Vehicle Detection by Fusing Part Model Learning and ... - MDPI

1 downloads 0 Views 6MB Size Report
Oct 17, 2018 - Moreover, fusing semantic scene information by DPM methods ... This research extends DPM-based vehicle detection to multiview DPMs within a single probabilistic framework, which ...... 2009, 42, 1297–1307. [CrossRef]. 13.
sensors Article

Vehicle Detection by Fusing Part Model Learning and Semantic Scene Information for Complex Urban Surveillance Yingfeng Cai 1 , Ze Liu 2 , Hai Wang 2, * , Xiaobo Chen 1 and Long Chen 1 1 2

*

Automotive Engineering Research Institution, Jiangsu University, Zhenjiang 212013, China; [email protected] (Y.C.); [email protected] (X.C.), [email protected] (L.C.) School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China; [email protected] Correspondence: [email protected]; Tel.: +86-182-6197-7099

Received: 12 September 2018; Accepted: 15 October 2018; Published: 17 October 2018

 

Abstract: Visual-based vehicle detection has been studied extensively, however there are great challenges in certain settings. To solve this problem, this paper proposes a probabilistic framework combining a scene model with a pattern recognition method for vehicle detection by a stationary camera. A semisupervised viewpoint inference method is proposed in which five viewpoints are defined. For a specific monitoring scene, the vehicle motion pattern corresponding to road structures is obtained by using trajectory clustering through an offline procedure. Then, the possible vehicle location and the probability distribution around the viewpoint in a fixed location are calculated. For each viewpoint, the vehicle model described by a deformable part model (DPM) and a conditional random field (CRF) is learned. Scores of root and parts and their spatial configuration generated by the DPM are used to learn the CRF model. The occlusion states of vehicles are defined based on the visibility of their parts and considered as latent variables in the CRF. In the online procedure, the output of the CRF, which is considered as an adjusted vehicle detection result compared with the DPM, is combined with the probability of the apparent viewpoint in a location to give the final vehicle detection result. Quantitative experiments under a variety of traffic conditions have been contrasted to test our method. The experimental results illustrate that our method performs well and is able to deal with various vehicle viewpoints and shapes effectively. In particular, our approach performs well in complex traffic conditions with vehicle occlusion. Keywords: Vehicle detection; traffic surveillance; deformable part model (DPM); conditional random field (CRF); context-based inference

1. Introduction Collecting and generating vehicle information in an urban traffic monitoring system is a fundamental task of intelligent transportation systems (ITSs). Nowadays, as a result of the rapid pace of urbanization, vision-based vehicle detection faces great challenges. In addition to the challenges of outdoor visual processing, such as illumination changes, poor weather conditions, shadows, and cluttered backgrounds, vehicle detection by a stationary camera is further hindered by its own unique challenges, including various appearances and vehicle poses, and partial occlusion due to the loss of depth information or to traffic congestion. In fact, heavy traffic congestion has become common in many large cities, and as a result, many vehicle detection approaches that apply motion information to detect vehicles are not suitable, because the congestion causes vehicles to slow down, which reduces their motion (see Figure 1 for an example). During the past decade, vision-based object detection has been formulated as a binary classification Sensors 2018, 18, 3505; doi:10.3390/s18103505

www.mdpi.com/journal/sensors

Sensors 2018, 18, x FOR PEER REVIEW

2 of 19

In fact, heavy traffic congestion has become common in many large cities, and as a result, many approaches that apply motion information to detect vehicles are not suitable, 2 of 19 because the congestion causes vehicles to slow down, which reduces their motion (see Figure 1 for an example). During the past decade, vision-based object detection has been formulated as a binary problem with the goal of separating a target object from the background, and many of these algorithms classification problem with the goal of separating a target object from the background, and many of achieve good performance with robust object detection. these algorithms achieve good performance with robust object detection. vehicle Sensors 2018, detection 18, 3505

(a)

(b)

Figure 1. Two urban traffic conditions. (a) Urban traffic conditions 1; (b) Urban traffic conditions 2. Figure 1. Two urban traffic conditions. (a) Urban traffic conditions 1; (b) Urban traffic conditions 2.

Among the various binary classification approaches, the deformable part model (DPM) has been Among the variousinterest. binary classification approaches, deformable model (DPM) has been regarded with increasing Evaluated using PASCALthe Visual Object part Classes (VOC) challenge regarded with increasing interest. Evaluated using PASCAL Visual Object Classes (VOC) challenge datasets [1], DPM achieved state-of-the-art results in average precision (AP) for vehicle detection on [1],2011 DPM achieved state-of-the-art results in average (AP) for vehicle detection thedatasets 2010 and benchmarks [2]. Since objects are detected asprecision deformable configurations of parts,on the 2010 and 2011 benchmarks [2]. Since objects are detected as deformable configurations of not parts, DPM should perform better at finding partially occluded objects; however, that performance has DPM should perform better at finding partially occluded objects; however, that performance has not been fully demonstrated for vehicle detection. Moreover, fusing semantic scene information by DPM been fully demonstrated for vehicle detection. Moreover, fusing semantic scene information by DPM methods, particularly in congested and complex traffic conditions, needs to be improved. methods, particularly in congested and complex traffic conditions, needs to be improved. Motivated by prior work on deformable part models, mixtures of DPMs [3], and scene Motivated by prior work ona deformable models, for mixtures DPMs [3], and scene modeling modeling [4], this paper proposes probabilisticpart framework vehicleofdetection based on fusing the [4], this paper proposes a probabilistic framework for vehicle detection based on fusing the results of structured part models and viewpoint inference. As the configuration of vehicle partsresults varies of structured part models and viewpoint inference. As the configuration of learned vehicle parts varies greatly greatly as a function of monitoring viewpoints, structured part models are for each possible as a function of part monitoring viewpoints, part potential models are learnedoffor each possible viewpoint using the detection results. On structured the other hand, viewpoints vehicles could viewpoint using the part detection results. On the other hand, potential viewpoints of vehicles could be predicted by spatial context–based inference as vehicle motion is constrained by road structures be predicted by spatial context–based inference as vehicle motion is constrained by road structures and traffic signals. and traffic signals. is depicted in Figure 2, which consists of two parts, offline learning and The framework The framework is depicted 2, which consists of two parts, offline and online online processing. In this research, in allFigure the possible viewpoints and occlusion states learning for the stationary processing. this research, alland the the possible viewpoints and occlusion states the stationary video video camerasInare summarized viewpoint-related discriminative partfor models are learned cameras are summarized and the viewpoint-related discriminative part models are learned using using an existing approach, described in [5]. All possible viewpoints in a certain location are generatedan described in [5].where All possible viewpoints certain location are generated byexisting spatial approach, context–based inference, the logic constraintinisaobtained by trajectory analysis.by spatial context–based inference,vehicle where detection the logic to constraint is obtained by trajectory analysis. This This research extends DPM-based multiview DPMs within a single probabilistic research extends DPM-based vehicle detection to multiview DPMs within a single probabilistic framework, which greatly improves the overall efficiency of vehicle detection. Such a method could be framework, which greatly improves the overall efficiency of vehicle detection. Such a method could extended to on-road vehicle detection. be The extended to on-road remainder of this vehicle paper isdetection. organized as follows. In Section 2, an overview of the related work The remainder of this paper is organized as follows. In Section 2, an overview is given. The online probabilistic framework of vehicle detection is proposed in Sectionof3the andrelated the work is given. The online probabilistic framework of vehicle detection is proposed in Section 3 and offline discriminative part model learning is described in Section 4. Viewpoint detection based on the offline discriminative part model learning is described in Section 4. Viewpoint detection based spatial context inference is proposed in Section 5. Experimental results and analysis are presented inon spatial inference is proposed Section 5. Experimental results analysis Section 6, context followed by a conclusion and in recommendations for future workand in Section 7.are presented in Section 6, followed by a conclusion and recommendations for future work in Section 7.

Sensors 2018, 18, 3505

3 of 19

Sensors 2018, 18, x FOR PEER REVIEW

3 of 19

Figure Figure 2. 2. Framework Framework of of the the proposed proposed vehicle vehicle detection detection method. method. CRF, CRF, conditional random field.

2. Related Related Works Works 2. Vision-based vehicle vehicle detection detection is widely used used in in ITSs ITSs in in many many parts parts of of the the world. world. Moving Moving object object Vision-based is widely detection methods methods are are applied applied in in many many simple simple background background environments, environments, such such as as bridges, bridges,highways, highways, detection and urban expressways. These methods can be described or defined as background modeling, frame and urban expressways. These methods can be described or defined as background modeling, frame differencing, and optical flow. They all have the ability to handle slight illumination changes; however, differencing, and optical flow. They all have the ability to handle slight illumination changes; all of the techniques exhibit some performance limitations, as they areasunable to detect however, all of the techniques exhibit some performance limitations, they are unablestationary to detect vehicles and incorrectly classify some moving objects as moving vehicles. Moreover, in relation in to stationary vehicles and incorrectly classify some moving objects as moving vehicles. Moreover, the congestion issue for urban surveillance, these methods may incorrectly identifyidentify several several closely relation to the congestion issue for urban surveillance, these methods may incorrectly spaced vehicles as a single vehicle, or may not be able to detect any vehicle due to the lack of motion closely spaced vehicles as a single vehicle, or may not be able to detect any vehicle due to the lack of in a congested scene. Therefore, some research attempt utilize visual features of features vehicles of to motion in a congested scene. Therefore, some efforts research effortstoattempt to utilize visual detect them. vehicles to detect them. Simple features such such as as color, color, texture, texture, edge, edge, and and object object corners corners are are usually usually used used to to represent represent Simple features vehicles. Then the features are provided to a deterministic classifier to identify the vehicles vehicles. Then the features are provided to a deterministic classifier to identify the vehicles [6–9].[6–9]. Due Due to the unavailability or unreliability of some classification features in some instances, such as to the unavailability or unreliability of some classification features in some instances, such as partial partial object occlusion, the utility of these methods can be limited to specific applications. object occlusion, the utility of these methods can be limited to specific applications. Many recent recent studies studies on on part-based part-based models models have have been been conducted conducted to to recognize recognize objects objects while while Many maintaining efficient performance under occluded conditions [10,11]. Using these methods, a vehicle is maintaining efficient performance under occluded conditions [10,11]. Using these methods, a vehicle considered to be composed of a window, a roof, two rear-view mirrors, wheels, and other parts [12,13]. is considered to be composed of a window, a roof, two rear-view mirrors, wheels, and other parts After part detection, the spatialthe relationship, motion cues, and multiple part models part are usually [12,13]. After part detection, spatial relationship, motion cues, and multiple modelsused are to detect vehicles [14]. Wang et al. [15] applied local features around the roof and two taillights usually used to detect vehicles [14]. Wang et al. [15] applied local features around the roof and two (or headlights) to detect vehicles identify occlusion. Li et al. proposed an AND-OR graph taillights (or headlights) to detectand vehicles andpartial identify partial occlusion. Li et al. proposed an ANDmethod [16,17] to detect and rear-view vehicles. vehicles. Saliency-based object detection has also OR graph method [16,17]front-view to detect front-view and rear-view Saliency-based object detection been proposed [18]. has also been proposed [18]. Instead of of manually manually identifying identifying vehicle vehicle parts, parts, they they can can be be identified identified and and learned learned automatically automatically Instead using a deformable part-based model. DPM was first proposed by Felzenszwalb [1] and achieved using a deformable part-based model. DPM was first proposed by Felzenszwalb [1] and achieved state-of-the-art results in PASCAL object detection challenges before the deep learning framework state-of-the-art results in PASCAL object detection challenges before the deep learning framework appeared, which which had had low low real-time Variants of of this this pathfinding pathfinding work work have have been been appeared, real-time performance. performance. Variants proposed by many subsequent research efforts. Niknejad et al. [19] employed this model for proposed by many subsequent research efforts. Niknejad et al. [19] employed this model for vehicle vehicle detection in which a vehicle is decomposed into five components: front, back, side, front detection in which a vehicle is decomposed into five components: front, back, side, front truncated, truncated, and back truncated. Each component a root andfilters, six part filters, which were and back truncated. Each component contained contained a root filter and filter six part which were learned learned using a latent support vector machine and a histogram of oriented gradients features. In [5], using a latent support vector machine and a histogram of oriented gradients features. In [5], the DPM

was combined with a conditional random field (CRF) to generate a two-layer classifier for vehicle detection. This method can handle vehicle occlusion in the horizontal direction; however,

Sensors 2018, 18, 3505

4 of 19

the DPM was combined with a conditional random field (CRF) to generate a two-layer classifier for vehicle detection. This method can handle vehicle occlusion in the horizontal direction; however, experiments using artificial occlusion samples might suggest limited effectiveness of their approach in real-world traffic congestion conditions. In particular, in order to handle partial occlusions, various approaches were proposed to estimate the degree of visibility of parts, in order to properly weight the inaccurate scores of root and part detectors [20–23]. Wang et al. [24] proposed an on-road vehicle detection method based on a probabilistic inference framework. The relative location relationships among vehicle parts is used to overcome the challenges from multiview and partial observation. Additionally, some researchers directly built occlusion patterns from annotated training data to detect occluded vehicles. Pepikj et al. [25] modeled occlusion patterns in specific street scenes with cars parked on either side of the road. The established occlusion patterns demonstrated the ability to aid object detection under partially occluded conditions. Wang et al. [26] established eight types of vehicle occlusion visual models. The suspected occluded vehicle region is loaded into a locally connected deep model of the corresponding type to make the final determination. However, it is hard to establish a comprehensive set of all possible occlusion patterns in real-world traffic scenes. Although the above approach effectively handles localized vehicle occlusion, more sophisticated methods are still needed for complex urban traffic conditions that are severely occluded between vehicles or multiple nonvehicle objects. While deformable part models have become quite popular, their value has not been demonstrated in video surveillance by stationary cameras. Meanwhile, by observation, it is found that for a specific monitoring scene, the vehicle motion pattern corresponding to road structures can be obtained with offline learning, and the obtained vehicle motion pattern can be used in deformable part models to enrich and improve the vehicle detection performance. In this paper, we summarize all the possible viewpoints and occlusion states in stationary video cameras, train the structured part models for each viewpoint by combining DPM with CRF in [5] to handle occlusion, and propose a probabilistic framework addressing the spatial contextual inference of viewpoint and part model detection results. The pros and cons of above related work are concluded in Table 1. Table 1. Pros and cons of related work. Methods

Pros

Cons

Simple features-based methods [6–9]

Easy to describe and perform in specific applications

Can only be used in specific simple scenes; cannot handle occlusion

Manual part-based model-based methods [10–18]

Able to handle weakly partial occlusion

Still low-detection performance in complex scenes

Deformable part-based model-based methods [1,5,19–26]

Improved performance in vehicle detection

Cannot handle heavy occlusion

3. Online Probabilistic Framework Given a frame in a traffic video, the online vehicle detection procedure is depicted in Figure 3. i , the features vector vi is generated by the DPM, which includes the For each dominant viewpoint MV scores of root, part filters, and best possible placement of parts. The viewpoint-related CRF is treated as the second-layer classifier, which uses the information from the DPM to detect the occluded vehicles. The detection conditional probability MCi and location are output by the CRF. As the viewpoint is a location-specific parameter, this work uses a table-checking method to obtain the probability of a certain viewpoint PMi . In this research, i = 1, 2, . . . , 5 are described in detail in Section V. The problem V of vehicle detection can be represented by the following framework: rˆk = argmax p(r k | MCk , Ts ) rk

(1)

Sensors 2018, 18, 3505

5 of 19

where r k is the estimated location of a vehicle center for viewpoint k, Ts is the knowledge of scene information, and PMk = p(r k | Ts ). The objective is to find the r k that maximizes p(r k | MCk , Ts ). V 2018, 18, x from FOR PEER REVIEW of 19 For theSensors viewpoint a fixed location that can be treated as independent from vehicle5detection, the estimation can be converted to the viewpoint from a fixed location that can be treated as independent from vehicle detection, the estimation can be converted to k p(r | MCk , Ts ) = Z [ p(r k | MCk ), p(r k | Ts )] p ( r k | M Ck , Ts ) = Ζ[ p ( r k | M Ck ), p ( r k | Ts )] (2)

(2)

where Z[ a, b] is an adjustment function to avoid the effect of error of viewpoint probability. If b < a, where Ζ[a, b] is an adjustment function to avoid the effect of error of viewpoint probability. If b < a , Z[ a, b] = a, otherwise Z[ a, b] = a · b. Ζ[ a , b ] = a , otherwise Ζ[ a , b ] = a ⋅ b . Below, the training of structured part-based vehicle models including DPM and CRF is given, Below, the training of structured part-based vehicle models including DPM and CRF is given, and theand estimation to p(r k | Mk ) of Equation (2) is described in Section 4. Typical types of road structures the estimation to pC( r k | M Ck ) of Equation (2) is described in Section 4. Typical types of road and the method of generating the possible viewpoints table will be defined, and the estimation to structures and the method of generating the possible viewpoints table will be defined, and the p(r k | Tsestimation ) of Equation given in(2) Section to p ((2) of Equation will be5.given in Section 5. r k |will T ) be s

M V5

M V1  V5

V1 

M C5

M C1

Ts PM 1

V

r1



PM 5 V

r5

rˆ k Figure 3. Online procedure of vehicle detection. DPM, deformable part model. Figure 3. Online procedure of vehicle detection. DPM, deformable part model.

4. Structured Part-Based Vehicle Models 4. Structured Part-Based Vehicle Models The viewpoints of vehicles are closely related to the spatial contextual The viewpoints of vehicles are closely related to the spatial contextualinformation information of of aatraffic traffic scene scene and the effect of possible occlusion states. In this paper, a summary of all possible vehicle and the effect of possible occlusion states. In this paper, a summary of all possible vehicle viewpoints viewpoints from the fixed camera location proposed (as shown in Figure 4). Since are from the fixed camera location is proposed (asisshown in Figure 4). Since cameras are cameras usually installed usually installed at a certain distance above the ground, vehicle images for front viewpoints and rear at a certain distance above the ground, vehicle images for front viewpoints and rear viewpoints viewpoints are similar in their visual feature space [17]. Similarly, this same observation yields four are similar in their visual feature space [17]. Similarly, this same observation yields four additional additional pairs of common viewpoints: up-front and up-rear, up-right-front and up-left-rear, uppairs of common viewpoints: up-front and up-rear, up-right-front and up-left-rear, up-left-front and left-front and up-right-rear, right and left. As a result, the training samples are merged from the up-right-rear, right and left. result, the training samples mergedaccordingly, from the original 11These categories original 11 categories to As five,aand models are learned on five are viewpoints i.e., k = 5. to five,are and models areleft/right, learnedup-front/up-rear, on five viewpoints accordingly, i.e., k =and 5. up-left-front/up-rightThese are labelled as up, labelled as up, up-right-front/up-left-rear, left/right, rear. up-front/up-rear, up-right-front/up-left-rear, and up-left-front/up-right-rear.

Sensors 2018, 18, 3505

6 of 19

Sensors 2018, 18, x FOR PEER REVIEW

6 of 19

Figure 4. Five possible viewpoints of a vehicle as observed by a stationary camera. Figure 4. Five possible viewpoints of a vehicle as observed by a stationary camera.

4.1. Deformable Model 4.1. Deformable PartPart Model The vehicle model for the kth viewpoint is based on a star model of a pictorial structure and is The vehicle model for the kth viewpoint is based on a star model of a pictorial structure and is described by a root model and several parts models. A set of permitted locations for each part with described by a root model and several parts models. A set of permitted locations for each part with respect to the root is combined with the cost of deformation to each part. Formally, it is an n + 2 tuple, respect to the root is combined with the cost of deformation to each part. Formally, it is an n + 2 tuple, as defined by Equation (3): as defined by Equation (3): M D = {F0 , ( F1 , v1 , d1 ), , ( Fn , vn , d n ), b}

(3)

, dparts vnP,n d) n and ), b}b is the bias term. Each part { Fnumber 0 , ( F1 , v1of 1 ), . . .(,P(1 ,FPn2 ,, where F 0 is the root filter,MnDis=the

(3)

model Pi is defined by a 3-tuple (Fi, vi, di), where Fi is the ith part filter, vi is a bidimensional vector

wherethat F0 specifies is the root n is the for number of partsto(the P1 , Proot . Pn ) andand b isdi the term. Each part 2 , . .position, the filter, fixed position part i relative is a bias four-dimensional modelvector Pi is defined by a 3-tuple (F , v , d ), where F is the ith part filter, v is a bidimensional i i iof a quadratic i that specifies the coefficients function defining ai deformation cost forvector each that specifies the fixed position for part i relative to the root position, and d is a four-dimensional vector placement of part i relative to vi. The score is given by the following formula: i that specifies the coefficients of a quadratic function defining a deformation cost for each placement of n n score ( p0 , by pn ) the Fi ′⋅ φ ( H , pi ) formula: = following −  di ⋅ φd (dxi , dyi ) + b (4) part i relative to vi . The score is given i =0

i =0

n space-scale pyramid n H with the upper left corner in Pi. where φ ( H , pi ) is a subwindow in the score( p0 , . . . pn ) = F 0 i · φ( H, pi ) − di · φd (dxi , dyi ) + b gives the displacement of the ith part relative to its anchor position ( dxi , dyi ) = ( xi , yi ) − (2( x0 , y0 ) + vi ) i =0 i =0





(4)

and φ d ( dxi , dyi ) = ( dxi , dyi , dxi2 , dyi2 ) are deformation features.

where φ( H, pi ) is a subwindow in the space-scale pyramid H with the upper left corner in Pi . 4.2.i )CRF (dxi , dy = (Model xi , yi ) − (2( x0 , y0 ) + vi ) gives the displacement of the ith part relative to its anchor 2 , dy2 ) are deformation features. position and φd (dxican , dybe = (dxi , in dyDPM i ) defined i , dxi by Occlusion ai grammar model; however, it is time consuming. Niknejad [5] provided a method using DPM and CRF as a two-layer classifier to detect occluded vehicles. In

4.2. CRF Model the first layer, DPM generates root and parts scores and the relative configuration for parts. In the

second layer,can the be CRFdefined uses the in output from to detect the occluded objects. Occlusion DPM bythe a DPM grammar model; however, it is time consuming. Niknejad [5] provided a method using DPM and CRF as a two-layer classifier to detect occluded 4.2.1. Occlusion State vehicles. In the first layer, DPM generates root and parts scores and the relative configuration for parts. The occlusion finite relating the viewpoints andoccluded can be defined based on In the second layer, thestates CRF are uses the variables output from theto DPM to detect the objects. the visibility of root and part filters ( P1 , P2 , Pn ) . We sum up all possible occlusion states for each

4.2.1.viewpoint. OcclusionFigure State 5 depicts an example of parts clustering of a vehicle up-front model from up to down direction. The occlusion state sj has the character of s j −1 ⊂ s j .

The occlusion states are finite variables relating to the viewpoints and can be defined based on the visibility of root and part filters ( P1 , P2 , . . . Pn ). We sum up all possible occlusion states for

Sensors 2018, 18, 3505

7 of 19

s1 = {P1 , P2 }

s2 = {P1 , P2 , P3 , P4 }

s3 = {P1 , P2 , P3 , P4 , P5 , P6 }

each viewpoint. Figure 5 depicts an example of parts clustering of a vehicle up-front model from up to 2018, 18,The x FORocclusion PEER REVIEW 7 of 19 downSensors direction. state sj has the character of s j−1 ⊂ s j .

P1

P2

P4

P3

P5

P6

Figure Partsclustering clustering based occlusion direction. Figure 5. 5.Parts basedononthe the occlusion direction.

Model 4.2.2. 4.2.2. CRF CRF Model {Pi} are set to be the nodes in lowest layer of the CRF, and {sj} are set to be other nodes in the CRF, {Pi } are set to be the nodes in lowest layer of the CRF, and {sj } are set to be other nodes in the CRF, as shown in Figure 6. Given a bounding box i with label yi, we can calculate the detection conditional as shown in Figure 6. Given a bounding box i with label yi , we can calculate the detection conditional probability by maximizing the probability over all occlusion states sj: probability by maximizing the probability over all occlusion states sj : P ( yi | vi ;θ ) = arg max( P ( yi , s j | vi ;θ )) j

(5)

P(yi |vi ; θ ) = argmax( P(yi , s j |vi ; θ ))

where vi is the features vector and

θ

is a CRF parameter. vi is generated by the DPM and includes j

(5)

the scores for parts and best possible placement of the parts. The probability p ( r k | M Ck ) with the

wherekthviviewpoint is the features vector andP (θy is| va; θCRF parameter. vi is ygenerated by the DPM and includes is converted into labeled vehicle i in the kth viewpoint. ) with i i the scoresInfor parts and best possible placement of the parts. The probability r k | MCkfunction ) with the kth the CRF, the maximum of P ( yi , s j | vi ; θ ) can be converted to get the largestp(energy viewpoint is converted into P(yi |vi ; θ ) with labeled vehicle yi in the kth viewpoint. ψ ( y , s j , v;θ ) over all occlusion stages. ψ (⋅) contains energy from both parts and their relations. In the CRF, the maximum of P(yi , s j |vi ; θ ) cann be converted to get the largest energy function ψ ( y , s j , v;θ ) =  f ( y , p k , s j , v ) ⋅ θ +  g ( y , p k , pl , s j , v ) ⋅ θ c (6) ψ(y, s j , v; θ ) over all occlusion stages. ∀ p ∈Ω ψ (·) contains energy ∀ p , p ∈Ω from both parts and their relations. k

k

l

where Ω = { p0 ,  pn } , θ ,θ are the components of θ = (θ n , θ c ) in the CRF model and correspond n

c

c ψ(y, s j , v; θ )and = the f (y, pspatial · θ n + with pk , prespectively, ∑ relative ∑otherg(y,parts, k , s j , v ) relation l , s j , v ) · θ and f (⋅) to the part information

∀ pk ∈Ω

(6)

∀ pk ,pl ∈Ω

features depend on a single hidden variable in the CRF model, while g ( ⋅) features depend on the c are parts. wherecorrelation Ω = { p0 ,between . . . pn },pairs θ n , θof the components of θ = (θ n , θ c ) in the CRF model and correspond to A belief propagation algorithm is generated to with estimate each occlusion state. The detection the part information and the relative spatial relation other parts, respectively, and f (·) features conditional probability is as follows: depend on a single hidden variable in the CRF model, while g(·) features depend on the correlation Ζ ( y | v , θ ) = arg max j (exp(ψ ( y , s j , v ; θ ))) (7) between pairs of parts. A belief algorithmcan is be generated Thenpropagation the detection probability calculatedto by:estimate each occlusion state. The detection conditional probability is as follows: P ( y | v , θ ) = Ζ ( y | v , θ )  Z ( yˆ | v , θ )

(8)



Z (y|v, θ ) = argmax j (exp(ψ(y, s j , v; θ )))

(7)

Then the detection probability can be calculated by: P(y|v, θ ) = Z (y|v, θ )/∑ Z (yˆ |v, θ ) yˆ

(8)

Sensors 2018, 18, 3505

8 of 19

Sensors 2018, 18, x FOR PEER REVIEW

8 of 19

Figure 6. CRF model for up-front vehicle model.

Figure 6. CRF model for up-front vehicle model. Using a Bayesian formula, the likelihood of each occlusion state is calculated as follows:

Using a Bayesian formula, the likelihood of each occlusion state is calculated as follows: P ( s | y , v, θ ) = P ( y , s | v,θ ) P ( y | v,θ ) j

(9)

j

where P ( y | v, θ ) is constant for all occlusion stages and calculated by maximizing over all possible P(s j |y, v, θ ) = P(y, s j |v, θ )/P(y|v, θ ) occlusion states [5]. The marginal distribution for visibility over each individual part pk = a or pairs

(9)

of parts corresponding to edges in the graph is calculated as follows:

where P(y|v, θ ) is constant for all occlusion stages and calculated by maximizing over all possible P ( s , p = a | y , v,θ ) ≈ f ( y , p = a , s , v ) ⋅θ (10) occlusion states [5]. The marginal distribution for visibility over each individual part pk = a or pairs of (11) P ( s , p = k , p = l | y, v,θ ) ≈ g ( y, p = k , p = l , s , v) ⋅θ parts corresponding to edges in the graph is calculated as follows: n

j

k

k

j

k

l

c

j

k

l

j

For the root p 0 , the energy is calculated by concatenating the weight vectors of the trained root

filter and HOG(Histogram of Oriented Gradient) feature at the given position: P(s j , pk = a|y, v, θ ) ≈ f (y, pk = a, s j , v) · θ n f ( y , p 0 = 0, s j , v ) = F0 ⋅ φ ( H , pi )

(10) (12)

For parts score according in the P(s(jP,1 ,pP2k,=Pn )k,, pthe l |y,ofv,parts θ ) ≈is gsummed (y, pk = k, pl =tol,their s j , vappearance ) · θc l = direction view as the following functions:

(11)

k For the root p0 , the energy is calculated by concatenating the weight vectors of the trained root f ( y, pk = k , s j , v) =  max( Fh ⋅ φ ( H ,( xh + dx, yh + dy) pi ) − dk ⋅ φd (dx, dy)) (13) dx , dy 0 filter and HOG(Histogram of Orientedh=Gradient) feature at the given position:

For the spatial correlation, a normal distribution is used through the following formula:

g ( y, pk = k , pl = l , s j , v) ⋅θ c = θ(ck , j ) x ⋅ N(Δlx, k | μlx, k ,σ lx,k ) f (y, p0 = 0, s j , v) =c F0 · φ(y H, ypi )y + θ( k , j ) y ⋅ N(Δl , k | μl , k ,σ l ,k )

(14)

(12)

pk p l For parts ( P1 , PΔ2 , . . . ΔPn ), the score of parts is summed according to their appearance in the direction Δ = ( x − x ), Δ = ( y − y ) (15) view as the following functions:

where

x l,k

,

y l ,k

are the relative positions of parts x l ,k

p

,

y l ,k

l

p

in the x and y directions.

l

θ (ck , j ) x , θ (ck , j ) y are parameters that correspond to the relative spatial relation between parts p k , p l

k in the x and y directions, respectively. μ lx, k , σ lx, k are mean and covariance values that correspond to

f (the y, relative pk = k,spatial s j , v)relation = ∑ between max( Fparts + dx, yh + dy) pthat d · φd (dx,fordyall)) i) − h · φ ( pH,, ( x was kextracted p h in the x direction h=0 dx,dy positive samples from the DPM training data.

k

l

(13)

For the spatial correlation, a normal distribution is used through the following formula: x |µ x , σ x ) g(y, pk = k, pl = l, s j , v) · θ c = θ(ck,j) x · N(∆l,k l,k l,k y

y

y

+ θ(ck,j)y · N(∆l,k |µl,k , σl,k )

(14)

y

x , ∆ are the relative positions of parts p , p in the x and y directions. where ∆l,k k l l,k x ∆l,k = ( x p − x l ),

y

∆l,k = (y p − yl )

(15)

θ(ck,j) x ,θ(ck,j)y are parameters that correspond to the relative spatial relation between parts pk , pl in the x x , σ x are mean and covariance values that correspond to the relative and y directions, respectively. µl,k l,k spatial relation between parts pk , pl in the x direction that was extracted for all positive samples from the DPM training data.

Sensors 2018, 18, 3505

9 of 19

4.2.3. Model Learning For each viewpoint k, the root filter F0k and part filters Fik , i = 1, . . . n are learned using the image samples of the category. Here, n is set as a preassigned constant value, which simplifies the calculation. The MLE (maximum likelihood estimation) method is used to estimate the parameters θ ∗ = argmaxθ L(θ ) from the training samples. 5. Viewpoint Detection by Spatial Context-Based Inference Normally, distributions on where, when, and what types of vehicle activities occur have a strong correlation with the road structure, which is defined by road geometry, number of traffic lanes, and traffic rules. Given a type of road structure, the viewpoint of a subject vehicle at any specific location can be predicted by the scene model. 5.1. Scene Modeling A scene model can be manually described or automatically extracted from the static scene appearance. On the other hand, a scene model can be extracted from regular vehicle motion patterns, e.g., trajectories, which tends to be a better method than the other two methods. 5.1.1. Trajectory Coarse Clustering During an uncongested traffic period, general vehicle trajectories can be obtained from coarse vehicle trajectories via a blob tracker based on background subtraction and length-width ratio constraints. Through a period of observation, it is possible to obtain thousands of trajectories from a scene. Before clustering, some outliers caused by tracking errors must to be removed. Trajectories with large average distances to neighbors and with very short lengths are rejected as outliers. After that, the main flow direction (MFD) vector x is used to group the trajectories: x = (( x (t) − x (0)), (y(t) − y(0)))

(16)

where ( x (0), y(0)) and ( x (t), y(t)) are the position vectors of the starting point and ending point of a trajectory, respectively. Each cluster is described as a Gaussian function with a mean µk and covariance matrix σk . The overall distribution p(x) considering all MFD vectors can be modeled as a mixture of Gaussians (MoG): kmax

p (x) =



ω k p k (x)

(17)

k =1

Here, kmax is the number of classes, ωk is the prior probability, and pk (x) is the normal distribution for the kth class. Then, each vector x in the training period is assigned to a class according to k = argmax ω j p j (x)

(18)

j∈{1,...kmax }

The number of Gaussians in the mixture is determined by the number of clusters. 5.1.2. Classification Filtering The detailed filtering procedure was developed in the authors’ previous work [27]. First, clusters that are significantly broader than the remaining clusters are removed:

|σm | > 2∑ |σi | i

(19)

Sensors 2018, 18, 3505

10 of 19

where σm is the determinant of the covariance matrix of the mth cluster and σi is that of any other. Second, clusters that have obvious overlaps are merged. Here, the Bhattacharyya distance–based error estimation method is used [28]. The expected classification error E (in %) between two classes is defined as E = 40.22 − 70.02b + 63.58b2 − 32.77b3 + 8.72b4 − 0.92b5 (20) where b describes the Bhattacharyya distance. In this paper, the threshold is set to 1.5. Then, if the Bhattacharyya distance b < 1.5, the two clusters will be merged. Lastly, isolated outliers are removed. Such error is generated by incorrect tracking due to occlusion or formed when overlaps split. In such cases, the maximum and minimum peak coefficients based adaptive mean and variance estimation method can be used [29]. 5.1.3. Trajectory Fine Clustering After coarse clustering, trajectories moving in opposite directions are separated, regardless of their relative spatial proximity. The road structure of the scene can be represented by path models with entries, exits, and paths between them. However, trajectories with similar moving directions are clustered regardless of the magnitude of their relative spatial separation. During the fine clustering procedure, each class of trajectories is further clustered according to different spatial distributions. Sensors 2018, 18, x FOR PEER REVIEW 10 of 19 The trajectory similarity is defined by the Hausdorff distance, which considers the location (20) E =and 40.22is − 70.02 b + 63.58algorithmic b − 32.77 b + 8.72calculation. b − 0.92 b relationship between trajectories a simple 2

3

4

5

→ the threshold→ where b describes the Bhattacharyya is set to 1.5. Then, if the → → distance. In this paper,

Considering two trajectories =, the { atwo and B b i }, where a i =< xia , yia >, b i =< xib , yib > are Bhattacharyya distance A will= be { merged. b , bi =< x ib , y ib > are  of the fine clustering procedure is where Dh A, B themax . an A observation summary E ai , b j for spatialdcoordinates, A, its nearest observation on B is ai on

(

(

(1) (2)

(3)

)=

i∈ A

(

(

)=

(

(

)

(

))

))

d E ( ai , b j ) = arg min || ( xia − x bj , y ia − y bj ) || j∈ B

(22) as follows:

(21)

The first trajectory of the dataset initializes The Hausdorff distance between A and B isthe first route model. ( A, B ) the = max(existing D ( A, B ), D ( Broute , A)) (22) Other trajectories are compared Dwith models. If the calculated distance by Dh ( A, B) = max( dE (ai , bthreshold of the fineroute clusteringmodel procedureis is as follows: Equation (12)where is smaller than the updated (as shown in Figure 7). j )) . A summary τ, i∈A (1) The first trajectory of the dataset initializes the first route model. Otherwise, a new model is initialized. (2) Other trajectories are compared with the existing route models. If the calculated distance by If two route models are overlapped, they merged. Equation (12)sufficiently is smaller than threshold modelare is updated (as shown in Figure 7). τ , the route H

h

h

Otherwise, a new model is initialized. (3) If two route models are sufficiently overlapped, they are merged.

Figure 7. Matched trajectory. The maximum distance of the trajectory from the route model is smaller

Figure 7. Matchedthan trajectory. The maximum distance of the trajectory from the route model is smaller the allowed threshold. than the allowed threshold.

Sensors 2018, 18, x FOR PEER REVIEW Sensors 2018, 18, x FOR PEER REVIEW Sensors 2018, 18, 3505

11 of 19 11 of 19 11 of 19

5.2. Viewpoint Inference 5.2. Viewpoint Inference 5.2. Viewpoint Inference 5.2.1. Some Phenomena in Imaging 5.2.1. Some Phenomena in Imaging 5.2.1.As Some Phenomena inaccording Imaging to the imaging principle, plane α , determined by the camera's show in Figure 8, α , determined As show in Figure 8, according thethe imaging principle, plane by the camera's , and projection point to Ptoof camera's center on ground plane β correspond to the optical Asaxis showο in Figure 8, according the imaging principle, plane α, determined by the camera's optical axis ο , and projection point P of the camera's center on ground plane β correspond to the optical line axis ,pand projection the camera's center groundofplane β correspond to the center AB be the center in the image point planePγof. Let line on segment the interaction of the camera’s pimage γbe. Let AB segment center line in the image plane be the line segment of theofinteraction of field the camera’s line p in the plane γ. Let AB the line of the interaction the camera’s of γ. field of vision (FOV), plane α , and β . Then, ab is the corresponding line segment in plane vision field of plane visionα,(FOV), and . Then, ab is line the corresponding lineγ.segment in plane γ . (FOV), and β.plane Then, αab, is theβcorresponding segment in plane p p

Z Z

a a

α α Y Y

ο οβ β

γ γ

c c

d d

e e

f f b b

X X Figure 8. Camera model and corresponding transform from 3D world plane to 2D image plane. Figure to 2D 2D image image plane. plane. Figure 8. 8. Camera Camera model model and and corresponding corresponding transform transform from from 3D 3D world world plane plane to

(1) When a car goes from point A to point B, its image in plane γ starts from point a to point b with Whenaacar cargoes goesfrom frompoint pointAAto topoint pointB,B,its itsimage imageininplane plane γγ starts (1) When starts from from point a to point b with viewpoints from up-front to up. viewpoints from from up-front up-front to to up. up. viewpoints (2) When a car appears in the camera’s far FOV and comes along line segment cd , the viewpoint (2) When a car appears in the camera’s far FOV and comes along line segment cd, the viewpoint (2) When is right.a car appears in the camera’s far FOV and comes along line segment cd , the viewpoint isright. right. (3) is When a car is in the camera’s near FOV and comes along line segment e f , the viewpoint is up. (3) When Whenaacar carisisin inthe thecamera’s camera’snear nearFOV FOVand and comes comes along along line line segment segment eeff, the , theviewpoint viewpointisisup. up. In conclusion, vehicle viewpoint is directly determined by location and motion direction, which In described conclusion, viewpoint is directly determined by location and motion direction, In conclusion, vehicle viewpoint is gradient. directly determined by location and motion direction, which which can be can be byvehicle the trajectory can be described by the trajectory gradient. described by the trajectory gradient. 5.2.2. Inference of Viewpoint Probability Distribution 5.2.2. Inference of Viewpoint Probability Distribution In the far FOV, the vehicle area is small and its viewpoints can be treated as two kinds, left/right In the far FOV, FOV, the vehicle area is and be treated as two kinds, theIn vehicle area is small small and its its viewpoints viewpointsiscan can asthe twomiddle kinds,left/right left/right and up-front/up-back. the near FOV, the vehicle viewpoint setbe to treated be up. In FOV, the up-front/up-back.InInthe the near FOV, the vehicle viewpoint is set to up. be up. In middle the middle FOV, and up-front/up-back. near FOV, the vehicle viewpoint is set to be In the FOV, viewpoints of vehicles are divided by trajectory gradient, as shown in Figure 9, where region π 1 the is the viewpoints of vehicles are divided by trajectory gradient, as shown in Figure 9, region where region viewpoints of vehicles are divided by trajectory gradient, as shown in Figure 9, where π 1 is in the up-front/up-rear viewpoint, region π 2 is in the up-left-front/up-right-rear viewpoint, and π1 the is inup-front/up-rear the up-front/up-rear viewpoint, in up-left-front/up-right-rear the up-left-front/up-right-rear viewpoint, 2 isthe in viewpoint, regionregion viewpoint, and π 2 isπin region is in the up-right-front/up-left-rear viewpoint. π 3 and region π3 is in the up-right-front/up-left-rear viewpoint. region π 3 is in the up-right-front/up-left-rear viewpoint.

-75 -75

π2 π2

π

-15 π 1 15 -15 1 15

π3 π3

75 75

Figure view (FOV). (FOV). Figure 9. 9. Division Division of of viewpoints viewpoints under under middle middle field field of of view Figure 9. Division of viewpoints under middle field of view (FOV).

Sensors 2018, 18, 3505

12 of 19

Sensors 2018, 18, x FOR PEER REVIEW For a fixed location r = ( xr , yr ),

12 of 19

if it is located in N path models, then the probability of vehicle

viewpoint i, ai = 1, 2,location . . . 5, in rr =is( x r , y r ) , if it is located in N path models, then the probability of vehicle For fixed viewpoint i, i = 1, 2,5 , in r is

p(r i

N

N

∑ q t ( i ) / ∑ Mt t =1 | T ) =  tq=(1i )  M

p(ri | Ts ) =N s

(23)

N

t

t

(23)

t =1 t =1 where Mt is the number of total trajectories passing through the cross-section with r in model t and where M t is the number of total trajectories passing through the cross-section with r in model t and qt (i ) is the number of trajectories with viewpoint i in the tth model. An example of the detection result qt (i ) is the number of trajectories with viewpoint i in the tth model. An example of the detection of viewpoint probability with a stationary camera is shown in Figure 10.

result of viewpoint probability with a stationary camera is shown in Figure 10.

Figure 10. Result of viewpoint probability distributions.

Figure 10. Result of viewpoint probability distributions.

Implementation 5.2.3.5.2.3. Implementation classification near FOV,middle middleFOV, FOV,and and far far FOV FOV can to to thethe camera The The classification of of near FOV, canbe bedone doneaccording according camera parameters and the size of the vehicle; however, it can be difficult to obtain the relevant parameters parameters and the size of the vehicle; however, it can be difficult to obtain the relevant parameters for for a fixed camera needed to help determine the appropriate FOV region. In this paper, a simple a fixed camera needed to help determine the appropriate FOV region. In this paper, a simple division division method is used that classifies the top one-sixth of an image as without consideration for the method is used that classifies the top one-sixth of an image as without consideration for the target area, target area, the next one-fifth from the top is defined as the far FOV of the camera, the bottom onethe next top defined as and the the far FOV of theofcamera, the of bottom one-fifth is setas to be fifth one-fifth is set to befrom nearthe FOV foristhe camera, remainder the middle the image is defined nearthe FOV for the camera, and the remainder of the middle of the image is defined as the middle FOV middle FOV for the camera.

for the camera.

Pseudocode of algorithm 1 Function fusingpartbasedObjectRecognition () Pseudocode of algorithm 2 Build image pyramid 1 Function () forfusingpartbasedObjectRecognition s = 1 to k 2 Build image IS*Gpyramid →IG for s = 1calumniate to k IS+1 by subsampling IS with factor λ IS *Gnext → IG by subsampling 3 calumniate Build HOGIS+1 feature pyramid IS with factor λ next for s = 1 to k 3 Build HOG feature pyramid calumniate gradient orientation image and gradient magnitude image for s = 1for to ky = 1 to H step 8 calumniate forgradient x = 1 to orientation W step 8 image and gradient magnitude image for y = 1set to H step 8 H '( x, y, s ) as gradient orientation histogram based on IѰ,S and weighted by IMag,S for x = 1 to W step 8 next set H 0( x, y, s) as gradient orientation histogram based on I Ψ,S and weighted by IMag,S next next calculate H ( x, y , s ) through normalization of H '( x, y , s ) with respect to four blocks next containing calculate Hcurrent ( x, y, s) cell through normalization of H 0( x, y, s) with respect to four blocks containing current cell next next

Sensors 2018, 18, 3505

13 of 19

Pseudocode of algorithm 4 Calculate matching score functions for s = 1 to k for y = 1 to H step 8 for x = 1 to W step 8 for i = 1 to M calculate mi (x,y,s,ui ) if i6=1 then generalized distance transform calculate Pi (x,y,s) for subsequent backward DP step end if next next next next 5 Find all local max Find all local maximal of root part above similarity threshold

6. Experimental Results 6.1. Dataset A dataset consisting of thousands of image samples used for offline vehicle model training was generated from video sequences obtained from the surveillance system in our laboratory and the Internet. Some images from a Caltech dataset [30] were used as an implementation. Vehicles were first detected in the video sequences based on background subtraction and length-width ratio constraints. Second, vehicles were checked and labeled by hand. The positive samples were from the above three sources, whereas the negative samples were from random background images. The positive samples were divided into five categories according to the vehicle’s viewpoints, which again are up, left/right, up-front/up-rear, up-right-front/up-left-rear, and up-left-front/up-right-rear. We used more than 10 scenarios for the actual testing environment. As shown in Table 2, the testing samples cover a broad range of traffic conditions, from sparse to dense traffic, containing vehicles with different viewpoints and shapes, and under different illumination conditions. In particular, several experiments on congested traffic conditions are represented. Table 2. Testing sample details.

Viewpoint

Traffic density Total

1 2 3 4 5 Low High

Fully Observed

Partially Observed

5800 980 7420 6750 1256 18,016 4190

4580 3600 5700 4022 18 5380 12,540

22,206

17,920

6.2. Experiment Details The hardware part of the system was built of the image acquisition device, server, and client. The image acquisition device uses a network camera, and the highest resolution is 1920 × 1080, high definition, and no delay. The system uses its SDK (software development kit) development kit to achieve video image sequence acquisition. The server is used for video processing and the client is a PC. The software part of the system consists of two parts: the server and the client. The server includes an acquisition server and a central server. The acquisition server sends the video data captured by the camera to the central server. The central server performs real-time vehicle detection or query results

Sensors 2018, 18, 3505

14 of 19

Sensors 2018, 18, x FOR PEER REVIEW

14 of 19

according to the received client commands. At the same time, the test result or query result is sent to results according to the received client commands. At the same time, the test result or query result is the sent client and displayed on the client interface. to the client and displayed on the client interface. TheThe number of of training samples were used usedfor fortraining trainingand and number training sampleswas was28,600, 28,600,of ofwhich which 80% 80% (22,880) (22,880) were 20%20% (5720) were used for verification, and the total number of testing samples was 40,126. The testing (5720) were used for verification, and the total number of testing samples was 40,126. The testing samples were independent of of the training samples were independent the trainingsamples. samples. 6.3.6.3. Experimental Results Experimental Results TheThe presented results include a contrasting quantitative experiment on the testingtesting images presented results include a contrasting quantitative experiment onlabeled the labeled andimages severaland experiments on complex traffic conditions. several experiments on urban complex urban traffic conditions. 6.3.1. Quantitative Experiment 6.3.1. Quantitative Experiment In the quantitative experiment, into two two sets setsof offully fullyvisible visibleand and In the quantitative experiment,we wedivided dividedthe thetesting testing samples samples into partially occluded vehicles, which accounted respectively. partially occluded vehicles, which accountedfor for22,206 22,206and and17,920 17,920 samples, samples, respectively. A vehicle was determined totobebedetected betweenits itsbounding boundingbox boxand and A vehicle was determined detectedonly onlywhen when the the overlap overlap between the the ground truthtruth was was greater than than 50%. 50%. The recall and precision of ourof method on fullyonvisible ground greater The recall and precision our method fully vehicles, visible vehicles, as well as the results three other methods, are shown in Figure 11a. results The results of recall as well as the results from threefrom other methods, are shown in Figure 11a. The of recall and and precision for partially under low trafficand density high trafficconditions density precision for partially occludedoccluded vehicles vehicles under low traffic density high and traffic density are shown inrespectively. Figure 11b,c, Four respectively. Fourare approaches arewhich compared, up are conditions shown in Figure 11b,c, approaches compared, makeswhich up themakes proposed the proposed work we describe, the deformable part model (DPM) [1], the deformable part model work we describe, the deformable part model (DPM) [1], the deformable part model with conditional with field conditional (DPM + CRF) [5], and the multiple deformable random model random (DPM + field CRF)model [5], and the multiple deformable part model (MDPM)part [3]. model (MDPM) [3]. For a particular threshold, precision and recall are (true positive)/(true positive + false positive) For a particular threshold, and recall respectively. are (true positive)/(true positive + from false positive) and (true positive)/(true positiveprecision + false negative), It can be observed Figure 11 and (true positive)/(true positive + false negative), respectively. It can be observed from Figure 11 that that the approach we propose demonstrates comparable results to the other methods. The increased the approach we propose demonstrates comparable results to the other methods. The increased number of viewpoints and viewpoint map helped to improve the detection rate by filtering out other number of viewpoints and viewpoint map helped to improve the detection rate by filtering out other confusing background objects and estimating vehicle poses. The average processing time for the four confusing background objects and estimating vehicle poses. The average processing time for the four methods (DPM, DPM + CRF, MDPM, OURS) was 23.7 ms, 32.3 ms, 29.1 ms, and 33.9 ms, respectively, methods (DPM, DPM + CRF, MDPM, OURS) was 23.7 ms, 32.3 ms, 29.1 ms, and 33.9 ms, respectively, for for an input image of of 640640 × ×480 codeoptimization optimizationand and acceleration. an input image 480scale scalewith withC/C++ C/C++ code acceleration.

(a)

(b)

(c) Figure Performance comparison of precision–recall curves for four methods. (a) Comparison of Figure 11. 11. Performance comparison of precision–recall curves for four methods. (a) Comparison of fully fully visible vehicles; (b) comparison of partially occluded vehicles under low traffic density; (c) visible vehicles; (b) comparison of partially occluded vehicles under low traffic density; (c) comparison comparison of partially occluded under high traffic density. of partially occluded vehicles undervehicles high traffic density.

Sensors 2018, 18, 3505 Sensors 2018, 18, x FOR PEER REVIEW

15 of 19 15 of 19

For stationary cameras in in video combinationofof viewpoint inference improves For stationary cameras videosurveillance, surveillance, the combination viewpoint inference improves the performance of patternrecognition–based recognition–based vehicle detection methods. In theIn following figures, we the performance of pattern vehicle detection methods. the following figures, show detailed detection results for different viewing conditions. we show detailed detection results for different viewing conditions. Figure 12 provides vehicle detectionresults results for for five where thethe blue boxes denote Figure 12 provides vehicle detection five viewpoints, viewpoints, where blue boxes denote detected vehicle parts and the red box denotes the final vehicle detection result. The results detected vehicle parts and the red box denotes the final vehicle detection result. The results demonstrate that our method can accommodate different vehicle shapes and poses. Further, the demonstrate that our method can accommodate different vehicle shapes and poses. Further, the method method demonstrates good detection of nonoccluded vehicles in low-density traffic conditions. It is demonstrates good detection of nonoccluded vehicles in low-density traffic conditions. It is noted that noted that since the training images did not include buses, our method failed to detect buses in the since test the data. training images did not include buses, our method failed to detect buses in the test data.

Figure 12. Vehicle detection results for five viewpoints. (a) vehicle parts detection results in

Figure 12. Vehicle detection results for five viewpoints. (a) vehicle parts detection results in viewpoint1-viewpoint2; (b) vehicle global detection results in viewpoint1-viewpoint2; (c) vehicle viewpoint1-viewpoint2; (b) vehicle global detection results in viewpoint1-viewpoint2; (c) vehicle parts parts detection results in viewpoint3-viewpoint4; (d) vehicle global detection results in viewpoint3detection results in viewpoint3-viewpoint4; (d) vehicle global detection results in viewpoint3-viewpoint4; viewpoint4; (e) vehicle parts detection results in viewpoint4-viewpoint5; (f) vehicle global detection (e) vehicle parts detection results in viewpoint4-viewpoint5; (f) vehicle global detection results results in viewpoint4-viewpoint5. in viewpoint4-viewpoint5. As a result of the loss of depth information in the image projection transformation from 3D to As result ofhappens the lossfrequently of depthininformation in thewith image projection transformation from 3D 2D, aocclusion video surveillance monocular stationary cameras. Thus, to 2D, occlusion happens frequently in video surveillance with monocular stationary cameras. our testing images include partial occlusions. As shown in Figure 13, the proposed method can detect occluded vehicles with similar shapes, asAs long as more than 20% of the vehicle body is can Thus,partially our testing images include partial occlusions. shown in Figure 13, the proposed method detectobservable. partially occluded vehicles with similar shapes, as long as more than 20% of the vehicle body In actual traffic scenes, the vehicle may also be obscured by road infrastructure as well as other is observable. common road features. As shown in Figure 14, the occluded vehicle parts can be deduced by the CRF In actual traffic scenes, the vehicle may also be obscured by road infrastructure as well as other model, which improves the model score of DPM and guarantees the detection rate of occluded common road features. As shown in Figure 14, the occluded vehicle parts can be deduced by the CRF vehicles.

model, which improves the model score of DPM and guarantees the detection rate of occluded vehicles.

Sensors 2018, 18, 3505 Sensors 2018, 18, x FOR PEER REVIEW Sensors 2018, 18, x FOR PEER REVIEW

16 of 19 16 of 19 16 of 19

Figure 13. Vehicle detection results with occlusion. (a) vehicle parts detection results in partial Figure 13. Vehicle Vehicle detection results occlusion.vehicle (a) vehicle detection parts detection in partial Figure 13. results withwith occlusion. results inresults partial occlusion scenariodetection 1; (b) vehicle global detection (a) results in parts partial occlusion scenario 1; (c)occlusion vehicle occlusion scenario 1; (b) vehicle global detection results in partial occlusion scenario 1; (c) vehicle scenario 1; (b) vehicle detection results in partial (c) vehicle partsin detection parts detection resultsglobal in partial occlusion scenario 2; occlusion (d) vehiclescenario global 1; detection results partial parts detection results in partial occlusion scenario 2; (d) vehicle global detection results in partial results in partial occlusion scenario 2; (d) vehicle global detection results in partial occlusion scenario 2. occlusion scenario 2. occlusion scenario 2.

Figure Figure 14. 14. Vehicle Vehicle detection detection results results with with occlusion occlusion on on aaa rainy rainy day. day. (a) (a) vehicle vehicle parts parts detection detection results results Figure 14. Vehicle detection results with occlusion on rainy day. (a) vehicle parts detection results in partial occlusion scenario 1 of rainy day; (b) vehicle global detection results in partial in partial occlusion scenario 1 of rainy day; (b) vehicle global detection results in partial occlusion occlusion in partial occlusion scenario 1 of rainy day; (b) vehicle global detection results in partial occlusion scenario scenario 11 of of rainy rainy day; day;(c) (c) vehicle vehicleparts partsdetection detectionresults resultsin inpartial partialocclusion occlusionscenario scenario22of ofrainy rainyday; day; scenario 1 of rainy day; (c) vehicle parts detection results in partial occlusion scenario 2 of rainy day; (d) vehicle global detection results in partial occlusion scenario 2 of rainy day. (d) vehicle global detection results in partial occlusion scenario 2 of rainy day. (d) vehicle global detection results in partial occlusion scenario 2 of rainy day.

In In addition addition to to the the vehicle vehicle poses poses and and shapes, shapes, the the test test images images also also include include different different weather weather In addition to the vehicle poses and shapes, the test images also include different weather conditions, conditions. Our method is not affected by conditions,consisting consistingofofsunny, sunny,cloudy, cloudy,rainy, rainy,and andtwilight twilight conditions. Our method is not affected conditions, consisting of sunny, cloudy, rainy, and twilight conditions. Our method is not affected by shadows on aon sunny day,day, and and can detect vehicles in rainy and twilight situations, both ofboth which result by shadows a sunny can detect vehicles in rainy and twilight situations, of which shadows on a sunny day, and can detect vehicles in rainy and twilight situations, both of which result in poorin image quality. However, our method failsfails to detect vehicles at at night result poor image quality. However, our method to detect vehicles nightororininheavy heavy fog fog in poor image quality. However, our method fails to detect vehicles at night or in heavy fog conditions, conditions, because because the the vehicle vehicle edges edges are are nearly nearly invisible invisible in in these conditions. conditions. Our Our method method will will be be conditions, because the vehicle edges are nearly invisible in these conditions. Our method will be further further improved improved to to address address these limitations in our future work. further improved to address these limitations in our future work. 6.3.2. Experiments in Congested Traffic Conditions 6.3.2. Experiments in Congested Traffic Conditions As described previously, congested traffic conditions bring challenges to the effectiveness of As described previously, congested traffic conditions bring challenges to the effectiveness of vehicle detection methods. Therefore, we purposely paid attention to studying congested traffic vehicle detection methods. Therefore, we purposely paid attention to studying congested traffic

Sensors 2018, 18, 3505

17 of 19

6.3.2. Experiments in Congested Traffic Conditions As described previously, congested traffic conditions bring challenges to the effectiveness Sensors 2018, 18, x FOR PEER REVIEW 17 of 19 of vehicle detection methods. Therefore, we purposely paid attention to studying congested traffic conditions. In high-density the continuous adhesion of vehicles greatly affects conditions. In high-density traffic traffic flow, flow, the continuous adhesion of vehicles greatly affects the the detection rate. As shown in Figure 15a, there is no consistent viewpoint distribution for a fixed detection rate. As shown in Figure 15a, there is no consistent viewpoint distribution for a fixed camera camera at intersections, so the detection resultby is the given by +the DPM CRF. in AsFigure shown15c, in locationlocation at intersections, so the detection result is given DPM CRF. As + shown Figure 15c, five vehicles the middle FOV image With the ourocclusion method, five vehicles occlude eachocclude other ineach the other middleinFOV image region. With region. our method, the occlusion states contain only two vehicle parts in the CRF model in the up-front view. As a result, states contain only two vehicle parts in the CRF model in the up-front view. As a result, the five the five vehicles are all detected. Our proposed algorithm performs well at a certain resolution in the vehicles are all detected. Our proposed algorithm performs well at a certain resolution in the region region of interest. However, in the far-field region, area of theisvehicle is very and detection the edge of interest. However, in the far-field region, the area the of the vehicle very small andsmall the edge detection effect is not ideal, leading to a low detection rate for the low overlapping rate between the effect is not ideal, leading to a low detection rate for the low overlapping rate between the estimated estimated box and the real vehicle region. bounding bounding box and the real vehicle region.

Figure 15. 15. Vehicle Vehicle detection detection results results in in congested congested traffic traffic scenes. scenes. (a) (a) vehicle vehicle parts parts detection detection results results in in Figure congested traffic scene 1; (b) vehicle global detection results in congested traffic scene 1; (c) vehicle congested traffic scene 1; (b) vehicle global detection results in congested traffic scene 1; (c) vehicle parts detection detection results results in in congested congested traffic traffic scene scene 2; 2; (d) (d) vehicle vehicle global global detection detection results results in in congested congested parts traffic scene scene 2. 2. traffic

6.3.3. Experimental Analysis 6.3.3. Experimental Analysis From From experiments experiments on on various various proposed proposed road road environments environments for for the the proposed proposed method, method, from from the the test results, it can be seen that in a rainy environment with multitarget detection, the algorithm has test results, it can be seen that in a rainy environment with multitarget detection, the algorithm has aa good andand also also has better performance for vehicles different perspectives good detection detectioneffect, effect, has detection better detection performance forwith vehicles with different in the actual acquisition video. However, dueHowever, to the relationship the trainingof set, is impossible perspectives in the actual acquisition video. due to theofrelationship theit training set, itto is smoothly detect vehicles such as buses, trucks, etc., and this relationship is prone to false detection. impossible to smoothly detect vehicles such as buses, trucks, etc., and this relationship is prone to The application shows that the proposed method can solvemethod the problem in vehicle detection. falsepractical detection. The practical application shows that the proposed can solve the problem in Under lighting the detection ratethe is high, is notrate affected byismost of the shadows, vehiclenormal detection. Underconditions, normal lighting conditions, detection is high, not affected by most and is not sensitive toisthe background. canbackground. complete most of the vehiclesmost on the with occlusion of the shadows, and not sensitive toItthe It can complete of road the vehicles on the detection, has better detection road with and occlusion detection, andperformance. has better detection performance.

In future work, we will attempt to make our method more suitable for real-world traffic video surveillance systems. First, further computation acceleration may be realized by leveraging parallel processing. Second, an adaptive parameter estimation method can be developed to avoid manually setting relevant parameters. Finally, it is necessary to expand our method to more object types by deep part hierarchies or by building more object models.

Sensors 2018, 18, 3505

18 of 19

In future work, we will attempt to make our method more suitable for real-world traffic video surveillance systems. First, further computation acceleration may be realized by leveraging parallel processing. Second, an adaptive parameter estimation method can be developed to avoid manually setting relevant parameters. Finally, it is necessary to expand our method to more object types by deep part hierarchies or by building more object models. 7. Conclusions In this paper, we propose a vehicle detection method by combining scene modeling and deformable part models for fixed cameras, particularly for use in congested traffic conditions. The proposed method consists of constructing a DPM + CRF model for each viewpoint to represent vehicles, training scene models, viewpoint inference, and vehicle object detection by addressing both partial observation and varying viewpoints within a single probabilistic framework. Compared with the current methods, there are two main innovations in this paper. First, we use the scene context information to constrain the possible location and viewpoint of a vehicle, which is reasonable since the camera is often in a fixed location in a real traffic scene. In practical applications, the judicious usage of scene information can potentially eliminate some interference and reduce required computation times. Second, based on the combined DPM and CRF model, the viewpoint-related vehicle models are built and the occlusion states are defined individually. Compared to the MDPM, the proposed method provides semisupervised viewpoint direction, which performs well for a variety of traffic surveillance conditions. Experimental results demonstrate the efficiency of the proposed work on vehicle detection by a stationary camera, especially for solving the occlusion problem in congested conditions. Author Contributions: Methodology, C.Y. and W.H.; Software, L.Z. and C.X.; Project administration, C.L. Funding: This research was funded by the National Key Research and Development Program of China (2018YFB0105003), National Natural Science Foundation of China (U1762264, 51875255, U1664258, U1764257, 61601203, 61773184), Natural Science Foundation of Jiangsu Province (BK20180100), Key Research and Development Program of Jiangsu Province (BE2016149), Key Project for the Development of Strategic Emerging Industries of Jiangsu Province (2016-1094, 2015-1084), Key Research and Development Program of Zhenjiang City (GY2017006), and Overseas Training Program for Universities of Jiangsu Province. Acknowledgments: Thanks for the help of reviewers and editors. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2. 3.

4. 5.

6. 7. 8.

Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [CrossRef] [PubMed] Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 2010, 88, 303–338. [CrossRef] Leon, L.C.; Hirata, R., Jr. Vehicle detection using mixture of deformable parts models: Static and dynamic camera. In Proceedings of the 2012 25th SIBGRAPI Conference on Graphics, Patterns and Images, Ouro Preto, Brazil, 22–25 August 2012; pp. 237–244. Makris, D.; Ellis, T. Learning semantic scene models from observing activity in visual surveillance. IEEE Trans. Syst. Man Cybern. Part B 2005, 35, 397–408. Niknejad, H.T.; Kawano, T.; Oishi, Y.; Mita, S. Occlusion handling using discriminative model of trained part templates and conditional random field. In Proceedings of the 2013 IEEE Intelligent Vehicles Symposium (IV), Gold Coast City, Australia, 23–26 June 2013; pp. 750–755. Kanhere, N.K.; Birchfield, S.T. Real-time incremental segmentation and tracking of vehicles at low camera angles using stable features. IEEE Trans. Intell. Trans. Syst. 2008, 9, 148–160. [CrossRef] Pang, C.C.C.; Lam, W.W.L.; Yung, N.H.C. A method for vehicle count in the presence of multiple-vehicle occlusions in traffic images. IEEE Trans. Intell. Trans. Syst. 2007, 8, 441–459. [CrossRef] Nguyen, V.D.; Nguyen, T.T.; Nguye, D.D.; Lee, S.J.; Jeon, J.W. A fast evolutionary algorithm for real-time vehicle detection. IEEE Trans. Veh. Technol. 2013, 62, 2453–2468. [CrossRef]

Sensors 2018, 18, 3505

9. 10.

11. 12. 13. 14. 15. 16. 17. 18. 19.

20.

21. 22. 23. 24. 25.

26. 27. 28. 29. 30.

19 of 19

Zhang, W.; Wu, Q.M.J.; Yang, X.; Fang, X. Multilevel framework to detect and handle vehicle occlusion. IEEE Trans. Intell. Trans. Syst. 2008, 9, 161–174. [CrossRef] Shu, G.; Dehghan, A.; Oreifej, O.; Hand, E.; Shah, M. Part-based multiple-person tracking with partial occlusion handling. In Proceedings of the 2012 IEEE Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–22 June 2012; pp. 1815–1821. Wang, H.; Dai, L.; Cai, Y.; Sun, X.; Chen, L. Salient object detection based on multi-scale contrast. Neural Netw. 2018, 101, 47–56. [CrossRef] [PubMed] Lin, L.; Wu, T.; Porway, J.; Xu, Z. A stochastic graph grammar for compositional object representation and recognition. Pattern Recognit. 2009, 42, 1297–1307. [CrossRef] Lin, B.F.; Chan, Y.M.; Fu, L.C.; Hsiao, P.Y.; Chuang, L.A. Integrating appearance and edge features for sedan vehicle detection in the blind-spot area. IEEE Trans. Intell. Trans. Syst. 2012, 13, 737–747. Tian, B.; Li, Y.; Li, B.; Wen, D. Rear-view vehicle detection and tracking by combining multiple parts for complex urban surveillance. IEEE Trans. Intell. Trans. Syst. 2014, 15, 597–606. [CrossRef] Wang, C.C.R.; Lien, J.J.J. Automatic vehicle detection using local features-A statistical approach. IEEE Trans. Intell. Trans. Syst. 2008, 9, 83–96. [CrossRef] Li, Y.; Li, B.; Tian, B.; Yao, Q. Vehicle detection based on the and-or graph for congested traffic conditions. IEEE Trans. Intell. Trans. Syst. 2013, 14, 984–993. [CrossRef] Li, Y.; Wang, F.Y. Vehicle detection based on and-or graph and hybrid image templates for complex urban traffic conditions. Transp. Res. Part C: Emerg. Technol. 2015, 51, 19–28. [CrossRef] Cai, Y.; Liu, Z.; Wang, H.; Sun, X. Saliency-based pedestrian detection in far infrared images. IEEE Access 2017, 5, 5013–5019. [CrossRef] Niknejad, H.T.; Takeuchi, A.; Mita, S.; Mcallester, D.A. On-road multivehicle tracking using deformable object model and particle filter with improved likelihood estimation. IEEE Trans. Intell. Trans. Syst. 2012, 13, 748–758. [CrossRef] Enzweiler, M.; Eigenstetter, A.; Schiele, B.; Gavrila, D.M. Multi-cue pedestrian classification with partial occlusion handling. In Proceedings of the 2010 IEEE Conference on Computer vision and pattern recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 990–997. Wu, T.; Zhu, S.C. A numerical study of the bottom-up and top-down inference processes in and-or graphs. Int. J. Comput. Vision 2011, 93, 226–252. [CrossRef] Azizpour, H.; Laptev, I. Object detection using strongly-supervised deformable part models. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 836–849. Zhang, L.; Van Der Maaten, L. Preserving structure in model-free tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 756–769. [CrossRef] [PubMed] Wang, C.; Fang, Y.; Zhao, H.; Guo, C.; Mita, S.; Zha, H. Probabilistic Inference for Occluded and Multiview On-road Vehicle Detection. IEEE Trans. Intell. Trans. Syst. 2016, 17, 215–229. [CrossRef] Pepikj, B.; Stark, M.; Gehler, P.; Schiele, B. Occlusion patterns for object class detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3286–3293. Wang, H.; Cai, Y.; Chen, X.; Chen, L. Occluded vehicle detection with local connected deep model. Multimedia Tools Appl. 2016, 75, 9277–9293. [CrossRef] Cai, Y.; Wang, H.; Chen, X.; Jiang, H. Trajectory-based anomalous behaviour detection for intelligent traffic surveillance. IET Intel. Transport Syst. 2015, 9, 810–816. [CrossRef] Hampapur, A.; Brown, L.; Connell, J.; Ekin, A. Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking. IEEE Signal Process Mag. 2005, 22, 38–51. [CrossRef] Singh, A.; Pokharel, R.; Principe, J. The C-loss function for pattern classification. Pattern Recognit. 2014, 47, 441–453. [CrossRef] Griffin, G.; Holub, A.; Perona, P. Caltech-256 object category dataset. California Institute of Technology, Pasadena, CA, USA, Unpublished work. 2007. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Suggest Documents