Vehicle Speed Estimation Using a Monocular Camera

Vehicle Speed Estimation Using a Monocular Camera Wencheng Wu*, Vladimir Kozitsky*, Martin E. Hoover†, Robert Loce*, D. M. Todd Jackson‡ *PARC, A Xerox Company, 800 Phillips Road, Webster, NY, USA 14580 † Xerox Corp., 800 Phillips Road, Webster, NY, USA 14580 ‡ Xerox Corp., 12410 Milestone Center Drive, Germantown, MD, USA 20878

ABSTRACT In this paper, we describe a speed estimation method for individual vehicles using a monocular camera. The system includes the following: (1) object detection, which detects an object of interest based on a combination of motion detection and object classification and initializes tracking of the object if detected, (2) object tracking, which tracks the object over time based on template matching and reports its frame-to-frame displacement in pixels, (3) speed estimation, which estimates vehicle speed by converting pixel displacements to distances traveled along the road, (4) object height estimation, which estimates the distance from tracked point(s) of the object to the road plane, and (5) speed estimation with height-correction, which adjusts previously estimated vehicle speed based on estimated object and camera heights. We demonstrate the effectiveness of our algorithm on 30/60 fps videos of 300 vehicles travelling at speeds ranging from 30 to 60 mph. The 95-percentile speed estimation error was within ±3% when compared to a lidar-based reference instrument. Key contributions of our method include (1) tracking a specific set of feature points of a vehicle to ensure a consistent measure of speed, (2) a high accuracy camera calibration/characterization method, which does not interrupt regular traffic of the site, and (3) a license plate and camera height estimation method for improving accuracy of individual vehicle speed estimation. Additionally, we examine the impact of spatial resolution on accuracy of speed estimation and utilize that knowledge to improve computation efficiency. We also improve accuracy and efficiency of tracking over standard methods via dynamic update of templates and predictive local search. Keywords: Video processing, object tracking, machine learning, camera calibration/characterization, individual vehicle speed estimation, object height estimation

1. INTRODUCTION Studies have shown a strong relationship between speeding and traffic accidents. For example, in the USA in 2011 22% of the passenger car and 34% of motorcycle fatalities involved speeding [1]; the economic cost of speeding-related crashes is estimated to be $40.4 billion each year [2]. High vehicle speed also has a negative impact on the environment. Emissions of NOx, CO and CO2 increase with speed, and noise increases linearly with speeds higher than 40-50km/h. Photo enforcement is one approach commonly used to discourage speeding. Studies have shown that in certain settings photo enforcement has led to an average reduction of speed leading to 21% and 14% reduction in accidents involving severe collision and injuries, respectively [3]-[4] In this paper, we aim to develop a monocular vision-based method capable of measuring speed of individual vehicles with high accuracy for the purpose of photo enforcement. Vehicle speed is a key traffic measurement required in an Intelligent Transport System (ITS). It is relevant to traffic flow and can be used for accident prediction or accident prevention etc. For example, instantaneous measurement of vehicle speeds at an intersection may be used to alter the duration of traffic signals temporarily to prevent accidents from occurring. Common sensors for vehicle speed measurement include inductive loops, radar, lidar, and stereo video cameras. There are several advantages that a monocular vision system can provide over alternatives. It is more cost effective, monocular imaging systems are widely available, are easier to install and maintain, and can serve other purposes such as surveillance. At a glance, it may appear quite simple to provide some measure of speed of an object using video cameras if the object of interest is properly detected/identified, and tracked. Indeed, much work has been done in this area but the focus has been primarily on measuring average speed of vehicles. There has been little study of the accuracy and precision of speed measurement of an individual vehicle required for law enforcement when applying computer vision techniques using a monocular camera. The challenges in meeting the accuracy and precision have been discussed in Ref. [5].

Video Surveillance and Transportation Imaging Applications 2015, edited by Robert P. Loce, Eli Saber, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 9407, 940704 · © 2015 SPIE-IS&T CCC code: 0277-786X/15/$18 · doi: 10.1117/12.2083394 SPIE-IS&T/ Vol. 9407 940704-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/10/2015 Terms of Use: http://spiedl.org/terms

The objectivve of this papeer is to describbe and characteerize a speed estimation meethod for indivvidual vehicless using a stationary monocular m camera. The systtem includes the t following: (1) object deetection, whichh detects an object o of interest based on a combin nation of motiion detection and a object classification andd initializes traacking of the object o if detected, (2) object trackin ng, which tracks the object over o time baseed on templatee matching and reports its frame-tofr p displacem ments to frame pixel displacement, (3) speed estiimation, whichh estimates veehicle speed byy converting pixel actual distancces traveled along the road, (4) ( object heighht estimation, which w estimatees the distance of the trackedd point(s) of the objectt to the road plane, p and (5) speed estimatiion correction, which adjustss previously esstimated vehiccle speed based on estiimated object and a camera heiights. The flowchaart of our meth hod is shown inn Fig. 1. The object of our tracking intereest is the licennse plate. Ourr method starts by dettecting an objeect of interest (i.e., license plate) within the camera field of view annd then initiallizes the tracking acccordingly. Allthough there exist many image-based i license plate detection d and//or recognitionn (LPR) algorithms, it is not necessary for this appplication to deecipher the liceense plate inforrmation at this stage. For eff fficiency, we defer LPR R to a later staage. Instead, we w utilize a com mbination of motion m detectionn and object cllassification to identify possible liceense plates. We W first applyy double-framee-differencing on frames at reduced spatial resolution to t detect regions of innterest (ROIs) indicating objjects in motionn. A pre-trainned classifier is i applied to each e ROI to deetermine whether this ROI represents a license platte. If so, trackiing of the deteccted plate is innitiated. Once a license plate is deteected, our methhod tracks top left and right corners c of the plate via tempplate matching until the plate exits thhe scene. Thee initial templaates are extracted when the plate is first detected. d The templates are updated frame to fram me in the tracking step to acccount for pose changes c of thee tracked vehiclle. This is impportant for maiintaining the tracking and for speed estimation. Thhere are also innteractions bettween tracking and object dettection since not n every object detectted in a frame is i a new objectt and not everyy object trackedd is always dettected in a fram me. We will diiscuss in this paper hoow our method deals with these interactions in an efficientt manner. Next, an appproximate speed d of the trackeed vehicle is esstimated. A coonversion of unnits is required for computingg vehicle speed becausse the output of o the object detection d and tracking t steps is the license plate trajectorry in pixel unitts. This conversion iss often referreed to as camerra geometric trransformation, a well-knownn art [6], but has h its drawbacck when applied to ceertain real worrld transportatiion applicationns [7]. A rouugh speed estim mate can be caalculated basedd on the conversion of o pixels to reall-world coordinnates and the frame-rate. f Beelow, we discuuss why this is not sufficient and a how our method addresses a the issue with a proocess to estimaate height of each tracked plaate. We thus develop d a novell camera calibration/chharacterization n method to meeet this need. Finally, we perform p speed estimation corrrection, whichh adjusts the prreviously estim mated individuual vehicle speeed based on estimatedd plate height and camera height. h The camera c heightt can be estim mated through our proposedd camera calibration/chharacterization n method or measured m directly with instrum ments. The heeight for the pllate of each inndividual vehicle is esttimated at run--time based onn a priori know wledge of platee dimension annd measured sppatial characterristics of the camera. This height reepresents the distance d betweeen the tracked object/feature to the road plaane, which is a critical measurementt needed for accurate a speed measurement using a monoocular camera. It is the key to recovering missing information due d to 3D-to-2 2D imaging. Thhis is why we choose to trackk the license pllate rather thann other vehicle features that don’t have consistentt properties. Once the heigght of trackedd feature(s) is estimated, thhe correction of o speed measurementt is simply a frractional factorr based on the ratio r of plate height to cameraa height.

Figure 1. Alg gorithm flowchaart of our propossed vehicle speedd estimation usinng a monocular camera. c

The remaindder of this papeer is organizedd as follows. In I Section 2, computer c vision and video prrocessing techhnologies applied to veehicle tracking is discussed. Our processingg and methodss used for speedd estimation arre presented inn Section

SPIE-IS&T/ Vol. 9407 940704-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/10/2015 Terms of Use: http://spiedl.org/terms

3. Section 4 briefly describes our experiment in assessing the accuracy of our method for speed estimation. Conclusion and future work are summarized in Section 5.

2. VEHICLE TRACKING Vehicle tracking is an essential step for vision-based speed estimation. To estimate the speed of an object using a video camera(s), it is necessary to know the presence of the object and its trajectory while it is in the field of view of the camera(s). Object detection and tracking is a well-studied research topic in video processing and computer vision. An excellent survey can be found in Ref. [8]. Although many of these methods are readily applicable to speed estimation, the resulting accuracy on estimated speed depends heavily on the type of tracking method. The reason is that there is a distinct aspect that needs to be considered by the tracking algorithm. More specifically, common methods focus on a coarser concept of tracking. The objective of most trackers is to track the object “as a whole”. The operation of a tracker is considered effective as long as it can track the object as it appears or re-appears the scene over time under various practical conditions. For the purpose of speed measurement, the tracking objective is refined: it is necessary to track a fixed and specific portion(s) of the object. As a simple example, if a tracker starts by tracking a point on a vehicle near its front, and as the tracking proceeds the feature point shifts toward the rear of a vehicle, this trajectory alone would not be sufficient for accurate speed estimation. As a result, a suitable tracker for an accurate speed measurement system has to achieve one of the following: (1) tracking directly a specific consistent portion(s) of the object to determine its trajectory or (2) tracking the object coarsely as common practice while applying additional processing to infer the trajectory of a specific portion(s) of the object indirectly. Our method follows the first option. The specific portion(s) we track is the top corners of the license plate of a vehicle. This decision is driven by the fact that we are interested in speed enforcement applications, where license plate detection and recognition is already key component, and license plates have certain consistent dimensions that aid in the estimation. Other easily recognized elements with distinct features could be used as well. 2.1 License plate detection Our method starts by detecting an incoming license plate(s) within the camera field of view and then initializes the tracking accordingly. Although there are many image-based license plate detection and/or recognition (LPR) algorithms, it is not necessary for this application to decipher the license plate information (i.e., recognition) at this stage. For efficiency, we defer recognition to a later stage and focus on detection. Since images are captured in the form of video and our interest is in moving vehicles, motion detection is incorporated into our detection step for efficiency. That is, we utilize a combination of motion detection and object classification to identify possible license plates. For motion detection, our method proceeds as follows. Given a series of three consecutive image frames: ( , ; − 1), ( , , ), ( , ; + 1), = 1~ = 1~ , first we apply double-frame-differencing on frames at a reduced spatial = 8) and thresholding those to form a binary motion map ( ′, ′; ), where resolution (e.g., a factor of ( ( ′, ′) = 1 if | 0 for ′ = 1,2, … ,

,

; )− (

, ′ = 1,2, … ,

,

; − 1)| > ( ) & | ( otherwise

,

.

; )− (

,

; + 1)| > ( ) , (1)

Here ( ) is a time-dependent threshold determined algorithmically. Connectivity analysis is then applied to the binary map to yield regions of interest (ROIs) indicating potential objects in motion. Additionally and preferably, morphological filtering such as opening, closing, and hole-filling may be applied to the binary mask to remove noises from various sources. Finally, for those identified ROIs whose sizes are within a range of interest, further processing is performed to verify if there are license plates in those ROIs. Conceptually, the motion detection serves as a prescreening step to reduce the search regions for license plate detection. For license plate detection, we first train a license-plate classifier off-line with samples extracted from videos of local highway. The training sample set consists of 236 positives and 115 negatives. These samples are extracted by first running the motion detection on the training videos to extract image frames where motion regions of possible license plates exist. These are the same ROIs detected by the previous motion detection step (expect that the ROIs are resampled back to its original high spatial resolution). Then the positive samples are localized and cropped by human operators from those ROIs that contain at least one license plate, while the negative samples are extracted and cropped randomly from those ROIs that do not contain a license plate. Note that a large set of training samples are required in a


typical settinng for training a vision-basedd object classifier for detectiion [9]-[10]. The T reasons thhat our off-linee trained classifier cann work so well with such a sm mall set of trainning samples arre the followinng: •

We have simpliified the classiffication probleem to a more liimited but relevvant scenario by W b limiting the training saamples to the ROIs yielded from our motiion detection. This simplifieed classificatioon problem woould thus onnly need a smaall set of trainiing samples. By B way of our process, a liceense plate will not be detecteed if it is not first detecteed by the motioon detection. This T is a choice that we make for speed enfforcement applications too gain computaational efficienncy.

•

Our classifier iss only requiredd to differentiatte a region thatt has “license plate O p texture” from f a region that t does not because it iss not required for f us to recognnize the alphannumeric inform mation of the pllate at this stagge. With thhat, we only need n a small seet of positive training t samplees to representt the possible “license plate texture” raather than a laarge set of possitive training samples to coover all possiblle fine licensee plate details (such as various possiblee alphanumericc information and a plate designs).

(a)

(bb)

F Figure 2. Exemplary training sam mples for HOG-L LDA license plaate classifier: (a) positives and (bb) negatives.

In our licensse plate classifi fier, we use hisstogram of orieented gradient (HOG) [11] as a the feature and a linear discriminant analysis (LD DA) as the macchine learning method [12]. The HOG feaature has the dimensionality d of 81 (3cell × 3cell × 9bin), whichh is adequate fo or this applicatiion. At run-tim me, all ROIs deetected in eachh frame from thhe motion detecction are further checkked with the tracking t moduule to see if a given ROI beelongs to the license plate been b tracked from f the previous fram me. We use a common validdation method in tracking whhere proximity and similarityy criteria are ussed. For those remainning ROIs thaat are not beinng tracked alreeady, we perfo form license plate detection on those ROIs. The detection meethod we used is i a standard window-search w approach, wheere various seaarch windows are a sliding witthin each ROI; image content c of each h window is paassed to the liccense plate classsifier to yieldd a score indicaating the ratingg that the given window w could be a license l plate; finally f a maxim mal suppressioon is applied too windows whhose score are above a threshold to consolidate ov verlapped winddows that corrrespond to thee same license plate. If a liccense plate is detected here, that meeans there a new n (not previiously tracked)) license plate is found. A new tracking thread would then be initiated heree and a speed estimation e for thhis plate is calcculated in latteer modules. There are a few things wo orth noting. In I our motion detection phaase, the threshhold ( ) can be a static vaalue or a dynamic onee that varies fraame by frame. There are traade-offs betweeen the choice of o thresholds and a the perform mance in license plate detection. An n aggressive thrresholding will yield fewer ROIs R and thus results r in moree efficient licennse plate detection, at the cost of po ossible missed detections. On O the other hand, h conservattive thresholdiing yields largge and/or many ROIs and a degrades detection d efficieency. In our liicense plate deetection phase, it is often neceessary to searcch within each ROI to better localizee the license pllates in it. Thiis is especiallyy true when thee ROI is largerr than the licennse plate and/or there is more than one plate in the t ROI. How wever, with a proper p couplinng of imaging and thresholdding it is possible to use dynamic thrresholding so that t the ROIs would w cover tigghtly around onne license plate or none. Unnder such a configuratiion, one can skip s the windoow search andd pass the imaage content off the entire RO OI to the classsifier to determine whhether a licensee plate is preseent. The key point is that therre are various interactions i am mong motion detection, license plate detection, imaaging and trackking modules, etc. It is wortthwhile to expllore the interacctions and optim mize the algorithm as a system. In our o experimentt section, we deescribe a system m approach to optimize the trrade-offs. t 2.2 Vehicle tracking Conceptuallyy, our vehicle tracking algorrithm tracks toop-left and topp-right corners of the plate via v template matching, m where cross-correlation is used u as a measure of similaritty. The initial templates t are extracted e from the image fram me when d. In a typical setting, the saame templates are used throuughout the traccking phase. Thhis is an the plate wass first detected effective metthod assuming g there is not a severe pose chhange, scaling, or occlusion. Also, in typical settings, thhe search of matches iss performed ov ver the entire image i frame for f each trackinng step. This can be compuutational expennsive and


unnecessary. Our method deviates from these general practices in two aspects: the templates are updated and may be scaled for each frame before matching and the search regions are adaptively constrained based on the past tracking trajectory of the object been tracked. More details are described below. When a new license plate of the th vehicle is first detected at time , we create the initial templates of top-left and topright by cropping out (2 + 1) × (2 + 1) sub-images from the image at the locations where their centers are located at the top-left or top-right corners of the detected plate. ( )

( )

(, ; )=

( )=

− ,

( )

−

, −

− ;

= 1~(2

+ 1), = 1~(2 + 1), = 1,2

, = 1,2.

(2)

( )

( , ; ) and ( , ; ) are (2 + 1) × (2 + 1) template images for representing top-left and top-right Here, ( ) ( ) and ( ) ( ) are the detected locations of top-left or top-right corners of vehicle at time , respectively; and ( ) ( ) and ( ) ( ) are directly obtained from corners of the plate of vehicle at time , respectively. Note that license plate detection and are used as the initial trajectory of the top corners of the plate of vehicle . Tracking of vehicle starts at the next frame = ( + 1) and proceeds as follows. First, new centers time for the tracked points are found using the following optimization procedure: ∑

( , , )= ∑

) = argmax

( , ( )

∑

( )=

,

∈ℛ

∑

(

,

(

,

; )

( )

( )

∑

( )

(

, ( )

∑

;

)

(

,

( ) at

, ;

)

( , , ),

) if ( , , ) > otherwise

( ,

; )∙

( )

, = 1,2.

(3)

Here, ( , , ) is the cross-correlation between the (2 + 1) × (2 + 1) sub-image of current frame at time at ( ) location centered at ( , , ) and the (2 + 1) × (2 + 1) template images ( − 1) from previous time ( − 1), ( ) ( , ) is the location where maximal cross-correlation is found, ℛ ( ) is the search range at time for the above is a threshold that specifies the minimal correlation required to consider a successful optimization procedure, and matching/tracking. Next, the template images the above tracking at time : ( )

( ) are updated by cropping the current image frame around the new found centers from

( ) ( , ; − 1) ( − − , −

(, ; )=

= 1~(2

( )

if − ; )

( )

( )= , otherwise

+ 1), = 1~(2 + 1), = 1,2.

(4)

Note that the template remains the same (i.e., not updated) if the tracking of the template is not successful (maximal cross-correlation is too low to provide sufficient confidence of a match). These templates will be used for tracking the next frame ( + 1). Next, the search range ℛ so far: = ℛ

( )

−

(

( )

),

( + 1) in updated to be used for tracking next frame ( + 1) based on the tracked trajectory =

−

( + 1) = ( , ) = (

= (

−

− sign( ) ∙

);

(

−

− sign( ) ∙ +

)~(

+

+

)~(

+ sign( ) ∙

+ +

+ sign( ) ∙ ).

+

) (5)

Here, ( , ) is the vector indicating the travel of the tracked object between and − 1. The intuition of the equation above for updating the searching is that if the speed of the object is constant in pixel units, we expect the next location will be offset by the same amount. Thus we can search at the location with that offset, which means essentially no


search at all. However, in practice the speed of the object is not exactly constant in pixel units due to camera distortion, and the vehicle may not travel at exactly the same speed in physical units. Hence it is desired to perform additional searching around the expected region with the range specified by [− , ]. Note that in the rare case where a tracking is not successful in time and/or − 1, we use the estimated traveled vector from earlier trajectory with a proper scaling of how many frames are used to get the estimated traveled vector. ( ) ( ) Finally, if the tracking is successful, we would have a new position of the tracked corners of the plate, ( ) ( ). The new found corners are used to compare to the motion ROIs detected in previous motion detection phase and to see if any of the detected motion ROIs represents a vehicle already being tracked and thus should be removed. After this step, if there are still remaining detected motion ROIs, we would then repeat the process discussed above to verify if any of them corresponds to a new found plate and initiate a new thread of tracking for this new found plate accordingly. Our tracking method repeat the process of template-matching via maximizing cross-correlation, update the templates, update the search region, and initiate tracking for newly detected untracked motion ROIs (described in equations (3)~(5) and (2)) for the frames ( + 1), ( + 2), … until too many unsuccessful tracking occur for a tracked object or the object has left the scene.

In most case, the vehicle speed is relatively slow compared to the frame rate and the pose of the vehicle is roughly unchanged frame to frame. The dynamic template update approach, however, can be improved using scaled templates. The concept is as follows. Figure 3 shows a typical camera field of view (FOV) for a speed enforcement or traffic surveillance setting, where the vehicles are travelling away from the camera with visible and recognizable license plates. As shown in the Fig. 3a and 3b, the license plate size shrinks by about 20% over the course of half second for a vehicle travelled at ~30 mph. For this type of imaging configurations, the size of a license plate in pixel unit varies depending on where the plate is in the FOV; but possible trajectories and speeds of a license plate are somewhat constrained and predictable. Given these characteristics, our tracking algorithm can be further improved by introducing scaling into the ( ) ( − 1) to find the best match in a current frame at update of templates. In addition to using the template time using Eq. (3), scaled versions of the template can also be used in Eq. (3) to find additional sets of best match for new scales. The final best match is the one that has the highest cross-correlation across the scales. This multi-scale approach is well-known in object detection applications. The key drawback is the additional processing and computation. Our contribution in the speed estimation application is that we propose to use the camera calibration information, the possible speed range on the monitored road, and the predictable nature of the trajectory of the tracked object to determine suitable scales for the templates. This differs from typical a multi-scale approach, which simply uses various scales without incorporating this type of prior knowledge. Incorporating this a priori information enables multiscale matching in a more efficient and effective manner.

(a)

(b)

Figure 3. Illustration of scale changes of license plate in pixel units over a half second: (a) first appearance of the vehicle and (b) 0.5 second later.


3. SPEED D ESTIMA ATION ( ) ( ) and ( ) ( ) for each From our traacking of licen nse plates of veehicles, our meethod producess a pair of trajectories tracked vehiccle over the time , where the plate cornners of this veehicle are succcessful trackedd. Given that the time between fram mes is known (=1/fps), ( the sppeed of the vehhicle can be easily e calculateed by dividing distance traveeled with the time betw ween travel. However, H the trajectories aree in pixel unitts. A conversiion from imagge pixel coordiinates to physical coorrdinates is need ded. For factoors relating to thhe camera, thiss conversion caan be adequateely representedd by a set of projectivve transformaations, whichh can be characterized c by a proceedure often referred as camera calibration/chharacterization n. There are many m techniquees for camera calibration/ chharacterization.. An interesteed reader can refer to Ref. R [7] for mo ore details. Rouughly speakingg, camera calibbration/characterization methhods can be categorized as model-bassed or manual [5]. They are many m advantagges of using model-based appproaches, howeever, the accurracy may be more diffi ficult to achievee. For the pressent applicationn, we use mannual calibrationn while proposiing a practical solution to overcome some of the issues that prior methods do noot address [13].

The other isssue concerning g speed estimattion is related to t the height of o tracked vehicle feature andd the dimensionality of image acquissition. As sho own in Fig. 4 and a discussed in Ref. [5], thee speed at the road surface is i the desired measure, m while the feaature being traccked (e.g., edgees, blob centrooid, etc.) is gennerally above thhe road at an unknown u heighht. Since an image acqquired from a single s camera is a 2D repressentation of a 3D 3 phenomenoon, it is not posssible to determ mine the height of an object in the im mage unless addditional inform mation or imagges are provideed. One solutioon is to simplyy assume that the heigght is zero, i.ee., at the roadd surface, or some s typical height h of the vehicle featurre been trackeed. The implication of o such assump ption is illustraated in Fig. 4. Assuming a fixed f mounted camera at a heeight , two different d features at height and are trackedd. It can be shhown that the perceived p distaances at the rooad surface forr the two a and , are rellated by = . If we iggnore the heighht and use perceived distance at the road surrface for features, speed estimaation, we would d over-estimate. The high feeature would get g over-estimaated more, i.e., it would appear that it moves fasterr than it actuallly does. The fraction f of oveer-estimate is the t ratio of thee feature heightt to the cameraa height. Given this, we w need to hav ve a good estim mate of the heigght of the featuure been trackeed for accuratee speed estimattion with single camera. This will bee addressed in Sec. 3.3. Anotther possibilityy is to mount thhe camera at a very high poinnt, which is not alwayss possible in prractice.

17c. YäIYqI'ü I

"LIAI }

1?

y1

Figure 4. Illustration of an a accuracy issuue related to traccked vehicle im mage feature heigght and the dimeensionality of im mage acquisitioon (modified from m Fig. 13 of Reff. [5]).


3.1 Cameraa calibration The geometrric mapping beetween pixel-coordinates ( , ) to a real-woorld planar cooordinates ( , ; = can be well characterized c by b a projective matrix ( ):

1

=

) for a camera

(6)

, 1

where z is thhe height abovee an arbitrary reference r planee such as the rooad plane. Thee 3 × 3 ( ) matrix m is know wn as the camera projeective matrix fo or = . Diffferent would have a differennt projective matrix, m but theyy are related accross ’s. Although theere are 10 unk knowns, 9 enttries of ( ) and the scaliing , they arre not complettely independeent since + + = forr all ( , ) in the t image fram me. Due to thhis constraint, the minimal number n of knoown data f ( ) is foour. However, Eq. (6) can be highly non-liinear for points (refereence point), ( , ) & ( , ) paair, needed to find some ( , )’ss. Hence a rob bust derivationn of ( ) ussually requires many more known k data pooints if one chhooses to derive the abbove 10 param meters directly from the data.. If it is desireed to know the projective matrices m ( ) for f more than one (other ( heights above the roaad-plane, assum me = 0 is thhe road plane)), even more reference r pointts at the heights of innterested are needed. n For transportation applications, it is rare thatt many referennce points aree readily available in the t scene. Hen nce a commonn approach to acquire a this infformation is too block the trafffic of the scenne (work zone) and maanually place enough e static markers m in the scene to havee a sufficient nuumber of tempporary referencce points for camera caalibration. Thiss is a serious deployment d issuue due to the coost of traffic innterruption andd safety factors. In this paper, we propose a method for caamera calibratiion without thee need of trafffic interruption via the use off moving m on tesst vehicle(s). Fiigure 5 depictss the flowchartt of our approaach. One exam mple embodimeent is the test targets mounted following: teest vehicle (seee Fig. 6) with moving m calibraation target traavels through trraffic camera(ss) FOV of inteerest; the traffic camerra(s) would id dentify the tesst vehicle via matching of license plate numbers; the calibration pooints are identified forr all frames baased on the testt target used; a camera calibrration map is constructed c based on these iddentified calibration pooints and know wledge of the teest target. Key advantagges of this app proach include cost saving (ee.g., no need foor lane / traffic stops, less maanual interventtion) and better calibraation performaance (e.g. can afford more points p than stattic test targets thus more robbust against noise and more accuratte against noisee). More details will be discuussed below.

Figuure 5. Method fo or camera calibraation without trafffic stop via use of moving test targets t mounted on test vehicle.

3.1.1 Test veehicle identificcation step The key funcction of this sttep is to identify the test vehicle with the moving calibrration targets when w it appearrs in the FOV of the traffic cameraa to be calibraated. Figure 6(a) shows thhe photo of a prototype test vehicle carinng a 3×7 calibration grrid target (set of o 360° reflecttors) that we ussed in demonsttrating this novvel calibration method. An im mproved prototype wiith vertical riseer is shown in Fig. 6(b). Thee addition of vertical v risers allows a us to deerive camera prrojective matrices acrooss various heiights (thus a 3--D lookup tablle, LUT) and estimate e plate heights h and cam mera height. Note N that an ideal calibbration target has h the followinng characteristtics: (1) easy too identify (day or night), (2) relative r locatioons of its elements are known with great g certainty and (3) coverss as much of thhe FOV as possible. Our movving grid targeets meets # by moving the t target fram me to frame, i.ee. “sweeping” the t full surfacee of the road with w a lot these requireements (meet #3 of calibrationn points). In our exempplary implementation, we use vehicle detecction and trackking module too analyze the video v streams from f the f traffic camerra, and initiatee a vehicle iddentification step via use of automated liccense plate reccognition (e.g., Xerox proprietary LPR L engine nam med XLPR) whenever w a new w vehicle first enters e the scenee. When an XL LPR result mattches the identificationn of the test veh hicle, the cameera calibration procedure is auutomatically innitiated.


(a)

(b)

Figure 6. Photo of a prototype test vehicle carrying a 3×7 calibration grid target (set of 360° reflectors): (a) grid target only and (b) grid target plus two vertical risers.

3.1.2 Calibration point identification step The key function of this step is to detect the positions of the intended calibration target in all frames and then identify the corresponding individual position of each element in the calibration target for each frame. In the example in Fig. 6b, this means first finding the regions that enclose all 3×7 reflectors and then further identify all 21 centroids of those reflectors for each frame. We then identify the centroids of the 13×2 reflection tapes on the vertical risers. Since we are using 360°-reflectors and reflection tapes, we can robustly and accurately identify these centroids. Assuming that the test vehicle stays in the FOV for frames, the output of this module would be sets of 21 & 26 centroids. Note that would depend on the speed of the test vehicle traveling through the FOV of the traffic camera. 3.1.3 Camera calibration construction step The key function of this step is to construct the camera calibration map (image coordinate to real-world coordinate mapping) using the output from the calibration point identification step and knowledge of calibration target. Without loss of generality, let us use the prototype test target shown in Fig. 6a as an example and assume that there are total of sets of such 21 extracted centroids that can be used to construct a robust and accurate camera calibration map, : ( , ) → ( , , ). First, an arbitrary (but fixed for each camera) reference point in image coordinate ( , ) is chosen to be the origin ( = 0, = 0, = ) in the real-world coordinate. A typical ( , ) can be chosen in the middle of the image coordinates. Here is 0 if the camera calibration map is set at road plane and 3 feet if it is set at the height of the tips of those 360°-reflectors in our prototype. We will ignore from now on since it is fixed once the moving target is selected. sets of 21 centroids and their Once the reference origin is chosen, we can now derive based on the extracted relative positions. One approach is to construct camera calibration maps independently based on each 21 centroids and then average them (since we have fixed the reference origin, we can average these camera calibration maps regardless the speed of the test vehicle). This has the advantage of simplicity. But this is not as robust or accurate as the following preferred method (iterative data alignment and outlier removal approach). Let us assume that the

,

sets of 21 centroids are ( )

,

= 1~ ,

= 1~21. Let us further assume for each group

( )

of 21 centroids the known relative positions are , , = 1~21 . If we further assume that the moving target is rigid body (as it should be) and does not “flutter” while travelling through the scene (would consider as source of error if this is inevitable), we can now construct a robust and accurate camera calibration map in the following manner: •

Derive the camera calibration map iteratively until all o

o

sets of data are used. That is (the order can be arbitrary),

( ) ( ) Given the requirements : ( , ) → (0,0) (reference requirement) and : , → + , + = 1~21 (1st set of 21 centroids), one can solve the optimal (e.g. least square sense) mapping and the required offset ( , ) so that this camera calibration map will optimally satisfy the condition of chosen reference origin and the condition of the known relative positions among these 21 centroids. (Iteration #1 using the first set of 21 centroids)

Given the additional data set

,

, we can first apply the previous camera calibration map

estimate of their real-world coordinates

( )

+

,

( )

+

. Here (


,

)=

∑

to get an ,

−

( )

( )

, . That is, we apply the previous mapping to the new set of 21 centroids and then compute the corresponding average offset away from the known relative position for this group. The key is that the “averaging” step removes the noise and aligns the two different sets of 21 centroids using the rigid-body constraint. Note that the difference between ( , ) and ( , ) gives an indication of the speed of the test vehicle (thus can potentially be used to further improve the current algorithm further if the speed of test vehicle is known).

•

o

Update the mapping which is the optimal mapping satisfying: ( ) ( ) ( ) ( ) + , + , and : , → + , + .

: ( , ) → (0,0),

o

Continue this iteratively until all

.

sets of data have been used to construct

,

→

by removing % outliers using the estimation from . That is,

Construct the final camera calibration map o

:

From the previous iterative process, we have the estimated real-world coordinates of all × 21 points , = 1~ , = 1~21, and the estimated real-world coordinates of all × 21 points using using : ( )

( )

+ , + = 1~21 = 1~ . These two would not average offsets under rigid-body constraints be the same unless there is no noise and the mapping perfectly captures camera characteristics. We thus compute the errors between the two for all × 21 points, remove % (e.g. = 10) points that have larger errors, and using the remaining (100 − )% points to re-calculate the final camera calibration map . Note that we prefer this method over the simple averaging method since it is more robust and accurate due to the additional processes in calculating average offsets and the outlier removal steps. Note also that with this iterative data alignment and outlier removal approach, one can even use the data from different time periods to build a more robust camera calibration map over time. 3.2 Initial speed estimation ( )

( )

( ) and ( ) of plate corners of vehicle Without loss of generality, let us assume that the resulting trajectories are available for = , + 1, + 2, … , and assume the frame rate fps. Our initial speed estimation would assume that the height of the plate corners is at the road plane (i.e., = 0) and convert each point on the trajectories from image pixel coordinate to road-plane coordinate using (0). That is, ( )

( )=(

1

=

,

) ∀ =

(0)

,

+ 1,

= 1,2.

(8)

1

The initial estimate of the speed of vehicle ( )

=

+ 2, … , , where

∑

∙

( )

is then simply calculated by

( )−

( )

( − 1) +

∑

∙

( )

( )−

( )

( − 1)

/2.

(9)

( ) ( ) ( ) − ( ) ( − 1) and ( ) − ( ) ( − 1) are the distances travelled in physical units for the tracked Here top-left and top-right corner of the vehicle at consecutive frames and − 1, respectively. The factor is used to convert distance to speed. Equation (9) expresses that we estimate the speed of the vehicle using the average of frameto-frame instantaneous speeds of each corners and then take the average of the speeds of two corners. Other statistics can be used to estimate the speed of the vehicle as well. We choose this for its simplicity. ( ) would be an excellent estimate of vehicle speed if the camera calibration is accurate and the tracked feature points are indeed on the road plane. Unfortunately, the height of the tracked top corners of the plate is not on the road plane. It is thus necessary to estimate the height using the method discussed in Sec. 3.3. One may ask why not track the feature points that are indeed at the road plane, i.e., the bottom of vehicle tires. The answer is fairly intuitive: it is very difficult to track tires of vehicle robustly since they have low contrast against the road and may be occluded from camera view. Another alternative is to track the feature points with known and standard height for all vehicles. We are not aware of such features or standards of vehicles.


3.3 Plate heeight estimatio on A speed enfoorcement soluttion requires ann effective means of identifyying the violatiing vehicle andd this is typicaally done through the use of autom mated license plate p recognitiion (ALPR). A key part off the ALPR algorithm a invoolves the localization and a extraction n of license plaate characters. A high level view of a typpical ALPR alggorithm is capptured in Figure 7. Seection 2.1 desccribes license plate localizattion and thus we can skip this t block in Figure F 7 and feed the localized reggion of interestt (ROI) imagee directly to “C Character Segm mentation”. In I order to geenerate highly accurate OCR results for each ROI image, the seggmentation algoorithm is used to normalize the t ROI imagee by removing as much imaging systtem variation and a noise as poossible. This normalization n i involves correccting for persppective, removiing plate rotation, sheaar, and cameraa motion/lens blur. b This correection is driven by priors succh as license plates p having an a aspect ratio of 1:2 (H H/W), characteers being of eqqual height andd assembled in a linear fashionn.

Figure 7. Liicense plate recoognition system used u to identify vehicles withouut an RFID transpponder.

Once a ROI image i has been n normalized, the t characters are of similar pixel p height annd an average of o the characterr heights produces a robust r estimatee of the actuall character heiight given thee distance from m the camera. The physical sizes of characters onn a license platte are typicallyy fixed for a particular state (ssame font). Thhe identificationn of the issuingg state is produced as part of the ALPR A process by the “State ID” block annd thus we cann obtain a maapping from thhe actual character heiight in absolutee units to one measured m by thhe imaging systtem. Using thee fact that the size of objects varies v as a function off distance from m the camera acccording to thee power law, the t character pixel heights froom the referennce point of a fixed caamera with kno own lens param meters can be used u to estimaate the actual distance d of the license plate from f the camera. Once the acttual distance of o the license plate p from the camera is knoown, we can apply a simple geometry, g illustrated in Figure 8 andd Eq. (10) & (11), to deteermine the heiight of the liccense plate above the road plane. The addditional parameters required r to so olve the geom metry are cam mera height annd distance of o the groundd for the pixeel ( , ) correspondinng to the centeer of the licennse plate. The distance of grround from thhe camera is obtained o as parrt of the calibration prrocess. ( , ) = cos _

_ _

ℎ ( , )=

_

( , )

_

(10) ℎ −[

_

_

∗ cos ( , )]

Camera Center of LP Cam mera Heig ght

Plate Height

X,Ypixell in frame

Figure 8: Geometry forr License Plate Height H Estimatioon


(11)

3.4 Speed estimation with height-correction As shown in Fig. 4, the ratio between the perceived distance travelled and the actual distance travelled are proportional to their height relative the camera height. Given that we are able to estimate the height of the tracked feature ℎ and the ( ) perceived speed ( ) in Eq. (9), the actual speed of the vehicle can be estimated as: ( )

= 1−

( )

(10)

4. EXPERIMENT To evaluate the accuracy of our method, we developed a near-infrared (NIR) monocular imaging system consisting of a NIR light source and a monocular camera. The experiment was conducted at a race track at Crofton, MD. The experiment covered three days: one sunny and windy day, one rainy and windy day, and one night-time test. In total, we have 294 vehicle runs with ground truth speed from a reference instrument. The speeds tested are targeted at three groups: 30, 45, 60 mph. Nine plate heights were used: 24.5”, 31”, 31.5”, 35.5”, 36”, 36.25”, 36.75”, 37”, 43”. The camera was mounted 25 feet above the road plane. The video camera has 1728 × 2304 pixel resolution and was operated at 30 fps (for night time) and 60 fps daytime. The reference instrument for speed measurement was Vitronic LIDAR PoliScan Speed Enforcement System [14], which is advertised with speed measurement accuracy of 99%. We first calibrated our camera off-line using the method described in Sec. 3.1. In particular, we ran our calibration test vehicle on each lane of the track on to cover the full FOV of our camera. The output of our analysis on these videos is a series of projective matrices ( ). Each provides a mapping of ( , ) → ( , ) for height . For efficient processing, we convert these matrices to a 3-D lookup table (LUT) with input ( , , ) and output ( , ), i.e., : ( , , ) → ( , ). Figure 9 shows the 3-D LUT of our camera. The bottom slice is for = 3, which is the height of the 21 grid reflectors relative to the road plan. The LUT converged to a single point, which corresponds to the mounting position of the camera (at the height of 25 feet as we measured via other mean). At run-time, videos of vehicles passing through the FOV of our imaging system were acquired. The videos were then analyzed by our method to obtain the estimated speeds of these vehicles. The ground-truth speeds of these vehicles were also measured via the reference lidar instrument. They were also measured by ground loops on the track, which were less accurate than those from the reference lidar instrument or ours. We did find consistent instrument-to-instrument bias among these three measurement systems. Due to lack of an absolute reference instrument that is an order-ofmagnitude better than these three systems [15], we performed an off-line bias removal process before comparing results. This process was done by running speed measurements on a few random vehicles with all instruments and then comparing their results to find the bias between instruments. The relative biases to the reference lidar system are −1.63% and −2.67% for our measure system and loops, respectively. These constant biases were used to remove instrument-to-instrument bias before comparing the results against the reference lidar system. Figure 10 shows the results of speed estimation of our method. Figure 10(a) shows the accuracy of our speed estimation when compared to the reference lidar system. The errors in our plate height estimations are shown in Fig. 10(b) and 10 (c). The two figures are the same plots with difference units. One is in the unit of inches, while the other is in the unit of corresponding contribution to speed errors in percent (i.e., ∙ 100%). It shows that on average there is no bias (0.1%) introduced by our plate height estimation. However, from the perspective of absolute average, our plate height estimation errors contributed 1.0% to speed estimation errors. The statistics of our performance are listed in Table 1. To decouple the sources of errors in our method, we also computed the speed estimation of each vehicle using its actual license plate height rather than our estimation. As you can see in the Table, speed measurements are more accurate with the true height of the plate. The difference between speed estimates with estimated plate height and those with true plate height is about 0.7%, which is very close to our estimate of errors contributed from height estimation errors. When actual plate height is used, we see a clear degradation in performance as vehicle speed increases. This is an indication of degradation in our vehicle tracking as vehicle speed increases. This is expected since as vehicle speed increases, fewer tracking points are available and the frame-to-frame scaling changes are larger (more difficult to track). There isn’t a clear degradation as speed increases when estimated plate heights are used. The reason is because the tracking errors are now confounded with plate height errors, which does not degrade as speed increases. Overall performance is very good with average absolute error of 1.4% and p95 absolute error of 3.1%. A significant portion of the errors are due to plate height estimation, which is one of the focus areas of our future work. Note that the


reference lidar instrument has “claimed” 1% error. Statistically speaking, it is “not good enough” to assess our system fairly. In fact, it can be shown that using a measurement system of 1% error rate to assess another independent measurement system with 1% error rate can lead to an error rate of 1.4% (i.e., √2 factor increase) if treated the first measurement as the reference system. Camera mappings MZ for a 20 x 20 grid (3D-view)

Camera mappings MZ for a 20 x 20 grid (3D-view wit h camera location shown)

7

30 25 height in f t

height in f t

6 5

20 15 10

4 5 3 50

0 50 20 0

20

0

10

10

0 -10 -50

ft

-20

0

-50

-10 -100

ft

ft

(a)

-20

ft

(b)

Figure 9. Example calibration maps at different

values (Data from Maryland Calibration Test).

65 Use Plat e height est imat ion I nput plat e height

Speed f rom our met hod (mph)

60 55 50 45 40 35 30 25 25

30

35 40 45 50 55 60 Speed f rom reference ground-trut h (mph)

70 ave = 0.2" |ave| = 3.1"

60

50

50

40

40

occurence

occurence

(a)

70

60

30

20

10

10

-10

-5 0 5 Plat e height estimat ion error (inch)

10

(b)

ave = 0.1% |ave| = 1.0%

30

20

0

65

0 -4

-3

-2 -1 0 1 2 Plat e height est imation error t o speed error (% )

Figure 10. Speed estimation accuracy of our method.


3

4

(c)

Vehicle group

|Error| (ave,p95) (with estimated heights)

|Error| (ave,p95) (with input heights)

30 mph 45 mph 60 mph Overall

(1.4%,2.9%) (1.5%,3.7%) (1.3%,3.0%) (1.4%,3.1%)

(0.5%,1.4%) (0.8%,2.1%) (1.0%,2.5%) (0.7%,1.8%)

Table 1. Speed estimation accuracy of our method.

5. CONCLUSION AND FUTURE WORK Our method relies on the detection and tracking of license plates, the estimation of the height of the plate above the road plane, and the knowledge of the camera height to achieve high accuracy of individual vehicle speed estimation. Key contributions of our method include the development of (1) tracking a specific set of feature points of a vehicle to ensure consistent measure of vehicle speed, (2) a high accuracy camera calibration/characterization method, which does not interrupt regular traffic of the site, and (3) a plate and camera height estimation method for improving the accuracy of individual vehicle speed estimation. Additionally, we examine the impact of spatial resolution to the accuracy of speed estimation and utilize that knowledge to improve the computation efficiency of our algorithm. For example, we use relatively low spatial resolution for motion detection while using high spatial resolution for object classification and tracking. We also improve the accuracy and efficiency of our tracking over standard methods via dynamic update of templates and predictive local search.

REFERENCES [1] [2]

[3]

[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

“TRAFFIC SAFETY FACTS 2011 DATA,” http://www-nrd.nhtsa.dot.gov/Pubs/811753.pdf National Highway Traffic Safety Administration, “Traffic Safety Facts, 2005 Data. Speeding.” NHTSA's National Center for Statistical Analysis, Washington D.C. DOT HS810629 (2006). Available: http://wwwnrd.nhtsa.dot.gov/Pubs/810629.pdf Goldenbeld, C., and Schagen, I. V., “The Effects of Speed Enforcement with Mobile Radar on Speed and Accidents: An Evaluation Study on Rural Roads in the Dutch Province Friesland,” Accident Analysis and Prevention, 37(6), 1135-1144 (2005). Chen, G., Meckle, W. and Wilson, J., “Speed and Safety Effect of Photo Radar Enforcement on a Highway Corridor in British Columbia,” Accident Analysis and Prevention, 34(2), 129-138 (2002). Loce, R. P., Bernal, E. A., Wu, W., Bala, R., “Computer vision in roadway transportation systems: a survey,” J. Electron. Imaging. 22 (4), 041121; doi: 10.1117/1.JEI.22.4.041121 (2013). Tsai, R. Y., “A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-theshelf TV cameras and lenses,” IEEE J. Robot. Autom., RA-3(4), 323-344 (1987). Kanhere, N. K. and Birchfield, S. T., “A Taxonomy and Analysis of Camera Calibration Methods for Traffic Monitoring Applications,” IEEE Transactions On Intelligent Transportation Systems, 11(2), 441-452 (2010). Yilmaz, A., Javed, O., and Shah, M., “Object Tracking: A Survey,” ACM Computing Surveys, 38 (4), 1-45 (2006). Felzenszwalb P., Girshick R., McAllester D., and Ramanan D., “Object Detection with Discriminatively Trained Part Based Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (9) (2010). Bulan O., Loce R. P., Wu W., Wang Y., Bernal E. A., Fan Z., “Video-based real-time on-street parking occupancy detection system,”. J. Electron. Imaging. 22(4): 041109; doi: 10.1117/1.JEI.22.4.041109 (2013). Dalal N. and Triggs B., “Histograms of Oriented Gradients for Human Detection,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 886-893 (2005). Krzanowski, W. J. Principles of Multivariate Analysis: A User's Perspective. New York: Oxford University Press (1988). Wimalaratna1, L.G.C and Sonnadara, D. U. J., “Estimation of the Speeds of Moving Vehicles from Video Sequences,” Proceedings of the Technical Sessions, 24, 6-12, Institute of Physics – Sri Lanka (2008). http://www.vitronic.de/en/traffic-technology/applications/traffic-enforcement/speed-enforcement/poliscan-speedfixed.html Bellucci, P., Cipriani, E., Gagliarducci, M. and Riccucci, C., “The SMART Project - Speed Measurement Validation in Real Traffic Condition,” Proceedings of the 8th International IEEE Conference on Intelligent Transportation Systems Vienna, Austria (2005).