Generalized Haar Filter based Object Detection for Car Sharing Services

1 downloads 0 Views 2MB Size Report
Abstract—Object detection is important in car sharing services. Accuracy, efficiency, and low memory consumption are desirable for object detection in car ...
1

Generalized Haar Filter based Object Detection for Car Sharing Services Keyu Lu, Student Member, IEEE, Jian Li, Member, IEEE, Li Zhou, Xiping Hu, Xiangjing An, Member, IEEE, and Hangen He

Abstract—Object detection is important in car sharing services. Accuracy, efficiency, and low memory consumption are desirable for object detection in car sharing services. This paper presents a network system that satisfies all these requirements. Our approach first divides the object detection task into multiple simpler local regression tasks. Then, we propose the generalized Haar filter based convolutional neural network (CNN) to reduce the consumption of memory and computing resource. To achieve real-time performance, we introduce a sparse window generation strategy to reduce the number of input image patches without sacrificing accuracy. We perform experiments on both vehicle and pedestrian datasets. Experimental results demonstrate that our approach can accurately detect objects under challenging conditions.

Note to Practitioners— Object detection is an important part of intelligent vehicle technologies, which play an important role in car sharing services. Object detection provides metadata for collision avoidance, self-driving systems and driver assistance systems, which can result in better safety and consumer experiences in car sharing services. Although deep learning has achieved excellent performance in object detection, but they consume a large amount of storage and computing resource, which makes them difficult to be deployed for car sharing services. This paper suggests a novel approach which is based on the generalized Haar filter and the local regression strategy. Our approach is accurate, efficient and light. The experimental results verify the effectiveness of the proposed approach in car sharing services. Index Terms—Object detection, traffic scene, car sharing, CNN, Haar filter.

I. I NTRODUCTION AR sharing services provide customers access to vehicles with a reservation for short-term use. The customers are charged per time and often per mile [1]. The past decade has witnessed the development of car sharing services, which greatly reduce inner-city traffic, congestion, and pollution problems [2]. To achieve better safety and consumer experiences, the next generation of car sharing services [1] will incorporate intelligent vehicle technologies such as collision avoidance [3], self-driving technologies [4] and advanced driver assistance systems [5].

C

K. Lu, J. Li, L. Zhou, X. An and H. He are with the College of Mechatronics and Automation, National University of Defense Technology, China. X. Hu is with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. J. Li, L. Zhou and X. Hu are the co-corresponding authors (e-mail: [email protected], [email protected], [email protected]).

Vision-based object detection (e.g. vehicle and pedestrian detection) is one of the fundamental components of these intelligent vehicle technologies for car sharing services. Vision sensor is popular due to its quicker response, lower price and power consumption compared with other sensors such as LiDAR and millimeter-wave radar [6]. Furthermore, vision sensor is able to gather rich information from traffic scene (such as luminance, color and texture) [7], [8]. However, object detection is also a challenging task. In the first place, traffic scenes are varying with lots of factors such as illuminations, shadows and weather conditions, which bring challenges to the object detection task in traffic scenes. More importantly, most platforms (such as FPGA and ASIC) for traffic scene applications have strict limits on memory, computing resource and power consumption. In this way, the object detection task is quite different from that in ILSVRC [9] and COCO [10] competitions, where algorithms can be run in high performance servers as well as workstations. Although deep learning has achieved excellent performance in object detection, they consume a large amount of storage and computing resource [11], which make them difficult to be deployed in applications for car sharing services. This paper proposes a light and effective object detection system that is feasible for car sharing services. The innovations of this work lie in four folds: (1) We introduce a local regression strategy that performs light and efficient regression networks on small image patches to detect objects. Our local regression strategy is more effective than global regression based methods such as YOLO [12] and SSD [13]. (2) We propose the generalized Haar filter based CNNs (GHaar CNNs) to reduce the consumption of memory and computing resource. Due to the strong representation of Haar filters, the networks are able to achieve a high performance without consuming too much resource, which makes our method practicable for car sharing services; (3) We introduce a sparse window generation strategy to reduce the number of input image patches without sacrificing accuracy. In this way, the efficiency of our approach is ensured; (4) We design a light and efficient deep network that simultaneously outputs the bounding box, category and confidence score of each object through two output channels: localization and classification channels.

2

Image Block

pool4

conv1 relu1

pool1

conv2 relu2

pool2

conv5_1

conv6_1

conv7_1

relu5_1

relu6_1

relu7_1

conv5_2

conv6_2

relu5_2

relu6_2

conv7_2

conv3 relu3 conv8_1

softmax

conv4 pool3

relu4

Localization

Classification

Fig. 1: The architecture of our network. In the figure, convX, reluX and poolX respectively stand for the X-th convolution, relu (Rectified Linear Unit) and pooling layer in the convolutional neural network.

The rest of the paper is organized as follows: In Section II, we introduce related works and discuss the pros and cons of these works. In section III, we present the architecture of our generalized Haar filter based CNN and describe how to design their weights. We introduce our sparse window generation strategy in section IV. Experimental results and associated discussions are presented in section V, while conclusions are provided in section VI. II. R ELATED WORK In the past several decades, object detection has been addressed by a variety of methods. Attempts were first made by using hand-craft solutions, where objects in images are detected by manually designed features. Early hand-craft solutions include color-histograms, geometry primitives and wavelet transform coefficients [14]. Later, SIFT [15] popularized the use of key-points and then numerous key-points based methods appeared. Among them, color-SIFT [16], PCA-SIFT [17] and SUFR [18] are well known examples that have been successfully implemented to the object detection task [19]. Another set of hand-craft algorithms attempt to use boosted based methods to achieve real-time and robust object detection. For example, in [20], objects are detected by using a cascades strategy and Haar-like features. Because Haar-like features can be efficiently calculated by integral images, this method has achieved real-time performance without extra effort. To achieve better performance, Doll´ar et al. suggested a robust boosted based algorithm called Aggregated Channel Feature (ACF) [21]. In this method, features of different channels are produced from integral images and then objects are effectively detected by decision trees and AdaBoost [20]. Furthermore, Dalal et al. [22] introduced a method based on the framework of “HOG+SVM”. More specifically, it first produces the edge based features called histogram of oriented gradients (HOG) for each image patch, and then these features are classified by Support Vector Machine (SVM) [23]. After that, numerous extensions of “HOG+SVM” were proposed. Among them, DPM (Deformable Part Model) [24] has been widely used in object detection competitions such as Pascal VOC Challenges [25]. This method represents each object by a root model and multiple part models. In this way, the whole and parts of the object are both considered and the performance is improved significantly.

In a different line of work, numerous researchers are interested in learning based methods. Especially in recent years, the success of deep learning boosts the progress of visionbased object detection [26]. Convolutional Neural Network (CNN) [27], [28] stands out as one of the most famous network architecture. With the implementation of local receptive fields, weight sharing and spatial pooling [29], CNN has made a breakthrough in computer vision and image processing. For example, AlexNet [27], being a form of CNN, has made a surprising achievement in the competition of ILSVRC-2012 [9]. However, with the increasing of network depth, CNN met a bottleneck in training [30]. Recently, He et al. proposed the deep residual network [30] which employs shortcut connections to solve the problem of training for deeper CNNs. Although CNN has achieved a marvelous success, there still exist two main problems that have to be solved when it is implemented to object detection tasks. In the first place, CNN has a large number of convolutions, it would be rather inefficient if using the conventional sliding window paradigm [20], [24]. Besides, CNN is less sensitivity to the scale and shift of objects. Thus, it is hard to achieve accurate localization if it is directly applied to object detection [31]. To overcome these problems of CNN, Girshick et al. proposed the framework of the region proposal method (RCNN) [32]. This method performs CNNs on candidate bounding boxes (called “proposals”) that have potential to contain objects [32]. Later, the upgraded version “Fast R-CNN” [31] is proposed to improve the efficiency of object detection. This method combines R-CNN with SPPnet [33]. In this way, computations of overlapping proposals are shared with each other and run-time can be reduced dramatically. However, in both R-CNN and Fast R-CNN, the proposals are generated by Selective Search [34], which is quite inefficient. To overcome this limitation, Faster R-CNN [35] is proposed to use Region Proposal Networks (RPNs) to produce proposals in place of Selective Search [34]. In this way, the stage of proposal generation and CNN-based classification/regression are under a unified framework and thus the detection/localization speed can be boosted with the help of GPU. However, region proposal based CNNs are complex and not easy to optimize [12]. For this reason, various researches attempt to achieve real-time object detection by regarding it as a regression problem. YOLO [12] is one of the ground-

3

dy1 dx1

Object

dx2

dy2 Input Window Fig. 2: Description of a bounding box using a 4-dimensional vector.

breaking studies on deep regression networks for object detection. The approach simultaneously outputs the location, category and confidence score for each object at a extremely high frame rate. However, it does not work well in small object detection. In the same vein, Liu et al. proposed SSD (Single Shot MultiBox Detector) [13] for real-time object detection. It adopts new techniques (such as multi-scale feature maps and default boxes) to improve the performance [13]. As deep networks usually consume a large amount of memory resource, several methods have been proposed in the literature that focus on network compression. Early solutions, such as [36], [37], mainly focused on network pruning. Recently, Han et al. introduced a method called “Deep compression” that reduces the memory consumption by removing redundant connections, quantizing weights and compressing weights by Huffman coding [11]. Another class of approaches are based on network binarization [38], [39]. These methods attempt to compress neural networks by binarizing their weights and activations. Our approach is different from these methods. Instead, we constrain the weights of CNN to the form of generalized Haar filter, which has a strong representation in feature extraction [20], [40]. III. G ENERALIZED H AAR FILTER BASED CNN In this section, we introduce the architecture of our generalized Haar filter based CNN (G-Haar CNN) and describe how to design their weights and train the model. A. Architecture of the network As shown in Fig. 1, our network consists of two output channels: localization and classification channel. The localization channel focuses on bounding box regression and outputs a 4-dimensional vector (dx1 , dx2 , dy1 , dy2 ) for each bounding box (see Fig. 2). The classification channel outputs the corresponding category and confidence score which are represented by a 2-dimensional vector (l, s). As shown in Fig. 1, our deep network consists of 11 convolution layers, 4 max-pooling layers and 1 softmax layer. To reduce memory consumption, several low-level features

Fig. 3: Illustration of the selected generalized Haar filters.

are shared by localization and classification channels via conv1∼pool4. As these two channels focus on different tasks, each of them has 4 independent layers to handle their own task. Due to the localization channel, our deep network is less sensitive to the scale and shift of input objects. Thus, instead of generating region proposals, objects can be efficiently detected by using sparse sliding-window paradigm and perspective geometry. More details on this issue are introduced in section IV and associated experiments are presented in section V. B. Generalized Haar Filter based Weight Design We propose a generalized Haar filter based weight (GHaar weight) design method to balance the performance and resource consumption of the network. Haar-like filters have been successfully applied to object detection due to their strong representation and high efficiency [20], [40]. Generalized Haar filter is an extension of Haarlike filter. Instead of using a fixed number of rectangles and configuration types, generalized Haar filters are based on arbitrary configurations and rectangle number [41]. In this work, weights of our deep network are constrained based on generalized Haar filters. Unlike conventional generalized Haar filters with arbitrary configurations and rectangle number [41], in our work, they are obtained in a data-driven manner. More specifically, we constrain a weight wi of size m × m (m ≥ 3) to the following form: wi = w ˆp · ki ,

(1)

where ki ∈ R is a multiplication factor and w ˆp is the p-th filter in the Haar filters space. For a generalized Haar filter of size 2 m × m, there are 2m types of configurations. In our case, w ˆp and its negative form −w ˆp can be regarded as a same filter. 2 Thus, the Haar filter space of size m × m contains 2m −1 2 filters. That is, p ∈ [1, 2m −1 ]. For a network trained using TME Motorway dataset [42] for vehicle detection, Fig. (4) illustrates the filter usage in 3 × 3

4

12000

Usage Count

10000 8000 6000 4000 2000 0

1

32

64

96

128

160

192

224

256

Filter Index Fig. 4: Filter usage in the 3 × 3 Haar filter space. The network is trained using TME Motorway dataset [42]. In the figure, filter index is the serial number of each Haar filter in its Haar filter space.

Haar filter space. As shown in Fig. (4), filter usage is quite “sparse”. In other words, a few filters tend to be used much more frequently in Haar filter space. Thus, to some extent, these filters are more important and representative than rest filters. According to this fact, we try to select these “important” filters for weight design. Let N r denote the number of filters we try to select. We select N r filters according to their usage count (N r = 32 in our work). Selected filters are shown in Fig. (3). Then, all weights of our network are approximated by these N r filters and corresponding multiplication factors. Essentially, generalized Haar filters compress the deep network by establishing relationships between elements in each weight. Thus, considerable amount of storage resource can be saved. For each weight of our deep network, only a multiplication factor and a filter index are needed to be stored. In our work, each multiplication factor (single-precision floatpoint) occupies 4 bytes and each filter index (ranging from 1 to 32) needs less than 1 byte. Thus, less than 5 bytes are needed for each weight. By contrast, in conventional deep networks, 4m2 bytes are needed for each weight of size m×m. Besides, compared with binary networks [38], [39], our network is able to keep a higher computing precision due to the float-point multiplication factor. Unlike approaches in [40], [20], [41] where Haar filters are operated based on integral images, we use basic addition and multiplication to calculate Haar filter based convolution. One reason is that most weights in deep networks are relatively small (3×3 and 5×5), accordingly, there is no obvious benefit in using integral images. Besides, integral images need to be re-computed for each layer, which is inefficient. Moreover, as generalized Haar filters are selected in a data-driven way, lots of different computation rules are needed if calculated by using integral images. Thus, we design an efficient strategy to handle Haar filter based convolution. In our work, generalized Haar filters only contain two types of elements: -1 and +1. Thus, each generalized Haar filter can be regarded as a sign pattern matrix. Therefore, the weight wi can be written as: wi = w ˆp · ki = sign(w ˆp ) · k.

(2)

As each convolution step in CNN can be regarded as a dot

product operation. Let Pi denote the dot product between the Haar filter w ˆp and the input patch xi , that is: Pi = w ˆp · xi = sign(w ˆ p ) · xi .

(3)

As we regard Haar filters as sign pattern matrices, equation (3) can be calculated by using lookup tables without multiplication. Then each convolution step can be transformed to the following form: X wi · xi = (w ˆ p · x i ) · k i = Pi · k i = k i · Pi , (4) P where “ ” denotes the sum of matrix elements. In this way, only one multiplication is needed for each convolution step. By contrast, in conventional deep networks, m2 multiplication are needed. In a word, our approach can markedly reduce the consumption of computing resource. Consequently, power consumption can be also reduced. Thus, our deep network is suitable for embedded devices and FPGAs where computing resource (multipliers) and power are limited. In addition, using generalized Haar filters to normalize weights provides a form of regularization which is able to improve the generality of the deep network. Corresponding loss functions and regularization terms are introduced in the next subsection. C. Multi-Task Training Our G-Haar model has 3 parts that need to be trained: classification channel, localization channel and G-Haar weights. In each training iteration, we first jointly train classification and localization channels by using a loss function that consists of a classification loss and a localization loss. Then we constrain weights to the form of generalized Haar filter. 1) Classification and localization channel: The classification channel focuses on a typical classification problem, we use softmax-loss to optimize this channel: X exj Lcla = − yj log( P x ), (5) e k j k

where yj is the label of the j-th training sample and x is the output of the previous layer. The localization channel outputs a 4-dimensional vector d = (dx1 , dx2 , dy1 , dy2 ) (see Fig. 2). Given a ground-truth vector

5

Ws

Algorithm 1 Update parameters of the i-th weight. A minibatch of inputs xi ; Loss function Ci (wit , wi∗ ); Generalized Haar filter set w ˆr , 2 (r = 1, 2, . . . , 2m −1 ); Current filter index p˜; Current multiplication factor k˜i ; Current learning rate lt . Output: Updated filter index p; Updated multiplication factor ki ; Updated learning rate lt+1 . Procedure: 1: Construct current weights: wit = wp˜ · k˜i ; 2: Obtain targets using forward propagation: wi∗ = Forward(wit , xi ); 3: Gradients are computed using backward propagation: ∂Ci = Backward(wit , wi∗ ); ∂wit 4: Update weights using SGD: ∂Ci wit+1 = UpdateWeight(wit , ∂w , lt ); i 5: Update filter indexes:

P t+1 2

t+1 ˆr wi ·w

; P p = arg min (w − w ˆ · ) 2 r w ˆ

i

Input:

Ls

Object Bounding Square Input Window Fig. 5: Illustration of the object size ratio.

dˆ = (dˆ x1 , dˆ x2 , dˆ y1 , dˆ y2 ), the localization loss can be defined as a squared loss form:

2

(6) Lloc = d − dˆ . The final loss function for these two channels is a weighted function of classification and localization loss: L = Lcla + βLloc ,

(7)

where β is a weight to balance the classification loss and localization loss. In this work, β is experimentally set to 10. With this loss function, we jointly train classification and localization channels using stochastic gradient descent (SGD) algorithm. 2) Generalized Haar filter based weights: The training objective of this part is to constrain each weight to the form of generalized Haar filter. In other words, we try to obtain a Haar filter index p and a multiplication factor ki and use them to approximate each weight in our network. We use least square principle to approximate weights, that is:

2 min (wi − w ˆr · λr ) ,

r = 1, 2, . . . , 2m

1

2

−1

.

(8)

For each filter in Haar filter space, the corresponding multiplication factor λr can be obtained by:

d

2 ˆr · λr ) = 0. (9)

(wi − w dλr 1 From equation (9), λr can be calculated: P wi · w ˆr λr = P 2 . w ˆr

(10)

Having obtained λr , we normalize each weight by adding a regularization term to its original loss function Coi (wi , wi∗ ):

2 Ci (wi , wi∗ ) = Coi (wi , wi∗ ) + ϕ · min (wi − w ˆr · λr ) . r 1 (11)

r

r

Update multiplication factors: P t+1 ·w ˆ w ki = Pi wˆ 2 p ; p 7: Update learning rate: lt+1 = UpdateLearningRate(lt , t).

1

6:

This regularization term not only constrains weights to the form of generalized Haar filter, but also improves the generality of the deep network. Further experiments are presented in section V. Having obtained the updated weight wit+1 using stochastic gradient descent (SGD) algorithm, the final Haar filter index p and a multiplication factor ki can be obtained by: 

P t+1 2

 w · w ˆ 

r t+1 i  P (w −w ˆr · )   p = arg rmin

i

w ˆr2 1 . (12) P   wit+1 · w ˆp   P 2 ki = w ˆp For clarity, we describe our training procedure in algorithm 1. In the pipeline, we first construct current weights using filter indexes and multiplication factors which are obtained in the previous iteration. Then, target weights wi∗ are obtained by using a standard forward propagation approach. After that, based on the loss function Ci (wit , wi∗ ) (see equation (11)), gradients can be calculated using a standard backward propagation approach. Having obtained gradients, weights are updated using stochastic gradient descent (SGD) algorithm. Finally, by using updated weights, filter indexes and multiplication factors can be obtained using equation (12). IV. S PARSE WINDOW GENERATION In this section, we describe how to generate sparse windows (or bounding boxes) to feed our network for efficient object detection.

6

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 6: Final sparse windows in image pyramids.

We generate sparse windows based on the scale and shift tolerance of our deep network and the perspective geometry of traffic scenes. Due to the localization channel, our deep network is less sensitive to the scale and shift of input objects. Thus, we can use sparse sliding-window strategy to achieve real-time object detection in traffic scenes. Besides, as camera is often mounted in a fixed position of a vehicle (e.g. mounted on the top of a vehicle windshield), objects (e.g. vehicles and pedestrians) produce predetermined location-specific patterns in images. Accordingly, potential locations of objects in images can be obtained using the perspective geometry of the given scene. Based on the perspective geometry, a series of sparse windows can be generated according to the scale and shift tolerance of our deep network. Finally, locations and categories of objects can be obtained efficiently by performing our deep network on these sparse windows.

400 200

d 2D

0 3000 -200 2000 -400 220

1000 200

0

180

160

y2D

140

120

x2D

-1000

Fig. 7: Illustration of the 4 boundary planes that jointly limit the locations and sizes of input windows in the x2D − y2D − d2D space.

A. Sparse sliding-window strategy We first generate initial sparse windows based on the scale and shift tolerance of our deep network. We first define a size ratio for each input window. As shown in Fig. 5, let W s denote the size of each input window and Ls denote the size of each object bounding square. The size ratio between Ls and W s is represented by Rs: Ls Rs = . (13) Ws We then produce image pyramids by resizing the given image to different scales. For each resized image in pyramids, we assume that each deep network is responsible for objects with Rs ranging from 0.5∼0.7 (objects beyond this range would be detected in other scales in pyramids). Thus, by setting the stride of sliding-windows to 0.3, we can ensure that each object is completely contained in at least one window. In this way, our approach is able to combine benefits of regression based method and sliding-window based method.

Besides, instead of applying global regression on the whole image, our approach performs local regression for bounding box localization. In this way, the approach has the potential to detected smaller objects compared with global regression based method [12], [13]. Furthermore, the approach is quite efficient due to the sparse sliding-window paradigm. Further experiments are presented in section V.

B. Perspective Geometry We further reduce the number of sparse windows by using the perspective geometry of traffic scenes (assuming camera parameters are known). Let (x3D , y3D , z3D ) denote a 3D point position in world coordinates, (x2D , y2D ) denotes the corresponding 2D point position in pixel coordinates. According to

7

TABLE I: Comparison between conventional CNNs (without G-Haar weights) and our G-Haar based CNNs. conv1

conv2

conv3

conv4

conv5 x

G-Haar

Mem.

Mul./St.

Cla. Err.

Loc. Err.

3×64 3×64 3×256 3×128

64×128 64×256 256×512 128×256

128×256 256×512 512×512 256×512

256×256 512×1024 512×1024 512×1024

256×128 1024×128 1024×512 1024×1024

× × × ×

5.97 MB 32.13 MB 67.74 MB 96.04 MB

9 9 9 9

0.147 0.135 0.092 0.054

5.19 4.78 4.21 3.22

3×64 3×64 3×256 3×128

64×128 64×256 256×512 128×256

128×256 256×512 512×512 256×512

256×256 512×1024 512×1024 512×1024

256×128 1024×128 1024×512 1024×1024

X X X X

901.96 KB 4.51 MB 9.58 MB 13.68 MB

1 1 1 1

0.095 0.072 0.067 0.042

4.33 3.48 2.97 2.32

In equation (15), R is a rotation matrix which is the result of three rotations around the world coordinate axes. As in most object detection systems for traffic scenes, the angles of camera rotations are relatively small. Accordingly, the camera projection matrix M can be approximated as:   m11 0 m13 m14 m22 m23 m24  . M = 0 (16) 0 0 1 m34 Then, the equation (14) can be transformed to:  m11 · x3D + m13 · z3D + m14  x2D = z3D + m34 m · y + m23 · z3D + m24 . 22 3D   y2D = z3D + m34

(17)

In this work, each input window of our deep network has the same height and width. We use d2D to represent the height or width of the input window located at (x2D , y2D ) in pixel coordinates (center aligned), and let d3D denote the corresponding height or width in world coordinates. Then d2D can be formulated as: d2D

(18)

As locations of most objects in traffic scene are limited (e.g., vehicles and pedestrians will not appear on the sky region), the number of windows can be further reduced based on this assumption. We use a triplet (x2D , y2D , d2D ) to describe each input window of our deep network in pixel coordinates. Let max min max [xmin 3D , x3D ] and [y3D , y3D ] represent the location ranges of corresponding windows in x3D and y3D axes of world coordinates, respectively. By solving the equation set that consists of equation (17) and (18), we can find 4 boundary

0.25

Classification Error of Test

Traditional Weights G-Haar Weights ( Nr G-Haar Weights ( Nr

256 ) 32)

0.2 0.15 0.1 0.05 0

0

1

2

3

4

5

6

Traditional Weights G-Haar Weights ( Nr G-Haar Weights ( Nr

0.25

0.15 0.1 0.05 0

7

0

1

2

3

4

5

6

7

Iterations (1e4)

(a) Classification Error of Training

(b) Classification Error of Test 20

20

Traditional Weights G-Haar Weights ( Nr G-Haar Weights ( Nr

15

256 ) 32 )

10

5

0

256 ) 32)

0.2

Iterations (1e4)

Localization Error of Test

(14)

Classification Error of Training

 , 

where M is a camera projection matrix which is the product of an intrinsic matrix and an extrinsic matrix:    fx 0 u0 0  R T   0 fy v0 0 M= . (15) 0T 1 0 0 1 0

m11 · d3D . = z3D + m34

0.3

0.3



Localization Error of Training

the perspective geometry, they satisfy:    x3D x2D  y3D z  y2D  = M   z3D 1 1

0

1

2

3

4

5

6

Iterations (1e4)

(c) Localization Error of Training

7

Traditional Weights G-Haar Weights ( Nr G-Haar Weights ( Nr

15

256 ) 32 )

10

5

0

0

1

2

3

4

5

6

7

Iterations (1e4)

(d) Localization Error of Test

Fig. 8: Comparisons of G-Haar weights and conventional weights on TME motorway dataset [42].

planes in the x2D y2D d2D space that jointly limit the locations and sizes of input windows:  min d2D = x2D · gx (xmin  3D ) − gx (x3D ) · m13   d = y · g (y min ) − g (y min ) · m 2D 2D y 3D y 3D 23 , (19) max max  d = x · g (x ) − g (x ) · m 2D x 3D x 3D 13  2D   max max d2D = y2D · gy (y3D ) − gy (y3D ) · m23 where:    gx (x3D ) =

m11 · d3D m11 · x3D − m13 · m34 + m14 . m11 · d3D    gy (y3D ) = m22 · y3D − m23 · m34 + m24

(20)

As shown in Fig. 7, input windows are distributed in the region of inverted pyramid which is enclosed by 4 boundary planes in the x2D y2D d2D space. We use Up to represent the input windows in this region and Us to denote the sparse windows that generated based on the scale and shift tolerance of our deep network, then the final sparse windows Uf can be obtained by: Uf = Up ∩ Us . (21)

8

1

Recall

0.8 0.6 Caraffi Castangia YOLO SSD500 BNN Ours (Conventional Weights) Ours (Full System)

0.4 0.2 0

0

20

40

60

80

100

Distance Fig. 10: Performance evaluation in different vehicle distances on TME motorway dataset [42]. TABLE II: Runtime analysis on TME motorway dataset [42]. Method

Average Runtime (s)

Caraffi [42] Castangia [44] YOLO [12] SSD500 [13] Ours (Without Sparse Windows) Ours

0.1 0.05 0.022 0.043 32 0.019

As shown in Fig. 6, we illustrate the final sparse windows in image pyramids by using the TME Motorway dataset [42], which is a challenging and widely-used dataset in vehicle detection. Note that, all sparse windows have the same size (48 × 48 in this work), which means that they can be directly used as the input of our deep network. In this way, resizing windows can be avoided and efficiency is ensured. V. E XPERIMENTS Experiments in this section mainly focus on vehicle and pedestrian detection in traffic scenes. The performance of our approach is evaluated on two widely used datasets: TME motorway dataset [42] and ETHZ pedestrian dataset [43]. The former is designed for vehicle detection in challenging traffic scene with various lighting conditions and complex traffic situations [42], and the latter is developed for pedestrian detection in busy pedestrian zones [43]. We first evaluate the performance and generality of our GHaar weights by comparing with conventional weights. Then the consumption of storage and computing resource is evaluated. Finally, the effectiveness and efficiency of our approach are validated in comparison with some recently published state-ofthe-art methods. In these experiments, output bounding boxes are refined by mean-shift and non-maximum suppression, and we use an intersection-over-union (IoU) threshold of 0.7 to determine the correctness of detection. All the experiments in this section are performed on a GTX1080 GPU. A. Generalized Haar filter based weights vs. conventional weights We evaluate the performance of our G-Haar weights by comparing with conventional weights. As our deep network

has two output channels: classification channel and localization channel. We respectively use the classification and localization errors of the training and test as the performance metrics. The classification error of both training and test is defined as: FP + FN , (22) Ercla = N where N is the total number of training or test samples, F P and F N respectively denote the number of positive and negative samples which are incorrectly classified. For the localization channel, the error of both training and test is defined as:

2 N P

di − dˆi , (23) Erloc = i=1 4N where the vector d = (dx1 , dx2 , dy1 , dy2 ) represents the output of the localization channel, and dˆ = (dˆ x1 , dˆ x2 , dˆ y1 , dˆ y2 ) is the ground truth location vector. Fig. 8 illustrates how the errors of classification and localization change when increasing the number of training iterations. This figure shows that our deep network achieves a high performance when the training converges. Generality is the ability of a network to avoid overfitting and predict new cases in a test set. Although the deep network with GHaar weights has a slightly larger training errors and needs more iterations to get convergent, it has smaller test errors in both classification and localization tasks. It means that the deep network with G-Haar weights has a stronger generality than that with conventional weights. This is due to the regularization effect of G-Haar weights. Besides, the G-Haar network with less N r (number of selected filters) needs more training iterations and tends to have stronger generality when its training converges. B. Resource consumption We investigate the consumption of storage and computing resource in this subsection. The experiment is also performed on TME motorway dataset [42]. In this experiment, the number of selected filters N r is 32, and all weights are stored using single-precision floating-point format (32 bits). In the deep network (see Fig. 1), the size of conv1∼conv5 x is 3×3, and the rest convolution weights have the size of 1 × 1. As weights of size 1 × 1 consume much less resource compared with that of size 3 × 3, we only evaluate conv1∼conv5 x and their influence on resource consumption in table I. As shown in table I (column of Mem.), memory consumption can be dramatically reduced by using G-Haar weights (about 0.8n2 times, n = 3 in this work). This is due to the fact that only a filter index and a multiplication factor are needed to be stored for each G-Haar based convolution kernel, which only consumes 5 bytes. By contrast, there are n2 weights needed to be stored for each conventional convolution kernel of size n × n, thus, 4n2 bytes are required. Besides, computing resource can be also reduced by using G-Haar weights. As multipliers are major computing resource consumed by deep networks, we use the number of multiply operations required for each convolution step to measure

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4 LRC YOLO SSD500 Ours

0.2 0

0

0.5

1 1.5 2 2.5 False Positives Per Image

Recall

1

Recall

Recall

9

0.4 LRC YOLO SSD500 Ours

0.2

3

0

(a) ETH-01

0

0.5

1 1.5 2 2.5 False Positives Per Image

0.4 LRC YOLO SSD500 Ours

0.2

3

0

0

0.5

(b) ETH-02

1 1.5 2 2.5 False Positives Per Image

3

(c) ETH-03

Fig. 9: Performance evaluation on ETHZ pedestrian dataset [43].

L

TME dataset M

S

ETHZ dataset M

L

S

YOLO

SSD

Ours

Fig. 11: Qualitative results on TME [42] and ETHZ [43] datasets. “L”, “M” and “S” respectively stand for the object sizes of large, middle and small.

the consumption of computing resource. Each conventional convolution kernel of size n × n needs n2 multiply operations for each convolution step. By contrast, in the network using G-Haar weights, each convolution step can be transformed to the form of equation (4). In this way, only one multiply operation is required for each G-Haar convolution step. As shown in table I (column of Mul./St.), the computing resource consumption of the deep networks using G-Haar weights is only 1/n2 of that using conventional weights (n = 3 in this experiment).

TABLE III: Performance comparison with state-of-the-art methods. The detection performance is measured by the recall rate at 1 false positives per image. The best performance is highlighted.

Furthermore, G-Haar weights can achieve a lower power consumption compared with conventional weights since power usage is directly influenced by storage (including memory accesses) and computing resource consumption [45]. This feature is especially important when the deep network is implemented on embedded systems or mobile devices (such as ASIC/FPGAs), which are quite sensitive to power consumption. Owing to the generalized Haar Filter based weights, only one multiply operation is required for each convolution step. This make it possible to construct more parallel pipelines in ASIC/FPGAs where the number of multipliers is limited. Moreover, each local regression task is an independent com-

putation. Thus, all of the local regression tasks can run in parallel in ASIC/FPGAs. In this way, the system can achieve real-time in ASIC/FPGAs response without extra effort.

Dataset

YOLO [12]

SSD500 [13]

LRC [46]

BNN [38]

Ours

TME ETHZ-01 ETHZ-02 ETHZ-03

0.837 0.815 0.688 0.826

0.861 0.833 0.752 0.852

0.758 0.607 0.779

0.845 0.773 0.704 0.838

0.882 0.863 0.767 0.875

C. Comparing with state-of-the-art methods For further analysis, our proposed approach is evaluated by comparing with some state-of-the-art methods, including handcraft methods (LRC [46], Castangia [44] and Caraffi [42]) and learning-based methods (YOLO [12], BNN [38] and SSD500 [13]). In the following experiments, “Ours” represents our complete method that employs G-Haar weights and sparse

10

window generation procedure. “Ours (Traditional Weights)” and “Ours (Without Sparse Windows)” denote the variants of our approach that employ conventional weights instead of GHaar weights and without sparse window generation procedure, respectively. The experiment settings of our approaches are the same as that in the row 7 of table I. As shown in Fig. 9, 10 and table III, learning-based methods (such as YOLO [12], BNN [38], SSD500 [13] and our approach) tend to perform better than conventional handcraft methods, which is due to the strong representation of deep convolution neural networks. Besides, our G-Haar based method outperforms BNN [38] since G-Haar weights are able to keep a higher computing precision compared with binary networks. As presented in table II, regression based networks achieve real-time performance in object detection task in traffic scenes. Results in table II also demonstrate that the sparse window generation can dramatically reduce computational cost since exhaustive sliding windows are avoided. Experimental results in Fig. 10 and 11 also indicate that our proposed method is able to achieve better performance on small object detection compared with other regression based deep networks such as YOLO [12] and SSD500 [13], which are based on global regression strategy. Our approach divides the global regression task into several easier local regression tasks, and detects multi-scale objects from image pyramids. In this way, the image resolutions of small objects are ensured and thus better performance can be achieve. VI. C ONCLUSIONS We have presented a novel object detection system for car sharing services. We first introduced a local regression strategy for accurate object detection. Besides, we presented the generalized Haar filter based deep networks to reduce the consumption of memory and computing resource. In addition, we proposed a sparse window generation strategy to reduce the number of input image patches without sacrificing accuracy. Experimental results on both vehicle and pedestrian datasets suggest that the proposed approach is efficient, robust and resource-saving compared with several state-of-the-art methods. Thus, the potential application of the proposed approach in car sharing services is verified. ACKNOWLEDGMENT This work is supported by the National Natural Science Foundation of China (Grant No. 61473303). R EFERENCES [1] S. A. Shaheen, M. A. Mallery, and K. J. Kingsley, “Personal vehicle sharing services in north america,” Research in Transportation Business & Management, vol. 3, pp. 71–81, 2012. [2] B. Cohen and J. Kietzmann, “Ride on! mobility business models for the sharing economy,” Organization & Environment, vol. 27(3), pp. 279– 296, 2014. [3] A. Balachandran, M. Brown, S. M. Erlien, and J. C. Gerdes, “Predictive haptic feedback for obstacle avoidance based on model predictive control,” IEEE Transactions on Automation Science and Engineering, vol. 13(1), pp. 26–31, 2016.

[4] D. Fagnant and K. M. Kockelman, “The travel and environmental implications of shared autonomous vehicles, using agent-based model scenarios,” Transportation Research Part C: Emerging Technologies, vol. 40, pp. 1–13, 2014. [5] Z. Sun, G. Bebis, and R. Miller, “Monocular precrash vehicle detection: features and classifiers,” IEEE Transactions on Image Processing (TIP), vol. 32(9), pp. 2019–2034, 2006. [6] K. Lu, J. Li, X. An, and H. He, “Vision sensor-based road detection for field robot navigation,” Sensors, vol. 15(11), pp. 29 594–29 617, 2015. [7] K.-M. Lee, Q. Li, and W. Daley, “Effects of classification methods on color-based feature detection with food processing applications,” IEEE Transactions on Automation Science and Engineering, vol. 4(1), pp. 40– 51, 2007. [8] H. Yang, S. Zheng, J. Lu, and Z. Yin, “Polygon-invariant generalized hough transform for high-speed vision-based positioning,” IEEE Transactions on Automation Science and Engineering, vol. 13(3), pp. 1367– 1384, 2016. [9] “Large Scale Visual Recognition Challenge (ILSVRC),” http://www. image-net.org/challenges/LSVRC/, [Online; accessed Oct. 23, 2016]. [10] “MS COCO Visual Recognition Challenges,” http://mscoco.org/, [Online; accessed Oct. 23, 2016]. [11] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” Fiber, vol. 56(4), pp. 3–7, 2016. [12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” arXiv preprint, vol. 1506.02640, 2015. [13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” arXiv preprint, vol. 1512.02325, 2015. [14] I. Daubechies and C. Heil, Ten Lectures on Wavelets. Capital city press, 1992. [15] D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision (IJCV), vol. 60(60), pp. 91– 110, 2004. [16] K. E. van de Sande, T. Gevers, and C. G. Snoek, “Evaluating color descriptors for object and scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32(9), pp. 1582–1596, 2010. [17] Y. Ke and R. Sukthankar, “Pca-sift: A more distinctive representation for local image descriptors,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2004, pp. 506–513. [18] H. Baya, A. Essa, T. Tuytelaarsb, and L. V. Goola, “Speeded-up robust features (surf),” Computer Vision and Image Understanding, vol. 110(3), pp. 346–359, 2008. [19] X. Yuan, L. Kong, D. Feng, and Z. Wei, “Automatic feature point detection and tracking of human actions in time-of-flight videos,” IEEE/CAA Journal of Automatica Sinica, vol. 4(4), pp. 677–685, 2017. [20] P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision (IJCV), vol. 57(2), pp. 137–154, 2004. [21] P. Doll´ar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 36(8), pp. 1532–1545, 2014. [22] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886–893. [23] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20(3), pp. 273–297, 1995. [24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32(9), pp. 1627–1645, 2010. [25] “Visual Object Classes Challenge 2012 (VOC2012),” http://host.robots. ox.ac.uk/pascal/VOC/voc2012/, [Online; accessed Oct. 23, 2016]. [26] J. Yu, D. Tao, R. Hong, and X. Gao, “Recent developments on deep big vision,” Neurocomputing, vol. 187, pp. 1–3, 2016. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, pp. 1–9, 2012. [28] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 39(7), pp. 1476–1481, 2017. [29] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end text recognition with convolutional neural networks,” in International Conference on Pattern Recognition (ICPR), 2012, pp. 1051–4651.

11

[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [31] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448. [32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 580–587. [33] K. He, X. Zhang, and S. Ren, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 37(9), pp. 1904–1916, 2014. [34] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” in International Journal of Computer Vision (IJCV), 2013, pp. 154–171. [35] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016. [36] S. J. H. an Lorien Y. Pratt, “Comparing biases for minimal network construction with back-propagation,” in Advances in neural information processing systems (NIPS), 1989, pp. 177–185. [37] J. S. D. Yann LeCun and S. A. Solla, “Optimal brain damage,” in Advances in neural information processing systems (NIPS), 1989, pp. 598–605. [38] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training neural networks with weights and activations constrained to +1 or -1,” arXiv preprint, vol. 1602.02830, 2016. [39] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” arXiv preprint, vol. 1603.05279, 2016. [40] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,” in European Conference on Computer Vision (ECCV), 2012, pp. 864– 877. [41] P. Doll´ar, Z. Tu, H. Tao, and S. Belongie, “Visual tracking with online multiple instance learning,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 983–990. ˇ [42] C. Caraffi, T. Voj´ıˇr, J. Trefn´y, J. Sochman, and J. Matas, “A system for real-time detection and tracking of vehicles from a single car-mounted camera,” in IEEE Intelligent Transportation Systems Conference, 2012, pp. 975–982. [43] B. Lucas and T. Kanade, “Moving obstacle detection in highly dynamic scenes,” in IEEE International Conference on Robotics and Automation (ICRA), 2009, pp. 56–63. [44] L. Castangia, P. Grisleri, P. Medici, A. Prioletti, and A. Signifredi, “A coarse-to-fine vehicle detector running in real-time,” in IEEE Intelligent Transportation Systems Conference, 2014, pp. 691–696. [45] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in IEEE International Solid State Circuits Conference, 2014, pp. 10–14. [46] W. R. Schwartz, L. S. Davis, and H. Pedrini, “Local response context applied to pedestrian detection,” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 181–188, 2011.

Keyu Lu received the B.E. degree and the M.E. degree in control science and engineering from National University of Defense Technology (NUDT), Changsha, Hunan, P.R. China, where he is currently working toward the Ph.D. degree. From Dec. 2016 to Dec. 2017 he was a visiting Ph.D. student at the University of British Columbia (UBC). His research interests include computer vision, robotics, pattern recognition and machine learning.

Jian Li received the B.E., M.E. and Ph.D. degrees in control science and engineering from National University of Defense Technology (NUDT), Changsha, Hunan, P.R. China. He was a visiting Ph.D. student at Center for Intelligent Machines (CIM) in McGill University in 2012. He is currently a lecturer in the College of Mechatronic Engineering and Automation at NUDT. His research interests include computer vision, robotics, image processing, and machine learning.

Li Zhou received his B.S., M.S. and Ph.D. degrees from National University of Defense Technology (NUDT), China in 2009, 2011 and 2015 respectively. From Sept. 2013 to Sept. 2014 he worked as a visiting scholar at The University of British Columbia, Canada. He is currently an assistant professor at College of Electronic science, NUDT, China. His research interests are in the area of software defined radios (SDRs), software defined networks (SDNs) and unmanned ground vehicles (UGVs). Dr. Zhou was invited as a keynote speaker in China Satellite 2017 and ICWCNT 2016. He served as a TPC member in IEEE CIT 2017 and co-chair in ITA 2016. His research contributions have been published and presented in more than 20 prestigious journals and conferences, such as IEEE Transactions on Vehicular Technology, Ad Hoc Networks, WInnComm 2017 and IEEE INFOCOM 2015. Xiping Hu received the Ph.D. degree from The University of British Columbia, Vancouver, BC, Canada. He was the Co-Founder and the CTO of Bravolol Ltd., Hong Kong, a leading language learning mobile application company with over 100 million users, and listed as the top-two language education platform globally. He is currently a Professor with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. He has authored or co-authored around 60 papers published and presented in prestigious conferences and journals, such as the IEEE Transactions on Emerging Topics in Computing, the IEEE Internet of Things Journal, the ACM Transactions on Multimedia Computing, Communications, and Applications, the IEEE Communications Surveys and Tutorials, the IEEE Communications Magazine, the IEEE Network, the ACM MobiCom, and WWW. His current research interests include mobile cyber-physical systems, crowdsensing, social networks, and cloud computing. Dr. Hu has been serving as the Associate Editor of the IEEE ACCESSand a Lead Guest Editor of the IEEE Transactions on Automation Science and Engineering and Wireless Communications and Mobile Computing. Xiangjing An received the B.S. degree in automatic control from the Department of Automatic Control, National University of Defense Technology (NUDT), Changsha, P. R. China, in 1995 and the Ph.D. degree in control science and engineering from the College of Mechatronics and Automation (CMA), NUDT in 2001. He has been a visiting scholar for cooperation research in the Boston University during 2009-2010. Currently, he is a Professor at the Institute of Automation, CMA, NUDT. His research interests include computer vision, mobile robot, image processing and machine learning. Hangen He received the BSc degree in Nuclear Physics from Harbin Engineering Institute, Harbin, China, in 1968. He was a visiting Professor at the University of the German Federal Armed Forces in 1996 and 1999 respectively. He is currently a professor in the College of Mechatronics and Automation (CMA), National University of Defense Technology (NUDT), Changsha, Hunan, China. His research interests include artificial intelligence, computer vision, robotics and learning control. He has served as a member of editorial boards of several journals and has cochaired many professional conferences. He is a joint recipient of more than a dozen academic awards in China.

Suggest Documents