Multi-Object Detection and Behavior Recognition from Motion 3D Data

0 downloads 0 Views 1MB Size Report
This paper presents methods for 3D object detection and multi-object (or ..... we plan to optimize it and transfer it to a mobile vehicle for live demonstration.
Multi-Object Detection and Behavior Recognition from Motion 3D Data Kyungnam Kim, Michael Cao, Shankar Rao, Jiejun Xu, Swarup Medasani and Yuri Owechko {kkim; mcao; srrao; jxu; smedasani; yowechko}@hrl.com HRL Laboratories, LLC, Malibu CA 90265, USA

imagery for control of autonomous vehicles and for active safety applications. Soon, a new generation of sensors will be able to generate geo-registered 3D representations of a scene (also along with geo-coordinates if equipped with a GPS) for air and ground vehicles at a real-time refresh rate. Fusion of onboard motion 3D data with map data and off board sensors will provide real-time 3D scene analysis and situation awareness for air or ground platforms. By avoiding the 3D structure information loss inherent in 2D imaging, these sensors will make many problems in visual detection, tracking, and recognition much more tractable and well-posed.

Abstract This paper presents methods for 3D object detection and multi-object (or multi-agent) behavior recognition using a sequence of 3D point clouds of a scene taken over time. This motion 3D data can be collected using different sensors and techniques such as flash LIDAR (Light Detection And Ranging), stereo cameras, time-of-flight cameras, or spatial phase imaging sensors. Our goal is to segment objects from the 3D point cloud data in order to construct tracks of multiple objects (i.e., persons and vehicles) and then classify the multi-object tracks as one of a set of known behaviors, such as “A person drives a car and gets out”. A track is a sequence of object locations changing over time and is the compact object-level information we use and obtain from the motion 3D data. Leveraging the rich structure of dynamic 3D data makes many visual learning problems better posed and more tractable. Our behavior recognition method is based on combining the Dynamic Time Warping-based behavior distances from the multiple object-level tracks using a normalized car-centric coordinate system to recognize the interactive behavior of those multiple objects. We apply our techniques for behavior recognition on data collected using a LIDAR sensor, with promising results.

1.1. Previous work Most vision-based behavior recognition methods in surveillance, gaming or safety systems use 2D imaging sensors that lack 3D depth information 1 . Current object detection and behavior recognition software does not approach human-level performance. For surveillance and safety applications, the difficulties in detecting threats and recognizing safety-related events in motion imagery are rooted in the loss of information that occurs when 3D world information is projected into a 2D image. Our proposed method makes use of full 3D information for object detection and behavior recognition. Recently, there has been some investigation into object detection and recognition using 3D LIDAR data, mostly for stationary objects like buildings and trees. Some examples include object recognition in 3D LIDAR data with recurrent neural network [1], coarse-to-fine scheme for object indexing/ verification and rotationally invariant semi-local spin image features for 3D object representation [2], probabilistic representation of LIDAR range data for efficient 3D object detection [3], 3D object recognition from range images using local feature histograms [4], and DARPA URGENT object recognition based on strip-cueing [6]. However, to the best of our knowledge, little work has been done for detection and recognition of moving objects (i.e., persons and vehicles) in motion 3D data. In [5], Lai

1. Introduction Many surveillance or active safety applications require sensors for detecting objects and recognizing behaviors (threats or non-safe actions) in various environments. In most cases, the sensors generate 2D motion imagery in the visible and IR bands. Even though 3D sensors are becoming more practical, algorithms for handling dynamic 3D data are still at the emergent stage. An important need in surveillance and intelligence applications is processing of wide-area imaging data from unmanned vehicles for target tracking and suspicious behavior detection. Exploitation of this data is difficult due to the labor intensive process of detecting threatening activities. The automotive industry has an important need for detecting obstacles in motion

1 Recently, Microsoft Kinect is used capture human actions by interpreting 3D scene information from a continuously-projected infrared structured light for indoor or close range applications.

37

(A)

(C)

(E)

(B)

(D)

(F)

Figure 1: Motion 3D data is processed to detect moving objects which are then classified into Car and Human categories. (A) Baseline (background) point cloud; (B) 2D projection of Baseline; (C) Input point cloud; (D) 2D projection of Input; (E) Car (red) and person (blue) detections; (F) zoomed-in view of detected objects. between multiple agents and combines the behavior distances from the tracks in a normalized 3D coordinate system that is more robust and accurate than 2D.

and Fox showed how to significantly reduce the need for manually labeled training data by leveraging objects from Google‟s 3D Warehouse to train an object detection system for 3D point clouds collected by robots navigating through both urban and indoor environments. In [7], Huang et al. presented a performance evaluation of shape similarity metrics for 3D video sequences of people with unknown temporal correspondence. Popular techniques, such as Hidden Markov Models (HMM), Support Vector Machines (SVM), and Dynamic Time Warping (DTW) [8], have been applied to behavior recognition from spatial track data in 2D, but few were leveraged for motion 3D data (aside from 3D data obtained from stereo imaging). Recently, Albanese et al. [9] proposed Probabilistic Activity Description Language (PADL) to specify activities of interest and developed a probabilistic framework that estimates the probability that a given sub-video contains the given activity. An expandable graphical model based on salient postures was presented with action graph that encodes new actions without compromising the existing actions in [10]. A framework for modeling motion by exploiting the temporal structure of the decomposable motion segments for human activity classification was proposed in [11], and a discriminative part-based approach based on the hidden conditional random field for human action recognition from video sequences using motion features was presented in [12]. Our method adapts the DTW method to handles behaviors

1.2. Applications Our proposed method for multi-object detection and behavior recognition in motion 3D data can be applied to any surveillance/safety system where object tracks or trajectories are extracted from motion 3D data and classified into behaviors while maintaining robustness to temporal variations. We describe our algorithms below in the context of the car-person behavior application using LIDAR data, but the method is generic enough to be applied to other application domains as well as to data captured by others types of 3D sensors. For example, this method can be used in surveillance scenarios to detect suspicious or abnormal behaviors like shoplifting, loitering, fast-running, and meeting, reducing the workload of human security. It can also be applied to automatically monitor and track workers in a factory to provide safety warnings when dangerous activities are undertaken.

1.3. Overview The key features of our proposed object detection and behavior recognition method are the following:

38

Figure 2: Block diagram of motion 3D object detection and behavior recognition algorithms. 1) Segmentation and recognition of multiple objects from motion 3D data: Moving objects are detected by subtracting the baseline (background) point cloud from an input point cloud in the 2D projection space. Our object classifier classifies detected 3D blobs into three classes – “person”, “car”, or “other”. 2) Normalization of multi-agent tracks to assess relative interactions: Depending on the 3D sensor‟s location, the absolute 3D coordinates of the object tracks could be different even for the same behaviors. In our car-person behavior recognition problem, in order to avoid confusion and also to make unique track data for the same behaviors, all the tracks are normalized to the car-centric coordinate system. In the car-centric coordinate system, the center of the stopped car is the coordinate origin and the forward direction is aligned with the y-axis.

Figure 3: Voxel - All the 3D points are divided by layer and grid lines. Each cell of the voxel has its count of 3D points included.

2. Motion 3D Object Detection Algorithm

3) Combination of behavior distances for complex behavior recognition: The behavior distance (or score) for each of the atomic objects (agents) is computed separately and then combined into a final distance for multi-agent behavior recognition.

Collected LIDAR point clouds from the Riegl scanner first undergo a data preparation process that involves format conversion and point cloud editing. Once the data is properly prepared, it is fed to our motion 3D detection algorithm for processing. The processing starts by computing a set of voxels for the baseline (background) point cloud and the input point cloud. Each cell of the voxel contains a population count of points within the cell (see Figure 3). A ground plane map is computed for the baseline point cloud to deal with elevation variations. The process is followed by a difference map computation where the two voxels are compared to detect cars and persons. In order to get the difference map, 3D voxels are projected onto a 2D projection map as shown in Figure 1 (A to D). All the cells of voxels are projected along the z-axis so that the each grid in the 2D map has the number of 3D points within the cells belonging to the grid. The result of 2D project map differencing is one difference map for cars and one for persons. The car difference map looks for

In the following sections, we describe our techniques for 3D object detection using motion 3D data and multi-agent behavior recognition. For the motion 3D data used in our experiments, we used a Riegl scanning LIDAR (LMS-Z390i) to first capture a 3D point cloud and then concatenated the point clouds taken consecutively over time into a motion 3D sequence. A flash LIDAR in contrast would automatically generate motion 3D data. LIDAR (Light Detection And Ranging) is an optical remote sensing technology that measures properties of scattered light to find range and/or other information of a distant target.

39

Figure 4: Five different behaviors in our LIDAR motion 3D data and our object detection results. Each image is a representative snapshot of the corresponding behavior sequence. •

differences in the baseline and input voxel from the ground to the average car height. The person difference map looks for similar vertical density of points from the ground to the average height of a person. These difference maps are then passed to a blob detector (back to the 3D space). The returning blobs are filtered and then written out as detections as shown in Figure 1 (E and F). For better 3D object recognition, we can use the blob detection to extract point cloud objects and compute feature vector which will then be fed to a classifier like [6]. This classifier approach allows us to detect a greater variety of car, person, and situation more robustly.

One can use a behavior score or a behavior distance for recognition/matching depending on the metric used. In our implementation, we chose „distance‟, which quantifies how much a behavior differs from another. A person behavior distance and a car behavior distance are combined with the weight of 50% each to get the final behavior distance. Depending on the 3D sensor‟s location, the absolute 3D coordinates of the object tracks could be different even for the same behaviors. In our car-person behavior recognition problem, all the tracks are normalized to the car-centric coordinate system, where the center of the stopped car is the coordinate origin and the forward direction is aligned with the y-axis. This was done to avoid confusion and to make unique track data for the same behaviors (see the normalized tracks in Figure 6 of the behaviors in Figure 4).

3. Motion 3D Behavior Recognition Algorithm Our behavior recognition algorithm involves multiple objects (or agents) and their interaction. We first generate object tracks from the blobs detected over time in the previous detection stage. A track is a sequence of object location changing over time and is the compact object-level information we use and obtain from the motion 3D data. The proposed multi-agent behavior recognition algorithm is illustrated in Figure 2. The metric of our algorithm can be summarized below: • •

d_car (car behavior norm distance, 50% weight): DTW distance of car tracks. An alternative measure is to check if the car motions are different. The difference between the motion variances is used. We found that both perform similarly for our dataset.

Behavior distance metric: d = d_person + d_car. d_person (person behavior norm distance, 50% weight): DTW (dynamic time warping) distance of person tracks in the car-centric normalized coordinate system as shown in Figure 6. 40

Figure 6: Normalized tracks in the car-centric coordinate system for behavior analysis (in a bird‟s eye view). Green squares are cars and red circles are persons - Note that we used a color scheme different from the one in Figure 4. Some events involve a moving car and others only have a stationary car.

Figure 5: Confusion matrix of behavior matching. Lower distance means better matching. Pdet = 90% and Pfa = 2.5%.

Given a pair of tracks a={ a1, … , aI } and b={ b1, … , bJ }, DTW finds a mapping between features in a and b, such that the average distance d_dtw(ai,bj) between corresponding features ai and bj is minimized. Figure 7 shows an example of warping between two tracks. Each track is a sequence of pixel coordinate pairs, and so in our context, we choose d_dtw(ai,bj) to be the Euclidean distance between ai and bj.2 The optimal mapping is constrained so the endpoints match (i.e., a1 corresponds to b1, aI corresponds to bJ ) and no reversals of time are allowed. Figure 7: An example of a time warping between tracks a and b. Corresponding features between tracks are indicated by red lines.

2 For other applications, such as factory monitoring, portions of a track may already be annotated with semantic labels that can be used to modify and improve the distance metric between features.

41

[3] T. Yapo, C.V. Stewart, and R.J. Radke, A Probabilistic Representation of LiDAR Range Data for Efficient 3D Object Detection. Proceedings of the S3D (Search in 3D) Workshop 2008, in conjunction with IEEE CVPR (2008). [4] G. Hetzel, B. Leibe, P. Levi, B. Schiele, “3D Object Recognition from Range Images using Local Feature Histograms”, IEEE Conference on Computer Vision and Pattern Recognition (2001). [5] K. Lai and D. Fox, “Object Recognition in 3D Point Clouds Using Web Data and Domain Adaptation”, International Journal of Robotics Research, Volume 29 Issue 8, (2010). [6] Yuri Owechko, Swarup Medasani, Thommen Korah, “Automatic Recognition of Diverse 3-D Objects and Analysis of Large Urban Scenes Using Ground and Aerial LIDAR Sensors”, Conference on Lasers and Electro-Optics and The Quantum Electronics and Laser Science Conference, San Jose, CA (2010). [7] Peng Huang, Adrian Hilton, Jonathan Starck, “Shape Similarity for 3D Video Sequences of People”, International Journal of Computer Vision (2010). [8] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition”, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 26, No. 1, pp. 43-49 [9] M. Albanese, R. Chellappa, N. Cuntoor, V. Moscato, A. Picariello, V. S. Subrahmanian, and Octavian Udrea, “PADS: A Probabilistic Activity Detection Framework for Video Data”, IEEE Trans. on PAMI, vol. 32, pp. 2246-2261, (2010). [10] W. Li, Z. Zhang, and Z. Liu, “Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures”, IEEE Transaction on Circuits and Systems for Video Technology, Vol.18 No.11, pages 1499-1510 (2008). [11] J.C. Niebles, C. Chen, and L. Fei-Fei, “Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification”, ECCV (2010). [12] Yang Wang and Greg Mori, “Hidden Part Models for Human Action Recognition: Probabilistic vs. Max-Margin,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2010).

Our algorithm was evaluated using the five behaviors (2 instances each, 10 in total) as in Figure 4 and Figure 6. Figure 5 shows the confusion matrix of the pair-wise behavior distances (d = d_person + d_car), with d_car and d_person defined previously. In the matrix, lower distances mean better matching. Our multi-agent behavior recognition method used in the motion 3D data successfully recognized 5 different car-related behaviors with Probability of Detection (Pdet)= 90% and Probability of False Alarm (Pfa) = 2.5%. One of the „GetOn‟ behaviors was matched to one of the „GetOn-Drive‟ behaviors as marked with the red ellipse in the confusion matrix because those two behaviors are very similar.

4. Conclusion and Discussion The proposed multi-object behavior recognition method tested in our motion 3D data captured by a LIDAR sensor successfully detected objects (person and car) and recognized five different car-person interaction behaviors. Object detection and track generation was implemented in C/C++ in on Windows and we used Open Scene Graph to display labeled 3D point clouds. The behavior recognition algorithm was developed using MATLAB and is being converted to C/C++ for real-time demonstration. With some adaptations, our 3D object detection and recognition algorithm can be applied to any motion 3D data (stereo, flash LIDAR, spatial phase imaging, time-of-flight, etc.). Likewise, the multi-object behavior recognition algorithm can be applied to any track data of multiple objects obtained from 2D or 3D behavior data in surveillance, safety, autonomous vehicles applications. It is noteworthy to mention that, in our car-person behaviors, we used a car-centric coordinate system to provide a canonical view of tracks. Normalization of multi-agent tracks is needed to assess relative interactions between multiple objects. We used the combination of the behavior distances (scores) for recognizing multi-object agent behaviors. In future work, we plan to use a shape-based object detector using a 3D classifier and evaluate the behavior recognition algorithm extensively with more data. While the current demonstration pipeline is not running in real-time, we plan to optimize it and transfer it to a mobile vehicle for live demonstration.

References [1] Prokhorov, D.V. “Object recognition in 3D lidar data with recurrent neural network” IEEE Computer Vision and Pattern Recognition 9-15 (2009) [2] Matei, Bogdan C.; Tan, Yi; Sawhney, Harpreet S.; Kumar, Rakesh “Rapid and scalable 3D object recognition using LIDAR data,” Automatic Target Recognition XVI. Edited by Sadjadi, Firooz A.. Proceedings of the SPIE, Volume 6234, pp. 623401 (2006).

42

Suggest Documents