A Novel 2D Marker Design and Application for Object Tracking and Event Detection Xu Liu1 , David Doermann1 , Huiping Li1 ,K. C. Lee2 , Hasan Ozdemir2 , Lipin Liu2 1
Applied Media Analysis Inc., College Park, MD
[email protected] 2 Panasonic R&D Company of America, Princeton, NJ
[email protected]
Abstract. In this paper we present a novel application which uses 2D barcode for object tracking and event detection. We analyze the relationship between the spatial efficiency of a marker and its robustness against defocusing. Based on our analysis we design a spatially efficient and robust 2D barcode, M-Code (MarkerCode) which can be attached to the surface of static or moving objects. Compared with traditional object detection and tracking methods, M-Code not only identifies and tracks an object but also reflects its position and the orientation of the surface where it is attached. We implemented an efficient algorithm that tracks multiple M-Codes in real time in the presence of rotation and perspective distortion. In experiments we compare the spatial efficiency of M-Code with existing 2D barcodes, and quantitatively measure its robustness, including its scaling capability and tolerance to perspective distortion. As an application we use the system to detect door movements and track multiple moving objects in real scenes.
1
Introduction
In this paper we present a novel barcode based object detection and tracking method. The method detects and tracks 2D barcodes attached to the surface of an object through real-time decoding. 2D barcodes have been widely used in context aware and mobile computing environments. Traditional 2D barcodes (Fig.1a) carry information up to a few hundred bytes. Some research [1, 2] suggests that 2D barcodes can be used to identify the position and orientation of the capturing device to facilitate mobile interaction (Fig.1b) which often stores an index to online content. Inspired by these novel applications and the application of visual markers [3] in video surveillance, we found that 2D barcodes can be used to identify the location of the object which it is attached to (Fig.1c). This capability fits the requirements of tracking multiple objects and detecting concurrent events. Other visual trackers have been explored for unconstrained environments for video surveillance, but they require extensive processing for object detection and classification. Markers attached to moving objects can be detected and tracked automatically by computers at significantly reduced labor cost. Existing 2D barcodes, such as QRcode, MaxiCode or Datamatrix, are typically not designed for the purpose of tracking and are often decoded at a close distance. “Visual Code” and “Spot Code” (Fig.1b) are designed for interacting with mobile devices,
2
Xu Liu1 , David Doermann1 , Huiping Li1 ,K. C. Lee2 , Hasan Ozdemir2 , Lipin Liu2
Fig. 1. Existing 2D-barcodes & MCode
which implicitly require cooperative users to spot the code. “Array Tags” [4] were designed to be located and decoded at a long distance, but require a considerable area to print and attach. The spatial efficiency of a marker is an important factor since an object may have limited area. Codes with higher coding efficiency are often more robust since they contain a smaller number of units (a unit typically refers to a black/white square), correspondingly a larger unit is more resistant to distortions caused by blurring, noise and/or perspective distortion. It is worth mentioning that colors have been used to improve the spatial efficiency of markers [5] as well, but color may be affected by environmental lighting and distance and is not a stable feature to recognize. Binary (black and white) markers are more convenient to print and more reliable to read, so we will focus on binary marker in this paper. We describe the code design in Section 2, present the locating and decoding algorithm including a fast perspective ratification method in Section 3. We evaluate the spatial efficiency of MCode, the speed and robustness of decoding in Section 4 and draw conclusions in Section 5.
2
Code Design
When an object is being tracked, the image of the attached M-Code may be defocused if the object is not within the best focal range of the camera. Defocusing is usually modeled by Gaussian convolution [6]. We first analyze the relationship between the unit width of the marker and the size of the Gaussian kernel to show the importance of spatial efficiency of the marker. For a black and white marker, Gaussian convolution may increase the gray scale value of a black cell and decrease the gray scale value of a white cell in the captured image. When a black cell appears to be brighter than a white cell, the captured M-Code image can no longer be read correctly. This situation is most
Lecture Notes in Computer Science
3
Fig. 2. Defocus likely to happen when a black cell is surrounded by white cells, or vice verse, as shown in Figure 2. The gray scale value at the center of a black unit cell with width w after Gaussian convolution with kernel size σ may be as high as H: H=
|x|>w/2 |y|>w/2
=1− =1−
x2 +y 2 2σ 2 2πσ 2
−
e
R
R
e
dxdy
x2 +y 2 2σ 2 2πσ 2
−
dxdy
(1)
|x| 22 (2) √ √ 2 −1 => w > 2 2Erf ( 2 )σ √ √ and therefore the unit width w must be greater than 2 2Erf −1 ( 22 )σ to keep the marker readable under Gaussian convolution with kernel size σ. So we should utilize every cell in a limited print area and keep the unit width w as large as possible. Taking this requirements into consideration, we designed a new marker which reflects its position (four corners) with high spatial efficiency. As shown in Fig.1a, traditional 2D barcodes usually use part of the image as a finding pattern. The finding pattern, however, decreases the coding efficiency. Therefore, we do not use an explicit finding pattern. As shown in Fig. 3, MCode consists of 8 × 8 black and white cells bounded by a black box with half unit width. The black box plays an important role to facilitate decoding. It is a robust feature that can still be detected even if the code image is blurred or de-focused. More importantly, the location of the black box determines the location of every cell in the marker. MCode encodes 28 bits of information with a 4 bit checksum and 32 bits of Reed-Solomon[7] error correction data, which can correct 2 bytes (16 bits) of error at any location in the MCode.
3
Locating and Decoding
The detection and tracking process is actually performed by a single procedure: detecting and decoding the MCode in the captured frames. The decoding of MCode consists
4
Xu Liu1 , David Doermann1 , Huiping Li1 ,K. C. Lee2 , Hasan Ozdemir2 , Lipin Liu2
Fig. 3. MCode design of three steps: code location, corner registration and perspective correction. To track MCodes in real time, code location and perspective correction must be performed efficiently. To ensure successful decoding and find the exact position of an object, registration has to be highly accurate. 3.1
Code Location
Fig. 4. Code location, the original image and its super-pixel representation
Unlike the mobile barcode reading which relies on users to locate the barcode, we have to locate the MCode anywhere in the captured image. The location process must
Lecture Notes in Computer Science
5
be performed very efficiently to satisfy the real-time performance requirements. We use a multi-resolution approach to accelerate the location algorithm. Multi-resolution has been applied to image retrieval[8] and object extraction [9] to improve the efficiency. We first down sample the original image by averaging every 5 × 5 pixel block into one super pixel (Fig. 4b). Meanwhile we compute the histogram of the grayscale values and find the threshold to binarize the image using 2-means classification. The binarization is adaptive to environmental lighting and can extract the bounding box of M-Code under global lighting changes. We then search the super pixel image for connected components using a breadth-first-search (BFS). The complexity of BFS is liner in the number of pixels in the image, so we accelerate the connect components search 25 times by running on the super pixel image rather than running on the original image. A Hough transform is then performed to locate the four edges of each connected component and the approximate position of four corners are calculated as well. 3.2
Corner Registration
Fig. 5. Corner detection and the template used to exactly locate the corner
After locating the approximate position of a MCode, we compute the exact location (Fig. 5a) of the four corners which determine the full geometry (homography). We convolve a template in the neighborhood of the approximate corner position and find the maximum response. An example of the templates is shown in Fig. 5b. 3.3
Perspective Correction
MCode may be captured from any arbitrary angle. To decode the data we first need to correct the perspective distortion i.e. we need to calculate the mapping between an ideal, non-perspective image and a camera captured image, which can be described as a
6
Xu Liu1 , David Doermann1 , Huiping Li1 ,K. C. Lee2 , Hasan Ozdemir2 , Lipin Liu2
plane-to-plane homography. Unlike classic homogeneous estimation methods [10] we use a fast perspective correction which simplifies the perspective correction into seven cross-products and avoids floating point operations.
Fig. 6. Fast Perspective Correction As shown in Fig. 6, we first perform an affine transformation and then a perspective transformation. Suppose the coordinates of four corners in the image plane are (P1 , P2 , P3 , P4 ), and the top and bottom boundaries of the bounding box intersect at vanishing point A. Under homogeneous coordinate A = L1 × L2 = (P 1 × P 4) × (P 2 × P 3). Similarly B = L3 × L4 = (P 1 × P 2) × (P 3 × P 4). A and B are infinite points in the original plane. The third element of A and B under homogeneous −→ H1 −→ coordinates should be 0 in the affine image. Any homography H = H2 that maps −→ H3 the perspective image back into an affine image should map A and B to infinity, which implies H3 ∼ ((P1 × P4 ) × (P2 × P3 )) × ((P1 × P2 ) × (P3 × P4 )) (3) This suggests that we can calculate H3 using seven cross products. Any homography H with the third row H3 computed by (3) maps the perspective image (III in Figure 6) to an affine image (II). The next task is to fill in the first and second row of H. The reason to calculate this homography H is: given any matrix coordinate we can quickly tell its pixel coordinate in the image. From the matrix coordinate (I) to the affine image (II), the transformation is linear and can be directly computed by transforming the base
Lecture Notes in Computer Science
7
of the coordinate system. In the final step we transform the affine image (II) to the perspective image (III) by computing H −1 . Therefore, we choose the first and second row of H so that it has a neat inverse: h33 0 0 h33 0 H ±1 ∼ 0 (4) ±h31 ±h32 h33 This “inverse” only requires the reverse of two signs in the third row of H. In this way it accelerates the coordinate transformation with numerical stability. Normally the numerical inverse is subject to “division by zero” when H is nearly singular and our method is division free. For any entry (i, j) in a w − by − h matrix we compute its −−−→ −−−→ affine coordinate wi P10 P40 + hj P10 P20 and use (4) to map this affine coordinate to the image coordinate. All these computation can be performed with integer operations only and therefore is highly efficient.
4 4.1
Evaluation Spatial Efficiency Evaluation
Fig. 7. Robustness under blurriness, MCode vs. QRCode
Table 1. comparison of 2D barcodes Code Type Unit Cells Bits Cells/bit MCode 9 × 9=81 28 2.89 DataMatrix(C80) 12 × 12=144 24 6 QRCode(M) 21 × 21=441 112 3.94
As previously addressed, the spatial efficiency is directly related to the robustness of the marker. The code that requires more unit cells are more vulnerable to degradation.
8
Xu Liu1 , David Doermann1 , Huiping Li1 ,K. C. Lee2 , Hasan Ozdemir2 , Lipin Liu2
As shown in Fig.7, A MCode and a QRCode storing one integer undergo the same level of Gaussian blur. The cells in the QRCode cannot be separated while the MCode is still recognizable. We quantitatively compare MCode’s spatial efficiency to QR Code and Data Matrix in Table 1. To encode the same amount of data MCode contains 9×9=81 unit cells, while a QR code (version 1) contains at least 21×21 unit cells and a DataMatrix code contains at least 12×12 unit cells, where 44 bits are used for encoding the finding pattern. From here we can see MCode is most efficient. 4.2
Processing Speed
As addressed above, the object detection and tracking are performed through a decoding process. Therefore, the decoding speed is very important for real-time tracking. The MCode decoder is implemented on a P4 3.0G PC with Panasonic WV-NP224 surveillance camera and WV-LZ62/8S lens. The camera can capture raw VGA images at 10 fps and runs at 9 fps while detecting M-Codes. So the decoding time on each frame is 0.011 second, which is equivalent to a frame rate of 90 fps, if we ignore the time spent on image acquiring. 4.3
Scale and View Angle Test
To let MCode be applicable in complex environments, one major concern is its scalability i.e. at what distance a MCode can be recognized.
Fig. 8. Scale and View Angle Test
Our test result shows that a MCode printed on an A4 paper can be recognized at 30 meters (Fig.8a) with VGA resolution. Or be printed at 0.5 inch × 0.5 inch and be recognized at less than 2 inches distance (Fig.8b). When an object is moving, the attached MCode may not always be captured from an upper-front view angle. When this happens perspective distortion exists. To quantitatively measure the robustness to perspective distortion, we use the ratio between the longest edge and the shortest edge (Fig. 3) as a criterion:
Lecture Notes in Computer Science
K=
9
max(|P1 P2 |, |P2 P3 |, |P3 P4 |, |P4 P1 |) min(|P1 P2 |, |P2 P3 |, |P3 P4 |, |P4 P1 |)
Fig. 9. Histogram of K K = 1 indicates no perspective distortion. The larger K, the stronger perspective distortion. Fig.9 shows the histogram of K when we arbitrarily move a MCode in front of the camera. Our decoder can decode when K is as large as 3.45.
5
Application Scenario and Future Work
Our system has been applied to detecting door events and tracking multiple moving objects (Fig. 10). 1 The experimental results show that MCode and its decoder works robustly for these tasks. In the future work, we will use M-Code as a tool for self calibration. With the calibrated camera, we will fully re-construct the 3D geometry and track the exact 3D coordinate of the moving object.
References 1. Rekimoto, J., Ayatsuka, Y.: Cybercode: designing augmented reality environments with visual tags. In: Proceedings of DARE 2000 on Designing augmented reality environments. (2000) 1–10 2. Rohs, M., Gfeller, B.: Using camera-equipped mobile phones for interacting with real-world objects. In Ferscha, A., Hoertner, H., Kotsis, G., eds.: Advances in Pervasive Computing, Vienna, Austria, Austrian Computer Society (OCG) (2004) 265–271 1
Video demos of object tracking and event detection using MCode can be found at http://sites.google.com/site/mcodedemos/m-code-demos
10
Xu Liu1 , David Doermann1 , Huiping Li1 ,K. C. Lee2 , Hasan Ozdemir2 , Lipin Liu2
Fig. 10. Application Scenario
3. Schiff, J., Meingast, M., Mulligan, D., Sastry, S., Goldberg, K.: Respectful Cameras: Detecting Visual Markers in Real-Time to Address Privacy Concerns. International Conference on Intelligent Robots and Systems (IROS) (2007) 4. Little, W., Baker, P.: Data with perimeter identification tag (1993) US Patent 5,202,552. 5. Cheong, C., Kim, D., Han, T.: Usability Evaluation of Designed Image Code Interface for Mobile Computing Environment. Lecture Notes in Computer Science 4551 (2007) 241 6. Hwang, T.l., Clark, J., Yuille, A.: A depth recovery algorithm using defocus information. Computer Vision and Pattern Recognition, 1989. Proceedings CVPR ’89., IEEE Computer Society Conference on (1989) 476–482 7. Wicker, S.B., Bhargava, V.K., eds.: Reed-Solomon Codes and Their Applications. John Wiley & Sons, Inc., New York, NY, USA (1999) 8. Nikulin, V., Bebis, G.: Multiresolution image retrieval through fusion. Proceedings of SPIE 5307 (2004) 377–387 9. Porikli, F., Wang, Y.: An unsupervised multi-resolution object extraction algorithm usingvideo-cube. Image Processing, 2001. Proceedings. 2001 International Conference on 2 (2001) 10. Criminisi, A., Reid, I., Zisserman, A.: A plane measuring device. Image and Vision Computing 17 (1999) 625–634