Tracking Multiple People using Laser and Vision*

Tracking Multiple People using Laser and Vision * Jinshi Cui, Hongbin Zha

Huijing Zhao and Ryosuke Shibasaki

National Laboratory on Machine Perception Peking University Beijing, China {cjs, zha}@cis.pku.edu.cn

Center for Spatial Information Science University of Tokyo Tokyo, Japan {chou, shiba}@skl.iis.u-tokyo.ac.jp

Abstract - We present a novel system that aims at reliably detecting and tracking multiple people in an open area. Multiple single-row laser scanners and one video camera are utilized. Feet trajectory tracking based on registration of distance information from multiple laser scanners and visual body region tracking based on color histogram are combined in a Bayesian formulation. Results from tests in a real environment are reported to demonstrate that the system can detect and track multiple people simultaneously with reliable and real-time performance. Index Terms - Multiple people. Tracking. Laser scanner. Video camera. Sensor fusion.

I. INTRODUCTION Reliable detection and tracking of multiple people in a wide and open area has many valuable applications, such as monitoring human activities for intelligent surveillance; measuring and analysing trajectories for planning and management aiding in shopping malls, exhibitions and railway stations, as well as studying human behaviours for human and social sciences. In computer vision area, automatically detecting and tracking people in dynamic scenes has become an important topic. A good survey for vision-based surveillance can be found in [1]. Several research efforts that target at a relatively large crowd in a wide and open area are picked up at [2-6]. These techniques are limited to simple situations where humans appear in isolated regions or occlusion areas are small. Typical methods rely on appearance models that must be acquired when the humans enter the scene and are not occluded. Other methods heavily rely on visual detection of head candidates or face detection, which are time-consuming and not very reliable, especially in real environments, where lighting conditions are not controlled, and temporary occlusions can occur at any time. On the other hand, laser-based detection and tracking systems can provide reliable automatic detection of human in variable scenes. They are insensitive to lighting conditions and laser data processing does not consume much time. In [710], one laser scanner is used for multiple people tracking. In our previous research [9], multiple laser scanners are registered to perform a reliable and fast trajectory tracking process. The system showed its strong ability in counting and tracking pedestrians, even in the case of a very high-density crowd. In one of real experiments, three scanners are used to

cover an area about 60*60 m 2 at the corner of an exhibition hall, and about 100 trajectories are extracted simultaneously in the peak hour. However, the limitations of these laser-based systems are inherent and obvious. They cannot provide information such as colour of objects, so it is difficult to obtain a set of features that uniquely distinguish one object from another. If a trajectory is broken due to occlusion, it is difficult to link the fragments together. In addition, in the case that people walk in-group or walk across and their feet are too close together, their data will be mixed and feet position will be lost in extraction. Thus, visual computation is strongly required in these laser-based systems. At the same time, laser data can be seen as an effective aid for existing visual tracking systems, which could provide cues for reliable detection and tracking. Recently, several systems have been proposed in robotic area [11-13] to detect and track people on a mobile platform based on a combination of distance information obtained from laser range data and visual information obtained from a camera. [11] obtains laser range data first to detect moving objects, then the position information is used to perform face detection on the sub image using face detection algorithm. In [12], face-tracking module is used to verify that nonconfirmed feet belong to people. [13] uses laser range data to verify that skin-colored pixels, which have been isolated by a color-blob detection algorithm as possibly belonging to faces, do indeed belong to a face. All the above systems can only detect and track one or few people. In addition, performance of their work heavily relies on face detection result, which might fail in a real surveillance and monitoring environment with most of people not facing to camera. In this paper, we present a novel laser/camera based system that aims at reliably and real-timely monitoring and tracking multiple people in an open area. Multiple laser scanners and one video camera are utilized. People detection is mainly achieved by feet trajectory detection from laser data. Visual color tracking by mean-shift method with color histogram and laser tracking that using a typical pedestrian model are independently processed, then a Bayesian formulation is used to fuse these two tracking information in decision level. Compared with previous research, our system can overcome several tracking error that cannot be tackled only with laser data, such as data missing or data confusion due to occlusion. Moreover, this system can track more than ten people simultaneously with real-time performance.

This work is partially supported by NSFC Grant #6033010 and NKBRPC #2004CB318000.

The paper is organized as follows: in next section, we present an overview of our proposed system. In section 3 and section 4, laser and visual subsystems are respectively described in detail. And then, data fusion for tracking we used is introduced in section 5. At last, we present results with real environments data, which demonstrate the strong capability of the system. II. SYSTEM ARCHITECTURE The basic idea underlying the design of the proposed system is to sufficiently exploit the powerful detection capability of laser scanners and to extract and attach visual information to the detection result for obtaining accurate and entire trajectories. Human detection is mainly done by laserbased feet detection. After detection, laser-based tracking and vision-based tracking systems operate asynchronously and decision-level fusion is made to integrate these two kinds of information. The system is depicted in Figure 1. Input Camera

Inactive object dataset

Body region calculation

Mean-shift Matching

Input Laser

Succeed?

Feet detection

Y

Background model

N

Link to an existing object Create a new object

Mean-shift tracking

Prediction Bayesian fusion

Result Evaluation

N Accept?

Feet tracking Pedestrian model

Y Output

Fig. 1 The architecture for people detection and tracking.

It can be seen from the figure that, once a new feet trajectory is detected using laser data, corresponding body region in video image is localized from average man height, feet position and calibrated camera model. Then appearance model will be calculated, followed by mean-shift matching with the appearance model of objects whose trajectories have broken. If matching succeeds, then this is a fragment of one existing object; otherwise, it belongs to a new object, i.e. a new entered person. At fusion process, confidence of meanshift tracking and feet tracking result will be evaluated respectively, then confidence of the final result after fusion will be evaluated. These confidences will determine the final result should be accepted or not, and should more trust which sensor. III. LASER-BASED SUBSYSTEM In this subsystem, laser range scanners, LD-A, produced by IBEO Lasertechnik, are exploited. In one scanning (range frame), LD-A profiles 1080 range distances equally in 270 degrees on the scanning plane. LD-A has a maximum range distance of 70 meter and an average distance error of 3 cm.

Frequency of LD-A is 30Hz. LD-As are set on the floor (about 16cm above the ground surface) doing horizontal scanning. The data of moving (e.g. feet) as well as still objects (e.g. building walls, desks, chairs and so on) are obtained in a rectangular coordinate system. Moving objects are obtained by background subtraction. Laser scans keep a degree of overlay between each other. Relative transformations between the local coordinate system of neighboring laser scanners are calculated by pair-wisely matching their background models [14]. At each iteration step, server computer gathers the data of moving feet in the latest range frames from all laser clients and integrates them into a global coordinates system. Since there might be many points shooting upon the same foot, relative to the distance from people to range scanners, a process is first conducted clustering the moving points of the integrated range frame that has a radius less than a normal foot (e.g. 15cm). The centre points of which are treated as foot candidates. Trajectory tracking is conducted by first extending the trajectories that have been extracted in previous frames using Kalman filter with a well-defined walking model, then looking for the seeds of new trajectories from the foot candidates that are not associated to any existing trajectories. Seeds of new trajectories are extracted in two steps. The foot candidates, who are not associated to any trajectories, are first paired up into step candidates if the Euclidean distance between them is less than a normal step size (e.g.50cm). A foot candidate may belong to a number of people candidates, if there are multiple options. A seed trajectory is then extracted along a certain number (N>3) of previous frames, which satisfies the following two conditions. First, two step candidates in successive frames overlap at the position of at least one-foot candidate. Secondly, the motion vector decided by the other pair of non-overlapping foot candidates changes smoothly along the frame sequence [9]. IV. VISION-BASED SUBSYSTEM A. Body Region Localization Once people are detected, we first judge if this is a new object by mean-shift colour histogram matching, and then track the location of the people in subsequent image frames using mean-shift tracking method. Before we use mean-shift method, accurate localization of initial model region (center, width and height, rotation angle of the region) for each individual is very important. That requires the calibration of the video camera with laser scanners. The video camera is calibrated to the global coordinate system using Tsai’s model, where both internal and external parameters are calculated using at least 11 control points. A global coordinate system is defined with its XY-axes coincident with those of the integrated coordinate system of laser scanners, with its Z-axis vertically upward, and its origin on the ground surface. The elevation of laser scanning plane is detected using the sensor chip that is developed in [17], so that Z-coordinate is associated to each laser point. Control points are obtained by putting markers on the vertical edges of wall,

desks, chairs, boxes and so on, with themselves visible on video image and the vertical edges be measured by laser scanners. Z-coordinate of each marker is its elevation from the ground surface, which is physically measured previously. XYcoordinates of each marker come from the laser scans measured vertical edge. In addition, calibration blocks, e.g. shown in Fig. 2a, can be used to increase the number of control points. Put the calibration block directly on the ground, a horizontal transformation is necessary to associate its local coordinate system to the global one. Let at least two facades be measured by laser scans, which are represented by two-dimensional lines on the horizontal plane of laser scans. A horizontal transformation can be thus calculated, so that the coordinates of all grid points can be converted to the global coordinate system.

The mean-shift algorithm is a non-parametric statistical method that has recently been widely adopted as an efficient technique for visual appearance-based object region tracking [15]. It employs mean-shift iterations to find the most similar region compared with a given region in terms of colour distribution. The similarity of two distributions is expressed by a metric based on the Bhattacharyya coefficient. The main advantage of this method is that it can achieve real-time performance and it is robust to partial occlusion and rotations in depth. Let the histogram of the target is denoted by ck (i )(i 1,..., N ) , where N is the number of bins in the histogram and

¦

N

c (i) 1 . We choose 4-bit quantization of

i 1 k

the RGB values. This gives us N 24u3 4096 bins. The Bhattacharyya distance is used to measure the similarity between two histograms

D[ck , ck* (t )]

1 ¦

N i 1

ck (i )ck* (t; i)

1

2

And the matching score can be represented as PM exp O D 2 [ck , ck* (t )] ,

a

b

Fig.2 a) A calibration block. b) Body regions. Colored ellipses are body regions of the individuals. Colored curves are tracked feet trajectories.

When a people is captured, laser scanners will tell the XY-coordinates of people feet’s position in the world coordinate system, and the Z-coordinate is the elevation of laser scanning plane detected before. The image coordinates could be calculated using the project transformation without any difficulty. Ellipse is utilized to model body region in image plane, which is described by a 5 dimensional vector: ( x, y, w, h, T ) , where x and y are the location of ellipse center, h , w are the lengths of axes, and T is rotation angle. However, it’s impossible to calculate the body ellipse only using the feet position in world coordinate system, and thus we introduce several constraints on body region. It’s assumed that a human is with average height (e.g. 170cm) and width (e.g. 50cm), and always stands vertically upward on the ground. With such assumption, body ellipse could be calculated from feet position easily, since the world coordinate of human head is also known (with a fixed Z-coordinate). Actually, only ( x, y ) is enough to describe a body region (other parameters will depend on it). In Fig.2b, it is shown that body region is calculated from given feet position effectively. In mean-shift tracking, it’s necessary to inversely calculate corresponding feet position in world coordinate system from body region to compare with laser-based detection result. With a fixed Z-coordinate, it’s also easy to get the feet position from the center of body region (or human face/ head).

B. Mean-shift Matching and Tracking

here O is a constant and we assign it with 10 in our experiments [16]. For tracking, initial candidate state of the tracker at current image is obtained by prediction using a dynamic model and previous position. For matching, the initial candidate is just the state of the object that is detected for the first time. Whether we accept a matching result or not is dependent on the matching score calculated with above equation. Mean-shift tracking method has many advantages. It is robust to object rotation and deformation, partial occlusion, cluttered environment. Moreover, it can achieve real-time performance. This is also the reason why this method is preferred by many researches. However, it is very sensitive to the initial model of object region. So, initial body region localization is very important, which has been mentioned in the previous subsection. In addition, mean-shift tracking is a method that searches the optimal solution in the local neighbourhood of the candidate position, so appropriate prediction of candidate state is very crucial. About dynamic model we used for prediction, please see the next section. V. BAYESIAN FORMULATION FOR TRACKING We formulate the sequential tracking problem as computing the maximum a posteriori (MAP) estimation X such that X arg max X & P( X | Z ) , X ( X 0 , , X t ) is the state sequence and Z ( Z 0 , , Z t ) is observation sequence. X ti

( pti , vti ) represents the state of i th object at

time t, where pti is the trajectory location, and vti is velocity. 5 dimensional parameters of body region (i.e. an ellipse) can be calculated with feet position pti . Z t is the observation at time t, which is composed of the image ZVt , and laser data Z Lt .

Following Bayesian rule, the posterior probability is decomposed into a likelihood term and a prior term: p( X | Z ) v p( Z | X ) P( X ) where P ( X ) P ( X 0 ) P( X t | X t 1 , , X 0 ) t 1

P( Z | X )

P ( Z 0 ) P ( Z t | X , Z t 1 , , Z 0 ) t 1

With the assumption of Markov property, we have P ( X t | X t 1 , , X 0 ) P( X t | X t 1 ) P ( Z t | X , Z t 1 , , Z 0 ) P ( Z t | X t ) We employ a constant velocity model as dynamic model, therefore: X ti )X ti1 Zt 1 where ) is the state transition matrix, and Zt 1 is the normal process noise with zero mean and covariance matrix 6t 1 , i.e. N (0, 6) . ) is defined as:

ª I 2 u 2 I 2u 2 º « 0 I » ¬ 2u 2 ¼ where I 2u2 denotes 2-by-2 identity matrix. Assume that two observations respectively from laser and image are independent given the state, P ( Z t | X t ) P( Z Lt | X t ) P( ZVt | X t ) where P( Z Lt | X t ) represents laser measurement likelihood, and P( ZVt | X t ) represents visual observation likelihood. )

Assume that P ( Z Lt | X t ) and P ( ZVt | X t ) have Gaussian distributions N ( X Lt , 6 Lt ) , N ( X Vt , 6Vt ) . X Lt denotes laser tracking result at time t, and X Vt denotes mean-shift visual tracking result at time t. The observation model will be: Z ti +X ti Xt where + is the observation matrix, and Xt is the normal measurement noise with zero mean and covariance matrix 6 Zt . + and 6 Zt is defined as: ª I 2u 2 0 º ª6 Lt 0 º «I » , 6 Zt « 0 6 » 0 ¬ 2u 2 ¼ Vt ¼ ¬ It’s well known that the sequential tracking problem under Gaussian assumption could be solved with Kalman filtering. A common Kalman filtering algorithm is employed in our system. For sensor fusion, it’s important to define a reasonable noise model. A practical approach is to assume process noise and measurement noise are independent identical distributed (iid) and furthermore, we define noise model as: 0 º ªG I 6t G t I 4u4 , 6 Zt « Lt 2u 2 G Vt I 2u2 »¼ ¬ 0 +

where G t , G Lt , and GVt represent the noise intensity of prediction model, laser tracking result, and mean-shift tracking result. The process noise and measurement noise

reflect the confidence (or possibility) of predicted result and observed result. We connect the noise intensity with confidence with following equation: G D ln P , 0 d P d 1 where P is the confidence (or possibility) and G is the noise intensity. The constant D could be eliminated since noise intensities play by their proportion values. We assign the confidence values of both prediction model (Pt) and laser tracking result (PLt) with a constant value (0.8 in our experiment), since we don’t utilize clues indicating their confidence. For the mean-shift tracking result, mean-shift matching score PMt mentioned before is a good clue, and thus we assign the confidence value PVt with PMt. After fusion with Kalman filtering, both laser tracking result and mean-shift tracking result are evaluated again to get the confidence values. These values are used to correct laser tracking error and compute an overall confidence as the acceptance criteria of tracking result. The confidence value of mean-shift tracking result is still kept as mean-shift matching score PMt, and confidence of laser tracking result is evaluated by its distance from final tracking result X t* : PVt

PMt , PLt

exp X Lt X t*

2

2V 2

With the independent assumption, we get the overall confidence as: Pt * PLt PVt PLt PVt If Pt * is larger than a threshold (0.7 in our experiments), we accept the result, otherwise the trajectory of the object is labeled as inactive and put the object into inactive object dataset. An inactive object is still under tracking, and once its overall confidence is larger than the threshold, it will be labeled back as active. An object that lasts in inactive state over a threshold (1.5 seconds in our experiments) will be labeled as dead and the object is considered out of the view field. VII. EVALUATION OF THE SYSTEM In this paper, the evaluation of the proposed architecture is performed with a real scene at a corner of an exhibition hall. Two laser scanners and one camera is utilized. Camera is set 4.5mm lens with a diagonal angle of about 98degree, and a slant angle of 30-45 degree towards the ground. As we could only put the laser scanners around our booth, Feet were measured only from one side, and were blocked heavily due to the poor sensor layouts. During the experiment, we defined an area on the horizontal plane of laser scanning, and only the persons inside the area were tracked in real-time. While the upper corners of the video image were not covered, so that the people near to the upper corners of the video image did not have laser points and not be tracked. Three Pentium 3CPU of 600MHz controlled clients, and one with Pentium 4 CPU of 1.0GHz worked as the server computer. Four computers were connected using 10/100 Base LAN cable. We test our system with a 10-minute long data. Despite of poor sensor layouts, performance of laser-based detection is

very reliable, almost all people appeared in the scene was captured at least once, only 6 in total 167 people are missed without laser data due to severe occlusion. However, 964 trajectories are generated by laser-based tracking module, which implied that many trajectories are broken and these fragments could not be linked only with laser data. Thus, we need visual computation, as well as an efficient fusion of these two types of data. Here, we select 3 typical situations where laser-based tracking module fails in accurate tracking, to evaluate various aspects of the system. In situation 1, we test tracking of two moving people. Temporal occlusion of laser data and visual data alternatively occur in this case. In situation 2, we test a much more complex situation, two people walk too close, and there are tracking errors with laser module. In situation 3, we test simultaneous tracking of multiple people. A. Situation I: Laser Tracking Fails Due to Data Occlusion Two people are tracked simultaneously (Fig. 3), and trajectory of each person is fragmented to several parts. White curves and ellipses denote trajectories and regions generated when there are only visual data and no laser data due to occlusion. This is a very frequent situation in our test data. One feasible solution is addition of laser scanners in different directions to reduce mutual occlusions.

Frame 1395

Frame 1410

occluded. So only visual tracking is processed to extend the trajectory. In frame 1410, although there is a large occlusion to the bottom person in visual data, the fusion of visual data with laser data makes the result smooth and accurate.

Frame 44 (a1)

Frame 44 (b1)

Frame 47(a2)

Frame 47(b2)

Frame 48(a3)

Frame 48(b3)

Frame 1398

Frame 1415

Frame 1424 Frame 1434 Fig. 3. A simple situation (only two people are tracked for demonstration). Laser-based tracking fails due to occlusion in laser data, while visual data is introduced to extend the trajectories. White curves are visual tracking results with no laser data. Red curves are tracking results from data fusion.

Red curves and ellipses denote trajectories and regions generated from data fusion when there are both visual data and laser data. In frame 1398, laser data of the upper person is

Frame 53(a4) Frame 53 (b4) Fig. 4. A much more complex situation. The pictures demonstrate the efficiency of the Bayesian fusion method (only two people are tracked for demonstration). a1-a4): Results with no data fusion. Red curves are generated with only laser data. There are two tracking errors (picture a2-a3, marked with yellow lines). White curves are mean-shift tracking result with no laser data just for linking the fragments marked with red curves. b1-b4): Results with data fusion.

B. Situation II: Laser Tracking Fails Due to Data Confusion This is a more complex situation. In Fig. 4, two people walk close together. In frame 44-48, the feet of them locate too closely to be distinguished with laser data. The direction and speed of upper person changed seriously, so that the predicted feet position of this person is far from his real state, which result in laser tracking error. This is the first problem. The second problem is that laser data of the bottom person is lost due to occlusion by other people. This is a situation that is very difficult to deal with. The same error will happen when two people walk across. In frame 47, although laser tracking result and visual tracking result have a conflict, visual tracking has a larger confidence (matching score is higher than

threshold 0.7, despite of partial occlusion) than laser tracking result (two trajectories are too close, < 10cm). This means that, in result evaluation, we will trust visual tracking result more, and in this way, we can correct the tracking error with laser data. C. Situation III: Multiple People Tracking Multiple people are tracked simultaneously to test the system’s performance in tracking multiple people. Two of the results are shown in Fig. 5.

successfully overcomes the inherent disadvantages of simple laser-based system or vision-based system. In future work, we will capture more real scene data with more laser scanners and cameras. Since the video FOV in current experiment is too small to satisfy the requirement of real scene, more cameras with wide-angle lens or omnidirectional cameras that could cover much more area will be used. In this paper, we did the fusion in high level. Theoretically, fusion in low level might have a more promising result. Our future work will focus on exploiting an efficient low-level fusion method for these two types of sensors. REFERENCES [1] D. Gavrila, “The visual analysis of human movement: a survey”, Computer Vision and Image Understanding, 73(1) 82-98, 1999. 4

Frame 713

Frame 1240 Fig.5. Multiple people tracking results that calculated in real-time. White curves are visual tracking result with no laser data. Curves with other colours are tracking results from data fusion. We used different colours for different people.

VII. CONCLUSION In this paper, a novel system is proposed of tracking multiple people in an open area, such as shopping mall and exhibition hall. Reliable detection and tracking are achieved by fusion of distance information from multiple laser scanners and visual color information from one video camera. In a real experiment, two laser scanners and one camera are set on an exhibition hall monitoring visitors’ flow. More than 10 visitors are tracked simultaneously in real-time. Although feet were blocked heavily due to the poor sensor layouts, efficiency of the system with combination of laser scanners and camera in reliable and fast multiple people tracking are proved. Comparing with our previous research and existing various related systems, it can be concluded that this system

[2] I. Haritaoglu, D. Harwood, and L.S. Davis, " w : Real-time surveillance of people and their activities," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no.8, pp.809-830, August 2000. [3] A. M. Elgammal and L. S .Davis, "Probablistic framework for segmenting people under occlusion," in Proceedings of IEEE 8Th International Conference on Computer Vision (ICCV), July 2001, vol.2, pp. 145-152. [4] S. Mckenna, S. Jabri, Z. Duric, A. Rosenfeld and H. Wechsler, “Tracking Groups of People”, Computer Vision and Image Understanding, 80, 4256, 2000. [5] N. T. Siebel and S. Maybank, “Fusion of Multiple Tracking Algorithm for Robust People Tracking”, Proc. Of European Conf. On Conf. On Computer Vision; LNCS2353, pp. 373-387, 2002. [6] T. Zhao and R.Nevatia, “Bayesian Multiple Human Segmentation in Crowded Situations,” Proc. IEEE Conf. On Computer Vision and Pattern Recognition, 2003. [7] A. Fod., A. Howard. and M. J. Mataric, “Laser-based people tracking”, Proc. of the IEEE International Conference on Robotics and Automation, Washington DC, U.S.A., pp. 3025-3029, 2002. [8] D. Schulz, W. Burgard, D. Fox, and A. B. Cremers. “Tracking multiple moving objects with a mobile robot”, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2001. [9] H. J. Zhao and R.Shibasaki, “A novel system for tracking pedestrians using multiple single-row laser range scanners”, IEEE Transactions on systems, man and cybernetics, part A, November 2004. [10] B. Kluge, C. Koehler, and E. Prassler. “Fast and robust tracking of multiple moving objects with a laser range finder”, Proc. of IEEE International Conferenceon Robotics & Automation (ICRA), 2001. [11] J. Blanco, W. Burgard, R.Sanz and J. L. Fernandez, “Fast Face Detection for Mobile Robots by Integrating Laser Range Data with Vision”, Proc. of the International Conference on Advanced Robotics (ICAR), 2003. [12] M. Scheutz, J. McRaven and Gy. Cserey, “Fast, Reliable, Adaptive Bimodal People Tracking for Indoor Environments”, Proc. of the International Conference on Robots and Systems (IROS), 2004. [13] Z. Byers, M. Dixon, K. Goodier, C. M. Grimm, and W. D. Smart, “An Autonomous Robot Photographer”, Proc. of the International Conference on Robots and Systems (IROS), LasVegas, NV, October, 2003. [14] H. J. Zhao and R. Shibasaki, “A robust method for registering groundbased laser range images of urban outdoor environment”, Photogrammetric Engineering and Remote Sensing, vol. 67, no. 10, pp.1143-1153, 2001. [15] D. Comaniciu, V. Ramesh and P. Meer, “Kernel based object tracking”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no.5, May 2003. [16] B. Han, D. Comaniciu, Y. Zhu and L. Davis, “Incremental Density Approximation and Kernel-based Bayesian Filtering for Object Tracking”, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2004. [17]Nishimura, T, et al, “A compact battery-less information terminal (CoBIT) for location-based support systems” Proc. SPIE, 4863B-12, 2002.