46
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
K-means Tracker: A General Algorithm for Tracking People Chunsheng Hua, Haiyuan Wu, Qian Chen, Toshikazu Wada Faculty of Systems Engineering, Wakayama University, Wakayama City, Japan Email: {hua, wuhy, chen}@vrl.sys.wakayama-u.ac.jp,
[email protected]
Abstract— In this paper, we present a clustering-based tracking algorithm for tracking people (e.g. hand, head, eyeball, body, and lips). It is always a challenging task to track people under complex environment, because such target often appears as a concave object or having apertures. In this case, many background areas are mixed into the tracking area which are difficult to be removed by modifying the shape of the search area during tracking. Our method becomes a robust tracking algorithm by applying the following four key ideas simultaneously: 1) Using a 5D feature vector to describe both the geometric feature “(x,y)” and color feature “(Y,U,V)” of each pixel uniformly. This description ensures our method to follow both the position and color changes simultaneously during tracking; 2) This algorithm realizes the robust tracking for objects with apertures by classifying the pixels, within the search area, into “target” and “background” with K-means clustering algorithm that uses both the “positive” and “negative” samples. 3) Using a variable ellipse model (a) to describe the shape of a nonrigid object (e.g. hand) approximately, (b) to restrict the search area, and (c) to model the surrounding non-target background. This guarantees the stable tracking of objects with various geometric transformations. 4) With both the “positive” and “negative” samples, our algorithm achieves the automatic self tracking failure detection and recovery. This ability makes our method distinctively more robust than the conventional tracking algorithms. Through extensive experiments in various environments and conditions, the effectiveness and the efficiency of the proposed algorithm is confirmed. Index Terms— K-means clustering, negative samples, tracking failure detection and recovery
I. I NTRODUCTION Robust human tracking under complex environments is a challenging issue due to: complex non-rigid movements, illumination variation, clutter background, occlusion, etc. This paper describes a general purpose algorithm for tracking part or the whole human body. Our goal is to realize a robust and stable object tracking algorithm that only uses limited knowledge and input information. Until now, many researches about human body tracking have been reported [1]- [21]. Cascia, etc. [1] estimated the position and the orientation of a head by using texturemapping techniques and a pre-built 3-D head model. This method assumed a pre-defined initial position and pose of the head. Nanda, etc. [4] used edges and Chamfer distance to track the head in the image. This method will suffer from clutter backgrounds. Utsumi, etc. [3] constructed a color patch model for representing 3D head arrangement. © 2006 ACADEMY PUBLISHER
The tracking is performed by fitting the projected patch model to the input image. Many methods for tracking human body are also reported [5]–[9]. Active contour [5] method has the attraction of the ability to track objects having un-predictable non-rigid movements. With the CONDENSATION (or particle filter) algorithm, they achieved the stable object tracking by reasoning over a non-parametric distribution of joint state probability of multiple hypothesis. Other tracking methods with particle filter have also been reported in [9]. Mean shift [6] method also achieves the stable human body tracking by applying the hill-climbing calculation in the color histogram within the template. The target tracking is achieved by finding an area that has the maximum similarity to the target in the color histogram. Heisele, etc. [8] use the clustering method to partition the input image into many small regions. The target tracking is performed by finding the moving region between two images. Compared with head and body, hand tracking [15]–[18] is more difficult because the non-rigid shape of a hand is difficult to be followed. Bray, etc. [15] combines the SMD trackers as small particles in the CONDENSATION framework to deal with high dimension problem. They use the skin color to mask out the background and the hand tracking is achieved by fitting the pre-defined 3D hand model against the depth maps obtained from a structured light based range sensor. Sato [16] uses the infrared camera to get the arm region and to define the search area. The fingertip tracking is performed by finding out circular region in the search area with template matching. Since eye is one of the most important facial organ, its tracking becomes a hot research topic [10]–[14]. To deal with the eye’s movements, Kanade et al [10] define several eye models for describing eye blink. Based on the extracted edges of eyes, they can fit extracted eye boundary with the pre-defined eye model. As mentioned above, most of the present tracking algorithms are the model-based ones. That means they rely on a lot of prior knowledge or assumptions about the 3-D shape, appearance (2-D shape and color, etc.), the movement of the target object and the background. Also, many of them require additional input information such as 3-D range data, etc. The purpose of using those knowledge (or assumption) and additional input information is to improve the tracking performance and to provide additional features (the ability of giving 3-D orientation,
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
shape, etc. of the target). However, unfortunately, those rich knowledge (assumption) also restricts the usefulness and the robustness of those methods. This is because if the environment or the target object is not exactly the same as what the prior knowledge (assumption) described, the methods relying on such assumption will be unreliable or even useless. Our idea is to use the least possible knowledge or assumption about the target and the background for tracking. By doing so, although our method can only provide a few features (telling where the target is and its approximated shape and size), which are much less than many existing methods, it will improve the robustness and the processing speed very much. In this paper, we propose a clustering-based general tracking framework for human movements. And the targets of our algorithm include: human body, iris, head, hand, and lips. As a pixel-wise clustering based algorithm, our method does not require any prior target appearance model and has more flexibility to deal with the deformation of the target. By introducing the concept of “negative” (background or non-target) samples to the pixelwise algorithm, our method becomes robust against the interfused background in the concave objects or objects having apertures (e.g. hand), which many methods suffer from. We also found out that even the best available tracking algorithms still have the possibility to suffer from the tracking failure due to the quick illumination variance, complete deformation of the target shape, motion blur caused by the high velocity of target, increase of the feature dimension, etc. Therefore, it is reasonable to think that a tracking algorithm can work more robustly if it has the ability to detect and recover from the tracking failure. In this paper, by using both the “positive” (target) and “negative” information, we achieved the automatic self tracking failure detection and recovery, which really increases the robustness of our tracking algorithm (details in Section IV, V). In Section II, we show the main ideas that make our algorithm become robust while object tracking. In Section III, we will explain how to apply those key ideas in object tracking and how to update the target center and search area. Section IV, V show the way to make the self tracking failure detection and recovery. Finally, Section VI is the experiment result and discussion.
47
not deal with the target that is different from the predefined target model. Therefore, in this paper, we select the K-means clustering algorithm for object tracking (hereafter we call it K-means tracker), because compared with those modelbased tracking methods, the clustering-based tracking algorithm has the following advantages: 1) Does not need any prior target appearance model, so the initialization becomes simple; 2) Can track the non-rigid (human body) and ware object (hand) well, even the target object deforms greatly; 3) Can track any target object but not a specific one. B. Introducing the Concept of Non-target However, most of the conventional pixel-wise tracking algorithms only use the target information for pixel classification, which is similar to those model-based methods. Thus in those works, they only measure the similarity (or dissimilarity) between an unknown pixel and the target sample. Therefore, a threshold is required to determine whether this pixel belongs to the target group or not, according to the estimated similarity. However, it is well known that, while object tracking, both the target appearance and the surrounding background can change easily due to the illumination variance, target deformation, etc, so a fixed threshold can not be suitable under different conditions. In this paper, we introduce the concept of non-target and apply the K-means clustering to both the target and non-target samples. The motivation to introduce this concept is that: through our experiments, we found out that the ability to discriminate the target from its surrounding background is the key element that resolves the tracking success or failure. Therefore, it is reasonable to consider that a good tracking algorithm should be performed between the target and background information. In this work, we introduce the concept of non-target (background) for the target discrimination. As shown in Fig.1, whether an unknown pixel within the searching area is the target or not is discriminated by measuring its similarity to the target group and non-target ones. In Target pixel Non-target pixels Unknown pixels
Outline of ATR
RTB Ellipse Center
II. M AIN I DEAS A. Using K-means Clustering Algorithm for Pixel-wise Object Tracking Since our purpose is to set up a general tracking algorithm for arbitrary parts of human body, we consider the model-based methods are not appropriated in our work. That is because the success of model-based methods is based on the assumption of prior acknowledge and specific target model (e.g. head model, hand model, eye model, etc). In other words, the model-based methods can © 2006 ACADEMY PUBLISHER
Target object
Image Coordinate
Feature Space
Figure 1. Illustration for describing non-target samples
this case, the similarity measurement becomes a problem of “yes00 or “no00 and we consider this measurement is
48
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
better than only comparing the similarity with the target. That is because, with the negative reference, it will be more confirmable that the pixels that do not belong to the non-target groups will belong to the target group. In Fig.1, RT B means the “Real T arget Boundary” that really discriminate the target from the background. Since it is difficult to stably get the RT B, we propose a model named as “absolute target region” (hereafter AT R) to approximate RT B and describe the target object and its surrounding background. The AT R is an area that contains all the target object pixels, meanwhile, all the pixels on the outline of AT R are the representative nontarget pixels around the target object. In this paper, we use the AT R to: 1) restrict the search area for speed up and increasing the robustness of our tracking algorithm; 2) represent the approximate target shape and direction; 3) model the surrounding non-target background. Since both the target sample and surrounding AT R are continuously updated, our method becomes robust against the target appearance variance. C. Describing Target Features with a Uniform 5D Feature Vector Color (Y, U, V)
III. T RACKING F RAMEWORK A. Elliptical Search Area In this paper, we define the search area as a small region that contains the whole target object in both the previous frame and present frame. Here, we describe this search area by an ellipse. The position, direction and shape of search area are determined according to the previous tracking result and the maximum velocity of the target. Since all the pixels of target object are located within the search area, the pixels on the ellipse contour are the background ones. And we use some pixels on the ellipse contour to represent the negative (background) samples. B. K-means Clustering with Target and Non-target In Fig.3, we take the ellipse center as the target cluster center and describe it by fT = [cT pT ]T . Nontarget pixels on the ellipse are presented as fN (j) = [cN (j) pN (j)]T , j = 1 ∼ m, where m is the number of the selected pixels on the ellipse, and an unknown pixel is described by fu = [cu pu ]T . Cross Point Target Object
Nearest point to the unknown pixel
Outline of ATR
Position (x, y)
U Y
V
Unknown Pixel Target Cluster Center
x Image plane
y
5D feature space
Figure 2. Illustration for 5D feature space
As shown in Fig.2, to represent the property of the target features, in this algorithm, each pixel in an image is described by a 5D feature vector f = [c p]T , where, c = [Y U V ]T describes the color similarity and p = [x y]T describes the position approximation of the pixel. By applying K-means clustering to both the target and non-target sample in this 5D feature space, the target center is processed and updated not only in 2D position space but also in the 3D color space simultaneously. This nature guarantees our method to be robust against the illumination changes. D. Self Failure Detection and Recovery The tracking failure detection is achieved by checking whether the dramatic changes of the color of target object have occurred or not between two adjacent images. The failure recovery is realized by, within the search area, finding out the pixel that has the maximum similarity to the target sample determined in the previous frame. The details are available in the following Section IV. © 2006 ACADEMY PUBLISHER
Figure 3. Illustration for target clustering
An unknown pixel fu can be classified by comparing the distance dT from it to the target cluster center dT = kfT − fu k2 ,
(1)
and the shortest distance dN from fu to the m pixels on the ellipse o n (2) dN = min kfN (j) − fu k2 j=1∼m
In Eq.(1)(2), the distance is calculated with the Euclidean distance in the 5D feature space, because Euclidean distance is invariant to rotation of the target. If dT < dN , fu is classified as a target pixel, otherwise a non-target pixel. When the pixel is classified as target, it will be recorded for the update procedure of search area described in the following section. As shown in Eq.(2), there are m non-target points selected from the outline of search area. Here, we let m = 9 and 8 of them are resolved by the 8-equal division of the ellipse contour at the interval of 45◦ . The rest one is the cross point shown in Fig.3. For an unknown pixels, its cross point is performed by the ellipse contour and the radius connected by the unknown pixel and the ellipse center.
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
49
C. Update of Target Center During the iteration of K-means clustering, the detected target center will be updated in both the 2D position space and the 3D color space to follow the illumination changes. However, rapid color shifting (e.g. high light reflecting) may happen because of the glossy surface of the target object (e.g. eyeballs). If we just follow the normal update principle, the target property (here, the color) will be completely destroyed. To solve this problem, we use a simple averaging filter to gradually decrease the influence of rapid color changes as: (t)
(new)
fT = γfT
(t−1)
+ (1 − γ)fT
(3)
0 < γ < 1 is the pre-defined coefficient(e.g. 0.6) , (new) the superscript (t), (new), (t − 1) denote the time, fT represents the new target feature after clustering in frame (t−1) t, fT means target feature in frame t − 1. With this equation, when rapid color shift happens, our method can resist such influence, and when the target color turns back, it will continue the correct process.
in this paper, we make Zi = [xi , yi ]T (the position of ith target pixel), mZ is the mean and ΣZ denotes the covariance matrix. The Mahalanobis distance of a vector Z to the mean mZ is given by g(Z) = [Z − mZ ]T ΣZ −1 [Z − mZ ].
(5)
The ellipse (E(M )) that contains at least M % of the target pixels is given by g(Z) = J,
(6)
M ). where J = −2 ln(1 − 100 We let M be big enough (e.g. 95 it means the proportion of target pixels that are within about 2|ΣZ |) so that E(M ) will contain the overall majority of target pixels, thus E(M ) can represent the approximate shape of target object. Since the object moves, we enlarge E(M ) by k (e.g. 1.25) times and use it as the search for the object tracking in the next frame, where k is resolved by considering the velocity of target object.
Detected target pixels
Enlarged search area
D. Update of the Search Area As the target may change its shape and position continuously, the background surrounding the target will also keep on changing. Therefore, for a good tracking algorithm, it is necessary to update the background information to follow the target movement. In this paper, it becomes the problem of the update for the search area (also the ellipse or AT R). After K-means clustering procedure, the target center (also the ellipse center) will be updated to a new position which is the initial target center for the next frame. When the target rotates or scales, the ellipse contour should be dynamically updated to deal with these changes. Therefore, the update of ellipse model includes both the ellipse center shift for 2D translation and the ellipse contour variation for 3D deformation. Here, we use the movement of target center to shift the ellipse center. The distribution of the target pixels is used to update the ellipse contour. Here, the ellipse that describes the search area is determined by using the pixel set of target object obtained from the current image frame. The pixels of target object can be considered as a random distributed point set, so they can be represented by a probability distribution. And in this paper, we assume the distribution can be described by a Gaussian probability density function (pdf ) approximately. Because Gaussian pdf is a dynamic statistical representation, it is superior to a concrete geometric model in describing the appearance of target object. The advantages are: 1) It is insensitive to the small deformation of the geometrical transformation of the object; 2) It reduces the influence of mis-detection of some target pixels due to the image noise, etc. Here the Gaussian pdf of Z = [Z1 , Z2 .....Zn]T (the pixel set of target object) is defined as: Z ∼ N (mZ , ΣZ ), © 2006 ACADEMY PUBLISHER
(4)
p(x,y)
E(M) Image plane
Figure 4. The AT R updated by Mahalanobis distance
IV. S ELF T RACKING FAILURE D ETECTION In most tracking algorithm, target tracking is performed by searching a small area around the target detected in the previous frame for the target object. This approach works only when the result obtained in the previous frame is correct. Therefore, we consider that, before the object tracking in the current frame, it is necessary to perform a verification procedure that checks whether the tracking result is correct or not. If a tracking failure occurs and is detected, the current object tracking job should be interrupted and a failure recovery job should be started. Once the failure recovery has been performed successfully, we can restart the interrupted tracking job. A. Failure detection by checking target center If the target object is solid and it moves smoothly, the center of target object determined in the previous frame should still be on the target object of same color in the current frame. Therefore, at the beginning of the object tracking in the current frame, if the pixel at the position of the target center determined in the previous frame does
50
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
not have the similar color (as the case 1 of Fig.5), we can consider that a tracking failure may have occurred in the current frame. This situation can be confirmed by the following equation: 1; if dT > dP B fC (T ) = , (7) 0; otherwise (t)
(t+1) 2
where dT = kCT − CP (t+1) CB (j)k2 .
(t+1)
k , dP B = min kCP j=1∼m
−
(t) CT
is the color of target cluster center (t+1) in the previous frame. CP is the color of pixel, in the current frame, at the position of the target cluster (t+1) center in the previous frame. CB (j) is the color of j-th background sample in the current frame. fC (T ) is a constant, if fC (T ) = 1, it means that the color of pixel in the current frame at the position of target center in the previous frame looks more like background rather than the color of the previous cluster center, thus indicates a tracking failure may have occurred. However, in the cases that a target object has thin color patterns, is in a concave shape or has apertures, false alarm will be raised often. To cope with this situation, we propose one more method for checking tracking failure which will be described in the next sub-section. Failure detection by checking target center
Case 1
Failure detection by checking boundary of search area
Case 2
Figure 5. Illustration for failure detection
B. Failure detection by checking the boundary of search area As described in Section III-A, the ellipse contains the whole target object in the current frame and should still contain the whole target object in the next frame. Therefore, if the object goes across the boundary of the search area (elliptical contour), we can consider that a tracking failure may have occurred in the current frame. This may happen when the speed of the target object is faster than the predicted maximum speed. This situation can be confirmed by checking whether the colors of the pixels in the current frame on the elliptical contour determined in the previous frame are similar to the color of the target cluster centers (as case 2 of Fig.5). The condition that the color of a pixel in the current frame on the elliptical contour determined in the previous frame is similar to the color of one of the target cluster centers is given by 1; if dBT j > dBBj fB (j) = , (8) 0; otherwise © 2006 ACADEMY PUBLISHER
where dBT j =
(t+1)
min kCtT − CB
k=1∼n (t+1) − CB (j)k2 . CtT
(j)k2 , dBBj =
is the color of target cluster kCtB (j) center determined in the previous frame, CtB (j) and (t+1) CB (j) are the color of j-th pixel on the elliptical contour in the previous and the current frame, respectively. By checking the target center and the background samples, the probability of tracking failure can be estimated with the following equation: Pm j=1 fB (j) P (f ail) = 0.5fC (T ) + 0.5 . (9) m When P (f ail) is greater than 0.7 (an experimental value), we will judge that a tracking failure has occurred. V. T RACKING
FAILURE RECOVERY
When a tracking failure occurred and was detected, the newly determined target cluster center will be false. This means that the tracking of the target cluster is failed. The tracking failure can be detected by checking the condition indicated by Eq.(9). The recovery from the tracking failure is achieved by re-determining the target cluster center by searching for the pixel within the elliptical search area (ellipse) in the current frame that is most similar to that cluster center determined in the previous frame. (t) Let fT be the 5-d feature vector of one of the lost cluster center determined in the previous frame t, f (t+1) be one of the pixels in the search area in the current frame (t+1) t+1, and fT be the re-determined target cluster center. (t+1) fT can be determined by finding out the f (t+1) ∈ (t) ellipse that has the minimum distance to fT (t+1)
fT
=
(t)
argmin kf (t+1) − fT k2 .
(10)
f (t+1) ∈ellipse
VI. E XPERIMENT To evaluate the performance of our tracking framework, we applied it to different human movements under various conditions. Some of the experiments are shown in Fig.6 and those experiments include almost all the human parts both indoors and outdoors. In the first row, the experiment is designed to test the performance of our algorithm with eyeball. Eyeball is a challenging target because it is always in un-predictable state such as high light reflecting, wearing glasses or not, partially occlusion of the spectacles frame and blinking, etc. In frame 173, the reflecting (the blue high light ) happened because of the glossy surface of the glasses, which changed the target property and made the tracking challenging. Because we use the average filter to update the target center in 5D feature, our tracker could track successfully in this case. In frame 602, the target person took off the glasses and scaled greatly, although the spectacles frame moved into the ellipse in 2D position space, it reflected and this reflection could be discriminated from the target in the 3D color space. By using variable ellipse model, the arbitrary scaling and direction are allowed in frame 115, 193 and 750.
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
The second experiment was taken with the hand which is well known difficult target because of its arbitrary shape and the interfuse background through the fingers. The difficulty in this experiment include: 1) the background was quite complex, many objects contained different colors; 2) the illumination was un-uniform, we had the high light from the windows on the right side; 3) people’s arm contains similar colors to the target hand. The target contained different shapes in each image and its color was also affected because of the un-uniform illumination (e.g. frame 403,307). By applying the non-target samples to the K-means clustering algorithm, our tracking framework could successfully discriminate the target object from the interfused background. Although the arm contained almost the same color as the target hand, since its 2D geometric feature was different from that target center, in that way the arm containing similar color could not affect the successful tracking. With the variable ellipse model to restrict the search area, our algorithm successfully dealt with the various target deformations. In the third row, our tracking algorithm was applied to the human head. The target achieved: 2D rotation (frame 86), revolution (frame 226,253), motion blur (frame 452) and scaling (frame 570). Because the light is not uniformly distributed, the change of target color always happened (e.g. frame 226, 253, and 452). In this experiment, we paid special attention to frame 452, because in this case the change of target color and motion blur caused by high speed moving happened simultaneously. Although the target property had changed in this case, the target objected could still be found out by being compared with the non-target samples surround it. Since the target object was continuously updated in the 5D space, color shift was allowed. When target moved to the light position (frame 452), our algorithm could recover to the new target color. The fourth experiment was taken on the soccer sequence outdoors1. The human body in this test is a typical non-rigid target object, and because occlusion always happens, this sequence becomes a challenging task in computer vision. In frame 155, because the target appearance was the non-rigid, another player who was not the target had moved into the ellipse region and we consider the background was interfused into the target in this case. Since the target tracking was performed by the discrimination between the target and non-target samples, our algorithm worked well in this case. In frame 215, the scaling, interfused background pixels and partial occlusion happened simultaneously. To track robustly in this case, we applied the K-means clustering algorithm to both the positive and negative information for the problem caused by the interfused background and partial occlusion and use the variable ellipse model to deal with the scaling and non-rigid target movements. Frame 309 shows another challenging problem that another player who belongs to the same group as the target player had moved into the ellipse template. Because not only the 3D color but also the 2D position information was 1 This
movie is taken from European Champion League 2002-2003
© 2006 ACADEMY PUBLISHER
51
calculated in the clustering process, the interfused player with white clothes had lower probability belonging to the target group in the 2D position space. Therefore, such influence could be neglected and the robustness of our method was improved. The fifth row shows the performance of our system with the basketball players indoors. In these sequences, the complex color of audience, partial occlusion and interfused background made it difficult for the priorknowledge-based tracking algorithm to work robustly. We applied: 1) 5D feature vector to discriminate two player belonging to the same team, because they had different property in the 2D position space; 2) non-target samples to the K-means clustering to deal with the partial occlusion and the interfused background (other team’s players); 3) variable ellipse model to deal with the unpredictable human non-rigid movements. The sixth row was taken with the experiment of lips tracking. Human lips are the typical case with non-rigid deformation. Since in our experiment the movement of lips is caused by both the deformation of mouth and the rotation of head, this experiment becomes difficult for most of the conventional model-based lips tracking algorithm because of the rotation of head. Because our tracking is based on the pixel-wise clustering algorithm, no pre-defined target model is needed. By modifying the search area according to the distribution of target clustering, our method successfully deals with the nonrigid deformation. The seventh row is the experiment for tracking a basketball player when he is slamming dunk. As shown in these continuous sequences, both the target appearance and background change greatly due the target deformation and illumination variance. In Frame 052, the color of both the target and background was dramatically changed by the flash of camera. And our method automatically changed the ellipse contour to green to indicate that the tracking failure had happened (described in Section IV) and the tracking failure is recovered as the method described in Section V. All the experiments are taken on a 3.06GHZ Intel XEON processor. The image size is 640 × 480 pixels, and when the target size varies from 140 × 140 to 200 × 200 pixels, the processing time is about 0.012 ∼ 0.018 sec/frame. VII. C ONCLUSION In this paper, we proposed a general tracking framework for human movements. Because pixel-wise clustering method is applied, our framework achieved the robust object tracking with few prior knowledge or assumptions. By applying K-means clustering algorithm to both the target and non-target samples in a 5D feature space, this algorithm can discriminate the target object from its surrounding background. We have performed a real tracking system based on this framework and the videorate processing has been achieved.
52
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
Frame 115
Frame 173
Frame 193
Frame 602
Frame 750
(I) Tracking the iris with different pose and scale. In frame 602, the target people has taken off his glasses. Frame 027
Frame 073
Frame 307
Frame 375
Frame 403
(II) Object tracking for hand with freely non-rigid deformation under non-uniform illumination condition. Frame 086
Frame 226
Frame 253
Frame 452
Frame 570
(III) Head tracking with non-uniform illumination and the motion blur caused by the high velocity of target. Frame 155
Frame 215
Frame 295
Frame 309
Frame 420
(IV) Tracking the soccer player with shape deformation, occlusion, background interfusion and scaling. Frame 057
Frame 130
Frame 234
Frame 299
(V) Tracking the basketball player with the occlusion under clustered background. Frame 017
Frame 158
Frame 226
Frame 290
Frame 318
Frame 052
Frame 078
Frame 095
(VI) Tracking people’s lips with different shapes. Frame 002
Frame 038
(VII) Tracking the non-rigid human body with dramatic variance of background and the target shape. In Frame 052, the illumination is dramatically changed by the flash of cameras, and when the ellipse contour changed green, it indicates that the tracking failure was detected. Figure 6. Experiment results of tracking different parts of human under various conditions
© 2006 ACADEMY PUBLISHER
JOURNAL OF MULTIMEDIA, VOL. 1, NO. 4, JULY 2006
For future work, we plan to extend this framework to multicolor model and one-click initialization is considered. ACKNOWLEDGE This research is partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Grants-in-Aid for Scientific Research (A)(2), 16200014 and (C) 18500131, (C) (2) 15500112. R EFERENCES [1] M.L.Cascia, S.Sclaroff, “Fast, Reliable Head Tracking under Varying Illumination: An Approach Based on Registration of Texture-Mapped 3D Models” PAMI, Vol.22, No.4, pp.322-336, April 2000 [2] P.Fieguth, D.Terzopoulos, “Color-based Tracking of Heads and other Mobile Objects at Video Frame Rates” CVPR, 1997 [3] A.Utsumi, N.Tetsutani, “Human Tracking using MultipleCamera-Based Head Appearance Modeling” FG, pp.657662, 2004 [4] H.Nanda, K.Fujimura, “A Robust Elliptical Head Tracker” AFRG, pp.469-474, 2004 [5] M.Isard, A,Blake, “Contour Tracking by Stochastic Propagation of Conditional Density” ECCV Vol.1, pp.343-356, 1996 [6] D.Comaniciu, V,Ramesh, P.Meer, “Real-time Tracking of Non-rigid Objects using Mean Shift” CVPR, Vol.2, pp.142149, 2000 [7] R.Urtasun, P.Fua, “3D Tracking for Gait Characterization and Recognition” FG, pp.17-22, 2004 [8] B.Heisele, U.Kreβel, W.Ritter, “Tracking Non-rigid Moving object Based on Color Cluster Flow” CVPR, pp.253-257, 1997 [9] T.Zhao, R.Nevatia, “Tracking Multiple Humans in Crowed Environment” CVPR, Vol.2, pp.406-413, 2004 [10] Y-l.Tian, T.Kanade, J.F.Cohn, “Dual-state Parametric Eye Tracking” FG, pp.110-115, 2000 [11] Z-w.Zhu, Q.Ji, K.Fujimura, “Combining Kalman Filtering and Mean Shift for Real Time Eye Tracing Under Active IR Illumination” ICPR,2002 [12] Starker, I., Bolt, R.A, “A Gaze-Responsive Self-Disclosing Display” HFCS:CHI’90, pp.3-9, 1990 [13] S.Baruja, D.Pomerleau, “Non-Intrusive Gaze Tracking using Artificial Neural Networks” Technical Report CMU-CS94-102, CMU, 1994 [14] R.Stiefelhagan, J.Yang, A.Waibel, “Tracking Eyes and Monitoring Eye Gaze” PUI’97, 1997 [15] M.Bray, E.Koller-Meier, L.V.Gool, “Smart Particle Filtering for 3D Hand Tracking” FG, 675-680, 2004 [16] Y.Sato, “Fast tracking of hands and fingertips in infrared images for augmented desk interface” FG, pp.462-467, 2000 [17] A.Utsumi, J.Ohya, “Multiple-hand-gesture tracking using multiple cameras” CVPR,pp.473-478, 1999 [18] K.Oka, Y.Sato, H.Koike, “Real-time tracking of Multiple Fingertips and Gesture Recognition of Augmented Desk Interface System” FG, 2002 [19] R.Collins, Y.Liu, M.Leordeanu, “On-line Selection of Discriminative Tracking Features” PAMI, Vol.27, No.10, pp.1631-1643, October 2005 [20] H.T.Nguyen, A.Semeulders, “Tracking Aspects of the Foreground Against the Background” ECCV, Vol.2, pp.446456, 2004 [21] C.Gr¨ aβl, T.Zinβer, H.Niemann, “Illumination Insensitive Template Matching with Hyperplanes,” DAGM, pp.273-280, 2003
© 2006 ACADEMY PUBLISHER
53
Dr. Chunsheng Hua was born in Shenyang, China. He received his BE degree in electronic engineering from Shenyang University of Technology in 2001. He received his MS degree from the Department of Mechanical and System Engineering at Kyoto Institute of Technology in 2004. He started his Ph.D. course since 2004 until now. He is a student memmber of IEEE and IPSJ. His research interests include face recognition, color-based object tracking.
Prof. Haiyuan Wu received the Ph.D. degree from Osaka University in 1997. From 1996-2002, she was the research associate at Kyoto Institute of Technology. Since 2002, she has been an associate professor at Wakayama University. She is a member of IPSJ, IEICE, ISCIE and the Human Interface.
Prof. Qian Chen received the Ph.D. degree from Osaka University in 1992. From 1992-1994, he was a researcher at the Laboratory of Image Information and Science and Technology. From 1994-1995, he was the research associate at Osaka University. From 1995-1997, he was the research associate at the Nara Institute of Science ans Technology. Since 1997, he has been the associate professor at Wakayama University. He is a member of IPSJ and RSJ.
Prof. Toshikazu Wada received his BEng degree in electronic engineering from Okayama University, MEng degree in computer science from Tokyo Institute of Technology and DEng degree in applied electronics from Tokyo Institute of Technology, in 1984, 1987, 1990, respectively. He is currently a professor in Department of Computer and Communication Science, Wakayama University. His research interests include pattern recognition, computer vision, image understanding and artificial intelligence. He received the Marr Prize at the International Conference on Computer Vision in 1995, Yamashita Memorial Research Award from the Information Processing Society Japan (IPSJ), and the Excellent Paper Award from the Institute of Electronics, Information, and Communication Engineerings (IEICE), Japan. He is a member of the IEICE, IPSJ, the Japanese Society for Artificial Intelligence and IEEE.