Real-Time Gesture Recognition by Learning and ... - IEEE Xplore

2 downloads 0 Views 2MB Size Report
Jan 14, 2005 - Real-Time Gesture Recognition by Learning and Selective Control of Visual Interest Points. Toshiyuki Kirishima, Member, IEEE, Kosuke Sato, ...
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 27, NO. 3,

MARCH 2005

351

Real-Time Gesture Recognition by Learning and Selective Control of Visual Interest Points Toshiyuki Kirishima, Member, IEEE, Kosuke Sato, Member, IEEE, and Kunihiro Chihara, Member, IEEE Abstract—For the real-time recognition of unspecified gestures by an arbitrary person, a comprehensive framework is presented that addresses two important problems in gesture recognition systems: selective attention and processing frame rate. To address the first problem, we propose the Quadruple Visual Interest Point Strategy. No assumptions are made with regard to scale or rotation of visual features, which are computed from dynamically changing regions of interest in a given image sequence. In this paper, each of the visual features is referred to as a visual interest point, to which a probability density function is assigned, and the selection is carried out. To address the second problem, we developed a selective control method to equip the recognition system with self-load monitoring and controlling functionality. Through evaluation experiments, we show that our approach provides robust recognition with respect to such factors as type of clothing, type of gesture, extent of motion trajectories, and individual differences in motion characteristics. In order to indicate the real-time performance and utility aspects of our approach, a gesture video system is developed that demonstrates full video-rate interaction with displayed image objects. Index Terms—Gesture recognition, selective control, visual interest points, Gaussian density feature, real-time interaction.

æ 1

INTRODUCTION

I

N various fields of industry, Virtual Reality (VR) technologies are expected to bring about novel interaction methods and user interfaces. Immersive experiences are better obtained when a user’s nonverbal information is reflected in the virtual environment in real time [1], [2]. Of particular interest is real-time recognition of human gestures, which realizes more flexible and more effective means of Human– Computer Interaction (HCI) [3], [4]. The quality of information services/experiences depends on the design and performance of the user interface, which is one of the most critical elements in cyberspace interaction [5], [6]. For example, consider a browsing scheme on the Web for commercial products. The persuasiveness of the advertisement and customer satisfaction are improved when a VR-based browser is used to maximize the freedom of the user to manipulate products at will [7], [8]. The above examples imply that there is a definite need to realize higher adaptability and flexibility in the human interface, assuming the participants to be not only children, but also elderly people, and that users are not limited to specialists.

. T. Kirishima is with the Department of Electrical Engineering, Nara National College of Technology, 22 Yata-cho, Yamatokoriyama-shi, Nara, 639-1080 Japan. E-mail: [email protected]. . K. Sato is with the Department of Systems Innovation, Division of Systems Science and Applied Informatics, Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama-cho, Toyonaka-shi, Osaka, 5608531 Japan. E-mail: [email protected]. . K. Chihara is with the Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara, 630-0101 Japan. E-mail: [email protected]. Manuscript received 29 May 2003; revised 9 Sept. 2004; accepted 10 Sept. 2004; published online 14 Jan. 2005. Recommended for acceptance by S. Soatto. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-0114-0503. 0162-8828/05/$20.00 ß 2005 IEEE

Generally, the reproducibility of natural gestures in real situations is poor and the gestures are usually not defined explicitly. By selectively extracting necessary information, we can recognize not only standard gestures, but also “similar” gestures in a highly robust manner. In the present study, a similar gesture refers to a gesture that shares several common points with a standard gesture. According to this definition, there can exist myriads of similar gestures for a particular gesture class. In order to cope with all kinds of gestures, although a method that stores sequentially gesture samples of identical class might be considered, this method would cause the number of gesture samples to increase and thereby increase indefinitely the memory volume and pattern matching cost. In general, gestural actions in the same gesture class tend to have a number of visually common features. Even in a crowd, we can easily grasp the meaning of gestures given by the communication partner by paying attention to these visually common features. However, these visually common features are not always self-evident. In real-life situations, cases in which multiple features must be considered simultaneously is not unusual. In order to deal with the above-mentioned problem, selective attention models [9], [10], [11], [12], [13] and theoretical frameworks [14], [15], [16], [17], [18], [19], [20], [21] can be powerful tactics that enable the selection of spatio-temporally reliable features from a given gestural image sequence. By focusing on the visually common features, recognition of similar gestures can be achieved without increasing the number of gesture samples. This contributes to faster processing and requires fewer similar gesture samples to be prepared beforehand. In this paper, we refer to the spatio-temporally reliable set of visual features as gesture protocols. In order to realize protocol-based learning Published by the IEEE Computer Society

352

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

and recognition of gestures, Quadruple Visual Interest Point Strategy (QVIPS) is introduced [22], [23]. Although stable gesture recognition at a high sampling rate is extremely important in improving the quality of the interaction in a virtual environment [24], very few of the previously reported systems guarantee a sampling rate of 30 [fps], namely, full video-rate interaction, without introducing special-purpose or state-of-the-art hardware [4]. In order to satisfy the real-time requirement, a more sophisticated framework is needed which selectively considers meaningful spatial-features [25] because the system must deal with a tremendous amount of data in a fraction of second. However, previous attention models and theoretical frameworks do not provide a comprehensive framework to deal with the problems of selective attention and processing frame rate in gesture recognizing systems. In order to deal with the above-mentioned problems, a selective control method of visual interest points is introduced. The remainder of this paper is organized as follows: In Section 2, related studies are described briefly. Section 3 describes QVIPS in detail. In Section 4, selective control methods for visual interest points are presented. Section 5 presents the results of the evaluation experiments that were performed in order to verify the effectiveness of the proposed techniques. Section 6 presents a discussion of the results presented in Section 5. Finally, Section 7 summarizes the above sections and their results.

2

RELATED WORK

Vision-based gesture recognition techniques can roughly be classified into two approaches: the body model-based approach and the appearance-based approach. In the body model-based approach [26], [27], [28], [29], geometrical primitives, such as cylindrical models, are fitted to an original image. Some of these approaches attach several markers on the user’s body and estimate body posture and relevant parameters based on the spatial relationships among the markers. Body model-based approaches require exact calibration and usually place restrictions on the type of clothing that can be worn. In addition, these approaches cannot avoid occlusion problems that frequently occur depending on the camera angle and the nature of the target gestures. This leads to difficulty in guaranteeing the convergence accuracy and the time for parameter estimation. Therefore, these approaches are not suitable for VR applications, in which recognition results are required within acceptable time intervals. On the other hand, appearance-based approaches estimate the gesture class by applying appearance-based methods [30], [31], [32]. The following are examples of representative methods: the Temporal Template-based methods [33], Support Vector Machine (SVM)-based methods [30], [34], Hidden Markov Model (HMM)-based methods [35], [36], [37], Neural Network (NN)-based methods [38], Fuzzy-based methods [39], [40], Figure Moment-based methods [41], [42], and Orientation Histogram-based methods [43]. Although these methods require the target gestures to be learned beforehand, there is no need for the user to wear special markers or devices and the recognition task is not affected by frequent occlusions.

VOL. 27,

NO. 3,

MARCH 2005

Fig. 1. Simplified diagram of the gesture recognition process.

Considering that the quick response of the interaction system to user actions is extremely important in improving the quality of interaction [2], [44], we believe that appearance-based methods are more suitable for real-time processing of gestural image sequences. However, many conventional methods are based on the off-line class estimation of the input gestures and dynamic collaboration with VR applications is often ignored. In this paper, Quadruple Visual Interest Point Strategy (QVIPS) is proposed as a means by which to deal with the above-mentioned problems. In addition, a selective control method of visual interest points is introduced in order to guarantee real-time interaction. In order to demonstrate the real-time performance and utility aspect of the proposed method, a gesture video system is developed.

3

QUADRUPLE VISUAL INTEREST POINT STRATEGY (QVIPS)

3.1

Motivation and Overview of the Proposed Approach If our visual attention system observes a gesture only by paying attention to the changes in the body posture of a communication partner, we should simply witness disjointed regions before our eyes. In reality, we observe both regions of change and regions of little or no change. We believe that a gesture recognition algorithm should not predetermine which visual features are of the greatest importance for a particular gesture. Therefore, we introduce protocol-learning for the determination of reliable visual features. A simplified diagram of the proposed gesture recognition approach is shown in Fig. 1. In short, the proposed approach first takes an “interested” approach, followed by a “disinterested” approach. Details for each processing block are described in the following sections. 3.2 Extraction of Gaussian Density Feature In order to achieve real-time performance, fast and efficient algorithms are greatly needed, especially for the feature extraction and matching stage. In our approach, a difference

KIRISHIMA ET AL.: REAL-TIME GESTURE RECOGNITION BY LEARNING AND SELECTIVE CONTROL OF VISUAL INTEREST POINTS

image is generated in order to detect motion and a silhouette image is generated in order to detect static postures. By referring to the center of gravity and area of each region of interest, visual features with respect to the position, posture, and motion are extracted frame-by-frame. As a feature extractor, we use the Gaussian Density Feature (GDF), as described below. Let an image sequence be f ðx; yÞð ¼ 1; 2; :::; NÞ. Binarize f ðx; yÞ with a threshold value t and define the obtained binary image as g ðx; yÞ. The geometrical moment mp;q for g ðx; yÞ is defined by XX mp;q ¼ g ðx; yÞxp yq : ð1Þ x

y

Then, a centroid ða ; b Þ is obtained by   m1;0 m0;1 : ; ða ; b Þ ¼ m0;0 m0;0

353

as the gestural phase number. A feature vector, which is stored in a reference data set having the number g and visual interest point number l, is represented as follows: ðgÞ

K l ¼ ðK1 ; K2 ;    ; K ;    ; K Þt ;

ð4Þ

where K  ¼ ðs1 ðÞ; s2 ðÞ;    ; s ðÞÞt ;  ¼

2 : 

ð5Þ

Through the decomposition of a gestural image into smaller fragments, the allocation of weights to each visual interest point becomes possible at the protocol-learning stage.

3.4 Feature-Based Matching Let the feature vector of visual interest point l on the current image be ð2Þ

Next, consider the binary image g ðx; yÞ in a polarcoordinate system P ðr; Þ, the origin of which is located at ða ; b Þ. Now, divide the radial pattern of the figure ðradius ¼ RÞ at the angle  into  partitions. Each of these partitions is called a kernel. Then, the figure feature s" ðÞð" ¼ 1; 2; :::; Þ at the angle  is defined by n o X P ðr; Þ exp aðr  Þ2 X s" ðÞ ¼ R r ; ð3Þ P ðr; Þ r

where a is a gradient coefficient which determines the uniqueness of the figure feature and  is a phase term which is an offset value determined in accordance with . A detailed description of how to determine a is given in the appendix. The figure feature s" ðÞ at kernel " is obtained by applying (3) to the binary image in an omni-directional manner at an arbitrary resolution. After computing the GDF, the Fast Fourier Transform (FFT) is applied to the figure feature s" ðÞ at each kernel to obtain the power spectrum P" ð!Þ. The following procedure is applied to each dynamic region of interest: First, by normalizing s" ðÞ, the figure feature that depends on the rotation, but not on the scale of the object, is then obtained. Otherwise, the figure feature that depends on both the rotation and the scale of the object will be obtained. Second, normalize P" ð!Þ. According to the shift-invariant nature of Fourier Transform [45], the figure feature that depends on neither the rotation nor the scale of the object is obtained. Otherwise, the figure feature that depends on the scale, but not on the rotation of the object, will be obtained. Through this simple procedure, figure features corresponding to 32 kinds of visual interest points can be obtained.

3.3 Feature-Based Learning A set of figure features and its corresponding frame number are stored together as members of a reference data set. We hereinafter refer to the frame number in a reference data set

T l ¼ ðT1 ; T2 ;    ; T ;    ; T Þt ;

ð6Þ

where T  ¼ ðs1 ðÞ; s2 ðÞ;    ; s ðÞÞt ;  ¼

2 : 

ð7Þ ðgÞ

The distance between T l and reference data set K l recognition unit is defined by ðgÞ

dl ¼

 X

k T   K k;

at the

ð8Þ

¼1

where  is the angular resolution and g refers to an arbitrary reference data set number. Let the reference data set number of gesture class i be ki ðgÞ ðiÞ which minimizes dl . The similarity Slf to the input image of frame number f in gesture class i and visual interest point l can be defined by ðiÞ

Slf ¼ 1 

ðk Þ

dl i  : ðgÞ Max dl

ð9Þ

Here, MaxðÞ is a function which returns the maximum ðiÞ ðiÞ value. The similarity Slf ð0  Slf  1Þ obtained by (9) and the phase number stored in the reference data set ki are input to the protocol-based learning/recognition stage. Here, the matching results are the multidimensional ðiÞ patterns of similarity at each visual interest point: X l ¼ ðiÞ ðiÞ ðiÞ t ðiÞ ðSl1 ; Sl2 ;    ; SlK Þ . Xl is referred to as an activation map. Since the distance among feature vectors is based on different measures, such as scale and rotation, matching measures are made equivalent by normalizing the output distance among visual interest points.

3.5

Gesture Protocol-Based Learning and Recognition In feature-based matching, spatial-similarity is taken into account. However, the fluctuation of similarity in the time direction is not considered. Visual interest points that return stable similarity in the time direction can be important clues in learning a gesture. When the activation map of visual interest point l in gesture class i is given by

354

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, ðiÞ

ðiÞ

ðiÞ

ðiÞ

X l ¼ ðSl1 ; Sl2 ;    ; SlK Þt ; ðiÞ l ,

and the variance, the average, the following equations: ðiÞ l

ðiÞ

l ¼

ðiÞ l ,

of

ð10Þ ðiÞ Xl

are given by

ð11Þ

K    1X ðiÞ ðiÞ ðiÞ ðiÞ t S lk  l : S lk  l K k¼1

ð12Þ

ðiÞ

Here, the weights !l for visual interest points are defined by ðiÞ

l

ðiÞ

!l ¼

ðiÞ l

þ

ð13Þ

;

where  is an emphasis coefficient. As  becomes smaller, ðiÞ ðiÞ the influence of the variance l on !l becomes greater. When the similarity in the time direction becomes more ðiÞ ðiÞ stable, or when l becomes smaller, !l becomes bigger. In ðiÞ protocol-based learning, the activation map Xl is stored as protocol-data: ðiÞ

ðiÞ

ðiÞ

ðiÞ

M l ¼ ðMl1 ; Ml2 ;    ; MlM Þt :

ð14Þ

ðiÞ !l

for each visual interest point is In addition, the weight calculated and stored. In the protocol-based recognition, the probability disðiÞ tribution of each element of protocol data M l is assumed ðiÞ to follow a Gaussian distribution. The estimation value Ef of a gesture image of frame number f in the activation map ðsÞ X l is given by ðiÞ

Ef ¼

L X M X

    ðsÞ ðiÞ 2 exp  Slf  Mlm ;

ð15Þ

where is a separation coefficient. The influence among similarity points becomes smaller as becomes larger. Assume now that the likelihood distribution given by (15) is a protocol map for visual interest point l. Protocol maps are generated for each gesture class according to (16) ðiÞ immediately after Ml is fixed, or after the protocollearning has been completed. For the purpose of quick acquisition of the estimation value, the reference tables are generated in advance: ðiÞ

M X

    ðiÞ 2 : exp   Mlm

f¼1

ðiÞ

Ef ;

ðiÞ

:

!l

ð18Þ

As a result of protocol-learning, higher class weight Wi , defined by (18), is given to the visual interest points of greater importance. Finally, the input image is judged to belong to gesture class C, for which cumulative estimation value E ðiÞ becomes the maximum. Here, E ðiÞ is obtained by ðiÞ accumulating the estimation value Ef from the start of the action.

3.6 Quantification of Gestural Parameters After the class estimation of the input image, quantification of gestural parameters is performed. Gestural phase number, relative direction, relative speed, and relative width are calculated based on the reference data set and a ðCÞ series of frame numbers in the activation map Xl . Let the temporal sequence of phase number stored in the activation map of visual interest point l in gesture class C be ðCÞ ðCÞ ðCÞ ðCÞ P l ¼ ðPl1 ; Pl2 ;    ; PlK Þt . Now, the phase number P0k of the input image at frame number k is estimated as follows: PL ðCÞ l¼1 Plk Zlk : ð19Þ P0k ¼ P L l¼1 Zlk Here, Zlk is the estimation weight of visual interest point l at frame number k and is defined as follows: Zlk ¼

K X M X

    ðCÞ ðCÞ 2 : exp  Slk  Mlm

ð20Þ

is the separation coefficient noted in the previous section. Pk is the gestural phase number of the input image, which is obtained by applying a moving average function to P0k with a filter length of U, according to PU1 P0ku : ð21Þ Pk ¼ u¼0 U Gestural speed s and gestural width w are computed by following simple calculations. Gestural width w indicates the estimated relative range of movement compared with the learned image sequence. s ¼ Pk  Pk4 ;

ð16Þ

Since protocol maps for each gesture class are created independently, there is no theoretical limitation for the number of target gesture classes. For a gestural image sequence which is made up of N [frames], the cumulative estimation value E ðiÞ is defined as follows: N X

L

l¼1

m¼1

E ðiÞ ¼ Wi

MARCH 2005

k¼1 m¼1

l¼1 m¼1

Il; ¼

NO. 3,

where Wi is the weight for each gesture class and is defined by (18). Here, L is the number of all interest points: Wi ¼ PL

K 1X ðiÞ ¼ S ; K k¼1 lk

VOL. 27,

ð17Þ



MaxðPk Þ  MinðPk Þ  100ð%Þ: F maxðCÞ

ð22Þ ð23Þ

Here, MaxðÞ is a function that returns the maximum value, and MinðÞ is a function that returns the minimal value. 4 in (22) refers to the number of frames that corresponds to the minute time width, and F maxðCÞ in (23) refers to the number of frames in the standard image sequence of gesture class C. As shown in (22), s is positive when the gesture is taken in the same direction as the original movement, and s is negative when the gesture is taken in the opposite direction

KIRISHIMA ET AL.: REAL-TIME GESTURE RECOGNITION BY LEARNING AND SELECTIVE CONTROL OF VISUAL INTEREST POINTS

355

Fig. 2. Block diagram of the feedback control.

from the original movement; otherwise, s is zero, assuming the direction of the original movement to be positive.

4

SELECTIVE CONTROL POINTS

OF

VISUAL INTEREST

4.1

Overview of Target Problem and the Proposed Approach When implementing both an application system and a recognition system as UNIX processes, the prediction of overall processing load is extremely difficult. The response of the overall processing load can be stabilized by introducing a feedback control that dynamically selects the magnitude of the processing load. A block diagram of the proposed feedback control system is shown in Fig. 2. In this paper, the selective control is handled as a problem of feedback control which consists of three control inputs and one controlled variable. Here, the controlled variable is the processing frame rate x [fps] and v [fps] is the target frame rate. The control inputs are pattern scanning interval Sk , pattern matching interval RSk , and the number of effective visual interest points Nvip . Here, Sk refers to the jump interval in scanning the feature image and RSk refers to the loop interval in matching the current feature vector with feature vectors in the reference data set. Finally, Nvip indicates the number of effective visual interest points. Basically, we can assume three types of control targets: 1) load of the recognition module: Dt1 ðtÞ, 2) load of the interprocess communication with application system: Dt2 ðtÞ, and 3) load of the application system: Dt3 ðtÞ. In this paper, we try to stabilize the processing frame rate of the overall system to the target by controlling Dt1 ðtÞ. Unpredictable load and the effects of third-party processes are comprehensively included in Dt3 ðtÞ since the proposed method presupposes operation under the multiuser and multiprocess environment. The frame rate detector provides the visual interest point controller with the current frame rate. Subsequent control inputs are determined after making corresponding changes to the control inputs and detecting the minute change in the processing time. The feedback control is continually applied until the control deviation e falls within the minimal error range. Here, the control index S is defined by (24) in order to represent the state of control using a single index: S ¼ LNðL  Sk Þ þ NðL  RSk Þ þ Nvip :

ð24Þ

Here, N refers to the maximum number of visual interest points and L refers to the number of control levels in the selective control. Although the control index S is introduced

Fig. 3. Simplified diagram of the visual interest point controller.

in order to indicate the state of selective control, it can basically be said that the smaller the control index S becomes, the shorter the time for the recognition process becomes. Here, we are taking a sequential approach because it is inherently difficult to predict the actual processing load, which could vary depending on the duration, the area of target gestures, and the performance of computing machines. A simplified diagram of the visual interest point controller is shown in Fig. 3. In the following sections, the details of each of the control inputs are presented in a bottom-up manner: (A) Control of Pattern Scanning Interval, (B) Control of Pattern Matching Interval, and (C) Control of Visual Interest Points. Each control input could be changed independently, but this requires precise measurement and prediction of the actual processing load and makes the control procedure unnecessarily complex. In order to simplify the control schedule, we assume the priority of each control to be ðCÞ > ðBÞ > ðAÞ. Therefore, the proposed method sequentially performs selection in the order of ðCÞ ! ðBÞ ! ðAÞ. When the desired processing frame rate is satisfied, the selective control is suspended.

4.2 Control of Pattern Scanning Interval In order to generate a wide variety of visual interest points, numerous dynamic regions of interest are required. As the number of regions increases, the cost for feature extraction also increases. Selective control of the pattern scanning interval attempts to satisfy the required processing speed by reducing the amount of computation, as shown in (3). Here, the angular resolution that plays an important role in representing shapes is maintained unchanged and the amount of computation is reduced by increasing the scanning interval Sk in the radial direction. In order to avoid unfavorable degradation in the recognition accuracy, Sk is restricted to 3 at the maximum because we have experimentally confirmed that values of Sk greater than 3 do not significantly affect the processing speed. 4.3 Control of Pattern Matching Interval In order to inhibit the amount of computation, as shown in (8), the selective control on pattern matching interval RSk skips the number of the reference data set in the feature-based

356

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 27,

NO. 3,

MARCH 2005

Fig. 4. Internal structure graph of the recognition system.

matching. This could degrade the accuracy of phase number estimation because the phase number of the skipped reference data set is ignored. The smaller the number of effective visual interest points, the greater the influence of the decrease in RSk . In order to minimize this influence, pattern matching interval RSk is restricted to a maximum of 3, because we have experimentally confirmed that values of Sk greater than 3 do not significantly affect the processing speed.

4.4 Control of Visual Interest Points As the most powerful means for improving processing speed, the number of visual interest points is reduced in ascending order of weights as given by (13). After the reduction, the generation of useless feature images is suspended. Although the recognition accuracy may suffer due to the reduction of visual interest points, top priority is given to the satisfaction of the real-time requirement in this paper. Here, the control gain is given by (25): G ¼ ½n

X½n  v : X½n  X½n  1

ð25Þ

G: control gain. v: target frame rate [fps]. X½n ¼

nþK X 1 x½m 2K þ 1 m¼nK

ð26Þ

X½n: moving average of processing frame rate [fps] at frame number n. x½n: processing frame rate [fps] at frame number n. K: half the length of the moving average. ½n ¼

1 1  : n 15r þ 13 expð7r2 Þ  5

½n: adjusting function for control gain. r¼

ð28Þ

[r]: uprising coefficient (r  1:0). The control gain G increases as the difference between X½n and v increases or as the difference between X½n and X½n  1 decreases. Here, the value of G is restricted to being less than N. Control input Nvip , the range of which is ½1; 32, is in proportion to the gain G. Larger Nvip causes the processing speed to be slower, but the robustness of recognition increases. In order to decrease the chance of entering into an oscillation state, (27) is designed to gradually increase the influence of control gain G at the initial phase of selective control.

4.5

Dynamic Reconstruction of the Recognition System Regarding the visual interest points which are disabled by the selection process, further improvement in processing speed can be expected by suspending their related processes. The internal structure graph of the recognition system is shown in Fig. 4, which begins from the acquisition of an input image sequence to the extraction of figure features for each visual interest point. As shown in Table 1, an early vision-related processing consists of four processing layers of targets. The ON/OFF control of the related-processing is performed according to the following rules: 1.

ð27Þ

n : 2K

When the all of the dynamic regions of interest are turned off, the generation of the corresponding feature image is suspended.

KIRISHIMA ET AL.: REAL-TIME GESTURE RECOGNITION BY LEARNING AND SELECTIVE CONTROL OF VISUAL INTEREST POINTS

357

Finally, define the value of top node N1;1 in layer 1 according to X N1;1 ¼ N2;k: with N1;1 : ð31Þ

TABLE 1 Processing Target and Number of Nodes for Each Processing Layer

k: with N1;1

The pattern processing at the dynamic region of interest in which all of the visual interest points in layer 4 are turned off is suspended. 3. The normalization process that becomes unnecessary due to the visual interest points being turned off is suspended. The following is the procedure for implementing the 2.

above rules: Set the value of an arbitrary node N4;i (1  i  32) in layer 4 to 1 for the case in which the corresponding visual interest point is effective; otherwise, set the value to 0. Define the value of an arbitrary node N3;j (1  j  8) in layer 3 according to N3;j ¼

X

N4;i: with N3;j :

ð29Þ

i: with N3;j

Similarly, define the value of an arbitrary node N2;k (1  k  3) in layer 2 according to X N3;j: with N2;k : N2;k ¼ j: with N2;k

ð30Þ

When all of the visual interest points are in operation, the value of top node N1;1 sums to 32. The value of each node represents the degree of necessity and is obtained by the above method. When the value of a node is set to zero, its corresponding processing is suspended. Unless the values of all nodes are set to zero, at least one node is guaranteed to be effective. The proposed approach requires the setting of the state of the nodes in only the bottom layer to ON/ OFF. The state of the upper layers is automatically determined in conjunction with the state of the bottom layer. The entire framework of the proposed approach is shown in Fig. 5.

5

EXPERIMENTS

5.1 Experiments Using the Proposed Approach The proposed approach was implemented as a gesture interface system [46], [47] on a workstation (SGI Indigo2) and evaluation experiments were conducted. Through the image input device, images (160 by 120 [pixels]) were captured and sent to the workstation and online recognition was performed. Experiments were performed in a normal laboratory setting without using special-purpose hardware, special illumination, or a uniform background. In this experiment, the following parameters were determined empirically: gradient coefficient a ¼ 5:0, number of kernels  ¼ 1, phase term  ¼ 0, emphasis coefficient  ¼ 0:1, separation coefficient ¼ 2; 000, length of moving average

Fig. 5. Proposed framework for the learning and selective control of visual interest points.

358

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 27,

NO. 3,

MARCH 2005

TABLE 2 Seven Types of Gestures Adopted for Use in the Evaluation Experiments

filter U ¼ 3, and number of frames for minute time interval 4 ¼ 5.

5.1.1 Learning of Gesture Protocols In order to evaluate the dependency on clothing type, a recognition experiment was conducted using seven gestures, listed in Table 2, performed by the same subject while wearing six different types of clothing, denoted as types A through F, as shown in Fig. 6. After storing reference data determined by using standard samples of clothing A, protocol-learning under two conditions was performed: (Condition A) using standard gesture samples and (Condition B) using similar gesture samples. Average recognition rates which were obtained under these conditions for all of the similar gesture samples (10 samples for each gesture for each clothing type, 420 samples in total) are shown in Fig. 7a for

Fig. 6. Six types of clothing used in the experiment.

Fig. 7. Recognition results under various conditions. (a) Average recognition rate for clothing types A through F. (b) Average recognition for gestures A through G. (c) Recognition results for subjects A through K.

clothing type (A, B, C, D, E, and F) and in Fig. 7b for gesture type (G-A, G-B, G-C, G-D, G-E, G-F, and G-G). As shown in Fig. 7a, the average recognition rate for each clothing type improved as much as 30 percent (from 54 percent to 84 percent), as a whole, by the protocollearning of similar gesture samples (! Condition B). Moreover, judging from Fig. 7b and from the numerical fact that the variance of the average recognition rate in Condition A is 42.48, whereas that for Condition B is 15.63, the dependency on gestures has been alleviated significantly. In particular, the improvement in gesture recognition (G-A), (G-B), and (G-G) is remarkable. These results suggest that the proposed techniques realize robust recognition with respect to both clothing and gesture types. After storing the reference data set using the standard samples for clothing A, protocol-learning was conducted under the following conditions: (Condition A) using only standard gesture samples for clothing A and (Condition B) using all similar gesture samples for clothing types A through F. Next, a recognition experiment on 11 subjects was conducted using seven gesture samples that were not used in the learning phase. Photographs of the 11 subjects are shown in Fig. 8 and the average recognition rates for subjects A through K are shown in Fig. 7c. By using similar gesture samples (! condition B) in protocol-learning, the average recognition rate for each subject in Fig. 7c was improved by approximately 27 percent (from 41 percent to 68 percent) on the whole. These results indicate that the robust recognition with respect to the extent of motion trajectories and individual differences of motion characteristics was realized by the protocol-learning of similar gesture samples.

KIRISHIMA ET AL.: REAL-TIME GESTURE RECOGNITION BY LEARNING AND SELECTIVE CONTROL OF VISUAL INTEREST POINTS

359

Fig. 9. Convergence evaluation results for protocol-learning. (a) Fluctuation curves of the difference between the two estimates. (b) Fluctuation curves of the weights Wi of two gesture classes.

or the left hand, drops sharply and becomes stable. Similarly, gesture class weight Wi becomes stable after a sharp increase. The above results indicate that similar learning curves can be obtained when reference data sets are made using either the right hand or the left hand. In addition, a small number of training samples is sufficient for protocol-learning.

Fig. 8. Photographs of 11 subjects who participated in the experiment.

In this experiment, gesture test samples from 11 subjects (77 samples in all) were evoked and collected by uttering a keyword, for instance, “Please make a gesture that would instantly bring to mind ‘good-bye.’” The average recognition rate for all samples was 68 percent. This validates the hypothesis on gesture protocols that the gestural actions of different subjects have some common traits. Next, some of the results obtained by allowing the system to learn the gesture “good-bye,” or waving one hand, are described in order to demonstrate the stability of our protocol-learning. The following is the procedure for this experiment: The reference data set is acquired by waving the right hand once from side to side. 2. After storing the reference data set, protocol-learning is conducted by alternately waving the right hand and left hand. 3. The difference in the estimates between the right hand and the left hand is recorded for each test sample. The gesture class weight given by (18) is also recorded as an index for learning efficiency. The fluctuation curves of the difference between two estimates and the gesture class weights are shown in Fig. 9. In Fig. 9, the results of the same experiment using the reference data set for the left hand only are superimposed. Fig. 9 shows that, after a couple of instances of learning, the difference between two estimates, for either the right hand 1.

5.1.2 Real-Time Estimation of Gestural Parameters Gestural parameters obtained by repeated capture of the gesture “good-bye” in front of a CCD camera are shown as follows: 1) gestural phase number is shown in Fig. 10a, 2) gestural speed is shown in Fig. 10b, and 3) gestural width is shown in Fig. 10c. By applying QVIPS, it is possible to estimate gestural parameters frame by frame and in real time. 5.1.3 Comparison of Processing Speed The response curves of the system’s frame rate when the size of the reference data set is gradually increased are measured for the case in which feedback control is applied and for the case in which feedback control is not applied. The target frame rate v [fps] for the gesture interface system is set to 30 [fps]. Fig. 11 indicates that, in both cases, real-time recognition is possible without any special-purpose hardware. However, for the case without feedback control (! FRate1), the system’s frame rate is inversely proportional to the increase in the size of the reference data set. On the other hand, the system’s frame rate of 30 [fps] is maintained constantly for the case with feedback control (! FRate2). The above results clearly indicate that the proposed method is successful in handling the problem of slower processing speed caused by the increase in the size of the reference data set. 5.2 Example Application: Gesture Video System 5.2.1 Block Diagram of the System In this section, we present the results obtained by connecting our gesture interface system to a virtual reality application. For an application example, we use a newly developed gesture video system. By presenting an image synchronized with the motion of the user, the gesture video system realizes

360

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 27,

NO. 3,

MARCH 2005

Fig. 12. Procedure for the connection negotiation.

Fig. 12. A block diagram of the gesture video system is shown in Fig. 13. As shown in Fig. 13, the proposed system consists of three important systems: a gesture interface system, a gesture video system, and an image editor system. The transmission of gestural parameters and control signals between the gesture interface and the application system is done bidirectionally through interprocess communication (TCP/IP). The principal load in the gesture video system is the presentation of a video image to a specified window and its fluctuation is usually small. Fig. 10. Estimated results for gestural parameters. (a) Phase estimation results for “good-bye.” (b) Speed estimation results for “good-bye.” (c) Width estimation results for “good-bye.”

an intuitive interaction between the user and image objects. This is appropriate as an application example because the system is required to satisfy the full video-rate requirement for generating a sufficient sense of reality. The negotiation procedure for connecting our gesture interface system to the application system is shown in

Fig. 11. Response curves of the frame rate of the system for a larger reference data set.

5.2.2 Real-Time Interaction with Movie Clips In order to increase the availability of our gesture interface system, the proposed methods are implemented on a laptop personal computer (Pentium MMX 266 MHz, OS: Linux) and evaluation experiments were conducted. A video clip is available on the Computer Society’s Digital Library at http://www.computer.org/publications/dlib. Images captured by a CCD camera at a resolution of 80 by 60 [pixels] were sent through the image input device (IBM Smart Capture Card I) and online recognition was performed. Experiments were conducted in a normal laboratory setting. No special-purpose hardware, special illumination, or special background was used. In this experiment, the following parameters are determined empirically: gradient coefficient a ¼ 5:0, number of kernels  ¼ 1, phase term

Fig. 13. Block diagram of the gesture video system.

KIRISHIMA ET AL.: REAL-TIME GESTURE RECOGNITION BY LEARNING AND SELECTIVE CONTROL OF VISUAL INTEREST POINTS

361

Fig. 14. Example screenshot of the gesture video system (Window at top-left is the control panel of the gesture interface system and that at bottom-center is the picture icon panel. In the center of the screen, a historical relic is displayed).

 ¼ 0, emphasis coefficient  ¼ 0:1, separation coefficient ¼ 500, length of moving average filter U ¼ 4, number of frames for minute time interval 4 ¼ 5, phase term  ¼ 0, angular resolution  ¼ 64, maximum number of visual interest points NMAX ¼ 32, initial number of effective visual interest points Nvip ¼ 32, initial interval for pattern scanning Sk ¼ 1, initial interval for pattern matching RSk ¼ 1, and initial control index S ¼ 288. The connection experiment was conducted according to the following procedure: The connection between the gesture interface and the gesture video system was established according to the procedure shown in Fig. 12. 2. The target frame rate on the side of gesture interface system was manually set to 30 [fps]. 3. The user allows the system to learn his/her original gesture depending on the content of the video clip/ stream and then starts the interaction. As an example of gestural actions, an object-viewing gesture of leaning one’s head from side to side was used in this experiment. This enables historical relics to be viewed from arbitrary angles synchronously with the user’s posture. 4. Selective control was started after repeating the gestural action several times. During the interaction, the frame rate x [fps] and control index S were recorded using the gesture interface system. An example screenshot of the gesture video system is shown in Fig. 14. The response of the frame rate of the system obtained by the above procedures is shown in Fig. 15a. The response of the control index S is shown in Fig. 15b. After frame number 5,120, at which selective control is started, the processing speed of the system gradually approaches the target frame rate of 30 [fps] and the target frame rate is maintained after frame number 5,260. The above results clearly demonstrate that the proposed approach can handle the problem of the slower and unstable processing speeds incurred by the connection to the application system. 1.

Fig. 15. Response of the proposed system. (a) System’s frame rate. (b) Control index S.

6 6.1

DISCUSSION Learning of Gesture Protocols

6.1.1 Dependencies on Clothing and Gesture Types By applying protocol-learning to the samples of an arbitrary person’s similar gestures, the recognition rate among gestures with respect to six types of clothing was improved as much as 30 percent on average. Improvements in recognition rates were found for all six types of clothing. This indicates that protocol-learning enables robust recognition with respect to different types of clothing. In particular, the difficulty of the system in recognizing gestures by a person wearing a black jersey was notably alleviated. In addition, remarkable improvements in the recognition rates for gestures “good-bye,” “guppa,” and “mouse-operation” were confirmed. This suggests that protocol-learning can contribute to clothing-dependency and gesture-dependency problems. Good learning performance was obtained despite the small number of training samples and there was no great difference among the learning curves for two similar gestures. The above results demonstrate the flexibility and the stability of the proposed protocol-learning. 6.1.2 Dependency on Individuality After the protocol-learning using training samples collected by having the same person wear six types of clothing, a

362

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

recognition experiment was conducted on unknown test samples for 11 subjects. The features of this experiment are described below: The subject is prompted by keyword to perform a gesture operation. 2. Prior restrictions are not placed on position, posture, or movement. 3. Prior restrictions are not placed on the subject’s sex or physique. 4. Prior restrictions are not placed on the subject’s dress. 5. Prior restrictions are not placed on the extent or the speed of gestural actions. The average recognition rate was approximately 70 percent, suggesting that the gestures performed by the same subject and those performed by different subjects have common traits. The protocol-learning improves the recognition rate for the test samples of unknown subjects. 1.

6.2 Selective Control of Visual Interest Points 6.2.1 Cooperation with Application System By estimating gestural parameters and sending them to the gesture video system, real-time interaction with arbitrary image objects was demonstrated. In addition, processing speed, which originally fluctuated between 17 [fps] and 21 [fps], was shown to stabilize at 30 [fps]. Selective control is important in dealing with the problem of slow and unstable processing speed, especially when the gesture interface system and its application systems are tightly coupled over the TCP/IP network. 6.2.2 Evaluation of Real-Time Performance Although the processing speed is inversely proportional to the size of the reference data set, selective control realized a constant processing speed of 30 [fps]. Selective control is particularly effective in dealing with the increasing cost for pattern matching. Since each control input is dynamically determined by the selective control method, neither users nor system engineers need to worry about performance differences of networked or distributed computers, which are mutually and massively connected. In this way, the proposed selective control method is extremely advantageous in satisfying an arbitrary processing speed specified by the application systems.

7

CONCLUSION

A comprehensive framework by which to recognize an arbitrary person’s unspecified gestures in real time was presented. For the problem of feature selection, QVIPS was proposed. Quadruple Visual Interest Point Strategy places more emphasis on spatio-temporally reliable visual interest points in image sequences given as belonging to the same gesture class. Recognition of a wide variety of gestures became possible under the framework of QVIPS without restricting the types of clothing or gestures, the extent of motion trajectories, or individual differences of motion characteristics.

VOL. 27,

NO. 3,

MARCH 2005

In order to deal with the problem of a slow and unstable processing frame rate of the recognition system, a selective control method for visual interest points was incorporated into QVIPS. The processing frame rate was stabilized to an arbitrary time interval by applying the selective control method to the pattern scanning interval, the pattern matching interval, and the number of effective visual interest points. A gesture video system was developed in order to demonstrate that the proposed methods can be used as a vision-based gesture interface system and realize full videorate interaction with image objects. The proposed approach is a small but necessary step toward the realization of selfload controlling functionality for the recognition system.

APPENDIX A FEATURE EXTRACTION METHOD This appendix presents a brief proof of uniqueness of the Gaussian Density Feature (GDF), an original feature extractor represented by (32), and the decision criterion for gradient coefficient a. n o X pðrÞ exp aðr  Þ2 X q¼R r : ð32Þ pðrÞ r

A.1 Properties of Gaussian Density Feature (GDF) The purpose of (32) is to extract a unique feature q from a one-dimensional discrete pattern pðrÞ. However, the feature extractor must satisfy the uniqueness property regarding the displacement r. Here, the following operator functions are considered as examples: fðrÞ ¼ ar;

ð33Þ

fðrÞ ¼ expðarÞ;

ð34Þ

fðrÞ ¼ expðar2 Þ:

ð35Þ

If the operator has a symmetric nature, a unique feature cannot be extracted, especially from a symmetric pattern distribution. The symmetry of each operator can be examined by (36):  Z R ¼ 0 ðsymmetryÞ d ð36Þ ffðrÞ þ fðR  rÞgdr 6¼ 0 ðasymmetryÞ: 0 dr After applying (36) to (33), (37) is obtained: Z R Z R d d ðaRÞdr ¼ 0: far þ aðR  rÞgdr ¼ 0 dr 0 dr

ð37Þ

Judging from the criterion shown in (36), (33) is unable to extract a unique feature from a symmetric pattern distribution. A similar procedure is applied to the other operators. Equation (38) is obtained from (34) and (39) is obtained from (35):

KIRISHIMA ET AL.: REAL-TIME GESTURE RECOGNITION BY LEARNING AND SELECTIVE CONTROL OF VISUAL INTEREST POINTS

Fig. 16. (a) Gaussian operator and (b) its primary derivative.

Z 0

R

d ½expðarÞ þ expfaðR  rÞgdr dr Z R ¼ a expðarÞ½expfað2r  RÞg  1dr

363

Fig. 17. Symmetry evaluation of a Gaussian operator.

ACKNOWLEDGMENTS ð38Þ

0

The authors would like to express their sincere thanks to OMRON Co. LTD. for funding for the research project. The authors would also like to thank the associate editor and the

6¼ 0;

reviewers for helpful comments that greatly improved the Z

R

0

n oi dh expðar2 Þ þ exp aðR  rÞ2 dr dr Z R 2a expðar2 Þ½expfaRð2r  RÞgðR  rÞ  rdr ¼ 0

readability of the paper.

REFERENCES [1]

6¼ 0: ð39Þ The results shown above indicate that (33) has a symmetric nature and, hence, cannot satisfy the uniqueness property. On the other hand, (34) and (35) satisfy the uniqueness property and each can be used as a feature extractor. A Gaussian operator that is robust to noise and has lower-band-pass filtering properties is adopted herein.

[2]

[3] [4] [5]

A.2 Decision Criterion for Gradient Coefficient In the previous section, it was shown to be possible to extract unique features, even from the symmetric distribution of a pattern, by applying a Gaussian operator. However, the gradient coefficient must be determined in accordance with the peculiarities of the target problem since the degree of uniqueness of the obtained feature depends on the gradient coefficient a. Fig. 16 shows Gaussian operators and their primary derivatives which are obtained by changing the gradient coefficient a from 1 to 9. In applying the Gaussian operator as a feature extractor, the operator should be designed to be more sensitive to the external contour of the target feature image region by reversing the right and left-hand sides of the Gaussian distribution: n oi dh expðar2 Þ þ exp aðR  rÞ2 dr ¼ 2a expðar2 Þ½expfaRð2r  RÞgðR  rÞ  r:

[6]

[7] [8] [9]

[10]

[11]

ð40Þ

Fig. 17 shows the symmetry evaluation function represented by (40) when r is changed from 0 to 0.5. In this paper, evaluation experiments are conducted by setting the gradient coefficient a to 5.0.

[12]

[13]

T. Darrell and A.P. Pentland, “Attention-Driven Expression and Gesture Analysis in an Interactive Environment,” Proc. Int’l Workshop Automatic Face and Gesture Recognition, pp. 135-140, 1995. R.H.Y. So and M.J. Griffin, “Experimental Studies of the Use of Phase Lead Filters to Compensate Lags in Head-Coupled Visual Displays,” IEEE Trans. Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 26, no. 4, pp. 445-454, July 1996. P. Ekman, “Essential Behavioral Science of the Face and Gesture that Computer Scientists Need to Know,” Proc. Int’l Workshop Automatic Face and Gesture Recognition, pp. 7-11, 1995. T.S. Huang and V.I. Pavlovic, “Hand Gesture Modeling, Analysis, and Synthesis,” Proc. Int’l Workshop Automatic Face and Gesture Recognition, pp. 73-79, 1995. H. Thwaites, “Human Navigation in Real and Virtual Space,” Proc. Int’l Conf. Virtual Systems and Multimedia (VSMM ’96), pp. 1926, Sept. 1996. K. Mase, R. Kadobayashi, and R. Nakatsu, “Meta-Museum: A Supportive Augmented-Reality Environment for Knowledge Sharing,” Proc. Int’l Conf. Virtual Systems and Multimedia (VSMM ’96), pp. 107-110, Sept. 1996. C. Maggioni, “Gesture Computer—New Ways of Operating a Computer,” Proc. Int’l Workshop Automatic Face and Gesture Recognition, pp. 166-171, 1995. M.J. Diorinos and H. Popp, “Conception and Development of Virtual Reality Shopping Malls,” Proc. Int’l Conf. Virtual Systems and Multimedia (VSMM ’96), pp. 183-188, Sept. 1996. H.D. Tagare, K. Toyama, and J.G. Wang, “A Maximum-Likelihood Strategy for Directing Attention During Visual Search,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 5, pp. 490-500, May 2001. G. Backer, B. Mertsching, and M. Bollmann, “Data- and ModelDriven Gaze Control for an Active-Vision System,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 12, pp. 14151429, Dec. 2001. J. Denzler and C.M. Brown, “Information Theoretic Sensor Data Selection for Active Object Recognition and State Estimation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 145-157, Feb. 2002. A.A. Salah, E. Alpaydin, and L. Akarun, “A Selective AttentionBased Method for Visual Pattern Recognition with Application to Handwritten Digit Recognition and Face Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 420-424, Mar. 2002. G. Loy and A. Zelinsky, “Fast Radial Symmetry for Detecting Points of Interest,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 959-973, Aug. 2003.

364

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

[14] J.T. Chien and C.C. Wu, “Discriminant Waveletfaces and Nearest Feature Classifiers for Face Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1644-1649, Dec. 2002. [15] J. Kittler and F.M. Alkoot, “Sum Versus Vote Fusion in Multiple Classifier Systems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 1, pp. 110-115, Jan. 2003. [16] D. Windridge and J. Kittler, “A Morphologically Optimal Strategy for Classifier Combination: Multiple Expert Fusion as a Tomographic Process,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 3, pp. 343-353, Mar. 2003. [17] A.J. Storkey and C.K.I. Williams, “Image Modeling with PositionEncoding Dynamic Trees,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 859-871, July 2003. [18] M.K. Titsias and A. Likas, “Class Conditional Density Estimation Using Mixtures with Constrained Component Sharing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 924-928, July 2003. [19] S. Raudys, “Experts’ Boasting in Trainable Fusion Rules,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1178-1182, Sept. 2003. [20] M. Bressan and J. Vitria`, “On the Selection and Classification of Independent Features,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1312-1317, Oct. 2003. [21] I.R. Vega and S. Sarkar, “Statistical Motion Model Based on the Change of Feature Relationships: Human Gait-Based Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1323-1328, Oct. 2003. [22] T. Kirishima, K. Sato, and K. Chihara, “Realtime Gesture Recognition under the Multilayered Parallel Recognition Framework of QVIPS,” Proc. Third IEEE Int’l Conf. Automatic Face and Gesture Recognition (FG ’98), pp. 579-584, Apr. 1998. [23] T. Kirishima, K. Sato, and K. Chihara, “Real-Time Gesture Recognition by Learning of Gesture Protocols,” IEICE (D-II) Trans., vol. J81-D-II, no. 5, pp. 785-794, May 1998. [24] I.S. MacKenzie and C. Ware, “Lag as a Determinant of Human Performance in Interactive Systems,” Proc. Conf. Human Factors in Computing Systems (INTERCHI ’93), pp. 488-493, Apr. 1993. [25] T. Kirishima, K. Sato, and K. Chihara, “Realtime Gesture Recognition by Selective Control of Visual Interest Points,” IEICE (D-II) Trans., vol. J84-D-II, no. 11, pp. 2398-2407, Nov. 2001. [26] C. Rasmussen and G.D. Hager, “Probabilistic Data Association Methods for Tracking Complex Visual Objects,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 560-576, June 2001. [27] C.S. Wiles, A. Maki, and N. Matsuda, “Hyperpatches for 3D Model Acquisition and Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 12, pp. 1391-1403, Dec. 2001. [28] T. Drummond and R. Cipolla, “Real-Time Visual Tracking of Complex Structures,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 932-946, July 2002. [29] Y. Song, L. Goncalves, and P. Perona, “Unsupervised Learning of Human Motion,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 814-827, July 2003. [30] A. Mohan, C. Papageorgiou, and T. Poggio, “Example-Based Object Detection in Images by Components,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 4, pp. 349-361, Apr. 2001. [31] J. Triesch and C. Malsburg, “A System for Person-Independent Hand Posture Recognition against Complex Backgrounds,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 12, pp. 1449-1453, Dec. 2001. [32] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi, “Robust Online Appearance Models for Visual Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1296-1311, Oct. 2003. [33] A.F. Bobick and J.W. Davis, “The Recognition of Human Movement Using Temporal Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001. [34] B. Moghaddam and M.H. Yang, “Learning Gender with Support Faces,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 707-711, May 2002. [35] J. Yamato, S. Kurakake, A. Tomono, and K. Ishii, “Human Action Recognition Using HMM with Category-Separated Vector Quantization,” IEICE (D-II) Trans., vol. J77-D-II, no. 7, pp. 13111318, July 1994.

VOL. 27,

NO. 3,

MARCH 2005

[36] P. Bharadwaj and L. Carin, “Infrared-Image Classification Using Hidden Markov Trees,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1394-1398, Oct. 2002. [37] M. Bicego and V. Murino, “Investigating Hidden Markov Models’ Capabilities in 2D Shape Classification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 281-286, Feb. 2004. [38] M.H. Yang, “Extraction of 2D Motion Trajectories and Its Application to Hand Gesture Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 8, pp. 1061-1074, Aug. 2002. [39] A.D. Wilson and A.F. Bobick, “Configuration States for the Representation and Recognition of Gesture,” Proc. Int’l Workshop Automatic Face and Gesture Recognition, pp. 129-134, 1995. [40] D.S. Yeung and X.Z. Wang, “Improving Performance of Similarity-Based Clustering by Feature Weight Learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 556-561, Apr. 2002. [41] E. Hunter, J. Schlenzig, and R. Jain, “Posture Estimation in Reduced-Model Gesture Input Systems,” Proc. Int’l Workshop Automatic Face and Gesture Recognition, pp. 290-295, 1995. [42] S.X. Liao and M. Pawlak, “On the Accuracy of Zernike Moments for Image Analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1358-1364, Dec. 1998. [43] W.T. Freeman and M. Roth, “Orientation Histograms for Hand Gesture Recognition,” Proc. Int’l Workshop Automatic Face and Gesture Recognition, pp. 296-301, 1995. [44] I.S. MacKenzie and C. Ware, “Lag as a Determinant of Human Performance in Interactive Systems,” Proc. Human Factors in Computing Systems (INTERCHI ’93), pp. 488-493, Apr. 1993. [45] H. Deng and D.A. Clausi, “Gaussian MRF Rotation-Invariant Features for Image Classification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 7, pp. 951-955, July 2004. [46] F. Quek, “Toward a Vision-Based Hand Gesture Interface,” Proc. Virtual Reality System Technology Conf., pp. 17-29, Aug. 1994. [47] T. Kirishima, K. Sato, and K. Chihara, “A Novel Approach on Gesture Recognition: The Gesture Protocol-Based Gesture Interface,” Proc. Int’l Conf. Virtual Systems and Multimedia (VSMM ’96), pp. 433-438, Sept. 1996. Toshiyuki Kirishima received the BS degree from the National Institution for Academic Degrees (NIAD), Japan, in 1995 and the MS and PhD degrees in engineering from the Nara Institute of Science and Technology, Japan, in 1997 and 2000, respectively. He is currently an assistant professor in the Department of Electrical Engineering, Nara National College of Technology, Institute of National Colleges of Technology, Japan. His research interests include pattern recognition, artificial reality, and behavior-based image retrieval. He is a member of IPSJ and the IEEE. Kosuke Sato received the BS degree from Osaka University, Japan, in 1983 and the MS and PhD degrees in engineering science from Osaka University, Japan, in 1985 and 1988, respectively. He was a visiting scientist at the Robotics Institute, Carnegie Mellon University, from 1988 to 1990. He is currently a professor at the Graduate School of Engineering Science, Osaka University, Japan. His research interests include image sensing, 3D image processing, and virtual reality. He is a member of IEICE, IPSJ, VRSJ, the IEEE, and the IEEE Computer Society. Kunihiro Chihara received the BS degree from Osaka University, Japan, in 1968 and the MS and PhD degrees in engineering science from Osaka University, Japan, in 1970 and 1973, respectively. He is currently a dean professor at the Graduate School of Information Science, Nara Institute of Science and Technology (NAIST), Japan. His research interests include medical ultrasonic imaging, ubiquitous computing, and the application of multimedia. He is a member of ISCIE, JSMBE, JSUM, and the IEEE.