Hand Motion Gesture Frequency Properties and ... - Springer Link

International Journal of Computer Vision 69(3), 353–371, 2006 c 2006 Springer Science + Business Media, LLC. Manufactured in The Netherlands. DOI: 10.1007/s11263-006-8112-5

Hand Motion Gesture Frequency Properties and Multimodal Discourse Analysis YINGEN XIONG AND FRANCIS QUEK Center for Human Computer Interaction, Department of Computer Science, Virginia Polytechnic Institute and State University, 621 McBryde Hall, MC 0106, Blacksburg, VA 24061, USA [email protected] Received May 20, 2003; Revised August 31, 2005; Accepted February 7, 2006 First online version published in April, 2006

Abstract. Gesture and speech are co-expressive and complementary channels of a single human language system. While speech carries the major load of symbolic presentation, gesture provides the imagistic content. We investigate the role of oscillatory/cyclical hand motions in ‘carrying’ this image content. We present our work on the extraction of hand motion oscillation frequencies of gestures that accompany speech. The key challenges are that such motions are characterized by non-stationary oscillations, and multiple frequencies may be simultaneously extant. Also, the duration of the oscillations may be extended over very few cycles. We apply the windowed Fourier transform and wavelet transform to detect and extract gesticulatory oscillations. We tested these against synthetic signals (stationary and non-stationary) and real data sequences of gesticulatory hand movements in natural discourse. Our results show that both filters functioned well for the synthetic signals. For the real data, the wavelet bandpass filter bank is better for detecting and extracting hand gesture oscillations. We relate the hand motion oscillatory gestures detected by wavelet analysis to speech in natural conversation and apply to multimodal language analysis. We demonstrate the ability of our algorithm to extract gesticulatory oscillations and show how oscillatory gestures reveal portions of the multimodal discourse structure. Keywords: hand gesture, hand motion, gesticulatory oscillation, frequency ridges, multimodal discourse analysis, speech analysis

1.

Introduction

Our work is motivated by the conviction that gesture and speech are coexpressive of the underlying dynamic ideation that drives human communication. In essence, spatio-temporal imagery interacts with the discourse production process to produce our vocal utterances. At the same time, these spatio-temporal images are embodied in human gesture accompanying speech. As such, human gesture and speech coheres, not at the lexical or syntactic level (there is no such thing as a consistent gesticulatory correlate to a particular word), but at the level of the units of meaning and discourse as they are constructed (McNeill,

2000; Quek et al., 2002; Quek, 2003). This gives rise to the psycholinguistic concept known as the catchment (McNeill, 2000a, 2000b; McNeill et al., 2001). A catchment is the recurrence of one or more gesticulatory features (e.g. two inward-facing open palm facing held vertically, or right-hand rising and moving away from body) that indicate that the co-temporal utterances arise from the same spatio-temporal imagery. Hence, transitions and cohesions in gesticulatory behavior would inform us as to the discourse conceptualization. The issue then boils down to the typical computer vision problem of identifying visually computable metrics that correlate with the entity of interest.

354

Xiong and Quek

We present a study on the utility of oscillatory hand motion as a gesticulatory feature that gives us such a glimpse into the discourse construction and organization. We detect and analyze hand motion gestures in the frequency domain, and investigate the relationship between hand gesture and multimodal discourse structure. 1.1.

Human Communication, Gesture, and Oscillation

Although gesture may be considered broadly to include hand, head, eye, and posture, we consider only hand gesture in this paper. We observe that a particular gesture phase may feature either a moving hand or a stationary hand. If the hand is moving, the motion may either be single or repeated. One might characterize repeated motion gestures in terms of the size of the motion, the trajectory (directions) of the motion, and the frequency of the repetition. Within the framework of accessing discourse content by the embodied mental imagery of the speaker, an obvious question is whether the patterning of the repeated hand movement has sufficient ‘image-bearing’ capacity for describing gesticulatory imagery. If so, we might expect the patterning of gesticulatory repetition to correlate with the topical units of discourse. Consequently, changes in these patterns would correspond to points of departure between discourse units. We investigate the directions and frequencies of hand motion oscillation within this context. In this paper, we address the computational aspects of detecting hand motion repetitions, extracting the frequencies of such oscillations in the cardinal directions, and correlating these computational results with discourse structure. Motion oscillation can be obtained by detection of periodic and cyclic motion. In vision-based algorithms, objects are tracked through the image sequences. By the analysis of objects and their motion properties, we can detect and recognize the oscillatory motion. 1.2.

Computer Vision Based Periodic and Cyclic Motion Detection

If a motion repeats itself with a constant period p, it is said to be a periodic motion. If p is not perfectly constant over time, the motion is cyclic. Hence, we can say that cyclic motion is a generalized version of periodic motion. That is to say, all periodic motions

are cyclic but a cyclic motion is periodic only when its period is perfectly constant over time. The existing methods for the detection and recognition of periodic and cyclic motion can be categorized as three main classes: pixel-based, area-based, and modelbased algorithms. Pixel-based algorithms employ the motion information and properties of pixels on objects such as point correspondences (Seitz and Dyer, 1997; Tsai et al., 1994), periodicities of pixel value variations (Liu and Wechsler, 1998; Polana and Nelson, 1997), and the periodic motion of pixel-derived features (Niyogi and Adelson, 1994; Fujiyoshi and Lipton, 1998; Heisele and Woehler, 1998). Seitz and Dyer (1997) describe a technique for resolving the difficulties and limitations due to changing viewpoint and irregularity in the periodicity of a motion. A temporal correlation plot for repeating motions is computed using different image comparison functions that require the establishment of point correspondences. This enables them to create a measure of period trace for the detection and interpretation of cyclic motion. The algorithm is based on inputs from tracking several point features on the same object. The goodness of match is evaluated by the significance of a solution based on the comparison function that measures correlations between images containing a number of common features. A statistical measure based on the Kolmogorov-Smirnov test is utilized to classify cyclic and non-cyclic motion. Their approach is applicable to objects and scenes whose motion is highly non-rigid. It has some weaknesses in handling nonstationary data. Polana and Nelson (1997) describe a technique to detect periodic, nonrigid motion. The object employed in the process of detecting periodic motion is tracked in a video sequence. It is first realigned so that its centroid remains stationary over time. Then reference curves, which are lines parallel to the trajectory of the motion flow centroid, are computed. Along these curves, a spectral power is calculated for the image pixels. Each reference curve is then measured for periodicity by using a difference between spectral energy at the highest amplitude frequency and its harmonics and the energy at frequencies at the midpoints between those harmonics. They implemented a real-time system that can recognize and classify repetitive motion activities in normal gray-scale image sequences. Heisele and Wohler (1998) present an approach for recognizing walking pedestrians in sequences of color images taken from a moving camera. The feature of periodic motion of legs of a pedestrian walking parallel to the image plane is employed in the recognition process. The

Hand Motion Gesture Frequency Properties and Multimodal Discourse Analysis

images are segmented using a color/position feature space and the resulting clusters are tracked. The clusters are classified by a two-stage classifier. The first stage uses a fast polynomial classifier for a rough preselection, and the final stage uses a time delay neural network. Niyogi and Adelson (1994) describe an algorithm for walking gait analysis and recognition. In their algorithm, a person walking frontoparallel to image plane is segmented using background subtraction. A set of spatiotemporal snakes is employed to fit the braided pattern generated by legs of walking persons. The snakes can be used to find the bounding contours of the walker, so that the periodic motion can be detected. Fujiyoshi and Lipton (1998) describe a technique for analyzing the motion of a human target in a video stream by image skeletonization. In this approach, a “star” skeleton is created with moving targets and their boundaries. Periodic motion can be detected with the “star” skeleton by Fourier analysis. Their technique does not require a priori models or a large number of pixels on target, but it needs accurate motion segmentation, which is not always possible. Liu and Wechsler (1998) employ periodicity templates to detect, segment, and characterize spatiotemporal periodicity. The templates can indicate the presence and location of a periodic event and give an accurate quantitative periodicity measure. Hough transform and Fourier analysis are also applied in their algorithm. Tsai et al. (1994) present a method for analyzing cyclic motion using point correspondences. These points and their correspondences in a video sequence are obtained manually by tracking the joints of the body. Cycles are detected using autocorrelation and Fourier transform techniques of the smoothed spatiotemporal curvature function of the motion trajectories of specific points on the object. Since pixel-based algorithms need to apply information of pixels on objects, they depend largely on the motion segmentation, object or point tracking and they are high sensitive to noise if the segmentation and tracking are not performed manually. Pixel-based algorithms cannot be employed when the motion is not parallel to the image plane. It is difficult for pixel-based algorithms to handle the body’s deformations. Model-based algorithms such as (Cohen et al., 1997; Davis et al., 2000) construct models to generate and recognize cyclic motions. Cohen et al. (1997) employ simple one dimensional ordinary differential equations to model hand motion gesticulatory oscillation generated by a moving light. With this

355

model, they can generate and recognize six classes of hand gestures (all circular and linear paths). The recognition results are applied to control actuated mechanisms. Their model requires the availability of reliable point correspondences. Davis et al. (2000) propose a categorical approach for the representation and recognition of a set of oscillatory biological motion. They employ a simple sinusoidal model with very specific and limited parameter values to represent and recognize animal oscillatory motion. Four classes of oscillatory motions are considered. Like (Cohen et al., 1997), they all are circular and linear paths. Fourier analysis is employed to calculate parameters for the model including amplitude, frequency, and phase. Before modeling oscillatory motions, motion trajectories of the object must be obtained by object tracking techniques. Model-based approaches require point correspondences or accurate motion trajectories of tracked objects to create parameters for models. The model-based approaches thus far developed can only represent and recognize some simple oscillatory motions (all circular and linear paths). It is difficult to handle arbitrary oscillatory motions. The area-based algorithms such as (Cutler and Davis, 2000; Plotnik and Rock, 2002) compute the periodicities of object similarities to detect periodic and cyclic motion. Correlation metrics are employed to provide the measurement of self-similarity of the object between different frames in time. This is similar to the measures used in many object-tracking algorithms. Peaks in correlation are then measured for periodicity using various methods. Cutler and Davis (2000) present an approach for real-time periodic motion detection and analysis. Their approach assumes that of orientation and apparent size of the object being tracked are constant or varies insignificantly. The motion objects are tracked through the image sequence, aligned along the temporal axis such that their centroids are coincident, and resized using a Michell filter so that their outer dimensions are normalized. As in Seitz and Dyer’s (1997) approach, the object’s self-similarity is computed as it evolves in time. For periodic motions, the self-similarity metric is periodic and its period is measured by more than one method. Spectral power is computed using Time-Frequency analysis. In their approach, similarity and autocorrelation of 2D plots along time are also employed, and these techniques prove more robust to changes in periodicity frequency over time, and lend themselves well toward detecting the onset or cessation of periodic motion. Their

356

Xiong and Quek

approach is applied to detect period motion in image sequences of people walking, dogs running, etc. Unlike Seitz and Dyer’s (1997) approach, this algorithm can be used to detect cyclic motion in nonstationary data. Plotnik and Rock (2002) extend the self-similarity approach to detect cyclic motion of marine animals. The objects of marine animals are tracked through the underwater video. The self-similarity measures of the tracked animals are computed over time. Fourier analysis techniques are used to detect and characterize the periodicity. Area-based approaches are robust to noise and can detect many types of cyclic motion even when the dominant motion is not parallel to the image plane (as long as cycles of shape changes are visible), when there are non-rigid body deformations, and where individual pixel variations are small. Also area-based approaches can be easily implemented in a real-time. The main disadvantage is the strict assumption of only very small changes in viewpoint between the camera and the tracked object. 1.3.

The Challenge of Conversational Hand Motion Gesticulatory Oscillation Analysis

The challenge of oscillation detection in hand motion in natural speech and conversation is that motion trajectories extracted from video tend to be noisy. The hand motion signal is a non-stationary signal with varying instantaneous frequencies. At a given time point there may be several frequencies and the duration of the oscillation may be very short with respect to the period of oscillation. While one might check a self-similarity measure between motion frames for detecting regular periodic motion (Cutler and Davis, 2000) trying to determine motion oscillation from non-repetitive motion poses a significant challenge. This is even more the case when our goal is the detection of these oscillations in the kind of idiosyncratic gestures that accompany speech. One could not, for example, train a system to recognize a specific oscillatory motion performed generally in the same way, by the same subject. In this paper, we present two techniques: a Windowed Fourier Transform (WFT) based approach and a wavelet analysis based method. We extract hand motion trajectories from video using motion tracking algorithm, and employ these to detect hand motion oscillatory gestures. We process the hand motion trajectory signals with WFT and wavelet analysis. We demonstrate the extraction of multiple simultaneous oscil-

latory frequencies, and show that these can be used to characterize oscillatory idiosyncratically-produced hand motion that co-occur with speech. We relate the oscillatory gestures with co-temporally produced speech and show the promise of this approach to multimodal discourse segmentation. 2.

Hand Motion Trajectory Signal Extraction from Video

Human hand gesture in standard video data poses several processing challenges. First, one cannot assume contiguity of motion in the video. A sweep of the hand across the body can span just 0.25s. to 0.5s. This means that the entire motion is captured in 7 to 15 frames. Depending on camera field-of-view on the subject, interframe displacement can be quite large. This means that dense optical flow methods cannot be used. Second, because of the speed of motion, there is considerable motion blur. Third, the hands tend to merge, separate, and occlude each other. Fourth, hand shapes are highly deformable. Finally, temporal resolution is extremely critical since we need to correlate the initiation and cessation of motion phases with speech units (e.g. a syllable). We apply a parallelizable fuzzy image processing approach known as Vector Coherence Mapping (VCM) (Quek and Bryll, 1998; Quek et al., 1999; Bryll, 2003) to track the hand motion. VCM is able to apply spatial coherence, momentum (temporal coherence), motion, and skin color constraints in the vector field computation by using a fuzzy-combination strategy, and produces good results for hand gesture tracking. VCM hand tracking operates under the assumption that when the hand is in gross motion, hand shape changes are insignificant (Quek, 1994; Quek, 1996). Hence, we can obtain the dominant motion of the hand as an instantaneous aggregate of motion vectors extracted from the hand silhouette and points of high image variation. Using a robust fuzzy voting strategy, the algorithm is robust to motion and image noise. The temporal coherence (localized momentum) constraint further contributes to the robustness of the vector computation. Since VCM typically extracts hundreds of vectors for each hand, the averaging effect of the vector aggregates yields smooth motion fields that are temporally accurate (i.e. no oversmoothing across frames to degrade temporal resolution). The algorithm has been applied to extract hand motion out of very long video sequences, some in excess of 22,000 frames of video.


Furthermore, since our work involves the assembly of reliable multimodal speech-gesture-gaze corpora, we employ a principled editing system to ensure the reliability of the hand trajectories manually (Bryll, 2003). This editor permits a human operator to fix vector clustering errors so that the resulting motion field for each hand maintains the smoothness due to the vector averaging. If the dataset is captured with a monocular camera, we can obtain two dimensional (x and y) hand motion trajectory signals. For three-dimensional datasets (captured with synchronized stereo-calibrated cameras (Tsai, 1987), we obtain three dimensional (x, y, and z) hand motion traces. In this paper, we consider hand motion trajectories with respect to the cardinal directions of the subject’s torso. We base this decision on the fact that the human gestures tend to be lateralized and organized with respect to the plane parallel to the speaker’s torso (McCullough, 1992; McNeill, 1992; Pedelty, 1987). For convenience, in the following sections, we refer to a signal s(t) as a proxy for the directional signals sx (t), s y (t), and sz (t) (x is lateral, y is vertical and z is front-back with respect to the subject’s torso) extracted from video by our hand trajectory algorithm. If true three-dimension data (e.g. from motion trackers or stereo cameras) of the reference torso frame and hand movements are available, these would be optimal. We found, however, that it is only important that the camera be somewhat fronto-oblique to estimate the oscillatory motion with respect to the subjects torso-frame. It turns out that we do not actually require the more stringent fronto-parallel view to the torso frame because we are detecting oscillatory motion and even though there is apparent x-oscillation for a real y-oriented oscillation (i.e. the hand makes perfect up-down motions), this is highly unlikely. Humans don’t move like robots, and so there is always some collateral motion in other dimensions. What is important is the dominant oscillatory direction (i.e. direction of largest orientation). This is what is employed in our analysis.This is fortunate since it is awkward to perform human communication experiments between two subjects with the camera staring down each person’s face. 3.

Windowed Fourier Transform Based Technique

The Windowed Fourier Transform (WFT) was introduced by Gabor (1946) to measure localized audio frequency components. In order to overcome the disad-

357

vantage of a Fourier expansion that it has only frequency resolution and no time resolution, the signal of interest is divided into several parts with a proper window operator. The parts are then analyzed separately. A real and symmetric window w(t) = w(−t) is translated by u and modulated at the frequency f : wu,f (t) = e j2π ft w(t − u).

(1)

The signal is normalized so that wu, f = 1 for any (u, j) ∈ R 2 . Suppose s(t) is a one-dimensional hand motion trajectory signal. The WFT of signal s(t) ∈ R 2 is Fs(u, f ) = s, wu,f +∞ = −∞ s(t)w(t − u)e− j2π ft dt

(2)

= S( f ) ∗ W ( f ) where S( f ) and W ( f ) are the Fourier transform of signal s(t) and window w(t) respectively. Equation 2 slides the window w(t) along s(t) and computes the usual Fourier transform for each s(t)w(t − u). The window size T and window shape are critical to the performance of a WFT. The spectral window must be real, even, and highly concentrated in the region around. In this paper we use a Hanning window. For a one-dimensional signal the Fourier transform of rectangle window w(t) = rect(T ) is W ( f ) = T sinc( f T ). In this case, we have a resolution of T in the time domain and 1/T in the frequency domain. If we need to detect two frequencies f 1 and f 2 using a WFT, the individual frequencies cannot be resolved unless | f 1 − f 2 | > 1/T . For adequate separation we should have | f 1 − f 2 | > 2/T . As the window become shorter, frequency resolution decreases. On the other hand, as the window length decreases, the ability to resolve changes with time increases. Consequently, the choice of window length becomes a trade-off between frequency resolution and time resolution. 4.

Wavelet Analysis Based Technique

For a wavelet function ψ(t), a family of time-frequency atoms is obtained by scaling ψ by a and translating by u: 1 ψu,a (t) = √ ψ a

t −u a

(3)

358

Xiong and Quek

A one dimensional Continuous Wavelet Transform (CWT) (Rao and Bopardikar, 1998) of s(t) is given by: W s(u, a) = s, ψu,a =

1 s(t) √ ψ a R

t −u dt (4) a

which projects the signal s(t) onto all scaled (denoted by a) and translated (denoted by u) versions of a single wavelet function, ψ(t). For linear filtering, the wavelet transform can be rewritten as a convolution product: W s(u, a) =

1 s(t) √ ψ a R

t −u dt = s(t) ∗ ψa (u) a (5)

with 1 ψa (t) = √ ψ a

−t a

(6)

The wavelet coefficient, W s(u, a), is defined on a two-dimensional time-scale domain and interpreted as giving a measure of the time evolution of frequency content in the signal. In CWT instead of fixing the time and the frequency resolutions, one can let both resolutions vary in the time-frequency plane in order to obtain a multiresolution analysis. This variation can be carried out without violating the Heisenberg inequality. In this case, the time resolution must increase as frequency increases and the frequency resolution must increase as frequency decreases. So CWT can be used to process nonstationary signals. We experimented with different wavelet kernels and determined that the Morlet wavelet produces the most intuitively correct results. The Morlet wavelet is defined (Goupillaud et al., 1984–1985): 1 t−u 2 t −u ψu,a (t) = √ e( a ) /2 cos 5 a a

(7)

This wavelet is constructed from a Gaussian operator convolved with a sinusoid. Hence, the Morlet wavelet has infinite duration (from the Gaussian) with most of the energy concentrated within the ‘hump’ of the Gaussian. Also this wavelet is centered, and the Gaussian performs an implicit smoothing operation over the

input signal. ψu,a (t) is essentially a band-pass filter centered around frequency 5/(2πa) with a bandwidth of 1/a. We believe that these properties (implicit smoothing and extraction of the central sinusoid) contribute to the suitability of the Morlet wavelet for gesture analysis. Since hand motion trajectories are nonstationary signals, we apply a bank of wavelet band pass filters to detect their instantaneous frequencies, to extract the dominant frequencies of hand motion. To create a wavelet band pass filter bank, we need a strategy to determine a set of center frequencies for the filter bank. The range and intervals of these frequencies are, of course, related to the particulars of human gesticulatory motion. For human gesticulation, it is difficult for the hand to oscillate at greater than 4.5 Hz (and even if it could, the camera could not pick up the motion). This sets the upperbound of our frequency range. At the low frequency end, gesticulatory oscillations slower than 0.3 Hz (i.e. slower than one cycle every 3 seconds) are unlikely. Within these bounds, we observe that absolute frequency deviations are less important than relative changes. For example, at 4.5 Hz, a 0.1 Hz (2.22%) deviation is inperceptible by both the gesturer and the human observer. However a 0.1 Hz deviation at 0.3 Hz (33.33%) is clearly significant. We implemented our filter bank in a log scale using the function: N=

ln f max − ln f min ln(1 + γ )

(8)

for a set of N filters within the frequency range [ f min , f max ]. γ defines the rate of change in center frequencies. This allows us to determine the size of our bank with a single parameter γ that controls the frequency resolution. Setting γ to 0.02 (the center frequency of every filter in the bank deviates from its neighbors by 2%), Equation 8 yields a bank of 138 filters for the frequency range: [0.3 Hz, 4.5 Hz]. The wavelet band pass filter bank created by this method has constant relative frequency resolution error in both low and high frequencies. Figure 1 shows the central frequencies of wavelet filters in the filter bank. 5.

Frequency Ridge Extraction

In generally, ridges are a set of curves defined on a surface z = f (x, y). According to Hall et al., (1992), these


Figure 1.

Central frequencies of wavelet filter bank.

curves are completely characterized by their projection on the (x, y) plane. Frequency ridges are a special class of ridges. According to Carmona et al., (1999), we can describe frequency ridges as below. Assume that Mnorm (b, a) is the normalized spectrogram of a WFT or CWT which is a non-negative function defined on the whole half time-frequency plane D. The frequency ridge set R can be defined as the set of local maxima in a of the functions a −→ Mnorm (b, a) when the variable b is held fixed. We assume that the surface Mnorm (b, a) is smooth enough so that the ridge set is the finite union of the graphs of smooth functions slowly varying on their respective domains. 5.1.

359

Wavelet Ridges

For CWT, the square modulus of a wavelet transform(scalogram) defines a local time-frequency energy density Pw s, which measures the energy of s(t) in the Heisenberg box of each wavelet ψu,a centered at (u, ξ = aη ). η 2 Pw s(u, ξ ) = |W s(u, a)|2 = W s u, ξ

(9)

The scalogram of wavelet transform can be normalized with wavelet scale a,

Mnorm (u, a) =

|W s(u, a)|2 Pw s(u, ξ ) = a a

(10)

According to the definition above, wavelet ridges are curves in the time-scale domain which follow the local maximum value of the normalized scalogram. A good approximation of the ridge point p = (u, a0 ) at a position u can be obtained from the local maxima of normalized scalogram Mnorm (u, a) throughout the neighborhood of p in scale a. Let N (∗) denote the neighborhood of the argument ∗, including ∗, then, (u, a0 ) is selected as a ridge point if ∀N (a0 )

|Mnorm (u, a0 )| ≥ |Mnorm (u, N (a0 ))| (11)

With the center frequency, we can obtain the relationship between wavelet scale a and frequency f . So we can transfer wavelet ridges from time-scale domain to time-frequency domain.

Xiong and Quek

360

5.2.

WFT Ridges

For WFT, the square modulus of a WFT, PF s(u, f ) = |Fs(u, f )|2 2 (12) +∞ = −∞ s(t)w(t − u)e− j2π f t dt , defines an energy density called a spectrogram. It measures the energy of s(t) in the time-frequency neighborhood of (u, f ). Similar as CWT, WFT ridges can be computed from local maxima of normalized spectrogram Mnorm (u, f ). Mnorm (u, f ) =

|Fs(u, f )|2 PF s(u, f ) = af af

(13)

where a f is a scale to normalize window w, wa f (t) = 1 t √ w , so that w = 1. af af It is possible for multiple ridge points to occur at the same position u. One can connect the neighboring ridge points as a continuous surface, called a ridge

surface. We can use frequency ridges to characterize instantaneous frequencies of nonstationary signals. With the frequency property of hand motion trajectory signals, we can obtain hand motion oscillatory gestures.

6.

Test Signals and Results

To test WFT and CWT, first we need to determine the frequency domains of the test data and select appropriate parameters for both WFT and CWT. Since our eventual application is the extraction of oscillatory gesticulatory hand motion from video, we will use a frequency range concomitant to the variability of gesticulatory motion. Human oscillatory hand gestures typically lie in the interval: [0.3 Hz, 4.5 Hz]. We also use the digital camera sampling frequency of 29.97 Hz for all our test data. The first test signal contains two frequencies (1 Hz and 3 Hz) that run the entire length of the signal. The generator equations are follows:

WFT:Frequency(Hz)

y

2

0

CWT:Frequency(Hz)

5

10

15

20

25 Time (s)

30

35

40

45

50

0

5

10

15

20

25 Time (s)

30

35

40

45

50

0

5

10

15

20

25 Time (s)

30

35

40

45

50

4 3 2 1 0

Figure 2.

0

4 3 2 1 0

Results for the first test signal.


361

WFT:Frequency(Hz)

y

2

0

CWT:Frequency(Hz)

5

10

15

20

25 Time(s)

30

35

40

45

50

0

5

10

15

20

25 Time (s)

30

35

40

45

50

0

5

10

15

20

25 Time (s)

30

35

40

45

50

4 3 2 1 0

Figure 3.

0

4 3 2 1 0

Results for the second test signal.

⎧ ⎨ y1 = sin(2π f 1 t1 ); y2 = sin(2π f 2 t2 ); ⎩ y = y1 + y2

f 1 = 1H z f 2 = 3H z

0 ≤ t1 ≤ 50s 0 ≤ t2 ≤ 50s (14)

Figure 2, from top to bottom, shows the input signal, the WFT response, and the CWT ridges. The CWT results in Figure 2 also show that it produces better temporal resolution at higher frequencies than for lower frequencies. The temporal resolution of the WFT is constant for all frequencies, and is determined only by the size of T . Both approaches resolve the frequencies correctly. Our second test signal contains two frequencies. The generator equations are: ⎧ y1 ⎪ ⎪ ⎪ ⎨y 2 ⎪ y3 ⎪ ⎪ ⎩ y

= sin(2π f 1 t1 ); f 1 = 1H z 5s ≤ t1 ≤ 15s = sin(2π f 2 t2 ); f 2 = 3H z 20s ≤ t2 ≤ 30s = sin(2π f 1 t3 ) + sin(2π f 2 t3 ); 35s ≤ t3 ≤ 45s = y1 + y2 + y3 (15)

Figure 3 shows the results for the second test signal. As before, the figure displays the input signal, the WFT result, and the CWT result. This second test illustrates an unstable signal with changing frequencies with time. While the frequency resolution of the CWT is lower for low frequencies, this does not matter when one considers the frequency, since the error is constant as a function of the wavelength of the signal being detected. For the WFT, the constant time resolution is problematic at higher frequencies because this can represent many cycles of the signal. Again, both methods correctly extracted the oscillation frequencies in the input signal. The third test signal contains one frequency (3 Hz) and an added white noise signal n w . The generator equations are follows.

y1 = sin(2π f 1 t1 );

f 1 = 3H z

0 ≤ t1 ≤ 50s

y = y1 + n w (16)

362

Xiong and Quek

WFT:Frequency(Hz)

y

2

0

CWT:Frequency(Hz)

5

10

15

20

25 Time (s)

30

35

40

45

50

0

5

10

15

20

25 Time (s)

30

35

40

45

50

0

5

10

15

20

25 Time (s)

30

35

40

45

50

4 3 2 1 0

Figure 4.

0

4 3 2 1 0

Results for the third test signal.

Figure 4 shows the results for the third signal. From the results we can see that both methods can resolve the frequency from noisy signals. Although we can see that the noise has some influence on the CWT results, we notice that the maximal frequency error is still within the constant uncertainty window of the WFT. Our next test signal comes from head movement data shown on the top of Figure 5 (similarly organized as before). In this dataset, the subject moved his head from side to side in as constant an oscillation as possible. The oscillation frequencies were determined by stopwatch. The head movement signal is extracted from a video sequence made by digital camera with sampling frequency 29.97 Hz using the head tracking algorithm of Cascia et al., (2000). From Figure 5 we can see that both methods can resolve the frequencies of head movement, but wavelet band pass filter bank can provide better resolution in both time domain and frequency domain. In the meantime, we can also notice that for this non-stationary signal, WFT also can detect its frequencies.

Figure 6 shows the x motion and oscillation plots (obtained using WFT and CWT) for a set of oscillatory hand motions. As shown, the hand engages in oscillatory motions in different directions. Note that when the hand is oscillating in the y direction, there is some corresponding collateral oscillation in the x although this is small. The segment labeled ‘cycle-in-a-cycle’ plots the x movement of the hand as it twirls in small circles while the arm moves in a larger circulatory path. From the results we can see that both methods can resolve the frequencies of hand oscillatory motions. Again the wavelet band pass filter bank can provide better resolution in both time and frequency domains. One key observation we make between WFT and CWT approaches for gesticulatory analysis is that WFT temporal resolution for oscillation onset and offset is poorer than that of CWT. In WFT, these times are highly dependent on the window size. Since the beginning and the end of each motion phase with respect to speech is critical in multimodal language analysis, we determined that CWT is more appropriate for the application.

WFT:Frequency(Hz)

Head x Position


0

CWT:Frequency(Hz)

7.

7.1.

0

10

20

30

40

50

60

70

80 90 Time (s)

100

110

120

130

140

150

160

0

10

20

30

40

50

60

70

80 90 Time (s)

100

110

120

130

140

150

160

0

10

20

30

40

50

60

70

80 90 Time (s)

100

110

120

130

140

150

160

5 4 3 2 1 0

Figure 5.

363

5 4 3 2 1 0

The results for head movement signal.

Hand Motion Oscillatory Gesture Detection and Multimodal Discourse Analysis Experimental Setup

We performed our analysis on a dataset. It was a 961 frame (32 sec) video of a subject describing her living space to an interlocutor (both seated) (Quek et al., 2002). Figure 7 shows one of the pictures from the dataset. The data was captured with a single camera from an oblique frontal view. Besides the extraction of hand trajectories described earlier, we also perform a detailed linguistic text transcription of the discourse that includes the presence of breath and other pauses, disfluencies and interactions between the speakers. The speech transcript is aligned with the audio signal using the Entropic’s word/syllable aligner. We also extract both the F0 and RMS of the speech. The output of the Entropic’s aligner is manually checked and edited using the Praat phonetics analysis tool (Boersma and Weenik, 1996) to ensure accurate time tags. This process yields a time-aligned set of traces of the hand motion with the

precise locations of the start and end points of every speech syllable and pause. The time base of the entire dataset is also aligned to the experiment video. In some of our data, we employ the Grosz ‘purpose hierarchy’ method (Nakatani et al., 1995) to obtain a discourse segmentation. This gives us a set of time indices of where semantic breaks are expected. We ran our oscillatory extractor on the dataset and compared them against the transcriptions for agreement with discourse segmentation. 7.2.

Results and Analysis

Figure 8 shows the hand motion trajectories in the x and y dimensions of the camera image plane using the VCM algorithm described in section 2. The top of the figure is the left hand motion trajectories and the bottom is the right hand motion trajectories. We label the respective hand-dimension pairs LH-x, LH-y, RH-x, and RH-y. We employed both WFT and CWT techniques to detect the frequency properties of subject’s hand

Xiong and Quek

WFT:Frequency(Hz)

Hand x Position

364

y-osc

x-osc

y-osc

diagonal: x-y osc

cycle-in-a-cycle

0

5

10

15

20

25

30 Time (s)

35

40

45

50

55

60

0

5

10

15

20

25

30 Time (s)

35

40

45

50

55

60

0

5

10

15

20

25

30 Time (s)

35

40

45

50

55

60

5 4 3 2 1 0

CWT:Frequency(Hz)

x-osc

5 4 3 2 1 0

Figure 6.

The results for hand movement signal.

Figure 7.

The dataset of living space description experiment.


365

Left Hand

300 250 200 150

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 time(s)

350 Right Hand

300 250 200 150

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 time(s)

Figure 8.

The hand motion trajectory signals of living space describing dataset.

motion trajectory signals. We found that CWT produces better results than WFT does. The reason for this is that the former is better at dealing with nonstationary signals in the time domain. In a natural speech and conversation, the hand trajectory signal is a non-stationary signal with instantaneously and contiguously changing frequencies (the hand has to accelerate and decelerate). At a given time point there may be several frequencies and the duration of the oscillation may be very short. WFT based approaches using to windowed Fourier atoms have a fixed scale and thus cannot follow the instantaneous frequency of rapidly varying events (Mallat, 1998). In the following section, we only give the results detected by CWT based technique and relate them to multimodal discourse analysis. Applying our wavelet ridge algorithm, we obtain oscillation ridge traces for the corresponding trace signals. Figures 9 and 10 show the wavelet results for x and y dimensions of the subject’s left hand motion respectively. The horizontal dimension of the graphs is time in seconds. The top of each figure shows the hand motion trace, and the bottom of the figure shows the extracted wavelet ridges. Figures 11 and 12 are

the corresponding x and y plots of the subject’s right hand. Figure 13 tabulates the wavelet ridges. Each ridge is assigned with a number. The table shows the beginning time, duration, and estimated number of oscillation cycles, the frequency at the point of highest wavelet response, f Rmax (peak point of the ridge), and the words spoken by the subject within the ridge duration. The number of cycles is estimated from f Rmax and the ridge duration. To show the temporal organization of these ridges with respect to the accompanying discourse, we ‘scored’ the ridge durations with the experiment transcript in Figure 14. The ridge durations are marked under each line of text in the order LH-x, LH-y, RH-x, RH-y, descending. When there are two simultaneous ridges in any Frequency-Time (FT) space, they are scored one above the other (we did not detect any instance of more than two simultaneous ridges). For example, LH-x5 and LH-x6 begin simultaneously on the second line of the transcript, cotemporally with the words “. . . there’s the the front . . . ”. By cross-referencing Figure 14 with Figure 13 we also know that f Rmax for LH-x5 and LH-x6 are 1.73 Hz and 0.83 Hz respectively. To give a sense

366

Xiong and Quek

340 320 300 280 260

Frequency(Hz) Frequency(Hz)

240

Figure 9.

5 4 3 2 1 0 5 4 3 2 1 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 time(s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

Results for left hand signal x.

400 350 300 250 200


150

Figure 10.

5 4 3 2 1 0 5 4 3 2 1 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 time(s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

Results for left hand signal y.


350 300 250 200


150

Figure 11.

5 4 3 2 1 0 5 4 3 2 1 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 time(s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

Results for right hand signal x.

400 350 300 250


200

Figure 12.

5 4 3 2 1 0 5 4 3 2 1 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 time(s)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Time (s)

Results for right hand signal y.

367

368

Xiong and Quek

Figure 13.

Oscillation table.

Figure 14.

Oscillation-scored transcript.


of the time scale, vertical lines mark two-second intervals. The numbers at the base of these lines indicate the absolute time in the dataset. For example, we can see that the “. . . um . . . ” is rather lengthy, covering more than 2 seconds (from 12s mark to 14s mark), and is in fact longer than the utterance “left so you can go straight upstairs to the second floor.” This significantly shows that the LH-x4 oscillation lasting 0.93s is in fact an instance of a Butterworth gesture (McNeill, 1992) (a repetitive motion accompanying silence or filled speech pauses) that marks word-search behavior. The discourse piece begins with the subject describing the back of her house (kitchen and back staircase). The subject consistently situates this using her left hand (Quek et al., 2002). The entire initial back of house is captured in the oscillatory segments RHx1 (1.56 Hz). There were no perceptible oscillatory gestures to a human observer in this segment. What we are measuring is the subtle physical movement that accompanies the underlying frequency of her cadence (there were two phrases of approximately 3.7 sec duration during this period leading to the two motion cycles). This frequency ridge captures this entire back of house fragment right up to the point where she aborted the description with a rapid withdrawal of the RH at the beginning of the editing phrase: “Oh I forgot to say.” RH-y1 (1.23 Hz) lasted only one cycle and was basically the preparation for her first gesture and the motion of the gesture. RH-y2 (1.48 Hz, 2 cycles) was the twirling motion of the iconic gesture representing the spiral back staircase (we know that the staircase was spiral only from the gesture). This oscillation is clearly perceptible to a human observer. In the second phase of the discourse, the subject proceeds to the front of the house and the front doors. She consistently uses two-handed gestures to represent this discourse segment. This segment is clearly set out in the corresponding LH-x1 (2.54 Hz, 2.29 cycles) and RH-x2 (2.13 Hz, 2.63 cycles) symmetric 2H oscillatory gesture where she held both hands horizontally with palms facing her torso at chest level. This properly captures two perceptible cycles of a ‘beat-like’ gesture where the subject demonstrated the limit of the facade of the house moving her hands away from her torso and back in two short strokes. Her hands were facing her torso in preparation for the next gesture pantomiming “throwing open the double front doors.” LH-x2 (4.09 Hz) is a short high frequency oscillation we believe to be a noise ridge. The next strong symmetric oscilla-

369

tory gesture, LH-x3 (2.21 Hz, 2.29 cycles) and RH-x3 (1.8 Hz, 2.23 cycles), captures the throwing open of the front doors (we know these were double doors again only from the gesticulatory imagery, it was never said). She performed the gesture with strong effort and the oscillation observed involved both the gesture and the ‘overshoot-and-rebound’ motion typical at the end of effortful gesticulatory strokes. In LH-x4 (4.09 Hz, 3.82 cycles), the subject held both hands, palms out, in front of her in a 2H oscillatory action as though ‘feeling’ the glass doors. This is typical word search behavior, corresponding to the utterance “doors with the . . . . . . the glass in them.” There was an FT ridge of almost the exact shape in the RH-x, but the ridge peaks were weak, and we did not include it in the dataset. Hence, we deem the equivalent RH-x oscillation to be missed by the algorithm. After describing the front of the house, the subject proceeded to describe the front staircase. In the discourse, she consistently used her LH gestures when describing this staircase. LH-x5 – LH-x8 and LH-y1 and LH-y2 pertain to this discourse segment. The subject performed two clearly perceptible iterations of the same oscillatory gesture (2 motion cycles per gesture). The gesture indicated ascending straight staircases, with the hand rising and moving forward, with palm facing down and outward at approximately 45◦ . LH-x6 (0.83 Hz) and LH-y1 (0.91 Hz) captured the first gesture and LH-x7 (1.15 Hz) and LH-y2 (1.15 Hz) captured the second correctly. It is likely that the mental imagery contrast in this discourse was between the straight front staircase and the spiral back staircase. The final LH ridge, LH-x8 (1.97 Hz), captures the final LH oscillatory gesture coincident with the words “second floor from there.” After two cycles of the front staircase gesture (at the end of LH-x7 and LH-y2 ), the subject held her LH at the highest point of the stroke, indicating the ‘upstairs.’ As she uttered “second floor from there,” she oscillated her hand up and down at the wrist, palm down, as though patting the ‘upstairs floor.’ LH-x5 was a ridge that was barely strong enough to be counted. We believe that it was a noise ridge brought about by the strong neighboring ridges LH-x6 and LH-x7 . The subject returns to the back staircase and RH gestures in the final segment of the discourse. This segment begins with three relatively high frequency oscillations. In RH-x4 (4.01 Hz), the subject raises her RH in a small twirling action. This is followed by RH-x5 (3.19 Hz) where she makes a larger twirling gesture indicating

370

Xiong and Quek

‘coming around to through the kitchen.’ RH-x6 (4.09 Hz) is a very quick hand-shape change (from an open down-facing palm in RH-x5 to a typical ASL G-hand pointing pose for RH-y3 ). This registered as a quick high frequency 1-cycle oscillation. RH-y3 (2.13 Hz) is a return to the aborted spiral staircase gesture (upward twirling motion with hand in the pointing pose) of RH-y2 . RH-y3 properly captures the two repetitions. RH-y4 (0.83 Hz) and RH-x7 (0.66 Hz) both straddle two utterances. The ridge responses were strong, though there were no perceptible oscillatory gestures. We believe what we are capturing in each case is the ‘gesticulatory cadence’ across two utterances. RH-x8 (1.89 Hz) and RH-y5 (1.24 Hz) detected the transition between the end of the spiral staircase gestures and the oscillatory beat that take the description to the second floor. 8.

Conclusion

We presented an investigation into oscillatory hand motion in human discourse. In doing so, we are providing a handle for computer vision research to engage the broader domain of natural multimodal human language. Two questions were addressed in the course of this paper. First, how might one detect the cyclical motion behavior, locate the beginnings of these cycles, and compute the frequencies of oscillation. To do this, we have to address the problems of the non-stationarity of the signal and extract oscillations with very few cycles. Second, how do these cyclic motion phases correlate temporally with discourse. This is driven by the psycholinguistic concept relating mental imagery to the units of discourse production. In a sense, the second question is a test for the algorithms developed to answer the first question. This is because oscillations in natural hand motions are difficult to code manually. Hence, detection of stretches of oscillation that correspond with cogent discourse pieces illustrates both the efficacy of our algorithms and their utility in accessing units of discourse. In the course of this paper, we developed our algorithms and presented experiments that demonstrate answers to both questions in the affirmative. 9.

Acknowledgments

This research has been supported by the U.S. National Science Foundation STIMULATE program, Grant No.

IRI-9618887, Gesture, Speech, and Gaze in Discourse Segmentation, and the NSF KDI program, Grant No. BCS-9980054, Cross-Modal Analysis of Signal Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research. and the NSF ITR program, Grant No. ITR-0219875, Beyond the Talking Head and Animated Icon: Behaviorally Situated Avatars for Tutoring. References McNeill, D. 2000. Growth points, catchments, and contexts. Cognitive Studies: Bulletin of the Japanese Cognitive Science Society, vol. 7, no. 1. Quek, F., McNeill, D., Ansari, R., Ma, X., Bryll, R., Duncan, S., and McCullough, K.-E. 2002. Multimodal human discourse: Gesture and speech. ACM Transaction on Computer-Human Interaction (TOCHI), 9:171–193. Quek, F. 2003. The catchment feature model for multimodal language analysis. In ICCV, (Nice, France), 13–16. McNeill, D. 2000. Catchments and context: Non-modular factors in speech and gesture. In Language and Gesture (D. McNeill, ed.), ch. 15, 312–328, Cambridge: Cambridge U. Press. McNeill, D., Quek, F., McCullough, K.-E., Duncan, S., Furuyama, N., Bryll, R., and Ansari, R. 2001. Catchments, prosody and discourse. In Oralité et Gestualité, ORAGE (Speech and Gesture 2001) (C. Cavé, I. Guaïtella, and S. Santi, eds.), (Aix-enProvence, France), 474–481. Seitz, S., and Dyer, C. 1997. View-invariant analysis of cyclic motion. International Journal of Computer Vision, 25(3):231– 251. Tsai, P.-S., Keiter, K., Kasparis, T., and Shah, M. 1994. Cyclic motion detection. Pattern Recognition, 27(12):1591–1603. Liu, C. and Wechsler, H. 1998. Probabilistic reasoning models for face recognition. In Proc. of the IEEE Conf. on CVPR, (Santa Barbara, CA), 827–832, IEEE Comp. Soc. Polana, R. and Nelson, R.C. 1997. Detection and recognition of periodic, non-rigid motion. International Journal of Computer Vision, 23:261–282. Niyogi, S. and Adelson, E. 1994. Analyzing gait with spatiotemporal surfaces. In the 1994 IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 64–69. Fujiyoshi, H. and Lipton, A. 1998. Real-time human motion analysis by image skeletonization. In IEEE Workshop on Applications of Computer Vision. Heisele, B. and Woehler, C. 1998. Motion-based recognition of pedestrians. In Proc. Int. Conf. Pattern Rec., 2, (Brisbane, AU), IEEE Comp. Soc. Niyogi, S. and Adelson, E. 1994. Analyzing and recognizing walking figures in xyt,” in 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’94), 469– 474. Cohen, C.J., Conway, L., Koditschek, D., and Roston, G.P. 1997. Dynamic system representation of basic and non-linear-in-parameters oscillatory motion gestures. In 1997 IEEE International Conference on Systems, Man, and Cybernetics, vol. 5, (Orlando, FL), 4513–4518.


Davis, J., Bobick, A., and Richards, W. 2000. Categorical representation and recognition of oscillatory motion patterns. In Proc. of the IEEE Conf. on CVPR, vol. 1, (Hilton Head Is., SC), 628–635, IEEE Comp. Soc. Cutler, R. and Davis, L. 2000. Robust real-time periodic motion detection, analysis, and applications. PAMI, 22:781–796. Plotnik, A. and Rock, S. 2002. Quantification of cyclic motion of marine animals from computer vision. In IEEE OCEANS Conference, 1575–1581. Quek, F. K. H. and Bryll, R.K. 1998. Vector coherence mapping: A parallelizable approach to image flow computation. in ACCV, vol. II, (Hong Kong, China), 591–598. Quek, F., Ma, X., and Bryll, R. 1999. A parallel algorithm for dynamic gesture tracking. In ICCV’99 Wksp on RATFG-RTS., (Corfu, Greece), 119–126. Bryll, R. 2003. A robust agent-based gesture tracking system. PhD thesis, Wright State University, CSE Dept. Quek, F. 1994. Toward a vision-based hand gesture interface In Virtual Reality Software and Technology Conference, (Singapore), 17–29. Quek, F. 1996. Unencumbered gestural interaction. IEEE Multimedia, 4(3):36–47. Tsai, R. 1987. A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal of Robotics and Automation, RA3(4):323–344. McCullough, K.-E. 1992. Visual imagery in language and gesture. In Annual Meeting of the Belgian Linguistic Society, (Brussels, Belgium).

371

McNeill, D. 1992. Hand and Mind: What Gestures Reveal about thought. Chicago: U. Chicago Press. Pedelty, L. 1987. Gesture in Aphasia. PhD thesis, Department of Psychology, The University of Chicago. Gabor, D. 1946. Theory of communication. Journal of the Institution of Electrical Engineers, 93:429–441. Rao, R. and Bopardikar, A. 1998. Wavelet Transforms: Introduction to Theory and Applications. Addison-Wesley. Goupillaud, P., Grossmann, A., and Morlet, J. 1984–1985. Cycleoctave and related transforms in seismic signal analysis. Geoexploration, 23:85–102. Hall, P., Qian, W., and Titterington, D.M. 1992. Ridge finding from noisy data. Journal of Computational and Graphical Statistics, 1(1):197–211. Carmona, R., Hwang, W., and Torresani, B. 1999. Multiridge detection and time-frequency reconstruction. IEEE Transactions on Signal Processing, 47:480–492. Cascia, M.L., Scaroff, S., and Athitsos, V. 2000. Fast, reliable head tracking under varying illumination: An approach based on registration of texure-mapped 3d models. PAMI, 22:322– 336. Boersma, P. and Weenik, D. 1996. Praat, a system for doing phonetics by computer. Tech. Rep. Report 132, Institute of Phonetic Sciences of the University of Amsterdam. version 3.4., http://www.fon.hum.uva.nl/praat/. Nakatani, C., Grosz, B., Ahn, D., and Hirschberg, J. 1995. Instructions for annotating discourses Tech. Rep. TR-21-95, Ctr for Res. in Comp. Tech., Harvard U., MA. Mallat, S. 1998. A wavelet tour of signal processing. San Diego, Academic Press.