An Instantaneous Correlation Algorithm for Assessing Intra and Inter ...

52 downloads 0 Views 380KB Size Report
Vancouver, BC, V6T 1Z4 [email protected], oberg. ... strive for synchronization during ensemble music performance. Less conscious is the behavioral.
Workshop on Modeling Human Communication Dynamics at NIPS 2010

Page 38

An Instantaneous Correlation Algorithm for Assessing Intra and Inter Subject Coordination During Communicative Behavior

Adriano V. Barbosa, Martin A. Oberg, Rose-Marie D´echaine, Eric Vatikiotis-Bateson Department of Linguistics University of British Columbia Vancouver, BC, V6T 1Z4 [email protected], [email protected], [email protected], [email protected]

Abstract This paper outlines a fast method for assessing spatial and temporal coordination between signals. Instantaneous correlation is computed recursively by a linear filter operating either uni- or bi-directionally. The method is proving especially useful in analyzing communicative behavior, where temporal fluctuations are ubiquitous and may be crucial to characterizing coordination within and between communicating entities. This is exemplified briefly for the physiologically coordinated production behavior of two speakers.

1

Introduction

Events of interest within and across biological systems may be spatially and temporally coordinated in a variety of ways, but are rarely fixed-phase synchronous (pathologies such as stuttering, Parkinson’s disease, and epilepsy are notable exceptions). For example, both singers and instrumentalists strive for synchronization during ensemble music performance. Less conscious is the behavioral entrainment that occurs when two people walk down the street together. Such examples give the impression of synchronization, but they really exemplify coordinated behavior that incorporates naturally occurring time-domain fluctuations. Treating the fluctuations as noise will generally lead the analysis to underestimate the coordination between the two signals. Furthermore, computing the spatiotemporal relation between signals at a specific phase offset, zero or otherwise, misses the crucial value of these temporal fluctuations in characterizing the observed coordination, as intimated in the visualization shown (Figure 1). In Section 2, we set out the mathematical structure of our method for computing the instantaneous correlation between two signals. Section 3 extends the method by introducing the concept of correlation map, which allows us to characterize the correlation as a function of not only time but also of the phase offset between the signals. In Section 4, we exemplify application of the method to one of the domains of multimodal speech processing and musical performance our research addresses.

2

Assessing coordination from instantaneous correlation

At any point in time, the instantaneous correlation is computed based on previous samples and therefore reflects the relation between the signals’ past histories. The instantaneous correlation coefficient ρ(k) between signals x(k) and y(k) can be defined as [1] Sxy (k) ρ(k) =  , Sxx (k) Syy (k) 1

(1)

Workshop on Modeling Human Communication Dynamics at NIPS 2010

Page 39

where the instantaneous covariance Sxy (k) between signals x(k) and y(k) is computed as Sxy (k) =

∞ 

c e−η l (x(k − l) − x ¯(k)) (y(k − l) − y¯(k)) ,

(2)

l=0

and Sxx (k) and Syy (k) are similarly defined. In Equation 2, η is a small positive number. We call this a unidirectional system because only past samples are used to compute Sxy (k) at any point in time. The weights decay exponentially over time such that older samples receive smaller weights. The same exponential decay is used to compute the instantaneous means x ¯(k) and y¯(k) as x ¯(k) =

∞ 

c e−η l x(k − l) ,

(3)

c e−η l y(k − l) .

(4)

l=0

y¯(k) =

∞  l=0

The constant c in equations 2, 3 and 4 is given by c = 1 − e−η = 1 − a , where a = e

−η

(5)

. Using the definitions in equations 3 and 4, Equation 2 can be rewritten as Sxy (k) =

∞ 

c e−η l x(k − l) y(k − l) − x ¯(k) y¯(k) .

(6)

l=0

By defining the exponential weighting function h(k) = c e−η k u(k) ,

(7)

where u(k) is the unit step function, equations 3, 4 and 6 can be re-written in terms of convolution sums in the following way x ¯(k) = h(k) ∗ x(k) , (8) y¯(k) = h(k) ∗ y(k) , (9) Sxy (k) = h(k) ∗ [x(k) y(k)] − x ¯(k) y¯(k) , (10) Thus, the signals in equations 8 – 10 can all be determined in terms of a linear filter whose impulse response is h(k). This is a first-order IIR (infinite impulse response) filter whose z-domain representation is given by 1−a H(z) = , |z| > a . (11) 1 − az −1 Letting Fu {·} denote the operation performed by this filter, Equation 10 can be rewritten as Sxy (k) = Fu {x(k) y(k)} − Fu {x(k)} Fu {y(k)} ,

(12)

which shows that the instantaneous covariance Sxy (k) can be completely determined from the signals x(k) and y(k) in terms of the filter H(z). The filter’s difference equation is given by s(k) = a s(k − 1) + (1 − a) r(k) ,

(13)

where r(k) and s(k) are the filter’s input and output, respectively. Thus, the filter’s output at any point in time can be easily computed as a linear combination of the current input sample and the previous output sample. Computing the instantaneous covariance recursively (Equation 13) is much faster than non-recursive computation over potentially large numbers of samples for each time step (Equation 2 or Equation 6). This is a major advantage of using an exponential decay convolution kernel (Equation 7) over others – rectangular, semi-gaussian, semihanning – whose output may be similar, but whose computations are more expensive because they cannot be expressed in simple recursive form. Furthermore, exponential decay occurs naturally across a wide range of systems and thus seems a quite logical way of forgetting the past. In the bi-directional mode, the instantaneous covariance at any point in time is computed based on both previous and future samples. The instantaneous covariance is redefined by changing the summation limits in equations 2, 3 and 4 to (−∞, +∞) and, from there, a similar derivation follows [2]. 2

Workshop on Modeling Human Communication Dynamics at NIPS 2010

3

Page 40

The 2D Correlation Map

Here we discuss computing and visualizing the instantaneous correlation between two signals across an arbitrary range of temporal offsets. The computation and resulting two-dimensional (2D) correlation map affords quantitative assessment of the coordination which in biological behaviors is invariably subject to time-domain fluctuations. The 2D visualization of the correlation map represents the coordination between signals graphically across a user-determined range of temporal offsets. Using the visualization, the influence of the parameter η (see Equation 2) on the 2D correlation can be readily inspected and fine-tuned for specific data and analytic goals. By adjusting η (and, therefore, the filter cutoff frequency), the 2D visualization can be used to emphasize different aspects of the time-varying correlation. Decreasing η emphasizes the stability of the correspondence, while increasing it emphasizes the time-varying fluctuations. The correlation map between signals x(k) and y(k) is computed as ρ(k, d) = 

Sxy (k, d) , Sxx (k, d) Syy (k, d)

(14)

where Sxy (k, d) is the instantaneous covariance between signals x(k) and y(k − d), that is, Sxy (k, d) = S (x(k), y(k − d)) .

(15)

In practice, a maximum offset dmax is defined and the correlation map is then computed over the entire offset range {d : −dmax ≤ d ≤ +dmax }.

4

Application

Figure 1 shows the instantaneous correlation coefficient (both 1D and 2D) between two signals collected during an experiment where two speakers attempt to synchronize repeated productions of a one or two-word sequences such as top, cop, or top-cop. Motions of each speaker’s tongue, lip, and jaw were recorded electromagnetically along with the speech acoustics. In the particular example of Figure 1, speakers S1 and S2 were repeatedly producing the words cop and top, respectively, and the signals being correlated are speaker S1’s tongue root position and speaker S2’s tongue tip position. The bottom panel of the figure shows the 2D correlation map between these two signals, where the correlation coefficient is plotted as a function of both time (abscissa) and the temporal offset (ordinate) between the signals. Being a function of two variables, the correlation map is visualized on the plane as color ranging from dark blue (ρ = −1) to dark red (ρ = +1). The 2D visualization of the correlation map provides an immediate overview of the correspondence between the observed behaviors within the chosen range of temporal offsets. For example, the continuous red band in Figure 1 fluctuates around the zero offset. This shows that the two input signals are continuously coordinated, but that the coordination varies over time through the 30 sec shown. This cannot be seen in the 1D correlation computed at zero-offset in the panel above. Also, because correspondence is calculated and plotted across a user-determined range of temporal offsets, phase-shifted primary and/or secondary correlations can be identified. For example, the substantial spatiotemporal repetition that characterizes the rhythmic behavior of speech shows up in the 2D visualization as vertically arranged, alternating sequences of positive and negative correlations. This is a direct consequence of the fact that the cross-correlation between two periodic streams of simple word repetitions is also periodic, and can be clearly seen in Figure 1. The distance between the “correlation bands” depends on the periods of the input signals. This can be seen in Figure 1 when the two speakers slow their speaking rate (15-20 sec) and the dark correlation bands become more vertically separated. Acknowledgments Research supported by NSERC, SSHRC, and CFI grants. Our thanks to Catherine Best, Mark Tiede, Denis Burnham and the ARC (Australia) for magnetometer data collection for two speakers simultaneously. 3

Workshop on Modeling Human Communication Dynamics at NIPS 2010

Page 41

S1 audio

0.2 0

S2 audio

−0.2 0.1 0

S1 − TR

−0.1 60 50 40

S2 − TT

30 40 30 20

1D Corr.

10 1 0 −1 0.6

1

0.4 0.5

Offset (s)

0.2 0

0

−0.2 −0.5 −0.4 −0.6

5

10

15

20

25

30

−1

Time (s)

Figure 1: Correlation map between the tongue root (TR) position of speaker S1 (third panel) and the tongue tip (TT) position of speaker S2 (fourth panel) during a task in which the speakers try to synchronize their speech. Speaker S1 is repeatedly saying cop (first panel), whereas speaker S2 is repeatedly saying top (second panel). Panels are counted from top to bottom.

References [1] R. M. Aarts, R. Irwan, and A. J. E. M. Janssen, “Efficient tracking of the cross-correlation coefficient,” IEEE Transactions on Speech and Audio Processing, vol. 10, pp. 391 – 402, September 2002. [2] A. V. Barbosa, H. C. Yehia, and E. Vatikiotis-Bateson, “Algorithm for computing spatiotemporal coordination,” in Proceedings of the International Conference on Auditory-Visual Speech Processing – AVSP 2008, (Tangalooma, Australia), pp. 131 – 136, September 2008.

4

Suggest Documents