c
, , 1{33 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Multimedia Tools and Applications, Kluwer Academic Publishers, Vol. 18, No. 2, pp. 91-113, August 2002 Also as VISLab Report: VISLab-00-03.
A Multimedia System for Temporally Situated Perceptual Psycholinguistic Analysis FRANCIS QUEKy, ROBERT BRYLLy , CEMIL KIRBASy, HASAN ARSLANy, AND DAVID MCNEILLz
[email protected] Vision Interfaces and Systems Laboratory Wright State University, Dayton, OH z Department of Psychology University of Chicago y
Abstract. Perceptual analysis of video (analysis by unaided ear and eye) plays an important role in such disciplines as psychology, psycholinguistics, linguistics, anthropology, and neurology. In the speci c domain of psycholinguisticanalysis of gesture and speech, researchers micro-analyze videos of subjects using a high quality video cassette recorder that has a digital freeze capability down to the speci c frame. Such analyses are very labor intensive and slow. We present a multimedia system for perceptual analysis of video data using a multiple, dynamically linked representation model. The system components are linked through a time portal with a current time focus. The system provides mechanisms to analyze overlapping hierarchicalinterpretations of the discourse, and integrates visual gesture analysis, speech analysis, video gaze analysis, and text transcription into a coordinated whole. The various interaction components facilitate accurate multi-point access to the data. While this system is currently used to analyze gesture, speech and gaze in human discourse, the system described may be applied to any other eld where careful analysis of temporal synchronies in video is important.
Keywords: Multimedia Data Visualization; Temporal Analysis; User Interface; Multiple, Linked Representation; Gesture Coding; Gesture, Speech and Gaze Analysis
D R A F T
March 23, 2000, 12:23am
D R A F T
2
1. Introduction Perceptual analysis of video (analysis by unaided ear and eye) plays an important role in such disciplines as psychology, psycholinguistics, linguistics, anthropology, and neurology. In the speci c domain of psycholinguistic analysis of gesture and speech, researchers micro-analyze videos of subjects using a professional quality video cassette recorder that has a digital freeze capability down to the speci c frame. This is a painstaking and laborious task. In our own work on the integrated analysis of gesture, speech, and gaze (GSG), the labor intensivity of such analysis is one of the key bottlenecks in our research. We have developed a multimedia system for perceptual analysis of GSG in video (henceforth MDB-GSG). This system, developed with attention to the interactive needs of psycholinguistic perceptual analysis, has resulted in at least a ten-fold increase in coding eciency. Furthermore, MDB-GSG provides a level of access to GSG entities computationally extracted from the video and audio streams that facilitates new analysis and discoveries. In this paper, we shall discuss the task of perceptual analysis of video, the interactive model of our MDB-GSG system based on time situated multiple-linked and related representations, and the MDB-GSG system design and implementation.
2. Perceptual Analysis of Video Psycholinguistic perceptual analysis of video typically proceeds in three iterations. First, the speech is carefully transcribed by hand, and then typed into a text document. The beginning of each linguistic unit (typically a phrase) is marked by the time-stamp of the beginning of the unit in the video tape. Second, the researcher revisits the video and annotates the text, marking co-occurrences of speech and gestural phases (rest-holds, pre-stroke and post-stroke holds, gross hand shape, trajectory of motion, gestural stroke characteristics etc.). The researcher also inserts locations of audible breath pauses, speech dis uencies, and other salient comments. Third, all these data are formatted onto a nal transcript for psycholinguistic analysis. This is a painstakingly laborious process that takes a week to ten days to analyze about a minute of discourse.
D R A F T
March 23, 2000, 12:23am
D R A F T
3
Hand Movement along X-Direction
100 50
LH
Pixels
0 -50
RH -100 -150 -200 1
31
61
91
121
151
181
211
241
271
301
331
361
391
421
451
481
Hand Movement along Y-Direction
300 250 200
Pixels
150 100
LH
50
RH
0
Left Hand Rest
-50
Right Hand Rest
-100 1
31
61
91
121
151
181
211
241
271
301
331
361
391
421
451
300
Audio Pitch 17
200
22
150 100 25 24 361
26
391
3200
421
27
28
451
481
there's a
23
the
21 331
(smack)
20 301
19 271
doors
18
241
with the
16
openn the
15
front
211
annd you
14
181
from the
oh I
13
enter the house
12 151
when you
9 10 11
come through the
8 121
say when you
7
stairc-
6 91
forgot to
5
the back
4 61
So
garage
in back
Speech Transcript
3 31
chen 'n' then there's a s
2
1
you're in the kit-
1
0
inn them
50
glass
F0 Value
250
RMS Amplitude
481
Speech Amplitude
2800 2400 2000 1600 1200 800 400 0 1
31
61
91
121
151
181
211
241
271
301
331
361
391
421
451
Frame Number 481
Sue C1 1 of 2
Figure 1. Hand position, analysis, F0 , transcript and RMS graphs for frames 1{481
D R A F T
March 23, 2000, 12:23am
D R A F T
4
3 00:16:46:28
5
4
6
7
8 9
[
# [ / so you're in the kitchen] [['n' then there's a s* /] [ / the back stairc*] 1
2a
]
2b
1. iconic; RH sl.spread B PTD moves AB from RC to hold directly above knees; 2a. deictic vector§ - aborted;; RH G takes off from end point of #1; points and moves RtoL across C; head tilts to look left in the same direction the hand is moving; 2b. deictic vector§ - aborted;; RH G takes off from end point of #2a; in one smooth motion: points to R while moving TB, curves to point & move up; .
00:16:50:12
10 11 12 [oh I forgot to say] emblem-ish; A-hand moves to sternum & holds;
12 00:16:51:10
13
[[when you come through the* / / # ] iconicΩ - aborted;; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB and apart to a more PTC/PTB position on either side of CC;
14
15
16 17 18
19
20
21
23
22
00:16:2:29
[[when you enter the ^house ^from the ^front / ] [annd you / openn the / doors / with t]][he* 1
2 1. iconic; BH/mirror 5-hands start together PTB @chest, move AB but do not move apart; 'superimposed' beats, including a big upward & forward head movement at the end on "front"; 2. iconic - repeat & complete Ω; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB to a mid-stroke hold, then move apart to PAB on either side of C @far L&R;
Figure 2. Sample rst page of a psycholinguistic analysis transcript
In our current work [17, 16], we relate such analysis to speech prosody, threedimensional traces of hand motion (plotted as x, y, and z displacements against time), three-dimensional traces of head motion, and three-dimensional (turn, nod, roll) traces of gaze orientation. The fundamental frequency plots of F0 envelopes are extracted using Entropic's XwavesTM [1], transferred to a page layout package and printed together with the hand and head motion/direction traces. These F0 plots are then numbered and related to the text manually. The addition of these steps on top of the traditional perceptual analysis makes such research even more labor intensive. Finally, we use a graphical page layout program to combine all the plots manually so that the time scales are aligned. This is essential to provide visualization of the data to support discovery. It typically takes an entire day to organize a set of data in this way. Figure 1 is an example of such plots. The outcome of the psycholinguistic analysis process is a set of detailed transcripts. We have reproduced the rst page of the transcript for our example dataset in Figure 2. The gestural phases are annotated and correlated with the text (by underlining, bolding, brackets, symbols insertion etc.), F0 units (numbers above the text), and video tape time stamp (time signatures to the left). Comments
D R A F T
March 23, 2000, 12:23am
D R A F T
5
about gesture details are placed under each transcribed phrase. Each step of this analysis requires signi cant research labor.
3. Time Situated Multiple-Linked and Related Representations This paper presents a multimedia system that is designed to reduce the labor of perceptual analysis, and to provide a level of analysis that heretofore has not been possible. Our goal is not to do away with expert perceptual analysis. Rather, we seek to provide higher level objects and representations and to mitigate the labor-intensiveness of analysis that has access only to the time stamp of the video signal. By providing direct access to computed entities (GSG plots, automatically segmented gesture units, speech prosody groupings etc.), the underlying video and audio, and other representations, the MDB-GSG system also enables a level of analysis heretofore not available to researchers. An interactive system may be viewed as a conduit of communication between the human and the machine [9]. Modern psychology and linguistics theories of discourse stress the importance of maintaining a state of `situatedness' for communication to be successful [4, 3, 5]. Under this model, the user and the computer system maintain a stream of communication that keeps the user situated within an abstract interactive space. In the complex environment of multi-modal discourse analysis, this becomes all the more important. In our system, the key element of this usersystem coordination is temporal situatedness. To motivate this situatedness, all the interface components are linked by their time synchrony. Each representation of the complex multi-modal space is focussed on the same moment in time. To enforce this mental model, we call this a time-portal through which we view the extended timeline of the subject's GSG behavior. Hence, this is an example of Multiple, Linked Representations (MLR) of dynamic components [7, 8, 15, 14] in which each representation reinforces the situatedness condition. Furthermore, each representation in the system is active, thereby enabling multiple-point access to the underlying data. We shall esh out these concepts using the actual components of our MDB-GSG system as examples.
D R A F T
March 23, 2000, 12:23am
D R A F T
6
Temporal cohesion is especially critical in GSG analysis. The motor and speech channels are not subservient to one another, but spring from a common psychological (and neurophysiological) source. They proceed on independent pathways to the nal observable behavioral output. The temporal cohesion of the GSG channels are thus governed by the constants of neurophysiology and psycholinguistics [10, 6, 12, 13]. The time portal metaphor provides a panoramic snapshot of the functioning of all modalities at synchronized time instants.
4. The MDB-GSG System Figure 3 shows our MDB-GSG system with all the representational components open. The essence of our MLR strategy is that each of these components is synchronized with the current time focus. This means that each component is animated to track this time focus, and when it changes, all the system components change to re ect this. Not all components, however, have to be active at the same time. When they are inactive, the system `deregisters' them, and they are not updated. When a system component is opened (e.g. the avatar representation at the bottom left of Figure 3), it is registered with the system and linked with the current time focus. These system components will be considered in ve groups:
The VCR-like interface and player
The hierarchical shot-based representation and editor
The animated strip chart representation
The synchronized speech transcription interface
The GSG abstraction representation or avatar
Each of these representational components was chosen to advance our psycholinguistic analysis and support or computer vision and signal processing eorts. We shall motivate each interface element as it is discussed.
D R A F T
March 23, 2000, 12:23am
D R A F T
7
Hierarchical Keyframe Browser
Hierarchy Editor Panel
Current Shot Keyframe VCR-Like Control Panel
Animated Strip Chart Panel
Current Time Focus Timeline Representation
Digital Video Monitor
Text Transcription Interface Synchronized Transcript Display
Avatar Representation
Figure 3. System Screen of the MDB-GSG Analysis System
D R A F T
March 23, 2000, 12:23am
D R A F T
8
4.1. VCR-Like Interface and Player
The panels labeled VCR-like Control Panel and Digital Video Monitor at the bottom right of Figure 3 provide a handle to the data using the familiar video metaphor. The MDB-GSG system is designed so that dierent virtual players may be plugged into the system. Currently, we have drivers to control a digital video (e.g. MJPEG, MPEG, QuickTime) player and two physical devices (a computer controlled Hi-8 video player, and a laser videodisc player). Drivers for other media such as DVD players can easily be added. The functions of this control panel and video display are similar to other computerbased players except for several enhancements. The frame shown in the Digital Video Monitor is always the frame at the current time focus. As the video plays, the time focus changes accordingly. If the time focus is changed through some other interface component, the video player will jump to the corresponding video frame. Our choice of MJPEG for the video is driven by the need for random frame access and reverse play. The media player in the Silicon Graphics media library is able to play both the video and audio tracks together both forward and in reverse at various speeds. This is important for coding the exact beginning of particular utterances in the audio because of the psychological eect where a listener perceives a word a fraction of a second after it is uttered. Humans hear coherent sounds as words, and perceive the words as they emerge from the mental processing. Hence, it is dicult to locate the exact synchronies of the beginning of gesture phrases and speech phrases when the audio is played forward at regular speeds. When the audio is played backward or at slow speeds, this eect is removed and the coder can perceive the synchronies of interest. The circular loop button on the top row of the control panel toggles the `shot loop' mode. When this mode is set, the video player will keep looping through the current shot at the current level until play is halted. A `shot' is a video segment of signi cance to the GSG analysis. The loop mode permits the researcher to examine a particular GSG entity (e.g. a stroke) to identify its idiosyncrasies at various play speeds. The jump-to-start and jump-to-end (double arrows with a vertical terminal
D R A F T
March 23, 2000, 12:23am
D R A F T
9
bar) buttons at the right end of the top row allow the user to skip from shot to shot in the shot hierarchy (described in the next section). The Step Rev and Step Fwd buttons in the second row permit the researcher to step through the video a frame at a time (similar to the frame jog operation in a professional video player). This is important for micro gesture and gaze shift coding. The triple and quadruple directional arrow buttons play the video and audio backward and forward at dierent greater than realtime speeds (with audio). The Slow Rev and Slow Fwd buttons in the third row permit play at various fractions of the regular video rate with the accompanying audio. These rates (0.25, 0.5 and 0.75) are set via the pull-down menu at the top of the VCR-Like Control Panel. This, again, is important for detailed analysis and the coding of exact synchronies in the GSG signals. The Timeline Representation at the bottom of the of the VCR-Like Control Panel shows the oset of the current frame in the entire video. Consistent with the rest of the interface, the slider is strongly coupled to the other representational components. As the video plays in the VCR-like representation, the location of the current frame is tracked on the timeline. Any change in the current frame by interaction from either the visual summary representation or the hierarchical shot representation is re ected by the timeline as well. The timeline also serves as a slider control. As the slider is moved, all the other representational components alter to re ect this change. The number above the slider represents the percent oset of the current frame into the video.
Rationale We have already discussed the importance of the careful temporal analysis in the study of GSG cross-modal communication. Expert gesture and speech coders are well experienced in the operation of high-end VCR's in their analysis. The design choices in our video player: the VCR-like control panel, and the timeline sliders are designed to enhance the users' familiarity with the professional quality VCR control and to extend it for temporal analysis.
D R A F T
March 23, 2000, 12:23am
D R A F T
10 Current shot at level 0 1
2.1
2
2.2
3
2.3
4
2.4
5
6
7
6.1
6.2
6.3
8
Current shot at level 1 2.3.1
2.3.2
2.3.3
Current shot at level 2
Figure 4. Shot Hierarchical Organization
4.2. Hierarchical Shot-Based Representation and Editor
The panels labeled Hierarchical Keyframe Browser and Hierarchy Editor Panel facilitate the organization of the video stream into a nested hierarchical structure, and the visualization of this hierarchy in summary format. 4.2.1. Shot Architecture Before we proceed, we need to de ne several terms to
facilitate our discussion. A video sequence can be thought of as a series of video frames. These frames can be organized into shots. We de ne a shot as any sequential series of video frames delineated by a rst and a last frame. These shots may be organized into a nested hierarchy as illustrated in Figure 4. In the gure, each shot at the highest level (we call this level 0) is numbered starting from 1. Shot 2 is shown to have four subshots which are numbered 2.1 to 2.4. These shots are said to be in level 1 (as are shots 6.1 to 6.3). Shot 2.3 in turn contains three subshots, 2.3.1 to 2.3.3. Shot 2 spans its subshots. In other words, all the video frames in shots 2.1 to 2.4 are also frames of shot 2. Similarly, all the frames of shots 2.3.1 to 2.3.3 are also in shot 2.3. Hence, the same frame may be contained in a shot in each level of the hierarchy. One could, for example, select shot 2 and begin playing the video from its rst frame. The frames would play through shot 2, and eventually enter shot 3. If one begins playing from the rst frame of shot 2.2, the video would play through shot 2.4 and proceed to shot 3.
D R A F T
March 23, 2000, 12:23am
D R A F T
11
null
F K L P
F K L P
F K L P
null
null
null
F K L P
F K L P
null
null
null
F K L P
F K L P
null
null
F K L P
F K L P
F K L P
F K L P
null
null
null
null
null
F K L P
F K L P
null
null
null
null
Figure 5. Shot Hierarchy Data Model
Next, we de ne a series of concepts which de ne the temporal-situatedness of the system. The user may select any shot to be the current shot. The video system will play the frames in that shot, and the frame being displayed at any instant is known as the current frame. These are dynamic concepts since the current frame and current shot change as the video is being played. Suppose we are at level 2 of the hierarchy and select shot 2.3.3 as the current shot (shown as a shaded box in Figure 4). Shot 2.3 at level 1 and shot 2 at level 0 would conceptually become the current shot at those levels. This could lead to confusion in the user, and hence we introduce the concept of the current level. At any moment in the interface, the system is situated at one level, and only the shots at that level appear in the visual keyframe summary representation. In our current example, the system would be in level 2 although the current frame is also in shot 2.3 and shot 2 in levels 1 and 0 respectively. Figure 5 shows the data model in our shot hierarchy. Each shot de nes its rst and nal frames within the video (represented by F and L respectively) and its keyframe for use in the visual summary. Each shot unit is linked to its predecessor and successor in the shot sequence. Each shot may also be comprised of a list of subshots. This allows the shots to be organized in a recursive hierarchy. The shot data model is essentially a structure containing the F and L indices into a
D R A F T
March 23, 2000, 12:23am
D R A F T
12
video. Hence, we may maintain multiple shot hierarchies into the same video/audio sequence. This is important for GSG analysis since natural communication typically contains multiple overlapping semantic threads. 4.2.2. Keyframe-Based Visual Summary The use of the keyframe representation
as a visual summary for video segments has proven surprisingly eective [18, 2]. For GSG segmentation, a keyframe representation of each video segment permits the researcher to see the general hand con gurations and gaze directions represented in the keyframe. The system takes the rst frame of each shot/subshot as the default keyframe, but the user may substitute this with any frame of her choice through the Hierarchy Editor Panel. Figure 3 shows the standard keyframe representation in the top left corner. Each shot is summarized by its keyframe, and the keyframes are displayed in a scrollable table. The keyframe representing the \current shot" is highlighted with a boundary. A shot can be selected as the current shot from this interface by selecting its keyframe with the computer pointing device. In accordance with the MLR strategy, the current time focus is set to the beginning of the shot, and all other interface components are updated (the position of the current shot in the shot hierarchy appears in the shot hierarchy representation, the rst frame of the shot appears as the \current frame" in the display of the Digital Video Monitor, and the timeline representation is updated to show the osets etc.). The video can be played using the VCR-Like Control Panel. When the video is being played, the current keyframe highlight boundary blinks. When the current frame advances beyond the selected shot, the next shot becomes the current shot, and the highlight is moved to the new current shot. Figure 6 shows the keyframe browser with a larger keyframe presentation than that in Figure 3. Our MDB-GSG implements three keyframe sizes that are generated dynamically from the MJPEG video to permit the user to trade-o between keyframe resolution and the number of shots concurrently visible. The two text entry boxes at the bottom of Figure 6 permit the user to enter textual annotation for the current shot. The user enters a short label for the shot in the smaller box on
D R A F T
March 23, 2000, 12:23am
D R A F T
13
Figure 6. Keyframe Visual Summary Representation with Annotation Boxes
D R A F T
March 23, 2000, 12:23am
D R A F T
14
the left and a more complete textual description in the box on the right. The label entry is used to provide a textual synopsis and the description provides detail. This hierarchical structure is an eective means of representing nested semantic discourse structures. However, this may be insucient to represent discourse models in which multiple concurrent semantic threads are pursued through the discourse. While each thread may be amenable to hierarchical analysis, the multiple threads taken together are not. In our shot hierarchy architecture, each shot object stores only the frame indices and its position within one hierarchy. This economy of representation permits us to impose multiple hierarchies on the same discourse video. These multiple hierarchies are synchronized through a single time portal with a single current time focus. 4.2.3. Shot Hierarchy Editor The Hierarchy Editor Panel in the top right of Fig-
ure 3 is designed to allow the user to navigate, to view, and to organize video in our hierarchical shot model shown in Figure 4. It comprises three sub-panels. The one on the left labeled Shot Editing permits the construction and deletion of shots from the video stream. This panel allows the user to create new shots by setting the rst and last frames in the shot and capturing its keyframe. When the \Set Start" or \Set End" buttons are pressed, the current frame (displayed in the Digital Video Monitor) becomes the start or end frame respectively of the new shot. The default keyframe is the rst frame in the new shot, but the user can select any frame within the shot by pausing the video display at that shot and activating the \Grab Keyframe" button. The middle sub-panel labeled Subshot Editing permits the user to organize shots in a hierarchical fashion. This sub-panel is context sensitive. The buttons that represent currently unavailable operations are blanked. As is obvious from the button icons in this panel, it permits the deletion of a subshot sequence (the subshot data remains within the rst and last frames of the supershot, which is the current shot). The \promote subshot" button permits the current shot to be replaced by its subshots, and the \group to subshot" permits the user to create a new shot as the current shot and make all shots marked in the Hierarchical Keyframe Browser window its subshots. The blank button at the top of subshot editing panel is a
D R A F T
March 23, 2000, 12:23am
D R A F T
15
\Create Subshots" button. In the example shown, the current shot already has its subshot list, so this button is disabled and blanked. The rightmost subpanel labeled Subshot Navigation displays the ancestry of the current shot (the shot labels entered by the user in the Hierarchical Keyframe Browser are listed) and permits navigation up and down the hierarchy. The \Down" button in this panel indicates that the current shot has subshots, and the user can descend the hierarchy by clicking on the button. If the current shot has no subshots, the button becomes blank. Since the current shot in the gure (labeled as \L1/C (WOMBATS)") is a top level shot, the \Up" button (above the \Down" button) is left blank. The user can also navigate the hierarchy from the Hierarchical Keyframe Browser window. If a shot contains subshots, the user can click on the shot keyframe with the right mouse button to descend one level into the hierarchy. The user can ascend the hierarchy by clicking on any keyframe with the middle mouse button. These button assignments are, however, arbitrary and can be replaced by any mouse or key-chord combinations. The hierarchical shot representation panel also permits the user to hide the current shot or a series of marked shots in the visual summary display. This permits the user to remove video segments that do not participate in the particular discourse structure of interest in a study. The hide feature can, of course, be switched o to reveal all shots.
Rationale
In our work on discourse analysis, we employ such discourse structure models as that of Grosz and colleagues [11] to parse the underlying text transcripts. The method consists of a set of questions with which to guide analysis and uncover the speaker's goals in producing each successive line of text. Such discourse models are amenable to hierarchical representation. We compare this structure against the discourse segments inferred from the objective motion patterns shown in the gesture and gaze modalities [17, 16]. Our Hierarchical Keyframe Browser and the underlying nested shot architecture directly support such discourse patterning. The Keyframe Browser spatializes the time dimension so that the user may view the discourse units as a hierarchy of keyframes. Each keyframe is a `snapshot' view of the gestural morphology of the corresponding discourse element, and serves as a memory cue of
D R A F T
March 23, 2000, 12:23am
D R A F T
16
the gestural and gaze con guration for the coder. The Shot Hierarchy Editor is the tool for the coder to add discourse segmentation and hierarchy information to the data. The shot labeling facility is used by coders to annotate each discourse segment with psycholinguistic observations. As will be described later, these annotations are used to generate text transcript formats with which psycholinguistics researchers are familiar. 4.3. Strip Chart Abstraction of Communicative Signal
The Animated Strip Chart Panel at the left middle of Figure 3 provides the user access to the computed GSG entities. The user may select any signal to be displayed in a pane in this panel. In the gure, the x-position trace describing both of the subject's hand motion is in the top pane, and her voiced utterances are displayed as the fundamental frequency F0 plots in the lower pane. Each pane may be displayed in Huge, Normal and Compressed resolutions in the y-dimension. The x-dimension of the plots is time expressed in terms of video frame oset into the video. The time scale may be displayed in 3 resolutions { Small, Normal (as shown) and Large (each successive scale being twice the previous). The red line down the middle of the plots represents the current time focus. When the video plays, this line stays in the middle of the panel, and the plots animate so that the signal points under this current time focus always represent the signal at that instance. The user is able to drag the plots right and left by pulling the time scale at the bottom of the panel in the desired direction with the middle mouse button depressed. All other MLR interaction components will animate to track this change (for practical reasons, audio does not play when this happens). The user may also bring any point of the graph immediately to the current time focus line by clicking on it with the left mouse button. If the mouse is in any portion of the Animated Strip Chart Panel other than the time scale, the middle and right mouse button will toggle forward and reverse play respectively at the current speed (set using the VCR-Like Control Panel). This feature was added because the psycholinguists wanted rapid control of the video playing without having to move to the VCR-Like Control Panel repeatedly.
D R A F T
March 23, 2000, 12:23am
D R A F T
17
The user may select any available plot to be displayed in any pane. All that the system needs to represent a plot is for its name to be registered with the system, and for an ASCII le containing a list of points to be provided. Although there is no theoretical limit to the number of plots in this scrollable panel, this is limited by the pragmatic concerns of screen real-estate (why animate all plots when a maximum of 5 may be seen at any time) and processor speed (how many plots can the system animate before it impacts performance). Our current limit is 10 animated plots at any time.
Rationale
This representational component has proven invaluable in our research into GSG. First, it provides the researcher with a `god's eye view' into the video time line so that she can conceptualize about the GSG entities represented beyond the immediacy of the current point being played. This has helped immensely to speed up the psycholinguistic coding. Second, since it is trivial to change and add signals to the display, dierent extracted time-based traces may be displayed in this way. This has been extremely useful in the development of algorithms to process the video and segment the GSG signals. In our work in `deconstructing' the hand motion traces into atomic `strokelet' motion units, for example, we simply generate a signal stream that has value spikes at the `strokelet' transition points and is zero elsewhere. This allows us to evaluate the eectiveness of our segmentation perceptually with access to the underlying video and audio through the interface. This system malleability directly supports our reciprocal cross-disciplinary research strategy. Psycholinguists provide perspective and analysis to guide the engineering eorts in audio, video, and signal processing. The engineering team provides access to GSG signals and entities, and the tools to access and visualize them. The MDBGSG system provides the locus of integration and interaction among researchers from both disciplines. 4.4. Transcription Interface
The Transcription Interface shown in Figure 7 consists of a text display and editing area (the Transcript Display Pane), a status display, and a set of control buttons and a
D R A F T
March 23, 2000, 12:23am
D R A F T
18
Figure 7. The Transcription Interface
pull down menu. These provide access and manipulation to the temporal properties and content of the underlying syntax of the subject's speech. The speech is rst transcribed to text manually to obtain a preliminary ASCII text le that may be organized and re ned using the MDB-GSG system. When this text transcript is registered with the system, it is indexed and displayed in the Transcript Display Pane. The Transcription Interface has three modes of operation: Time, Edit, and Playback. In Time mode, the user associates text entities with timestamps; in Edit mode the user may modify the underlying transcript text; and, in Playback mode the MDB-GSG system animates the text to track the current time focus. The mode of operation is selectable from the `Mode' pull down list. The default mode of operation is Playback, and the system returns to Playback mode whenever Time or Edit mode is terminated. 4.4.1. Transcript and Associated Representation In our system, the transcrip-
tion is maintained in two dierent les. The transcription text (and other embedded information) is maintained as a straight ASCII le. The transcription is divided
D R A F T
March 23, 2000, 12:23am
D R A F T
19
into `separable entities' in the form of alphabetic strings that are delineated by separators (white space or special characters). These entities are represented as a list in the Tagged Transcription File. Each item in the Tagged Entity List may be associated with a time stamp that is synchronized with the rest of the database. The time stamp represents the onset of voicing of the particular transcript entity. These timestamps, therefore, describe a set of intervals between successive entities in the list. If a timestamp is not assigned to a particular entity, that entity is said to belong to the interval between the last previously tagged, and next subsequently tagged entities in the list. Beside white spaces, our system provides for other delimiters such as a period between two alphabetic character strings with no spaces. This permits the separate tagging of dierent syllables or phones within a single word. For example, the word \Transcription" may be stored in the ASCII Transcription File as \Tra.ns.crip.tion". This is represented as a sublist of four items: \[tra]-[ns][crip]-[tion]" in the Tagged Transcription File, allowing the separate tagging of each item. The user may also add comments to the ASCII Transciption File. Following programming convention, comment lines in the le begin with a semi-colon immediately following the preceding line break. Comment text is ignored in the Tagged Entity List as are all text delimiters. 4.4.2. Time Mode Operation Transcription typically begins with an untagged
text transcript le that is generated manually by a transcriber viewing the experiment video tape. This le is imported into the system and forms the basis for the ASCII Transcription File. Upon importation, the MDB-GSG system parses the ASCII transcription le to produce the list of separate entities that are initially untagged. This forms the basis of the Tagged Entity List. In Time mode operation, an `Accept' button appears next to the mode status indicator. The user may mark a text entity to associate it with the current time focus. This time tag is entered into the Tagged Entity List when the `Accept' button is pressed or when the user hits the return key on the keyboard. The system checks that the time tags entered are temporally constrained (i.e. items in the front of the list have earlier time stamps), and ags ordering errors for user correction. Since the only criterion we use for indexing the textual entities is a parse based
D R A F T
March 23, 2000, 12:23am
D R A F T
20
on word separators (spaces, tabs, and line breaks), the user may enforce a syllable and phone level parsing by inserting in-word separators. Once a time stamp is associated with a text entity, the system automatically highlights the next text entity to be associated with the new timestamp. The user may of course highlight some other text entity using either the mouse or the cursor control keys on the keyboard. Not all words need to be time-annotated. If the rst words at the beginnings of two consecutive phrases are annotated, the system associates all the words from the rst annotated word up to the one before the second annotated word with the duration between the time indices. This allows sentence-, phrase- and word-level analyses using the MDB-GSG system. 4.4.3. Edit Mode Operation In Edit mode operation, the user may modify the
underlying transcript text (add missed words, correct transcription errors, add audible breath pauses etc.). When the user selects `Edit' from the pull down mode menu, an edit session begins. A pair of buttons labeled `Done' and `Cancel' respectively appear next to the mode status indicator. The user may also add or delete words (or word fragments) in the tagged transcription list by changing the text in the Transcript Display Pane directly. In Edit mode, in-word separators and white space may also be added. Edit mode is exited when the user activates either the `Done' or `Cancel' button. If `Cancel' is selected, the changes made in the editing session are discarded. If the user selects done, the changes are parsed and incorporated into the Tagged Entity List. Upon exiting edit mode operation, Playback mode is automatically activated. 4.4.4. Synchronized Transcript Playback In Playback mode, the Transcript Display Pane serves as a synchronized playback display. Once the text is time annotated, the word[s] associated with the current time focus in the Transcript Display Pane are highlighted in synchrony with the all other MLR interaction components. When the video plays, this highlighting animates and the Transcript Display Pane scrolls to keep the current time focus text within the pane. When the current time focus is changed in any other interface component, the appropriate scrolling and
D R A F T
March 23, 2000, 12:23am
D R A F T
21
highlighting takes place. The user is also able to select any word in the Transcript Display Pane and make its time index the current time focus of the entire system. The user may choose either to show or hide comment lines and within-word delimiters in the Transcript Display Pane. Since the Transcript Display Pane highlights the appropriate word[s] and scrolls during playback, the user may activate Time or Edit mode to modify and re ne the time annotation or transcript text at any point during playback. 4.4.5. Importing and Exporting ASCII Transcript Files Since the transcript is
an ASCII text le, these may be prepared and edited independently outside the system, and then imported. Text may also be exported. During the importation process, it is critical that we reassociate it with any existing Tagged Transcription File (if this does not exist, the system simply creates a new one). The imported ASCII Text File is parsed to produce a new tagged transcription list that is compared against the old list. If the only changes added are comments and formatting (the most common situation), the two lists would be identical, and the time tags are simply transferred to the new list. If the underlying text entities are changed, the MDB-GSG system highlights these changes to bring them to the attention of the user for time tagging. Once the lists have been fully resolved, the new list is saved as the Tagged Transcription File.
Rationale Since text-level manual transcription using high quality frame-accurate VCRs is the way psycholinguistic GSG research is traditionally done, there was much interaction with our psycholinguist colleagues in the design of this sub-system. The `within-string' delimiters were incorporated because this was deemed essential to the transcription process. Similarly, the addition of the commenting capability was motivated by the research need to add scienti c observations in the commentary. The comments and formatting also permit the researcher to use indented formatting to represent discourse-level structure. For this reason, the ASCII Transcription File leaves the white space formatting of the transcript intact, and the system displays this in the Transcript Display Pane.
D R A F T
March 23, 2000, 12:23am
D R A F T
22
The capability to import and resolve new ASCII Transcription Files is also driven by the working style of our research's primary critical resource: expert psycholinguistic researcher. The ability to export the ASCII Transcription File for editing on a standard word processor and reimporting the edited result allows the researcher to analyze the text without being tied to the workstation that runs the MDB-GSG system. Since it is much easier to do time tagging and syllable and phone level parsing (these depend on the content of the video and audio tracks) on the MDB-GSG system, such o-line editing is typically done to add comments and indentation structure. Since such operations leave the Tagged Transcription File unchanged, minimal labor is required to resolve the new ASCII Transcription File with the existing MDB-GSG representation. The resolution process often serves as a debugging operation to remove inadvertently modi ed text (e.g. comments added without the commenting ag). 4.5. Avatar Abstraction
The Avatar Representation at the bottom left of the screen displays an animated avatar that moves in synchrony with the current time focus. In our current GSG work, we image the subjects using three cameras (two calibrated stereo, one closeup on the head) to extract the three-dimensional velocities and positions of the hands, the three-dimensional position of the head, and the head gaze direction in terms of the turn, nod, and roll angles. These values are fed into the avatar simulation that plays in synchrony with the rest of our MDB-GSG interface. This provides essentially a simulated analog of the subject in the Digital Video Monitor displaying only the signal dimensions extracted.
Rationale
This avatar serves three purposes to support GSG research. First, it is not restricted to a particular viewpoint. For example, the simulation may be rotated to give the user a top-down view of how the hands move toward and away from the body. A top-down view also aids the examination of the direction of gaze in terms of the head `turn' alone. This provides a better understanding of how the subject is structuring her gestural space in the `z-direction'. Second, it permits
D R A F T
March 23, 2000, 12:23am
D R A F T
23
researchers to see the communicative eects of each extracted signal. For example, we applied the system with a constant z-dimension to see the eect of depth on how one perceives a gesticulatory stream with the hands motions constrained to be in a plane in front of the subject. We also saw the eectiveness of head and gaze direction in giving the avatar a sense of communicative realism by disabling that signal. We expect that this avatar interaction will also provide insight to the eects of slight dissynchronies in speech and gesture, or the removal of dierent gestural components. We cannot do this with the original video, but this is trivial in the avatar. Third, the avatar permits us to do a qualitative evaluation of our hand and head tracking algorithms. Since our purpose is not absolute position but conversational intent, the avatar facilitates a quick evaluation of the eectiveness of our extraction algorithms in comparison with the original video.
5. Transcript Generation The MDB-GSG system permits researchers to analyze and organize multi-modal discourse by applying speci c psycholinguistic models. Hence, the resulting segmentation, structure, transcription and annotation are the intellectual product of the analysis. This may be accessed and communicated in two ways. First, the MDB-GSG system itself provides multi-media access to these GSG entities. The database associated with a particular analysis may be loaded for perusal and query. Dierent databases may be generated on the same discourse data to re ect dierent discourse organization theories and methodologies. Second, the MDB-GSG system is able to produce a text transcript from the analysis. Figure 8 shows a fragment of such a transcript. The hierarchical organization of the transcript derives automatically from the shot hierarchy along with its labels and annotation. The speech transcription text associated with each item in this hierarchy derives directly from the time-annotated speech transcript.
Rationale: This transcript is similar to that produced manually in Figure 2. Psycholinguistic researchers familiar with manual transcription nd this automatically gener-
D R A F T
March 23, 2000, 12:23am
D R A F T
24
1
clapper (0:7:12:23 - 0:7:14:17) beginning of the film clip
(0 - 54)
Transcript: "" 2
L1
(0:7:14:18 - 0:7:17:1) (55 - 128) Introduces top discourse layer Transcript: "okay what we need to do "
2.1
L1 (0:7:14:18 - 0:7:15:18) (55 - 85) "okay"-Macro level discourse marker BH fists PTB/FTC contact in cc Transcript: "okay "
2.2
L1 (0:7:15:19 - 0:7:17:1) RH G to interlocutor
(86 - 128)
Transcript: "what we need to do " 3
whisper by LSNR (0:7:17:2 - 0:7:17:6) motivates slight SPKR hesitation
(129 - 133)
Transcript: "" 4
L2/C(TRAINS) (0:7:17:7 - 0:7:28:9) RH, BH, LH
(134 - 466)
Transcript: "is we're gonna ride in through town umhm uhm get off at the train station uhm we'll be getting off at the right so we're coming from this direction past this to the station "
Figure 8. Transcript produced by the MDB-GSG system
D R A F T
March 23, 2000, 12:23am
D R A F T
25
ated transcript invaluable. MDB-GSG, therefore, facilitates greater communication among researchers.
6. Object-Oriented Implementation and Temporal Synchronization Figure 9 shows the simpli ed object hierarchy of our system and also the method of temporal synchronization. All MDB-GSG interface elements are separate objects in our C++ implementation of the system. The system is implemented under SGI IRIX, using X Windows, Motif and SGI Movie libraries. To permit multiple time portals, our system can open multiple datasets comprising shot hierarchies, video recordings, transcriptions, and charts simultaneously. Some interface elements are shared by all datasets while some are associated with speci c datasets. 6.1. Architecture Overview
Figure 9 presents a simpli ed diagram of our MDB-GSG system architecture. The VCR-like Control Panel, Shot Hierarchy Editor and Virtual Video Player are shared \common" objects and are instantiated only once. The Virtual Video Player is designed to give the system independence from the actual device that plays the video data. This player may be either a hardware device or a software media player. In the current implementation, the MDB-GSG system can handle two physical devices (a RS-232 controlled VCR and a laser disk player) and the Digital Video Monitor discussed in section 4.1. Beside these shared common objects, all other objects are instantiated on demand and tied to the speci c dataset to for which they are created. The key interface component for a particular dataset is the Keyframe Browser that is associated with a speci c shot-hierarchy. At any point in time only one Keyframe Browser may be active (or `in focus '). The active Keyframe Browser is shown highlighted in Figure 9. Each Keyframe Browser contains a list of one or more Browser Windows, only one of which is active at any time (shown highlighted in Figure 9). Each Browser Window is a kind of a `view' into the shot hierarchy, maintaining a separate time focus, and active level. Our time portal conceptual model is embodied by a speci c
D R A F T
March 23, 2000, 12:23am
D R A F T
26
Polling Function in the Event Loop (Timeout)
"Update Position" Message
Query for Current Position in Media
Video Storage or Device
Virtual Video Player
"Synchronization Dispatcher" Propagates the "Update Position" Message to All Objects
Shot Hierarchy Editor
VCR-like Control Panel
DB
Keyframe Browser
DB
Keyframe Browser
Keyframe Browser
DB
VCR
Laser Disk
Browser Window
Browser Window
Browser Window
MPEG
MJPEG
VIdeo is Played Independently of the Main Event Loop
Strip Chart
Avatar
Text Transcript
Main X Event Loop
Figure 9. Simpli ed Architecture of the MDB-GSG System and the Method of Media Synchronization
D R A F T
March 23, 2000, 12:23am
D R A F T
27
Browser Window with its time focus and active level. As discussed earlier, this time portal concept is critical in the analysis of GSG interaction. Each browser window is associated with its own Strip Chart, Avatar, and Text Transcription interfaces. When a browser window is active or in focus, the entire system uses its time focus as the current time focus. 6.2. Media Synchronization
All objects, except the virtual video player object and the actual video device (implemented in software or hardware), share the same X event loop, shown as dashed circle in Figure 9. The Digital Video Monitor is the most commonly used video device, but our system allows other types of video devices to be connected (e.g. a laser disk player or a computer-controlled VCR) and interfaced to the system through an appropriate `virtual player'. The only requirements for each video device is that it has to be able to return the current (currently played) frame number, and go to a particular frame number on demand. The video device plays the video independently from the rest of the system. This is obviously the case when an external physical device is used. Software and video players, such as are Digital Video Monitor, play the video data on the separate thread of execution. The system's main X event loop contains a function (implemented as a timeout) which periodically polls the video device for the current position in the media. The function then sends a `update position' message to the `synchronization dispatcher' object (see Figure 9). The `synchronization dispatcher' propagates this message to all its sub-objects (keyframe browsers), and all currently active objects (i.e. currently opened windows) update themselves in response. Each object, in turn, propagates the message to all active sub-objects (e.g. if any keyframe browser window has some active animated strip charts). So the update message originates in the X event loop and is propagated in a tree-like fashion throughout all currently active objects. Polling an independently executing virtual player makes the video recording a basis of temporal synchronization of all the system elements and is consistent with the basic philosophy of the entire system, where the video is central to analysis
D R A F T
March 23, 2000, 12:23am
D R A F T
28
and all other data. In fact, all system data (e.g. text transcript, strip chart data, hand position data for the avatar) are derived from the video. It is also easy to implement and makes all object interfaces relatively simple (any object has to be able - in general - to update itself with a new current frame number, and also to send a new frame number to the virtual player - e.g. when a slider is moved by the user on the VCR-like Control Panel). Furthermore, most video players maintain accurate temporal synchronization making them ideal `clocks' for system. By using the virtual media player as the central synchronization agent, dropping frames during video playback does not pose a problem, since all the remaining interface modules respond to such an event by updating themselves with the correct new frame number. In fact, our synchronization strategy has the added bene t of automatic system load balancing. On a slower machine, the video player is likely to drop frames if many interface components are active and animated. This in turn causes a coarser animation update, giving more resources to the video player. In the current implementation the polling is performed in 33 ms intervals (slightly faster than 30 times per second), which is sucient, considering that the typical video framerate is 30 fps. Some events in certain interface objects (e.g. pressing the \skip single frame" buttons in the VCR-like Control Panel) automatically force dispatching of the `update position' message to all interface elements.
7. System Use Our MDB-GSG system replaces a process of manual perceptual analysis using a frame-accurate videotape player (a Sony EVO-9650); hand transcription; manual production of gesture and gaze tracking charts and audio F0 and RMS charts; manual tagging of F0 charts and synchronization with the text transcript; and manual reproduction of the analysis results into a text transcript. For this reason, it is not possible to evaluate the MDB-GSG system against a predecessor. Furthermore, the kind of psycholinguistic analysis performed is extremely skill-intensive and tedious. For this reason, the number of people actively engaged in micro-analysis of video for GSG coding is small. We hope that a system like our MDB-GSG will bring modern multimedia tools to bear on such research and increase the number of new
D R A F T
March 23, 2000, 12:23am
D R A F T
29
researchers in the eld. We hope to have an eect similar to that of PC-controlled telescopes bringing many new amateur astronomers to the discovery of celestial phenomena. The system is being used by two doctoral students in psycholinguistic research, and both have given the MDB-GSG system high marks in their subjective evaluation. The work is an order faster (a day for a week and a half of intense labor) with the MDB-GSG system. Furthermore, with direct access to the multimedia, multimodal data, the degree of integrated analysis is enhanced.
8. Lessons Learned We worked closely with domain scientists in all phases of our research and implementation of the system. We had been doing GSG research in collaboration with our psycholinguistics partners, and had observed the degree of detail required in the micro analysis of the video data. We also saw the importance of temporal analysis for such research. At each stage of our system development, we proposed the structure of each interface component to, and discussed how it would function with the psycholinguists. The computer scientists took the lead in this tool development eort as they are more familiar with what can be done. Since we are proposing new ways for GSG access, the psycholinguists initially found it dicult to visualize how the new interface would work. To overcome this, after initial discussion we developed rapid prototypes of the new tools and showed them to the psycholinguists for comment. This is when we usually get the most eective insights on how to modify the tools as psycholinguistic GSG analysis. Most of the tools required two or three iterations to re ne both interface and the back end processing. The components that elicited the most discussion and re nements were not surprisingly the Text Transcription Interface and the VCR-like Control Panel. These were the components that matched most precisely how video micro analysis for GSG Research was previously conducted. The other components like the video Keyframe Browser, the Avatar Representation, and the Strip Chart Interface were completely new to the experience of the psycholinguists, and we expect that new comments and re nement requests would be forthcoming as they become more familiar with these tools. With respect
D R A F T
March 23, 2000, 12:23am
D R A F T
30
to system architecture, clean objects de ned in relation to interface components facilitate such interactive re nement and system development. As for data representation, our discussion led to the implementation of the time portal concept to visualize and compare multiple instances in time. The time portal conceptual model provides a good mechanism to visualize and access single or multiple time instances in a spatialized representation of time. We also found that single hierarchies are insucient to represent complex human discourse structures. Our implementation of multiple keyframe browsers associated with dierent shot hierarchies is the rst attempt at visualizing the tangled hierarchies and overlapping semantic units at various levels of the psycholinguistic analysis. Our research on this representation continues. Multiple-linked representations are essential in the design of complex tools for temporal analysis. By requiring all interface components to function under a single current time focus now helps the user to stay situated with the multi-modal data. The use of animation of the strip chart, avatar, key frame highlighting, and text transcription display in synchrony with video and audio helps to maintain the sense of situatedness. For media synchronization, we found the use of the media or video player as the master clock for the entire system an eective and simple way to ensure a synchrony of the various interface components.
9. Conclusions We have presented a multimediadatabase for perceptual analysis of video data using a multiple, dynamically linked representations model. The system components are linked through a time portal with a current time focus. The system provides mechanisms to analyze overlapping hierarchical interpretations of the discourse, and integrates visual gesture analysis, speech analysis, visual gaze analysis, and text transcription into a coordinated whole. The various interaction components facilitate accurate multi-point access to the data. While our system is currently applied to gesture, speech, and gaze research, it may be applied to any other eld where careful analysis of temporal synchronies in
D R A F T
March 23, 2000, 12:23am
D R A F T
31
video is important. The VCR-Like Control Panel, Digital Video Monitor, Hierarchical Keyframe Browser, Hierarchy Editor Panel, Animated Strip Charts, Text Transcription Interface, and Synchronized Transcript Display loaded with any time-based signal les as ASCII point lists, and may be used for any synchronized signal-based analysis (video compression rate against time for a compression application, force measurements in a videotaped ergonomic lifting experiment, etc.). The Avatar Representation may be customized for any other abstract representation. The time portal concept permits the simultaneous analysis of multiple time foci. In our work on GSG, we have found situations where we need to compare recurrences across multiple time indices and sometimes across video datasets. A single time portal may then represent the events converging in one nexus of time. This may be compared against that of an alternate time portal.
Acknowledgments This research has been funded by the U.S. National Science Foundation STIMULATE program, Grant No. IRI-9618887, \Gesture, Speech, and Gaze in Discourse Segmentation", and the National Science Foundation KDI program, Grant No. BCS-9980054, \Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Computational Tools for Gesture, Speech, and Gaze (GSG) Research."
References 1. R. Ansari, Y. Dai, J. Lou, D. McNeill, and F. Quek. Representation of prosodic structure in speech using nonlinear methods. In 1999 Workshop on Nonlinear Signal and Image Processing, Antalya, Turkey, 1999. 2. Aaron F. Bobick. Representational frames in video annotation. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems and Computers, November 1993. Also appears as MIT Media Laboratory Perceptual Computing Section Technical Report No. 251. 3. Susan E. Brennan. Centering attention in discourse. Language and Cognitive Processes, 10(2):137{167, 1995. 4. Judy Delin. Presupposition and shared knowledge in it { clefts. Language and Cognitive Processes, 10(2):97{120, 1995.
D R A F T
March 23, 2000, 12:23am
D R A F T
32
5. Peter C. Gordon, Barbara J. Grosz, and Laura A Gilliom. Pronouns, names, and the centering of attention in discourse. Cognitive Science, 17(3):311{347, 1993. 6. Adam Kendon. Current issues in the study of gesture. In J-L Nespoulous, P. Peron, and A.R. Lecours, editors, The Biological Foundations of Gestures:Motor and Semiotic Aspects, pages 23{47. Lawrence Erlbaum Associates, Hillsdale, NJ, 1986. 7. R.B. Kozma. A reply: Media and methods. Educational Technology Research and Development, 42(3):1{14, 1994. 8. Robert B. Kozma, Joel Russel, Tricia Jones, Nancy Marx, and Joan Davis. The use of multiple, linked representations to facilitate science understanding. In Stella Vosniadou, Erik De Corte, Robert Glaser, and Heinz Mandl, editors, International Perspectives on the Design of Technology-Supported Learning Environments. Lawrence Erlbaum Associates, Publishers, Mahwah, New Jersey, 1996. 9. Deborah Mayhew. Principles and Guidelines in Software User Interface Design. PrenticeHall Inc., 1992. 10. David McNeill. Hand and Mind: What Gestures Reveal about thought. University of Chicago Press, Chicago, 1992. 11. C.H. Nakatani, B.J. Grosz, D.D. Ahn, and J. Hirschberg. Instructions for annotating discourses. Technical Report TR-21-95, Center for Research in Computing Technology, Harvard U., Cambridge, MA, 1995. 12. S. Nobe. Represenational Gestures, Cognitive Rythms, and Accoustic Aspects of Speech: A Network-Threshold Model of Gesture Production. PhD thesis, Department of Psychology, University of Chicago, 1996. 13. Shuichi Nobe. When do most spontaneous representational gestures actually occur with respect to speech? In D. McNeill, editor, Language and Gesture. Cambridge: Cambridge University Press, 2000. 14. F. Quek. Content-based video access system. US provisional patent application Serial No. 60/053,353 led on 07/22/1997. PCT application Serial No. PCT/US98/15063 led on 07/22/1998. UIC le number: CQ037. 15. F. Quek, R. Bryll, and X. Ma. Vector coherence mapping: A parallel algorithm for image
ow computation with fuzzy combination of multiple constraints. Submitted (04/1999) to IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999.
D R A F T
March 23, 2000, 12:23am
D R A F T
33
16. F. Quek, McNeill, R. D., Bryll, C. Kirbas, H. Arslan, K-E. McCullough, N. Furuyama, and R. Ansari. Gesture, speech, and gaze cues for discourse segmentation. In Submitted to IEEE Conf. on CVPR, Hilton Head Island, South Carolina, June13-15 2000. 17. F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K-E. McCullough. Gesture cues for conversational interaction in monocular video. In ICCV'99 International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pages 64{69, Corfu, Greece, September 26{27 1999. 18. Boon-Lock Yeo and Minerva M. Yeung. Retrieving and visualizing video. Communications of the ACM, 40(12):43{52, December 1997.
D R A F T
March 23, 2000, 12:23am
D R A F T