Accessing Speech Data Using Strategic Fixation - Semantic Scholar

Accessing Speech

Accessing Speech Data Using Strategic Fixation Steve Whittaker1 & Julia Hirschberg2 1

2

University of Sheffield

Columbia University

Department of Information Studies

Department of Computer Science

University of Sheffield

1214 Amsterdam Avenue, M/C 0401

211 Portobello St

450 CS Building

Sheffield, S1 4DP, UK

New York, NY 10027, USA.

phone: +44 114 222 6340

phone: +1 (212) 939-7114

fax: +44 114 278 0300

email: [email protected]

email: [email protected] (corresponding author)

ABSTRACT When users access information from text, they engage in strategic fixation, visually scanning the text to focus on regions of interest. However, because speech is both serial and ephemeral, it does not readily support strategic fixation. This paper describes two design principles, indexing and transcript-centric access that address the problem of speech access by supporting strategic fixation. Indexing involves users constructing external visual indices into speech. Users visually scan these indices to find information-rich regions of speech for more detailed processing and playback. Transcription involves transcribing speech using automatic speech recognition (ASR) and enriching that transcription with visual cues. The resulting enriched transcript is time-aligned to the original speech, allowing users to scan the transcript as a whole or the additional visual cues present in the transcript, to fixate and play regions of interest. We tested the effectiveness of these two approaches on a set of reference tasks derived from observations of current voicemail practice. A field trial evaluation of JotMail, an indexed-based interface similar to commercial unified messaging clients, showed that our approaches were effective in supporting speech scanning, information extraction and status tracking, but not archive management. However, users found it onerous to take manual notes with JotMail to provide effective retrieval indices. We therefore built SCANMail, a transcript-based interface that constructs indices automatically, using ASR to generate a transcript of the speech data. SCANMail also uses information extraction techniques to identify regions of potential interest, e.g. telephone numbers, within the transcript. Laboratory and field trials showed that SCANMail overcame most of the problems users reported with JotMail, supporting scanning, information extraction and archiving. Importantly, our evaluations showed that, despite errors, ASR transcripts provide a highly effective tool for browsing. Users exploited the enriched transcript to determine the gist of the underlying speech, and as a guide to identifying areas of speech that it was critical for them to play. Long-term field trials also showed the utility of transcripts to support notification and mobile access. Key Words: speech access, interfaces, browsing, scanning, search, evaluation.

09/06/2006

1

Accessing Speech

CONTENTS Accessing Speech Data Using Strategic Fixation...............................................................................................................1 ABSTRACT..........................................................................................................................................................................1 CONTENTS .........................................................................................................................................................................2 1.

THE PROBLEM OF ACCESSING SPEECH ARCHIVES ..................................................................................3

2.

VOICEMAIL TASKS AND USER PROBLEMS ...................................................................................................5 2.1.

The Character of Voicemail Messages and Communication .........................................................................5

2.2.

Key User Tasks in Voicemail ............................................................................................................................5

3.

INDEXING VOICEMAIL: JOTMAIL USER INTERFACE AND DESIGN RATIONALE...............................7 3.1.

Information extraction using annotations .......................................................................................................8

3.2.

Scanning using message overview information ...............................................................................................8

3.3.

Status tracking using annotations and overview information........................................................................9

3.4.

Archive management .........................................................................................................................................9

3.5.

Implementation ..................................................................................................................................................9

4.

JOTMAIL EVALUATION ......................................................................................................................................9 4.1.

Method ................................................................................................................................................................9

4.2.

Users and General Usage.................................................................................................................................10

4.3.

Hypotheses and Results ...................................................................................................................................10

4.4.

Summary...........................................................................................................................................................14

5.

THE SCANMAIL SYSTEM ..................................................................................................................................14 5.1.

SCANMail Design Rationale...........................................................................................................................14

5.2.

SCANMail System Components .....................................................................................................................15

5.3.

Task Support in SCANMail ............................................................................................................................17

5.4.

Implementation and Architecture ..................................................................................................................17

6.

SCANMAIL LABORATORY EVALUATION......................................................................................................18 6.1.

Method ..............................................................................................................................................................18

6.2.

Results...............................................................................................................................................................19

6.3.

Summary...........................................................................................................................................................21

7.

FIELD TRIAL .......................................................................................................................................................21 7.1.

Method ..............................................................................................................................................................21

7.2.

General Usage ..................................................................................................................................................22

7.3.

Research Hypotheses and Results...................................................................................................................22

7.4.

Long-Term Usage and System Evolution ......................................................................................................24

8. CONCLUSIONS ...........................................................................................................................................................25 9. ACKNOWLEDGEMENTS...........................................................................................................................................27 REFERENCES..................................................................................................................................................................27

09/06/2006

2

Accessing Speech

1. THE PROBLEM OF ACCESSING SPEECH ARCHIVES Most research on speech interfaces has focused on using speech as a medium for interacting with computers either in speechonly dialogue systems (Walker, 2000, Young, 2002, Zue and Glass, 2000) or multimodal ones (Oviatt, 2000, Walker, Whittaker, Stent, Maloor, Moore, Johnston and Vasireddy, 2004). In contrast, we focus here on interfaces to speech content, and the development of interfaces allowing users to browse and extract information from speech archives. Spoken information is ubiquitous (Kraut, Fish, Root, & Chalfonte, 1990, Panko, 1993, Whittaker, Frohlich and Daly-Jones, 1994). Increasingly large amounts of spoken data are being archived, such as meetings, phone calls, broadcast news and talk shows, and impromptu conversations (Galley, McKeown, Hirschberg, and Schriberg, 2004, Emnett and Schmandt, 2000, Garofolo, Lard and Voorhees, 2000, Hindus, Schmandt and Horner, 1993, Janin et al., 2003, M4, Wellner, Flynn, Tucker and Whittaker, 2004). Currently, however, it is hard to exploit these archives because we lack effective end user tools. Our goal in this study is to devise and test principles for designing effective interfaces for browsing and searching speech corpora. One obvious strategy for designing speech tools is to capitalize on successful techniques developed for accessing text. But there are crucial differences between speech and text. Although speech has advantages over text in being both expressive and easy to produce, it is serial and ephemeral, giving rise to significant access problems (Arons,1997, Chalfonte, Fish and Kraut, 1991, Hindus et al., 1993, Whittaker, Hirschberg and Nakatani, 1998a, 1998b). Simple processing studies also show that people extract information more quickly from text: average reading rates for text are 350 words/minute compared with listening rates for speech of 180 words/minute (Arons, 1992, Beasley and Maki, 1976, Monk, 1984). These differences arise from the different affordances of speech and text. Text can be processed more quickly because it is a permanent medium that affords strategic fixation, allowing readers to focus on important parts of a document while ignoring less significant regions. Studies of eye gaze confirm this; during reading, readers generally fixate upon less common (and hence more information bearing) words. They also fixate on longer words, and content words as opposed to function words (Rayner and Well, 1996, Schilling, Rayner and Chumbley, 1998). They exploit formatting information (Askwall, 1985), suggesting that, overall, users strategically focus on those aspects of the document that provide the most information. Speech, in contrast, does not readily support strategic fixation, making speech browsing a data-driven process. Indeed, in one study of access from a voicemail archive, we found that users’ attempts to increase efficiency by strategic fixation and sampling had negative effects. Users forgot which parts of the archive they had already sampled, leading them to re-access the same information multiple times. On other occasions, sampling caused them to miss important material altogether, resulting in failure of their retrieval task. Overall, their sampling strategy ended up being less efficient than simply playing the speech from beginning to end (Whittaker, Hirschberg and Nakatani,1998a). The goal of this paper is to determine whether we can improve speech browsing by designing interfaces that directly support strategic fixation, converting speech access from a data-driven into a self-paced activity. We explore two different strategic fixation techniques: (a)

(b)

Indexing: This involves constructing external visual indices into speech. Users visually scan these indices to find information-rich regions of speech for more detailed processing and playback. This technique clearly depends on the quality of the indices that we can construct; Transcription: This involves transcribing the speech using automatic speech recognition (ASR). By timealigning the transcript with the speech, users can scan the resulting transcript, fixating and playing regions of interest. One obvious problem with this technique is that ASR transcription is inaccurate with only 5090% (depending on genre) of words being accurately transcribed (Garofolo, 2000, Huang, Acero and Hon, 2001).

Prior research on speech browsing has generally adopted an indexing approach, exploring various different types of indices. Speaker-based indexing allows users to choose particular speakers they want to listen to (Degen, Mander, and Salomon, 1992, Hindus et al., 1993, Kazman, Al-Halimi, Hunt, and Mantei, 1996, Wilcox, Chen, Kimber, and Balasubramanian, 1994). Intonation analysis can identify parts of speech that are emphasized so users can focus on these (Arons, 1997, Steifelman, 1996, Stifelman, Arons and Schmandt, 2001, Wilcox, and Bush, 1991). We can also index significant participant activities or external events, such as when participants take notes during a meeting (Abowd, Atkeson, Feinstein, Hmelo, Kopper, Long, Sawheny, and Tani, 1996, Moran, Palen, Harrison, Chiu, Kimber, Minneman, van Melle and Zellweger, 1997, Stifelman et

09/06/2006

3

Accessing Speech al., 2001, Whittaker, Hyland and Wiley, 1994, Wilcox, Schilit and Sawhney, 1997), or when a speaker changes slides (He, Sanocki, Gupta, and Grudin, 1999). Others have identified ‘hotspots’ where multiple participants are highly involved in the conversation (Kennedy and Ellis, 2004, Wrede and Shriberg, 2003). These activities can then be used as landmarks to identify important parts of the meeting. Indices can also be extracted from significant visual events such as those detected in video accompanying speech (Boreczky, Girgensohn, Golovchinsky, and Uchihashi, 2000, Hauptmann and Witbrock, 1997, Jones, Foote, Spark Jones and Young, 1996, Kazman et al., 1996). With some exceptions, however, (Abowd et al., 1996, Arons, 1997, Whittaker et al., 1994, Moran et al., 1997, Boreczky et al., 2000), indexing research has generally focused on proof of concept systems, and one limitation of many of these studies is that indices have not been evaluated with users to determine their efficacy (Wellner, Flynn, Tucker and Whittaker, 2005). Another browsing technique is surface manipulation. This approaches employs signal processing techniques to remove pauses or increase the rate of the audio over the original, to enable listeners to process the data more rapidly (Arons, 1992, He and Gupta, 2000, Covell, Withgott, and Slaney, 1998). While these techniques permit faster audio processing, they do not support strategic fixation, as the listener ends up listening to all of the spoken parts of the archive. Other attempts at speech compression have been developed to summarize spoken documents by eliminating ‘unnecessary’ words and phrases (Hori and Furui, 2003, Tucker and Whittaker, 2005) or to extract portions of the data that contain the important information (Maskey and Hirscherg, 2003, Furui, Kikuchi, Shinnaka, and Hori, 2004, Tucker and Whittaker, 2005). For the most part, these summarization techniques have not been incorporated into working audio navigation systems, so their usefulness in comparison to other audio browsing strategies remains to be tested. Few interfaces to audio data have explored transcript-centric browsing. There has been substantial research on spoken document retrieval, but this has focused on the efficiency of text search algorithms when these are applied to ASR transcripts, not on the development of interfaces to support browsing (Garafolo, 2000, Whittaker, Hirschberg, Choi, Hindle, Pereira and Singhal, 1999). The goal of spoken document retrieval has been to identify a set of audio documents relevant to a user query. These returned documents may be minutes or even hours long, which leaves the user with the significant problem of extracting relevant information from the set of returned documents. Where transcript-centric browsers have been built, researchers have often concluded that transcript quality is too low to support effective browsing (Hauptmann and Witbrock, 1997, Kazman et al., 1996). Recent exceptions to this are explorations of various visualizations of transcripts and ASR data (Vermuri et al., 2004, Whittaker et al., 1999, Jones and Gibson, 2005). In exploring strategic fixation techniques, we chose voicemail as our reference application. Voicemail is an under-researched workplace communication technology that relies on access to recorded speech. It is also ubiquitous; a recent study estimated there are over 76 million enterprise voicemail users (Davidson, 2002) and most mobile phone providers now offer a voicemail service. Many organizations rely heavily on voicemail for conducting everyday work, and voicemail is often preferred to email (Rice and Shook, 1990, Whittaker et al., 1998a). Yet despite its ubiquity and importance, there are still outstanding problems with current voicemail interfaces - which make it appropriate for exploring novel access techniques. One approach to improving voicemail access is a form of unified messaging, supported by various products from Avaya 1 , Cisco 2, Nortel 3 and Siemens Communications 4 (Mier, Tarpley, and Mier, 2005). Unified messaging incorporates voicemail messages as email attachments, presenting voicemail header information graphically in the user’s email inbox. It allows users to play attached messages by clicking on them. However current unified messaging interfaces support only rudimentary indexing and do not support transcript-centric access. In unified messaging users therefore cannot access information or browse within a message; instead unified messaging interfaces simply allow users to start or stop play of an entire message. And while unified messaging clients allow users to edit the text in the message subject line or body of the email message associated with the voicemail attachment, they do not allow direct annotation of the waveform or search of the voicemail message contents. Our methodological approach combines requirements-driven iterative design (Preece, Rogers and Sharp, 2002, Rosson and Carroll, 2001) with quantitative and qualitative evaluation of working prototypes to determine how well they support key

1

http://www.avaya.com/gcm/master-usa/en-us/products/offers/modularmessaging.htm

2

http://www.cisco.com/en/US/netsol/ns340/ns394/ns165/ns152/networking_solutions_package.html

3 http://products.nortel.com/go/product_content.jsp?segId=0&parId=0&prod_id=44039&locale=en-US 4

http://enterprise.usa.siemens.com/products/products/applications/messaging/xpressions.html

09/06/2006

4

Accessing Speech user tasks (Whittaker, Terveen and Nardi, 2000). The structure of this paper mirrors that method. We first briefly review three studies that identify key voicemail processing tasks, and characterize user problems with accessing voicemail and speech data in general. We then go on to describe two working systems that address these speech access problems. The first, JotMail, supports user annotation for indexing. It allows users to take temporally indexed notes associated with individual messages. Our second system SCANMail, generates transcripts of speech data by applying ASR to the underlying speech data. We present systematic evaluations of both interfaces to determine how well they support key speech browsing tasks, comparing them with one of the standard interfaces for accessing voicemail using a touchtone phone. We conclude by discussing the trade-offs of these approaches, identifying future design and theoretical problems for speech browsing.

2. VOICEMAIL TASKS AND USER PROBLEMS In this section we summarize three empirical studies, that we performed to collect qualitative and quantitative data on users’ key tasks and problems with accessing speech for a typical voicemail system, AudixTM. Our data included server logs, surveys, interviews and user observations. Full details of the studies are found in (Whittaker et al., 1998a, 1998b).

2.1.

The Character of Voicemail Messages and Communication

One common intuition is that voicemail is a parasitic technology predominantly used to set up synchronous conversations with a typical message being ‘This is Anne. Call me back.’ But user interviews indicated that entire tasks are accomplished in voicemail without resorting to other media such as email, phone or face to face communication: “Entire transactions or entire tasks are accomplished by exchanging [voicemail] messages. That is, you will never talk to the person in real time.” We found too that voicemail messages contained significant amounts of contentful information: about half the people we surveyed reported average message lengths of 1-2 minutes. User interviews support this: “[A voicemail message] is really like a whole memo, or a huge email message worth of information.” The following transcript shows the complexity of voicemail messages. In it a caller describes the target configuration for a new computer system he is ordering: “erm yes Bob James Ingram with Digital Equipment Corporation how are you erm say Bob you think you could give me a call when you have the opportunity I would deeply appreciate it erm looks like the follow up to your email message you sent me on Thursday erm my telephone number is six zero nine five eight nine erm nine one one four erm just like to talk to you if nothing else around the configuration that you sent me erm all that stuff yes is digital Unix erm uh is it my understanding that uh you'd like to get a dualboot system erm uh if you like Unix I can give you just Unix if that's the case erm uh we can supply you with wh-.what. appears to be everything that you've asked for you said you don't necessarily need to order a display and that's fine we don't have to do that erm uh are we looking at a five hundred megahertz system four thirty three six hundred erm it all depends on the performance that you're looking for which would be a determining factor there erm well please give me a call at six zero nine five eight nine nine one one four erm if I'm not in if you could leave a message on my voice mail I'd deeply appreciate it take care look forward to speaking with you soon bye bye”

2.2.

Key User Tasks in Voicemail

Users reported four main tasks when processing voicemail: scanning the mailbox to identify important messages; extracting information from individual messages; tracking the status of current messages; and managing their archive of stored messages. Scanning: Scanning is used for prioritizing incoming new messages, and for locating valuable saved messages. Only 24% of people we surveyed use voicemail message headers to identify urgent messages, reporting they are “too slow”. Instead, users’ main strategy is to sample the first few seconds of each new message in sequence to determine its content and urgency. In accessing stored messages, users do not retain a detailed model of their archive and 76% report that “listening to each message in sequence” is also their standard procedure for finding archived messages. When the archive contains multiple messages, the sequential nature of access makes it onerous to locate particular messages. Information Extraction: Once a relevant message is identified, users must then extract critical information from it. This often involves repeatedly accessing the same message for verbatim facts such as caller’s name and phone number.

09/06/2006

5

Accessing Speech

Figure 1 – Notes Taken About Voicemail Messages 46% of users report that they replay messages “about half the time”. To reduce repetitive processing, 72% report “almost always” taking written notes. Users employ two different note-taking strategies. The first is full transcription, where users attempt to completely transcribe the message, to reduce the need for future access. The second is summary notes taken as indices to jog the user’s memory later about the message’s contents. Users also point out that voicemail messages are structured, so the summary notes aim to record the predictable key points of the message (caller name, caller number, reason for calling, important dates/times and action items). Representative notes are shown in Figure 1. In most cases, users keep the original voice message as a backup for their sometimes sketchy notes. Notes are then used to identify and navigate within the original message and archive. They allow users to locate the relative position of messages in the archive, as message notes with earlier dates will precede later ones. Users also commented on the laborious nature of constructing and managing their notes. Many kept dedicated voicemail notebooks for just this purpose. Status tracking: Workplace tasks are often delegated through voicemail. A common user problem is tracking message status, in particular whether a message has been processed. Status tracking is a particularly difficult problem for users accessing voicemail under time pressure. They may often access but have no time to respond to incoming messages. When accessing voicemail later, they are often unclear about which messages they have dealt with. There are two main techniques for status tracking. First, during information extraction, people note down reminders on scraps of paper or in a dedicated logbook. Second, they take no notes but leave undischarged messages in their mailbox. Reminding takes place as they encounter these undischarged messages when they next scan their archive. The weakness of this second strategy is the absence of visible reminding cues; as a result users who do not access voicemail may be unaware of the presence of unresolved messages. Archiving: Users also have to manage their archives. Given their access strategies, most users’ archives consist of a backlog of undischarged messages as well as saved valuable messages. Users therefore engage in periodic “clean-ups”: accessing each message in sequence to determine whether it should be preserved or not. By removing superfluous messages, they also make it easier both to scan for existing valuable messages, and to monitor reminder messages. Those who do not engage in “clean-ups” report being surprised by the extent to which they accumulate irrelevant messages.

09/06/2006

6

Accessing Speech We now describe two systems based on the principle of strategic fixation that address these key user tasks. The first, JotMail, relies on indexing, and the second, SCANMail, relies upon transcript-based access. In both cases we provide a detailed design rationale and system evaluation.

3. INDEXING VOICEMAIL: JOTMAIL USER INTERFACE AND DESIGN RATIONALE

Archive management tools

Automatically generated message information

User generated message annotations

Figure 2 – The JotMail User Interface JotMail is an index-based interface that supports our key user tasks of scanning, information extraction, status tracking and archive management tasks (Whittaker, Davis, Hirschberg and Muller, 2000) The guiding principle for the design of the system is index-based speech access, where user notes serve as the index into messages. By scanning their annotations users can identify and play important messages as well as individual parts of messages requiring detailed processing. User notes are time-indexed (Moran et al., 1997, Whittaker et al., 1994), following our previous observation that handwritten notes serve as a navigational index into the underlying structure of the original voicemail message: “My notes trigger things - they are … meant to just give me place holders while I am browsing. Then I have to go back and listen to stuff”. The JotMail interface is shown in Figure 2. The design was finalized after three iterations with representative users, where the first design involved paper prototypes. The center right of the screen shows text boxes for user generated message annotations, with play control buttons located to the left of them. On the left side of the interface is automatically derived header information about each message. The upper left part of the screen shows archive management tools for creating and managing voicemail folder structures. A key benefit of voicemail reported by our users is ubiquity. By developing a Web-

09/06/2006

7

Accessing Speech based UI, we provided access to voicemail wherever there is an Internet browser, whether this is on a computer, a PDA, or a phone. There is some overlap in functionality with certain commercial unified messaging clients, which present visual header information for voicemail messages, and tools for archiving messages (Mier et al., 2005). These systems also allow some user-centric indexing by editing emails associated with the voicemail attachments. However, none of these support timebased user indexing, i.e. the ability to index user notes with parts of the original message allowing controlled access to specific parts of that message – a key feature of JotMail. We now describe how JotMail’s indexing approach supports the four key user tasks, and how this exploits existing user strategies.

3.1.

Information extraction using annotations

A key user strategy for addressing information extraction is the use of personal notes. A central, novel, feature of the UI is support for user annotations. Users can type personal notes, (e.g. “phil brittain”) in the scrollable text box associated with a given message. One use made of annotations by JotMail users is for strategic fixation - to provide access to the most relevant parts of the message. Another use made of annotations was as summaries, as notes usually capture the main points of a message. Time-based indexing works as shown in Figure 3: users take notes as they listen to the message, and each note is co-indexed with the speech currently being played. If users later click on a particular word in their notes, they automatically access the speech that was being played when the note was taken. In this way, notes provide reasonably accurate access into the underlying speech, allowing users to focus on areas of specific relevance. However, there is often a delay between the time when users hear the relevant information and when they enter their related note. In later versions of the UI, we introduced a 2 second indexing off-set, so that notes are indexed to material being played two seconds previously, based on iterations with early users. Figure 3 does not show the off-set for ease of exposition. To further help with information extraction, we provide general play controls for navigating within the message without recourse to user notes. These allow people to play and stop a given message as well as skip forward and skip back 2 seconds within a given message (Sk B and Sk F buttons).

3.2.

Scanning using message overview information

Our initial user studies revealed the requirement for scanning to prioritize and locate important messages. In addition to user generated annotations, an important set of cues to aid scanning is message header information. Like many unified messaging systems, the JotMail UI displays the following information about each message: date, time, size, caller-ID number (when available) and caller name (for internal calls only, derived by looking up the name for the caller number in the corporate directory). By depicting this general information we enable users to visually scan and randomly access messages. In this way, headers allow users to identify specific messages, and users are no longer forced to access messages in sequence. Users can also manually supplement this automatically generated overview information with their own notes. In this way, annotations can support scanning as well as information extraction. Annotating a Message Time User Actions Message Audio Annotation

Play Message

Type "details of 1st quarter results"

...first quarter results are good overall, earnings per share is up and...

(Initially Blank)

details of 1st quarter results

Retrieving Information from a Message Time User Actions

Locate Annotations

Message Audio Annotation

Click on "details" are good overall, earnings per share is up and...

details of 1st quarter results

Figure 3 - Time-based indexing

09/06/2006

8

Accessing Speech

3.3.

Status tracking using annotations and overview information

Voicemail users also reported frequent problems in remembering what outstanding actions were required for a given message. JotMail was designed to support status tracking in two ways – again by analogy with people’s paper based strategy of leaving themselves visual reminders. Annotations could be used to explicitly record the actions necessary for each message. More implicitly, we hoped that the mere fact of having a visual representation of each message visible in the mailbox would serve to remind people of the necessary action whenever they access JotMail.

3.4.

Archive management

Voicemail users also reported problems remembering the contents of their archive, and complained of the build up of irrelevant messages. Consistent with other unified messaging clients, the JotMail interface provides them with a set of tools for organizing, managing and deleting voicemail data. Labeled buttons allowed them to create new folders, as well as move, delete and copy information to those folders. More implicit support for archive management is provided by the visibility of messages, enabling the archive to be quickly scanned to identify important messages and filter out superfluous ones.

3.5.

Implementation

JotMail is built on top of Webmail, a research system that supports email and voicemail access. Webmail is implemented as a CGI script that connects to a standard mail server. When the script is run, it produces HTML pages with interfaces for viewing, browsing, and archiving messages. Voicemail messages are retrieved from the voicemail system and stored as email messages with special headers and data. The requirement for broad access influenced our choice of web-based platform. The annotation system in JotMail was implemented as an HTML browser plug-in. While this will work on most browsers, using a plugin restricts the complexity of possible UI implementations compared with Java. Webmail was modified to store annotation files alongside voicemail messages, and to display small annotation plug-in windows next to each message. The plug-in application downloads the annotation file and the audio file for each message and allows the user to play the message and type text in the annotation text box. If the message is playing when the user is typing, then the current time index in the message is stored with each word. By holding "Control" and clicking on a word, the user can play the message from that time index. When the plug-in closes, modified annotations are sent back to Webmail so they will be displayed the next time the page is loaded. This gives JotMail users persistent annotations for information extraction and status tracking.

4. JOTMAIL EVALUATION 4.1.

Method

The goal of our evaluation was to investigate JotMail use by people for their everyday work, to determine how well indexbased retrieval supports our key user tasks. We compared our research prototype with a baseline representative voicemail system, Avaya’s AudixTM in an 8-week field trial. We were also concerned with how well the system supported annotation behaviors, given the centrality of this strategy for indexing. We collected the following four sets of quantitative and qualitative data to test specific hypotheses about how JotMail compared with Audix. Preinstallation survey: All our study participants were experienced Audix users. Before installing JotMail we administered a questionnaire probing their Audix usage. We asked people about the number of voicemail messages they received and sent, and how they processed these, as we felt these prior strategies and usage might influence their reactions to JotMail. We also gathered data about users’ Audix scanning, information extraction, status tracking and archiving behaviors, as well as note-taking strategies. We surveyed people about the success of Audix features (header information and message operations) in supporting these core tasks. JotMail logs: We then installed JotMail and logged usage data for 8 weeks. From logfiles we collected data about: number and duration of JotMail sessions, messages stored and accessed, operations on messages (stop, play, skip) as well as information about what notes people took and when they used these to replay messages. This data was used to analyze usage and identify the main types and functions of user annotations.

09/06/2006

9

Accessing Speech Post-installation survey: After 8 weeks, we uninstalled JotMail and administered an extended version of the original Audix questionnaire containing additional questions about the basic features of JotMail, note-taking, and the use of notes for playback. Interviews: We also carried out two semi-structured interviews with users, one while the system was installed and one after it was uninstalled. These probed the same issues as the surveys, but were designed to elicit fuller descriptions of the way the system was being used, as well as its main benefits and drawbacks. We also tried to find examples of novel or unexpected JotMail uses. We supplemented these interviews with observations of people using JotMail.

4.2.

Users and General Usage

Our users were researchers and secretarial support staff at a large telecommunications company. They were not directly involved with our project and were blind to the study’s hypotheses. All were experienced Audix users with a mean of 4.2 years experience with the system. We collected logs for 9 users for a total of 184 sessions over 935 hours. According to their own estimates, these people used JotMail to access 70% of their voicemail during this time. Of these 9 users, 7 completed all questionnaires and interviews. Our logs show frequent JotMail use. People used JotMail for an average of 20.4 sessions, of mean length 5.1 hours. Users tended to keep the program running constantly, unlike Audix where short sessions were the rule and, in fact, users would be timed out if they were not actively using system features. In each JotMail session users would typically access 4.1 messages, and the mean distribution of play operations per session was 2.1 plays, 1.8 skips and 0.3 annotation-based plays. Overall, users replayed 36% of messages. On 30% of occasions they listened to the same message two or more times in sequence. Users would typically save 11.0 messages at the end of each session.

4.3.

Hypotheses and Results

HYPOTHESIS

MEASURE

O1

Overall ease of processing voicemail (1 = “very hard”, 5=”very easy”) Ease of locating a specific message (1 = “very hard”, 5=”very easy”)

S1 S2

Preferred method for locating a message I1

Information extraction (1 = “very hard”, 5=”very easy”) Replay frequency (1 = “ very frequently”, 5 = “never”) Note-taking frequency (1 = “ very frequently”, 5 = “never”) Ease tracking message status (1=“very hard”, 5=”very easy”) Frequency of losing notes (1=“frequently”, 6=”never”) Archival behavior (1= “usually archive”, 5=”never archive”)

I2 I3 ST1 ST2 A1

AUDIX (MEAN)

JOTMAIL (MEAN)

STATISTICAL DIFFERENCE & HYPOTHESIS CONFIRMED? 2.6 4.6 t (6)= 4.10, p

Accessing Speech Data Using Strategic Fixation - Semantic Scholar

Accessing Speech Data Using Strategic Fixation - Semantic Scholar

Suggest Documents

A Knowledge Accessing Theory of Strategic ... - Semantic Scholar

Requirements Fixation - Semantic Scholar

(Conversational) Speech Data Collection - Semantic Scholar

speech enhancement with missing data ... - Semantic Scholar

Accessing primary care - Semantic Scholar

Accessing primary care - Semantic Scholar

Robust Speech Music Discrimination Using ... - Semantic Scholar

Natural-Sounding Speech Synthesis Using ... - Semantic Scholar

Spontaneous Speech Recognition Using a ... - Semantic Scholar

speech emotion recognition using stationary ... - Semantic Scholar

VISUAL SPEECH RECOGNITION USING ... - Semantic Scholar

Speech-based Emotion Characterization using ... - Semantic Scholar

Emotion Recognition from Speech using ... - Semantic Scholar

Adaptive Speech Enhancement Using Frequency ... - Semantic Scholar

Neural Speech Enhancement Using Dual ... - Semantic Scholar

continuous speech recognition using support ... - Semantic Scholar

Dysarthric Speech Recognition Using Dysarthria ... - Semantic Scholar

WATERMARKING OF SPEECH SIGNALS USING ... - Semantic Scholar

A Speech Enhancement Approach Using ... - Semantic Scholar

Architecture for Accessing Data Streams on the Grid - Semantic Scholar

ASSISTIVE NOTETAKING USING SPEECH ... - Semantic Scholar

Emotion Recognition from Speech using ... - Semantic Scholar

Automatic Speech Recognition System Using ... - Semantic Scholar

Second Language Speech Recognition Using ... - Semantic Scholar