Forward and Backward Speech Skimming with the ... - Semantic Scholar

Forward and Backward Speech Skimming with the Elastic Audio Slider Wolfgang Hürst, Tobias Lauer, Cédric Bürfent, Georg Götz Institute of Computer Science, University of Freiburg Georges-Köhler-Allee 051, D-79110 Freiburg, Germany {huerst, lauer, buerfent, goetz}@informatik.uni-freiburg.de In pursuit of the goal to make recorded speech as easy to skim as printed text, a variety of methods and user interfaces have been suggested in the literature, involving time-compressed audio, speech segmentation and recognition, etc. We propose a new user interface, the elastic audio slider, which makes navigation in speech documents similar to video navigation or text scrolling. The approach supports navigation at variable speed in both forward and backward direction while providing immediate intelligible audio feedback during the user’s interactions. A user study was conducted to prove the usefulness of backward replay of speech for tasks such as topic classification. In addition, we show that the proposed interface offers the opportunity to combine the advantages of existing approaches within a single, easy-to-use UI component that complements and enhances the common user interfaces known from standard audio player software. Keywords: speech skimming, elastic interfaces, audio slider, time-scaled audio, backward speech replay.

1 Introduction In order to cope with the growing amount of digital audio available today, interaction techniques and interface designs are needed which make speech signals as easy to skim and navigate as textual documents. While research in this area dates back to (at least) the early 1990s (see, e.g., Arons [1994]), common media players have only recently started to integrate techniques such as time-compressed replay in order to enable easier, interactive audio skimming. In addition, user

2

Wolfgang Hürst, Tobias Lauer, Cédric Bürfent, Georg Götz SPEED 0.5

0 min

10 min

1.0

20 min

1.5

2.0

2.5

30 min

Figure 1. Exemplary illustration of a standard audio player interface design with start/stop buttons, speed controller (top right), and audio progress bar (bottom).

interface designs are often still based on the well-known but insufficient tape recorder metaphor or on simple modifications of replay speed. Most standard interfaces feature, if at all, some sort of speed controller and an audio progress bar in combination with a slider interface for random access (cf. Fig. 1). The slider on the progress bar can be used to set (or re-set) replay to any position within the file, e.g., in order to re-listen to some specific information. Modification of replay speed (time-scaled replay) using techniques such as SOLA [Roucus & Wilgus 1985] can be useful, e.g., to quickly skim over areas of minor interest or to locate specific parts with high relevance for some particular information need. Studies have shown that people are still able to understand the content of a speech recording (intelligibility) if it is replayed with up to 1.8 times the normal replay speed [He & Gupta 2001], and that they are able to classify the overall topic (comprehension) at replay speeds as high as 2.5 to 3 times normal replay [Arons 1997]. Slower than normal replay rates can be useful, e.g., if the speech recording is not in the user’s native language [Amir et al. 2000]. While the interface design illustrated in Figure 1 improves audio skimming, there are a lot of situations where it is not comfortable or flexible enough and lacks usability. E.g., assume a user wants to skim a longer radio news show recording in order to find some particular stock quotes or the latest soccer results. One reasonable search and interaction strategy for this task might be as follows: First, increase the replay rate to 2.5 times normal speed in order to skim the data and roughly identify the area of interest (e.g., the part of the news show containing business news or sports coverage, respectively). Once the overall topic has been found, decrease replay speed to 1.8 in order to find the data of interest (e.g., reports from the stock market or soccer games, respectively). If the searched position is found, audio replay is re-set to the beginning of the corresponding segment and replay speed must be reduced to normal replay again (or even slower, e.g., if it is necessary to listen to this part carefully or to take notes while listening). While such a search strategy can be performed with the interface illustrated in Figure 1, its design is clearly not optimal in terms of usability. First, the involved interactions require continuous changes between different interface elements: The speed controller is used to increase replay speed to 2.5, then to decrease it to 1.8. Then the slider is used to re-set replay to the beginning of the segment, before the speed controller is used again to decrease replay speed to normal replay (cf. Fig. 2A). Secondly, re-setting audio replay to the beginning of a particular segment is not easy because users generally do not know the exact position and audio replay is normally muted while the slider thumb is being dragged. In general, moving back to re-listen to a particular part of an audio file is an essential task users often want or need to perform when skimming speech recordings because in many situations you only know that you found the correct position once you have already passed it. Especially if a file is replayed at higher speed, the need to re-set replay to a

Forward and Backward Speech Skimming with the Elastic Audio Slider A

1

SPEED

4 1.0

0.5

2 2.0

1.5

3

STANDARD INTERFACES (CONTROLLER & PROGRESS BAR) 1 INCREASE REPLAY SPEED W. SPEED CONTROLLER TO 2.5

2.5

2 DECREASE REPLAY SPEED W. SPEED CONTROLLER TO 1.8 3 RE-POSITION SLIDER THUMB TO BEGINNING OF SEGMENT

0 min

3

10 min

20 min

30 min

4 RE-SET REPLAY SPEED TO NORMAL REPLAY ELASTIC AUDIO SLIDER

B 0 min

1 10 min

3

20 min

2

1 SPEED UP REPLAY TO 2.5 BY PULLING THE RUBBER-BAND 30 min

MOVE POINTER (BUTTON PRESSED) MOVE POINTER (BUTTON RELEASED)

2 DECREASE PRESSURE ON THE RUBBER-BAND AND THUS REPLAY SPEED TO 1.8 FOR MORE DETAILED LISTENING 3 REPOSITION SLIDER THUMB TO BEGINNING OF SEGMENT

PAUSE POINTER (BUTTON PRESSED) MOUSE CLICK

Figure 2. A typical example for interaction using different interface designs.

particular position after a point of interest was identified is very likely. However, few audio interfaces pay tribute to this demand. It should be noted that the example just given oversimplifies the general course of interactions when skimming speech files in search for information. Normally, we can expect much more interaction to take place. E.g., repeated repositioning of the slider thumb might be necessary in order to find the correct target position, as described above. In addition, if the topic has not been classified correctly in the first place, it may be necessary to speed up and slow down replay again. In the worst case, the user will end up switching several times between the speed controller and the progress bar, continuously re-setting replay speed and repositioning the slider thumb until the target position has been found. Therefore, we can not and will not draw general conclusions from this simple example, but just use it in the following in order to illustrate the advantages and limitations of the different user interface approaches described. The goal of this paper is to introduce a new interface and interaction design for interactive audio skimming which is particularly useful for searching and navigating speech documents. The approach builds on our previous work which is summarized in Section 2 together with a general discussion of audio skimming. Section 3 describes how this approach can be extended to improve speech skimming, in particular when trying to move backwards in order to re-listen to some previously heard parts of a file. Section 4 reviews related works and puts them in relationship to our approach. Section 5 concludes the paper with a short summary and an outlook on future work.

2 Time-Scaled Forward Replay for Speech Skimming Both speech and text have a linear dimension defined by the sequence of words spoken or written, respectively. However, speech is linear in time while text is arranged in a spatial dimension (i.e. left to right, top to bottom). This is the main reason why text is much easier to skim than a speech signal. Layout and metainformation such as headlines, paragraphs, font styles (bold, italic, etc.), punctuation, etc. serve as visual cues which help the user in scanning the content. Comparable meta-information exists in speech: Speakers use intonation to make a point, pause before introducing a new topic, etc. In case of text, the speed at which the static visual information is processed and absorbed is under full control of the user (e.g., depending on how fast or slow the user’s eyes move over the text),

4

Wolfgang Hürst, Tobias Lauer, Cédric Bürfent, Georg Götz

while in case of audio, this speed basically depends on the source of the sound signal. Not surprisingly, many approaches for audio skimming aim at making up for the time restriction implied in the audio signal by enabling the user to modify the replay speed, that is, by transferring the control over the replay speed from the system to the user. In addition, signal processing algorithms can be used to automatically extract acoustic cues similar to the visual cues from text. Advanced interface designs for audio skimming make use of these cues in order to facilitate navigation in the file. Several such interfaces have been introduced in the past which we will refer to in Section 4 where we relate them to our work. The approach presented here is based on the intuitive assumption that a timebased progress bar or slider seems to be a good and natural choice for quick and easy navigation in an audio file since speech is linear in time. However, when a user is dragging the slider thumb along the time-line, audio feedback is generally muted or normal replay continues and is not re-set until the mouse button is released, that is, there is a lack of immediate feedback during the interaction. The reason for this is grounded in the nature of these signals. Digital speech (and digital audio in general) consists of a continuous sequence of individual samples which, if played in the correct order, make up the sounds representing the words and phrases. Therefore, just playing individual samples corresponding to certain positions on the progress bar while the slider thumb is being dragged would result in unintelligible audio feedback. Generally there are two approaches in order to deal with this problem. One is to give up the strict synchronicity between the slider thumb position and the document progress and to always play a certain amount of successive samples (i.e., an intelligible piece of audio). Once the current audio snippet is finished, replay is set to the position of the thumb at that time and the next audio segment is played. The other approach is to restrict users in their options to move the thumb in such a way that intelligible audio feedback can be provided. Our initial tests with the first approach showed that it works quite well for rough topic classification of the whole document but might be problematic when searching for more detailed, specific information due to the resulting loss of synchronization between audio feedback and slider movements. The approach proposed here is based on the second idea, i.e., restricting the possibilities of the user to move the thumb (and therefore to navigate the file) in a way which still permits intelligible audio feedback. It is based on the concept of elastic interfaces, introduced by Masui et al. [1995] for visual data browsing of static data such as text or images. In this approach, the thumb of a slider or a scrollbar is not dragged directly but follows the mouse pointer along an imaginary ‘rubber-band’ tied between the mouse pointer and the thumb. The greater the distance between the two (i.e., the harder the band is stretched) the faster the thumb will move (see Fig. 3). When it gets closer to the mouse pointer (e.g. because the user has stopped dragging), it will slow down as the ‘force of the rubber-band’ decreases, and eventually stop when it arrives at the pointer position. Technically, the distance between the slider thumb and the mouse pointer is mapped to a speed value at which the document progress is displayed (see. Fig. 4, left). Thus, while sliders and scrollbars are interfaces for position-based navigation (since the slider position corresponds to a document position), they can be used for

Forward and Backward Speech Skimming with the Elastic Audio Slider

5

Figure 3. Visualisation of the elastic audio slider. The distance between mouse pointer and slider thumb (i.e., the ‘force’ on the ‘rubber-band’) determines the replay speed.

speed-based scrolling with the elastic interfaces approach. Masui et al. [1995] describe the usage of this concept for navigation of static, time-independent data, such as text or images. Hürst et al. [2004a] showed the viability and usefulness of the approach for dynamic visual documents such as video. One interesting characteristic of an elastic slider is that the thumb movement is always ‘smooth’ in the sense that there are no sudden jumps. This is because the slider positions are not manipulated directly by the user but only implicitly via the speed changes resulting from the distance to the mouse pointer. This property makes it possible to use the concept of elastic interfaces in order to create a slider for audio navigation with immediate acoustic feedback while the thumb is being pulled along the time-line. However, transferring the concept from visual data to the audio domain requires some substantial modifications and considerations regarding both the actual realization and the resulting usability [Hürst et al. 2004b, 2004c]. In the following, we summarize the final approach taken by us. When transferring a concept for visual navigation to the audio domain, a first issue which should be considered is that the default output value (i.e., when the user is not dragging the slider) should not be paused audio (in analogy to a static frame in the visual domain) but replay at normal speed. Secondly, the range of possible speedup factors has to be restricted to values that make sense for timescale modified speech. For example, it does not make sense to allow a replay rate of more than 3 times the normal speed, nor is it useful to play speech slower than half the normal rate. These constraints lead to a redefined mapping function which is shown in the right image of Figure 4. The resulting interface (cf. Fig. 2B) is used as follows: the slider thumb can still be grabbed directly and dragged as usual for quick navigation without acoustic feedback. The new functionality is evoked by pressing and holding down the mouse button on the slider bar anywhere next to the thumb (but not on it). A visualisation of the elastic band between the thumb and the mouse pointer is shown, together with a label displaying the current speed (see Fig. 3). Dragging to speed

speed

0.0

3.0 2.5 2.0 1.5 1.0 0.5 0.0 0

distance

0

distance

Figure 4. Left: distance-to-speed mapping in the elastic slider for visual scrolling. Right: redefined mapping for the elastic audio slider. The dashed line shows the modified mapping after the default speed was set to 1.5 times the normal speed.

6


the right increases the speed, while moving to the left slows down replay, depending on the tension of the rubber-band. A neutral area around the slider thumb represents replay at normal speed. If the mouse pointer is dragged to the left of this area, playback is slowed down below the normal rate. Releasing the mouse button is equivalent to ‘letting go’ of the elastic band and reverts replay to normal speed instantly. The interface also features a standard speed controller interface element allowing users to set the replay speed to a preferred value. If it is adjusted to a speed faster or slower than normal, the elastic slider adapts to that value in the sense that it uses this speed as its new default value and speeds up or slows down from there, as illustrated in Figure 4 (dashed line). For a detailed description about individual design decisions together with a heuristic study leading to the final implementation of such an elastic audio slider, we refer to Hürst et al. [2004c]. Transferring a successful method for visual data skimming to the acoustic domain might not be reasonable in terms of usability. Therefore, a usability study with the proposed interface design and a group of twelve test users was conducted, showing the feasibility and usefulness of the proposed approach [Hürst et al. 2004b]. Looking at our introductory example of searching for a specific news message in a recorded news show, the situation is much easier with the new interface in terms of usability. All interactions needed to find the target position can now be done using one single interface element, the elastic audio slider (cf. Fig. 2B): Speedup to a high replay rate (for rough classification) can be done by dragging the elastic slider to the far right. If the overall topic is found, the speed can be reduced to a value allowing more detailed comprehension by moving the pointer closer to the slider thumb. Replay can be reset to the preferred rate by simply releasing the mouse button. Some test users in the evaluation explicitly noticed that being able to return to the default speed by just releasing the mouse button while skimming a file at higher speed is a very useful and convenient feature. In order to go back to the beginning of the paragraph after the relevant part has been found, users can grab the slider thumb and drag it back in exactly the same way as they would do with the standard interface. No switching from one UI component (the speed slider) to the other (the progress slider) is necessary because the elastic audio slider integrates speed-based and position-based audio navigation into a single UI component. Although the resulting interaction seems much easier and more fluid, the problem that a user has to re-set the slider thumb to an unknown position without any acoustic feedback still remains. This drawback is accompanied by an additional disadvantage, the scaling problem (see, e.g., Hürst et al. [2004a]): In long documents, short backward (and forward) steps are often impossible to accomplish with a slider since even a very small distance (e.g., one pixel) on the slider bar corresponds to a large portion of the document. In order to resolve these problems and provide feedback for navigation in both directions, some sort of backward audio replay might be useful. This issue is addressed in the next section.

3 Backward Audio for Speech Skimming Skimming and searching usually involves navigating through a document in both directions. For example, when scanning printed text in order to find some specific information, the layout may help find the respective paragraph, but in order to


7

locate a particular sentence, users generally have to go back and forth within the actual text. In non-static visual documents such as video, fast-forward and backward is often used for search. A slider or scrollbar mapping the document length to the length of the slider bar allows navigation in both directions with full control over the speed if the screen display is updated in real time. It seems helpful and desirable to be able to do the same when searching and skimming recorded speech. In order to provide such a feature for both scrolling directions, some way of ‘intelligible backward replay’ needs to be realised.

3.1 Approaches to intelligible backward replay As has been stated in Section 2, audio is stretched over time and a certain continuity of the signal is necessary in order to obtain intelligible feedback. This is different from video, which – beside the temporal dimension – has a spatial distribution of information. Thus, playing a video sequence backwards frame by frame still allows viewers to extract information from it, while doing the same with audio, i.e., playing samples in reverse direction, results in meaningless sound. Even though this method may help in detecting pauses and distinguishing speech from non-speech, the only way to provide any kind of intelligible audio feedback when going backwards in a document is to preserve the original direction and continuity of the signal over at least a short period of time. This means that small snippets of audio must be played in forward direction, containing a segment of the speech long enough to make sense to the listener. (In text scanning, too, readers do not go backwards letter by letter or word by word, but use ‘units’ consisting of several words that are read at a time.) Following this approach, continuous backward replay can be realised as a sequence of audio segments played normally (i.e., in forward direction), but in reverse order of their occurrence in the document. Several approaches to speech skimming have used this method in order to establish intelligible backward audio. Some systems (e.g. Arons [1997]) try to partition the signal into meaningful segments consisting of words or phrases, which involves employing specialised algorithms. Others use fixed-length segments throughout the document, accepting fragmented speech within and between these segments. For example, Kim [2002] and Schmandt et al. [2002] use a fixed length of 4 seconds. Interestingly, none of the known systems providing this type of backward replay report on any empirical basis for their specific choice of the segment length, although this is a critical parameter: on the one hand, segments must be long enough to contain some complete words or phrases in order to ensure intelligibility; on the other hand, they should be as short as possible for a fine granularity of the backward steps. Furthermore, no more than anecdotal evidence is given as to how useful this kind of backward speech actually is, i.e. how well users perform in search or classification tasks.

3.2 User Study In order to address some of these open questions, an empirical evaluation was conducted with the following twofold goal: the first objective was to test the

8


overall suitability and usefulness of backward speech replay for the task of topic classification and thus to confirm the intuitive arguments claiming that backward replay of audio can be useful for search. The second one was to determine suitable values for some of the parameters involved. An approach with fixed-size segments was chosen, rather than using adaptive segment lengths resulting from partitioning the signal according to parts of speech. The two main parameters affecting the resulting signal in this approach are the duration sl of each segment and the jump width sj determining how much of the signal is skipped in each backward leap from the end of the segment just played to the beginning of the next segment to be played. Time-saving backward replay can be realised by increasing sj to a value higher than 2·sl. In this case, parts of the signal are omitted. (Conversely, if sj < 2·sl, segments would overlap and some parts would be played twice.) The other straightforward way to allow faster backward replay is audio time-compression, in which case the speedup factor is the main parameter. The study described in the following is a follow-up to an earlier evaluation presented in Hürst et al. [2005], where preliminary results are reported. The data discussed here is based on a larger number of participants, which is why the results are more reliable and an additional, more detailed analysis of the data could be carried out. 30 users, aged 16 to 61 (average 32.2), participated in this evaluation. Twelve of them were female, 18 male. 16 of the test subjects were students, the other 14 came from different professional backgrounds. No one had any previous experience with backward speech. The overall duration of the evaluation was 20 to 30 minutes per user. The study was subdivided into three experiments, A, B, and C, which all had the same structure but tested different parameters. In experiment A, the independent variable was the segment duration sl, while the jump width sj was always set to be 2·sl. The speech data consisted of news clips of 8 to 10 seconds in length, which were extracted from radio news messages. Each experiment contained two tests. First, users were asked to listen to the same audio clip several times with different values for sl (in ascending order). In the test, the segment lengths 0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, and 3 seconds were evaluated. Users listened to the backward clip with the particular segment length, and had to rate this specific value on a 5-point scale ranging from “far too small” to “far too large”. The participants could listen only once to each version and then had to give their judgement. Later changes on the judgements were possible, but no re-listening was permitted. Different speech files were assigned to different users, but each participant was given the same file for all parameter values during this test. The goal of this test was to identify a threshold value for the segment length at which replay would generally be agreed to be intelligible. The second test of experiment A evaluated the same range of values for sl, but now the goal was to see how users performed at the task of classifying the topic and contents of the different news clips. In this part, the values were given in random order. Participants listened to a news clip played backwards and then had to classify the contents based on a given list of topics. 20 different clips were extracted from 10 news messages about sports (3 about soccer, 3 about car racing, 2 about cycling, and 2 about other sports). The participants got a different clip for each value of the segment length. The mapping of clips to parameter values was


9

equally distributed among the users. Participants were allowed to listen to each file only once and had to answer the questions immediately. No re-listening or later modification of the judgements was permitted in this test. Two detail levels for classification were offered: Users had the option to relate the clip to the actual news message it was extracted from (e.g., “Lance Armstrong wins at Tour de France”), to select the overall topic (e.g., “cycling”), or to give no classification if they were not able to classify the clip at all. In addition, they were asked to give the same subjective rating as in the first test. Experiments B and C both tested some form of accelerated backward replay. Experiment B provided faster backward skimming through the omission of segments, as described above. Here, the jump width sj (determining the amount of omitted material) was the parameter to be evaluated. Based on initial testing, the segment length sl was fixed to 2 seconds for this test. Again, the participants performed two tests set up similarly as in experiment A. In this case the test started with the ‘best’ value, i.e. sj = 2,·sl = 4 seconds (where no parts of the signals were omitted). Values for the jump width sj increased in the first test (4, 4.5, 5, 5.5, 6 and 6.5 seconds), and were randomly ordered in the second one. User ratings were again done on a 5-point scale, this time ranging from “very good” to “very bad”. Experiment C provided an alternative way of faster backward skimming, using time-compressed audio replay. Hence, the speedup factor was the independent variable, with the following values being tested: 1, 1.25, 1.5, 1.75, 2 and 2.25 times normal replay speed. These values were chosen in order to achieve the same overall time savings as in experiment B, and the two experiments were set up in the exact same way. Again, a fixed segment length sl = 2 seconds was used. In order to exclude possible learning effects resulting from the order of these two experiments, 50% of the participants started with experiment B, the others with experiment C. Throughout the study, users were encouraged to make verbal comments. After completing the three experiments, the test subjects were interviewed and had to answer a short questionnaire. Although some users were rather sceptical about backward replay before the test, afterwards 80% agreed that it is a useful feature for speech skimming, and 29 out of 30 users thought backward audio could be a useful enhancement for skimming and searching audio-visual documents, e.g. video browsing. When asked about faster backward speech, 63% preferred timecompression, while 23% favoured the omission of segments. These preferences did not have any measurable influence on the classification task, where all users performed equally well. The results of the subjective user judgments (median of all participants) for both tests of each experiment are illustrated in Figure 5 (left column). When comparing the outcomes of the two tests of each experiment, it is important to keep in mind that in the second tests the order of the parameter values was randomized to eliminate learning effects. Moreover, the users did not just listen to the file but had to solve an actual task. Therefore, the subjective judgements were expected to correspond to the users’ perceived success in the task and thus be less regular than in the first tests, an assumption that is confirmed by the data. The rather large deviations found in the subjective user judgements were somewhat surprising. A closer look at the data showed that judgements varied both within one document between different users and between documents for a single user. This indicates

10

Wolfgang Hürst, Tobias Lauer, Cédric Bürfent, Georg Götz EXPERIMENT A: SEGMENT LENGTH

100%

FAR TOO LONG

TEST 1

TOO LONG

50%

OK TEST 2

TOO SHORT

FAR TOO

0%

SHORT

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

0.25 0.5 0.75 1

3

1.5

2

2.5

3

EXPERIMENT B: JUMP WIDTH VERY

100%

GOOD GOOD

OK TEST 2

BAD

50%

TEST 1

VERY BAD

0%

4

4.5

5

5.5

6

6.5

4.0

4.5

1

1.25

5.0

5.5

6.0

6.5

1.5 1.75

2

2.25

EXPERIMENT C: REPLAY RATE VERY

100%

GOOD

GOOD

OK

50% TEST 1

BAD TEST 2

VERY BAD

0% 1

1.25

1.5

Correct classification of content

1.75

2

2.25

Correct classification of topic

Wrong classification or no answer

Figure 5. Results of the classification tasks in the 3 experiments.

that subjective perception of backward speech highly depends on the user as well as on the actual document. The results of the second test of each experiment are illustrated in Figure 5 (right column). Users performed unexpectedly well in the classification task; even parameter values considered critical yielded predominantly correct results. For example, at a jump width of 6 seconds (i.e., 2 seconds of every 4-second block were omitted), 80% of the users were still able to identify the corresponding news message and another 17% were able to at least classify the overall topic. Performance with time-compressed audio decreased more obviously with higher values, but even at the highest rate of 2.25 times normal speed, half of the users were able to identify the corresponding news message. Thus, the most important finding is that the type of backward speech examined here does preserve enough intelligibility for users to classify the overall topic and, in a lot of cases, details of the contents. This is strong evidence that a proper integration of backward replay will enhance an audio skimming interface.

Forward and Backward Speech Skimming with the Elastic Audio Slider speed

speed

3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0

3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 0

distance

0

11

distance

Figure 6. Redefined distance-to-speed mappings. Left: backward replay replacing slower than normal speed; right: backward replay plus slower than normal speed.

Considering the individual parameters, no clear trends or thresholds could be identified, except for the obvious findings that shorter segments, larger jump widths, and faster replay lead to a decrease in comprehension. This decrease is gradual for all three parameters; there is no sudden drop in intelligibility at any point. In the case of segment length, this finding did not match our expectation that a clear lower threshold value could be identified. However, it became apparent that rather short segments are sufficient for classification. Even at the lowest value of 250 ms (subjectively rated as “far too short” by almost all participants), more than half of the users were still able to identify the overall topic correctly. Regarding faster backward skimming, the variant which omitted parts of the signal showed a slightly better performance than the version using time-compressed audio, although 63% of the users expressed a preference for the latter.

3.3 Backward Skimming with the Elastic Audio Slider Considering the high classification performance and the positive user feedback, it makes sense to integrate this approach for backward replay into the elastic interface described in Section 2 and to provide audio feedback during scrolling in either direction. The findings of the study also suggest that either of the tested methods for faster backward replay could be used for implementing backward speech with interactively adjustable skimming speed. The original interface prototype did not support backward skimming in its ‘elastic’ function: when the mouse was dragged to the left of the slider thumb, replay was slowed down below the default speed (with the elastic band acting as a ‘brake’) rather than reverted. This behaviour is useful for close listening (e.g., if the speech document is not in the user’s native tongue) and still conforms to the rubber-band imagery. However, it may be counterintuitive to users who want to navigate backwards while using the elastic slider. This function can be replaced with our method to play audio backwards in the following way: when a user drags the mouse pointer to the left of the slider thumb, audio is set to backward replay, with the time-compression rate depending on the distance to between pointer and thumb just as in the forward scrolling scenario. The distance-to-speed mapping of the interface needs to be modified as illustrated in Figure 6 (left). The disadvantage in this approach is the complete loss of the option to scroll at a rate slower than the default speed. In order to preserve this function, we implemented a third variant including both time-stretched replay and backward replay. This, again, required a redefinition of the distance-to-speed

12


C

ELASTIC AUDIO SLIDER WITH BACKWARDS REPLAY

1

0 min

3

10 min

2

20 min

D

30 min

ELASTIC AUDIO SLIDER WITH SNAP-TO-SEGMENT

1

0 min

1 + 2 SEE B 3 PULL THUMB BACKWARDS TILL BEGINNING OF SEGMENT

30 min

1 + 2 SEE B 3 CLICK LEFT OF THUMB & JUMP TO BEGINNING OF PARAGRAPH

MOVE POINTER (BUTTON PRESSED) MOVE POINTER (BUTTON RELEASED)

PAUSE POINTER (BUTTON PRESSED) MOUSE CLICK

3

10 min

20 min

2

Figure 7. Example for navigation with (C) backward audio and (D) snap-to-segment.

mapping, which is shown in Figure 6 (right). Scrolling to the right still works as usual. When dragging to the left, replay first slows down until half the normal rate. If the user drags the mouse further to the left, backward replay is activated, starting with a low speed, which increases with the distance. First tests confirmed that this last approach seems to be the most useful one. The possible problem that the sudden change of replay direction might confuse the users was not confirmed, since an additional colouring of the backward region of the slider bar serves as an indicator and the label displaying the current speedup rate shows a negative number when scrolling backwards in the document. However, some more testing is required in order to establish the best parameter settings, particularly those for faster backward skimming, which could be realized in either of the methods mentioned above, or even by a combination of the two. Backward speech replay further enhances the interface and facilitates skimming and search. In our example, once the user has detected the desired news message and wants to listen to it from the beginning, it is possible to navigate backwards while getting immediate audio feedback by simply dragging the elastic slider to the left (cf. Fig. 7C). That way, the user is less likely to miss the beginning of the message and be forced to re-position the slider several times, as is often the case when no audio feedback is given.

4 Discussion Throughout this paper we have used a simple but realistic type of interaction from a typical search task as an example to motivate our approach and to illustrate its advantages. Specific issues and individual parts of the proposed interface have been evaluated in separate studies in order to support our intuitive arguments, to show the feasibility of the overall approach, and to optimize its parameter settings. The example and the different evaluations showed that the elastic audio slider offers advantages in particular situations, especially when skimming a speech file in search for information. However, we do not claim that it is superior to other techniques in all situations and under all circumstances. Instead, we argue that it complements existing methods and thus a combination of approaches should be used in practice. In this section, we review related works on speech skimming and argue if and how it can be combined with the elastic audio slider interface. As noted before, one additional approach to speech skimming is to locate acoustic cues in the sound signal in order to emulate the visual cues found in text documents. Such cues can be used to create a segmentation into sentences or phrases [Stifelman et al. 2001]. With this, advanced, more ‘intelligent’ navigation of a file can be supported using, e.g., simple interface elements, such as clickable


13

“jump to next / previous segment” icons [Arons, 1997]. Such a segment-based navigation is particularly useful for search since it provides easy navigation in both directions. Other approaches, including the one presented in this paper, enable faster replay of the speech signal in order to make up for the missing ability of the ear to ‘skim over sound’ in the same way as the eyes can quickly scan printed text. Such time-scaled replay can be offered to the user via simple controller-like interfaces (cf. Fig. 1) or via more advanced approaches such as the elastic audio slider introduced in the previous sections. One of the main advantages of the latter is the ability to continuously move backwards in a file while still getting audio feedback. This type of backward skimming can have advantages compared to the segmentbased backward jumps described before. Assume a situation where a user hears some stock quotes while skimming a file at high velocity. Instead of re-listening to the whole sentence or even the whole paragraph from the beginning, the user might just want to go back a few seconds to replay the respective numbers but not the whole sentence or even paragraph from its very beginning. On the other hand, there are situations where jumps to the beginning of the previous segment might be more suitable (e.g., if the user is not just interested in the actual stock quotes but also in some background information and therefore wants to listen to the whole paragraph). Hence, the ultimate question is how to integrate all these varieties of options and techniques into the user interface in the most beneficial way. Maybe the most extensive and historically interesting work on user interface design for speech skimming is the SpeechSkimmer [Arons 1994, 1997]. The SpeechSkimmer incorporates time as well as content compression techniques by enabling replay modification in two dimensions, which is reflected in the layout of the interface. For content compression, parts of the speech signal are identified as less relevant based on automatic identification of pauses and intonation. Those parts are skipped during replay in order to increase replay speed. In the horizontal dimension of the interface, users can choose between different, discrete browsing levels where more or less parts of the speech signal are removed. Both forward and backward replay are supported. Within each level, replay speed can be adjusted continuously by moving a mark in the vertical dimension. Moving up or down this indication mark increases or decreases replay speed, respectively. Additional interface elements are provided, such as bookmark-based navigation along a timeline or buttons to jump to the previous or next segment. Segmentation is done, again, based on automatic pause detection and intonation analysis. When skimming a file a higher speed, one feature turned out to be particular useful: A “jump & play” button allows a user to go back to the beginning of the current segment and to continue replay at normal replay speed. Therefore, the SpeechSkimmer interface design incorporates three different concepts for speech skimming: contentcompression in the horizontal dimension, speed-modification in the vertical dimension, and navigation by jumping to distinguished positions (personal bookmarks, segment borders, etc.) using a time-line and an additional button field. One consequence of this richness of functions is the rather extensive layout of the SpeechSkimmer interface, making it difficult to integrate it into existing software implementations such as media players. (In fact, Arons [1997] suggests the use of a special hardware component developed specifically for the SpeechSkimmer.)

14


Content-compressed replay is a very good feature for quick classification of the overall content of a file or to find larger parts of one topic, but not necessarily for a detailed search for particular information, a situation in which the elastic audio slider unfolds its full strength. Modification of replay speed in the SpeechSkimmer is done in a similar way as with the controller interface element illustrated in Figure 1 (but with a different orientation, i.e. vertically instead of horizontally) which again, is useful and sufficient in many situations but lacks flexibility and power in others, as has already been argued in the previous sections. Segment-based navigation can complement the elastic slider in a useful and reasonable way, which is why we suggest the following extension to our interface, where segment-based navigation can be done with the elastic slider by clicking at random positions on the progress bar. The two common implementations for mouse clicks at a random position of the progress bar are snap-to-tick and snap-toclick. Snap-to-click sets the slider thumb (and thus, the document position) to the position of the mouse pointer. With snap-to-tick, the slider thumb jumps to the closest tick on the slider scale in the direction of the mouse pointer. The ticks are equidistant marks on the (linear) scale and, in the case of audio documents, usually represent units of time. However, if a segmentation of the audio file in meaningful units is available or can be calculated from the signal it is possible to use segment borders instead of ticks and thus to implement a snap-to-segment functionality: Here, the thumb jumps to the beginning of the previous or following segment depending if the pointer is left or right of the current thumb position, respectively. a behaviour that Stifelman et al. [2001] refer to as “audio snap-to-grid”. This feature allows users to skim a file with the elastic slider as outlined in the preceding sections or through a segment-based navigation by clicking continuously at random positions of the slider bar. Switching between both modes can easily be done, since both interaction types are integrated in one single interface. Based on the good results from Arons [1997] with the “jump & play” button, we suggest that replay after a click should fall back to the user’s pre-selected default speed. Which kind of segmentation is the best to use mainly depends on the data and the expected usage. For example, for highly structured data (such as news shows), a contentbased segmentation might be useful (such as single news messages), while in case of single speaker files, pauses or the beginning of individual phrases and sentences can be a better choice. Reconsidering the example used to illustrated and motivate the elastic audio slider, the situation becomes as follows: A user can quickly skim forward at a rather high speed, reduce replay speed in order to listen more carefully to the content, and then, once a part of particular interest is identified, go back using backwards skimming (cf. Fig. 7C) or jump back to the beginning of, e.g., the respective news show message (cf. Fig. 7D), depending on his aim to either just relisten to the stock quotes or to the whole news message, respectively. The important issue here is that all these functionalities are integrated smoothly into the overall interface design, thus offering much more possibilities to the user while at the same time avoiding to overload the interface with additional interface elements and complicated interaction concepts. In fact, the basic design of the proposed interface is no different from the current, established layout of common audio and media players as illustrated in Figure 1, and aside from the snap-to-segment


15

behaviour, all standard interactions offered by the original interface can still be used in the same way as before. Thus the elastic audio slider does not replace but complement common approaches in a reasonable and beneficial way.

5 Summary and Future Work In the preceding sections we described a new interface design for interactive skimming of speech recordings. First, our previous work in this area was reviewed, describing its advantages, and identified its limitations. Then, we presented an evaluation showing that backward replay of speech signals realized in a suitable fashion preserves enough of the intelligibility of the signal to be usable for speech skimming and we described how it can be integrated in the elastic audio slider interface. Finally, we related our work to other approaches and proposed an extension of the elastic slider which can further improve its usability. Our first focus for future work is therefore the implementation and evaluation of the proposed extension of the interface. Other areas for further research include the question of how content compression can be integrated in a reasonable way. While we argued that content compression is not suitable when searching for detailed, particular information, Arons’ work showed that it can be very useful for rough topic identification. Thus, it should be considered in the interface design, e.g. by an option to automatically eliminate or shorten longer pauses in the speech signal. In addition, the combination with approaches should be evaluated where not just meta-information and characteristics of the acoustic signal are presented in the interface but the actual content of the speech file, e.g. by generating a transcript of the spoken words using automatic speech recognition. Related work showed that this is offers great benefits if the ASR system produces transcripts of high word accuracy [Stark et al. 2000]), but can even be useful when the transcripts are laden with errors [Vemuri et al. 2004]. Maybe the most interesting area for future research is the combination of our interface for acoustic skimming with visual data browsing, e.g. TV news show recordings where important information might be located in the acoustic signal as well as in the visual stream. In addition to these extensions of the current interface design, its further evaluation in different scenarios is one of our key interests. The evaluations presented in this paper and the ones cited from our previous work proved the feasibility and usefulness of the approach. However, it will be very interesting to see how the elastic audio slider is used in actual, real-world scenarios. One area we are particularly interested in is to evaluate the interface with students who use lecture recordings for learning, e.g., when preparing for exams. We believe that the advanced navigation functionality and interactivity provided by the system will greatly improve the overall user experience and usability in this scenario.

References Amir, A., Ponceleon, D., Blanchard, B., Petkovic, D., Srinivasan, S. & Cohen, G. [2000], Using Audio Time Scale Modification for Video Browsing, Proceedings of HICCS 2000, Maui, HI, IEEE Computer Society, 3046-3052.

16


Arons, B. [1994], Interactively Skimming Recorded Speech, PhD thesis, MIT. Arons, B. [1997], SpeechSkimmer: A System for Interactively Skimming Recorded Speech, ACM Transactions on Computer-Human Interaction 4(2), 3-38. He, L. & Gupta, A. [2001], Exploring Benefits of Non-linear Time Compression, Proceedings of ACM Multimedia 2001, Ottawa, Canada, ACM Press, 382-391. Hürst, W. Lauer, T. & Bürfent, C. [2005], Playing Speech Backwards for Classification Tasks, Proceedings of the IEEE International Conference on Multimedia & Expo (ICME 2005), Amsterdam, The Netherlands, IEEE Press. Hürst, W., Götz, G. & Lauer, T. [2004a], New Methods for Visual Information Seeking through Video Browsing, Proceedings of the 8th International Conference on Information Visualisation, London, UK. Hürst, W., Lauer, T. & Götz, G. [2004b], An Elastic Audio Slider for Interactive Speech Skimming, Proceedings of the 3rd Nordic Conference on HumanComputer Interaction (NordiCHI 2004), Tampere, Finland, ACM Press, 277-280. Hürst, W., Lauer, T. & Götz, G. [2004c] Interactive Manipulation of Replay Speed While Listening to Speech Recordings, Proceedings of the 12th ACM International Conference on Multimedia, New York, NY, ACM Press, 488-491. Kim, J.S. [2002], TattleTrail: An Archiving Voice Chat System for Mobile Users over Internet Protocol, Masters thesis, MIT, Boston, MA. Masui, T., Kashiwagi, K., & Borden IV, G.R. [1995], Elastic Graphical Interfaces for Precise Data Manipulation, Conference companion to the SIGCHI conference on Human factors in computing systems, ACM CHI 1995, ACM Press, 143-144. Roucos, S. & Wilgus, A. [1985], High quality time-scale modification for speech, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2., Tampa, FL, IEEE Press, 493-496. Schmandt, C., Kim, J.S., Lee, K., Vallejo, G. & Ackerman, M. [2002], Mediated Voice Communication via Mobile IP, Proceedings of UIST 2002, Paris, France, ACM Press, 141-150. Stark, L., Whittaker, S. & Hirschberg, J. [2000], ASR Satisficing: The Effects of ASR Accuracy on Speech Retrieval, Proceedings of International Conference on Spoken Language Processing, ICSLP 2000, vol. 3, 1069-1072. Stifelman, L., Arons, B. & Schmandt, C. [2001], The Audio Notebook. Paper and Pen Interaction with Structured Speech, Proceedings of CHI 2001, Seattle, WA, ACM Press, 182-189. Vemuri, S., DeCamp, P., Vender, W. & Schmandt, C. [2004], Improving Speech Playback Using Time-Compression and Speech Recognition, Proceedings of CHI 2004, Vienna, Austria, ACM Press, 295-302.