Visual-Tactile Captions - ACM Digital Library

0 downloads 0 Views 15MB Size Report
For the former, we transform auditory NSI into its equivalent tactile representation and convey it simultane- ously with the captions. For the latter, we visually ...
Enhancing Caption Accessibility Through Simultaneous Multimodal Information: Visual-Tactile Captions Raja S. Kushalnagar, Gary W. Behm, Joseph S. Stanislow and Vasu Gupta National Technical Institute for the Deaf Rochester Institute of Technology

{rskics,gwbnts,jssnbs,vxg1421}@rit.edu ABSTRACT Captions (subtitles) for television and movies have greatly enhanced accessibility for Deaf and hard of hearing (DHH) consumers who do not understand the audio, but can otherwise follow by reading the captions. However, these captions fail to fully convey auditory information, due to simultaneous delivery of aural and visual content, and lack of standardization in representing non-speech information. Viewers cannot simultaneously watch the movie scenes and read the visual captions; instead they have to switch between the two and inevitably lose information and context in watching the movies. In contrast, hearing viewers can simultaneously listen to the audio and watch the scenes. Most auditory non-speech information (NSI) is not easily represented by words, e.g., the description of a ring tone, or the sound of something falling. We enhance captions with tactile and visual-tactile feedback. For the former, we transform auditory NSI into its equivalent tactile representation and convey it simultaneously with the captions. For the latter, we visually identify the location of the NSI. This approach can benefit DHH viewers by conveying more aural content to the viewer’s visual and tactile senses simultaneously than visual-only captions alone. We conducted a study, which compared DHH viewer responses between video with captions, tactile captions, and visual-tactile captions. The viewers significantly benefited from visual-tactile and tactile captions.

Figure 1: Intertitle-Scene temporal and spatial separation: The movie briefly narrates what will happen in the scene, and then displays the scene.

1.

INTRODUCTION

In the U.S., most DHH viewers are accustomed to always available captioning for television, especially after the passage of the Americans with Disabilities Act in 1991. The advent of online entertainment has increased technical capabilities and expectations for more customizable and accessible captions. Subtle barriers remain: multiple simultaneous visuals and imprecision in translating speech and non-speech information (NSI) [14]. Interestingly, depending on the state of art in movie display technology, the amount of accessible information for DHH viewers has varied dramatically over the years. The era from 1900-1920 was a golden era for deaf and hard of hearing consumers who could watch and understand movies. During that era, all movies were exclusively visual, and therefore silent, as movie projectors in that era could not play synchronized audio. Speech and non-speech information alike was conveyed through pantomime or partially synchronized “intertitles” that were spliced between scenes; they paraphrased dialogue and other bare-bones information about the story not apparent from visual clues as shown in Figure 1. Although the intertitles effectively transformed aural to visual information, the streams were separated both spatially and temporally. Movie consumers did not complain, for they were not aware of any alternatives. From 1927, movies could show both video and audio, which ushered in the “talkies” era and by 1929, the talkies completely supplanted the silents. Intertitling continued to be used for showing dialogue in foreign language movies. Consumers, now exposed to audio and acutely aware of the importance of audio-visual synchronization, rapidly noticed the lack of intertitling synchronization with the video and their complaints were doubtlessly heard. With this incentive solidly in place, by 1933, spatially synchronized subtitles completely replaced intertitles.

Categories and Subject Descriptors H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems; K.4.2 [Social Issues]: Assistive technologies for persons with disabilities

General Terms Human Factors, Design, Experimentation

Keywords Aural-to-tactile information; Caption Readability; Multimodal Interfaces; Deaf and Hard of Hearing Users Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ASSETS’14, October 20–22, 2014, Rochester, NY, USA. Copyright 2014 ACM 978-1-4503-2720-6/14/10 ...$15.00. http://dx.doi.org/10.1145/2661334.2661381.

185

The rest of this paper is organized as follows: • We discuss the impact of technological evolution on movie accessibility, including modern approaches that leverage the web to enable more accessible captions in settings where they may not have previously been practical. • We present the design of a tactile caption interface that allows viewers to feel simple non-speech auditory events that can occur simultaneously with captions. • Based on the results of the tactile captioning study, we present a visual-tactile caption interface that allows viewers not only feel simple non-speech auditory events but also. Our results show that users prefer this method to pausing and perform significantly better on comprehension tests.

Figure 2: Caption-Scene spatial separation: The captions are shown simultaneously with the scene. But they are at the bottom of the screen, away from the main scene action. The DHH viewer’s gaze path shows how the viewer reads the captions, and then scans the scene looking for the people who uttered the words that were shown in the captions.

• We then discuss the implications of our findings, and design principles for caption presentation that we derived from the feedback of our study participants. • We conclude with a discussion of potential future improvements to captioning that can further address the problems faced by viewers while watching movies.

DHH viewers mourned the loss of movie accessibility. They could foresee regaining access via yet-to-be-invented technology (captions), but they also knew captioned talkies would not be fully accessible, as shown in the quote below:

2.

Perhaps, in time, an invention will be perfected that will enable the deaf to hear the “talkies,” or an invention which will throw the words spoken directly under the screen as well as being spoken at the same time. ... The real solution, I believe, is to have a silent theatre in every city where there are a large number of deaf people.

2.1

Captioning Interface Issues

Many DHH viewers who watch captioned media deal with multiple factors such as simultaneous visuals, visual search, non-speech information, and identifying sound sources. These issues challenge viewers to watch video and read captions that convey sufficient aural information, yet remain synchronized and readable. They are also likely to miss information that requires both seeing the scene as well as reading captions. For example, some DHH viewers miss the point of visual gags or fail to identify who is talking in a group.

— Emil S. Ladner, a deaf high school senior, in an essay that won in a nationwide competition sponsored by Atlantic Monthly in 1931 [15].

1.1

BACKGROUND AND RELATED WORK

Aural and visual information contained in a “talkie” can be combined to a single “silent movie” with captions. Though a captioned movie is far more accessible to DHH viewers, their access is generally less than their hearing peers due to captioning interface issues.

Enhancing Caption Accessibility

In this paper, we explore two simple but effective approaches to aid DHH viewers in watching movies with nonspeech audio content: tactile captions and visual-tactile captions. Our experiments measure students’ preference and recall of captions and associated scenes in a captioned movie by asking them to complete a preference and recall survey after using each of our two tools compared to the baseline case of regular captions. We find that, while both approaches have a positive impact on viewer preferences and recall accuracy scores, captions with visual-tactile information is preferred over captions with tactile information, and both are preferred over captions alone. Moreover, visual-tactile enhanced captions yield significantly higher recall over the other two. We discuss the relevant design criteria based on the feedback we received from users during our iterative design process, and suggest future work that builds on these insights. In general, there is much potential in leveraging multiple sensory systems to maximize DHH viewers’ information access to audiovisual movies. Combining both aural and visual information into a single visual caption channel can result in reduced understanding and scene or caption recall. We discuss the implications of converting a unimodal presentation to a bimodal one (visual/tactile).

2.1.1

Simultaneous visuals

Viewers split their attention between the video and the captions and this reduces the amount of time they can spend on each [13]. Most DHH viewers spend a majority of their time reading the captions [9], and thus, have little time to watch the scenes as shown in Figure 2. Many DHH viewers are not fluent in English and may need extra time to read the captions. [8].This means that they often fall behind and do not catch all of the dialog. Due to the spatial separation between captions and the scene as shown in Figure 3, the viewers have to manage and split attention among the captions and scene, which remains an elusive goal. Previous studies have found that DHH students usually watch the captions most of the time and barely spend any time watching the slides or teacher. Overall, deaf students looked at the instructor 10% of the time and on the slides 14% of the time [12], as compared to 15% and 22% reported by Marschark et al. [16], and 12% and 18% [6]. Hearing students on the other hand, have no captions to focus on, and usually spend most of their time watching the instructor or lecture visuals.

186

2.2

Modern user interfaces increasingly mirror physical and social interaction, as this leverages existing human interactive knowledge. However, the range of this interaction, such as gesture and speech is incredibly diverse and can be a barrier for users with different abilities and cultures. We explore deaf and hard of hearing consumer preferences of tactile captions that enhance auditory information, such as a phone ring. The extra effort that has to be put into reading subtitles often diminishes the quality of visual attention on the movie and reduces total satisfaction in watching it. Yet, if viewers are unaware of alternatives, they cannot articulate or act on their issues clearly. It is our belief that simultaneous information in two modalities can be most accurately conveyed in two modalities, not necessarily the original modalities. We investigate the efficacy of transforming audio-visual movies to visual-tactile movies by inserting visual-tactile cues.

Figure 3: Caption-Scene spatial separation: The captions are shown simultaneously with the scene. The viewer has to decide which visual to focus on and read, and ignore the other one.

2.1.2

2.2.1

Visual Search

Non-speech information

It is difficult to represent all auditory information through captions. First, most NSI is not easily expressed in text, as they are perceptually meaningful, but semantically vague. These sounds are not usually part of the language. Even if these sounds are expressed as onamatoepic words, different cultures use different words to represent these sounds [19]. They do not map well to text and are often ambiguous; moreover they can also occur simultaneously with speech. In these cases, the captions usually show only the text representation of the speech and do not show the non-speech text representation. In other words, non-speech information is not easily or accurately described and be inserted into the captions.

2.1.4

Tactile feedback

This approach conveys information through through vibrations to a person who can feel. Vibrotactile stimulation activates numerous mechanoreceptors in the skin, and their responses depend on the frequency, amplitude, and duration of vibration and the area of the motor on skin [10]. Vibrotactile discrimination of 0.4 to 1000 Hz is relatively poor compared with aural discrimination from 20 to 20000Hz [18]. Encoding parameters available for a vibrotactile stimulus (tactons), namely frequency, intensity, locus, and duration, and the latter two hold the most promise for encoding information in a tactile display [5]. Frequency and intensity appear to be the least exploitable dimensions, primarily because the skin is rather poor at discriminating differences in frequency [4]. Research on translating sound to vibration goes back to the earliest days of electronic devices: indeed Norbert Weiner invented a hearing glove as a sound-to-tactile prosthesis for deaf people [17]. The frequency limitations were not well understood, and there were many failed inventions from the 1920s to the 1960s that tried to translate speech to vibrations for deaf consumers, including Weiner’s hearing glove. However, there is scant prior research on combining video captions with tactile feedback for videos. One line of research has focused on simulating the hearing experience by providing a rich, whole body tactile experience. Prior research has explored sensory substitution via a tactile chair. The tactile chair provided a tactile sensory system to provide sensory substitution for music. It provided a high-resolution audio-tactile version of music to the body [2, 3]. The system uses eight separate audio-tactile channels to deliver sound to the body, and provides an opportunity to experience a broad range of musical elements as physical vibrations. The other line of research has focused on augmentation of audio or visual information with tactile information. For example, Apostolopoulos et al. [1] and Khoo et al. [11] used the vibration function of smart phones to provide tactile information in parallel with auditory information for navigation for blind and low vision users. In terms of multimodal interfaces, tactile perception has much potential in supplementing other modalities for simple sounds, especially those that are distinguishable temporally. For instance, perception through tactile feedback nicely complements reading through visual perception.

Deaf viewers spend lot of time searching for visuals when switching in to video, as shown in Figure 2. This wasted time in searching discourages gaze shifting. Hearing viewers do not need to look at the audio source, while deaf viewers have to actively look at the visual translation of the audio in order to understand it. This is problematic for the deaf viewer who has to switch between the lecture visual and the aural-tovisual translation, and this can significantly interfere with audiovisual perception and understanding. Video with lot of visual detail can also increase the time needed to switch between video and captions.

2.1.3

Need for Improved Caption Interfaces

Speaker Identification and NSI Localization

Where there are multiple speakers or sources of NSI, captions usually do not have the bandwidth to indicate the sound source. For instance, in Figure 2, if one of the group members speaks up, a hearing person often can recognize by the audio signature alone who spoke, while captions do not carry this information. Similarly, if a group member’s phone rings, a viewer can often localize the sound and more quickly identify whose phone rang, while this information is not available to a viewer reading captions. The duration also may not be clear - for example, is the phone still ringing or did it just ring once?

187

Figure 5: A snapshot of the captioned video at a specific moment that has NSI: (object falls). This information is conveyed through the captions alone or supplemented through tactile feedback. Figure 4: video.

A DHH viewer watching a captioned passage of the Americans with Disabilities Act. They sat in front of a laptop and viewed a 5 minute long captioned movie that contained exactly seven NSI events, such as phone ringing or foot stomps as shown in Figure 4. Next, they completed a survey that tested the ability to 1) recall and narrate the NSI events, and 3) to identify and describe the location of the NSI in the scene.

First, there is high overlap between aural and tactile perception unlike aural and visual perception. Human ears and skin detect physical oscillations, while the eyes perceive electromagnetic vibrations. Second, tactile perception has much finer temporal discrimination than its visual analogue - almost a magnitude larger [7]. Third, tactile perception tends to be perceived directly and untranslated, unlike visual reading. The viewer does not have to semantically interpret the sound; instead, he or she can directly feel it. A single non-speech sound can be captioned in multiple ways and there is no clear agreement on which one to pick. For example, a phone ringing could be represented in at least three different ways: [ Phone Ringing ] or as [ Phone Rings Multiple Times ] or [ Phone rings 3 times ].

3.

3.1.2

PRELIMINARY EVALUATION

From prior studies [8, 6, 12], we know that DHH viewers spend the majority of time reading captions; for example, Jensema et. al found that viewers spent 84% on the captions and less than 16% on the scenes [9]. By contrast, hearing viewers who listen to the audio and watch the scenes spend close to 100% of the time watching the scenes. From prior observation, we note that DHH viewers struggle to identify and locate auditory cues such as a ringing phone or determining the identity of who is speaking. For many non-speech event captions, DHH viewers also spend a significant amount of time searching for the location and nature of cues. This extra searching time discourages viewers from shifting to the scene and obtaining more contextual information.

3.1

3.1.3

Participant Feedback

The participants stated that the captions were clear, but sometimes could not was all text. Additionally, they often did not see which item triggered the NSI until after the event.

Pilot Study

We conducted a pilot study to investigate viewer recall of NSI events after they watched a captioned movie as shown in Figure 5.

3.1.1

Results

The participants were asked to recall and describe all seven NSI events (NE), e.g., [ object falls ], that they observed in the video. For all correctly described NE, the mean and standard deviation was N E = 4.73, σ = .44. Next, the participants were asked to describe the item that caused the NSI (NI) in each video scene when the event occurred (e.g., floor stand). For all correctly described NI, the mean and standard deviation was N I = 3.87, σ = .35. The lower number of correctly described NSI items in comparison with described NSI items suggests that DHH viewers miss important information by not being able to see both the item causing the NSI event and reading the captions that describe the NSI event. An approach that assists viewers in better perception of the non-speech sounds would be a logical place to start. The responses indicated that some could not associate the captions quickly with the visuals. Most likely this is due to the fact that many deaf viewers do not have ’auditory’ memory, needed to associate and recognize sounds and the item causing the sound.

The movie was funny! But! Many times I could not see where the sound was coming from.

Setup

I got a little irritated at the captions. The phone rang, but I could not tell whose phone rang.

We recruited 27 DHH participants (15 male, 12 female) from the National Technical Institute for the Deaf. The average participant age was 19.8 years old. Our participants all had prior experience with watching captioned movies from birth, reflecting the fact that they had grown up after the

Some of the captions was distracting. I need time to read words like printer whirring

188

Figure 6: The sound envelope of an RIT office doorbell.

4.

ITERATIVE SYSTEM DESIGN

Our goal is to develop accessible multi-modal captions that enable DHH users to better perceive, follow and manage both the auditory and visual information that is simultaneously presented, which is common in most sitcoms and movies. Adding additional NSI information along with the textual description can help DHH viewers understand the plot, especially if the NSI is semantically important. In addition to assessing the impact of tactile captions, we also assess the impact of tactile-visual captions to indicate where the sound originated and its duration. The multi-modal captions interface should also be intuitive and customizable in order to meet the needs of the diverse population of deaf and hard of hearing consumers. The formative study identified two main issues:

Figure 7: A vibration envelope that is temporally faithful to the sound envelope in Figure 6.

4.3

1. Inability to interpret the caption words meaningfully in time, for many phrases including NSI. 2. Inability to associate scene content with the NSI caption content, and vice versa. We investigated whether the synchronization of tactile and/or visual-tactile feedback with the captions could address both issues.

4.1

Vibration mappings

While some non-speech sounds can be represented with standardized onomatopoeic words, there are far more nonspeech sounds that have no standardized print representation. For example, there is no standard way to represent many ambient noises, such as air duct or printer noises. Moreover, written representations do not show most aural properties such as loudness that can be important depending on context. For example, captions usually indicate printer noises as [ printer whirring ] or [ printer noise ]. These descriptions do not convey duration or loudness. With tactile feedback, a viewer can simultaneously watch the entire video scene including the printer whirring and feel it. If the printer noise is important to the plot, the viewer will notice and relate better to the plot.

4.2

Tactile Captions

We built a library of NSI tactile events that faithfully replicated the NSI auditory temporal and amplitude changes over time. First, we implemented a program to scan the caption text for non-speech information enclosed by brackets, e.g., [ phone ringing ]. We analyzed the auditory envelope for that timeframe. We picked video-specific minimum amplitudes to aid in identifying the start and end times for the NSI event. As touch has excellent temporal resolution, we focused only on preserving temporal fidelity by looking and copying the peak-to-peak temporal fidelity as closely as possible. For example, in Figure 6, the doorbell sound envelope has two peaks that can be transformed into the following vibration envelope as shown in Figure 7. This tactile envelope was added into a library. When the entire video was completed, the library was transferred to the vibration microcontroller. For viewing, users start a integrated program that displays both the video and captions. When the program detects the NSI captions, it sends the NSI description caption to the vibrator microcontroller, which looks up the matching vibration pattern in its library and runs it.

4.4

Visual-Tactile Captions

In addition to the synchronous tactile information, we added synchronous visual information overlaid on the NSI items to supplement the caption text. We created multiple parallel wavy lines over the NSI item. We extracted the NSI amplitude and automatically set its width is proportional to the auditory amplitude. We also automatically set the length of the thick wavy lines according to its duration. We overlaid these lines over the NSI item whenever the corresponding NSI caption was shown. We used a single solid and bright color to call attention to the sound, but not to convey any additional information.

Vibration device characteristics

Like most personal consumer devices, personal vibration information devices should be usable, accurate, durable, lowcost, highly reliable, and easily worn. For our study, we focused on usability in terms of watching captioned videos: 1. The auditory to tactile translation should convey salient information such as duration, rhythm or amplitude.

4.5

2. The captions describing the auditory event to be translated to tactile information should occur simultaneously with the tactile information.

System

The hardware for both the tactile and visual-tactile captions consists of a Windows 8 laptop, a mBed NXP 1768 micro-controller and a ROB-08449 vibration motor (tactor). The software is a stand-alone C# program that displays captioned multimedia and simultaneously scans the captions for

3. The device should not feel intrusive while being used or worn.

189

Figure 9: A snapshot of visual-tactile enhanced captioned video at a specific moment when there is nonspeech information, transmitted both through text, corresponding visual cue and temporally faithful vibration.

Figure 8: A DHH viewer watching a captioned video with a strapped on tactor.

The first clip was set up to provide practice. That clip was divided into three sections. The first section had only captions, then the second one had tactile enhanced captions, and the third section had visual and tactile enhanced captions, as shown in Figure 9. The experiment script then presented the second, third and fourth clips in a balanced and randomized order, each of the three kinds of captioning: regular, tactile and visual-tactile enhanced, so as to avoid bias.

NSI. When NSI is detected, the program sends a trigger via bluetooth to the micro-controller. Based on the trigger command, the micro-controller selects and executes the corresponding tactile envelope from its library. The program cannot reproduce the amplitude or frequency due to the tactor’s single power setting. At first, we found that the current supplied to it was too low. We then increased the current via a MOSFET to make the vibrations stronger and more easily noticeable. The microcontroller can control the vibration motor such that different kinds of vibration through varying the frequency and duration of pulses. We developed a specific vibration envelope to mimic the feel for each NSI event. For all vibration envelopes, we preserved the amplitude in terms of temporal fidelity. That is, we carefully mapped the shape of the sound envelope amplitude to the vibration envelope amplitude as shown in Figures 6 and 7.

4.6

4.6.2

When participants came in for the study, we informed them that the entire session would be about 30 minutes long. They signed the consent form, and then filled out a demographics questionnaire. They were told that each clip had seven NSI events, and that they should pay attention to the NSI event and the item that triggered the event. The participants strapped on a tactor as shown in Figure 8. They were invited to become familiar with all three kinds of captioning by watching the first clip. Next, they viewed clips with the associated in randomized order. After viewing each clip, they completed a survey that asked them to 1) rate their satisfaction with the clarity of captions, 2) their ability to recall and narrate the NSI events, and 3) their ability to identify and describe the location of the non-speech sound in the scene at the time of the sound event. They were asked open-ended questions to solicit feedback on each kind of captions, and compensated for their participation and time.

Experiment

To get a sense of whether or not the users benefited from tactile and/or visual-tactile captions, we conducted a study using a new video that had both kinds of captions. We recruited 21 new DHH participants (15 male, 7 female) from the National Technical Institute for the Deaf at the Rochester Institute of Technology. The average participant was 19.5 years old and all had some degree of hearing loss at birth. All participants had prior experience with watching captioned movies from birth, reflecting the fact that they had grown up with mandated universal captions required after the passage of the Americans with Disabilities Act in 1991.

4.6.1

Study Setup

5.

RESULTS

We assessed the mean and standard deviations for all questions: Likert ratings, number of accurately described NSI, and number of accurately described NSI location and item in the scene. For the Likert ratings, a chi-square test was used to compare the ratings for tactile and visual-tactile captions against captions. In terms of ease of use, there was no significant difference between captions (C = 4.2, σ = .4) and tactile captions (T C = 4.3, σ = .3), χ2 = 18.44, p = .4). However, there was a significant difference between captions (C = 4.2, σ = .4) and visual-tactile captions (V T C = 4.7, σ = .3), χ2 = 10.87, p < .01.

Experiment Setup

We created a movie for this study that featured fellow peers to keep it interesting and relevant. The movie featured all deaf students, with the protoganist remembering that he has homework due, and trying to find a computer to print it before it is due in class, and with a surprising twist at the end. We divided the movie into four equal clips, each with seven NSI events such as phone ringing or foot stomps, used by DHH people to call others via vibration instead of sound.

190

There was a significant increase in recall of details with visual-tactile enhanced captions as compared to the tactile enhanced captions. This aligned with the feedback by the students in the initial pilot study. Most participants were able to adapt to both tactile and visual-tactile captions. Out of all of our participants, none of the visual-tactile caption users, and only three of the tactile caption users actually decreased their score between the baseline and trial conditions. While we had expected some level of variation in this result, this is an additional indicator that our approach is reliably beneficial to users.

8

Number of Correct Answers

7

6.85

6

6.27 5.81

5 4

4.79 3.71

3

4.07

2 1

0 Captions (Baseline)

Tactile captions

Visual Tactile Captions

6.2

We find visual-tactile captions yielded an improvement over captions in terms of participant recall on NSI and scenes that occurred during the NSI as shown in Figure 10. For the recall accuracy of NSI events and associated NSI location, an independent samples t-test was used to compare captions with tactile captions, and captions with visualtactile captions. For accuracy in describing NSI events, tactile captions showed a 21.3% increase over captions, t(40) = 2.98, p < .01, while visual-tactile captions had a 30.90% increase over captions, t(40) = 3.21, p < .01. In terms of participant accuracy in locating NSI in the corresponding scene, tactile captions yielded a 9.70% improvement over captions, t(40) = 1.47, p < .01, while visualtactile captions yielded a 84.64% increase over captions, t(40) = 6.42, p < .01.

5.1

Participant Feedback

Overall, the participants were very positive about the enhancements to regular captions, especially the visual-tactile enhanced captions. In particular, they liked being able to quickly locate the location of the non-speech sounds. They also liked the ability to perceive a temporally faithful description of the sound. When asked to fill out the open ended answers, the participants commented:

6.3

It was interesting to see how the tactile feedback was being use to vibrate to warn about some things like the fall. I don’t mind wear that to let me know that something happen.

DISCUSSION

The participants liked the tactile captions more than the captions, and in turn, liked the visual-tactile captions more than either tactile captions or visual-tactile captions.

6.1

Conclusions

The results suggest that while tactile captions alone can make a significant difference in visualization and recall of NSI, the open ended feedback indicated that they were sometimes confused about the NSI presumably as they had no prior experience with them. On the other hand, for most viewers, the visual-tactile captions made a significant difference in the visualization and recall of NSI, and this was supported by the feedback which indicated they were able to connect unfamilar NSI information with familiar visuals. Although touch is not as information-rich as vision, we conjecture that the increased recall is due to simultaneous perception in the visual and tactile modalities. Feedback also indicated that some preferred stronger vibrations, while others liked the weaker ones. This suggests that visual-tactile systems should have adjustable baselines to adapt to consumer preferences. Users noted that visual cues were very helpful in guiding their attention towards immediately identifying the location and nature of the thing producing the NSI. The variable thickness that was proportional to the amplitude may have been too subtle, for none of them commented on its helpfulness. The ability to directly perceive salient properties of auditory information via visual-tactile captions adds a new dimension to accessibility in watching multimedia and enables viewers to gain more access to multimodal information.

The strength for the tactile feedback is that I can feel the vibration very well. It is pretty strong and understandable. Also, it is like a watch so it can stay on my arm and not feel loose or will fall off of the arm.

6.

Weaknesses

For regular captions, the viewers commented that they were sometimes distracting because there was not always a visual connection between the captions and the scene details. In other words, they did not have auditory memory to identify or relate to auditory suggestions in the scene. Because many of them lacked experience of understanding and interpreting environmental or non-speech sounds, they occasionally found it confusing to read the description as it was not always obvious what these relatively abstract descriptions were. For tactile captions, the viewers commented that they were sometimes distracting because they could not identify the tactile feedback was as they were too busy reading the captions. The caption description did not always help them identify or relate to auditory events. One participant stated: “I like the vibration to get my attention, but at the same time, it was distracting when trying to watch the video.” When asked to directly compare tactile captions against visual-tactile captions, the viewers commented that the former was sometimes distracting because there was not always a visual connection between the tactile feedback and the scene details.

Figure 10: Left chart: Average NSI events recalled; Right chart: Average NSI locations/items recalled.

Strengths

Participants recalled significantly more NSI when they watched tactile or visual-tactile captions, as compared to watching visual captions alone. These findings are supported by participant comments. For example, one participant noted: “Tactile captions let me feel the doorbell rather than just looking at the description: “doorbell ringing””.

191

7.

FUTURE WORK

There is great potential in maximizing the enjoyment and recall of visual-tactile captions. We will explore new methods to enhance traditional captions through visual and tactile feedback for DHH consumers, such as the issue of interference between simultaneous visual and tactile input, with the goal of maximizing information perception. Another important limitation of captions is the fact that it strips out volume information, which is imperceptible to the caption user. Every sound becomes equally “loud” when it is transferred to the caption track. The distinction between background and foreground blurs. We will explore approaches to pass on speech loudness to viewers, possibly through programmable MOSFETS that can adjust vibration amplitude to provide more dimensions for tactile feedback. We plan to investigate vibration properties that work best in terms of enjoyment or recall alone or in combination with visual captions. We plan to do evaluation studies in which viewers will be asked to describe what they saw and felt. We will rate these descriptions for accuracy in relation to the auditory event, such as turn-signal versus seat-belt beeps. The current research work has laid a foundation for taking the idea behind the research to the next level. The aim is to explore more and more patterns and intensities of vibrations in order to build a continuous vibration system rather than the current discrete vibration system. The effectiveness of visual-tactile captions suggests that presenting accessible multimodal information can increase a DHH viewer’s viewing and recall ability. This has implications for all users, including blind and low vision people (BLV). They face similar problems, in that multimodal information is all conveyed over audio, i.e., the audio description plus the movie dialog. The problems are not quite identical, and future studies would tackle different issues.

8.

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

ACKNOWLEDGMENTS

This work is supported by a grant from the National Science Foundation IIS-1218056. [14]

9.

REFERENCES

[1] I. Apostolopoulos, N. Fallah, E. Folmer, and K. E. Bekris. Integrated online localization and navigation for people with visual impairments using smart phones. ACM Transactions on Interactive Intelligent Systems, 3(4):1–28, Jan. 2014. [2] A. Baijal, J. Kim, C. Branje, F. Russo, and D. I. Fels. Composing vibrotactile music: A multi-sensory experience with the emoti-chair. In 2012 IEEE Haptics Symposium (HAPTICS), pages 509–515. IEEE, Mar. 2012. [3] C. Branje, M. Karam, D. Fels, and F. Russo. Enhancing entertainment through a multimodal chair interface. In 2009 IEEE Toronto International Conference Science and Technology for Humanity (TIC-STH), pages 636–641. IEEE, Sept. 2009. [4] S. Brewster and L. M. Brown. Tactons: Structured Tactile Messages for Non-Visual Information Display. In AUIC ’04 Proceedings of the fifth conference on Australasian user interface, pages 15–23, 2004. [5] S. Brewster, F. Chohan, and L. Brown. Tactile feedback for mobile interactions. In Proceedings of the SIGCHI conference on Human factors in computing

[15] [16]

[17]

[18]

[19]

192

systems - CHI ’07, page 159, New York, New York, USA, 2007. ACM Press. A. C. Cavender, J. P. Bigham, and R. E. Ladner. ClassInFocus. In Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility - ASSETS ’09, pages 67–74, New York, New York, USA, 2009. ACM Press. W. Fujisaki and S. Nishida. Audio-tactile superiority over visuo-tactile and audio-visual combinations in the temporal resolution of synchrony perception. Experimental brain research, 198(2-3):245–59, Sept. 2009. C. Jensema. Closed-captioned television presentation speed and vocabulary. American Annals of the Deaf, 141(4):284–292, 1996. C. J. Jensema, R. S. Danturthi, and R. Burch. Time spent viewing captions on television programs. American annals of the deaf, 145(5):464–8, Dec. 2000. L. A. Jones and N. B. Sarter. Tactile Displays: Guidance for Their Design and Application. Human Factors: The Journal of the Human Factors and Ergonomics Society, 50(1):90–111, Feb. 2008. W. L. Khoo, E. L. Seidel, and Z. Zhigang. Designing a virtual environment to evaluate multimodal sensors for assisting the visually impaired, volume 7383 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. R. S. Kushalnagar, P. Kushalnagar, and G. Manganelli. Collaborative Gaze Cues for Deaf Students. In Dual Eye Tracking Workshop at the Computer Supported Cooperative Work and Social Computing Conference, Seattle, WA, Mar. 2012. ACM Press. R. S. Kushalnagar, W. S. Lasecki, and J. P. Bigham. Captions Versus Transcripts for Online Video Content. In ACM, editor, 10th International Cross-Disclipinary Conference on Web Accessibility (W4A), 32, pages 1–4, Rio De Janerio, Brazil, May 2013. ACM Press. R. S. Kushalnagar, W. S. Lasecki, and J. P. Bigham. Accessibility Evaluation of Classroom Captions. ACM Transactions on Accessible Computing, 5(3):1–24, Jan. 2014. E. S. Ladner. Silent Talkies. American Annals of the Deaf, 76:323–325, 1931. M. Marschark, G. Leigh, P. Sapere, D. Burnham, C. Convertino, M. Stinson, H. Knoors, M. P. J. Vervloed, and W. Noble. Benefits of sign language interpreting and text alternatives for deaf students’ classroom learning. Journal of Deaf Studies and Deaf Education, 11(4):421–37, Jan. 2006. M. Mills. On Disability and Cybernetics: Helen Keller, Norbert Wiener, and the Hearing Glove. differences, 22(2-3):74–111, Dec. 2011. C. E. Sherrick, R. W. Cholewiak, and A. A. Collins. The localization of low- and high-frequency vibrotactile stimuli. The Journal of the Acoustical Society of America, 88(1):169–79, July 1990. S. Sundaram and S. Narayanan. Classification of sound clips by two schemes: Using onomatopoeia and semantic labels. In 2008 IEEE International Conference on Multimedia and Expo, pages 1341–1344. IEEE, June 2008.