representing visual data with sound and compare several ways of .... Interaural time delay (ITD) and interaural intensity difference (IID) are the primary cue that ...
Caveat: This document was written in Nov2005 and found again in 2015July. The authorship is unclear, it could be: Heiko Rudolph, & Suresh Matta, however no such information was on the original 2005 document and authorship could be mixed and unknown or contain unacknowledged text because this document was never finished. It is a draft fragment. Presented as is, no others claims made. ____________________________ A conceptual summary by Heiko Rudolph‟s for translating three dimensional space to sound: The idea was to restrict the situations to a number of common, spatial templates, which are encountered in daily life. Heiko presented these concepts, Suresh fleshed out the ideas with research, citations etc... Heiko Rudolph and Suresh Matta, PhD candidate, worked in the area of representing space in sound for blind people. In particular how to represent often found well defined configurations in sound, such as: object in front, object at an angle, object on the ground, object coming from top/ceiling, hole in front (or any angle), incline down or up etc.... all represented in easily conveyed auditory methods. Angular info would also be supplied.
often found well defined configurations Standard templates/configurations Free space, no objects, flat ground, 360 degrees. Object on ground e.g. 1m x 1m x 1m box
Distance
angle
N/A
N/A
Close, Middle far
Object coming from top, i.e. need to beware of collision with top of body.
Close, Middle Far
Path goes down : Depression/hole/descent
Close, Middle Far Close, Middle Far
Use the positions of a standard clock face. i.e. directly in front = 12 o‟clock, right angle = 3 o‟clock etc.. Use the positions of a standard clock face. i.e. directly in front = 12 o‟clock, right angle = 3 o‟clock etc.. Clock positions
Path goes UP: Hill/stairs/ascent
Clock positions
Object == defined as something that could cause impediment when walking. Loosely defined at this point. Distances defined as: Close, == next step requires adjustment to new condition
Page 1
1
Middle == at least 1 to 3 normal steps away Far == at least 4 to 10 normal steps away
Two methods to translate the above into sound were considered: 1) conveying space using the inherent properties of sound, as sound is modified by these configurations Or: 2) Use a set of easily, intuitively simple to understand audio markers to indicate in a kind of short hand code what was spatially going on: example: a series of pitched beeps, with left right ear differentiation, to indicate a depression/hole/stairwell going down directly in front of the person. Only the space 90 degrees to left/right of the person was considered. i.e. 9, 10,11,12,1,2,3, o‟clock positions. The paper attempts to express something of the conceptual approach outlined above. Below is the 2005 document, converted to docx, then PDF. A work in progress never finished. ***************************************************************************
Psychophysical Measurements for Translation of Visual Data to Sounds ABSTRACT This paper focuses on representation of visual data in to sounds and also identifies the appropriate way of mapping data into sound cues. Sounds are employed in navigational aids for recognizing objects or obstacles in the environment. This paper surveys different sonic representations and analyze them with respect to three important attributes: how the object location is detected, how to avoid potholes and kerbs and how to identify stairs. Several experiments are conducted to investigate the limits of representing visual data with sound and compare several ways of representing the values. 1 INTRODUCTION Currently many researchers are employing sounds as a tool to recognize objects in the development of Electronic Travel Aids. Among them the most significant one is “The vOICe” developed by Meijer, which scans and digitizes an image captured by a video camera. This image is then divided into vertical strips. Pixels accounting for recognized objects are sonified using simple tones, higher pitches being associated to the pixels closer to the image top. This enables the user to recognize the elevation of objects in a scene, via an association between height and frequency of the tones. Finally, the horizontal position of objects is resolved using stereo panning, and the intensity of the light captured by the video camera is translated into loudness of the respective tones. To create auditory displays for blind, three basic questions must be answered. First, we need to know what auditory display dimensions best represent a given environment. Is it best to use frequency, tempo, or some other auditory parameter to represent the data dimension, second, what is the method of the preferred data-to-display mappings? If volume is used to represent brightness, one might postulate that increasing amplitude would represent increasing brightness. However, when representing the dimension of size, increasing frequency might best represent decreasing size, since high-frequency
Page 2
2
sounds tend to come from small objects. And third, once we establish a mapping and polarity, what is the scaling factor for that data and display pair? That is, exactly how much change in volume must we use to represent a given change in brightness? In this paper we answer these questions, and then we begin to validate the resulting mapping solutions in a representative practical listening task. Many sonification applications employed only frequency, amplitude and time functions in their mapping methods, the resulting solutions may not suitable to the user‟s requirements and choices. This may lead to avoid using by the users or causes problems. Further, most mapping schemes have not utilized the actual type of data being displayed, or the specific listening experience of the listeners. Psychophysical Scaling functions are that specifies the mapping between acoustic parameters and subjective perceptions of a stimulus. For example, if the physical amplitude is doubled, what is the resulting change in the perceived loudness? Here we adapt the psychophysical scaling paradigm known as magnitude estimation to examine how changes in the physical sound attributes (e.g., frequency) result in different estimations of the data that the sound is supposed to represent. That is, if distance is mapped to the amplitude of the sound, then what change in distance will the listener report when the amplitude is doubled? Is it simply the same as the change they report for pitch when frequency is doubled? Is it the same change they would report for pressure or velocity or the value of the dollar when the frequency is doubled? The question thus simplifies to whether it matters what data dimension the sound is supposed to represent, and if so, how. Most navigational tools are exclusively visual, failing to exploit the advantages of the human auditory system. These auditory presentations have no specific rules and no research has addressed how to create optimal displays. Three key research questions are: (1) what is the best sound parameter to use to represent a given situation in surrounding environment? (2) Should an increase in the sound dimension (e.g., rising frequency) represent an increase or a decrease in the data dimension? (3) How m uch change in the sound dimension will represent a given change in the data dimension? Reasons for focusing on sound:
Page 3
Sound is currently under-utilized in interaction design Vision is overloaded and our auditory senses are seldom engaged In the world we are used to hearing a lot Adding sound to existing, optimized visual interfaces does not add much to usability Sound has an extremely high temporal resolution (ten times faster than vision)
3
1. Free Space in all directions Room : 10 m X 10 m 10 m
10 m
Room : 10 m X 10 m
Front view
Top view
Fig1: Free Space in all directions Human localization of a sound source is three-dimensional. The egocentric location of a source is specified in terms of the two parameters indicating its direction, azimuth (lateral direction with respect to the facing direction of the head) and elevation (direction with respect to the ear-level plane), and the third parameter indicating its distance. In free space the attributes of HRTF functions, which are azimuth, elevation, and distance becomes zero, as there are no objects near camera. So that sound is represented as a white noise or a soothing sound to the listener. In our system, the intensity of the sound decreases with the square of the distance from listener to source, because of this the intensity of the sound will become zero, as there is no object in vicinity of the cameras. In this case ITD and IID of the sounds are equal to zero then the sound is heard at the center of the head because there is no change in int ensity and time. 2. Obstacle in front of the subject
Page 4
4
Field Code Changed
30 cm
2m
2m
30 cm
30 cm
Top view
Front view
In the case of the object is in front of the user the intensity of the sound is determined by the distance from the object to the user, which is measured using the depth measurement technique (triangulation). Azimuth and Elevation are determined from the coordinates of the image in the picture taken from the cameras. All the locations are given as a function of the azimuth and elevation. The azimuth will be defined as 0 degrees for the object at in front of the subject in the horizontal plane. Similarly, the elevation will be defined as 0 for direct center in the vertical plane. With 0 degrees azimuth, causing the sound produced to the eardrums with the intensity i.e. distance between the subject and the object. Both timbre and spatial cues depend on the position of the object source. There are also changes in interaural phase as a function of frequency that are caused by the HRTF. In this position the left-ear and right-ear waveforms are essentially identical, and the arrival time changes very little with elevation angle. Interaural time delay (ITD) and interaural intensity difference (IID) are the primary cue that allows one to locate a sound on the azimuth plane [30]. In our case the HRTF depends on the direction of the position of the object and the distance of the object from the user. Here the azimuth is restricted to the interval from +45 to -45, while the elevation ranges over the interval from +90 to -90. For simplicity, we restricted all of our measurements on human subjects to the frontal half space, so that the elevation restricted to the interval from +90 to -90. To synthesize the visual scene with the object location (,) then needs to filter the signal with H ( ,) and render the result binaurally through headphones. Additionally, the HRTF must be interpolated between discrete measurement positions to avoid audible jumps in sound, and appropriate reverberation must be mixed into the rendered signal to create good externalization. The HRTF data is a set of experimental measurements of the impulse response of both the left and right ear for a stimulus at a certain point in space with respect to the head. The HRTF data is used from the MIT media lab (http://sound.media.mit.edu/KEMAR.html). For the purpose of this experiment only data for 36 positions at 40 degrees elevation were used. The set of data used covers 180 degrees of rotation at 5-degree intervals at zero elevation. Matlab was used to read the HRTF data, convolve it with the appropriate part of the source signal, and simulate the sound at each of the listener's ears. All the HRTF data was input to a single matrix using
Page 5
5
the wavread function. Here only 180 degrees is simulated. The complete impulse response matrix was then convolved with the random noise signal segment by segment. The noise signal was broken into segments that correspond to the 36 HRTFs used. The result of the convolution is the sound heard by the listener. Plots of the amplitude of the signal received at each ear were generated in Matlab. These are plots of the waveform heard at the left and the right ear.
3. Obstacle to the right of the subject We assume that the object is restricted to horizontal plane and right side of the user. HRTF (Head Related Transfer Function) models sound perception with two ears to determine positions of the objects in space. Incorporating HRTF‟s into our system can make user understanding the objects in a natural environment. Here we considered the HRTFs of an average human listener, which are determined experimentally. As given above, a particular HRTF is specified by four parameters, azimuth ( ), elevation (), frequency (f) and distance (d). When the object appears on right side in the image then the azimuth angle will be in the range from 0 to 45 and the elevation will be depending on the position in the image if it is above the center then it is 0 to 90 otherwise 0 to -90. So the left HRTF will be zero and the right HRTF will have the value. As the sound coming from the right speaker so that listener perceives that the object is on right direction.
Page 6
6
Field Code Changed
30 cm
30 cm
2m
2m
30 cm
30 cm
Front View
Top View
As the object approaches the user, the localization cues change drastically. Interaural intensity differences (IIDs) increase dramatically as distance decreases, while interaural time delays (ITDs) remain constant. This systematic variation of the HRTF in the near-field region may allow listeners make absolute distance estimation for nearby objects, and provide the means for significantly improving the capabilities of audio displays. As the user approaches the object, the ratio of distances from the source to the near and far ears increases, and the effects of head shadowing are amplified, causing the interaural intensity difference to increase. The spectral shaping caused by the head and pinnae may also change as the source enters the physical acoustic near field and the curvature of the sound field increases. The interaural delay, which results from the absolute difference in path length from the source to the ears, remains approximately constant as distance decreases. Assume that the source is located at an azimuth and elevation of ( ,) and the N closest available measurements of HRTF are at (i, i).
4. Obstacle to the left of the subject Human auditory system employs ITD, IID, and other spectral cues called HRTF to identify spatial location so they can differentiate the locations of sounds in the free field. Such as virtual audio displays can help the blind users to identify the objects in the free filed. Various HRTFs corresponding to different objects in spatial locations can be described by different equalizer settings. HRTFs at left ear and right ear are developed for several azimuths (locations along the „„left-right‟‟ direction) and elevations (locations along the „„up-down‟‟ direction). Thus, spatial location is designated by an ordered pair of angles (azimuth , elevation ), where (0, 0) corresponds to the location directly in front of a listener. Similarly, (–90, 0) and (+90, 0) correspond to locations directly opposite the left and right ears, respectively; (0, -45) and (0, +45) correspond to locations in front-below and in front-above the listener, respectively.
Page 7
7
30 cm
2m
30 cm
30 cm
2m
Field Code Changed
30 cm
Front View
Top view
We assume that the object is restricted to horizontal plane and left side of the user. When the object appears on left side in the image then the azimuth angle will be in the range from 0 to -45 and the elevation will be depending on the position in the image if it is above the center then it is 0 to 90 otherwise 0 to -90. So the right HRTF will be zero and the left HRTF will have the value. As the sound coming from the left speaker so that listener perceives that the object is on left direction. The HRTFs for near sources have more energy than those for far sources. 5. Obstacles just above user’s head Field Code Changed
1m
Video
3m
1m
Screen
Top view Front view We assume that an object is just above the user‟s head for example, an overhead projector installed on low roof. In this case the amplitude is determined from the distance of the obstacle to the camera. As a result, the effect of volume varies dramatically with object position. The frequency varies with image brightness. Azimuth and elevation determined from the location of the object in the image.
6. A wall to the left at a distance of 1metre and nothing to right
Page 8
8
1m 1m
Front view
Top view
7. A wall to the right at a distance of 1metre and nothing to left
1m 1m
Front view
Top view
8. 1metre right a wall and 1metre left a wall and in front a narrow space
Page 9
9
1m
1m
1m
Front view
1m
Top view
9. A wall to the left at a distance of 1metre & a wall to the right at a distance of 1metre
1m
1m
1m
Front view
Top view
Experimental Procedure Experiments were divided into 10 distinct sections:
Page 10
1m
10
Question Is the environment is empty? Can you determine where the wall is?
No. 10
Requested Response Say yes if the environment is empty otherwise no Answer is left or right
Environment
2m
Object and Stairs 11
Potholes on the road
Page 11
11
REFERENCES 1. D. R. Begault, 3-D Sound: For Virtual Reality and Multimedia. MA: AP Professional Publishers, pp. 293, (1994). 2. M. Capp and P. Picton, "The Optophone: an Electronic Blind Aid," The Engineering Science and Education Journal,(2000). 3. C. I. Cheng and G. H. Wakefield, "Moving Sound Source Synthesis for Binaural Electroacoustic Music Using Interpolated Head-Related Transfer Functions (HRTFs)," Computer Music Journal, vol. 25, pp. 57–80, (2001). 4. N. Efford, Digital Image Processing: A practical introduction using Java, USA: Pearson Education Limited,(2000). 5. W. G. Gardner, 3D Audio and Acoustic Environment Modeling, Wave Arts Inc,(1999). 6. W. W. Gaver, Auditory Interfaces, in Handbook of Human-Computer Interaction. Amsterdam, The Netherlands: Elsevier Science B.V, pp. 1003-1042, (1997). 7. W. L. Gulick, Hearing: Physiology and Psychophysics. London: Oxford University Press,(1971). 8. E. Gunther, G. Davenport and S. O'Modhrain, "Cutaneous Grooves: Composing for the Sense of Touch," in Proceedings of Conference on New Instruments for Musical Expression, vol. 1. Dublin, Ireland, pp. 6,(2002). 9. R. L. Jenison, On Acoustic Information for Motion, Ecological Psychology, vol. 9, pp. 131151,(1997). 10. D. Keesvanden, "Physically-Based Models for Liquid sounds," in Proceedings of the International Conference on Auditory Display, (2004). 11. P. B. L. Meijer, "An Experimental System for Auditory Image Representations," in IEEE Transactions on Biomedical Engineering, vol. 39, pp. 112-121, (1992). 12. D. Rocchesso, Introduction to Sound Processing,(2003). 13. B. G. Shin-Cunningham, "Distance cues for virtual auditory space," in Proceedings of the First IEEE Pacific-Rim Conference on Multimedia. Sydney, Australia, (2000). 14. B. Shinn-Cunningham, "Learning Reverberation: Considerations for Spatial Auditory Displays," in Proceedings of the 2000 International Conference on Auditory Display. Atlanta, GA, (2000). 15. P. W. Wong and S. Noyes, "Space-Frequency Localized Image Compression," IEEE Transactions on Image Processing, vol. 3, pp. 302-307, (1994). 16. Ballard, D.H. and C.M. Brown, "Computer Vision", Prentice-Hall, Inc, New Jersey, 1982. 17. Boshra, M. and H. Zhang, "Use of Tactile Sensors in Enhancing the Efficiency of Vision-Based Object Localization". in Proceedings of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Las Vegas, 1994. 18. Brewster, S. and L.M. Brown, "Tactons: Structured Tactile Messages for Non-Visual Information Display". in Proceedings of the fifth Australasian User Interface Conference, Australian Computer Society, Inc, Dunedin, 2004. 19. Castle, H. and T. Dobbins, "Tactile display technology", Technology And Innovation, 2004. 20. Edman, P.K., "Tactile Graphics". American Foundation for the Blind, New York, 1992. 21. Gunther, E. and S. O'Modhrain, "Cutaneous Grooves: Composing for the Sense of Touch", Journal of New Music Research, 2003. 22. Heyes, A., "A Polaroid ultrasonic travel aid for the blind", Journal of Visual Impairment and Blindness. 76: p. 199-201, 1982. 23. Hoffman, D., "Visual Intelligence", W.W.Norton, New York, 1998.
Page 12
12
24. Kontarinis , D.A. and R.D. Howe, "Tactile display of vibratory information in teleoperation and virtual environments", Presence. 4: p. 387-402, 1995. 25. Landua, S. and L. Wells "Merging Tactile Sensory Input and Audio Data by Means of The Talking Tactile Tablet", in Euro Haptics, Dublin, Ireland, 2003. 26. Loomis, J.M. and S.J. Lederman, "Tactual perception", in Handbook of Perception and Human Performance, K.R. Boff, L. Kaufman, and J.P. Thomas, Editors, John Wiley and Sons, Inc. 1986. 27. Parente, P. and G. Bishop, "BATS: The Blind Audio Tactile Mapping System". in ACM Southeast Conference, Savannah, Georgia, 2003. 28. Sjöström, C., "Designing Haptic Computer Interfaces For Blind People". in Proceedings of the Sixth IEEE International Symposium on Signal Processing and its Applications, Kuala Lumpur, Malaysia, 2001. 29. Sonka, M., V. Hlavac, and R. Boyle, "Image Processing, Analysis and Machine Vision", Chapman and Hall Computing, 1993. 30. Thomas D. Rossing, Paul Wheeler, and Richard Moore, The Science of Sound, 3rd ed. San Francisco, California: Addison Wesley, 2002. 31.
Page 13
13