A Stereo-vision System for the Visually Impaired ... - Semantic Scholar

A Stereo-vision System for the Visually Impaired John Zelek, Richard Audette, Jocelyn Balthazaar, Craig Dunk School of Engineering University of Guelph Guelph, ON, Canada, N1G 2W1

Abstract

This is as a result of a better assessment of the requirements of the algorithm in the context of the entire system. We used a unrefined standard algorithm and upon constructing a prototype, we are now better equipped to re-iterate our design, in particular, the design of an appropriate depth-from-stereo algorithm.

Navigation is one significant barrier to individuals with visual impairments. In response to this barrier, we have investigated a wearable stereo-vision system and a prototype system has been developed. Preliminary investigations have proven the viability of this device for navigation through simple hallway structures. The prototype is inexpensive and consists of two USB cameras, a “virtual touch” feedback system and a wearable computer (inexpensive laptop or embedded platform). The feedback system is used to relay visual information via tactile feedback through the user’s fingers. Experimentation has shown the feasibility of this approach and we soon plan on experimenting with the planned demographics for such a device. Depth measures were computed using a pixel-to-pixel correspondence method making use of an epipolar geometry constraint. Real-time operation was achieved by limiting the field of view to the effective field of view deemed necessary for navigation. Experimentation was conducted with a non-calibrated unit. The depth measures were crude, however calculations reveal that this inadequacy may be overcome by making allowances for epipolar uncertainty in the depth from stereo algorithm.

2. Background The work in vision substitution (a brief review is presented in [1]) has focused on two main issues: (1) reading and writing; and (2) obstacle detection and avoidance. The most well-known device for reading was the raised dot code developed by Louis Braille in the 19th century. Another invention, the Optacon, was a tactile 6 by 24 matrix of vibrating pins corresponding to the brightness patterns of a camera developed by Linvill and Bliss in the 1960s (this was discontinued as a product in 1996 by the company TeleSensory). There have been many initiatives to develop text to speech reading machines, especially by major computer companies such as Apple and IBM. The original reading machine was introduced in 1975 by Raymond Kurzweil. The role of obstacle avoidance is to present spatial information of the immediate environment for improving orientation, localization and mobility. Electronic devices have been developed for this purpose and are typically known as electronic travel aids (ETA’s) or blind mobility aids. The two oldest aids for this purpose are the walking cane and guide dog. The walking cane is an effective mechanical device which requires certain skills of the person using it to interpret the acoustical reflections that continually result from tapping. The cane’s range is only a few feet (limited by the person’s reach extent and the length of the cane). Some people find the cane difficult to master or spend significant amounts of time in the learning phase [2]. The guide dog alleviates some of the limitations of the cane but little information regarding orientation and navigation is conveyed to the blind traveller. In addition, dogs require constant care and extensive training for both the dog and person [3]. Early devices relied on an acoustical sensor (i.e., sonar) or a laser light [1]. Unfortunately, power consumption is an

1. Introduction Sight and hearing, specifically sight, are considered the senses people make most use of in everyday life. It is a multi-disciplinary effort to develop devices for individuals who happened to be challenged in one of these senses. A beneficial aid for a visually challenged person is one that facilitates mobility independence. This paper presents some preliminary work in the development of a wearable device that uses inexpensive USB cameras, a simple tactile glove and light-weight computing facility to provide obstacle information to a visually challenged person. The interest with respect to stereo vision is that it is rare that final system needs, constraints and requirements are reflected in the original depth-from-stereo algorithms. Albeit, we are only now at the stage in the project where we can make appropriate adjustments to the depth-from-stereo algorithm we used. 1

important consideration and typically laser-based devices require heavy battery packs. Sonar is problematic due to incorrect interpretations when presented with multiple reflections (e.g., corners). Environmental sensing via a camera is attractive due to the wealth of information available through this sense, its closeness in function to the human eye, typical low power consumption and the low cost that technology has recently presented. In terms of feedback, the typical feedback mechanisms include auditory [1] and tactile [4]. The Tactile Vision Substitution System (TVSS) [5] was developed in the early 70’s and displayed map images from a video camera to a vibrating tactile belt worn on the abdomen. Other invasive types of substitution include an approach where an array of electrodes are placed in direct contact with the visual cortex [6],[7].

ronment. The constant beeping feedback which bears some correlation with the environment can also tend to be annoying. Ideally, in order to minimize learning and reliance of operator subjectivity, it would be more appropriate for the vision system to only provide a processed depth map to the feedback mechanism.

3. The Design Process We decided to design and prototype a device that is able to transform depth information (computed from an inexpensive stereo configuration of two cameras) into another sensory domain (auditory or tactile) for use by a visually impaired person. The device should also be wearable, low power and relatively inexpensive. In the planning stages of our design, we decided to enumerate the constraints and criteria that would govern the design process.

Recently, two devices were developed that evolved from research in mobile robotics, specifically, the NavBelt [8] and the GuideCane [9]. The NavBelt provides acoustical feedback from an array of sonar sensors that are mounted on a belt around the abdomen. The array of sensors either provides information in a guidance mode (i.e., actively guides the user around obstacles in pursuit of a target) or in image mode (i.e., presents the user with an acoustic or tactile image). The GuideCane is a cane attached to a small mobile robot platform with an array of ultrasonic sensors, an odometer, compass and gyroscope sensors. The robot steers around obstacles detected by the sensors. The user receives information via the current orientation of the cane. Unfortunately, the robot is a wheeled platform and therefore restricted to travel along relatively flat surfaces. A problem with the NavBelt is the complexity of learning the patterns of the acoustical feedback and the typical problems of multiple reflection associated with sonar sensors.

3.1 Constraints 1. The system must be able to perform in real time. A definition for real-time is governed by the walking (i.e., pace of a normal human (approximately 1 ). Thus, the camera system should provide depth information at a rate no worse than 2 . Albeit, a better frame rate would be desirable. 2. The system should be constructed with off-the-shelf components. 3. The system must weigh less than 4 kg with batteries. However, the system will become non-intrusive when it is significantly lighter and ubiquitous with the weight of regular clothing. This weight was chosen as an approximate equivalent of a backpack loaded with textbooks. It is noted that this weight may actually become uncomfortable for long journeys.

A recent development is the further enhancement of a device that converts depth information to an auditory depth representation [1] (i.e., see the web page for updated information). Rather than using sonar information as input, a camera is used as the input source. The image intensities are converted to sounds where frequency and pitch represent different bits of information. With a single camera, image intensities do not typically correspond to depth information which is necessary for navigation. Subsequently, an anaglyphic video input has been used (red filter in front of the left camera and a green filter in front of the right camera) where the two video images are superimposed on top of each other. This is analogous to the red-green 3D glasses used for watching 3D movies. Again this anaglyphic image is transferred to an acoustical pattern that the operator interprets. The problem with acoustic feedback is that this ties up the hearing of the person when trying to engage in conversation with other people. In addition, a significant amount of learning is necessary for interpreting the varying beeps and learning the correspondence with the envi-

4. The system must be wearable. Therefore, the power source must be on-board the person (e.g., batteries).

3.2 Criteria 1. The system should be easy to use and require little or no training. 2. The system should be ergonomically designed for efficiency, comfort and safety. 3. The system should be designed with off-the-shelf components. 4. The system should be wearable and thus portable. 5. The system should minimize intrusiveness into daily activities of the user (e.g., conversation). 2

6. The entire system should be inexpensive.

3.3 Other It is assumed that for the time being, the user environment will be indoors, well lit and contain no reflective surfaces (mirrors, windows). It is assumed that a deployable system would benefit from more customized parts, such as smaller cameras, improving the usability and aesthetics of the design. Furthermore, it is expected that available processing power will continually improve. Unfortunately power technology advances have not been as significant as computing technology advances.

Figure 1: Stereo Head Prototype. The stereo head prototype constructed. It is shown attached by a strap that goes around the user’s neck, but it may actually end up being a belt-like device, embedded in the jacket or hat worn by the user. The construction was out of wood but a subsequent prototype will probably be made out of lightweight plastic or metal.

4. System This project was initiated as a senior undergraduate project. The prototype was designed and constructed in a single semester. The students did not deliberate on a depth-fromstereo algorithm, but rather decided to use one that was available. They did not also focus on calibrating the cameras but assumed that the construction of the head enforced epipolar geometry. The constructed prototype allowed evaluating this assumption and the appropriateness of the stereo algorithm. The interesting aspect of this project is that now we can reflect on what changes are necessary to the stereo algorithm to facilitate better operation in the context of the task that the entire system must perform.

finger corresponds to an obstacle map in a forward leaning direction away from the user. The middle finger corresponds to the forward direction while the the other fingers correspond to the field left or to the right of the individual (see Figure 3). Finger stimulation represents an obstacle in that direction. The piezo elements are driven by a TI TLV5620 DAC interface to the host notebook by a standard parallel printer cable.

4.1 Hardware The components selected for the system are off-the-shelf and readily available. A low cost laptop computer (i.e., IBM Thinkpad 1230, 64 MB ram, 6 GB hard drive, Celeron 500 Mhz cpu) was selected for the prototype however the use of lighter embedded platforms [10] are proposed for the future. The stereo head was constructed out of wood components but a lighter plastic (or metal) assembly is planned for a future prototype (see Figure 1). The cameras selected were two de-cased Creative Video Blaster Webcam 3 cameras. These were selected because they use the USB interface Linux drivers that are available for the enclosed CMOS vision sensor chipset [11] The Linux operating system was chosen because of its low latency in accessing the USB (i.e., for the cameras) and the serial/parallel (i.e., for activating piezo-electric buzzers for tactile unit) ports. In addition, Linux has been ported to various operating platforms making future transition to a lightweight embedded platform relatively easy. A virtual touch tactile feedback system was designed to relay depth information computed by a stereo algorithm (implemented on a laptop) via a glove worn by the operator (see Figure 2). On one of the hands (e.g., left hand), each

Figure 2: Tactile Unit. A collection of piezo-electric buzzers are attached to each finger. The hand forms a map of the immediate forward looking environment (see Figure 3). The piezo buzzer used in the prototype is activated (in a binary mode) if a corresponding obstacle is contained in its respective frontal direction. It is only with further investigation into other tactile feedback and input from actual users that will determine if this type of response system is appropriate. Ideally, the buzzers should be smaller and embedded into the glove construction. The overall weight of the system is approximately 3.5 kg (the notebook weighs 2.8 kg, the cameras and head weigh 3

even so slightly, so that matches can exist. The search limit imposes a limit on the amount of disparity allowed and thus the amount of physical search along the epipolar line. A dynamic programming technique makes use of the structure of the cost function to find the optimal match sequence by conducting an exhaustive search. Pruning is used to make the algorithm more efficient. Information is subsequently propagated amongst scanlines [12]. A full 2D search could increase computing time 10-fold. Each pixel is assigned a level of reliability (3 levels of reliability). ”A moderately reliable pixel propagates along its column changing the disparities of the pixels it encounters until it reaches either intensity variation or a slightly reliable region with lower disparity. Regions with a higher disparity are overrun no matter what their reliability, because reliability is not a good indication that the disparities are correct when the background has little intensity variation. After the pixels are propagated along their columns, the same process is repeated along the rows.” The following cost function is minimized for each scan line match sequence [12]:

Figure 3:

Hand Spatial Correspondence. This illustrates a right hand with the thumb positioned in the extreme left. The middle finger corresponds to the spatial direction of straight ahead.

0.5 kg and the feedback system weights 0.2 kg). The approximate cost of the entire system is $1882 CAN (or approximately $1250 US) with the computer ($1599 CAN) and two cameras ($180 CAN) being the major components. Unfortunately, the potential continuous operating time of the system is limited by the laptop’s battery supply which is approximately 4 hours.

"!# %$'&)(+* ,.-0/ 13254 ,678,:9

(1)

where is the constant occlusion penalty, is the constant match reward, 1;2A@CB+DE ,678,6F GH6FJIK96 E 7L,6 ,6F I"6FJGM9ON (2) 13254 ,678,=9 13254 132 4 E where the value 1 is defined as: >A@CBPDRQ FJG ,O9 ! F E I 79 Q NOSCTVU:WYX T SCTVUOW]X E 1;2