computer vision guided navigation system for visually ...

4 downloads 15705 Views 1MB Size Report
experimental setup of our proposed system is simple cheap and wearable. The system ... laptop computer (Intel Core i3 - 4030U CPU @ 1.90 GHz/ 4GB RAM).
COMPUTER VISION GUIDED NAVIGATION SYSTEM FOR VISUALLY IMPAIRED

A Dissertation Submitted in Partial Fulfillment of the Requirement for the Award of the degree of

MASTER OF TECHNOLOGY IN INSTRUMENTATION

SUBMITTED BY: ARVIND SHARMA M 3144518

UNDER THE GUIDANCE OF DR. PRAKASH CHAND Assistant Professor Department of Physics National Institute Of Technology Kurukshetra

NATIONAL INSTITUTE OF TECHNOLOGY (INSTITUTION OF NATIONAL IMPORTANCE) KURUKSHETRA, HARYANA – 136119 2014-2016

i

CANDIDATE DECLARATION I, Arvind Sharma M, hereby certify that the work which is being presented in this dissertation titled, “Computer vision guided navigation system for visually impaired”, by me in partial fulfillment of requirements for the award of Master of Technology in Instrumentation from National Institute of Technology, Kurukshetra is an authentic record of my original work, carried out in the Department of Physics, N.I.T. Kurukshetra under the supervision of Dr. Prakash Chand, Assistant Professor, Department of Physics, N.I.T. Kurukshetra. All the work done in this thesis are entirely my own except for the references quoted. This work, in the best of my knowledge, has not been published in any University/Institute for award of any degree.

Place : Kurukshetra

ARVIND SHARMA M

Date :

3144518

ii

CERTIFICATE It is certified that the dissertation work entitled “Computer vision guided navigation system for visually impaired”, being submitted by Mr. Arvind Sharma M (Roll No. – 3144518) to the Department of Physics, National Institute of Technology Kurukshetra, for the award of the degree of Master of Technology in Instrumentation. The candidate has worked under my supervision. The work presented in this thesis has not been submitted to any other University/Institute for the award of any other degree.

Place : Kurukshetra Date :

DR. PRAKASH CHAND Assistant Professor Department of Physics National Institute Of Technology Kurukshetra

iii

ACKNOWLEDGEMENT I especially wish to express my overwhelming gratitude and immense respect to Dr. Prakash Chand, Assistant Professor, Department of Physics, NIT Kurukshetra for his valuable guidance, constant encouragement and constructive criticism in bringing this dissertation in the present shape. He had been my mentor, friend, philosopher and source of inspiration during the whole tenure. When I got puzzled and confused, the moment I entered into his office, I found all my load torque balanced and driving me to a steady state progress.

I would like to acknowledge the contribution of Dr. Ashavani Kumar (Professor& Head of Department of Physics, National Institute of Technology Kurukshetra) for all his support in the department.

Above all I would like to give my sincere thanks to all the respected Professors of department of Physics, whose affectionate guiding, unremitting encouragement and valuable suggestion at each and every step has always stimulated me to be on track of my work, ultimately leading to its successful completion.

Last but not the least my special thanks to my parents for their patience and financial support as well as for their inspiration to continue the efforts in this research work with their valuable suggestion of not to worry about the consequences of success or failure.

iv

ABSTRACT For the past few centuries, the broad overview of the problems faced by a blind or visually impaired person are numerous, such as access to information; access to transportation; locations and obtaining blind services; shopping; cooking or any other independent living skills; lack of inclusive or accessible social activities; insufficient finances for necessary assistance devices. The observation on the visually impaired person’s living standard and struggle for their livelihood was the biggest motivation behind designing and developing a cost-efficient dual-module wearable navigation assistance system that guides the visually impaired not only as a travelling aid to avoid the human obstacles in user’s path but also helps in reading the text. Most of the existing travelling aids lack in assisting the blind to read texts in signboard for finding the direction of a particular location but the proposed system is capable of reading out the texts. There are two modules in the system namely, Read Module which works based on text detection algorithm in natural scene with edge-enhanced MSER and second module is Guidance Module that helps obtaining the dimension of the obstacle in the path of the user by three dimensional (3D) reconstruction. Instead of using several sensors, we choose a simple stereo camera setup and three dimensionally reconstruct the surrounding from the images. Our work on 3D reconstruction of cylinders to find the real world metrics is merged with text detection algorithm. Real-time scene text localization and recognition via stroke width is more robust than the other text extraction techniques. Voice synthesizer reads out the strings through earphones to the user thus enabling him recognize the text. Robustness and performance is better because the number of exact text detected by the Optical Character Recognition (OCR) was high. 3D reconstruction to avoid human obstacle is performed by taking stereo pictures of the surrounding and then by considering isometric property. The algorithm used in our work was designed to reconstruct the cylinders without using vanishing point or vanishing line. The metrics of the standing/walking human body is calculated based on the assumption that it occupies a cylindrical 3D space and finally dimensions are informed to the user. Thus the user can identify the dimension of the obstacle and can avoid collision. The tests were performed in a controlled indoor environment and the results show that the performance of both modules is superior to other techniques and assures that the real-time outdoor performances can be carried out in the future.

v

CONTENTS

S.NO

CHAPTERS

PAGE NO.

List of Figures………………………………………………………….. viii List of Tables…………………………………………………………...

ix

List of Abbreviations…………………………………………………… x

CHAPTER 1: INTRODUCTION 1.1

Introduction: Need for Blind Assistance Devices……………………...

1

1.2

Literature Review……………………………………………………....

5

1.2.1 Echolocation……………………………………………………………

5

1.2.2 Navbelt…………………………………………………………………

5

1.2.3 voIce…………………………………………………………………...

6

1.2.4 University of Stuttgart Project…………………………………………

7

1.2.5 University of Guelph…………………………………………………...

7

1.2.6 FIU Project……………………………………………………………..

8

1.2.7 Virtual Acoustic Space…………………………………………………

8

1.2.8 Navigation assistance for Visually Impaired…………………………..

9

1.3

Text Recognition……………………………………………………….

10

1.4

Computer Vision and 3D Reconstruction……………………………...

13

1.5

Modules of the System…………………………………………………

14

1.5.1 Read Module…………………………………………………………...

14

1.5.2 Guidance Module………………………………………………………

16

CHAPTER 2: ALGORITHMS & METHODOLOGY 2.1

Dual-Module System…………………………………………………..

17

2.1.1 Read Module Algorithm……………………………………………….

18

2.1.2 Guidance Module Algorithm…………………………………………..

22

Experimental Setup…………………………………………………….

34

2.2

vi

CHAPTER 3: RESULTS & DISCUSSION 3.1

Results obtained………………………………………………………... 35

3.1.1 Text recognition (Read Module) results………………………………..

36

3.1.2 3D reconstruction (Guidance Module) results…………………………. 43

CHAPTER 4: CONCLUSIONS & FUTURE SCOPE 4.1

Conclusions...…………………………………………………………... 45

4.2

Future Work……………………………………………………………. 47 References……………………………………………………………… 48 List of Publications……………………………………………………..

vii

54

LIST OF FIGURES FIGURE NO.

FIGURES

PAGE NO

1.1

voIce system

6

1.2

University of Guelph project

8

1.3

Virtual Acoustic Space system

9

2.1

Flow chart of read module

21

2.2

Flow chart of guidance module

22

2.3

Stereo Vision Setup

23

2.4

Tangential Mapping of points from Cylinder

28

2.5

Flowchart of 3D reconstruction technique

33

2.6

Experimental setup

34

2.7

Creative Senz3D stereo camera

34

3.1

Actual input image

36

3.2

After recognition and detection of text regions

37

3.3

Text saved as meaningful strings in file

37

3.4

Text-to-Speech engine

37

3.5

MSER regions

38

3.6

After removing non-text region

39

3.7

Stroke width variation based approach

39

3.8

Expanded bounding boxes text

40

3.9

Detected text

40

3.10

Detected natural scene text saved as meaningful strings in file

41

3.11

Standing pose

43

3.12

Walking pose

43 viii

LIST OF TABLES

S.NO

TABLE

PAGE NO.

4.1

Evaluation of text detection algorithm

42

4.2

Results of metric estimation in real world

44

ix

LIST OF ABBREVIATIONS

AFB American Foundation for Blind CCs Connected Components DOF Degree of Freedom EOAs Electronic Orientation Aid ETAs Electronic Travel Aid EGNOS European Geostationary Navigation Overlay Service FIU Florida International University GPS Global Positioning Systems HRTF Head Related Transfer Function ICDAR International Conference on Document Analysis and Recognition Inoffensive Changeable Decent Attractive Resourceful LVQ Learning Vector Quantization MSER Maximally Stable Extreme Regions NFB National Federation for Bind OCR Optical Character Recognition PDA Personal Device Assistance PLDs Position Locator Devices SBPS Single Board Processing System SVM Support Vector Machine 3D Three Dimension WHO World Health Organization

x

CHAPTER 1 INTRODUCTION

1

1.1 Introduction For the past few centuries, the broad overview of the problems faced by a blind or visually impaired person are numerous, such as access to information (mails, print media, warning systems and computers); access to transportation – to move without a guidance either by walk or to board a vehicle (as a passenger); locations and obtaining blind services such as training for Braille, shopping, cooking or any other independent living skills; obtaining or maintaining employment; lack of inclusive or accessible social activities and venues; insufficient finances for necessary assistance devices. The observation on the visually impaired person’s living standard and struggle for their livelihood was the biggest motivation behind designing and developing a dual-module cost-efficient wearable navigation assistance system that guides the visually impaired. Need for blind assistance – World Health organization (WHO) in the year 2014, has estimated 285 million [59] are to be visually impaired and in which 39 million are completely blind. Based on the same study, of the 39 million people across the globe who are blind, around 15 million people reside in India which makes India now home to the world's largest number of blind people. National Federation for Blind (NFB) [56] and American Foundation for Blind (AFB) [57], have confirmed that in the United States, 1.3 million are estimated to be blind while individuals with visual impairment is approximately 10 million with around 100,000 to be students. Assistance systems necessity is inevitable in the present-day scenario as well as in the future [58]. There is a wide range of assistance systems and tools for making navigation of visually impaired individuals possible [1]. White cane and dog guides are preferred mostly. Though, they cannot provide the vital information for uninterrupted navigation such as distance, dimension of obstacle and speed along with direction, which are normally gathered by eyes and are necessary for the perception and the control of movement during navigation still white cane is the most popular navigation aid because it’s the cheapest, most reliable and the simplest.

For past half century, many researchers developed electronic devices for navigation. These certainly fall under three major categories, namely vision enhancement, vision replacement, and vision substitution. The function of any sensory aid is “to detect and locate objects and provide information that allows user to determine (within acceptable tolerances)

2

range, direction, and dimension and height of objects” [1]. Non-contact trailing and tracking is made possible, thus enabling the traveler to receive information about directions from physical structures that have strategic locations in the environment with additional object identification if possible.

Vision enhancement involves simple steps such as obtaining input from a camera then processing the information and finally output the visual information in a display. As in many virtual reality systems a miniature head-mounted camera with the output on a head-mounted visual display is the basic prototype model of this system. Vision replacement involves displaying the necessary information directly to the visual cortex of the human brain or fed to human brain via the optic nerve. As it is an invasive technique, currently this category is not widely accepted or accessible since they deal with scientific, technological and medical issues. Vision substitution is similar to vision enhancement but with the output being nonvisual, typically tactual or auditory or some combination of the two. Since the senses of touch and hearing have a much lower information capacity than vision, it is essential to process the information to a level that can be handled by the user [1]. The category that is mostly focused in electrical and computer science researchers community is the “vision substitution”. Vision substitution category is a non-invasive method with many real time prototypes that are being developed to make navigation of blind possible without the help of others.

Here someone can find these subcategories:

1) Electronic travel aids (ETAs): information about the surroundings which is communicated generally through vision is transformed into a form that can be conveyed through touch or sound by these devices.

2) Electronic orientation aids (EOAs): orientation is provided prior or during the travel by these devices. Generally these devices are carried or can also be exterior to the user such as a handheld receiver or infrared light transmitter.

3) Position locator devices (PLDs): navigational assistances like Global Positioning Systems (GPS), European Geostationary Navigation Overlay Service (EGNOS) and other such technologies falls under this category.

3

Our proposed system falls under the category of wearable ETA and they are categorized depending on the way of gathering information from the environment and also depending on the way of information delivery to the user. Information can be gathered with sonars, laser scanners, or cameras, and the user can be informed through the auditory and/or tactile sense.

Sounds or synthetic voice are the options for the auditory sense and electrotactile or vibrotactile stimulators for the tactile sense. Tactile feedback is more beneficial because auditory sense is never blocked (free-ears), because it is the most important perceptual input source for a visually impaired user when compared to the others such as temperature, wind, odor and touch. Additionally, wearable ETAs benefits the users to a greater extent as they are hands-free but some others do not since the user is required to hold them, it is up to the user to select which is more appropriate to his/her habits [1]. The National Research Council’s guidelines for ETAs are listed below:

1) Obstacles need to be detected till the head height for full body width from ground level. 2) Discontinuities and textures are to be included in the travel surface information. 3) For projection and shore lining objects bordering the travel path should be detected. 4) For projection of a straight line cardinal direction and distance of the object information. 5) Landmark location and identification information. 6) Information that enables mental mapping of surroundings and self-familiarization.

The remainder of the thesis is organized as follows. In Section 2, we describe the individual steps of the two proposed algorithm. Section 3 demonstrates the robust performance of the proposed system with results obtained and Section 4 concludes the paper.

4

1.2 Literature Review Since 1960s, electrical and computer science engineers developed navigation assistance devices for the betterment of living standards of blind and visually impaired people [14]. ETAs have been under the interest of scientists for past two decades [14]. A number of commercial devices available as well as research is carried out worldwide in this same domain. This section discusses the literature studied and updates the latest developments.

1.2.1 Echolocation

The project had started around three decades ago in Japan, mainly aimed at designing a new model after the bat’s echolocation system [6] to aid mobility. Sizes of the obstacles and their directions are indicated by the time differences and varying intensities of the reflected ultrasound waves transmitted by the sensors, creating a localized sound images. User’s capability is examined by performing few initial experiments to distinguish between objects in front of the user’s face, using different ultrasound frequencies. More experiments and statistical results are essential to assure the viability of the project but still the results that are obtained shows that users discriminate objects and can even identify them in some limited cases. The developed prototype was portable as was simple to use. Two ultrasonic sensors are attached on conventional eyeglasses and their data, using a microprocessor and A/D converter, are down converted to a stereo audible sound, sent to the user via earphones.

1.2.2 Navbelt

Borenstein along with his colleagues developed Navbelt [7] a blind guidance system in University of Michigan, using obstacle avoidance system in a mobile robot. The computer creates a map of the angles using the information received from the eight ultrasonic sensors, where each sensor used for one of the eight directions and the object’s distant in that particular angle. The sounds appropriate for each mode will be produced by the obstacle avoidance algorithm. Guidance mode and image mode are the two modes in Navbelt. A single recurring beep assists the user in the optimum direction to travel for reaching his/her destination during the guidance mode. This devise is not practically useful because in a nonsimulated realistic implementation more sensors are required than what is implemented here.

5

Similar to a radar sweep eight tones of different amplitudes are played from all the eight sensor facing directions during the image mode. Depending on the mode, the user listens to the sound form earphones that are translated form the map by computer. The disadvantages of the systems are the use of audio feedback (exclusively), the bulky prototype and that the users are required extensive training periods. In early 90s, the prototype was implemented and it is consisted of ultrasonic range sensors, a computer and earphones.

1.2.3 voIce

voIce project [8] was started by Meijer, believing that the human auditory system has the capability to process and interpret rapidly changing sound patterns even if they are extremely complicated. The images captured by the camera is used by a portable computer and an unfiltered, direct, invertible one-to-one image to sound mapping is done, later the sound is sent through the earphones. Since the main idea was that human auditory system and the brain is capable enough to process extremely complicated sound information, there were no filters used in order to mitigate the risk of filtering vital information. Recently, the same software was redesigned to embed it on a cell phone, so that the user can use his\her personal mobile phone’s camera and earphones as a voIce assistant. Furthermore, for better results and increased safety along with improved representation of surroundings sonar extension is also available. Despite, the user takes an extensive training because of the complex sound patterns, the system received very promising feedback from many users who tried it. The prototype is wearable and portable because it consists of cameras fitted to eyeglasses, earphones, both connected to a laptop with the necessary software.

Figure 1.1: voIce system [8]

6

1.2.4 University of Stuttgart Project A portable–wearable system that assists blind people orienting themselves in indoor environments was developed by researchers in University of Stuttgart in Germany. The prototype is consisted of a sensor module with a detachable cane and a portable computer. The sensor is equipped with two cameras, a keyboard, a digital compass, a 3-D inclinator, and a loudspeaker. It can be handled like a flashlight and “By pressing designated keys, different sequence and loudness options can be chosen and inquiries concerning an object’s features can be sent to the portable computer. After successful evaluation these inquiries are acoustically answered over a text-to-speech engine and the loudspeaker.” [9] The computer contains software for detection of color detection distance and size of objects and wireless local area network capabilities.

The device works almost in real time. In order to improve the performance of the system, a virtual 3D model of the environment was built, so the information from the sensor can be matched with the data stored in the 3D model. A matching algorithm for sensor information and 3D model’s data and embedding the system to Nexus framework, a platform that allows a general description of arbitrary physical real-world and virtual objects were the future work proposals. Concluding, the system’s positives are the robustness of the sensor, the near real-time operation and the friendliness to the user. The negatives are that the holdand-scan operation and the, until this moment, limited, simulated testing.

1.2.5 University of Guelph Project

Zelek with students from University of Guelph, in Canada, developed an inexpensive, built with off-the-shelf components, wearable and low power device [13] that will transform output of stereo cameras into tactile or auditory information for use by visually impaired people while navigation. The output is the depth of the object. The prototype, consisted of two stereo cameras, a tactile unit which is glove with five piezoelectric buzzers on each fingertip and a portable computer. Each finger corresponds to a spatial direction. For example, the middle finger corresponds to straight ahead. Using a standard stereovision algorithm, the depth map is created and then divided into five vertical sections, each one corresponding to a vibration element. If a pixel in an area corresponds to a threshold distance

7

then the corresponding vibration element is activated, informing the user about a close obstacle in that direction.

Figure 1.2: University of Guelph Project [13]

1.2.6 FIU Project

This project [10] from researchers in Florida International University is an obstacle detection system that uses 3D spatialized sounds based on readings from a multidirectional sonar system. The prototype consisted of two subsystems namely, the sonar and compass control unit, which consisted of six ultrasonic range sensors pointing in the six radial directions around the user and a microcontroller; and the 3D sound rendering engine, which consisted of earphones and a personal digital assistant (PDA) equipped with software capable of processing information from the sonar and compass control. The algorithm, using Head Related Transfer Functions (HRTF), creates a 3D sound environment that represents the obstacles detected by the sensors. The user in that way creates a mental map of the layout of his/her surroundings so that obstacles can be avoided and open passages can be considered for path planning and navigation.

1.2.7 Virtual Acoustic Space

Research group in Instituto de Astrofisica de Canarias developed a novel device named, Virtual acoustic space [12]. At neuron level, by building a perception of space the user can orient as the sound map of the surroundings is generated. Stereo vision set up is used by fitting two cameras which capture information of the surroundings. A depth map with 8

features like textures, distance of the obstacles and color is created using HRTF by the processor which then generates sound maps corresponding to the condition of the surroundings where sonorous sources exist in the environment. The prototype consists of stereo vision camera attached to the frame of eyeglasses (conventional) and earphones connected to the processor.

Figure 1.3: Virtual Acoustic Space system [12]

1.2.8 Navigation Assistance for Visually Impaired (NAVI)

NAVI project is a sound-based ETA developed to assist blind people for identifying the obstacle during navigation by Sainarayanan et al. [11] from University Malaysia. This system mainly focusses on the objects that are in front of the user’s center of vision, so to distinguish obstacle from the background is very significant. Resampling the greyscale video captured by the camera to 32 × 32 resolution. Then using a fuzzy learning vector quantization (LVQ) neural network the pixels are classified to either background or objects using different gray level features. Background is suppressed to enhance the object of interest and finally the processed image is segmented into two parts namely left and right and transformed into a stereo sound map that is sent via earphones. The prototype consists of a headgear that holds the digital camera, a pair of stereo earphones, the single board processing system (SBPS) and a vest that holds rechargeable batteries and SBPS.

Based on the literature studied [15], [16], [17] ETA can be divided into three types based on their feedback mechanisms and our proposed system is in audio feedback category; A) Audio feedback B) Tactile feedback C) Without feedback 9

1.3 Text Recognition Low-cost and high-performance cameras are accessible very easily nowadays that led to the increased popularity in mobile visual search [28]. Applications of visual search systems such as, product or object recognition and landmark recognition systems are being developed in the recent years, in which, local image features are extracted from images taken with a simple cameras of mobile phones or laptops are matched to a large database using visual word indexing techniques [29]. In images, text have been largely discounted as a useful features even though current visual search technologies have already reached a considerable extent of maturity [35]. In fact, text is particularly interesting because it provides contextual clues for the object appearing inside an image. Using the embedded text to retrieve an image in the given vast number of text-based search offers an effective enhancement to the visual search systems [30].

Automatic text recognition from images is one of the most difficult problems in computer vision. Robustly locating the text on images is an essential prerequisite for text recognition. In spite of, this still remains a tedious task because of variations in text appearance. This is due to variations in size of the text, font, stroke width, color, texture and pattern [31]. Geometric distortions, partial or complete occlusions, different lighting conditions leading to different contrast ratio and image resolutions also play an important role in the increasing the difficulty of text detection [33].

Text detection has been considered in many contemporary studies and abundant number of methods are reported and discussed in the literature [47], [48], [49]. Texture-based [50] and connected component CC-based are the two major classifications and most of the existing techniques comes under these two categories. In texture-based methodologies text is viewed as a distinct feature or texture which is distinguishable from the background of the image. Typically, by machine learning heuristics the features are extracted over a particular region of the image and a classifier is usually employed to detect and recognize the text. Zhong et al. [37] in their work had assumed that the text has certain horizontal and vertical frequencies, furthermore in the discrete cosine transform domain extracted the features to perform text detection. Ye et al. [38] collected features from wavelet coefficients and classify text lines using SVM. Chen et al. [32], [34], [55] to the Adaboost algorithm fed a set of weak classifiers and trained a strong text classifier. 10

Finding individual characters by grouping pixels into regions using connected component analysis assuming that pixels belonging to the same character have similar properties. Connected component methods differ in the properties like color, stroke-width and others. The advantage of the connected component methods is that their complexity typically does not depend on the range of scales, orientations, fonts and other properties of text, and that they also provide a segmentation which can be exploited in the OCR step [2]. In contrast to texture-based method, CC-based approach utilizes geometric constraints to rule out non-text background and extracts regions from the image. An adaptive binarization method to discover CCs is a top scoring contestant. Based on geometric properties CCs are linked and then text lines are formed. Recently, Epshtein et al. [42] proposed transformed image, which is generated by shooting rays from edge pixels along the gradient direction using the CCs in a stroke width. Shivakumara et al. [43] eliminated false positives by using text straightness and edge density as well as extracted CCs by executing K-means clustering in the FourierLaplacian domain. Disadvantage of CC-method is a sensitivity to clutter and occlusions that change connected component structure

Text localization and recognition in natural scene images is still a problem to be solved which has been receiving significant attention because it is a vital component in a number of computer vision applications like navigation assistance for visually impaired, visual search systems and textual based reading business labels in map applications as it is used in Google Street View. Despite being the fact that words occupy a substantial part in the image without perspective distortion and they were written horizontal without substantial noise, the maximum efficiency of localizing 62% words correctly even after the data was not fully natural scene. Numerous contests have been held in recent times and the top performing one in the most recent ICDAR 2011 [41], [52] was the above mentioned. If N is the number of pixels then localizing text in an image is theoretically very expensive task (computationally) as commonly any of the 2N subsets can correspond to text.

Works which focusses solely dealing with text localization in natural scene images problem have been published. The method of Epstein et al. [42] converts an input image to a greyscale space and uses Canny detector to find edges. Pairs of parallel edges are then used to calculate stroke width for each pixel and pixels with similar stroke width are grouped together into characters. The method is sensitive to noise and blurry images because it is

11

dependent on a successful edge detection and it provides only single segmentation for each character which not necessarily might be the best one for an OCR module.

Only a few methods that perform both text localization and recognition have been published. The method of Wong et al. [52] finds individual characters as visual words using the sliding-window approach and then uses a lexicon to group characters into words. The method is able to cope with noisy data, but its generality is limited as a lexicon of words has to be supplied for each individual image. Detect characters as MSER and perform text recognition using the segmentation obtained by the MSER detector. An MSER is a particular case of an Extremal Region whose size remains virtually unchanged over a range of thresholds [46]. The methods perform well but have problems on blurry images or characters with low contrast. According to the description provided by the ICDAR 2011 [41], [52] Robust Reading competition organizers the winning method is based on MSER detection, but the method itself had not been not published and it does not perform text recognition.

In this work, we propose a novel CC-based text detection algorithm, which employs MSER as our basic letter candidates [4]. Despite their favorable properties, MSER have been reported to be sensitive to image blur. To allow for detecting small letters in images of limited resolution, the complimentary properties of Canny edges and MSER are combined in our edge-enhanced MSER. Further we propose to generate the stroke width transform image of these regions using the distance transform to efficiently obtain more reliable results. The geometric as well as stroke width information are then applied to perform filtering and pairing of CCs. Finally, letters are clustered into lines and additional checks are performed to eliminate false positives. In comparison to previous text detection approaches, our algorithm offers the following major advantages. First, the edge-enhanced MSER detected in the query image can be used to extract feature descriptors for visual search. Hence our text detection can be combined with visual search systems without further computational load to detect interest regions. Further, our system provides a reliable binarization for the detected text, which can be passed to OCR for text recognition. Finally, the proposed algorithm is simple and efficient. MSER as well as the distance transform can be very efficiently computed and determining the stroke width only requires a lookup table.

12

1.4 Computer Vision and 3D Reconstruction Computer vision involves acquiring the images of the object of interest, further analyzing the images for estimation of higher dimensional data from the real world to get a numerical information. Computational symmetry and group theory are the fields that contributed for image analysis and retrieving structure from scenes of real world [18], [20], [23], [26]. So far single perspective view of a scene with minimum information have been used for computing metric measurements and for 3D reconstruction [22], [25]. A passive ranging technique which is generally applicable for acquiring 3D data is done by stereo camera set up and it is better than the single view in many aspects. In single view metrology, the uncertainty is added to the calculations and to metric values as the method considers vanishing point and vanishing line, hence the errors are inevitable because these parameters are assumed to be at infinity [21], [22], [24].

In single view metrology, the distance between parallel planes is computed when the corresponding points on the plane are normal to the planes [22], [25]. The homology between the two planes are also established by the above mentioned method [18]. The general framework of the corresponding points is considered in this proposed framework and thus the restrictions that existed in the previous works are eliminated. Two orthogonal planes for finding the corresponding points are found by using 360° rotational symmetry and the location of the camera centre [21]. Cylindrical 3D volume has 360° rotational symmetry around its axis and thus it requires only two homologies to find cylinder on the ground plane and the cylinder is imaged by two cameras. Earlier work for obtaining depth of corresponding points had been carried out by vanishing point method [21] and hence the error percentage was comparatively higher than our method. Our proposed method is developed by blending the earlier works viz., (i) 360° rotational symmetry; (ii) stereo vision for 3D reconstruction; (iii) depth estimation without vanishing points. Geometrical calculations are carried out with some known reference measurements and assumptions. Metric information obtained from images and real world, without considering vanishing point and vanishing line gives a better 3D reconstruction technique with reduced error.

13

1.5 Modules of the system The proposed system has two modules namely, Read Module and Guidance Module. Each works within their respective scopes and functions. Read module works based on an algorithm for automatic text detection from natural images by MSER. Guidance module works in helping the user to recognize the size of the human obstacle in the path. It works by 3D reconstruction technique. Generally, existing assistance systems [1], [14] helps in informing the user about the distance, orientation and direction of the object but out algorithm helps the user to know about the dimension of the obstacle. It is based on the assumption that when any human stands or walks he occupies a 3D cylindrical volume in the space.

1.5.1 Read Module

Main hurdle for visually challenged persons are the reading of texts. Education and reading texts by visually impaired persons is made possible by Braille system or other similar tactile methods. But the main problem for the blind comes into existence when they have to interact with the natural environments such as to read the normal texts for example from a newspaper, menu card in a hotel and the destination of a bus mentioned in the digital board. Our proposed system helps not only as a travelling aid but also helps in reading the texts when it is in read module. Either reading module or guidance module is exclusively selected by the user of the system. The algorithm designed for extracting the text from the images are extensively studied and a better feasible technique is incorporated in this module.

The MSER feature detector works well for finding text regions. It works well for text because the consistent color and high contrast of text leads to stable intensity profiles. Although the MSER algorithm picks out most of the text, it also detects many other stable regions in the image that are not text. A rule-based approach to remove non-text regions can be used. For example, geometric properties of text can be used to filter out non-text regions using simple thresholds [31]. Alternatively, you can use a machine learning approach to train a text vs non-text classifier [32]. Typically, a combination of the two approaches produces better results. This algorithm uses a simple rule-based approach to filter non-text regions

14

based on geometric properties namely, aspect ratio, eccentricity, Euler number, extend, solidity.

Another common metric used to discriminate between text and non-text is stroke width. Stroke width is a measure of the width of the curves and lines that make up a character. Text regions tend to have little stroke width variation, whereas non-text regions tend to have larger variations. The stroke width can be used to remove non-text regions and to estimate the stroke width of one of the detected MSER region can be did by using a distance transform and binary thinning operation. The stroke width image has very little variation over most of the regions of a text. This indicates that the region is more likely to be a text region because the lines and curves that make up the region all have similar widths, which is a common characteristic of human readable text.

Stroke width detected and later, all the detection results are composed of individual text characters. To use these results for recognition tasks, such as OCR, the individual text characters must be merged into words or text lines. This enables recognition of the actual words in an image, which carry more meaningful information than just the individual characters, where the meaning of the word is lost without the correct ordering. For merging individual text regions into words or text lines is to first find neighboring text regions and then form a bounding box around these regions. To find neighboring regions, the computed bounding boxes are extended. This makes the bounding boxes of neighboring text regions overlap such that text regions that are part of the same word or text line form a chain of overlapping bounding boxes.

Text regions detected and later, using the OCR the text is recognized within each bounding box and the meaningful text is stored in a text file. The text-to-speech synthesizer converts this text into speech and reads it out loud for the user. The user hears the text through his noise cancellation earphones so that he can understand the text and act accordingly depending on the situation.

15

1.5.2 Guidance Module

Most of the ETAs [1]-[17] discussed in the previous sections calculates the distance of the obstacle and few systems can find the range and direction of the obstacle. But none of the systems concentrate on the detection of dimensions such as height, breadth and width of the human obstacle in the user’s path. Our proposed system particularly finds the dimension of the obstacle. This is very important because if the obstacle’s dimension is not as big as to affect the mobility of the user then they can be neglected.

Our system mainly focusses on the human obstacles. In a real indoor environment such as considering a corridor of a college building, then mostly the fellow human obstacles are more than compared to the non-living objects. Hence the algorithm designed is to detect the dimensions of the humans who come in the path of the user. A controlled indoor environment is considered for the whole experimental set up and the results obtained are promising. The algorithm used in guidance module is 3D reconstruction technique in which the surrounding is regenerated in the computer and the real world metrics are calculated. The experimental setup of our proposed system is simple cheap and wearable. The system consists of a stereo camera (Creative Senz3D) and a stereo earphones connected to a portable laptop computer (Intel Core i3 - 4030U CPU @ 1.90 GHz/ 4GB RAM).

Based on the literature studied and by analyzing the challenges faced by the earlier works prompted us to start investigating an alternate approach to existing systems. Our work was majorly designed to overcome the limitations that exists in the present day commercial assistance products and research prototypes. The efficiency obtained in the results of text localization and recognition was better compared to the state-of-the-art techniques. 3D reconstruction for blind assistance is a novel idea that is not attempted to this point of time.

16

CHAPTER 2 METHODOLOGY & ALGORITHMS

17

This chapter discuss the details of the modules developed and the algorithms that are responsible for the working of these modules. It also discusses the experimental setup used for the development of the system.

2.1 Dual-Module System

2.1.1 Read Module Algorithm for text extraction in read module is as follows –

Step 1: Detecting text regions using MSER Step 2: Removing the non-text background regions based on basic geometric properties Step 3: Additional removing of non-text region based on stroke width variation Step 4: Merging texts and recognizing strings for final detection results Step 5: Recognizing the detected text using OCR Step 6: Writing the detected text into a text file Step 7: Using speech synthesizer to read out the text saved in the file for the user.

i.

Detect candidate text regions using MSER

Text had consistent color and high contrast that leads to high intensity profiles which makes the MSER feature detector to work well in finding text regions from the images [4]. In the image processing toolbox of MATLAB, detectMSERFeatures function is used to discover all text regions within the image and results are plotted. Many non-text regions are also detected alongside the text.

Many stable regions in the image which are not actually text is also detected alongside the text regions even though the MSER algorithm is highly robust. Non-text regions are removed by a rule-based approach. In our work, by using simple thresholds the geometric properties of text are exploited to filter out non-text regions [51]. On the other hand, a machine learning approach to train a text against non-text classifier [34]. Classically, these two approaches are combined to produce better results [3].

18

ii.

Remove non-text regions based on basic geometric properties

Majorly the geometric properties those are used for obtaining the distinction between text and non-text regions [5], [2], are as follows including: 

Aspect ratio



Eccentricity



Euler number



Extent



Solidity

In the image processing toolbox of MATLAB, regionprops function is used to measure a few of the above mentioned properties and then remove regions based on their property values.

iii.

Remove non-text regions based on stroke width variation

Stroke width is one of the most common metrics used to differentiate text and nontext. Stroke width is a measure of the width of the curves and lines that make up a character. Non-text regions have larger variations in their stroke width while text regions tend to have less stroke width variation.

By using a distance transform and binary thinning operation [2] the stroke width of the detected MSER regions is estimated. It is noticed that over most of the region the variation in the stroke width image is very little. The curves and lines all having similar widths make up this region being the typical characteristic of human readable text, indicates that the region is a text region.

In order to use stroke width variation to remove non-text regions using a threshold value, the variation over the entire region must be quantified into a single metric. Tuning this threshold value is required for images with different font styles. Then, a threshold can be applied to remove the non-text regions.

19

Each detected MSER region undergoes the above mentioned procedure separately. A forloop processes all the regions, and then the results after removing the non-text regions using stroke width variation is displayed.

iv.

Merge text regions for final detection result

The detected results comprises of discrete text characters at this point. The discrete text characters are merged into strings or words and then text lines in order to use these results for recognition tasks, such as OCR. This recognition of the actual words is enabled in an image that convey more sensible information than just the discrete characters. For example, recognizing the word 'PLATE' against the set of individual characters {'P','L','A','T','E'}, where the meaning of the word is vanished without the exact arrangement.

Find neighboring text regions and then forming a bounding box around text regions is one of the approaches for merging discrete text regions into strings or words then text lines. Expanding the bounding boxes computed previously with regionprops is used to find neighboring regions. This makes the bounding boxes of neighboring text regions overlap such that text regions that are part of the same word or text line form a chain of overlapping bounding boxes.

Now, the overlapping bounding boxes can be combined together to form a single bounding box around discrete strings or words. The overlap ratio between all bounding box pairs are computed in order to do this. Finding groups of neighboring text regions by looking for non-zero overlap ratios is made possible by quantifying the distance between all pairs of text regions. A graph is used to find all the “connected components” text regions by a nonzero overlap ratio after the pair-wise overlap ratios are computed.

In the image processing toolbox of MATLAB, bboxOverlapRatio function is used to compute the pair-wise overlap ratios for all the expanded bounding boxes. Further a graph is used to find all the connected regions. The output of conncomp are indices to the connected text regions to which each bounding box belongs. The indices are used to combine multiple neighboring bounding boxes into a single bounding box by computing the minimum and maximum of the individual bounding boxes that construct each connected component.

20

Finally, before showing the final detection results, suppress false text detections by removing bounding boxes made up of just one text region. This removes isolated regions that are unlikely to be actual text given that text is usually found in groups such as strings and sentences.

v.

Recognize detected text using OCR

The OCR function is used to recognize the text within each bounding box after spotting the text regions. A noisy output is obtained from OCR function if the text regions are not found first.

vi.

Text-to-Speech synthesizer

The detected texts are concatenated to strings or words then text lines and saved in a text file. Finally the speech synthesizer reads out the text to the user via a pair of earphones.

Capture Image

Text Regions Detected

Non-text Background Removed

Merge texts to recognize strings

Optical Character Recognition

Stroke width Variation Check

Figure 2.1: Flow Chart of Read Module

21

2.1.2 Guidance Module Algorithm for obstacle detection in guidance module as follows –

Step 1: Capturing human obstacle in the path of the visually impaired user by stereo camera. Step 2: Resizing the color image and then converting it to gray scale image for processing Step 3: Identifying human body separately from the background and marking as rectangle. Step 4: Assuming the rectangle as a standing cylinder (2 dimensional perpendicular view of a cylinder is a rectangle) the dimension of cylinder is calculated. Step 5: After the calculation of the dimensions of the cylinders it is converted into height, width and depth (distance from user) of the obstacle. Step 6: Using speech synthesizer tool these important physical dimensions of the obstacle is informed over earphones to the user.

Capture Images using stereo Camera

Resize the image

Convert the image to grey scale

2D view of cylinder is a rectangle

Mark the region as rectangle

Segment obstacles from background

3D reconstruction of cylinders using stereo vision

Metrics of obstacle is obtained

Speech synthesiser tool

Figure 2.2: Flowchart of Guidance Module 22

Steps involved in the algorithm is explained as follows. Each step is elaborated and finally the results are discussed in chapter 3.

Assume World coordinate system (Left-Hand) with x, y and z axis as shown in figure 2.3. Let imaginary planes Πxy, Πzx and Πyz be parallel to z, y and x axis respectively. Perspective images of a standing featureless cylinder are imaged by stereo camera setup from two different viewpoints as in figure 2.3. The camera model considered here is central projection. Corresponding points are found by the 360° rotational symmetry and two orthogonal planes passing through the axis of the cylinder such that the shape remains the same even if the cylinder rotates about its axis (for similar approach, see A.Miglani et al [21]).

i.

Measurement in image formed by C

In figure 2.3, camera centre C and C' are central projection cameras forming the images for stereo vision. The 3D position of the point is recovered by considering the reference planes viz., Πxy, Πyz, Πzx and by applying triangulation, which is a geometrical method of assuming a point in a scene and identifying it correctly in each image. Relative affine structures (invariants) ρi and ρ'i are used for retrieving 3D projective structure of the cylinder from images taken by cameras C and C' respectively.

Figure 2.3: Stereo Vision Setup

23

Note that Nref which is the reference point whose metrics are known. Ni denotes the points on the surface of the cylinder, Zi and Z'i are the projective depths of each point from the camera centre C and C' respectively. di and d'i are the perpendicular distances of each point from the ground plane with respect to camera centre C and C' respectively as discussed in [20]. The integer i varies from 1 to n, depending on the number of corresponding points as shown in Figure 2.3.

Assuming a point N0 whose perpendicular distance is d0 from ground plane, which is equidistant from both the camera centres such that the projective depth (Z0) is same. 𝜌𝑥𝑟𝑒𝑓 =

𝑑𝑥𝑟𝑒𝑓 𝑍0 . 𝑍𝑟𝑒𝑓 𝑑0

(2.1) 𝜌𝑦𝑟𝑒𝑓 =

𝑑𝑦𝑟𝑒𝑓 𝑍0 . 𝑍𝑟𝑒𝑓 𝑑0

(2.2) 𝜌𝑧𝑟𝑒𝑓 =

𝑑𝑧𝑟𝑒𝑓 𝑍0 . 𝑍𝑟𝑒𝑓 𝑑0

(2.3) Where, ρxref is the coordinate of relative affine structure in x direction and ρyref, ρzref denotes the same in y and z direction respectively. dxref is the coordinate of perpendicular distance in x direction and similarly dyref, dzref denotes the same in y and z direction respectively (for similar approach, see A.Miglani et al [24]). 𝜌𝑥𝑖 =

𝜌𝑦𝑖 =

𝜌𝑧𝑖 =

𝑑𝑥𝑖 𝑍0 . 𝑍𝑖 𝑑0 𝑑𝑦𝑖 𝑍0 . 𝑍𝑖 𝑑0 𝑑𝑧𝑖 𝑍0 . 𝑍𝑖 𝑑0

(2.4)

(2.5)

(2.6)

Dividing equation (2.1), (2.2) and (2.3) by respective equations in (2.4), (2.5) and (2.6); 𝜌𝑥𝑟𝑒𝑓 𝑑𝑥𝑟𝑒𝑓 𝑍𝑖 = . 𝜌𝑥𝑖 𝑑𝑥𝑖 𝑍𝑟𝑒𝑓

(2.7) 24

Ratio of relative affine structures in a single image can be formulated and the equations are as follows, 𝑑𝑥𝑖 = 𝑑𝑥𝑟𝑒𝑓 .

𝜌𝑥𝑖 𝑍𝑖 . 𝜌𝑥𝑟𝑒𝑓 𝑍𝑟𝑒𝑓

(2.8) 𝑑𝑦𝑖 = 𝑑𝑦𝑟𝑒𝑓 .

𝜌𝑦𝑖 𝑍𝑖 . 𝜌𝑦𝑟𝑒𝑓 𝑍𝑟𝑒𝑓

(2.9) 𝑑𝑧𝑖 = 𝑑𝑧𝑟𝑒𝑓 .

𝜌𝑧𝑖 𝑍𝑖 . 𝜌𝑧𝑟𝑒𝑓 𝑍𝑟𝑒𝑓

(2.10)

ii.

Measurement in image formed by C'

For the image formed by camera centre C', the similar formulation is derived. Again here, invariant ρ'i is used for retrieving 3D projective structure of the cylinder from image taken by camera with centre C'. ′ 𝑑𝑥𝑖

′ 𝜌𝑥𝑖 𝑍𝑖′ = 𝑑𝑥𝑟𝑒𝑓 . ′ . 𝜌𝑥𝑟𝑒𝑓 𝑍𝑟𝑒𝑓

(2.11) ′ 𝑑𝑦𝑖 = 𝑑𝑦𝑟𝑒𝑓 .

′ 𝜌𝑦𝑖 𝑍𝑖′ . ′ 𝜌𝑦𝑟𝑒𝑓 𝑍𝑟𝑒𝑓

(2.12) ′ 𝑑𝑧𝑖 = 𝑑𝑧𝑟𝑒𝑓 .

′ 𝜌𝑧𝑖 𝑍𝑖′ . ′ 𝜌𝑧𝑟𝑒𝑓 𝑍𝑟𝑒𝑓

(2.13)

Note Nref with known metrics, is assumed to be the same reference point for this image. N0 is equidistant from both the camera centres as the projective depth is same.

25

iii.

Stereo Vision for Reconstruction

Ratios from equations (2.8) to (2.13) are used for getting ratio of relative affine structure between the two images from the stereo vision. Dividing equation (2.11) by (2.8), ′ ′ 𝑑𝑥𝑖 𝜌𝑥𝑖 𝑍𝑖′ = . 𝑑𝑥𝑖 𝜌𝑥𝑖 𝑍𝑖

(2.14)

To compute the coordinate of arbitrary point Ni with relative affine structure and projective depth, ′ 𝑑𝑥𝑖 = 𝑑𝑥𝑖 .

′ 𝜌𝑥𝑖 𝑍𝑖′ . 𝜌𝑥𝑖 𝑍𝑖

(2.15)

With the reference measurements along x, y and z axis, required affine transformation measurements can be computed up to respective scale, ′ 𝑑𝑥𝑖 = 𝑑𝑥𝑖 . 𝑆𝑥𝑖

(2.16) ′ 𝑑𝑦𝑖 = 𝑑𝑦𝑖 . 𝑆𝑦𝑖

(2.17) ′ 𝑑𝑧𝑖 = 𝑑𝑧𝑖 . 𝑆𝑧𝑖

(2.18) Where, Si is a constant which is the product of ratio of relative affine structures of two images and ratio of projective depth of a same point from camera centres C and C'. 𝑆𝑥𝑖 =

′ 𝜌𝑥𝑖 𝑍𝑖′ . 𝜌𝑥𝑖 𝑍𝑖

(2.19) ′ 𝜌𝑦𝑖 𝑍𝑖′ 𝑆𝑦𝑖 = . 𝜌𝑦𝑖 𝑍𝑖

(2.20) ′ 𝜌𝑧𝑖 𝑍𝑖′ 𝑆𝑧𝑖 = . 𝜌𝑧𝑖 𝑍𝑖

(2.21)

Thus, it is clear from equation (2.15) that the relative affine structure depends on the projective depth of the points and it is calculated without vanishing point assumptions to minimize error. 26

iv.

Transformation Matrices

Mapping between the two images is made by transformation matrices (rotation and translation).

Rotational Matrix If the same cylindrical pellet (object) is imaged by cameras C and C' related by pure rotation then the rotational matrix is given by, 𝑆𝑥𝑖 0 0 𝑆𝑦𝑖 [ 0 0 0 0

0 0 0 0] 𝑆𝑧𝑖 0 0 1

(2.22)

Translation Matrix If the same cylindrical pellet (object) is imaged by cameras C and C' related by pure translation then the translation matrix is given by, 1 [0 0 0

v.

0 1 0 0

0 𝑇𝑥 0 𝑇𝑦 ] 1 𝑇𝑧 0 1

(2.23)

Depth Estimation

A cylinder can be considered as a stack of circles with equal radius arranged such that the centre of all the circles are placed on the axis of cylinder. Here, depth estimation is carried out and analyzed assuming a single view point. The same method can be implemented for any number of cameras (viewpoints) though this work is restricted to two cameras because of stereo vision.

Jacobian matrix of mapping

Consider only a single circle as shown in figure 2.4, then the tangential mapping of the circle consisting a point Ni to the image is given by the following equation (2.24) [19]. Here, a is the vector that denotes the 360° rotational symmetry of the cylinder.

27

𝑇𝑁 𝜋(𝑁𝑖 , 𝑎) = (𝜋(𝑁𝑖 ), 𝑇𝑁 𝜋(𝑎))

(2.24)

Figure 2.4: Tangential Mapping of points from Cylinder

The Jacobian matrix of mapping TNπ is given by, 𝑇𝑁 𝜋 =

1 [𝐼 − 𝜋(𝑁𝑖 ). 𝜋(𝑁𝑖 )𝑇 ] |𝑁𝑖 | 3𝑥3

(2.25) Normalized Jacobian matrix of mapping T'Nπ is defined as, 𝑇𝑁′ 𝜋 = 𝑇𝑃3 → 𝑇 ′ 𝑆 2

(2.26) Where, T'S2 denotes the unit tangent bundle. 𝑇𝑁′ 𝜋 ∶ (𝑁𝑖 , 𝑎) → (𝜋(𝑁𝑖 ),

𝑇𝑁 𝜋. 𝑎 ) |𝑇𝑁 𝜋. 𝑎|

(2.27)

28

𝑇𝑁′ 𝜋 ∶ (𝑁𝑖 , 𝑎) → (𝜋(𝑁𝑖 ), 𝑎̂)

(2.28) As in [19] here, â gives an extra Degree Of Freedom (DOF) because of 360° rotational symmetry of the cylinder. If there are two distinct points viz., Ni and N'i on the circle and a vector a then they can be modeled as T'Nπ. 𝑇𝑁 𝜋. (𝜋(𝑁𝑖 ) − 𝜋(𝑁𝑖′ )) 𝑇𝑁 𝜋. 𝑎 𝑎̂ = = |𝑇𝑁 𝜋. 𝑎| |𝑇 𝜋. (𝜋(𝑁 ) − 𝜋(𝑁 ′ ))| 𝑁

𝑖

𝑖

(2.29) Measurement without Vanishing Point

The proposed method involves measurement of the depth of points without vanishing points. As shown in figure 2.4, only one circle from the cylinder that is mapped tangentially is taken and then the distance from the camera centre is measured by the following method. Let the radius of circle in the real world be rb; radius of image be rc and the distance between the centre Ob and Oc is K. If a circle is imaged perpendicularly, then it appears as a single line in the image with its length equal to the scaled value of diameter of the circle and thus these circle are formed with that line as diameter (D). 𝑅 = (𝑟𝑏 − 𝑟𝑐 )

(2.30) As shown in figure 2.4, line segment NicB is of equal length and parallel to line segment OcOb, whose length is a known value K. Ni is the mapped to point Nic in the image from real world. Using Pythagoras theorem in ∆NicBNi, 𝑍𝑛2 = 𝐾 2 + (𝑟𝑏 − 𝑟𝑐 )

(2.31) 𝑍𝑛 = √𝐾 2 + 𝑅 2

(2.32)

29

Projective depth is the length of the ray passing through camera centre C and Ni. Hence, 𝑍 = 𝑍𝑛 + 𝑍𝑙

(2.33) As shown in figure 2.4, aligning the body frame such that the origin is at axis of cylinder and the point Ni is in negative z-axis so that 3D coordinate of point Ni is [0 0 -h]T. 𝑁𝑖 = −ℎ𝑧3

(2.34) 𝑣1 = 𝜋(𝑁𝑖 )

(2.35) Nic is the point in image and it denotes the location of body point Ni in camera frame 𝑁𝑖𝑐 = 𝑅𝑁𝑖 + 𝑑

(2.36) Where, R is the rotational matrix and tangential mapping is, 𝑇𝑁 𝜋 ∶ (𝑁𝑖 , 𝑎) → (𝜋(𝑁𝑖 ), 𝑇𝑁 𝜋(𝑎))

(2.37) 𝑇𝑁′ 𝜋 ∶ (𝑁𝑖 , 𝑎) → (𝑣1 , 𝑣2 )

(2.38) Where v1 and v2 are mutually orthogonal unit vectors [19]. 𝑣1 =

𝑁𝑖 . 𝐶 |𝑁𝑖 . 𝐶|

(2.39)

30

𝑣2 =

̂𝑐 ) (1 − 𝑣1 𝑣1 . 𝑎 ̂𝑐 )| |(1 − 𝑣1 𝑣1 . 𝑎

(2.40) vi.

Error Analysis

Sensitivity of distance error

Taking the first order derivative of equation (2.15) with respect to projective depth, the sensitivity of error can be obtained. Ei is calculated as follows,

𝐸𝑖 =

′ 𝑑 𝜌𝑥𝑖 𝑍𝑖′ (𝑑𝑥𝑖 . . ) 𝑑𝑍 𝜌𝑥𝑖 𝑍𝑖

(2.41)

𝐸𝑖 = 𝑋.

𝑑 𝑍𝑖′ ( ) 𝑑𝑍 𝑍𝑖

(2.42) 𝑍𝑖′ . 𝛿𝑖 − 𝑍𝑖 . 𝛿𝑖′ 𝐸𝑖 = 𝑋 ( ) 𝑍𝑖2

(2.43) Where, ′ 𝜌𝑥𝑖 𝑋 = 𝑑𝑥𝑖 . 𝜌𝑥𝑖

(2.44)

31

𝛿𝑖 =

𝑑𝑍𝑖 𝑑𝑍

(2.45) 𝛿𝑖′

𝑑𝑍𝑖′ = 𝑑𝑍

(2.46) The equation shows that error reduces when the depth gets larger as sensitivity of error is inversely proportional to Zi2.

Maximum likelihood estimation and uncertainties

Here, uncertainty is described by the co-variance matrix [22] and end points in three reference distances. Considering error free transformation matrix, then the uncertainty in selecting the points in the image that corresponds to the points Ni and N'i is modelled by a covariance matrix.

MLE of end points in image [27] is given by, 𝑇

′ ̂𝑖 ) 𝛬−1 ̂ ̂′ −1 ′ ̂′ 𝑚𝑖𝑛𝑁̂,𝑁̂′ [(𝑁𝑖 − 𝑁 𝑁𝑖 (𝑁𝑖 − 𝑁𝑖 ) + (𝑁𝑖 − 𝑁𝑖 )𝛬𝑁 ′ (𝑁𝑖 − 𝑁𝑖 )] 𝑖

𝑖

𝑖

(2.47) Where, (ΛN'i and ΛNi) are uncertainty ellipses. These ellipses are defined manually and they indicate a confidence region for localizing the points as discussed in [27].

32

Measurement in Image 1

Measurement in Image 2

Relative Affine structure

Jacobian matrix of mapping

Depth estimation by Jacobian tangent bundle

Transformation and Rotational matrices

Sensitivity of error estimation

Maximum likelihood estimation and uncertainities

Metrics in real world

Figure 2.5: Flowchart of 3D reconstruction technique

33

2.2 Experimental Setup The experimental setup of our proposed system is simple, cheap and wearable. The system consists of a stereo camera (Creative Senz3D) and a stereo earphones connected to a portable laptop computer (Intel Core i3 - 4030U CPU @ 1.90 GHz/ 4GB RAM).

This makes the system to be very simple when compared to the available systems in the literature or which are commercially sold. The portable feature makes it convenient for the user to carry it in a bag on his shoulders.

Figure 2.6: Experimental Setup

Figure 2.7: Creative Senz3D stereo camera

34

CHAPTER 3 RESULTS AND DISCUSSION

35

3.1 Results and Discussion The modern techniques proposed in the modules of the system is experimentally tested and results of both text recognition algorithm and 3D reconstruction assures that the proposed prototype is superior to the existing models. Text localization is evaluated based on the ICDAR 2013 test results of values - recall, precision and f-measure while the reconstruction technique’s efficiency is measured by reduction in estimated error percentage. Results of read module and guidance module are discussed in sections 3.1.1 and 3.1.2 respectively.

3.1.1 Text Recognition (Read Module) Results

Non-natural scene texts

The words occupy a substantial part in the image without perspective distortion and they were written horizontal without substantial noise. The algorithm performance is simpler in comparison with natural scene text detection procedure and in addition it works excellent when compared with the other works in the literature.

The results are given in the same order as the steps involved in the algorithm.

Figure 3.1: Actual Input Image

36

Figure 3.2: After recognition and detection of text regions

Figure 3.3: Text saved as meaningful strings in file

Figure 3.4: Text-to-speech engine 37

Natural Scene text detection

Fully natural scene is captured by the camera and the image is processed by the MSER based algorithm as mentioned in the previous section and the results are being organized as in the same order of steps involved in the algorithm.

a) Detect candidate text regions using MSER

Figure 3.5: MSER Regions

38

b) Remove non-text regions based on basic geometric properties

Figure 3.6: After removing non-text region

Figure 3.7: Stroke width variation based approach 39

c) Merge text regions for final detection result

Figure 3.8: Expanded bounding boxes text

Figure 3.9: Detected text 40

d) Text-to-Speech synthesizer Note, in the image the letter “E” in the word “EXPENSE” has been deformed because of the bolt. Thus when the text is detected but if it is blurred or deformed to an inevitably damaged extent then the system returns a “0 or null character” at that point.

Figure 3.10: Detected natural scene text saved as meaningful strings in file

Test results of ICDAR 2013

The ICDAR 2013 [4], [41], [52] Robust Reading competition dataset consists of a total 1189 words and 6393 letters in 255 images. Our method achieves the recall of 60.01%, precision of 73.10% and the f-measure of 66.30% in text localization by using the ICDAR 2013 competition evaluation scheme.

In statistical analysis of binary classification [60], the f-measure is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score p is the number of correct positive results divided by the number of all positive results and r is the number of correct positive results divided by the number of positive results that should have been returned.

The f-measure can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. 41

𝑓 =2.

𝑝. 𝑟 𝑝+𝑟

(3.1)

Based on the test results, our proposed method proves to be better and more efficient. The comparison can be seen in the table below,

Method

precision

recall

f-measure

Proposed algorithm

73.10

60.01

66.30

Epshtein et al. [42]

73.10

59.95

66.25

Neumann & Matas [44], [45]

73.03

64.70

68.70

Yao & Lu [2]

59.01

61.97

61.05

Chen et al. [4], [39], [40]

73.02

60.09

66.10

Gonzalez et al. [5]

72.67

56.00

63.39

Kim’s method [36]

82.96

62.49

71.28

Yi’s method

67.26

58.06

62.44

TH - Textloc system

66.97

57.51

61.93

TDM – IACS

63.52

53.75

58.02

LIP6 – Retin

62.97

50.07

55.78

KAIST – AIPR system

59.17

44.67

51.07

ECNU – CCG method

35.07

38.45

36.69

Text Hunter

50.08

25.96

34.29

Minetto [53]

60.89

62.95

61.55

Fabrizio [54]

38.89

49.34

43.35

Wolf

44.00

30.00

35.00

Todoran

18.38

19.45

18.45

Table 4.1: Evaluation of text detection algorithm

42

3.1.2 3D Reconstruction (Guidance Module) Results

Human detection

Figure 3.11: Standing Pose

Figure 3.12: Walking Pose 43

The rectangular detected part in image considered to be two dimensional view of standing cylinder. Thus, 3D reconstruction technique is carried out and results are tabulated,

Orientation

Type

Source

Values (in cms)

Standing pose

Height

Scene

171

Image

137.82

Error

33.18

Scene

33

Image

35.64

Error

-2.64

Scene

17

Image

17.68

Error

-0.68

Scene

171

Image

140.22

Error

30.78

Scene

33

Image

30.459

Error

2.541

Scene

17

Image

16.3455

Error

0.6545

Scene

171

Image

142.70

Error

28.30

Scene

33

Image

34.8744

Error

-1.8744

Scene

17

Image

17.4828

Error

-0.4828

Standing pose

Standing pose

Walking Pose

Walking Pose

Walking Pose

Standing pose

Standing Pose

Standing Pose

Breadth

Width

Height

Breadth

Width

Height

Breadth

Width

Table 4.2: Results of metric estimation in real world 44

Error Percentage

19.40 %

8.00 %

4.00 %

18.00 %

7.70 %

3.85 %

16.55 %

5.68 %

2.84 %

CHAPTER 4 CONCLUSIONS & FUTURE SCOPE

45

4.1 Conclusions The proposed wearable dual-module electronic travelling aid was developed costefficiently, that helps visually impaired to avoid the human obstacles in their path and also to read the texts. Instead of using several sensors, a simple stereo camera setup was chosen and the surrounding was three dimensionally reconstructed from the images. This reduced the computational difficulties and the prototype became less bulky so that is light-weight and easy to carry on shoulders by the user. 3D reconstruction of cylinders algorithm to find the real world metrics was successfully merged with text detection algorithm. In Read Module, a novel text detecting edge-enhanced MSER algorithm is proposed to overcome the sensitivity of MSER with respect to image blur and to detect even very small letters by exploiting the additional properties of MSER and Canny edges. Our method had employed MSER as basic letter candidates and demonstrated cutting-edge performance for text localization even in natural images. In Guidance Module, our work on using one invariant to analyze projective, affine and Euclidean space for vision tasks was successful. 3D measurement has been computed from a stereo vision using relative affine structure and rotational symmetry without vanishing point that reduced the error to a good extent. The transformation is represented by relative affine structures along the three orthogonal directions. Therefore, camera transformation for a cylindrical object can also be expressed in terms of relative affine structures. Voice synthesizer reads out the strings or sentences through earphones to the user thus enabling him recognize the text. Our method achieves the recall of 60.01%, precision of 73.10% and the f-measure of 66.30% in text localization and the error percentage in the 3D reconstruction is reduced up to 2.84% which proves that the efficiency of the system is excellent. Robustness and performance is better because the number of exact text detected by the OCR was high. The tests were performed in a controlled indoor environment and the results show that the performance of both modules is superior to other techniques and assures that the real-time outdoor performances can be carried out in the future.

46

4.2 Future Work

The tests were performed in a controlled indoor environment and the results show that the performance of both modules is superior to other techniques and assures that the real-time outdoor performances can be carried out in the future. Our proposed guidance module framework can be extended for vision applications such as motion analysis involving multiple views, motion segmentation and tracking, structure from motion. Machine learning can also be added to the system in order to avoid the null character output when deformed letters are encountered and furthermore, our read module framework can be efficiently combined with visual search systems by sharing MSER as interest region.

47

References: [1] D.Dakopoulos, and G.Bourbakis, “Wearable obstacle avoidance electronic travel aids for blind: A survey”, IEEE transactions on systems, man, and cybernetics - part c: applications and reviews, vol. 40, 2010. [2] Y.Li, and H.Lu, “Scene text detection via stroke width”, 21st International Conference on Pattern Recognition (ICPR 2012), 2012. [3] L.Neumann, and J.Matas, “Real-Time scene text localization and recognition”, 978-14673-1228-8/12, 2012. [4] H.Chen, S.S.Tsai, G.Schroth, D.M.Chen, R.Grzeszczuk, and B.Girod, “Robust text detection in natural images with edge-enhanced maximally stable extremal regions”, 18th IEEE conference on Image Processing, 2011. [5] A.Gonzalez, L.M.Bergasa, J.J.Yebes, and S.Bronte, “Text location in complex images”, 21st International Conference on Pattern Recognition (ICPR 2012), 2012. [6] T. Ifukube, T. Sasaki, and C. Peng, “A blind mobility aid modeled after echolocation of bats,” IEEE Trans. Biomed. Eng., vol. 38, no. 5, pp. 461– 465, 1991. [7] S. Shoval, J. Borenstein, and Y. Koren, “Mobile robot obstacle avoidance in a computerized travel aid for the blind,” in Proc. IEEE Robot. Autom. Conf., San Diego, CA, pp. 2023–2029, 2009. [8] P. B. L. Meijer, “An experimental system for auditory image representations.” IEEE Trans. Biomed. Eng. [Online]. pp. 112–121, 1992. [9] A. Hub, J. Diepstraten, and T. Ertl, “Design and development of an indoor navigation and object identification system for the blind,” in Proc. ACM SIGACCESS Accessibility Computing, no. 77–78, pp. 147–152, 2004. [10] D. Aguerrevere, M. Choudhury, and A. Barreto, “Portable 3D sound / sonar navigation system for blind individuals,” presented at the 2nd LACCEI Int. Latin Amer. Caribbean Conf. Eng. Technol. Miami, FL, 2004.

48

[11] G.Sainarayanan, R. Nagarajan, and S. Yaacob, “Fuzzy image processing scheme for autonomous navigation of human blind,” Appl. Softw. Comput., vol. 7, no. 1, pp. 257–264, 2007. [12] J. L. Gonzalez-Mora, A. Rodriguez-Hernandez, L. F. Rodriguez-Ramos, L. Diaz Saco, and N. Sosa “Development of a new space perception system for blind people, based on the creation of a virtual acoustic space”, 2008. [13] R. Audette, J. Balthazaar, C. Dunk, and J. Zelek, “A stereo-vision system for the visually impaired,” Sch. Eng., Univ. Guelph, Guelph, ON, Canada, Tech. Rep. 2000-41x-1, 2000. [14] N.G.Bourbakis, and D.Kavraki, “Intelligent assistants for handicapped people’s independence: case study”, 0-8186-7726-7/96, 1996. [15] D.Dakopoulos, S.K.Boddhu, and N.Bourbakis, “A 2D vibration array as an assistive device for visually impaired”, 1-4244-1509-8/07, 2007. [16] Meers.S, and Ward.K, “A substitute vision system for providing 3D perception and GPS navigation via electro-tactile stimulation”, Proceedings of the International Conference on Sensing Technology, Palmerston North, New Zealand, 2005. [17] Meers.S, and Ward.K, “A Vision System for Providing 3D Perception of the Environment via Transcutaneous Electro-Neural Stimulation”, Proceedings of the Eighth International Conference on Information Visualisation, 2004. [18] R.I. Hartley and A. Zisserman, “Multiple View Geometry in Computer Vision,” Cambridge University Press, ISBN: 0521540518, 2004. [19] Cowan, N.J.; Dong Eui Chang, “Geometric visual servoing,” Robotics, IEEE Transactions on, vol.21, no.6, pp.1128,1138, 2005. [20] A. Shashua and N. Navab, “Relative affine structure: canonical model for 3d from 2d geometry and applications,” in Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, pp. 483-489, 1994. [21] Miglani, A.; Roy, S.D.; Chaudhury, S.; Srivastava, J.B., “Symmetry based 3D reconstruction of repeated cylinders,” Computer Vision, Pattern Recognition, Image

49

Processing and Graphics (NCVPRIPG), 2013 Fourth National Conference on, vol., no., pp.1,4, 18-21, 2013. [22] A. Criminisi, I. Reid, and A. Zisserman, “Single view metrology,” Int. J. Comput. Vision, vol. 40, no. 2, pp. 123-148, 2000. [23] Joseph L. Mundy, “Object Recognition in the Geometric Era: a Retrospective,” Division of

Engineering,

Brown

University

Providence,

Rhode

Island.

Email:

[email protected]. [24] Miglani, A.; Roy, S.D.; Chaudhury, S.; Srivastava, J.B., “Complete visual metrology using relative affine structure,” Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2013 Fourth National Conference on, vol., no., pp.1,4, 2013. [25] A. Criminisi, “Accurate visual metrology from single and multiple uncalibrated images,” Ph.D. dissertation, University of Oxford, Dept. Engineering Science, d.Phil. Thesis, 1999. [26] Sharma A, Chaudhary S, Roy S D, Chand P, “Three dimensional reconstruction of cylindrical pellet using stereo vision,” 5th National Conference on Computer Vision Pattern Recognition Image Processing and Graphics (NCVPRIPG 2015), 2015. [27] Liebowitz, D. and Zisserman, A., “Metric rectification for perspective images of planes,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 482-488, 1998. [28] D. Chen, S. S. Tsai, C. H. Hsu, K. Kim, J. P. Singh, and B. Girod, “Building book inventories using smartphones,” in Proc. ACM Multimedia, 2010. [29] G. Takacs, Y. Xiong, R. Grzeszczuk, V. Chandrasekhar, W. Chen, L. Pulli, N. Gelfand, T. Bismpigiannis, and B. Girod, “Outdoors augmented reality on mobile phone using loxelbased visual feature organization,” in Proc. ACM Multimedia Information Retrieval, pp. 427– 434, 2008. [30] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 2004. [31] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346 – 359, 2008. 50

[32] V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai, R. Grzeszczuk, and B. Girod, “CHoG: Compressed histogram of gradients a low bit-rate feature descriptor,” in CVPR, pp. 2504 – 2511, 2009. [33] D. Nist´er and H. Stew´enius, “Scalable recognition with a vocabulary tree,” in CVPR, pp. 2161–2168, 2006. [34] D. M. Chen, S. S. Tsai, V. Chandrasekhar, G. Takacs, R. Vedantham, R. Grzeszczuk, and B. Girod, “Inverted Index Compression for Scalable Image Matching,” in Proc. of IEEE Data Compression Conference (DCC), Snowbird, Utah, 2010. [35] J. Liang, D. Doermann, and H. P. Li, “Camera-based analysis of text and documents: a survey,” IJDAR, vol. 7, no. 2-3, pp. 84–104, 2005. [36] K. Jung, K. I. Kim, and A. K. Jain, “Text information extraction in images and video: a survey,” Pattern Recognition, vol. 37, no. 5, pp. 977 – 997, 2004. [37] Y. Zhong, H. Zhang, and A. K. Jain, “Automatic caption localization in compressed video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 4, pp. 385 –392, 2000. [38] Q. Ye, Q. Huang, W. Gao, and D. Zhao, “Fast and robust text detection in images and video frames,” Image Vision Comput., vol. 23, pp. 565–576, 2005. [39] X. Chen and A. L. Yuille, “Detecting and reading text in natural scenes,” in CVPR, vol. 2, pp. II–366 – II–373 Vol.2., 2004. [40] X. Chen and A. L. Yuille, “A time-efficient cascade for real-time object detection: With applications for the visually impaired,” in CVPR - Workshops, p. 28, 2005. [41] S. M. Lucas, “ICDAR 2005 text locating competition results,” in ICDAR, pp. 80 – 84 Vol. 1., 2005. [42] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in CVPR, pp. 2963 –2970, 2010.

51

[43] P. Shivakumara, T. Q. Phan, and C. L. Tan, “A laplacian approach to multi-oriented text detection in video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 2, pp. 412 –419, 2011. [44] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in British Machine Vision Conference, vol. 1, pp. 384– 393, 2002. [45] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool, “A comparison of affine region detectors,” Int. J. Comput. Vision, vol. 65, pp. 43–72, 2005. [46] D. Nist´er and H. Stew´enius, “Linear time maximally stable extremal regions,” in ECCV, pp. 183–196, 2008. [47] D. G. Bailey, “An efficient euclidean distance transform,” in Combinatorial Image Analysis, IWCIA, pp. 394–408, 2004. [48] J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, pp. 679–698, 1986. [49] A. Srivastav and J. Kumar, “Text detection in scene images using stroke width and nearest-neighbor constraints,” in TENCON 2008 - 2008 IEEE Region 10 Conference, pp. 1–5, 2008. [50] K. Subramanian, P. Natarajan, M. Decerbo, and D. Castanon, “Character-stroke detection for text-localization and extraction,” in ICDAR, vol. 1, pp. 33–37, 2007. [51] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Trans. Syst. Man Cybern., vol. 9, no. 1, pp. 62 –66, 1979. [52] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R Young, “ICDAR 2003 robust reading competitions,” in ICDAR, vol. 2, p. 682, 2003. [53] R. Minetto, N. Thome, M. Cord, J. Fabrizio, and B. Marcotegui, “Snoopertext: A multiresolution system for text detection in complex visual scenes.” in ICIP, pp. 3861–3864, 2010. 52

[54] J. Fabrizio, M. Cord, and B. Marcotegui, “Text extraction from street level images,” in CMRT, pp. 199–204, 2009. [55] S. S. Tsai, H. Chen, D. M. Chen, G. Schroth, R. Grzeszczuk, and B. Girod, “Mobile visual search on papers using text and low bit-rate features,” in ICIP, 2011. [56] National Federation of the Blind. (2009) [Online]. Available: http://www.nfb.org/ [57] American Foundation for the Blind. (2009) [Online]. Available: http://www.afb.org/ [58] International Agency for Prevention of Blindness. (2009). [Online]. Available: http://www.iapb.org/ [59] World Health Organization Factsheet: Visually impairment and Blindness, updated on (2014, August). Available: http://www.who.int/mediacentre/factsheets/fs282/en/ [60] https://en.wikipedia.org/wiki/Precision_and_recall

53

List of Publications [1] Sharma A, Chaudhary S, Roy S D, Chand P, “Three dimensional reconstruction of cylindrical pellet using stereo vision,” 5th National Conference on Computer Vision Pattern Recognition Image Processing and Graphics (NCVPRIPG 2015), 2015. [2] Sharma A, Chand P, “Computer Vision guided Navigation System for Visually Impaired,” Recent Advances in Analytical Science (RAAS), IIT BHU, 2016. [3] Sharma A, Chand P, “Assistance system for Visually Impaired through Stereo Vision,” Biomaterials, Biodiagnostics, Tissue Engineering, Drug delivery and Regenerative Medicine (BiTREM), IIT Delhi, 2016.

54