Challenges in Enabling Speech as a Service Channel ...

4 downloads 77824 Views 170KB Size Report
provide inputs through the phone keypad, imposing serious constraints on what the user can input and the ... made mobile devices an affordable, ubiquitous, personal interface. ... 3 https://market.android.com/search?q=indian+railway&c=apps.
Challenges in Enabling Speech as a Service Channel for Indian Scenario Charudatta Jadhav, Imran Ahmed, Meghna Pandharipande Venkatakrishna T, Mithun BS, Vrushali Kulkarni, Chitralekha Bhat, Arun Pande, Sunil Kumar Kopparapu TCS Innovation Labs – Mumbai Tata Consultancy Services Yantra Park, Thane (West) Maharashtra 400601, INDIA.

Abstract: In India, information dissemination is a major challenge despite advancement in mobile and other related technologies because of three major reasons (a) language diversity, (b) English literacy and (c) access to fast broadband or mobile network. A culmination of these reasons makes speech as a channel for information dissemination best suited in the Indian context. Existing self-help Interactive Voice Response (IVR) based solutions require the user to provide inputs through the phone keypad, imposing serious constraints on what the user can input and the time taken to input, this makes conventional IVR systems less user friendly. Moreover only the digital aware citizen (digizen) is comfortable with the “Press x for y” systems. The best alternative, for dissemination of information, would be the scenario where the user speaks in their local language to an human agent at the other end of the IVR; however this scenario to disseminate information not only turns out be expensive but also imposes serious limitation on scalability. Speech technology based solutions are seen as the best alternative from the perspective of usability and scalability. However there are several challenges in deploying speech based solutions and these challenges come in several dimensions. During our development of a speech enabled system, several technological and people adoption challenges came into focus. Typical technology challenges were in the form of enabling its usability pan India, meaning should work for all languages, dialects and accents robustly and additionally the challenge of scalability in terms of infrastructure to support the enormous user base. Typical people adoption challenges were in terms of making the system easy to navigate, user friendly and ability to give the user what they sought. In this paper, we enumerate our experiences, in terms of challenges both in the form of technology and people use in deploying speech as a service channel.

1 Purpose of Research With increasing penetration of mobile devices worldwide in general and in India in particular especially in the last couple of years, information reachability gap, in terms of the source of knowledge and the need for the knowledge, has significantly reduced. This phenomenon has made mobile devices

an affordable, ubiquitous, personal interface. Voice and text

communication1 being the primary use of mobile phone, speech as a communication channel stands out and this in turn has brought speech based solutions for information dissemination into focus. In India, however, information dissemination is still a major challenge [1] despite advancement in technology because of three major reasons (a) language diversity, (b) English literacy and (c) access to fast broadband or mobile network. For these reasons, speech as a channel for information dissemination suits best in the Indian scenario more than anywhere else in the world. Though there has been continued research interest in the area of speech technology for more than five decades, its commercial viability in terms of return on investment (RoI) is fairly recent in the western countries and probably not demonstrated yet in India. For Indian market, the speech enabled solutions pose significant and complex challenges. The challenges arise from the fact that there are a large number of languages and multitude of dialects and the fact that a vast majority of the Indian population can only converse comfortably in their native language. For the Indian scenario, speech technology requires continuous tuning and adoption of an approach which is solution centric rather than product specific. Some of the Indian specific challenges are (a) the need to address a large number of people, languages with their dialects and accents (c) use of mixed language in day to day communication, (d) noisy telephone lines and noisy environment, (e) people inertia in adopting technology. While the need of speech enabled solution is nicely landscaped, the terrain is difficult to traverse. In this paper, we capture the terrain and bring out the practical challenges faced in traversing the terrain. The solutions that best suit the Indian conditions are the ones that use voice channel, where people can seek information by actually speaking to other person as in the case of a call center scenario. However, this is an expensive proposition in terms of infrastructure and human resource setup; signaling speech based solutions as being the best fit for information dissemination. In the rest of the paper we use a specific example, namely that of a speech solution to access railway related information. 1 Text communication in Indian language is more often difficult

2 Methodology Indian Railway [2] is one of the most used modes of transportation in India with 9000 passenger trains running each day covering a whooping 64015 kilometers and moving over 20 million passengers. Information related to train is highly sought after, even during non-peak seasons2. To gauge the popularity of information related to railways it is sufficient to observe that there are as many as twenty applications related to Indian Railway information 3 in the Andriod market place available for mobile phones, with one of them having been downloaded several hundred thousand times4. There are primarily four pockets of information that the users generally tend to query, namely, status of train in terms of its arrival and departure, availability of accommodation of a particular train, the status of their already booked ticket in terms of PNR and the cost of journey from one place to another. There are several channels in which this information is being disseminated, (a) in the form of an IVR5 (b) in the form of SMS, (c) websites, (d) Radio/TV broadcast. While the radio or television broadcast are infrequent and not necessarily when required by the user, making this channel of information dissemination less useful; The use of internet based and SMS based information dissemination modes is mostly constrained to the English literate and hence attracts only a small population base. The majority use IVR as the channel for information acquisition because it does not have any of the difficulties that the other channels face. However, economics of infrastructure and the limited means in which the user could input queries leads to a situation where a very large number of the users are waiting in the queue for their turn to query the information that they seek from a human agent, thus seriously affecting the usability of an IVR system. With this in the background, the best was to build a speech recognition based solution that could assist in dissemination of information related to Indian railway. There are several prime components that make up a functional speech enabled information dissemination system. They are Speech Technology is the core component of any speech enabled solution and consists of (a) an 2 3 4 5

Which is often rare https://market.android.com/search?q=indian+railway&c=apps https://market.android.com/details?id=com.rahul.utility.indianrail&feature=search_result Popularly called the 139 service

Automated Speech Recognition (ASR) engine which can recognize what the user is speaking and (b) a Text-To-Speech (TTS) engine which can speak back to the user in a voice that is as close as possible to another human. These two technologies jointly give the ability to process the speech input received from the user and the output that has to be spoken to the user. ASR engine converts the caller’s speech into a meaningful text within the scope of the context, enabling the IVR system to respond appropriately to the user request. While a TTS converts a textual string into voice, this is generally used to speak back to the user. Resources in the form of pronunciation lexicon and dictionary assist ASR and TTS. Clearly, these two components are highly language dependent and many a times accent dependent. Voice User Interface design addresses the cultural and language complexities of the user through a thorough understanding of what the user is trying to accomplish and how to create an engaging dialogue, thereby ensuring a satisfactory user experience. The VUI design components are: (a) Call flow design, which addresses how the overall system should engage the user so as to produce the best effect as would be possible when the user is actually speaking to another human (b) dialogue design and (c) system persona, an impression that the system has on the user or the personality of the IVR or the speech system. While these two are the primary components that enable a speech recognition based solution, a very important third component is the ability of the system to enable people usage. This scenario is specifically applicable to the people who have more than one choice of channel to get information. For example, people in cities have better access to internet making it a preferred choice of channel to get information compared to speech as a channel of seeking information. User Engagement can be enabled by providing a facility to proactively inform the user regarding any updates in the information that was sought earlier. Obviously this has to be non-intrusive. For example if the user sought information about his PNR status, any further change in the PNR status could be communicated to the user without the user having to make any efforts. We list the challenges in building and making a usable system.  Noisy Environment affects the performance of speech application ◦ Background street noise is inherent in Indian scenario ◦ Telephone channels in India are inherently noisy

 People Related ◦ Lack of tech savviness in people makes them prefer to speak to a human agent rather than a machine. ◦ People speaking in mixed language [3,4] ◦ People will not use if they can do away with it  Technology Related ◦ Languages and dialects are many, but resources are miniscule ◦ No mature speech recognition and Text-To-Speech engine available for Indian languages ◦ Scalability of infrastructure

3 Experiential Learning In this section we capture our learning and practical challenges we faced in bringing the speechenabled IVR based railway inquiry system for use by people. We list the challenges and the approach and methodology used to overcome these hurdles. The primary aim was to enable the use of speech enabled system so that it could be used by masses and simultaneously ensure good performance of the system in one dimension and in another dimension it is the adoption of such a service by the people. Challenge 1: Background noise is an integral part of the Indian environment especially with the increase in mobile users calling from noisy environments. If the call originates from a fixed line, the environment is relatively quiet, because of the call originating from an office or a home environment. However, there is significant environmental noise in case of mobile phone usage; originating from different environments – bustling traffic, cranking factories, rush streets, busy malls, moving trains and vehicles, swarming auto rickshaws, windy bike ride etc. This requires effective noise management of calls originating from mobile phones compared to calls from a fixed line. This was addressed by identifying the calling environment by first identifying if the call was from a mobile or a fixed line by determining the caller ID and then judging using simple energy based methods to determine if there was noise in the calls made from mobile phones. Noise reducing algorithms were enabled only when deemed necessary rather than use them6 all 6 Computationally expensive on one hand plus they could have negative effect when there is no environment noise

the time. This hurdle has been addressed by adapting and learning the noise characteristics dynamically and in near real time. Also, context-sensitive error handling mechanism can serve to reduce the number of mis-recognitions. Challenge 2: There is a definite inertia in adopting technology for several reasons, the most significant among these is especially when (a) a different channel is available for use and (b) they have used it earlier. We addressed this by enabling attractive features using multimodality. We used the SMS channel available to all mobile users to effectively communicate train information to the user, when deemed necessary, after the user had used the voice channel to fetch information. We enabled what we call the Outbound SMS service that constantly gathers updates about the latest known status or a change in status of the train. Any change in the information given to the user during the telephone transaction, is proactively communicated to the user. For example, if a user inquires about the arrival status of a train at a station and if this arrival status changes later,, the same is communicated to the user through SMS. Challenge 3: People desire natural language interaction (query of the form "Mumbai se aane waali Godaan express Lucknow kab pohchegi") with the system but building such a natural language speech interface is a problem. In [8] we have discussed an IVR system enabled with Natural Language Hindi Speech Interface. Although natural language speech interface allows a user to pose naturally spoken queries, it should also be noted that it may not be applied in all scenarios. In certain scenarios it has been observed that unconstrained natural language severely strains recognition technology [9]. Thus the optimal paradigm, or style, for human-machine speech communication arguably lies somewhere in between the two extremes: more regular than natural language, yet more flexible than hierarchical menus. We identified situation where the system could seek natural language inputs and enables natural language speech recognition only at those parts of the system. Challenge 4: In a multilingual country like India if a speech solution has to be usable by a large number of people it has to work in different Indian languages. India officially recognizes 22 languages and the fact of the matter is that there are hundreds of dialects and thousands of accents. Hence to truly address a large population which speaks different languages, we need a

system for quickly adapting a speech recognition solution working in one language into another second language, and deploying it in the second language. We addressed this by identifying a method which enables rapid prototyping a speech solution in one language into a different language. In [5] we proposed a practical approach to adapt speech recognition based solution working in one language to work in a different language and in [7] we discuss in detail the system for rapid prototyping of an existing speech solution into another language and also present the case study of this system as applied to the speech-enabled IVR system for railway inquiry. Challenge 5: In addition to the large number of languages spoken in India there is a large variability within the languages due to different individuals speaking the language or words differently. This is an enormous challenge to the speech recognition engine. For example, there are different ways in which a station name can be spoken like “Chatrapati Shivaji Terminus”, “CST”, “VT”, “Bori Bandar” which refer to the same station. By this token, if there are a total of X unique stations, we need to consider

a much higher number of possible terms for

recognition,increasing the confusability and making the process of recognition difficult.

We

adopted the approach wherein only a small subset of the 3000 station names is enabled at any given time. For example, once the context of the train name is determined, only those stations that come in the path of that particular train, typically 30 to 50 station names are activated for speech recognition. This approach gives better recognition accuracy. Challenge 6: No good resource available for Indian languages to build speech solutions in terms of lexicon, speech data. it must also be noted that Indian languages are speech resource-deficient; meaning that there is no or very less amount of available text annotated speech data in Indian languages that could be used to train and build Automatic Speech recognition (ASR) systems in Indian languages. We addressed this by devising a way to make use of existing resources on the internet to try and enrich the data resources. Eventually, this rapidly developed speech solution in the second language could use a Wizard of Oz speech data collection system [6] to collect rich speech data of the resource-deficient second language. Challenge 7: Need to play safe with the cognitive load else the user will not use it. In speech

enabled railway inquiry application, when the user speaks a train name (say "Rajdhani express"). The system finds 9 different trains. To ask which of these 9 did the user mean would be not only a lengthy interaction process but would also deter the user from using this channel and seek another channel through which he can obtain information. We are trying to address this by enabling Multimodal interface on a mobile phone [10] which is capable of taking keypad inputs as well as recognize speech. Thus in the scenario of "Rajdhani express", the user could more easily see the visual output of choices of 9 trains on the mobile phone screen and make his choice. This also helps promoting voice based service. Challenge 8: Speech solution will not be effective from day one of its implementation. There is a definite learning curve for the system which may extend for a period of time, wherein the system gets tuned based on the user feedback as well as from the recognizer perspective. To accommodate this learning curve, speech solutions go through several iterations of pilot launches before the system reaches its peak efficiency. Gaining the loyalty of the user and ensuring that he continues to use the service despite some unsatisfactory experiences is a massive challenge. We have adopted these strategy in our enabling the speech solution. We went through steps of opening the application to a small set of users and gradually increasing the user base.

4 Implications One is on the look out for a platform which allows a person to pick up a phone and get the desired information by asking for the information in a natural way. While the technology that can make this possible is still developing, there are a number of innovative and workable approaches that can be adopted to make speech the channel for transaction and query answering. We have shown through an example how this can be made possible after identifying the problems associated with building working solution for use by masses. There are several direct and indirect impact due to use of speech as a channel to enable information reach. In a agricultural scenario, information about commodity prices in real time gives power of negotiation to the rural farmer for getting better prices for their produce. Even in the case of railway query, the magnitude of the problem to be addressed is huge especially in terms of being able to offer this information about trains to the masses, it can be broken down into addressable solutions. For example, though there are 9000 trains that the speech recognition engine might have to

recognize, this number can be significantly reduced considering the fact that not all trains pass through all the railway stations and the fact that people geographically close to a certain railway station most often need information about the trains that pass through that station only. This observation enables one to build a reliable and usable speech solution to address the problem of enabling masses to ask for information. Further, one could use a language specific speech recognition depending on the region in which the speech solution is deployed. For example, in a particular geography there could only be a small set of languages (much less than the official number of languages 22) that would be spoken, making speech engine region (and hence language) specific instead of building speech recognition engine to recognize all languages simultaneously. Using this philosophy, we built a speech recognition system that can be used by masses to get information about Indian railway and described the challenges that we faced during our journey. Speech-enabled solutions are apt, given the socio-economic conditions and diversity of people in terms of spoken languages and dialects in the Indian scenario. Building a system that addresses this diversity is a huge task which has to be necessarily undertaken. A solution based approach is more appropriate to the Indian condition than a product approach.

5 Key References [1] Arun Pande, Sunil Kopparapu, “Challenges in Serving Speech Solution for use by Masses”, Book Chapter in Speech Recognition, Nova Publication (ISBN: 978-1-61470844-5) [2] Indian Railway (www.irctc.co.in) [3] Kiran Bhuvanagiri, Sunil Kopparapu, “Recognition of Mixed Language Speech without Language Identification”, International Conference on Signal System and Automation, ICSSA 2011, 24-25 January 2011 [4] Imran Ahmed, Sunil Kumar Kopparapu, “Specifications for Mixed Language Speech Corpora: A Proposal” 14th Oriental COCOSDA Conference Oct. 26-28, 2011 in Hsinchu, Taiwan [5] S. K. Kopparapu, I. A. Sheikh, and A. S. Pharande, “System and method for rapid

prototyping of existing speech recognition solutions in different languages,” US Patent Application - 20100299133, November 2010. [Online]. Available: http://www.freepatentsonline.com/y2010/0299133.html [6] M. Gruber, M. Legt, P. Ircing, J. Romportl, and J. Psutka, “Czech senior companion: Wizard of oz data collection and expressive speech corpus recording and annotation,” Human Language Technology. Challenges for Computer Science and Linguistics, vol. 6562, pp. 280–290, 2011. [7] S. K. Kopparapu and I. A. Sheikh, “Enabling Rapid Prototyping of an Existing Speech Solution Into Another Language”, Oriental-COCOSDA 2011, Hsinchu - Taiwan, Oct. 2011. [8] Ahmed I. and Kopparpu S. K., "Building a Natural Language Hindi Speech Interface to Access Market Information". NCVPRIPG-2011, Hubli-India. Dec. 2011. [9] Stefanie T and Rosenfeld R., "Speech Graffiti vs. Natural Language: Assessing the User Experience". Proc of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics HLTNAACL04 (2004), Issue: 2001, Publisher: Carnegie Mellon University, Pages: 73-76. [10] Kvale, K., Warakagoda, N.D. and Knudsen, J.E., "Speech centric multimodal interfaces for mobile communication systems", Telektronikk nr.2, pp. 104-117, 2003.

Suggest Documents