Implementing an Natural Language Conversational ...

6 downloads 1299 Views 40KB Size Report
Through this paper, we describe the development of a NLCI for Indian language ... Currently the system has been developed on the Linux platform for desktop.
Implementing a Natural Language Conversational Interface for Indian Language Computing Rahul Jindal Infosys Technologies, Mohali

Rohit Kumar, Ritvik Sahajpal, Sanjeev Sofat, Shailendra Singh Department of Computer Science and Engineering Punjab Engineering College, Chandigarh

I. INTRODUCTION Natural Language Conversational Interfaces (NLCIs) use the natural modes of interaction among humans like speech and dialog to provide better interfaces for human – computer interaction. These NLCIs are motivated by the need to bring human – computer interaction as close to human – human interaction as possible. There are two basic challenges involved in development of NLCI. First a system needs to process and respond to the natural language conversation inputs from the user. The response should be by an output in natural language and/or by a system level action. Second the modes of interaction with the user should be similar to human – human interfaces such as spoken dialog and facial expressions. There are several issues involved in each of these basic challenges. Through this paper, we describe the development of a NLCI for Indian language computing. The challenges involved in the development of NLCIs are met through convergence of existing technologies in the domain of Dialog Management and Spoken Language Generation. Currently a Text – In, Speech – Out interfaces is implemented. The proposed NLCI finds application as the primary interface for interaction with the machine in kiosks for use in rural and sub – urban India. In our work, we concentrate on an Indian language Conversational Interfaces currently using Hindi as our medium of communication.

II. THE COMPUTING TIME COMPANION MODEL Dialog Management Sub – System

Natural Language Conversational Interface USER

C. T. C.

MACHINE

Multi Modal Interface

Figure I: The Computing Time Companion Model In Figure I the basic model of a Computing Time Companion (CTC) is shown. The CTC stands between the user and machine and provides a natural language conversational interface for human – computer interaction. The CTC makes the interaction with the machine closer to human – human interaction. For the user the CTC acts as a human like companion helping him in accessing the machine’s services; hence the name Computing Time Companion. The C. T. C’s ability to provide human like interaction is provided through two basic sub –systems: the Dialog Management Sub – System and the Multi Modal Interface. The Dialog

Pg - 1

Management Sub – System provides the natural language understanding and generation capabilities to the CTC by accepting responses in the natural language of the user and by generating the corresponding response in the natural language of the user or as a system level action on the machine or as a combination of both.

III. DEEPTI: IMPLEMENTATION OF COMPUTING TIME COMPANION To provide an instantiation to the CTC model proposed in Section II, we have implemented a NLCI named Deepti which has the below mentioned characteristics:

III. 1. Characteristics of Deepti III. 1. 1. Intelligent Human like Conversational ability: Deepti is capable of intelligently interacting with the user through natural dialog. The natural dialog is user controlled instead of being a fixed set of prompts to which the user replies in a specified format. This ability essentially classifies Deepti as a true NLCI III. 1. 2. Text – In – Speech – Out Interface: Deepti’s natural dialog response to user stimulus has been made multi – modal by providing the responses both as readable text and as speech. The inputs from the user remain in text format. III. 1. 3. Controls the Machine for the User: Deepti plays the role of providing a human like interface to the machine. The interface controls the machine i.e. the operating system for the user. The Deepti system generates system – level actions like opening up a notepad in response to user’s explicit requests like “I want to type down some notes” or implicit mentions like “I have got an idea”. III. 1. 4. Support for Hindi: We have concentrated our efforts on implementing the Deepti NLCI for one Indian language Hindi though it should be mentioned that the approach followed is quite general for implementation of a similar NLCI for other Indian languages. III. 1. 5. Easy Modifiability for new domains: As mentioned in Section II, easy modifiability is a desirable characteristic of an NLCI The architecture of Deepti as discussed ahead and the technologies we have used are easily modifiable to incorporate knowledge about new domains and to adapt to new applications of Deepti.

III. 2. Architecture of Deepti System The block diagram of architecture of the Deepti system is shown in figure II. The derivation of this architecture from the CTC model shown in figure I. is clearly observable. The dialog management sub – system is replaced by the Artificial Intelligence Markup Language (AIML) engine. The multi – modal interfaces currently comprise of an Indian Language Text to Speech System only. Currently the system has been developed on the Linux platform for desktop computers.

Pg - 2

A. I. M. L. Engine

USER

Romanized Hindi

DEEPTI

MACHINE

Text – to – Speech System

Data Control

Figure II: Architecture of Deepti System The user and Deepti - the interface to the machine interact through natural dialog strictly following a protocol of taking turns in communicating alternately. The AIML engine generates the responses to the user’s stimulus both in form of textual message as well as in form of system level action to control the machine if required. The textual message is passed through the Text – to – Speech System to generate a multi – modal response for the user.

III. 3. Standard and Encoding for Hindi: Use of Romanized Hindi In the Deepti System, we have tried to keep the text – in interface as natural for the user as possible. We have used Romanized Hindi for the textual inputs and responses. Romanized Hindi in itself has no standard for encoding each grapheme in Hindi. Due to this there is a lot of flexibility for the user to type in the words in a natural way. For example the word hmaara (meaning: our) can be written in different ways like hamara, hamaaraa, hamaara, hamaraa in Romanized Hindi. The ambiguity resulting due to this flexibility is tackled by the use of substitution dictionaries as discussed in section IV.

IV. ARTIFICIAL INTELLIGENCE MARKUP LANGUAGE Artificial Intelligence Markup Language (AIML) [22] has established itself as one of the premier technologies for building interactive chat bots. We have chosen AIML engine for use as the dialog management subsystem for the Deepti system due to its successful use in several experiments and applications.

IV. 1. Architecture of AIML Engine Figure III shows the block diagram of the architecture of AIML Engine. Some other blocks and details in the architecture have been eliminated to highlight only the most relevant components. IV. 1. 1. AIML Knowledge Base: The offline knowledge base of the system comprises of AIML files. The basic element of knowledge in these files is a Category. A Category consists of a Pattern and a Template. The Pattern specifies what the possible stimulus should be like for that particular category to be selected. The Template specifies the exact response to be generated when a particular category is selected. A simple AIML Category is shown. namaste aapki kaise madad kar sakti hoon? Whenever a user’s input is “namaste” the response generated is “aapki kaise madad kar

Pg - 3

sakti hoon?” Knowledge Base A. I. M. L. PATTERN Matching

A. I. M. L. Files

Selected Category A. I. M. L. Interpreter Stimulus Pre – Processor Graph Master

Category Processor

TEMPLATE SYSTEM Tags Response Generator

Responder(s) System Level Control Stimulus

Response Figure III: Architecture of A. I. M. L. Engine

IV. 2. Response Generation in AIML Engine: The Response Generator reads the TEMPLATE and takes suitable action. In case System Level Action is intended in the TEMPLATE by mention through a SYSTEM Tag, that particular action is requested from the underlying operating system. The processing of dialog response generated by a TEMPLATE varies from being as simple as copying the contents to the response; to expansion of other Tags inserted within the TEMPLATE. The Response Generator handles this processing. The SRAI Tag usually an abbreviation of Symbolic Reduction Artificial Intelligence, is used to recursively generate response to a particular stimulus by looking for response to some other stimulus. It always appears in the TEMPLATE part of a category. [27] gives a good number of examples to explain the use of SRAI Tag. The RANDOM Tag is used to generate different response to the same stimulus at different instants. For example, to a stimulus “namaste”, a variety of responses could be given like “namaskaar”, “namaste”, “kyaa haal hain”, “aapki kaise madad kar sakti hoon?” etc. By the use of RANDOM Tag more human like responses are generated. Another useful Tag is the THAT Tag which is used outside both PATTERN and TEMPLATE can be considered as an addition to the PATTERN. It refers to last response of AIML engine during a particular session of interaction.

IV. 3. Approaches to Build AIML Knowledge Base The first approach to building AIML Knowledge Base is Anticipatory. The botmaster enumerates all the most probable inputs and writes a suitable response to these inputs to form the categories. This approach can be made more convenient if two botmasters engage in a dialog

Pg - 4

with one of them playing the role of a user and the other the role of the machine. The subsequently generated sequence of dialog can then be processed to generate categories automatically or semi – automatically with human intervention to some extent to provide corrections and generalizations. The second approach is based on backward – looking log file analysis. The botmaster adds knowledge to provide correct response to most the usual inputs which were incorrectly or improperly answered, by analyzing the interaction logs of the bot.

V. INDIAN LANGUAGE TEXT – TO – SPEECH SYSTEM To provide a multi – modal response to the user comprising of both written and spoken form of dialog, we have built a Text – to – Speech (TTS) system for Hindi based on the Festvox Framework [29]. The TTS involves concatenative synthesis of diphones.

V. 1. Festvox based Text – to – Speech System for Hindi Text Input Tokenizer

Spoken Response

Diphone Database

Token to Words Converter

Speech Synthesizer

Pronunciation Generator

Pronunciation Lexicon Letter – to – Sound Rules

Figure IV: Design of Festvox based Hindi Text – to – Speech System

The building blocks of our Festvox based Hindi TTS system are shown in figure IV. In our Hindi TTS system, pronunciation is done basically through lookup in a pronunciation lexicon. A lexicon comprising of all the words in our AIML templates along with their pronunciation as a sequence of phone units has been built. Initially a lexicon comprising of around 1900 words from our AIML templates has been created. To generate pronunciation for unseen words i.e. words not in the lexicon (e.g. proper nouns like names, etc.) we are using a set of carefully designed letter to sound (LTS) rules to generate their nearest pronunciation. An example of the use of LTS rules is shown below. The preferred way of writing the name rama in Romanized Hindi is “ram”. If an input consists of “rama” then we take it that it is intended to pronounce the word as ramaa. So a L. T. S. rule for handle “a” at word endings is as follows: ( [ a ] # = aa ) The basic units of synthesis are diphones.

V. 2. Building a diphone database for Hindi The major task involved in building the Hindi TTS using Festvox framework is that of building diphone database [37]. Figure V shows the steps involved in this task. V. 2. 1. Definition of Phone – Set: We have come up with a definition of a Hindi phone – set comprising of 12 vowels and 32 consonants which are listed below in Table 1. Besides these a silence phone is also defined in the phone – set. Vowels:

a

aa I ii u uu e ai o au am an

Pg - 5

Consonants:

k kh g gh c ch j jh T Th D dh n p ph b bh m y r l v

Dh N t th d sh s h ks gy

Table 1. Hindi Phone – Set

V. 2. 2. Listing the Diphones: Based on the total 45 phones in the definition of Hindi Phone – Set a list of all possible diphones was generated. All of these possible diphones were listed and some of the phonotactically invalid diphones were manually deleted from the list. V. 2. 3. Recording of Carrier Words: These carrier words are recorded at a rate of 16000 samples per second in a calm environment to avoid all extraneous noise using commonly available recording equipment on a Desktop PC. V. 2. 4. Labeling the Recording: The recordings were manually segmented and labeled at phone boundaries using the EmuLabeler tool [39]. V. 2. 5. Extracting Features: Two basic features that are required during synthesis and needed to be extracted offline are pitchmarks and pitch synchronous linear prediction coefficients. The Hindi TTS system is invoked by the Deepti system whenever a response from the AIML engine is generated. The TTS system generates the spoken form of the response that is then played out to the user along with the display of the textual response.

VI. POSSIBLE APPLICATIONS FOR THE DEEPTI SYSTEM Some of the considered applications are mentioned. Primarily the Deepti System could serve as the interface for Information and Service Kiosks being developed for rural and sub – urban India. These kiosks provide elementary information about various domains of public interest like health, government schemes, income management, etc. Also the kiosks provide easy access to various services like application forms for passports, driving license, voter ID cards, etc. The Deepti System could be deployed as an information dispensing system at shopping centers, commercial banks, post offices, railway stations, airports, offices of insurance companies, tourist spots, schools and colleges, etc. Our NLCI also serves as the primary interface for a computer to be used by layman who requires to use only certain basic application like emailing, taking notes, writing letters, etc. The Deepti System can be used as an interactive educating aid for children and / or illiterate adults in areas certain areas though some changes to the text – in interface would be required.

REFERENCES [1]. [2]. [3].

[4]. [5].

Artificial Intelligence Markup Language, Artificial Intelligence Foundation Inc, http://www.alicebot.org/alice/aiml.html Richard S. Wallace, "Symbolic Reduction in AIML," The A.L.I.C.E. AI Foundation, http://www.alicebot.org/documentation/srai.html, 2000 Alan W. Black and Kevin A. Lenzo, Building Voices in the Festival Speech Synthesis System: Processes and issues in building speech synthesis voices, Edition 1.2:beta, for Festival Version 1.4.1, Carnegie Mellon University, http://cstr.ed.ac.uk/projects/festival/docs/festvox/, 2000 Kevin A. Lenzo and Alan W. Black, “Diphone Collection and Synthesis,” in Proceedings of Intl. Conf. on Spoken Language Processing (ICSLP), Beijing, 2000 Steve Cassidy, Jonathan Harrington and others, Emu-Labeler, The Emu Speech Database System, Speech Hearing and Language Research Centre, Macquarie University, 2001, http://www.shlrc.mq.edu.au/emu/

Pg - 6

Suggest Documents