The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm
LHT 27,3
450 Received 18 December 2008 Revised 2 February 2009 Accepted 14 April 2009
Issues in Indian languages computing in particular reference to search and retrieval in Telugu language Devika P. Madalli Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, India, and
Dimple Patel Department of Library & Information Science, Osmania University, Hyderabad, India Abstract Purpose – The purpose of this paper is to discuss the various issues involved in Indian languages computing, particularly Telugu, like creating, displaying, searching and retrieving digital content. The paper also aims to emphasize the issues involved in retrieval in Indian languages. The complexities presented by the grammar, syntax and morphology of Indian languages are discussed. Design/methodology/approach – The paper undertakes and presents descriptive study of the issues and challenges in Indian languages computing in general and Telugu language in particular. Findings – The problem of multilingual information retrieval in Indian languages is multi-pronged. A major observation of this study is that, though digital content is available in Indian languages, it is mostly in non-standard encoding format and fonts. There is an urgent need to work in the area of developing search algorithms for Indian languages, like soundex and metaphones to tolerate spelling variations and mistakes that a user might make in queries and suggest correct spelling(s). Practical implications – With existing technologies libraries can now build online catalogues in the language of the documents or build digital repositories with content in various Indian languages. Though a few library automation software like NewGenLib and digital library software like DSpace, etc. are offering Unicode support for Indian languages, they do not allow for different types of search such as truncation search, word variants, etc. The present study is a step towards developing algorithms for indexing and searching in Indian languages. Originality/value – The paper addresses various issues in Indian language computing with emphasis on search and retrieval. Keywords Information retrieval, Languages, Indexing, India Paper type Research paper
Library Hi Tech Vol. 27 No. 3, 2009 pp. 450-459 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830910988568
Introduction Language computing is needed to accomplish the tasks that a computer is capable of in the native language known to the end-user. Language computing research involves development of the capability in operating systems as well as in applications like word processors, text editors, web browsers in a given language. The basic argument here is that though information services should be generically designed, they have to be given in the language that the target community understands. In the process, language-related tasks such as auto-correction, spell-checking, grammar checking,
and tools like dictionaries and thesauri prove to be an integral part of language computing. Research in language computing is on-going in many areas such as: . machine translation; . speech processing; . optical character recognition (OCR); . standards (character representation, fonts display, etc); . localization; . applications like word processors, e-mail clients, etc; . search, information extraction and retrieval; and . search engines. Multilingual computing Internet has influenced many disciplines and changed the way research is carried out. In the field of library and information science, one of the significant products has been the “digital library”. Digital libraries organize and store information and strive to give tailor-made information service according to the needs of the user community. Often these user communities like to have information in their local language/script. Thus managing multilingual information and provision of multilingual information services, is a big challenge for libraries and digital libraries. Library community has been dealing with multilingual access to information since 1970s (Brendler, 1970) and information professionals are concerned about providing multilingual information services to the user communities (Zielinska, 1976). Wellisch (1978) in his paper advocates the support of machine-readable catalogue formats for multilingual documents. He demonstrated the trends of book production in non-roman scripts which included Cyrillic, Japanese, Chinese, Devnagari and Arabic. Most libraries had extensive collections of books in non-Roman scripts. He pointed out the problem of language experts for cataloguing a record in different language scripts. Universal bibliographic control of documents cannot be achieved excluding multilingual documents (Wellisch, 1978). The digital library movement in India has gained momentum in the last few years. There are around 35 þ institutions which have developed their own Institutional Repositories. Many of these are developed on DSpace and EPrints digital library software and few are developed on Greenstone Digital Library (GSDL) software. But there is a serious issue regarding digital content in Indian languages. Due to lack of machine-readable text in Indian languages, only scanned images of the documents are available for display supplemented by metadata. The Digital Library of India is a case in point. Due to lack of OCR for Indian languages, most of its content has been uploaded in image formats and the text is not searchable. Even if editable text is available, there are no efficient search techniques available for Indian languages. Use of non-standard metadata schema by many of the libraries can also be a potential problem. With the use of codified data a standard metadata scheme can yield an efficient cross-lingual information retrieval. This technique of cross-lingual information retrieval has been demonstrated in the system named “Brass” developed at Documentation Research & Training Centre (Tripathi, 2004). Such information retrieval systems can supplement automatic translation projects to some extent, as there is no significant development in Indian language machine translation.
Indian languages computing 451
LHT 27,3
452
Indian languages Technology development Character representation, display of characters (fonts), search and retrieval techniques for English language have come a long way and though are a few issues left in English language computing, lot of research and development have been done over the last decade. In fact, now the focus is shifting to the semantic angle of information retrieval in the English computing world. But the same cannot be said of other languages of the world, more so the Indian languages. India is a multilingual country with 428 languages listed of which, 415 are living languages and 13 are extinct (Gordon, 2005). The Constitution of India recognises 22 languages, namely Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Meitei, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santhali, Sindhi, Tamil, Telugu and Urdu. The fact is that 64.8 per cent (Census of India, 2001) of the Indian population is literate but is not able to take advantage of Information and Communication Technologies (ICT) tools due to lack of familiarity with English language. In recent years, the Ministry of Information and Communication Technology (MICT), Government of India (GOI), has taken up the agenda of making available technologies and tools such as opentype fonts, Internet browsers, search engines, e-mail clients, online translation service tools, and speech-to-text as well as text-to-speech interfaces in various Indian languages in the public domain. It has already made available for some of these products in some of the Indian languages (Telugu, Hindi, Tamil, Marathi, Urdu, Punjabi, Oriya, Kannada, Assamese and Malayalam) and is currently working on making available these products in all 22 scheduled languages of India (TDIL, 2007). MICT is also working in tandem with the private software development companies like Microsoft, for example, to develop their popular Office Suite in Indian languages and provide them at affordable prices to the end-user. Other companies like the Tata Consultancy Services (TCS) also have entered into the arena of Indian languages computing (Tata ATC, 2007). Slowly but surely, the Government of India, as well as the industry have realized the potential and utility of addressing the computing requirements in Indian languages. Issues in Indian languages computing Indian languages computing is beset with myriad issues, starting from lack of creation of content in Indian languages, standards, localization, search and retrieval. The following sub-sections discuss these issues. 1. Content creation in Indian languages English language documents dominate the web when compared to other languages of the world (English on the internet, 2007). It is paradoxical that English stands third with Chinese and Hindi at the first and second place among the world’s most spoken languages. Even then it is only recently that, there is evidence of ever-growing literature in Indian languages on the web. The issues involved in creating content in Indian languages are: . availability of standard, opentype fonts; . character encoding standards especially with legacy language data that was created with proprietary standards, tools and techniques; . input methods – keyboards, OCR technology, etc;
. .
word processors, spell/grammar checkers, other applications; and availability of vocabulary tools i.e. dictionaries, thesauri, etc.
One of the offshoots of internet is the development of digital libraries. Many libraries and institutions across India have realized the importance and role of digital libraries in disseminating information to their clientele. If one studies the developmental activities in digital libraries in India, it is observed that in India many institutions deal with documents in Indian languages apart from English language documents. This is especially true of universities, where the presence of few Indian language departments is not uncommon. 2. Digitization projects Many libraries in India hold a wealth of ancient valuable classics like Vedas, Upanishads, Puranas, Epics, Sushruta-Samhita, Charak-Samhita, etc. in various forms like manuscripts, palm leaves, bamboo leaves, cloth, etc. Projects like the Digital Library of India (Digital Library of India, 2007) and National Mission for Manuscripts (NAMAMI, 2007) are attempting to digitize these ancient Indian classics to preserve them for posterity. These projects are only limited to scanning the documents and storing them as images and providing with few metadata elements. The practical application of language technologies in Indian languages for large scale usage, are yet to be implemented. 3. Standards The main, and possibly the most controversial issue, in computing in Indian Languages has been that of lack of development and implementation of standards. Though, there have been standards like Indian Script Code for Information Interchange (ISCII) and Indian Standard FOnt Code (ISFOC) developed by organisations like Centre for Development of Advanced Computing (C-DAC), these standards were never imposed by the government. This resulted in the private industries developing proprietary standards for Indian languages. The problem with proprietary standards is that once the user creates content using these standards, there is no way to access that information if the tools based on those become obsolete. Another reason for the current state of affairs is, Indian researchers have concentrated on intellectually more challenging problems like machine translation, speech recognition, text-to-speech and speech-to-text interfaces and optical character recognition, rather than developing sustainable computing standards for Indian languages that would lead to development of tools for wider use. 4. Localization A major issue is the localization of operating systems and applications in Indian languages. Localization means making the user interface of operating systems and applications available in language of each country or region. Localization has not been taken seriously in India all these years. It is only in the recent past that localization of operating systems like Windows and Linux and other applications working on these platforms has been started. The IndLinux project aims to create a distribution of Linux localized to Indian languages (IndLinux project, 2007). The Department of Information Technology, GOI, is also participating in World Wide Web Consortium (W3C) activities, by becoming the affiliate member of the World Wide Consortium. A project “Web internationalization initiative” has been initiated
Indian languages computing 453
LHT 27,3
454
with the objective of adequate representation of Indian scripts/languages in the web technology standards being evolved by W3C. W3C India Office has been setup at C-DAC, Noida. It is even being debated whether localization of operating systems and application to Indian languages is worth the effort, because people who are acquainted with computers are already familiar with the English interface. But, one should keep in mind that majority of the Indian population in India is still rural-based where teaching and learning would be in local languages, where communities are literate, but still are not familiar with English. So, localization efforts are definitely worth the effort to bring this population into the IT-savvy fold. 5. Encoding standards for Indian languages The two main standards in character representation of Indian languages are ISCII and Unicode. (1) ISCII (Bureau of Indian Standards, 1991) – ISCII is an 8-bit code. It covers ten Indian scripts (Devanagari, Gujarati, Punjabi, Bengali, Assamese, Oriya, Telugu, Tamil, Malayalam, Kannada). ISCII uses extended ASCII and uses last 128 characters position for characters representation in Indian scripts. The arrangement of characters is phonetic. (2) Unicode and Indian languages – The Unicode Consortium was initiated in January 1991, under the name Unicode, Inc., to promote the Unicode Standard as an international encoding system for information interchange, to aid in its implementation, and to maintain quality control over future revisions. (Unicode Standard, 2007) Currently, Unicode is in version 5.0.0. The Unicode standard provides with three encoding formats: UTF-8, UTF-16 and UTF-32. Any one of these forms can be used to represent the Unicode characters. Each of these is used in different environments. The default encoding form of Unicode is UTF-16. Operating System level support for Unicode encoding of Indian language scripts is available both on Windows XP and Linux. Unicode fonts for many of the Indian languages are now available. In addition, HTML supports Unicode. 6. Search and retrieval in Indian languages Most of the search engines can index and search English documents and some European languages like Altavista and Google support Greek, French, German among others. Many search engines and digital library software like DSpace do support Indian scripts. However, they do not support “stemming” for Indian languages, consequently, one can only make exact keyword search. For example, Google has come up with search features in five Indian languages i.e. Hindi, Bengali, Telugu, Marathi and Tamil (www.google.co.in). The complexities of grammar, syntax, and morphology and script of Indian languages are the main barriers in developing search algorithms for these languages. The approaches and methodology adopted for English language are not adequate for processing Indian language queries. Some of the complexities presented when working with Indian languages are: . Multiplicity of spellings of words i.e. one word can have variant spellings. . Many languages being represented by one script, e.g. the Devanagari script supports many languages like Hindi, Nepali, Sanskrit, Marathi, etc. . Use of synonyms and colloquial terms.
.
Word variations: same word varies in its manifestation for different numbers, gender and tense (demonstrated in examples/tables in the following sections).
Inspite of their diversities, all most all the scripts are derived from Brahmi and the order of alphabets in all the scripts is similar. They also share some common characteristics like, common phonetic based alphabet; non-linear and complex scripts; word order free; there are no cases (upper or lower) in Indian scripts.
Indian languages computing 455
7. Telugu Telugu is one of the 22 officially recognized languages of India. Telugu is a member of the Telugu languages which are part of the South-central branch of the Dravidian languages (the other Telugu languages being Chenchu, Savara and Waddar). In Telugu language the stem/root of a word is known as “Dhaathu”. The Dhaathu or the stem undergoes many modifications in cases of plural/singular forms, gender, tense, dative and accusative cases, animate and inanimate objects. This is explained with an example below (see Table I). The example discusses the postpositions i.e. Dative (ki/ku) and Accusative (ni/nu) suffixes. The Dative suffixes ki and ku denote “to” or “for” to the basic stems of words. The Accusative suffixes ni and nu Singular Basic stem (nominative) illu (house)
Snehithudu (friend)
inti (of a house)
snehithudi (of a friend)
illu (house)
snehithunni (or)
Oblique stem (genitive)
Accusative
snehithudini (friend) Dative intiki (to a house)
snehithudiki (to a friend)
iLLu (houses)
snehithulu (friends)
iLLa (of hosues)
snehithula (of friends)
iLLu (houses)
snehithulani/nu (friends)
iLLaki/ku (to houses)
snehithulaki/ku (to friends)
Plural Basic stem (nominative)
Oblique stem (genitive)
Accusative
Dative
Table I. Example of stem/root word modifications in Telugu
LHT 27,3
456
denote the object of the sentence. When the object is an inanimate object (like illu, meaning house, in the example), the Accusative case is same as the nominative. Its use in case of inanimate objects is optional. But, nouns denoting animate objects (like snehithudu, meaning friend in the example) have to take Accusitive suffix. Many variations and transformations occur in a word in Telugu due to sandhi formations, vibhakthis and samasas. All these variations and transformations have to be analyzed by morphological analysis of the word to arrive at the Basic stem of the word. A comparative study of the search algorithms in English and Telugu is presented in Table II. 8. Issues in developing algorithms for Telugu language Studies are still ongoing to develop efficient information retrieval techniques in Indian languages. Work on search algorithms for Telugu language in particular, are going on at very few Indian research institutes notably at Language Engineering Research Centre (LERC) of University of Hyderabad and Language Engineering Research Centre (LTRC) of Indian Institute of Information Technology (IIIT), Hyderabad. Kumar and Murthy (2008) have adopted a corpus-based statistical approach. A Telugu text corpus already developed and analysed by LERC-UoH (Language Engineering Research Centre (LERC) of University of Hyderabad) was used in this study. This corpus consisted of approximately 40 million words and 33,20,920 distinct word forms. In this work, rules for syllabification have been worked out for Telugu, tested and refined. The study is based on the assumption that words are sequences of syllables and morpheme boundaries coincide with syllable boundaries. Some of the rules for syllabification in this study were taken from literature and few were proposed by the researchers, based on a study conducted by them with native speakers of Telugu. Three approaches were taken by the researchers, they are:
Search algorithm
Table II. Comparative study of search algorithms in English and Telugu
English
Telugu
Representation ASCII, Unicode compatible
ISCII and Unicode compatible
Exact search
Possible
Possible
Truncation
Simple
Requires morphological analysis
Spelling variations
British and American, e.g. colour and color
No spelling variants
Variant words Already identified, e.g. Manage, managed, managing, management, Thesaurus Readily available, general as well as subject-specific Embedded Morphological analysis of prefixes, words suffixes and roots. Tolerance to Books on common spelling mistakes error available readily Transliteration Complex
Requires to be identified., e.g. Ramunichetha, ramunivalla Need to be explored More complicated because of ‘vibhakthis’, ‘samasas’, ‘sandhis’ Not readily available Fairly easy within Indian Languages, though not without problems
(1) Heuristic stemmer: was based on the premise that “the best place to cut a word into a root and a suffix is the one that globally maximizes the probability of the root as also that of the suffix”. The score that gave best results was: * P* SÞ Score ¼ ð2ðPþSÞ P ¼ Frequencyofprefix* lengthofprefix þ 0:5 S ¼ Frequencyofsuffix* lengthofsuffix þ 0:5 This stemmer gave an accuracy of 70.8 per cent. (2) N-gram based stemmer: Telugu being a suffixing language, the inflected and derived words generally match the initial portions of the root word. 2,67,502 clusters were obtained based on the word initial bi-grams. The smallest word found within a cluster is taken as the root word or lemma. This stemmer gave an accuracy of 65.4 per cent. (3) Suffix tree approach: here words are represented as a suffix tree. For each word and for each possible prefix, the successive verity (Nascimento and da Cunha, 1998) is calculated. It was observed in this study that this criterion does not work well for Telugu. A set of heuristics to decide which among the first four maxima was taken for stemming. This stemmer gave an accuracy of 74.5 per cent. But it was conceded that while this works, to the percentage of accuracy indicated above, it fails in cases of exceptions that occur often enough in Telugu and also in case sandhis (joining of two or more words that results in changed word forms). Due to peculiarities of Indian languages in general and Telugu in particular, as discussed above, we cannot have stemming and stopping algorithms meant for English language adopted to Telugu language. For example: When searching for “tranform” stemmers could be used and this query could be equated to “transformations”, “transformed” or “transforms”. A simpler example is: a search for the term “apple” always also retrieves “apples”. But in Telugu consider a plural noun: (banDLu) when we want to stem for singular it is (banDi). Here, the last character is a compound character i.e. Also variations in gender and tense present other complications. Telugu has two genders, masculine and non-masculine. There is no feminine gender. Nouns denoting female persons are treated as non-masculine in the singular, but in the plural they are treated as masculine (see Table III). Due to the fact that grammar came much later than spoken language, exceptions in grammatical rules are natural. For instance, though, there are some set rules for plural formation in Telugu, there is also a large number of exceptions. For example, the same word can have more than one plural forms, e.g. kaNDlu and kaLLu are the alternate forms of the same singular noun kannu, or, the same singular word having different meaning depending on context will form different plurals. For example, for the singular noun pannu (which has two different meanings i.e. tooth as well as tax) forms the plural paLLu (teeth) pannulu for the latter (i.e. tax). 9. Conclusion Humans infer the semantics of a sentence even if the speaker does not pronounce the words distinctly. Indeed machines are not blessed with such intuitive learning. To make meaningful retrieval in Indian Languages search engines will have to understand the intricacies and nuances of the language. Though, this may not mean pragmatic language understanding as aimed in natural language processing (NLP), but at least
Indian languages computing 457
sneehitudu sneehitulu naa sneehitudu waccaaDu naa sneehitudu wastaaDu naa sneehitulu waccaaru naa sneehitulu wasaaru kuuturu kuutuLLu aame kuuturu chadutundi aame kuutuLLu chadutunaru
Table III. For example Friend Friends My friend has come My friend will come My friends have come My friends will come Daughter Daughters Her daughter is studying Her daughters are studying
458
Masculine noun, singular Masculine noun, plural Past tense suing masculine noun, singular Future tense using masculine noun, singular Past tense using masculine noun, plural Future tense using masculine noun, plural Feminine noun, singular Feminine noun, plural Present tense using feminine noun, singular Present tense suing feminine noun, plural
LHT 27,3
morphological understanding is essential as it gives essential clues in developing tools like stemmers and stoppers for Indian languages. References Brendler, G. (1970), “The multilingual thesaurus: a tool for rationalising information flow”, Informatika, Vol. 17 No. 4, pp. 19-24. Bureau of Indian Standards (BIS) (1991), Indian Script Code for Information Interchange (ISCII) ISCII-91 or IS13194:1991, Bureau of Indian Standards, New Delhi. Census of India (2001), Literacy Rate, Registrar General & Census Commissioner, New Delhi, available at: www.censusindia.gov.in/Census_Data_2001/India_at_Glance/literates1.aspx (accessed 13 October 2007). Digital Library of India (DLI) (2007), available at: www.dli.iiit.ac.in (accessed 28 August 2007). English on the internet (2007), available at: www.wikipedia.org (accessed 28 August 2007). Gordon, R.G. Jr (Ed.) (2005), Ethnologue: Languages of the World, 15th ed., SIL International, Dallas, TX, available at: www.ethnologue.com/ (accessed 13 October 2007). IndLinux Project (2007), available at: http://indlinux.org (accessed 28 August 2007). Kumar, S. and Murthy, K.N. (2008), “Corpus-based statistical approaches for stemming Telugu”, Journal of Language Technology, April 2007-January 2008, available at: http://tdil.mit.gov. in/april-jan-2008/8.12_Corrpus_based_Statistical.pdf (accessed 4 December 2008). Nascimento, M.A. and da Cunha, A.C.R. (1998), “An experiment stemming non-traditional text” Proceedings, SPIRE’98, Santa Cruz de La Sierra, Bolivia, September 1998, pp. 75-80. National Mission for Manuscripts (NAMAMI) (2007), available at: www.namami.org (accessed 28 August 2007). Tata ATC (Advanced Technology Center) (2007), Indian Language Computing, available at: www.atc.tcs.co.in/indic-computing (accessed 13 October 2007). Technology Development for Indian Languages (TDIL) (2007), available at: http://tdil.mit.gov.in/ (accessed 13 October 2007). Tripathi, A. (2004), “Design and development of multilingual information retrieval system with numeric MARC”, doctoral dissertation. Unicode Standard (2007), available at: www.unicode.org/standard/standard.html (accessed 14 October 2007). Wellisch, H.H. (1978), “Multiscript and multilingual bibliographic control alternatives to Romanization”, Library Resources and Technical Services, Vol. 22 No. 2, pp. 179-90. Zielinska, M. (1976), “Multilingual biblioservice”, Canadian Library Journal, Vol. 33 No. 5, pp. 441-3, 445. About the authors Devika P. Madalli is based at the Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, India. Devika P. Madalli is the corresponding author and can be contacted at:
[email protected] Dimple Patel is based in the Department of Library & Information Science at Osmania University, Hyderabad, India.
To purchase reprints of this article please e-mail:
[email protected] Or visit our web site for further details: www.emeraldinsight.com/reprints
Indian languages computing 459