Hindi Part of Speech Tagging and Translation

133 downloads 759 Views 364KB Size Report
Int. J. Tech. 2011 ... 2231-3915 (Online) ... Words are divided into different classes called parts of speech (POS; Latin pars orations), word classes, morphological.
Int. J. Tech. 2011; Vol. 1: Issue 1, Pg 29-32

ISSN

2231-3907 (Print)

www.enggresearch.net

2231-3915 (Online)

RESEARCH ARTICLE

Hindi Part of Speech Tagging and Translation Shachi Mall and Umesh Chandra Jaiswal Madan Mohan Malaviya Engineering College Gorakhpur ,Uttar Pradesh, India *Corresponding Author E-mail: [email protected] ,[email protected]

ABSTRACT:

Words are divided into different classes called parts of speech (POS; Latin pars orations), word classes, morphological classes, or lexical tags[4]. In traditional grammars, there are few parts of speech (noun, verb, adjective, preposition, adverb, conjunction, etc.). Many of the recent models have much larger numbers of word classes Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context —i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph [6]. Part-of-speech tagging (or just tagging for short) is the process of assigning a part-of speech or other lexical class marker to each word in a corpus. Tags are also usually applied to punctuation markers; thus tagging for natural language is the same process as tokenization for computer languages, although tags for natural languages are much more ambiguous[5]. Taggers play an increasingly important role in speech recognition, natural language parsing and information retrieval.

KEYWORDS: Tagging, verbs, POS, morphological, punctuation. INTRODUCTION:

The definition of parts of speech has been based on morphological and syntactic function; words that function similarly with respect to the affixes they take (their morphological properties) or with respect to what can occur nearby (their distributional properties') are grouped into classes. While word classes are having tendencies toward semantic coherence (nouns often describe 'people, places or things', and adjectives often describe properties), this is not necessarily the case, and in semantic coherence is not used as a definitional criterion for parts of speech[1]. Parts of speech can be divided into two broad super categories: closed class types and open class types. Closed classes are those that have relatively fixed membership. For example, prepositions are a closed class because there is a fixed set of them in English; new prepositions are rarely added. By contrast nouns and verbs are open classes because new nouns and verbs are continually added or borrowed from other languages. It is likely that any given speaker or corpus will have different open class words, but all speakers of a language, and corpora that are large enough, will likely share the set of closed class words[9].

Received on 05.04.2011 Accepted on 18.04.2011 © EnggResearch.net All Right Reserved Int. J. Tech. 1(1): Jan.-June. 2011; Page 29-32

Closed class words are generally also function words; function words are grammatical words like of, it, and, or you, which tend to be very short, occur frequently, and play an important role in grammar. There are four major open classes that occur in the languages of the world: nouns, verbs, adjectives, and adverbs. It turns out that English has all four of these, although not every language does. Many languages have no adjectives[10]. Every known human language has at least the two categories noun and verb (although in some languages, for example Nootka, the distinction is subtle). Noun is the name given to the lexical class in which the words for most people, places, or things occur[4]. But since lexical classes like noun are defined functionally (morphological and syntactically) rather than semantically, some words for people, places, and things may not be nouns, and conversely some nouns may not be words for people, places, or things. Thus nouns include concrete terms like ship and chair, abstractions like bandwidth and relationship. What defines a noun in English, then, are things like it ability to occur with determiners (a goat, its bandwidth, Plato's Republic), to take possessives (IBM's annual revenue), and for most but not all nouns, to occur in the plural form (goats, abaci). Nouns are traditionally grouped into proper nouns and common nouns. Proper nouns, like Regina, Colorado, and IBM, are names of specific persons or entities[8]. In English, they generally aren't preceded by articles (e.g. the book is upstairs, but Regina is upstairs). In written English, proper nouns are usually capitalized In many languages, including

29

Int. J. Tech. 2011; Vol. 1: Issue 1, Pg 29-32

English, common nouns are divided into count nouns and mass nouns. Count nouns are those that allow grammatical enumeration; that is, they can occur in both the singular and plural (goat/goats, relationship/relationships) and they can be counted (one goat, two goats). Mass nouns are used when something is conceptualized as a homogeneous group[3]. So words like snow, salt, and communism are not counted (i.e. *two snows or *two communisms). Mass nouns can also appear Without articles where singular count nouns cannot (Snow is white but not *Goat is white). The verb class includes most of the words referring to actions and processes, including main verbs like draw, provide, differ, and go. English verbs have a number of morphological forms (non3rd-person-sg (eat), 3d-person-sg (eats), progressive (eating), past participle eaten). The third open class English form is adjectives; semantically this class includes many terms that describe properties or qualities. Most languages have adjectives for the concepts of color (white, black), age (old, young), and value (good,bad), but there are languages without adjectives. The closed classes differ more from language to language than do the open classes. Here's a quick overview of some of the more important closed classes in English, with a few examples of each[9]: 1. prepositions: on, under, over, near, by, at, from, to, with 2. determiners: a, an, the 3. pronouns: she, who, I, others 4. conjunctions: and, but, or, as, if, when 5. auxiliary verbs: can, may, should, are 6. particles: up, down, on, off, in, out, at, by, 7. numerals: one, two, three, first, second, third PART OF SPEECH TAGGING: Part-of-speech tagging (or just tagging for short) is the process of assigning a part-of speech or other lexical class marker to each word in a corpus. Tags are also usually applied to punctuation markers; thus tagging for natural language is the same process as tokenization for computer languages, although tags for natural languages are much more ambiguous. Taggers play an increasingly important role in speech recognition, natural language parsing and information retrieval. The input to a tagging algorithm is a string of words and a specified tagset of the kind described in the previous section. The output is a single best tag for each word. For example, here are some sample sentences from the ATIS corpus of dialogues about air-travel reservations. For each, a potential tagged output using the Penn Treebank tagset is shown. VB DT NN Book that flight. VBZ DT NN VB NN . Does that flight serve dinner? Even in these simple examples, automatically assigning a tag to each word is not trivial. For example, book is ambiguous. That is, it has more than one possible usage and part of speech. It can be a verb (as in book that flight or to book the suspect) or a noun (as in hand me that book, or a book of matches). Similarly that can be an determiner (as in Does that flight serve dinner), or a complementizer (as in I

thought that your flight was earlier). The problem of POStagging is to resolve these ambiguities, choosing the proper tag for the context. Most words in English are unambiguous; they have only a single tag. But many of the most common words of English are ambiguous (for example ‘can’ can be an auxiliary ('to be able'), a noun ('a metal container'), or a verb ('to put something in such a metal container'))[10]. Only 11.5% of English word types in the Brown Corpus are ambiguous, over 40% of Brown tokens are ambiguous.

Hindi POS Tagger

Figure : 1 Hindi PoS Module Description: Module 1: Hindi File Read This module Read Hindi (Unicode) corpus by browsing any .text file from the drives or user can also enter Hindi corpus into the edit box and then count total number of lines, number of sentences, number of words in given Hindi corpus and also display the whole corpus in the edit box with path if user browse the file[4]. Input: Hindi (Unicode) Corpus Processing: If the corpus is in Hindi then this module read file, count total number of lines, number of sentences, number of words and show the file path if browsing any text file.

30

Int. J. Tech. 2011; Vol. 1: Issue 1, Pg 29-32

Output: Displays the corpus, number of lines, sentences, Output words and/or show browsing path. Sentence  

          

       !  Module 2: Tokenizer     Sub Module 2.1: Sentence Extraction This module split Input Hindi (Unicode) corpus into the sentences End pattern: PUNC according to the delimiter. Input: Hindi Corpus Processing: Split corpus into the sentences according to the Output delimiter and store them into sentence banker table. Sentence Output: Display the splitted Hindi sentences.  

      

   !      Sub Module 2.2: Word Tokenizer This module split the sentence into the word according to the space delimiter. Input: Extracted Hindi sentence Processing: Split the sentence according to the space delimiter and store them into Lexicon table. Output: Display the splitted Hindi words.



|



|

Module 3: Tagging This module fig2. tag each Hindi word in the sentence with their related tags like Noun, Adjective, Post Preposition, Verb, Adverb, Conjunction, Pronoun, Number etc. If Hindi word is not coming in any category of part of speech tag then tag that word with “No_Tag”. It also identifies and display tag pattern like Start pattern, Mid pattern, End pattern. Input: Extracted Hindi sentence Processing: Tag each word of input sentence. Output: Display the tag output, search tag pattern and save tag structure into HTML file Example: Input Hindi sentence 33 | Output

Fig2: Layout of Proposed system

















|

Sub Module 3.1: Tag Pattern Identifier Input Start pattern: NOUN Output Sentence  

   

   

   !      | Input Mid pattern: NOUN+PP

Tag Structure NOUN+PP+NOUN+PP+NO UN+PP+NOUN+PUNC Fig3 : Input taken

31

Tag Structure NOUN+PP+NOUN+PP+NO UN+PP+NOUN+PUNC

Tag Structure NOUN+PP+NOUN+PP+NO UN+PP+NOUN+PUNC

Int. J. Tech. 2011; Vol. 1: Issue 1, Pg 29-32

Fig8: Translation from Hindi to English2

Fig9: Translation from Hindi to English2 Fig 4: Save a input file

CONCLUSION:

We have presented a part-of-speech tagger and ctranslation for Hindi to English[4]. We also discussed language dependent as well as language independent features suitable for Hindi POS tagging and chunking[2]. We have shown that such a system has good performance with an average accuracy of 88.4% for POS tagging, with best accuracies being 89.35% and 87.39% for POS tagging and translation, respectively. We believe that Further error analysis and more language specific features would improve the system performance, particularly in case of translation.

REFERENCES: 1. Fig 5: Tagging

Transla on: Fig6: Tagging with dictionary

J.N. Darroch and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43(5):1470–1480. 2. Adwait Ratnaparkhi. 1996. A maximum entropy model for partof-speech tagging. In Eric Brill and Kenneth Church, editors, Proceedings of the Conferenceon Empirical Methods in Natural Language Processing, pages 133–142. Association for Computational Linguistics, Somerset, New Jersey. 3. Adwait Ratnaparkhi. 1997. A simple introduction to maximum entropy models for natural language processing. Technical Report 97-08, Institute for Researching Cognitive Science, University of Pennsylvania, May. 4. Akshay Singh, Sushma Bendre, and Rajeev Sangal.2005. Hmm based chunker for Hindi. In Proceedings of IJCNLP-05. Jeju Island, Republic of Korea, October. 5. P. Baudisch, B. Lee, and L. Hanna. Fishnet, a fisheye web browser with search term popouts: a comparative evaluation with overview and linear view. In Proceedings of the working conference on Advanced visual interfaces, pages 133–140. ACM, 2004. 6. J. Collins and Kaufer. Description of DocuScope, 2001. [Online]. http://betterwriting.net/projects/fed01/dsc fed01.html [Accessed: Aug. 9,2010]. 7. G. G. Robertson and J. D. Mackinlay. The document lens. Proceedings of the 6th annual ACM symposium on User interface software and technology - UIST ’93, pages 101–108, 1993. 8. M. Sarkar, S. S. Snibbe, O. J. Tversky, and S. P. Reiss. Stretching the rubber sheet. ACM Press, New York, New York, USA, June 1993. 9. J. Slack, K. Hildebrand, T. Munzner, and K. John. SequenceJuxtaposer: Fluid navigation for large-scale sequence comparison in context. In German Conference on Bioinformatics, pages 37–42. Citeseer, 2004. 10. M. Witmore. The Funniest Thing Shakespeare Wrote? 767 Pieces of the Plays, 2010.

Fig7: Translation from Hindi to English1

32