Knowledge Based Machine Translation - IEEE Xplore

1 downloads 0 Views 573KB Size Report
Knowledge Based Machine Translation. Semantically Enriched English-to-Urdu Machine Translation Using Data Mining Techniques. Ghulam Rasool Tahir1 ...
Knowledge Based Machine Translation Semantically Enriched English-to-Urdu Machine Translation Using Data Mining Techniques Ghulam Rasool Tahir1, Sohail Asghar2, Nayyer Masood3 Department of Computer Science Muhammad Ali Jinnah University Islamabad, Pakistan 1 [email protected], [email protected], [email protected] Abstract— Machine translation, a part of computational Linguistics, belongs to Natural Language Processing (NLP) and is a hot issue in the computational society. Gap between the linguist and the computer programmer, gives birth to so many problems like lexical ambiguity, syntactic and structural ambiguity, polysemy, induction, discourses, anaphoric ambiguity and different shade of meanings. Mostly English-to-Urdu machine translation systems were developed without considering the target language and also semantics are not included in existing systems. This alarming problem generates several issues during Natural Language Processing. We, in this paper, proposed and designed a new Knowledge Based Machine Translation System to overcome the above mentioned problems by using data mining and text mining techniques. Our machine translation system fulfills almost all the requirements of Natural Language Processing and Computational Linguistics. Basically this system is designed for Urdu but it can be used for many other languages. The proposed system will give better results as compared to existing systems. Keywords-component; machine translation; semantics; translation ambiguities; polysemy;data mining; bilingual; Englishto-Urdu;

I.

INTRODUCTION

Translation as an art of rendering a work of one language into another is as old as written literature. In this modern civilization of ours the need for translation is ever growing and its importance in the field of business, economics and industrialization can not ignored [1]. These needs coupled with the modern scientific advancements paved the to the conception of modern translation machine translation, which is “An automatic translation of one language into another by means of a computer or another machine that contains a dictionary, along with the programs needed to make logical choices from semantics, supply missing words and rearrange word order as required for the new language [2]. Urdu is widely spoken not only in South Asia but also in the West. In Pakistan it is the national language and is used in instruction in most government schools, at the lower level of administration, and in the mass media. Urdu is also spoken in India, Bangladesh, Afghanistan and Nepal, and has become the culture language and lingua franca of the South Asian Muslim Diaspora outside the subcontinent, particularly in the Middle East, Europe, the United States and Canada [3].

978-1-4244-8003-6/10/$26.00 ©2010 IEEE

English has its own importance with respect to international language; knowledge containing language, a good translator will remove the gap among the languages. About 45% knowledge of the world exists in English whereas remaining 55% is in other languages like Russian, French, German, Arabic, Farsi and Urdu. Pakistan has a 37% literacy rate and out of which only 2% can read and understand English. Rest of the 35% is out of bound for the digital world. As 35% are unable to understand English or communicate their views in English. How we can pass the bay of digital divide and eliminate digital illiteracy. If we see the conceptology of the world knowledge, only found total concepts are seventeen million out of 250 total sciences. English has only 5.6 million terminologies and Urdu has 2 million terms. Due to this gap, over 80% of Pakistan’s population cannot get benefits from IT revolution [4]. Developing English- to-Urdu machine translator is not an easy task [5]. Existing systems do not provide the solution of the ambiguities like lexical, syntactic, polysemy, anaphoric, structural [6]. These issues and problem will be solved in proposed system by integration of data mining, text mining and Natural Language Processing techniques [7]. Using data mining techniques, we can get Association Rules among verb and nouns very effectively and also can resolve the issues on multiple parsing of sentence [8]. Self-Organizing Maps (SOMs) are useful for clustering text document on the bases of their importance [9]. Now most research is being done and published in English all over the world, due to this and above mentioned survey, we can get more knowledge by machine translation and our nation may progress rapidly by using the machine translation against wasting of more time to learn English. Machine translation system will be beneficial for all levels of government and private sector, personal and academic research. This paper is only proposing the enhancement in the exiting machine translation systems. Because there are many machine translation systems in the market but these have no idea of semantics. There are many benefits and advantage of English-to-Urdu machine translation such as knowledge which exits in the English can easily be converted to Urdu. Those people who have less knowledge of English may get more benefits. It is

also beneficial for government sector as well private to translate their works into Urdu because only with few clicks their progress or messages will be converted into Urdu for public use e.g. budget report. By addition the other languages Urdu speaking community may able to get any research in their own languages. Now web search engines are providing the machine translation services to their users [8]. Machine translation also beneficial for the professional translators, students and knowledge seekers and it has also weight in localization market [10]. II.

PROBLEM STAMENT

Machine translation is the prominent field of the computational linguistics. Computational linguistics belongs to the branch of science which deals the language aspects with the help of computer science technology. In this field all processing on natural language is done by the machine (computer). Computation is done under considering the all known as well as possible and necessary principals of syntax, semantics and morphology of the language. Machine should understand all these possible aspects of the language but previous work does not handle the other requirements during machine translation. Current online as well as desktop machine translation systems ignore many aspects of the languages during translation. Due to this problem many ambiguities are arisen such as lexical ambiguity, polysemy, syntactic ambiguity, structural ambiguity, anaphoric/reference ambiguity and discourse. Due to these ambiguities current machine translator are not able to produce right translation. This problem can be seen by different machine translators in the TABLE 1 given below. TABLE I. English Phrase

The house which Abraham Lincoln was born is still standing.

Flesh of mango

English Phrase

Translator Name

Translation Sample

ApniUrdu

‫ﮔﻮﺷﺖ ﺁم ﮐﺎ‬

If we analyze the above translation we can easily note this translation is totally ambiguous, wrong and no right sense of meanings. It can easily sense that their lexical databases and algorithms have not ability to resolve ambiguities. III.

PROPOSED MODEL

This proposed model has an ability to produce almost right translation by considering the all known types of ambiguities. Proposed model is an enhanced form of the SAMPARK model of machine translation, which was developed in India [11]. SAMPARK model is being developed basically among the Indian official languages but we can use it for English-to-Urdu machine translation by enhancing it. Their model as shown in Fig. 1 consists of three levels such as Source Analysis, Transfer and Target generation.

TRANSLATION BY DIFFERENT TRANSLATOR Translator Name

Translation Sample

Babylon 8 http://translation.babylon.co m/

‫ﺳﻨﺖ اﺑﺮاﮨﻴﻤﯽ ﮐﮯ ﮔﻬﺮ ﺟﻮ‬ ‫ﻟﻨﮑﻦ ﻣﻴﮟ ﭘﻴﺪا ﮨﻮا ﺗﻬﺎ اب ﺑﻬﯽ‬ ‫ﺑﻨﺪ ﮨﮯ‬

Worldlingo (www.worldlingo.com ) PakTranslations (www.paktranslations.com )

ApniUrdu (www.apniurdu.com ) MT by FAST-NU (www.crulp.org ) MT by FAST-NU (www.crulp.org ) Babylon 8

‫ﺳﻨﺖ اﺑﺮاﮨﻴﻤﯽ ﮐﮯ ﮔﻬﺮ ﺟﻮ‬ ‫ﻟﻨﮑﻦ ﭘﻴﺪا ﮨﻮا ﺗﻬﺎ اب ﺑﻬﯽ ﺑﻨﺪ‬ .‫ﮨﻴﮟ‬ ‫ﻣﮑﺎن ﮐﻮن ﺳﮯ‬ Abraham Lincoln ‫ﭘﻴﺪے ﺗﻬﮯ ﺳﺎﮐﺖ ﮐﻬﮍا ﮨﻮ‬ ‫رﮨﮯ ﮨﻴﮟ‬ ‫(اﺑﺮاﺣﺎم‬Abraham) ‫ﮔﻬﺮ‬ ‫( ﻟﻨﮑﻮﻟﻦ‬Lincoln) (abhi ‫(ﺑﻮرن‬born) ‫اﺑﻬﺒﻬﻲ ﮐﻮن‬bhi) ‫ﺳﺎ اﭨﻬﺮهﺎ هﮯ‬ ‫ان ازﺳﭩﻞ ﺳﭩﺎﻳﮉﻧﮓ دا ﮨﺎوس‬ ‫وﻳﭻ اﺑﺮاﮨﻴﻢ ﻟﻨﮑﻮﻟﻦ وازﺑﻮرن‬ ‫ﺁم ﮐﺎ ﮔﻮﺷﺖ‬ ‫ﻣﭩﯽ ﮐﮯ اﻧﺒہ‬

Figure 1. ESAMPARK works in India[11]

First, we enhanced it by the addition of the semantics so that it can generate good and meaningful translation. SAMPARK architecture fulfills their desired requirements because mostly Indian languages are derived from Sanskrit, which is based on rules set down by Panini, the 4th century B.C. grammarian [11]. Even those Indian languages that are not derived from Sanskrit are structurally similar to others in India. This common foundation makes the translation from one Indian language to another easier than from, say, German to Chinese [11]. Because we are working on English to Urdu translation system so we need some other requirements, we add another level such as “Addition of Semantics”. So we change the SAMPARK model as shown in Fig. 2.

Figure 2. Proposed model of machine translaion based on data mining techniques

Actually this is a level which for solving the issues of the ambiguities. Our Machine Translation System is consisted of six levels, two of them are simple source input and target output, but four are major levels such as Source Analysis, Semantics Addition, Transfer and Target Generation. Our most work will focus on semantics addition. Due to semantics addition our system will require changes in other levels. The elaboration of the proposed model is given below Finally, complete content and organizational editing before formatting. Please take note of the following items when proofreading spelling and grammar: A. Source Analysis It is the first phase where processing start and general analysis is performed. Following steps are performed in this phase. • Tokenizer: Converts text into a sequence of tokens (words, punctuation marks, etc.) • Morphological Analyzer: Uses rules to identify the root and grammatical features of a word. Splits the word into its root and grammatical suffixes. • Parts of Speech Tagger: Based on statistical techniques, assigns a part of speech, such as noun, verb or adjective, to each word. • Chunker: Uses statistical methods to identify and tag parts of a sentence, such as noun phrases, verb groups, and adjectival phrases, and a rule base to give it a suitable chunk tag. • Parser: Identifies and names relation between a verb and its participants in the sentence, based on the computational grammar framework. B. Adding Semanticss Most issues are solved in this level by using intelligent techniques of data mining and text mining. Our system may take help from existing lexical databases such as WordNet or British National Corpus (BNC) and ultimately we shall have to develop an enriched bilingual

lexical database (Bilingual Lexicon). WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory [12]. The main issues, challenges and problems which will be solved in this level are given below. 1) Discourse Analysis: Discourses are simply combinations or series of such units whose significance in turn derives from a corresponding combination of their meanings as represented. In this step proposed model will consider source text in the following context so that [13] a) Cohesion - grammatical relationship between parts of a sentence essential for its interpretation; b) Coherence - the order of statements relates one another by sense. c) Intentionality - the message has to be conveyed deliberately and consciously; d) Acceptability - indicates that the communicative product needs to be satisfactory in that the audience approves it; e) Informativeness some new information has to be included in the discourse; f) Situationality - circumstances in which the remark is made are important; g) Intertextuality - reference to the world outside the text or the interpreters' schemata The Fig. 3 shows the addition of the semantics.

Figure 3. Showing addition of semantics

After general analysis of the source language, the flow of the system has reached in step “Adding Semantics”. In this step, all issues (different discourses and ambiguities) will be resolved. Proposed system will produce correct translation by using Natural Language Processing and data mining techniques correct

translation is produce. This model will take help from corpus during resolution of ambiguities and challenges from British National Corpus (BNC) and Wordent, so that it will reach at maturity level. Proposed system also has its own bilingual lexicon which is enriched with maximum properties of the word (Part-of-speech tags). 2

Lexical Ambiguity: Ideally, each word in a language should have a unique meaning, but for natural languages, many words have two or more interpretations. Due to this interpretation, if sentence become ambiguous due to a word, is called lexical ambiguity [14]. This type of ambiguity can be resolved by syntactic analysis. It can be seen with different examples in TABLE II.

English Sentence astronomer with a telescope. A teacher hits a student with an umbrella.

5

Possible Interpretation in Urdu 2 ‫ﻣﻴﮟ ﻧﮯ ﺧﻼﺑﺎز ﮐﻮدورﺑﻴﻦ ﺳﮯ دﻳﮑﻬﺎ ۔‬

English Word

EXAMPLES OF LEXICAL AMBIGUITY Lexical Type

Noun

Fly

Auxiliary

Polysemy: Lexical ambiguity in which a word has different meanings within the same lexical category is pure lexical in nature and this ambiguity cannot be resolved by the syntactic analysis. This property of words is often termed as polysemy. Semantic and contextual knowledge is of the word usage is required for the ambiguity resolution [14]. This ambiguity is shown by the TABLE III. TABLE III.

English Word Bank

Cricket

4

I saw an

EXAMPLES OF POLYSEMY

Lexical Type

Meanings

Noun Singular

A financial (‫)ﻣﺎﻟﻴﺎﺗﯽ ادارﮦ‬

Noun Singular

Side of river ( ‫) درﻳﺎ ﮐﻨﺎرﮦ‬

Noun Singular

A game (‫)ﮐﺮﮐﭧ‬

Noun Singular

An insect (‫)ﺟﻬﻴﻨﮕﺮ‬

Institute

Structural Ambiguity: A sentence has a syntactic or structural ambiguity if two or more structural interpretations can be assigned to it. Its examples are shown in TABLE IV.

TABLE IV. English Sentence

Aslam loves his mother.

A can of juice (‫ﮐﮯ ﻣﺎﺋﻊ‬ ‫)ﻟﻴﮯ ﺑﺮﺗﻦ‬ Power to do -(‫)ﺳﮑﻨﺎ‬

Noun

3

(‫)ﺣﺸﺮاء‬

To fly. ( ‫) اڑﻧﺎ‬

Verb

Can

English Sentence

Meanings Insect

EXAMPLE OF STRUCTURAL AMBIGUITY Possible Interpretation in Urdu ‫ﻣﻴﮟ ﻧﮯ ﺧﻼﺑﺎز ﮐﻮ دﻳﮑﻬﺎ ﺟﺲ ﮐﮯ ﭘﺎس دورﺑﻴﻦ ﺗﻬﯽ۔‬

1

1

‫اﺳﺘﺎدﻧﮯ ﻃﺎﻟﺐ ﻋﻠﻢ ﮐﻮ ﭼﻬﺘﺮﯼ ﺳﮯ ﻣﺎرا۔‬

2

Aanphoric/Reference Ambiguity: Anaphoric refers to objects that have previously been mentioned in a discourse. The pronoun appearing in the sentence needs to bind with its antecedent in order to remove the ambiguity involved [14]. It's examples are illustrated in the TABLE V given below.

TABLE V. TABLE II.

‫ ﻣﺎرا ۔‬،‫ ﺟﺲ ﮐﮯ ﭘﺎس ﭼﻬﺘﺮﯼ ﺗﻬﯽ‬،‫اﺳﺘﺎدﻧﮯ ﻃﺎﻟﺐ ﻋﻠﻢ ﮐﻮ‬

When Aslam and Tahir went to the hospital, and saw the hospital staff was busy and they told them to go home and rest.

EXAMPLES OF ANAPHORIC AMBIGUITY Anaphoric Word (which depends)

Urdu Translation ‫اﺳﻠﻢ اﭘﻨﯽ ﻣﺎں ﺳﮯ‬ ‫ﭘﻴﺎر ﮐﺮﺗﺎ ﮨﮯ۔‬

1

‫اﺳﻠﻢ اس ﮐﯽ ﻣﺎں ﺳﮯ‬ ‫ﭘﻴﺎر ﮐﺮﺗﺎ ﮨﮯ‬

2

His

‫ﺟﺐ اﺳﻠﻢ اور ﻃﺎﮨﺮ ﮨﺴﭙﺘﺎل‬ ‫ﭘﮩﻨﭽﮯ اور دﻳﮑﻬﺎﮨﺴﭙﺘﺎل ﮐﺎ‬ ‫ﻋﻤﻠہ ﻣﺼﺮوف ﺗﻬﺎ اور اﻧﮩﻮں‬ ‫ﻧﮯ ان ﮐﻮ ﮔﻬﺮ ﺟﺎﻧﮯ اور ﺁرام‬ ‫ﮐﺮﻧﮯ ﮐﺎ ﮐﮩﺎ۔‬

There are two words “they” and “them”. They and them may replaced by Aslam and Tahir with Hospital Staff.

These are solved by adding the semantics. Extracting semantic relationships between entities mentioned in text documents is an important task in natural language processing. The various types of relationships that are discovered between mentions of entities can provide useful structured information to a text mining system [15]. C. Transfer In this phase enriched query with semantics and lexical tags and also free from all type of ambiguities will be transferred/translated to the target language in the form of the first draft. These Steps will be involved in this proposed model are explained as given below. 1) Syntax Transfer: Converts the parse structure in the source language to the structure in the target language that gives the correct word order, as well as a change in structure, if any. 2) Lexical Transfer: Root words identified by the morphological analyzer are looked up in a bilingual dictionary for the target language equivalent. 3) Transliteration: Allows a source word to be rendered in the script of the target language. Useful in cases where translation fails for a word or a chunk.

D. Target Generation This step will perform the finishing process. Two main functions will be performed here for agreement and insertion of case markers. 1) Agreement: Performs gender-number-person agreement between related words in the target sentence. 2) Insertion of Case-marker: Adding post position and other markers that indicate the meanings of words in the sentence. In short we can say it’s a process in which post marker “‫”ﻧﮯ‬ and “‫ ”ﮐﻮ‬are added. E. Output This proposed model will not produce bidirectional translation (English-to-Urdu and Urdu-to-English) Its output will be only in Urdu language (editable form). IV.

PROPOSED OUTPUT

This proposed model has capabilities to resolve the issues, challenges and problems regarding ambiguities of which are being faced by the existing machine translation systems. After applying text mining algorithms with the help of enriched bilingual lexicon and by considering the semantics aspects; the translation will be correct and fine as shown in TABLE 6. TABLE VI.

TABLE TYPE STYLES

English Sentence

Urdu Translation

Tahir is a good boy. He loves his mother.

‫ﻃﺎﮨﺮ اﻳﮏ اﭼﻬﺎ ﻟﮍﮐﺎ ﮨﮯ۔ وﮦ اﭘﻨﯽ ﻣﺎں ﺳﮯ‬ ‫ﭘﻴﺎر ﮐﺮﺗﺎ ﮨﮯ۔‬ ‫ﮐﻴﺎﺁپ ﺁم ﮐﺎ ﮔﻮدا ﭘﺴﻨﺪ ﮐﺮﺗﮯ ﮨﻮ؟‬

Do you like flesh of mango?

V.

CONCLUSION

Discussion narrates without adding semantics in the system; no system can produce correct translation. Because previous examples show the translation is a totally wrong and has no sense of the actual text in the output text. So without considering semantics aspects the machine translation is a useless so by adding semantics system will be able to produce correct meaningful translation. ACKNOWLEDGMENT Thanks to all the technical staff at Center of Excellence for Urdu Informatics, National Language Authority, Islamabad. REFERENCES [1] [2]

[3] [4]

H. Abdallah Homiedan, “Machine Translation,” in KingSaud Univ., vol. 10, pp. 1-21, 1998. J. Hutchins, “Latest Development in Machine Translation Technology: Beginning a New Era in MT Research.,” in: Reflections on the History and Present State of Mahcine Translation. Norwich, England: Uinversity of East Angia, 1986. R. L. Schmidt, “Urdu An Essential Grammar,” Rutledge Taylor & Francis Group, London and New York, 2004. D. Attash, “Urdu Informatics,” vol. First, National Language Authority, Islamabad, pp. 99, 2009.

[5] [6]

[7]

[8]

[9]

[10]

[11] [12]

[13] [14]

[15]

U. Muhammad, “AGHAZ: An Expert System Based approach for the Translation of English to Urdu,” PWASET, vol. 6, 2005. R. Muhammad Zafar, “Challaenges for a Machine Translation, Development of Algorithms and Computational Grammar for Urdu,” Ph.D. thesis, Pakistan Institute of Engineering and Applied Sciences,Islamabd, 2007. R. Feldman, “The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data,” Cambridge University Press, pp-43, 2007. S. D. Samantaray, “A Data Mining Approach for Resolving Cases of Multiple Parsing in Machine Aided Translation of Indian Languages,” International Conference on Information Technology (ITNG’07), IEEE 2007. J. Henderson, P. Merlo, I. Petroff, G. Schneider, “Using NLP to Efficiently Visualize Text Collections with SOMs,” IEEE Proceeding of the 13th International Workshop on Database and Expert Systems Applications (DEXA’02), 2002. G. R. Tahir, “Machine Translation: Scope, Types, Advanteages, History, Limits and Future,” Monthly Akhbar-e-Urdu, pp. 10-20, December, 2009. G. Anthes, “Communications of the ACM,” vol. 53 |No.1, January 2010 A. Suarez, M. Saiz-Noeda, M. Palomar, “A Method of Restricted Knowledge Acquisition from Wordnet,” IEEE Third International Conference on Knowledge-Based Intelligent Information Engineering Systems, 31st Aug-1st Sept 1999, Adelaide Australia. K. Wisniewski, “Discourse Analysis”, Unpublished. S. R. Muhammad Zafar, “Challaenges for a Machine Translation, Development of Algorithms and Computational Grammar for Urdu,” Ph.D. thesis, Pakistan Institute of Engineering and Applied Sciences,Islamabd, 2007. A. Kao, S. R. Poteet, “Natural Language Processing and Text Mining,” Springer, pp. 29-35, 2007.