Algorithm for Sentence Translations in Bilingual Corpora. (Malayalam & English) ... English sentence alignment and provided a statistical ... sentence or short paragraph translation and ...... University, and Language Assistant-Tamil. (8 June ...
International Journal of Computational Linguistics and Natural Language Processing
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
A New Dynamic Statistical Maximum Likelihood Alignment Algorithm for Sentence Translations in Bilingual Corpora (Malayalam & English) Rajesh K S Dr. Lokanatha C Reddy Sr. Asst. Professor, SCE & Professor Research Scholar in CS Dept of CS DU, Kuppam DU, Kuppam
Abstract- Sentence alignment task in machine translation gained more importance recently. It is the task of finding correspondences of sentences in one language (eg. English) and another (eg. Malayalam, an Indian language). An aligned parallel corpus provides aid to human translators since it is possible to look up all sentences in which a word or a phrase occurs to see the ways in which that word or phrase has been translated into the other language. Sentence alignment is also a first step towards word alignment, which is used to determine instances where a word in one language consistently appears in sentences aligned with sentences containing the equivalent word in the other language. Basically, alignment aims to succeed the task of extracting structural information and statistical parameters from bilingual corpora. This process might seem very easy at first sight but it has some important challenges to face which make the task difficult. These challenges are discussed with special reference to English to Malayalam sentence alignment. There is a special mention on the previous work in this area and a new algorithm is proposed for sentence alignment in English to Malayalam or vice versa. In this research work the main aim is to design a sentence
Dr. S Arulmozi Asst. Professor Dept. of CL, DU, Kuppam
alignment algorithm for using with bilingual texts in which one of the texts is Malayalam. In this hybrid method, the location information of sentences and paragraphs as well as the lengths of them for aligning the bilingual texts are used. This method is quite easy to implement and independent of the languages of the bilingual texts. The results of the experiments showed that our method’s success is related with the success in paragraph alignment phase. When the paragraph alignment is successful, if the text is easy (90% 1-1 beads) it has 96.2% accuracy. If the text is difficult (64% 1-1 beads), it has lower (about 70.4%) but still a high accuracy with respect to the difficulty of the text. However if it makes too many errors in paragraph alignment, which is a rare case, it gives continuous blocks of wrong alignment beads. We also outlined some of the prominent difficulties and challenges in MalayalamEnglish sentence alignment and provided a statistical maximum likelihood algorithm which can be used in future studies and implementations of sentence alignment.
Keywords: sentence alignment, parallel corpus, local word grouping, stemming, sentence boundary detection, sentence boundary identification
217
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
1.0 INTRODUCTION Aligned parallel corpora are collections of pairs of sentences where one sentence is a translation of the other. Sentence alignment means identifying which sentence in the target language (TL) is a translation of which one in the source language (SL). Such corpora are useful for statistical NLP, algorithms based on unsupervised learning, automatic creation of resources and many other applications. Over the last fifteen years, several algorithms have been proposed for sentence alignment in English to other languages, not many Indian languages. Their performance as reported is excellent (in most cases not less than 95%, and usually 98 to 99% and above). The evaluation is performed in terms of precision and sometimes also recall. While this gives an indication of the performance of an algorithm, the variation in performance under varying conditions has not been considered in most cases. Very little information is given about the conditions under which evaluation was performed. This gives the impression that the algorithm will perform with the reported precision and recall under all conditions. We have tested several algorithms under different conditions and our results show that the performance of a sentence alignment algorithm varies significantly, depending on the conditions of testing. Based on these results, we propose a method of evaluation that will give a better estimate of the performance of a sentence alignment algorithm and will allow a more meaningful comparison. Our view is that unless this is done, it will not be possible to pick up the best algorithm for certain set of conditions. Those who want to align parallel corpora may end up picking up a less suitable algorithm for their purposes, particularly in Malayalam, because of lack of a full fledged Malayalam corpus. We have used a method for comparing the
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
performance of our algorithm under different conditions. Several algorithms are available for sentence alignment, but there is a lack of systematic evaluation and comparison of these algorithms under different conditions. In most cases, the factors which can significantly affect the performance of a sentence alignment algorithm have not been considered while evaluating. We have used a method for evaluation that can give a better estimate about a sentence alignment algorithm’s performance. These have mostly been tried on European language pairs. We have evaluated, manually-checked and validated English-Malayalam aligned parallel corpora under different conditions. We also suggest some guidelines on actual alignment.
2.0 Related Work 2.1Parallel Corpus Parallel text is a text placed along with its translation or translations. A large collection of such text is called parallel corpus. There is no readymade parallel corpus available for English- Malayalam language pair. For the current research work we created a parallel corpus from the software documentation of translation of two famous Free and Open Source Application, OpenOffice.og and the Gnome desktop. Majority of text in the user interface and documentation of the both software are translated to Malayalam language. The translated text in this documentation can be classified in three categories; a) word or phrase translation b) sentence or short paragraph translation and c) sentences with variable components. The third category is omitted from the parallel corpus because of its odd nature. A sample for such text is : English text - “User %s is not a super user” Malayalam text - “%s ഉപയ ോക്തോവ് ഒരു സൂപ്പര് ൂസര്അല്ല.”
218
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
“Upabhokthavu
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
oru
super user alla” In the text some 'C' like place holders (%s etc) are there, which makes the sentence incomplete. Such sentences are omitted from the parallel corpus because of the same reason. The software translation is available in a specific file format called .po file; which is a default standard in localization world. A script has been written to extract required text from the .po files. The program will read the .po file and extracts the original English sentence with its Malayalam translation. It will automatically identify the sentences belongs the category 'c' mentioned above. From the extracted parallel text a dictionary has been generated manually for the work. The dictionary contains both word meaning as well as the phrase meanings. The phrase has been considered as unit for lexicon because of the following reason: “A phrase in English may be translated to Malayalam using a single word and vice versa’.
2.2 The Proposed Alignment Model
Sentence
Figure 1.1: Proposed Sentence Alignment Model
2.3 Proposed Algorithm for English Malayalam Sentence Aligner The input sentence is an English Text and its corresponding Malayalam translation, which is available in an aligned parallel corpora. Input -> English Text Input -> Malayalam Translation for sentence in Input: mark sentence boundary for each sentence in English text, Malayalam sentence for each word in English sentence get local word groups for each local word group find the matches in statistical language model end for each matches do contextual disambiguation with statistical data mark alignment
219
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
end for rest of sentence get stem get meaning get target language alignment if meaning == Null find possible transliteration in the target language text end end end
2.4 Sentence Boundary Detection for Malayalam and English Text Identification of sentence boundary is considered as one of the toughest task in text processing. The standard algorithm for detecting sentence boundary in English is given below: 1) Place putative sentence boundaries after all occurrence of . ?! (and may be ;:--) 2) Move boundary after following quotation marks if any. 3) Disqualify a period boundary in the following circumstances: a) If it is preceded by a known abbreviation of sort that does not normally occur word finally, but is commonly followed by a capitalized proper name, such as Prof. or vs. b) If it is preceded by a known abbreviation and not followed by an upper-case word. This will deal correctly with most usages of abbreviations like etc. or Jr. which can occur sentence medially or finally. 4) Disqualify a boundary with a ? or ! if: a) It is followed by lower case letter (or a known name). Regard other putative sentence boundaries as sentence boundaries. In the case of Malayalam this algorithm is not as such effective. So the algorithm is modified to work with the
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
language. The major diversions from the above given algorithm in the case of Malayalam are: i) The part three of the algorithm is not directly implemented because there is no concept of upper case in Malayalam. ii) The abbreviations contain more than one character. E.g. The abbreviation in English “A.B” will be written in Malayalam as എ.ബി where the 'ബി' contains two characters. So we modified the abbreviation identification accordingly. To attain accuracy we kept a list of possible abbreviated forms in the implementation. The titles in English follow certain orthographic convention like, keeping the first letter as upper case. E.g 'Prof.', 'Dr.' etc. For the accurate identification of such titles we collected list of titles used in Malayalam language for the implementation.
2.5 Local Word Grouping The problem of local word grouping is handled in a special way in the proposed sentence alignment algorithm. The methodology is heuristic in nature. As part of the work we created a large database of local word groups in English as well as Malayalam from the parallel corpus we generated. Since there are no POS taggers available for the language Malayalam for research we were forced to adopt the methodology. The strategy of local word group identification is as follows: for each sentence do local word group dictionary search if dictionary search fails do perform heuristics based grouping end It is obvious that we may not be able to collect all the local word groups/phrase in
220
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
the language. So some heuristics were developed to do the same. Eg. if 'to' is followed by a verb then mark it as a local word group After having a minimum corpus linguistic inquiry we created such a heuristics to resolve the issue.
2.6 Statistical Language Modeling for the Sentence Alignment Task A statistical language model assigns probability to a sequence of 'm' words P(w1,...,wn) by means of probability distribution. For the task of language modeling we adopted the n-gram approach. In an n-gram model, the probability P(w1 ,...wm ) of observing the sentence w1 ,...wm is approximated as m
m
i 1
i 1
P(w1 ,..., wm ) P(wi | w1 ,...,wi 1 ) P(wi | wi ( n1) ,..., wi 1 )
From the parallel corpus generated, we estimated the unigram probability of each local word group/phrase/word. This data has been used in the alignment system. For the purpose of contextual disambiguation we generated a word, phrase/local word group collocation from both English and Malayalam text. With the help of the parallel corpus again we generated a language model. This language model contains the probability of a word/phrase/word group 'W' followed by another 'X' can be translated into a target language word/phrase or word group 'Z'.
2.7 Stemming The porter stemming algorithm (1980) has been adopted for stemming English words. There is no stemmer available for Malayalam language research. So a psuedo stemmer is developed for Malayalam with tailor made rules. If meaning is not available
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
in the corpus, find the possible transliteration in the target language text, which can be considered as a possible enhancement to this algorithm. The proposed algorithm is expected to give 75% results. If the enhancement is done, the algorithm may give more than 75% results.
2.8 Sentence Boundary Identification Identification of sentence boundary is considered as one of the toughest task in text processing. The standard algorithm which we used for detecting sentence boundary in English is given below: 1) Place putative sentence boundaries after all occurrence of . ?! (and may be ;:--) 2) Move boundary after following quotation marks if any. 3) Disqualify a period boundary in the following circumstances: a) If it is preceded by a known abbreviation of sort that does not normally occur word finally, but is commonly followed by a capitalized proper name, such as Prof. or vs. b) If it is preceded by a known abbreviation and not followed by an upper-case word. This will deal correctly with most usages of abbreviations like etc. or Jr. which can occur sentence medially or finally. 4) Disqualify a boundary with a ? or ! if : a) It is followed by lower case letter (or a known name). Regard other putative sentence boundaries as sentence boundaries.
2.9 Sentence Boundary Identification for Malayalam Text In the case of Malayalam the above discussed algorithm is not as applicable. So we modified the algorithm to work with the language. The major diversions from the
221
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
above given algorithm in the case of Malayalam are: 1) The part three of the algorithm is not directly implemented because there is no concept of upper case and lower case in Malayalam 2) The abbreviations contains more than one character E.g. The abbreviation in English “A.B” will be written in Malayalam as എ.ബി where the 'ബി' contains two characters. Most of the components in Malayalam abbreviations contain more than one character. We did a manual analysis on online Malayalam texts and generated a set of character combination which can be part of abbreviation in a Malayalam text. So we modified the abbreviation identification accordingly. 1) To attain accuracy we kept a list of possible abbreviated writings in the implementation 2) The titles in English follows certain orthographic convention like, keeping the first letter as upper case. E.g 'Prof.', 'Dr.' etc .. For the accurate identification of such titles we collected list of titles used in Malayalam language for the implementation.
2.10 Maximum Likelihood Sentence Boundary Alignment Once the sentence boundary is identified the next step is to find the words and phrases in the sentence. The phrase identification is crucial because phrases may be translated to Malayalam with a single word. Here the word phrase is not used in the sense of phrases like Noun Phrase (NP) and Verb Phrase (VP). Here it means word groups which refer to single concept and having a single word translation in the target language. The word group 'cattle feed' is an example for such word group; it will be translated to Malayalam as 'കോല്ിത്തീറ്റ'
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
(‘kaalitheetta’). While performing the alignment we have to map 'cattle feed' to 'കോല്ിത്തീറ്റ'. To make it possible we adopted a statistical technique to identify possible phrase/ word group from English as well as Malayalam text. We are using Likelihood Ratio approach to extract the phrases. The methodology is explained below: This approach makes two hypotheses from the input word sequence and compares the likelihood ratio of the hypothesis to find all the possible phrases. The initial step in this process is to generate bi-grams from the input sentence. For any bi-gram {w1,w2} in the sentence: H0 = w2 and w1 are independent. H1 = w2 is dependent on w1. where H0 and H1 are hypothesis. Then we have to find the probabilities for the dependent and independent cases from the occurrence of word in the corpus. Table 1.1: Probabilities for the dependent and independent cases from the word occurrences in the corpus. P(w2|w1) P(w2|¬w1) H0 p00 = n2 / N p01 = n2 / N H1 p10 = n12 / p11 = (N - n12) / (N - n1) n1 where: n1 = number of occurrences of w1 n2 = number of occurrences of w2 n12 = number of occurrences of {w1,w2} N = number of words In hypothesis H0: p00 = P(w2|w1) P(w2 ∩ w1) = P(w1) (Conditional probability)
222
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
P(w2) * P(w1) = are independent) P(w1) = n2 /N
(because w1 and w2
p01 = P(w2|¬w1) = P(w2|w1) (The probability of w2 being preceded by w1 is same as w2 being preceded by ¬w1) = n2 / N
Now we can calculate the likelihood ratio of the hypothesis H0 and H1. It is defined as the likelihood ratio of the probability of a true positive and false positive. L(H0) = E(w2|w1) / E(w2|¬w1) b(n12; n2, p00) = b((n12-n2); (N-n1), p00) n2Cn12 * p00n12 * (1 - p00)(n2-n12)
Alternative Hypothesis H1: p10 = P(w2|w1) P(w2 ∩ w1) = (conditional probability) P(w1)
= (N-n1)C(n12-n2) * p01(n12-n2) * (1 - p01)(Nn1-n12+n2) L(H1)
n12 / N
b(n12; n2, p10)
=
=
n1 / N
b((n12-n2); (N-n1), p11)
= n12 / n1 p11 = P(w2|¬w1) P(w2 ∩ ¬w1) = (conditional probability) P(¬w1) (n2 - n12) / N = (N - n1) / N = (n2 - n12) / (N - n1) Using Binomial Distribution we have to calculate the expected probabilities from the observed probabilities. Table 1.2: Expected probabilities from the Observed Probabilities E(w2|w1) E(w2|¬w1) H0 b(n12; n2, p00) ((n12-n2); (Nn1), p01) H1 b(n12; n2, p10) b((n12-n2); (Nn1), p11)
= E(w2|w1) / E(w2|¬w1)
n2Cn12 * p10n12 * (1 - p10)(n2 - n12) = (N-n1)C(n12-n2) * p11(n12-n2) * (1 p11)(N-n1-n2+n2) This ratio tells us how much more likely is the dependent assumption is compared to the independent assumption. Now we have to cancel the binomial coefficients to get the final values to find the phrases. p10n12 * (1-p10)(n2-n12) * p01(n12n2) * (1-p01)(N-n1-n12+n2) R(H1/H0) = p00n12 * (1-p00)(n2-n12) * p11(n12n2) * (1-p11)(N-n1-n2+n2)
2.11 Word / Phrase Translation Model The word/phrase translation model is developed to identify the most probable translation equivalent to a word/phrase in a given bilingual sentence pair. Most of the words in natural languages falls into more
223
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
than one parts of speech category and based on the usage context the meaning may vary. So it is very crucial to identify the equivalents in the sentence boundary alignment task. We developed a dictionary model for storing the translation equivalents of the phrases and words. It can be utilized to handle English to Malayalam and Malayalam to English sentence alignment. Sample entry of the same is given below: book%{പുസ്തകം#ബുക്ക്#ഗ്രന്ധം# അദ്ധ്യോ ം,} {pustakam#book#grandham#adhyayam,} white whale%{വെള്ളത്തിമിംരല്ം#വെളു ത്ത തിമിംരല്ം} {vellathimingalam#velutha thimingalam} The data is stored in a key value/s pair and can be searched base on the key or value. Like id a key has particular value or what is the key for a meaning pair. Apart from this we are keeping a bilingual lexicon. It is done as a precaution measure. All the word forms and phrases are difficult to collect and store in the form of a dictionary. For handling the unknown forms we are using the lexicon. If any word is unknown to the word/phrase model it will be stemmed and searched in the bilingual lexicon to find the target language stem. The returned stem will be matched against the stems in the target language sentence for finding the equivalents.
3.0 Results and Interpretations The performance of a sentence alignment algorithm depends on some identifiable factors. We can even make predictions about whether the performance will increase or decrease. However the algorithms do not always behave in a predictable way. This variation in performance is quite significant and it cannot be ignored for actual
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
alignment. Some of these factors have been indicated in earlier papers, but these were not taken into account while evaluating, nor were their effects studied. Translation of a text can be fairly literal or it can be a recreation, with a whole range between these two extremes. Paragraphs and/or sentences can be dropped or added. In actual corpora, there can even be noise (sentences which are not translations at all and may not even be part of the actual text). This can happen due to fact that the texts have been extracted from some other format such as web pages. While translating, sentences can also be merged or split. Thus, the SL and TL corpora may differ in size. All these factors affect the performance of an algorithm in terms of, say, precision, recall and F measure. For example, we can expect the performance to worsen if there is an increase in additions, deletions or noise. And if the texts were translated fairly literally, statistical algorithms are likely to perform better. However, our results show that this does not happen for our algorithm. The linguistic distance between SL and TL can also play a role in performance. The simplest measure of this distance is in terms of the distance on the family tree model. Other measures could be the number of cognate words or some measure based on syntactic features. For our purposes, it may not be necessary to have a quantitative measure of linguistic distance. The important point is that for languages that are distant, some algorithms may not perform too well, if they rely on some closeness between languages. For example, an algorithm based on cognates is likely to work better for English-French or EnglishGerman than for English-Malayalam, because there are fewer cognates for English-Malayalam. It won’t be without a basis to say that Malayalam is more distant from English than is German. English and German belong to the Indo-Germanic
224
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
branch whereas Malayalam belongs to the Dravidian languages. There are many more cognates between English and German than between English and Malayalam. Similarly, as compared to French, Malayalam is also distant from English in terms of morphology. The vibhaktis of Malayalam can adversely affect the performance of sentence length (especially word count) as well as word correspondence based algorithms. From the syntactic point of view, Malayalam is a comparatively free word order language, but with a preference for the SOV (subject-object-verb) order, whereas English is more of a fixed word order and SVO type language. For sentence length and IBM model-1 based sentence alignment, this does not matter since they do not take the word order into account. However, Melamed’s algorithm (Melamed, 1996), takes care of some difference in word order, is somewhat sensitive to the word order. How it will fare with languages with more word variation than English and French is an open question. Another aspect of the performance which may not seem important from NLP-research point of view is its speed. Someone who has to use these algorithms for actual alignment of large corpora (say, more than 1000 sentences) will have to realize the importance of speed. Any algorithm which does worse than O(n) is bound to create problems for large sizes. Obviously, an algorithm that can align 5000 sentences in one hour is preferable to the one which takes three days, even if the latter is marginally more accurate. Similarly, the one which takes two minutes for 100 sentences, but sixteen minutes for 200 sentences will be difficult to use for practical purposes. Actual corpora may be as large as a million sentences. The type of corpora which was used for the language pair (English- Malayalam) is a LINUX user manual in which words, sentences and even small paragraphs are
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
available in English and Malayalam and small sized. We took 2500 sentences, as this was the size of the smallest corpus. It consists of LINUX related details which appeared in both languages. We expected this corpus to be the most difficult because the translations are often more like adaptations. They may even be rewritings of the English sentences in Malayalam. The algorithm was tested on different data sets. One limitation of our work is that we are considering only 1-to-1 alignment for most of the times. This is partly due to practical constraints, but also because 1-to-1 alignments are the ones that can be most easily and directly used for linguistic analysis as well as machine learning. Since we had to prepare a large number of data sets of sizes up to 10000 sentences, manual checking was a major constraint. We had four options. The first was to take a raw unaligned corpus and manually align it. This option would have allowed consideration of 1-to-many, many-to-1, or partial alignment. The second option was to pass the text through an alignment tool and then manually check the output for all kinds of alignment. The third option was to check only for 1-to1 alignment from this output. The fourth option was to evaluate on much smaller sizes. In terms of time and effort required, there is an order of difference between the first and the second and also between the second and the third option. It is much easier to manually check the output of an aligner for 1-to-1 alignment than to align a corpus from the scratch. We couldn’t afford to use the first two options. The fourth option was affordable, but we decided to opt for a more thorough evaluation of 1-to-1 alignment, than for evaluation of all kinds of alignments for smaller sizes. Thus, our starting data sets had only 1-to-1 alignment. In future, we might extend the evaluation to all kinds of alignments, since the manual alignment currently being done on our
225
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
corpus includes partial and 1-to-2 or 2-to-1 alignment. Incidentally, there are rarely any 2-to-1 alignment in English-Malayalam corpus since two English sentences are rarely combined into one Malayalam sentence (when translating from English to Malayalam), whereas the reverse is quite possible. In literature survey we have seen that most researchers on sentence alignment, especially if bilingual texts are French, German, English or Chinese, use hansards of these countries for a reliable common bilingual database. But no such hansard exist in Indian Languages to English bilingual texts. Thus we used other data sources for experiment. This situation makes the comparison of the accuracy of our method with other alignment methods. The proposed method described in previous chapter is tested on 3 different data in a LINUX user manual in English and Malayalam. 1) Data 1 was a text containing large paragraphs in both languages and having somewhat similar paragraph counts. But it was a hard text when we consider the sentence alignment beads. The percentage of 1-1 beads was only 65.4% and the percentage of 1-2 or 2-1 beads was 22%. The remaining 12.6% alignment pairs consisted of more complex beads even containing 1-6, 1-5 or 2-5 sentence beads. It also contained a deleted region of 18 sentences long in English text which is hard to handle. Under these situations it did 65% of alignments correctly and 25% were complete errors. The remaining 10% was partial errors in which alignment is partially correct. For example, the real bead is a 1-2 beads but our program splits it into two beads: a 1-1 and a 0-1. These are called partial errors. By changing parameters we can avoid these errors up to some extent. Another important point is the question of how much the deleted block affected overall
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
performance. The 18-sentence long deleted segment was towards the end of the text. For a short period it caused program to give continuous wrong alignments. But it managed to overcome this situation after some paragraphs. If we exclude this continuous segment, the accuracy increases to 73.7% which is very good for such a difficult text. 2) In the experiment on data 2, we had very bad results. Because, the paragraph alignment phase made many errors since there were a lot of 1-6, 1-5, etc. paragraph beads. When the program failed in paragraph alignment, it inevitably made errors in sentence alignment in large blocks. Due to this problem, it had accuracy lower than 45% for data 2. 3) Finally, in the experiment on data 3, again we used a data similar to data 2. But this time the paragraphs aligned mostly 1-1 and also they were long paragraphs. In the sentence level, again 1-1 bead percentage was high (about 90%). Under these values, it gave a very good accuracy. The percentage of true alignments was 96.2% and 2.1% was partial alignment errors. Only 1.7% of all alignments were completely wrong. For the simulation of the algorithm explained above, since a full fledged development of Malayalam corpora is quite impossible at this stage, a simple corpus “a Linux user manual” in English and Malayalam is used for most of our research work. Other data sources given above are also used for the analysis of some of the result. A hypothetical result analysis is done to test whether the model and the algorithm proposed is giving satisfactory results or not. A detailed study of statistical measures of performance of the proposed algorithm is also made. 3.1 System Evaluation We tested our system with different style texts that correspond to different difficulty of alignment.
226
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
Precision = 3.1.1 Accuracy The evaluation was done on a Linux user manual which is used as corpus. We had taken about 33 paragraphs randomly, with the variation of 3 to 9 numbers of sentences in each paragraph. The number of sentences on which experiment is done is 304, out of which 254 are aligned by the proposed algorithm in which 187 are aligned correctly. Some technical English words in the user manual do not have corresponding Malayalam words and hence are not aligned and eliminated.
Number of aligned sentences Precision was calculated as 73.62%. Graph of variation in precision can be seen on Figure 1.3. 3.1. 3 Recall Total number of correct aligned sentences Recall = Total number of sentences in source Recall was calculated as 61.51%.
Number of aligned Sentences Accuracy = Total no of Sentences The accuracy of the algorithm was measured as 83.55%.
The variation can be seen in the graph plotted between numbers of cases on x axis with corresponding percentage recall on yaxis. The graph for recall is given on Figure 1.3.
The accuracy can be visualized by the graph given in Figure 1.2, taking percentage scale on y- axis and number of test cases on xaxis.
Figure 1.3: Graph Showing the Variation in Precision 3.1.4 Error Analysis
Figure 1.2: Graph showing the Accuracy of the alignment in English Malayalam Corpus 3.1.2 Precision Number of correctly aligned sentences
Total number of not aligned English sentences Error = Total number of English sentences Error rate was calculated as 16.90%. Taking the average of error percentage corresponding to each test case will give error count equal to 12.3%. The difference between these two results is due to the varying length of sentences.
227
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
Identified sources of errors: • Due to not much difference between number of words of source and target language sentences merging not taking place. • Due to lack of sufficient number of matching words from source and target language. • Due to unavailability of corresponding sentence in target text (Incorrect corpus). • Due to input to source in unrecognized format. The sources of error in this system can be reduced in future by improving translation quality of source and by applying more heuristics with the fact that always perfectly aligned paragraphs should be given as input to this system . 3.1.5 F-Measure The other measure of evaluation scheme is known as F-measure. 2 × (Recall × Precision) F - Measure = (Recall + Precision) F-Measure was calculated as 91.72%.
4.0 Conclusion and Future Work 4.1 Future Work The Java implementation of the proposed algorithm is a major future work that is thought of. At this stage, because of the non availability of readymade tools in Malayalam for research, implementation is a major difficulty. Another serious issue is the lack of a huge corpus in Malayalam, which can only be developed, as a team effort and as a funded project. The results of the experiments reveal some deficiencies and advantages of our algorithm:
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
First of all, the results reveal the importance and effect of paragraph alignment. If paragraphs are well-arranged in both bilingual texts, paragraph alignment is advantageous and increases the accuracy of the alignment remarkably. So it is better to use this program for texts having wellarranged paragraphs. In the future, it can be studied on paragraph alignment to increase its robustness for using in any text. Secondly, there may be a problem of deleted blocks. It may take some period to recover after a deleted segment. Since we managed to shorten the length of this recovery period and lexical information is used, deleted segments will not be a problem for us anymore. In fact, by using hybrid approach in our algorithm, the accuracy rates have to be increased. Finally, values of the parameters may be modified for determining the best values. This is the simplest improvement but it requires too much time to check the effects of variations in parameter values since checking has to be done manually. We could have done this, but lack of time for doing this prevented us to calculate the best values of parameters. The system can be utilized as such for developing speech systems. A smart intelligent text to speech alignment system with human computer interaction will be producing a high impact on the society. The system can be further enhanced by using a massive database of bilingual dictionary for better choice of words. It is always better to have a statistical machine translation system for a prolonged usage. Word Sense Disambiguation system can be developed on the target side for Malayalam to English translation for avoiding semantic ambiguities. 4.2 Conclusion The proposed algorithm performs well for a small English-Malayalam corpus which is
228
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
having a limited number of paragraphs, sentences and words. The result need to be tested on a large corpus, then only the accuracy of the results can be known perfectly, and the precision and recall need to be calculated and tested for different sample sizes. A proper analysis can only be done if the tools are all available readily, which is a major lack of the works in Malayalam. This work needs to be extended to test the algorithm for accuracy and performance and implemented as well. References [1] P. F. Brown, J. C. Lai, and R. L. Mercer. Aligning sentences in parallel corpora. In Meeting of the Association for Computational Linguistics, pages 169_176, 1991. [2] S. F. Chen. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pages 9 _ 16. Columbus, Ohio, 1993. [3] Thomas C. Chuang and Jason S. Chang. Adaptive bilingual sentence alignment. In AMTA '02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, pages 21_30, London, UK, 2002. SpringerVerlag. [4] N. Collier, K. Ono, and H. Hirakawa. An experiment in hybrid dictionary and statistical sentence alignment. In COLING-ACL, pages 268_274, 1998. [5] W. A. Gale and K. W. Church. A program for aligning sentences in bilingual corpora. In Meeting of the Association for Computational Linguistics, pages 177_184, 1991. [6] K. Hofland and S. Johansson. The translation corpus aligner: A program
[7]
[8]
[9]
[10]
[11]
[12]
[13]
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
for automatic alignment of parallel texts. In S. Johansson and S. Oksefjell, editors, Corpora and Cross-linguistic Research: Theory, Method, and Case Studies. Amsterdam: Rodopi. M. Kay and M. Röscheisen. Texttranslation alignment. volume 19, pages 121_142, 1993. R. C. Moore. Fast and accurate sentence alignment of bilingual corpora. In AMTA '02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, pages 135_144, London, UK, 2002. Springer-Verlag. J. R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993. E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll2003 shared task: Languageindependent named entity recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 142_147. Edmonton, Canada, 2003. Gy. Szarvas, R. Farkas, L. Felfoldi, A. Kocsor, and J. Csirik. A highly accurate named entity corpus for hungarian. In Proceedings of LREC2006, 2006. Gy. Szarvas, R. Farkas, and A. Kocsor. A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms. Springer-Verlag, 2006. M. Simard, G. Foster, and P. Isabelle. Using cognates to align sentences in bilingual corpora. In Proceedings of the Fourth International Conference on Theoretical and Methodogical Issues in Machine translation (TMI92), (Montreal), pages 67_81, 1992.
229
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
[14] Rajesh K. S and Lokanatha C. Reddy, “A Relative Study on the Principles and Practices of Machine Translation” Journal of Advanced Research in Computer Engineering (An International Journal) ISSN: 09744320 Vol. 3 No. 2 pp. 349 – 353, JulyDecember 2009. [15] Rajesh K. S and Lokanatha C. Reddy, “Natural Language Processing- an Intelligent Way to Understand Context Sensitive Languages” International Journal of Intelligent Information Processing ISSN: 0973-3892 Vol. 3 No. 2 pp. 421 – 428, July-December 2009. [16] Rajesh K. S, Lokanatha C. Reddy and Veena A Kumar, “An Empirical Study on the Practicalities of Sentence Alignment Task in English to Indian Language (Malayalam) Bilingual Corpora and a new hybrid algorithm for English-Malayalam sentence alignment” International Journal of Artificial Intelligence and Computational Research ISSN: 09753974 Vol. 3 No.2 pp. 95 – 101, JulyDecember 2011. [17] Rajesh K. S, Veena A Kumar and CH. Dayakar Reddy, “Building a Bilingual Corpus based on hybrid approach for Malayalam-English machine translation” International Journal of Computational Science and Informatics ISSN: 2231 –5292 Vol. 2 Issue 1,2 pp. 59 – 64, Jan 2012. Authors: 1. Rajesh. K. S received his MCA Degree from Mahatma Gandhi University, Kottayam and MBA in Finance from IGNOU, New Delhi, in 2000 and 2006 respectively. He is working as Sr. Assistant Professor in the department of MCA, Saintgits College of Engineering,
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
Pathamuttom, Kottayam, Kerala, India. He is a Research Scholar in the Dept. of Computer Science at Dravidian University, Kuppam, AP, India and working towards his Ph.D Degree in Computer Science. His current area of research is alignment of sentences in bilingual corpora, which is an interesting area in NLP.
2. Lokanatha C. Reddy earned M.Sc.(Maths) from Indian Institute of Technology, New Delhi; M.Tech.(CS) with Honours from Indian Statistical Institute, Kolkata; and Ph.D.(CS) from Sri Krishnadevaraya University, Anantapur. Earlier worked at KSRM College of Engineering, Kadapa (1982- 87); Indian Space Research Organization (ISAC) at Bangalore (1987-90). He is the Head of the Computer Centre (on leave) at the Sri Krishnadevaraya University, Anantapur (since 1991); and a Professor of Computer Science and Dean of the School of Science & Technology at the Dravidian University, Kuppam (since 2005). His active research interests include Real-time Computation, Distributed Computation, Device Drivers, Geometric Designs and Shapes, Digital Image Processing, Pattern Recognition and Networks.
230
Rajesh K S et.al.
www.ijclnlp.org
International Journal of Computational Linguistics and Natural Language Processing
Vol 2 Issue 1 January 2013 ISSN 2279 – 0756
3. S. Arulmozi after obtaining M.Phil, Ph.D in Applied Linguistics and PG Diploma in Translation studies from the University of Hyderabad, served as Guest Faculty (19 January 2005-28 December 2005) University of Hyderabad. Prior to that he has worked as Research Staff (29 December 2000-15 January 2005) at the AU-KBC Research Centre, Chennai, Project Fellow (21 January 1999-2 July 2000) Tamil University, and Language Assistant-Tamil (8 June 1998-31 October 1998) CIIL, Mysore before joining Dravidian University as an Assistant Professor. His areas of interest are computational linguistics and lexical semantics. He has coordinated a project on Tamil WordNet, currently coordinating a project on ILIL (TeluguTamil) MT. He has also organized three workshops namely, Corpus-based NLP, WordNet in Dravidian Languages, Introduction to Computational Linguistics.
231
Rajesh K S et.al.
www.ijclnlp.org