Identifying Comparable Texts for Under-resourced Languages and ...

13 downloads 41 Views 132KB Size Report
Unfortunately, the freely available parallel corpora in the Web represent too few languages and limited domains. Given that on average improving translation ...
Identifying Comparable Texts for Under-resourced Languages and Domains Evangelos Kanoulas, Monica Paramita, Paul Clough, Ahmet Aker, David Guthrie, Rob Gaizauskas University of Sheffield Sheffield, UK {e.kanoulas, m.paramita, p.d.clough}@shef.ac.uk {a.aker,r.gaizauskas,d.guthrie}@dcs.shef.ac.uk

Abstract The availability of parallel corpora (i.e. documents that are exact translation of each other in different languages) has been a key factor in the development of the Machine Translation (MT) technology, in particular data driven (statistical and example-based) MT. Parallel corpora however are not always easily available. This is the case for certain underresourced languages or under-resourced domains. Nevertheless, the richness of the Web can allow the discovery of pieces of text in different languages that could be used to improve MT. Pairs of documents in different languages that contain such useful pieces of text without being parallel constitute a comparable corpus. In this work we attempt to identify comparable documents by extracting a number of language-dependent and languageindependent features and employing machine learning techniques to learn what consists a comparable pair of documents and what does not.

1

Introduction

The availability of parallel corpora (i.e. documents that are exact translation of each other in different languages) has been a key factor in the development of data-driven Machine Translation (MT) technology. Parallel corpora are used as training data to learn MT models. Given the importance of parallel corpora in MT several techniques have been proposed in the literature to identify parallel corpora in the Web. Resnick and Smith (Resnik and Smith, 2003) identified

parallel corpora by retrieving web pages of similar length, structure, or URLs. Candidate pages were initially collected if they contained anchor text of multiple languages such as ”English” and ”Greek” within a small distance from one another. These anchors were later used to find the parallel pair of a page. URLs of the parallel pages also followed certain rules, such as including the language as a folder name “http://www.AAA.com/english/home.html” or adding the language initial at the end of the file name “http://www.AAA.com/home en.html” etc. They also compared the order of the HTML tag structures (title, h1, body) between the two pages, with the number of unaligned tag in the pairs is counted as the difference score. The text chunks which were aligned using the structure analysis were found to be useful in the alignment process. Zhang et al. (Zhang et al., 2006) identified parallel sources by comparing the out links and image links from the candidate pairs. Pages which refer to the same links are more likely to be parallel. Document length, on the other hand, was not proven to be a good feature to analyze parallel pages. The last feature they analyzed was the proportion of aligned sentences which were produced by Champollion Tool Kit1 . These features were put into a Knearest-neighbors classifier to retrieve parallel documents. These previous approaches work reasonably well to retrieve parallel corpora. Unfortunately, the freely available parallel corpora in the Web represent too few languages and limited domains. Given that on average improving translation qual1

http://champollion.sourceforge.net

ity at a constant rate requires exponential increase in the training data (Och and Ney, 2003) a relatively good quality translation requires a huge volume of training data. Collecting such a large volume of parallel corpora for under-resourced languages is infeasible. Furthermore, training translation models for Statistical Machine Translation (SMT) was shown to be domain-dependent (Koehn, 2010). Therefore, for narrow subject domains SMT doesn’t produce acceptable translation quality without highly-specified parallel textual resources. As a result traditional ways of building SMT engines with acceptable translation quality are often not possible for many domain / language combinations. To overcome the problem of limited availability of parallel corpora for certain languages and domains Do et al. (Do et al., 2009) and Munteanu et al. (Munteanu et al., 2004) used different methods to retrieve comparable documents. First, they implemented a filter on an article’s publication dates. Articles which were published more than five days from each other are discarded. Munteanu et al. (Munteanu et al., 2004) also translated foreign documents to build an English query and retrieved candidate pairs using TF-IDF. Further, Munteanu et al. (Munteanu et al., 2004) also retrieved parallel sentences by evaluating the percentage of words in the source document which have translation in the target document. Do et al. (Do et al., 2009) applied some linguistic methods to identify and evaluate the named entities between both documents. Linguistic methods are also used to identify word forms, lemmas and part of speech. In this work we explore an approach to automatically classify document pairs in terms of comparability. We (a) identify document features which are promising candidate indicators of comparability; (b) automatically extract these features from a set of initial comparable corpora (ICC) obtained manually, (c) investigate the distribution of these features across the different comparability classes in the ICC, and (d) train classifiers to automatically classify document pairs into comparability class (parallel, strongly comparable, weakly comparable, noncomparable). Since comparable corpora is more readily available for under-resourced languages in specific subject domains, they can be efficiently used for building MT with much better quality.

2

Comparable Corpora

A key concept in all aforementioned works and in this work as well is the notion of comparability. The definition of comparability is rather intuitive. Essentially comparability can be defined by how useful a pair of documents/pieces of text are to machine translation. This means, for instance, that documents with similar “language” might be comparable, even if they talk about different entities. Granularity of comparability: Comparability can be defined at different granularities, i.e. units on different structural levels: (a) corpus-level comparability – between corpus A and corpus B as a whole, or comparability between individual sections (subcorpora) within the corpus; (b) document-level comparability – between different documents within or across corpora (Lee et al., 2005); (c) paragraphand sentence-level comparability – between structural and communicative units within or across individual documents (Li et al., 2006); (d) comparability of sub-sentential units – between clauses, phrases, multi-word expressions, lexico-grammatical constructions. The approaches described in this paper primarily focus on comparability on higher levels (corpus and document comparability), with the task for selecting comparable corpora, texts and paragraphs for further alignment and use within MT. Degrees of comparability: Further, comparability is not a binary notion. On the contrary it is a coninues value that depends on how useful a pair of texts is for MT. For simplicity purposes we discretize comparability into three degree, “parallel”, “strongly comparable” and “weakly comparable”. Table 1 presents the different degrees of comparability along with some intuitive definition. This table was used as guidelines in the collection and annotation of pairs of documents that constitute our data set (described in the following section). Initial Comparable Corpora Two bilingual comparable corpora were chosen as a dataset for the feature analysis:English-Greek (ENEL) corpora and Romanian-Greek (RO-EL) corpora. Both corpora were collected by Athena Research and Innovation Center in Information Communication & Knowledge Technologies, Institute for

Parallel

These are traditional parallel texts, i.e. texts which are true and accurate translations and texts which are approximate translations with minor language-specific variations, such as new examples or omissions.

Strongly comparable

These are heavily edited translations or independent, but closely related texts reporting the same event or describing the same subject. This category includes texts coming from the same source with the same editorial control, but written in different languages, e.g., the BBC News in English and Romanian or independently written texts concerning the same subject, e.g. Wikipedia articles linked via Wiki or news items concerning exactly the same specific event from AFP, DPA and Reuters.

Weakly comparable

This category includes texts in the same narrow subject domain and genre, but describing different events or texts within the same broader domain and genre, but varying in subdomains and specific genres. Table 1: The three grades of comparability.

Language and Speech Processing (ILSP), Athens, Greece. The corpora contain documents from a variety of domains and genres (shown in brackets): International news (Newswires), Sports (Newswires), Admin (Legal), Travel (Advice), Software (Wikipedia, User Manuals) and Medicine (For doctors, For patients). The comparability level between pairs of documents were categorized into three different types: parallel, strongly comparable and weakly comparable according to Table 1. ILSP collected parallel data from the official website of the European Union which mostly contains legal documents, and multilingual news sites to obtain newswire documents. Strongly comparable data were collected manually by retrieving newspapers’s article about the same events. For the other domains, specific topics were manually chosen, and the partners had to find comparable data from different language. This process was mentioned to be quite difficult and very time consuming. To obtain the weakly comparable data, ILSP extracted some seed words from the previously collected parallel and strongly comparable data using BootCat, and retrieved new documents based on these words. The size of both corpora (in number of Greek words) is shown in Table 2. The documents in the EN-EL corpora were dis-

Corpora EN-EL RO-EL

Parallel 191,843 282,213

Strongly 294,554 267,897

Weakly 952,534 315,108

All 1,438,931 865,218

Table 2: Initial Comparable Corpora Size (in words)

tributed quite evenly on most domains with an average of 200,000 words. The Software domain, however, has a reasonably smaller size for the Wikipedia and User Manuals genre, which is 20,000 and 113,000 words respectively. The average size of the RO-EL corpora is slightly lower, which is around 170,000 for most domains, with Software domains containing 27,000 words and Admin containing 90,000 words. There were no documents found which belong to the Travel domain. ILSP also provided the alignment between documents, mostly for parallel and strongly comparable. However, from all the retrieved weakly comparable documents, only a small subset were paired on document level. Note that in this work we only used paired documents. Furthermore, the corpora did not contain any pairs of non comparable documents. Certainly one could random select documents from the entire Web and given the enormous size of the Web these documents could be considered non-comparable. However, especially for training rankers or classifiers documents that are close to the margins of separa-

tion are the most useful ones. That is we want to identify document pairs that may have similar characteristics to comparable documents but they are non-comparable. These will constitute hard examples that can improve the performance of machine learning and feature extraction technologies. How hard these examples should be is also an open question given that the different degrees of comparability may not be separable in our features space. For this reason, as an initial exploration we generated three sets of non-comparable document pairs: (1) document pairs that come from same domain but different genre, (2) document pairs that come from the same genre but different domain, and (3) document pairs that come from different domain and different genre as a safe option.

3

Methodology

The document pairs in ICC constitute the data set over which we will test our methods to identify comparable document pairs. From each document pair in the ICC we extract a number of features that could be used to identify the comparability of the pair. Given a number of (comparability label, features) data we learn a classification function that is used to classify new document pairs in different comparability classes. In section 3.1 we describe the features we extract from the ICC and examine the distribution of their mean values across the comparability levels defined in the ICC. In section 3.2 we describe two machine learning techniques we used to built a multi-class classifier. 3.1

Features / Criteria of Comparability

A list of the extracted features can be viewed in Table 3. These features are divided into two categories: language-dependent features which need some translation methods or other linguistic knowledge, and language-independent features which do not. For the language-dependent features Google Translate was used to translate the Greek language into English and Romanian. Google Translate is not expected to perform excellently on these underresourced languages and domains, nevertheless it can still be considered of better quality compared to bilingual dictionaries or MT based on our par-

Language-dependent Features 1. Word/stem overlap (rel. to source language) 2. Cosine similarities on word/stem occurence 3. Cosine similarities on word/stem TF 4. Cosine similarities on stemmed word TF-IDF 5. Cosine similarities on stemmed bi-gram and tri-gram TF Language-independent Features 1. Out-link overlap 2. URL character overlap 3. URL number of slash difference 4. Image link word overlap 5. Image link filename overlap Table 3: Features or criteria of comparability

allel corpora.Thus, the results of our classification can then be considered a realistic upper bound. The extracted language-dependent features include relative word overlap the number of common unique words in the source and target documents divided by the number of unique words in the source document, relative stem overlap the same as the relative word overlap apart from the fact that words were first stemmed by Porters Stemmer (Sparck Jones and Willett, 1997), and cosine similarity in the document vector space (Salton and McGill, 1983; Salton et al., 1975). In the vector space documents are represented as a t-dimensional vector where t is the number of words (or stems) in the entire corpus of source (English or Romanian in this case) and translated target (Greek in this case) documents. The vectors can be binary (1 is a word/stem is present in a document and 0 otherwise), include term (words or stems) frequencies (TF) or term frequencies weighted by the importance of a word in the corpus (TF-IDF). The cosine similarity fro bi-gram and tri-gram TF vectors was also computed. The extracted language-independent features include out-link overlap the number of common outlinks in the two documents, image links overlap the number of common image source url in the two documents, URL level overlap the difference of the number of slashes in the url that correspond to the two documents (remember that documents were crawled from the Web and thus each document corresponds to a url) and URL character overlap. Features were extracted for the English-Greek (EN-EL) and Romanian-Greek (RO-EL) language

pair. Tables 4 and 5 illustrate the distribution of the mean feature values over the different comparability levels for the EN-EL corpus. The mean feature values for the RO-EL corpus are similar to the ones for the EN-EL corpus and thus they are not presented here due to space limitations. As it can be viewed in Table 4 all languagedependent features appear to be good indicators of the comparability grade of a document pair. In particular an identifiable gap in the values between parallel, strongly comparable, weakly comparable and non-comparable document pairs is present for most of the features. Only, the cosine similarity among term frequencies seems not to be able to separate strongly from weakly comparable. Table 5 demonstrates language independent features of comparability. Language independent features are particularly appealing since no initial translation is required. As it can be viewed in Table 5 these language independent features can well identify parallel documents but have extremely hard time identifying any of the other degrees of comparability. Essentially there is not much difference in the values of these features between comparable and non-comparable documents. One exception is the Image Link Overlap that appears to identify strongly comparable documents along with the parallel ones. 3.2

Classification

Support Vector Machines (SVM) were used for the classification (Vapnik, 1995). Given that we have more than two comparability classes, two different techniques were used for multi-class classification. Our first method uses Joachims implementation of the multi-class classifier described in (Crammer and Singer, 2002), SVMmulticlass (Tsochantaridis et al., 2004). For a training set (x1 , y1 )...(xn , yn ), with labels yi ∈ [1..k], it finds the solution of the following optimization problem during training, X X min 1/2 wi ∗ wi + C/n ξi i=1..n

i=1..k

s.t. ∀y ∈ [1..k] [x1 · wy1 ] ≥ [x1 · wy ] + 100 ∗ ∆(y1 , y) − ξ1 ... [xn · wyn ] ≥ [xn · wy ] + 100 ∗ ∆(yn , y) − ξn

C is the regularization parameter that trades off margin size and training error. ∆(yn , y) is the loss function that returns 0 if yn equals y, and 1 otherwise.

Our second method solves the multi-class classification problem via Error-Correcting Output Codes (ECOC) (Dietterich and Bakiri, 1995). In ECOC each comparability class is assigned a unique binary string of length n (codeword). An example can be seen in Table 3.2. For instance the codeword assigned to the strongly comparable class is 0011001. The multi-class classification problem is decomposed into n binary classification problems corresponding to each column. During learning a binary classifier fi is trained for each column i, with the labels for each class in the training data given by the bit in the corresponding column. For instance, according to Table 3.2, during training the classifier f0 , the non-comparable document pairs constitute the positive examples, while all other classes constitute the negative ones. During testing, all classifiers are evaluated to obtain an n-length string s for the new document pair x. Let’s assume that the resulting string s is 1101111. The string is then compared to all codewords and the x is assigned to the class whose codeword is closest in terms of Hamming distance to string s. In the aforementioned example that would be the non-comparable class. A good errorcorrecting output code should satisfy two properties: (a) each codeword should be well-separated in Hamming distance from each other codeword so that the maximum number of misclassification can be corrected, and (b) each column classification function should be uncorrelated with the functions to be learned for other columns; this implies that each column string should also be well separated in Hamming distance both from each other column string and their complementary. Details of the algorithms can be found in (Dietterich and Bakiri, 1995). One of the advantages of ECOC is that they are robust with respect to changes in the size of training sample. This particularly important given that our data set is extremely skewed with many more noncomparable document pairs than comparable ones. The binary classifier employed is Joachims implementation, SVMlight (Joachims, 1999) of Vapnik’s SVM (Vapnik, 1995).

Average

Word Overlap Stem Overlap Cos Sim. Cos Sim. Cos Sim. Cos Sim. Cos Sim. Cos Sim. Cos Sim. (Relative) (Relative) Word Occ. Stem Occ. Word TF Stem TF Stem TF-IDF Bi-gram TF Tri-gram TF

Parallel

65.89%

69.22%

0.67

0.70

0.92

0.92

0.77

0.58

0.34

Strongly comparable

37.44%

40.95%

0.43

0.47

0.80

0.80

0.60

0.29

0.11

Weakly comparable

26.89%

31.65%

0.26

0.30

0.82

0.82

0.41

0.25

0.02

Same Domain Different Genre

14.11%

16.63%

0.15

0.18

0.57

0.56

0.06

0.07

0.003

Different Domain Same Genre

13.36%

15.22%

0.15

0.17

0.63

0.62

0.05

0.08

0.001

Different Domain Different Genre

11.88%

13.78%

0.14

0.16

0.57

0.56

0.04

0.06

0.0006

Table 4: Language-dependent comparability features. Average Parallel Strongly comparable

OutLinks Overlap 2.91

Image Links Overlap 0.41

URL Level Overlap 2.942

URL Character Overlap 34.05

0.17

0.00

0.018

3.52

Weakly comparable

0.00

0.00

0.000

0.80

Same Domain Different Genre

0.00

0.00

0.012

2.94

Different Domain Same Genre

0.00

0.00

0.004

1.75

Different Domain Different Genre

0.00

0.00

0.000

2.58

Table 5: Language-independent comparability features.

Class Non-comp. Weakly Strongly Parallel

f0 1 0 0 0

f1 1 0 0 1

f2 1 0 1 0

Codeword f3 1 0 1 1

f4 1 1 0 0

f5 1 1 0 1

f6 1 1 1 0

Table 6: Error-Correcting Code for a four-class problem.

4

Results

In this section we report the results of our classification. A 5-fold cross validation was performed. Document pairs were randomly shuffled and equi-split into 5 buckets, S1 ... S5. 3/5 of the document pairs (i.e. 3 buckets) were used for training, 1/5 for validation and 1/5 for testing. 5 folds were generated with the first one containing S1 ... S3 as training data, S4 as validation data and S5 as test data, the second one containing S2 ... S4 as training data, S5 as validation data and S1 as test data and so on. The validation set was used to decide the regularization parameter c that trades off margin size and training error. The parameter c varied from 0.0001 to 2 and the best one was selected based on the Average Class Accuracy (ACA) in the validation data. To compute ACA

the accuracy for each class prediction was computed separately as the number of correctly classified document pairs in the class under consideration divided by the total number of document pairs in the class under consideration. The average over all classes was then obtained. ACA opposite to Overall Accuracy (OC), i.e. the number of correctly classified pairs over all classes divided by the total number of pairs in the set, does not depend on the size of each class. ACA is the measure we are reporting in our results. When more detailed analysis is required the contingency tables are also reported. In the first set of experiments we only considered a three-class problem, “non-comparable”, “strongly comparable” and “parallel” and used the two algorithms described above, SVM multiclass and ECOC classification. Moreover, we only used document pairs from different genre and different domain as “non-comparable” as a safe choice. Further, given that we can generate numerous of non-comparable document pairs we varied the ratio between noncomparable and strongly comparable/parallel from 0.4 to 2. The results can be viewed in Figure 4 for

Average Class Accuracy

1

SVMmulti ECOC

0.9 0.8

Non-comp. 712 0 216

Actual Strongly 304 0 96

Parallel 6 0 80

0.7 0.6 0.5 0

0.5

1

1.5

2

2.5

Ratio of non−comp. to comparable and parallel pairs

Figure 1: Average class accuracy for SVMmulti and ECOC over the EN-EL corpus and for varying ratio between the non-comparable and the comparable/parallel document pairs.

the two algorithms (SVMmulti and ECOC) over the EN-EL corpus. The solid lines correspond to the SVMmulti algorithm while the dashed to the ECOC algorithm. As it can be observed the ECOC algorithm almost perfectly classifies all document pairs in the three categories achieving an ACA of about 0.97 in both corpora. Furthermore, one can observe that the ratio, i.e. the skewness in the training data does not affect the performance of the classifier which remains approximately fixed. On the other hand the SVMmulti demonstrates a significantly worse effectiveness, achieving an ACA of about 0.73. Furthermore, the effectiveness of the algorithm highly depends on the size of the training data in each class. The achieved ACA values vary between 0.55 and 0.73 depending on the skewness of the data. The optimal ACA is achieved with a 1.2 ∼ 1.6 : 1 ratio between non-comparable and comparable/parallel document pairs. To further understand how the SVMmulti fails to correctly classify document pairs one can observe the following two contingency tables for the EN-EL corpus2 . 0.4:1 Predicted Non-comp. Strongly Parallel 2

1.8:1 Predicted Non-comp. Strongly Parallel

Non-comp. 46 61 97

Actual Strongly 37 175 188

Parallel 0 4 82

Note that when contingency tables are reported they correspond to the optimal training ratio. Thus, often the total number of non-comparable document pairs differ across different contingency tables.

In the first one the ratio is 0.4:1 while in the second 1.8:1. As it can be observed, while less noncomparable document pairs allow the algorithm to learn to classify reasonably well strongly comparable and parallel document pairs it massively misclassified non-comparable document pairs. On the other hand when more non-comparable document pairs are used the algorithm learns to classify those correctly but it cannot distinguish them from the strongly comparable document pairs, which it misclassified. Contrary to the SVMmulti, ECOC almost perfectly classifies all document pairs in both corpora. The contingency tables for the EN-EL and RO-EL respectively can be viewed bellow. EN-EL Predicted Non-comp. Strongly Parallel

Non-comp. 299 0 0

Actual Strongly 0 385 2

Parallel 0 3 83

RO-EL Predicted Non-comp. Strongly Parallel

Non-comp. 292 0 0

Actual Strongly 45 305 5

Parallel 0 2 35

Given that ECOC clearly out-performs SVMmulti in the next set of experiments we only report results obtained by the Error-Correcting Output Codes algorithm. So far, the constructed classifiers have used both language-dependent and language-independent features. In the next experiment we explored the use of only language-dependent and only languageindependent features in the classification of document pairs. The contingency table below reports results when only language-independent features are used by the classifier. As it can be viewed languageindependent features are utterly useful at identifying parallel documents. This is in line with past results on identifying parallel documents by using the document structure, out-links and image links and also in

line with the average feature values reported in Table 5. However as one can observe in the table below such features are unable to distinguish comparable and non-comparable documents, as expected. Given that language-independent features are particularly appealing since they do not require any translation technology to be used at the first place the results below highlight the need of identifying languageindependent features different from the ones used so far to identify parallel documents. Lang. Ind. Predicted Non-comp. Strongly Parallel

Non-comp. 0 100 0

Actual Strongly 0 394 6

Parallel 0 3 83

On the other hand language-dependent features are able to clearly distinguish comparable from noncomparable document pairs as it can be viewed in the contingency table below. Nevertheless they sometimes misclassify parallel documents into comparable. Lang. Dep. Predicted Non-comp. Strongly Parallel

Non-comp. 100 0 0

Actual Strongly 10 364 26

Parallel 0 17 69

Comparing the last two tables with the EN-EL ECOC contingency table that appeared earlier in the section one can observe that when combining the ability of both language-dependent to distinguish non-comparable and comparable document pairs and the ability of the language-independent features to identify parallel document pairs leads to the optimal result. Similar results were obtained for the RO-EL corpus. The two contingency tables corresponding to language-independent and language dependent features can be viewed below. Here the classifier over language-independent features can better distinguish non-comparable from comparable document pairs than in the EN-EL corpus but clearly still misclassifies many of them. So far we have ignored the weakly comparable documents. Such documents could also be useful as training data for Machine Translation. The following contingency table reports the results

Lang. Ind. Predicted Non-comp. Strongly Parallel

Non-comp. 277 16 0

Actual Strongly 191 161 3

Parallel 1 0 36

Lang. Dep. Predicted Non-comp. Strongly Parallel

Non-comp. 291 4 3

Actual Strongly 64 272 19

Parallel 0 1 36

of the ECOC algorithm on a four-class classification (“non-comparable”, “weakly comparable”, “strongly comparable” and “parallel”). Similar to the previous experiments the classifier correctly classified most of the non-comparable, strongly comparable and parallel document pairs. On the other hand, it failed to correctly classify the weakly comparable ones. Instead most of t comparable document pairs were misclassified as strongly comparable and couple of them as non-comparable. A possible explanation for this outcome is the limited size of weakly comparable document pairs. There were only 45 of them in total and thus in a 5-fold cross validation setup 27 of them were used during training. Given that weakly comparable document pairs are probably the hardest to identify a classifier can hardly learn anything about this class over 27 instances only. Another possible explanation comes from the fact that all labels were manually assigned and thus mis-labeling strongly comparable as weakly comparable and vice-versa could be a common mistake. We plan to further investigate this issue in the future. Actual Predicted Non-comp. Weakly Strongly Parallel Non-comp. 419 3 11 0 Weakly 0 0 0 0 Strongly 0 42 383 3 Parallel 0 0 6 83

One last experiment we run was using all 6 classes of comparability, i.e. “parallel”, “strongly comparable”, “weakly comparable” and the three classes of non-comparable document pairs. The results can be viewed in the contingency table below. The purpose of this experiment was to investigate (a) whether

the classifier can say the different “non-comparable” categories apart, and (b) whether harder negative exaples, i.e. non-comparable document pairs in the same domain but different genre and vice versa could further improve classification. The results of this experiment demonstrated that it is particularly hard to distinguish document pairs from the three classes of non-comparablility. A large number of non-comparable pairs was misclassified across these three classes but none of them was misclassified as one of the comparable classes. Further, the use of these “hard” instances did not seem to further improve the classification of comparable document pairs, with weakly comparable pairs again being misclassified as strongly comparable.

els trained on our parallel documents instead of Google Translate. As mentioned earlier Google Translate only gives a realistic upper bound. We also plan to explore the reason of weakly comparable pairs being misclassified. Manual alignment of weakly comparable document pairs may be needed to increase the number of instances. Further, the we intent to investigate whether the constructed classifiers are corpora-dependent. If not then using resources from multiple language pairs and thus more training data may further help identifying comparable texts. Finally, we intent to use comparable document pairs to train SMT models and explore whether such pairs can improve the quality of translation.

6 5

Conclusions

In this work we employed Machine Learning technology to identify comparable document pairs for the purpose of Machine Translation for underresourced languages and subject domains. We extracted a number of language-dependent and language-independent features from two corpora (EN-EL and RO-EL) and used two ML algorithms to learn classification functions, an SVM multiclass classifier and Error-Correcting Output Codes. We demonstrated that ECOC outperforms the SVM multi-class classifier achieving an average class accuracy of about 0.93. Also ECOC is robust to the skewness of the training data. Extensive experiments with ECOC have also demonstrated that language-independent features can identify parallel document pairs but they cannot distinguish between comparable and non-comparable pairs. On the other hand language-dependent features can clearly distinguish between these two classes. They can also identify parallel document pairs but not with the same accuracy as language-independent features. Finally, we have shown that weakly comparable document pairs are often misclassified as strongly comparable and that the use of “harder” negative examples to further improve classification. In the future we intent to explore the use of new language-independent features that can capture the characteristics of comparable document pairs. Further, regarding the language-dependent features we intent to employ bilingual dictionaries or MT mod-

Acknowledgments

The reported work has been carried within the ACCURAT project funded by the European Communitys Seventh Framework Programme (FP7/20072013) under the Grant Agreement no 248347.

References Koby Crammer and Yoram Singer. 2002. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292. Thomas G. Dietterich and Ghulum Bakiri. 1995. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2(1):263–286. Thi-Ngoc-Diep Do, Viet-Bac Le, Brigitte Bigi, Laurent Besacier, and Eric Castelli. 2009. Mining a comparable text corpus for a vietnamese - french statistical machine translation system. In StatMT ’09: Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 165–172, Morristown, NJ, USA. Association for Computational Linguistics. Thorsten Joachims. 1999. Making large-scale support vector machine learning practical. Advances in kernel methods: support vector learning, pages 169–184. Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press. Michael D. Lee, Michael D. Lee, and Matthew Welsh. 2005. An empirical evaluation of models of text document similarity. IN COGSCI2005, pages 1254–1259. Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett. 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. on Knowl. and Data Eng., 18(8):1138–1150.

Dragos Stefan Munteanu, Alexander Fraser, and Daniel Marcu. 2004. Improved machine translation performance via parallel sentence extraction from comparable corpora. In HLT-NAACL, pages 265–272. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist., 29(1):19–51. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380. Gerard Salton and Michael J. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA. Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620. Karen Sparck Jones and Peter Willett. 1997. Readings in Information Retrieval. Morgan Kaufmann. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured output spaces. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 104, New York, NY, USA. ACM. Vladimir N. Vapnik. 1995. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA. Ying Zhang, Ke Wu, Jianfeng Gao, and Phil Vines. 2006. Automatic acquisition of chinese-english parallel corpus from the web. in: Ecir2006. In In Proceedings of 28th European Conference on Information Retrieval, pages 420–431. Springer.

Suggest Documents