Using a Statistical Language Model Combined with a ...

3 downloads 50878 Views 128KB Size Report
cross-language search engine to extract translation candidates .... the best translation hypothesis, we use a statistical model .... Software Engineering, 2004.
4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco

Using a Statistical Language Model Combined with a Cross-Language Search Engine for English-Arabic Machine Translation Nasredine Semmar, Dhouha Bouamor*, Ali Jaoua, Samir Elloumi** * **

CEA LIST/LVIC, Gif sur Yvette, France. [email protected], [email protected] Qatar University/College of Engineering, Doha, Qatar. [email protected], [email protected]

Abstract—In this paper, we present a new approach for English-Arabic machine translation. This approach uses a cross-language search engine to extract translation candidates from a monolingual corpus and a bilingual reformulator to transform syntactic structures from the source language into the target language. Linguistic information (lemmas, part-of-speech and syntactic dependency relations) corresponding to the words of the translation candidates returned by the cross-language search engine are combined with a statistical model of the target language to produce a correct translation. This approach has been experimented using the corpus of the MEDAR Machine Translation package. The obtained results are encouraging and demonstrate the effectiveness of the proposed approach. Keywords—Cross-language information retrieval, bilingual reformulation, statistical language modeling, morphosyntactic analysis.

I. INTRODUCTION There are mainly two approaches for Machine Translation (MT): rule-based and corpus-based [1] [2]. The rule-based approaches regroup word-to-word translation, syntactic translation with transfer rules and interlingua which uses an intermediate semantic representation which is common to more than one language. The corpus-based machine translation approaches use statistics and probability calculations in order to identify equivalences between texts in the corpus [3]. Hybrid approaches combine the strengths of rulebased and corpus-based machine translation strategies [4]. Reference [5] reported that within the framework of factored and tree-based translation models additional linguistic information (lemma, part-of-speech and morphological information) can be successfully exploited to overcome some short-comings of the currently dominant phrase based statistical machine translation approach and produce promising results. Rule-based approaches require manual development of bilingual lexicons and linguistic rules, which can be costly, and which often do not generalize to other languages. Corpus-based approaches are effective only when large amounts of parallel text corpora are available and finding relevant parallel corpora in any language pair poses a real challenge for current Statistical Machine Translation (SMT) systems. In this paper, we present a new approach for machine translation which is based on a cross-language search

engine to extract translation candidates from a monolingual corpus and a bilingual reformulator to transform syntactic structures from the source language into the target language. Linguistic information (lemmas, part-of-speech and syntactic dependency relations) provided by a multilingual analyzer for the words of the translation candidates returned by the cross-language search engine are combined with a statistical model of the target language to produce a correct translation. This approach was first used to develop a English-French prototype [6]. It is now ported to English-Arabic language pair. We present in section 2 the main components of the English-Arabic machine translation prototype, in particular, we will focus on the cross-language search engine and the linguistic processing. In section 3, we discuss results obtained after translating some English sentences. Section 4 concludes our study and presents our future work. II. MACHINE TRANSLATION BASED ON CROSSLANGUAGE INFORMATION RETRIEVAL AND STATISTICAL LANGUAGE MODELLING The main idea of our machine translation approach is to use only a monolingual corpus in the target language collected from the Web. This corpus is linguistically analyzed and the results stored in the database of a crosslanguage search engine. For each sentence to translate, the search engine returns a set of sentences in the target language with their lemmas, part-of-speech tags, gender, number and syntactic dependency relations. These linguistic properties are used with a statistical language model learned from the target language corpus to find the correct translations. The English-Arabic machine translation prototype implementing our approach is composed of three modules: A cross-language search engine, a bilingual reformulator and a text generator (Figure 1):

4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco

Figure 1: Machine translation using cross-language information retrieval

A. The Cross-language search engine The purpose of Cross-Language Information Retrieval (CLIR) is to find all the relevant documents from a collection of documents that are in a different language from that of the user’s query. In our use of CLIR in machine translation, a document corresponds to a sentence. The role of the cross-language search engine is to retrieve for each user’s query translations from an indexed monolingual corpus. Our cross-language search engine is based on a deep linguistic analysis of the query and the monolingual corpus to be indexed [7]. It is composed of the following modules (Figure 2):

Figure 2: Main components of the cross-language search engine

• A multilingual analyzer (LIMA) [8] which produces for the words of each sentence a set of normalized lemmas, a set of named entities and a set of syntactic dependency relations. These linguistic properties are associated with the translation candidates returned by the crosslanguage search engine. • A statistical analyzer, that computes for documents to be indexed concept weights based on concept database frequencies. • A comparator, which computes intersections between queries and documents and provides a relevance weight for each intersection. • A reformulator, to expand queries during the search. The expansion is used to infer from the original query words other words expressing the same concepts. The expansion can be in the same language (synonyms, hyponyms, etc.) or in different language. • An indexer to build the inverted files of the documents on the basis of their linguistic analysis and to store indexed documents in a database. The LIMA linguistic analyzer is built using a traditional architecture involving separate processing modules: • A tokenizer which separates the input stream into a graph of words. This separation is achieved by an automaton developed for each language and a set of segmentation rules. • A morphological analyzer which looks up each word in a general full form dictionary. If these words are found, they are associated with their lemmas and all their grammatical tags. For Arabic agglutinated words which are not in the full form dictionary, a clitic stemmer was added to the morphological analyzer. The role of this stemmer is to split agglutinated words into proclitics, simple forms and enclitics. The clitic stemmer proceeds as follows: 1. Several vowel form normalizations are performed: the vowel symbols َ ً ُ ٌ ِ ٍ are removed, the characters ‫  إ أ‬are replaced by the character ‫ ا‬and the final characters ‫ ي ئ ؤ‬or ‫ة‬ are replaced by the characters ‫ ى ءى ءو‬or #. 2. All clitic possibilities are computed by using proclitics and enclitics dictionaries. 3. A radical, obtained by removing these clitics, is checked against the full form lexicon. If it does not exist in the full form lexicon, re-write rules are applied, and the altered form is checked against the full form dictionary. For example, consider the token "$%&'(" (with his ball) and the included clitics ‫( ب‬with) and # (his), the computed radical ‫ آ&ت‬does not exist in the full form lexicon but after applying one of the rewrite rules, the modified radical "‫( "آ&ة‬ball) is found in the dictionary and the input token is segmented into root and clitics as: $%&'( = ‫ ب‬+ ‫آ&ة‬ + # (with + his + ball). 4. The compatibility of the grammatical tags of the three components (proclitic, radical, enclitic) is then checked. Only valid

4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco



segmentations are kept and added into the graph of words. • An idiomatic expressions recognizer which detects idiomatic expressions and considers them as single words for the rest of the processing. Idiomatic expressions are phrases or compound nouns that are listed in a specific dictionary. The detection of idiomatic expressions is performed by applying a set of rules that are triggered on specific words and tested on left and right contexts of the trigger. These rules can recognize contiguous expressions as ",َ-(ْ / َ ‫ ا‬0ْ-1َ 2‫( "ا‬the white house). Non-contiguous expressions such as phrasal verbs are recognized too. A module to process unknown words by assigning to these words default linguistic properties based on features identified during tokenization (e.g. presence of Arabic or Latin characters, numbers, etc.). • A Part-Of-Speech (POS) tagger which searches valid paths through all the possible tags paths using attested trigrams and bigrams sequences. The trigram and bigram sequences are generated from a manually annotated training corpus. They are extracted from a hand-tagged corpora of 13257 Arabic words. If no continuous trigram full path is found, the POS tagger tries to use bigrams at the points where the trigrams were not found in the sequence. If no bigrams allow completing the path, the word is left undisambiguated. The accuracy of the Arabic Part-Of-Speech tagger is around 86%. • A syntactic analyzer which is used to split graph of words into nominal and verbal chain and recognize dependency relations (especially those within compounds) by using a set of syntactic rules. We developed a set of dependency relations to link nouns to other nouns, a noun with a proper noun, a proper noun with the post nominal adjective and a noun with a post nominal adjective. These relations are restricted to the same nominal chain and are used to compute compound words. For example, in the nominal chain “#3-42‫ ا‬567” (water transportation), the syntactic analyzer considers this nominal chain as a compound word “#3-8_567” composed of the words “567” (transportation) and “#3-8” (water). • A named entity recognizer which uses name triggers (e.g., President, lake, corporation, etc.) to identify named entities [9]. For example, the expression “‫رس‬3َ8 &ِ ِ8 ‫و@ل‬/ َ ‫( ”ا‬The first of March) is recognized as a date and the expression “&AB” (Qatar) is recognized as a location. • A module to eliminate empty words which consists in identifying words that should not be used as search criteria and removing them. These empty words are identified using only their PartOf-Speech tags (such as prepositions, articles, punctuations and some adverbs). For example, the preposition “‫( ”ل‬for) in the agglutinated word “56C2” (for transportation) is considered as an empty word. • A module to normalize words by their lemmas. In the case the word has several lemmas, only one

of these lemmas is taken as normalization. Each normalized word is associated with its morphosyntactic tag. For example, normalization of the word “D-(37‫( ”أ‬pipelines) which is the plural of the word “‫ب‬Eُ17ْ ‫( ”ُأ‬pipeline) is represented by the couple (‫ب‬Eُ17ْ ‫ُأ‬, Noun). B. The Bilingual reformulator Because the indexed monolingual corpus does not contain the entire translation of each sentence, we need a mechanism to extend translations returned by the crosslanguage search engine. This is done by the bilingual reformulator which consists, on the one hand, in transforming into the target language the syntactic structure of the sentence to translate, and, on the other hand, in translating its words. This reformulator uses an English-Arabic bilingual lexicon composed of 149495 entries to translate words and a dozen of linguistic rules to transform syntactic structures. These rules create hypothesis translations from the source language to the target language. For example, the rule Translation(A.B) = Translation(B).Translation(A) allows the translation into Arabic of the compound word “registration fee” as follows: Translation(registration.fee) = Translation(fee).Translation(registration) = FْG‫ َر‬.5-ِHI ْ %َ The results of lexical and syntactic transformations achieved by the bilingual reformulator are assembled in a lattice with the results of the cross-language search engine. This lattice contains linguistic information (lemma, part-of-speech, gender, number, etc.) for each word of the translation hypothesis. In order to select only the best translation hypothesis, we use a statistical model learned on a mono-lingual lemmatized corpus. This lattice is implemented using the AT&T FSM toolkit [10]. The language model is learned with the CRF++ toolkit [11]. C. The Text generator The text generator consists in producing a correct sentence in the target language by using the syntactic structure of the translation candidate. A flexor is used in order to obtain the right forms of the translation candidate words. The flexor transforms the lemmas of the target language sentence into plain words. We use the linguistic information returned by the cross-language search engine to produce the right form of the lemma. This flexor consists in transforming the lemma of a word into the surface form of this word by using the grammatical category, the gender and the number of the word. For example, the lemma “FْG‫( ” َر‬fee) in plural will be transformed into the form “‫م‬EG‫( ”ر‬fees). Sometimes, we obtain several forms for the same lemma. For example, the lemma “Kَ2ْ‫ ”َأو‬could have several surface forms: L2E%, L2EM, etc. To disambiguate, we use a statistical language model based on CRF that has been previously trained on a monolingual corpus. This disambiguation provides the right flexion of the lemma and therefore the best translation from the source language to the target language. III. EXPERIMENT RESULTS AND DISCUSSION Our machine translation prototype has been experimented on a monolingual corpus extracted from the

4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco

English-Arabic parallel corpus provided by the MEDAR Machine Translation package 1 . The MEDAR parallel corpus is composed of 96063 sentences and concerns general domain (news). We extracted from this corpus a monolingual text composed of 75000 Arabic sentences to be used as an indexed database of the cross-language search engine and 500 English sentences to translate into Arabic. The experimentation consisted in submitting these 500 sentences to the machine translation prototype and comparing the returned translations with the translations which are in the original MEDAR parallel corpus. We show below with an example the entire translation process step-by-step. The English sentence to translate is “France attaches great importance to tackling climate change.”. The first step consists in retrieving the translation candidates from the monolingual corpus by considering the sentence to translate as a query to the cross-language search engine database. Table I illustrates the first six translation candidates returned by the cross-language search engine. TABLE I. ARABIC RETRIEVED SENTENCES CORRESPONDING TO THE ENGLISH QUERY “FRANCE ATTACHES GREAT IMPORTANCE TO TACKLING CLIMATE CHANGE.” Rank

Score

1

0.832656

2

0.037076

3

0.037076

4

0.037076

5

0.037076

6

0.00662131

Translation candidate N-O2‫ا‬ .LR3C42‫& ا‬-ST2‫ا‬ ‫ز‬3H7‫ إ‬N8V6T42‫ان ا‬VX12‫ ا‬KXY Z2[‫ آ‬LS1CM FYV2‫ ا‬FMV6T( ‫ة‬V‫ه‬3\4X2 3ً61] 3 ا‬-'4T2 ‫ت‬3-C6T2‫ ا‬567‫ و‬L2342‫ا‬ LR3C42‫& ا‬-ST2‫ ا‬NH23\8 >8 N-83C2‫ا‬ .N-23\_(

LR3C42‫& ا‬-ST2‫ ا‬N-O2‫> | ا‬-O2‫ا‬#L_NP_GEN (2) Kَ2ْ‫ | َأو‬L2E%#L_VERBE_ACC (3) N@-4€ ‫ | َأ َه‬N-4‫أه‬#L_NC_GEN (4) ‫َى‬EO ْ Bُ | ‫ى‬EOB #L_NC_GEN (5) ‫ل‬ ِ | ‫ل‬#L_PREP_AVEC_NOM (6)Nَ8 NXIXG DIr ،,_sC42‫ا‬ Ld &14dE7 25 Ld ،&tC% ‫ف‬EG N-un( .0-I7w N-73AM&12‫ ا‬NM‫ور‬V2‫ا‬ Ld ‫رك‬3= 3

Suggest Documents