a hybrid rules and statistical method for arabic to

5 downloads 0 Views 3MB Size Report
The word is dropped with alig function alig: { e1→a4, e2→a1, e3→a3}. Violate the Human Right. قوﻘﺣ. لا. نﺎﺳﻧا. كﺎﮭﺗﻧا ai. 1 2 3 4 ej. 1 2 3 4 ai. 1 2 3 4. لا قوﻘﺣ. نﺎﺳﻧا. ﺗﺳ.
A HYBRID RULES AND STATISTICAL METHOD FOR ARABIC TO ENGLISH MACHINE TRANSLATION

ARWA HATEM QASSIM

THESIS SUBMITTED IN FULFILMENT FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

FACULTY OF INFORMATION SCIENCE AND TECHNOLOGY UNIVERSITI KEBANGSAAN MALAYSIA BANGI

2017 KAEDAH HIBRID BERASASKAN PERATURAN DAN STATISTIK UNTUK TERJEMAHAN MESIN BAHASA ARAB KE BAHASA INGGERIS

ARWA HATEM QASSIM

TESIS YANG DIKEMUKAKAN UNTUK MEMPEROLEH IJAZAH DOKTOR FALSAFAH

FAKULTI TEKNOLOGI DAN SAINS MAKLUMAT UNIVERSITI KEBANGSAAN MALAYSIA BANGI 2017

i

DECLARATION

I hereby declare that the work in this thesis is my own except for quotations and summaries which have been duly acknowledged.

ARWA HATEM QASSIM P65613

ii

ACKNOWLEDGMENT

All the praises be to the mighty Allah, the Merciful and the Beneficent for the strength and blessing in the completion of this study. Indeed there are many wonderful people who have contributed significantly throughout the whole course of my study up to the completion of this thesis. I owe a great deal to them. First and foremost, I wish to express my most sincere acknowledgment to my supervisor Assoc. Prof. Dr. Nazlia Omar for her valuable guidance, generosity and freedom throughout the entire research and thesis writing. Sincere appreciation goes to Dr. Rabha Ibrahim for her constructive comment. Many thanks for Dr. Marta Ruiz Costa-jussa for the encouragement, thoughtful comments and helpful discussion. To my dearest mother and father, thank you for bringing me up to who I am today. My success symbolises and reflects the support and love from both of you. My deepest appreciation goes to my husband Dr. Khalid Shaker and children for their love, patience and understanding. I am very grateful to UKM and all the staff and members of the School of Computer Science for the help, and friends who have assisted me whenever I need them.

iii

ABSTRACT

Machine translation (MT) represents text translation by computer from one language to another. The main problem in this field is finding a high-quality MT that meets human requirements. This issue is particularly challenging for translations between the Arabic and English languages. The reason is that Arabic is a rich and complex morphological language that is significantly different from other languages. This characteristic leads to specific problems such as different knowledge of sentence structure rules, word order patterns, affixes, and ambiguity between languages. Arabic is a highly inflectional language with rich morphology, relatively free word order, and a variety of sentence structures such as subject–object–verb, subject–verb–object, verb–subject–object, and verb–object–subject. The language has a large number of prefixes, suffixes, and infixes that can modify a stem to form words, thereby leading to a large vocabulary. Ambiguity may occur when a sentence or a phrase has more than one structure or meaning. The objective of this study is to propose a hybrid method consisting of rules and statistical approaches for Arabic-to-English MT. The rule-based approach includes 93 rules developed on the basis of basic rules to solve problems related to word reordering and affixes to enhance the quality of translation from Arabic to English. The approach achieved 70% precision with 1-gram model in the bilingual evaluation understudy (BLEU) system. This study also proposed a statistical approach to handle the ambiguity problem by using Expectation Maximization algorithm to estimate word translation probabilities for selecting the translation word based on collocation of word translation. The Expectation Maximization approach achieved 76% precision with 1-gram model in the BLEU system. To further improve the results, the study designed a new approach, which is a hybrid of the rule-based approach and the Expectation Maximization algorithm. The hybrid approach has the advantage of combining the positive element of rule-based approach by using huge numbers of rules to handle word ordering problem, with the positive element of statistical approach by using a selected translation word based on collocation to solve the ambiguity problem. The proposed approach significantly outperformed other available systems, helped improve the translation quality, and addressed the Arabic word ordering and ambiguity problems. The evaluation results show that the approach achieved 89% precision with 1-gram model in the BLEU system. This study also proposes a new statistical evaluation metric called Holder mean to assess the MT quality by considering the size of word order differences based on the distance between the words in a sentence. The performance of the approach is tested on the United Nations Arabic–English parallel corpus. The results of the evaluation hybrid approach achieved 91.9% in the Holder mean metric.

iv

ABSTRAK

Terjemahan mesin mewakili terjemahan teks oleh komputer dari satu bahasa kepada bahasa lain. Masalah utama adalah untuk mencari terjemahan mesin berkualiti tinggi yang memenuhi keperluan manusia. Walau bagaimanapun, untuk mendapatkan mesin terjemahan berkualiti tinggi adalah mencabar terutamadi dalam terjemahan antara bahasa Arab dan Inggeris. Ini kerana bahasa Arab mempunyai kepelbagaian peraturan tatabahasa dalam struktur ayat, susunan perkataan, analisis morfologi yang kaya dan lebih rumit berbanding bahasa lain. Ini membawa kepada masalah tertentu seperti pengetahuan yang berbeza mengenai peraturan struktur ayat, corak susunan perkataan, imbuhan dan ambiguiti antara bahasa ini. Bahasa Arab adalah sangat fleksi dengan morfologi yang banyak, perintah perkataan yang agak bebas, dan mempunyai pelbagai struktur ayat seperti Subjek -Objek - Kata kerja (SOV), Subjek - Verb -Objek (SVO), Kata kerja - Subjek -Objek (VSO) atau kata kerja -Objek - Subjek (VOS). Ia mempunyai sejumlah besar imbuhan awalan, akhiran dan sisipan yang boleh mengubah suai kata dasar untuk membentuk kata dan membawa kepada perbendaharaan kata yang tinggi untuk saiz leksikon. Ambiguiti mungkin berlaku apabila ayat atau frasa yang mempunyai lebih daripada satu struktur atau makna. Objektif kajian ini adalah untuk mencadangkan kaedah hibrid peraturan dan pendekatan statistik untuk terjemahan mesin dari Bahasa Arab ke Bahasa Inggeris. Pendekatan berasaskan peraturan melibatkan sembilan puluh tiga peraturan dibangunkan berdasarkan peraturan asas untuk menyelesaikan perkataan penyusunan semula dan masalah imbuhan untuk meningkatkan kualiti terjemahan dari bahasa Arab ke bahasa Inggeris. Pendekatan ini mencapai 70% ketepatan dengan model 1gram dari segi sistem BLEU. Seterusnya , kajian ini mencadangkan pendekatan statistik untuk menangani masalah ambiguiti dengan menggunakan algoritma Jangkaan Pemaksimunan untuk menganggarkan kebarangkalian perkataan terjemahan untuk memilih perkataan terjemahan berdasarkan kolokasi terjemahan perkataan. Berdasarkan pendekatan Jangkaan Pemaksimunan, 76% ketepatan dicapai dengan model 1-gram dari segi sistem BLEU. Dalam usaha untuk meningkatkan lagi hasil penyelidikan, pendekatan penghibridan berdasarkan peraturan dan algoritma Jangkaan Pemaksimunan dicadangkan. Pendekatan hibrid menggabungkan elemen positif pendekatan berasaskan peraturan untuk mengendalikan perkataan masalah pesanan dengan unsur positif pendekatan statistik dengan menggunakan perkataan terjemahan dipilih berdasarkan penempatan bersama untuk menangani masalah ambiguiti. Pendekatan yang disyorkan ketara mengatasi system lain dan membantu dalam meningkatkan kualiti terjemahan dan pengendalian perkataan dan ambiguiti masalah bahasa Arab. Hasil penilaian menunjukkan bahawa pendekatan yang mampu menghasilkan 89% ketepatan dengan model 1-gram dari segi sistem BLEU. Tesis ini juga mencadangkan metrik penilaian statistik baru yang dikenali sebagai HOLDER untuk menilai prestasi terjemahan mesin yang mengambil saiz perbezaan susunan perkataan berdasarkan jarak di antara perkataan dalam ayat untuk menilai kualiti terjemahan mesin. Prestasi pendekatan diuji ke atas korpus Arab-Bahasa Inggeris United Nations. Keputusan pendekatan penilaian hibrid mencapai 91.9% dari segi purata metrik HOLDER.

v

CONTENTS Page DECLARATION

iii

ACKNOWLEDGMENTS

iv

ABSTRACT

v

ABSTRAK

vi

CONTENTS

vii

LIST OF FIGURES

xi

LIST OF TABLES

xiii

CHAPTER I

INTRODUCTION

1.1

Background and Motivation

1.2

Challenges in Arabic Machine Translation 1.2.1 Word ordering problem 1.2.2 Affixes 1.2.3 Structural and Lexical Ambiguity Problem 1.2.4 Evaluation of Machine Translation

1.3

Problem Statement

1.4

Research Questions

1.5

Research Objectives

1.6

Research Scope

1.7

Overview of the Thesis

CHAPTER II

LITERATURE REVIEW

2.1

Introduction

2.2

Machine Translation (MT)

2.3

Arabic Language Characteristics

2.4

Major Problems in Arabic MT

2.5

Linguistic Recourses 2.5.1 2.5.2

2.6

Parallel Corpus Morphological Analysers

Machine Translation Approaches 2.6.1 Rule-based machine translation 2.6.2 Statistical machine translation 2.6.3 Example-based machine translation 2.6.4 Knowledge-based machine translation

2.7

2.6.5 Hybrid Method Preprocessing In Machine Translation

2.8

English to Arabic Machine Translation

vi

2.9

Arabic to English Machine Translation

2.10

Machine Translation from Arabic to Other Language and vice versa

2.11

Evaluation Metrics for Machine Translation 2.11.1 BLEU 2.11.2 METEOR 2.11.3 WER 2.11.4 NIST 2.11.5 F-measure 2.11.6 ORANGE

2.9 CHAPTER III

Conclusion RESEARCH METHODOLOGY

3.1

Introduction

3.2

Research Design

3.3

Conclusion

CHAPTER IV

RULE-BASED MACHINE TRANSLATION

4.1

Introduction

4.2

Phrase Structure

4.3 4.4

Rules Creation Syntactic Reordering Phrase Based Arabic to English Machine Translation 4 4.1 Tokenizer 4.4.2 Morphological analysis 4.4.3 Syntactic Parsing

4.4.4 Morphological Lexicon 4.4.5 Transfer Process 4.4.6 Target Language Generation 4.5

Handling the word ordering problem

4.6

Affixes Problem Handling

4.7

Evaluation and Results

4.8

CHAPTER V 5.1 5.2

Conclusion

PROBABILITY ESTIMATION METHODS FOR ARABIC TO ENGLISH MACHINE TRANSLATION Introduction Estimating probabilities of word translation 5.2.1 Frequency of Word Translation Method

vii

5.2.2 Collocation of word translation method 5.2.3 Expectation Maximization Method 5.3

Evaluation results

5.4

Conclusion

CHAPTER VI

HYBRIDIZATION OF RULE AND EXPECTATION MAXIMIZATION ALGORITHM

6.1

Introduction

6.2

Hybridisation Rule and Expectation Maximisation Approach 6.2.1 RBMT Component 6.2.2 SMT Component 6.2.3 Hybridisation Algorithm

6.3

Evaluation and Results

6.4

Conclusion

CHAPTER VII

MACHINE TRNSLATION TESTING AND EVALUATION TECHNIQUES

7.1

Introduction

7.2

Evaluation of Machine Translation Systems

7.3

Human Evaluation

7.4

Evaluation Metrics of Machine Translation

7.5

HÖLDER Mean

7.6

Evaluation Results 7.6.1 BLEU and HÖLDER Evaluation 7.6.2 Hypothesis Testing 7.6.3 The Values of t-test and p-value

7.7

Conclusion

CHAPTER VIII

CONCLUSION AND FUTURE WORK

8.1

Introduction

8.2

Summary of the Research

8.3

Contributions

8.4

Future work

REFERENCES

viii

APPENDIX A

List of Arabic Rules and Its Equivalent in English

B

List of Corpus Sentences

C

List of Acronymic

LIST OF FIGURES

Figure No. 1.1

Objectives answer the research questions

2.1

Arabic diacritical marks

2.2

Example for Arabic text

2.3

Derivation of words from a three-letter root

2.4

Arabic sentence and the equivalent sentence in English

2.5

Transfer strategy (Hutchins and Somers1992)

2.6

Direct machine translation approach

2.7

Interlingua MT with 4 languages

2.8

Knowledge-based machine translation approach

2.9

Classification of hybrid MT architectures (Marta et al. 2015)

2.10

The parser for Arabic sentence

3.1

Research design

3.2

Rules for phrases

ix

4.1

Implementation process

4.2

Phrase structure tree

4.3

A simple syntactic transfer

4.4

A complex syntactic transfer

4.5

Word alignment

4.6

Word alignment between English and Arabic before morphological processing

4.7

Word alignment between Arabic and English

4.8

Score of the system with n-grams of lengths 1, 2, 3, and 4

5.1

Arabic word translated to many English words

5.2

Possibility of translation of each English word that the translation of ―‫‖طانة‬

5.3

BLEU score of EMM and RBMT with length

6.1

Tokenization process

6.2

Morphological analyser for Arabic sentence

6.3

Parse tree for Arabic sentence

6.4

The Arabic parse tree and its equivalent English

6.5

Architecture of HMT system

6.6

Rule based part of the hybrid system

6.7

EM algorithm part of the proposed hybrid approach

6.8

BLEU score of HMT, SMT and RBMT with phrase length 1gram, 2-gram, 3-gram, and 4-gram The relation of p with H(p), where 0