NEW TRENDS FOR BUILDING ARABIC LANGUAGE RESOURCES By Mohamed Abdelrahman Zahran Mohamed A Thesis Submitted to the Faculty of Engineering at Cairo University in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE in Computer Engineering
FACULTY OF ENGINEERING, CAIRO UNIVERSITY GIZA, EGYPT 2015
NEW TRENDS FOR BUILDING ARABIC LANGUAGE RESOURCES By Mohamed Abdelrahman Zahran Mohamed
A Thesis Submitted to the Faculty of Engineering at Cairo University in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE in Computer Engineering
Under the Supervision of Prof. Dr. Amir Atyia
Prof. Dr. Mohsen Rashwan
Professor Computer Engineering Department, Faculty of Engineering, Cairo University
Professor Electronics and Communications Department, Faculty of Engineering, Cairo University
FACULTY OF ENGINEERING, CAIRO UNIVERSITY GIZA, EGYPT 2015
NEW TRENDS FOR BUILDING ARABIC LANGUAGE RESOURCES
By Mohamed Abdelrahman Zahran Mohamed
A Thesis Submitted to the Faculty of Engineering at Cairo University in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE in Computer Engineering
Approved by the Examining Committee
____________________________ Prof. Dr. Ahmed Rafea, External Examiner ____________________________ Prof. Dr. Sameh Alansary, External Examiner ____________________________ Prof. Dr. Amir Atyia, Thesis Main Advisor ____________________________ Prof. Dr. Mohsen Rashwan, Advisor
FACULTY OF ENGINEERING, CAIRO UNIVERSITY GIZA, EGYPT 2015
Engineer’s Name:
Mohamed Abdelrahman Zahran Mohamed
Nationality: E-mail:
Egyptian
[email protected] [email protected] hoto here
Registration Date: Awarding Date: Degree: Department:
11/10/2011 …./…./…….. Master of Science Computer Engineering
Supervisors: Prof. Amir Atyia Prof. Mohsen Rashwan Examiners: Prof. Ahmed Abdelwahed Rafea (External examiner) Prof. Sameh Saad Alansary (External examiner) Prof. Amir Sorial Atyia (Thesis main advisor) Prof. Mohsen Abdelrazek Rashwan (Advisor) Title of Thesis: New trends For Building Arabic Language Resources. Key Words: Natural Language Processing, Arabic Language Resources Integration, Text Similarity, Cross Lingual Lexical Substitution, Word Sense Disambiguation, Word Representations, Word Embeddings, Word Vectors, Word to Word Relations, Machine Learning, Neural Networks, Cosine Regression, Arabic-English vector space mapping.
Summary: Language resources are important factor in any Natural Language Processing application. However, the language resource support for Arabic is not mature because the existing Arabic language resources are either scattered, inconsistent or even incomplete. To solve this problem, first, we automatically bootstrap a rich Arabic Language Resource leveraging the existing resources. Next, we build the largest statistical Arabic language resource, and then introduce a new technique to map this statistical Arabic resource to the English counterpart outperforming standard techniques in this task. Finally, using the new statistical methods we present a novel
technique to link conventional Arabic language resources to English using cross lingual lexical substitution outperforming the state of the art system in this problem.
Acknowledgments بسم هللا الرحمن الرحيم .﴾" البقرة٣٢﴿ك أَنتَ ْال َعلِي ُم ْال َح ِكي ُم َ َك َل ِع ْل َم لَنَا إِ َل َما َعلَ ْمتَنَا إِن َ َ" ُسب َْحان “Glory to You (O Lord), we have no knowledge except what you have taught us. Indeed, it is You who is the knowing, the wise (32) “ Al-baqarah ُ ْ" َو َما تَوْ فِيقِي إِ َل بِاللَـ ِه ۚ َعلَ ْي ِه تَ َو َكل .﴾" هود٨٨﴿ ُت َوإِلَ ْي ِه أُنِيب “And my success is not but through Allah. Upon him I have relied, and to Him I return (88)” Hood.
I want to express my gratitude to Dr. Mohsen Rashwan and Dr. Amir Atyia for guidance, learning opportunities and support. I’d like to thank Microsoft Advanced Technology Lab in Cairo for supporting this line of word with data and computing resources. I’d like also to thank RDI for providing us with data and linguistic aid.
6
Dedication I’d like to dedicate this thesis to my family and friends for supporting me during my work.
7
Table of Contents
LIST OF TABLES........................................................................................................ 10 LIST OF FIGURES ...................................................................................................... 11 NOMENCLATURE ..................................................................................................... 12 ABSTRACT .................................................................................................................. 13 CHAPTER 1 : INTRODUCTION .............................................................................. 14 1.1.BRIEF INTRODUCTION ON LANGUAGE RESOURCES. ............................................... 14 1.2.WHY ARABIC? ....................................................................... 15 1.3.THESIS OBJECTIVE ....................................................................... 16 1.4.THESIS ORGANIZATION ....................................................................... 17 CHAPTER 2 : BUILDING SEMI-AUTOMATED CONVENTIONAL ARABIC LANGUAGE RESOURCE (ARABASE) ................................................................... 18 2.1.INTRODUCTION 2.2.RELATED WORK 2.3.ARABIC LANGUAGE RESOURCES
....................................................................... 18 ....................................................................... 18 ....................................................................... 19
2.3.1.King Abdulaziz City for Science and Technology (KACST) .............. 20 2.3.2.Arramooz ............................................................................................. 21 2.3.3.Arabic WordNet (AWN) ...................................................................... 22 2.3.4.RDI Lexical Semantic Database .......................................................... 23 2.3.5.RDI light lexicon .................................................................................... 23 2.3.6.Alkhalil ................................................................................................. 23 2.3.7.Arabic stop words ................................................................................ 24
2.4.INTEGRATION METHODOLOGY 2.4.1.Analysis 2.4.2.Design 2.4.3.Integration 2.4.4.Linking
....................................................................... 25
................................................................................................ 25 .................................................................................................. 25 ............................................................................................ 25 ................................................................................................. 25
2.5.ARABASE ARCHITECTURE AND DESCRIPTION ........................................................ 26 2.5.1.Description ........................................................................................... 26 2.5.2.Problems with Integration .................................................................... 27 2.5.3.Proposed Solutions ............................................................................... 28 2.5.3.1.Modified Lesk 2.5.3.2.Latent Semantic Analysis
2.6.EVALUATION AND TESTING
..................................................................................................... 28 .................................................................................... 29
....................................................................... 29
2.6.1.The evaluation of the integrated resource (Arabase) ............................ 29 2.6.2.The evaluation of the linking algorithm ............................................... 30
2.7.LIMITATIONS AND FUTURE WORK ....................................................................... 32 CHAPTER 3 : WORD REPRESENTATION IN VECTOR SPACE AND THEIR APPLICATIONS FOR ARABIC ................................................................................ 33 8
3.1.INTRODUCTION 3.2.NECESSARY BACKGROUND 3.2.1.Language Modeling
....................................................................... 33 ....................................................................... 33 ............................................................................. 33
3.2.1.1.Unigram Language Model 3.2.1.2.Bigram Language Model
.................................................................................. 34 .................................................................................... 34
3.2.2.One-hot vector Representation ........................................................ 34 3.2.3.Softmax function .................................................................................. 35 3.2.4.Skip-grams ........................................................................................... 35 3.2.5.Log Linear Models ............................................................................... 35
3.3.A SURVEY ON WORD REPRESENTATIONS IN VECTOR SPACE ................................. 37 3.3.1.Feed Forward Neural Network Language Model (FFNNLM) ........... 38 3.3.2.Recurrent Neural Networks (RNN) ..................................................... 40 3.3.3.Log Linear Models (CBOW, SKIP-G) for Word Representation ....... 41 3.3.4.GloVe: Global vectors for word representation ................................... 45 3.3.5.Relation between GloVe and SKIP-G .................................................. 46
3.4.BUILDING WORD REPRESENTATION FOR ARABIC.................................................. 48 3.5.VECTOR QUALITY ASSESSMENT ....................................................................... 50 3.5.1.Intrinsic Evaluation 3.5.2.Extrinsic Evaluation 3.5.2.1.Information Retrieval 3.5.2.2.Short Answer Grading
.............................................................................. 50 ............................................................................. 53 .......................................................................................... 53 ........................................................................................ 55
3.6.ARABIC-ENGLISH VECTOR SPACE MAPPING ......................................................... 56 CHAPTER 4 : CROSS LINGUAL LEXICAL SUBSTITUTION ........................... 60 4.1.INTRODUCTION ....................................................................... 60 4.2.CROSS LINGUAL LEXICAL SUBSTITUTION TASK .................................................... 61 4.3.RELATED WORK ....................................................................... 61 4.4.SYSTEM DESCRIPTION ....................................................................... 62 4.4.1.Building word vector representation for the target language (Spanish).64 4.4.2.Spanish Translation .............................................................................. 65 4.4.3.Mapping Algorithm .............................................................................. 65
4.5.SCORING ....................................................................... 66 4.6.RESULTS AND EVALUATION ....................................................................... 67 4.7.CROSS LINGUAL LEXICAL SUBSTITUTION FOR ARABIC ......................................... 72 CHAPTER 5 : CONCLUSION AND FUTURE WORK .......................................... 76 REFERENCES ............................................................................................................. 77 APPENDIX A................................................................................................................ 81 GLOSSARY .................................................................................................................. 83
9
List of Tables
Table 1: A statistical comparison between the existing and the final integrated resources. ........................................................................................................................ 26 Table 2: Depth and breadth coverage for Arabase. ........................................................ 30 Table 3: Co-occurrence probabilities for the words "ice" and "steam" with selected contextual words from a 6 billion token corpus ............................................................ 45 Table 4: Training configuration parameters used to build the Arabic models. .............. 50 Table 5: A sample of Mikolov's semantic and syntactic analogy test for English and its translation to Arabic. ...................................................................................................... 51 Table 6: Total accuracy of English models and Arabic models on the test set and its Arabic translation. The first column per model shows the percentage of test cases covered by this model. The second column shows how many of the covered test cases are correct. ...................................................................................................................... 52 Table 7: A comparison between the expansion lists for the word "Alsadr" in a query before and after disambiguation using the query context vector. ................................... 54 Table 8: Arabic vectors results in short answer grading using RMSE (the lower the better) and correlation (the higher the better). ................................................................ 56 Table 9: Translation examples using Cosine neural network. ........................................ 57 Table 10: A comparison between optimizing for Cosine error versus mean square error using NDCG, Recall and Accuracy. ............................................................................... 58 Table 11: Correlation between predicted grades and reference grades for Native Arabic, Translated English and Native English for short answer grading using word vectors. .. 59 Table 12: Two Senses for the word "Bank" with contexts. ............................................ 62 Table 13: Shows the pairwise Cosine similarity between contextual words and two senses “money” and “riverside”. .................................................................................... 63 Table 14 : A sample of English-Spanish CLLS dataset. ................................................ 65 Table 15 : Parameter values used by our systems .......................................................... 68 Table 16: A sample of Arabic-English CLLS dataset. ................................................... 73 Table 17 : Parameter values used by our systems for Arabic CLLS. ............................. 74
10
List of Figures Figure 1: Graphical representation for KACST ............................................................. 20 Figure 2: Graphical representation for ARRAMOOZ.................................................... 21 Figure 3: Graphical representation for AWN ................................................................. 22 Figure 4: Graphical representation for RDI-LSDB ........................................................ 23 Figure 5: Graphical representation for Arabase ............................................................. 24 Figure 6: The performance of Lesk VS. LSA on the development-set using the k-best measure. .......................................................................................................................... 31 Figure 7: The performance of Lesk VS. LSA on the test-set using the k-best measure. 31 Figure 8: Feed Forward Neural Language Model. ......................................................... 38 Figure 9: Another illustration for the Feed Forward Neural Language Model. ............. 39 Figure 10: Mikolov's RNNLM ....................................................................................... 40 Figure 11: CBOW model architecture, it predicts w(t) using its contextual words w(t-2), w(t-1), w(t+1), and w(t+2). ............................................................................................ 42 Figure 12: The skip-gram model architecture. The objective is using the pivot word w(t) to predict the context w(t-2), w(t-1), w(t+1), w(t+2) .............................................. 43 Figure 13: Levels of precision for expansions using Arabic vectors, Mahgoub expansion using wikipedia, and other resources and raw text matching on TREC 2002. ........................................................................................................................................ 54 Figure 14: Recall-precision curve for expansions using Arabic vectors, Mahgoub expansion using wikipedia, and other resources and raw text matching on TREC 2002. ........................................................................................................................................ 55 Figure 15: Using translated vectors in short answer grading. ........................................ 58 Figure 16: The ‘best’ scores for the systems participating the CLLS, semEval 2010 task 2. ..................................................................................................................................... 69 Figure 17: The ‘oot’ scores for the systems participating the CLLS, semEval 2010 task 2. ..................................................................................................................................... 70 Figure 18: Effect of changing MAXTRNS with/without gold translations on CBOW. 72 Figure 19: Effect of changing MAXTRNS with/without gold translations on GloVe. 72 Figure 20: Effect of changing MAXTRNS with/without gold translations on SKIPGRAM. ………………………………………………………………………………..73 Figure 21: Comparison between the “best” score of SKIP-G, GloVe, WordNet with/without using gold translations in Arabic CLLS. ……………………………….75 Figure 22: Comparison between the “oot” score of SKIP-G, GloVe, WordNet with/without using gold translations in Arabic CLLS. ……………………………….76 Figure 23: Shows a generic neural network and the connections between two successive layers. …………………………………………………………………………………82
11
Nomenclature Term or a symbol
Abbreviation
Human Language Technology Machine Learning Natural Language Processing Computational Linguistics Language Resource Optical Character Recognition Cross Lingual Lexical Substitution Machine Translation Word Sense Disambiguation Recurrent Neural Networks Neural Networks Language Model Recurrent Neural Networks Language Model Feed Forward Neural Network Language Model Mean Square Error Root Mean Square Error Normalized Discounted Cumulative Gain
HLT ML NLP CL LR OCR CLLS MT WSD RNN NNLM RNNLM FFNNLM MSE RMSE NDCG
12
Abstract Language resources are important factor in any natural language processing application. The term “Language Resource” refers to any machine readable pieces of information concerned with the language being processed, thus language resources are the fundamental building units for any natural language processing application. However, the language resource support for Arabic is not mature as evidenced by the inconsistency and incompleteness of the existing Arabic language resources, mainly due to the lack of time and money investments in human expert linguists to build a well renowned Arabic resource. To alleviate this problem, we propose to decrease the dependency on human efforts to build Arabic language resources by introducing semi and fully automatic techniques to build these resources. We discuss the notion of having an integrated Arabic resource leveraging various pre-existing ones. We compare between these resources then present new preliminary semi-automatic methods to integrate these resources with at least 80% recall. This work serves as a bootstrapping for a rich Arabic resource with a good potential to map with WordNet (a famous English language resource). Next, using machine learning techniques, we automatically build a large scale statistical Arabic language resource spanning 6.5 million words (unigrams and bigrams) by giving each word adequate representations in vector space that capture semantic and syntactic properties of the language. We compare different techniques to build vectorized representation models for Arabic, and test these models via intrinsic and extrinsic evaluations. Intrinsic evaluation assess the quality of models using benchmark semantic and syntactic dataset that show the competency of our models against the English models. Extrinsic evaluation assess the quality of models by their impact in two natural language processing applications: information retrieval and short answer grading. For information retrieval, the Arabic vectors enhanced the retrieval process slightly better than conventional semantic expansion techniques with 1.3% precision gain on average, while for short answer grading, Arabic vectors made it possible to evaluate short answer grades for Arabic dataset without the need for Arabic-English translations achieving 82% correlation with reference grades. Then, we present a novel technique to map the Arabic vector space to the English counterpart by using Cosine error regression neural network and show that it outperforms standard mean square error regression neural networks in this task with more than 0.7% gain on average. Finally, using vectorized word representation models, we present a novel technique to link conventional Arabic language resources to English by solving the cross lingual lexical substitution problem. We compare our techniques with different systems on a benchmark dataset using two metrics and show that our techniques outperform the state of the art in one metric with 22.78% gain, while keeping a reasonable performance in the second metric with only 1.1% loss with respect to the state of the art system.
13
Chapter 1 : Introduction 1.1. Brief Introduction on Language Resources. Human Language Technology (HLT) is any computerized technology dealing with the processing aspects of human languages. It is the big umbrella that incorporate Natural Language Processing (NLP), Computational Linguistics (CL) and speech processing. NLP and CL focus on how to enable computers process human languages, the level and depth required by computers to successfully process the human language varies with the numerous applications of NLP as in Information retrieval, machine translation, spelling correction, text clustering/classification, spam filtering, named entity recognition, part of speech tagging, parsing, word sense disambiguation, cross lingual lexical substitution, sentiment analysis, opinion mining, key phrase extraction, acronym expansion, plagiarism detection, author identification and many others, all these applications use – in one way or another- language resources to work correctly. The term “Language resource” refers to any machine readable pieces of information in any format providing information about language being processed. For example, a language resources for a certain language can provide the following information for a given word: different senses with definitions and examples for each sense, different word relations like synonyms and antonyms, word morphological analysis, Semantic information, linking with other languages as English, use in a monolingual corpus and use in a parallel corpus. Thus, language resources provide a lot of vital information for the language being processed, these information play critical role in numerous NLP applications as mentioned. The most notable work done to build computerized language resources was the English WordNet [16] built in Princeton University. It is a large lexical database where nouns, verbs, adjectives and adverbs are clustered together, within each cluster, words are grouped together in a homogenous cognitive sets called “synset”, each synset describes and expresses a certain concept, and words of the same synset can replace each other in their contexts. Synsets are linked with each other via links, these links represent lexical or semantic relations between synsets. Words in the same synset together with the inter-synset links form a network of words which altogether capture important features and properties of the English language. WordNet became the default language resource for all English NLP applications to the point that it got support from the NLP development communities as the python NLTK1, which provides a clear WordNet APIs support for programming. Also, WordNet became the measure against which other language resources are measured, the main reason for that attention given to WordNet is its depth and precision, it is fine-grained resources which cares for any small variations for the word semantic or syntactic features which explains the depth of WordNet, the precision on the other hand can be explained by the huge amount of human efforts exerted by linguists to develop it. Many other languages tried to follow WordNet’s footsteps, as EuroWordNet [19] which is a framework that aims to unite all European language (as Italian, Dutch, 1
http://www.nltk.org/
14
German, Spanish, Czech …) each language has its own WordNet constructed with the same spirit as the English WordNet. The addition is that these WordNets are linked to an inter-lingual-index based on the English WordNet, Thanks to this index, the languages are interconnected such that it’s possible to start from words in one language and move on to their equivalent or similar words in other languages. Another form of language resources are ontologies, an ontology is a study of entities of a universe, it involves formal definitions of types, features of entities and relationships between them which can be regarded as a philosophical taxonomy. This kind of resources used mainly in semantics aware NLP applications. Interlinks in WordNet can be regarded as a light weight ontology because they form a taxonomy through the links between synsets. One example for ontologies is the Suggest Upper Merged Ontology (SUMO) [15], upper ontologies (also known as top level ontologies) are concerned with describing very abstract and general concepts. Upper ontologies must be very broad and general so that they can be built on top of other specific ontologies, that’s why they are called “upper” ontologies. Thus, SUMO is concerned with meta-level concepts that do not fall under the categorization of a specific domain. A big corpus of large amount of raw text is counted as a language resources, such corpus provide useful statistics for the language as calculating the frequency of words, frequency of words’ co-occurrences, it can provide reasonable means to know the structures and rules of the language by building a language model. Parallel corpus is a type of language resources as well, by parallel corpus we mean a list of sentences in one language and their translation to another language. Parallel corpus can be used in building machine translation systems and building an alignment model to map words from one language to another.
1.2. Why Arabic? Arabic is one of the most important global languages, spoken by more than 300 million speakers it is considered an opportunity to be seized by many companies seeking to expand their market share. Other factors may direct research work towards Arabic are the challenges imposed by the complex nature of Arabic as a language, however, our main motive to focus on Arabic as the language of concern is the sense of responsibility towards Arabic and the commitment to establish good foundations for future Arabic NLP research and development. Many papers and books discussed the challenges faced when working with Arabic from an NLP perspective, most of the problems discussed focus on the characteristics of Arabic that makes it particularly challenging: Orthography: By orthography we mean the methodology of writing the language which include alphabet, spelling and punctuation. Arabic is special because it’s written from right to left, the characters have allographic variants (a character can take many forms of writing depending on its position in a word), no concept of capitalization and the use diacritics is optional. These properties imposes challenges on some applications as Arabic Optical Character Recognition (OCR) and Word Sense Disambiguation (WSD). Morphology: It’s the bits and pieces that form a word, as prefixes and suffixes. Arabic has a complex morphological nature, simply an Arabic word can be a sentence on its own, e.g. “( ”زوجناكهاtranslated as: we made you marry her). These
15
properties imposes challenges on some applications as Part of Speech Taggers (POS), tokenizers and parsers. Dialects: Arabic has many dialects spoken by different regions as Egyptian Arabic, Levantine Arabic, Gulf Arabic, North African Arabic, Iraqi, Yemenite, Sudanese and Maltese. Almost no native speaker of Arabic can naturally sustain spontaneous use of Modern Standard Arabic (MSA), this adds to the challenges facing Arabic NLP applications.
These language properties of Arabic listed above form challenges to the Arabic NLP community aiming to build genuine tools and applications for Arabic and not to reply on just using and adapting English tools. However, in this thesis we adopt a different perspective when looking at the challenges of Arabic, because our goal is not to build an Arabic NLP tool, but rather to build fundamental resources to be used in building those tools, and hence we face different types of challenges in dealing with Arabic as the language of concern. Most of the challenges we face in our work to build language resources for Arabic stem from the fact that the investment in Arabic NLP in general is limited. Most of the existing Arabic language resources are limited, inconsistent and incomplete, the reason behind that is the lack of investment translated in time and money, not entirely due to the challenges imposed by the complex nature of Arabic as a language.
1.3. Thesis Objective This thesis focuses on the idea of how to build a language resource with Arabic as the language of concern, and it considers building language resources from different angles, views and different ideologies. For example, WordNet-like ideologies believe that language resources should be built with extensive care, this will require countless hours of human effort in order to achieve a very reliable precise and accurate resource. This is a very sound argument, because this job will be done once correctly and can be used forever with few additional efforts for maintenance and versioning. However, such investment require human experts, time and money, these factors can be hard to achieve. These factors explain as well, why the existing Arabic Language resources are not as mature as English, such investments do not exist for Arabic to build a world-class resource as the English WordNet, which leaves the existing Arabic resources be inconsistent, scattered or even incomplete. The second ideology is to automate the process of building the language resource, which means relying less on human efforts due to the absence of investments and to rely more and more on machines. This technique can be classified as a hybrid methodology that minimize the human interactions and efforts to build a language resource, we are adapting this type of thinking in chapter 2. The basic idea is to develop a framework that can use pre-existing Arabic resources and try to pick up all the pieces of information scattered across them and automatically compile them into one single resource in a database format, which we made it so much easier for human linguist to revise. The third ideology discussed in chapter 3 and 4 is to completely rely on the machines to build language resource without any human interaction which will be the main focus of the thesis. We will discuss the techniques and theory behind this line of work and provide extensive evaluations and testing beds for the models we built for Arabic, which can be regarded as the world’s first and largest statistical Arabic resources that is available online for future research. The main idea is to collect a massive amount 16
of Arabic running text, then using machine learning techniques that scan over the corpus to learn a higher representation for words as vectors in a multidimensional space, the constructed space captures semantic and syntactic properties of the words, so that we can use the vectorized representation for the words to find word to word relations, and we can even generalize the model to represent whole sentences not just mere words. In a nutshell, this thesis focuses on building Arabic language resource using two methods; semi-automated and fully automated. We aim to change the conventional methods used before in building Arabic language resources. Finally we test our new language resources using intrinsic and extrinsic evaluations.
1.4. Thesis Organization This thesis is organized as follows, the rest of the thesis are four chapters, and each chapter is self-contained, such that each chapter will discuss a certain problem, survey its related work, present our proposed solution for the problem and finally show our results. Chapter 2 discusses the conventional Arabic language resources and presents our semi-automated method to integrate these resources into a single database. Chapter 3 presents the idea of fully automating the process of building a language resource using machine learning techniques to build vectorized representation for words in a multidimensional space and present how to apply these techniques to build vectorized representation for Arabic. Chapter 4 shows how to link the conventional Arabic resource with the English one using new techniques that solve the problem of Cross Lingual Substitution. Finally we conclude the work in chapter 5.
17
Chapter 2 : Building Semi-Automated Conventional Arabic Language Resource (Arabase)2 This chapter discusses conventional methods for building Arabic Language Resources. We propose techniques to combine different resources together using fully and semiautomated methods into a single database (Arabase). This chapter is organized as follows, first we introduce the problem (section 2.1), discuss related work (section 2.2), then we present a comparison of different Arabic resources (section 2.3). Next we present our proposed integration methodology (section 2.4), the database architecture and description (section 2.5). Finally, we present the evaluation, testing (section 2.6) and discuss the limitations and future work (section 2.7).
2.1. Introduction Language resources have a great impact on the quality of any NLP application. The term “Language resource” refers to any machine readable pieces of information in any format providing information about language being processed. For example, language resources for a certain language can provide the following information for a given word: different senses with definitions and examples for each sense, different word relations like synonyms and antonyms, word morphological analysis and Semantic information. This information are important for many NLP applications like word sense disambiguation, text similarity, semantic search, text mining, opinion mining, and many others as in Fassieh (Attia et al., 2009) [7] is one example on these applications. A lot of work is done in the field of language resources for many languages like English, French, German and many other European languages, but very few is done to Arabic. Moreover, these few Arabic language resources are limited and not fully developed. Yaseen, et al. (2006) [20] conducted a review on Arabic language resources. A good language resource can be done manually by expert linguists, but such task can take a long time and too much human labor to achieve. In this chapter we examine various Arabic language resources, compare them and apply an algorithm to integrate the scattered information across these different resources into one compact database using a configurable technique between fully and semiautomated methods showing the trade-off between them in the integration.
2.2. Related Work Several attempts have been recorded to enrich Arabic language resources. Elkateb, et al. (2006) [ 12] reported efforts for building a WordNet for Arabic, they followed the methods developed for EuroWordNet (Vossen, 1998) [ 19]. Concepts from WordNet (Princeton University "About WordNet.", 2010) [16], EuroNet languages and BalkaNet (Tufis, 2004) [18] are used as Synsets in ArabicWordNet, some Arabic language specific This work is published in [ 1] 2
18
concepts are translated and added and then Equivalent English entries according to SUMO ontology (Niles & Pease, 2001) [15] are translated and added as well. Finally, a bi-directional propagation is performed from English to Arabic and vice versa to generate Synsets. Most of the work is done manually, which decreased the coverage and depth of the resulting resource. Another attempt is bootstrapping an ArabicWordNet using WordNet and parallel corpora (Diab, 2004) [10], she exploited the fact that a polysemous word in one language will have a number of translations in another language, these translations can be clustered based on the word sense proximity using WordNet. Diekema (2004) [11] attempted to build an English-Arabic semantic resource that can be used in Cross Lingual Information Retrieval (CLIR), using WordNet and various bilingual resources. Attia et al. (2008) [8] built a rich Arabic lexical semantic database based on the theory of semantic field using various Arabic resources.
2.3. Arabic Language Resources We started our work by searching for different Arabic language resources, and exploring what information they provide. First, we present the nature of information provided by these resources, and then we present a comparison between these resources. Below is a list of the information provided by these Arabic language resources:
Morphological information (Im): The Word morphological analysis, like root, type, gender, and number. Sense information (Ise): Different word senses with gloss, definitions and examples for each sense. Synset information (Isy) (adapted from WordNet terminology): Different relations between two sets of words. E.g. synonyms and antonyms. Semantic information(Ism): The Semantic field that the word belongs to together with different semantic relations between semantic field pairs. Interfacing with WordNet (Iwn): Arabic words are linked to their equivalent English words in WordNet.
Each word can have one or more of these information components in what we call the ‘information vector’ < Im, Ise, Isy, Ism, Iwn> Next, we present the resources we found and we give tabular comparison between them in Table 1. The main entry of all resources is the unvocalized word which can have more than one vocalized form. Vocalized forms are classified by their part of speech (POS) into nouns, verbs and particles. Each vocalized form can have its own morphological information and it can have also more than one sense. Each sense can have its sense information, synset information, semantic information and a WordNet interface if applicable.
19
2.3.1.
King Abdulaziz City for Science and Technology (KACST)
It is an Arabic-Arabic resource in a MYSQL database format (almuajam, 2011) [3]. It provides sense and morphological information. It has very limited and incomplete semantic information (Figure 1).
Figure 1: Graphical representation for KACST
20
2.3.2.
Arramooz
It is an Arabic-Arabic resource available in different formats (SQL, XML and raw text) (Arramooz AlWaseet, n.d.) [6]. It provides morphological and sense information (Figure 2).
Figure 2: Graphical representation for ARRAMOOZ
21
2.3.3.
Arabic WordNet (AWN)
It is an Arabic-English resource in derby database format (Arabic WordNet, 2007) [5]. Its main entry is both English and Arabic words, it provides Arabic-Arabic information by mapping and translating the pre-existing English-English information obtained from WordNet, thus interfacing with it. It also provides synset and sense information (Figure 3).
Figure 3: Graphical representation for AWN 22
2.3.4.
RDI Lexical Semantic Database
It is an Arabic-Arabic resource by RDI available in access format (Attia, et al., 2008) [8]. It provides semantic information and some sense, morphological information (Figure 4).
Figure 4: Graphical representation for RDI-LSDB
2.3.5.
RDI light lexicon
This resource is limited. It provides some morphological information and some sense information.
2.3.6.
Alkhalil
It is an Arabic morphological analyser used to add some morphological information when needed in the integration (alkhalil dot net, 2011) [2].
23
2.3.7.
Arabic stop words
It provides Arabic stop words with morphological inflections using a combinations of possible prefixes and suffixes (Arabic Stop Words, 2010) [ 4]. It has 10,389 unique entries.
Figure 5: Graphical representation for Arabase
24
2.4. Integration Methodology The main goal of the integration is trying to allocate an information vector from all available resources for each distinct vocalized word. The integration process is done in four main steps; Analysis, Design, Integration, and Linking.
2.4.1.
Analysis
Carefully analyse each resource to detect its potential Information components. Then we transform all the resources into one format to ease the integration process. We transformed all the resources to MYSQL database format.
2.4.2.
Design
Put a design for the target integrated database given the resources we have. The scalable design of KACST database was the starting point of our integrated database, then we modified it to match the target database design as shown in Figure 5. Also Table 1 shows statistical comparison between the existing and the target database, only the distinct and non-floating entries are counted. Floating entries are words entries with no information components.
2.4.3.
Integration
Design and apply an algorithm to automatically compile these resources together. The following is an overview of the integration algorithm: For each vocalized word w in resource r: - Get (POS) of w if provided by r, otherwise use Alkhalil to get POS of w - Add w to a table corresponding to its POS only if w was not added before, otherwise use the existing entry. - For each information i in r for w - Add i to w - For each vocalized word w - Cluster information i for w by word sense. ‘i’ refers to an information component.
2.4.4.
Linking
It is the information linking phase in which we cluster the information content for each unvocalized word by its senses.
25
Table 1: A statistical comparison between the existing and the final integrated resources.
#Nouns #Verbs #Particles and stop-words #Not classified #Total entries #Meaning definitions #Examples #Semantic fields #Semantic field relations #Semantic relations types #Synsets relations types #Synset relations Information components
KACST
Arramooz
AWN
RDILSDB
RDILite
Integrated Resource (Arabase)
42,732 18,054 171
27,272 9,866 N/A
11,564 3,226 N/A
15,095 8,254 N/A
N/A N/A N/A
81,977 35,400 10,389
N/A 60,957 80,512
N/A 37,138 19,053
1,290 16,080 N/A
27,883 51,232 33,394
7,354 7,354 13,938
30,651 158,417 146,897
5,232 247
N/A N/A
N/A N/A
N/A 5,039
2,554 N/A
13,388 5187
N/A
N/A
N/A
292,910
N/A
292,799
N/A
N/A
N/A
19
N/A
19
N/A
N/A
22
N/A
N/A
22
N/A
N/A
145,655
N/A
N/A
145,655
Im, Ise, Ism
Im, Ise
Ise, Isy, Iwn
Im, Ise, Ism
Im, Ise
Im, Ise, Isy, Ism, Iwn
2.5. Arabase Architecture and Description 2.5.1.
Description
The main entry in the integrated resource is the unvocalized word which has more than one vocalized form. These vocalized forms can be nouns, verbs, particles or unclassified. For example consider the unvocalized word (w):( عينpronounced Aien) it has more than one vocalized form E.g. (w1): َعيْن, (w2):َعيِّن Each of these vocalized forms can have more than one sense. E.g. w1 has these two ْ ( عtranslated as Eye). Sense2 الجاساااو senses. Sense1 إلنساااان وغيره من الحيوان ُ ِ ُضاااو ا ِإلبصاااا ل (translated as Spy). Each sense can have a meaning or a definition, and more than one example to illustrate its usage. Each of these senses can have more than one synonyms. E.g. Sense2 can have these two synonyms مراقب,عميل. These synonyms form a Synset. We choose one of these synonyms and promote it to be the Synset head. We should note that synonyms themselves are in fact vocalized words. Each Synset can have a semantic field. E.g. synset1 { حالك, داكن,( }اسودSynonyms for the word dark) can have the word لون (translated as color) as its semantic field. Also the synset2 { ناصا, فاتح,( }ابيضSynonyms for the word bright) can have لونas its semantic field.
26
Synsets can have relations with each other E.g. hyponym, antonym and synonym. E.g. synset1 & synset2 are antonyms. Semantic fields can have relationships with one another as well. Entries in the integrated resource are divided among four main tables; Nouns, Verbs, Particles and Not classified. The table ‘Not classified’ is for words we could not specify its (POS) during the integration process because this information is missing from the resource or Alkhalil failed to analyze the word. Our goal is to compile as much information vector components as possible for each vocalized word w. These components will possibly be compiled from different resources (r1, r2 …) so that the final vector could be: w :. Where rx(Iy) is the piece of information y allocated from the resource x.
2.5.2.
Problems with Integration
Let w1 and w2 be two different senses from two different resources for the same word w. Since different senses of the same word can have the same vocalized form of the word, therefore w1 and w2 have the same vocalized form. Therefore, we cannot rely solely on the vocalized word form to distinguish between different words senses, which means that w1 and w2 could be the same sense or two different senses. This confusion is a problem in the integration. The following is an example on that confusion: For the vocalized word w َعيْن. Resource r1 provides w1: َعيْن ْ ( عDefinition for the word Eye). Ise: إلنساااااااااااااان ِ ُضاااااااااااااو ا ِإلبصاااااااااااااا لاااااااا Im: اسم, مؤنث,( مفردSome morphological information). Resource r2 provides w2: َعيْن Ise: ( الجاسُوtranslated as Spy). Ism: ( الرقيبSemantic field). Resource r3 provides w3: َعيْن Ise: ( من جوا ح النسان الخاصة بالبصرDefinition for the word Eye). Ism: ( عضو من األعضاءSemantic field). Resource r4 provides w4: َعيْن Isy: ماااااااااااراقاااااااااااب,( عااااااااااامااااااااااايااااااااااالSynonyms for the word Spy). Iwn: Spy (Corresponding WordNet entry). We notice that w1, w2, w3 and w4 have the same vocalized form but w1 & w3 have the same sense and w2 & w4 have the same sense. Our goal is to collect two information vectors one for each sense. The first is for the sense (Eye) < r1 (Im), r1 (Ise) r3 (Ise), r3 (Ism)> and the second is for the sense (Spy) < r2 (Ise), r2 (Ism), r4 (Isy), r4 (Iwn) >. In order to do so, we have to search in the available information for clues that show that the information from r1 & r3 and r2 & r4 belong together. At the integration phase we assume that each information vector I comes from resource r represents a distinct sense. I.e. r1 (I), r2 (I), r3 (I) and r4 (I) are totally different and each represents a different sense (i.e. different synsets) for the word w. Finally at the linking phase we analyse these information and link the related information together.
27
2.5.3.
Proposed Solutions
We discuss some heuristics we used in the linking phase. The first: If two information components belonging to two different resource are linked together then all the information content of these resources are linked together as well. The second: All words in the same synset, i.e. synonyms, share all the links established by all the synset members. (Except for Im ). Generally all the sense information (Ise) are definitions, which means we can use text similarity algorithms to decide if two given definitions are similar or not, then using the heuristic discussed above we link all the information content of their resources. If we look closely to the previous example we find that we need to link r1 & r3 and r2 & r4. We can link r1 (Ise) with r2 (Ise) using text similarity algorithms. Then we decide that r1 & r3 are linked together to the same sense. Generally, text similarity algorithms give a similarity score for a given two texts, but it does not decide if the two texts are similar or not. In this case, to decide if two texts are similar or not using similarity score algorithms we have the following three alternatives. The first is to use the similarity score when querying the integrated database. We can show the links with their scores or show only the links with confidence exceed a certain value. It turned out that this method can cause problems when using the integrated database in different NLP applications. The second is to find a threshold such that if the similarity score between two texts is greater than this threshold we label these two texts as similar and not similar otherwise. The main drawback of this method is the false positives (labelling two texts as similar which in fact they are not) are dangerous because the result of the text similarity decision is a clue to link all the information of the two resources these texts come from, which can cause serious confusion. The third is to do a Configurable semi-automatic linking. There is no doubt that the linking task and the confusion problem are solved using human labor. But this is very time consuming task. We propose to use text similarity techniques to reduce the time taken by humans to do this task as follows: For a given two resources r1, r2 having the vocalized word w in common. We retrieve all the senses for w per resources r1 {sense1, sense2 …}, r2 {sense1, sense2 …} For each sense s in r1 For each sense s` in r2 s.calculateSimilarityScore(s`) s.sortScoresDescendingly () This way each sense has a list of senses sorted by similarity score. When one sense is chosen, all the similar senses will appear sorted by similarity score so there will be no need to scan all the senses and most likely a match will be found in the first k-best senses. The more accurate the text similarity algorithm the less k will be before finding a match. We compared two text similarity algorithms in terms of the k-best measure; Modified Lesk and latent semantic analysis. 2.5.3.1. Modified Lesk Lesk [14] method is used in word sense disambiguation problems. We modified it to do text similarity based on scoring two texts by counting the number of common words between them. The intuition behind that is that two texts expressing the same meaning 28
usually use similar words. We modified the behaviour of the algorithm by introducing some parameters to be tuned on a development-set. Some of these parameters are Boolean yes/no and others take real values. Boolean parameters are: removing diacritization, removing stop words, stemming the words and using edit distance. Real valued ones: stop word to stop word weight. 2.5.3.2. Latent Semantic Analysis LSA (Deerwester, et al., 1990) [9] uses a corpus to build a matrix whose rows represent unique words and columns represent each paragraph (paragraphs in our problem are the word’s definitions). Singular value decomposition (SVD) is then used to reduce the number of columns while preserving the similarity structure among rows. Texts to be compared are projected on this space and the similarity is calculated based on the cosine of the angle between the two vectors. The intuition behind using a semantic similarity algorithm like LSA is that texts expressing the same meaning are usually semantically similar. The number of dimensions used by LSA is yet to be tuned using a developmentset. The following is a simple example, assume we have five documents d1, d2, d3, d4, d5 and eight words occurring in them “romeo”, “juliet”, “happy”, “dagger”, “live”, “die”, “free” and “new-Hampshire”. First we build occurrence matrix with rows are the words and columns are the documents, then using SVD we can decompose the matrix into its k principle components (in this example k=2) into three matrices S2, ∑2, U2 (the subscript 2 refers to k). Rows of S2 × ∑2 gives the word vectors, while Columns of ∑2 × U2 are the documents vectors. A new document to be indexed (converted into vector form) by averaging the vectors of its words.
2.6. Evaluation and testing Two types of evaluations were performed, the first is the depth and breadth coverage evaluation of the integrated resource itself. Breadth is the number of entries found in the database, while depth is the information content for each entry. The second is the evaluation of the linking algorithm.
2.6.1.
The evaluation of the integrated resource (Arabase)
In order to evaluate the database in both depth and breadth coverage. We used a random sample of running Arabic text collected from Wikipedia from different topics, then for each word in the running text we retrieved all the possible word forms from Arabase (referred to as hits) this represents the breadth coverage. For each hit we retrieved all its
29
possible information content (Im, Ise, Isy, Ism, Iwn) .This represents the depth coverage. (Table 2). Table 2: Depth and breadth coverage for Arabase. Arabase 2059 6997 10 450 2546 6055 4064 871 871 0.3639 0.8654 0.8653 0.1245 0.1245
#words #hits #missed #stopwords Total Im Total Ise Total Ism Total Isy Total Iwn Avg. Im per hit Avg. Ise per hit Avg. Ism per hit Avg. Isy per hit Avg. Iwn per hit
2.6.2.
The evaluation of the linking algorithm
We examined some words and manually link information components together based on Ise resulting in 184 links. We used 134 of them as a development-set to tune the parameters for both Lesk and LSA and we used the remaining 50 for testing. Figure 6 shows the performance of Lesk VS. LSA on the development-set using the k-best measure. The vertical axis is percentage of correct matches sorted with a rank less than or equal to k. Figure 7 shows the performance of Lesk VS. LSA on the test-set using the k-best measure. After tuning both Lesk and LSA these parameters that gave best performance on the development-set: Lesk: Removing diacritization, edit distance: True. Removing stop words, stemming: False Stop word to stop word weight = 0.01 LSA: We used genism API (Rehurek, R., 2010) [17] on the collected glosses of Arabase. The tuned number of dimensions=100.
30
Development set 100
Recall %
95
90
85
80 1
2
3
4
5
6
7
8
9
10
@k LESK
LSA
Figure 6: The performance of Lesk VS. LSA on the development-set using the k-best measure.
Test set 100
Recall %
95 90 85 80 75 1
2
3
4
5
6
7
8
9
@k LESK
LSA
Figure 7: The performance of Lesk VS. LSA on the test-set using the k-best measure.
31
10
2.7. Limitations and Future Work Since the integration work here is automated with adjustable human interaction, the algorithm is liable to some errors that can be solved manually. Below are some of the limitations involved in our approach: Classifying the words by POS is liable to errors when using Alkhalil. If it failed to get the POS, the word is given the label “Not classified”. N-gram entries (entries with more than one token) are classified according to the POS information given in the resource. If no such information is found we classify it according to the first token. If failed we label it as ‘Not classified’. Poorly diacritized words can confuse the morphological analyzer, which ends up with more than one morphological analysis for the same word. In these cases the first analysis is taken, which can be erroneous approach but can be fixed manually. We can also choose not to provide morphological information for such words. The linking algorithm is limited to linking resources based on the sense information (Ise). If any resource does not have this information component and its synset has no other words, then it will not be linked by our linking algorithm. We can enrich Arabase by linking it with WordNet. Such that each Arabic sense is linked with its corresponding English one in WordNet. Currently the only interface is the integrated entries from ArabicWordNet.
32
Chapter 3 : Word Representation in Vector Space and their Applications For Arabic3
3.1. Introduction In this chapter we discuss the idea of automatically building a language resource for Arabic without any human interaction using machine learning techniques. This chapter is organized as follows, first we discuss machine learning techniques used to build Arabic Language Resources. These techniques are based on word representation in vector space, by giving the individual words of a certain language adequate representations in vector space so that these representations capture semantic and syntactic properties of the language (section 3.3). Then we show how to use these techniques to build vectorized space representation models for Arabic (section 3.4). These models serves as a fundamental building unit that can take part in numerous Arabic NLP applications which means that these vectorized representations on its own is considered a language resource. Next we test these models via intrinsic and extrinsic evaluations. Intrinsic evaluation assesses the quality of models using benchmark semantic and syntactic dataset, while extrinsic evaluation assesses the quality of models by their impact on two Natural Language Processing applications: Information retrieval and Short Answer Grading (section 3.5). Finally, we map the Arabic vector space to the English counterpart using Cosine error regression neural network and show that it outperforms standard mean square error regression neural networks in this task (section 3.6).
3.2. Necessary Background 3.2.1.
Language Modeling
A language model is a mathematical model that assigns a probability to a sentence in a certain language. A sentence with high probability means that it obeys the language structure and requirements. Generally the most common language modeling technique is the n-gram language modeling. A sentence 𝑠 formed of a sequence of words 𝑤1 , 𝑤2 , 𝑤3 … 𝑤𝑛 . The general form of n-gram language model: 𝑃(𝑠) = 𝑃(𝑤1 , 𝑤2 , 𝑤3 … 𝑤𝑛 ) = 𝑃(𝑤1 )𝑃(𝑤2|𝑤1 )𝑃(𝑤3 |𝑤1 , 𝑤2 ) … 𝑃(𝑤𝑛 |𝑤1 … 𝑤𝑛−1 ) = ∏ 𝑃(𝑤𝑖 |𝑤1 … 𝑤𝑖−1 ) 𝑖
𝑃(𝑤2 |𝑤1) =
𝑐𝑜𝑢𝑛𝑡(𝑤1 , 𝑤2 ) 𝑐𝑜𝑢𝑛𝑡(𝑤1 )
This work is published in [ 23] 3
33
To simplify the general form, the Markovian assumption will be applied. 𝑃(𝑤𝑖 |𝑤1 … 𝑤𝑖−1 ) ≈ 𝑃(𝑤𝑖 |𝑤𝑖−𝑘 … 𝑤𝑖−1) 3.2.1.1. Unigram Language Model It assigns the probability to a sentence by observing its sequence of words one at a time assuming total independence between these words: 𝑃(𝑤1 , 𝑤2 , 𝑤3 … 𝑤𝑛 ) = ∏ 𝑃(𝑤𝑖 ) 𝑖
For example: 𝑃(the water is clear) = 𝑃(𝑡ℎ𝑒)𝑃(𝑤𝑎𝑡𝑒𝑟)𝑃(𝑖𝑠)𝑃(𝑐𝑙𝑒𝑎𝑟) 3.2.1.2. Bigram Language Model It assigns the probability to a sentence by observing its sequence of words two at a time assuming that each two consecutive words only depend on each other: 𝑃(𝑤1 , 𝑤2 , 𝑤3 … 𝑤𝑛 ) = ∏ 𝑃(𝑤𝑖 |𝑤𝑖−1 ) 𝑖
For example: 𝑃(the water is clear) = 𝑃(𝑡ℎ𝑒)𝑃(𝑤𝑎𝑡𝑒𝑟|𝑡ℎ𝑒)𝑃(𝑖𝑠|𝑤𝑎𝑡𝑒𝑟)𝑃(𝑐𝑙𝑒𝑎𝑟|𝑖𝑠)
3.2.2.
One-hot vector Representation
It’s a very simple vector representation for a word. Given a corpus, a dictionary (often called vocabulary) of unique words is constructed. For an input word w its one-hot vector is a zeros vector of length equals to the length of the constructed dictionary and has only one-valued entry at the index of w in the dictionary. For example, if we have the following simple corpus: I am Sam Sam I am I love eggs and ham Constructed dictionary: I Am Sam love eggs and ham The one hot vector for some words will be:
34
0 1 2 3 4 5 6
I: 1000000 eggs: 0000100
3.2.3.
Softmax function
The softmax function or sometimes called “normalized exponential” is used in various probabilistic multiclass classification, including Naïve Bayes and neural networks. The input to the softmax is a k-dimensional vector “x” and the output is the probability of the jth class given the sample x:
𝑃(𝑦 = 𝑗|𝑥) =
𝑒𝑥
𝑇𝑤 𝑗
𝑥 𝑇 𝑤𝑘 ∑𝐾 𝑘=1 𝑒
Log linear models in the next sections are a great example that use the softmax function.
3.2.4.
Skip-grams
While unigram, bigram and n-gram model n consecutive words, skip-grams on the other hand allows tokens to be skipped (Guthrie et al. 2006) [22]. For example the sentence: “we should go to club” Unigrams: “we”, “should”, “go”, “to”, “club” Bi-grams: “we should”, “should go”, “go to”, “to club” Tri-gram: “we should go”, “should go to”, “go to club” Skip-grams can allow certain tokens to be skipped and still joins words together. A skipgram is defined by two parameters; ‘k’ and ‘n’ where ‘k’ is the number of allowed tokens to skip, and ‘n’ is the length of the grams formed. For example: At k=2 and n=2 a skip-gram will be 2-skip-bigram = “we should”, “we go”, “we to”, “we club”, “should go”, “should to”, “should club”, “go to”, “go club”, “to club” At k=2 and n=3 a 2-skip-trigram = “we should go”, “we should to”, “we should club”, “we go to”, “we go club”, “should go to”, “should go club”, “should to club”, “go to club”
3.2.5.
Log Linear Models
This section briefly introduces the log linear classifiers adopted from Michael Collins notes [21]. Model description: A set 𝑋 of possible inputs A finite set 𝑌 of possible outputs.
35
A positive integer 𝑑 specifying the number of features and parameters in the model Function 𝑓 ∶ 𝑋 × 𝑌 ℝ𝑑 mapping any (𝑥, 𝑦) pair to a feature vector 𝑓(𝑥, 𝑦) A parameter vector 𝑣 𝜖 ℝ𝑑 that service as a weight vector for the feature vector 𝑓(𝑥, 𝑦) The model’s task to find the probability of 𝑦 give 𝑥 parameterized by 𝑣 𝑃(𝑦|𝑥; 𝑣) =
𝑒 𝑣.𝑓(𝑥,𝑦) ∑𝑦′𝜖𝑌
′ 𝑒 𝑣.𝑓(𝑥,𝑦 )
𝑓(𝑥, 𝑦) = inner product of 𝑥, 𝑦 = ∑𝑑𝑘=1 𝑓𝑘 (𝑥, 𝑦)
(1)
(2)
Taking a closer look at the model, we notice that the inner product 𝑣. 𝑓(𝑥, 𝑦) can take on any value positive or negative, since we aim to model a conditional probability therefore no negative values are allowed so an exponential is employed to restrict the values to positives. Next, dividing by the normalizing factor is essential to make the summation ∑𝑦𝜖𝑌 𝑃(𝑦|𝑥; 𝑣) = 1
Applying the log form: log 𝑃(𝑦|𝑥; 𝑣) = 𝑣. 𝑓(𝑥, 𝑦) − 𝑙𝑜𝑔 ∑ exp(𝑣. 𝑓(𝑥, 𝑦 ′ ))
(3)
𝑦 ′ 𝜖𝑌
= 𝑣. 𝑓(𝑥, 𝑦) − 𝑔(𝑥) 𝑊ℎ𝑒𝑟𝑒,
𝑔(𝑥) = 𝑙𝑜𝑔 ∑𝑦 ′ 𝜖𝑌 exp(𝑣. 𝑓(𝑥, 𝑦 ′ ))
(4)
The term 𝑣. 𝑓(𝑥, 𝑦) is linear in 𝑓(𝑥, 𝑦). While 𝑔(𝑥) depends only on 𝑥 and independent on 𝑦 which makes 𝑙𝑜𝑔 𝑃(𝑦|𝑥; 𝑣) is linear in 𝑓(𝑥, 𝑦) as long as 𝑥 is held constant, which justifies the name log-linear. Parameter estimation in log linear models is finding the optimum values for the parameter vector 𝑣 .Training is done using a training set with training samples: (𝑥1 , 𝑦1 ), (𝑥 2 , 𝑦 2 ). . . (𝑥 𝑖 , 𝑦 𝑖 ) . . . (𝑥 𝑛 , 𝑦 𝑛 ) For any parameter vector 𝑣 the value 𝐿(𝑣) is a metric of how well the parameter vector fits the training examples. 𝐿(𝑣) = ∑𝑛𝑖=1 log 𝑃(𝑦 𝑖 |𝑥 𝑖 ; 𝑣)
(5)
Parameter estimation will be performed by maximizing the likelihood: 𝑣𝑀𝐿 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝜖𝑅𝑑 𝐿(𝑣)
36
(6)
In order to prevent the parameter values from shooting to infinity regularization term is introduced to the objective: 𝐿(𝑣) = ∑𝑛𝑖=1 log 𝑃(𝑦 𝑖 |𝑥 𝑖 ; 𝑣) −
𝜆 2
∑𝑘 𝑣𝑘2
(7)
Now, 𝐿(𝑣) is a convex function, so gradient ascend methods will find optimal parameter 𝑣𝑀𝐿
3.3. A Survey on Word Representations in Vector Space Researchers proposed various techniques to leverage large amount of unlabeled data. One particular method that is adopted by many researchers is representing individual words of a language as vectors in a multidimensional space that capture semantic and syntactic properties of the language. These representations can serve as a fundamental building unit to many Natural Language Processing (NLP) applications. Word representation is a mathematical model representing a word in space, mostly a vector. Each component (dimension) is a feature to this word that can have semantic or syntactic meaning. Collobert and Weston [ 24] proposed a unified architecture for natural language processing. They used a deep neural network architecture that is jointly trained using back propagation for many tasks: Part of Speech tagging, Chunking, Named Entity Recognition, and Semantic Role Labeling. Individual words are embedded into a ddimensional vector where each dimension is regarded as a feature. Words representations (embeddings) are stored in a matrix 𝑊 ∈ ℝ𝑑 ×|𝐷| , where 𝐷 is a dictionary of all unique words. Look up tables are used for retrieving specific features for words. Sentences are represented using the embeddings of their forming words using a window around the word of interest, which solves the problem of variable length sentences. Mnih and Hinton [ 25] introduced Hierarchical log-bilinear model (HLBL) that is another form of word representations. For n-gram sentences, it concatenates the embedding of the first 𝑛 − 1 words and learns a neural linear model to predict the last word. It is a probabilistic model that uses softmax layer to transform the similarity between the predicted representations with the reference representations to a probability distribution. Turian et al. [27] presented a survey for various work that had been done for representing words in vector space, they also presented yet another neural language model resembles the work of Collobert and Weston to represent words in vector space.
37
3.3.1.
Feed Forward Neural Network Language Model (FFNNLM)
Figure 8: Feed Forward Neural Language Model.
38
Figure 9: Another illustration for the Feed Forward Neural Language Model. Figure 8 shows a neural language model using n consecutive words to predict the n+1 word, by setting n to 3 Figure 9 shows a neural language model uses the past three words from the context 𝑤(𝑡 − 1), 𝑤(𝑡 − 2), and 𝑤(𝑡 − 1) to predict the current word 𝑤(𝑡). Each word in the vocabulary is mapped to its continuous vector representation via a look-up table (matrix) with dimensions N × P, where N is the size of the vocabulary and P is the dimension of the continuous representation (vector) so that the ith row in the table corresponds to the ith word in the vocabulary. One-hot vector coding is used at the input layer, and the continuous representations for the input is concatenated in the projection layer. The projection can be regarded as a hidden layer with linear activation function. After the projection layer is performed, values are moved to the second hidden layer with a non-linear hyperbolic tangent function. Finally, the output layer is a Softmax function resulting in posterior probabilities for the current word given the context (n) used: 𝑃(𝑤𝑗 = 𝑘|𝑤𝑗−𝑛+1 , … 𝑤𝑗−1 ) =
𝑒 𝑎𝑘 𝑎𝑖 ∑𝑁 𝑖=1 𝑒
(8)
Where 𝑎𝑘 is the value of the kth neuron of the output layer which corresponds to the kth word in the vocabulary. The look-up table can be represented by the matrix U shown in Figure 9. Which will be trained using back propagation and eventually will hold the continuous vectroized representation for words.
39
3.3.2.
Recurrent Neural Networks (RNN)
Mikolov et al. [26] built a neural language model using a recurrent neural network (RNN) that encode the context word by word and predict the next word. He used the trained network weights as the words representation vectors. The network architecture has input layer with recurrent feed. It has one hidden layer and an output layer. The training procedure iterates over the sentences, each sentence is broken down to words, the input layer receives the current word encoded in a one-hot vector encoding (1-of-N coding, where N is the vocabulary size). The recurrent feed is the previously encoded context history. The length of the recurrent feed depends on the length of the hidden layer. The output layer is a softmax layer that generates a probability distribution over the vocabulary, which means that the length of the output layer equals the size of the vocabulary. The RNN is trained using back propagation to maximize the likelihood of the data using this model.
Figure 10: Mikolov's RNNLM
𝑠(𝑡) = 𝑓 (𝑈𝑤(𝑡) + 𝑊𝑠(𝑡 − 1))
40
(8)
𝑦(𝑡) = 𝑔(𝑉𝑠(𝑡))
Where: 𝑓(𝑧) =
𝑔(𝑧𝑛 ) =
(9)
1 1 + 𝑒 −𝑧
(10)
𝑒 𝑧𝑛 ∑𝑚 𝑒 𝑧𝑚
(11)
Example: “I love pizza” At t=3 (current word is “pizza”) 𝑤(𝑡) = “pizza” which is represented as a one-hot vector. 𝑆(𝑡 − 1) = “I love” which is represented as encoded vector Which means that the context history is “I love” and the current word is “pizza” and the objective is to try to predict the current word “pizza” correctly given the context history “I love”. After the training is complete the weight matrices (U, W and V) is learned. The word representation are found in the columns of U.
3.3.3. Log Linear Models (CBOW, SKIP-G) for Word Representation Mikolov et al. [28, 29] proposed two new techniques for building word representation in vector space using log linear models, the two models are continuous bag of word (CBOW) and Skipgram (SKIP-G) models. These techniques are based on a neural network architecture with the hidden layer replaced with a simple projection layer to reduce the computational requirements. Although the hidden layer is the main reason that makes neural networks a tempting choice due to their ability to represent data accurately, however by using enough training data the new models can be accurately trained in a much faster setting. Using a large corpus of raw text, first these models build a vocabulary of all unique words in the corpus, and then assign each word to a randomly initialized vector with a user-defined dimension, this vector will eventually represent the corresponding word in space such that each dimension encode a certain property of this word including semantic and syntactic features. The objective of the models is to learn optimal vectorized representation for words such that words sharing some semantic or syntactic meanings be in close proximity in space. These models assumes that words that tend to replace each other in sentences or at least occur in similar contexts and sentence structures share semantic or syntactic properties. Both models (CBOW or SKIP-G) uses log linear 41
classifiers (discussed in chapter 1) to train these vectors which are regarded as the parameters (weights) to be learnt by the classifiers. CBOW predicts a word by using a window of contextual words from its history and future. As shown in Figure 11, this simple network uses the contextual words as inputs, the hidden layer is removed and the contextual words are projected by averaging their vector representations, then using the output of the projection layer as an input to a log linear classifier that performs a classification trying to predict the pivot word.
Figure 11: CBOW model architecture, it predicts w(t) using its contextual words w(t-2), w(t-1), w(t+1), and w(t+2).
While most models uses certain context to predict the current word, Skip-gram models on the other hand, uses a pivot word to predict its context by trying to maximize the probability of the contextual word given that pivot word using log linear classifier. As shown in Figure 12, it uses the continuous vector representation of the pivot word as an input to the classifier to predict another word in the same context within a certain window. Increasing the context window increases the model accuracy reflected in the quality of the resulting word vectors, but it increases the computation complexity. Choosing a window = C, the model considers a random words R in [1,C] words from the history and R words from the future. Which means that for each pivot word, 2R classifications will be performed. Here we can see why this model called a skip-gram 42
model, because the pivot word will try to predict a word that is a maximum C words apart with no regard (skipping) to the C-1 words between them. The current word 𝑤(𝑡) is used to predict words from its context within a certain window. The projection layer will hold the final word representation.
Figure 12: The skip-gram model architecture. The objective is using the pivot word w(t) to predict the context w(t-2), w(t-1), w(t+1), w(t+2)
Skip-gram model tries to maximize the average log probability: 𝑇
1 ∑ ∑ log 𝑝(𝑤𝑡+𝑗 |𝑤𝑡 ) 𝑇 𝑡=1 −𝑐≤𝑗≤𝑐 𝑗≠0
43
(12)
Alternatively, we can rewrite the objective function as:
𝑎𝑟𝑔𝑚𝑎𝑥𝛩 ∏𝑤∈𝑉 ∏𝑐∈𝐶(𝑤) 𝑝(𝑐|𝑤; 𝛩)
(13)
𝑎𝑟𝑔𝑚𝑎𝑥𝛩 ∏ 𝑝(𝑐|𝑤; 𝛩)
(14)
(𝑤,𝑐)∈𝐷
Where, w is a word from the vocabulary V, c is a context (can be a contextual word), C(w) is the set of contextual words around w, 𝛩 is the parameter (words vectors) to be optimized, and D is the set of all word and context pairs extracted from corpus. The number of parameters 𝛩 is |C|×|V|×d, where d is the dimension of the vectors.
By plugging in the log-linear models formula (using softmax):
𝑒 𝑣𝑐.𝑣𝑤 𝑃(𝑐|𝑤; 𝛩) = ∑𝑐′∈𝐶 𝑒 𝑣𝑐′.𝑣𝑤
(15)
Using sum instead of product by taking the log:
𝑎𝑟𝑔𝑚𝑎𝑥𝛩
∑ log 𝑝(𝑐|𝑤)
(16)
(𝑤,𝑐)∈𝐷
=
∑ (log 𝑒𝑣𝑐 .𝑣𝑤 − log ∑ 𝑒𝑣𝑐′ .𝑣𝑤 ) (𝑤,𝑐)∈𝐷
𝑐′
Where 𝑣𝑐 and 𝑣𝑤 ∈ ℝ𝑑 are vector representations for c and w. C is the set of all available contexts which are all unique words in the corpus but having different vectors representations from the vocabulary, i.e. each word receives two vectorized representations according to its role (pivot or contextual word). Having two distinct representations for the same word is valid. Consider the case when the word “car” as a word and “car” as a context share the same vector, since words rarely appear in the context of themselves (“car” is rarely repeated in context of “car”), this forces the model to give low probability for P(car|car) which means the model will give low value for the inner product 𝑣𝑐𝑎𝑟 . 𝑣𝑐𝑎𝑟 which is impossible.
44
3.3.4.
GloVe: Global vectors for word representation
Pennington et al. [ 30] presented yet another technique to learn word representations called “GloVe” for Global Vectors. While CBOW and SKIP-G models can be classified as shallow window based approaches, because they represent a word in vector space as a function of its local context controlled by a window, GloVe on the other hand utilizes the global statistics of word-word co-occurrence in the corpus to be captured by the model. The co-occurrence matrix is used to calculate the probability of 𝑤𝑜𝑟𝑑𝑖 to appear in the context of another 𝑤𝑜𝑟𝑑𝑗 𝑃𝑖𝑗 = 𝑃(𝑗|𝑖), this probability is postulated to capture the relatedness between these words. Let the word-word co-occurrence matrix be 𝑋, such that 𝑋𝑖𝑗 is the number of times the 𝑤𝑜𝑟𝑑𝑗 appear in the context of 𝑤𝑜𝑟𝑑𝑖 . Let 𝑋𝑖 = ∑𝑘 𝑋𝑖𝑘 be the number of times any words appears in context of 𝑤𝑜𝑟𝑑𝑖 . 𝑋 Let 𝑃𝑖𝑗 = 𝑃(𝑗|𝑖) = 𝑋𝑖𝑗 be the probability that 𝑤𝑜𝑟𝑑𝑗 appears in context of 𝑤𝑜𝑟𝑑𝑖 . 𝑖
For example, the word “solid” is more related to “ice” than to “steam”, this can be confirmed by the ratio between P(“solid”|”ice”) and P(“solid”|”steam”) to be high. Using probe words (k) we can notice that for words related to “ice” but not “steam” as k = “solid” the ratio will be high, but words related to “steam” but not “ice” as k = “gas” the ratio will be small, and words neither related to “ice” nor “steam” the ratio will be close to 1 (Table 3). Table 3: Co-occurrence probabilities for the words "ice" and "steam" with selected contextual words from a 6 billion token corpus [30]. Probability and Ratio P(k|”ice”) P(k|”steam”) P(k|”ice”)/P(k|”steam”)
k=”solid” 1.9×10-4 2.2×10-5 8.9
k=”gas” 6.6×10-5 7.8×10-4 8.5×10-2
k=”water” 3×10-3 2.2×10-3 1.36
k=”fashion” 1.7×10-5 1.8×10-5 0.96
GloVe uses this ratio to encode the relationship between words (i, j, k) and tries to find vectorized representation for words that satisfy this ratio. 𝐹(𝑤𝑖 , 𝑤𝑗 , 𝑤′𝑘 ) =
𝑃𝑖𝑘 𝑃𝑗𝑘
(17)
Where 𝑤 ∈ ℝ𝑑 are the word vectors and 𝑤′ ∈ ℝ𝑑 is a separate context word vector, and F is a function that will be used to encode the information in the ratio (right hand side) in a vectorized representation for words. Since relations between words are linear using vector subtraction as shown in [ 28] then we are interested in learning vectorized representations for words encoding this linear relation between words. Since the right hand side is a scalar value, which means that the natural way is taking the dot product for the arguments of F. 𝑃𝑖𝑘 (18) 𝑇 𝐹((𝑤𝑖 − 𝑤𝑗 ) 𝑤′𝑘 ) =
45
𝑃𝑗𝑘
Imposing: 𝐹(𝑤𝑖 𝑇 𝑤′𝑘 ) = 𝑃𝑖𝑘 =
𝑋𝑖𝑘
this will give:
𝑋𝑖
𝐹((𝑤𝑖 − 𝑤𝑗 )𝑇 𝑤′𝑘 ) =
𝐹(𝑤𝑖 𝑇 𝑤′𝑘 )
(19)
𝐹(𝑤𝑗 𝑇 𝑤′𝑘 )
This equation is solved by having F = exp, which makes: 𝑒𝑥𝑝(𝑤𝑖 𝑇 𝑤′𝑘 ) = 𝑃𝑖𝑘 =
𝑋𝑖𝑘 𝑋𝑖
𝑤𝑖 𝑇 𝑤′𝑘 = log(𝑃𝑖𝑘 ) = log(𝑋𝑖𝑘 ) − log(𝑋𝑖 )
(20)
(21)
The term log(𝑋𝑖 ) is independent on k which can be absorbed into a bias term 𝑏𝑖 . Finally adding another bias term 𝑏′𝑘 for 𝑤′𝑘 so that the objective will be finding values for word vectors to satisfy this equation. 𝑤𝑖 𝑇 𝑤 ′ 𝑘 + 𝑏𝑖 + 𝑏 ′ 𝑘 = log(𝑋𝑖𝑘 )
(22)
𝑤𝑖 𝑇 𝑤′𝑘 + 𝑏𝑖 + 𝑏′𝑘 − log(𝑋𝑖𝑘 ) = 0
(23)
One main drawback for this model is weighting all the co-occurrences (frequent and rare) equally; to alleviate this problem, a weighting function f(𝑋𝑖𝑗 ) is introduced to the model. Now the cost function is a weighted least squares regression problem: 𝐽 = ∑𝑉𝑖,𝑗=1 f(𝑋𝑖𝑗 ) (𝑤𝑖 𝑇 𝑤 ′𝑗 + 𝑏𝑖 + 𝑏 ′𝑗 − log(𝑋𝑖𝑗 ))2
3.3.5.
(24)
Relation between GloVe and SKIP-G
Since both models use the occurrences statistics from a large corpus, they definitely share a number of similarities. SKIP-G models the probability that 𝑤𝑜𝑟𝑑𝑗 appears in context of another 𝑤𝑜𝑟𝑑𝑖 as a softmax model 𝑄𝑖𝑗 : 𝑇
′
𝑒𝑤𝑖 𝑤 𝑗 𝑄𝑖𝑗 = 𝑇 ′ ∑𝑘∈𝑉 𝑒𝑤𝑖 𝑤 𝑘
(25)
The objective of the model is to maximize the log probability as a window scans the corpus:
46
𝐽 = −∑
𝑖 ∈ 𝑐𝑜𝑟𝑝𝑢𝑠 log 𝑄𝑖𝑗 𝑗 ∈ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡(𝑖)
(26)
Since we can interchange a word’s role as vocabulary word and as a context, for example 𝑤𝑖 (as context) can be in the context of another word 𝑤𝑗 (as vocabulary) and later on as we scan through the corpus 𝑤𝑗 (as a context) can be in the context of 𝑤𝑖 (as vocabulary). Interchanging the roles of the word between vocabulary word and a context word makes the number of times 𝑤𝑖 appeared in context of 𝑤𝑗 (𝑋𝑗𝑖 ) equals to the number of times 𝑤𝑗 appeared in context of 𝑤𝑖 (𝑋𝑖𝑗 ), which enables us to calculate 𝑄𝑖𝑗 once and just multiply it by 𝑋𝑖𝑗 which makes the objective of the SKIP-G be: 𝑉
𝑉
𝐽 = − ∑ ∑ 𝑋𝑖𝑗 log 𝑄𝑖𝑗
(27)
𝑖=1 𝑗=1
Since 𝑃𝑖𝑗 =
𝑋𝑖𝑗 𝑋𝑖 𝑉
𝑉
𝐽 = − ∑ 𝑋𝑖 ∑ 𝑃𝑖𝑗 log 𝑄𝑖𝑗 𝑖=1
(28)
𝑗=1
The cross entropy of the two distributions 𝑃𝑖 , 𝑄𝑖 is H: 𝑉
𝐻(𝑃𝑖 , 𝑄𝑖 ) = − ∑ 𝑃𝑖𝑗 log 𝑄𝑖𝑗 𝑗=1
𝑉
𝐽 = ∑ 𝑋𝑖 𝐻(𝑃𝑖 , 𝑄𝑖 )
(29)
𝑖=1
This means that the objective function is a weighted sum of cross entropy error that resembles equation 24 of the weighted least square error, which means that cross entropy is just one possible way to measure the distance (or the resemblance) between two probability distributions. The cross entropy as a distance measure for probability distributions has a few problems: first, distributions with long tails are modeled poorly with too much weights given to unlikely events. Second, it requires the model Q to be normalized, this normalization factor is very expensive to compute, which lead Mikolov use hierarchal softmax as an approximation to the full complete softmax in CBOW and SKIP-G models.
47
We can consider another distance measure instead of cross entropy as the least square objective in which the normalization factors are discarded: 𝑉
(30)
2
′
𝐽 = ∑ 𝑋𝑖 (𝑃′ 𝑖𝑗 − 𝑄 ′ 𝑖𝑗 ) 𝑖,𝑗
𝑇
′
Where 𝑃′𝑖𝑗 = 𝑋𝑖𝑗 and 𝑄′𝑖𝑗 = 𝑒𝑤𝑖 𝑤 𝑗 are the two un-normalized versions for 𝑃𝑖𝑗 , 𝑄𝑖𝑗 To avoid 𝑋𝑖𝑗 taking on large values that can complicate the optimization, we can minimize the squared error of the logarithms: 𝑉 ′
𝐽 = ∑ 𝑋𝑖 (𝑃
′
𝑖𝑗
′
(30)
2
− 𝑄 𝑖𝑗 )
𝑖,𝑗
2
𝐽′ = ∑ 𝑋𝑖 (log 𝑃′ 𝑖𝑗 − log 𝑄 ′𝑖𝑗 ) = ∑ 𝑋𝑖 (𝑤𝑖 𝑇 𝑤 ′𝑗 − 𝑙𝑜𝑔 𝑋𝑖𝑗 )2 𝑖,𝑗
(31)
𝑖,𝑗
Adding the weighting factor introduced in equation 24: 𝐽′ = ∑ 𝑓(𝑋𝑖𝑗 ) (log 𝑃′ 𝑖𝑗 − log 𝑄 ′𝑖𝑗 )
2
(32)
𝑖,𝑗
Now, equation 32 is equivalent to equation 24.
3.4. Building Word Representation For Arabic
Mikolov et al. [ 28] compared different techniques for building word representation in vector space: Mikolov’s RNN embeddings, Collobert and Weston’s embeddings, Turian’s embeddings and Mnih’s embeddings, and showed that the CBOW and SKIP-G models are significantly faster to train with better accuracy. Pennington et al. [30] showed that GloVe performed well compared to CBOW and SKIP-G models in the semantic and syntactic analogy test presented in [ 28]. Accordingly, we used CBOW, SKIP-G and GloVe models to build a word representation in vector space for Modern Standard Arabic (MSA). To train these models, we collected large amount of raw Arabic texts form these sources:
Arabic Wikipedia. Arabic Gigaword Corpus. LDC Arabic newswire. 48
Arabic Wiktionary. The open parallel corpus [31, 32]. Combined glosses and definitions for Arabic words in Arabase [1]. MultiUN; which is collection of translated documents from the United Nations [33]. OpenSubtitles 2011, 2012, and 2013. They are a collection of movie subtitles [34]. Raw Quran text [35]. A corpus of KDE4 localization files [36]. A collection of translated sentences from Tatoeba [32]. Khaleej 2004 and Watan 2004 [37]. BBC and CNN Arabic corpus [38]. Meedan Arabic corpus [38]. Ksucorpus; King Saud University Corpus [40]. A text version of Zad-Almaad book. Microsoft crawled Arabic Corpus.
We compiled all these sources together and performed several cleaning and normalization steps to the combined corpus:
Cleaning noisy characters, tags and removing diacritics. Arabic characters normalization: we normalized ( آ, إ, ) أto ( ) اand ( )ةto ()ه. Normalizing all numerical digits to the token “NUM”.
We formed short phrases from individual words by attaching n-gram tokens together to be treated as a single unit [29]. To form such phrases we choose a frequency threshold above which this n-gram will be treated as a short phrase. For words 𝑤i and 𝑤j and their bigram 𝑤i 𝑤j :
𝑠𝑐𝑜𝑟𝑒(𝑤i , 𝑤j ) =
𝑐𝑜𝑢𝑛𝑡(𝑤i 𝑤j ) − 𝛿 𝑐𝑜𝑢𝑛𝑡(𝑤i ) × 𝑐𝑜𝑢𝑛𝑡(𝑤j )
(33)
Bigrams whose score above this threshold will be used as phrases. δ is a discounting factor to prevent the formation of phrases with infrequent words. The vocabulary size of the compiled corpus is about 6.3 million entries (unigrams and bigrams), and the total number of words is about 5.8 billion. Training these models require choice of some hyper-parameters affecting the resulting vectors:
Word vector size: The vector size is an input parameter. A Couple of hundreds is the recommended choice. This parameter affects the performance of the model, which means it is useful to tune this parameter if the resulting vectors will be used in a specific task. Window: For CBOW/SKIP-G it refers to the amount of context to consider around the pivot word in training the model, while for GloVe it refers to the maximum distance between two words to be considered in co-occurrence.
49
Sample: For CBOW/SKIP-G it refers to a threshold for occurrence of words so that words appearing with frequency higher than a certain threshold will be randomly down-sampled using this parameter. Hierarchical Softmax (HS): For CBOW/SKIP-G, hierarchical Softmax is a computationally efficient approximation of the full softmax used to predict words during training. Negative: For CBOW/SKIP-G it refers to the number of negative examples in the training. Frequency threshold: Words appearing with frequency less than this threshold will be discarded. Maximum number of iterations: For GloVe, it is the number of iterations used to train the model. X_Max: For GloVe, this parameter is used in a weighting function whose job is to give rare and noisy co-occurrences low weights.
We built three models for Arabic4 (CBOW, SKIP-G and GloVe). Table 4 shows the training details for each model.
Table 4: Training configuration parameters used to build the Arabic models.
Vector size Window Sample HS Negative Freq. thresh. Phrase thresh. Max iterations X_Max
CBOW 300 5 1e-5 No 10 10 200 N/A N/A
SKIP-G 300 10 1e-5 No 10 10 200 N/A N/A
GloVe 300 10 N/A N/A N/A 10 200 25 100
3.5. Vector Quality Assessment 3.5.1.
Intrinsic Evaluation
Having the individual words represented in vector space introduces new thinking strategies in using these representations in word-to-word relations. A relationship between two words can be measured by using a similarity function that maps a pair of word vectors to a real number: 𝐹(𝑣1 , 𝑣2 ) → ℝ . This mapping function (similarity measure function) can be Cosine similarity, or Euclidean distance, or Manhattan distance, or any possible similarity measure techniques.
4
Models are available at: https://sites.google.com/site/mohazahran/data
50
One particular interesting task, is to apply the relationship between a pair of words (e.g. singular/plural, feminization, tense change…) to a third word, this is called “analogy task”. Let the first pair of words be 𝑤1 and 𝑤2 and the third word be 𝑤3 . Let the relationship between 𝑤1 and 𝑤2 be 𝑟1 then 𝑣(𝑟1 ) = 𝑣(𝑤2 ) − 𝑣(𝑤1 ) where the operator 𝑣(x) returns a vector representation of 𝑥, and 𝑣(𝑟1 ) represents a vector in space joining 𝑤1 and 𝑤2 . We can apply 𝑟1 to 𝑤3 to give a fourth word 𝑤4 such that 𝑣(𝑤4 ) = 𝑣(𝑤3 ) + 𝑣(𝑟1 ) and so 𝑤3 , 𝑤4 have the same relationship as 𝑤1 , 𝑤2 .
Table 5: A sample of Mikolov's semantic and syntactic analogy test for English and its translation to Arabic. Type of relation ship Common capital city ManWoman Superlative Plural nouns
Word pair 1
Word pair 2
Athens
أثينا
Greece اليونان
brother
شقيق
sister
شقيقة
bad bird
سيء طائر
worst birds
اسوا طيو
Oslo
اوسلو
Norway
Grand son big car
حفيد
grand حفيدة daughter biggest اكبر cars سيا ات
كبير سيا ة
النرويج
To test the quality of the vectors, Mikolov’s analogy test for English vectors [28] is used by translating the test cases manually to Arabic. The test set contains five types of semantic questions, and nine types of syntactic questions. Examples are shown in Table 5. We compared our models to the English skip-gram model [41] and GloVe model [42] using the translated version of the test (Table 6). The test focuses on calculating the relation between the first pair of words and apply it to a third word, then comparing the fourth word with the predicted word. The predicted word is the word with the highest Cosine similarity score to the predicted vector. A test case is answered correctly only if one of the top five predicted words matches the fourth word, which means that synonyms and semantically close words are considered as mistakes. By examining the results in Table 6, the Arabic models are trained on significantly smaller corpus compared to English. Having a large corpus is essential to enable the models to represent the words more accurately, by large corpus we mean a corpus big enough such that each word in the vocabulary is repeated an adequate number of times for the models to represent accurately. However, our models show good performance on test cases whose Arabic translation is unambiguous as translating named entities (countries, capitals, and cities) because they are frequent in the corpus. On the other hand, they show low performance on other test cases as in the “opposite” test cases because most of the English words in this test do not have a clear Arabic translation. For example, the word “uncompetitive” has no direct (word-to-word) Arabic translation; the closest translation will be “”غير_منافس. These unusual translations are either out of vocabulary or rarely found in the training corpus and thus receive a poor vectorized representation. 51
Low frequent terms explain as well the low performance for the “currency” test cases. For example, the term “( ”الزلوتيtranslated as zloty (Polish currency)) occurred only 63 time. Table 6: Total accuracy of English models and Arabic models on the test set and its Arabic translation. The first column per model shows the percentage of test cases covered by this model. The second column shows how many of the covered test cases are correct. Model Training words capitalcommon countries capitalworld currency city-instate family adjective -toadverb opposite compara tive superlati ve presentparticipl e nationali tyadjective pasttense plural pluralverbs TOTAL
English SKIP-G300 300B
English GloVe300 840B
Arabic CBOW300 5.8B
Arabic SKIP-G300 5.8B
Arabic GloVe300 5.8B
Cov. 100
Acc. 94.9
Cov. 100
Acc. 100
Cov. 100
Acc. 94.3
Cov. 100
Acc. 93.7
Cov. 100
Acc. 92.7
100
93.1
100
95.6
100
74.7
100
77
100
80.4
100 100
37.8 87.2
100 100
13.2 87.4
100 100
7.7 32.5
100 100
7.9 32.6
100 100
5.7 36.4
100 100
95.3 53.8
100 100
86.8 65.6
67.6 100
46.5 34.2
67.6 100
36.3 30.1
67.6 100
50.3 22
100 100
57.6 97
100 100
45.8 96.6
80 100
3.7 73.8
80 100
3.2 67
80 100
3.5 71.5
100
95.4
100
93.7
100
68.9
100
64.6
100
66.9
100
96.7
100
97.4
93.9
46.1
93.9
42.1
93.9
30.5
100
95.7
100
89.2
100
49.9
100
55.5
100
44.2
100
93.7
100
91.9
100
44.7
100
41.6
100
43.5
100 100
95.8 89.5
100 100
96.5 90.3
100 100
56.1 80.1
100 100
56.9 75.5
100 100
57.7 72.2
100
87.4
100
86.2
98
54.3
98
53.6
98
53.5
Another source of errors is the absence of diacritization, which is needed in Arabic to differentiate between words having the same form but with different meanings, however in practice the use of diacritics in Arabic is rare and the Arabic data collected is almost 52
free of diacritics. These results suggests that there should be a tailored test set for Arabic rather than translating the English test cases in order to evaluate the Arabic vectors more accurately.
3.5.2.
Extrinsic Evaluation
3.5.2.1. Information Retrieval Many query expansion techniques have been proposed to enhance the performance of text retrieval task, which can be classified into semantic-based and statistical-based expansion techniques. Here, we propose using the Arabic vectors as a semantic expansion technique because the Arabic vectors capture the semantic properties of the language such that semantically close terms are clustered in close proximity in the vector space. Mahgoub et al. [45] proposed query semantic expansion techniques for Arabic information retrieval, the expansion techniques based on various language resources as Wikipedia, Google translate with WordNet, and other various Arabic linguistic resources. We compare our vector expansion technique with their techniques using TREC 2002 the cross-lingual (CLIR) track dataset [46], which contains 50 queries tested against 383,872 documents, we discarded any non-judged documents from our experiments before evaluation. The basic idea is to expand a query term such that these expansions are semantically related to the query, which means the query should act as a sense gauge to the expanded term. A term is expanded using its vector representation by retrieving all other terms in the vector space ordered descendingly by cosine similarity score, while a query is represented by adding up all the vectors of its terms together. The order of possible expansions for a term should be influenced by the query through re-ordering the terms in the expansion list using cosine similarity score with the query vector. In order to avoid bias in re-ordering the expansion list, the term being expanded is not included among the terms forming the query vector. For each query, we allow maximum 50 expansions for all of its terms, such that the number of expansions for each term is inversely proportional to its frequency, thus allowing less frequent terms to have more expansions [45]. Figure 13 and Figure 14 compare between the impact of using the Arabic vectors as an expansion scheme versus traditional resources as Wikipedia, WordNet translations and other resources using Indiri [47]. The following example in Table 7 shows how the query is used to disambiguate the expansion list of a term (underlined word). The query: “ ”كيف يعامل طالب الدين في النجف بعد اغتيال صادق الصدر؟translated as (How are the religion students treated in Nagaf after the assassination of Sadeq Alsadr?). The expanded word ( الصدر, Alsadr) has two senses, either “chest” or the name of a person “Alsadr” (which is the correct sense for this query).
53
Table 7: A comparison between the expansion lists for the word "Alsadr" in a query before and after disambiguation using the query context vector. Before disambiguation Arabic English translation والصد And the chest للصد For the chest بالصد By the chest البطن Stomach
After disambiguation Arabic English translation النجف Alnagaf انصا مقتدىSupporters of Moqtada الصد ببغداد Alsadr in Baghdad مقتدي الصد Moqtada Alsadr
Levels of Precision 0.45 0.4 0.35 Vectors expansion
Precision
0.3 0.25
Mahgoub expansion
0.2
Raw
0.15 0.1 0.05 0 5
10
15
20
30 @k
100
200
500
1000
Figure 13: Levels of precision for expansions using Arabic vectors, Mahgoub expansion using wikipedia, and other resources and raw text matching on TREC 2002.
54
Interpolated recall-precision 1.2
1 Raw
Recall
0.8
0.6
Mahgoub expansion
0.4 Vectors expansion
0.2
0 0
0.1
0.2
0.3 0.4 Precision
0.5
0.6
0.7
Figure 14: Recall-precision curve for expansions using Arabic vectors, Mahgoub expansion using wikipedia, and other resources and raw text matching on TREC 2002.
3.5.2.2. Short Answer Grading Short answer grading is one interesting NLP application to assess how much the Arabic vectors capture semantic and syntactic properties of the language. In short answer grading, given a reference answer and a student answer; it is required to return a grade that represents the correctness of this answer. To employ the vectors in such problem it is essential to transform the grading problem into a sentence-to-sentence similarity measuring task. A sentence can be represented using a combination of the vectors of its words. For example, a simple addition of the word vectors can give a sufficient representation for a sentence in vector space especially for short sentences. Using combinations of different preprocessing steps (lemmatization, stemming …) with various vector based sentence representation schemes (CBOW, SKIP-G, GloVe) will result in a number of features relating a student answer with the reference answer via similarity measures as cosine similarity, these features are fed to an SVM regression module to scale similarity scores to a reasonable grade. Table 8 shows the impact of using the Arabic vectors on an Arabic dataset for short answer grading using root mean square error (RMSE) and Pearson’s correlation against inter annotator agreement (IAA). We also report the results of Goma’s system discussed in [ 44] on the equivalent humanly translated English data set.
55
Table 8: Arabic vectors results in short answer grading using RMSE (the lower the better) and correlation (the higher the better).
IAA (Arabic and English Dataset) Arabic vectors (Arabic Dataset) Goma’s system (Manual English translations data set)
RMSE 0.69 0.95 0.75
Correlation 0.86 0.82 0.83
3.6. Arabic-English Vector Space Mapping Mapping the Arabic vector space to the English vector space is an attractive application especially for Arabic, because Arabic suffers from poor language resources support as compared to English. Mapping the two vector spaces will allow Arabic NLP applications to use the English language support in Arabic NLP domain. Mikolov et al. [43] used a translation matrix to learn linear transformation between the two vector spaces by minimizing the mean square error between the reference and the predicted vectors, this translation matrix can be regarded as a simple neural network with no hidden layers. Alternatively, we propose training a neural network to learn vector mapping by minimizing the Cosine error instead of minimizing the mean square error (derivations in the appendix). The intuition behind this objective function is the use of Cosine similarity score in literature as the default measure to assess the similarity between two word vectors [28, 29, 30, 43], which means there is a mismatch between the objective function (mean square error) and the similarity metric (Cosine similarity score). We used an English-Arabic dictionary to train the neural network by retrieving vectors corresponding to the dictionary’s entries to form parallel training data. Each entry consists of an Arabic word with a list of possible English translations totaling 8,444 entries divided into 5,872 entries for training, 1,254 for validation and 1,318 for testing. The training and validation entries are expanded such that an Arabic word form a different entry with each of its possible English translations, resulting in 27,089 entries for training and 5,718 entries for validation. We used vectors from the best scoring models in Table 6, CBOW300 for Arabic and SKIP-G300 for English. Two simple neural networks are constructed with 300 neurons for both input and output layers with no hidden layers. Both networks have the same architecture, parameters and initial weights, the only difference is the objective functions, the first optimizes for Cosine error, the second optimizes for mean square error and both are trained using backpropagation. Some translation examples from the test set using the Cosine neural network are shown in Table 9.
56
Table 9: Translation examples using Cosine neural network. Arabic Word عصو علم_الفلك الكونجر يتسلل
English translations using Cosine neural network epochs,epoch,millenniums,millennia,eras,era astronomy,cosmology,quantum_physics,astrology,science parliament,congress,legislature,senate,vote,Congress infiltrate,sneak,penetrate,seep,slither,creep
Table 10 compares between the two networks using three measures, NDCG, recall and accuracy for the test set. For each instance in the test set, we use the predicted English vector to retrieve the top 𝑘 English translations using Cosine similarity. The NDCG for test sample 𝑗 using 𝑘 predicted translations:
𝑁𝐷𝐶𝐺j =
𝑚𝑎𝑡𝑐ℎ(1) + ∑ki=2 1+
𝑚𝑎𝑡𝑐ℎ(𝑖)
𝑙𝑜𝑔2 𝑖 1 k ∑i=2 𝑙𝑜𝑔2 𝑖
(34)
Where the function 𝑚𝑎𝑡𝑐ℎ takes a predicted translation and checks if it matches one of the possible reference English translations with no accounting for synonyms, semantically close terms and even morphological variations. 1, 𝑚𝑎𝑡𝑐ℎ(x) = { 0,
if 𝑥 is a match. otherwise.
(35)
The overall NDCG for 𝑀 test sentences using 𝑘 predicted translations is:
𝑁𝐷𝐶𝐺 =
∑M j=1 𝑁𝐷𝐶𝐺j
(36)
𝑀
While the NDCG accounts for the rank 𝑘 at which a match happens, the recall and accuracy on the other hand takes no regard to the rank. The accuracy calculates how many test samples were translated correctly by matching at least one of the 𝑘 predicted translations with one of the reference translations, while the recall calculates how many reference translations were covered by the predicted translations. The recall for a test sample 𝑗 with 𝑙 reference translations and 𝑘 predicted translations:
𝑅𝑒𝑐𝑎𝑙𝑙j =
∑ki=1 𝑚𝑎𝑡𝑐ℎ(𝑖) 𝑀𝑖𝑛(𝑘, 𝑙)
(37)
The overall recall for 𝑀 test sentences using 𝑘 predicted translations is: 𝑅𝑒𝑐𝑎𝑙𝑙 =
∑M j=1 𝑅𝑒𝑐𝑎𝑙𝑙j 𝑀
57
(38)
Table 10: A comparison between optimizing for Cosine error versus mean square error using NDCG, Recall and Accuracy. k 1 2 3 4 5 6 7 8 9 10
Cos (NDCG) 27.1% 19.3% 16.8% 15.2% 14% 13.1% 12.4% 11.8% 11.2% 10.7%
MSE (NDCG) 25.9% 19.1% 16.5% 14.9% 13.8% 12.8% 12.2% 11.5% 11% 10.5%
Cos (Recall) 27.1% 22% 20.8% 20.1% 20.2% 20.6% 21.2% 21.8% 22.3% 22.6%
MSE (Recall) 25.9% 21.7% 20.1% 19.4% 19.6% 19.6% 20.3% 20.8% 21.1% 21.7%
Cos (Accuracy) 27.1% 34.7% 38.8% 41.4% 43.6% 45.8% 47.8% 49.3% 50.2% 50.8%
MSE (Accuracy) 25.9% 34.4% 38.5% 40.4% 43.1% 44.8% 46.7% 48.1% 49% 50.4%
Another way to assess the quality of the neural network is to compare the translated vectors with the native English vectors in a practical NLP task as in short answer grading. As shown in Figure 15, the idea is having an Arabic datasets for short answer grading and its human translation to English. Given an Arabic student answer 𝑎s and its Arabic reference answer 𝑎r and their corresponding student and reference English answers 𝑒s and 𝑒r respectively. First, we assign a grade to the English answers using the English vectors following the same ideas discussed in the previous section. Second, using the proposed neural network we translate the two Arabic answer vectors to English vectors (𝑒′s and 𝑒′r ) and assign a grade to them using the translated vectors. Finally, we compare between the Arabic vectors on the Arabic data, the translated vectors on the English data, and the English vectors on English data using the Pearson’s correlation between the predicted grades of each system with the reference grade as shown in Table 11.
Figure 15: Using translated vectors in short answer grading.
58
Table 11: Correlation between predicted grades and reference grades for Native Arabic, Translated English and Native English for short answer grading using word vectors. Model Arabic vectors & Arabic data Translated English Vectors & Human translated English data English vectors & Human translated English data
Correlation 0.79 0.75 0.76
These results show that the translated english vectors using the neural network performs remarkably closer to the native english vectors on the same data set (Human translated english data set) than to Arabic vectors on Arabic data.
59
Chapter 4 : Cross Lingual Lexical Substitution5 In chapter 2, we automatically built rich Arabic language resource (Arabase) leveraging pre-existing Arabic language resources. In this chapter, we discuss the problem of mapping the Arabic entries in Arabic with WordNet. Given an Arabic word and a representative sentence (a gloss in our case), it is required to find an English word that can substitute the Arabic word in its context, and this problem is known in literature as cross lingual lexical substitution (CLLS). Polysemous words acquire different senses and meanings from their contexts. Representing words in vector space as a function of their contexts captures some semantic and syntactic features for words and introduces new useful relations between them. In this chapter, we exploit different vectorized representations for words to solve the problem of Cross Lingual Lexical Substitution. We compare our techniques with different systems using two measures: “best” and “out-of-ten” (oot), and show that our techniques outperform the state of the art in the “oot” measure while keeping a reasonable performance in the “best” measure. This chapter is organized as follows, first, we introduce the CLLS problem (section 4.1 and 4.2), present the related work to CLLS (section 4.3), then we introduce our novel algorithm for CLLS (section4.4) and test it on the benchmark Spanish-English dataset (section 4.5 and 4.6). Next, we build an equivalent Arabic-English dataset and evaluate our algorithm on it (section 4.7).
4.1. Introduction Word sense disambiguation (WSD) is a famous problem in Natural Language Processing (NLP). It solves the problem of identifying a particular sense for polysemous word given its context. Cross Lingual Lexical Substitution (CLLS) can be regarded as a multilingual word sense disambiguation. It substitutes a word given its context in a source language with a suitable word in a target language, thus this task is inherited from Machine Translation (MT). However, many machine translation systems fail to do this task correctly mainly due to the insufficient parallel data that covers different word senses with their contexts. We can see how the context surrounding a word is a key player in all these tasks: WSD, CLLS and MT. This suggests that the real sense for a word is formulated somehow from its surrounding context, which means that if we can represent a word as a vector in a multidimensional space as a function of its context, then words appearing in similar contexts will be related. In this chapter, we examine how to employ such vectorized representations for words to solve CLLS.
This work is publish in [48] 5
60
4.2. Cross Lingual Lexical Substitution Task The problem of CLLS is described in SemEval-2010, task2 [49] where given an English word in a context, it is required to find an alternative Spanish word or phrase to substitute this English word in its context. English words (headwords) were collected such that each word has instances; each instance expresses a certain sense for the word using a context. The instances are not necessarily distinct, which means that they can share translations. Four native Spanish speaker annotators were assigned to manually do the cross lingual lexical substitution for the collected dataset. Each annotator examines each headword and for each instance, he supplies as many translations as possible. Afterwards, for each instance, all Spanish words supplied by the annotators were pooled together keeping track of the frequency for each translation, so that the most frequent translation given by the annotators for any instance is most likely the correct one. The dataset is divided into test set and development set. The test set has 100 English words each has 10 instances. The development dataset has 30 English words with 10 instances each.
4.3. Related Work Here we give brief description for the competing systems [49]. Two baseline systems were introduced, the first is dictionary based (DICT) and the second is both dictionary and corpus based (DICTCORP). The dictionary used is an online Spanish-English dictionary and the corpus is the Spanish Wikipedia. DICT retrieves all the Spanish translations for the English headwords and uses the first translation provided by the online dictionary as the Spanish word for all the English instances. DICTCORP sorts the retrieved Spanish translations by their frequency of occurrence in the Spanish Wikipedia corpus. UvT systems [53] builds a word expert for the target words using k nearest neighbor, then the correct translation is selected using GIZA word alignments from the Europarl parallel corpus. WLVusp [54] uses the open machine translation framework Moses to obtain the Nbest translation list for the instances, then uses English-Spanish dictionary as a filter to pick the correct translation. UBA-T and UBA-W [ 50] work in two steps, the first is candidate collection by retrieving and ranking candidate translations from Google dictionary, SpanishDict.com and Babylon, then UBA-T uses Google to translate instances to Spanish, while UBA-W uses parallel corpus automatically constructed from DBpedia. The second step is candidate selection, it is performed by several heuristics that use tokenization, part of speech tags, lemmatization, Spanish WordNet and Spanish translations. SWAT-E and SWAT-S [ 52] use a lexical substitution framework. The SWAT-E system first performs lexical substitution in English, and then translates the substitutions into Spanish. SWAT-S translates the source sentences into Spanish, identifies the Spanish word corresponding to the target word, and then performs lexical substitution in Spanish.
61
4.4. System Description The main idea behind our approach is to make use of the useful word-to-word relations in vector space to disambiguate between different senses using the context. For example, the word “bank” has the two sense; the first is the financial institution (labeled as a sense by “money”) and the second is “riverside” (Table 12). Given a context for each sense, it is required to map the context to the correct sense. Table 12: Two Senses for the word "Bank" with contexts. Sense
Context
money
Context1: He cashed a check at the bank to pay his loan.
riverside
Context2: He sat by the bank of the river to watch the fish in water currents.
By examining the contexts, we can notice that the words “check”, “cashed” and “loan” are strongly related to “money” more than to “riverside”. Also, the words “river”, “water” and “fish” show stronger relationship with “riverside” than with “finance”. This relationship can be measured by using a similarity function that maps a pair of word vectors to a real number: 𝐹(𝑣1 , 𝑣2 ) → ℝ. This mapping function (similarity measure function) can be Cosine similarity, Euclidean distance, Manhattan distance, or any possible similarity measure techniques. Given a list of sense representative words e.g. “money” & “riverside” and a list of contexts, it is required to map the sense with its correct context. We propose two scoring functions that assign a score to a sense/context pair. Define 𝐻(𝑠𝑒𝑛𝑠𝑒, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡) → ℝ the first scoring function is: 𝐻1 (𝑠𝑖 , 𝑐𝑗 ) = ∑ 𝐹(𝑣(𝑠𝑖 ), 𝑣(𝑤))
(39)
∀ 𝑤∈𝑐𝑗
Where the operator 𝑣(𝑤) takes a word 𝑤 and returns its vector representation, sensei is a sense representative word denoted as 𝑠𝑖 and contextj is denoted as 𝑐𝑗 . This function calculates the similarity score for a sense and a context by accumulating the pairwise similarity score for sensei and each word in contextj. Table 13 shows pairwise cosine similarity scores for the senses and contexts in Table 12 using Mikolov’s skip-gram word vector representations for English6, using these scores we can apply 𝐻1 to calculate the scores for both senses with both contexts and map the 6
https://code.google.com/p/word2vec/
62
sense to the context with the highest score, thus assigning context1 to “money” and context2 to “riverside”. H1(‘money’,context1)
0.686771
H1(‘riverside’,context1)
0.164374
H1(‘money’,context2)
0.534379
H1(‘riverside’,context2) 1.248834
Table 13: Shows the pairwise Cosine similarity between contextual words and two senses “money” and “riverside”. Contextual Words
“money”
“riverside”
cashed
0.293907
0.01455
check
0.163442
0.039154
loan
0.229422
0.11067
river
0.161156
0.595411
water
0.245526
0.287399
currents
0.01496
0.094783
fish
0.112737
0.271241
Another idea is to treat word vectors as semantic layers such that a context can be regarded as a concept formed by its individual words; each word contributes to the formation of this concept by a certain increment (semantic layer). Combining those layers should give an abstractive approximation for the concept. A simple combination for semantic layers is adding the vector representations of the words together.
𝐻2 (𝑠𝑖 , 𝑐𝑗 ) = 𝐹(𝑣(𝑠𝑖 ), ∑ 𝑣(𝑤))
(40)
∀ 𝑤∈𝑐𝑗
Applying the semantic layers idea to both contexts and calculating the similarity between the semantic layers approximated concepts with the senses will assign context1 to “money” and context2 to “riverside”.
63
H2(‘money’,context1)
0.334804
H2(‘riverside’,context1)
0.081559
H2(‘money’,context2)
0.168080
H2(‘riverside’,context2) 0.413259 The rest of this section will discuss how to apply the ideas presented in the previous example to the CLLS problem. We divide our technique into three steps. First, data collection and preparations. Second, building the vector space model and finally evaluate the models and compare them to other technique.
4.4.1. Building word vector representation for the target language (Spanish). Using the three models discussed (CBOW, SKIP-GRAM, and GloVe) we build word representations in vector space for Spanish. To train the models, we collect raw Spanish text from these sources made available by the open parallel corpus7 and Wikipedia: MultiUN [33]; it is a collection of translated documents from the United Nations. OpenSubtitles8 2011, 2012, and 2013; they are a collection of movies subtitles. EUbookshop [51]; it is a corpus of documents from the EU bookshop. Europarl3 [36]; it is a parallel corpus extracted from the European Parliament web site. Europarl [ 32]; it is an improved parallel corpus extracted from the European Parliament web site. EMEA [36]; it is a parallel corpus made out of PDF documents from the European Medicines Agency. ECB [36]: it is a documentation from the European Central Bank. Tatoeba [36]; it is a collection of translated sentences from Tatoeba. OpenOffice [36]; it is a collection of documents from openoffice.org PHP [36]; it is a parallel corpus originally extracted from http://se.php.net/downloaddocs.php EUconst [36]; it is a parallel corpus collected from the European Constitution. Spanish Wikipedia dump. We compile all these sources together and clean them from noisy characters and tags. The vocabulary size of the compiled corpus is 1.18 million words and the number of words is 2.7 billion. Next, we train the models9 changing the window of context to 5 and 10. We refer to this window parameter later on as the model window (MWINDOW).
7
http://opus.lingfil.uu.se/
8
http://www.opensubtitles.org/
9
Models are available at: https://sites.google.com/site/mohazahran/data
64
4.4.2.
Spanish Translation
Google Translate is used to retrieve all possible Spanish translations for each headword sorted by frequency, and to translate all instances (contexts) to Spanish. Now the CLLS problem is transformed into a mapping problem; to map between possible headword translations (acting as sense representation words) and the instances translations (acting as contexts) as shown in Table 14. Table 14 : A sample of English-Spanish CLLS dataset. Instance side.n 301
Headword translations lado, cara, costado, parte, banda, equipo, aspect, orilla, borde, ladera
Instance (Context)
Instance (Context ) translation
On Sunday at Craven Cottage, Jose Mourinho and his all stars exhibited all of the above symptoms and they were made to pay the price by a Fulham side that had in previous weeks woken up after matches with their heads kicked in. On our side: provide more aid , untied to trade ; write off debt ; help with good governance and infrastructure ; training to the soldiers , with UN blessing , in conflict resolution ; encouraging investment ; and access to our markets so that we practise the free trade we are so fond of preaching .
El domingo en Craven Cottage, José Mourinho y sus todas las estrellas exhibido todos los síntomas anteriores y estaban obligados a pagar el precio por un lado Fulham que tenía en las semanas anteriores despertado después de los partidos con la cabeza patada en.
side.n 302
lado, cara, costado, parte, banda, equipo, aspect, orilla, borde, ladera
4.4.3.
Mapping Algorithm
Por nuestra parte: proporcionar más ayuda, no vinculada al comercio; condonar la deuda; ayudar con el buen gobierno y la infraestructura; entrenamiento para los soldados, con la bendición de las Naciones Unidas, en la resolución de conflictos; fomento de la inversión; y el acceso a los mercados para que practiquemos el libre comercio somos tan aficionados a la predicación.
The Spanish translations for instances are cleaned from stop-words and noisy characters and then we introduce few parameters to control the mapping algorithm:
Similarity measure between two vectors (SIM): Cosine similarity, Euclidean distance and Manhattan distance. Vector normalization (NORM): The choice to normalize the word vector or not before performing a similarity measure. The number of output choices per instance (MAXO): The CLLS task allows for systems to output more than one suggestion for each instance, this parameter to supply a specific number of translation choices. The number of headword translations to consider (MAXTRNS): Each headword has more than one Spanish translation sorted by frequency. This parameter limits the number of translations to consider in order to ignore infrequent translations. Minimum score threshold (MINSIM): This parameter refuses to assign a context to a headword translation if their similarity score below this threshold. 65
The window size around the headword (HWINDOW): To limit the words to consider in a context, we use a window around the headword translation so that words in range [𝑝 − 𝑤 ∶ 𝑝 + 𝑤], will only be considered as contextual words, where 𝑝 is the position of headword translation and 𝑤 is the window size. The intuition behind this parameter is to adjust the problem into a similar setting that was used to train our models. Sematic layers (SEMLAYER): to use semantic layers or not (Choose between H1 and H2). Vector averaging (AVG): In case of SEMLAYER is used, we may combine vectors by taking their average representations instead of mere addition.
Using a certain configuration of these parameters, we can transform the CLLS problem into a mapping task between sense representative words (headword word translations) and the contexts translations. It is worth noting that we removed the headword translation from all translated instances because machine translation fails to pick a correct headword translation matching the context, which means that keeping this possibly erroneous headword translation can confuse our matching algorithm.
4.5. Scoring Two scoring metrics were used to score the systems competing in this task; “best” and “out-of-ten (oot)” [49]. Since systems are allowed to supply more than one translation per instance it is required to give credit to the correct ones and give higher scores to the translations picked with most annotators, and penalize the wrong ones taking into account the number of supplied translations. Let item 𝑖 belongs to the set of instances 𝐼 belonging to a headword. Let 𝑇𝑖 be the set of gold translations supplied by annotators for , and 𝑆𝑖 is the set of supplied translations by the system then best score for 𝑖: 𝑏𝑒𝑠𝑡 𝑠𝑐𝑜𝑟𝑒 (𝑖) =
∑𝑠∈𝑆𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑠 ∈ 𝑇𝑖 ) |𝑆𝑖 | . |𝑇𝑖 |
(41)
Precision is calculated by adding the scores and dividing by the number of items attempted by the system, thus penalizing for increasing the number of supplied translations by the system. On the other hand, recall divides by the sum of the scores for each item 𝑖 by |𝐼|. 𝑏𝑒𝑠𝑡 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑏𝑒𝑠𝑡 𝑟𝑒𝑐𝑎𝑙𝑙 =
∑𝑖 𝑏𝑒𝑠𝑡 𝑠𝑐𝑜𝑟𝑒(𝑖)) |𝑖 ∈ 𝐼 ∶ 𝑑𝑒𝑓𝑖𝑛𝑒𝑑(𝑆𝑖 )|
(42)
∑𝑖 𝑏𝑒𝑠𝑡 𝑠𝑐𝑜𝑟𝑒(𝑖)) |𝐼|
(43)
66
The oot metric allows the systems to supply up to ten translations per item and it does not penalize the system with the number of supplied translations.
𝑜𝑜𝑡 𝑠𝑐𝑜𝑟𝑒 (𝑖) =
∑𝑠∈𝑆𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑠 ∈ 𝑇𝑖 )
𝑜𝑜𝑡 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑜𝑜𝑡 𝑟𝑒𝑐𝑎𝑙𝑙 =
(44)
|𝑇𝑖 |
∑𝑖 𝑜𝑜𝑡 𝑠𝑐𝑜𝑟𝑒(𝑖)) |𝑖 ∈ 𝐼 ∶ 𝑑𝑒𝑓𝑖𝑛𝑒𝑑(𝑆𝑖 )|
(45)
∑𝑖 𝑜𝑜𝑡 𝑠𝑐𝑜𝑟𝑒(𝑖)) |𝐼|
(46)
According to these metrics, the theoretical upper bound10 if all items are attempted and only one translation is supplied: bestup=40.57, ootup=405.78
4.6. Results and Evaluation We employed our models in the CLLS task using the configuration parameters in Table 15 and compared our results using the “best” and “oot” measures to other systems competed in the task [49] (Figure 16 and Figure 17). By examining the results, we notice that our systems outperform the state of the art system in the “oot” measure, and keeping a very reasonable performance in the “best” measure. By considering the scores of other systems, we can notice that systems performing well in one measure usually perform poorly in the other measure. For example, if we take “UBA-T” system, it is ranked first in the “best” measure, but ranked tenth in the “oot”, the same happens with “SWAT-E” that is ranked first in the “oot”, but eighth in the “best”. Our systems on the other hand, achieve a considerable balance between the two measures.
10
The upper bound for both best and oot is multiplied by 100
67
Table 15 : Parameter values used by our systems
CBOW
SKIP-G
GloVe
SIM
Euclidean
Manhattan
Manhattan
NORM
False
True
False
MAXO
1
1
1
MINSIM
0
0
0
SEMLAYER
True
True
True
MAXTRNS
2
2
2
AVG
False
False
False
HWINDOW
5
5
ALL
MWINDOW
5
10
10
We examine the effect of using all the possible translations supported by Google translate for the headwords including the infrequent translations (Figures 18, 19 and 20 ), we can notice that using one translations only can achieve the highest scores. This fact is also proved by the baseline (DICT) which achieve remarkably good scores in the “best” measure. The naïve baseline picks the first translation of the headword and assigns it to all instances, this suggests that the semEval task2 has a data problem, it contains 100 English headwords with 10 instances each. Ideally, these 10 instances per headword should represent distinct senses such that they should not share translations. This is hard to achieve under the restriction of having exactly 10 instances per headword, because not all English words show that much fine-grained polysemous behavior which results in overlapping correct translations between instances, and enables only one translation for all instances to perform well.
68
The ‘best’ scores of the systems participating the CLLS, semEval 2010 task 2. 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
R
P
Figure 16: The ‘best’ scores for the systems participating the CLLS, semEval 2010 task 2.
69
The ‘oot’ scores of the systems participating the CLLS, semEval 2010 task 2. 300 280 260 240 220 200 180 160 140 120 100 80 60 40 20 0
R
P
Figure 17: The ‘oot’ scores for the systems participating the CLLS, semEval 2010 task 2. One challenge CLLS imposes on our techniques, is the need to obtain a list of headword translations that’s ideally should cover all the gold (reference) translations supplied by annotators. Limited or incomplete list can cause some instances to receive false translations. Figures 18, 19 and 20 show the performance of our systems with/without using all the possible gold translations for all instances as possible translations for the headword, thus removing the factor of wrong or incomplete headword translations in the evaluation of our systems. Comparing the scores with/without using gold translations; for MAXTRNS=1, 2, and 3, the scores of using the gold translations is better, however at MAXTRNS=ALL, not using the gold translations is better, due to the presence of collocations in the gold translations that are not supported by our systems.
70
CB O W
F1score
Mode F1score
F1score_gold
Mode F1score_gold
80 70 60 50 40 30 20 1
2
3
ALL
Figure 18: Effect of changing MAXTRNS with/without gold translations on CBOW.
GL O VE
F1score
Mode F1score
F1score_gold
Mode F1score_gold
80 70 60 50 40 30 20 1
2
3
ALL
Figure 19: Effect of changing MAXTRNS with/without gold translations on GloVe.
71
S KIP-G
F1score
Mode F1score
F1score_gold
Mode F1score_gold
75 65 55 45 35 25 15 1
2
3
ALL
Figure 20: Effect of changing MAXTRNS with/without gold translations on SKIP-GRAM.
4.7. Cross Lingual Lexical Substitution for Arabic In this section, we discuss the problem of mapping the Arabic entries in Arabase with WordNet. Given an Arabic word and a representative sentence (a gloss in our case), it is required to find an English word that can substitute the Arabic word in its context, and this falls in the category of CLLS problem solved in the previous sections using word vectors. In this section we manually build a benchmark dataset for Arabic-English following the style and format of the CLLS presented in the previous sections [49]. We manually selected 65 Arabic headwords with different senses totaling 138 sense (instance) with its gloss from Arabase, and manually select the corresponding English substitution words (Table 16). We will follow the exact same steps explained in the previous sections; First, build English vector space. Second, translate Arabic instances to English get possible Arabic headword translations. Finally, plug-in the mapping algorithm and report results.
72
Table 16: A sample of Arabic-English CLLS dataset. Id Arabic Head word
Arabic Instance
English Instance Translation
4
عين
إلنسان ِ عُضْ و ا ِإلبصا لMember vision of وغيره من الحيوان. humans and other animals.
5
عين
6 7
عين عين
8
شعر
ُ يَ ْنبُوWater fountain ع الما ِء ينبُ ُ من األَ ْ ض ويجريsprings from the earth and being. الجاسُوSpy اسند اليه وظيفة يعملAssigned to him and ويشتغل بهاthe function works and work out زوائد خيطية تظهر علىFilamentous growths جلد اإلنسان وغيره منappear on the skin of الثديياتthe human and other mammals
9
شعر
كالم موزون مقفى يعتمدWeighted words . على التحييل والتأثيرrhymed depends on وما علمناه ال ِّش ْع َرAlthial and influence what we
English Headword translation s eye optic peeper spy stool pigeon assistance specify appoint assign
Refere nce Transl ation eye
Spring , well Spy Hire
hair felt poetry verse rhyme
hair
Poetry , verse
We use the “best” and the “oot” measures to compare between different techniques: using the English Skip-gram model [41] and the English GloVe model [42], we also used WordNet as a technique to measure the similarity between words using its ontology tree; we can assess the similarity between two words in WordNet by noticing how far their common parent in the tree. The closer –in terms of number of tree levels- the common parent, the more related are the children. This metric is used in one experiment instead of using word vectors to measure similarity. We used the same baseline as semeval task2 (DICT) which uses the most common translation as the English substitution for all instances. Table 17 shows the parameters values for our systems in the Arabic-English CLLS.
73
Table 17 : Parameter values used by our systems for Arabic CLLS.
WordNet
SKIP-G
GloVe
SIM
WordNet
Cosine
Manhattan
NORM
True
True
False
MAXO
1
1
1
MINSIM
0.0
0.2
0.0
SEMLAYER
N/A
True
True
MAXTRNS
ALL
ALL
ALL
AVG
False
True
False
HWINDOW
ALL
ALL
ALL
MWINDOW
N/A
10
10
Best Measure 80 70 60 50 40 30 20 10 0
P
R
Figure 21: Comparison between the “best” score of SKIP-G, GloVe, WordNet with/without using gold translations in Arabic CLLS.
74
oot Measure 800 700 600 500 400 300 200 100 0
P
R
Figure 22: Comparison between the “oot” score of SKIP-G, GloVe, WordNet with/without using gold translations in Arabic CLLS.
One challenge CLLS imposes on our techniques, is the need to obtain a list of headword translations that ideally should cover all the gold (reference) translations supplied by annotators. Limited or incomplete list can cause some instances to receive false translations, so we include experiments using gold (reference) translations thus removing the factor of wrong or incomplete headword translations in the evaluation of our systems. Figure and Figure 22 show the “best” and “oot” scores of our systems SKIP-G, GloVe and WordNet for the Arabic-English CLLS. The theoretical upper bound11 if all items are attempted and only one translation is supplied: bestup=73.42 , ootup=734.1 , our experiments show that our systems performs remarkably well using correct and complete translation list.
11
The upper bound for both best and oot is multiplied by 100
75
Chapter 5 : Conclusion and Future Work In this thesis we showed the spectrum of building language resources starting from conventional human fueled techniques to contemporary statistical based techniques. We compared different Arabic resources examining their points of strength and weakness. Then we presented a framework that can be used to compile pieces of Arabic language information scattered across these resources into a single resource. We showed the trade-off between fully automated and manual methods in the integration task. Full automation will decrease significantly the human effort, thus saving time and man-power at the expense of decreasing the accuracy and consistency of the resulting resources. We showed the compromise between both methods can result in an acceptable accuracy and consistency with minimal human efforts. Next, we compared between different models for building continuous representation models in vector space for Arabic and tested these models via intrinsic and extrinsic evaluations. In intrinsic evaluation, we used the analogy task to test the vectors’ capability to capture semantic and syntactic properties of Arabic. While in extrinsic evaluations, we employed the vectors in two NLP applications: query expansion for information retrieval and short answer grading. For query expansion, the Arabic vectors enhanced the retrieval process slightly better than other semantic expansion techniques, while for short answer grading, Arabic vectors made it possible to evaluate short answer grades for Arabic dataset without the need for Arabic-English translations with high correlation with the reference grades. We built a neural network to map Arabic vectors to English vectors and showed that optimizing for cosine error outperforms the standard mean square error optimization for word-to-word similarity using cosine score, which means that the objective function of the training procedure should match the similarity measure used. Using the proposed neural network, we succeeded to achieve humanly translated-like results for short answer grading task. Many extensions can be related to the work presented here, starting with increasing the raw Arabic data used to train the vectorized representations. It will also be useful to have an analogy test built specifically for Arabic rather than the manual translation of the English test. We showed a simple technique for word sense disambiguation in the context of query expansion using the Arabic vectors, yet even more sophisticated techniques can be developed. Using deep neural networks, it can be possible to learn complex relations to map the Arabic and English vector spaces more accurately. Finally, we presented a novel technique to solve the CLLS problem that outperformed the state of the art system in the “oot” measure. We introduced the idea of sematic layers using word representation in vector space and showed how it can be effective to capture a concept expressed in a context. We believe our technique explores new grounds in the field of semantic and linguistic computations; it is fast and simple and minimizes the language dependent requirements, which means it is easily applicable to new languages. We would like to address some of the limitations as future work, since translation is a key player in our approach, it will be useful to rely on different sources of translations to ensure the depth and the quality of translations, and find ways to rank these translations. Moreover, handling collocations is essential as many languages show average usage for collocations. Finally, increasing raw data will help building better word vector representations.
76
References 1. Raafat, H., Zahran, M., & Rashwan, M. (2013). Arabase-A Database Combining Different Arabic Resources with Lexical and Semantic Information. In KDIR/KMIS (pp. 233-240). 2. alkhalil dot net, 2011. KACST Available http://sourceforge.net/projects/alkhalildotnet/. [Accessed 23 June 2013].
at:
3. almuajam, 2011. Arabic Interactive Dictionary Project. Available http://sourceforge.net/projects/almuajam/ [Accessed 23 June 2013].
at:
4. Arabic Stop Words, 2010. Available at: http://arabicstopwords.sourceforge.net/. [Accessed 23 June 2013]. 5. Arabic WordNet, 2007. A multi-lingual concept dictionary, Available at: http://awnbrowser.sourceforge.net/. [Accessed 23 June 2013]. 6. Arramooz AlWaseet: Arabic dictionary for morphology. http://arramooz.sourceforge.net/. [Accessed 23 June 2013].
Available
at:
7. Attia, M., Rashwan, M. A., & Al-Badrashiny, M. A. S. A. A. (2009). Fassieh, a semiautomatic visual interactive tool for morphological, PoS-Tags, phonetic, and semantic annotation of Arabic Text Corpora. Audio, Speech, and Language Processing, IEEE Transactions on, 17(5), 916-925. 8. Attia, M., Rashwan, M., Ragheb, A., Al-Badrashiny, M., Al-Basoumy, H., & Abdou, S. (2008). A compact Arabic lexical semantics language resource based on the theory of semantic fields. In Advances in Natural Language Processing (pp. 65-76). Springer Berlin Heidelberg. 9. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JAsIs, 41(6), 391-407. 10. Diab, M. (2004, September). The feasibility of bootstrapping an arabic wordnet leveraging parallel corpora and an english wordnet. In Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo. 11. Diekema, A. R. (2004, August). Preliminary lexical framework for English-Arabic semantic resource construction. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (pp. 10-14). Association for Computational Linguistics. 12. Elkateb, S., Black, W., Rodríguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Building a wordnet for arabic. In Proceedings of The fifth international conference on Language Resources and Evaluation (LREC 2006). 13. ksucorpus, 2013. King Saud University Corpus of Classical Arabic. Available at: http://ksucorpus.ksu.edu.sa/ar/. [Accessed 23 June 2013]. 14. Lesk, M. (1986, June). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation (pp. 24-26). ACM.
77
15. Niles, I., & Pease, A. (2001, October). Towards a standard upper ontology. In Proceedings of the international conference on Formal Ontology in Information Systems-Volume 2001 (pp. 2-9). ACM. 16. Princeton University "About WordNet.” 2010. WordNet. Princeton University. Available at: http://wordnet.princeton.edu. [Accessed 23 June 2013]. 17. Rehurek, R. & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valetta, pp. 46-50. 18. Tufis, D., Cristea, D., & Stamou, S. (2004). BalkaNet: Aims, methods, results and perspectives. a general overview. Romanian Journal of Information science and technology, 7(1-2), 9-43. 19. Vossen, P. (1998). Introduction to eurowordnet. In EuroWordNet: A multilingual database with lexical semantic networks (pp. 1-17). Springer Netherlands. 20. Yaseen, M., Attia, M., Maegaard, B., Choukri, K., Paulsson, N., Haamid, S. & Ragheb, A. (2006, May). Building annotated written and spoken Arabic LR’s in NEMLAR Project. In Proceedings of LREC. 21. http://www.cs.columbia.edu/~mcollins/loglinear.pdf [Accessed 3 March 2015]. 22. Guthrie, D., Allison, B., Liu, W., Guthrie, L., & Wilks, Y. (2006). A closer look at skip-gram modelling. In Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006) (pp. 1-4). 23. Zahran, M. A., Magooda, A., Mahgoub, A. Y., Raafat, H., Rashwan, M., Atyia, A. Word Representations in Vector Space and their Applications for Arabic (2015) In the 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), April 14–20, 2015, Cairo, Egypt. 24. Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (pp. 160-167). ACM. 25. Mnih, A., & Hinton, G. E. (2009). A scalable hierarchical distributed language model. In Advances in neural information processing systems (pp. 1081-1088). 26. Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL (pp. 746-751). 27. Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 384-394). Association for Computational Linguistics. 28. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In (ICLR) proceeding of the International Conference on Learning Representations Workshop track, Arizona, USA. pp.13013781. 29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119).
78
30. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12. 31. http://opus.lingfil.uu.se/ [Online; accessed 29-January-2015]. 32. Tiedemann, J. (2012, May). Parallel Data, Tools and Interfaces in OPUS. In LREC (pp. 2214-2218). 33. Eisele, A., & Chen, Y. (2010, May). MultiUN: A Multilingual Corpus from United Nation Documents. In LREC. 34. http://www.opensubtitles.org/ [Online; accessed 29-January-2015]. 35. http://tanzil.net/download/ [Online; accessed 29-January-2015]. 36. Tiedemann, J. (2009). News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing (Vol. 5, pp. 237-248). 37. https://sites.google.com/site/mouradabbas9/corpora [Online; accessed 29-January2015]. 38. Saad, M. K., & Ashour, W. (2010). OSAC: Open Source Arabic Corpora. In 6th International Symposium on Electrical and Electronics Engineering and Computer Science, Cyprus (pp. 118-123). 39. https://github.com/anastaw/Meedan-Memory [Online; accessed 29-January-2015]. 40. http://ksucorpus.ksu.edu.sa/ar/ [Online; accessed 29-January-2015]. 41. https://code.google.com/p/word2vec/ [Online; accessed 29-January-2015]. 42. http://nlp.stanford.edu/projects/glove/ [Online; accessed 29-January-2015]. 43. Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. 44. Gomaa, W. H., & Fahmy, A. A. (2014). Automatic scoring for answers to Arabic test questions. Computer Speech & Language, 28(4), 833-857. 45. Mahgoub, A. Y., Rashwan, M. A., Raafat, H., Zahran, M. A., & Fayek, M. B. (2014). Semantic Query Expansion for Arabic Information Retrieval. ANLP 2014, 87-92. 46. Oard, D. W., & Gey, F. C. (2002). The TREC 2002 Arabic/English CLIR Track. In TREC. 47. http://sourceforge.net/p/lemur/wiki/Indri/ [Online; accessed 31-January-2015]. 48. Zahran, M. A., Raafar, H., Rashwan, M. Cross Lingual Lexical Substitution Using Word Representation in Vector Space (2015). In the 28th International FLAIRS conference, Florida Artificial Intelligence Research Society, Hollywood, Florida, USA. 49. Mihalcea R.; Sinha R.; and McCarth D. 2010. SemEval-2010 Task 2: Cross-Lingual Lexical Substitution. In proceedings of the fifth International Workshop on Semantic Evaluation, ACL, Uppsala, Sweden. Pages 9-14 50. Basile P.; and Semeraro G. 2010. UBA: Using Automatic Transla-tion and Wikipedia for Cross-Lingual Lexical Substitution. In pro-ceedings of the fifth International Workshop on Semantic Evalua-tion, ACL, Uppsala, Sweden. Pages 242-247. 79
51. Skadins R.; Tiedemann J.; Rozis R.; and Deksne D. 2014. Billions of Parallel Words for Free. In Proceedings of International Con-ference on Language Resources and Evaluation (LREC), Reykja-vik, Iceland. Pages 1850-1855. 52. Wicentowski R.; Kelly M.; and Lee R. 2010. Swat: Cross-lingual lexical substitution using local context matching, bilingual dictionaries and machine translation. In Proceedings of the fifth Interna-tional Workshop on Semantic Evaluation. Association for Compu-tational Linguistics. Pages 123-128. 53. Gompel M. 2010. UvT-WSD1: a Cross-Lingual Word Sense Disambiguation system. In Proceedings of the fifth International Workshop on Semantic Evaluation. Association for Computational Linguistics. Pages 238-241. 54. Aziz W.; Specia L. 2010. USPwlv and WLVusp: Combining Dictionaries and Contextual Information for Cross-Lingual Lexical Substitution. In Proceedings of the fifth International Workshop on Semantic Evaluation. Association for Computational Linguistics. Pages 117-122.
80
Appendix A
Figure 23: Shows a generic neural network and the connections between two successive layers. The objective function is to maximize the cosine similarity between the predicted vector (𝒚) and the reference vector (𝒅). This is equivalent to:
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐸 = 1 − cos(𝑦, 𝑑) = 1 −
𝑦. 𝑑 |𝑦||𝑑|
Let the activation function be 𝑦iz = 𝐹(𝑥iz ). The superscript denotes the layer number, and the subscript denotes the input number. The notation 𝑙(𝑧) refers to the number of neurons in the layer 𝑧. The derivative of the error function 𝐸 with respect to weights at layer 𝑧 for the training sample 𝑚:
𝜕𝐸 𝜕𝐸 𝜕𝑦𝐼z+1 𝜕𝑥Iz+1 𝜕𝑥Iz+1 z+1 = × × = 𝛿I × 𝜕𝑤IJz 𝜕𝑤IJz 𝜕𝑤IJz 𝜕𝑦Iz+1 𝜕𝑥Iz+1
𝑤ℎ𝑒𝑟𝑒, 𝛿Iz+1 =
𝜕𝐸 𝜕𝑦Iz+1 × 𝜕𝑦Iz+1 𝜕𝑥Iz+1
81
(1)
(2)
𝜕𝐸 (𝑦. 𝑑)𝑦Iz+1 − 𝑑Iz+1 |𝑦|2 = |𝑑||𝑦|3 𝜕𝑦I𝑧+1
(3)
𝑦Iz+1 = 𝑓(𝑥Iz+1 ) 2 𝑓(𝑥) = 1.7159 tanh ( 𝑥) 3 𝜕𝑦Iz+1 𝜕𝑥I𝑧+1
2
𝑙(z)
𝑥Iz+1
Finally
𝜕𝐸 z 𝜕𝑤IJ
𝑦 z+1
2
I = 𝑓 ′ (𝑥Iz+1 ) = (1.7159 × 3) (1 − (1.7159 ) )
∑ 𝑤Ijz 𝑦jz j=1
=
⇒
𝜕𝑥Iz+1 = 𝑦Jz 𝜕𝑤IJz
(4)
(5)
is calculated by substituting from (2), (3), (4) and (5) in (1).
Now we will prove that optimizing for a training sample for, Cosine error is equivalent for half the square error (𝑆𝐸) if the both the reference (𝑑) and the predicted vector (𝑦) are normalized.
|𝑦| = 1 𝑎𝑛𝑑 |𝑑| = 1 𝐸COS = 1 − cos(𝑦, 𝑑) = 1 − 𝑦. 𝑑 K
𝐸SE = ∑(𝑑i − 𝑦i )2 = (𝑑 − 𝑦) . (𝑑 − 𝑦) i=i
= 𝑑. 𝑑 − 𝑑. 𝑦 − 𝑦. 𝑑 + 𝑦. 𝑦 = |𝑑|2 − 2𝑑. 𝑦 + |𝑦|2 = 2 − 2 cos(𝑦, 𝑑) 1 𝐸 = 1 − cos(𝑦, 𝑑) = 𝐸COS 2 SE
82
Glossary12 الشَّبكة العصبيَّة االصطناعيَّة في الا َككاااء الصاااااا ِطناااعي والتَعلُّم اآللي؛ هي نمااا صاااااابِيَة ياضاااااايَة تسااااااتلهم وتُحاكي عمل الخاليا ال َع َ ِ الحيَ ِو يَة Biological Neural Networkفي ُمخ اإلنساااااااان؛ وتَتَ َك َونُ هكه َ الشاااااابَ ُ كات من مجموعة من الع َُقاااد[ Nodesوحااادات ُمعاااالجاااة Processing ]Elementsوالرَوابط[ Linksتُسااااااتَ خ َد ُم لتقر يب ليساااااات لها قاعدة َمعرُوفة]، ياضاااااايَة الَتي َ ال َدوال الرِّ ِ حيث ينتُ ُج عن التَالحُم [التَشااااابُك] بينَ ال ُعقَد والرَوابط َ الصااااااطناااعيااة صاااااابِ َيااة مجموعااة من الخاليااا ال َع َ Artificial Neuronsالَتي تكتسااااااابُ القُاد ة على افتراضاااايَة ،ثُ َم تخزين البيانات والمعلومات في اكرة ِ استرجاعها وقتَ الحاجة إليها. عج ِميَّة الحاسوبيَّة ِّ الصناعة ال ُم َ ُقصاا ُد بها اسااتخدام ُ من فُرُوع اللِّسااانِيَات الحاسااوبِيَة؛ ي َ عجم Lexicology الحواساااايب في فهم قواعد علم ال ُم َ عااجاا ِماا َياااة ومااعاارفاااة مااعاااايااياار ِّ الااماا َ الصاااااااانااااعاااة ُ َ َ يري ،Lexicographyواإلفادة من الجانبَين – التن ِظ ِ والتَطبِيقِي – في بناء ال ُمعجمات عجم القاموس – ال ُم َ وصااااااناعة ال ُمع َجم؛ كتاِع مر ِج ِعي، في ِعلم ال ُم َ عج ِميَة ِ ُ ُ َ يضاااااام مفردات لغ اة ُم َعيَنااة أو أكث َر من لغااة ،ويُبَيِّن ُ معانيها ومشااااتقَاتِها وأساااااليبَها واسااااتعمالتِها اللُّغ َِويَة؛ ق ترتيب ُم َعيَن [بحسب األلفاظ فرداتِه وف َ ويتِم إد ا ُم َ الموضاااوعات] .ويكونُ ال ُمع َجم/القامو عامًّا [في ُ أو صا [في أحد ال ُحقُول م ال أنواع مختلف عرفة] أو ُمتَ َخصِّ ًّ َ ِ عرفِيَة]. ال َم ِ المسافة اإلقليديَّة والجبر ال َخطِّي؛ من المسااااااااافااات ااضااااااي اَا ات َ في الرِّيا ِ ِّياضيَة؛ تُ ِشي ُر إلى طُول الخط ال ُم ستَقِيم بينَ نُقطَتَين الر ِ في فراغ ُمتَ َعدِّد األبعاد. تقنيات اللُّغة الطَّبيعيَّة [اإلنسانِيَّة] ُقصاااا ُد بها في ال َككاء الصااااطناعي وحوساااابة اللُّغة؛ ي َ ِ َ الصاااانا ِعيَة التي تُسااااتَخ َد ُم في بناء مجموعة التَطبيقات ِّ وتطوير أدوات ال ُمعالَجة اآلليَة للُّغات الطَبِي ِعيَة .تتف َر ُع هااكه التِّقنيا ُ اات إلىن تقنيااات ُمعااالَجااة النَ ،المكتوِ، َ [الصاااوت] ،وتِقنِيات وتقنيات ُمعالَجة الكالم المنطوق ُمعالَجة الصُّ َو . استرجاع المعلومات عرف تي ،يُسااااااتَخ َد ُم في في عُلوم ال ُ حاسااااااوِ؛ م جا عل َم ِ َ البحث عن المعلومات التي تحويها هياك ُل البيانات، عرفِيَة .من سااااااوا عء أكانت قواعد بيانات أم شاااااابكات َم ِ
)Artificial Neural Network (ANN In Artificial Intelligence and machine learning, it’s a mathematical model that mimics the work of brain neurons. Each neuron is considered as a processing unit. Connecting these neurons together can approximate complex mathematical functions.
)Computational Linguistics (CL It refers to the use of computers to understand the rules of lexicology, which can be useful in building lexicons in both theoretical and practical aspects. Dictionary In lexicology, it’s a reference book containing terms of one language or more illustrating the terms’ derivatives, meanings, and usage. Terms are sorted by topic. Dictionaries can be domain specific or generic. Euclidean Distance In mathematics and linear algebra, it refers to the length of the line segment joining two points in a multidimensional space. Human Language Technologies In artificial intelligence and computational linguistics, it refers to a group of applications used in building and developing natural language processing tools. They can be classified into natural language processing, speech processing and image processing tools. Information Retrieval In computer science, it deals with the search for relevant information in any data structure. One application for 83
تطبيقات اساااترجاع المعلوماتن ُمحرِّكات البحث على information retrieval is search engines and library search tools. شبكة اإلنترنت ،وأدوات البَحث ال َمكتَبِيَة. الموا ِرد اللُّ َغ ِويَّة الحاساااوبِيَة؛ تُع َر ُ – ُ في التَحليل اللُّغ َِوي واللِّساااانِيَات أيضاااااااا – بااالا َكخااائِر اللُّغ َِوياَاة .هي مجموعااة الموا ِد ًّ واألدوات ال ُمسااااااتَخدَمة في إجراء تحليل لُغ َِوي على اختال ُمسااااااتَويا ِتهن أو إنجاز إحدى تقن يات اللُّ غات الم َد َو نات اللُّغ َِو يَة ،وال ُمعجمات الطَبِي ِع يَة. وتشااااااملُن ُ َ ُّ اللُّغ َِويَة ،وقواعد البيانات اللغ َِويَة ،واألدوات البَر َم ِجيَة ال ُمسا ِعدة. ُّ نمذجة اللغة ص ُد في اإلحصاء اللُّغ َِوي و ُمعالَجة اللُّغات الطَبِي ِعيَة؛ يُق َ ضيَة أو إحصائِيَة أو لُغ َِويَة بها ُمحاولة بناء نما [ يا ِ ]...تُحاااكي النِّظااا َم اللُّغ َِوي إلحاادى اللُّغااات الطَبِي ِعي اَا ة صاااول بأكبر قَد من ال ِّدقَة إلى أقرِ شاااكل ساااعيًّا لل ُو ُ من التَوزيعات الحتماليَة للوحدات اللُّغ َِويَة الفُونيمات Phonemesأو الااجاارافاايااماااات Graphemesأو الكلمات Wordsن .وتُسااااااتَخ َد ُم نم َكجة اللُّغة في العديد من تطبيقات ُم عالَ جة اللُّ غات الطَبِي ِع يَة ،مثلن التَعرُّ على الكالم ،والتَرجمة اآلليَة ،واستِرجاع المعلومات. فهرسة الدِّالالت الكامنة صااااااو ؛ في اسااااااتِرجاع المعلومات والتَنقِيب في النُّ ُ تقنية حاسُوبِيَة تُستَخ َد ُم في فهرسة واستِرجاع البيانات صاااو باساااتِخدام أسااااليب التَحليل اإلحصاااائِي. والنُّ ُ ُ وتقوم فكرتهاااا على تعيين العالقاااات الاااد َ ِّللياااة بينَ ال ُمصااااااطَلَحات والمفاهيم الوا ِ دة في مجموعات غير صااااو اسااااتنادًّا إلى قاعدة مفادُها َ أن ُمنتَ ِظمة من النُّ ُ الكلمات الَتي يكثُ ُر اساااتِخدا ُمها في ساااياق لُغ َِوي ُم َعيَن تمي ُل إلى تكوين معان ودللت ُمتشابهة. قاعدة بيانات ُمعجميَّة عجم عج ِمي] وصااااااناااعااة ال ُم َ في التَحلياال اللُّغ َِوي [ال ُم َ الحاساااااوبِيَة؛ من الموا ِ د ال ُمع َج ِميَة؛ هي ُ واللِّساااااانِيَات قاااعاادة بيااانااات ُم َزوَدة بمعلومااات ُمع َج ِمي اَا ة ألغراض ُ البَ حث اللُّغ َِوي وال ُمع َج ِمي؛ تكونُ مفتوحةًّ بحيث يُمكن التَح ُّكم في بياناتِها باإلضااااااافة أو الحك أو التَعديل. وقد تكونُ مغلَقةًّ ألغراض تجا ِيَة ونحو لك. خوارزم الليزك في علوم الحاسب ،وتحليل النصو الطبيعية ،ليزك هو خوا زم يقوم بحسااااِ مقدا التشاااابه بين جملتين عن طريق معرفااة عاادد الكلمااات المشااااااتركااة بين الجملتين.
Language Resources A group of resources and tools used in carrying out lexical analysis or taking part in any natural language processing application in general. It include, corpora, lexicons, databases, dictionaries or any aiding tools. Language Modeling In natural language processing, it refers to building a mathematical/statistical model for the probability distribution of words (unigrams, bigrams …). Language modeling is used in many NLP applications as language identification, machine translation, information retrieval.
)Latent Semantic Indexing (LSI In information retrieval, latent semantic indexing (sometimes is called latent semantic analysis) is used in indexing and retrieval of data by statistical analysis of texts. The basic idea is using raw texts, LSI tries to find the most important words related to latent concepts or topics inferred from the texts using mathematical operations. Lexical Database It is a database enriched with lexical information that may include lexical categories and synonyms of words, as well as semantic relations between different words or sets of words. Lexical databases can be open source so as to allow edit, add or delete information, or can be closed for commercial purposes. Lesk In computer science and natural language processing, Lesk is an algorithm tries to assign a similarity score for two sentences by counting the number of overlapping words in the two sentences.
Introduction to computerizing language, Rashwan et al. (in press) 12 مقدمة في حوسبة اللغة ,شوان وآخرين )تحت الطب (
84
الجبر َ الخطِّي ُ ِّياضاااااايَات الحديثة؛ فر ع ع من فرُوع علم ال َجبر؛ في الر ِ بريَة] ال ُم َر َكبة، ج [ال َة ي ض ِّيا ر ال ر ص العنا اسة يُعنى بد ِ ِ َ ِ ُّ َ مثلن ال ُمتجهات والمصاافوفات والفضاااءات الش اعاعي َة مجموعاااات ال ُمتَجهاااات ال ُمترابطاااةن وغيرهاااا من الموضااااااوعات المركز يَة في ال ِّر ياضاااااا يَات الحديثة وم اا ،وفروع الجبر والهناادسااااااااة التَحليلِ َيااة فرع عم ًّ َ الرِّياضاااااايَات الكي يربط بينَ الجبر والهندسااااااةن على ِ وجه ال ُخصُو التَّعلُّم اآللي في ال َك كاء الصاااااا ِط نا ِعي و ُم عالَ جة اللُّ غات الطَبِي ِع يَة والتَنقيااب في البيااانااات والتَ َعرُّ على األنماااط؛ فَر ع ع معلومااااتِي؛ يُعنى بتصااااااميم وتطوير مجموعاااة من التِّقنيااات والخوا زمااات الَتي تساااااام ُح بتطوير عماال الحواساااااايااب وتطوي ِعهااا للتَعلُّم ،سااااااوا عء باااسااااااتِقراء المعلومات أو اسااتنتا القواعد منها .ويُعتَ َم ُد في التَعلُّم اآللي على مجموعات كبيرة من ال ُمعطَيات ال ُم َمثَلة في قواعد بيانات ُمحو َسبة.
Linear Algebra In mathematics, a branch of Algebra. It is concerned with studying vectors, matrices, vectors spaces and linear mappings between spaces.
Machine learning It is a field in computer science and particularly artificial intelligence. It is concerned with the study of algorithms that enable computers learn patterns and rules from data examples. Machine learning uses solved data examples (training data) to build a model that should be good enough to generalize to unseen data examples. It is closely related to computational statistics and mathematical optimization. Matrix المصفوفة َ اااضاااااايَاااة ُم َمثلاااة في In linear algebra, it’s a mathematical في الجبر الخَ طِّي؛ هي بِنياااةع يا ِ الع ا َد ِدي اَا ة و/أو ال ُمتَ َغيِّرات construct of a group of numerical values مجموعااة ُمنتظمااة من القِيَم َ الَتي تتشاااا َك ُل في جدول ُمسااااتَ ِطيل باعتبا ِ ها مجموعة or variables arranged in a rectangular 2D array. من المتَجهات المتراصَة [ ا ِج ن ]Vector Monolingual Corpora ُم َد َّونات أحاديَّة ال ُّلغة ُّ It is a collection of coherent texts of the صااو في لسااانِيَات ال ُم َدوَنة؛ هي ُكتَل/مجموعات الن ُ اااااتم ُّد ما َدتَها من لُغة واحدة ،ويغ ُلبُ عليها أن same language, it can be use in building الَتي تسا ِ ُّ ُ ُ ًّ ًّ language models. تُ َغطِّ َي ُمسااااااتوى لغ َِويااا ُم َعيَنااا [اللغااة الفصااااااحى أو ال َدا جة] .تُسااااااتَخدَم في بناء ال ُمعجمات أحاديَة اللُّغة، كما تُسااااااتَخدم في العديد من مجالت البحث في ُعلُوم حويَة اللُّغة ،مثلن اإلحصاااااااء اللُّغ َِوي وال ِّد اسااااااات النَ ِ وال ِّد اسات اللُّغ َِويَة الوصفِيَة Morphology صرف/التَّص ِريف علم ال َّ ُّ َ [الصاااارفِي]؛ يُع َر ُ – It refers to the study and analysis of word في علم اللُّغة والتَحليل اللغ َِوي كااكلااك – بعلم البِنيااة؛ ويُعنى بااد اسااااااااة الكلمااة [في forms, bits and pieces forming up a word ُ حيث أقساااااامها as prefixes, suffixes and their functions, صاااااو َ تِها المكتوبة أو المنطوقة] من ُ ووظائف هكه األقساااام في الجُملة؛ كما يُعنى باشاااتقاق it also studies word inflections, word وصااااايَة التَثنِية والجُموع pattern, singular and plural forms, الكلمات وتعيين مصااااااد ِ ها ِ موضاااوعاتِه – masculine and feminine forms, stem and ُ والتَككير والتَأنيث؛ ويهتم – في بعض root. والج اكوع ،كمااا يااد ُ الج ا ُكو بتوليااد الكلمااات من ُ تغيُّرات البِنية للدِّللة على معان ُمختلفة. Natural Language Processing ُمعالَجة اللُّغات الطَّبيعيَّة ُّ في ال ا َككاااء الصاااااا ِطناااعي وحوساااااابااة اللغااة وعلوم It is a field of computer science and األنظماااة artificial intelligence concerned with َوجياااه ِ وماااا ِت َياااة؛ تُعنى بت ِ المعلوماااات/ال َمعلُ َ َ ُّ الحاساااوبِيَة إلى ُمحاكاة قواعد اللغات الطبِي ِعيَة اعتمادًّا interactions between human and ُ على ُمعطَيات هكه اللُّغات [الَتي تكونُ معلومةًّ أو غير computers that raises many challenges as ُصااااااول على بيانات لُغ َِو يَة معلو مة] سااااااع يًّا إلى الح ُ 85
خر جات] معلو مة .تُ َم ِّث ُل ُم عالَ جة اللُّ غات الطَبِي ِع َية [ ُم َ ِّ َ الجانب التطبِيقِي من اللسانِيَات الحاسُوبِيَة NLP َ أنطولوجي الاااتاااحااالااايااال ُّ فاااي َ الااالااا َغااا ِوي [الااادِّللاااي] ُ وعااا ُلاااوم ُ ال َمعلُ ُ عرفِ يَة م ال َل ت ك ال من مجموعة َة؛ ي ت وما ل ومات/المع ِ َ ِ َ عناصااااارُها عرفِيَة] التي تتالح ُم ِ [أو قواعد البيانات ال َم ِ َ ل ُت َك ِّونَ تمثيالًّ الم ف َردات من ماااة َظ ن م لمجموعاااات ُ ُ والمفاهيم ال ُمشاااااتركة في ُحقُول دلليَة ُمح َددة ،تترابطُ بعضاااااا ها بعال قات دلل يَة ،وتُسااااااتَقَى ما َدتُ ها من م ِ ُّ ُ َ َ س واقِ َ اللغة ُمختلف فرُوع المعرفة شااااااريطة أن تَع ِك َ ِ ال ُم َعيَنة.
machine understanding of human languages. Ontology It is a philosophical study of nature of beings, reality, entities and their categories and relations. In NLP context, ontologies often viewed as a taxonomy of words into semantic fields having relations with each other.
التَّنقيب عن اآلراء ُّ صااااااو ؛ في اسااااااتخرا المعلومات والتَنقيب في الن ُ عمليَة فرعيَة ،تهد إلى استخال وتلخي ،األفكا صاااااايَة ال ُمتَفِقة في واآل اء في مجموعة من الوثائِق النَ ِّ ُ موضو ِعها ،سوا عء أكانَت مكتوبةًّ بلُغة واحدة أم بلغات ُمت ََعااادِّدة [ ُمراد ع لسااااااتخال اآل اء أو وجهاااات النظر ،وتلخي ،وجهاااات النَظَر ،وتحليااال وجهاااات النَظَر]. ال ُم َد َّونات اللُّ َغ ِويَّة ال ُمتوازية ُّ صااو في لسااانِيَات ال ُم َدوَنة؛ هي ُكتَل/مجموعات الن ُ اااااتمااا ُّد ماااا َدتَهاااا من لُ َغتَين أو أكثر ،وتكونُ الَتي تسا ِ صااااو أصااااالًّ في إحدى هكه اللُّغات – وتُ َساااا َمى النُّ ُ ُّ ُّ ًّ اللُّغااة المصاااااااد ن ،وترجم اة في اللغااة [أو اللغااات] األخرى – وتُ َساااااا َمى اللُّغاااة [أو اللُّغاااات] الهاااد ن. صااااااو ُ ال ُم َدوَنات ال ُمتوازية – جنبًّا إلى وضاااااا ُ نُ ُ وتُ َ جناااب – في قوالاااب متوازياااة ،بحياااث تظه ُر – في مصفوفات – كلمةًّ كلمة ،أو جُملةًّ جملة ،وهككا
Opinion Mining In natural language processing, it refers to the extraction and summarization of the overall polarity of written opinion, sentiments and emotion of the writer with respect to a certain topic.
Parallel Corpora It’s a group of texts of two languages or more with one language called the sources language from which these texts were originally collected, the other languages are called the target languages which are translations to the source language texts. Usually parallel corpora are presented in separated files, such that each file is a translation of the source texts in one language. Part of Speech Tagging In natural language processing, it is the process of labeling a word in a text to its corresponding part of speech (noun, )… verb, adjective, adverb
تعيين أقسام الكالم َ في لسااااانِيَات ال ُم َدوَنة والتَحليل َ الصاااارفِي والتر ِكيبِي؛ ُقصااااااا ُد بااه تمييز ُكال قساااااام من أقساااااااام الكالم في ي َ ُّ َ ُّ صااي َة] مجموعات الن ُ صااو [أو ال ُم َدوَنات اللغ َِويَة الن ِّ َ حو يَة ن ال فات الصاااااا تعيين ن أخرى ة وبعبا على ِحدَة. ِّ ِ لل ُمف َردات [أقساااام الكالم] بمعزل عن الجُملة الَتي تردُ فيها .وتُ ستَخ َد ُم لكلك ُموز أقسام الكالم PoS Tags حوي [ َ الصاارفِي والتَر ِكيبي] في وفقًّا لطبيعة النِّظام النَ ِ اللُّغة الَتي تنتمي إليها النُّصُو . Semantics علم الدِّاللة في اللِّساااااااانِ يَات الحديثة والتَحليل اللُّغ َِوي [الدِّللي]؛ It is the study of meanings, in natural يُع َر ُ – ككلك – بعلم المعنى ،ويُعنى بد اسة المعنى language processing it refers to the study غَوي لل ُمف َردات [األساااااامااااء واألفعاااال واألدوات of word meanings (nouns, verbs …) in ال ُّل ِ their respective context. "الحُ رُو "] والتَراكياااب [العباااا ات والجُ َمااال] في ساااااياقاتها اللُّغ َِويَة؛ كما يُعنى بد اساااااة التَطَوُّ َ اللُّ َغ ِو َ ي الحاااا ِد في معااااني الكلماااات المكتوباااة واأللفااااظ المنطوقة. Statistical machine translation التَّرجمة اآلليَّة اإلحصائِيَّة 86
في التَرجمة اآلل يَة؛ هي المنهج يَة األكثر ُشاااااايوعًّ ا في اااااسااااايَة على ميادين التَرجمة اآلليَة؛ تقو ُم فكرتُها األسا ِ تااد يااب نظااام التَرجمااة ال ُمقتَ َرح على مجموعاات من ُ تحوي ص ايَة الكبيرة نساابيًّا ال ُم َدوَنات اللُّغ َِويَة النَ ِّ [بحيث ِ أعاادادًّا هااائِلااة من الكلمااات واألنماااط التَر ِكيبِي اَا ة] ،ثُ َم ُمعالجتها باسااااااتخدام أساااااااليب التَحليل والسااااااتنباط اإلحصائِي. ُّ َّ Vector النحو ال َعدَدي متغير الطول في الجبر الخَ طِّي؛ هُو بنيةع ياضاااااايَة ُم َر َكبة من ع َدة In linear algebra, it’s a mathematical ِ ُنصاار مجموعة من القِيَم ال َع َد ِدي َة construct that groups a set of numerical عناصاار؛ يُ َمثِّ ُل ُكلُّ ع ُ ِ values or variables in single dimensional الثَابتة و/أو ال ُمتَ َغيِّرة array. Word Sense Disambiguation فك االلتباس الدِّاللي ُّ الحاساااوبِيَة والتَحليل اللغ َِوي [الدِّللي]؛ In natural langue processing and ُ في اللِّساااانِيَات حاسااااااوبي تُعنى به ُم عالَ جة اللُّ غات الطَبِي ِع يَةcomputational linguistics, it refers to the . ُ إجرا عء َ َ ُ ُقصاااااااد بااه إزالااة اللتبااا الن ااتج عن الت اداخ ال بينَ identification of the correct sense ي َ المجازيَة لكلمة ما ،اسااتنادًّا (meaning) of a word in a context. For عاني م وال َة ي ق ي ق الح المعاني َ ِ ِ َ ِ السااااااياقات اللُّغ َِويَة الَتي ت َِر ُد فيها الكلمةُ موضاااااا example, bank can be a financial إلى ِّ ِ اللتبا .يُمكنُ التَمثي ُل لاللتبا الدِّللي بكلمة عَينن institution or the bank of a river, الَتي تُ ِشااي ُر إلى معان َحقِيقِيَة ،مثلن "عُضااو اإلبصااا depending on the context one of the two senses should be picked. جري عند اإلنسان والحيوان" ،أو "يَنبُوع الماء الَكي يَ ِ مجازيَة، في األ ض"؛ وتُ ِشااااااي ُر – ككلك – إلى معان ِ مثلن "الجاسو " ،أو "طَلِي َعة ال َجيش". In machine translation, the statistical translation is the most commonly used. The idea based on building a system that can learn word/phrase translations form one language to another using a training parallel corpus.
87
الملخص المصاااد اللغوية تمثل عنصاار مهم ألي تطبيق في معالجة النصااو الطبيعية وم لك ،دعم المصاااد اللغوية للغة العربية يعد ضعفا و لك ألن المصاد اللغوية الموجودة للعربية إما غير مرتبة أو غير كاملة. لحل هذه المش كك ،أول ا:لنق ملببناء مصااد لغوي مصااقل موحد للغة العربية عن طريق تجمي عدة مصاااد لغوية متاحة للعربية ،ثم نقوم ببناء أكبر مصاااااد لغوي معد بطريقة إحصاااااائية للغة العربية ونقدم طريقة جديدة لربط المصااد ين اللغويين اإلحصااائيين العربي والنجليزي ،ثم أخي ًّرا باسااتخدام الطرق اإلحصااائية الجديدة نقدم طريقة جديدة لربط المصد اللغوي العربي الموحد بنظيره اإلنجليزي عن طريق حل مشكلة تبديل المفردات اللغوية بين اللغات محققين أعلى دقة بين الطرق موجودة سابقا .ل ل تنقسام هكه الرساالة إلى ثالثة اقساام .القسام األول يتناول الطرق التقليدية لبناء المصااد اللغوية ،حيث نناقش فكرة بناء مصااااد لغوي مصااااقل موحد للغة العربية عن طريق تجمي عدة مصاااااد لغوية متاحة للعربية ،فنقوم بعمل د اساااااة للمصااااااد اللغوية للعربية الموجودة ثم نقوم بطرح طريقة أوتوماتيكية وشااااابه أوتوماتيكية لتجمي هكه المصاد في مصد موحد غني ،الكي من خالله نستطي بناء قنطرة لربطه بـالمصد اللغوي اإلنجليزي الشهير . )(WordNet القسم الثاني يتناول الطرق والتجاهات الجديدة لبناء المصاد اللغوية العربية .كثير من األبحا تناقش تمثيل الكلمة كمتجه في الفراغ كدالة في المحتوى النصي لها الكلمات المجاو ةن ومن خالل هكا التمثيل نتمكن من ايجاد خصائ ،دللية ونحوية للكلمة وعالقات جديدة بين الكلمات .نبدأ هكا العمل بمقا نة طرق عديدة لبناء تمثيل متجهي في الفراغ للكلمات ثم تختبر هكه النما بنوعين من الختبا ات داخلي وخا جي .الختبا الداخلي يقيم جودة النما باستخدام اختبا ات دللية ونحوية قياسية .اما الختبا الخا جي فيقوم بتقييم جودة النما بتأثيرها على تطبيقيين مهمينن استرجاع المعلومات ،وتقييم اإلجابات القصيرة .ثم نقوم أخيرًّ ا بربط النمو المتجهي العربي في الفراغ بنظيره اإلنجليزي باستخدام طريقة جديدة تعتمد على تد يب شبكة عصبية صناعية تقوم بتحسين جيب تمام الزاوية بين المتجه المتوق والمتجه الصحيح بدل من تحسين متوسط مرب الخطأ بينهما ،حيث أظهرت النتائج تفوق طريقتنا على الطريقة القديمة. لأما القسم الثالث من الرسالة فيتناول بناء قنطرة بين المصد ين اللغويين التقليديين العربي والنجليزي و لك عن طريق حل مشااااااكلة الكلمات متعددة المعاني الالتي يحصاااااالن على معانيهن من المحتوي المككو ات به .فنقوم بد اسااة نما مختلفة لتمثيل الكلمات في الفراغ ونسااتخدمها لحل مشااكلة تبديل المفردات اللغوية بين اللغات عن طريق عرض طرق جديدة ومقا نتهم بالطرق الموجودة مساابقا باسااتخدام مقياسااين ،حيث تسااتطي طرقنا الجديدة بالتغلب على الطرق الموجودة في أحد المقياسين.
أ
مهنـدس:
محمد عبدلالرحمن زهران محمد ل ل
مصري ل
الجنسية: تاريخ المنح:
..........\....\....ل ل
القسم:
هندسأ الحاسبات ل
الدرجة:
ماجستير ل ل
المشرفون: ا.د.ل ميرلعطيأ ل ا.د.لمحسن رش ان ل ل الممتحنون:
ل .د.لاحمدلعبدلال احدلرافعل(الممتحنلالخارجي) ل .د.لسامحلسعدلاانصاريل(الممتحنلالخارجي) ل .د.لامير س لايرلعطيأل(المشرفلالرئيسي) ل .د.لمحسنلعبدلالرازقلرش انل(المشرف) ل
عنوان الرسالة:
ل
اتجاهات جديدة لبناء موارد اللغة العربية ل
الكلمات الدالة:
ل
معالجأ النص ص الطبيعيأولتجميعلم ارد ال غألالعربيأولتشابهلالنص صولتبديللمفردات معجميألعبرل ال غاتولطرقلتمثيللال ،ماتولتمثيل ال ،مات ،متجهاتولعالقاتلال ،ماتولتع يملاآللأولالشب،أ العصبيأولت صيللمتجهاتلال ،ماتلفي الفراغلمن العربيألإلى اإلنج يزيأ .ل
لل ل
ل ملخـص الرسالة:
ل
المصادرلال غ يألتمثللعنصرلمهملأليلتطبيقلفيلمعالجألالنص صلالطبيعيأل معلذلكولدعملالمصادرل
ال غ يأ لل غأ لالعربيأ ليعد لضعفا ل ذلك لألن لالمصادر لال غ يأ لالم ج دة لل عربيأ لإما لغير لمرتبأ ل لغيرل ،ام أ.للحل لهذه لالمش ،أو ل ا:لنق م لببناء لمصدر للغ ي لمصقل لم حد لل غأ لالعربيأ لعن لطريق لتجميعل عدةلمصادرللغ يألمتاحألل عربيأولثملنق ملببناءل ،برلمصدرللغ يلمعدلبطريقألإحصائيألل غألالعربيأل خيرلباستخداملالطرقل نقدملطريقألجديدةللربطلالمصدرينلال غ يينلاإلحصائيينلالعربيل اانج يزيولثمل ًا
يزيلعنلطريقل اإلحصائيألالجديدةلنقدملطريقألجديدةللربطلالمصدرلال غ يلالعربيلالم حدلبنظيرهلاإلنج ل حللمش ،ألتبديللالمفرداتلال غ يألبينلال غاتلمحققينل ع ىلدقألبينلالطرقلم ج دةلسابقا .ل
ل
اتجاهات جديدة لبناء موارد اللغة العربية
اعداد محمد عبد الرحمن زهران محمد
رسالة مقدمة إلى كلية الهندسة -جامعة القاهرة كجزء من متطلبات الحصول على درجة ماجستير العلوم في هندسة الحاسبات
يعتمد من لجنة الممتحنين: االستاذ الدكتور :احمد رافع
الممتحن الخارجي
االستاذ الدكتور :سامح االنصاري
الممتحن الخارجي
االستاذ الدكتور :امير عطية
المشرف الرئيسي المشرف
االستاذ الدكتور :محسن رشوان
كليــة الهندســة -جامعــة القاهــرة الجيـزة -جمهوريـة مصـر العربيــة 2015
اتجاهات جديدة لبناء موارد اللغة العربية
اعداد محمد عبد الرحمن زهران محمد
رسالة مقدمة إلى كلية الهندسة -جامعة القاهرة كجزء من متطلبات الحصول على درجة ماجستير العلوم في هندسة الحاسبات
تحت اشراف أ.د .امير عطية أستاذ بقسم هندسة الحاسبات
أ.د .محسن رشوان أستاذ بقسم هندسة االلكترونيات واالتصاالت الكهربية
كليــة الهندســة -جامعــة القاهــرة الجيـزة -جمهوريـة مصـر العربيــة
2015
ل
اتجاهات جديدة لبناء موارد اللغة العربية
اعداد محمد عبد الرحمن زهران محمد رسالة مقدمة إلى كلية الهندسة -جامعة القاهرة كجزء من متطلبات الحصول على درجة ماجستير العلوم في هندسة الحاسبات
كليــة الهندســة -جامعــة القاهــرة الجيـزة -جمهوريـة مصـر العربيــة
2015