Supervised learning to measure the semantic

0 downloads 0 Views 319KB Size Report
MADAMIRA analyzer [18] while the semantic and the syntactic-semantic knowledge were taken from the LMF standardized Arabic dictionary [9]. 5.2. Results.
Supervised learning to measure the semantic similarity between Arabic sentences Wafa Wali1, Bilel Gargouri1, Abdelmajid Ben hamadou2 1

MIR@CL Laboratory, FSEGS, Sfax, Tunisia {wafa.wali,bilel.gargouri}@fsegs.rnu.tn 2 MIR@CL Laboratory, ISIMS, Sfax, Tunisia [email protected]

Abstract. Many methods for measuring the semantic similarity between sentences have been proposed, particularly for English. These methods are considered restrictive as they usually do not take into account some semantic and syntactic-semantic knowledge like semantic predicate, thematic role and semantic class. Measuring the semantic similarity between sentences in Arabic is particularly a challenging task because of the complex linguistic structure of the Arabic language and given the lack of electronic resources such as syntacticsemantic knowledge and annotated corpora. In this paper, we proposed a method for measuring Arabic sentences’ similarity based on automatic learning taking advantage of LMF standardized Arabic dictionaries, notably the syntactic-semantic knowledge that they contain. Furthermore, we evaluated our proposal with the cross validation method by using 690 pairs of sentences taken from old Arabic dictionaries designed for human use like Al-Wassit and Lissan-Al-Arab. The obtained results are very encouraging and show a good performance that approximates to human intuition. Keywords: sentence similarity, automatic learning, Arabic language, syntactico-semantic knowledge, LMF-ISO 24613 standardized dictionaries.

1

Introduction

Today, people are surrounded by a huge amount of information due to the rapid development of the Internet and its associated technologies. Also, increasingly, the techniques related to information retrieval, knowledge management, Natural Language Processing (NLP), and so on, are becoming increasingly important and are being developed to help people manage and process information. However, one of the key problems of these themes is sentence similarity, which has a close relationship with psychology and cognitive science. There are numerous studies that have been previously developed with the aim of computing sentence similarity. The problem was formally brought to attention and the first solutions were proposed in 2006 with the works that are reported in [1] and that takes into account the syntactic information via the word order and the semantic one via the semantic similarity of words using knowledge-based and corpus-based methods. Several methods have adapted the propadfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

osition of Li et al. [1] and improved by adding to it other features such as Longest Common subsequence (LCS) and Word Sense Disambiguation (WSD). Recently, some methods have been suggested such as [2] and [3] which take into consideration in their computation the sentence similarity, the syntactic dependencies, and the semantic similarity between words using WordNet. Nevertheless, one of the main problems of existing sentence similarity methods is that most of them have neglected some elements of semantic knowledge such as semantic class and thematic role. Additionally, the syntactic-semantic knowledge that can be extracted from the sentence and that is highly relevant in the computation of sentence similarity is ignored. Indeed, these elements of knowledge, notably the semantic predicate, the thematic role and the semantic class, supply a mechanism of interaction between the syntactic processor and the discourse model, provide information about the relationships between words, and perform an important role in conveying the meaning of a sentence. Concerning the research works on semantic similarity for the Arabic language, they are focused on words similarity as [4] whereas to the best of our knowledge, there is no sentence similarity measure developed specifically for Arabic compared to the works in the English language, which has already benefited from extensive research in this field. There are some aspects that slow down progress in Arabic NLP compared to the accomplishments in English and other European languages such as the absence of diacritics (which represent most vowels) in the written text, which creates ambiguity, agglutination and complexity in grammar. In addition to the linguistic issues, there is also a lack of Arabic corpora, lexicons that treat the syntactic-semantic knowledge, which are essential to any advanced research in different areas. In fact, the standardization committee ISO TC37/SC4 has validated the Lexical Markup Framework norm (LMF) project under the standard ISO 24 613 [5] which allows for the encoding of rich linguistic information, including among others morphological, syntactic, and semantic aspects covering several languages. The Arabic has taken advantage of this standard and an LMF standardized Arabic dictionary [6] that incorporates multi-data knowledge has been developed within the MIR@CL research team. The focus of this paper is on the production of the first sentence similarity benchmark dataset for Modern Standard Arabic (MSA) that we will use in evaluating the content of Arabic dictionaries [7]. Indeed, the similarity function is devoted to the measurement of the similarity degree between the definitions of lexical entries or the examples of one definition in order to avoid redundancy [8]. In this paper we propose a novel method, to compute the similarity between Arabic sentences by taking into account semantic, syntactic and syntactic-semantic knowledge and taking advantage of the LMF standardized Arabic dictionary. The proposal measures the semantic similarity via the synonymy relations between words in sentences. Besides, the syntactic similarity is measured based on the cooccurrences of dependent structures between two sentences after a dependency parsing. The syntactic-semantic similarity is measured on the basis of the common semantic arguments, associated with the semantic predicate in terms of thematic role and semantic class, between the pair of sentences.

An experiment was carried out on 1380 Arabic sentences, taken from various definitions found in old Arabic dictionaries designed for human use like Alwasit and Lissan Al Arab. The experiment is based on supervised learning which shows a good accuracy that approximates to human judgment. Therefore, our method proves the importance of semantic and syntactic-semantic knowledge in the computation of sentence similarity. The next section presents a brief review of the approaches used to compute sentence similarity. Section 3 presents the main features of the Arabic language. Section 4 gives the details of the proposed method of measuring sentence similarity. Section 5 covers the experiments and the obtained results. Lastly, Section 6 sums up the work, draws some conclusions and announces prospective future works.

2

Related works

During the last decade, several methods to measure sentence similarity were established based on semantic and/or syntactic knowledge. In this section, we examine some related works, called hybrids, which are based on both syntactic and semantic knowledge and have the same contribution as our proposal, in order to explore the advantages and limitations of the previous methods. Stefanescu et al. [3] introduced a method for measuring the semantic similarity between sentences, which is based on the assumption that the meaning of a sentence is captured by its constituents and the dependencies between them. The method considers that every chunk has a different importance with respect to the overall meaning of a sentence that is computed according to the information content of the words in the chunk. The disadvantage of this method is that the semantic measurement is isolated from the syntactic measurement. In this method, the semantic similarity is calculated based on words semantic similarity, while syntactic dependency is counted to compute the syntactic similarity. Lee et al. [2] proposed a sentence similarity algorithm that takes advantage of a corpus-based ontology via Wu Palmer’s measure [9] and grammatical rules. Nevertheless, the Wu and Palmer measure presents the following drawback: in some situations, the similarity of two elements of an IS-A ontology contained in the neighborhood exceeds the similarity value of two elements contained in the same hierarchy. This situation is inadequate within the measuring sentence similarity. The Semantic Textual Similarity task (STS) organized as part of the Semantic Evaluation Exercises (see [10] for a description of STS 2013) provides a common platform for the evaluation of such systems through a comparison with human annotated similarity scores over a large dataset. The authors introduced a benchmark for measuring Semantic Textual Similarity (STS) between similar sentences. They used the similarity between words using Knowledge-based Similarity, Syntactic Analysis, Nnamed Entity Recognition, Semantic Role Labeling, String Similarity, and Word Sense Disambiguation. The main drawback of this method is that it computes the similarity of words from different features, which is not computationally efficient.

In STS 2012, Saric et al. [11] proposed a hybrid method that derives sentence similarity from semantic information using WordNet (similarity between words) and corpus (the information content) and syntactic information that is based on the syntactic roles, the overlap syntactic dependencies and the named entities. However, the judgment of similarity is situational and depends on time. Indeed, the information collected in the corpus may not be relevant to the present. Furthermore, we can see that all of the hybrid methods presented above exploit insufficiently the sentence information. However, the major disadvantage of these hybrid methods is that some elements of the semantic knowledge such as the semantic class and the thematic role of the sentence’s words are not considered in calculating sentence similarity. Also, the relationship between the syntactic and semantic level such as semantic predicate is not taken into account. We discuss how this knowledge enhances the sentence similarity in the following section.

3

Arabic language background

Arabic is one of the world's major languages. It is the fifth most widely spoken language in the world and is the second in terms of the number of speakers with over than 250 million Arabic speakers, of whom roughly 195 million are first language speakers and 55 million are second language speakers. Arabic is characterized by a complex morphology and a rich vocabulary. It is a derivational and flexional language. Indeed, an Arabic word may be composed of a stem plus affixes (to refer to tense, gender, and/or number) and clitics (including some prepositions, conjunctions, determiners, and pronouns). For instance, the word “ُُ‫“ال ُكتُب‬, transliterated al-kutubu and meaning books, is derived from the stem “ُ‫“ ِكت َاب‬, transliterated kitAb and meaning book, which is derived from the root “ُ‫َب‬ َ ‫“ َكت‬, transliterated katab and meaning to write. Moreover, Agglutination in Arabic is another specific phenomenon. In fact, in Arabic articles, prepositions, pronouns, etc. can be affixed to the adjectives, nouns, verbs and particles to which they are related. The derivational, flexional and agglutinative aspects of the Arabic language yield significant challenges in the NLP. Thus, many morphological ambiguities have to be solved when dealing with the Arabic language. Moreover, many Arabic words are homographic: they have the same orthographic form, though their pronunciation is different. In most cases, these homographs are due to the non-vocalization of words. This means that a full vocalization of words can solve these ambiguities but most of the Arabic texts are not vocalized such as the word «‫كتب‬-ktb» has 16 vocalizations and which represent 9 different grammatical categories like “ُ‫َب‬ َ ‫ – َكت‬kataba- write”, “ُ‫ِب‬ َ ‫ – ُكت‬kutiba – was written” and“ُ‫ َكتُب‬- kutubbooks”. In addition, Arabic grammar is a very complex subject of study; even Arabicspeaking people nowadays are not fully familiar with the grammar of their own language. Thus, Arabic grammatical checking is a difficult task. The difficulty comes from several reasons: the first is the length of the sentence and the complex Arabic syntax; the second is the flexible word order nature of the Arabic sentence and the third is the presence of an elliptic personal pronoun “alDamiirAlmustatir”.

4

The proposed method

This section is devoted to the presentation of our suggested method. The suggested method for measuring semantic similarities between Arabic sentences has two phases: the learning phase and the test phase. 4.1

Method overview

The first phase of our proposal requires a training corpus, the features extracted from the learning corpus and an LMF standardized Arabic dictionary [9]. Indeed, the features are lexical, semantic and syntactico-semantic. This phase includes two processes: the first is the pre-processing that aims to have an annotated corpus and the second is the training used to get a hyper plane equation via the learning algorithm. The second phase implements the learning results from the first phase to achieve the similarity score of the Arabic sentence in order to classify the sentences as similar and not similar. The phases of our approach are illustrated in the following figure.

Fig. 1. The suggested method

4.2

The learning phase

The learning phase involves the use of a training corpus, a set of features extracted from the learning corpus analysis and an LMF standardized Arabic dictionary in order to train the learning algorithm. It is composed of the following two processes: Pre-processing: in the pre-processing process, we are applying the defined features on the learning corpus taking advantage of the LMF standardized Arabic dictionary

[9] in order to have an annotated corpus. Indeed, the features are classified into three classes’ namely the lexical, the semantic and the syntactic-semantic features. The lexical feature specifies the common words between the pair of sentences. The semantic feature analyzes the synonymous words among the words of the sentences. And the syntactic-semantic feature detects the common semantic arguments between sentences in terms of semantic class and thematic role [12]. Also, in this step we aim to annotate each word of a sentence in the learning corpus according to the different extraction features presented above. Each pair of sentences is described by a vector called extraction vector. The value of the lexical feature SL (S1,S2) corresponds to determining the similar stems between the pair of sentences. In this step, we use the Jaccard coefficient [13] to compute the lexical feature. The following formula shows how to calculate the lexical feature. SL(S1,S2)= MC/ (MS1+MS2-MC)

(1)

Where: MC: the number of common stems between the two sentences MS1: the number of stems contained in the sentence S1 MS2: the number of stems contained in the sentence S2 This extraction vector is completed by the semantic features selected from the semantic annotations in the corpus. Indeed, the semantic feature is derived from the LMF standardized Arabic dictionary [9].The procedure to compute the semantic feature is to form a joint word set only by using the distinct stem in the pairs of sentences. For each sentence, a raw semantic feature is derived with the assistance of the semantic annotation. Indeed, each sentence is readily represented by the use of the joint word set as follows: The vector derived from the joint word set denoted by Š. Each entry of the semantic vector corresponds to a stem in the joint word set, so the dimension equals the number of stems in the joint word set. The value of an entry of the lexical semantic vector, Ši (i=1, 2, …, m) is determined by the semantic similarity of the word corresponding to a word in the sentence. Take S 1is made up by W1 W2…Wm as an example: Case 1: if Wi appears in the sentence, Ši is set to 1. Case 2: If Wi is not contained in S1, a semantic similarity score is computed between Wi and each word in the sentence S1, using the semantic annotation. Thus, the most similar word in T 1 to Wi is the one with the highest similarity score δ. If δ exceeds a preset threshold, then Ši =δ; otherwise, Ši=0. Once the two sets of synonyms for each stem are collected, we calculate the degree of similarity between them using the Jaccard coefficient [13]. Sim(W1,W2)=MC/(MW1+MW2-MC) Where: MC: the number of common words between the two synonym sets MW1: the number of words contained in the w1 synonym set MW2: the number of words contained in the w2 synonym set

(2)

From the semantic vectors generated as described above, we compute the semantic feature between the pair of sentences, which we call SM(S1, S2) using the Cosine similarity. SM(S1,S2)= V1.V2/(||V1||*||V2||)

(3)

Where: V1: the semantic vector of sentence S1 V2: the semantic vector of sentence S2 Also, this extraction vector is completed by the syntactic-semantic feature selected from the syntactico-semantic annotations in the corpus. The syntactic-semantic feature is derived from the LMF standardized Arabic dictionary [9]. Indeed, on the one hand each sentence is syntactically parsed using a syntactic analyzer and on the other hand it is semantically analyzed by an expert giving its semantic predicate. The correspondence between syntactic and semantic analysis is then defined in the LMF standardized Arabic dictionary [9] in order to extract the semantic arguments such as semantic class and thematic role. Then, the value of syntactico-semantic feature corresponds to extraction the similar semantic arguments in terms of semantic class and thematic role between the pair of sentences. To compute the syntactico-semantic feature, which we call SSM(S1,S2), using the Jaccard coefficient [13]. The following formula shows how to compute the syntactico-semantic feature. SSM(S1,S2)= ASC/(ASS1+ASS2-ASC)

(4)

Where: ASC: the number of common semantic arguments between the two sentences ASS1: the number of semantic arguments contained in the sentence S1 ASS2: the number of semantic arguments contained in the sentence S2 This extraction vector is completed by the appropriate similarity decision (similar, not similar) that is provided by an expert. The set of extraction vectors forms an input file for the learning stage. At the end of this process, the learning corpus is converted from its original format into a vector format and we obtain a tabular corpus which consists of a set of vectors separated by a return as shown in the following example: Vector1: 0.3, 0.4, 1, not similar Vector2 :0.8, 0.7, 1, similar Vector3 :0, 0, 0, not similar In fact, the values of lexical, semantic, and syntactic-semantic features are included between 0 and 1. Training: This stage uses the previously generated extraction vectors in order to produce an equation known as hyperplane equation. There are many learning algorithms proposed in the literature such as the SVM, the Naïve Bayes and the J48 decision tree algorithms. In fact, these algorithms generate the equation which is used to compute a similarity score in order to classify the sentences (similar, not similar). The training stage generates a hyperplane equation. It is not worthy that the learning stage is done only once and is only repeated in case we increase the size of the corpus, or change

the type of corpus. This step is performed using the data set of sentences and the Weka library [14]. This tool takes as input extraction vectors in the form of an “.arff” file and an output hyperplane equation. 4.3

The test phase

This phase implements the results of the learning phase in order to measure the semantic similarity between the Arabic sentences. The user must provide segmented sentences as input to our system. This phase proceeds in two steps as follows: Firstly, a pre-processing is applied to the input pair of sentences. Indeed, we use the features and the LMF standardized Arabic dictionary to process the sentences in order to have the vector format as presented in the learning stage. This pre-processing generates extraction vectors like those generated as input for the learning stage. The only difference is that these vectors do not contain the similarity decision (similar or not similar). This information will be calculated by the learning algorithms. Then, the extraction vectors generated in the first step and the hyperplane equation generated in the learning stage are provided as input to the classification module. Indeed, for each vector, we calculate a score using the hyperplane equation. Each equation discriminates between two similarity decision classes. So every vector will have a score according to the coefficients of three features as lexical, semantic and syntacticsemantic. The score and its sign are used to identify the similarity decision class for the test vector. At the end of this stage we obtain a similarity decision (similar, not similar) between the pair of sentences.

5

Experimentation

5.1

Learning corpus

There are currently no suitable Arabic data sets (or even standard text sets) annotated with syntactic-semantic knowledge, notably the semantic predicate, the thematic role and the semantic class. Building such a data set is not a trivial task due to subjectivity in the interpretation of language, which is in part due to the lack of deeper contextual information. In our work, we collected a set of 1380 sentences that consist of the dictionary definitions and the examples of definitions of words, taken from Arabic dictionaries like Lissan Al-Arab and Al-Wassit. The data were annotated with the features described above. In fact, the lexical knowledge was derived from the MADAMIRA analyzer [18] while the semantic and the syntactic-semantic knowledge were taken from the LMF standardized Arabic dictionary [9]. 5.2

Results

The evaluation of our method is achieved following the cross-validation method using the Weka tool [15]. To realize that, we divided the training corpus into two

distinct parts, one for learning (80%) and one for the test (20%). The classification constitutes the most important stage of our proposal that is why we did some experiments to evaluate its impact in the whole solution. We chose five classifiers from the java API weka. The choice covers different algorithms of different classifiers types; namely: the rule-based, the probabilistic, the case-based, the functions and the decision trees methods. The results are listed in the table below. Probabilistic Decision tree Function Empiric Rule-based (NaiveBayes) (J48) (SMO) (KStar) (DecisionTable) Precision 0.968 0.987 0.979 0.987 0.985 Recall 0.962 0.988 0.978 0.987 0.985 F-measure 0.964 0.988 0.977 0.987 0.985 Table 1. Evaluation results

The results obtained are encouraging and represent a good start for the implementation of automatic learning for measuring semantic similarity between Arabic sentences. We noticed that the analysis of short sentences (

Suggest Documents