Lexical Network Enrichment Using Association Rules ...

Lexical Network Enrichment Using Association Rules Model Souheyl Mallat1, Emna Hkiri1, Mohsen Maraoui2 and Mounir Zrigui1 1LATICE Laboratory Research Department of computer science University of Monastir, Tunisia 2COMPUTATIONAL MATHEMATICS LABORATORY, UNIVERSITY OF MONASTIR { [email protected] [email protected], [email protected] [email protected]}

Abstract. In this paper, we present our method of lexical enrichment applied on a semantic network in the context of query disambiguation. This network represents the list of relevant sentences in French (noted by listRSF) that respond to a given Arabic query. In a first step we generate the semantic network covering the content of the listRSF. The generation of the network is based on our approach of semantic and conceptual indexing. In a second step, we apply a contextual enrichment on this network using association rules model. The evaluation of our method shows the impact of this model on the semantic network enrichment. As a result, this enrichment increases the F-measure from 71% to 81% in terms of the (listeRSF) coverage. Keywords: Association rules model, semantic network, contextual enrichment.

1

Introduction

In the last decade lexical disambiguation under automatic query translation around the world has taken giant leaps. Our disambiguation method leverages on our method of listRSF representation. The list includes sentences that semantically answers the Arabic query noted list of relevant sentences in French (listRSF). Concerning the construction of the listRSF corresponding to the list of relevant sentences in Arabic listRSA [1][2], it was obtained by a aligning step at the sentence level with MkAlign tool [3].The disambiguation method will attempt to improve our system of Arabic queries translation by eliminating the translated terms, with other meanings/senses that do not belong to the semantic context of listRSF. In this work, we are interested in representing the listRSF by a semantic network. A number of work under the automatic processing of natural languages NLP are based on the principles presented in [4] to exploit networks of lexical collocations (semantic, syntactic, pragmatic). [5] used the lexical networks in the context of word sense disambiguation. In addition [6] exploited it respectively in the parsing and the generation. Such networks have the advantage of being easy to build automatically. Consequently in our context, we do treat our listRSF without limitation to a particular theme. In this paper, we first propose a structure of semantic network, in order to realize the semantic cohesion through (1) the various relations (synonymy, hypernymy, hyponymy, meronymy), (2) grouping concepts relative to significant terms of the listRSF and (3) projecting them on the French EuroWordNet (EWNF) [7]. The problem is that the network generated in this step does not fully cover the listRSF content. In fact, it partially fulfills our objectives because of its limits to some queries containing ambiguous words.

To overcome this issue we pass to the following next step. In this step, we do use contextual information between concepts. This allows emerging concepts and contextual relations defined implicitly in order to obtain a rich semantic description of the listRSF. These relations are provided by the semantic associations’ rules which are generated by Apriori algorithm[8]. As a result of the previous step our semantic network is contextually enriched. Our development is performed on French data extracted from the “diplomatic Monde” corpus [9]. These data are presented in two formalisms; the first is semantic network without contextual enrichment and the second is contextually enriched. These two networks will be used in a comparative study in order to demonstrate the effect of this enrichment on the coverage of the listRSF.The works presented above reflects the importance of this domain and shows some diversity in approaches to acquire relations between terms. In this paper, we propose first a method to represent the listRSF by a semantic network similar to the work of [8]. Our network is essentially composed of concepts associated to significant terms identified from the listRSF. So, we propose in the first step a method of representation of the listRSF by a semantic network using our indexing method. After the indexing we build the network (identification of the nodes and the relations between them). In the second step, we are interested in enriching this semantic network by adding other hidden relations. 1.1

State of arts: Approaches of semantic relations identification In this section, we first present the different semantic relations that can exist between terms and two methods of acquisition of these relations.

Use of contextual distribution of terms in relations extraction. It consists in grouping terms sharing context (in origin syntactic)[10]. For example the term teacher and the board are semantically close because they share the same context which is teaching. Distributional analysis method applied on a corpus of texts allows to identify several type of relations; proximity relations [11], synonymy relations [12]. This method was also used by [13] to highlight the semantic relations associated with terms. The idea was to replace the present terms in the contexts by their semantic classes, based on WordNet. For example, the terms solder is replaced by the class “ministry of defense” in WordNet. There is also a hybrid approach that combines the distributional analysis and lexico- syntactic patterns methods presented in [14]. The works presented above reflects the importance of this domain and shows some diversity in approaches to acquire relations between terms. In this paper, we propose first a method to represent the listRSF by a semantic network similar to the work of [8]. Our network is essentially composed of concepts associated to significant terms identified from the listRSF. So, we propose in the first step a method of representation of the listRSF by a semantic network using our indexing method. After the indexing we build the network (identification of the nodes and the relations between them). In the second step, we are interested in enriching this semantic network by adding other hidden relations

2

Representation of the listRSF by semantic network

This section is devoted to introduce the formalism of network. Our network composed essentially by the set of concepts associated to significant terms, those are identified from listRSF. This identification aims to extract significant information of the listRSF and is essentially based on the indexing process. 2.1

Description of the indexing method of listRSF

To create our indexing method, we are inspired by Baziz work [15], in order to represent the listRSF by a list of index concepts. It is based on the combination of semantic and conceptual indexing [16]. In the semantic indexing, the used semantic structure makes possible the extension of the representation of the listRSF by the relation of synonymy. Baziz proved that this method improves the quality of the system contrary to an indexing based only on conceptual indexing. He demonstrates that his IR system performs better with this combination, since it had produced less than 30% of disambiguation errors. Our indexing method is based on the use of the semantic network French EuroWordNet (EWNF). The method of indexing incorporates three main steps:

Extraction of concepts from significant terms (simple and composed) of the listRSF is done by projection on EWNF. If the projection generates for a given term several corresponding concepts, then this term will be disambiguated. The identification of composed terms in the list is interesting to improve the performance of the automatic indexing. The use of composed terms reduces considerably the ambiguity of terms and increases precision (reduces the number of senses of a term). For example the composed term "North America" takes one sense, with 6sense for term north, and 3 for America returned by EWNF. Our method for identifying simple and composed terms is based on a symbolic method, it requires a morphosyntactic analysis of listRSF. We use analysis obtained by integrating TreeTagger Helmut[17]. The analysis provided by TreeTagger, can produce a list of words labeled by their grammatical categories. Most of composed terms consist of combinations of nouns, adjectives and prepositions, we generate a list of n-grams (2 ≤ n ≤ 3). Concepts weighting: Once the simple and composed terms are extracted from the listRSF. We assign to each one of them a weight in the listRSF. The purpose of this step is to eliminate the least frequent terms and maintain only the most representative terms in the listRSF . The weighting method, which combines statistical and semantic analysis [18], for assigning weight to the terms of listRSF optimally in terms of frequency of each with their semantic variation. For the statistical analysis: in the step of concepts identification we are interested in the importance of composed terms but in some cases, the words composing these terms can refer to them even when used alone, after a number of occurrences. This represents a form of simplification or abbreviation used by the author. Let Ti be a term, its frequency depends on the number of occurrences of the term itself, and the words that compose (or sub-term (STi)). Statistical analysis is defined by the conceptual frequency of a term Ti for the listRSF, it is calculated as follows: CF (Ti) =

(1); With Length (STi) represents the

number of words in Ti and STi, represents the sub-terms (single words) derivatives of Ti. The semantic analysis is based on the representativeness of a concept, which takes into account the frequency of occurrence of terms, denoting the concept in the listRSF but also its relations with other concepts in the domain. The more relations with other concepts present in the listRSF a concept has, the more is this concept a representative of the listRSF .The EWNF resource is used to generate the set of concepts related to these terms in the form of synset taking every defined sense, and its semantic relations. The basic relation between the terms of the same synset is synonymy, but different synsets are otherwise related by various semantic relations such as subsumption, or hyponymy / hypernymy. In our case, we used the weighting method of semantic frequency of the term W_frqsem (Ti), which is calculated for each term in function of: the frequency of occurrence of the concepts associated to that term, and the ranks of sentences to which those concepts do belong. The coefficients corresponding to each sentence are assigned as follows: if a term belongs to the first sentence its coefficient is 10, 9 for second, and 1 for the tenth and the rest of the sentences in the listRSF. Assuming that term Ti containing n terms and appears p times in the listRSF, Mi,j is the coefficient for sentences containing the conceptual occurrence j of the term Ti (different senses associated with this term, extracted from a EWNF, and for each sense of this term, a synset is

associated, as well as all semantic relations). The weight of semantic frequency W_freqsem of a term Ti in the listRSF is calculated as follows: Where is the weight of term Ti, and Ns=k - number (Mi, j=0) with (ns presents the number of possible senses of Ti).W (Ti, listRSF) represents the global weight of a term Ti in the KB (listRSF), is defined by the expression: W (Ti, listRSF) =WTi= (3) The index of listRSF noted Index (listRSF) = (Ti, WTi). Disambiguation of index terms aims to identify the exact sense of a polysemous index term in the listRSF. For an ambiguous term Ti belonging to the index listRSF. Let Si, the number of senses associated with the term Ti. The principle of the disambiguation method is to select the best concept (sense) in the listRSF from several (C1, C2, Cn). In the semantic disambiguation, we are interested in the method used by [19]. It is based on the calculation of a symmetric similarity weight (P (c)) for each concept associated with term Ti of sense j of the list of indexes: the formula is as follows: P (Cij) =

(4) with m and nl represent the number of terms in

Index (listRSF), and the number of senses of the term Ti in EWNF, Dist , ) is a measure of proximity between semantic concepts and [20] [8], it is calculated by a score based on their mutual distance in the network EWNF. The disadvantage of this method is that it considers only the semantic similarity between concepts in listRSF, but it does not take into account the representativeness of terms in the context of listRSF. So the best sense for a term ti in listRSF must be strongly correlated to the senses associated with other important terms in listRSF. For this reason, we will integrate the weight of the term in the calculation of conceptual scores, using the following formula: P(

)=

. (5)

The concept with the highest weight is considered the best sense of the term Ti. After extracting the concepts and calculation of their weights, the listRSF will be represented by m concepts (m