Arabic Name Entity Recognition Using Second Order Hidden Markov

0 downloads 0 Views 287KB Size Report
Abstract: In this paper, we introduce a model that builds an automatic name entity recognition (NER) system for Arabic language by using the context structure for ...
June 2015- August 2015; Sec. B; Vol.4.No.3,344-351

E-ISSN: 2278–179X

Journal of Environmental Science, Computer Science and Engineering & Technology An International Peer Review E-3 Journal of Sciences and Technology

Available online at www.jecet.org Section B: Computer Science Research Article

Arabic Name Entity Recognition Using Second Order Hidden Markov Model Fadl Dahan, Ameur Touir and Hassan Mathkour Department of Computer Science, College of Computer and Information Sciences, King Saud University Received: 07 July 2015; Revised: 13 July 2015; Accepted: 15 July 2015

Abstract: In this paper, we introduce a model that builds an automatic name entity recognition (NER) system for Arabic language by using the context structure for three words (the target word and two words before) and second order of Hidden Markov Model (HMM). We deal with this context, history by keeping them into Tri-gram model, then we integrate between the HMM and the words features to make decision about the classification of target words without human intervention and supported names list. Keywords: Hidden Markov Model (HMM), Name Entity Recognition (NER), Trigram Model. INTRODUCTION Named entity (NE) is widely used in Natural Language Processing (NLP). It was coined for the Sixth Message Understanding Conference (MUC-6)1. The main problem of NER is that we want to recognize the entity name, which is generally organized into personal name, location, and organization from a document. There are several classification methods which are successfully applied to this task; some of these methods depend on human knowledge in a specific language by elaboration of the necessary rules to extract the names entity from documents. However, these methods are limited to a particular language and cannot be used with other languages as they require a lot of effort and coordination between the programmers and linguistic. 744

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Arabic …

Fadl Dahan et al.

The other methods study the features of positive and negative examples of NE over a large collection of annotated documents and design rules that capture instances of a given type 1, these methods include Hidden Markov Models (HMM), Decision Trees, Maximum Entropy Models (ME), Support Vector Machines (SVM), and Conditional Random Fields (CRF). The main shortcoming of these methods is the requirement of a large annotated corpus 2. The problem in this paper is to classify Arabic text into one of NE classes (Person, Location, and Organization) by predicting the next word given the previous words. For this task, we can't possibly consider each textual history separately, so we need a method of grouping histories that are similar in some way to give reasonable predictions. One possible way to group them is by making a Markov assumption that only the prior local context - the last few words - affects the word. If we construct a model where all histories that have the same n - 1 words are placed in the same equivalence class, then we have (n - l)th order Markov model or a n-gram word2. For Arabic language two words and their features are a good choice. In addition, the relationship between three words using joint probability and the further analysis of the structure of this word (feature history) to build a good NER system. RELATED WORKS There are to our knowledge a few researches focus on solving NER problem for Arabic language, the almost of these researches depend on rule-base method. Maloney and Niv 3 combined between the pattern matching engine and supporting data with a morphological analysis component to build their system where the morphological analyzer allows making a distinction between likely and unlikely name constituents, which is particularly important when deciding where a name ends and the non-name contest begin. The pattern matching engine uses data consisting of a set of patternaction rules supported by word list for finding the name. While Abuleil 4 presented technique to extract names from Arabic texts by building a database and graphs to represent the words that might form a name and the relationships between them. They used to directed graphs to represent the words found in the name phrases, the relative frequency (weight) of each, and the relationship between them. Shaalan and Raza5 developed a system for person name entity recognition for Arabic language called (PERA), using a rule-based approach which uses a linguistic grammar based techniques. The system consists of a lexicon, in the form of gazetteer name lists, and a grammar, in the form of regular expressions, which are responsible for recognizing person name entities. Mesfar 6 described a system for Arabic name entity recognition that combines a morphological parser and a syntactic parser that are built with the NOOJ linguistic development environment. Morphological analyzers use finite state technology to parse vowel texts, as well as partially and non- vowel text. Benajiba et al. 7,8 applied an automatic approach to work with NER problem for Arabic language, in 7 they presented a first version based on Bigram and Maximum Entropy (ME) their system boosted by gazetteers (List of names), and they developed their own resource (training and testing Corpus ANERcorp) which are freely available on their website 9. While in 8 they describe the second version which contains the improvement of the first version, they used Part Of Speech (POS) and two step approach to enhance the performance. System Implementation: We apply the system to pass two distinct processes sequentially, the system starts training itself by a set of word annotated previously called corpus to get useful rules 745

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Arabic …

Fadl Dahan et al.

and attributes to use it into a second process where the system applies this information to classify new text using probabilities. Training Phase: Each word into corpus has set of information. Consequently, in this phase the system analysis these words and extracts the needed information to build a Tri-gram of each word and two previous words throw the process into Figure 1. Text Files Tokenization

Next Token

Normalization

Unannotated

unknown Token

Next Token Ambiguity Analysis

Kn

ow

nT

ok

en

Tri-gram Model

Figure 1: Training Phase

Text Files: We train the system by using a corpus consists of 200,000 words spread over a group of files. These files contain a collection of news (international, Arabic, sports) collected from different sources. These words were annotated manually where each word contains the class name and the position where it is in the class sequence as an example in Figure 2.

Figure 2: Example of Training Corpus

Tokenization: The earlier step is to break a stream of input, Arabic text up into meaningful elements called tokens. Each token is either a word or something else like a number or punctuation mark, since the core system is only able to tag entities on a token-by-token basis. During the tokenization we preprocess input text to omit diacritics and the special characters. 746

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Arabic …

Fadl Dahan et al.

Normalization: Every word in the Arabic language can appear in the texts with more than one form either single or a group of inflections. These additions are a character comes before or after a noun in the Arabic language. This means that dealing with every word in all its forms costs us a lot of space and effort. Thus, we need to remove these additions by fixing one form and measuring other form based on it using lit stemming. Figure 3 shows same location with inflections like ‫اﻟﻤﺎﻧﯿﺎ‬ "Germen" appears with ‫" و‬and" inflection to be ‫" واﻟﻤﺎﻧﯿﺎ‬and Germen", in similar situation ‫ﻓﺮﻧﺴﺎ‬ "France", ‫" اﻟﻨﻮرﯾﺞ‬Norway", and ‫" اﻟﺴﻌﻮدﯾﺔ‬Saudi Arabia".

Figure 3: Example of Words Inflections

At the end of this process, we have a list of features about the words that will be stored in the Trigram. Ambiguity Analysis: Each word in training corpus may take one of the name classes or more ex. The word ‫" ﺟﺪة‬Jeddah" name of the city from Figure 2 has B-LOC class and this word may take O class when it means "Grandmother" and ORG when it comes as a company name. This example, express the ambiguity problem and Arabic language full by this problem. Ambiguity problem is the problem of the words which can have more than one name class and O class. We deal with this problem by store this name word first, and check other word with it when we found similarities between stored words and this word we deal with this word as a name word by keep it with necessary information. Building Tri-gram: Tri-gram is a special case of the N-gram, where N reflects the number of words used in the N-gram model building wherein Tri-gram we use three words (name word, and two words before). When building Tri-gram model in addition to a series of word where each word contains some features that help us in the classification of new texts. These features come from the tokenization and normalization, this means that each word is stored in the Tri-gram will contain the position of name into class sequence, the class, number of occurrences, and addition or deletion (stemming features). Classification Phase: In the classification process, we take the text we want to classify, unlike training input text, here we have only the words, these words are called testing corpus. We collect our own testing corpus from different news website Alarabiya, Aljazeera, and Alhayat and different types of news, sport, international news, arabic news, and economic news which about 20,000 words.

747

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Arabic …

Fadl Dahan et al.

Input Text Tokenization

Normalization

Probability Estimation

Next Token

Name Classification

Annotated Text

Figure 4: Classification Phase Probability Estimation: Each word has a probability to be one of classes name. We need to know some statistics about the following: • • •

The relationship between the word and each class eq. 1. The relationship between the word and previous words eq. 2, 3, 4. The feature of the word and previous word feature.

So in this process, we get four attribute from the training corpus and normalization (Position of Name, Class, Name, Class, Feature from Normalization, Number of Occurrences). The system uses these attributes plus the joint probability from the equations 1, 2, 3, and 4 and the two previous words attributes. These will be enough to decide the correct class of the current word and help to solve ambiguity problem.

𝑃(𝑊, 𝑁𝐶)

𝑒𝑞. 1

𝑃(𝑊, 𝑁𝐶, 𝑊−1 , 𝑁𝐶−1 , 𝑊−2 , 𝑁𝐶−2 )

𝑒𝑞. 2

𝑃(𝑊, 𝑁𝐶, 𝑊−1 , 𝑁𝐶−1 )

𝑒𝑞. 3

𝑃(𝑊, 𝑁𝐶, 𝑊−2 , 𝑁𝐶−2 )

𝑒𝑞. 4

Name Classification: In this process we use the statistical information from probability process to make the final decision about the correct class; each information has a priority in classification cases. The first case when we have matched the context from Tri-gram model then we choice the maximum of joint probability ex. the location name ‫" اﻟﺴﻌﻮدﯾﺔ‬Saudi" when it comes after ‫اﻟﻤﻤﻠﻜﺔ‬ ‫" اﻟﻌﺮﺑﯿﺔ‬Arabia Kingdom". The second state when we haven't context but we have information help us to choice between other class "ambiguity problem". Here we integrated between all available information to choose the correct class ex. ‫“اﻟﺠﺎﻣﻌﺎت اﻟﺴﻌﻮدﯾﺔ‬Saudi Universities" the word ‫اﻟﺴﻌﻮدﯾﺔ‬ "Saudi" in most case classified as LOC. However, in this case we classify it as O depending on the 748

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Arabic …

Fadl Dahan et al.

relationship with the previous word. Previous words are not a location indicator and do not occur with location before. There are many examples in this case and they are identified by storing them in a output log file from the system. This step is important to facilitate the process of identifying system performance in this case. Last case when we haven't any information about this word because it doesn't occur before into training corpus "Unknown word". In this case, we have two situations, either the word previous word classified as name or not. We deal with first situation where we use position of name class, and feature from normalization, position of name class is useful with location where locations are often fixed. Ex. we haven't inside-class for word ‫" ﺳﻮرﯾﺎ‬Syria" or ‫" ﺗﻮﻧﺲ‬Tunisia" or ‫" اﻟﺴﻌﻮدﯾﺔ‬Saudi" when it comes alone or after the long name ex. ‫" اﻟﻤﻤﻠﻜﺔ اﻟﻌﺮﺑﯿﺔ اﻟﺴﻌﻮدﯾﺔ‬Kingdom of Saudi Arabia". Last word is in the class sequences so any unknown word after it is O class. However, feature from normalization is useful with organization because we can't see an organization name starts with declaration, next has un-declaration word, and in many cases the character ‫" و‬and" comes to connect more than one name of the organization. The last name class (person) we haven't a properties to help on the recognition. If the person name occurs for the first time and we get an unknown name after with no additions except declaration we classified this word as person name. Evaluation: We evaluated our system by measuring the Precision P and Recall R and F-measure Fm, where the P indicates the correction percentage from total annotated word, Recall indicates how many of the entities that should have been found, are effectively extracted, and Fm is the combination of Precision and Recall which is the harmonic mean of Precision and Recall 5. 𝑃=

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑐𝑒𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑟𝑒𝑠𝑝𝑜𝑛𝑐𝑒𝑠

𝑅=

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑐𝑒𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛 𝐹𝑚 =

2 (𝑃𝑅) 𝑃+𝑅

The average Precision and Recall for the testing corpus in recognizing name entity is 83% and 82%, respectively. And the average F-measure is 82%. These results are better than Benajiba et al. [351][361] systems, Ahmed [51] and Samir [331], as shown Figure 5. Our system is fully automated where the classification process is dependent only on the information extracted from the training corpus without any supported name list. When compared with the results of Benajiba's systems, we found that Benjajiba's results were supported with a name list called ANERgazet. ANERgazet built manually from web resources and contains three different gazetteers [351][361]: 1- Location Gazetteer contains 1,950 names; 2- Person Gazetteer contains 2,309 names;

749

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Arabic …

Fadl Dahan et al. Our System

82.31%

79.21% 76.00% 75.01%

50%

ANERsys 3.0

65.91% 55.23%

60%

62.08% 49.04%

70%

Ahmed

82.09% 72.77% 69.00%

80%

Samir 89.09%

90%

ANERsys 2.0

82.53% 86.90% 86.00% 74.06% 70.24% 63.21%

ANERsys 1.0 100%

40% 30% 20% 10% 0% Precision

Recall F-measure

Figure 5: Results Comparison CONCLUSION Context structure proofs to be helpful information to build Arabic NER system. Although the Arabic language does not contain a lot of feature to be a good indicator for name entity like capitalization but the sequence of words features and the position of name into class sequence do that. This way needs a big corpus to work probably and verity of Arabic text (sport, news, economic …etc) especially for ambiguity problems and unknown tokens. ACKNOWLEDGEMENT This work is partially supported by the research center in the college of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia. The Authors wish to thank KACST (king Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia) for providing part of the corpus. We also thank Yassine Benajiba from the Universidad Politécnica de Valencia for providing valuable resources. REFERENCES 1. R. Grishman, B. Sundheim. “Message Understanding Conference - 6: A Brief History”. COLING-96. 2. D.Christopher, Manning, Hinrich Schuetze," Foundations of Statistical Natural Language Processing", MIT Press. Cambridge, MA: May 1999. 3. J. Maloney, M. Niv. "TAGARAB: A Fast, Accurate Arabic Name Using High-Precision Morphological Analysis". Proceeding of the Workshop on Computational Approaches to Semitic Language, August 1998. pp. 8-15. 750

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Arabic …

Fadl Dahan et al.

4. S. Abuleil. "Extracting Names From Arabic Text For Question-Answering Systems". Proceedings of Coupling approaches, coupling media and coupling languages for information retrieval (RIAO2004), Avignon, France. pp. 638- 647. 5. K. Shaalan, H. Raza. "Person Name Entity Recognition for Arabic". Proceedings of the 5th Workshop on Important Unresolved Matters. Prague, Czech Republic, June 2007. pp. 17–24. 6. S. Mesfar. "Name Entity Recognition for Arabic Using Syntactic Grammars". Proceeding 12th International Conference on Applications of Natural Language to Information Systems, NLDB, Paris, France, June, 2007. Vol. 4592/2007, pp.305-316 7. Y. Benajiba, P. Rosso, J. Miguel B. Ruiz."ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy". Proceeding of 8th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing, Mexico City,Mexico, February , 2007. Vol 4394/2007,pp.143-153. 8. Y. Benajiba, P. Rosso. "ANERsys 2.0: Conquering the NER Task for Arabic Language by combining the Maximum Entropy with POS-tag information". Proceeding of the 3rd Indian International Conference on Artificial Intelligence (IICAI-0),2007, December 1719. pp.1814-1823. 9. http://www.dsic.upv.es/~ybenajiba .

Corresponding Author: Fadl Dahan; Department of Computer Science, College of Computer and Information Sciences, King Saud University

751

JECET; June 2015- August 2015; Sec. B; Vol.4.No.3, 744-751.

Suggest Documents