Article
A hybrid approach to Arabic named entity recognition
Journal of Information Science 2014, Vol. 40(1) 67–87 Ó The Author(s) 2013 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0165551513502417 jis.sagepub.com
Khaled Shaalan The British University in Dubai, Dubai International Academic City, UAE
Mai Oudah The British University in Dubai, Dubai International Academic City, UAE
Abstract In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.
Keywords hybrid approach; information extraction; information retrieval; machine learning approach; named entity recognition; natural language processing; rule-based approach
1. Introduction Named entity recognition (NER) is the task of detecting and classifying named entities (NEs) within texts into predefined classes, such as Person, Location and Organization names [1]. NER acts as an important preprocessing step in many natural language processing (NLP) applications to enhance their overall performance. Examples of NLP applications that use the functionalities of NER [2] are information retrieval [3], machine translation [4], question answering [5–7] and text clustering [3]. Most NER tasks have been developed using either rule-based or machine learning-based approaches. A hybrid approach can take advantage of both approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. In the 1990s, at the Message Understanding Conferences (MUC) in particular, the task of NER was first introduced and given attention by the community of research. Three main NER subtasks were defined at the 6th MUC: ENAMEX (i.e. Person, Location and Organization), TIMEX (i.e. temporal expressions) and NUMEX (i.e. numerical expressions). Customized NER systems may require more sub-divisions in one or more of the NER subtasks to fulfill the system goals and objectives, for example, Location NEs may have sub-types such as City, Country, River or Road. The Arabic language, an important member of the family of Semitic languages, is the official language of Arab world countries, where more than 300 million people speak Arabic in their daily life [8, 9]. It is the language of the Holy
Corresponding author: Khaled Shaalan, The British University in Dubai. Dubai International Academic City, Block 11, 1st Floor, P.O. Box 345015, UAE. Email:
[email protected]
Shaalan and Oudah
68
Quran. As such, every Muslim around the world uses Arabic in their daily prayer and worship. Arabic is one of the richest natural languages in the world in terms of morphology [10, 11]. Interest in Arabic NLP has been gaining momentum in the last decade, and some of the NLP tasks have proven to be challenging, especially because of the language’s scarce resources [12], in particular when it comes to NER, and critical aspects of the language such as high morphological ambiguity, absence of capitalization, common words/proper noun ambiguities and writing style [13]. NER for Arabic has been receiving increasing attention, yet opportunities for improvement in performance are still available. Most of the Arabic NER systems have been developed using two types of approaches: the rule-based approach, notably NERA system [14], and the ML-based approach, notably ANERsys 2.0 [15]. Each approach has its strengths and weaknesses. Arabic rule-based NER systems rely on handcrafted grammatical rules acquired from linguists. Therefore, any maintenance applied to rule-based systems is labour-intensive and time-consuming, especially if linguists with the required knowledge and background are not available. On the contrary, ML-based NER systems utilize learning algorithms that make use of a selected set of features extracted from datasets annotated with NEs for building predictive NER classifiers. The main advantages of the ML-based NER systems are that they are adaptable and updatable with minimal time and effort as long as sufficiently large datasets are available. Recently, the interest in a hybrid Arabic NER approach has emerged, as it achieves an overall performance better than that of either the rule-based approach or the ML-based approach when they are processed independently and overcomes critical aspects inherent in each approach. In this article, we report the results of our research project, which aims at investigating the integration of the rulebased approach with the ML-based approach, forming a novel hybrid NER pipeline approach. We apply this approach to the Arabic language and then evaluate the performance. With pipelining, the Arabic NER process task is divided into two stages that are solved sequentially, such that the output of the first stage is processed and then fed as input into the next in the sequence. Consequently, the proposed system consists of two components: (a) a rule-based NER component that produces NE labels based on lists of NEs/keywords and contextual rules; and (b) an ML-based post-processor intended to make use of language-specific and language-independent features. The feature set also includes rule-based features extracted from datasets annotated with NEs that are recognized by the rule-based component, aiming at enhancing the overall performance of the NER task. To the best of our knowledge, no other Arabic NER research uses hybrid architecture that is domain-independent, recognizes 11 NE types, explores a large set of features, including contextual features derived from rule-based decisions, and has gone through rigorous testing on standard datasets. Our early hybrid Arabic NER research was aimed at recognizing three types of NEs in Arabic texts: Person, Location and Organization. As a proof of concept, only the Decision Trees ML technique was used within the hybrid system. This technique was applied to a limited set of selected features. The experimental results were promising and ensure the quality of the prototype [16]. As a continuation of the Arabic hybrid NER research, we enhanced our system, aiming at the recognition of 11 different types of NEs, namely: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Moreover, we extended the ML feature space to include linguistic (morphological) and non-linguistic (contextual) features. In addition to Decision Trees [17], we investigated two more ML algorithms: Support Vector Machines [18] and Logistic Regression [19]. An extended group of features was used to evaluate the system on wider standard and collected datasets. The experimental results were indicative of a better system performance and comparable to the state-of-the-art systems [20, 21]. Thereafter, more experiments and analysis of results were conducted in order to assess the quality of the hybrid Arabic NER system by means of standard evaluation metrics and precision–recall curve. Throughout this paper, we introduce the development of a robust and consolidated hybrid NER system, and discuss extensive experiments conducted for studying the impact of different parameters and new features on the Arabic NER system’s performance. Furthermore, a deep analysis of the results is presented using a precision–recall curve, which gives a more informative picture of the system’s performance. The rule-based component is a reproduction of NERA, an Arabic rule-based NER system developed by Shaalan and Raza [14], with enhancements that improve its performance. Two types of linguistic resources are collected and acquired: gazetteers (predefined lists of NEs or keywords) and corpora (training and evaluation datasets). The developed linguistic resources have gone through verification and preparation phases before applying NER. The annotated dataset is presented to the ML-based component through a set of features, which is selected in such a way as to optimize the performance of the ML component as much as possible. The 11 types of NEs are distributed among three groups of NE types according to their nature. We conducted extensive evaluation experiments in which the baseline was the performance of only the rule-based component. At the level of the hybrid system, experiments are subdivided at three dimensions: the NE type, the ML classifier used, and the inclusion/exclusion of feature groups, with the rule-based features included as one of the feature groups. Feature groups were examined to study their effect on the overall performance when the whole feature space was considered as well as different combinations of all-but-one feature groups. Experimental results show performance Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
69
gains over the state-of-the-art when applied to the same standard dataset. Finally, the use of hybrid approaches combining ML and rule-based approaches yields higher scores. The structure of the article is as follows. Section 2 gives background about Arabic Language Characteristics. A literature review of NER is given in Section 3. Section 4 describes the method followed for data collection. Section 5 illustrates the architecture of the proposed NER system and then describes in details the rule- and ML-based components. The experimental results are discussed in Section 6. Section 7 concludes this article and gives directions for future work.
2. Arabic language characteristics Applying NLP tasks in general and the NER task in particular is very challenging when it comes to Arabic because of its peculiarities and unique nature. The main characteristics of Arabic that pose non-trivial challenges for NER task are as follows: •
• •
•
•
No capitalization – capitalization is not a feature of Arabic script, unlike Latin languages where an NE usually begins with a capital letter. Therefore, the usage of the capitalization feature is not an option in Arabic NER. However, the English translation of Arabic words may be exploited as a feature indicator in this respect [22]. The agglutinative nature – Arabic has a high agglutinative nature in which a word may consist of prefixes, lemma and suffixes in different combinations, which results in a very complicated morphology [23]. Optional short vowels – in theory, short vowels, or diacritics, are needed for pronunciation and disambiguation. However, practically, most modern standard Arabic texts do not include diacritics, and therefore, a surface form of a word in Arabic may refer to two or more different words or meanings according to the context it appears in, creating a one-to-many ambiguity. Spelling variants – in Arabic script, the word may be spelled differently and still be the same word with the same meaning, creating a many-to-one ambiguity. For example, the word ‘’ﺟﺮﺍﻡ, jrAm1, ‘Gram’, can also be written as ‘’ﻏﺮﺍﻡ, grAm, with the same meaning. Lack of linguistic resources – in general, there is a limitation on the number of Arabic linguistic resources that are publicly available free for research purposes. In particular, many of the available corpora neither are annotated with NEs nor include sufficient number of NEs, which makes them unsuitable for the Arabic NER task. The Arabic gazetteers are rare as well and limited in size. Therefore, researchers tend to spend considerable effort annotating/acquiring and verifying their own Arabic linguistic resources in order to train and test their proposed Arabic NER systems.
Consequently, issues that need to be resolved when building a system for Arabic NER are related to the ambiguity inherent in the language, complex nature of morphology, typographic typos or variants that are widely accepted by Arabic writers, and availability of resources.
3. Literature review of NER NER serves to achieve two main goals: the detection of NEs and the classification of those NEs in the form of different predefined types. Three types of approaches are used to fulfill those two goals: the rule-based, the ML-based and the hybrid approaches.
3.1. Rule-based NER Rule-based NER systems depend on local handcrafted linguistic rules to identify NEs within texts using linguistic and contextual clues, and indicators [9]. Such systems exploit gazetteers/dictionaries as auxiliary clues to the rules. The rules are usually implemented in the form of regular expressions or finite-state transducers [7]. The maintenance of rule-based systems is not a straightforward process since experienced linguists need to be available to provide the system’s proper adjustments [25]. Thus, any adjustment to such systems would be labour-intensive and time-consuming. Maloney and Niv [26] have presented the TAGARAB system, which is one of the early attempts to tackle Arabic NER. It is a rule-based system where a pattern matching engine is combined with a morphological tokenizer to recognize Person, Organization, Location, Number and Time NEs. The empirical results show that combining an NE recognizer with a morphological tokenizer outperforms the individual NE recognizer in terms of accuracy when applied to random datasets from Al-Hayat newspaper. Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
70
Maynard et al. [27] introduced the MUSE system, which is a grammar-based NER system implemented in GATE framework. MUSE is composed of a tokenizer, gazetteers and grammars. After a preliminary analysis of the input text, the set of processing resources is manually manipulated in order to extract NEs based on the text type. The evaluation was conducted on English datasets to extract NEs of various types, including Person, Date, Location, Organization, Time, Money, Percent, Email, URL, Telephone, IP and Identifier. Riaz [28] introduced a rule-based system which tackles the task of Urdu NER. Urdu is an Arabic script language but differs in morphology and syntax. Owing to the lack of large annotated datasets, linguistic rules were constructed instead of training a statistical model to extract NEs of the following types: Person, Location, Organization, Date and Number. The system outperforms the Conditional Random Fields-based NER systems proposed in IJCNLP 2008 as reported by Riaz [28]. Mesfar [7] developed an Arabic component under a NooJ linguistic development environment to enable Arabic text processing and NER. The NER component consists of a tokenizer, morphological analyser and NE finder. The NE finder exploits a set of gazetteers and indicator lists to support rule construction. The system identifies NEs of the following types: Person, Location, Organization, Currency and Temporal expressions. The system utilizes the morphological information to extract unclassified proper nouns and thereby enhance the overall performance of the system. Another work adopting the rule-based approach for NER is the one developed by Shaalan and Raza [29] called PERA. PERA is a grammar-based system which was built for identifying Person Names in Arabic scripts with high degree of accuracy. PERA is composed of three components: gazetteers, local grammar and a filtration mechanism. Whitelists of complete Person Names are provided in the gazetteer component in order to extract the matching names regardless of the grammar. Afterwards, the input text is presented to the local grammar, which is in the form of regular expressions, to identify the rest of Person NEs using gazetteers and indicators. Finally, the filtration mechanism is applied on NEs detected through certain grammatical rules in order to exclude ambiguous and invalid NEs. PERA achieved satisfactory results when applied to the ACE and Treebank Arabic datasets. As a continuation of Shaalan and Raza’s [29] research work, the NERA system was introduced [12, 14]. NERA is a rule-based system that is capable of recognizing NEs of 10 different types: Person, Location, Organization, Date, Time, ISBN, Price, Measurement, Phone Numbers and Filenames. The implementation of the system was in the FAST ESP framework, where the system has three components, as the PERA system, with the same functionalities to cover the 10 NE types. The authors have constructed their own corpora from different resources in order to have a representative number of instances for each NE type. The rule-based component in our proposed hybrid system is a reproduction of the NERA system and the evaluation is applied to standard datasets in order to provide results that can be compared with existing systems’ performance. Elsebai et al. [30] proposed a rule-based NER system that integrates pattern matching with morphological analysis to extract Person Names from Arabic text. The pattern matching engine utilizes lists of keywords without using predefined lists of Person Names. The performance of the system was comparable to PERA results despite the fact that the PERA system is evaluated using datasets different from the ones used in Elsebai et al. [30], which are news articles from Aljazeera website. Aboaoga and Aziz [31] introduced a rule-based Arabic NER system that extracts Person Names from Arabic text. The domains covered by the system are sports, politics and economics. The authors have developed four main rules composing the core of the system. The system goes through three steps for Arabic person names recognition: (a) preprocessing (tokenization, data cleaning and sentence splitting); (b) automatic NE tagging where predefined lists of person names and keywords are used for person NE annotation process and keyword annotation process; and (c) applying the rules to the text in order to extract person names that do not exist in the built-in dictionaries. The dataset used for evaluation is collected from different online Arabic newspapers. The empirical results show that the system achieves accuracy in the sports domain higher than those achieved in the politic and economic domains. Zaghouani [32] introduced a rule-based Arabic NER system (RENAR) to extract Person, Location and Organization NEs. The system is composed of three phases: (a) morphological preprocessing; (b) looking up known NEs; and (c) using local grammar to extract unknown NEs. According to the empirical results, RENAR outperforms ANERsys 1.0 [33], ANERsys 2.0 [15] and LingPipe [34] in extracting Location NEs when applied to ANERcorp dataset, while LingPipe outperforms RENAR in extracting Person and Organization NEs.
3.2. Machine learning-based NER ML-based NER systems take advantage of the ML algorithms in order to learn NE tagging decisions from annotated texts. The most common approach used in ML-based NER is the Supervised Learning (SL) approach, which represents the NER problem as a classification task and requires the availability of large and sufficient annotated datasets. Among Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
71
the most common SL algorithms utilized for NER are Support Vector Machines (SVM), Conditional Random Fields (CRF), Maximum Entropy (ME), Hidden Markov Models (HMM) and Decision Trees [1]. Zhou and Su [35] adopted HMM as the ML approach in their NER system, which has the ability of recognizing different types of NEs, including named, time and numerical expressions from English text. The utilized feature set includes features of two categories: internal evidence and external evidence. The internal evidence focuses on word-level features such as InitialCaps, Semantic features such as SuffixPercent (e.g. %), and Gazetteer features that determine the presence of the candidate word in one or more of the system gazetteers. On the other hand, the features representing the external evidence are the contextual features that can be derived from the context to extract the relationships between previously recognized entities and the candidate word within the input text. The system was applied on MUC-6 and MUC-7 English shared tasks and the results show that the performance reaches its highest rates when all the features from both categories are considered by the system. Mayfield et al. [25] proposed a language-independent statistical-based system for NER. The system exploits the SVM approach to handle a large number of features with minimal overfitting. Examples of the features are word length, POS tag, whether the word is between double quote marks or not, and whether a dash exists in the word or not. The features of the surrounding words contribute to the learning and classification processes as well. The window size is seven words centred by the targeted word. The system was applied on CoNLL2003 English and German datasets for training and evaluation. Finkel and Manning [36] introduced a new approach to enhance the recognition of nested NEs. NEs can be nested if one or more NE is within the structure of other NEs. The proposed approach relies on representing the sentences within a text using constituency parse trees where POS tags are included, and then a discriminative constituency parser (i.e. CRFbased) is used to generate the classification models. Each node within the tree needs to be labelled with its parent and grandparent. The system, which is domain-independent, was applied on GENIA v.3.02, JNLPBA and AnCora datasets. The empirical results show that the overall performance of the new approach is better than that of the semi-CRF model in terms of the standard evaluation matrices when used to extract NEs of any level, that is, either nested or normal NEs. Benajiba et al. [33] developed an Arabic NER system, ANERsys 1.0, which uses ME. The authors built their own linguistic resources, which have become a de facto standard in Arabic NER literature: ANERcorp (i.e. an annotated corpus) and ANERgazet (Person, Location and Organization gazetteers). The features used by the system are lexical, contextual and gazetteers features. The system can recognize four types of NEs: Person, Location, Organization and Miscellaneous. The ANERsys 1.0 system raised some difficulties when detecting NEs that are composed of more than one token/word; hence Benajiba and Rosso [15] developed ANERsys 2.0, which employs a two-step mechanism for NER: (a) detecting the start and the end points (boundaries) of each NE; and (b) identifying the NE type. Benajiba and Rosso [37] applied CRF instead of ME as an attempt to improve the performance. The feature set used in ANERsys 2.0 was used in the CRF-based system. The features are POS tags and base phrase chunks (BPC), gazetteers and nationality. The CRF-based system achieves higher results in terms of accuracy. Benajiba et al. [38] developed another NER system based on SVM. The features used are contextual, lexical, morphological, gazetteers, POS tags and BPC, nationality and the corresponding English capitalization. The system was evaluated using ACE Corpora and ANERcorp. The best results were achieved when all the features were considered. A simplified feature set was proposed by Abdul-Hamid and Darwish [39] to be utilized in Arabic NER. They relied on CRF to recognize three types of NEs: Person, Location and Organization. The system considers only surface features (i.e. leading and trailing character n-gram, word position, word length, word unigram probability, the preceding and succeeding words n-gram and character n-gram probability) without taking into consideration any other type of features. The system was evaluated using ANERcorp and ACE2005 dataset. The results show that the system outperforms the CRF-based NER system of Benajiba and Rosso [37]. Benajiba et al. [40] investigated the sensitivity of different NE types to various types of features. They built multiple classifiers for each NE type, adopting SVM and CRF approaches. ACE datasets were used in the evaluation process. According to their findings, it cannot be stated whether CRF is better than SVM or vice versa in Arabic NER. Each NE type is sensitive to different features and each feature plays a role in recognizing the NE in different degrees. Further studies, such as Benajiba et al. [3, 41], have also confirmed the importance of considering language-independent and language-specific features in Arabic NER. Mohammed and Omar [42] proposed an ML-based system, which adopts Artificial Neural Networks (ANN) approach, to tackle the problem of Arabic NER. The system has three main phases: (a) preprocessing of the data (data cleaning, text tokenization and tagging using Inside-Outside-Beginning (IOB) schema); (b) transforming the Arabic letters to Roman letters; and (c) applying the ANN classifier to the text. The system is capable of recognizing four NE classes: Person, Location, Organization and Miscellaneous. The ANERcorp dataset was used for evaluation. The authors compared the performance of the system when decision trees and ANN approaches were used individually. The relationship between Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
72
the corpus size and the accuracy of the system was examined as well. According to the experimental results, the NER system, when an ANN approach was adopted, achieves higher accuracy than that achieved when the decision trees approach is the ML approach used by the system. The results show that the accuracy improves as the size of the corpus increases. AbdelRahman et al. [23] integrated two ML approaches to handle Arabic NER: CRF and bootstrapping pattern recognition. The feature set used with the CRF classifier included word-level features, POS tag, BPC, gazetteers and morphological features. The system was developed to extract 10 types of NEs: Person, Location, Organization, Job, Device, Car, Cell Phone, Currency, Date and Time. The results show that the system outperforms LingPipe NE recognizer when both are applied to the ANERcorp dataset. Farber et al. [22] suggested the integration of morphological tagger with Arabic NER system. The authors claim that the morphological information produced by a morphological tagger is significant for Arabic NER. The output of a morphological tagger is utilized as features to be used by the NE classifier in order to recognize NEs of the types Person, Organization, and Geo-political entities. The system adopts the structured perceptron approach [43] for Arabic NER. The empirical results show that the morphological features improve the performance of NER system for Arabic.
3.3. Hybrid NER The hybrid approach is the integration of the rule-based approach and the ML-based approach [44]. The direction of the processing flow may be from the rule-based system to the ML-based system or vice versa. To the best of our knowledge, we are the first to exploit the hybrid approach in NER for Arabic, aiming to improve the overall system’s performance and overcome the critical aspects of the integrated approaches when they are processed individually. However, in the literature, the hybrid approach has been utilized in NER for other languages, such as English and French. Srihari et al. [45] proposed a hybrid NER system that exploits two ML algorithms and pattern matching rules to extract NEs of several types, including Person, Location, Organization, Date, Money, Time and Percentage, from English text. The system is composed of four main modules. The first module is the pattern-matching rules, which is used to extract temporal and numerical expressions. The second module is a Maximum Entropy model that was built to utilize the gazetteers and the contextual information in order to provide a preliminary recognition of Person and Location NEs. The third module is the HMM classifier, which gives the final tagging for the entities of the types Person, Location and Organization. The final module is another Maximum Entropy model, produced to identify further sub-categories such as Government and Airport. The system was evaluated using the MUC-7 dataset and the results show that using the hybrid approach improves the performance in terms of accuracy. Our hybrid approach is mainly different from this approach in two aspects. First, the number of modules within their architecture requires more processing. Our hybrid system is composed of only two main modules: rule- and ML-based modules. The second difference is the handling of limited types of NEs by each module or approach. In our hybrid system, we cover a reasonable number of NEs, that is, 11 NE types, in a pipeline processing of rule-based and ML-based NER. The output of the rule-based NER in our approach is utilized as contextual features, which is used with other language-specific and language-independent features in the ML-based NER. Seon et al. [46] introduced a hybrid approach to tackle the problem of NER for Korean. Two ML approaches are used by the system: Maximum Entropy and Neural Networks. The Maximum Entropy is utilized to resolve the problem of unknown words that do not exist in any of the predefined dictionaries within the system, while Neural Networks use lexical information to deal with ambiguity. The pattern-selection rules are exploited to just combine adjacent words that comprise a multiword NE. The system was developed to extract NEs of the types Person, Location and Organization. In comparison, the authors utilized a pure ML learning approach for the actual NER while the role of the rule-based postprocessing is not NER but rather setting the boundary of the identified NEs by combining them, which might be suitable for Korean NER. In our hybrid NER, the two NER rule-based and ML-based components cooperate to produce final annotation and each of them can be used independently. Mencius is a hybrid NER system for Chinese proposed by Tsai et al. [47]. The rule-based and ML-based methods are combined in Mencius in order to enhance the recognition capabilities for Person, Location, and Organization NEs. The output of the rule-based module is used as an input to the ML-based module, which is a Maximum Entropy model. The feature set consists of internal (i.e. category-dependent) features utilized to distinguish between different NE categories. The authors built their own corpus to evaluate the performance of their system. According to the empirical results, the highest performance was achieved when the hybrid system was used along with word tokenization. Our proposed hybrid approach differs from the Mencius approach in two aspects – the architecture of the hybrid system and the selected feature set. Mencius is heavily dependent on the output of the rule-based module and uses it to extract the features used by the ML module, while our hybrid system does not rely only on the rule-based system’s output in extracting the features. Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
73
Moreover, the feature set of Mencius is composed of only NE category-dependent features, while the structure of our feature set includes rich features consisting of language-independent and language-specific features of different types, such as morphological, contextual and word-level features. Ku¨xcu¨k and Yazıcı [48] presented a hybrid NER system for Turkish. The main module in the system is the rule-based module which includes knowledge sources for specific domains. The original rule-based system is extended to a hybrid NER system which learns from annotated text with the purpose of improving the knowledge sources accordingly. The authors developed their own corpus for learning and evaluation purposes. The rule-based module contains dictionaries of Person, Location and Organization NEs along with pattern bases for the recognition of the aforementioned NE types as well as the numerical and temporal expressions. The system utilizes rote learning as the ML technique. According to the empirical results, the hybrid NER system for Turkish outperforms the pure rule-based NER system in terms of accuracy. Our proposed hybrid approach differs from this approach in several aspects. The flow of processing in Ku¨xcu¨k and Yazıcı’s approach [48] goes from the ML-based module to the rule-based module while it is the converse in our hybrid approach. In the authors’ system, the rule-based module is heavily dependent on the output of the ML module, while the modules in our system can function separately and do not heavily rely on each other’s output. Our hybrid system is domain-independent while the authors’ system is built with the capability of supporting four specific domains – news, financial news, historical news and child stories. The authors’ hybrid approach is mainly a rule-based approach which employs an ML approach for the purpose of enhancing the knowledge sources within the rule-based module to support new text genres when annotated text is available, while our architecture is a true hybrid in the sense that it does not favour one approach over the other. Petasis et al. [44] introduced a hybrid method for the purpose of facilitating the maintenance of their rule-based NER system. The role of the ML-based component is to evaluate the performance of the rule-based NER system, and accordingly taking the decision of whether to maintain the rule-based NER system. In the training phase, the ML-based system relies only on the annotated output of the rule-based system; hence, there is no need to train on a manually tagged dataset. Then, the two systems are applied on a new dataset and their results are compared. The disagreements in recognition output indicate the need to update the rule-based system. The ML technique adopted is C4.5, which implements decision tree ML method. Greek and French rule-based systems are evaluated using this approach. The systems can recognize NEs of the types Person, Location and Organization. The experimental analysis shows that this method can help in deciding when to maintain a rule-based NER system. Our hybrid approach is mainly different from this approach in two aspects. First, the authors’ approach is a mainly rule-based approach, which employs an ML approach for the purpose of enhancing the performance of the rule-based approach, while our architecture is hybrid and does not favour one approach over the other. Second, the authors’ approach restricts the feature extraction to the rule-based output (tagging) in order to avoid the requirement of the availability of large annotated corpus that is necessary for machine learning. Our hybrid approach does not rely only on the rule-based system’s output in extracting language-independent and language-specific features of different types such as morphological, contextual and word-level features. We only use the rule-based decisions in order to extract the contextual features, which are then combined with the rest of the features.
4. Data collection Various linguistic resources are necessary in order to develop the proposed Arabic hybrid NER system with scope of 11 different categories of NEs. The linguistic resources are of two main categories: corpora and gazetteers. The corpora utilized in this research are a combination of licensed and free linguistics resources. In the literature, they are commonly used for evaluation as well as comparison with existing systems. The licensed linguistics resources2 are Automatic Content Extraction (ACE) corpora and Arabic Treebank (ATB) Part1 version 2.0 dataset. The free linguistic resource is ANERcorp3 dataset, which is freely available for research purposes. We have also built our own corpus to train and evaluate our system in order to identify certain types of NEs that were not sufficiently covered previously, in particular, file names, phone numbers and ISBN numbers. The dataset files have been prepared and transformed (i.e. tagged) using our tag schema and in XML format. Figure 1 illustrates a sample from ANERcorp dataset before (on the left) and after (on the right) transformation. Our tag schema includes the following 11 named entity tags: < Person > Entity < /Person > < Date > Entity < /Date > < ISBN > Entity < /ISBN > < FileName > Entity < /FileName > < Percent > Entity < /Percent > < Price > Entity < /Price >
< Location > Entity < /Location > < Organization > Entity < /Organization > < Measurement > Entity < /Measurement > < PhoneNumber > Entity < /PhoneNumber > < Time > Entity < /Time >
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
74
Figure 1. Sample of ANERcorp dataset before and after transformation.
The three ACE corpora used in this research are ACE 2003–2005 Multilingual Training Corpora. ACE 2003 Corpus is available under licence from LDC with catalogue number LDC2004T09 and ISBN 1-58563-292-9 [49]. ACE 2004 Corpus [50] is available under licence from LDC with catalogue number LDC2005T09 and ISBN 1-58563-334-8. ACE 2005 Corpus [51] is available under licence from LDC with catalogue number LDC2006T06 and ISBN 1-58563-376-3. The ACE training datasets covered are Newswire (NW) and Broadcast News (BN). ANERcorp is an annotated dataset provided by Yassine Benajiba [33]. ATB Part1 version 2.0 dataset [52] is available under licence from LDC with catalogue number LDC2003T06 and ISBN 1-58563-261-9. ATB Part1 version 2.0 dataset has no NE annotations and was originally designed to support POS tagging in Arabic NLP. Therefore, we manually annotated the ATB dataset in order to support Arabic NER. Our study indicates that the listed datasets suffer from sparseness when it comes to NEs of the types Phone Number, ISBN and File Name, and hence are not suitable for training and/or testing the proposed Arabic NER system. In order to have a dataset with a representative number of NEs of certain types, including Phone Number, ISBN and File Name, we acquired our own corpus from different resources and did the manual tagging ourselves. In this study, the total number of annotated NEs covered by all datasets is 23,929, as demonstrated in Table 1. Another type of linguistic resource used is the gazetteers (or dictionaries). The gazetteers for Person, Location and Organization were collected from Shaalan and Raza [14], while the gazetteers for the rest of the NE types were prepared as part of this research. Examples of Time gazetteers’ entries are given in Table 2 with Latin transliteration and English translation. The total number of NEs/keywords covered by all gazetteers is 19,328.
5. The system architecture The rule- and ML-based NER approaches have their strengths and weaknesses. In this article, we propose a hybrid architecture that is demonstrably better than the rule-based or ML-based systems individually. Figure 2 illustrates the Table 1. The number of named entities in each reference dataset. Dataset
ACE 2003 ACE 2004 ACE 2005 ANERcorp ATB P.1 v. 2 Our corpus Total
Person
NW BN NW BN NW BN
Location
Organization
711 517 1865
1292 1073 3449
493 181 1313
3602
4425
2025
6695
10239
4012
Date
Time
Price
Measurement
58 20 357 67 154 37
15 1 28 4 20 7
17 3 105 36 163 9
28 14 51 30 60 22
35 3 54 32 42 5
431
80
168
330
75
1124
155
501
535
246
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Percent
Phone Number
File Name
ISBN
136 136
160 160
126 126
Shaalan and Oudah
75
Table 2. Examples of entries in Time gazetteers. Time gazetteers
Examples of entries Entry
Entry
Transliteration
Translation
Transliteration
Evening
SabaAH
GMT
Altawqiyt AlmaHaliy
Before
baBd
Sharp
biAlDabt
Quarter
niSf
ﻣﺴﺎﺀ
Time Words masaA’ tawqiyt griyniytsˇ
ﻗﺒﻞ
Keyword Prefixes qabl Sharp Time tamaAmAa Time Fractions
ﺻﺒﺎﺡ
ﺗﻮﻗﻴﺖ ﻏﺮﻳﻨﻴﺘﺶ
Time Zones
Translation
ﺗﻤﺎ ًﻣﺎ ﺭﺑﻊ
rubB
ﺍﻟﺘﻮﻗﻴﺖ ﺍﳌﺤﲇ
Morning Local Time
ﺑﻌﺪ ﺑﺎﻟﻀﺒﻂ ﻧﺼﻒ
After Exactly Half
Rule-based component Input text
Exact matching Whitelists Gazetteers Parsing
Grammar Rules
Filtering Blacklist Tagged text Feature Extraction Selected Feature ML-based component ML method
Model (Classifier)
Final tagged text
Figure 2. The overall architecture of the hybrid NER system.
architecture of the proposed hybrid NER system for Arabic. The system consists of two sequential loosely coupled components: (a) a rule-based NER component that produces NE labels based on lists of NEs/keywords and contextual rules; and (b) an ML-based post-processor intended to make use of rule-based features extracted from datasets annotated with NEs that are recognized by the other component aiming at enhancing the overall performance of the NER task. The processing goes through three main phases: (a) The rule-based NER phase; (b) the feature engineering, that is, the feature selection and extraction phase; and (c) the ML-based NER phase. The rule-based NER component takes the raw text as an input, applying the gazetteers and local grammar rules, respectively, and then produces an annotated version as an output. Thereafter, the rule-based annotations are presented to the ML-based component as a group of features, so called rule-based features. The resultant dataset is subsequently used to extract morphological features that are produced from running MADA [53–55]. In the feature engineering phase, additional types of features are selected and extracted, such as gazetteer features, capitalization and word length. Finally, the set of selected features out of the feature space are presented to the ML-based component in order to predict the final NE annotations for each word in the input text, which is the final output of the hybrid system. One of the research goals of this study is to determine the feature combination that achieves the highest results. Therefore, extensive experiments and results analysis are discussed in detail in this paper. Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
76
Figure 3. Example of a Location NE rule implemented within the rule-based component.
Figure 4. Typical annotations in GATE interface when the rule-based system is applied on ACE 2003 BN dataset – temporal and numerical expressions are highlighted.
5.1. The rule-based component The rule-based component is a reproduction of the NERA system [14] using the GATE framework4 [56]. The rule-based component consists of three main modules, as illustrated in Figure 2: Whitelists (lists of full names of different types of NEs), Grammar Rules and a Filtration mechanism (blacklists of invalid NEs). In the GATE framework, the rule-based component works as a corpus pipeline in which a group of documents comprising a corpus is processed through a set of processing tools, including an Arabic tokenizer, and resources, including a list of gazetteers and local grammatical Rules (implemented as finite-state transducers for each NE type). The rule-based component is built with the capability of recognizing the aforementioned 11 NE types. An example of the Location NE rules utilized by the rule-based component is illustrated in Figure 3. The Location NE rule presented in Figure 3 can be described as follows: if the expression starts with an administrative division (i.e. indicator word), which might begin with the preposition ‘( ’ﺑـin), and followed by a location name such as State Name, Country Name or any other Location, then the expression after the exclusion of the indicator word is identified as Location NE. Examples of Location NEs (the underlined words) that can be recognized by this rule are: ( ﻣﻤﻠﻜﺔ ﺍﻟﺒﺤﺮﻳﻦKingdom of Bahrain) and ( ﺑﺎﻣﱪﺍﻃﻮﺭﻳﺔ ﺍﻟﻴﺎﺑﺎﻥin the Empire of Japan). Figure 4 demonstrates the GATE interface in which the rule-based component is applied on ACE 2003 BN dataset to extract the temporal and numerical expressions. Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
77
Table 3. The number of gazetteers and rules for each NE type. NE type Number of
Person
Location
Organization
Date
Time
Price
Measurement
Percent
Phone Number
File Name
ISBN
Total
Gazetteers Rules
11 9
20 20
8 9
12 7
10 8
8 4
7 3
3 1
7 7
3 3
1 2
90 73
Feature Extraction The output of the Rule-based component (Tagged text)
MADA
Extracting Morphological Features
Whitelists Gazetteers
Extracting the rest of Features of different types
Figure 5. The architecture of the feature extraction phase.
Table 3 illustrates the number of gazetteers and rules implemented for each NE type. The system currently contains a total of 73 rules and 90 gazetteers.
5.2. The ML-based component The ML-based component depends on two main aspects: feature engineering and the selection of ML techniques. Feature engineering involves selecting a combination of classification features from the entire feature space. Figure 5 illustrates the architecture of the feature extraction phase. The features explored are divided into six categories: rulebased features that are derived and constructed from the rule-based decisions in which a window of five words centred by the targeted word is covered, making these features of the contextual category as well, morphological features (i.e. derived from morphological analysis decisions), POS features (although it is a morphological feature, its impact on the recognition process is studied separately owing to its significant role in the classification), gazetteer features, contextual features and word-level features. It worth mentioning that the ML-based component itself does not necessarily depend on the rule-based features and it can work without them, but we study their impact on the system’s performance as a hybrid approach. Feature engineering allows study of the impact of feature combinations on the overall system’s performance in different dimensions: the NE type, the ML classifier used and the considered types of features. Three ML techniques were explored and examined in isolation on standard datasets in order to reach a conclusion on the best ML technique to utilize with the Arabic hybrid NER system given the selected set of features. The three techniques are Decision Trees, SVM and Logistic Regression. The first two techniques were chosen for their satisfactory performance in NER in general and Arabic NER in particular, while we decided to investigate the effect of the third technique on the proposed system’s performance. As illustrated in Figures 6 and 7, the output of the training phase (i.e. the classification model) is used in the prediction phase to generate the final annotation. In this research, WEKA5 [57], a comprehensive workbench with support for a large number of ML algorithms, is utilized as the development environment of the ML-based component. The decision tree algorithm is applied using the J48 classifier, SVM with the Libsvm classifier and Logistic Regression with the Logistic classifier. Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
78
The output of the feature extraction phase (Feature set data file)
Machine Learning Classifier (J48, Libsvm or Logistic)
Generated Classification (statistical) Model
Figure 6. The architecture of the training phase.
The output of the feature extraction phase (Feature set data file)
Classification (statistical) Model The final annotated text
Figure 7. The architecture of the prediction phase. Table 4. Example of the rule-based features for a phrase from the ANERcorp dataset. Word
Transliteration
Translation
WordN–2
WordN–1
WordN
WordN + 1
WordN + 2
ﺻﻨﺎﻋﺔ ﺍﻟﺴﻴﺎﺭﺍﺕ ﰲ ﺃﳌﺎﻧﻴﺎ ﺗﻮﺍﺟﻪ
SinABaħ Als~ayArAt fy ˆ lmAnyA A tuwAjih
Industry Auto In Germany Faces
OTHER OTHER OTHER OTHER OTHER
OTHER OTHER OTHER OTHER Location
OTHER OTHER OTHER Location OTHER
OTHER OTHER Location OTHER OTHER
OTHER Location OTHER OTHER OTHER
The 11 NE types are distributed among three groups according to their nature (each group has a distinct feature set): • • •
first group – Person, Location and Organization NEs (aka ENAMEX); second group – Date, Time, Price, Measurement and Percent NEs (aka TIMEX and NUMEX); third group – Phone Number, ISBN and File Name NEs. Although the first two NE types can be considered as NUMEX, they were included here because of the specific and limited nature of their rules and patterns.
The three groups have two kinds of classification features: generic and distinct features. The generic classification features are common among all groups and are described as follows: •
• • • • •
Rule-based features – these contextual features include the NE type predicted by the rule-based component for the targeted word as well as the NE types for the two immediate left and right neighbours of the candidate word, that is, NE type for a sliding window of size 5. Table 4 shows an example. Morphological features – the set of 13 features generated by MADA as shown in Table 5. POS tag – the POS tag of the targeted word estimated by MADA as shown in Table 5. Word length flag – a binary feature to represent whether the current word length ≥ 3. Dot flag – a binary feature to represent whether the word has an adjacent dot. Capitalization flag – a binary feature to represent the existence of capitalization information on the English gloss (translation) corresponding to the Arabic word. The English gloss is generated by MADA, and this feature is used as a possible indication of encountering an NE if the translation starts with a capital letter.
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
79
Table 5. MADA morphological features with their description [58]. Number
Feature
Feature value definition
1 2 3 4 5 6 7 8 9 10
Aspect Case Gender Mood Number Person State Voice Proclitic 3 Proclitic 2
11
Proclitic 1
12
Proclitic 0
13
Enclitics
Verb aspect: Command, Imperfective, Perfective or Not applicable (NA) Grammatical case: Nominative, Accusative, Genitive, NA or Undefined Nominal gender: Feminine, Masculine or NA Grammatical Mood: Indicative, Jussive, Subjunctive, NA or Undefined Grammatical number: Singular, Plural, Dual, NA or Undefined Person information: 1st, 2nd, 3rd or NA Grammatical state: Indefinite, Definite, Construct/Poss/Idafa, NA or Undefined Verb voice: Active, Passive, NA or Undefined Question proclitic: No proclitic (NP), NA or Interrogative Particle > a Conjunction proclitic: NP, NA, Conjunction fa, Response conditional fa, Subordinating conjunction fa, Conjunction wa, Particle wa or Subordinating conjunction wa Preposition proclitic: NP, NA, Particle bi, Preposition bi, Preposition ka, Emphatic Particle la, Preposition la, Response conditional la, Jussive li, Preposition li, Future marker sa, Preposition ta, Particle wa, Preposition wa, Preposition fy, Negative particle lA, Negative particle mA, Vocative yA, Vocative wA or Vocative hA Article proclitic: NP, NA, Determiner, Negative particle lA, Negative particle mA, Relative pronoun mA or Particle mA Pronominal: No enclitic, NA, 1st person (plural|singular), 2nd person (dual|(feminine (plural| singular))|(masculine (plural|singular))), 3rd person (dual|(feminine (plural|singular))| (masculine (plural|singular))), Vocative particle, Negative particle lA, Interrogative pronoun (ma|mA|man), Relative pronoun (ma|mA|man) or Subordinating conjunction (ma|mA) POS definition: Nouns, Number Words, Proper Nouns, Adjectives, Adverbs, Pronouns, Verbs, Particles, Prepositions, Abbreviations, Punctuation, Conjunctions, Interjections, Digital Numbers or Foreign/Latin
POS
•
Actual NE tag of the word – it is used along with other features for training the classification model. It is also used as a reference when calculating the accuracy scores.
Distinctive features are related to each NE group. There are two distinct features that are used with the first group: • •
nominal flag – a binary feature to represent whether POS tag is a Noun (or Proper Noun); check Person/Location/Organization Gazetteers feature flags – a binary feature to represent whether the word (or left/right neighbour of targeted word) belongs to Person/Location/Organization Gazetteer(s).
Similarly, there are two distinct features used with the second group: • •
check POS feature flags – a binary feature to represent whether the POS tag is a Noun_num (i.e. literal number) or a proper noun; check Date/Time/Price/Measurement/Percent Gazetteers feature flags – a binary feature to represent whether the word (or left/right neighbour of targeted word) belongs to Date/Time/Price/Measurement/Percent Gazetteer(s).
Likewise, two distinct features are used with the third group: • •
nominal flag – as described in the first group feature set; check Phone Number/ISBN/File Name Gazetteers feature flags – a binary feature to represent whether the word (or left/right neighbour of targeted word) belongs to Phone Number/ISBN/File Name Gazetteer(s).
6. Experimental results In this section, we show how the proposed hybrid NER system’s performance is evaluated and compared with a baseline (rule-based NER performance) and state-of-the-art systems. Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
80
Table 6. Sample of the AnnotationDiff evaluation results (for Date NE annotations) when the rule-based component is applied on the ACE 2005 NW dataset.
Correct: Partially correct: Missing: False positives:
144 7 3 0
Strict: Lenient: Average:
Recall
Precision
F-measure
0.94 0.98 0.96
0.95 1.00 0.98
0.94 0.99 0.97
Table 7. Sample of the WEKA evaluation results for the extraction of the second NEs group when the hybrid system (using J48 classifier) is applied on the ACE 2005 NW dataset. TP rate
FP rate
Precision
Recall
F-measure
ROC area
0.97 0.95 0.97 1.00 0.94
0.00 0.00 0.00 0.00 0.00
1.00 0.99 1.00 1.00 0.95
0.97 0.95 0.97 1.00 0.94
0.98 0.97 0.98 1.00 0.94
0.98 0.97 0.98 1.00 0.95
Class Price Date Measurement Percent Time
6.1. Experiment setup Experimental results are presented across three dimensions: the considered types of NEs, the ML classifier used and the considered types of features within the feature set. Each experiment includes a reference (i.e. gold-standard) dataset version. The reference datasets are the initial datasets described with their tagging details in Section 3, including ACE corpora, ATB part1 version 2.0, ANERcorp and our own corpus. The cleaned (unannotated) version of the reference datasets is fed into the rule-based component for annotation. For each experiment, a set of features is determined from the whole feature space. The annotated datasets are utilized by the feature extraction phase to generate the feature set data files in order to be processed by the ML-based component to produce the final predicted annotated version. The performance of the rule-based component is evaluated using GATE built-in evaluation tool, so-called AnnotationDiff. This tool enables a comparison of the reference with the system output annotated versions, and the results are presented in terms of the Information Extraction standard evaluation measures (i.e. precision, recall and Fmeasure). Table 6 shows a sample of AnnotationDiff evaluation results for Date NE type annotations when the rulebased component is applied on ACE 2005 NW dataset. With regard to the ML component, three different functions (or classifiers) are applied separately to the rule-based annotated dataset, including decision trees, SVM and logistic regression techniques, which are supported in WEKA workbench via J48, LibSVM and Logistic built-in classifiers, respectively. In this research, 10-fold cross validation is chosen to avoid overfitting. The WEKA tool provides the functionality of applying the conventional k-fold cross-validation for evaluation with each classifier and then having the results represented in the aforementioned standard measures. Table 7 illustrates a sample of WEKA evaluation results for extracting NEs that belong to the second group of NE types. In this sample evaluation, the J48 ML classifier is applied on the rulebased annotated version of the ACE 2005 NW dataset.
6.2. Experiments and results A number of experiments were conducted to evaluate the performance of the proposed system in order to extract the various types of NEs using each of the three different ML techniques when applied to different datasets. We group similar features together according to the nature of the feature type. For example, number, gender, person, case, voice and mood features are grouped under the morphological features group, called MF for short. We examine six settings of feature groups to study their effect on the overall performance: when all features are considered, and when different combinations of all-but-one feature groups are considered. The baseline in all experiments is the performance of the pure rulebased component. The settings are:
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah (1) (2) (3) (4) (5) (6)
81
All Features – all features are considered; W/O RB – excluding the rule-based (RB) features (i.e. pure ML-based mode); W/O MF – excluding the morphological features; W/O POS – excluding POS feature; W/O GAZ – excluding Gazetteers features; W/O NbG – excluding neighbours’ related features within the Gazetteers features.
Table 8 shows the system’s performance in terms of F-measure when applied on ACE2003 (NW & BN), ACE2004 (NW) and ANERcorp datasets in order to extract NEs of the first group (i.e. Person, Location and Organization). According to the empirical results illustrated in this table, the highest performance of our system when applied on ACE2003 NW and BN datasets are achieved by J48 classifier when the sixth feature setting is used (i.e. without neighbouring features), while using J48 classifier with the first setting (i.e. all Features are used) leads to the highest performance when applied on ACE2004 NW and ANERcorp datasets. The results illustrated in Table 9 show that the highest performances in terms of average F-measures when applied on ACE2003 BN, ACE2004 NW & BN, ACE2005 BN and ATB Part1 version 2.0 datasets to extract NEs of the second group (i.e. Date, Time, Price, Measurement and Percent) are achieved by J48 classifier when any of the six feature settings is utilized except the second feature setting, while using J48 classifier with the fifth or sixth feature settings leads to the highest performance in extracting NEs of the same group from ACE2003 NW and ACE2005 NW datasets. The highest performance of our system in terms of average F-measures when applied on our own corpus to extract NEs of the third group (i.e. Phone Number, ISBN and File Name) is achieved by the J48 classifier when any of the six feature settings is utilized except the second feature setting, Libsvm classifier when the fourth feature setting is used, and by Logistic classifier when any of the six feature settings is utilized except the second feature setting, as shown in Table 10. The overall experimental results show that the adoption of the hybrid approach leads to the highest performance. It is worth noting that the results of the proposed hybrid system are very close to those of the rule-based component (baseline) when it comes to the numerical and temporal expressions (mostly the second and third NE groups). The performance of the pure ML-based component (W/O RB) is poor when it is used for the recognition of the second and third NE groups because the number of NE instances within the training datasets is less, and we overcome this issue with the utilization of rule-based features as part of the feature set constructing the hybrid approach. Therefore, the hybrid approach proves its suitability for the recognition of the three groups of NEs. Also, the decision trees function has proved its comparatively higher efficiency as a classifier in our Arabic hybrid NER system.
Table 8. The results of applying the proposed hybrid system on the ACE2003 (NW & BN), ACE2004 (NW) and ANERcorp datasets in order to extract NEs of the first group.
Rule-based (baseline) J48
Libsvm
Logistic
All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG
ACE2003 NW
ACE2003 BN
ACE2004 NW
ANERcorp
Average F-measure
Average F-measure
Average F-measure
Average F-measure
0.64 0.85 0.82 0.85 0.83 0.79 0.86 0.80 0.75 0.79 0.78 0.74 0.79 0.80 0.76 0.78 0.78 0.75 0.79
0.61 0.81 0.76 0.82 0.81 0.79 0.81 0.77 0.63 0.77 0.75 0.67 0.76 0.77 0.67 0.76 0.77 0.76 0.77
0.47 0.76 0.74 0.74 0.74 0.71 0.74 0.72 0.66 0.71 0.70 0.63 0.70 0.72 0.64 0.71 0.69 0.68 0.71
0.67 0.91 0.84 0.90 0.90 0.87 0.90 0.90 0.81 0.90 0.89 0.85 0.89 0.90 0.78 0.89 0.88 0.84 0.88
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
82
Table 9. The results of applying our hybrid system on ACE2003 NW and BN, ACE2004 NW and BN, ACE2005 NW and BN and ATB Part1 version 2.0 datasets when the second group is the targeted group.
Rule-based (baseline) J48
Libsvm
Logistic
All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG
ACE2003 NW
ACE2003 BN
ACE2004 NW
ACE2004 BN
ACE2005 NW
ACE2005 BN
ATB-Part1 v 2.0
Average F-measure
Average F-measure
Average F-measure
Average F-measure
Average F-measure
Average F-measure
Average F-measure
0.98
1.00
0.98
0.99
0.96
0.98
0.98
0.98 0.74 0.99 0.99 0.99 0.99 0.96 0.87 0.97 0.97 0.96 0.96 0.98 0.65 0.99 0.98 0.99 0.97
1.00 0.63 1.00 1.00 1.00 1.00 0.70 0.29 0.78 0.70 0.75 0.75 0.93 0.41 0.93 0.86 0.93 0.95
0.99 0.57 0.99 0.99 0.99 0.99 0.98 0.58 0.98 0.99 0.98 0.98 0.98 0.52 0.98 0.98 0.98 0.98
1.00 0.63 1.00 1.00 1.00 1.00 0.91 0.50 0.92 0.95 0.92 0.92 0.99 0.54 0.99 0.97 0.99 0.99
0.98 0.67 0.98 0.98 0.98 0.98 0.97 0.84 0.97 0.98 0.97 0.97 0.97 0.66 0.98 0.98 0.97 0.97
0.99 0.57 0.99 0.99 0.99 0.99 0.92 0.36 0.93 0.84 0.89 0.93 0.97 0.41 0.99 0.98 0.98 0.97
0.99 0.82 0.99 0.99 0.99 0.99 0.99 0.75 0.99 0.99 0.98 0.98 0.99 0.72 0.99 0.99 0.99 0.99
Person, Location and Organization NE types (ENAMEX or first NE group in our discussion) have been used specifically in the literature for comparing the performance of the Arabic NER systems when applied on standard datasets. In comparison, the results achieved by ANERsys 1.0 [33], ANERsys 2.0 [15], Arabic ML-based NER system using CRF [37] and the hybrid NER prototype for Arabic developed by Abdallah et al. [16] when applied on ANERcorp, have shown that our system performs demonstrably better, as illustrated by Table 11. Our hybrid system outperforms the other systems in terms of F-measure in extracting Person, Location and Organization NEs from ANERcorp dataset. The detailed analysis of these results in terms of precision–recall curve is presented in the next section. It is worth noting that a comparison between our results and the results achieved by Benajiba et al. [38, 40] is not possible because their published evaluation lacks sufficient detail. The authors reported the performance results of their system as one group representing ENAMEX, Facility, Vehicle and Weapon NEs types.
6.3. Precision–recall trade-off The precision–recall curve is used by information extraction and information retrieval systems to visualize their performances with respect to the two main standard measures: precision and recall [59]. Precision is the fraction of detected NEs that are correctly classified, while recall is the fraction of relevant NEs that are detected [59]. We used the precision–recall curve to represent the performance of the system with regard to one category or a set of categories as a total. Figures 8–10 show category precision–recall curves, while Figure 11 shows a total precision–recall curve. We used precision–recall trade-off to compare the performances of the different ML classifiers (i.e. J48, Libsvm and Logistic) used by the hybrid system with regard to the first group NEs when all features were considered. In Figure 8 is illustrated the precision–recall trade-off with regards to the Person NE extraction. We first observe that the precisions obtained by the three classifiers remain above 0.80 until the recall reaches 0.90, then start decreasing. The second observation is the breakeven point (BEP) where the precision is equal to the recall (the higher the better in terms of performance), and J48, Libsvm and Logistic classifiers achieve 0.89, 0.88 and 0.88, respectively, as their BEPs. In Figure 9, we illustrate the precision–recall trade-off with regards to the Location NE extraction. The precisions obtained by the three classifiers remain above 0.80 until the recall reaches 0.85, then start to decrease. When it comes to the BEP, J48, Libsvm and Logistic classifiers achieve 0.88, 0.85 and 0.83, respectively. In Figure 10, we illustrate the precision–recall trade-off with regards to the Organization NE extraction. The precision obtained by J48 classifier Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
83
Table 10. The results of applying our hybrid system on our own corpus when the third group is the targeted group.
Rule-based (baseline) J48
Libsvm
Logistic
All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG All features W/O RB W/O MF W/O POS W/O GAZ W/O NbG
Phone Number
ISBN
File Name
F-measure
F-measure
F-measure
Average F-measure
1.00 1.00 0.45 1.00 1.00 1.00 1.00 1.00 0.40 1.00 1.00 1.00 1.00 1.00 0.45 1.00 1.00 1.00 1.00
1.00 1.00 0.44 1.00 1.00 1.00 1.00 1.00 0.15 1.00 1.00 1.00 1.00 1.00 0.52 1.00 1.00 1.00 1.00
1.00 1.00 0.90 1.00 1.00 1.00 1.00 1.00 0.90 1.00 1.00 1.00 1.00 1.00 0.88 1.00 1.00 1.00 1.00
1.00 1.00 0.60 1.00 1.00 1.00 1.00 1.00 0.48 1.00 1.00 1.00 1.00 1.00 0.61 1.00 1.00 1.00 1.00
Table 11. The results of ANERsys 1.0, ANERsys 2.0, CRF-based system [37] and Abdallah et al.’s [16] system compared with our hybrid system’s highest performance when applied to ANERcorp dataset. System
ANERsys 1.0 ANERsys 2.0 CRF-based system [37] Abdallah et al. [16] Our Hybrid System (J48)
Person
Location
Organization
F-measure
F-measure
F-measure
0.46 0.52 0.73 0.92 0.94
0.80 0.86 0.89 0.87 0.90
0.36 0.46 0.65 0.86 0.88
remains above 0.80 until the recall reaches 0.50, then starts to decrease, while the precisions of the rest of the classifiers fall below 0.80 when the recall is around 0.4 and 0.45. The BEPs achieved by J48, Libsvm and Logistic classifiers are 0.66, 0.63 and 0.61, respectively. Figure 11 summarizes the performance of the hybrid system using the three classifiers to extract the NEs of the first group (i.e. ENAMEX). With recall bellows 0.50, the logistic classifiers outperform the Libsvm and J48 classifiers in terms of precision, while when the recall is above 0.50, the J48 classifier outperforms the other two classifiers. As illustrated in Figure 11, the highest BEP, which is around 0.77, is achieved by J48 classifier. The total precision–recall curve shows that the J48 classifier can achieve precision of 0.87 when the recall is 0.60. As a conclusion of the precision–recall trade-off, we find that our system, when the J48 classifier is used, exploiting all features, outperforms the Libsvm and logistic classifiers in terms of BEP. As can be noted from the total precision–recall curve, the BEPs are located in the upper-right portion of the graph, which indicates good overall performance of all the classifiers.
7. Conclusion and future work In the literature, the use of either rule-based approach or pure ML-based approach is considered a successful NER approach. Our proposed hybrid Arabic NER approach is different from these approaches in that the ML-based subsystem can make use of rule-based decisions determined by the rule-based subsystem in order to improve the overall performance of the hybrid system. Nevertheless, many of existing rule-based NER systems have achieved higher performance levels and can even achieve better performance if their decisions can contribute to the set of features used by a Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
84
Figure 8. Precision–recall curve of the Person NE type.
Figure 9. Precision–recall curve of the Location NE type.
Figure 10. Precision-recall curve of the Organization NE type.
post-processor ML-based Arabic NER subsystem, forming a larger hybrid Arabic NER. The lack of NER linguistic resources and the high dependence of ML methods on the annotated training data make the hybrid approach suitable for Arabic NER.
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah
85
Figure 11. The total precision–recall curve of NEs of first group (Person, Location and Organization).
The proposed hybrid system is capable of recognizing 11 different types of named entities, namely: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. This system has gone through rigorous testing and explored a large set of features. A number of extensive experiments have been conducted on three different dimensions including the NE types, the feature set (divided into groups of naturally related features) and the ML technique used to evaluate the performance of our domain-independent Arabic NER system when applied on a variety of standard datasets. The experimental results proved that the hybrid approach outperforms the pure rule-based approach and the pure ML-based approach. Moreover, the proposed system outperforms the state-of-the-art of the Arabic NER in terms of F-measure when applied to the ANERcorp standard dataset with an F-measure of 0.94 for Person NEs, an F-measure of 0.90 for Location NEs, and an F-measure of 0.88 for Organization NEs. In addition to F-measure, we reported the accuracy of the system in terms of precision–recall curve, through tracking the performance of the system at different precision and recall points, and observing the trade-off between them. This has also confirmed that the proposed hybrid Arabic NER approach is demonstrably better than the pure rule-based system or the pure ML classifier. In future work, we intend to enhance the gazetteers and explore the possibility of improving the system by adding more predefined lists of NEs. There is also a space for improving the local grammar rules implemented within the rulebased component through analysing the hybrid system’s output in such a way as to automate the enhancement process. We are also considering the possibility of investigating different ML techniques other than decision trees, SVM and logistic regression and studying their impact on the overall performance of the hybrid NER system. Funding This research was funded by the British University in Dubai (grant no. INF004 – Using machine learning to improve Arabic named entity recognition).
Notes 1. 2. 3. 4. 5.
We have used the Habash–Soudi–Buckwalter transliteration scheme [24]. Available for our institution under licence agreement from the Linguistic Data Consortium. Available to download at: http://www1.ccls.columbia.edu/~ybenajiba/downloads.html GATE is freely available at: http://gate.ac.uk/ WEKA is available at: www.cs.waikato.ac.nz/ml/weka/
References [1] [2] [3] [4]
Nadeau D and Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes 2007; 30: 3–26. Cowie J and Wilks Y. Information extraction. Communications of the ACM 1996; 39: 80–91. Benajiba Y, Diab M and Rosso P. Arabic named entity recognition: A feature-driven study. IEEE Transactions on Audio, Speech, and Language Processing 2009; 17: 926–934. Babych B and Hartley A. Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th international EAMT workshop on MT and other language technology tools, improving MT through other language technology tools: resources and tools for building MT (EAMT 2003), 2003, pp. 1–8.
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
[16]
[17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33]
86
Abouenour L, Bouzoubaa K and Rosso P. IDRAAQ: New Arabic question answering system based on query expansion and passage retrieval. CLEF (Online Working Notes/Labs/Workshop), 2012. Hamadene A, Shaheen M and Badawy O. ARQA: An intelligent Arabic question answering system. In: Proceedings of Arabic language technology international conference (ALTIC 2011), 2011. Mesfar S. Named entity recognition for Arabic using syntactic grammars. In: Proceedings of the 12th international conference on application of natural language to information systems. Berlin: Springer, 2007, pp. 305–316. Farghaly A and Shaalan K. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP) 2009; 8: 1–22. Shaalan K. Rule-based approach in Arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT) 2010; 3: 11–19. Al-Sughaiyer I and Al-Kharashi A. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology 2004; 55: 189–213. Attia M. Ambiguity in Arabic computational morphology and syntax: A study within the lexical functional grammar framework. Lambert Academic, 2012. Shaalan K and Raza H. NERA: Named entity recognition for Arabic. Journal of the American Society for Information Science and Technology 2009; 60: 1652–1663. Habash NY. Introduction to Arabic natural language processing. Morgan & Claypool, 2010. Shaalan K and Raza H. Arabic named entity recognition from diverse text types. In: Proceedings of the 6th international conference on natural language processing (GoTAL 2008). Berlin: Springer, 2008, pp. 440–451. Benajiba Y and Rosso P. ANERsys 2.0: Conquering the NER task for the Arabic language by combining the maximum entropy with POS-tag information. In: Proceedings of workshop on natural language-independent engineering, 3rd Indian international conference on artificial intelligence (IICAI-2007), 2007, pp. 1814–1823. Abdallah S, Shaalan K and Shoaib M. Integrating rule-based system with classification for Arabic named entity recognition. In: Proceedings of the 13th international conference on intelligent text processing and computational linguistics (CICLing). Berlin: Springer, 2012, pp. 311–322. Orphanos G, Kalles D, Papagelis T and Christodoulakis D. Decision trees and NLP: A case study in POS tagging. In: Proceedings of annual conference on artificial intelligence (ACAI), 1999. Vapnik VN. The Nature of statistical learning theory. New York: Springer, 1995. Hastie T, Tibshirani R and Friedman J. The elements of statistical learning: Data mining, inference, and prediction, 2nd edn. : BerlinSpringer, 2009. Oudah MM and Shaalan K. A pipeline Arabic named entity recognition using a hybrid approach. In: Proceedings of the 24th international conference on computational linguistics (COLING 2012), 2012, pp. 2159–2176. Oudah M and Shaalan K. Person name recognition using the hybrid approach. Lecture Notes in Computer Science, Natural Language Processing and Information Systems, Vol. 7934. Berlin: Springer, 2013, pp. 237–248. Farber B, Freitag D, Habash N and Rambow O. Improving NER in Arabic using a morphological tagger. In: Proceedings of workshop on HLT & NLP within the Arabic world (LREC 2008), 2008, pp. 2509–2514. AbdelRahman S, Elarnaoty M, Magdy M and Fahmy A. Integrated machine learning techniques for Arabic named entity recognition. International Journal of Computer Science Issues (IJCSI) 2010; 7: 27–36. Habash N, Soudi A and Buckwalter T. On Arabic transliteration. Arabic Computational Morphology: Knowledge-based and Empirical Methods 2007; 38: 15–22. Mayfield J, McNamee P and Piatko C. Named entity recognition using hundreds of thousands of features. In: Proceedings of the 7th conference on Natural language learning at HLT-NAACL 2003 (CONLL 2003), 2003, pp. 184–187. Maloney J and Niv M. TAGARAB: A fast, accurate Arabic name recognizer using high-precision morphological analysis. In: Proceedings of the workshop on computational approaches to Semitic languages (Semitic 1998), 1998, pp. 8–15. Maynard D, Tablan V, Ursu C, Cunningham H and Wilks Y. Named entity recognition from diverse text types. In: Proceedings of recent advances in natural language processing conference, 2001. Riaz K. Rule-based named entity recognition in Urdu. In: Proceedings of the 2010 named entities workshop (ACL 2010), 2010, pp. 126–135. Shaalan K and Raza H. Person name entity recognition for Arabic. In: Proceedings of the 5th workshop on important unresolved matters, 2007, pp. 17–24. Elsebai A, Meziane F and BelKredim FZ. A rule based Persons names Arabic extraction system. Communications of the IBIMA 2009; 11: 53–59. Aboaoga M and Aziz MJA. Arabic person names recognition by using a rule based approach. Journal of Computer Science 2013; 9: 922–927. Zaghouani W. RENAR: A rule-based Arabic named entity recognition system. ACM Transactions on Asian Language Information Processing 2012; 11: 1–13. Benajiba Y, Rosso P and Bened’i JM. ANERsys: An Arabic named entity recognition system based on maximum entropy. In: Proceedings of the 8th international conference on computational linguistics and intelligent text processing (CICLing-2007). Berlin: Springer,, 2007, pp. 143–153.
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417
Shaalan and Oudah [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44]
[45] [46] [47] [48] [49] [50] [51] [52] [53] [54]
[55] [56] [57] [58] [59]
87
Alias-I. LingPipe 4.1.0, 2008. Available at: http://alias-i.com/lingpipe (accessed October 2012). Zhou G and Su J. Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL), 2002, pp. 473–480. Finkel J and Manning C. Nested named entity recognition. In: Proceedings of the 2009 conference on empirical methods in natural language processing, 2009. Benajiba Y and Rosso P. Arabic named entity recognition using conditional random fields. In: Proceedings of workshop on HLT & NLP within the Arabic world (LREC), 2008. Benajiba Y, Diab M and Rosso P. Arabic named entity recognition: An SVM-based approach. In: Proceedings of Arab international conference on information technology (ACIT 2008), 2008, pp. 16–18. Abdul-Hamid A and Darwish K. Simplified feature set for Arabic named entity recognition. In: Proceedings of the 2010 named entities workshop (ACL), 2010, pp. 110–115. Benajiba Y, Diab M and Rosso P. Arabic named entity recognition using optimized feature sets. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP 2008), 2008, pp. 284–293. Benajiba Y, Diab M and Rosso P. Using language independent and language specific features to enhance Arabic named entity recognition. The International Arab Journal of Information Technology 2009; 6: 464–473. Mohammed NF and Omar N. Arabic named entity recognition using artificial neural network. Journal of Computer Science 2012; 8: 1285–1293. Collins M. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing, 2002, pp. 1–8. Petasis G, Vichot F, Wolinski F, Paliouras G, Karkaletsis V and Spyropoulos CD. Using machine learning to maintain rulebased named-entity recognition and classification systems. In: Proceedings of conference of Association for Computational Linguistics, 2001, pp. 426–433. Srihari R, Niu C and Li W. A hybrid approach for named entity and sub-type tagging. In: Proceedings of the 6th conference on applied natural language processing (ANLC), 2000, pp. 247–254. Seon C, Ko Y, Kim J and Seo J. Named entity recognition using machine learning methods and pattern-selection rules. In: Proceedings of the 6th natural language processing Pacific Rim symposium, 2001, pp. 229–236. Tsai T, Wu S, Lee C, Shih C and Hsu W. Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model. Computational Linguistics and Chinese Language Processing 2004; 9: 65–82. Ku¨xcu¨k D and Yazıcı A. A hybrid named entity recognizer for Turkish. Expert Systems with Applications 2012; 39: 2733–2742. Mitchell A., Strassel S, Przybocki M, Davis J, Doddington G, Grishman R et al. Tides extraction (ACE) 2003 multilingual training data. Ldc2004t09. Linguistic Data Consortium, 2003. Mitchell A, Strassel S, Huang S and Zakhary R. ACE 2004 multilingual training corpus. Ldc2005t09. Linguistic Data Consortium, 2005. Walker C, Strassel S, Medero J and Maeda K. ACE 2005 multilingual training corpus. Ldc2006t06. Linguistic Data Consortium, 2006. Maamouri M, Bies A, Jin H and Buckwalter T. Arabic treebank: Part 1 v 2.0. LDC2003T06. Linguistic Data Consortium, 2003. Habash N and Rambow O. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), 2005, pp. 573–580. Habash N, Owen R and Ryan R. MADA + TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), 2009. Roth R, Rambow O, Habash N, Diab M and Rudin C. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of ACL-08: HLT, short papers, 2008, pp. 117–120. Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, Roberts I et al. Text processing with GATE (version 6). University of Sheffield Department of Computer Science, 2011. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P and Witten IH. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 2009; 11: 10–18. Habash N, Owen R and Ryan R. MADA + TOKAN Manual. Technical Report CCLS-10-01. Center for Computational Learning Systems, Columbia University, 2010. Manning CD, Raghavan P and Schtze H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.
Journal of Information Science, 40(1) 2014, pp. 67–87 Ó The Author(s), DOI: 10.1177/0165551513502417