Named-Entity Techniques for Terrorism Event ... - Semantic Scholar

1 downloads 0 Views 233KB Size Report
relationships in a natural language text, and extractions of relevant arguments of .... General Architecture for Text Engineering (GATE) graphic user interface tool ...
2009 Eighth International Symposium on Natural Language Processing

Named-Entity Techniques for Terrorism Event Extraction and Classification Uraiwan Inyaem, Phayung Meesad, and Choochart Haruechaiyasak English texts; relatively, few algorithms have been made for Thai language. The existing machine learning algorithms are suitable for information extraction investigation, but they may not be applicable for Thai event extraction. This paper focuses on linguistic feature selection, sometimes also defined as named entity, such as terrorism gazetteer, terrorism ontology and terrorism grammar rules used for entity recognition. News articles are annotated on the basis of these named entities. The annotated entities summarize relevant information from a given set of news articles into desired template slots. Machine learning algorithms are used for the events extraction of news articles. The learning algorithms in these experiments are Support Vector Machine (SVM), Naïve Bayes (NB), K Nearest Neighbor (KNN) and Decision Tree (DTREE). TF-IDF similarity measure based event classification is used to isolate the specific events from news articles category [2], to display to the users. The rest of this paper is organized as follows. Section II presents related works on event extraction methodology and linguistic feature selection techniques. Section III presents a system design and implementation detail from the Thai terrorism news articles corpus. In section IV, a framework for constructing proposes Thai terrorism events extraction is outlined. Section V describes performance measures in the empirical evaluation, outlines experiments, and discusses event extraction. Finally, Section VI concludes this paper.

Abstract—The aim of this paper is to study and compare several machine learning methods for implementing a Thai terrorism event extraction system. The main function of the system is to extract information related to terrorism events found in Thai news articles. The terrorism events can then be classified and presented to intelligence officers who can further analyze and predict terrorism events. This paper compares three named entity feature selection techniques provided by terrorism gazetteer, terrorism ontology and terrorism grammar rules, for entity recognition. The machine learning algorithms use for event extraction include Naïve Bayes (NB), K Nearest Neighbor (KNN), Decision Tree (DTREE) and Support Vector Machines (SVM). Each term feature is weighted by using the Term Frequency-Inverse Document Frequency (TF-IDF). Finite State Transduction is applied for learning feature weights. Experimental results show that the SVM algorithm with a terrorism ontology feature selection yields the best performance with 69.90% for both precision and recall.

I. INTRODUCTION

M

ANY Thai people can access and read news articles about terrorism events occurring in the south of Thailand from many news websites. A news article typically includes a headline, dates and time, places, event type and full content. News articles are considered as unstructured information; therefore, the important required metadata must first be manually extracted before it can be entered to the decision support applications. The extra time requires for data entry which discourages users from recording terrorism events. It reduces usefulness of both the news articles and their decision support application. In the proposed decision support system, users can select a predefined entity, for example a word, names, places, to be extracted from news articles and automatically classified. Grishman defines a task of information extraction as the identification of instances of a particular class of events or relationships in a natural language text, and extractions of relevant arguments of the events of relationships as in [1]. Most algorithms have applied for information extraction on

II. RELATED WORKS Previous research studies in event information extraction proposed a variety of algorithms. These algorithms differ in the definitions of feature selections for improving process of information extraction. Black and Ranjan proposed a method to bridge the gap between an incoming email and user’s personal calendar as in [2]. This work uses RAPIER algorithm [3] and handcoded pattern matcher to automate event extraction. The main system had four components to extract all requisite calendar information from incoming massages: (1) a named entity recognizer, (2) a threaded email filter, (3) an email signature filter, and (4) an email summarizer. This work proposes solutions to the problem at hand. Califf and Mooney use hand-coded pattern and keyword matcher to identify obvious attributes as in [3]. When no matching pattern is found, hidden Markov model (HMM) is used, as in [4]. In this approach, providing a lot of training data and embedding HMM system provided more realistic conditions to test the full potential of the algorithm.

Manuscript received June 15, 2009. This work was supported in part by the faculty of Information Technology, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand. U. Inyaem is with the faculty of Information Technology, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand (phone: 662-549-4197; fax: 662-549-4193; e-mail: [email protected]). P. Meesad is with the faculty of Technical Education, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand (e-mail: [email protected]). C. Haruechaiyasak is with the Human Language Technology Laboratory, National Electronics and Computer Technology Center, Pathumthani 12120, Thailand (e-mail: [email protected]). 978-1-4244-4139-6/09/$25.00 ©2009 IEEE

175

Naughton et.al. propose a system that distinguishes the description of separate events within a single document, and then matches the description of each event with corresponding event descriptions from other documents as in [4]. The key idea is to cluster sentences, using distance metric regularity in the sequential structure of events within documents. Term Frequency-Inverse Document Frequency (TF-IDF) is used to enhance capturing word frequencies within events. Li et.al. use SVM, NB learning algorithms and a linguistic feature for opinion analysis pilot tasks as in [5]. This experimental result is satisfaction.

Finally, the extracted event is classified into one of the three patterns using TF-IDF similarity measure based-event classification. Each step of the pipeline is discussed below. A. Named Entity Feature Selection A pre-process of the corpus executes to obtain named entity feature selection for the learning. The corpus of the Thai terrorism news articles are prepared by exporting it into a single text file as Extensible Markup Language (XML) format [7]. These training files have to annotate some news articles with labels that are about 70% of total articles which want the learning system to annotate in some news articles. The remaining, 30% of the total articles are included in the test files. News article header and body are demarcated with XML tags,
and . The number of annotated entities in this task includes up to nineteen types which are day, month, year, time, period, suburb, district, city, tactic, evidence, terrorist’s name, amount of terrorists, level of terrorist, place, title, victim’s name, amount of victims, victim’s occupation, and amount of weapons. The linguistic feature selections of this task are compared and studied, such as terrorism gazetteer, terrorism ontology and terrorism grammar rule according to the rules-based recognizer, they described as follows. 1) Terrorism Gazetteer (TG): The terrorism gazetteer is developed by using an open-source tool. It is called the General Architecture for Text Engineering (GATE) graphic user interface tool [8]. The terrorism gazetteer size is 36.5 kb entries in the nineteen types of the entities. The terrorism gazetteer consists of a set of lists complied into Finite State Transduction (FST) as in [8]. These lists have attributes of major type, minor type and the language namely City.lst:Location:City:Thai. This example sentence means the City.lst file is lists of the terrorism gazetteer, the Location attribute is the major type, the City attribute is the minor type, and the Thai attribute is the language consecutively. All attributes are used as inputs to Java Annotation Pattern Engine (JAPE) grammar [8]. These list entries may be entities or a part of entities, or they may contain context information. Most particularly titles often indicate people. 2) Terrorism Ontology (TO): The terrorism ontology is developed by using the GATE graphic user interface tool. The terrorism ontology steps for generation are as follow: at the beginning, the expert domain determines scope ontology for terrorism domain. Thai terrorism news articles consist of terrorist’s detail, victim’s detail, dates and time of occurrence events, places, location detail, amount of evidences, amount of weapons, and tactics. Next, the reusing existing ontology is considered but this is not possible in this system, so the ontology in this experiment is created from new Thai news article corpus. After that, enumerate terms is analyzed by considering the overview of terrorism domain. The classes and instance hierarchy are defined and implemented by using the GATE tool.

III. THAI TERRORISM NEWS ARTICLE CORPUS In this paper, training and test corpus are created. The corpus contains Thai terrorism news articles [6]. The purpose of this task defines three patterns of templates as follows: A. Occurrence event template An occurrence event template is a news article that contains occurrence terrorism event information clearly such as dates and time of the occurrence event, terrorists, victims, tactics, weapons and places. Usually, a key word in the content of the news article for this pattern is “Á„·—Á®˜»´ (to occur). In this experiment, there are 1080 terrorism news articles included. B. Finding suspicious item template A finding suspicious item template is a news article that reports police finding suspicious items in the area of three southernmost provinces of Thailand, such as Pattani, Yala and Narathiwat. This news article contains dates and time of the occurrence event, evidences and places. Usually, a key word in the content of the news article for this pattern is “ª´˜™»˜o°Š­Š­´¥´(suspicious items). There are 165 terrorism news included in this experiment. C. Arrest of suspect template An arrest of suspect template is a news article that reports police arresting suspects in the area of three southernmost provinces of Thailand. This news article contains dates and time of the occurrence event, suspects, weapons and places. Usually, key words in the content of the news article for this pattern are “Ÿ¼o˜o°Š­Š­´¥´ (suspect) and “žd—¨o°¤” (to blockade). There are 255 terrorism news articles included in this experiment. IV. THE PROPOSED SYSTEM FRAMEWORK The architecture of the Thai terrorism event extraction system can be viewed as a pipeline of processes that takes terrorism news articles corpus as its input. The corpus data is preprocessed before processing of the named entity feature selection. A terrorism information event extraction process is performed by comparing four machine learning algorithms.

176

Their properties and constraints are developed. The expert domain checks the final version again and then all instances are created in the terrorism ontology system. 3) Terrorism Grammar Rule (TGR): Java Annotation Pattern Engine (JAPE) grammars in the GATE tool are used for developing terrorism grammar rules. JAPE provides FST over annotation based on regular expressions. The JAPE phases run sequentially and constitute a cascade of FSTs over annotation. Hand-coded rules are created and applied to annotation to identify named entities. The annotation entities come from format analysis, tokenization and gazetteer module. The grammar rules can set their priority based on pattern length, rule status and rule ordering. The nineteen grammar rules of this task are created for Thai terrorism domain.

2) Naïve Bayes (NB) [11]: NB is a simple probabilistic classifier based on applying Bayes' theorem with strong (naïve) independence assumptions. The basic idea for the NB is to use joint probabilities of words and categories to estimate the probabilities of categories given a document. 3) K Nearest Neighbor (KNN) [10]: KNN is a supervised learning algorithm. The main idea is to classify data depending on its similarity with predefined data. The classifier measures the similarity between an unclassified data object and the predefined data by different distance measures. The algorithm computes the distance between the unclassified data object and the k closest objects in the predefined training dataset. The class majority of the k nearest neighbors will be the decided class for the unclassified data object. 4) Decision Tree (DTREE) [11]: DTREE is a well-known machine learning approach. DTREE is a tree in which the internal nodes are labeled by features. Edges leaving a node are labeled by tests on weights of the features. Leaves are labeled by categories. Four algorithms are implemented in the following tools included in the open-source machine learning package Weka: SMOWeka for SVM, NaiveBayesWeka for NB, KNNWeka for KNN and C4.5 for DTREE. The experiments are divided into a training mode to learn a model from a training dataset, and a test mode to run on the new documents. The extracted event results are printed in the message window of GATE GUI then it can save these results into XML file.

B. The Event Extraction System Machine learning is used to extract terrorism events from the Thai terrorism news article with linguistic feature selection using the GATE tool. The tool relies on finite state algorithms and the JAPE language, as in [8]. Their component form is a pipeline. The task mainly involves under taking three subtasks. The first is to annotate some news articles with the class labels that it wants the learning system to annotate in some news articles. The second is to prepare the corpus to obtain the linguistic feature selection for the learning system. Finally a XML configuration file for setting the machine learning application interface is created. Most particularly the learning algorithm and definition of the linguistic feature selection used by learning are selected. The first and second steps have already been described in the previous section. The last step, the XML configuration file is created to connect GATE tool to Weka application, as in [9]. The configuration file must contain three basic elements which are optional settings, dataset, and engine. The optional settings in the configuration file have various settings which facilitate different tasks. The optional setting in this task has a default value. Dataset elements define types of annotation to be used as instances and a set of attributes that characterizes them. Engine element specification, which particular machine learning, is used and also set options for those algorithms.

C. TF-IDF-based Event Classification This section describes an event classification task. After finishing an event extraction task, a hand code file is created using Perl script [12] to extract all relevant entities of the terrorism event of each news article into their pattern slot in XML format. This event classification is started by using a similarity measure based on Term Frequency-Inverse Document Frequency (TF-IDF) [13]. Each news article has three patterns considered as their suitable patterns. This task reproduces the TF-IDF measure from [2] which is defined in the paragraphs below. Let N be a news article and F(N) be a word-frequency vector. Each component F(N, t) represents the frequency of token word t in news article N. Using this definition, it can compute weight for a given category, C, as follows. Let TT(C,w) be the number of times that word w occurs in some news articles. The news articles belong to category C divided by the total number of words in category C. Let TT(Tr,w) be the number of times that word w occurs in some training news articles divided by the total number of words across all training news articles. Thus they can compute terms frequency as follows

The learning algorithms are used in this system: SVM, NB, KNN, and DTREE described detail as follows; 1) Support Vector Machine (SVM) [10]: The basic idea of SVM is to find a hypothesis h for which it can guarantee the lowest true error. The true error of h is the probability that h will make an error on unseen and randomly selected test examples. SVM finds the hypothesis h which approximately for minimizes this bound on the true error by effectively and efficiently controlling dimension.

177

TF (C , w)

TT (C , w) TT (Tr , w)

F1 score is the weighted harmonic mean of precision and recall. This score is one of the commonly used measures that combine precision and recall into a single rating defined in equation (6) below.

(1)

Using this term frequency and the document frequency DF(w) that define to be the number of categories in which the word w occurs at least once divided by the total number of categories. It can compute weight W(C,w) in the following equation

W (C , w) =

TF (C , w) DF ( w) 2

F = 2× ( precision×recall ) / ( precision+recall )

B. The experimental results and discussion All experiments are performed on the Thai news articles corpus [6]. A corpus has three news article categories. It is further partitioned into training and test datasets. The number of news articles in the training dataset is 1,050 and the number of news articles in the test dataset is 450. The GATE is used to perform all experiments. The default settings are created for all algorithms. The performance measures for evaluating the event information extraction are precision, recall, and F1 measure, as mentioned in the previous section. The proposed algorithms are terrorism gazetteer, terrorism ontology and terrorism grammar rule. They are used for the entity reorganization process. The results in terms of precision, recall and F1 measure are averaged across all experiments in the dataset. The evaluating results are the named entity recognition with proposed linguistic features. The experimental results are summarized in Table I. They show that the linguistic feature selection; the terrorism ontology gives higher performance, with 52.00% for precision and 34.70% for recall, than the terrorism gazetteer and the terrorism grammar rule. The event extractions use four machine learning algorithms with the terrorism ontology feature selection are summarized in Table II. The experimental results for SVM algorithm with the terrorism ontology feature selection yield the best performance with 69.90% for both precision and recall. The extracted event classifications using TF-IDF similarity measure are summarized in Table III. The experimental results show that P2 gives higher classification rate, with 90.50% for both precision and recall, than P3 and P1 respectively. Moreover, the experimental results show the size of the data affected the classification correctly.

(2)

Using these weights can compute a similarity measure, SIM(N, C), for the given news articles N to a given category C in the following equation.

SIM ( N, C) =

¦ C(N, w)W (C, w)

w∈N

§ · (3) min ¨ ¦ C( N, w), ¦W (C, w) ¸ w∈N © w∈N ¹

Then when attempting to classify news articles, the category will be selected by using the largest similarity scores. V. RESULTS This section describes the performance measures for all experiments, the experimental results and discussion respectively. A. Performance measures The experiments are evaluated by comparing the summaries generated by human experts for the same test set of previously unseen texts. The comparison of performance measure is performed by using an automated scoring program that rates each system according to measure precision, recall and F1 score [14]. Precision measures the reliability of the information extracted that is shown in equation (4) below.

precision =

#correct slot fillers in output templates slot fillers in output templates

(4)

TABLE I. THE SUMMARIZED RESULTS USING SEVERAL NAMED ENTITY METHODS

Recall measures the amount of the relevant information that the natural processing language system correctly extracts from the test collection that is shown in equation (5) below.

recall =

#correct slot fillers in output templates slot fillers in answer keys

(6)

Methods TG TO TGR

(5)

178

Precision 52.00 52.00 50.00

Recall 34.44 34.70 33.31

F-Measure 50.00 50.25 49.99

TABLE II. THE SUMMARIZED RESULTS USING SEVERAL MACHINE LEARNING ALGORITHMS WITH TO FEATURE Algorithms Precision SVM 69.90 NB 49.00 DTREE 60.60 KNN 49.20

Recall 69.90 49.10 62.30 47.60

rule for a specific domain. The results for event extraction show that the SVM algorithm with a terrorism ontology feature selection yields the best performance with 69.90% for both precision and recall. The result for event classification shows that the article of the Finding Suspicious Item has the highest accuracy with 90.50% for both precision and recall. From the experimental results show that the Fmeasure values are still low because in this study the information flow in the system is at the initial period. The system needs more information to learn and develop itself. If the test dataset is not distinguished from the training dataset, it will lead to high F-measure values.

F-Measure 69.90 49.05 61.89 48.40

TABLE III. THE SUMMARIZED RESULTS FOR THE CLASSIFICATIONS IN SEVERAL FOLDERS USING SVM ALGORITHM WITH TF-IDF METHOD Folders P1 P2 P3

Precision 88.90 90.50 89.99

Recall 88.90 90.50 89.99

ACKNOWLEDGMENT U. Inyaem thanks my supervisor, Asst.Prof.Dr.Phayung Meesad who is abundantly helpful and offered invaluable assistance, support and guidance. Deepest gratitude is also due to the co-supervisor, Dr.Choochart Haruechaiyasak without whose knowledge and assistance this study would not have been successful. The author would also like to convey thanks to the faculty of Information Technology, King Mongkut’s University of Technology North Bangkok for providing the financial means and laboratory facilities.

F-Measure 88.90 90.50 89.99

Where P1 is the news articles of occurrence event, P2 is the news articles of the finding suspicious item, and P3 is the news articles of the arrest of suspect.

REFERENCES

VI. CONCLUSIONS

[1]

This paper has examined the design of the system to assist the Thai terrorism events extraction. Three named entity techniques have been compared to find the most suitable one for implementing the terrorism events extraction system. The main functions of the proposed system are the identification of the named entities, the extraction, the classification of the terrorism events from the terrorism news article corpus, and the presentation of the results to the users. Three linguistic feature techniques, i.e., the terrorism gazetteer, the terrorism ontology, and the terrorism grammar rule have been studied and compared for the entity recognition process. The machine learning algorithms extract the relevant entities into the desired template slots of the news articles. The learning methods in this study are SVM, NB, KNN and DTREE algorithms. TF-IDF similarity measure based event classification is used for all news articles to isolate the specific events from the news article category to present to the users. Additionally, FST is used to learn these feature weights and is also studied to emphasize the performance of the machine learning algorithms. The results for named entities recognition show that the terrorism ontology has the highest accuracy with 52.00% for precision and 34.70% for recall. The accuracy from the terrorism gazetteer is slightly different from the accuracy from terrorism ontology with 52.00% for precision and 34.44% for recall. However, the terrorism gazetteer has disadvantages in a limit of domain. In addition, the terrorism grammar rule has high accuracy when we can define a cover

[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

179

R. Grishman, “Discovery Methods for Information Extraction,” In proceeding of the ISCA & IEEE workshop on spontaneous speech processing and recognition, submitted for publication. J. A. Black, N. Ranjan. (2004, July). Automated event extraction from email. Available : http://nlp.stanford.edu/courses/cs224n/2004/ M. E. Califf, R. J. Mooney, “Bottom-up relation learning of pattern matching rules for information extraction,” Journal of Machine Learning Research, vol. 4, Dec. 2003, pp. 177–210. M. Naughton, N. Kushmerick, J. Carthy, “Event extraction from heterogeneous news source,” In proceedings of the AAAI workshop event extraction and synthesis, submitted for publication. Y. Li, K. Bontcheva, H. Cunningham, “Experiments of opinion analysis on the corpora MPQA and NTCIR-6,” In proceedings of NTCIR-6 workshop meeting, submitted for publication. U. Inyaem, P. Meesad, C. Haruechaiyasak, “Domain Knowledge based information filtering for terrorism new articles,” In proceeding of NCCIT Conference, submitted for publication. E. Castro, XML for the World Wild Web (Visual Quick Start Guide). CA: Berkeley, 2001, pp. 70–119. H. Cunningham, K. Bontcheva, D. Maynard, V. Tablan, “GATE – A new release,” ELSNews, vol 11(1), 2002. I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press, CA: San Diego, 2000, pp. 159–226. R. Feldman, J. Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, NY: New York, 2007, pp. 70–78. T. M. Mitchell, Machine Learning. McGraw-Hill, NY: New York, 1997, pp. 52-78, 154-199 L. Wall, T. Christiansen, J. Orwant, Programming Perl. O’Reilly media, CA: Sebastopol, 2000, pp. 123–135. Z. Yun-tao, G. Ling, W. Yong-cheng, An improved TF-IDF approach for text classification. J Zhejiang Univ SCI, CN, 2005, pp. 49–55. A. Dalli, “Automated email integration with personal information management application,” The UK special-interest group for computational linguistics, submitted for publication.

Suggest Documents