ScienceDirect An Intelligent Document Clustering

0 downloads 0 Views 173KB Size Report
3.1. Data collection and performance measure. For this research, the corpora have been collected from Bernama news*, and the dataset test of six categories of.
Available online at www.sciencedirect.com

ScienceDirect Procedia Technology 11 (2013) 1181 – 1187

The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013)

An Intelligent Document Clustering Approach to Detect Crime Patterns Qusay Bsoul, Juhana Salim *, Lailatul Qadri Zakaria Knowledge Technology Research Group, Universiti Kebangsaan Malaysia, 43600 Bangi Selangor, Malaysia

Abstract A wide number of reports and news on crimes, increasingly conducted almost every day, have resulted in making detection of such crimes more difficult if not complex. Therefore, the need for detecting and identifying such crimes emerges as a necessary way of detecting and identifying such crime patterns on the news. Document Clustering have been increasingly becoming an important task for obtaining good results with the unsupervised learning methods. It aims to automatically group similar documents in one cluster using different types of extractions and cluster algorithms. There are ongoing works done to improve Document Clustering techniques such as Extractions and Clustering approaches to overcome the difficulty in designing a general purpose document clustering for crime investigation and the ill posed problem of extraction and clustering. This paper discusses two major sequential stages in Document Clustering “Extraction Features and Clustering Algorithms” as well as the major challenges and the key issues in designing extraction features and clustering algorithms. In addition, the following approach assists the law enforcement officers and detectives to enhance performance and speed up the process of solving crimes. © 2013 The Authors. Authors. Published Publishedby byElsevier ElsevierB.V. Ltd. Open access under CC BY-NC-ND license. ofof Information Science & Technology, Universiti Kebangsaan Selection Selection and and peer-review peer-reviewunder underresponsibility responsibilityofofthe theFaculty Faculty Information Science & Technology, Universiti Kebangsaan Malaysia. Malaysia. Keywords: Intelligent Clustering; K-means; Affinity Propagation algorithm; Lemmatization Algorithm; Crime Detection

1. Introduction Due to its increasing social importance, the crime domain has been selected as an area for application in this work. Tracing back the history of tracking and solving crimes, it is evident that such important domain has been

*

Corresponding author. E-mail address: [email protected]

2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and peer-review under responsibility of the Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia. doi:10.1016/j.protcy.2013.12.311

1182

Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 – 1187

privileged by a wide number of experts and specialists in criminal justice and law enforcement. Recently, due to the rapid technological advances including the increasing application of the computerized systems to trace and track crimes, computer data analysts have taken practical steps in assisting the law enforcement officers and detectives to enhance the process of solving crimes and increase its performance. For the last few recent years, there has been an increasing interest on the part of researchers in detecting and tracking crime news stories based on clustering methods. Such an increasingly emerging interest is attributed to the social dilemma and epidemic disease represented and reflected by the occurrence of social crimes which poses tremendous threats to societies [1-2]. Since large amount of news concerning stories in general and others related to crimes news in particular are being increasingly accumulated like flood over the web. Many challenges are increasingly encountered by the decision makers in law enforcement departments in detecting, identifying and tracing or tracking the crimes events [1-3]. Thus, tracking the social crimes or events according to their time line is becoming a tedious task. Such difficult challenges and complexities in organizing the news of crime stories are generated from a huge dimensionality of crimes data, which usually refers to the highly diverse embedded modalities such as criminal data and weapon data [4]. In other words, law enforcement officers and detectives are provided by these modalities with justified explanations about the international or world view for crime patterns by carrying out identification of the relations between local patterns [4-5]. Document Clustering is considered as one of the most commonly used methods in detecting topics/events or types of crime documents [6], and a method in which document clustering has three main processes [1-7]. The first process is preprocessing of documents to remove unimportant words and symbols from the document of crime. The second process is representation of documents crimes to extract the most important information from the document of crime and show the similarity among these documents. The last process of document clustering consists in applying the clustering algorithm to the groups of documents of topics/events or types crime based on the similarities among the documents. This study looks forwards to providing the restraints of Document Clustering in two stages (Extraction of Document and cluster algorithms) with regard to the crime domain. Document Clustering has been studied for ages, however it is still a hot domain in the research area, and therefore it requires more improvements [8]. As to section 2, it examines two out of the three main processes of document clustering. In section 3, we describe our suggested system of Crime Document Clustering and lastly, section 4 presents the conclusion. 2. Process of document clustering This section presents a comprehensive review of the previous related work on the extraction terms and cluster algorithms of Document Clustering. A detailed review of previous work related to extraction terms and cluster algorithms on Document Clustering is provided in the first and second sub-sections. In addition, in this section, the latter has limitations in two out of the three processes of Document Clustering. 2.1. Extraction terms Each crime document in this stage is represented by a set of terms which are called features. Selecting the features from the indexed words is considered as a real challenge. This is because more than one feature or combinations of features are considered in this stage. A wide body of research has been carried out in this particular extraction domain. The focus of some researchers has been placed on extracting information from the terms or ‘features’ indicating a certain crime by employing name entity [9], bag of word [10-11], n-gram [12], Frequent Word and Frequent Word Meaning [13], Concept Weighting ontology [14], Word Net Lexical ontology [15] and using core semantic features [16] to make the document clustering better and more effective. Concerning this, Zhiwei et.al [8] conducted a study in which they compared between bag of words and name entity. Their findings revealed that the results gained through using the name entity approach were better and more effective than those results generated from the data when using bag of word. In addition, yanjunli et.al [13] used Frequent Word and Frequent Word Meaning to compare them with bag of word. Their result showed that Frequent Word and Frequent Word Meaning were better than bag of word. On the other hand, Masnizah et.al [6] distinguished between Name Entity and bag of word by users. Their findings revealed that the name entity approach is better than bag of word.

Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 – 1187

1183

Samah Fodeh et.al [16] carried out an experiment using ontology and semantic features in which all the polysemous and synonymous nouns were extracted from the documents and a unique approach that was capable of permitting them to measure the information gain in disambiguating these nouns in an unsupervised learning setting was developed by them. The purpose of developing this approach was to identify the core subset of semantic features representing a text corpus. Thus, based on this experiment, the results revealed that by employing core semantic features for clustering, it is more possible to reduce the number of features by 90% or more. At the same time, it is possible to produce clusters that capture the major themes in a text corpus. According to (Hmway Hmway and Thi Thi [14] and Tarek F. Gharib et.al [15]), when compered Concept Weighting ontology and Word Net Lexical ontology with bag of word, the performance of Concept Weighting ontology and Word Net Lexical ontology was much better than bag of word. Table 1. Summary of previous work on extraction Authors

Dataset used

Stat-of-art

outperform method

[8]

Financial news

BOW

NER

[13]

news

BOW

Frequent word, frequent word meaning

[6]

TDT

BOW

NER

[16]

Divers dataset

Ontology and semantic

Core semantic features

[14]

news

BOW

Weighting ontology

[15]

news

BOW

Word-net ontology

Lemmatization algorithm is an idea to extract information from sentences based on certain rules of the sentence. For instance, Eiman Al-shamari and Lin [17] generated a new technique which was published as patent application publication called lemmatization algorithm for Arabic language. It aims to extract nouns and verbs from Arabic documents based on the preposition words, in addition to some rules related to other linguistic elements such as the definite article “the”. This algorithm catches important words from the two lists of prepositions; the first one includes proceeding verbs and the other one involves proceeding nouns. Based on our knowledge, it is very important to extract words such as nouns and verbs related to topics/events of crimes, but this importance is more stressed or emphasized especially in Document Clustering. 2.2. Cluster Algorithms Many researchers have carried out empirical work on algorithms of cluster. For example, Dai et.al [7] improved Agglomerative Hierarchical Clustering by taking into account the importance of the title part in a story. That was in cases when the occurrence of the term was found in the title, and it was assigned as higher weight. The findings showed that the proposed method was effective in clustering documents in financial news. However, some researchers have focused on the clustering of topic or events. For Fernando Bação et.al [18], they used selforganizing map (SOM) to cluster IRES “image of eye” and compared it with k-means, where the performance of SOM was better than k-means. Until Sheng-Tun et.al [19], they used a fuzzy self-organizing map (FSOM) network to detect and analyze the patterns of crime trends from temporal crime activity data. Other researchers such as Fredrik and James [20], compared between scalable k-means, complete k-means and k-means for knowledge and data discovery (KDD) of data set. Their results proved that k-means is better than other algorithms. Bouras and Tsogkas [21] used the clustering methodologies including: single, maximum, linkage and centroid linkage hierarchical clustering, as well as regular k-means, k-medians and k-means++. Based on their findings, it was found that using the k-means not only generating the best result at the level of internal measurement of clustering index function, but also leading to the best results on a real users’ experimentation. In addition, when comparing between k-means and single pass algorithms and other algorithms of clustering on topics of news, Taeho Jo [22] revealed that the k-means was better than single pass clustering. As suggested by Zhiwei Li et al. [8], the estimation of initial number of events depends or is based on the article count-time distribution in their probabilistic model where the estimation of events number represents the initial (K) clusters. Masnizah et.al [6] tried to produce new ways in

1184

Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 – 1187

choosing optimal initial centroid for each cluster, where their results achieved higher F- measure. In examining the hybrid algorithm of clustering, Xiangying Dai, et.al [23] proposed a two-layer text clustering approach for detecting retrospective news events by using the Affinity Propagation clustering (AP) as a first layer. Then, the researchers conducted a second feature selection on the generated clusters of (AP) cluster, and they also adopted the usual agglomerative hierarchical clustering for the purpose of generating the ultimate news events. Finally, the researchers chose the traditional agglomerative hierarchical clustering (AHC) and the classic K-Means as comparative methods. The findings revealed that the proposed method achieved the highest precision measure, moreover,it was followed by the Affinity Propagation (AP) clustering and the k-means and (AHC) clustering respectively in their ranks. As far as the recall measure is concerned, it was found that the proposed method and k-means obtained the highest result followed by the (AHC) clustering and (AP) clustering. Velmurugan and Santhanan [24], compared between three algorithms of cluster for geographic map data set. Their result showed that k-means was good with small data, kmedoids was good with large data and fuzzy c-means was between k-means and k-medoids. Table 2. Summary of previous work on cluster algorithms Authors

Dataset used

Stat-of-art algorithm

outperform clustering algorithm

[7]

Financial news

AHC

Enhance AHC

[17]

IRES

k-means

SOM

[18]

Data series of crimes

N/A

FSOM

[19]

KDD

Scalable, complete k-means

k-means

[20]

web

Single, maximum, centroid AHC and k-medians, k-means++

k-means

[21]

news

Single pass

k-means

[8]

News

Probabilistic model cluster

Enhance Probabilistic model

[6]

Crimes

Single pass, k-means

Enhance k-means

[22]

News

k-means, AHC, AP

APAHC

[23]

Geographic map

Fuzzy c-means

k-means, k-medoids

2.3. Limitations of feature extraction and cluster algorithm In general, the main limitations of the previous work which are summarized in Table 1, 2 are presented as follows: x The researchers assumed that the name entity is good for extracting information from crime documents without detecting the weakness of extracting the specific important information from crime documents such as extracting nouns and verbs, and simultaneously avoiding extracting the unimportant features [25]. x The researchers assumed that k-means has the best algorithm [26], in carrying out a comparison between “the kmeans with another algorithm without a hybrid” and other algorithms of cluster without detecting the weakness of k-means for crime documents. They, furthermore, provided a solution for these weaknesses by hybrid [6] with other algorithms and compared such applied algorithms with another one to increase their performance. x In general, all the proposed works solved the weakness of clustering document in terms of enhancing part of the process, where the output of each stage affected on the accuracy of the next process [25]. So, that needs to detect the problems from down to up which means from algorithm of clustering which is used to extracting the terms. As mentioned above, there are three processes of clustering documents, from down “cluster algorithm” up to “extracting the terms from crime documents”. Thus, the idea of down to up guides us to describe or identify the problem of extracting information from crime documents when solving the weakness of k-means cluster algorithm. However, it is so easy to detect the problem of extracting information and know the information that should be avoided.

Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 – 1187

1185

2.4. Need As far as we know, all the existing methods of detecting and identifying for document clustering have assumed that all the crime documents are proved, however, there are two problems related to cluster algorithm, in which the performance of k-means clustering highly depends on selecting the number of clusters and the initial seed centroids. Therefore, it is expected that the result of this method is often suboptimal [27], as well as a problem related to extraction features [25]. All in, each of these two problems impacts on other processes [25]. And therefore to overcome these obstacles, we provide alternatives to help detect and identify the groups of topics/events or types of crimes in Document Clustering with high performance and more effectiveness by using our proposed as shown in next section. 3. Data collection and performance measure 3.1. Data collection and performance measure For this research, the corpora have been collected from Bernama news*, and the dataset test of six categories of topics, including Canny Ong, mona fandy, noritta samsudin, nurin jazlin, sharlinie mohd nashar and sosilawati articles . In addition , the dataset test of ten types of crimes, including Traffic Violation, theft, sex crime, murder, kidnap, fraud, drugs, cybercrime and arson gang articles. These topics and types consist of 247, 2400 documents that are used as the testing dataset. In addition, this study employs an overall purity measure and an overall F-measure in order to measure the external quality. These two measures are popular document clustering measures [28-29]. The existence of higher overall purity and F-measures gives the best cluster. 3.2. Proposed of extraction features As far as it is mentioned in section 2.3 the weakness of extraction, Name Entity is suggested to be used [30] to extract information based on three questions namely; who, where and when, these questions play an important role to be extracted, because most of these features are usually used to describe crime in general, we also would like to do nouns and verbs extraction by using Word-Net to provide the most important features that could be used to describe specific types or events of crimes (as show in Fig.1). Events or types of crimes could be identified by looking at nouns which could be used to identify location, names, topics, date, etc..., as well as verbs to identify the events and types of event. In addition will be remove redundant features extracted from NE with Verbs and Nouns. 3.3. Proposed of Affinity Propagation algorithm In order to overcome the weakness of k-means clustering, we would used Affinity propagation algorithm [31] to generate number of clusters as illustrated in Fig.2, also to create local optimal initial centroid for each clusters. In addition, we will be detected unimportant features by studying the shared terms between wrong documents in wrong cluster and correct document as shown in Fig.2.

Name Entity Nuons

Verbs

Most Important rtant tan Feature of Crime Fig. 1. Our proposed of extract most important features of crime.

1186

Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 – 1187

extractio n nouns and verbs also Name Entity

Decum ent of crime

preproces sing

detect wronge document by study share all terms with their wrong centroid and corect centroid

(AP) algorit hm

represen tation by used TFIDF

kmeans cluste ring

avoid unimpo rtant terms and redo from TFIDF

Fig. 2. Our proposed crime document clustering of enhances clustering and avoids unimportant features.

4. Conclusion This paper detects and identifies some limitations in Crime Document Clustering. First off, it addresses the fault detection and identification in the k-means algorithm, then, it examines the weakness of extraction terms from documents as it is stated above. Therefore, this study aims to enhance the reliability of Document Clustering of crime document by efficient k-means as well as the extraction features of crime document. Furthermore , it is used for crime document clustering, and its results are the best testimony for its efficiency as it aims to enhance the kmeans algorithm for Document Clustering as well as the extracting of information which group topics/events of crimes can outperform the original Document Clustering and other Document Clustering based on two criteria of time and performance . Besides, we look forward that our suggestion of crime document clustering enhances the performance and the effectiveness. References [1]

Chandra. A, Gupta. B and Gupta. C, A multivariate time series clustering approach for crime trends prediction. Proceeding of International Conference on Systems, Man and Cybernetics, SMC; 2008. p. 892-896. [2] Alruily, Ayesh, Al-Marghilani. Using Self Organizing Map to Cluster Arabic Crime Documents, Proceedings of the International Multiconference on Computer Science and Information Technology; 2010. p. 357–363. [3] Chen A, Zeng B, Atabakhsh C, Wyzga D, Schroeder, E., COPLINK: managing law enforcement data and knowledge, Communications of the ACM 2003; 46(1): 28-34. [4] Boo A, Alahakoon B.,“Mining Multi-modal Crime Patterns at Different Levels of Granularity Using Hierarchical Clustering. CIMCA 2008, IAWTIC 2008, and ISE 2008. p. 1268-1273. [5] Bache A, Crestani B., Estimating real-valued characteristics of criminals from their recorded crimes; 2008. p. 1385-1386. [6] Mohd, Bsoul, Ali, Noah, Saad, Omar And Ab, Aziz. Optimal Initialcentroid In K-Means For Crime Topic. Journal of Theoretical and Applied Information Technology 2010; 45(1): 19-26.

Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 – 1187 [7]

1187

Dai X, Chen Q, Wang X, Xu J. Online topic detection and tracking of financial news based on hierarchical clustering. Proceeding of International Conference on Machine Learning and Cybernetics; 2010. p. 3341-3346. [8] Zhiwei Li, Wang, Mingjing Li, Wei-Ying. A Probabilistic Model for Retrospective News Event Detection. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval; 2005. p.106-113. [9] Kumaran. G. and Allan. J. Text classification and named entities for new event detection. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval; 2004. [10] Can. F, , New event detection and topic tracking in Turkish, Journal of the American Society for Information Science and Technology2010; 61: 802-819. [11] Chen H, Zeng D, Atabakhsh H, Wyzga W, Schroeder J. COPLINK: managing law enforcement data and knowledge, Communications of the ACM 2003; 46(1):28-34. [12] Mustafa, H. and Al-Radaideh.Q, Using N-grams for Arabic text searching, J. Am. Soc. Inf. Sci. Technol. 2004; 55(11): 1002-1007. [13] Lia, Chungb, Holtb, Text document clustering based on frequent word meaning sequences, 8th International Conference on Enterprise Information Systems (ICEIS' 2006); 2006. p. 381–404. [14] Tar and Nyunt. Enhancing Traditional Text Documents Clustering based on Ontology. International Journal of Computer Applications 2011; 33(10) : 38-42. [15] Gharib, Fouad, Mashat, Bidawi. Self Organizing Map -based Document Clustering Using WordNet Ontologies, International Journal of Computer Science Issues 2012; 9(1-2) : 88-95. [16] Fodeh, Punch and Ning Tan. On ontology-driven document clustering using core semantic features, Journal of Knowl Inf Syst, SpringerVerlag London; 2011. [17] Al-Shammari, patent application publication of lemmatizing, stemming, and query expansion method and system, Pub.no.:US 2010/0082333 A1, Pub.Date: Apr.1, http://www.google.com/patents/US20100082333?printsec=abstract&source=gbs_overview_r&cad=0#v=onepage&q&f=false, 2010. [18] Bação, Lobo, Painho , Self-organizing Maps as Substitutes for K-Means Clustering, Computational Science – ICCS Lecture Notes in Computer Science 2005; 3516: 476-483. [19] Li, Kuo, Tsai. An intelligent decision-support model using FSOM and rule extraction for crime prevention, Expert Systems with Applications, Elsevier 2010; 37(10): 7108–7119. [20] Farnstrom and Lewis, Fast. Single-pass K-means algorithms, www.citeulike.org/user/zador/article/1772993 ; 2007. [21] Bouras C, Tsogkas V. Assigning Web News to Clusters. Proceedings of Conference on Internet and Web Applications and Services; 2010. p. 1-6. [22] Taeho J, Clustering News Groups using Inverted Index based NTSO, NDT. First International Conference on Networked Digital Technologies; 2009. p. 1-7. [23] Dai, He, Sun. A Two-layer Text Clustering Approach for Retrospective News Event Detection. International Conference on Artificial Intelligence and Computational Intelligence, IEEE computer security; 2010. p. 364-368. [24] Velmurugan .T. and Santhanan .T. A survey of partition based clustering algorithms in data mining: An experimental approach, Information Technology Journal 2011; 10 (3): 487-484. [25] Ramadan and Mohd. A Review of Retrospective News Event Detection. International Conference on Semantic Technology and Information Retrieval, Putrajaya, Malaysia; 2011. p. 209-214. [26] Jain. Data clustering: 50 years beyond K-means, Journal of Pattern Recognition Letters, 2009; 136-713. [27] Aouf M, Lyanage L, Hansen S. Review of data mining clustering techniques to analyze data with high dimensionality as applied in gene expression data. Proceeding of International Conference on Service Systems and Service Management; 2008. p. 1-5. [28] Jain.K. and Dubes.C. Algorithms for clustering data, Ed;1988. [29] Steinbach, Karypis,M, Kumar,G. A comparison of document clustering techniques, In: KDD Workshop on Text Mining; 2000. p. 525-526. [30] Chau,M. Xu.J, and Chen,H. Extracting Meaningful Entities from Police Narrative Reports.Proceedings of the 2002 Annual National Conference on Digital Government Research; 2002. p. 1-5. [31] Dueck and Frey. Non-PHWULFDI¿QLW\SURSDJDWLRQIRUXQVXSHUYLVHGimage categorization. IEEE 11th International Conference of Computer Vision, ICCV; 2007. p.1-8.