Document Classification Using Enhanced Grid Based Clustering Algorithm Mohamed Ahmed Rashad, Hesham El-Deeb, and Mohamed Waleed Fakhr
Abstract
Automated document clustering is an important text mining task especially with the rapid growth of the number of online documents present in Arabic language. Text clustering aims to automatically assign the text to a predefined cluster based on linguistic features. This research proposes an enhanced grid based clustering algorithm. The main purpose of this algorithm is to divide the data space into clusters with arbitrary shape. These clusters are considered as dense regions of points in the data space that are separated by regions of low density representing noise. Also it deals with making clustering the data set with multidensities and assigning noise and outliers to the closest category. This will reduce the time complexity. Unclassified documents are preprocessed by removing stops words and extracting word root used to reduce the dimensionality of feature vectors of documents. Each document is then represented as a vector of words and their frequencies. The accuracy is presented according to time consumption and the percentage of successfully clustered instances. The results of the experiments that were carried out on an in-house collected Arabic text have proven its effectiveness of the enhanced clustering algorithm with average accuracy 89 %. Keywords
Clustering K-means Density based clustering Grid based clustering TFIDF
Introduction Pattern recognition is generally categorized according to the type of learning procedure used to generate the output value. One of these types of learning is ‘Unsupervised Learning’. An example of unsupervised learning is called ‘Clustering’ based on the common perception of the task as
M.A. Rashad (*) M.W. Fakhr College of Computing and IT, Arab Academy for Science and Technology (AASTMT), Latakia, Syria e-mail:
[email protected] H. El-Deeb College of Computer Science, Modern University for Technology and Information (M.T.I), Cairo, Egypt e-mail:
[email protected]
involving no training data to speak of, and of grouping the input data into clusters based on some inherent proximity measure, rather than assigning each input instance into one of a set of pre-defined classes. One of the important applications on clustering is ‘Text Clustering’. It is the method of partitioning unlabelled data into disjoint subsets of clusters. It is developed to improve the performance of search engines through pre-clustering the entire corpus. Arabic language is one of the six international languages that are used by most of the humans all over the world. Most of the Arabic words are found from list of Arabic language roots which could be roots of three, four, five or six letters. Sentences in Arabic language consist of nouns, verbs, pronouns, preposition and conjunction. The noun and the verb have a root while the others don’t have so it is cancelled in the preprocessing phase.
K. Elleithy and T. Sobh (eds.), New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering, Lecture Notes in Electrical Engineering 312, DOI 10.1007/978-3-319-06764-3_27, # Springer International Publishing Switzerland 2015
207
M.A. Rashad et al.
Read Documents
Removing Stop Words
Word Stemmine
Vector Representation
Processing
Output
Dataset Documents
Input
208
Fig. 1 Documents data preprocessing
One of the important steps in this research is document preprocessing after getting them from the internet or any other resources. On this spot two research questions could be raised about how to select the appropriate stemming methodology and how to allow scaling to large datasets and reducing the computational complexity in such applications. Figure 1 shows a data block diagram of the data documents preprocessing phase. The rest of this research is structured as follows. In section “Related Works”, brief description of some related works in this area of research is presented. Section “Document Preprocessing”, discuss how to extract data from the preprocessed documents and be ready for training. Also this research presents one of the methods of feature selection and weighting algorithm called TF-IDF. The research presents three of the text clustering algorithms K-means, DBSCAN improved with K-mediods and Grid Based Multi-DBSCAN using representative points. Section “Experimental Results”, evaluates experimental results. Finally, conclusion will be presented in section “Conclusions”.
Related Works An efficient density based K-medoids clustering algorithm has been proposed to overcome the drawbacks of DBSCAN and K-medoids clustering algorithms [1]. Another related research deals with the problem of Text Categorization [2]. It concentrates on the filter approach to achieve dimensionality reduction. The filter approach consists of two main stages; feature scoring and thresholding. A new algorithm based on DBSCAN is proposed in [3]. It presents a new method for automatic parameters generation that create clusters with different densities and generates arbitrary shaped clusters.
To improve the performance of DBSCAN algorithm, a new algorithm combined Fast DBSCAN Algorithm [3] and Memory effect in DBSCAN algorithm [4] to speed up the performance as well as to improve the quality of the output. As the Region Query operation takes long time to process the objects, only few objects are considered for the expansion and the remaining missed border objects are handled differently during the cluster expansion. According to the for mentioned related works [1], used Kmediods to improve the performance of DBSCAN. Whereas [2] used a popular algorithm which is SVM, one of the drawbacks of this algorithm is the long training time. Finally [3, 5] had used DBSCAN approach. There are two parameters required that must be specified manually Eps and MinPts. DBSCAN cannot cluster data sets well with large differences in densities since the minPts-epsilon combination cannot be chosen appropriately for all clusters. Therefore, a new approach is proposed into order to reduce the time complexity, deal with making clustering of dataset with multi-densities and the density parameters are not specified manually.
Document Preprocessing Before starting the feature selection procedure, there are five proposed steps for preprocessing the online datasets. Finally, clustering phase is proposed to implement the recognition task.
Preprocessing Phase The common model for text presentation is Vector Space Model [2, 6–8]. Its main idea is to make the document become a numeral vector in multi-dimension space. The preprocessing phase includes HTML parsing and tokenization, word rooting, removing stop words, vector space representation, feature selection and weight calculation. Figure 2 shows the steps for preparing the datasets.
HTML Parsing and Tokenization HTML Token Parser application could be used in order to extract information from the inner body of HTML files by removing HTML tokens such as tags, newlines . . .etc. Arabic Datasets Documents This research uses Arabic datasets which are online and dynamic datasets abstracted from the internet in form of HTML text files which needs preprocessing to remove the tags and get the main body that contains the data, as shown below in Table 1.
Document Classification Using Enhanced Grid Based Clustering Algorithm
209
Fig. 2 Text processing with root extraction and removing stop words
Table 1 Arabic dataset sample Category Economy
Document number 1500
Culture
1500
Religion
1500
Sport
1500
Sample data
Word Rooting (Stemming) Arabic language belongs to the Semitic group of languages. Arabic language consists of 28 letters as follow:
and The ) are vowels; the others are consonants. letters ( The algorithm presented in [9] is preferred for root extraction for its simple manipulation. The simplicity of this algorithm comes from employing a letter weight and order scheme. This algorithm extracts word roots by
assigning weights and ranks to the letters that constitute a word. Weights are real numbers in the range 0–5. The order rank of letters in a word depends on the length of that word and on whether the word contains odd or even number of letters.
Removing Stop Words This step removes stop words from the documents in order to generate the frequency profile of a certain list of words. Theses words might be words like prepositions, Arabic particles, and special characters. Vector Space Document Representation Now, Vector space model with those keywords and documents is formed [6]. Here every element in the vector space model indicates how much number of times a word occurs in the document. For example consider the following diagram which shows the term and document matrix. Rows represents words in each document, columns represents documents. Each cell represents the word count in the particular document. If the number of documents is increasing the size of the matrix will also be increased.
Feature Selection and Weighting Phase As mentioned in [7, 8], TF-IDF for term weighting is used for weighting phase. Essentially, TF-IDF works by determining the relative frequency of words in a specific document
210
compared to the inverse proportion of that word over the entire document corpus. Intuitively, this calculation determines how relevant a given word is in a particular document. Words that are common in a single or a small group of documents tend to have higher TFIDF numbers than common words such as articles and prepositions. Tr TFIDF tk ; dj ¼ TFkj :log ð1Þ df k
Clustering Phase It deals with finding structure in a collection of unlabeled data. There are many types of unsupervised linear clustering algorithm such as: 1. K-means clustering algorithm. 2. Fuzzy c-means clustering algorithm. 3. Hierarchical clustering algorithm. 4. Threshold based clustering algorithm. 5. Density Based clustering algorithm. K-means is a simple partitioning algorithm as it keeps track of the centroids of the subsets, and proceeds in simple iteration the initial partitioning is randomly generated, randomly initialize the centroids to some points in the region of the space [10]. The K-means needs to perform a large number of “nearest-neighbour” queries for the points in the dataset. If the data is ‘d’ dimensional and there are ‘N’ points in the dataset, the cost of a single iteration is O (kdN). As one would have to run several iterations, it is generally not feasible to run the naı¨ve K-means algorithm for large number of points. So, a modification for K-mean is essential as concluded in [1] for the following reasons: 1. DBSCAN does not perquisite to know the number of clusters in the data, as opposed to k-means. 2. DBSCAN can find arbitrarily shaped clusters. It can even find clusters completely surrounded by a different cluster. 3. DBSCAN has a notion of noise. 4. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. But the main disadvantage of DBSCAN that it does not respond well to data sets with varying densities. Also DBSCAN is sensitive to two parameters, Eps and MinPts must be
M.A. Rashad et al.
required determined manually. Also the number of cluster, K, must be determined beforehand. In order to solve these drawbacks of the two algorithms mentioned above, a grid based algorithm based on multidensity DBSCAN using representative points is proposed.
Experimental Results Datasets Outlines The selected datasets are online dynamic datasets that are characterized by its availability and credibility on the internet. Arabic text corpus was collected from online magazines and newspapers. Datasets are collected from EL-Watan News [11] and EL-Jazeera News [12]. 5,000 documents that vary in length and writing styles were collected. These documents fall into four pre-defined categories. Every category contains 1,500 documents. The set of pre-defined categories include: Sports, economic, science, religion and politics. Every document of these collected documents is automatically categorized to only one category according to human categorizer’s judgment.
Proposed Algorithms In this research, three algorithms are used. The traditional Kmeans algorithm, DBSCAN using K-mediods and Grid Based Density algorithm.
K-Means Algorithm As presented in [10], the traditional K-means algorithm 1. Input k, number of suggested clusters 2. Select k centers randomly. 3. Determine numbers of trials to test Clusters, 4. Assign each document to the closest cluster based on the distance between centers and the document. Compute square error between centers and selected documents. 5. Store the results in the database. Density Based Algorithm Basic DBSCAN algorithm The steps involved in DBSCAN algorithm are as follows: 1. Arbitrary select a point P.
Document Classification Using Enhanced Grid Based Clustering Algorithm
211
2. Retrieve all points density-reachable from P w.r.t Eps and MinPts. 3. If P is a core point, cluster is formed. 4. If P is a border point, no points are density-reachable from P and DBSCAN visits the next point of the database. 5. Continue the process until all of the points have been processed. Enhanced DBSCAN with K-Mediods Algorithm As stated in [1], the algorithm could be described as follows: 1. For each unvisited point p in dataset D get neighbors w.r. t Eps and MinPts. 2. Now will have m clusters, find the cluster centers and the total number of points in each cluster. 3. Join two or more clusters based on density and number of points and find the new cluster center and repeat it until achieving k clusters. 4. Else split one or more clusters based on density and number of points using K-mediods clustering algorithm and repeat it until achieving k clusters. Proposed Grid Based Multi-density Using Representative Points Algorithm The main purpose of this algorithm is to divide the data space into clusters with arbitrary shape. These clusters are considered as dense regions of points in the data space that are separated by regions of low density representing noise. Also it deals with making clustering the data set with multidensities and assigning noise and outliers to the closest category. The proposed clustering algorithm consists of seven steps as shown in Fig. 3: Here we will illustrate the algorithm implementation steps in more details:
Dataset
Fig. 3 proposed clustering algorithm milestones
Clusters
Data normalization
Nested mean partitioning
Assigning noise to the closest category
Choosing representative points
Labeling and post processing
Parameters selection
Local clustering and merging
212
M.A. Rashad et al.
Table 2 EL-Watan news dataset Scenario Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6
No. of categories 4 4 4 4 4 4
–1.9
No. of files 400 1000 2,000 3,000 4,000 5,000
No. of doc. per each category 100 doc 250 doc 500 doc 750 doc 1,000 doc 1,250 doc
0.3
1.3
3.0
4.6
6.3
8.0
9.6
11.3
–1.9 –0.3 –1.3 –3.0 –4.7 –6.4 –8.0
Performance Measures In the proposed research, performance is measured according to the percentage of success rate, the number of incorrectly clustered instance and the time consumption for each algorithm to cluster the data space.
Results Evaluation The evaluation of results is conducted via six scenarios. Each scenario has a number of files and number of categories. These categories refer to the type of the news as shown in Table 2.
Data Space Grid Distribution As mentioned before that there are some factors that affect the effectiveness of the clustering such as the preprocessing phase, the proposed algorithm also has a factor for increasing the performance of the clustering. Below there is a comparative results between two types of dividing data space into grid of cells.
Sparse Tree Sparse tree divides a dimension into number of intervals, each of which has the same length. This algorithm does not consider the distribution of the data. Although it can effectively locate strong clusters, it often results in an extremely uneven assignment of data items to cells and fails to examine detailed patterns within a dense area. Extreme outlier values can also severely affect the effectiveness of the sparse tree algorithm. As shown in Fig. 4, with the sparse tree algorithm the two smaller but much denser clusters fall in a single cell. Therefore these two clusters can never be distinguished in further analysis with those cells extreme outlier values can also severely affect the effectiveness of the algorithm.
–9.7 –11.4
Fig. 4 sparse tree data grid distribution
Nested Mean The nested mean divides a dimension into a number of intervals and can adapt well with the data distribution and is robust with outlier values and noisy points. It recursively calculates the mean value of the data and cut the data set into two halves with the mean value. Then each half will be cuts into halves with its own mean value. This recursive process will stop when the required number of intervals is obtained. It can examine detailed structures within a dense region. Although it tends to divide a cluster into several cells, those cells that constitute the cluster are always denser than neighboring cells. The distance among those cells of the same cluster are very small and the clustering procedure can easily restore the cluster by connecting them. As in Fig. 5 the two smaller but denser cells now fall in eight cells, each which is still denser than cells in a sparse area. Thus these two clusters are distinguishable in further analysis. The distance among those cells of the same cluster are very small and the clustering procedure can easily restore the cluster by connecting them. The two data grid distribution algorithms are applied on ElWatan news dataset and the results of the proposed algorithm through three scenarios are represented as shown in Table 3. Table 4 shows the performance measures when applying K-means algorithm to each scenario. Table 5 shows the performance measures when applying DBSCAN using K-mediods algorithm to each scenario. Table 6 shows the performance measures when applying Grid Based Multi-density DBSCAN using representative points algorithm to each scenario. As shown in Table 7, it is concluded that the degradation of the accuracy results of the three algorithms is due to the
Document Classification Using Enhanced Grid Based Clustering Algorithm –1.9
0.7 0.3 1.2
2.5
6.0
7.5
8.7
11.3
–1.9
213
Table 6 Proposed grid based multi-density using representative points algorithm Scenario Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6
0.1 0.6 1.1 2.4
Time (s) 68.32 97.60 219.34 209.00 285.30 378.20
Incorrectly instance 50 77 218 396 580 778
Success rate (%) 94.40 92.30 89.10 86.80 85.50 84.45
5.8 7.4
Table 7 Percentage of F-measure (%)
8.6
11.4
Fig. 5 Nested mean data grid distribution
Table 3 Comparative accuracy sparse tree and nested mean (%) Scenario Scenario 2 Scenario 3 Scenario 4
No. of categories 4 4 4
No. of files 1,000 2,000 3,000
Sparse tree 85.30 80.80 77.50
Nested mean 92.30 89.10 86.80
Table 4 K-means algorithm Scenario Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6
Time (Sec) 87.03 122.20 253.54 220.94 428.20 495.40
Incorrectly Instance 62 272 698 1,131 1,680 2,280
Success rate (%) 84.60 72.80 65.10 62.30 58.00 54.40
Table 5 DBSCAN using K-mediods algorithm Scenario Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6
Time (Sec) 72.20 110.30 229.58 218.15 313.92 430.50
Incorrectly instance 58 160 443 712 1,108 1,469
Success rate (%) 85.60 84.00 77.85 76.27 72.30 70.62
increase of degree of complexity of scenarios according to the increase of the data set samples and also another factor for this difference is the data set preprocessing. In order to measure the performance of the three algorithms a suitable metric is needed, so F-measure is used as measure for the accuracy of clustering of each algorithm for this type of case study used in the research [8, 10].
Scenario Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6 Average
DBSCAN using K-mediods 85.60 84.00 77.85 76.27 72.30 70.62 77.77
K-means 84.60 72.80 65.10 62.30 58.00 54.40 66.20
Proposed algorithm 94.40 92.30 89.10 86.80 85.50 84.45 88.76
Table 8 AL-Jazeera news dataset
Scenario
No. of categories
No. of files
K-means (%)
DBSCAN using K-mediods (%)
Proposed algorithm (%)
Scenario 1
4
1,000
77.80
89.00
97.30
Scenario 2
4
2,000
70.10
82.85
94.10
Scenario 3
4
3,000
67.30
81.27
91.80
Also it is realized that these three approaches are sensitive to the dataset, because according to the visualization of the clustering results of the selected dataset there are two categories (Economy and Culture) that are closely related to each other. So something like that will affect the effectiveness of the clustering. The proposed clustering approach is also applied on ALJazeera news dataset and Table 8 shows a comparative result between the three approaches after passing the dataset through the same preprocessing phase. The same categories are used (Culture, Economy, Sports and Religion) and each scenario with different number of files. The three algorithms shows better accuracy results here when applied on AL-Jazeera news dataset. This proves the theory that the effectiveness of clustering is sensitive to the dataset although this dataset has the same categories as ALWatan news dataset. In addition the proposed algorithm is compared with an algorithm proposed by [2]. The algorithm proposed is support vector machine. The algorithm is applied on a dataset collected from AL-Jazeera news categorized into five categories, science, arts, economy, religion and sports.
214
M.A. Rashad et al.
Table 9 Comparative accuracy
No. of categories No. of docs Accuracy (%)
Reference [2] 4 600 93.00
Reference [10] 4 1,000 65.00
Proposed algorithm 4 1,000 95.00
Fig. 6 Difference in accuracy between the two algorithms for each category
Table 9 below shows a comparative accuracy results between the proposed algorithm and the algorithms proposed by [2, 10] where Arabic corpus is the common case study of each research. The two algorithms are applied on the dataset case study used by [2] to test the clustering accuracy of each approach and how accurate documents are assigned to its right category as shown below in Fig. 6. The categories used are science, economy, sports and politics, each contains 150 documents.
Conclusions This research has shown the effectiveness of unsupervised leaning algorithms by using K-means, DBSCAN using K-mediods and Grid Based Clustering algorithm. Arabic language is a challenging language when applied in an inference based algorithm for number of reasons. First, orthographic variations are common in Arabic language; certain combinations of characters can be written in different ways. Second, Arabic has a very complex morphology. Third, Arabic words are often ambiguous due to the trilateral
root system. In Arabic, a word is usually derived from a root, which usually contains three letters. Fourth, short vowels are omitted in written. Fifth, synonyms are widespread. That is why this proposed solution is customized for Arabic language. Selecting the appropriate dataset is an important factor in such research. The chosen datasets are dynamic, robust and applicable. This extends the range of applications that could follow the proposed algorithm. Grid Based Multi-Density using representative points algorithm has shown better results than the other two algorithms, K-means and DBSCAN improved with Kmediods algorithm due to the following reasons: first, it takes into account the connectivity of clusters that are closely related. Second, it operates successfully on data sets with various shapes. The third reason, it doesn’t depend on a user supplied model. Also, the fourth reason, it divides data space into grid using nested mean which means allows scaling to large datasets and reduces the computational complexity. Finally, it is very sensitive to two parameters, MinPts and Eps changing dynamically according to a certain factor for each cell and could handle the issue of noise and outliers.
References 1. Raghuvira Pratap A, K Suvarna Vani, J Rama Devi, Dr.K Nageswara Rao, “An Efficient Density based Improved K- medoids Clustering algorithm”, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 6, 2011. 2. Dina Adel Said “Dimensionality reduction techniques for enhancing automatic text categorization”, 2007. 3. Priyanka Trikha and Singh Vijendra, “Fast Density Based Clustering Algorithm”, International Journal of Machine Learning and Computing, Vol. 3, No. 1, February 2013. 4. Li Jian; Yu Wei; Yan Bao-Ping; “Memory effect in DBSCAN algorithm,” Computer Science & Education, 2009. ICCSE ‘09. 4th International Conference on, vol., no., pp.31-36, 25-28 July 2009. 5. J. Hencil Peter, A. Antonysamy, “An Optimised Density Based Clustering Algorithm”, International Journal of Computer Applications (0975 – 8887) Volume 6– No.9, September 2010. 6. Anil Kumar, S.Chandrasekhar, “Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering”, International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 5, July – 2012 ISSN: 2278-0181. 7. Osama A. Ghanem, Wesam M. Ashour, “Stemming Effectiveness in Clustering of Arabic Documents”, International Journal of Computer Applications (0975 – 8887) Volume 49– No.5, July 2012.
Document Classification Using Enhanced Grid Based Clustering Algorithm 8. Motaz K. Saad, “The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification”, September 2010. 9. Al-Shalabi, R., Kanaan, G. and Al-Serhan H., “New approach for extracting Arabic roots”, The International Arab Conference on Information Technology (ACIT ‘2003), Alexandria, Egypt, December, 2003.
215
10. Mahmud S. Alkoffash, “Comparing between Arabic Text Clustering using K-means and K-mediods”, International Journal of Computer Applications (0975 – 8887) Volume 51– No.2, August 2012. 11. El-Watan news http://www.elwatannews.com/. 12. El-Jazeera news http://www.aljazeera.net/news.