A Visualization Approach to Automatic Text

A Visualization Approach to Automatic Text Documents Categorization Based on HAC Rayner Alfred, Mohd Norhisham Bin Razali, Suraya Alias and Chin Kim On Center of Excellence in Semantic Agents, School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia Telephone: (+6088) 320000, Fax: (+6088) 320348 {ralfred, hishamrz,suealias,kimonchin}@ums.edu.my

Abstract. The ability to visualize documents into clusters is very essential. The best data summarization technique could be used to summarize data but a poor representation or visualization of it will be totally misleading. As proposed in many researches, clustering techniques are applied and the results are produced when documents are grouped in clusters. However, in some cases, user may want to know the relationship that exists between clusters. In order to illustrate relationships that exist between clusters, a hierarchical agglomerative clustering (HAC) technique can be applied to build the dendrogram. The dendrogram produced display the relationship between a cluster and its sub-clusters. For this reason, user will be able to view the relationship that exists between clusters. In addition to that, the terms or features that characterize each cluster can also be displayed to assist user in understanding the contents of whole text documents that stored in the database. In this paper, a Text Analyzer (VisualText) that automates the categorization of text documents based on a visualization approach using the Hierarchical Agglomerative Clustering technique is proposed. This paper also studies the effect of using different inter-cluster proximities on the quality of clusters produced. Cophenetic Correlation Coefficient is measured in order to evaluate the quality of clusters produced using these three different inter-cluster distance measurements. Keywords: Interactive Visualization, Hierarchical Agglomerative Clustering, Text Analyzer, Text Categorization, Data Summarization, Cophenetic Correlation Coefficient

1

Introduction

The wide availability of huge collections of text documents (news corpora, e-mails, web pages, scientific articles and etc) has fostered the need for efficient text mining tools. Information retrieval, text filtering and classification, and information extraction technologies are rapidly becoming key components of modern information processing systems, helping end-users to select, visualize and shape their informational environment [1].

Nowadays, information extracted from documents is becoming more significant. Most organizations and companies store their documents in relational databases for data analysis purposes in order to support organizational decision making. Some organizations keep organizational resources such as documents for referral purposes whereas some would keep it as collections. However, these documents will be accumulated everyday it will be very critical and tedious for these documents to be represented in an understandable yet simple and interactive manner. For instance, there are a large number of documents in the database but what does each and every one of it contains are hard to be recalled. By only reading the title of the document, it does not always give meaningful categorization of the contents correctly. It will thus take up plenty of time in the event when one is searching for documents in a particular field because each document in the database needs to be inspected. The best data summarization technique could be used to summarize data but a poor representation or visualization of it will be totally misleading [2]. Hence, efficient tools used for organizing and maintaining these documents are becoming more and more valuable as it can be very useful in numerous ways. Therefore, there is a need for an efficient and effective tool that is capable to visualize data summarization obtained from clustering text documents. The proposed tool is potentially very useful for analyzing text documents automatically for summarization purposes and thus facilitates decision making process. This paper is organized as followed. Section 2 explains some works related to document categorization based on visualization approach. Section 3 describes the proposed text analyzer tool which assists user to visualize terms or features that characterize each cluster of text documents, particularly English news articles. There are a few functionalities that will be implemented for the proposed text analyzer. By using the proposed text analyzer, users are able to analyze and categorize text documents automatically, visualize the overall structure of their informational environment by visualizing each cluster and its sub-clusters, identify words or terms used to categorize each cluster and its sub-cluster and finally evaluate the quality of the text categorization based on the distance method. Section 4 discusses the experimental results obtained in investigating the effect of using three different inter-clusters proximity methods, namely the MIN, MAX and Average Group Linkage. This paper is concluded with future works in Section 5.

2

Related Works

There are quite a number of researches conducted on visualizing documents and articles that have been categorized or clustered to show hidden information stored in the clustered documents [2, 4, 6, 7, 8]. Yen and Wu have proposed a Growing Hierarchical Self-Organizing Maps (SOM) approach to documents categorization and visualization [6]. In their approach, documents are first encoded into similarity matrix that is constructed based on bibliographic coupling. The bibliographic coupling of two documents is computed by counting the number of common references cited by these two documents. The higher the number of common references cited by these two

documents, the more similar areas or concepts covered by these two documents. Thus, in this approach, documents are clustered together based on common areas or concepts derived from the common references cited. Then, this similarity matrix is used to train a Growing Hierarchical Self-Organizing Map (GHSOM) that clusters document items into a hierarchical order. The Ranked Centroid Projection (RCP) is then used to project the input vectors into a hierarchy of two-dimensional maps. Based on this result, users will have a better understanding of the information hidden in a large collection of documents as these documents have been clustered based on their respective categories. This approach results in document maps with various levels of details. Unfortunately, GHSOM fails to show details relationship between two different levels of SOM. Miller et al. proposed a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM) [8]. LDA-SOM combines probabilistic topic models and Self Organizing Maps to cluster and visualize document collections. The LDA-SOM approach to document clustering uses LDA for dimensionality reduction and the SOM for clustering and visualization. This approach produces a map of topical clusters that indicates documents within each cluster share similar topics, and neighboring clusters may have one or more topics in common. This unique layout allows users to quickly browse through a document collection. It can also indicate relationships between documents that might otherwise go unnoticed. However, LDA-SOM does not show topics in common shared by documents in the same cluster or neighboring clusters. Lehmann et al. described an interactive visualization called Wivi which enables users to intuitively navigate Wikipedia by visualizing the structure of visited articles and emphasizing relevant other topics [2]. Wivi assists users to read up on subjects relevant to the current point of focus and thus opportunistically find relevant information, by visualizing the potential paths that could be taken. However, Wivi does not provide theoverall structure of all articles stored in the database. Hsiao introduced a new clustering algorithm to categorize and spatially cluster text documents [7]. He employed TF-IDF and term co-occurrence to measure similarities between two different documents. Then, a modified minimum spanning algorithm tree is applied to cluster similar documents. The results obtained show that the system is capable of distinguishing different topics and producing unique and informative clusters. Other works involve studying methods to improve the categorization of documents. One of the techniques used is CFC, which is also known as collaborativefiltering based personalized document-clustering [9]. CFC technique combines target user’s and other users’ partial clustering when evaluating the categorization preferences of the target user. There are also organizations that provide solution to their clients such Vivisimo. Vivisimo provides Velocity, a program that is able to locate, extract and produce relevant information of data regardless of where this information resides. A lot of researches have also been conducted in data summarization on various techniques of data clustering such as hierarchical clustering [10], fuzzy clustering [11], genetic algorithm [12] and many more. Some has also taken the approach to combine a few clustering techniques such as effective hybrid approach based on PSO,

ACO and k-means for cluster analysis [13]. Regardless of what technique these researchers have chosen, their common purpose is to group text documents into a meaningful cluster. Hence, in this paper, the main challenge is to visualize findings in an interactive way so that it can be easily understood by the user. Besides choosing the right algorithms to perform stemming and removal of stop words in order to reduce the number of terms, a right technique is also needed for clustering text documents so that the results can be visualized in a clear manner.

3

VisualText: An Interactive Visualization Approach to Automatic Documents Categorization

As in most text mining and information retrieval tasks, the process begins by preprocessing the text document collection. Terms that carry small discriminative values are removed from the collection’s vocabulary. This includes Stop-words, i.e., definite articles and pronouns. Additionally, exceptionally rare words, e.g., those appearing in fewer than three documents, can be also removed. Documents are then encoded as word histograms based on word occurrence frequency. Additional weighting scheme such as inverse document frequency is also applied at this stage. The most common way to organize and label text documents is to group similar documents into clusters by clustering them and then extract concepts that characterize each cluster. Normally, the assumed number of clusters may be unreliable since the nature of the grouping structures among the data is unknown before processing and thus the partitioning methods would not predict the structures of the data very well. Hierarchical clustering has been chosen to solve this problem by which they provide data-views at different levels of abstraction, making them ideal for people to visualize the concepts generated and interactively explore large document collections. Hierarchical clustering’s basic concept is to group data objects into a tree of clusters. It will produce a set of nested clusters organized as a hierarchical tree [14, 19]. Basically, it is divided into two main types, bottom-up, or also known as agglomerative, and also top-down, which is also known as divisive. The results can be displayed in the form of a nested cluster or dendrogram. At any level, it can be shown as joining two clusters from the next lower level or breaking a cluster from the next higher level [15]. Fig. 1. (a) and (b) show an example of nested cluster and dendrogram respectively.

(a)

(b)

Fig. 1. An example of (a) Nested cluster and (b) Dendrogram respectively

The agglomerative clustering is also known as a merging method where it starts with the points as individual cluster. Then at each step, it will join the closest pair of cluster until it is left with one cluster or k clusters. As for divisive clustering or splitting, it starts with only one cluster. Then, it will split at each step until each cluster contains one point or there are left with k cluster. As agglomerative hierarchical clustering is more common, focus is placed on this method. The basic algorithm for hierarchical agglomerative clustering is shown below; 1. 2. 3.

4.

Calculate the proximity matrix. Assume each data point as a cluster. Repeat Combine the two closest clusters. recompute the proximity matrix. Until only a single cluster remains.

(a)

(b)

(c)

Fig. 2. Points that are considered in (a) MIN cluster proximity (b) MAX cluster proximity and (c) Group Average cluster proximity

As defined in the algorithm above, two clusters need to be merged and in order to implement this, the cluster proximity is used. Different hierarchical agglomerative clustering techniques can be formed from the definition of cluster proximity, namely, MIN, MAX and Group Average. MIN, or also known as single link, defines cluster proximity as the shortest distance or closest between any two points in two different clusters. This means that it is the maximum similarity between any two points. In terms of a graph, it is the shortest edge between two nodes in different subset of nodes. If all points are started as singleton clusters, the shortest links are first added between points, one at a time. After that, these single links will merge the points into cluster. The strength of MIN is that it can handle non-elliptical shapes well. Nonetheless, it is sensitive to noise and outliers. Fig. 2(a) shows which point is considered in the MIN cluster proximity. MAX is also known as complete linkage or CLIQUE. The cluster proximity of MAX is the furthest two points in different clusters. The maximum distance between these two points also indicates the least similarity between it. It is the longest edge between two nodes in different subsets of nodes in terms of a graph. If all points are started as singleton clusters, then the shortest link are first added between points, one at a time. However, a group of points is not a cluster until all the points in it are completely linked or if it forms a clique. MAX is less susceptible to noise and outliers. Nevertheless, it can break large clusters and is bias towards globular shapes. Fig. 2(b) illustrates which point is chosen in the MAX cluster proximity.

Last but not least is the Group Average where the proximity of two clusters is the average pair-wise proximity among all pairs of points in different clusters. In terms of a graph, it is the average length of edges between all pair of points in different clusters. This approach is an intermediate approach between the MIN and MAX approach and it compromises between these two approaches. Fig. 2(c) shows that all points are considered in computing the average length between two different clusters in Group Average. There are various methods to measure the quality of clusters produced such as Davies-Bouldin index (DBI) [17] and Sum of Squared Error (SSE) [18]. However, in this paper, the quality of the cluster is measured using the Cophenetic Correlation Coefficient by verifying their dissimilarity and their consistency. Cophenetic correlation is the correlation between the actual dissimilarities as recorded in the original dissimilarity matrix, and the dissimilarities which can be read off of the dendrogram. In essence, this is a measure of how well the dendrogram, which is a model of the similarity behavior, models the actual behavior. The formula used to calculate the Cophenetic Correlation Coefficient [16], C3 is given as below,

C3 =



i j

( x(i, j )  x)(t (i, j )  t )2

[i  j ( x(i, j  x) ][i  j (t (i, j )  t )2 ] 2

(1)

where x(i, j) is the distance measurement between the ith and jth observation and t(i, j) is the dendrogrammatic distance between model T i and Tj. This distance is the height of the node at which these two points are first joined together. If the value of the Cophenetic Correlation Coefficient, C3 produced by this formula is closer to 1, it shows that the clustering solution reflects the original data more accurately. The main purpose of using this measurement is to determine which type of inter-cluster distances (MIN, MAX and AVERAGE linkages) should be used in the hierarchical agglomerative clustering in order to cluster text documents more efficiently and effectively. A better interpretation of the cophenetic correlation can be explained by looking at each step of the building of the dendrogram. Sudden decreases or increases in the cophenetic correlation indicate that the cluster just formed has made the dendrogram less faithful to the data. Based on the best results obtained by considering the value of the Cophenetic Correlation Coefficient measurement, an interactive user interface is designed to assist users to view the relationship that exists between clusters. In addition to that, the terms or features that characterize each cluster can also be displayed to assist user in understanding the contents of whole text documents that stored in the database. User will also get to view what are the documents that reside in each cluster based on two different distance matrixes which are Euclidean Distance and Cosine Similarity. In this paper, we apply the clustergram to illustrate the relationship that exists between clusters. Clustergram is formed from a combination of a dendrogram and a heat map.

Fig. 3. Sample clustergram produced with four features in four documents

Fig. 3 shows the sample clustergram produced based on four documents that consist of four features (apple, orange, pear and grape). The X-axis represents the terms or words used in the documents, while the Y-axis represents the document ID. In Fig. 4, there is a color bar at the left hand side of the clustergram. This color bar shows different color intensity to differentiate the term’s importance. By activating the data cursor function (clicking the symbol ) in the clustergram window, a more detailed explanation of the terms can be seen. At the top of the color bar, light red or a value near to 3 signifies that the term is more important in the document. For instance, when user clicks on any part of the heat map, a datatip which is similar to a tool tip will appear. It will contain a value between the range of -3 to 3, the term, and the corresponding document in that selection. In short, the importance of the term is rated based on the range of -3 to 3 where -3 reflects that the term has the lowest importance or does not appear in the document. On the other hand, a value nearer to 3 will mean that it is very important in the document.

4

Experimental Evaluation

In this experiment, 50 text documents related to politics, business, sports and entertainments, are used for the clustering task. These documents are clustered based on 70 unique features that are selected and extracted automatically based on the degree of relevancy of the features (TF-IDF weights) to the set of all documents [7]. The analysis of the degree of importance of each term that exists in text documents can be performed by referring to Fig. 4., Fig. 5. and Fig. 6. This can be analyzed by investigating the intensity of the color on the left with respect to each document. The X-axis represents 70 terms or words used, while the Y-axis represents 50 documents that are used for clustering these documents. At the top of the color bar, bright red signifies that the term is very important, black in the middle signifies that the term is averagely

important whereas at the bottom, bright green indicates that the term is not important or the term does not exist in that particular document. For instance, a close up of clustergram using the MIN linkage can be observed in Fig. 7. The term ‘pa’ appears to be more important in document P4.txt with a value of 0.42 whereas the term ‘asli’ shows a lower degree of importance with a value of -0.32. On the other hand, when comparing between documents, the term ‘pa’ is more significant in document P11.txt compare to P4.txt with a value of 2.73. Table 1 summarizes the results obtained for the Cophenetic Correlation Coefficient with three different forms of cluster proximity based on Euclidean and Cosine Similarity distances. Group Average linkage produces the best clusters structure as it shows a higher Cophenetic Correlation Coefficient value of 0.988687 when using Euclidean Distance. However, when using the Cosine Similarity distance, the Cophenetic Correlation Coefficient value fluctuates for three different cluster proximities. Table 1. Cophenetic Correlation Coefficient for three different types of cluster proximities based on Euclidean Distance and Cosine Similarity. Cluster Proximity

Similarity Distance

MIN

Euclidean Distance Cosine Similarity Euclidean Distance Cosine Similarity Euclidean Distance Cosine Similarity

MAX Group Average

Cophenetic Correlation Coefficient 0.983366 0.700378 0.983332 0.771906 0.988687 0.873056

Fig. 4. Clustergram produced using MIN linkage

Fig. 5. Clustergram produced using MAX linkage

Fig. 6. Clustergram produced using Average Group linkage

Fig. 7. Clustergram produced using Average Group linkage

5

Conclusion

In this paper, we have formally presented a Hierarchical Agglomerative Clustering (HAC) approach to visualize the relationship that exists between clusters. In addition to that, the terms or features that characterize each cluster can also be displayed to assist user in understanding the contents of whole text documents that stored in the database. By visualizing the relationship that exists between clusters, the proposed approach enable user to analyze and categorize text documents automatically, visualize the overall structure of their informational environment by visualizing each cluster and its sub-clusters, identify words or terms used to categorize each cluster and its sub-cluster and finally evaluate the quality of the text categorization based on the Cophenetic Correlation Coefficient measurement. Cophenetic Correlation Coefficient is measured to evaluate the quality of clusters produced using three different inter-cluster distance measurements, namely MIN, MAX and Average Group Linkage. Based on the experimental results obtained, a better clustering result can be produced when MIN linkage is used. In addition to that, Euclidean Distance generally provides a better clustering solution because it produces a higher Cophenetic Correlation Coefficient value compared to using Cosine Similarity.

Acknowledgments. This work has been supported by the Research Grant Scheme project funded by the Ministry of Higher Education (MoHE), Malaysia, under Grants No. RAG0008-TK-2012.

References 1. Hamzah, M.P., and Sembok, T.M.T.: Enhancing retrieval effectiveness of Malay documents by exploiting implicit semantic relationships between words, Transactions on Engineering, Computing and Technoloy, V10, ISSN 1305-5313 (2005) 2. Lehmann, S., Schwanecke, U. and Dorner, R.: Interactive Visualization for Opportunistic Exploration of Large Document Collections. Information Systems. 35:260-269 (2010) 3. Amiri, B. and Niknam, T.: An Efficient Hybrid Approach Based on PSO, ACO and kmeans for Cluster Analysis. Applied Soft Computing. 10:183-197 (2010) 4. Bang, S. L., Yang, J. D. and Yang, H. J.: Hierarchical Document Categorization with kNN and Concept-Based Thesauri. Information Processing and Management. 42: 387-406 (2006) 5. Lailil Muflikhah, Baharum Baharudin, Document Clustering Using Concept Space and Cosine Similarity Measurement, In Proceeding of International Conference on Computer Technology and Development, Vol. 1, pp. 58-62 (2009) 6. Yen, G.G., and Wu, Z.: A Self-Organizing Map Based Approach for Document Clustering and Visualization, International Joint Conference of Neural Networks, July 16-21, Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada (2006) 7. Hsiao, P.L.: Document clustering and Visualization. Analysis, (I). Retrieved from http://sequoia.csc.ncsu.edu/~phsiao/document_clustering.pdf (2006) 8. Jeremy, R.M., Gilbert, L.P., and Michael J.M.: Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps, In Proceedings of the TwentySecond International FLAIRS Conference, pp. 69-74 (2009) 9. Wei, C.P., Yang, C.S., and Hsiao, H.W.: A Collaborative Filtering-Based Approach to Personalized Document Clustering. Decision Support System, 45:413-428 (2008) 10. Bang, S.L., Yang, J.D., and Yang, H.J.: Hierarchical Document Categorization with k-NN and Concept-Based Thesauri. Information Processing and Management, 42:387-406 (2006) 11. Tsekouras, G., Sarimveis, H., Kavakli, E., and Bafas, G.M.: A Hierarchical FuzzyClustering Approach to Fuzzy Modeling. Fuzzy Sets and System, 150:245-266 (2005) 12. Mak, B., Blanning, R., and Ho, S.: Genetic Algorithm in Logic Tree Decision Modeling. Computing, Artificial Intelligence and Information Technology, 170:597-612 (2006) 13. Amiri, B. and Niknam, T.: An Efficient Hybrid Approach Based on PSO, ACO and kmeans for Cluster Analysis, Applied Soft Computing, 10:183-197 (2010) 14. Tan, P.N., Steinbach, M., and Kumar, V.: Introduction to Data Mining. Pearson Education, Inc. United States of America, 769 pages (2006) 15. Steinbach, M., Karypis, G., and Kumar, V.: A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 400:526-526 (2000) 16. Ludwig, J. A., and Reynolds, J.F.: Statistical ecology, J. Wiley & Sons, New York (1988) 17. Davies, D.L., and Bouldin, D.W.: A cluster separation measure, IEEE Transactions and Pattern Analysis and Machine Intelligence, 1(2):224-227 (1979) 18. Tabachnick, B.G., and Fidell, L.S.: Using multivariate statistics, 5th ed. Pearson Education. pp. 217–218 (2007)

19. Alfred, R., Kazakov, D., Bartlett, M., and Paskaleva, E.: Hierarchical Agglomerative Clustering for Cross-Language Information Retrieval, International Journal of Translation, 19(1), 139–162 (2007)