AbstractâThe Growing Hierarchal Self-Organizing Map. (GHSOM) is the most efficient model among the variants of. SOM. It is used successfully in document ...
Extending the Growing Hierarchal SOM for Clustering Documents in Graphs domain Mahmoud F. Hussin, Mahmoud R. farra and Yasser El-Sonbaty
Abstract—The Growing Hierarchal Self-Organizing Map (GHSOM) is the most efficient model among the variants of SOM. It is used successfully in document clustering and in various pattern recognition applications effectively. The main constraint that limits the implementation of this model and all the other variants of SOM models is that they work only with Vector Space Model (VSM). In this paper, we extend the GHSOM to work in the graph domain to enhance the quality of clusters. Specifically, we represent the documents by graphs and then cluster those documents by using a new algorithm GGHSOM: Graph-based Growing Hierarchal SOM after modifying its operations to work with the graph instead of vector space. We have tested the G-GHSOM on two different document collections using three different measures for evaluating clustering quality. The experimental results of the proposed G-GHSOM show an improvement in terms of clustering quality compared to classical GHSOM.
I.
W
INTRODUCTION
e are living in a world full of data, which continues to grow everyday. Clustering attempts to organize these data into unique groups where each group represents a topic that is different from those represented by other groups. Document clustering is an automatic grouping of text documents into clusters such that documents within one cluster have high similarity when compared to each other, but are dissimilar to documents in other clusters. Document clustering has been used in grouping the results of a search engine query, clustering documents in a collection, and automating (or Semi automating) creation of document taxonomies. Any clustering technique consists of four stages [2], (1) data representation, (2) Clustering algorithm, (3) Cluster validation, (4) Results interpretation. One of the most used clustering algorithms is the SelfOrganizing Map (SOM) [4], which is a very popular artificial neural networks (ANNs) algorithm based on unsupervised learning. It has two main advantages over other clustering methods [3]: (1) mapping the input documents is able to extract underlying non-linear relationships among documents. (2) Similar documents are mapped close to each other on the map and dissimilar ones are far apart. In contrast, its earlier variations had the disadvantages that the main cluster size had to be predetermined. In order to overcome these limitations, the growing hierarchical SOM (GHSOM) variant has been proposed [6]. This variant consists of a hierarchal architecture where each layer is composed of independent SOMs that adjust their
size according to the requirements of the input data. The GHSOM based on five main operations, which are the (1) initialization of the neurons, (2) finding the wining neuron, (3) updating the wining neuron and its neighbors, (4) the horizontal growing and (5) the vertical growing. The representative work on GHSOM [3] in document clustering achieves an improvement in the quality of clusters. The GHSOM and all the variants of SOM are limited by the fact that they use only the Vector Space Model (VSM) for document representations, which does not represent any relation between the words, due to the sentences being broken down into their individual components without any representation of the sentence structure. In contrast, using graphs to represent documents helped us capture the salient features of data through using edges to represent relations and using vertices to represent words. In addition, it decrease the space complexity comparing to the VSM using phrases of the document collection as features which is required a huge space while using the graph, required one graph, as we will explain in the following sections. The previous graphs-based clustering algorithms proved their great ability to improve the quality of resultant clusters as in [1], [7], [8], [9] and [10]. A combination of the graph model with the GHSOM model may very well improve the quality of the clusters more than the algorithms, which use the GHSOM with the vector space. In this paper, we present an extension of GHSOM to the graph domain to enhance the quality of clusters in the document clustering applications. This extended variant termed the G-GHSOM: graph-based growing hierarchal SOM. To make this extension possible, we need to propose five new operations in the graph domain corresponding to those operations of the classical GHSOM [6]. The rest of the paper is organized as follows; Section 2 provides a brief overview about the document clustering algorithms using SOM variants and the graph model. Section 3 presents the graph-based representation model of the documents, which we extended the GHSOM to work on it. Section 4 describes the proposed G-GHSOM and its graphbased operations. Section 5 presents the results of using the G-GHSOM and discusses the complexity of G-GHSOM. Finally, section 6 provides the main conclusions and suggestions for future work.
II. LITERATURE WORK A. The graph-based document representation Most of document clustering algorithms that are in use depend on VSM, which has many weaknesses as we briefly mentioned in the introduction. Therefore, some graph-based models are proposed to solve these weaknesses of VSM and then to improve the quality of resultant clusters. Zamir et al. [7] proposed Suffix Tree Clustering (STC) algorithm which is a phrase-based document clustering approach. They achieved nlog(n) performance and produced high quality clusters. The results they showed were encouraging, but the major disadvantage of STC is that a high number of redundancies words in the nodes of the suffix tree. Chim et al. [8] proposed new suffix tree similarity measure for document clustering based on the suffix tree to detect the overlap nodes between documents (sub-trees). Although, they did not solve the major disadvantage of STC which is the high redundancies. Hammouda e.t. al.[1] extended the STC by proposing a Document Index Graph (DIG) model to represent the collection of documents. They avoided the redundancy of data in the nodes, which is the main limitation in the STC. In our work, we adapt the DIG to use it in a suitable way with G-GHSOM. More details about this graph model will be discussed in section III. Bhoopesh et al. in [9], [10] used the semantic graph to represent the semantic relation in documents to improve the quality of the document clustering. In this algorithm, they converted the semantic graphs to vectors then used the classical SOM as the clustering algorithm, that they did not propose a direct technique to use the semantic graphs directly with SOM. B. Document clustering algorithms based on SOM variants Earlier, SOM has been successfully applied as a classification tool to various problem domains, including speech recognition [11], image data compression [12], image or character recognition [13], robot control [14], and medical diagnosis [15]. In document clustering domain there are many trials to use SOM based on vector space as in [16], [17]. As we will see in this section, all the works that have used SOM in document clustering is limited on directly or indirectly using it with VSM only. J. Bakus et al[16], presented a phrase grammar extraction technique to represent documents, and use the extracted phrases as the features of vectors then input them to SOM. This technique made one forward step to use the phrases representation with SOM but it still use vectors which does not allow to represent all the phrases of the documents. R. Freeman et al [3] proposed a method termed Treeview SOMs for clustering and organizing text documents by extending the GHSOM, but they still represent the documents by VSM. Russell B. et al [18], presented a method for clustering documents in VSM using a series of 1-dimensional SOM arranged hierarchically to provide an intuitive tree structure representing document clusters.
Many other works tried to use the SOM either with VSM based on words or phrases to enhanced the quality of document clustering as in [19], [20]. III. ENHANCE THE DIG TO WORK WITH G-GHSOM In this section, we enhanced the DIG to work with our GGHSOM, in the following subsections, we will discuss the DIG in brief then discuss the enhancements steps and finally, the DIG will be exploited in the new phrase based similarity measure used in the G-GHSOM which will be discussed in details in subsection C. A. The Document Index Graph (DIG) model Hammouda et.al,[1] proposed the DIG for representing the document and Exploited it in the document clustering. They constructed a cumulative graph, which is one graph that represents all the documents in the collection. Each vertex (vi ) in the cumulative graph represents a unique word in the entire document set. While each edge (ej) is an ordered pair of vertices such that ej = (vi , vi+1), that vi+1 is adjacent to vi. The vertices store all the required information about this word, while the edges represent the relation between words to represent the phrases that each edge here contains a pair of successive vertices. Since the graph is directed, each node maintains a list of outgoing edges per document entry. For example as in [1], consider the following three sentences as three different documents, which are: • River rafting. (document 1) • mild river rafting. (document 2) • River fishing. (document 3) In figure 1, the document table of word "river" is shown. It stores all its information (word frequency, its importance, its position in the sentences) to make the edges represent the phrases. As it is clear, the node 'river' appears in document 1 at position 0 in sentence 0, s0(0). In document 2, it appears at position 1 in sentence 0, s0(1). Finally, it appears in document 3 at position 0 in sentence 0, s0(0). Moreover, the frequencies of "river" in the three documents are recorded in the table as 1 with the same importance. Thus, this information tells us which sentence continues along which edge.
Fig. 1. The details of the recorded data in the vertices of the DIG
The cumulative graph is constructed incrementally, which decreases the time and space of construction and makes it suitable when no new words are added to the cumulative graph. Figure 2 illustrates the meaning of incrementally here, based on our simple example.
Algorithm 1 Constructing the cumulative graph Input: di : the i th document from the collection where , i = 1,..,n Required: G: cumulative graph for documents, d1 to dn 1. Constructing of the vertex of the first word in each sentence as follow For each next sentence Sij in di do: tij1 Æ vij1 where vij1 is a vertex in the cumulative graph and tij1 is the first word if vij1
∉
G then Add vij1 to G
else Fig. 2. Incremental construction of the DIG.
In DIG, the shared phrases are detected while the cumulative graph is constructing such that when the construction of the cumulative graph is finished, all the shared phrases between all documents are available. Hammouda et.al,[1] proposed a similarity measure based on these matched phrases and their importance, frequency and length. The phrase in DIG consists of two or more ordered words in the same sentence, and the shared phrase is a phrase that appears in more than one document in the collection, such as "river rafting" which is shared phrase between document 1 and document 2. They claim that the DIG improves the performance and produces high quality clusters. The results of the document algorithm they showed were encouraging, but the phrasebased similarity that they proposed measures the length of the phrase by recounting the words, which are shared between two overlapped matching phrases, which make the length of the phrase longer than its real length. Therefore, we proposed a new phrase based similarity to avoid this problem. In the next subsections, we will enhance the DIG to work on our G-GHSOM. B. The Enhancement Steps of DIG to work with GGHSOM Working with G-GHSOM and the variants of SOM depend on clustering one document at a time, that the SOM clusters the input document and then clusters the second and so on. In [1], all the shared phrase between all the documents are detected during the construction of the cumulative graph and exploit these phrases directly in their clustering method. Therefore, in the first enhancement step, we need to prepare the representation subgraph of each document from the cumulative graph to input it to the G-GHSOM. Therefore, we adapt the DIG to extract the sub graph of each document from the cumulative graph. Clustering one document at a time makes the detection of all the shared phrases through the construction of cumulative graph not favorable because the clustering of one document using SOM is achieved by measuring its similarity only with the neurons of the SOM, which is represented by graphs. So in the second enhancement step, the shared phrases are detected through the clustering process, after extracting the sub graph.
Modify the document table of vij1 End if 2. Constructing of the edges in each sentence sij in each document as follow For each word k in sentence j of document i , where k = 2 ..the length of sij , tijk do: tijk -1 Æ vijk-1 ; tijk Æ vijk ek-1 Æ (vijk-1 , vijk )
∉
If vijk G then Add vijk to G else Modify the document table of vijk End if
∉
If ek-1 G then Add eijk-1 to G else Modify the document table of vijk-1 End if End for End for
It is one of the most important enhancements on the DIG to work with G-GHSOM that the number of neurons in any level of the G-GHSOM compared to the number of documents in the collection is very small that reflect the difference of the complexity time after implement this enhancement step. In addition, it will decrease the complexity time of the cumulative graph construction and of the clustering process. According to these two enhancement steps, the algorithm of graph construction is shown in algorithm 1. While the last enhancement step is how to measure the similarity between two graphs. Here, we proposed a new phrased based similarity measure that depends on the shared phrases between the two documents with fixed length equal 2 words which is a new difference with the DIG. In the following subsection we will discuss the new phrase-based similarity measure and why we propose a new one. The advantages of constructing such graph for document clustering using G-GHSOM are great in terms of the space complexity and in the clear improvements achieved through representing the document as phrases not single words. We can imagine that if we need to represent all the phrases of the documents using VSM, which needs a huge space and required time while we can achieve that using graph with less time and space.
C. The Phrase-Based Similarity Measure Measuring the similarity between two documents is based on finding the percent of the shared phrases and words between them, that while the shared phrases increased, the similarity is increased, So our equation's similarity measure based on the length of the shared phrases between the two documents normalized to the length of the two documents. Hammouda et.al,[1] proposed a phrase-based similarity measure based on the same idea, but their method of measuring the length of the matching phrase does not achieve the real length of the phrases because if there are two or more overlapped matching phrases as it is shown in figure 3, they count the shared words more than one time.
compared documents. The factors are, (1) the number of uniquely shared words which are the words appearing in the shared phrases without being shared between two intervening phrases as fig.4 shows, (2) the frequency of these words is in the shared phrases list not in the whole document. This factor is an important one, (3) the importance of these words in the two documents and (4) the length of the two documents with their importance. i =n
¦(F × w ) + (F × w ) Sim (d , d ) = ¦(| S | ×w ) + ¦(| S | ×w ) 1i
1i
p
1
1j
Therefore, to avoid this problem, our phrase-based similarity measure depends on phrases with length 2 only, which is corresponding to one edge. In addition, instead of measuring the length of phrase we measure only the frequency of the words, which are represented in the matching phrases and ignore the word if it appears in two successive matching phrases with the same position as character "C" in figure 3. The new phrase-based similarity measure is normalized by the length of the two compared documents, so the value of similarity is normalized between [0,1]; where the 0 value mean that the two documents do not have any shared phrases. While the 1 value means that the two documents are identical. Our phrase-based similarity measure based on a list of the shared phrases between the compared input sub graph (document) and the neuron (graph), which is generated by comparing each edge in the graph i to all edges in the graph j as shown in the fig.4.
2i
2
1j
j
Fig.3.The shared word between two overlapped phrases make the similarity between two documents does not accurate.
2i
i =0
2k
(1)
2k
k
Where: F1i, F2i : The frequency of the word i in the document 1 and 2, respectively with uniquely position. w1i , w2i : The weight of the word i in document 1 and 2, respectively. |S1j|, |S2k|: The length of the original sentence j and k in the document 1 and 2 respectively. w1j, w1k: The weight of the sentence j and k in document 1 and 2 respectively. Based solely on the phrase-based similarity in many cases is not sufficient. In many times, the two documents are similar without enough matching phrases. In this case, we need to integrate the single based similarity with the phrasebased similarity using a weighting factor. We used the cosine correlation similarity measure, with TF-IDF word weights, as the single-word similarity measure. The cosine measure was chosen due to its wide use in the document clustering literature, and since it is described as being able to capture human categorization behavior as well. The cosine measure calculates the cosine of the angle between the two document vectors. Accordingly, our word-based similarity measure (simw) is given as in equation 2.
simw (d1 ,d 2 ) =
d1 ⋅ d 2 d1 d 2
(2)
Where: d1 and d2: are vectors of the two compared documents and represented as word weights calculated using TF-IDF weighting scheme. The total similarity is the integration of the word-based and the phrase-based similarity measures, which is a weighted average of the two results from equations 1 and 2, and is given by equation 3 as follow.
sim(d1 ,d2 ) = Į ∗ simp (d1 ,d2 ) + (1− Į)∗ simw(d1 ,d2 ) (3) Fig. 4. Finding the matching phrases and their information between any two documents
Our phrase-based similarity measure is a function of four factors, three of them extracted from the list of the shared phrases between two documents, and the forth will be available once the cumulative graph is constructed. Each of the following factors may have different values in the two
Where Į is a value in the interval [0, 1] which determines the weight of the phrase similarity measure which is according to the experimental results, we found that a value between 0.5 and 0.75 for Į results in the maximum improvement in the clustering quality. Although this integration will decrease the value of similarity between two documents, it will improve the quality of the document
clustering as we will note in the experimental discussion, because now the similarity measure is more accurate. Measure the similarity described in algorithm2. IV.
MENT OF THE G-GHSOM TO WORK WITH GRAPH
The main contribution of our work is to extend the GHSOM from working with the VSM only to work with a graph model to enhance the performance of document clustering. As we introduced, the GHSOM is based on five operations in the vector space domain, so in order to make our contribution possible, we need to propose a new five corresponding operations in the graph domain. Firstly, a new procedure for the initialization of neurons is proposed in A of this section. Secondly, an equivalent update procedure based on the enhanced DIG and our phrase-based similarity measure is proposed in B of this section. The third and forth operations include a new criterion for the horizontal and vertical growth which is proposed in C of this section. Finally, the fifth operation deals with measuring the similarity of two graphs. This operation is proposed in section III and is achieved through algorithm 2 and equation 3. A. Neuron Initialization Algorithm 2 Detecting the matching list to calculate the phrase based similarity Input: graphs of documents d1, d2 Output: ML(d1, d2); Matching List between d1 and d2 1. Detect the matching edges as follow For each eke in d1 where k is the no. of edges in d1 do If eke is an edge in d2 then Add a new matching edge to ML(d1,d2)
graph. We can consider it as a random selection and from the same part of the pattern space. It is random because the input graph is extracted randomly from the cumulative graph and it is certain the generated weights in this case belong to the pattern space. B. Graph based neurons The operation of updating in the graph domain is based on the idea of the two graphs can become more similar when the number of matching phrases is increased or the shared words are increased. So our new procedure that updates the wining neuron and its neighbors work as follows, the value of the similarity between the input graph and the neurons is used to detect wining neuron and its neighbors, and then the number of the difference nodes between them ' ' is multiplied by a learning rate μ which decreases while the increases. δ = β ×μ (4) Then random į nodes with their successive nodes are chosen and added to the weights as new phrases "edges" to the neuron weights, as shown in algorithm 3. C. In the classical variations of SOM, a neighborhood relation is defined. Typically, the neurons are linearly ordered, or arranged in a two-dimensional array. This ordering exploited in detecting the neighbors and updating them based on their similarity with the input pattern. The neurons, i.e. graphs, in the G-GHSOM should provide an approximation to the original input pattern distribution, but input and output spaces are dimensionless. Therefore, we replaced the topological neighborhood relation among the neurons in the G-GHSOM by using the similarity values between the input and the neurons. This allowed us to achieve the same goal of topological neighborhood relation in the field of clustering, that in clustering task, the position of the neurons is not important. The graphbased criterion of G-GHSOM growing
Endfor 2. Detect the unique nodes from ML(d1,d2) as shown in fig.3 3. Calculate the total similarity using equation3
In the classical variations of SOM, the neuron initialization is based on randomly weights in the range of the input data. In the graph domain, this method is expected to cause problems as there exists no straightforward way to guarantee that the randomly generated graphs belong to the same part of the pattern space as the input patterns. Gunter et. al. [21] initialized the neurons by random selection of pairs of inputs and generated a weighted mean for each pair in the application of clustering characters using SOM. This method may cause two problems in the field of documents. First, the weighted graph may belong to an other part of the pattern space. Second, this method will generate clusters, which contain documents that belong to different topics without any real relation between them that will make the clustering results unreliable. Here, we will try to achieve the random method with the guarantee of belonging to the same part of the pattern space by initializing each new neuron by the first coming input
The growing of G-GHSOM consists mainly of horizontal growth step at every level and then the vertical growth step after the horizontal levels have been converged. The relation between first level (parent neurons) and the second level (child neurons) is shown in figure 5. As shown, not all parent neurons have to have children; it depends on the quality of parent neuron. The training of each child's neuron is done using only its parent's documents. In the horizontal growth step, once the first neuron is initialized, the next neuron is constructed based on the value of the similarity between the next input graph and the weights of the existing neurons, if it has a value smaller than a threshold then the next neuron will be constructed else it will be assigned to the closest neuron. This operation is continued until the network is tuned, then the vertical growing part is started. The criterion of horizontal growth step and the vertical growth step is described later on and their procedure is shown in algorithm 4.
Algorithm 3 updating the wining neuron and its neighbors which are represented by graphs
Input: the Gin graph that represent the input document Output: the updated Gwin and its neighbors Gbors which are all represented by graphs 1. calculate the similarity between the input graph Gin and all the neurons represented by graphs, as follow For each neuron graph Gneuron Calculate the similarity of Gin and Gneuron using Equ. 3 End for 2. Select the wining neuron Gwin and neighbors Gbors 3. Find the difference Dif(Gin, Gwin ) and Dif(Gin, Gbors ), which are two sets of nodes 4. Calculate the number of added nodes to Gwin , μw = |Dif(Gin, Gwin)| * Ɛ and set of k Gbors, μn,k = |Dif(Gin, Gbors)|* Ɛ, where k = 1…r and r is integer number Ɛ is a factor decrease while the neighbors far 5. Randomly, choose μw and μn nodes from the Dif(Gin, Gbors) and Dif(Gin, Gwin) sets respectively. 6. Add chosen μw and μn nodes to the update Gwin and Gbors neurons such that it generates a new phrases End for
Fig. 5. The architecture of the trained G-GHSOM with four layers
After the growth step is finished and reaches a fine tuned, the vertical growth start with checking each neuron in the first level "parent neuron" need to decompose and generate new children neurons or not. The concern in this check is firstly about the overall similarity of each neuron. Then the number of assignment documents (L) in this neuron. If the overall similarity of the tested parents neuron is less than a certain threshold, then the number of assignment documents (L) is tested. If it is less than a threshold, the parent neuron is decomposed using the documents, which are clustered to its parent neuron to train the children map, for its training and generating a suitable number of children. The decomposition continues until the network is tuned. The combination of these two conditions avoids the decomposition of a parent containing a small number of documents with low overall similarity. All the thresholds here are dependant on the data set, especially as the threshold of number of documents and overall similarity.
The growing strategy of the children level is the same as the parent level with stronger thresholds. Algorithm 4 The G-GHSOM: Graph-based Growing Hierarchal SOM
Input: the collection of documents represented by graphs Output: documents clusters 1. Horizontal Growth the map level k as follow For each new input graph Gin do For each neuron1 Gneurons in map level K -Calculate Sim(Gin,Gneuron) using Equ.3 Endfor -Select the neuron Gmax that has the max similarity with Gin - If Sim(Gin,Gmax) < threshold1 Construct a new cluster and initialize it using Gin else - Gwin will be the Gmax - update the weight of Gwin and its neighbors Until the collection of documents become empty Endfor 2. Vertical Hierarchy Growth from map level K as follow For each neuron Gneuron in map level K do - Calculate the Overall similarity using Over( neuronGneuron ) =
1 L2
¦ Sim( G ,G x
Gx Gy
y
)
(5)
Where: L is number of documents assign to Gneuron represented by graphs in Gneuron Gx,Gy are the graphs represented the assigned documents in the neuron Gneuron - If Over(neuron Gneuron) < threshold2 If L > threshold3 Decompose the neuron Gneuron using the horizontal part Endfor V.
EXPERIMENTAL RESULTS
We used two data sets with different number of documents, subjects and average number of words in each document. The first data set from various University of Waterloo and Canadian Web sites (UW-CAN), this data set is used and available in [1] and available in2 [25]. The second data set is a collection from Reuter’s news articles (RNA) posted on the Yahoo! News. This is a standard text clustering corpus, 2340 documents are selected from this corpus and used as the test set to be clustered and was used by Boley et al. [23] and in [1]. All the documents have HTML format, which is more informative and more helpful in achieving better quality of the clusters than using plain text format. The overview architecture of our proposed system is shown in figure 6. To demonstrate the effectiveness of the proposed G-GHSOM model based on graphs, we compare it to the GHSOM based on single words (T-GHSOM) and then evaluate them using the most widely used quality measures which are the internal measure (Overall similarity) and the external measure (Entropy and F-measure)[22].
Table 1: Description of the data sets No
Data Set Name
1 2
UW-CAN RNA
# of Documents # of classes Avg.words
314 2340
10 20
469 289
We would like to maximize the F_measure, and minimize the entropy of clusters to achieve high quality clustering. Results showed a clear improvement using GGHSOM, which is due to the use of the graph model instead of the single words. Table 2 shows the results of the three quality measures and the percent of the improvement of the G-GHSOM to the T-GHSOM.
Fig. 6. The architecture of the trained G-GHSOM with four layers, and the relation between parents and children neurons
Using the first data set, the G-GHSOM improved the average F-measure by 18% , while using the second data set the average improvement was 12.6%. On the other hand, the reduction of the Entropy was 22.5 and 19.5 % in the first and second data set, respectively. Finally, using the Overall Similarity, the average improvement in the two data sets are 4 % and 16 % for the first and the second respectively. These improvements mean that using G-GHSOM the resultant clusters are more tight and the data is well characterized by the centroid of the clusters. This makes the Table 2: The improvements of G-GHSOM Data Sets
T-GHSOM E
1. UW-CAN 2. RNA
0.62 0.41
F
0.39 0.63
G-GHSOM E
F
0.48 0.33
0.46 0.71
in representing the documents plays an important role in accurately judging the relation between documents. This is evident through the clear improvement we achieved using this method. By definition, the time complexity of GGHSOM clustering algorithm is based on the number of neurons in the maps and the number of levels, since for each new document we just compute its similarity to all existing neurons in the first level, and in the vertical growth we need to repeat this process only with one network for each documents, this is if all the neurons are decomposed. VI. DISCUSSIONS AND CONCLUSIONS Instead of using SOM and its variants only with vector space model, in this paper, we extend the GHSOM to a new graph based GHSOM: (G-GHSOM) to enhance the quality of the document clustering. We enhanced the DIG model [1] to work with GHSOM algorithm by presenting five operations in the graph domain corresponding to those in the VSM domain. First, a new criterion to initialize the neurons based on choosing randomly one of the input documents to initialize it. Second, a new phrase-based similarity measure which works with graphs that are based on the matching edges between the two graphs. Third, a new procedure to update the neurons, which is based on increasing the matching, edges between the input graph and the weight graph of the neurons. Finally, the forth and fifth operations were a new criterion for horizontal and vertical growing, which is based on the overall similarity of the neurons and the number of the assignment documents to them. The performance was evaluated by testing the G-GHSOM and T-GHSOM on two different data sets, and using two quality measures which were entropy, F-measure which was the most widely used. The experimental results demonstrate that the proposed G-GHSOM works successfully with graph domain and achieves a better quality clustering than TGHSOM in document clustering. The G-GHSOM can be used in a new algorithms to retrieve text and in summarization algorithm through presenting the documents with graph and clustering them using GGHSOM, this will be improved the results of the classical version of this application. REFERENCES [1]
Improvements E
-22.5% -19.5%
F
+18% +12.6%
use of G-GHSOM in the retrieval systems more efficient. The results are shown in detail in the figures 7, 8, 9 and 10. We performed the experiments using different dimensions of the G-GHSOM and T-GHSOM with the two data sets, as the figures show. In all the presented results the weighted factor Į was fixed to 0.7. This value achieved the best improvements; the range of best values was between 0.5 and 0.75, but 0.7 resulted in the best average improvement. These findings illustrate that integrating the graph model
[2] [3] [4] [5]
K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279{1296, 2004. R. Xu. Survey of Clustering Algorithms, Neural Networks, IEEE Transactions on, 16(3): 645--678, 2005. R. Freeman, H. Yin, Tree view self-organisation of web content, Neurocomputing 63 (2005) 415–446 T. Kohonen. Self-Organizing Maps, 3rd edition. Springer, 2001. M. Y. Kiang.. Extending the Kohonen self-organizing map networks for clustering analysis Comput. Stat. Data Anal. 38, 2 (Jan. 2002), 161 180, 2002.
[9]
[10]
[11] [12]
[13]
Fig. 7. Entropy of using G-GHSOM and T-GHSOM using UW-CAN data set
[14]
[15]
[16]
[17]
Fig. 8. Entropy of using G-GHSOM and T-GHSOM using RNA data set
[18]
[19]
[20] [21] [22]
Fig. 9. F-Measure of using G-GHSOM and T-GHSOM using RNA data set
Fig. 10. F-Measure of using G-GHSOM and T-GHSOM using UW-CAN data set [6]
[7]
[8]
O. Zamir, O. Etzioni, Omid Madani, Richard M. Karp, Fast and Intuitive Clustering of Web Documents, KDD ’97, Pages 287-290, 1997. H. Chim and X. Deng, " A New Suffix Tree Similarity Measure for Document Clustering", WWW 2007, ACM , Banff, Alberta, Canada. 2007 B. Choudhary and P. Bhattacharyya, Text Clustering Using Semantics, World Wide Web Conference (WWW2002), Hawai, USA, May 2002.
B. Choudhary and P. Bhattacharyya, Text Clustering Using Universal Networking Language, Universal Networking Language Conference, Shanghai, China, November, 2001. L. Leinonen, T. Hiltunen, K. Torkkola, J. Kangas. Self-organized acoustic feature map in detection of fricative-vowel coarticulation. J. Acoust. Soc. Am. 93 (6), 3468–3474, 1993 C.N. Manikopoulos. Finite state vector quantisation with neural network classification of states. IEEE Proc.-F 140 (3), 153–161. 1993 A.D. Bimbo, L. Landi and S. Santini. Three-dimensional planar-faced object classification with Kohonen maps. Opt. Eng. 32 (6), 1222– 1234. 1993 J.A. Walter and K.J. Schulten. Implementation of self-organizing neural networks for visuo-motor control of an industrial robot. IEEE Trans. Neural Networks 4 (1), 86–95. 1993 L. Vercauteren, G. Sieben, M. Praet, G. Otte, R. Vingerhoeds, L. Boullart, L. Calliauw and H. Roels. The classification of brain tumours by a topological map. In Proceedings of the International Neural Networks Conference, Paris, pp. 387–391. 1990 J. Bakus, M. Hussin, and M. Kamel. A SOM-based Document Clustering using Phrases. In 9th International Conference on Neural Information Processing ICONIP 2002, Vol. 5, pp. 2212-2216, Singapore , November 2002. M.Hussin and M. Kamel. Enhanced Neural Network Document Clustering Using SOM based Word Clusters, ICONIP 2005, Taiwan , October 2005 . B. Russell, H. Yin and N. M. Allinson, "Document clustering using the 1+1 dimensional self-organising map. In Proceedings of the Third international Conference on intelligent Data Engineering and Automated Learning (August 12 - 14, 2002), vol. 2412. SpringerVerlag, London, 154-160, 2002. M. Hussin and M. Kamel. Document clustering using hierarchical SOMART neural network. In Proceedings of the 2003 Int’l Joint Conf on Neural Network, Portland, Oregon, USA, pp. 2238-2242, July, 2003. M.Hussin, M. Kamel and M. Nagi. An Efficient Two-Level SOMART Document. Clustering Through Dimensionality Reduction. ICONIP2004, ISSU 3316, pages 158-165, 2004 S. Gunter , H. Bunke, Self-organizing map for clustering in the graph domain, Pattern Recognition Letters 23 (2002) 405–417 M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000. D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, “Partitioning-Based Clustering for Web Document Categorization,” Decision Support Systems, vol. 27, pp. 329-341, 1999.