Markov Cluster Shortest Path

1 downloads 0 Views 133KB Size Report
Abstract. In this paper, we propose a new variant of the breadth-first shortest path search called Markov Cluster Shortest Path (MCSP). This is applied to the.
Markov Cluster Shortest Path Founded upon the Alibi-breaking Algorithm Jaeyoung Jung, Maki Miyake, and Hiroyuki Akama Tokyo Institute of Technology, Department of Human System Science 2-12-1 O-okayama, Meguro-ku, Tokyo, 152-8552 Japan {catherina, mmiyake, akama}@dp.hum.titech.ac.jp

Abstract. In this paper, we propose a new variant of the breadth-first shortest path search called Markov Cluster Shortest Path (MCSP). This is applied to the associative semantic network to show us the flow of association between two very different concepts, by providing the shortest path of them. MCSP is obtained from the virtual adjacency matrix of the hard clusters taken as vertices after MCL process. Since each hard cluster grouped by concepts as a result of MCL has no overlap with others, we propose a method called Alibi-breaking algorithm, which calculates the adjacency matrix of them in a way of collecting their past overlapping information by tracing back to the on-going MCL loops. The comparison is made between MCSP and the ordinary shortest paths to know the difference in quality.

1

Introduction

In the leading network science, the graph structure and scale problem has risen as a renewed matter of concern. The same thing is true of the corpus or cognitive linguistics that allows us to see the world of language as a large-scale graph of words. If a word is associated in a certain sense to the other, it is told that they are connected with each other and all the words taken in this way as nodes (vertices) are linked together by a set of edges corresponding here with the lexical association. In this structure, the shortest path between two random words or concepts represents their distance in semantic networks. Steyvers et al. (2003) showed that large-scale word association data possess a small-world structure characterized by the combination of highly clustered neighborhoods and a short average path length. According to them, the average shortest path (SP) length between any two words was 3.03 in the Undirected Associative Network of Nelson et al, 4.26 in their Directed Associative Network, 5.43 in Roget's thesaurus and 10.61 in WordNet. It also held true in Ishizaki Associative Concepts Dictionary of Japanese Words (in abbreviation, ACD), which offered us lexical association data for graph manipulation. Its average shortest path (SP) length was 3.442 in the 43 word pairs randomly chosen from it. Despite such low values, however, it took a relatively long time (according to our experiment mentioned below, more than 1 minute on average) by the usual searching method that automatically traces the shortest routes based on the word node connectivity in semantic networks. This kind of word-to-word distance measure not

only takes time, but might restrict a way to present any other possible semantic structures of networks. Accordingly to make the best use of the small-world feature of the semantic network, we propose a new shortest path detection algorithm by using a customized strategy of graph clustering: Recurrent Markov Cluster Algorithm (RMCL).

2

Markov Cluster Process

The original MCL algorithm proposed by Van Dongen (2000) can be formulized as the alternation of two steps--expansion and inflation-- to reach the convergence of a stochastic matrix through which a whole graph is subdivided into the clusters without any overlaps one another. In this work, we used GridMathematica to write the original MCL program and applied it to ACD. The dictionary, made as a result of the free association by ten participants, is composed of 33,018 words and 240,093 pairs of them. But to make a significant and well-arranged semantic network, we selected 9,373 critical words from it by removing the rarest words. And then MCL process was applied to a 9,373*9,373 adjacency matrix which was calculated based on 187,113 pairs of the critical words. Finally it made a nearly-idempotent stochastic matrix at the 16th cluster stage to allow us to gain 1,408 hard clusters corresponding with as many concepts sustained by a series of similar words.

3

Recurrent MCL and Markov Cluster Shortest Path

The final concept clusters generated by MCL have no common word node. Since they don’t expose their adjacent relations at the final stage of convergence, we need to find out their connections to search for the shortest path based on them. For this, we thought out a way to restore the virtual connections of them by tracing back to the previous cluster stages before the convergence. In this procedure, each concept cluster is newly considered as a vertex, or meta-vertex including word nodes (later, each of them is identified by the representative word node with the largest degree value). This back-tracing search collects the evidence that the final clusters have had any common word nodes somewhere in the previous cluster stages, and then based on it reconnects the final clusters. This procedure may allow us to call it “alibi-breaking algorithm”, in a sense that it takes up evidence of the past “implication” to show the connections of the final clusters. The alibi-breaking algorithm is shown below, which is the core part of our Recurrent Markov Cluster Algorithm (RMCL). ClusterStagesList means a set of the clustering results still in progress of MCL loops, except for the ClusterStagek that represents the final converged clusters. First OverlappingNodes(ClusterStagei) looks for all the multiply-attributed nodes (abbreviated as oln(p)) in each of the on-going ClusterStagei. And then in OverlappingClusters(oln(p)), the set of olc(p), the union of all the soft clusters including oln(p) at ClusterStagei is generated. For each oln(p), all the past co-occurring nodes in olc(p) (we call them “conodes”) are enumerated, and

by searching for the clusters containing conodes(p) at ClusterStagek we newly settle adjacency relationships between the final clusters. Formula of Alibi-breaking Algorithm ClusterStagesList = {ClusterStage1,ClusterStage2,...,ClusterStagek}; OverlappingNodes(ClusterStagei) = {oln(1),oln(2),...,oln(p),...,oln(m)}; OverlappingClusters(oln(p)) = olc(p)= ∪ (ClusterStagei(j) ⊃ oln(p)); j

For each oln(p){conodes(p)=olc(p) ∩ ¬ {oln(p)} = {con(1),con(2),...,con(q),…,con(n)}}; MakeAdjacency(ClusterStagek(j) ⊃ conodes(p)); end. The breadth-first routing is a method to trace the shortest path in a traversal search. Namely, from a starting node all the adjacent nodes to it are searched by constructing spanning trees from connected graphs. This way is adopted here since we are interested in representing the coordination of a series of paradigms (sets of similar words) instead of the straight-forward chain of every single word. Markov Cluster Shortest Path (MCSP) uses the breadth-first traversal strategy, yet unlike the ordinary shortest paths, it is applied to the adjacency matrix of the concept cluster nodes, not of word nodes.

4

Results

Using RMCL and the ordinary search, we computed the shortest paths in the 43 word pairs that were randomly chosen from ACD as a sample set at the size of 1.0e-6 of population. The results were distinguished into three types. a) Type 1 of Markov Cluster shortest path (MCSP1), which is the breadth-first shortest path detected in the graph of 1,408 clusters obtained by MCL process. Its results are given under the form of cluster-to-cluster flows. b) Type 2 of Markov Cluster shortest path (MCSP2), of which computation is the same as in MCSP1. But in MCSP2, the specific paths between words are searched from the result clusters of MCSP1. c) Ordinary breadth-first shortest path (SP), which is searched in the graph of 9,373 words. The core function of breadth-first research was identical through all the three types. Consequently we could see a tendency that SP would permit us to grasp the relatively precise denotative semantic relationships between any two words, whereas MCSP would show us the large sphere of meaning that is joined together by a free association using the extensive connotative uses of the words. As for the quantitative data, there was a highly significant difference in average time for calculation (on Windows XP, 2.01GHz, by Mathematica5.0). (5.071 sec, 2.342 sec and 84.487 sec on average in a), b) and c) respectively, F(2,126)=16.066, p