Special-purpose Text Clustering

Special-purpose Text Clustering 1

1,2

Mario Kubek, 2Herwig Unger Faculty of Mathematics and Computer Science, FernUniversität in Hagen Hagen, Germany [email protected]

Abstract Selecting appropriate query terms to both satisfy an information need and find matching documents has become a tedious task when searching the World Wide Web (WWW). Although web search engines suggest terms and keywords to enhance query formulation in common interactive search sessions, more specific search tasks such as performing in-depth research on a topic of interest and tracking topics across multiple documents are still not supported technically. In particular, the linearly ordered result lists do not indicate topical dependencies between linked documents. This article presents three algorithms to support such search tasks by extracting and clustering high-quality keywords from texts and using them as search words in consideration of their semantic relationships, and by automatically linking documents with significant topical dependencies derived from cluster analysis. These algorithms can be executed in local text processing agents that can regard user-specific information needs to a much greater extent than web search engines.

Keywords: Clustering co-occurrence graphs, search word extraction, source topic detection, link induction, topic tracking

1. Introduction In recent years, the search for information in the World Wide Web (WWW) became dominated by big search engines such as Google that index the vast amount of data in the web in order to make it more accessible to users. Despite recent advances in query formulation techniques, the most common way to express information needs is still based on the combination of search words that are sent as queries to those web search engines. On average, two search words are given [1] by the users in order to find matching contents. It has been figured out in several publications [2,3] that the application of query expansion techniques may significantly reduce the amount of search results by describing the search subject better. That is the reason, why most search engines already suggest additional search words that are semantically connected to the query terms entered initially [4]. However, even though such help exists, it is often not useful, because the expansion terms are generally selected by analysing term combinations found in queries of the entire user community. Such inappropriate expansion terms will likely not satisfy experts. Also, it is not always possible for the users to determine the semantic relationship of the terms suggested. This makes it hard to decide which terms should be used for a particular search task. Either way, the search space needs to be broadened or to be narrowed down repeatedly when starting a general query. In the latter case, the topical basics of a search subject often need to be found and examined as they prov ide more detailed background information. This tedious process is time-consuming and error-prone as it requires manual evaluation of the usually linearly presented search results by the users who might not have enough knowledge for doing so properly. Therefore, new mechanisms are required that assist users in conducting their (re)search tasks in the web in a more effective and efficient manner with the help of proper user-specific search words. Therefore, in this article, two graph-based term clustering algorithms will be presented to automatically determine search words and expansion terms that address the above issues directly. Moreover, as these algorithms do not rely on large text corpora such as classical algorithms for keyword extraction, they can efficiently run in local text processing agents that regard the users’ special topical interests and help to determine and select proper search terms even before a web search

engine is invoked. The mentioned classical methods for keyword extraction such as TF-IDF [5] and difference analysis [6] rely on word frequency analysis and do not return satisfying results when too few (similar) documents are available. Also, methods to generate ontologies or taxonomies need larger amounts of data to derive reliable knowledge and are, thus, not properly applicable either. It is also shown, that the terms returned by the algorithms presented here are well suited to be used for automatic retrieval of semantically similar and related documents from large corpora like the WWW through automatic and local query formulation. Two additional advantages of the proposed graph-based methods are that they do not rely on third-party datasets such as reference corpora, and that they can be applied on single texts. Additionally, an algorithm to automatically generate topically induced links between documents based on the results of one of the clustering algorithms is introduced. This algorithm can be used to facilitate topic tracking across multiple documents, which is especially useful when dealing with large corpora such as the WWW. Moreover, this algorithm provides a suitable basis to re-rank web search results according to their topical dependencies. The remaining article is structured as follows: the next section explains the methodology used thereby focusing on the generation of directed word graphs based on statistical co occurrence analysis and forming the basis of the presented clustering algorithms. A short summary of existing graph-based clustering algorithms is given in Section 2 as well. Section 3 describes the new centrality-based term clustering algorithms in detail and presents experimental results. In Section 4, applications of the term clustering algorithms such as the search for similar and related documents in the WWW are discussed. A derived application to follow topics across multiple documents will be elaborated on as well. For this purpose, a new algorithm to automatically link documents based on their topical depend encies is introduced. Section 5 provides a look at options to enhance the algorithms presented. The focus is hereby on possible improvements of the link generation algorithm and its applicability.

2. Methodology The extraction of keywords and keyphrases should not only return high-quality results, but must also meet semantic criteria for distinct applications like identifying the sources and basics of topics for topic tracking purposes across different documents, and narrowing down or expanding queries based on semantic relations as hyponyms or hypernyms of the respective search terms. In order to be able to do so, queries with topic -specific combinations of search words must be either formulated manually by users, which requires a decent but not always existing amount of background knowledge on a domain of interest, or be generated automatically using the results of special-purpose clustering techniques that support these applications. This involves the identification of and the separation between general and more specific terms, and the determination of their semantic relationships. The set of all these relationships in text documents can naturally be represented in form of graphs analysable with graph clustering algorithms. Before presenting two special-purpose graph clustering algorithms in Section 3, in this section techniques for generating word nets are presented, and well -known general techniques from the literature to cluster such graphs are described. They can be used to identify semantically homogeneous groups of words within single texts and text corpora.

2.1. Generating word nets In order to determine topically connected terms (e.g. keywords, concepts), words and word forms in texts, clustering techniques can be applied. The initial problem arising in this context is that words are basic elements in texts and, therefore, carry no explicit attributes to characterise their semantic orientation which could be used to compare their similarity. One feasible solution is to determine a word’s semantics by considering its context, namely the words it co-occurs with in text sections. Cooccurrences or syntagmatic relations, which such word pairs are usually referred to, that occur with a higher probability than expected are called significant co-occurrences. The most prominent kinds of co-occurrences are word pairs that appear as immediate neighbours, and term pairs that occur together in a sentence. The following considerations will focus on the latter ones.

There are several well-established measures to calculate the statistical significance of such word pairs’ occurrences by assigning them a significance value. If this value is above a pre-set threshold, the co-occurrence can be regarded as significant and a semantic relation between the words involved can often be derived from it. Rather simple co-occurrence measures are, e.g. the frequency count of co-occurring terms and the similar Dice [7] and Jaccard coefficients [8]. More advanced formulae rely on the expectation that two terms are statistically independent (a usually inadequate hypothesis). With them, however, the deviations between obse rved and expected co-occurrences of real corpus data can be calculated. Therefore, a significant deviation leads to a high co-occurrence value. Co-occurrence measures based on this hypothesis are, for instance, the mutual information measure [9], the Poisson collocation measure [10] and the loglikelihood ratio [11]. Stimulus-response experiments show that co-occurrences found to be significant by these measures correlate well with term associations by humans [6].

Figure 1. A co-occurrence graph for the term “shower” (http://corpora.uni-leipzig.de/) The co-occurrences of a text can be considered as a graph of semantically related terms (with the terms as the vertices and the significance values as the edges) . The graph has the smallworld properties, i.e. it has a large clustering coefficient as it comprises a definable group of strongly connected vertices and a small average path length. In [ 12] it was shown that the cooccurrence graph of the BNC Corpus with 470,000 vertices and 1.7*10 7 edges has a mean average path length of 2.65 and a clustering coefficient of 0.5. Co-occurrence graphs as examples for word nets are the basic input for the clustering algorithms described below. The following steps are necessary to obtain a co-occurrence graph G in form of a term-term-matrix: ———————————————————————————————————————– Algorithm 1: Generating a term-term-matrix representing a co-occurrence graph: ———————————————————————————————————————– Input: Text files f1...fn Output: A term-term-matrix inducing co-occ. graph G=(V, E) with vertex set V and edge set E ———————————————————————————————————————– 1. Remove stop words and apply a stemming algorithm on all words in the text files. (Optional) 2. Generate a text corpus C by concatenating all text files to C=f1∘f2∘…∘fn. 3. Enumerate the sentences of C as s1…sm. 4. Determine for all words ti C which sentence si the word ti occurs in and save the results in a term-sentence-matrix Ts. 5. Determine the significance of all co-occurring words ti and tj on sentence level using Ts and return the results in a term-term-matrix. ——————————————————————————————————————— Usually, co-occurrence graphs determined using the measures mentioned above are undirected, which is suitable for the flat visualisation of term relations and for applications like query expansion that rely on spreading activation techniques. However, real-life associations are mostly directed, e.g. an Audi is a German car, but not every German car is an Audi. The association of Audi with German car is, therefore, much stronger than the association of German car with Audi. Thus, it actually makes sense to deal with directed term relations, and it is necessary to describe the construction of directed co-occurrence graphs before getting into the details of the clustering methods and their applications.

To determine the significance of the directed relation of term A with term B, which can also be regarded as the strength of the association of term A with term B, it is proposed to use the confidence measure known from the field of association rule mining as a basis, whereby is the number of times terms A and B co-occurred in the text on the sentence level, and is the number of sentences term A occurred in:

To generate a directed co-occurrence graph, it is also sensible to take into account only the direction of the dominant association (the one with the higher value, when applying this formula for both association directions of the two terms involved). Additionally, in order to compensate for the effects of great differences in the involved term frequencies, the dominant association should be weighted. As an example, the association of a less frequent term A with a frequently occurring term B could be 1.0. If another term C, which occurs more frequently in the text than A, always co-occurs with term B, then its association value with B would be 1.0, too. Yet, this co-occurrence is more significant than the co-occurrence of A with B because of A’s low support. However, even though A has a low support, it would be a mistake to ignore this term completely, because that would imply a loss of data material. An additional weight influencing the association value and considering this fact could be determined by  

the (normalised) number of sentences, in which both terms co-occur, or the (normalised) frequency of the term A. The normalisation basis could be the maximum number of sentences in which any term of the text has occurred.

The association Assn of term A with term B can then be calculated using the second approach by:

Here, is the maximum number of sentences any term has occurred it. A relation of term A with term B with a high association strength can be interpreted as a recommendation of A for B. Because of their direction, relations gained by this means are more specific than undirected relations between terms. They resemble a hyperlink on a website. In this case, however, it has not been manually and explicitly set, and it carries an additional weight indicating the strength of the term association. The set of all such relations obtained from a text represents a directed co-occurrence graph. Detecting the relationship of term pairs by statistical means is very effective. However, there are further approaches to enhance this detection and to correct wrongly detected relations. One possible way is to consult manually created semantic networks such as WordNet [13], a large lexical database containing semantic relationships for the English language and covering relations like polysemy, synonymy, antonymy, hypernymy and hyponymy (i.e. more general and more specific concepts), as well as part-of-relationships. If two co-occurring terms are, e.g., synonyms, then it is sensible to merge their respective vertices of the co-occurrence graph, or to add this relationship information to their interconnecting edges as an additional feature or, at least, to draw an undirected edge with a high weight between the vertices of the terms. If one term A is a hyponym of another co-occurring term B, or if it is in a part-of-relationship with that term B, then a directed and accordingly annotated edge should be drawn from the vertex of term A to term B, whereby the weight depends on the distance of these terms in the WordNet graph [14]. Another way to build directed term graphs is the usage of dependency parsers [15]. After having applied a part-of-speech tagging they identify syntactic relationships among words in the text and generate dependency trees of them for each sentence. Detecting term relations using lexico-syntactic patterns is another well-known approach [16] for this task. Hereby, interesting patterns of parts of speech and/or word forms in a specified order are defined and searched for in the texts. This way, part-of- and is-a-relationships can be detected easily. A pattern like "[NN] like [NN] and [NN]" can be used to uncover hyponyms and hypernyms and, thus, determine the direction of the relation between the terms referred to.

2.2. Clustering co-occurrence graphs Language networks, such as the mentioned co-occurrence graphs as well as syntactic dependency and similarity networks, often contain huge sets of vertices and edges. Relying on algorithms with exponential running time makes it a hard task to find optimal clusters in them. To address this issue, heuristic algorithms are being used to find good, but not optimal, solutions with short processing times. In this subsection, several important approaches to cluster such graphs are described as found in the literature. Since most of these algorithms rely on the similarity of data elements (here words), however, it is necessary to describe some common measures to determine their semantic similarity or distance. 2.2.1. Choosing an appropriate similarity measure As mentioned above, the grouping of words should reflect their semantic orientation. This means that the similarity between the words should be high. Intuitively, the similarity of two words can be determined by representing their sets of significant co-occurrences as vectors expressing their semantic context. The comparison of these co-occurrence vectors is then a feasible possibility to obtain similarity values for all pairs of terms. This approach is based on the assumption that similar terms have similar contexts. To calculate values of the term-term-similarity, measures operating on vectors such as the Euclidian distance, the inner product (for normalised vectors) or the cosine similarity can be applied. The latter is defined in Formula 3 and is used in the following considerations to obtain the similarity between term A and term B by comparing their co- occurrence vectors and :

The vector contains all significant co-occurrences of term A. The same applies to term B. These values have to be calculated for all term pairs in order to obtain a matrix that contains similarity values for each term pair in the text. As this formula takes the contexts of terms into account, the calculated similarity value is more meaningful than just the cooccurrence value which, of course, cannot be calculated for every term pair. Particularly, terms could have a high similarity to each other even if they do not co -occur in the text. The resulting term-term-matrix contains the term-term-similarities in the range from 0 to 1 for all term combinations in the text, whereby values near 0 indicate low (small overlap of their context vectors) and values near 1 high (large overlap of their context vector) similarity. In this case, a paradigmatic relation of the terms involved can be assumed. With the help of these values it is also possible to determine the mean term-term-similarity inside a cluster of terms. Other measures to determine the semantic distance between terms rely on external lexical and semantic databases such as WordNet [13] and Roget [17]. These measures determine the shortest path between the terms’ respective concepts in these databases [18,19]. 2.2.2. Non-hierarchical (flat) clustering The classic K-means clustering algorithm [20] is an example for a non-hierarchical clustering technique. The number of clusters to be generated has to be specified as an input parameter k. Initially, k elements will be picked as representatives (“means”) of the different clusters. In the following steps, the other data elements are associated with their closest means, and new cluster centroids are calculated repeatedly until the centroids do not change anymore and convergen ce is reached. This way, a local optimum can be achieved. In order to increase the probability of finding a global optimum, K-means is executed several times with different start configurations. With the help of the mentioned term-term-similarity matrix, it is possible to employ this algorithm to cluster co-occurrence graphs, too. In doing so, the graph topology is indirectly (using the term-term-similarity matrix) taken into account for this purpose, only. This applies for the algorithms presented in the next subsection as well. Another algorithm for flat clustering is the so-called Chinese-Whisper-algorithm [21] which employs a simple, yet effective

technique for label assignment in graphs. First, a specific label is assigned to each vertex. In the next steps, the vertices determine and take on the label most popular in their direct neighbourhood. This algorithm was successfully used to determine the parts -of-speech of text corpora. 2.2.3. Hierarchical clustering In contrast to non-hierarchical methods, hierarchical clustering techniques determine a hierarchy of element groups in each step and work in an agglomerative (bottom up) or divisive (top down) way [6]. In the first case, initially all data elements to be clustered reside in their own clusters. In subsequent steps, two clusters with the highest similarity are merged. In the second case, initially all data elements reside in a single (possibly big) cluster. In subsequent steps, the cluster with the least degree of coherence is split. In both cas es, the clustering process is stopped when a reasonable compromise between a small number of clusters and a high degree of homogeneity of the data elements inside the clusters has been reached. Also, in both cases, methods are needed to determine the similarity of two complex clusters. The following approaches can be used for this purpose:   

Single-link: This approach generates a new cluster of the two clusters, whose elements have the highest similarity. It tends to lead to large clusters. Complete-link: This approach determines the maximum distance between two clusters and merges the clusters with the lowest maximum distance. It tends to lead to small clusters. Average-link: This approach calculates the average distance between the objects in two clusters. The clusters with the lowest average distance are merged. In general, this leads to clusters of almost the same size.

In the next subsection, further graph clustering techniques will be outlined that are applicable in the domain of text mining, natural language processing and social network analysis. 2.2.4. Other clustering techniques In 2004, Flake et al. [22] introduced a cut-clustering algorithm to determine a set of minimum s-t-cuts in undirected graphs with edge weights. First, the original g raph is augmented by inserting an artificially inserted sink, which is linked to any other vertex in the graph with the edge weight α. A cut tree from this augmented graph is then computed, and the sink is removed afterwards. This leads to a separation of the cut tree into several connected components which are returned as clusters. Thereby, the clusters are weakly connected to the rest of the graph, only, because they are induced by minimum cuts. The inserted edges guarantee an expansion based on α within the clusters. Spectral graph partitioning algorithms rely on calculating the eigenvectors of a graph’s Laplacian or its adjacency matrix [18]. The method introduced by Capoccia et al. [23] also works on directed weighted graphs. Another algorithm of this class was presented by Qiu and Hancock [24]. It relies on the calculation of the Fiedler vector which partitions a given graph into two components. Random-walk-based algorithms rely on the intuitive idea to find dense subsets of vertices (clusters) of a gra ph that a random walker is not likely to leave [25]. This means that the members of such a cluster should be visited with a higher probability from within the cluster than from outside. Another method called Latent Semantic Indexing (LSI) [26] can be used to automatically group similar concepts found in document collections by mapping synonymous and related terms to the same topical dimension. It reduces the dimensionality of the original term-document-matrix using a mathematical technique called Singular Value Decomposition (SVD). Naturally, a problem is to interpret the gained dimensions to which the concepts are assigned. The topical grouping of terms in documents can also be realised by employing inference algorithms for the well-known probabilistic topic modeling technique Latent Dirichlet Allocation (LDA) [27] .

3. Centrality-based clustering of co-occurrence graphs In this section, two algorithms for clustering directed and undirected co-occurrence graphs based on the relative importance of their nodes will be proposed. Thereby, the centrality score is not only influenced by the graph topology, but also by the strengths of the semantic relations involved. These two algorithms can be utilised in different practical applications relying on the determination and grouping of keywords. They will be described here, too.

3.1. Clustering co-occurrence graphs using extended PageRank In [28], the authors have introduced a divisive text clustering algorithm using PageRank [29] calculations. This algorithm iteratively rules out terms in co-occurrence graphs that have a higher PageRank than their neighbouring terms. Since these graphs are generally scale-free and have the small-world properties, the deletion of such terms leads to a separation of semantically related components (clusters). It was shown that the clusters obtained contain terms of a high mean term-term-similarity. In contrast to other clustering algorithms, the number of clusters to be determined is not required as an initial input parameter. T he original PageRank formula takes, however, only the structure of a graph into account to determine a node’s relative importance. It is, therefore, feasible to extend it by employing the significance values of cooccurrences in order to consider the strength of the semantic term relations as well. The idea to also regard the bandwidth of internodal links for PageRank calculation has been elaborated in detail in [30]. This principle can be transferred to co-occurrence graphs: a random walker will follow paths (sequences of edges) between words with a higher probability, when they are strongly connected. Therefore, the terms involved should be ranked highly, also because they are usually strongly connected in the human brain. The thus extended formula to calculate the PageRanks for all terms t i in a co-occurrence graph is given in Formula 4:

The Set represents the terms tj linking to ti and algorithm will now be given in detail.

is the out-degree of tj . The extended

3.1.1. Algorithm In order to perform the clustering using extended PageRank, the following steps must be executed: ———————————————————————————————————————– Algorithm 2: Clustering co-occurrence graphs using extended PageRank calculations: ———————————————————————————————————————– Input: A co-occurrence graph G=(V, E) in form of a term-term-matrix Tm from Algorithm 1 Output: Clusters Gs of words and a list Lt of removed words ———————————————————————————————————————– 1. Increase the iteration counter and determine all separate components Gs of the (remaining) co-occurrence graph G using Tm. 2. Check, if there are components Gs with two or more words. If yes, return these clusters along with the current value of the iteration counter and continue, otherwise terminate. 3. Calculate the extended PageRanks for all words ti C using Formula 4. 4. For all words ti V check, whether is greater than for all words t j . If yes, mark word ti for removal. 5. Remove all marked words ti from G and add them to Lt along with their PageRank . 6. Go to step 1. ——————————————————————————————————————– Empirical results of this algorithm will be provided in the next subsection.

3.1.2. Results The main goal of the experiments carried out was to prove the hypothesis that the PageRanks of nodes in co-occurrence graphs can be used to separate clusters of semantically similar te rms. The results therefore show for some example text documents, how the number of clusters and their mean term-term-similarity changes during the execution of the algorithm. The first finding was that if the initial graph G contains stop words, they will receive a high PageRank and will be ruled out first. Of course, these terms are neither characteristic nor discriminating, and must not be regarded as keywords. If it is impossible to rule them out in the optional first step of cooccurrence graph generation due to a perhaps not available stop word list, then the state transition point can be detected where the co-occurrence graph first separates into clusters with meaningful terms. At this time, most stop words have been removed from the graph and the mean term-term-similarity of all clusters rises significantly, as is shown in Figure 3 for five example texts with around 250 to 500 words taken from the English Wikipedia.

Figure 3. Significant increase in mean term-term-similarity This significant increase in mean term-term-similarity correlates to the number of clusters emerging during the algorithm’s execution, as can be seen in Figure 4. Therefore, the point in time from which on the number of clusters constantly rises indicates that the most common terms have been ruled out.

Figure 4. Number of clusters in each iteration It is feasible to define a threshold to inhibit further cluster separations in order to obtain usable groups of topically related terms. This threshold should primarily depend on the mean term-term-similarity in a cluster as a quality measure. As this algorithm mainly relies on the underlying graph topology in finding connected components, it will not return perfect results, because it is still possible that the clusters obtained contain topically inappropriate terms. However, this hard clustering algorithm has proven to efficiently return useful groups of semantically similar terms by determining with high precision strongly connected components in co-occurrence graphs. As an example, clusters with high mean term-term-similarity formed from the Wikipedia article “automobile” are given: Table 1. Interesting clusters from the English Wikipediaarticle “automobile” Term clusters

Mean term-term-similarity

parts, hood, roof, powertrains, windows, platforms, doors

0.91

gasoline, combustion, engine, automobiles, diesel

0.76

casualties, EuroNCAP, tests, pedestrian, fatalities

0.96

trains, horseback, trolleybuses, cycling, aspects, riding, velomobile, walking, tramways, subways, transit

0.97

Term clusters

greenhouse, climate change, gas, emissions, warming, laws, restrictions quality, areas, carbon dioxide, monoxide, study, kg, amounts, pounds, hydrocarbons

Mean term-term-similarity

0.85 0.96

As can easily be seen in Table 1, in the clusters the terms are topically grouped. The first cluster contains terms to describe car parts, the second one deals with engine-related terms and the third one comprises terms related to accidents. In addition to the clusters obtained in each iteration, an ordered list Lt of removed terms (keywords) along with their PageRank values is another result of this algorithm. The list contains the terms of the text ordered according to the iteration of their removal from the co-occurrence graph, which indicates their levels of specifity. The earlier terms are ruled out (terms that are added first to the list), the more general they are for the specific text. Hence, the algorithm does not only return clusters of similar terms, but also a ranked list of keywords. These two sets of results turn out useful when it comes to automatic query formulation to find similar documents in corpora, especially when semantic aspects of documents need to be explored interactively at different levels of specifity. These applications will be discussed in Section 4. A shortcoming of the original PageRank algorithm is that entire graphs need to be considered. Therefore, in recent years, many solutions for distributed PageRank computations were published [31,32,33] in order to address this issue. In [30], extended methods for distributed PageRank calculation based on random walks and considering network parameters are discussed and empirically evaluated. These approaches will be valuable when analysing large text corpora.

3.2. Detecting the main topics and their sources using extended HITS In [34], the authors presented a novel graph-based approach for clustering keywords of texts by analysing their co-occurrence graphs using an extended version of the HITS algorithm [35], which has similarities with PageRank. The HITS algorithm was initially designed to evaluate the relative importance of nodes in web graphs (which are directed). It is now a pplied in a different domain (analysis of word graphs in form of directed co-occurrence graphs) for text clustering purposes. The algorithm returns two lists of keywords: the characteristic terms (authorities) and the source topics (hubs) that strongly influence a text’s main topics, whereby authorities are nodes often linked to many other nodes, and hubs are nodes pointing to many other nodes and, therefore, topically influencing them. For this purpose, the co-occurrence graphs to be analysed must be directed. Applying Formula 2 to calculate asymmetric term associations render such graphs. Undirected co-occurrence graphs can also be analysed with HITS. In this case, the authority and hub lists will be identical as a node’s out-degree will be equal to its in-degree. The two calculated clusters of terms can also show, however, an overlap when analysing directed co -occurrence graphs. This means, that this is a soft graph clustering technique. The two-classification obtained is especially useful for follow-up tasks such as topic detection and tracking. To regard the association strengths provided by the edge weights (Formula 2) of the directed cooccurrence graphs, the update rules of the HITS algorithm iteratively calculating the authority score a(ti) and hub score h(ti) of a vertex (word) ti need to be extended accordingly, and are given by the following formulae:

These rules will be executed until convergence is reached (the calculated values do not change significantly in two consecutive iterations), or a fixed number of iterations has been performed. In order to prevent diverging values, the authority and hub scores should be normalised by dividing each hub score by the square root of the sum of all squared hub scores, and by dividing each authority score by the square root of the sum of all squared authority scores.

3.2.1. Algorithm In order to obtain the two lists for the authorities and hubs based on these extended update rules, the following steps must be carried out: ———————————————————————————————————————– Algorithm 3: Clustering directed co-occurrence graphs using extended HITS: ———————————————————————————————————————– Input: A directed co-occurrence graph G=(V, E) in form of a term-term-matrix Tm from algorithm 1 with co-occurrence significances based on Formula 2 Output: Lists (clusters) LA (authorities, keywords) and LH (hubs, source topics) of word forms ———————————————————————————————————————– 1. For all words ti V determine the authority value a(ti) and the hub value h(ti) using the formulas 5 and 6 until convergence is reached (the calculated values do not change significantly in two consecutive iterations) or a fixed number of iterations has been executed. 2. Put all words ti in descending order by their authority values a(ti) in the list LA and return LA. Put all words ti in descending order by their hub values h(ti) in the list LH and return LH. The first 10 to 20 words in LA can be regarded as keywords. The first 10 to 20 words in LH represent the most influential topics. ——————————————————————————————————————– Empirical results of this algorithm will be provided in the next subsection. 3.2.2. Results Tables 2 and 3 show for two documents of the English Wikipedia the lists LA and LH extracted. To conduct these experiments, the following parameters have been used:    

removal of stop words restriction to nouns and names base form reduction activated phrase detection

The examples show that the extended HITS algorithm can determine clusters of the most characteristic terms (authorities) and source topics (hubs) in texts by analysing their directed co-occurrence graphs. Especially the hubs provide information useful in finding suitable terms to be employed as search words in queries when background information is sought on a specific topic. Another empirical finding was that the quality of the authority and hub lists improved when analysing clusters of semantically similar documents instead of single texts. Table 2. Terms and phrases with high authority and hub values of the Wikipedia-Article “Earthquake” Term

Authority value Term/Phrase

Hub value

Table 3. Terms and phrases with high authority and hub values of the Wikipedia-Article “Android” (mobile operating system) Term/Phrase

Authority value Term/Phrase

Hub value

earthquake

0.48

movement

0.18

Android

0.32

source code

0.19

earth

0.30

plate

0.16

Google

0.31

development

0.18

fault

0.27

boundary

0.15

application

0.27

October

0.16

area

0.23

damage

0.15

version

0.24

project

0.15

boundary

0.18

zone

0.15

open source

0.23

platform

0.14

plate

0.16

landslide

0.14

Linux

0.22

handset

0.14

structure

0.16

seismic activity

0.14

system

0.22

Alliance

0.13

rupture

0.15

wave

0.13

December

0.21

Android Inc.

0.13

aftershock

0.15

ground rupture

0.13

software

0.20

Java

0.13

tsunami

0.14

propagation

0.12

Play Store

0.19

mobile

0.12

The reason for this fact is that smaller texts like newspaper articles often address only specific subtopics of a main topic. Therefore, document-specific terms and topics would be given a high importance for such a document, but would not be of great impo rtance when this document would be analysed in combination with other, topically similar documents. Moreover, when analysing corpora of documents with the same main topic, the results in the authority and hub lists will be more meaningful, because they can be validated statistically. For ranking, however, external corpora are not necessary. In conjunction with the regarded term association strengths, the topology of the analysed documents’ directed co-occurrence graphs is a suitable basis for high-quality keyword extraction and clustering. Therefore, only a small set of topically related documents is sufficient to find important keywords for a topical domain.

4. Applications In this section, application scenarios for the presented clustering methods will be outlined. Especially, the generation of topically induced links between text documents based on the second method is an interesting way to make visible topical dependencies across documents. Further, related applications will be discussed as well.

4.1. Automatic selection of search words for web-based document retrieval Both clustering techniques do not only group document terms according to their similarities or semantic influence, but they return lists of keywords, too. The detected keywords and their sources can not only be used as query expansion terms, but they can also be useful to directly find topically similar and related documents in the World Wide Web (WWW). Since these algorithms do not rely on large text corpora such as classical algorithms for keyword extraction, they can efficiently run in local text processing agents that regard the user’s special topical interests. Thus, the knowledge present on the user’s computer can be combined with the huge and well-indexed databases as well as access mechanisms of the big search engines, which are unrivaled. Such an agent has access to a special folder of local (perhaps confidential) text documents of the user, and may even establish a fine-granular user profile to support interactive search based on the user’s topical interests. As these data are kept local, there is no danger of privacy and security violations. When entering a query, the local search agent processes the current search words, enriches them with knowledge from the local files and previous searches, and generates a keyword suggestion, which may or may not be used as a query to the (remote) search engine that returns its results in the known manner. Also, it is possible that the local search agent generates queries by itself, simply by using single text documents as the only input parameter, in order to find similar and related documents in the web. As an example, in [36] it was shown that such documents can be found in the web by sending the analysed document’s keywords as search words to a web search engine. Experiments revealed that at most five terms and phrases should be used for this purpose. A larger number would limit the search results too much, while too few terms would generate too many results. In order to receive documents that deal with the most important, yet general, topics of the analysed document, proper query terms should have been ruled out from the co-occurrence graph in the first iterations (low specifity level) when relying on the PageRank clustering technique as they have received a higher PageRank value compared to other terms in the list Lt. Another approach is to use semantically related terms from the same cluster as search words in order to find documents that cover the cluster’s topic in detail.

4.2. Interactive search for similar and related documents If users should be able to modify the automatically generated queries, then the term lists can facilitate interactive document retrieval, too. Besides just recommending terms with a high PageRank for this purpose, the specifity level offers users the possibility to browse through the terms which should be selectable as search words in an interactive search system. A first implementation of this idea is the interactive Firefox extension "FireMatcher" (www.firematcher.com). Its aim is to locally analyse text documents a user provides, and to send their characteristic terms as queries to web search engines

in order to find similar documents. The returned search words and web search results are generally of good quality. Another application of a term’s specifity level is to use it in automatic query expansion, similar to the approach presented in [3]. Given a query, the search system could suggest other related terms of a higher or lower specifity level in order to broaden or narrow down the search space. The relatedness of terms can be obtained from the co-occurrence significances of the analysed documents’ co-occurrence graphs. It is also possible to expand a query with terms from the same topical cluster, as discussed in the previous section. This query expansion approach is promising, as it does not only take the degree of relatedness but also hierarchical dependencies into account when recommending terms. If a user accepts expansion terms suggested, it means that this user is satisfied with the recommended terms. It is, therefore, feasible to think about ways to influence the strength of the term relations involved by human interaction. Thus, the PageRank of terms would depend on the structure of a document’s co-occurrence graph, on the significance values of term pairs as shown in this article, and also on implicit or explicit user feedback. A personalised ranking of terms in documents would not only consider a user’s interest in special topics, but would also influence the selection of search words in automatic document retrieval. These possibilities will be elaborated in later publications.

Figure 5. Screenshot of “FireMatcher” with determined search words and Google’s web results When background information is needed on specific topics, then the calculated source topics of the HITS-based clustering technique can be used as search words to find related, not necessarily similar documents, too. The reason for this is the naturally occurring topic drift in the results induced by the intended semantic differences between authorities and hubs. This observation indicates that the source topics of documents can be used as means to follow topics across several related documents. Hereby, it is desirable that the hubs of the analysed documents are the authorities (main topics) of the documents found to obtain a chain of topically dependent documents. This approach will be discussed in the next subsection.

4.3. Automatic link induction Web search engines return useful results when a user is looking for specific documents that contain known keywords. The situation is getting more difficult when it is needed to search for topically connected documents in the web and to find out why they are semantically related, e.g. in the course of a scientific research. Despite the needed user knowledge on the area of interest and time to conduct the research as well as efforts to evaluate the results, the following reasons make especially the tracking of topics in the web a hard task:

  

Web search results are mostly presented linearly, and their (often existing) semantic dependencies are usually not visualised. The amount of manually established links (despite their generally high quality) on a website is limited by the webmaster's knowledge of related material and his/her willingness to actually link it. Therefore, although feasible, relevant documents may not have been linked. Moreover, despite possibly high user interest, links may not exist for specific reasons, such as not to lead customers to competitors or similar artists, or that plain text documents, unlike web pages in HTML format, do not provide means to express explicit links to related documents.

In order to address these problems, the search engine must be able to automatically calculate the semantic dependencies between the documents in question. To make these semantic dependencies visible and enhance document retrieval methods, Kurland and Lee [37] used language models to generate asymmetric links between initially returned search results and applied the PageRank [29] algorithm on the obtained document graphs to re-rank the results. Although automatically induced links can be noisier than manually created ones [18], this structural re-ranking was able to improve the quality of the result lists consistently. While Kurland and Lee used a generation probability to model topical dependencies between two documents to establish those links, here a new algorithm for automatic link induction is provided by measuring the degree of the semantic dependency between two documents in question using the HITS-based clustering method. The following steps are necessary to establish such links between documents based on their main and source topics determined: ———————————————————————————————————————– Algorithm 4: Generation of topically induced links between text documents: ———————————————————————————————————————– Input: Directed co-occurrence graphs Gf =(Vf, Ef) for each text f1...fn in form of term-termmatrices Tf from algorithm 1 with co-occurrence significances based on Formula 2 Output: Topically induced links between the documents f1...fn with respective degree of their semantic relatedness ———————————————————————————————————————– 1. Apply algorithm 3 on each Gf. 2. For all pairs of texts (fk, fl) with k l calculate the similarity Skl between the authority list of fk and the hub list of fl e.g. by determining the overlap of the first 10 to 20 entries in these lists. 3. If Skl is greater than a preset threshold, return the pair (fk, fl) (document link) along with Skl. ———————————————————————————————————————– The intuition behind this approach is that a link from document K to L can be established when K's most important terms (main topics) can also be found in L's list of source topics. In other words, if K primarily deals with topics that significantly influence L's content, then a topically induced link from K to L should be added.

4.4. Further applications An interesting application for the topically induced links between related documents is to track topics across multiple documents, e.g. in large corpora like the WWW. In the approach proposed herein, a document primarily covering the basics of another document’s main topic will be linked to this other document. Thus, from an originally examined document users can be led to related ones that mainly cover important aspects. By following incoming links repeatedly, topics can be tracked down to their basics. This way, chains of topically connected documents can be uncovered without requiring users to formulate queries with search terms. These chains of topically depending documents can also be generated automatically, and presented in the same way and form of web search result lists that users are already used to. In contrast to the often unrelated web search results, however, these result lists make semantic connections between the documents visible. Thus, users do not have to follow the links manually and step-by-step in a time-consuming manner. This approach also goes beyond a simple search for similar documents, as it offers a new way to search for related documents and to find background information on a topic of interest. This functionality can be seen as a useful addition to Google Scholar (http://scholar.google.com/), which already offers users the possibility to search for similar scientific articles.

These automatically determined links between web search results can also be very useful in terms of positively influencing the ranking of search results, because these links represent verified semantic relations between documents. Manually set links, e.g. found on websites, can be automatically evaluated regarding their validity using the approach for source topic detection, too. To realise this function, the web search results must be downloaded by the local agent and analysed with regards to their topical dependencies using the presented HITS-based clustering algorithm. The web search results could be re-ordered according to the relationships found in such a manner that topical clusters of documents become visible. Also, by comparing newly found documents with the lists of keywords and source topics of locally existing documents, it is possible to re-rank them based on their similarity with the local knowledge. As this function will take a possibly large amount of time, its use is not appropriate when a timely response is needed. It is feasible, however, when an in-depth analysis of a topic is required and real-time demands play a secondary role.

5. Conclusion In this article, two special-purpose graph-based methods to cluster co-occurrence graphs have been presented. Their effectiveness has been shown empirically. The algorithms’ objective is to support new search applications, such as finding similar and related content with more background information in the web, by using automatic query formulation and by enhancing solutions for query expansion through local document analysis. Furthermore, a new algorithm to automatically generate links between topically interdependent documents based on these clustering results has been presented. This technique can be used to realise special search functions such as tracking of topics spanning over multiple documents. This is a task that currently requires a great deal of user knowledge, time and manual evaluation when relying on usual web search results. There are still options, however, to enhance the described method to induce document links. One way is to detect, within documents, topically connected keywords by applying term clustering techniques such as LDA [27] or the PageRank-based clustering method described. Based on such topical clusters, fine-grained links between documents can be generated, that are valid for specific topics and subtopics only, but not for th e entire documents that usually cover many topics. This possibility will be examined in future contributions.

References [1] R. Agrawal et al., “Enrichment and Reductionism: Two Approaches for Web Query Classification”, in: Bao-Liang Lu, Liqing Zhang and James T. Kwok (eds.), ICONIP (3), Lecture Notes in Computer Science, Vol. 7064, pp. 148–157, Springer, Berlin / Heidelberg, 2011 [2] J. Xu and W. B. Croft, “Query expansion using local and global document analysis”, in: HansPeter Frei, Donna Harman, Peter Schäuble and Ross Wilkinson (eds.), Proc. of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96, pp. 4–11, Zurich, 1996 [3] M. Kubek and H. F. Witschel, “Searching the Web by Using the Knowledge in Local Text Documents”, in: Kyandoghere Kyamakya, Herwig Unger and Wolfgang Halang (eds.), Proc. of Mallorca Workshop 2010 Autonomous Systems, Shaker Verlag, Aachen, 2010 [4] Website of Google Autocomplete, Web Search Help, http://support.google.com/websearch/bin/answer.py?hl=en&answer=106230 [5] G. Salton, A. Wong and C.S. Yang, “A vector space model for automatic indexing”, in: Communications of the ACM, Volume 18, Issue 11, pp. 613–620, New York, November 1975 [6] G. Heyer, U. Quasthoff and T. Wittig, Text Mining: Wissensrohstoff Text: Konzepte, Algorithmen, Ergebnisse, W3L-Verlag, Dortmund, 2006 [7] L. R. Dice, “Measures of the Amount of Ecologic Association Between Species”, in: Ecology, Vol. 26, No. 3, pp. 297–302, 1945 [8] P. Jaccard, “Étude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura”, in: Bulletin de la Société Vaudoise des Sciences Naturelles, Vol. 37, pp. 547–579, 1901

[9] M. Büchler, “Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturierten Daten“, Master’s thesis, University of Leipzig, 2006 [10] U. Quasthoff and C. Wolff, “The Poisson Collocation Measure and its Applications”, in: Second International Workshop on Computational Approaches to Collocations, IEEE, Vienna, 2002 [11] T. Dunning, “Accurate methods for the statistics of surprise and coincidence”, in: Computational Linguistics, Vol. 19, Issue 1, pp. 61–74, MIT Press, Cambridge, 1993 [12] R. Ferrer i Cancho and R. V. Solé, “The Small World of Human Language”, in: Proc. of The Royal Society of London, Series B, Biological Sciences, Vol. 268, pp. 2261–2266, 2001 [13] C. Fellbaum, “WordNet and wordnets“, in: Keith Brown et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, pp. 665-670, Elsevier, Oxford, 2005 [14] T. Hughes and D. Ramage, “Lexical semantic relatedness with random graph walks”, in: Jason Eisner (editor), EMNLP-CoNLL 2007, pp. 581–589, ACL, Prague, 2007 [15] R. McDonald et al., “Non-projective dependency parsing using spanning tree algorithms”, in: Donna Byron, Anand Venkataraman and Dell Zhang (eds.), Proc. of the Joint Conf. on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 523-530, ACL, Vancouver, 2005 [16] E. Riloff and R. Jones, “Learning dictionaries for information extraction by multi-level bootstrapping”, in: Jim Hendler, Devika Subramanian (eds.), Proc. of the Sixteenth National Conference on Artificial Intelligence, pp. 474-479, Orlando, 1999 [17] M. Jarmasz and S. Szpakowicz, “Roget’s Thesaurus and semantic similarity”, in: Nicolas Nicolov, Kalina Bontcheva, Galia Angelova and Ruslan Mitkov (eds.), Proc. of the Conference on Recent Advances in Natural Language Processing 2003, pp. 212-219, Borovets, 2003 [18] R. Mihalcea and D. Radev, Graph-based Natural Language Processing and Information Retrieval, Cambridge University Press, 2011 [19] A. Budanitsky and G. Hirst, “Evaluating WordNet-based measures of semantic distance”, in: Computational Linguistics, Vol. 32, Issue 1, pp. 13–47, MIT Press, Cambridge, 2006 [20] J. B. MacQueen, "Some Methods for classification and Analysis of Multivariate Observations". in: Lucien M. Le Cam and Jerzy Neyman (eds.), Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 281–297, University of California Press, 1967 [21] C. Biemann, “Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems”, in: Rada Mihalcea and Dragomir Radev (eds.), Proc. of the HLT-NAACL-06 Workshop on Textgraphs-06, pp. 73–80, ACL, New York City, 2006 [22] G.W. Flake, R.E. Tarjan and K. Tsioutsiouliklis, “Graph clustering and minimum cut trees”, in: Internet Mathematics, Vol. 1, Number 4, pp. 385–408, 2003 [23] A. Capoccia et al., “Detecting communities in large networks”, in: Physica A: Statistical Mechanics and its Applications, Vol. 352, Issues 2–4, pp. 669–676, Elsevier, Amsterdam, 2005 [24] H. Qiu and E. R. Hancock, “Graph matching and clustering using spectral partitions”, in: Pattern Recognition, Vol. 39, Issue 1, pp. 22–34, Elsevier, 2006 [25] S. M. van Dongen, “Graph clustering by flow simulation”, Ph.D. Thesis, Universiteit Utrecht, Utrecht, The Netherlands, 2000 [26] S. C. Deerwester et al., “Indexing by latent semantic analysis”, in: Journal of the American Society of Information Science, Vol. 41 Number 6, pp. 391–407, 1990 [27] D. Blei, A. Ng and M. Jordan, ”Latent Dirichlet Allocation“, in: The Journal of Machine Learning Research, Vol. 3, pp. 993–1022, 2003 [28] M. Kubek and H. Unger, “Topic Detection Based on the PageRank's Clustering Property”, in: Proc. 11th Intl. Conf. on Innovative Internet Community Systems, GI Lecture Notes in Informatics Vol. P-186, pp. 139–148, Köllen Verlag, Bonn, 2011 [29] L. Page, S. Brin, R. Motwani and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web”, Technical Report, Stanford Digital Library Technologies Project, 1998 [30] S. Sodsee et al., “An Extended PageRank Calculation Including Network Parameters”, in: Proc. of the Annual International Conference on Computer Science Education: Innovation and Technology (CSEIT 2010), pp. 121-126, Phuket, 2010 [31] Y. Zhu, S. Ye and X. Li, “Distributed PageRank computation based on iterative aggregationdisaggregation methods”, in: Proc. of the 14th ACM International Conference on Information and Knowledge Management, pp. 578-585, ACM, Bremen, 2005

[32] K. Sankaralingam, S. Sethumadhavan and J.C. Browne, “Distributed pagerank for P2P systems”, in: Proc. 12th IEEE International Symposium on High Performance Distributed Computing, pp. 58-68, IEEE Computer Society Press, Seattle, 2003 [33] H. Ishii and R. Tempo, “Distributed pagerank computation with link failures”, in: Proc. The 2009 American Control Conference, pp. 1976-1981, IEEE Control Systems Society, St. Louis, 2009 [34] M. Kubek and H. Unger, “Detecting Source Topics by Analysing Directed Co-occurrence Graphs”, in: Proc. 12th Intl. Conf. on Innovative Internet Community Systems, GI Lecture Notes in Informatics Vol. P-204, pp. 202–211, Köllen Verlag, Bonn, 2012 [35] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment”, in: Journal of the ACM, Volume 46, Issue 5, pp. 604–632, ACM, New York, 1999 [36] M. Kubek and H. Unger, “Search Word Extraction Using Extended PageRank Calculations”, in: Autonomous Systems: Developments and Trends, volume 391 of Studies in Computational Intelligence, pp. 325–337, Springer, Berlin / Heidelberg, 2011 [37] O. Kurland and L. Lee, “PageRank without hyperlinks: Structural re-ranking using links induced by language models”, in: Proc. of the 28th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp.306–313, ACM, Salvador, 2005

Special-purpose Text Clustering

Special-purpose Text Clustering

Suggest Documents

Clustering Text Data Streams - CiteSeerX

Special-purpose Text Clustering

Efficient Streaming Text Clustering - CiteSeerX

DISCRIMINATIVE CLUSTERING OF TEXT DOCUMENTS Jaakko ...

Text Clustering for Topic Detection - Semantic Scholar

DISCRIMINATIVE CLUSTERING OF TEXT DOCUMENTS Jaakko ...

Enhancing Traditional Text Documents Clustering ... - Semantic Scholar

Scalable Text Clustering on Global Grids

Clustering Full Text Documents - Semantic Scholar

Text Clustering Algorithms: A Review - Semantic Scholar

Pseudo-Supervised Clustering for Text Documents - CiteSeerX

Clustering Full Text Documents - Semantic Scholar

Text Clustering for Digital Forensics Analysis

Text Clustering with String Kernels in R

Comparing between Arabic Text Clustering using

Text Documents Clustering using Genetic ... - Semantic Scholar

WordNet-based Text Document Clustering - Association for

HLDA based text clustering - IEEE Xplore

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text ...

Effective Term Based Text Clustering Algorithms - Engg Journals ...

News clustering approach based on discourse text structure

An Improved Clustering Algorithm for Text Mining: Multi-Cluster ...

Text Clustering Algorithm Based on Random Cluster Core

The Evaluation Measure of Text Clustering for the