Noname manuscript No. (will be inserted by the editor)
Mutual Information based labelling and comparing clusters Rob Koopman · Shenghui Wang
arXiv:1702.08199v1 [cs.IR] 27 Feb 2017
Received: date / Accepted: date
Abstract After a clustering solution is generated automatically, labelling these clusters becomes important to help understanding the results. In this paper, we propose to use a Mutual Information based method to label clusters of journal articles. Topical terms which have the highest Normalised Mutual Information (NMI) with a certain cluster are selected to be the labels of the cluster. Discussion of the labelling technique with a domain expert was used as a check that the labels are discriminating not only lexical-wise but also semantically. Based on a common set of topical terms, we also propose to generate lexical fingerprints as a representation of individual clusters. Eventually, we visualise and compare these fingerprints of different clusters from either one clustering solution or different ones. Keywords Cluster labelling · Normalised mutual information · Visualisation
1 Introduction Identifying thematic structures in science (so-called topics) is the shared goal for all the different methods described in this special issue “Same data, different results?” Every method produced a set of clusters, with each cluster grouping similar or relevant articles together, to reflect certain thematic structures in the same dataset. Comparing different clustering solutions is however a challenge in itself, as reported in [5]. What ever numeric measures we apply, such as shared documents across different solutions, or size distributions, those measures give little insight into the meaning of the differences between clustering solutions. For the R. Koopman OCLC Research, Schipholweg 99, Leiden, The Netherlands Tel.: +31 71 524 6500 E-mail:
[email protected] S. Wang OCLC Research, Schipholweg 99, Leiden, The Netherlands Tel.: +31 71 524 6500 E-mail:
[email protected]
2
Rob Koopman, Shenghui Wang
interpretation of clustering solutions and to relate them to the research fields that people know about, it seems more natural and intuitive if we could assign humanunderstandable labels to describe the content of those clusters or the topics they represent. This paper presents a method to first assign labels and second to compare them at a more abstract yet still meaningful level. Although, the approach has been developed as a part of the “Same data, different results?” collaboration, we believe that it could also be applied in clustering of other objects. It is not straightforward to pick descriptive, human-understandable labels to summarize the content of clusters produced by an automated clustering algorithm [3]. Labelling clusters is often done by those producing them on the basic of common or specific tacit knowledge. When doing automatic clustering of documents, which also contain lexical information in titles, keywords, abstracts, publication venues and alike, using measures based on frequency and co-occurrence of terms come to mind. For example, one could look at the most frequent terms in the bibliographic metadata of the documents belonging to a cluster. Or, one could consider to use terms that occur frequently in the centroid (the middle of a cluster) or the documents that lies closest to the centroid. For this paper we follow another labelling approach, namely to use measures such as mutual information to compare distributions of terms in one cluster with that of other clusters. Those terms are extracted from the lexical information of the documents, in our case, the titles and abstracts of the articles. We call this a differential cluster labelling approach, because it selects those terms which are frequent in one cluster but are not frequent in others as potential labels for this cluster. The advantage of this method is that it is independent of the clustering method, because it only uses the terms extracted from the articles’ title and abstract against the final cluster assignments. Furthermore, it can be applied independently of the availability of assigned keywords or subject headings which are commonly used. Having labels based on significant terms which are human-understandable should contribute to the comparison of clustering solutions. We further extend the labelling approach and identify a set of terms which are most informative for all clustering solutions that we need to compare. We then generate a fingerprint for each cluster by measuring their Normalised Mutual Information against these labels. We order these selected terms as solving the Travelling Salesman Problem [1], resulting a one-dimensional word-space, where each term has a specific coordinate. Now we can visualize the fingerprints of all clusters in terms of the Normalised Mutual Information and compare them using those common label-based coordinates. This gives us direct insight in the qualitative differences between clusters across different clustering solutions. The main goal of this paper is to apply methods based on Normalised Mutual Information to label and compare clusters. In the first part we describe our experiment of using the Normalised Mutual Information to identify labels for individual clusters from different clustering solutions. To test the meaningfulness of the labels we discussed them with a domain expert. The second part of the paper is about comparing clusters based on their label-based fingerprints.
Mutual Information based labelling and comparing clusters
3
2 Mutual information based labelling Mutual Information is a common technique for labelling clusters [3]. The Mutual Information measures how much information the presence/absence of a term t contributes to making the correct clustering decision on a cluster c. In other words, the Mutual Information represents the reduction in uncertainty about the cluster c given the knowledge of the term t. A high Mutual Information between a term t and a cluster c suggests this term describes a large part of the content of this cluster therefore this term could be a candidate for labelling this cluster. Formally, the mutual information between a term t and a cluster c is calculated as follows: I(t, c) =
1 X 1 X i=0 j=0
P (Ti , Uj ) log2
P (Ti , Uj ) P (Ti )P (Uj )
(1)
where T0 indicates an article does not contain the term t and otherwise T1 , U0 indicates an article does not belong to the cluster c and otherwise U1 , P (Ti , Uj ) is the probability that Ti and Uj happen together within one article, P (Ti ) and P (Uj ) are the probabilities that these events happen independently. The probabilities are estimated by dividing the frequency of the observed event (the article contains or does not contain the term t or it is in or not in the cluster c) by the total number of articles. We then normalize this mutual information I(t, c) by dividing over the entropy of cluster c, i.e., I(t, c) N M I(t, c) = 2 × (2) H(t) + H(c) where H(t) = −
1 X i=0
P (Ti ) log2 P (Ti ) and H(c) = −
1 X
P (Uj ) log2 P (Uj ).
(3)
j=0
This Normalised Mutual Information (NMI) score is non-negative. When a term and a cluster are completely independent from each other, the NMI score is 0. When a term only occurs frequently in one cluster but rarely in others, then the NMI score between this term and this cluster is high. When a relatively frequent term has an extraordinarily low occurrence in one cluster, the NMI score between this term and the cluster is also pretty high, indicating that the cluster is not about this term. In order to distinguish the positive and negative associations between terms and cluster, we assign a negative sign to those NMI scores when the occurrence of a term is less than expected.1 For each cluster in a specific clustering solution, we calculate the NMI scores between this cluster and all the 60 thousand topical terms,2 and rank these terms based on their NMI scores. The terms which have the highest NMI scores are good candidates for labelling this particular cluster. 1 If a cluster covers 10% of the total dataset, then the term is expected to occur 10% of its occurrences over the total dataset. 2 The topical terms were extracted from the titles and abstracts of the articles in the Astro dataset. Please refer to [2] for more details.
4
Rob Koopman, Shenghui Wang
Table 1 shows the topical terms with top 10 highest NMI scores for our KMeans clusters [6]. The same method was used to label all the clusters from different solutions described in this special issue. As shown here, this method is independent of how these clusters are generated, as it only uses the information from the articles which are in the clusters. Also these topical terms are extracted automatically from the titles and abstracts of all articles [2]. As said before, this labelling method does not depend on the availability of the pre-assigned keywords or subject headings. It is actually generalisable to label any collections of articles. Because it is a data-driven approach based on lexical information, these topical terms are sometimes only understandable to domain experts, and most likely part of the very specific vocabulary in this domain. Our impression, alternating between articles and the identified labels, was that the labels do reflect potential topics, and do so on a lower level of abstraction or more specific than keywords chosen by authors or librarians. Analysis with an domain expert With the produced labels for the clusters we can now compare clusters based on those labels. Still, there is one problem, we cannot solve, there is no objective ground truth to evaluate the resulting labels. Not being experts in astrophysics, we cannot really judge how meaningful the selected labels are, compared with the content of the clusters. As cross-check, we discussed the labels for one clustering solution, shown in Table 1, with a domain expert. The fact that this expert also has a background in bibliometrics helped in the discussion. We explained the labelling procedure and provided the domain expert with the table, without providing the articles assigned to each cluster. In this discussion, he remarked on some inner logic between the labels and subsequently the clusters. Based on his knowledge of the Astrophysics field, he developed a concept map to order those clusters. He started with six categories of astrophysical objects at different scales: Cosmology, Galaxies, Compact Objects, Stars, Planets, Elementary Particles, plus a general category about Observation Techniques. He used “is part of” relations to draw a categorical backbone of the whole field. He then assigned each cluster to these categories and produced the concept map as shown in Figure 1. Each cluster is connected to one of the main categories by a solid blue arrow, representing a “deals with” relation. This relation indicates that these clusters mostly belong to that particular category. For example, the clusters ok0, ok1 and ok7 is all about Galaxies. Indeed we find terms as galactic or galaxy among the labels for this cluster. Some clusters in the concept map also have an extra dotted red link to another categories, indicating that they are also related to the other category. For example, the cluster ok13 is mostly about Elementary Particles, but is also related to Cosmology. This is not so much a surprise, because as we reported in [6] there are not always clear boundaries between clusters. So it is quite possible that some clusters are actually bridging two major categories. This exercise with one expert is not a representative evaluation. A more systematic evaluation with more experts carefully checking the articles in these clusters is of course more informative. Still this exercise gave us some confidence in the appropriateness and meaningfullness of those automatically generated labels. They were obviously specific, discriminating and at the end informative enough for a domain expert to draw a knowledge landscape about these clusters along the seven categories with good confidence. For us it was encouraging enough to trust
Mutual Information based labelling and comparing clusters Table 1 K-Means cluster labels Cluster ID ok 0
Size 2866
ok 1
2934
ok 2
5449
ok 3
3389
ok 4
5420
ok 5
3874
ok 6
4721
ok 7
4206
ok 8
2195
ok 9
3225
ok 10
4070
ok 11
3409
ok 12
6022
ok 13
5556
ok 14
5583
ok 15
2569
ok 16
1985
ok 17
2465
ok 18
2206
ok 19
1627
ok 20
2020
ok 21
1849
ok 22
4103
ok 23
2071
ok 24
4592
ok 25
2622
ok 26
3593
ok 27
6292
ok 28
4903
ok 29
3176
ok 30
2624
Cluster labels seyfert 1, active galactic, narrow line, agn, broad line, galactic nuclei, quasars, line seyfert, nuclei agns, emission line lens, microlensing, gravitational lens, rotation curve, spiral galaxies, bars, dark matter, barred galaxies, galaxy, pattern speed transit, star, eclipsing binary, radial velocity, hd, planet, corot, photometric, main sequence, type stars spacetimes, black hole, horizon, asymptotically flat, reissner nordstrom, metric, einstein maxwell, spherically symmetric, hole solutions, schwarzschild solar wind, magnetosphere, interplanetary magnetic, magnetic field, auroral, plasma, magnetopause, ion, substorm, spacecraft standard model, higgs, lhc, minimal supersymmetric, supersymmetric standard, neutrino mass, lepton, right handed, hadron collider, electroweak quark, qcd, meson, decays, lattice qcd, pi pi, pion, j psi, form factors, chiral galaxy clusters, dark matter, haloes, cluster, n body, weak lensing, intracluster medium, halo mass, 1 mpc, galaxies blazar, bl lac, jet, radio sources, lac objects, radio galaxies, synchrotron, radio, flat spectrum, 3c yang mills, gauge theory, mills theory, string, supergravity, noncommutative, field theory, supersymmetric, dual, branes globular clusters, fe h, metal poor, red giant, metallicity, giant branch, horizontal branch, galactic globular, color magnitude, stars ray binary, x ray, hard state, ray timing, rossi x, timing explorer, black hole, accretion disk, neutron star, rxte galaxies, star formation, formation rate, deep field, redshift, early type, sample, rest frame, starburst, lyman break quantum gravity, quantum, loop quantum, spacetime, general relativity, scalar field, gravity, quantum cosmology, metric, quantization coronal, active region, solar, flare, magnetic flux, cme, quiet sun, chromosphere, mass ejections, hinode cosmic ray, high energy, gamma rays, tev, hess, ultra high, air showers, tev gamma, extensive air, shower microwave background, cosmic microwave, background cmb, cmb, microwave anisotropy, anisotropy probe, wilkinson microwave, wmap, power spectrum, probe wmap dark energy, quintessence, universe, phantom, f r, cosmological constant, cosmic acceleration, chaplygin gas, modified gravity, accelerated expansion white dwarf, cataclysmic variables, dwarf nova, nova, wd, mass transfer, orbital period, secondary star, cvs, superhumps inflation, slow roll, curvature perturbation, non gaussianity, inflationary models, curvaton, reheating, cosmological perturbations, f nl, primordial gravitational wave, inspiral, ligo, lisa, wave detectors, laser interferometer, waveforms, binary black, space antenna, post newtonian grb, ray burst, gamma ray, afterglow, bursts grbs, swift, prompt emission, prompt, fireball, batse asteroid, comet, body problem, orbits, kuiper belt, main belt, bodies, mean motion, planets, solar system planetary nebulae, asymptotic giant, agb stars, post agb, giant branch, pne, branch agb, agb, central star, mira molecular cloud, protostellar, cloud, c 13, star forming, h 2, molecules, hco, forming regions, massive star ionospheric, winter, summer, degrees n, mesosphere, tec, electron content, iri, ozone, seasonal mars, titan, ice, water, deposits, cassini, co2, methane, atmosphere, surface performance, scientific, technology, mission, astronomical, development, research, flight, cost, software sn, explosion, wolf rayet, type ia, supernova, ejecta, wr, progenitor, eta carinae, lines brown dwarfs, tauri stars, pre main, herbig ae, substellar, circumstellar disks, young, main sequence, disks, low mass pulsar, neutron stars, psr, radio pulsars, anomalous x, magnetar, isolated neutron, soft gamma, millisecond pulsars, axp
5
6
Rob Koopman, Shenghui Wang
Fig. 1 Expert’s concept map again the K-Means clusters (courtesy of Marcus John)
the labels and engage in a comparison exercise between clusters from different solutions.
3 Quantitative comparing clustering solutions by labels Since each cluster in each solution is labelled by its most significant or informative topic terms, it is possible to find the most informative topic terms across all clusters in one solution. Those labels would than represent the whole clustering solution. Therefore, we further extend the formula (Eq. 1) to measure the NMI scores between a whole clustering solution and all the 60K topical terms. For a clustering solution C with m clusters ({c1 , c2 , . . . , cm }), we computer I(t, C) =
m X 1 X k=1 i=0
P (Ti , Uk ) log2
P (Ti , Uk ) P (Ti )P (Uk )
(4)
where t is a topical term, T0 indicates an article does not contain the term t and otherwise T1 , and Uk indicates an article belongs the cluster cj . We normalise it as follows: I(t, C) N M I(t, C) = 2 × (5) (H(t) + H(C)) where, H(t) = −
1 X i=0
P (Ti ) log2 P (Ti ) and H(C) = −
m X i=1
P (Ui ) log2 P (Ui ).
(6)
Mutual Information based labelling and comparing clusters
7
The sign of the final NMI score is also assigned in the same way as described in the previous section. The probabilities are estimated by dividing the frequency of the observed event by the total number of articles which have a cluster assignment. This is different from labelling the individual clusters described in the previous session. The reason is that not all clustering solutions have a full coverage of the whole dataset.3 We only look at the part of the dataset which is covered by the clustering solution and consider that the rest of the dataset do not contribute to the information of the clusters. For a clustering solution, we can now compute topic terms which have the highest NMI scores, i.e, they contain the most information about all the clusters in this solution. Table 2 gives the top ten terms, in the descending order of their NMI scores, for seven clustering solutions described in this special issue. Clustering CWTS-C5 UMSI0 OCLC-31 OCLC-Louvain STS-RG ECOOM-BC13 ECOOM-NLP11
Top ten terms galaxies, stars, x ray, solar, gamma ray, black hole, star formation, redshift, stellar, magnetic field, cosmological galaxies, stars, x ray, solar, black hole, star formation, gamma ray, stellar, redshift, observations galaxies, stars, x ray, gamma ray, black hole, star formation, solar, magnetic field, dark matter, redshift galaxies, x ray, stars, black hole, solar, star formation, gamma ray, redshift, cluster, similar galaxies, stars, solar, observations, similar, x ray, magnetic field, star formation, dark matter, observed galaxies, x ray, stars, dark matter, solar, star formation, black hole, gamma ray, cosmological, magnetic field black hole, gamma ray, galaxies, dark matter, magnetic field, stars, x ray, star formation, black holes, microwave background
Table 2 Top ten most informative terms for different clustering solutions
As Table 2 shows, the most informative terms for these clustering solutions are very similar. That means these methods actually use the similar information to make major clustering decisions, while their differences lie more in how to handle less informative terms which are not listed in this table. The order of these terms, based on their NMI score, is also similar for most of the solutions, with “galaxies” as the most informative term. However, ECOOM-NLP11 ranks “black hole” and “gamma ray” at the top, which is different from the others. And compared to other solutions, “dark matter” ranked relatively higher for the two ECOOM solutions. This qualitative analysis invited us to do a quantitative comparison of individual clusters based on their information content. To be able to compare all the clusters using the same coordinates, we collected the top 50 labels computed from the seven clustering solutions listed in Table 2. We kept all terms occurring in at least two lists of labels for clustering solutions and removed all duplicates. This leads to 61 different labels in total. In a next step we ordered these labels in a way that the sum of the distances between neighbouring labels is the minimal, i.e. similar labels are positioned close to each other. In other words we deal with a Travelling Salesman Problem (TSP) [1]. Distances between labels are calculated based on their vectorial representations in the semantic matrix [2]. Then we apply 3
Please see Table 1 in [2].
8
Rob Koopman, Shenghui Wang
one of the standard algorithms for the TSP [1] and implement a simplified version of the Chained Lin-Kernighan heuristic [4]. In the last step, we re-computed the NMI scores between these labels and all the clusters from all the solutions, using Eq. 2. Eventually each cluster is represented by a 61 dimensional vector, which we call the fingerprint of a cluster. We use this vector as global lexical coordinate system, based on which we can now compare individual clusters, within a solution or across solutions, in a more visual way. In Figure 2, we visualise selected K-Means and Louvain clusters in terms of their fingerprints. The selection of these clusters is based on the fact that their most informative labels have a very high absolute NMI score. Given the fact that most of the labels have a very low or even negative NMI values, the high peaks at a very few and sometimes unique labels are very informative about what these clusters are about. For example, we are almost certain that ok19 is about “inflation,” ok18 is about “white dwarf,” and etc. Actually, the 7 clusters in the upper figure for each solution, for example, ok19, ok18, etc. for K-Means, have more distinguishing labels than those in the figure below, i.e, their highest NMI scores are higher than those in the figure below. We are more certain about the topics of these clusters. The clusters in the lower figures for both solutions have relatively less distinguishing (multiple peaks with lower NMI scores) yet still pretty informative labels. It is also possible to group the individual clusters from different solutions based on their fingerprints. Figure 3 shows two groups of clusters. In Figure 3 (a), four clusters from four clustering solutions have highly similar fingerprints and they all have one single focus: “grb” (gamma ray burst). Looking at these four clusters more carefully, they share a large overlap in terms of articles they contain. The average Jaccard similarity coefficient4 among them is 0.64. In Figure 3 (b) the seven clusters also have very similar but more complicated fingerprints, with the top three peaks at the “galaxy,” “redshift” and “active galactic.” Their average Jaccard similarity coefficient is 0.47. It is not surprising to see that the fingerprints calculated from the NMI scores are consistent with the simple set-based similarity between clusters. If two clusters overlap more, then their fingerprints are more similar. Actually, after applying a simple Affinity Propagation clustering algorithm5 over these fingerprints of clusters from different clustering solutions, there are 22 “clusters of clusters,” including the two shown in Figure 3. This may suggest that there exist core articles who have a tendency to always cluster together and are recognised by different methods, in other words, they are prototypical articles which clearly represent certain topics agreed by different methods, while other articles are in between topics, forming the fuzzy boundaries among topics. Different methods have different ways of deciding the boundaries, which makes the study in this special issue interesting. More comparison between different methods can be found in [5].
4
https://en.wikipedia.org/wiki/Jaccard_index We applied the Python package provided at http://scikit-learn.org/stable/modules/ generated/sklearn.cluster.AffinityPropagation.html with all default parameter settings. Further investigation with this clustering exercise is out of the scope of this paper. 5
NMI 5
10
5 x ray xmm newton ray emission pulsar neutron star white dwarf main sequence metallicity photometric star stellar circle dot mass gas disk dust star forming galaxy clusters dark matter cosmological power spectrum cosmic microwave microwave background dark energy universe redshift survey sample optical observed emission radio active galactic black hole gravitational wave spacetime scalar field inflation standard model decays quark qcd ray burst grb gamma ray high energy cosmic ray solar wind ionospheric spacecraft mars planetary planets earth surface solar coronal magnetic plasma energy
NMI 10
x ray xmm newton ray emission pulsar neutron star white dwarf main sequence metallicity photometric star stellar circle dot mass gas disk dust star forming galaxy clusters dark matter cosmological power spectrum cosmic microwave microwave background dark energy universe redshift survey sample optical observed emission radio active galactic black hole gravitational wave spacetime scalar field inflation standard model decays quark qcd ray burst grb gamma ray high energy cosmic ray solar wind ionospheric spacecraft mars planetary planets earth surface solar coronal magnetic plasma energy
NMI
NMI
Mutual Information based labelling and comparing clusters
60
0
0 10
10 20
50
20
30
30
9
OCLC - KMeans
50
40
30 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21 ok 30
20
10
0
30 40
25 40
20
10 15
Fig. 2 Visual comparison between clusters’ labelling fingerprints 50
20 25
15 ok 0 ok 4 ok 5 ok 6 ok 12 ok 14 ok 26
40
30
20
50
60
10
0 5
OCLC - Louvain ol 4 ol 9 ol 11 ol 18 ol 23 ol 27 ol 29
10
0
ol 7 ol 8 ol 15 ol 20 ol 24 ol 25 ol 26
60
5
0
Rob Koopman, Shenghui Wang
grb
10
cti c
ift
ac
tiv eg
ala
sh re d
ga
lax
y
(a) clusters about “grb”
(b) clusters about “galaxy” Fig. 3 Visual comparison between clusters
Mutual Information based labelling and comparing clusters
11
4 Conclusion In this paper, we showed that using Normalized Mutual Information between clusters and topical terms extracted from titles and abstracts is an effective way to identify important topical terms to describe clusters. It is a data driven approach which can clearly scale, and has the advantage that the process is independent from clustering methods, and so probably more objective than human judgement. The discussion with a domain expert also showed that these labels represent information he could interact with and which related to his own understanding of the field. The chosen labels are meaningful and useful in follow-up human interpretation and ordering. However, having said this, other labels chosen by other techniques might have a similar function. The aim of this exercise was not to find the best labels, but labels which can claim some representativeness and meaningfulness, and labels which enables further semantic interpretation. Once a selection of labels is determined to be lexical reference or lexical coordinates, we can compare different clustering solutions at a global level and also to map single clustering solutions against each other. We showed how such common label-based coordinates can be used to visually compare different clusters or generate “clusters of clusters”. This way of visual comparison is intuitive and straightforward. It is surprising yet understandable that different clustering methods, despite their differences in data models or algorithms, do share a fair amount of terms which are most informative about their clustering results. The most important result of this paper is a method that uses NMI measures based on lexical information from the documents (articles) which are clustered. We have shown that such a lexical based comparison complements the comparison of clusters in LittleAriadne [2]. Its findings are more or less consistent with other methods of comparison [5].
Acknowledgement Part of this work has been funded by the COST Action TD1210 Knowescape. We would like to thank Marcus John, who functioned as domain expert, for his valuable analysis of the cluster labels. We would like to thank Michael Heinz for discussions on the method.
References 1. Applegate, D.L., Bixby, R.E., Chvatal, V., Cook, W.J.: The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics). Princeton University Press, Princeton, NJ, USA (2007) 2. Koopman, R., Wang, S., Scharnhorst, A.: Contextualization of topics – browsing through the universe of bibliographic information. In: J. Gl¨ aser, A. Scharnhorst, W. Gl¨ anzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2303-4 3. Manning, C.D., Raghavan, P., Sch¨ utze, H.: Introduction to Information Retrieval. Cambridge University Press (2008) 4. Martin, O., Otto, S.W., Felten, E.W.: Large-step markov chains for the traveling salesman problem. Complex Systems 5, 299–326 (1991)
12
Rob Koopman, Shenghui Wang
5. Velden, T., Boyack, K., van Eck, N., Gl¨ anzel, W., Gl¨ aser, J., Havemann, F., Heinz, M., Koopman, R., Scharnhorst, A., Thijs, B., Wang, S.: Comparison of topic extraction approaches and their results. In: J. Gl¨ aser, A. Scharnhorst, W. Gl¨ anzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2306-1 6. Wang, S., Koopman, R.: Clustering articles based on semantic similarity. In: J. Gl¨ aser, A. Scharnhorst, W. Gl¨ anzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2298-x