Unsupervised Similarity Learning from Textual Data - Dominik Slezak

9 downloads 2059 Views 376KB Size Report
319. DOI 10.3233/FI-2012-740. IOS Press. Unsupervised Similarity Learning from Textual Data. ∗. Andrzej Janusz. †. , Dominik Sl˛ezak. ‡. , Hung Son Nguyen.
Fundamenta Informaticae Informaticae 119 119 (2012) (2) 2012 319–336 Fundamenta 319–336

319

DOI 10.3233/FI-2012-740 IOS Press

Unsupervised Similarity Learning from Textual Data∗ ´ ezak‡ , Hung Son Nguyen Andrzej Janusz† , Dominik Sl˛ Faculty of Mathematics, Informatics and Mechanics University of Warsaw Banacha 2, 02-097 Warszawa, Poland [email protected]; [email protected]; [email protected]

Abstract. This paper presents a research on the construction of a new unsupervised model for learning a semantic similarity measure from text corpora. Two main components of the model are a semantic interpreter of texts and a similarity function whose properties are derived from data. The first one associates particular documents with concepts defined in a knowledge base corresponding to the topics covered by the corpus. It shifts the representation of a meaning of the texts from words that can be ambiguous to concepts with predefined semantics. With this new representation, the similarity function is derived from data using a modification of the dynamic rule-based similarity model, which is adjusted to the unsupervised case. The adjustment is based on a novel notion of an information bireduct having its origin in the theory of rough sets. This extension of classical information reducts is used in order to find diverse sets of reference documents described by diverse sets of reference concepts that determine different aspects of the similarity. The paper explains a general idea of the approach and also gives some implementation guidelines. Additionally, results of some preliminary experiments are presented in order to demonstrate usefulness of the proposed model.

Keywords: Similarity learning, semantic similarity, text mining, feature extraction, bireducts ∗

The authors are supported by the grants N N516 077837 and 2011/01/B/ST6/03867 from the Ministry of Science and Higher Education of the Republic of Poland, and the National Centre for Research and Development (NCBiR) under the grant SP/I/1/77065/10 by strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”. † Address for correspondence: University of Warsaw, Banacha 2, 02-097 Warszawa, Poland ‡ Also works: Infobright Inc., Krzywickiego 34 lok. 219, 02-078 Warsaw, Poland

320

1.

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

Introduction

The selection of an appropriate similarity measure for a given task has always been a challenge to researchers in fields related to data mining and information retrieval. Although numerous similarity functions have been proposed, none of them is general enough and different applications require different functions to be used. In practice, a similarity measure which is reasonable for a given task is often chosen by experts but, due to the complexity of the similarity evaluation problem, this choice is rarely optimal. On the other hand, many psychologists and cognitivists have studied general properties of similarity in order to develop models capturing the rules that govern the assessment process of resemblance between objects [26, 6]. Numerous experiments conducted by those scientists suggest that for humans, a decision whether two stimuli are similar is individual and strongly depends on factors such as a domain, context and personal experience of the judge. In this paper we propose a model for learning a similarity function that can be used to evaluate semantic resemblance of texts, based on available data. In our approach, we would like to combine experience of psychologists with mathematical tools for data analysis. As a result we want to construct a similarity measure, which on one hand would possess psychologically plausible properties and on the other, would be effective in practical applications such as document clustering, classification or information retrieval. As a starting point we have chosen our previous research on the Dynamic Rule-Based Similarity (DRBS) model [9]. It can be regarded as an application of the idea behind the feature contrast model, proposed by Tversky [26], to the problem of learning an appropriate similarity measure for a given classification task. DRBS utilizes notions from the rough set theory [13] to approximate a set of objects which are similar to an object from a query in a context of classification. In this research we present an adaptation of DRBS to the unsupervised case. In order to learn aggregations of the basic attributes of texts represented by, so called, bag-of-words (see Section 3.1) we utilized methods for extraction of semantic features, such as Explicit Semantic Analysis (ESA) [4]. In the course of this paper we discuss and use a few types of document representations – as a vector of weights expressing a relation to individual words, as a vector of automatically generated association strengths to predefined concepts from a domain knowledge base and as a set of the most important concepts that can be treated as binary features in Tversky’s model. Finally, we also represent documents by a set of labels assigned by experts, which define their topical context, and are used for the evaluation of models. We employ a novel rough set notion, called information bireduct, to select diverse sets of documents described by diverse sets of higher-level features. The resulting sets can be regarded as different aspects of similarity, which are meaningful for a given text corpus, considering the available knowledge base. A collection of bireducts can also be seen as a representation of agents who, within the proposed model, try to jointly assess the similarity of the compared texts. In the following section we briefly overview the existing approaches to issues related to learning a similarity measure from data. In Section 3, we introduce some basic notation and discuss the building blocks of our model. Then, in Section 4 we show how the unsupervised RBS is constructed and in Section 5 we evaluate the model by comparing it to the state-of-art in a few practical application tests. Finally, in Section 6 we draw directions for future research.

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

2.

321

Related Works

To overcome the issues related to the problem of selecting an appropriate similarity measure for the given task, some techniques for learning similarity from data have been developed [24, 28]. Instead of relying on globally defined measures, those methods try to discover proper local similarity functions and aggregate them in a way that reflects dependencies between objects from a particular dataset. For this purpose, they often make use of basic properties of a similarity relation in a given context. For example, the Dynamic Rule-Based Similarity (DRBS) model [9] utilizes the fact that two objects from different decision classes are unlikely to be similar. Other models restrain the search space to some classes of similarity functions (e.g. distance-based measures [24, 28]). One application of similarity models is the clustering task. Unlike in the supervised classification case, where the context for the similarity is defined by a decision attribute, the clustering requires evaluation of resemblance in a general setting. Depending on a domain of objects, the context for similarity in the clustering task can be, e.g., a “general appearance” of physical objects or a “meaning” of texts. Such semantic ambiguity makes the selection of an appropriate similarity measure even more challenging. Currently, only few similarity learning models are able to cope with this problem. Classical clustering algorithms utilize distance measures to compute dissimilarities between objects. By doing so, they enforce some potentially undesired properties on the resulting similarity relation and make it less reflecting the human perception of similar objects. This drawback is especially important for grouping semantically related documents within text corpora for information retrieval purposes [7, 16]. Many methods utilizing different similarity measures were developed for the purpose of the textual data clustering. The most popular ones are modifications of the standard k-means algorithm, which use some spheric distance measures such as the cosine distance [3, 11]. Those approaches are usually based on a bag-of-words representation (see Section 3.1) of documents, in which each text is represented by a vector of weights assigned to the unique terms that it contains. There have been many attempts to extend the bag-of-words representation in order to more accurately capture semantics of texts. It has been done by, for example, the inclusion of tags derived from the linguistic analysis of a natural language or considering N-grams of words [3]. A different method has also been developed within the rough set theory [13, 14]. In the tolerance rough set model, the bagof-words representation is extended by considering the upper approximation of texts [8, 12]. Other approaches, such as Latent Semantic Analysis [2], try to model relatedness of artificial concepts to the analysed texts. There are also methods which make use of additional knowledge sources, such as domain ontologies or encyclopaedias, in order to more accurately assess semantic similarity between documents. For example, in [16] an ontology is used to forge better estimates of relatedness of particular terms in texts. In [4] the Explicit Semantic Analysis (ESA) model is described, in which documents are represented by their associations with concepts explicitly defined in an external knowledge base. This particular representation of texts is utilized in the presented similarity learning model and will be called the bag-of-concepts (see Section 3.1). Although the bag-of-concepts representation can be computed using several different techniques, the unsupervised similarity learning model proposed in this paper utilizes the ESA. It treats the associated concepts as basic features that allow to express the notion of similarity between documents. Therefore, from the rough set perspective, the presented approach can be seen as a method for finding an approximation space that is suitable for approximating a similarity relation [19, 20]. It can also be regarded

322

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

as a rough set application of the psychologically plausible contrast model [26] which is discussed in the following sections. In order to select the relevant and maximally independent higher-level features, which are used in Tversky’s contrast model, our similarity learning method employs the novel notion of information bireducts [22]. Bireducts describe approximate dependencies in data like many types of approximate reducts [21, 10]. However, unlike other methods they allow more natural control on approximation areas by explicitly indicating which objects need to be discerned. Due to this property, conducted experiments suggest that the bireducts usually lead to more diverse results [22]. Additionally, in a context of similarity learning, different information bireducts can be intuitively interpreted as independent agents who assess resemblance of considered objects.

3.

Basic Notions

The main idea of the model that we propose is to utilize a variant of psychologically plausible Tversky’s contrast model of similarity [26] for evaluation of resemblance between semantic representations of documents. The semantic features that are taken for the contrast model are defined and aggregated analogously to the DRBS approach [9] – they can be derived from the text corpus using methods such as the ESA [4] and information bireducts [22]. In this section we explain some basic notions and ideas which we later use in the construction of the unsupervised RBS model.

3.1.

Explicit Semantic Analysis

Any document can be represented by predefined concepts which are related to the information that it carries (its semantics). Those concepts can be then treated as features describing documents. The process of choosing concepts that describe documents in the most meaningful way can be regarded as a feature extraction task. In the ESA, natural language definitions of concepts from an external knowledge base, such as an encyclopaedia or an ontology, are matched against documents to find the best associations. A scope of the knowledge base may be general (like in the case of Wikipedia) or it may be focused on a domain related to the investigated text corpus, e.g. Medical Subject Headings (MeSH)1 [27]. The knowledge base may contain additional information on relations between concepts, which can be utilized during computation of the “concept-document” association levels. Otherwise it is regarded as a regular collection of texts, with each concept definition treated as a separate document. The associations between concepts from a knowledge base and documents from the corpus are treated as indicators of their relatedness. They are computed two-fold. First, after the initial processing (stemming, stop words removal, identification of terms), the corpus and the concept definitions are converted to the bag-of-words representation. Each of the unique terms in the texts is given a weight expressing its association strength. Assume that after the initial processing of a corpus consisting of M documents, D = {T1 , . . . , TM }, there have been identified N unique terms (e.g. words, stems, N-grams) w1 , . . . , wN . Any text Ti in the corpus D can be represented by a vector hv1 , . . . , vN i ∈ RN + , where each coordinate vj expresses a 1

MeSH is a controlled vocabulary and thesaurus created and maintained by the United States National Library of Medicine. It is used to facilitate searching in life sciences related article databases.

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

323

value of some relatedness measure for j-th term in vocabulary (wj ), relative to this document. The most common measure used to calculate vj is the tf-idf (term frequency-inverse document frequency) index (see [3]) defined as:   ni,j M · log vj = tfi,j · idfj = PN , (1) |{i : ni,j 6= 0}| k=1 ni,k where ni,j is the number of occurrences of the term wj in the document Ti . In the second step, the bag-of-words representation of concept definitions is transformed to an inverted index which maps words into lists of K concepts described in a knowledge base, c1 , . . . , cK . The inverted index is then used to perform a semantic interpretation of documents from the corpus. For each text, the semantic interpreter iterates over words that it contains, retrieves corresponding entries from the inverted index and merges them into a vector of concept weights (association strengths) that represent the given text. Let Wi = hvj iN j=1 be a bag-of-words representation of an input text T , where vj is a numerical weight of a word wj expressing its association to the text Ti (e.g. its tf-idf). Let invj,k be an inverted index entry for wj , where invj,k quantifies the strength of association of the term wj with a knowledge base concept ck , k ∈ {1, . . . , K}. The new vector representation of Ti is calculated as: h

X

vj invj,k iK k=1 .

(2)

j:wj ∈Ti

This new vector will be called a bag-of-concepts  P representation of a text Ti and will be denoted by Ci = c1 (Ti ), . . . , cK (Ti ) , where ck (Ti ) = j:wj ∈Ti (vj invj,k ) is the numerical association strength of k-th concept to the document Ti . In the later sections we will also use a set representation of texts, which for the document Ti is a list of concepts with sufficiently high association level denoted by Fi = {fk : ck (Ti ) ≥ minAssock }. Those concepts will be treated as binary semantic features of texts. If the utilized knowledge base contains additional information on semantic dependencies between the concepts, this knowledge can be used to further adjust the vector (2). Moreover, if experts could provide feedback in the form of manually labeled exemplary documents, some supervised learning techniques can also be employed in that task. However, particular methods for automatic tagging of textual data are not in the scope of this research.

3.2.

Tversky’s Similarity Model

In 1977, Amos Tversky, influenced by the results of his experiments on human perception of the similarity, came up with the contrast model of similarity [26]. He argued that the distance-based approaches are not appropriate for modeling similarity relation due to constraints imposed by the mathematical features of the distance measures. He also noticed that the similarity between objects depends not only on their common features but also on the features that are considered distinct. Such features may be interpreted as arguments for or against the similarity. He proposed the following formula to evaluate the similarity of the compared stimuli:   Sim(x, y) = θf (X ∩ Y ) − αf (Y \ X) + βf (X \ Y ) ,

(3)

324

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

where X and Y are the sets of binary features of the instances x, y, f is an interval scale function and the non-negative constants θ, α, β are the parameters. In Tversky’s experiments f usually corresponded to cardinality (|∗|). Depending on the values of θ, α, β the contrast model may have different characteristics, e.g., for α 6= β the model is not symmetric. In the Formula (3) θf (X ∩ Y ) can be interpreted as corresponding to the strength of arguments for similarity of x to y, whereas αf (Y \ X) + βf (X \ Y ) may be regarded as strength of arguments against the similarity. Using that model Tversky was able to create similarity rankings of simple geometrical objects which were consistent with evaluations made by humans. The bag-of-concepts representation discussed in Subsection 3.1 makes it possible to examine relations between concepts and documents as well as to identify and filter key concepts for given documents in a corpus. The concepts with the strongest association levels with texts seem to be perfect semantic features to instantiate Tversky’s contrast model. Usually, texts are represented by associations with words which they contain. Due to a large number of possible words and their ambiguity it is not clear whether a particular word is truly relevant to express the semantics of a document. Even if two documents share the same word, its meaning can be different and, as a consequence, it should not be regarded as a common feature. On the other hand, different words can be related, expressing the same semantic entity. This is why semantic similarity of documents should be considered in terms of well-defined concepts, not just words. The semantic representations of documents can be used to define an information system S = (D, F ), where D is a set of objects (in our case these are documents) and F is a set of their attributes – conceptrelated features describing documents [13]. For example, F may be a set of all possible semantic features S|D| of documents from the corpus D, i.e. F = i=1 Fi . An information system may be here seen as a tabular representation of knowledge about a considered universe of documents. In this table, i-th row corresponds to a document Ti and indicates which of the concepts are related to this text (rows in this representation are exact mappings of the set representation mentioned in the previous section, thus we denote attributes from F as fk , k = 1, . . . , K). Other document representation types (e.g. bag-of-words or vectors of concept association strengths) can also be stored in the form of information systems [3, 11]. Altogether, the combination of the ESA with equation (3), delivered within a simple framework of information systems, should provide us with a very intuitive model of learning and representing semantic similarities between documents. On the other hand, the issue remains with tuning parameters θ, α, β, as well as with the choice of appropriate subsets of concept-based features in order to avoid the curse of dimensionality problem. Moreover, as widely known in, e.g., the clustering literature [1], the ways of measuring similarity may differ across some subspaces of the considered universe. In our case, it means that the choice of parameters and the most meaningful features may not be unique with respect to both, the universe D of considered documents, and the set F of all considered concept-based attributes.

3.3.

Information Bireducts

The problem of learning a similarity relation from data involves working on imprecise concepts and it may be well-handled in the framework provided by the rough set theory, particularly basing on its two key notions: a discernibility relation and an information reduct [18]. We say that two objects T1 , T2 ∈ D can be discerned in an information system S = (D, F ), if and only if there exists at least one f 0 ∈ F for which values f 0 (T1 ) and f 0 (T2 ) are sufficiently different (e.g.: T1 and T2 are discernible in a classical sense when f 0 (T1 ) 6= f 0 (T2 )). Moreover, we say that B ⊆ F is

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

325

an information reduct for S = (D, F ), if and only if it is an irreducible subset of attributes such that each pair of objects which is discerned by F , is also discerned by B. In [22] we extended the notion of an information reduct to an information bireduct, which is a pair (B, X) consisting of a subset of attributes B and a subset of objects X, defined as: Definition 3.1. (Information bireduct) Let S = (D, F ) be an information system. A pair (B, X), where B ⊆ F and X ⊆ D, is called an information bireduct, iff B discerns all pairs of objects in X, and the following properties hold: 1. There is no proper subset C ( B such that C discerns all pairs in X; 2. There is no proper superset Y ) X such that B discerns all pairs in Y . The information bireducts describe non-extendible subsets of objects that are discernible using irreducible subsets of attributes. One may say that information bireducts correspond to the most irregular, informative areas in a dataset. Due to this characteristic, exploration of a diverse set of information bireducts can be particularly useful for defining different aspects of similarity between objects. Input: σ : {1, ..., M + K} → {1, ..., M + K}, K = |F |, M = |D| Output: (B, X), B ⊆ F , X ⊆ D 1 2 3 4 5 6 7 8 9 10 11 12 13 14

B ← F , X ← ∅; for i = 1 to M + K do if σ(i) ≤ K then if B \ {Fσ(i) } discerns all pairs in X then B ← B \ {Fσ(i) } end end else if B discerns all pairs in X ∪ {Tσ(i)−K } then X ← X ∪ {Tσ(i)−K } end end end return (B, X) Algorithm 1: Calculation of an information bireduct of S = (D, F )

In [22], we also proposed an algorithm for constructing decision bireducts that uses a random permutation of attributes and objects. This algorithm can be easily adjusted to the case of information bireducts (Algorithm 1). The randomization of the bireduct calculation algorithm guarantees that its multiple executions result in producing a diverse set of bireducts. Experiments presented in [22] for decision bireducts show that such a set corresponds to a much better coverage of the universe of objects under scope. The same can be stated for information bireducts. This is crucial to address the problems identified in the end of Subsection 3.2, i.e., to unbias the model with regard to hidden relations between the concepts, occurring potentially for some unknown subsets of the considered universe of documents.

326

4.

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

The Model Design

In this section we show how the intuition behind Tversky’s model can be used for evaluation of semantic similarity of scientific articles. The construction of the model starts with assigning concepts from a chosen knowledge base to a training corpus of documents. This can be done in an automatic fashion with the use of methods such as the ESA. The associations to the key concepts assigned to the documents can be transformed to binary features and therefore, are suitable to use with the contrast model of similarity. However, a direct application of this model would not take into consideration data-based relations between concepts from the knowledge base and a potentially different meaning of those relations for different documents. The problem of finding appropriate values of parameters would also remain unsolved. Below, we explain how to use the methods outlined in Section 3 to overcome those issues. We call this approach the unsupervised RBS model, since it can be seen as a continuation of research on the Dynamic Rule-Based Similarity described in [9].

4.1.

Bireduct-Based Commonality

The main intuition behind Tversky’s contrast model is that the resemblance between two objects should be considered in terms of arguments for and against their similarity. Those arguments should be expressed in terms of higher-level features which can better reflect important properties or semantics of the objects. Let F be a set of all possible semantic features of texts from a corpus D and let Fi be a set of S|D| the most important concepts related to the document Ti , F = i=1 Fi . The concepts from Fi can be assigned by experts or in the automatic fashion, e.g. by taking ontological entities which were given the highest weights by methods such as the ESA. If we treat concepts from F as features, we can construct an information system S = (D, F ), as explained in Section 3.2. An example of such a system is shown in Table 1. We can say that two documents described by this table have a common feature if they both have value 1 in the corresponding column (e.g. the documents T1 and T2 have three common features: f2 , f3 and f10 ). In many practical applications, the numerical values of association strength between a concept and a document may be discretized into more than two intervals in order to precisely model their bond. In this case, it is reasonable to define a few binary features that represent consecutive intervals and remain dependent, in a sense that if a feature is “highly related” to a document then it is also “weakly related”, but not the opposite. Such an approach is popular in Formal Concept Analysis [5] and allows to model a psychological phenomena that usually simpler objects are more similar to the more complex ones then the other way around. In order to find out which combinations of independent concepts comprise the informative aspects of similarity, we could compute information reducts of S [18] or, to obtain more compact and robust subsets of F , some form of approximate reducts [21, 10]. However, to ensure stability of the model and to limit its bias toward both common concepts and common objects of negligible importance, we suggest working with the information bireducts [22]. For each bireduct BR = (B, X), B ⊆ F , X ⊆ D, we can define a commonality relation in D with regard to BR. One example of such a relation can be τ |BR which is defined as follows: (Ti , Tj ) ∈ τ |BR ⇐⇒ Tj ∈ X ∧ Fi|BR ∩ Fj|BR ≥ p,

(4)

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

Table 1.

327

An information system S with three exemplary bireducts.

f1

f2

f3

f4

f5

f6

f7

f8

f9 f10

T1

1

1

1

0

0

0

0

0

1

1

T2

0

1

1

1

1

0

1

0

0

1

T3

1

1

0

0

0

0

1

0

0

0

T4

0

0

0

0

1

0

0

1

0

0

T5

1

0

1

0

1

1

0

0

0

0

T6

1

0

1

0

0

0

0

0

0

0

T7

0

1

1

0

0

1

1

1

0

0

T8

0

0

0

0

1

1

1

1

1

0

T9

1

1

0

0

1

0

0

0

1

0

Exemplary information bireducts: BR1 = ({f2 , f3 , f8 , f9 }, {T1 , T2 , T3 , T4 , T6 , T7 , T8 , T9 }) BR2 = ({f3 , f5 , f7 , f9 }, {T1 , T2 , T3 , T4 , T5 , T6 , T7 , T8 , T9 }) BR3 = ({f1 , f2 , f5 }, {T2 , T3 , T5 , T6 , T7 , T8 , T9 })

where p > 0, Ti , Tj ∈ D and Fi|BR is a representation of Ti restricted to features from B. Intuitively, two documents are in the commonality relation τ |BR if and only if one of them is covered by the bireduct BR and they have at least p common concepts. We will denote the commonality class of a document T with regard to BR as I(BR, T ). For instance, if we consider the information system from Table 1 and the commonality relation defined by the formula (4) with p = 2, then I(BR1 , T1 ) = {T1 , T2 , T7 , T9 } and I(BR1 , T5 ) = ∅. It is important to realize that commonality of two objects is something conceptually different than indiscernibility. For example, the documents T5 and T6 are indiscernible with regard to the features from the bireduct BR1 but they are not in the commonality relation since they have in common only one feature f3 . From the practical point of view, a similarity model needs to have a functionality which allows it to be applied for analysis of new documents. Typically, we would like to assess their similarity to the known documents (those available during the learning phase) in order to index, classify or assign them to some cluster. For this reason, in the definition of the commonality relation only Tj needs to belong to X. This also makes it more convenient to utilize information bireducts that explicitly define the set of reference cases, for which the comparison with the new ones is well defined.

4.2.

Tversky’s Approach Revisited

Having selected bireducts which correspond to different contexts or points of view for evaluation of document resemblance, we may use the idea expressed in the contrast model formula (3) to assess the similarity between two given texts. Obviously, we could define many different similarity functions but due to its psychological plausibility Tversky’s approach can better fit human-made judgements. This aspect is especially important for a similarity model whose main application will be the semantic information retrieval task [7, 16] or semantic clustering [25]. This property may also allow for better interactions with human experts, for whom the resulting model can be much more intuitive. However, in large scale applications it is unrealistic to expect that experts or end-users will be able to manually set optimal values of the contrast model parameters θ, α, β or to come up with an appropriate

328

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

function f . For this reason we propose to use the commonality relation (4) to locally estimate the real significance of arguments for and against the similarity of documents which are being compared. We may aggregate those arguments analogically to Tversky’s formula (3) and compute the similarity of Ti to Tj with regard to a bireduct BR using the following formula:  |I(BR, Ti ) ∩ I(BR, Tj )| X \ I(BR, Ti ) ∩ I(BR, Tj ) SimBR (Ti , Tj ) = − . (5) |I(BR, Ti )| |X \ I(BR, Ti )| Since each information bireduct in our setting is a non-extendable subset of documents coupled with an irreducible subset of features that discern them, it carries maximum information on a diverse set of reference documents. Due to this property, the utilization of bireducts nullifies the undesired effect which common objects (or usual features) would impose on sizes of the commonality classes and thus, on the similarity function value. Moreover, such a use of the information bireducts in combination with the commonality relation (4) substitutes the need for manual tuning of Tversky’s model parameters. Instead, the relative size of the intersection of commonality classes locally estimates the relevance of arguments for similarity without a need for considering additional parameters. By analogy, the importance of arguments against similarity is reflected by the relative size of a set that comprises those documents from outside the commonality class of the first document, which are sufficiently compliant with the second text. Following the example from Section 4.1, we can use the formula (5) to compute the similarity between any two documents from S with regard to a chosen bireduct BRi . For instance, SimBR1 (T1 , T2 ) = 3/4 − 0 = 0.75, SimBR1 (T1 , T5 ) = 0 − 0 = 0 and SimBR1 (T1 , T8 ) = 0 − 1/4 = −0.25. It is worth noting that the proposed approach keeps the flexibility of the contrast model and does not impose any properties on the resulting similarity function, i.e. depending on the data and on the selection of τ |BR , the function SimBR may be not symmetric (SimBR1 (T2 , T1 ) = 0.8 6= SimBR1 (T1 , T2 )), and even not reflexive (SimBR1 (T3 , T3 ) = 0), which is consistent with observations made by psychologists [26].

4.3.

Ensemble-Based Similarity

The utilization of information bireducts allows to conveniently model different aspects of similarity. By analogy to the initial experiments with decision bireducts (see [22]), a set of bireducts will cover much broader aspects of data than an equally sized set of the regular information reducts. This allows to capture approximate dependencies between features which could not be discovered using classical methods and may contribute to the overall performance of the model. Additionally, by working with explicitly defined subsets of documents X we can express diverse points of view on the compared texts. Intuitively, each bireduct may represent an agent who is assessing the resemblance of information conveyed by the compared documents. Different agents are characterised by a different level of experience and preferences. In a bireduct, the set of documents (examples or cases) may be treated as agent’s experience and the set of features naturally models the factors that influence an agent while making judgements. To robustly evaluate similarity of two documents the agents need to interact by combining their assessments. The simplest method of such an interaction is to average votes of all agents. In such a case the final similarity of Ti to Tj can be computed using the following formula: P SimBRk (Ti , Tj ) Sim(Ti , Tj ) = k . (6) #extracted bireducts

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

329

For example, if for the information system S from Table 1 we consider information bireducts BR1 , BR2 and BR3 , the final similarity of T1 to T2 would be equal to Sim(T1 , T2 ) = (0.75+0.05+0.3)/3 u 0.37. Design of such a similarity function is computationally feasible and does not require tuning of unintuitive parameters. It also guarantees that the resulting similarity function keeps the flexibility and psychologically plausible properties of the contrast model. This kind of an ensemble significantly reduces the variance of similarity judgements in a case when the available data set changes over time (e.g. new documents are added to the repository) and increases model robustness. However, some more sophisticated methods can also be employed for carrying out the interaction between agents (bireducts), in order to improve performance in a given task or to reduce similarity computation costs. For instance, properties of extracted bireducts can be used to select only those which will most likely contribute to the performance of the model. The considered properties may include, e.g. a number of selected features, a size of the reference document subset or an average intersection with other bireducts (see [22]). Using such statistics in combination with general knowledge about the data we can try to decrease the number of bireducts required for making consistent assessments of similarity.

5.

Experimental Framework

In this section we describe a series of experiments that we conducted to evaluate the proposed similarity learning model. In order to check whether the similarity measure induced from data using our approach can be useful for tasks such as document grouping and semantic information retrieval, we decided to compare it with commonly utilized methods which are based on the cosine similarity measure.

5.1.

Testing Methodology

The experiments were performed on a document corpus consisting of 1000 research papers related to biomedicine which were enquired from PubMed Central repository [17]. We used our implementation of the ESA algorithm, which we adapted to work with the MeSH ontology [27], to compute associations between the documents and the MeSH headings2 . During the experiments, we use the bag-of-concepts representation (see Section 3.1) to construct the similarity model described in Section 4 as well as all the other models used for comparison (see Section 5.2). We excluded from this experiments the similarity models based on bag-of-words representation since our previous studies suggest that this representation is inferior to bag-of-concepts for computation of popular similarity models [25]. Additionally, all of the documents were manually labeled by experts from the U.S. National Library of Medicine (NLM) with the MeSH subheadings [27]. Those labels represent a topical content of the documents and as such they can serve as a mean for evaluation of truly semantic relatedness of the texts. We use them to compute a semantic distance of the analyzed documents and we treat it as a reference for the compared similarity measures. For this purpose we used F1 distance define as: F1 distance(Ti , Tj ) = 1 − 2 · 2

precision(Si , Sj ) · recall(Si , Sj ) , precision(Si , Sj ) + recall(Si , Sj )

(7)

The corpus used in our experiments is a subset of JRS 2012 Data Mining Competition data (for more details refer to http://tunedit.org/challenge/JRS12Contesthttp://tunedit.org/challenge/JRS12Contest)

330

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

where Si , Sj are documents Ti , Tj ∈ D represented by MeSH subheadings assigned by experts and precision(Si , Sj ) =

|Si ∩ Sj | , |Si |

recall(Si , Sj ) =

|Si ∩ Sj | . |Sj |

(8)

We selected this particular measure because it is often used for evaluation of results in the information retrieval setting [3, 11]. Although we realize that evaluation of a similarity measure by a distance metric may fail to capture some of the psychologically important properties and underestimate its real quality, in this way we are able to quantitatively assess many similarity models and use those assessments to objectively compare them. The semantic distance between two sets of documents we define as an average of F1 distances between each pair of texts from different sets: P F1 distance(Ti , Tj ) semDist(D1 , D2 ) =

Ti ∈D1 ,Tj ∈D2

|D1 | · |D2 |

.

(9)

We made three types of comparisons between the similarity models. In the first one, for each of the models we computed the absolute value of correlation between similarity measurements and the semantic distance (7). This kind of comparison is often used in psychological studies where researchers try to measure the dependence between values returned by their models and assessments made by human subjects [6, 23, 26]. One disadvantage of this evaluation method is that the linear correlation does not necessarily indicate the usefulness of the measure in practical applications such as the information retrieval or clustering. For this reason we devised the second test, in which we verify semantic homogeneity of clusterings resulting from the use of different similarity models. For each document in the corpus we can compute its average semantic distance to other documents from the same cluster and to the remaining texts. If for a division of data into k groups we denote documents from a corpus D belonging to the same cluster as Ti by clusterTi , then we can define a semantic homogeneity of Ti with regard to the semantic distance function semDist as: B(Ti ) − A(Ti ) , max A(Ti ), B(Ti ) A(Ti ) = semDist(Ti , clusterTi \ Ti )

homogeneity(Ti ) =

where and

B(Ti ) = semDist(Ti , D \ clusterTi ). If clusterTi \ Ti = ∅, then it is assumed that homogeneity(Ti ) = 1. The average semantic homogeneity of all documents can be used as a measure of clustering quality. Since useful similarity models should lead to meaningful clustering results, we may employ the average semantic homogeneity to intuitively evaluate the usefulness of the compared models for the clustering task. In the last of the performed tests we checked how different similarity measures impact cluster separability for two different hierarchical clustering algorithms – agnes (AGglomerative NESting) and diana (DIvisive ANAlysis), which are described in [11] (see also a brief description in Section 5.3). Apart from a clustering hierarchy, those algorithms return an agglomerative and divisive coefficient, respectively. These coefficients express conspicuousness of a clustering structure in a clustering tree [11]. Although they are internal measures and their value does not necessarily correspond to the semantic relatedness of objects within the clusters, they can give an intuition on interpretability of a clustering solution.

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

5.2.

331

Compared Similarity Models

For the purpose of the experiments four similarity models were implemented in R System [15]. The unsupervised RBS was constructed as described in Section 4. Documents from the corpus were given associations to MeSH headings by our tagging algorithm and an information system S = (D, F ) was constructed consisting of 1000 documents described by 25, 640 semantic features. During preprocessing, the features which were not present in at least one document from the corpus D were filtered out from further analysis. Numerical association values of each term were transformed into four distinct symbolic values. The discretization was based on general knowledge of the data (e.g. for each feature possible association values ranged from 0 to 1000) and the cut thresholds were constant for every feature (i.e. they were set to = 0, ≥ 300, ≥ 700 and ≥ 1000). From the discretized information system, 500 information bireducts were computed using random permutations (see Algorithm 1). As expected, they significantly differ in selection of features and reference documents. On average, a bireduct consisted of 210 attributes (min = 173, max = 246), with each attribute belonging on average to 9 bireducts (min = 1, max = 42). The average number of documents in a single bireduct was 995 (min = 988, max = 1, 000), and each document appeared on average in 498 bireducts (min = 489, max = 500). All of the computed information bireducts were used for assessment of similarity by the unsupervised RBS model. Apart from the unsupervised RBS, for the sake of comparison we implemented three other similarity models. The first one was the popular cosine similarity [3]. In this model, for documents Ti and Tj , represented by vectors Ci , Cj of numerical association strengths to headings from the MeSH ontology (i.e. the vector representation of bag-of-concepts described in Section 3.1), the cosine similarity is: Simcos (Ti , Tj ) =

Ci × Cj , kCi k · kCj k

(10)

where × is the scalar product of two vectors and k ∗ k is the L2 norm of a vector ∗. This particular measure is very often used for comparison of texts due to its robust behaviour in highly dimensional and sparse data domains. The second reference model used in the comparison was also based on the cosine similarity measure. However, unlike in the first one, its similarity judgements were not based on the entire data but were ensembled from 500 local evaluations. Each of those local assessments were made by the cosine similarity restrained to the features selected by a corresponding information bireduct (the same as those used in the construction of the unsupervised RBS model). The similarity function of this model was: Simens (Ti , Tj ) =

500 X

Simcos (Ti |BRl , Tj |BRl ),

(11)

l=1

were T |BR is a document T represented only by features from BR. We call this model cosine ensemble. The main reason for including it in our experiments was that we wanted to verify what impact the bireduct-based similarity combining technique has on the overall quality of metric-based similarity. The last reference model, which we call single RBS, was constructed using the notion of a commonality relation (4) and the same aggregation method as in the unsupervised RBS (formula (5)). Its only difference was that it did not use bireducts to create multiple sub-models that correspond to local aspects of similarity but instead, it made assessments using the whole data set. Such a model can be interpreted

332

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

as a super-agent whose experience cover all available documents and who takes into consideration all possible factors at once. We used it to verify whether the bireduct-based ensemble approach is beneficial for our model.

5.3.

Results of Experiments

In the experiments we used the similarity models described in the previous section to assess similarities between every pair of documents from our corpus. This allowed us to construct four similarity matrices, in which a value at an intersection of i-th row and j-th column expressed similarity of the document Ti to Tj assigned by a corresponding model. We also constructed a semantic distance matrix using formula (7), as it is described in Section 5.1. In the first series of test we measured the absolute value of correlation between the measurements from the similarity matrices obtained for each similarity model and the semantic distance. Since similarity assessments made using different measures are likely to come from different distributions, the Spearman’s rank correlation [23] was utilized in this test to increase its robustness. These results are displayed in Table 2. Table 2.

Absolute value of the correlation between the tested similarity models and the semantic distance.

unsupervised RBS

cosine

cosine ensemble

single RBS

0.186

0.155

0.153

0.144

The result of the unsupervised RBS in this test is much higher than results of other models. It is interesting that the correlation of the third of the reference models (the single RBS) with the semantic distance was the lowest. This highlights the benefit from considering multiple similarity aspects in the RBS approach. On the other hand, the difference between the two cosine-base similarity models is negligible which suggests that the ensemble approach may be ineffective for spherical similarity measures. The second test involved computation of two clustering trees for each of the models using the agnes and diana algorithms. Those algorithms, described in details in [11], differ in the way they form a hierarchy of data groups. Agnes starts by assigning each observation to a different (singleton) group. In the consecutive steps, the two nearest clusters are combined to form one larger cluster. This operation is repeated until there remains only a single cluster. The distance between two clusters is evaluated using a linkage function. To maximize the semantic homogeneity of the clusters, in our experiments we used the complete linkage method. The diana algorithm starts with a single cluster containing all observations. Next, the clusters are successively divided until each cluster contains a single observation. At each step, only a single group, with the highest internal dissimilarity is split. We used two different algorithms in the experiments to verify the stability of the compared similarity models and avoid bias towards a single clustering method. Since the implementation which we used (from the cluster R System package [15]) can work only with symmetric dissimilarity matrices, we had to transform the similarity matrices of the tested models using the formula (12): dissM atrix = 1 − (simM atrix + simM atrixT )/2,

(12)

333

1.0 0

50

100

Number of clusters

150

0.6 0.4 0.2

agnes unsup. RBS diana unsup. RBS agnes Cosine diana Cosine agnes Cosine ens. diana Cosine ens. agnes single RBS diana single RBS random clustering

0.0

Average semantic homogenity

0.8

0.15 0.10 0.05 0.00

agnes unsupervised RBS diana unsupervised RBS agnes Cosine diana Cosine agnes Cosine ensemble diana Cosine ensemble agnes single RBS diana single RBS random clustering

−0.05

Average semantic homogenity

0.20

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

0

200

400

600

800

1000

Number of clusters

Figure 1. Comparison of the average semantic homogeneity of clusterings into a consecutive number of groups using different similarity models and clustering algorithms. The plot on the left is a close up of the most interesting area from the plot on the right. The clustering based on a randomly generated dissimilarity matrix is given as the black dot-dashed line.

where simM atrix is a square matrix with similarity values, 1 is a square matrix containing 1 at every position and ∗T is the transposition operation. Figure 1 presents average semantic homogeneities (10) of clusterings into a consecutive number of groups, made using the compared similarity models. The plot on the left is a close up to the area in the plot on the right which is marked by a rectangle with dotted edges. This area corresponds to the most interesting part of the plot, since a clustering of documents into a large number of groups produces small individual clusters and is very often difficult to interpret. The results of this test show evident superiority of the unsupervised RBS similarity over other models for clustering into 2 to 50 groups. Interestingly, in this interval, the semantic homogeneity of the single RBS is also much higher than in the case of the cosine-based measures. The maximum difference between the unsupervised RBS and cosine similarity for the agnes algorithm is visible when the clustering is made into 4 groups and is equal to 0.083. For the diana algorithm the difference is even higher – when clustering is made into 10 groups it reaches 0.097. For clustering into 51 to approximately 150 groups the results, especially for the agnes algorithm, changed slightly in favour of the cosine similarity. The highest loss of the unsupervised RBS was to the cosine ensemble model and reached 0.015 for division of data into 101 groups. Going further, the unsupervised RBS takes the lead again but the difference is not that apparent. In Figure 1 there are also results of a clustering made using a random dissimilarity matrix (as the black dot-dashed line). They can serve as a reference, since they clearly show that all of the investigated similarity models led to much better results than the random clusterings.

334

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

The compared models differ also in the results of the internal clustering measures. Table 3 shows the agglomerative and divisive coefficients of the clustering trees obtained for each similarity model. In Table 3.

measure:

Values of the internal clustering separability measures.

unsupervised RBS

cosine

cosine ensemble

single RBS

agglomerative coef.

0.58

0.33

0.37

0.55

divisive coef.

0.54

0.28

0.31

0.51

this test, the clustering for the both RBS-based models significantly outperformed the cosine measure approaches. Higher values of the coefficients indicate that the clusterings resulting from the use of the proposed model are clearer (the groups of documents are better separated), thus they are more likely to be easier to interpret for experts and end-users. It is also worth noticing that the both ensemble-based measures achieved higher internal scores than their single-model counterparts.

6.

Directions for the Future

There are many possible directions for the development of the presented unsupervised RBS model. One of them is an adaptation of the learning process to a case when information about labels assigned by experts is available for a set of training documents. In such a situation it is possible to utilize some supervised learning techniques to improve the final similarity model. For instance, the feedback from experts can be used for selecting information bireducts that should be included in the model. This problem can be neatly formulated as an ensemble learning task. It can also be seen as a way of interaction among agents who try to assess similarity of given documents. Preliminary tests suggest that it would significantly improve the model. For example, in our experiments we tried to sort all the local RBS models by the decreasing size of the corresponding bireducts3 and then we computed correlations between the semantic distance matrix and ensembles of the first k RBS similarity matrices, with k ranging from 1 to the total number of the local models. We noticed that the highest score can be obtained when only the first 10 matrices are averaged – it reaches 0.203 comparing to 0.186 when all the bireducts are used (Table 2). It seems that by using some additional validation document set we would be able to estimate the optimal number of bireducts and to increase the overall performance of the model. Moreover, considering a lower number of local models would have a great impact on scalability of the similarity learning process. The unsupervised RBS model can also be used for supervised learning tasks, such as topical classification of documents. It would be interesting to compare the performance of the unsupervised RBS to the supervised DRBS model. Moreover, a classification task could provide another opportunity to measure quality of the model. Ability to accurately classify texts (or other types of objects) would definitely extend the possible application areas of the model. Another research direction involves further investigation of the information bireducts and the utilization of their properties for the similarity learning. For example, we would like to verify whether there exists any correlation between properties of the information bireduct ensembles, that are imposed by their generation process [22], and the performance of the unsupervised RBS model for textual data. 3

By a size of a bireduct we mean a sum of cardinalities of its attribute set and its document set.

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

335

The discovery of such dependencies may help to improve the resulting similarity relation and speed up its induction. It would also be interesting to compare the unsupervised RBS models constructed using the information bireducts and the classical information reducts. This comparison would help to better evaluate practical benefits from using the bireducts. Finally, with the proposed model, it is possible to experiment on different parameter settings for the commonality relation and the extraction of higher-level features. In order for the model to be applicable to large article repositories, it is necessary to develop fast heuristic algorithms for computing bireducts and finding reasonable parameter values.

References [1] Böhm, C., Faloutsos, C., Plant, C.: Outlier-robust clustering using independent components, SIGMOD Conference, 2008. [2] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R.: Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41(6), 1990, 391–407. [3] Feldman, R., Sanger, J., Eds.: The Text Mining Handbook, Cambridge University Press, 2007, ISBN 978-0521-83657-9. [4] Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007. [5] Ganter, B., Stumme, G., Wille, R., Eds.: Formal Concept Analysis, Foundations and Applications, vol. 3626 of Lecture Notes in Computer Science, Springer, 2005, ISBN 3-540-27891-5. [6] Goldstone, R., Medin, D., Gentner, D.: Relational Similarity and the Nonindependence of Features in Similarity Judgments, Cognitive Psychology, 23, 1991, 222–262. [7] Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E. G. M., Milios, E.: Information Retrieval by Semantic Similarity, Int. Journal on Semantic Web and Information Systems (IJSWIS). Special Issue of Multimedia Semantics, 3(3), 2006, 55–73. [8] Ho, T. B., Nguyen, N. B.: Nonhierarchical document clustering based on a tolerance rough set model, International Journal of Intelligent Systems, 17, 2002, 199–212. [9] Janusz, A.: Dynamic Rule-Based Similarity Model for DNA Microarray Data, LNCS Transactions on Rough Sets, 2012, In print 2012. [10] Janusz, A., Stawicki, S.: Applications of Approximate Reducts to the Feature Selection Problem, Proc. of Int. Conf. on Rough Sets and Knowledge Technology (RSKT), 6954, Springer Berlin/Heidelberg, 2011. [11] Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis, Wiley Interscience, New York, 1990. [12] Ngo, C. L., Nguyen, H. S.: A Tolerance Rough Set Approach to Clustering Web Search Results, in: Knowledge Discovery in Databases: PKDD 2004 (J.-F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi, Eds.), vol. 3202 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2004, 515–517. [13] Pawlak, Z.: Information systems, theoretical foundations, Information Systems, 3(6), 1981, 205–218. [14] Pawlak, Z.: Rough sets, rough relations and rough functions, Fundamenta Informaticae, 27(2-3), 1996, 103–108, ISSN 0169-2968.

336

A. Janusz et al. / Unsupervised Similarity Learning from Textual Data

[15] R Development Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2008. [16] Rinaldi, A. M.: An ontology-driven approach for semantic information retrieval on the Web, ACM Trans. Internet Technol., 9, July 2009, 10:1–10:24, ISSN 1533-5399. [17] Roberts, R. J.: PubMed Central: The GenBank of the published literature, Proceedings of the National Academy of Sciences of the United States of America, 98(2), January 2001, 381–382. [18] Skowron, A., Rauszer, C.: The Discernibility Matrices and Functions in Information Systems, in: Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory (R. Słowi´nski, Ed.), Kluwer Academic Publishers, Dordrecht, Netherlands, 1992, 331–362. [19] Skowron, A., Stepaniuk, J.: Tolerance Approximation Spaces, Fundamenta Informaticae, 27(2-3), 1996, 245–253. ´ [20] Skowron, A., Stepaniuk, J., Peters, J. F., Swiniarski, R. W.: Calculi of Approximation Spaces, Fundamenta Informaticae, 72(1-3), 2006, 363–378. ´ ezak, D.: Approximate Entropy Reducts, Fundamenta Informaticae, 53(3-4), 2002, 365–390. [21] Sl˛ ´ ezak, D., Janusz, A.: Ensembles of Bireducts: Towards Robust Classification and Simple Representation, [22] Sl˛ FGIT, 7105, Springer, 2011. [23] Spearman, C.: The proof and measurement of association between two things. By C. Spearman, 1904., The American journal of psychology, 100(3-4), 1987, 441–471, ISSN 0002-9556. [24] Stahl, A., Gabel, T.: Using Evolution Programs to Learn Local Similarity Measures, In Proceedings of the Fifth International Conference on Case-Based Reasoning, Springer, 2003. [25] Szczuka, M., Janusz, A., Herba, K.: Clustering of Rough Set Related Documents with use of Knowledge from DBpedia, Proc. of the 6th Int. Conf. on Rough Sets and Knowledge Technology (RSKT), 6954, Springer, 2011. [26] Tversky, A.: Features of similarity, Psychological Review, 84, 1977, 327–352. [27] United States National Library of Medicine: Introduction to MeSH - 2011, http://www.nlm.nih.gov/ mesh/introduction.html, 2011. [28] Xiong, H., Chen, X.-w.: Kernel-based distance metric learning for microarray data classification, BMC Bioinformatics, 7(1), 2006, 299, ISSN 1471-2105.