Adapting Gloss Vector Semantic Relatedness Measure for Semantic ...

6 downloads 7638 Views 717KB Size Report
May 21, 2014 - Discussed measures get evaluated in the biomedical domain using MeSH, ... Semantic web Similarity measure Relatedness measure UMLS ...
Adapting Gloss Vector Semantic Relatedness Measure for Semantic Similarity Estimation: An Evaluation in the Biomedical Domain Ahmad Pesaranghader1(&), Azadeh Rezaei1, and Ali Pesaranghader2 1

2

Multimedia University (MMU), Jalan Multimedia, 63100 Cyberjaya, Malaysia {ahmad.pgh,azadeh.rezaei}@sfmd.ir Universiti Putra Malaysia (UPM), 43400 UPM Serdang, Selangor, Malaysia [email protected]

Abstract. Automatic methods of ontology alignment are essential for establishing interoperability across web services. These methods are needed to measure semantic similarity between two ontologies’ entities to discover reliable correspondences. While existing similarity measures suffer from some difficulties, semantic relatedness measures tend to yield better results; even though they are not completely appropriate for the ‘equivalence’ relationship (e.g. ‘‘blood’’ and ‘‘bleeding’’ related but not similar). We attempt to adapt Gloss Vector relatedness measure for similarity estimation. Generally, Gloss Vector uses angles between entities’ gloss vectors for relatedness calculation. After employing Pearson’s chi-squared test for statistical elimination of insignificant features to optimize entities’ gloss vectors, by considering concepts’ taxonomy, we enrich them for better similarity measurement. Discussed measures get evaluated in the biomedical domain using MeSH, MEDLINE and dataset of 301 concept pairs. We conclude Adapted Gloss Vector similarity results are more correlated with human judgment of similarity compared to other measures. Keywords: Semantic web  Similarity measure  Relatedness measure UMLS  MEDLINE  MeSH  Bioinformatics  Pearson’s Chi-squared test Text mining  Ontology alignment  Computational linguistics

 

1 Introduction Considering human ability for understanding different levels of similarity and relatedness among concepts or documents, similarity and relatedness measures attempt to estimate these levels computationally. Developing intelligent algorithms, which automatically imitate this human acquisition of cognitive knowledge, is the challenging task for many computational linguistics studies. Generally, the output of a similarity or relatedness measure is a value indicating how much two given concepts (or documents) are semantically similar (or related). Ontology matching and ontology alignment systems primarily employ this quantification of similarity and relatedness [1] to establish interoperability among different ontologies. These measures have also wide usage in machine translation [2], information retrieval from the Web [3]. W. Kim et al. (Eds.): JIST 2013, LNCS 8388, pp. 129–145, 2014. DOI: 10.1007/978-3-319-06826-8_11,  Springer International Publishing Switzerland 2014

130

A. Pesaranghader et al.

Ontologies, as representations of shared conceptualization for variety of domains [4], by aiming at organizing data for the purposes of ‘understanding’, are the vital part of the Semantic Web. Considering importance of ontologies for web services, to improve and facilitate interoperability across underlying ontologies, we need an automatic mechanism which aligns or mediates between vocabularies of them. Ontology alignment or ontology matching is the process for driving correspondences among entities (concepts) within two different aligning ontologies. Formally, a correspondence with unique identifier of id is a five-tuple (1) by which relation R (denoting specialization or subsumption, B; exclusion or disjointness, \; instantiation or membership, [; and assignment or equivalence, =) with the confidence level of n between entities (concepts) e1 and e2 of two different ontologies holds. To find this level of confidence, majority of ontology alignment systems rely on similarity and relatedness measures for the following decision of acknowledgment regarding to the examined relation. hid; e1 ; e2 ; n; Ri

ð1Þ

Statistically, among a variety of proposed graph-based, corpus-based, statistical, and string-based similarity (or distance) and relatedness measures, those measures which tend to consider the semantic aspects of the concepts, are proven to be more reliable [1]. For this main reason, these measures are separately known as semantic similarity or relatedness measures which are the topics of interest in this paper. The remainder of the paper is organized as follows. We enumerate existing measures of semantic similarity and semantic relatedness in Sect. 2 with explanation of concomitant problems for each one. In Sect. 3, we list the dataset and resources employed for our experiments in the biomedical domain. In Sect. 4, we represent our measure as a solution to overcome the shortcomings of semantic similarity measures. In brief, our proposed similarity measure is optimized version of Gloss Vector semantic relatedness measure adapted for an improved semantic similarity measurement. Regarding to the experiments in the given domain, in Sect. 5, we evaluate and discuss the calculated results of similarity for different semantic similarity measures by employing a reference standard of 301 concept pairs. The reference standard is already scored by medical residents based on concept pairs’ similarity. Finally, we state the conclusions and the future studies in Sect. 6. In the following section we will discuss about two main computational classes which semantic similarity and relatedness methods get mostly divided into.

2 Distributional vs. Taxonomy-Based Similarity Measures Different proposed approaches for measuring semantic similarity and semantic relatedness get largely categorized under two classes of computational techniques. These two classes are known for their distributional versus taxonomy-based attitudes for calculation of their estimations. By applying a technique belonging to any of these classes, it would be possible to construct a collection of similar or related terms automatically. This would be achievable by either using information extracted from a

Adapting Gloss Vector Semantic Relatedness Measure

131

large corpus in the distributional methods or considering concepts’ positional information exploited from an ontology (or taxonomy) in the taxonomy-based approaches. In the following subsections, we will explain each of these classes and functions of their kind by details along with the challenges coming in any case. In Sect. 4, we will present our method for overcoming these difficulties.

2.1

Distributional Model

The notion behind methods of this kind comes from Firth idea (1957) [5] indicating ‘‘a word is characterized by the company it keeps’’. In practical terms, such methods learn the meanings of words by examining the contexts of their co-occurrences in a corpus. While these co-occurred features (of words or documents) get represented in a vector space the distributional model is also recognized as the vector space model (VSM). The values of a vector belonging to this vector space get calculated from applied function in the model. The final intention of this function is to measure the relatedness of term-term, document-document or term-document based on their corresponding vectors. The result would be achieved through computing the cosine of the angle between two input concepts’ vectors. Definition-based Relatedness Measures - A number of studies try to augment the concept’s vectors by taking the definition of the input term into consideration. In this cases, a concept will be indicative of one sense for the input term and its definition can be accessible from an external resource (or resources) such as a dictionary or thesaurus. Be aware that in these methods the full positional and relational information for a concept from the thesaurus is still unexploited and thesaurus here performs just as an advanced dictionary. WordNet1 is the known thesaurus which often gets used in non-specific contexts for this type of relatedness measures. In WordNet, terms are characterized by a synonym sets called synsets while each has its own associate definition or gloss. Synsets are connected to each other through semantic relations such as hypernym, hyponym, meronym and holonym. The Unified Medical Language System2 (UMLS) is a mega-thesaurus specifically designed for medical and biomedical purposes. The scientific definitions of the medical concepts can be derived accordingly from the resources (thesauruses) included in the UMLS. Lesk [6] calculate the strength of relatedness between a pair of concepts as a function of the overlap between their definitions by considering their constituent words. Banerjee and Pedersen [7] for an improved result proposed the Extended Gloss Overlap measure (also known as Adapted Lesk) which augment concepts definitions with the definitions of senses that are directly related to it in WordNet. The main drawback in these measures is their strict reliance on concepts definitions and negligence of other knowledge source.

1 2

http://wordnet.princeton.edu http://www.nlm.nih.gov/research/umls

132

A. Pesaranghader et al.

To address forging limitation in the Lesk-type measures, Patwardhan and Pedersen [8] introduced the Gloss Vector measure by joining together both ideas of concepts’ definitions from a thesaurus and co-occurrence data from a corpus. In their approach, every word in the definition of the concept from WordNet is replaced by its context vector from the co-occurrence data from a corpus and the relatedness get calculated as the cosine of the angle between the two input concepts associated vectors (gloss vectors). This Gloss Vector measure is highly valuable as it: (1) employs empirical knowledge implicit in a corpus of data, (2) avoids the direct matching problem, and (3) has no need for an underlying structure. Therefore, in another study, Liu et al. [9] extended the Gloss Vector measure and applied it as Second Order Context Vector measure to the biomedical domain. The UMLS was used for driving concepts definitions and biomedical corpuses were employed for co-occurrence data extraction. In brief, this method gets completed by five steps which sequentially are, (1) constructing co-occurrence matrix from a biomedical corpus, (2) removing insignificant words by low and high frequency cut-off, (3) developing concepts extended definitions using UMLS, (4) constructing definition matrix using the results of step 2 and 3, and (5) estimating semantic relatedness for a concept pair. For a list of concept pairs already scored by biomedical experts, they evaluated Second Order Context Vector measure through Spearman’s rank correlation coefficient. Figure 1 illustrates the entire procedure.

Fig. 1. 5 steps of Gloss Vector relatedness measure

Even though Gloss Vector semantic relatedness measure has proven its application in wide variety of intelligent tasks, and so various studies try to improve it [10, 11], these measures’ low/high frequency cut-off phase is one of their week points as they do not take frequency of individual terms into account and merely consider frequency of bigrams for the cut-off. This notion is problematic because, first, we believe when a concept is specific (or concrete) in its meaning it needs more specific and informative terms in its definition describing that concept. Second, it is believed when one term is more specific in the meaning has less frequency in the corpus compared to those terms which are general (or abstract). Therefore, removing terms (features) only based on the frequency of bigrams can lead to elimination of many descriptive terms of features. In Sect. 4, in the first stage of our proposed measure for an enhanced measure of semantic similarity built on the Gloss Vector relatedness measure, we introduce Pearson’s chi-squared test (X2) as a measure of association to eliminate insignificant features (on which one concept is not dependent) in a statistical fashion. Distributional Model Challenges - At times relatedness and similarity measures can be used interchangeably; however, the results may show inconsistency for several

Adapting Gloss Vector Semantic Relatedness Measure

133

involved pairs. This issue is more considerable in ontology alignment and database schema matching as finding the relation of equivalences (or assignment) for examined concepts is at highest priority. As a realistic example, result of semantic correspondence for entities ‘‘blood’’ and ‘‘hemorrhage’’ represented high degree of relatedness when low level of similarity between these terms was really intended. This drawback is more intense for negations comparison (‘‘biodegradable’’ vs. ‘‘non-biodegradable’’). For this reason, we need to look for a semantic measure which specifically takes similarity of two compared entities (concepts) into account. Semantic similarity measures typically are dependent on a hierarchical structure of concepts derived from taxonomies or thesauruses. A concept, which is equal to entity in ontology language, is a specific sense of a word (polysemy). For example, ‘‘ventricle’’ is representative word for two different concepts of ‘‘a chamber of the heart that pumps blood to the arteries’’ and ‘‘one of the four cavities of the brain’’. One concept can also have different representative words (synonymy). For instance, words ‘‘egg’’ and ‘‘ovum’’ both can imply ‘‘the reproductive cell of the female, which contains genetic information and participates in the act of fertilization’’ concept. Normally, all representative terms of a concept in a taxonomy (or an ontology) comprise a set with an assigned unique identifier for that concept (sense). In the following subsection we categorize existing semantic similarity measures into three groups, those which are taxonomy-based, generic and domain-independent for a variety of applications.

2.2

Taxonomy-Based Model

The main factor showing the unique characteristic of measures belonging to this class is their capability for estimating semantic similarity (rather than relatedness). The functions of this kind, as the model’s name implies, rely on positional information of concepts derived from a taxonomy or ontology. Depending on different degrees of information extraction from taxonomy and option of employing a corpus for enriching it, the methods for similarity measurement get categorized into three main groups. Path-based Similarity Measures - For these methods distance between concepts in the thesaurus is generally important. It means the only issue for similarity measurement of input concept pair is the shortest number of jumps from one concept to the other one. The semantic similarity measures based on this approach are: • Rada et al. 1989 [12] simpath ðc1 ; c2 Þ ¼

1 shortest is  a pathðc1 ; c2 Þ

In path-based measures we count nodes (not paths) • Caviedes and Cimino 2004 [13]

ð2Þ

134

A. Pesaranghader et al.

It is sound to distinguish higher concepts in the thesaurus hierarchy, which are abstract in their meaning and so less likely to be similar, from those lower concepts, which are more concrete in their meanings (e.g. physical_entity/matter two high level concepts vs. skull/cranium two low level concepts). This is the main challenge in pathbased methods as they do not consider this growth in specificity when we go down from the root towards leaves. Path-based and Depth-based Similarity Measures - These methods are proposed to overcome the abovementioned drawback of path-based methods. They are generally based on both path and depth of concepts. These measures are: • Wu and Palmer 1994 [14] simwup ðc1 ; c2 Þ ¼

2  depthðLCSðc1 ; c2 ÞÞ depthðc1 Þ þ depthðc2 Þ

ð3Þ

LCS is the least common subsumer (lowest common ancestor) of the two concepts • Leacock and Chodorow 1998 [15]   min pathðc1 ; c2 Þ simlch ðc1 ; c2 Þ ¼  log 2D

ð4Þ

minpath is shortest path, and D is the total depth of taxonomy, • Zhong et al. 2002 [16] simzhong ðc1 ; c2 Þ ¼

2  mðLCSðc1 ; c2 ÞÞ mðc1 Þ þ mðc2 Þ

ð5Þ

m(c) is computable by m(c) = 1/k(depth(c)+1), • Nguyen and Al-Mubaid 2006 [17] simnam ðc1 ; c2 Þ ¼ logð2 þ ðmin pathðc1 ; c2 Þ  1Þ  ðD  dÞÞ

ð6Þ

D is the total depth of the taxonomy, and d = depth(LCS(c1,c2)), From the information retrieval (IR) viewpoint when one concept is less frequent than the other concepts (especially its siblings) we should consider it more informative as well as concrete in meaning than the other ones. It is a notable challenge in pathbased and depth-based similarity functions. For example when two siblings get compared with another concept in these measures, they end up with the same similarity estimations while it is highly possible one of the siblings (more frequent or the general one) has greater similarity to the compared concept.

Adapting Gloss Vector Semantic Relatedness Measure

135

Path-based and Information Content Based Similarity Measures - As mentioned, depth shows specificity but not frequency, meaning low frequent concepts often are much more informative so concrete than high frequent ones. This feature for one concept is known as information content (IC) [18]. IC value for a concept in the taxonomy is defined as the negative log of the probability of that concept in a corpus. ICðcÞ ¼  logðPðcÞÞ

ð7Þ

P(c) is probability of the concept c in the corpus PðcÞ ¼

tf þ if N

ð8Þ

tf is term frequency of concept c, if is inherited frequency or frequency of concept’s descendants summed together, and N is sum of all augmented concepts’ (concept plus its descendants) frequencies in the ontology. Different methods based on information contents of concepts are proposed: • Resnik 1995 [18] simres ðc1 ; c2 Þ ¼ ICðLCSðc1 ; c2 ÞÞ

ð9Þ

• Jiang and Conrath 1997 [19] simjcn ðc1 ; c2 Þ ¼

1 ICðc1 Þ þ ICðc2 Þ  2  ICðLCSðc1 ; c2 ÞÞ

ð10Þ

2  ICðLCSðc1 ; c2 ÞÞ ICðc1 Þ þ ICðc2 Þ

ð11Þ

• Lin 1998 [20] simlin ðc1 ; c2 Þ ¼

Typically path-based and IC-based Similarity measures must yield better result comparing with path-based and path and depth-based methods. Nonetheless, in order to produce accurate IC measures to achieve a reliable result from IC-based similarities many challenges are involved. The first challenge is that IC-based similarities are highly reliable on an external corpus, so any weakness or imperfection in the corpus can affect the result of these similarity measures. This issue is more intense when we work on a specific domain of knowledge (e.g. healthcare and medicine or disaster management response) when accordingly a corpus related to the domain of interest is needed. Even when the corpus is meticulously chosen, absence of corresponding terms in the corpus for the concepts in the thesaurus is highly likely. This would hinder to compute IC for many concepts (when the term frequencies of their descendants are also unknown) and consequently this will avoid semantic similarity calculation for those concepts. The second challenge is associated with the ambiguity of terms (terms with multiple senses) in their context during the parsing phase. As we really need concept

136

A. Pesaranghader et al.

frequency rather than term frequency for IC computation, this type of ambiguity for numerous encountered ambiguous terms in the corpus will lead to miscounting and consequently noisiness and impurity of produces ICs for concepts. The third problem arises for concepts having equivalent terms that are multi-word rather than just one word (e.g. semantic web). In this case the parser should be so smart to distinguish these terms (multi-word terms) in corpus from times when they appear alone; otherwise the frequency count for this type of concepts will be also unavailable. As a result, there will be no chance for IC measurement and similarity estimation for the concept. The last setback regarding to the usage of IC-based measure is extremely associated with those specific domains of knowledge which are in the state of rapid evolution and new concepts and ideas get added to their vocabularies and ontologies every day. Unfortunately, the speed of growth for documentation regarding to these new topics is not the same as their evolution speed. Therefore, it will be another hindrance for calculating required ICs for these new concepts. Pesaranghader and Muthaiyah in their similarity measure [21] try to deal with these drawbacks to some extent, however, these issues in semantic similarity measures cause measures of semantic relatedness like Gloss Vector outperform them in the application of similarity estimation, even though as we previously stated relatedness measures are not perfectly tuned for this kind of task. In this research, by considering high reliability, performance and applicability of Gloss Vector semantic relatedness measure in the task under study, after optimizing it using Pearson’s Chi-squared test (X2) in the first stage of the proposed methods, in the second stage, by taking a hierarchical taxonomy of concepts into account, we attempt to adapt this measure for even more accurate estimation of semantic similarity (instead of relatedness). Our approach for computation of semantic similarity using Gloss Vector idea is fully discussed in Sect. 4. Through experiments of our study on the specific domain of biomedical, in Sect. 5, the result of our similarity estimation gets compared with other similarity measures. For this purpose, we have applied these measures on the Medical Subject Headings (MeSH) from the Unified Medical Language System (UMLS) by using MEDLINE as the biomedical corpus for computation of optimized and enriched gloss vectors. The produced similarity results get evaluated against a subset of concept pairs as a reference standard which is previously judged by medical residents. Following section will explain each of abovementioned resources employed in our experiments.

3 Experimental Data Some of external resources available for the study act as taxonomy; other resources are a corpus to extract required information fed to the semantic similarity measures, and a dataset used for testing and evaluation. While MEDLINE Abstract is employed as the corpus, the UMLS are utilized for construction of concepts definitions used in our proposed Adapted Optimized Gloss Vector semantic similarity measure.

Adapting Gloss Vector Semantic Relatedness Measure

3.1

137

Unified Medical Language System (UMLS)

The Unified Medical Language System3 (UMLS) is a knowledge representation framework designed to support biomedical and clinical research. Its fundamental usage is provision of a database of biomedical terminologies for encoding information contained in electronic medical records and medical decision support. It comprises over 160 terminologies and classification systems. The UMLS contains more than 2.8 million concepts and 11.2 million unique concept names. The Metathesaurus, Semantic Network and SPECIALIST Lexicon are three components of the UMLS. Basically this research focuses on the Metathesaurus and Semantic Network since for calculation of all semantic similarity measures examined in the study we need to have access to the biomedical concepts resided in the UMLS Metathesaurus and know relations among these concepts through Semantic Network. In the UMLS there are 133 semantic types and 54 semantic relationships. The typical hierarchical relations are parent/child (PAR/CHD) and broader/narrower (RB/RN). Medical Subject Headings (MeSH) is one of terminologies (sources) contained in the UMLS. While all the concept pairs tested in this study are from MeSH, the accessibility to MeSH on the UMLS is requisite for our experiments. Basically MeSH is a comprehensive controlled vocabulary which aims at indexing journal articles and books in the life sciences; additionally, it can serve as a thesaurus that facilitates searching. In this research we have limited the scope to 2012AB release of the UMLS.

3.2

MEDLINE Abstract

MEDLINE4 contains over 20 million biomedical articles from 1966 to the present. The database covers journal articles from almost every field of biomedicine including medicine, nursing, pharmacy, dentistry, veterinary medicine, and healthcare. For the current study we used MEDLINE article abstracts as the corpus to build a first order term-term co-occurrence matrix for later computation of second order co-occurrence matrix. We used the 2013 MEDLINE abstract.

3.3

Reference Standard

The reference standard5 used in our experiments was based upon a set of medical pairs of terms created specifically for testing automated measures of semantic similarity, freely provided by University of Minnesota Medical School as an experimental study [22]. In their study the pairs of terms were compiled by first selecting all concepts from the UMLS with one of three semantic types: disorders, symptoms and drugs. Subsequently, only concepts with entry terms containing at least one single-word term were further selected for potential differences in similarity and relatedness responses.

3 4 5

http://www.nlm.nih.gov/research/umls http://mbr.nlm.nih.gov/index.shtml http://rxinformatics.umn.edu/data/UMNSRS_similarity.csv

138

A. Pesaranghader et al.

Eight medical residents (2 women and 6 men) at the University of Minnesota Medical School were invited to participate in this study for a modest monetary compensation. They were presented with 724 medical pairs of terms on a touch sensitive computer screen and were asked to indicate the degree of similarity between terms on a continuous scale by touching a touch sensitive bar at the bottom of the screen. The overall inter-rater agreement on this dataset was moderate (Intraclass Correlation Coefficient 0.47); however, in order to reach a good agreement, after removing some concept pairs, a set of 566 UMLS concept pairs manually rated for semantic similarity using a continuous response scale was provided. A number of concepts from original reference standard are not included into the MeSH ontology. More importantly, some of them (and their descendants) are not found in the MEDLINE, which means no chance for information content calculation used in IC-based measures. Therefore, after removing these concepts from this dataset, a subset of 301 concept pairs for testing on different semantic similarity functions including new measure in this study was available.

4 Methods As already mentioned, the main attempts of this paper are, first, to avoid concomitant difficulties from implementing the low and high frequency cut-off of bigrams in Gloss Vector relatedness measure, and second, to improve this effective relatedness measure in a way which is more appropriate for semantic similarity measurement. According to these two main objectives of the study, the method implemented to address forgoing statements consists of two very important stages which provide us in the final point with the Adapted Gloss Vector semantic similarity measure. 4.1

Stage 1 - Optimizing Gloss Vector Semantic Relatedness Measure

We use Pearson’s chi-squared test (X2) for test of independence. By independence we mean in a bigram two given words are not associated with each other and merely have accrued together by chance if the result of X2 test for that bigram is lower than a certain degree. The procedure of the test for our case includes the following steps: (1) Calculate the chi-squared test-statistic, X2, using (12), (2) Determine the degree of freedom, df, which is 1 in our case, (3) Compare X2 to the critical value from the chisquared distribution with d degree of freedom (again 1 in our case). X2 ¼

n X ðOi  Ei Þ2 i¼1

Ei

ð12Þ

where X2 is Pearson’s cumulative test statistic, Oi is an observed frequency, Ei is an expected frequency, and n is the number of cells in the contingency table. In order to show what contingency table for a bigram means and how X2 gets computed using it, we represent ‘‘respiratory system’’ as a bigram sample on which X2 has been applied: In Table 1, by considering an external corpus, O1 = 100 represents that ‘‘system’’ has accrued after ‘‘reparatory’’ 100 times, accordingly O2 = 400 denoted

Adapting Gloss Vector Semantic Relatedness Measure

139

Table 1. 2 9 2 Contingency Table for ‘‘respiratory system’’ system

!system

TOTAL

respiratory

O1 = 100 E1 = 2

O2 = 400 E2 = 498

500

!respiratory

O3 = 300 E3 = 398

O4 = 99’200 E4 = 99’102

99’500

TOTAL

400

99’600

100’000

‘‘respiratory’’ has been seen in the corpus without ‘‘system’’ 400 times, so in total ‘‘respiratory’’ has the frequency of 500 in the corpus. The total number of bigrams collected from the corpus is 100’000. To calculate expected value for each cell we just need to multiply the total values of the column and row to which that cell belong together and then divide the result with the total number of bigrams. The expected value for O3 is: E3 = (99’500*400)/100’000 = 398 The other expected values are calculable using the same way accordingly. Now, for calculating chi-squared test-statistic we need only put numbers into the (12). X2 ðrespiratory-systemÞ . . . . . X ¼ ðOi  Ei Þ2 Ei ¼ ðO1  E1 Þ2 E1 þ ðO2  E2 Þ2 E2 þ ðO3  E3 Þ2 E3 þ ðO4  E4 Þ2 E4 . . . . ¼ ð100  2Þ2 2 þ ð400  498Þ2 498 þ ð300  398Þ2 398 þ ð99’200  99’102Þ2 99’102 ¼ 4845:5

The calculated number of 4845.5 represents level of dependence (association) between ‘‘respiratory’’ and ‘‘system’’ which is rather high. This number is calculable for any bigram extracted from the corpus. Comparing these calculated chi-squared test statistic values to the critical value from the chi-squared distribution, by considering 1 as the degree of freedom, the appropriate level of cut-off for removing those bigrams which are really unwanted could be found. We take advantage of Pearson’s chi-squared test represented above for our proposed approach of insignificant features (terms) removal. In order to integrate this statistical association measure into the Gloss Vector measure procedure, in our approach for optimizing gloss vectors computed for each concept in Gloss Vector relatedness measure, we (1) ignore the low and high frequency cut-off step in Gloss Vector measure (2) construct normalized second order co-occurrence matrix using first order co-occurrence matrix, (3) build X2-on-SOC matrix by enforcing Pearson’s chi-squared test on the normalized second order co-occurrence matrix to find relative association of concepts (rows of matrix) and terms (columns of matrix), and finally (4) apply level of association cut-off on X2-on-SOC matrix. Figure 2 illustrates the entire procedure of our proposed semantic relatedness measure using Pearson’s chi-squared test (X2).

140

A. Pesaranghader et al.

Fig. 2. 5 steps of Gloss Vector measure with optimization with X2

There are three important points considered in this stage: (1) we count cooccurrences from corpus instead of bigrams, (2) we apply X2 on second order cooccurrence matrix cells, in other words concept-term elements instead of bigrams (cooccurrences), (3) to find the appropriate point for cut-off of independent values (features), we gradually increase the critical point and examine correlation of calculated results of similarity using these different cut-off points to find the optimum threshold of cut-off. After this stage, we have Optimized Gloss Vector relatedness measure which still has some difficulties for similarity calculation. In the next stage, by taking a taxonomy of concepts into account, we adapt this optimized relatedness measure for the semantic similarity calculation. 4.2

Stage 2 - Adapted Gloss Vector Semantic Similarity Measure Using X2

Here, by considering the concepts’ taxonomy, we would construct enriched second order co-occurrence matrix used for final measurement of similarity between concepts. For this purpose, by using optimized X2-on-SOC matrix, we can calculate enriched gloss vectors of the concepts in a taxonomy. Our proposed formula to compute the enriched gloss vector for a concept is: cai ðcj Þ þ iai ðcj Þ k¼1 ðcai ðck Þ þ iai ðck ÞÞ

Vectorðcj Þ ¼ Pm

8 i\ ¼ n

Vector 2 Rn ; n is the quantity of features 1  j  m ; m is the quantity of concepts

ð13Þ

Where cai(cj) (concept association) is the level of association between concept cj and feature i (already computed by X2 in the previous stage); iai(cj) (inherited

Adapting Gloss Vector Semantic Relatedness Measure

141

association) is the level of association between concept cj and feature i inherited from cj’s descendants in the taxonomy calculable by summation of all concept’s descendants’ levels of association with feature i; and finally, the denominator is summation of all augmented concepts (concepts plus their descendants) levels of association with feature i. All of these levels of association’ values are retrievable from optimized X2on-SOC matrix. The collection of enriched gloss vectors built enriched gloss matrix which will be used for final semantic similarity measurement. The intuition behind this extra stage is that the higher concepts in a taxonomy (or an ontology) tend to be more general and so associated with lots of terms (features in their vectors). On the other hand, the lower concepts should be more specific and therefore associated with less number of features. Accordingly, the concepts in the middle have relative number of association with the features (terms). For this reason, by adding up the computed optimized second order vector of one concept with its descendants’ optimized vectors we attempt to enforce the forgoing notion regarding to the abstractness/concreteness of the concepts; which by itself would implicitly imply the idea of path and depth for similarity measurement as well. In order to measure a similarity between two concepts we only need enriched gloss matrix loaded on the memory. By having enriched gloss vectors of the two concepts available this way, the measurement of semantic similarity for them by calculating the cosine of the angle between these two vectors would be possible.

5 Experiments and Discussions In the following subsections, first, in one place, we would compare the best results generated for all the semantic similarity measures discussed in this paper. Then, by considering our Adapted Gloss Vector semantic similarity measure, we focus on the Optimized X2-on-SOC matrix to illustrate the influence of the level of association cutoff using statistical approach of Pearson’s chi-squared test (X2) for more accuracy of the final similarity estimation. For the evaluation, Spearman’s correlation coefficient to assess the relationship between the reference standards and the auto-generated semantic similarity results is employed. 5.1

Adapted Gloss Vector Similarity Measure vs. Other Similarity Measures

Figure 3 represents the result of the Spearman’s rank correlation coefficients for 301 concept pairs against biomedical expert similarity score of them for nine similarity measures proposed in other studies and our Adapted Gloss Vector semantic similarity in this paper. To see the reliability of Gloss Vector semantic relatedness the different form of it are represented. The test is conducted on MeSH ontology from the UMLS. In both IC-based and Gloss Vector-based measures MEDLINE’s single-words and bigrams for calculation of the ICs and vectors are extracted. In the Gloss Vector-based measures window size of 2 for construction of first order co-occurrence matrix after removal of stop words and porter stemming implementation is considered. In these cases we have used the finest cut-off points in removal of insignificant features.

142

A. Pesaranghader et al.

Speearman's rank correlation c Ad dapted Gloss Vecto or Similarity (chi-squared) Optimized Gloss Vector Relatedness (chi-ssquared) A Adapted Gloss Vecctor Similarity (freq quency) Gloss Vectorr Relatedness (freq quency) Lin Jiang & Conrath Resnik Nguyen & Al--Mubaid Zho ong et al. Leacock & Ch hodorow Wu & Palmer Caviedes & Cimino Raada et al.

Semantic Sim milarity Measuree

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65

Fig. 3. Spearman’s rank correlation of similarity measures

Because of these considerations, which help to produce the best estimated results of similarity, for other measures the highest produced correlation results are presented. Table 2 shows precisely the Spearman’s correlation results for 5 highest semantic similarity measures. Adapted Gloss Vector Similarity (frequency cut-off) means we applied the idea of taxonomy (stage 2) on Gloss Vector measure while instead of X2 the bigram frequency cut-off (in the original Gloss Vector measure) is implemented. This is done to show the effectiveness of X2 on insignificant features elimination. Table 2. 5 highest Spearman’s correlation results of similarity measures Semantic similarity measure

Spearman’s rank correlation

Lin Gloss Vector relatedness (frequency cut-off) Adapted Gloss Vector similarity (frequency cut-off) Optimized Gloss Vector relatedness (chi-squared cut-off) Adapted Gloss Vector similarity (chi-squared cut-off)

0.3805 0.4755 0.5181 0.6152 0.6438

With comparing Adapted Gloss Vector similarity measure in optimized the form with other measures, we can see that this approach has achieved higher Spearman’s rank correlation. On the other hand, due to some drawbacks in semantic similarity measures, Gloss Vector relatedness yields more reliable estimation of similarity. We also discussed setting absolute thresholds of low and high cut-off without considering the level of dependency of two terms (words) involved in a bigram can waste some

Adapting Gloss Vector Semantic Relatedness Measure

143

beneficial as well as expressive amount of information. Therefore, before considering any strict low and high cut-off points to remove insignificant features, we see there is a need to measure the relative level of association between terms in a bigram. The results of X2 implementation prove efficiency of this consideration. 5.2

Pearson’s Chi-squared Test and Level of Association Cut-off

If we consider words w1, w2, w3, … and wn as an initial set of features which construct n 9 n first order co-occurrence matrix, in practice, only a subset of these words is significant and descriptive for a concept (to which the concept is really dependent or associated). To find these levels of association, after applying X2 on second order cooccurrence matrix, the critical value from the chi-squared distribution with 1 as the degree of freedom can be considered to remove those features to which concept is not dependent. In order to find this critical value, as we mentioned, we gradually increase this value, by going upward from 0 until we find the best point where the best result of correlation between computed similarity and human judgment of them is achieved. Figure 4 demonstrates the correlation changes in different cut-off points for the X2 level of association. For example, after removing X2 results with values less than 15 from X2-on-SOC matrix, we achieved a correlation score of 0.6425. In these represented ranges, all 26’958 concepts in MeSH have their associated vectors. Further cutoffs beyond the last point may cause loss of concepts adapted and optimized gloss vectors which will hinder future measurement of similarity for these concepts (depending on whether that concept has any descendant with an available optimized gloss vector).

Spearman's rank correlation

level of association cut-off 0.65 15 10.83

0.64 6.64

25 50 100

3.84

0.63 no cutting

400

0.62 level of association cut-off for Peason's chi-squared test

Fig. 4. Adapted Gloss Vector similarity measure and different levels of association cut-off point

Considering the results of experiments we found insignificant features elimination in the frequency cut-off approach in Gloss Vector relatedness measure is problematic as it does not consider frequency of individual terms and merely take frequency of bigrams into account. This can cause loss of many valuable features deciding for more a reliable estimation. Through a slight change in this measure’s procedure using Pearson’s chi-squared test we find insignificant terms statistically and more properly.

144

A. Pesaranghader et al.

6 Conclusion In this study we attempted to adapt the Gloss Vector relatedness measure for an improved similarity estimation calling it the result Adapted Gloss Vector semantic similarity measure. Generally, Gloss Vector uses angles between entities’ gloss vectors for calculation of relatedness. In order to adapt this measure for similarity measurement, first, through altering Gloss Vector’s procedure by using Pearson’s chisquared test (X2) for determining the level of association between concepts and terms (features), we could eliminate insignificant features which lead us to the optimized gloss vectors; second, by taking a taxonomy of concepts into account and also having optimized gloss vectors of concepts, by augmenting concepts’ optimized gloss vector with their descendants’ optimized gloss vectors we produced enriched gloss vectors for all concepts in the taxonomy suitable for semantic similarity measurement. Our evaluation of Adapted Gloss Vector similarity measure in the biomedical domain proved this measure outperforms other similarity and Gloss Vector based measures. For the future works, in order to implement insignificant features removal, other statistical measure of association such as Pointwise Mutual Information (PMI) and Log Likelihood Statistic (G2) can be examined. Additionally, our proposed approach can be tested on extrinsic tasks such as ontology alignment and data base schema matching, or get evaluated in other domains of knowledge.

References 1. Muthaiyah, S., Kerschberg, L.: A hybrid ontology mediation approach for the semantic web. Int. J. E-Bus. Res. 4, 79–91 (2008) 2. Chen, B., Foster, G., Kuhn, R.: Bilingual sense similarity for statistical machine translation. In: Proceedings of the ACL, pp. 834–843 (2010) 3. Pesaranghader, A., Mustapha, N., Pesaranghader, A.: Applying semantic similarity measures to enhance topic-specific web crawling. In: Proceedings of the 13th International Conference on Intelligent Systems Design and Applications (ISDA’13), pp. 205–212 (2013) 4. Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum Comput Stud. 43, 907–928 (1995) 5. Firth, J.R.: A synopsis of linguistic theory 1930–1955. In: Firth, J.R. (ed.) Studies in Linguistic Analysis, pp. 1–32. Blackwell, Oxford (1957) 6. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice-cream cone. In: Proceedings of the 5th Annual International Conference on Systems Documentation, New York, USA, pp. 24–26 (1986) 7. Banerjee, S., Pedersen, T.: An adapted Lesk algorithm for word sense disambiguation using WordNet. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 136–145. Springer, Heidelberg (2002) 8. Patwardhan, S., Pedersen, T: Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In: Proceedings of the EACL 2006 Workshop (2006) 9. Liu, Y., McInnes, B.T., Pedersen, T., Melton-Meaux, G., Pakhomov. S.: Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet. In: Proceedings of the 2nd ACM SIGHIT IHI (2012)

Adapting Gloss Vector Semantic Relatedness Measure

145

10. Pesaranghader, A., Pesaranghader, A., Rezaei, A.: Applying latent semantic analysis to optimize second-order co-occurrence vectors for semantic relatedness measurement. In: Proceedings of the 1st International Conference on Mining Intelligence and Knowledge Exploration (MIKE’13), pp. 588–599 (2013) 11. Pesaranghader, A., Pesaranghader, A., Rezaei, A.: Augmenting concept definition in gloss vector semantic relatedness measure using Wikipedia articles. In: Proceedings of the 1st International Conference on Data Engineering (DeEng-2013), pp. 623–630 (2014) 12. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19, 17–30 (1989) 13. Caviedes, J., Cimino, J.: Towards the development of a conceptual distance metric for the UMLS. J. Biomed. Inf. 372, 77–85 (2004) 14. Wu, Z., Palmer, M.: Verb semantics and lexical selections. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, (1994) 15. Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 265–283. MIT press, Cambridge (1998) 16. Zhong, J., Zhu, H., Li, J., Yu, Y.: Conceptual graph matching for semantic search. In: Priss, U., Corbett, D.R., Angelova, G. (eds.) ICCS 2002. LNCS (LNAI), vol. 2393, pp. 92–106. Springer, Heidelberg (2002) 17. Nguyen, H.A., Al-Mubaid, H.: New ontology-based semantic similarity measure for the biomedical domain. In: Proceedings of IEEE International Conference on Granular Computing GrC’06, pp. 623–628 (2006) 18. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (1995) 19. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: International Conference on Research in Computational Linguistics (1997) 20. Lin, D.: An Information-theoretic definition of similarity. In: 15th International Conference on Machine Learning, Madison, USA, (1998) 21. Pesaranghader, A., Muthaiyah, S.: Definition-based information content vectors for semantic similarity measurement. In: Noah, S.A., Abdullah, A., Arshad, H., Abu Bakar, A., Othman, Z.A., Sahran, S., Omar, N., Othman, Z. (eds.) M-CAIT 2013. CCIS, vol. 378, pp. 268–282. Springer, Heidelberg (2013) 22. Pakhomov, S., McInnes, B., Adam, T., Liu, Y., Pedersen, T., Melton, G.: Semantic similarity and relatedness between clinical terms: an experimental study. In: Proceedings of AMIA, pp. 572–576 (2010)

Suggest Documents