Semantic Cores for Representing Documents in IR

Semantic Cores for Representing Documents in IR Mustapha Baziz*, Mohand Boughanem*, Nathalie Aussenac-Gilles*, Claude Chrisment* *IRIT-Université Paul Sabatier, Campus universitaire ToulouseIII, France F-31062 Cedex 4, +33 5 61 55 74 16 {baziz, boughane, aussenac, chrisme}@irit.fr

ABSTRACT This paper deals with the use of ontologies for Information Retrieval.Roughly, the proposed approach consists in identifying important concepts in documents using two criterions, co-occurrence and semantic relatedness and then disambiguating them via an external general purpose ontology, namely WordNet. Matching the ontology and a document results in a set of scored concept-senses (nodes) with weighted links. This representation, called semantic core of a document best reveals the semantic content of the document. We regard our approach, of which the first evaluation results are encouraging, as a short but strong step toward the long term goal of Intelligent Indexing and Semantic Retrieval.

Categories and Subject Descriptors H3.3 [Information Storage And Retrieval]: Information Search and Retrieval; H.3.1 [Content Analysis and Indexing] – Search process, Retrieval models.

General Terms Algorithms, Experimentation.

Keywords Information Retrieval, Semantic Representation of Documents, ontologies, WordNet.

1. INTRODUCTION Ontology-based information retrieval (IR) approaches promise to increase the quality of responses since they aim at capturing within computer systems some part of the semantics of documents, allowing for better information retrieval. Two key contributions can then be distinguished: query reengineering and document representation. In both of these two cases, the expected advantage is to get a richer and more precise meaning representation in order to obtain a more powerful identification of relevant documents meeting the query. Concepts from

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’05, March 13-17, 2005, Santa Fe, New Mexico, USA.

Copyright 2005 ACM 1-58113-964-0/05/0003…$5.00.

ontology which are semantically related to the query can be added in order to select more documents, namely those using a different vocabulary but dealing with the same topic [1], [2], [3]. In document representation, known as semantic indexing and defined for example by [2] and [5], the key issue is to identify appropriate concepts that describe and characterize the document content. The challenge is to make sure that irrelevant concepts will not be kept, and that relevant concepts will not be discarded. In this paper, we propose an automatic mechanism for the selection of these concepts from documents using a general purpose ontology, namely the WordNet lexical database [6]. We propose to build a semantic network for representing the semantic content of a document. The semantic network to which we refer here is one as defined by [7]: "semantic network is broadly described as any representation interlinking nodes with arcs, where the nodes are concepts and the links are various kinds of relationships between concepts". The resulted scored concepts could be used in indexing and then searching phases of an Information Retrieval System (IRS).

2. THE USE OF SEMANTICS IN INFORMATION RETRIEVAL Recently, some approaches have been proposed to use semantics in IR. In semantic-based IR, sets of words, names, noun phrases are mapped into the concepts they encode [12]. By this model a document is represented as a set of concepts: to this aim a crucial component is a semantic structure for mapping document representations to concepts. A semantic structure can be represented using distinct data structures: trees, semantic networks, conceptual graphs, etc. It includes dictionaries, thesauri and ontologies [8]. They can be either manually or automatically generated or they may pre-exist. WordNet and EuroWordNet are examples of (thesaurus-based) ontologies widely employed to improve the effectiveness of IR system. [20] and [21] proposed methods to extract terminology and word relations from domain data and web sites. In [9], the authors propose an indexing method based on the use of WordNet synsets: the vector space model is used and synsets represent the indexing space instead of word forms. In [2] and [10], an approach to connect concepts from ontology “regions” to those in documents is described. The authors propose a method for measuring the similarity of concepts via the semantic network of ontology. Similarity between concepts in semantic networks is already treated by searchers and various methods and metrics are

plural form solar batteries. The second way, which we adopt in this paper, follows the reverse path, projecting the document onto the ontology: for each multiword concept candidate trained by combining adjacent words in the text, we first question the ontology using these words just as they are, and then we use their base forms if necessary. So this manner of doing lets us to resolves the problem of word forms.

proposed [11],[12],[13],[7] for disambiguating words in text. In our case, the finality is not mainly Word Sense Disambiguation (WSD) —for a complete survey of WSD, we refer the interested reader to the Senseval home page (http://www.senseval.org) —, however, we got inspired by the Pederson’s approach described in [13] which uses an adapted Lesk [14] algorithm based on gloss overlaps between concepts (WordNet Synsets) for computing semantic similarity between concepts-senses.

Concerning word combination, we select the longest term for which a concept is detected. If we consider the example shown on Figure2, the sentence contains three (3) different concepts: external oblique muscle, abdominal muscle and abdominal external oblique muscle.

3. MATCHING A DOCUMENT WITH AN ONTOLOGY The approach we propose, as schematized in Figure1, consists of

7

Semantic network of ontology

2

C3 27

5 9 2 32 Cn C1 10 12 15 8 19 3

C4 17 6

t1, …..t2, t3,……

6 2 C2

C5 34

Semantic representation

Document (1) Term extraction from a document and concept identification

(2) Computing similarity measures between concepts and concepts disambiguation

(3) Building of the semantic core of document with the concepts (nodes) disambiguated.

Figure1. Overall scheme of the approach 3 main steps. At the first step, being given a document, terms (mono and multiwords) corresponding to entries in the ontology are extracted. Their frequencies are then computed using a kind of TF.IDF. At step 2, the ontology semantic network is used, first to identify the different meanings (senses in WordNet terminology named also synsets) of each extracted concept and then to compute relatedness measures between these senses (concept-senses). Finally at step 3, a document’s semantic core is built up. It contains the best group of concepts from the document which can best cover the topic expressed in this document.

3.1 Concept Extraction Before any document processing, in particular before pruning stop words, an important process for the next steps consists in extracting mono and multiword concepts from texts that correspond to nodes in the ontology. The recognized concepts can be named entities like united_kingdom_of_great_britain_and_northern_ireland or phrases like pull_one's_weight. For this purpose, we use an ad hoc technique that relies solely on concatenation of adjacent words to identify ontology compound concepts. In this technique, two alternative ways can be distinguished. The first one consists in projecting the ontology on the document by extracting all multiword concepts (compound terms) from the ontology and then identifying those occurring in the document. This method has the advantage of being rapid and makes it possible to have a reusable resource even if the corpus changes. Its drawback is the possibility to omit some concepts which appear in the source text and in the ontology with different forms. For example if the ontology contains a compound concept solar battery, a simple comparison do not recognizes in the text the same concept appearing in its

The first concept abdominal muscle is not identified because its words are not adjacent. The second one external oblique muscle and the third one abdominal external oblique muscle are synonyms. So, they belong to the same WordNet synset designed also by the term node, their definition is: external oblique muscle, musculus obliquus externus abdominis, abdominal external oblique muscle, oblique -- (a diagonally arranged abdominal muscle on either side of the torso)

The selected concept is associated to the longest multiword abdominal external oblique muscle which corresponds to the correct sense in the sentence. Remind that in word combination, the order must be respected (left to right) otherwise we could be confronted to the syntactic variation problem (science library is different from library science).

The selected concept is associated to the longest multiword « abdominal external oblique muscle » which corresponds to the correct sense in the sentence. Remind that in words combination, the order must be respected (left to right) otherwise we could be confronted to the syntactic variation problem (science library is different from library science).

3.1.1 Why Using Multiword Concepts?

the

abdominal

external oblique

muscle

Figure2. Example of text with different concepts. The extraction of multiword concepts in plain text is important to reduce ambiguity. These concepts generally are monosemous even thought the words they contain can be individually ambiguous. For example, when taking each word separately in the concept ear_nose_and_throat_doctor, we have to disambiguate between 5 senses for the word ear, 13 senses (7 for the noun and 6 for the verb) for the word nose, 3 senses for throat (and is not used because it is a stop-word) and 7 senses (4 for the noun and 3 for the verb) for the word doctor. So, we would have a number of 5x13x3x7= 1365 possible combinations of candidate senses. But when considering all the words forming

a single multiword concept (of course, the multiword concept must belong to the ontology), we only will have one sense. In the case of this example, the full concept (WordNet synset) and its definition (gloss in WordNet) are as follows: 1. ENT man, ear-nose-and-throat doctor, otolaryngologist, otorhinolaryngologist, rhinolaryngologist -- (a specialist in the disorders of the ear or nose or throat.)

Statistics we done on WordNet2.0 as shown in Table1, show that from a total of 63,218 multiword concepts (composed of 2-9 words), 56,286 (89%) of them are monosemous. 6,238 have 2 senses (9.867%) and only 694 (0.506%) multiword concepts have more than 2 senses. Thus, the more there are multiword concepts in a document to be analyzed, the easier is their disambiguation.

3.1.2 Weighting concepts The extracted concepts are then weighted according to a kind of tf.idf that we called cf.idf. For a concept c composed of n words, its frequency in a document equals to the number of occurrences of a concept itself, and the one of all its sub-concepts. Formally:

cf (ci ) = count (c i ) +

length( sc) count ( sc ) length(ci ) sc∈sub ( c )

∑

(1)

i

Where length(ci) represents the number of words and sub(ci) is the set of all possible sub-concepts which can be derived from ci. Example: if we consider a concept “elastic potential energy” composed of 3 words its frequency is computed as: cf(“elastic potential energy”) = count(“elastic potential energy”) + 2/3 count(“potential energy”)+1/3 count(“elastic”) + 1/3 count(“potential”) + 1/3 count(“energy”).

Although potential energy is itself also a multiword concept, we decided to consider its occurrences and not its frequency. Other methods for concept frequencies are proposed in the literature, they use in general statistical and/or syntactical analysis [15], [16]. Roughly, they add single words frequencies, multiply them or multiply the number of concept occurrences by the number of single words belonging to this concept. In our case, we used the weighting method of equation (1). The global frequency of a concept ci in a document dj is:

Weight(c i , d j ) = cf (c i ). ln( N / df )

(2)

Where N is the total number of documents and df (document frequency) is the number of documents a concept occurred in. If the concept occurs in all the documents, its frequency is null. Only concepts with Weight greater than a threshold are kept. Once the most important concepts are extracted from a document, they are used to build up the semantic core of this document. The link between two nodes is a condensed value resulting from a comparison of the two concept-senses (nodes) using several semantic relations. It has no precise meaning but expresses the semantic closeness of the two concept meanings. In the next section, we introduce the method of semantic relatedness computation between concepts using the WordNet semantic network.

3.2 Computing Semantic Relatedness between concept

Being given a set of n relations from ontology, R={R1, R2, …Rn}, and two concepts, Ck and Cl with affected senses j1 and k l j2: Sj1 and Sj2 . The semantic relatedness (overlaps) between k l k l k l Sj1 and Sj2 noted Pkl (Sj1 , Sj2 ) or overlaps(Sj1 , Sj2 ) is defined as follows:

Pkl (Sj1k, Sj2l)=

∑{ R}

( i , j )∈ 1,.., n

i

( S kj1 ) ∩ R j ( S lj 2 )

(3)

It represents the adapted Lesk intersection (number of common words which is squared in the case of successive words) between the strings corresponding to the information returned by all relations from R, when applied to the concept senses Sj1k and Sj2l. For example: if we suppose Sj1k= applied_science#1, a sense1 of the applied_science concept and Sj2l= information_ science#1, a sense1 of the information_science concept, and R1= gloss, R2=holonym-gloss as known relations (gloss and holonymy definitions are given below), Relatedness applied for R1 and R2 between the two concepts equals to 11: R1 (applied_science#1)= (the discipline dealing with the art or science of applying scientific knowledge to practical problems; "he had trouble deciding which branch of engineering to study") R2 (information science#1)= (the branch of engineering science that studies (with the aid of computers) computable processes and structures).

R1(applied_science#1) ∩ R2i(information_science#1)=1x "science" +1x "branch of engineering" + 1x "study" =1 + 32 + 1 = 11 The Ri relations depend on those available in ontology. In our case (WordNet), we used the following WordNet relations: − gloss which represents the definition of a concept with possibly specific example(s) from the real world. For example, the gloss of sense1 of noun car (car#n#1) is: (4-wheeled motor vehicle; usually propelled by an internal combustion engine; "he needs a car to get to work"). The real world example is given between quotes mark.

− hypernymy for the is-a relation (physics is-a natural science) and hyponymy its reverse relation. − meronymy for the has-a relation (car has-a {accelerator, pedal, car seat}) and holonymy (its reverse relation), − domain relations: domain specific relations may be relevant to assess semantic relatedness between concepts. [17] have used domain relations extracted using a supervised method. In our case we used an automatic method since these relations have recently been added to WordNet (Wn2.0 version). We used six domain relations: domain–category (dmnc: for example acropetalÆ botany,phytology), domain–usage (dmnu: bloody Æ intensifier), domain–region (dmnr: sumoÆ Japan). And the domain member relations are: member of domaincategory for nouns (dmtc: matrix_algebra Æ diagonalization, diagonalisation), member of domain-usage for nouns (dmtu: idiomÆ euphonious, spang), member of domain-region for nouns (dmtr: Manchuria Æ Chino-Japanese_War, Port Arthur). This relatively high number of relations helps to cover the most aspects of links that two concepts could have. Concerning gloss, it is not in itself a relation [12] rather than a concept definition. It is used for exploiting the entire semantic nets of WordNet

regardless of parts of speech1 and is applied to all the 10 remaining relations in order to compute overlapping between concepts and detect hidden links between them. To illustrate how relations and gloss are used to detect hidden semantic links, let us consider as an example the two concepts Henry Ford and car. A priori, in WordNet there is no explicit semantic relation between the two concepts. But when gloss is applied to synonymy and hyponymy relations for example, we’ll find words overlaps which are indicative of the strong relation between the 2 concepts: as shown in Figure3, there is one overlap between the gloss of the first concept (Henry Ford) and the synset of the second (car); which is the term “automobile”. And when applying gloss to hyponymy relation for example, there is a very interesting overlap between, in the one hand, the synset of Henry Ford and the gloss of the hyponym of the concept car which is the compound “Henry Ford”; and between the gloss of Henry Ford and the gloss of a hyponym of car which are “automobile” and “mass” in the other hand. The hyponym is what comes after the symbol “=>” and it is given in the example of Figure3 only one hyponym between the 30 returned by WordNet to the sense 1 of car.

3.3 Building a Semantic Network Once the most important concepts are extracted from a document, they are used to build semantic networks. Let:

Dc= {C1, C2, …, Cm}

(4)

be the set of the selected concepts from a document D following the extraction method described in section 3.2. Concepts should be mono or multiword and each Ci has a certain number of senses represented by WordNet synsets noted Si :

Si= {S1i, S2i, ….., Sni}

(5)

Such that concept Ci has |Si |=n senses. And when we choose one sense for each concept from Dc, we will always have a set SN(j) of m elements, because we are sure that each concept from Dc has at least one sense, being given that it belongs to the ontology semantic network. We define a semantic net SN(j)as:

SN(j)=(Sj11, Sj22, Sj33,…, Sjmm)

(6)

It represents a jth configuration of concepts-senses from Dc. j1, j2, .., jm are sense indexes between 1 and all possible senses for respectively concepts C1, C2,.., Cm. For the m concepts of Dc, different semantic networks could be constructed using all sense combinations. The number of possible semantic networks Nb_SN depends on the number of senses of the different concepts from Dc:

Nb_SN =|S1| . |S2| … .|Sm| 1

(7)

WordNet is designed into four non-interacting semantic nets — one for each open word class (nouns, verbs, adjectives and adverbs). This separation is based on the fact that the appropriate semantic relations between the concepts represented by the synsets in the different parts of speech are incompatible.

For example, Figure3 represents a possible semantic network (S21, S72, S13, S14, S45, S2m) resulting from a combination of the 2nd sense of the first concept, the 7th sense of C2, the 2nd sense of Cm (we suppose that null links are not represented).

P4m P42

S14

P41

P2m S72

P25 P23

P12

S2m

P5m S45

P35 S21 P13 S13 Figure3. a semantic network built from one configuration of concepts-senses. To build the best semantic network representing a document, the selection of each of its nodes (concept senses) relies on the computation of a score (C_score) for every concept sense. The score of a concept sense equals to the sum of semantic relatedness computed for all possible couples where it belongs to: For a concept Ci, the score of its sense number k is computed as:.

C _ score( S ki ) =

∑P

i l i,l ( S k , S j ) l ∈[1..m ], l ≠ i j∈[1..n ]

(8)

such that m is the number of concepts from Dc and n represents the number of WordNet senses which is proper to each Cl as defined in equation(5). And then, the best concept sense to retain is the one which maximizes C_score:

Best _ score(C i ) = Max C _ score( S ki ) k =1..n

(9)

By doing so, we have disambiguated the concept Ci which will be a node in the semantic core. The final semantic core of a document is (Sj11, Sj22, Sj33,…, Sjmm) where nodes correspond respectively to (Best_Score(C1), Best_Score(C2), Best_Score(C3), …, Best_Score(Cm)).

4. EVALUATION 4.1 Evaluation Method We evaluated our approach in Information Retrieval (IR). The document’s semantic cores are used for a semantic indexing and compared searching results to those returned when using a classical key-words indexing. 3 cases are experimented: [case 1] Key-words based indexing: in this classical method, no compound concepts are used and all frequencies are computed by tf/IDF formula both for documents and queries. [case 2] Concepts based indexing: In this indexing method, only nodes (concepts senses) of a semantic core of each document are used. CF/IDF(C) formula with 2 as a threshold is used for selecting concepts. The values of C_scores are used for weighting concepts in documents and queries. However, they are passed to log in order to attenuate too large variations. Links values between concepts in the resulted semantic core of each document are not used. [case 3] Concepts + key-words based indexing: Here we combine the two previous methods: a dictionary is first

generated using simple key-words indexing (case1) and then semantic core for each document is generated (case2). Two cases may arise while adding concepts-senses from semantic core nodes. Either a concept-sense multiword, and then it is added directly to the dictionary with log of its C_score as a weight. Or a concept-sense is a single word (ie it is already indexed by the first classical indexing method) and in this case we change only its weight based on TF.IDF by the log of its C_score value. Query 29: Heparin induced thrombocytopenia, diagnosis and management. The experimental method follows the one used in Trec’s campaigns [19]. For each query, the first 1000 retrieved documents are returned by the search engine and precisions are computed at different points. We have choosen to use a small collection because of the computing complexity: building one document semantic core takes one (01) minute in average (the overall collection takes about 5 days). In Figure6, results of search precision are given at different points: 5, 10, 15, 20, 30, 100, 1000 top selected documents and Avg-Pr, the average precision at the last. In each point, the average precision for the 25 queries is given in the three cases: keyword-based indexing, concepts + keyword-based indexind and concepts only indexing. Results clearly show the advantage of using concept-based indexing method combined to key-words indexing for enhancing system accuracy. Roughly, Precision is grewer in the case of concepts + keywords indexing at all the precision points and morely when considering only the first selected documents. For example, the precision in the 5 top selected documents is 0,3440 in the case of simple key-words indexing and 0.232 in the case of concept only indexing. It grews to 0,432 (+20%) when concepts + keywords indexing method is used. And the average precision equals 0,2581 in the case of concepts+ keywords indexing, 0,2230 when classical key-word indexing is used and only 0.105 when only concepts (nodes of a semantic core) are used. We can conclude that combining concept indexing from document semantic cores to a classical key-word indexing is the best way to enhance retrieval accuracy when concept only based indexing is not sufficient to have considerable results.

Key-wo rds o nly

Co ncepts + Key-wo rds

Co ncepts o nly

0,500 0,400 0,300 0,200 0,100 0,000 P5

P 10

P 15

P 20

P 30

P 100

P 1000 A vgP r

Figure6. Search results for key-words, concepts + key-words and concepts only indexing. We can also remark that the difference between concepts+keywords and keywords indexing falls when the considering number of selected documents increase. This can be explained by the total number of relevant documents which is

relatively small (even if it varies from one topic to another) being given the size of the collection.

5. CONCLUSION The work developed in this paper lies within the scope of the application of ontologies in information retrieval field. We have introduced an approach representing semantic content of documents. The result is a semantic network where nodes represent disambiguated concepts and edges are materialized by the value of semantic similarity between nodes. Different semantic relations from WordNet are used. Roughly, these relations denote domain, generalization, specialization and composition links. This approach could be generalized to all ontology having in its structure a semantic network as defined by [7]: broadly a set of concepts and semantic relations between these concepts.

6. REFERENCES [1] OntoQuery project net site: http://www.ontoquery.dk [2] Khan, L., and Luo, F.: Ontology Construction for Information Selection In Proc. of 14th IEEE International Conference on Tools with Artificial Intelligence, pp. 122127, Washington DC, November 2002. [3] Guarino, N., Masolo, C., and Vetere, G. “OntoSeek : content-based access to the web”. IEEE Intelligent Systems, 14:70-80, (1999). [4] Boughanem, M., Dkaki, T., Mothe, J., et Soulé-Dupuy, C."Mercure at TREC-7". In Proceeding of Trec-7, (1998). [5] Mihalcea, R. and Moldovan, D.: Semantic indexing using WordNet senses. In Proceedings of ACL Workshop on IR & NLP, Hong Kong, October 2000. [6] Miller, G. Wordnet: A lexical database. Communication of the ACM, 38(11):39--41, (1995). [7] Joon Ho Lee, Myong Ho Kim, and Yoon Joon Lee. "Information retrieval based on conceptual distance in IS-A hierarchies". Journal of Documentation, 49(2):188{207, June 1993. [8] Haav, H. M., Lubi, T.-L.: A Survey of Concept-based Information Retrieval Tools on the Web. In Proc. of 5th EastEuropean Conference ADBIS*2001, Vol 2., Vilnius "Technika", pp 29-41. [9] Gonzalo, J., Verdejo, F., Chugur I., Cigarrán J.: Indexing with WordNet synsets can improve text retrieval, in Proc. the COLING/ACL '98 Workshop on Usage of WordNet for Natural Language Processing, 1998. [10] Cucchiarelli, R. Navigli, F. Neri, P. Velardi. Extending and Enriching WordNet with OntoLearn, Proc. of The Second Global Wordnet Conference 2004 (GWC 2004), Brno, Czech Republic, January 20-23rd, 2004. [11] Hirst, G., and St. Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In C. Fellbaum, editor, WordNet: An electronic lexical database, pages 305–332. MIT Press, 1998. [12] Resnik, P., "Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language", Journal of

Artificial Intelligence Research (JAIR), 11, pp. 95-130, 1999. [13] Banerjee, S. and Pedersen, T.: An adapted Lesk algorithm for word sense disambiguation using Word-Net. In Proc. of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, February 2002. [14] Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone. In Proc. of SIGDOC ’86, 1986. [15] Croft, W. B., Turtle, H. R. & Lewis, D. D. (1991). The Use of Phrases and Structured Queries in Information Retrieval. In Proceedings of the 4th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Chicago, Illinois: pp. 32-45. [16] Huang, X. and Robertson, S.E. "Comparisons of Probabilistic Compound Unit Weighting Methods", Proc. of the ICDM'01 Workshop on Text Mining, San Jose, USA, Nov. 2001.

APPENDIX Example: Considering a concrete example: The Document : geophysics is the study of the earth and its atmosphere by physical measurements using a combination of mathematics and physics along with electrical_engineering computer_science and geology and other earth sciences the geophysicist analyzes measurements taken at the surface to infer properties and processes deep within the earths complex interior. with its use of physics mathematics geology and other earth sciences electrical_engineering and computer_science geophysics can be truly called an applied and interdisciplinary science it combines these different sciences to study the earths properties and processes from the atmosphere and oceans to the shallow subsurface and deep interior down to its central core except for boreholes which penetrate only a small fraction of the outer surface the earths interior

a) Concepts extraction The most important concepts with their frequencies are as follow: deep#n#2 = 88 geophysicist#n#1 = 274 science#n#1 = 1867 earth_science#n#1 = 1335 resource#n#1 = 192 infer#v#4 = 126 computer_science#n#1 = 1588 mathematics#n#1 = 800 earth#n#3 = 779 electrical_engineering#n#1 = 1514 physics#n#1 = 926 study#n#6 = 2351

process#n#2 = 452 geology#n#1 = 1065 surface#n#2 = 3198 measurement#n#1 = 466 apply#v#1 = 442 side#n#5 = 3041 geophysics#n#1 = 1007 interior#n#2 = 2367 distribution#n#2 = 203 property#n#3 = 625 atmosphere#n#3 = 1081

For example: Frequency (“earth_science”)= 2 x “earth science” + 9/2 x “earth” + 3/2 x “science” =8

[17] Magnini, B. and Cavaglia, G.: Integrating Subject Field Codes into WordNet. In Proc. of the 2nd International Conference on Language resources and Evaluation, LREC2000, Atenas. [18] Buitelaar, P., Steffen, D., Volk, M., Widdows, D., Sacaleanu, B., Vintar, S., Peters S., Uszkoreit, H.. Evaluation Resources for Concept-based Cross-Lingual IR in the Medical Domain In Proc. of LREC2004, Lissabon, Portugal, May 2004. [19] The Sixth Text REtrieval Conference (TREC{6). Edited by E.M. Voorhees and D.K. Harman. Gaithersburg, MD: NIST, 1998. [20] Vossen P., 2001. Extending, Trimming and Fusing WordNet for technical Documents, NAACL 2001 workshop on WordNet and Other Lexical Resources, Pittsbourgh, July 2001. [21] Maedche A. and Staab S., 2000. Semi-automatic Engineering of Ontologies from Text. Proceedings of the Twelfth International Conference on Software Engineering and Knowledge Engineering.

cannot be directly observed all that_is known about the history continuing changes and distribution of resources of the earths interior must be inferred through detective_work based_on mastering and applying the sciences and technologies listed above because the earth supplies societys material needs and is the repository of its used products and home to all its inhabitants the wide range and importance of this field are apparent on the applied side oil companies and mining firms use the exploratory skills of geophysicists to locate deeply hidden resources geophysicists assess the strength of the earths surface when sites are chosen for large engineering and waste management operations on the theoretical side geophysicists try to understand such earth processes as heat distribution and flow gravitational magnetic and other force_fields and vibrations and other disturbances within the earths interior between the pure and the applied is seismology the branch of geophysics that studies earthquake causes and

predictability distribution#n#3study#n#9 = 2 distribution#n#2side#n#3 = 4 process#n#6measurement#n#1 = 14 process#n#2infer#v#5 = 1 science#n#1mathematics#n#1 = 400 electrical_engineering#n#1study#n#6 = 598 mathematics#n#1computer_science#n#1 19 geology#n#1computer_science#n#1 33 atmosphere#n#1side#n#2 2 study#n#6computer_science#n#1 599

… For example, the 3rd line means that relatedness between sense 6 of noun process whose frequency=3 and sense 1 of measurement whose frequency=2, equals 14.

c) Selecting the best semantic net: For each concept, its sense with the best cumulated scores, is retained as the appropriate one:

b) semantic relatedness:

For the term atmosphere for example, the 6 senses in WordNet are as follow:

Semantic relatedness are calculated combining all concepts senses using the 10 relations:

1. (18) atmosphere, ambiance, ambience -- (a particular environment or surrounding influence; "there was an atmosphere of excitement") 2. (10) standard atmosphere, atmosphere, atm, standard pressure -- (a unit of pressure: the pressure that will support a column of mercury 760 mm …)

3. (9) atmosphere, air -- (the mass of air surrounding the Earth; "there was great heat as the comet entered the atmosphere"; "it was exposed to the air")

The sense 3 selected is effectively the appropriate one. The problem of ambiguity for compounds concepts like: earth_science, computer_science and electrical_engineering is not posed, since they have only one sense in WordNet.

4. (6) atmosphere, atmospheric state -- (the weather or climate at some place; "the atmosphere was thick with fog") 5. (4) atmosphere -- (the envelope of gases surrounding any celestial body) 6. air, aura, atmosphere -- (a distinctive but intangible quality surrounding a person or thing; "an air of mystery"; "the house had a neglected air"; ….)

Finally, the resulted semantic core corresponding to the document content is as in the scheme below. Here, only links with values >=30 are represented for the visibility of the scheme. Concepts-senses (Nodes) can be grouped in sub-topics following there C_scores: We can remark for example that the group of

nodes : { computer_science#n#1, physics#n#1, science#n#1, earth_science#n#1, have scores surrounding 1000. They effectively belong to the same topic (science and applied science). The same remark can be applied to nodes { side#n#5, interior#n#2, surface#n#2}.

geology#n#1, mathematics#n#1, electricalengineering#n, geophysics#n#1}

244 1537 distribution#n#2 215 computer_science#n#1 geophysicist#n#1

49 33

41

527 480 measurement#n#1process#n#2

599 41

68

691 property#n#3

644

244 resource#n#1

162 infer#v#4 103 deep#n#3

445 apply#v#1

48

2375 study#n#6

1125 1696

193 1112 geology#n#1

750 earth#n#3

206

30 79

310 37 401

843 mathematics#n#1 3211 surface#n#2

1014 physics#n#1

666

601

2364 1474 3075 interior#n#2 side#n#5 97electrical_engineering#n#1

1094

360 45

960 geophysics#n#1

69

208

1886 science#n#1

192

1317 earth science#n#1

1081 atmosphere#n#3

Scheme. The resulted semantic core for the example document.

Semantic Cores for Representing Documents in IR

Semantic Cores for Representing Documents in IR

Suggest Documents

Representing Versions in XML Documents Using Versionstamp

Overview of the IR for Spoken Documents Task in

N-GRAM GRAPHS: REPRESENTING DOCUMENTS AND ... - CiteSeerX

Representing TEI Documents in the CLASSIC Knowledge ... - CiteSeerX

PML: Representing Procedural Domains for ... - Semantic Scholar

Conceptual Graphs for Representing Business ... - Semantic Scholar

An Approach for Representing Domain ... - Semantic Scholar

Representing Contextualized Information in the ... - Semantic Scholar

Representing Semantic Information In Pulley Problems - ijcai

Representing Complex Geographic Phenomena in ... - Semantic Scholar

Representing Documents and Queries as Sets of Word Embedded ...

Representing Documents and Queries as Sets of Word Embedded ...

Representing and Reasoning on XML Documents: A Description

Representing Runtime Variability in Business ... - Semantic Scholar

Representing Experimental Biological Data in ... - Semantic Scholar

Representing Judicial Argumentation in the Semantic ...

Virtual Humans for Representing Participants in ... - Semantic Scholar

A Method for Representing Thematic Data in Three ... - Semantic Scholar

IR - Semantic Scholar

Nanocrystalline Material in Toroidal Cores for

Requirements for Representing Situations*

an ontological model for representing semantic ... - Semantic Scholar

IR spectroscopy - Semantic Scholar

Semantic Trajectory Compression â Representing ... - Semantic Scholar

Semantic Cores for Representing Documents in IR