Quality of semantic compression in classification

0 downloads 0 Views 338KB Size Report
of WordNet [11] and adjusting it to already existing tools crafted for SEIPro2S. The details of ... As stated in previous works, semantic compression can be perceived as an ... detail after picturing semantic compression's mechanism. .... Building up a new semantic net for English is a great effort surpassing authors' abilities,.
Quality of semantic compression in classification Dariusz Ceglarek1 , Konstanty Haniewicz2 , and Wojciech Rutkowski3 1

2

Poznan School of Banking, Poland [email protected] Poznan University of Economics , Poland, [email protected], 3 Business Consulting Center, Poland, [email protected]

Abstract. Article presents results of implementation of semantic compression for English. An idea of semantic compression is reintroduced with examples and steps taken to perform experiment are given. A task of re-engineering available structures in order to apply them to already existing project infrastructure for experiments is described. Experiment demonstrates validity of research along with real examples of semantically compressed documents. Key words: semantic compression, semantic network, WiSENet, clustering, natural language processing

1

Introduction

The aim of this work is to present an implementation of semantic compression, idea presented in [1], for English. Its main contribution is that experiment was performed for new language and an introduction of new semantic net. This has been achieved by application of re-engineered WordNet as a data structure for disambiguation resolution and a set of English domain frequency dictionaries. This research was motivated by good results already achieved for Polish [2]. In order to reach broader spectrum of peers and demonstrate usability of semantic compression, authors have decided to introduce semantic compression for English language. For completion’s sake authors decided to reintroduce notion of semantic compression. Detailed setting of this technique in Information Retrieval systems is discussed in here [1]. Discussion of necessary adjustments of already crafted solutions frequently used by researchers around the world precedes description of experiment and presentation of its results. The work has been divided into following sections: description of semantic compression, description and discussion of semantic net in SenecaNet format, process of transferring WordNet into SenecaNet format (denoted as WiSENet), evaluation experiment using semantic compression for English, conclusions and future work.

2

Semantic compression and its applications

In consistence with what has been stated in introductory section of this article, authors decided to reintroduce semantic compression. From now on any referrals to it are to be understood in spirit of the following definition: Definition 1. Semantic compression is a technique that allows to transform a text fragment so that it has similar meaning but it is using less detailed terms where information loss minimization is an imperative. The most important idea behind semantic compression is that it reduces the number of words used to describe an idea. As a consequence, semantic compression allows one to identify a common thought in seemingly different communication vessels, and it uses reduced number of word-vector dimensions in involved methods, making them more efficient. Semantic compression for English has been made possible through adoption of WordNet [11] and adjusting it to already existing tools crafted for SEIPro2S. The details of adoption and motivation of transferring WordNet to SenecaNet format is discussed in separate section. As stated in previous works, semantic compression can be perceived as an advanced technique to replace unnecessary terms with more general ones in processed text. Unnecessary terms are understood as words too specialised for given text [10]. This terms can be replaced by more general ones that fit into text’s domain. This cannot be achieved without prior domain research and definition of domain frequency dictionaries. Thus, the need of semantic net for storing relations among words (with special focus on hypernymy and synonymy) and domain frequency dictionaries to measure which words can be freely generalised without visible information loss. These two crucial elements will be explored in greater detail after picturing semantic compression’s mechanism. To further visualize semantic compression, please consider following artificial demonstration-purpose reference example. First, one shall consider sentence A0 and then compare it with sentence A1 . Then, proceed to sentences B 0 and B 1 . Finally, compare sentences A1 and B 1 . Sentence A0 Our cherished color television and fridge went under the hammer as the family tried to make ends meet. Sentence A1 Our loved colour television and fridge was sold as the family tried to survive. Sentence B 0 Our treasured color TV set and refrigerator was traded so that our household could carry on through. Sentence B 1 Our loved colour television and fridge was sold so that our family could survive. Comparison demonstrates sentences that can be easily matched when one is able to transform them using semantic compression (sentences A1 and B 1 are compressed). Matching must be based on a algorithm resembling classic bag of words [12] to yield best results. Authors strongly believe that given example

is clear enough to convey basic mechanism of semantic compression. Real life examples will be given in section devoted to evaluation experiment. If one is to further delve into semantic compression and its place in Information Retrieval systems, please refer to more detailed description given in [2]. As one is ready to observe, semantic compression has to impose some information loss. It is a result of generalisation process that allows to replace less frequent terms with their more frequent hypernyms. Domain corpora are resource that is foundation of domain frequency dictionaries that are of great value in generalisation process. There are no predefined settings for the level of generalisation that yields greatest effects. It has to be computed throughout an experiment for given set of documents. One can refer to exemplary results of semantic compression for Polish which can be found in [2]. Results for English are summarised and discussed later in this work. A set of application for semantic compression is possible. For the authors of this work, an interesting way to apply it to real world usage scenario is to check whether an artifact overuses unquoted references to someone’s work. This kind of application enables one to weed out instances of plagiarism. Another interesting application can be search for similar work in some vast corpora. This shall be extremely useful to anyone poised against such a task, as one does not have to match actual word phrasing but can focus more on the notion to be found. This application can be treated as method for automatic creation of related search terms not based on other queries but basing on rephrasing of current one. Others are also interested in the field, refer to [3] , [4] and [5]. One should also consider application of semantic compression to verify whether community based classification are free from random misclassification. Previously referenced work on semantic compression along with verification in this article hints that overall quality of automatic categorization can be significantly better than one performed with traditional methods. Performing clustering over corpus of semantically compressed documents results in fewer errors [8]. Semantic compression can find its application in intellectual property protection systems such as SEIPro2S implemented for Polish [1]. This kind of systems focuses on the meaning of documents trespassing corporate boundaries. In knowledge oriented societies where majority of revenue is an outcome of knowledge application this kind of system is invaluable asset.

3

SenecaNet features and structure

As earlier emphasized, any reasonable text transformation that promises informed choices when substituting one term for another one of more general nature fitting into text’s domain, must be based on a structure capable of storing a variety of semantic relations. A number of structures ranging from simple dictionaries, through thesauri to ontologies were applied into the matter [7]. Out of them semantic net has

proven to be the best solutions due to its outstanding features coupled with lack of overcomplexity. Experiment given in [2] was based on SenecaNet. SenecaNet is semantic net that stores relations among concepts for Polish. It stores over 137000 concepts, its other features are listed and described elsewhere. It follows a notion of semantic net in every aspect. It has chosen to represent concepts of semantic net as a list. There is a specific format that allows for fast traversal and a number of checkup optimizations, that SenecaNet implements. As mentioned before, concepts are represented as a list of entries. Entry is stored in a way that allows for referencing connected concepts in an efficient manner. Every entry of this list conveys information on actual descriptor to be found in text, hypernyms, synonyms, antonyms and descriptors that are in unnamed relation to given descriptor. There is additional rule that every descriptor can occur exactly one time on the leftmost part of entry when whole semantic network is considered. This restriction introduces extremely important feature. There can be no cycles in structure devised in this manner. Each descriptor can have one or more hypernyms (a heterarchy as in [7]). Each descriptor can have one or more synonyms. Synonyms are listed only once on the right side of entry, they do not occur on the leftmost part of entry, This is additional anticycle guard. An excerpt from WiSENet format is given to illustrate described content. Listing 1.1. SenecaNet file format Barack Obama | p o l i t i c i a n ,# p r e s i d e n t (USA) , c a r | v a h i c l e ,& e n g i n e , g i g a b y t e | computer memory u n i t ,& byte , Real Madrid | f o o t b a l l team , @Madrid , volume u n i t | u n i t o f measurement , @volume , J e r u s a l e m | c i t y , : P a l e s t i n e Authority , Anoushka Shankar | musician ,# s i t a r i s t ,# d a u g h t e r ( Ravi Shankar ) , H i l l a r y C l i n t o n | p o l i t i c i a n ,# s e c r e t a r y o f s t a t e (USA) , For many solutions, structure of semantic net is transparent, it does not affect the tasks net is applied to. Nevertheless, semantic compression is much easier when descriptors are represented by actual terms and their variants are stored as synonyms.

4

WordNet to SenecaNet conversion

When faced with implementation of semantic compression for English one has to use a solution that has similar capabilities as those on SenecaNet. Building up a new semantic net for English is a great effort surpassing authors’ abilities, thus we have turned to existing solutions. WordNet has proven to be excellent resource. It was applied by numerous research teams to a great number of tasks yielding good results. Thus, it was a natural choice in authors’ research.

WordNet itself is a sense-oriented semantic net structure that contains over 130000 terms grouped in synsets. Every synset is collection of words (lemmas) that are in synonymy [11]. The design choice are not to be discussed here, yet one has to emphasize that synsets are elegant solution, at the same time they are cumbersome in text processing applications. Authors had to confront a challenge of converting synset oriented structure into cycleless semantic net operating on descriptors to be recognized as actual terms in processed text fragment. An algorithm to accomplish this has been devised. It operates on sets, taking into account data on every lemma stored in given synset and synsets (therefore their lemmas) that are hypernyms to the one processed. Synset is understood as a group of terms that have similar meaning. Under close scrutinization a lot of terms gathered in one synset fails to be perfect synonyms to each other. They share a common sense, yet the degree to which they do that, varies. Lemma is any member of synset, it can be a single term or a group of terms representing some phrase [11]. Before algorithm is given, an example of naive approach to a problem is demonstrated. This shall enable reader to follow the process of semantic network transformation in greater detail and with less effort. One need to drop additional data on word sense as ideally one would like to come up with a list of words. Lets consider word “abstraction” as a target term. WordNet stores “abstraction” in six different synsets. As they are numbered basing on their frequency, naive approach would suggest to start with the first sense. A generalization path leading from our chosen farthest leaf to the root can easily be obtained. When one is to apply this kind of transformation, he shall quickly face the consequences of introducing great many circulatory graphs in his mapped structure. In order to avoid graph cycles in target structure, authors needed to modify the way one chooses words to describe synset. The best situation is when a lemma contained in synset descriptor belongs only to this synset, ie. lemma itself is a unique synset descriptor. In other situations, authors try to find other lemma from the same synset, which satisfies the condition. Experiments have prooved that this produces desired networks, but cannot satisfy criterion of lack of losses during transformation. Obtained semantic net consisted of only 25000 terms serving as concepts, where a total of 86000 noun synsets were processed. Eventually, “synthetic” synset descriptor is developed. Introduction of synthetic descriptors is not contrary to authors’ ambitions to convert WordNet into WiSENet in a lossless manner along with usage of actual terms as concept descriptors. Synthetic descriptors are always result of untangling of some cycle thus they always can be outputted as actual terms to be found in processed text. Please refer to figures 1 and 2 to view visualisation of this process. Notice that term approximation is contained in several synsets. Thus it fails as a concept descriptor (see 1). One can easily observe that term “bringing close together” occurs exactly once, thus can replace synthetic descriptor “approximation.n.04”.

All this is gathered in tables 1 and 2.

Table 1. Companion table for figure 1 Synset change of integrity joining.n.01 approximation.n.04 approximation.n.03 approximation.n.02 estimate.n.01

Terms change of integrity joining, connection, connexion approximation, bringing close together approximation approximation estimate, estimation, approximation, idea

Parent synset change.n.03 change of integrity joining.n.01 version.n.01 similarity.n.01 calculation.n.02

Table 2. Companion table for figure 2 Term change of integrity approximation

Parents change.n.03 bringing close together, approximation.n.02, estimate.n.01, approximation.n.03 approximation.n.02 similarity.n.01 approximation.n.03 version.n.01 bringing close together joining joining change of integrity estimate.n.01 calculation.n.02 estimate estimate.n.02,estimate.n.01,estimate.n.05, appraisal.n.02,estimate.n.04,compute,count on

The whole procedure is realized as described below. The first step is to build a frequency dictionary (F) for lemmas, counting synsets containing a given lemma. Algorithm loops through all synsets in WordNet (WN), and all lemmas in the synsets (S), and count every lemma occurrence. In the second step, it picks a descriptor (possibly a lemma) for every synset. Next, it begins checking, whether synset descriptor (d) contains a satisfactory lemma. After splitting the descriptor (partition point is the first dot in synset description) and taking the first element of resulting list, algorithm examines, whether such lemma occurs exactly once throughout all synsets - if answer is positive, it can be used as a new synset descriptor. If contrary, it loops through lemmas from examined synset and checks if there is any unique lemma which can be utilised as a descriptor. In case no unique lemma can be found, a genuine WordNet descriptor is used.

Fig. 1. WordNet synset description

Fig. 2. Concepts description in WiSENet format

5

Evaluation

As in previously conducted research for Polish we have devised an experiment that enables to verify whether semantic compression does yield better results when applied to specific text processing tasks.The evaluation experiment is performed by a comparison of clustering results for texts that were not semantically compressed with those that were [6]. Authors gathered texts coming from following domains: business, crime, culture, health, politics, sport, biology, astronomy. To verify the results, all documents have been initially labeled manually with a category. All documents were in English. Clustering procedure was performed 8 times. First run was without semantic compression methods: all identified concepts (about 25000 - this is only about a fifth of all concepts in the research material) were included. Then, semantic

compression algorithm has been used to gradually reduce the number of concepts. It started with 12000 and it proceeded with 10000, 8000, 6000, 4000, 2000 and 1000 concepts. Classification results have been evaluated by comparing them with labels specified by document editors: a ratio of correct classifications was calculated. The outcome is presented in Tables 3 and 4. The loss of classification quality is virtually insignificant for semantic compression strength which reduces the number of concepts to 4000. As briefly remarked in earlier section the conducted experiment indicates, that semantic compression algorithm can be employed in classification tasks to significantly reduce the number of concepts and corresponding vector dimensions. As a consequence, tasks with extensive computational complexity are performed faster. A set of examples of semantically compressed text fragments (for 4000 chosen concepts) is now given. Each compressed fragment is proceeded by its original. Table 3. Classification quality without semantic compression Clustering features All concepts 12000 concepts 10000 concepts 8000 concepts 6000 concepts 4000 concepts 2000 concepts 1000 concepts

1000 94,78% 93,39% 93,78% 94,06% 95,39% 95,28% 95,56% 95,44%

900 92,50% 93,00% 93,50% 94,61% 94,67% 94,72% 95,11% 94,67%

800 93,22% 92,22% 93,17% 94,11% 94,17% 95,11% 94,61% 93,67%

700 91,78% 92,44% 92,56% 93,50% 94,28% 94,56% 93,89% 94,28%

600 91,44% 91,28% 91,28% 92,72% 93,67% 94,06% 93,06% 92,89%

Average 92,11% 91,81% 92,23% 93,26% 93,95% 94,29% 93,96% 93,68%

1a The information from AgCam will provide useful data to agricultural producers in North Dakota and neighboring states, benefiting farmers and ranchers and providing ways for them to protect the environment. 1b information will provide adjective data adjective producer american state adjective state benefit creator creator provide structure protect environment 2a Researchers trying to restore vision damaged by disease have found promise in a tiny implant that sows seeds of new cells in the eye.The diseases macular degeneration and retinitis pigmentosa lay waste to photoreceptors, the cells in the retina that turn light into electrical signals carried to the brain. 2b researcher adjective restore vision damaged by-bid disease have found predict tiny implant even-toed ungulate seed new cell eye disease macular degeneration retinitis pigmentosa destroy photoreceptor cell retina change state light electrical signal carry brain 3a Together the two groups make up nearly 70 percent of all flowering plants and are part of a larger clade known as Pentapetalae, which means five petals. Understanding how these plants are related is a large undertaking

Fig. 3. Classification quality for two runs, upper line denotes results with semantic compression enabled

that could help ecologists better understand which species are more vulnerable to environmental factors such as climate change. 3b together two group constitute percent group flowering plant part flowering plant known means five leafage understanding plant related large undertaking can help biologist better understand species more adjective environmental factor such climate change

6

Conclusions and future work

Table 4. Classification quality using semantic compression with proper names dictionary enabled Clustering features All concepts 12000 concepts 10000 concepts 8000 concepts 6000 concepts 4000 concepts 2000 concepts 1000 concepts

1000 94,78% 93,56% 95,72% 95,89% 96,94% 96,83% 97,06% 96,22%

900 92,50% 93,39% 94,78% 95,83% 96,11% 96,33% 96,28% 95,56%

800 93,22% 93,89% 93,89% 94,61% 96,28% 96,89% 95,83% 94,78%

700 91,78% 91,50% 91,61% 95,28% 96,17% 96,06% 96,11% 94,89%

600 91,44% 91,78% 92,17% 94,72% 95,06% 96,72% 95,56% 94,00%

Average 92,11% 92,20% 93,08% 94,86% 95,77% 96,27% 95,83% 94,66%

This work has demonstrated that semantic compression is viable for English. A set of steps that were needed to make it possible has been described. Finally, an experiment has been presented along with its results. Authors defined a number of important areas that need to be further developed. First issue that shall be tackled is state of vocabulary that is not repre-

sented in WordNet, yet is overwhelming in current culture. To exemplify, words such as these are non existent: superpipe, windsurfer, airball, blazar, biofuel, spacelab, exoplanet, wildcard, superhero, smartphone. WiSENet is in great need of incorporating a vast corpus of geographic names and locations. This shall easily improve results for further experiments and applications. In addition, inclusion of greater number of information on actual people shall further boost generalisation results. This inclusion must focus on introduction of unnamed relation to WiSENet as it currently supports only those from original WordNet. A great number of adjectives and adverbs can only be generalised to their type i.e. WiSENet can only tell whether it is dealing with adjective or an adverb. Authors envision addition of information whether adjective or adverb in consideration can be related to a verb or a noun. This is another feature that will improve performance of semantic compression. Last but not least improvement is creation of vaster corpora of texts, so that WiSENet can store more new concepts. Research has brought a number of other interesting observations. They shall be brought to reader’s attention in further publications, as close to the topic of this work they shift its focus from semantic compression to semantic net’s features.

References 1. Ceglarek D., Haniewicz K., Rutkowski W.: Semantically Enchanced Intellectual Property Protection System - SEIPro2S, 1st International Conference on Computational Collective Intelligence, Springer Verlag Berlin Heidelberg 2009 2. Ceglarek D., Haniewicz K., Rutkowski W. : Semantic compression for specialised Information Retrieval systems, 2nd Asian Conference on Intelligent Information and Database Systems, Studies in Computational Intelligence 283, Springer 2010 3. Baziz M.: Towards a Semantic Representation of Documents by Ontology-Document Mapping, 2004 4. Baziz M., Boughanen M., Aussenac-Gilles N.: Semantic Networks for a Conceptual Indexing of Documents in IR, 2005 5. Gonzalo J. et al.: Indexing with WordNet Synsets can improve Text Retrieval, 1998 6. Hotho A., Staab S., Stumme S.: Explaining Text Clustering Results using Semantic Structures. In Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003, Dubrovnik, Croatia, September 22-26, 2003 7. Hotho A., Maedche A., Staab S.: Ontology-based Text Document Clustering, Proceedings of the Conference on Intelligent Information Systems, Zakopane, Physica/Springer, 2003 8. Khan L., McLeod D., Hovy E.: Retrieval effectiveness of an ontology-based model for information selection, 2004 9. Krovetz R., Croft W.B.: Lexical Ambiguity and Information Retrieval, 1992 10. William B. Frakes, Ricardo Baeza-Yates: Information Retrieval: Data Structures and Algorithms, Prentice Hall, 1992 11. Fellbaum C.: WordNet - An Electronic Lexical Database, The MIT Press, May 1998, ISBN:978-0-262-06197-1 12. Zellig H.,: Distributional Structure, Word 10 (2/3): 14662. 1954

Suggest Documents