Semantically Enhanced Intellectual Property Protection System - SEIPro2S Dariusz Ceglarek1 , Konstanty Haniewicz2 , and Wojciech Rutkowski3 1
2
Wyzsza Szkola Bankowa w Poznaniu, Poland
[email protected] Uniwersytet Ekonomiczny w Poznaniu, Poland,
[email protected], 3 Business Consulting Center, Poland,
[email protected]
Abstract. The aim of this work is to present some of the capabilities of a Semantically Intellectual Enhanced Property Protection System. The system has reached a prototype phase where experiments are possible. It uses an extensive semantic net algorithms for Polish language that enable it to detect similarities in two compared documents on a level far beyond simple text matching. SEIPro2S benefits both from using a local document repository and from Web based resources. Main focus of this work is to give a reader overview of architecture and some actual results. Key words: intellectual property, semantic net, thought matching, natural language processing
1
Introduction
SEIPro2S has been built to be an aid in a situation where some kind of intellectual assets has to be protected from unauthorized usage in terms of copying ideas. An example of this situation is plagiarism which as is commonly known violates ones intellectual property to gain some sort of revenue. One cannot come up with a universally valid definition of plagiarism. As an example, Polish law does not specify a notion of plagiarism, nevertheless it enumerates a list of offences directed at intellectual property. I. e. : – claiming of an authorship of some piece of work as a whole or its parts – reusing someone’s work without changes to the narration, examples and the order of arguments – claiming of an authorship of research project – claiming of an authorship of invention to be used in evaluation of research achievements that is not protected by intellectual property law or industry property ownership law SEIPro2S enables its users for a semiautomatic control of publicly available documents and those that are made available to it by other channels in context of using concepts which represent user’s intellectual property.
This application can be easily transformed into another one, meaning that it lets us monitor a stream of documents for content similar on concept level with base document. One can easily imagine this as a help in automatic building of relatively complete reference base for some research domain. Working prototype allows to produce a report demonstrating areas copied from other sources along with clear indicator informing user on ratio of copied work to the overall document content. Thanks to devised algorithms basing on extensive semantic net for Polish language and thesauri prototype is able to detect copied fragments where insincere author tried to hide his deeds by changing of word order, using synonyms, hyponyms or commonly interchangeable phrases when compiling his document. The structure of the work is as follows: brief discussion of used technologies, description of the system functioning, its architecture, example from the undertaken experiments and summary along with future work.
2
Technologies used by SEIPro2S
The system applies many of well known text mining techniques. Due to fact that text mining is multidisciplinary domain it employs a variety of algorithms coming from various branches of computer science. Every document submitted to the system has to undergo a set of transformations all commonly defined as a text-refinement. As an output of this textrefinement transformation process the system produces a structure containing ordered concept descriptors coming from the input document. Following steps are undertaken in the text-refinement process: – – – – –
tokenization applying stop-list concept identification (phrases, order of concepts) stemming concept generalization with aid of knowledge representation structure (synonyms, hypernyms) – concept disambiguation – applying correct set algorithms basing on the type of text-mining task Further explanation of the knowledge representation structures follows, as this is the crucial element of systems internal workings. 2.1
Knowledge representation
By knowledge representation method it is understood a way in which knowledge of the real world is stored, computed and inferred about by a system. Further, it is strictly defined knowledge description language paired with computation mechanisms. Every possible entity is mapped to a natural language using a word, collocation or a phrase. This mapping will be referred to from now-on as a concept. A
good knowledge representation serves a goal of producing structures capable of being computed by the system and transforming input in a manner that allows it to come up with elements relating to stored system-known concepts. Classic methods employed by information retrieval systems base on simple knowledge representation where a document is a set of used words. This representation yields well-established methods of querying systems such as Boolean model and vector space model. A more sophisticated knowledge representation models are: glossary, domain dictionary, taxonomy, thesaurus, semantic net and ontology. In comparison with a bag of words, enlisted methods introduce different, more complicated and of higher information value lexical relations among stored concepts. We have a special interest in a semantic net as it is employed as a knowledge representation model in the system. It has been chosen due to its superb capability of capturing relationships among concepts. Semantic net is a directed graph with vertices representing concepts and edges representing lexical relationships. A list of supported types of lexical relationships along with examples follows: – – – – – – –
hypernyms (animal is more general than elephant) hyponym (elephant is less general than animal) meronym (lamb has part bulb) holonym (bulb is part of lamp) connotation (rose has trait scent) attribute (soft is specification of hardness) synonym (scent is synonym of fragrance)
Every edge can be weighted to introduce a notion of importance of some relation. Inference is realized by edge traversal. One can apply well-known methods of search within graph. Starting with a vertex in question one can reach all relevant vertices by following edges on the way thus inferring of their properties. Semantic net contains all information concerning concepts semantics, thus its natural application to the domain of interest. Some of benefits one can obtain by applying thesauri and semantic nets were described by [2]. Its greatest advantage is to supply a system with the right meaning of the concept processed based on its contextual usage. What is more, when subjected to be used in classification tasks it levels up quality of classification and categorization. Commonly used semantic net in information retrieval systems for processing English is WordNet. Its structure is organized around notion of synset. Every WordNet’s synset contains words which are mutually synonyms. Relationships among synsets are hypernyms or hyponyms, when combined with previous data it is easily seen that whole WordNet acts as a thesaurus. At the current moment, there are following relationships present in WordNet: hypernymy, hyponymy, synonymy, metonymy, homonymy and antonymy. All these make WordNet fully functional semantic net for English. Facing a challenge of not being able to use a similar structure for Polish, a semantic net produced as an outcome of project SeNeCa was employed in the system. Table 1 demonstrates a comparison between WordNet and SeNeCa net.
Table 1. Comparison of WordNet and SeNeCa net Features WordNet SeNeCa net Concept count 155200 126800 Polysemic words 27000 18200 Synonyms 0 5100 Homonyms, hypernyms + + Antonyms − + Connotations + + Unnamed relationship − +
3
SEIPro2S functioning
Figure 1 presents a sequence of steps which are undertaken when a document is to be subjected to similarity check against local document repository and documents obtained from the Internet. When input document ∆ is presented to the system, responsible component starts a procedure of Internet sampling. Samples are obtained by submitting a number of fragments obtained from input document. A list of potentially viable documents is prepared basing on the occurrence of necessarily long identical passages. Documents are downloaded and subjected to text-refining. After completing these steps every downloaded document is indexed. For every document an abstract is created and stored in local repository. A key for the whole process procedure follows. At first, the most relevant of previously sampled, indexed and abstracted documents are selected by comparing the abstracts. Then, input document ∆ is subjected to comparison with every relevant document. As a result of this action, a similarity report is constructed. It conveys following information on the overall similarity of input document ∆ to every checked document: – length of the longest identical phrase obtained from the documents from the sampling phase – similarity indicator, which using percent ratio demonstrates how much of the submitted text is identical with documents coming both from the sampling phase and from local repository coming from earlier checks – checked text along with markup showing which text fragments are identical to those coming from local repository and current Internet sample. When SEIPro2S acts in monitoring mode some enhancements are introduced. First of all, each analyzed document along with first batch of samples obtained from the Internet is stored in local repository. Thus, when the system is monitoring the Internet it can omit all previously checked documents and focus only on new ones.
Fig. 1. Checking document against a sample of documents obtained from open Internet
4
System architecture
Components constituting SEIPro2S system are illustrated on figure 2. Its main part is a semantic network module, which works as system engine processing semantic relations. End user communicates with the system through a web interface, defining tasks to be completed by the system. Tasks are characterized by type, subject and such parameters like time limit. After task is done, the user can view a report or have it sent by e-mail. Other components of SEIPro2S — sampling agent, downloading agent, file format adapters, abstractor, classifier and document corpus — are described below. 4.1
Sampling agent, downloading agent
The sampling agent uses internet search engines to find documents relevant to examined text. It uses document abstracts and semantic network module to generate multiple queries. Thanks to heuristics, the agent makes use of domain dictionary to narrow down obtained results to specified domain by determining a frequency of domain-typical words. It can also query the search engine with long text phrases. The sampling agent outputs a list of URLs and passes it to downloading agent, which downloads relevant documents from the Internet and saves them to
Fig. 2. SEIPro2S compontents
local repository. For efficiency matters, the downloading agent works in multiple threads. 4.2
File format adapters
This module is a strictly functional one, as its sole purpose is to homogenize format of retrieved resources. The module follows well tested approach of flattening any structured file formats into plain text. 4.3
Abstractor
In order to conduct similarity analysis, refining of text documents needs to be done before. Text refinement uses multiple information retrieval methods aiming at streamlining resulting quality of the analysis. First, text units in document need to be identified to allow correct indexing. Following steps are taken by the abstractor: – lexical analysis,
– – – – –
concept identification, removal of stop-words, lemmatization, descriptor adding, disambiguation.
Lexical analysis is a process of converting an input stream of characters into a stream of words, which can be used for indexing. At this stage, all words need to be split by identifying spaces, punctuation or letter case. The next step is to drop low significance words, which do not add much information to the context. Such words occur on so called stop-list, which contains mainly prepositions, pronouns or conjunctions. After stop-words are eliminated, the text is being indexed. At this stage, concepts consisting of two or more words need to be identified to represent one semantic unit and in order to ensure results quality. Then the stream of words is lemmatized to change inflexed forms to main form (subject). It is particularly difficult task for highly flexible languages, such as Polish or French. In Polish, there are multiple noun declination forms and verb conjugation forms. Therefore, for flexible languages word lemmatization is crucial to represent multiple word forms as one concept. Synonyms need to be represented with concept descriptors using semantic network. It allows correct similarity analysis and also increases classification algorithms efficiency by reducing number of dimensions without loss in comparison quality. Replacing words by their hypernyms [10] is only possible when using semantic network as a form of knowledge representation. Indexing faces another problem here, which is polysemy. One word can represent multiple meanings, so the apparent similarity need to be eliminated. It is done by concept disambiguation. Disambiguation, which identifies word meaning depending on its context, is important to ensure that no irrelevant documents will be returned in response to a query [14]. A very efficient method of concept disambiguation has been proposed in [7]. It uses semantic network for Polish language and examines word context to determine its meaning, resulting in 82% accuracy. It seems that only linguistic analysis methods can exceed 90% accuracy [16], while human experts are able to recognize correct meaning of 96,8% polysemic words [19]. 4.4
Classifier
SEIPro2S implements innovative algorithms which ensure, that document comparison methods are insensitive to techniques used to disguise plagiarism, like changes in word order or using synonyms. Classifier module compares documents and searches common phrases which are long enough (phrase length is a configurable parameter set by user). The classifier uses document abstracts which includes concept descriptors, instead of words, to represent multiple synonyms as one descriptor.
Fig. 3. Fragment of similarity report for matching document with Internet sources. Positively matched fragments are marked with red bold font, while single discrepancies are ignored (underline only).
The algorithm is not sensitive to word order changes in a few word space. User configurable number of concepts with no match is also ignored, and when a discrepancy occurs inside a longer positively matched text, it can be treated as temporal, and does not eliminate the match. The described algorithm has high computational complexity, which is however typical to similar tasks. One more feature of SEIPro2S classifier is taking into account cases, when text copying is allowed. Citations marked with quotation marks and references are not treated like potential plagiarism, nor are some domain-typical phrases. 4.5
End-user report
The result of SEIPro2S analysis is a similarity report between examined document and text from document corpus (with documents downloaded from the Internet), which is presented to end-user. The report contains following information:
Fig. 4. Original text submitted to the SEIPro2S for analysis. Text formatting has no further significance (and can be easily changed in order to camouflage a plagiarism attempt), hence is cleared in early processing steps.
– maximum length of positive matching phrase, which can be used to classify document as violating intellectual property, – similarity factor — a ratio of document positively matched with document corpus, – examined document with marked text fragments, which have been positively matched (see figure 3), together with found sources. One can compare the report with the original tested text (see figure 4). In this example systems not only provides user with info which elements were copied but also implies that the tested fragment of text was compiled basing on two different Internet resources. It contrasts the fragments of the original text with copied passages with the found Internet resources. What is more, system was able to correctly flatten the structure of Internet resources and cope with Polish diacritics.
5
Summary
SEIPro2S has proved to be a successful tool for checking whether submitted content is not an unauthorized copy. It is possible to not only find direct copying but also passages that rephrase the copied content with another set of words thus reproducing the original thought. This work has presented an example of SEIPro2S working as a authenticity checker. This is only one possible application. Among others, one hast to emphasize it as a tool that can monitor whether important information does not leak out of organization. In addition system can be of great help to researchers checking whether their idea was addressed by their peers granting them with better overview of chosen research field. SEIPro2S still yields interesting research problems. Above else the system is to be reimplemented so as to allow for performance optimizations, new user interface and introduction of a set of additional algorithms that will enable for further automation. There are also few topics for further research. Certain phrases are typical for some domains (eg. law, science) and therefore their exact occurrence in different documents cannot be perceived as the act of plagiarism, as well as using public domain documents. An investigation how to recognize allowed proper citations, could also be done.
References 1. Ah-Hwae T., Fon-Lin L.: Text Categorization, Supervised Learning, and Domain Knowledge Integration, 2004 2. Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval, ACM Press, Addison-Wesley Longman Publishing Co., New York, 1999 3. Ballinger K.: .NET Web Services, Architecture and Implementation, Pearson Education, 2003
4. Baziz M.: Towards a Semantic Representation of Documents by Ontology-Document Mapping, 2004 5. Baziz M., Boughanen M., Aussenac-Gilles N.: Semantic Networks for a Conceptual Indexing of Documents in IR, 2005 6. Ben-Ari M.: Principles of Concurrent and Distributed Programming, 2nd Edition, Pearson Education, 2005 7. Ceglarek D.: Zastosowanie sieci semantycznej do disambiguacji pojec w jezyku naturalnym, red. Porebska-Miac T., Sroka H., w: Systemy wspomagania organizacji SWO 2006 - Katowice : Wydawnictwo Akademii Ekonomicznej (AE) w Katowicach, 2006 8. Frakes W.B., Baeza-Yates R.: Information Retrieval - Data Structures and Algorithms. Prentice Hall, 1992 9. Gonzalo J. et al.: Indexing with WordNet Synsets can improve Text Retrieval, 1998 10. Hotho A., Staab S., Stumme S.: Explaining Text Clustering Results using Semantic Structures. In Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003, Dubrovnik, Croatia, September 22-26, 2003 11. Labuzek M.: Wykorzystanie metamodelowania do specyfikacji ontologii znaczenia opisow rzeczywistosci, projekt badawczy KBN, 2004 12. Khan L., McLeod D., Hovy E.: Retrieval effectiveness of an ontology-based model for information selection, 2004 13. Krafzig D., Banke K., Slama D.: Enterprise SOA. Prentice Hall, 2005 14. Krovetz R, Croft W.B.: Lexical Ambiguity and Information Retrieval, 1992 15. Sanderson M.: Word Sense Disambiguation and Information Retrieval, 1997 16. Sanderson M.: Retrieving with Good Sense 2000 17. Stokoe Ch., Oakes M.P., Tait J.: Word Sense Disambiguation in Information Retrieval Revisited, SIGIR 2003 18. Van Bakel B.: Modern Classical Document Indexing. A linguistic contribution to knowledge-based IR, 1999 19. Gale, W., Church K., Yarowsky D.: A Method for Disambiguating Word Senses in a Large Corpus. Computers and the Humanities. 26, pp. 415-439, 1992