Use of Text Syntactical Structures in Detection of

0 downloads 0 Views 201KB Size Report
bioinformatics is to compare unknown or partially .... Adjective, superlative. Base form verb ... 1) Two documents used as sources for exercises along with five ...
Use of Text Syntactical Structures in Detection of Document Duplicates Mohamed Elhadi Department of Computer Science Sultan Qaboos University, P. O. Box 36, Al-Khod 123, Oman [email protected] Abstract This is the first paper on a set of experiments addressing issues related to the determination of text similarity using a combined syntactical representation and string alignment techniques. The suggested approach takes advantage of document syntactical structure manifested in Part of Speech (POS) tags and uses it as a basis for further processing. Documents, including the query, are preprocessed using a POS tagger converting them into a reduced string that captures some of the writing style of authors and some of the semantics of the written text. This provides means of representing a document on a higher level of abstraction that captures the different alterations that can be done on documents that are similar in origin and possibly in style. This in turn enables processing of such documents using many of the available string manipulation and matching algorithms. This work is inspired and driven by the parallel between text processing and sequence alignment in computational biology. Sequence alignment techniques are used to analyze and establish the utility of using strings produced as a result of this syntactical representation of content.

1. Introduction With the growth of the web and the emergence of digital libraries, the tasks of document management and text analysis have become very important. Some of the commonly used techniques in such tasks are copy detection, near-copy detection, and similarity calculations. These techniques have been based on string representations and processing and are widely used in text analysis and other fields such as computational biology [1, 4, 6, 9, 10, 11, 13, 14, 16, 19, 21]. Computational biology, in particular sequence alignment, heavily relies on the use of well-established string manipulation techniques and algorithms. Many of those techniques and algorithms can be readily utilized in text similarity calculation if

Amjad Al-Tobi Department of Computer Science Sultan Qaboos University, P. O. Box 36, Al-Khod 123, Oman [email protected] the text can be transferred into appropriately representative strings. In bioinformatics, nucleotide and protein sequences are strings that are considered to be modified versions of some original sequences. The modification takes place over a long time and uses some edit operations that are the work of evolution. Operations that can be performed on strings representing text or bio-sequences are one of three types: • • •

Insertion of one unit or more into the string Deletion of one or more units from the string Replacement of some units by some other units

The objective of processing the strings in bioinformatics is to compare unknown or partially known sequences against a collection of known and annotated ones. Results of comparisons are expressed as a numerical value(s) which serves as an indication of the level of relatedness or unity of sequences [4, 6, 21]. We propose an approach that looks at the utility of using syntactical units, namely POS tags [3, 20] to represent text structure as a basis for further comparison and analysis. This is a realization of the intuition that similar (exact copies) documents would have similar (exact) syntactical structure (sequence of POS Tags). Similar documents and in particular those that contain some exact or near-exact parts of other documents would contain similar syntactical structures. This is more so when the refinement is the result of reduction, expansion, plagiarism or modifications. In this approach the text documents are converted into a reduced version or string of tags capturing some of the underlying syntactic and semantics manifested as authors writing style and written text's structures similarity. In a way, this is a representation of documents on a higher level of abstraction that captures the different alterations of the original text. The paper reports results obtained from an initial set of experiments performed in which the validity of

Some of the areas this work relates to are briefly mentioned and presented next.

difficulties of representing semantics or to limitations on assessment coverage of user studies, which do not scale with the size, heterogeneity, and growth of the web [17, 18]. Syntactic approaches, on the other hand, are more common. They are divided into fingerprinting [11], Information Retrieval techniques [1] and hybrid techniques [7, 8, 12]. Fingerprinting techniques use the idea of chunking a text document into small chunks where each chunk is hashed using hashing algorithm to produce a list of hash values representing the document. These values are then used to compare other documents’ hash values to detect similarities [11]. Information retrieval focuses on representing documents based on content words and word frequencies using indexes with an appropriate model to evaluate similarities between documents [1]. Attempts have been made to combine some of the above techniques. In one approach fingerprinting was combined with information retrieval [7]. Other techniques are discussed in the literatures which aim to detect overlap between documents for more specific purposes. These techniques adopt different detection strategies, depending on the task required by the system [8, 12].

2.1. Syntactical (POS) Representation

2. 3. Biological Sequences Alignment

Performing POS-Tagging is the process of annotating a given text with its POS units. Several approaches have been used to implement POS tagging systems with variable degrees of accuracy [3]. Many of those systems implement the probabilistic methods. They are built on first-order or second-order Markov models. These systems have experienced difficulties in estimating small probabilities accurately [20]. TreeTagger [3], a tagger used in this work, applies a probabilistic method for automatic words’ annotating with POS tags but uses a decision tree to obtain more reliable estimates than other systems. It has achieved the highest accuracy in comparison to other taggers with accuracy of and upto 96.36% and has gone through many improvements [3, 20].

Biological DNA and protein sequences are represented as strings of some predefined alphabet. Edit operations represent a process, which is an act of evolution on one sequence to produce a modified one. Homologous sequences are those that are related in origin or function can be identified if the amount of editing can be identified, computed and represented as a similarity or dissimilarity measure [4, 6, 19, 21]. Many methods are used in measuring similarities between sequences ranging from the shallow analysis using unit frequencies to the more complex dynamic programming algorithms [1, 2, 5].

intuitions and underlying assumptions are investigated. The aim is the exploration of the usefulness of the idea of representing text using syntactical structures that can be manipulated by similar algorithms and methods such as clustering, fingerprinting, BLAST, and other dynamic programming methods [2, 3, 4, 6, 11, 21]. The paper provides an overall description of the proposed model and preliminary results and analysis. The combined use of syntactical POS tagging and text processing methods for the purpose of documents similarity and copy detection is novel. As far as we know the literature does not seem to show any previous or similar use of combined use of POS tagging and string processing methods for copy detection. The rest of the paper is made up of section 2 on related work; section 3 on the proposed model (under implementation); section 4 on the experiments conducted, document collections used and results obtained and section 5 on conclusions and future work.

2. Related Work

2.2. Similarity Calculations Similarity calculation is an important issue on its own and serves as the basis for much text analysis tasks. Different methods and approaches have been used to calculate similarities between documents some of which use semantic and others which use syntactic. Semantic approaches seem to have had less attention for reason that may be due to the

3. Proposed Approach 3.1. Parallel with Biological Sequences When compared to biological sequences, a major hurdle appears to be due to the differences on the make up of the strings and the lack of a theory like that of evolution that can be used for explanation. A different way to look at this task is to think of a chunk of text as a string made of some meaningful yet well defined and numerable units (alphabets). As a result modified or similarly-created text can be

thought of as a result of some intervention or application of some edit operations. To do so, text can be considered as string of syntactical unites derived from POS tagging instead of using actual characters or words as is commonly used in text processing. This not only represents strings using meaningful units but also uses a welldefined and clearly-represented set of units, namely the tags. The created strings capture some semantics contained in the writing style of authors and the relationships defined by the order of the text units. With this in mind, a person who would, for example, attempt some modification of an existing text whether maliciously by plagiarizing or purposefully by re-editing would produce either a reduced or expanded version of the exiting text. He/she would be involved in one of the following: Total cut-and-past where very little is done to modify original text. That is, strings representing phrases or sentences from original text would be used as is in the newly created document. 2) Insertion, where a person would, for example, insert new words into an original text to produce a partially modified version. The insertion of the word very, for example, in the original phrase of a good book would make it look partially different producing the sentence a very good book. The new text contains partially matching parts or strings. 3) Deletion, where the opposite of the above takes place. In deletion, for example, some of the words of the original text are deleted to produce a partially modified version. The deflection of the word very, for example, from the original phrase of a very good book would make it look partially different producing the sentence a good book. The new text would contain partially matching parts or strings. 4) Substitution where the operation is a combination of a delete(s) and an addition(s). The original text would be modified by the deflection of a word and the addition of similar one or more words. Doing a substitution to the sentence a very good book may produce the sentences a useful book or a valuable book. Both in which two units (very good) were deleted and a new one (useful or valuable) was inserted. Reducing the original documents to their syntactical structures greatly reduces the dimensionality of the document. Smaller strings will be dealt with instead of the whole characters in the document. At the same time less information loss when compared to what

happened when documents are processed based on actual characters or words. A brief description of the proposed model and its major phases is provided next (see Figure 1).

3.2. Phases of the Proposed Approach 3.2.1. Syntactical Processing Phase. In existing systems based on ranking or hashing, text is preprocessed by removal of stop words and stemming to reduce the text into a set of general tokens. This excludes numbers, punctuation and special characters. The proposed approach only reduces the text into a smaller set of syntactical (POS) tags making use of most of the documents' content. Text Docs

1)

POSTagging Tagged Text

Qry Tag Optimization

Matching & Ranking Processing

Ranked Docs

Figure 1: Overall proposed approach The choice of tagger and the tag-set would certainly have an impact on the accuracy. Taggers are relatively accurate; still however it may not be 100% error free [3, 20]. A tagger with high accuracy and a relatively large tag set was adopted and used in this work. For the purpose of easier and more efficient processing, each tag is replaced by a single character when producing the string of tags representative of documents. 3.2.2. String Optimization Phase. The size of the tag-set is an important accuracy factor. The bigger the tag set the more detailed the produced string and thus the more accurate and vice versa. TreeTgger was used in this work. It is freely available and has reasonably large POS-tag set [3, 20]. Table 1 contains a partial list of the tags that can be produced by TreeTagger.

The tag-set is a detailed 55 tags set. It can be reduced into different sizable sets by collapsing some of the tags. Table 1: Sample POS tags produced by TreeTagger Adjective Adjective, comparative Adjective, superlative Adverb Adverb, comparative Adverb, superlative Article (Determiner) Cardinal number Common noun, singular Common noun, plural Proper noun, singular Proper noun, plural Conjunction, coordinating Conjunction, subordinating

Modal verb Participle Base form verb Past tense verb Present participle Past participle Present tense verb Possessive ending Personal pronoun Possessive pronoun Wh-determiner Wh-pronoun Poss. wh-pronoun Wh-adverb

The reduction of the tag set, in essence, is a lowering of the accuracy of representation but can result in more appropriate capturing of some of the overlap and similarity depending on types of documents and sought precision. A number of tag set with sizes of 9, 19, 29 and 55 tags were generated and used in this experiments. Smaller tag sets were produced through consolidation of some grammatically related tags like verbs. 3.2.3. Matching and Ranking Phase. The result of tagging and tag optimization is a set of strings representing the ordered tags corresponding to each word in the original document. The produced string can be considered for further processing. A number of processes and analysis techniques are being tested and used to see the effect of using the POS-tag strings. Results will be reported as they become available.

4. Experiments The objective of the experiments described in this paper was the analysis of the appropriateness of tag sets and their produced strings as documents representative for further processing to find and quantify similarities between original documents. Clustering and the string manipulation technique, Longest Common Sequences (LCS) algorithm [2, 5] were used to see if the produced strings can be usefully used to produce similar groups of strings. Closely related documents should cluster together and have high similarity when run through some clustering tool. Similar strings would also rank high when compared with LCS algorithm. Two experiments were performed using two different collections of text documents. The first

experiment used a limited set of text documents referred to here as Controlled Collection set (CCSet). The CC-Set was created by the authors for the purpose of testing. The second experiment used a subset of documents from the well known Reuter's collection [15] and referred to as RC-Set. Due to the large size of the RC-Set and the factors that influence clustering only LCS was used with this set.

4.1. Controlled Collection Set (CC-Set) CC-Set was taken from David Gardner's website [22] which contains a document on plagiarism prevention with some exercises. The set was supplemented with several other general documents from the web. The CC-Set contained the following: 1) Two documents used as sources for exercises along with five documents written as exercises for students with variable degree of plagiarism. 2) Three unrelated short story documents. 3) Documents created from parts of the documents written by Dr. Gardner on plagiarism plus one unrelated document by Dr. Gardner's. 4) Documents taken from the web by a search on keyword "Africa and China" and from the Canadian Broadcasting Corporation (CBC) website. The source documents are related. They are on the same topic and written by the same author. The exercise documents are based in general on the two sources documents. Some of the exercise documents should cluster together and / or with the one or more of the sources. The documents done by Gardner may cluster with some of the source and exercise documents too. Other web documents may or may not cluster together. Table 2 shows a summarized comparison of the exercise documents as analyzed by Gardner and summarized by the authors. The main idea here was to see how these documents are clustered. Looking at Table 2 and based on manual investigation of the contents of those documents one would expect an over-all similarity between the documents. Hierarchical clustering [23] was used to group the set of documents into non-predetermined number of clusters. We expect that similar documents would cluster together. The following procedure was followed on CCSet: (1) Documents are first tagged using TreeTagger. (2) Tags are converted into single character tags . (3) Tag frequencies tables were created. (4) Different size tag tables were produced. (5) Tag tables were clustered. (6) Results are analyzed and compared.

Table 2: Summary of the set of exercise documents Docs-Number Plagiarism Sources Indication Copying Copy Indication Paraphrase Para. Indication Points Order Explanation Opinions

1 Yes No

2 Yes Yes

3 No Yes

4 No Yes

5 No Yes

Yes No No No No No No

Yes No Low No Yes No No

Low Yes Yes Yes Yes No No

Low Yes Yes Yes Yes Yes No

Low Yes Yes Yes Yes Yes Yes

Clustering results and distance coefficients, as can be seen from the dendrogram of Figure 2, strongly confirm that frequencies of tagged documents can serve as an indication of similarities between documents. Distance Cluster 0 5 10 15 20 25 Cases +-------+-------+-------+-------+-------+ 8 òûòòòø 9 ò÷ ó 3 òòòòòôòø 4 òòòûò÷ ó 6 òòò÷ ùòø 5 òòòòòòòú ó 20 òòòòòòò÷ ùòø 2 òòòòòòòòòú ùòòòø 7 òòòòòòòòò÷ ó ùòòòø 16 òòòòòòòòòòò÷ ó ùòø 13 òòòòòòòòòòòòòòò÷ ó ó 1 òòòòòòòòòòòòòòòòòòò÷ ùòòòòòø 10 òòòòòòòø ó ó 11 òòòòòòòôòø ó ùòòòø 12 òòòòòòò÷ ùòòòòòòòòòòò÷ ó ó 15 òòòòòòòòò÷ ó ùòòòòòòòòòø 17 òòòòòòòòòòòòòòòòòòòòòûòòòòò÷ ó ó 19 14 18

òòòòòòòòòòòòòòòòòòòòò÷ ó ó òòòòòòòòòòòòòòòòòòòòòòòòòòòòòòò÷ ó òòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòò÷

Figure 2: Dendrogram based on 55-tag set It has been observed that exercise documents did cluster together. Documents 4 and 5 on one hand, and 1 and 2 on another have very close distances. Surprisingly the three (unrelated) story documents also clustered closely together. This can be attributed to the style. Observed also was that results of the smaller tag set (size of 9 and 19 tags) were the same while larger ones (size of 29 and 55 tags) did better with best results obtained using the 55 tag set. Results obtained using the LCS confirmed the above results. The entire set of exercises ranked highest with documents 9 and 8 having highest score of 82%.

4.3. Reuter Collection (RC-Set)

Two subsets of documents were taken from the Reuter collection. One set contains 999 overlapping documents and the other contains 741 of nonoverlapping documents. Documents that belonged to a single topic only as pre-classified by Reuter are considered non-overlapping. The overlapping documents were selected regardless of whether they belonged to more than one category as pre-labeled by Reuters or not. The implications are that nonoverlapping documents would have no duplications. Each collection was divided into three subsets containing 333 and 247 documents respectively. Following procedure was done on both subsets: (1) Documents are first tagged using TreeTagger. (2) Tags are converted into single character tags. (3) Tags produced in (2 above) were submitted to Longest Common Sequences. (4) Results are analyzed and compared. The results were divided into three ranges using the normalized score based on the LCS. The three score ranges that were used are 0-30%, 30-60% and 60-100%. Table 3: Percentage ranges for each and all sets % 60

Overlap NonOverlap

Set 1

Set 2

Set 3

All

99.77

99.74

99.32

99.61

99.91

99.97

99.78

99.89

0.11

0.15

0.08

0.11

0.06

0.01

0.21

0.09

0.13

0.11

0.60

0.28

0.02

0.02

0.01

0.02

Results, as can be seen from Table 3 were very positive reflecting the fact that overlapping subsets contained more duplicates and near duplicates than the non-overlapping ones. In general, documents having a score of less than 30% were non-duplicates. Those documents scoring 60% or more were all duplicates. Documents scoring 30-60% were mostly non- duplicates. Documents with scores close to 60% contained duplicates and near duplicates. Results were very encouraging as can be seen from the percentages for each set and for the total sets Results are shown in Table 3. Quite surprising was the amount of duplication that was there especially in both collections and in the overlapping collection in particular. Many of the documents were fully duplicates while others were given a unique identification but still had same or almost same contents.

It is understood that efficiency is an important consideration when dealing with text processing. Intuitively, however, the reduced document is an improvement on size and efficiency. We have improved LCS algorithm and used it. Work is underway to apply BLAST [4, 6] algorithm.

5. Conclusions Inspired by the parallels between texts processing and sequence alignment in computational biology, a different perspective of looking at text as string of syntactical units was introduced. The method takes advantage of syntactical structure manifested as POS tags to further process documents. Documents are pre-processed using a POS tagger converting them into string of tags. This constitutes a representation of documents content on a higher level of abstraction. This enables documents processing using many of the available string manipulation algorithms. Encouraging results were obtained on the first set of experiments on text duplication detection and similarity determination using the proposed method. More analysis is needed to fully understand the implications of such an approach and address issues of efficiency and better tuning of the tags.

6. References [1] A. Singhal, “Modern Information Retrieval: A Brief Overview”, Google, Inc., IEEE, 2001. [2] L. Bergroth, H. Hakonen and T. Taita, “A Survey of Longest Common Subsequence Algorithms”, In String Processing and Information Retrieval, 7th. International Symposium on, 27-29 Sept. 2000., pp. 39–48. [3] H. Schmid, “Probabilistic Part-of-Speech Tagging Using Decision Trees”, Intern Conf. on New Methods in Language Processing, Germany, 1994, pp. 4-9. [4] S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, “Basic Local Alignment Search Tool”, J. Mol. Biol. Vol.215, Academic Press Limited, 1990, pp. 403-410. [5] I. Yang, C. Huang and K. Chao, “A fast algorithm for computing a longest common increasing subsequence”, Information Processing Letters, Vol.93(5), Elsevier B.V., 2004, pp. 249-253. [6] Baral, C., Local Alignment: Smith-Waterman algorithm, CSE 591: Computational Molecular Biology Course, Arizona State University, 2004. [7] Y. Liu and L. Liang, “A Dual-method Model for Copy Detection”, IEEE, IAT Workshops, 2006, pp. 634-7. [8] K. Monostori, R. Finkel, A. Zaslavsky, G. Hodasz and M. Pataki, “Comparison of Overlap Detection Techniques”, Intern. Conference on Computational Science, Amsterdam, Holand, 21-24 Apr., 2002, pp 51-60. [9] Clough, P., Old and new challenges in automatic plagiarism detection, Department of Information Studies, University of Sheffield, 2003. [10] Bull, J., C. Collins, E. Coughlin and D. Sharp, Technical Review of Plagiarism Detection Software

Report, Computer Assisted Assessment Centre, University of Luton, Luton, UK. [11] S. Schleimer, D. S. Wilkerson and A. Aiken, “Winnowing: Local Algorithms for Document Fingerprinting”, International Conference on Management of Data, ACM, 2003, pp. 76–85. [12] Kang, N., A. Gelbukh and S. Han, PPChecker: Plagiarism Pattern Checker in Document Copy Detection, 2006. [13] Steinberger, R., B. Pouliquen and J. Hagman, Crosslingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC, Springer-Verlag Berlin Heidelberg, 2002. [14] Poinçot, P., S. Lesteven and F. Murtagh, Comparison of Two “Document Similarity Search Engines”, ASP Conference Series, Vol. 153, 1998. [15] REUTERS, Reuters Corpus (Volume 1: English Language, 1996-08-20 to 1997-08-19), NIST, 2000. [16] Grune, D, and M, Huntjens, Detecting copied submissions in computer science workshops, Vakgroep Informatica, Faculteit Wiskunde & Informatica, Vrije Universiteit, AMSTERDAM, 1989. [17] A, G. Maguitman, F, Menczer, H. Roinestad and A. Vespignani, “Algorithmic Detection of Semantic Similarity”, International World Wide Web Conference Committee, 2005, pp.107-116. [18] Mihalcea, R., C, Corley and C, Strapparava, Corpusbased and Knowledge-based Measures of Text Semantic Similarity, American Association for Artificial Intelligence, Jul, 2006. [19] D. M. Campbell, W. R. Chen and R. D. Smith, “Copy Detection Systems for Digital Documents”, IEEE, Washington, DC, USA, May, 2000, pp. 78-88. [20] H. Schmid, “Improvements in Part-of-Speech Tagging With an Application To German”, EACL SIGDAT workshop, in Dubai (UAE), 1995. [21] M. S. Waterman, “General Methods of Sequence Comparison”, Bull. Math. Biol.Vol(46), 1984, pp. 473-500. [22] Gardner, D., Plagiarism and How To Avoid It, The English Centre, The University of Hong Kong 2006. Available at http://ec.hku.hk/plagiarism/ [23] K. A Kazuaki, “Techniques of document clustering: A review”, Lib. & Info. Sci. J., Mita Press, Japan, 2003, pp.33-75.

Suggest Documents