Semantic Sequence Kin: A Method of Document Copy Detection Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu, and Xiao-Di Zhang Department of Computer Science and Engineering, Xi'an Jiaotong University, Xi'an 710049, People's Republic of China
[email protected]
Abstract. The string matching and global word frequency model are two basic models of Document Copy Detection, although they are both unsatisfied in some respects. The String Kernel (SK) and Word Sequence Kernel (WSK) may map string pairs into a new feature space directly, in which the data is linearly separable. This idea inspires us with the Semantic Sequence Kin (SSK) and we apply it to document copy detection. SK and WSK only take into account the gap between the first word/term and the last word/term so that it is not good for plagiarism detection. SSK considers each common word's position information so as to detect plagiarism in a fine granularity. SSK is based on semantic density that is indeed the local word frequency information. We believe these measures diminish the noise of rewording greatly. We test SSK in a small corpus with several common copy types. The result shows that SSK is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.
1
Introduction
In this paper we propose a novel Semantic Sequence Kin (SSK) that is based on the local semantic density, not on the common global word frequency. And we apply it to Document Copy Detection (DCD), not the Text Classification (TC). DCD is to detect whether some part or the whole of the given document is the copy of other documents (it means plagiarism). However, the word frequency based kernel is not suitable for DCD though it is popular in TC. The word frequency model takes mainly global semantic features of a document but loses the detailed local features and structural information. For example, TF-IDF (Term Frequency - Inverse Document Frequency) vector is a basic document representation in TC. But we cannot use TF-IDF vector to distinguish two sentences (or sections) that are just the different arrangements of the same words, which usually have different meanings. By means of matching string, we can exactly find out the plagiarized sentences. Indeed, many DCD prototypes [4-6] prefer it. This method first gets some strings called fingerprints as text features and then matches the fingerprints to detect plagiarism. The string matching model exploits local features of a document mainly. It can hardly resist noise, and rewording sentences may impair the detection precision H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 529−538, 2004. Springer-Verlag Berlin Heidelberg 2004
530
J.-P. Bao et al.
heavily. Therefore, it is better to take account of both global and local feature in order to detect plagiarism in certain detail and against rewording noise. In SSK, We first find out the semantic sequences based on the concept of semantic density, which represents the locally frequent semantic features, and then we collect all of the semantic sequences to imply the global features of the document. When we calculate the similarity between document features, we absorb the ideas of wordsequence kernel and string kernel. In the next section, we introduce some related work on string kernel and DCD. We present our Semantic Sequence Kin in detail in Section 3 and release experimental results in Section 4. We discuss some aspects of SSK in Section 5. Finally we draw conclusions in Section 6.
2
Related Work
Joachims [1] first applied SVM to TC. He used VSM (Vector Space Model) to construct the text feature vector, which contains only word global frequency information without any structural (sequence) information. Lodhi et al.[3] proposed the string kernel method that classifies documents by the common subsequences between them. The string kernel exploits the structural information (i.e. gaps between terms) instead of word frequency. Before long, Cancedda et al.[2] introduced the word sequence kernel that extends the idea of string kernel. They greatly expand the number of symbols to consider, as symbols are words rather than characters. Now the kernel methods are popular in TC, but we have not found its application to DCD. Brin et al.[4] proposed the first DCD prototype (i.e. COPS) that detects overlap based on sentence and string matching, but it has some difficulties in detecting sentences and cannot find partial sentence copy. In order to improve COPS, Shivakumar and Garcia-Molina [7] developed SCAM (Stanford Copy Analysis Method), which measures overlap based on word frequency. Heintze [6] developed a KOALA system for plagiarism detection. Broder et al.[5] proposed a shingling method to determine the syntactic similarity of files. These 2 systems are similar to COPS. Monostori et al.[8] proposed the MDR (Match Detect Reveal) prototype to detect plagiarism in large collections of electronic texts. It is also based on string matching, but it uses suffix tree to find and store strings. Si et al.[9] built a copy detection mechanism CHECK that parsed each document to build an internal indexing structure, which is called structural characteristic (SC), used in document registration and comparison modules. Song et al.[10] presented an algorithm (CDSDG) to detect illegal copy and distribution of digital goods, which indeed combined CHECK and SCAM to discover plagiarism.
3
Semantic Sequence Kin
In the following we first introduce some concepts about the semantic sequence, and then we propose SSK in detail.
Semantic Sequence Kin: A Method of Document Copy Detection
531
3.1 Semantic Density and Semantic Sequence Definition 1 Let S be a sequence of words, i.e. S = S 1S2...Sn. We denote the word at the position i in S by Si. The word distance of position i (1in), denoted by ı(i), is the number of words between Si and its preceding occurrence Sh: (1) V (i ) i h where Sh=Si and SkSi (1h