Detection of Plagiarism in Database Schemas Using

0 downloads 0 Views 201KB Size Report
This research is concerned with plagiarism in software design. Specifically, the focus is on database design. Generally, producing plagiarized software may be ...
Detection of Plagiarism in Database Schemas Using Structural Fingerprints Samer M. Abd El-Wahed

Ahmed Elfatatry

Department of Information Technology, Institute of Graduate Studies and Research, Alexandria University, Egypt [email protected] Abstract

This research is concerned with plagiarism in software design. Specifically, the focus is on database design. Generally, producing plagiarized software may be done through illegal use of someone else's source code or database design. Since it is hard to steal source code from the original author(s), the database schema is usually easier to be viewed, studied and hence plagiarized. Efficient design of a database schema requires an understanding of the application area, the data usage patterns, and the underlying database management system. Stealing the effort of this step may be rewarding to some. The main contribution of this paper is the suggestion of a software tool for detecting plagiarized database schemas using structural fingerprints of database tables. Keywords-Copyright infringement, Plagiarism, Software forensics, Database

I.

Mohamed S. Abougabal

Department of Information Technology, Institute of Graduate Studies and Research, Alexandria University, Egypt [email protected]

Intellectual

property,

INTRODUCTION

The success of any information system critically depends on how the analysis phase is carried out. The existence of the "know-how" and core competencies for accomplishing such task is essential for completion of the task. In this context, there is an intellectual property associated with solving a specific problem in a given domain. Such "know-how" may have a commercial value protected by law as an intellectual property. Plagiarizing a database schema is not only steeling part of the design, but also implicitly involves steeling the effort and the know-how of the analysis phase. Finding a correlation between source code files for two different programs does not necessarily mean that an illicit behavior has occurred [1]. Detecting plagiarism in database schemas is even harder. There are many occasions in which similarity can be considered as an evidence for copyright infringement. For example, a high similarity in a database schema may include un-required fields "which is not reasonable".

Department of Computer and Systems Engineering Faculty of Engineering, Alexandria University, Egypt [email protected]

Manual approaches for detecting plagiarism in database schemas are time consuming, labor intensive, and usually involve multiple readings of each suspect schema. In contrast, the use of software is ideally suited for automating the detection of verbatim plagiarism, but is not capable of higherlevels of detection. However, software can serve as a valuable tool for evaluating the degree of similarity. In fact, some researchers [3, 12, 19 and 20] do not foresee fully automated schema matching as a possibility, and orient their research towards assisting human-performed schema matching. This paper presents a tool for detection of plagiarism in database schemas. The tool is based on an algorithm that generates a fingerprint for all individual tables in a schema and then uses it to scan the suspected schemas and locate possible similarities. Finally, an annotated report is generated containing statistics about the degree of similarities found. The organization of this paper is as follows: A survey of schema matching research is presented in section II. In section III facts about database structure are discussed. In section IV, related work in the software plagiarism detection is reviewed. In section V, an algorithm for detecting plagiarism database schemas is presented. Section VI presents an evaluation of the suggested algorithm. Finally, a step by step guideline for investigating database schema similarity in courts is suggested in section VII. II.

A SURVEY OF SCHEMA MATCHING RESEARCH

In its simplest form, a database schema matching consists of identifying two elements from two different schemas as semantically equivalent, or matching [5]. In such approach, there have been many considerable researches. But, as this paper is focused on detecting the suspected plagiarism in schemas, many techniques will not fit our needs. Historically, the main purpose of schema matching techniques has been either to merge two or more databases, or to enable queries on multiple, heterogeneous databases to be formulated on a single schema.

A recent literature review classified ongoing research on schema matching into two main techniques: rule-based and learning-based techniques [11]. A.

Rule-Based Techniques A wide variety of rules have been used to match schemas using element names, data types, structures, and number of sub elements. For instance, a rule may state that: "two elements match if they have the same name and the same number of sub elements" [18]. Other techniques employ rules that categorize elements based on names, data types, and domains [17]. Rule-based techniques provide several benefits. First, they are relatively inexpensive and do not require training as in learning-based techniques. Second, they typically operate only on schemas (not on data instances), and hence are fairly fast. Third, they can work very well in certain types of applications [11]. For example, given the two relational schemas in figure 1, the process finds matches such as Item in schema 1 matches Item in schema 2, Location in schema 1 matches (Address concatenated with Country) in schema 2. Also, the Name in schema 1 matches Vendor Name in schema 2.

Figure 1: Semantic correspondences between two relational databases.

B. Learning-Based Techniques These techniques typically utilize both schema elements and data to determine matches. Some techniques use a neural network learning approaches [15], while others use a Naive Bayes learning approach [4, 10]. The key idea is that a matching tool must be able to learn from the past matches to predict successful matches. However, one potential drawback of learning-based techniques is that not only they require a training data set consisting of correct matches, but also they need a large training data set to deduce rules from it. Such techniques are described in [11]. Also, a combination of both techniques is discussed in [5]. III.

DATABASE STRUCTURE

The database schema consists of the structures and operations necessary to define the way data is logically organized and accessed within a database management system [8]. In a relational database, the schema defines tables, and fields in each table. A table groups an entity’s attributes in the database schema, representing attributes using simple data types such as integer, floating point, and character types [14]. It is worth noting that the order of attributes does not matter in the schema. In other words, there's no such thing like "the first attribute" or "the second attribute". Attributes are always referenced by name, and not by position [9].

Creativity is an important issue in computer programs and database design to be protected [1]; whether the database is simple consisting of small number of files, or complex containing multiple-files and relations. IV.

RELATED WORK IN PLAGIARISM DETECTION METHODS

Plagiarism detection is a two step process: first, source code is transformed into a language independent format (tokenized). Second, a comparison algorithm is then applied on the tokenized code [2, 16]. A. Tokenization The tokenization procedure aims to convert source code into token strings, that represent the code in a languageindependent form. This is where different parts of the code are replaced by consistent tokens [7]. B. Comparison Algorithms CCFinder [13] performes a token-by-token matching algorithms, focusing on analyzing large-scale systems with a limited amount of language dependence. Other approaches use pattern matching algorithm [2]. An evaluation of five detection tools (JPlag, MOSS, Covet, CCFinder & CloneDr) is presented in [6]. GPLAG [12] detects program plagiarism based on the analysis of program dependence graphs (PDG). A recent technique takes a few seconds to find (simulated) plagiarism in programs having thousands of lines of code. Several techniques can be insensitive to identifier renaming, statement reordering and control replacement. In this way, an exact copy is equivalent to a program intensively plagiarized. V.

ALGORITHM FOR DETECTING THE PLAGIARIZED SCHEMA

From the above discussion, it is clear that tables, attributes, and relations are the significant elements of a database. Therefore, we require the database tables to be tokenized to generate a fingerprint for all individual tables in the database and use it to scan the suspected database to locate possible similarities. Each attribute in a table has one and only one type. Several important types are listed in table 1, which also illustrates how data types can be tokenized. Moreover, "as in schema matching review" rules have been considered and employed to match schemas. In its simplest form, the algorithm considers two tables to be matching if they have the same number of attributes and same data types. TABLE 1 DEMONSTRATION OF HOW DATA TYPE CAN BE TOKENIZED

Data Type

Description

Token

VarChar Int, Number, Currency Text, Memo

Text and numbers, such as names and addresses.

T

Numerical data to perform mathematical calculations

N

Lengthy text and numbers, such as comments or explanations

M

DateTime

Date and time

D

Since the table structure is not affected by attributes ordering, after tokenization process, the resulting fingerprint have to be alphabetically sorted. Then, a series of ranking list are generated. This will deliberately ignore easy-to-modify actions such as attribute ordering, name changing, and splitting attributes. The steps of detection are show n figure 2. Example: ID: Integer First_Name: VarChar(20) Middle_Name: VarChar(20) Last_Name: VarChar(20) Age: Short Integer Length: Short Integer Description: Text

Assume that we have two schemas S1 and S2 where each one has a different number of tables. The similarity can be established between such schemas if one or more tables have the same data type attributes with table/tables in the other schema. The algorithm generates a fingerprint for all individual tables in the two schemas S1 {[5N1T], [1N1T], [8N1T], [9N2T], [1M5N2T]…etc.} and S2 {[1M2N], [1N1T], [13N2T], [1M7N], [6N] …etc.}. Then, scanning is performed to detect similarities. A. Experimental Results: Because programmers tend to create their schemas using standardized ways. Some tables from different schemas may have the same fingerprint even if the schema comes from different domains. In general, from the experimental result three cases were found using this technique.

ƒ

Case 1: If no similarity was detected, this indicates that no copyright infringement has taken place. Because, no tables have the same type of attributes were found.

ƒ 1- Login to the database 2- Tokenize each table in the database

Database NTTTNNM

3- Resulting tokens are alphabetically sorted

Case 2: Some tables may be used to store one field (to avoid redundancy) a case in point is a table that contains a country list or units list table ...etc. Such kind of tables consists of two attributes one for ID and the second for the name or description. In such a case, the fingerprint will be [1N1T]. This kind of similarity is ignored.

MNNNTTT

ƒ

4- Encoding the resulting fingerprints

5- Fingerprint for all individual tables in the database and uses it to scan the suspected database and locate the similar tables

1M3N3T

Figure 2: Plagiarism detection

VI.

EVALUATION OF THE SUGGESTED ALGORITHM

To demonstrate the accuracy and effectiveness of the suggested algorithm, an experiment on real data has been performed. We applied our technique on different schemas from different domains. Also, we perform cross domain testing. TABLE 2 DOMAINS AND DATA SOURCES FOR THE EXPERIMENT Average No. of Matching Matching Domains no. of schemas tables schemas tables Within same 60 53 30-100% 3% domains Different 20 48 5-15% 0% domains

Case 3: Some tables in the schema 1 were found in the schema 2, which gives an indication to start human testing. The testing must consider two points: • Testing the original authorship. The schema may not be owned by the plaintiff or the defendant. It may be created by a third-party. •

Worthiness: Similar tables may not indicate unique ideas or creativity. This is the case when modeling usual objects.

VII. INVESTIGATING SCHEMA PLAGIARISM IN COURTS Similar database schemas may have been created merely from common marketing needs of a particular industry or profession. Therefore, even if the schemas are similar, it is not necessarily a copyright infringement. The following points must be considered in the computer expert's report in order to proof an incidence of software plagiarism: 1- If direct (exact) duplication has occurred, copyright law does not protect works that do not exhibit the

creative elements. So, an element of creativity has to be proven to criminalize such act. 2- A high similarity in database schemas may include unrequired fields "which is not reasonable". Therefore, the author of the original source needs to be identified. In other words a third party artifact may have been used. 3- If exact duplication could not be proven, it is not logical that the defendant generates new tables or alter the schema just for hiding plagiarism, because severe modification may have significant effect on the database performance. Therefore, other factors must be tested. 4- Newly developed schemas, where only the idea is used. In such a case the plaintiff has to obtain a valid patent before suing whoever made use of the idea. VIII.

CONCLUSION

[4]

[5]

[6]

[7]

[8] [9] [10]

[11] [12] [13]

This work has presented an automated tool for detection of schema similarity based on finger prints of database tables. From a legal perspective, the existence of similarity alone does not proof the occurrence of a criminal act. Further intervention of human experts might be needed then. However, proving a "no similarity" case is an evidence that no copyright infringement has occurred.

[14]

[15]

[16]

REFERENCES [1]

[2]

[3]

Samer Abd El-Wahed, Ahmed Elfatatry, and Mohamed S. Abougabal "A New Look at Software Plagiarism Investigation and Copyright Infringement" 5th International Conference on Information and Communications Technology (ICICT2007), 16 - 18 December, 2007, Cairo, Egypt. Hamid Abdul Basit, Stan Jarzabek. (2005) "Detecting Higher-level Similarity Patterns in Programs". European Software Engineering Conf. and ACM SIGSOFT Symposium on the Foundations of Software Engineering, Sept. 2005. Aumueller, D. Do, H., Massmann, S., and Rahm, E. (2005) “Schema and Ontology Matching with COMA++.” Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 906908.

[17]

[18]

[19]

[20]

Berlin, J., and Motro, A. (2002) "Database schema matching using machine learning with feature selection". In Proceedings of the Conference on Advanced Information Systems Engineering (CAISE). Roger Blake (2007) "A Survey of Schema Matching Research", University of Massachusetts Boston. The College of Management. Working Papers: September 2007. Burd, E. and Bailey, J. (2002) "Evaluating Clone Detection Tools for Use during Preventative Maintenance", In Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation (SCAM ’02), October 2002. Burrows et al. (2004) "Efficient and Effective Plagiarism Detection for Large Code Repositories". Proceedings of the Second Australian Undergraduate Students' Computing Conference. C. J. Date, (1990) "An Introduction to Database Systems" Volume 1, Fifth Edition, Addison-Wesley Publishing Company, Inc. C. J. Date (2005) "Database In Depth: Relational Theory for Practitioners", O'Reilly Media, Inc. Doan, Domingos, and Halevy (2001) "Reconciling schemas of disparate data sources: A machine learning approach". In Proceedings of the ACM SIGMOD Conference. Doan, A. and Halevy, A. (2005) "Semantic-Integration Research in the Database Community" AI Magazine, Spring 2005, pp. 83-94. Halevy, A. (2005) “Why Your Data Won’t Mix.” ACM Queue, October 2005, pp. 50-58. Kamiya T., Ohata F., Kondou K., Kusumoto S., Inoue K. (2001) "Maintenance Support Tools for Java Programs: CCFinder and JAAT" , International Conference on Software Engineering. Katherine C. Morris Mary Mitchell (1992) "Database Management Systems In Engineering" U.S. Department Of Commerce, National Institute of Standards and Technology Gaithersburg, Maryland 20899 Li, W.; Clifton, C.; and Liu, S. (2000) "Database integration using neural network: implementation and experience". Knowledge and Information Systems 2(1):73-96. Chao Liu et al. (2006) "GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis". KDD’06, August 20–23, Philadelphia, Pennsylvania, USA. ACM 2006. Madhavan, J.; Bernstein, P.; and Rahm, E. (2001) "Generic schema matching with Cupid". In Proceedings of the International Conference on Very Large Databases (VLDB). Milo, T., and Zohar, S. (1998) "Using schema matching to simplify heterogeneous data translation". In Proceedings of the International Conference on Very Large Databases (VLDB). Ram, S. and Park, J. (2004) “Semantic Conflict Resolution Ontology (SCROL): An Ontology for Detecting and Resolving Data and SchemaLevel Semantic Conflicts,” IEEE Transactions on Knowledge and Data Engineering, 16(2), pp. 189-202. Yan, L., Miller, R., Haas, L, and Fagin, R. (2001) “Data-Driven Understanding and Refinement of Schema Mappings,” SIGMOD Record,30(2), pp. 485-496.

Suggest Documents