Sim: A Utility For Detecting Similarity in Computer Programs - CiteSeerX

9 downloads 0 Views 613KB Size Report
called sim to measure similarity between two C computer programs. It is useful for detecting plagiarism among a large set of homework programs. This software ...
Sim: A Utility For Detecting Similarity

in Computer

Programs

David Gitchell 81 Nicholas Tran Department of Computer Science Wichita State University Wichita, KS 67060-0083 {ddgitche,tran} @cs.twsu.edu Abstract We describe the design and implementation of a program called sim to measuresimilarity between two C computer programs. It is useful for detecting plagiarism amonga large setof homeworkprograms. This softwareis part of a project to construct tools to assistthe teaching of computer science. 1 Introduction Computerscienceeducationbeganwith the establishmentof the first CS Departmentsmore than thirty years ago, and yet its practitioners have lagged behind their colleaguesin other CS subfields in developing software tools for their trade. Today there is an impressive array of software systemsat the disposal of software designers, circuit designers, network administrators, numerical analysts, and digital artists. In contrast, computer science educatorsare still relying on traditional tools and techniques: their main tool for communicating ideas, only recently supplementedby course webpages,is still chalk and blackboard, and their main tool for evaluation is still a human grader. Webelievethat this dearthof softwaretools is not dueto a lack of diligence from CS educatorsbut rather to the absence of a mature theory in the field. Researchon computational learning theory [15], which investigatesthe limitations and costs of different strategiesof learning, began only in the mid 1980’s. Similarly, researchon program checking, which measurescorrectnessof programsby examining their behavior [3, 4, 51, began only in the late 1980’s. Both of these developing areasare generating deep and surprising results, which will certainly lead to improvementsin the practice of computerscienceeducation in the near future.

Permss~on IO make d!gllal or hard copes of all or part of this work for persona, or c~assroorn use IS granted wlthout fee provided that copses are not made or d#slributed for profit or commercial advan,age and that copes bear th!s nottce and the full cltatlon on the flrSt Pagt 70 copy otherww, to republish, to post on serwls Or to red,str,bu,e to 11~1s. requres prior SpeClflC permlsslo” and/or a fee SIGCSE ‘99 3199 New Orleans. LA, USA 0 ,999 ACM l-581 13.085.6/99/0003...$5.00

266

Yet at this point not all software developmentin CS education require further foundational work. As an example, look at the task of evaluation of assignmentsand exams in lower-level programmingcourses.Suchcoursestend to have high enrollment and a large number of programming assignments with simple solutions, which are to be evaluatedon not only their correctnessbut also their style and uniqueness. A software tool for this task would need to examine the programs’ structure and would benefit from the already well-developed theory of string algorithms. In this paper we describethe design and implementation of a program called sim that measuresstructural similarity betweentwo C computer programs,a task that underlies the evaluation of correctness,style, and uniqueness. This program is a direct application of string alignment techniques that were recently developed to detect similarity between DNA strings [6, 121(see also [7, 13, 141). Given two programs, sim first reduces them to their parse trees using a standardlexical analyzer. Viewing the parsetreesas strings, sim then aligns them by inserting spacesin eachto obtain a maximal common subsequenceof tokens. A score between 0.0 and 1.0 is reportedasthe degreeof similarity betweenthe two programs. In its current form, sim runs in time 0(s2), where s is the maximum size of the parsetrees. The bulk of sim is implemented in C++, and its graphical user interface is implementedin Tcl/Tk. As a first application, we used sim to detect plagiarism. Experiments on real data showed that sim is resistant to extensive name changes,reordering of statementsand functions, and adding andremoving white spacesand comments. sim is reasonablyfast on small-sizedprograms:comparisons between all pairs of 56 programs with an averagelength of 3415 bytes take about three and a half minutes. The rest of this paper is organized as follows. Section 2 explains the underlying string alignment algorithm used by sim, Section 3 describesthe design and implementation of sim, Section 4 presentsthe experimental setup and results, and Section 5 discussesfuture improvements. We conclude this section with a review of previous works on detecting structural similarity in programs. The Unix dif f command [8,9] also usesstring alignment methodsto detectsimilarity betweentwo programs,but its basic textual unit is a line insteadof a parsetree token asin sim; dif f cannot detect systematicname changesor textual reordering of modules. Baker’s dup program [2] reports all maximal exact

matchesover a threshold length between two programs; it can detect systematic(parameterized)namechangesand reordering of large modules,but it is ineffective when spurious statementsare inserted, or when a small block of statements is reordered. Aiken’s moss program [ 11,developed at UC Berkeley for plagiarism detection, also examines program structure, but its underlying algorithm for comparison has not beenmadepublic. Other tools designedspecifically for plagiarism detection such as [IO, 11J are basedon heuristics that measurestatisticson frequenciesof words and symbols, and thus may have a high rate of false positives. 2

D(i, j - 1). The running time of evaluating the best alignmentisO(]s]]t]). ThespacerequirementisO(max(]s],]t])), since only two rows are neededby the computation at any time. 3

Design and Implementation

sim was implemented in 1780 lines of C++ and 393 lines of Tclm

SlSS

1

array[il

= 0;

1

/* print

array

> void t

*/

print-arraytint int

a11,

int

size)

int

arrayC1,

int

++i) a[il)

(i = 0; i < t; printf("%d printf("\n");

;

void t

main(void)

shuffcint

a[SIZEl; sample-size;

srandom(timeWlJLL)); of %d elaments:\n",

++i) ", arrayCi1);

I

1

printf("A random shuffle shuffleb, SIZE); print-arrayb, SIZE);

t)

i;

for

(i = 0; i < size; printf("Xd ", printf("\n") ;

int int

printcint

i;

for

void t

m)

j = random0 % (n -i); if (j < m) t arrayCi1 = I;

1

void
11 II m < 0) return;

++i)

% (size-i)) aCi1 = aCj1;

int

SIZE);

arrayC1,

int

8)

int int

p; n, m;

for

(m = 0; m < s; ++m) arrayCm1 = m + 1;

for I

(Ill = 0; m

Suggest Documents