Copy detection in Urdu Language Documents using N ... - IEEE Xplore

Copy detection in Urdu Language Documents using N-grams Model Muhammad A. Khan\ Abdul Aleem\ Abdul Wahab\ M. Nasir Khan2 J Department

l{mak.fast,

of Computer Science, University of Peshawar, 25000 PAKiSTAN 2 GIK Institute of Engineering & Technology, Topi

abdul_aleem13,

uop.wahab}®yahoo.com,

Abstract: In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the

second

passage

is

plagiarized

version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and

2mnasirkhan174®gmail.com

stream. N Grams are used as alternative to word based retrieval of text. In section 2 the proposed detection algorithm is explained, in section 3 the experiments on different passages and the comparison with other n-gram model are presented. In section 4 conclusion and future work of the paper is given.

the resemblance measures calculated from the bi-gram

II. PROPOSED PLAGIARISM DETECTION SYSTEM

comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student's assignments in Urdu language.

Keywords: Copy detection, N-gram Model, Bi-gram, Urdu Language, Natural Language Processing

I.

INTRODUCTION

The advent and awilability of digital information has made it possible to send, share, save and use digital data Virtually there is no best system available that could prohibit or limit the misuse of the available data Other issues associated with the misuse of digital data are Ownership detection, Copyright issues and Plagiarism detection. Plagiarism detection is of particular interest to people in the academia and the publishing sector. Plagiarism means copying thought and text of another author and presenting them as ones's own work [1 ]. One way to ensure quality in academic research is through the application of plagiarism detection. Donor and sponsor agencies like the Higher Education Commission (HEC) are interested in determining the quality of the research work for eligibility of a grant or fund. One of the quality assurance steps is checking for plagiarism in the work. Plagiarism in academics is considered as academic dishonesty and the responsible are subject to punishment by the university or the research funding organization. N Gram Model was first used in text categorization based on the statistical information gathered from the usage of sequence of characters [4]. N grams are consecutive overlapping characters formed from an input

We have proposed a plagiarism detection system for Urdu text passages based on the n-gram Model. We have used trigram as our model of representing the text. Trigram means that token of three words are used extracting the words from the passages and these trigrams matched. Then the resemblance measures are computed for text categorization. The resemblance measure R [3] is defined as

R

=

I S(A)nS(B) I IS(A)u S(B)I

(1 )

Where S (A) is the set of trigram from passage A, S (B) is the set of trigram from passage B. The Matched trigrams are calculated as

M

=

IS(A) nS(B)1

(2)

And the total number of trigram is computed as

N

=

IS(A)uS(B)1

(3)

The value of R ranges between 0 and 1 . We have set a threshold of 75% resemblance as the yard stick for classifying text as plagiarized. I) Punctuation Removal First of all the punctuation from the passage is removed. The Algorithm used in the punctuation removal is as follows.

978-1-61284-941-6/11/$26.00 ©2011 IEEE 263

III. EXPERIMENTS

Listing no.1 the Pseudo code for Punctuation removal from passages.

1) Experiments with Trigram Model

Clean String (STR)

We have used passages from the standard Urdu text books and rephrased them ourselves (the text passages

Define legalcharacterset="all valid

1.

urdu characters" 2.

Initialize String="validcharacterset"

3.

Define CleanString STR= empty string.

can be provided on request).

The following are two passages n and J2. Passage [2] is the original passage (taken from Urdu text book of ih class, NWFP text book board) J2 is the rephrased version of n. Trigrams for both the passages are calculated, table no 1 lists the trigrams calculated for n and table contains the trigrams computed for passage J2.

For index = 0 to str.length

n

currentcharacter = str.charAT (index) If legalcharacterset.indexof(cur rentcharacter>=O) Then CleanString += currentcharacter;

2) Comparison with other n-gram Models

End of loop

4.

Return clean string

5.

Exit

2) Algorithm for Extracting and Matching Trigram

The pseudo code of the algorithm for calculating trigram from the given passages is given in listing no. 1 . In the algorithm for matching the trigram when the first match of trigram is encountered then the search is stopped. The reason is that in the set 8 (A) and 8(B) we can have only distinct trigram from the passage. Listing

No. 2

the

Pseudo

code for

extracting

and

comparing trigramsfrom the given passages. 1* Creating Trigrams from the Passage

The tri-gram model is compared with other n-gram models to asses our selection of using tri-grams as the extracting word model. In the table no. 4 three passages and their rephrased versions are compared for copy detection using trigram and four-gram models. Copy detection with bi-gram model is the maximum but the complexity of extracting and comparing bi-gram is also the maximum. The copy detection rate of four-gram model is the smallest; it finds a very small number of matched four-grams as it compare longer sentences. The trigram model gives the average acceptable performance with affordable cost in terms of complexity and false alarms.

1 and

IV.

Passage 2*1 1

Passage 1

For index

=0 to trigramsl.length-2

2 Strl = trigraml[index] + trigraml[index + 1] + trigram1[index + 2] 2

Passage 3

for index =0

4

to trigrams2.length-2

str2 = trigram2[index] 1]

+ trigram2[index +

+ trigram2[index+ 2]

1* Creating Tokens out of String *1 String[]

str3 = str.split(

"

1* Token Matching *1 For i = 0 to str.length

"

);

In this paper we have presented a copy detection mechanism using the trigrams as our word extraction model. We have used Resemblance measure R for computing the probability of the matching text. Based on a threshold the given text is categorized as plagiarized. To asses the validity of trigram model selection we have compared it with bi-gram and four-gram models. This comparison gives further confidence to our results and the selecting of using trigram as model. In future we will extend this study and try to compute an adaptive threshold using machine learning techniques for classifying the Urdu text.

For j = 0 to str.length If (str[i]

CONCLUSION & FUTURE WORK

== str[j]

Then Increment count Break;

264

Passage No Jl

TIlE LIST OF

� l lf:i � �� .).JI�

":;�y. �d.J .)IS ').JI � y;..;

� 1""'-'" �.)�""" �Iy...� . .l

1""'-'"

Copy detection in Urdu Language Documents using N ... - IEEE Xplore

Copy detection in Urdu Language Documents using N ... - IEEE Xplore

Suggest Documents

Detection of Copy-Move Forgery Using Krawtchouk ... - IEEE Xplore

Copy-Move Image Forgery Detection Using Local Binary ... - IEEE Xplore

Copy Move Forgery Detection using DWT and SIFT ... - IEEE Xplore

Detection of Copy-Create Image Forgery Using ... - IEEE Xplore

Using Wavelets to Classify Documents - IEEE Xplore

Noncontact Physiological Dynamics Detection Using ... - IEEE Xplore

Abandoned Objects Detection Using Double ... - IEEE Xplore

Noncontact Physiological Dynamics Detection Using ... - IEEE Xplore

Presentation Attack Detection using Laplacian ... - IEEE Xplore

A Copy-move Image Forgery Detection Based on ... - IEEE Xplore

DyWT based Copy-Move Forgery Detection with ... - IEEE Xplore

COPY DETECTION SYSTEMS FOR DIGITAL DOCUMENTS by ...

Detection Systems - IEEE Xplore

Detection - IEEE Xplore

A Framework for Urdu Language Translation using

Activation Detection In Functional MRI Using Subspace ... - IEEE Xplore

Target Detection in High Clutter using Passive Bistatic ... - IEEE Xplore

Detection of SSDF Attack Using SVDD Algorithm in ... - IEEE Xplore

Anomaly Detection in Online Social Networks Using ... - IEEE Xplore

Improving performance in pulse radar detection using ... - IEEE Xplore

Robust fault detection in nonlinear systems using sliding ... - IEEE Xplore

Earthquake Damage Detection in Urban Areas Using ... - IEEE Xplore

Earthquake Damage Detection in Urban Areas Using ... - IEEE Xplore

Electricity Theft Detection in AMI Using Customers ... - IEEE Xplore