Copy detection in Urdu Language Documents using N ... - IEEE Xplore
Recommend Documents
Abstract. 1 blank line using 9-point font with single spacing. Copy move forgery is a popular image tampering technique that copies part of the image onto ...
Abstractâ This paper introduces a copy-move image forgery detection method based on local binary patterns (LBP) and neighborhood clustering.
Abstract- Powerful image editing tools like Adobe Photoshop etc. are very common these days. However due to such tools tampering of images has become.
Hubli, INDIA [email protected]. Govindraj B. Chittapur. Basaveshwar Engineering College,. Bagalkot, INDIA [email protected]. Abstract- The ...
1 Programa de Engenharia de Sistemas e Computação, COPPE/UFRJ. 2 Departamento de Ciência da Computação, IM/UFRJ. 3 Instituto Militar de Engenharia, ...
measure the human instantaneous vital signs using digital- intermediate frequency (IF) Doppler radar. The synchrosqueez- ing transform-based algorithm has ...
AbstractâThis paper proposes an automatic and robust method to detect and recognize the abandoned objects for video surveillance systems. Two Gaussian ...
This article has been accepted for inclusion in a future issue of this journal. Content ... IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT. 1.
Mohammad Farukh Hashmi1, Vijay Anand2, Avinash G. Keskar3. 1,2,3Department of Electronics and Communication Engineering. Visvesvaraya National ...
Abstractâ Copyâmove forgery or regionâduplication is a form of digital image forgery which copies some region(s) of an image, and pastes those onto the ...
submissions of a single document, revisions to existing documents, and ... In this
thesis, we develop a copy detection system to automate the detection of ..... In
large digital libraries, of which the Internet is a prime example, partial or total.
via observable low-level actions of computer software, those include records in audit logs, call-stack data, GUI interaction events, network traffic, registry access ... network traffic analysis as an example of application of indirect human computer
Email: wjangWunlnotes.unl.edu. Abstract-In this paper, we apply .... User asynchronism prevailing in spectral density (PSD) is No/2. Tk representstime shifts to ... For this, the template waveform, v(t), for each pulse of each. II. SYSTEM MODEL.
An option appears in the browser to translate the webpage in a new window. The designed system will help the millions of users of internet to get benefit of the.
FUNCTIONAL magnetic resonance imaging (fMRI) is an exciting relatively new medical imaging technique that is providing much new information about the ...
Target Detection in High Clutter using Passive Bistatic WiFi Radar. Kevin Chetty, Graeme Smith, Hui Guo and Karl Woodbridge. Department of Electronic ...
[email protected]. M. Abbasi Jannat-Abad. Electrical Engineering Department. Ferdowsi University of Mashhad. Mashhad, Iran [email protected].
methodology based on graph metrics of online social networks. The experimental results illustrate that majority of friends in online social networks have common ...
An inverter/cycloconverter system for variable frequency, variable voltage, ac power supplies. In Proceedings of IEEE IAS International Semiconductor.
Sep 16, 1993 - vides a method for estimating the parameter error in the system. Keywords: robust fault detection, nonlinear sys- tems, sliding mode observers.
GU2 7XH, U.K. (e-mail: [email protected]; [email protected]). Color versions ... estimate displacement caused by earthquakes [6]. Methods for SAR change detection can broadly be divided ... There are several classes of earthquake damage .... Ã
tions such as the Operational Satellite Applications Programme of the United ..... in a set-union operation. ..... Peter T. B. Brett (S'12) was born in Berkshire,.
Electricity Theft Detection in AMI Using. Customers' Consumption Patterns. Paria Jokar, Student Member, IEEE, Nasim Arianpoo, Student Member, IEEE,.
Copy detection in Urdu Language Documents using N ... - IEEE Xplore
2GIK Institute of Engineering & Technology, Topi l{mak.fast, abdul_aleem13, uop.wahab}®yahoo.com, 2mnasirkhan174®gmail.com. Abstract: In this paper we ...
Copy detection in Urdu Language Documents using N-grams Model Muhammad A. Khan\ Abdul Aleem\ Abdul Wahab\ M. Nasir Khan2 J Department
l{mak.fast,
of Computer Science, University of Peshawar, 25000 PAKiSTAN 2 GIK Institute of Engineering & Technology, Topi
abdul_aleem13,
uop.wahab}®yahoo.com,
Abstract: In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the
second
passage
is
plagiarized
version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and
2mnasirkhan174®gmail.com
stream. N Grams are used as alternative to word based retrieval of text. In section 2 the proposed detection algorithm is explained, in section 3 the experiments on different passages and the comparison with other n-gram model are presented. In section 4 conclusion and future work of the paper is given.
the resemblance measures calculated from the bi-gram
II. PROPOSED PLAGIARISM DETECTION SYSTEM
comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student's assignments in Urdu language.
The advent and awilability of digital information has made it possible to send, share, save and use digital data Virtually there is no best system available that could prohibit or limit the misuse of the available data Other issues associated with the misuse of digital data are Ownership detection, Copyright issues and Plagiarism detection. Plagiarism detection is of particular interest to people in the academia and the publishing sector. Plagiarism means copying thought and text of another author and presenting them as ones's own work [1 ]. One way to ensure quality in academic research is through the application of plagiarism detection. Donor and sponsor agencies like the Higher Education Commission (HEC) are interested in determining the quality of the research work for eligibility of a grant or fund. One of the quality assurance steps is checking for plagiarism in the work. Plagiarism in academics is considered as academic dishonesty and the responsible are subject to punishment by the university or the research funding organization. N Gram Model was first used in text categorization based on the statistical information gathered from the usage of sequence of characters [4]. N grams are consecutive overlapping characters formed from an input
We have proposed a plagiarism detection system for Urdu text passages based on the n-gram Model. We have used trigram as our model of representing the text. Trigram means that token of three words are used extracting the words from the passages and these trigrams matched. Then the resemblance measures are computed for text categorization. The resemblance measure R [3] is defined as
R
=
I S(A)nS(B) I IS(A)u S(B)I
(1 )
Where S (A) is the set of trigram from passage A, S (B) is the set of trigram from passage B. The Matched trigrams are calculated as
M
=
IS(A) nS(B)1
(2)
And the total number of trigram is computed as
N
=
IS(A)uS(B)1
(3)
The value of R ranges between 0 and 1 . We have set a threshold of 75% resemblance as the yard stick for classifying text as plagiarized. I) Punctuation Removal First of all the punctuation from the passage is removed. The Algorithm used in the punctuation removal is as follows.
Listing no.1 the Pseudo code for Punctuation removal from passages.
1) Experiments with Trigram Model
Clean String (STR)
We have used passages from the standard Urdu text books and rephrased them ourselves (the text passages
Define legalcharacterset="all valid
1.
urdu characters" 2.
Initialize String="validcharacterset"
3.
Define CleanString STR= empty string.
can be provided on request).
The following are two passages n and J2. Passage [2] is the original passage (taken from Urdu text book of ih class, NWFP text book board) J2 is the rephrased version of n. Trigrams for both the passages are calculated, table no 1 lists the trigrams calculated for n and table contains the trigrams computed for passage J2.
For index = 0 to str.length
n
currentcharacter = str.charAT (index) If legalcharacterset.indexof(cur rentcharacter>=O) Then CleanString += currentcharacter;
2) Comparison with other n-gram Models
End of loop
4.
Return clean string
5.
Exit
2) Algorithm for Extracting and Matching Trigram
The pseudo code of the algorithm for calculating trigram from the given passages is given in listing no. 1 . In the algorithm for matching the trigram when the first match of trigram is encountered then the search is stopped. The reason is that in the set 8 (A) and 8(B) we can have only distinct trigram from the passage. Listing
No. 2
the
Pseudo
code for
extracting
and
comparing trigramsfrom the given passages. 1* Creating Trigrams from the Passage
The tri-gram model is compared with other n-gram models to asses our selection of using tri-grams as the extracting word model. In the table no. 4 three passages and their rephrased versions are compared for copy detection using trigram and four-gram models. Copy detection with bi-gram model is the maximum but the complexity of extracting and comparing bi-gram is also the maximum. The copy detection rate of four-gram model is the smallest; it finds a very small number of matched four-grams as it compare longer sentences. The trigram model gives the average acceptable performance with affordable cost in terms of complexity and false alarms.
In this paper we have presented a copy detection mechanism using the trigrams as our word extraction model. We have used Resemblance measure R for computing the probability of the matching text. Based on a threshold the given text is categorized as plagiarized. To asses the validity of trigram model selection we have compared it with bi-gram and four-gram models. This comparison gives further confidence to our results and the selecting of using trigram as model. In future we will extend this study and try to compute an adaptive threshold using machine learning techniques for classifying the Urdu text.