Citation Matching in Sanskrit Corpora Using Local ... - Semantic Scholar

2 downloads 0 Views 180KB Size Report
in other Indic languages like Kannada, Tamil, etc. The approximate ... know, no one has seriously attempted citation matching on Sanskrit literature. Proceedings ...
Proceedings of the 4th International Sanskrit Computational Linguistics Symposium (SCLS 2010), New Delhi, December 10-12, 2010. Published in Springer LNCS, vol. 6465.

Citation Matching in Sanskrit Corpora Using Local Alignment Abhinandan S. Prasad and Shrisha Rao [email protected], [email protected] International Institute of Information Technology - Bangalore

Abstract. Citation matching is the problem of finding which citation occurs in a given textual corpus. Most existing citation matching work is done on scientific literature. The goal of this paper is to present methods for performing citation matching on Sanskrit texts. Exact matching and approximate matching are the two methods for performing citation matching. The exact matching method checks for exact occurrence of the citation with respect to the textual corpus. Approximate matching is a fuzzy string-matching method which computes a similarity score between an individual line of the textual corpus and the citation. The SmithWaterman-Gotoh algorithm for local alignment, which is generally used in bioinformatics, is used here for calculating the similarity score. This similarity score is a measure of the closeness between the text and the citation. The exact- and approximate-matching methods are evaluated and compared. The methods presented can be easily applied to corpora in other Indic languages like Kannada, Tamil, etc. The approximatematching method can in particular be used in the compilation of critical editions and plagiarism detection in a literary work.

Keywords: citation matching, local alignment, Smith-Waterman-Gotoh algorithm, Sanskrit, Mah¯abh¯arata, Mah¯abh¯arata-T¯atparyanirn.aya

1

Introduction

Citation matching in literature is the problem of finding which citations occur where in a given textual corpus. Citation matching is applied in various areas like authorship detection, content analysis, etc. Currently, citation matching is limited to scientific literature because most of the citation mapping work like autonomous citation matching, identity uncertainty, etc., are done on scientific literature. Autonomous citation matching identifies and groups variants of the same paper [7]. Identity uncertainty in the context of citation matching decides whether a set of citations corresponds to the same publication or not [11]. Citation matching is an unexplored area in Sanskrit literature due to various reasons like lack of encoded texts, complexity of the Sanskrit language compared to English, lack of Sanskrit knowledge among computer scientists, etc. As far as we know, no one has seriously attempted citation matching on Sanskrit literature.

This paper presents two methods: exact matching and approximate matching, to perform citation matching in Sanskrit texts. Exact matching is based on the idea of finding the precise character-by-character match between pattern and text. This method finds citations that are exactly the same as those in the corpus. The search space for a given text with citations is directly proportional to the size of the corpus from where citations are to be found. Sorting the corpus can help reduce the number of comparisons needed by obviating the need to make unnecessary comparisons. This method cannot find citations if there are occurrences of p¯ a.th¯ antara (variant readings) or sandhi-viccheda (splits of compounds). Such occurrences provide the motivation for the approximate matching method. The p¯ a.th¯ antara and sandhi-viccheda problems can be easily handled if we consider similarity or closeness. Approximate matching is based on this idea. This method is widely applied in bioinformatics, where a common technique called local alignment is used. The Smith-Waterman algorithm [16] is one of the predominant algorithms in local alignment. The Smith-Waterman algorithm finds a pair of segments in the nucleotide or amino acid (protein) sequences such that there is no other segment with greater similarity [16]. The SmithWaterman-Gotoh algorithm [6] is an extension of the Smith-Waterman algorithm which uses affine gap penalty to reduce the computational overhead of the basic Smith-Waterman algorithm. Our approximate matching method computes the Smith-Waterman-Gotoh distance [6] between citation and text. The results are filtered based on the similarity cutoff. As our intuition would suggest, this method is computationally intensive compared to exact matching. An experiment is conducted using the Mah¯abh¯arata-T¯atparyanirn.aya as the base text where citations may exist, and the Mah¯abh¯arata [15] as the source text where to look for citations.1 The Mah¯abh¯arata-T¯atparyanirn.aya is a commentary on the Mah¯abh¯arata in digest form.2 Its author Madhva (1238–1317 CE) indicates that a lot of its verses are taken from the Mah¯abh¯arata directly: BArt _Ep yTA prokto EnZ yo_y\ ‡m Z t । tTA prdш Ey yAm-tdvAky {r v sv ш, ॥ 2 - 53 ॥ However, a lot of the citations are yet to be traced. Madhva’s citation of untraceable sources has in general been the subject of some controversy since the 17th century.3 Thus, beyond the general and computational aspects, we believe this work also has probative value in the context of this particular textual controversy. The rest of the paper is structured as follows. Section 2 presents the similar work in other domains like bioinformatics. Section 3 describes the problem and the challenges. Section 4 presents the method to perform sorting of a Sanskrit corpus. Section 5 presents the exact matching method. Section 6 presents the approximate matching method. Section 7 presents some conclusions and suggestions for further work.

2

Related Work

Citation matching is one of the active areas of research. Citation matching is used for identifying authors of scientific papers [11] [7], knowledge discovery, etc. Citation matching is currently limited to scientific literature like research papers. Sorting is performed to reduce the search space when we use the exact matching method. There are classical algorithms for performing sorting like heap sort, merge sort, etc.4 There is an x86 assembly language program [1] which sorts Sanskrit texts. Mahoney [8] implements sorting using Perl. Neither [1] nor [8] specifies the maximum text size that these programs can handle. Also as far as our knowledge goes, nobody has sorted the entire Mah¯abh¯arata text. Gale and Church [5] describe a method for aligning sentences based on a simple statistical model of character length. This method is used by Csernel and Patte [4] for performing comparisons between Sanskrit manuscripts. Local alignment is one of the sequence alignment techniques used in bioinformatics to find the similarity regions between two DNA, RNA or protein sequences of unequal length. The Smith-Waterman algorithm [16] is used widely for performing local alignment. The Smith-Waterman algorithm is more accurate but computationally intensive as compared to other methods like BLAST. The Smith-Waterman-Gotoh algorithm [6] is an extension of the Smith-Waterman algorithm. Symmetric [2] is an open source similarity measurement library implemented in Java. Basic algorithms like edit distance and Smith-Waterman-Gotoh [6] are implemented in this library, which we have used.

3

Problem Definition

Citation matching is an unexplored area in natural language texts, especially in Sanskrit literature. Table 1 shows examples of sources and citations.

Citation

Source

mAnqF\ tnmAErt\ кAlo vA кArZ\ rAjo yd\ tml\ lomhq Z\ b }hmcArF кOmArAdEp pAXv,

mAnqF\ tn\ aAErt\ кAlo vA кArZ\ rAjo yd\ tml\ romhq Z\ b }hmcArF кOmArAd^ aEp pAXv,

Table 1. Source and Citation Examples

Citation matching in Sanskrit literature is complex compared to the same problem with scientific literature because of the following reasons.

(i) The Sanskrit alphabetical system is entirely different from that of the English language. In Sanskrit each letter is a combination of vowel and consonant like к^ + a = к, but this kind of combination is not present in English. Modern English does not support ligatures whereas Sanskrit supports ligatures like к^ + к = kk. (ii) In many cases, the corpus has p¯ a.th¯ antaras with respect to the text, and there also are cases of differences due to sandhi-viccheda (e.g., tnmAErtm^ and tn\ aAErtm^). (iii) The size of the corpus may be very large (the ASCII text [15] of the Mah¯abh¯arata being about 8MB in size). It is challenging to handle these kinds of large data sets. (iv) The conversion of Sanskrit texts into machine readable formats is prone to error as it involves human interaction. There is a high probability of error in the machine readable format which in turn affects the result significantly. For both methods—exact matching and approximate matching—we use the Mah¯abh¯arata-T¯atparyanirn.aya as the base text where citations may exist, and the Mah¯abh¯arata [15] as the source text or corpus where to look for citations. The Mah¯abh¯arata and Mah¯abh¯arata-T¯atparyanirn.aya are both encoded in the ASCII-based Harvard-Kyoto format to evaluate these methods. The HarvardKyoto format is one of the common encoding schemes used to make Sanskrit texts machine readable.

4

Sorting

Exact matching compares each line of the source corpus with each citation. The number of comparisons is directly proportional to both corpus size and citations. If there are m lines of source corpus and n citations, then the total number of comparisons needed in exact matching is m×n. Consider a corpus like the BORI Mah¯abh¯arata which has around 80000 verses or 160000 lines, and a text set with 100 citations; then the total number of comparisons are 16 × 106 . One approach to reduce the number of comparisons is to sort the source corpus. Sorting reduces the search space drastically. Consider a citation ккd\ t-y cABAEt -кD\ aApy EvE¤tm^. It is enough to compare with the Mah¯abh¯arata hemistichs starting with к, and we can ignore the rest of the verses. In this way, sorting helps reduce the search space. Sorting is not straight-forward in Sanskrit texts compared to English. Even the classical sorting tools like the Unix sort fail to sort Sanskrit text. Table 2 shows the input and output of the sort command in Unix, which sorts in Roman, rather than Sanskrit, alphabetical order. In Sanskrit each letter is a combination of vowel and consonant. So the comparison function should look at both vowel and consonant during comparison of words. Consider an example of кF and Eк. Eк should come before кF because Eк = к^ + i and кF = к^ + и. Even though both have к^ in common, they differ in their vowels, and i comes before и in the Sanskrit alphabetical system.

Input

Output

кT\ EvrAVngr  mm pv EptAmhA, ajAtvAs\ uEqtA dyo DnByAEd tA, яnm яy uvAc g(vArm\ b }AhmZ