Paraphrase Identification of Malayalam Sentences

47

1

Ditty Mathew, Dr. Sumam Mary Idicula, Department of Computer Science,CUSAT

Paraphrase Identification of Malayalam Sentences - an Experience Abstract—Sentences with different structures may convey the same meaning. Identification of sentences with paraphrases plays an important role in text related research and applications. This work focus on the statistical measures and semantic analysis of malayalam sentences to detect the paraphrases. The statistical similarity measures between sentences, based on symbolic characteristics and structural information, could measure the similarity between sentences without any prior knowledge but only on the statistical information of sentences. The semantic representation of Universal Networking Language(UNL), represents only the inherent meaning in a sentence without any syntactic details. Thus, comparing the UNL graphs of two sentences can give an insight into how semantically similar the two sentences are. Combination of statistical similarity and semantic similarity score results the overall similarity score. This is the first attempt towards paraphrases of malayalam sentences. Index Terms—Statistical Similarity, Semantic Textual Similarity(STS), Universal Networking Language(UNL)

P

I. INTRODUCTION

araphrase is an expression of the same message in different words. Paraphrase identification has many applications in the areas of information retrieval, information extraction, natural language processing, machine translation etc. It can be identified by calculating similarity between sentences. Calculating similarity between sentences is the basis of measuring the similarity between texts which is the key of document classification and clustering. Sentence similarity is one of the key issues for sentence alignment, sentence clustering, question answering etc. To illustrate the concept of praraphrase consider the following sentence pair,

Each sentence can be considered as a paraphrase of the other. They both describe the same event and communicate the same information. The paraphrases can be detected using the sentence similarity measures such as symbolic similarity, structure similarity and semantic similarity. If two sentences are similar, then words in the two sentences will be similar. Here, words in two sentences are similar means that the words similar in symbolic or in semantics. Two sentences with different symbolic and structure information could convey the same or similar meaning. Semantic similarity of sentences is based on 

the meanings of the words and the syntactic of sentence. If two sentences are similar, structural relations between words will be similar. Structural relations include relations between words and the distances between words. If the structures of two sentences are similar, then there is a possibility that they convey similar meanings. Symbolic similarity and structural similarity can be calculated using statistical methods. In statistical methods, the similarity between sentences is measured based only on the statistical information of sentences. Some of the statistical measures include similarity based on word distance, word order, word set, word vector, edit distance etc. Semantic similarity approach makes extensive use of information about similarities between word meanings. The semantic representation of Universal Networking Language (UNL), represents only the inherent meaning in a sentence without any syntactic details. Thus, comparing the UNL graphs of two sentences can give an insight into how semantically similar the two sentences are. The overall sentence similarity is calculated through the sum of each sentence similarity with respect to the other, which consists of two parts: the semantic similarity and syntactic similarity weighted by a smoothing factor. II. RELATED WORK Several works have been done to evaluate different approaches to measure similarity between English sentences. Statistical similarity between sentences can be calculated based on word set, word order, word vector, edit distance, word distance etc[1] . In [4], three classes of sentence similarity measures such as word overlap measures, phrasal overlap measures and linguistic measures are combined together to find the overall similarity. The similarity score produces by these measures has a normalized real-number value from 0 to 1. In [7], the UNL graphs of two sentences are compared to find how semantically similar the two sentences are. It presents the UNL graph matching method for the semantic similarity task. The whole about Universal Networking Language(UNL) is described in [16],[17]. The conversion of Tamil sentences to UNL which are smilar to malayalam is described in [2]. The information needed to construct the UNL structure is available at different linguistic levels. Malayalam being a morphologically rich language allows a large amount of information including syntactic categorization, and thematic case relation to be extracted from the morphological level itself. Information about relating concepts like verbs to thematic cases, adjectival components to nouns and adverbial components to verbs are available through syntactic functional grouping. It is done by the specially designed parser taking into consideration the

47

2 distinction. The following statistical methods are used to find the similarity of sentences. A. Jaccard Similarity Jaccard similarity is a word set based measure in which the word sets of the two sentences are formed first. There are two ways here- one is to form the word set of the sentences with the original words in the sentences, second is to use stemmed words. The first method is used since it takes into consideration the tense form and voice information. Let w(S a) be the set of words in first sentences S a and w(Sb) be the set of words in second sentence Sb. After the word set is formed, jaccard similarity is computed as:

Jaccard ( S a ,Sb ) =

(∣w ( S a )w ( Sb )∣) (∣w ( S a ) Uw ( Sb )∣)

(1)

B. Dice Similarity Dice Similarity is also a word set based measure which is computed as: Fig. 1. Overall Architecture

requirements of the UNL structure. III. PROPOSED METHOD In this work statistical measures and semantic measures are used to find paraphrases. The statistical techniques selected in this work are based on word set, word vector, word order and word distance. The UNL graph matching method is used for the Semantic Textual Similarity(STS) task. Universal Networking language (UNL) gives the semantic representation of sentences in a graphical form. It is a computer language that enables computers to process information and knowledge. It is designed to replicate the functions of natural languages. By comparing the similarity of these graphs, the semantic content of the two sentences can be compared, rather than comparing the similarities in the syntax. After calculating Statistical Similarity and Semantic Similarity, the overall similarity of two sentences are calculated by combining these two measures. The overall architecture is shown in Fig 1. IV. STATISTICAL MEASURES The only preprocessing needed for statistical technique is tokenization. Tokenization is the process of breaking a stream of text up into meaningful elements called tokens by removing extraneous punctuation. This is useful both in linguistics and in computer science, where it forms part of lexical analysis. These tokens are often loosely referred to as terms or words, but it is sometimes important to make a token

( 2∣w ( Sa )w ( Sb )∣) Dice ( S a ,S b ) = (∣w ( Sa )∣+∣w ( Sb )∣)

(2)

C. Cosine Similarity In Cosine Similarity, word vectors of sentences should be constructed first. If the words in w(S a) and w(Sb) are assigned with weights, word vectors of Sa and Sb can be represented as , v(Sa) ={ (w1,wa1),(w2,wa2),...............,(wi+j,wa(i+j))} v(Sb) ={ (w1,wb1),(w2,wb2),...............,(wi+j,wb(i+j))} Then cosine similarity between sentences can be calculated by i+j

∑ w ak wbk Cosine ( S a ,Sb ) =

k= 1

√∑ √∑ i+j

k= 1

w 2ak

i+j

k=1

w 2bk

(3)

D. Word Order Similarity Sentence similarity based on word order requires constructing the order vectors of the two sentences first. If the sentence Sa has words (w a1,wa2,......,wai) and sentence Sb has words (wb1,wb2,......,wbi) then word order vectors for Sa and Sb are: L(Sa) ={ (wa1,wa2),(wa1,wa3),...............,(wa(i-1),wai)} L(Sb) ={ (wb1,wb2),(wb1,wb3),...............,(wb(i-1),wbi)}

47

3

where (wx,wy) ϵ L(Sa) U L(Sb) means wx is before wy. The similarity between Sa and Sb can be calculated based on the orders of words by

(∣L ( S a) L ( Sb )∣) WordOrder ( S a ,Sb ) = (∣L ( S a)∣U∣L ( Sb )∣)

(4)

E. Word Distance Similarity The list of distances between word pairs in S a and Sb is as follows, L(Sa) ={(wa1,wa2,d(a1,a2) ),(wa1,wa3,d(a1,a3) ),...............,((wa(i,wai,d(a(i-1),ai) )} L(Sb) ={(wb1,wb2,d(b1,b2) ),(wb1,wb3,d(b1,b3) ),...............,((wb(i1),wbi,d(b(i-1),bi) )} 1)

Let w(a,b) = (w ax,way)|1≤x≤y≤i ∩ (wbp,wbq)|1≤p≤q≤j, then similarity between Sa and Sb can be calculated as follows,

∑

WordDist ( Sa ,S b )=

w ( i,j )∈ ( w ( a,b ) )

√

w

∑ ( ax,ay )

d ( ai,aj ) d (bi,bj )

d 2 ( s a)

√

w

∑ ( bp,bq )

d 2 ( sb ) (5)

and:01( science(icl>art), art(icl>abstract thing) ) B. UNL En-conversion process The process of converting natural language text into UNL is called UNL En-conversion process. The UNL En-conversion process analyses the natural language text morphologically, syntactically and semantically. This process works based on a word dictionary and a set of en-conversion rules (grammar rules of en-conversion). It analyzes sentences according to the en-conversion rules. It deals with various natural languages by using respective word dictionaries and sets of en-conversion rules. The En-converter works in the following way. A sentence is scanned from left to right. When an input string is scanned, all matched morphemes from the beginning (left) of the string are retrieved from the word dictionary and become the candidate morphemes. These candidate morphemes are sorted according to priority. Word selection is done by applying grammar rules of en-conversion to these candidate morphemes. Syntactic and semantic analysis is carried out by applying the rules to already selected words to build up a syntactic tree and a semantic network for the input sentence. This process continues until all words of the sentence are inputted, and a complete semantic network of the input sentence is made. The output of this whole process is a semantic network expressed in the UNL format.

V.SEMANTIC MEASSURES Universal Networking language (UNL) gives the semantic representation of sentences in a graphical form. By comparing the similarity of these graphs, we inherently compare only the semantic content of the two sentences, rather than comparing the similarities in the syntax. Thus, the UNL graph matching strategy is a natural choice for the Semantic Textual Similarity. A. UNL Expression UNL expresses information and knowledge in the form of semantic network. The semantic network of the UNL is a directed graph. Its nodes are UWs or hyper-nodes (or “scope” as it is commonly called) representing concepts. Its edges are Relations between concepts. Concepts can be annotated by Attributes. Such a semantic network of the UNL is called a “UNL Expression” or “UNL Graph”. The general description format of UNL Expression is the following: :[]( , ) In , a relation label has given. In , an ID of a scope is described. All binary relations that constitute a scope must be given the same ID, a can be omitted, and in each of and , a UW is described. A UW can be followed by a . A UNL Expression is a list of these binary relations. An example for the UNL expression of the English phrase “science and art” is as,

Fig. 2. UNL graph Example

Simply enconversion process includes, •

Convert or load the rules

•

Input sentence

•

Apply the rules and retrieve the Word Dictionary

•

Output the UNL expressions

C. UNL graph based Similarity Semantic Similarity can be calculated by comparing the UNL graphs of input sentences. Let unl 1 represents UNL graph of one sentence and unl 2 represents UNL graph of other sentence, then

47

4 in turn contributes to the score of the matched relation. Thus, matching of the semantic relations has more weight than the matching of the attributes.

2∗P∗R SimilarityScore ( unl1 ,unl 2 ) = P+R (6) where P and R are,

∑

P=

rel∈unl

∑

R=

After calculating Statistical Similarity and Semantic Similarity, the overall similarity of two sentences are found as follows:

1

count ( rel∈unl1 )

rel∈unl

VI. OVERALL SIMILARITY

reln score ( rel ) (7)

Simi ( sa ,sb ) =pSimistat ( s a ,s b ) +qSimi sem ( sa ,s b )

reln score ( rel )

(13) where p and q are the coefficients which denote the contribution of each part to the overall similarity, p+q = 1; p,q∊ (0,1).

2

count ( rel∈unl2 )

(8)

where relnscore(rel) of relation rel can be calculated as, VII. EXPERIMENTS AND EVALUATION

relnscore ( rel ) =avg relmtch,uw 1

(

score

,uw 2

score

) (9)

where rel mtch can be calculated as,

relmtch= 1,if relation matches (9) 0, otherwise and uwscore of universal word can be calculated as

uwscore=avg ( wordscore,attributescore ) (10) where word score is,

The dataset used in this work is from agricultural domain. 80 sentence pairs are there in the dataset and they are from the Parallel Corpus developed by research team at IIT Bombay. The parallel corpus contains the UNL expression of sentences. The primary source of parallel corpus is aAqua, an agricultural forum for farmers. Every sentence pair has a binary classification associated with it which says whether the pair is a paraphrase or not judged by human. This work used 50 sentence pairs for training and 30 sentence pairs for testing. The training set contained 30 positive examples and 20 negative examples. The testing set contained 20 positive examples and 10 negative examples. For example:Consider the sentence pairs S 1 and S2 mentioned in the begining. The similarity scores obtained for each method has shown in Table I. The UNL graph obtained for each sentences are same and has given in Fig 3.

wordscore= 1,if universal word matches (11) 0, otherwise

TABLE I SIMILARITY MEASURES FOR EACH METHOD Measure

and attribute score is,

attributescore=F1score ( attr 1, attr 2 )

(12)

The matching scheme is based on the idea of the F1 score. UNL graphs are a list of UNL relations. Each such relation match is given a score, which is used in the calculation of the precision(P) and recall(R). From the precision and recall the F1 score can be easily calculated which becomes the total matching score of the two graphs. The relation score is obtained by averaging the scores of relation match, and the score of the two universal word matches. The universal word match score has a component of the attributes that match between the corresponding universal words. This attribute matching is again the F1 score calculation similar to relation matching. Matching the attributes of the universal words, contributes to the score of the matched universal word, which

Value

Jaccard

0.67

Dice

0.8

Cosine

0.8

Word Order

0.43

Word Distance

0.4

UNL graph based

1

47

5 VIII.ISSUES THAT AFFECTED PERFORMANCE

Fig. 3. UNL graph for sentences S1 and S2

A confusion matrix was developed for evaluating the performance of the system and it is given in Table II. TABLE II CONFUSION MATRIX

Firstly, spelling mistake in the sentences will output wrong result even if the sentences are similar. Another issue is, this work considered only simple and compound sentences. Paraphrase matching of complex sentences can also consider to improve the performance of the system. Then, the issues related to UNL matching system affects the semantic similarity performance. A better result can output by considering the synonyms of the Universal words while matching the UNL relations. The UNL generation system need to be more robust to improve its performance. UNL graph matching system will not work for grammatically incorrect system since the accuracy of UNL generation system is low. Sometimes UNL relations generated will be less or incorrect UNL relations will be generated in case of grammatically incorrect sentences. Finally, the statistical similarity and semantic similarity scores are calculated in a single shot in this work and this affects the computing time of the process.

Obtained

Actual

IX. CONCLUSION

Similar

Not Similar

Similar

46

4.

Not Similar

25

5

Out of 50 similar sentence pairs the system classified 46 as similar and 4 as dissimilar and out of 30 dissimilar sentence pairs the system classified 25 as dissimilar and 4 as similar in the data-set. The performance analysis of overall similarity are shown in Table III.

Paraphrase identification is important for text classification and retrieval. It is closely related to both word similarity and document similarity. The statistical similarity between sentences are calculated based on symbolic and structural charachertistics. Semantic similarity is carried out with reference to grammatical deep structure. This work used 5 methods to find the statistical similarity of malayalam sentences. Semantic similarity is also important for paraphrase identification. Here a UNL graph based matching score is used to find the semantic similarity. UNL is a conceptual language which describes the meaning of sentences in terms of semantic nets. Finally overall similarity score is calculated from statistical similarity scores and semantic similarity scores. The work can be extended to complex sentences also. REFERENCES

TABLE II PERFORMANCE ANALYSIS OF OVERALL SIMILARITY

[1]

Measure

Value

Accuracy

0.9

Precision

0.9

Recall

0.92

F Measure

0.91

[2] [3]

[4] [5] [6] [7]

The measures such as accuracy, precision, recall and F measure are used to evaluate performance of the system. Overall Similarity is taken as input for performance analysis. Even if there is variation in the similarity values of each methods, the overall similarity score provides a better result. Here all the statistical concerns and semantic concerns are considered.

[8]

[9]

Zhang, Sun, Wang and He, “Calculating Statistical Similarity between Sentences,” Journal of Convergence of Information Technology, February 2011. T.Dhanabalan, K.Saravanan, T.V.Geetha, “Tamil to UNL EnConverter”. LinLi, Xia Hu, Bi-Yun Hu, Jun Wang, Yi-Ming Zhou “Measuring Sentence Similarity From Different Aspects”, Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009. Palakorn, Xiaohua, Shen Xiajiong, “The Evaluation of Sentence Similarity Measures”, Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery, Pages 305-316, 2008. Li, Bandar, McLean, Shea, “A Method for Measuring Sentence Similarity”, 17th International FLAIRS Conference,2012. Mihalcea, Corley, Carlo, “Corpus based and Knowledge based Measures of Text Semantic Similarity”, American Association for Artificial Intelligence, 2006. Janardhan Singh, Arindam Bhattacharya, Pushpak Bhattacharyya, “Semantic Textual Similarity using Universal Networking Language graph matching”, First Joint Conference on Lexical and Computational Semantics. Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett, “Sentence Similarity Based on Semantic Nets and Corpus Statistics IEEE Transactions on Knowledge and Data Engineering, Vol.18, No.8, August 2006. Xiaoying Liu, Yiming Zhou, “Sentence Similarity based on Dynamic Time Warping”, International Conference on Semantic Computing.

47 [10] Xiaoying Liu, Yiming Zhou, Ruoshizheng, “Measuring Semantic Similarity within Sentences”, Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming, 12-15 July 2008. [11] Ming Che Lee, Jia Wei Zhang, Wen Xiang Lee, “Sentence Similarity Computation Based on POS and Semantic Nets”, Fifth International Joint Conference on INC, IMS and IDC, 2009. [12] Amitabha Mukerjee, Achla M Raina, Kumar Kapil, Pankaj Goyal, Pushpraj Shukla, “Universal Networking LAnguage - A Tool for Language Independent Semantics”, International Conference on the Convergence of Knowledge, Culture, Language and Information Technologies, 2003 [13] Pushpak Bhattacharyya, “Multilingual Information Processing Through Universal Networking Language”. [14] Uchida H., Zhu M., “The Universal Networking Language (UNL) specifications”, version 3.0,Technical Report, United Nations University, Tokyo, 1998. [15] Bouguslavsky, I., Frid, N. and Iomdin, L, “Creating a Universal Networking Module within an Advanced NLP System”, Proceedings of the 18 International Conference on Computational Linguistics, pp. 8389, 2000 [16] (2013) “Universal Networking Language” [Online]. Available: http://www.undl.org [17] (2013) “UNL” [Online]. Available: http://language.worldofcomputing.net/unl/universal-networkinglanguage-unl.html [18] (2013) “UNL enconversion” [Online]. Available: http://www.cfilt.iitb.ac.in/UNL enco

6