Paraphrase Detection for Tamil Language using

0 downloads 0 Views 470KB Size Report
examples. The following different pair of Tamil sentences can be considered as paraphrases and their corresponding English meaning is given in the brackets.
Paraphrase Detection for Tamil Language using Deep Learning Algorithm 1

S. Mahalakshmi, 2Dr. M. Anand Kumar and 3Dr. K. P Soman 1

PG Student, 2Assistant Professor, 3Head of the Department Centre of Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, Coimbatore, India. [email protected], [email protected], [email protected].

Abstract In this paper, we have performed paraphrase detection for the Tamil sentences. Paraphrase detection is the process of detecting the sentences or paragraphs which is restated using jumbled words in the sentences or different selection of words, while preserving the meaning of the same. The real challenge in paraphrase detection is that semantics of the sentence has to be preserved while restating the sentence. Here, we have used unfolding recursive auto encoder (RAE's), a deep learning algorithm for learning feature vectors for phrases in syntactic trees in an unsupervised way. The experimental results have provided an accuracy of around 65.17% for Tamil sentences.

meaning is given in the brackets. In the sentences S1, S2 represents the pair of Tamil paraphrase sentences. S1: பாகிஸ்தானின் ப஬ங்க஭வாத நடவடிக் ககக஬ ஆதாிக்காதீர்கள் சீனாவுக்கு ஫த்தி஬ அ஭சு ககாாிக்கக. (The central government demanded china not to support Pakistan’s terriost activities) . S2: பாகிஸ்தானின் ப஬ங்க஭வாத நடவடிக் ககக஬ ஆதாிக்க கவண்டாம் என சீனா கவ ஫த்தி஬ அ஭சு ககட்டுக்ககாண்டுள்ளது. (China was requested by the central government not support Pakistan terriost activities).

Keywords: Deep Learning Architecture, Recursive Auto S1: பழநி அரு கக சிறுத் கத நட஫ாட்டம் உள்ளதால் , Encoders (RAEs), Paraphrase Detection. கி஭ா஫஫க்கள் பீதி஬ில் உள்ளனர் . (The wandering of leopard near palani area has created fear among the village people). 1. Introduction Paraphrasing is the process of restating sentences or paragraphs using different sentence structures or different choice of words, while keeping similar meaning. Paraphrase detection is one of the difficult tasks where deep semantic understanding of the context is required in order to achieve high quality results. Paraphrase detection is the main application under computational semantics. The process of constructing meaning representation of natural language expressions is called computational semantics. Paraphrasing finds its application in various fields of NLP tasks such as Information Extraction, Information Retrieval, Question and Answering, Summarization [20][22]. Paraphrase identification is different from paraphrase generation. The former is the task of identifying whether the given sentences or paragraphs stand in a relation of paraphrasing whereas the latter is the task of generating the paraphrased sentence for the given original sentence. For a native speaker of a language, it is always easy to paraphrase a given sentence. At the same time, it is quite natural to determine whether the given pair of the text sentences is a paraphrase or not. But, an algorithm formulation of paraphrase detection to be fed to a computer machine for classification is quite challenging [1][21]. We shall discuss more about the paraphrasing in the following examples. The following different pair of Tamil sentences can be considered as paraphrases and their corresponding English

S2: பழநி பாலாறு அகைப்பகுதி஬ில் சிறுத் கத நட஫ாட்டம் உள்ளதால் அப்பகுதி கி஭ா஫஫க்கள் பீதி஬ில் உள்ளனர். ( Due to the wandering of leopard in the surrounding of palani palar dam, the villagers are frightened).

2. Literature Review For Tamil language, NLP tools [26] such as morphological analyzer [29], word sense disambiguation [25] has been developed using machine learning approach. But the task paraphrase detection is the initial attempt for Tamil language. The earlier works of paraphrase detection used in unsupervised construction of paraphrase corpora is about exploiting massively parallel news sources. In this, they used two techniques: 1) a simple string edit distance, 2) Heuristic strategy pairs in the initial sentences obtained from different kinds of the news stories in the same cluster [2]. Semantic similarity approach for paraphrase detection using semantic similarity metrics to find the similarity of two text segments, but a key difference is that all word-toward similarities are taken into account, not just the maximal similarities between the sentences as by Fernando [3]. Socher, proposed paraphrase detection using recursive auto encoders (RAEs) and dynamic pooling, which gives finest output using deep learning concepts for the English language. They used Stanford Parser

for parsing sentences and send output to similarity matrix. Using dynamic pooling concept they convert the obtained variable size matrix to fixed size matrix. Then, the fixed size matrix has given to the Softmax classifier to detect whether the sentences are paraphrased or not [4]. The paraphrase identification technologies depend upon the applications. These are applicable in the fields such as summarization where paraphrase detection eliminates sentences. The next field is about question and answering, for both identifying sentence pairs with the related content. Such practical applications are possible because of the availability of large corpora. The development of robust paraphrase models on the scale of the best SMT models [5]. The next related work is text canonicalization approach to the paraphrase identification task. This approach tries to handle on both the lexical and the grammatical level. The transformation rules are applied to this system. It has shown comparative system performance on the MSR Paraphrase Corpus in current state-of-the art systems [11][19]. This method provides significant increase in the recall rate of paraphrases compared with a system using non canonicalized text. However, future research is required to know about how many transformation rules are needed for the task. In the future, more work has also to be done to enhance the system with lexical semantic knowledge from either manually constructed lexical databases like WordNet or other resources that automatically learned from corpora like VerbOcean [6]. Paraphrase Identification (PI) is the process to classify two sentences having similar meaning. PI is an most important term in research with the applications in the field of higher NLP tasks such as information extraction, machine translation, information retrieval, automatic identification of copyright infringement, question and answering systems etc. This study proved a novel approach of paraphrase identification using semantic heuristic features improving the accuracy with existing PI systems. Finally, a comprehensive analysis has provided the details about the proposed approach used in the experiments [7][23].

3. Methodology Here, we first discuss about the proposed paraphrase model that is given in the Fig.1, followed by the flow diagram that is given in the Fig. 2. In this task, we are using Shallow Parser to parse the Tamil sentences [12]. The challenge in this work is to detect the paraphrases in Tamil sentences. To solve this problem, we apply method of machine learning (neural networks, recursive auto encoders (RAEs)) to identify whether the sentences are paraphrase or not. Since this task paraphrase detection in the Tamil language not yet addressed, it is missing comparative benchmark results and adequate corpora. The Tamil sentences are acquired in an unsupervised manner by obtaining hourly news flashes and corresponding news stories from leading Indian Tamil News sites. The datasets are hand tagged by linguists in order to obtain a baseline comparable data set. The size of the corpus is 2500 pair of sentences, in Tamil language with 65.17% of the pairs being in a paraphrase relationship.

Input Tamil Sentences

Variable-sized similarity matrix

Dynamic pooling layer

Fixed-sized matrix

Softmax classifier

Sentences are paraphrases or not Fig. 1: An overview of paraphrase model In this method, the input length of the Tamil sentences are calculated and stored in a similarity matrix which is of variable size. Initially, the recursive auto encoders (RAEs) computes the phrase vectors for each nodes in the given parse tree. Then, it computes the euclidean distances between the word and phrase vectors for the given pair of Tamil sentences. For updating the similarity matrix, the rows and columns of the matrix are originally filled by the words in the given original sentences order. The paraphrased sentences often have low euclidean distance value which is close to the diagonal of the similarity matrix. It will occur when similar words are ordered well between the two given pair of Tamil sentences. However, since the matrix dimensions vary based on the sentence lengths we cannot able to directly feed the obtained similarity matrix into the standard neural network classifier [4]. Here, we are used dynamic pooling approach which converts the variable size matrix to the fixed size matrix. The obtained fixed size matrix is given to the softmax classifier. The classifier identifies whether the given Tamil sentences are paraphrase are not. The general approach is as follows and the flow of paraphrase detection is described in Fig. 2. 

Given an input pair ( S1 , S 2 ) sentences in Tamil language.



Parse them using shallow parser to receive two trees ( t1 , t 2 ).

 



Modify the leaves of the parsed trees to contain Tamil language model word embeddings. Using the recursive auto encoder (RAEs) to receive good representations at each and every inner nodes in the parsed trees. Finding the minimal tree matching in the two parsed trees, whether the pair of sentences are paraphrase or not.

Neural networks reproduce the purpose of neurons in an intelligent human brain. A particular case of neurons are the input neurons which accept their input from an input instance of fixed format and output neurons which result of the total network [8] [15]. The neural language model came into existence with the work of Bengio [17]. Words are embedded into a vector space which is n-dimensional in this model. The model was mainly used to figure the word existence given in the specified context. Collobert and Weston [28] came up with a new neural network model where optimization of the networks are development through gradient ascent. The embedding word matrix

Input pair

Yes or no

Shallow parser

Pass through classifier

Embed the leaves

Encode using unfolding RAEs

Dictionary

Extract minimal match

Fig. 2: Explains the flow of paraphrase detection in Tamil Language The fixed parse tree learns semantic information of phrase vectors without any labelled corpus. In addition to the paraphrase corpus we have also used Tamil monolingual corpus to capture the broad range of semantic information in a unsupervised way. 4. Learning Methods In this section, we have discussed the neural network concepts, followed by the background of recursive auto encoders (RAE’s), then finally we discuss about the RAE model (i.e.) unfolding recursive auto encoders (RAEs). In Natural Language Processing (NLP), these techniques are used over natural language corpus. The data in the given input Tamil sentences have to train by labelling (i.e.) pairs of Tamil sentences are marked manually. Additional labels may include semantic information, which is not naturally present in raw natural language. By this data and possible additional labels in the given input Tamil Sentences, learning methods are used to explore recurring patterns [17]. 4. 1 Neural Network

E  Rn|v| . Here | v | is

vocabulary size of the corpus. The distributional syntactic and semantic information of word vectors are captured through co-occurrence data information statistics. Vector of every word is used for further processing of tasks after the matrix learns on an unlabelled corpus. We have an ordered list of vectors

X  {x1 , x2 ,...., xn } for every sentence. Auto

encoders fare better when compared to the binary number representations such as the recursive auto associative memory (RAAM) model which was modelled by Pollack [9] [10]. Recurrent neural networks model is another example of the binary number representation related auto encoder model. 4. 2 Recursive Auto Encoders (RAEs) It is one of the most widely used techniques for dimensionality reduction. We have reviewed the Recursive Auto Associative Memory (RAAM) model [9] [10]. A powerful way to represent data is to find a lower dimensional non-linear manifold in which the data concentration lies, and project our data on to it. Auto encoders consist of a non-linear encoding function which transforms input data into feature or representations. These feature mappings are easy to compute and is given in the equation 1. (1) y  f (wx  b) Where x - input data vectors b - biasing factor f - Activation function w- weight factor y - output data. The learned representations are assessed by their ability to reproduce the input data using a decoding function. This decoding function is given in the equation 2. (2) x  g (wy  b) Auto encoder’s rules are unsupervised learning implementations under the framework of clustering, where the additional advantage of auto encoders is that they label the clusters efficiently. Also, from the theoretic perspective, auto encoders try to learn representations from input data.

One of the efficient ways to train an auto encoder is to learn the mapping based upon some basic regularization function that generalizes well to data coming from the probability distribution of training data. The main difference between standard RAEs and unfolding recursive auto encoders (RAEs) is that standard RAE's tries to reconstruct only its direct children. But unfolding RAEs reconstruct the entire leaf node under each node [18]. In this model, we used unfolding RAEs to learn Tamil sentences.

Erec



 ( y( i , j ) )   xi ,...., x j    xi ,...., x j   



2

(5)

Thus, we used the algorithm of the unfolding recursive auto encoders (RAEs) in the proposed model. The unfolding recursive auto encoders used for unsupervised feature learning from unlabelled parse trees [4].

6. Experimental Results This is an initial attempt in paraphrase detection for Tamil language as there is no existing work done for the same till date. Therefore, there is no standard dataset are available. So, we have collected 2500 pair of Tamil sentences. In the training, we include "1" if the pair of Tamil sentences are paraphrase, otherwise we include "0". We have given 2000 pair of Tamil sentences and respective labels for the training. For the testing, we have given 1000 sentences (i.e.) 500 pair of Tamil sentences. The Figure 4, represents the input Tamil sentences which have hand tagged by “1” if the pair of sentences are paraphrase, otherwise we include “0”.

4.3 Unfolding Recursive Auto Encoders (RAE's) The unfolding recursive auto encoders (RAEs) with details of the reconstruction at node y1 . The unfolding recursive auto encoder (RAEs) which tries to reconstruct all leaf nodes under it. The algorithm of the unfolding recursive auto encoders (RAEs) is as follows as in the Figure 3. Assume we have given the list of word vectors X  {x1 , x2 ,..., xn } . The binary parse tree is in the structure of branching triplets of parent with their children: parent  c1 (child1 )  c2 (child2 ) .

where

[ x1 ; y1 ]  f (Wd y2  bd )

(3)

[ x2 ; x3 ]  f (Wd y1  bd )

(4)

[ x2 ; x3 ] [ x1; y1 ] is basically the concatenation of the

two children,

f - element-wise activation function, We -

encoding matrix ,

Wd - decoding matrix. Here x1 , x2 , x 3 are

nodes or words sentences and

We is encoding weight, Wd is

decoding weight. In general, we make use of the decoding matrix Wd to unfold every node in the parsed tree. The reconstruction error is the difference among the leaf nodes under that node [ x1 , x2 , x 3 ] and reconstructed parts

[ x1 , x2 , x 3 ] . During the training, the objective is to reduce the reconstruction error for all the given Tamil input sentences. The equation 5, provides the reconstruction error calculation using euclidean distance.

Fig. 4: Input Tamil Sentences The testing labels have the dimension of size 1×500, which is extracted from the testing Tamil sentences (i.e.) 500 pair of Tamil sentences. The training labels having the dimension of size 1×2000. The two training and testing vectors size are mentioned below. dataFullTrain - 1× 2000 dataTest.labels - 1× 500 In Table 1, we illustrates about the statistics of the training and testing data set. In Table 2, we have calculated the Reference and Response Results using precision, recall, and f-measure. We know precision is about providing the relevant instances. And the recall is about obtaining the retrieved instances. Here the Tp signifies true positive, Fp signifies the false positive, Fn signifies false negatives, Tn signifies true negative. The Total is calculated using the equation (9) is depicted in Table 2. In the Table 3, we calculated the Precision results using the equation (10). We have obtained the accuracy around 67.17% for precision. Similarly, we calculated the Recall results using the equation (11). We have obtained the accuracy around 66.17% for Recall. 1). Finally this results provided with the accuracy of 65.17% in the Table 5.

Table. 1: Statistics of Training and Testing data Category Training Testing 1014 332 Paraphrase 986 168 Not Paraphrase 2000 500 Total Table. 2: Confusion Matrix Category Response True False Fn=109 Reference True Tp=223 False Response Totals

Fp=109

Tn=168

TP+FP =332

Fn+Tn =277

for higher level NLP applications such as text classification, sentiment analysis, big data concepts, plagiarism detection, and textual entailment.

References

Reference Totals Tp+Fn= 332 Fp+Tn= 277 Total= 64.22%

The Total from the Table 2 is calculated as follows, Total=Tp+Tn/Tp+Fp+Tn+Fn

(6)

Table . 3: Precision Table Precision Response True False 223 109 Reference True 109 168 False The Precision is calculated using the equation 7, Precision =Tp/Tp+Fp

(7)

Table . 4: Recall Table Recall Response True False 223 109 Reference True 109 168 False The Recall is calculated using the equation 8, Recall =Tp/Tp+Fn Category Accuracy Paraphrase sentences 65.17% Non-Paraphrase sentences 34.83%

(8)

7. Conclusion and Future work We aimed at learning features from the given input Tamil sentences and it shows that unfolding recursive auto encoders (RAE’s) are easy to train and implement. Our main objective is about paraphrase detection in the given Tamil sentences, given that in the recent years English language research has gained a lot of interest and focus in task of paraphrase detection we tried with better results for Tamil language too. We developed a system to classify whether the input pair of Tamil sentences are in a paraphrase relationship using unfolding recursive auto encoders (RAE’s). This corpus was achieved in an unsupervised manner from most important Tamil news sites. As a future research we can also implement these algorithms

[1] Stanovsky, G., 2012, "A study in Hebrew Paraphrase Identification", (Doctoral dissertation, Ben-Gurion University of the Negev). [2] Dolan, B., Quirk, C., and Brockett, C., 2004, "Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources", In Proceedings of the 20th international conference on Computational Linguistics(p. 350). Association for Computational Linguistics. [3] Fernando, S., and Stevenson, M., 2008, "A semantic similarity approach to paraphrase detection", In Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics (pp. 45-52). [4] Socher, R., Huang, E. H., Pennin, J., Manning, C. D., and Ng, A. Y., 2011, "Dynamic pooling and unfolding recursive auto encoders for paraphrase detection", In Advances in Neural Information Processing Systems (pp. 801-809). [5] Brockett, C., and Dolan, W. B., 2005, " Support vector machines for paraphrase identification and corpus construction", In Proceedings of the 3rd International Workshop on Paraphrasing (pp. 1-8). [6] Zhang, Y., and Patrick, J., 2005, "Paraphrase identification by text canonicalization", In Proceedings of the Australasian language technology workshop (Vol. 2005, pp. 160-166). [7] Qayyum Ul Zia and Altaf Wasif., 2012, "Paraphrase Identification using Semantic Heuristic Features”, Research Journal of Applied Sciences, Engineering and Technology 4(22): 4894-4904. [8] Raul Rojas., 1996, "Neural Networks A Systematic Introduction", Springer, edition 1, Chapter 1- The Biological Paradigm. [9] Voegtlin, T., & Dominey, P. F. , 2005, "Linear recursive distributed representations", Neural Networks, 18(7), 878-895. [10] Pollack, J. B., 1990, "Recursive distributed representations", Artificial Intelligence, 46(1), 77-105. [11] Microsoft research paraphrase corpus, Accessed on september-2014, http://research.microsoft.com/en-us/. [12] Shallow parser, accessed on october-2014, . http://ltrc.iiit.ac.in. [13] Blacoe, W., and Lapata, M., 2012, "A comparison of vector-based representations for semantic composition", In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 546-556). Association for Computational Linguistics. [14] Madnani, N., Tetreault, J., and Chodorow, M., 2012, " Re-examining machine translation metrics for paraphrase identification", In Proceedings of the 2012 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies (pp. 182-190). Association for Computational Linguistics. [15] Goller, C., and Kuchler, A., 1996, "Learning taskdependent distributed representations by backpropagation through structure", In Neural Networks, 1996., IEEE International Conference on (Vol. 1, pp. 347-352). IEEE. [16] Das, D., and Smith, N. A., 2009, "Paraphrase identification as probabilistic quasi-synchronous recognition", In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1Volume 1 (pp. 468-476). Association for Computational Linguistics. [17] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. 2003, "A neural probabilistic language model", The Journal of Machine Learning Research, 3, 1137-1155. [18] Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D., 2011, "Semi-supervised recursive auto encoders for predicting sentiment distributions", In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 151-161). Association for Computational Linguistics. [19] Xu, W., Ritter, A., Callison-Burch, C., Dolan, W. B., and Ji, Y., 2014, "Extracting Lexically Divergent Paraphrases from Twitter", Transactions Of The Association For Computational Linguistics, 2, 435-448. [20] Xu, W., 2014, " Data-driven approaches for paraphrasing across language variations ", (Doctoral dissertation, New York University). [21] Xu, W., Ritter, A., and Grishman, R., 2013, " Gathering and generating paraphrases from twitter with application to normalization", In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora (pp. 121-128). [22] Xu, W., Ritter, A., Dolan, B., Grishman, R., and Cherry, C., 2012, "Paraphrasing for Style", In COLING (pp. 2899-2914). [23] Xu, W., Grishman, R., Meyers, A., and Ritter, A. 2013," A Preliminary Study of Tweet Summarization using Information Extraction", NAACL, 2013. [24] Wu, W., Ju, Y. C., Li, X., and Wang, Y. Y., 2010, "Paraphrase detection on SMS messages in automobiles", In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on (pp. 5326-5329). IEEE. [25] Anand Kumar, M., Rajendran, S., Soman, K.P., 2014, “Tamil word sense disambiguation using support vector machines with rich features”, International Journal of Applied Engineering Research, 9 (20), pp. 7609-7620. [26] Anand Kumar, M. , Dhanalakshmi, V., Soman, K.P., Rajendran, S. , 2014 ,“Factored statistical machine translation system for English to Tamil language”, Pertanika Journal of Social Science and Humanities, 22 (4), pp. 1045-1061, [27] Kozareva, Z., and Montoyo, A. , 2006, " Paraphrase identification on the basis of supervised machine learning techniques", In Advances in natural language processing (pp. 524-533). Springer Berlin Heidelberg.

[28] Collobert, R., & Weston, J., 2008, "A unified architecture for natural language processing: Deep neural networks with multitask learning", InProceedings of the 25th international conference on Machine learning (pp. 160-167). ACM. [29] Dhanalakshmi, V. ., Anandkumar, M., Rekha, R.U. , Arunkumar, C., Soman, K.P., Rajendran, S., 2009., “Morphological analyzer for agglutinative languages using machine learning approaches” , ARTCom 2009 - International Conference on Advances in Recent Technologies in Communication and Computing, art. no. 5329355, pp. 433435.

Suggest Documents