A Spell Correction Method for Query-Based Text Summarization Nazreena Rahman and Bhogeswar Borah Department of Computer Science and Engineering, Tezpur University, Sonitpur, Assam, India-784028
[email protected],
[email protected]
Abstract. Finding and correcting incorrect spelling from text documents always plays an important role particularly in information retrieval. There are many approaches to tackle spell checking problem. Here, one spell checking and correcting method is proposed particularly for query-based text summarization purpose. Our method mainly works on non-word errors. Query based text summarization finds summary based on the given query. This method refines the query by replacing each misspelled word with exact match of that word with the dictionary. We use TAC 2009 dataset for our experiments and validation purpose and obtain encouraging results. Keywords: Information retrieval, spell checking, non-word errors, querybased text summarization.
1
Introduction
The rapid and continuous growth of text databases makes it difficult to retrieve the required information. Therefore, query based text summarization can be used for finding the summarized answer according to user’s need. Text summarization defines a text that is constructed from single or multiple texts and length of the summarized text is just half of the main text documents[1]. Text summarization can be obtained on single text document or multiple text documents. It can be of extraction based or abstraction based. In extractive method, sentences can be extracted from original texts. Abstractive summarization needs information integration, sentence compression and reformulation. Summarization can be of generic or query-based type. In generic summarization, summary can be achieved based on the document, no query is required but in case of query based technique, summary depends on the user’s query. Based on details, indicative and informative summary can be constructed. Indicative summary indicates whether the reader should go through the text document or not and informative summary gives most important information of the texts. Query-based text summarization plays vital role in information retrieval. It gives tremendous contribution to natural language processing, specially in the areas of information extraction, question answering, text summarization and text analysis. Query-based summarization can be applied for complex question answering. A complex question can be defined as such type of question for which answer can
be obtained by integrating and interpreting knowledge from single or multiple text documents. Here, answer can be given to the complex question by combining important information from each document in a cohesive and redundancy free manner. The query-based text summarization process requires the query and one or more text documents to be given as input and the summary is obtained as output. The Figure 1 shows overview of query-based text summarization.
Fig. 1. Query-Based Summarization Overview
Spelling correction in query always plays an important role in information retrieval. We can consider spell checking as a pre-processing part of query-based text summarization. This indeed helps in finding more relevant and user focused query. Two types of errors can be found: one is non-word errors, these are not found in dictionary and the other is real-word errors, which are present in the dictionary. Real-word errors can be typographical and cognitive. In typographical error, the errors are introduced by mistake. For cognitive error, this error occurs when spelling of the word is not known. This happens for homophones. Example of such errors are piece and peace. The first word ‘piece’ means slice and the second word ‘peace’ means silence [2]. Spell checking and correcting helps to get the summarized text document more accurately. In fact, correct words help in extracting more semantically similar sentences to get useful summary.
2
Related Work
Spelling task can be divided into error detection and correction. From the survey it is found that 26% of spelling errors are done in web queries (https: //web.stanford.edu/class/cs124/lec/spelling.pdf). For efficient retrieval, revised n-gram based technique has been put forward where n-gram statistics and lexical resources are used [3]. In this method, the authors try to generate ranked list of correction candidates to derive the most suitable candidate word for the incorrect word. Garaas et al. [4] suggest a personalized error correction system using neural networks. They train a simple feed-forward neural network so that if the same error happens it can detect and give the proper word. Islam et.al [5] use Google Web 1T 3-grams dataset for detecting and correcting incorrect real-words. For string similarity measure, they use longest common subsequence string matching algorithm with different normalization and modifications. Their method try to improve the detection and correction recall value
of incorrect words. Detection recall value means detecting fraction of errors correctly and correction recall value means modifying fraction of errors correctly. Duan et al. [6] propose a model to find out the incorrect words in online spelling correction. They try to provide all spell corrected complete suggestions while the query words are adding. They try to train the Markov n-gram transformation model to analyze the user’s spelling behavior. To enhance the efficiency of this transformation model, they study different techniques. Finally, for searching correct words, they use A* informed search algorithm. Different pruning and thresholding methods are being used to increase the result of A* algorithm. A dictionary based approach has been proposed by Amorim et al. [7] using an unsupervised method. They integrate anomalous pattern initialization and partition around medoids (PAM) clustering algorithms. Their result shows 88.42% success rate for challenging datasets. Sharma et al. [8] propose a system to correct confused words if they are found to be contextually wrong. To identify and correct real-word errors, one phase of the algorithm applies trigram approach and the other phase applies Bayesian technique. Commonly used confused words set and brown corpus are applied for this system. Though different spell checking and correcting methods have been found for query completion in information retrieval, our spell checking and correcting method is applicable particularly for query-based summarization purpose. Here, input text documents are used for correction of misspelled words.
3
An Approach for Spell Checking and Correcting
In this paper, a method for correcting misspelled non-word errors is presented. In this approach, a dictionary based spelling correction is done which also depends on input text documents. 3.1
Overview of Spell Correction method for Query-Based Text Summarization
Firstly, this method finds out the incorrect words using dictionary. After finding incorrect words, it searches for candidate words for these incorrect words. Candidate words are those words whose spellings are quite similar to the incorrect words and they are present in dictionary. These words are real-words. Then the candidate words are filtered out by finding highest n-gram character matching with the incorrect words. We use all these candidate words for finding matching words from the input text documents. If we find the matching words, then we will see if the highest n-gram character matching word and input text matching word is a unique word. If it is a unique word, then we consider that unique word as the correct word. If it is not a unique word, then we take both the highest n-gram character matching words and the input text matching words and finds the scores of those words on the basis of following similarity scores.
1. Minimum Edit Distance Score (MEDS): Edit Distance finds dissimilarity between two words. It calculates least number of operations necessary for transforming from one to another word. Here, Levenshtein distance [9] is used for finding the edit distance value. Levenshtein distance between two strings i, j is given by D(i,j) where: Initialization: Di,0 = i D0,j = j Recurrence Relation: F or each i = 1...M F or each j = 1...N Di−1,j + 1 Di,j−1 + 1 n D(i,j) = min D + 2; if X(i) 6= Y (j) i−1,j−1 n D i−1,j−1 + 0; if X(i) = Y (j) T ermination : D(N, M ) is the minimum edit distance value. Now, the equation for edit distance score is: s1 =
no. of characters in the longest word minimum edit distance value
2. Character Similarity Measure Score (CSMS): It finds similar characters between two words. The percentage of similarity can be calculated as follows: s2 =
no. of similar characters no. of characters in the longest word
3. Longest common substring score (LCSS): Here, longest common substring among words are found. The substring similarity can be found as follows: At first it finds out the longest common suffix. The required equation is: ( LCSuf f (S1..p−1 , T1..q−1 ) + 1 if S|p| = T|q| LCSuf f (S1..p , T1..q ) = 0 otherwise. Now, this following equation is used to find the maximum longest common substring LCSubstr(S, T ) = max1≤i≤m,1≤j≤n LCSuf f (S1..i , T1..j ) Finally, longest common substring score between two words can be found by the following equation s3 =
length of longest common substring no. of characters in the longest word
4. First Letter Weighting (FLW): Yannakoudakis and Fawthrop [10] surveyed that usually people do not make mistakes at first letter of a word while writing. Therefore, an extra weight is given to the words with similar first letter. Here, weight given to the first letter matching is 0.5. 3.2
The Steps of Spell Correction method for Query-Based Text Summarization (SCQBT)
The pseudo-code for finding incorrect words and replacing with a correct word is as follows Data: Query (Qi ) and Input Text (I) Result: Correct Query (Qcorrect ) Find the incorrect words (Qincorrect ) using dictionary for each incorrect word Inw in Qincorrect do Generate the candidate words (Cw ) Do n-gram character matching between Inw and Cw Take highest n-gram character matching words (Nw ) from Cw if Cw ∈ I then take those words (Iw ); else do not take any Cw words; end if Nw ∩ Iw has one unique word then Take the unique word and replace with Inw ; else Take Nw ∪ Iw while each word W ∈ (Nw ∪ Iw ) do Calculate M EDS (Inw , W) Calculate CSM S (Inw , W) Calculate LCSS (Inw , W) Calculate F LW (Inw , W) Sum up all scores (score) Replace Inw with W having highest (score) end end end Algorithm 1: Steps of Spell Correction method for Query-Based Text Summarization (SCQBT)
4
Experimental Data and Results
Experiments are performed on the datasets provided by Text Analysis Conference (TAC) (http://www.nist.gov/tac/data/). Here, text documents are taken from TAC 2009 datasets. There are 44 documents each having 2 topics. For each topic, there are ten text documents.
Below, experimental process is described with a sample query and a input text file. Let us consider the sample query is “detail china accident”. When the query is entered, it is written with spelling mistake as “detal chna accidnt”. Now, dictionary indicates the incorrect words along with possible candidate words shown in Table 1. Table 1. Candidate Words List Incorrect Words detal chna accidnt
Candidate Words ‘deal’, ‘dental’, ‘detail’, ‘dealt’, ‘delta’, ‘metal’, ‘petal’, ‘fetal’, ‘decal’ ‘tuna’, ‘china’ ‘accident’, ‘accidence’, ‘acidness’, ‘accordant’, ‘account’
Now, n-gram character matching is done for the candidate words. Here, value of n is 3. The list of values of n-gram character matching are shown in Table 2. Table 2. n-gram character matching values of Candidate Words Incorrect Words n-gram value of Candidate Words detal ‘dental’: 0.5, ‘detail’: 0.5, ‘deal’: 0.44, ‘petal’: 0.4, ‘metal’: 0.4, ‘fetal’: 0.4, ‘decal’: 0.4, ‘delta’: 0.16, ‘dealt’: 0.16 ‘china: 0.4’, ‘tuna: 0.2’ chna accidnt ‘accident’, 0.58, ‘account’, 0.38, ‘accordant’, 0.33, ‘accidence’, 0.33, ‘acidness’, 0.26
Each of the candidate words are searched in the text document provided. Here for incorrect word ‘chna’, we get one unique word which has highest ngram character matching score and is present in input text document and the unique word is ‘china’. Similarly, for incorrect word ‘accidnt’ , we get the one unique words which is ‘accident’. But for ‘detal’, we get input text matching word as ‘metal’ and highest n-gram character matching words as ‘dental’ and ‘detail’. Hence, according to our method, total scores of these three filtered candidate words are calculated. Scores are given in Table 3. Finally, word replacement is done on the basis of highest score. Therefore, the word ‘detal’ will be replaced by ‘detail’. Here, we take 50 queries for evaluation purpose. Comparison of our method is done with baseline methods like Microsoftword Spell Corrector, Character Similarity, Minimum Edit Distance and Longest Common Subsequence. Following Table 4 shows the detailed results. From the above Table 4, it is observed that our proposed algorithm SCQBT works well in-terms of accuracy. In fact, SCQBT performs well for various pre-
Table 3. Scores of Filtered Candidate Words Filtered Candidate Words Score Value metal 6.6 7.83 dental detail 8.0 Table 4. Comparison of Performance with Baseline Systems Method Name Accuracy Recall Precision F SCQBT 70% 89.7% 87.5% 89.1% Microsoftword Spell Corrector 64% 89.2% 89% 89.1% Character Similarity 48% 85% 82.8% 83.9% Minimum Edit Distance 42% 84% 80.8% 82.4% Longest Common Subsequence 18% 69% 64.3% 66.6%
cision, recall and F-measures of Character Similarity, Minimum Edit Distance and Longest Common Subsequence methods. Similarly, the recall and F-measure values of SCQBT method is higher than Microsoftword Spell Corrector but it’s precision value is low.
5
Discussion
Here, our method is limited to non-word errors. Sometimes, we get the highest score for the incorrect words also. For example: we consider the sample query as ‘India Pakistan peace polici’. Hence, for this query, incorrect word is polici. Initially, these following candidate words are found in Table 5. Table 5. Candidate Words List Incorrect Words Candidate Words polici ‘polis’, ‘police’, ‘policy’
Hence. we calculate the n-gram character matching values for the all candidate words shown in Table 6. Now, we find ‘police’ word from the input text matching. Finally, ‘police’ unique word is found as the correct word from both the highest n-gram character matching word and the input text matching word , but this is not the correct word for the above query. Therefore, our method does not find the correct word for this incorrect query.
6
Conclusion and Future Work
This spell checking technique is mainly applied in query-based text summarization purpose. Here, the proposed method is based on dictionary. Our method
Table 6. n-gram character matching values of Candidate Words Incorrect Words n-gram value of Candidate Words polici ‘police’: 0.45, ‘policy’: 0.45, ‘polis’: 0.36
tries to select most appropriate word from the candidate words provided by dictionary. Experimental results show better performance, however this work can be extended by adding corpus based semantic similarity to get more improved results.
References 1. Hovy, E., Lin, C.Y.: Automated text summarization and the summarist system. In: Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, Association for Computational Linguistics (1998) 197–214 2. Martin, J.H., Jurafsky, D.: Speech and language processing. International Edition 710 (2000) 3. Ahmed, F., Luca, E.W.D., N¨ urnberger, A.: Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness. Polibits (40) (2009) 39–48 4. Garaas, T., Xiao, M., Pomplun, M.: Personalized spell checking using neural networks 5. Islam, A., Inkpen, D.: Real-word spelling correction using google web it 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, Association for Computational Linguistics (2009) 1241–1249 6. Duan, H., Hsu, B.J.P.: Online spelling correction for query completion. In: Proceedings of the 20th international conference on World wide web, ACM (2011) 117–126 7. Cordeiro De Amorim, R., Zampieri, M.: Effective spell checking methods using clustering algorithms. In: Proceedings of Recent Advances in Natural Language Processing, Association for Computational Linguistics (2013) 8. Gupta, S., et al.: A correction model for real-word errors. Procedia Computer Science 70 (2015) 99–106 9. Haldar, R., Mukhopadhyay, D.: Levenshtein distance technique in dictionary lookup methods: An improved approach. arXiv preprint arXiv:1101.1232 (2011) 10. Yannakoudakis, E.J., Fawthrop, D.: An intelligent spelling error corrector. Information Processing & Management 19(2) (1983) 101–108