Int. J. Reasoning-based Intelligent Systems, Vol. 8, Nos. 3/4, 2016
91
Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction Iyad Abu Doush* and Ahmed M. Al-Trad Computer Science Department, Yarmouk University, Irbid, Jordan Email:
[email protected] Email:
[email protected] *Corresponding author Abstract: The optical character recognition (OCR) is used to convert scanned documents into text. The resulted text need to be validated for correctness. The problem increased when working on Arabic text because of the complexity of Arabic language. This research aims to explore the ways of improving OCR spell checking effectiveness by proposing a post-processing Arabic OCR system based on three different approaches: Microsoft Office Word with Google online suggestion system, Ayaspell spell checker with Google online suggestion system, and using Google online suggestion system alone. We have used precision and recall in order to evaluate the effectiveness of our proposed OCR post-processing. The results show that using Microsoft Office Word with Google outperform other approaches with accuracy of (0.49). Keywords: post-processing Arabic OCR; Arabic optical character recognition; Arabic spell checker. Reference to this paper should be made as follows: Abu Doush, I. and Al-Trad, A.M. (2016) ‘Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction’, Int. J. Reasoning-based Intelligent Systems, Vol. 8, Nos. 3/4, pp.91–103. Biographical notes: Iyad Abu Doush is an Associate Professor in the Department of Computer Science at Yarmouk University in Jordan. He earned his PhD degree from the Computer Science Department at New Mexico State University, USA, 2009. In 2009, he was awarded with the Merit-based Enhancement fellowship from graduate school for outstanding service to New Mexico State University as Teaching and Research Assistant. Since 2010, he has been a Professor of Computer Science in Yarmouk University, Jordan. He has published more than 30 articles in international journals and conferences, and he served as a reviewer and committee member of several specialised international journals and conferences. His research interests include: evolutionary algorithms, optimisation, accessibility and human computer interaction. Ahmed M. Al-Trad graduated from Yarmouk University in 2008 with excellent rating from computer information systems. In 2009, he started his career as a computer Teacher in one of Jordanian Ministry of Education. In 2012, he received his high diploma degree from Yarmouk University in Computer Information Systems with very good rating. Then, he completed his Master in Computer Science from Yarmouk University in 2014 with very good rating. His research interests include: Arabic language spell checking and correction algorithms, information retrieval (semantic search), artificial intelligence systems, optimisation, accessibility and human computer interaction.
1
Introduction
As a result of computer and information technology revolution, the dependence on the computer system for performing various real life activities increase. However, any activity to be done by the computer has to be expressed in digital format so that it could be interpreted and carried out by the computer system. One of these activities is searching, analysing, maintaining and processing of data and information that are stored in various books and
Copyright © 2016 Inderscience Enterprises Ltd.
documents. Since a huge amount of these documents are available only in printed format, this drives researchers to search for computerised techniques that convert these documents into digital format (Magdy and Darwish, 2008). Optical character recognition (OCR) is a process that can convert text, present in digital image, to editable text. The main job of OCR system is to transform written documents to electronic documents so that it can be accessed, maintained and processed easily using computer. It allows a machine to recognise characters through optical
92
I. Abu Doush and A.M. Al-Trad
mechanisms. The output of the OCR should ideally be the same as the original text. The process involves some pre-processing of the image file and then acquisition of important knowledge about the written text (Singh, 2013). OCR works by scanning source documents and performing character analysis on the resulting images, giving a translation to ASCII text, which can then be stored and manipulated electronically like any standard electronic text document. Although OCR is useful and can be used in scanning documents its output still contains some critical and undesirable errors which makes the character recognition process not perfect, and resulted in low quality digital documents. These errors have an adverse effect on the effectiveness of information retrieval algorithms that are based on exact matches of query terms and document terms. Searching OCR data is essentially a search of ‘noisy’ or error-filled text (Beitzel et al., 2003). Research work on Arabic OCR increased considerably since the 1980s. First systems for Arabic printed text were available at the market in the 1990s (Märgner and El Abed, 2011). Today, there are many Arabic OCR systems available at the market, the most commonly used Arabic OCR products are Sakhr Automatic Reader and OmniPage for Arabic (Kanungo et al., 1999). The difficulty of OCR increased when dealing with Arabic documents due to the complexity of Arabic language and the different shapes that can a letter appear in, in addition to the complex morphological structure of Arabic text. Due to these reasons, the necessity for an OCR engine specialised in Arabic language has been appeared and prompts the researchers to give this domain more attention and concern (Märgner and El Abed, 2011; Magdy and Darwish, 2006; Slimane et al., 2011). Typically, the OCR process introduces errors in the text representation of the document images. The error rate is affected by the quality of paper, printing, and scanning. The introduced errors are more common in Arabic OCR due to some of the orthographic and morphological features of Arabic. Furthermore, Arabic characters are connected and change shape depending on their position in a word. As for morphological complexity, Arabic allows the insertion of infixes to form words and the attachment of prefixes and suffixes that include pronouns, determiners, number markers (singular, dual, and plural), conjunctions, etc. The introduced errors adversely affect retrieval effectiveness of OCR documents. Consequently, we need an effective technique to detect these errors and correct them in order to improve the performance of the OCR system (Magdy and Darwish, 2008; Bassil and Alwani, 2012b). There is a lack of an automatic spelling corrector without the need for human interference, and without wasting much efforts and time when correcting in the traditional method (Bassil and Alwani, 2012b; Mahdi, 2012). In this paper, first, we will discuss some issues related to OCR Arabic recognition system errors detection and correction. Then we will propose an OCR post-processing technique to improve OCR Arabic results by using some spelling errors detection and correction techniques. Finally,
we will make a comparison between the results of the regular Arabic OCR system and the results of our proposed system using some samples from different datasets as an evaluation of our proposed OCR post-processing system.
2
Background
We present an overview about the OCR system and summarise some of Arabic OCR challenges. In addition, we discuss some classifications of spelling errors, type of errors that are more likely to occur in Arabic OCR output documents. In addition, we review some of the approaches that had been used in previous work.
2.1 Optical character recognition The goal of OCR systems is to convert the scanned documents images that were in printed format into digital format in order to make its data searchable and editable. The transformation process, as shown in Figure 1, usually started by scanning the documents and stored as images. Then the process of segmentation of the document image into sub-characters images using OCR system, after that the OCR engine tries predicating each character by analysing the character image segment and determining the character text that corresponds to one of the character shapes according to the sequential context of the list of possible words. Then select the most likely character according to its position, then the OCR errors are optionally corrected, indexed, and searched (Magdy and Darwish, 2008). Figure 1
The OCR transformation process (see online version for colours)
Source: Magdy and Darwish (2008)
There are several approaches have been developed in order to minimise the number of errors included in OCR results. One the most popular and widespread used approaches is spell checker approach, which is based on various spell checking techniques. In the spelling checking approach, the candidate words are selected through the information
Improving post-processing OCR documents with Arabic language using spelling error detection and correction gathered from multiple knowledge sources and automatically replaced with the correct word (Magdy and Darwish, 2008; Mohapatra et al., 2013).
•
2.2 Arabic OCR challenges The Arabic OCR outputs usually contain errors and wrong predicted words. The character error rate influenced by several factors and challenges, some of these challenges can be summarised as follows (Magdy and Darwish, 2006, 2008; Rashwan et al., 2007; Borovikov and Zavorin, 2012): •
The reproduction quality: usually the original documents give error rate better than photocopies.
•
The scanning process resolution: obviously the more resolution of scanning documents the smaller the error rate.
•
The various shapes of the character: the Arabic character can appear with different shapes in different words or even in the same word due to the connection of the character shape with its position in the word. There are four forms that Arabic characters can take based on their position in the word: isolated, initial, medial and final, as shown for letter ‘Ein’ in Figures 2 and 3.
Figure 2
Figure 3
The presence of dots in 15 Arabic letters: sometimes it is difficult to distinguish these dots from dirt or dust may occur in the scanned document image. In addition, there are many Arabic letters share the same shape and the dot is the only thing that differentiates them from each other (see Figure 5) which is in turn makes the Arabic OCR recognition task more difficult.
Figure 5
Examples sets of dotted letters
Source: Rashwan et al. (2007) Figure 6
Some ligatures in traditional Arabic font
Source: Rashwan et al. (2007)
•
The letter Ein ‘ ’ﻉin four different positions
Source: Rashwan et al. (2007)
93
The diacritics challenge: Arabic diacritics are used sometimes in resolving the linguistic ambiguity of the text. The problem of diacritics with Arabic OCR is similar to the dots challenge, as sometimes diacritics are difficult to be distinguished from dirt or dust in noisy documents. Figures 7 and 8 show some examples of Arabic diacritics:
Figure 7
Arabic diacritics (see online version for colours)
Four shapes of the Arabic letter ‘Ein’ (right to left): isolated, initial, medial, and final
Source: Borovikov and Zavorin (2012)
•
The difficulty in distinguishing multiple connected characters from single characters: sometimes it is possible that multiple connected characters to be reassembled as one single character or vice versa. For example, the letter ‘ ( ’ﺷـsheen) may resemble ‘ ’ﻧﺘـ (noon-ta combination). Moreover, in most Arabic words the Arabic letters occurs connected to each other, as shown in Figure 4, which makes the segmentation and recognition tasks harder.
Figure 4
The complexity of segmentation task
Source: Rashwan et al. (2007)
Source: Borovikov and Zavorin (2012)
•
The ligatures challenge: the ligature is a compound of two or more letters that are represented at certain positions of word segments by a single atomic segment. The ligature often depends on the font type of the text. Traditional Arabic contains 220 ligatures and simplified Arabic contains 151 ligatures, while English contains only 40 to 50 ligatures. This makes the recognition task harder. Figure 6 shows some ligatures that may found in traditional Arabic.
94
I. Abu Doush and A.M. Al-Trad
Figure 8
Automatic spelling error detection and correction techniques
resulted from: transposition of two letters, one letter extra, one letter missing or one letter wrong (Mohapatra et al., 2013). Misspelling errors can result from human typed errors and text recognition errors like OCR. However, most of researchers divided word misspelling errors into three groups (Haddad and Yaseen, 2007; Mahdi, 2012; Naseem and Hussain, 2007): 1
Typographic errors Typographic error refers to the error that may happen due to the keyboard adjacencies where is a writer knows how to spell the word but makes a mistake when typing it. The typographic errors belong to one of the following categories (Haddad and Yaseen, 2007; Mahdi, 2012; Naseem and Hussain, 2007): •
This error refers to the error that occurs when a letter is additionally inserted to the word, e.g., typing ‘acress’ for ‘cress’ in English language where is the letter (a) inserted additionally to the word. In Arabic, we can find many examples of this error type, e.g., typing ‘ ’ﻣﻜﺘﺘ ﻮﺏfor ‘’ﻣﻜﺘ ﻮﺏ, where is the letter ( )ﺕis additionally inserted. Or typing ‘ ’ﻭﻡﺍﻟﻌﻠ ﻞfor ‘’ﺍﻟﻌﻠ ﻮﻡ, where is the letter ()ﻝ is additionally inserted. According to Haddad and Yaseen (2007), this error forms approximately 15% of all single error rates.
Source: Verberne (2002)
2.3 Spell checkers classification •
According to Naseem (2004) and Verberne (2002), the spell checking process can be divided into three main steps:
1
Error detection: means the process of finding misspelled words in the text.
2
Finding candidate correction words: refers to the process of finding the suggested corrections.
3
Ranking candidate words: the process of ranking the suggested corrections in decreasing order of probability (Naseem, 2004; Verberne, 2002).
•
Interactive: in this type of spell checkers, the spell checker can detect the misspelled words and then suggest number of possible corrections for each error. After that, the user selects the appropriate word from the suggested ones to replace the misspelled word.
2
Automatic: in automatic spell checker, the misspelled word is automatically detected and corrected without user intervention. The misspelled word replaced with the most probable word according to some statistical calculations.
2.4 Error patterns in Arabic There are many studies performed to analyse and classify the types of spelling errors in Arabic. For example, errors can be classified based on grammar or semantics, but for simplicity many studies consider them a typographic errors. On the other hand, these errors can be categorised into single error misspellings or multi-error misspellings. The studies found that approximately 80% of all misspelled words are single-error misspellings and most of these errors
Deletion: This error occurs when one of word letters is unintentionally deleted. For example, in English typing ‘acress’ for ‘actresse’ where is the letter (t) deleted. Also, in Arabic typing ‘ ’ﻣﺒ ﺪfor ‘’ﻣﻌﺒ ﺪ, where is the letter ( )ﻉis deleted, or typing ‘’ﻣﺘﺒ ﺔ for ‘’ﻣﻜﺘﺒ ﺔ, where is the letter ( )ﻙis deleted. This error forms approximately 23% of error rates (Haddad and Yaseen, 2007).
Spell checkers can be classified into two types (Naseem, 2004; Verberne, 2002): 1
Insertion:
•
Substitution: This error occurs when a letter is mistakenly substituted by another letter. For example, in English typing ‘acress’ for ‘acrosse’ where the letter (o) is mistakenly substituted by (e). In Arabic typing ‘ ’ﺍﻟﺘ ﺺfor ‘’ﺍﻟﻨ ﺺ, where is the letter ()ﻥ is mistakenly substituted by ()ﺕ. Alternatively the letter ( )ﻕis mistakenly substituted by ( )ﻑin ‘’ﻓ ﺎﻝ instead of ‘’ﻗ ﺎﻝ. Substitution forms 41.5% of the error rates (Haddad and Yaseen, 2007).
•
Transposition: This error occurs when two adjacent letters are swapped, for example in English typing ‘acress’ for ‘caress’ where is the two letters (c) and (a) are swapped. In Arabic language typing ‘’ﺍﺟﻤﺘ ﺎﻉ for ‘’ﺍﺟﺘﻤ ﺎﻉ, where is the letters ( )ﺕand ( )ﻡare swapped. Or typing ‘ ’ﻣﻌﻠ ﺐfor ‘’ﻣﻠﻌ ﺐ, where is the letters ( )ﻝand ( )ﻉare swapped. This error forms only 4% of error rates (Haddad and Yaseen, 2007).
Improving post-processing OCR documents with Arabic language using spelling error detection and correction 2
Cognitive errors Unlike typographic error, cognitive error occurs when a writer does not know how to spell the word due to lack of knowledge. For example, in English (receive → recieve or abyss → abiss). Arabic examples are typing ‘ ’ﻻﻛ ﻦfor ‘’ﻟﻜ ﻦ, or typing ‘ ’ﻫﺎﺫﺍfor ‘’ﻫﺬﺍ, or typing ‘’ﺫﺍﻟ ﻚ for ‘’ﺫﻟ ﻚ.
3
Phonetic errors Phonetic errors are caused mainly due to homophone characters (Naseem and Hussain, 2007). Homophone characters are those characters that represent the same sound. In Arabic, phonetic error often occurs according to some Arabic dialects (Haddad and Yaseen, 2007). In fact cognitive errors include phonetic errors where a word has been replaced by similar sounding word, for example, typing ‘ ’ﻗﻀ ﺎﺓby ‘ ’ﻗﻀ ﺎﺕor typing ‘ ’ﻋﻈﻴ ﻢby ‘’ﻋﻀ ﻴﻢ.
grammar, conceptual closeness, passage level word clustering, linguistic context, and visual context. Next section introduces an overview OCR-errors correction techniques.
•
single substitution, for example: ﺽ → ﺹor ﻭ → ﻡ
•
multi-substitution
•
one single character recognised as multiple characters, for example: ‘ ’ﻧﺘـ ‘ → ’ﺷـ
•
multiple characters recognised as one single character, for example: ‘ ’ﺷـ ‘ → ’ﻧﺘـ
•
space deletion like: ﺔﻓ ﻲ ﻣﻜ ﺖ ﺑ ﺔ→ ﻓﻴﻤﻜﺘﺒor space insertion for example: ﻁﺎﻟ ﺐ → ﻁ ﺎ ﻟ ﺐ
•
letter deletion like: ﺍﻟ ﺬﻱ → ﻟ ﺬﻱor letter insertion for example: ﺯﻳ ﺖ → ﺯﻳﻨ ﺖ.
1
Word-level OCR post-processing: this category includes the use of dictionary lookup, probabilistic relaxation, character and word n-gram frequency analysis and morphological analysis.
2
Passage-level OCR post-processing: this category includes the use of word n-grams, word collocations,
these
In this section, we present an overview about some OCR-degraded errors techniques that were proposed previously for both categories word-level and passage-level OCR post-processing approaches. Some of these OCR error correction techniques are (Magdy and Darwish, 2008; Beitzel et al., 2003; Bassil and Alwani, 2012b): •
Dictionary look-up: in this technique, the recognised words in the OCR-document are compared with words in a dictionary term list. If a word is found in the dictionary, then it is considered correct. This technique has been used for designing and implementing of Punjabi spell checker (Lehal, 2007). In this work (Lehal, 2007), they created a lexicon of correctly spelled words, then they stored all the possible forms of words of Punjabi lexicon. After that, they partitioned the lexicon into 16 sub-dictionaries based on the word length. The dictionary lookup technique is used to detect misspelled words.
•
Character n-grams: this technique may be used alone or in combination with dictionary lookup. The main reason for using n-grams is that some letter sequences are more common than others and other letter sequences are rare or impossible. For example, the trigram ‘xzx’ is rare in the English language, while the trigram ‘ies’ is common.
•
Word n-grams: a word n-gram is a sequence of n consecutive words in text. This technique can be used for calculating the probability that a word sequence would appear. Using this method, the candidate correction of a misspelled word might be successfully picked. For example, in the sentence ‘I bought a peece of land’, the possible corrections for the word peece might be ‘piece’ and ‘peace’. However, using the n-gram method the word trigram ‘piece of land’ is much more likely to be selected than the trigram ‘peace of land’. Thus, the word ‘piece’ is a more likely correction than ‘peace’.
•
Using morphology: there are many morphologically complex languages, such as Arabic, Turkish, and German, that have a huge number of possible words. This makes listing all the possible words is impossible and not feasible for error correction. Due to these issues, the researchers have used stem lists and orthographic rules, which govern how a word is written, and morphotactic rules, which govern how morphemes can be combined.
•
Using grammar: in this technique, spelling errors is parsed based on a language specific grammar. Applying
2.5.1 OCR post-processing approaches There are much research has been done for correcting OCR recognition errors and fixing OCR-degraded collections. According to Magdy and Darwish (2008), the OCR-degraded errors correction approaches may be divided into two main categories:
about
2.5.2 OCR-degraded errors correction techniques
2.5 OCR errors OCR errors refer to the errors that occur from OCR misrecognition text of the original document. In this section, we present an overview about both English and Arabic possible OCR errors that may resulted from OCR systems (Mahdi, 2012). There are many Arabic OCR errors, most of them occurs due to the similarity between some Arabic letters shapes like ‘ ’ﺯand ‘ ’ﺭor ‘ ’ﻁand ‘ ’ﻅand many others. However, Arabic OCR errors can be grouped as follow:
95
96
I. Abu Doush and A.M. Al-Trad this technique to Arabic might be challenging because the work on Arabic parsing has been very limited.
•
Multi-OCR output fusion: in this technique multiple OCR systems are used to recognise the same text. Then, the different OCR systems outputs are fused by picking the most likely recognised sequence of tokens using language modelling.
2.5.3 Arabic spell checking and correction approaches Recently, Arabic spell checkers have received great attention due to the increasing Arabic applications that requires spell checking and correction capabilities. One of the most famous and widely used Arabic spell checkers that are now available for word processing applications is Microsoft Word spell checker. According to Haddad and Yaseen (2007), most of the Arabic spell checkers are based on simple morphological analysis considering the keyboard effect for correcting single-error misspellings. Moreover, most Arabic OCR systems segment characters in order to recognise the characters, while a few number of them recognise words without segmenting characters (Ezzat et al., 2013). Another system developed by BBN (Ezzat et al., 2013) avoids character segmentation by dividing lines into slender vertical frames and then divides frames into cells and uses an HMM recogniser to recognise character sequences. There are many spell checking systems proposed, one of these spell checking systems is built based on approximate string matching. This system uses statistical model that incorporates confusion matrix technique. This system corrects errors which could be resulted from commercially available OCR applications. Also, it handles the error patterns like substitution, transposition, insertion and deletion (Mohapatra et al., 2013). In Ezzat et al. (2013, 2014), a new model for enhancing the Arabic OCR degraded text retrieval effectiveness has been proposed. The proposed model was based on simulating the Arabic OCR recognition mistakes on a word-based approach. Then the model expands the user search query using the expected OCR errors. The resulting expanded search query gives higher precision and recall in searching Arabic OCR-degraded text rather than the original query. The proposed model shows a significant increase in the degraded text retrieval effectiveness over the previous models. The retrieval effectiveness of the new model is 97%, while the best effectiveness published for word-based approach was 84% and the best effectiveness for character-based approach was 56% (Ezzat et al., 2013, 2014). Another approach has been used for OCR errors detection and correction was based on Google’s online spelling suggestion which harnesses an internal database containing a huge collection of terms and word sequences gathered from all over the web. Experiments revealed a
significant improvement in OCR error correction rate (Bassil and Alwani, 2012b). In addition, the researchers in Whitelaw et al. (2009) have designed, implemented and evaluated an end-to-end system for spell checking and auto-correction that does not require any manually annotated training data. The World Wide Web is used as a large noisy corpus from which we infer knowledge about misspellings and word usage. This is used to build an error model and an n-gram language model. The proposed system achieves 3.8% total error rate in English and shows similar preliminary results for Russian and Arabic (Whitelaw et al., 2009). Another post-processing system, for English and French languages, based on Google was proposed in Bassil and Alwani (2012a). The proposed post-processing OCR used context-sensitive error correction method for detecting and correcting non-word and real-word OCR errors. The system has used Google Web 1T 5-gram dataset as a dictionary of words to spell-check OCR text. Experiments carried out on the proposed algorithm showed a drastic improvement in the detection and correction of OCR errors. Apparently, the proposed solution was able to detect and correct about five times (504%) more English errors and around four times (405%) more French errors than the OmniPage tool. Yielding to an error rate close to 4.2% for the English text and 3.5% for the French text (Bassil and Alwani, 2012a). However, in our proposed post-processing system, we will use Google’s online spelling suggestion system, in addition to Microsoft Office Word and Ayaspell spell checkers to improve Arabic OCR post-processing.
3
The methodology
In the first section, we described the steps of the approach we have applied and the stages of the research methodology. The second section talks about Ayaspell and Microsoft Office Word spell checkers that we used in our proposed system. Moreover, in the last section, we discussed the evaluation metrics that we will use in order to evaluate the returned results of our proposed system.
3.1 Description of the research methodology In this paper, we have used C#.NET to develop our proposed system. The approach, which we have used to evaluate the effectiveness of the proposed system is divided into two main phases. At the first phase, the image document is assumed to be converted into text using OCR engine. Then, we split this text into tokens to be processed. Then, we filter the text from errors using the Arabic spell checker (Ayaspell or Microsoft Spell checker). These steps are shown in Figure 10. The second phase is checking the new resulted text using Google’s online suggestion system. In this phase, as shown in Figure 11, the proposed system breaks down the output text into a collection of blocks, each containing five
Improving post-processing OCR documents with Arabic language using spelling error detection and correction words. Five words were chosen according to Bassil and Alwani (2012b), which illustrated that the five words were chosen to provide Google with enough insights about the context of every block. These blocks are given one by one to Google’s search engine as search parameters. If Google returns a successful response without the ‘did you mean’ expression, then it is evident that the query contains no misspelled words and thus no correction is needed for this particular block of words. Contrariwise, if Google responds with a ‘did you mean: spelling-suggestion’ expression, then the query contains some misspelled words and thus a correction is required for this particular block of words. The actual correction consists of replacing the original block in the OCR output text by the Google’s alternative suggested correction. Figure 9
(Phase 1) spell checking and correction (see online version for colours)
Figure 10
(Phase 2) using Google’s online suggestion system
Source: Bassil and Alwani (2012b)
97
3.2 Spell checkers 3.2.1 Microsoft spell checker Microsoft spell checker is one of the most famous and widely used Arabic spell checkers that are available for word processing applications. According to Hirst (2008), Office Word 2007 includes a ‘contextual spelling checker’ that is intended to find misspellings that nonetheless form correctly spelled words. As described in Bernstein et al. (2010), human checkers are currently more reliable, and can offer suggestions on how to fix the errors they find, which is not always possible for Word. For example, the common Microsoft Word feedback ‘Fragment consider revising’ is not useful since it does not give any suggestions (Bernstein et al., 2010). Microsoft Arabic spell checker does not give reasonable and natural suggestions for many real-word errors. Also, the correction candidates in many cases give the impression that they are unfounded (Haddad and Yaseen, 2007; Bernstein et al., 2010). Due to these issues, we try to use another spell checkers and approaches with Microsoft spell checker in order to improve the whole process.
98
I. Abu Doush and A.M. Al-Trad
Figure 11
Our OCR post-processing snapshot (see online version for colours)
3.2.2 Ayaspell spell checker Ayaspell (Bernstein et al., 2010; Hajeer, 2008; Ayaspell-dic Project, http://ayaspell.blogspot.com/) is a project that aims to create dictionaries in the Arabic language for office applications such as Open-Office (OpenOffice.org), Firefox browser, Thunderbird, Abiword, and Gedit. Ayaspell is one of Hunspell spell checker libraries specialised in Arabic languages. Hunspell is a spell checker based on MySpell and can use the MySpell dictionaries for languages with complex word compounding and rich morphology. It was written by László Németh for spell checking of the Hungarian language and can be used with utf8 encoded unicode directories. However, Ayaspell word list is the official resource used in OpenOffice applications. Developers of this word list created their own morphological generator, and their word list contains about 300,000 inflected words (Ayaspell Project, http://ayaspell.sourceforge.net/; Shaalan et al., 2012; Hunspell, 2014).
3.3 Evaluation metrics In order to evaluate the results and performance of the proposed techniques we use recall and precision (Mahdi, 2012; Liang, 2008). The recall rate is the fraction of the errors in the text that are correctly detected. Namely, the total number of the actual misspelled words that are detected divided by the total number of misspelled words. Recall rate represents the actual detected misspelled words compared with all misspelled. Equation (1) shows the error detection recall rate.
Total number of the actual misspelled words corrected Recall = Total number of all misspelled words in the document
(1)
The precision rate is the percentage of the total number of the actual misspelled words that are detected by the system to the total number of detected words. Equation (2) shows the precision rate. Total number of the actual misspelled words corrected Precision = Total number of all changed words
(2)
In addition, we will use the statistical measures of the performance sensitivity, specificity and accuracy. Sensitivity measures the proportion of actual positives which are correctly identified. While specificity measures the proportion of negatives which are correctly identified. Moreover, accuracy measures the proportion of true results (both true positives and true negatives) in the population (Baldi et al., 2000). The three equations (3), (4) and (5) show these measures: Sensitivity =
Total number of true positives Total number of all positives
(3)
Specificity =
Total number of true negatives Total number of all negatives
(4)
Improving post-processing OCR documents with Arabic language using spelling error detection and correction Total number of true negatives and true positives Accuracy = Total number of all negatives
while the second has two-character errors per word. Moreover, each class is divided into two categories CG-5 and CG-10 that contains different error rates.
(5)
and positives
4
99
Figure 12
Evaluation accuracy (see online version for colours)
Experiments and evaluation
We show the experiments results and the evaluation factors (precision and recall) values that we have obtained when evaluating our proposed OCR post-processing techniques for Arabic language. Then, we discuss and analyse these results and make a comparison between different spell checking techniques that we have used.
4.1 Preparing the data for evaluation In this section, we describe the datasets that we will use to test and evaluate our proposed OCR post-processing system for Arabic language. We have prepared three forms of data: 1
Books pages recognised text samples (books): this data was prepared by the authors using ABBY Arabic OCR to convert some books pages into text. We have prepared five different documents each one contains one book page from five different books.
2
Arabic recognised text (ART) samples collected from (Adnan Mahdi, KFUPM) (Mahdi, 2012): this data was generated from an OCR system developed at KFUPM correction prototype. We have taken a sample of one document from KFUPM dataset contains around 100 words.
3
4.2 Evaluation results In this section, we present the results yielded from our experiment sample with precision and recall values for each one of the datasets categories that we have describe above. The results are divided into three categories: first, the results using Microsoft Office Word with Google online suggestion system. Second, the results using Ayaspell with Google online suggestion system. Finally, the results using Google online suggestion system alone.
4.2.1 Recall and precision Tables 1, 2 and 3 show the recall and precision for OCR post processing using the three techniques. As shown on the tables, the highest recall and precision happened in the case of using Google alone. This is because it has the highest number of corrected words from the total number of misspelled words. Also, it is the technique with the highest number of correctly detected misspelled words out of the total number of detected misspelled words.
Computer generated data with errors: this data was generated by KFUPM by taking a normal correct text and randomly introducing three types of errors (insert, delete and replace). The data were divided in two classes CG1 and CG2 each one consists of 88 words. The first one has single-character errors per word,
Table 1
Recall and precision using Ayaspell with Google (see online version for colours)
Category
Doc
Before
Changed
Corrected
Recall
Precision
ART
ART
31
13
3
0.09677
0.23077
Books
B1
107
98
13
0.12150
0.13265
B2
158
128
3
0.01899
0.02344
CG
Total average
B3
8
64
0
0.00000
0.00000
B4
38
104
4
0.10526
0.03846
B5
22
12
4
0.18182
0.33333
CG1-5
14
10
3
0.21429
0.30000
CG1-10
6
6
1
0.16667
0.16667
CG2-5
6
9
2
0.33333
0.22222
CG2-10
13
9
3
0.23077
0.33333
0.14694
0.17809
100
I. Abu Doush and A.M. Al-Trad
Table 2
Recall and precision using Google alone (see online version for colours)
Category
Doc
Before
Changed
Corrected
Recall
Precision
ART
ART
31
19
6
0.19355
0.31579
Books
B1
55
28
16
0.29091
0.57143
B2
150
165
3
0.02000
0.01818
B3
1
64
0
0.00000
0.00000
B4
35
104
4
0.11429
0.03846
B5
25
15
8
0.32000
0.53333
CG
CG1-5
15
17
9
0.60000
0.52941
CG1-10
8
9
3
0.37500
0.33333
CG2-5
5
10
2
0.40000
0.20000
CG2-10
13
15
6
0.46154
0.40000
0.27753
0.29399
Total average Table 3
Recall and precision using Microsoft Word with Google (see online version for colours)
Category
Doc
Before
Changed
Corrected
Recall
Precision
ART
ART
29
7
0
0.00000
0.00000
Books
B1
50
25
10
0.20000
0.40000
B2
151
165
2
0.01325
0.01212
B3
4
64
0
0.00000
0.00000
B4
37
19
12
0.32432
0.63158
B5
16
13
4
0.25000
0.30769
CG
CG1-5
11
11
3
0.27273
0.27273
CG1-10
6
6
1
0.16667
0.16667
CG2-5
6
9
1
0.16667
0.11111
CG2-10
13
14
7
0.53846
0.50000
0.19321
0.24019
Total average
However, on the case of using Microsoft Word spell checker along with Google the results become worse than using Google alone. This is because the spell checkers might change a correct word into a wrong one. This problem becomes larger in the case of using Ayaspell spell checker as number of wrongly changed words becomes higher. It is obvious from the tables that the computer-generated documents give the best results. This is because the errors generated from OCR real output can have different types other than only spelling errors. For example, deleted words or a word replaced with one character. Such errors cannot be all detected by the proposed techniques. Context-based techniques need to be used based which can solve such problems.
4.2.2 Sensitivity, specificity, and accuracy Tables 4, 5 and 6 show the sensitivity, specificity, and accuracy for OCR post processing using the three techniques. The results show that the highest true positive (i.e., detecting correctly error word) happened in the case of Using Google Alone. On the other hand, the highest true
negative (i.e., the number of correctly words that are not misspelled) happened in the case of using Ayaspell with Google. Table 4
Sensitivity, specificity, and accuracy using Ayaspell with Google (see online version for colours)
Doc
TP
TN
Sensitivity
Specificity
Accuracy
ART
3
10
0.08333
0.02398
0.02870
B1
13
85
0.36111
0.20384
0.21634
B2
3
125
0.08333
0.29976
0.28256
B3
0
64
0.00000
0.15348
0.14128
B4
4
100
0.11111
0.23981
0.22958
B5
4
8
0.11111
0.01918
0.02649
CG1-5
3
7
0.08333
0.01679
0.02208
CG1-10
1
5
0.02778
0.01199
0.01325
CG2-5
2
7
0.05556
0.01679
0.01987
CG2-10
3
6
0.08333
0.01439
0.01987
Average
36
417
0.099999
0.100001
0.100002
The sensitivity rate is lowest in the case of using Ayaspell with Google and highest in the case of using Google alone. This is because Google alone where the highest in detecting
Improving post-processing OCR documents with Arabic language using spelling error detection and correction 101 correctly error words. The specificity where the highest in the case of Ayaspell with Google as it has the largest number of correctly detecting words that are not misspelled. The lowest specificity rates were in the case of using Microsoft Word with Google. Table 5 Doc
Sensitivity, specificity, and accuracy using Google alone (see online version for colours) TP
TN
Sensitivity
Specificity
Accuracy
ART
6
13
0.105263
0.033419
0.042601
B1
16
12
0.280702
0.030848
0.062780
B2
3
162
0.052632
0.416452
0.369955
B3
0
64
0.000000
0.164524
0.143498
B4
4
100
0.070175
0.257069
0.233184
B5
8
7
0.140351
0.017995
0.033632
CG1-5
9
8
0.157895
0.020566
0.038117
CG1-10
3
6
0.052632
0.015424
0.020179
CG2-5
2
8
0.035088
0.020566
0.022422
CG2-10
6
9
0.105263
0.023136
0.033632
Average
57
389
0.100000
0.100000
0.100000
Table 6 Doc
Sensitivity, specificity, and accuracy using Microsoft Word with Google (see online version for colours) TP
TN
Sensitivity
Specificity
Accuracy
ART
0
7
0.00000
0.02389
0.02102
B1
10
15
0.25000
0.05119
0.07508
B2
2
163
0.05000
0.55631
0.49550
B3
0
64
0.00000
0.21843
0.19219
B4
12
7
0.30000
0.02389
0.05706
B5
4
9
0.10000
0.03072
0.03904
CG1-5
3
8
0.07500
0.02730
0.03303
CG1-10
1
5
0.02500
0.01706
0.01802
CG2-5
1
8
0.02500
0.02730
0.02703
CG2-10
7
7
0.17500
0.02389
0.04204
Average
40
293
0.100000
0.099998
0.100001
However, as shown in Figure 12, the results show that using Microsoft Office Word with Google gives best results with accuracy of (0.49), while using Ayaspell with Google gives accuracy of (0.28). Finally, using Google online suggestion system alone gives accuracy of (0.37). As we see, the best results are achieved using Microsoft Office Word with Google approach.
5
Conclusions and future work
Arabic OCR systems usually contain many errors influences the quality of their results. It is necessary to apply some error detection and correction techniques on them in order to minimise their error rate and maximise quality of results. In this paper, first we discuss some issues related to OCR Arabic recognition system errors detection and correction
and summarise the previous work done in this topic in general. Then we propose an OCR post-processing system to improve OCR Arabic results by using some Microsoft Office Word and Ayaspell with Google online suggestion system. Finally, we show the results of our proposed system using some samples as an evaluation of our proposed OCR post-processing system. We have used three different types of datasets, namely ART generated by KFUPM University, a few different pages of different books recognised text samples (books) and some computer generated documents with random errors. The results show that using Microsoft Office Word with Google gives accuracy of (0.49), while using Ayaspell with Google gives accuracy of (0.28). Finally, using Google online suggestion system alone gives accuracy of (0.37). As we see, the best results have been gotten by using Microsoft Office Word with Google approach. This is because Word spell checker correct a word separately and Google correct the word according to its context. This helps this technique to outperform other techniques. In future work, we will refine our proposed OCR post-processing system by adding more Arabic spell checkers and try to combine them and build a one hybrid spell checker. Moreover, we will try to find additional factors for evaluating the results rather than using only recall and precision. In addition, we will use larger and various datasets in addition to those that we have used. Furthermore, we will try to use other suggestion systems from search engines like Bing and Yahoo and compare them with Google suggestion system that we have used in this paper.
References Ayaspell Project [online] http://ayaspell.sourceforge.net/ (accessed April 2014). Ayaspell-dic Project [online] http://ayaspell.blogspot.com/ (accessed April 2014). Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F. and Nielsen, H. (2000) ‘Assessing the accuracy of prediction algorithms for classification: an overview’, Bioinformatics, Vol. 16, No. 5, pp.412–424, DOI: 10.1093/bioinformatics/16.5.412. Bassil, Y. and Alwani, M. (2012a) ‘Context-sensitive spelling correction using google web 1T 5-gram information’, Computer and Information Science, Vol. 5, No. 3, pp.37–48. Bassil, Y. and Alwani, M. (2012b) ‘OCR post-processing error correction algorithm using Google online spelling suggestion’, Journal of Emerging Trends in Computing and Information Sciences, January, Vol. 3, No. 1, pp.1–9, ISSN: 2079-8407. Beitzel, S.M., Jensen, E.C. and Grossman, D.A. (2003) ‘A survey of retrieval strategies for OCR text collections’, 2003 Symposium on Document Image Understanding Technologies, Greenbelt, Maryland, April.
102
I. Abu Doush and A.M. Al-Trad
Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B., Ackerman, M.S., Karger, D.R., Crowell, D. and Panovich, K. (2010) ‘Soylent: a word processor with a crowd inside’, in Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Technology (UIST ‘10), ACM, New York, NY, USA, pp.313–322. Borovikov, E. and Zavorin, I. (2012) ‘A multi-stage approach to Arabic document analysis’, A Book Chapter, ‘Guide to OCR for Arabic Scripts’, pp.55–78, Print ISBN: 978-1-4471-4071-9, Online ISBN: 978-1-4471-4072. Ezzat, M., El Ghazaly, T. and Gheith, M. (2014) ‘A word & character N-gram based Arabic OCR error simulation model’, International Journal of Computers & Technology, Vol. 12, No. 8, pp.3758–3767, ISSN: 2277-3061. Ezzat, M., ElGhazaly, T. and Gheith, M. (2013) ‘An enhanced Arabic OCR degraded text retrieval model’, International Journal of Computational Linguistics and Natural Language Processing, August, Vol. 2, No. 8, pp.380–393, ISSN: 2279–0756. Haddad, B. and Yaseen, M. (2007) ‘Detection and correction of non-words in Arabic: a hybrid approach’, International Journal of Computer Processing of Oriental Languages, IJCPOL, December, Vol. 20, No. 4, pp.237–257, World Scientific Publishing and Imperial College Press, New Jersey, London, Singapore, Beijing, Shanghai, Hong Kong, Taipei, Chennai. Hajeer, I. (2008) For an Arabic Open Source Spell Checker, MSc thesis, Center of Scientific and technical researches to upgrade Arabic Language, Algeria. Hirst, G. (2008) An Evaluation of the Contextual Spelling Checker of Microsoft Office Word 2007, The Natural Sciences and Engineering Research Council of Canada, University of Toronto, Toronto, Canada M5S 1A4, January. Hunspell, N. (2014) Free Spell-Checker, Hyphenation and Thesaurus for .NET (C#,Visual Basic,...) [online] http://nhunspell.sourceforge.net/ (accessed April 2014). Kanungo, T., Marton, G.A. and Bulbul, O. (1999) ‘OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products’, Proceedings of SPIE Conference on Document Recognition and Retrieval (VI), (Also appears as LAMP Technical Report ID LAMP-TR-030, December 1998), San Jose, CA, 27–28 January, Vol. 3651. Lehal, G.S. (2007) ‘Design and implementation of Punjabi spell checker’, International Journal of Systemics, Cybernetics and Informatics, Vol. 3, No. 8, pp.70–75. Liang, H. (2008) Spell Checkers and Correctors: A Unified Treatment, Master’s thesis, University of Pretoria, South Africa. Magdy, W. and Darwish, K. (2006) ‘Arabic OCR error correction using character segment correction, language modeling, and shallow morphology’, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp.408–414. Magdy, W. and Darwish, K. (2008) ‘Effect of OCR error correction on Arabic retrieval’, Information Retrieval, Vol. 11, No. 5, pp.405–425, Springer Science+Business Media, LLC. Mahdi, A. (2012) Arabic Spell Checking and Correction for Arabic Text Recognition, MSc thesis, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia. Märgner, V. and El Abed, H. (2011) ‘Arabic word and text recognition – current developments’, International Conference on Document Analysis and Recognition (ICDAR).
Mohapatra, Y., Mishra, A.K. and Mishra, A.K. (2013) ‘Spell checker for OCR’, International Journal of Computer Science and Information Technologies (IJCSIT), Vol. 4, No. 1, pp.91–97. Naseem, T. (2004) A Hybrid Approach for Urdu Spell Checking, MSc thesis, National University of Computer & Emerging Sciences, Pakistan. Naseem, T. and Hussain, S. (2007) Spelling Error Trends in Urdu, Center for Research in Urdu Language Processing, FAST-NU, Lahore. Rashwan, M.A., Fakhr, M.W., Attia, M. and El-Mahallawy, M.S. (2007) ‘Arabic OCR system analogous to HMM-based ASR systems; implementation and evaluation’, Journal of Engineering and Applied Science (JEAS)-Cairo, Vol. 54, No. 6, p.653, Faculty of Engineering, Cairo University. Shaalan, K., Samih, Y., Attia, M., Pecina, P. and van Genabith, J. (2012) ‘Arabic word generation and modelling for spell checking’, The Eighth International Conference on Language Resources and Evaluation (LREC’12). Singh, S. (2013) ‘Optical character recognition techniques: a survey’, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), June, Vol. 2, No. 6, pp.545–550, ISSN: 2278-1323. Slimane, F., Kanoun, S., El Abed, H., Alimi, A.M., Ingold, R. and Hennebert, J. (2011) ‘Arabic recognition competition: multi-font multi-size digitally represented text’, International Conference on Document Analysis and Recognition ICDAR, pp.1449–1453. Verberne, S. (2002) Context-Sensitive Spell Checking Based on Word Trigram Probabilities, MSc thesis, University of Nijmegen. Whitelaw, C., Hutchinson, B., Chung, G.Y. and Ellis, G. (2009) ‘Using the web for language independent spellchecking and autocorrection’, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August, pp.890–899.
Appendix The following show an example of the steps when correcting misspelled words in Arabic text. For example, if we have an OCR document contains the following text:
And the original text (correct) is:
Now in order to compute the number of errors that the OCR result have we perform a matching between both text and calculate the number of words differences between them, the result will be look like:
Improving post-processing OCR documents with Arabic language using spelling error detection and correction 103
As shown above, the number of errors were eight errors (words in red colour). Now apply a spell checking using Microsoft Office Word, and the output is:
As shown the Word spell checker has changed four words, again using matching between Microsoft Word output text and the original one, we can find that the number of errors after performing the spell checking process becomes seven errors.