Here, we leverage our prior work in ensemble-based combination of outputs from multiple line finding algorithms [1]. We used the BBN Byblos OCR system [2] to.
21st International Conference on Pattern Recognition (ICPR 2012) November 11-15, 2012. Tsukuba, Japan
Document Recognition and Translation System for Unconstrained Arabic Documents Huaigu Cao, Jinying Chen, Jacob Devlin, Rohit Prasad, Prem Natarajan Raytheon BBN Technologies, Cambridge, MA 02138, USA {hcao, jchen, jdevlin, rprasad, pnataraj}@bbn.com that cannot be accurately translated by COTS machine translation (MT) systems. In this paper, we describe an end-to-end system for translating real-world documents image archives that contain both machine-printed and handwritten content in Arabic. Our system has three main processing stages: (1) document analysis, (2) text recognition, and (3) statistical machine translation. Document image analysis includes two steps – image cleaning and text line finding. Image cleaning removes artifacts such as large clutter noise, small random noise and pre-printed rule-lines. Line finding segments the image into text line images for subsequent recognition and translation. Here, we leverage our prior work in ensemble-based combination of outputs from multiple line finding algorithms [1]. We used the BBN Byblos OCR system [2] to recognize Arabic text in the lines extracted from the document analysis stage. Byblos is a segmentationfree, trainable OCR engine that uses contextdependent hidden Markov models (HMMs) to model character glyphs. For this work, we trained glyph HMMs on the in-domain data available to us. For translating the Arabic OCR output into English, we used BBN’s state-of-the-art machine translation (MT) system [10]. While we had a large amount of data from news and web blogs (50M words of parallel text for translation model training and 9 billion English words for the target language model), the indomain data was limited to 400,000 words of parallel text. This amount of in-domain data is inadequate for training our statistical MT engine for the domain. Instead, we used the limited in-domain training data to effectively bias the out-of-domain translation model and the language model to the domain. The rest of the paper is organized as follows. First, we describe the corpus used for our experimentation. Next, we describe each of the system components and salient experimental results, followed by conclusions and future work.
Abstract We describe an end-to-end system for translating real-world Arabic field documents that contain a mix of handwritten and printed content into English. These documents are extremely challenging to recognize due to presence of noise, poor image capture quality, and variations in writing style, writing device, font, layout, genre, etc. Furthermore, no off-the-shelf machine translation (MT) engine is available to translate these documents into English. We present key innovations for dealing with these challenges for document preprocessing, text line segmentation, and text recognition. In addition, we describe our approach for adapting MT using a limited amount of in-domain training data that results in significant improvements in translating accuracy. 12
1. Introduction Document image translation is an extremely challenging research area with several potential applications. In spite of significant research advances in optical character recognition (OCR) for printed content and handwritten content, their application has been limited to either office quality printed documents or constrained tasks such as check processing [1] or mail sorting [2]. In contrast, most real-world document processing applications require processing unconstrained documents that exhibit significant variation in content type (printed vs. handwritten), layout, font, writing style, image quality, etc. These variations make it impossible for commercial off-theshelf (COTS) OCR engines to accurately recognize the content. From a machine translation perspective, such real-world document corpus, contain a mix of genres 1
The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. 2 Distribution Statement 'A' (Approved for Public Release, Distribution Unlimited)
978-4-9906441-1-6 ©2012 IAPR
318
units (SU) according to the space between text lines and semantic meanings. This was done to simplify the evaluation of OCR and MT because the reading order of the entire document is often ambiguous and hard to evaluate. Under this circumstance, a single line finding algorithm is able to gives very good results. In this paper, only the line finding algorithm [9] is used. We also extract manually labeled line images as a contrastive condition to test the OCR and MT performance. In this condition, line images are further labeled so that each of them only contains handwritten or type-written text, but not both. We do not use this information from automatic line finding. Extracted line images are up-sampled to 600 DPI using bi-cubic interpolation and further de-skewed by looking within for the angle that gives the narrowest horizontal projection profile.
2. Corpus for Experimentation In this paper, we use a corpus of real-world Arabic documents. This corpus includes several thousands of pages with both unconstrained handwriting and typewritten text. A small fraction of these documents were transcribed and annotated by a team led by Linguistic Data Consortium (LDC). As shown in Table 1, the transcribed data set is categorized into the three different types: 1) Handwritten (HW) if >90% of the words in the document are handwritten, 2) Typewritten (TW) if >90% of the words in the document are typewritten, and 3) Mixed (MX) otherwise. Two sample pages from the in Figure 1.
Field Document Images
Noise & Artifacts Filtering
Line Extraction
Up-sampling & De-skewing
Cleaned Document Images
Line Images
Enhanced Line Images
Figure 2 Flowchart for document analysis
Figure 1 Sample pages from the Arabic field document data set
4. Text Recognition 4.1. OCR training
Table 1 Arabic field data set statistics Type #transcribed pages #translated pages Handwritten 17330 1929 Typewritten 2607 68 Mixed 5799 590 All 25736 2587
In our OCR system, each character is modeled as a 14state HMM. Features including image intensity percentile, local angle and correlation representing the orientation of the stroke, frame energy, gradient, concavity, and Gabor filter responses are extracted using sliding window analysis. This high-dimensional feature vector is projected to a 17-dimensional feature space using Linear Discriminant Analysis (LDA). Context-dependent, state-tied mixture (STM) models are trained for each character glyph. A tri-gram word LM is also estimated from the training data.
3. Document Analysis Figure 2 shows a flowchart for the document analysis processing stage. First, a few pre-processing steps are applied to removing the noise and artifacts in field documents. These steps include morphological operations to remove large clutters around the edge of the documents, shape-based rule-line removal [6], and salt and pepper noise removal. We run four different line finding algorithms, and then their outputs are combined using a novel graph clustering-based ensemble combination algorithm [1]. The F-scores for individual line finding systems range from 0.56~0.6 (harmonic averages of line finding precisions and recalls), whereas the ensemble combination results in an F-score of 0.71. This proves ensemble of multiple line finding algorithms improves the line finding performance on complex layout. Note that in OCR and MT experiments, the line finding is performed on sentence units (SU). Each document is divided manually into several sentence
4.2. OCR decoding Recognition is performed in two stages. 1) Un-adapted Decoding (UDEC): We use a two-pass search strategy to generate an n-best list (n=300) for each line image [7]. We also re-rank the n-best using a weighted combination of HMM score, trigram language model score, etc. using weights estimated from a held-out development/tuning set. 2) Adapted Decoding (ADEC): For each page, the reranked best OCR hypotheses are taken as the reference to adapt the HMM using the Maximum Likelihood Linear Regression (MLLR) algorithm [8]. The page is decoded to produce an N-best list which is re-ranked using the same rescoring procedure as in UDEC.
319
5. Statistical Machine Translation (SMT)
4.3. OCR results The numbers of pages in our OCR training, tuning and test sets are shown in Table 2. Three groups of HMMs (HMMHW, HMMTW, and HMMALL) are trained using manually-labeled line images from each type of data. We used the transcription of all training data and a dictionary of over 600K words to train the language model. For each genre, line images are extracted in two contrastive conditions: line (manually labeled line images) and SU (line finding run on sentence units). The HMM used to decode documents of each genre and line finding condition is shown in Table 3.
The final component of our pipeline is the Arabicto-English SMT system that takes the OCR output as an input (Figure 3). Here we face two key challenges: noisy input due to OCR errors and lack of in-domain data for training. Previous experiments show that using noisy input and their corresponding translation as the training data degrades MT system performance. Thus, we only use the error-free transcribed text to develop our MT system. To adapt our MT system to Arabic field documents, we use a novel variant of the framework described in [5].
Table 2 Numbers of pages in training, tuning and test set Type Training Tuning Test Handwritten 16330 500 500 Typewritten 2007 300 300 Mixed 5099 800 500
Extraction of English side 9 billion out-of-domain English text Language Modeling
Table 3 Selection of HMM for all genres and line finding conditions Condition Genre Handwritten Typewritten Mixed
Line
SU
HMMHW HMMTW HMMHW for handwritten HMMTW for typewritten
HMMHW HMMTW HMMALL
English Language Model OCR output
The OCR Word Error Rates (WERs) are shown in Table 4. As expected, the line condition performs better than the SU condition on handwritten and typewritten documents. However, for mixed type documents the SU condition performs better. We believe this is because the line segmentation condition artificially fragments the context for computing language model scores. Also, we did not find any benefit in using separate glyph HMMs for each type for decoding mixed type documents that contain SUs with both typewritten and handwritten content, i.e. the HMMALL worked the best. However, when we decoded the predominantly handwritten and typewritten documents using HMMALL model, we got worse results. Table 4 Performance of OCR system output Sys WER %
HW Line SU
Line
SU
Line
SU
26.4
9.4
11.2
20.2
19.6
29.7
TW
0.4M in-domain parallel text 50M out-of-domain parallel text MT Training
Tune set Optimization
Translation Model
MT Tune Set
MT Decoding
translation
Figure 3 Flowchart of MT decoding
training and
5.1. Domain adaptation of SMT Our baseline MT system is a state-of-the-art hierarchical engine trained and tuned with out-ofdomain data [10]. The features used by the system include linguistic and contextual features like word translation probabilities, rule translation probabilities, language model scores, and target side dependency scores, and discriminatively tuned features similar to [11]. Feature scores are combined with a log-linear model with weights set by optimizing the BLEU score on the tune set. To adapt our baseline MT system to field documents, we train a separate language model using the target side of our in-domain parallel training data, and discriminatively estimate the interpolation weights with the standard (out-of-domain) language model. To adapt the translation model, we discriminatively estimate separate feature weights and penalties for rules extracted from the in-domain and out-of-domain parallel training. We tuned the system by using an indomain tune set composed of 615 documents from HW and MX genres (Tune-A).
MX
We also contrasted the results obtained above with an OCR system trained on a corpus of 37K Modern Standard Arabic (MSA) documents written by 259 scribes [5]. These documents were created by scribes manually copying text from news and web text. On the handwritten portion of the field-data test set, our system trained on this out-of-domain yielded 80.4% WER, which is much worse than in-domain training.
320
Through domain adaptation, the MT system performance improved on the HW/TW/MX transcribed test sets by 8~17% of BLEU and TER, as shown in Table 5.
References [1] G. Kim and V. Govindaraju, A Lexicon Driven Approach to Handwritten Word Recognition for RealTime Applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):366–379, Apr. 1997. [2] Wu Ding, C. Y. Suen, A. Krzyzak, A new courtesy amount recognition module of a Check Reading System, in proc. of the 19th International Conference on Pattern Recognition, (ICPR), 2008. [3] V. Manohar, S. Vitaladevuni, H. Cao, R. Prasad, P. Natarajan, Graph Clustering-Based Ensemble Method for Handwritten Text Line Segmentation, Proc. International Conference on Document Analysis and Recognition (ICDAR) , pp. 574-578, 2011. [4] S. Saleem, H. Cao, K. Subramanian, M. Kamali, R. Prasad, P. Natarajan, Improvements in BBN's HMMbased offline Arabic handwriting recognition system, In proc. International Conference on Document Analysis and Recognition (ICDAR) , pp. 773-777, 2009. [5] P. Koehn and J. Schroeder, Experiments in domain adaptation for statistical machine translation. Proc. the Second Workshop on Statistical Machine Translation, pages 224–227, 2007. [6] H. Cao, R. Prasad, P. Natarajan, A Stroke Regeneration Method for Cleaning Rule-lines in Handwritten Document Images, Proc. International Workshop on Multilingual OCR, 2009. [7] Y. Chow and R. Schwartz, The N-Best Algorithm: An Efficient Procedure for Finding the Top N Sentence Hypotheses, Proc. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 81-84, 1990. [8] C. Leggetter and P. Woodland, Maximum likelihood linear regression for speaker adaptation of HMMs, Comp. Speech Lang., vol. 9, pp. 171–186, 1995. [9] Z. Shi, S. Setlur, and V. Govindaraju, A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines, Proc. International Conference on Document Analysis and Recognition, pp. 176–180, 2009. [10] L. Shen, J. Xu, and R. Weischedel, A new string-todependency machine translation algorithm with a target dependency language model. Proc. the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 577– 585, 2008. [11] D. Chiang, K. Knight, and W. Wang. 2009. 11,001 new features for statistical machine translation. Proc. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 218–226. [12] J. Chen, J. Devlin, H. Cao, R. Prasad, P. Natarajan, Automatic Tune Set Generation for Machine Translation with Limited In-domain Data, Proc. the 16th Annual Conference of the European Association for Machine Translation, 2012.
5.2. MT optimization for typewritten portion The TW genre has only 68 translated documents (1093 segments in total). To ensure the reliability of testing results, we reserved all of them as the test set. Therefore, we need another set to optimize the MT system for the TW genre. We developed a novel method that automatically generates the matching tune set for the TW genre [12]. The method uses a novel ngram based similarity metric to extract N nearest neighbors (N=2 for our case) of each TW test segment from mixed genres (Tune-A). This method is scalable to new test sets since it only uses the source language to find neighbors. Compared with the MT system tuned on Tune-A set, the system using the optimized tune set improves the BLEU and TER scores by 1.2% and 0.9% on the transcribed TW test set. The improvement is consistent on the OCR output with recognition errors (Table 6). The final MT system performance on the OCR output is shown in Table 7. Table 5 MT Domain Adaptation Sys Base-line Adapted
BLEU 19.8 28.1
HW TER 62.3 52.9
BLEU 19.9 28.4
TW TER 63.2 52.8
MX BLEU 20.1 36.5
TER 61.4 43.7
Table 6 Tune set optimization for TW data Sys Tune-A Optimized
Transcribed BLEU TER 28.4 52.8 29.7 51.8
OCR (Line) BLEU TER 26.2 56.1 27.1 55.2
OCR (SU) BLEU TER 26.0 56.7 26.4 56.0
Table 7 MT performance on OCR output genre HW TW MX
Transcribed BLEU TER 28.1 52.9 28.4 52.8 36.5 43.7
OCR (Line) BLEU TER 20.5 63.4 27.1 55.2 31.6 52.4
OCR (SU) BLEU TER 20.2 64.5 26.4 56.0 31.2 52.4
6. Conclusion We have presented a system for recognizing and translating Arabic field documents. This system has been applied to a large corpus of real-world Arabic handwritten and typewritten document images and has achieved impressive performance. Our future work will focus on understanding the complex structure of such documents, tighter integration of document layout analysis into OCR and MT, and use of lattices for high-quality information extraction.
7. Acknowledgement This paper is based upon work supported by the DARPA MADCAT Program.
321