Using Web Search Engines to Improve Text Recognition Michael Donoser, Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {donoser,bischof}@icg.tugraz.at
Abstract In this paper we introduce a framework for automated text recognition from images. We first describe a simple but efficient text detection and recognition method based on analysis of Maximally Stable Extremal Regions (MSERs) and simple template matching which allows to provide initial character recognition results. The main emphasis of the paper is on introducing a novel method for exploiting contextual information to improve the obtained recognition results. We propose to analyze the results of web search engine queries on two levels of detail, which both allow to significantly improve the overall text recognition performance. The experimental evaluations on reference data sets prove that even based on a low quality single character recognition method the proposed web search engine extension enables reasonable text recognition results.
1. Introduction Text recognition allows to translate images of text as e. g. scanned documents or images of natural scenes containing signs into actual text characters. Although it has been widely regarded as a solved problem, this is only true for clean, non-distorted images. Recognition performance degrades significantly with noise, variable lighting and perspective distortions. In general, methods for text recognition mainly consist of two subsequent steps: segmentation and classification. For segmentation of the individual characters the text is localized in the image and binary segmentations of every character in the text area are provided. Classification first assigns each character independent of all the others to one of a set of character classes. Finally, post-processing based on exploiting contextual information tries to improve the final text recognition results. In this work we propose a method which allows to recognize text in single images in three subsequent steps. First, the text is automatically localized in images and segmenta-
Silke Wagner Department of Slavic Studies University of Graz, Austria
[email protected]
tions of every character are provided. Second, we generate weak character hypotheses by a simple template matching method. Finally, we show that by analyzing contextual information on word and sentence level the weak hypotheses can be improved to strong classification results. Although this paper also describes an efficient method for text localization, segmentation and character recognition, the main emphasis lies on the third step: improving the final text recognition performance by contextual knowledge. Methods exploiting contextual information for improving text recognition can be mainly classified into statistical [7, 12] or dictionary based [3, 2, 13] methods. The main issue of dictionary based methods is the time it takes to examine candidate words in a large lexicon, but in general they achieve higher word-level recognition performance as statistical approaches. We propose to use a web search engine like Google, Yahoo or Live Search which allows to analyze an endless amount of words and texts in all languages. Using web search engines for computer vision applications was already shown by Fergus et al. [4] but exclusively for learning object categories. The outline of the paper is as follows. Section 2 describes the entire framework focussing on the postprocessing by analyzing results of web search engine queries for improving the final text recognition results. In Section 3 we demonstrate the performance of the system on a text recognition data set of the ICDAR 2005 conference and on high resolution document images acquired with digital cameras. Finally, Section 4 draws some conclusions.
2. Text recognition framework Our method for recognizing text in single images can be roughly divided into three main parts: text localization and segmentation, character recognition and contextual postprocessing. The first step of the method is to localize text in the images and to provide segmentations of the individual characters (Section 2.1). The second step is to calculate class likelihoods for the segmented characters by a simple
template matching method (Section 2.2). Finally, we describe how the results of web search engines allow to improve the text recognition results by exploiting contextual information on two levels of detail (Section 2.3).
2.1. Text localization The first step of the system is to localize the text in the images. We apply a method based on Maximally Stable Extremal Region (MSER) detection [8]. The MSER detector is one of the best interest region detectors in computer vision as evaluations by Mikolajczyk et al. [10] and Fraundorfer and Bischof [5] proved. The set of MSERs is closed under continuous geometric transformations and is invariant to affine intensity changes. Furthermore MSERs are detected at all scales. We use these properties mainly for segmentation purposes, because as it was already shown in [9] MSERs allow to segment characters in images. We apply a method similar to [9], but instead of looking for Category Specific Extremal Regions (CSERs) we simply focus on MSER detection results. After MSER detection linear configurations of MSERs are found by Hough transform constraining on lines covering regions with approximately same mean color and height. In addition to localizing the text in the images by the estimated lines, the corresponding MSERs also represent segmentation results of the individual characters. Figure 1 shows some text localization results for images of the ICDAR 2005 robust reading competition highlighting the MSER segmentations and the detected lines.
(a)
(b)
Figure 1: Examples for text localization results (yellow lines) and corresponding character segmentations (blue boundaries). Images from ICDAR 2005 robust reading competition.
2.2. Character recognition The next step is to classify the segmentations of each character independently into a set of C pre-defined character classes Ck with k = 1 . . . C. In our case we use the
latin alphabet, i. e. C = 23. This is a general classification task where any method based on learning-from-examples strategy as e. g. support vector machines [11], neural networks [1] or boosting [6] can be applied. The main emphasis of this work is on demonstrating that integrating contextual information by analysis of results of web search engine queries allows to improve text recognition performance. Therefore, we use a very simple method for character recognition based on analyzing template matching results to single reference characters. Such an approach only enables a rather low recognition performance. Nevertheless, as it is demonstrated in Section 3 even those low-quality results are sufficient to achieve reasonable text recognition results based on the web search engine extension. We compare the segmentations of each character Sn to each of the 23 reference segmentations by cross correlation. Every comparison provides a class likelihood p(Ck |Sn ). A naive approach for text recognition would be to take the highest probability class assignments for word recognition. But as it is described in the next section we can significantly improve recognition performance by passing these probability values to a verification and correction step based on analysis of the results of web search engine queries.
2.3. Search engine verification In the last step of our text recognition framework we exploit contextual information to improve the individual character recognition results. Most text recognition methods perform simple matches to a standard dictionary in this step. It is a straightforward idea to exploit the internet for this purpose which provides access to billions of varying documents in all languages. Therefore, we propose to use web search engines like Google, Yahoo or Live Search to validate and furthermore correct the recognition results on two different levels of detail: on word level and on sentence level. We first describe the analysis on word level. For that purpose we merge all segmented characters Sn of an image to a set of words Wl by analyzing the distance between neighboring characters. The character recognition method presented in Section 2.2 provides class probabilities p(Ck |Sn ) Wl for each of the N characters S1Wl . . . SN building a word Wl . We now use these character likelihoods as importance density for randomly drawing hypotheses Wlk for the entire word. These hypotheses are all sent as query to a web search engine and the number of search engine entries N r(Wlk ) and the corresponding “Did you mean” entry Dym(Wlk ) is saved as result. The “Did you mean” entry is provided by almost all search engines and since it was designed for correcting misspelled search words, it is a strong tool for correcting wrong word recognition results.
Query Wlk
N r(Wlk )
Dym(Wlk )
p(Wlk )
Query Wlk
N r(Wlk )
Dym(Wlk )
p(Wlk )
suhher sunner suhner sumher summer sumner sunher
101 327 000 358 000 2 660 360 000 000 8 210 000 18 000
– – – summer – – suhner
0% 0.12 % 0.14 % 0% 96.96 % 2.78 % 0%
scnool school sohool sonool
9 250 954 000 000 14 000 177
school – school –
0% 99.99 % 0.01 % 0%
Table 1: Search engine verification of hypotheses on word level. The different queries, search results and the final word probabilities are shown for “Summer”. Please note, that the correct word was only considered because it was returned as alternative from the search engine. For accessing the web search engines we wrote Perlscripts that automatically send the queries to a search engine and then analyze the returned HTML-Code for the desired entries by simple text matching, e. g. for the text behind “Did you mean”. Of course, for a standard user the number of allowed search queries per second is limited but it is possible to buy continuous access to the required APIs of the different search engines, which would enable unlimited access. The process of verification and correction on word level now works as follows and is illustrated by an example. First, the importance density weighted randomly drawn hypotheses Wlk are each sent as query to the search engine by the Perl-script and the results N r(Wlk ) and Dym(Wlk ) are saved. In addition, if for a query the Dym(Wlk ) entry is not empty the corresponding result is added to the list of hypotheses. Please note, that this allows to even analyze word candidates that never would have been selected by the random hypotheses creation. Finally, we calculate a probability value p(Wlk ) for all candidate words Wlk that provide search results by N r Wlk k p(Wl ) = N pc Wlk , (1) P n N r (Wl ) n=1
where pc Wlk is the mean conditional probability of all character likelihoods of the corresponding word. Tables 1 and 2 show two examples of the web search engine verification and correction step for the words “Summer” and “School” of the image shown in Figure 1(a). As can be seen although single character recognition fails to recognize the words the web search engine verification allows to finally provide the correct text recognition result and meaningful probability values. Please note further, that the correct word “Summer” was only considered because it was returned as alternative from the search engine.
Table 2: Search engine verification of hypotheses on word level. The different queries, search results and the final word probabilities are shown for “School”.
Query Ylk
N r Ylk
summer school sumner school sunner school suhner school
4 380 000 32 300 45 0
Dym Ylk
– – summer school –
p(Ylk ) 99.61 % 0.39 % 100 % 100 %
Table 3: Search engine verification of hypotheses on sentence level. The different queries, search results and the final word probabilities are shown for “Summer School”.
As can be seen in Table 1 the word “Summer” is correctly recognized, but also other words like “Sumner” are returned as possible solution. Therefore, to further improve the recognition performance we additionally propose to perform the web search engine verification on a second level of detail: on sentence level. All web search engines provide the capability to search for exact phrases instead of combinations of words. For that purpose we again randomly draw importance density weighted hypotheses Ylk for word combinations, but this time on sentence level where the importance density is defined by the estimated word probabilities p(Wlk ). Table 3 shows the results for the example considering two neighboring words Wlk (“Summer”) and Wlm (“School”) for phrase verification. Again the final combination probabil ity p Ylk is calculated in the manner of Equation 2.3. Of course, also the combination of more than two neighboring words is possible. For example by considering three words and also including the results for “Essex” (see Figure 1(a)) we get 3 340 entries for “Essex Summer School” and 0 for the “Essex Sumner School” and therefore “Essex Summer School” is returned as text recognition result with a probability of 100 %. To sum up, using contextual information by means of analyzing the results of web search engine queries allows to significantly improve the recognition performance. Even simple and low quality character recognition methods can lead with this extension to reasonable text recognition results as it is demonstrated in the next section.
Figure 2: Example of images used for evaluating the influence of the web search engine verification on sentence level.
3. Experiments
References
To demonstrate the improvements that can be achieved by the proposed web search engine verification and correction step we analyzed images of the ICDAR 2005 robust reading competition. In these images 98 words were automatically detected by the described text localization method. Without any contextual analysis the simple character recognition method only achieved a recognition performance of 68.58 %. Including the proposed web search verification step as described in Section 2.3 word level improves the recognition rate to 87.05 %. Since most of the images of the ICDAR data set only contain a single word we additionally tested our method on high resolution images of documents returned by web search engines or acquired by digital cameras. An example image is shown in Figure 2. This evaluation better allows to demonstrate the effect of contextual analysis on the sentence level. We analyzed 257 words localized in these images and achieved on average a character recognition performance of 59.65 % without contextual analysis, 75.82 % including word level analysis and 84.18 % with considering two neighboring words on sentence level. Especially for short words like “in”,“to”, “for”, etc. the verification on word level failed due to similar abbreviations. For these words the analysis on sentence level, e. g. the combination with the following verb, allowed to correct the recognition results of almost all of them.
[1] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [2] X. Chen, J. Yang, J. Zhang, and A. Waibel. Automatic detection and recognition of signs from natural scenes. IEEE Transactions on Image Processing, 13:87–99, 2004. [3] X. Chen and A. L. Yuille. Detecting and reading text in natural scenes. In Proceeding of Conference on Computer Vision and Pattern Recognition (CVPR), pages 366–373, 2004. [4] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google’s image search. In Proceedings of International Conference on Computer Vision (ICCV), volume 2, pages 1816–1823, 2005. [5] F. Fraundorfer and H. Bischof. A novel performance evaluation method of local detectors on non-planar scenes. In Workshop Proceedings of Conference on Computer Vision and Pattern Recognition, pages 33–38, 2005. [6] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1:119–139, 1997. [7] J. Hull, S. Srihari, and R. Choudhari. An integrated algorithm for text recognition: Comparison with a cascaded algorithm. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 5(4):384–395, July 1983. [8] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of British Machine Vision Conference (BMVC), pages 384–393, 2002. [9] J. Matas and K. Zimmermann. A new class of learnable detectors for categorisation. In Proceedings of Scandinavian Conference of Image Analysis (SCIA), pages 541–550, 2005. [10] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision (IJCV), 65(1-2):43–72, 2005. [11] B. Sch¨olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. [12] R. M. K. Sinha, B. Prasada, G. Houle, and M. Sabourin. Hybrid contextual text recognition with string matching. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 15:915–925, 1993. [13] J. J. Weinman, E. Learned-Miller, and A. Hanson. Fast lexicon-based scene text recognition with sparse belief propagation. In Proceedings of Conference on Document Analysis and Recognition (ICDAR), 2007.
4. Conclusion In this paper we introduced a framework for automatically detecting and recognizing text in images. We described a text localization and a simple character recognition method for providing initial character recognition results. The main emphasis was on introducing the use of results from web search engine queries for improving the final recognition performance by exploiting contextual information. Experimental evaluation demonstrated that analysis on word and on sentence level enables to improve the overall recognition performance. We further demonstrated that even rather simple and low quality character recognition methods can lead with this extension to reasonable text recognition results.