Natural Image Character Recognition Using Oriented Basic Image ...

1 downloads 0 Views 912KB Size Report
Natural Image Character Recognition Using. Oriented Basic Image Features. Andrew J. Newell. Department of Computer Science. University College London.
2011 International Conference on Digital Image Computing: Techniques and Applications

Natural Image Character Recognition Using Oriented Basic Image Features Andrew J. Newell

Lewis D. Griffin

Department of Computer Science University College London

Department of Computer Science University College London

Abstract—Whilst effective methods exist for character recognition in certain contexts, such as characters taken from printed or handwritten documents, these methods have not performed well when tested with characters taken from natural images. In this work we introduce oriented Basic Image Features (oBIFs), a system based upon local symmetry and orientation, and demonstrate how they can be used within a mutliscale system for natural image character recognition. The best performing system outperforms previous methods by 8.1% and 1.2% when tested with the two most common datasets.

of training examples per class (5 or 15) on datasets which contain digits and upper and lower case characters. Most recently Wang et al. [5] have shown that Histograms of Oriented Gradients (HOG) could be used in conjunction with a Nearest Neighbour classifier to produce better performance than all previous methods. HOG features [6], like Geomtric Blur and Shape Context, have previously been used in applications involving shape recognition. Given the different testing paradigms it’s difficult to make comparisons between state of the art in natural image character recognition and state of the art in handwritten character recognition. However, the current best performing schemes achieve performance rates of less than 60% when tested using 15 training images per class, implying that there may be substantial room for improvement.

I. I NTRODUCTION A. Character Recognition Text taken from natural images presents a challenge for traditional optical character recognition (OCR) systems for several reasons. First, there may be far greater variation in terms of the number of fonts used and their variability within sections of text. Second, whereas in a document the intention will generally be to convey information in as clear a way as possible, text in natural images may be used with a different motivation. Signs and labels may have been designed to attract attention or for aesthetic appeal, rather than clear presentation of the text. Third, the text in images has often come from a three dimensional surface, giving rise to a variation in viewing angles along with clutter and occlusion not generally found in documents. These aspects of variation not only pose an interesting problem for traditional OCR techniques, but also make natural image character recognition an interesting instance of an object categorisation problem. Whilst the underlying objects may be considered relatively simple, the degree of intraclass variation is substantial. Therefore natural image character recognition may provide a useful tool for evaluating general object categorisation systems, particularly when tested using small training sets. It is this recognition problem that we focus on in this work.

II. M ETHODS A. Oriented Basic Image Features Basic Image Features (BIFs) [7], [8] have shown leading performance in texture recognition applications [9], [10]. They offer a system of image description where each location in an image is classified into one of seven types based upon the type of the approximate local symmetry. These approximate types are flat, dark and light rotational, dark and light line, slope and saddle-like. The classification is determined from the output of a bank of six derivative-of-Gaussian filters, one 0th order, two 1st order and three 2nd order. The algorithm has two tunable parameters. A filter scale parameter, σ, and a threshold, , which is influential in deciding whether a locality should be classified as flat, or as one of the other six articulated symmetry types. Larger values of  result in a greater proportion of an image being classified as flat. Given the apparent usefulness of local orientation in recognition systems, both biological and artificial, we consider an extension to the BIF system where local symmetry is combined with local orientation to produce a set of features referred to as oriented Basic Image Features (oBIFs). For any given level of orientation quantisation, n, the possible orientations that can be assigned will depend on the local symmetry type. If the location is assigned to the dark or light rotational or the flat classes then, as these all exhibit rotational symmetry, no orientation is assigned. If the location is assigned to the dark line, light line or saddle-like classes then n possible orientations can be assigned. If the location is of the slope class

B. Related Work Perhaps the most successful representation for object recognition, SIFT [1], has not performed well when tested against other methods [2]. Shape Context [3] and Geometric Blur [4], have been shown to outperform SIFT and other methods when used in conjunction with a Nearest Neighbour (NN) classifier [2]. These methods were tested using a small number 978-0-7695-4588-2/11 $26.00 © 2011 IEEE DOI 10.1109/DICTA.2011.39

189 191

In addition we need to choose a value for the scale ratio for the oBIF features. In order to assess the value of using oBIFs over first order features, which have been very effective in methods such as HOG, we considered two further schemes which used oriented gradients in place of oBIFs. Each of these schemes differed only in the initial calculation of features. The first of the oriented gradient schemes corresponded to the single oBIF scheme described above. Oriented gradients were calculated using first order derivative-of-Gaussian filters at a given orientation quantisation over a set of arithmetically spaced scales. Locations which were classed as flat were discarded and the histogram of the remaining features formed the image encoding. The second oriented gradient scheme, referred to as 1st Order Columns, corresponded to the oBIF column scheme described above. For this, column features were assembled from pairs of oriented gradients at a given scale ratio and then put into a single histogram for the whole image. The scheme used the same three parameters as oBIF columns, namely , the orientation quantisation, n and the scale ratio. For both schemes a nearest neighbour (NN) classifier was used, using the Bhattacharyya distance [11] to measure similarities between images.

then, due to the directional nature of the slope, we can assign 2n orientations. Thus we have a total set of 5n+3 features. The oBIF calculation is given in Algorithm 1. Algorithm 1 The oBIF calculation 1) Measure filter responses ci,j to an (i, j)-order derivativeof-Gaussian filter, and from these calculate the scale normalised filter responses si,j= σ i+j cij 2) Compute λ = s20 + s02 , γ = (s20 − s02 )2 + 4s211 3) Assign BIF type according to which Expression is largest, then calculate orientation where appropriate: Expression s 00 2 s210 + s201 λ −λ √ (γ + λ)/ 2 √ (γ − λ)/ 2 γ

BIF type flat slope dark rotational light rotational dark line light line saddle-like

Quantisable Orientation No orientation atan2(s0,1 , s1,0 ) No orientation No orientation 2s1,1 arctan (s0,2 −s 2,0 ) 2s

1,1 arctan (s0,2 −s 2,0 )

2s

1,1 arctan (s0,2 −s 2,0 )

B. Recognition using oBIFs We consider two different schemes for using oBIFs for character recognition. In the first we calculate oBIFs, for a given quantisation of orientation, at a set of arithmetically spaced scales. So that we ignore blank space in the image, we discard all the pixels that are classified as flat. We then form a histogram of all other oBIF types, across all scales considered, and normalise so that the histogram sums to 1. In this scheme we have to choose a value for  and the number of possible orientations for the oBIF calculation. At one extreme, with no orientation quantisation, the system is equivalent to the BIF system. The first scheme produces a histogram of 3+5n bins, where n is the level of quantisation. Such a low number of bins may put a limit on the discrimination ability of the system and so we consider a method of combining the oBIF classes to form new features that are rarer, yet potentially still consistent with object class. One way to create such features is to look at pairs of oBIF types that occur next to each other in the scale dimension. To create each feature, called an oBIF column, we calculate oBIFs at two scales and then look at the pair of oBIFs that occur each point. These can then be put into a histogram and normalised as before, except such a histogram now has (3 + 5n)2 bins. This process is illustrated in Figure 1. These oBIF column features can then be used in a similar way as in the first scheme. For each image, we calculate oBIF columns at a range of scales, although the ratio between scales for each oBIF column remains fixed. We then remove any oBIF columns that contain a flat type. All oBIF columns are then counted and normalised. In the second scheme we still need to choose a value for  and the number of orientations.

III. R ESULTS A. Evaluation In order to evaluate the four schemes we used the chars74k [2] and ICDAR03-CH [12] datasets. The first of these has 62 classes of images, made up of digits with upper and lower case letters. Images generally contained a single character from a natural image. However, high levels of clutter meant that some images contained small subsidiary characters, in which case only the main character is labelled. Examples from the chars74k are shown in Figure 2. The ICDAR03-CH dataset comes from the robust OCR challenge section of the ICDAR03 challenges and contains 75 classes, made up of digits, upper and lower case letters and symbols. When referring to specific results we use the convention of suffixing the dataset name with the number of training images per class, so, for example, chars74k-05 refers to the chars74k dataset when tested with 5 training images per class. B. Preprocessing As a first step, all images were resized and padded where necessary to ensure they were all the same size. The datasets contain both light letters on dark background and dark letters on a light background. In order to be invariant to this difference we performed a simple test on each image by looking at the relative strength of oBIF type at a coarse level, which essentially considered whether the image tended to a dark patch on light or a light patch on dark. If this oBIF type was a light rotational images were inverted, whereas if it was a dark rotational images were left as they were.

192 190

oBIF Column Histogram Encoding 1

Image

Orientation BIF StereoLabel type

2

oBIFs

Assign each pixel to one of 7 local symmetry types and assign orientation

4 Normalised Histogram Stack BIFs at different scales

0.008

0.006

Count features and normalise

0.004

3

oBIF Columns

0.002

....

Fig. 1.

The stages in computation of an oBIF Column histogram texture descriptor

draw subsequent training and test sets. These images, referred to as the main set, contained the majority of available images for most classes. We used the remaining images from chars74k to tune the parameter values for the four schemes. For this process none of the images from the main set were used with the consequence that the number of images available per class varied considerably in the tuning process. The same parameter values were used for both datasets. D. Performance

Fig. 2.

The results for the four schemes are shown in Table I alongside previously published results including SIFT and the OCR software ABBYY. The three columns indicate the dataset and the number of training images per class. The performance measure given is the mean score over all runs, which was 50 for the chars74k dataset and 20 on the ICDAR03 dataset, along with the standard deviation of these scores. From this table it can be seen that, on the chars74k dataset, both oBIF columns and 1st Order Columns outperformed previous methods when using either 5 or 15 training images per class. However, on the ICDAR03 dataset only the oBIF column scheme outperforms previous methods. The performance for other sizes of training set on the char74k dataset is shown in Figure 3.

Example images from the chars74k dataset

C. Dataset splits In order to make a comparison to previous work we focused on using training sets of 5 or 15 images per class. With such low numbers of images per class available for training the results can vary substantially from run to run. Therefore we wanted to ensure the performance measures were based upon a significant number of trials. To do this we first selected 30 images from each class in the chars74k dataset from which to

193 191

TABLE I C OMPARISON OF PERFORMANCE ( IN % CORRECT ) ON THE CHARS 74 K AND ICDAR03

Scheme SIFT [2] ABBYY [5] Multiple Kernel Learning [2] Shape Context [2] Geometric Blur [2] HOG Features [5] 1st Order Features oBIFs 1st Order columns oBIF columns

6FRUH  

Chars74k-5 18.7 26.1±1.7 36.9±1.0 45.3±1.0 27.0±1.2 35.3±1.2 50.8±1.1 53.4±1.4

 R%,)V



ICDAR03-CH-5 21.2 18.3 27.8 51.5 28.4±1.1 30.8±1.2 46.1±1.3 52.7±1.2

between upper and lower cases examples, thus reducing the problem to 36 classes the overall score becomes 73.0±1.3% on chars74k-15. A similar pattern was seen in the confusion matrices for the other schemes, as well as for the ICDAR03 dataset.

R%,) &ROV VW 2 &ROV



Chars74k-15 20.8 18.7 55.3 34.4 47.1 57.5 36.7±0.8 46.4±1.0 60.2±1.1 64.3±1.3

DATASETS .

VW 2UGHU

    



Fig. 3.







  7UDLQLQJ ,PDJHV SHU &ODVV



The performance of oBIFs and oBIF columns



E. Parameter Values For the first oBIF scheme, the tuning process gave an optimal value of  of 0.05 and an optimal orientation quantisation set of 12, giving an oBIF set of 63 features. For the oBIF column scheme, the tuning process gave  as 0.03 and an orientation quantisation of 8, giving 43 oBIF features. The optimal ratio between the scales in the oBIF features was 3. For the first order feature schemes, a set of 24 orientations was optimal for both. The  values were 0.02 for the first scheme and 0.05 for the 1st Order Columns and the optimal scale ratio was 2.5.

Fig. 4. The confusion matrix when using oBIF columns on the chars74k dataset with 15 training images per class.

G. Parameter Investigation We wanted to see how the various parameter settings for the oBIF column scheme affected the performance. First, we looked at the sensitivity of the value of . From the graph in Figure 5, it can be seen that the performance is robust with respect to small changes in  with at least 95% optimum performance being achieved in the range 0 to 0.075, which is 2.5 times the optimal value of . We then looked at the influence of the ratio between the two scales. For this we looked at how the level of performance changed as the ratio between the two scales varied between 1 and 10, with results shown in Figure 6. As seen from this graph, when the ratio between the scales is 1, the system

F. Confusion Matrix As part of the evaluation process we looked at the pairs of classes that were often confused. For the oBIF column scheme and the chars74k dataset this is shown in the confusion matrix in Figure 4. The classes are given in the same order as in Figure 2, that is digits followed by upper case letters then lower case letters. A notable feature of the confusion matrix is the pair of lines running parallel to the main diagonal. These indicate a relatively high level of confusion in between upper and lower case examples of the same letter. If we allow confusion

194 192

6FRUH  

6FRUH   











 







 









Ε 



 









   2ULHQWDWLRQ 4XDQWLVDWLRQ

The relationship between performance and 

Fig. 5.

Fig. 7.

The relationship between performance and orientation quantisation

6FRUH  

is equivalent to using single scale oBIFs and we therefore get the same level of performance. There is a sharp increase in performance as the ratio increases to a value of 2, with marginal increases in performance thereafter. The true optimal value of ratio is 3.5, as opposed to the value of 3 obtained from the tuning process. However, performance within 95% of the optimum is achieved with a ratio in the range 1.75 to 7.

R%,)V VW 2

    

6FRUH  

 

 







6FDOHV

Fig. 8. The relationship between performance and the number of scales in the column descriptor

 

IV. D ISCUSSION



A. Is it possible to achieve 100% performance? 



Fig. 6.











5DWLR

The oBIF columns method tested within this work has shown an improvement over previous methods using the chars74k and ICDAR03 datasets. However, the performance is still far from perfect and, even with 29 training images per class we see performance flattening out at approximately 70% (as shown in Figure 3). In order to put this improvement into context it would be helpful to have some idea of the upper bound on performance. Ideally this would come from human level performance, but such an estimate is very difficult to make as it would require subjects who had no prior experience of the characters used in these datasets. We can, however, detect apparent ambiguities between classes that occur because of the nature of the testing regime where each character is presented out of context. Certain pairs of classes, for example an upper and lower case ’x’ or a ’one’ and a lower case ’l’, may have visually identical instances meaning that the true class can only be identified by the context. Upper case letters may be larger than surrounding letters and found at the start of words, whereas digits may be found next to other digits. Examples of ambiguous pairs of classes are shown in Figure 9. The absence of these context cues will likely place an upper

The relationship between performance and scale ratio

Next, we looked the importance of the level of orientation quantisation in the oBIF set, shown in Figure 7. Here, the point on the far left represents the BIF system without orientation, using a set of 7 underlying features. We see a rapid increase in performance as orientation is introduced, with a significant increase up to a quantisation level of 8 with a slow decline in performance thereafter. Finally we wanted to see whether any further increase in performance could be achieved by increasing the complexity of the oBIF column features to triplets of oBIFs across 3 scales. The oBIF features were calculated as before and performance for both oBIFs and first order features are plotted against the single scale and 2 scale schemes in Figure 8. This shows a marginal, though not significant, improvement in both schemes. It should also be noted that the 3 scale oBIF columns produce a total encoding size of (5n+3)3 as opposed to the 2 scale oBIF column encoding size of (5n+3), where n is the orientation quantisation.

195 193

HOG features. This is roughly what is seen in the results, where oBIF columns outperform HOG features by a margin of 8.1% on chars74k-5 but only 1.2% on ICDAR03-CH-5. V. C ONCLUSIONS In this paper we have described and demonstrated a method that has outperformed previous methods on the two most common datasets of characters from natural images. On the chars74k dataset, our best performing method achieves an 8.1% increase in performance when tested using 5 training images per class and an increase of 6.8% when using 15 training images per class. On the ICDAR03 dataset, an increase of 1.2% was achieved using 5 training images per class. To achieve these results we have found it necessary to incorporate two elements. First, we have used second order features in the form of oBIFs, instead of oriented gradients. Second, we have used a mutliscale encoding, which consists of a global histogram of paired features.

Fig. 9. Examples of ambiguous images in the chars74k dataset. All images in the top row are from a different class than the corresponding image in the bottom row.

bound on performance. It is difficult to determine exactly what this upper bound might be, but if, for example, there were 10 pairs of visually ambiguous classes, as implied by the confusion matrices, we would have an upper bound of just under 84%. B. How do oBIF columns compare to other methods?

R EFERENCES

The methods tested in this work are perhaps most comparable to the Shape Context, Geometric Blur and HOG features. All these methods involve the extraction of local features, followed by some form of pooling step followed by a nearest neighbour classifier. There are however important differences between the methods. Both Shape Context and Geometric Blur combine local features in a way that will produce an encoding that is largely invariant to position and at least partially invariant to the size of the object. Therefore, when testing these methods on a set of upright characters, we are essentially evaluating their ability to categorise images. However, when using HOG features, as local histograms are essentially concatenated to make the overall image encoding, we would not expect this to be particularly invariant to changes in position and size. When testing this method then, performance may vary according to how much size and position variation there is in the dataset, as well as with intraclass variation. Thus, given two datasets with different levels of variation in size and position, we might expect the relative performance between the datasets to differ between HOG and Geomtric Blur or Shape Context. In the previously published results it is interesting to note the relative performance between Geomtric Blur, with 36.9% on chars74k-05 and 27.8% on the ICADR03-CH-5 test, and HOG features, which scored 45.3% on chars74k-05 and 51.5% on ICDAR03-CH-5. Whilst HOG features outperform Geometric Blur on both sets, it performs far better on ICDAR03 as opposed to chars74k whereas the opposite is true with Geomteric Blur. This may be down to differing levels of variation in position and size between the datasets. If the ICDAR03 set contained characters of relatively uniform position and size, but with a greater degree of intraclass variation, then this might explain the relative differences in performances between the methods. As all the schemes tried in this work have used global histograms, and should therefore be highly position invariant and at least partially invariant to size changes, we might expect a relative performance more similar to Geometric Blur than to

[1] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. International Conference on Computer Vision (ICCV’99), pp. 1150–1157, 1999. [2] T. E. de Campos, B. R. Babu, and M. Varma, “Character recognition in natural images,” in Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, February 2009. [3] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 509–522, April 2002. [4] A. C. Berg, T. L. Berg, and J. Malik, “Shape matching and object recognition using low distortion correspondence,” in CVPR, vol. 1, pp. 26–33, 2005. [5] K. Wang and S. Belongie, “Word spotting in the wild,” in ECCV (K. Daniilidis, P. Maragos, and N. Paragios, eds.), vol. 6311, pp. 591– 604, Springer, 2010. [6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1, pp. 886–893, 2005. [7] L. D. Griffin, “The second order local-image-structure solid,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, pp. 1355–1366, 2007. [8] L. D. Griffin and M. Lillholm, “Symmetry sensitivities of derivative-ofgaussian filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 1072–1083, 2010. [9] M. Crosier and L. Griffin, “Using basic image features for texture classification,” International Journal of Computer Vision, vol. 88, pp. 447– 460, January 2010. [10] A. J. Newell, L. D. Griffin, R. M. Morgan, and P. A. Bull, “Texture-based estimation of physical characteristics of sand grains,” in Proceedings of the International Conference on Digital Image Computing: Techniques and Applications Sydney, pp. 504–509, 2010. [11] T. Kailath, “The divergence and bhattacharyya distance measures in signal selection,” IEEE Transactions on Communications Technology, vol. 15(1), pp. 52–60, 1967. [12] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, “ICDAR 2003 robust reading competitions,” in In Proceedings of the Seventh International Conference on Document Analysis and Recognition, pp. 682–687, IEEE Press, 2003.

196 194

Suggest Documents