The paper describes how SURF visual features, extracted from the documents, can be usefully used for business document recognition and their classification.
Science and Information Conference 2014 August 27-29, 2014 | London, UK
A Fully Visual Based Business Document Classification System Ignazio Infantino, Umberto Maniscalco, Dario Stabile, Filippo Vella Institute of High Performance Computing and Networking - ICAR National Research Council - CNR Viale delle Scienze, edificio 11, 90128, Palermo (PA), Italy {infantino, maniscalco, stabile, vella}@pa.icar.cnr.it Abstract—A fully visual approach for business documents classification is presented. The paper describes how SURF visual features, extracted from the documents, can be usefully used for business document recognition and their classification. Some of the extracted features are used to compute a prototype aiming at speed up the comparison of a document class while obtaining the best recognition rate. Moreover, we can determine which features are relevant and we can select zones of interest in the documents. Experimental setup has been performed on a set of real business documents of different typologies and companies. We tested also the robustness of our approach adding artificial defects and noise to the original documents and classifying them taking into account exclusively visual and graphical features. The capability of documents classification without any kind of text analysis has the great advantage to make the system totally independent from the idiom. Keywords—document classification; image processing; image features extraction
I.
INTRODUCTION
The recognition and automatic classification of digital business documents are two difficult tasks that usually require both the extraction and the automatic analysis of the content (i.e. using an OCR). Our hypothesis is that the appearance of the document (visual information), if it is properly managed can be used to perform an effective recognition and classification, and can sometimes avoid the content analysis at all. Anyway, addressing the two aspects simultaneously or sequentially could achieve faster and more accurate classification results. Both for the tasks of automated digital document storing and document searching, merging the modern techniques of image processing and machine learning allow us to obtain satisfactory results that are independent from the textual content. In general, the images features involved in these tasks can be sub-symbolic, structural or textual. An important aspect to take in account is to choose relevant and robust features extracted from the document image. The document image classification algorithm proposed by Jagtap [1] classifies the documents in three classes using low-level image features like mean, variance, skewness, energy, contrast, homogeneity and entropy. Sarkar adopts a classifier in [2] that is based on ViolaJones features, the author adopts maximum likelihood in the latent conditional independence (LCI) model. A document is
represented by a list of features and a probability distribution is associated to them, the classification is achieved applying an expectation-maximization algorithm. Strictly related to the recognition and classification task is the definition of a metrics to evaluate documents similarity. Visual similarity of the document layout structure is used by Shin et al. for the classifier [3]. The authors use both geometric and sub-symbolic features obtained by analyzing rectangular regions of the image. Some of these features are text component, horizontal and vertical lines, column gap, row gap, number of 2D line sets, foreground pixel window density, foreground pixel bounding box density, foreground pixel bounding box density, and so on. Eglin and Bres [4] suggest a similarity measure based on visual content of the document. This visual similarity is computed using textural and structural image features, in particular, the authors propose complexity and visibility as textural features. Regarding the structural features, and in particular the layout analysis, it can be distinguished in physical and logical layout analysis. The first one concerns with the composition of the image document with a hierarchical structure, the second one performs a semantic and logical labeling of images blocks. Eglin and Bres in [5] use both the textural and structural features to characterize the logical labeling for the document physical layout. Héroux et al. [6] perform an image segmentation on a document and describe the document as hierarchical structure modeled by a tree in which nodes are suitably labeled. Several document image classification methods use both physical and logical layout representation. In these cases, the classification can be performed by layout graph matching [7], hidden tree Markov models [8], first order random graphs [9] or by neural networks [10]. Usilin et al. [11] present a classifier that recognizes documents with a rigid geometry, for these types of documents is possible consider the problem of document image classification as an object detection problem. In this paper we present an approach based on SURF (Speeded Up Robust Features) features [12], it computes the descriptors from the upper quarter of the document that appears to be most rich of invariant visual information. Relevant key points are used to define a prototype to perform a robust classification. Here we underline that we face
339 | P a g e www.conference.thesai.org
Science and d Information Conference 2014 Aug gust 27-29, 2014 | London, UK
the problem of automatic business documentss recognition and classification, taking into account exclusivvely visual and graphical aspects of the document excluding any kind of text analysis. This point has many significant impliications including the positive effect to make the system totally iindependent from the idiom. In the section II we describe the prooposed system, in section II-A the document representation is shhown and finally, in section III, the experimental part is reported .
or non optimized acquisition param meters can affect the output with unwanted artifacts. As show wn in Fig. 1 both paper documents and e-documents can be processed by the system. In case of paper documents, a pre-pro ocessing is applied with the aim to “adjust” the document. This phase p typically covers noise reductions algorithms, rotation adju ustment and cropping of the borders of the documents. The otheer steps consist of a feature extraction module to represent the document d content (see II-A). This representation is used to prod duce a classification of the document and to propose a label for the input document.
Fig. 1. Complete framework for fully visual business ddocument recognition and classification. Man-in-the-loop allows the system too detect errors and to adjust prototypes if needed.
II.
THE PROPOSED SYSTEM M
This work is part of a wider researchh project named Innovative Document Sharing (IDS), that aiims to introduce several aspects of artificial intelligence in a typpical workflow of business document management in order to reeduce the human involvement in the processing chain. A ffully visual and graphical approach for business documents recognitions and classification is introduced to achieve this gooal. The acquired documents are the input of the processing ssystem described here. They can be the output of the digitallization of paper documents or can be in an electronic format (ttypically pdf, but they can also be stored in any image repressentation format). The document workflow can manage both thhe typologies and after an initial type dependent processing thhe elaboration is merged in a single processing sequence as show wn in Fig. 1. The paper documents are be affected by some issue due to the acquisition process. For example, ddefects that can deteriorate system performance are: undesireed rotation of the documents, noise due to the digitalization system and disturbances affecting the digitalization proceedure. The same system can add to the digitalized copy some nnoise that is given to defects or dust on the lens. Finally a wrong resolution setting
Fig. 2. Some documents of the dataset useed in the experiments. Images are given in small size since they contain sen nsible data but they are useful to understand the structure of this kind of docum ments.
The classification is checked by y the user that can validate the classification or trigger a ch hange in the classification module. A further module that is outside o of the scope of this paper receives the user’s feedback to o reorganizes the prototypes to improve the classifier performancce. A. Document representation he paper documents used as Some examples of images of th input of the system are shown in Fiig. 2. The structure of these documents shows that the most relev vant region of the images is the upper part and it can be describeed through local descriptors. Local features have raised a good in nterest for their capability to capture the information in a visuall subpart of the image and maintain the same representation wh hen the image is affected by common image modifications. They y are based on detectors that find interest points representing a portion of the image with
340 | P a g e www.conference.thesai.org
Science and Information Conference 2014 August 27-29, 2014 | London, UK
features that are invariant versus most typical disturbances. The algorithm of these features aims at finding points that show repeatability, distinctiveness and robustness properties with a process that is not computationally expensive. Repeatability regards the property to maintain the same key points even if the image is captured in different condition. Distinctiveness assures that the key point description is ssuitable for matching similar points and robustness is the property to have minimal variation when change in illumination, rotation or blurring occur. Among the local features SIFT (Scale Invariant Feature Transform) [13] are widely adopted to represent features for tasks involving image registration and image matching. These features have the properties to be invariant to variation of scale, image rotation and affine variation of Fig. 1. Complete framework for fully visual business document recognition and classification. Man-in-the-loop allows the system to detect errors and to adjust prototypes if needed. viewpoint. Changes of luminance are compensated by the SIFT representation as gradient of histograms and changes of scale are compensated by the point selection, finally changes of rotation are compensated by the orientation normalization. Some attempts of further improvements of the SIFT algorithm has been done with PCA-SIFT [14], bi-SIFT [15] and Speeded Up Robust Features (SURF) by Bay et al. [12]. The SURF feature yield results comparable to SIFT with a fraction of the computational cost and are more suitable for two dimension transformations. The SURF features are based on a Hessian matrix. The Hessian has shown to have good accuracy and can detect bloblike structures where the determinant is maximum. Given a point x=(x,y) in an image I, the Hessian matrix H(x,σ) in x at scale σ is defined as follows: (1) are the convolutions of the Gaussian where second order derivative along i and j axes with the image I in the point x. The determinant of the matrix H(x,σ) represents the blob response of the image at the location x. The points with the highest determinant at multiple scales are selected as interest points. The points are described with a vector of 128 elements referred to the four neighbors area of size 4 x 4. For each area 16 descriptors, with four values each, are computed. The values are stored in the following vector (2) The values of and are the Haar wavelet response that approximate the Difference of Gaussians. The SURF features are particularly adapt towards on scale and in-plane rotation and this limitation in the range of application has the advantage to reduce the computational cost. This focus is particularly suitable for our system that is limited to this kind of modifications. B. Document description The proposed approach extracts the SURF features from each document and use the set of the key points descriptions to
characterize the documents. The main advantage of this choice is that each feature represent a local part of the document and no hypothesis on the structure is done. This characteristic brings that the global description is maintained even if the structure of the document is deformed or rotated. The properties of these local features include also robustness against rotation and scaling and when little modifications occur the description maintains the same values. The drawback of the description of the documents as collection of the key points is that the length of the description is not fixed and a different number of key points is extracted from similar documents. Therefore it is needed a classification module that associate a document to the correct class. C. Document classification To classify a document we consider the matches between the key points of a test document and all the documents of the ground truth. To detect a match between two key points we considered a threshold distance that is empirically determined. In a naive classification methodology a test document could be compared with all the document of the ground truth and the document showing the highest number of matches will provides the label for the new document. Being the number of key point potentially high we computed a set of prototypes to reduce the number of documents to be compared. The prototypes are chosen so that they cover all the classes with a sub-optimal strategy. The methodology can be made optimal in terms of minimal number of prototypes with an increased computational cost. The used methodology is the following: 1) Fix all the desired classes to be detected 2) Set all the documents in the training set as unmarked 3) Compute all the key points for all the documents of the ground truth 4) Repeat until there are a classe without prototypes or an unmarked document: a) choose an uncovered class λ b) choose the document d that has the highest number of valid matches with the other documents within the given class c) set d as prototype for the class d) mark all the documents in the λ class that have a number of matches with d higher than a given threshold This algorithm assures that all the classes have one or more prototypes and each document is associated with a prototype that is very similar to itself. This process brings to a low generalization capability for the systems but allows a high recognition performance. The main purpose of the system, that is to correctly recognize documents, is met with the tradeoff of a lower recall. Usually the companies that produce the documents have a common layout that is typically given by the logo and some information about the address of the company and it is placed in the upper part of the document. Starting from this hypothesis the number of points to be checked can be reduced, evaluating the position of the matched points and observing that the portion with more matches inside the same class is identified in a given region of the image.
341 | P a g e www.conference.thesai.org
Science and Information Conference 2014 August 27-29, 2014 | London, UK
Fig. 3. Some examples of feature extraction. All detected keypoints (red crosses), keypoints (blue circles) more relevant for matching, regions of interest (dashed green lines).
The experiments have led to a selection of a reduced number of points placed in the most discriminating region of the images. The matching of each key point has been performed through an improved Nearest Neighbor algorithm. The details and the results are shown in the section III. Although the algorithm is independent from this hypothesis it is possible to choose different parts of the documents that from a preliminary analysis provide the areas with the highest density of feature points. We found that the upper part of the document (approximately one third of the document height) is the most discriminate sub region for the given classes. III.
EXPERIMENTS
The dataset used for the experiments consists of 339 business documents belonging to 67 companies. These documents cover different types such as bills of lading, invoices, bills of materials, mileage allowance, cheques, bank transfers, financial statements, forms and commercial business letters. The documents were acquired using a color digital scanner with a resolution of 2480x3508. Among them two documents have horizontal page orientation, and one document is wrongly rotated by 90 degrees. For documents with more than one page, it was decided to consider independently the image of each page. The documents were acquired with the usual procedures of the personnel of the company that has collaborated in the experiments, and using their own software platform for document management. In the dataset considered, the following defects were found: handwritten notes, small
rotations (less than 10 degrees) and small shifts (less than 5 mm), presence of staples, insertion of a barcode stickers in different positions, lack of the upper left corner of the page, holes for ring binder, folds that determine the dark areas in the image, reflections, erasures of handwritten notes. The SURF features (with 128 values) were extracted on the images of the dataset, however, considering only the upper quarter of the document. From our experiments, in fact, it was found that the upper part of the document is relevant to the task of visual recognition (see examples of Fig. 3) of the company that originates the document. The result is understandable given that this region of the document usually contains a logo and some fixed pieces of text (often in a particular location). The extracted features were used for matching keypoints of the algorithm using FLANN (Fast Approximate Nearest Neighbor Search). In particular a matcher that is present in OpenCV (http://www.opencv.org) trains an index on a key points collection and starts a nearest neighbor search methods to find the best matches with performance that are much better than the brute force search [16]. The number of the extracted keypoints depends on the resolution of the images, and for this reason it was decided to set it to 620x295 (considering only the upper part of the image), thus allowing us to set an optimal threshold (equal to 10) of the number of matches for the classification task. The classification is based on the largest number of matches found. It is also possible to determine a set of prototypes which guarantee the best recognition rate as explained in the previous
342 | P a g e www.conference.thesai.org
Science and d Information Conference 2014 Aug gust 27-29, 2014 | London, UK
section. For the training phase we used all the available images, and were determined 95 prototypes that allow w you to identify all the documents. The prototypes are madde from lists of features detected and require only a few Kbbytes for storing them. The classification of the documents hass produced a rate of success of 99%. To test the robustness of the recognition aand classification, we used a sub set of the original data, introduccing some defects on the original images. The subset is compossed of 83 images extracted from the original dataset selectingg the documents without relevant defects. They were proccessed with the following filters: diffuse glow distortion, cornner cutting, pinch distortion, liquid filtering, facet pixelation, blurring, upper shifting.
med on the so formed set The classification task perform produced only one error on 83*7 = 581 processed images, i.e., an error rate of 0.17%. The obtained d results are very promising, and we are moving to a validatiion with a larger dataset. Regarding the sensitivity to a singlle defect, we have counted the average number of matches relev vant for recognition, and we have calculated the percentage of preserved matches with respect to original documents (see taable I). The table shows that the defects that significantly reduce the number of matches are spread glow, pinch, and blurring g. However, as previously stated, the recognition rate remains very v high. IV.
CONCLU USIONS
We presented a fully visual based classification system for business documents tested on real dataset. We have obtained very good results with artificially im mposed artifacts and we are considering to extend the dataset witth more real documents. The main advantage of our ap pproach is that it does not involve text analysis and it is completely independent from the idiom. Furthermore it can be enricched with other approaches considering the structure of the document d and other visual aspects. TABLE I.
SENSITIVITY TO DEFECTS
defect
% of preserved matches
diffuse glow
27.9%
corner cutting
74.8%
pinch
29.3%
liquid
65.2%
pixelation
37.0%
blurring
23.1%
shifting
40.5%
ACKNOWLEDG GMENT This research was supported by Regione Sicilia grant POR FESR 2007/2013 Linea d’intervento 4.1.1.1 Progetto IDS, INNOVATIVE DOCUMENT SHA ARING. Authors would like to thank Dr. Giovanni Spoto for his contribute in the analysis of the state of the art. [1]
Fig. 4. An example of document with different defects added. The first row reports the original document (the left corner of uupper quarter), and subsequently rows show the effects of the following filter: diffuse glow distortion, corner cutting, pinch distortion, liquid filteriing, facet pixelation, blurring, upper shifting.
The seven types of defects were chosenn considering the ones that usually can be found in the doccuments acquired manually by commercial digital scanner (see F Fig. 4). In order to understand which defects are the most criticaal for recognition, we have chosen to introduce a single defect onn each image.
[2]
[3]
[4]
REFERENCE ES R. Jagtap, “An improved processing g technique with image mining method for classification of textual images using low-level image features,” International Journal of Adv vanced Computer Science, vol. 2, no. 2, pp. 79–84, 2012. P. Sarkar, “Image classification: Cllassifying distributions of visual features,” in Proc. of the 18th International Conference on Pattern Recognition - Volume 02, ser. ICPR ’06, Washington, DC, USA, 2006, pp. 472–475 C. Shin, D. Doermann, and A. Rosen nfeld, “Classification of document pages using structure-based featurres,” International Journal on Document Analysis and Recognition, vol. v 3, no. 4, pp. 232–247, 2001. V. Eglin and S. Bres, “Document pagee similarity based on layout visual saliency: Application to query by exam mple and document classification,” in Proceedings of the Seventh International Conference on Document Analysis and Recognition, ser. ICDA AR ’03, vol. 2, Washington, DC, USA, 2003, pp. 1208–1212.
343 | P a g e www.conference.thesai.org
Science and Information Conference 2014 August 27-29, 2014 | London, UK [5]
——, “Analysis and interpretation of visual saliency for document functional labeling,” International Journal on Document Analysis and Recognition, vol. 7, no. 1, pp. 28–43, Mar. 2004. [6] P. Heroux, S. Diana, A. Ribert, and E. Trupin, “Classification method study for automatic form class identification,” Pattern Recognition, International Conference on, vol. 1, pp. 926–929, 1998. [7] J. Liang and D. S. Doermann, “Logical labeling of document images using layout graph matching with adaptive learning,” in Proceedings of the 5th International Workshop on Document Analysis Systems V, ser. DAS ’02. London, UK, UK: Springer-Verlag, 2002, pp. 224–235. [8] M. Diligenti, P. Frasconi, and M. Gori, “Hidden tree markov models for document image classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 4, pp. 519–523, Apr. 2003. [9] A. Bagdanov and M. Worring, “Fine-grained document genre classification using first order random graphs,” in Proceedings of the Sixth International Conference on Document Analysis and Recognition, ser. ICDAR ’01, Washington, DC, USA, 2001, pp. 79–83. [10] F. Cesarini, M. Lastri, S. Marinai, and G. Soda, “Encoding of modified x-y trees for document classification,” in Proceedings of the Sixth International Conference on Document Analysis and Recognition, ser.
ICDAR ’01, Washington, DC, USA, 2001, pp. 1131–1136. [11] S. Usilin, D. P. Nikolaev, V. V. Postnikov, and G. Schaefer, “Visual appearance based document image classification.” in ICIP. IEEE, 2010, pp. 2133–2136. [12] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in Computer Vision–ECCV 2006. Springer, 2006, pp. 404417. [13] D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2. Ieee, 1999, pp. 1150–1157. [14] Y. Ke and R. Sukthankar, “Pca-sift: A more distinctive representation for local image descriptors,” in Proc. of Computer Vision and Pattern Recognition (CVPR) 04, 2004, pp. 506–513. [15] I. Infantino, G. Spoto, F. Vella, and S. Gaglio, “Composition of sift features for robust image representation,” Proceedings of SPIE, vol. 7540, p. 754016, 2010. [16] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” in International Conference on Computer Vision Theory and Application VISSAPP’09). INSTICC Press, 2009, pp. 331–340.
344 | P a g e www.conference.thesai.org