The precise and efficient identification of medical order forms using ...

2 downloads 0 Views 2MB Size Report
2 Dresden University of Technology, Department of Computer Science, ... 3 University of Marburg, Data Bionics Group, Hans-Meerwein-Straße, D-35032.
The precise and efficient identification of medical order forms using Shape Trees Uwe Henker1 , Uwe Petersohn2 , and Alfred Ultsch3 1

2

3

DOCexpert Computer GmbH, Kirschh¨ ackerstraße 27, D-96052 Bamberg [email protected] Dresden University of Technology, Department of Computer Science, D-01062 Dresden [email protected] University of Marburg, Data Bionics Group, Hans-Meerwein-Straße, D-35032 Marburg [email protected]

Abstract A powerful and flexible technique to identify, classify and process documents using images from a scanning process is presented. The types of documents can be described to the system as a set of differentiating features in a case base using shape trees. The features are filtered and abstracted from an extremely reduced scanner image of the document. Classification rules are stored with the cases to enable precise recognition and further mark reading and Optical Character Recognition (OCR) process. The method is implemented in a system which actually processes the majority of requests for medical lab procedures in Germany. A large practical experiment with data from practitioners was performed. An average of 97% of the forms were correctly identified; none were identified incorrectly. This meets the quality requirements for most medical applications. The modular description of the recognition process allows for a flexible adaptation of future changes to the form and content of the document’s structures. Key words: Document Identification, Shape Tree, Document Processing, Case Based Reasoning, Optical Mark Reading, Similarity, Segmentation.

1 Introduction Identifying documents on the basis of scanned images is a frequently used process in image processing. Particularly in document management systems (DMS) it is a basic prerequisite. Although it is possible to gather information by analyzing the raster graphic, efficiently identifying or ascertaining similarity is difficult without first assigning unique characteristics. A possible approach to identification without unique identifiers is to abstract the image. This makes distinguishing the regions that form the characteristics possible. Depending on the degree of abstraction, the image’s layout can be determined. Hierarchies can also be employed to describe the regions.

2

Uwe Henker, Uwe Petersohn, and Alfred Ultsch

This article describes the identification of documents on the basis of abstracted images . The approach includes: 1. Image pre-processing is necessary for successful identification to create a suitable starting point for the comparison. 2. Shape trees can save the distinctive characteristics that describe the layout sufficiently. 3. Using Case-Based-Reasoning (CBR), the cases saved as shape tree can be accessed. These make efficiently searching for similar documents and thus the a priori learned case possible.

2 Geometrical Shapes for Determining Similarity 2.1 Object recognition The general problem model consists of object recognition, which seeks to detect the presence of a known object (here a document) in a new image. Object (class) recognition is basically a classification problem: assign a class label to an input vector. Some techniques are described in [8], [9], [10] and further methods in [11], [12], [13], [14], [15]. We present a hierarchical description of objects and associations. With case-based reasoning techniques and instance-based learning the object class recognition can be regarded as a classification problem, where the class is predicted by means of the query image representation. 2.2 Shapes as Models for Regions If regions can be modeled using geometrical shapes such as rectangles or polygons, than such a model can also be used to more precisely calculate similarity according to a degree of abstraction[4]. Shapes in different regions can have the following relationships between each other: • • • •

one region’s shape can contain another, the shapes can partially overlap, the shapes can touch at one or more points, or the shapes can be disjunctive, in which case a minimum distance can be set for how far apart two shapes can be.

If producing various shapes for a domain and identifying them with similar regions is possible, then it follows that it is possible to differentiate the similarity between two shapes as follows: • • •

one shape contains the other, the shapes overlap, and the shapes are not more than a set distance apart from each other.

Identification of medical order forms using Shape Trees

3

A similarity definition, which can be calculated for any two shapes by a similarity function SIMShapes , can lead to the following results: SIMshapes = (s1 , s2 ) = (distance, contains, matches, touches). The values represent the following: • • • •

distance: the distance between two shapes, contains: one shape contains the other, matches: the shapes overlap, touches: the shapes touch.

2.3 Modeling Regions as a Shape Tree The aim is to produce regions modeled by shapes as a tree to efficiently search for similarities. It should thus be possible to find similar entries for a given query shape. To simplify the process, each shape is assigned a minimum bounding box that contains the shape. In an initial step, this makes calculating the distance easier, as only the distance between the two rectangles is calculated. Only when it has been ascertained that both rectangles are within a defined area, are the original shapes used. Here, the definition of shapes and forms is by no means limited to primitive object models. Using a large number of descriptors is possible. For example, results have been achieved using edge detectors in [1],[7] and Fourier descriptors in [5], and [7]. 2.4 Shape Tree Structure The following discussion is limited to simple primitive forms as a starting point. Although, as already mentioned, a multitude of definitions is possible, for the purposes of illustration the shapes used here are simple. [Figure 1] shows six shapes, some of which overlap or are contained in some other shape. [Figure 2] shows the corresponding shape tree. The top node contains the rectangle F, which contains all the remaining shapes. The shapes A, B, C and E follow as children of the root node, which are contained in F but in no other shape.

Fig. 1. Shapes in a model

Fig. 2. Example of a shape tree

4

Uwe Henker, Uwe Petersohn, and Alfred Ultsch

In contrast, Shape D is subordinate to C because the triangle D is located in the circle C. The shape tree is not height balanced. However a degree of balance is possible by limiting a node to a certain maximum number of children. By splitting the children nodes, the maximum number of child nodes can be maintained. The following formal expression can be used to represent a shape tree’s data structure: • •



Node: f = (fn , f Sn ), where fn is the name of the node and f Sn is a quantity of attributes. Attribute: s = (sn , sF ct), s ∈ f S where sn is the attribute name and sF ct is the number of facets. A node in a shape-tree, which describes a region, must have the following attributes as a minimum: – Shape: A geometrical shape, which models the region in question. – Bounding box: A geometrical shape (here, a rectangle), which contains the model shape. – Semantic: A semantic description of the region. – Neighborhood distance: Shapes within this distance are considered as neighboring and therefore as similar. Facet: F ct = (F ctn , F ctEIN T R), where F ctn is the name (type) of the facet and F ctEIN T R is the number of entries represented.

Different facet types make a specific representation of a shape tree’s knowledge elements possible within the context of the attribute characteristics. Apart from the actual attribute characteristics, default values or methods, etc. can be saved for further processing. In constructing a shape tree, the following definitions and characteristics must be considered: 1. A shape-tree’s node f = (fn , f Sn ) defines a classifier. A classifier e(x, f ) for a quantity M is a transformation e : M → I with I = {0, 1}, P = {x | e(x, f ) = 1} (quantity of positive components) and N = {x | e(x, f ) = 0} (quantity of negative components). 2. A node f1 in shape tree K is more general than a node f2 (f2 ≤ f1 ) iff x ∈ M is true for all: if e2 (x, f2 ), then also e1 (x, f1 ). 3. The shape tree must always be constructed in such a way that fi ≥ fi .child is true. 2.5 Searching in a Shape Tree The following describes the method for efficiently searching for similar shapes in a shape tree. Prior to searching for shapes the criteria for deciding whether two shapes are similar must be set. Namely, a similarity interpretation function based on the displayed similarity description must be defined.

Identification of medical order forms using Shape Trees

5

SRIΠ : SR → I, I = {identical, similar, not similar, ambiguous} Searching begins with the inspection of the node’s bounding box. If the inspection is successful, that means the shapes are similar, the model shape is verified, as long as the node is of the semantic type. If both the model shape and the query shape meet the criteria of the comparison, the node with the model shape having semantic value part of the solution pool of similar shapes. The children of the current node are then recursively inspected. If the query criteria is not met, the child nodes do not require any further attention. This follows from the shape-tree definition: the shapes of the subordinate nodes are contained within the bounding box of the current node. The comparison of the individual shapes after successfully testing the distance between the bounding boxes must consist of two components. On the one hand, the distance between the forms with regard to their length, via an Euclidean distance measure and, on the other, the forms themselves are compared. Comparing Fourier descriptors is useful here if the shapes are more complex. In the case of simpler forms, simpler comparison mechanisms can be applied. For complex forms other methods of pattern matching are suitable. Disparities within a query image can cause recognition. An object of a circular form fails, although it is saved at the appropriate location in the reference tree. In such a case, a more complex form description using Fourier descriptors is carried out so that a comparison between different description forms is possible. Starting from the root node, the tree is traversed to the leaves and a matching of the respective children nodes is performed. The form similarity is ascertained by inspecting the type of corresponding object and applying a similarity measure. It is then possible to compare two cases, and therefore also search the saved prototype cases.

3 Document Identification of Specialized Order Forms Identifying medical order forms poses a special challenge in terms of both the quality and quantity of processing. The order forms contain the medical requests as markings, which are marked in a defined grid on the form. After identifying the form, the processing rule is read as part of the solution and the forms are analyzed using special OMR (Optical Mark Recognition) techniques. Coordinates and rules for analyzing OCR fields can also be saved as part of the solution. Identifying such forms represents a special case of our algorithm. Due to the design of such forms, only rectangular frames are used in the shape tree. The case base is saved as a modified shape tree. The tree’s nodes each represent a shape (box). In this specific case, only rectangles are used. These may contain subordinate boxes.

6

Uwe Henker, Uwe Petersohn, and Alfred Ultsch

Fig. 3. Example form

The following definitions are set, which specify the actual application at hand: • •

leaf nodes are boxes, and boxes in children nodes must be completely contained within the parent node’s box.

In creating the shape tree, the following is true: • •

a shape tree is structured by iterative decomposition of the image in the frame, and only rectangles are employed due to the requirements of the used forms.

After the necessary pre-processing (deskewing and cropping) the derived shape tree is compared with the available case base. If a case similar to the current is not found, we apply Instance-based learning techniques like IBL3. The cases to be identified must be known to the system a priori. In the present application example, returning the most similar case does not make sense because it generally differs from the query and, among other things, can lead to incorrect results in the semantic interpretation. Similarities in the identification of the forms are only allowed when comparing the distinctive interpreted areas. On a new query, the search starts in the case base by pre-processing the query image. A reduced copy of the form is first produced.

Identification of medical order forms using Shape Trees

7

Fig. 4. Reduced copy of a form

Based on this, the connected areas of the image are determined and a query shape-tree is formed using the line coincidence algorithm discussed above. The latter describes the image’s characteristics sufficiently. [Figure 5] shows the

Fig. 5. Representation of the image areas

identified connected areas from which the shape tree is formed. The subsequent search in the case base is iterative; in the first step the shape tree’s structures are compared. If matching is not found, the second step compares the leaves according to a defined distance measure. If there exists sufficient similarity, the case is identified and can be further process according to a set rule.

8

Uwe Henker, Uwe Petersohn, and Alfred Ultsch

Fig. 6. The resulting shape tree

4 Experiments In this chapter, the quality of the presented process was demonstrated in a series of experiments. The quality of the form identification is influenced by the process for calculating the shape tree. Different parameters in the reduction of the image information lead to differing connected regions, and these in turn form the shape tree. The aim of the experiments was to determine the optimal parameters for the forms used. A total of 2463 forms of four different types were made available by a North German laboratory for the tests. They were filled out by doctors and nurses in various institutions for the purpose of ordering medical services. The four form types were first added to the case base. Important parameters for the identification are the reduction of the image information for forming the shape tree and the similarity in comparing the leaf nodes. The algorithms presented here were tested using various parameters for the reduction. The values for the horizontal and vertical reduction were varied. In the first test, 25 forms were used of each of the four form types. The best results were achieved with a horizontal reduction of 1:60 and a vertical reduction of 1:40. At a higher reduction, that is, less image information during the comparison, distinction was no longer possible. At less reduction, the variation between the images was weighted too heavily. The necessary adjustment of the similarity also led to incorrect identifications. Using this optimal configuration, the remaining forms were processed. The results are displayed in [Table 1]. Form Number Correct-Positive Sensitivity Specificity Form 1 785 763 97% 100% Form 2 780 759 97% 100% Form 3 781 747 96% 100% Form 4 117 115 98% 100% total 2.463 2.384 97% 100% Table 1. Results

Identification of medical order forms using Shape Trees

9

Failed preprocessing was determined as the reason some forms were not recognized. In particular, the label for alignment were not recognized. Interestingly, generally only one of the four corners was not found. Calculating the not identified corner is useful here for the alignment. The implemented algorithm delivered no incorrect results. The characteristic elements were grouped and used to form a shape tree. The identification was performed in two steps. First, by comparing the resulting tree structure and, second, by comparing the formed shapes. The second comparison was only carried out when the comparison with a tree structure did not yield a result. The presented test demonstrates the efficiency of the approach using four different, but very similar, forms.

5 Discussion The identification of medical order forms using their layouts was described. The forms used here have a clear structure, which enables a high rate of recognition. Their use is primarily limited to the marking of orders. Due to this, the forms contain little variable data. The latter also presents a disruptive factor for identification. Unlike in [2], the analysis does not move from the detail level to structural connections, but instead the abstraction of the image serves as basis for analyzing the layout4 . By reducing the image information, it is possible to suppress or completely remove irrelevant and simultaneously disruptive data in this phase of processing. By heavily reducing the image information, the connected areas of the image can be summarized and recognized as geometrical shapes. This leads to the forming of the shape tree and at the same time justifies the latter use. The result of 97% of correctly classified forms was not expected on the basis of available test data. These contained, apart from the order markings, additional information, which were in handwriting on various areas of the page and affected the layout. Algorithms, which require distinctive characteristics, such as characters or geometrical symbols located on set areas of a form, fail is any of these symbols are written or glued over or are in some other way modified. The presented approach has, however, a high degree of tolerance to changes on the form.

6 Summary This article discussed the identification of documents using the abstraction of a created image. The types of documents to be identified must be known 4

In [2], a process is presented in which documents are compared by their layout with a collection of already identified documents. A total of 2555 documents divided into 18 classes were examined. The results varied, depending on the algorithm and document used, between an average of 72% and 85%.

10

Uwe Henker, Uwe Petersohn, and Alfred Ultsch

to the system a priori. For this, the necessary characteristics are saved in a case guide as shape trees. This file also contains rules for possible further processing. It was shown that, in an extremely reduced image, it is possible to filter out the significant, characteristic image information and identify these using CBR. By example of experiments using medical order forms, the method was demonstrated and shown effective. The resolution selected for scanning, as well as the parameters for abstracting the form, are key to the quality achieved. Based on the achieved results, the described approach here demonstrates a solution for identifying documents using their layout where identification via conventional elements, such as barcodes, fails. The CBR methods employed have proven themselves as suitable.

References 1. W. Abmayr. Einf¨ uhrung in die digitale Bildverarbeitung. Teubner, 1994 2. M. Huang, D. DeMenthon, D. Doermann, L. Golebiowski, and B.A. Hamilton. Document Ranking by Layout Relevance. Eighth International Conference on Document Analysis and Recognition, 2005. 3. B. J¨ ahne. Digitale Bildverarbeitung. Heidelberg:Springer. Berlin, 2002. 4. T. Lunze. Entwurf und Implementierung von Komponenten f¨ ur Case Base Reasoning zum iSuite Wissensbanksystem. Diplomarbeit TU Dresden, 2005. 5. E. Persoon, K. Fu. shape discrimination using Fourier descriptors. IEEE Trans. Syst. Man Cybern., vol. SMC-7, no. 3, pages 1119–1122. Montreal 1977. 6. U. Petersohn. Vorlesung: Angewandte Systeme der k¨ unstlichen Intelligenz. TU Dresden, Fakult¨ at Informatik, Institut f¨ ur K¨ unstliche Intelligenz, 2006. 7. A. Rosenfeld. Digital Picture Processing, Computer Science and Applied Mathematics Vol. 2. Academic Press INC., 2006. 8. V. Ferrari, L.Fevrier, F. Jurie, C. Schmid. Groups of adjacent contour segments for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 36–51, 2008. 9. B. Epshtein, S. Ullman. Feature hierarchies for object classification. Proc. International Conference on Computer Vision, 30(1), pages 220–227, 2005. 10. D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), pages 91–110, 2004. 11. D. Comaniciu, P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Trans. on Pattern Analysis and machine Intelligence (24), pages 603–619, 2002. 12. D. Nist´er, H. Stew´enius. Scalable recognition with a vocabulary tree. Proc. IEEE Int. Conference on Computer Vision and Pattern Recognition (CVPR), 2006. 13. D. Lowe. Object recognition from local scale invariant features. Proc. International Conference on Computer Vision (ICCV), 1999. 14. B. Leibe, B. Schiele. Interleaved object categorization and segmentation. Proc. British Machine Vision Conference, pages 759–768, 2003. 15. B. Leibe. Interleaved object categorization and segmentation, PhD thesis. ETH Z¨ urich, 2004.

Suggest Documents