Content based hypertext creation in text/ gure databases Marcel Worring and Arnold W.M. Smeulders Intelligent Sensory Information Systems Department of Computer Science University of Amsterdam, The Netherlands
[email protected]
Abstract. We consider the construction of hypertext for documents containing a mixture of text and gures to provide convenient access to the content of both the text and the gures. The focus in this paper is on detecting alphanumeric labels in the gures to be used as anchors in the hypertext construction.
1 Introduction More and more documents are becoming available in their digital form and are stored in document databases or made accessible through the World Wide Web. Most of the documents contain a mixture of text and gures. Full text retrieval and automatic extraction of text-to-text type hyperlinks [4] are becoming more and more commonplace. Such a level of functionality has not yet been realized for gures or the combination of text and gures. However, in many classes of documents access is determined by the possibility to include the content of the gures as well. After all they have been included by the authors to underline the most important points they wish to make. Notably in patent applications and manuals the accompanying text consists mostly of extended captions of the gures. Consider for example a user manual for a PC. In such a manual there will be a gure showing the monitor where labels are attached to dierent buttons. These buttons have a description in the accompanying text. Such a reference may be at a considerable distance with respect to the linear reading order of the document. In accessing large les of digital documents it is therefore of great value to have available the non-linear hypertext structure of the document. If this can be made available one can create separate text and gure channels where clicking a keyword in the text automatically turns the focus of attention in the gure channel to the corresponding gure (even if that is in another chapter, e.g. where the de nition is) and to the proper position within this gure. Vice versa when clicking in the gure the associated text should be displayed. The above is not easy to achieve as in almost any digital document the gure is available in its bitmap format only. In document image databases this is even more complicated as these are usually created by scanning paper documents and hence the whole document is stored as a bitmap. As a consequence most systems
providing hyperlink access to (scanned) documents [4, 5, 6] limit their attention to links between fragments of text or links to gures as a whole. One of the few systems making relations between text and gure details is the Piction system [7] developed for analyzing newspaper photographs by means of the caption. Although their goal was not hypertext construction it can be viewed as such. We are developing a system for hypertext construction within the domain of patent applications and manuals or other documents with gures created by a drawing package [9, 10]. Current topics of the system will be described in this paper. which is organized as follows. In section 2, an overview of the system is given and a classi cation of dierent labels in a gure is made. For the particular class of alphanumeric labels we present a new detection algorithm in section 3. Results are presented in section 4.
2 The hyperdocument system 2.1 Hypertext The essence of hypertext is formed by the links from and to (composite) components and anchors within the components [2]. For text/ gure documents the composite components are typically formed by the logical structure [1] which among others de nes the linear reading order. The components are the individual textblocks, gures and lines. Finally, anchors are keywords in textblocks and labels (or other details) in the gures. Hence, when the input to the system is a scanned document our goal is to 1. nd the components by analyzing the image of the document 2. nd anchors in both text and gures 3. create the relevant links to anchors and/or components The rst step in the analysis is the classi cation of the dierent document parts into the geometric classes text, gure, line etc. Many algorithms are known e.g. [8]. We have developed a similar bottom-up approach with a visual algorithm editor allowing fast adaptation to new application domains. For eciency reasons this classi cation is performed at a reduced resolution (in our case reduction by a factor 4 in both x- and y-direction). The later analysis of gures is performed at full resolution as otherwise small details are lost. In all cases the parameters are speci ed in actual document parameters rather than pixels. Hence, increasing the resolution always yields more accurate results. If required the user can interactively correct the classi cation. Every block in the image classi ed as text is sent to an external Optical Character Recognition program to reveal the ASCII text conveyed by the facsimile of the text. These pieces of text and the gures in the document form the basis for the hypertext construction. Note that at this point the distinction between a scanned document and a generic ASCII text combined with gures is only that the former might contain more errors due to improper recognition.
3 4 label 5 text (a)
1 2 3
house tree
7 8 9 (b)
car (c)
Fig. 1.
(a) plain alphanumeric labels, (b) two examples of generic alphanumeric labels, (c) icon labels, and (d) legend labels
2.2 Anchors and links We now turn our focus to the gures in the document where we aim to detect labels to be used as anchors. We distinguish four dierent types of labels which can be present (see gure 1) namely :
{ plain alphanumeric labels : text strings without a generic visual part. { generic alphanumeric labels : text strings with an associated generic visual part. { icon labels : labels distinguished by their shape alone { legend labels : icon labels with an associated textual de nition. The detection of the last three types has been studied in [10]. In the next section we consider the detection of plain alphanumeric labels in gures, which can then be used as anchors. Link creation to all but the icon labels is on the basis of the associated textual meaning. This ASCII string can be retrieved through Optical Character Recognition of the facsimile of the textual part of the label. This ASCII string can then be used in a keyword search in the associated text. For icon labels, links will always be to positions in the document, either gure or text, where the same icon is present.
3 Label detection and recognition 3.1 Introduction We consider the detection of alphanumeric labels in gures. For scanned paper maps this is studied, among others, in [3]. They make a preselection of objects in the image on the basis of expected width and height of the characters. From there domain knowledge is used to improve recognition. In our case we have a more general situation where no extensive domain knowledge is present. As a consequence the detection of labels should be on the basis of the similarities within the set of characters rather than their distinction from the environment as non-character objects can have arbitrary unknown properties. Furthermore, as no prede ned notion of scale is present the method should be scale independent.
The analysis is performed in a number of steps. First, connected objects are detected. These objects can be either individual characters, individual characters accidently connected in horizontal direction, or graphics. We only consider horizontal connections between characters as vertical spacing is usually much larger and hence accidental connections are unlikely. For each of the detected objects the relevant features are computed. On the basis of these features the selection of candidate labels is performed. Finally, OCR is used to retrieve the ASCII content of each label. This can also result in rejects of individual characters or complete labels. Hence, this provides a further veri cation.
3.2 Features
In this section we consider which features should be used in the detection. The assumption is that a consistent font is used for the labels in the gure. As the method should work for every character local shape information cannot be used as feature. Hence, only global features are considered. Candidate features are height of characters. The width and nally features based on the stroke width. Height is usually very consistent within one document and varies only depending on whether ascenders (e.g. \b,l,k"), descenders (e.g. \j,p,q"), or none of the two are present (e.g. \a,c,e"). Capitals usually have the same height as objects with ascenders. Width is varying much more for dierent characters e.g. \W" versus \i", and furthermore two subsequent characters in a label might be connected. Both width and height are easily computed from the bounding box of the connected binary objects. As a measure of stroke width we consider the average stroke width, which varies for dierent types of fonts (e.g. italic versus bold). As consistent font is assumed the average stroke width should be similar for all characters. It is computed from the discrete distance transform. First, the distance transform of the binary object is computed, local maxima of which are used to compute the medial axis. Then the average of the distance values on the axis can be used as an approximation of the average stroke width.
3.3 Label selection
When detecting the connected objects in the image we regard any object which has an area smaller than (0:05 cm)2 as noise. Objects with a width or height larger than 1.0 cm are disregarded (i.e. classi ed as graphics). In the next step candidate characters are merged into candidate labels. This has two functions. First, after merging the selection on width can be made on the basis of complete words rather than the individual characters circumventing the above described problems. Second, for a complete label average feature values are more robust than values for single characters. The following characteristics are used to decide whether two label fragments should be merged. 1. The horizontal spacing between the two fragments should be small and 2. The two fragments should have a substantial overlap in the vertical direction
As stated before measures for these characteristics should be scale invariant. As width varies with the number of characters in the fragment scale independence must be achieved by de ning measures relative to the height of the fragments.
H y w
h
tot
Fig. 2.
h
h
∆H = h
w
∆ W= h
tot
tot
Graphical impression of the computation of the measures H and W .
i ; y i de ne the bounding box for label To be precise let ximin ; ximax ; ymin max fragment i, then the following equations measure the spacing and overlap of fragments i and j . j
; xmin ) ? min(xmax ; xmax ) W = max(xmin j ) ? min(y i ; y j ) i ; ymax max(ymax min min i ? y j j + jy i ? y j j j y max max min min H = 1 ? j ) ? min(y i ; y j ) i ; ymax max(ymax min min i
i
j
The measures are exempli ed in gure 2. Candidate label fragments are only merged if W Wt and H Ht . The process continues until there are no more changes to the set of fragments. Note that negative values of H indicate no overlap at all whereas negative values of W do indicate overlap. From the characteristics of characters with and without ascenders/descenders we conclude that an appropriate setting for overlap is Ht = 0:5. We further set Wt = 0:75. As labels in the gure are set in some xed font, we expect to nd many objects with similar height and stroke width. We therefore compute the modal height hm and modal average stroke width sm by making histograms of height and average stroke width for all objects. As these values will vary much more for graphical objects the modal height and stroke width will correspond to the characters. To be precise we expect these values to correspond to the characters without ascenders/descenders. Now candidate labels for which the height hobj 62 [(1 ? )hm ; (1 + )hm ], for properly chosen and can be classi ed as graphics. Again from the characteristics of characters we set = 0:5; = 2:0. When multiple fonts are used in the gure each maximum in the height histogram should be analyzed. In the resulting set one can still expect non-character objects. Therefore labels with average stroke width sobj 62 [(1 ? )sm ; (1 + !)sm ] should also be discarded.
As this is document dependent and ! are not xed. Finally, if the number of characters in a label is known to be at least nchar labels are discarded when wobj < nchar hm . Note, that again all steps are invariant to change of scale. In summary the algorithm is given by: 1. 2. 3. 4. 5.
: join characters into labels : discard labels on basis of height : discard labels on basis of stroke width : discard labels on basis of width : perform OCR and discard characters or labels if not recognized
4 Results
Fig. 3. Representative example documents used in the text. Documents of class 1 (left) are from an old (1964) manual of a tele-typewriter where photographs, text, and graphics have been mixed. The documents of class 2 (right) are taken from a manual for a VCR and are completely drawing package generated. To illustrate the algorithm we have applied it to a set of 9 documents of which the documents in gure 3 are representatives. In total there are 2337 objects larger than the noise value and width and height smaller than 1.0cm, of which 1551 are genuine characters.
We have experimented with dierent values of and ! but it turned out that we could not nd values yielding good results. Either genuine characters were removed or not a single non-character was classi ed as such. Hence, for the current experiments we do not use step 3 of the algorithm. For the selection on width we set nchar = 3. After this selection, remaining labels are sent to the commercial OCR-package OmniPage. precision 100 ’t1’ ’t2’ ’t3’ ’t4’ ’t5’ ’v1’ ’v2’ ’v3’ ’v4’
80
60
40
20
0 start
height
width
OCR
Fig. 4.
Precision as function of the dierent processing steps in the algorithm for images of class 1 (v1-v4) and class 2 (t1-t5).
To judge the performance of the algorithm we consider the
{ precision: proportion of genuine characters retrieved, and { recall : proportion of genuine characters in the selected set We measure those after each step in the algorithm where we note that it is important that no labels are missed. Hence, recall should stay high after each step. Precision should clearly not decrease. The resulting precision is presented in gure 4. From the gure it is observed that precision is very high. Especially the selection on width gives a signi cant increase. For two documents precision after OCR is not shown as OmniPage was not capable of handling the remaining graphics, mainly lines in the gures. Hence, no textual labels could be retrieved. After step 1-4 the recall of the algorithm was high (5*100%, 99%, 98%, 97%, and 94%). Most prominent error is the deletion of dashes in words. Except for the above mentioned two documents recall did not change after OCR. Furthermore, only 3 of the genuine characters were assigned the wrong ASCII label.
5 Conclusion Content based access to a database containing textdocuments with associated gures should make the hypertext structure of the document explicit including links to details in gures. In this paper we have presented an algorithm for detecting alphanumeric labels in bitmap gures to be used as anchors in hypertext construction. The results show that the algorithm has good performance based on a limited set of assumptions and hence can eectively support the process of hypertext creation. However, the results also show that precision should be improved further. This can be achieved by considering other measures of stroke width than the average stroke width we have used or by using other features like e.g. the aspect ratio to classify lines in the image and locally to detect dashes in words. Further research will consider larger sets of documents and include a complete hypertext construction including derivation of logical document structure. Also other invariances (e.g. rotation) and multiple fonts will be considered.
References 1. I.R. Campbell-Grant. Introducing ODA. Computer Standards & Interfaces, 11:149{157, 1991. 2. F. Halasz and M. Schwartz. The Dexter hypertext reference model. Communications of the ACM, 37(2):30{39, 1994. 3. C.P. Lai and R. Kasturi. Detection of dimension sets in engineering drawings. IEEE PAMI, 16(8):848{855, 1994. 4. A. Myka and U. Guntzer. Using electronic facsimiles of documents for automatic reconstruction of underlying hypertext structures. In Second ICDAR, Tsukuba Science City, pages 528{532, 1993. 5. G. Nagy, S. Seth, and M. Viswanathan. A prototype document image analysis system for technical journals. IEEE computer, 25(7):10{22, 1992. 6. S. Satoh, A. Takasu, and E. Katsura. A collaborative approach supporting method between document processing and hypertext construction. In Second ICDAR, Tsukuba Science City, pages 533{536, 1993. 7. R.K. Srihari. Automatic indexing and content-based retrieval of captioned images. IEEE Computer, 28(9):49{56, 1995. 8. S. Tsujimoto and H. Asada. Major components of a complete text reading system. Proceedings of the IEEE, 80(7):1133{1149, 1992. 9. M. Worring, R. Buitenhuis, and A.W.M. Smeulders. An Oda/Dexter hyperdocument system with automated link de nition. In S. Spaccapietra and R. Jain, editors, Visual Database Systems III, Lausanne, pages 346{359, 1995. 10. M. Worring, R. van den Boomgaard, and A.W.M. Smeulders. Structured hyperdocument generation using OCR and icon detection. In Third ICDAR, Montreal, pages 1180{1183, 1995. This article was processed using the LATEX macro package with LLNCS style