Layout and Language: Preliminary investigations ... - Semantic Scholar

16 downloads 1400 Views 51KB Size Report
originals marked up in, e.g., SGML, such as HTML or ... In [2] we offered an analysis of table layout and linguis- .... c: [w1 d? tv] A single column of values.
Layout and Language: Preliminary investigations in recognizing the structure of tables Matthew Hurst & Shona Douglas Language Technology Group, Human Communication Research Centre, University of Edinburgh, Edinburgh EH8 9LW, UK. Copyright Notice Copyright 1997 IEEE. Published in the Proceedings of ICDAR’97, August 18-20, 1997 in Ulm, Germany. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

Abstract We describe a prototype system for assigning table cells to their proper place in the logical structure of the table, based on a simple model of table structure combined with a number of measures of cohesion between cells. A framework is presented for examining the effect of particular variables on the performance of the system, and preliminary results are presented showing the effect of cohesion measures based on the simplest domain-independent analyses, with the aim allowing future comparison with more knowledge-intensive analyses based on Natural Language Processing. These baseline results suggest that very simple string-based cohesion measures are not sufficient to support the extraction of tuples as we require. Future work will pursue the aim of more adequate approximations to a notional subtype/supertype definition of the relationship between value cells and label cell.

1 Introduction Most representations of tabular data are layout or geometry-based; this is the case whether they derive from document recognition processes or directly from electronic originals marked up in, e.g., SGML, such as HTML or CALS tables. Marcy Thompson notes “. . . table markup contains a great deal of information about what a table looks like. . . but very little about how the table relates the entries. . . . [This approach] prevents me from doing automated context-based data retrieval or extraction.” ([7], p151). As Thompson suggests, this might not be a disaster

as long as “you can do something really clever to retrieve data from tables.” ([7], p152). But how? In [2] we offered an analysis of table layout and linguistic characteristics which emphasised the potential importance of linguistic information about the contents of cells to the task of extracting the underlying relational structure from a layout-oriented table representation. In this paper, we present some preliminary results of investigating the effects of taking into account different kinds of information about the contents of cells and a simple model of table structure. Such results represent a baseline for comparison with more knowledge-intensive techniques in the future. We are grateful to BICC plc for the use of their construction industry specification documents in this work, and for financial support of the first author. We also acknowledge the support of the UK Engineering and Physical Sciences Research Council though the CISAU project (Construction Industry Specification Analysis and Understanding, project IED4/1/5818) which supported the second author.

2 Approach and objectives 2.1 Views of tables In previous work [2] we distinguished two components of representation for tables. On the denotational view, a table is a set of n-tuples, forming a relation between values drawn from n valuesets or domains. Domains typically consist of a set of values with a common supertype in some actual or notional Knowledge Representation scheme. The table may also include label cells which typically can be interpreted as a lexicalisation of the common supertype. We hypothesize that the contents of value cells and corresponding label cells for a given domain are significantly related in respect of some measures of similarity or cohesion that we can identify. In particular, we are interested in the sort of linguistic cohesion introduced in [4]. On the functional view, we take it that the information presentation characteristics of the table embody a decision structure [9] or reading path, which determines the order

in which domains are accessed in building up (looking up) a tuple. To express a given table denotation according to a given functional view, there is a repertoire of layout patterns that express how domains can be grouped and ordered for reading in two dimensions. These layout patterns constitute a syntax of table structure, defining the basic geometric configurations in which domain values and their labels can appear, e.g., horizontal list of values with spanning label above; vertical list of values with spanning label to the left; or, as in Figure 1, vertical list of values with labels above.

2.2 An information extraction task In the CISAU application domain, we are faced with the task of information extraction (cf [6]) in documents containing a large number of tables. The tables appear in our original documents as ASCII formatted text1 ; the table in Figure 1 is a typical example: CHLORIDE CONTENT OF MIXES:

| Maximum total | chloride ion content | (% by weight of cement, | including GGBS) | ---------------------------------------------------------------| Prestressed concrete | 0.1 | Concrete made with sulphate | resisting Portland cement | 0.2 or supersulphated cement | | Concrete made with Portland cement, | Portland blastfurnace cement or | combinations of GGBS or PFA with | 0.4 ordinary Portland cement and | containing embedded metal | Mix

Figure 1: Example table from the application domain Our aim is to extract information from the text into a set of frames that contain various elements represented in terms of the sublanguage world model, a simple partof/type-of knowledge representation (KR). The elements we are looking for are entities, attributes which the KR accepts as appropriate for those entities, a unit or type for each attribute, a value which the assignment gives to each attribute, and a relationship expressing the semantic content of the assignment. To extract these components, we would like to have a basic representation of the tuple structure of the table, plus information about any labels and how they relate to the values, in order to specify fully the relationship and its arguments.

2.3 Aims of the current work Without some way of identifying which cells belong together as domains we cannot extract the table relation we 1 Getting this text into a geometric, cell-based representation is assumed to be the province of document recognition experts — but see the discussion section.

require. From this identification of domains, we will be able to go on to employ the notion of reading paths to identify the tuples of the relation expressed in the table, and thus complete the information extraction task. Our aim is to investigate the usefulness of a range of cohesion measures, from knowledge-independent to knowledge-intensive, in allowing us to select, from among those areas of table cells which are syntactically capable of being domains, those which in fact form the domains of the table.

3 The current prototype system 3.1 Overview of operation The system operates in two phases. In the first, a set of areas that might constitute domains is identified, using the constraints of geometric configuration and cell cohesion. In the second, this candidate set is filtered on the basis of cohesion scores to produce a consistent tiling over the table.

3.2 A simplified table structure model: domain templates The potential geometric configurations that we allow for a set of domain values (plus optional label) are called templates. A notation for specifying simple domain templates is defined as follows. A template is delimited by a pair of brackets [. . . ]. Contained within the template is a list of sub-templates, currently recursive only to depth 1. Currently the subtemplates within a template are taken to be stacked vertically in the physical table. If a template has no subtemplates, it consists of a triple (width; depth; type). width specifies the x-extent of an area of cells that can match the template, given by w followed by either an integer or the wild card ? which matches any width of cells. depth specifies the y-extent of an area cells that can match the template, given by d followed by either an integer or the wild card ? which matches any depth of cells. type specifies whether the template is to be counted as a value or a label area, given by t followed by either v or l. A selection from a set of four possible restrictions on a template can be defined: RESTRICTION

AREA MUST

-top -left +right +bottom

not contain top row not contain leftmost column contain rightmost column contain bottom row

The following templates are used currently: lc: [[w1 d1 tl][w1 d? tv]] A label above a single column of values, of any height. lr: [[w1 d1 tl][w? d1 tv]] A label above a single row of values, of any width.

v: [w? d? tv]{-top -left +right +bottom} A rectangular area consisting of only values, restricted to domains at the bottom right margin, typically accessed using both x and y keys. c: [w1 d? tv] A single column of values.

3.3 A simplified cohesion model The ‘goodness’ of a rectangular area of the table, viewed as a possible instantiation of a given template, is given by its score on the various cohesion functions. Each score is a value between 0 and 1, where 0 corresponds to high similarity and 1 to low similiarity. The cohesion attributes we currently measure are: Na j jαb j?jNb j alpha-numeric ratio: j( jjααaa j?j j+jNa j ? jαb j+jNb j )=2j, where jαaj is the number of alphabetic characters, and jNaj the number of numeric characters, in string a. bj string length ratio: j jjaaj?j j+jbj j, where jaj is the length of one string and jbj the length of the other.

Values assigned for each of these chosen attributes are combined in a weighted sum to yield two overall cohesion scores for each MatchedArea, the value-value cohesion (v-v) and the label-value cohesion (l-v) as follows. v-v cohesion: Given





a

set

of n v-v cohesion functions f 1v-v : : : f nv-vg which each take two cells and return a value between 0 and 1 which reflects how similar the two cells are on that function; and a set of n weights fwv-v; wv-v : : : wv-vg which

f f0v-v

;

0

1

n

determine the relative importance of each v-v cohesion function’s result

for an area A composed of a set of cells we calculate a measure of the area’s cohesion as a set of domain values: VS = ∑ci;c j 2A v-vScore(ci ; c j ) (where (ci ; c j ) is an ordered pair of cells) v-vScore = f iv-v = ∑ni=0 wv-v ∑ni=0 wv-v i i l-v cohesion: Given



a

set of m l-v cohesion functions l-v l-v ; f 1 : : : f n g which each take two cells and return a value between 0 and 1 which reflects how likely one of the cells is to be a label for the other

f f0l-v



l-v l-v a set of m weights fwl-v 0 ; w1 : : : wm g which determine the relative importance of each label cohesion function’s result

for an area A composed of a set of cells and a label cell cl we calculate a measure of the area’s cohesion as a label plus set of domain values: LS = ∑(cl ;cv):cv2A l-vScore l-v l-v m l-v l-vScore = ∑m i=0 wi f i = ∑i=0 wi A final score for the area is calculated as follows, depending on the type of template: If the area’s template contains values and a label: finalScore =

wv-v VS+w l-vLS wv-v +w

l-v

where wv-v and wl-v are weights reflecting the relative importance given to the VS and LS respectively. If the area’s template contains only values, finalScore = VS.

3.4 Choosing a tiling on the basis of cohesion Given a set of templates, we find a set of MatchedAreas, rectangular areas of cells which satisfy a template definition and which reach a given cohesion threshold. The set of MatchedAreas contains no areas that are wholly contained in other matched areas for the same template. From this set of candidate domains, a greedy algorithm selects a subset that form a complete, non-overlapping tiling over the table, by successively adding to the tiling the highest scoring MatchedArea consistent with the previous choices.

4 Experiments 4.1 Material used To test our system, we created a corpus of tables marked up in SGML with basic cell boundaries. This representation is similar in relevant information content to many SGML table DTDs, and is also a plausible output from completely domain-independent systems for table recognition in ascii text or images, e.g., [3]. To this basic representation we added human-judgment information about the domains in each table, specifying cell areas of values and labels for each domain. (These markup operations were performed using a specialised interactive markup tool written in Lucid Emacs Lisp.) The tables were taken from a corpus of formatted ascii documents in the domain of construction industry specifications. 29 tables consisting of 91 domains formed a development set which was open to examination during the experimental development; 4 tables consisting of 13 domains were held back as a test set.

4.2 Experimental factors The tests we ran combined various factors: baseline: all coefficients set to 1. alphanum: alpha-numeric ratio cohesion measure. stringlength: string-length ratio cohesion measure. noempty: disallow the selection of MatchedAreas with empty top left hand cells. ignorelabel: reduce the weighting wl for the goodness of the label match to 0. The recall score for each experimental condition is the number of matched areas that perfectly agree with a domain as marked by the human judge as a ratio of the number of domains identified by the human judge. Because the template matching operations heavily favour larger matches, recall and precision are similar, so precision is not given here for.

5 Results and discussion The recall results are given in Table 1. The experiment column specifies the experimental condition in terms of the factors defined above. Effect of templates: Increasing the number of templates available at one time reduces the recall performance because of confusion during the selection process. Although examples of all templates occur in the material, the labelled column configuration lc is so common that better results would be obtained (for this corpus at least) by using only that, or weighting heavily towards it; this would put a ceiling on performance that would probably be unacceptable, but represents some sort of baseline at least. The simple singly recursive templates we have considered so far are also not adequate for more complex tables with patterns of recapitulation and multiply layered spanning labels. It will be desirable to take a more sophisticated view of the layout configurations possible for labels and values, possibly using something like the treatment in [8]. Effect of cohesion measures: The baseline experiment used both cohesion measures, with all weight coefficients set to 1. While using both cohesion measures was better than either alone, or none, the results are not impressive, and the overall impression must remain that the system’s ability to distinguish between labels and values is poor. This will be more of a problem when we deal with more complex tables with complex multi-cell labels. It is here that the addition of more sophisticated cohesion measures, based on better approximations to the supertype/subtype distinction we postulated earlier, are likely to be helpful.

Effect of algorithms for selecting matched areas: The greedy algorithm that selects a single tiling is designed to be computationally cheap at the expense of the optimality of the tiling, and so although we currently have no explicit comparison to make, it is very possible that a better algorithm can be found.

6 Future Work 6.1 Improvements Our aim in this work has been to find a baseline measure of the usefulness of simple string-based techniques as approximations to the supertype/subtype cohesion relations between values cells and label cells which allows us to identify domains in tables, and hence their relational structure. We have found that it is possible to measure a certain amount of cohesion with string-based techniques; however we believe we have demonstrated that these are inadequate for anything other than the most simple tables. From this, we intend to measure the efficacy of more sophisticated cohesion measures of cells contents. Our plans include using domain-independent KR, such as wordnet [1], corpus-based Knowledge Acquisition techniques [5], and n-gram models of syntactic similarity. Combining a number of types of measure, in the kind of framework we have presented here, should allow graceful performance over a wide range of domains using as much information as is available, from whatever source. We intend to extend the current system with a tuple extraction phase, improve the design of the selection algorithm, increase the number and accuracy of the templates we use, and investigate weighting of templates on the basis of frequency. Clearly, we require a larger test corpus too. We also intend to pursue competitive evaluation of cohesion using a dynamic method for deciding whether to include l-v weighting or not; if we have a good score on l-v cohesion, use it, if not, weight it at zero.

6.2 Applications A table structure disambiguator with reasonable performance would have numerous applications, some of which we are already pursuing. It is convenient from the point of view of modularity to consider the content-based disambiguation of table structure separately from lower-level document recognition processes which typically result in a layout-based representation on a par with HTML tables. However, it would clearly be useful to combine the two, probably by using cohesion scores to choose among multiple hypotheses about cell boundaries generated by prior processes. The same techniques can be used for adding information about relational structure to tables already available in (semantically impoverished) forms. We are currently working on applying our techniques to HTML tables.

Templates available flc; flr; flr; cg vg cg

Experiment

flcg

flr g

fvg

fcg

flc; lr g

flc; vg

fv; cg

flc; lr; vg

flc; lr; cg

flc; v; cg

flr; v; cg

2 1

7 0

33 31

59 36

24 26

9 1

flc; lr; v; cg 26 27

alphanum stringlength alphanum, ignorelabel stringlength, ignorelabel alphanum, stringlength alphanum, stringlength, ignorelabel unseen

84 41

3 1

3 0

3 0

82 42

32 30

60 35

5 1

84

3

3

3

84

34

84

5

2

7

36

84

34

9

36

41

1

0

0

42

34

41

1

1

0

35

42

34

1

35

75

2

3

3

75

45

68

4

1

7

45

67

42

8

42

75 77

2 8

3 0

3 0

76 77

47 62

75 62

5 8

2 8

7 0

48 62

76 62

47 54

8 0

48 54

Table 1: Recall results for all experimental conditions Once the relational structure of the table is known it can be manipulated for many purposes. Smart editors can allow restructuring of tables on demand to reflect different functional views, while keeping track of the underlying semantics. Also, very different presentation formats could be generated. A presentation of tables for, e.g., telephone interfaces to Web pages, ought to reflect the information access structure of a table [9], rather than its physical structure; we are working on automatically reacasting tables as specifications for information-seeking dialogues.

References [1] Richard Beckwith, Christiane Fellbaum, Derek Gross, and George A. Miller. Wordnet: A lexical database organized on psycholinguistic principles. Technical Report 42, Cognitive Science Laboratory, Princeton University, March 1990. [2] Shona Douglas, Matthew Hurst, and David Quinn. Using natural language processing for identifying and interpreting tables in plain text. In Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 535–545, 1995. [3] E. Green and M. Krishnamoorthy. Recognition of tables using tables grammars. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 261–278, University of Nevada, Las Vegas, April 1995. [4] M. A. K. Halliday and Ruqaiya Hasan. Cohesion in English. English Language Series. Longman, 1976. [5] A. Mikheev and S. Finch. A workbench for acquisition of ontological knowledge from natural text. In

Proceedings of the 7th conference of the European Chapter for Computational Linguistics, pages 194– 201, Dublin, Ireland, 1995. [6] Proceedings of the Sixth Message Understanding Conference. Morgan Kaufmann, San Francisco, CA, 1996. [7] Marcy Thompson. A tables manifesto. In Proceedings of SGML Europe 1996, pages 151 – 153, Munich, Germany, 12-16 May 1996. [8] Xinxin Wang. Tabular Abstraction, Editing, and Formatting. Phd, University of Waterloo, Ontario, Canada, 1996. [9] Patricia Wright. A user-oriented approach to the design of tables and flowcharts. In David H. Jonassen, editor, The Technology of Text, pages 317–340. Educational Technology Publications, 1982.