Document structures: A survey - Department of Computer Science

3 downloads 2354 Views 347KB Size Report
sis and Recognition [l]. A graphic illustration can be found in Fig. 1, where the relationships among the geo- metric structure, logical structure, document analysis.
Document Structures: A Survey Yuan Y. Tang and Ching Y. Suen Centre for Pattern Recognition and Machine Intelligence Suite GM-606, Concordia University 1455 de Maisonneuve Blvd. West, Montreal, Quebec H3G 1M8, Canada Abstract

e

Knowing the structure of a document is the key to successful processing of this document. From different points of view, there exist diflerent definitions for document structures. This paper presents a survey which contains a collection of many methods of describing document structures. Several novel concepts and theoretical analyses are also presented in this survey.

The deep structure should remain invariant when the document is translated from one language to another.

e

Surface structure contains sentences] phrases and words;

0

The atomic symbols of the surface structure are letters of an alphabet or ideographs (for example, Chinese, Japanese and Korean ideographs);

e

The deep structure contents are encoded into the surface structure;

e

The surface structure changes from one language to another.

1

Introduction

As document is a medium of knowledge] it can be considered not only as a two-dimensional image - a concrete document, but also a conceptual document which corresponds to human’s thinking. The abstract representations of these two kinds of documents are conceptual structure and concrete structure. The process of publishing or writing corresponds to encoding the conceptual structure into a concrete structure. Conversely, the concrete structure of the document is decoded into its conceptual one in document processing. i.e.

In this section, we will introduce a new concept of document structure: the new terminologies - “concrete ~ . basic prinstructure” and “conceptual ~ t r u c t u r e ’The ciple of these structures will be presented followed by a basic model for the concrete structure. The relationship between the conceptual and concrete structures will also be discussed. The end of this section will describe deep and surface structures correspond t o the conceptual structure and the concrete document respectively. The term document is used in a very general sense, a document can be considered not only as a concrete two-dimensional image but also a concept. A general definition of document is given below:

(1) Process of publishing or writing: Conceptual Structure

encoding

Concrete Structure;

(2) Document processing:

Conceptual Structure

decoding

e Concrete Structure.

Definition 1 Document is a medium of the knowledge which presents the ideas of the human in great wide scopes including politics, economics, history, culture] education] arts, science, engineering] etc.

In this paper, we will limit our discussion to document processing only. The process of publishing or writing will not be the subject in this aspect.

2

Conceptual and Concrete Structures

Traditionally, document processing and pattern recognition studies are concerned with concrete documents and patterns. Little attention has been paid to conceptual ones [4, 5, 81. [8] presented an approach to automatic recognition of design concepts but is not related to document processing. [4] proposed two kinds of structures for a document: the deep structure and surface structure. The basic principles are illustrated below:

Note the key words written in the italics in the above definition - “ideas” and “medium”. The key word “ideas” indicates the conceptual meaning of the document while “medium” implies the concrete t w e dimensional image. Consequently, a document has both conceptual and concrete properties.

The deep structure is incorporated into the linear order as a string of concepts;

Definition 2 A conceptual structure w is specified by conceptual space, such that

e

2.1

99

0-8186-4960-7l93$3.000 1993 IEEE

Conceptual Structure

Conceptual document is represented by its conceptual structure w which can be described as follows:

w = (E, 3,U )

where, 0

C stands for an alphabet which is a finite, nonempty set of elements. The elements of the alphabet are usually called symbols or letters. For example, C = { ~ , b ,,..., c Z,Y,Z}

for the documents written in English. 0

0

3 states a finite set of words. A word over an alphabet C is a finite sequence of symbols from C, usually written without any separating commas.

U is a finite set of operations defined in pragmatic and semantic domains, such that 3 x u x 3 ' E w 0

A very important characteristic for the conceptual document is that of language-invariance, i.e.

0

aW -

ac = O

that means the conceptual structure remains unchanged when the document is translated from one langua e to another. For instance, the idea of a paper has to %e same, when it is translated from English to French.

2.2

0

0

Concrete Structure

Concrete document structure is the division and repeated subdivision of the content of a document into increasingly smaller parts which are called objects. An object which can not be subdivided into smaller objects is called a basic object. All other objects are called composite objects. Structure can be realized as two types: (1) Geometric (layout) structure in terms of its geometric characteristics, for instance, the position and size of each document object; (2) Logical structure due to its logical properties, for instance, the logical relationship among the different objects. Most concrete documents such as newspapers,.journals, books, reports, etc. are organized hierarchically. Both its eometric and logical structures can be represented %y trees [l, 31. The geometric relationship between blocks can be described by a geometric tree while the logical properties of the document objects can be represented by its logical tree. Building both the geometric tree and logical tree is a major tasks of the document processing system. It is clear that the property of language-invariance will fail in the concrete structure. Both geometric and logical structures might be changed when the document is published in different languages. As the first stage of document processing, most research and development efforts in document processing are concentrated on the concrete document. Thus, this survey also puts some emphasis on this stage.

9 is a finite set of document objects which are sets of blocks 0' ( i = 1 , 2 , ...,m).

(0;). denotes repeated sub-division, since an object may be subdivided into several smaller objects. CP is a finite set of linking factors. (PI and .pT stand for leading linking and repetition linking respectively. 6 is a finite set of logical linking functions which indicate logical linking of the document objects.

0 CY 0

2.3

is a finite set of heading objects.

P is a finite set of ending objects. Basic Model for Concrete Document

A basic model for processing the concrete document was first proposed in our early work presented at the First International Conference on Document Analysis and Recognition [l]. A graphic illustration can be found in Fig. 1, where the relationships among the geometric structure, logical structure, document analysis and document understanding are depicted.

Figure 1: Basic Model for Document Processing

Definition 3 A concrete structure Q is specified by a quintuple

The following principal concepts were proposed in this model:

100

A concrete document is considered to have two structures: the geometric (or layout) structure and the logical structure. Document processing is divided into two phases: document analysis and document understanding.

j Conceptual Document

Extraction of the geometric structure from a document refers to document analysis; mapping the geometric structure into the logical structure is defined as document understanding. Once the logical structure has been captured, its meaning can be decoded by AI or other techniques. e

A

But in some cases, the boundary between the two phases just described is not clear. For example, the logical structure of bank cheques may also be found during an analysis by knowledge rules.

Concrete Document f

In the Handbook of Pattern Recognition and Computer Vision, we have presented a formal description of the concrete document as well as the processing involved using the natural language description [7].

2.4

3

w

*

decoding

0,

(4)

R.

(5)

mapping

SGS

8L.S

Geometric and Logical Structures A concrete document can be viewed as a geometric

Definition 4 The geometric structure is described by 9,5 , the element 9 in the document space Q = (9, a,p) shown in Eqs. (2 - 3) and which is a set of operations performed on 9such that

where the notation w stands for the conceptual structure and 0 denotes the concrete structure. Eq. 4 corresponds to the process of publishing or writing where the conceptual structure is encoded into a concrete structure. On the other hand, Eq. 5 refers to the processing required to decode the conceptual structure from the concrete structure. Eq. 5 can be re-written in the following form with more details:

-

............,

structure and a logical structure. To formalize these basic terms, the basic model of a concrete document and its processing will be used in this section. 3.1 Geometric Structure The geometric structure can be formally described by the following definition according to the basic model given by Eqs. (2) and (3).

The relationship between the conceptual structure and concrete structure can be illustrated as follows [5]: encoding

.....................................

Figure 2: Conceptual and Concrete structures

Relation between Conceptual and Concrete Structures

w

i

(6)

where 9~represents a set of Basic objects, and 9c stands for a set of Composite objects.

Concrete Structure s1

where, SGSand SJZLS denote the geometric and logical structures respectively. A graphical illustration of Eqs. 4 - 6 is shown in Fig. 2. 3.2 Logical Structure The document understanding process finds the logical relations between the objects of a document. To facilitate this process, a logical structure and its model will be presented in this sub-section. Logical structure is the result of dividing and subdividing the content of a document into increasingly smaller parts called logical objects, based on the human-perceptible meaning of the content. For example, chapters, sections, subsections, and paragraphs, which are application-dependent, can be defined using the Object class mechanism [2].

It is obvious that in order to decode a concrete structure into the conceptual one, we have to extract the geometric structure of the concrete document and map it into the logical structure. These are the necessary steps in document processing. A complete and universal system for treating both of the conceptual and concrete documents is probably still decades away. To reach that goal, it is necessary to restrict the domain of discourse. In the remainder of this paper, we will pay attention to the concrete document where the meaning of “document” will be limited to the domain of the concrete one.

101

According to the basic model represented by Eqs. (2) and (3), a formal description of the logical structure is presented as follows: Definition 5 The logical structure is described by the elements @, 6, a, and p in the document space R = (9,a, 6, a,p) in Eqs. (2 - 3), such that

Document Processing Definition 6 Document processing is a process to construct the quintuple represented by Eqs. (2 - 3). 3.3

Figure 3: A simple example of document processing described by the basic model

Document analysis refers to the extraction of elements 9, 0’ and 0; in Eq. (3), i.e. the geometric structure of R. Document understanding deals with the finding of @, 6, a,and /3 in Eq. (3), based on the logical structure of R. Example - A simple example is illustrated in Fig. 3, we have

Acknowledgment This work was supported by research grants received from the National Networks of Centres of Excellence research program of Canada, the Ministry of Education of Quebec, and the Natural Sciences and Engineering Research Council of Canada.

References

9= { 0 1 , 0 2 , 03,0 4 , 05} 0 4 = (0;)’ = {0;‘,04}

[l] ICDAR. Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30 - Oct. 2, 1991.

o5= {e;}* = {e;, 0;,OX} a = {el,02) p = {e4,0 5 )

[a] ISO. 8613: Information Processing- Text and Ofice Systems- Ofice, Document Architecture (ODA) and Interchange Format, International Organization for Standardization, 1989.

[3] Journal. Machine Vision and Applications, (Special Issue: Document Image Analysis Techniques), Vol. 5, No. 3, 1992.

6 = s x 9 + 2s : 6

[4] G. Nagy. “What deos a machine need t o know to read a document?” Proc. Symposium on Document Analysis and Information Retrieval, March 16-18, 1992, pp. 1-10.

From this definition, it is obvious that there is a nondeterministic mapping from the geometric structure into the logical structure. However, once the geometric structure is extracted, a deterministic mapping can be achieved. It is formally described as follows:

[5] Y. Y. Tang and C. Y. Suen. “Concrete document and conceptual document ,” Technical Report, Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia University, 1992.

Theorem 1 Let R be a document defined by a quin-

tuple (9,, @ i t Si, a,,pi) having nondeterministic mapping from a geometric structure into a logical structure, then there exists a quintuple (Sj, aj,6j, a j , p j ) which contains a deterministic mapping from the geometric structure of R into a logical structure. The proof of this theorem can be found in [5]. Many articles exist which deal with the geometric and logical structures [1, 31.

4

[6] Y. Y. Tang and C. Y. Suen. “Document Structures: A Survey,” Technical Report, Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia University, 1993. [7] Y. Y. Tang, C. D. Yan, M. Cheriet, and C. Y. Suen. ”Automatic analysis and understanding of documents,” Handbook of Pattern Recognition and Computer Vision, edited by Patrick S.P. Wang, C.H. Chen and L.F. Pau, The World Scientific Publishing Co. Pte, Ltd., Singapore, 1993.

Conclusions

There are many definitions of document structures, such as concrete structure, conceptual structure, surface structure, deep structure, geometric structure, logical structure, textual structure, information structure, textural structure etc. Due to the page limitation, only a few of them have been presented in this paper. A complete version can be found in [6].

[8] J. T . Tou. “Computer recognition of design concepts,” Proc. 11th Int. Conf. on Pattern Recognition, 1992, Vol. B, pp. 639-642.

102

Suggest Documents