A Formal Framework For Interlinear Text - Semantic Scholar

13 downloads 1963 Views 140KB Size Report
Nov 24, 2000 - based on the annotation graph model [Bird and Liberman, 2001]. This model has ..... An annotation graph API and a DTD for storage and ...
A Formal Framework For Interlinear Text Kazuaki Maeda and Steven Bird Linguistic Data Consortium University of Pennsylvania maeda,sb  @ldc.upenn.edu November 24, 2000

Paper presented at the workshop on Web-Based Language Documentation and Description, 12-15 December 2000, Philadelphia, USA. Abstract Interlinear texts come in many forms and can be represented digitally in many ways, e.g. plain text with hard spacing, tables, special markup, and special-purpose data structures. There are various methods for linking to audio data and lexical entries, and for including footnotes and other marginalia. This diversity of form presents problems for general purpose software for searching, exchanging, displaying and enriching interlinear texts. In this paper, we survey several existing tools, models and formats for interlinear text. We argue that a general purpose abstract data model for interlinear text is necessary in order to abstract away from all the physical storage formats and display styles. We propose a new formal framework for interlinear text based on the annotation graph model [Bird and Liberman, 2001]. This model has several desirable properties for the development of practical tools. It scales well, so that extended, richly layered texts can be stored and manipulated efficiently. The model has a direct representation as a relational table, permitting efficient query. Incomplete information can be represented naturally, so partially analyzed texts are wellformed structures and are well-behaved with respect to query. For the same reason, essentially arbitrary parts of an interlinear text, such as particular layers, can be projected and remain well-formed, and this facilitates analysis by external programs. Finally, this model makes it trivial to construe interlinear text as an annotation of time-series data, so that tools can give access to the primary audio data while a text is being transcribed and annotated. As partial demonstration of these claims, we present a prototype interlinear text editor based on the annotation graph model. This tool is being developed in conjunction with a general architecture of tools for transcribing and annotating time-series data within the framework of annotation graphs. By using this architecture we can reuse and integrate interlinear text software components with other tools. The tool is available in open source form.

1 Introduction and Review of Previous Work Interlinear texts come in many forms and can be represented digitally in many ways, e.g. plain text with hard spacing, tables, special markup, and special-purpose data structures. There are various methods for linking to audio data and lexical entries, and for including footnotes and other marginalia. This diversity of form presents problems for general purpose software for searching, exchanging, displaying and enriching interlinear texts. In this section, we survey a range of forms of interlinear text used in various software packages and data models, and describe the need for a general framework to express interlinear text. 1

In section 2 we present the annotation graph model and show how it can be applied to interlinear text. We study the structural limitations which are inherent in interlinear text, and present a formalization of these within the model in section 3. In section 4 we describe an interlinear text editor based on the framework, and evaluate the work in section 5.

1.1 Printed Texts and Text Formatting Tools The physical appearance of interlinear text typically consists of a main text line annotated with linguistic transcriptions and analyses, such as morphological representations, glosses at various levels, part-of-speech tags, free translations at the sentence or utterance level. Typically, fragments of these annotation lines are vertically aligned with the corresponding fragments of the main text. Sometimes punctuation symbols, such as the dash or the period, are used to indicate further breakdowns of the annotations, e.g. into morphemes. Phrasal translations and footnotes are often presented on other lines or off on one side. There are a number of TEX and LATEX macros that are capable of formatting interlinear text. The ITF (Interlinear Text Formatter) package [SIL, 2000a] is a sophisticated collection of tools to use with TEX. Figure 1 shows a portion of a formatted text from an Eskimo sample included in the ITF package. This example illustrates how interlinear texts are typically represented on the page.             !        "$#%& "')(* &      *+-,. 9:;:1 @ > :  

E>

E>

E>  O =9 @  9 A@ @ %> @  9 * A@ @ %> @  F#GH(*> A@ B  BJI =C:DE> CK=; , L " 7 9 M @ A%N=C  > :;A

8,!@ ==E@Q[\9   AZF: ]> @ > :   @ &9 ! A@ A@X@Q= > @ R#XGH(*> A@ BJ  B1I 9 A% *K=D , -L " 7 V :D@ A%)9 A%

* R#  (C  B1I *K=D K=D ,  + 9@ U

Figure 1: ITF formatted text The following is a portion of the TEX file produced by the itxprep program (included in the ITF package) corresponding to Figure 1. One can see the use of the control character sequences, such as \[, \

"1.0" "http://www.ldc.upenn.edu/atlas/ag/" "http://www.w3.org/1999/xlink" "http://purl.org/DC/documents/rec-dces-19990702.htm"



#REQUIRED

#REQUIRED #REQUIRED #REQUIRED #REQUIRED #REQUIRED #FIXED #REQUIRED #IMPLIED

"simple"



#REQUIRED #IMPLIED #REQUIRED #IMPLIED

Appendix 2: a Mawu XML file A sample Mawu interlinear text.

17



Yala o yE Bulama min lOOn n yE a ye lion the AUX Ibrahim WH eat I AUX him saw I saw Ibrahim, whom a lion ate. Basic rel. clause with obj. as relative element.

18