An XML-based data model for flexible representation and ... - CiteSeerX

2 downloads 0 Views 229KB Size Report
We present an XML-based data model that is deployed in a system for querying ..... o f f i c e. w i l l be. s t r a i g h t on your right. . . .
An XML-based data model for flexible representation and query of linguistically interpreted corpora Richard Eckart and Elke Teich Technische Universit¨ at Darmstadt Institut f¨ ur Sprach- und Literaturwissenschaft Hochschulstrasse 1, 64289 Darmstadt, Germany {eckart,teich}@linglit.tu-darmstadt.de

Abstract. We present an XML-based data model that is deployed in a system for querying corpora with multiple layers of linguistic annotation. The model is based upon the simple, but effective idea of leaving each layer of annotation intact at annotation time and only relate the layers to each other at query time. Queries select parts of the layers or of the text and then use interval operations based on stand-off anchors to relate the results to each other. The queries are performed by the XQuery engine of a native XML database which has been extended with custom functions for interval operations and for access of the annotated text.

1

Introduction

The occupation with authentic natural language text in the form of corpora both in linguistic and literary computing has brought with it a concern with data models that can adequately express the needs of annotation and query. Since the late 1990’s, a lot of research has been done into data representation in linguistic contexts as diverse as syntactic, phonological and discourse analysis as well as in literary contexts. This research has resulted in a number of suggestions concerning data models more generally—including e.g., Thompson and McKelvie (1997), Bird and Liberman (2001), Sperberg-McQueen and Huitfeldt (2000), Carletta et al. (2004), just to mention the more influential publications. While XML has developed to be a standard in the representation of linguistically interpreted corpora, there exist a variety of data models and formats. The simple explanation for this is that depending on the analysis task, each tool adopts its own data model and creates its own kind of XML output (see e.g., TIGER (Brants et al., 2002)), NITE XML Toolkit (NXT)(Carletta et al., 2004) or ATLAS (Laprun et al., 2002)). As a consequence, corresponding query mechanisms have to be defined which are often again task-specific. As long as only one type of annotation is dealt with, proceeding this way presents no problem. However, the recently growing interest in more comprehensive text interpretation creates an amplified need to draw upon more than one type of linguistic annotation at a time. In our own work, we have encountered this

need in contexts as diverse as analysis of information structuring (Teich et al., 2001, Baumann et al., 2004), cohesion (Fankhauser and Teich, 2004), (crosslinguistic) register variation and translation (Teich, 2003, 2006). At a technical level, the problem arising in this situation is that of overlapping hierarchies: task-specific tools use specific corpus representations that are not compatible in that integrated hierarchies of annotations cannot be formed (see also Alink et al., 2006)—and subsequently cannot be readily queried simultaneously. There are essentially two paths towards a solution: First, to try to find a comprehensive solution to linguistic data representation and build a system that anticipates all possible corpus analysis needs, as for instance in NXT or ATLAS; second, to acknowledge the fact that various, task-specific tools are used for linguistic interpretation and provide a basis for interoperation of the resulting heterogeneous corpora (see also W¨orner et al. (2006)). The research presented in this paper adopts the latter position. We have developed a data model that can act as a “lingua franca” between heterogeneous corpora and that allows a user to relate them when needed—notably at query time. Using this data model, we have implemented a system for querying linguistically interpreted corpora resulting from multiple, diverse annotation processes (such as PoS-tagging, syntactic parsing and other special-purpose annotation). Following Teich et al. (2001), our data model strives to be as expressive as necessary and as simple as possible. Its basis is formed by the explicit distinction of four types of objects or tiers that linguistically interpreted corpora typically need to recognize (Eckart, 2006b): – signal tier : primary data, e.g., audio signals, transcriptions or text, – location tier : mapping between annotations and primary data, – structure tier : structural elements and the ways they relate to each other, e.g., trees or graphs,1 – feature tier : information associated with structural elements, e.g., part-ofspeech tags. Any corpus processing system dealing with multiple annotations needs to acknowledge these four tiers, but the concrete modeling and implementations may very well differ. We proceed in the following way. First, we present related work, discussing selected state-of-the-art corpus processing tools that instantiate the tiers in different ways (Section 2). We then elaborate our own data model and its encoding in XML (Section 3) and illustrate its use in multi-layer querying (Section 4). We conclude with a summary and outlook on future research (Section 5).

2

Related Work

Stand-off annotation (Thompson and McKelvie, 1997) first made it possible to decouple signal and structure/feature tiers. There are essentially two variants 1

Potentially multiple structures forming a multi-layer annotation.

of stand-off annotation: one coordinates multiple annotations directly by means of pointers, the other separates annotations from the signal, thus introducing a location tier. Most of today’s state-of-the-art corpus processing tools acknowledge separate signal and structure/feature tiers. Many acknowledge a location tier. Also, structure and feature tiers can be separated since features can exhibit their very own complexity, ranging from simple key/value pairs to complex feature structures (cf. ISO-24610-1, 2006). In the following, we briefly discuss the concrete realizations of tiers of representation in three selected corpus processing tools that were built for different application purposes: the IMS Corpus Workbench (CWB) (Christ, 1994), TigerSearch (Lezius, 2002) and the NITE XML Toolkit (NXT) (Carletta et al., 2004). CWB was built with the goal of providing a tool for inspection of textual corpus data. At the structure tier, there is support for annotation trees of limited depth. The feature tier connects only to the leaves of this tree, elements of the annotation hierarchy cannot bear features. This is sufficient, for instance, for the representation of Part-of-Speech annotation but not for the representation of syntactic annotation. A location tier does not exist in the CWB, hence there is no support for multiple annotation layers. However, alignment of multiple signals is possible and thus, for example, multilingual parallel corpora are supported. TIGERSearch is a query tool for corpora annotated in terms of syntactic structure. TIGERSearch is tree-based and allows the representation of overlaying arbitrary graphs using secondary edges. Thus, both syntactic trees and coreference can be represented. The feature tier consist of key/value pairs that connect to tree nodes. Tree edges and secondary edges can carry role labels. A location tier is not present, since TIGERSearch does not use a true stand-off annotation. Crossing edges are supported, yet, as the leaves of the annotation tree directly contain the annotated text, overlapping segments are not supported. At the signal tier TIGERSearch supports text only. NXT was developed as a corpus processing tool for multi-modal data. Similar to TIGERSearch, the structure tier of NXT is tree-based and allows the representation of overlaying arbitrary graphs using pointers. The feature tier connects to the annotation elements contained in the structure tier. As a result, phrase and dependency structure can be represented. Using pointers between elements in different annotation layers, coreference can also be represented. NXT uses stand-off annotation and allows multiple annotation layers that can refer to each other using pointers or to primary data using signal offsets. The location tier is based on continuous intervals defined by two decimal numbers. This allows NXT to address text as well as audio or video signals. At the signal tier, all three kinds of signals are supported. Multiple signals can be aligned with each other if they share a common timeline. This is sufficient to represent, for example, dialogue or subtitled video, but is not suitable for multilingual parallel corpora. Comparing these three approaches reflects the current situation concerning existing corpus processing tools more generally:

– At one end of the spectrum we have task-specific tools (such as TIGERSearch), at the other end we have generic tools (such as NXT); – at a more abstract level, the underlying data models vary in expressiveness (here: CWB’s being the least expressive, NXT’s being the most expressive); – in terms of query, each tool develops its own query engine and accompanying query language. Thus, what is called for, in our view, is an approach that allows to accommodate various, heterogeneous (XML-based) representations, on the one hand, and to employ a standard query engine and language, on the other. Sections 3 and 4 below describe such an approach, implemented in a system for corpus query (AnnoLab; Eckart (2006a)). The approach is based on a modification of the XML data-model that is similarly expressive as the one employed in NXT and allows querying using a regular XQuery engine with some custom extensions.

3

A modified XML data model

The XML data model as defined by the W3C (2005) describes an ordered tree with typed nodes that can bear attributes (key/value pairs). Special text nodes can contain textual data and are always leaf nodes of the tree. This model acknowledges structure, feature and signal tiers and is sufficiently expressive for many linguistic applications, such as syntactic parsing or Part-of-Speech tagging. However, trees are required to be projective, which leads to problems such as crossing edges, overlapping segments and conflicting hierarchies. We have therefore modified the XML data model turning it into a stand-off multi-layer annotation model that addresses these problems. At the structural tier, our model uses XML elements as annotation elements and thus inherits all structural relations present in the ordered tree model XML is based upon: parent/child, preceding/following sibling, following/preceding element. At the feature tier it uses XML attributes to model key/value features. An annotation layer is comprised of elements and features represented by a single XML tree and multiple layers can exist in parallel. The XML text nodes are substituted by segments describing a region of a signal via anchors. These anchors represent start and end offsets of a segment. A layer can refer to no signal, to one signal or to multiple signals. Layers can express a fully nested tree, a flat list as well as alignment by means of elements that refer to multiple signals. See Figure 1 displaying our model. This model stays close enough to XML to support queries using an existing XQuery engine with few modifications, while solving the problems of crossing edges, overlapping segments and conflicting hierarchies. With the structure tier and signal tier being decoupled by means of the location tier, the annotation layers need not form trees that directly project to the signal, i.e., that share the same linear ordering. Also, regions addressed by segments do not need to be disjoint, but can overlap. Furthermore, as multiple annotation layers can

Fig. 1. Modified XML data model

exist in parallel, it is not necessary to integrate all annotations into a single tree structure, which at times is even impossible (cf. Teich et al., 2005). Instead, different kinds of annotations are kept in separate layers and at query time these can be related to each other via the signal tier. The presented approach is a compromise between expressiveness and reusability of existing XML query implementations. It is sufficiently expressive for a wide range of linguistic annotations (see Section 4), yet the full expressiveness of enhanced Annotation Graphs or even arbitrary graphs is not attained. Annotation layers can relate to each other only via the segments that anchor the layers to the signal. Pointers, which can be used to refer from one annotation element to another within a layer or across layer boundaries, are currently not supported, but may be accommodated as well.

4

Representation and query

The modified XML data model described in Section 3 has been implemented in a system for corpus query, AnnoLab (Eckart, 2006a). AnnoLab is based on the native XML database eXist (Meier, 2006) which has been extended with custom functions to coordinate annotation layers and to access the associated signals. To illustrate the use of our data model, we discuss two areas of linguistic inquiry: the grammar-intonation interface and parallel text-translation. Due to space constraints, the XML data shown in the examples is an abbreviated version of the original code used in AnnoLab. The following non-standard XQuery functions are used in the examples: – ds:layer($a as xs:string) element* Takes the name of a layer and returns the root element of this layer. – seq:overlapping ($a as element*, $b as element*) element* Takes two sets of annotation elements and returns all a in A that overlap with some b in B. – txt:get-text($a as element*) xs:string* Takes a set of elements or segments and returns the text from the associated signal. Grammar-intonation interface. The example illustrates how to deal with overlapping segments from multiple annotation layers. The clause layer (text into.clause) and the intonation layer (text into.inton) shown in Figure 4

let $ c := d s : l a y e r ( ’ t e x t i n t o . c l a u s e ’ ) / / c l a u s e for $ i i n d s : l a y e r ( ’ t e x t i n t o . i n t o n ’ ) / / i n t o n −u n i t / t where count ( seq : o v e r l a p p i n g ( $i , $c ) ) gt 1 return { t x t : g e t −t e x t ( $ i )} − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − r i d o r and

Fig. 2. Query for overlapping segments for

$eng i n ds : l a y e r ( ’ t e s t e n . pos . tnt ’ ) / / token , $ a l n i n ds : l a y e r ( ’ t e s t e n d e . a l i g n ’ ) / / a l i g n let $ n e x t := $ e n g / f o l l o w i n g : : t o k e n [ p o s i t i o n ( ) < 2 ] where s e q : o v e r l a p p i n g ( $eng , $ a l n // i [ @ r o l e = ’ en ’ ] ) and s t a r t s −w i t h ( $ e n g / @pos , ’V ’ ) and s t a r t s −w i t h ( $ n e x t / @ f e a t u r e , ’DT’ ) return { t x t : g e t −t e x t ( $ e n g )} { t x t : g e t −t e x t ( $ a l n // i [ @ r o l e = ’ de ’ ] ) } < / g e r > − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − l o c k e d s c h l o s s ab

Fig. 3. Query for aligned signals using an alignment layer and PoS tags

cannot be merged into a single tree because of overlapping segments, so both layers are kept separately in the database. The query in Figure 2 searches for all intonation units that include a clause boundary. Incidentally these are exactly the overlapping segments that prevent the two layers from being represented in a single tree. Parallel text-translation. In cases where multiple signals that do not share a common timeline need to be aligned, as for example in a parallel texttranslation corpus, an annotation layer can contain references to more than one signal. Figure 5 illustrates two signals and an alignment layer represented in XML which aligns the words in each clause with each other. Figure 3 shows a query using the alignment layer (test en de.align). In addition to that layer, assume a Part-of-Speech layer (test en.pos.tnt). The query yields a list of all verb forms in the English text that are one or two tokens to the left of a determiner along with their translations into German. The query selects all tokens from the Part-of-Speech layer and all alignments from the alignment layer. The result set contains those combinations of segments and alignments that fulfill the specified conditions: – the English part of the alignment has to overlap the word from the Part-ofSpeech layer, – the Part-of-Speech tag of the token has to start with a V (verb forms), – the Part-of-Speech tag of first or second token following the verb token has to start with DT. The general principle underlying query formulation is the following: First, individual layers from the structure/feature tiers are queried. Typically, this is

expressed by means of a simple XPath expression. Second, the results of these individual queries are combined via references to a common signal. This uses the extension function seq:overlapping. Third, a result is constructed by retrieving a part of the actual signal. This employs the extension function txt:get-text.

5

Conclusions

In this paper we have addressed the problem of interoperability of heterogeneous corpora, suggesting a modified XML data model that can act as a lingua franca for linguistically interpreted corpora resulting from multiple, diverse annotation processes. This data model forms the basis for a corpus query system, AnnoLab, that allows inspection of multi-layer corpora. AnnoLab uses the native XML database eXist, employing its XQuery engine with a few extensions to query functionality. The advantages of our approach are that only minimal modifications on the part of the XML data model are required to achieve sufficient expressiveness and that only a few extensions of XQuery are needed to cater for the specific needs of linguistic query. However, query formulation in XQuery can become quite complex and require a level of expertise that linguistic users may often not have or want to acquire. Therefore, in our future work we plan to develop a query preprocessor that generates complex XQuery queries from a simpler representation.

Acknowledgements We thank Peter Fankhauser and the anonymous reviewers for their helpful comments. All remaining weaknesses remain ours. This research is supported by a grant from Deutsche Forschungsgemeinschaft (dfg).

< s i g n a l i d =”A”> Okay w e l l you t u r n r i g h t and you go a l o n g t h e s e c o n d l i t t l e c o r r i d o r and a s s o o n a s on y o u r r i g h t

t h e c o r r i d o r and you t u r n l e f t you do t h a t t h e o f f i c e w i l l be

into straight

< l a y e r i d =” t e x t i n t o . c l a u s e ”> Okay w e l l you t u r n r i g h t and you go a l o n g t h e c o r r i d o r and you t u r n l e f t i n t o t h e s e c o n d l i t t l e c o r r i d o r and a s s o o n a s you do t h a t t h e o f f i c e w i l l be s t r a i g h t on y o u r r i g h t < l a y e r i d =” t e x t i n t o . i n t o n ”> Okay w e l l you t u r n r i g h t and you go a l o n g t h e c o r r i d o r and you t u r n l e f t i n t o t h e s e c o n d l i t t l e c o r r i d o r and a s s o o n a s you do t h a t t h e o f f i c e w i l l be s t r a i g h t on your r i g h t

Fig. 4. Overlapping segments: grammatical and intonation units

He l o c k e d t h e g a t e

< l a y e r i d =” t e s t e n d e . a l i g n ”> < i r o l e =”de”> Er < i r o l e =”en”>He < i r o l e =”de”> s c h l o s s ab < i r o l e =”en”> l o c k e d < i r o l e =”de”> das < i r o l e =”en”> t h e < i r o l e =”de”> Tor < i r o l e =”en”> g a t e

Fig. 5. Word-alignment of a German and an English clause

Bibliography

Alink, W., Jijkoun, V., Ahn, D., de Rijke, M., Boncz, P., and de Vries, A. (2006). Representing and querying multi-dimensional markup for question answering. In Proceedings of the Workshop on Multi-Dimensional Markup in Natural Language Processing, pages 3–9, Trento, Italy. EACL. Baumann, S., Brinckmann, C., Hansen-Schirra, S., Kruijff, G.-J., KruijffKorbayova, I., Neumann, S., and Teich, E. (2004). Multi-dimensional annotation of linguistic corpora for investigating information structure. In Proceedings of NAACL Workshop Frontiers in corpus annotation, Boston. Meeting of the North-American Chapter of the Association for Computational Linguistics (NAACL). Bird, S. and Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1-2):23–60. Brants, S., Dipper, S., Hansen, S., Lezius, W., and Smith, G. (2002). The TIGER treebank. In Proceedings of the Workshop on Treebanks and Linguistic Theories, September 20-21 (TLT02), Sozopol, Bulgaria. Carletta, J., McKelvie, D., Isard, A., Mengel, A., Klein, M., and Møller, M. B. (2004). A generic approach to software support for linguistic annotation using XML. In Sampson, G. and McCarthy, D., editors, Corpus Linguistics: Readings in a Widening Discipline, chapter 39. Continuum International, London and New York. Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system. In Proceedings of COMPLEX ’94: 3rd Conference on Computational Lexicography and Text Research, pages 23–32, Budapest, Hungary. Eckart, R. (2006a). A framework for storing, managing and querying multilayer annotated corpora. Diploma thesis, Technische Universit¨at Darmstadt, Darmstadt. Eckart, R. (2006b). Towards a modular data model for multi-layer annotated corpora. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 183–190, Sydney, Australia. Association for Computational Linguistics. Fankhauser, P. and Teich, E. (2004). Multiple perspectives on text using multiple resources: Experiences with xml processing. In Proccedings of LREC Workshop on XML-based richly annotated corpora, 4th Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal. ISO-24610-1 (2006). ISO-24610-1 - Language resource management - Feature structures - Part 1: Feature structure representation. International Organization for Standardization. Laprun, C., Fiscus, J. G., Garofolo, J., and Pajot, S. (2002). A practical introduction to ATLAS. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC) 2002, Las Palmas, Spain.

Lezius, W. (2002). Ein Suchwerkzeug f¨ ur syntaktisch annotierte Textkorpora (German). Ph.D. thesis, University of Stuttgart, Institut f¨ ur Maschinelle Sprachverarbeitung. Meier, W. (2006). eXist – Open Source native XML database. http://exist.sourceforge.net/index.html. Sperberg-McQueen, M. and Huitfeldt, C. (2000). GODDAG: A data structure for overlapping hierarchies. In DDEP/PODDP, pages 139–160, Munich. Teich, E. (2003). Cross-linguistic variation in system and text. In A methodology for the investigation of translations and comparable texts, Berlin and New York. Mouton de Gruyter. Teich, E. (2006). Information load in Theme and New: an exploratory study of science texts. In Cho, S.-Y. and Steiner, E., editors, Information distribution in English grammar and discourse and other topics in linguistics, pages 289– 304, Frankfurt am Main. Peter Lang Pub Inc;. Teich, E., Fankhauser, P., Eckart, R., Bartsch, S., and Holtz, M. (2005). Representing SFL-annotated corpus resources. In Proceedings of the 1st Computational Systemic Functional Workshop, Sydney, Australia. Teich, E., Hansen, S., and Fankhauser, P. (2001). Representing and querying multi-layer corpora. In Proceedings of the IRCS Workshop on Linguistic Databases, pages 228–237, Philadelphia. University of Pennsylvania. Thompson, H. S. and McKelvie, D. (1997). An hyperlink semantics for standoff markup of read-only documents. In Proceedings of SGML Europe 97, Barcelona, Spain. W3C (2005). W3C: Document Oject Model (DOM). http://www.w3.org/DOM/. W¨ orner, K., Witt, A., Rehm, G., and Dipper, S. (2006). Modelling linguistic data structures. In Proceedings of Extreme Markup Languages 2006, Montr´eal, Qu´ebec.

Suggest Documents