The application of annotation models for the construction of databases and tools Overview and analysis of MPI work since 1994
Hennie Brugman Max-Planck-Institute for Psycholinguistics Wundtlaan 1 6525 XD Nijmegen, Netherlands
[email protected]
Abstract
This paper discusses four generations of models for linguistic annotation and evaluates their evolution in relation to the software tools and corpora they are used for. MPI work on models is compared with other recent efforts to design generic models. Introduction
At the Max Planck Institute for Psycholinguistics [1] software development in the area of linguistic annotation has been ongoing since 1994. This has resulted in a few generations of linguistic corpora, annotation formats and software tools. Data models and object oriented models have played a central role from the start. This paper gives an overview and critical evaluation of this evolutionary development process and attempts to relate it to recent efforts of standardization and cooperation. The following models will be discussed in chronological order: the model that is used for the MediaTagger annotation tool (1994), the extension of this model to a relational database model (1996), the object oriented Abstract Corpus Model (1997) and recent revisions of this model (2001). This paper often uses terms without giving proper definitions. This is because the definitions evolved along with the models discussed. In a sense the roles that each concept
Peter Wittenburg Max-Planck-Institute for Psycholinguistics Wundtlaan 1 6525 XD Nijmegen, Netherlands
[email protected] in a model plays can be seen as a definition of the concept. We hope that the reader has his or her own definitions of the concepts discussed and will be able to relate these to ours. 1
MediaTagger's implicit model
MediaTagger was built initially as a viewer and editor for QuickTime text tracks. Therefore its model started rather implicitly, by adapting QuickTime's model. It uses an unrestricted number of independent text tracks with segment labels as its tiers and annotations. All labels have to be time aligned. Time dependencies between tiers are explicitly added. All annotations on a tier have dependencies to their parent annotations as specified at the level of the tiers. Because annotations are not allowed to overlap within a tier each parent annotation is uniquely identified. There are two types of dependencies: "included in" and "attribute of". The first restricts begin and end times within the time segment of the parent annotation, the latter copies the begin and end times of the parent. Dependent tiers can have their own dependent tier, thus forming trees of annotations. Each tier has an id that is unique within the annotation document and it has a type. These types have an id and an optional closed vocabulary. Types can obviously be shared among tiers in the document. Tier attributes and type information are stored in QuickTime's user definable elements. MediaTagger documents are valid QuickTime movies that can be distributed and even played with the standard QuickTime
players, including visual representations of the annotations. 1.1
Evaluation
Many of this model's principles still hold and comply to what is currently seen as up-to-date. Some disadvantages of MediaTagger's model are: • • • •
•
• •
•
• •
2
Tier types are internal to the annotation document and therefore can not be shared among documents. Tier types that only have an id and optional closed vocabulary are relatively basic. There was need for more document and tier metadata. Annotation begin and end times are duplicated from parent annotations to attribute annotations instead of shared in some way. All annotations have to be fully timealigned, partially aligned annotations or annotations that are only ordered can not be dealt with. Annotation values can be only simple Roman text strings. Not all annotation structures are possible: for example, there is no support for annotations that refer to discontinuous segments of time or symbolic tree structures (like syntax trees). For long time QuickTime support for editing was only available on the Macintosh, so there exists only a Mac implementation. No ability for simultaneous annotation on different tiers of the same document. Quicktime video files support their own mechanisms for referencing between files.
...Made explicit as relational model
Soon after the deployment of MediaTagger at the MPI there was a need for centralized storage with search capabilities, and the need to reuse tier type information for multiple documents came up. It was decided to design and develop a
relational database. We used a formal method for the model's design. This resulted in a detailed Entity-Relationship diagram and a set of carefully normalized relational database tables. This database is implemented and has been used for several years. It now contains approximately 80.000 annotations on 3.000 tiers that are associated with 500 digital video movies (corresponding to approximately 50 hours). A separate graphical tool for the configuration and execution of queries was built with OracleForms and OracleGraphics to compose time-related multi-tier queries. MediaTagger was adapted and extended to support client-server operation using ODBC. This is done in such a way that both online and offline operation on annotation documents is possible. Synchronization is performed on documents or tier types when they are exported or imported. The condensed form of the relational schema is described by the following 8 database tables: SESSIONS table: SESSIONS_ID, SESSIONS_NAME, OWNER, TAPE_REF, RECORDING_DATE, AV_FILE, CODEC, PAL_NTSC, BEGIN_TIME,
SUBJECT table: SUBJECT_ID, SUBJECT_NAME, OWNER, BIRTHDAY, FIRST_LANGUAGE
CONDITION table: SESSIONS_ID, CONDITION_NAME
TIER_TYPE table: TIER_TYPE_ID, TIER_TYPE_NAME, OWNER, VALUE_RANGE_ID
VALUE_RANGE table: VALUE_RANGE_ID, SAMPLE_VALUE
TIER table: TIER_ID, TIER_NAME, TIER_TYPE_ID, SESSIONS_ID, SUBJECT_ID, TRANSCRIBER
SAMPLE table: TIER_ID, BEGIN_TIME, END_TIME, SAMPLE_VALUE
DEPENDENCY table PARENT, CHILD, DEPENDENCY_TYPE
The SESSIONS table contains one record for each annotation document. It has a couple of fields to store metadata values and a reference to an associated audio or video file. CONDITION records are a way to group SESSIONS on basis of some keyword. TIER records contain some tier metadata fields and are linked to a SESSION record, a SUBJECT record and to a TIER_TYPE record. This TIER_TYPE can be associated with a closed vocabulary, here called VALUE_RANGE. The SAMPLE table contains a record for each annotation. It stores begin and end times and the annotation's content value, as well as a reference to one TIER. All annotations from all database users end up in the same table, as all tiers from all users do. Finally, the DEPENDENCY table stores time dependencies between annotations at the level of their tiers. Dependencies at the level of individual annotations can be uniquely derived because of the implicit constraint that annotations on the same tier may not overlap in time. 2.1
Evaluation
Designing and implementing the relational database resulted in improvement on some of the disadvantages mentioned above. 2.1.1 Sharing of tier types Each database user can define and reuse tier types in several annotation documents. It is also possible to share tier types among users by
explicitly granting another user access. Although possible, in practice no real tier type repositories evolved within the MPI yet. Because tiers from multiple documents share types, modification of a type may imply modification of all documents that use that type. Therefore, creating a type repository involves careful design from the start. 2.1.2 Document and tier metadata Out of all the tier attributes that were added the most important ones are those for a participant and coder. At the level of annotation documents it is possible to group those on basis of keywords using the CONDITION table. Compared to recent initiatives concerning metadata for language resources, our document metadata support is rudimentary here. 2.1.3 Simultaneous annotation In the database configuration, annotation documents are spread over a number of records in a number of tables. It is therefore straightforward to load only a subset of all available tiers for the document into MediaTagger. These tiers can be modified and stored in the database separately. This allows concurrent editing on different tiers of the same document. Although this is possible and also done, it is not robust in the sense that there is no good locking mechanism that prevents users from modifying the same data simultaneously. 2.1.4 Reference to video files Video files are kept separate from the database tables. Only a reference to a video file is maintained for each annotation document. This reference is just an identification string, it contains no information about where the media resource is located. It is up to the user to locate the video using a dialog box each time a document is loaded from the database. 3
The Abstract Corpus Model
Although the implementation of a central annotation database solved some of the shortcomings of MediaTagger, some issues remained. Further, it was clear that user requirements and the state of technology where evolving past the initial design of MediaTagger. Therefore the EUDICO (European Distributed
Corpus Database TierSharedInfo
TierBundle
*
*
1
* *
1
*
*
CodeGroup
1
*
Corpus
CodeType
* *
1 CompositeCorpus
*
Trans cription
*
Tier
1 *
CodeType1
*
Co deTypeN
1
1 MetaTime
LeafCorpus
1
0..1 MediaObject
*
1
Tag
TierUnsharedInfo
Figure 1 ACM class diagram
Corpora) project [2] was initiated. Its starting point was our experiences with MediaTagger and a number of new, ambitious aims: • • •
• • • •
Independence of operating system. Independence of annotation file and corpus formats. Extendible, both with respect to file formats on one side and viewing, editing and search tools for annotations on the other side. Internet based operation. Three-layer, distributed architecture that allows network collaboration on annotion related tasks. Support for streaming audio and video. Support for a wider range of annotion structures.
To meet these aims we chose to start from scratch using Java and an object oriented approach. Substantial time was taken to design an objected oriented model that would be able to cope with a number of new user requirements. An informal use-case driven method was used to create this design. In contrast to the previously discussed relational model that merely models concepts and their relations, this object oriented model is expressed mostly as a set of related interface definitions. It is therefore more of an operational model than a data model. The model is called Abstract Corpus Model (ACM) since it models concepts from the domain of annotated corpora in an abstract way. It is realized in first instance as a set of abstract classes that
implement common behavior. These abstract classes each have concrete subclasses, one for each of the annotation file formats that ACM currently supports (CHAT (MacWhinney, 1999) Shoebox [3], MediaTagger’s relational database, Tipster via the GATE API [4], several varieties of XML). The method calls from ACM's interfaces can be used by a range of annotation related tools. The interfaces are uniform to the tools although the actual objects that implement those interfaces may be instantiated from differently formatted files or even from a relational database. For example, the tools are not aware whether they work on a CHAT file or on a set of database records. Most ACM objects are implemented as remote objects using Java's RMI facilities (Remote Method Invocation). This means that these objects can exist on a central annotation server while the annotation related tools that use their services run on local clients on the network. Method calls to a set of remote interfaces, with arguments and return value, offer a natural way to organize protocols for an annotation server. This type of support for remote objects is efficient since only data that is asked for is sent over the network, i.e. a tier name instead of a complete tier or annotation document. It also forms the basis for a collaborative annotation environment since remote objects can be simultaneously accessed by multiple users. For a class diagram of the first generation of the ACM see figure 1.
CorpusDatabase is a singleton remote object on a server that is at the root of a tree of Corpus objects. It is the only object that has to be made available by the EUDICO server program explicitly and it has to be addressed by client programs using a URL. All other ACM objects can be retrieved directly or indirectly via method calls to this object. The CorpusDatabase returns a tree of Corpus objects (either CompositeCorpus or LeafCorpus). These three Corpus classes are organized in the "composite" design pattern (Gamma, 1994). CompositeCorpora contain Corpora, LeafCorpora contain Transcription objects (annotation documents). Transcription object can be associated with one MediaObject. Transcriptions can contain Tier objects that can contain Tag objects (annotations). In this model there are two types of Tier attributes: attributes whose value applies to one annotation document and one tier (grouped in a TierUnsharedInfo object, example: 'coder', 'coding quality') and attributes whose value can be shared by a number of tiers or annotation documents (grouped in a TierSharedInfo object, example: tier name). The motivation for this split into two groups of attributes is that a number of TierSharedInfo objects could serve as a template or TierBundle that could be used for similar annotation documents. Changing a value in a TierSharedInfo object would then automatically update all tiers that use it. The one essential part of a TierSharedInfo object is a CodeGroup that in turn consists of a number of CodeTypes. CodeType's role is to specify and check the annotation values that are allowed for the associated Tier's Tags. We chose to support Tags that can have a list of values. CodeGroup specifies the order and types of these values. This choice is mainly motivated by requests from gesture research, where researchers typically want the encode a large number of attributes for the same time segment. In this version of ACM, Tags have begin and end times that can be specified or unspecified. To make this possible the order of all unaligned Tags in a Transcription has to be stored explicitly. The object responsible for this is called MetaTime and is associated with Transcription.
3.1
Evaluation
Over the years ACM version 1 has proven to be remarkably robust and stable. Adding new subclass implementations for a new annotation file format has turned out to be a relatively straightforward task that just takes a couple of days at most. We have been able to create a number of viewing and annotation tools that function in a format independent way, thus offering new functionality for existing corpora. Using the services of objects from the ACM with either local method calls or with RMI remote method calls is completely transparent. Therefore all components of EUDICO software can run on one notebook or be distributed over a range of servers. Initial experiments with locking of remote objects and client notification of lock status on basis of bi-directional RMI were successfully conducted. Nevertheless, evolving user needs and new insights made a recent major revision of ACM necessary. The main issues are: •
•
•
•
Bird and Liberman (2001) introduced new terminology that is nowadays widely used. It makes sense to use their terminology wherever possible. Examples: annotation document instead of transcription, annotation instead of tag. Metadata for language resources has become an important issue since the start of the EAGLES/ISLE metadata initiative (IMDI) at the beginning of 2000. Use cases where metadata is involved percolate into models for linguistic annotation. The need to be able to locate and instantiate individual ACM object resources, for example from a metadata based browser that is external to EUDICO, asks for new generic methods to point at those resources at any location (be it on network or on offline medium like DVDROM) and for new mechanisms to instantiate those objects. ACM used to focus primarily on visualization and search operations.
•
•
4
Annotation format independent editing requires a substantial extension of the model's set of operations. The application of EUDICO and ACM for a number of new projects (Corpus Gesproken Nederlands - Spoken Dutch Corpus [5], DoBeS - Dokumentation Bedrohter Sprachen - Documentation of Endangered Languages, [6]) required support for new and more complex annotation structures like interlinear glossing formats, syntactical trees and annotations that refer to discontinuous time segments or non-sequential annotations. The model is underspecified with respect to annotation type. More about this in the next chapter. The revised ACM
This chapter reports on progress on some of the recent changes conducted on ACM. 4.1
Metadata integration
Starting in 1999 the BrowsableCorpus concept and technology has been used at the MPI to organize and structure multimedia language resources with the help of metadata. Since then the MPI was involved in the ISLE Metadata Initiative (IMDI) [7], and the development of a BrowsableCorpus implementation of this standard, as well as in the development of metadata browsing, searching and editing tools [8]. BrowsableCorpus includes a more elaborate model for hierarchies of corpora and language resources than ACM. A central concept is that of a Session: a bundle of all language resources that are associated with one linguistic event or performance (e.g. annotation documents, audio and video recordings, photographs). Sessions can be administered in Session XML files that also contain metadata. These Session XML files can be linked in hierarchies using Corpus XML files. Merging BC (BrowsableCorpus) and EUDICO models required the introduction of a Session class in ACM. The direct association between Transcriptions and MediaObjects is now administered by a Session object. The composite Corpus structure in ACM is maintained, but as an alternative to BC Corpus hierarchies. There
was also a need to introduce Metadata, MetadataContainer and LanguageResource interfaces into ACM as a way to merge in behavior that is needed for BC. 4.2
Locating and instantiating EUDICO objects
In the first version of ACM, new objects were usually instantiated by their direct ancestors in the corpus tree e.g. Transcription objects were instantiated from LeafCorpus objects. The exact type of the LeafCorpus determined the exact type of the Transcription to be instantiated. In the case of instantiation of a Transcription from a browser over generic corpus trees (like the BC browser) we needed another way to specify the exact type of the Transcription object, and a separate mechanism for creation of this object has to be available. We were also confronted with a number of related cases where the issue of specifying type and location, and subsequent instantiation of the proper object played a role. For example, in the case of the Spoken Dutch Corpus, currently all digital audio data is delivered on a number of CDROMs. Pointing at and accessing this data, including prompting for the proper CDROM, can be solved by a similar mechanism. For the same corpus, a variation of stand-off annotation is used for annotation documents, where separate annotation tiers are kept in separate XML files in separate directories. Instantiation of an annotation document requires pointing at and combining of these separate files. To solve this range of problems a design was finished that makes use of the standard mechanisms that Java offers to deal with URLs. Based on a generalization of URL syntax and content type the required access mechanisms (like login prompt, prompt for media carrier) are triggered automatically and the proper type of object is instantiated. In case of ordinary URLs and content types everything automatically falls back on Java's built-in URL handling. 4.3
Annotation structures
As said, new projects required more complex relations between annotations than the ACM could deal with in its original form. For example, for the Spoken Dutch Corpus both utterances and individual words can be (but don't have to be) time aligned, and each word
can have a number of associated codes on different tiers. The Spoken Dutch Corpus also required support for syntactic trees. For the DoBeS project a wide range of legacy material has to be incorporated in the archive and the EUDICO based archive software has to be able to cope with that. Much of this data is Shoebox or Shoebox-style MS Word data. Therefore interlinear glossing formats have to be supported at the level of ACM. Within the DoBeS community, the maximal format requirements are well described by Lieb and Drude (Lieb and Drude, 2000) in their Advanced Glossing paper. To support all of these structures we made the following adaptations to the model: we now support two basic types of Annotations, AlignableAnnotations and ReferenceAnnotations. AlignableAnnotations can in principle be aligned with the time axis of some recording because they have two associated TimeSlots. Each TimeSlot can be coupled with an explicit time but does not have to be. TimeSlots can be shared by more than one AlignableAnnotation. All TimeSlots in an AnnotationDocument are explicitly ordered at all times, whether they are explicitly timealigned or not. ReferenceAnnotations refer to one or more Annotations, on other tiers or on their own tier. For some tiers it may be necessary to explicitly store the order of the ReferenceAnnotations, for others this is irrelevant or even impossible (e.g. morphemes that refer to words, versus coreference chains or syntax trees). We chose not to allow Annotations that can both be aligned with time and contain references to other Annotations. This is because of the following. We can not think of any real world example of an annotation that by its nature is embedded in its annotation document both by time reference and by symbolic reference. Second, because each ReferenceAnnotation via its references can always be traced back to some time interval, conflicts may arise between the annotation's own time and the time that is implied via the references. Our first models had annotations with one single attribute value. The first ACM version contained annotations that could have a list of attribute values. Because for the revised ACM more
complex annotation structures than plain attribute lists had to be modeled, we had the choice to put all this structure either in the annotation's values or in explicit relations between different atomic annotations. Since our motivation for structured annotation values mainly came from user requirements concerning easy data entry, we chose the last option and decided to deal with entry and visualization of complex annotation structures otherwise. The Tag class from ACM version 1 with its constituent annotation content can be maintained, but is now considered not as an annotation anymore but as a stereotypic pattern of annotations that can be treated by some specialized user interface element as one entity for display or editing. 4.4
Annotation types and tier types
In almost every annotation system or format the concept of a tier exists as a kind of natural extension of the concept of a database field applied to time-based data. It is an old idea to "put different things in different places". A tier is the place to put similar things. When we consider this tier concept more precisely we might attempt to give a definition of tier: A tier is a group of annotations that all describe the same type of phenomenon, that all share the same metadata attribute values and that are all subject to the same constraints on annotation structures, on annotation content and on time alignment characteristics. Metadata attributes for example can be a participant, coder, coding quality, or reference to a parent tier. Examples of constraints on annotation structures are: • Annotations on the tier refer to exactly one associated parent annotation on a parent tier (1- n ). • Annotations on the tier refer one-to-one to parent annotations on a parent tier. • Annotations may refer to n parent annotations, that all have to be on the same tier (e.g. co-reference). • Annotations on the tier must be ordered in time.
•
Annotations on the tier refer to parent annotations on a parent tier or to annotations on the same tier (e.g. syntax trees could be encoded this way).
Examples of constraints on annotation content: • Content is restricted by a specific closed vocabulary, or by an open vocabulary. • Content can only consist of Unicode IPA characters. Examples of constraints on time alignment: • Annotations on this tier may not overlap in time. • Time segments of annotations on the tier are always within the time segment of a parent annotation. • Annotations on the tier are strictly consecutive, gaps are not allowed. • Time segments are strictly consecutive within the time segment of a parent annotation. Explicitly including these types of constraints in the ACM makes tool support for a wide range of use cases and for user interface optimizations possible. For example, known begin or end times of annotations can be reused for new annotations or as constraints on the time segment of other annotations. Text entry boxes can be set up automatically with the proper input method for IPA, annotation values can be specified using popup menus. Tier metadata, with attribute values specified or not specified, combined with the tier constraints could be reused as a template for the creation and configuration of new tiers, either in the same document or in another. One step further, a set of tier templates could be part of a document template, making it possible to reuse complete configurations of tiers for other documents. 5
ACM and the ATLAS model
Recently we attempted to implement our extended ACM in terms of the ATLAS API [9], thereby integrating a well supported storage format (AIF). Although the models seem to be conceptually quite closely related, we identified some issues where there appear to be differences.
5.1
The encoding of hierarchies
Using the strict Annotation Graph formalism we found no unique way to represent symbolic hierarchies such as syntax trees. Some differing hierarchies turn out to have the same annotation graph. Therefore there is an explicit need for references from annotation content to other annotations, as were introduced by AIF's AnnotationRef element. But, unlike the AIF proposal, we would like annotations to be associated either with a time interval (or RegionRef) or with AnnotationRefs, not with both at the same time. Since it should be possible to find an enclosing time segment for annotations that can not be time-aligned by themselves by tracing back references to other annotations that are timealigned, conflicts might arise. For example, a syntax tree can have orthographic word annotations that are time-aligned as leafs. The time segment for some compound syntactical unit could be determined by combining the time segments of its word constituents. So, no obligatory RegionRef. 5.2
Tiers
It is typical for annotation formats to group annotations according to shared characteristics in tiers (as discussed before). For building tools it is very convenient, if not necessary, to have such a tier concept. AIF only approximately supports the sharing of characteristics of annotations by means of its Analysis concept. Analysis supports grouping of Annotations, but not necessarily Annotions that all refer to the same Signal. 5.3
Regions
ATLAS/AIF generalized the concept of nodes in the annotation graph model to multi-dimensional cases (Regions) at the cost of losing the ordering that was present in the strict AG model. Therefore, complete time-alignment of AIF's Anchors seems to be necessary to know the order of Annotations in the one-dimensional case. For our applications we need unaligned Anchors.
5.4
Structured annotation content
AIF's Content is associated with FeatureData, which can basically be any 'data structure'. However, these structures seem to be outside the scope of the ATLAS API's functionality and are therefore application specific. When possible, we prefer to encode this structure on basis of ReferenceAnnotation between more atomic annotations. 5.5
There is a close relation between a model on one side and a set of problems and their solution in the form of a software system on the other side. A different problem requires a different model. Over the years, user requirements changed, partly because new technology made new things possible, and partly because of the user's experience with our own tools. The models changed with the user requirements and will continue to do so.
AIF and EUDICO Annotation Format
Our original intention was (and still is) to use AIF as the persistent format for ACM by exploiting the ATLAS API. Since ATLAS is still under development and because of the issues mentioned here we designed an XML based annotation file format that suits our current needs (EAF). We will extend this format along with the functionality of our software until it is able to deal with persistence of the full ACM, or until AIF becomes a realistic alternative. Conclusion
Since 1994 we have been attempting to construct "generic" models for linguistic annotation. This has payed off in the sense that we have been able to construct good tools on the basis of these models. However, we have now seen a couple of generations of models, each one more powerful and more "generic". Therefore it seems reasonable to expect that our most recent Abstract Corpus Model some day will be revised as well. We would like to make the observation that "generic model" in fact is a contradictio in terminis. Booch, Rumbaugh and Jacobsen (1998) in their UML User Guide make some clear statements about models that illuminate this: A model is a simplification of reality We build models so that we can better understand the system we are developing The choice of what models to create has a profound influence on how a problem is attacked and how a solution is shaped
Links [1]http://www.mpi.nl [2]http://www.mpi.nl/world/tg/lapp/eudico /eudico.html [3]http://www.sil.org/computing/catalog /shoebox.html [4]http://www.dcs.shef.ac.uk/nlp/gate [5]http://lands.let.kun.nl/cgn/home.htm [6]http://www.mpi.nl/DOBES [7]http://www.mpi.nl/ISLE [8]http://www.mpi.nl/ISLE/tools/tools_frame.html [9]http://www.nist.gov/speech/atlas References Bird S, and Liberman M, (2001) A formal framework for linguistic annotation. Speech Communication 33(1,2), pp 23-60. Gamma E, et al, (1994) Design Patterns, Elements of Reusable Object-Oriented Software. AddisonWesley. Lieb H, and Drude S, (2000). Advanced Glossing: A language documentation format. Unpublished working paper. MacWhinney B, (1999) The CHILDES Project: Tools for analyzing talk. Second ed. Hillsdale, NJ: Lawrence Erlbaum.