Time- and Text-Aligned Annotations: the SpLaSH Data Model Francesco Cutugno 1, Sara Romano2 1 2
LUSI-Lab, Dipartimento di Scienze Fisiche, University „Federico II‟ Naples, Italy Dipartimento di Informatica e Sistemistica, University „Federico II‟ Naples, Italy
[email protected],
[email protected]
Abstract In this work we present SpLaSH data model. SpLaSH (Spoken Language Search Hawk), is a freely available toolkit able to perform complex queries on spoken language corpora. The proposed system implements a data model for the integration of any kind of phonetic annotation with textual mark-up (i.e. POS tagging, syntax, pragmatics). It provides functions to perform queries across speech and text labels. The integration of timealigned annotations (TMA), represented making use of Annotation Graphs, with text aligned ones (TXA), stored in generic XML files, are provided by a data structure, the Connector Frame, acting as a look-up-table linking temporal data to units parsed from the text (usually words). SpLaSH imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately within the same dataset and without any relative dependency. It also provides a GUI allowing queries on TXA or TMA structures, and cross query on an integrated framework where both TXA and TMA annotations are included. SpLaSH is an open source project freely available at http://s2snaples.fisica.unina.it/splash Index Terms: search tools, speech corpora, time-aligned annotations, text-aligned annotations.
1. Introduction For the purposes of this work we will consider two class of possible annotations normally available in linguistic corpora: time-aligned (TMA) and text-aligned (TXA) annotations. TMA data are referred to annotations in which time is the independent variable used to describe the succession of grammatical units in biunivocal relation to one or more annotation layers as typically happens in collection of spoken texts. Annotations containing data derived from observations of acoustic phenomena such as phone sequences, syllables, words, intonation, timing, disfluencies, etcetera and having temporal labels associated (i.e. measured into an audio/video signal) can be classified as TMA annotations. TXA data are annotations derived from written texts or transcriptions of spoken utterances. The independent variable used to observe and describe data, in this case, is the units sequence progression exactly as it appears to the operator in the text. This kind of annotations usually intend to classify the structural information about how words are formed or how they give rise to a text regardless of the measure of time passing while reading or uttering the text itself. Syntactic tree-banks,
POS tagging, semantic and pragmatic annotations are examples of TXA annotations. Both in TMA and in TXA, separately, hierarchic relations between annotation levels may exist; furthermore, a corpus may be characterized by multiple hierarchies that may share some level. Corpora presenting both TXA and TMA appeared on the scene since many years, however, even if this increased complexity would have required the development of specific database and relative search tools, due to a lack of agreement on the ideal storage format for the linguistic annotations, tools developed within a given project rarely have been reused and the integration of data coming from different sources has systematically required additional efforts. At the same time, general purpose systems for the management of different annotation standards, with multiple hierarchies, have been developed, accepting as input annotation files and returning special format databases, that a user can search by means of specific tools [1]. EMU Speech Database System [2] is a representative example of this class of applications: it is one of the first general purpose systems created for the management of heavily cross-annotated data. In EMU the data are organized in tokens that represent some convenient units of analysis as words, phonemes or sentences. There are two types of tokens: events and segments. Querying the system means to retrieve events, segments or sequences of segments navigating into a hierarchical tree that rigidly structures relations among levels of the available annotation datasets. In EMU, temporal labels are associated only to one annotation level, all remaining levels inherit time information by means of the hierarchy on the base of compositional criteria. This solution does not facilitate the reuse of limited portions of the corpus and introduces a not always acceptable principle of temporal dependency between annotation levels. NITE XML toolkit [3] is a general purpose system that considers two kinds of relations between annotation levels: the former organizes data hierarchically, the latter defines ontologies for the description of more complex data. Differently from EMU, NITE allows the presence of temporal references on more than one level, permits multiple hierarchies but imposes the use of an ontology, stored into a metadata file, describing and fixing their structure. Consequently each change in the data structure requires an intervention in the metadata file. SpLaSH (Spoken Language Search Hawk) is a new general purpose system for multilevel linguistic corpora management [4]. In SpLaSH data coming from linguistic annotations belonging both to TMA and TXA categories are integrated into an unified dataset. It is a software packages including various
functionalities: from data import and conversion to querying. Queries can be performed both on TXA/TMA data separately and on the integrated dataset. Differently from the EMU system, in SpLaSH no fixed hierarchies among the annotation levels are imposed; our system considers only those implicitly defined in the data model as it is based on the idea that each level could be obtained independently from the others. Differently from NITE toolkit, in SpLaSH no metadata files are used to describe data structures hence no human intervention is needed to define internal organization of linguistic resources. This work is mainly focussed on the description of the SpLaSH data-model and the general method of querying, more complex aspects of SpLaSH querying capacities are described in [5]. The paper is organized as follows: Section 2 introduces the data-model and describes the method used to integrate the two annotation classes: TXA and TMA; section 3 shows how the system performs its queries.
2.1.2. TXA data As it is well known, TXA data can be recursive and often their descriptive elements forming the annotation system are structured in a well defined hierarchy. Furthermore it is evident that usually TXA annotations derive directly from a mark-up process performed on the analyzed text. For all these reasons XML is straightforwardly considered the ideal instrument for these type of annotations as this form of textual mark-up allows the organization of annotation elements in a tree structure, in which, if necessary, the sequential nature of the text can be conserved reading the leaf labels from left to right (see Figure 2 and Figure 3).
2. Splash Data-Model 2.1. TMA and TXA original formats SpLaSH provides to manage both TMA and TXA annotation classes, both finally stored as XML files. Figure 2: A syntax tree-bank as example of TXA
2.1.1. TMA data SpLaSH encodes TMA annotations through Annotation Graphs (AG) [6]. AG are a descriptive model able to embody the main annotation formats like TIMIT [7], Praat TextGrid (see http://www.fon.hum.uva.nl/praat/), Partitur [8] and can be considered as a unifying standard to apply to each speech corpus. As it is shown in Figure 1, Annotation Graphs are data structures whose nodes are anchored to the signal (and thus contain timing information) while the left to right oriented arcs are labeled with the annotations values. Several relations are defined within arc- and node-like data structures. Temporal precedence relations are stored in the node fields while inclusion, coincidence and overlap relations are topologically expressed by the arc relative positions in the graph structure. The Annotation Graphs representing a speech corpus, are organized in a set called AGSet. SpLaSH implements TMA annotations coded as AG recurring to the XML native database format and respecting the related formal definitions as originally proposed by Bird&Liberman.
It is important to notice that, in some cases (as, for example, in some syntactic parsing procedure) the original unit order gets lost, but, when it is necessary it is a good norm, during the tokenization procedure, to index the units in a way that allows the a posteriori reconstruction of the correct order, if necessary.
Figure 3: XML as storage format for TXA of Figure 2
2.2. Integration model Figure 1: Annotation Graph data structure
The integration of TMA and TXA data represents the main aim in SpLaSH. As far as TXA data concerns, SpLaSH accepts any
XML file with the only constraint that, given the usual tree structure of these kind of data, textual annotations values (usually words) in the leaves must coincide with at least one level of the TMA annotations. TMA labels in AG are stored as arc labels. As it is showed in Figure 4, in order to allow the integration of TMA and TXA annotation classes in an unique structure, we perform following steps: the original TMA data in original format (i.e. TIMIT or Praat TextGrid) are converted in AG; the original TXA XML file is modified by adding a key id to each „minimal content textual sub-part‟, i. e. to the node which is father of any textual leaf in the text („extended TXA‟ module in Fig. 4); a new object named Connector Frame (CF) is introduced: it consists of a further XML file acting as interface between TXA and TMA classes.
PRAAT files
TMA
AG-XML
Figure 6 Another way to interpret the data model Figure 6 uses a different graphic metaphor to explain the data model. TMA data are seen as a multilevel hierarchic-free dataset in AG format lying on an horizontal plane, TXA data are plugged perpendicularly in the AG with plug pins representing the CF. Even if it increases model and querying complexity, it cannot be a priori excluded that more than one different TXA vertical plane can be plugged in the AG, using the same point of entrance as well (and the same CF), or recurring to a different link class that generated a different CF.
Connector Frame
Integrated Structures XML Extended TXA
TXA XML format
Figure 4: TMA and TXA integration procedure Once again the CF has a simplified tree structure containing references to nodes belonging both to the TMA and TXA. A sketch of how it works is shown in figure 5. In this picture we illustrate the creation of links from leaves level TXA to CF and from CF to TMA arcs. Arc labels belong to the annotation class we choose as reference to create the temporal alignment between TMA and TXA data. The integration process allows TXA nodes to inherit the temporal relationships from the TMA levels and allows a user to perform analysis and queries on such linguistic data.
3. Querying Simple queries can be performed on a TXA structure or on a TMA structure.
3.1. Simple queries on TXA Queries on TXA are strongly based on XPath operators (http://www.w3.org/TR/xpath20/). Like many other path query languages, XPath filters elements in the dataset defining a template match procedure expressing key-match criteria names of nodes and attributes and relative values. Searched paths may recursive and can be obtained without the previous knowledge of their depth: a template match filter can return sub-tree portion of any length. SpLaSH provides a GUI that allows the user to query the data graphically, a front end helps to build the template path without any a priori knowledge of the XPath syntax. The user selects any tree element and poses constraints on the tree topology and on the values of text labels and attributes and the system automatically generates an XPath expression that will be executed on the corpus. The Result section displays the result of the XPath expression executed on the TXA corpus. If the XPath expression is manually generated by an expert user, then the Nodes result section will show the intermediate results of the elements selection, being updated with every added element.
3.2. Simple queries on TXA
Figure 5 Connector Frame: structure and functions
Bird et al. [9] have proposed a formal query language for AG named AG-QL once again based on path template matching. In this case the template is based on the definition of an arbitrary couple of nodes on the timeline in addition to filters on label values on the arcs. Operators based on the temporal properties of arcs and nodes (such as: precedence, overlap, same start node etcetera) can be used. Unfortunately, no practical implementation for this standard is presently available. Alternatively, the use of AG -> SQL conversion presents many
limitation, in particular it does not permit the retrieval of sequences of indefinite lengths (*Kleene closure). In SpLaSH a reduced version of AG-QL is graphically implemented within a specific GUI. Our system allows the formulation of queries as described in the first step, intersecting up to three different levels of annotations on the basis of their relative temporal position over the time line.
3.3. Cross queries The extraction of information from multilevel corpora is done by means of queries performed on both TMA and TXA structures. Queries can be performed in a Top- Down way (i.e. before on the TXA structure and afterward on the TMA structure), or in Bottom-Up way (i.e. TMA then TXA). Given two nodes belonging to the whole structure, it is easy to demonstrate that the resulting, complex, graph is strongly connected. This means that it is always possible to find a path that connects them. Considering the multi-level corpus in figure 6, let υ TXA node-set and TMA node-set and let the Boolean operator that return TRUE if a path exists between the nodes υ and , then υ is TRUE and υ is TRUE. The path between υ and nodes goes through the CF structure. The selection of a node in the TXA structure implies traversing a number of arcs at most equal to the TXA tree depth while the node selection in the TMA structure represents a minimum path problem which is quadratic with respect the number of nodes of the graph. On the base of these considerations we can conclude that generic queries performed in Bottom-Up and Top-Down directions are equivalent but the former has a lower complexity. Using the Top-Down execution, the query is reduced to a minimum path search algorithm on the sub-graph underlying the sub-tree rooted in node υ . The cross queries interface, is composed of two sections that join the two operations showed in the previous sections. As just stated, the query starts from the TXA dataset and is performed by means of an XPath expression. It returns a node-list containing the set of instances in the TXA satisfying the given request. In a second step a loop operates a projection of every node in this list on the TMA recurring to CF to reach the TMA dataset. In other words the cross queries are executed along the TopDown direction and consist of a TXA node selection followed by a TMA arc selection. The TMA arc selection is performed on TMA data portions that underlie the nodes resulting from the TXA node selection.
can be correlated to one of the arc families in the corresponding AG. A new version (extended) of the XML TXA is generated on line, subsequently the users chooses the annotation class that, being common to both TXA and TMA, will be used to build the CF. Now the data are ready to be queried. Each class of query has its own graphical interface containing several components to facilitate the query generation. Thus SpLaSH presents interesting innovations in the linguistic general purpose systems developing area. Our system imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately within the same dataset and without any relative dependency. Being a metalanguage that emphasizes simplicity, generality, and usability over the web, the choice of XML as the storage format for linguistic annotations, leads to improving the data reusability. AS it is shown in [5] SpLaSH is going to be enriched with a query language named SpLaSH-QL, for which the formal definition is being completed. SpLaSH-QL is formed by a set of specific algebraic operators finalized to the retrieval of information from TXA and TMA integrated dataset, operators can be composed each other and can explicitly contain XPath query recall as arguments. In next SpLaSH releases, guided interface for querying will be based on this language and its use will increase the potential query expression power of the system.
5. Acknowledgements Author FC is supported by „Marie Curie‟ Research Thematic Network „Sound2Sense‟ EC - FP-VI.
6. References [1] [2]
[3]
[4]
[5]
[6] [7]
4. Conclusions SPLaSH is an open source project available at http://s2snaples.fisica.unina.it/splash, under GNU-Public license and is constantly upgraded with new features. To support end users work, SpLaSH include a GUI for query generation on data belonging to the SpLaSH data model. Phonetic data both in TIMIT and in Praat TextGrid format are parsed and converted in the standard AG XML format. SpLaSH accepts any kind of textual annotation in XML with the only constraint that the textual leaves of the document tree
[8]
[9]
Bird, S. and Harrington, H., “Speech annotation and corpus tools”, Speech Communication, 33(1-2): 1-4, 2001. Cassidy, S. and Harrington, J., “Multi-level annotation in the Emu speech database management system”, Speech Communication, 33(1-2):61-77, 2001. Carletta, J., Evert, S., Heid, U. and Kilgour, J., “The NITE XML Toolkit: data model and query language” Language Resources and Evaluation Journal, 39(4): 313-334, 2005. Romano, S., Cecere, E., Cutugno, F. “SpLaSH (Spoken Language Search Hawk): integrating time-aligned with textaligned annotations”. Proceedings of Interspeech2009: 14871490, 2009. Romano S., Cutugno F., “New features in Spoken Language Search Hawk (SpLaSH): Query Language and Query Sequence”. Proceedings of LREC2010: 3738-3741, 2010. Bird, S., Liberman, M. “A formal framework for linguistic annotation”. Speech Communication, 33(1-2):23-60, 2001. Garofalo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S. and Dahlgren, N.L. “The DARPA TIMIT AcousticPhonetic Continuous Speech Corpus”. Technical Report NISTIR 4930, NIST, 1993. Schiel, F., Burger, S., Geumann, A., Weilhammer, K. “The Partitur Format at BAS”. In Proceedings of LREC1998: 12951301, 1998. Bird, S., Buneman, P., Tan, W.C. “Towards a query language for annotation graphs”. In Proceedings of LREC2000: 807814:2000.