From Experiments to Corpora: The Ariadne Corpus Management System Peter Menke Bielefeld University
[email protected]
Alexander Mehler Goethe University Frankfurt/Main
[email protected] Abstract
In this paper we present the current state of development of the Ariadne Corpus Management System, a tool for the organisation and analysis of complex data sets resulting from experiments dealing with multimodal communication. After a short overview of some potential problems of interdisciplinary research, we introduce the core features of the Ariadne applications. The rest of the paper describes a selection of tasks and processes typically occuring during the performance of an dialogue or communications experiment, and we describe mechanisms and modules in Ariadne that can assist to support or simplify these tasks.
1
Background
Interdisciplinarity is one of the most important and promising strategies in contemporary science. It can, however, cause problems in coordination and collaboration between researchers, due to the different scientific fields, paradigms, theories, methods, tools, data formats, and representations that are regularly involved in interdisciplinary projects. Imagine that a researcher A produced a data set that could be interesting for another researcher B. There are many cases where a direct use of that data set is not possible, for instance, if: • B does not know that these data set exists, • B expects or needs a different representation or format than the one produced by A, • A and B use different naming conventions for the same concepts, so they misunderstand each other, although they probably refer to the same things, • A did not make the data set accessible to B. An example scenario could involve a phonetician intending to perform analyses on data sets assembled by conversation analysts. In many cases it is infeasible for the phonetician to use her methods on these kind of data, since in phonetics, usually high quality audio recordings are necessary. Conversation analysis, however, focuses on naturally-occurring dialogues, which means that recording devices have to be integrated into the setting in an inconspicious way. The consequence is that audio recordings regularly have a comparatively low quality – single speakers cannot be separated, there can be background noise and noise caused by activities of the participants. This, however, is still perfectly sufficient for the creation of transcriptions of the variant used in conversation analysis. It is obvious that a low audio quality cannot be improved by software. But it is possible to develop tools that assist researchers in data manamement: They could let users describe experimental setups and data sets in a standardized way. With such data at hand, other users can get a quick overview of the type and quality of experiments and related resources without having to inspect the actual recordings themselves. 1
(a) Permissions of the current user.
(b) Detailed view for a selected corpus.
(c) List of designs for a selected corpus.
(d) Detailed view for a single resource.
Figure 1: Screenshots of different functions of the Phoibos web application. In this paper, we present such a tool: Ariadne assists at various stages of experimental research. It models experimental designs and represents the resulting resources as corpora1 that can later be queried and inspected by authorised users of the system.
2
The software
Project X1 “Multimodal Alignment Corpora” (part of the Collaborative Research Centre (CRC) 673 “Alignment in Communication”) is developing Ariadne, a software system for multimodal corpus management that tries to make interdisciplinary work easier by providing: 1. a flexible data model that can express many data formats used in the field of research on multimodal communication, 2. a system of user accounts, groups, and permissions that allows for the management of fine-grained access levels for different groups of users, 3. mechanisms to assign special categorial information to data units that will later allow for complex searches and comparisons in the data pool, 4. and a variety of modules for tasks from several stages of corpus production. 1
To be more concise: as multimodal dialogue corpora resulting from studies and experiments. These are different from text corpora which are based on a sequence of characters in a text (and, as a consequence, can be represented in a one-dimensional, linear fashion).
2
http://foo.sfb673.org/corpora
Index of all publicly available corpora http://foo.sfb673.org/corpora/SaGA
Detailed view of corpus with ID “SaGA” http://foo.sfb673.org/corpora/SaGA/trials
Index of all trials of the SaGA corpus http://foo.sfb673.org/corpora/SaGA/trials/1/resources/2
Detailed view of resource #2 of trial #1 of the SaGA corpus Figure 2: Examples of clean, human-readable URLs and the associated resources in Phoibos – assuming that the domain foo.sfb67.org hosts a Phoibos instance. In the past years, two prototypes of Ariadne have been developed and evaluated by researchers of the CRC 673: A first version consisting of a Java Applet available online, and a second version that used the framework of the eHumanities Desktop (???), a web-based application for a broad variety of research tasks in the digital humanities. Evaluation of the first version revealed that an applet-based solution had some grave drawbacks, such as limitations in memory, or a complicated permission system that often prevented access to the local file system at certain secured university workstations. The second version (?) eliminated these problems, but due to the nature of the eHumanities Desktop as a web-based application, a few other problems in the management of very large files still prevailed. For example, one research group maintains a collection of dozens of uncompressed video files, each one having a file size of approximately 3.5 Gigabyte. These files have to be frequently accessed on different university workstations. The limitations of HTTP-based client-server-interaction, however, make a frequent exchange of these resources via an instance of the eHumanities Desktop difficult.2 As a consequence, for the current version of Ariadne a different approach was chosen that considered the exact problems and limitations that users experienced with the previous versions. Nevertheless, the eHumanities Desktop is still a flexible and versatile tool that has, for example, successfully been used in the field of iconographic research (?) and historical semantics (?), to name but a few. The current version of Ariadne consists of two applications, each named after an epithet of Apollo, the Greek god of wisdom, knowledge and the sciences. Each application is suited for a different situation and environment.
2.1
“Phoibos” (web application and server)
Phoibos (Ancient Greek: “radiant, shining”) is a web application that can be as easily accessed as any other web site (provided you have an authorised account). It offers an interface to corpora and their associated resources, such as video and audio files, data sets, transcriptions and annotations. It is possible to present data sets in different views and combinations (a selection can be seen in Figure ??): Permissions to corpora, their parts and resources can be managed by corpora administrators. Users can view only those parts of data collections they have explicit access to (see Figure ??). Main corpus information (see Figure ??) as well as organisational data (see Figure ??) can be viewed and modified. Finally, the resources of a corpus can be accessed online via smaller, web-compatible preview versions of the (often huge) original resources (see Figure ??). 2
We would like to point out that this not at all the fault of the eHumanities Desktop – any other application that is purely web-based and communicates via the standard HTTP protocol would have had the very same problems when dealing with such large resource collections.
3
(a) Resource import wizard.
(b) Editor for value lists and dictionaries.
(c) Contents view for one layer of an annotation document.
(d) Statistics for one layer of an annotation document.
Figure 3: Screenshots of different functions of the Mantikos desktop application. Phoibos has been implemented as a RESTful web service (cf. ?), and it tries to use clean, human-readable and reliable URLs for the representation of resources and corpus components whenever it is possible. It also acts as a central server for Mantikos client applications (see section ??). It is also possible to perform scheduled or request-triggered long-running tasks. These can be performed on the physical machine where the Phoibos server software resided, or on separate machines. This functionality is especially useful for tasks like object classification, machine learning, automatic tagging and parsing, conversion of large media files, etc.
2.2
“Mantikos” (client application)
Mantikos (Ancient Greek: “prophetic, visionary”) is a rich desktop application with an interface to one or more Phoibos servers. Mantikos was specifically designed to perform tasks that are difficult on the web, for example the initial organisation and creation of resources scattered on a local disk into a new corpus, or operations on large files that are present on local or network-attached storage media. Some of the operations available in the system can be seen in Figure ??: Local data can be added to the corpus management system with an assistant that helps researchers to find the correct location and data type in the corpus (see Figure ??). The software contains views and graphical interfaces for various parts of a corpus: For value lists and dictionaries used in correction assistants (see Figure ??), as well as for actual data sets: Annotation files and layers within (see Figures ??, ??). Mantikos has a dynamic module system that enables developers to attach new methods and functions to the system by creating a separate library written in Java, conforming to the
4
1. Planning
2. Experimenting
3. Processing
4. Evaluation
5. Publication t
Figure 4: Typical stages and phases of an experiment. Mantikos Module API. With this mechanism, developers can create new use cases as well as bridges and interfaces to existing software systems. Usage of the module system when creating new libraries is possible also for students in advanced programming courses. Also, the creation of new modules can be given to students as a task for their Bachelor’s or Master’s thesis, since the modular nature of the software already provides them with a basis application and a rich data model that can be used and accessed – all they have to do is create the module itself, not an environment for running and testing it.
3
Application: from experiments to corpora
Figure ?? shows the typical stages of a corpus creation process. In each of these phases, certain different tasks have to be performed. Our aim is to provide auxiliary modules and functions in Ariadne for as many areas as possible. The remainder of this section illustrates examples of such tasks and procedures from each of these stages. Planning. In the planning phase, researchers develop a design for the studies to be performed. This also involves specific setups of recordings (audio and video devices, and also special devices such as electroencephalographs or interactive game boards) and of other types of secondary data3 . These setups can be modeled inside Ariadne in so-called trial designs – these are templates for the various trials that are yet to be performed in the second phase. These trial designs define the number, the types and characteristics (camera angle, microphone position, etc.) of secondary data that will be collected during a trial. Experimenting. During actual experimentation, Ariadne has only few things to offer, since this phase has the purpose of creating and assembling the resources that Ariadne can deal with at later stages. Processing. As soon as secondary data is available, Ariadne offers mechanisms to link these resources to the proper trial and design component (see Figure ??). This is important because there are frequent problems in using filenames and naming schemes: If too many pieces of information are to be integrated into a single file name, this approach becomes error-prone. It will also also possible to perform different types of resource conversions. Large media files can be converted on the Ariadne server for use on typical workstations or for integration into web sites – for example, if techniques such as social tagging (see ?) are to be employed. The creation of tertiary data will in most cases be done with specialised third-party software: Praat (?), for example, is regularly used for phonetic analyses and transcriptions, while the annotation tool ELAN (??) is particularly suited for the annotation of multimodal communicative events involving one or multiple video recordings. 3
We follow the model-theoretic categorisation of study-related data in ?, pp. 34–39. They differentiate data into primary data (real conversations that are inevitably lost because of their transient nature), secondary data (direct transcripts, logs, and recordings of these events), and, finally, tertiary data (annotations and transcriptions on the basis of repeated, detailed analysis of secondary data sets, mostly of recordings).
5
Neither Mantikos nor Phoibos are supposed to be a replacement for these specialist tools. They are rather designed to complement them and to seamlessly integrate their resulting data sets into larger processes. Evaluation. There are (at least) two different kinds of evaluative processes that are important at this point. For the evaluation of generated data, that is, of multiple annotations of the same basis resource, several measures for inter-rater-agreement have been proposed (for example, ? and its improvements, extensions, and adaptations). Modules are planned that implement these measures and make them available to users in as intuitive a way as possible. To support the evaluation of the actual research questions or problems, two modules for statistical analysis are provided. 1. Simple parameters and indices of data sets, and results of basic statistical tests can directly be summarised in a statistics module (see Figure ??). If, for example, only a significant length difference of turns for two distributions is important, researchers can gather the respective values directly from the application. 2. For more complex computations, data can be exported in various data formats compatible to major statistics software, such as Matlab, GNU R, or SPSS. Publication. Two types of objects can be published: results of scientific research, and the resources resulting from that research. When publishing scientific results in articles, authors frequently need to add video stills, excerpts from transcriptions, diagrams, tables, and other illustrative elements. In future releases, some of these elements can be prepared or generated with the aid of special modules. Different formats will be available, for example, different spreadsheet formats for integration into Microsoft Office or OpenOffice/LibreOffice documents, or tabular enviroments for LATEX documents. In addition, many funding organisations demand that sustainability and reusability of resources are guaranteed. For both purposes, a metadata description of corpora and resources compliant to a well-established metadata scheme or standard can be helpful. For Ariadne resources, a first conversion to the CMDI format (short for “Component MetaData Infrastructure”, see http://www.clarin.eu/cmdi) has been implemented. This approach utilises the markup of categories and data types for single data units. These are internally represented as data categories from the ISOcat data registry (?).
4
Conclusion
We gave an overview of the current development of the Ariadne Corpus Management System, consisting of the web application Phoibos and the client desktop application Mantikos, and we showed examples of how these tools can be useful at many stages of experimentation and corpus production. Since this is work in progress, several features and modules that have been referred to in this paper have not been completed yet. The basic systems, however, are already functional and are being tested by a small group of researchers of the CRC. We also welcome anybody who is interested in the development of Ariadne to contact us in order to access a test version of the software. Acknowledgement. This research has been supported by the German Research Foundation (DFG) in the project X1 of the Collaborative Research Centre (CRC) 673 “Alignment in Communication” at Bielefeld University. Several examples of real resources from the Bielefeld SaGA (Speech And Gesture Alignment) Corpus have been used in figures and examples. That corpus was created by the project B1 of the CRC 673, and the excerpts were used with the project’s permission.
6