Sharing sign language data online

4 downloads 12782 Views 195KB Size Report
ethical aspects of publishing video data of signers online are considered, and ... ber of documents were compiled which describe the various technical, organi- ...... E-MELD workshop on digital language documentation, East Lansing MI, USA, ...
John Benjamins Publishing Company

This is a contribution from International Journal of Corpus Linguistics 12:4 © 2007. John Benjamins Publishing Company This electronic file may not be altered in any way. The author(s) of this article is/are permitted to use this PDF file to generate printed copies to be used by way of offprints, for their personal use only. Permission is granted by the publishers to post this file on a closed server which is accessible to members (students and staff) only of the author’s/s’ institute. For any other use of this material prior written permission should be obtained from the publishers or through the Copyright Clearance Center (for USA: www.copyright.com). Please contact [email protected] or consult our website: www.benjamins.com Tables of Contents, abstracts and guidelines are available at www.benjamins.com

Sharing sign language data online Experiences from the ECHO project Onno Crasborn, Johanna Mesch, Dafydd Waters, Annika Nonhebel, Els van der Kooij, Bencie Woll and Brita Bergman Radboud University Nijmegen / Stockholm University / University College London / Radboud University Nijmegen / Radboud University Nijmegen / University College London / Stockholm University

This article describes how new technological possibilities allow sign language researchers to share and publish video data and transcriptions online. Both linguistic and technological aspects of creating and publishing a sign language corpus are discussed, and standards are proposed for both metadata and transcription categories specific to sign language data. In addition, ethical aspects of publishing video data of signers online are considered, and suggestions are offered for future corpus projects and software tools. Keywords: sign language, manual-visual modality, video corpus, linguistic typology, metadata standards, linguistic transcription conventions

.

Introduction

. The goals of this article The overall goal of this article is to describe some of the steps to be taken and choices involved in creating a sign language corpus consisting of video recordings plus linguistic annotations. In discussing the various technical and linguistic issues, we raise the broader question of standardising transcription conventions and metadata categories. The article is based on our experiences from the sign language case study entitled “Language as cultural heritage: a pilot project with sign languages” which was carried out in context of the ECHO project, ECHO being an acronym of European Cultural Heritage Online. The ECHO project was an EU initiative aiming to take a step towards sharing scientific data in the humanities International Journal of Corpus Linguistics 12:4 (2007), 535–562. issn 1384–6655 / e-issn 1569–9811 © John Benjamins Publishing Company

536 Onno Crasborn et al.

online, and to find out how online data for various research areas could be presented in a unified way. The goal of the case study was to involve multimodal language data in the project and to stimulate the comparative study of European sign languages. The case study resulted in an online corpus of video material with elaborate linguistic annotations for three signed languages. The first part of this article introduces and describes the ECHO project and the overall design of the sign language case study within that project. Moreover, we discuss the general trend in scientific research towards sharing data in addition to just sharing research results. Section 2 describes the nature of metadata for linguistic research, and proposes a method for storing information specific to sign language data within a standard metadata convention developed for spoken language materials (named IMDI). Section 3 characterises the linguistic annotation conventions that were used within ECHO, and addresses the question of the extent to which they could serve as a prototype transcription standard for all kinds of sign language studies. Section 4 discusses some of the ethical problems inherent in publishing video data online, while Section 5 concludes with a presentation of the core aspects of constructing a sign language corpus. .2 The ECHO project Between 2002 and 2004, the EU commission funded an eighteen-month project called European Cultural Heritage Online, abbreviated ECHO.1 Sixteen partner institutions from nine different European countries worked closely together to set up a model web site that allows for integrated access to a large variety of data sources from the humanities and social sciences. Linguistics (the present sign language project), social and cultural anthropology, the history of science, and the history of arts constituted the specific research domains for which case studies were carried out. For each of these domains, the ECHO project aimed to find out how existing methodology and tools could be enhanced to create a joint network infrastructure, by which these disciplines could profit optimally from the ‘Internet revolution’ (Levinson & Wittenburg 2001). The participating institutions signed a charter, documenting some common values, goals and restrictions for the various research groups that participated.2 This charter describes the ECHO project as “a collaborative research endeavour that provides active support for scientific and cultural institutions and projects in Europe that hold or enrich cultural heritage through new technologies and media” (ECHO Charter, p. 1; see also the ECHO Statement of Purpose). The resulting web portal interfaces an impressive number of varied data © 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 537

resources, which can all be accessed free of charge. Most of the data collections that were published stem from the 16th to the 19th century; the sign language corpus is one of the few resources of contemporary culture within ECHO. In addition to this concrete outcome of data sources that can be accessed, a number of documents were compiled which describe the various technical, organisational and scientific aspects of designing such a corpus, and possibilities for future developments. The fact that the corpora are all freely accessible fits with a broader development seen over the last decade towards ‘open access’ (Esanu & Uhlir 2004). As a part of ECHO, the so-called Berlin Declaration was set up, which by mid 2006 had been signed by over 150 scientific organisations from all over the world.3 This text states explicitly that while authorship should be respected, researchers should be encouraged to publish not only the results of their scientific work, but also data sources of various kinds, including “raw data and metadata, source materials, digital representations of pictorial and graphical materials and scholarly multimedia material” (Berlin Declaration, p. 1). In this way, not only a restricted group of researchers within a given department, but the society at large can have access to all aspects of the knowledge collected in scientific studies within the various sub-disciplines of the humanities. The sign language case study served as a pilot project for linguistics and was completed in mid 2004.4 Data were recorded, transcribed and published for three signed languages used in Deaf5 communities in Europe, viz. Sign Language of the Netherlands (abbreviated NGT (Nederlandse Gebarentaal)), British Sign Language (BSL) and Swedish Sign Language (SSL). The resulting ECHO website forms the main portal for accessing all the different research domains. While it is clear that these domains, to a large extent, still use different techniques, the site does represent an important step towards linking the various fields within the humanities. .3 Sharing linguistic data As the general description of the ECHO initiative above makes clear, the central idea of the project was not simply to create large corpora of any type of data source, but actually sharing them online. This is useful for several reasons. The publication of data in a specific format will promote the standardisation of research methods within a discipline, leading to results that are more readily comparable. While this of course is good for science in general, an additional advantage is that it facilitates collaboration between research groups working at some distance from each other. The advent of Internet technology has provided © 2007. John Benjamins Publishing Company All rights reserved

538 Onno Crasborn et al.

some of the technical means to exchange data, but actually working together requires that the different research teams use a comparable methodology as well. Modern computer tools (such as the annotation programme ELAN discussed below) facilitate remote collaboration between researchers, at the same time increasing the transparency in methodology. For the sign language case study, collaboration between linguistic researchers from different countries could be particularly useful because it can facilitate cross-linguistic studies on signed languages. While it has been recognised that different signed languages have different lexicons and grammars since the linguistic study of signed communication took off in the 1970s, it is only recently that the broader typological comparison of these languages has received more attention (e.g., Zeshan 2004ab; van der Kooij et al. forthcoming). Linguistic research makes use of a large variety of methodologies and data types. Written records traditionally have played a central role in the analysis of languages. Later these were supplemented by experimental methods, initially for phonetic, and after some time also for psycholinguistic analyses. The more cognitive and psychological view of language promoted by Chomsky since the late 1950s has emphasised the relevance of the intuitions of native speakers. Research on the basis of text corpora was a first step in emphasising the (linguistic and sociolinguistic) study of actual language use by many different speakers in a large variety of settings (e.g. Granger & Petch-Tyson 2003); this development has seen a major new step in the last decade with the creation of spoken language corpora. An example is the Spoken Dutch Corpus (CGN), a corpus of over nine million words that was completed in 2004.6 For signed languages, there has been no commonly used writing system in any deaf community. A writing system that is receiving increasing attention is SignWriting.7 It is increasingly used in deaf education in different countries, but it is not widely used by adults in Deaf communities. At the moment, therefore, to focus on written corpora of signed data would simply be impossible and illogical. However, modern computer technology allows for the easy processing and accessing of video recordings. Consumer-level video cameras now record digitally, and removable media such as DVDs and fixed media such as hard drives can easily store many hours of compressed video recordings. These technical developments in the area of video and audio recordings, together with the increasing bandwidth of computer networks, in principle allow for easy and fast sharing of data among researchers in different countries. For actual cross-linguistic studies to take place, it is necessary that not only the same stimulus material be used, but also that the same conventions for annotating the data are used, both in terms of linguistic transcription and in © 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 539

terms of metadata description. The availability, even of a small corpus, of video recordings from different languages, as published for the ECHO project, can promote standardisation. The sign language case study in ECHO made use of two tools and corresponding standards developed and defined at the Max Planck Institute for Psycholinguistics. To describe the language resources that are being collected at a metadata level, we used the IMDI standard and the associated tools.8 To annotate the signed data in terms of linguistic categories, we used the ELAN software, which allows for the precise time-alignment of annotations with the corresponding video (and audio) sources on multiple tiers that can be specified and adapted by the user.9 Another tool which is currently available for annotating video data is SignStream10 (Neidle & MacLaughlin 1998) which was developed especially for sign language data, whereas ELAN started its life in the domain of general multimodal studies (an earlier version of ELAN was called MediaTagger), similar to tools like Anvil11 and Transana.12 Compared to SignStream, ELAN is especially well-suited for annotating larger corpora of video data, allowing for multiple ways of browsing and searching large files and collections of files. Moreover, it guarantees accessibility of the data by storing annotations in open-format XML files, and can be used on various computer platforms (Windows, MacOS, Linux). These new technologies for presenting sign language data and transcriptions together pose the question of the extent to which standard transcription conventions should be used. If all the raw material (in the form of the video sources) is available, are full transcriptions really necessary? In principle, one can look at the video source for all kinds of information that are traditionally included in various transcription systems, such as eye gaze, head nods, etc. On the other hand, the great strength of computer tools such as ELAN is that they allow for complex searches in large data domains and for the immediate inspection of the video fragments relating to the search results. This is extremely time consuming when using paper transcription forms or even digitised transcription forms that are not directly linked to the original data. Since it is not possible (yet) to directly search in a video file for all instances of eye blinks, for example, annotations have to be added to mark each instance of a particular category. In many areas of sign language research, what counts as a (linguistic) category is yet to be determined (see below). Within the ECHO project, we wanted to establish an annotation system that would be useful for as many researchers as possible, with a focus on the syntactic and discourse domains. We tried to be careful to let the description categories not be biased by theoretical assumptions. For example, we avoided © 2007. John Benjamins Publishing Company All rights reserved

540 Onno Crasborn et al.

imposing too much analysis on any tier by for example not saying that a specific phonetic form is an instance of ‘person agreement’. Also in the description categories we proposed we tried to strictly separate form from function by using formal descriptions only. For example, eyebrow positions were not given functional labels such as ‘wh-q’ (i.e. indicating a question); instead formal labels such as ‘raised’ were used. On the other hand, it must be recognised that analytical decisions are constantly being made in any transcription process (Ochs 1979). Even adding multiple tiers with sentence-level translations in various written languages (in the case of the ECHO project: Dutch, English and Swedish) implies taking (implicit or explicit) decisions about where sentence boundaries are located. While every research project will have its own research questions and therefore will require special transcription categories, it should be possible to define a standard set of transcription tiers and values that are useful to large groups of researchers, regardless of their specific interests. For example, a sentence level translation to a written language is always useful, if only for quickly exploring a video recording. Working with three teams of linguists from different countries, each with their own research interests, the ECHO project formed a good starting point for developing a standard set of transcription conventions. This ECHO set is described in Section 3. A relatively small set of transcription tiers allows for the coding of a relatively large data set, which can be further expanded by researchers according to their specific needs. ELAN will continue to see several updates in the years to come; one of the future functions will be the possibility of expanding a publicly available transcription file with an individual researcher’s additions, including extra tiers.13 In this article, we outline some of the issues faced in the creation of the multilingual ECHO corpus and present and discuss our solutions. We start in Section 2 by discussing the technological framework used for the creation of our corpus, including the use of metadata descriptions, while transcription conventions are the subject of Section 3. Some of the ethical and copyright issues involved in the online publication of video recordings of signers are discussed in Section 4. Finally, we summarise the developments achieved within the ECHO case study and the recommendations for future corpus projects that emerge from the project.

© 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 54

2. The use of corpora for sign language research 2. Introduction In the last four decades, researchers studying sign languages have collected a large number of different data sets. These data sets have rarely been made available for public use, and often are not described in detail in publications of research results. Different elicitation methods have been used (including the recording of spontaneous interactions), different numbers of signers have been recorded for a given elicitation method, and the research questions underlying the data recording have spanned many areas of linguistics. Only incidentally have collections of data been published, so that other researchers can make use of them for their own research and the analysis presented in linguistic publications can be checked against the original data. A theme issue of the journal Sign Language & Linguistics (vol. 4, 2001) focuses on methodological issues in sign language research, but does not actually contain data sets. One of the few data sets that have been published is a collection of dialogues from German Sign Language (DGS), entitled Gehörlos So!. Totalling one hour, ten different signers talk to another deaf person; all these data have been glossed and translated. The glossing of sign language data in this case consisted of assigning a tag in written German to every manual sign, where the left and the right hand were assumed to be forming one lexical item together. In addition, the data have been annotated for some additional information, including the occurrence of mouth actions and head movements. The corpus is available as a VHS tape accompanied by two books with background information and the transcriptions (Heßmann 2001).14 This project appears to be the first substantial corpus of transcribed discourse data, and as such represents a considerable step forward for the whole field of sign linguistics. At the same time, actually using the transcriptions together with the video tape is rather cumbersome: it is not possible to search for specific instances of glosses or words in the text, and it is not possible to quickly browse different segments of the video. These technological limitations in the publication and use of a corpus like Gehörlos So! (which until recently were unavoidable) explain only in part why so few corpora have been published for general use. A substantial factor has probably also been that researchers may not wish to share material which they have spent a lot of effort collecting and transcribing. Furthermore, sharing one’s data implies that fellow researchers can use the material for conducting studies that before could be done exclusively by the people who collected the data.

© 2007. John Benjamins Publishing Company All rights reserved

542 Onno Crasborn et al.

Finally, there are ethical issues relating to the publication of video recordings of signers that may have withheld researchers from publishing recordings. While these objections remain valid, the overall ECHO project instantiates a trend towards making scientific data available to other researchers. Although sharing linguistic corpora is not uncommon, the publication of data sets currently has limited academic prestige in the humanities, and is not valued highly by measures of academic output, which focus on publications in high-ranked academic journals. With the increase in technological possibilities, publication of data is expected to receive increasing recognition as scientific output.15 This broader trend in the humanities is likely to be instantiated in the field of linguistics as well. Aside from this ‘indirect’ motivation to publish scientific data, the publication of corpora for linguistic research also has a wide range of academic advantages. When data is accessible to other researchers, research outcomes can be checked by colleagues working in the same field; cross-linguistic studies are facilitated because similar data sets can be recorded for additional languages; the creation of new research groups and the work performed by a single researcher (as for dissertation projects) will become easier because part of the data collection effort can be skipped; finally, seeing in which way other data sets have been collected can lead to the gradual improvement in methodologies for the whole field. While these ideas are commonplace in the world of speech and text corpora, they are new to the young and relatively small field of sign linguistics. 2.2 The ECHO corpus The aim in compiling the ECHO corpus of sign language data was not to collect and publish a large corpus. Rather, the goal was to collect comparable data from different sign languages, and by collaborating between different research groups, to see to what extent current annotation conventions, metadata categories, and technical possibilities were sufficient for the creation of a useful linguistic corpus. The actual data that were recorded, annotated and published in an online corpus are therefore limited in scope and quantity, as will appear from the description in this section. We recorded data from three signed languages: Swedish Sign Language (SSL), British Sign Language (BSL), and Sign Language of the Netherlands (NGT). In each country, one man and one woman were asked to reproduce five stories from Aesop’s fables that were offered to them in written form (in Swedish, English, and Dutch, respectively). The recording of these stories from the classical European tradition (written down by authors such as Aesop, © 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 543

Phaedrus, and later La Fontaine)16 nicely fitted with ECHO’s overall purpose to publish aspects of Europe’s cultural heritage; for the data elicitation, we only used a version of Aesop’s fables. In addition to these narratives, we elicited a basic word list of 300 items, based on the extended Swadesh list of 200 items (Samarin 1967). This Swadesh list was developed in the 1940s and 1950s and forms a list of basic (possibly universal) concepts; it was used to investigate historical relations between languages. To this set of words from the Swadesh list we added two groups of items: concepts relating to deafness and sign language, and concepts from the fable stories that formed the core of the corpus. The elicitation of narratives and word lists resulted in a collection of comparable data that can be used for cross-linguistic research. In addition to these semi-spontaneous data, we also recorded a short interview in which we asked the signers to introduce themselves. Finally, for BSL and NGT, we included some sign poetry; this included published work by Wim Emmerik (NGT) and Dorothy Miles (BSL), and some newly recorded work by Paul Scott (BSL). The fable stories, word lists and poems all received an elaborate linguistic annotation using ELAN. The details of this linguistic annotation are given in Section 3 below. All parts of the data received a metadata description, which we discuss in the rest of this section. 2.3 Metadata standards In order to create an electronic corpus in which data such as the set described in the previous section are bundled, and which can then be made available and searched online, some kind of metadata description has to be created of each item in the data set. This description refers to properties of the whole recording, rather than to what is said in the recording itself: the recording date, who the speakers/signers are that participate, what the subject and style of the communication event is, whether it was elicited or spontaneous, etc. There are several standard sets of metadata categories for linguistic resources. One widely used standard set of metadata categories that has been developed for any type of electronic resource is the Dublin Core.17 This set consists of fifteen elements, including ‘title’, ‘creator’, ‘description’, ‘references’, and ‘comment’. The Open Language Archives Community (OLAC) has developed a set that is more specifically targeted at language resources.18 In particular, it includes extensions to the Dublin Core set that specify discourse types, the language in question, the specific subfield of linguistics, the types of data that are included (such as lexicon, grammatical description, etc.), and the role © 2007. John Benjamins Publishing Company All rights reserved

544 Onno Crasborn et al.

Table 1. The different sections in the IMDI metadata set. 1. Session

The session concept bundles all information about the circumstances and conditions of the linguistic event, groups the resources (e.g., video files and annotation files) belonging to this event, and records the administrative information for the event. 2. Project Information about the project for which the sessions were originally created. 3. Collector Name and contact information for the person who collected the session. 4. Content A set of categories describing the content of the language used in the session. 5. Actors Names, roles and further information about the people involved in the session. 6. Resources Information about the media files, such as URL, size, etc. 7. References Citations and URLs to relevant publications and other archive resources.

of the participants in the data. Other initiatives that focus on the storing of digital texts include the Text Encoding Initiative (TEI),19 the Corpus Encoding Standard (CES),20 and the Expert Advisory Group on Language Engineering Standards (EAGLES). For the ECHO project, we decided to make use of a more detailed metadata set, termed IMDI (ISLE MetaData Initiative). This set was developed within a series of large European projects on language and linguistics, and covers a broader range of properties plus more detail within each category. At the same time, it does not include details that are specific to certain types of data or certain types of linguistic fields. IMDI consists of seven broad groups of information, listed in Table 1. As in every other metadata set, it is possible in IMDI to add extensions that are specific to a given set of resources. These extensions are termed ‘keys’ in IMDI; each of the seven sections in IMDI allows the addition of keys. Sets of keys that apply to a larger group of sessions can be bundled in a ‘profile’ that can easily be re-used. Moreover, use of such profiles guarantees consistency within and across projects. 2.4 Additional metadata for sign language corpora One of the most important achievements of the ECHO case study on sign languages was to consider a standard metadata description specifically for sign language resources. These extra keys would become available as a ‘sign language profile’ in the IMDI editor. Several properties of signers will be relevant in the description of any sign language data set, including the linguistic © 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 545

environment of the signer during childhood, whether the person is deaf or hearing, etc. Since neither IMDI nor any other metadata set includes categories to describe such sign-specific properties, it seemed appropriate to establish a standard set of properties that would be useful not only for the data collected for the ECHO project, but for any type of linguistic study on sign language. With this aim in mind, a workshop was organised in which a large number of European linguists working on sign languages discussed the number and nature of possible IMDI keys to describe sign language data. The start of the workshop was formed by a proposal containing a very large set of properties considered relevant and specific to sign language resources (Crasborn & Hanke 2003a). Just as in creating linguistic transcriptions, in any type of metadata description one has to find a balance between relevant detail and the time investment needed to describe one’s resources. Moreover, there is no a priori category of ‘relevant detail’. Depending on the goal for which the data are collected, or the type of study for which the data are used later in time, it may be relevant whether the signer has extensive experience with cued speech. However, to register this information for every signer in every situation implies a lot of extra work. Although such fields can be left blank when creating a metadata description, their presence in a metadata set implies that a judgement needs to be made in each instance. Another consideration raised in the workshop is that researchers should not be encouraged to collect personal information from informants just because it can be registered in a metadata editor. The outcome of the workshop was a document describing twenty-three keys in two different sections of the IMDI scheme, jointly available in the IMDI editor by selecting the ‘sign language profile’ as the default for new sessions. The keys are listed in Table 2, and described in full, following the IMDI conventions, in Crasborn and Hanke (2003b). Most of the keys above have an associated ‘vocabulary’: a restricted list of possible values or a list of suggested values. Further use of the sign language profile beyond the ECHO project, including the current vocabulary choices, will be required in order to demonstrate whether the profile is indeed useful for all types of sign language resources, or whether additions or modifications are called for. Of course, researchers can always add further keys for a specific session or a specific project. Finally, a number of extensions were identified that often apply to sign language recordings and that were generally judged as relevant for inclusion in an IMDI description of sign language data, but which do not specifically refer to sign language data. These include the number of cameras used and their viewpoints (in the Session area), the handedness of the signer (in the © 2007. John Benjamins Publishing Company All rights reserved

546 Onno Crasborn et al.

Table 2. The keys in the sign language profile for the IMDI metadata standard. Section 1. Session 2. Project 3. Collector

Key name No keys in the sign language profile No keys in the sign language profile No keys in the sign language profile Language Variety Elicitation Method

Short description – – –

A description of the language variety used in the session. A characterization of specific prompts used for eliciting language production (e.g.picture story, sign video). Source modality and language type (e.g. sign, speech, Interpreting Source sign supported speech, fingerspelling). Target modality and language type. Interpreting Target Interpreting Visibility Visibility of the interpreter in the video recordings. Interpreting Audience Presence and nature of an audience that the interpreter is signing for. 5. Actors Deafness: Status Actor’s ability to hear (hearing, deaf, hard-of-hearing). Deafness: Aid Type Type of hearing aid the actor has (none, conventional, cochlear implant). Sign Language Experi- Age at which exposure to sign language and sign lanence: Exposure Age guage use started. Sign Language Experi- Place where sign language was learnt (e.g. at home from family, from teachers, from friends). ence: Acquisition Location Sign Language Experi- Amount of experience with teaching sign language. ence: Sign Teaching Describes mother’s / father’s / partner’s] deafness status. Family: [Mother/Father/Partner]: Deafness Describes mother’s / father’s / partner’s language input Family: [Mother/Father/Partner]: Primary towards the actor. Communication Form Describes the age during which the school was attended. Education: Age Describes the type of school. Education: School Type Education: Class Kind Describes the kind of class in the school. Education: Education Describes the education model used at the school (e.g. bilingual, oral, mixed). Model Describes where (town or region) the institution was Education: Location located. Is the school a boarding school? Education: Boarding School 6. Resources No keys in the sign – language profile 7. References No keys in the sign – language profile 4. Content

© 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 547

Actor area), and information about the dialect and educational background of the signer (also in the Actor area). These properties appear to have a wider scope than sign language research alone, and therefore were not included in the sign language profile.21 The corresponding keys were used in the IMDI descriptions of the ECHO data set, and are specified in an additional document (Crasborn 2003).

3. Transcription of sign language data: A proposal for basic transcription categories 3. Introduction While the metadata descriptions discussed in the previous section characterise recordings as a whole, linguistic transcriptions focus on the details of the language used in the video recordings. An overview of various approaches can be found in the papers in Bergman et al. (2001). To make the sign language material collected for ECHO useful for a broad base of linguists, an important aspect of the study was to agree on a set of linguistic transcription conventions that would be relevant for most language researchers as well as being relatively theory-neutral. Adding annotations for such a diverse audience constitutes a considerable challenge, as analytical decisions are constantly being made in using any kind of transcription system. For example, even adding multiple tiers with translations in various written languages (in this case Dutch, English and Swedish) implies taking (implicit or explicit) decisions about where sentence boundaries are located. The choice of tiers and the values assigned is a first proposal, and further use of the material by linguists from a range of backgrounds and with different research interests is needed to reveal the extent to which the choices are seen as appropriate, over-detailed, or incomplete. While this may be considered a rather cumbersome procedure, this procedure is well fitted to the new technological possibilities. For instance, the immediate presence of the original source data with every annotation makes it possible to check each decision taken in the annotation process by comparing it to the actual data, something that was not possible with the use of paper transcriptions. It remains to be investigated how these possibilities are exploited in actual practice, but the free access to all data, that is at the heart of the whole ECHO project, may promote some form of standardisation in transcription conventions, whether at the level of the tiers that are distinguished, or the details of particular distinctions within these tier

© 2007. John Benjamins Publishing Company All rights reserved

548 Onno Crasborn et al.

categories or the way they are represented. Both are described further on in this section; it will become clear that we have decided to focus on annotating aspects of form rather than meaning. Aspects of the meaning of signs and utterances are provided only on the gloss and translation tiers. There is no commonly used standard for annotating sign language material, although many researchers have discussed aspects of the annotation conventions they used in their research (e.g. Johnston 1991; Engberg-Pedersen 1993; Neidle & MacLaughlin 1998). There are phonetic annotation systems similar to the International Phonetic Alphabet (IPA) for spoken language, such as HamNoSys (Prillwitz et al. 1989; Prillwitz & Zienert 1990). By contrast, the Berkeley Transcription System (BTS) (Slobin et al. 2001) focuses on the transcription of meaning components. It is clear that these and other annotation systems all have their own goals and functions, and it is the research question at hand that determines what system is best suited. Since the data for the ECHO project, and also those for similar corpus projects that are currently underway (e.g. Johnston & Crasborn 2006), were intended to serve as possible data for all kinds of investigations, the selection of one or more of the existing annotation systems was felt to be inappropriate, also taking into consideration that they often are very time-consuming to use. In the selection of which transcription categories to use, one has to find a balance between the amount of time needed to make the transcription and the possibilities the transcription offers for linguistic analysis later in time (Crasborn et al. 2001). While a fine-grained transcription offers more detail for the linguist to help answer research questions, creating transcriptions is very time-consuming. The detail of transcription is inevitably lost when time is limited. For example, the Facial Action Coding System for facial expression (FACS) (Ekman et al. 2002) is the most precise system available, but it is slow to use. In the case of the ECHO project, for example, we would not have been able to annotate more than a single fable per language using FACS. With the restricted set of transcription categories we decided to use, we were able to add annotation for facial expressions for ten fables per language. Aside from the time involved in creating the annotations, it also takes a considerable effort to learn and use a specialised system like FACS correctly and reliably. A more important consideration in the case of ECHO, both for facial expressions and other aspects of the signing, was that for the average user, a highly refined annotation would offer more detail than is necessary for the average research project. Altogether, these considerations encouraged us not to use available systems like HamNoSys and FACS for the ECHO corpus, but to agree on a simpler set of basic transcription categories. © 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 549

It is important to emphasise that, unlike the IMDI metadata software, the ELAN annotation tool does not offer a set of standard profiles to choose from, including tiers and vocabularies (possible values) for certain tiers. It is possible to store tier setups in a separate document, a so-called ‘template’, which can be reused and shared with other researchers. However, these documents are not readily available for anyone who downloads and installs ELAN: the template has to be downloaded separately, from another researcher’s web site, copied from other users or created anew by the researcher him/herself. In conclusion, then, the set of transcription conventions that are proposed and described below is just that: a proposal for a set of tiers and values that covers elementary transcription categories that are useful for many different kinds of research (Nonhebel et al. 2004a). To achieve the latter goal, the categories aim to be as theory-neutral as possible and focus on the form rather than the function of signing. Three groups can be distinguished: tiers storing general information, tiers for the activity of the hands, and tiers with non-manual signals. 3.2 Tiers with general information General information that can be supplied for every fragment of a video file includes ‘Translation’ tiers for English, Swedish and Dutch. Each of these tiers is targeted to a translation at sentence level. ‘Role’ indicates where a signer takes on the role of a specific discourse participant, as is common in sign languages, especially in the stories such as those that were collected for the ECHO corpus. Finally, the ‘Comments/notes’ tier can be used to add any kind of comment by the user. Some possible categories that we decided not to transcribe include the specific dialectal nature of a lexical item or stretch of signing, the style of an utterance or part of discourse, and prosodic units within utterances. 3.3 Tiers with manual information ELAN distinguishes between two types of tiers, ‘parent tiers’ and ‘child tiers’. Parent tiers are independent tiers, which contain annotations that are linked directly to a time interval in the media frame. Child tiers or referring tiers contain annotations that are linked to annotations on another tier (the parent tier). The only place in the ECHO annotations where this possibility was used is for describing manual behaviour. The manual activity is systematically described separately for the two hands. For both the left and the right hand, there © 2007. John Benjamins Publishing Company All rights reserved

550 Onno Crasborn et al.

is a ‘Gloss’ tier; two-handed signs are assigned the same gloss on both tiers, so that the alignment of the annotation (the start and end times) can differ for the two hands. Glosses can be prefixed by (1h) and (2h), indicating that a normally two-handed sign is now realised with only one hand, and vice versa, respectively. This Gloss tier acts as a parent tier for two further tiers for each hand: ‘Repetition’ and ‘Direction & Location’. The Repetition tier can be used to specify how often a movement is repeated, either by entering the exact number of movement cycles for that hand, or by specifying ‘u’ for ‘uncountable’. Further, alternating movement between the two hands can be indicated by adding ‘a’ to the number of repetitions. As its name indicates, the Direction & Location tier contains information about movement direction and the spatial location of the hand. In this way, information that is potentially relevant to morphosyntactic and discourse processes is included in the transcription without applying a specific linguistic analysis, which would e.g. describe a certain movement as an instance of agreement with a specific referent. Direction and location values include different parts of space (such as upward, straight left and diagonally left, etc.), but also more abstract categories such as ‘towards a person present’ or ‘towards the other hand’. Both translation and glossing of words (signs) was seen as indispensable for the material to be usable for anyone who is not familiar with the language in case. Glossing in this project involved creating a text label for each action of the left and right hand. For the BSL data, the gloss was only in English; for SSL and NGT, an additional gloss in Swedish or Dutch respectively was added. The glossing that is used aims to be as general as possible, and includes no explicit morphological information. However, several markers were added to the transcription protocol to allow the researchers to indicate that the hand was doing something beyond articulating a lexical item. The marker (p-) can be prefixed to a gloss to indicate that the sign is a polycomponential form built up of a classifier-like hand and other (meaningful) components, and thus is neither a lexically fixed form-meaning combination nor a simple inflected verb form. A precise analysis of the form is not made however, and is left for the user of the transcription. The same is true of another category of hand movements, ‘gestures’, which consist of a short description of the meaning in lowercase letters, prefixed by the code (g-). While there is no clear-cut definition for what counts as a gesture in sign languages, the code can be used to label hand actions that may be borrowed from the surrounding hearing culture, or that have neither a clear lexical status nor are polycomponential morphological constructions (see Emmorey (1999), for example, for a discussion of gestures in signed languag© 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online

es). Finally, fingerspelled words are assigned a gloss consisting of the manual letters used, prefixed by (fs-) so that users can easily identify them. Again, there are many other possible categories that we decided not to include in the transcription, including phonological details such as handshape and movement speed and size. To include such categories would involve considerable extra annotation time, while their possible use is likely to be limited. 3.4 Tiers with non-manual information A set of non-manual tiers allow for the specification of relevant properties of the face, head, and body of the signer. The position and movement of the head and eyebrows, the state of the cheeks, the amount of eye opening, and the direction of eye gaze, can all be specified. For each of these five tiers, only a very small number of distinctions were incorporated. This was the only way for the annotators to transcribe all of the material in the corpus within the limited time available. Moreover, linguistic analysis of sign languages has not yet revealed which finer distinctions are universally relevant. As with the labelling of polycomponential constructions in the gloss tiers, the annotations on these non-manual tiers form a first labelling of phonetic events, which are hopefully useful for a wide range of linguistic studies; such studies may need to make further refinements in the basic annotations proposed here. Suggested distinctions on the Head tier are ‘nod’, ‘shake’ and ‘tilt’. The two possible values on the Cheek tier are ‘puffed’ and ‘in’. The Eyebrows tier distinguishes between ‘raised’ and ‘furrowed’. Values on the Eye Aperture tier are ‘blink’ (including the number of blinks), ‘closed’, ‘wide’ and ‘squint’. Finally, the Eye Gaze tier can take on values similar to those for Direction & Location of the hands: both directions in space and gaze towards a person present, the camera or (one of) the hands can be notated. The details of the tiers and distinctions discussed above can be found in Nonhebel et al. (2004a). A new system was devised to specify the behaviour of the mouth, including the tongue, which in previous systems was often treated in a rather fragmentary manner (Nonhebel et al. 2004b).22 A summary of the distinctions that were used is presented in Table 3. As in the system proposed by Bergman and Wallin (2001), sequences of different forms can be annotated; in the present proposal, these forms are separated by the character >. For further details, see the description in Nonhebel et al. (2004bc). Properties of the face that were not transcribed include wrinkling of the nose and the combinations of facial signals expressing emotion. © 2007. John Benjamins Publishing Company All rights reserved

55

552 Onno Crasborn et al.

Table 3. Categories for transcribing mouth behaviour. Category Lip aperture Lip position Air stream Corners of the mouth Tongue shape Tongue position Teeth

Possible values Closed, open Round, forward, stretched Air in, air out Up, down Pointed, relaxed 0–100% out of the mouth End of the teeth touching upper vs. lower lip

4. Ethics: Privacy issues in publishing video data online Needless to say, the privacy of subjects in scientific studies has to be respected. For the sign language component of the ECHO project, this gives rise to problems not previously encountered in the creation of spoken language corpora that just make use of sound recordings. The visual information in the video recordings contains much more personal information than do audio recordings of voices, including not only the identity of the signer (i.e., the visual appearance of the face, head, and upper body) but also clues to the emotional state and age of the person. Detailed personal information stored in the metadata (including name, age, and family background) can of course be kept hidden using the IMDI tools, allowing only certain users to see the full metadata description. However, the alternative of ‘anonymising’ subjects by assigning a number or other code to them in the metadata is ineffective if the identity can be established by looking at any segment of the video recording. While it is common practice to ask subjects in linguistic recordings for their explicit written permission to use the data for various purposes, including making images for publications, discussion among sign language specialists revealed that this permission is a rather sensitive issue in the case of Internet publication. Open-access publication of data online implies that the information is available to the whole world, and not just to a limited group of people with access to specific university libraries, for example, as in the case of video tape recordings. While Deaf people often indicate that they are accustomed to being recorded on video, and sometimes do not understand why they would ever object to that, it is difficult for anyone to accurately judge the impact and speed of current developments in Internet technology and use. Given the effort involved in creating a corpus, it would be wasted if at a later stage signers decided to withdraw their permission for use. More importantly, given the fact that data are downloaded by users, it is also impossible to really withdraw the data from the Internet corpus. © 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 553

Future projects aimed at making data accessible online should explore these issues in more depth, with assistance from both legal and ethical expertise. One recent development in the online publication of data is the rapid adaptation of a series of copyright licences from the Creative Commons organisation.23 These licences are specifically developed to protect creative work published on the Internet, including music, videos and photos. A series of possible licenses is proposed, which have both a legal underpinning that is adapted to various countries and a ‘translation’ that is accessible to the average user. The copyright license can be designed to explicitly allow Internet users to download and reuse data such as sign language videos, at the same time obliging the user to mention the creator when data are reused. Other aspects of such a license may allow or prohibit commercial use of the data, or the modification of the data for re-use in other projects. If such licenses come into common use in the future, it may be fairly easy to protect sign language data legally; it remains to be seen whether users actually obey the restrictions of these licenses, of course. It should be noted that in the world of software development, similar licenses such as the GNU General Public License (GPL)24 have been in common use for some time, and have been successful in protecting published ‘free’ software. For the ECHO data, we did not use any such licenses, but included a simple statement in the metadata and on the web site accompanying the project, requesting users of the data for research purposes to refer to the corpus in publications. Moreover, every video file starts and ends with a one-frame text giving credit to the signer and referring the viewer to the project web site. Since we did not yet have a clear legal basis for restricting the use of the video recordings published online, we made sure to check in detail that the signers were aware of what was going to happen with the recordings, and that it would not be possible to un-publish them at a later stage. All signers indicated that this was not problematic. For the description of the signers in the metadata, we asked every signer to fill in a form containing personal information, and granting us permission to use the data for different purposes, including the publication of the recordings on Internet and the use of the recordings for research and teaching. The information in the form was explained and signed to the participants in their own sign language.

© 2007. John Benjamins Publishing Company All rights reserved

554 Onno Crasborn et al.

5. Conclusion and future developments 5. Summary of experiences and results In creating a first cross-linguistic digital corpus of sign language data to be published online, we encountered difficulties arising from the lack of standardisation in both the metadata and the data domain. For metadata, we found that the IMDI standard developed for spoken languages, which is among the most stable and detailed metadata descriptions for linguistic resources, can be fairly easily expanded for describing sign language data sets. The resulting ‘sign language profile’ for IMDI was the outcome of a workshop with participants from various European countries and many different research interests and theoretical backgrounds. We hope that the elaboration of this IMDI standard, which can be revised if further experience in other projects indicates that this is necessary, will make it easier for descriptions of existing data collections and for future corpus projects to describe sign language data sets in a straightforward and systematic way. Creating a standard for the linguistic annotation of sign language data is much less straightforward. Any transcription is coloured to some extent by the research tradition and theoretical perspective of the researcher at hand. Larger linguistic corpora are typically not designed for a narrow and specific set of research questions, but aim to provide a solid database for all kinds of researchers and research questions. Linguistic corpora do not form the only type of data that can be used in research; nevertheless they have been a neglected data source for sign language researchers. The use of large linguistic corpora has for a long time relied on the collection of written texts, which can be easily accessed by computer tools. Now that computer technology for recording and storing video data is readily available, it is expected that for sign language research as well, the use of corpus data will rapidly increase. In this context, some kind of standardisation of transcription conventions is necessary. The set of transcription conventions used to annotate the ECHO corpus aims to provide another step towards such a standardisation. Its central feature is the clear focus on the form of signing, rather than the function of specific forms. While ELAN’s rapid access to the video recording always provides a researcher with the opportunity to see the signing itself, some type of annotation is crucial in order to quickly browse and search large collections of data. The ECHO transcription conventions aim to provide a good balance between the effort needed to annotate the video and the usefulness of the distinctions for the end user.

© 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 555

Although the ECHO corpus for the three sign languages is fairly small, the data have already been used by external researchers. Both Morrisey and Way (2005) and Zahedi et al. (2006) used parts of the ECHO corpus in an investigation of automatic sign language recognition. In addition to this use of the published data, there are currently several efforts underway to add new recordings of the fables and lexical information in other sign languages, including Catalan Sign Language and Spanish Sign Language. Aside from the use of the corpus by linguists, the data are freely accessible. Corpora like the ECHO corpus may also satisfy the need of other users, such as Deaf people interested in seeing foreign sign languages, adult sign language learners seeking extra opportunities for comprehension practice, and sign language teachers developing new teaching materials. 5.2 Recommendations for future corpus projects There are several areas for progress. A larger corpus could aim to broaden the type of data that are included, focussing not only on narratives but also including spontaneous interaction among signers, for example. Two current projects are doing just that (Johnston & Crasborn 2006), by collecting large-scale annotated video corpora of Auslan (Australian Sign Language) and NGT.25 These projects focus on the documentation of these two languages by recording a relatively large number of native signers in a wide range of linguistic tasks, but only have limited resources for adding annotations to the video data. Future projects might also investigate how people actually use corpora, and how they use tools like ELAN. The immediate access to the source data when browsing or searching an annotation document might render some types of annotation or certain levels of detail redundant. For example, given that relatively little is known about the exact formal distinctions and functions of head and body movements, annotating the details of the direction and size of body leans may require a great deal of work while adding limited value for a future research project on these types of markers. One could consider marking on a ‘body’ tier that something is happening with respect to the position or movement of the upper body, and not try to specify what exactly it is. The presence of the annotation, especially in larger corpora, could then be used in inspecting data by going back to the video source. Depending on the research question at hand and the insights that have been developed in the years since the creation of the corpus, actual content could later be added to the annotation. In the ECHO annotations, as was implicit in the descriptions in Section 3 above, we did not make use of the possibility to add ‘empty’ annotations. © 2007. John Benjamins Publishing Company All rights reserved

556 Onno Crasborn et al.

More generally, the ECHO project has focused on the creation of the corpus, rather than its actual use. While one specific linguistic study has been carried out on the basis of the ECHO annotations (van der Kooij et al. forthcoming), we still have limited experience with the use of existing data for answering new research questions. The involvement of experts in the area of spoken language corpora (whether written or spoken) may provide further improvements in the design and usability of new sign language data collections. 5.3 Future technological developments There are various functions that would be useful future additions to the tools we used in ECHO, which have arisen out of the experience with using the IMDI tools and ELAN in the ECHO project so far. In terms of data entry, facilities could be envisaged that save the user a great deal of time in creating annotations. For example, being able to copy annotations from one tier to another would make it easier to mark the occurrence of two-handed signs in the form that was used for the ECHO corpus, where the left and right hand receive the same gloss, with only some timing difference between the two annotations. The addition of short annotations for running video with a single keystroke would enable the rapid annotation of brief events such as eye blinks or head nods. Most of the suggestions for future software improvement relate to the use of existing annotations. As corpora will grow, the need for complex search functions and other corpus exploitation tools will also grow. In the future, it will be of crucial importance to search across multiple files with video annotations, possibly restricting the search to specific nodes in the corpus. Moreover, researchers will wish to use queries that combine information from the metadata and the data domains. For example, users might want to find occurrences of specific linguistic information, but only of a subset of all participants: “all occurrences of the sign newspaper [data domain] uttered by male signers who went to the school for deaf children in the north of the country [metadata domain]”. In the present state of the tools, one needs to first search within the set of metadata categories, and in the resulting set of transcription files search for data categories one-by-one. Finding all cases of weak hand spreading by signers over 60 years old thus becomes a very time-consuming task, whereas corpora are particularly useful for these kinds of complex queries. In addition to the search facilities, easy access to basic statistical functions would be useful in studying corpus data, including the calculation of frequen-

© 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 557

cies of annotation values on different tiers and the distribution of the durations of these annotation values. While there is no commonly accepted standard or software to create lexical databases for sign languages (comparable to Shoebox for spoken languages), ultimately the integration of lexicons with transcriptions of ‘running text’ will be an important development that allows for new ways of doing linguistic research, both on the meaning and form levels. Conversely, corpus data can aid in creating sign language lexicons (Hanke 2006). Similar to the combination of the metadata and the data domains mentioned above, the joint study of the discourse domain and the lexical domain is an obvious linguistic demand that is hampered by the current state of technology. The IMDI tools that are currently available allow for linking resources of many different kinds, all related to the same ‘session’ unit. Thus, it is possible to add text documents or hyperlinks to Internet URLs to the combination of a video and its ELAN transcription file. However, since ELAN sessions can be relatively long, often containing more than a thousand words or signs, it would be useful to allow reference to files or URLs for specific stretches of annotations as well (rather than linking a text document to the whole annotation file). Conversely, it would be very useful if hyperlinks in text documents could refer to individual annotations or combinations of annotations in ELAN files, so that linguistic papers in PDF or HTML format can refer directly to the data source of an example, without having to copy parts of the video and annotation to separate files. A totally different type of data that could be integrated with video annotations concerns numerical data from kinematic measurements with data gloves, or from eye tracking equipment. As the field of sign language phonetics is still in its infancy, the specifications of such functionality will have to develop over the years to come. ELAN could thus be further developed to include functionality similar to the Praat software used for speech analysis (Boersma & Weenink 1996). While the data collection techniques involved are typically used for quantitative analysis carried out independently of video recordings of language production, ELAN could serve an important function in additional exploration of the data, at the same time enlarging the video corpus for a given language. A first step towards integrating and visualizing kinematic data in ELAN has been completed in version 2.6 of ELAN (Crasborn et al. 2006). While the use of any Unicode font is already possible within ELAN, the most commonly used notation and writing systems for sign language, such as HamNoSys26 and SignWriting,27 are not yet available in Unicode versions. To be able to use such fonts, it would be necessary to select a specific font per tier. © 2007. John Benjamins Publishing Company All rights reserved

558 Onno Crasborn et al.

The disadvantage of using specific fonts is that the font in question would have to be installed on every computer on which the annotation file is to be used; for this reason, the selection of specific fonts per tier has not yet been implemented in ELAN. Finally, in addition to improved search facilities and the integration of different types of data, it would be useful if researchers located in different places could jointly work on the same video files and annotation documents, whether in creating new annotations or discussing existing data. Such ‘collaborative annotation’ was characterised in Brugman et al. (2004), and in another form is present in the multi-user version of the Transana annotation tool (see Note 12). We hope that the freely accessible ECHO corpus will be a stimulus to other researchers to also publish their data online, allowing both specialists and the general audience to profit from linguistic data collections, and to contribute to technological developments of the type suggested above.

Notes . The ECHO home page is http://echo2.mpiwg-berlin.mpg.de/home. 2. The ECHO charter is available at: http://echo2.mpiwg-berlin.mpg.de/home/documents/ charter. 3. Information on the Berlin Declaration, including the text and the list of signatories, can be found at http://echo.mpiwg-berlin.mpg.de/home/documents/declaration. 4. The results of the project are published at http://www.let.ru.nl/sign-lang/echo/. 5. We follow the common practice among sign language researchers to write Deaf with an initial capital when referring to members of a cultural minority group, and deaf with a lowercase initial to indicate people with an auditory impairment. 6. http://lands.let.kun.nl/cgn/ 7. http://www.signwriting.org 8. See http://www.mpi.nl/IMDI/ 9. http://www.mpi.nl/tools/elan.html 0. The SignStream software is described at and available from http://www.bu.edu/asllrp/ SignStream/. . http://www.dfki.de/~kipp/anvil/ 2. http://www.transana.org

© 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 559 3. The underlying XML format of the annotation files created by ELAN already ensures future accessibility as well as access by other applications. Built-in export functions provide more user-friendly ways of analysing annotations in other ways, for example by exporting to text format for importing and analysis within spreadsheet software. 4. All of the video and some of the annotations have been added to the ECHO corpus following the completion of the project in 2005. 5. The increasing attention for the publication of data as part of the academic research output is exemplified by a position statement of the Research Councils UK, http://www.rcuk. ac.uk/access/2005statement.pdf. 6. http://www.let.ru.nl/sign-lang/echo/docs/AesopsFables.pdf 7. http://dublincore.org 8. http://www.language-archives.org/OLAC/metadata.html 9. http://www.tei-c.org 20. http://www.xml-ces.org 2. Strictly speaking, this also holds for the “Language Variety” and “Elicitation Method” keys now included in the profile, but it was commonly agreed that these are core properties for any type of sign language data and that it is very convenient to have them available through selection of the sign language profile. They might be incorporated into the general IMDI standard in the future. 22. The SSL data were annotated using the system described by Bergman & Wallin (2001). 23. http://creativecommons.org 24. http://www.gnu.org/copyleft/gpl.html 25. http://www.let.ru.nl/sign-lang/corpusngt/ 26. http://www.sign-lang.uni-hamburg.de/hamnosys/ 27. http://www.signwriting.org

References Bergman, B., Boyes Braem, P., Hanke, T. & Pizzuto, E. (Eds.) (2001). Sign transcription and database storage of sign information. Papers from the Intersign network. Special Issue of Sign Language & Linguistics, vol. 4, 1–2. Bergman, B. & Wallin, L. (2001). A preliminary analysis of visual mouth segments in Swedish Sign Language. In P. Boyes Braem & R. Sutton-Spence (Eds.), The hands are the head of the mouth (pp. 51–68). Hamburg: Signum-Verlag.

© 2007. John Benjamins Publishing Company All rights reserved

560 Onno Crasborn et al.

Boersma, P. & Weenink, D. (1996). PRAAT. A system for doing phonetics by computer. Version 3.4. Amsterdam: Institute of Phonetic Sciences of the University of Amsterdam, report 132. Brugman, H., Crasborn, O. & Russel, A. (2004). Collaborative annotation of sign language data with peer-to-peer technology. Proceedings of LREC2004 (pp. 213–216). Lisbon. Paris: ELRA. Crasborn, O. (2003). General IMDI extensions. Available at: http://www.let.ru.nl/sign-lang/ echo/docs/General_IMDI_Extensions.doc (last accessed August 2007). Crasborn, O. & Hanke, T. (2003a). Metadata for sign language corpora. Background document for an ECHO workshop, 8–9 May 2003, Nijmegen University. Available at: http:// www.let.ru.nl/sign-lang/echo/docs/Metadata_SL.doc (last accessed August 2007). Crasborn, O. & Hanke, T. (2003b). Additions to the IMDI metadata set for sign language corpora. Agreements at an ECHO workshop, May 8–9, 2003, Nijmegen University. Available at: http://www.let.ru.nl/sign-lang/echo/docs/SignMetadata_Oct2003.pdf (last accessed August 2007). Crasborn, O., van der Hulst, H. & van der Kooij, E. (2001). SignPhon. A phonological database for sign languages. Sign Language & Linguistics, 4 (1/2), 215–228. Crasborn, O., Sloetjes, H., Auer, E. & Wittenburg, P. (2006). Combining video and numeric data in the analysis of sign languages within the ELAN annotation software. In C. Vettori (Ed.), Proceedings of the 2nd workshop on the Representation and Processing of Sign Languages, 28 May 2006, Genova (pp. 82–87). Paris: ELRA. Ekman, P., Friesen, W. V. & Hager, J. C. (2002). Facial Action Coding System. Salt Lake City, Utah: Research Nexus. Emmorey, K. (1999). Do signers gesture? In L. S. Messing & R. Campbell (Eds.), Gesture, speech, and sign (pp. 133–159). New York: Oxford University Press. Engberg-Pedersen, E. (1993). Space in Danish Sign Language: the semantics and morphosyntax of the use of space in a visual language. Hamburg: Signum. Esanu, J. M. & Uhlir, P. F. (Eds.) (2004). Open Access and the Public Domain in Digital Data and Information for Science: Proceedings of an International Symposium. US National Committee for CODATA, National Research Council. Granger S. & Petch-Tyson, S. (Eds.) (2003). Extending the scope of corpus-based research: new applications, new challenges. Amsterdam & Atlanta: Rodopi. Hanke, T. (2006). Towards a corpus-based approach to sign language dictionaries. In C. Vettori (Ed.), Proceedings of the 2nd workshop on the Representation and Processing of Sign Languages, 28 May 2006, Genova (pp. 70–73). Paris: ELRA. Heßmann, J. (2001). Gehörlos So! Materialien zur Gebärdensprache Gehörloser. Hamburg: Signum Verlag. Johnston, T. (1991). Transcription and glossing of sign language texts: examples from Auslan (Australian Sign Language). International Journal of Sign Linguistics, 2, 3–28. Johnston, T. & Crasborn, O. (2006). The use of ELAN annotation software in the creation of signed language corpora. Paper presented at Tools & Standards: the State of the Art, E-MELD workshop on digital language documentation, East Lansing MI, USA, June 20–22, 2006.

© 2007. John Benjamins Publishing Company All rights reserved

Sharing sign language data online 56

Kooij, E. van der, Crasborn, O., Waters, D., Woll, B. & Mesch, J. (Forthcoming). Frequency distribution and spreading behaviour of different types of mouth actions in three sign languages. (Manuscript submitted for publication) Levinson, S. C. & Wittenburg, P. (2001). Language as cultural heritage. In J. Renn (Ed.), ECHO: an infrastructure to bring European cultural heritage online. The foundation papers of a European Initiative (pp. 103–111). Max Planck Institute for the History of Science, Berlin. Morrisey, S. & Way, A. (2005). An example-based approach to translating sign language. Paper presented at the Workshop Example-Based Machine Translation (MT X–05), Phuket, Thailand, September 16, 2005. Neidle, C. & MacLaughlin, D. (1998). SignStream: A tool for linguistic research on signed languages. Sign Language & Linguistics, 1, 111–114. Nonhebel, A., Crasborn, O. & van der Kooij, E. (2004a). Sign language transcription conventions for the ECHO project. Version 9, 20 January 2004a. Manuscript, Radboud University Nijmegen. Available at: http://www.let.ru.nl/sign-lang/echo/docs/ECHO_transcr_conv.pdf (last accessed August 2007). Nonhebel, A., Crasborn, O. & van der Kooij, E. (2004b). Sign language transcription conventions for the ECHO project. BSL and NGT mouth annotations. Manuscript, Radboud University Nijmegen. Available at: http://www.let.ru.nl/sign-lang/echo/docs/ECHO_ transcr_mouth.pdf (last accessed August 2007). Nonhebel, A., Crasborn, O. & van der Kooij, E. (2004c). Sign language transcription conventions for the ECHO project. SSL mouth annotations. Manuscript, Radboud University Nijmegen. Available at: http://www.let.ru.nl/sign-lang/echo/docs/ECHO_transcr_ mouth_SSL.pdf (last accessed August 2007). Ochs, E. (1979). Transcription as theory. In E. Ochs & B. B. Schieffelin (Eds.), Developmental pragmatics (pp. 43–72). New York: Academic Press. Prillwitz, S., Leven, R., Zienert, H., Hanke, T. & Henning, J. (1989). HamNoSys. Hamburg Notation System for sign languages. An introductory guide. Hamburg: Signum Verlag. Prillwitz, S. & Zienert, H. (1990). Hamburg Notation System for sign language. Development of a sign writing with computer application. In S. Prillwitz & T. Vollhaber (Eds.), Current trends in European sign language research. Proceedings of the 3rd European Congress on Sign Language Research, Hamburg, July 26–29, 1989 (pp. 355–379). Hamburg: Signum Verlag. Samarin W. J. (1967). Field Linguistics: A Guide to Linguistic Field Work. New York: Holt, Rinehart & Winston. Slobin, D., Hoiting, N., Anthony, M., Biederman, Y., Kuntze, L., Lindert, R., Pyers, J., Thumann, H. & Weinberg, A. (2001). The Berkeley Transcription System (BTS) for sign language. Sign Language & Linguistics, 4, 63–96. Zahedi, M., Dreuw, P., Rybach, D., Deselaers, T., Bungeroth, J. & Ney, H. (2006). Continuous Sign Language Recognition — Approaches from Speech Recognition and Available Data Resources. In C. Vettori (Ed.), Proceedings of the 2nd workshop on the Representation and Processing of Sign Languages, 28 May 2006, Genova (pp. 21–24). Paris: ELRA.

© 2007. John Benjamins Publishing Company All rights reserved

562 Onno Crasborn et al.

Zeshan, U. (2004a). Hand, head, and face: negative constructions in sign languages. Linguistic Typology, 8, 1–58. Zeshan, U. (2004b). Interrogative constructions in signed languages: cross-linguistic perspectives. Language, 80, 7–39.

Author’s address: O. Crasborn Department of Linguistics / CLS Radboud University Nijmegen PO Box 9103 NL–6500 HD Nijmegen The Netherlands [email protected] Phone: +31 24 3611377

© 2007. John Benjamins Publishing Company All rights reserved