A Data Category Registry- and Component-based Metadata Framework

4 downloads 0 Views 2MB Size Report
believe that disciplines without proper metadata manage- ment will ..... registries, the component registry, the component editor and .... working_group.pdf.
A Data Category Registry- and Component-based Metadata Framework Daan Broeder, Marc Kemps-Snijders, Dieter Van Uytvanck, Menzo Windhouwer, Peter Withers, Peter Wittenburg, Claus Zinn Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands {firstname.lastname}@mpi.nl

Abstract Metadata plays a crucial role in realizing the eScience vision. Without a proper description of research data, there is little hope to build environments that help researchers finding, accessing, exploiting, and preserving the data they need to effectively conduct their studies. In fact, we firmly believe that disciplines without proper metadata management will progress lesser than those who invest time and effort to manage their work and results at research community level. Nonetheless, often researchers’ awareness of this aspect is rather poorly developed, and the reasons for this are manyfold. Certainly, the belief that “one metadata schema fits it all” is misguided. On the other extreme, it is harmful for a community to devise new metadata descriptors and schemas just because existing ones were ignored or perceived as not fitting entirely with the community’s requirements. This, in fact, lead to the many schemas existing today, so that the metadata universe can be described as rather fragmented, causing the interoperability problems that so many of us suffer. This paper describes a novel and computer-supported framework to overcome the rule of metadata schism. It combines the use of controlled vocabularies, managed by a data category registry, with a component-based approach, where the vocabulary terms can be combined to yield complex metadata structures. A metadata scheme devised in this way will thus be firmly grounded in its use of terms. Schema designers will profit from existing prefabricated larger building blocks, motivating re-use at a larger scale. The common base of any two metadata schemes within this framework will solve, at least to a good extent, the semantic interoperability problem, and consequently, further promote systematic use of metadata for existing resources and tools to be shared.

1. Introduction The library community is certainly one of the oldest communities dealing with metadata - the Ancient Library of

Alexandria, with its 100.000s of scrolls must have had some organisational principles to facilitate access to its resources to scholars. Today, the Dublin Core Metadata Element Set (DCMES) is predominantly used to describe books and other forms of publications. Developed in the 1990s, its 15 metadata elements (for the Simple Dublin Core) are deemed sufficient to describe any type of resource. About a decade ago, the IMDI [15] and OLAC [26] initiatives started to systematically assign descriptive metadata to language resources; at DFKI, natural language processing tools were described and stored in the NLP Tools Registry [9]; the Text Encoding Initiative [10] suggested header elements to annotate text at a finer-grained level; and the ENABLER Thematic Network proposed metadata sets for the description of lexica, text corpora, speech resources, multimodal resources and tools [18]. In these initiatives, many new field descriptors, in addition to DCMES, were introduced to describe the complexities of the language resource and technology world. Researchers engaged with IMDI and OLAC have now quite some experience in the usage of metadata in the LRT field. Despite initial and ongoing reservations of linguists to invest time in the provision of high quality metadata, and despite the ongoing debates about the usefulness of suggested metadata elements, there is an increasing awareness about the importance and benefits of descriptive metadata in the LRT community at large. Clearly, some sub-fields of linguists are more disciplined than others; the field of language documentation, for example, makes extensive use of metadata, also given the role of data preservation issues (see the DoBeS endangered languages project [30]). The need for proper research data management will only increase. Recently, for instance, the special ESFRI working group on repositories made a clear recommendation [11] that each research infrastructure initiative needs to have a solution for a proper repository infrastructure. Both the eIRG meeting in Paris [12] and the meeting of the Alliance for Permanent Access in Budapest [13] also discussed aspects of data preservation and hinted at the crucial role of high quality metadata descriptions in this respect. Funding

agencies, interested in evaluating research work, or matching research groups, will also greatly rely on metadata to inform their decisions. Political issues aside, an increasing number of researchers are, or will be, confronted with the provision and exploitation of metadata, be it to properly archive the resources they collected, to provide additional material when submitting a journal paper, or to create and browse virtual collections each described with its (potentially different) metadata set. The main lesson learned in the past decade is that there is no single metadata scheme that satisfies the needs of all researchers of all (sub-)disciplines, with their many different types of resources but also domain-specific vocabulary and research methods. On the other hand, it is unwise to create new metadata schemes, just because existing ones were ignored, or did not perfectly fit a community’s requirements. In this paper, we present a novel approach for metadata management. It is based on a controlled vocabulary, centrally managed in a data category registry, to provide the basic building blocks, and a component registry for constructing larger components from existing ones, and for sharing them. The CLARIN research infrastructure [27] will provide those registries, together with tool support, and aims at describing existing LRT resources within this new framework to facilitate their access, re-use, and interoperability.

2. Existing Metadata Solutions 2.1. Element Sets The Library Sciences look back at a long tradition of managing large amounts of resources. The emergence of standards for cataloguing resources such as MARC [3] and DCMES enabled data exchange across institutional boundaries and the creation of computer-supported catalogue instruments. But it is only with the emergence of interconnected libraries and other repositories that a standardization of metadata sets became essential. The Library Sciences advocated the use of the DCMES, with its 15 metadata descriptors, to describe all kinds of objects. However, the strong emphasis on library-specific expert terminology prevented its wide acceptance in other disciplines. Also, the DCMES is an unstructured set of descriptors, which some felt make it unsuitable for the description of complex objects. In the linguistic domain, metadata started being present in headers of annotation files such as CHAT [24] or ESF [4]. In these first steps, however, the encoding and semantics of the metadata were all corpus specific and no attempts were made to cover a wider field of resources. The TEI initiative has been successful in establishing a widely accepted system for text annotation, at least within the Humantities.

It also includes metadata fields to describe types of text resources, the TEI header — which is not independent from the TEI annotation format, but widely used nevertheless. The systematic use of DCMES for linguistic resources resulted into OLAC [26], which adds an (extensible) set of descriptors to the DCMES. Although OLAC started with the addition of only a single metadata element (“language identifier”), others followed over the years: codes for discourse type, participant roles, the linguistic field and data type were introduced. Now OLAC is the accepted metadata exchange format between language resource archives. The IMDI metadata scheme aims at addressing the need to describe resources in the linguistic domain, suitable for multimedia resources, using a domain specific terminology, and supporting complex resources. IMDI was devised and used in several international projects. Its development was influenced by a number of international research projects such as ISLE and EAGLE [16], INTERA [5], CGN [25], ECHO [29] and ENABLER [18], and also profited from several years of interaction with various communities of linguistic sub-disciplines such as Sign Language Research. Although IMDI can be used to describe text corpora and lexica, its main strength and its primary use is the detailed description of bundles of tightly related resources of multimedia corpora (for instance, audio and video material stemming from DoBeS endangered language documentation projects). Metadata schemes, or profiles, were created as community-specific extensions such as for Dutch Spoken Corpus and the Sign Language community. Clearly, there is a tension between the need for sufficiently rich and domain-specific terminology to adequately describe resources and the desire for interoperability where terms have to be understood by humans (from varying disciplines) and machines alike. This has lead to a kind of rebound effect for metadata description sets: it started with a move from small sets of descriptors with broad semantics to large sets with highly specific descriptors, which was then followed by a backward move; Baker compared this to the linguistic theory of pidginization and creolization [14].

2.2. Semantic Alignment of metadata schemes Given the various metadata schemes, interoperability is facilitated by creating a semantic mapping from one to another. There are well-establised alignments between MARC and Dublic Core, and other metadata schemes [13]. In our linguistic domain, alignments need to be created when creating joint metadata domains. We are using the OAI-PMH protocol [6] to harvest metadata from other institutions metadata, and subsequently make it accessible to others. For this, we expect metadata providers to adhere to the formal specifications of the protocol, but also to deliver a mapping of their metadata scheme to OLAC, which serves

as a kind of least common denominator. Depending on the mapping’s quality (coverage, correctness of mapping links), valuable information present in the original schema may be lost in the mapping process. All OLAC-based metadata is then converted, without further information loss, into IMDIbased metadata, and then made accessible via the IMDI Browser. There is promising research in automatically constructing alignments between metadata schemes as the Dutch STITCH project demonstrated [19, 20]. However, we are not aware of any studies that attempted the automatic alignment of metadata schemes in the linguistic domain, and it is unclear whether there is sufficient high-quality metadata to inform instanced-based approaches, or whether lexical approaches (exploiting similarity between concept labels) would work reliably. It seems that any mapping between linguistically-motivated metadata schemes would rather profit from manual work, not the least because of their moderate size. While the alignment path is certainly one option to address the interoperability issue, it rather attacks the symptons than the cause. Having a commonly agreed base set of controlled vocabulary, together with a component-based approach to build complex metadata blocks from simpler ones, is a better path towards achieving a common understanding and consensus across the various communities, and interoperability between resources and tools.

3

The ISO Data Category Registry

In 2004, the new ISO group ISO TC37/SC4 [1] was founded with the central topic “management of language resources”. Within this group descriptive metadata is one of the important dimensions. Its main focus, however, is to instantiate a central registry of relevant linguistic concepts called the ISO Data Category Registry (DCR). The data model of and procedures around this registry are described in the forthcomming revision of the ISO 12620 standard [21]. The previous 1999 version of this standard contained a hardcoded list of data categories commonly used by various standards produced by ISO TC37. With this revision this hardcoded list is replaced by a registry which can be collectively maintained and used by the linguistic community at large. However, the ISO standardized subset of data categories in the registry will be guarded by members of TC37, i.e., these data categories will undergo the normal ISO evaluation and standardization proces. The DCR data model allows very detailed data category specifications with an administrative, descriptive, and linguistic section. The descriptive section contains at least an English name and definition for the data category, and optional translations of these in relevant other working languages. In the linguistic section the value domain of a data

category is specified and can be restricted for specific object languages. For example, the data category /grammaticalGender/ has as value domain {/masculine/, /feminine/, /neuter/} but for French this value domain is restricted to {/masculine/, /feminine/}. To support the interchange of selections of data categories the standard also describes an XML serialization of the data model: the Data Category Interchange Format (DCIF). Each data category is assigned a persistent identifier (PID), especially suited to be included in metadata and schemata of linguistic resources to foster semantic interoperability. Some schema languages, e.g., TBX XCS and TEI ODD, have built-in support to embed these PIDs into the schema. However, more generic schema languages, such as Relax NG and W3C XML Schema don’t. But as these languages are XML-based they can easily be extended by annotating specific parts of the schema documents with attributes or elements from the DC Reference XML vocabulary (see Annex A in [21] or the ISO 12620 page at [2]). ISOcat [2] is a reference implementation of ISO 12620. It implements the data model and will support all procedures to use and maintain the DCR. Using its web-based interface, data categories can be created, edited, shared, exported and standardized. Next to an user interface ISOcat also supports a RESTful web services API, which allows tools to interact with the system. The lexcicon tool LEXUS, [22], which supports the Lexical Markup Framework [17], another standard developed by ISO TC37, uses ISOcat’s web services to adorn the LMF metamodel with application specific data categories. Fig. 1 shows a screenshot of ISOcat’s web user interface. It has four major parts: (1) the user’s workspace explorer allows users to access selections of data categories created by the TC37 Thematic Domain Groups (TDGs), by other users or by herself; (2) the active data category selection shows a table with basic information for each data category, e.g., its name and owner; (3) the full data category specification is shown for the currently selected data category; and (4) the basket area in which relevant data categories can be collected, exported or saved as a new persistent selection. This area is also used for editing previously saved selections. Depending on the rights of the user other parts of the application can be accessed, e.g., an editor can be opened if the user is the owner of the selected data category. At time of writing social features are being implemented. This functionality will allow users to form groups to share and work together on data category specifications. Also the ability to comment on a data category or to contact its owner will be part of the upcoming release. The process for ISO standardization will be implemented at a later stage.

Figure 1. ISOcat, the reference implementation of the ISO DCR

4

Metadata Component Model

The ISO DCR serves as a common ground for terminology use and shall become a reference to achieve semantic interoperability. In this section we complement the controlled vocabulary registry with a component registry to build a flexible but grounded metadata framework.

4.1

Model

A metadata component describes different aspects or dimensions of a resource. Such a component is basically a collection of metadata fields. Each field refers via a URI to exactly one data category in the ISO DCR, thus indicating unambiguously how the content of the field in a metadata description should be interpreted. Components can have a recursive structure: next to the atomic fields they may contain other components. They are expressed as XML-files, an example component description can be found in Fig. 2(a) with the graphical representation in Fig. 2(b); here, fields are marked with dotted lines, components with a solid line. Any number of components can be combined with a header element into a metadata profile. A profile, also represented as an XML-file, provides a blueprint for the personalised metadata schema. It can be converted into a W3C XML-schema with an XSLT transformation; references to all occuring data categories will be maintained. Finally, using the generated XML-schema to check for the formal correctnes of the data, the XML files containing

the metadata descriptions can be created. A fragment of such an example description is shown in Fig. 2(c). As each metadata description will contain a link to its W3C schema, its validity can be checked and the data categories used in the description can be retrieved. CLARIN will suggest a number of recommended components that will be made available in a component registry. But users can use and create their own components, too. The fundamental requirement for all components is that the metadata elements that are used explicitly refer to concepts registered in ISOcat or other trusted registries. If a user wants to include an element which is not yet registered, since she feels that it is necessary for the proper description of the resource, she would need to register the new concept at least in the so-called ISOcat ”user space”. The ISOcat process will then decide whether the new category will be integrated in the official part of the registry. CLARIN will be strict and only accept categories that are registered in accepted registries, since otherwise no semantic interoperability can be established. Fig. 3 describes the interaction between the data category registries, the component registry, the component editor and the metadata editor. The design aims at promoting the reuse of existing fields and components, but also gives users the opportunity to create new ones within the well-defined ISO DCR process. Naturally, we anticipate the creation of different metadata profiles for different communities, e.g., sign-language researchers will need to describe a video signal differently than multimodality experts.

(a) XML representation

(b) Schematic representation.

Foo Bar Kilivila French (c) Actor instance in XML

Figure 2. A metadata component description. Clearly, the principles behind this model need to be enforced by the accompanying end user software: the component editor will check whether all elements used are indeed taken from ISOcat or another trusted registry, and it will interact with the component registry to store final component ensembles or profiles and make them re-usable.

4.2

Initial Set of Metadata Categories

Tapping into the rich and long-standing experience of CLARIN members with different category sets for different resource types across a broad range of subdisciplines, a process was established within the CLARIN network to identify a first set of basic concepts to be included in the ISO DCR. A initial list of 135 data categories with their definitions (and partly, with constraints) was created [7]; they describe the resoures’ creation process, their content and the participants contributing to the content, formal characteristics of the resources, and the resources’ identity and access. Taking together, they should provide a good basis for describing all major linguistic resource types (such as media resources, textual resources and corpora, annotations, and lexica) as well as tools and web services. In the first selection of data categories, special attention was given to profile matching, an automatic process that identifies suitable processing components for a given resource (e.g., finding a parser tool that can cope with a given resource’s text format). Here, CLARIN aims at building metadata-based support for the construction of workflow chains where the output of an operation must be accepted by the input of the sub-sequent operations. Machines must

be able to read the description of a resource and compare its content with the required input of a web service to determine whether the processing pipeline can lead to useful results. Categories describing resources at different granularities, ranging from the very general “resource type” to the very specific “used tag set”, have been devised. It is worth to note that data category definitions shall be widely independent of the contexts in which they can occur. So far, we have thus chosen a high granularity semantics for those elements that we expect to be consumed by machine processing, searching or profile matching. For those categories aimed at human consumption, the semantic scope and granularity is not as critical as humans can use contexts quite efficiently to disambiguate between various meanings. In a next step, these data categories will be entered into ISOcat, providing all information as required by the ISO 12620 model including, for instance, translations of all terms in various European languages. Once a data category is entered in ISOcat, an ISO controlled process is in place, which finally may promote it as accepted standard. The 12620 model is providing a flat list of categories. No relations are included in the registry, except for more generic terms that are part of the definition e.g., a transitive verb is a verb). Since relations often depend on their use in context, the ISO committee decided to separate them from the definitions. For this, the notion of a relation registry has been proposed [23], constituting a store for expressing relations between entries of data category registries, and thus imposing some structure. We anticipate that contents of the relation registry will support semantic search (see below).

Figure 3. Interaction of registries and editors

4.3

Initial Set of Components

We are currently in the process of “desconstructing” the IMDI and OLAC metadata schemes into components. All of their field descriptors are being entered in a ISOcat workspace, and re-usable components that mirror existing IMDI and OLAC structures are being defined and stored. This modularisation of IMDI and OLAC metadata schemes should provide the most typical components first. Users can then select and re-assemble them to fit their needs, and we expect users to re-use more than to create. CLARIN will also offer more complex components, for instance to describe lexica for the purpose of natural language processing, or lexica for the purpose of language documentation and field work. Both profiles will differ, but also largly share a common base of descriptors and components. CLARIN also plans to offer predefined metadata components and profiles to support the aggregation of closely related resources (e.g., annotations, media files, digital bookmarks) into compound objects up to the scale of virtual collections [28]. At the time of writing, our component editor is a standard XML editor; schemas enforce the formal correctness of all components created. The component registry is in is early stages and is based on the SVN versioning system.

5

Building the CLARIN Infrastructure

Currently, we are collecting metadata descriptions from the CLARIN language resource and tools inventory, the IMDI resources, the OLAC resources, part of the ELRA

catalogue, the DFKI registry, the DELAMAN network, the UNESCO list on endangered languages and a few others. Here, the aim is to get a good grip on the various schemes, and how they are used to describe language resources and tools. Some of the vocabulary will make its way to ISOcat, and an analysis of the various schemes will also inform the design of larger, reusable building blocks for the component registry, similar to the aforementioned “deconstruction” of the IMDI and OLAC schemas. For the provision of new metadata, CLARIN will enforce a policy where metadata will only be harvested when the terminology of its field descriptors is registered in ISOcat or another trusted data category registry. To further promote the need and uptake of metadata to describe research data, we will launch an initial version of the CLARIN Virtual Language Observatory at the 2009 NEERI conference [28]. The observatory will support access to language resources and tools via three modes of presentation: via a normal catalogue, via a facetted browser, and via a geographic representation (Google Earth/Maps). Figure 4 shows the current state of two of the interfaces. The provision of effective tools for metadata management is crucial. The implementation of the ISOcat web application is close to completion, and its current feature set is being tested thoroughly. We will also build a special purpose component editor with an user-friendly interface to ISOcat’s REST services and the component registry. We are also investigating how existing tools such as search engines can be adapted to make use of the new metadata framework in the context of federated metadata do-

(a) Facetted Browsing

(b) Geographical Browsing

Figure 4. Two of the presentation modes mains. Fig. 5 gives an overview of the search process, where metadata-based user queries are transparently evaluated against the joint metadata repository. During this process equivalent metadata fields in different schemes are detected via the data category registries and the relation registry to increase the search recall. Clearly, our visualization tools (facetted and geographic browsing) will also profit from this semantic search.

6. Conclusion CLARIN is moving ahead on a number of parallel tracks to come to a DCR-based and component-based metadata infrastructure with a critical mass of representative metadata descriptions for language resources and tools. The investments in the new CLARIN component metadata infrastructure will pay off, given the large number of existing language resources and tools, and the crucial role of metadata to access and exploit them. The realization of workflow chains, supported by profile matching, will further push the need for detailed metadata, but our registries framework is flexible enough to cope with evolving requirements. Our new approach to metadata management is largely discipline-independent. This also holds for the data category registry model, which is compliant with the ISO norm ISO/IEC 11179 [8], a very broad standard defining a generic framework for metadata registries. The ISO 12620 model, of course, has been designed to address the needs for metadata to describe language resources and tools, and to make its creation and maintenance a feasible task. For disciplines or applications that require a fully-fledged ontology based on some deep conceptual domain modeling, the ISO 12620

model may be too restrictive. In fact, our approach does not advocate the (traditional) use of ontologies; rather, concept definitions should be clearly separated from relations where possible so that some flat reference terminology for a given domain can be established first, thus allowing others to refer to the registered categories and to make use of them in various contexts. Other disciplines may want to run the ISOcat software independent of the ISOcat registry for language resources and tools management. In fact, the software’s design allows users to create their own instance, allowing any research community to define its domain-specific vocabulary, and to share it with others, as long as the general ISO decision process (built into the software) is followed. The component model for metadata is also inherently domain-independent, so that interesting parties could also use the component software in addition to the ISOcat software. Give the generality of our approach, we are currently evaluating its use for building joint metadata domain across some other institutions of the Max Planck Society.

References [1] http://www.tc37sc4.org/. [2] http://www.isocat.org/. [3] http://www.oclc.org/programs/ourwork/ renovating/changingmetadata/marcgroup. htm. [4] The ESF second language database. http://hdl.handle.net/1839/ 00-0000-0000-0005-6634-C. [5] http://www.mpi.nl/INTERA/.

Figure 5. Semantic Search in Joint Metadata Domains [6] [7] [8] [9] [10] [11]

[12] [13] [14] [15]

[16]

[17]

[18]

[19]

http://www.openarchives.org. http://www.clarin.eu/view_datcats. http://metadata-standards.org/11179. http://registry.dfki.de/. http://www.tei-c.org. ESFRI working group about digital repositories – position paper. ftp://ftp.cordis.europa.eu/ pub/esfri/docs/digital_repositories_ working_group.pdf. http://www.e-irg.eu/images/stories/ reports/paris_workshop_summary.pdf. http://www.alliancepermanentaccess.eu/ index.php?id=3. T. Baker. Languages for Dublin Core. D-Lib Magazine, 4:12, 1998. D. Broeder and P. Wittenburg. The IMDI metadata framework, its current application and future direction. International Journal of Metadata, Semantics and Ontologies, 1(2):119–132, 2006. N. Calzolari, A. Zampolli, and A. Lenci. Towards a standard for a multilingual lexical entry: The EAGLES/ISLE initiative. Lecture notes in computer science, pages 264– 279, 2002. G. Francopoulo, N. Bel, M. George, N. Calzolari, M. Monachini, M. Pet, and C. Soria. Lexical Markup Framework: ISO standard for semantic information in nlp lexicons. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2006. M. Gavrilidou and E. Desypri. D2.2 – report for the definition of common metadata description for the various types of national LRs. ENABLER Network, see http://www. ilc.cnr.it/enabler-network/reports.htm. A. Isaac, S. Schlobach, H. Matthezing, and C. Zinn. Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. Library Review, 57(3):187–199, 2008.

[20] A. Isaac, S. Wang, C. Zinn, H. Matthezing, L. van der Meij, and S. Schlobach. Evaluating thesaurus alignments for semantic interoperability in the library domain. IEEE Intelligent Systems, 24(2):76–86, 2009. [21] ISO 12620. Computer Applications in Terminology – Data Categories – Specification of Data Categories and Management of a Data Category Registry for Language Resources. ISO, Geneva, Switzerland, 2009. Forthcoming. [22] M. Kemps-Snijders, M.-J. Nederhof, and P. Wittenburg. LEXUS, a web-based tool for manipulating lexical resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2006. [23] M. Kemps-Snijders, M. Windhouwer, and S. Wright. Putting data categories in their semantic context. In IEEE eHumanities Workshop, Indianapolis, Indiana, USA, December 2008. [24] B. MacWhinney. The CHILDES project: Tools for analyzing talk. Lawrence Erlbaum Associates, 2000. [25] N. Oostdijk. The Spoken Dutch Corpus: Overview and first evaluation. In Proceedings LREC, pages 887–894, 2000. [26] G. Simons and S. Bird. The open language archives community: An infrastructure for distributed archiving of language resources, June 2003. [27] T. V´aradi, S. Krauwer, P. Wittenburg, M. Wynne, and K. Koskenniemi. CLARIN: Common language resources and technology infrastructure. Proceedings of the sixth international language resources and evaluation (LREC08). Marrakech, Morocco, 2008. [28] CLARIN shortguide: Virtual collections. http://www.clarin.eu/files/virtual_ collections-CLARIN-ShortGuide.pdf. [29] P. Wittenburg. Echo: Wp2 task and mission. Technical report, MPI for Psycholinguistics, 2003. http://www. mpi.nl/echo/html/mission-tr1.html. [30] P. Wittenburg. The DoBeS model of language documentation. Language Documentation and Description, 1:122– 140, 2003.

Suggest Documents