European Initiatives to Promote Cooperation ... - Semantic Scholar

7 downloads 0 Views 71KB Size Report
vision for the field of LRs [3] – is the first European initiative which has the mission to ... 2. coordination initiatives – these concern both the national and the ...
European Initiatives to Promote Cooperation between Speech and Text Communities Nicoletta Calzolari Istituto di Linguistica Computazionale (ILC) CNR, Pisa, Italy [email protected]

Abstract In this paper I intend to touch a few initiatives recently promoted and carried out in Europe with the aim of strengthening the cooperation between the communities – until now mostly separated – of Spoken and Written language processing. I give some thoughts to the current state-of-the-art with respect to initiatives common to the two areas, in particular within ENABLER, ELSNET, and ELRA. Doing this I reflect on how we envision that we should proceed so that the global requirements of the multilingual information society can be met by the two fields together. This may become the new ‘vision’ for LRs in the next years.

1. Introduction In a sort of position paper on the cooperation between speech and text, I do not want to give technical or implementation details, but - rather ambitiously - I try to give, without aiming at providing an organic and even less exhaustive picture, an overview of the current situation concerning cooperation between Spoken and Written (S&W) Language Resources (LR) in Europe, in particular as emerging from a number of representative European initiatives. After briefly highlighting the strategic and infrastructural role of LRs within any Human Language Technology (HLT) application, I focus on the main activities and results of the ENABLER (European National Activities for Basic Language Resources) Thematic Network, because they are in fact on purpose related to both S&W communities. After mentioning a common initiative with ELSNET, I proceed by sketching a few initiatives within ELRA relevant for promoting a unified view of our sector. Recommendations are provided towards the design of general strategies and an overall coordination for the field of LRs as a whole, critical to satisfy some of the requirements of the multilingual information society. At the same time, implicitly, in this overview I want to i) touch on a number of aspects which are of relevance when considering the infrastructural role of S&W LRs, ii) underline some of the circumstances/conditions and attitudes which are specific of the European approach, and iii) sketch the current situation in Europe with respect to implementing a common infrastructure. My objective is to show that it is important that there is an underlying global strategy behind the set of initiatives which are/can be launched in EU and world-wide, and that a global vision is necessary to achieve more coherent and useful results.

2. The strategic infrastructural role of LRs LRs – Written, Spoken and, recently, Multimodal – are a central and strategic component of the so-called “linguistic

infrastructure” (the other key element being Evaluation), necessary for the development of any HLT application and product. The availability of adequate LRs for as many languages as possible is a pre-requisite for the development of a truly multilingual Information Society. They play a critical role, as a horizontal technology, in different areas of the EU 6th Framework Programme (FP), and have been recognized as a priority within a number of national projects around Europe. The availability of LRs is also a “sensitive” issue, touching directly the sphere of linguistic and cultural identity, but also with economical, societal and political implications. This is going to be even more true in the new Europe with 25 languages on a par. 2.1. The role of the EU S&W LRs projects The EU S&W LRs projects of the last decade have achieved the emergence of a broad conscience in EU of the aspects of consensual agreement vs. those of more difficult theoretical or technological solution with respect to the state-of-the-art. These projects were helpful in creating a more homogeneous community in EU by compelling researchers from different countries and from public and private organisations to work together. They were also effective in spotting and bringing to light a number of commonly felt needs to which some solution had to be found. It was seen as a waste of money, effort and time the fact that any new project started to redo again and again (the same type of) fragments of LRs, without reusing what already available, while LRs produced by the projects - even though small - were usually forgotten and left unused. From here the notion of “reusability” arose [1] [2]. The pooling together of these types of requirements made it compulsory, and almost inevitable, the emergence of the concept of LRs as infrastructural types of resource.

3. The ENABLER thematic network The ENABLER Thematic Network of HLT National Projects in European countries – an EC funded IST project, designed and initiated by Antonio Zampolli, with a clear strategic vision for the field of LRs [3] – is the first European initiative which has the mission to explicitly consider together the technical, organizational, strategic and political issues of LRs. In ENABLER these various aspects are put together in a coherent framework [4] [5], to set up a medium- and longterm set of priorities (both technical and strategic) and to promote these at the national and international levels. Moreover, ENABLER has recognized the importance to promote actions aiming at integrating the different resource types, until now developed independently, and – as a consequence – at improving the cooperation between the communities of Speech, Text and Multimodality.

In the following I briefly highlight the main issues tackled by ENABLER on the different layers (for more details see http://www.enabler-network.org/). 3.1. The survey of LRs

synergy between the groups of knowledge management/ ontology and of HLT/LR technology. 3.3. Strategic issues and infrastructures for S&W LRs

international

research

The ENABLER Consortium conducted a survey of LRs [6] to get a global picture of the situation on LRs, to compare the various conditions that hold across different languages and across speech and text, and – on this basis – to suggest sounder recommendations. The survey provides an overview of the results of National Projects and activities on LRs of different types (written, spoken, multimodal, lexical resources and related tools). The results, related to 164 resources from various countries and languages, concern all the facets needed to properly describe LRs: resource production issues (collection, processing, annotation, validation), legal (copyright, availability), descriptive (type, language coverage, content) and the distribution policies adopted by the LRs producing organizations. The survey aimed not only at collecting information on LR activities, but also at harmonizing their descriptions and, by this, at leading to a common metadata schema, again encompassing S&W LRs, for their description.

Important ENABLER goals were: to provide recommendations for strategic initiatives to be promoted for LR production and management, to address the main priorities for LRs, and to define a strategy for LRs in the next years. Two main lines have been highlighted: 1. infrastructural initiatives – ENABLER has promoted the creation of a new international infrastructure for LRs;

3.2. Compatibility and interoperability of LRs: standards and multilinguality



increase multilingual LRs;



reduce development time of LRs;



enhance LR content interoperability;



foster synergies between S&W areas and with neighbouring areas (e.g. terminology, Semantic Web);



develop new methodologies and tools for LR management, quick domain and application adaptation, data-driven tuning, etc.

To facilitate the integration of the LRs and tools resulting from all the various LR initiatives of the last decade and, at the same time, to make word-content machine understandable, as it is the aim of the Semantic Web vision, three critical issues must be addressed: 1. standards, critical to achieve the interoperability needed for effective integration; 2. 3.

content, as crucial information to be represented is semantic information; multilinguality, a critical issue for the immediate future.

Multilinguality is also a strong integrating factor, horizontal with respect to different application areas and S&W LR types. It implies not only harmonised technical decisions, but also heavy organisational aspects, which can only be taken into account at the supranational level. A way to reach the optimisation of the process of production and sharing of (multilingual) LRs can be found in a common and standardized framework which ensures the encoding of linguistic information in such a way to grant its reusability in different applications and tasks. ENABLER promoted the compatibility and interoperability of LRs mainly through cooperative work with: i) ISLE/EAGLES (http://www.ilc.cnr.it/EAGLES96/isle/), for harmonisation of linguistic specifications, in particular for corpora and multilingual lexicons (based on MILE, the Multilingual ISLE Lexical Entry [7]); ii) SO TC37 SC4 WG4, to make European standards truly international ISO Standards; iii) ELRA Validation Committee, for the incorporation of agreed standards in protocols for the validation of LRs, both Spoken and Written; iv) INTERA, for the harmonisation of metadata descriptions; v) Semantic Web communities, to promote

2.

coordination initiatives – these concern both the national and the transnational and transcontinental dimensions.

3.3.1.

A common Roadmap for S&W LRs and HLT

The workshop “International Roadmap for Language Resources” organised by ENABLER and ELSNET in Paris on August 2003 has actually laid the basis to build a roadmap for LRs. A first list of priorities which act as critical issues for the future of both S&W LRs was drawn: • provide basic LRs for a larger set of languages;

Another Roadmap Building meeting, as a common enterprise of the S&W communities, is held in conjunction with LREC 2004 in Lisbon. 3.3.2.

From BLARK to ELARK

ENABLER has adopted and strongly supported the BLARK (Basic LAnguage Resource Kit) concept, first launched through ELSNET [8] and Nederlandse Taalunie [9]. The promotion of BLARK requires to: • specify for every language the minimum set of LRs (in terms of S&W corpora, lexicons, basic tools to manipulate them, skills required, etc.) to be able to do any pre-competitive research for that language; •

spot the actual gaps to be filled (a matrix highlighting the gaps of LRs for many applications and languages will be soon accessible and modifiable directly from the ELRA Web site, to enable customers or providers of LRs to fill it, to identify available LRs, and to promote the production of new LRs);



present a summary of the technical, operational and organisational problems to be tackled, and provide suggestions for an overall organisation framework for international cooperation.

Moreover, BLARK must be considered as an evolving notion. A further level is defined as Extended LAnguage Resource Kit (ELARK), which will be extensively promoted for its larger adoption [10]. 3.3.3.

An open and distributed framework for LRs

The need of ever growing LRs – testified also by the current US funding strategies – led us to propose and promote a change in the overall model of how to build, maintain and share LRs. In particular, a new paradigm is required and proposed to make the Web usable, i.e. an open, distributed and collaborative language infrastructure, based on open content interoperability standards. Semantic Web developers will need repositories of words and terms and machineunderstandable knowledge about their relations within language use and ontological classification. The cost of adding this structured information can be one of the factors that delays its full deployment. The effort of making available millions of ‘words’ for dozens of languages is something that no small group is able to afford. Existing experience in LR development proves that such a challenge can be tackled only by pursuing – on the organisational side – a truly interdisciplinary and cooperative approach, and by establishing – on the technical side – a highly advanced environment for the representation and acquisition of linguistic information, open to the reuse and interchange of linguistic data. We promote the launch of a large initiative, comprising the major LR and HLT groups in Europe and world-wide, for the creation of an open and distributed infrastructure for LRs. The outcome of such an initiative could be the design of a completely “new generation” of LRs. An important Declaration on Open Access to LRs was endorsed by all participants of the ENABLER/ELSNET Workshop. The Linguistic Infrastructure supported by ENABLER intends to contribute to the structuring and integration of the European Research Area, addressing problems such as the fragmentation of its research base and the weakness in converting R&D results into useful economic or society benefits. To this aim, we claim it is necessary to pool together and to build on many different, but related, S&W initiatives. 3.4. Contributing to the design of an overall coordination and strategy in the field of LRs International cooperation will be certainly the most important factor for the field of LRs – and consequently of HLT - in the next years. A report produced by ELDA [11] presents an analysis of several organisational frameworks, focussing on funding and organisational procedures to provide LRs. The pre-requisites to be addressed for the production of interoperable LRs in a cooperative framework belong to different layers: technical (specifications), validation (quality assessment), legal, commercial. In order to fill the gaps in terms of LRs, cooperation on all combined - organisational, funding, technical and commercial - issues appears to be necessary. To strengthen such a cooperation, there is no doubt that an effort in coordinating this cooperation is required. A coordinated operation was already launched in the framework of Speech LRs with the creation of COCOSDA (International Committee for the Coordination and Standardisation of

Speech Databases and Assessment Techniques). Major strategic outcomes of ENABLER with respect to international cooperation are the following ones. 3.4.1.

ICCWLRE

A new committee, originally conceived by Antonio Zampolli, has been established in the field of Written LRs, the International Coordination Committee for Written LRs and Evaluation (ICCWLRE). It provides the optimal environment to continue (part of) the ENABLER mission, while, at the same time, enlarging its scope beyond the European boundaries. Tasks for this Committee include: information dissemination on LRs and standards; promotion, coordination, and enabling activities; copyright and IPR; training and methodology for LR creation and validation; roadmaps for LRs; political and strategic initiatives. The first joint meeting of COCOSDA and ICCWLRE is organised as a satellite event at LREC 2004, with the goal of building a Roadmap for LRs, as a joint effort of the S&W communities, fostering future synergies among them. 3.4.2.

LangNet

Last but not least, an initiative – LangNet – is being proposed in the framework of the ERA-Net scheme of the 6th FP of the European Commission (EC) to coordinate national initiatives in HLT all over Europe. LangNet candidates itself to provide the most natural environment to continue the efforts and the momentum gained by the ENABLER Network. Language Technologies seem to be especially well fitted for the ERANet scheme, based on the assumption that each country wishes to conduct research activities allowing the development of systems and applications for its language(s). It therefore seems natural that the individual countries basically take into account all the “(S&W) languagedependent” aspects and that the EC rather takes into account all the generic, “language-independent” aspects, in agreement with the principle of subsidiarity. But in order to avoid a twospeed Europe [12], linguistically speaking, coordination should be established between the EC and the member states, and strategies should be drawn in order to ensure a proper balance of language coverage in Europe. The idea behind these initiatives is to establish some sort of permanent coordination to build on parallel existing (national or international) initiatives.

4. The ELRA role ELRA, as a promoter of infrastructures for LRs, has in its mission also the production and validation of LRs and the promotion of standards. Both the Validation and Production Committees have started experiments aimed at defining common standards for S&W LRs, to overcome existing barriers among independently built S&W LRs. This is the first step to pave the way to innovative methods of building and acquiring S&W LRs according to individual requirements starting from available repositories, thus contributing to solve the current fragmentation of LRs, while capitalising on and reusing results from previous EC and national projects and standardizations activities.

4.1. The ELRA Unified Lexicon experiment A small but interesting experiment was the Unified Lexicon (UL), aiming, from a methodological perspective, at investigating the feasibility, and defining methods and procedures for pooling and unifying independently created lexical resources, combining sources with complementary information and different formats at reasonable costs, both in terms of human efforts and computational techniques. A merging experiment was carried out on two Italian lexicons available at ILC [13]: • a pronunciation lexicon, part of the DMI [14], an Italian Machine Dictionary, containing, among other data, the phonological encoding; •

the Italian morphological module of the multi-layered PAROLE/CLIPS lexicon [15].

As an ideal test for this kind of experiment, each source contains information not present in the other and, moreover, some data overlaps. Besides the complementary information, the two lexicons also present different formats. The UL offers the possibility to import a new module in the CLIPS relational database, the phonological repository of word-forms with the pronunciation encoding (both in DMI proprietary format and in SAMPA phonetic alphabet: www.phon.ucl.ac.uk). Via the morphological unit(s), this is connected to further linguistic layers, thus obtaining a whole XML-entry, from the phonology to the semantics. In the ILC Web site, a portion of the CLIPS syntactic and semantic data can be output in HTML-format. We will shortly experiment also with the merging of two different large lexicons separately built by the EU S&W communities, with partially complementary and partially overlapping information. This will hopefully produce a unified standard for morphology (both for S&W), and will pave the way to linking e.g. phonology to syntax and semantics: two strategically very relevant results. 4.2. Validation methodologies for LRs Within the ELRA Validation Committee protocols and procedures are defined and implemented for validating S&W LRs, representing the current best practice in validation of LRs (http://www.elra.info/).

5. Conclusions At the end everything is tied together, which makes our overall task so interesting - and difficult. What we must have is the ability to combine the overall view with its decomposition into manageable pieces. No one perspective the global and the sectorial - is really fruitful if taken in isolation. A strategic and visionary policy for cooperation between S&W groups has to be debated, designed and adopted for the next few years, if we hope to be successful, but - inside this - a realistic and stepwise approach to solving well-defined and limited aspects must be adopted. To this end, the contribution of the main actors in the field is of extreme importance. Some of the events of the last years are hopefully moving in this direction. As a last remark I want to mention that in LREC 2004 we have common sessions for S&W LRs, to start encouraging and pushing towards more interaction and integration between the two communities. This will be a must for our

field to contribute, effectively and globally, to the big challenges of the ‘knowledge-based society’ [16].

6. References [1] Zampolli, A., “Perspectives for an Italian Multifunctional Lexical Database”, Linguistica Computazionale, IV-V, 1987, pp. 304-341. [2] Calzolari, N., “Lexical databases and textual corpora: perspectives of integration for a Lexical Knowledge Base”, U. Zernik (ed.), Lexical Acquisition: Exploiting on-line Resources to build a Lexicon, Lawrence Erlbaum Associates, Hillsdale, NJ, 1991, pp. 191-208. [3] Zampolli, A. et al., ENABLER Technical Annex, Pisa, 2000. [4] Baroni, P., Calzolari, N., Lenci, A., Extended Configuration of the Network and Final Report, ENABLER Deliverable D1.2, Pisa, 2003. [5] Calzolari, N., Choukri, K., Gavrilidou, M., Maegaard, B., Baroni, P., Fersøe, H., Lenci, A., Mapelli, V., Monachini, M., Piperidi, S., “ENABLER Thematic Network of National Projects: Technical, Strategic and Political Issues of LRs”, LREC 2004 Proceedings, Lisbon, 2004. [6] Gavrilidou, M., Desypri, E., Final Edited Version of the Survey, ENABLER Deliv. D2.3, Athens, 2003. [7] Calzolari, N., Bertagna, F., Lenci, A., Monachini, M. (eds.), Standards and Best Practice for Multilingual Computational Lexicons. MILE (the Multilingual ISLE Lexical Entry), ISLE CLWG Deliverables D2.2&D3.2, Pisa, 2003, pp. 194. [8] Krauwer, S., “ELSNET and ELRA: A Common Past and a Common Future”, ELRA Newsletter, 3(2), 1998. [9] Binnenpoorte, D., De Vriend, F., Sturm, J., Daelemans, W., Strik, H., Cucchiarini, C., “A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch”, LREC 2002 Proceedings, Las Palmas, 2002. [10] Mapelli, V., Choukri, K., Report on a (Minimal) Set of LRs to Be Made Available for as Many Languages as Possible, and Map of the Actual Gaps, ENABLER Deliverable D5.1, Paris, 2003. [11] Mapelli, V., Choukri, K., Report Contributing to the Design of an Overall Co-ordination and Strategy in the Field of LRs, ENABLER Deliverable D5.2, Paris, 2003. [12] Joscelyne, A., Lockwood, R., Benchmarking HLT Progress in Europe, HOPE, Copenhagen, 2003. [13] Monachini, M., Calzolari, F., Mammini, M., Rossi, S., Ulivieri, M., “Unifying Lexicons in view of a Phonological and Morphological Lexical DB”,LREC 2004 Proceedings, Lisbon, 2004. [14] Calzolari, N., Ceccotti, M.L., Roventini, A, Documentazione sui tre nastri contenenti il DMI, ILCDMI-2, Pisa, 1983. [15] Ruimy, N., Corazzari, O., Gola, E., Spanu, A., Calzolari, N., Zampolli, A., “The European LE-PAROLE Project: The Italian Syntactic Lexicon”, Proceedings of the First International Conference on Language Resources and Evaluation, Granada, 1998. [16] Calzolari, N., “Introduction of the Conference Chair”, LREC 2004 Proceedings, Lisbon, 2004.