KNOW:Developing large-scale multilingual technologies ... - CiteSeerX

Jornada de Seguimiento de Proyectos, 2009 Programa Nacional de Tecnolog´ ıas Inform` aticas

KNOW:Developing large-scale multilingual technologies for language understanding TIN2006-15049-C03 Eneko Agirre ∗ University of the Basque Country

Llu´ıs Padró † Universitat Politècnica de Catalunya

Irene Castellón ‡ University of Barcelona

Abstract This project aims to add meaning, knowledge and reasoning to current interface technologies. Specifically, KNOW is providing novel natural language interpretation and reasoning capabilities to current multilingual computer applications: full syntactic parsers and semantic interpreters (including word sense disambiguation systems and semantic role labelers) for the languages involved in the project, a common conceptual structure (the Multilingual Central Repository or MCR), and both automatic reasoning and analogybased inference based on the MCR. KNOW has opened the way for a new generation of broad-coverage unrestricted-domain concept-based language understanding applications. KNOW has demonstrated the feasibility of these technologies in two applications linked to the project EPOs: Cross Lingual Information Retrieval and Question/Answering on two multi-modal databases. Consult the project website [90] for further information. Keywords: Natural Language Processing, Syntactic Analysis, Semantic Interpretation, Knowledge Acquisition , Information Retrieval

1

Goals of the project

This project fully corresponds to and comprehensively addresses objective 3 Intelligent Systems within the line 3.6 Human Language Engineering of the National Program for Computer Science (in Spanish, “Programa Nacional de Tecnolog´ıas Informáticas”, TIN). In particular, our project fully addresses the Strategic Action about Human Language Engineering targeting Spanish, Catalan and Basque throughout the following objectives: 1. Improving current syntactic analysis of Spanish, Catalan and Basque, combining machine learning techniques, hand-written rules, and integrating the deep semantic information acquired below. ∗ Email:

[email protected] [email protected] ‡ Email: [email protected] † Email:

TIN2006-15049-C03

2. Automatic acquisition of deep knowledge on verbal models, including selectional preferences and semantic roles. 3. Integration of large-scale semantic knowledge in the Multilingual Central Repository. 4. Advancing in the use of Machine Learning techniques for the resolution of semantic analysis and language understanding tasks, namely, word sense disambiguation, and semantic role labeling, approached either as independent tasks, or in a simultaneous way. 5. Developing efficient deduction and automatic reasoning techniques able to generate new knowledge from the existing in MCR, and using it as an additional source of knowledge to include in the MCR. 6. Providing Spanish, Catalan and Basque with the resources and developed tools. 7. Proving the viability of these applications, and the advantages that the use of advanced semantic knowledge represents, in two demonstrators for cross-lingual information retrieval and cross-lingual question answering. The project objectives are organized in tasks (see following section) and in subprojects: • The UB subproject (02), gathered existing language processors for each of the KNOW languages, and developed the main missing link: wide coverage parsers for Spanish, Basque and Catalan. This subproject will also acquire verbal models, which are indispensable for giving coherence to the syntactic parses and allow semantic interpretation • The UPC subproject (03) will focus on semantic interpretation and maintaining the repository of all acquired knowledge. The first requires advances in the combination of two existing techniques, word sense disambiguation and semantic role labelling, and will use the knowledge in the repository. • The UPV/EHU subproject (01) builds on top of the results of the two other subprojects. The goals are twofold: to develop reasoning capabilities in top of the semantic repository and semantic interpretation modules, and to demonstrate the capabilities of the acquired techniques in the demonstrators. The partners form a closely knitted community, collaborating in all tasks. For instance, both UPC and UPV/EHU are involved in developing parsers for Spanish, Basque and Catalan.

2

Project progress and achievements

In its second year the project has met all the objectives planned for that period, and in some cases also those set for the third year. In this section we will review the project progress, structured in several parts, according to the work packages of the project.

TIN2006-15049-C03

2.1

Management and dissemination

The project is being carried out as planned and meeting the deadlines, with no major deviations from the technical plan. The results of the project have been widely published (see references), including the organization of international workshops and public competitions (cf. Sections 2.8 and 3). Regarding the workshops, we have organized the most prestigious semantic evaluation workshop [17], and are organizing a followup workshop in 20091 .

2.2

Design of the project and demonstrators

The overall methodology of KNOW including the standard protocols, formats, procedures, and evaluation criteria and the Multilingual Central Repository database was defined early in the project. The partners also designed the demonstrators that showcase the capabilities of the project, in close contact with the EPOs (see Section 3.1).

2.3

Linguistic Processors

The project has gathered, adapted and enriched the basic tools and linguistic resources available for all the tasks in the project in English, Spanish, Catalan and Basque. Including: tokenization and sentence boundaries, morphological analysis and treatment of referential expressions, named entity recognition and syntactic analysis. We have also developed grammars for deep syntactic analysis of the languages in the project. There are three demos available for the Basque surface parser [91], the Basque full parser [92], and the Spanish, Catalan and English parsers [93]. Finally, we have also developed a new version of the EWN Top Concept Ontology, available in [101].

2.4

Knowledge Integration

KNOW has managed and maintained the Multilingual Central Repository, which is used for uploading and porting the knowledge acquired across languages and resources and maintaining the compatibility among them. The Multilingual Central Repository now includes modules for: Uploading the data acquired from one language to the Multilingual Central Repository, porting the knowledge stored in the Multilingual Central Repository to the local wordnets and Checking the integrity of the data stored in the Multilingual Central Repository. The MCR can be consulted in the following interfaces [94, 95]. In addition we have enriched by semi-automatic means the reasoning rules built on the MCR Top Concept ontology, which have been used by the automatic theorem provers and inferencing tools in the reasoning tasks (cf. Section 2.7).

2.5

Acquisition of semantic knowledge

We have designed and applied automatic acquisition techniques, both supervised and unsupervised, to widely representative corpora, to infer lexical selectional preferences, with special attention to verbal preferences for semantic roles. They are available for English [106]. We have also acquired an extensive amount of new relations[49], forming what we call KNOWNETs for 1 http://www.lsi.upc.edu/~ lluism/sew2009/

TIN2006-15049-C03

the project languages, which are publicly available [102]. In addition, we have mined the MCR for Base Level Concepts, which offer different generalization levels. Base Level Concepts are freely available [103]. The acquired knowledge is being used to improve word sense disambiguation, semantic role labeling and to improve parsing, with positive results on both tasks for English [49, 87, 6, 70].

2.6

Semantic Processing

Word Sense Disambiguation systems based on the MCR have been developed. These systems are capable of disambiguating words for any language with a WordNet with results that surpass the state-of-the-art [14, 15]. The system is available as open source [105]. A demo of a related system is also available [96]. In parallel, we have continued to develop supervised WSD, with new developments on the use of untagged corpora [8]. In the case of Basque, we have completed the annotation of the Basque Semcor (which can be consulted online in [104]) and developed a supervised WSD based on that (demo available [97]). Regarding Semantic Role Labeling, we have developed a system which use the selectional preferences acquired in Section 2.5, and a joint-learning system [69] which has an online demo [98].

2.7

Reasoning

KNOW has produced a layer of inference on top of the knowledge acquired and integrated in the MCR, using the following: Rule-based theorem provers working with formal rules linked to top-level semantic information in the MCR [78], and graph-based algorithms working with all the semantic information in the MCR.

2.8

Evaluation and demonstration

We have evaluated the quality and accuracy of the developed software and the acquired data. Objective measures on significant samples of the data and software results have been taken, and we have also participated in public evaluation exercises on both parsing, information retrieval, semantic role labeling and word sense disambiguation, with high ranking positions on all [17, 87, 7, 13, 65, 10, 9, 12, 17, 46, 11, 26, 55, 73]. We have demonstrated the feasibility of integrating the project results in Information Retrieval and Question Answering (cf. Section 3.1). Two demos are available [99, 100], although the ArgazkiPress demo is only available at request. In any case, the main objective for KNOW in this third year is to further improve current results on those applications using the knowledge and tools developed during the two previous years, with special attention to semantic technologies.

3

Quantification of results

We present the quantification of the results according to the subproject involved, as well as some comments about the added value of the coordination in KNOW. The relevance of the publication is also mentioned, following the so-callled “Resolución de 11 de noviembre

TIN2006-15049-C03

de 2008 , de la Presidencia de la Comisión Nacional Evaluadora de la Actividad Investigadora (BOE 22-11-2008)”, as required for area “6. Ingenier´ıas y Arquitectura”. In the case of journal papers we indicate the impact factor in the Science Citation Index. For conference papers we indicate those venues on relevant positions of the following indices: CiteSEER (http://citeseer.ist.psu.edu/impact.html), Computing Research and Education (CORE) (http://www.core.edu.au/) or CS Conference Rankings ( http://www.cs-conference-ranking.org/co

3.1

Subproject 01

EHU/UPV leads the coordinated project and subproject 01. The main objectives of this subproject are: 1) the evaluation and demonstration of the technologies developed for KNOW, with special attention to Information Retrieval and Question Answering, and 2) the development of reasoning capabilities on top of the Multilingual Central Repository. Besides, we had a prominent role on the other tasks as well, but, for the sake of brevity, we will focus our report to these two objectives. We just want to stress the development of new parsers for Basque, both based on manual rules [21] and machine learning on hand-annotated data [27] (demos available [91, 92] Regarding the first objective, we have organized [11] and participated [73] on the CLEF 2008 task on IR and WSD2 , where the goal was to analyze the potential impact of WSD technologies on the usual ad-hoc task. The task was very successful with 8 participants and positive results in both the English-to-English and Spanish-to-English tasks. Our system [73] obtained very good results overall, showing that WSD allowed for better results, and opening the avenue for the integration of semantic technologies in IR engines (see EPO demo at [99]). The competition has been accepted for 2009, so it will be run another year. We also organized a Question Answering task where WSD was provided by us, and the Basque subtask of the main Question Answering task [55]. We participated in the Basque subtask [26] with our own Question Answering system, where we will incorporate our semantic technologies in the third year, as planned in the project plan. Regarding the second objective, development is going as planned. On the one hand, we have adapted the axioms of the Standardized Upper Merged Ontology, and for the first time it is possible to reason on top of the MCR concepts. On the other hand, we have set analogical reasoning as a similarity engine, which, given two words, returns the similarity according to the information in the MCR. This similarity engine has been used to develop a state-of-the-art WSD system. Demos of both are available form the project website [96, 97], as well as open source code[105]. The relevance and originality of the research carried out in this project is proved by the breadth and depth of the publications. We want to underscore the international impact of our research, with publications in major conferences and journals. All in all, this subproject has produced the following publications, with special mention to 5 publications on high quality venues3 : • 44 publications in conferences, workshops and journals with peer review committees[2, 1, 4, 6, 18, 19, 20, 21, 27, 35, 34, 63, 87, 3, 78, 51, 64, 70, 8, 7, 14, 13, 65, 66, 24, 25, 5, 11, 10, 9, 12, 26, 55, 73, 88, 89, 46, 44, 45, 47, 49, 50, 48, 17]. 2 http://ixa2.si.ehu.es/clirwsd 3 Due

to lack of space, we omit details of papers in the B or C categories of CORE.

TIN2006-15049-C03

• One publication on a journal with 1.107 impact factor [70]. • Two publications [89, 6] on a A+ conference according to CORE, which, according to CSCR, is in the top 65 conferences on Artificial Intelligence / Machine Learning (9th position) with an EIC of 0.90. • Two publication [8, 49] on a A conference according to CORE), which, according to CSCR, is in the top 65 conferences on AI Artificial Intelligence / Machine Learning (35th position) and EIC of 0.64 (according to CSCR). Technology transfer. Two EPOs (Elhuyar and ArgazkiPress) have incorporated our technologies. In the first case we have constructed a prototype of our Basque Question Answering system on their Science and Technology Portal, which they are planning to deploy on their portal4 . In the second, they have applied the semantic information in the MCR to enhance their Cross-Lingual IR system. Demos of both are available from the project website. This interest of the EPOs underscores the usefulness of our results and relations with the socio-economic environment. In order to facilitate fast technology transfer, we have made a special effort to produce several demos and downloadable resources, all of which can be found in the project website. We have also registered in the intellectual property office the following resources which were developed in the project: EuSemcor (a Basque semantic concordance, that is, a corpus where nouns have been annotated by hand with word senses from the Basque WordNet, ref. SS-411-08) and IXAti (a surface syntactic analyzer for Basque, ref. SS-78-08). The Basque WordNet was registered in 2003 (ref. SS-323-03). Participation in international projects. UPV/EHU is a partner of KYOTO5 , a 7th FP project of the European Commission which started in March 2008. The groups of our University is lead by German Rigau, member of KNOW. We are actively participating on tasks closely related to KNOW, and in fact, some publications are joint publications with KYOTO partners. Eneko Agirre lead a proposal for a STREP in the last call, obtaining 14.5 points from 15 maximum, but was not selected for funding. Although not expresely funded, we want to stress the organization of competitions, which prove the status of the members of the subproject in the international arena [10, 9, 12, 46, 11, 55]. Training human resources. Five PhD students are active partners of KNOW (Egoitz Laparra, Arantxa Otegi, Aitziber Atutxa, Be˜ nat Zapirain and Olatz Ansa), with several publication [88, 89, 87, 26, 73, 10, 9, 24, 26]. Two of them are on their final year. Eneko Agirre and Izaskun Aldeazabal supervised the thesis of Eli Pociello [77] on work closely related to KNOW. Cooperation with other research groups. Eneko Agirre visited Timothy Baldwin in the University of Melbourne for three months. Arantxa Otegi is currently visiting Hugo Zaragoza at Yahoo Research in Barcelona for three months. Be˜ nat Zapirain visited Diana McCarthy in the University of Sussex also for three months. Eneko Agirre and Aitor Soroa are collaborating with Google Research Zurich [16]. 4 http://www.zientzia.net/ 5 http://www.kyoto-project.eu/

TIN2006-15049-C03

3.2

Subproject 02

The main objective of this subproject is the development of wide-coverage deep grammars for syntactic parsing of Spanish, Catalan and English. The other goals related to this main objective are: 1) the acquisition of syntactic information and 2) the acquisition of semantic information. The main tasks carried out by the UB subproject are: • Development of wide-coverage deep grammars for syntactic parsing of Spanish, Catalan and English, in collaboration with the UPC group.These grammars integrate lexical, semantic and syntactic information [29].The size of the grammars are: 2900 rules in the Catalan grammar, 4349 rules in the Spanish grammar and 1857 in the English grammar. • In order to enrich the grammars with lexico-syntactic information, our target is to obtain the kind of information that is manually encoded in the SenSem verbal database for verbs that are not in the database. To that aim, we have carry out subcategorization acquisition. We acquire a model of the behavior of SenSem verbs in corpus, and then try to apply this model to previously unseen verbs. Our underlying hypothesis is that verbs that have a similar behavior in corpus have similar subcategorization frames. We have explored two different methods. First, we have applied clustering techniques to find some empirically determined grouping of the verbs in SenSem [31]. Our second method, under development, consists in learning a classifier by means of a bootstrapping approach. • The consistent annotation of the nominal part of the EuroWordNet consists in the full annotation of the nouns on the EWN ILI with EWN Top Concept Ontology(TCO). Semantic features of TCO has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy, while setting inheritance blockage points [78, 25]. • To acquire selectional preferences for verbs we start the creation of a resource for acquisition. The task is in first stage, our aim is to complete the semantic annotation of SENSEM with senses of Wordnet. This first stage consist on building a semantic resource sample of the Sensem corpus which has been manually annotated to be used as a gold standard. Agreement between judges has been assessed and performance of analyzers has been evaluated [28]. The annotation interface , which may be found in http://sisx04.si.ehu.es:8080/spsemcor/ has been adapted for this task. A study of clustering techniques has been carried out. In order to fulfill the annotation of the complete SENSEM corpus, a complementary project (FFI2008-02579-E/FILO) has been granted. This project will start in April 2009. • Studies of linguistic theories for the improvement of verbal classification based in eventive classes have be done. In order to determine the eventive classes of verbs, a study of linguistic theories and several psycholinguistic experiments have been carried out [39, 40, 41, 42]. In addition, the semantics of verbal periphrases and their effects on eventive classes in some complex predicates have been studied, and a representation proposal for their computational processing is in progress. The results are relevant in several senses. First, there are few open-source, wide coverage, deep grammars for Spanish, but there is not such a for Catalan. Therefore, the grammars

TIN2006-15049-C03

developed in this project are a very valuable resource for the NLP community. Second, it is commonly assumed that subcategorization frames can significantly improve the performance of automatic syntactic analyzers of natural language. A lot of methods have been developed for subcategorization acquisition for English, but only a few works can be found for other languages, particularly for Spanish. Manually enriching resources with subcategorization information is costly, and resulting resources tend to have low coverage. As follows, the methods explored in this project for subcategorization acquisition for Spanish cover an important gap for the automatic analysis of Spanish. Concerning the semantic task, the annotation of the nominal part of EuroWordNet provides a wide coverage, consistent resource for NLP tasks like inference, parsing and others. The studies about linguistic theories are of great importance because they allow us to determine which linguistic information is crucial to the creation of semantic resources. Polysemy, sense delimitation, identification of primitives and definition of classes are needed to represent and describe linguistic knowledge. Our goal at this point is to use some principles of cognitive linguistics in the computational representation and to integrate them with linguistic knowledge empirically obtained from psycholinguistic experiments and from corpora. The grammar performance can be consulted in the on-line demo6 , and the TCO can be consulted online7 . The results of these tasks have been published in international journals and national journals. • 19 publications in conferences, workshops and journals [25, 22, 23, 29, 30, 33, 43, 42, 37, 38, 36, 39, 40, 41, 67, 86, 85, 24, 25] • 3 publications indexed: [78] (CORE category C), [28] (CORE category C) and [31] (indexed by citeseer in the top 93.28% venues) Training human resources: One thesis to be defended in June/09 is partially related with the contents of the project (Marta Coll). Three master thesis have been carried out within this project [67, 30, 86]. Technology Transfer: The company Automatic Trans is EPO in this subproject. The company, which is focused in development of machine translation systems, is interested in the resources developed by KNOW, mainly for parsing. For the moment, collaboration has consisted in the assistance of members of the company to seminars organized by GRIAL. Two yearly meetings have been carried out as well. In March a workshop will be organized that will bring together the research group and the company. Cooperation with other research groups: Marta Coll Florit has stayed at York University with Dr. Silvia Genari for three months.

3.3

Subproject 03

This subproject deals with two main research lines: the development and application of semantic analyzers, specifically of WSD and SRL systems for Spanish, Catalan and English, and the acquisition and integration of knowledge into the MCR. 6 http://garraf.epsevg.upc.es/freeling/demo.php 7 http://adimen.si.ehu.es/cgi-bin/wei5/public/wei.consult.perl

TIN2006-15049-C03

The main advances carried out by the UPC subproject regarding the syntax and semantic analyzers are the following: • Improvement and extension of the rule-based dependency parser TXALA (included in FreeLing): Dependency rule formalism expressivity has been enhanced to include access to MCR-based semantic database (word classes, Top Ontology, Wordnet SF, etc) • Resources and analyzers for English, Spanish and Catalan have been improved both at the lexical (dictionary coverage), and syntactic (chunking grammars, dependency rules.) levels. • In order to produce a statistical dependency parser for Spanish and Catalan, MALT Parser [72] has been trained for Spanish and Catalan, using CoNLL 2007 data as training set. A prototype plugin to embed MALT parser in FreeLing has been developed. • A joint approach to WSD+SRL using global inference was developed, and participated in CoNLL-2008 Shared Task with a joint inference model which integrates syntax and semantics [69]. A demo is available The work is detailed in X. Llu´ıs’ Master Thesis, which obtained an excellent grading (9.5/10): ”Joint Learning of Syntactic and Semantic Dependencies” [68]. • We built KNOWNET, a new semantic resource based on Topic Signatures. KNOWNET is a novel technique, completely automatic, to build precise and highly dense knowledge bases from existing semantic resources. The technique uses SSI-Dijkstra, a precise and wide-coverage knowledge-based WSD algorithm, to assign the proper senses to large sets of topic-related words. Data used in KNOWNET have been retrieved from the Web (TSWEB). The first version of this resource (KNOWNET), has been evaluated with test sets from SemEval-07 (for English) and Senseval-3 (English and Spanish). • Also, the use of WSD techniques for SMT and RBMT has been explored (in coordination with OpenMT project): A new approach for lexical selection in SMT systems has been developed, based on WSD techniques [57, 59]. Experimental results show an improvement in lexical selection for our models, when compared to traditional MLE based techniques. Nevertheless, the improvements are not necessarily propagated to deeper quality levels (syntax or semantics), thus we are planning new studies to improve the integration of these models in the SMT architecture. • The same strategy has been applied to Arabic-English translation, confirming the previous results [54]. This work constitutes the core of the Master Thesis of C.Espa˜ na-Bonet [53], which got the top mark (Matr´ıcula de Honor) and obtained a mention at CCIA-2008 Master Thesis Award. With respect to the knowledge acquisition and integration line, the most relevant issues addressed and results are: • Release of new MCR version (MCR5), including error depuration in Spanish, Catalan, and Basque glosses. Also, examples have been separated from definitions and a final port to UTF8 has been performed. Extended WN gloss/rgloss relationships have been uploaded to the repository.

TIN2006-15049-C03

• An interface to enable access to MCR for non-expert users has been developed, taking into account usability criteria and tests with real users. Current status is a stable prototype, with improvements pending. The research carried out in the project has yielded several papers, published in national and international scientific forums: • Publications in top conferences (CORE categories A and A+): [89] (citeseer top 10%, CSCR conference-ranking 0.90), [49] (citeseer top 25%, CSCR conference-ranking 0.64), [69] (CORE A, CSCR conference-ranking 0.82), [60] (CORE A). • Publications in other relevant conferences (CORE categories B and C): [83] (CORE B), [88] (CORE B), [62] (CORE C). • Other publications: [29, 46, 44, 45, 47, 50, 48, 54, 57, 58, 61, 59, 71, 52, 75, 76, 80, 79, 81, 82, 84] Training human resources: Two thesis have been published with contents partially developed in the project. The first related to the use of semantic analyzers and knowledge for machine translation [56], and the second devoted to sequence analysis algorithms, useful for named entity detection or chunking, among other tasks [74]. Also, two related master thesis mentioned above have been published [53, 68].

3.4

Benefits of the coordination

The large number of joint publications (around 15%) showcases the benefits of coordinating the three subprojects. The project yearly project meetings have been complemented with focused workshops on special themes, where members of the project and invited speakers have shared their experiences. In addition, the coordination enables KNOW to combine different resources and techniques in creative ways, thus allowing to pursue the ambitious objectives that we set in the project.

References [1] I. Aduriz, K. Ceberio, and A. D´ıaz de Ilarraza. Pronominal anaphora in basque: Annotation issues for later computational treatment. In DAARC2007, 2007. [2] I. Aduriz, K. Ceberio, A. D´ıaz de Ilarraza, and I. Garcia. An´ alisis de la correferencia para su anotaci´ on en un corpus en euskera. In Actas de Congreso: VIII Congreso de Ling¨ u´ıstica General, 2008. [3] E. Agirre, I. Aldezabal, A. Estarrona, and E. Pociello. A methodology for the joint development of the basque wordnet and semcor. In Dutch SemCor workshop, 2008. [4] E. Agirre and I. Alegria. Tresna linguistikoak informazioa atzitzeko. In Komunikabideetako Dokumentazioari Buruzko I. Jardunaldiak, 2008. [5] E. Agirre, I. Alegria, G. Rigau, and P. Vossen. Mcr for clir. In SEPLN aldizkaria, monografia TIIMM. vol 38, 2007. [6] E. Agirre, T. Baldwin, and D. Martinez. Improving parsing and pp attachment performance with sense information. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL HLT 2008), 2008. [7] E. Agirre and O. Lopez de Lacalle. Ubc-alm: Combining k-nn with svd for wsd. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), 2007. [8] E. Agirre and O. Lopez de Lacalle. On robustness and domain adaptation using svd for word sense disambiguation. In The 22nd International Conference on Computational Linguistics (COLING), 2008.

TIN2006-15049-C03

[9] E. Agirre, B. Magnini, O. Lopez de Lacalle, A. Otegi, G. Rigau, and P. Vossen. Semeval-2007 task 01: Evaluating wsd on cross-language information retrieval. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), in conjunction with ACL, 2007. [10] E. Agirre, B. Magnini, O. Lopez de Lacalle, A. Otegi, G. Rigau, and P. Vossen. Semeval-2007 task01: Evaluating wsd on cross-language information retrieval. In Proceedings of CLEF 2007 Workshop, 2007. [11] E. Agirre, G. Di Nunzio, N. Ferro, T. Mandl, and C. Peters. Clef 2008: Ad hoc track overview. In Working Notes of the Cross-Lingual Evaluation Forum, 2008. [12] E. Agirre and A. Soroa. Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 2007. [13] E. Agirre and A. Soroa. Ubc-as: A graph based unsupervised system for induction and classification. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 2007. [14] E. Agirre and A. Soroa. Using the multilingual central repository for graph-based word sense disambiguation. In Proceedings of LREC, 2008. [15] E. Agirre and A. Soroa. Personalizing pagerank for word sense disambiguation. In Proceedings of EACL, Forthcoming. [16] E. Agirre, A. Soroa, and E. Alfonseca. A study on similarity and relatedness using distributional and wordnetbased approaches. In Proceedings of NAACL, Forthcoming. [17] Eneko Agirre, Llu´ıs M` arquez, and Richard Wicentowski, editors. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). Association for Computational Linguistics, Prague, Czech Republic, June 2007. [18] I. Aldezabal. Estudio preliminar para la creaci´ on de euskal propbank perspectivas de an´ alisis de la unidad verbal. In SERES. Universitat de Barcelona, 2007. [19] I. Aldezabal, I. Alegria, J. Arriola, A. D´ıaz de Ilarraza, M. Lersundi, and K. Sarasola. Language technology is an effective tool to promote use of basque. In AILA 2008, Multilinguism:Challenges & Opportunities, 2008. [20] I. Aldezabal, M. Aranzabe, J. Arriola, A. D´ıaz de Ilarraza, A. Estarrona, K. Fernandez, M. Iruskieta Quintian, and L. Uria. Epec (euskararen prozesamendurako erreferentzia corpusa) dependentziekin etiketatzeko eskuliburua. In UPV/EHU / LSI / TR 12-, 2007. [21] I. Aldezabal, M.J. Aranzabe, A. Diaz de Ilarraza, and K. Fern´ andez. From dependencies to constituents in the reference corpus for the processing of basque. In SEPLN 2008, 2008. [22] L. Alonso, I. Castell´ on, and N. Tincheva. Obtaining coarse-grained classes of subcategorization patterns for spanish. In Proceedings of the International Conference RANLP, 2007. [23] L. Alonso, I. Castell´ on, and N. Tinkova. Adquisici´ on de subcategorizaciones verbales mediante un clasificador autom´ atico. Procesamiento del Lenguaje Natural, 2007. ´ [24] J. Alvez, J. Atserias, J. Carrera, S. Climent, E. Laparra, A. Oliver, and G. Rigau. Complete and consistent annotation of wordnet using the top concept ontology. 6th international conference on language resources and evaluation. In LREC’08, Marrakesh, Morroco, 2008. ´ [25] J. Alvez, J. Atserias, J. Carrera, S. Climent, A. Oliver, and G. Rigau. Consistent annotation of wordnet using the top concept ontology. proceedings of the 4th global wordnet association conference. In Szeged. Hungary, 2008. [26] O. Ansa, X. Arregi, A. Otegi, and A. Soraluze. Ihardetsi question answering system at qa-clef 2008. Working Notes of the Cross-Lingual Evaluation Forum, 2008. [27] K. Bengoetxea and K. Gojenola. Desarrollo de un analizador sint´ actico estad´ıstico basado en dependencias para el euskera. In Congreso Anual de la SEPLN, 2007. [28] J. Carrera, I. Castell´ on, S. Climent, and M. Coll-Florit. Towards spanish verbs’ selectional preferences automatic acquisition. semantic annotation of sensem corpus. In Proceedings of The 6th international conference on Language Resources and Evaluation, 2008. [29] Jordi Carrera, Irene Castell´ on, Marina Lloberes, Llu´ıs Padr´ o, and Nevena Tinkova. Dependency grammars in freeling. Procesamiento del Lenguaje Natural, 41:21–28, September 2008. [30] J. T. Carrera Ventura. An´ alisis de t´ ecnicas de adquisici´ on autom´ atica de restricciones selectivas. Master’s thesis, Dept. Ling¨ u´ıstica General Universitat de Barcelona, 2007. [31] I. Castell´ on, L. Alonso, and N.Tincheva. A procedure to automatically enrich verbal lexica with subcategorization frames. Inteligencia Artificial, 12(37):45–53, 2008. [32] I. Castell´ on, L. Alonso, and N. Tincheva. A procedure to automatically enrich verbal lexica with subcategorization frames. In Proceedings of Argentinean Symposium on Artificial Intelligence, ASAI’07, 2007. [33] I. Castell´ on and A. Fern´ andez, editors. Perspectivas de an´ alisis de la unidad verbal. Publicacions i Edicions de la Universitat de Barcelona, 2007.

TIN2006-15049-C03

[34] K. Ceberio, I. Aduriz, A. D´ıaz de Ilarraza, and I. Garc´ıa. Erreferentziakidetasunaren azterketa eta anotazioa euskarazko corpus batean. In Gramatika Jaietan. P. Goenagaren 30 ”Gramatika Bideetan” liburuaren omenez, 2008. [35] K. Ceberio, I. Aduriz, A. D´ıaz de Ilarraza, and I. Garc´ıa. La anotaci´ on de la referencia sobre un corpus period´ıstico en euskara. In XXVI Congreso internacional de AESLA, 2008. [36] M. Coll-Florit. Aktionsart y polisemia verbal. Hisp´ anica (CILHIS), 2007.

In Actas del III Congreso Internacional de Ling¨ u´ıstica

[37] M. Coll-Florit. Experimento psicoling¨ u´ıstico sobre la dinamicidad verbal en espa˜ nol. el caso de cumplir. Interling¨ u´ıstica, 18, 2007. [38] M. Coll-Florit. M` etodes emp´ırics en ling¨ u´ıstica cognitiva. Technical Report 004, Internet Interdisciplinary Institute UOC Working Paper Series WP07, 2007. [39] M. Coll-Florit. Sobre la realidad cognitiva de los par´ ametros aspectuales. de la teor´ıa a los m´ etodos experimentales. In Proceedings of 6th International Conference of the Spanish Cognitive Linguistics Association (AELCO-SCOLA), 2008. [40] M. Coll-Florit, I. Castell´ on, and S. Climent. Sobre la natura dels estats. una revisi´ o basada en corpus. Sintagma. Revista de Ling¨ u´ıstica, 20, 2008. [41] M. Coll-Florit, I. Castell´ on, S. Climent, and J. Santiago. Trends in Cognitive Linguistics: theoretical and applied models, chapter Realidad psicol´ ogica del aspecto l´ exico. Evidencias experimentales. Peter Lang, 2008. [42] M. Coll-Florit, S. Climent, and I. Castell´ on. Aspecto l´ exico y desambiguaci´ on sem´ antica. el caso de los estados. In Ricardo Mairal Us´ on, editor, Aprendizaje de lenguas, uso del lenguaje y modelaci´ on cognitiva: perspectivas aplicadas entre disciplinas. UNED-AESLA, 2007. [43] M. Coll-Florit, S. Climent, and I. Castell´ on. A self-paced reading experiment on the cognitive status of lexical aspect in spanish. Indian Journal of Applied Linguistics, 33:2(Special number on Recent trends in Neurolinguistics, Psycholinguistics & Language Cognition: Methodologies and Innovations):9–28, 2007. [44] Montse Cuadros, Mauro Castillo, and German Rigau. Evaluating large-scale knowledge resources across languages. In RANLP, 2007. [45] Montse Cuadros and German Rigau. Bases de conocimiento multil´ıng¨ ues para el procesamiento sem´ antico a gran escala. Cursos de verano de la Fundaci´ on Duques de Soria. Industrias de la Lengua. en M.F. Verdejo (ed) Acceso y visibilidad de la Informaci´ on Multiling¨ ue en la red, 2007. [46] Montse Cuadros and German Rigau. Semeval-2007 task 16: Evaluation of wide coverage knowledge resources. In Fourth International Workshop on Semantic Evaluations (SemEval-2007), 2007. [47] Montse Cuadros and German Rigau. Bases de conocimiento multil´ıng¨ ues para el procesamiento sem´ antico a gran escala. Procesamiento del Lenguaje Natural, 40:35–42, 2008. [48] Montse Cuadros and German Rigau. Knownet: A proposal for building highly connected and dense knowledge bases from the web. In STEP, September 2008. [49] Montse Cuadros and German Rigau. Knownet: using topic signatures acquired from the web for building automatically highly dense knowledge bases. In Proceedings of the 22nd Interanational Conference on Computational Linguistics (COLING 2008), Manchester, UK, August 2008. [50] Montse Cuadros and German Rigau. Multilingual evaluation of knownet. In SEPLN, September 2008. [51] A. D´ıaz de Ilarraza, K. Gojenola, and M. Oronoz. Reusability of a corpus and a treebank to enrich verb subcategorisation in a dictionary. In Conference on Recent Advances in Natural Language Processing (RANLP), 2007. [52] James Dowdall, Bill Keller, Llu´ıs Padr´ o, and Muntsa Padr´ o. An automata based approach to biomedical named entity identification. In Annual Meeting of Proceedings of the Annual Meeting of the ISMB BioLINK Special Interest Group on Text Data Mining, Vienna, Austria, July 2007. [53] Cristina Espa˜ na-Bonet. A proposal for an arabic-to-english smt. Master’s thesis, Universitat de Barcelona and Universitat Polit` ecnica de Catalunya (Artificial Intelligence Program), 2008. [54] Cristina Espa˜ na-Bonet, Jes´ us Gim´ enez, and Llu´ıs M` arquez. The upc-lsi discriminative phrase selection system: NIST MT evaluation 2008. In Proceedings of the 2008 NIST Open Machine Translation Evaluation Workshop, Washington, USA, 2008. [55] P. Forner, A. Pe˜ nas, E. Agirre, I. Alegria, C. Forascu, N. Moreau, P. Osenova, P. Prokopidis, P. Rocha, B. Sacaleanu, R. Sutcliffe, E. Tjong, and K. Sang. Overview of the clef 2008 multilingual question answering track. Working Notes of the Cross-Lingual Evaluation Forum, 2008. [56] Jes´ us Gim´ enez. Empirical Machine Translation and its Evaluation. PhD thesis, Universitat Polit` ecnica de Catalunya, July 2008. [57] Jes´ us Gim´ enez and Llu´ıs M` arquez. Context-aware discriminative phrase selection for statistical machine translation. In Proceedings of the Second ACL Workshop on Statistical Machine Translation, pages 159–166, June 2007.

TIN2006-15049-C03

[58] Jes´ us Gim´ enez and Llu´ıs M` arquez. Linguistic features for automatic evaluation of heterogeneous mt systems. In Proceedings of the Second ACL Workshop on Statistical Machine Translation, pages 256–264, June 2007. [59] Jes´ us Gim´ enez and Llu´ıs M` arquez. Discriminative phrase selection for statistical machine translation. In Learning Machine Translation. MIT Press, 2008. [60] Jes´ us Gim´ enez and Llu´ıs M` arquez. Heterogeneous automatic mt evaluation through non-parametric metric combinations. In Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP’08), pages 319–326, January 2008. [61] Jes´ us Gim´ enez and Llu´ıs M` arquez. A smorgasbord of features for automatic mt evaluation. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 195–198, Columbus, Ohio, June 2008. The Association for Computational Linguistics. [62] Jes´ us Gim´ enez and Llu´ıs M` arquez. Towards heterogeneous automatic mt error analysis. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), 2008. [63] M. Iruskieta, A. D´ıaz de Ilarraza, and M. Lersundi. An´ alisis de los marcadores del discurso para el euskera: denominaci´ on, clases, relaciones sem´ anticas y tipos de ambig¨ uedad. In XXVI Congreso internacional de AESLA, 2008. [64] R. Izquierdo, A. Su´ arez, and G. Rigau. Exploring the automatic selection of basic level concepts. In Proceedings of the International Conference on Recent Advances on Natural Language Processing (RANLP’07), 2007. [65] R. Izquierdo, A. Su´ arez, and G. Rigau. Plsi: Word coarse-grained disambiguation aided by basic level concepts. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 2007. [66] R. Izquierdo, A. Su´ arez, and G. Rigau. A proposal of automatic selection of coarse-grained semantic classes for wsd. In Proceedings of the 23th Annual Meeting of Sociedad Espa˜ nola para el Procesamiento del Lenguaje Natural, SEPLN07, 2007. [67] M. Lloberas. Guia d’´ us i criteris. gram` atiques de depend` encies per a l’analitzador de depend` encies txala castell` a i catal` a. Technical Report GRIAL- Research Report No 1/2008, Departament de Ling¨ u´ıstica General. Universitat de Barcelona, 2008. [68] Xavier Llu´ıs. Joint learning of syntactic and semantic dependences. Master’s thesis, Universitat Polit` ecnica de Catalunya (Artificial Intelligence Program), 2008. [69] Xavier Llu´ıs and Llu´ıs M´ arquez. A joint model for parsing syntactic and semantic dependencies. In Proceedings of the 12th Conference on Computational Natural Language Learning (CoNLL-2008), Manchester, UK, 2008. [70] D. Martinez, E. Agirre, and O. Lopez de Lacalle. On the use of automatically acquired examples for all-nouns wsd. Journal of Artificial Intelligence Research, 2008. [71] Llu´ıs M` arquez, Llu´ıs Padr´ o, Mihai Surdeanu, and Lu´ıs Villarejo. UPC: Experiments with joint learning within semeval task 9. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), pages 426–429, June 2007. [72] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kl¨ ubler, S. Marinov, and E. Marsi. Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2):95– 135, 2007. [73] A. Otegi, E. Agirre, and G. Rigau. Ixa at clef 2008 robust-wsd task: using word sense disambiguation for (cross lingual) information retrieval. In Working Notes of the Cross-Lingual Evaluation Forum, 2008. [74] Muntsa Padr´ o. Applying Causal State Splitting Reconstruction Algorithm to Natural Language Processing Tasks. PhD thesis, Universitat Polit` ecnica de Catalunya, July 2008. [75] Muntsa Padr´ o and Llu´ıs Padr´ o. ME-CSSR: an extension of cssr using maximum entropy models. In Proceedings of the 2007 Conference on Finite-State Methods for NLP (FSMNLP), Potsdam, Germany, September 2007. [76] Muntsa Padr´ o and Llu´ıs Padr´ o. Studying cssr algorithm applicability on nlp tasks. Procesamiento del Lenguaje Natural, 39:89–96, September 2007. [77] E. Pociello. Euskararen ezagutza-base lexikala: (EHU/UPV), February 2008.

Euskal WordNet.

PhD thesis, Euskal Filologia Saila

[78] E. Pociello, A. Gurrutxaga, E. Agirre, I. Aldezabal, and G. Rigau. Wnterm: Combining the basque wordnet and a terminological dictionary. In Proceedings of the 6th International Conference on Language Resources and Evaluations, 2008. [79] Horacio Rodr´ıguez, David Farwell, Javier Farreres, Manuel Bertran, Musa Alkhalifa, and Ma Ant` onia Mart´ı. Arabic wordnet: Semi-automatic extensions using bayesian inference. In Proceedings of the the 6th Conference on Language Resources and Evaluation LREC2008. Marrakech (Morocco), May 2008, May 2008. [80] Horacio Rodr´ıguez, David Farwell, Javier Farreres, Manuel Bertran, Musa Alkhalifa, Ma Ant` onia Mart´ı, William J. Black, Sabri Elkateb, James Kirk, Adam Pease, Piek Vossen, and Christiane Fellbaum. Arabic wordnet: Current state and future extensions. In Proceedings of the Fourth International GlobalWordNet Conference - GWC 2008, Szeged, Hungary, January 2008.

TIN2006-15049-C03

[81] Emili Sapena, Llu´ıs Padr´ o, and Jordi Turmo. Alias assignment in information extraction. Procesamiento del Lenguaje Natural, 39:105–112, September 2007. [82] Emili Sapena, Llu´ıs Padr´ o, and Jordi Turmo. A graph partitioning approach to entity disambiguation using uncertain information. In Advances in Natural Language Processing, pages 428–439. 6th International Conference, GoTAL 2008, August 2008. [83] Mihai Surdeanu, Roser Morante, and Llu´ıs M` arquez. Analysis of joint inference strategies for the semantic role labeling of spanish and catalan. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing-2008, LNCS 4919, pages 206–218, Haifa, Israel, February 2008. [84] Mihai Surdeanu, Llu´ıs M` arquez, Xavier Carreras, and Pere R. Comas. Combination strategies for semantic role labeling. Journal of Artificial Intelligence Research, pages 105–151, June 2007. [85] N. Tinkova. Construcci´ on de una gram´ atica del espa˜ nol para el an´ alisis. In R. Mairal et al., editor, Aprendizaje de lenguas, uso del lenguaje y modelaci´ on cognitiva: perspectivas aplicadas entre disciplinas. UNED - AESLA, 2007. [86] N. Tinkova. A state of the art review on automatic parsing of spanish. Master’s thesis, Departament de Ling¨ u´ıstica General. Universitat de Barcelona., 2007. [87] B. Zapirain, E. Agirre, and L. M` arquez. Sequential srl using selectional preferences. an approach with maximum entropy markov models. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), 2007. [88] Be˜ nat Zapirain, Eneko Agirre, and Llu´ıs M` arquez. A preliminary study on the robustness and generalization of role sets for semantic role labeling computational linguistics and intelligent text processing. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing2008, LNCS 4919, pages 219–230, Haifa, Israel, February 2008. [89] Be˜ nat Zapirain, Eneko Agirre, and Llu´ıs M` arquez. Robustness and generalization of role sets: Propbank vs. verbnet. In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics (ACL-08), pages 550–558, Columbus, Ohio, USA, 2008.

Project website, resources and demos [90] Know project website: http://ixa.si.ehu.es/know. [91] Surface syntactic parser for basque: http://ixa2.si.ehu.es/demo/zatiak.jsp. [92] Full syntactic parser for basque: http://sisx04.si.ehu.es:8080/maltixa/maltixa.jsp. [93] Full syntactic parser for spanish, catalan and english: http://garraf.epsevg.upc.es/freeling/demo.php. [94] Technical interface to the mcr: http://adimen.si.ehu.es/cgi-bin/wei/public/wei.consult.perl. [95] User-friendly interface to the mcr: http://www.lsi.upc.edu/%7enlp/know/herramienta. [96] Word sense disambiguation based on the mcr (english): http://adimen.si.ehu.es/cgi-bin/ssi-dijkstra/index.pl. [97] Word sense disambiguation based on supervised techniques (basque): http://ixa3.si.ehu.es/wsd-demo/. [98] Semantic role labeling (english): http://www.lsi.upc.edu/%7exlluis/jointparser. [99] Cross-lingual information retrieval for argazkipress (epo): https://siuc05.si.ehu.es/argazkienbilaketa. note: contact [email protected] for access keys. [100] Question answering system for basque on science and http://sisx04.si.ehu.es:8080/IhardetsiWebDemo/IhardetsiBezeroa.jsp .

technology

(elhuyar

epo):

[101] Top concept ontology mapping to the mcr: http://lpg.uoc.edu/ewntco-wordnet-mapping.htm. [102] Knownet: http://adimen.si.ehu.es/web/knownet. [103] Base level concepts of the mcr: http://adimen.si.ehu.es/web/blc. [104] Interface to the basque semcor: http://sisx04.si.ehu.es:8080/eusemcor/. [105] Open source word sense disambiguation program based http://www2.let.vu.nl/twiki/pub/Kyoto/WP05:KnowledgeMining/wsd kyoto.tgz.

on

mcr

graph:

[106] Selectional preferences for semantic roles (english): ixa.si.ehu.es/ixa/resources/srl-selprefs/sp-pb-wn1.6.zip.

KNOW:Developing large-scale multilingual technologies ... - CiteSeerX

KNOW:Developing large-scale multilingual technologies ... - CiteSeerX

Suggest Documents

Key Technologies for Multilingual Information Processing ... - CiteSeerX

Explaining Largescale Historical Change - CiteSeerX

Explaining Largescale Historical Change - CiteSeerX

Machine Translation - Workshop on Multilingual Technologies

(I) Foundations (II) - Workshop on Multilingual Technologies

Multilingual Cataloguing - CiteSeerX

Dissecting Multilingual Beijing_WILD6 - CiteSeerX

Multilingual Revision - CiteSeerX

Standardizing Multilingual Lexicons - CiteSeerX

LANGUAGE IDENTIFICATION AND MULTILINGUAL ... - CiteSeerX

multilingual multilingual dictionary - TermCoord

Technologies - CiteSeerX

Multilingual

Multilingual Resources, Technologies and Evaluation for Central and ...

Multilingual Resources, Technologies and Evaluation for Central and

A LargeScale Gene-Trap Screen for Insertional

Problems with using largescale oceanic climate indices to ... - CiteSeerX

Selfconsistent modeling of the largescale ... - Semantic Scholar

LargeScale, Ultrapliable, and FreeStanding ... - Deep Blue

A Multilingual Database of Idioms - CiteSeerX

A multilingual learner corpus in Brazil - CiteSeerX

Interlingual Annotation of Multilingual Text Corpora - CiteSeerX

New technologies, new pedagogies: Mobile technologies ... - CiteSeerX

Linguini: Language Identification for Multilingual Documents - CiteSeerX