Research Memo CS { 00 { 06
A Survey of Uses of GATE Diana Maynard, Hamish Cunningham, Kalina Bontcheva, Roberta Catizone, George Demetriou, Robert Gaizauskas, Oana Hamza, Mark Hepple, Patrick Herring, Brian Mitchell, Michael Oakes, Wim Peters, Andrea Setzer, Mark Stevenson, Valentin Tablan, Christian Ursu, Yorick Wilks July 2000 Institute for Language, Speech and Hearing (ILASH), and Department of Computer Science University of Sheeld, UK
[email protected] http://www.dcs.shef.ac.uk/~diana
Contents 1 Introduction
2
2 Motivation
2
3 Infrastructure for Language Processing R&D
3
4 GATE
4
5 Projects involving GATE
6
5.1 ECRAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
5.2 Cass-SWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
5.3 GIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
5.4 AVENTINUS . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
5.5 SVENSK . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
5.6 LOTTIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
5.7 EUDICO . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
5.8 LaSIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
5.9 EMPathIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.10 TRESTLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.11 PASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.12 German Named Entity Recognition . . . . . . . . . . . . . . . 11
6 Usage of GATE
12 1
7 Strengths and Weaknesses
14
8 Conclusion
15
9 Acknowledgements
15 Abstract
GATE, a General Architecture for Text Engineering, aims to provide a software infrastructure for researchers and developers working in NLP. GATE has now been widely available for four years. In this paper, we review the objectives which motivated the creation of GATE and the functionality and design of the current system. We describe some of the ways in which GATE has been used during this time, and examine the strengths and weaknesses of the current system, identifying areas for improvement.
1 Introduction This paper relates experiences in projects that have used GATE (General Architecture for Text Engineering) over the four years since its initial release in 1996. We begin in section 2 with some of the motivation behind this type of system, and then give a de nition of architecture in this context (section 3). Section 4 brie y describes GATE; section 5 covers a range of projects that have used the system; and section 6 examines some additional ways in which GATE has been used. These experiences form the input to section 7, which discusses the system's strengths and weaknesses.
2 Motivation If you're researching human language processing you should probably not be writing code to:
store data on disk; 2
display data; load processor modules and data stores into processes; initiate and administer processes; divide computation between client and server; pass data between processes and machines.
A Software Architecture for language processing should do all this for you. You will have to parameterise it, and sometimes deployment of your work into applications software will require some low-level ddling for optimisation purposes, but in the main these activities should be carried out by infrastructure for the language sciences, not by each researcher in the eld. We can go further and say that you shouldn't have to reinvent components and resources outside of your specialism if there is already something that could do the job. A statistician doesn't need to know the details of the IEEE Floating Point computation standard; a discourse processing specialist doesn't need to understand all the ins and outs of part-of-speech tagging (or worse still how to install a particular POS tagger on a particular machine). If you're a professional mathematician, you probably regard a tool like SPSS or Mathematica as necessary infrastructure for your work. If you're a computational linguist or a language engineer, the chances are that large parts of your work have no such infrastructural support. Where there is infrastructure, it tends to be speci c to restricted areas. GATE, a General Architecture for Text Engineering [CHGW97], represents an attempt to ll this gap, and is a software architecture for language processing R&D. We now have four years of experience with GATE, work on which began in 1995, with a rst widespread release late in 1996. The system is currently at a pivotal point in its development, with a new version due for release. The experiences reported in this paper contribute to the requirements analysis and design of the new system.
3 Infrastructure for Language Processing R&D What does infrastructure mean for Natural Language Processing (NLP)? What sorts of tasks should be delegated to a general tool, and which should 3
be left to individual projects? The position we took in designing GATE is to focus on the common elements of NLP systems. There are many useful tools around for performing speci c tasks such as developing feature structure grammars for evaluation under uni cation, or collecting statistical measures across corpora. To varying extents, they entail the adoption of particular theories. The only common factor of NLP systems, alas, seems to be that they very often create information about text. Developers of such systems create modules and data resources that handle text, and they store this data, exchange it between various modules, compare results of test runs, and generally spend inordinate amounts of time pouring over samples of it when they really should be enjoying a slurp of something relaxing instead. The types of data structure typically involved are large and complex, and without good tools to manage and allow succinct viewing of the data we work below our potential. At this stage in the progress of our eld, no one should really have to write a tree viewing program for the output of a syntax analyser, for example, or even have to do signi cant work to get an existing viewing tool to process their data. In addition, many common language processing tasks have been solved to an acceptable degree by previous work and should be reused. Instead of writing a new part of speech tagger, or sentence splitter, or list of common nominal compounds, we should have available a store of reusable tools and data that can be plugged into our new systems with minimal eort. Such reuse is much less common than it should be, often because of installation and integration problems that have to be solved afresh in each case [CFB94]. In sum, we de ne our infrastructure as an architecture, framework and development environment, where an architecture is a macro-level organisational pattern for the components and data resources that make up a language processing system; a framework is a class library implementing the architecture; a development environment adds graphical tools to access the services provided by the architecture.
4 GATE GATE version 1.n does three things: 4
manages textual data storage and exchange; supports visual assembly and execution of modular NLP systems plus visualisation of data structures associated with text; provides plug-in modularity of text processing components.
The architecture does this using three subsystems:
GDM, the GATE Document Manager; GGI, the GATE Graphical Interface; CREOLE, a Collection of REusable Objects for Language Engineering.
Figure 1: Gate Architecture GDM manages the information about texts produced and consumed by NLP processes; GGI provides visual access to this data and manages control ow; CREOLE is the set of resources so far integrated. Developers working with GATE begin with a subset of CREOLE that does some basic tasks, perhaps tokenisation, sentence and paragraph identi cation and part-of-speech tagging. They then add or modify modules for their speci c tasks. They use a single API for accessing the data and for storing their data back into 5
the central database. With a few lines of con guration information they allow the system to display their data in friendly graphical form, including tree diagrams where appropriate. The system takes care of data storage and module loading, and can be used to deliver embeddable subsystems by stripping the graphical interface. It supports modules in any language including Prolog, Lisp, Perl, Java, C++ and Tcl.
5 Projects involving GATE Over the past 4 years, GATE has been used in a number of projects, not only within the University of Sheeld, but also by a variety of external institutions. In this section, we outline some of the main projects that have used GATE, and examine its performance in each case.
5.1 ECRAN Goal: ECRAN (Extraction of Content: Research at Near-market) [BPV+97]
was a 3-year EU funded research project with the main aim of carrying out Information Extraction (IE) using adapted lexicons. Participants: Thomson-CSF (Paris) (project co-ordinators), SIS (Smart Information Systems, Germany), University of Sheeld, University of Rome La Sapienza, University of Geneva, NCSR \Demokritos" (Athens) Description: GATE was mainly used in this project to implement a general word sense disambiguation engine based on a combination of classi ers. Bene ts: The modular architecture of GATE allowed this to be carried out very rapidly. Drawbacks: Two main disadvantages were found with GATE. (1) The architecture was under development at the same time as the word sense disambiguation engine. (2) The speed of database access for the Tipster database was found to be slow for large amounts of lexical data. The solution used was to store large amounts of lexical data separately from GATE as gdbm hash tables.
6
5.2 Cass-SWE Goal: The aim of the Cass-SWE project (A Cascaded Finite-State Parser
for Syntactic Analysis of Swedish) [KJK99] was to create a parsing system for fast and accurate analysis of large volumes of written Swedish. Participants: Sprakdata/Goteborg University, Sweden. Description: Cass-SWE implements a grammar as a modular set of 6 small grammars. GATE is used to integrate all the required software components into one system prior to parsing, and to enable the results to be visualised in a user-friendly environment. Bene ts: GATE allows the tagging process to be carried out sequentially, and enables modi cation of individual elements without disruption to others. Using GATE as a visualisation environment also enables the results of Cass-SWE to be further used in applications such as IE tasks and additional semantic processing. Drawbacks: There were a few initial diculties understanding the workings of the GATE system, but problems originally thought to be caused by GATE were later traced to the CASS parser.
5.3 GIE Goal: The aim of the GIE (Greek Information Extraction) project [PPK+99]
was to develop a prototype named entity recognition model for Greek.
Participants: NCSR \Demokritos" (Athens), University of Sheeld Description: The GIE system is based on the VIE system provided with
GATE, but requires dierent language-speci c resources such as gazetteers and grammars. Using GATE enables non-language speci c resources to be reused from the English version, thereby saving time and eort. Bene ts: GATE facilitated signi cantly the integration of existing and new modules in GIE, as well as the validation of the nal demonstrator. It was generally found to be fast, easy to use and powerful. Drawbacks: GATE's demand for system resources as document size increases can become a serious limitation. Complex compilation processes made the embedding of static modules dicult. GATE also has some diculties supporting non-Latin languages. mostly relating to the GUI. Many minor possible improvements to the GUI and to GATE in general (such as the addition of new features) were identi ed during this project.
7
5.4 AVENTINUS Goal: AVENTINUS (Advanced Information System for Multinational Drug
Enforcement) is an EU funded research and development programme set up to build an information system for multinational drug enforcement. Participants: SIETEC (Germany), ADB (France), Amt fur Auslandsfragen (Germany), Bundeskriminalamt (Germany), Sprakdata Gothenburg (Sweden), Institute for Language and Speech Processing (Greece), INCYTA (Spain), University of Sheeld. Description: AVENTINUS aims to collect information from distributed international sources, using advanced linguistic techniques to improve IE, involving multimedia resources and supporting multilinguality.
5.5 SVENSK Goal: SVENSK [Ols97, OGE98, GO00] was a 4-year project aimed at de-
veloping an integrated toolbox of language processing components and resources for Swedish. Participants: SICS (Swedish Institute of Computer Science), NUTEK, Uppsala University, Goteborg University, PipeBeach AB., Telia Research AB, IBM Svenska AB Description: The toolbox is based on the GATE language engineering platform and incorporates language processing tools developed at SICS or contributed by external sources. Bene ts: Each component has a standardised interface, so users have the choice of working within GATE or selecting and combining supplied components for integration into a user application. GATE is useful in that it is not committed to any particular type of data or task. The emphasis on modularity was also found to be particularly appealing. Drawbacks: GATE was at the time still in its early phases and had some problems with very large-scale resources. Speci cation of byte oset and I/O requirements for dierent modules was also dicult.
5.6 LOTTIE Goal: LOTTIE (Low Overhead Triage from Text using Information Extrac-
tion) was a demonstrator project for the GATE infrastructure. It aimed to 8
provide proof-of-concept by implementing demonstration software dealing with the major technological problems involved in computer-assisted triage. Participants: University of Sheeld Description: LOTTIE did not itself use GATE, but formed a basis on which to prototype initial versions of release 2. Parts of it were real, based on a project in a dierent domain, and parts of it served as a test case for GATE development and as a demonstration of future possibilities.
5.7 EUDICO Goal: The aim of Eudico was a distributed multimedia infrastructure sup-
porting annotation of speech and video corpora [BRWP98]. Participants: Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands), University of Sheeld Description: Eudico enables transcriptions of utterances to be time-aligned with speech and video data, so that dynamic and simultaneous viewing and editing is possible. Integration with GATE was carried out in order to bene t from GATE's ability to represent, store and visualise linguistic data. Bene ts: The exibility of GATE's data model enabled the seamless integration between EUDICO's time-based data and GATE's oset-based annotations. This enabled the representation, manipulation and display of time-aligned transcriptions into GATE's viewers, allowing the user to manipulate the dierent types of data simultaneously in a uniform environment. Drawbacks: There is a certain lack of support for distributed/remote access to the document manager. Therefore in a client-server environment, the entire data has to be sent over the network instead of just the parts that are needed.
5.8 LaSIE Goal: LaSIE [WG99]is an advanced large-scale IE system, performing named entity recognition, coreference resolution, template element lling and scenario template lling. Participants: University of Sheeld Description: LaSIE was designed speci cally to work within the GATE architecture, and led to the free distribution of its counterpart, VIE, a baseline IE system. LaSIE modules within GATE have also formed part of other 9
customised projects within the EC Fourth Framework (AVENTINUS and ECRAN).
5.9 EMPathIE Goal: EMPathIE (Enzyme and Metabolic Path Information Extraction)
was an 18-month research project aimed at applying Information Extraction technology to bioinformatics tasks. Participants: Dept. of Information Studies & Dept. of Computer Science (University of Sheeld), Glaxo Wellcome plc., Elsevier Science. Description: EMPathIE aims to extract details of enzyme reactions from articles in biomedical journals. The IE system is derived from LaSIE and was developed within the GATE architecture. Bene ts: The embedding of EMPathIE within the GATE environment means that many modules can be reused. EMPathIE thus makes use of many of the LaSIE modules, and itself produces modules which have been used for other related projects. Using GATE therefore enables much of the low-level work in moving IE systems to new domains to be carried out effortlessly.
5.10 TRESTLE Goal: TRESTLE (Text Reuse, Extraction and Summarisation for Large
Enterprises) [TRE00] is a 2-year project involving IE from electronic alerting bulletins distributed daily throughout the pharmaceutical industry. Participants: Glaxo-Wellcome plc, University of Sheeld Dept. of Computer Science and Dept. of Information Studies. Description: TRESTLE is based on the LaSIE IE system, but requires different domain-speci c resources, such as gazetteers and ontology, and substantial modi cation of the discourse interpreter and template writer. Bene ts: GATE provides domain independent linguistic components for TRESTLE, the most important of which is the semantic parser. Named Entity recognition requires only the installation of domain speci c gazetteers. Drawbacks: Strong computing skills are necessary in order to make the most of the system.
10
5.11 PASTA Goal: PASTA (Protein Active Site Template Acquisition) [KHG00] extracts
information about protein structures directly from scienti c journal papers, and stores them in a template. Participants: Depts. of Computer Science, Molecular Biology & Biotechnology, and Information Studies (University of Sheeld). Description:The system has been adapted to the molecular biology domain from pre-existing IE technology such as LaSIE. The progress so far demonstrates the feasibility of developing intelligent systems for IE from text-based sources in the pursuit of knowledge in the biological domain. Bene ts: The use of a common database for storing intermediate results oers several advantages. GATE allows simple integration of heterogeneous system components and algorithms. The user interface is also attractive. Drawbacks: GATE version 1 is slow and memory hungry, even for mediumsized documents, and is not very robust in face of changes in versions of the C compiler used (gcc).
5.12 German Named Entity Recognition Goal: German Named Entity Recognition [Mit97] was an MSc project to adapt part of the LaSIE system to deal with German, and to test whether the architecture was suitable for processing a language other than English. Participants: Dept. of Computer Science, Sheeld University Description: The system followed the same general architecture as LaSIE, but with modi cations to various modules such as the grammar and tokeniser. Bene ts: Using the GATE architecture meant that only fairly minor modi cations to individual modules were necessary, and rule adaptation was easy. The evaluation of LaSIE as a tool for processing other languages was very positive (as borne out by the later development of M-LaSIE, a multilingual IE system). Drawbacks: The ability to group modules into blocks for processing would be a useful addition, as would an easier method of inserting new modules in the correct place.
11
6 Usage of GATE In addition to the projects described above, GATE has been used as a research tool, for applications development and for teaching. Some examples of these uses are given below. Our colleagues in the Universities of Edinburgh, UMIST in Manchester, and Sussex (amongst others) have reported using the system for teaching, and the University of Stuttgart produced a tutorial in German for the same purposes (see http://www.dcs.shef.ac.uk/gate/contrib/michael.dorna). Numerous postgraduates in locations as diverse as Israel, Copenhagen and Surrey are using the system in order to avoid having to write simple things like sentence splitters from scratch, and to enable visualisation and management of data. Turning to applications development, ESTEAM Inc., of Gothenburg and Athens are using the system for adding name recognition to their MT systems (for 26 language pairs) to improve performance on unknown words. Syntalex Ltd., of London, are developing a product that automatically applies amendments to legal documents within the GATE framework. Both British Gas Research and Master Foods NV (owner of the Mars confectionery brand) used the LaSIE system for competitor intelligence systems. LaSIE was con gured as an embedded library using GATE's deployment facilities. [GRG+ 98] report experiments using Named Entity tagged language models for large vocabulary connected speech recognition. The modelling data was created by running LaSIE as a batch process using GATE's command line interface. Competitor intelligence researchers in Finland are using GATE in the BRIEFS project [Kei99] (Brief driven Information retrieval and Extraction for Strategy), which concerns business intelligence for companies. Although they had some initial diculties getting the system to run on their platform (Linux Red Hat 6.0), they were able to get good results in a short period of time, which was one of their main reasons for choosing GATE in the rst place1 . 1
(Matti Keijola, personal communication, October 1999)
12
The Polytechnic of Catalunya in Barcelona have used GATE as part of the ITEM project for multilingual IE and retrieval for Spanish, Catalan and Basque. They used GATE as a framework for integrating their tools for dierent languages, and as a visualisation tool for the results. They found it particularly useful for providing a user-friendly environment for non-experts, although its slow speed in processing large corpora was a problem.2 GATE has also been used for Information Extraction in English, German, French, Spanish, Greek and Swedish. In these and other projects, researchers have contributed a diverse and growing set of components to CREOLE, the collection of language resources. CREOLE now includes:
the VIE3 English IE components (tokeniser and text structure analysers; sentence splitter; several POS taggers; morphological analyser; chart parser; name matcher; discourse interpreter); the Alvey Natural Language Tools morphological analyser and parser; the Plink parser; the Parseval tree comparison software; the MUC scoring tools; French parsing and morphological analysis tools from Fribourg and INRIA; Italian corpus analysis tools from Rome Tor Vergata; a wide range of Swedish language processing tools.
The main de ciency in this set is a bias towards language analysis, and towards Processing Resources above Language Resources. A partial list of GATE licensees (of which there were over 300 at the end of 1999) is available at http://www.dcs.shef.ac.uk/nlp/gate/users.html. Along with the experiences cited above, this list indicates that take-up of the system is healthy and that LE R&D workers have found the system useful in many contexts. 2 3
(Jordi Batalla, personal communication, March 1999.) A `Vanilla IE' system related to LaSIE
13
7 Strengths and Weaknesses The strengths of the nal version 1 release of GATE are that it:
facilitates reuse of NLP components by reducing the overheads of integration, documentation and data visualisation; facilitates multi-site collaboration on IE research by providing a modular base-line system (VIE) with which others can experiment; facilitates comparative evaluation of dierent methods by making it easy to interchange components; facilitates task-based evaluation, both of `internal' components such as taggers and parsers, and of whole systems, e.g. by using materials from the ARPA MUC programme [GS96] (whose scoring software is available in GATE, as is the Parseval tree scoring tool [Har91], and a generic annotation scoring tool [RGHC97]); provides a reasonably easy to use, rich graphical interface; contributes to increased software-level robustness, quality and eciency in NLP systems, and to their cross-platform portability (all UNIX systems, Windows NT and Windows 95; native support for Java, C++ and Tcl); contributes to the portability of NLP systems across problem domains by providing a markup tool for generating training data for examplebased learning (it can also take input from the Alembic tool [DAH+ 97] for this purpose); uni es the two best approaches to managing information about text by combining a TIPSTER-style database with SGML input/output lters (developed using tools from Edinburgh's Language Technology Group [MBT97]).
The principal problems with version 1 are that:
It is biased towards algorithmic components for language processing, and neglects data resource components (PRs vs. LRs). It is biased towards text analysis components, and neglects text generation components. 14
The database implementation is space and time inecient. The visual interface is complex and somewhat non-standard. The task graph generation and management process does not scale beyond small component sets: \GGI suers from the scaling problem [BBB+ 87], as the size of the custom graph quickly becomes unmanageable" [RGHC97]. Only the annotator component model is extensible; adding new viewers or tools is not possible. Installing and supporting the system is a skilled job, and it runs better on some platforms than on others (UNIX vs. Windows). Sharing of components depends on sharing of annotation de nitions (but isomorphic transformations are relatively easy to implement). It only caters for textual documents, not for multi-media documents. It only supports 8-bit character sets. Module reset cascades through all previously run PRs that made nonmonotonic database updates.
Work is currently in progress on Version 2 of GATE, which aims to combat some of these problems and extend the range of the system [Cun00]. Further details of requirements for this type of system, and how to evaluate them, can be found in [CBTW00].
8 Conclusion Based on the collective experiences of a sizeable user base across the EU and elsewhere, the system can claim to be a viable infrastructure for certain sections of the eld. Given further development, we hope that it can take on this role for a wider variety of tasks.
9 Acknowledgements This work was supported by EPSRC grants GR/K25267 and GR/M31699. 15
References [BBB+ 87] M. Burnett, M.J. Baker, C. Bohus, P. Carlson, S. Yang, and van Zee P. Scaling Up Visual Languages. IEEE Computer, 28(3):45{54, 1987. [BPV+ 97] R. Basili, M. Pazienza, P. Velardi, R. Xatizone, R. COllier, M. Stevenson, Y. Wilks, O. Amsaldi, A. Luk, B. Vauthey, and J. Grandchamp. Extracting case relations from corpora. ECRAN Deliverable 2.4 version 1, 1997. [BRWP98] H. Brughman, A. Russel, P. Wittenburg, and R. Piepenbrock. Corpus-based research using the Internet. In First International Conference on Language Resources and Evaluation (LREC) Workshop on Distributing and Accessing Linguistic Reseources, Granada, Spain, 1998. [CBTW00] H. Cunningham, K. Bontcheva, V. Tablan, and Y. Wilks. Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2), Athens, 2000. http://gate.ac.uk/. [CFB94] H. Cunningham, M. Freeman, and W.J. Black. Software Reuse, Object-Oriented Frameworks and Natural Language Processing. In New Methods in Language Processing (NeMLaP-1), September 1994, Manchester, 1994. (Re-published in book form 1997 by UCL Press). [CHGW97] H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. Software Infrastructure for Natural Language Processing. In Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97), March 1997. http://xxx.lanl.gov/abs/cs.CL/9702005. [Cun00] Hamish Cunningham. Software Architecture for Language Engineering. PhD thesis, University of Sheeld, 2000. http://gate.ac.uk/sale/thesis/. [DAH+ 97] D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain. Mixed-Initiative Development of Language Processing Systems. In Proceedings of the 5th Conference on Applied NLP Systems (ANLP-97), 1997. 16
[GO00]
B. Gamback and F. Olsson. Experiences of Language Engineering Algorithm Reuse. In Second International Conference on Language Resources and Evaluation (LREC), pages 155{160, Athens, Greece, 2000. [GRG+ 98] Y. Gotoh, S. Renals, R. Gaizauskas, G. Williams, and H. Cunningham. Named Entity Tagged Language Models for LVCSR. Technical Report CS-98-05, Department of Computer Science, University of Sheeld, 1998. [GS96] R. Grishman and B. Sundheim. Message understanding conference - 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, June 1996. [Har91] P. Harrison. Evaluating Syntax Performance of Parsers/Grammars of English. In Proceedings of the Workshop on Evaluating Natural Language Processing Systems, ACL, 1991. [Kei99] M. Keijola. BRIEFS { Gaining Information of Value in Dynamical Business Environments. http://www.tuta.hut.fi/briefs, 1999. [KHG00] G. Demetriou K. Humphreys and R. Gaizauskas. Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. In Proc. of Paci c Symposium on Biocomputing (PSB-2000), Honolulu, Hawaii, 2000. [KJK99] D. Kokkinakis and S. Johansson-Kokkinakis. Cascaded nitestate parser for syntactic analysis of swedish. Technical Report GU-ISS-99-2, Dept. of Swedish, Goteborg University, 1999. http://svenska.gu.se/ svedk/publications.html. [MBT97] D. McKelvie, C. Brew, and H. Thompson. Using SGML as a Basis for Data-Intensive NLP. In Proceedings of the fth Conference on Applied Natural Language Processing (ANLP-97), Washington, DC, 1997. [Mit97] B. Mitchell. Named Entity Recognition in German: the identi cation and classi cation of certain proper names. Master's thesis, Dept. of Computer Science, University of Sheeld, 1997. http://www.dcs.shef.ac.uk/campus/dcscd/projects/bm.pdf. 17
[OGE98]
F. Olsson, B. Gamback, and M. Eriksson. Reusing Swedish Language Processing Resources in SVENSK. In Workshop on Minimising the Eorts for LR Acquisition, Granada, Spain, 1998. [Ols97] F. Olsson. Tagging and morphological processing in the svensk system. Master's thesis, University of Uppsala, 1997. http://http://stp.ling.uu.se/ fredriko/exjobb.ps. [PPK+ 99] G. Petasis, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and I. Androutsopoulos. Resolving part-of-speech ambiguity in the greek language using learning techniques. In Proc. of the ECCAI Advanced Course on Arti cial Intelligence (ACAI), Chania, Greece, 1999. [RGHC97] P.J. Rodgers, R.J. Gaizauskas, K.W. Humphreys, and H. Cunningham. Visual Execution and Data Visualisation in Natural Language Processing. In IEEE Visual Language, Capri, Italy, 1997. [TRE00] TRESTLE. The TRESTLE project. http:/www.dcs.shef.ac.uk/research/groups/nlp/trestle, 2000. [WG99] Y. Wilks and R. Gaizauskas. Report on epsrc research grant on the large scale information extraction research project. Technical Report GR/K25267, University of Sheeld, 1999.
18