Generator reside on network servers, and the Controlled Language Checker, Batch Translator, and Knowl- ... Maintenance Tool also interacts with the Knowledge Server, which provides ... Several domains, languages, and versions of.
The KANTOO Machine Translation Environment Teruko Mitamura & Eric Nyberg Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 Telephone: (412) 268-6596 Fax: (412) 268-6298 teruko,ehn @cs.cmu.edu
System Description / Demo
Abstract In this paper we describe the KANTOO machine translation environment, a set of software services and tools for multilingual document production. KANTOO includes modules for source language analysis, target language generation, source terminology management, target terminology management, and knowledge source development. The KANTOO system represents a complete re-design and re-implementation of the KANT machine translation system. We discuss the motivation behind the development of KANTOO, and discuss the requirements and functionalities of each KANTOO module.
1
1. Introduction KANTOO is a knowledge-based, interlingual machine translation system for multilingual document production. KANTOO is a redesign and reimplementation of the KANT system [2, 4]. The KANTOO system consists of several modules, illustrated in Figure 1.
KANTOO Clients
KANTOO Servers
Lexical Maintenance Tool
Controlled Language Checker
Analyzer
Batch Translator
Generator
Knowledge Maintenance Tool
KANTOO Knowledge
Oracle DB
IMPORT/EXPORT Knowledge Bases IMPORT/EXPORT
Knowledge Server
Language Translation Database
Oracle DB
Figure 1: KANTOO Architecture.
KANTOO runs in a client-server environment. The source language Analyzer and target language Generator reside on network servers, and the Controlled Language Checker, Batch Translator, and Knowledge Maintenance Tool (KMT) are client programs residing on individual workstations. The Knowledge Maintenance Tool also interacts with the Knowledge Server, which provides version-controlled access to the KANTOO knowledge repository. Terminological resources are created and maintained using the Lexical Maintenance Tool (LMT) and Language Translation Database (LTD). The LMT and LTD are Oracle Forms applications, which connect to Oracle database servers, and which export their terminology in KANTOO dictionary format. The KANTOO server architecture is robust and scalable. Several domains, languages, and versions of their knowledge sources can be maintained and executed in parallel. The PC delivery format of the LTD and LMT allow those tools to be used by third-party translations vendors to develop terminology resources. These tools are in daily use at an industrial document production facility [1] for Spanish, French, and German. In Sections 2 through 6, we describe the core servers and tools in the KANTOO architecture1 . In Section 7 we conclude with some remarks about ongoing and future KANTOO research topics. 1
Space limitations preclude a discussion of a) the Controlled Language Checker, which has been discussed at length in [3], and b) the Batch Translator, which is a simple piece of driver code that uses the KANTOO servers to translate entire documents.
2
2. Analyzer Module The Analyzer module performs tokenization, morphological processing, lexical lookup, syntactic parsing with a unification grammar, and semantic interpretation, yielding one or more interlingua expressions for each valid input sentence. For sentences which are not valid, the Analyzer produces an appropriate diagnostic message for the user. The Analyzer is implemented as a standalone server demon, which spawns a new Analyzer process for each new client connection. The same Analyzer server can be used by the Controlled Language Checker, Batch Translator and KMT in parallel.
3. Generator Module The Generator module performs lexical selection, structural mapping, syntactic generation, and morphological realization for a particular target language. The same Generator executable can be loaded with different knowledge bases for different languages. The Generator is also implemented as a standalone server demon, which spawns a new Generator process for each new client connection. The same Generator server can be used by the Batch Translator an KMT in parallel.
4. Lexical Maintenance Tool (LMT) The Lexical Maintenance Tool (LMT) is implemented as an Oracle database and Forms application which helps users to create, modify, and navigate through large numbers of lexical entries. The LMT brings together into one database the various kinds of lexical entries used in NLP development, including words, phrases, and specialized entries such acronyms, abbreviations, and units of measure. The LMT can be used to scan the entire lexicon for entries with specific features, by using a powerful and easy-to-use search function. The main LMT window provides direct access to the main features of each entry; detailed supporting information about each entry can be accessed through a variety of pop-up windows.
5. Language Translation Database (LTD) The Language Translation Database (LTD) tool is the target language counterpart to the LMT, and supports the following functions for translators developing target language terminology: The tool provides support for both sequential navigation and partial (wildcard) keyword search;
The main window provides information on the source term: its part of speech, concept symbol, definition, and usage examples;
Translators can store comments about a source term and remarks about a translation (especially useful if a team of translators are working together to refine a domain vocabulary);
3
A status box clearly displays whether or not the current translator has made any changes to the translations of the current source term;
Each update is marked with a time stamp which indicates the time and origin of the change;
A special draft mode, a productivity enhancement which provides the translator with a partial translation taken from the longest translated term which is most similar to the current (untranslated) term.
The LMT and LTD tools share the same virtual database schema, so there are formally-specified procedures for synchronizing them following a period of distributed parallel development. In this way, KANTOO takes advantage of the fact that source and target terms are usually developed by different individuals, while making it possible to merge all the terms together into a multilingual term bank.
6. Knowledge Maintenance Tool (KMT) and Knowledge Server The Knowledge Maintenance Tool (KMT) is a graphical user interface which allows developers to test their knowledge changes immediately, in the context of a complete working system, without requiring a complete re-build of the system. The KMT operates in conjunction with the Knowledge Server, which provides distributed network access to a version-controlled repository of KANTOO knowledge sources. The KMT tool provides the following functionality for the developer/maintainer: Incremental test/trace/edit/reload cycle. The developer can incrementally test, trace, and edit various rules in different knowledge sources, and then reload the updated knowledge sources into a running system with minimal delay. It is possible to create modified versions of the working system to test experimental updates, without affecting other developers working in parallel on other parts of the system.
Extensive tracing of different system modules. It is possible to trace any combination of modules in the system interactively.
Built-in syntax checking for rules. In order to isolate structural faults in rules before they are loaded into the system, all editing operations are accompanied by a syntax verification step.
Built-in support for interactive and batch testing. The developer can run a set of interactive and/or batch tests immediately after a knowledge source is updated, to determine whether the change is successful and has not caused other faults in the system.
7. Current Status and Future Work KANTOO is implemented in C++ (Analyzer, Generator, Knowledge Server), Java (KMT) and Oracle/Forms (LMT, LTD). KANTOO has been deployed under AIX and Linux, and will be ported to Windows NT in 2000. The KANTOO modules can be licensed from Carnegie Mellon University (please contact the authors if you are interested in a KANTOO software license). The flexibility of the KANTOO client-server architecture supports distributed, parallel development of new applications and robust, scalable deployments. Our current research focuses on the issues related to 4
deploying the KANTOO architecture in an environment where document authoring and document translation are performed by third-party vendors external to the customer site. This architecture is particularly wellsuited for the deployment of authoring and translation as distributed internet services, available over the network 24 hours a day.
References [1] Kamprath, Christine, Eric Adolphson, Teruko Mitamura & Eric Nyberg: 1998, ‘Controlled Language for Multilingual Document Production: Experience with Caterpillar Technical English’, in Proceedings of the Second International Workshop on Controlled Language Applications: CLAW-98, Pittsburgh. [2] Mitamura, Teruko, Eric Nyberg & Jaime Carbonell: 1991, ‘An Efficient Interlingua Translation System for Multi-lingual Document Production’, in Proceedings of the Third Machine Translation Summit, Washington, D.C. [3] Mitamura, Teruko & Eric Nyberg: 1995, ‘Controlled English for Knowledge-Based MT: Experience with the KANT System’, in Proceedings of TMI-95, Leuven. [4] Nyberg, Eric and Teruko Mitamura: 1992, ‘The KANT System: Fast, Accurate, High-Quality Translation in Practical Domains’, in Proceedings of COLING-92, Nantes.
5