ISA & ICA - Two Web Interfaces for Interactive Alignment of Bitexts J¨org Tiedemann Alfa Informatica, University of Groningen Oude Kijk in ’t Jatstraat 26 9712 EK Groningen, The Netherlands
[email protected] Abstract ISA and ICA are two web interfaces for interactive alignment of parallel texts. ISA provides an interface for automatic and manual sentence alignment. It includes cognate filters and uses structural markup to improve automatic alignment and provides intuitive tools for editing them. Alignment results can be saved to disk or sent via e-mail. ICA provides an interface to the clue aligner from the Uplug toolbox. It allows one to set various parameters and visualizes alignment results in a two-dimensional matrix. Word alignments can be edited and saved to disk.
1. Introduction In this paper we describe two web-based interfaces for interactive alignment of parallel texts, one for sentence alignment (ISA) and one for word alignment (ICA). Both interfaces use automatic alignment as the backbone but allow users to revise the proposed alignments. The sentence aligner can also be used to create manual alignments from scratch. Manual alignments have previously been used for evaluation purposes, i.e. to produce gold standards for automatic alignment systems, see e.g. (V´eronis and Langlais, 2000; Melamed, 2001). Alignment interfaces have also been shown to be effective for visualization (Smith and Jahr, 2000) and alignment improvement (Isahara and Haruno, 2000; Ahrenberg et al., 2003; CallisonBurch et al., 2004). Our interfaces include both, alignment visualization and post-editing tools for sentence and word alignment.
2.
... It will be pursued with firmness and consistency . ...
Figure 1: A simple example of tokenized input text.
General Architecture
ISA and ICA are part of the Uplug toolbox for parallel text processing (Upl, 2005). They are developed to interact with other parts of the toolbox. They are implemented in PHP1 and, hence, they are server-side web applications providing interfaces for alignment of parallel texts stored on the server’s file system. They do not provide functions for file management (e.g. for uploading files to the server). Corpus files have to be installed and prepared off-line on the server to enable alignments via the interfaces. This can easily be done using the provided configuration routines and the Uplug tools. Both, ISA and ICA use Uplug alignment tools for automatic alignment. Essentially they call external Uplug scripts and transform their output to information displayed in the web interfaces. ISA and ICA allow to modify alignment parameters and to edit alignment results. In this way, they can be used to visualize alignments with different settings and to post-edit alignment results. ISA and ICA require different kinds of pre-processing. Both presume tokenized corpus data in XML format. Figure 1 shows a simple example of one tokenized sentence in 1
an English sample corpus.
Javascript is also used for highlighting objects and tooltip windows.
Corpus files may include further markup that might be used in automatic alignment. For instance, structural markup such as paragraph boundaries can be used in the sentence aligner interface. Another example are part-of-speech tags that can be used as a feature in word alignment clues. However, once tokenized and annotated, corpus files should be static, i.e. they should not be modified anymore because ISA and ICA will produce indeces with file positions to jump to arbitrary sentences in the corpus. The indexing is done automatically when opening the web interfaces for the first time with a new corpus. ICA (the word aligner) presumes sentence alignment and a collection of alignment clues. Sentence alignment may be done by ISA or off-line using the automatic sentence aligner integrated in Uplug. Alignment clues, however, have always to be produced off-line before starting ICA. They can also be re-used from other alignment runs. Please refer to the section on ICA for further information. Both alignment interfaces can be used for several corpora. You may prepare any number of alignment projects using the configuration scripts. ICA and ISA will scan through the corpus directory to look for available data. Initially, the user has to select a corpus and may later always switch to another one when working with the interfaces.
ISA and ICA need to store some intermediate data to preserve settings and parameters in your session. This is done using the PHP session support that usually stores data in client-side cookies. Hence, cookies should be enabled in your browser. Due to the server-side architecture, ISA and ICA require write permissions on the file system of the web server in the corpus directory (corpora/). User authentication and other routines for Internet security are not included. Access permissions and the like should be handled using other standard techniques depending on your web server software. Furthermore, there is no multi-user mode implemented in the system. Running and editing corpus files with multiple users is no problem because intermediate data is stored in client-side cookies. However, saving data is done on the file system of the server and will cause the system to overwrite data previously saved. Storing data on disk can be disabled in the configuration scripts for each corpus to avoid this problem. In general, ICA and ISA are not implemented for processing huge amounts of data. However, this is not the intended purpose of an interactive alignment tool.
... It will be pursued with firmness