NACSIS Corpus Project for IR and Terminological Research - CiteSeerX

13 downloads 0 Views 152KB Size Report
Kyo Kageura, Teruo Koyama, Masaharu Yoshioka, Atsuhiro Takasu, .... (1) Segmentation of Japanese terms into M0 units de ned in Nomura & Ishii (1987).
NACSIS Corpus Project for IR and Terminological Research Kyo Kageura, Teruo Koyama, Masaharu Yoshioka, Atsuhiro Takasu, Toshihiko Nozue NACSIS, 3{29{1 Otsuka, Bunkyo-ku, Tokyo 112 Japan E-Mail:

fkyo,koyama,yoshioka,takasu,[email protected]

Keita Tsuji Library and Information Science Course, Graduate School of Eduation The University of Tokyo, 7{3{1 Hongo, Bunkyo-ku, Tokyo, 113 Japan E-Mail: [email protected]

1 Introduction In this paper we introduce the corpus construction project currently carried out at the National Center for Science Information Systems (NACSIS), Japan 1 . In the following, the motivations and the aims of the project are rst introduced, followed by the linguistic speci cations of the corpora. We then discuss some technical problems concerning the speci cation and the actual construction of the corpora.

2 Motivations and Aims NACSIS is running an retrieval service of academic documents, and has recently launched the electronic library service, to the Japanese academic community. To enhance these services, various researches are being carried out, such as IR user-interface, indexing, application of NLP to IR, etc. The project is situated against this background, and the corpus is mainly intended to be used for the research in the intersection of IR and NLP, especially terminology processing. The project is also motivated by the recognition of the following situations:  Recently the research in automatic term

recognition is developing, but most work is not sensitive to such distinctions as domainoriented terminology and document-oriented keywords.

1 This is a part of the bigger research project \A Study on Ubiquitous Information Systems for Utilization of Highly Distributed Information Resources" supported by the Japan Society for the Promotion of Science.

 Many methods are proposed for various tasks

of IR, but there are few comparative examinations from the consistent points of view. TREC o ers one such opportunity (Harman 1993{7), but does not cover Japanese.

As the project is situated against these backgrounds, the corpus is rst and formost intended to provide the theoretically sound data for carrying out various descriptive studies of terminology as well as developing and evaluating various lexically-oriented methods for IR. Particular attention is paid, therefore, to (a) the consistent treatment of morphological units and word units, and (b) the relation between terminology and their occurrences in texts, although we also pay attention to other aspects which make it usable for other purposes such as developing various NLP tools.

3 Speci cations of the Corpora 3.1

Population and Types of Corpora

The target population is very important in the construction of the corpus (Biber 1994). The basic boundaries of the population is limited by the fact that the services of NACSIS are concerned with academic documents; we are concerned with the present-day written stable linguistic phenomena of the academic elds. Within the boundaries, two di erent spheres of the linguistic phenomena are de ned for our purpose, i.e. terminological and textual. These two spheres are essential with respect to the basic aims of our corpus project. Thus two types of corpus

are constructed, i.e. terminological and textual, which correspond to each other. The domain and the text types are the next factors to be considered. We have made the decisions: (i) the two domains of arti cial intelligence and information processing are selected, (ii) abstracts of the conferences and full journal papers are selected as the textual types, (iii) the current data (from late 80's to the present) are chosen. They are in uenced by the external factors such as the availability of resources, needs at NACSIS, etc. (Linguistic Sphere) TermiTextual nological (Text Types) Book Abst. Paper : : : :

: (DoAI main) IP : : : : : Table 1. Types of Data Selected for the Corpus

5 5

4 4

5 5

5 5

222 222 222 222 222

Table 1 summarises the types of data selected for the corpus. Although the ranges of target population covered by the current corpus is limited, we allow for the consistent extensibility of the corpus in the future, with respect to the linguistic variations within the target population. Note that our primary target is Japanese, although English data are kept when available. 3.2

Terminological Corpus

Table 2 lists the basic quantities, the reference sources, and the types of information added to the terminological corpus. In constructing the corpus, terms were rst selected from the sources, inputed to the computer, and then checked manually. A term decomposition program (Tsuji & Kageura 1997) was then applied, whose results are manually corrected. At the time of writing this paper, we are in the process of de ning the nal POS set and of examining the methods of de ning lexicological information, i.e. (4) and (5) in Table 2.