TBMS: Domain Specific Text Management and Lexicon Development

9 downloads 0 Views 377KB Size Report
Input and editing of texts according to numerous points of view. 2. Management of an unlimited number of text units on a suitably sized auxiliary storage device.
TBMS: Domain S p e c i f i c Text Management and Lexicon Deve]opment* S. Goeser and E. Mergenthaler University of Ulm FRG

Abstract The d e f i n i t i o n of a Text Base Management System is

component and a l l o w extensive processing of t e x t ,

introduced in terms of software engineering. That

but no kind of t e x t analysis a t a l l .

gives a basis for discussing p r a c t i c a l t e x t adminis t r a t i o n , including questions on corpus properties and appropriate r e t r i e v a l c r i t e r i a .

Finally, strate-

2. D e f i n i t i o n

o f a TBMS

gies for the d e r i v a t i o n of a word data base from an actual TBMS w i l l be discussed.

From the p o i n t of view of a TBMS-user, who is supposed to be a non-programming a p p l i c a t i o n - f i e l d worker, the system i s an instrument to take up, to

l. Introduction

control and/or to analyze his or her i n d i v i d u a l t e x t s for domain-dependent purposes. Consequently, a

Textual data are a sort of complex data o b j e c t that is of growing importance in many a p p l i c a t i o n s .

system intended to manage a t e x t bank has the f o l lowing tasks:

Research p r o j e c t s from such d i f f e r e n t f i e l d s as h i s t o r y , law, social sciences, humanities and l i n g u i s t i c s but also commercial i n s t i t u t i o n s are dea-

1. Input and e d i t i n g of texts according to numerous points of view.

l i n g with vast q u a n i t i t i e s of t e x t . At Ulm University f o r instance, a machine-readable corpus of spoken language t e x t s has been b u i l t up, with the purpose

2. Management of an unlimited number of t e x t units on a s u i t a b l y sized a u x i l i a r y storage device.

of support For psychotherapeutic process research. The corpus is administered by a T e x t Base Manage-

3. Management of an u n l i m i t e d amount of information

ment System (TBMS), that i n t e g r a t e s the functions

concerning the t e x t u n i t s , t h e i r authors, and the

of archiving, processing and analyzing an a r b i t r a r y

r e l a t e d t e x t analyses.

amount of t e x t (MERGENTHALER 1985). 4. Management of an open q u a n t i t y of methods for Several sysLmns s a t i s f y i n g the TBMS d e f i n i t i o n were

e d i t i n g and analyzing stored t e x t u n i t s .

conceived independently in the l a t e seventies. THALLER (1983) reports a system CLIO, a TBMS with a highly d i f f e r e n t i a t e d data base component and a

5. Assistance f o r i n t e r f a c e s to s t a t i s t i c a l other user packages.

and

method base providing c~nputerized content analysis and comfortable e d i t i n g . LDVLIB (DREWEK and ERNI

6. Assistance f o r a simple, d i a l o g u e - o r i e n t e d user

1982) is mainly a t e x t analysis package, where data

i n t e r f a c e when the tasks mentioned in points I to

base management and t e x t processing play a subordi-

5 are supplied or performed.

nate r o l e . A PC-suited TBM-system, ARRAS (SMITH 1984), supports comfortable t e x t i n q u i r y by concor-

The tasks mentioned in p o i n t 1 belong to the domain

dance and index functions, but has no textbase

of t e x t processing. The term t e x t p r o c e s s i n g sys-

component. F i n a l l y there are two TBM-systems for

tem can be used i f s u f f i c i e n t user assistance is

commercial use, MIDOC (KOWARSKI and MICHAUX 1983)

provided.

and MINDOK [INFODAS 1983) which have a database

235

sets are grouped in f i l e s and put into magnetic

Point 4 refers to a collection of methods that, givencomputer assistance in the user's selection of methods, can be termed a method base. Further assistance, such as method documentation and parameter

storage. I f , in addition, i t is possible to adminis-

input, is provided by the method base management

ter the data sets with the access methods provided

system. Point 5, on interfaces, is a subset of the

by the operating system, then we w i l l speak of a

tasks described by point 4.

The task of managing an unlimited number of text units on an auxiliary storage (point 2) is the object of long-term data maintenance. The stored

data maintenance system or a f | ] e management system (e.g., Archive with BS2000 from Siemens and Datatrieve with VMS from Digital Equipement)o

The tasks mentioned in point 3 concern the management of a data base including all of its services. These tasks are f u l f i l l e d by general data base management systems; the classic functions of such systems are the description, handling, take up,

All of these tasks are collected in point 6 with regard to the user interface. Thus the TBMS is an integrated overall system consisting of f i l e management system data base management system text management system method base management system

FMS DBMS TMS MBMS

manipulation, and retrieval of data. Data structures can be classified as hierarchic, network oriented, or r e l a t i o n a l .

user

user

rA

. . . . . . . . . . .

user

text

method base

data acquisitionr~5

text acquisition

text ~'""" analysis package characte-

data base user interface

.

request outp~ut~ of text

analysis

data base

ex-~

~ 3 y s man~egement

4

TBS text

Figure 1

236

J ~

base

} ~ output

base

Since the s e l e c t i o n of the s p e c i f i c data to be

3. Text Base Management

managed by the DBMS and of the methods provided by the MBMS is made in accordance with the kind of

The p r a c t i c a l management of texts can roughly be

t e x t s managed by the FMS, i t is l e g i t i m a t e to des-

separated into management of texts or t e x t sets (as

cribe the o v e r a l l information system as tailo~nade.

complex s t r i n g objects) and of t e x t - r e l a t e d information usually not being part of the t e x t . This i n f o r -

The following definition of the entire system is

mation may be pragmatical such as i d e n t i f i c a t i o n and

made, in analogy to the definitions of the i n d i v i -

a t t r i b u t e s of the t e x t producer(s) or a d e s c r i p t i o n

dual components, in order to ensure that our termi-

of the communicative s i t u a t i o n the t e x t o r i g i n a t e d

nology adequately reflects this state of affairs:

from, or i t may be semantical, specifying some mapping of the t e x t on a formal representation. For both kinds the method base w i l l provide adequate

¥extbase management system = TBMS

analysis procedures.

Branch: L i n g u i s t i c data processing A TBMS is i n d i f f e r e n t with regard to the completeThe TBMS is an information system that can

ness of a t e x t corpus, but i t supports the more

administer t e x t s and information on t e x t s , and

ambitious handling of an open corpus in a special

t h a t makes t e x t s accessible by i n t e g r a t i n g

way.

techniques from l i n g u i s t i c

data processing and

t e x t processing. I t features a homogeneous user

The c h a r a c t e r i s t i c of an open corpus is that i t is

i n t e r f a c e that assists in the take up, pro-.

an excerpt from a parent population which is contin-

cessing, output, and analysis of t e x t units.

uously being expanded without being t i e d to the goal of completeness. An example is the c o l l e c t i o n of f u l l y transcribed short psychotherapies. I t is

A System Architecture I (top l e v e l ) , which represents

possible to expand t h i s corpus to include every

one way of r e a l i z i n g a TBMS is shown in Fig. I .

newly undertaken treatment without ever reaching a state of completeness or representativeness with

The TBMS d i f f e r s from document r e t r i e v a l systons by

respect to the t e x t type "short psychotherapy". The

containing i~o a d d i t i o n a l components: t e x t process-

q u a l i t y of completeness can only be approximated i f

ing and method base. I t is true that the emphasis,

there are features l i m i t i n g the composition of the

as f a r as r e t r i e v a l in the I BMS is concerned, is

corpus. Thus a c o l l e c t i o n oF diagnostic f i r s t

still

views has a higher degree of completeness i f i t

on data r e t r i e v a l ( f o r requests made according

to the author of a t e x t ) and on document r e t r i e v a l

inter-

consists of equal parts on the features "sex",

( f o r t e x t references to be determined according to

"age", "diagnosis", and "social s t r a t a " . An example

i n t e r n a l textual features). Fact r e t r i e v a l is none-

of a completely c l o s e d corpus is the Bible (see

theless an i n t e g r a l component of a l l the plans for a

PARUNAK 1982).

TBMS, even i f systems able to cope with large quannot ready for production. However, t h i s should not

The degree of completeness of a corpus also influen~ ces the s t r a t e g i e s for handling the results of t e x t

be a basis for d i f f e r e n t i a t i n g w i t h i n a TBMS. i d e a l -

analyses, such as a semantic mapping. There are two

l y f a c t and document r e t r i e v a l are to be integrated

p r i n c i p a l approaches. In the one, a l l the a v a i l a b l e

in one system in order to provide s a t i s f a c t o r y user

results of analyses are stored completely with the

tities

of material from c o l l o q u i a ] speech are s t i l l

assistance.

t e x t or in d i r e c t r e l a t i o n to the t e x t . In the other, parts of the corpus are processed a l g o r i t h m i c a l l y as needed. As d e t a i l e d by PARUNAK (1982, p.150), users with open corpora tend to use the a l g o r i t h m i c version, while the storage of e x i s t i n g results from previous studies is often preferred for 237

closed corpora.

attributes

For each p a r t i c i p a n t in a communicat i o n , a set of feature-value pairs

Most of the t e x t - r e l a t e d information, be i t semanti-

specifying r e l e v a n t p r o p e r t i e s is

cal or pragmatical in nature, can be applied in

reserved.

d i f f e r e n t ways. F i r s t , a l l of them is c l e a r l y of an immediate i n t e r e s t to the system user, who is concerned with a given t e x t . For example, a psychiatrist

4. D e r i v i n g a Word Data Base

examining the verbatim protocol of a treatment

hour w i l l be i n t e r e s t e d in i t s content and in the

A word data base (WDB) i s , f o r m a l l y spoken, a r e l a -

p a t i e n t ' s medical h i s t o r y . He or she might be pro-

tion ranging over l e x i c o g r a p h i c a l features. Every

vided with a weighted semantic category l i s t , frequency word l i s t tient's

a high

and, as an a t t r i b u t e , the pa-

ICD-diagnesis.

tupel in a WDB s p e c i f i e s a class of feature values being c h a r a c t e r i s t i c f o r a single word form; one or several tupels may be organized in a subset of what has been c a l l e d a l e x i c a l entry (see HESS, BRUSTKERN

Secondly, t e x t - r e l a t e d information can serve w i t h i n

and LENDERS 1983)o Lexical features to be considered

TBMS as an a d d i t i o n a l c r i t e r i o n (beside t e x t i d e n t i -

here are word form, lemma name, word class, frequen-

f i c a t i o n and a s u i t a b l e pre-segmentation) f o r re-

cy of the word form and some grammatical features,

t r i e v i n g t e x t s from FMS. In our TBMS-implementation,

p a r t i a l l y depending on the p a r t i c u l a r word class.

the f o l l o w i n g features do have that r e t r i e v a l funcE m p i r i c a l l y , a WDB is "based" on one or more commu-

tion:

n i c a t i v e s i t u a t i o n s , t h a t i s , i t is accumulated with

text type

text unit

Given a c e r t a i n class of d e s t i n a t i o n

respect to an a p p r o p r f a t e l y sized (see MERGENTHALER

features a t e x t type can be defined by

1985) t e x t corpus. Note t h a t s i t u a t i o n s l i k e ,

for

a set of a t t r i b u t e - v a l u e p a i r s . I n t u i -

instance, a psychoanalytic treatment do in f a c t

tivly

l i m i t c o l l o q u i a l l e x i c a l domains to some degree,

a t e x t type d i f f e r e n t i a t e d t h a t

way corresponds to a communicative

simply by imposing thematical r e s t r i c t i o n s on the

situation.

t e x t drawn from i t .

A t e x t seen as s t r i n g o b j e c t can be

The r e l a t i o n s h i p between such situation-based do-

organized h i e r a r c h i c a l l y , containing

mains and the WDB derived from i t w i l l be worked out

(not necessarily continuous) substrings

in more d e t a i l as f o l l o w s .

such as utterances or segments, which we c a l l t e x t u n i t s .

As l e x i c a l TBMS-component, a WDB w i l l usually support the method base in analyzing textual proper-

speaker

selects among speakers of a given

t i e s . We only mention, t h a t most computerized con-

(dialogic) text

tent analysis procedures w i l l operate on raw and lemmatized t e x t , y i e l d i n g d i f f e r e n t r e s u l t s . The WDB

text

size

to be s p e c i f i e d in current word forms

in our TBMS-implementation (see Table I) is due to i t s a p p l i c a t i o n in psychotherapy protocol analysis.

theme

on the basis of a semantic category list,

textual i d e n t i f i c a t i o n units can

be selected according to the weight of a category. 2

The most

important procedure in d e r i v i n g a WDB is

automatic lemmatization, defined as providing one pair and o p t i o n a l l y i n f l e c t i o n a l features f o r every w o r d - f o r m - i n - t e x t ,

238

using context f o r disambiguation. Since TBMS are

i f the lemma is element in the WDB. Otherwise, an

dealing with mass t e x t , the primary o b j e c t i v e s in

i n t e r a c t i v e procedure, the lemmatization d i a l o g ,

l emmatization w i l l be f i r s t ,

w i l l be i n i t i a t e d .

to have an e f f i c i e n t

procedure, second, to have l e a s t user assistance and t h i r d , to minimize the e r r o r rate with respect to the r e s u l t i n g WDB. These objections are t i g h t l y

procedure

l i n k e d to a r e s o l u t i o n of word class and lemma name ambiguities (homographies) by analyzing context. This seems to be obvious with regard to e f f i c i e n c y

determi-

nondetermi-

and user support. We also stress upon a s i g n i f i c a n t

nistic

nistic

decrease in e r r o r r a t e , in order to avoid i n t e r a c t i v e lemmatization that goes along with well-known consistency problems.

static appro-

WDB-entry:

WDB-entry:

word

pattern

morpholog,

syntact.

analysis

analysis

ach

dynamic

All current lemmatization procedures (see e.g. KRAUSE and WILLEE 1982) combine s t a t i c

cmnponents

designed f o r lexicon lookup and dynamic components t r y i n g to analyze e n t r i e s not y e t contained in the l e x i c o n . S t a t i c lemmatization surely w i l l work

Table 2

s u f f i c i e n t l y well in a l l cases where an appropriate lexicon can be provided and a constant domain asserts congruence between t h i s lexicon and the on-

5. F~nal remark

coming vocabulary. Dynamic lemmatization, however may become a c r u c i a l task in case of n o n - s p e c i f i c ,

The current implementation of the TBMS ULM TEXTBANK,

'fuzzy' domains "like psychoanalytic t a l k s . This is

being f i n i s h e d in 1986, is on a SIEMENS 7.550-D main

simply a question of the size of vocabulary, which

frame under BS2000 operating system. Further work on

is up to a ~ime not covered by the lexicon. For any

the ULM TEXTBANK w i l l

given l e x i c o n , t h i s uncovered vocabulary w i l l be

method base by robust parsing and rule based content

include extensions of the

more extensive for fuzzy domains than for s p e c i f i c

analysis methods.

ones. While unconstrained lexicon access and morphological word analysis are c l e a r l y d e t e r m i n i s t i c in nature,

I) The i l l u s t r a t i n g

context r e l a t e d lemnatization requires a sort of

as f o l l o w s : I n d i v i d u a l s are represented by t r i a n -

technique used here s h o r t l y is

i n d e t e m i n i s m on word l e v e l . That i s , some kinds of

gles, dynamic elements as rectangles and s t a t i c

l e x i c a l pattern 3 should be recognized by the lemma-

elements as e l l i p s e s . The system d e t a i l is enclosed

t i z a t i o n algorithm and become stored in the lexicon

in a frame. C o n t r o l l i n g operations are indicated by

f o r l a t e r access. This leads us to a m a t r i x descrip-

a broken l i n e , reading operations as uninterrupted

tion of contextual l e l m a t i z a t i o n (Table 2).

l i n e s coming from above and w r i t i n g operations as going to below the frame. Comunicating elements are l i n k e d with double l i n e s . Lines extending to the

The dynamic component is applied to those word fonns

frame i n d i c a t e , that a l l the dynamic elements con-

- plus a two sided context of several words - for

tained w i t h i n i t r e l a t e to the outside. See Mergen-

which no corresponding pattern is a v a i l a b l e in the

t h a l e r ( 1 9 8 5 ) f o r a more d e t a i l e d d e s c r i p t i o n .

WDB. In order to avoid mis-leading lemmatization a linkage between dynamic and s t a t i c component is b u i l t up. A p a i r , is gener~

2) The feature theme is a c t u a l l y not a part of the

ated dynamically and w i l l be accepted only

r e t r i e v a l system but of the method base, since i t s

239

LEMMA NAME I

lemma name

2 v a r i a n t meanings i d e n t i f i c a t i o n 3 reference to i n f l e c t e d forms 4 semantic d e s c r i p t i o n (planned) 5 frequency of occurrence 6 grammatical features {as f o l l o w i n g ) WORD KIND I 2 3 4 5 6

noun verb adjective adverb pronoun numeral

7 B 9 I0 11 12

negation article preposition conjunction interjection other

I simplex (s) 2 compound (c) 3 affixal derivation {a) 4 s adjective derivation 5 s verb derivation 6 s nominal number 7 s abbreviation B s proper name 9 c adjective derivation 10 c verb derivation 11 c nominal number 12 c, abbreviation 13 c, proper name 14 c, adjective derivation 1.G c, verb derivation 16 a, nominal number 17 a, abbreviation IR a, proper name

I 2

foreign dialect

Kowarski, I . and Michaux, C.: MIDOC: A Microcomputer System for the Management of Structured Documents. Information Processing ~3:567 - 572 (1983). Krause, W. and Willee, G.: Lemmatizing German Newspaper Texts with t h e A i d of an Algorithm. In: Computers and the Humanities 15:101-113 (1983). Mergenthaler, E.: Textbank Systems. Computer Science Applied in the F i e l d of Psychoanalysis. Springer: Heidelberg New York (1985). PARTICIPLE I have 2 he

INFLECTED FORMS

I 2 3 4 5 6

inflected form variant meanings i d e n t i f i c a t i o n reference to lemma names semantic description (planned) frequency of occurrence grammatical features (as following) RELATIVE FORMS

i 2 3

I 2 3 4 5

i singular 2 plural 3 plural only

MORPHOLOGY

CASE/MODE

Table 1

240

I 2 3 4

Parunak, Dyke van, M.: Data Base Design f o r B i b l i c a l Texts. In: Bailey, R.W. (Ed.): Computing in the Humanities: 149-161, North Holland, Amsterdam 1982. Smith, J . : ARRAS User Manual. State University of North Carolina, Chapel H i l l NC, 1984. T h a l l e r , M.: CLIO: EinfUhrung und SystemUberblick. Manual, G~ttingen 1983. *This work has been supported by the German Research Foundation w i t h i n the C o l l a b o r a t i v e Research Program 129, Projekt B2.

NUMBERS

diminutive comparative superlative

pres.particip. past p a r t i c i p . pres.tense past tense infinitive

Drewek, R. and Erni, M.: LDVLIB. A (new) Software Package for Text Research. Vortrag ALLC - Conference, Pisa 1982.

INFODAS: MINDOK. Ein Informationssystem auf Kleinrechnern zur Erfassung, Verwaltung und Retrieval yon Dokumenten und Daten. (Ausgabe 2.0) INFODAS GmbH, K~In 1983.

GENDER I masculin 2 feminine 3 neutral

References

Hem, K., Brustkern, J. and Willee, G.: Maschinenlesbare deutsche W~rterb~cher. Niemeyer, T(~bingen 1983.

MORPHOLOGYOF LEMMA

WORD ORIGIN

3) Concerning the problem of d i f f e r e n t i a t i n g homographs contextually, the straightforward approach of rewrite rules operating on word-class categories won't work except the inventory of those categories is extremely differentiated. For example the rule AD + X + NOUN - - - > AD + AD + NOUN i s c o r r e c t only i f NOUN is the head of an NP and X is not a present p a r t i c i p l e or (in English) a noun w i t h i n that NP. We suppose t h a t a s u f f i c i e n t d i f f e r e n t i a t i o n may be achieved only with some kind of semantic p a t t e r n .

nominat./indicat. genitive/conjunct. dative/imperat. accusative

Authors address: S. Goeser, l i c . p h i l and Dr. E. Mergenthaler, D i p l . Inform. Sektion Psychoanalytische Methodik U n i v e r s i t ~ t Ulm Am Hochstr~8 8 D7900 ULM

Suggest Documents