entries, either by means of abstract nodes or simple inheritance between terminal nodes. Thus, for example, verbs inherit the structure of their syntactic ...
Some re ections on the conversion of the TIC lexicon into DATR Lynne J. Cahill School of Cognitive and Computing Sciences University of Sussex Brighton BN1 9QH England
Abstract
The Trac Information Collator (TIC)1 (Allport, 1988, 1989) is a prototype system which takes verbatim police reports of trac incidents, interprets them, builds a picture of what is happening on the roads and broadcasts appropriate messages automatically to motorists where necessary. Cahill and Evans (1990) described the process of converting the main TIC lexicon (a lexicon of around 1000 words speci c to the domain of trac reports) into DATR (Evans and Gazdar, 1989a, 1989b, 1990). This paper reviews the strategy adopted in the conversion discussed in that paper, and discusses the results of converting the whole lexicon, together with statistics comparing eciency and performance between the original lexicon and the DATR version.
1 Introduction The Trac Information Collator (TIC) is a prototype system which takes verbatim police reports of trac incidents, interprets them, builds a picture of what is happening on the roads and broadcasts appropriate messages automatically to motorists where necessary. In Cahill and Evans (1990), the basic strategy of de ning the structure of lexical entries was described. That paper concentrated on the main TIC lexicon, which was just one part of a collection of dierent kinds of lexical information and only dealt with a small fragment even of that. The whole conversion involved collating all of that information into a single DATR description. In this paper we review the structure of the original TIC lexicon; with sections describing the reasons behind the conversion, the conversion itself, some statistics relating to the relative sizes of the lexica and processing times for the two. Finally we provide some conclusions. 1.1
Structure of the original TIC lexicon
The original TIC lexicon was written in POP112 , using macros to de ne dierent sets of lexical entries. The main lexicon matched words with expressions denoting their syntax and semantics, in the format, WORD CAT=(FEAT1=VAL1, ... FEATn=VALn) # SEMANTICS #
where WORD is the word being de ned, CAT is the syntactic category to which it belongs and FEAT1 to FEATn are features having the values VAL1 to VALn respectively. The feature/value pairs are optional, although most categories require that at least the feature \type" is given a value The TIC was originally developed as part of the Alvey MISC Demonstrator Project. The TIC is implemented in a combination of POP11 and Prolog in the POPLOG multi-language programming environment (Hardy, 1982). 1
2
1
in the lexicon. The semantics is in the form of a list, which consists of propositions expressed in the form of lambda calculus, e.g. [ L X [road X]] (the \L" stands for ). In addition to the main lexicon, the original TIC lexicon had sections dealing with phrases, abbreviations, irregular lexical forms and special entries (e.g. car registration numbers, police classi cation numbers). The lookup procedure involved checking in the special sections rst, and then, if no corresponding entry was found there, splitting the word into possible root and ending pairs, by pattern matching. The lookup in the special sections eectively bypassed the root/ending splitting stage, so had to provide root and ending information explicitly. The root was then looked up in the main lexicon, and a set of procedures applied to the root and ending, for example to assign arguments of verbs to subject or object roles. The phrases were de ned as follows: en route .
enroute null
where the phrase being de ned is everything before the \.", and the two words after the \." are the root and ending respectively. Abbreviations were de ned as triples, with the rst \word" being the abbreviation, the second the root and the third the ending, and irregular forms were de ned in the same way as abbreviations, e.g. amb ambulance null broken break ed
The special entries were simple rules which associated with each class of entry a procedure for deriving the syntax and semantics. For example, the procedure for roads (such as a23) generated a syntax of the form, db road=(type=@road)
and a semantics, [L x [and [road x] [called x a23]]].
2 Reasons for the conversion The two principal reasons for converting the TIC lexicon into DATR were: Ease of adaptation to new domain (and ease of maintenance, since the language used by the police forces, like any language, varies slowly over time); More integrated and linguistically accurate de nition of all the lexical information. The original TIC project developed a prototype system with the principal aim to ascertain whether the problem was indeed a tractable one, and was based solely on the Sussex police force. The current project (POETIC { POrtable Extendable Trac Information Collator3, Evans et al., 1990) aims to take the TIC and develop it in such a way as to make it more easily adaptable to new domains, and thus more widely applicable. At present, work is centered on adaptation to the Metropolitan police force. With this in mind, it was felt that a more structured organisation of the lexicon, which is one of the main areas which will need adaptation for a new domain, was desirable. The nature of the DATR representation language means that much information is shared between entries, either by means of abstract nodes or simple inheritance between terminal nodes. Thus, for example, verbs inherit the structure of their syntactic arguments from the abstract \VERB" node, while features such as \type", \subjtype" are de ned at individual nodes. However, a verb like The POETIC project is a collaborative project between Sussex University, Racal Research Ltd, the Automobile Association and National Transcommunications Ltd., funded by SERC and the DTI, following on from the TIC project. 3
2
\request" inherits all of its information from \ask" because in terms of the syntax and semantics required for the TIC parser they are identical. The rst bene t of this type of organisation is thus that when changes need to be made to the lexicon, they do not necessarily need to be made for every aected word, but may be made only at an abstract node from which those words all inherit. This is not always the case, of course, but it is still expected to be of signi cant bene t in this regard. Besides this practical reason for the conversion, there is the theoretical desirability of a more integrated and linguistically sound representation of the lexical information needed by the system. The original TIC lexicon system was devised primarily to be functional, with little regard for linguistic niceties (understandably, since the aim of the original TIC project was simply to build a prototype to see whether the basic problem was soluble). Thus the means of handling dierent types of information, e.g. abbreviations, irregular forms and information carried by endings, are rather ad hoc. By using DATR we are able to integrate all of this information into a single DATR description, not only providing a more elegant uni ed representation of the lexicon, but also meaning that a single, one-pass lexical lookup procedure can be used. As discussed in Section 4 below, this does not necessarily mean a faster lexical lookup, but speed is not of primary concern to us at this stage.
3 The Conversion 3.1
Querying the lexicon
Each query of the lexicon is of a root and ending, as in the original lexicon, and is done by constructing a new node which inherits from the node representing the word in question, but has the path (meaning initial ending) de ned. For example, to query the lexicon for the syntax with the root/ending pair \ask/ing", the node, Node1:
== Ask == ing = ?? .
would be compiled, with the \??" indicating that the value at the path in question is to be evaluated. (See Evans, 1990) There are three endings distinguished, \init ending", which is the ending in the query, \ending", which may be de ned at some other point in the hierarchy and \true ending", which is the resultant ending arising when the other two are considered together. Their relationship, with \true ending" having the value of \init ending" unless the latter is null and an \ending" is de ned, is de ned at the CATEGORY node, the top node of the hierarchy. The reason for the distinction can be seen by considering as an example the word \children". The pattern matching procedures would de ne this as the root/ending pair \children/null". The entry for the word \children" in the lexicon is, Children: == Child == s.
which says that \children" inherits from \child" but has the \ending" \s". The true-ending in this case is therefore the same as , i.e. \s", since the initial ending was null. If the initial ending was not null, then this would be the true ending, since the assignment of endings in the lexicon is only to complete word forms, so we do not allow the overriding of non-null initial endings. 3.2
Structure of entries
Each entry in the main lexicon has a syntax and a semantics, which are of a particular structure. The syntax consists of a major category and a (possibly empty) set of feature/value pairs. The features which are de ned for each major category are dierent. The basic structure of the DATR lexicon is as described in Cahill and Evans (1990). The DATR lexicon top level consists of a CATEGORY 3
node, which is the very top of the hierarchy. This de nes the basic structure of any set of syntactic information thus, == ({ "" [ "" ] }) == ({ type "" })
which says that the syntax of a word is a list (in round brackets) which contains a f, the value of the path , a [, the value of the path , a ] and a g. With the exception of the outer round brackets which indicate a DATR list, the various brackets are individual elements of the list, which will be interpreted as delimiters of POP11 lists and vectors when the resultant structure is passed to the POP11 procedures in the parser. The quotes around the paths indicate that the path is to be evaluated at the original query node (which may refer by default or explicitly back to other nodes higher up the hierarchy, including the CATEGORY node). What the two lines given above state is that the syntax of a word consists of its category and the \synargs" or syntactic arguments, which by default consist simply of the word \type" and the type of that word. The default semantics of a word is de ned by the line, == ([ L X true ])
The full CATEGORY node is, CATEGORY: == false == cat == ({ type "" }) == ({ "" [ "" ] }) == ([ L X true ]) == == "" == "" == "".
with the word \false" being the default value for any path not de ned elsewhere. The major categories all inherit from the top node, adding their own \synargs" to those of CATEGORY, assigning default values to some of these and de ning default structures for the semantics. The \ADJECTIVE" node, for example, is as follows, ADJECTIVE: == CATEGORY == adjective == @property == ({ subjtype "" } CATEGORY) == ([ L X "" ]) == true.
In addition to these, some nodes (notably NOUN and VERB) supply additional information derived from the ending. This is one of the respects in which the DATR lexicon diers from the original lexicon. Since the original (main) lexicon could not refer to the ending, all information derived from the ending had to be provided by procedures run at lookup time. The DATR lexicon can express all of this information in a declarative manner, simply because of the assumption that the query will be in the form of a DATR node which may inherit and disseminate information in exactly the same way as other nodes permanently in the lexicon. The NOUN node de nes the feature \num" as singular or plural depending on the \true ending", even though this is only de ned for query nodes, not for any of the permanent nodes in the lexicon, 4
NOUN:
== CATEGORY == n == sing == plur == ({ num } CATEGORY) == ([ L X [ "" X ] ]).
and VERB de nes the features \pass" (which de nes whether it is possible that a verb is passive) and \tense" by the same means, VERB:
== CATEGORY == v == @event == ({ subcat "" } { subjtype "" } { objtype "" } { obj2type "" } { subjprep "" } { objprep "" } { obj2prep "" } { passive "" } { ending "" } CATEGORY ) == ([ L E L S L O1 [ and [ eventtype E "" ] [ time E ] ] ]) == == no == yes == == pres == pres_fut == past.
In addition to the major categories, there are other abstract nodes which inherit from these, which group together sets of words which have other shared information. For example, the \WEATHER" node inherits from NOUN, but de nes the type and the semantics, with a single word, de ned by the path derived from the terminal node. Thus words like \rain", \snow" etc. can inherit everything from the WEATHER node needing only to de ne the weather condition in question, WEATHER: == NOUN == @ilpevent == ([ L E [ eventtype E "" ] ]). Rain: == WEATHER == rain.
The treatment of abbreviations and irregular forms, as should be obvious, is trivial and needs no extra machinery. An abbreviation simply inherits everything from the word which it is an abbreviation for, e.g. Amb:
== Ambulance.
and an irregular form inherits from the root of that form, together with a separately de ned ending, as in the example of \children" above. Phrases are still handled separately, in the same way as in the old lexicon. This is the only piece of \lexical" information which is not contained in the DATR theorem. 5
3.3
Coverage of the conversion
In the conversion, the initial aim was to purely convert the existing lexicon into DATR without making any improvements to it. However, in the course of the conversion, it became apparent that certain changes implemented now would save eort later. For example, it was deemed impractical to produce a DATR lexicon and interface which handled ending information in the same way as the original lexicon, when this would subsequently be changed. Similarly, some entries were omitted from the lexicon since they are now handled by tokenisation (e.g. numbers and punctuation characters) and others were omitted rather than set up distinct abstract nodes to handle them which seemed inappropriate. For example, the original lexicon had an entry for \smith" as a name, but this was the only name given. To incorporate this would have meant creating an abstract node for the category of \name", simply for this one word, so it was omitted.
4 The lexica { some statistics In the tables below, the number of \entries" in each section of the original TIC lexicon is given. These do not correspond entirely with the number of words for which lexical information exists. This is because there is duplication in the representation. A word such as \report" had two entries in the irregular lexicon, report report age report report null
the rst of which states that it could have the information associated with the verb \report" in the main lexicon, with the manipulations entailed by the \age" ending (the ending used to express nominalisation). The second stated that it could also have the information associated with the verb in the main lexicon with a null ending. This was necessary since the lookup procedures did not allow normal checking of the lexicon if such an entry had been found, so the irregular entries frequently had to have their regular forms speci ed explicitly in the irregular section in this way. Original lexicon
Number of entries in main lexicon Number of abbreviation entries Number of irregular entries TOTAL Total size of les (bytes)
786 179 129 1094 70046
With the gures for the DATR lexicon, the number of nodes is given, and these are divided into total number of nodes, number of abstract nodes and number of terminal nodes. In contrast to the original TIC lexicon, there is a direct correlation here between the number of terminal DATR nodes and the number of words for which lexical information exists. This assumes that \ambulance" and \amb" are distinct words, although the latter is an abbreviation of the former. As one would expect with DATR, there is no essential dierence in the format, nor the type or amount of information in the dierent sections of the lexicon. The node for an abbreviation inherits from the node for its full form in exactly the same way as a word inherits from the node for another equivalent (in TIC terms) word.
6
DATR lexicon Number of nodes in main lexicon 722 Number of nodes in special lexicon 186 (includes abbreviations and irregularities) TOTAL Abstract nodes in main lexicon Abstract nodes in special lexicon
908 154 7
TOTAL Terminal nodes in main lexicon Terminal nodes in special lexicon
161 568 179
TOTAL Total size of les (bytes)
747 68032
The number of terminal DATR nodes diers from the number of lexical entries in the original lexicon for a number of reasons: 1. Certain entries are unnecessary with the improved morphological (i.e. ending) analysis in the new lexicon. Principally this aects the irregular entries. As discussed below, in the original lexicon, any word which had an irregular entry and a regular entry had to have both entries speci ed in the irregular section, in addition to having an entry in the main lexicon. 2. A few entries were omitted because they were inappropriate and/or not consistent (e.g. the example of \smith" given above). 3. Some punctuation characters were included in the original lexicon and have been omitted. 4. The organisation into main, abbreviations and irregulars changed, so that some entries which were in the main lexicon originally are now treated as abbreviations. In addition, many of the irregularities were omitted because they were linguistically unsound, e.g. a set of nouns which inherited their information from the corresponding verbs like collision/collide. This information was constructed by means of some highly convoluted procedures. Any shared information can easily be accounted for in the DATR lexicon by means of inheritance between nodes. In fact, the division between sections of the lexicon, although maintained for ease of adaptation, is redundant in the processing of the DATR lexicon because it is all compiled into a single DATR theorem. 5. The way the lexicon is structured and the way lexical lookup takes place means that entries in the original lexicon do not correspond directly to terminal nodes in the DATR lexicon. Multiple entries are handled using two distinct methods: polysemous entries have a single node which refers to a set of nodes, one for each entry, which are distinguished by lettering, e.g. the verb \ask" has a node \Ask" which has a single sentence, Ask:
== ([ poly "AskA" "AskB" ]).
and the separate senses of the word are dealt with under the nodes \AskA" and \AskB"; homographs have distinct nodes which are dierentiated by means of numbering, so that the entries for \close" (meaning a kind of street) and \close" (meaning \shut") are represented by two nodes \Close1" and \Close2". The lookup procedures then check for a node which corresponds to the input root (capitalised) and if not found then look for a node which corresponds to the 7
capitalised word with a 1, if this succeeds then it looks for the word, capitalised, with a 2 and so on until it fails. This method of representation has led to an overall decrease in the number of terminal nodes required. The reason for this is that, although for each polysemous entry an extra node has to be used, in many cases there are words which are essentially identical, (in terms of the syntax and semantics required for the TIC) and so a word which had seven polysemous entries in the original TIC lexicon, but whose entries were all identical to those of another word requires only one DATR node, referring to the other word's top node. Thus, although in several cases a single additional node is required, in other cases, six or seven entries in the original lexicon are reduced to a single DATR node. In addition, some polysemous entries can refer directly to abstract nodes. For example, Ford: == ([ poly "SMALL_VEH" "VEH_ADJ" ]).
where the two entries in the original lexicon were for the use of \ford" as a noun meaning a make of car or as an adjective applied to a vehicle (as in \ford transit"). Another similar case which requires fewer DATR nodes than original lexical entries is the situation where the quoted nodes are not abstract nodes, but other terminal nodes, for the de nition of other words, e.g., Send:
== ([ poly "SendA" "Pass" ]).
5 Processing times The original TIC lexicon was written in POP11 and was pre-compiled into a simple lookup table which was consulted at run-time. The process of compiling involves loading up a set of POP11 les and then building the lexicon onto disc. The total time for this is approximately 6 minutes of which less than a minute is the initial loading of les. The DATR lexicon uses the DATR implementation written by Roger Evans in Prolog (Evans, 1990) and involves a two-stage compilation. The DATR les containing the lexicon are rst compiled into prolog les, then the prolog les are compiled into the parser saved image. This could be done in just one stage, with the DATR les compiled directly into the saved image, but this way takes much longer. The compilation directly into the saved image takes about 37 minutes, while the compilation into prolog les takes approximately 12 minutes, with another 1.5 minutes to compile the prolog les into the saved image. These times are based on the system running on a SPARC-station1 but are extremely approximate. It should be stressed that the compilation time is not of great concern to us and that the DATR compiler used is non-optimal. Similarly, eciency of lexical lookup is not of primary concern at this stage of the project. None of the system code has been written with this kind of optimisation in mind, the whole aim of the project up to now has been simply to show that the problem for the system as a whole is tractable. Timed tests have shown that lexical lookup in the old lexicon is, as expected, faster than in the DATR lexicon. The times for lookup for the original lexicon ranged from about 0.09 seconds for a word with a simple, small, single entry (e.g. \rta", \you") to about 0.5 seconds for a word with polysemous entries (e.g. \close", \northbound"). The DATR lookup times ranged from 0.2 seconds to 1.05 seconds for the same words. It was interesting that the times for \you", which has a relatively complex entry in the DATR lexicon, with reference to two abstract nodes and quoted path inheritance was not signi cantly dierent from those for \rta" which has a relatively simple inheritance structure. Similarly, the times for \close" which requires lookup of more than one node name (\Close1", \Close2") were not signi cantly dierent from those for \northbound", which has three polysemous entries. It would appear from these times that the complexity of the DATR is of less importance in lookup times than the simple number of dierent senses of a word, however they are represented. This implies that a large part of the lookup time is actually taken up by the macro expansion procedures, which take the abbreviated forms of syntactic types in the lexical entries and 8
expand them out, for both the DATR and old versions of the lexicon. This also explains why the dierences between times of words with multiple entries and those with single entries diered by similar proportions in both the DATR and old lexica.
6 Conclusions Although the times for lexical lookup have not been improved, the conversion of the TIC lexicon into DATR has undoubtedly been worthwhile. One of the main aims of converting the lexicon into DATR, in addition to the general aim of nding out how feasible it was to code a real application lexicon in DATR, was to improve portability. This has been achieved, since the very structure of the DATR lexicon means that changes to a set of entries or to all entries can very often be done only at a small number of abstract nodes, rather than throughout the lexicon as was previously the case. This is particularly important at the current stage of the project, when we are involved in changing the grammar rules and the semantic representations used on a fairly large scale, as well as developing a lexicon for a new police domain.
References Allport,D (1988). \Understanding RTA's", Proceedings of the 1988 Alvey Technical Conference. Allport,D (1989). \The TIC: Parsing Interesting Text", Proceedings of the second ACL Conference on Applied Natural Language Processing. Evans,R., R.Gaizauskas and A.F.Hartley (1990). \POETIC - The Portable Extendable Trac Information Collator", OECD Workshop on Knowledge-Based Expert Systems in Transportation, Espoo, Finland, 1990. Cahill,L.J. and R.Evans (1990). \An Application of DATR: the TIC Lexicon", in ECAI-90, pp.120125, Stockholm 1990. Evans,R. (1990). \An introduction to the Sussex Prolog DATR Implementation" in Evans and Gazdar (1990), pp. 63-72. Evans,R. and G.J.M.Gazdar (1989a). \Inference in DATR." Proceedings of the 4th Conference of the European Chapter of the Association of Computational Linguistics, Manchester, England, 1989, 66 - 71. Evans,R. and G.J.M.Gazdar (1989b). \The Semantics of DATR." Proceedings of the 7th Conference of the Society for the Study of Arti cial Intelligence and Understanding Simulation of Behaviour, Sussex, England, 1989, pp.79 - 88. Evans,R. and G.J.M.Gazdar (1990). The DATR Papers Cognitive Studies Research Paper No. CSRP139, University of Sussex. Hardy,S. (1982). The POPLOG Programming Environment, Cognitive Studies Research Paper No. CSRP82-06, University of Sussex.
9