Adding preferences to CUF 1 Introduction - CiteSeerX

5 downloads 0 Views 204KB Size Report
This model is in extensive use in speech recognition HAJ90] as well as in word-tagging Kup92]. The basic model is a ... For a more realistic example consider the rst sentence of the abstract of main article ... in which the tagger correctly regards \pragmatics" as a singular noun, with: .... Predicate: tag_lookup(Tags, Strings).
Adding preferences to CUF Chris Brew Language Technology Group Human Communication Research Centre Edinburgh University Abstract

We discuss the problem of associating grammars written in CUF [DD93] with preference information of the sort which has proved useful in the development of large scale grammars like the Alvey Natural Language Tools Grammar [GBCB89]. We show a way to embed a useful part of the functionality of the Xerox part of speech tagger [Kup92] into the framework of CUF, developing a design for a version which has the appropriate facilities for the handling of numeric data. On the basis of this design we speculate about the prospects of including other preference-based mechanisms within grammar encoded in CUF, and draw preliminary conclusions about the nature of this enterprise.

1 Introduction Typed feature formalisms are too new to have yet been used for large-scale grammar development. Because of the extra generality of formalisms like CUF it is not immediately clear that techniques which were developed for nite-state automata or context-free grammars will be of any assistance for developers working with the more elaborate formalisms. We use this note to describe a framework for preference-ranking which can co-exist with CUF. Although we do not signi cantly extend existing techniques, we do show how they can be incorporated into a modern framework, making it clearer what the issues are likely to be in seeking to develop more sophisticated preference-based approaches The design proposed in this note is unimplemented, but is intended as a broadly compatible extension of the facilities already provided by CUF.

1.1 Declarative and procedural information in CUF

CUF already makes a distinction between the declarative meaning of a speci cation and the control information which indicates how the declarative speci cation is to be exploited. We believe it is appropriate to consider a third class of information, which we shall term preference information. This is essentially the control information described by Uszkoreit [Usz91]. For the sake of clarity we reserve the term control information for the non-numeric delay and indexing statements described by Dorre and Dorna [DD93], which are di erent in character. It is an acknowledged problem that large-scale grammars associate more than one analysis with each sentence. In fact, there are often so many analyses that it is unreasonable to expect an analyst to manually inspect the complete set, making the debugging of a large grammar at best an extremely tedious and error-prone activity [BC93][p 40]. If possible, we would like to use preference information to provide a rank-ordering of the outputs of the system, in order that the analyst should  This work was carried out while the author was supported by a U.K. SERC grant to Henry Thompson. Special thanks to Suresh Manandhar for very useful discussions. Thanks also to Steve Finch for helping me to understand the Xerox tagger, to Henry Thompson and Claire Grover for providing comments on drafts and to Jochen Dorre for patiently answering odd questions about CUF.

57

need to inspect, or otherwise process, only those analyses which are (in some suitable sense) the most likely. We think that the ability to rank parses is extremely important for the debugging and development of large coverage grammars. This is really only a special case of a larger problem. In any system there comes a point when the explosion in number of parse results becomes so large that later stages of processing cannot reasonably be asked to handle them all. Now it is the capacity of the system rather than the patience of the analyst which is at issue. Given a suitable framework for preference information it may be possible to alleviate this problem by structuring the computation in such a way that a rank-ordered sequence of solutions is generated by lazy evaluation of the input against the grammar. This would be a kind of best- rst search. The practicality of this approach would depend on the exact form of the calculus by which preference information is combined, but we see no reason to doubt the theoretical possibility of such a scheme. This is essentially the programme described by Uszkoreit [Usz91]. Our current focus is on showing how the preference information can be made available. To complete this work we need an inference scheme capable of making ecient and appropriate use of the information which we provide. The provision of such a scheme is left for further research. We show how it is possible to systematically associate numerical information with the di erent constructs of CUF. For the purposes of this note we treat CUF as a representative of the current crop of typed feature formalisms. The details of how to make the connection to preferences will of course be speci c to CUF. Readers more interested in the di erences between the various systems should refer to Manandhar's contribution to this volume [Man93].

2 Background This section provides background information about the components of the system which we will extend.

2.1 Types and sorts

CUF makes a distinction between types and sorts. Sorts are de nite clauses de ning relations over feature terms, while type axioms are constraints on the form of feature terms. Sorts are more expressive, but make it harder for the CUF implementation to adopt ecient representations at compile-time, while types sacri ce a degree of expressivity for the sake of decidable satis ability checking and the possibility of run-time eciency. The relative simplicity of types makes it easier to see how they can be pressed into service for statistics, but even for the limited goals of this note we will turn out to need de nite clauses as well.

2.2 Probabilistic language models

As the introductory example of a probabilistic language model, we consider the Hidden Markov Model. This model is in extensive use in speech recognition [HAJ90] as well as in word-tagging [Kup92]. The basic model is a non-deterministic nite-state automaton. In order to generate language from an HMM, one begins in a start state, chooses a transition to another state, generates an output symbol, then chooses a transition once more. This process continues until an end-state is reached. The process is doubly stochastic, since the choice of a transition is made using a (state-dependent) vector of transition probabilities and the choice of a symbol to generate is controlled by a matrix tabulating the probability of generating a particular symbol when one is in a given state. In the word-tagging application described by Kupiec [Kup92], which we shall from now on call the tagger, the underlying Markov process operates over part-of-speech tags. The system has access to a lexicon which gives an ambiguity class for each known word. The ambiguity class is the set of tags permitted for a word. Only one of these will be correct in context. The output symbols are the ambiguity classes. We assume that the parts of speech themselves are unobservable, and 58

operate a hill-climbing optimisation scheme which searches for the sequence of states which best accounts for the observed sequence of ambiguity classes. It is a remarkable fact that an appropriate optimisation scheme exists. Given only a lexicon and a large corpus of text the training procedure is certain to converge to a locally optimal solution, although there is no guarantee of a global optimum [Bau72, Kup92]. In practice the procedure often converges to good solutions. In the analysis phase we search for the optimal path through the trained network, looking for the sequence of states which provides the best account of the sequence of ambiguity classes which were observed. Once we know the best sequence of states, we can read o the disambiguated part-of-speech tags. For example, the lexicon provided with the Xerox tagger includes a two way ambiguity for the word \red", making the following input strings mildly ambiguous. Norman buys the np vbz at

red bottle (jj|nn) nn

Norman is in the red np bez in at (jj|nn)

The task of the tagger is to use its statistical model to resolve the ambiguity, which it achieves, as witnessed by the following sample output Norman buys the np vbz at

red bottle jj/2 nn

Norman is in the red np bez in at nn/2

Here the numerical suxes indicate the size of the ambiguity class which has been considered for each tag. For a more realistic example consider the rst sentence of the abstract of main article of this volume: We describe the formalism CUF ( Comprehensive Unification Formalism) ppss vb at nn nps/25 - jj nn nn which has been designed as a tool for writing and making use of any wdt hvz ben vbn/2 cs/2 at nn/2 in vbg/2 cc vbg/2 nn/2 in dti/3 kind of linguistic descriptions ranging from phonology to pragmatics. nn/2 in jj nns vbg in nn in/2 nns/2

Note that many of the words have unambiguous part-of-speech, but that the tagger is still choosing a good analysis from 25  2  2  2  2  2  2  3  2  2  2 = 38400 possibilities. The 25 arises because the system has essentially given up on \CUF", assigning all possible tags. There is actually another word not present in the lexicon, namely \pragmatics", but the system is capable of making a fair (but wrong) guess on the basis of simple morphological analysis. Compare Pragmatics is hard. nn/2 bez rb/2

in which the tagger correctly regards \pragmatics" as a singular noun, with: Pragmatics are hard. nns/2 ber rb/2

59

in which it (under pressure) makes the same mistake as it did in the long sentence, and treats \pragmatics" as a plural. While acknowledging the de ciencies of the theoretical model, this remains a fairly impressive performance in coping with general text. By its very nature the tagger is sensitive to preference information, which it uses to choose from a range of available analyses. In order to arrive at an understanding of how preference information could be exploited in CUF it seems sensible to investigate a reconstruction of the essence of the tagger using the facilities which CUF provides.

2.3 Parse ranking

Although the tagger is usually used to produce a single analysis, the language model which it constructs can be applied to rank parses. Under this regime the rst stage is that a grammar written in CUF generates a set of analyses of an input. Each parse result is then treated as a description of the ways in which the words in the inputs may have been used. The tagger is then used to rank the parses produced by the full-scale grammar. This approach contrasts with that described by Pulman [Pul92]. Under Pulman's scheme the tagger was used as a pre-processor, with the output from the tagger controlling the lexical access carried out by the parser. What is happening there is that the results of the tagger are used to reduce the size of the search space explored by the parser. What we are proposing is that the parse results are used to restrict the search space which the tagger is allowed to explore. Parses which force the tagger to follow high cost paths will be declared bad, those which allow the tagger to follow low cost paths will be ranked highly. Under Pulman's scheme mistakes made by the tagger can deny the parser the opportunity to get it right, since the initial choice of tag may have prevented the parser from starting with the right lexical edges. Under our scheme tagger mistakes can cause good parse results to be given low rankings, but the correct parse remains available if the analyst is patient enough. In order to make our scheme work we have to say something about the correspondences between the tags which the tagger wants and the descriptions which the parser generates. In general this can be quite complex. In the simplest case the parse result is equivalent to a single sequence of tags. Ranking the parse is simply a matter of reading o from the language model the cost of traversing the appropriate sequence of states. Another possibility is that there are several di erent ways of translating the parse results into sequences of tags. This situation will sometimes arise, but it is likely to be fairly infrequent, because the distinctions made by a full-scale grammar are likely to be ner grained than those made by the tagger. When the problem does arise, it can be dealt with by searching for the maximum probability sequence of tags allowed by the parse result. More serious is the problem which arises when many parses give rise to the same sequence of tags. If this occurs, all the relevant parses will be assigned the same ranking. This is the same problem which was noted by Briscoe and Carroll [BC93][p30] for the case of stochastic contextfree grammars, namely that the limitations of the language model cause the ranking mechanism to declare parses equivalent despite the fact that there seem to be important di erences between them. The proposal developed above is that the tagger be used as a post-processor in order to rank the parses. This does not help if there are so many parses that the parser crashes before delivering results. We will present a more re ned version of our proposal once we have developed an account of the tagger's functionality within the framework of CUF. This makes it possible for the system as a whole to adopt appropriately exible control strategies, and opens the way for regimes in which the ranking ability provided by the tagger helps the parsing process to explore its search space more intelligently.

60

3 Reconstructing the tagger within CUF In the next section we show how ambiguity classes and tags can be re-interpreted within CUF's type discipline. We provide an account of the way in which small extensions to CUF could be used to emulate the behaviour which the tagger shows at run-time. We do not display a corresponding emulation for the training phase of the tagger, since that can conveniently be carried out o -line, and does not need to be integrated within CUF. The ultimate aim, which will not be achieved in this short note, is to extend the tagger's functionality to provide a coherent and practical probabilistic interpretation for a typed feature framework.

3.1 Using types

In CUF, feature structures carry types, and these types are organised into a hierarchical structure. A similar hierarchical relationship exists between part-of-speech tags and the corresponding ambiguity classes. We can express the required relationship in the CUF type system. We do this by organising the type hierarchy according to the following principles.  Each word tag is represented by a CUF type (e.g. nn).  Each ambiguity class is represented by a CUF type (e.g. ambig13). Ambiguity classes are placed in a relationship with orthographic forms by the tagging lexicon.  Both the ambiguity classes and the word tags are described as disjoint unions of smaller CUF types (e.g nn13), which pick out the occasions on which a word in a particular ambiguity class should have a particular tag (in the example nn13 is the type associated with words from ambiguity class 13 which are assigned the tag nn). There follows an extract from a suitable grammar. % word woc = % With woc ::

occurrences (wocs) have an ambiguity class and an id ambig & tag the following features orth:string, gprob:real, tprob:real. % The type TAG is defined by a disjoint union of possible tags tag = jj | nn | nns| ... % The type AMBIG is defined by a disjoint union of possible % ambiguity classes ambig = ambig1|...|ambig12| ambig13|ambig14|... ... % Ambiguity classes decompose into elements ambig12 = nn12 | nns12. ambig13 = nn13 | jj13. ambig14 = jj14 | ... ... % Tags decompose into elements from the same set as % the ambiguity classes nn = nn12|nn13|... jj = jj13|jj14|... nns = nns12|... ... % Predicate: tag_lex tag_lex(orth: ``pragmatics'' & ambig12) := ``pragmatics''.

61

tag_lex(orth: ``red'' & ambig13) := ``red''. ... % Predicate: tag_lookup tag_lookup([]) := []. tag_lookup([Tag|Tags]) := [tag_lex(Tag)|tag_lookup(Tags)].

The lexicon is encoded as a binary relation tag lex between feature terms and bare strings. The syntax of CUF is potentially misleading here, because the functional notation tends to obscure the fact that we are dealing with a relation. The relation between lists of tags and lists of strings is de ned by the clauses for tag lookup. The functional notation again slightly obscures what is going on. In order to clarify what is intended, we provide equivalent pseudo-Prolog. % Predicate: tag_lex(Tag,String) tag_lex(orth: ``pragmatics'' & ambig12,``pragmatics''). tag_lex(orth: ``red'' & ambig13,``red'') ... % Predicate: tag_lookup(Tags, Strings) tag_lookup([],[]). tag_lookup([Tag|Tags],[String|Strings]) :tag_lex(Tag,String), tag_lookup(Tags,Strings).

This is only pseudo-Prolog, because of the feature term syntax in the rst argument of tag lex . There is a symmetry in the Prolog-style encoding which is no longer present in the CUF version, The reason for the initially odd choice of the string as the result argument in the CUF version of tag lex is that this choice makes later code shorter and clearer. In the training phase the tagger picks up information from a corpus and records it as information about discrete probability distributions. Speci cally, the information which the tagger learns is:  Information about state transitions. This expresses information about the context in which tags occur. The type mechanism says nothing about this. To make progress with this we will turn to the use of de nite clauses in a later section.  Information about the probabilities that a particular tag will be expressed in the observable input as each of its possible ambiguity classes. The type system says which ambiguity classes are possible, but does no more than this. If required, we can record numerical information in the grammar1. This is our rst step towards integrating the functionality of the tagger into CUF. In a later section we will demonstrate how the information recorded during training can be exploited in tandem with CUF's usual processing. The way in which we associate probabilities with sub-types is simply to associate values for the feature gprob with the smallest types of the framework. % Types: nn12,nn13,jj13,... ... nn12 = gprob:0.05 nn13 = gprob:0.05 ... jj13 = gprob:0.07 jj14 = gprob:0.91 ... nns12 = gprob:0.44 ... 1

Allowing numbers is a small extension. We shall further extend CUF notation once we need arithmetic.

62

The particular numbers are chosen arbitrarily. The idea is that these values should be induced by the tagger in the training phase, not that they should be created ex nihilo by the grammar writer. So far, we have encoded only half the information which the tagger collects during the training phase, because we haven't said anything about the transitions between states. To add this information we make use of the de nite clause facilities of CUF.

3.2 De nite clauses

We start by setting things up to show how we encode the sequences of states in which a rst-order Markov process can nd itself. We choose to represent the transition possibilities of the Markov model by a disjunctive de nition of the successor relationship. % Predicate: succ succ(jj) := jj. succ(jj) := nn. succ(jj) := nns. ... succ(nn) := jj. succ(nn) := nn. succ(nn) := nns. ... succ(nns) := jj. succ(nns) := nn. succ(nns) := nns. ...

This is just the representation of the universal relation over the set of states. As a grammar this is very uninteresting, since it says that any sequence of states is allowed. Once we start adding numerical information it becomes more useful. % Predicate: succ succ(jj) := jj & tprob:0.1. succ(jj) := nn & tprob:0.3. succ(jj) := nns & tprob:0.3. ... succ(nn) := jj & tprob:0.001. succ(nn) := nn & tprob:0.40. succ(nn) := nns & tprob:0.10. ... succ(nns) := jj & tprob:0.001. succ(nns) := nn & tprob:0.10. succ(nns) := nns & tprob:0.80. ...

Again, the precise numbers are ctional, but would in a real system be subjected to a regime of re-estimation against a corpus until they re ected the observed regularities as accurately as possible. We now provide a de nition of the relationship between a sequence of words and a probability that a rst-order Markov model would give rise to that sequence of words. Note the use of the true(X) construct to impose side-conditions on the computation. % Predicate: path_prob path_prob([Tag]) := SP & true(gprob:SP & Tag).

63

path_prob([Tag0,Tag1|Tags]) := S0P * S0S1P * path_prob([Tag1|Tags]) & true(gprob:S0P & Tag0) & true(Tag1 & succ(Tag0) & tprob:S0S1P) &

Here we have assumed arithmetic facilities which will allow CUF to evaluate arithmetic expressions by suspending the arithmetic operations until their arguments are suciently instantiated. The rst clause says that the probability of a path through a one element list of tags is produced by inspecting the gprob attribute of the object which has the appropriate orthography in the tagging lexicon. The second clause says that the path probability of a path through a word list of two or more elements is found by multiplying the probabilities found under the gprob attribute of the rst state, the tprob attribute of the second state and the path probability of the rest of the word list. Note that the probabilities in question will only be appropriately instantiated (to numbers) when the words have been coerced to have types such as nn13,jj12,.... In general this coercion can happen in a number of di erent ways, giving rise to di erent solutions. Each of these solutions corresponds to a di erent path through the Markov model. In the most obvious use of the tagging facilities from within CUF we need to collect all these solutions, returning only the maximum probability interpretation. To do this we introduce another ad-hoc extension to CUF. % Predicate: best_path best_path(Strings) := maximize(P, P & path_prob(Tags) & true(Strings & tag_lookup(Tags)),Tags).

The maximize construct should cause the system to nd the best way of constructing a path probability for the given input sequence. The rst argument indicates the variable over which maximization is to occur, the second indicates the relation whose solutions will be searched for the maximum and the the third argument indicates which variable will be returned as a result. Prolog programmers will recognize this as a variant of the setof construct. Despite the fact that they are de nable, it is standard for numerically-oriented constraint logic programming languages to include such facilities as built-ins, since this allows the interpreter to exploit ecient strategies such as dynamic programming or the simplex method [Van89]. The standard algorithm for recognition using hidden Markov Models is the Viterbi algorithm [Vit67], which is an instance of the dynamic programming scheme. It is possible that there will be problems in integrating this approach to numeric computation with CUF's policy on type inference, but for our purposes this is a minor matter. At this point we have all we need to use the tagging facilities to rank the outputs of a full-scale grammar. For the sake of simple illustration we suppose that the goal of the whole enterprise is to nd a single best parse, and that the full-scale grammar is a lexicalist system which produces sets of richly structured lexical signs2. The changes necessary to lift either of these assumptions are minor. We also make the assumption that the lexicon used by the full-scale grammar is an extension of the tagging lexicon introduced earlier. The reason for this assumption is that we want to ensure that there is some interaction between the two processes which are carried out within the scope of the maximize construct. % Predicate: best_parse best_parse(Strings) := maximize(P, P & path_prob(Tags) & true(Strings= tag_lookup(Tags) 2

Rather than, for example, a tree structure or a single spanning sign.

64

& true(wf_sequence(Tags)), Tags).

All we are saying here is that we want to nd the maximum probability sequence of words which also satis es the constraints imposed by the grammar. The idea is that every parse will restrict the space of possibilities which can be explored by the tagger. Note that we have not speci ed the order in which computation occurs. There is clearly scope for a whole range of interesting interleavings of the tagging process and the workings of CUF proper. This is the potential advantage of building the tagger and the grammar as interacting processes rather than running a pipeline process with separate components. We do not here attempt the task of providing a control regime which will make such computations ecient, leaving this as an open problem for further research in constraint logic programming.

4 Future work In this section we discuss various directions in which the system already described might usefully be extended.

4.1 Getting useful output

As con gured above the system only guarantees to return a parse which gives rise to the highestscoring sequence of tags. There are a number of reasons why this may not be exactly what we want:  There may be several parses corresponding to the top-scoring tag sequence.  The correct (human-preferred) parse may not in fact correspond to the best tag sequence. We ought to expect this, since the main point of having a full-scale grammar is to achieve a better model of the language than can be provided by a nite-state machine.  The input may contain ambiguities which should not be resolved at this stage of processing. The rst problem can be attributed to the use of a tag set which is too impoverished to allow the full-scale grammar to help as much as it might. Since we have assumed a lexicalist grammar, in which all the information about a parse is stored somewhere in the extended description of the input words, it will always be theoretically possible to extend the tag set in order to re ect appropriate distinctions made by the full-scale grammar. Such an extension of the tag set would require re-training of the tagger, and if the tag set were to become too large the whole operation would become infeasible for lack of suciently large corpora. Nevertheless, judicious extensions to the tag set could produce improvements in parse ranking, which would assist with the rst two of the problems mentioned above. All the problems would be alleviated if we could provide a version of the maximize construct which was backtrackable. This would return the top score rst, followed by lesser scores as desired. Carroll and Briscoe describe [CB92] work on the heuristic unpacking of parse forests which is closely related to this, and motivated by the same desire to ease the task of using large grammars which produce substantial numbers of analyses.

4.2 Modularity

It is a violation of modularity to insist that the tagger interacts with the full-scale grammar by a shared set of tags. The CUF reconstruction of the tagger actually represents these tags as types. Whether they are atoms or types, a modern grammar typically uses a richer representational scheme for its lexical information, and should not be constrained to use what the tagger requires. Although 65

we cannot avoid doing some work to reconcile the two representations, we can shelter the provider of the modern grammar from the exigencies of the tagger by providing an intermediary between the two processes. If we do this the code given above becomes. % Predicate: best_parse best_parse(Strings) := maximize(P, P & path_prob(Tags) & true(Strings & dolmetscher(Tags,Lexicals)) & true(wf_sequence(Lexicals)), Lexicals).

We have called the intermediary dolmetscher, since it is a kind of highly intelligent simultaneous translation service. Now best parse is the CUF encoding of a binary relation between the input and a set of lexical signs produced by the full-scale grammar. The role of the intermediary is to control the ow of information between the original input (Words), the tags (Tags), and the terms which are manipulated by the full grammar (Lexicals). For a simple grammar the intermediary might be as shown in the following extract. % Predicate: dolmetscher dolmetscher([],[]) := []. dolmetscher([Tag & lex_to_tag(Lexical)|Tags],[Lexical|Lexicals]) := [full_lex(Lexical) & tag_lex(Tag) |dolmetscher(Tags,Lexicals)]. ... % Predicate: full_lex full_lex(n:+ & v:- & bar:0 & num:sing ) := ``pragmatics''. full_lex(n:+ & v:- & bar:0 & num:sing ) := ``red''. full_lex(n:+ & v:- & bar:0 & num:pl ) := ``pragmatics''. full_lex(n:+ & v:+ & bar:0) := ``red''. ... %Predicate: lex_to_tag lex_to_tag(n:+ & v:- & bar:0 & num:sing ) := nn. lex_to_tag(n:+ & v:- & bar:0 & num:pl ) := nns. lex_to_tag(n:+ & v:+ & bar:0) := jj. ...

We have given full lex the same argument structure as tag lex in order to make their analogous roles obvious. The dolmetscher predicate is closely analogous to the tag lookup predicate given earlier. Its generalization to systems with yet further components should be obvious. Both the tagger and the grammar are now free to use whatever representations they choose. In the worst case, the information which passes through the intermediary will fail to constrain the operation of the individual components, and we will be back where we started. However, if the intermediary does its job well there will be mutual constraints between the two main components, and the possibility of improved performance.

4.3 Extensions

Once we have the idea of using an intermediary to establish arbitrary relations between di erent components of the system, there is no reason why one should not add further preference-based components. Uses of tagger output for phrase recognition, word sense disambiguation and grammatical function assignment are described in the nal section of the paper by Cutting et al [CKPS92]. Any or all of these might prove useful as components working in collaboration with the conventional 66

grammar. Another candidate is a component which assigns merit to parses according to the assessed probabilities that particular slots in the argument structure are lled by particular items (i.e. prefer the reading of \The brave customer stopped the bank raider with the gun" in which it is the criminal who has the gun). We might also contemplate much more elaborate statistical models. The task of associating uni cation grammars with stochastic context-free grammars and/or LR parsing tables is a topic of active research, as described by Briscoe and Carroll [BC93]. There is no a priori reason why the fruits of this work should not take a role parallel to that which we have described for the tagger.

4.4 Feature frameworks and probabilistic calculi

The main purpose of the framework developed in this note is as a high-level description of the possible relationships between a declarative grammar and associated preference information. In this nal section we expand on this. In the example we identi ed the types nn, jj and nns with corresponding states in the tagger, and we associated probabilities with the elements of a disjoint union. The probabilities in the di erent elements of the disjoint union should sum to 1, since these numbers are conditional probabilities of the form P (nni j nn). Clearly we want

X P (nni j nn) = 1 i

This is because the type nn is de ned as the union of the subtypes mentioned in the disjoint union. For similar reasons it will follow that

X P (ambigi j ambig) = 1

ambigi

and that

X

fxjxtagg

P (x j tag) = 1

and that for each individual ambiguity class ambigi

X

fxjxambigi g

P (x j ambigi) = 1

Note that the sums of the probabilities associated with the elements of the disjoint unions which form the ambiguity classes are not 1. These sums are de ned by the expressions

X

fxjxambigig

P (x j tag(x))

which represent quite di erent quantities and are not guaranteed to sum to unity. The reason for this apparent anomaly is that we chose to associate the probabilities with types. What we really wanted to say was that there was a probability distribution associated with the disjoint union. We chose the alternative approach because it ts in better with the existing CUF facilities, but it is unsatisfactory. If we had need of recording the probabilities which actually do sum to one for each ambiguity class,namely: P (x j ambigi)

X

fxjxambigig

we should have to de ne another numeric feature (say aprob ) on the sub-types in order to store this information. This is inelegant, since the information in question is not really about the sub-types per se. 67

Likewise the transition probability information is recorded in the target state of the transition, when the relevant probability distribution appropriately operates at the point where a choice is made between transitions. We are getting into a situation where the linguistic engineer keeps having to make arbitrary decisions about where to put information. Before proceeding down this path we ought to step back and consider what is going on at a more abstract level. Although CUF, like other similar systems, provides us with a uniform framework for the description of things as diverse as orthography, syntactic structure and anaphoric phenomena, it does not follow that there will be a similar uniformity in the appropriate treatment of preferences. We initially thought that the way to proceed would be to associate a probabilistic operator with each construct of the logic which underlies CUF, using these operators to compositionally construct probabilities from those of the component parts of a structure. We now think this inappropriate, because a probabilistic interpretation stands or falls by its delity to the described situation, not by its connection to the formal description. Since we are no longer looking for a homogeneous account of the relation between CUF constructs and probabilities, we no longer need to develop a general account of the appropriate treatment of such potentially problematic constructs as variable co-reference and involvement in relational dependencies. Instead we will proceed on a case by case basis, attempting to ensure that the ways in which we use probabilities re ect the underlying realities of the situation which is being modelled. For example, in developing the code which implemented the functionality of hidden Markov models we began by associating numerical information with CUF's construct for disjoint union of types, then proceeded to associate further numerical information with the traversal of a list of states. It seems extremely natural to associate disjoint union of types with addition of probabilities, as we have done, but there is no similar feeling of obvious correctness in interpreting passage along the 'R' feature of a feature structure as multiplication, as we have also done (the 'R' feature is CUF's name for Lisp's cdr and is impicitly used in the CUF list notation). In general, we will nd the right probabilistic calculus only when we take into account the relationship between the feature structure and what it represents. Once we know that traversal of the 'R' feature corresponds to progress along a list the logical interpretation of feature traversal as conjunction is clearly motivated. Once we know that, it is easy to see why multiplication, as the probabilistic counterpart of conjunction, is clearly the right choice. The traversal of other features may well require di erent interpretations. We therefore need to provide the linguistic engineer with a exible means of specifying the desired probabilistic interpretation of parts of the grammar. This is a matter for further research. The numerical extensions to CUF which we have proposed represent a rst step in this direction. Unfortunately, in order to work with these extensions one has to take a rather low-level view of the problem. A longer term goal is to provide a modular toolkit of appropriate probabilistic calculi and a convenient high-level means of associating them with the grammar. The intention would be that the linguistic engineer would be able to use these in the same way that packages implementing abstract data types are routinely used in software engineering.

Bibliography References [Bau72] [BC93]

L.E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities, 3:1{8, 1972. Ted Briscoe and John Carroll. Generalised probabilistic lr parsing of natural language (corpora) with uni cation grammars. Computational Linguistics, 19(1):25{60, March 1993. 68

[CB92]

John Carroll and Ted Briscoe. Probabilistic normalisation and unpacking of packed parse forests for uni cation-based grammars. In Proceedings, AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pages 33{38, 1992. [CKPS92] Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical part-ofspeech tagger. In ACL Conference on Applied NLP, Berkeley, California, 1992. [DD93] Jochen Dorre and Michael Dorna. CUF: A Formalism for Linguistic Knowledge Representation. Dyana deliverable, 1993. This Volume. [GBCB89] Claire Grover, Ted Briscoe, John Carroll, and Bran Boguraev. The Alvey Natural Language Tools Grammar. Technical Report Technical Report 162, University of Cambridge Computer Laboratory, April 1989. [HAJ90] X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov Models for Speech Recognition. Edinburgh University Press, 1990. [Kup92] Julian Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6(3):225{242, July 1992. [Man93] Suresh Manandhar. CUF in Context. Dyana deliverable, 1993. This Volume. [Pul92] Steve Pulman. Using tagging to improve analysis eciency. In Henry Thompson, editor, SALT/ ELSNET Workshop on Sub-Language Grammar and Lexicon Acquisition for Speech and Natural Language Processing, pages 71{75, January 1992. Record of verbal presentation. [Usz91] Hans Uszkoreit. Strategies for adding control information to declarative grammars. Technical Report RR-91-29, Deutsche Forschungszentrum fur Kunstliche Intelligenz GmbH, Saarbrucken, 1991. [Van89] P. Van Hentenryck. Constraint Satisfaction in Logic Programming. MIT Press, 1989. [Vit67] A.J. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, pages 260{269, April 1967.

69

Suggest Documents