Disjoint Feature Structures - Semantic Scholar

4 downloads 0 Views 70KB Size Report
Jay Earley. An efficienct context-free parsing ... Jones, and Bonnie Lynn Webber, editors, Readings In Natural Language Processing. Morgan. Kaufmann, 1986.
Disjoint Feature Structures Matthew Hurst Language Technology Group Human Communication Research Centre University of Edinburgh Abstract This paper looks at some of the space and time issues faced in implementing feature structures for natural language parsers. The design of a space and time efficient graph representation is presented as well as an evaluation of it performance.

1 Introduction The SEATS project aims to produce a checker for controlled languages and consequently requires the implementation of an efficient parsing/recognitionmechanism for analysing the author’s texts. A feature based information structure is required to support a certain class of errors recognisable by relaxation techniques and it is the implementation of this data structure which is the focus of this paper.

2 Controlled Languages A controlled language (CL) is a restricted variation on some natural language. The purpose of defining a CL for some domain is to control aspects of the language used to describe a task in that domain. The control is designed to reduce the ambiguity inherent in natural languages, making the text easier to understand, and less prone to incorrect interpretation. A typical application area is one in which the correct execution of a procedure manipulating objects in the domain is safety (or legally) critical, e.g. [AECMA89].

3 Previous Work Previous work in this are includes [KK85] and [Per85]. The work described in [Per85] is particularly interesting and relevant to that described here. Direct analogies can be drawn between solutions to various sub-problems and to allow for comparison a summary of [Per85] follows. The underlying observations, motivation and concepts map directly on to those presented here.

 Structures loaded at run time are considered to be base (skeleton) instances of feature structures.  Additions to these structures is monotonic (incremental).  Structures generated by the unification operator are stored as additions to the base structures. Pereira uses virtual-copy arrays ([War]) to provide a certain amount of structure sharing in update representations. Virtual-copy arrays require O(log n) access or update time for an array with highest index n. It is the storage of the additions which differs between the two implementations. Pereira uses a reference to the skeleton together with a set of modifications. The use of the virtual copy array to represent these modifications means that a derivation of a feature structure can be computed in O(log d) time (where d is the length of the derivation: the chaining referred to later in this paper). Additionally, merging the updates

1

rule node

X0 cat

X1 cat

np

X2

agr

det

cat

num

noun

singular

Figure 1: Extending the Feature Structure through Unification The rule triggered by the addition of a determiner to the chart has the information structure represented by the top half of the feature structure. Unification with the instance of the determiner category edge results in extending the feature structure with the num feature shown in the circled section of the figure. This extension is monotonic.

has worst case behaviour O(j d j log j d j) where d is the size of the derivation. This operation is also represented by chaining in this paper. The main conceptual difference in this work compared to that of Pereira is that the representation is inline where as Pereira’s is off-line. For example, at a node n in a DAG calculating the type of node (atom, variable or complex structure) for an in-line representation can be done at that node. The equivalent calculation for an off-line representation requires examination of global data. The data being distinguished in this manner is the additions to the base structures. Pereira stores them in a separate data structure whereas here they are stored at the relevant node in the graph. The advantage of this technique is that the merging of additions for unification can be removed by incorporating the relevant information in the chaining process.

4 Chart Parsing The main means of analysis in the SEATS project is a bottom-up chart parser. Before describing the feature structures used in the system, an overview of the concepts used as well as some of the robust analysis techniques will be given.

4.1 General Concepts In the chart parsing paradigm, rules are represented by an information structure and a description of the grammatical production. The information structure may simply be a declaration of constituent category values or a more complex representation like the feature structures (see [Shi86]) described here (see Figure 3). Edges in the chart are represented by a rule, a description of the state of the edge, e.g. the dotted rule notation ([Ear86]), and a pair of delimiting vertices. Lexical items are represented in much the same way and, for brevity, can be thought of as unary rules from strings to categories and so on. The processes associated with chart parsing are well documented and will not be discussed here in full (see, for example, [Ear86], [Kay86]). What is pertinent is the notion that the construction of complex information structures from matched constituent information is assumed to be monotonic. For example, an edge that has been formed through the triggering of a rule and combination with other edges produces a monotonically extended information structure, see Figure 1. This is, in the system under consideration, the behaviour of unification of feature structures. The quality is relied on for the chaining reference mechanism described later.

2

4.2 Targeted Relaxation Techniques The SEATS system provides a degree of robustness through the use of targeted relaxation techniques (see [Hur95]). These methods provide a limited search space for detecting errors with respect to a ‘correct’ grammatical description of the language. 4.2.1

Robust PATR

Robust PATR is a method for encoding expectations about constraint violations in grammar rules with a PATR like notation. This method, described in [DD92], is implemented in the SEATS system to encode such things as subject/verb number disagreement and so on. The basic idea is that, taking a feature structure as a graph, failure to unify may occur at a node if that node has been marked for relaxation. In the language of feature structures, this means that a certain feature (described by a path through the structure) has been relaxed with respect to the constraints it represents. More elaborate versions may be implemented in which different levels of relaxation are used. 4.2.2

Finite State Productions

A further technique for detecting violations at the constituent level (insertions and deletions) uses a finite state automaton representation of grammar productions to encode the necessary variations, see [Hur95]. 4.2.3

Adjacency

Due to the multiple levels of analysis carried out by the SEATS parser, it is necessary to define the notion of adjacency in the chart. In a standard chart parser, adjacency is a simple matter of sharing the same vertex. For example, two edges are adjacent if one enters vertex vi and the other exits vertex vi . Allowing this notion to vary over more complex definition of adjacency gives the grammar writer more flexibility. For example, a rule defining the capitalisation of a word requires immediate adjacency between the single capital character and the remaining lower case characters, whereas a rule describing constituents at the word level require adjacency over spaces. Errors in text to do with adjacency, then, may be detected by this mechanism.

4.3 Experimental Data The evaluation of the system has used the grammar and input examples for the controlled language described in [FHS94]. The grammar described there has been extended to give a complete account of the example input. This grammar contains 34 rules and 120 lexical items. The results of using this grammar are plotted against a dimension indicating the number of edges in a parse. The input data for this dimension is a set of 38 sentences from length 5 to length 43.

5 Feature Structures Feature structure formalisms provide powerful descriptive and computational mechanisms for NLP. Space and time factors in their implementation, however, pose real problems to language engineers. Consequently, it is vital that their use be properly justified and, that the expensive space and time considerations be met head on. The space factor can be attributed largely to the need to copy structures in parsing environments, the time factor can be attributed to the need to carry out graph traversal. This second issue has been the topic of much research (e.g. [Erb95]), however, it is not clear if those methods translate well to unification regimes which perform default or relaxed unification. Of course, the expenses of space and time are interrelated and an efficient space implementation should lead to efficient time qualities. The key copying phase, can be, for the most part, removed as described below.

3

Arc: Name

int

Value

Node

Node: Name

int

ArcSet

{}

Figure 2: Data Structures for Simple Graph An arc has a name, represented by an integer, and a value, represented by a pointer to a node. A node has a name, also represented by an integer, and a set of arcs (possibly empty). The arc can be used to represent a feature, and the node either an atomic value, a variable, or a complex feature structure. The atomic value requires an empty arc set and a name, a variable requires an empty arc set and the name 0. rule node

Feature Structure X0

X1

X2

X0

X1

X2

Production

Figure 3: A Grammar Rule as a DAG There are two components to a grammar rule. The information structure, here a feature structure represented as a DAG, and a description of the production, here a context free rule.

5.1 Graph Implementations of Feature Structures The work reported here uses a straight forward graph representation of feature structures built up from arcs, which represent the features of the structure and nodes which represent the feature values. These simple components are described in Figure 2.

5.2 The Population of Feature Structures A grammar and lexicon are described by a number of structures. A grammar rule can be thought of as a single DAG with the form of the rule (e.g. a simple context free production or a finite state automaton) being expressed as some function over a set of marked nodes (feature values) in the graph, see Figure 3. Similarly, a lexical item is a single graph. Once a parser has been populated with feature structures there is no need to explicitly build new feature structures as the generative unification operation simply combines feature structure information (see [GM89]). Consequently, after creating unique feature structures for rules and lexical entries, it is possible to encode all other possible feature structures that may be generated as annotations to this original set. The straight forward implementation described here relies on the monotonic qualities mentioned earlier. Some simple changes must be made to the graph representation presented in Figure 2 which will allow for the 4

Arc:

Node: int

Name Value

int

Name

Node

int

Index

Index

Node

int Node

int

Node

{}

ArcSet Maps

BitMap

Index

BitMap

BitMap BitMap

Figure 4: Data Structures for Disjoint Feature Structure Graph implementation of this space efficient disjoint feature structure representation.

5.3 Disjoint Feature Structures The basic idea is to use the same instance of a feature structure representation for all the times it is used in an edge in the chart. This should be compared with using copies of a feature structure. Similarly, when unification occurs, new structures are not created through copying, but by modifications to the existing structures. In summary, when a rule of lexical item it loaded a new structure is generated. If that structure is used in the parse (i.e. forms part of the edge) then instead of making a copy of it, a reference1 to it is used. If the structure is used again as a result of this first use another reference is made and so on. However, the structure must be capable of representing conflicting information precisely in the cases where it is extended (through the unification operation) in various ways. For example, a structure may have a variable which is bound through unification to an atom. Another use of that same structure may bind the variable to a different atom, or a feature structure. In order to cope with this disjunction, each reference to the structure is given a unique identifier. The identifier is unique over all feature structures. When a new feature structure is created during parsing, the next identifier is used through incrementing a global counter of feature structures. At load time, the feature structures used by rules all have the unique identifier 0. The first requirement is to provide a data structure capable of representing these disjoint structures. Additions to the arc are made so that an index of possible values is available instead of the unique value as in Figure 4. This index is provided by a hash table which can be referenced by the unique identifier of a feature structure. The node representation is a little more complex. The set of arcs is represented by a unique set. The disjunction over this set is represented by a set of indexed bit masks. These masks are then used to view the set of arcs, thereby ignoring those arcs which are not part of that particular structure. 2 Instead of having private copies of entire feature structures, the parsing process is now dealing with references to public objects. A new data structure is required to wrap around the reference and store the unique identifier. The reference is now a reference to the base feature structure, the feature structure now becomes the data structure in Figure 5. Hereafter, the term feature structure will refer to the wrapper class and the term base feature 1 In

implementation specific terms, this reference may be a pointer to a data structure. bit masks may also be used to enhance the subsumption operation. A node with a set of arcs exiting it, represented by bit map b0 , will be able to subsume a node with a set of arcs exiting it, represented by b1 , iff b0 ^ b1 is equal to b0 (where ^ is the bit wise and operator), i.e. b0 is a subset of b1 . 2 The

5

Feature Structure: BaseFS

base feature structure

OriginalFS

Feature Structure

Id

int

Figure 5: Wrapper Structure for Feature Structures structure will refer to the underlying global object. Expressions of the form fn refer to a feature structure with unique identifier n. 5.3.1

Copying

Copying a feature structure f0 to the feature structure f1 involves the following steps. 1. construct a new instance of a feature structure, this is the wrapper data structure, not the base feature structure. This action generates a new unique identifier for the new feature structure. So the original feature structure, f0 , has identifier 0; the new feature structure f1 has identifier 1. 2. copy the reference, i.e. the f1 ’s base feature structure is the same as that of the base feature structure of f0 . 3. assign the original feature structure with a reference to the feature structure being copied. So f1 ’s originalFS is a pointer to f0 . This forms a history of the generation of the feature structures. Note that the base feature structure has not been accessed in any way, so the expense of traversing the structure for copying has been removed as has the space expense of allocating more memory for the feature structure. 5.3.2

Traversal

Traversing the graph of a feature structure is the basic process used by subsumption and unification. This process makes use of sub-processes which look up values in the hash tables.

 bitRepLookUp When at a node in the graph, the process has to know which bit map to look at in order to view the arc set according to the correct set of features. First of all the table of bit maps is indexed on the id of the feature structure. If this returns a null value then the process recurses on the id of the originalFS until a bit map is found.  nameLookUp A node may have a number of different names if multiple atom values have been assigned to it. Consequently, another look up facility is required to index on the name.  valueLookUp When traversing an arc, the value used is computed in exactly the same manner as for the bit maps in bitRepLookUp. Traversal is then a matter of using the chaining mechanisms described to view the correct names and bit maps for nodes, and the correct value for arcs. For example, if f0 is a feature structure which was loaded from the grammar and a copy, f1 , is made of it during parsing, finding the value of a particular feature, ax in f1 might happen as follows. 1. the value hash table of ax is indexed with the unique identifier of f1 : 1. 2. this returns a null value as the only values present are those loaded at run time. 3. the originalFS pointer of f1 is followed and f0 is found. 6

4. the value hash table of ax is indexed with the unique identifier of f0 : 0. 5. this returns the load time value. 5.3.3

Unification

Unification is an operation that requires the traversal of two feature structures. The failure case has already been discussed in the section on relaxed unification, and here we discuss the case when information is added into a structure. The graphs being unified will be referred to as the a-graph and the b-graph. The a-graph is the one which will be modified. When information is added into a base feature structure, one of the following two cases arises. 1. An atomic value is assigned to a feature. 2. A base feature structure is assigned to a feature, i.e. a sub-graph of a base feature structure. It is the second case which will be described fully here. Assigning a base feature structure to a feature is the operation of assigning a value to an arc. The value that is to be assigned is a reference to the node in the b-graph. However, it is not possibly to simply make the reference for the following important reason. The feature structure which is being used for the interpretation of the b-graph is not contained in the chain of feature structures being used for the a-graph. Consequently, it is not possible to interpret the added material with the correct index, or chain of indices. There are two solutions to this problem. 1. traverse the added material, adding the appropriate indices to the arcs and nodes. 2. incorporate a more complex indexing system into the chaining mechanism allowing the b-graph’s current interpretation feature structure to be considered in the a-graph’s chain. The first solution results in a trivial implementation which traverses the base feature structure being added and adds the appropriate index to the hash tables. The second solution simply adds the b-graph’s current index into the chaining process. This requires the addition of optional information to the feature structure. This information records

 A pointer to the feature structure used to interpret the b-graph This information can then be used by the chaining process to avoid the updating of nodes required by the first solution. The possible extra expense of searching more than one branch of the evolution of the feature structure is non persistent if reduced chaining is used (see later). The first solution will be referred to as simple chaining and the second as complex chaining. The addition of atomic values to the graph results in the modification of the name and bit map tables of the node. The name table has the name of the atom added, and the bit map table has an empty bit map added. Figure 6 illustrates the unification operation. There are two unique base feature structures, S 1 and S 2. S 1 is referred to by the active edge between V i and V j , S 2 is referred to by the inactive edge between V j and V k. Unifying the X 2 (daughter) node of the active edge with the X 0 (mother) node of the inactive edge requires the addition of F eature1 with the value Node1 to the structure S 1. Consequently, the X 2 node of S 1 has a new feature added to the bit map indexed by the unique identifier 1. The feature F eature1 has a new index added to its value table with the unique identifier 1 having the value Node1. 5.3.4

Reducing Chaining

Chaining, the main point of comparison between this work and [Per85], can be reduced greatly if the following procedure is introduced to the three look up mechanisms.

 if the identifier resulting in a value is not identifier of the feature structure, then add the value to the table under the feature structures identifier. 7

S1

Vi

X0

X1

Vj

X2 S2

Vj

X0

X1

Vk

Feature1

Node1

Figure 6: Unification 3 Y x 10 50.00

possible actual

45.00

40.00

35.00

30.00

25.00

20.00

15.00

10.00

5.00

0.00 50.00

100.00

150.00

200.00

250.00

300.00

X

Figure 7: Evaluation of Reduced Chaining The upper line describes the number of chaining operations carried out without compression. The lower line represents the number of chaining operations carried out with compression. The lower line is increasing 4 times slower than the upper line.

This reduces the chaining operation by constantly incrementing the index table, therefore, for accessed table values, the chaining will at most have to look under the index of the originalFS. There is a slight cost in terms of memory as bit maps and so on will have to be entered in the chart. The advantages of this reduction can be seen in Figure 7. 5.3.5

Results

A useful measure of the work being carried out by a unification system is the number of nodes visited during parsing. Nodes in feature structure graphs are visited in the subsumption and unification operators as well as in the copying operation. Note that in the disjoint representation, the updating of base feature structures with new index values is akin to copying and is measured in the same manner for simple chaining. The measure as reported here is also an indication of the space efficiency of the implementation. The operators will visit the same number of nodes regardless of implementation, the difference is provided by the cost of copying. Figure 8 shows the number of nodes visited against the number of edges in a parse. It clearly demonstrates the value of using the feature structure implementation described in this paper against the naive copy8

Y x 103 possible actual1

11.00

actual2 10.50

10.00

9.50

9.00

8.50

8.00

7.50

7.00

6.50

6.00 50.00

100.00

150.00

200.00

250.00

300.00

X

Figure 8: Evaluation of the Disjoint Feature Structure Representation The upper line describes the time cost of the copying implementation of a graph feature structure representation. The second line represents the cost of the disjoint representation without the use of complex chaining. The lower line represent the cost of the disjoint representation with the complex chaining.

ing implementation. The upper line shows the number of nodes that would be visited if copying did take place, the lower lines shows the number of nodes actually visited (for simple and complex chaining). The upper line is growing 4 times as fast as the second line and 6 times as fast as the lower line.

6 Conclusions This paper has described algorithms and data structures used to encode feature structures as graphs in a disjoint manner. This method involves the use of indices over the feature structures and consequently avoids the need to copy structure during parsing, thereby saving space. It has been demonstrated that the implementation is 6 times as efficient when compared with the naive solution in terms of the number of nodes visited during parsing. In comparison with [Per85], the distinct processes of derivation and merging have been collapsed in to the data structure and operated on by a unique process: chaining. It has been shown that this process can be reduced through a form of lazy lookup which in effect compresses the path of derivation.

References [AECMA89] AECMA. Simplified English: A guide for the preparation of aircraft maintenance documentation in the International Aerospace Maintenance Language. Assiciation Europeenne Des Constructeurs De Materiel Aerospatial, Paris, 5 edition, 1989. [DD92]

Shona Douglas and Robert Dale. Towards robust patr. In Proceedings of the 14th COLING conference, 1992.

[Ear86]

Jay Earley. An efficienct context-free parsing algorithm. In Barbara J. Grosz, Karen Sparck Jones, and Bonnie Lynn Webber, editors, Readings In Natural Language Processing. Morgan Kaufmann, 1986.

[Erb95]

Gregor Erbach. Profit: Prolog with features, inheritance and templates. In Seventh Conference of the European Chapter of the Association for Computational Linguistics, 1995.

9

[FHS94]

Norbert E. Fucchs, Hubert F. Hofmann, and Rolf Schwitter. Specifying logic programs in controlled natural language. Technical Report 94.17, Department of Computer Science, University of Zurich, November 1994.

[GM89]

Gerald Gazdar and Chris Mellish. Natural Language Processing in PROLOG. Addison Wesley, 1989.

[Hur95]

Matthew Hurst. Parsing for targetted errors in controlled languages. to appear, 1995.

[Kay86]

Martin Kay. Algorithm schemata and data structures in syntactic processing. In Barbara J. Grosz, Karen Sparck Jones, and Bonnie Lynn Webber, editors, Readings in Natural Language Processing. Morgan Kaufmann, 1986.

[KK85]

Lauri Kaurttunen and Martin Kay. Structure sharing with binary trees. In 23rd Annual Meeting of the Association of Computational Linguistics. ACL, 1985.

[Per85]

Fernando C. N. Pereira. A structure-sharing representation for unification-based grammar formalisms. In 23rd Annual Meeting of the Association for Computational Linguistics. ACL, 1985.

[Shi86]

Stuart M. Shieber. An Introduction to Unification-Based Approaches to Grammar. CSLI, 1986.

[War]

David H. D. Warren. Logarithmic access arrays for prolog. Unpublished Program.

10