On the Expressiveness of Information Extraction ... - Computer Science

1 downloads 825 Views 219KB Size Report
For example, for the verb hire in Figure 3 the linked chain model would .... which were derived from the (now defunct) Yeast Proteome Database (YPD) (Hodges et al., 1999) and the Online ..... The Art of Programming: Fundamental Algorithms.
On the Expressiveness of Information Extraction Patterns Mark A. Greenwood and Mark Stevenson Abstract Many recently reported machine learning approaches to the acquisition of information extraction (IE) patterns have used dependency trees as the basis for their pattern representations (Yangarber et al., 2000a; Yangarber, 2003; Sudo et al., 2003; Stevenson and Greenwood, 2005). While varying results have been reported for the resulting IE systems little has been reported about the ability of dependency trees, or patterns extracted from them, to represent relationships needed to perform IE. In this paper we evaluate the ability of a number of pattern representations, derived from dependency trees, to represent the relationships being extracted by an IE system. The paper concludes by suggesting the use of the “linked chains” model which represents around 94% of the possible relations in text but without generating an unwieldy number of candidate extraction patterns.

1

Introduction

A common approach to Information Extraction (IE) is to use patterns which match against text and identify the items of interest (Riloff, 1993)(Gaizauskas et al., 1996)(Grishman, 1997)(Soderland, 1999). These patterns are designed to be applied to text which has undergone various levels of linguistic analysis, including phrase chunking (Soderland, 1999) and full syntactic parsing (Gaizauskas et al., 1996). There are two requirements of a pattern representation language to make it suitable for use within an IE system: (1) enough detail must be encoded to successfully identify the information which is to be extracted from the text and (2) it must be general enough to allow for at least some of the variation in the way information is described in text. An important question in the design of any extraction pattern language is to ensure that they contain enough information to meet the first requirement but are general enough to ensure that the second is also met. Several recent approaches have used extraction patterns based on dependency tree parses of the input text (Yangarber et al., 2000b; Yangarber et al., 2000a; Yangarber, 2003; Sudo et al., 2001; Sudo et al., 2003; Stevenson and Greenwood, 2005). Each of these approaches learn extraction patterns from text which has been analysed using a dependency parser and those patterns used for IE. However, a large variety of pattern representations are used, from ones which represent just a verb and its arguments to others which allow arbitrary sections of the tree to be patterns. This paper provides an analysis of the various pattern representation models by comparing their ability to represent the information which we are interested in extracting from various corpora. Semantic patterns are widely used in Natural Language Processing and their application is far from limited to IE. Lin and Pantel (2001) learned rules which matched dependency parses for question answering, Snow et al. (2004) applied them to the discovery of hypernyms while Riloff and Weibe (2003) learned extraction patterns to identify opinion expressions. The rest of this paper is organised as follows. Dependency trees are described in Section 2. A number of possible dependency-tree-based IE patterns are then described in detail (Sections 3 and 4). Section 5 describes and experiment which aims to compare the representations. The results and implications of which are presented in Section 6.

2

Dependency Trees

A dependency relationship is a binary relationship between a word (often referred to as the head word) and one of its modifiers. Dependency grammars represent sentences as a set of dependency relationships producing a dependency tree in which each word can be modified by zero or more other words but where each word can only modify one other word. An example dependency analysis for the sentence “I own a black cat” is shown in tree format and an abbreviated representation1 in Figure 1. Often a dependency tree will be much simpler than a corresponding syntactic tree and hence is more suited to use as a basis for a pattern representation aiming to capture the meaning of a sentence. For example a syntactic tree structure produced by the Stanford Parser2 (Klein and Manning, 2002; Klein and Manning, 2003) for the same example sentence, 1 This formalism for representing dependency patterns is similar to the one introduced by Sudo et al. (2003). Each node in the tree is represented in the format a[b/c] (e.g. subj[N/bomber]) where c is the lexical item (bomber), b its grammatical tag (N) and a the dependency relation between this node and its parent (subj). The relationship between nodes is represented using notation like this X(A+B+C) which indicates that nodes A, B and C are direct descendents of node X. 2 http://www-nlp.stanford.edu/software/lex-parser.shtml

1

verb[V/own](subj[N/I]+obj[N/cat](det[Det/a]+adj[Adj/black])) Figure 1: Example dependency analysis in tree and alternative abbreviated format.

Figure 2: An example syntax tree. “I own a black cat”, can be seen in Figure 2. Also for the purpose of representing textual meaning, dependency structures show the semantic relationships between words and phrases more clearly than is possible using phrase structures. For instance the dependency tree shows that black is an adjective which modifies cat, whereas the syntax tree only states that both words are part of the same noun phrase. Similarly the dependency tree clearly shows the subject and object of the verb own whereas the syntax tree only shows that own is the verb in a verb phrase. Dependency trees capture more of the meaning of the text than the corresponding syntax tree.

3

Pattern Representations

This section outlines four models for representing extraction patterns which can be derived from dependency trees. Each model has been taken from the Information Extraction literature. The dependency tree in Figure 3 will be used as a running example.

3.1

Predicate-Argument Model (SVO)

The simplest of the four extraction pattern models is to use subject-verb-object tuples; the verbs and their subject and object. This approach has been used by Yangarber (2003), Sudo et al. (2003), and Stevenson and Greenwood (2005). An SVO pattern will be extracted for each verb in a sentence so, using this model, the example dependency tree in Figure 3 would produce the following two patterns:3 verb[V/hire](subj[N/Acme Inc.]+obj[N/Mr Smith]) verb[V/replace](obj[N/Mr Bloggs]) This model may be motivated by the assumption that many IE scenarios involve the extraction of participants in specific events. For example, the MUC-6 (MUC, 1995) management succession scenario concerns the identification of individuals who are changing job including details such as the organisation and job title. These events are commonly described using a simple predicate argument structure, for example “Acme Inc. fired Smith”. However, the SVO model cannot represent information described using other linguistic constructions such as nominalisations or prepositional phrases. For example, in the texts used for this evaluation it is common for job title to be mentioned within a prepositional phrase, for example “Smith joined Acme Inc. as CEO” (Stevenson, 2004). 3 The dependency relation between the root node and its parent (verb) for each of these three patterns is redundant but is included to allow all nodes to be represented in a common format.

2

Acme Inc. hired Mr Smith as their new CEO, replacing Mr Bloggs. Figure 3: An example sentence and the corresponding dependency tree.

3.2

Chains

To overcome some of the limitations of the SVO representation Sudo et al. (2001) introduced the chain model as an alternative pattern representation. In this representation a pattern is defined as a path between a verb node and any other node in the dependency tree which passes through zero or more intermediate nodes. For example the 7 chains for the verb hire are: verb[V/hire](subj[N/Acme Inc.]) verb[V/hire](obj[N/Mr Smith]) verb[V/hire](obj[N/Mr Smith](as[N/CEO])) verb[V/hire](obj[N/Mr Smith](as[N/CEO](gen[N/their]))) verb[V/hire](obj[N/Mr Smith](as[N/CEO](mod[A/new]))) verb[V/hire](vpsc mod[V/replace]) verb[V/hire](vpsc mod[V/replace](obj[N/Mr Bloggs])) This representation has the advantage of providing a mechanism for encoding information beyond the direct arguments of predicates and covers areas of the dependency tree ignored by the SVO model. For example chains can represent information expressed as a prepositional phrase. Information expressed as a nominalisation can also be expressed using this model. For example, “The resignation of Smith from the board of Acme shocked the markets.” However, a potential shortcoming of this model is that it cannot represent the link between arguments of a verb. The chain model would encode even a simple sentence containing a transitive verb, for example “Smith left Acme Inc.”, as two patterns. This may be a problem for IE since many extraction scenarios involve the identification of participants of events and in natural language these can be easily represented as the arguments of a verb. It would seem that representing these relations is important for an extraction pattern representation.

3.3

Linked Chains

A model which aims to overcome the limitations of both the SVO and chain models was introduced by Greenwood et al. (2005). The linked chains model represents extraction patterns as either a chain or a pair of chains which share the same verb but no direct descendants. For example, for the verb hire in Figure 3 the linked chain model would generate 14 patterns in addition to those generated under the chain model. Examples of these patterns include: verb[V/hire](subj[N/Acme Inc.]+obj[N/Mr Smith]) verb[V/hire](subj[N/Acme Inc.]+obj[N/Mr Smith](as[N/CEO])) verb[V/hire](obj[N/Mr Smith]+vpsc mod[V/replace](obj[N/Mr Bloggs])) This pattern representation encodes most of the information in the sentence, much as the chain model could, with the advantage of being able to link event participants together, i.e. we can see that the “Smith replaces Bloggs” whereas with the chain model we could only represent the fact that “Acme Inc. hired Smith” and that “Acme Inc. replaced Bloggs”. Each linked chain consists of a pair of chains and it follows that all relations captured by the chains will also be captured by a number of the linked chains. Consequently, the chains could be discarded without effecting the number of relations which can be represented. The chains are retained to allow an IE system using these patterns to extract partial relations and incorporate these into an IE template.

3

3.4

Subtrees

The final pattern representation model considered is the subtree model introduced by Sudo et al. (2003). In this model any subtree of a dependency tree can be used as an extraction pattern. The subtree model is a richer representation than those discussed so far and can represent any part of a dependency tree. Each of the previous models from a proper subset of the subtrees. By choosing an appropriate subtree it is possible to link together any pair of nodes in a tree and consequently this model can represent the relation between any two items in the sentence, provided the parser has produced a dependency tree including them both.

4

Pattern Enumeration and Complexity

We have already discussed the fact that each of the models differ in terms of the information from dependency trees they can represent. However, they also vary in the number of patterns which will be generated from the dependency analysis of a sentence. This section discusses the variation in the amount of patterns each model generates and the implications this has for computational practicality. It begins by listing the number of patterns generated for each model. Predicate-Argument Model (SVO) Under this model a single pattern will be extracted for each verb in the dependency tree. More formally, assume a dependency tree T consists of a set of N nodes and that V , such that V ⊆ N , is the set of nodes in the dependency tree which represent verbs. Then the number of patterns extracted from T under this model is given by: Nsvo (T ) = |V | =

X

1

(1)

v∈V

Chain Model In this representation there is one chain per node below each verb in a dependency tree. If we extend the notation so that d(v) denotes the number of descendents of a node v ∈ V in T the number of chains is given by: Nchains (T ) =

X

d(v)

(2)

v∈V

Linked Chains This model gives rise to manymore patterns. Let C(v) denote the set of direct child nodes of a verb v and vi denote the i-th child, so C(v) = v1 , v2 , ...v|C(v)| . The number of patterns extracted from a dependency tree will be:

Nlinked chains (T ) =

X

v∈V





d(v) + 

|C(v)| |C(v)|

X X

i=1 j=i+1



(d(vi ) + 1) × (d(vj ) + 1)

(3)

Subtrees The number of SVO, chain and linked chain patterns for a particular dependency tree can be derived from information about the number of verbs in the tree and their descendents. However, for arbitrary tree structures further information is required in order to determine the number of subtrees. For a tree with N nodes an upper and lower bound on the number of subtrees can be determined:4 N −1 X

i ≤ Nsubtree (T ) ≤ 2N −1

(4)

i=2

Table 1 lists the number of patterns generated by each of the four models for the dependency tree in Figure 3. Model Nsvo Nchains Nlinked chains Nsubtree

# of Patterns 2 7 21 42

Table 1: Number of patterns produced for each model for the dependency tree in Figure 3 4 The upper bound can be derived as follows. A tree with N nodes has N − 1 edges. Each subtree consists of a subset of the edges in the tree and there are 2N−1 of these. Of course, not all subsets of the set of edges will be well-formed subtrees so this determines the maximum possible for a tree. The lower bound is the number of subtrees of the simplest tree of N nodes. The simplest N node tree is one in which each node has a single parent and only one child. Accurately determining the number of subtrees for a given tree requires full knowledge of its structure (Knuth, 1968).

4

It is clear that there is a tradeoff between the complexity of pattern representations and the practicality of computation using them. Some pattern representations are more expressive, in terms of the amount of information from the dependency tree they make use of, than others (Section 3) and are therefore more likely to produce accurate extraction patterns. However, the more expressive models will add extra complexities during computation since a greater number of patterns will be generated. The number of SVO patterns is a linear function on the size of the dependency tree while the function is polynomial for the chain and linked chain models. The subtree model generates an exponential number of patterns and it is known that enumerating all subtrees is a #P -complete problem (Goldberg and Jerrum, 2000)). This complexity, both in the number of patterns produced and the computational effort required to produce them, limits the learning algorithms that can reasonably be applied to learn useful extraction patterns. Each of the models presented here has been used in a machine learning approach to automatic pattern acquisition for IE and it is important that computations over these patterns can be carried out efficiently for these approaches to be feasible. For a pattern model to be suitable for an extraction task it needs to be expressive enough to encode enough information form the dependency parse to accurately identify the items which need to be extracted. However, we also aim for the model to be as computationally tractable as possible. The ideal model will then be one which has enough expressive power but without any redundant expressiveness which would make computation less efficient.

5

Experiments

The aim of this work is to identify how accurately each of the pattern representations can identify events in the sort of extraction scenarios which are commonly used to evaluate IE systems. To do this we make use of existing corpora (detailed in Section 5.1) which have been annotated with events. Patterns from each of the four models are examined to check whether they cover the information which should be extracted. In this context “cover” simply means that the pattern contains the elements of the sentence which should be extracted. For example, the sentence “Smith was recently made chairman of Acme.” contains information about the person (Smith), post (chairman) and organisation (Acme). An SVO pattern derived from a dependency parse of this sentence would be something like Smith-made-chairman and so would cover the person and post information but not organisation. Subtrees can be used to cover any set of items in a dependency tree. So, given accurate dependency analyses for a sentence, this model will cover any events mentioned in it. Consequently the coverage of the subtree model can be determined by checking whether the elements of the event are connected in the dependency analysis of the sentence and, for simplicity, we chose to do this rather than enumerating all subtrees. For practical applications parsers are required to generate the dependency analysis and these may not always be relied upon to provide a complete analysis for every sentence. The coverage of each model will be influenced by the ability of the parser to produce a tree which connects the elements of the event which should be extracted from the sentence. To account for this we compute the coverage of each model relative to the number of events which could be represented given a particular dependency parse, rather than the total count of events in the corpus. This measure is referred to as the bounded coverage of a pattern representation. The subtree model covers any event whose entities are included in the dependency tree, so the bounded coverage for the subtree model will always be 100% and for all other models it can be computed by dividing the number of events that model covers by the number covered by the subtree model.

5.1

Corpora

Corpora representing two genres of text were used to evaluate the pattern representations detailed in Section 3; one containing newspaper text and another composed of text describing biomedical science. The first corpus uses the evaluation texts from the Sixth Message Understanding Conference (MUC, 1995) which are taken from the Wall Street Journal. These texts are reliably annotated with templates which include details about the movement of executives between jobs. We make use of a version of the corpus produced by Soderland (1999) in which events described within a single sentence were annotated. These events consist of up to four entities: PersonIn, PersonOut, Post and Organisation of which each event contains at least two. In this paper we are concerned with modelling binary relations and so we will use all six paired relationships in our experiments. The remaining documents are taken from the biomedical domain and are make up from three individual corpora: the training corpus used in the LLL-05 challenge task (N´edellec, 2005), and a pair of corpora Craven and Kumlien (1999) which were derived from the (now defunct) Yeast Proteome Database (YPD) (Hodges et al., 1999) and the Online Mendelian Inheritance in Man (OMIM) database (Hamosh et al., 2002). All three corpora contain a single binary relation over which we can evaluate the pattern representations. The LLL-05 corpora contains genic interactions between genes and proteins. For example the sentence “Expression of the sigma(K)-dependent cwlH gene depended on gerE” contains a relation between sigma(K) and cwlH and between gerE and cwlH. The YPD corpus is concerned with the subcellular compartments in which particular yeast proteins localize. An example sentence “Uba2p is located largely in the nucleus” relates Uba2p and the nucleus. The relations in the OMIM corpora are between genes and diseases, for example “Most sporadic colorectal cancers also have two APC mutations” contains a relation between APC and colorectal cancer.

5

The bomb caused widespread damage and killed three people. Figure 4:

MINIPAR

duplicating sentence subject phrase.

These corpora contain events of different formats. In the MUC corpus each event may have either two or three elements while the other corpora contain events which always have a pair of elements. For consistency we convert all events in the MUC corpus into a set of relations with two elements. For example, in Soderland’s version of the MUC corpus the sentence “Smith was recently made chairman of Acme.” would be annotated with the following succession event: PersonIn(Smith) Post(chairman) Org(Acme) which contains information about the person (Smith), post (chairman) and organisation (Acme). For evaluation we covert this event structure into a set of binary relations: PersonIn-Post, PersonIn-Organisation and Post-Organisation. We do not believe this has any effect on the outcome of the experiments reported here and others, including (Chieu and Ng, 2002; McDonald et al., 2005), have shown that n-ary relationships can be recovered from sets of binary relations to generate complex template structures. We only consider relationships described in a single sentence. This assumption is reasonable since the previous work on the use of dependency tree patterns for Information Extraction has made the same assumption and the aim of this work is to compare the models they propose. The MUC corpus contains a total of six possible binary relations. Each of the three biomedical corpora contain a single relation type, giving a total of nine binary relations for the experiments. In total the corpora used for our analysis contain 3911 instances of binary relations. The named entities participating in relations are replaced by a token indicating it class during a preprocessing phase. For example, the four types of name in the MUC6 corpus (PersonIn, PersonOut, Post and Organisation) are replaced by the tokens NAMPERSONIN, NAMPERSONOUT, NAMPOST and NAMEORG. The corpora used have already been annotated with relevant named entities. Coverage for each pattern model can then be checked in a straightforward way by checking whether the both tokens are included in the pattern.

5.2

Dependency Parsers and Pattern Generation

Rather than rely on a single parser, which may affect our results, we make use of three dependency parsers: MINIPAR5 (Lin, 1999), the Machinese Syntax6 parser from Connexor Oy (Tapanainen and J¨arvinen, 1997) and the Stanford Parser7 (Klein and Manning, 2003). A number of post-processing steps were necessary to make a) the most of the output from the parsers, and b) consistent structures from which the patterns could be extracted. The output from MINIPAR provides details of implicit dependency relationships which violate the constraint that each word can only modify at most one other word. For example, for the sentence shown in Figure 4, the the bomb is the subject of not only the verb cause but also the verb kill. To capture this information sections of a dependency tree can be duplicated. This means that the tree now contains multiple copies of certain nodes but none of the nodes no violate the one parent constraint. For example, the section of the tree representing the bomb in Figure 4 appears in two places in the dependency tree whereas it is only explicitly mentioned once in the original sentence. The Stanford parser makes use of an underspecified generic dependency relation which is used when the type of the relation between two nodes cannot be determined. This often results in a single tree being generated when other dependency parsers would generate correct fragments from which patterns could be extracted. The trees produced by the Stanford parser are therefore post-processed to generate tree fragments from those nodes linked by a generic dependency relation. This creates a set of complete tree fragments and modifies the output into a similar format as produced by the other parsers we are using. The parsers were adapted to deal with the tokens representing name classes. The corpora were parsed with each parser and patterns for the models, with the exception of subtree, extracted. It was found that the parsers were often unable to 5 http://www.cs.ualberta.ca/∼lindek/minipar.htm 6 http://www.connexor.com/software/syntax/ 7 http://www-nlp.stanford.edu/software/lex-parser.shtml

6

Relationship (Corpus) Person Out - Company (MUC6) Person In - Company (MUC6) Person Out - Post (MUC6) Person In - Post (MUC6) Person In - Person Out (MUC6) Post - Company (MUC6) MUC6 Average Genic Interaction (LLL-05) Protein - Location (YPD) Gene - Disease (OMIM) Biomed Average Combined Average

# of Relations 186 133 334 308 103 258 1322 103 1372 1114 2589 3911

SVO %C 2.15 (4) 12.03 (16) 0.90 (3) 8.12 (25) 42.72 (44) 2.71 (7) 7.49 (99) 0.97 (1) 1.02 (14) 0.81 (9) 0.93 (24) 3.14 (123)

%B-C 2.50 15.68 1.04 10.78 54.32 3.07 9.07 1.08 1.43 1.17 1.30 4.19

Chains %C 41.40 (77) 17.29 (23) 65.27 (218) 23.05 (71) 8.74 (9) 56.20 (145) 41.07 (543) 6.80 (7) 12.46 (171) 24.42 (272) 17.38 (450) 25.39 (993)

%B-C 48.13 22.55 75.43 30.60 11.11 63.60 49.73 7.53 17.52 35.23 24.44 33.86

Linked Chains %C %B-C 85.48 (159) 99.38 75.94 (101) 99.02 85.93 (287) 99.31 75.32 (232) 100 78.64 (81) 100 86.43 (223) 97.81 81.92 (1083) 99.18 80.58 (83) 89.25 66.40 (911) 93.34 62.57 (697) 90.28 65.31 (1691) 91.85 70.93 (2774) 94.58

Subtrees %C 86.02 (160) 76.69 (102) 86.53 (289) 75.32 (232) 78.64 (81) 88.37 (228) 82.60 (1092) 90.29 (93) 71.14 (976) 69.30 (772) 71.11 (1841) 74.99 (2933)

Table 2: Evaluation results when parsing using MINIPAR. Relationship (Corpus) Person Out - Company (MUC6) Person In - Company (MUC6) Person Out - Post (MUC6) Person In - Post (MUC6) Person In - Person Out (MUC6) Post - Company (MUC6) MUC6 Average Genic Interaction (LLL-05) Protein - Location (YPD) Gene - Disease (OMIM) Biomed Average Combined Average

# of Relations 186 133 334 308 103 258 1322 103 1372 1114 2589 3911

SVO %C 0.00 (0) 2.26 (3) 0.30 (1) 0.00 (0) 17.48 (18) 2.33 (6) 2.12 (28) 0.00 (0) 0.36 (5) 0.00 (0) 0.19 (5) 0.84 (33)

%B-C 0.00 3.33 0.38 0.00 24.32 2.86 2.75 0.00 0.51 0.00 0.27 1.15

Chains %C 40.86 (76) 24.81 (33) 41.32 (138) 23.05 (71) 8.74 (9) 56.20 (145) 35.70 (472) 5.83 (6) 8.02 (110) 23.43 (261) 14.56 (377) 21.71 (849)

%B-C 52.41 36.67 52.47 30.21 12.16 69.05 46.41 7.14 11.11 33.98 20.47 29.70

Linked Chains %C %B-C 77.96 (145) 100 66.92 (89) 98.89 78.44 (262) 99.62 75.64 (233) 99.15 71.84 (74) 100 79.84 (206) 98.10 76.32 (1009) 99.21 71.84 (74) 88.10 67.20 (922) 93.13 62.75 (699) 91.02 65.47 (1695) 92.02 69.14 (2704) 94.58

Subtrees %C 77.96 (145) 67.67 (90) 78.74 (263) 76.30 (235) 71.84 (74) 81.40 (210) 76.93 (1017) 81.55 (84) 72.16 (990) 68.94 (768) 71.15 (1842) 73.10 (2859)

Table 3: Evaluation results when using the Machinese Syntax parser. generate a dependency tree which included the whole sentence and would generate an analysis consisting of sentence fragments represented as separate tree structures instead. Some of these fragments did not include a verb so no patterns could be extracted from them. To allow as many patterns as possible to be generated patterns were extracted not only from each verb (as detailed in Section 3) but also from the root node of any tree fragment in the analysis produced by a parser.

6

Results

Coverage and bounded-coverage results for each pattern representation are given in Tables 2, 3, and 4, which show the results for the MINIPAR, Machinese Syntax, and the Stanford parser respectively. Each table uses the same format, the first column lists the relationship, the second the total number of instances for that relation while the remaining four columns lists the results for each of the four pattern models. Results for the subtree model lists the coverage and raw count, the bounded-coverage for this model will always be 100% and is not listed. Results for the other three models are listed in a pair of sub-columns one of which shows the coverage and raw count while the other shows the bounded coverage. Coverage percentages are computed by dividing the number of relations covered (bracketed figure is left sub-column) by the total number of instances of the relation while bounded coverage is found by dividing the same figure by the bracketed figure in the rightmost column of the table. (Recall that the bounded-coverage figure is computed as the percentage of the relations which could have been identified, given a particular parse, and this is equivalent to the relations identified by the subtree model.) Figure 5 summarises the coverage of the different pattern representations using each of the three dependency parsers. The simplest representation, SVO, does not perform well in this evaluation. The highest bounded-coverage score is 15.1% (MUC6 corpus, Stanford parser) but the combined average over all corpora is less than 6% for any parser. This suggests that the SVO representation is simply not expressive enough for IE. Previous work which has used this representation have used indirect evaluation, document and sentence filtering (Yangarber et al., 2000a; Yangarber, 2003; Stevenson and Greenwood, 2005), and while the SVO representation may be expressive enough to allow a classifier to distinguish between documents or sentences which are relevant to a particular extraction task it seems too limited to be used for full text extraction. The SVO representation performs noticeably worse on the biomedical text. Our analysis suggested that this was because the items of interest were commonly described using nominalisations in these texts. The

7

Relationship (Corpus) Person Out - Company (MUC6) Person In - Company (MUC6) Person Out - Post (MUC6) Person In - Post (MUC6) Person In - Person Out (MUC6) Post - Company (MUC6) MUC6 Average Genic Interaction (LLL-05) Protein - Location (YPD) Gene - Disease (OMIM) Biomed Average Combined Average

# of Relations 186 133 334 308 103 258 13122 103 1372 1114 2589 3911

SVO %C 2.69 (5) 5.26 (7) 14.67 (49) 34.42 (106) 23.30 (24) 3.10 (8) 15.05 (199) 0.00 (0) 0.80 (11) 0.09 (1) 0.46 (12) 5.40 (211)

%B-C 2.69 5.30 14.71 34.42 23.53 3.11 15.10 0.00 0.811 0.10 0.49 5.58

Chains %C 40.86 (76) 18.80 (25) 58.68 (196) 25.32 (78) 17.48 (18) 58.14 (150) 41.07 (543) 4.85 (5) 12.03 (165) 23.16 (258) 16.53 (428) 24.83 (971)

%B-C 40.86 18.94 58.89 25.32 17.65 58.37 41.20 4.85 12.18 25.72 17.39 25.69

Linked Chains %C %B-C 91.40 (170) 91.40 94.74 (126) 95.45 95.51 (319) 95.80 97.73 (301) 97.73 95.15 (98) 96.08 92.64 (239) 93.00 94.78 (1253) 95.07 89.32 (92) 89.32 92.57 (1270) 93.73 83.48 (930) 92.72 88.52 (2292) 93.13 90.64 (3545) 93.81

Subtrees %C 100 (186) 99.25 (132) 99.70 (333) 100 (308) 99.03 (102) 99.61 (257) 99.70 (1318) 100 (103) 98.76 (1355) 90.04 (1003) 95.06 (2461) 96.62 (3779)

Table 4: Evaluation results when using the Stanford parser. 100% 90% 80%

Coverage

70% 60% 50% 40% 30% 20% 10% 0% MINIPAR SVO

Machinese Syntax Chains

Linked Chains

Stanford Subtrees

Figure 5: Coverage of various pattern representation models for each of the three parsers. Coverage is across all relations in the test corpora. limited SVO model is unable to represent these. The more complex chain model representation covers a greater percentage of the relations. However its boundedcoverage is still less than half of the relations in either the MUC6 corpus or the biomedical texts. This model does well for some relations; 74.43% of the PersonOut-Post relations in the MUC6 corpus are covered. But the performance is not consistent across the various relations in the evaluation and the results are often limited. Using the chain model the best coverage which can be achieved over all corpora is 33.86% (MINIPAR) which is unlikely to be sufficient to create an IE system. Results for the linked chain representation are much more promising. This representation covers around 70% of all interactions using the MINIPAR and Machinese Syntax parsers and over 90.64% using the Stanford parser. However, for all three parsers this model achieves a bounded-coverage of close to 95%, indicating that this model can represent the majority of relations which are included in the dependency tree. The subtree representation covers slight more of the relations than linked chains: around 75% using the MINIPAR or Machinese Syntax parsers and 96.62% using the Stanford parser. These are the best results that can be obtained using these parsers. However, the disadvantage of this model is that the number of patterns generated is exponential on the size of the three while this number will be merely polynomial for the linked chain model. Analysis of the relations which were covered by the subtree model but not by linked chains suggested that there are certain constructions which the cause difficulties. One such construction is the appositive. For example in the MUC6 corpus the relation between PERSON and ORGANISATION in the sentence fragment “ORGANISATION’s POST, PERSON, resigned yesterday morning” is such a construction. Some nominalisations can also cause problems for the linked chains representation. In the biomedical texts the relation between the AGENT and TARGET in the nominalisation “the AGENT-dependent assembly of TARGET” cannot be represented by linked chain. In both cases the problem is caused by the fact that the dependency tree generated included the two named entities in part of the tree which is dominated by a node marked as a noun. Since each linked chain must be anchored at a verb (or the root of a tree fragment) and the two chains cannot share part of their path, these relations are not covered. It would be possible to create another representation which allowed these relations to be captured but it would generate more patterns than the linked chain model. These results suggest that the linked chain model is a promising representation for extraction patterns. It has the

8

advantage of generating far fewer possible extraction patterns from a dependency tree than the subtree model with a small loss of representation power. In addition the linked chain model can be used to represent far more of the relations in a variety of extraction corpora than the SVO or chain models. The results of this analysis also provides insight into the more general question of how suitable dependency trees are as a basis for extraction patterns. By basing their analysis on semantic relations between elements of the sentence, dependency trees have the advantage of abstracting away from the surface realisation which can be useful for extraction. However, the approach relies on the availability of accurate dependency parsers. The experiments reported here demonstrate a wide difference between the ability of the three parsers to generate dependency trees which connect together the two parts of relations in the sentence; MINIPAR and Machinese Syntax both cover less than 75% of the relations while the Stanford parser achieves 96.62%. One reason for the wide difference in performance may be that the Stanford parser is unique is allowing the use of an underspecified dependency link when the link type cannot be determined (Section 5.2). This allows the parser to generate analyses which span more of the sentence than the other two and therefore cover a greater proportion of the relations. The use of underspecified dependency relations may not be useful for all applications but is unlikely to cause problems for systems which learn IE patterns provided the trees generated by the parser are consistent. Differences between the results produced by the three parsers suggest that it is important to fully evaluate their suitability for a particular purpose.

7

Previous Work

A number of IE systems have used information extraction patterns based on dependency trees (Yangarber et al., 2000b; Yangarber et al., 2000a; Sudo et al., 2001; Yangarber, 2003; Stevenson and Greenwood, 2005) but less work has been carried out comparing the expressiveness of the various pattern models used. Sudo et al. (2003) compared the SVO, chains and subtree models as part of system designed to extract the named entities involved in events. However, they were not concerned with identifying the connections between entities and so no attempt was made to quantify how well each model represented relations. Muslea (1999) reviewed the extraction patterns used by several Information Extraction systems, although none of them are based on dependency tree representations. In general the systems used lexico-syntcatic patterns which used a limited amount of syntactic analysis and a small set of concept classes. These allow for alternative ways of expressing an event and some generalisation of the participating entities.

8

Discussion

It should be noted that the evaluations contained within this paper does not constitute a direct evaluation of paring performance. A parser which produces a number of correct tree fragments is likely to be more linguistically useful than a parser which produces a single badly formed dependency tree spanning the whole sentence. However, for our experiments the single tree is likely to produce better results as the chance of both entities in a relation being contained in the tree is greater. For the purposes of this evaluation as long as a parser is consistent in the way it treats a language construct it does not matter if the treatment is incorrect. The problem discussed in this paper, namely identifying the least express pattern representation model required to represent the relations in a corpus, has parallels with the question of the grammatical formalism required to represent the syntax of natural language. There are advantages to using the simplest possible type of grammar to model language since more efficient parsing procedures can be applied to them. The vast majority of structures in natural language can be satisfactorily parsed using context-free parsing techniques (Gazdar and Mellish, 1989). However, Pullum and Gazdar (1982) demonstrated that there are a small class of natural language constructions, such as certain verb phrases in a Swiss German dialect and compound nouns in the Senegalese language Bambara, which cannot be modelled using context-free languages. Consequently grammars from the next level of the Chomsky hierarchy, context-sensitive languages, are required for comprehensive treatment of natural language phenomena.

9

Conclusions

This paper presents an experiment comparing the usefulness of four extraction pattern models: SVO, chain, linked chain and subtree. Texts containing a variety of extraction tasks drawn from the management succession and biomedical domains were used. It was found that the linked chains model can represent around 95% of the total possible relations contained in the text, given a dependency parse. While subtrees would be able to represent all the relations contained within the dependency trees their use is limited because of the time complexity involved in enumerating all possible subtrees and the large number of resulting patterns limits the learning algorithms which can be applied. It was also found that information to be extracted from text can be described using different syntactic constructions depending upon the particular extraction task. For example, biomedical text often describes chemical interactions using nominalisations. This effects the success of extraction patterns when applied to that task. 9

Acknowledgements The authors would like to acknowledge the support of the UK EPSRC under grant GR/T06391 and Connexor Oy for providing the Machinese Syntax parser for use in the experiments detailed here..

Bibliography H. Chieu and H. Ng. 2002. A Maximum Entroy Approach to Information Extraction from Semi-structured and Free Text. In Proceedings of the Eighteenth International Conference on Artificial Intelligence (AAAI-02), pages 768–791, Edmonton, Canada. Mark Craven and Johan Kumlien. 1999. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 77–86, Heidelberg, Germany. AAAI Press. R. Gaizauskas, T. Wakao, K. Humphreys, H. Cunningham, and Y. Wilks. 1996. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 207 – 220, San Francisco, CA. G. Gazdar and C. Mellish. 1989. Natural Language Processing in Prolog. Addison-Wesley. Leslie Ann Goldberg and Mark Jerrum. 2000. Counting Unlabelled Subtrees of a Tree is #P -Complete. London Mathmatical Society Journal of Computation and Mathematics, 3:117–124. Mark A. Greenwood, Mark Stevenson, Yikun Gui, Henk Harkema, and Angus Roberts. 2005. Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05), Bonn, Germany, August. Ralph Grishman. 1997. Information extraction: Techniques and challenges. In Maria Teresa Pazienza, editor, Information Extraction: a multidisciplinary approach to an emerging technology, Lecture Notes in Artificial Intelligence. Springer. Ada Hamosh, Alan F. Scott, Joanna Amberger, Carol Bocchini, David Valle, and Victor A. McKusick. 2002. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research, 30(1):52–55. Peter E. Hodges, Andrew H. Z. McKee, Brian P. Davis, William E. Payne, and James I. Garrels. 1999. The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data. Nucleic Acids Research, 27(1):69–73. Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), Vancouver, Canada. Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics, pages 423–430, Sapporo, Japan. D. Knuth. 1968. The Art of Programming: Fundamental Algorithms. Addison-Wesley. D. Lin and P. Pantel. 2001. Discovery of inference rules for question answering. Natural Language Engineering, 7(4):343–360. Dekan Lin. 1999. MINIPAR: A Minimalist Parser. In Maryland Linguistics Colloquium, University of Maryland, College Park. Ryan McDonald, Fernando Pereira, Seth Kulick, Scott Winters, Yang Jin, and Pete White. 2005. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 491–498, Ann Arbour, MI. MUC. 1995. Proceedings of the Sixth Message Understanding Conference (MUC-6), San Mateo, CA. Morgan Kaufmann. I. Muslea. 1999. Extraction patterns for information extrcation: A survey. In Proceedings of the AAAI-99 workshop on Machine Learning for Information Extraction, Orlando, FL. Claire N´edellec. 2005. Learning Language in Logic - Genic Interaction Extraction Challenge. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05), Bonn, Germany, August. 10

G. Pullum and G. Gazdar. 1982. Natural Languages and Context-Free Languages. Linguistics and Philosophy, 4:471–504. E. Riloff and J. Weibe. 2003. Learning extraction patterns for subjective expressions. In Proceedings of the 2003 Conference on Empirical Methods in natural Language Processing (EMNLP-03), Sapporo, Japan. E. Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. Proceedings of the Eleventh Annual Conference on Artificial Intelligence, pages 811–816. R. Snow, D. Jurafsky, and A. Ng. 2004. Learning syntactic patterns for automatic hypernym discovery. In Proceedings of Advances in Neural Information Processing Systems 17 (NIPS 2004). Stephen Soderland. 1999. Learning Information Extraction Rules for Semi-structured and free text. Machine Learning, 31(1-3):233–272. Mark Stevenson and Mark A. Greenwood. 2005. A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 379–386, Ann Arbour, MI. Mark Stevenson. 2004. Information Extraction from Single and Mutliple Sentences. In Proceedings of the Twentieth International Conference on Computational Linguistics (COLING-02), pages 875–881, Geneva, Switzerland. Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. 2001. Automatic Pattern Acquisition for Japanese Information Extraction. In Proceedings of the Human Language Technology Conference (HLT2001). Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. 2003. An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03), pages 224–231, San Diego, CA. Pasi Tapanainen and Timo J¨arvinen. 1997. A Non-Projective Dependency Parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pages 64–74, Washington, DC. Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. 2000a. Automatic Acquisition of Domain Knowledge for Information Extraction. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), pages 940–946, Saarbr¨ucken, Germany. Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. 2000b. Unsupervised discovery of scenario-level patterns for information extraction. In Proceedings of the Applied Natural Language Processing Conference (ANLP 2000), pages 282–289, Seattle, WA. Roman Yangarber. 2003. Counter-training in the Discovery of Semantic Patterns. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03), pages 343–350, Sapporo, Japan.

11

Suggest Documents