Department of Computer Science Tokyo Institute of Technology

0 downloads 0 Views 225KB Size Report
eralize context-free grammars by associating a probability with each rule in producing ... tempts to incorporate lexical sensitivity into models [3, 6, 7, 11, 12, 14, 17, 18]. Following ... and P(LjR) = 1, the joint distribution P(R; L; WjA) can be simpli ed to ... the probability of the LR parsing action sequence that produces the parse.
ISSN 0918-2802 Technical Report

L

Integrated probabilistic language modeling for statistical parsing INUI Kentaro, SHIRAI Kiyoaki, TANAKA Hozumi, and TOKUNAGA Takenobu

TR97-0005 May

Department of Computer Science Tokyo Institute of Technology ^ Ookayama 2-12-1 Meguro Tokyo 152, Japan

http://www.cs.titech.ac.jp/

c The author(s) of this report reserves all the rights.

Abstract

This paper proposes a new framework of probabilistic language modeling that satis es the two basic requirements of: (a) integration of part-of-speech n-gram statistics, structural preference and lexical sensitivity, and (b) maintenance of their modularity. Our framework consists of the syntactic model for part-of-speech bigram statistics and structural preference, and the lexical model for lexical sensitivity. As to the syntactic model, we argue that the LR-based language modeling, which was originally proposed by Briscoe and Carroll and signi cantly re ned by Inui et al., has good potential for its construction. The main issue we discuss in this paper is how to incorporate lexical sensitivity into the syntactic model while maintaining its modularity. We propose the new notions of lexical dependency parameters and lexical contexts, which capture lexical dependency. A lexical dependency parameter measures the degree of the dependency between the derivation of the word, and its lexical context. We discuss how to incorporate lexical dependency parameters into the overall probabilistic language model. Further, we demonstrate that one can use a uni cation-based grammar as a means of collecting the lexical contexts for each lexical derivation.

1 Introduction The increasing availability of text corpora has been encouraging researchers to explore statistical approaches for various tasks in natural language processing. Statistical parsing is one of these approaches. In statistical parsing, the following three types of statistics have been proven experientially to be e ective: 1. short-term n-gram statistics of preterminal symbols (part-of-speech ngram) 2. long-term statistical preference over structures of parse trees (structural preference) 3. long-term dependency between terminal symbols (lexical sensitivity) Part-of-speech n-gram models have typically been used for morphological analysis including word segmentation (for agglutinative languages such as 1

Japanese, Korean, etc.) and part-of-speech tagging [5, 15]. For structural preference, perhaps one of the most straightforward methodologies is to generalize context-free grammars by associating a probability with each rule in producing probabilistic context-free grammars (PCFGs). PCFGs have been applied to syntactic disambiguation. However, as many researchers have already pointed out, PCFGs are not quite adequate for syntactic disambiguation due to the lack of context-sensitivity (particularly, the lack of lexical sensitivity). This problem has led a divergence of research into two directions. The rst direction is in attempts to re ne the actual models of syntactic preference, by assigning a probability to each single syntactic derivation according to previous syntactic derivations [10, 13]. The second direction is in attempts to incorporate lexical sensitivity into models [3, 6, 7, 11, 12, 14, 17, 18]. Following these trends, in this paper, we consider the following two basic requirements. First, since the above three types of statistics represent analyses of linguistic distribution from di erent point of view, language models for statistical parsing are ideally required to capture all of them. However, it seems to be the case that no existing framework of language modeling successfully incorporates all of these three types of statistics. For example, Kita's model [10] introduces context-sensitivity by incorporating an n-gram model of the sequence of syntactic derivations, however it does not capture either part-of-speech n-gram statistics or lexical sensitivity. Magerman's probabilistic chart parsing [13] captures both part-of-speech n-gram statistics and structural preference, but is not lexically sensitive. History-based grammars (HBGs) proposed by Black et al. [3], stochastic lexicalized tree adjoining grammars (SLTAGs) proposed by Resnik [17] and Schabes [18], and Li's scoring function [12] are all lexically sensitive in maintaining structural preference, however fail to capture part-of-speech n-gram statistics. Second, when one considers integration of these three types of statistics, it should also be important to pay attention to their modularity. As with most existing language models, such an integrated model would still only re ect limited aspects of linguistic phenomena, with inherently limited potential for contribution to parsing. There is thus a need for further re nement and/or exploration of the possibilities of combining this type of statistical approach with other approaches. In order to do so, we have to be able to analyze the behavior of the model easily, particularly in terms of each of the above three types of statistics. This requires that the total score of a parse derivation 2

can be decomposed into the factors derived from the di erent types of statistics. We refer to this requirement as the modularity of statistic types. For example, in HBGs [3] and SLTAGs [17, 18], which are extensions of PCFGs, lexical dependency is incorporated into each production, making it potentially dicult to analyze the contribution of lexical dependency separately from that of structural preference. This paper presents a new framework of probabilistic language modeling that satis es both of the above two basic requirements: (a) integration of part-of-speech n-gram statistics, structural preference and lexical sensitivity, and (b) maintenance of their modularity.

2 The Overall Framework Suppose we have an input string A = fa1 ; . . . ; a g, a word sequence W = fw1 ; . . . ; w g that generates A, a part-of-speech tag sequence L = fl1 ; . . . ; l g that generates W , and a parse derivation R that generates L (see Figure 1). A parse derivation R uniquely speci es a syntactic tree with its leaves being preterminal symbols (i.e. part-of-speech tags). Since P (AjW ) = 1 and P (LjR) = 1, the joint distribution P (R; L; W jA) can be simpli ed to 2 P (R; W ) for a constant that depends on A. Thus our problem is one of ranking R and W according to their joint distribution P (R; W ). n

m

m

A

A

S VP VP

PP NP

NP pron

vt

she

ate

n

R NP

p

n

spaghetti with chopsticks

Figure 1: A parse tree 3

L W

Since a joint distribution P (R; W ) consists of a prohibitively large number of parameters, it cannot be directly trained. We need to estimate it from trainable peripheral distributions. As with most statistical frameworks, we decompose this joint distribution into the product of peripheral distributions. We rst decompose our language model P (R; W ) into two submodels:

P (R; W ) = P (R) 1 P (W jR)

(1)

We refer to these two submodels P (R) and P (W jR) as the syntactic model and the lexical model, respectively. The syntactic model is expected to cover both part-of-speech n-gram statistics and structural preference. We take the LR-based language modeling [4, 9] as a good candidate for the syntactic model. On the other hand, the lexical model is expected to re ect lexical sensitivity. The main issue we discuss in this paper is how to incorporate lexical sensitivity into the syntactic model while maintaining its modularity. In what follows, we rst brie y discuss the syntactic model, in arguing that the LR-based modeling potentially integrates part-of-speech bigram statistics and structural preference (Section 3). We then move to the central issue of this paper: how to incorporate lexical sensitivity into the syntactic model (Section 4). In Section 5, we discuss the features of our framework in comparing it with other existing models. Finally, we conclude this paper in Section 6.

3 The Syntactic Model The idea of the LR-based probabilistic language modeling was originally proposed by Briscoe and Carroll [4]. It was then recently formalized by Inui et al., resolving several problems of Briscoe and Carroll's modeling [9]. An LR-based language model estimates the probability of a parse derivation as the probability of the LR parsing action sequence that produces the parse derivation. In order to estimate the probability of each LR parsing action sequence, this model distributes probabilities to LR parsing actions associated with each LR parse state so that the probabilities of the transitions from any given LR parse state sum up to one. The probability of each LR parsing action sequence is computed as the product of the probability assigned to each action included in the sequence. The advantages of such an approach can be summarized in the following three respects. 4

First, the model is mildly context-sensitive since it assigns a probability to each LR parsing action according to its left context (i.e. LR parse state) and right context (i.e. next input symbol). Second, since the probability of each parse derivation can be estimated simply as the product of the probabilities associated with all the actions in that derivation, one can easily implement a probabilistic LR parser through a simple extension to the original LR parser. One can also easily train the model, as we need only count the frequency of each action applied to generate correct parse derivations in the training corpus. Third, PCFG-based models give long-term preference over structures but do not suciently re ect short-term bigram statistics of terminal symbols, whereas LR-based models re ect both of these types of statistics. The overall probability of each parse derivation includes probabilities P (l js 01 ), which is a distribution that predicts the next input symbol (i.e. part-of-speech tag) l for the current LR parse state s 01 . Since s 01 uniquely speci es the previous part-of-speech tag l 01 , P (l js 01 ) can be seen as a slightly more context-sensitive version of the part-of-speech bigram P (l jl 01 ). Thus, an LR-based model is expected to cover both part-of-speech n-gram statistics and structural preference. Further, our preliminary experiments suggest that our formulation of an LR-based model signi cantly improves the disambiguation performance of the original LR-based modeling proposed by Briscoe and Carroll [19]. i

i

i

i

i

i

i

i

i

i

4 The Lexical Model The lexical model is the distribution P (W jR) of lexical derivations l ! w (for i = 1; . . . ; m) given a syntactic tree speci ed by R. In estimating this distribution, we consider the dependency between the lexical derivations. We rst apply the chain rule to decompose P (W jR) as in (2): i

P (W jR) =

Y m

P (w jw101 ; R)

i

(2)

i

i

i=1

In order to reduce the parameter space of distribution P (w jw101 ; R), we introduce the notion of lexical contexts, and assign a probability to each lexical derivation only according to the lexical contexts of that derivation. i

i

5

4.1 An example Let us rst consider the previous example illustrated in Figure 1. Suppose that we derive W in the order as below: ate

!

spaghetti

!

with

chopsticks

!

!

he

First, we estimate the probability of the derivation v ! ate in assuming that the derivation of the lexical head of a sentence depends only on its part-of-speech: P (atejR)  P (atejvt) (3) Next, we estimate the probability of the derivation n ! spaghetti, which functions as a ller of the object slot of the head verb eat. To estimate P (spaghettijate; R), we assume that the derivation of the slot- ller spaghetti depends only on its part of speech n, the head verb eat, and the kind of the slot obj: P (spaghettijate; R)  P (spaghettijn[s(eat; obj)]) (4) where s(h; s) denotes a slot s of a head word h, and P (w jl [s(h; s)]) is the probability of a lexical derivation l ! w given that w functions as a ller of a slot s(h; s). We then estimate the probability of the derivation p ! with, which functions as a slot-marker of the head verb eat subordinating another slot obj, as: P (withjate; spaghetti; R)  P (withjp[h(ate; [obj; subj])]) (5) i

i

i

i

i

where h(h; [s1 ; . . . ; s ]) denotes a lexical head h that subordinates the set of slots s1 ; . . . ; s , and P (w jl [h(h; [s1 ; . . . ; s ])]) is the probability of a lexical derivation l ! w given that w functions as a slot-marker of lexical head h(h; [s1 ; . . . ; s ]). For the derivations n ! chopsticks and n ! she, we estimate their probability in a similar way as in (4): n

n

i

i

i

i

n

i

n

P (chopsticksjate; spaghetti; with; R)  P (chopsticksjn[s(eat; with)]) P (shejate; spaghetti; with; chopsticks; R)  P (shejn[s(eat; subj)]) 6

(6) (7)

Finally, combining equations (3) through (7), we produce (8):

P (W jR)

P (atejvt) 1 P (spaghettijn[s(eat; obj)]) 1 P (withjp[h(ate; [obj; subj])]) 1 P (chopsticksjn[s(eat; with)]) 1 P (hejn[s(eat; subj)])



(8)

4.2 Lexical dependency 4.2.1

Lexical contexts

As shown in the above example, we estimate the probability of each lexical derivation l ! w as: i

i

P (w jw101 ; R)  P (w jl [c 1; . . . ; c i ])

(9)

i

i

i

i

i

in

where c is a lexical context of the derivation l ! w . A lexical context c is either h(h; S ) or s(h; s) for a lexical head h, a set S of slots subordinated by h and a slot s. The assumptions underlying equation (9) are the following: ij

i

i

ij

If a word w (typically a preposition, conjunction, etc.) functions as a slot-marker of a certain lexical head h, the context that has the strongest e ect on the derivation l ! w must include h, and the sibling slots of h that have already been derived. By considering sibling slots, we can deal with the dependency between the slots of a lexical head, as well as the dependency between a slot and its head. This case is exempli ed in (5).

Assumption 1.

i

i

i

If a word w (typically a noun, verb, adjective, adverb, etc.) functions as a ller of a slot s of a head word h, the context that has the strongest e ect on the derivation l ! w must include h and s, as in (4).

Assumption 2.

i

i

i

If a word w (typically the head of a sentence) is neither a slot-marker nor a slot- ller, the probability of the derivation l ! w depends only on li, as in (3).

Assumption 3.

i

i

7

i

4.2.2

Lexical dependency parameters

Note that a lexical derivation may be associated with more than one lexical context (multiple lexical contexts). Multiple lexical contexts appear, for example, in the derivation of a word modi ed by a relative clause. The word spaghetti in the sentence \she ate the spaghetti I cooked" functions as a ller of the two slots s(eat,obj) and s(cook,obj). Thus, the probability of the lexical derivation n ! spaghetti is estimated to be P (spaghettijn[s(eat; obj); s(cook; obj)]). Multiple lexical contexts also appear in coordinate structures. The word grapes in the sentence \she ate spaghetti and grapes" has the two lexical contexts of: s(eat,obj) and s(spaghetti,and). Since distributions including more than one lexical context still have a prohibitively large parameter space, we need to decompose them further into peripheral distributions. Let us rst consider the decomposition of the probability of a lexical derivation l ! w with two lexical contexts c1 and c2: i

i

P (w jl [c1 ; c2 ]) P (l [c1 ; c2 ]jw ) 1 P (w ) = P (l [c1; c2 ]) P (l [c1]jw ) 1 P (l [c2 ]jl ; w ) 1 P (w )  P (l [c1]) 1 P (l [c2]jl ]) P (w jl [c1]) P (w jl [c2 ]) 1 = P (w jl ) 1 P (w jl ) P (w jl ) = P (w jl ) 1 D(w jl [c1 ]) 1 D(w jl [c2 ]) i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

(10)

i

In (10), we assume that the two lexical contexts c1 and c2 are mutually independent given l (and w ): i

i

P (l [c2 ]jl [c1])  P (l [c2]jl ) P (l [c2 ]jl [c1]; w )  P (l [c2 ]jl ; w ) i

i

i

i

i

i

(11) (12)

i

i

i

i

D(w jl [c]) is what we call a lexical dependency parameter, which is given by: P (w jl [c]) D(w jl [c]) = (13) P (w jl ) D(w jl [c]) measures the degree of the dependency between the lexical derivation l ! w and its lexical context c. It is close to one if w and c are highly i

i

i

i

i

i

i

i

i

i

i

i

i

8

independent. It becomes greater than one if w and c are positively correlated, whereas it becomes less than one and close to zero if w and c are negatively correlated. From (10), the following equation can be induced: i

i

P (w jl [c1; . . . ; c ])  P (w jl ) 1 i

i

n

i

Y n

D(w jl [c ]) i

i

i

j

(14)

j =1

Summarizing equations (2), (9) and (14), the lexical model P (W jR) can be estimated by the product of the context-free distribution of the lexical derivations Pcf (W jL) and the degree of the dependency between the lexical derivations D(W jR):

P (W jR)  Pcf (W jL) 1 D(W jR) Pcf (W jL) =

Y m

P (w jl ) i

(15) (16)

i

i=1

D(W jR) =

Y Y m

2 wi

i=1 c

D(w jl [c]) i

i

(17)

C

where C i is the set of the lexical contexts of w . Further, from equations (1) and (15), the overall distribution P (R; W ) can be decomposed as follows: w

i

P (R; W )  P (R) 1 Pcf (W jL) 1 D(W jR)

(18)

where the rst two terms P (R) and Pcf (W jL) re ect part-of-speech bigram statistics and structural preference, whereas the third term D(W jR) re ects lexical dependency. Thus, equation (18) suggests that our model integrates these three types of statistics, while maintaining modularity of lexical dependency. For further reduction of the parameter space of each lexical dependency parameter, recently examined techniques such as probabilistic decision-trees [14], maximum entropy models [16], etc. will be applicable.

4.3 Using uni cation-based grammars As described in Section 4.2, the estimation of the probability of each lexical derivation requires the collection of lexical contexts for the derivation. One 9

straightforward means of collecting lexical contexts is the use of uni cationbased grammars. In this subsection, we brie y demonstrate how this task can be done by means of de nite clause grammars (DCGs). Let us rst consider a very simple grammar: (c0) s

!

np(s(H,subj), ), vp(h(H,[subj]),h(H, )).

(c1) vp(S0,S2)

vp(S0,S1), pp(S1,S2).

!

(c2) vp(h(H,Ss),h(H,[objjSs])) (c3) pp(h(H,Ss),h(H,[SjSs])) (c4) np(F1,S2)

!

(c7) p(S,with)

p(h(H,Ss),S), np(s(H,P), ).

pron(F,H)

!

j

n(F,H).

[ate].

!

(c8) pron(F,he)

!

vt(H), np(s(H,obj), ).

n(F1,H), pp(h(H,[]),S2).

!

(c5) np(F,h(H,[])) (c6) v(F,eat)

!

[with].

!

(c9) n(F,spaghetti)

[he]. !

(c10) n(F,chopsticks)

[spaghetti].

!

[chopsticks].

A term beginning with a capital letter denotes a logical variable. The arguments of each literal are used to pass the lexical contexts. For example, this grammar can parse the previous example sentence \she ate spaghetti with chopsticks" exactly as shown in Figure 1. During this parsing, the lexical contexts are passed as shown in Figure 2, where each curved link denotes a uni ed logical variable and each arrow denotes the direction of the ow of lexical context information. As a result, we nd that the lexical context of spaghetti is s(eat,obj), that of with is h(eat,[obj,subj]), and so on. One can extend the above grammar to deal with multiple lexical contexts. For example, the preposition with has another usage as in \she ate spaghetti with cheese." In order to parse this sentence, one may extend the grammar by replacing (c2) and (c3) with (c2') and (c3'), respectively, and adding (c11) and (c12). 10

vp( , )

vp( , ) h(eat,[obj,subj])

h(eat,[subj])

pp( , ) h(eat, ) h(eat,[with| ])

s(eat,obj)

s(eat,with)

np( , )

vt( )

np( , ) p( , )

n( , ) eat

s(eat,obj)

ate

n( , )

h(eat,[obj,subj])

spaghetti

with

with

s(eat,with) chopsticks

Figure 2: The ow of the lexical contexts (c2') vp(F1,h(H,[obj])) ! vt(F1,H), np([s(H,obj)], ). (c3') pp(h(H,Ss),h(H,[SjSs])) ! p(h(H,Ss),S),np([s(H,P)], ). (c11) np(F1,h(H,[PjSs])) ! np(F1,h(H,Ss)),co p(h(H,Ss),S),np([s(H,S)jF1], ). (c12) co p(S,with) ! [with]. This extended grammar can properly collect the lexical contexts for s(eat,obj) and s(spaghetti,with).

cheese:

5 Discussion Several attempts have been made to incorporate lexical sensitivity into probabilistic language models. Stochastic lexicalized tree adjoining grammars (SLTAGs) proposed by Resnik [17] and Schabes [18] can be seen as an extension of PCFGs. SLTAGs assign a probability to each combination of two structures, each of which is associated with a word. History-based grammars (HBGs) proposed by Black et al. [3] can also be seen as an extension of PCFGs, where productions are distinguished according to not only syntactic categories but also 11

semantic categories, lexical heads, etc. Both models are expected to successfully incorporate lexical dependency into PCFG-based structural preference. However, it is not a trivial question as to how one could combine such models with part-of-speech n-gram statistics, since PCFG-based models represent the distribution of a part-of-speech sequence L (as well as a syntactic tree R and a word sequence W ), while part-of-speech n-gram models also represent the distribution of L. Recently proposed models such as the model used in Magerman's SPATTER parser [14] and Collins' model based on bigram lexical dependencies [6] are bottom-up models, as it were, which represent the distribution P (RjL; W ) of syntactic trees given the tagged sentence represented by L and W . Such models not only integrate structural preference and lexical sensitivity but can also be combined easily with a part-of-speech n-gram model. However, all these models merge lexical dependency into syntactic preference without maintaining modularity, making it potentially dicult to analyze the contribution of lexical dependency separately from that of the other statistic types. The lack of modularity also makes the training of such models costly. Namely, since such models incorporate lexical information into each production, they ideally require fully parsed corpora as training data1 . In contrast, in our model, one can train lexical dependency independently of syntactic preference. Although the syntactic model still requires fully parsed training corpora, its parameter space should be only a small part of the overall parameter space. In order to train our lexical model, on the other hand, one can also use partially parsed corpora, which can be collected at a lower cost, as well as fully parsed corpora. This feature is signi cant since the lexical dependency distribution may have a much larger parameter space, and thus may require much larger amounts of training data, as compared to the syntactic model. There are also several attempts to deal with lexical dependency independently of structural preference. The statistical techniques proposed by Hindle and Rooth [8] and Ratnaparkhi et al. [16] are expected to resolve the PP-attachment problem for the particular word sequence of \v , n1 , p, n2 ". These techniques use lexical dependency distributions such as P (pjv ), 1 Although

one could apply algorithms such as the inside-outside algorithm in training

production distributions from plain corpora, the feasibility of such methodologies in terms of the convergence rate is not fully proven yet.

12

P (pjn1 ), etc. However, since the problem these techniques target is quite spe-

ci c, it is still questionable whether they are also applicable to more general parsing problems. Li's statistical model [12] is expected to be applicable to general parsing problems. In this model, the overall score of a parse derivation is the product of the syntactic score that re ects PCFG-based structural preference, and the lexical score that re ects lexical dependency. The lexical score is similar to our lexical model, with slight di erences. Given a set W 0  W of words functioning as slot- llers in a sentence W , the lexical score is given by the geometric mean: 0 11 Pw W j w j Y Y @ P (wjh; s)A (19) 2 s( )2 w =

w

W0

h;s

2

0

C

C

where C is the lexical context, in the sense of our model, of w. Thus, if a word has multiple lexical contexts, say s(h1 ; s1 ) and s(h2 ; s2 ), then the lexical score includes the factor: P (w0 jh1 ; s1 ) 1 P (w0 jh2; s2) w

i

i

However, such formulation makes it dicult to give probabilistically wellfounded semantics to the overall score. In contrast, our lexical model would include the factor:

P (w jl ) 1 D(w jl [s(h1; s1)]) 1 D(w jl [s(h2 ; s2)]) i

i

i

i

i

i

which can be probabilistically interpreted. One may have noticed that the logarithm of a lexical dependency parameter D(w jl [c]) corresponds to the mutual information of w and c: i

i

i

P (w jl [s(h; s)]) P (w jl ) P (w ; h; s) = log P (w ) 1 P (h; s)

log D(w jl [s(h; s)]) = log i

i

i

i

i

i

i

i

This fact implies that the statistical approach that ranks dependency structures according to the product of the mutual information value of words for each dependency pair [7, 11] considers only the lexical dependency factor D(W jR) in equation (18). 13

6 Conclusion We proposed a new framework of probabilistic language modeling that satis es the two basic requirements of: (a) integration of part-of-speech n-gram statistics, structural preference and lexical sensitivity, and (b) maintenance of their modularity. By introducing the notions of lexical dependency parameters and lexical contexts, we can incorporate lexical sensitivity into the syntactic model while maintaining its modularity. This modularity allows us to apply mildly context-sensitive models such as an LR-based model for the syntactic model. Further, we demonstrated that one can use a uni cationbased grammar as a means of collecting the lexical contexts for each lexical derivation. The implementation and evaluation of our framework is being directed, the tentative target language being Japanese. As for the syntactic model, we have fully implemented a probabilistic LR parser, and conducted several preliminary experiments to compare our LR-based model with the PCFG model and Briscoe and Carroll's model, attaining promising results, some of which are presented in [19]. As for the lexical model, we are currently conducting preliminary experiments. The results will be reported in a di erent place.

References [1] Proceedings of the 14th International Conference on Computational Linguistics, 1992. [2] Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 1996. [3] E. Black, F. Jelinek, J. La erty, D. M. Magerman, R. Mercer, and S. Roukos. Towards history-based grammars: Using richer models for probabilistic parsing. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 31{37, 1993. [4] T. Briscoe and J. Carroll. Generalized probabilistic LR parsing of natural language (corpora) with uni cation-based grammars. Computational Linguistics, Vol. 19, No. 1, 1993. 14

[5] E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz. Equations for part-of-speech tagging. In Proceedings of the 11th National Conference on Arti cial Intelligence, pp. 784{789, 1993. [6] M. J. Collins. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics [2]. [7] E. de Paiva Alves. The selection of the most probable dependency structure in Japanese using mutual information. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics [2]. [8] D. Hindle and M. Rooth. Structural ambiguity and lexical relations. Computational Linguistics, Vol. 19, No. 1, pp. 103{120, 1993. [9] K. Inui, V. Sornlartlamvanich, H. Tanaka, and T. Tokunaga. A new probabilistic LR language model for statistical parsing. Technical Report TR97-0004, Dept. of Computer Science, Tokyo Institute of Technology, 1997. [10] K. Kita. Spoken sentence recognition based on HMM-LR with hybrid language modeling. IEICE Trans. Inf. & Syst., Vol. E77-D, No. 2, 1994. [11] Y. Kobayashi, T. Tokunaga, and H. Tanaka. Analysis of syntactic structure of Japanese compound noun. In Proceedings of Natural Language Processing Paci c Rim Symposium '95,, pp. 326{331, 1995. [12] H. Li. A probabilistic disambiguation method based on psycholinguistic principles. In Proceedings of the Fourth Workshop on Very Large Corpora (WVLC-4), 1996. [13] Magerman. D. M. and Marcur. M. Pearl: A probabilistic chart parser. In Proceedings of the 5th Conference of European Chapter of the Association for Computational Linguistics, pp. 15{20, 1991. [14] D. M. Magerman. Statistical decision-tree models for parsing. In Proceedings of the 33th Annual Meeting of the Association for Computational Linguistics, pp. 276{283, 1995.

15

[15] M. Nagata. A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-Best search algorithm. In Proceedings of the 15th International Conference on Computational Linguistics, Vol. 1, pp. 201{207, 1994. [16] A. Ratnaparkhi, J. Reyner, and S. Roukos. A maximum entropy model for prepositional phrase attachment. In Proceedings of the Human Language Technology Workshop, pp. 250{255, 1994. [17] P. Resnik. Probabilistic tree-adjoining grammar as a framework for statistical natural language processing. In Proceedings of the 14th International Conference on Computational Linguistics [1], pp. 418{424. [18] Y. Schabes. Stochastic lexicalized tree-adjoining grammars. In Proceedings of the 14th International Conference on Computational Linguistics [1], pp. 425{432. [19] V. Sornlartlamvanich, K. Inui, K. Shirai, H. Tanaka, T. Tokunaga, and T. Takezawa. Incorporating probabilistic parsing into an LR parser { LR table engineering (4) {. Information Processing Sciety of Japan, SIG-NL-119, 1997. Also available from http://tanaka-www.cs.titech.ac.jp/~inui/.

16