A Structured Language Model - Center for Language and Speech ...

A Structured Language Model Ciprian Chelba

The Johns Hopkins University CLSP, Barton Hall 320 3400 N. Charles Street, Baltimore, MD-21218 [email protected]

Abstract The paper presents a language model that develops syntactic structure and uses it to extract meaningful information from the word history, thus enabling the use of long distance dependencies. The model assigns probability to every joint sequence of words{binary-parse-structure with headword annotation. The model, its probabilistic parametrization, and a set of experiments meant to evaluate its predictive power are presented.

1 Introduction

The main goal of the proposed project is to develop a language model(LM) that uses syntactic structure. The principles that guided this proposal were: the model will develop syntactic knowledge as a built-in feature; it will assign a probability to every joint sequence of words{binary-parse-structure; the model should operate in a left-to-right manner so that it would be possible to decode word lattices provided by an automatic speech recognizer. The model consists of two modules: a next word predictor which makes use of syntactic structure as developed by a parser. The operations of these two modules are intertwined.

2 The Basic Idea and Terminology Consider predicting the word tence:

barked

in the sen-

the dog I heard yesterday barked again.

A 3-gram approach would predict barked from (heard, yesterday) whereas it is clear that the predictor should use the word dog which is outside the reach of even 4-grams. Our assumption is that what enables us to make a good prediction of barked is the syntactic structure in the

dog dog heard

the dog I heard yesterday barked

Figure 1: Partial parse =h_{-m}

T_{-m}

h_{-m+1}

T_{-m+1}

h_{-1}

T{-1}

h_0

T_0

w_1 ... w_p ........ w_q ... w_r w_{r+1} ... w_k w_{k+1} ..... w_n

Figure 2: A word-parse k-pre x past. The correct partial parse of the word history when predicting barked is shown in Figure 1. The word dog is called the headword of the constituent ( the (dog (...))) and dog is an exposed headword when predicting barked | topmost headword in the largest constituent that contains it. The syntactic structure in the past lters out irrelevant words and points to the important ones, thus enabling the use of long distance information when predicting the next word. Our model will assign a probability P (W; T ) to every sentence W with every possible binary branching parse T and every possible headword annotation for every constituent of T . Let W be a sentence of length l words to which we have prepended and appended so that w0 = and wl+1 =. Let Wk be the word k-pre x w0 : : : wk of the sentence and Wk Tk the word-parse k-pre x. To stress this point, a word-parse k-pre x contains only those binary trees whose span is completely included in the word kpre x, excluding w0 =. Single words can be regarded as root-only trees. Figure 2 shows a wordparse k-pre x; h_0 .. h_{-m} are the exposed headwords. A complete parse | Figure 3 | is any binary parse of the w1 : : : wl sequence with the restriction that is the only allowed headword.

h_{-2}

h_{-1}

h_0

T_{-m}

w_1 ...... w_l

.........

Figure 3: Complete parse

T_{-2}

T_{-1}

T_0

Figure 4: Before an adjoin operation h’_{-1} = h_{-2}

Note that (w1 : : : wl ) needn't be a constituent, but for the parses where it is, there is no restriction on which of its words is the headword. The model will operate by means of two modules: PREDICTOR predicts the next word wk+1 given the word-parse k-pre x and then passes control to the PARSER; PARSER grows the already existing binary branching structure by repeatedly generating the transitions adjoin-left or adjoin-right until it passes control to the PREDICTOR by taking a null transition. The operations performed by the PARSER ensure that all possible binary branching parses with all possible headword assignments for the w1 : : : wk word sequence can be generated. They are illustrated by Figures 4-6. The following algorithm describes how the model generates a word sequence with a complete parse (see Figures 3-6 for notation): Transition t; // a PARSER transition generate ; do{ predict next_word; //PREDICTOR do{ //PARSER if(T_{-1} != ) if(h_0 == ) t = adjoin-right; else t = {adjoin-{left,right}, null}; else t = null; }while(t != null) }while(!(h_0 == && T_{-1} == )) t = adjoin-right; // adjoin ; DONE

It is easy to see that any given word sequence with a possible parse and headword annotation is generated by a unique sequence of model actions.

3 Probabilistic Model The probability Ql+1P (W; T ) can be broken into: P (W; T ) = k=1 [P (wk =Wk?1 Tk?1 ) QN P (tk =wk ; Wk?1 Tk?1; tk : : : tk )] where: k

i=1

i

1

i?1

Wk?1 Tk?1 is the word-parse (k ? 1)-pre x wk is the word predicted by PREDICTOR Nk ? 1 is the number of adjoin operations the

PARSER executes before passing control to the PREDICTOR (the Nk -th operation at position k is the null transition); Nk is a function of T

h’_0 = h_{-1}

h_{-1}

T’_0

h_0

T’_{-m+1}_ ; 8 ;

and in the fw,hg models they were:

4 _ ; denotes a don't care position, _ a (word, tag) pair; for example, 4 will trig-

ger on all ((word, any tag), predicted-word) pairs that occur more than 3 times in the training data. The sentence boundary is not included in the PP calculation. Table 1 shows the PP results along with 1

ftp://ftp.cs.princeton.edu/pub/packages/memt

the number of parameters for each of the 4 models described . LM PP param LM PP param W 352 208487 w 419 103732 H 292 206540 h 410 102437 Table 1: Perplexity results

5 Acknowledgements

The author thanks to all the members of the Dependency Modeling Group (Stolcke97):David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Ristad, Roni Rosenfeld, Andreas Stolcke, Dekai Wu.

References

Michael John Collins. 1996. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 184-191, Santa Cruz, CA. Frederick Jelinek. 1997. Information extraction from speech and text | course notes. The Johns Hopkins University, Baltimore, MD. Frederick Jelinek, John Laerty, David M. Magerman, Robert Mercer, Adwait Ratnaparkhi, Salim Roukos. 1994. Decision Tree Parsing using a Hidden Derivational Model. In Proceedings of the Human Language Technology Workshop, 272-277. ARPA. Raymond Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Trigger-based language models: a maximum entropy approach. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, volume 2, 45-48, Minneapolis. Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz. 1995. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313-330. Eric Sven Ristad. 1997. Maximum entropy modeling toolkit. Technical report, Department of Computer Science, Princeton University, Princeton, NJ, January 1997, v. 1.4 Beta. Ciprian Chelba, David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Sven Ristad, Roni Rosenfeld, Andreas Stolcke, Dekai Wu. 1997. Structure and Performance of a Dependency Language Model. In Proceedings of Eurospeech'97, Rhodes, Greece. To appear.

A Structured Language Model - Center for Language and Speech ...

A Structured Language Model - Center for Language and Speech ...

Suggest Documents

structured language modeling for speech ... - Semantic Scholar

STRUCTURED LANGUAGE MODELING FOR SPEECH ... - Google Sites

Language Model Adaptation for a Speech to Sign Language

Continuous Sinhala Speech Recognizer - Center for Language ...

Refinement of a Structured Language Model

a semantically structured language model - Semantic Scholar

A Structured Language Model - Google Sites

A New Presentation Language for Structured Documents - cajunhttps://www.researchgate.net/.../A-New-Presentation-Language-for-Structured-Docum...

Speech and Language Processing

Speech and Language Processing

Language and Speech

speech and language 2017

Language and Speech

Language and Speech

Language and Speech - Spoken Language Laboratory

Speech and Language Processing

Language and Speech

Structured Language for Specifications of

Title Placeholder - Center for Language and Speech Processing

Phonology What is Phonology? - Center for Speech and Language ...

1 Introduction - Center for Language and Speech Processing

Early Morphological Development - Center for Speech and Language ...

Structured Language Meanings and Structured Possible Worlds

A Functional Language For Generating Structured Text