1 A Hidden Tag Model for Language - CiteSeerX

5 downloads 0 Views 106KB Size Report
development of a new Hidden Tag Model for language, ... sequence of acoustic information O. This approach, pioneered by the IBM. Continuous Speech ...
1 A Hidden Tag Model for Language E. Brill, JHU, D. Harris, NSA, S. Lowe, Dragon, X. Luo, JHU, P.S. Rao, IBM, E. Ristad, Princeton, S. Roukos, IBM

During the LM95 workshop, the \parse group" explored ideas on improving the language model by incorporating linguistic structure of language. The group worked on 3 related activities during the summer:  development of a new Hidden Tag Model for language,  initial exploration of parsing switchboard transcripts, and  development of keyword selection methods for topic identi cation. Most of our work was on developing the HTM model which we will describe in detail in this report. We will also describe brie y the goals of the parsing activity. The keyword selection work was a quick attempt to see if keywords can be used to improve recognition accuracy for these keywords and was based on earlier work in the area. Dramatic progress has been demonstrated in solving the speech recognition problem via the use of a statistical model of the joint distribution p(W; O) of the sequence of spoken words W and the corresponding observed sequence of acoustic information O. This approach, pioneered by the IBM Continuous Speech Recognition group, is called the source-channel model. In this approach, the speech recognizer determines an estimate W^^ of the identity of the spoken word sequence from the observed acoustic evidence O by using the a posteriori distribution p(W jO). To minimize its error rate, the recognizer chooses that word sequence that maximizes the a posteriori distribution: ^^ = arg max p(W jO) = arg max p(W )p(OjO) W W

W

where p(W ) is the probability of the sequence of n-words W and p(OjW ) is the probability of observing the acoustic evidence O when the sequence W is spoken. The a priori distribution p(W ) of what words might be spoken (the 1

source) is referred to as a language model (LM). The observation probability model p(OjW ) (the channel) is called the acoustic model. Our work this past summer concentrated on building the language model. For a given word sequence W = fw1; ::; w g of n words, we rewrite the LM probability as: n

Y n

p(W ) = p(w1 ; ::; w ) =

p(w jw0; ::; w ?1) i

n

=1

i

i

where w0 is chosen appropriately to handle the initial condition. The probability of the next word w depends on the history h of words that have been spoken so far. Some of the most successful models of the past two decades are the simple n-gram models for the words, particularly the word trigram model (n = 3) where only the most recent two words of the history are used to condition the probability of the next word. The probability of a word sequence becomes: Y p(W )  p(w jw ?2 ; w ?1 ): i

i

n

i

i

i

=1

i

1.1

Hidden Tag Model:

Our goal was to explore ways to incorporate grammatical sentence structure into the language model for improving the accuracy of a speech recognizer. Small improvements have been obtained by augmenting it with word class information but to a large degree we have not yet been able to use the part of speech structure or the parse structure of running text. There have been several attempts to incorporate linguistic structure into statistical LMs with limited success. Ideally, one would like to incorprate various linguistic information such as:  morphology of words: the fact that book and books are related.  part of speech: whether a word is being used as a noun or a verb.  word sense: to disambiguate between May the person and May the month.  parse structure: words come in chunks that are noun phrases, verb phrases, etc.. 2

 topic: identify keywords that indicate the current topic for a sentence. Instead of modeling language as a sequence of word spellings, we associate with every spelling additional information1 about the use of the word spelling and we treat the additional information as a \hidden" state that is unobserved. For example, consider the phrase \I read the book"; if we use the part of speech (POS) of each word as additional infomation, we would have \I PRP read VBD the DET book NN" which would disambiguate the word \read" to the reader if he knows that VBD means past tensed verb. While we did not explore additional markings one can envision adding such to delineate sense, topic, and parse structure, e.g. the current word is part of a noun phrase. Let S = s1 denote the sequence of additional linguistic information for the word sequence W ; the statistical language model becomes a joint model p(W; S ) and the probability of a given word sequence is given by the marginal: n

p(W ) =

X

p(S; W ) =

all S

X

p(S )p(W jS )

(1)

all S

One common factorization of the above distribution that has led to the HMM tagging model is: p(W ) =

XY

p(s js1 ?1 ) n

Y

n

all S

n

p(w jw1 ?1 ; S ) n

n

n

where the output word distribution is simpli ed to depend only on the current state, i.e. p(w js ) and the transition distribution is considered to be a trigram: XY p(W ) = p(s js ?2 ; s ?1 )p(w js ) n

n

n

all S

n

n

n

n

n

This HMM tagging model has been e ectively used for assigning POS tags to a stream of text. However, when used as language model the perplexity of such a model is signi cantly higher than the word trigram model (about a factor of 2 higher). This past summer, we explored factoring the joint state and word distribution given in Equation 1 by making less drastic independence assumptions. Since we are modeling language, we anticipate that linguistically motivated markings would lead to more accurate probabilistic models of the word sequence. 1

3

Let's denote the tag and word pair at time n by x = (s ; w ). The joint probability can be factored as n

p(S; W ) =

Y n

p(x

jx1 ?1) = n

n

Y

n

p(s ; w js1 ?1 ; w1 ?1 ) n

n

n

n

n

n

The conditional probability for the tag-word pair at time n can be rewritten as: p(x jx1 ?1 ) = p (t jx1 ?1 )p (w jt ; x1 ?1 ) Clearly, we can approximate the tag and word distribution in di erent ways. For the case of trigram window on the tags and a bigram window on the words we can try the following approximations: 1) for the tag distribution, we used p (t jx1 ?1 )  p (t jt ?2 ; t ?1 ; w ?1 ) and 2) for the output distribution we used n

n

n

t

n

n

w

n

n

n

t

n

t

n

n

n

n

p (w jt ; x1 ?1 )  p (w jt ; w ?1 ): n

w

n

n

w

n

n

n

In this new model, we have injected w ?1 in the transition and output probability distributions of the traditional HMM tagging model. The hope is that the additional dependence would improve the perplexity of the resulting LM. We will compare this model to word bigram and trigram models. Clearly one may choose to factor the two distribution in other ways. Speci cally, we implemented a general exponential model for the transition distribution using the Maximum Entropy approach to modeling distributions. We refer the reader to [1] for a general discussion of the ME approach to language modeling. n

1.1.1 Experiments

The University of Pennsylvania provided us with a corpus of about 1.6 million words that has been tagged and manually corrected with POS tags using about 40 tags. They also provided us with about 100 kwords of text that was parsed manually. To perform their tagging and parsing, the UPenn team segmented the data not according to speaker turns but rather into units that corresponded to full sentences crossing and skipping turn boundaries if necessary. 4

Due to processing limitations, the text was uppercased which is usually a disadvantage for taggers. Building a traditonal HMM tagger, we obtained a tagging accuracy of about 95%, somewhat lower than usual. The vocabulary was about 22 kwords and the tagging dictionary was extracted from all of the data without any additional care at improving its coverage. The perplexity of the resulting model was above 200 on the development set. Previous attempts have used linear interpolation of the above tagging model with a word trigram model and a lemma model to obtain a small improvement in speech recognition accuracy. We wanted to develop a new way, instead of linear interpolation, to incorporate the POS tags information to achieve a signi cant improvement in speech recognition accuracy. Trellis computation: The probability can be computed by a recursive forward sum: (w ; t2t3) =

X

n

t1 t2

2 n?1

p(w jw ?1 ; t3)p(t3 jt1; t2; w ?1 ) (w ?1 ; t1t2 ) n

n

n

n

T

The set of states T in the trellis at position n can be controlled by restricting the allowed tags for word w by the tag dictionary. Due to the available software, we were restricted to trigrams for the various component models. In Table 1, we show the perplexity for the various models that we tried. In all of our results, we used the following word model p(w3jt3; w2). For the word model, we are backing o to the w2w3 bigram and to the w3 unigram. In the rst row of Table 1, we restricted the allowed tags for a word based on a dictionary and we used a trigram model for the tags. We also indicate in Table 1 the type of backo order used for smoothing the tag transition distribution. We use the notation t1t2t3 to indicate that the backo to the t2t3 bigram and to t3 unigram. For the trigram w2 t2t3 we backo to the t2 t3 bigram and to the t3 unigram. After analyzing some of the high perplexity (hot spots), we improved our tagging model by adding the the previous word in the history of the model indicated by w2t2t3. The perplexity of the resulting models is shown in Table 1. Restriciting the tags to the one allowed by the dictionary results in a signi cant increase in perplexity over the word bigram model2. Using a full search approach reduces the perplexity to 129. Adding the w2 dependence n

n

This is due in part to the method of smoothing used by the backing o model which was used due to software availability 2

5

Dict. t1t2t3 188 Full Sum t1t2t3 129 Full Sum w2t2t3 115 Word bigram 105 Table 1: Perplexity of various HTM using p(w3jt3; w2) as output distribution. t3w2 w3 25 t1t2 t3 8.9 w2t2 t3 6.9 t1t2 w2t3 6.4

Table 2: Perplexity of tag-word models on development set. in the transition distribution reduces the perplexity to 115, which about 9% higher perplexity than the word bigram model. We investigated further the perplexity of each of the 2 components of our model. Table 2 shows the perplexity of the word model at 25. This assumes the model knows the tag t3 of the current word. We also show the perplexity for prediciting the tag for 3 models. In particular, we used a 4-gram tag model which is about 7% lower perplexity than the trigram w2t2t3 model. We did not incorporate the 4-gram model in our HTM implementation. Our goal was to use a more general model than a trigram model for the tag and word distributions. We only implemented an ME model for the tag distribution. The general form of the model was: Q (w ; t ; t ; t ) 2 1 2 3 p(t jt ; t ; w ) = p (t jt ; t ) 2 3 1

2

2

b

3 1

2

i

F

i

Z1

t ;t2 ;w2

where the feature set are binary valued indicator functions that are various n?grams on tag and word sequences. The prior model on tags used the backing-o trigram model. The ME code was implemented and tested for just the simplest type as shown in Table 3. The resulting model had a perplexity of 120. The larger ME model shown in Table 4 was not completed. Speech recognition: we used our HTM using ME model 1 in a top N rescoring strategy and obtained a slight increase in the error rate. The bigram baseline error rate was 51.4%, the top 10 error rate was 52.3%, and the top 100 error rate was 53.6%. It was surprising to see the error rate 6

Feature type No. of Features w2t2 t3 17,993 t2t3 1100 t3 44 Perplexity 120 Table 3: Number of features by type and perplexity of tag-word model on development set. Feature type No. of Features w2t1 t2t3 29,856 w1w2 t3 36,149 w2t3 16,889 Table 4: Number of features by type. increase when the search was increased. We could not analyze the results in time to identify potential problems. However, over the coming year we are hoping to repeat the experiments to obtain a more thorough evaluation of the HTM LM. 1.2

Parsing Switchboard

The sentence, a meaningful unit in written text, appears to be a much less meaningful unit in spoken language, as can be seen by examining the Switchboard transcripts. Because of this, the standard approach to parsing, nding the best single tree structure for a sentence such that the root of that tree is labelled as S, is probably not amenable to Switchboard. We would like to assign syntactic structure to Switchboard utterances, in the hope that this syntactic annotation will prove useful to a language model. For instance, if we have phrase structure for the sentence: The pig in the car oinked, we could get a more accurate estimate of the probability of "oinked" given the history by P(oinked| head of previous phrase is "pig") than by the trigram P(oinked | the car). In an attempt to allow a language model to use phrasal information, we have begun to explore the possibility of building a phrase chunker for Switchboard, which would annotate phrasal information wherever it could do so reliably, but would not be forced to t phrases into a 7

sentence. The chunker we are building uses an ordered list of deterministic rules which describe where to assign structure. These rules are learned automatically, using a corpus of manually parsed text. In particular, we are using a portion of Switchboard parsed at the University of Pennsylvania. In each iteration of learning, the learner searches for the most reliable context for inserting structure, and then learns a rule to insert structure everywhere that context is seen. During the workshop we implemented a small prototype of the learner and chunker. More recently, a student has been working on extending the power of the system and applying it to both conversational and written language.

References [1] Lau, R., Rosenfeld, R., and Roukos, S., \Trigger-Based Language Models: A Maximum Entropy Approach," Proceedings of ICASSP-93, Vol. II, pp. 45-48, 1993.

8

Suggest Documents