Advances in Automatic Speech Recognition by ... - Semantic Scholar

6 downloads 0 Views 36KB Size Report
How to tell a pine cone from a ice cream cone. In Proceedings of SIGDOC '86, 1986. [Level 1989] Levelt, W. J. M.: Speaking: From intention to articulation.
Advances in Automatic Speech Recognition by Imitating Spreading Activation Stefan Bordag, Denisa Bordag University of Leipzig [email protected] [email protected]

Inspired by recent insights into the properties of statistical word co-occurrences, we propose a mechanism which imitates spreading activation in the human mind in order to improve the identification of words during the automatic speech recognition process. This mechanism is able to make accurate semantic predictions about t he currently uttered word as well as about words which are likely to come in the rest of a sentence. A robust automatic disambiguation algorithm provides a framework for semantic clustering, which allows to avoid the inherent polysemy problem.

Introduction Computer systems , lacking the wide human knowledge about world and language, have difficulties to determine the most appropriate meaning of a word also in sentences, which humans are able to parse unambiguously thanks to their cognitive skills and ability to infer from a situation and larger contexts. Thus, in a sentence like ‘Each node connected by a hub is allowed, due to hierarchical extension of the STAR topology, to in turn play the role of a hub for a disjoint set of leaf nodes.’, humans, even those with little knowledge about computers, would immediately recognize that the word star refers to a technical device. A fluent speaker of English with some hardware knowledge would not even consider alternative interpretations relating to nature and show business at all, even despite the fact that the word leaf might indicate the meaning of a star as an object in the sky (e.g. One large leaf of the old chess- nut tree standing in front of his window, spoiled Henry´s view of the morning star.) and the word play the meaning of a famous and/or successful actor (e.g. One year later our star appeared in a play by John Osbourn.) In the human mind, differently from current computational algorithms (from [Lesk 1986] to [Patwardhan 2003] or [Rosso et al. 2003]), the problem of immediate word sense disambiguation is resolved with the means of a spreading activation. Psycholinguistic theories assume that lexical access involves selection of the most highly activated lexical node from a set of activated nodes. Selection is necessary, because according to the spreading activation theory other semantically or phonologically related nodes are activated as well [Caramazza 1997], [Dell 1986], [Garrett 1980], [Levelt 1989], [Levelt et al. 1999]. Thus, if the nodes of the first

Stefan Bordag, Denisa Bordag

words of our example sentence are activated, accessed and selected, activation spreads from them to all nodes which are connected with them on account of semantic and phonological relations. From the nodes which are reached by this mediated activation, weaker activation spreads to the network of nodes connected to them and so on. Consequently, the activation spreading from the nodes of words node, connected and hub pre-activates the node, or to be more precise, the lemma (here used in its psycholinguistic meaning, see e.g. [Levelt 1989]), of the word star in its technical sense. When a human then reads or hears the word star in our sentence, only the lemma with the proper meaning will be selected, because it will have the highest activation. The other two lemmas connected with the ambiguous word form star (with the meaning of an object in the sky and show business) will be considered as competitors, but will not be selected, because their activation will be lower (no additional activation from its neighbours). Thus it can be concluded that the principle responsible for the highly effective word disambiguation by humans is the spreading activation and its natural flow through the network of nodes on various levels. Their organisation is based especially on semantic (or, in broader sense, associative) principle on the higher and on the phonological/phonetic principles on the lower levels.

Word sense disambiguation – current problems The attempts to solve the problems of lexical disambiguation automatically basically try to substitute the process of spreading activation with computational processes. Compared to the spreading activation, many of these automatic processes are static, because they operate on stored, completed sentences and on a static framework of dictionary like word definitions. Despite the fact that such mechanisms use whole sentences (or other windows), their results are very unsatisfactory when compared to those of humans (see [Banarjee 2002] for a comparison of automated systems), because even the use of all words in a given sentence cannot successfully substitute the highly complex associative networks in the human mind. Consequently, this approach is at a disadvantage if used to find a correct word sense in a real-time environment as only the so far uttered part of the sentence is available in that case. However, a larger number of words considered does improve results of mechanisms like Lesk’s algorithm [Lesk 1986], because the chance that one of the words in the critical sentence matches a word used in a definition of one of the senses of an ambiguous word is of course the higher, the larger the sets of words, which are compared. The number of correct hits roughly corresponds to the level of activation in a human mind: The definition (i.e. word sense) will be selected, which has the most hits. It is, however, obvious that such a simple and straightforward mechanism can achieve only very imperfect results compared to the complex system of spreading activation, especially as such definitions tend to be very short, containing only rarely more than three content words. Another disadvantage of this approach becomes obvious, if we want to use it for purposes like improving the sense recognition in the dialogue systems. Here the static character of the process starts to matter, because it is desirable to know in advance,

Advances in Automatic Speech Recognition by Imitating Spreading Activation

which word or sense is likely to come next (i.e. what is “pre-activated”) in order to choose the correct word/word sense quickly. Using simple Markov model techniques this is possible with one drawback – it doesn’t solve the polysemy problem. That means that such a system would “pre-activate” words like constellation or twinkling after processing the word ‘star’ in our example sentence, which would be certainly wrong in this context. A more improved mechanism imitating the spreading activation mechanism based on the Markov model would not help due to one more reason: if the target word is preceded by only one or two words, the probability that they will match a word in one of the definitions is quite low. However, using the results of a disambiguation algorithm [Bordag 2002b], which is based on statistical co-occurrences of words within sentences, the effectiveness of the spreading activation process and especially the complexity of the associative networks in the human mind is imitated quite closely as will be shown. Based on a quantitative corpus exploration finding that a graph constructed from word co-occurrences has a small world structure (strong local clustering and high connectivity) the disambiguation process is treated as a maximum cluster finding problem. The algorithm is based on two assumptions: first, words in the graph cluster semantically and second, any three given words are unambiguous. If the three words are semantically homogenous, then they are located in the same cluster of the graph and the intersection of their direct neighbours will not be empty and will be semantically homogeneous as well. After generating an amount of such triplets (always including the input word), their neighbour-intersections are then clustered with hierarchical agglomerative clustering. As a result, for a given word one or more sets of semantically homogeneous words are found along with a set of words which are either semantically unrelated with the input word (although they are co-occurring with it) or whose statistical count is not high enough to make a reliable decision.

The algorithm The algorithm is based on the co-occurrences analysis at the "Projekt Deutscher Wortschatz”, which specializes on researching statistical properties of word forms in large text corpora. Co-occurrences can be calculated in many different ways, but solely the standard sentence-wide co-occurrences of word forms from the Wortschatz Project have proven to be useful for this particular task and have therefore been used. We refer to [Läuter et al. 1999], [Quasthoff et al. 2002] and [Heyer et al. 2001] for more information about the properties of co-occurrence analysis itself. Implicitly, these co-occurrences define a graph, where the nodes of the graph represent the word forms. Two nodes or word forms wi and w j are viewed as connected with each other, if the significance value sig(wi , w j ) of the co-occurrence measure for wi and

w j is above a certain threshold. The resulting graph is sparse, fully connected and has the small world property, see [Bordag 2002a], [Tenenbaum 2002], or [Kleinberg 2000]. The context set K (or neighbour set, or collocation profile) of a word form wi

Stefan Bordag, Denisa Bordag

is then defined as the set ∀w(sig ( wi , w) > t) of words which are directly connected to wi with the threshold t . A special and for our work highly relevant property of this graph is that it has local clusters which roughly correspond to topics or, to be more precise, to contexts, in which particular words appear significantly often. Such a topic or context could be e.g. computer networks, show business or astronomy. Using the disambiguation algorithm [Bordag 2002] it is possible to determine the membership of words in particular clusters. After determining which clusters are accessed by each of the input words, the accessed clusters are compared for overlapping. Those clusters which have more than a given threshold amount of input words attached to them represent then a good description of the topic of the sentence. The precise definition is as follows: • begin with set Ψ of input words (content words of a sentence) • for each element of Ψ run the disambiguation algorithm: D : Ψ → {{κ1 , κ2 ,..., κn }}

{{κ1 , κ2 ,..., κn }} into pairs = ( wi ,κ1 ) for each word wi ∈ Ψ



map



cluster the set of pairs by comparing κ into a set of groups of pairs: {p1 , p2 ,..., pn } → {{p1 , p2 ,..., pm }}



resulting

context

vectors

{{κ1, κ2 ,..., κn }} → { p1, p 2 ,..., pn } with

p1

Merge pairs which have been clustered together into tuples q m = (Wm , Κ m ) :

{{p1 , p2 ,..., pm }} → {q1 , q2 ,..., qm } with W ⊆ Ψ . The tuple q m = (Wm , Κ m ) contains then in K m the words of the found context

(topic) and in W m those words from the input which are attached to those

{{κ1, κ2 ,..., κn }} which result then in K m .

The sentence from the beginning of this paper will be used as an example again. All content words preceding ‘star’ were used as input words : ‘connected extension hierarchical hub node’. Two of them, extension and hub were found ambiguous in the data, the others were members of only one cluster each. As the algorithm can find contexts/topics, but not name them, sample words from each cluster will be provided as representatives of particular topics, along with the total number of words in a given cluster. As will be explained in greater extent later, the quality of the results suffered from the quality and size of the English corpus at the Wortschatz Project. connected: 343 words: { … connection connections connector consisting consists console controller controllers converter cord data dedicated desktop detector …} extension 1: 63 words {… application approval approved authority beyond call comments compliance creditors date deadline determination expiration expire …} extension 2: 50 words {… ANSI allows application automatic automatically batch command date default directory extensions fax feature file file's filename …} hierarchical: 78 words {… features file files folder folders functions hierarchy information interactive interface layers lets logical management manipulate …} hub 1: 217 words {… cards carriers central chassis closet coax coaxial communications concentrator concentrators configuration connect …} hub 2: 76 words {… airline airline's airlines airport airports announced bus carrier carrier's carriers center expansion fares flights hubs located main major …}

Advances in Automatic Speech Recognition by Imitating Spreading Activation

node: 258 words {… data database dedicated defined defines destination device devices dial directly directory diskless distributed either element enables errors …} It is obvious that the first cluster of extension and the second of hub are both inappropriate in the context of the sentence. After comparing all found clusters with each other, these two will be dropped, because they do not overlap with any of the other clusters. In the next step, all appropriate clusters (i.e. those which are in accordance with the topic of the sentence) are merged into one larger cluster, because of many overlapping words. Consequently, this large cluster contains only words relevant for the topic of a given sentence. The overlap percentage used in the implementation, which calculated these results, was 50% (of the smaller cluster). In the case of our example sentence, the cluster retrieved by the algorithm in a fully unsupervised way contained 751 words, consisting of computer networks specific words only. It might seem a bit artificial at this point that a common word like node has only one cluster, but this is the limitation posed by the size and especially the quality of the English corpus used. In fact, the corpus contains only business-newspaper texts (mainly Wall Street Journal) and is not very large (about 13 million sentences). As such this corpus is very unbalanced and has the disadvantage that it cannot be used for all purposes. It contains 1.240.002 different word forms, from which only the first 34.356 (ordered by frequency) occurred often enough in order to have enough cooccurrences for the disambiguation process to return meaningful data. The distribution of the number of senses for a word found by the algorithm is given in the following table:

number of senses number of words

1 2 28135 6221

3 1156

4 409

5 178

6 85

7 45

8 32

The scope of the corpus is thematically very narrow and hence we would expect that only some meanings of words are represented. The analysis however shows that multiple meanings for a significant number of words can be retrieved with the described algorithm. Parallels between the algorithm and spreading activation The parallel between the mechanism of spreading activation and our approach becomes more obvious if we map the psycholinguistic terminology on the corresponding processes in the algorithm. In the above example it has been shown which words can be reached with one step in the graph (i.e. which are directly connected to a given input word). If a counter is added to each word and increases each time a word is reached, then the model can imitate the activation spreading through association networks by broadcasting an amount of “energy” over the neighbours: more to those which are found via overlapping clusters directly, less to their neighbours and even less to the neighbours of the neighbours and so on. A special case of this model would be to spread “energy” only to the immediate neighbours and only to those which are situated in overlapping clusters. If a word is reached several times, the sum increases accordingly, imitating the accumulation of

Stefan Bordag, Denisa Bordag

energy in a node. In the following table the development of the energy sums are showed as the sentence proceeds from the word star on. In the bottom line of the table, there is the count of words which are found after merging the similar clusters (i.e. those which overlap) of all the words of the sentence (except again the stop words) or, in other words, the total of words which are affected by the activation. The table can be read as follows: In the upper left field, the 4 means that the word topology received 4 times activation energy at the point after star has been uttered. As can be seen, the topic specific words, especially hub and nodes, get more and more connected over time while others remain not activated. It would fit our argumentation better if disjoint would have been activated as we consider it appropriate for the given topic. But disjoint was too rare in the corpus as to have any significant cooccurrences. star

topology

turn

play

role

hub

disjoint

set

leaf

topology

4

-

-

-

-

-

-

-

-

turn

0

0

-

-

-

-

-

-

-

play

0

0

0

-

-

-

-

-

-

role

0

0

0

1

-

-

-

-

-

hub

3

4

4

4

4

-

-

-

-

disjoint

0

0

0

0

0

0

-

-

-

set

0

0

0

0

0

0

1

-

-

leaf

1

1

1

1

1

1

1

1

-

nodes

3

4

4

4

4

5

5

5

6

388

708

708

728

728

728

751

751

751

#

Most words from the general phrase ‘turns out to play a role’ were not activated either, because they are not specific for the given context/topic. It is assumed that the expressions and phrases with a general meaning and high frequency are not significantly pre-activated by topic specific words in the human mind either, but that their activation threshold is very high, so that they can be selected very fast, once they are activated. (Theoretically, these expressions/phrases would have to get preactivated by any word in the lexicon, because they can co-occur with them all, but, importantly, there is no associative relationship between them.) On the other hand, due to the disambiguating effect of the clustering, the words from the mentioned phrase didn’t activate any inappropriate words because their context sets did not overlap. It is also important to note that at this point, no stemming was used at all. That means that play is not activated even if played or playing were in the data. We assume that stemming could improve accuracy significantly as the dispersion caused by various word forms of the same lemma would disappear. Furthermore it can be seen that after the word topology, the number of activated words does not rise significantly anymore, which means that all clusters relevant for the topic of this sentence in the whole corpus have been reached and activated.

Advances in Automatic Speech Recognition by Imitating Spreading Activation

Further research Evaluating the results of the above described mechanism is an inherently complex task. While measuring precision is not very difficult, measuring the recall causes problems. First, it is important to know, how many relevant words are activated, and compare it to how many words exist in the corpus. Second, it is even more important to measure how many topics the algorithm finds compared to how many the are in the corpus. Another task would be to extend the model so that it would not be based only on the simplified one-step energy spreading. Problems at this point would be to decide how to treat polysemous but unfitting clusters and how exactly the energy should be spread. Further this mechanism could be implemented as a part of a speech recognition system. Finally it is noteworthy that this model can be made more general in order to be applicable in, e.g. error correction systems. Such systems generate candidate lists of words for a presumably misspelled word based on edit distance and a dictionary. Weighting words semantically might prevent such systems to offer candidates which are completely out of context and provide words which are semantically related though having a larger edit distance instead.

References: [Banarjee 2002] Banerjee, S.: Adapting the Lesk Algorithm for Word Sense Disambiguation to WordNet. Department of Computer Science University of Minnesota, Duluth, Minnesota 55812, 2002 [Bordag 2002a] Bordag, S.: Vererbungsalgorithmen von semantischen Eigenschaften auf Assoziationsgraphen und deren Nutzung zur Klassifikation von natürlichsprachlichen Daten, Diplomarbeit, Universität Leipzig, Institut für Mathematik und Informatik, 2002 [Bordag 2002b] Bordag, S.: Sentence Co-occurrences as Small-World Graphs: A solution to Automatic Lexical Disambiguation, A. Gelbukh (Ed.): CICLing 2003, LNCS 2588, pp. 329332, Springer-Verlag Berlin Heidelberg, 2003 [Caramazza 1997] Caramazza, A.: How many levels of processing are there in lexical access? Cognitive Neuropsychology, 14, s.177-208. 1997 [Dell 1986] Dell, G. S.: A spreading-activation model of retrieval in sentence production. Psychological Review, 93, s. 231-241. 1986 [Garrett 1980] Garrett, M. F.: Levels of processing in sentence production. In: B. Butterworth (ed.), Language Production: Vol. 1. Speech and Talk. San Diego, CA: Academic Press, s. 177-220. 1980 [Heyer et al. 2001] Heyer, G., Läuter, M., Quasthoff, U., Wittig, Th., Wolff, Chr.: Learning Relations using Collocations. Maedche, S. Staab, C. Nedellec and E. Hovy, (eds.). Proc. IJCAI Workshop on Ontology Learning, Seattle/ WA, 19. - 24. August 2001 [Kleinberg 2000] Kleinberg, J.: The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000 [Lesk 1986] Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone. In Proceedings of SIGDOC ’86, 1986. [Level 1989] Levelt, W. J. M.: Speaking: From intention to art iculation. Cambridge, MA: MIT Press 1989.

Stefan Bordag, Denisa Bordag [Levelt et al. 1999] Levelt, W. J. M., Roelofs, A., Meyer, A. S.: A theory of lexical access in speech production. Behavioral and Brain Sciences, 22, s. 1-75. 1999 [Läuter et al. 1999] Läuter, M., Quasthoff, U.: Kollokationen und semantisches Clustering. GLDV-Tagung 1999 [Patwardhan et al. 2003] Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Relatedness for Word Sense Disambiguation, A. Gelbukh (Ed.): CICLing 2003, LNCS 2588, pp. 241-257, Springer-Verlag Berlin Heidelberg, 2003 [Quasthoff et al. 2002] Quasthoff, U., Wolff. Chr.: The Poisson Collocation Measure and its Applications. Proc. Second International Workshop on Computational Approaches to Collocations, Wien, 2002 [Rosso et al. 2003] Rosso, P, Masulli, F., Buscaldi, D., Pla, F., Molina, A.: Automatic Noun Sense Disambiguation, A. Gelbukh (Ed.): CICLing 2003, LNCS 2588, pp. 273-276, Springer-Verlag Berlin Heidelberg, 2003 [Steyvers & Tenenbaum 2002] Steyvers, M., Tenenbaum, J. B.: The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. M. Steyvers, J. B. Tenenbaum, Cognitive Science, 2002