Representation issues in spoken word recognition: An account ... - limsi

4 downloads 0 Views 91KB Size Report
The present paper speculates that the effects of stimulus repetition may underlie a number of critical psycholinguistic phenomena that have been investigated to ...
Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

Representation issues in spoken word recognition: An account for the repetition effect Agnès Lainé*, Dominique Béroule**, Phillip Dermody*** * Post-Doc at SEDAL (University of Sydney) ** CNRS, visiting scholar at SEDAL *** University of Newcastle (NSW), visiting scholar at SEDAL

Abstract The present paper speculates that the effects of stimulus repetition may underlie a number of critical psycholinguistic phenomena that have been investigated to try to understand the organisation of human speech processing and lexical organisation. A model is propounded to account for the repetition effect, which complements Stimulus-Response theory with a Context component, and can be implemented as a Guided Propagation Network (GPN). To explore the operation of the model for speech processing the investigation focuses on a particular speech processing ability in which human listeners demonstrate a cumulative identification function for the presentation of brief speech segments. The GPN model produces a similar function to human listeners including better performance on unvoiced than voiced segments of stop consonants. The model is then used to demonstrate how lower activation thresholds simulating limitations in repetition effects could show poorer performance for the speech processing task. The effects suggest a plausible simulation of repetition effects that might be used to model human identification of speech in noise performance as well as developmental effects (differences in speech identification by children and adults) that could be the result of differential stimulus repetition similar to the effects shown in our modelling.

I. Introduction Everyone who has attended a cocktail-party in a foreign country has experienced the difficulty of understanding non-native speech in a noisy

environment... and the subsequent relief when an interlocutor is found who speaks your native language. In both situations, the level of background sound may have remained the same; the only difference between them lies in the greater familiarity of the native language. Several factors can be put forward to account for the natural enhancement of a language which is well known, including the more adaptable identification of speech patterns, a better anticipation ability and the better use of higher linguistic knowledge from experience with the language. More precisely, considering a spectral image of a compound signal that reaches the ears of a cocktail party participant, some regions of the spectrum are masked by noise. There should be something inside the perceptual system to compensate for this loss of input information, to reconstruct a full pattern from subparts of it, something that may be called reconstruction of an "internal representation" from partial information. This compensation would only work with familiar speech items, that have had their internal representations constructed by repetition of similar patterns particularly those of the native language. This paper concerns the nature of speech internal representations, and the way they could be influenced by repetition to exhibit the robust recognition illustrated by the cocktail-party effect. The notion of "internal representation" has been widely addressed in experimental psychology. A discussion of the related models of "lexical access" will be presented, with an emphasis on the repetition effect. Our proposed formulation will be presented, which involves the computational approach of Guided Propagation Networks (GPNs). Experimental results concerning the identification of consonants by subjects from their very beginning will be compared to the GPN computer simulation performance.

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

II. The repetition effect in psychological models II.1. Demonstrations of repetition effects in lexical processing The investigation of repetition effects in psychological studies has mostly been directed to the repetition priming effect. The repetition priming effect describes facilitation in the speed or accuracy with which a word is read (or heard) produced by prior presentation of that word. The temporal course of the facilitation can be short (over a few seconds) or may persist over longer periods (e.g: days). For example, a reader may be presented with a list of written words that have to be named as fast as possible (a speeded naming task) or identified under ambiguous conditions (for example such as spoken words at a cocktail party). If the reader/listener is previously presented with a word that needs to be named or identified a facilitatory effect on performance for the primed word is demonstrated. Another psycholinguistic effect that has been often investigated is one in which more familiar or frequently presented words always produce a facilitation in processing compared to less familiar or statistically infrequent words in the lexicon of a language. This word frequency effect might be considered as a very long term product of repetition priming. Finally, a great deal of effort in psycholinguistic studies has concentrated on lexical decision tasks in which people are presented with written or spoken words from their lexicon and with strings that represent reasonable approximations to words but are not actually part of the lexicon. This experiment produces the word superiority effect in which actual words are judged as words quicker than non-words are judged as non-words. In addition, words that are statistically frequent from the lexicon are judged more quickly than less frequent words (that is word frequency effects are also demonstrated in lexical decision tasks). Investigations of repetition priming have involved a number of tasks including the use of lexical decision tasks (in which words in a lexicon are presented along with non-words and a person responds as quickly to indicate whether the presented string constitutes a legitimate word in the lexicon); speeded naming; perceptual identification of ambiguous strings; fragment completion (in which only a part of the word is presented and must be completed as quickly as possible); and stem completion which fragments the word at a linguistic juncture related to the stem of the

word. These investigations have shown that primed words are identified quicker and more accurately than words that have not been repeated. A related set of studies have demonstrated that listeners respond accurately to a word once the presentation of the word has included its unique identification point or the point at which it is uniquely discriminable from other words (e.g. [1]). The unique identification point effect has been used to form a major lexical processing model known as the Cohort Model [2] based on the demonstration that word identification occurs when the unique identification point (its cohort) is reached, which is typically significantly shorter than the full duration of the word required to present all of its phoneme structure. Given the ubiquity of the repetition priming effects one speculative explanation of the unique identification point effect is that it represents a combination of 1/ the word familiarity effect (produced by having had the word repeated many times during lexical acquisition) and 2/ the effects observed for fragment completion in repetition priming effects where words that have been repeated are completed more accurately and faster than non-repeated words. The reason these effects are produced is that the effect of stimulus repetition over many instances through experience with the language produces the ability to shorten the critical time required to process the stimuli (an effect often observed in short delay repetition priming studies). In summary, it would seem that repetition effects in lexical processing might be used to explain a wide range of lexical processing phenomena ranging from the word superiority effect, to easier identification for familiar/frequent words (the frequency effect) and the word superiority effect to word identification points based on cohorts. The present paper investigates possible mechanisms of the repetition effect and their effect on speech sound processing.

II.2. Repetition effects in speech sound processing The majority of studies have investigated repetition effects for lexical items and have involved the presentation of the initial item (the prime) with a delay produced by presentation of intervening potential targets (foils). However, there are also a number of studies that provide evidence for repetition effects in speech processing in which the repetition effect can be observed by immediate presentation of the target items. These studies have mainly focused on spoken word identification (in contrast to the delayed repetition priming experiments which have mostly been conducted with written stimuli) and have demonstrated the effect in

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

ambiguous listening conditions such as the cocktail party effect. For example, from perhaps the earliest study showing repetition effects in spoken word identification, Miller, Heise and Lichten showed [3], if listeners are either given automatic repetition of word or are able to request repetition, that facilitation effects are clearly demonstrated. Repeating a word three times before requiring identification compared to responding to the word after only one presentation produced a difference in recognition threshold of about 2.5 dB for monosyllabic words and about 2.0 dB for words presented in sentences. Therefore, in terms of cocktail party effects there would be an improvement of about 2 dB for words that were repeated. Similar results were also presented by Thwing [4] and Stuckey [5]. These studies suggest that the human information processing system is sensitive to repetitions of stimuli and can use repetition to aid in speech and lexical processing. The question remains what mechanisms might be operating to take advantage of stimulus repetition effect. The present paper proposes one view about the underlying principles of this mechanism and describes a model that might be used to investigate the mechanisms in more detail.

II.3. A neo-behaviourist interpretation of the repetition effect When the physiologist Ivan Petrovitch Pavlov started his study of digestion processes by measuring the amount of salivation elicited by the presence of food in a dog mouth, he did not know that he would face a serious complication: the response of the dog sometimes occurred earlier than expected, before food was placed in its mouth. This unexpected phenomenon initiated a series of experiments known as "conditioning", in which a neutral stimulus (S1 bell sound) becomes significant (entails a response R salivation), as it is repetitively presented shortly before a unconditional stimulus (S2 food) that automatically produces R. This provided some experimental evidence to the behaviourist theory, according to which learning can be reduced to a strengthening of associations between a stimulus and a response : this is the repetition of Pavlov experiment that reinforced the association between the bell sound and the salivation response. Another interpretation of this conditioning experiment involves the notion of internal Context in which a Stimulus occurs. Instead of being directly linked with a stimulus, a response could be mediated by a contextual input that would convey a history of previous significant stimuli. In the example reported above, this contextual

input would be the internal response (R1) caused by the bell sound (S1), that would combine with the food stimulus (S2) so as to trigger the response R2 (see Fig. 1). Following several pairings of the bell and food, the contextual internal input would become strong enough to elicit some salivation, even in the absence of food.

S 1 (sound) C0

R1

S 2 (food) C1

R2 (salivation)

Figure 1: Hypothetical internal representation suggested by Pavlov conditioning experiment. The first stimulus S1 occurs in a neutral context and causes an internal response R1. If S2 occurs before R1 has vanished, an internal connection is created between R1 and the internal response triggered by S2. This connection may be reinforced through several repetitions of R1 and R2. The wide arrows convey the contextual information which may thus propagate towards R2 without requiring a contribution by S2.

In this view, the internal response produced by the first stimulus may be able to activate in advance the response associated with the second one, as if the latter had actually occurred (the bell sound would then have activated an internal representation of food). It may be noticed here that the proposed time-dependent internal representation fits to the fact that S1 should occur (0.5 to 1 seconds) before S2 to obtain the best effect (principle of "temporal contiguity"). Before entering a more formal description of our underlying model, it can be put forward that this anticipated response may involve two factors: 1/ a decrease of the response thresholds, 2/ an increase of the strength of the contextual input.

II.4. Consequences to speech recognition From a Pattern Recognition point of view, S1 and S2 can be considered as forming a temporal pattern the identification of which corresponds to their associated response R. By analogy with the conditioning experiment, repetitions of this pattern may lead to its early identification, possibly induced by the recognition of S1, immediately followed by the anticipation of S2. Assuming that perceptual memory traces keep track of series of stimuli in the same way, the reinforcement process induced by repetition could also be applied to our present purpose: the identification of speech items. After having been stabilised by many repetitions, a speech unit (phoneme, syllable, word) could be identified before its full completion, in the same way as S1 was sufficient to elicit R in the conditioning

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

experiment. This question has already received positive answers from experimental psychology, such as the existence of a "unique identification point" in words as described in II.1. A challenging point concerns the lower limit of this phenomenon : What is the shortest speech segment at the beginning of a phoneme that allows the latter to be identified ? This duration is probably linked with the familiarity, frequency, number of repetitions of the phoneme, that is : the amount of reinforcement it underwent. But if the answer to the above question was that a well reinforced phoneme could be recognised thanks to a few milliseconds of its very beginning, this would shed light on the crucial topic of Speech Recognition in noisy environment. This would also provide an explanation to the fact that a foreign language, which is not as reinforced as the native one, appears to be much harder to identify in noisy environment.

III. A behaviourist presentation of Guided Propagation Networks III.1. From the C&S-R schema to the complete architecture A major criticism of the Stimulus-Response (S-R) theory concerns its inadequacy to account for the knowledge-driven nature of learning processes, even in the absence of input stimuli. Beside a few stimuli that entail pre-wired responses whatever the context, the great majority of learned responses rely on the contextdependent interpretation of stimuli. This is why the C&S-R schema has been proposed as a substitute for SR, where letter C stands for a Contextual internal input [6]. C&S-R expresses that R depends on two factors occurring together: an internal Context, and an environmental Stimulus.

a/ S1 b/

R1

S2

R2

R1

S2 C1 R2

S1 C0

c/ S1 C0 R1

S2 C1 R2

Figure 2. Compared combinatorial possibilities of S-R (a) and C&S-R scheme (b,c). A Stimulus, Context or Response is

represented by its indexed capital first letter. An oriented association between two items is represented by an arrow. When equivalent, two terms are linked by a segment. The only way to combine S-R elements is to consider the output of one of them as the input of the following one (a). In (b), the output R1, driven by Context C0, is combined with another context C1 to elicit R2. In (c), R1 is the context input C1 in which S2 elicits a response R2.

In certain conditions, only one of the two factors may elicit a Response. When this is S, our schema is reduced to the classical S-R formulation. When C does not require the simultaneous contribution of S to generate R, this corresponds to the generation of a context (knowledge)-driven response. The main advantage of this 3-terms schema lies in the richer set of combinations it offers, whereas S-R items can only be chained, as shown in Fig. 2a. As a matter of fact, a complete computational architecture can be built by combining elementary C&S-R bricks, considering the output R of a brick as whether the Stimulus (Fig. 2b) or the Context input (Fig. 2c) of other bricks. Thanks to this combinatorial skill, a given stimulus may trigger different responses R1, R2 in two different context, C1, C2.

S C1

R2

C2

R3

(I)

In a similar way, two responses R1 or R2 may occur in the same context C, depending on the current stimulus S1 or S2.

S1 C

S2 R1

(II) R2

Once a Response has occurred, it may be considered as an updated internal context for a following stimulus, which brings a temporal dimension in our model. The following schema is obtained by representing the equivalent items Ri and Ci of Fig. 2c as merging into a single symbol Ci. It may be noticed that we retrieve here the hypothetical internal representation proposed for the conditioning experiments.

S1 S2 S3 C0 C1 C2

C3

(III)

By using (I), (II) and (III), we may obtain the following example structure that integrates combinations of stimuli (to the top) for generating responses (at the bottom):

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

S1 C0

S2

C01 C 03

S3

C 013 C 032

C0132

(IV)

C 031 R1 R2

R3

As shown in Fig. 2b, an internal response located at the output of the above structure (or module) can form the stimulus of other C&S-R items, contained in deeper modules of the architecture. Moreover, by assuming that several parallel analyses might contribute to the accurate interpretation of stimuli, several modules can work simultaneously within each layer of the architecture. Given that such modules can also be used to generate patterns, we are led to a complete computational architecture based on coincidence detection [6].

III.2. Implementation of C&S-R in a topological memory Instead of being distinguished by a specific code or index, the symbols (S,C,R) involved in the C&S-R schema can be differentiated by their location within a physical substratum. For instance, their coordinates in the above 2-Dimensional figures. It is noted, that it is not their actual position in space (and time) which is significant, but the position they occupy relatively to each other, hence the topological nature of the underlying memory model (Fig. 3).

R(t3 +T) S(t2 ) C(t 1 ) R(t 3 ) C(t 1 +T)

S(t2 +T)

Figure 3. Topological equivalence of two C&S-R scheme, the symbols of which hold different positions. R may respond to the co-occurrence of S and C as long as the temporal (T stands for a time interval) and spatial (oriented links) relationships between them are preserved.

In order to implement the C&S-R schema in a topological memory, a computational operator is needed for detecting the co-occurrence of S and C, and for taking decisions concerning the related generation of R. Thanks to the topological representation it would support, an operator can be located anywhere in the machine, provided the ability to "connect" with any

other one is available. A basic processing unit would thus own two input and a decision (or response) threshold. At the input level, Transmission Factors would allow S and C contributions to be regulated relatively to each other (in most implementations, only the contextual factor F is variable, while the stimulus input does not undergo any weighting); Time-delays would permit R to be elicited even when S and C occur at different times. To deal with the uncertainty associated with the instant at which either C or S occur, inputs remain active for a certain tolerance timeinterval, in the same way people stay for a certain time at an appointment location, waiting for each other [7]. An easy way of detecting the coincidence of S and C associates propagating signals with a certain amplitude (A) to both symbols. The operator always sums its input, so that if both C and S signals reach it, the resulting amplitude will be equal to 2*A. By using a response threshold set above A and below 2*A, one obtains an operator that only responds to the couple of S, C signals, after they have been synchronised by time-delays (Fig. 4). The response threshold is simply considered as a ratio of the maximum instantaneous input to the unit. By introducing the unit Excitability E, the threshold value is set to Amax/E. For instance, with an Excitability of 3, only one third of the maximum input will be sufficient to trigger a response. Above the threshold, the amplitude of the output signal varies linearly with the total input.

S C

R

=>

a(t2 ) a(t 1 )

a(t 3 )

Figure 4. Implementation of the C&S-R schema by using an operator that detects the co-occurrence (fuzzy coincidence) of S and C, represented by signals occurring at characteristic times. If t2 > t1, the contextual input is delayed of a value t2-t1 at the level of the connection, and t3 follows t2.

A response threshold, a contextual factor, input timedelays and tolerance intervals are set when a new processing unit is brought into play for representing a new S-C occurrence. The related Differentiation mechanism runs while the system is working, which immerses learning in the context. A new coincidence detector is thus created in the course of processing, at the intersection between C and S flows of internal signals. This continuous growing process requires a focussed contextual flow for indicating the exact location where a new unit should appear. The width of the internal flow depends on response thresholds, contextual transmission factors and tolerance intervals. In the "Restricted propagation" mode, there is no time-tolerance, the contextual flow is equally balanced with the stimulus input, and the response threshold is high (see Fig. 5). But

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

as a memory trace is used due to repetitions of its associated series of stimuli, a reinforcement mechanism tends to move the unit behaviour towards other functioning modes. In the "Extended propagation" mode, a unit can respond quicker to its contextual input without waiting for its stimulus. This feature accounts for the repetition effect, through which the response of a unit is facilitated and possibly triggered in advance by its contextual input.

E

1 F+ E=

4

3 3

2 4

B

2

Repetition effect

1 1

1

E=1+1/F

A

2

3

4

F

Figure 5. Functioning modes of an elementary processing unit, depending on two of its parameters: the contextual transmission Factor F, and the Excitability E. At birth, a unit is set in the "Restricted propagation" area (1), where both S and C input are required for R to be triggered. The subsequent increase of Excitability caused by repetitions of S drives the unit functioning point up to areas (2:"extended propagation") and then (3: "free propagation") where only C can elicit a response. For instance, the A functioning point is moved up to the B point. The area (4: "Forced propagation") corresponds to a functioning mode in which the unit can respond to S only. The response threshold can never be reached in the hachured area ( E < 1 ).

III.3. GPNs as a speech perception model Now that the architecture and the basic processing units of the model are described, its application to realworld tasks relies on being able to map actual environmental signals onto space-time distributions of "micro-stimuli" that would feed the memory input. The advantage of speech over other signals lies in its inherent temporal dimension. For obtaining a spatial distribution of speech-related stimuli, a spectral transformation can be followed by enhancement functions inspired by the peripheral auditory system [8]. After Lateral Inhibition had been carried out between spectral channels, a shortterm adaptation function detects significant variations of

energy across time within every channel. Whether positive ("onset") or negative ("offset"), the time-space location of each detected variation corresponds to a micro-stimulus. A binary piece of information concerning the voiced/unvoiced nature of the signal can be provided by a pitch detector, and attached to the stimulus. Considering dual sets of sensors (inner and outer hair cells) as a natural solution to the poor accuracy of a single spectral analysis, micro-stimuli are extracted from two complementary analyses (narrowband and wide-band). Due to the short-term adaptation function which enhances the transient regions of the spectrum, microstimuli are grouped in regions between stable states. Two consequences result from the fact that modifications of the speech rate mainly influence the stable parts of the speech, where no stimuli are detected by the enhancement functions: - the amount of micro-stimuli remains stable whatever the duration of a word, which eases the identification process; - A single processing unit (Dissyllable Detector) can be associated with a group of micro-stimuli. In order to preserve significant features of the spectral landscape, the frequency-time location of spectral maxima constitutes another set of micro- stimuli, notably used in the experiments reported in this paper. The resulting architecture (Fig. 6) is composed of 3 banks of (64) detectors distributed across the narrowband analyser, plus a bank of (4) onset-detectors for the broad-band analyser. A second layer of detectors (convergence units) is aimed at grouping neighbours micro-stimuli that are transformed in purely temporal events by a scanning process. As a matter of fact, the time-delays imposed by the basilar membrane on speech signals [9] has recently been proposed as a way for distributing spectral events across time and let the GPN deal with frequency variations [10]. Within the spectral region viewed from one of these detectors, frequency variations are transformed into time variations that are handled by Dissyllable (D) -Detectors forming a third layer of cells. D-Detectors are created when a new combination of micro-stimuli has to be preserved in the course of processing, as it is for the deeper parts of the network. Although not used in the following experiments, a deeper module receives the activity delivered across time by the D-Detectors, and that correspond to words. Because of the differentiation learning mechanism, this lexical module grows as a tree-like structure compatible with the Cohort model previously mentioned: when the identification point of a word is reached, there is no

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

following diverging branches to be activated in parallel by the internal flow. The latter only follows one direction, and thus may be strong enough to reach the word-detector before the word actual completion. This is another consequence of repetition that remains to be investigated.

IV. Experimental and computational results IV.1. Experiments on subjects The present results use the experimental method proposed by Dermody, Mackie and Katsch [11], which employs spoken stop consonants, recorded in citation form and edits cumulative speech segments from their onset out to approximately 30 ms of duration (segment 1 = 1 ms; segment 2 = 2 m, etc). These segments or speech gates are presented in randomised order trials to listeners who are asked to judge which speech sound occurred on each trial. Listeners are given a closed set of 6 response alternatives /pa,ta,ka,ba,da,ga/ from which to choose their response. The resulting performance-duration function (% correct as a function of stimulus duration) indicates that listeners demonstrate a continuously incrementing function of identification that achieves better than chance performance at quite short duration. Figure 7 shows the results averaged for 10 listeners on a speech gating task that presented speech gates from 4 male speakers. This curve demonstrates the characteristic increasing identification performance that is consistently seen in speech gating tasks. The figure shows the average result for both the unvoiced speech sounds (/pa,ta,ka/) and the voiced speech sounds (/ba,da,ga/). However, in general unvoiced speech sounds are initially perceived more accurately than voiced speech sounds in the speech gating task. This can be explained by noting that unvoiced sounds provide a reasonably homogeneous spectral representation over about 70 ms from onset while the "consonant" portion of voiced sounds is very short (7-15 ms) before it is strongly influenced by the vowel spectrum. The joint consonant-vowel spectrum of voiced sounds also shows an increasing continuous function for identification but greater ambiguity in the signal produces less accurate performance for the shorter duration stimuli.

IV.2. Experiments on a GPN computer simulation The current results involve a learning session on a corpus of six stop consonants (duration 30 ms) uttered once for each of the four speakers. Only 24 segments are needed to build the network. The recognition tests carried out on a shortened version of the same segments

(duration 29, 28, 27, ...1 ms) for the four same speakers that were used in the human identification experiments. The recognition rate is based on 696 segments. For this task, the GPN performance (Fig. 8a) is very similar to the human one (Fig. 7), although the voiced/unvoiced detection (A.M.D.F.) is not reliable for the segments shorter than 15 ms. However, figure 8b shows that unvoiced segments are better identified than voiced segments. This is particularly true for the shorter duration segments, an effect observed in human speech perception [12]. Recognition rates in Fig.9 clearly benefit from a repetition effect.

V. Conclusion The following general conclusions can be drawn from this work: 1. Stimulus repetition effects might be used to explain a broad range of perceptual and cognitive organisation effects for speech sound and lexical processing. 2. Stimulus repetition effects can be modelled using a stimulus/context/response explanation and simulated using the Guided Propagation Network model. 3. The GPN model demonstrates similar processing for brief cumulative duration of speech sounds as human performance including effects due to different articulatory effects related to the voiced/voiceless distinction. 4. The GPN model can simulate repetition effects as a changing threshold for activation in the response network to speech sounds. From these conclusions we are planning to use the GPN model for further investigation of how adult and child listeners can be differentiated on the basis of the amount of repetition (activation) operating and how adult listeners use the well-learned spectral patterns (from frequently repeated speech sounds) to process speech in noise. The initial promising results also suggest the potential value in attempting to model identification of spoken lexical from their uniqueness point by using the stimulus repetition model provided by the GPN.

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

broad-band spectrum

Onset/Offset Detectors

narrow-band spectrum convergence cells Dissyllable-Detectors

Context-Dependent cells

Word Detectors Figure 6. Architecture of the speech processing GPN used in the computer simulation.

90 80 70 60 50

29ms

27ms

25ms

23ms

21ms

19ms

17ms

15ms

13ms

11ms

9ms

7ms

5ms

3ms

1ms

40 30 20 10 0

Gate Duration

Figure 7. Performance-duration function for 10 listeners responding to stop consonant segments from 4 speakers

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

Figure 8. GPN recognition duration for stop consonant segments, as a function of the segment duration. The recognition tests are carried out on a shortened version of the learnt segments (30 msecs), only the beginning of the segments is taken into account. The a/ curve at the top shows an average result for the unvoiced stop consonants (/pa,ta,ka/) and the voiced one (/ba,da,ga/). The two curves in b/ correspond respectively to voiced and unvoiced stop consonants groups.

a/

29ms

27ms

25ms

23ms

21ms

19ms

17ms

15ms

13ms

11ms

9ms

7ms

5ms

3ms

1ms

100 90 80 70 60 50 40 30 20 10 0

Duration

100 90 80 70 60 50 40 30 20 10 0

Duration

b/

29ms

27ms

25ms

23ms

21ms

19ms

17ms

15ms

13ms

11ms

9ms

7ms

5ms

3ms

1ms

Voiced Unvoiced

Proceed. of the Intern. Conf. on Computer Linguistics, Speech and Document Processing, February 18-20, Calcutta (1998).

100 90 80 70 60

E=10 E=2 E=1.75

50 40 30 20 10 0 1

3

5

7

9

11 13 15 17 19 21 23 25 27 29 Duration (ms)

Figure 9. GPN recognition duration score using different response thresholds. When the Excitability parameter of the D-Detectors is low (E = 1.75), the recognition score stays at a low level, especially for speech segments below 15 msec. As Excitability gradually increases under the effect of repetition, performance improve. When E = 10, only 1/10th of the learned micro-stimuli is required for the D-Detector to respond.

References [1] Marslen-Wilson,W. and Tyler,L., The temporal structure of spoken language understanding. Cognition,8,1-71, 1980. [2] Marslen-Wilson, W., Functional parallelism in spoken word recognition. Cognition, 25, 172-102, 1987. [3] Miller, G., Heise, G. and Lichten, W., The intelligibility of speech as a function of the context of test materials. Journal of Experimental Psychology, 41, 329-335, 1951. [4] Thwing, E., Effect of repetition on articulation scores for PB words. Journal of the the Acoustical Society of America, 28, 302-303, 1956. [5] Stuckey, C., Investigation of the precision of an articulation-testing program. Journal of the Acoustical Society of America, 35, 1782-1787, 1963. [6] Béroule, D., Vers un Connexionnisme Cognitiviste ?, in: Modèles et Concepts pour la Science Cognitive, M.Denis, G.Sabah (Eds.), Presses Universitaires de Grenoble, 109-124, 1993. [7] Béroule, D., Management of Time Distortions through Rough Coincidence Detection, 1st European Conference on Speech Communication and Technology, Paris, 1989. [8] Schwartz, J.L., Béroule, D., Essai de formalisation de faits et hypothèses de physiologie concernant le traitement de l'information pour la Reconnaissance Automatique de la Parole, XVemes JEP, Aix, 1986. [9] Schwartz J.L., Apports de la psychoacoustique a la modélisation du Système Auditif chez l'homme: étude des phénomènes de propagation des ondes cochléaires, PhD Thesis, INPG, University of Grenoble, 1981. [10] Lainé, A., Architecture à Détection de Coincidence pour la Reconnaissance de la Parole continue bruitée: gestion des distortions fréquentielles, PhD Thesis, Orsay, 1996. [11] Dermody, P. Mackie,K. and Katsch,R, Initial speech sound processing in human word recognition. Proceedings of 1st Australian Conference on Speech Science and Technology, Canberra:ANU Printing Service.66-71, 1986. [12] Kewley-Port, D., Studdert-Kennedy, M. and Pisoni, D., Perception of static and dynamic acoustic cues to place of articulation in initial stop consonants. Journal of the Acoustical Society of America, 73, 1779-1793, 1983.