Speech Recognition Training for Pre-Elementary ...

Speech Recognition Training for Pre-Elementary School Language Learning Alejandro Curado-Fuentes, Juan Enrique Agudo-Garzón, Héctor Sánchez-Santamaría Escuela Politécnica Avda. Universidad s / n, Cáceres 10003, Spain {acurado, jeagudo, sasah}@unex.es Abstract The integration of CALL (Computer-Assisted Language Learning) in FL (foreign language) teaching at pre-elementary levels (three to five years of age) motivated our goal to design SHAIEX some years ago. This project consists in an adaptive hypermedia system that presents course content based on the child’s particular language and learning skills. The material in SHAIEX primarily involved visual and hearing stimuli to which the child reacted accordingly via tasks requiring the use of the mouse (e.g., choosing). More recently, to fulfill the fourth skill (speaking) demand, a speech recognition prototype has been tested in SHAIEX. This tool, called AIRE, has been developed by our research group as an external implementation that connects to SHAIEX via a local server. Some tests were conducted with a first prototype for the activity of matching characters with names, i.e., the child would have to utter the correct names instead of just choosing them with the cursor. The tests lead to positive results, with a high recognition rate for the utterances examined. This approach motivates further training with speech recognition architecture in the on-going project.

Keywords: speech recognition (S.R.), SHAIEX, early age, language learning, adaptive hypermedia, sound patterns

1. Introduction SHAIEX (Adaptive Hypermedia System in Extremadura) is a project begun in 2002 (cf. Cumbreño et al., 2006; Edwards et al., 2008), consisting of a platform with seven interactive lessons found in the Pre-elementary school curriculum. The goal was the possibility to adapt the material level to the learners’ skills. Thus, a three-year old may be guided through hypermedia activities that differ from the five year-old’s in terms of both linguistic content and mouse movement skills. Should the child demonstrate progress or slow down in his/her scores, the system would either move the student up or down the levels accordingly. Each lesson consisted of four phases: Presentation - repetition, Interaction, Evaluation, and Catching up. SHAIEX was also an answer to the new learning scenario that introduced both foreign languages and computers in school at early age in the form of hypermedia input that could suit visual, aural, and kinetic methods. The expanding European dimension for language learning at all levels also supported the approach to language at early age (cf. Rico et al., 2006). As text is not used at this stage, all the lessons contain a rich variety of oral input (in all four phases). The activities have the children listen to and repeat much of the input from previous dialogues and songs. Such tasks mainly demand them to understand the oral input and use the mouse functions to complete the exercise (e.g., by choosing the elements said, or matching objects and characters, etc). The child, however, never actually interacted by inputting his/her own speech in the exercises, i.e., speaking. In 2009, by collaborating with some colleagues at the Mathematics Department, and with a newly hired computer technician, we decided to implement our own speech recognition tool in the system. This application would have to run externally but equally work in the

platform, integrated in some activities for which the child would have to say key words instead of applying the mouse cursor to recognized objects. As a one-year project, the tool, called AIRE, was designed and implemented, and some tests were done. In this paper, we describe how AIRE has been integrated in SHAIEX to meet the speaking skill demand for English learners between ages 3 and 5. The results are preliminary, as this tool has served as a first prototype that can be improved and extended for more and better activities over the years.

2. Speech recognition at early age Speech recognition (S.R.) implies a greater challenge when working with children than with adults, even more so if these children are only between three and five years old. The main reason is that the acoustic parameters of children’s speech such as pitch and formant frequencies are higher than adult’s speech. Because of this, when an automatic speech recognition system trained on adult speech is used to recognize children’s speech, the error rate increases significantly (Wilpon and Jacobsen, 1996). The error rate of a speech recognizer trained with data from speakers of all ages increases when testing speech data from children that are 12 years old or younger. Even if a S.R. system for children speech uses an adequate amount of training material (i.e., data from different children) in order to train age-specific acoustic models, the recognition error rate reported for children is usually significantly higher than that shown for adults, which actually decreases by increasing children’s age (cf. Gerosa et al., 2007). Speech clarity in fact rises over the first few years of life, reaching complete intelligibility by age 4. Gerosa et al. (2009) investigate age-independent acoustic modeling with a large vocabulary to reduce such differences between children and adults as much as possible.

Nonetheless, their population did not include any children under age 7. To develop a high performance speech recognizer for children, some techniques have been offered by researchers such as VTLN (vocal tract length normalization, cf. Elenius and Blomberg, 2005; Zhan and Waibel, 1997; Giuliani et al., 2006), speaker independent linear frequency warping, Maximum likelihood linear regression (MLLR), and Constrained MLLR (cf. Shinoda, 2005). Our own approach to the speech model closely follows Elenius and Blomberg (2005) for VTLN, as the children were about the same age, and background noise was also a significant factor to explore and minimize in the algorithm.

3. The adaptive model One of the innovative aspects of SHAIEX is its adaptive mode in the lessons and activities (cf. Cumbreño et al., 2006; Agudo et al., 2006). Each didactic unit is different according to the educational level selected. Overall, there are three difficulty levels: Low for beginners (age 3 or older children with no previous experience with the system), Intermediate for age 4 and any child who demonstrates mastery with the low level tasks, and Advanced, for age 5 or any other children who have completed all previous requirements to reach this level. The different levels are present in each of the four sections or phases in the lesson. Each section consists of one or more learning tasks / activities. SHAIEX has been developed in two modes: standalone and adaptive (client/host). In the first mode, the student selects the educational level after launching the application (linked to age), and then, goes to the didactic unit to be interacted and learned with. The system provides the student with a sequential arrangement of the four sections described above (Section 1). In the second mode, learning is customized. The system registers information for each individual student (educational level, interactivity style with mouse movements, and linguistic knowledge). The child does not select the educational level or the unit, but just goes into the system with his/her user account (logged on via web to a server). The system then selects and provides the learner with the most appropriate activities according to his/her history and profile stored in the application at a given point. The student chooses one activity, completes it, and the system sends the results to the server, which updates the user model for the next learning task query. The activities are not pre-recorded, but are automatically generated in accordance with the profile / user model in use. As mentioned, a second version of SHAIEX has been undertaken, where a first solution for the integration of a S.R. tool in the activities could be implemented for speech recognition training. Thus, the option was opened for the speaking skill development according to age level and task difficulty in the adaptive model.

4. S.R. integration in SHAIEX

The detailed requirements for an adequate integration of the S.R. tool in SHAIEX must comply with the technical features of SHAIEX, i.e., portability and client / host architecture for the adaptive mode. Another important requirement is to avoid the installation of additional software that is not the one already configured for SHAIEX (i.e., Adobe Flash Player) in the client application. The S.R. tool must meet the following criteria: • Recognizing isolated words / terms • Speaker-independent • Vocabulary with few words (< 100 words). In addition, different libraries may be built for each educational task, thereby leading to great simplicity and efficiency in the recognition. • The normal performance of this tool will take place in a classroom, where there will likely be a significant amount of background noise. The tool must be conceived with this premise in mind. The algorithm must include background noise as extended bandwidth in the test speech (cf. Elenius and Blomberg, 2005).

4.1. AIRE The tool chosen, AIRE, was initially developed by our colleagues at the Mathematics Department.1 Essentially, AIRE is a library that can recognize vocabularies composed of isolated words. The tool is written in C and based on the HMM (Hidden Markov Model) theory that works on vocabularies made up of isolated words. Among other features, this library is designed according to the IVORY methodology, which splits the automatic speech recognition process in several phases (modules) that can be distributed among various DSP cards, increasing the performance and the scalability of this system. Furthermore, AIRE is a re-entrant library for running in multi-user environments and is totally independent of the audio capture process, so another module must provide audio signals to it. In the process of speech recognition, the audio signal is processed in three phases. A recognition procedure then follows, called pattern recognition (phases 4 and 5). In these last phases, statistical techniques, such as the vectorial quantization and the Markov Models, are used. At this point, the sound patterns extracted from the speech signal are compared with the patterns of each word that integrates the vocabulary database to recognize. AIRE provides each phase of the overall recognition process in a function with parameters, that is, it completely separates each one; in fact, each can be placed in a DSP card. The user must meet the following to use AIRE: 1.

1

To capture the speech signal with the minimum level of noise (but AIRE can also work in

[http://gsd.unex.es/projects/aire/index.html]

relative-noise environments) and store it in small overlapped windows. 2. To provide to AIRE these windows of audio and then execute each phase of recognition. 3. When AIRE detects the utterance of a word from the vocabulary, it notifies the user about this event through the last function [phase]. The phases must be executed continuously, in an infinite loop, as Figure 1 shows.

1.

2.

3.

Fig. 1: AIRE execution phases Because of the client / host architecture in SHAIEX, the recognition function is performed entirely on the server part. The client submits the captured audio via the microphone to the server, and the server, after the recognition process, will return the degree of either success or failure resulting from the comparison of the submitted audio with the in-built vocabulary. The speed of the audio transfer to the server is a crucial feature in the system. Each submission may reach an approximate size of 25 Kb. The ideal situation is to balance the load of queries among various servers that will dispatch the different recognition queries.

4.2. AIRE / SHAIEX architecture

4.

Capturing the audio from SHAIEX. As mentioned above, the user interface in SHAIEX is developed with Flash. However, this software cannot record audio by itself (unless done as a local application in AIR). A multimedia server must be used, e.g., Flash Media Server, Wowza or Red5. In our case, Red5 has been applied, as it is free of charge and open source. The audio captured by Red5 is stored in FLV form. Retrieving audio from the FLV file. AIRE can only work with WAV files, so, the audio must be retrieved from the files. For this task, the standard audio extractors failed because they expected video content in addition to the audio content in the FLV file (not our case). FFMPEG enables a perfect audio extraction from an FLV file that only contains audio as well as its storage in WAV form. Speech recognition. AIRE is used in its executable form, generated after the library has been compiled. AIRE will transfer the recognition result to SHAIEX. It also sends the percentage of validity for the received word in relation to the other words from the dictionary. Message registration. SHAIEX integrates the message received in the client user interface. It registers the action so that, once the activity is finished, the user model may be updated.

4.3. S.R. in an interactive activity The educational game “Choose” introduces a specific number of elements, repeated or unrepeated, that the learner must choose after listening to information on each element. For instance, in the first unit of “Introductions”, all the characters that appear in the lessons are presented. With the use of the mouse, the child clicks on the appropriate character after listening to its speech (e.g., “My name is Nose” for the elephant). This choosing activity has been used with AIRE for the implementation of speech recognition. This game can be configured in terms of the following parameters: a) Dexterity level (interactivity style with mouse): Only one click, hovering, or left double-click; b) number (more or fewer characters) depending on difficulty level; and c) introducing each element once or more times. The specific values of such parameters are transferred to the game via an XML file (see Figure 3).

Fig. 2: AIRE integration in SHAIEX Figure 2 visually renders the architecture of AIRE within SHAIEX. The stages followed in this integration are: Fig. 3: Interface for game (Flash file)

The navigation bar has been modified in the activities that support S.R. in order to include the following elements: • A graphic button for audio recording (REC) to start audio capture. This button changes colour when audio is being captured. When the microphone detects silence (70 percent silence for one second), audio capture ends. • A text field that indicates the word detected by the S.R. tool. • A text field that displays the similarity percentage between the audio being captured and the output issued by the tool. The observation of this percentage enables the adjustment of a value for the acceptance of word recognition as valid. During the activity, as elements are uttered and recognized, a positive or negative reinforcement is reproduced in agreement with a good or bad answer. The Action Script code is added to the game as layer. This code highlights each action to be executed by the game in order to allow the S.R. tool to perform accordingly. These are the actions defined: 1. Connecting to the Red5 server. By using the “connect” method, a connection is made by means of “netConnection” to a folder in the “Red5>webapps” directory. The folder “oflaDemo” is chosen because of no special reason; it is only used as a repository. 2. Retrieving audio from microphone. By using “Microphone” as class, the microphone is declared (the device must be plugged in). Then, this type of flow from the connection made in point 1 is associated with the “netStream” class and its method “attach”. When creating this “netSream” class, the type of associated connection must be specified. This action is performed after activating the “bt_reconocedor” button. 3. Storing audio with Red5. This action is performed by using another method in the “netStream” class, called “Publish”, using the “record” option. Thus, an FLV file will be generated with an identifier (also filtered by parameters to this method) within the sub-folder “Streams” in “oflaDemo”. Until the connection is closed, the file will not finally be recorded. This action occurs when “bt_reconocedor” is again selected (this is controlled by means of the “record” boolean). 4. Extracting audio from the FLV file captured with Red5. This action is performed by using the “Ffmpeg” servlet, which runs by using the “LoadVars” class. Thus, a small pause is made as standby while the servlet is executed in its entirety (with the “onLoad” function of this class). 5. Running AIRE by feeding the audio captured as parameter. This action is performed by using the

“Aire2” servlet, which runs immediately after point 4. 6. Reading the answer given by AIRE. Immediately after point 5, everything returned by AIRE is obtained by the method “toString()” belonging to the “LoadVars” class. Because the cluster returned contains irrelevant information (e.g., the executable path), some trimming is made in the cluster to obtain the recognizer’s answer alone. 7. Checking if the answer is correct. By comparing the cluster obtained in point 6 with the cluster stored in the “datos” class for the pertinent query, the function “control_acierto_y_fallo” will load if the clusters are similar. Otherwise, no action will be performed at all. Customizing the educational level required for the activity using S.R. can be achieved by adjusting the precision level in the recognizer. In other words, at the low educational level, the tool will be less rigorous than at higher levels, where more precision will be demanded.

5. Tests and results A six-word vocabulary has been built for the “Choose” activity selected for the tests with AIRE in SHAIEX (see Sub-section 4.3 above). These six words have been recorded by 75 different speakers (21 at age 3, 20 at age 4, and 34 at 5). There were 37 girls and 38 boys for this experiment. Therefore, a total of 450 WAV files were recorded with the Pratt program, and noise was cleared as much as possible in the algorithm, but some was expected, as explained above. The words correspond to the names of the characters in SHAIEX: Sese (snake), Nose (elephant), Benito (penguin), Lupe (stork), Thai (turtle) and Hakim (snail). The following (Figures 4 to 6) display the frequency features of four randomly chosen utterances in our corpus for three of the six words in the vocabulary. In general, the frequencies are found to follow a similar identifiable sound pattern, but there are some important differences.

8.

Fig. 4: Samples for the word “Benito” The pattern in the spoken name “Benito” (Figure 4) is easily identifiable; it is thus no surprise that this word is

the most easily recognized by the tool. Something similar happens with the name “Nose”; it is easily identified because of its characteristic pattern. However, this word can be sometimes confused by the tool when it has a wide range of frequencies, as can be checked in the samples (Figure 5).

Fig. 5: Samples for the word “Nose”

regarded as positive at this point, provided that there was a good deal of complexity with some names. In addition to the WAV parameter, which demonstrates good quality for the sound analysis, there are other parameters that may alter the recognition rate. All such parameters can be found at the “vocTrainer” stage, since this stage contains the building of the library (the other stages are compilations for the creation of only one library file [“vocNumber”] and for the client creation [clientewav]). The parameters in “vocTrainer” are: Hquant value, Sampling frequency, and Window size. Table 2 shows the results obtained for the Hquant value with the 180 utterances (30 students x 6 words) in each of the four hquant waves. Hquant value

Benito

Hakim

Lupe

No -se

Sese

Thai

Success rate

64

20

15

15

25

30

0

58,33%

128

30

15

20

25

25

0

63,89%

256

30

25

15

25

30

5

77,78%

512

30

25

15

15

25

5

63,89%

1024

ERR OR

ERR OR

ERR OR

ER RO R

ER RO R

ERR OR

ERROR

Table 2. Results with spoken names for Hquant

Fig. 6: Samples for the word “Thai” The frequencies are most intricate and diverging with the name “Thai”, as can be examined in Figure 6. This word presents the most confusing pattern for AIRE. It is seldom recognized and often confused with any of the other names. To evaluate the validity of the words in the library, another test has been conducted. 30 children (10 for each age level, and divided evenly into the two sexes), different from the previous 75, were selected. Each student produced six WAV files (one for each character name). A total of 180 WAV files were fed to the S.R. tool as parameter to be contrasted with the vocabulary. Table 1 provides the number of correct utterances and total percentage for success in the test. Benito Hakim Lupe Nose Sese Thai Total 30

25

15

25

30

5

77,78%

Table 1. Correct answers in the test with 30 students The results in Table 1 corroborate the much more confusing pattern for “Thai”, whereas 78 percent is

Sampling Frequency

Benito

Hakim

Lupe

Nose

Sese

Thai

Success rate

11025

0

0

30

0

0

0

16,67%

22050

30

25

15

25

30

5

77,78%

44100

5

0

25

0

0

0

16,67%

Table 3. Results with spoken names for the sample frequencies For the Hquant value of 1024, the execution of “vocTrainer.exe” returns an error because the number of clusters is too big (there are too many empty spaces). As can be examined, the Hquant value yielding the best results is 256. Table 3 displays the results for the sampling frequency parameter. The results are not positive. In the 11025 frequency, all the words are confused with “Lupe”. For 44100, the errors are found with “Lupe” but also in part with “Benito”. These poor results are directly related to window size (Frecuencia ≈ Tam_Ventana / 0.01). Therefore, if this value is to be changed, window size must also be modified. For the vocabulary tested, the best frequency and window size are 22050 Hz and 256 respectively.

6. Conclusions and future work

Our research on FL (foreign language) learning at early age in Extremadura has led to some interesting observations and findings in the classroom when English is learned in combination with the use of stimulating computer programs. SHAIEX has been at least in part an answer to how very young children can be both motivated and easily introduced into the world of computer use and English practice. Our mixed research group (university researchers and teachers, pre-elementary teachers, psychologists, and computer scientists) have been able to integrate various resources that can work for educational purposes. In the complex area of S.R., our findings are leading to a convenient river in which early age learning may navigate. The process requires many more tests and improved mathematical methods for the enhancement of the tool. Because we always decided to use free applications and open source code in our SHAIEX projects, AIRE fits in quite conveniently, and should be upgraded by means of independent work and effort. The tool is to run externally and smoothly in the platform, but also efficiently enough to have the activities work with miscellaneous speakers at early age. The option to speak should parallel mouse functionalities at all times when the activities require the use of AIRE. The results in this paper have been preliminary, but may serve as guidance and reference for further research in this line of work.

References Agudo, J.E., Sánchez, H., Rico, M., Curado, A. and Domínguez, E. (2006). Adaptive Hypermedia for Foreign Language Development at Early Age. In: GESTS International Transactions on Computer Science and Engineering, 34 (1), pp. 90-101. Cumbreño, A.B., Rico García, M., Curado Fuentes, A. and Domínguez, E. (2006). Developing Adaptive Systems at Early Stages of Children’s Foreign Language Development. In: ReCALL Journal, 18 (1), pp. 45 – 62. Edwards, P., Rico, M., Agudo, J. E., Paín, M.A., Curado, A. and Sánchez, H. (2008). The SHAIEX Project: Principles and Practice for Multimedia Foreign Language Learning in Pre-school. In: The EuroCALL Review, 13 (1), pp. 8–18. Elenius, D. and Blomberg, M. (2005). Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year Old Children. In: Proceedings of INTERSPEECH 2005, Lisbon, pp. 2749-2752. Gerosa, M., Giuliani, D. and Brugnara, F. (2007). Acoustic Variability and Automatic Recognition of Children’s Speech. In: Speech Communication, 49, pp. 847–869. Gerosa, M., Giuliani, D. and Brugnara, F. (2009). Towards Age-Independent Acoustic Modeling. In: Speech Communication, 51 (6), pp. 499-509. Giuliani, D., Gerosa, M. and Brugnara, F. (2006). Improved Automatic Speech Recognition through Speaker Normalization. In: Computer Speech & Language, 20 (1), pp. 107-123.

Rico, M., Curado, A., Domínguez, E. and Cumbreño, A.B. (2006). Hypermedia-based Tasks for L2 Learning at Pre-school and Primary Education. In: V. GuerreroBote (Ed.) Current Research in Information Sciences and Technologies, Multidisciplinary Approaches to GIS. Badajoz: Open Institute of Knowledge, pp. 242246. Shinoda, K. (2005). Speaker Adaptation Techniques for Speech Recognition Using Probabilistic Models. In: Electronics and Communications in Japan (Part III: Fundamental Electronic Science), 88 (12), pp. 25-42. Wilpon, J. G. and Jacobsen, C. N. (1996). A Study of Speech Recognition for Children and Elderly. In: IEEE Proceedings of Acoustics, Speech, and Signal Processing, Atlanta, GA, May 7-10, 1996, pp. 349–352. Zhan, P. and Waibel, A. (1997). Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition. Pittsburgh, PA: Carnegie Mellon University.