Efficient Client-Server based Implementations of Mobile Speech Recognition Services Richard C. Rose a and Iker Arizmendi b a
Corresponding Author McGill University Department of Electrical and Computer Engineering McConnell Engineering Building, Room 755 3480 University Street, Montreal, Quebec H3A 2A7 Canada Email:
[email protected], Phone: 514-398-1749, Fax: 514-398-4470 b
Coauthor AT&T Labs – Research Room D129, 180 Park Ave., Florham Park, NJ 07932-0971 U.S.A. Email:
[email protected], Phone: 973-360-8516
Abstract The purpose of this paper is to demonstrate the efficiencies that can be achieved when automatic speech recognition (ASR) applications are provided to large user populations using client-server implementations of interactive voice services. It is shown that, through proper design of a client-server framework, excellent overall system performance can be obtained with minimal demands on the computing resources that are allocated to ASR. System performance is considered in the paper in terms of both ASR speed and accuracy in multi-user scenarios. An ASR resource allocation strategy is presented that maintains sub-second average speech recognition response latencies observed by users even as the number of concurrent users exceeds the available number of ASR servers by more than an order of magnitude. An architecture for unsupervised estimation of user-specific feature space adaptation and normalization algorithms is also described and evaluated. Significant reductions in ASR word error rate were obtained by applying these techniques to utterances collected from users of hand-held mobile devices. These results are important because, while there is a large body of work addressing the speed and accuracy of individual ASR decoders, there has been very little effort applied to dealing with the same issues when a large number of ASR decoders are used in multi-user scenarios.
Preprint submitted to Elsevier Science
5 May 2006
Key words: Automatic Speech Recognition, Distributed Speech Recognition, Robustness, Client-Server Implementations, Adaptation
1
Introduction
There are a large number of voice enabled services that are currently being provided to telecommunications customers using client-server implementations in multi-user scenarios. The interest in this work is in those implementations where the functionality of the interactive system may be distributed between a client and server which can be interconnected over any of a variety of communications networks. The client in these applications may be a cellular telephone, personal digital assistant, portable tablet computer, or any other device that supports speech input along with additional input and output modalities that may be appropriate for a given application. The server deployment, on the other hand, often consists of many low cost commodity computers located in a centralized location. For these implementations to be practical, it is necessary for the server deployment to support large numbers of users concurrently interacting with voice services under highly variable conditions. This requires that the server deployments be able to scale to large user populations while simultaneously minimizing degradations in performance under peak load conditions. Methods for maintaining efficient and robust operation under these conditions will be presented. This paper presents a client-server framework that efficiently implements multimodal applications on general purpose computers. This framework will serve as the context for addressing two important practical problems that have received relatively little attention in the ASR literature. The first problem is the efficient assignment of ASR decoders to computing resources in network based server deployments. There has been a great deal of work applied towards increasing the efficiency of individual ASR decoders using a number of strategies including efficient pruning [26], efficient acoustic likelihood computation during decoding [16,3], and network optimization [15]. However, there has been little effort applied to increasing the overall efficiency at peak loads when a large number of ASR decoders are used in multi-user scenarios. The second problem is the implementation of acoustic adaptation and normalization algorithms within a client-server framework. Over the last decade, a large number of techniques have been proposed for adapting hidden Markov models (HMMs) or normalizing observation vectors based on a set of adaptation utterances. The overall goal in this work is to apply these techniques to minimizing the impact of speaker, channel, or environment variability relative to purely task independent ASR systems. Methods will be proposed and evaluated for applying these algorithms under the constraints that are posed when 2
implemented within client-server scenarios. One can make many qualitative arguments for when either fully embedded ASR implementations or network based client-server implementations are appropriate. It is generally thought that fully embedded implementations are most appropriate for value added applications like name dialing or digit dialing, largely because no network connectivity is necessary when ASR is implemented locally on the device [24]. Distributed or network based ASR implementations are considered appropriate for ASR based services that require access to large application specific databases. In these cases, issues of database security and integrity make it impractical to distribute representations of the database to all devices [21]. Network based implementations also facilitate porting the application to multiple languages and multiple applications without having to affect changes to the individual devices in the network. However, implementing ASR in a network based server deployment can also lead to potential degradations in ASR word accuracy (WAC) resulting from transmitting speech over the communications channel between client and server. There has been a very large body of research devoted to this issue. Some of the relevant work in this area will be summarized in Section 2. The client-server framework presented here, referred to as the distributed speech enabled middleware (DSEM), performs several functions. First, it implements the communications channels that allow data and messages to be passed between the components that make up the interactive dialog system. Second, it manages a set of resources that include ASR decoders, database servers, and reconfiguration modules that are responsible for adapting the system to particular users. The framework was designed to minimize the degradation in performance that occurs as the number of clients begins to exceed a server’s peak capacity [25]. This degradation could be the result of context switching and synchronization overhead as can occur for any non-ASR server implementation, but can also be a result of the high input-output activity necessary to support ASR services. Algorithms presented in this paper for efficient allocation of ASR resources and for efficient user configuration of ASR acoustic modeling are implemented in the context of the DSEM framework. The framework will be described in Section 3 and its performance will be evaluated in terms of its ability to minimize response latencies observed by the user under peak load conditions. Strategies are presented for efficient allocation of ASR resources in server deployments that utilize many low cost, commodity computational servers. By dynamically assigning ASR decoders to individual utterances within a dialog, these strategies are meant to compensate for the high variability in processing effort that exists in human-machine dialog scenarios. The sources of this variability are discussed in Section 4. A model for these ASR allocation strategies is also presented in Section 4. The strategies are evaluated both in 3
terms of their simulated and actual performance for a large vocabulary dialog task running on a deployment with ten ASR servers. Finally, an efficient architecture is presented for implementing algorithms for fast acoustic reconfiguration of individual ASR decoders to a particular mobile client. This is motivated primarily by the need to deal with the environmental, channel, and speaker variability that might be associated with a typical mobile domain. It is also motivated by the opportunity for acquiring representations of speaker, environment, and transducer variability that is afforded in the case where the client is dedicated to a particular user account. Since the ASR allocation strategies discussed above can dynamically assign ASR decoders to individual utterances, it is difficult in practice to adapt the acoustic hidden Markov model (HMM) parameters to a particular client. Hence, it is more practical to modify the feature space parameters for a particular client rather than attempt to adapt the acoustic model parameters. An architecture is presented in Section 5 for unsupervised adaptation and normalization of feature space parameters in the context of the DSEM framework.
2
Robustness Issues for Mobile Domains
It is well known that the performance of client-server based ASR implementations suffer from the distortions associated with transmission of speech, or the ASR features derived from speech, over a communications channel. There has been a great deal of research addressing issues of acoustic feature extraction and channel robustness for ASR under these conditions [2,4,7,8,11,12,19,23]. This section provides a brief summary of some of this work and the impact of these degradations especially in wireless mobile applications. This serves as motivation for the discussion in Section 5 on the implementation of acoustic feature space normalization and adaptation algorithms.
2.1
Feature Extraction Scenarios
There have been several investigations comparing the ability of different feature analysis scenarios to obtain high performance network-based ASR over wireless telephone networks [7,8,11,12]. The ETSI distributed speech recognition (DSR) effort has standardized feature analysis and compression algorithms that run on the client handset [7]. In this scenario, the coded features are transmitted over a protected data channel to mitigate the effects of degradation in voice quality when channel carrier-to-interference ratio is low. Another scenario involves performing feature analysis in the network by extracting ASR features directly from the received voice channel bit stream [11]. 4
A last scenario has been evaluated for ASR which does not involve additional client based or network based processing [8,12]. Instead, it involves the use of the adaptive multi-rate (AMR) speech codec that has been selected as the default speech codec for use in Wideband Code Division Multiple Access (WCDMA) networks. Studies have shown that the ability of the AMR codec to trade-off source coding bit-rate over a range from 4.75 to 12.2 kbit/s with channel coding bit allocation results in negligible change to ASR accuracy for carrier to interferer ratios as low as 4 dB [8].
2.2
Robustness with Respect to Channel Distortions
There have also been a variety of approaches that have been investigated for making ASR more robust with respect to the distortions induced by Gaussian and Rayleigh fading channels associated wireless communications networks [2,19,23]. One approach is to apply transmission error protection and concealment techniques to the coded feature parameters as they are shipped over wireless channels [23]. Another approach involves combining confidence measures derived from the channel decoder with the likelihood computation performed in the Viterbi search algorithm used in the ASR decoder. In this approach, confidence measures are computed from the a posteriori probability which provides an indication of whether received feature vectors have been correctly decoded. These confidence measures are then used to weight or censor the local Gaussian likelihood computations used in the Viterbi algorithm [2,19]. This last approach is similar in some ways to the missing feature theory approach to robust ASR where noise corrupted feature vector components are labeled and removed from the likelihood computation [19]. However, the ability of the channel decoder to identify missing features in this application is far more effective than the existing techniques for labeling feature vectors corrupted by noisy acoustic environments. Similar techniques have been investigated for the packet loss scenarios associated with packet-based transmission over VoIP networks [4].
2.3
Importance of Acoustic Environment
It is well known that distortions introduced by both the acoustic environment and the communications channel can impact ASR performance. Studies based on empirical data collected in multiple cellular telephone domains have demonstrated that the effects of environmental noise can often dominate the observed increases in ASR word error rate (WER) [22]. Increases in WER of 50% have 5
been measured over wireless communications channels in noisy automobile environments compared to quiet office environments. On the other hand, a WER decrease of only 10% was observed in wireless channels compared to wire line channels when speech was collected in a quiet office environment. This agrees with similar findings suggesting that, except for extremely degraded communications channels, the impact of channel specific variability has been found in some cases to be secondary with respect to environmental variability in mobile ASR applications.
3
Mobile ASR Framework
Modern multi-user applications are often challenged by the need to scale to a potentially large number of users while minimizing the degradation in service response even under peak load conditions. Scaling multi–modal applications that include ASR as an input modality presents an additional hurdle as there is typically a great disparity between the number of potentially active users and a system’s limited ability to provide computationally intensive ASR services. This section provides an overview of a proposed distributed speech enabling middleware (DSEM) framework that is used to efficiently implement multi– modal applications that maximize performance under normal loads and are well conditioned under peak loads. The section is comprised of two parts. First, the framework rationale and design are briefly described. The second part of the section presents an experimental study demonstrating the throughput of the framework in the context of hundreds of simulated mobile clients simultaneously accessing a system equipped with a limited number of ASR decoders.
3.1
3.1.1
Description
Models for Efficient Client-Server Communication
Traditional non-ASR server implementations that assign a thread or process per client suffer from degraded time and space performance as the number of clients approaches and exceeds a server’s peak capacity. The practical and theoretical issues that are behind these observed performance degradations have received a great deal of attention in the computer science community [1,5,17,25,14]. There is general agreement, however, that while this degradation stems from many factors, there are three principal issues that limit the performance of the thread per client model. A first issue is the overhead incurred by the operating system in perform6
ing context switching. Context switching involves saving the current thread’s context and loading the context of the next runnable thread. It can introduce a number of artifacts. For example, cache performance can be reduced as the memory required for storing the stacks associated with individual threads compete for processor cache space [14]. This switching overhead can be exacerbated in the presence of threads that handle streaming media which must be frequently invoked to service buffers that are used for transmission of digital audio and video. It is important to note that the degradations associated with this particular scenario, which is critical to supporting the interaction between clients and ASR servers, has received relatively little attention when evaluating existing client-server communication models. The degradations stemming from this type of overhead are the principal focus of the evaluation described in Section 3.2. A second issue is associated with the synchronization of state information that must often be shared between clients. Resources like shared pools of feature frames and pools of ASR decoder proxies are examples of shared state information which are heavily contended between clients. In the presence of multiple threads, access to this shared state information must be synchronized using locking primitives. The overhead of such synchronization can be significant, especially for the above examples. Finally, a third issue limiting the performance of thread per client models is the virtual memory requirements associated with each individual thread. Every thread allocated by the operating system requires a small data structure to track it and, more importantly, an allocation of virtual memory for the thread’s stack. One reason why this issue is problematic is because of limitations that are specific to a given operating system. For example, it is common on Linux operating systems to define a thread’s stack size to be 2 Mb. It is also common in Linux to limit the user address space to 2 Gb, which limits the maximum number of threads that can be supported in a thread per client model to 1000. In an effort to address the above issues, the proposed DSEM framework uses an event-driven, non-blocking IO model. This requires only a single thread to manage a large number of concurrently connected clients [25]. In addition, an ASR decoder cache is employed to effectively share limited decoder resources among active clients.
3.1.2
An Example Interaction
The basic functional components of the framework can be introduced by way of the example illustrated in the block diagram of Figure 1. Figure 1 shows a typical interaction that allows a client to issue a voice query that retrieves the contents of a URL on a remote web server. In this example, the recognition 7
Fig. 1. A block diagram of the distributed speech enabled middleware (DSEM) framework. Strategies for allocation of ASR resources and for the server-based implementation of feature space adaptation techniques were evaluated within this framework.
result is not returned to the client directly, but is instead acted upon by the DSEM server to produce the final result. The sequence of steps in such an interaction can be summarized as follows: Establishing DSEM Connection The interaction begins with the client establishing a connection with the DSEM server. Upon accepting the connection, the DSEM server creates a special session handler, labeled “SES” in Figure 1, for that connection and adds the client’s socket to its internal dispatch table. The session handler performs two functions. First, it houses application specific processing such as determining which other handlers to invoke. Other handlers in this example may include ASR handlers and HTTP handlers. Second, it provides a place to place session state that spans across more than one request. Generating a Voice Query The user then issues a voice query which is streamed to its session handler on the DSEM server. The query is streamed using a custom protocol which typically includes the type of query (eg, the URL to fetch, what database query to perform, etc) and can also include ASR related parameters such as audio coding and language model. Creating an ASR Handler The DSEM dispatcher, which is responsible for detecting and routing all of the system’s IO events, detects activity on the client’s socket and notifies its session handler to process any incoming data. In this example, the session handler creates an ASR handler to process the audio stream and registers interest in the ASR handler’s output. Among other things, this may include a recognition string or word lattice produced by the ASR decoder. 8
Initializing the ASR Handler Upon activation, the ASR handler fetches client specific parameters from the user database. These may include, for example, the user specific acoustic feature space adaptation and normalization parameters that are discussed in Section 5. It also acquires a decoder proxy, which is a local representation of a decoder process potentially residing on another machine, from the decoder proxy cache. If there are no free decoder processes, the proxy enters “buffering mode”. The ASR handler registers its decoder proxy’s socket with the DSEM dispatcher to receive notification when decoder IO is detected. Generating and Buffering Acoustic Features Each portion of the audio stream received by the client session handler is forwarded to the ASR handler which may perform feature analysis and implement user-specific acoustic feature space normalizations and transformations. If the decoder proxy created in the previous step acquired an actual decoding process then the computed cepstrum vectors are streamed directly to that process. If no decoding processes were available, the vectors are buffered and transmitted when a decoder is freed. The proxy cache provides a signal scheme that alerts its proxies when this occurs. Obtaining ASR Results and Releasing the ASR Decoder An ASR decoder process produces a result and transmits it to the DSEM server. The DSEM dispatcher detects this event and notifies the associated ASR handler which extracts the recognition string from the decoder proxy and reports it to the session handler. Once the session handler receives the recognition string, the ASR handler unregisters itself from the DSEM dispatcher and releases its decoder back to the cache. Issuing Query to Web Server With the recognition string in hand, the session handler creates an HTTP handler, registers interest in the HTTP handler’s output and issues a query to a remote web server. The prototype application implemented in this work uses this technique to retrieve employee information from AT&T’s intranet site. Sending Result to Client When the web server responds to the HTTP request the HTTP handler processes the reply and notifies the session handler which in turn sends the result to the waiting mobile client. One of the key assumptions of the above framework is that it is impractical to permanently assign and adapt an ASR decoder and model to a specific client. Typical ASR implementations require the use of large acoustic and language models which, if speaker and environment independent, can be preloaded and efficiently shared across multiple instances of a decoder drastically reducing the cost of an ASR deployment. This is typically done by memory mapping all needed acoustic and language models on all ASR servers, subject to available 9
Fig. 2. DSEM server performance for an eight server installation plotted with respect to the number of concurrent clients. a) Average response latency measured in seconds between the time a client submits an ASR request and a result is returned by the DSEM. b) Average server throughput computed as the number of completed recognition transactions per second.
physical memory, and selecting between them at runtime (which involves little overhead). Speaker adapted models, on the other hand, cannot be shared and thus result in a substantial increase in the amount of memory required on a decoding server which is servicing several clients. In order to enjoy the benefits of shared models and client specific adaptation the proposed framework implements all acoustic modeling techniques as feature space normalizations and transformations in the DSEM server. This issue is addressed further in Section 5.
3.2
Performance Evaluation
An experimental study was performed to demonstrate the throughput of the framework described in Section 3.1. The goal of the study was to measure both the throughput maintained by the DSEM server and the latencies that would be observed by users of the associated mobile ASR services as the number of users making simultaneous requests increased into the hundreds of users. The study was performed by simulating many clients interacting with the DSEM and performing the following interaction: • Each client streamed an 8-bit, 8-kHz speech request to the DSEM server. Each request consisted of a 1.5 second utterance corresponding to a query to an AT&T interactive dialog application. • The DSEM server performed acoustic feature analysis and streamed features 10
to an available ASR decoder. When a decoder was not available, the DSEM server buffered features. • When a decoder was released and made available to the decoder proxy cache, the DSEM streamed buffered features in a burst and streamed subsequent features (if any) as they arrived. The decoder returned a decoded result to the DSEM server which in turn forwarded the result to the waiting client. The infrastructure used for the study included eight 1GHz Linux ASR servers with each server running four instances of the AT&T Watson ASR decoder and a single 1GHz Linux DSEM server with 256Mb of RAM. Figure 2a illustrates the effect on response latency as the number of concurrent clients increases. Response latency was calculated as the interval in seconds between the time that the last sample of a speech request was sent by the client and the time that the recognition result was returned to the client by the DSEM server. The plot shows a relatively constant latency when the number of clients is less than 128 and a gracefully degrading response latency as the number of clients is increased. In addition, note the slight increase between 32 and 128 clients: as the number of clients exceeds the number of available decoders the DSEM buffers features and transmits them in bursts to an available decoder. The fact that the audio streams are transmitted all at once and that the decoding task typically ran at better than real time (at most 1/4 real time) helped to minimize latency in that range. After the number of clients exceeds 128, the delay imposed on clients by the DSEM decoder wait queue and the overhead of the DSEM server itself begins to dominate. A more thorough investigation of this effect could shed some light on the relative importance of audio arrival rate to decoder performance. Figure 2b illustrates the effect on server throughput as the number of concurrent clients increases. Throughput was calculated as the number of completed recognition transactions per second. The plot in Figure 2b demonstrates that throughput gradually increases until the server’s peak capacity is reached at a point corresponding to 128 clients and remains relatively constant even as the number of clients far exceeds this peak capacity. Again, the buffering of features in the DSEM server provides a throughput benefit beyond the expected 32 recognitions/second.
4
Allocation Strategies for ASR Resources
The problem of efficient assignment of ASR decoders to computing resources in client-server frameworks like the DSEM is addressed in this section. The section begins by providing basic definitions of a call in the context of humanmachine dialogs and quality of service for ASR servers. Next, a simple theoretical model for efficient ASR resource allocation is presented. This model is used 11
to predict the total number of users that can be supported by the proposed framework under different assumptions while maintaining a given quality of service. Finally, the theoretical performance and actual performance of the model evaluated on a large vocabulary dialog task running on a deployment of with ten ASR servers is presented.
4.1
Multi-User ASR Scenario
There are several assumptions that are made in this work concerning the means by which a user interacts with a speech dialog system and how both ASR quality of service (QoS) and system overload are defined. The most general assumption about the overall implementation is that calls are accepted from multiple users and are serviced by pools of ASR servers each of which can return a recognition string for any given utterance with some latency. The manner in which these ASR servers are allocated is described in Section 4.2. A typical interaction, or call, in human-machine dialog applications consists of several steps. The user first establishes a channel with the dialog system over a public switched telephone network (PSTN) or VoIP connection. Once the channel is established, the user engages in a dialog that consists of one or more turns during which the user speaks to the system and the system responds with information, requests for disambiguation, confirmations, etc. During the periods in which the system issues prompts to the user, the user will generally remain silent and the system will be mostly idle with respect to that channel. Finally, when the user is done, the channel is closed and the call is complete. The quality of service (QoS) of an overall implementation is defined here in terms of the latency a system exhibits in generating a recognition result for an utterance. For utterances processed on a server, there are a number of factors that contribute to this latency. When the multi-server system is operating at near peak capacity, the number of concurrent utterances, or utterance load, the server is handling can be the dominant factor. The focus of this paper rests on the observation that, irrespective of all other factors, implementing simple strategies for reducing the instantaneous load on ASR servers will result in a significant decrease in the average response latency observed by the user. A server’s maximum utterance load is defined here as the maximum number of concurrent utterances which can be processed with an acceptable average response latency. A server that handles more than its maximum utterance load is said to be overloaded. 12
4.2
ASR Resource Allocation Strategies
Two strategies are presented for allocating ASR servers to incoming calls. It will be shown that an intelligent approach for allocating utterances to servers in a typical multi-server deployment can dramatically reduce the incidence of overload with respect to more commonly used allocation strategies.
4.2.1
Call-Level Allocation
A common approach for indirectly balancing the utterance load across the hardware resources an allocator has at its disposal is call-level allocation. Using this approach, an allocator assigns a call to an ASR process running on a decoding server for the duration of the call. This process is responsible for all feature extraction, voice activity detection, and decoding. For example, consider the hardware configuration shown in Figure 3 that illustrates a typical setup where a source of call traffic (a PBX, or VoIP gateway) routes user request streams to ASR processes residing on two servers. The figure depicts six calls where each call consists of intervals of speech or silence denoted by colored and uncolored blocks, respectively. As calls arrive, a simple allocator tracks the number of calls on each server and ensures that they all handle an equal number of calls. However, as the number of calls handled by an ASR deployment increases, use of such a simple allocator can lead to an unacceptably high utterance load on some servers even when other servers are underutilized. In Figure 3 we see that during the first and second intervals the first server will need to handle an utterance load of 3 even though the second server is only handling a load of 1. Assuming that the maximum utterance load for each server is 2 and assuming that the processing of each utterance requires identical computational complexity, the first server will be overloaded. If the computational complexity of the ASR task is sufficiently high, this may result in unacceptably high latencies for users assigned to the overloaded first server. A simple probabilistic argument can be made that generalizes the example to an arbitrary deployment and makes this deficiency explicit. Assume, for simplicity, that each utterance is of some fixed duration, d, and each call is of some fixed duration, D. A call is then assumed to consist of L randomly occurring utterances so that at any time t, the probability that an utterance is active is given by pt = L
d . D
(1)
If we assume that a server that handles an utterance load of more than Q is 13
Fig. 3. Example of call-level allocation showing six calls being routed directly to two ASR servers. Individual utterances are shown as colored blocks within each call.
overloaded, then the probability of overload if it services M calls, with M > Q, is given by Pq =
M X k=Q+1
!
M k p (1 − pt )M −k . k t
(2)
This is simply the probability that more than Q users out of M calls on a server will speak at any given moment. This probability is obviously zero when the server is handling Q calls or less. The probability Pq can then be used to calculate the probability, PC , that one or more servers in a deployment of S servers (with S > 1), each handling M calls, will be overloaded. PC = 1 − (1 − Pq )S
(3)
In Section 4.3, Equations 2 and 3 will be used to determine the number of calls, M , that can be supported by the call-level allocation strategy when the probability of overload, PC , is fixed atan acceptable value. It will be shown that the fundamental difficulty with this approach arises from the fact that the call-level allocator knows nothing of what transpires within a call.
4.2.2
Utterance-Level Allocation
One way to reduce the probability of overload is to let the allocator look within calls to determine when utterances begin and end. This additional information can be used to implement an allocator that assigns computational resources to utterances instead of entire calls. This will be referred to as utterance14
level allocation. Figure 4 illustrates this approach. In order to inspect the audio stream of incoming calls the allocator is placed between the source of call traffic and the ASR decoding servers. In addition, feature extraction and voice activity detection are moved to the allocator so that it may determine when utterances begin and end. Of course, it is possible to perform feature extraction in several locations including the client, the allocator as shown here, or in the ASR server. From this vantage point the allocator can keep track of activity across the deployment and intelligently dispatch utterances and balance the incoming utterance load. This allows that same deployment of S servers to be viewed as a single virtual server that can handle an aggregate utterance load of SQ concurrent utterances. Under this model, an overload
Fig. 4. An utterance level allocator looks within dialogs to determine when utterances begin and end. This information is used to balance the load on decoding servers.
on any server can only occur if more than SQ utterances are active, an event that is considerably less likely than any individual server being overloaded. More specifically, for a deployment handling SM calls, with SM > SQ, the probability, PU , that an overload will occur is given by PU =
SM X k=SQ+1
!
SM k pt (1 − pt )SM −k k
(4)
Equation 4 will be used in Section 4.3 to determine the number of calls that can be supported by the utterance-level allocation strategy when the probability of overload, PU , is fixed at an acceptable value. Note that although the allocator in this scenario acts as a gateway to the decoding servers it generally is not a bottleneck as the processing required to detect utterances is very small [20]. However, we must introduce an allocator that can monitor all traffic, which may be a potential bottleneck. We look at the effects of such an allocator in Section 4.3. 15
4.2.3
Refining Utterance-Level Allocation
Incorporating knowledge of additional sources of variability in ASR computing effort can further improve the efficiency of multi-user ASR deployments. Two examples of these sources of variability are illustrated by the plots displayed in Figure 5. The first is the high variance in computational load exhibited by a decoder over the length of an utterance. It is well known that the instantaneous branching factor associated with a given speech recognition network can vary considerably. This fact, coupled with the pruning strategies used in decoders, results in a large variation in the number of network arcs that are active and must be processed at any given instant. This is illustrated by the plot in Figure 5a which displays the number of active network arcs in the decoder plotted versus time for an example utterance in a 4000 word proper name recognition task. The plot demonstrates the fact that the majority of the computing effort in such tasks occurs over a fairly small portion of the utterance. Knowledge of this time dependent variability in the form of sample distributions could potentially be used to allocate utterances such that peak processing demands do not overlap.
Fig. 5. a) Computational effort measured as the number of active arcs versus time for an example utterance from the proper names recognition task. b) The distribution of the ratio of decoding time to audio duration (CPU vs. audio) for test utterances taken from a digit recognition task and from c) an LVCSR task.
The second source of variability comes from the variation in computational complexity that exists between different ASR tasks. This is illustrated by the histograms displayed in Figures 5b and 5c. The plots display the distribution of average computational effort measured as the ratio of the decoding time to the utterance duration. The distributions correspond to continuous digit and large vocabulary continuous speech recognition (LVCSR) tasks with means of 0.022 16
and 0.44 respectively on a 2.6 GHz server. As would be expected, the high perplexity stochastic speech recognition network associated with the LVCSR task demands a higher and more variable level of computational resources than the small vocabulary deterministic network. Distributions characterizing this inter-task variability could be incorporated into server allocation strategies. In addition to the obvious efficiency improvements beyond those discussed in Section 4.2.2, servers with large CPU caches can be dedicated to a single ASR task to achieve improved cache utilization.
4.3
Experimental Results
This section presents the results of two comparisons of the call-level allocation (CLA) and utterance-level allocation (ULA) strategies. The first compares the efficiencies of the CLA and ULA strategies that are predicted by the model presented in Sections 4.2.1 and 4.2.2. The second compares the two strategies using an actual deployment where ASR decoders are run on multiple servers processing utterances from an LVCSR task. A comparison of the efficiencies as predicted by the model can be made by plotting the number of incoming calls with respect to the probabilities of overload, PC in Equation 3 for the CLA strategy and PU in Equation 4 for the ULA strategy. The difference in overall efficiency for the two strategies can be measured as the difference between the number of calls that are supported at a given probability of overload. Figure 6 shows a plot of this comparison for an example where the multiuser configurations illustrated in Figures 3 and 4 are configured with ten ASR servers each of which can service a maximum of four simultaneous utterances without overload. It is also assumed that, on the average, there are active utterances to be processed by an ASR server for only one third, pt = 1/3, of the total duration of a call. It is clear from Figure 6 that, at a probability of overload equal to 0.1, the utterance-level allocation strategy can support approximately two times the number of calls that can be supported by the call-level allocation strategy. A comparison of the efficiencies that are obtainable in an actual deployment was made using the DSEM framework that was evaluated in Section 3. The framework was configured with ten 1 GHz Linux based servers running instances of the AT&T Watson ASR decoder. Calls were formed from utterances that were natural language queries to a spoken dialog system with speech active for an average of 35 percent of the total call duration and each server able to service approximately two simultaneous utterances without overload. A rather aggressive load of four hundred of these calls were presented simultaneously to the multi-user system. An overall performance measure was 17
Fig. 6. Number of calls supported by CLA and ULA strategies using ten simulated ASR servers. The curves are plotted versus probability of overload predicted by PC in Equation 3 for CLA and PU in Equation 4 for ULA.
used that is derived from the latency based QoS defined in Section 4.1. For a given number of incoming calls, a count is obtained for the percentage of utterances where the latency in generating a recognition result falls below a specified threshold. Figure 7 shows a plot of these percentages plotted versus the threshold that is placed on the maximum response latency. The maximum response latency ranges from 0.5 to 3.0 sec. Curves are shown for both the CLA and ULA strategies. The system implemented with the ULA strategy is shown in Figure 7 to support a significantly larger call load than the CLA system. It can be seen that the ULA strategy is able to service approximately twice as many requests with a one second maximum latency.
5
Robust Modeling Techniques in Client-Server Scenarios
This section describes the application of acoustic adaptation and feature normalization procedures in the context of the DSEM client-server based ASR framework described in Section 3. This class of procedures are in general carried out in two steps where parameters are first estimated from adaptation data and these parameters are then applied as transformations in the acoustic feature space or the HMM model space. A discussion of how the client-server framework impacts the implementation of this class of procedures will be followed by a description of the implementation of three well-known approaches to feature space adaptation / normalization. The implementation of these approaches was evaluated on a task where users fill in “voice fields” that appear 18
Fig. 7. Percentage of actual calls serviced within specified latencies for CLA and ULA strategies. Measurements were made on an actual server deployment consisting of ten servers with calls formed from natural language queries to a spoken dialog system.
on the display of a mobile hand-held device. The evaluation was performed under a scenario where unsupervised estimation of adaptation parameters was performed from user utterances collected during the normal use of the handheld device.
5.1
Adaptation within the DSEM Framework
Client-server communications frameworks like the DSEM impact the implementation of these algorithms in several ways. First, the dynamic assignment of ASR decoders to individual utterances makes it very difficult in practice to configure the acoustic HMMs associated with these decoders to a particular user. As each individual utterance is shipped to one of multiple servers as part of the utterance level ASR server allocation strategy, the server will suffer the overhead of loading a user specific HMM model or adapting the parameters of a task independent model for that user. Since it is not unusual for the HMM for a given server installation to be composed of tens of thousands of states and hundreds of thousands of Gaussian densities, this overhead can be substantial. This process would have to be repeated for each utterance that is routed to that server. The second impact of the DSEM arises from the fact that, as argued in Section 3, it is ideally suited for operating on multiple channels of input speech data. Communications frameworks like the DSEM facilitate the implementation of low complexity feature space transformations and normalization pro19
cedures for a large number of concurrent clients. The plot in Figure 2 demonstrates that it is possible to route a large number of client utterances to ASR servers while still maintaining acceptable user response latencies. As result, relatively low complexity feature space transformation and normalization procedures can easily be applied within the DSEM framework with little impact on overall system performance. A third impact of the DSEM arises in the implementation of “personalized services” where state information relating to individual users can be stored within the server installation. Acoustic compensation parameters can be estimated off–line from adaptation utterances, or statistics derived from those utterances, that have been collected from users’ previous interactions with voice enabled services that are supported by the installation. The advantages of this paradigm for adaptation from the standpoint of providing sufficient adaptation data and minimizing computation complexity during recognition are well known. First, input utterances can be very short, sometimes single word, utterances. These short utterances can be insufficient for robust parameter estimation. Second, the computational complexity associated with the estimation of parameters for many adaptation / normalization techniques could overwhelm the DSEM if performed at recognition time.
Fig. 8. The role of the DSEM framework in ASR feature adaptation and normalization.
20
5.2
Algorithms
This section describes the robust acoustic compensation algorithms that are implemented within the DSEM framework. The algorithms that are applied here include frequency warping based speaker normalization [13], constrained model adaptation (CMA) and speaker adaptive training (SAT) [9], and cepstrum and variance normalization. They were applied to compensating utterances spoken into a far-field device mounted microphone with respect to acoustic HMM models that were trained in a mis-matched acoustic environment. Normalization/transformation parameters were estimated using anywhere from approximately one second to one minute of speech obtained from previous utterances spoken by the user of the device. All of these techniques were applied in the context of mel frequency cepstrum coefficient (MFCC) feature analysis. The Davis and Mermelstein triangular weighting functions with center frequencies spaced on a mel frequency scale were applied as a filter-bank to the 8 KHz bandwidth magnitude spectrum [6]. The first technique is frequency warping based speaker normalization [13]. Several definitions have been proposed for warping functions that can be applied to warping the frequency axis in ASR feature analysis and there have been several techniques proposed for estimating an optimum warping function from adaptation data [18,13]. In this work, warping is performed by selecting a single linear warping function, α, from a W length ensemble of candidate warping functions using the adaptation utterances for a given speaker to maximize the likelihood of the adaptation speech with respect to the HMM. This ensemble of warping functions typically consists of approximately W = 20 linearly spaced values and corresponds to a compression or expansion of the frequency axis of from ten to twenty percent. Then, during speech recognition for that speaker, the warping factor is retrieved and applied to scaling the frequency axis in mel-frequency cepstrum coefficient (MFCC) based feature analysis [13]. During acoustic model training, a “warped HMM” is trained by estimating optimum warping factors for all speakers in the training set and retraining the HMM model using the warped utterances. There are several regression based adaptation algorithms that obtain maximum likelihood estimates of model transformation parameters. The techniques differ primarily in the form of the transformations. Constrained model space adaptation (CMA) is investigated here [9]. CMA estimates a model transformation {A, b} to an HMM, λ, with means and variances µ and Σ, to create updated mean and variance, µ ˆ = Aµ − b
ˆ = AΣAT . Σ
(5)
These parameters are estimated to maximize the likelihood of the adaptation 21
data, X, P (X|λ, A, b) with respect to the model, λ. The term “constrained” refers to the fact that the same transformation is applied to both the model means and covariances. Since the variances are transformed under CMA, it is generally considered to have some effect in compensating with respect to environmental variability, which is generally characterized by additive noise, as well as speaker and channel variability. An important implementation aspect of CMA is that this model transformation is equivalent to transforming the feature space, xˆt = Axt + b. It is applied during recognition to the d = 39 component feature vectors, xt , t = 1, . . . , T , composed of cepstrum observations and the appended first and second order difference cepstrum. Speaker adaptive training was also used for training the original acoustic model. In one implementation of SAT, an HMM is trained by estimating an optimum CMA transform for each speaker in the training set and retraining the HMM model using the transformed utterances [9]. This provides a more “compact’ HMM model and results in improved performance when CMA is applied during recognition. Cepstrum mean normalization (CMN) and cepstrum variance normalization (CVN) were also applied under a similar scenario as the algorithms described above. Normalization vectors, µ ˜ and σ ˜ respectively, were computed from adaptation utterances for each speaker and then used to initialize estimates of normalization vectors for each input utterance. The incorporation of additional speech data provided by this simple modification to standard cepstrum normalization procedures had a significant impact on ASR performance.
5.3
The Acoustic Reconfiguration Server
A block diagram of the architecture that has been realized for acoustic feature space adaptation/normalization within the DSEM framework is shown in Figure 8. It relies on acoustic transformations being applied to recognition utterances as they are routed by the DSEM from clients to ASR decoders. Figure 8 depicts acoustic feature analysis and feature space transformations being performed within the DSEM. It also depicts the storage of the user specific acoustic parameters needed to implement these transformations. This includes speech data taken from previous utterances from a given user, transcriptions or word lattices produced by an ASR decoder from these utterances, and partial statistics that have been accumulated by parameter estimation algorithms. Finally, a “reconfiguration server” that is invoked by the DSEM for off-line estimation of adaptation parameters is also depicted in the figure. Of course, as discussed in Section 2, the architecture shown in Figure 8 represents one of many possible ways for distributing functionality between client and server. Many of the following arguments still apply if, for example, the feature 22
analysis is performed on the client instead of within the DSEM. One motivation for the architecture in Figure 8 is the need to minimize computational complexity during recognition. Applying frequency warping based speaker normalization during recognition requires zero operations. It can be implemented in this scenario simply by swapping in a filterbank in Melfrequency cepstrum coefficient (MFCC) analysis corresponding to the given warping function α. CMA requires d2 operations per frame corresponding to multiplying observation vectors by the regression matrix, A. CMN and CVN require only d operations per frame associated with subtracting the mean from feature vectors and scaling by the inverse variance. It has been found in practice that the additional computation associated with applying these transformations during recognition has minimal impact on the overall throughput as characterized by the plots in Figure 2. Parameter estimation for all of the above procedures can be performed in an “incremental mode” where partial statistics are accumulated across multiple utterances. These partial statistics can be used as the next incremental update of the parameters as additional data becomes available. The per-user storage and computational requirements for feature space adaptation techniques like CMA can be fairly heavy. The computational load is dominated by an iterative matrix inversion that requires d4 operations per iteration and the partial statistics require on the order of d3 floating point locations for storage. Obtaining a maximum likelihood estimate of the frequency warping function, α, can also be computationally intensive, requiring on the order of T ∗ W ∗ d operations where T is the total number of adaptation frames and W is the size of the warping ensemble. To deal with this additional computational complexity, the DSEM invokes the “reconfiguration server” in Figure 8 at infrequent intervals to estimate the adaptation and normalization parameters for speaker Si : {αi , µ ˜i , σ ˜i , Ai , bi }. It is assumed that the DSEM has continually augmented the data storage for speaker Si with speech and ASR transcriptions collected from previous utterances. The reconfiguration server produces the updated partial statistics and the updated parameters. The following section addresses the potential gains in WAC that are obtainable from the scenario given in Figure 8 for a typical application and addresses the frequency with which the off-line parameter estimation procedures should be invoked.
5.4
Experimental Study
The feature normalization/adaptation algorithms described in Section 5.2 were used to reduce acoustic mismatch between task independent HMM models and utterances spoken through a Compaq iPAQ hand-held device over the distributed framework described in Section 3. This section describes the sce23
nario under which the algorithms were evaluated, the speech database, and the experimental study. The dataset for the study included a maximum of 400 utterances of proper names per speaker from a population of six speakers. The utterances were spoken through the device mounted microphone on the hand-held device in an office environment. Since the data collection scenario also involved interacting with the display on the hand-held device, a distance of from approximately 0.5 to 1.0 meters was maintained between the speaker and the microphone. The first 200 utterances for each speaker were used for estimating the parameters of the normalizations and transformations described in Section 5.2. After automatic end-pointing, this corresponded to an average of 3.5 minutes of speech per speaker. The remaining 1200 utterances, corresponding to isolated utterances of last-names (family names) from the six speakers, were used as a test set for the experimental study described below. A baseline acoustic hidden Markov model (HMM) was trained from 18.4 hours of speech which corresponds to 35,900 utterances of proper names and general phrases spoken over wire-line and cellular telephone channels. After decision tree based state clustering, the models consisted of approximately 3450 states and 23,500 Gaussian densities. In order to evaluate the effect of acoustic level and task level mismatch on this baseline model, ASR word error rates (WER) were evaluated on several speech corpora. The first corpus included 1000 utterances of proper names spoken as first-name last-name pairs that were collected over wire-line telephone channels. A WER of 4.8 percent was obtained for this corpus. The second corpus included isolated telephone bandwidth utterances of last-names that were collected from a different population of speakers over a close-talking noisecanceling microphone [21]. A significant reduction in WER to 26.1 percent was obtained for this corpus which was largely due to the more difficult task of recognizing isolated last-name utterances rather than first-name, last-name pairs. The third corpus, and the corpus that was used for the experiments reported in Table 1, consisted of the isolated last-name utterances that were spoken through a far-field device mounted microphone under the conditions described above. A baseline WER of 41.5 percent was obtained for this corpus. One can infer from these comparisons that both acoustic mismatch due to the far-field microphone and lexical ambiguity due to the less constrained recognition grammar combine to significantly degrade the baseline ASR performance for this task. In any case, 41.5 percent WER is clearly a very high baseline WER and not acceptable for realistic applications. One must be careful about making general interpretations of performance gains achieved when the baseline WER is high. However, it is not uncommon to find these levels of performance degradation when applications developers attempt to incor24
porate generic acoustic models provided by speech technology vendors into a new domain. The goal of the robust compensation algorithms applied here is to close the performance gap between these scenarios. It is important to note that this experimental study is by no means an exhaustive evaluation of robust ASR techniques. Model based adaptation techniques were not evaluated because, as mentioned in Section 5.1, it is not practical for the ASR servers to dynamically load user-specific acoustic models for each utterance in our multi-user framework. Furthermore, the channel robustness techniques discussed in Section 2 and the large class of algorithms developed specifically for dealing with environmental distortions have the potential to improve ASR robustness in mobile domains. These techniques were not implemented as part of this study mainly because it was felt that the speech utterances used in this study were collected from a domain where these classes of distortions had only marginal impact. Table 1 displays the results for the experimental study as the word error rate (WER) resulting from the use of each of the individual algorithms where parameters are estimated using adaptation data of varying length. Columns 2 through 5 of Table 1 correspond to the WER obtained when 1.3, 6.8, 13.4, and 58.2 seconds of speech data are used for speaker dependent parameter estimation. Compensation Algorithm
Ave. Adaptation Data Dur. (sec) 1.3
6.8
13.4
58.2
Baseline
41.5
41.5
41.5
41.5
N
40.2
37.2
36.8
36.8
N+W
36.7
33.8
33.6
33.3
N+W+C
–
35.0
32.3
29.8
N+W+C+SAT
–
34.4
31.5
28.9
Table 1 WER obtained using unsupervised estimation of mean and variance normalization (N), frequency warping (W), and constrained model adaptation (C) parameters from varying amounts adaptation data.
There are several observations that can be made from Table 1. First, by comparing rows 1 and 2, it is clear that simply initializing mean and variance normalization estimates using the adaptation data (N) results in a significant decrease in WER across all adaptation data sets. Second, frequency warping (W) is also shown to provide significant reduction in WER with the most dramatic reduction occurring for the case where an average of only 1.3 seconds of adaptation data per speaker is used to estimate warping factors. Third, by 25
observing rows 4 and 5 of Table 1, it is clear that constrained model adaptation (C) actually increases WER when the transformation matrix is estimated from less than 13.4 seconds of adaptation data. However, significant WER rate reductions were obtained as the adaptation data length was increased. It is important to note that the over-training problem observed here for adaptation algorithms resulting from insufficient adaptation data is well known. Future work will investigate the use of procedures that prevent over-training by interpolating counts estimated on a small adaptation set with those obtained from other sources of data [10].
6
Conclusions
This paper has addressed several important issues that are specific to the implementation of ASR applications and services in client-server scenarios. It was noted that there has been a great deal of research addressing issues relating to the communications channels associated with distributed speech recognition scenarios and addressing methods for making individual ASR channels more efficient. The techniques presented here, however, have addressed robustness and efficiency issues strictly in the context of multi-user scenarios. All of these techniques relied on the existence of an efficient framework for client-server communications and for managing the resources associated with human-machine dialog systems. It was shown in Section 3 that the DSEM framework with an event-driven, non-blocking IO model with a single thread for managing concurrently connected clients was well-behaved even when supporting many hundreds of clients. An architecture for implementing unsupervised acoustic feature space adaptation and normalization in the context of this framework was introduced. When approximately one minute of adaptation utterances were used to estimate parameters for a combination of algorithms in a large vocabulary name recognition task under this scenario, a 31% reduction in word error rate was obtained. Again using the DSEM framework, the effect of using an intelligent scheme for allocating ASR decoders to application servers in multi-user client-server deployments was demonstrated. It was shown to decrease average response latencies by over a factor of two when compared to an alternative commonly used approach for ASR resource allocation. With an expanding infrastructure of personal devices, communications networks, and server configurations, it is hoped that there will be increased interest in addressing the problems of robust and efficient ASR that are relevant to this infrastructure. It is often the case that very efficient single channel ASR systems are applied in relatively inefficient client-server installations which do not exploit the power of the underlying ASR technology. It is also the case 26
that many robust modeling techniques are not realizable in a given clientserver framework or are simplified to the point where they are less effective. Hence, it may be found that addressing these problems from the standpoint of multi-user distributed scenarios may have a greater impact than incremental improvements in the underlying single channel systems.
References
[1] G. Banga, J.C. Mogul, and P. Druschel. A scalable and explicit event delivery mechanism for UNIX. In Proc. USENIX 1999 Annual Technical Conference, June 1999. [2] A. Bernard and A. Alwan. Joint channel decoding - Viterbi recognition for wireless applications. Proc. European Conf. on Speech Communications, September 2001. [3] E. Bocchieri and B. Mak. Subspace distribution clustering hidden Markov model. IEEE Transactions on Speech and Audio Processing, 9(3):264–275, March 2001. [4] Antonio Cardenal-Lopez, Laura Docio-Fernandez, and Carmen Garcia-Mateo. Soft decoding strategies for distributed speech recognition over IP networks. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 49–52, May 2004. [5] A. Chandra and D. Mosberger. Scalability of Linux event-dispatch mechanisms. Technical Report HPL-2000-174, Hewlett Packard Laboratory, 2000. [6] S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans on Acous. Speech and Sig. Proc., ASSP-28(4):357–366, 1980. [7] ETSI TS 126 094 (2001-03). Universal Mobile Telecommunications System (UMTS); Mandatory speech codec speech processing functions AMR speech codec; Voice Activity Detector (VAD) (3FPP TS 26.094 version 4.00 Release 4). [8] T. Fingscheidt, S. Aalburg, S. Stan, and C. Beaugeant. Network-based versus distributed speech recognition in adaptive multi-rate wireless systems. Proc. Int. Conf. on Spoken Lang. Processing, pages 2209–2212, September 2002. [9] M. J. F. Gales. Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language, 12:75–98, 1998. [10] A. Gunawardana and W. Byrne. Robust estimation for rapid speaker adaptation using discounted likelihood techniques. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, May 2000.
27
[11] H. K. Kim, R. V. Cox, and R. C. Rose. Bitstream-based front-end for wireless speech recognition in adverse environments. IEEE Trans. on Speech and Audio Processing - Special Issue on Speech Technologies for Mobile and Portable Devices, nov 2002. to be published. [12] Imre Kiss, Ari Lakaniemi, Cao Yang, and Olli Viikki. Review of AMR speech codec and distributed speech recognition-based speech-enabled services. Proc. IEEE ASRU Workshop, pages 613–618, December 2003. [13] L. Lee and R. C. Rose. A frequency warping approach to speaker normalization. IEEE Trans on Speech and Audio Processing, 6, January 1998. [14] Luke K. McDowell, Susan J. Eggers, and Steven D. Gribble. Improving server software support for simultaneous multithreaded processors. In Proc. Ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 2003. [15] M Mohri and M. Riley. Network optimization for large vocabulary speech recognition. Speech Communication, 25(3), 1998. [16] S. Ortmanns, H. Ney, and T Firzlaff. Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition. Proc. European Conf. Speech Communication and Technology, September 1997. [17] Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. Flash: An efficient and portable web server. In Proc. USENIX 1999 Annual Technical Conference, June 1999. [18] M. Pitz, S. Molau, R. Schluter, and H. Ney. Vocal tract normalization equals linear transformation in cepstral space. Proc. European Conf. on Speech Communications, September 2001. [19] A. Potamianos and V. Weerackody. Soft-feature decoding for speech recognition over wireless channels. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 269–272, May 2001. [20] R. C. Rose, I. Arizmendi, and S. Parthasarathy. An efficient framework for robust mobile speech recognition services. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, April 2003. [21] R. C. Rose, S. Parthasarathy, B. Gajic, A. E. Rosenberg, and S. Narayanan. On the implementation of asr algorithms for hand–held wireless mobile devices. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, May 2001. [22] R. A. Sukkar, R. Chengalvarayan, and J. J. Jacob. Unified speech recognition for landline and wireless environments. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, pages 293–296, May 2002. [23] Zheng-Hua Tan, Paul Dalsgaard, and Borge Lindberg. On the integration of speech recognition into personal networks. Proc. Int. Conf. on Spoken Lang. Processing, October 2004.
28
[24] O. Viikki. ASR in portable wireless devices. Proc. IEEE ASRU Workshop, December 2001. [25] Matt Welsh, David E. Culler, and Eric A. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In Symposium on Operating Systems Principles, pages 230–243, 2001. [26] S. Wendt, G. A. Fink, and F. Kummert. Dynamic search-space pruning for time-constrained speech recognition. Proc. Int. Conf. on Spoken Language Processing, September 2002.
29