Efficient Automatic Speech Recognition on the GPU Jike Chong∗†‡, Ekaterina Gonina∗§, and Kurt Keutzer∗¶ May 3, 2011 Automatic speech recognition (ASR) allows multimedia content to be transcribed from acoustic waveforms to word sequences. This technology is emerging as a critical component in data analytics for a wealth of media data that is being generated everyday. Commercial usage scenarios are already appearing in industries such as customer service call centers for data analytics where ASR is used to search recorded content, track service quality, and provide early detection of service issues. Fast and efficient ASR enables economic employment of text-based data analytics to multimedia contents. This opens the door to unlimited possible applications such as automatic meeting diarization, news broadcast transcription, and voice-activated multimedia systems in home entertainment systems. This chapter provides speech recognition application developers with an understanding of specific implementation challenges when working with the speech inference process, weighted finite state transducer (WFST) based methods, and the Viterbi algorithm, and illustrates an efficient reference implementation on the GPU that could be productively customize to meet the needs of specific usage scenarios. The chapter also introduces to machine learning researchers the capabilities of the GPU for handling large, irregular graphbased models with millions of states and arcs. Lastly, the chapter provides four generalized solutions for GPU computing researchers and developers to resolve four types of algorithmic challenges encountered during in the implemention of speech recognition on GPUs. These solutions include concepts that are generally useful for creating performance critical applications on the GPU.
1
Introduction, Problem Statement, and Context
It is well-known that automatic speech recognition (ASR) is a challenging application to parallelize [6, 5]. Specifically, on the GPU an efficient implementation of ASR involves resolving a series of implementation challenges specific to the data-parallel architecture of the GPU platform. This section introduces the speech application and highlights four of the most important challenges to be resolved when implementing the application on the GPU.
1.1
The speech recognition application
The goal of an automatic speech recognition application is to analyze a human utterance from a sequence of input audio waveforms in order to interpret and distinguish the words and sentences intended by the speaker. Its top level software architecture is shown in Figure 1. The inference process uses a recognition network or a speech model which includes three components: an acoustic model, a pronunciation model, and a language model. The acoustic model and language model are trained off-line using powerful statistical learning techniques. The pronunciation model is constructed from a dictionary. The speech feature extractor collects feature vectors from input audio waveforms using standard scalable signal processing techniques [9, 10]. The inference engine traverses the recognition network based on the Viterbi search algorithm [8]. It infers the most likely word sequence based on the extracted speech features and the recognition network. In a typical recognition process, there are significant parallelism opportunities in concurrently evaluating thousands of alternative interpretations of a speech utterance to find the most likely interpretation [3, 13]. ∗ Department
of Electrical Engineering and Computer Science, Soda Hall, UC Berkeley, CA 94720. LLC, 258 Ficus Terrace, Sunnyvale, CA 94086. ‡ Email:
[email protected] § Email:
[email protected] ¶ Email:
[email protected] † Parasians,
1
,%$"-(#."(&/%+0"12& C=/:$.=& 7/K%>&
!"#$%& '()*+&
F0%%="& G%#,:-%& HI,-#=,/-&
3)%%$4& 5%6+*1%7&
!-/*:*=+#./*& 7/K%>&
L#*E:#E%& 7/K%>&
J&
;*D%-%*=%&& H*E+*%&
C-="+,%=,:-%&/D&,"%&+*D%-%*=%&%*E+*%2&
8"19& 3%:*%($%& I think therefore I am
)*%&+,%-#./*&0%-& .1%&$,%02& &&&&&34567&+*$,8&
!"#$%&'&
9/10:,%&;*,%*$+2.-3,#?)##
()*+),-./#
0.,1234)#
%!"$H#
B)23C>,D#E>9)#:)4# ()23,C#3F#(:))2G##
@A"!7#839:+;)#?)## $%"!7#8399+,>2.-3,#?)##
Figure 5: Ratio of computation intensive phase of the algorithm vs communication intensive phase of the algorithm speed up on the GPU. The speedup numbers indicate that synchronization overhead dominates the run time as more processors need to be coordinated in the communication intensive phase. In terms of the ratio between the compute and communication intensive phases, the pie charts in Figure 5 show that 82.7% of the time in the sequential implementation is spent in the compute intensive phase of the application. As we scale to our GPU implementation, the compute intensive phase becomes proportionally less dominant, taking only 49.0% of the total runtime. The increasing dominance of the communication intensive phase motivates a detailed examination of parallelization implications in the communication intensive phases of our inference engine. We found that the sequential overhead in our implementation is less than 2.5% of the total run time even for the fastest implementation. This demonstrates that we have a scalable software architecture that promises greater potential speedups with more parallelism in future generations of processors.
4
Conclusion and Future Directions
In this chapter, we demonstrated concrete solutions to mitigate the implementation challenges of an automatic speech recognition application on NVIDIA graphics processing units (GPUs). We described the software architecture of an automatic speech recognition application, posed four of the most important challenges in implementing the application, and resolved the four challenges with the corresponding four solutions on a GPU-based platform. The challenges and solutions are: 1. Challenge: Handling irregular graph structures with data parallel operations Solution: Constructing efficient dynamic vector data structure to handle irregular graph traversals 2. Challenge: Eliminating redundant work when threads are accessing an unpredictable subset of data based on input Solution: Implementing efficient find-unique function by leveraging the GPU global memory writeconflict-resolution policy 3. Challenge: Performing Conflict-free reductions in graph traversal to implement the Viterbi beam search algorithm Solution: Implementing lock-free accesses of a shared map leveraging advanced GPU atomic operations with arithmetic operations to enable conflict-free reduction 4. Challenge: Parallel construction of a global queue causes sequential bottlenecks when atomically accessing queue control variables Solution: Using hybrid local/global atomic operations and local buffers for the construction of a global queue to avoid sequential bottlenecks in accessing global queue control variables Our ongoing research is to incorporate these techniques in an automatic speech recognition inference engine framework that allows application developers to further customize and extend their speech recognizer within an optimized infrastructure for this class of applications. Figure 6 shows a framework with its reference
12
!"#!$$%&'()*"#+,(-./*,0#1*,#(#2$..'3#4"1.,."'.#5"6&".# IR;%(">"(%% L&EI%S".R;/3%
E.#,1% G#.#%E./01.0/"% 45,"%6/"(;#
%$*."/#,;+%:;+./;(%
!"#$%&'(")%
&'("%*+50.%&;/"E".%
D$#5,>"%2"#