sonal desktop and laptop systems through the use of graphics processing units (GPUs). .... tion state, each arc transition updates a destination state atom- ically.
A Fully Data Parallel WFST-based Large Vocabulary Continuous Speech Recognition on a Graphics Processing Unit Jike Chong, Ekaterina Gonina, Youngmin Yi, Kurt Keutzer Department of Electrical Engineering and Computer Science, University of California, Berkeley {jike, egonina, ymyi, keutzer}@eecs.berkeley.edu ,%$"-(#."(&/%+0"12&
Abstract
1. Introduction Graphics processing units (GPUs) are enabling tremendous compute capabilities in personal desktop and laptop systems. Recent advances in the programming model for GPUs such as CUDA [1] from NVIDIA have provided an implementation path for many exciting applications beyond graphics processing such as speech recognition. In order to take advantage of high throughput capabilities of the GPU-based platforms, programmers need to transform their algorithms to fit the data parallel model. This can be challenging for algorithms that don’t directly map onto the model, such as graph traversal in a speech inference engine. In this paper we explore the use of GPUs for large vocabulary continuous speech recognition (LVCSR) on NVIDIA GTX280 GPU. A LVCSR application analyzes a human utterance from a sequence of input audio waveforms to interpret and distinguish the words and sentences intended by the speaker. Its top level architecture is shown in Fig. 1. The recognition process uses a Weighted Finite State Transducer (WFST) based recognition network [2], which is a language database that is compiled offline from a variety of knowledge sources using powerful statistical learning techniques. The speech feature extractor collects feature vectors from input audio waveforms, and then the Hidden-Markov-Model-based inference engine computes the most likely word sequence based on the extracted speech features and the recognition network. In the
!"#$%& '()*+&
F0%%="& G%#,:-%& HI,-#=,/-&
3)%%$4& 5%6+*1%7&
J&
Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA’s GTX280 GPU. Our implementation consists of two phases - compute-intensive observation probability computation phase and communication-intensive graph traversal phase. We take advantage of dynamic elimination of redundant computation in the compute-intensive phase while maintaining close-to-peak execution efficiency. We also demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs. On 3.1 hours of speech data set, we achieve more than 11× speedup compared to a highly optimized sequential implementation on Intel Core i7 without sacrificing accuracy. Index Terms: Data parallel, Continuous Speech Recognition, Graphics Processing Unit
C=/:$.=& 7/K%>&
!-/*:*=+#./*& 7/K%>&
L#*E:#E%& 7/K%>&
;*D%-%*=%&& H*E+*%&
C-="+,%=,:-%&/D&,"%&+*D%-%*=%&%*E+*%2&
8"19& 3%:*%($%& I think therefore I am
)*%&+,%-#./*&0%-& .1%&$,%02& &&&&&34567&+*$,8&
!"#$%&'&
9/10:,%&;*,%*$+&"%.*"7*86'.3*
Figure 4: Communication Intensive Phase Run Time in the Inference Engine (normalized to one second of speech)
6. References [1] NVIDIA CUDA Programming Guide, NVIDIA Corporation, 2009, version 2.2 beta. [Online]. Available: http://www.nvidia.com/CUDA [2] M. Mohri, F. Pereira, and M. Riley, “Weighted finite state transducers in speech recognition,” Computer Speech and Language, vol. 16, 2002. [3] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry, “Challenges in parallel graph processing,” Parallel Processing Letters, 2007. [4] A. Janin, “Speech recognition on vector architectures,” Ph.D. dissertation, University of California, Berkeley, Berkeley, CA, 2004. [5] H. Ney and S. Ortmanns, “Dynamic programming search for continuous speech recognition,” IEEE Signal Processing Magazine, vol. 16, 1999. [6] M. Ravishankar, “Parallel implementation of fast beam search for speakerindependent continuous speech recognition,” 1993. [7] S. Phillips and A. Rogers, “Parallel speech recognition,” Intl. Journal of Parallel Programming, vol. 27, no. 4, pp. 257–288, 1999. [8] K. You, Y. Lee, and W. Sung, “OpenMP-based parallel implementation of a continous speech recognizer on a multi-core system,” in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, 2009. [9] S. Ishikawa, K. Yamabana, R. Isotani, and A. Okumura, “Parallel LVCSR algorithm for cellphone-oriented multicore processors,” in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, 2006. [10] P. R. Dixon, T. Oonishi, and S. Furui, “Fast acoustic computations using graphics processors,” in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, 2009. [11] P. Cardinal, P. Dumouchel, G. Boulianne, and M. Comeau, “GPU accelerated acoustic likelihood computations,” in Proc. Interspeech, 2008. [12] J. Chong, Y. Yi, N. R. S. A. Faria, and K. Keutzer, “Data-parallel large vocabulary continuous speech recognition on graphics processors,” in Proc. Workshop on Emerging Applications and Manycore Architectures, 2008. [13] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice-Hall, 2001. [14] S. Kanthak, H. Ney, M. Riley, and M. Mohri, “A comparison of two LVR search optimization techniques,” in Proc. Intl. Conf. on Spoken Language Processing (ICSLP), Denver, Colorado, USA, 2002, pp. 1309–1312. [15] J. Chong, K. You, Y. Yi, E. Gonina, C. Hughes, W. Sung, and K. Keutzer, “Scalable HMM based inference engine in large vocabulary continuous speech recognition,” Workshop on Multimedia Signal Processing and Novel Parallel Computing, July 2009. [16] G. Tur et al., “The CALO meeting speech recognition and understanding system,” in Proc. IEEE Spoken Language Technology Workshop, 2008. [17] A. Stolcke, X. Anguera, K. Boakye, O. Cetin, A. Janin, M. Magimai-Doss, C. Wooters, and J. Zheng, “The SRI-ICSI Spring 2007 meeting and lecture recognition system,” Lecture Notes in Computer Science, vol. 4625, no. 2, pp. 450–463, 2008.