MONITORING: RECENT ALGORITHMIC ADVANCEMENTS, SOLUTIONS AND ... (Automatic Speech Recognition and Keyword Spotting) and âhow was it said?
SPEECH PROCESSING TECHNOLOGY FOR HUMAN BEHAVIOR AND PERFORMANCE MONITORING: RECENT ALGORITHMIC ADVANCEMENTS, SOLUTIONS AND CHALLENGES A. Sangwan, C. Yu, L. Kaushik, A. Ziaei, J.H.L. Hansen Center for Robust Speech Systems (CRSS), The University of Texas at Dallas (UTD), Richardson, Texas 75080. INTRODUCTION: It is our intention to build an automatic system that exploits speech-processing technology to measure various aspects of individual behavior and team communication. In the envisioned system, speech data is captured continuously. In this manner, every activity, interaction and conversation is analyzed for intelligent metrics. An important step in building such a system is researching and developing robust algorithms that can process long duration audio signals, and extract entry-level information such as “who spoke when?” (Speaker Identification and Diarization technology), “what was said?” (Automatic Speech Recognition and Keyword Spotting) and “how was it said?” (Speech under Stress). This capability organizes the information by identifying semantic segments in the long duration audio, which lay the foundation for meaningful computation of second-level behavior and communication relevant metrics such as turn-taking statistics, conversation sentiment and topic, speaker stress level, non-verbal events such as laughter, sighs, etc. In this work, we will discuss our recent advancements that facilitate robust extraction and effective evaluation of first and second level information mentioned here through novel algorithmic breakthroughs. METHOD: The proposed method captures speech data continuously. In this manner, all activities are continuously monitored along with ever interaction and conversation. First, speech activity detection (SAD) technology is utilized to separate speech from pause. Next, speaker diarization technology is used to identify individual speakers in the audio stream, and the speaker identify is determined using speaker recognition technology. Using this information, the audio stream can now be semantically segmented into a series of successive activities and conversations. Now, conversation turn-taking statistics are computed to determine the nature of interaction. Speech recognition and keyword spotting technology is used to track conversation topic, and determine affects such as sentiment. Nonverbal events such as instances of laughter etc. are detected. The combination of this information gives insight into the quality and quantity of interaction [1]. RESULTS, DISCUSSION AND ILLUSTRATION: We have been developing a new corpus using NASA Apollo space mission data in order to research, develop and evaluate solutions [2]. As part of corpus development, we have been digitizing multi-track Apollo mission backroom loops (for the first time to the best of our knowledge). This corpus will be made available for public release and we will discuss various aspects of this collection. Additionally, we will showcase the performance improvements that we have achieved due to specific algorithmic advancements for various speech tasks that are relevant to the proposed system. Particularly, we will present a new method for SAD that reduces error rates by an order of magnitude for long duration audio recordings [3]. This is an important advancement because SAD system output drives upstream speech processing components like speech and speaker recognition, and has significant impact on overall system performance. Additionally, we will present a new method for characterizing activity in audio streams that is capable of classifying different types of activities, interactions and conversations with 80% accuracy. This is an important result because it allows us to segment a continuous audio file into semantic segments that is essential for meaningful behavior and performance analysis. Furthermore, we will show new method that helps in analyzing the impact of physical and cognitive stress on astronaut’s speech [4]. Finally, we will present a new method of detecting topics and sentiment in conversation using Automatic Speech Recognition (ASR)/Keyword Spotting (KWS) technology [5,6]. We have used Apollo mission data for evaluating the mentioned speech systems, and we will share the corresponding results. Additionally, we will discuss details of the Apollo corpus that we are creating. Finally, we will play several audio samples that demonstrate algorithm function as well as the challenges presented by the data. REFERENCES: [1] Hayley, H. and Gatica-Perez, D. (2010): IEEE Transactions on Multimedia, 12(6), 563-565. [2] Sangwan, A. et. al. (2013): Interspeech 2013, 1135-1139. [3] Ziaei, A. et. al. (2014): Interspeech 2014, 1544-1548. [4] Yu, C. et. al. (2014): Interspeech, 945-948. [5] Kaushik, L. et. al. (2013): Automatic Speech Recognition and Understanding (ASRU), 239-244. [6] Hansen, J.H.L. et. al. (1996): Speech Communication, 20(1-2), 151-173.