Guest editorial introduction to the special issue on ... - IEEE Xplore

8 downloads 128682 Views 70KB Size Report
James R. Glass (M'78) received the B.Eng. degree from Carleton University, Ottawa, ... He is an Associate Professor in the School of Computer Science and the ...
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,, VOL. 8, NO. 1, JANUARY 2000

1

Guest Editorial Introduction to the Special Issue on Language Modeling and Dialogue Systems

T

HE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING (and the TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING before it) has been an important repository of research on the speech signal, including production, perception, analysis, synthesis, coding, enhancement, and recognition. However, there has been an underrepresentation of language modeling and speech understanding activities, despite the availability of editorial categories in these areas. Why is that? As a primary mode of human communication, speech is much more than a sequence of quasistationary signals, phones, or words. It is clear that higher level linguistic processes such as syntax and semantics play a vital role as both a source of constraint, and as a means for understanding, especially in the context of a conversation. The goal of this Special Issue is to highlight current research activities involving speech and language, in order to raise awareness that this subject is of strong interest to this publication. Since we could not highlight all of the areas in this EDICS category, we focused on language modeling and dialogue systems. In this issue, we have eight papers that give the reader a broad overview of activities in these areas, and provide citations to other related work in the field. Currently, the dominant method for incorporating linguistic constraint into a speech recognizer is via -gram language models. Due to well-known parameter estimation difficulties, -grams typically provide only local word order probabilities (e.g., trigrams). Thus, methods that can capture longer distance relationships can potentially provide additional constraint to, and improve performance of, a speech recognizer. In the first paper in this Special Issue, Bellegarda shows how he can improve the recognition performance of bigram and trigram language models on a large vocabulary dictation task by incorporating global semantic constraints captured via latent semantic analysis. Over this past decade, maximum entropy (ME) language modeling has emerged as a powerful paradigm for modeling language. Although not subject to fragmentation, ME is still prone to overfitting of rare features. While -gram smoothing has been studied extensively, the same cannot be said for ME models. Chen and Rosenfeld fill this void by providing a much needed overview of ME language modeling, surveying existing work in ME smoothing, and comparing the performance of several of these algorithms with smoothing techniques for conventional -grams. They find that fuzzy ME smoothing

Publisher Item Identifier S 1063-6676(00)00278-9.

performs as well as or better than all other algorithms they considered. As speech recognition moves from carefully controlled laboratory dictation to more practical uses, the technology must be changed to address new challenges. One such challenge is linguistic modeling of conversational speech. The Siu and Ostendorf paper describes several different improvements on conventional -gram models, especially designed for addressing the idiosyncrasies of conversational speech. In addition to a systematic study of variable length -grams, the authors introduce context-dependent classes and a skipping mechanism to accommodate disfluencies. This results in a compact language model with a longer effective span. In dialogue systems, the language used by subjects can vary significantly depending on the current dialogue state. Language usage can also vary over time. In the final language modeling paper in this Special Issue, Riccardi and Gorin report on a set of adaptive language modeling experiments performed on a large database of user transactions with their “How may I help you?” system. They examine differences in the language used in human-human and two human-machine data collection efforts, and develop an adaptive language model framework that reduces perplexity and recognition error rates, but which still allows for unconstrained input at each stage in the dialogue. The work of Ney et al. on statistical machine translation is a welcome new step in the direction pioneered by the IBM group in the early 1990’s. Three different translation models are described, the differences among them being mainly in the way the alignment process is treated. Among the contributions of this paper is the emphasis on careful simplification and decomposition of the various models in order to make the search tractable. This practical yet methodical approach is also used in dealing with the integration of the speech recognition component into the translation system. The JUPITER dialog system for weather information by Zue et al. is one of the best known examples of a working (and quite useful) conversational interface. Its strength lies, among other things, in early exposure to field conditions, which allowed data to be collected in large amounts from the intended users under realistic conditions. But for this to be possible, many different language technologies had to be developed, tested, and integrated, ranging from telephone-based recognition and understanding to language generation and speech synthesis. The authors review the various aspects of the technologies and the system, and describe several design and evaluation methodologies that are likely to become standard in this fast growing field. One of the challenges facing developers of dialogue systems is to produce robust performance while still maintaining a high

1063–6676/00$10.00 © 2000 IEEE

2

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,, VOL. 8, NO. 1, JANUARY 2000

level of naturalness and flexibility. The paper by Souvignier et al. presents two principles that have influenced the design of their systems: early integration of information sources, and delayed decision making. By describing and evaluating, the component technologies used for tasks ranging from the travel domain, to automatic switchboards, to large-scale directory assistance, the authors show how these principles can be effectively embedded at many different levels in their dialogue systems. In most current conversational systems, the design of a dialogue strategy is typically hand-crafted by the system developer. This can be a time-consuming process, whose result may or may not generalize to different domains. In the final paper in this Special Issue, Levin et al. propose a stochastic formulation for dialogue that uses automatic learning methods to determine optimal parameter settings. By representing a dialogue strategy as a Markov decision process, the authors use reinforcement learning to find the structure of an information retrieval dialogue in an air travel domain. In their experiments they show that an

automatically learned dialogue strategy is similar to one heuristically designed by several research groups. Machine learning of dialogue strategy will likely be the subject of much exciting research in the near future. Putting together a Special Issue in a timely manner requires the coordinated effort of a large number of people. We would like to thank all the authors, reviewers, and publishing staff who helped make this Special Issue possible.

JAMES R. GLASS, Guest Editor Massachusetts Institute of Technology Cambridge, MA 02139 USA

RONALD ROSENFELD, Guest Editor Carnegie Mellon University Pittsburgh, PA 15213 USA

James R. Glass (M’78) received the B.Eng. degree from Carleton University, Ottawa, Ont., Canada, in 1982, and the M.S. and Ph.D. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology (MIT), Cambridge, in 1984, and 1988, respectively. Since then, he has been with the MIT Laboratory for Computer Science, where he is currently a Principal Research Scientist and Associate Head of the Spoken Language Systems Group. His research interests include acoustic-phonetic modeling, speech recognition and understanding, and corpus-based speech synthesis. Dr. Glass served as a member of the IEEE Acoustics, Speech, and Signal Processing, Speech Technical Committee from 1992 to 1995. Since 1997, he has served as an Associate Editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING.

Ronald Rosenfeld (M’95–A’95) received the B.Sc. degree in mathematics and physics from Tel-Aviv University, Tel-Aviv, Israel, in 1985, and the M.Sc. and Ph.D. degrees in computer science from Carnegie Mellon University (CMU), Pittsburgh, PA, in 1991 and 1994, respectively. He is an Associate Professor in the School of Computer Science and the Graduate School of Industrial Administration, CMU. His research interests include statistical language modeling, human language technologies, speech recognition, and human-machine speech communication. Dr. Rosenfeld has served as an Associate Editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING since 1997. He is a National Science Foundation Graduate Fellow (1986–1990) and a recipient of the Allen Newell Medal for Research Excellence (1992).