An Effective Algorithm for Automatic Detection and Exact Demarcation ...

11 downloads 0 Views 2MB Size Report
detection of breath sounds in song and speech recordings that can be used in a real-time ..... based on its simi- larity measure. (computed as explained above), ...
838

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

An Effective Algorithm for Automatic Detection and Exact Demarcation of Breath Sounds in Speech and Song Signals Dima Ruinskiy, Student Member, IEEE, and Yizhar Lavner, Member, IEEE

Abstract—Automatic detection of predefined events in speech and audio signals is a challenging and promising subject in signal processing. One important application of such detection is removal or suppression of unwanted sounds in audio recordings, for instance in the professional music industry, where the demand for quality is very high. Breath sounds, which are present in most song recordings and often degrade the aesthetic quality of the voice, are an example of such unwanted sounds. Another example is bad pronunciation of certain phonemes. In this paper, we present an automatic algorithm for accurate detection of breaths in speech or song signals. The algorithm is based on a template matching approach, and consists of three phases. In the first phase, a template is constructed from mel frequency cepstral coefficients (MFCCs) matrices of several breath examples and their singular value decompositions, to capture the characteristics of a typical breath event. Next, in the initial processing phase, each short-time frame is compared to the breath template, and marked as breathy or nonbreathy according to predefined thresholds. Finally, an edge detection algorithm, based on various time-domain and frequency-domain parameters, is applied to demarcate the exact boundaries of each breath event and to eliminate possible false detections. Evaluation of the algorithm on a database of speech and songs containing several hundred breath sounds yielded a correct identification rate of 98% with a specificity of 96%. Index Terms—Breath detection, event spotting in speech and audio, mel frequency cepstral coefficient (MFCC).

I. INTRODUCTION

A

UTOMATIC detection of predefined events in speech and audio signals is a challenging and promising subject in the field of signal processing. In addition to speech recognition [1], [2], there are many applications of audio information retrieval [3], [4], and various tools and approaches were tested for this task [5]–[7]. In case of speech recordings, the events to be recognized can be part of the linguistic content of speech (e.g., words [8], syllables [9], and individual phonemes [10], [11]) or nonverbal

Manuscript received March 13, 2006; revised August 23, 2006. This work was supported in part by Waves Audio Israel. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Hiroshi Sawada. D. Ruinskiy was with the Department of Computer Science, Tel-Hai Academic College, Upper Galilee 12210, Israel. He is now with the Faculty of Mathematics and Computer Science, Feinberg Graduate School, Weizmann Institute of Science, Rehovot 76100, Israel. Y. Lavner is with the Computer Science Department, Tel-Hai Academic College, Upper Galilee 12210, Israel (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2006.889750

sounds or cues, such as laughs [12], breaths [13], coughs, and others. The applications of the detection are various. In [14], vowel/ consonant recognition in speech served for differential timescaling of speech for the hearing impaired. Nonverbal sound recognition can be used for enhancing the description of situations and events in speech-to-text algorithms [15] or for improving the efficiency of human transcriptionists [16]. Other applications of event detection are the elimination of badly articulated phonemes and other sounds that degrade the quality of the acoustic signal [17] or accentuation of certain sounds. Due to the high-quality requirement in professional voice recordings, the sound engineers have to recognize places where the quality is degraded or undesired sounds are added. Since on many occasions conducting a new recording session is either expensive or impossible (for example, in historical recordings), the undesired sounds must be eliminated or attenuated manually, a process that can be tedious and time consuming. Therefore, an efficient automatic algorithm that detects and accurately demarcates the required sound can be of great advantage. An example of a sound that can degrade the quality of voice recordings is breath, which is inherently present in nearly all recordings, even those of professional singers and narrators. In professional audio, breaths are often considered unwanted sounds, which degrade the aesthetics of the voice, especially when modification of the dynamic range is applied and may unintentionally amplify them. In these cases, the sound engineer may want the breath sounds removed or suppressed significantly. However, in other situations, breath sounds may have their purpose in songs, for instance, as part of the emotional content, and then it is sometimes desirable to stress them and make them more expressed. Whatever the purpose is, accurate detection of the breath sounds is required for further manipulations, such as attenuation or amplification. In addition to its role in improving the quality of songs and speech recordings for the professional music industry, high-accuracy breath detection can be valuable in many other applications. In spontaneous speech recognition, for example, effects such as long pauses, word fragments, pause fillers, and breath sounds, have been shown to significantly degrade the recognition performance [18]. Accurate detection of breaths, among other nonlexical sounds, was shown to improve the recognition quality. Other applications are automatic segmentation of continuous speech, where breath sounds may serve as natural delimiters in utterances, automatic labeling of prosody patterns [19], and enhancement of speech to text applications by including nonlexical sounds.

1558-7916/$25.00 © 2007 IEEE

RUINSKIY AND LAVNER: EFFECTIVE ALGORITHM FOR AUTOMATIC DETECTION AND EXACT DEMARCATION OF BREATH SOUNDS

839

Fig. 1. Basic block diagram of the algorithm. After preliminary classification as breathy/nonbreathy, a refinement of the detection is performed in the vicinity of sections initially marked as breathy.

In this paper, we present an efficient algorithm for automatic detection of breath sounds in song and speech recordings that can be used in a real-time environment (see Section V). The algorithm is based on a template-matching approach for the initial detection and on a multistage feature tracking procedure for accurate edge detection and elimination of false alarms. The latter procedure makes it possible to achieve very high accuracy in terms of edge marking, which is often required in speech recognition and professional music applications. Evaluation of the algorithm on a small database of recordings (see Section IV) showed that both the false negative rate and the false positive rate are very low. For the initial detection, a prototype breath template is constructed from a small number of breath examples. The features used for the template are the mel frequency cepstral coefficients (MFCCs) [20], which are known for their ability to distinguish between different types of audio data [21], [22]. The processed audio signal is divided into consecutive overlapping frames, and each is compared to the template. If the similarity exceeds a predefined threshold, the frame is considered “breathy,” i.e., a part of a breath sound. Whenever breaths are detected by the initial detection phase, further refinement is carried out by applying a feature-tracking procedure aimed at accurate demarcation of the breath boundaries, as well as elimination of false detections. The waveform features used for this procedure include short-time energy, zerocrossing rate, and spectral slope, in addition to the MFCC-based similarity measure. A schematic block diagram that describes the basic algorithm is shown in Fig. 1. A detailed description of

the processes in each block will be presented in the following sections. Although the classification of each frame as breathy/nonbreathy can be also performed by other standard machine learning approaches such as support vector machine (SVM) [23] or Gaussian mixture model (GMM) [24], we find that the template-matching approach used here has several advantages: it is simple and computationally efficient, and yet very accurate and reliable. In contrast, both SVM and GMM require more complex models with multiple parameters and assumptions, and therefore with higher time and space complexity [25]. The training procedure in the presented algorithm is very fast and achieves high accuracy even with very few training examples. This paper is organized as follows: In Section I-A, the physiological process of breath production and its relation to the detection of breath boundaries in speech signals are briefly described. Next, the main algorithm is presented: the template construction in Section II-A, the detection phase along with the audio features used in Section II-B, the computation of the breath similarity measure in Section II-C, and the edge detection algorithm in Section III. Following that, empirical results and performance evaluations of the proposed algorithm are presented in Section IV. Finally, some aspects of real-time implementation are discussed in Section V, followed by the conclusion. A. Approach to Breath Edge Detection Based on Physiological Aspects of Breath Production During breathing, various muscles are operating to change the intrapulmonary volume, and in consequence, the intrapul-

840

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

Fig. 2. Part of a voice waveform demonstrating a breath sound located between two voiced phonemes. The upper line marks the breath, characterized by higher energy near the middle and lower energy at the edges. The lower lines denote the silence periods separating the breath from the neighboring phonemes.

is computed and compared to a predefined threshold. In cases where the similarity measure is above the threshold, the frame is considered breathy (i.e., part of a breath event). Once the training phase is executed and a template is constructed, it can be used in multiple detection phases without retraining the system. Whereas the training phase is executed offline, the detection phase can be used in a real-time environment because the classification as breathy/nonbreathy is highly localized. More details on the real-time implementation of the algorithm and the required latency are presented in Section V. A. Constructing the Template

monary pressure, which affects the air flow direction. In inspiration, a coordinated contraction of the diaphragm and other muscles cause the intrapulmonary volume to increase, which translates into a decrease of the intrapulmonary pressure and causes air flow into the lungs. In expiration, the chest cavity and the intrapulmonary volumes are decreased, followed by an increase in the intrapulmonary pressure, causing air to be exhaled through the trachea out of the lungs. Since there are several systems of muscles involved, such changes in the air flow direction cannot be instantaneous. As a result, when inhaling occurs during speech or song, it requires a certain pause, which results in a period of silence between the breath and the preceding utterance, and a similar period between the breath and the following utterance. Measurements of breath events in many different speech sequences from different speakers have shown that the silence periods between a breath and the neighboring utterances are at least 20 ms long. This duration is sufficient for the silence period to be detected by energy tracking mechanisms. A typical breath event residing between two utterances is shown in Fig. 2. The silence periods before the breath and after it are readily noticeable. The detection of the silence periods before and after the breath allows marking them as the breath edges, separating the breath from the neighboring phonemes. The attenuation of the breath in such cases creates a smooth and natural period of silence between the two utterances, without any artifacts. In the following sections, the details of the algorithm that detects and demarcates breath events in speech recordings will be presented. II. INITIAL DETECTION ALGORITHM The initial detection consists of two phases. The first is the “training” phase, during which the algorithm “learns” the features of typical breath sounds. The main feature that is used in this study is a short-time cepstrogram, consisting of MFCC vectors, which are computed for consecutive short-time windows of the speech signal. In the training phase, the cepstrograms of several breath signals are computed, and a template cepstrogram matrix is constructed, representing a typical breath signal. The second phase is the detection phase, where a speech signal is processed and short-time cepstrogram matrices are computed for consecutive analysis frames. For each frame, a breath similarity measure, which quantifies the similarity between the frame’s cepstrogram and the template cepstrogram,

The breath template is constructed using several breath examples, derived from one or more speakers. The template is expected to represent the most typical characteristics of a breath and to serve as a prototype, to which frames of the signal are compared in the detection phase. On the one hand, the template should contain the relevant data that distinguish the breath event from other phonemes and sounds, and, on the other hand, it should be compact for computational efficiency. Both of the above requirements are satisfied by the MFCC, which represent the magnitude spectrum of each short-time voice signal compactly, with a small number of parameters [2], [20]. The efficiency of the MFCC in recognition of phonemes and other components of the human speech is well known and widely used [12], [26], [27], and therefore, they were chosen for template construction in this study. Several vectors of the MFCC are computed for each breath example, forming a short-time cepstrogram of the signal. The average cepstrogram of the breath examples is used as the template. The stages for construction of the template are as follows (Fig. 3). 1) Several signals containing isolated breath examples are selected, forming the example set. From each example, a section of fixed length, typically equal to the length of the shortest example in the set (about 100–160 ms), is derived. This length is used throughout the algorithm as the frame length (see Section II-C). 2) Each breath example is divided into short consecutive subframes, with duration of 10 ms and hop size of 5 ms. Each subframe is then pre-emphasized using a first-order differ, where ). ence filter ( 3) For each breath example, the MFCC are computed for every subframe, thus forming a short-time cepstrogram representation of the example. The cepstrogram is defined as a matrix whose columns are the MFCC vectors for each subframe. Each such matrix is denoted by , where is the number of examples in the examples set. The construction of the cepstrogram is demonstrated in Fig. 4. 4) For each column of the cepstrogram, DC removal is per. formed, resulting in the matrix 5) A mean cepstrogram is computed by averaging the matrices of the example set, as follows: (1)

RUINSKIY AND LAVNER: EFFECTIVE ALGORITHM FOR AUTOMATIC DETECTION AND EXACT DEMARCATION OF BREATH SOUNDS

841

Fig. 3. Schematic block diagram describing the construction of the template.

tion phase must match the length of the frame derived from each breath example in the training phase. 2) The short-time energy is computed according to the following: (2)

Fig. 4. Schematic description of the procedure for constructing the cepstrogram matrix. The frame is divided into overlapping subframes S , and for each subframe the MFCC vector V is computed. These vectors form the columns of the cepstrogram matrix.

This defines the template matrix . In a similar manner, a variance matrix is computed, where the distribution of each coefficient is measured along the example set. 6) In addition to the template matrix, another feature vector is computed as follows: the matrices of the example set are concatenated into one matrix, and the singular value decomposition (SVD) of the resulting matrix is computed. Then, the normalized singular vector corresponding to the largest singular value is derived. Due to the information packing property of the SVD transform [28], the singular vector is expected to capture the most important features of the breath event, and thus, improve the separation ability of the algorithm when used together with the template matrix in the calculation of the breath similarity measure of test signals (see Section II-C). B. Detection Phase The input for the detection algorithm is an audio signal (a monophonic recording of either speech or song, with no background music), sampled with 44 kHz. The signal is divided into consecutive analysis frames (with a hop size of 10 ms). For each frame, the following parameters are computed: the cepstrogram (MFCC matrix, see Fig. 4), short-time energy, zero-crossing rate, and spectral slope (see below). Each of these is computed over a window located around the center of the frame. A graphical plot showing the waveform of a processed signal as well as some of the parameters computed is shown in Fig. 5. 1) The MFCC matrix is computed as in the template generation process (see previous section). For this purpose, the length of the MFCC analysis window used for the detec-

is the sampled audio signal, and is the where window length in samples (corresponding to 10 ms). It is then converted to a logarithmic scale (3) 3) The zero-crossing rate (ZCR) is defined as the number of times the audio waveform changes its sign, normalized by the window length in samples (corresponding to 10 ms) ZCR (4) 4) The spectral slope is computed by taking the discrete Fourier transform of the analysis window, evaluating its and (corresponding magnitude at frequencies of here to 11 and 22 kHz, respectively), and computing the slope of the straight line fit between these two points. It is known [29] that in voiced speech most of the spectral energy is contained in the lower frequencies (below 4 kHz). Therefore, in voiced speech, the spectrum is expected to be rather flat between 11 and 22 kHz. In periods of silence, the waveform is close to random, which also leads to a relatively flat spectrum throughout the entire band. This suggests that the spectral slope in voiced/silence parts would yield low values, when measured as described previously. On the other hand, in breath sounds, like in most unvoiced phonemes, there is still a significant amount of energy in the middle frequency band (10–15 kHz) and relatively low energy in the high band (22 kHz). Thus, the spectral slope is expected to be steeper, and could be used to differentiate between voiced/silence and unvoiced/breath. As such, the spectral slope is used here as an additional parameter for identifying the edges of the breath (see Section III). C. Computation of the Breath Similarity Measure Once the aforementioned parameters are computed for a given frame , its short-time cepstrogram (MFCC matrix) is

842

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

Fig. 5. A: Original signal, along with the V/UV/S marks (stair graph), and detected breath events (short bars). B: Reciprocal of the breath similarity function (top), the threshold (middle line), and the energy function (bottom). C: Short-time cepstrogram plot for the signal in A. The vertical axis represents the MFC coefficients, and the horizontal axis is the time axis. The color represents the magnitude of the cepstral coefficients using a “hot” scale.

ized difference matrix, according to the following equation: (5)

Fig. 6. Schematic block diagram describing the calculation of the two breath and and their product as the final breath similarity similarity measures measure.

C

C

used for calculating its breath similarity measure. The similarity , is computed between the measure, denoted cepstrogram of the frame, , the template cepstrogram (with being the variance matrix) and the singular vector . The steps of the computation are as follows (Fig. 6): 1) The normalized difference matrix is computed. The normalization (element-by-element) by the variance matrix is performed in order to compensate for the differences in the distributions of the various cepstral coefficients. 2) The difference matrix is liftered by multiplying each column with a half-Hamming window that emphasizes the lower cepstral coefficients. It has been found in preliminary experiments, that this procedure yields better separation between breath sounds and other sounds (see also [2]). is computed by taking the 3) A first similarity measure inverse of the sum of squares of all elements of the normal-

where is the number of subframes, and is the number of MFC coefficients computed for each subframe. When the cepstrogram is very similar to the template, the elements of the difference matrix should be small, leading to a high value of this similarity measure. When the frame contains a signal which is very different from breath, the measure is expected to yield small values. This template matching procedure with a scaled Euclidean distance is essentially a special case of a two-class Gaussian classifier [30] with a diagonal covariance matrix. This is due to the computation of the MFCC, which involves a discrete cosine transform as its last step [20], known for its tendency to decorrelate the mel-scale filter log-energies [22]. is computed by taking 4) A second similarity measure the sum of the inner products between the singular vector (see Section II-A) and the normalized columns of the cepstrogram. Since the singular vector is assumed to capture the important characteristics of breath sounds, ) are expected to these inner products (and, therefore, be small when the frame contains information from other phonemes. 5) The final breath similarity measure is defined as the . It was found exproduct of the two measures: perimentally that this combination of similarity measures yields better separation between breath and nonbreath than using just the difference matrix or the singular vector. The breath detection involves a two-step decision. The initial decision treats each frame independently of other frames and classifies each as breathy/not breathy based on its simi(computed as explained above), larity measure

RUINSKIY AND LAVNER: EFFECTIVE ALGORITHM FOR AUTOMATIC DETECTION AND EXACT DEMARCATION OF BREATH SOUNDS

energy and zero-crossing rate. A frame is initially classified as breathy if all three of the following occur. 1) The breath similarity measure is above a given threshold. This threshold is initially set in the learning phase, during the template construction, when the breath similarity meais computed for each of the examples. sure The minimum value of the similarity measures between each of the examples and the template is determined, de. The threshold is set to . The logic benoted by hind this setting is that the frame-to-template similarity of breath sounds in general is expected to be somewhat lower than the similarity among examples used to construct the template in the first place. 2) The energy is below a given threshold, which is chosen to be below the average energy of voiced speech (see Section III-A). 3) The zero-crossing rate is below a given threshold. Experimental data have shown that ZCR above 0.25 (assuming a sampling rate of 44 kHz) is exhibited only by a number of unvoiced fricatives, and breath sounds have much lower ZCR (see Section III-A). Following the initial detection, a binary breathiness index is assigned to each frame: breathy frames are assigned index 1, whereas nonbreathy frames are assigned index 0. Whenever, at some point of the processing phase, there is a frame or a batch of frames that are classified as breathy, all the frames in their vicinity are examined more carefully, in order to reject possible false detections and identify the edges of the breath accurately. This constitutes the second step in the decision. When the edges are identified, all frames between them are classified as breathy. and all other frames in their vicinity are classified as nonbreathy. Whenever an entire section is identified as a false alarm in the second decision step, all the frames within this section are classified as nonbreathy. The exact procedure, by which edge searching and false detection elimination are performed, is described in the following section. Frames that are marked as breathy can be further manipulated on the output stream, for example by being attenuated or emphasized. III. EDGE DETECTION AND FALSE ALARM ELIMINATION One of the purposes of applying breath detection in high-end recording studios is to produce voices in which the breaths are not audible. Although the algorithm described in the previous section detects breath events with very high sensitivity (i.e., very few false negatives), it is still somewhat susceptible to false positive detections. Furthermore, its time resolution is not always sufficient to detect the beginning and end of the breaths accurately. In some applications, this can lead to incomplete removal of breath sounds on the one hand, and to partial removal of nearby phonemes on the other hand. Such inaccuracies, as well as the presence of false detections, can introduce audible artifacts that degrade speech quality. In this section, we will present two alternative algorithms designed to address both problems—the problem of false positives and the problem of accurate edge detection. The algorithms use various features of the sound waveform, such as energy and zero-crossing rate. In the general case, these features

843

Fig. 7. Probability density function curves of short-time energy (in decibels) of breath events (dashed line) in comparison with that of voiced speech (solid line), measured from voices of several different speakers, using nonoverlapping windows of 10 ms.

on their own cannot separate between breath sounds and speech phonemes. Therefore, the edge detection algorithms are used only as a refinement and invoked only when the initial decision steps identifies some frames as breathy (see Section II-B). These algorithms have two possible outcomes: either the edges of the breath are accurately marked, or the entire event is rejected as a false detection. Any of the two algorithms presented can be used for the task, and both have been shown to perform well. We present them both to provide a deeper insight of the problem and the possible solutions. Comparative performance results of the two algorithms are described in Section IV. A. General Approach to False Detection Elimination In both of the edge detection algorithms presented here, similar criteria are used for rejection of false positives. 1) Preliminary Duration Threshold: A breath event is expected to yield a significant peak in the contour of the breath similarity measure function, i.e., a considerable number of frames that rise above the similarity threshold. If the number of such frames is too low, it is likely to be a false detection. 2) Upper Energy Threshold: Typically, the local energy within a breath epoch is much lower than that of voiced speech and somewhat lower than most of the unvoiced speech (see Fig. 7). Hence, if some frames in the detected epoch, after edge marking, exceed a predefined energy threshold, the section should be rejected. 3) Lower ZCR Threshold: Since most of the breaths are unvoiced sounds, the ZCR during a breath event is expected to be in the same range as that of most unvoiced phonemes, i.e., higher than that of voiced phonemes. This was empirically verified. Therefore, if the ZCR throughout the entire marked section is beneath a ZCR lower threshold, it will be rejected as a probable voiced phoneme. 4) Upper ZCR Threshold: It is known that certain unvoiced phonemes, such as fricative consonants, can exhibit a high ZCR (0.3–0.4 given a 44-kHz sampling frequency). Preliminary experiments showed that the maximum ZCR of breath sounds is considerably lower (see Fig. 8). An upper threshold on the ZCR can thus prevent false detections of certain fricatives as breaths.

844

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

Fig. 9. Energy contour of three potential candidates for being marked as breath events. The top threshold is the peak threshold, the bottom threshold is the edge threshold. (A) will be marked as breath, since its peak energy is below the peak threshold, and both its deeps are below the edge threshold. (B) will be rejected, because its edge energy exceeds the edge threshold. (C) will be rejected, because its peak energy exceeds the peak threshold and it is probably produced by a voiced phoneme. Fig. 8. Probability density function curves of ZCR of breath events (dashed line) in comparison with that of /s/ fricatives (solid line), measured from voices of several different speakers, using nonoverlapping windows of 10 ms.

5) Final Duration Threshold: A breath sound is typically longer than 100 ms. Therefore, if the detected breath event, after accurate edge marking, is shorter than that duration, it should be rejected. In practice, in order to account for very short breaths as well, the duration threshold may be set more permissively. B. Edge Detection The basic approach to accurate detection of the edges of the breath is based on the principles described in Section I-A. A genuine breath event is expected to exhibit a peak in the local energy function, accompanied by two noticeable deeps on each side of the peak, indicating periods of silence. Therefore, the task of the edge detection algorithm is to locate the peak and the two deeps near it. The edges of the breath can then be set where these deeps have occurred. In practice, the energy function may not be smooth enough, and the algorithm will have to cope with spurious peaks and deeps. The key difference between the two algorithms presented below is the approach used to deal with them. 1) Edge Marking Using Double Energy Threshold and Deep Picking: One of the principles established in Section III-A is that in most cases, the local energy of breath sounds does not exceed a certain upper threshold, even at its highest point (referred here as “the peak of the breath”). We denote this upper threshold as the “peak threshold.” However, it is also expected that the edges of the breath will exhibit considerably lower energy, akin to that of silence (see Section I-A). It is reasonable to claim that if there are not any frames in the detected event whose energy is lower than some silence threshold energy (denoted as “edge threshold”), the event is likely to be a false detection. Consequently, a section of the signal is to be marked as a breath event only if its peak energy does not exceed the peak threshold , and its edge energy does not exceed the edge threshold . The introduction of the edge energy threshold improves the algorithm’s robustness against potential false positive detections (Fig. 9). An advantage of this double threshold algorithm is that it makes actual edge marking very simple. It works as follows (Fig. 10). Let be the running index of the frame, representing

Fig. 10. Peak of the breath and the corresponding edges, located at the deeps of the energy contour.

its location along the time axis, and let represent the be the frame with the short-time energy of frame . Let highest energy in the section in question (the peak of the breath). and be the first frames where the energy falls below Let , to the left and to the right of , respectively. Let and be the first frames to the left of and the right of , respectively, where the energy rises again above . Then, and , and between and the entire sections between are close to silence. The edges of the breath are defined as the centers of the frames with the lowest energy in their respective silence sections, i.e., if we denote the left and right edges and , then by (6) A schematic block diagram of the double threshold edge detection procedure is depicted in Fig. 11. A simpler approach is to consider all peaks and deeps in the mark the most section in question and for each peak likely left and right edges, according to the following (Fig. 12). and be the nearest deeps to the left and to the 1) Let , respectively. right of be the nearest deep to the left of . If 2) Let , then . Repeat until or until there are no more deeps to the left. be the nearest deep to the right of . If 3) Let , then . Repeat until or until there are no more deeps to the right. . 4) Output the pair

RUINSKIY AND LAVNER: EFFECTIVE ALGORITHM FOR AUTOMATIC DETECTION AND EXACT DEMARCATION OF BREATH SOUNDS

845

Fig. 11. (a) Schematic block diagram of the double threshold algorithm for edge detection. (b) Detailed description of the double-threshold edge detection procedure.

This procedure results in a set of pairs , each of which indicates a section, starting and ending with energy deeps. Each of these sections is examined to see if it violates any of the conditions mentioned in Section III-A. Among the remaining sections, we join those that overlap or share a common edge to ensure that if the breath was divided into several sec-

tions due to spurious peaks/deeps in the energy contour, it will be joined into one. A mechanism that resembles the method presented here, with two energy thresholds and zero-crossing rate, was used in [31] for edge detection of words, in order to include all the phonetic components for speech recognition. However, due to the dif-

846

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

TABLE I RULES FOR DEEP PICKING PROCEDURE FOR EDGE MARKING

Fig. 12. Deep picking procedure for edge searching: the curve is the energy contour. Big dots show the peak X and the final values of (X ; X ). Both X and X are lower than their neighboring deeps, which are spurious deeps (smaller dots).

ferent purpose, the actual algorithm is not the same. For example, in the latter, the thresholds were used strictly for indication of the boundaries, while in our algorithm, their primary use is to avoid false detection of speech. 2) Edge Marking With Spurious Deep Elimination: A drawback of the previous procedure, which considers every peak and deep, is that a very noisy energy function can lead to incorrect detection of the edges. The algorithm described in this section attempts to deal with this problem by eliminating spurious deeps. The edge marking relies more on the original breathiness index (see Section II-C), namely, it searches for the edges of the breath inside the section that was originally marked as breathy, or in close vicinity to it. To reduce the effect of possible false detections, the binary vector of breathiness indices is first smoothed with a nine-point median filter. The smoothed contour is expected to contain a block of successive binary ones (presumably a breath epoch) amid a sequence of zeros. On rare occasions, it may contain two blocks of ones (in case of a false detection or two breath sounds being very close to each other). On such occasions, each block is treated separately. The block of ones indicates the approximate location of the breath, and the algorithm will look for the exact edges in the vicinity of this block. Let us denote the first frame index (representing its location and its last along the time axis) of the block of ones as . For simplicity, we shall refer to the section frame index as as the “candidate section.” The edge search is conducted by examining the section’s energy contour and looking for deeps (local minima). Because of the high resolution of the energy tracking and the relatively short time windows (10 ms, see Section II-B), there are likely to be spurious deeps in the energy contour. To reduce the number of such deeps, the energy contour is prefiltered with a threepoint running average filter. The smoothing may not eliminate all the spurious minima. Therefore, after prefiltering, the remaining deeps are divided into significant and insignificant. A is defined as a significant deep if given frame

(7) and are the two energy peaks closest to on where the left and right sides, respectively, and and are the global maximum and minimum of the energy contour function in some predefined vicinity of the candidate section. In other is considered significant if at least one of the energy words, values of and exceeds it by more than 25% of the dynamic range of the energy.

Fig. 13. Block diagram showing the various steps of the edge detection algorithm with spurious deep elimination.

Having eliminated all the insignificant energy deeps, the algorithm checks to find which of the remaining significant deeps , and acts according fall inside the candidate section to the rules in Table I. Finally, the section between the marked edges is checked once more to verify that it does not violate any of the conditions established in Section III-A and if it does, it is rejected. A block diagram showing the algorithm step-by-step is given in Fig. 13. A variation of this algorithm uses spectral slope information, in addition to energy information, to search for edges. It is based on the observation that there are usually significant differences in the steepness of the spectral slope between breath sounds and silence/voiced (see Section II-B). The spectral slope is expected to be steep near the middle of the breath and flat at the edges, suggesting that edges can be detected by applying similar deep-picking to the spectral slope. Thus, the initial edge marking is based on the spectral slope deeps and later refined with the energy deeps. Using this algorithm for edge detection yielded very accurate results, as shown in the following section. Although the previous algorithm, which uses the double energy threshold, was found

RUINSKIY AND LAVNER: EFFECTIVE ALGORITHM FOR AUTOMATIC DETECTION AND EXACT DEMARCATION OF BREATH SOUNDS

TABLE II PERFORMANCE EVALUATION WITH A SINGLE TEMPLATE FOR ALL EXAMPLES (M—MALE, F—FEMALE)

847

TABLE III PERFORMANCE EVALUATION WITH SEPARATE MALE (M)/FEMALE (F) TEMPLATES

even less prone to errors, and could achieve almost perfect results, it required considerable amount of fine-tuning to obtain them. IV. RESULTS AND EVALUATION For the evaluation of the algorithm, we first constructed several breath templates and compared their performance. The templates were constructed from isolated breath events, derived from the voices of 14 singers (eight male and six female). Each template was generated using the breath signals of one voice. Mixture of breaths from several voices was also attempted, but yielded no performance gain, and therefore was not used in the final evaluation. The test was carried out using 24 voices of professional singers and narrators, including both songs (a cappella, 22 recordings) and normal speech (two recordings). All the voices were sampled and digitized with a sampling frequency of 44 kHz. The total duration of the recordings was about 24 min and contained more than 330 breath events. In all the tests, the voices used for constructing the templates or for measuring

the distributions of the various classification parameters (see Section III-A) were excluded from the evaluation set. In order to evaluate the performance of the system, the breaths were hand-marked by two labelers. In addition, the results were confirmed by listening to the original passages, the processed passages with breaths suppressed and the detected breaths only. These ensure both the detection of the breath and the accurate marking of its edges. The results of one template (constructed using seven breath events of one male singer) are presented in Table II. As can be seen, the sensitivity (the number of correct detections divided by the total number of events present) of the algorithm using this template is 94.4% (304 correct detections out of 322 breath events), and its specificity (the number of correct detections divided by the total number of detections) is 96.3%. Better results were achieved when a combination of two templates was used,

848

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

TABLE IV COMPARISON BETWEEN THE TWO EDGE DETECTION ALGORITHMS

TABLE V CONTRIBUTION OF THE EDGE DETECTION ALGORITHM TO THE SYSTEM PERFORMANCE

each for its corresponding gender (Table III). The table shows a sensitivity of 97.6% and a specificity of 95.7% in 332 breaths (324 correct detections, of which 16 were detected only partially and 15 false detections). Comparable results were achieved in previous studies, where breath detection was used as part of an automatic system for labeling prosody patterns. In [32], a Bayesian classifier based on cepstral coefficients achieved detection rates of 73.2% to 91.3%, depending on whether the training set included passages from the same speech material as the test set or not. In [19], an extension of the latter system achieved sensitivity of 100% and specificity of 97% in speech corpora that included 364 breath events, but the accurate detection of the exact location of the breaths was not addressed by the authors. In an earlier paper [13], the authors mention that the breaths labeled by the algorithm are within 50 ms of those labeled by hand 95% of the time. While such accuracy can be acceptable for a prosody labeling application, it may be insufficient for other applications of breath detection mentioned in Section I. In the results we report here, “correct detection” means that the breath was detected and marked accurately and completely (i.e., the processed voice contained no audible traces of the breath). “False detection” includes both cases when the marked breath boundaries encroached into an adjacent speech segment and cases in which a breath was detected where it was not present (both of these were spotted by listening to the detected events, and by carefully examining the boundary marks). As such, it can be seen (Table III) that the breaths were detected with complete accuracy 94% of the time. Partial detections constitute only 5% of the total detections. The above results were achieved using the edge detection algorithm with spurious deep elimination (see Section III-B). This algorithm was chosen, because it requires less presetting to achieve its optimal performance. In Table IV, we provide a comparison between the results achieved with that algorithm and those achieved with the double energy threshold algorithm on a small selection of voices from four different speakers containing a total of 61 breaths. In both cases, the same template was used, constructed from the breaths of another speaker. As

can be seen, the performance is similar, although the double energy threshold algorithm is slightly better. The usefulness of the edge detection algorithm is demonstrated by testing the system twice on a small set of voices from different speakers, once with the edge detection enabled and once without it. The results are displayed in Table V. It can be seen that the edge detection algorithm contributes greatly by avoiding both partial detections and false positives. V. REAL-TIME IMPLEMENTATION OF THE ALGORITHM Implementation in a real-time environment puts very strict demands on the processing speed of the application and the latency required by it. Any real-time algorithm must be able to process data faster than the rate of new data arrival. Experimental tests have shown that the breath detection algorithm presented here is able to meet these demands, given the current state of hardware and software. The processing speed depends largely on the choice of length for the analysis windows and the hop size. Smaller hop size means that more frames need to be analyzed for a given time period, increasing the processing time. Similarly, longer analysis windows also slow down the operation, because more data need to be processed per single frame. Table VI shows the processing time as a function of the different window lengths, for a given signal. It can be seen that in all cases the processing time is well below the signal length, as required by real-time applications. Considering the fact that practical applications are coded using methods which are known to be far more efficient than those of MATLAB, the algorithm’s speed was found sufficient for real-time implementation. Another important factor for real-time implementation is the latency of the algorithm, i.e., the minimum required delay between the time data is received for processing and the time the processed data is ready for output. This latency must be bounded, which means that the processing must be localized—analysis of a given frame must be completed before the entire audio sequence is available. In our implementation, the classification of the frame as breathy or not is based only on the parameters of the frame

RUINSKIY AND LAVNER: EFFECTIVE ALGORITHM FOR AUTOMATIC DETECTION AND EXACT DEMARCATION OF BREATH SOUNDS

TABLE VI PROCESSING TIME OF THE ALGORITHM

itself and of frames in its vicinity. The latency, therefore, depends on the number of frames that are examined by the edge detection algorithm for breath boundary search. Our experiments have shown that accurate results can be achieved when the look-back and look-ahead periods for boundary search are around 200–400 ms (in each direction). The required delay in this case will be of the order of 400–800 ms. Modern audio processing applications, both hardware-based and software-based, are equipped with memory buffers that can easily store such an amount of audio data, thus making the algorithm usable in real-time. Indeed, the breath detection algorithm presented here was implemented as a real-time plug-in for an audio production environment (http://www.waves.com/content.asp?id=1749). VI. CONCLUSION In this paper, we presented an algorithm for automatic detection and demarcation of breath sounds in speech and song signals. The algorithm is based on a template-matching frame-based classifier using MFCC for the coarse detection and on a set of additional parameters in a multistage algorithm for the refinement of the detection. Tested on a selection of voices from different speakers, containing several hundreds of breath sounds, the algorithm showed very high sensitivity and specificity, as well as very accurate location of the breath edges (see Section IV). This level of performance cannot be achieved by any of the two components independently, because the frame-based classifier cannot provide the sufficient accuracy in edge detection, while the edge detection algorithm uses features that on their own cannot reliably distinguish between breath and nonbreath sounds (see Section III). It is the combination of the two that provides the necessary accuracy and robustness (see Table V). Although the current paper describes an algorithm for the detection of breath signals, a slightly modified version of this algorithm may be used for the detection of other sounds, such as certain phonemes. In preliminary experiments, it was shown to yield high-quality results in the detection of fricatives, such as /s/ and /z/, thus proving the feasibility of the general scheme for the broader task of event spotting. Our approach somewhat resembles that of the system described in [16] for suppressing pauses during the playback of

849

audio recordings for transcription. The latter also uses a twostage detection strategy, with a speech recognizer for the first stage and additional time-based rules for the second. The algorithm presented here is simpler, compared to a full-fledged speech recognizer; it is very efficient, with low computational complexity (suitable for real-time processing), and with a short and a simple procedure for presetting the system. These advantages make it a better choice for the task of breath detection in applications where no speech recognition is required (see Section I). The breath detection algorithm presented here was implemented as a real-time plug-in for an audio production environment. A limitation of the algorithm in its current form is that it performs well on monophonic audio signals (pure speech), but may not be suitable for polyphonic signals, containing music or other background sounds. This limitation will be addressed in future work. Additional directions of future research include testing different classifiers for the problem and extending the algorithm to the detection of other events in audio signals. ACKNOWLEDGMENT The authors would like to thank I. Neoran, Director of R&D of Waves Audio, for valuable discussions and ideas. The authors would also like to thank G. Speier for valuable ideas, M. Shaashua for helpful comments, and Y. Yakir for technical assistance. The authors are grateful to the anonymous reviewers for valuable comments and suggestions. REFERENCES [1] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1998. [2] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993. [3] J. T. Foote, “An overview of audio information retrieval,” Multimedia Syst., vol. 7, pp. 2–10, 1999. [4] E. Brazil, M. Fernstrom, G. Tzanetakis, and P. Cook, “Enhancing sonic browsing using audio information retrieval,” presented at the Int. Conf. Auditory Display (ICAD), Kyoto, Japan, 2002, unpublished. [5] G. Tzanetakis and P. Cook, “Audio Information Retrieval (AIR) tools,” presented at the Int. Symp. Music Information Retrieval (ISMIR), Plymouth, MA, 2000. [6] G. H. Li, D. F. Wu, and J. Zhang, “Concept framework for audio information retrieval: ARF,” J. Comput. Sci. Technol., vol. 18, pp. 667–673, 2003. [7] C. Spevak and E. Favreau, “Soundspotter—a prototype system for content based audio retrieval,” presented at the Int. Conf. Digital Audio Effects (DAFx-02), Hamburg, Germany, 2002. [8] P. Gelin and C. J. Wellekens, “Keyword spotting for video soundtrack indexing,” presented at the IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP-96), 1996. [9] H. Sawai, A. Waibel, M. Miyatake, and K. Shikano, “Spotting Japanese CV-syllables and phonemes using the time-delay neural networks,” presented at the Int. Conf. Acoust., Speech, Signal Process. (ICASSP89), Glasgow, U.K., 1989. [10] D. Bauer, A. Plinge, and M. Finke, “Selective phoneme spotting for realisation of an /s, z, C, t/ transpose,” in Lecture Notes in Computer Science—ICHHP 2002. Linz, Austria: Springer, 2002, vol. 2398. [11] A. Plinge and D. Bauer, “Introducing restoration of selectivity in hearing instrument design through phoneme spotting,” in Assistive Technology: Shaping the Future, ser. Assistive Technology Research Series, G. M. Craddock, L. P. McCormack, R. B. Reilly, and H. Knops, Eds. Amsterdam, The Netherlands: IOS Press, 2003, vol. 11. [12] L. Kennedy and D. Ellis, “Laughter detection in meetings,” presented at the NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, QC, Canada, 2004.

850

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

[13] P. J. Price, M. Ostendorf, and C. W. Wightman, “Prosody and parsing,” presented at the DARPA Workshop on Speech and Natural Language, Cape Cod, MA, 1989. [14] M. Covell, M. Withgott, and M. Slaney, “Mach1: Nonuniform time-scale modification of speech,” presented at the IEEE ICASSP-98, Seattle, WA, 1998. [15] L. Kennedy and D. Ellis, “Pitch-based emphasis detection for characterization of meeting recordings,” presented at the Automatic Speech Recognition Understanding Workshop (IEEE ASRU 2003), St. Thomas, VI, 2003. [16] C. W. Wightman and J. Bachenko, “Speech-recognition-assisted selective suppression of silent and filled speech pauses during playback of an audio recording,” U.S. Patent 6 161 087, Dec. 12, 2000. [17] P. A. A. Esquef, M. Karjalainen, and V. Välimäki, “Detection of clicks in audio signals using warped linear prediction,” presented at the 14th IEEE Int. Conf. Digital Signal Process. (DSP-02), Santorini, Greece, 2002. [18] J. Butzberger, H. Murveit, E. Shriberg, and P. Price, “Spontaneous speech effects in large vocabulary speech recognition applications,” presented at the Workshop on Speech and Natural Language, Harimman, New York, 1992. [19] C. W. Wightman and M. Ostendorf, “Automatic labeling of prosodic patterns,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 469–481, Oct. 1994. [20] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28, no. 4, pp. 357–366, Aug. 1980. [21] R. Mammone, X. Zhang, and R. Ramachandran, “Robust speaker recognition—A feature-based approach,” IEEE Signal Process. Mag., vol. 13, no. 5, pp. 58–71, Sep. 1996. [22] T. F. Quatieri, Discrete-Time Speech Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 2001. [23] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: Cambridge Univ. Press, 2000. [24] D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Commun., vol. 17, pp. 91–108, 1995. [25] C. J. C. Burges, “Simplified support vector decision rules,” presented at the 13th Int. Conf. Machine Learning, Bari, Italy, 1996. [26] R. Cai, L. Lu, H. J. Zhang, and L. H. Cai, “Highlight sound effects detection in audio stream,” presented at the 4th IEEE Int. Conf. Multimedia and Expo, Baltimore, MD, 2003.

[27] M. Spina and V. Zue, “Automatic transcription of general audio data: preliminary analyses,” presented at the Int. Conf. Spoken Lang. Process., Philadelphia, PA, 1996. [28] S. Theodoridis and K. Koutroumbas, Pattern Recognition. London, U.K.: Academic, 1999. [29] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978. [30] R. O. Duda, P. O. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. [31] L. R. Rabiner and M. R. Sambur, “An algorithm for determining the endpoints of isolated utterances,” Bell Syst. Tech. J., vol. 54, pp. 297–315, 1975. [32] C. W. Wightman and M. Ostendorf, “Automatic recognition of prosodic phrases,” presented at the IEEE Int. Conf Acoust., Speech, Signal Process., Toronto, ON, Canada, 1991.

Dima Ruinskiy (S’06) received the B.Sc. degree in computer science, specializing in signal processing, from Tel-Hai Academic College, Upper Galilee, Israel. He is currently pursuing the M.Sc. degree in computer science and applied mathematics at the Feinberg Graduate School, Weizmann Institute of Science, Rehovot, Israel, specializing in cryptanalysis of public key protocols.

Yizhar Lavner (M’01) received the Ph.D. degree from The Technion—Israel Institute of Technology, Haifa, in 1997. He has been with the Computer Science Department, Tel-Hai Academic College, Upper Galilee, Israel, since 1997, where he is now a Senior Lecturer. He also has been teaching in the Signal and Image Processing Laboratory (SIPL), Electrical Engineering Faculty, Technion, since 1998. His research interests include audio and speech signal processing, voice analysis and perception, and genomic signal processing.

Suggest Documents