Mar 13, 2003 - The authors are with Integrated Media Systems Center and .... Sub-harmonic summation (SHS) was used to estimate the pitch in ..... Since notes in humming are typically uttered as consonant-vowel syllabus, modeling the consonants ..... Science Foundation Information Technology Research Grant NSF ITR ...
1
Multidimensional Humming Transcription Using Hidden Markov Models for Query by Humming Systems
Hsuan-Huei Shih, Shrikanth S. Narayanan and C.-C. Jay Kuo
1
The authors are with Integrated Media Systems Center and Department of Electrical Engineering-Systems, University of
Southern California, Los Angeles, CA 90089-2564, USA. E-mail: {hshih,shri,cckuo}@sipi.usc.edu. March 13, 2003
DRAFT
2
Abstract Providing natural and efficient access to the fast growing multimedia information, accommodating a variety of user skills and preferences, is a critical aspect of content-based information mining. Query by humming provides a natural means for content-based retrieval from music databases. A statistical pattern recognition approach for recognizing hummed or sung melodies is reported in this paper. Being datadriven, the proposed system aims at providing a robust front-end especially for dealing with variability in user’s productions. The segment of a note in the humming waveform is modeled by phone-level hidden Markov models (HMM) while data features such as energy measures are modeled by Gaussian mixture models (GMM). The duration of the note segment is then labeled by a duration model. The pitch of the note is modeled by a pitch model using a Gaussian mixture model. Preliminary real-time recognition experiments are carried out based on humming data obtained from eight users and an overall recognition rate of around 81% is demonstrated.
I. I NTRODUCTION Content-based multimedia data retrieval is an emerging research area. Enabling natural interaction with multimedia databases is a critical component of such efforts. Music databases form a significant portion of media applications, and there is a great need in developing methods for indexing and interacting with them. Querying music databases using human humming as the query key has recently gained attention as a viable option [1]. This requires signal processing for automatically mapping human humming waveforms to symbol strings representing the underlying melody and duration contours. This paper focuses on automatic humming recognition and transcription. In this work, we target a finer description of the melodies, that is, at the note level. The proposed statistical approach aims at providing note level decoding. Since it is data-driven, it is more amenable to robust processing in terms of handling variability in humming. Conceptually, the approach tries to mimic a human’s perceptual processing of humming as against attempting to model the production of humming. Such statistical approaches have been a great success in automatic speech recognition and can be adopted and extended to recognize human humming and singing. Unlike prior deterministic methods for humming query systems, decoded notes can be streamed to the query module since realtime decoding can be achieved, enabling decoding and query to be conducted simultaneously. Another advantage of the proposed framework is the capability of adding and changing features for humming recognition by just changing the models and without requiring the design of new algorithms. In particular, we focus on recognizing both the melody contour and the duration contour of a piece of humming in this
March 13, 2003
DRAFT
3
research. These two contours are combined to form a multidimensional humming transcription. The multidimensional transcription is assumed to be a sequence of notes in a piece of humming defined by their detected duration changes and pitch intervals. A note segmentation model is defined by phonelevel hidden Markov models with features modeled by Gaussian mixture models. A pitch model of a note is defined by pitch features and also modeled by Gaussian mixture models. During the training phase, note and pitch models are trained with humming data obtained from human subjects. During recognition, the incoming piece of the humming waveform is first decoded with trained note models for note segmentation, and then the pitch values of segmented notes are detected with trained pitch models. The detection process is done in real-time. The rest of the paper is organized as follows. First, an overview of previous work is given in Sec. II In Sec. III, a brief development history and the an overview of the proposed algorithm are provided. The details of proposed algorithm are given in Sec. IV. Experimental results are given in Sec. V and concluding remarks and future work are presented in Sec. VI. II. BACKGROUND The first Query by Humming system was proposed in 1995 by Ghias et al. [1], and Query by Humming (QBH) has since started attracting researchers’ attention. The input of a QBH system is user’s humming, and the output is a list of music pieces which have similar melodies with query humming. Fig. 1 shows a general QBH system. The music database contains music piece for query, and currently most of the QBH system use MIDI files. The user’s humming is passed into the humming transcription module which converts a humming waveform into a musical query sequence. This musical query sequence can be a melody contour, a duration contour, or both. The query engine uses the musical query sequence to query the music database. Music database organization schemes try to help QBH systems to query databases in more efficient ways. Currently many researchers are working on finding better ways to organize music databases; for example, classifying music pieces into groups of different styles and genres. The aim of this paper is to propose a new approach for the humming transcription module that can provide more accurate musical query sequences then conventional approaches. Most approaches to humming recognition that have been proposed for Query by Humming systems are based on non-statistical signal processing. They include methods based on time domain, frequency domain, and cepstral domain approaches, and most have focused predominantly on the time domain approach. Ghias et al. [1], and Jang et al. [2] used autocorrelation to calculate pitch periods. McNab et al. [3–5] applied the Gold-Rabiner [6] algorithm to overlapping frames of a note segment, extracted by March 13, 2003
DRAFT
4
energy-based segmentation. For every frame, the algorithm yielded the frequency of maximum energy. To decide the note frequency, the histogram-statistics of frame level values were used. Pauws [7] used a combination of time domain and frequency domain methods to detect onsets and pitches of notes. The algorithm combined the results using short-term energy, surf measure [8], high-frequency content, and pitch change to decide onsets of notes. Sub-harmonic summation (SHS) was used to estimate the pitch in the singing. A major problem with non-statistical approaches is robustness to inter-speaker variability and other signal distortions. Users, especially those with minimal or no music training, hum with varying levels of accuracy (in terms of pitch and rhythm). Hence, most deterministic methods tend to use only a coarse melodic contour e.g. labeled in terms of rising/stable/falling relative pitch directions [1]. One reason used to justify this representation is that humans are more sensitive to highs and lows between adjacent pitches [9]. While this representation minimizes the potential errors in the representation used for query and search, the scalability of this approach is limited. To overcome the limitation introduced by the three-step coarse melody representation, finer melody representations were proposed. One of the examples was a nine-step melody representation [7] which labeled pitch intervals in terms of -4/-3/-2/-1/0/1/2/3/4 with the quantization level being 2 semitones. The assumption of this representation was that pitch intervals larger than 5 semitones occur rarely (only about 10%) in music pieces across the world. However, the representations are still too coarse to incorporate higher level music knowledge. Some of systems proposed full scale melody representations using a semitone as the quantization level with respect to absolute pitch values. In other words, they tried to match a pitch to a key on a piano keyboard. However, these representations lack capabilities to deal with inter-speaker variability. For example, different users may have different keys and different tempo. Normalization between different users and music pieces in music databases are the main challenges of these approaches. Another problem with these non-statistical approaches is the lack of real-time processing ability. Most of these methods rely on full utterance level feature measurements that require buffering, thereby limiting real-time processing. A statistical humming recognition system is proposed to address the forementioned problems. Two similar efforts have been proposed by Raphael [10, 11] and Durey et al. [12, 13]. Raphael tried to solve the segmentation problem of automatic musical accompaniment in [10]. Hidden Markov models were used to segment notes and rests in a music piece. The energy measure was adopted as the major feature, and the pitch information was not used. Therefore, the proposed system could only know when a note was played not which note was played. In [11], Raphael tried to extract rhythm transcription from a
March 13, 2003
DRAFT
5
music piece. He used the technique proposed in [10] to find note segments. The system required a complete set of possible measure positions, i.e. a note’s onset position in a music measure, and the time signature of the music piece. Durey et al. used hidden Markov models to spot melody in music pieces. A small piece of query melody was passed into the melody spotting system, and then a list of music pieces which contain the query melody were returned. Their approach was similar to word spotting in a speech recognition system. They experimented with 3 different kinds of frequency feature sets in [12], i.e. autocorrelation, 75-channel Fast Fourier Transform (FFT), and 25-channel FFT. Later, they also experimented with 3 other feature sets in [13], i.e. FFT, Mel-frequency Cepstral Coefficient (MFCC), and filter banks (FB). All these studies were based on instrumental or MIDI-generated music pieces accurate in tune and rhythm. Our proposed system aims at dealing with human humming that is usually less accurate and has more variability. The proposed system detects not only when a note is hummed but also which note is hummed. The features that the proposed system uses cover acoustic shape information, segmentation information, and pitch information. III. S YSTEM OVERVIEW The system design is motivated by the statistical approach to automatic speech recognition. Humming and singing are similar to speech, and they may be considered as another form of language. The grammars of the humming and singing are guided by music theory. Notes of a music piece can be considered similar to words in speech. Because Hidden Markov Models (HMMs) are widely used in speech recognition systems, the first experiment was developed using HMMs to model musical notes. Details of how they were chosen and how they are being used are given in Sec. IV-C.1. The goal of our first prototype was to extract the melody contour from an input humming signal, where the melody contour was a sequence of pitch intervals that indicated adjacent pitch changes in semitones. The first prototype was proposed in [14] and shown in Fig. 2. A single stage decoder was used to recognize humming signals, and a single HMM was used to model two attributes of a note, i.e. duration and pitch. By including the pitch information in notes’ HMMs, the first prototype suffered from dealing with a large number of HMMs to account for different pitch intervals. Each pitch interval required an HMM. Adding up all possible pitch intervals, the required training data became large. Furthermore, our first humming database did not have a sufficient amount of training data. To overcome the difficulties encountered in the first prototype, two tasks were carried out. One was collecting a larger humming database. Another was to design a new system that took care of the duration information and the pitch information of a note separately to make the melody more tractable. Better March 13, 2003
DRAFT
6
transcriptions covering not only melody contours but also duration contours were needed. Our second prototype, proposed in [15], was developed to address these requirements, and the resulting transcription scheme provided a multidimensional description. Pitch features were separated from note’s data features needed for representation. The system became a two-stage process that segmented a note at the first stage, and detected the pitch of the note at the second stage. MFCCs and energy measure were still features used to model a note. Since the pitch information was separated from the note model, only one note model was needed to find the start and the end of each note. Rests were considered as silences and were model by a silence model. Pitch models were used to detect the pitch of a note, and details of pitch features and models will be presented in Sec. IV-D. The system reported in this paper builds upon expanding the capability of our initial efforts. The new system is based on a multidimensional humming transcription scheme and is summarized in Fig. 3. Similar to any data-driven pattern recognition approach, models are derived from data representing the underlying classes for recognition. Details of database preparation are given in Sec. IV. The proposed algorithm can be divided into two stages. In the first stage, a humming piece is first passed into the note decoder for note segmentation, and a duration label of the segmented note is given at this stage. Statitistical modeling assumes the availability of signal-derived features that provide discriminability for recognition. Training data are windowed into frames. For each frame, two features are extracted. They are Mel-frequency Cepstral Coefficients and an energy measure. In the second stage, a segmented note of the humming piece is passed to the pitch detector for pitch tracking. The phone-level statistical modeling, feature selection, training, decoding and labeling of the note segmentation stage are addressed in Sec. IVC. The pitch feature selection, pitch analysis, and pitch model generation of the pitch tracking stage are described in Sec. IV-D. The generation of the multidimensional humming transcription is given in Sec. IV-E. Finally, the potential of music language modeling in humming recognition is adressed in Sec.IV-F. IV. P ROPOSED M ULTIDIMENSIONAL H UMMING T RANSCRIPTION S YSTEM A. Humming Recording Since the proposed study is one of the early ones to use human humming, and since there was no humming database available, the creation of a humming database became the first step. The preliminary database used in this work was collected from eight subjects, four females and four males. Users were asked to hum specific melodies using a stop consonant-vowel syllable, such as “da” or “la”, as the basic sound unit. Other sound units could also be used. The “da” sound was used predominantly as the March 13, 2003
DRAFT
7
sound unit in our early investigation. Each person was asked to hum three different melodies that included the ascending C major scale, the descending C major scale, and a short song “2 little tigers”. The corresponding music scores are given in Fig. 4. The recordings were done using a high-quality close talking Shure microphone (with model number SM12A-CN) at 44.1kHz and high quality recorders in a quiet office environment. Recorded signals were sent to a computer and low-pass filtered at 8kHz to reduce noise and other high frequency components that are outside the normal human humming range. Then, signals were down sampled to 16kHz. B. Data Transcription A humming piece is assumed to be a sequence of notes. To enable supervised training, these notes were segmented and labeled by human listeners. Manual segmentation of notes was included to provide information for pitch modeling and comparison against automatic methods. In practice, few people have a sense of perfect pitch in order to hum a specific pitch at will, for example, a true “A” note (440Hz). Therefore, the use of absolute pitch values to label a note was not deemed to be a viable option. Since the goal of this work is to map the humming signal to notes, a more robust and general method is to focus on relative changes in pitch values of a melody contour. A note has two main attributes, namely, pitch (measured by the fundamental frequency of voicing) and duration. Hence, pitch intervals (relative pitch values) were used to label a humming piece instead of absolute pitch values. The same argument also holds for note durations. Human ears are sensitive to relative duration changes of notes. Keeping track of relative duration changes is more useful than keeping track of the exact duration of each note. Therefore, duration models used relative duration changes. Two different labeling conventions were considered for melody contours. The first one uses the first note’s pitch as the reference to label subsequent notes in the rest of the hummed piece. Let “R” represent notes that have the same pitch with respect to the reference, and “Dn” and “Un” represent notes that are lower or higher in pitch with respect to the reference by n semitones. For example, a humming piece corresponding to do-re-mi-fa will be labeled as “R-U2-U4-U5” while the humming corresponding to doti-la-sol will be labeled as “R-D1-D3-D5” where “R” is the reference note, “U2” denotes a pitch value higher than the reference by two semitones and “D1” denotes a pitch value lower than the reference by one semitone. The numbers following “D” or “U” are variable, and they depend on the humming data. Fig. 5 shows humming pieces of the ascending and the descending C major scale, and the input humming was hand segmented and labeled using the first labeling convention. The labeling tool we used is HSLab [16], which is included in the publicly-available Hidden Markov Tool Kit (HTK) [16]. March 13, 2003
DRAFT
8
The second labeling convention is based on the rationale that a human is sensitive to the pitch variation with respect to the previous note (rather than the first note). According to this convention, the humming piece for do-re-mi-fa should be labeled as “R-U2-U2-U1”, and a humming piece corresponding to doti-la-sol is labeled as “R-D1-D2-D2”, where we use “R” to label the first note since it does not have a previous note as the reference. Fig. 6 shows humming pieces of the ascending and the descending C major scales. The input humming was manually segmented and labeled using the second labeling convention. All humming data were labeled by these two different labeling conventions. Transcriptions contained both labels and the start and the end of each note. They were saved in separate files and were used during supervised training of note models and to provide reference transcriptions to evaluate recognition results. Although two labeling conventions were investigated, only the second labeling convention was adopted in the proposed system. Our initial experiments showed that the second labeling convention provided robust results. Details are given in [14].
C. Note Segmentation Stage The first stage of the proposed algorithm is note segmentation, where the process of segmenting notes of a humming piece is done. First, a feature set which can characterize a note is chosen. Next, the HMM topology for the note is defined. During the training phase, notes’ phone-level HMMs are trained using the selected feature set. The trained note models are then used by the decoder for note segmentation. Finally, the duration of a segmented note is labeled according to its relative duration change. 1) Feature Selection: The choice of good features is the key to good humming recognition performance. Since human humming production is similar to the speech signal, features used to characterize phonemes in automatic speech recognition (ASR) are considered for modeling notes in humming recognition. Features used in our base feature set include mel-frequency cepstral coefficients (MFCC), energy measures and their first- and second-derivatives. •
Mel-Frequency Cepstral Coefficients (MFCCs) Mel-Frequency Cepstral Coefficients (MFCCs) are obtained through a nonlinear filterbank analysis motivated by the human hearing mechanism. They are popular features used in automatic speech recognition (ASR). The applicability to modeling music using MFCCs has been shown by Logan [17]. Cepstral analysis is capable of converting multiplicative signals into additive signals. The vocal tract properties and the pitch period effects of a humming signal are multiplied together in the
March 13, 2003
DRAFT
9
spectrum domain. Since vocal tract properties have a slower variation, they fall in the low-quefrency1 area of the cepstrum. In contrast, pitch period effects are concentrated in the high-quefrency area of the cepstrum. Applying low-pass liftering2 to Mel-frequency cepstrum coefficients gives the vocal tract properties. They are used to estimate the acoustic spectral shapes of humming notes. Although applying high-pass liftering to Mel-frequency cepstrum coefficients gives the pitch period effects, the resolution is not sufficient to estimate the pitch of the note. Thus, other pitch tracking methods are needed to provide better pitch estimation. They will be described later in Sec. IV-D. For our analysis, 26 filterbank channels were chosen, and the first 12 MFCCs were selected as features. •
Energy Measure Energy serves as an important feature in humming recognition. It is especially useful in providing temporal segmentation of notes. The log energy value is calculated from input humming signals {sn , n = 1, N } via E = log
N X
s2n .
(1)
n=1
Typically, a distinct variation in the energy value occurs during the transition from one note to the other. This effect can be enhanced when users are asked to hum some basic sounds, which were a combination of a stop consonant and a vowel (e.g. “da”, “la”). The log energy plot of a humming piece using “da” is shown in Fig. 7. The energy drops indicate the change of notes. The 39-element feature vector contains 12 MFCCs, 1 energy measure, and their first and second derivatives. 2) Hidden Markov Model: Hidden Markov models (HMM) with Gaussian mixture models (GMM) for observations corresponding to each state of the HMM were used to define a note model. Each note was modeled by 3-state left-to-right HMMs as shown in Fig. 9. An input humming is segmented into frames, and features are extracted from each frame. In the first prototype , two possible sets of HMMs were derived in [14]. The first method used the first note of the sequence as its reference while the second method used the previous note as the reference. The proposed system uses the previous notes as the reference, and note models are further divided into phone-level HMMs. •
Phone-Level Hidden Markov Model Since notes in humming are typically uttered as consonant-vowel syllabus, modeling the consonants and vowels separately to create the combined note models provides design scalability. Each note
1 2
The frequency in the spectrum domain is called the quefrency in the cepstrum domain. Filtering in the spectrum domain is called liftering in the cepstrum domain.
March 13, 2003
DRAFT
10
was modeled by multiple phone-level HMMs. Phone-level hidden Markov models (HMM) use the same structure of note-level HMM to characterize a part of a note model. The use of HMM provides the ability to model temporal aspects of a note especially in dealing with time elasticity. The features corresponding to each state occupation in an HMM are modeled by a mixture of two Gaussians. The idea of using phone-level HMM for a humming note is very similar to that used in speech recognition. Since a stop consonant and a vowel have quite different acoustical characteristics, two distinct phone-level HMMs are defined for “d”, and “a”. The HMM of “d” is used to model the stop consonant of a humming note. The HMM of “a” is used to model the vowel of a humming note. A humming note is then represented by combining the HMMs of “d” followed by “a”. •
Silence Hidden Markov Model A new silence model was designed to improve the robustness. Background noise and other distortion may cause erroneous segmentation of notes. In the new silence model, an extra transition from state 1 to 3 and then from state 3 to 1 was added to the original 3-state left-to-right HMM model. By doing so, the silence model can allow each model to absorb impulsive noise without exiting the silence model. At this point, a 1-state short pause “sp” model is created. This is called the “teemodel”, which has a direct transition from the entry node to the exit node. The emitting state is tied with the center state (state 2) of the new silence model. The topology of the new silence model is shown in Fig. 8. A “Rest” in a melody is then represented by the HMM of “Silence”.
3) Duration Model: Instead of directly using the absolute duration values, the relative duration change is used in the labeling process. The relative duration change of a note is based on its previous note, and the relative duration change is calculated as relative duration = log2 (
current duration ) previous duration
(2)
The system assumes that the shortest note of a humming piece is the 32nd note. A total of 11 duration models which are -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5 cover possible differences from a whole note to a 32nd note. 4) Training Process: An efficient and robust re-estimation procedure is used to automatically determine parameters of the note model. Given a sufficient number of training data of the note, the constructed HMM can be used to represent the note. The parameters of HMMs are estimated during a supervised training process using the maximum likelihood approach with Baum-Welch re-estimation. The first step in determining the parameters of an HMM is to make a rough guess about their values. Then, the BaumWelch algorithm is applied to these initial values to improve their accuracy in the maximum likelihood March 13, 2003
DRAFT
11
sense. An initial 3-state left-to-right HMM silence model is used in the first two Baum-Welch iterations to initialize the silence model. The tee-model (“sp” model) extracted from the silence model and a backward 3-to-1 state transition are added after the second Baum-Welch iteration. 5) Recognition Process: In the recognition phase, the same frame size and the same features of a frame are extracted from an input humming piece. There are two steps in the recognition process: note decoding and duration labeling. To recognize an unknown note in the first step, the likelihood of each model generating that note is calculated. The model with the maximum likelihood is chosen to represent the note. After a note is decoded, the duration of the note is labeled accordingly. •
Note Decoding The Viterbi decoding algorithm [18] is used in the decoding process. The recognition problem is to find the model which is most likely to have been generated by the data.
•
Duration Labeling After a note is segmented, the relative duration change is calculated using Equation (2(. Then, the relative duration change of the segment is labeled according to the duration model. The duration label of a note segment is represented by an integer that is closest to the calculated relative duration change. In other words, if a relative duration change is calculated as 2.2 then the note will be labeled as 2. The first note’s duration label is labeled as “0”, since no previous reference note exists.
D. Pitch Tracking Stage After a note is segmented from a humming piece, it is passed to the second stage to decide its pitch. The pitch detector decides the pitch of a segmented note based on statistical pitch models obtained from the humming database off-line. The implementation of each component of the pitch detector is given below. 1) Pitch Analysis: Short-time autocorrelation was chosen for pitch analysis. The main advantage of using short-time autocorrelation is its relative low computational cost in comparison with other existing pitch detection algorithms. A frame-based analysis is performed on a note segment with a frame size of 20 msec with 10 msec frame overlap. Multiple frames of a segmented note are used for pitch model analysis. After applying autocorrelation to those frames, pitch features are extracted. The selected pitch features include: the first harmonic of a frame, the pitch median of a note segment, and the pitch log standard deviation of a note segment. 2) Pitch Features: The first harmonic, also known as the fundamental frequency or the pitch, provides the most important pitch information. The pitch median gives the pitch of a whole note segment. Because March 13, 2003
DRAFT
12
of noise, there is frame to frame variability in the detected pitch value within the same note segment. Taking their average is not a good choice, since distant pitch values move the mean to the location where it is away from the target value. The median pitch value of a note segment proves to be a better choice in our experiments. The outlying pitch values also impact the standard deviation of a note segment. To overcome this problem, these outlying pitch values should be moved back to the range where most pitch values belong. Since the smallest interval of two different notes is a semitone, we claim that pitch values different from the median value by more than one semitone have a significant drift. Pitch values drifted by more than a semitone are moved back to the median. Then, the standard deviation is calculated. Pitch values of notes are not linear in the frequency domain. In fact, they are linearly distributed in the log frequency domain, and calculating the standard deviation in the log scale is more reasonable. Thus, the log pitch mean and the log standard deviation of a note segment are calculated in this work. 3) Pitch Model: Pitch models are used to measure the the difference in semitones of two adjacent notes: pitch interval =
log(current pitch) − log(previous pitch) √ . log 12 2
(3)
The above pitch models cover 2 octaves of pitch intervals, which are from D12 semitones to U 12 semitones. A pitch model has two attributes: the length of the interval (in terms of the number of semitones) and the pitch log standard deviation in the interval. The two attributes are modeled by a Gaussian function. The boundary information and the ground truth of a pitch interval were obtained from manual transcription. The calculated pitch intervals and log standard deviations, which are computed based on the ground truth pitch interval, are collected. Then, a Gaussian model is generated based on the collected information. Fig. 10 shows the Gaussian models of pitch intervals from D2 to U 2. Due the limitation of available training data, not every possible interval covered by 2 octaves exist. Pseudo models are generated to fill in the holes of missing pitch models. The n interval’s pseudo model is based on the pitch model of U 1 with the mean of the pitch interval shifted to the predicted center of the nth pitch model. 4) Pitch Detector: The pitch detector decides the pitch change of a segmented note with respect to its previous note. The first note of a humming piece is always marked as the reference note, and its detection, in principle, is not needed. However, the first note’s pitch is still calculated as reference in our experiments. The later notes of the humming piece are detected by the pitch detector. The pitch intervals and the pitch log standard deviations are calculated. They are used to select the best model that gives the
March 13, 2003
DRAFT
13
maximum likelihood value as the detected result. E. Transcription Generation After the note segmentation stage and the pitch detection stage, a humming piece has all the information required for transcription. The transcription of the humming piece results in a sequence of length N with two attributes per symbol, where N is the number of notes. The two attributes are the duration change (or the relative duration) of a note and the pitch change (or the pitch interval) of a note. The “Rest” note is labeled as “Rest” in the pitch interval attribute, since they do not have a pitch value. Following is the example of the first two bars of the song “Happy birthday to you”, and the full music score and transcriptions are shown in Fig. 11. Numerical music score: |
1
1
2
|
1
4
3
| |
Nx2 transcription: Duration changes:
|
0
0
1
|
0
0
1
|
Pitch changes:
| R R U 2 | D2 U 5 D1 |
F. Music Language Modeling It is interesting to apply the music language model to the proposed system. The underlying assumption is that music note sequences can be modeled using the statistical information learned from music databases. This in turn allows us to predict the next note of a music sequence and helps improve the probability of correct recognition of a note. The note sequence may contain the pitch information, the duration information or both. An N-gram model can be designed to adopt different levels of information. For example, an N-gram duration model can be applied in the note segmentation stage of the proposed system. An N-gram pitch model can be applied in the pitch tracking stage. An N-gram pitch-duration can be applied when a note’s pitch and duration are recognized. Fig. 12 shows the places where the N-gram models can be applied in the proposed humming recognition system. Our objective here is to introduce the potential of using the music language model to improve the recognition result. The N-gram pitch models were used in the experiment. In particular, the backoff and discounting bigram (N=2 of N-gram) was chosen. The bigram probabilities were calculated in the base-10 log scale. Twenty five pitch models, defined in Sec. IV-D.3 (D12, ..., R,..., U12), covered intervals of two octaves were used for the pitch detection process. Given an extracted pitch feature of a note segment, the probability of each pitch model was calculated in the based-10 log scale. For i and j being positive integers from 1 to 25 (25 pitch models), i March 13, 2003
DRAFT
14
and j are the index numbers of pitch models. A grammar formula is defined below in deciding the most likely note sequence: max Pnote (i) + βPbigram (j, i),
(4)
i
where Pnote (i) is the probability of being pitch model i, Pbigram (j, i) is the probability of being pitch model i following pitch model j and β is the scalar of the grammar formula, which decides the weight of bigrams in affecting the selection of pitch models. Equation (4) chooses the pitch model which gives the greatest probability.
V. E XPERIMENTAL R ESULTS A. Development Tools A Graphical User Interface (GUI) program, called “HTKEdit”, was written based on the Hidden Markov Model Toolkit (HTK) [16]. The program can be used to train both world-level and phone-level note models and pitch models. It was expanded to include pitch analysis tools for the pitch model study. It can take the trained note and pitch models to perform both off-line and real-time transcription and convert the hummed pieces into a music score. Fig. 13 is the screenshot of the HTKEdit program. The “HummingDecoder” was written in JAVA to implement the proposed humming transcription system so that it is a cross-platform ready program. HummingDecoder can do both off-line and real-time transcription while no training mechanism is available. Fig. 14 is the screenshot of the HummingDecoder program. The detail functionalities, and the download of HTKEdit and HummingDecoder can be found in [19].
B. Experimental Setup The proposed algorithm consists of two stages: note segmentation and pitch detection. For the note segmentation stage, the 39-element feature vector consists of 12 MFCCs, 1 energy measure and their first and second derivatives. The frame size was chosen to be 20 msec, and the frame skip to be 3 msec (which means two consecutive frames have an overlap of 17 msec.). A 3-state left-to-right HMM with two GMMs was used to model notes, and each model was trained 10 times. For the pitch detection stage, the frame size was chosen to be 20 msec and the frame skip 10 msec. Only one Gaussian was used since the data set was quite limited. Before presenting experimental results, let us see how the parameters were decided. As a part of the initial experiments, we empirically investigated: (1) the choice of the frame size and the frame rate, (2) the need for manual segmentation, and (3) weighting selection of models. For the sake of simplicity, March 13, 2003
DRAFT
15
our experiments were performed based on two models only, i.e. one for generic notes and the other for ‘silence’. Hence, they did not involve resolving individual notes using the pitch information. Let us define the following quantities •
N: the no. of correct notes;
•
D: the no. of deletion errors;
•
S: the no. of substitution errors;
•
I: the no. of insertion errors.
Two performance measures, i.e. the correct recognition rate (CRR) and the accuracy rate (AR), are adopted for comparison. They are: CRR =
N −D−S , N
AR =
N −D−S−I . N
1) Frame Size and Frame Rate: An important aspect of front-end signal processing is the selection of the appropriate frame size and the frame advancement step for maximum recognition accuracy. The frame size relates to the frequency resolution of the spectral analysis. A large window gives good frequency resolution while it compromises the short-time stationarity assumption regarding the signal. The frame advancement step, which provides the number of samples a frame is advanced, dictates the smoothness of the resulting short-time analysis. The choice of the frame size and the advancement step were made based on experimental evaluations. The recognition performance on the training set was used as the criterion in deciding the optimal parameters under a variety of frame size/advancement step values: 15/3, 15/2, 20/3, and 30/10. Across speakers, a frame advancement step of 3 msec and a frame size of 20 msec gave the best results as shown in Table I. Table I was prepared with the objective to segment notes correctly and therefore, the pitch information was not included in this experiment. No matter how fast a person can change notes while humming or singing, we observed that 20 msec is about the upper bound. In other words, a humming piece can be assumed as a stationary signal within a period of 20 msec. However, a note may still change within a 20 msec frame, since no synchronization between frames and notes was attempted. In other words, a boundary of two adjacent notes may fall in the selected 20 msce frame period. The 3 msec frame advancement step provides 85% overlapping between frames while this step size also provides smoothness between note changes. Thus, the frame advancement step of 3 msec and the frame size of 20 msec were adopted for the rest of our experiments. 2) Effect of Manual Segmentation in Training: The usefulness of bootstrapping a training model with manually segmented data is investigated by comparing it with the flat start approach. To segment data manually, a human listener has to mark the start and the end of each note in a humming piece. As to the March 13, 2003
DRAFT
16
flat start procedure, the first training iteration uses uniformly segmented data. Training and testing were done with the leave-one-out strategy that uses data from all but one speaker for testing and the remaining speakers for training. A 20-msec frame size and a 3-msec frame advancement step were used in the experiment. Several iterations were run in both approaches for training. Seven subjects were used to train and the remaining subject’s humming data were used in testing with the leave-one-out strategy. The Viterbi algorithm was used for decoding. Two performance measures are used: the correct recognition rate, which accounts for note deletion and substitution errors, and the accuracy rate, which includes insertions as well. Results show similar convergence rates. That is, both methods converged in about 10 iterations. The resulting recognition rates are also similar. In fact, models trained with the flat start provide slightly better recognition rates. The reason is probably due to the lack of consistency in manually segmented notes. Errors between the exact note boundaries and the manually segmented note boundaries vary a lot for a given humming piece. They affect the accuracy of note models generated by the bootstrap procedure. Thus, it is concluded that the handcrafted segmentation, which is a labor intensive process, is actually not needed in note model training. 3) Weighting Selection of Models: Preliminary experiments showed the existence of a large number of insertions in decoded results and, consequently, a poor accuracy rate. One possible way to deal with insertions is to penalize unnecessary transitions of exiting one note and entering the next note. The penalization is formulated as the re-calculated decoding likelihood S · x − P, where x is the original decoded likelihood, S is the scale factor and P is the penalty. Table II gives results with different note insertion penalties. As shown, the manipulation of the note insertion penalty helps improve insertion rates while maintaining high correct recognition rates. A penalty factor of P = −30 gives the best results in this experiment. Another possible way to improve the performance is to add a background noise model. One reason for numerous insertion errors is due to background noise which causes spurious segments. Thus, it is reasonable to use the noise model and note models together in segment decoding. As shown in Table III, the use of the noise model improves the note recognition rate even without changing the penalty factor. C. Experimental Results The off-line recognition task was performed on the first prototype, the second prototype and the proposed system in this work. Real-time recognition experiments were performed on the second prototype March 13, 2003
DRAFT
17
and the proposed system. Comparisons between the first (using the first note as the reference note) and the second (using the previous note as the reference note) labeling conventions were given in the results of the first prototype. The off-line recognition task was conducted with the leave-one-out method (trained on 7 speakers and tested on 1). For real-time recognition, 8 speakers’s humming data was used in training and humming pieces of two additional participants, who were not in the humming database, were tested in real time. The results are given in Table IV. The first column of Table IV gives the number of pitch intervals modeled in the prototype. The first labeling convention of the first prototype had 15 pitch interval models (D12, D10, D8, D7, D5, D3, D1, R, U2, U4, U5, U7, U9, U11, U12). The second labeling convention of prototype 1 had only 5 pitch interval models (D2, D1, R, U1, U2). For the first labeling convention of prototype 1 in Table IV, using the previous note as the reference did give better recognition results than using the first note as the reference. This can be explained by two reasons. First, people are more sensitive to the pitch change of two consecutive notes rather than with respect to the first note while humming. Second, for the given size of the humming database, the second labeling convention had more training data for each interval than the first labeling convention. As a result, no further experiments of the first labeling conventions were performed in later systems such as the second prototype and the proposed system. Both the second prototype and the proposed system had 25 pitch interval models which covered D12, D11, . . . , R, U1, . . . , U12. The results of real-time recognition are listed in Table IV. Since there exists variation in hummed pieces of each person for the same melody, the phone-level HMM did improve the AR from 75.63% to 81.25% by comparing results of the second prototype and the proposed system. We have the following observations for the second prototype. At the note segmentation stage, the AR value could be as high as 90%. However, the AR value was calculated based on the number of notes a test humming piece had (rather than the actual segmented units of notes). Consider a three-note humming piece that is decoded into three segmented units, where the first two comes from the first note and the last one for the second and third notes. For this case, the AR value after the note segmentation result is 100%. However, when the pitch detection process was applied to the wrongly segmented units, it is difficult to correct the error made in the earlier stage. Using the previous example, the results of the first two pitches will be from the first note, and the pitch of the last note will be obtained from the second and third notes. This is the major error encountered in our experiment, which lowers the AR performance significantly after the pitch detection stage. Therefore, the new phone-level HMMs and the new silence model were designed to minimize the errors mentioned above. The improvement of AR by reducing insertion errors
March 13, 2003
DRAFT
18
were achieved using the new phone-level HMMs and the new silence model. The last row of Table IV shows the improvement of using the bigram based music language models with different sizes of training data. In the experiment, 17 short melodies with a total of 516 notes were used to build the bigram models. All melodies were converted into interval melody contours. They were then used to train the bigram models. Different sizes of training melodies were experimented. The grammar scalar β, the discount D of bigram, the threshold t and the unigram floor count were set to 1.25, 0.5, 0, and 1, respectively.
VI. C ONCLUSION AND F UTURE W ORK A new statistical approach to speaker-independent humming recognition was proposed in this work. Phone-level hidden Markov models were used to better characterize humming notes. A robust silence (or the “Rest”) model was created to overcome unexpected note segments caused by background noise and some distortion. Features used in note modeling were extracted directly from the data. Pitch features extracted from humming were based on the previous note as the reference. A bigram based music language model was experimented. Preliminary experimental results showed that our approach is a promising one for further refinement. There are a number of issues to be investigated as future extension. First, the role of inter-note context should be further investigated through context-dependent modeling of melody and tempo contours. The error made by the note decoder can be corrected by a tempo’s context model before passing the note to the pitch detector. The potential of using bigram-based music language model for melody contours had been shown in experimental results. Second, a more comprehensive database of human humming from a larger set of human subjects should be gathered to enable detailed modeling and evaluation of the recognition performance. This is being carried out right now, and details of the new humming collection system can be found at [20]. Note also that, Dduring the humming collection of the humming database, one of the subjects’ humming was deemed highly inaccurate by informal listening and hence was excluded from our experiments. The melody hummed by this subject could not be recognized as the desired melody by most listeners. Such cases are often referred as tone-deaf people. Most people cannot recognize melodies which are hummed by tone-deaf people. Thus, we need an automatic procedure to exclude this type of input data in building up the humming database. Third, it is also worthwhile to consider an integrated note segmentation and pitch detection algorithms. Finally, the music retrieval performance based on the output of the humming recognizer should be investigated, especially from the viewpoint of recognition
March 13, 2003
DRAFT
19
errors. A long term goal is to optimize the performance of the humming recognizer to maximize the overall retrieval accuracy.
ACKNOWLEDGEMENTS The research has been funded in part by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC-9529152, in part by the National Science Foundation Information Technology Research Grant NSF ITR 53-4533-2720, and in part by ALi Microelectronics Corp. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the National Science Foundation and ALi Microelectronics Corp.
March 13, 2003
DRAFT
20
R EFERENCES [1] A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, “Query by Humming: Musical information retrieval in an audio database,” in Proceedings of ACM Multimedia Conference 1995, San Francisco, California, November 1995. [2] Jyh-Shing Roger Jang, Hong-Ru Lee, and Ming-Yang Kao, “Content-based music retrieval using linear scaling and branchand-bound tree search,” in Proceedings of IEEE International Conference on Multimedia and Expo 2001 (ICME 2001), August 2001, pp. 405–408. [3] R. J. McNab, L. A. Smith, and Jan H. Witten, “Signal processing for melody transcription,” in the 19th Australasian Computer Science Conference, 1996. [4] R. J. McNab, L. A. Smith, I. H. Witten, C. L. Henderson, and S. J. Cunningham, “Towards the digital music library: Tune retrieval from acoustic input,” in Digital Libraries Conference, 1996. [5] R. J. McNab, L. A. Smith, I. H. Witten, and C. L. Henderson, “Tune retrieval in multimedia library,” in Multimedia Tools and Applications, 2000. [6] B. Gold and L. Rabiner, “Parallel processing techniques for estimating pitch periods of speech in the time domain,” in Journal of the Acoustical Society of America, 1969, pp. 46:442 – 448. [7] Steffen Pauws, “Cubyhum: A fully operational query by humming system,” in International Symposium on Music Information Retrieval (ISMIR 2002), 2002, pp. 187–196. [8] P.H. Sellers, “The theory and computation of evolutionary distances: Pattern recognition.,” in Journal of Algorithm, 1, 1980, pp. 359–373. [9] D. J. Levitin, “Absolute memory for music pitch: Evidence from the production of learned melodies,” in Perception and Psychophysics, 1994, vol. 54, pp. 414–423. [10] Christopher Raphael, “Automatic segmentation of acoustic musical signals using hidden Markov models,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999, pp. 360–370. [11] Christopher Raphael, “Automated rhythm transcription,” in International Symposium on Music Information Retrieval (ISMIR 2001), 2001. [12] Adriane Swalm Durey and Mark A. Clements, “Melody spotting using hidden Markov models,” in International Symposium on Music Information Retrieval (ISMIR 2001), 2001, pp. 109–117. [13] Adriane Swalm Durey and Mark A. Clements, “Features for spotting using hidden Markov models,” in Proceedings of 2002 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), 2002. [14] H.-H. Shih, S. S. Narayanan, and C.-C. J. Kuo, “An HMM-based approach to humming transcription,” in Proceedings of IEEE International Conference on Multimedia and Expo 2002 (ICME 2002), August 2002. [15] H.-H. Shih, S. S. Narayanan, and C.-C. J. Kuo, “Multidimensional humming transcription using a statistical approach for query by humming systems,” in Proceedings of 2003 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), 2003. [16] “Hidden Markov Model Toolkit,” URL: http://htk.eng.cam.ac.uk/. [17] Beth Logan, “Mel frequency cepstral coefficients for music modeling,” in International Symposium on Music Information Retrieval (ISMIR 2000), 2000. [18] A.J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” in IEEE Transaction on Information Theory, 1967, vol. IT-13, pp. 260–267.
March 13, 2003
DRAFT
21
[19] “Hidden Markov Model Toolkit Editor: HTKEdit,” URL: http://sail.usc.edu/music/. [20] “USC Query by Humming project homepage,” URL: http://sail.usc.edu/music/.
March 13, 2003
DRAFT
22
TABLE I C LOSED - SET CORRECT RECOGNITION RATES (CRR %) FOR VARIOUS FRAME SIZE / ADVANCEMENT STEP VALUES (N OTE THAT THE PITCH INFORMATION WAS DISCARD IN THIS EXPERIMENT, SINCE THE CONCERN OF THE EXPERIMENT WAS ABLE TO CORRECTLY SEGMENT NOTES ).
Frame size/ Frame rate
15/3
15/2
20/3
30/10
Female 1
98.75
98.75
99.38
94.69
Female 2
100
100
100
100
Female 3
100
100
99.69
98.75
Female 4
94.69
94.69
94.69
11.25
Male 1
63.75
99.69
99.38
99.69
Male 2
100
100
100
96.56
Male 3
99.69
100
100
99.38
Male 4
100
100
100
99.69
Male 5
85.62
99.06
98.75
95.31
Overall
98.75
98.75
99.38
94.69
TABLE II T HE IMPACT OF THE PENALTY PARAMETER .
Penalty
March 13, 2003
CRR %
AR %
0
100
85.63
-10
100
89.38
-20
100
94.06
-30
100
96.88
-40
99.38
97.19
-50
98.13
97.19
-60
95.63
95.63
-70
94.06
94.06
-90
89.69
89.69
-100
87.81
87.81
DRAFT
23
TABLE III T HE CHOICE OF THE PENALTY PARAMETER IN CONJUNCTION WITH A NOISE MODEL FOR NOTE SEGMENTATION .
Penalty
CRR %
AR %
-30
95.00
95.00
-10
98.13
96.88
-1
93.13
95.63
0
98.13
95.31
4
99.69
96.56
5
100
95.94
8
100
94.38
TABLE IV E XPERIMENTAL RESULTS OF THE FIRST PROTOTYPE , THE SECOND PROTOTYPE , THE PROPOSED SYSTEM , AND THE PROPOSED SYSTEM WITH MUSIC LANGUAGE MODEL
Number of
First labeling
modeled
Off-line
Off-line
Real-time
intervals
AR %
AR %
AR %
1st
15/5
51.18
78.53
N/A
2nd
25
N/A
75.63
72.32
Latest
25
N/A
81.25
80.45
MLM
25
N/A
86.88
N/A
Prototype
March 13, 2003
(MLM), WHERE AR DENOTES THE ACCURACY RATE .
Second labeling
DRAFT
24
Query humming
Humming to music contour
Music contours
Query engine
Database organization
Music database
Audio signal processing
Music database retrieval
Music database organization
Fig. 1. The functional blockdiagram of the Query by Humming system.
Humming waveform
Decoded note symbols
Decoding
HMMs of notes
HMM Training
Humming Database
Fig. 2. The functional blockdiagram of the first prototype.
March 13, 2003
DRAFT
25
Humming Waveform
Note Decoder
Phone Level Note Models
Note Segment
Duration Models
Pitch Detector
Humming Transcription
Pitch Models
HMM Define Feature Selection
Fig. 3.
HMM Training
Humming Database
Pitch Analysis
Pitch Feature Selection
The functional blockdiagram of the proposed humming recognition system based on the multidimensional humming
transcription scheme.
March 13, 2003
DRAFT
26
Forward C Major Scale
Backward C Major Scale
(a)
2 Little Tigers
(b) Fig. 4. The music scores of (a) forward and backward music scales and (b) “2 Little tigers”.
March 13, 2003
DRAFT
27
Fig. 5. Illustration of the music scores and its labels with the first labeling convention of the ascending (left) and the descending (right) C major scales.
Fig. 6. Music scores and labels using the second labeling convention of the ascending (left) and the descending (right) C major
Log amplitude
scales.
Time (seconds)
Fig. 7. The log energy of a humming piece using “da”.
March 13, 2003
DRAFT
28
Exit
Enter
Shared state Enter
Exit
Fig. 8. The new silence model with a one-state short pause “sp” model tied to the center state (i.e. State 2).
Note(n-1)
Note(n)
Note(n+1) Note level
a11
Enter Note(n) S(0)
a01
a33
a22
a23
a12 S(1)
S(2)
a34 S(3)
Exit Note(n) S(4)
HMM level
a13
b1 (o1 )
o1
b2 (o2 )
o2
b3 (o4 )
b2 (o3 )
o3
o4
Fig. 9. The note-level 3-state left-to-right hidden Markov model.
March 13, 2003
DRAFT
29
Gaussian Models of pitch intervals from D2 to U2 1.4 D1
U1
1.2
Probability density function (PDF)
1
0.8 D2
U2
0.6 R 0.4
0.2
0 −4
−3
−2
−1
0
1
2
3
4
Intervals (in semitones)
Fig. 10. The Gaussian model for pitch intervals from D2 to U2.
Numerical music score: 1 Nx2 transcription: Duration changes: 0 Pitch changes: R
4 1 D2
1
1
2
1
4
3
1
1
2
1
5
0 R
1 U2
0 D2
0 U5
1 D1
-2 D4
0 R
1 U2
0 D2
0 U7
3
2
7
0 D1
0 D2
. 1 1
-2 0 1 D5 R U12
6
4
0 0 D3 D4
7
-1 0 U8 R
6
4
5
4
1 D1
0 D4
0 U2
1 D2
Fig. 11. The music score, the numerical music score, and the humming transcription of the song “Happy birthday to you”.
Humming Waveform
Note Decoder
Note Segment
Duration N-gram
Humming Transcription
Pitch Detector
Pitch N-gram
Pitch and Duration N-gram
Fig. 12. The places where N-gram models can be applied in the proposed humming recognizer.
March 13, 2003
DRAFT
30
Fig. 13. The screenshot of the HTKEdit with a humming recognizer and a music score editor.
Fig. 14. The screenshot of the HummingDecoder with a music score editor.
March 13, 2003
DRAFT