Hierarchical Filtering Method for Content-based Music Retrieval via Acoustic Input’ Jyh-Shing Roger Jang’, Hong-Ru Lee* Multimedia Information Retrieval Laboratory Computer Science Department, National Tsing Hua University, Taiwan
[email protected]’,
[email protected]
applications of karaoke or digital music library) based on the search of names of singers or keywords of titles/lyrics can be totally avoided.
ABSTRACT This paper presents an implementation of a content-based music retrieval system that can take a user’s acoustic input (S-second clip of singing or humming) via a microphone and then retrieve the intended song from a database containing over 3000 candidate songs. The system, known as Super MBox, demonstrates the feasibility of real-time music retrieval with a high success rate. Super MBox first takes the user’s acoustic input from a microphone and converts it into a pitch vector. Then a hierarchical filtering method (HFM) is used to first filter out 80% unlikely candidates and then compare the query input with the remaining 20% candidates in a detailed manner. The output of Super MBox is a ranked song list according to the computed similarity scores. A brief mathematical analysis of the two-step HFM is given in the paper to explain how to derive the optimum parameters of the comparison engine. The proposed HFM and its analysis framework can be directly applied to other multimedia information retrieval systems. We have tested Super MBox extensively and found the top-20 success rate is over 85%, based on a dataset of about singing/humming 2000 clips from people with mediocre singing skills. Our studies demonstrate the feasibility of using Super MBox as a prototype for music search engines over the Internet and/or query engines in digital music libraries.
The operation of Super MBox can be divided into three steps: preprocessing, on-line processing, and postprocessing. For the preprocessing stage, we need to read each candidate song in the database and put the pitch/beat information into several indexed files for quick access during comparison. Usually the songs in the database are in MIDI (Music Instrument Digital Interface) format, which contains all the music elements of a song and is equivalent to the sheet music of the song. The pitch/beat information of a song is extracted from the MIDI’s major track (or vocal track), which is defined as the track that, when played alone, can be identified immediately by a human who is familiar with the song’s melody. The pitch/beat information of each candidate song is then converted into a pitch sequence where each pitch point corresponds to l/16 second. This form of pitch sequence, called middle format, is suitable for comparison with the pitch sequence extracted from the user’s acoustic input. (Alternatively, we can employ a pitch determination algorithm to obtain the pitch sequence of each candidate song from its monophonic representation, such as the solo singing of a person or the playing of a single music instrument.) During the on-line processing step, the user can specify a query by singing or humming a piece of tune (that contains several notes for identifying a song) to a PC microphone directly. Several timedomain-based methods, such as autocorrelation or average magnitude difference function, are used to find the most likely pitch sequence. At the same time, some heuristics (such as minimum signal intensity, pitch continuity, and pitch range for human’s singing) are employed to eliminate unwanted/false pitch points which might result from either unvoiced segments of the acoustic input or undesirable effect of pitch doubling/halving.
Keywords Content-based Music Retrieval, Audio Indexing and Retrieval, Nearest Neighbor Search, Pattern Recognition, Dynamic Programming, Dynamic Time Warping, Audio Signal Processing, Query by Singing
1. INTRODUCTION This paper presents a content-based music retrieval system called Super MBox, which allows the user to retrieve an intended song via her/his acoustic input. That is, the system’s intelligent comparison engine allows the user to retrieve songs based on a few notes sung or hummed naturally to the microphone. As a result, traditional ways of song retrieval (particularly in the
At the postprocessing step, the computed pitch sequence together with related timing information are transformed into the same middle format that was used in encoding the pitch/beat information of songs in the database. To account for key transposition, the middle format must be transformed into a representation that does not rely on the absolute values of the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’OZ, Sept. 30-Oct. 5,2001, Ottawa, Canada. Copyright 2001 ACM l-581 13-394-4/01/0009...$5.00
’ This is an on-going project supported by Cweb Inc. (httt,://www.4music.coln.t~v) and the Excellency Project sponsored by the National Science Council at Taiwan, where the authors express many thanks.
401
identitied pitch sequence. As a common practice in the literature, the transformation’ is usually accomplished by using the difference operator on the note level [7][20][22][ 19][21]. We have tried the difference operator on the pitch sequence and found that the operator tends to amplify noises and deteriorate the system’s poor performance. Thus we adopt a mean-adjustment method that shifts the input pitch sequence to have the same average value as that of each song’s in the database before invoking the comparison procedure. Moreover, we can apply other heuristics to find a better pitch shift amount for a better match in the comparison procedure. To account for tempo variation, the comparison procedure is based on dynamic time warping. The system then returns a ranked song list according to the computed similarity scores.
R. J. McNab et al. [20][22][ 19][21], in collaboration with New Zealand Digital Library, have published several papers on their experiment of query by singing. They applied Gold-Rabiner algorithm [8] for pitch tracking, and the pitch vector was then cut into notes based on energy levels and transition amounts. There were about 9400 songs in their database and they are the first one to put their system on the web. Their system, though lack of performance evaluation in terms of recognition rate, is still considered an excellent example of content-based music retrieval for real-world applications. Both of the music retrieval systems published by Ghias et al. and McNab et al. are designed to extract music notes from the identified pitch vectors, and then comparison procedures based on the extracted notes are invoked to compute similarity scores. However, note extraction (or segmentation) is error prone due to the continuity in pitch vectors obtained from human’s acoustic input. Kosugi et al.[ l5][ l4][ 161 proposed the use of both tone transition and tone distribution to improve the performance of CBMR systems. Their system, call SoundCompass, has more than 10,000 songs in the database with a retrieval time of I second. However, to use their system, users are required to hum after the beats of a metronome; this is inconvenient to and not always possible for common users.
At the preprocessing step, we need human’s interaction to identify the vocal track for extracting the pitch/beat information from a song in MIDI format. Once the vocal track is specified, the pitch/beat transcription and the similarity score computation are totally automatic. With about 3000 songs in the database, the average response time of our system is about 10 seconds on a Pentium III 800 when the user specifies a query from the beginning of a song. The performance is satisfactory, as long as the users can sing or hum the intended song with more or less correct pitch.
The authors have also published their work [3][17][1 l] on content-based music retrieval. The focus of the publications (in particular the last one) is on the use of dynamic programming techniques for elastic match in the comparison engine.
However, as the number of songs increases, or as a shorter response time is prescribed, we need a methodology to speed up the whole retrieval procedure. In this paper, we propose a hierarchical filtering method (HFA) that can first filter out 80% unlikely candidates in a quick and dirty manner, and then compare the input pitch sequence with the 20% remaining candidates in a detailed manner. We shall also develop a probability model for HFA that give us a guideline as how to select the optimum parameters of the comparison procedure in order to minimize performance reduction due to limited response time.
A more comprehensive list of related work on content-based music retrieval can be found at the first International Symposium on Music Information Retrieval 2000 [lo] held at Plymouth, Massachusetts. Most of the published papers at the conference can be found at the web site:
http://ciir.cs.umass.edu/music2000/
2. RELATED WORK
3. BASIC OPERATIONS OF Super MBox
Content-based multimedia information retrieval is a popular and useful research topic, but most research is focused on the media of text, image and video [4]. Even there are several publications on audio retrieval [.5], the research on music retrieval based on acoustic inputs arc not common. However, as the digital music is all over the Internet; the requests for search/retrieval of digital music files are getting more and more important.
In this section we shall explain the basic operations of Super MBox. In particular, we will focus on the use of dynamic time warping to deal with tempo variations, and key shifting procedure to take care of key transposition, both of which play a vital role in the comparison engine.
3.1 Input Collection
Ghias et al. [7] published one of the early papers on query by singing. They applied autocorrelation to obtain the fundamental frequency, and the pitch vector is then cut into notes. To accommodate the problem of different starting key, the obtained notes are converted into ternary contour of three characters: U (up, meaning this note is higher than the previous one), R (repeat, meaning this note is the same as the previous one), D (down, meaning this note is lower than the previous one). The comparison engine, based on longest common subsequence, is then applied to the ternary contours to find the most likely song. However, due to the limited computing power at that time, their pitch tracking takes 20-45 seconds, and there were only 183 songs in the database.
The acoustic input (which could be a clip of singing, humming, or playing from a single music instrument) to the system is recorded directly from a PC microphone with a duration of 8 seconds, sample rate of 11025, 8 bit resolution and single channel (mono). The system can also reads the acoustic input from a monophonic WAV file directly. Before further processing, the acoustic input have to go through a low-pass filter at 1047 Hz in order to cut down unwanted high-frequency components which is not likely to present in a human’s singing.
3.2 Pitch Tracking Once a clean acoustic input of 8 seconds is obtained, we need to do pitch tracking in order to identify the pitch frequency with respect to time. The acoustic input is first put into frames of 5 12
402
points, with 340 points of overlap; this corresponds to l/64 second for each pitch point. Then every 4 pitch points are averaged to merge into a single frequency, thus the final pitch sequence is in the middle format with a time scale of l/l6 second.
better performance. The disadvantage is that mom computing power is allocated for processing more information.
3.4 Dynamic Time Warping To account for tempo variation of each individual user, we can simply use dynamic time warping (DTW) [23] to compute the warping distance between the input pitch vector and that of each song in the database. Suppose that the input pitch vector (or test vector) is represented by t(i),i = 1,. . .,m, and the reference
There are plenty algorithms in the literature for pitch tracking or pitch determination [ l][S][9]. For our system, we have tried pitch tracking using two methods: 1. Autocorrelation function [23] 2. Average magnitude difference function (AMDF) [23] Both methods are performed in time domain and have comparable performance. For software implementation on current Pentiumbased PC, both methods are fast enough for real-time pitch tracking. However, when considering chip/hardware implementation, AMDF is more favorable since it does not require multiplication operations.
pitch vector (reference vector) by Y(j), j = 1,. . . , y1 . These two vectors are not necessarily of the same size and we can apply dynamic time warping to match each point of the test vector to that of the reference vector in an optimal way. That is, we want to construct a mXYl DTW table D(i, j) according to the following forward dynamic programming algorithm:
After obtaining the pitch frequencies, we use the following formula to transform them into the representation of semitone:
Optimal value function:
D(i, j) is the minimum distance starting from the left-most side ( i = ] ) of the DTW table to the current position (i, j) .
semitone = 12 *log,
Recurrence relation: The semitone representation used here is equivalent to the one used in MIDI format, where “69” represents A4 (A440, 440 Hz). Subsequent smoothing and comparison procedures are based on semitones thus calculated.
D(i-2,j-1) D(i,j)=d(i,j)+min D(i-l,j-1)) 1 D(i-l,j-2)
3.3 Pitch Smoothing
where d(i, j) is the node cost associated with and can be defined as
The pitch sequence (or vector) obtained in the previous step may not be smooth enough due to the following reasons:
t(i)
and y(j)
d(i, j) =I t(i) - r(j) I-7
1. Unvoiced segments and random noises may cause unreasonably high pitch. 2. A strong second harmonics can produce pitch-doubling effect. To eliminate these undesirable pitch points, the system has a simple end-point detection scheme based on energy levels. If the energy level is lower than a threshold, then the corresponding pitch semitones are set to zero. Also if the identified pitch semitones are higher than 84 (C6, or 1046.5 Hz in frequency) or lower than 40 (E2, 82.4 Hz in frequency), they are also set to zero. Moreover, to ensure overall smoothness, we put the pitch sequence through a median filter of order 5.
In the preceding formula, ?? is a small number and its function is to bias the search toward a longer matched path instead of a shorter one. Moreover, the recursion relation shows that the optimal path exists only when the test input is within half to twice the size of the reference vector. This constraint is reasonable since it is rather difficult to sing/hum a song faster than twice or slower than half the speed of the intended song. Boundary conditions: The boundary conditions for the above recursion can be expressed as
As opposed to some previous work [7][20][22][ 19][2l] reported in the literature, we do not perform note segmentation. The reasons for not performing note segmentation are two-fold:
D(Q) = 00, i = 2 ,***, m, D(1, j) =I t(1) - r(j) I, j = 1,. . ., ~1.
I. No segmentation is an error-prone process and even human expert cannot guarantee to recover correct notes from acoustic input.
The first equation ensures that the optimal DTW path never starts from the middle of the test vector. The second equation indicates that the optimal DTW path can start from anywhere in the middle of the reference vector. These conditions are reasonable since we would like to match the whole test vector to a part of the reference vector.
2. The users are not required to sing “ta” or “da” to facilitate note segmentation. Hence the users can sing naturally and continuously.
The cost of the optimal DTW path is defined as
As a result, we are using the whole pitch sequence in the comparison engine based on dynamic time warping, as described in the next section. Since the whole pitch sequence is used, all information can be kept for the comparison procedure to achieve
minD(m, j) .i
403
approach that shifts the entire input pitch vector to a suitable position that can generate the minimum DTW distance. This key shifting procedure can be described next:
After finding the optimizing j, we can back track to obtain the entire optimal DTW path. A typical path of the DTW table is shown in the following plot:
I, Set initial parameters and make t and r zero mean:
,’
.i /
/’
/’
,/* :’
,”
spun = 4 center = 0 t = t - mean(t) I r = r-mean(r)
..
,. ’ .:
.’
/’
/’
(The last two equations make both t and Y zero mean.)
/‘I 1’ ,~ ,i
2, Compute the DTW distances:
s-, = dtw( r, t - span) so = dtw( Y, t) 1 s, = dtw( r, t + span)
In the above plot, the optimal DTW path can start from any points on the left boundary and stop at any points on the right boundary. This implies that the user can sing/hum anywhere from the middle of a song and DTW can still identify the intended song from the database. Due to the half-twice speed constraint, the optimal path will always stay within the upper and lower dashed lines.
& Find the minimum DTW distance and update Center : If s-1 =nlin{s-,,so,s,},then center = center - span
In practice, m = 16*8 = I28 since we usually have 8 seconds of recording and each pitch point corresponds to l/16 second. (In fact, if we ignore the rest parts of the acoustic input, the value of m will be smaller.) For this study, we assume the user always sing/hum from the beginning of a song, therefore the DTW path always starts from the left lower comer of the DTW table. The corresponding boundary conditions of this “match beginning” case are :
else if S, = Inin { S-, , So, S, ) , then $. Update If
span
center = center + span and check stopping condition:
span > 2 , spun = span 12,
go to step 3. Otherwise stop the iteration. In the most complicated setting, the key shifting operation involves 5 DTW distance computations. In other words, the computation of comparing an acoustic input with 3000 songs in the database involves computing 15000 DTW tables, each with size of around 128 X 179 .
D(i,l) = -,i = 2 ,..., m, D(l,j)=m,j=2 ,..., n, D(l,l) =I t(1) - r(l) I.
To deal with the problem of key transposition, most related work applied the difference operator at the note level before invoking the comparison procedure [7][20][22][21][21]. However, difference operator amplifies noises and does not lead to acceptable performance in our simulation.
Furthermore, we assume that the reference vector cannot be longer than 1.4 times the size of the test vector, hence n = 16*8*1.4 = 179. Thus we need to construct a DTW table of size up to 128x179. The size of the DTW table could be smaller since we usually discard silence (or rests) of both the test and reference semitones. On the other hand, if we want the match to start from anywhere in the middle of a song, then for a typical song of 4 minutes, II =16*60*4=3840. Hence the DTW table is
4. SPEEDUP MECHANISM: HIERARCHICAL FILTERING METHOD
huge with size 128X 3840. It is still doable; the only problem is that it takes a lot of computation time. Therefore throughout this paper, we shall assume all comparisons start from the beginning (or a designated position) of a song.
The primary goal of Super MBox is to find the songs in the database that is closest to the query input in terms of the distance of dynamic time warping. This is a typical problem of nearest neighbor search (NNS) and has been widely studies in the pattern recognition literature. A more formal definition of NNS can be
3.5 Key Shifting Besides constructing the DTW table for computing each similarity scores, we still need to deal with the problem of key transposition, which is caused by key variations of each individual user. For instance, a female singer usually has a higher key than a man. To deal with the problem, first we shift the test vector to have the same mean value as that of the reference vector. Moreover, to guarantee better performance, we can optionally take a heuristic
stated as follows. Given a set of n objects
S = {o, , Ok,
***, o,~ )
and a query input 4 , find the object Ok in s such that dist(q, ok >
is
the
smallest
among
dist(q,oi),i=l ,..., n, i # k , where dist(.;)
404
all
is a distance
most detailed comparison procedure in Super MBox summarized as follows:
function defined for any two objects. In symbol, k can be expressed as the following formula:
1. Query input of 8 seconds. 2. Pitch vector of length I28 (or equivalently, pitch interval of l/l 6 seconds). 3. Pitch shifting of 5 times. If we change any of the above parameters (for instance, from 8 seconds of recording to 5 seconds, from 128-element pitch vector to 64 elements, or from 5 times of key shifting to I .), we will have a simpler comparison procedure. Now the problem becomes a constrained optimization problem: how do we arrange the parameters to achieve the optimal success rate while keeping the response time acceptable?
k = argmindist(q,q), i
where d&(q,o, ) is the distance between the query input 4 and object 0, in set s Most speedup mechanisms [24][2] for NNS assume that the objects in s can be expressed as vectors in a high dimensional space and the distance function is simply the Euclidean distance. However, in Super MBox, neither the query input nor the candidate songs in the database can be viewed as vectors of fixed length in a high dimensional space. Some other speedup mechanisms for NNS only require that the distance function in the metric space satisfies the triangular inequality: dist(a,b) + dist(b,c)
Suppose that in the first step of HFM, we employ a comparison procedure based only on the first d (0 5 d 5 8 ) seconds of recording. The computation time of the comparison procedure should be proportional to the length of recording, therefore it is reasonable to express the computation time (or response time) as follows:
2 dist(a,c),
where U , b and C are three arbitrary objects in the metric space, and d&‘t(.;) is the distance function defined on the space. Representative approaches based on this assumption include the vantage-point tree search [26][6] and the branch-and-bound tree search [3]. Unfortunately, DTW as a distance function cannot satisfy the triangular inequality.
time(d) = knd , where k is a proportional constant and ?Z is the number of candidate songs in the database. Moreover, we assume that the performance of the comparison procedure is denoted by a probability function pe$(d, p) , representing the probability
As a result, we cannot rely on conventional speedup mechanisms for NNS problem. So we need to find alternatives that can speed up the comparison procedure in Super MBox. In particular, we adopt an intuitive hierarchical filtering method (HFM) that can hierarchically filter out unlikely candidates in a quick and dirty manner. The concept of a basic two-step HFM can be summarized as follows:
that the correct song will appear in the top loop% of the resulted ranked list when the length of the query input is d seconds. In the following discussion, we shall refer to d as duration, p as survival rate, and per-(d) p) as the corresponding success rate. Usually ?‘Z (the number of candidate
I. Use a crude but fast comparison procedure cc: to compare the query input with all candidates and then filter out 80% unlikely candidates.
songs in the database) is large and we assume pe$(d, p) is independent of II. For instance, the top-20 success rate of a d second query input, or equivalently, the probability of the query input being ranked in top 20 of the ranked list, can be denoted as
2. Use a detailed comparison procedure CP, to compare the query input with the remaining 20% candidates to obtain the final ranked result. As can be seen immediately, the above two-step HFM can be generalized into an c-step HFM with the used of progressively
perf(d ,g) . n
complicated comparison procedures cc, i = 1,. . . ,C , and the
In the second step of HFM, we employ the query input of the whole duration, that is, 8 seconds, to compare with the remaining pn candidates. Hence the total computation time can be
corresponding survival rate (20% in the previous case) can be heuristically adjusted accordingly.
expressed as knd + Sknp , and the overall top-20 success rate
For the convenience of discussion, we shall assume the use of two-step HFM. The basic concept of HFM is to be able to filter out unlikely candidates quickly while keeping the potential nearest neighbor in the remaining set for detailed comparisons. This approach will cut down the comparison time, and hopefully without sacrificing the performance too much. In the following, we shall carry out a mathematical analysis on the two-step HFM, in the view of seeking for the optimum parameters for HFM that can achieve a balance between computation time and performance of Super MBox.
can be expressed as per-(d,p)perf@z)
. In the Pn expression of the overall top-20 success rate, perf(d, p) denotes the probability that the intended song of the d -seconds input query will appear in the top IOOp% of the ranked list, and per-(8,g)
Pn
Before employ HFM, we need to identify a set of progressively complicated comparison procedures ce , i = 1,. . .
,C
can be
denotes the probability that the g-seconds
input query will be ranked in top 20 of the remaining pn songs in the second step. In other words, the expression denotes the
The
405
probability that the intended song of the input query will appear in the top 20 of the resulted ranked list.
O