Query by Humming by Using Locality Sensitive Hashing ... - CMLab

2 downloads 0 Views 440KB Size Report
Traditional music information retrieval systems require users to enter the ..... [14] G. Velikic, E.L. Titlebaum, M.F. Bocko, “Musical note segmentation employing ...
2012 IEEE International Conference on Multimedia and Expo Workshops

Query by Humming by Using Locality Sensitive Hashing based on Combination of Pitch and Note

Qiang Wang1, Zhiyuan Guo1, Gang Liu1, Jun Guo1, Yueming Lu2 1

2

Pattern Recognition and Intelligent System Laboratory Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education 1,2 Beijing University of Posts and Telecommunications Beijing, China E-mail: [email protected]

Jang [3] proposed a QBH system using dynamic time warping (DTW) over frame-based pitch contours, which accommodates natural humming for better performance. He also introduced linear scaling (LS), hidden Markov model (HMM) for melody matching [4] [5]. By using LS, the query pitch sequence is linearly scaled several times and compared to the music melody. DTW is a more effective approach to melody recognition of QBH. But it is time-consuming. The key insight of HMM is the same as DTW which is based on dynamic programming. Previous research on QBH used bottom-up methods. Studies involving the first category of top-down methods have been performed by Wu. He presented recursive alignment (RA) algorithm and demonstrated its superiority experimentally [6]. RA pays more attention to the long-distance information in human singing. In the second category of top-down methods, a QBH system based on the earth mover’s distance (EMD) was proposed by Shen [7]. EMD can calculate the minimum cost between the humming and music pieces with change in weight. The lyrics information can also be used in QBH systems [8] [9]. Most importantly, Ryynanen extracted the pitch vectors using a time window of fixed length and used pitch-based locality sensitive hashing (PLSH) method for matching [10]. In the QBH system, the uncertainty and fluctuation of queries caused by humming poses a great challenge. Pitch and note are two typical features used in QBH systems. Pitch is the initial feature extracted from the query clip frame by frame continuously. Note is the post-processing feature. But most QBH systems only used one of these features. The paper presents a novel method called note-based locality sensitive hashing (NLSH) and combines it with pitch-based locality sensitive hashing (PLSH) to screen candidate fragments. NLSH based retrieval does not consider the duration of note and can capture the continuity of note value. It is robust to non-constant tempo variation and makes up for the deficiency of PLSH [10]. The combination of PLSH and NLSH (PNLSH) can take both advantages of the continuity of pitch and note. It can select more accurate candidates, increasing the recall rate effectively. Finally the paper makes use of recursive alignment (RA) [6] to sort candidates and returns the results. RA optimizes melody alignment problems in a top-down fashion, which utilizes LS as a

Abstract—Query by humming (QBH) is a technique that is used for content-based music information retrieval. It is a challenging unsolved problem due to humming errors. In this paper a novel retrieval method called note-based locality sensitive hashing (NLSH) is presented and it is combined with pitch-based locality sensitive hashing (PLSH) to screen candidate fragments. The method extracts PLSH and NLSH vectors from the database to construct two indexes. In the phase of retrieval, it automatically extracts vectors similar to the index construction and searches the indexes to obtain a list of candidates. Then recursive alignment (RA) is executed on these surviving candidates. Experiments are conducted on a database of 5,000 MIDI files with the 2010 MIREX-QBH query corpus. The results show by using the combination approach the relatively improvements of mean reciprocal rank are 29.7% (humming from anywhere) and 23.8% (humming from beginning), respectively, compared with the current state-ofthe-art method. Keywords-query by humming, locality sensitive hashing, recursive alignment, music information retrieval

I.

INTRODUCTION

During the recent years, a significant amount of research work has been focused on music database retrieval. Traditional music information retrieval systems require users to enter the title or artists. Recently, the necessity for content-based music information retrieval that can return results even if a user does not know information such as the name of songs or artists has increased. Query by humming (QBH) systems have been introduced to address this need, as they allow the user to simply hum a few seconds to find the wanted song. Content-based music information retrieval has developed rapidly in two aspects. The first one is query-by-example [1] and the other one is QBH. Most of QBH researches utilize melody information for retrieval. Ghias et al. [2] proposed one of the early papers of QBH, which adopted the simple idea to encode a pitch contour: just recording the melodic direction pattern (up, down, repeat) and discarding the duration. Later the melody is represented by the difference value of pitch. Currently the most widely used representation is the pitch sequence shifted to zero mean which is robust to key transposition. 978-0-7695-4729-9/12 $26.00 © 2012 IEEE DOI 10.1109/ICMEW.2012.58

302

III.

subroutine. It is a better distance measurement for melody matching. The structure of this paper is as follows: The framework of the QBH system is described in Section II. Sections III and IV present the retrieval algorithms. Section V gives the experimental results and finally, Section VI is our conclusion. II.

A. Extraction of MIDI Theme The music database is composed of MIDI files. The melody in MIDI files is note sequence in the format of (pi, ti), in which pi is the note value and ti is the duration. According to the frame shift in the stage of retrieval notes are converted to (pi, di) in which di is the number of continuous frames. Since we will use the index of PLSH, all the two-dimensional notes are converted to one-dimensional pitch sequences as follows:

STRUCTURE OF THE PROPOSED QBH SYSTEM

The procedure of the proposed QBH system is shown in Fig. 1. It contains two parts: i) the indexing of a music database ii) the retrieval of humming clips, specifically described as follows: 1. Index of MIDI music database off-line: For every MIDI music file in the database extract notes, turn notes to one-dimensional pitches and build the database indexes of pitch-based locality sensitive hashing (PLSH) and note-based locality sensitive hashing (NLSH). 2. Retrieve the music database on-line: Step 1: Extract pitches from humming clips and postprocess pitches to notes. Step 2: Search for candidates by using the PLSH and NLSH indexes. Step 3: Calculate similarities of all the candidates with RA and sort them to output results. The key insight of the proposed QBH system is: Firstly the fixed-dimensional vectors are extracted from pieces of 23 seconds in the music database, and they are used to build the locality sensitive hashing (LSH) index. Secondly the same dimensional vectors are extracted from the humming clip and the LSH index is used to search for the neighbor pieces of 2-3 seconds. Thirdly the pieces are expanded to candidate fragments with the length proportional to that of the humming clip. Finally all the candidates are sorted by RA algorithm.

MIDI Database

d1

PLSH Index

LSH Retrieval

di

1) PLSH Indexing: The problem of High-dimensional nearest neighbor retrieval arises in a large variety of database applications. Index-based locality sensitive hashing (LSH) is the most effective method [11] [12]. We can quickly screen candidate fragments and filter out unlikely songs by using LSH in QBH systems. The PLSH points are extracted from MIDI files as follows: First we choose a -second window and extract a highdimensional pitch vector with fixed interval within the window from the pitch sequence (p1,p2,…,pn) of MIDI files. Then the window is moved back step by step and the next vector is extracted one by one correspondingly. For example if the frame shift is 40 milliseconds (25 pitches in one second), the extracting interval is 2 (i.e., the sequence is down-sampled by a factor of 3), the size of the window is 2.4 seconds, the PLSH points are 20-dimensional vectors and the window shift is 0.6 seconds (15 pitches), the PLSH pitch vectors extracted from the original pitch sequence (p1,p2,…,pn) are (p1,p4,…,p58), (p16,p19,…,p73), (p31,p34,…, p88) … Fig. 2 gives an example of PLSH vector extraction. In addition we need to build another index to record the location of vectors in the original songs in order to expand found neighbor vectors to candidate fragments in the stage of retrieval. The key is the labels of PLSH vectors. The value is a pair composed of the title and the starting time of the first pitch in the vector. 2) NLSH Indexing: It is difficult to hum or sing the accurate pitch with the precise duration, especially the duration. Users are likely to hum a little longer for the shorter note and may hum shorter for the longer note. In Fig. 3 the first and third parts with circles are hummed shorter and the second part is hummed longer. If we can retrieve songs by note it will avoid the leakage caused by the nonlinear variation of note duration between humming clips and MIDI music, preserving the missing candidates of PLSH. We should ignore the duration of note to extract NLSH points as follows:

Recursive Alignment

Results Off-line Database Processing

(1)

B. LSH Indexing

Pitch Extraction Pitch to Note

NLSH Index

( p1 , d1 ),..., ( pi , d i ) → ( p1... p1 , p2 ..., pi ... pi ,...) 



Query

Preprocess

OFF-LINE PROCESSING OF THE MIDI DATABASE

On-line Retrieval

Figure 1. The framework of the proposed QBH system. The left part is offline database processing, including the melody extraction and PLSH, NLSH indexing. The right part is on-line retrieval, including the pitch extraction, the conversion from pitch to note, the candidate screen of LSH, RA calculation and the output of results.

303

Figure 2. An example of PLSH vector extraction. The frame shift is 0.04 seconds, the vector length is 20, the window size is 2.4 seconds and the window shift is 0.6 seconds. The left panel shows the pitch sequence of MIDI music with the extracting interval as 2 and the right one shows the final extracted PLSH vectors.

Figure 3. An example of humming errors for note duration. The left panel shows the pitch sequence of the MIDI file and the right one shows the corresponding pitches extracted from the humming clip.

Figure 4. An example of NLSH vector extraction. The dimension is 10, the shift of vectors is 1 and the largest number of continuous frames l is 10. The left panel shows the original note sequence and the right one shows the extracted NLSH vectors.

Figure 5. An example of note mergence. The left panel shows the note sequence of the MIDI music, the middle one shows the corresponding notes extracted from the humming clip and the right one shows the merging result.

304

Only taking into account of the note value a fixeddimensional note vector is extracted continuously from the note sequence (p1, d1),…,(pi, di)… If the dimension is 10 and the shift of vectors is 1, the NLSH vectors are (p1,p2,…,p10), (p2,p3,…,p11)… Considering the error of note segmentation in the phase of pitch post-processing, such as two or more notes merged into one note or one note cut into several notes, we extract NLSH vectors with a more robust method to tolerate the segmentation mistakes. All the long notes are split into multiple notes as follows: ( pi , d i ) Ÿ ( pi , l )...( pi , l ), ( pi , d i − r * l ) 

r

«d » r=« i» ¬l ¼

Since different people have different humming keys, it can be handled by mean subtraction, so all the pitch sequences and note sequences extracted from both the music database and humming clips, as well as the PLSH and NLSH points, are normalized to have zero mean. B. Retrieval of Candidate Fragments based on LSH 1) PLSH based Retrieval: In the phase of retrieval, we should search for neighbors by using the LSH index and expand them to candidate fragments [10]. At first pitches extracted from humming clips are processed to PLSH vectors with the same method as indexing PLSH. Taking into account the tempo variation we should modify the size of window from 0.6 to 1.6. Then more pitch vectors are extracted and they are used to search for neighbors by the PLSH index. It is sufficient to reserve 2 or 3 nearest points for each vector extracted from the query clip. On the basis of their positions in MIDI files, the window size and the length of the humming clip, all the neighbor points are expanded to candidate fragments in the format of (p1,p2,…,pn). 2) NLSH based Retrieval: The paper presents a method of NLSH based retrieval. In the stage of retrieval NLSH points are extracted from the query notes with the same method as NLSH indexing. Then neighbor vectors are found in the NLSH index and they could be expanded to candidate fragments. Prior to NLSH vector extraction the long note will be cut into multiple notes by (2). By this way the NLSH points are robust to cut-off mistakes of long notes and they also have properties robust to errors of short notes merged into long note. Considering the tempo variation we need to adjust the duration of notes. However, we only take advantage of the note value instead of duration, so we change the longest continuous frames l of (2) in the range [0.6l, 1.4l] to get 5 times NLSH vectors. 3) Combination of PLSH and NLSH (PNLSH) based Retrieval: PLSH based retrieval can capture the continuity of pitch, but if the tempo is non-constant, dimensional dislocation may occur for PLSH points and the missing of candidates is inevitable. NLSH based retrieval does not take into account the duration and it can capture the continuity of note value. The non-constant tempo variation does not affect the value of note. So it can make up for the deficiency of PLSH. However, the mistakes of note segmentation may still lead to the leakage of candidates. The paper presents a combination of PLSH and NLSH (PNLSH) to take both advantages. Firstly PLSH and NLSH indexes are built. Secondly PLSH and NLSH vectors are extracted. Thirdly neighbors are found by PLSH and NLSH indexes and they are expanded to candidate fragments. The candidate set is composed of all the candidates screened by the two indexes. The combination makes use of the continuity of pitch and note, increasing the recall rate and improving the performance.

(2)

where l is the largest number of continuous frames for one note. If di is bigger than l the note is cut into more notes by (2), otherwise it remains intact. Then we extract NLSH vectors and build the NLSH index. Fig. 4 shows an example of NLSH vector extraction. We also need to build another index to record the position of NLSH vectors in original songs. IV.

ON-LINE RETRIEVAL

A. Feature Extraction 1) Pitch Extraction from Query Clips: The extracted feature of QBH is pitch sequence. Plenty of methods for pitch tracking are described in the literature such as autocorrelation, average magnitude difference and so on. In the paper simplified inverse filter tracking [13] is used for the pitch tracking and the shift of frames is 40 milliseconds. Then the pitch contour is smoothed by passing it through a median filter of size 5. And the following Equation is used to transform them to the representation of semitone consistent with notes in MIDI files: Semitone=log2 (Pitch/440)*12+69

(3)

2) Note Segmentation: To search songs by notes we need to segment pitches into notes. We segment notes by virtual of combining evidence from both the time and frequency domains [14]. The note is used for NLSH based retrieval. NLSH points are robust to mistakes of long note segmentation. If the segmented note is longer we will split it into more notes. If the long note is wrongly cut into many short ones with the number of duration frames much smaller than l in (2), errors will occur in the phase of NLSH point extraction. So we should merge the short notes. If one note is short and the difference value between adjacent notes is small enough it will be merged with the neighbor note. Fig. 5 shows an example of note mergence. We can see that notes with circles are merged to be more consistent with the correct notes.

305

TABLE I. THE PERFORMANCE OF PLSH, NLSH AND PNLSH ON THINK IT CORPUS Method Top-1 Top-5 Top-10 Top-20 MRR PLSH 35.2% 57.5% 64.2% 73.5% 0.445 NLSH 49.6% 72.1% 79.7% 83.5% 0.565 PNLSH 51.8% 74.1% 82.3% 88.7% 0.590

last for exactly 8 seconds. Since the study pays more attention to a practical QBH system and people always hum from anywhere of melodies, the experiments were focused on Think IT corpus. We add 5,000 noise MIDI files to obtain a large scale music database. B. Experiments for the Combination of PLSH and NLSH (PNLSH) on Think IT Corpus Table 1 shows results of PLSH, NLSH and PNLSH evaluation experiments. When only using LSH, we sort songs by the cumulative number of the same candidate songs. The performance of NLSH is better than PLSH because candidates of NLSH are more precise and fewer false candidates result in better accuracy. The combination of PNLSH achieves the best result owing to complementary indexes. In the experiment the shift of frames is 40 milliseconds, the NLSH points are 10-dimensional vectors and the largest number of continuous frames l is set to 10. Fig. 6 shows the performance of PLSH+RA, NLSH+RA and PNLSH+RA, which mean using the PLSH index, NLSH index or PNLSH indexes to screen candidates and sorting candidates by RA. NLSH method emphasizes on the variation of note value instead of duration and it extracts fewer NLSH points from the query clip with fewer scaling factors, leading to fewer candidates and lower recall rate (recall rate means the percentage that the correct song is returned). So PLSH+RA are better than NLSH+RA. The combination of PNLSH indexes reduces the leakage of candidates and achieves the best performance.

100%

Accuracy

95% 90% 85% 80% 75% Top-1

Top-5

PLSH+RA

Top-10

Top-20

NLSH+RA

MRR

Recall

PNLSH+RA

Figure 6. The performance of PLSH+RA, NLSH+RA and PNLSH+RA on Think IT corpus.

C. Ranking Algorithm Recursive alignment (RA) is employed to calculate the distance between the query clip and candidate fragments which are expanding sequences of LSH neighbor points [6]. The algorithm divides the query and music candidate into two parts respectively and looks for the smallest joint distance of two parts. The cut-off point with the smallest joint distance is preserved. Then the process above is applied recursively to divide the cut-off parts independently. For details, see [6]. The distance evaluation is performed for every candidate fragment and the fragment with minimum distance is output. RA optimizes melody alignment problems in a top-down fashion which is more capable of capturing long-distance information in human singing. It can be viewed as a heuristic search with a global pruning strategy. It is a satisfactory distance measurement of melody matching, regarded as the final ranking method. In the experiment the recursion depth is set to 3. The division ratios are 1:2, 5:7, 1:1, 7:5 and 2:1. V.

100% 95%

Accuracy

90% 85% 80% 75% 70%

PLSH+EMD PLSH+DTW PNLSH+RA

65%

PLSH+LS PLSH+RA

60%

7RS

EXPERIMENTS

7RS

7RS

7RS

055

Figure 7. The results on Think IT corpus for different methods.

The evaluation measurements are mean reciprocal rank (MRR) and top-X hit rate. MRR is calculated by the reciprocal of the rank of the first correctly identified cover for each query. The top-X hit rate represents the probability that the correct MIDI file is included in the X highest ranked candidates among all the MIDI files.

100%

Accuracy

95%

A. Experimental Data In our experiments, two corpora used in the MIREX (Music Information Retrieval Evaluation Exchange) 2010 QBH Contest [15] are taken to compare with other methods. One is Think IT corpus provided by IOACAS, composed of 106 MIDI files and 355 song queries. The queries were hummed from anywhere of songs and the average duration is 12 seconds. The other dataset is Jang’s collection [15]. The corpus consists of 48 monophonic MIDI files and 4431 queries. All the queries were hummed from beginning and

90% 85% 80% PLSH+EMD PLSH+DTW PNLSH+RA

75%

PLSH+LS PLSH+RA

70%

7RS

7RS

7RS

7RS

055

Figure 8. The results on Jang’s collection for different methods.

306

C. Experiments for Different Methods We compared several approaches with the proposed PNLSH+RA on the two corpora. Fig. 7 and 8 give the experimental results on Think IT corpus and Jang’s collection respectively. The proposed method of PNLSH+RA achieves the best accuracy, which is much better than others. Compared with the baseline PLSH+RA method, which has achieved the state-of-the-art result, the relatively improvement (RI) of MRR is as much as 29.7%, top-1 RI is 23.6% and top-5 RI is 30.6% on Think IT corpus. The relatively improvement of MRR is 23.8%, top-1 RI is 19.6%, top-5 RI is 23.9% on Jang’s collection. With a C++ implementation running on a 2.13GHz Xeon CPU, the mean retrieval time of the proposed PNLSH+RA method is 4.5 seconds on Think IT corpus and 1.4 seconds on Jang’s collection, of which PLSH+RA take approximately 80% and NLSH+RA 20%. The combination method significantly improved the search accuracy but with only 25% more time consumed. VI.

REFERENCES [1]

[2]

[3]

[4]

[5] [6]

[7]

CONCLUSION [8]

The paper presented a novel approach of note-based locality sensitive hashing (NLSH) in QBH systems and combined it with pitch-based locality sensitive hashing (PLSH) to screen candidate fragments. Then it calculated the distance between candidates and the query clip by RA and returned a list of songs. The proposed method is evaluated on the dataset used in MIREX 2010 for QBH. Experimental results show that the performance of the proposed method is better than others and achieves satisfactory results. Future work includes seeking more accurate melody matching algorithm and improving the retrieval speed.

[9]

[10]

[11]

[12]

ACKNOWLEDGMENT This work was partially supported by the 111 project under Grant No.B08004, a key project of the Ministry of Science and Technology of China under grant no.2012ZX03002019-002, Innovation Fund of Information and Communication Engineering School of BUPT in 2011, Development Program (863) of China under grant No. 2011AA01A205, and the Next-Generation Broadband Wireless Mobile Communications Network Technology Key Project under Grant No. 2011ZX03002-005-01.

[13]

[14]

[15]

307

Q. Wang, G. Liu, Z. Guo, J. Guo, X. Chen, “Structural fingerprint based hierarchical filtering in song identification,” Proc. International Conference on Multimedia & Expo (ICME), 2011. A. Ghias, J. Logan, D. Chamberlin, B.C. Smith, “Query by Humming: Musical Information Retrieval in an Audio Database,” Proc. ACM Multimedia, 1995, pp.231-236. J.R. Jang, M. Gao, “A query-by-singing system based on dynamic programming,” Proc. Int. Workshop Intell. Syst. Resolutions, pp. 85– 89, Dec. 2000. J.R. Jang, H. Lee, M. Kao, “Content-based music retrieval using linear scaling and branch-and-bound tree search,” Proc. International Conference on Multimedia and Expo (ICME), August 2001. J.R. Jang, C. Hsu, H. Lee, “Continuous hmm and its enhancement for singing/humming query retrieval,” Proc. ISMIR, 2005. X. Wu, M. Li, J. Liu, J. Yang, Y. Yan, “A top-down approach to melody match in pitch contour for query by humming,” Proc. International Conference of Chinese Spoken Language Processing, 2006. S. Huang, L. Wang, S. Hu, H. Jiang, B. Xu, “Query by humming via multiscale transportation distance in random query occurrence context,” Proc. International Conference on Multimedia & Expo (ICME), 2008. Z. Guo, Q. Wang, G. Liu, J. Guo, “A music retrieval system based on spoken lyric queries,” International Journal of Advancements in Computing Technology, 2012, in press. Z. Guo, Q. Wang, G. Liu, J. Guo, Y. Lu, “A music retrieval system using melody and lyric,” Proc. International Conference on Multimedia & Expo Workshops (ICMEW), 2012. M. Ryynänen and A. Klapuri, “Query by humming of midi and audio using locality sensitive hashing,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2008, pp.2249-2252. M. Datar, P. Indyk, N. Immorlica, V.S. Mirrokni, “Locality sensitive hashing scheme based on p-stable distributions,” Proc. Annual Symposium on Computational Geometry (SOCG), June 9–11, 2004. Q. Wang, Z. Guo, G. Liu, J. Guo, “Entropy based locality sensitive hashing,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012, pp.1045-1048. J. Markel, “The SIFT algorithm for fundamental frequency estimation,” IEEE Trans. Audio and Electroacoustics, vol. AU-20, no. 5, pp. 367-377, 1972. G. Velikic, E.L. Titlebaum, M.F. Bocko, “Musical note segmentation employing combined time and frequency analyses,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004. http://www.music-ir.org/mirex/wiki/2010:Query-bySinging/Humming_Results.

Suggest Documents