Sub-Nyquist Audio Fingerprinting for Music Recognition Kaichun K. Chang Solon P. Pissis Jyh-Shing R. Jang Costas S. Iliopoulos Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science King’s College London King’s College London National Tsing Hua University King’s College London London, UK London, UK Hsinchu, Taiwan London, UK &
[email protected] [email protected] [email protected] DEBII Curtin University Perth, Australia
[email protected]
Abstract—In recent years, Compressive Sampling (CS), a new research topic in signal processing, has piqued interest of a wide range of researchers in different fields. In this paper, we present a sub-Nyquist Audio Fingerprinting (AF) system for music recognition, which utilizes CS theory to generate a compact audio fingerpint, and to achieve significant reduction of the dimensionality of the input signal. The presented experimental results demonstrate that by using the CS-based sub-Nyquist AF system, when downsampling to 30%, the average accuracy is 93.43% under various distorted environments, compared to Nyquist sampling methods. The advantages of the proposed process lie in the comparable performance under the sub-Nyquist sampling rate, and more compact audio fingerprint.
I. I NTRODUCTION It has been well known that the audio identification can be realized by different technologies, for example, fingerprinting, tagging, encryption, watermarking and so on. When compared to other technologies, fingerprinting technology has several advantages. A brief summary of these different realization technologies, along with their fundamental mechanisms, advantages and disadvantages can be found in [1]. In this paper, a sub-Nyquist AF system is used to integrate CS technology. In Section II, we demonstrate the capabilities of our approach by establishing a robust, compact and lowcomplexity AF system for identifying distorted audio material in large scale music databases. In Section III, we provide some experimental results, which demonstrate the efficiency and the accuracy of the proposed system compared to the Nyquist AF system. Finally, we briefly conclude with some future remarks in Section IV. A. Audio Fingerprinting AF is an important technology for automatic music recognition. In general, AF can be defined as a compact representation of the audio content, which makes the recognition of unlabeled audio clips possible. Nowadays, many applications of AF have been discussed, including identification, verification, and retrieval of audio materials, for example, as well as music identification using cell phones, radio broadcast and peer-topeer monitoring, audio connecting and file sharing [2].
Several practical requirements for realizing a successful AF system have been discussed [3]. First, reliable AF is desirable in automatic music recognition. The AF system should be capable of identifying the distorted or corrupted audio clips in the case of degradations, i.e., the system should be very robust. Robust fingerprint extraction methods should be capable of dealing with severe degradations, such as audio compression, and large signal-to-noise ratios should be employed to represent the musical signal. Second, it should identify the clips in the database in a few seconds, i.e., the system should be time efficient and compact, in practice. Less sampling digital data we require, less time is needed to process the data, and more compact data receiver devices can be obtained. However, the sampling rate cannot be decreased interminably because of the Shannon sampling theorem: “Only if you sample densely enough (at the Nyquist rate), you can perfectly reconstruct the original analog data”. Finally, the system should be of low computational complexity in forming the fingerprints, and running the matching algorithm for finding the best matching item in the database. The demands of various applications, and the difficulties in designing a compact and robust AF system, boost interest in this field, and nowadays, some practical issues on the applications of AF have been studied (see [2]). The AF system should consist of two phases: the preparation phase and the recognition phase. First, we construct the database by filling it with the audio fingerprints and the associated metadata of many audio clips, and then the fingerprint of an unknown clip (or the distorted version of the clip brought by compression or standard audio processing) is extracted and compared to that of the clips in the database. If the fingerprint of the unknown clip is in the database, it will be correctly identified by the matching procedures. The framework of the AF system is shown in Figure 1. As mentioned above, requirements of a well designed AF system include high-recognition rate over a large number of items, robustness, and low time and computational complexity. An efficient AF system should be capable of identifying the item even when it is severely distorted by compression or other operations.
signal processing, pattern recognition [5], wireless communication [6], [7], imaging [8], [9], channel estimation [10], face recognition [11], phonetic classification [12], sensor array [13], motion estimation [14], and music genre classification [15]. Fig. 1.
Fig. 2.
II. S UB -N YQUIST AUDIO F INGERPRINTING S YSTEM
Framework of the Audio Fingerprinting system.
The linear measurement of Compressive Sampling.
B. Compressive Sampling CS technology, a new framework for simultaneous sensing and compressing signal, has been developed recently. The basic concept in CS is that it is possible to exploit the sparsity present in signals to reduce the number of measurements needed for data acquisition, thus making the low-rate sampling and high-resolution sensing possible. CS enables large reduction in the sampling and the computation cost at a sensor. It can exploit the possible sparsity existed in many practical signals to reduce the number of measurements needed for further processing, thus can remarkably reduce the number of measurements needed to characterize signals; therefore, the storage and the transmission rate needed to handle the signal at the sensing point are also reduced. Cand` es and Wakin in [4] showed that reconstructing a Ksparse signal x of length N , requires a number of M = O(K log(N )) measurements. However, when the signal under the temporal domain is not sparse, the process described above can not be used in signal reconstruction. In these circumstances, the signal can be represented by Ψ, which is transformed by some decomposition method, that is x = Ψ · θ. Here, θ is the sparsity representation under Ψ transformation domain. The measurement can be described as follows: y =Φ·x=Φ·Ψ·θ =A·θ
(1)
In Figure 2, y is the measurement, where A = Φ · Ψ represents an M × N vector, which is called observation matrix. The theory of CS was firstly proposed by Cand` es, Romberg, Tao and Donoho, who have showed that a compressible signal can be precisely reconstructed from only a small set of random linear measurements (much lower than the Nyquist rate), implying the potential of a dramatic reduction of sampling rates, power consumption, and computation complexity in digital data acquisitions. Implementation of CS relies on two principles: sparsity of the signals of interest and incoherence of sensing modality [4]. Nowadays, it is successfully used in
A music object should be described by a set of its features and one of the main challenges in music information retrieval is the choice of representation of the musical information within the system. Nowadays, there have been numerous approaches on what features to retain, and on how to extract these features. However, some traditional information such as the artist and the title of the music object, are very limited in practical applications, since this information cannot explain the music content as it is. In order to obtain a more meaningful representation of the object, the metadata should contain semantic information of the music content. The first step to establish our system is to convert the music object to a fixed file format and framed in preprocessing for the ease of feature extraction process. In general, the size of the input data is properly reduced using downsampling and mono-converting. Then, the content based features should be extracted, and the feature vectors generated. Finally, a well designed multiclass classifier can be used to classify the object by using the feature vectors. CS is a new field in signal processing, whose main idea is that a signal having a sparse representation, can be captured from a small number of random linear projections onto a measurement basis [4]. By a random measurement process, the original signal can be reconstructed through a non-linear decoding scheme, that uses the sparsity as a priori information for solving the inverse problem. Some algorithms have been proposed, for example, Principle Component Analysis and the summation of the corresponding data between different frames. The experimental results prove that by using these algorithms, the identifying accuracy is a little bit lower, but has a large reduction in search time [16]. However, the signals are sampled at high-rate to obtain a large number of data, while some dimension reduction methods are applied subsequently to discard the acquired data. In other words, this kind of information processing approach is still not good enough, regarding the issue of efficiency. If we sample the received signal at low-sampling data directly, we not only save substantial time and hardware device cost, but also make a full use of data. The traditional AF system attempts to use a high-rate analog-to-digital converter to sample the analog signals, as prescribed by the Nyquist sampling theorem. We use the CS theory to sample signals at a comparably low sampling rate. The CS technology reduces the required sampling rate remarkably, without losing the information of analog audio clips. The framework of the SubNyquist AF system proposed in this paper, is shown in Figure 3. Consider a real-valued signal x, with length N , that is given K-sparse in sparse basis matrix Ψ. Then, consider also
Fig. 3. Framework of the proposed sub-Nyquist Audio Fingerprinting system.
an M × N measurement basis matrix Φ (here M ≪ N ), where the rows of Φ are incoherent with the columns of Ψ. In terms of matrix notation, we have x = Ψ · θ, in which θ can be approximated using only K (here K ≪ N ) nonzero entries. The CS theory states that such a signal x can be reconstructed by taking only M = O(K log N ) linear nonadaptive measurements, which can be described as follows: y =Φ·x=A·θ
(2)
where y represents an M × 1 sampled vector, A = Φ · Ψ is an M × N matrix. The reconstruction is equivalent of finding the signal’s sparse coefficient vectors θ, which can be written as an ℓ0 optimization problem: min kθk0 s.t. y = Φ · x = A · θ
(3)
From Equation 3, we can see that the implementation of a compressive sampling of a signal can be divided into three steps: first, transform the signal x to some sparse domain to make it K-sparse, under a certain dictionary; second, construct the observation matrix of the sparse transformation coefficients to obtain the measurements; third, apply an optimization algorithm to retrieve the sparse coefficients, and then recover the original signal x. Hence, we can see that there are three important things in realizing a compressive sampling of a signal, that is: finding a representation space in which the signal is sparse; determining a measurement matrix, that is incoherent with the observation matrix; and using an efficient optimization algorithm to recover the sample from the measurement matrix. e In Equation 4, let S(n) denote the original signal, S(n) the signal after the normalization, SM ax the maximum value of |S(n)|, and N the sampling rate. S(n) e S(n) = · 10 , n = 1, 2, . . . , N SM ax
(4)
In our method, the sampled signals are firstly used to recover high-rate signals using the CS sequence. As soon as the highrate sampling signals of the items in the database and the unknown item, are obtained, the classical AF schemes can be used. We address the audio fingerprinting recognition problem by finding a sparse representation of a given uknown audio clip, i.e., we represent the signal as a linear combination of the fewest training sampling signals. Some further processing, including the feature extraction of the items in the database, the feature extraction of the unknown clip, and the observation matrix, is then needed for constructing the audio fingerprint. For example, some characteristic features (or their combinations) for every time frame of the reconstructed highrate sampling signal, are tested for feature extraction, such
as energy differences, MFCC, SFM and sound classes. Then, we exploit the CS paradigm to generate a compact fingerpint, and use it in the AF system. The AF features can produce a compact fingerprint by calculating a limited number of random projections from a time-frequency representation of the original audio clip. In Equation 5, let k denote the number of a frame, Sk (n) the energy of a sample point in the frame k, Np the number of the sample points in the frame, Ek the average value of the energy in the k-th frame, and Nf the number of the frames in the signal. PN |Sk (n)| Ek = n=1 , k = 0, 1, . . . , Nf (5) Np In short, this is a sub-Nyquist sampling of analog audio signals, with the sampling rate being lower than the Shannon sampling rate. The obtained signal has the whole information of a high-rate sampling signal, thus making the AF possible. Its main advantages lie in its low-rate sampling, but its performance will not exceed the performance of high-rate sampling. III. E XPERIMENTAL DATA The experiments are divided into three parts. Section III-A deals with low-rate sampling. Section III-B deals with the bit error rate (BER) of the proposed system in comparison to the BER of the Nyquist AF system [2]. Section III-C compares the BER of the two systems under distorted environment. A. Low-Rate Sampling Significant reduction of the dimensionality of the input signal helps establish a compact and computationally efficient system. Our work was carried out by using simulation based on selected features and CS theory. First, we investigated the performance under different sampling rates. Using an audio clip sampled at 44.1 kHz of duration = 3.3353s as an example, the recovered signal and the squared errors are shown in Figure 4(a). Figure 4(b) demonstrates the recovered error when downsampling to 50%, 30% and 10%. From the Figures, we can observe that when we sample the musical data at a sampling rate more than 30% of the Nyquist sampling rate, we can recover the signal with tolerable error. B. Bit error rate The proposed AF scheme is a CS-based sub-Nyquist AF system followed by a traditional feature extractor. The total number of audio clips in the database is 17,208, and the number of query clips is 100. These audio clips are chosen from the categories of rock, pop, country, classical, and jazz. Figures 5(a) and (b) give the low-rate sampling signals and the recovered high-rate sampling signals, respectively. From the Figures, we can see a remarkable reduction of the sampling rate. Then, we compared the fingerprints of the 3.3353 second clip obtained by the Nyquist and the CS-based sampling rate. The results are shown in Figure 6. The raw sound data doubles
(a) The fpData vector. (a) The recovered signal and the squared errors when downsampling to 30%.
Fig. 6.
(a) The fpData vector. Fig. 7.
(b) The fpRank matrix.
Illustration of the fingerprints using the Nyquist sampling rate.
(b) The fpRank matrix.
Illustration of the fingerprints using the CS-based sampling rate.
(b) The recovered error when downsampling to 50%, 30% and 10%. Fig. 4.
Illustration of the recovered signal and error.
(a) Low-rate sampling signals. Fig. 5. signals.
(b) The recovered high-rate sampling signals.
Fig. 8. system.
The item’s information using the Nyquist Audio Fingerprinting
Illustration of the low-rate and the recovered high-rate sampling
ranging from -1 to 1, and represents 16-bit samples under the 44,100Hz sampling rate, where Figure 6(a) is the vector fpData containing the subfingerprints (unsigned 32 bit integer), and Figure 6(b) is a fpRank matrix, which describes the reliability of each bit in each sub-fingerprint. In this experiment, fpRank is a 2D matrix with each vector fpRank(i,1:32) being a set of 32 numbers ranking the bits from least to most reliable, i.e. fpRank(i,1) would be the least reliable bit of the i-th subfingerprint, while fpRank(i,32) would be the most reliable. When using the CS-based sampling rate (with sampling rate being 50%), the fpData and fpRank are shown in Figure 7(a) and (b), respectively. From the Figures, we observe that although half of sampling rate is being used, the obtained subfingerprints are of little difference. In our proposed CS-based sub-Nyquist AF system, the sampling rate is 30% of the original method. The optimization algorithm is an iterative thresholding algorithm. The exact searching method with hamming measurement distance being 0 is used. Figures 8 and 9 give the item’s information with the Nyquist and the CS-based sub-Nyquist AF systems, respectively. From the Figures, we can see that the obtained
Fig. 9. The item’s information using the CS-based sub-Nyquist Audio Fingerprinting system.
BER of the CS-based sub-Nyquist AF system is higher than that of the Nyquist one; however, both systems identify the item in the database correctly. C. Under Distorted Environment The robustness of the proposed system is tested under the following conditions: additive white uniform noise, additive white Gaussian noise, linear speed change, and band-pass filter. The results are shown in Table I. We observe that the BER of the CS-based sub-Nyquist AF system is very similar to the Nyquist one.
TABLE I C OMPARISON BETWEEN THE N YQUIST AND THE CS- BASED SUB -N YQUIST AUDIO F INGERPRINTING SYSTEMS . Nyquist
Sub-Nyquist
BER
BER
Additive white uniform noise (0.5)
0.3619
0.3643
Additive white Gaussian noise (0.5)
0.4337
0.3915
Linear speed change (0.1)
0.1827
0.2513
Band-pass filter
0.2875
0.3419
Feature dimensionality reduction is crucial to establish a compact system. Realizing a low-rate sampling of a signal and searching for an efficient group of features, are two very important aspects. Low-rate sampling is essential for production of low-cost sampling devices, while efficient representation of items brings out high computational precision and efficiency. They are both of great significance for a compact AF system. In our experiments, the CS-based sub-Nyquist AF system is compact, and it has the characteristic of rapid implementation. IV. C ONCLUSION Compressive sampling is attracting increasing interests of a very wide range of researchers in different fields. In this paper, we introduced the CS theory to the AF system for music recognition, by proposing a CS-based sub-Nyquist AF system. Some experiments were taken and the results prove the efficiency of the proposed system in reducing the sampling rate, and in the extraction of musical features. For immediate future work, we will focus on the possibility of porting the proposed CS-based recognition for other music information retrieval tasks, such as onset detection, beat tracking, and tempo estimation. R EFERENCES [1] V. Venkatachalam, L. Cazzanti and M. Wells, “Automatic identification of sound recordings,” IEEE Signal Process. Mag., vol. 21, no. 2, pp. 92-99, Mar. 2004. [2] J. Haitsma, and T. Kalker, “A highly robust audio fingerprinting system,” in Proc. 3rd Int. Conf. Music Information Retrieval, Oct. 2002, pp. 107115.
[3] P. Cano, E. Batlle, T. Kalker and J. Haitsma, “A review of audio fingerprinting,” J. VLSI Signal Process., vol. 41, no. 3, pp. 271-284, Nov. 2005. [4] E.J. Cand` es and M.B. Wakin, “An introduction to compressive sampling: A sensing/sampling paradigm that goes against the common knowledge in data acquisition,” Signal Processing Magazine, IEEE, vol. 25, no. 2, pp. 21-30, Mar. 2008. [5] M. Elad, “Optimized projections for compressed sensing,” IEEE Trans. on Signal Processing, vol. 55, no. 12, pp. 5695-5702, Dec. 2007. [6] G. Taub˘ ock and F. Hlawatsch,“A compressed sensing technique for OFDM channel estimation in mobile environments: Exploiting channel sparsity for reducing pilots,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, Nevada, Apr. 2008, pp. 2885-2888. [7] W.U. Bajwa, A. Sayeed and R. Nowak, “Compressed sensing of wireless channels in time, frequency, and space,” in Proc Conf. on Signals, Systems, and Computers, Pacific Grove, California, Oct. 2008, pp. 20482052. [8] J. Romberg, “Imaging via compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 14 - 20, Mar. 2008. [9] Z. Chen, J. Wen, Y. Han, J. Villasenor and S. Yang, “A compressive sensing image compression algorithm using quantized DCT and noiselet information,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas, U.S.A, Mar. 2010, pp. 1294-1297. [10] W. Bajwa, J. Haupt, A. Sayeed and R. Nowak, “Joint source-channel communication for distributed estimation in sensor networks,” IEEE Transactions on Signal Processing, vol. 53, no. 10, pp. 3629-3653, Oct. 2007. [11] J. Wright, A. Yang, A. Ganesh, S. Shastry and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, Feb. 2009. [12] T. Sainath, A. Carmi, D. Kanevsky and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas, U.S.A, Mar. 2010, pp. 4370-4373. [13] Y. Yoon and M. Amin, “Through-the-wall radar imaging using compressive sensing along temporal frequency domain,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas, U.S.A, Mar. 2010, pp. 2806-2809. [14] N. Jacobs, S. Schuh and R. Pless, “Compressive sensing and differential image motion estimation,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas, U.S.A, Mar. 2010, pp. 718-721. [15] K. Chang, J. Jang and C.S. Iliopoulos, “Music genre classification via compressive sampling,” in Proc. Int. Soc. Music Information Retrieval, Amsterdam, the Netherlands, Aug. 2010, pp. 387-392. [16] L. Shen, Y. Guan, Y. Wu and Y. Zhao, “Fast Audio Fingerprint Search Strategy for Song Identification,” in Proc. Int. Conf. Networking and Digital Society, Guiyang, Guizhou, China, May 2010, vol. 2, pp. 259262.