robustness of audio fingerprinting systems for connected ... - eurasip

ROBUSTNESS OF AUDIO FINGERPRINTING SYSTEMS FOR CONNECTED AUDIO APPLICATIONS I.Marrakchi-Mezghani, M.Turki-Hadj Alouane, M.Jaidane-Saidane Unit´e Signaux et Sys`emes, ENIT BP 37, 1002, Tunis, Tunisia [email protected], [email protected] ABSTRACT Audio fingerprinting allows the identification of audio content regardless of the audio format and especially without need of additional meta-data. A Well-known applications of audio fingerprinting are connected audio. Music identification using a mobile phone is one these examples. The robustness of the audio fingerprinting to the distortions induced by the different components of mobile network is important to guarantee good performance of the audio identification process. This paper focus on the distortions induced by speech codecs encountered by music signal during a radio-mobile transmission. The robustness of two selected fingerprinting systems is analyzed with respect to the real GSM speech codec. In particular, it will be shown through the analysis that rhythmical parameters seems to be more robust than timbrical ones. 1. INTRODUCTION Audio fingerprint can be seen as a short summary of an audio object. Many studies [1],[2] were done to develop algorithms for summarizing a long audio signal into a concise signature sequence, which can then be used to identify the original record. Each audio fingerprint scheme propose to extract a set of descriptors for representing in a good way the structure of the audio signal. In the most audio fingerprint systems, like in [1][4], they generate a set of features for representing the timbre and the texture: energy, ZCR (Zero Crossing Rate), spectral centroid, spectral flatness, 4 Hz modulation energy, Mel Frequency Cepstral Coefficients (MFCC), etc. Additional novel set of features for representing rhythmic structure and strength like BPM (Beat Per Minute), beatedness are proposed in [3][2]. Audio fingerprinting is a powerful tool for identifying either streaming or file-based audio, using a database of fingerprints. Well-known applications of audio fingerprinting are identification of songs or commercials being played on radio or television (broadcast monitoring), music identification using a cell phone, connected audio (radio enhanced with metadata corresponding to the content being played), assisting in wa-

termarking applications or checking the correct transfer of content over complex networks (broadcast transfer monitoring) [1], [4]. This paper focuses on connected audio applications such as the music identification using a mobile phone. In particular, we are interested in the robustness of the audio fingerprinting to the distortions induced by the speech codecs within the mobile network. The speech codecs of the mobile network are not fitted to music signals. Therefore, if the audio fingerprinting is not robust to such distortions, the audio recognition process may be impaired. In fact, if F denotes the fingerprint of the original signal, F c the one related to the coded/decoded signal, one can write that: Fc = F + B

(1)

where B is an additive noise due to GSM codec. Therefore, such distortion can be viewed as the classical distortion induced by a transmission channel. In this paper two audio fingerprinting schemes are selected for the analysis: the Philips system presented in[1] and the Amadeus system developed by the MTG group (Musical Technology Group) and presented in [2]. The selected systems operate in two different ways. Indeed, the fingerprint computing differs from one system to the other. Therefore, evaluation of their robustness to speech codec distortions is done here by different criterions. This paper is organized as follows. The robustness analysis of the generic audio fingerprint system is presented in section 2. Section 3 deals with the robustness of a more complete audio fingerprint extraction scheme.

2. ROBUSTNESS OF A GENERIC AUDIO FINGERPRINT SCHEME We start by analyzing the effects of the degradations introduced by the GSM codec on the extraction fingerprint scheme introduced by [1] and presented in Fig 1.

F c (n, m) is the sub-fingerprint for the coded/decoded signal. Such scheme is interesting because it is comparable to the Bit Error Rate (BER) evaluation in a classical transmission context. BER is calculated for the different sub-fingerprints:

BER =

Fig. 1. A generic fingerprint extraction scheme [1] 2.1. Audio fingerprinting system description The proposed fingerprint extraction scheme extracts 32−bit sub-fingerprints for every interval of 11.6 ms. A fingerprint block consists of K = 256 subsequent sub-fingerprints, corresponding to a granularity of 3 s, which is the minimum audio length for identifying a piece of music. The received audio is first downsampled to a mono audio stream with a sampling rate of 8 kHz. The overlapping frames have a length of 0.37 and are weighted by a Hamming window with an overlap factor of 31/32. In order to extract 32 sub-fingerprint values for every frame, 33 non overlapping frequency bands are selected. These bands lie in the range from 300Hz to 2kHz. The Mel scale is used in frequency domain, which is an approximation of the Bark scale. By denoting the energy of band m of frame n by E(n, m) = PL−1 2 k=0 |Xn,m (k)| and the m-th bit of the sub-fingerprint of frame n by F (n, m), the bits of the sub-fingerprint (see also the gray block in Fig 1, where T is a delay element) are determined via a decision process as follows: F (n, m) =

(

1, if

F˜ (n, m) > 1,

0, if

F˜ (n, m) ≤ 1

=

xn

SPEECH CODEC Half-Rate

xn

Fingerprinting

P32

m=1

|F c (n, m) − F (n, m)| 32 ∗ K

(4)

E (n, m) =

L−1 X

|Xn,m (k) + εn,m (k)|2

(5)

k=0

Due to the independence of the signals xn and n , we get in mean value: c E(|Xn,m (k)|2 ) = E(|Xn,m (k)|2 ) + E(|n,m (k)|2 )

We denote: b(n, m) = and

PL−1 k=0

|n,m (k)|2

=

b(n, m) − b(n − 1, m)

=

−b(n, m + 1) + b(n − 1, m + 1)

(6)

Consequently:

(3)

The robustness of the proposed audio fingerprinting algorithm under a speech codec is analyzed through the following scheme (Fig 2) related to the coded/decoded signal: ~ F C ( n, m ) Decision

c

(2)

E(n, m) − E(n − 1, m) − E(n, m + 1) + E(n − 1, m + 1)

c

n=1

where K is the frames’s number ( we take K = 256). We have chosen, for simulations, a GSM half rate speech codec for the 3rd generation. It’s important to notice that the more the SN R speech coders is small, the better is the coder. The analysis suppose that the coded/decoded signal and the original one are related as follows: xcn = xn + n where the original signal xn and the coding noise n are independent. We note X(k)(k = 0 : L − 1) the Fourier transform (FFT) of the input signal xn of length (L = 3100). F F T (xc ) = X c (k) = F F T (x + ) = X(k) + ε(k) For the C/D signal, the energy of band m of frame n is:

B(n, m)

where: F˜ (n, m)

PK

Fe c (n, m) = Fe (n, m) + B(n, m)

(7)

The additive noise B(n, m) can change the decision F (n, m) and consequently the different sub-fingerprints for the C/D signal. Eq (7) is a particular case of the equivalent transmission problem presented in Eq (1). 2.2. Experimental robustness evaluation

F C ( n, m)

Fig. 2. Fingerprint extraction scheme under GSM codec

We evaluate in Table 1 the BER for different signal types. The threshold fixed by this extraction system is α = 0.35 [1]. From Table 1, we notice that for the different test signals, the BERs are below the proposed threshold. As expected, we note that:

Signal BER

speech singing speech Piano Violon Classic 0,15 0,2 0,29 0,31 0,3

3. ROBUSTNESS OF A MORE COMPLETE AUDIO FINGERPRINT SCHEME

Dance 0,31

Table 1. BER for different signals under GSM codec • the BER for speech is the smallest. • the GSM codecs are adapted to speech properties and not to music ones. Consequently, the introduction of musical instruments makes the identification more difficult than speech or singing speech. Next, we evaluate the SN R (Signal to Noise Ratio) and the BER as a function of sub-bands: PK 2 e n=1 |F (n,m)| ) and SN R(m) = 10 ∗ Log10 ( PK |B(n,m)|2 BER(m) =

PK

n=1

n=1

|F (n,m)−F c (n,m)| K

25 20

SNR(dB)

15 10 5

3.1. Experimental robustness evaluation

0 −5 −10

0

5

10

15

0

5

10

15

sub−band

20

25

30

35

20

25

30

35

0.5 0.4

BER

The fingerprinting system developed by the MTG group [2] is an automatic genre classification system. They compute a group of descriptors which can be divided into two classes. The first one is related to timbrical characteristics (MFCC, spectral flatness, spectral centroid,... and also their derivatives and second derivatives). The second, which is a new set of parameters, is related to the rhythm and tempo of a musical piece which are called rhythmical parameters like Beatedness, BPM (Beat Per Minute)...[2]. These features are then mapped into a more compact representation using the LDA (Linear Discriminant Analysis) analysis in order to reduce the number of descriptors. Such preclassification improve the discrimination power of the traditional Hidden Markov Models (HMM) system located behind the LDA analysis [2]. Intuitively, rhythmical parameters are more robust to distortions than timbrical ones. Such features are communally used for audio classification.

0.3 0.2 0.1 0

sub−band

Fig. 3. BER and SN R evaluation as a function of sub-bands for classic music sampled at fs = 8 kHz

In this work, we will focus on the robustness of the two sets of parameters by evaluating the following Signal to Noise RaPK 2 n=1 (Fn ) P tio(SNR): SN R = 10Log10 K (F −F c )2 where Fn is the n n n=1 parameter for the frame n of the original signal and Fnc is the same parameter of the GSM encoded decoded signal. The simulation conditions are as follows: the original signal is sampled at fs = 8 kHz, the length of the frame is 300 [ms], the hopsize (the non overlapping frame length) is 30 [ms] and the Hamming windows is used. We have used 1 minute of audio recordings. We summarize in Table 2 the performance evaluation for some selected audio parameters.

In Fig 3, we plot the BER and the SN R in function of sub-bands for a musical piece of length 3 s. We observe that:

Signal Energy flatness centroid MFCC ZCR 4Hz mod

• for low BER values, the SN R is relatively high and vice versa. • the BER is high for the lower and the upper sub-bands. Such result depends on the GSM conception. In particularly, such coder is based on perceptual encoding and especially on the psychoacoustic model which have the same form of the BER in function of sub-bands.

Classic 1.25 4.46 -2,52 0,36 7.13 19,46

Dance 0.55 2.7 -3.46 0,08 8.66 19,80

Table 2. SN R(dB) comparison under GSM coder for different timbrical descriptors and for various signal types

• thanks to this analysis, it seems to be interesting to choose the middle sub-bands to evaluate the sub-fingerprints F (n, m) in order to minimize the BER rate and by the way the audio fingerprint extraction scheme becomes more robust. Such robustness study will be done, in the next section, for a more complete fingerprint extraction scheme.

speech singing speech Piano Violon 3.55 2.66 2.85 2,15 9.43 6.3 6.85 4.2 5,00 6,00 -3.28 -8,00 -2,40 -3,00 -2,00 1,00 5.9 4,70 8.2 14,20 26,37 24,44 23,96 23,70

We notice that: • the 4 Hz modulation parameter is the most robust one under such codec for different types of signals in time and frequency scales. Speech signal has a characteristic energy modulation peak around 4Hz syllabic rate. As shown in Fig 4, speech tends to have more modulation energy at 4Hz than music. Knowing that such

parameter is a representation of the temporal envelop, the codec effectively doesn’t change the global envelop of audio signals. • the MFCC and Spectral centroid parameters are the most sensible features to the GSM codec. • Fig 5 show that such codec change the PSD (Power Spectral Density) of signals. Especially for music, which have relevant spectral components even in high frequencies, the codec erase some relevant peaks in high frequencies. Dance music

speech

4 Hz modulation Energy


0.4 0.2 0 −0.2 −0.4 4

6 8 Time(second)

0 −0.5 −1

10

original C/D

25



2

0.5

24 23 22 21 20

2

4

6

8 10 Time(second)

400

600

800

20 18 original C/D

16

1000

200

400

frame

600 frame

800

1000

Fig. 4. 4Hz modulation energy parameter for Dance music (right) and speech (left) for the original signal and its coded decoded version under a GSM codec

Speech

Dance music Original C/D

120

120 PSD(dB)

PSD(dB)

Original C/D

130

100 80 60

2000

3000

90

4000

1000

Spectral centroid (Hz)

Spectral centroid (Hz)

3000

6000

Original C/D

4000 2000

100

200

300 frame

400

500

In this paper we analyzed the robustness of two audio fingerprint extraction schemes under GSM coders in the context of connected audio applications. As expected, rhythm parameters are more robust than timbrical ones in such context thanks to this analysis some scheme modifications are proposed in order to enhance the scheme robustness . 5. ACKNOWLEDGMENTS This work was done in the context of a Tunisian-Spanish project with Pompeu Febra University. So, the authors would like to thank Mr Enric Guaus for his useful help in manipulating the MTG programs for audio parameters extraction.

[1] Haitsma,Jaap,T.Kalker;”A highly robust audio fingerprinting system”,International Symposium on Musical Information Retrieval (ISMIR 2002),pp.144.

2500

2000 3000 Frequency (Hz)

[2] E.Battle,E.Guaus;”A non linear rhythm based style classification for broadcast speech-,music discrimination”,in proc.AES 116th conv,Berlin,Germany,2004.

Original C/D

2000 1500 1000 500 0

0

Next, we investigate in Table 3 the beatedness parameter for some selected signals and the corresponding SN R(dB) evaluation. The beatedness is a measure of how strong are the beats in a music piece. In [2], it is computed as the spectral flatness of the sequence in the rhythm domain.The highest beatedness value is for dance music (rhythmic) and the lowest for speech. Such information is very useful for automatic audio classification. From Tables 2 and 3 we can conclude that the selected rhythm parameter (beatedness in our study) is more robust than timbrical ones via mobile phone codecs.

80

Frequency (Hz)

8000

coded decoded version under GSM codec

6. REFERENCES

100

60 1000

Table 3. Beatedness comparison for the original signal and its

110

70 40

Dance 3,15 2,94 23,67

4. CONCLUSION

22

14 200

Signal speech singing speech Violon Classic Original 0,19 0,32 0,42 0,60 C/D 0,20 0,32 0,46 0,66 SNR(dB) 47,50 47,00 21,34 21,34

400

600 frame

800

1000

Fig. 5. Spectral centroid parameter for dance music (right) and speech (left) for the original signal and its coded decoded version under GSM codec

[3] G.Tzanetakis,P.Cook;”Musical genre classification of audio signals”,IEEE Transactions on Speech and Audio Processing,VOL.10,NO.5,July 2002. [4] P.Cano,E.Battle,H.Mayer,H.Neuschmied;”Robust sound modeling for song detection in broadcast audio”,in Proc.AES 112th int.cov.Munich,Germany,May 2002.