International Conference on Intelligent Information Hiding and Multimedia Signal Processing
Audio quality-based authentication using wavelet packet decomposition and best tree selection Fang Chen1, Wei Li1*, Xiaoqiang Li2* 1. Department of Computer Science and Engineering, Fudan University, Shanghai 200433 2. School of Computer Science and Engineering, Shanghai University, Shanghai 200072
[email protected],
[email protected] distortion caused by an incidental manipulation from that caused by a malicious manipulation. This intrinsic fuzziness makes the soft authentication design challenging and, likely, ad hoc in most cases [1]. Soft authentication typically measures distortion in some metrics between a feature vector from the dubious signal and the corresponding vector from the original signal and compares with a preset threshold to make a decision on the challenged signal’s authenticity. There is typically no sharp boundary between authentic and inauthentic signals. So far, almost all authentication algorithms are on the image domain, very few works are for audio. Here we briefly summarize a few typical related audio authentication works in recent years. Radhakrishnan et al [2] proposed an audio content authentication technique based on an invariant feature, i.e. the masking curve, contained in two perceptually similar audio data. Experimental results show that this contentbased hash is able to differentiate allowed signal processing applications like MP3 compression from certain malicious operations, which modify the perceptual content of the audio. Wu et al [3, 4] proposed two fragile speech watermarking schemes for content integrity, i.e. exponential-scale odd/even modulation and linear additive watermarking, and further compared with respect to their ability in detecting localized malicious alterations and tolerating content preserving operations in [5]. Both algorithms embed watermarks in the DFT magnitude domain by using a simplified masking model to achieve inaudibility. Both schemes could distinguish various content preserving operations from malicious tampering. The exponential-scale odd/even modulation is better in detecting localized content alteration while linear additive watermarking is more tolerant to low bit-rate speech coders such as the CELP coders. Lu et al. proposed a cocktail watermarking scheme[6] by inserting two complementary modulated watermarks, a robust one and a fragile one, into the original audio to
Abstract High quality editing and processing of audio signals make authentication increasingly important for secure applications. This paper proposes an effective algorithm for quality-based soft music authentication using wavelet packet decomposition and entropy-based best subtree selection, under the circumstance of general entertainment application. Experimental results show that in the sense of perceptual evaluation of audio quality(PEAQ), most audio manipulations except MP3 compression at all bit rates are non quality-preserving since they introduce various special auditory effects. This algorithm also demonstrates good ability of recognizing malicious operations like random cropping and tampering.
1. Introduction Modern audio editing and processing tools make high quality forgery pretty easy, for example, the semantic meaning of audio can be altered by simply reordering or dropping out a few small parts. Thus, judging the authenticity of audio data by human perception alone is far from enough, tamper detection is increasingly essential to secure audio applications. Multimedia authentication can be classified into hard and soft authentication. Hard authentication rejects any modifications except lossless compression or format conversion. Soft authentication passes certain incidental or admissible manipulations and rejects all the rest called malicious manipulations. Soft authentication can be further divided into quality-based authentication which rejects any manipulations that lower the perceptual quality below an acceptable level and content-based authentication which rejects any manipulations that change the semantic meaning of the content. Classification of acceptable and unacceptable manipulations depends on a specific application. In many applications, it is often difficult to distinguish
978-0-7695-3278-3/08 $25.00 © 2008 IEEE DOI 10.1109/IIH-MSP.2008.19
1265
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 25, 2008 at 00:55 from IEEE Xplore. Restrictions apply.
simultaneously achieve copyright protection and authentication purposes. In this paper, we focus on quality-based soft audio authentication under the circumstance of general entertainment application and propose a novel robust signature scheme based on wavelet packet decomposition and least entropy based best subtree selection. It exhibits wonderful performance to MP3 compression, the most typical quality-preserving operation, and shows good ability to differentiate malicious manipulations like tampering and random cropping. The rest is organized as follows. Section 2 gives a brief description of related techniques, i.e. wavelet packets transform and best subtree calculation. Section 3 details the robust signature extraction procedure. Experimental conditions and results are tabulated in section 4 and finally we conclude some problems of this paper and future research in section 5.
function along the wavelet packet coefficients, this algorithm leads to the best tree. Such calculation preserves important coefficients and discards trivial coefficients, thus greatly reduce the total number without losing much information. It facilitates the following robust signature extraction.
3. Robust Signature Extraction Based on the above introduction of wavelet packet decomposition and best tree calculation, the robust signature extraction is composed of the following four phases: (1). Preprocessing This phase converts all test music pieces into the format of 16-bit monophonic sampled at 44.1 kHz. (2). Framing The input music signal is segmented into frames, with 2048 frame length in our experiments, each frame is further weighted by hamming window to reduce the edge effect. (3). Wavelet packet transform & best tree calculation Wavelet packet decomposition with ‘db1’ basis in this paper is performed to each frame, then the least entropy-based best tree is calculated for further use. (4). Robust signature calculation As regards to each frame, we sum up the coefficients of all leaf nodes of the above best tree and compare with zero, if the sum is greater than zero, then a bit of ‘1’ is extracted from this frame, otherwise a bit of ‘0’, as shown in equalization 1.
2. Related Techniques: Wavelet Packet Transform and Best Subtree Selection Wavelet packet transform(WPT) extends conventional wavelet transform(WT), it splits not only the approximations but also the details, as shown in the 2 n−1
left part of Figure 1, yielding more than 2 different ways to encode the signal[7]. Since the possible encoding number is very large, it is interesting to find an optimal decomposition like the right part of figure 1 with respect to a convenient criterion. Minimum entropy-based method is one of the effective algorithms to extract the best subtree.
Nki
Ek = ∑ coef ki (n) n =1
(1)
1, if Ek > 0 h(k ) = 0, if Ek < 0
where k means the k-th frame, i is the i-th leaf node of i
i
the best tree of this frame, coef k (n) and N k are the n-th coefficient and the number of total coefficients of the i-th leaf node, respectively. All such bits are cascaded to get the final robust signature of this piece of music according to equalization 2.
H = {h(k ) | h(k ) ∈ {0,1}, k ∈ [1, M ]}
Figure 1. Wavelet packet decomposition tree at level 3 and its optimal entropy based subtree
(2)
where M is the total frame number of the input music clip.
Starting with the root node, the best tree is calculated using the following scheme. A node N is split into two nodes N1 and N2 if and only if the sum of the entropy of N1 and N2 is lower than that of N. This is a local criterion based only on the information available at the node N. Several entropy type criteria can be used. If the entropy function is an additive
4. Experimental Results The algorithm was applied to a set of music signals including pop, saxophone, rock, modern piano, guitar, and electronic organ (15s, mono, 16 bits/sample, 44.1
1266
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 25, 2008 at 00:55 from IEEE Xplore. Restrictions apply.
kHz). We used the famous audio editing and processing tools ‘GoldWave’ and ‘CoolEdit’ to perform various audio signal manipulations and time domain synchronization processing to check the ability of differentiating content-preserving and malicious operations. Detailed experimental results of rock are listed in Table 1 and Table 2 as an example. (1). Performance under audio signal manipulations From Table 1, we can see that this algorithm shows excellent performance to MP3 compression, the most important quality-preserving manipulation, at all bit rates. With respect to other audio signal processing such as echo, equalization, pitch changing, reverb, default noise reduction, time-scale modification etc, because most of them introduce more or less distinguishable special effects, they are all deemed as non quality-preserving manipulations under the sense of perceptual evaluation of audio quality(PEAQ), accordingly authentication are not be passed. (2). Performance under malicious tampering and random cropping Herein we take two types of malicious manipulations into account, one is random cropping, the other is local tampering. By observing Table 2, we find that random cropping causes a series of consecutive errors from behind the cut position, this is a distinguished difference with all other kinds of malicious or non quality-preserving operations. With regards to malicious tampering, it is difficult to differentiate from quality-preserving operations if only in the PEAQ and BER sense, for they all posses the characteristic of high PEAQ value and low BER. Therefore, when facing such detection results, we have to resort to comparing waveform of the dubious audio and the original audio, if only some local regions are different, then it should be malicious tampering, while if small variations appear all over the time axis, quality-preserving operation is the case, Figure 2 is an example.
strength changing of a same audio signal operation, and how to more precisely denote tampered local regions and lower the false authentication rate as much as possible etc.
Figure 2. Waveform comparison of malicious tampering and quality-preserving operations
6. References [1] B. B. Zhu, M. D. Swanson, and A. H. Tewfik, “When seeing isn't believing,” IEEE Signal Processing Magazine, vol. 21, no. 2, pp. 40-49, March 2004. [2] R. Radhakrishnan and N. Memon, “Audio content authentication based on psycho-acoustic model”. Proceedings of the Security and Watermarking of Multimedia Contents, San Jose, CA, February 2002. [3] C. P. Wu and C. C. Kuo, “Fragile speech watermarking based on exponential scale quantization for tamper detection”. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, Orlando, Florida, May 13-17, 2002 [4] C. P. Wu and C. C. Kuo, “Fragile speech watermarking for content integrity verification”. Proceedings of the IEEE International Symposium on Circuits and Systems, Scottsdale, Arizona, May 26-29, 2002. [5] C. P. Wu and C. C. Kuo, "Comparison of two speech content authentication approaches". Proceedings of SPIE vol. 4675 -- Security and Watermarking of Multimedia Contents IV, pp. 158-169, 2002. [6] C. S. Lu, H. Y. Mark Liao and L. H. Chen, “Multipurpose audio watermarking,” Proceedings of 15th International Conference on Pattern Recognition, Barcelona, Spain, 2000, pp. 286–289. [7] R. R. Coifman and M.V. Wickerhauser, "Entropy-based algorithms for best basis selection," IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 713-718, 1998.
5. Conclusion This paper proposes an effective algorithm for quality-based soft music authentication using wavelet packet decomposition and minimum entropy-based best subtree selection. Experimental results show that in this case, most audio manipulations except MP3 compression at all bit rates are non quality-preserving since they introduce various special auditory effects. This algorithm also demonstrates good ability of recognizing malicious operations like random cropping and tampering. However, there is still a big space for further improvement and research, for example, how to generate an adaptive authentication output with the
7. Acknowledgement This work was supported by NSFC (60402008), NKT R&DP(2007BAH09B03)
1267
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 25, 2008 at 00:55 from IEEE Xplore. Restrictions apply.
Table 1. experimental results under audio signal processing Manipulations MP3(32K) MP3(48K) MP3(64K) MP3(96K) MP3(128K) Echo(0.05ms) Echo(0.1ms) Echo(0.25ms) Echo(0.5ms) Equalization(boostbass) Equalization(boostmid) Equalization(boosttreble) Noise reduction(default) Noise reduction(Light hiss removal)
PEAQ 0.208 0.208 0.208 0.208 0.208 -3.452 -3.624 -3.711 -3.735 -1.296 -2.120 -1.957 -1.922 -0.897
Classification preserving preserving preserving preserving preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving preserving
BER 0% 0% 0% 0% 0% 17.4% 16.8% 15.8% 19.3% 44.4% 47.5% 48.4% 24.5% 4.3%
Authentication Passed Passed Passed Passed Passed Failed Failed Failed Failed Failed Failed Failed Failed Passed
Pitch(97%, tempo unchanged) Pitch(98%, tempo unchanged) Pitch(99%, tempo unchanged) Pitch(101%, tempo unchanged) Pitch(102%, tempo unchanged) Pitch(103%, tempo unchanged) Pitch(C to B) Pitch(C to D) Pitch(Down one octave) Pitch(Up one octave) Reverb(Concert hall) Reverb(Room) Time-scale modification(97%) Time-scale modification (98%) Time-scale modification (99%) Time-scale modification (101%) Time-scale modification (102%) Time-scale modification (103%)
-3.488 -3.450 -3.407 -3.449 -3.521 -3.542 -3.831 -3.475 -3.715 -2.103 -3.533 -3.545 -3.808 -3.846 -3.829 -3.336 -3.616 -3.229
non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving non-preserving
50.3% 48.4% 55.3% 47.2% 50.9% 46.9% 50.3% 50.5% 55.9% 47.2% 22.4% 25.4% 47.5% 42.9% 37.6% 48.9% 47.2% 50.5%
Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed
Table 2. experimental results under malicious tampering and cropping Manipulations
PEAQ
Random cropping (500000 to 500200)
-0.430
Bit error number 24
Tamper(400000 to 403000) Tamper(the 100th frame)
-0.376 -0.570
2 1
Error positions (Frame number) 249 252 254 256 257 265 266 280 282 288 290 293 296 297 298 300 304 307 308 311 313 314 315 318 195,196 100
BER 7.45%
Authent ication Failed
0.62% 0.31%
Failed Failed
1268
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 25, 2008 at 00:55 from IEEE Xplore. Restrictions apply.