Robust audio watermarking using frequency ... - Semantic Scholar

www.ietdl.org Published in IET Information Security Received on 3rd December 2007 Revised on 3rd July 2008 doi: 10.1049/iet-ifs:20070145

ISSN 1751-8709

Robust audio watermarking using frequency-selective spread spectrum H. Malik1 R. Ansari2 A. Khokhar 2 1

ECE Department, University of Michigan, Dearborn, USA ECE Department, University of Illinois, Chicago, USA E-mail: hafi[email protected] 2

Abstract: A novel audio watermarking scheme based on frequency-selective spread spectrum (FSSS) technique is presented. Unlike most of the existing spread spectrum (SS) watermarking schemes that use the entire audible frequency range for watermark embedding, the proposed scheme randomly selects subband(s) signal(s) of the host audio signal for watermark embedding. The proposed FSSS scheme provides a natural mechanism to exploit the band-dependent frequency-masking characteristics of the human auditory system to ensure the fidelity of the host audio signal and the robustness of the embedded information. Key attributes of the proposed scheme include reduced host interference in watermark detection, better fidelity, secure embedding and improved multiple watermark embedding capability. To detect the embedded watermark, two blind watermark detection methods are examined, one based on normalised correlation and the other based on estimation correlation. Extensive simulation results are presented to analyse the performance of the proposed scheme for various signal manipulations and standard benchmark attacks. A comparison with the existing fullband SS-based schemes is also provided to show the improved performance of the proposed scheme.

1

Introduction

The growth of the Internet, proliferation of low-cost and reliable storage devices, deployment of seamless broadband networks, availability of state-of-the-art digital media production and editing technologies and development of efficient multimedia data compression schemes have led to widespread forgeries of digital documents and unauthorised sharing of digital data. As a result, the music industry alone claims multi-million illegal music downloads on the Internet every week [1]. This form of piracy has led to enormous annual revenue loss. It is therefore imperative to have robust technologies to protect copyrighted digital media from illegal sharing and tampering. Traditional digital data protection techniques, such as encryption and scrambling, cannot provide adequate protection of copyrighted digital content, as the content becomes unprotected once it is decrypted or unscrambled by users. Digital watermarking technology complements cryptography in protecting digital content even after it is deciphered [2]. Digital watermarking is the process of embedding authentication and ownership protection IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129 – 150 doi: 10.1049/iet-ifs:20070145

information (also known as a watermark) into digital content (or host data) for its protection and integrity against illegal sharing and tampering. Existing digital watermarking techniques can be classified into two main categories: (1) informed embedding that exploits knowledge of the host signal during the watermark embedding process, for example, quantisation index modulation-based watermark embedding [2 – 4] and (2) blind embedding that does not use knowledge of the host signal during the watermark embedding process, for example, an additive spread spectrum (SS) based watermark embedding [2, 5 – 10]. Blind watermark embedding schemes such as SS-based watermarking schemes generally embed the watermark into the host data, without exploiting the host information. The corresponding watermark detection schemes employ statistical information about the host signal to perform watermark detection, which is optimal or near-optimal in the maximum likelihood (ML) sense. Key merits of SSbased watermarking schemes include robustness against 129

& The Institution of Engineering and Technology 2008

www.ietdl.org interference (i.e. anti-jamming capability), simplicity and low computational complexity for watermark embedding and detection processes. In addition, in case of SS-based watermarking, the decoding bit error probability at the detector varies smoothly from noise-free to severe channel distortion scenarios [11]. However, existing SS-based watermarking schemes [2, 5 –10] have poor detection performance at the blind detector, that is, the zero decoding error probability at the blind watermark detector is not possible. This is primarily because of a strong host signal interference at the blind watermark detector (Throughout the rest of the paper, watermark detector or detector implies a blind watermark detector unless otherwise specified), which limits its detection performance. Low embedding capacity is another limitation of SS-based watermarking [2, 5–10]. The detection performance of existing SS-based watermarking schemes [2, 5–10] deteriorates further when used for multiple watermark embedding applications. This is because multiple watermarks increase the interference level at the detector, which further deteriorates detection performance. In addition, existing SSbased watermarking schemes [2, 5–10] are also prone to watermark estimation attacks. The main motivation behind this work is to address the aforementioned limitations of SS-based watermarking schemes by developing an SS-based watermarking scheme that has low host interference at the detector, capable of embedding multiple watermarks and secure against watermark estimation attacks. In this paper, frequency-selective spread spectrum (FSSS) based watermarking scheme for digital audio is proposed. Unlike most of the existing SS-based watermarking schemes that use the entire audible frequency range for watermark embedding, the proposed scheme uses only a fraction of the audible frequency range for watermark embedding. The proposed scheme uses a secret key to randomly select subband signals of the host audio for watermark embedding. The number of subband signals selected depends on the target embedding capacity. The proposed FSSS scheme therefore provides a natural mechanism to exploit the band-dependent frequencymasking characteristics of the human auditory system (HAS) to ensure the fidelity of the host audio signal and the robustness of the embedded information. The scheme as characterised has the following advantages over existing SS-based watermarking schemes: (1) lower embedding distortion, (2) enhanced security, (3) improved multiple watermark embedding capability and (4) flexible control over fidelity and robustness of the embedded watermark for a given embedding capacity. The proposed method exploits the perceptual masking characteristics of the HAS to ensure fidelity and robustness of the embedded watermark. Two methods are considered for watermark detection, which are: (1) normalised correlation-based detection [2] and (2) estimation– correlation-based detection. The correlation-based detector detects the presence or the absence of the embedded watermark by comparing the correlation coefficient value between the test-audio and 130 & The Institution of Engineering and Technology 2008

the watermark (generated at the detector) against a decision threshold. The estimation – correlation detector, on the other hand, first estimates the embedded watermark from the test-audio and then correlation-based detection is used to detect the embedded watermark. For watermark estimation, Wiener filtering or blind source separation (BSS) can be used. The proposed scheme uses BSS based on an independent component analysis (ICA) for watermark estimation from the watermarked audio signal. Fidelity performance of the proposed FSSS-based watermarking scheme is evaluated using both subjective and the objective degradation. Robustness performance of the proposed watermarking scheme is evaluated against a variety of signal manipulations and standard benchmark attacks. These attacks include the addition of coloured and white noise, time- and frequency-scaling, resampling, requantisation, lossy compression and filtering (lowpass, highpass and bandpass). The robustness performance of the proposed scheme is also evaluated for standard benchmark attack for audio with the StirMark [12, 13]. In addition, to provide comparative analysis, performance of the proposed FSSSbased scheme, it is also compared with existing SS-based audio watermarking schemes presented in [7, 8]. Performance comparison based on fidelity and robustness for the constant embedding capacity shows that the proposed scheme exhibits better fidelity as well as robustness performance than the audio watermarking schemes presented in [7, 8]. The rest of the paper is organised as follows. The basics of SS watermarking are discussed in Section 2. A brief overview of the human auditory model is presented in Section 3. The watermark embedding procedure of the proposed scheme is described in Section 4, and the corresponding watermark detection methods are discussed in Section 5. Simulation results of the proposed audio watermarking scheme against different attacks, capacity analysis and comparison with existing audio watermarking schemes are presented in Section 6. The concluding remarks and future directions are outlined in Section 7.

2 Basics of SS-based watermarking The SS-based watermarking system can be modelled as a classical secure communication system [2]. Fig. 1 shows the

Figure 1 Depiction of a perceptual-based data hiding system with blind receiver as a standard secure communication model IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129– 150 doi: 10.1049/iet-ifs:20070145

www.ietdl.org SS-based watermarking system modelled using a secure communication system. In Fig. 1, s [ RN is a vector containing coefficients of an appropriate transform of the host signal. It is assumed that coefficients, s[i]:i ¼ 0, 1, . . . , N 1, are a set of independent and identically distributed (i.i.d.) random variables with a mean zero and variance ss2 . A watermark w [ RN is generated by using (1) an input message b [ { + 1} and (2) a pseudo-random sequence w [ { + 1}N generated using a secret key Kw . We assume that b, w and s are mutually independent. The watermarked signal, x, is obtained by adding an amplitudemodulated watermark, w, to the host signal s. To ensure imperceptibility of the embedded watermark, the amplitude-modulated watermark is spectrally shaped according to the perceptual mask, a [ (0, 1)N , estimated from the host signal, s, using the HAS. The watermarked signal x obtained using SS-based embedding can be expressed as x¼sþw

(1)

where w ¼ a wb, with denoting element-wise product of two vectors. The embedding distortion is expressed as de ¼ x s

(2)

The mean-squared embedding distortion, generally serving as a fidelity measure for a given watermarking scheme, is expressed as

s 2d

(3)

1 1 NX a[i]2 ¼ aw N i¼0

(4)

where kk represents the Euclidiean norm and a is the estimated masking threshold. Adversary attack or distortion introduced because of signal manipulations, n, can be modelled as an additive channel noise as shown in Fig. 1. The watermarked signal subject to adversary attacks or channel distortions x~ given as x~ ¼ x þ n

(5)

is processed at the detector to extract the embedded information. The SS-based watermarking schemes use probabilistic characterisation of the host data to develop an optimal or near-optimal watermark detector (in ML sense). Perez-Gonzalez et al. in [11] have shown that SS-based watermarking schemes cannot achieve zero decoding bit error probability even in the absence of attack-channel distortion because of the presence of host-signal interference. Therefore the decoding bit error probability of IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129 – 150 doi: 10.1049/iet-ifs:20070145

3 Watermarking using human auditory model The basic principle of perception-based audio watermarking is to incorporate auditory masking to ensure fidelity and robustness of the embedded watermark. Auditory masking is a well-studied phenomenon of the HAS. Extensive work has been done over the years in understanding the characteristics of the HAS and applying this knowledge to audio perception, perception-based compression [14, 15], and perception-based audio watermarking [6, 7, 16]. Some aspects of the HAS relevant to our method are discussed in this section. The human ear acts as a frequency analyser that maps signal frequencies to locations along the basilar membrane. The HAS is generally modelled as a nonuniform filter bank with logarithmically widening bandwidth for higher frequencies [17]. The bandwidth of each filter is set according to the critical band, which is defined as ‘the bandwidth in which subjective response changes abruptly’ [17, 18]. The critical band rate (CBR) specifies the correspondence between frequencies pooled for perception and locations on the basilar membrane. The unit of the CBR is Bark. The mapping between CBR (Bark) and frequency (Hz) has been approximated as [18], 2 f V ¼ 13arctan(0:76f ) þ 3:5arctan 7:5

(6)

where V is the CBR in Barks and f the frequency in kHz.

1 1 1 W D ¼ kd e k2 ¼ kx sk2 ¼ kwk2 N N N ¼

SS-based watermarking schemes is inherently bounded by the host signal interference at the detector.

The absolute threshold of hearing (ATH) at any frequency is the minimum intensity of the sound at that frequency, a normal human ear can perceive. The ATH is an important characteristic of the HAS that is generally used for spectral shaping of the watermark before embedding to ensure inaudibility of the embedded watermark. If a sound has any frequency components with power levels below the ATH, then these components can be discarded without degrading the perceptual quality of the sound. The ATH curve is obtained from experimental studies conducted with several subjects and it is modelled as [17], 2

Ta ( f ) ¼ 3:64f 0:8 6:5e0:6( f 3:2) þ 103f

4

(7)

where Ta is the sound pressure level (SPL) in dB and f is the frequency in kHz. Masking is a phenomenon in which a stronger audible signal (masker) makes a weaker audible signal (maskee) inaudible. Masking is an important characteristic of the HAS and it is the basis of perceptual-based audio coding and watermarking systems. Masking depends on the temporal and spectral characteristics of both the masker and the maskee. Spectral masking also known as 131


www.ietdl.org simultaneous masking is a frequency domain masking phenomenon, in which a stronger audible signal masks renders inaudible another simultaneously occurring audible signal located within the same critical band. The spectral masking effect of a masker is stronger within a critical bandwidth and relatively weaker in the neighbouring critical bands. The masking threshold depends on the frequency, the SPL and tone-like or noise-like structure of both the masker and the maskee [17]. Similarly, temporal masking is a time domain characteristic of the HAS, in which a masker makes normally audible signals appear inaudible when they occur in short intervals in the immediate vicinity of the masking signal [17]. The proposed FSSS-based audio watermarking scheme exploits simultaneous masking characteristics of the HAS to ensure imperceptibility of the embedded information.

4 Watermark embedding using FSSS In this section, we briefly discuss salient features of the proposed FSSS-based watermarking scheme and describe some details of the FSSS-based watermark embedding process. The proposed FSSS-based watermarking scheme decomposes an input audio signal (to be watermarked) into ( p þ 1) subband signals using an analysis filter bank (Fig. 2), where p is a positive integer. A subband signal, s, is selected using a subband selection key Ksb . The watermark is embedded into the selected subband according to (1).

an explicit synchronisation achieves better detection performance but at the cost of lower effective embedding capacity and is more prone to active adversary attacks. For example, an active adversary can use explicit synchronisation information to mount an effective desynchronisation attack by estimating the embedded watermark. To overcome the limitations of an explicit synchronisation, we use the implicit synchronisation that relies on features of the host signal that are robust to watermarking attacks and invariant to common signal manipulations. For example, features of the host audio signal to which the HAS is sensitive, such as fast-energy transition locations, high zero-crossing rate locations or high spectral flatness measure (SFM) locations in the input audio signal, can be used for an implicit synchronisation [8]. Since the HAS is sensitive to these features, a desynchronisation attack to alter these locations will introduce noticeable distortion. We refer to these locations of an implicit synchronisation as ‘salient points’ (SPs). The proposed FSSS-based audio watermarking scheme uses fast-energy transition feature of the audio for an implicit synchronisation. The SPs based on fast-energy transition are extracted using the method presented in [8]. To estimate SPs, the host audio signal is bandpass filtered to remove the frequency content to which the human ear is insensitive. For each audio index n, n ¼ 1, 2, . . . , N the total energy of r samples near index is n calculated as, E[n] ¼

r1 X

x[n i]2

(8)

i¼0

A critical aspect of designing an SS-based watermarking system is to ensure fast and reliable synchronisation at the watermark detector. Existing SS-based watermarking schemes [2, 5 – 10] either used explicit synchronisation or implicit synchronisation. It is important to mention that

And the energy transition ratio r is calculated as

r[n] ¼

E[n r] E[n]

(9)

Figure 2 Illustration of analysis filter bank for l ¼ 5 132 & The Institution of Engineering and Technology 2008

IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129– 150 doi: 10.1049/iet-ifs:20070145

www.ietdl.org Energy profile, E[n], calculated using (8) and energy transition ratio, r[n] are compared against pre-defined thresholds, Ti : i ¼ 1, 2, 3, 4 to estimate salient point locations. It is worth mentioning that an estimated SP-list, from a given audio clip, directly depends on the set of thresholds used. An index n is labelled as a fast energy transition point if r[n] . T1 and E[n] . T2 . Fast energy transition points separated by less than T3 are merged into one group. Only one salient point is estimated from each group of salient point candidates. A point with the strongest energy transition ratio in a group is selected as a salient point provided that r[n]ki . T4 , where ki is the number of samples in the ith group of fast transition points. Further details on the above outlined SP estimation algorithm and its efficient implementation can be found in [8]. To illustrate an SP estimation procedure, we analysed the first audio clip, SQAM1.wav, from the sound quality assessment material (SQAM) database downloaded from (http://sound. media.mit.edu/mpeg4/audio/sqam/) Fig. 3 gives the plots of the analysed audio clip in a time domain (Fig. 3 top), its corresponding estimated energy profiles (Fig. 3 middle), energy transition ratio (Fig. 3 bottom) and estimated salient points (Fig. 3 bottom) using the procedure outlined in the previous paragraph for threshold settings {T1 , T2 , T3 , T4 } ¼ {1:9, E4s , 60, 120}, where E4s denotes the average energy of a 4 s window around index n. It has been observed through experimentation that on average these threshold settings yielded 3–4 SPs/s when we analyse a database consisting of 20 audio clips used for performance evaluation in this paper. The proposed scheme estimates SPs using the methods discussed above. Estimated SPs are used to segment the host audio into non-overlapping frames anchored around each SP. The non-overlapping frames are then decomposed into ( p þ 1) subbands, one of which is selected using a private key, Ksb , to insert the watermark w, in the frequency domain. The masking threshold is estimated by the selected subband using the HAS. The estimated masking threshold is used to scale the pseudo-random sequence to ensure that the embedding distortion is below the masking threshold. The watermark is embedded into the selected subband signal using (1). The watermarked subband and the remaining (unwatermarked) subbands are then combined to synthesise (or reconstruct) the watermarked audio frame. The watermarked frames are merged to generate a watermarked audio clip. The block diagram of the proposed FSSS-based watermark embedding scheme is shown in Fig. 4. The key steps in the proposed watermark embedding scheme are briefly discussed below:

SP extraction: in this preprocessing step, a set of SPs, {SPi , i ¼ 1, . . . , M}, where M is the cardinality of the SP IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129 – 150 doi: 10.1049/iet-ifs:20070145

Figure 3 Salient point estimation using rapid energy transition: analysed audio clip (top), energy profile around processing point (middle) and energy ratio profile along with estimated SPs (bottom)

set, is extracted by analysing the host audio clip using the method presented and discussed above.

Segmentation: An audio frame consisting of N samples is selected around each SPi :i ¼ 1, . . . , M. Decomposition: Each frame is decomposed into ( p þ 1) subband signals of unequal bandwidth using an l-level analysis filter bank. The proposed analysis filter bank is designed incorporating factors such as security (against watermark estimation attacks), robustness (against lossy compression) and the fidelity of the embedded watermark. The proposed l-level analysis filter bank offers a tradeoff between the dyadic wavelet analysis filter bank and the 133


www.ietdl.org

Figure 4 Block diagram of the proposed FSSS-based audio watermark embedding scheme

wavelet packet analysis filter bank. An l-level decomposition of an audio signal using a dyadic wavelet analysis filter bank yields (l þ 1) subband signals, whereas corresponding l-level decomposition using a wavelet packet analysis filter bank produces 2l subband signals. As decomposition using a standard dyadic filter yields fewer subbands, it thereby provides lower security for a moderate value of l, that is, 1 l 5. On the other hand, decomposition using the wavelet packet does improve the security issue but the bandwidth of each subband is fs =(2lþ1 ) Hz, which is smaller than the bandwidth of a single critical band in the high-frequency range ( f . 10 kHz) for a moderate value of l [17]. Moreover, to incorporate the spreading effect of the basilar membrane because of neighbouring critical bands accurately, each subband signal should fall into at least two critical bands. One possible analysis filter bank structure satisfying the above constraints is illustrated in Fig. 2 where the bandwidth of each subband covers at least three critical bands for l ¼ 5, yielding nine subbands. The Lower eight subbands (subbands shaded with grey-scale in Fig. 2) can be used for watermark embedding to ensure robustness against lossy compression attacks. The highest subband, sb9 , is not used for watermark embedding because of low signal energy in this frequency band, especially when the sampling frequency fs . 16 kHz.

Masking threshold estimation: To ensure imperceptibility of the embedded watermark, a pseudo-random sequence w is spectrally shaped using the estimated masking threshold, Tm , from the selected subband, s j . Masking threshold, Tm , is estimated using MPEG layer III psychoacoustic model-1 [14, 15]. The masking threshold, Tm , of the selected subband is estimated in the frequency domain, for which the discrete Fourier transform (DFT) is used to map the selected subband signal into the frequency domain. The DFT is used because of its efficient mapping using a fast Fourier transform and a linear frequency scale. However, any mapping of the audio sequence from a sample domain to the frequency domain can be used. The masking threshold, Tm j , for jth i j j subband of ith frame (s i ) in the DFT domain (S i ) is estimated as follows:

Subband selection: A secret key Ksbi is used to select a subband from the lower 2l 2 1 subbands of the ith frame (audio frame around ith SP) for watermark embedding. Without the lose of generality, we assume that only one subband is selected for watermark insertion. The subband selection key, Ksb(1) , when only one subband is selected per frame can be expressed as Ksb ¼ [Ksb1 : :Ksb1 : :KsbM ]. However, if p1 subbands are selected per frame for the watermark insertion, then the subband selection key length (p ) will be Ksb 1 ¼ p1 Ksb(1) .

The bin frequency, f k , corresponding to the kth frequency index of the DFT is expressed as

134 & The Institution of Engineering and Technology 2008

The power spectrum of the selected subband is given as j

j

P i [k] ¼ jS i [k]j2 j

(10)

j

where S i [k] is the DFT of the S i [n] which is defined as

j S i [k]

Nj 1 pffiffiffiffiffi 1 X j ¼ s i [n]e 12pnk=Nj , N n¼0

0 k Nj 1 (11)

j

kf j f k ¼ Nj

j

(12)

where f j denotes the sampling frequency of the selected j subband s i . IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129– 150 doi: 10.1049/iet-ifs:20070145

www.ietdl.org j The linear frequency f k (in Hz) is mapped to the CBR scale V (in Bark) using (6), that is

j Vk

¼ 13arctan(0:76f

j k)

! j 2 f k þ 3:5arctan , 7:5

j

V k ¼ 1, 2, . . . , Nl

j

j

P i {V k } ¼ P i [k]

(14)

j

j

E i [m] ¼

j

j

j

j

O i [m] ¼ g i (14:5 þ m) þ 5:5(1 g i )

j

m ¼ 1, . . . , m jt

(15)

where m jt is total number of critical bands in the selected subband. To incorporate the spreading effect of the basilar membrane, j the energy per critical band Ei [m], is convolved with the spreading function, B[m], of the basilar membrane. The spreading function, B[m], can be approximated as [14, 17, 18], B[m] ¼ 15:91 þ 7:5(m þ 0:474) qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 17:5 1 þ (m þ 0:474)2 (dB) where m ¼ , 1, 0, 1, neighbouring critical bands.

is

the

index

(16) of

j

Em i [m] ¼ E i [m] B[m]

(17)

where * denotes the convolution operation. Next, we classify the signal content of the selected subband as either tone-like or noise-like. For this purpose, we employ the SFM, which is defined as a ratio of the geometric mean to the arithmetic mean of energy per critical band [18, 19]. The SFM of the selected subband can be expressed as 0 1=m jt 1 Qm jt j B C m¼1 Emi [m] B C j B C(dB) SFM i ¼ 10 log10 B C j @1=m j Pm t Em j [m] A t i m¼1

IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129 – 150 doi: 10.1049/iet-ifs:20070145

(21)

Now, the final masking threshold, Tmj i , is estimated based on j the raw masking threshold, O i [m], and absolute masking threshold Ta (calculated using (7)). The final masking threshold of the selected subband is determined as j Tm j [m] ¼ max rm i [m], Ta j i

(22)

i

To illustrate the masking threshold estimation process discussed above, the audio clip ‘SQAM7.wav’, an audio clip from SQAM audio database is analysed. The estimated masking threshold, Tm2 , from the selected subband, s26 , 6 and corresponding power spectrum density (PSD), P62 , are plotted in Fig. 5.

the

The energy per critical band after taking into account the masking effect because of neighbouring critical bands can be expressed as j

j

rmi [m] ¼ 10log 10(Em i [m]O i [m]=10) ,

(20)

The masking energy offset and the energy per critical band after spreading are used to estimate a raw masking j threshold, rm i [m], which is expressed as

j

jLm j

(19)

The tonality factor g i is used to calculate the masking j energy offset O i [m], defined as [18, 19]

j

m[Lm P i {m}

!

where SFMmax ¼ 60 dB.

The set {V k :k ¼ 1, 2, . . . , Nl } is partitioned into bins {Lm } corresponding to an overlap with the mth critical band. Let jLm j be the number of frequency bins in mth critical band of the selected The total energy of mth P subband. j band is given by m[Lm P i {m}. Then, the average energy j in each critical band. Ei [V jm ], of the selected subband is calculated as P

j

SFM i g ¼ min ,1 SFMmax j i

(13)

With the abuse of notations (10) can be rewritten as j

The SFM value is used to calculate the tonality factor g. The tonality factor, g, is used to determine the appropriate masking index model. The tonality factor of the selected subband can be expressed as [18, 19]

Watermark generation: A watermark is generated using a pseudo-random noise generator with a characteristic distribution. Secret key, Kw , is used as a ‘seed’ for the pseudo-random noise generator. Watermark embedding: The watermark embedding is performed in the frequency domain. The pseudo-random sequence, w, generated using the secret key, Kw , is spectrally shaped to ensure imperceptibility of the embedded watermark. The spectral weighting factor, a ji ¼ Tm j , is used to spectrally scale w, that is i

j

Wi [k] ¼ a i [k]wi [k]b

Finally, the watermark is added to the selected subband in the DFT domain as j

(18)

(23)

j

Xi [k] ¼ Si [k] þ Wi [k]

(24)

The corresponding time domain watermarked subband signal is obtained by taking the inverse discrete Fourier 135


www.ietdl.org

Figure 5 Watermark embedding using FSSS; the original audio frame at SP6 (top:left), selected subband signal s26 (top:right), PSD P 26 and corresponding estimated masking threshold Tm 2 (middle:right), watermarked subband x 26 (middle:left), 6 watermarked audio frame (left:bottom), and embedding distortion due to watermark embedding (bottom:right) transform (IDFT), that is j

j

x i ¼ IDFT(X i )

(25)

Signal synthesis and frame merging: The watermarked and unwatermarked subbands are combined to synthesise a watermarked audio frame using a synthesis filter bank, which is a perfect reconstruction filter for the analysis filter bank illustrated in Fig. 2. These watermarked frames are then merged to give the watermarked audio data, x. To illustrate the watermark embedding process, the seventh audio clip of SQAM database, for example, SQAM7.wav is processed using the proposed FSSS-based watermark embedding scheme. Figure 5 illustrates the watermark embedding process in the 6th frame of the 136 & The Institution of Engineering and Technology 2008

SQAM7.wav. In Fig. 5, the plot of frame of the audio clip to be watermarked around the 6th SP (top:left), the plot of the selected subband, s62 (top:right), plots of the corresponding PSD and the corresponding estimated masking threshold Tm 2 (middle:right), the plot of the 6 watermarked subband given (middle:left) and the plot of the watermarked frame in time domain (bottom:left). Fig. 5 (bottom:right) gives the plot of the embedding distortion, in the time domain, introduced because of the watermark embedding in the selected frame.

5

Watermark detection

The watermark detection process can be considered as a binary hypothesis testing problem. The performance of a watermark detector is generally characterised by the false IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129– 150 doi: 10.1049/iet-ifs:20070145

www.ietdl.org positive rate and false negative rate. The event ‘false positive’ is an incorrect detection of the watermark in the test-audio clip that actually does not contain the watermark, and its probability is denoted by Pfp . Likewise, the event ‘false negative’ is a detection failure when a watermark is actually embedded in the test-audio clip, and its probability is denoted by Pfn . The proposed scheme uses a blind watermark detection procedure for watermark detection, that is, the host signal is not used during the watermark detection process. To this end, two blind watermark detection schemes are considered, one a correlation-based detection and the other an estimation – correlation-based detection. The motivation for using estimation– correlation-based detection is to gauge the detection performance of the proposed scheme under host interference suppression at the detector. For estimation– correlation-based detection, the embedded watermark is estimated from the test-audio clip using BSS based on an ICA framework. The ICA-based estimator exploits the mutual independence between the host signal and the embedded watermark for watermark estimation. The estimated watermark, w, ^ is then applied to the correlation-based detection stage for watermark detection. In the following, we briefly outline the main steps of the detection process used for the proposed FSSS-based watermarking scheme.

† SP Extraction: the received watermarked audio signal, x~ , is analysed to extract a set of SPs. † Segmentation: an audio frame, x˜ i , consisting of N samples is selected around each SPi : i ¼ 1, 2, . . . , M. † Frame decomposition: each frame is then decomposed into ( p þ 1) subband signals using the proposed l-level analysis filter bank (as illustrated in Fig. 2). † Subband selection: a private key Ksbi is used to select the jth subband from the lower (2l 1) subbands of the ith j frame; that is, x~ i . † Watermark detection: to detect the presence or the absence of the embedded watermark, both the correlationbased detector and the estimation – correlation-based detector are considered. For correlation-based detection, a standard correlation– based detector using ML estimation [2, 5, 7 – 10, 16] is used, and for estimation – correlationbased detection, an ICA-based detector presented in [20] is used. The ICA-based detector, first estimates the embedded watermarking from the test-audio clip using BSS based on the ICA framework. The estimated watermark, w, ^ is then applied to the correlation-based detector to detect the presence or the absence of the embedded watermark. Both detectors have their own limitations, for example, the detection performance of correlation-based detector is limited by the host signal’s IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129 – 150 doi: 10.1049/iet-ifs:20070145

interference, whereas the ICA-based detector is encumbered by demanding computational requirements for the watermark estimation. However, the ICA-based detector provides significant performance improvement over the correlation-based detector because of its host signal interference suppression capability. The simulation results presented in Section 6 show that for a given watermark strength ICA-based detector performs significantly better than the counterpart correlation-based detector. j

The selected subband x~ i is applied to the corresponding detector for watermark detection. Some details of the blind correlation-based and ICA-based detectors used for watermark detection from the test-audio clip are discussed next.

5.1 Normalised correlation-based watermark detection The watermark detection problem can be formulated as the following binary hypothesis test H1 :~x ¼ x þ w H0 :~x ¼ x

(26)

To detect the embedded watermark using correlation-based detection, DFT coefficients of the selected subband signal j X~ i are correlated with the watermark, W, generated at the detector using the secret key Kw and the estimated masking threshold from the received audio clip. The normalised correlation coefficient (or similarity measure) between the selected subband and the watermark is calculated as j

xw ¼ X˜ i W

(27)

The correlation metric, t, of the selected subband is calculated as pffiffiffiffiffi m^ xw Nj t¼ s^ xw

(28)

where m^ xw is the sample mean estimated from the correlated subband signal xw as follows

m^ xw ¼

N 1 X

xw [n],

(29)

n¼0

and s^ xw is the sample variance, estimated as

s^ xw

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N 1 u1 X 2 ¼t x [n] m^ xw N n¼0 w

(30)

To detect the embedded watermark, the correlation metric t, is compared with the decision threshold, Th. Statistical analysis of the correlation metric is required to determine the optimal decision threshold either in the 137


www.ietdl.org Neyman–Pearson or in the Bayesian sense [21]. The correlation metric, t, is the sum of a large number of i.i.d random variables, therefore according to the central limit theorem (CLT), it is reasonable to approximate it as the normal distribution, that is N (m0 , s0 ) under the null hypothesis, H0 , and N (m1 , s1 ) under an alternative hypothesis, H1 , here m0 ¼ 0, s0 ¼ s1 ¼ 1 and m1 can be estimated using the sample mean. that is The error probability Pe , for the equal priors, Pr{H0 } ¼ Pr{H1 } ¼ 1=2, is given by Pe ¼ 1=2 Pfp þ Pfn . Analytically, Pfp and Pfn can be calculated as

Th m0 Pfp ¼ Q s0

Th m1 Pfn ¼ Q s1

(31)

(32)

where Q(x) is defined as 1 Q(x) ¼ pffiffiffiffiffiffi 2p

ð1

t2

e 2 dt

(33)

x

The decision threshold, Th, determined based on the Neyman–Pearson criterion [21] leads to an optimal Pfn for a given Pfp which can be calculated using (31) Th ¼ Q1 Pfp

(34)

Equation (34) shows that the decision threshold, Th, is a function of the target false positive rate. The block diagram of the normalised correlation-based watermark detector (NCWD) used for the FSSS-based audio watermarking scheme is given in Fig. 6.

5.2 ICA-based watermark detection The ICA-based watermark detection (ICAWD) exploits the mutual independence between the watermark and the host audio signal for watermark estimation. Recently in [20], we have shown that the watermark can be estimated using an ICA from data hiding schemes obeying additive embedding as long as the embedded watermark and the host signal are mutually independent and the watermark obeys a non-Gaussian distribution. The proposed FSSSbased scheme also belongs to the SS-based watermarking scheme and therefore the ICAWD proposed in [20] can be used to estimate the embedded watermark. In [20], we have shown theoretically and verified experimentally that the ICAWD performs significantly better than the traditional correlation-based detector because of its host interference cancellation capabilities. The ICAWD used here for watermark detection from the test-audio assumes that the embedded watermark sequences have nonGaussian distributions, a necessary requirement for the identifiability of the ICA model [22], and the watermark is repeated at least twice, a requirement of BSS of an underdetermined mixture of heavy-tailed sources using the ICA framework [23]. The ICAWD proposed in [20] consists of two stages: (1) watermark estimation stage and (2) watermark decoding and/or detection stage. The watermark estimation stage estimates the embedded watermark w ^ from the watermarked audio x˜ using BSS based on the ICA framework. The watermark detection stage detects the presence or the absence of the embedded watermark by applying correlation-based detection to the estimated watermark w. ^ Any BSS scheme based on an ICA can be used during the watermark estimation stage. Simulation results presented in this paper, however, are obtained by using ‘FastICA for noisy data algorithm’ presented in [22, 24] for watermark estimation from the test-audio. The ‘FastICA for noisy data algorithm’ is selected here because of its better

Figure 6 Block diagram of the NCWD used for detection 138 & The Institution of Engineering and Technology 2008


www.ietdl.org

Figure 7 Block diagram of-ICAWD for detection

computational performance and separation quality over other well known algorithms [25, 26]. For correlationbased detection, (27 – 32) are used to determine the presence or the absence of the watermark in the estimated watermark w. ˆ The block diagram of an ICAWD used for watermark detection is given in Fig. 7.

6

Performance analysis

The goal of this section is to provide performance analysis of the proposed scheme in terms of capacity, fidelity, and robustness. In addition, to provide a comparative analysis, the performance of the proposed scheme is compared with the existing SS-based audio watermarking schemes presented in [7, 8]. In the following sections, the robustness performance of the proposed scheme is analysed against various types of degradation (Section 6.1), the capacity analysis of the proposed scheme is provided in Section 6.2 and the comparison with the existing SS-based audio watermarking scheme is provided in Section 6.3. In this section, capacity is measured in terms of the number of bits per frame (also know as per sample capacity). Fidelity is evaluated using both the subjective and the objective degradation, and robustness is measured in terms of the decoding bit error probability, Pe .

Date set: Experimental results presented here are based on the data set consisting of the SQAM audio database downloaded from [27] and five audio clips listed in Table 1. All the audio clips used for the experimental results presented here are based on a mono audio channel sampled at 44.1 kHz with a resolution of 16 bits. In our experiments, the watermarks are generated and embedded following the procedure described in Section 3. A perceptual mask, Tm is estimated using (22). This mask is then multiplied by 200 independently generated pseudorandom sequences w[n], with zero mean and unit variance, to generate 200 independent watermarks. In case of the ICAWD, the pseudo-random sequences, w[n], follow a Laplacian distribution, that is

fw (t) ¼

b bjtj e , 2

j t j, 1

(35)

pffiffiffi where b ¼ 2=sw , and for correlation-based detector, it follows a normal distribution. These 200 random watermarks are embedded in each audio clip according to (1), which resulted 4000 watermarked audio clips. Experimental results presented in the following sections are averaged over these 4000 watermarked audio clips.

Table 1 Audio clips used for performance analysis Singer name and song title Backstreet Boys, I Want It That Way. . . Lata Mangeshkar, Kuch Na Kaho. . . A. Bhosle & R. Sharma, Kahin Aag Lage. . . Nusrat F. A. Khan, Afreen Afreen. . . Suzanne Vega, Tom’s diner


Genre

Duration (sec)

Pop (Pop1)

22

Melodic (Melodic)

15

Pop (Pop2)

10

Indian semi-classical (Classical)

20

Female vocal (Vocal)

5

139


www.ietdl.org 6.1 Robustness performance To evaluate the robustness performance of the proposed watermarking scheme, we have performed several experimental tests in which the watermarked audio is subjected to commonly encountered degradations. These degradations include addition of white and coloured noise, resampling, lossy compression (MPEG audio compression), filtering, timeand frequency-scaling, multiple watermarking and StirMark benchmark attacks for audio. Parameter settings. The simulation results presented in this section are based on the following system settings: † the salient point list (SP) was assumed to be available at the detector, and therefore the decoding bit error probability Pe , presented here is due to decoding error only; † the audio frame size (2l N1 ) was set to 213 for fs ¼ 44:1 kHz; † five-level wavelet decomposition was used, that is, l ¼ 5, and therefore eight target subbands were available for watermark embedding; † only one subband was selected at random from eight target subbands for watermark embedding (except the multiple watermark embedding case); † the target false positive rate Pfp was set to 3:5 104 , which corresponds to the decoding threshold Th ¼ 0:15 (using (34)); † the false positive bit rate, Pfp , was calculated by applying the original (unwatermarked) music clip to the proposed detector, and the average false positive for the 20 audio clips used for performance evaluation was calculated to be 2:9 104 ; † robustness performance in terms of the average decoding bit error rate was calculated without channel coding; and

Figure 8 Robustness performance for AWGN attack

Superior detection performance of the ICAWD to that of the NCWD can be attributed to its host signal interference cancellation capability. It can be observed from Fig. 8 that for SS-based watermarking a very low decoding bit error probability is achievable even in the presence of noise with 60– 70% power of the audio signal.

6.1.2 Resampling: To simulate the resampling attack, a watermarked audio signal was down-sampled at a sampling rate of fs =rf (where rf denotes the resampling factor) and then interpolated back to fs . The watermark detection was then applied to the resulting watermarked audio clips. Average Pe for rf ¼ 2, . . . , 10, is given in Fig. 9, which shows that the proposed watermarking scheme (using ICAWD) can withstand resampling attacks with the rf value up to 5 for each watermarked audio clip; similar decoding performance is achievable for the NCWD by using channel coding. Again, the ICAWD performs better than the NCWD and its superior detection performance can be attributed to its host signal interference suppression capability. It is important to mention that the better detection performance of the ICAWD comes at the cost of

† in case of the ICAED, a watermark repeating factor of two was used during the watermark embedding process, that is, two consecutive audio frames were watermarked with the same watermark w. The robustness performance of the proposed scheme against common degradations for the above settings is discussed next.

6.1.1 Addition of white Gaussian noise: White Gaussian noise ranging from 0 to 200% of the power of the audio signal was added to the corresponding watermarked audio clips. The Pe average over 4000 watermarked audio clips for the ICAWD and the NCWD for different SNR values are plotted in Fig. 8, which shows that the ICAWD performs better than the NCWD. 140 & The Institution of Engineering and Technology 2008

Figure 9 Robustness performance for resampling attack IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129– 150 doi: 10.1049/iet-ifs:20070145

www.ietdl.org security; the ICAWD requires repeated embedding (at least twice), which makes the embedded watermark more vulnerable to watermark estimation attacks than without repeated embedding.

6.1.3 Lossy compression: Lossy compression for audio (MP3) is generally applied to the digital audio for multimedia applications like transmission and storage to reduce the bit rate. To test the survivability of the watermark, audio encoding/decoding was applied to the watermarked audio using the ISO/MPEG-1 Audio Layer III [14, 15] coder at bit rates of 32, 64, 96, 112, 128, 192, 256 and 320 k bits/s (kbps). The average Pe for lossy compression attacks for bit rates rates of 32, 64, 96, 112, 128, 192, 256 and 320 (kbps) is given in Fig. 10. It has been observed from Fig. 10 that the detection performance for both the detectors deteriorates as the bit rate of the encoder/decoder decreases, this is because of the stronger distortion introduced by the encoder for lower bit rates. In addition, the ICAWD performs better than the NCWD.

6.1.4 Addition of coloured noise: To simulate an attack with coloured noise, white Gaussian noise was spectrally shaped according to the estimated masking threshold using a corresponding watermarked audio clip based on the HAS model [14, 15, 17]. This just audible coloured noise was then added to the watermarked audio signal. Average Pe for the resulting watermarked audio clips is presented in Fig. 11. It has been observed from Fig. 11 that the NCWD performs poorly; this is because of increase in interference level, as the coloured noise is generated with a process almost identicall to that of the watermark generation. Therefore, additive coloured noise acts as a second watermark interfering with the watermark to be detected. On the other hand, an ICAWD is efficient in handling such attacks because of its interference cancellation ability.

Figure 11 Robustness performance for filtering (LPF, HPF, BPF), rescaling (TSp, TSn, FSp, FSn), requantisation (Res) and coloured noise addition (ACNA) attacks

6.1.5 Rescaling: Rescaling attacks include time- and frequency-scaling. Time-scaling attacks can be used to desynchronise a watermark detector for SS-based watermarking systems. To test the robustness of the proposed scheme against time-scaling attacks, the watermarked audio clips were time-scaled with a time-scaling factor, TSp(n) ¼ þ(2) 1%. The detection performance for the time-scaling attack using both the detection schemes, for example, the ICAWD and the NCWD are given in Fig. 11. The frequency-scaling attacks are generally used to deteriorate the detection performance of the frequency domain watermarking schemes. As the proposed watermarking scheme is also a frequency domain watermarking scheme, it is reasonable to test the robustness performance of the proposed scheme against frequencyscaling attacks as well. To simulate a frequency-scaling attack, the watermarked audio clips were frequency-scaled using a frequency-scaling factor, FSp(n) ¼ þ(2) 1%. The detection performance for the resulting audio clips for both the detection schemes. It can be observed from Fig. 11 that the proposed scheme can withstand the rescaling attack of TS +1% and FS +1% (especially for the ICAWD).

6.1.6 Filtering: To test the robustness of the proposed watermarking scheme against filtering attacks, the watermarked audio signals were subjected to lowpass filtering (LPF), highpass filtering (HPF) and bandpass filtering (BPF) attacks. The specification of the filters used for the filtering attacks are, 1. LPF: cut-off frequency: fc ¼ 5 kHz with 12 dB/octave roll-off. Figure 10 Robustness performance for MP3 compression attack IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129 – 150 doi: 10.1049/iet-ifs:20070145

2. HPF: cut-off frequency: fc ¼ 1000 Hz with 12 dB/octave roll-off. 141


www.ietdl.org 3. BPF: cut-off frequencies: fclow ¼ 50 Hz, and fcup ¼ 5:5 kHz with 12 dB/octave roll-off. The detection performances for LPF, HPF and BPF attacks are given in Fig. 11.

6.2 Capacity analysis

the watermark; under this scenario, the capacity of the proposed scheme can be expressed as CFSSS

1 s2 ¼ log2 1 þ 2 w 2 s v i þ s si 2

! (38)

where s 2vi is the noise contribution in the ith subband.

The SS-based watermarking schemes, are known for their low embedding capacity , 1/1000 bits/sample). The existing SS-based watermarking schemes [2, 5 – 10, 16] spread one bit of information over several samples and achieve robustness at the cost of the embedding capacity. The proposed FSSS-based scheme also belongs to the SSbased watermarking class and therefore exhibits low embedding capacity as well. To calculate the embedding capacity of the proposed FSSS-based scheme, it is assumed that the watermark w, and the host audio s, are i.i.d. Gaussian random variables, that is w N (0, s 2w ) and s N (0, s 2s ), and the watermarked audio signal, x is subjected to an independent AWGN v N (0, s 2v ) attack. Costa has shown in his seminal work [28] that under an informed embedding scenario, that is, the host signal being used during the encoding process at the encoder, the watermarking capacity approaches the capacity of the AWGN channel, that is

When p1 :1 p1 p subbands are used to embed the watermark, the total watermark embedding distortion is bounded by P such that p1 X

s 2di P

(39)

i

where s 2di is the embedding distortion in the ith subband which is calculated using (4). In this case, the capacity of the proposed scheme can be calculated using a parallel Gaussian channel model [29] CFSSS

p1 X 1 i

2

log2

s 2d 1þ 2 i 2 svi þ s si

! (40)

It is important to notice that in the case of blind embedding, the quantity (s 2w =s 2v ), also known as watermark-tointerference ratio, determines the overall capacity performance of a given blind watermarking scheme. The proposed FSSS-based watermarking scheme also uses blind watermark embedding; therefore (37) can be used to determine the capacity of the proposed scheme.

P 2 where i s di ¼ P. Equality is achieved when embedded watermarks, (w1 , . . . , wp1 ), are mutually independent. This problem of calculating the capacity for the proposed data hiding scheme is reduced to finding the power allocation to maximise the capacity, subject to the constraint P 2 i s di ¼ P. This constraint optimisation problem can be solved using Lagrange multipliers [29]. Moulin et al. in [30] have proposed a framework based on ‘spike models’ to calculate the data-hiding capacity using parallel Gaussian channels. Their proposed framework uses low-power allocation or zero power to weak channels and uniform power to strong channels. The proposed framework of Moulin et al. in [30] can be used to calculate the capacity of the proposed FSSS-based scheme as the proposed scheme also uses parallel channels for watermark embedding. In addition, the strength of the embedded watermark is determined based on the estimated masking threshold, Tm , of the selected subband; therefore the proposed scheme also allocates watermark power adaptively based on the selected subband. The proposed scheme, however, does not optimise watermark power allocation during the watermark embedding process, In future, we intend to investigate the capacity and the security performance of the proposed scheme based on optimal channel selection and power allocation.

As the proposed scheme uses subbands of the host audio for watermark embedding, the, capacity of the proposed scheme is a function of the number of subbands selected for watermark insertion and allowed embedding distortion. Without the loss of generality, it is assumed that only one subband (say ith :1 i p subband) is used to embed

6.2.1 Experimental results: This section will provide capacity– fidelity and capacity– robustness performance analysis of the proposed scheme for (1) variable power allocation and (2) constant power allocation strategies. These power allocation strategies are reasonable to evaluate capacity– robustness performance of the proposed scheme

CInformed

1 s2 ¼ log2 1 þ w2 sv 2

! (36)

Here, (36) shows that capacity depends on (s 2w =s 2v ), also known as watermark-to-noise ratio in the data hiding literature. In case of blind embedding, the capacity performance deteriorates further as the host signal acts as an additional AWGN source. Therefore, the corresponding capacity for blind embedding can be calculated as CSS

1 s2 ¼ log2 1 þ 2 w 2 sv þ ss 2

!


(37)


www.ietdl.org

Figure 12 Capacity – robustness performance for variable power allocation using ICAWD (left) and NCWD (right)

under constant fidelity (or constant power allocation) and capacity– fidelity performance under constant robustness (or variable power allocation). To this end, seven subbands, 1 p1 7, are selected for watermark embedding based on two power allocation strategies. Robustness performance of the resulting watermarked audio clips is evaluated for as AWGN attack. For performance analysis, the capacity of the proposed scheme is measured in bits per frame, where each frame consists of 2 13 samples, and fivelevel wavelet decomposition is applied for signal decomposition, which yielded eight suitable subbands for watermark embedding. Under the above settings, the proposed FSSS-based audio data hiding scheme can achieve an embedding capacity of 1 to 8 bits/frame or 0:12 1013 to 1:0 1013 bits/sample. Experimental results with these settings for two power allocation strategies are given as, † Variable power allocation: in this case, the watermark embedding strength, s 2w , is determined based on the selected subband. Each audio clip in the database was watermarked by inserting p1 watermarks into p1 subbands selected using the secret key, Ksb(i) , i ¼ 1, . . . , p1 . The watermark strength for each subband was calculated based on the masking threshold, Tm j . It has been observed i through the experimental results that on average embedding distortion increases gradually from the low to high subband signals, and the embedding diction in the low subband signals is more than that in the high subband signals. This is because high subband signals have relatively lower energy than the low subband signals; therefore, the high subband signals allow relatively weaker watermarks than the low subband signals, which results in variable embedding distortion rates from the low to high subband signals. To verify this observation, watermark was embedded in p1 subbands, where p1 ¼ 1, . . . , 7, and the corresponding embedding distortion (in dB), averaged over 4000 clips was found to be s 2d ¼ {54:74, 49:68, 44:12, 42:29, 40:83, 39:62} (dB). This indicates that audio IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129 – 150 doi: 10.1049/iet-ifs:20070145

signals in general have higher energy in the low-frequency bands than in the high-frequency bands and the HAS is relatively insensitive to stronger signals, which ultimately leads to a stronger embedding in the lower subbands than the higher subbands. The capacity– robustness plot for different embedding capacities and/AWGN power is given in Fig. 12, which shows that the proposed scheme exhibits a constant robustness performance for variable power allocation, which is determined based on the masking threshold of the selected subbands. Fidelity performance of the proposed scheme based on subjective degradation under variable power allocation also revealed that the watermarked audio clips were perceptually similar for p1 , 6bits/frame and for a higher embedding rate, for example p1 6 introduced perceptible (but not annoying) embedding distortion. Here, subjective degradation was calculated based on the feedback from ten trained subjects (further details of the fidelity performance of the proposed scheme are provided in Section 6.3.2). † Constant power allocation: In this case, total embedding distortion is kept constant. There are many ways to decide about total embedding distortion, However, to meet the fidelity requirement and reasonable robustness level, we set the total watermark strength to the estimated masking threshold for the first subband, for example, T m1i . The reason for selecting the first subband masking threshold is that the strongest watermark can be embedded in this subband rather than with the higher subbands. The watermark strength for each subband to be watermarked was obtained by dividing its estimated masking threshold, j Tm i , j ¼ 1, . . . , 7 by the number of subbands to be watermarked p1 . Simulation results show that the average embedding distortion remained approximately constant for p1 ¼ 1, . . . , 7, sd2 ¼ {54:74, 55:32, 55:73, 55:93, 55:56, 55:97, 55:15} (dB), when p1 bits of data were embedded per frame in every audio clip in the database. The capacity– robustness plot for different values of p1 and AWGN power is given in Fig. 13. 143


www.ietdl.org

Figure 13 Capacity – robustness performance for constant power allocation using ICAWD (left) and NCWD (right) It can be observed from Fig. 13 that the robustness performance deteriorates significantly for higher embedding capacities. This observation is not surprising as embedding more data while keeping embedding distortion constant means decreasing the watermark strength accordingly, which ultimately results in deteriorating robustness. Fidelity performance based on subjective degradation revealed that the watermarked audio clips were perceptually similar for all watermarked audio clips. This is obvious as the embedded watermarks are scaled wellbelow the masking threshold, and hence, are imperceptible. Here, again the subjective degradation was determined based on the feedback from ten trained subjects. It is important to mention that the proposed scheme naturally fits into Moulin et al. ‘spike model’ as proposed in [30] when used under the variable power allocation mode, as the proposed scheme adaptively determines the watermark strength based on the selected subband. In addition, maximum achievable capacity per sample of the FSSS-proposed scheme is also equal to that of the SSbased audio watermarking scheme of Swanson et al. [7]. However, the proposed scheme exhibits better security, robustness and fidelity performances than the audio watermarking schemes presenting in [7, 8] without compromising the embedding capacity. In the following sections, we will provide a performance comparison of the proposed scheme with the existing SS-based schemes presented in [7, 8].

6.3 Performance comparison The performance of the proposed FSSS-based audio watermarking scheme is compared with the existing SSbased audio watermarking schemes such as robust audio watermarking scheme of Wu et al. [8] and the SS-based audio watermarking scheme of Swanson et al. [7]. A comparison based on the fidelity and the robustness performance of these schemes is provided here. For the 144 & The Institution of Engineering and Technology 2008

sake of fair comparison, the same amount of data is embedded in each audio clip using all three schemes.

6.3.1 Parameter settings: Simulation results presented in this section are based on the following settings † embedding capacity was set to 1 bit per 512 samples, here in the case of audio watermarking schemes presented in [7, 8] one bit of information was embedded in every 512 samples, whereas for the proposed FSSS one bit of information was embedded in every 256 samples with a repeating factor of 2; † for Wu et al. scheme [8] q (number of significant coefficients to carry watermark) was set to 256; † for FSSS-embedding, the audio frame size (2l N1 ) was set to 213 for fs ¼ 44.1 kHz; † five-level wavelet decomposition was used; † only one subband was selected at random from eight target subbands for watermark embedding, except in the case of multiple watermark embedding; † target false positive rate Pfp was set to 3:5 104 , which corresponds to a decoding threshold Th = 0.15 (using (34)); † two hundred independent watermarks are embedded in each audio clip using each watermarking scheme; and † the decoding bit error probability is averaged over 4000 watermark audio clips.

6.3.2 Fidelity performance comparison: The objective of this section is to compare the fidelity performance of the proposed scheme with that of existing audio watermarking schemes. Fidelity performance is evaluated using both the objective and the subjective degradations. Fidelity performance is evaluated based on the subjective degradation, which is calculated using IET Inf. Secur., 2008, Vol. 2, No. 4, pp. 129– 150 doi: 10.1049/iet-ifs:20070145

www.ietdl.org feedback from ten trained subjects. Whereas the fidelity performance based on objective degradation is calculated using the MSE distortion in the watermarked audio clip, various distortion metrics have been proposed in the literature on data hiding to gauge fidelity performance of a given watermarking scheme [2], among which MSE is commonly using to calculate objective degradation. Although the MSE-based embedding distortion measure does not reflect the exact level of perceptual distortion, it does give a quantitative metric about the strength of the embedded watermark. The relative embedding distortion measure, also known as signal-to-watermark-ratio is given as DSWR ¼ 10 log10

s 2x s 2d

! (41)

where s 2e the MSE calculated using (4), is used to evaluate the objective fidelity performance of the proposed scheme. To calculate the fidelity performance, each audio clip listed in Table 1 was watermarked using the proposed FSSS-based watermarking scheme described in Section 3 using the same parameter settings as outlined in Section 6.3.1. The same set of audio clips was also watermarked using the audio watermarking schemes presented in [7, 8]. For the sake of fair comparison, the per sample embedding capacity for all three schemes was set to 1/512 bits per sample. The embedding distortion in terms of MSE because of each watermark embedding scheme is listed in Table 2. Table 2 shows that on average the proposed scheme has the best fidelity performance based on the MSE criterion, and the scheme by Swanson et al. [7] is the worst. The fidelity performance of these schemes based on subjective degradation was assessed using International Telecommunication Union (ITU) -R Recommendations BS.1116 [31]. During a subjective quality assessment test based on the ITU-R Recommendations BS.1116 [31], the listener is free to listen to any of three audio clips A, B and C. Audio clip A is known to be the reference and clips B and C may be either the reference clip or the test signal. In our subjective testing setup, the clips B and C were assigned randomly for a each trial. After training each subject, the subject was asked to rate clips B and C relative

Table 2 Watermark embedding distortion comparison Embedding distortion (dB) Pop1

Melodic

Pop2

Classical Vocal

27.85

22.82

24.48

1.7924 31.56

Wu et al. [8] 21.29

22.97

30.14 38.72

27.23

FSSS

27.75

44.0

55.0

Swanson et al. [7]

27.55

13.24


Table 3 Subjective Fidelity Performance Pop1 Melodic Pop2 Classical Vocal FSSS

4.2

3.60

4.2

4.1

4.8

Swanson et al. [7]

3.1

3.2

3.0

3.2

3.0

Wu et al. [8]

2.8

2.8

2.5

2.0

1.9

to clip A, based on a continuous five-grade impairment scale, that is, 1.0 (very annoying), 2.0 (annoying), 3.0 (slightly annoying), 4.0 (perceptible but not annoying), 5.0 (imperceptible) (ITU-R Recommendation BS.562-3 [32] requirement). During our experimental setup, the original audio clip was used as reference clip and watermarked clip was used as a test clip. The reference clip (B) and the test clip (C) were selected randomly with equal probability. The subjective fidelity performance of the watermarked five audio clips based on the responses averaged over ten subjects when expressed according to ITU-R continuous five-grade impairment scale [32], is listed in Table 3. It can be observed from Table 3 that the scheme Wu et al. [8] exhibits the worst fidelity performance. It is interesting to notice that though the scheme of Wu et al. [8] exhibits better fidelity performance based on objective degradation than the scheme of Swanson et al. [7], but it exhibits inferior fidelity performance based on objective degradation. This strange fidelity performance might be attributed to the fact that the scheme of Wu et al. [8] does not incorporate the HAS during the watermark embedding process.

6.3.3 Robustness performance comparison: The goal of the section is to compare the robustness performance of the proposed scheme with existing audio watermarking schemes. Robustness performance is evaluated for standard benchmark attacks such as StirMark audio benchmark attacks [12, 13] and common audio degradations such as addition of white Gaussian noise (AWGN), resampling, multiple watermark embedding, addition of colored noise (ACNA), filtering, and time- and frequency-scaling attacks. To compare the robustness performance of the proposed scheme with the audio watermarking schemes presented in [7, 8], three sets of watermarked audio clips were generated by embedding 200 independent watermarks in every audio clip in the database; one set of watermarked audio clips was generated using the proposed FSSS-based scheme discussed in Section 4 and the remaining two sets using the schemes presented in [7, 8]. Watermarks were embedded using the same settings outlined in Sections 6.1.1 (for FSSS-based embedding) and 6.3.1 (for embedding using the schemes presented in [7, 8]). Watermarked audio clips were then subjected to StirMark for audio benchmark attacks [12] and common audio degradation attacks. 145


www.ietdl.org Table 4 Robustness performance comparison for stirmark audio benchmark attacks Decoding bit error probability, Pe stirmark attack

Swanson et al. [7]

Wu et al. [8]

FSSS with NCWD

FSSS with ICAWD

addbrumm_100

0.0387

0.173

0.088

0.0091

addbrumm_1100

0.0387

0.3503

0.088

0.0091

addbrumm_2100

0.0419

0.4355

0.088

0.0091

addbrumm_3100

0.0452

0.4854

0.1023

0.0091

addbrumm_4100

0.0588

0.5624

0.1257

0.0091

addbrumm_5100

0.0995

0.638

0.1412

0.0091

addbrumm_6100

0.1683

0.6517

0.1477

0.0091

addbrumm_7100

0.2993

0.7049

0.1904

0.0091

addbrumm_8100

0.478

0.8064

0.2228

0.0091

addbrumm_9100

0.5913

0.8325

0.2293

0.0234

addbrumm_10100

0.7886

0.8818

0.2293

0.0491

addfftnoise

1

1

1

1

addnoise_100

0.0387

0.1875

0.088

0.0491

addnoise_300

0.0387

0.2089

0.088

0.0491

addnoise_500

0.0355

0.2595

0.088

0.0491

addnoise_700

0.0355

0.302

0.088

0.0634

addnoise_900

0.0355

0.3566

0.088

0.0634

addsinus

0.0387

0.372

0.088

0.0634

amplify

0.0387

0.1418

0.088

0.0491

compressor

0.0452

0.1418

0.088

0.0491

copysamples

1

1

0.529

0.1749

cutsamples

1

0.9146

0.791

0.4835

dynnoise

0

0.1756

0.1056

0.0667

echo

0.2045

0.6069

0.0818

0.0667

exchange

1

0.2231

0.1056

0.0818

extrastereo_30

0

0.142

0.1056

0.0818

extrastereo_50

0

0.142

0.1056

0.0818

extrastereo_70

0

0.142

0.1056

0.0818

fft_hlpass

0.5513

0.1889

0.1074

0

fft_invert

1

0.142

0.1056

0.0818

fft_real_reverse

0

0.142

0.1056

0.0818

fft_stat1

0.2998

0.5884

0.1295

0.0238

fft_test

0.3301

0.6161

0.1056

0.0238

flippsample

0.3735

0.4466

0.1281

0.0725 Continued



www.ietdl.org Table 4 Continued Decoding bit error probability, Pe invert

1

0.199

0.088

0.0491

lsbzero

0

0.142

0.1056

0.0818

normalize

0.0387

0.1418

0.088

0.0673

rc_highpass

0.0226

0.1325

0.0945

0.0491

rc_lowpass

0.115

0.1585

0.088

0

smooth

1

1

1

1

resample

0.802

0.1494

0.1056

0

smooth2

1

0.192

0.1056

0

stat1

0.1851

0.1305

0.1056

0

stat2

0.066

0.142

0.1056

0.0818

voiceremove

1

1

1

1

zerocross

0.0387

0.313

0.088

0

zeroremove

1

0.73

0.2759

0.0363

zerolength

1

0.4239

0.2189

0.0607

† StirMark audio benchmark attacks: watermarked audio clips were subjected to StirMark audio benchmark attacks. The StirMark audio benchmark software, available in [13], was used in the default parameter settings. The decoding bit error probability, Pe , averaged over 100 watermarked audio clips, for the proposed FSSS-based scheme with the ICAWD and the NCWD, and the audio watermarking schemes presented in [7, 8] is given in Table 4. It can be observed from Table 4 that the proposed FSSS-based scheme using both the detectors exhibits better detection performance than the schemes presented in [7, 8]. The better performance of the proposed scheme can be attributed to its better use of the HAS.

† Multiple watermark embedding: existing SS-based watermark embedding schemes [5– 10] perform poorly for multiple watermark embedding attacks. As discussed in Section 4, the proposed scheme can simultaneously embed up to 2 * l 2 1 watermarks without introducing interwatermark interference. To test the performance of the

† Audio manipulation attacks: The robustness performance of the proposed scheme is also compared with that of the watermarking schemes presented in [7, 8] for common audio manipulation attacks such as AWGN power 0–200% of the audio clip, resampled with a resampling factor rf ¼ 1, . . . , 10, filtering, ACNA, and time- and frequencyscaling attacks. Detection performance in terms of Pe averaged over 4000 watermarked audio clips is given in Figs. 14–16. The following interesting observations can be made from Figs. 14–16: (1) The proposed scheme performs significantly better than the audio watermarking schemes presented in [7, 8], (2) The scheme of Swanson et al. [7] and the scheme of Wu et al. [8] for low-degradation attacks, this is because the scheme of Wu et al. [8] modifies only high-energy coefficients for watermark embedding, therefore can resist high-degradation attacks, (3) The scheme of Swanson et al. [7] can successfully resist an

Figure 14 Robustness performance comparison for an AWGN attack


HPF attack (Fig. 16). This is because of the fact that the scheme of Swanson et al. [7] as a spreading watermark over the entire audible frequency range and HPF attacks with a cut-off frequency of 1000 Hz cannot eliminate the watermark completely. Hence, it has better detection performance.

147


www.ietdl.org subbands. Three watermarks w1, w2 and w3, were embedded in the selected subbands (sb1, sb2 and sb3) of the input audio clip. The detection performances of these three schemes, in terms of detection probability Pd is given in Table 5. Here, Pd is averaged over 4000 watermarked clips.

Figure 15 Robustness rescaling attack

performance

comparison

for

Figure 16 Robustness performance comparison for filtering (BPF, LPF, HPF), rescaling (TSp, TSn, FSp, FSn), coloured noise addition attacks proposed scheme and the counterpart audio watermarking schemes presented in [7, 8] for multiple watermark embedding, three distinct watermarks were embedded into 20 audio clips in the data set used. Three distinct watermarks were generated using three independent secret keys (say Kw1 , Kw2 and Kw3 ). To generate a watermarked audio clip using the scheme of Wu et al. [8] and the scheme of Swanson el al. [7], an input audio clip was processed three times to embed three watermarks (say w1, w2 and w3). To embed three watermarks using the proposed scheme, three different subband selection keys Ksb 1, Ksb2 and Ksb3 were used to select three

It can be observed from Table 5 that for the watermark embedding schemes presented in [7, 8] the most recently embedded watermark, that is, w3, was almost always detectable, whereas, w2, was detectable with low probability and w1 was almost undetectable. This is primarily because multiple watermark embedding using the watermarking schemes presented in [7, 8], the most recent embedded watermark adds interference to the previously embedd watermark(s); it is therefore hard to detect the previously embedded watermark for these schemes [7, 8]. However, in case of the proposed FSSS-based watermark embedding, inter-watermark interference is zero as long as the number of simultaneously embedded watermarks is less than the available subbands for watermark embedding, which contributes to a high detection probability for all three watermarks. In addition, it can also be observed from Table 5 that for the scheme of Wu et al. [8] w2 is detectable 70% of the time. This higher detection rate can be justified as, the scheme of Wu et al. [8] selects only 50% of the DFT coefficients, based on their energy for watermarking embedding and thus, if we embed watermark w3 followed by w2, using the scheme of Wu el al. [8] then it is very unlikely that embedding w3 would modify the same coefficients carrying w2. This attributes to higher delectability of w2. Similarly, low delectability of w2, in the case of the scheme of Swanson et al. [7], is simply because the scheme of Swanson et al. [7] modifies all coefficients during watermark embedding and thus with a probability of 1 embedding of w3 followed by w2 using [7] will virtually eliminate w2.

7

Conclusion

A novel audio watermarking scheme based on SS is presented in this paper. The proposed scheme inherits the salient features of conventional SS-based watermarking schemes. In addition, the FSSS-based watermarking offers a natural method to factor in the HAS properties. It introduces lower embedding distortion and provides more secure

Table 5 Multiple watermarks embedding performance comparison Watermark detection performance, Pd Swanson et al. [7]

Pd

Wu et al. [8]

FSSS with NCWD

FSSS with ICAWD

W1

W2

W3

W1

W2

W3

W1

W2

W3

W1

W2

W3

0

0.05

0.99

0

0.7

0.99

0.99

0.99

0.99

0.99

0.99

0.99



www.ietdl.org embedding than the existing SS-based audio watermarking schemes [5– 10]. Simulation results show that the proposed scheme is robust to common intentional and unintentional watermarking attacks if an ICA-based watermark detector is used. The detection performance of conventional normalised correlation-based detector can be improved further by employing channel coding.

on lossy compression’, Proc. SPIE Secur. Watermarking Multimedia, 2002, 4675, pp. 79 – 90 [13] StirMark Benchmark for Audio: available at http:// amsl-smb.cs.uni-magdeburg.de/smfa/main.php, accessed on 23 June 2008 [14] NOLL P.: ‘MPEG digital audio coding’, IEEE Signal Process. Mag., 1997, 14, (5), pp. 59– 81

8

References

[1] MP3’s: ‘Friend or foe?’,http://www.freeessays.cc/db/ 48/tvh112.shtml accessed on 23 June 2008 [2] COX I.J., MILLER M.L., BLOOM J.A.: ‘Digital watermarking’ (Morgan Kaufmann, San Francisco, 2001) [3] EGGERS J. , GIROD B.: ‘Informed watermarking’ (Kluwer Academic Publishers, Norwell, 2002) [4] CHEN B., WORNELL G.W.: ‘Quantization index modulation: a class of provably good methods for digital watermarking and information embedding’, IEEE Trans. Inf. Theory, 2001, 47, (4), pp. 1423– 1443 [5] COX I.J., KILIAN J., LEIGHTON T., SHAMOON T.: ‘secure spread spectrum watermarking for multimedia’, IEEE Trans. Image Process., 1997, 6, (12), pp. 1673 – 1687 [6] WOLFGANG R.B., PODILCHUK C.I. , DELP E.J.: ‘Perceptual watermarks for digital images and video’, Proc. IEEE, 1999, 87, (7), pp. 1108– 1126 [7] SWANSON M.D., ZHU B., TEWFIK A.H., BONEY L.: ‘Robust audio watermarking using perceptual masking’, Signal Process., 1998, 66, (3), pp. 337– 355 [8] WU C.-P., SU P.-C., KUO C.-C.J.: ‘Robust audio watermarking for copyright protection’. Proc. SPIE’s 44th Ann. Meet. Adv. Sig. Proc. Alg. Arch. Impl. IX (SD39), July 1999, vol. 3807, pp. 387 – 397 [9] MALVAR H.S., FLORENCIO D.A.F.: ‘Improved spread spectrum: a new modulation technique for robust watermarking’, IEEE Trans. Signal Process., 2003, 51, (4), pp. 898– 905

[15] PAN D.: ‘A tutorial on MPEG/audio compression’, IEEE Multimedia Mag., 1995, 2, (2), pp. 60– 74 [16] MALIK H., KHOKHAR A., ANSARI R.: ‘Robust audio watermarking using frequency selective spread spectrum theory’. Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP’04), May 2004, pp. 385– 388 [17] ZWICKER R.E., FASTL H.: ‘Psychoacoustics: Facts and Models’ (Springer-Verlag, Berlin, 1999) [18] SORER T., BRANDENBURG K.: ‘Constraints of filter banks used for perceptual measurements’, J. Audio Eng. Soc., 1995, 43, (3), pp. 107– 115 [19] JOHNSTON J.D.: ‘Transform coding of audio signals using perceptual noise criteria’, IEEE J. Sel. Areas Commun., 1998, 6, (2), pp. 314 – 323 [20] MALIK H., KHOKHAR A., ANSARI R. : ‘Improved watermark detection for spread-spectrum based watermarking using independent component analysis’. Proc. 5th ACM Workshop On Digital Rights Management (DRM‘05), November 2005, pp. 102 – 111 [21] POOR H.V. : ‘An introduction to signal detection and estimation’ (Springer-Verlag, New York, 1994, 2nd edn.) [22] HYVARINEN A.: ‘Independent component analysis in the presence of Gaussian noise by maximizing joint likelihood’, Neurocomputing, 1998, 22, (1-3), pp. 49– 67 [23] HANSEN L.K., PETERSEN K.B.: ‘Monoaural ICA of while noise mixture is hard’. Proc. Sym ICA and BSS (ICA2003), 2003, pp. 815– 820

[10] KIROVSKI D., MALVAR H.S.: ‘Spread spectrum watermarking of audio signals’, IEEE Trans. Signal Proc., 2003, 51, (4), pp. 1020 – 1033

[24] HYVARINEN A.: ‘Survey on independent component analysis’, Neural Comput. Surv., 1999, 2, pp. 94– 128

[11] P-GONZALEZ F. , BALADO F. , HERNNDEZ J.R. : ‘Performance analysis of existing and new methods for data hiding with known-host information in additive channels’, IEEE Trans. Signal Process., 2003, 51, (4), pp. 960– 980

[25] GRIBONVAL R., BENAROYA L., VINCENT E., FEVOTTE C.: ‘Proposals for performance measurement in source separation’. Proc. 4th Int. Sym. Independent Component Analysis and Blind Source Separation, April 2003, pp. 763 – 768

[12] STEINEBACH M., LANG A. , DITTMANN J. , PRTITCOLAS F.A.P. : ‘StirMark benchmark: audio watermarking attacks based

[26] LI Y., POWERS D., PEACH J.: ‘Comparison of blind source separation algorithms’ in MASTORAKIS N. (ED) :


149


www.ietdl.org ‘Advances in neural networks and applicaitons’ (WSES, 2000), pp. 18 – 21 [27] http://sound.media.mit.edu/mpeg4/audio/sqam/, accessed on June 23, 2008

[30] MOULIN P., MIHCAK M.K.: ‘A framework for evaluating the data-hiding capacity of image sources’, IEEE Trans. Image Process., 2002, 11, (9), pp. 1029– 1042

[28] COSTA M.: ‘Writing on dirty paper’, IEEE Trans. Inf. Theory, 1983, 29, (3), pp. 439 – 441

[31] ITU-R Rec. BS.1116(rev. 1): ‘Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems’, Int. Telecommun. Union, 1997.

[29] COVER T.M., THOMAS J.A.: ‘Elements of information tehoey’ (Wiley-Interscience, 1991)

[32] ITU-R Rec. BS.562-3, ‘Subjective Assessment of Sound Quality’, International Telecommunications Union, 1997.

150