Digital Signal Processing 19 (2009) 809–814
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Robust audio watermarking using improved TS echo hiding Yousof Erfani ∗ , Shadi Siahpoush Islamic Azad University–Dezful Branch, Dezful, Iran
a r t i c l e
i n f o
a b s t r a c t
Article history: Available online 16 April 2009
A novel and content based improved time spread echo hiding method (ITS) is presented. The system decoder relies upon distinguishing a watermarked bit based on a correlation amount quantity and also the echo kernel embeds the watermark bit into the whole signal. The presented system is cepstral content based in which the original signal cepstral portion of error at the decoder is removed and thus the performance of the decoder detection rate is improved considerably. Experimental results show the good results for the system robustness against the common signal processing attacks through calculating error detection rates in comparison with conventional echo hiding methods. Also good results were obtained for watermark inaudibility through mean opinion test (MOS) test and SNR value comparisons. © 2009 Elsevier Inc. All rights reserved.
Keywords: Cepstrum Echo hiding Time spread echo hiding
1. Introduction Digital watermarking is an important technique for the protection of digital media contents including audio, image, video and even text, through the insertion of a hidden copyright message into the media [1–6]. In the case of audio content, several algorithms have been proposed such as echo hiding [7–10], spread spectrum modulation [11–13], quantization index modulation (QIM) [14,15], pitch scaling [16] and time scale modification [17], etc. [18,19] and there are also a lot of attacks against these systems which expose their lack of robustness [20,21]. Due to their simplicity and slight distortion to the original signal, echo hiding systems are advantageous for watermarking applications. In conventional echo hiding systems a single [7] or double [9] echo is inserted into the audio signal while the watermarked bit is selected based on the echo delays. The detector of such systems uses cepstrum analysis to detect the embedded echo delay(s) and correspondingly the watermarked bits [7]. In spite of the many benefits of single and double echo hiding systems, they also have concerning problems, that being security. They simply use the watermarking bit and do not utilize any other symmetric or public key. Their decoder is lenient and any unauthorized receiver (i.e. pirates) can detect the watermarked bit and hence these systems are not appropriate for full robust watermarking [1,2]. Time spread echo hiding (TS) [8] is a strong approach compared to the conventional echo hiding systems. The system solves the security problem of the conventional echo hiding while preserving good audio quality. Although benefits can clearly be seen within TS echo hiding systems, they still present two essential problems. Firstly the watermark is not permeated the entire signal and is inserted after a special delay in each segment. This problem can be an essential weakness for the watermarking security [18]. Another problem is the erroneously detection even in the condition of no attacks, like the conventional echo hiding, this problem is due to the effect of the original audio signal in the detector process. Here, we solved the problems mentioned above by introducing a new TS echo hiding method. This method differs extremely from conventional TS echo watermarking system: In solving the first presented problem, the watermarked bit is not related to a delay but as a sign bit b ∈ {1, −1} to be detected. The receiver does not distinguish this embedded
*
Corresponding author. E-mail addresses:
[email protected] (Y. Erfani),
[email protected] (S. Siahpoush).
1051-2004/$ – see front matter doi:10.1016/j.dsp.2009.04.003
© 2009
Elsevier Inc. All rights reserved.
810
Y. Erfani, S. Siahpoush / Digital Signal Processing 19 (2009) 809–814
bit through realizing a maximum apex in the correlation signal between the cepstrum and pseudo-random noise. Instead the proposed method detects the watermark bit through the degree of correlation at the receiver. In solving the second mentioned problem. Firstly, the proposed systems encoder and decoder are changed through eliminating the effect of the original audio at the detector. Secondly, we convey this effect to the encoder stage as in ISS watermarking [12]. Afterwards the system detector will detect the watermarked bit with no fault in the no-attack environments. Therefore the robustness to the signal processing attacks will be increased as a result, while preserving audio quality. This can be seen through the experimental results. The imperceptibility of the proposed method is investigated via a listening test and SNR values. In Section 2, we discuss the basics of TS echo hiding watermarking. In Section 3, we present a new design for such a methodology, and we improve the proposed system to an accurate content based system in Section 4. Experimental results will be discussed in Section 5, finally we conclude the paper in Section 6. 2. TS echo hiding In echo hiding systems, each original audio signal segment is convolved with a kernel signal to make watermarked signals. The kernel for TS [8] is as h(n) = δ(n) + α . p (n − d)
(1)
p (n) is a pseudo-random noise whose amplitude is ±1, δ(n) is Dirac delta function, α is a small value as echo coefficients and d is a delay that is selected between two values corresponding to one or zero bit embedding. By using this kernel the watermarked signal is a faint copy of real room echo of original signal and more desirable for ear. By using a key for generating p (n) by means of a linear shift register the algorithm will be key dependent and secure. If we use x(n) for original signal segment, then the watermarked signal segment y (n) will be y (n) = x(n) + α .
N
p (i ).x(n − d − i ),
0 < α 1,
(2)
i =1
N is the p (n) length. Here if we define the complex cepstrum transform as below,
c y (n) = F −1 loge F y [n] .
(3)
We can use this transform for discovering the watermarking at the receiver [8]: c y (n) = c x (n) + α . p (n − d),
if
α 1/ Max P ( w ) ,
(4)
P ( w ) is the Fourier transform of p (n). After generating right p (n) by authorized receiver, the final step is to take a crosscorrelation between c y (n) and p (n) cc(n) = c y (n) ⊗ p (n) = ns (n) + α . p (n) ⊗ p (n − d), ns (n) = c x (n) ⊗ p (n).
(5)
Here ⊗ is a symbol for cross correlation and cc(n) presumably has a peak at d, so the receiver decides the embedding bit based on the delay that he or she discovers corresponding to this peak. Here a big problem to solve is the first term of (4) that may make the detection process erroneously. As another problem, it is comprehensible that watermark is not embedded into the whole of signal. To solve these two problems, we first introduce a novel TS echo hiding in the next section and after that improve its decoder in the proceeding section. 3. Improved time spread echo hiding using real cepstrum The encoder kernel and encoded signal of (2) is changed to the below relations: h (n) = δ(n) + α .bp (n), y (n) = x(n) + α .b
N
(6)
p (i ).x(n − i ),
0 < α 1.
(7)
i =1
Unlike the conventional TS echo hiding, the p (n) sequence is embedded to the entire original audio signal from first bit to the end. b ∈ ±1 is the bit to be embedded and decoded in the decoder, N is the audio signal length. After applying the real cepstrum transform
Rceps y (n) = F −1 loge F y [n]
(8)
to the watermarked signal we will have [8] Rceps y (n) = Rcepsx (n) +
α 2
. p (n) +
α 2
. p (−n).
(9)
Y. Erfani, S. Siahpoush / Digital Signal Processing 19 (2009) 809–814
811
Fig. 1. Proposed encoder for time spread echo hiding system, PNG is pseudo-random noise generator.
Fig. 2. Proposed decoder for time spread echo hiding system, PNG is pseudo-random noise generator.
We define a normalized correlation amount between two signals u (n) and v (n) as C=
N 1
N
u (n). v (n).
(10)
n=1
Since the correlation amount is computed for only n > 0, then the right term of (9) is not considered for calculating the correlation amount in (10). After computing the normalized correlation amount between Rceps y (n) and p (n), we will have C=
N 1
N
p (n).Rceps y (n) =
n=1
=
N 1
N
n=1
N 1
N
Rcepsx (n). p (n) +
.
Rcepsx (n). p (n) +
n=1
1 2 1 2
α .b. p (n). p (n)
α .b.
(11)
The correlation amount in the last equation of (11) have two terms, left term that is a noise section due to the original signal effect in the detector and is considered as the source of error in detection process and the right term that the watermark bit b is in it. The detector distinguishes the watermark bit based on the sign of the correlation amount. The larger the parameter α is, the more robust the watermark will be and the less the inaudibility will become. This system is considerably different from that of TS echo hiding. In the encoder the watermark is spread into the whole of the signal and the watermark bit is a sign bit, not a special delay. The system decoder is relied on a correlation amount, instead of a peak at the decoder. The encoder and decoder for proposed system are shown within Figs. 1 and 2. In these figures, PNG, pseudo-random number generator, is a system that uses some bits as a key to generate pseudo-random stream. The left term of the decoding equation (11) is the main source of error in the detection process, even in the no-attacks environments. Here, we remove it from the decoder and move it to the encoder and change the encoder stage to the below equation y (n) = x(n) + (α .b − λ)
N
p (i ).x(n − i ),
0 < α 1,
i =1
λ=
N 1
N
.
Rcepsx (n). p (n).
(12)
n=1
For having a low distortion watermarking system with less audible watermark, the value of |α .b − λ|, should be very low. If we rewrite the decoder equations (6), (7) and (8) for the system, the correlation amount will change to the following C=
1 2
α .b.
(13)
It is clear from (13) that the noise source was removed in the correlation amount in the no-attacks environments and based on b the correlation amount will be positive or negative and hence the decoder will distinguish the embedded bit exactly. Here, for preventing the distorting effect of the imaginary part of the correlation amount caused by the complex cepstrum at the encoder in relation (11), we had to use the real cepstrum of the signal. Another approach for increasing the correlation amount at the decoder is to use the complex cepstrum and instead change the correlation amount to the
812
Y. Erfani, S. Siahpoush / Digital Signal Processing 19 (2009) 809–814
Table 1 λ and correlation amount comparison for 5 host signals when
α = 0.006.
Host signal
Genre
λ(M1)
λ(M2)
C (M1)
C (M2)
S1 S2 S3 S4 S5
Speech Persian singing Violin Orchestra Persian lute
0.00012 0.00016 0.00095 0.00048 0.00074
0.0018 0.0021 0.003 0.0020 0.0023
0.003 0.003 0.003 0.003 0.003
0.006 0.006 0.006 0.006 0.006
real correlation amount and rewriting all formula from the beginning of previous section to the end. This approach will be discussed in the next section. 4. Improved time spread echo hiding using complex cepstrum The complex cepstrum was defined in (3), by applying this transform to the (7) relation, and expand this transform, then [8] c y (n) = c x (n) + α b. p (n).
(14)
After computing the real portion of the normalized correlation amount between c y (n) and p (n), we will have
C = Re
N 1
N
n=1
N 1 p (n).c y (n) = Re c x (n). p (n) + α .b. p (n). p (n) N n=1
=
1 N
. Re
N
c x (n). p (n) + α .b
(15)
n=1
then the encoder is like (11) while λ and C is changed to
λ=
1 N
.Rceps
N
c x (n). p (n) ,
n=1
C = α .b.
(16)
Like the previous section, we must eliminate the left term in the λ value at the decoder and transfer it to the encoder. The correlation amount is the twice of the correlation amount in the previous method and since this value is a gauge for measuring the system robustness. In this case, the complex cepstrum method is far better then the real cepstrum method in Section 4. 5. Proposed methods assessment and experimental result By removing the original audio signal effect in the decoding stage, we make an accurate detector for TS echo hiding watermarking. Proposed algorithms will be more robust against signal processing attacks because the original signal effect in the decoder as the source of misdetection is much larger than signal processing attacks. We can make the system more robust against signal processing attacks by increasing the value of α . In the case of audibility, we add λ, a value related to the original signal cepstral contents, to the coefficients of echoes in the embedding stage. The value of λ for the two proposed systems is approximately smaller than 0.01, nevertheless the value of α is in this range too. We will compare the both proposed methods in the case of λ and C via experimental results that have been shown in Table 1. The cepstrum of the original signal is a decreasing function of the segment size N . An increase in the length of the segment size N causes a slight change in the cepstrum of the original audio signal and a decrease in the λ value. Therefore, in the case of big segment sizes, the audio quality will be improved at the cost of a small growth in computational load. Here we use 5 audio clips for our experiments: a speech clip with big silences, an audio clip containing just Persian signing with no instruments, a clip containing just a discrete instrument (Tar: a Persian lute), a clip containing a continuous instrument (violin) and a clip containing an orchestra (many instruments), whereby a duration of 10 s of each clip is used. The clips are sampled with 44.1 kHz and 16 bit quantization. After segmentation and hanning windowing for reducing the artifacts of the neighboring segments, we apply the proposed watermarking scheme to each segment (1 s). The result is an average for all segments and all 5 audio clips. We compare conventional TS and our proposed methods in these experiments. We use 100 and 110 bits for zero and one bit embedding in conventional TS echo hiding. We use α = 0.006 for both proposed systems. In Table 1, λ and C as measures of distortions and robustness are shown for 5 diverse kinds of audio clips and for the two proposed methods: M1 – ITS using real cepstrum, and M2 – ITS using complex cepstrum. The experimental results for robustness and audibility of proposed method, in comparison to the conventional TS echo hiding, are shown within Tables 2 and 3. Our experiments were done under the following conditions:
Y. Erfani, S. Siahpoush / Digital Signal Processing 19 (2009) 809–814
813
Table 2 Robustness of the proposed system, ITS(M2) against signal processing attacks, the values are bit error rates. Attack option
TS
FB
ITS:M2
No attacks BER Mp3 attack BER Quantization attack BER Re-sampling BER Noise attack BER
12.5% 45% 17.5% 21% 19.5%
17.5% 65% 10.5% 25% 40%
0% 47% 5.5% 15% 12.5%
Table 3 Subjective test and SNR comparison. Test option
TS
FB
ITS:M2
SNR MOS
23.3 4.6
19.1 4 .1
22.7 4.7
No attacks: closed loop (immediately decoding after encoding), Mp3 attack: compressing the watermarked signal by Mpeg-3 layer 1 and reverting it again to the original wave file; Re-sampling: sampling the watermarked signal with 16 kHz sampling rate; Re-quantization: quantizing the watermarked signal with 8 bits; Noise attack: adding noise with zero mean and Gaussian power density function to the watermarked signal. The BER was calculated by BER =
Number of erroneously decoded bits Number of embedding bits for the clip
.
(17)
We use the ABX test project [22] for the MOS test evaluation while we consider the MOS grade ‘5’ for our original audio clips that we use. As we can see from Table 2, our system is erroneous free in the no-attacks environments. In addition to the good quality of the proposed system, its robustness against signal processing attacks is far better than conventional TS. 6. Conclusion and future work The conventional time spread (TS) echo hiding has two security problems. The first is due to the fact that the watermark is not inserted to the whole of the original signal and secondly, it has erroneous watermark bit detection even in the matter of no-attacks environment. In this paper, we first proposed a new TS echo hiding watermarking system that solved the first problem through making essential changes in the encoder and decoder of the TS echo hiding. Afterwards we proposed a content based TS echo hiding system based on the first proposed method that solved the second problem of TS echo hiding. In this system, we removed the original audio signal effect in the blind decoder and shifted it to the encoding stage and because of that the receiver could detect the watermark bit with no-error. Good experimental results were obtained for robustness against attacks and audio signal quality. The authors are currently working on the improvement of audio quality of the proposed algorithm. This is to be achieved through the analysis-by-synthesis approach described in [23]. References [1] I.J. Cox, M.L. Miller, J.A. Bloom, Watermarking applications and their properties, in: Proceedings of the International Conf. on Info. Technology, Coding and Computing, ITCC2000, 2000, pp. 6–10. [2] I.J. Cox, G. Doerrm, T. Furon, Watermarking is not cryptography, in: Proceedings of the 5th Int. Workshop on Digital Watermarking, 2006, pp. 1–15. [3] M. Peinado, F.A.P. Petitcolas, D. Kirovski, Digital rights management for digital cinema, Multimedia Syst. 9 (3) (2003) 228–238. [4] D. Kundur, Diversity in watermarking: Insights and implications, IEEE Trans. Multimedia 8 (4) (2001) 46–52. [5] M. Ghanbari, Sh. Ghanbari, Impact of video watermarking on video compression efficiency, in: 13th Iranian Conf. on Electrical Eng., 2005, pp. 1–7. [6] S.H. Low, N.F. Maxemchuk, Performance comparison of two text marking methods, IEEE J. Sel. Areas Commun. 16 (1998) 561–572. [7] D. Gruhl, W. Bender, Echo hiding, in: Proc. Info. Hiding Workshop, 1996, pp. 295–315. [8] B.S. Ko, R. Nishimura, Y. Suzuki, Time-spread echo method for digital audio watermarking, IEEE Trans. Multimedia 7 (2) (2005) 212–221. [9] H.J. Kim, Y.H. Choi, A novel echo hiding scheme with backward and forward kernels, IEEE Trans. Circuit Syst. Technol. 13 (8) (2003) 885–889. [10] H.O. Oh, J.W. Seok, J.W. Hong, D.H. Youn, New echo embedding technique for robust and imperceptible AW, in: Proc. ICASSP, 2001, pp. 509–513. [11] D. Kirovski, H. Malvar, Robust spread spectrum audio watermarking, in: IEEE Int. Conf. on Acoustics, Speech, and Signal Process., 2001, pp. 1345–1348. [12] H.S. Malvar, D. Florencio, Improved spread spectrum: A new modulation technique for robust watermarking, IEEE Trans. Signal Process. 52 (4) (2003) 898–905. [13] I.J. Cox, J. Kilian, T. Leighton, T. Shamoon, Secure spread spectrum watermarking for multimedia, IEEE Trans. Image Process. 6 (1997) 1673–1687. [14] B. Chen, G.W. Wornell, Quantization index modulation: A class of provably good methods for digital watermarking and information embedding, IEEE Trans. Inform. Theory 47 (4) (2001) 1423–1443. [15] P. Moulin, Y. Wang, Improved QIM strategies for Gaussian watermarking, in: IWDW, 2005, pp. 372–386. [16] S. Shin, O. Kim, J. Kim, J. Choi, A robust audio watermarking algorithm using pitch scaling, IEEE Int. Conf. Digital Signal Processing, 2002, pp. 701–704. [17] M.F. Mansour, A.H. Tewfik, Audio watermarking by time scale modification, in: ICASSP, vol. 3, 2001, pp. 1353–1356. [18] M. Arnold, Audio watermarking: Features, applications and algorithms, in: IEEE Int. Conf. Multimedia and Expo, 2000, pp. 1013–1016. [19] N. Cvejic, Algorithms for audio watermarking and steganography, PhD thesis, Oulu University, 2004, pp. 20–50. [20] F.A.P. Petitcolas, R.J. Anderson, M.G. Kuhn, Attacks on copyright marking systems, in: Info. Hiding, Int. Workshop, IH’98, 1998, pp. 219–239. [21] F.A.P. Petitcolas, Watermarking schemes evaluation, IEEE Trans. Signal Process. 17 (5) (2000) 58–64.
814
Y. Erfani, S. Siahpoush / Digital Signal Processing 19 (2009) 809–814
[22] ITU-R Rec. BS.1116, Methods for the subjective assessment of small impairments in audio systems including multi-channel sound systems, International Telecomm Union, Geneva, Switzerland, 1994. [23] W.C. Wu, O.T.C. Chen, An analysis-by-synthesis echo watermarking method, in: Proc. IEEE Int. Conf. on Multimedia and Expo, 2004, pp. 1935–1938.
Yousof Erfani received the MSc and BSc degrees in Secure Communication and Electrical Engineering from Sharif University of Technology, Tehran, Iran in 2002 and 2004, respectively. From 2001 to 2003 he was with Zaeim Electronic Research Center, where he worked on block ciphers, stream ciphers and digital signature. In 2004, he joined the multimedia lab of the Electronic Research Center of Sharif University of Technology to develop echo cancellation methods for voip applications. From 2005 to 2007, he was with the Information Technology Department of Iran Telecomm Research Center, where he worked on multimedia security and watermarking. Since then, he is with Azad University of Dezful, where he is teaching and accomplishing research projects. His current research interests include digital signal processing, cryptography and multimedia security. Shadi Siahpoush received the BSc degree in Electrical Engineering from Azad University of Dezful, Dezful, Iran, in 2008. Since then, she is a research assistant and MSc student at the Multimedia Lab of Azad University of Dezful. Her research interests include audio, image and video processing.