A DIGITAL WATERMARK FOR STEREO AUDIO SIGNALS USING ...

International Journal of Innovative Computing, Information and Control Volume 6, Number 3(B), March 2010

c ICIC International ⃝2010 ISSN 1349-4198 pp. 1209–1220

A DIGITAL WATERMARK FOR STEREO AUDIO SIGNALS USING VARIABLE INTER-CHANNEL DELAY IN HIGH-FREQUENCY BANDS AND ITS EVALUATION Kazuhiro Kondo and Kiyoshi Nakagawa Graduate School of Science and Engineering Yamagata University 4-3-16 Jonan, Yonezawa, Yamagata 992-8510, Japan [email protected]; [email protected]

Abstract. We propose a watermarking algorithm for stereo audio signals which embeds data using delay values of the high-frequency channel signals. Since the stereo image perception of the human auditory system is known to be relatively insensitive to phase in the high-frequency region, we replace the two high-frequency channel signals with one middle channel signal, and embed data as the delay between the channels. Blind detection of embedded data is possible using the correlation between high-frequency channels at the detector. Embedded data rate was 7 to 30 bps depending on the host audio signal. Embedded data was detected with little or no errors for added noise at 20 dB SNR and above. MP3 and AAC coders, as well as sample rate conversion (both down-sampling and up-sampling) did not affect the data. Low-pass filtering and high-pass filtering also did not affect the data when cut-off frequencies were set to leave reasonable amount of frequency ranges intact. The embedded audio quality was shown to be source-dependent, with quality equivalent to MP3 coded audio with some sources, but significantly lower for other sources. Keywords: Audio signal, watermark, stereo, inter-channel delay, blind detection

1. Introduction. With the recent advances in multimedia signal processing technology, it has become relatively easy to edit, copy, store and transmit digital multimedia content. This situation calls for intellectual ownership protection, and detection/protection against unauthorized tampering. Digital watermarking, or information hiding, is a strong candidate for such protection. Digital contents can be “marked” with copyright information using watermarks, for instance. Although the volume of research in information hiding (watermarking) has been in images and video [1, 2, 3, 4], we recently see new proposals for information hiding in speech and audio [5]. Most of these take advantage of the human auditory system (HAS) in order to hide information into the host speech or audio signal without causing significant perceptual disturbances. Temporal masking and frequency masking [6] are two of the properties of HAS that are utilized most to hide information in audio signals without noticeable degradations. These are also used in MPEG audio coding standards [7, 8, 9] to compress the audio signals without noticeable coding noise. Phase coding has been regarded as an effective method to embed data with minimum impact on the perceived quality [1, 10]. Phase of the original audio is replaced with reference phase according to the data. Since the HAS is known to be relatively insensitive to small phase distortions [11, 12, 13], this method can potentially embed data without affecting the host signal quality significantly. In this paper, we use this property to code the embedded data in the relative phase between two stereo channels. Thus, we are trying to exploit the characteristics of the HAS with stereo audio signal redundancy. 1209

1210

KAZUHIRO KONDO AND KIYOSHI NAKAGAWA

There have been some prior attempts to embed data into stereo signals. Megias et al. have attempted to propose a robust watermark for stereo audio signals using the well-known patchwork algorithm [14]. Their goal was to make the watermark resistant to stereo attacks defined in the Stirmark benchmark. Thus, they first mixed down the stereo signals into a mono signal, and applied patchwork to this mono signal. Then, they added back the stereo components to this embedded mono signal. Their algorithm did prove to be robust to Stirmark stereo attacks, but does not exploit the stereo redundancy to its full extent. Our proposal exploits the insensitivity of the HAS to phase in the high-frequency regions, and embed data in the relative phase of the stereo signals in this region. Takahashi et al. have defined a family of data embedding methods using phasemodulation [15]. They defined different embedding methods for fingerprints (FP, unique digital media ID codes), copyright management information (CMI), and copy control information (CCI) depending on the need for robustness and blind detection. The first two requires the original signal for detection, i.e., non-blind detection. CCI was designed for blind detection. They modulated the phase of stereo signals in a sinusoidal pattern. The inter-channel offset of the sinusoids was used to embed the data. They have found that if the frequency ranges used to detect phase was adaptively chosen, the embedded CCI can be robust. This algorithm is similar to our proposal. However, we limit our alteration to the high-frequency regions, where the HAS is known to be relatively insensitive to phase. Modegi proposed an extremely robust data embedding algorithm for stereo signals which can be detected from stereo signals picked up with a cell-phone integrated microphone [16]. They totally or partially suppressed one channel of the low-frequency band stereo signal according to the embedded data. They found that even with crude microphones, the embedded data can be mostly recovered. They also claim that the audio quality does not suffer serious degradation, although they have not shown any formal subjective quality test results. The quality degradation probably will not be severe if presented over loud speakers, which is their primary target environment. However, if presented over headphones, the degradation may be catastrophic. Our goal was to embed data with “modest” degradation even with headphones. In the next section, the proposed watermark algorithm is described. In section 3, audio clips used for watermark evaluation are described. In section 4, the bit rate at which data was embedded in each audio clip is evaluated. In section 5, the robustness evaluation tests and its results are described. We evaluated the robustness of the proposed algorithm to four types of disturbances: added noise, audio compression, sample rate conversion, and filtering. In section 6, subjective quality evaluation tests and its results of the embedded audio is described. Finally, in section 7, some discussions and conclusions as well as suggestions for future research are given. 2. Digital Watermarking of Stereo Audio Signals. In this section, we propose an audio watermark which embeds data by delaying either the left- or the right-channel of the stereo high-frequency signals. It is well known that spatial localization of sound sources in HAS depends on both amplitude and phase characteristics for low frequency, but mostly on amplitude characteristics for high frequency [7, 17]. This property of HAS is exploited in the intensity stereo coding used by some of the audio codecs, e.g., MPEG 1 Layer 3 (MP3) coders [18]. We will also take advantage of this property, and alter the phase of the high-frequency band in the stereo channels according to the embedded data [19]. The amplitude envelope of these signals will be preserved in order to keep this alteration unperceivable.

DIGITAL WATERMARK FOR STEREO AUDIO SIGNALS xL

band

xLL

split xR

filter xRH stereo input

+

x’L

xLH

+

xRL marker noise

1211

x’R

BPF

average

1

+ delay filter

power calculation

PLH

power calculation

PRH

0 1 0

embedded data di

Figure 1. Proposed watermark embedding scheme. 2.1. Embedding. Fig. 1 shows the proposed watermark embedding algorithm. In this algorithm, the high-frequency channels are mutually delayed by a fixed number of samples according to the embedded data. The input signal is first split into two subbands using the subband split filter. The cut-off frequency was set to approximately one-tenth of the full bandwidth, i.e. approximately 2 kHz. We used a 128-tap FIR filter designed using the FFT windowing method [20]. The left- and right-channel high band signal powers, PLH and PRH , are calculated on a frame-by-frame basis. Then, the high-band signals are averaged to give one middle high-frequency channel. This signal will be copied for both left- and right-channels, and will be delayed on a frame-by-frame basis according to the embedded data. We arbitrarily chose to delay the left-channel only to mark a ’0’, and delay the right-channel to mark a ’1’. A frame was divided into two equal parts, and only the latter half of the frame was delayed according to the data. This is shown in Fig. 2. Thus, the first half of the frame is never delayed. We chose this arrangemenu to enable frame synchronization. If we embed a known data pattern regularly, e.g. consecutive zeros, there will always be a transition in the left-channel delay at the middle and end of a frame, which should easily be detectable using correlation functions, for instance. The amplitudes of each channel were then adjusted to match the power with the original signals, PLH and PRH . These signals are then added back to the low-frequency components to obtain full-bandwidth stereo audio signals. We found that some of the frames did not contain enough high-frequency components. Obviously, we are not able to encode data into these frames. We simply chose not to embed data in these frames. No delay was introduced in these frames. An example is shown in the third frame in Fig. 2. The added delay should be small enough to be unperceivable. We empirically chose a maximum delay D of 3 samples. Even with this small amount of delay, abrupt changes are noticeable. Therefore, we smoothed the delay changes according to a raised cosine-like pulse pattern, shown in Fig. 3. Non-integer delays were introduced using a sinc delay filter with impulse response h(n): h(n) =

sin(π ∗ (n − 1 − τ )) π ∗ (n − 1 − τ )

(1)

where n is the sample number, and τ is the delay in samples, which can be fractional. Also, as will be described in the next section, we will use correlation to detect delay between channel signals. In other words, we compare the cross-correlation between channels with delay and without delay. If the former is larger than the latter, we detect embedded data. However, in some frames, the host signal itself showed relatively flat correlation,

KAZUHIRO KONDO AND KIYOSHI NAKAGAWA embedded data di

1212

1 0

none right channel delay [samples]

D

left channel delay [samples]

t

D

0 t

0 t frame

Figure 2. Example of embedded data and channel delay. even for long lags. In these frames, we will not be able to detect delays reliably using correlation. Thus, we added a small amount of band-pass filtered noise as markers to enhance the difference between correlation with no delay and with delay. The amount of the added noise was empirically chosen as 10 % of the original frame power. However, even with the addition of markers, some frames do not display enough correlation difference. We chose not to embed data in these rare frames altogether. 3

Delay [samples]

2.5

2

1.5

1

0.5

0 0

50

100

150

200

250 300 Sample index

350

400

450

500

Figure 3. Raised cosine pulse delay pattern. 2.2. Detection. Fig. 4 shows the proposed watermark detection algorithm. This algorithm does not require the original source as reference, and thus is blind detection. The input stereo audio signal with embedded data is first split into subbands using the same band split filter as in the embedding process. The low-frequency components are discarded. The high-frequency components are delayed by D samples. Frame boundaries are synchronized. Three normalized correlation values for the latter half of the frame are calculated: 1. Correlation between left- and right-channel, both with no delay (Corr1) 2. Correlation between delayed left-channel and right-channel with no delay (Corr2)

DIGITAL WATERMARK FOR STEREO AUDIO SIGNALS x’L

x’LL

band

discarded

split x’R

x’RL

filter

stereo input

1213

x’RH

x’LH

correlation

correlation Z

Z

Corr1

Corr2

data detected

comparison

－D

－D

correlation

Corr3

detected data d’i

Figure 4. Proposed watermark detection scheme. 3. Correlation between delayed right-channel and left-channel with no delay (Corr3) If the first correlation (Corr1) is the largest of the three, we assume no data has been embedded. If the second correlation (Corr2) is the largest, we detect that a ’1’ has been embedded, and if the third correlation (Corr3) is the largest, we detect a ’0’. If the highfrequency energy is below the threshold, we assume that no data has been embedded. Thus, the detection process is a relatively simple procedure. Frame synchronization may be somewhat expensive, but this is not required once synchronization has been established. Frame synchronization can be established by calculating a sample-by-sample correlation analysis near the embedded sync pattern. 3. Audio Sources for Experimental Evaluations. We evaluated the proposed algorithm using five short stereo audio samples; one rock instrumental (abba), and three classical pieces (beethoven, mozart, piano, and trumpet). All sources were CD quality, sampled at 44.1 kHz, stereo, 16 bits/sample. The embedding frame rate was set to 1000 samples, or about 23 msec. Most sources were selected from the SQAM audio quality evaluation source CD [21]. Table 1. Evaluated sound sources. Notation Description abba The Visitors (Rock), instrumental beethoven Ode to Joy (Classics), orchestra & chorus mozart Mozart K. 270 (Classics), wind instruments piano Schubert (Classics), piano solo trumpet Haydn (Classics), orchestra & trumpet solo

4. Embedded Data Bit Rates. Embedded number of bits for each source is shown in Table 2. The bit rate is apparently source dependent. It seems possible to embed more data into audio with high energy. This probably is because we do not embed data in frames with low high-frequency energy. 5. Robustness Evaluations. We evaluated the robustness of our algorithm to four types of disturbances; additive white Gaussian noise, audio codecs, sample rate conversion, and filtering. We embedded three bit patterns in our host signals: pattern 1 was a random bit pattern with the same average number of ’0’ and ’1’, pattern 2 was an alternating ’0’ and ’1’ bit pattern, and pattern 3 was an alternating ’1’ and ’0’, i.e. phase-inverted pattern 2.

1214


Table 2. Embedded bits and bit rates. Length Embedded Bit rate [samples] bits [bps] abba 667500 265 17.5 beethoven 275500 184 29.4 mozart 627500 101 7.1 piano 544000 90 7.3 trumpet 611500 132 9.5

BER [%]

Source

90 80 70 60 50 40 30 20 10 0

abba beethoven mozart piano trumpet

10

15

20 SNR [dB]

25

30

Figure 5. BER for random embedded bit pattern (pattern 1) with additive noise. Since the proposed algorithm can show three error modes (substitution, where ’0’ is replaced with ’1’ and vice versa, deletion, and insertion errors), we evaluated bit errors after aligning the detected bit pattern using dynamic programming. We used sclite from NIST’s scoring package, freely available on the net [22]. However, most of the errors turned out to be deletion errors. Only in rare cases, substitution and insertion errors were seen. 5.1. Robustness to additive white Gaussian noise. Fig. 5 shows the signal to added noise ratio and the bit error rate (BER) of all sources when pattern 1 was embedded. There are differences in BER below SNR of 15 dB, but all sources show BER close to 0 % above 20 dB SNR. Note that added noise larger than that giving 20 dB SNR is quite audible and makes the audio data virtually worthless. Fig. 6 compares the BER for patterns 1 through 3 embedded into the trumpet clip. As shown in the figure, no significant difference is seen with the embedded data pattern. 5.2. Robustness to audio codecs. We also tested the robustness of the embedded data when coded with MPEG audio codecs. We encoded each source with MPEG 1 Layer 3 (MP3) coding [8] and MPEG 4 AAC codecs [9]. Lame version 3.98 beta 5 with fixed rate of 96 and 128 kbps with joint stereo coding was used to code sources to MP3 and back to PCM [23]. Apple Computer’s iTunes version 7.4.3.1 was used to code sources to AAC and back [24]. The AAC codec uses the Low Complexity Profile AAC to code sources to a fixed rate of 96 and 128 kbps. Table 3 show the BER when random bits were embedded (pattern 1). In the majority of cases, no bit errors were seen, suggesting that the algorithm is robust to MPEG coders.

BER [%]

DIGITAL WATERMARK FOR STEREO AUDIO SIGNALS

45 40 35 30 25 20 15 10 5 0

1215

bit pattern 1 bit pattern 2 bit pattern 3

10

15

20 SNR [dB]

25

30

Figure 6. BER for various embedded bit pattern with additive noise (trumpet). Table 3. BER of MP3 and AAC encoded audio. BER [%] Source

MP3 AAC 96 kbps 128 kbps 96 kbps 128 kbps abba 1.5 0.4 0.4 0.0 beethoven 0.6 0.0 0.0 0.0 mozart 0.0 1.0 0.0 0.0 piano 0.0 1.1 1.1 0.0 trumpet 0.8 0.0 0.8 0.0 Table 4 tabulates the bit error rate of MPEG-coded audio with various bit patterns embedded. The host audio signal was trumpet for all audio in the table. As can be seen, the bit error rate was nearly zero regardless of the embedded pattern. Table 4. BER of MP3 and AAC encoded audio with various bit patterns (trumpet). bit pattern

BER [%]

MP3 AAC 96 kbps 128 kbps 96 kbps 128 kbps pattern 1 0.8 0.0 0.8 0.0 pattern 2 0.8 0.0 0.0 0.0 pattern 3 0.0 0.0 0.8 0.8 5.3. Robustness to sample rate conversion. We also tested the robustness of the embedded data to sample rate conversion. All samples were either down-sampled or upsampled to the specified sample rate fs , and converted back to the original 44.1 kHz. The three selected rates for down-sampling were 22.05, 11.025 and 8 kHz. The sample rate 22.05 kHz is one-half of the original sample rate, and the effective bandwidth is halved. However, this is a simple 2-to-1 decimation. Likewise, 11.025 kHz is one-quarter of the original, and the bandwidth is one-quarter, but this is a simple 4-to-1 decimation. On the other hand, 8 kHz is not an integer fraction of the original rate, and thus is first upsampled and then down-sampled, which potentially adds significant degradation in the process. Up-sampling to 48 kHz is also not an integer multiple, and so the same multistage up-sampling and down-sampling is necessary. We used the built-in sample rate

1216


conversion of the CoolEdit 2000 lite [25] (now called Adobe Soundbooth [26]) software. All filtering including the anti-alias filters were applied within this tool. Table 5 shows the BER when random bits were embedded. As shown in this table, the sample rate conversion has no significant effect on the embedded bits. Surprisingly, down-sampling, even to 8 kHz, which is less than one-fifth of the original rate, has no significant effect on the BER, and shows the significant robustness of the proposed algorithm to sample rate conversion. Table 5. BER of sample rate-converted audio. Source

BER [%] fs =22.05 kHz 11.025 kHz 8 kHz 48 kHz abba 0.0 0.8 3.1 0.0 beethoven 0.0 0.0 0.0 0.0 mozart 0.0 0.0 1.0 0.0 piano 0.0 0.0 1.1 0.0 trumpet 0.0 0.0 0.0 0.0

5.4. Robustness to filtering. We tested the robustness of the embedded data to both high- and low-pass filtering. All samples were filtered using a 128-tap FIR filter. The filter coefficients were designed using the FFT windowing method [20]. Table 6 shows the bit error rates for low-pass filtered audio with three different cut-off frequencies. At cut-off frequencies fc =11.025 and 5.5125 kHz, the BER was surprisingly low even though these filters filter out one-half and one-quarter of the bandwidth. At fc =2.21 kHz, however, the BER abruptly increases to almost 100 %. Most of the errors with this filter were deletion errors. Since high-frequency band above 2.2 kHz were used to embed data, most of this region was eliminated by this filter, and thus the embedded data became undetectable. Table 6. BER of low-pass filtered audio. Source

BER [%] fc =11.025 kHz 5.5125 kHz 2.21 kHz abba 1.5 2.3 99.6 beethoven 0.0 0.0 98.3 mozart 0.0 0.0 100.0 piano 1.1 1.1 100.0 trumpet 0.0 0.0 89.7 Table 7 shows the bit error rates for high-pass filtered audio with three different cut-off frequencies. At cut-off frequencies fc =2.21 kHz, the error rate was relatively low, although most of the dominant low-frequency components are lost. Since the data is embedded in high-frequency regions, the embedded data was intact. However, with fc above 4.42 kHz, the error rates increase significantly. Most errors were again deletion errors. Note that all high-pass filtering eliminates most of the energies since the low-frequency energy dominates in all audio signals. The quality degrades significantly with all high-pass filters, making the filtered samples worthless in terms of audio quality.


1217

Table 7. BER of high-pass filtered audio. Source

BER [%] fc =2.21 kHz 4.42 kHz 5.5125 kHz abba 1.5 41.7 73.0 beethoven 0.6 98.3 98.9 mozart 1.0 96.0 100.0 piano 1.1 100.0 100.0 trumpet 4.0 99.2 100.0 100 90 MUSHRA score

80 70 60 50 40 30 20 10 0 ref

wp1

wp3

m48

m64

mp96

m128

lp35

Figure 7. Subjective listening test results (abba). 6. Subjective Quality. The embedding process introduced a small noticeable quality alteration for some of the sources, while for others, no degradation was perceived. We conducted the MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA) subjective tests [27] to quantify the degradations. We tested audio clips listed in Table 1 with the proposed data embedding method with bit pattern 1 (wp1) and bit pattern 3 (wp3). For reference, we included the original audio (ref), 3.5 kHz low-pass filtered audio as anchor (lp35), and MP3 transcoded audio at 128 kbps(m128), 96 kbps (m96), 64 kbps (m64), and 48 kbps (m48), respectively. Twelve subjects participated in the tests. Fig. 7 through 11 show the MUSHRA scores and the 95 % confidence interval for abba, beethoven, mozart, piano, and trumpet, respectively. Fig. 12 shows the average MUSHRA score across all sources. The embedded audio showed approximately the same quality as MP3 at 64 and 128 kbps for abba. For piano, however, embedded audio showed significantly lower quality than MP3 at any rate. The quality for other sources falls between these two. On the average, the embedded quality is equivalent to MP3 coding at 48 kbps. The quality rating for this range of MUSHRA scores (40 to 60) is equivalent to a “fair” rating. Thus, the embedded audio quality is apparently not transparent, but not objectionable either. The audio quality still does need significant improvement for its wide application. The embedded quality does not seem to differ by embedded bit patterns, however (pattern 1 and pattern 3). 7. Conclusions. We proposed an audio watermarking algorithm that exploits the fact that the perception of the auditory system for stereo audio images is relatively not affected by phase in the high-frequency regions. The high-frequency stereo signals were replaced with a single average middle channel. Data is encoded into the delays between the channels. The detection of embedded data can be achieved without the original signal, i.e., blind detection. Correlations between the right- and left-channel high-frequency signals, both with and without delay, are compared to detect the signals. The proposed algorithm showed embedded data bit rates between 7 and 30 bps depending on the host audio signal. The algorithm was also tested for robustness to added noise,

1218

KAZUHIRO KONDO AND KIYOSHI NAKAGAWA 100 90 80 MUSHRA score

70 60 50 40 30 20 10 0 ref

wp1

wp3

m48

m64

m96

m128

lp35

Figure 8. Subjective listening test results (beethoven). 100 90 MUSHRA score

80 70 60 50 40 30 20 10 0 ref

wp1

wp3

m48

m64

m96

m128

lp35

Figure 9. Subjective listening test results (mozart). 100 90

MUSHRA score

80 70 60 50 40 30 20 10 0 ref

wp1

wp3

m48

m64

m96

m128

lp35

Figure 10. Subjective listening test results (piano). 100

MUSHRA score

80

60

40

20

0 ref

wp1

wp3

m48

m64

m96

m128

lp35

Figure 11. Subjective listening test results (trumpet). MPEG coding, sample rate conversion, and filtering. Little or no errors were seen for additive noise with SNR above 20 dB. MPEG 1 layer 3 and MPEG AAC coding also do not seem to cause many errors. The proposed watermarks were also surprisingly robust


1219

100

MUSHRA score

80 60 40 20 0 ref

wp1

wp3

m48

m64

m96

m128

lp35

Figure 12. Subjective listening test results (all sources). against sample rate conversion and filtering. The subjective quality of the data-embedded audio was shown to be source dependent, with some sources showing equivalent quality with MP3 coded audio, while for others the quality was significantly lower. The current proposal uses high-frequency energy threshold to decide whether to embed data or not. This implies variable bit rate, and also gives rise to insertion and deletion errors. Thus, we would like to explore algorithms which allow us to embed data in every frame, which in turn enables us to embed at a fixed bit rate. The proposed algorithm also uses blind detection. However, if the reference host signal is available, direct correlation can be used with the reference high-frequency signals for improved robustness in detection. There is no need to change the embedding process. Moreover, we can possibly increase the embedded data bit rate if we assume informed (non-blind) detection by changing the embedding process, perhaps by increasing the frame rate, or using multiple delay values. In this paper, we replaced the high-frequency stereo signals with a single mid-channel, and mutually delayed these channels to “mark” them with data. Although the power in each individual channel was adjusted to match the original high-frequency signals, this may not have been enough to preserve the stereo image, not to mention the audio quality. We can also adjust the inter-channel correlation (ICC) of the mid-channel to match the original channels, which should preserve the size, or the “sharpness” of the stereo image. Binaural Cue Coding (BCC) [28] and Parametric Stereo Coding [29], relatively new additions to the MPEG family of audio codecs which enable efficient coding of stereo audio signals, reduce the redundancy of stereo signals by coding them with a single mid-channel, along with the ICC, the inter-channel level difference (ICLD), and the inter-channel time difference (ICTD). ICC can be adjusted by adding a small amount of echo to each channel. Since the relative polarity of the added echo in the channels is not critical, both in terms of quality and in the control of ICC, this polarity can be used to embed data. This is currently being investigated, and potential improvements in the audio quality will be reported in [30]. Acknowledgment. This work was partially supported by the Ojima Fund. The authors also thank the members of the Nakagawa and Kondo Laboratory of the Electrical Engineering Department, Yamagata University for their time as listening test subjects, as well as their discussions. REFERENCES [1] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, Techniques for data hiding, IBM System Journal, vol. 35, no. 3 & 4, pp. 313-336, 1996. [2] C.-C. Chang, C.-C. Lin, and Y.-S. Hu, An SVD oriented watermark embedding scheme with high qualities for the restored images, International Journal of Innovative Computing Information and Control, vol. 3, no. 3, pp. 609-620, 2007.

1220


[3] C.-C. Chen and D.-S. Kao, DCT-based zero replacement reversible image watermarking approach, International Journal of Innovative Computing Information and Control, vol. 4, no. 11, pp. 30273036, 2008. [4] T.-H. Chen, T.-H. Hung, G. Horng, and C.-M. Chang, Multiple watermarking based on visual secret sharing, International Journal of Innovative Computing Information and Control, vol. 4, no. 11, pp. 3005-3026, 2008. [5] N. Cvejic and T. Seppanen, Introduction to digital audio watermarking, in Digital Audio Watermarking Techniques and Technologies, N. Cvejic and T. Seppanen (eds.), Hershey, PA, Information Science Reference, 2007. [6] E. Zwicker, Psychoakustik, Springer-Verlag, Heidelberg, Germany, 1982. [7] J. D. Gibson, T. Berger, T. Lookabaugh, D. Lindberg, and R. L. Baker, Digital Compression for Multimedia, Morgan Kaufmann, San Francisco, California, 1998. [8] ISO/IEC JTC1/SC29/WG11, Coding of moving pictures and associated audio for digital storage media up to about 1.5 Mbit/s - Part 3: Audio, ISO/IEC International Std. 11172-3, 1999. [9] ISO/IEC JTC1/SC29, Generic coding of moving pictures and associated audio information - Part 7: Advanced Audio Coding (AAC), ISO/IEC International Std. 13818-7, 2006. [10] R. Nishimura and Y. Suzuki, Audio watermark based on periodical phase shift, J. Acoust. Soc. Japan, vol. 60, no. 5, pp. 268-272, 2004, in Japanese. [11] H. M. A. Malik, R. Ansari, and A. A. Khokhar, Robust data hiding in audio using allpass filters, IEEE Trans. on Audio, Speech, and Lang. Process., vol. 15, no. 4, pp. 1296-1304, 2007. [12] R. Nishimura, M. Suzuki, and Y. Suzuki, Detection threshold of a periodic phase shift in music sound, Proc. International Congress on Acoustics, Rome, Italy, vol. IV, pp. 36-37, 2001. [13] S. P. Lipshitz, M. Pocock, and J. Vanderkooy, On the audibility of midrange phase-distortion in audio systems, J. Audio Eng. Soc., vol. 30, no. 9, pp. 580-595, 1982. [14] D. Megias, J. Herrera-Joancomarti, and J. Minguillon, An audio watermarking scheme robust against stereo attacks, Proc. Multimedia and Security, Magdeburg, Germany, pp. 206-213, 2004. [15] A. Takahashi, R. Nishimura, and Yôiti Suzuki, Multiple watermarks for stereo audio signals using phase-modulation techniques, IEEE Trans. Signal Process., vol. 53, no. 2, pp. 806-815, 2005. [16] T. Modegi, Nearly lossless audio watermark embedding techniques to be extracted contactlessly by cell phone, Proc. of the International Conf. on Mobile Data Management, Nara, Japan, 2006. [17] J. Breebart and C. Faller, Spatial Audio Processing, John Wiley & Sons, West Sussex, England, 2007. [18] J. Herre, K. Brandenburg, and D. Lederer, Intensity stereo coding, Proc. of the 96th AES Convention, Amsterdam, Netherlands, preprint no. 3799, 1994. [19] K. Kondo and K. Nakagawa, A digital watermark for stereo audio signals using variable interchannel delay in high frequency bands, Proc. of the International Conf. on Intelligent Information Hiding and Multimedia Signal Processing, Harbin, China, pp. 624-627, 2008. [20] Digital Signal Processing Committee, Programs for Digital Signal Processing, IEEE Press, Piscataway, NJ, 1979. [21] European Broadcasting Union, Sound quality assessment material (SQAM), Compact Disk, 1998. [22] The National Institute of Standards and Technology, SCTK, the NIST scoring toolkit, 2007. Available from http://www.nist.gov/speech/tools. [23] The LAME Project, LAME ain’t an MP3 encoder, 2008. Available from http://lame.sourceforge.net. [24] Apple Inc., iTunes, 2008. Available from http://www.apple.com/itunes/download. [25] Syntrillium Software Corp., CoolEdit 2000 Lite. Trial version available from http://www.5starshareware.com/Windows/Music/AudioComposers/cool-edit-download.html. [26] Adobe Systems Inc., Adobe Soundbooth CS4, http://www.adobe.com/products/soundbooth/. [27] ITU-R, ITU Recommendation BS.1534-1: Method for the subjective assessment of intermediate quality level of coding systems, 2003. [28] C. Faller and F. Baumgarte, Binaural cue coding: A novel and efficient representation of spatial audio, Proc. International Conf. on Acoust. Sp. and Signal Process., Orlando, Florida, vol. 2, pp. 1841-1844, 2002. [29] J. Breebart, S. van de Par, A. Kohlrausch, and E. Schuijers, Parametric coding of stereo audio, EURASIP Journal on Applied Signal Processing, no. 9, pp. 1305-1322, 2005. [30] K. Kondo, A data hiding method for stereo audio signals using the polarity of the inter-channel decorrelator, to be presented, Proc. of the International Conf. on Intelligent Information Hiding and Multimedia Signal Processing, Kyoto, Japan, 2009.