1296
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
Robust Data Hiding in Audio Using Allpass Filters Hafiz M. A. Malik, Student Member, IEEE, Rashid Ansari, Fellow, IEEE, and Ashfaq A. Khokhar, Senior Member, IEEE
Abstract—A novel technique is proposed for data hiding in digital audio that exploits the low sensitivity of the human auditory system to phase distortion. Inaudible but controlled phase changes are introduced in the host audio using a set of allpass filters (APFs) with distinct parameters of allpass filters, i.e., pole-zero locations. The APF parameters are chosen to encode the embedding information. During the detection phase, the power spectrum of the audio data is estimated in the -plane away from the unit circle. The power spectrum is used to estimate APF pole locations, for information decoding. Experimental results show that the proposed data hiding scheme can effectively withstand standard data manipulation attacks. Moreover, the proposed scheme is shown to embed 5–8 times more data than the existing audio data hiding schemes while providing comparable perceptual performance and robustness. Index Terms—Allpass filter (APF), data hiding, detection, embedding, fingerprinting, human auditory system, parametric signal modeling, watermarking.
I. INTRODUCTION
T
HE ever-growing digital piracy problem has spurred efforts to develop robust technologies to protect copyrighted digital media from illegal sharing and tampering. Content protection provided by traditional data protection techniques, such as encryption and scrambling, is inadequate, particularly once the digital data is decrypted or unscrambled. Emerging digital content protection technologies based on data hiding, such as digital watermarking, and fingerprinting, used in conjunction with cryptography/scrambling, have the potential to protect digital content even after it is deciphered/unscrambled. Perception-based audio data hiding schemes generally exploit the human auditory system (HAS) characteristics, such as, spectral masking, temporal masking, and/or inaudibility of phase distortion [1], [5]–[7]. Existing audio data hiding schemes can be classified according to the underlying technique used for embedding data. Methods based on least significant bit dithering [9] embed information by replacing the least significant bits of the digital audio samples. Echo hiding methods [9] introduces inaudible echoes in the host audio signal based on the embedding message. Perceptual masking methods [9], [13], Manuscript received November 22, 2005; revised November 15, 2006. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. George Tzanetakis. H. M. A. Malik was with the Department of Electrical and Computer Engineering, University of Illinois, Chicago, IL 60607 USA. He is now with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030 USA (e-mail:
[email protected]). R. Ansari is with the Department of Electrical and Computer Engineering, University of Illinois, Chicago, IL 60607 USA (e-mail:
[email protected]). A. A. Khokhar is with the Department of Computer Science and Department of Electrical and Computer Engineering, University of Illinois, Chicago, IL 60607 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TASL.2007.894509
[14], spectrally shape the embedding message according to the HAS before embedding it in the host signal. In methods inspired by direct sequence spread spectrum (DSSS) [9], [12], [15], a pseudorandom sequence is generated based on the embedding message and added to the host signal after spectral shaping. Data hiding techniques based on phase coding [9], [10] introduce controlled changes in the phase of the host signal for information encoding. These schemes perform well as far as fidelity of the embedded data is concerned but exhibit low robustness against standard data manipulations and have limited embedding capacity. For example, the phase coding technique proposed in [9] embeds 16–32 bits of data in audio samples of 1-s duration, and the detection performance deteriorates rapidly in the presence of random noise. In general, low embedding capacity is the common shortcoming of existing data hiding schemes based on phase coding. In this paper, the aforementioned limitations of data hiding schemes based on phase coding are addressed by proposing a robust scheme for audio data hiding using a novel method of phase alteration. The proposed data hiding scheme uses a set of allpass filters (APFs) for embedding data. In order to improve embedding capacity, 1) a larger codebook consisting of APF parameters, and 2) partition of the host signal into multiple subband signals, are used for embedding information. The embedded information is detected by first estimating the spectra of the processed audio using the chirp -transform (CZT) computed over contours away from the unit circle. The detector uses a parametric signal model, such as an autoregressive (AR) or a moving-average (MA) model, for the data-embedded audio. In addition, the detector exploits finite-length truncation (FLT) effect of the APF impulse response in estimating the spectra of the processed audio for APF parameter estimation. Experimental results show that the proposed data hiding scheme effectively withstands standard data manipulation attacks. Moreover, the proposed scheme is shown to embed 5–8 times more data than the existing audio data hiding schemes using binary codebooks while providing comparable perceptual performance and robustness in embedding information [9], [10], [12]–[14]. This embedding capacity further improves two-to-three fold for -ary encoding/decoding at the expense of increased computational cost, while providing comparable perceptual performance and robustness. II. PERCEPTION OF PHASE DISTORTION IN AUDIO For many years, perception of phase distortion has been the subject of intense investigation in the acoustic engineering community. Phase distortion introduces considerable waveform degradation of signals without altering their spectral contents. The study by Lipshitz et al. [5] on the perception of phase dis– kHz) tortion shows that in the midrange frequency (
1558-7916/$25.00 © 2007 IEEE
MALIK et al.: ROBUST DATA HIDING IN AUDIO USING ALLPASS FILTERS
1297
phase distortion is audible for simple combinations of sinusoids. In addition, phase distortion in the midrange is also audible for some common acoustical signals. However, simple anechoically generated acoustic signals display clear phase audibility on headphones but in music or speech signals phase distortion is generally inaudible. This phase distortion audibility is far more noticeable on headphones than on speakers. The proposed data hiding scheme exploits the HAS characteristics of spectral masking and imperceptibility of phase distortion. Spectral masking refers to the inaudibility of magnitude distortion if it is below a masking threshold [1], and imperceptibility of phase distortion refers to the inaudibility of any phase kHz distortion generally introduced in frequency range [1], [5]. For robust performance, only a portion of the customary full range of audible frequencies, i.e., 20 Hz–20 kHz, is suitable for embedding data. For example, detection of small magnitude kHz, are changes in the higher frequency range, i.e., unreliable due to the insignificant signal energy content in that range. Lossy-compression schemes for digital audio, e.g., MP3, generally discard low energy coefficients in this range [7]. On the other hand, phase distortion in the low frequency range, e.g., kHz is generally audible [1], [5]. As a result the kHz is considered suitable for emfrequency range bedding data. The proposed data hiding scheme is based on Lipshitz et al.’s [5] phase distortion perception model. In order to embed data, inaudible phase distortions are introduced in the host audio in the chosen frequency range using an th-order APF.
III. PROPERTIES OF APF In this section, we briefly examine relevant properties of a stable APF whose transfer function is a finite-order rational function of the complex variable . The magnitude of the frequency response of an APF is unity for all frequencies. The associated phase response and group delay functions can be adjusted by the pole-zero pair(s) of the transfer function. A causal APF transfer function introduces positive delay (envelope retardation) and the phase-delay function is always twice that of their associated all-pole filters [4]. The phase response of the transfer function of a causal APF is a monotonically nonincreasing function of frequency. The phase response of an APF can be used to approximate a specified phase characteristic and the associated group delay function for a specified group delay denote the transfer funccharacteristic likewise. Let tion an th-order causal and stable allpass filter with magnitude and phase response . The frequency response response of the system can be expressed as (1) Distortionless processing by an allpass filter over a given frequency band requires that the magnitude response be constant, and that the phase response be linear in frequency, i.e., with slope , passing through the origin. When and/or fail to satisfy these conditions, the system introduces distortion in the signal [6], [7]. The phase
TABLE I APF PARAMETERS USED FOR BINARY ENCODING
TABLE II APF PARAMETERS USED FOR 4-ARY ENCODING
and the group delay are generally used to delay characterize the phase response of a given system where (2) (3) The transfer function of a first-order stable and causal APF with pole at and associated zero at can be expressed as (4) where , and is the complex , and the region of convergence conjugate of , i.e., is . Similarly, the transfer function of a second-order stable and and corresponding imagecausal APF with poles at and and respectively, can be expressed as zeros at (5) Here, the filter parameter is a function of (radial distance of the pole from the origin in the complex -plane) and (orientation of the pole-zero pair). The APF characteristics such as group delay depend on filter parameter . The proposed audio data hiding scheme maps message symbols (e.g., 0 or using a codebook. 1) to a known set of APF parameters Each symbol is then embedded by using the corresponding APF. Note that there is a one-to-one mapping between the embedding symbol and the APF parameters. The concept of mapping from is codebook symbols to APF parameter or ordered pair illustrated for the case of binary and 4-ary codebooks in Tables I and II. The proposed data hiding scheme uses an th-order APF, where is an even positive integer and the pole locations may be chosen in a variety of ways. Here, we confine our attention to a selection that was found to work well in practice. These th-order APF are realized by pairs of multiple poles at a pair of conjugate locations. These th-order APF are characterfor embedding information based ized by ordered pairs
1298
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
H
z
Fig. 2. Second-order APF, ( ), phase response (left), and group-delay response (right) for binary encoding/decoding.
H
z l
l
( ): =0, 1 ( =0, 1, 2, 3) for binary (4-ary)
Fig. 3. Phase response (left) and group-delay response (right) of second-order ( ) used for 4-ary encoding/decoding scheme. APF
on the message symbol. The th-order APF is realized by cascading second-order APFs given in (5). The cascaded form realization reduces the quantization effect of APF transfer function coefficients. The transfer function of an th-order APF, realized sections of second-order causal and stable APFs is exwith pressed as
is encoded by setting the APF parameter to (0.9, ) and symbol “1” is encoded by setting the APF parameter to (0.9, ).
Fig. 1. Pole-zero layouts of encoding/decoding.
(6) where , 1 for binary encoding and 0, 1, 2, 3 for 4-ary encoding. has pairs of poles at and and asHere, sociated pairs of zeros at and , respectively. Moreover, in order to ensure maximum robustness between different codewords, pole-zero pairs of the APFs used for -ary in the upperencoding should be separated by an angle of half complex -plane and their corresponding conjugate pairs in the lower-half plane. In addition, the first and the last pairs should not be very close to the real axis, as it was observed that and , with poles pairs close to the real axis for or lead to severe magnitude distortion for a truncated impulse response. Separation of these pole pairs by was found to work in practice. If an APF an angle is desirable, then it should be constructed by caswith cading lower order APFs. The pole-zero layouts of the second-order APFs [defined by (5)] used for binary encoding are illustrated in Fig. 1 (first column) and the corresponding plots of the phase response and the group-delay (based on Table I) are shown in Fig. 2. Similarly, the pole-zero layouts of the second-order APFs used for 4-ary encoding/decoding are illustrated in Fig. 1 (second and third columns) and corresponding plots of the phase response and the group-delay are given in Fig. 3. Fig. 1 shows that the proposed scheme selects an APF for embedding data based on the embedding information or encoding symbol. For example, in the binary embedding case, symbol “0”
H
z
IV. DATA EMBEDDING USING WAVELET ANALYSIS AND ALLPASS FILTERS The key steps of the proposed data-embedding process are listed below. • The host signal is segmented into nonoverlapping frames, each consisting of samples. • Each frame is decomposed using a -level discrete wavelet packet analysis filter bank (DWPA-FB), and the subbands kHz are designated as in the frequency range suitable for data-embedding. • A target set of suitable subbands is selected using a secret key and the set size depends on the codebook size. • One or more bits of information are embedded in the selected subbands using an APF corresponding to the encoding symbol. • The data-embedded subband along with the unused subbands are put together in respective order, and a -level discrete wavelet packet synthesis filter bank (DWPS-FB) is applied to generate the data-embedded frame. The number of target subbands selected and the associated selection procedure depends on the data hiding application. For example, in the case of audio fingerprinting, which generally requires higher embedding capacity, all suitable subbands can be used for data-embedding. On the other hand, for watermarking, only one or two subbands in the targeted frequency range can be used for data-embedding, which yields higher security, stronger embedding, and lower embedding capacity. The details of audio watermarking based on the proposed schemes are provided in Section VII. The proposed APF-based data hiding scheme is vulnerable to desynchronization attacks. In order to combat desynchronization attacks, we identify host signal features that are robust to such attacks. Such features include fast-energy transition points,
MALIK et al.: ROBUST DATA HIDING IN AUDIO USING ALLPASS FILTERS
1299
high zero-crossing rate locations, or high spectral-flatness-measure locations in the given audio clip. We refer to these implicit synchronization locations as synch points (SPs). V. DATA DETECTION IN THE PROPOSED SCHEME Data detection refers to the process of determining the presence or absence of the embedded information. In the proposed embedding framework, the detection process consists of audio segmentation, frame decomposition, parametric spectrum estimation using CZT, filter parameter estimation, i.e., pole-zero lo, and information decoding. Steps corresponding cations to audio segmentation and frame decomposition are very similar to the embedding procedure. In the following, we provide details of the steps unique to the detection/decoding process. A. Parametric Power Spectrum Estimation We use parametric signal models such as autoregressive (AR) and/or moving-average (MA) of sufficient order [2] to estimate the power spectrum of the data embedded subband signals. An , can be represented as the AR process, output of an all-pole filter excited by unit variance white noise. The estimated power spectrum of a th-order AR process is (7)
where and are the estimates of the process model parameters determined using using methods such as autocorrelation, covariance, modified covariance methods, or Burg’s algorithm , can be gen[2]. Similarly, an MA process, erated by exciting a th-order FIR filter by unit variance white noise. The estimated power spectrum of th-order MA process is expressed as
Fig. 4. Magnitude response of the L-length truncated second-order AP function for different values of truncation length (L).
finite-length segment is used, which in effect takes into account only a truncated APF impulse response. The finite-length truncation of an APF impulse response introduces distortion in the magnitude response around the pole-zero location frequency . The level of this magnitude distortion directly depends on the length of the truncated impulse response of an APF. Our analysis shows that the level of this magnitude distortion around the pole-zero location frequency due to FLT of the APF impulse response decreases as truncation length increases and vice versa. This phenomenon is illustrated here for a first-order APF. In order to determine the effect of a length- truncation on the magnitude response of an APF, consider the transfer function of a stable and causal first-order APF (9)
(8) (10) where are the estimates of the process model parameters. The estimates of parameters such as and for AR model or for MA model are used to estimate the power spectrum of process . The CZT on a circular contour with radius is used to estimate the power spectra. is then used to estimate The computed power spectra the APF parameters . Estimated spectra of the processed exhibit audio corresponding to different values of minima around which can be attributed to the finite length truncation (FLT) effect of the APF impulse response. B. Effect of FLT in the APF Impulse Response The transfer function of a stable and causal first-order APF can be expressed as
(11)
(12) Let us denote the first term on the right-hand side in (12) . Here, is the -transform of the by lengthof truncated impulse response , which can be expressed as (13)
The above transfer function has an infinite-length impulse response (IIR). However, during data-embedding process, only a
It is observed that the factor zeros at
introduces , uniformly
1300
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
1 1 1) and 16th-order (0) APF with = 0:95e
Fig. 5. Power spectral analysis of unprocessed (first row) and processed audio using fourth-order (
distributed on the circle . The zero at , i.e., , cancels the pole at the same location. Therefore, the FLT of AP has altogether zeros, and transfer function zeros are located at . gives rise to minima around The remaining zero at the pole-zero location frequency of the AP transfer function. is consistently manifested Moreover, the minima at to with orientation . along the radial line from truncated approxiThe magnitude response of lengthmate APF (LT-AAPF) approaches that of a true APF response as the truncation length is increased. The magnitude responses of second-order LT-AAPF for different values of are shown in Fig. 4. Fig. 4 depicts that the magnitude response of LT-AAPF exhibits magnitude distortion around the pole-zero location frequency , and this magnitude distortion is attributed to impulse response truncation length . In our experiments, consistent minima in the estimated power spectra of the data-embedded audio are observed around the pole-zero location frequency of the APF used. This fact is illustrated in Fig. 5. Fig. 5 shows the power spectra of the unprocessed third subband of 4-level wavelet decomposition of a sample audio seg(first row). The power ment calculated using CZT at spectra of the corresponding processed subband are shown in rows: 2–5. The selected subband is processed using fourth-order (doted line) and 16th-order (continuous line) APFs with difand ferent truncation lengths, i.e., . In different radius values, i.e., Fig. 5, the first row shows that the power spectra of an unprocessed audio has no local minimum around , whereas a consistent minima in the estimated power spectra of the processed audio are observed (rows: 2–5) around corresponding to the APF used. There are several observations that can be made from Fig. 5. • FLT introduces distortion in the magnitude spectra of the , and it increases with the inprocessed audio for crease in segment length [Fig. 5 (rows: 4–5)]. • The power spectrum of the processed audio calculated at has well established local minimum around the pole-zero location frequency of the APF, which could be exploited to estimate the filter parameters.
.
, Fig. 5 (row # 3), the local minimum is more • For evident in the left most column, which may be attributed to the smaller truncation length of the APF impulse response. , local minima are more pronounced for higher • For values of L. Based on these observations and to provide sufficient security against active adversary attacks, for a given APF, a suitable audio frame size should be used for information embedding to ensure negligible embedding distortion around in the estimated power spectrum of the processed audio segment. Otherwise, the APF parameters should be modified to ensure that there is negligible distortion around in the estimated power . spectrum at C. Allpass Filter Parameter Estimation are estimated using the estiThe APF parameters or mated power spectrum . Let us assume that is known at the detector so that only an estimate of the orientation of the pole-zero location is required which can be estimated from the of the processed data. estimated spectra The estimated power spectrum of the processed audio is calculated for different CZT contour values (simulation results presented in this paper are based on five contour values). Consistent minima in the estimated spectra are determined by searching minima across estimated spectra for different values of , i.e., through . For this an exhaustive search based on steepest gradient is applied. In addition, if more than one consistent minimum of same strength appear in the estimated power spectra, then the detector declares decoding failure. It is observed that for more reliable estimate of , spectra based on both AR model as well as MA model can be used during information detection process. For example, for a given audio segment, two different detectors can be used simultaneously, one based on MA-based estimated spectra and the other based on AR-based estimated spectra. If both the detectors produce the same decoded information, only then a successful detection is declared. A detection process relying on both estimated spectra will improve overall false positive rate performance at the cost of relatively higher false negative rate. Simulation results presented in this paper are based on the MA signal model only.
MALIK et al.: ROBUST DATA HIDING IN AUDIO USING ALLPASS FILTERS
1301
Multiple hypothesis testing based on nearest neighborhood is applied to decode embedded information. Multiple hypothesis testing for binary decoding is expressed as
otherwise
TABLE III DATA-EMBEDDING PERFORMANCE OF THE PROPOSED SCHEME BINARY EMBEDDING
(14)
where is a predefined threshold, and is decoding failure error which contributes to false negative rate. Similar hypothesis testing can be formulated for 4-ary decoding. are set based on the The values of the decoding threshold codebook used for encoding/decoding process. VI. SIMULATION RESULTS In this section, we evaluate the performance of the proposed APF-based embedder and detector in terms of embedding capacity or embedding rate, embedding distortion, decoding bit , false negative error probability , false positive bit rate , and robustness against desynchoronization attacks. bit rate The simulation parameters used to simulate the proposed encoder and decoder are as follows , pro1) Five-level wavelet decomposition is used, i.e., ducing eleven possible target subbands for data-embedding consisting of subband # 5 to subband # 15. 2) All available subbands (11 subbands per frame in total) are used for information embedding. is set to for binary decoding 3) Decoding threshold for 4-ary decoding. and 4) False positive bit rate is calculated using an original (unwatermarked) music clip as an input to the proposed detector. 5) Power spectra of the processed audio, used for information decoding is estimated using the th-order MA signal model (Durbin’s method [2] is used in our implementations). 6) The music clip “I Want It That Way ” by “Backstreet Boys,” (original and watermarked) along with other clips used in the simulations are available here.1 of the pro7) Relative measure of embedding distortion posed scheme is evaluated in terms of signal to noise ratio, which is calculated as follows: let denote the original music signal and be the data embedded signal, then em, and bedding distortion can be expressed as relative embedding distortion is calculated as (15) where and 8) The decoding bit error probability
. is defined as (16)
is the number of bits correctly decoded and where is the number of embedded bits. 1[Online]
Available: http://www.multimedia.uic.edu/~hafiz/APF_DH.html
TABLE IV DATA-EMBEDDING PERFORMANCE OF THE PROPOSED SCHEME 4-ARY EMBEDDING
Tables III and IV show that the proposed embedding scheme is capable of embedding up to 122 bits and 243 bits per second of audio for binary and 4-ary encoding, respectively, while using only 11 out of 32 subbands of the decomposed audio frame. From these tables, it can also be observed that the order of the APF used for information embedding is an important parameter to evaluate the performance of the proposed embedding scheme. For example, the higher order APF gives lower decoding bit and lower false negative bit error probaerror probability bility but at the cost of security and higher embedding distortion. Similarly, subband frame size is another parameter that impacts the performance. A smaller frame size gives higher embedding capacity but at the cost of higher embedding distortion, lower security, and higher decoding bit error probability. However, 4-ary encoding/decoding with reasonably large frame size and suitable filter order can used to achieve the target performance level. Average false positive bit rate of the proposed detector for both decoding schemes was calculated by applying detector to five unprocessed music clips and then taking average over calculated false positive bit rates. This gave average 2.5 10 for 4-ary decoding and false positive rate 3.5 10 for binary decoding. In order to evaluate the performance of the proposed scheme against desynchronization attacks due to addition of white Gaussian noise (AWGN), sync points (SPs) based on high energy transition location in the host audio that
1302
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
TABLE V PERFORMANCE OF THE PROPOSED SCHEME AGAINST DESYNCHRONIZATION ATTACK
are robust to AWGN attack were used for information embedding. A 22-s music clip, “I want It That Way ” by “Backstreet Boys” is used to estimate SPs using the algorithm presented in [12] with following parameter settings: . A salient point list consisting of 28 SPs is estimated for both unprocessed as well as corresponding processed music clips. In addition, three more lists are estimated by adding 10%, 20%, and 50% AWGN to the processed music clip. It has been observed that few SPs are susceptible to additive noise, e.g., 2/28 SPs disappeared in the estimated SP list from processed music clip corrupted with 10% AWGN and 4/28 additional SPs disappeared for 50% AWGN attack. It indicates that in order to combat desynchronization attacks robust SPs, i.e., SPs that can survive against desynchronization attacks, should be used for synchronization. Table V shows performance of the proposed detector against desynchronization attack due to AWGN. The SPs robust to 10% AWGN have been used for information embedding here. Simulation results presented in Table V are based on the same setting as for performance evaluation based on embedding rate, embedding distortion, etc. results presented in Tables III and IV. Table V shows that proposed scheme can withstand the desynchronization attacks. In addition, synchronization error can be reduced further if SPs vulnerable to strong noise are also excluded from the SP list used for synchronization, but this improvement would be at the cost of embedding capacity. denotes gross In Table V, denotes noise power (%), bit error probability, i.e., bit error due to decoding and desyndenotes decoding bit error probability, and chronization, denotes bit error probability due to desynchronization. Table V shows that the proposed scheme can withstand desynchronization attacks. This improved performance of the proposed detector against desynchronization attacks over existing spread spectrum-based schemes [12], [15], [18] can be attributed to information decoding based on the estimated power spectrum which has lower sensitivity to time shift. The proposed scheme uses estimated power spectrum of the selected subbands for information decoding. It is observed that, in general, the power spectra of two audio segments of size 186 ms (8196 samples at 44 100 sampling frequency) offset by 15 ms (or 640 samples) are very similar due to high temporal correlation. VII. APPLICATION TO AUDIO WATERMARKING The audio data hiding scheme presented is the earlier sections is generic and can be used for different audio data hiding applications such as steganography, watermarking, and fingerprinting. For audio steganography, the data-embedding and
detection procedures outlined in Sections IV and V can be used without any modification. However, for audio watermarking and fingerprinting applications some modifications are needed. Adaptation of the proposed data hiding scheme to audio watermarking is outlined in this section. The robustness of the proposed audio watermarking using APF is also evaluated against common audio degradations. Digital watermarking is the process of embedding authentication and ownership-protection information (also known as the watermark) into the digital content that needs protection. The is a pseudo-random sequence, generwatermark ated using a secret key as a “seed” to the pseudorandom seis the length of the watermark. The quence generator, where proposed data-hiding procedure is capable of embedding multiple watermarks simultaneously without increasing inter-watermark interference. In the following, we outline key steps to realize watermarking using the proposed data hiding technique. SPs is extracted from the host audio using the • A list of method described in [12], [17]. • An audio frame consisting of samples is selected around . each th salient point, i.e., • Each frame is decomposed in -subband signals, using a -level DWPA-FB. • One or more subbands in the chosen frequency range (i.e., kHz ) are selected for watermark embedding is used to select using a secret key. If the secret key the th subband of the th frame, then the subband selection key for the entire audio clip can be expressed as • One or more bits of channel-encoded information (watermark) are embedded in each selected subband using the corresponding APF. • For each frame, all the subbands (watermarked and unwatermarked) are combined and the frame is resynthesized using a -level discrete wavelet packet synthesis filter bank (DWPS-FB). The watermark detection/extraction process is based on APF parameter estimation using estimated power spectrum of the received audio. In our case, the detection scheme is blind as it does not require the original audio clip for watermark detection/extraction. 1) Experimental Results: The results presented in this section are based on the following system settings: Five-level wavelet decomposition is applied to an audio frame of samples producing eleven possible target subbands for data-embedding consisting of subband # 5 to subband # 15. Decoding is set to for binary decoding threshold (4-ary decoding). We assume that the list of SPs is available at the detector. Therefore, only the decoding error contributes to . We have used BCH (15,7,2) error correcting code to encode the watermark, and MA signal model is applied for parametric spectrum estimation. For testing the robustness of the proposed watermarking scheme, we consider the following attacks and report results: Addition of White Noise: White Gaussian noise with noise power from 0%–100% of the audio power is added to the watermarked audio signal. The resulting audio is applied to the watermark detector. The decoding bit error prob-
MALIK et al.: ROBUST DATA HIDING IN AUDIO USING ALLPASS FILTERS
Fig. 6. Detection performance P of the proposed audio watermarking scheme against AWGN attack.
1303
scaling attack. However, resynchronization method used in Section V can be used to combat such attacks. Resampling: To simulate a resampling attack, the data-emand then interpobedded audio is first decimated to lated back to . The resulting audio is applied to the watermark detector; the decoding bit error probability of the recovered data for the two encoding schemes is given in Fig. 7. Filtering: For filtering attacks, the watermarked audio signal is subjected to lowpass, highpass, and bandpass filtering attacks. The performance results against these filtering attacks are given in Fig. 7. The filter specifications used in filtering attacks are summarized as 1) lowpass kHz, 12 dB/octave, filtering cutoff frequency: Hz, 2) highpass filtering cutoff frequency: 12 dB/octave, and 3) bandpass filtering cutoff frequencies: Hz, and kHz, 12 dB/octave. VIII. CONCLUSION
Fig. 7. Decoding bit error performance of the proposed watermarking scheme against lossy compression (MP3), resampling (Res), random sample drop (RSD), and lowpass filtering (LPF), highpass filtering (HPF), and bandpass filtering (BPF) attacks.
ability of the recovered data, for different values of signal-to-noise ratio (SNR in decibels) and for both the encoding schemes is given in Fig. 6. These results show that the proposed watermarking scheme can effectively withstand additive noise attack up to 3.5-dB SNR for both the encoding schemes. Lossy Compression: Watermarked audio signal is compressed using MPEG layer-III coder [8] at 128 k bits/s. The value of the proposed scheme against lossy compression attack for both the encoding schemes is given in Fig. 7. We observe that the proposed audio watermarking scheme is least resistant to compression attacks. Random Chopping: To test robustness against desynchronization attacks, two to five samples out of every 100 samples of the watermarked audio were dropped randomly. The average decoding bit error probabilities for the resulting audio for the two encoding schemes are given in Fig. 7. The proposed scheme has very low decoding bit error probability ( 41%) against random sample dropping attacks. Time-Scaling: Watermarked audio is subjected to . The average time-scaling attacks with decoding bit error probability for the resulting audio for the two encoding schemes is given in Fig. 7. The proposed scheme is vulnerable to time-scaling attacks. The high decoding bit error probability can be attributed to significant shift in the data embedded locations (or SPs) due to time
We have proposed a novel high-capacity data hiding scheme for digital audio based on controlled inaudible phase distortion introduced in selected audio subband signals. The detection exploits the FLT impact of an th-order APF impulse response on the subband signals. The proposed scheme can be tailored to various audio data hiding applications such as stagenography, watermarking, and fingerprinting. The proposed technique is robust against standard data manipulations yielding low decoding bit error probability. The decoding bit error probability performance of the proposed scheme can be improved further by using channel coding schemes with higher error correction capabilities. The fidelity performance of the proposed scheme was evaluated with informal listening tests. REFERENCES [1] R. E. Zwicker and H. Fastl, Psychoacoustics: Facts and Models. Berlin, Germany: Springer-Verlag, 1999. [2] M. H. Hayes, Statistical Digital Signal Processing and Modeling. New York: Wiley, 1996. [3] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975. [4] H. J. Blinchikoff and A. I. Zverev, Filtering in the Time and Frequency Domain. New York: Wiley, 1976. [5] S. P. Lipshitz, M. Pocock, and J. Vanderkooy, “On the audibility of midrange phase-distortion in audio systems,” J. Audio Eng. Soc., vol. 30, no. 9, pp. 580–595, Sep. 1982. [6] D. Preis, “Phase-distortion and phase equalization in audio signal processing—A tutorial review,” J. Audio Eng. Soc., vol. 30, no. 11, pp. 774–794, Nov. 1982. [7] J. A. Deer, P. J. Bloom, and D. Preis, “Perception of phase-distortion in all-pass filters,” J. Audio Eng. Soc., vol. 33, no. 10, pp. 782–786, Oct. 1982. [8] D. Pan, “A tutorial on MPEG/audio compression,” IEEE Multimedia Mag., vol. 2, no. 2, pp. 60–74, Summer 1995. [9] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, “Techniques for data hiding,” IBM Syst. J., vol. 35, no. 3/4, pp. 313–336, 1996. [10] Y. Yardimci, A. E. Cetin, and R. Ansari, “Data-hiding in speech using phase coding,” in Proc. Int. Conf. Eurospeech, 1997, pp. 1679–1683. [11] D. Radakovic, “Data hiding in speech using phase coding,” M.S. thesis, Elect. Comput. Eng. Dept., Univ. Illinois, Chicago, 1999. [12] C.-P. Wu, P.-C. Su, and C.-C. J. Kuo, “Robust audio watermarking for copyright protection,” in Proc. SPIE’s 44th Annu. Meeting Adv. Signal Process Alg. Arch. Impl. IX (SD39), 1999, vol. 3807, pp. 387–397. [13] P. Bassia and I. Pitas, “Robust audio watermarking in the time domain,” in Proc. 9th Eur. Signal Process. Conf., 1998, pp. 25–28.
1304
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
[14] M. F. Mansour and A. H. Tewfik, “Time-scale invariant audio data embedding,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME’01), 2001, pp. 76–79. [15] D. Kirovski and H. S. Malvar, “Spread spectrum watermarking of audio signals,” IEEE Trans. Signal Process., vol. 51, no. 4, pp. 1020–1033, Apr. 2003. [16] H. Malik, A. Khokhar, and R. Ansari, “Robust audio watermarking using frequency selective spread spectrum theory,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’04), 2004, vol. 5, pp. 385–388. [17] R. Ansari, H. Malik, and A. Khokhar, “Data-hiding in audio using frequency-selective phase alteration,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’04), 2004, vol. 5, pp. 389–392. [18] H. Malik, A. Khokhar, and R. Ansari, “Robust data hiding for audio,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME’04), 2004, vol. 2, pp. 959–962.
Hafiz M. A. Malik (S’02) received the B.E. degree in electronics and communications engineering (with distinction) from the University of Engineering and Technology Lahore, Pakistan, in 1999 and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Chicago, in 2006. After the Ph.D. degree, he joined the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, where he is currently working as a Postdoctoral Research Fellow. His research interests are in the general areas of digital content protection and digital signal processing, and the focus of current research includes information security, steganography, steganalysis, statistical signal processing, audio analysis/synthesis, and digital forensic analysis. He has published more than 15 technical papers and book chapters in refereed conferences and journals in the area of multimedia security, steganography, steganalysis, multimedia processing, audio analysis/synthesis, and statistical signal processing. Dr. Malik has served as organizing committee of the special track on Doctoral Dissertation in the IEEE International Symposium on Multimedia (ISM) 2006. He was a member of technical program committees of several conferences.
Rashid Ansari (S’78–M’81–SM’93–F’99) received the B.Tech. and M.Tech. degrees in electrical engineering from the Indian Institute of Technology, Kanpur, India, in 1975 and 1977, respectively, and the Ph.D. degree in electrical engineering and computer science from Princeton University, Princeton, NJ, in 1981. He has been at the University of Illinois, Chicago, since 1995. He is currently Professor in the Department of Electrical and Computer Engineering, and in the past he has served as Director of Graduate Studies and as Interim Head. He was a Research Scientist at Bell Communications Research from 1987 to 1995, prior to which he served on the faculty of Electrical Engineering at University of Pennsylvania. His research interest is in the general areas of signal processing and communications, and topics of current research include image and video processing and analysis, video compression, multimedia signal processing and communication, data hiding, multirate filter banks and wavelets, OFDM transmission, and speech and audio analysis. Prof. Ansari has been Associate Editor of the IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE SIGNAL PROCESSING LETTERS, and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. He was a member of the editorial board of the Journal of Visual Communication and Image Representation (1989–1993). He served as a member of the Digital Signal Processing Technical Committee of the IEEE Circuits and Systems Society. He was a member of program committees of several IEEE conferences, in particular for International Conference of Image Processing, and he served on the organizing and executive committees of the SPIE Visual Communication and Image Processing (VCIP) conferences. He was General Chair (jointly with M. J. T. Smith) of the 1996 SPIE/IEEE VCIP Conference.
Ashfaq A. Khokhar (S’92–M’93–SM’99) received the M.S. degree in computer engineering from Syracuse University, Syracuse, NY, in 1989 and the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, in 1993. After receiving the Ph.D. degree, he spent two years as a Visiting Assistant Professor in the Department of Computer Sciences and School of Electrical and Computer Engineering. Purdue University, West Lafayette, IN. In 1995, he joined the Department of Electrical and Computer Engineering, University of Delaware, Newark, where he first served as Assistant Professor and then as Associate Professor. In Fall 2000, he joined the Department of Computer Science and Department of Electrical and Computer Engineering, University of Illinois at Chicago (UIC), where he currently serves as a Professor. He has published over 100 technical papers and book chapters in refereed conferences and journals in the area of wireless networks, multimedia systems, data mining, and high-performance computing. His research interests include: digital rights management, multimedia systems, secure multimedia systems, data mining, wireless and sensor networks, and high-performance computing. Dr. Khokhar was a recipient of the NSF CAREER award in 1998. His paper entitled “Scalable S-to-P broadcasting in message passing MPPs” won the Outstanding Paper Award in the International Conference on Parallel Processing in 1996. He served as the Program Chair of the 17th Parallel and Distributed Computing Conference (PDCS), 2004, Vice Program Chair for the 33rd International Conference on Parallel Processing (ICPP), 2004, and General Chair of the Workshop on Frontiers of Information Technology, 2004.