Affine Resistant Digital Audio Watermarking Using Template Matching ... signature, stamping a personal seal or printing an organization's logo, therefore such ...
Affine Resistant Digital Audio Watermarking Using Template Matching
Ding-Yun Chen, Ming Ouhyoung and Ja-Ling Wu, IEEE Senior Member Communications and Multimedia Lab. Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan {dynamic, ming, wjl}@cmlab.csie.ntu.edu.tw
(Revised version)
Abstract In this paper, we present a novel approach for embedding a digital watermark inaudibly into an audio clip, in the time domain, according to the difference between two half blocks of each block. The proposed scheme does not require any host-related information for watermark extraction. The embedded watermark is robust to common audio signal manipulations, such as MP3 compression, time shifting, cropping, time scaling, D/A A/D conversion, insertion, deletion, re-sampling, re-quantization and filtering. Two kinds of information are hidden in the audio: the owner’s information and a synchronization template. The owner’s information is a binary image provided by the copyright owner, which can be words, numbers, a signature, a personal seal or an organization’s logo. The synchronization template is generated by a random number generator controlled by a secret key and is used for synchronizing the signal skew caused by time shifting, cropping and time scaling attacks. The two kinds of information are combined together and dispersed by another secret key before embedding. The proposed technique can be applied to automatically search for a protected audio from an audio database (or from the World Wide Web) by first matching the synchronization template, and then show the owner’s information if it is claimed to have been watermarked.
1
1.
Introduction
Due to the fact that digital multimedia content can be distributed much faster and easier via the Internet and Web, consumers are investing heavily in digital multimedia markets, such as audio, image and video. Digital storage and transmission make it trivial to cheaply to reconstruct exact copies. Therefore, copyright protection of digital multimedia content becomes a more and more important issue for content providers and owners. A promising solution to this challenging task seems to rely on digital watermarking techniques. All watermarking methods share the same generic building blocks: an embedding system and an extraction system [1]. The embedding system is to imperceptibly and robustly embed watermark signal, and thus marks the host signal as being with a specific property. The extraction system discriminates the suspected signal by using the correlation-based watermark matching. There are some issues that must be seriously taken into account for a watermark system [1,2]. First, the watermark must be imperceptible in order to preserve the original signal quality. Artifacts introduced through a watermarking process are not only annoying, but will also reduce the commercial value of the original artwork. Second, the watermark must be robust to common audio signal manipulations, such as MP3 (MPEG-I Audio Layer 3) compression, time shifting, cropping, time scaling, filtering, etc. Third, the watermark must possess secret keys to avoid unauthorized detecting or removing the embedded watermark. The space of secret keys must be large enough to ensure statistical safety. Fourth, the watermark system must be reliable. That is, the false alarm probability must be very small. Fifth, it’s better not to use the host-related signal for watermark extraction, so that the watermark system can automatically search for a protected audio from an audio database (or from the World Wide Web). Finally, a semantically meaningful watermark is more convincing for real applications. Since, in daily life, one can claim a document, a creative word, and so on, by signing one’s signature, stamping a personal seal or printing an organization’s logo, therefore such kinds of visually recognizable patterns are more intuitive for representing one’s identity than that of a sequence of random numbers. Digital watermarking techniques have captured a great deal of attention recently in the literature and the research community. Most existing watermark schemes focused on images and videos [1,3,4,5]. A few audio watermark algorithms have been proposed [6~12]. Despite a few similarities, image and audio watermark systems possess significant differences. For example, human auditory system (HAS) is more sensitive than human visual system (HVS) to signal discontinuity in time. The commonly performed signal manipulations are also different, such as rotation in image and re-quantization in audio. In general, one has to consider different multimedia characteristics in designing different watermark systems.
2
In 1995, Cox et al. [13] proposed a secure spread spectrum watermarking for multimedia, which takes the original signal as the communication channel and the watermark as the transmitted signal. After that, lots of watermark systems are proposed based on similar idea [6,7,8]. In 1998, Swanson et al. [10] applied the temporal and frequency perceptual masking of HAS to guarantee that the embedded noise-like watermark is inaudible and robust. They also introduced the notion of a dual watermark to solve the deadlock problem. The first watermark is robust and requires the original signal during detection. The second watermark needs not be robust for editing the audio segment since it is meant to protect the audio clip that a pirate claims to be its original. Ikeda et al. [8] proposed the use of band-limited random sequences to embed the spread spectrum based watermark. They developed a systematic method for generating band-limited and orthonormal random sequences of any length. From the results shown in [8], the proposed method is robust against more than 160 kbps MP3 coding and decoding when the center frequency of the sub-band is lower than 11 kHz. Seok et al. [7] analyzed audio segments by noise shaping in the frequency domain to maximize the watermark energy while retaining perceptual audio quality, and then embed watermark in the time domain. They applied whitening or de-correlation processes before performing correlation in the extraction phase. A problem inherent in block-based digital watermark is the block synchronization during extraction, i.e. the alignment between the test audio and the original audio. Most audio signal manipulations would shift and/or scale the audio sequence by a unknown number, which will introduce block synchronization uncertainty during extraction. Once the block-by-block alignment goes wrong, it is more likely to get wrong results. Therefore, the block synchronization plays an important role on correct extraction. There are several researches solving the problem at the image watermarking side. Pun et al. [4] used template matching before watermark extracting to achieve affine transformation resistant image watermarking. The template is defined as a sparse set of positions in the DFT-domain. These positions are owner key dependent and therefore it is possible to create a reference template during the extraction. Chen et al. [14] embedded some marker bits (the same as template) for block synchronization and got the original DCT block by extracting the marker bits. On the audio side, several researches solve the problem of block synchronization caused by time shifting. Pitas et al. [12] embedded a watermark in the time domain by slightly modifying the amplitude of each audio sample. The characteristics of this modification are determined both by the original signal and the copyright owner’s key. The owner’s key generates their watermark based on a chaotic map. They examined all possible circular shifts by direct watermark matching without using another template. Haitsma et al. [9] embedded watermarks in the frequency domain, where the magnitudes of the Fourier coefficients were slightly modified. The corresponding watermark extraction was done by detecting the highest cross-correlation between every possible cyclic shifts and might be enhanced by using the 3
Symmetrical Phase Only Matched Filtering (SPOMF) approach. Wu et al. [11] combined a salient point extraction and a Fourier transform domain watermark embedding procedure. The salient point extraction of an audio content is done during both watermark embedding and detection processes so that synchronization is regained at each salient point. Li et al. [6] tried to track the synchronization of an audio signal using a unique tag (behaves the same as a template). The tag is inserted into the audio signal at owner-specified locations (determined by a secret key) using pre-masking and/or post-masking properties of HAS in the time domain. The contribution of this paper is in the development of a novel time domain block-based audio watermarking scheme with a synchronization template. The template and the owner’s information are combined as the watermark and embedded together. In the proposed approach, three kinds of secret keys are used to ensure the security of the watermarking scheme: first one is used to generate the template, second one is used to disperse the watermark signal, and last one is chosen for determining block size. The owner’s information is in form of a binary image that is provided by the copyright owner, which can be words, numbers, a signature, a personal seal or an organization’s logo. Before extracting the owner’s information, block synchronization is performed by matching the embedded template, to resist signal skews caused by compression, time shifting, cropping and time scaling attacks. Another secret key recovers the shape of the owner’s information. To the best of our knowledge, the proposed work is the first digital audio watermark scheme in the literature that embeds a binary image as owner’s information and uses a template for audio block synchronization. This paper is organized as follows. Section 2 describes the proposed embedding approach and the syntactic structure of a watermark (the synchronization template and the owner’s information). The corresponding extraction scheme and the false alarm probability analysis are presented in Section 3. Section 4 shows some experimental and subjective test results. Finally, Section 5 concludes this write up.
2.
The watermark embedding scheme
The proposed watermark embedding scheme can be viewed as adding an inaudible noise-like audio to the original audio in the time domain. Fig. 1 shows the diagram of the proposed embedding scheme. For an input audio clip, we segment it into many non-overlapping slices. Each slice is an independent unit for synchronization using template matching. Then, a slice is segmented into several non-overlapping frames. A frame, which includes a fragment of information bits, is an independent unit for watermark embedding. Next, each frame is segmented into several non-overlapping blocks; each block is the smallest basis to embed a bit (0 or 1). Section 2.1 describes the groundwork we used to embed bit in a block. In the proposed scheme, it is not necessary to embed bits in all blocks. How to select a block for
4
embedding bits and how to perform bit embedding will also be explained in Section 2.1. A binary image of owner information is embedded to claim the copyright in our watermark scheme. Another binary image of template is embedded for block synchronization and claiming the protection when extracting. The bit sequence table is generated from the both binary images; each table entry is used for providing a suitable bit sequence in embedding a frame. Section 2.2 discusses the watermark of two binary images, the secret keys and the bit sequence table in detail. Section 2.3 presents the conditions under which we can embed bits into a frame and the way to select an appropriate entry from the bit sequence table. Section 2.4 introduces the function of a slice, which comprises several frames to be a synchronization unit.
2.1 The groundwork of the proposed watermark embedding This subsection describes the basis of our embedding scheme, that is, the way of embedding a bit in a block. A feature space is said to be good for watermarking, if the corresponding feature is an invariant or approximately an invariant between the original and the attacked signals. In this paper, a reasonable and simple feature is chosen as the basis for performing our watermark embedding: the difference between two half blocks of an audio block in the time domain. A bit, wb, is used to indicate the polarity of a block by the following definition,
1 , if D > 0 wb = 0 , otherwise,
(1)
0 , if D > 0 wb = 1 , otherwise,
(2)
or,
where D=
bs / 2
∑
sample (n) −
bs
∑ sample(n) ,
(3)
n = bs / 2 +1
n =1
sample(n) denotes the n-th audio sample within the block, and bs denotes the block size. Therefore, one can modify D, if needed, for embedding 1 or 0 in the block. As for the choice of polarity, our system uses the same embedding scheme (i.e., eqn(1) or (2)) to embed bits within a slice. This issue will be described in Section 2.4. The sign of D in eqn. (3) is an invariant (with very high probability) between the original and the attacked, such as MP3 compressed, signals. Take an audio clip as an example, Fig. 2 shows the invariant ratio for the sign of D before and after MP3 compression at 64 kbps, along with different block sizes. As we can see, the invariant is a rather robust and stable feature for audio watermarking against MP3 compression. In our system, the block size is also a secret number (secret key 3) during embedding and extracting, so that the attacker doesn’t 5
know the block size. In our implementation, the block size can be chosen from 50 to 200, i.e., from 1.1 to 4.5 msec, when the sampling rate is at 44.1 kHz. Although the invariant ratio for the sign of D is high, the imperceptibility feature of modification is block data dependent. In other words, the adopted embedding feature is partially adjustable with imperceptibility, and it’s important for the embedding process to determine which block is adjustable and therefore, is good for watermark embedding. We filter out the blocks that are almost impossible to be chosen for bit embedding. This trial rejection process is done as follows: a block is said to be not adjustable, if the block satisfies the following condition: |D| > a max
(4)
or smax < tmin,
(5)
where amax denotes the maximum inaudible adjustment, smax denotes the maximum value of samples in the block, and tmin is a threshold for silence detection. Eqn. (4) filters out the un-adjustable blocks; eqn. (5) filters out the silent blocks, which are easily to be attacked. A block is not necessarily to be adjusted in a frame, even if the block passes the checks of eqn. (4) and eqn. (5). This issue will be detailed in Section 2.3. When embedding a bit in a block, its D must exceed a value, at (>0), for robustness. Take criterion of eqn. (1) as example, the adjusted value, A, of a block is defined as follows: − D + a t A = − D − a t 0
, if wb = 1 and ( D < 0 or D < a t ) , if wb = 0 and ( D > 0 or D < a t ) , otherwise.
(6)
That is, after adjusting the value of D becomes D+A, i.e., the first half block will add A/2 and the second half block will subtract A/2. The situation is similar when using criterion of the eqn. (2). Furthermore, we must take care of how to imperceptibly adjust the value A/2 in a half block. We cannot add the same value to each sample within the half block. Because this will introduce high frequency signals due to the discontinuities occurred in the boundaries among the half blocks and this artifact will be audible. This problem is called the block-effect of audio. In order to solve the block-effect problem, the adjusted values are weighted by using a sine window to reduce the noise.
2.2 The watermark and the secret keys This subsection describes the relation among the watermark, the secret key and the bit sequence table. Two kinds of information are hidden in the audio: the owner’s information and a synchronization template. The template and the owner’s information together with the secret keys produce a bit sequence table, which is used for selecting a suitable bit sequence
6
when embedding bits in a frame. Fig. 3 shows the process. The copyright owner provides a 64×64 binary image as the owner’s information, which can be words, numbers, a signature, a personal seal or an organization’s logo. The template is used for block synchronization caused by time shifting, cropping and time scaling attacks. The template is generated by a random number generator controlled by a secret key (key 1). Or, the secret key 1 can be a 128×64 binary image, and be the template directly. Another secret key (key 2) disperses the owner information and the template, respectively, using Linear Feedback Shift Register (LFSR) [15] to generate a random sequence for re-permuting the two binary images. The watermark is composed of the two dispersed binary images for embedding, and produces the bit sequence table. Each table entry includes “watermark bits” (including “owner information bits” and “template bits”) and “position bits”, which is used for selecting a bit sequence to be embedded in a frame. The table lists all bits of the watermark and their corresponding positions. In our implementation, a table entry contains 4 “owner information bits”, 8 “template bits”, and 10 “position bits” for representing the corresponding positions. That is, the table has 210=1024 entries, and each entry has 22 bits. To avoid the situation of continuous ‘0s’or ‘1s’of position bits when embedding in a frame, each table entry is dispersed first. Fig. 4 shows an example of a table entry before dispersing. Instead of embedding all watermark signals in bit-by-bit order, we embed the combination of watermark bits and position bits together. The only drawback of using such bit sequence table to embed watermark is that it takes certain effort for embedding position information of each entry. However, it has lots of advantages to support its usage. First, the size of each independent embedding unit (a frame) is small, that is, the number of combinations of all circular shifts for each unit is quite limited so that the block synchronization of time shifting can be reached fast. Furthermore, because not all blocks are adjustable, a few frames might not be embedded; therefore, those frames can be easily skipped, and are independent of other frames. Finally, a frame selects a suitable table entry with the minimum modification of each block. The approach minimizes the effect of watermark embedding and thus can be done inaudibly.
2.3 Embedding bits in frames Section 2.1 described the way of embedding a bit in a block, while this subsection describes the way of embedding bit sequence in frames. A slice is segmented into non-overlapping frames. Each frame is segmented into several non-overlapping blocks, and a suitable table entry is selected for embedding when a frame can be embedded. Each bit in a table entry is sequentially embedded to each block in a frame. A frame cannot be embedded if too many blocks are not adjustable (as defined by eqn. (4) and eqn. (5)) in the frame. In the extreme case, all blocks in a frame are not adjustable, that is, the polarity of each block cannot be changed. In this case, the polarity of blocks in the frame is 7
hard to match all bits of some table entry. Fortunately, in most of cases, there are a number of adjustable blocks in a frame at least. So, in general, the polarity of blocks can be adjusted (with high probability) to match all bits of some table entry. However, the number of adjusted blocks which represents “watermark bits” in a frame may not large enough to match with all the “watermark bits” of a table entry. In this case, a table entry is selected for matching as many bits as possible. If there are many adjustable blocks in a frame, the correlation of embedding watermark bits is high. When there are Nadj adjustable blocks in a frame, the correlation of watermark bits is defined as:
∑ e ~e
ij ij
C ( N adj ) =
, when i = N adj ,
ij
∑e
2 ij
(7)
ij
where eij and e~ij denote the polarities of desired and resulted embedding bits, respectively, in frames that have i adjustable blocks. Fig. 5 shows a typical example of correlation of “watermark bits” with different amount of adjustable blocks in a frame. For example, there are 324 frames that have 10 adjustable blocks, and the correlation is 0.83 in those frames. As we can see that the correlation is still high enough, even if a few blocks can not be embedded, that is, several error bits are embedded. So we define that a frame is adjustable, if the amount of adjustable blocks is larger than or equal to a threshold. That is, Nadj = t,
(8)
where Nadj is the number of adjustable blocks in a frame and t is a threshold. In our implementation, there are 22 blocks in a frame, and we choose 5 as the threshold. A frame selects a bit sequence to be embedded from the bit sequence table. For minimum modification of each block in a frame, a table entry with minimum cost is selected. The cost, c, of a table entry is defined as follows:
c = Q × ∑ A 2(i),
(9)
i
where A(i) denotes the adjusted value of the i-th block of a frame and is determined based on eqn. (6); Q is a constant and will be described later. Most of watermark bits are embedded repeatedly based on the above selection scheme, but probably, several watermark bits may not be chosen for embedding. In general, incomplete watermark bits embedding will not cause trouble in template matching during the extraction phase; however, the visual quality of the owner’s information may be degraded in this situation. To conquer this issue, a queue is used for increasing the visual quality, i.e., trying to embed all bits of the watermark to the audio. The table entries, which have been used in previous frames, are recorded in the queue. And those entries which are not selected yet will have higher selection priority in the next run. In eqn. (9), we set Q=1 if the table entry is not in queue, and Q=2 otherwise. The queue size will affect the visual quality, and may also affect the signal to noise ratio (SNR) of the extracted 8
signal. Fig. 6 shows an example of the effect using different queue sizes. Although several bits of the watermark are still not embedded, the extracting watermark of owner information can be easily recognized by naked eyes. In our implementation, the total number of table entry is 1024. We choose 800 as the queue size.
2.4 Synchronization Slice This subsection describes the function of a slice, which comprises of M frames (M=40, in our implementation) and behaves as a synchronization unit. The original audio is segmented into non-overlapping slices, and each slice is segmented into several continuous non-overlapping frames, which was detailed in Section 2.3. A slice is a basic synchronization unit for embedding and extracting, and is independent of others. Instead of segmenting original audio slice by slice, our system generates random starting points, for the segmentation of each slice. The starting points of each slice are difficult to detect by the attacker, but they can be detected automatically by block synchronization of the proposed extraction scheme using template matching. In addition, selection of the starting point for each slice can be used for improving the audio quality and correlation of embedding watermark bits. Therefore, our system generates P random starting points for each slice, and selects the best one for embedding. Fig. 7 shows the SNR (defined in eqn. (10)) and normalized correlation (defined in eqn. (13)) of an example audio clip when using different values of P. Although the embedding time will be increased, the SNR and the correlation can be improved. For each slice, we can select a criterion of eqn. (1) or eqn. (2), and use the same criterion to embed bits within the slice. Therefore, for each random starting points of each slice, we can also select the best one from the two criteria, and improve the SNR and correlation. Our system can use different criteria for different slices, since we embed position bits and watermark bits individually into a frame. When the sign of watermarked audio samples are changed, the extracting phase of our system will detect and use another criterion for extraction. Therefore, our system can still get the correct watermark bits. This situation may happen during D/A A/D conversion. Consequently, there are several advantages of synchronization using a slice. First, we can use a slice, rather than the whole audio, for block synchronization in the extracting phase. So our system can resist to more attacks, such as insertion and deletion. Second, we can reduce the false alarm by using more embedding bits. In our implementation, there are 40 frames in a slice, that is, there are at most 320 bits in a slice to reduce the false alarm probability. The analysis of false alarm will be detailed in Section 3.3. Finally, selection of random starting points for each slice can be used to increase the secrecy, quality and correlation of the watermarked audio.
9
3.
The watermark extraction scheme
In the proposed extraction scheme, block synchronization is done by template matching, and a binary image is extracted as the owner’s information. Fig. 8 shows the block diagram of the proposed extracting phase. The block synchronization plays an important role on the correct extraction. Once the block-by-block alignment goes wrong, it is more likely to make wrong decision. The block synchronization not only aims at the blocks in our extraction scheme, but also the frames and slices. This is because each frame is considered to embed bits independently, and a slice is composed of several frames independently. Once a slice is synchronized, the frames of the slice are synchronized, and the blocks of the frames are also synchronized. The template is generated, the same as the embedding scheme described in Section 2.2, by a secret key (key 1) and is used for synchronizing the signal skews caused by time shifting, cropping and time scaling attacks. Section 3.1 details the block synchronization. The proposed extraction scheme can be applied to automatically searching for protected audio clips from an audio database (or from the World Wide Web) by matching the embedded template. If the audio is claimed to be protected, the watermark (extracted from frames) and another secret key (key 2) will jointly recover the owner’s information. Section 3.2 introduces the extraction of owner’s information. Furthermore, the watermark system should be reliable, and the false alarm probability of the extraction scheme is investigated in Section 3.3. 3.1 Block synchronization using template matching A problem inherent in block-based digital watermark is block synchronization during extraction, i.e. the alignment between the test audio and the original audio. Since a public watermarking system needs not use original audio for extraction, only embedded template bits can be used for doing block synchronization. The template is controlled by secret keys, and is the same as that of embedding scheme described in Section 2.2. In this subsection, we will describe the approach of block synchronization against time shifting and cropping first, and then that against time scaling. In our extraction scheme, a template is extracted first to reach the block synchronization. The main idea of block synchronization is to search for the highest normalized correlation of template bits from all possible time shifts and scales within a slice, and apply multi-resolution approach to speed up the corresponding processes. Using this approach, the false alarm rate can be controlled within reasonably small amount, and will be detailed in Section 3.3. Note that, in our system, the frames and blocks are synchronized if a slice is synchronized. So this subsection is focused only on the way of synchronizing slices of an audio. The criterion of extracting template bits is according to eqn. (1) or eqn. (2). Our system will choose one of them, which provides the highest normalized correlation, for applying to all blocks of a slice. The normalized correlation (NC) is used to find the matching between the extracted template and the embedded template (decoded based on secret keys), and is defined
10
as:
NC =
~ ∑ ti ti i
∑ ti
2
(10)
i
~ where ti is the extracted bits of template, and ti is the corresponding bits of template decoded ~ from secret keys. Note that ti and ti are represented by 1s or –1s, rather than 1s or 0s, in eqn. (10). Most audio signal manipulations would shift the audio sequence by a random number of samples. A block-based watermark system must deal with this issue. Time shifting can be detected by template matching. The highest correlation, Rf, of all possible circular shifts (the same as the slice size in number) is defined as:
R f = max NCi , 0=i