1922
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007
The Replacement Attack Darko Kirovski, Fabien A. P. Petitcolas, and Zeph Landau
Abstract—Billions of dollars allegedly lost to piracy of multimedia have recently triggered the industry to rethink the way music and movies are distributed. As encryption is vulnerable to rerecording, currently all copyright protection mechanisms tend to rely on watermarking. A watermark is an imperceptive secret hidden in a host signal. In this paper, we analyze the security of multimedia copyright protection systems that use watermarks by proposing a new breed of attacks on generic watermarking systems. A typical replacement attack relies upon the observation that multimedia content is often highly repetitive. Thus, the attack procedure replaces each signal block with another, perceptually similar block computed as a combination of other similar blocks found either within the same media clip or within a library of media clips. Assuming the blocks used to compute the replacement are marked with distinct secrets, we show that if the computed replacement block is at some minimal distance from the original marked block, a large portion of the embedded watermark is removed. We describe the logistics of the attack and an exemplary implementation against a spread-spectrum data hiding technology for audio signals.
riers. In a forensic marking scenario, each distributed copy is marked with a unique fingerprint. Forensic analysis of pirated material is then performed on a trusted secure server in the presence of the original content. Here, players are not modified. Both systems have inherent problems. Content screening demands public key watermarking [1], whereas fingerprinting suffers from exceptionally low collusion resistance [2]. In addition, the embedded watermark in both applications must survive an arbitrary signal processing attack which preserves perceptual fidelity of the targeted content. In this paper, we focus our attention to signal processing type of attacks. Several adversarial mechanisms surveyed in [3] have been largely successful in setting up robustness benchmarks for watermarking technologies. However, none of the attack technologies that do not rely on having access to the watermark detector, remove watermarks without any hope that an irreversible or preventing action is possible.
Index Terms—Fingerprinting, multimedia similarity, pattern matching, replacement attack, watermarking.
II. RATIONALE BEHIND THE ATTACK
I. INTRODUCTION
S
IGNIFICANTLY increased levels of multimedia piracy over the last decade have put the movie and music industry under pressure to deploy a standardized antipiracy technology. The problem of ensuring copyright of multimedia at the client side lies in the fact that traditional data protection technologies such as encryption or scrambling cannot be applied as they are prone to digital or analog rerecording (copying). Thus, almost all modern copyright protection mechanisms tend to rely to a certain extent on watermarks, imperceptive marks hidden in host signals. Two different types of protection systems have evolved over the past decade: content screening and fingerprinting. In a typical content screening system, the client’s media player searches the content for hidden information. If the secret mark is found, the player must verify, prior to playback, whether it has a license to play the content. By default, unmarked content is considered as unprotected and is played without any bar-
Manuscript received February 25, 2006; revised March 26, 2007. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Rudolf Rabenstein. D. Kirovski is with Microsoft Research, Redmond, WA 98052 USA (e-mail:
[email protected]). F. A. P. Petitcolas is with the External Research Office, Microsoft Research, Cambridge CB3 0FB, U.K (e-mail:
[email protected]). Z. Landau was with Microsoft Research, Redmond, WA 98052 USA. He is now with the Math Department, City College of New York, New York, NY 10031 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2007.900088
Recently, Kirovski and Petitcolas introduced the first version of the replacement attack with an aim at reducing the correlation of a watermarked signal with its watermark by replacing each original watermarked block of the multimedia signal with another perceptually similar block [4]. The replacement is computed as a least-squares approximation of based upon other signal blocks that are perceptually similar but not tainted with the same watermark bits as the original marked block. The original idea of using content self-similarity to attack watermarking systems has been introduced independently by two research teams [5], [6] and [7], [8]. Primitives similar to the ones used by the replacement attack, have been used before in the watermark copy attack. This attack aims at estimating the watermark from one image and embedding the estimate into another image [9], [10]. While it is relatively straightforward to prevent the copy attack, e.g., by making the watermark content dependent, the replacement attack appears to be hard to prevent, and its effect on watermarks appears to be irreversible. The rationale behind the attack is simple. The key observation is that overwhelming majority of multimedia with perceptual value does not obey the memoryless source model. Strong selfsimilarity in both audio and, in particular, visual content exists locally and globally across media clips. While local correlations such as similarity of two consecutive images in a video frame, are commonly exploited for compression, global correlations are rarely considered in signal processing primitives mainly due to the significant computational complexity involved in their identification which often prevents real-time applications. The replacement attack adds marginal noise to the marked content. Although the noise appears to be zero-mean independent and identically distributed (i.i.d.) Gaussian, it wipes out an
1558-7916/$25.00 © 2007 IEEE
KIROVSKI et al.: REPLACEMENT ATTACK
enormous percentage of the correlation. An additive and truly Gaussian noise of equivalent variance would not alter the correlation beyond its original statistics. The surprising effect of the replacement attack stems from the redundancy which exists in multimedia content. By using this information, one can recreate the protected multimedia while being marginally dependent upon the originally embedded watermark. In the remainder of this section, we aim at describing the key effect of the replacement attack, while avoiding a detailed formal analysis for brevity. A. Signal and Marking Model First, we adopt the following simplified signal similarity model. We model the original content as a random variable of samples . We assume that all samples of are produced by a zero-mean -variance Gaussian source. We specifically do not adopt the memoryless source model for sample interdependency. Instead, we assume that each block of consecutive samples, , where , , such is comparable to other blocks that: • and any of the blocks in do not overlap; , where is the deviation between • returns the -norm of the two vectors, and operator a vector . To model the existence of self-similarity in the multimedia content, we assume that is relatively small with any does not so that the replacement of cause perceptual1 change in the multimedia clip. Next, we assume that is mutually independent with respect to all blocks in other than the ones in . Hence, for example, a is modeled correlation of with another block using a sum-product of two i.i.d. random variables. This model is restrictively simple, yet sufficiently powerful for the context presented in this manuscript. The original content is watermarked by adding a spread-spec, trum sequence, a random variable of i.i.d. chips: is independent with respect to . For now, we assume that to in the corresponding signal representation domain. For example, multiplicative watermarks are content dependent in the linear, but content independent in the logarithmic domain. The , is the only information available to resulting clip, the adversary. Using a classical Neyman–Pearson hypothesis test, the detector decides that a certain signal is watermarked with if the , where . correlation of The detection threshold controls the tradeoff between the probabilities of false positive and false negative decisions. We recall from modulation and detection theory that the correlation detector is optimal in the class of linear detectors in the , then presence of i.i.d. noise [11]. For example, if . Also, for , we have . Since the noise in the detector is approximately Gaussian due to the Central Limit Theorem, the error probabilities of false negatives, 1Euclidean
distance is not a solid perceptual similarity metric. By using a better perceptual metric, one can improve the attack as higher noise could be tolerated while retaining the same perceptual quality.
1923
, and positives, , are computed by integrating the tail of a corresponding Gaussian pdf (1) is the variance of the noise in the forensic detector. If , we have . Under nonlinear attacks, the detector noise may not be of equal variance depending upon whether the content under test has or has not been watermarked. Typically, in both cases, the noise is approximately equivalent. Based on the statistics of spread-spectrum watermarking [12], to by adding additive white Gaussian noise the marked content, the expected normalized correlation value remains intact for any mean of and variance . Hence, such an attack in moderate amplitudes should minimally2 affect the watermark detection. where
B. Alternate Attack Approaches The goal of the adversary is to alter so that the resulting clip preserves the perceptual fidelity of , while disabling the detector to identify . There are several standard approaches to performing this task. One is to moderately and randomly bend the space dimensions to inflict difficulty positioning in the same direction as it had, when embedded into [13], [14]. Such StirMark-like attacks are efficient in confusing most “blind” detectors [15], yet can be reversed in the presence of the original content in forensic scenarios [16]. In addition, regular structures in multimedia such as rhythm in music, can be, on the other hand, used to realign content before detection and, hence, thwart geometric attacks. Another approach is to estimate and remove the watermark from the content as . This type of an attack has limited success if the watermark magnitude is low is not redundantly embedded in [1]. Finally, if the and/or watermark detector is available to the adversary as in the case of content screening, she can launch a multitrial search for a watermark estimate , which quickly becomes accurate enough to cancel out the effect of [17]. Note that robustness to attacks desires awareness of all types of attacks simultaneously. For example, random space-bending (e.g., StirMark-like) attacks for audio can be prevented using redundant watermark embedding; however, such a solution is prone to the estimation attack [18]. C. Informal Discussion on the Replacement Attack The core idea behind the replacement attack is to replace one block of data with another, perceptually similar block found elsewhere in the clip or database of clips. From the assumption that the chips of are mutually indepenbecause of dent, it is easy to show that and . This means that the watermark is completely removed from the multimedia clip upon replacement. Typically, the removal is permanent as there are 2Additive white Gaussian noise only increases the variance at the detector, which, in the case of high , may cause strong offsets from the expected correlation, and consequently an erroneous detection. Typically, watermark length can be adjusted to tolerate sufficiently high noise magnitudes.
1924
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007
far more combinations for “undoing” the attack in case the detector is exhaustively trying to identify the original positioning of permuted blocks, than the achievable false positive rates. For example, for a multimedia signal that consists of a set of 1000 , which is a strong reblock 3-tuples (thus, for each block striction) that can swap their positions, there are approximately possible ordering combinations, a number that far out10 reaches the inverse of the false positive rates reported in literature for watermarking audio or video [12], [18]. As we will see later, in this paper we propose a standard least-squares approximation technique to compute the replacement blocks, where it is exceedingly more difficult to perform this type of search as several blocks, not necessarily similar to the target, are combined to create the replacement. , The assumption that, for each media block in or a there exists a perceptually equivalent block related multimedia database, such that is overly optimistic. The adversary must launch a lengthy search process to identify content similarities. This can be performed, for example, by correlating the search database with the target block . For a sufficiently large database, this search can result in false positives. Here is an informal and simple example. samples and a containing multiFor a given block of of samples, we assume that there are media database blocks similar to in and remaining blocks that are mutually i.i.d. with respect to . We denote the two sets is “similar” of blocks as and respectively. A block . The correlation of with a nonsimilar block to if can be approximated as a Gaussian random variable due to the Central Limit Theorem. In this case, the variance of the resulting variable equals .3 Among the possible such correlations in the database, the probability that the maximum correlation is smaller than or equal to is (2)
(3) selected as a false positive is used as a If a block , it is expected replacement for a given target . Although was not directly used in the that construction of , was still correlated with , i.e., “transduring the search process for . ferred” the correlation with As only highly correlated blocks are used as replacements, it is expected that they only marginally remove the watermark correlation. Thus, the selection criteria should be such that for a given database length , . Consequently, resulting in high only blocks with similarity measure , should be selected as replacements. Transferred correlation is a side effect of the attack proposal presented in this paper, where blocks from are used to 3Var[
w ] = 1.
create a least-squares approximation of . There are two sources of transferred correlation. First, the search for the basis blocks can return false positives. Next, the higher the , the closer the distance between and . More accurate approximation implies higher amounts of transferred correlation to , which can render the attack inefficient. In summary, we recognize three key phenomena related to the replacement attack. First, by replacing a given signal block with another, perceptually similar but watermarked independently, signal block, one can remove the watermark in under relatively mild assumptions. On the other hand, large media databases must be searched for replacements using appropriately sized blocks in order to avoid false positives. Third, any procedures that aim to reduce the distance between the replacement and the target block may be potentially a source of transferred correlation. The perplexing nature of these three phenomena defines the efficacy of the replacement attack. Analyzing them formally is a cumbersome task for the difficulty to model similarity in multimedia. We simplify the effects of transferred correlation by enforcing the following relationship between the replacement and the target blocks (4) Scalars and represent the bounds within which replacement blocks are identified. The purpose of the lower bound is to eliminate false positives due to block approximations that are too accurate. The upper bound aims at modeling the requirement for perceptual fidelity and also filtering out false positives stemming from elaborate searches. For example, for music, typdB and dB. This is a model that has proven ically, to yield solid results for relatively small [4]. Finally, note that if for a given block , no replacement is found to satisfy the objective from (4), the adversary can launch a different type of an attack for this block such as watermark estimation. In the remainder of this paper, we propose deterministic and randomized variations of the replacement attack; also, we address several tradeoffs and problems that inherently exist in realistic attack scenarios. Finally, we demonstrate how the attack can be applied on a spread-spectrum audio watermarking technology. We rely on empirical results to show the efficacy of several versions of the replacement attack. III. REPLACEMENT ATTACK—LOGISTICS The replacement attack is not limited to a type of content or to a particular watermarking algorithm. For example, systems that modulate secrets using spread-spectrum [12] and/or quantization index modulation (QIM) [19] are all prone to the replacement attack [8]. For brevity, the analysis of the attack in this paper is restricted to direct sequence spread-spectrum watermarks. In order to launch the attack successfully, the adversary does not need to know the details of the watermark codec. This assumption is convenient for the adversary compared to the knowledge mandated by other attacks mentioned in the previous section. and a corresponding watermark Given a signal , the attack performs the following steps:
KIROVSKI et al.: REPLACEMENT ATTACK
1) partition into overlapping blocks of length ; , find a set of perceptually 2) for each block that do not overlap ; most similar blocks in 3) compute a replacement block as a least-squares linear using blocks from ; approximation of with . 4) replace A quantifying measure of perceptual similarity is defined later in (5) in Section IV-B. A. Attack Tradeoffs Considering the issues related to the replacement attack and presented in the previous section, we identify several important tradeoff decisions that the adversary needs to make before applying the attack. The tradeoffs reflect on the following important performance metrics: reduction in correlation, distortion, and speed. T.1 Point dimensionality: has a profound effect on as mentioned in the previous section. By increasing , the adversary reduces the likelihood that vectors in , as well as their linear combinations, are correlated with . On the other hand, significantly increased reduces the expecta, tion on the cardinality of as it increases thus reducing attack effectiveness. One heuristic is that should be maximized for a given clip so to still produce “perceptually valid” matches. T.2 Selection of : Increased improves all aspects of attack performance except distortion. This parameter should be maximized for a given perceptual quality. of the Look-Up Media Library: determines T.3 Size the complexity of the attack and can significantly improve . As outlined in the previous section, large can cause false positives in the search; to address this issue, has to be raised. IV. ATTACK STEPS A. Step I: Signal Partitioning For improved perceptual quality of the resulting multiis partitioned media clip, the protected signal , where each block into a set of blocks , overlaps its neighbors and is windowed with an analysis windowing function that yields perfect reconstruction with its synthesis counterpart. With no loss of generality, we assume that is an one-dimensional signal such as audio. Note that the goal of the windowing function is the same as with standard block-based signal processing of multimedia; the choice of this function is not expected to mask or perturb the underlying self-similarity that exists in the multimedia clip. In our experiments on audio, we used the modulated complex lapped transform [20] with a standard Hanning window. B. Step II: Search for the Substitution Base Finding perceptually similar blocks of certain music or video content is a challenging and computationally expensive task. In this paper, we restrict our focus to audio, although video is in many cases a much better source of repetitive content within a
1925
single recording. For example, within a common scene, its objects experience geometric transformations significantly more frequently than changes in appearance. In general, repetition is often a principal part of composing music and is a natural consequence of the fact that distinct instruments, voices, and tones are used to create a soundtrack. Thus, it is likely to find similarities within a single musical piece, an album of songs from a single author, or in instrument solos. of best matched For each point , we want to find a set with individual blocks in denoted as where indexes the locapoints tion of in . Before we define the search process, we adopt normalized and squared Euclidean distance between two -dimensional points and as a similarity metric (5) Although in realistic attack scenarios the similarity function should be masked with the perceptual model for improved matching, in this section we disregard this effect. Next, note that maximized normalized correlation corresponds to minimal matches Euclidean distance in . Thus, the search for top can be conducted in the following way. in against each We first compute the normalized block convolution of with respect to . This can be done rather fast using the fast Fourier transform and the overlap-add fast convolution method . The top [21]. The complexity of this step is correlated blocks in that do not overlap constitute the substitution base for . C. Step III: Computing the Replacement This step of the algorithm is crucial as it resolves the tradeoffs outlined in Section II. First, we review the restrictions of the attack, which exactly apply only to algorithms A1 and A2. We of the replacement block is at a restrict that each sample “safe” and “perceptually valid” distance from the sample it is replacing in . More formally (6) We discuss several randomized algorithms for computing such that the above constraints are satisfied. Algorithm A0 computes the replacement block as follows. of points such that for We assume that it is given a set each point , the following relation holds: . The replacement block is computed from the selected blocks in such that its similarity with respect to is maximized. More formally, we construct a matrix where each row of this matrix represents one block from . We is minimized. The aim to compute a vector such that least-squares solution to this set of linear equations, commonly . A tempocalled pseudoinverse of , equals rary replacement block is now computed as . Three cases can occur. satisfies the re1) The temporary replacement block , in which case the quirements, e.g., replacement equals .
1926
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007
2) We cannot construct a high-quality estimate of , i.e., is an empty set; in this case, we copy the source signal . 3) is too similar to , e.g., , in which case and a randomly chosen block of samples from are mixed as such that . enforces the desired similarity The mixing parameter in the last case if (7) where and under the assumption that and are mutually independent. Algorithm A1 has a different strategy in computing the replacement block in order to address the fact that A0 follows the requirement in (6) only globally but not per sample. Thus, A1 computes in several steps. In the first step, it selects random and distinct subsets of blocks from . We denote these subthrough . For each of these subsets, A1 computes the sets least squares approximation of . More formally, for a given we are seeking for such that is minimized. Optimal is computed simply as (8) which yields the following replacement candidate vector . Samples from can be categorized into two categories: ones that satisfy (6) and the ones that do not. To address this issue, we introduce a binary coverage matrix associated with each . We set to if sample satisfies (6) and vice versa. We define as effective the number of samples a given replacedimensionality covers . Heuristically, we are alment ready driven by the assumption that the larger the effective dimensionality, the stronger the effect of the attack on the resulting correlation. Hence, we can model the goal of our replacement algorithm as a combinatorial optimization problem. A1 aims to cover as many as possible samples from using as few as pos. This problem is sible vectors from the set better known as minimum cover and is NP-hard (SP5, [22]). A1 aims at solving this problem using a greedy heuristic which itthat cover maximum number eratively selects vectors from of remaining uncovered samples. There may be samples that cannot be covered by any —their values are set to the corresponding values of the marked content in order to minimize distortion. Parameters , , and strongly influence the performance of A1. By increasing , we reduce attack’s distortion at the cost of stronger correlation of each and . Empirically, we have received best results for small , usually in the order of . Once is set, we determine the average effective dimensionality in . The higher the , the more candidate trials A1 can afford to test. Again empirically, we have achieved solid results . Finally, the size of the substitution database is kept with . large at Algorithm A2 is a significantly slower, but still randomized, version of A1 and A0. It is based on the observation that the are highly redundant because they are searched blocks in
using a common criterion [(5)]. Using only these blocks in linear combinations restricts strongly the search space. A better, but still not optimal, strategy in representing as accurately as possible using a constant number of blocks from , is to use a variant of Gram–Schmidt orthonormalization (GSO) [23]. Hence, A2 iteratively performs the following process. It finds with . The most similar point the first is subtracted from as , where is a scalar equal to . In the the normalized correlation of subsequent iteration, A2 computes the similarity of with as described in Section IV-B, finds the best match , and subtracts it from the remainder as , where . This procedure is . iterated while The above version of A2 has the problem that not all samobey the constraint ples of the final replacement in (6). In order to address this problem, we adjust A2 to discard samples that satisfy (6) in the subsequent iterations. In addition, similar blocks in each iteration as it is we consider the top not case that the closest point provably has the highest effective dimensionality. These two adjustments marry A1 with GSO to best describe A2. Similarly, one can advertise A1 as a low cost version of GSO because it does not need to perform the similarity search in each iteration. In Fig. 1, we illustrate the key steps performed by each of the algorithms A0, A1, and A2. Finally, in this manuscript, we fail to provide detailed description of the algorithms due to brevity. Instead, we present an application of the attack on an off-the-shelf spread-spectrum scheme for audio watermarking. D. Step IV: Block Substitution In the final step, each block of the original watermarked signal is replaced with the corresponding computed replacement to create the output media clip. V. REPLACEMENT ATTACK FOR AUDIO In this section, we demonstrate how the generic principles behind the replacement attack can be applied against an audio watermarking technology. We first describe how an audio signal is partitioned and preprocessed for improved perceptual pattern matching. Next, we analyze the similarity function we used for our experiments. The effect of the replacement attack on directsequence spread-spectrum watermark detection is presented in the following sections. A. Audio Processing for the Replacement Attack The set of signal blocks is created from the coefficients of a modulated complex lapped transform (MCLT) [20]. The MCLT is a 2 oversampled DFT filter bank, used in conjunction with analysis and synthesis windows that provide perfect reconstruction. We consider MCLT analysis blocks with 2048 overlap. Since most psytransform coefficients and an choacoustic models operate in the frequency spectrum [20] and to address the natural property of audio that magnitudes in low frequencies are typically much stronger than in high frequencies, we launch the replacement attack in the logarithmic freas the doquency domain, i.e., we used
KIROVSKI et al.: REPLACEMENT ATTACK
1927
Fig. 1. Illustration of steps conducted by attack algorithms A0, A1, and A2.
Watermark length is assumed to be greater than one second. In addition, we assume that watermark chips may be replicated along the time axis at most for one second [18].4 Thus, we restrict that for a given block its potential substitution blocks are not searched within one second. B. Analysis of the Similarity Function
Fig. 2. Block diagram of the signal processing primitives performed as preand postprocessing to the replacement attack.
main representation for the audio signal. We used decibels as a natural quantifier for this domain. Each block of coefficients is normalized and psychoacoustically masked using an off-the-shelf masking model [20]. Similarity is explored exclusively in the audible part of the frequency spectrum. Because of psychoacoustic masking, the actual similarity function in (5) is not commutative. A replacement block is always masked with the psychoacoustic mask of the replaced block. Fig. 2 illustrates the signal processing primitives used to prepare blocks of audio content for substitution.
We performed several experiments in order to evaluate the effectiveness of the replacement attack. The first set of experiments aims at quantifying similarity between blocks of several audio clips marked with spread-spectrum watermarks dB magnitude. In all examples, block similarity of a is computed over the 2–7 kHz subband as watermark codecs commonly hide data in a subband that is not strongly distorted by compression and medium quality low- and high-pass filtering [18]. Fig. 3 shows the values of the similarity funcfor five 2048-long MCLT blocks at positions tion against a database of 240 blocks within one audio clip (techno music). We observe that throughout the database, four different pairs of blocks (circled in the subfigure) are found as similar below 4 dB to the pair of blocks with indices 127 and 137. All similar pairs of blocks preserve the same index distance as the target pair. This points to the fact that in many cases content similarity is not a result of coincidence, but a consequence of repetitive musical content. Fig. 4 illustrates the probability that for a given 2048-long within the same MCLT block , there exists another block audio clip that is within dB, 4Higher
level of redundancy may enable effective watermark estimation.
1928
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007
TABLE I IMPROVEMENT IN SIGNAL DISTORTION DUE TO THE REPLACEMENT ATTACK AS PARAMETER K IS INCREASED. RESULTS 400. ARE REPORTED ON THE dB SCALE. AVERAGE EFFECTIVE BLOCK DIMENSIONALITY IS N
Fig. 3. Music self-similarity: a similarity diagram for five different 2048-long MCLT blocks within a techno clip with 240 MCLT blocks. Zero-similarity denotes equality. The abscissa x denotes the index of a particular MCLT block. The ordinate denotes the similarity (x; b ) of the corresponding block x with respect to the selected five blocks with indices b i = 122; 127; 132; 137; 142 .
j
f
g
for two randomly selected blocks within an audio clip is in the range of 6–8 dB. The probability of finding a similar block should rise proportionally to the size of the substitution database, especially if it consists of clips of the same genre/performer. Finally, note that electronically generated music (in our benchmark a techno song) is significantly more likely to contain perceptually correlated blocks than music that is performed by humans. The second set of experiments explores the distortion that the replacement attack introduces. We consider three cases. In the first case, in the left subfigure of Fig. 5, we present the probis at distance if ability that the replacement block equals the most similar block found in the substitution database (e.g., ). The right subfigure presents the same metric and is computed as described in for the case when Section IV-C. Finally, Table I quantifies the improvement in the as increases from 1 to 100. We average distortion conclude that the replacement attack in our experimental setup induces between 1.5–3 dB distortion noise with respect to the marked copy—a change in fidelity that most users are willing to sacrifice for free content. VI. EFFECT OF THE ATTACK ON WATERMARK DETECTION A. Using the Algorithm A0
Fig. 4. Music self-similarity: probability density function of the similarity function (b ; b ) within an audio clip—five different types of music are considered: rock, classical, jazz, vocals, and techno. A certain value x on the abscissa represents a histogram bin from x 0:25 to x + 0:25 dB.
0
where is a real number. This experiment was conducted for five different types of audio content: techno, jazz, rock, vocals, and classical music. All content was sampled at 44.1 kHz and represented with 16 bits per sample. For this benchmark set of distinctly different musical pieces, we conclude that the average
In order to evaluate the effect of a replacement attack on spread-spectrum watermarks, we conducted two experiments. For both experiments, we used spread-spectrum watermarks that spread over 240 consecutive 2048-long MCLT blocks (approximately 11 s long), where only the audible frequency magnitudes in the 2–7 kHz subband were marked. We did not use chip replication as its effect on watermark detection is orthogonal with respect to the replacement attack. A variant of this technology has robustly detected watermarks under the effects of a large benchmark of individual signal processing primitives and attacks and more importantly, their combinations [18]. Fig. 6 shows how normalized correlation of a spread-spectrum watermark detector is affected by the increase of the parameter . We performed the following experiment. We marked the first 240 2048-long MCLT blocks of five different songs (ranging from 3 to 5 min in duration) with a direct sequence spread-spectrum watermark. The watermark amplitude was set dB. During the attack, we replaced each target block to
KIROVSKI et al.: REPLACEMENT ATTACK
1929
Fig. 5. Probability density function of the similarity function (B; R ) for two different cases: K = 1 (left) and K = 10 (right).
Fig. 6. Response of a spread-spectrum watermark detector to the replacement attack. The abscissa quantifies the change in parameter K from 1 to 100 for fixed watermark amplitude of = 1 dB. The left ordinate shows the increase of the normalized correlation as K increases. The results are obtained for five full songs in different genres. The right ordinate shows the corresponding minimal, maximal, and average distortion with respect to the set of benchmark clips due to the replacement attack.
with its computed replacement block following the recipe presented in Section IV-C. In Fig. 6, we show two results. First, we show the average normalized correlation value (left ordinate) across ten different tests for watermark detection within marked content (curves marked WM) and within marked content attacked with our attack for (curves marked several values of RA). Second, we show on the right ordinate the signal distortion caused by the replacement attack: the minimal, average, and maximal distortion across all five audio clips. We can conclude from the diagram that: • for small values of , its increase results in greatly improved distortion metrics; • while for large values of , the computed replacement vectors are too similar with respect to the target blocks which results in lower effect on the normalized correlation. Note that we have not limited the approximation accuracy ) in this experiment to showcase the effect of ac(i.e., curate approximation. The randomized algorithms presented in
Section IV-C aim at balancing this effect with the goal of reducing the attack noise. The power of the replacement attack is most notably observed by comparing the effect of adding a white Gaussian of certain standard devinoise (AWGN) pattern dB to a replacement attack of equivalent ation distortion. Whereas the dramatic effect of replacement can be observed in Fig. 6, AWGN does not affect the correlation detector. Finally, additive noise of 2–3 dB in the 2–7 kHz subband is a relatively tolerable modification. Finally, the correlation of the watermarked content and the superimposed AWGN noise with a magnitude equivalent to the noise introduced by the replacement attack produces nontrivial variance presented in Fig. 6. This variance stems from the fact that the length of the audio sequence under test was relatively short. Correlation variance could be reduced by considering longer watermarks—however, our focus was on presenting the experimental results on as short as possible sequences due to the cumbersome tests involved in obtaining the presented diagrams. In listening tests, a distortion of 2.5 dB has been already announced statistically inaudible by The Golden Ears test for a similar watermarking methodology [18]. A distortion of up to 4 dB is widely considered as acceptable to most users via lisall tested clips produced tening tests. The fact that for noise lower than 4 dB suggests that the resulting clips should be usable to a wide audience. The reported distortion computed with respect to the original content, , was between 0.17 dB ) and 0.68 dB (classical, ) higher than (techno, the reported distortion with respect to the watermarked clips . We conclude that content with stronger self-similarity ex. Finally, increased for all clips hibited lower with the increase of ; we did not identify a case when this difference was higher than 0.9 dB. B. Using the Algorithm A1 Table II shows the response of a spread-spectrum watermark detector to a version of the replacement attack guided by A1. At, , , dB, and tack parameters are dB. The results are obtained for five full songs in different genres. Watermark amplitude equals one. The search space was
1930
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007
TABLE II RESPONSE OF A SPREAD-SPECTRUM WATERMARK DETECTOR TO THE REPLACEMENT ATTACK. ATTACK PARAMETERS ARE K = 200, r = 5, c = 100, = 2:5 dB, AND " = 4 dB. THE RESULTS ARE OBTAINED FOR FIVE FULL SONGS IN DIFFERENT GENRES. WATERMARK AMPLITUDE EQUALS ONE. THE SEARCH SPACE WAS DRASTICALLY SMALLER THEN IN THE TESTS CONDUCTED FOR A0—AT ONLY 10 s OF AUDIO FROM THE SAME SONG. THE TABLE PRESENTS INFORMATION COLLECTED FROM 100 DIFFERENT TESTS FOR EACH TEST CLIP: IS SIGNAL VARIANCE AFTER A MOVING AVERAGE FILTER, E [a 1 w] IS THE CORRELATION RESPONSE OBTAINED FROM THE WATERMARK DETECTOR, AND NORMALIZED WITH RESPECT TO THE SAMPLE COVERAGE, THE TOTAL NUMBER OF SAMPLES ALTERED BY THE ATTACK, THE TOTAL NUMBER OF AUDIBLE SAMPLES, THEIR RATIO, AND FINALLY, THE CONSEQUENT DISTORTION RESULTING FROM THE ATTACK.
drastically smaller than the tests conducted for A0—only 10 s of audio was used from the same song. The table presents information averaged over 100 different tests for each test clip. is signal variance (of ) after applying a moving Parameter is the correlaaverage filter [18]. Row marked as tion response obtained from the watermark detector and normalized with respect to the sample coverage. We also present the coverage of the total number of samples altered by the attack, the total number of audible samples, and their ratio. Finally, we demonstrate that the attack on the average caused distortion marginally higher than . For a search window almost 4 times smaller than in the previous subsection and a signal distortion at only 2.5 dB, we have succeeded to almost completely remove the watermark in one song (93% watermark removal) and more than half the correlation in the remaining four clips. Another important issue is the fact that the distortion introduced by the replacement attack is linearly proportional to the watermark amplitude. Clearly, with the increase of watermark amplitude , the search process of the replacement attack becomes harder for two reasons: 1) block contents become more randomized; 2) the substituted blocks are more correlated with the original blocks. On the other hand, we have empirically concluded that watermark amplitude affects the reduction of the normalized correlation minimally. Although stronger watermarks may sound like a solution to the replacement attack, high watermark amplitudes cannot be accepted because of two reasons: first, the requirement for high-fidelity marked content and second, strong watermarks can be efficiently estimated using an optimal watermark estimator [1], i.e., estimate makes an error per bit exponentially proportional to the watermark amplitude . VII. CONCLUSION For any watermarking technology and any type of multimedia content, one powerful attack is to recreate the original recording,
i.e., perform again the music or capture the image of the same original visual scene. In this paper, we emulate this attack using a computing system. The replacement attack aims at replacing small pieces of the marked content with perceptually similar but unmarked (or marked differently) substitution blocks, created using a library of multimedia content. As we have shown, if the substitution blocks are constructed following specific constraints, they can be largely uncorrelated with the originally embedded watermark. The attack is simply more effective if the multimedia clip contains more redundancy within itself or with respect to a considered multimedia library. Overwhelming majority of multimedia clips does not conform to the memoryless source model. Thus, commonly plenty of redundancy exists in the content to facilitate the replacement attack. We present one deterministic and two randomized algorithms for computing replacement blocks using a least-squares linear combination of signal blocks similar to the target block. Although the attack is generic and can be applied to all marking strategies, we demonstrate how it can be launched for audio content and a traditional watermarking modulation technology: direct sequence spread-spectrum. From the presented experimental results, we conclude that an implementation of the replacement attack that considers a relatively small substitution database can create replacement blocks that are only within 1.5–3 dB distance with respect to the target signal blocks. Such an attack removes approximately 50%–90% of the correlation between the watermark and the marked/attacked content. Our preliminary results demonstrate that the attack has similar effect on other marking mechanisms such as QIM [8], [19]. We identify several possible prevention strategies against the replacement attack. For example, a data hiding primitive may identify distinct parts of the content at watermark embedding time and mark only these blocks. This identification must be achieved at detection as well. Such a solution increases dramatically the complexity of both watermark embedding and detection. In addition, the “uniqueness identifier” cannot possibly have access to all related multimedia. Finally, the notion of similarity and all other details of the attack are defined by the adversary and cannot be known during watermark embedding or detection. For example, for audio, the adversary may launch three separate replacement attacks each on a distinct frequency subband over blocks of varying size and position. Another strategy may be to create a content-dependent watermark, where perceptually similar blocks are embedded with similar watermarks. Just as with the previous strategy, the fact that the adversary has the freedom to choose the attack parameters, makes the definition of content-dependent watermarks dependent upon the attack, which cannot be assumed. Next, content-dependency can be achieved per sample basis using a scheme such as QIM [19]. For a large quantization step proportional to , one may actually achieve robustness to the replacement attack at the expense of hard-to-tolerate low fidelity. Similarly, in the case of spread-spectrum watermarks, longer watermarks and increased detector sensitivity may enable watermark detection at lower thresholds. Unfortunately, such a solution comes at the expense of having significantly longer watermarks which results in a significantly lowered robustness with respect to desynchronization attacks.
KIROVSKI et al.: REPLACEMENT ATTACK
Finally, the replacement attack as described in this paper is not the panacea to removing watermarks from multimedia clips. We consider it as another tool available to the adversary. When establishing the robustness of a particular watermarking scheme or when defining a signal, watermarking, and attack model, this attack should be considered in concert with others like geometric attacks or estimation.
1931
[20] H. Malvar, “A modulated complex lapped transform and its application to audio processing,” in Proc. ICASSP, 1999, pp. 1421–1424. [21] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989. [22] M. R. Garey and D. S. Johnson, Computers and Intractability. San Francisco, CA: W.H. Freeman, 1979. [23] H. Cohen, A Course in Computational Algebraic Number Theory. New York: Springer-Verlag, 1993.
REFERENCES [1] D. Kirovski, H. Malvar, and Y. Yacobi, “A dual watermarking and fingerprinting system,” ACM Multimedia, pp. 372–81, 2002. [2] D. Boneh and J. Shaw, “Collusion-secure fingerprinting for digital data,” IEEE Trans. Inf. Theory, vol. 44, no. 5, pp. 1897–1905, Sep. 1998. [3] F. A. P. Petitcolas, R. J. Anderson, and M. G. Kuhn, “Attacks on copyright marking systems,” in Proc. Inf. Hiding Workshop, 1998, pp. 218–238. [4] D. Kirovski and F. A. P. Petitcolas, “Replacement attack on arbitrary watermarking systems,” in Proc. ACM Workshop Digital Rights Management, 2002, pp. 177–189. [5] C. Rey, G. Doërr, J.-L. Dugelay, and G. Csurka, “Toward generic image dewatermarking,” in Proc. ICIP, 2002, pp. 633–636. [6] G. Doërr and J.-L. Dugelay, “New intra-video collusion attack using mosaicing,” in Proc. ICME, 2003, pp. 505–508. [7] D. Kirovski and F. A. P. Petitcolas, “Blind pattern matching attack on audio watermarking systems,” in Proc. ICASSP, 2002, pp. 3740–3743. [8] D. Kirovski and F. A. P. Petitcolas, “Blind pattern matching attack on watermarking systems,” IEEE Trans. Signal Process., vol. 51, no. , pp. 1045–1053, Apr. 2003. [9] M. Kutter, S. Voloshynovskiy, and A. Herrigel, “Watermark copy attack,” SPIE Electron. Imaging, pp. 371–380, 2000. [10] M. Holliman and N. Memon, “Counterfeiting attacks on oblivious block-wise independent invisible watermarking schemes,” IEEE Trans. Image Process., vol. 9, no. 3, pp. 432–441, Mar. 2000. [11] H. L. Van Trees, Detection, Estimation, and Modulation Theory. New York: Wiley, 1968, pt. I. [12] I. J. Cox, J. Kilian, T. Leighton, and T. Shamoon, “A secure, robust watermark for multimedia,” in Proc. Inf. Hiding Workshop, 1996, pp. 183–206. [13] R. J. Anderson and F. A. P. Petitcolas, “On the limits of steganography,” IEEE J. Sel. Areas Commun., vol. 16, no. 4, pp. 474–481, May 1998. [14] A. Briassouli and P. Moulin, “Detection-theoretic anaysis of warping attacks in spread-spectrum watermarking,” in Proc. ICASSP, 2003, pp. 53–56. [15] M. Kutter and F. A. P. Petitcolas, “A fair benchmark for image watermarking systems,” SPIE Security Watermarking Multimedia Contents, vol. 3657, pp. 226–239, 1999. [16] D. Schonberg and D. Kirovski, “Fingerprinting and forensic analysis of multimedia,” ACM Multimedia, pp. 788–795, 2004. [17] T. Kalker, J. Linnartz, and M. van Dijk, “Watermark estimation through detector analysis,” in Proc. ICIP, 1998, pp. 425–429. [18] D. Kirovski and H. Malvar, “Robust covert communication over a public audio channel using spread spectrum,” in Proc. Inf. Hiding Workshop, 2001, pp. 354–368. [19] B. Chen and G. W. Wornell, “Quantisation index modulation: A class of provably good methods for digital watermarking and information embedding,” IEEE Trans. Inf. Theory, vol. 47, no. 4, pp. 1423–1443, May 2001.
Darko Kirovski received the Ph.D. degree in computer science from the University of California, Los Angeles, in 2001. Since April 2000, he has been a Researcher at Microsoft Research, Redmond, WA. His research interests include reliable computing, system security, multimedia processing, and embedded system design. He has authored more than 100 journal and conference papers and filed more than 40 patents. Dr. Kirovski received the 1999 Microsoft Graduate Research Fellowship, the 2000 ACM/IEEE Design Automation Conference Graduate Scholarship, the 2001 ACM Outstanding Ph.D. Dissertation Award in Electronic Design Automation, and best paper awards at the ACM Multimedia 2002 and the IEEE MMSP 2006.
Fabien A. P. Petitcolas graduated in 1996 from École Centrale, Lyon, France. He then studied at the University of Cambridge under the guidance of Prof. R. Anderson for a doctorate, which was partly funded by the Intel Corporation and was awarded in 1999 when he joined Microsoft Research. He is with the External Research Office, Microsoft Research, Cambridge, U.K. He heads the Intellectual Capital Development which incorporates a number of programs to support outstanding scientists in Europe, the Middle East, and Africa, including the Microsoft Research European Ph.D. Scholarship and European Fellowship programs. Before joining the External Research Office, he was a Researcher within the Security Research Group.
Zeph Landau received the Ph.D. degree in mathematics from the University of California, Berkeley, in 1998. Since then, he has had postdoctoral positions in theoretical neuroscience at the University of California. San Francisco, in mathematics at the Mathematical Sciences Research Institute, in quantum computation at University of California. Berkeley, and in signal processing at Microsoft Research (where this work was done). He is currently an Assistant Professor in the Math Department, The City College of New York. His research interests include signal processing, quantum computation, operator algebras, and theoretical neuroscience.