paradigm is attached as signature to the image before trans- mission. The forensic ... able the use of digital images as evidence material [2]. To deal with this ...
UNDERSTANDING GEOMETRIC MANIPULATIONS OF IMAGES THROUGH BOVW-BASED HASHING S. Battiato, G. M. Farinella, E. Messina, G. Puglisi Image Processing Laboratory University of Catania, Italy {battiato, gfarinella, emessina, puglisi}@dmi.unict.it
ABSTRACT The increasing use of low cost imaging devices and the innovations in terms of media distribution technologies induce a growing interest on technologies able to protect digital visual media against malicious manipulations of the visual contents. One of the main problems addressed in this research area is the blind detection of traces of forgery on an image obtained through the internet. Specifically, in this paper we consider the context of communications, where malicious image manipulations should be detected by a receiver. In the proposed method, an image hash based on the Bag of Visual Words paradigm is attached as signature to the image before transmission. The forensic hash is then analyzed at destination to detect the geometric transformations which have been applied to the received image. This task is fundamental for further processing which usually assumes that the received image is aligned with the original one, as in the case of tampering detection systems. Experiments show that the proposed approach outperforms state-of-the art methods by obtaining a good margin in terms of performances. Index Terms— Image forensics, Forensic hash, Bag of Visual Word, Tampering, Geometric transformations, Image validation and authentication 1. INTRODUCTION AND MOTIVATIONS The increasing use of digital visual media in communications has a significant impact in nowadays society. Recent estimates about popular websites [1] report about 4 billion photographs on Flickr (≈ 4000 uploads per minute), 120 million videos on YouTube (≈ 20 hours of video are uploaded per minute) and 15 billion images on Facebook (≈ 22, 000 uploads per minute). Moreover, among the other archives used in image forensics, a huge amount of data comes from surveillance systems and from digital archives of probationary material used in courts of law. The need of methods useful to establish the validity and authenticity of an image received through an Internet communication is confirmed by many episodes that make questionable the use of digital images as evidence material [2]. To
deal with this problem different solutions have been recently proposed in literature [3, 4, 5, 6]. Most of them share the same basic scheme: i) a hash code based on the content of the image to be sent is attached to the image; ii) the hash is analyzed at destination to verify the reliability of the received image. An image hash is a distinctive signature which represents the visual content of the image in a compact way (usually just few bytes). The image hash should be robust against allowed operations and at the same time it should differ from the one computed on a different/tampered image. Image hashing techniques are considered extremely useful to validate the authenticity of an image received through the Internet. Despite the importance of the binary decision task related to the image authentication, in the application context of Forensic Science is fundamental to provide scientific evidences through the history of the possible manipulations applied to the original image to obtain the one under analysis. In many cases, the source image is typically unknown, and, as in the application context of this paper, all the information about the history of the image should be recovered through the short image hash signature, making more challenging the task (blind estimation of manipulations). The list of manipulation provides to the end user the information needed to decide whether the image can be trusted or not. In order to perform tampering localization1 , the receiver should be able to filter out all the geometric transformations (e.g., rotation, scaling) added to the tampered image, in order to align the received image with the one at the sender. The alignment should be done in a blind way: at destination one can use only the received image and the image hash to deal with the alignment problem since the reference image does not exist. The challenging task of blindly understanding the geometric transformations occurred on a received image motivates this paper. Despite different robust alignment techniques have been proposed by computer vision researchers [7, 8, 9, 10], these 1 Tampering localization is the process of localizing the regions of the image that have been manipulated for malicious purposes to change the semantic meaning of the visual message.
Training Images
Features Extraction
Clustering
Codebook
Manipulation/Tampering
Untrusted Network Selected Features
1 1 2 2 3 3 4 4 5 5
Selected Features
Trusted Network
n 1 n 1 n n
1 1 2 2 3 3 4 4 5 5
3 2
1
ˆ
2 n 1 n 1 n n
Image hash
1 1 2 2 3 3 4 4 5 5
hs
n 1 n 1 n n
hr
r s
1 ˆ
r s
ˆ ,ˆ Alignment
Fig. 1. Schema of the proposed approach. techniques are unsuitable in the context of forensic hashing, since the most important requirement is that the image signature should be as “compact” as possible to reduce the overhead of the network communications. To fit the underlying requirements, authors of [5] have proposed to exploit information extracted through Radon transform and scale space theory to estimate the parameters of the geometric transformations. To make more robust the alignment phase under manipulations such as cropping and tampering, an image hash based on robust invariant features has been proposed in [6]. The latter technique extended the idea previously proposed in [4] by employing the Bag of Visual Words (BOVW) model to represent the features to be used as image hash. The exploitation of the BOVW representation is useful to reduce the space needed for the image signature, by maintaining the performances of the alignment component. Building on the technique described in [6], we propose a new method to detect the geometric manipulations occurred on an image starting from the hash computed on the original one. We exploit replicated visual words and a cascade of estimators to establish the geometric parameters (rotation and scale). As pointed out by the experimental results, our approach obtains the best results with a significant margin in terms of estimation accuracy with respect to state-of-the-art methods. The remainder of the paper is organized as follows: Section 2 presents the proposed framework. Section 3 reports the experiments and discusses the results. Finally, Section 4 concludes the paper with avenues for further research.
2. PROPOSED METHOD As stated in the previous section, one of the common steps of tampering detection systems is the alignment of the received image. Image registration is crucial since all the other tasks (e.g., tampering localization) usually assume that the received image is aligned with the original one, and hence could fail if the registration is not properly done. Classical registration approaches [7, 8, 9, 10] cannot be directly employed in the context considered in this work due the limited information that can be used (i.e., original image is unknown and the image hash should be as short as possible). The schema of the system is shown in Fig. 1. As in [6], we adopt a Bag of Visual Words based representations [11] to reduce the dimensionality of the feature descriptors to be used as hash for the alignment component. A codebook is generated by clustering a set of SIFT [12] extracted on training images. The pre-computed codebook is shared between sender and receiver. It should be noted that the codebook is built only once, and used for all the communications between a sender and a receiver (i.e., no extra overhead for each communication). Sender extracts SIFT features and sorts them in descending order with respect to the contrast values. Afterward, the top n SIFT features are selected and associated to the id label corresponding to the closest visual word belonging to the pre-computed codebook. Hence, the final hash signature for the alignment component is created: for each selected SIFT, the id label, the dominant direction θ, and the scale σ are considered. The hash signature computed at the source (hs ) together with the corresponding image are sent to the destination. Our system assumes that the image is sent over a network consisting of possibly untrusted nodes,
14
6000
20
12
5000 15
10
4000 8
3000
10 6
2000
4
5
1000
2 0 0
50
100
150 200 250 θs(degrees)
300
0 0
350
0 50
100
150 200 250 θr(degrees)
(a)
300
−150 −100
350
(b)
−50 0 50 θr−θs (degrees)
100
150
(c) 4
120
250
100
12
x 10
10
200
80
8 150
60
6 100
40
4 50
20
0 0
2
4
6
8 σ
10
12
14
16
0 0
2
5
s
(d)
10 σ r
(e)
15
20
0 0
5
10
σ /σ r
15
20
25
s
(f)
Fig. 2. (a) and (b): histograms of the dominant directions of Is and Ir ; (c): histogram of the difference of the directions of wrong matching; (d) and (e): histograms of the scales of Is and Ir ; (f): histogram of the ratios of the scales of wrong matching. whereas the signature is sent upon request from a trusted authentication server which encrypts the hash in order to guarantee its integrity [3]. During the untrusted communication the image could be manipulated for malicious purposes. Once the image reaches the destination, the receiver generates the hash signature (hr ) on the received image by using the same procedure employed by the sender. Afterward, the entries of the hashes hs and hr are matched considering the id values (see Fig. 1). The alignment is hence performed by estimating two transformation parameters: the rotation angle θb and the scaling factor σ b. To this aim, we consider the differences between dominant directions of the signature entries with corresponding id, Θ = {θr,i − θs,j | ids,i = idr,j ; i, j = 1 . . . n}, and the ratio of scales of the matchings between hs and hr , Σ = n o σr,i σs,j | ids,i = idr,j ; i, j = 1 . . . n . The estimation of the geometric parameters proceeds with a cascade of simple estimators. A distribution of orientations is obtained by quantizing the values of Θ. Specifically, a histogram with bins of regular size in the range [−180, 180] degrees is built. The bin with the highest frequency is selected, and the estimation of the rotation θb is performed by considering the mean of the values of Θ which fall into the selected bin. This step is used also to filter out the wrong matchings between the hash signatures hs and hr to further estimate the scale parameter. Indeed, the scale parameter σ b is estimated by building a histogram with quantized bins in the range [0, maxσ ], and considering the average of the ratio of scales corresponding to filtered matchings which fall into the bin with highest frequency.
It should be noted that some id values may appear more than once in hs and/or in hr . Even if a small number of SIFT are selected during the image hash generation process, the conflict due the replicated id can arise. For example, it is possible that a selected SIFT has no unique dominant direction [12]; in this case the different directions are coupled with the same descriptor, and hence will be considered more than once by the selection process which generates many instance of the same id with different dominant directions. As experimentally demonstrated in the next section, by retaining the replicated visual words the accuracy of geometric transformation estimation increases, and the number of “unmatched” images decreases (i.e., image pairs that the algorithm is not able to process because there are no matchings between hs and hr ). Our approach considers all the possible matchings in order to preserve the useful information. The correct matchings are hence retained but other wrong pairs could be generated. Since the noise introduced by considering correct and incorrect pairs can badly influence the final estimation results, the presence of possible noising matchings should be taken into account during the estimation process. Our approach deals with the problem of wrong matchings by employing the estimators in cascade. The rationale behind of the usefulness of the cascade is reported in the following through an example. In Fig. 2(a) and Fig. 2(d) are reported the distribution of the dominant directions and scales of SIFT extracted on an image Is , whereas in Fig. 2(b) and Fig. 2(e) are shown the ones related to a transformed version Ir obtained through
scaling and rotation. In order to study the distributions related to the wrong matchings, all the matchings between SIFT points of the image Is and the ones of the image Ir have been generated, and the correct matchings have been removed. The distribution of the subsets obtained from Θ and Σ by considering only the wrong matchings are reported in Fig. 2(c) and Fig. 2(f). Note that we are interested to understand the distributions of the directions and scales coming from wrong matchings, since these can introduce errors during the estimation process and our aim is to filter them out. As can be seen from Fig. 2(c) the histogram of the differences of dominant directions (θid,r − θid,s ) over the wrong matchings shows a distribution which is quite uniform. This happens especially when no preferred directions exist in the considered image. So, the uniform bias introduced by the wrong matchings into the distribution of the values belonging to Θ does not influence too much the rotational estimation performed through the selection of the bin of the histogram with the highest frequency. On the contrary a peak is σid,r clearly visible in the scale ratio histogram σid,s reported in Fig. 2(f). Hence, a direct estimation by considering the distribution of the values within Σ fails since the estimator is strongly induced to privilege values around one. Starting from the above considerations, we first extract rotational information by making use of the distribution of the values in Θ. This estimate can be considered reliable even in presence of replicated matchings (they typically distribute uniformly on the allowed range). Then, the scale estimation is performed by means of distribution on the scale ratio which is built by considering only the reliable matchings which have been considered to estimate the rotational parameters (see Fig. 1). In the proposed approach, the rotation estimation is hence used to estimate the rotational parameter and to filter out outlier matchings in order to better estimate the scale. This cascade estimation allows us to avoid discarding useful information coming from the replicate matchings without being influenced by the introduced wrong pairs. As pointed out by the results reported in the next section, the proposed cascade approach obtains a significant margin in terms of scale estimation accuracy with respect to the approach proposed in [6]. 3. EXPERIMENTAL RESULTS This section reports a number of experiments on which the proposed approach has been tested and compared with respect to [6]. The tests have been performed considering a subset of the benchmark dataset used in [13] for scene classification purposes. This dataset is made up of 4485 images (average size of 244 × 272 pixels) belonging to fifteen different scene categories at basic level of description: bedroom, suburb, industrial, kitchen, living room, coast, forest, highway, inside city, mountain, open country, street, tall building, office, store. The composition of the considered dataset allows to cope with the high scene variability needed to properly test methods in
the context of application of this paper. The training set used in our experiments is built through a random selection of 150 images from the aforementioned dataset. Specifically, ten images have been randomly sampled from each scene category. The test set consists of 5250 images generated through the application of different transformations on the training images2 . The considered transformations are typically available on image manipulation software (Tab. 1). Accordingly with [6], the following image manipulations have been applied: cropping, rotation, scaling, tampering, JPEG compression. Tampering has been performed through the swapping of blocks (50 × 50) between two images randomly selected from the training set. Various combinations of the basic transformations have been also used to make more challenging the task to be addressed. Considering the different parameter settings, for each training image there are 35 corresponding manipulated images into the test set. Table 1. Image transformations. Operations Parameters Rotation (θ) 3, 5, 10, 30, 45 degrees Scaling (σ) factor= 0.5, 0.7, 0.9, 1.2, 1.5 Cropping 19%, 28%, 36%, of entire image Tampering block size 50x50 Compression JPEG Q=10 Various combinations of above operations
To demonstrate the usefulness of considering the rotation and scaling estimators in cascade, and to underline the contribution of the replicated visual word during the estimation, the comparative tests on the aforementioned dataset have been performed by considering our method, the approach proposed in [6]. Although Lu et al. [6] affirm that further refinements are performed considering the points that occur more than once, actually they do not provide any implementation detail. In our test we have hence considered two versions of [6] with and without replicated matchings. The approach proposed in [6] has been reimplemented. The Ransac [14] thresholds needed to perform the geometric parameter estimation as proposed in [6] have been set to 3.5 degrees for the rotational model and to 0.025 for the scaling one. These thresholds have been obtained through data analysis (inliers and outliers distributions). In order to perform a fair comparison, the bin size of the histograms involved in our approach has been derived taking into account the threshold values used by Ransac. Specifically, the bin size used by our approach has been obtained doubling the corresponding Ransac thresholds: a histogram with bin size of 7 degrees ranging from −180 to 180 degrees has been used for the rotation estimation step, whereas a histogram with bin size equal to 0.05 ranging from 0 to maxσ = 10 was employed to estimate the scale. Finally, a codebook with 1000 visual words 2 Training and test sets used for the experiments are available at http://iplab.dmi.unict.it/download/CPAF_ICME_2011/ Dataset.zip.
Lu et al. [6]
Number of SIFT Lu et al. [6] Proposed Approach
3 3563 3,3563
3,0 2,5
Mean Errror M
Unmatched Images 15 30 45 60 6.74% 1.99% 0.88% 0.49% 0.35% 0.05% 0,01% 0.01%
2,0 20 1,5
1,5229
1,0
Table 3. Average rotational and scaling error. Mean Error θ
# SIFT Unmatched 15 11.73% 30 3.60% 45 1.64% 60 0.91% 15 30 45 60
Mean Error σ
11.73% 3.60% 1.64% 0.91%
Lu et al. [6] 14.5572 14.3468 13.5139 13.7529
Proposed Approach 9.5464 5.1619 3.9650 3.4326
0.1264 0.1136 0.1100 0.1150
0.0631 0.0438 0.0330 0.0292
Proposed approach
3,5
Table 2. Comparison with respect to unmatched images.
1,5184
0,82
0,8531
0,947
0,9444
0,7978
0,5 0,3314
0,2414 00 0,0 3
5
10
30
45
θ
(a) Average rotation error at varying of the rotation angle. Lu et al. [6]
Proposed approach
0,25 0,2281
0,221 0,20
M Mean Error
REC CURVE − THETA 1 0.9
0,15 0,1072
0,10 0, 0
0.8
0,0696 0,05
Accuracy
0.7 0.6
0,0455
0,0262 0,0116 0,0119
0,00 0,5
0.5
0,0805
0,7
0,9
0,0172 1,2
1,5
σ 0.4
(b) Average scaling error at varying of the scale factor. 0.3
Fig. 4. Comparison on single transformation (60 SIFT).
0.2
Lu et al. [6] Proposed approach
0.1 0
0
50
100
150
200
250
Absolute deviation
REC CURVE − SIGMA 1 0.9 0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2
Lu et al. [6] Proposed approach
0.1 0
0
1
2
3
4
5
6
7
8
9
10
Absolute deviation
Fig. 3. REC curves comparison. has been used to compare the approaches. The codebook has been learned through k-means clustering on the overall SIFT extracted on the training set. First, let us examine the typically cases in which the considered approaches are not able to work. Two cases can be distinguished: i) no matchings are found between the hash built at the sender (hs ) and the one computed by the receiver (hr ); ii) all the matchings are replicated. The first problem can be mitigated considering a higher number of features
(SIFT). The second one is solved only by allowing the replicated matchings (see Section 2). As reported in Tab. 2, by increasing the the number of SIFT points there is a decreasing of the number of unmatched images for both approaches, but the percentage of images on which our algorithm is not able to work is much lower than the one of [6] without replicated matchings. Tab. 3 shows the results obtained in terms of rotational and scale estimation through mean error. In order to properly compare the methods, the results have been computed taking into account only the images on which both approaches are able to work. The proposed approach outperforms [6] obtaining a considerable gain both in terms of rotational and scaling accuracy (Tab. 3). Moreover, the performance of our approach significantly improves with the increasing of the extracted feature points (SIFT). On the contrary, the technique in [6] is not able to properly exploit the additional information coming from the higher number of extracted points. To better compare the methods, the Regression Error Characteristic Curves (REC) has been employed [15]. The REC curve is essentially the cumulative distribution function of the error. The area over the curve is a biased estimation of the expected error of an employed estimation model. In Fig. 3 the comparison through REC curves is shown. The results have been obtained considering sixty SIFT to build the signature of training and test images. Additional experiments have been performed in order to
Table 4. Results of [6] by considering replicated matchings. Lu et al. [6] modified to consider replicated matchings Mean Error θ Mean Error σ 15 30 45 60 15 30 45 60 8.6085 4.8084 3.7338 3.0924 0.1602 0.1815 0.1988 0.2176
examine the dependence of the average rotational and scaling error with respect to the rotation and scale transformation parameters respectively. Results in Fig. 4(a) show that the rotational estimation error slightly increases with the rotation angle. For the scale transformation, the error has lower values in the proximity of one (no scale change) and increases considering scale factors higher or lower than one (Fig. 4(b)). It should be noted that the proposed approach obtains in all cases the best performances. Finally, to further study the contribution of the replicated matchings we performed some tests by considering the modified version of the approach proposed in [6] in which replicated visual words have been introduced (Tab. 4). Although the modified approach obtains better performance with respect to the original one in terms of rotational accuracy, it is not able to obtain satisfactory results in terms of scaling estimation. The wrong pairs introduced by the replication of matchings are not properly handled. On the contrary our approach works properly both in terms of rotational and scaling estimation. As already stated in Section 2, the noise statistics introduced by the replication of the matchings can be better handled in the rotational histogram rather than in the scale one. Hence, it is better to retrieve the scale information only at the end on more reliable information filtered through the rotational estimation step. 4. CONCLUSION AND FUTURE WORKS The assessment of the reliability of an image received through the Internet is an important issue in nowadays society. In this paper the task of building a robust image signature to be used for the alignment component of distributed forensic systems has been addressed. The image registration step is crucial to perform further analysis (e.g., tampering localization) which are commonly used to establish if a received image is trustable or not. The proposed solution exploits the Bag of Visual Words representation, and a cascade of estimators to establish the rotational and scaling factors. The application of the estimators in cascade allows to retain the useful information carried by replicated matchings by avoiding the noisy information introduced by wrong pairs. Experiments have been performed on a representative dataset of scenes. The proposed framework outperforms a recently appeared technique by obtaining a significant margin in terms of estimation accuracy. Future works should concern a more in depth analysis to establish the minimal number of SIFT needed to guarantee an accurate estimation of the geometric transformations, and a study in terms of space complexity needed to represent the alignment component of the image signature. Moreover,
ad-hoc attacks [16] should be considered by performing tests of performances of the proposed framework with respect to other methods. 5. REFERENCES [1] J. F. Gantz, “The diverse and exploding digital universe an updated forecast of worldwide information growth through 2011,” International Data Corporation White Paper, 2008. [2] H. Farid, “Digital doctoring: how to tell the real from the fake,” Significance, vol. 3, no. 4, pp. 162–166, 2006. [3] Y.-C. Lin, D. Varodayan, and B. Girod, “Image authentication based on distributed source coding,” in Proceedings of the IEEE International Conference on Image Processing, 2007. [4] S. Roy and Q. Sun, “Robust hash for detecting and localizing image tampering,” in Proceedings of the IEEE International Conference on Image Processing, 2007, pp. 117–120. [5] W. Lu, A. L. Varna, and M. Wu, “Forensic hash for multimedia information,” in Proceedings of the IS&T-SPIE Electronic Imaging Symposium - Media Forensics and Security, 2010, vol. 7541. [6] W.J. Lu and M. Wu, “Multimedia forensic hash based on visual words,” in Proceedings of the IEEE International Conference on Image Processing, 2010, pp. 989–992. [7] R. Szeliski and S. Kang, “Direct methods for visual scene reconstruction,” in Proceedings of the IEEE Workshop on Representations of Visual Scenes, 2000, pp. 26–33. [8] H. Shum and R. Szeliski, “Construction of panoramic mosaics with global and local alignment,” International Journal of Computer Vision, vol. 36, no. 2, pp. 101–130, 2000. [9] P. McLauchlan and A. Jaenicke, “Image mosaicing using sequential bundle adjustment,” Image and Vision Computing, vol. 20, pp. 751–759, 2002. [10] M. Brown and D. Lowe, “Automatic panoramic image stitching using invariant features,” International Journal of Computer Vision, vol. 47, no. 1, pp. 59–73, 2007. [11] R. Szeliski, Computer Vision: Algorithms and Applications, Springer, 2010. [12] D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [13] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp. 2169–2178. [14] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24(6), pp. 381–395, 1981. [15] J. Bi and K. P. Bennett, “Regression error characteristic curves,” in Proceedings of the International Conference on Machine Learning, 2003. [16] T.-T. Do, E. Kijak, T. Furon, and L. Amsaleg, “Deluding image recognition in SIFT-based CBIR systems,” in Proceedings of the ACM workshop on Multimedia in forensics, security and intelligence, 2010, pp. 7–12.