Detecting Hidden Information in Images: A Comparative Study Yanming Di, Huan Liu, Avinash Ramineni, and Arunabha Sen Department of Computer Science and Engineering Arizona State University, Tempe, AZ 85287 yanming.di, huan.liu, avinash.ramineni, arunabha.sen @asu.edu
Abstract During the process of information hiding in a cover image, LSB-based steganographic techniques like JSteg change the statistical properties of the cover image. Accordingly, such information hiding techniques are vulnerable to statistical attack. The understanding of steganalysis methods and their effects can help in designing methods and algorithms preserving data privacy. In this paper, we compare some steganalysis methods for attacking LSB-based steganographic techniques (logistic regression, the tree-based method C4.5, and a popular method Stegdetect). Experimental results show that the first two methods, especially the logistic regression method, are able to detect hidden information with high accuracy. We also study the relationship between the number of attributes (the frequencies of quantized DCT coefficients) and the performance of a classifier.
1. Introduction The last few years have seen a significant rise in interest among the computer security researchers in the science of steganography. Steganography [8, 9] is the art of hiding and transmitting data through apparently innocuous carriers in an effort to conceal the existence of the secret data. The term steganography in Greek means covered writing whereas cryptography means secret writing. Steganography is different from cryptography in that while the goal of a cryptographic system is to conceal the content of the messages, the goal of information hiding or steganography is to conceal their existence. Steganography in essence camouflages a message to hide its existence and make it seem invisible thus concealing the fact that a message is being carried altogether. Steganography provides a plausible deniability to the secret communication which cryptography
Yanming Di can also be reached at yanming
[email protected].
does not provide. Covert information is not necessarily secure and secure information is not necessarily covert. The goal of cryptography is the secure transfer of the secret message where as the goal of steganography is to make sure the transfer of a secret message is undetectable. The understanding of steganalysis methods and their effects can help in designing methods and algorithms preserving data privacy. With the increase of the digital content (and distribution of multimedia data) on the Internet, steganography has become a topic of growing interest. A number of programs for embedding hidden messages in images and audio files are available [18]. Most of these steganographic methods modify the redundant bits in the cover medium (carrier) to hide the secret messages. Redundant bits are those bits that can be modified without degrading the quality of the cover medium. Replacing these redundant bits with message bits creates stego medium. The modification of the redundant bits can change the statistical properties of the cover medium. As a result statistical analysis may reveal the presence of hidden content. Detecting the steganographic content is called steganalysis [7]. It is defined as the art and science of breaking the security of steganographic systems. As the goal of steganography is to conceal the existence of a secret message, a successful attack on a steganographic system consists of detecting that a certain file contains hidden information in it. Detection of steganographic modifications in an image can be made possible by testing its statistical properties. If the statistical properties deviate from a given norm it can be identified as a stego image. In this paper we present two new methods for steganalysis. We try to attack the steganographic tool JSteg and attempt to break it with higher accuracy. There have been many attempts at breaking JSteg but the accuracy is not really high. In this paper we present two novel methods of breaking JSteg which can further be extended to other tools like JPhide and Outguess [18]. The rest of the paper is organized as follows. Section 2 introduces JSteg steganographic method, the idea of the sta-
tistical attack, and some related work. Section 3 presents two new methods for attacking LSB-based steganographic methods. We also discuss how the knowledge of the distribution of the DCT coefficients can help in building our models. Section 4 reports results of our experiments of comparing the two new methods with the popular program Stegdetect [16]. Section 5 shows that careful model selection is needed to achieve high accuracy when using data mining and statistical methods for steganalysis. Section 6 concludes the paper.
2. Histogram Analysis The JPEG image format [15] uses a discrete cosine transblock of source imform (DCT) to transform each age pixels into DCT coefficients. The DCT coefficients of an block of image pixels are given by
! #"%& $ & $ /. ')(+* ,-(!* 0-132 5467 8 :9 0-132 ;43 where
and
#?AB C@ D 4 for EGFH for JIKFL
NMPOQF RL)LRL/ TSU=L
V
XWY
[Z 1 ]\+^J"
L V >
if the message bits are equally distributed, modifying the least significant bits will reduce the difference in frequency of the PoVs, which would otherwise be unequal with very high probability. This equalization can be detected by appropriate statistical tests. Based on this idea, Provos and Honeyman [16] carried out an extensive analysis of JPEG images using the steganalytic software Stegdetect. Stegdetect is based on Chi-square test. It is able to detect messages hidden in a JPEG image using steganography software such as JSteg, JPHide or OutGuess 0.13. However the Chi-square test used in [16] does not produce results with very high accuracy. In this paper, we will demonstrate that the histogram information can be used more efficiently by using logistic regression or a decision tree method, C4.5 [17]. Similar work in this area is reported by Berg et al. [1], Farid [4] and Zhang and Ping [22]. Berg et al. [1] used statistical learning methods in steganalysis. The attributes used in their learning procedure are unconditional entropy, conditional entropies and transition probabilities. Farid [4] built higher order statistical models of the images using a type of wavelet decomposition. Zhang and Ping [22] proposed an attack on the JSteg method based on a different Chi-square test. [5] is a very good survey on the state of art of the steganalysis.
3
3. Classification Methods
3
The coefficients are then quantized using a -element quantization table by the following operations:
The least-significant bits (LSB) of those quantized DCT coefficients can be used to embed hidden messages. JSteg hides data in JPEG image files by changing the LSB bits of the quantized DCT coefficients. It replaces the LSB bits of the quantized DCT coefficients with secret message bits. Other similar steganography methods include JPHide and OutGuess 0.13. Steganographic techniques like JSteg that modify the LSB bits can be detected by analyzing the frequencies of the quantized DCT coefficient values. Modifying the least significant bits transforms one value into another that differs only by 1. These pairs of values are called PoVs in [19]. They introduced a powerful statistical attack that can be applied to any steganographic technique in which a fixed set of PoVs are flipped into each other during the process of embedding message bits. The insight of their statistical attack is based on the observation that
To distinguish between the images with and without hidden data can naturally be viewed as a classification problem. We refer to the classes as stego and normal. We use a set of images as the training data to construct classifiers. When a classification algorithm is run on a data set (stego and normal images), it needs to find a boundary between the two classes and create a model. Given a set of images, the model learned can be used to predict the class to which each image belongs to. Here we propose to apply two classification methods to detecting hidden messages in JPEG images, the logistic regression method and the tree-based method C4.5. We discuss next the attributes to be used in the statistical models.
3.1. Predicting Variables As discussed earlier, LSB-based steganography techniques such as JSteg insert information into images by replacing the LSB of the quantized DCT coefficients with the secret text. This changes the frequency of the quantized DCT coefficient values. Therefore, the frequencies of quantized DCT coefficient values are natural candidates for predicting variables to be used in the statistical models. However, for JPEG images, the DCT coefficients can have a
F
4
3.2. Logistic Regression Logistic models are widely used in statistical applications where binary responses (two classes) occur. In our case, an image is either stego or normal. We can assume that the probability of an image being stego is a function of some image characteristics (a vector). For example, can be the frequencies of the quantized DCT coefficient values. In the case of two classes, the logistic model has a very simple form. The probability function is modeled by
Z]
Z] Z 7 7 7%.R.). 7+! * Z where ’s are the components of the vector . The model parameters ’s are usually estimated ' by maximum likelihood. Note that the function func ' is a monotone %' $ tion and as such its inverse function '& !#" )' $ guarantees Z# !(" that the values of are between 0 and 1. More details on logistic regression can be found in [6]. We use the S-PLUS [2] function glm [12] to estimate the probability function . As mentioned in the previ-
Z]
1400 200
400
600
h -2
800
1000
1200
3 3 3 3 2 2 3 32 3 3 3 2 2 2 3 11 1 2 1 1 1 2 2 3 33 3 1 12 3 1 3 22 1 33 3 2 2 3 3332 2 32 2 1 11 23 1 3333 21 2 1 2 3 3 1 1 3 33 3333333 3 2 3 1 2 3 33 333 3 33 32 2 2 1 1 2 2 3 33 33 3333 33 3333 33 222 2 1 2 12 1 33 333 3 3322 222 1 2 1 1 3 333 333 32 1 2 22212 221 1 3 3 33 33 23 222222221 222 22 33 3 3 333 3 11111 1 1 22222 3 3 333 3 22 2122111111 11 333 3 333 33 2 222 212 2 1 1 3 3333333 1 1111 1 1 1 33 3 22 22 3 33 3 33 2 2 1 1 1 11 11 1 2222222 2111 1 333 333 3 332 2 1 211 11 3 3333333 333 22 22222 2221 2 1 11 1 1 333 3 223 222 2 1 22221 2 2211 1 3333 3 22 22 222222112 2 2 1 1 21 1111 111 12 111 33 322 2 2222 222 21112 1 1 1 1 1 3 3 1 1 11 1 1 1 1 1 1111 1 111 11 2 222 1 33 1 1 2 1 2 1 2 1 222 1 1111 1 1 1 1 2 222222221 1 11 12 1 111 1 1 3 1 1 1 11 1 1 111 11 1 1 21 1 1
0
wide range of values. Using the frequencies of all these values to build a model is not practical. More importantly, using more variables than needed may introduce several problems. For a regression model, it may lead to ill-conditioned matrices and unstable estimate of the model parameters. In general, adding attributes that are not really important into a statistical model is like adding noise to the model, thereby degrading the performance of the model. Therefore careful model selection is very important. Research shows for normal images, the distribution of the DCT coefficients can be approximately modeled as a Laplacian or a generalized Gaussian distribution [3, 14, 10]. These models on the distribution of the DCT coefficients suggest that the frequencies of DCT coefficients with small values are more unevenly distributed. For example, the difference between the frequency of value and value is generally greater than that between value and value . So the DCT coefficient values with small magnitude are more sensitive to information inserted using a JSteg-like method. (See Figure 2 in [16] for histograms of the DCT coefficients before and after messages are inserted into a JPEG image using JSteg-like method). We use to denote the frequency of the DCT coefficient with value , i.e., the number of DCT coefficients that take the value . In this study, we use the central six frequencies , , , , and to build our models. We do not use the frequencies of the values and , since the JSteg method does not modify DCT coefficients with these values. Our experiments demonstrate (results presented in Section 5) that the use of more variables does not improve the performance of the models.
0
1000
2000
3000
h -1
4
Figure 1. The frequencies of the quantized DCT coefficients having values and . Black dots are for the normal JPEG images; Symbols 1, 2 and 3 are for stego images with 20, 100 and 200 bytes hidden messages respectively.
X
ous subsection, we use the frequencies , , , , and * as predicting variables. Addition of more variables increases the complexity of the model but does not necessarily guarantee higher accuracy. In fact, we show in Section 5, in some instances, accuracy can actually deteriorate. We include only linear terms of the variables in the model. This is based on the following argument. Take the pair and for example. A plot of these pairs is shown in Figure 1. The plot suggests that in a normal JPEG image, the frequency of the DCT coefficients having a value is almost always higher than the ones with a value . This is true because the distribution of each individual DCT coefficients tends to have a mode in the center (see [3, 14, 10] for modeling of the DCT coefficients). If we have a very large image with large number of DCT coefficients, this trend should also hold for other coefficient value pairs. However, for small size images, occasionally this trend may not be seen. As the image sizes in our experi#+ + ments was rather small, , we chose the frequencies , , , , and as predicting varialbes in the statistical models, because the trend seem to be correct in this case. This important piece of extra information is not utilized by any known steganalysis method. An implication of this observation is that linear logistic models should work well in detecting JSteg-like steganographic techniques. The above discussion also suggests that the methods presented in this paper should work better for larger images. Compared with a Chi-square test, logistic regression is
X 4
4 4
more refined. Logistic regression has many advantages,
0.5 0.4 0.1
0.2
3. It is flexible. We can adjust one simple parameter in the model to meet different accuracy needs.
0.3
error rates
Z]
2. It is easy to interpret. We can actually derive a closed form expression for the probability function .
0.6
1. It is fast. Logistic regression is implemented efficiently in almost all the professional statistical package, such as SPLUS and SAS.
0.0
3.3. Tree based Method
logistic
Tree based methods can also be used. We present a tree based model, C4.5 [17], to fit a tree structure to the training data. C4.5 builds classification models called decision trees from the training data. Each internal node in the tree specifies a binary test on a single attribute, using thresholds on numeric attributes. If as a result of tests conducted at internal nodes, an image ends up in a leaf node where majority of the images are stego, it is classified as a stego image; otherwise it is classified as a normal image. The tree is constructed by the following procedure: 1. Choose the best attribute for splitting the data into 2 groups at the root node. 2. Determine a splitting point by maximizing some specified criterion (say, information gain). 3. Recursively carry out the first two steps until information gained by the process cannot be improved any further. Information gained by splits can be used as the criterion for determining the attributes and the splitting points. Once the tree is constructed it can be used for classifying the test data. Tree based methods can be more flexible than logistic regression. It makes less assumptions about the data, so can be generalized to other situations more easily. One disadvantage of the tree based method is that the decision regions for classifications are constrained to be hyper-rectangles with boundaries constrained to be parallel to the input variable axes. As in the case of logistic model, the training data in this case also consist of the frequencies of quantized DCT coefficient values , , , , and * of images, as attributes or predicting variables. We use the data mining tool WEKA [20] to run the C4.5 algorithm on the training data.
4. Experiments and Results We have a data set of 180 normal JPEG images. The images are downloaded from the Internet. All the images
tree
stegdetect logistic
tree
stegdetect logistic
tree
stegdetect
Figure 2. Boxplots of the error rates. The left three bars are error rates for experiment 1, the middle three are for experiment 2, and the right three are for experiment 3. Smaller values are better.
4+ #4 + FF
43F
4FF
have been cropped to in size. We used the JSteg method to insert bytes, bytes, and bytes of text messages into the images. The JPEG image sizes range from to kilobytes. According to the author of JSteg, the maximum size of the message that can be inserted in a cover image is approximately of the size of the image file. For some image files, this limit is only 200 bytes. The secret message used in our experiments for insertion in cover images was taken from Gutenberg’s Etext of Shakespeare’s First Folio. We use 10-fold validation to compare the three steganalysis methods, the logistic regression, the tree-based method C4.5, and the Stegdetect method. In each of our experiments, we take the original 180 JPEG images, and one , group of 180 images with embedded text messages. and bytes of texts were embedded in cover images in our experiments 1, 2 and 3 respectively. The 10-fold validation results are summarized in Tables 1, 2, 3, and Figure 2. From the results, we can see that the logistic regression method performs better than the other two methods in all three experiments. The tree-based method C4.5 performs better than the Stegdetect method in experiment 1. The performance of the latter two methods is similar in experiment 2. While in experiment 3 both methods fail to perform better than random guess. The performance of the logistic regression method is noteworthy. When byte messages are embedded in the images, the mean error rate is for this method. This implies even when only bytes of message is embedded, the logistic regression method is able to perform better than
4
F
F
F=F
43FF
4F
43F
43F
FL 3
Table 1. Error rates for the logistic regression, the tree-based method C4.5 and the Stegdetect method in experiment 1 (in the stego images, 200 bytes of text message is embedded using JSteg). Mean and standard deviation of the error rates are shown in the bottom of the table. run 1 2 3 4 5 6 7 8 9 10 mean stdev
logistic 0.000000 0.000000 0.027778 0.000000 0.000000 0.000000 0.027778 0.000000 0.000000 0.000000 0.005556 0.011712
tree 0.055556 0.055556 0.083333 0.055556 0.000000 0.027778 0.194444 0.027778 0.027778 0.027778 0.055556 0.053990
stegdetect 0.194444 0.305556 0.111111 0.250000 0.111111 0.194444 0.166667 0.222222 0.111111 0.277778 0.194444 0.070516
Table 2. Error rates for the logistic regression, the tree-based method C4.5 and the Stegdetect method in experiment 2 (in the stego images, 100 bytes text message is embedded using JSteg). Mean and standard deviation of the error rates for each method are shown in the bottom of the table. run 1 2 3 4 5 6 7 8 9 10 mean stdev
logistic 0.055556 0.083333 0.111111 0.055556 0.055556 0.000000 0.083333 0.055556 0.083333 0.000000 0.058333 0.035741
tree 0.227778 0.202222 0.202222 0.138889 0.222222 0.194444 0.444444 0.166667 0.305556 0.083333 0.218778 0.098394
stegdetect 0.194444 0.250000 0.166667 0.250000 0.194444 0.138889 0.250000 0.222222 0.250000 0.333333 0.225000 0.054700
Table 3. Error rates for the logistic regression, the tree-based method C4.5 and the Stegdetect method in experiment 3 (in the stego images, 20 bytes text message is embedded using JSteg). Mean and standard deviation of the error rates for each method are shown in the bottom of the table. run 1 2 3 4 5 6 7 8 9 10 mean stdev
logistic 0.388889 0.305556 0.444444 0.500000 0.388889 0.388889 0.333333 0.416667 0.444444 0.333333 0.394444 0.059720
tree 0.527778 0.555556 0.555556 0.611111 0.527778 0.583333 0.416667 0.583333 0.507778 0.611111 0.548000 0.057986
stegdetect 0.555556 0.527778 0.472222 0.416667 0.444444 0.305556 0.638889 0.555556 0.500000 0.555556 0.497222 0.093008
random guess whereas the other two methods perform no better than random guess. In the tree-based method the boundaries of the decision regions are constrained to be parallel to the input variable axes. However, it may be observed in Figure 1, the true boundary in terms of the attributes of the normal and the stego images is not parallel to the input variable axes. For this reason the performance of the treebased method is not as good as that of logistic regression. Our chosen methods do not rely on the knowledge of the locations where the information is hidden. As such they can be effectively utilized to break similar LSB-based methods that use random bit selection, e.g., OutGuess 0.13.
5. On the number of predicting variables We indicated in Section 3 that the use of excessive variables may not lead to better results. We illustrate this phenomenon with the help of results from our experiments. In Figure 3, we present the results of our experiments where a varying number of predicting variables (2, 4, 6,...,20) were used instead of just 6 ( , , , , and * ). The estimation error rates for logistic regression method using 2, 4, 6, ..., 20 variables (10-fold cross validation) are summarized in Figure 3. The figure indicates that using only the center two frequencies and , it may not be possible to capture all the information in the data. Increasing the number of variables to or improves the accuracy. However, the use of more than variables does
Error rates 0.25 0.50
2
4
6
8
10
12
14
16
18
20
14
16
18
20
14
16
18
20
Number of variables used
Error rates 0.0 0.20
(a)
2
4
6
8
10
12
Number of variables used
Error rates 0.0 0.04
(b)
2
4
6
8
10
12
Number of variables used (c)
Figure 3. Comparison of logistic models with different number of predicting variables. The sizes of hidden messages in the stego images are 20, 100 and 200 bytes in figures (a), (b) and (c) respectively.
not improve the accuracy but causes slightly larger variance in the estimation error rates. We observe similar trends in the tree-based methods. In our experiments, therefore, we use a model with predicting variables. More sophisticated feature selection algorithms can be found in the literature [6, 11, 13, 21]. We intend to explore if applying these feature selection algorithms can lead to further performance improvement.
6. Conclusion LSB-based steganographic techniques like JSteg change the statistical properties of the cover image when it embeds secret message in the image. Accordingly, such methods are vulnerable to statistical attack. Previous methods such as Stegdetect are based on Chi-square test. The accuracy of Stegdetect can be improved. When the size of the hidden message is small, it performs no better than random guess. In this paper we have proposed two new steganalysis methods based on the logistic regression and the tree-based method C4.5 for attacking LSB-based steganographic techniques. We conducted experiments to evaluate the performance of the two data mining techniques and compared them with the performance of the well known method Stegdetect. The experiments demonstrated that the performance of the logistic regression based technique is very impressive. When large amount of information is hidden, it can detect with very high accuracy. Even when the amount of hidden information is very small, it performs better than random guess.
The tree-based method C4.5 outperforms Stegdetect in the experiment where a relatively large amount of information is hidden. However, it does not perform well when the amount of hidden information is small. We suggest that one reason for C4.5 not performing as well as the logistic regression is that it tends to produce boundaries that are parallel to the input variable axes, which in this case may not be appropriate. We also pointed out that the number of attributes used in classification can be related to a classifier’s performance. Many present steganalysis methods do not consider this as a serious problem. Hence they tend to use all attributes that are related. However using more variables than needed does not necessarily lead to good performance and may even significantly degrade the performance of a statistical learning model. When selecting the predicting variables for our model, we take the distribution of the DCT coefficients into consideration. The understanding of steganalysis methods and their effects can help in designing methods and algorithms preserving data privacy. Our experiments were carried out to break methods like JSteg that are employed to hide information. Our methods do not rely on the placement of the hidden information. Therefore they can be used without any modification on LSB based steganographic techniques that use random bit selection.
7. Acknowledgements The authors would like to thank Sidi Goutam and Amit Mandvikar for their help in this project. The authors also wish to thank the reviewers for their helpful comments in the preparation of this manuscript.
References [1] G. Berg, I. Davidson, M.-Y. Duan, and G. Paul. Searching for hidden messages: automatic detection of steganography. In 15th AAAI Innovative Applications of Artifical Intelligence (IAAI) Conference 2003, 2003. [2] J. M. Chambers and T. Hastie, editors. Statistical models in S. London: Chapman & Hall, 1991. [3] R. J. Clarke. Transform Coding of Images. London: Academic Press, 1985. [4] H. Farid. Detecting hidden messages using higher-order statistical models. In International Conference on Image Processing (ICIP), Rochester, NY, 2002, 2002. [5] J. Fridrich and M. Goljan. Practical steganalysis of digital images - state of the art. In Proc. SPIE Photonics West, Vol. 4675, Electronic Imaging 2002, Security and Watermarking of Multimedia Contents, San Jose, California, January, 2002, pp. 1-13., 2002. [6] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning: data mining, inference, and predic-
[7]
[8] [9]
[10]
[11]
[12] [13] [14]
[15]
[16] [17] [18] [19] [20]
[21]
[22]
tion: with 200 full-color illustrations. New York: SpringerVerlag, 2001. N. F. Johnson and S. Jajodia. Steganalysis of images created using current steganography software. In D. Aucsmith, editor, Information Hiding: Second International Workshop, volume 1525 of Lecture Notes in Computer Science, pages 273–289. Springer-Verlag, Berlin, Germany, 1998. D. Kahn. The Codebreakers – The Story of Secret Writing. Scribner, New York, New York, U.S.A., 1996. D. Kahn. The history of steganography. In R. J. Anderson, editor, Information Hiding, First International Workshop, volume 1174 of Lecture Notes in Computer Science, pages 1–5. Springer-Verlag, Berlin, Germany, 1996. E. Y. Lam and J. W. Goodman. A mathematical analysis of the DCT coefficient distributions for images. IEEE Transactions on Image Processing, 9(10):1661–1666, 2000. H. Liu and H. Motoda. Feature Selection for Knowledge Discovery & Data Mining. Boston: Kluwer Academic Publishers, 1998. P. McCullagh and J. A. Nelder. Generalized linear models (Second edition). London: Chapman & Hall, 1989. A. Miller. Subset Selection in Regression. Chapman & Hall/CRC, 2 edition, 2002. F. M¨uller. Distribution shape of two-dimensional DCT coefficients of natural images. ELECTRONICS LETTERS, 29(22):1935–1936, 1993. W. B. Pennebaker and J. L. Mitchell. JPEG Still Image Data Compression Standard. Van Nostrand Reinhold, New York, NY, USA, 1993. N. Provos and P. Honeyman. Detecting steganography content on the Internet. Technical report, CITI, 2001. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Steganography software. www.stegoarchive.com, 19972003. A. Westfeld and A. Pfitzmann. Attacks on steganography systesms, 1999. I. Witten and E. Frank. Data Mining - Practical Machine Learning Tools and Techniques with JAVA Implementations. Morgan Kaufmann Publishers, 2000. L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In T. Fawcett and N. Mishra, editors, Proceedings of the 20th International Conference on Machine Learning (ICML-03), August 21-24, 2003, pages 856–863, Washington, D.C., 2003. Morgan Kaufmann. T. Zhang and X. Ping. A fast and effective steganalytic technique against JSteg-like algorithms. In ACM Symposium on Applied Computing, March 9 to 12, 2003, Florida, USA, 2003.