A New Genetic Algorithm Approach for Secure JPEG Steganography Amin Milani Fard
Mohammad-R. Akbarzadeh-T.1
Farshad Varasteh -A.
Software Simulation & Modeling Lab. Department of Computer Eng. Ferdowsi University, Mashhad, Iran
IEEE Senior Member Department of Electrical Eng. Ferdowsi University, Mashhad, Iran
Department of Applied Mathematics Faculty of Mathematical Sciences Ferdowsi University, Mashhad, Iran
[email protected]
[email protected]
[email protected]
Abstract - Steganography is the act of hiding a message inside another message in such a way that can only be detected by its intended recipient. Naturally, there are security agents who would like to fight these data hiding systems by steganalysis, i.e. discovering covered messages and rendering them useless. There is currently no steganography system which can resist all steganalysis attacks. In this paper we propose a novel GA evolutionary process to make a secure steganographic encoding on JPEG images. Our steganography step is based on OutGuess which is proved to be the least vulnerable steganographic system. A combination of OutGuess steganalysis approach and Maximum Absolute Difference (MAD) for the image quality are used as the GA fitness function. The model presented here is based on JPEG images; however, the idea can potentially be used in other multimedia steganography as well. Keywords: Steganography, Genetic Algorithm, Cover Image, JPEG format, Blockiness, OutGuess
I. INTRODUCTION What you see is not always what you get! Thanks to a new technology, steganography, the art of hiding messages inside other messages, is now gaining more popularity and is used on various media such as text, images, sound, and signals. However, none of the existing schemes can yet shield against all detection attacks. Using genetic algorithms that are based on the mechanism of natural genetics and the theory of evolution, we can design a general method to guide the steganography process to the best position for data hiding. II. IMAGE STEGANOGRAPHY Image steganography takes the advantage of limited power of human visual system (HVS). Here, unlike watermarks which embed added information in every part of an image, only the complex parts of the image holds added information [2]. Straight message insertion will simply encode every bit of information in the image. More complex encoding can be done to embed the message only in "noisy'' areas of the image that will attract less attention [3]. The least significant bit (LSB) insertion method is probably the most well known image steganography technique. The main advantage of this method is that human eye is not able to notice the change; however unfortunately, it is extremely vulnerable to attacks, such as image manipulation. Masking 1
and filtering techniques hide information by marking an image in a manner similar to paper watermarks. By covering a faint but perceptible signal with another to make the first non-perceptible, the fact that the HVS cannot detect slight changes in certain temporal domains of the image was exploited in [14]. Masking techniques are better choices for lossy JPEG images than LSB method because of their relative immunity to image operations such as compression and cropping [2]. JPEG image format due to its good characteristics (having both reasonable quality and small size) is the most common image format for web and local usages. JPEG uses discrete cosine transform (DCT) to transform successive 8×8 pixel blocks of the image into 64 DCT coefficients. Here, LSBs of the quantized DCT coefficients are used as redundant bits. The modification of even a single DCT coefficient affects all 64 image pixels. In some image formats such as GIF, the visual structure of the image exists to some degree in all bitlayers of the image. Steganographic systems which modify these formats are mostly vulnerable to visual attacks [13]. However this is not true about the JPEG format. As the modifications happen in the frequency domain rather than spatial domain, there is no visual attack against it. Recently, several steganographic techniques for data hiding in JPEGs have been developed: JSteg [10], JP Hide&Seek [10], F5 [11], and OutGuess [12]. All these techniques manipulate the quantized DCT coefficients to embed the hidden message. III. STEGANALYSIS APPROACHES It is shown that the modification of redundant bits changes the statistical properties of the cover image and can reveal the hidden content [3]. As images with hidden data are expected to have higher entropy than those without, some tests only measure the entropy of the redundant data. The simplest test measures the correlation towards one. Westfeld and Pfitzmann in [13] observed that embedding encrypted data into a GIF image changes the histogram of its colour frequencies. One property of encrypted data is that the one and the zero bits are equally likely. When using the LSB method to embed encrypted data into an image that contains colour A more often than colour B, colour A is changed more often to colour B. As a result, the difference in colour frequency between A and B is reduced by embedding. The same is for JPEG images except instead of measuring the
Author is currently a visiting scholar at Berkeley Initiative on Soft Computing (BISC), UC Berkeley
colour frequencies, the DCT coefficients frequency should be analyzed. Provos and Honeyman in [1] used a χ2-test to determine if an image shows distortion from embedding hidden data. Using the generalized chi-square attack, JSteg and JP Hide&Seek with sequential message embedding are detectable [1,6,7]. However this is not correct about the F5 (as it decrease coefficient values by 1 if necessary and does not flip LSBs) and OutGuess (as it preserves first-order statistics). Thus almost all current JPEG steganography systems are detectable by statistical analysis except the latest version of OutGuess (OutGuess 0.2) and F5. However an effective approach to attack OutGuess and F5 was presented in [4] which can also estimate length of the message. We used a similar approach as a part of our GAbased steganography method. The OutGuess steganographic algorithm was proposed by Neils Provos in [12] to defeat the statistical chi-square attack [6]. OutGuess in the first pass, similar to JSteg, embeds message bits randomly into the LSBs of coefficients while skipping 0’s and 1’s. In the second pass the image is processed again and corrections are made to the coefficients to make the stego image histogram match the cover image histogram. Therefore since the χ2 attack is based on firstorder statistics, it cannot detect messages embedded by OutGuess [1]. OutGuess 0.2 preserves statistical properties and can not be detected by chi-square test. This approach improves the encoding step by using a pseudo-random number generator to select DCT coefficients at random. The least-significant bit of a selected DCT coefficient is replaced with encrypted message data. Fridrich and her colleagues in [4] described a thresholdfree detection methodology for attacking all current steganographic methods that embed data by modifying quantized DCT coefficients which can also estimate length of the hidden message. OutGuess preserves the first-order statistics of the DCT coefficients which makes their steganalytic method independent of the DCT histogram. The fact they used was that the embedding mechanism in OutGuess is overwriting the LSBs. This means that embedding another message into the stego image will partially cancel out and will thus have a different effect on the stego image than on the cover image. The detection starts with identifying a macroscopic quantity S(p) that predictably changes with the length of the embedded message. Using the values S(0) and S(1), it is possible to estimate the length of the embedded message. They took the increase in spatial blockiness as a function of p as the macroscopic quantity S and used discontinuities along the boundaries of 8×8 pixel blocks as a macroscopic quantity that increases with the hidden message’s length. The discontinues are measured by the blockiness formula where gij are pixel values in an M×N grayscale image:
B=
M −1 8 N
∑ ∑g i =i
j =1
N −1 M 8
8i , j
− g 8 i +1, j + ∑ i =i
∑ j =1
with the hidden message’s length which means the blockiness function’s slope is maximal for the cover image and decreases for an image that already contains a message. Using the blockiness measurement, the algorithm to detect OutGuess [4] proceeds as follows: 1. Determine the blockiness Bs(0) of the decompressed stego image. 2. Using OutGuess, embed a maximal length message and calculate the resulting image’s blockiness Bs(1). 3. Crop the stego image by four pixels to reconstruct an image similar to the cover image. Compress the resulting image by the same JPEG quantization matrix and calculate blockiness B(0). 4. Using OutGuess, embed a maximal length message into the cropped image and calculate blockiness B(1). 5. Using OutGuess, embed a maximal length message into the stego image from the previous step and compute the resulting blockiness B1(1). 6. The slope S0 = B(1) – B(0) is for the original cover image, and S1 = B1(1) – (1) for the image with maximal length message embedded (p=1). The stego image’s slope S = Bs(1) – Bs(0) is between the two slopes S0 and S1. Using linear interpolation to obtain the formula for p, S = S0 – p(S0 – S1), and the hidden message’s length is then estimated as
p=
S 0− S S 0 − S1
(3.2),
where p = 0 corresponds to the cover image and p = 1 to an image with the maximal embedded message length. To remove the effect of the randomness used in the OutGuess embedding algorithm, this algorithm would repeat 10 times and the average of the p-values is taken as the final message length. IV. GA BASED STEGANOGRAPHY FRAMEWORK Here we propose a new genetic algorithm approach to find the best position for data embedding and also optimize the quality of the steganographic image. As this method is based upon OutGuess algorithm, it is believed to be capable of resisting most of the common statistical attacks. In our optimization process we need to handle four conflicting goals: longer hidden message, higher image quality, better robustness and larger data capacity. The image quality enhancement method we used here is based on the scheme introduced in [9]. The basic concept of the GAs were originally developed by Holland [15] and later revised by Goldberg [16]. Goldberg showed that GAs are independent of any assumption about the search space and are based on the mechanism of natural genetics. The first step to model this problem as a GA problem, is determining the chromosome, GA operators, and fitness function. a) Chromosome Design
g i ,8 j − g i ,8 j +1
(3.1) Experiments in [4] shows that the blockiness B increases monotonically with the number of flipped least-sequential bits in the DCT coefficients. The first derivative decreases
In JPGE compression discrete cosine transform would be applied to transfer spatial domain into frequency domain. JPGE process the image in blocks of 8x8 pixels. Given a twodimensional NxN image f(m,n), its discrete cosine transform c(u,v) is defined as:
c (u , v ) = N −1 N −1 ( 2m + 1)uπ ( 2n + 1)uπ cos a (u )a (v )∑m =0 ∑n =0 f (m, n) cos , 2 2N N
where u,v = 0, 1, …, N-1 1 N ,u=0 a(u) = , u = 1, 2, …, N-1 2 N
In the next step of image compression, coefficients are quantized using a quantization matrix. In our proposed GA, a chromosome is encoded as an array of 64 genes containing quantized DCT coefficients of each 8x8 pixel block of the image i.e. mapping of a two dimension quantization matrix into a one dimension vector. 15
-1
0
-1
2
...
-5
2
1
0
0
0
Figure 1: A sample chromosome with 64 genes
b) GA operations For the crossover, one point in the selected chromosome would be selected along with a corresponding point in another chromosome and then the tails would be exchanged. Mutation processes causes some bits to invert and produces some new information. The only problem of mutation is that it may cause some useful information to be corrupted. Therefore we used elitism which means the best individual will go forward to the next generation without undergoing any change to keep the best information. c) Fitness function Defining fitness function is one of the most important steps in designing a GA-based method, which can guide the search toward the best solution. Pik-Wah [8] used image quality indicators, Mean Absolute Difference (MAD) for every 8x8 pixel region to measure his objective function values for his GA-based watermark optimizer. Here we also used MAD ( δ ) to measure the difference between the original image and the steganographic image which can be used to evaluate the quality of the final image. We used message embedding positions as our search space and then applied the genetic algorithmic operators to find the best combination of message and image. Our GA aims to improve the image quality and reduce the message detection probability. To satisfy this, we chose a hybrid of mean absolute difference for the image quality factor, actual embedded message length (Alength), and estimated message length (Elength) using equation (3.2), for probability of message detection to be used as the fitness function. Thus the definition of fitness function will be:
δ =
1 7 7 ∑ ∑ I ' ( x, y ) − I ( x, y ) 64 x =0 y =0
fitness = δ *
Elength Alength
(4.1)
(4.2),
where I and I' are the intensity values of the same pixel position within an image after and before embedding. d) Algorithm design After calculating the fitness function value for each parent chromosome the algorithm will generate N children. The lower a parent chromosome's fitness function value is the higher probability it has to contribute one or more offspring in the next generation. After performing operations, some chromosomes might not satisfy the fitness and as a result the algorithm discards this process and gets M (M ≤ N) children chromosomes. The algorithm then selects N chromosomes with the lower fitness value from the M + N chromosomes (M children and N parents) to be parent of the next generations. This process would repeat until a certain number of generations are processed, after which the best chromosome is chosen. Figure 2 shows our GA approach based on standard OutGuess method. The algorithm replaces the LSB of chromosomes values DCT coefficients with message data. Lines 4 and 5 indicates the reproduction and elitism process, lines 6-7 the cross-over and line 8 applies mutation on selected chromosomes. In line 9-16 it uses OutGuess modified algorithm to make steganographic image. The image will then be evaluated by fitness function and if was satisfactory then will break, otherwise chromosomes of the next generation will be produced. input: message, shared secret, cover image output: steganographic image 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
initialize PRNG with shared secret produce N initial parent chromosomes while done ≠ yes do produce N random children chromosomes pass the best individual to next generation randomly mating exchange parts of chromosomes mutate with rate = 1/100 While data left to embed do get chromosomes values DCT coefficient if DCT ≠ 0 and DCT ≠ 1 then get next LSB from message replace DCT LSB with message bit end if insert DCT into steganographic image end while evaluate stego image by fitness function if fitness satisfied then done = yes else produce next generation chromosomes end while Figure 2: Our GA-based steganography algorithm
V. IMPLEMENTATION The proposed algorithm was implemented in C++ using OutGuess 0.2 source code developed by Niels Provos [17] and Genetik, a C++ tool for floating point genetic algorithm, developed by Noyan Turkkan [18]. Experiments were done under Linux RedHat 9 distribution operating system on a desktop computer with Pentium 4 CPU 2.5GHz and 1GB RAM. The process took less than half a minute for each of our 200x200 pixel test images (Peppers, Lena, and Baboon) and all of them converged to optimal fitness almost after 50 generations. Table 1 shows the parameters value we set for our implementation.
TABLE 1: PARAMETERS SETTINGS FOR GA-BASED APPROACH
Parameter Population size Mutation probability Crossover probability Elitism probability Number of generations
VI. CONCLUSION
Value 100 0.01 0.9 0.5 50
In this paper a novel GA based steganographic system is introduced which is supposed to defeat almost all known steganalysis methods. The novelty used is in the application of genetic algorithm in JPEG steganography process. This method optimizes localization in which the message is to be embedded on the cover image. The experimental results show that this method works properly and is considered to give almost the optimum solution.
Results for Peppers.jpg Parameter MAD Message Size Estimated Size
Value 11750 470 bytes 10 bytes
ACKNOWLEDGMENT Authors would like to thanks Young Iranian Elites Association NGO for supporting young Iranian students' scientific works and projects.
2500
REFRENCES
Fitness
2000 1500 1000 500 0 1
6
11
16
21
26
31
36
41
46
51
Generation
Results for Lena.jpg
Fitness
Parameter MAD Message Size Estimated Size
Value 14200 550 bytes 8 bytes
1600 1400 1200 1000 800 600 400 200 0 1
6
11
16
21
26
31
36
41
46
51
Generation
Results for Baboon.jpg Parameter MAD Message Size Estimated Size
Value 13875 740 byte 8 byte
2500
Fitness
2000 1500 1000 500 0 1
6
11
16
21
26
31
Generation
36
41
46
51
[1] Provos N. and Honeyman, P., "Detecting Steganographic Content on the Internet", Center for Information Technology Integration, University of Michigan. Technical Report 01-11, 2001 [2] Sellars D., "An Introduction to Steganography", cs.uct.ac.za/courses/CS400W/NIS/papers99/ dsellars/stego.html [3] Johnson N. and Jajodia S., "Exploring steganography: Seeing the unseen", Computer, 31, no 2:26-34, Feb. 1998. [4] Fridrich J., Goljan M., and Hogea D., “Attacking the OutGuess,” Proc. ACM Workshop Multimedia and Security2002, ACM Press, 2002. [5] Provos N., Honeyman P., "Hide and Seek: An Introduction to Steganography", IEEE SECURITY & PRIVACY, MAY/JUNE 2003 [6] Westfeld A., and Pfitzmann A., "Attacks on Steganographic Systems". In: Pfitzmann A. (eds.): 3rd International Workshop. Lecture Notes in Computer Science, Vol.1768. Springer-Verlag, Berlin Heidelberg New York (2000) [7] Westfeld, A. “Detecting Low Embedding Rates”. 5 th Information Hiding Workshop, Netherlands, Oct. 7 .9, 2002 [8] Pik-Wah C., "Digital Video Watermarking Techniques for Secure Multimedia Creation and Delivery", A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Philosophy in Computer Science and Engineering, The Chinese University of Hong Kong, July, 2004 [9] Huang C. and Wu J., "A watermark optimization technique based on genetic algorithms”, SPIE Electronic Imaging 2000 San Jose, Jan.2000. [10] Steganography software for Windows, http://members.tripod.com/steganography/stego/ software.html [11] Westfeld, A. "High Capacity Despite Better Steganalysis (F5–A Steganographic Algorithm)". Information Hiding. 4th International Workshop. LectureNotes in Computer Science, Vol.2137. SpringerVerlag,Berlin Heidelberg New York, 2001, pp. 289–302 [12] Provos N., "Defending Against Statistical Steganalysis", Proc.10th USENIX Security Symposium, Washington, 2001 [13] Westfeld A., Pfitzmann A., "Attacks on Steganographic Systems". In Proceedings of Information Hiding-Third International Workshop.SpringerVerlag, September 1999. [14] Bassia P. and Pitas I., "Robust audio watermarking in the time domain", Findings report, Dept. of Informatics, University of Thessaloniki 1998. [15] J. H. Holland, "Adaptation in natural and artificial systems", Ann Arbor, MI University of Michigan Press 1975. [16] D. E. Goldberg, "The genetic algorithms in search, optimization, and machine learning", New Y7ork: Addison-Wesley, 1989. [17] http://www.outguess.org/download.php [18] http://www.umoncton.ca/turk