A Novel Scheme of Webpage Information Hiding Based on Attributes Dongsheng Shen
Hong Zhao
Dept. of Computer Science and Engineering Zhangzhou Normal University Zhangzhou, Fujian 363000, China
[email protected]
Dept. of Computer Science and Engineering Zhangzhou Normal University Zhangzhou, Fujian 363000, China
[email protected]
watch system, there is meaning to examine the possibility of the communication using HTML that is difficult to detect. The research of this area has led to several different techniques, which can be broadly classified into embed-ding invisible characters based, switching the uppercaselowercase states of letters in tag based, and altering the order of attributes based[3-8]. The methods based on embedding invisible characters, which embed invisible characters by taking white space as “0” and tab as “1” at the end of the lines or sentences, are easy to achieve, but these methods lengthen the size of the webpage, and easy to attract by deleting excess white space and tab. The methods based on switching the uppercase-lowercase state of letters in tag, which embed secret message by switching the uppercase-lowercase of letter in tag by taking uppercase letters as “1” and lowercase letters as “0”, can embed a large number of secret message inside of the web page, and do not lengthen the size of the webpage, but the letters switched uppercase-lowercase would expose the embedded message easily. According to the HTML specification [9], the HTML element may have associated properties, called attribute. The attribute/value pairs may appear in any order, so we can rearrange the attributes in an element’s tag in any order in HTML document. In this paper, a scheme of information hiding based on the attribute order, which overcomes the drawbacks of the ability of imperceptibility and robustness of the traditional webpage information hiding algorithms, is presented. First, we study the relation between the attributes in different order and the binary bit strings, then we define two maps for the information embedding and extracting, and the proposed scheme is achieved finally. The proposed scheme does not lengthen the size of the cover-webpage, and has better imperceptibility, more robust, and more perfect security than the traditional algorithms. The embedded capacity of the proposed scheme is large enough to embed specified secret message. So the proposed scheme can be used to protect the content of the webpage and covert communication. This paper is organized as follows. In Section 2, we give a brief introduction about HTML specification. Section 3 explains the basic theory and the scheme of webpage information hiding based on the attribute order. Experiments and discussion are shown in Section 4. Finally, this paper is concluded in Section 5.
Abstract—Webpage based information hiding technology uses a webpage as the cover object. The research of this area has led to several different techniques, which can be broadly classified into embedding invisible characters based, switching the uppercase-lowercase states of letters in tag based, and altering the order of attributes based. Based on the attributes order, a scheme of information hiding is presented in this paper, which overcomes the drawback of the ability of imperceptibility and robustness of the traditional webpage information hiding algorithms. In the proposed scheme, a map between the set of all permutations of n-attributes and the set of all (n-1)-bit binary string is established first, then the embedding and extracting processes are achieved according to this map. The experimental results and analysis show that the proposed scheme dose not lengthen the size of the coverwebpage, and has better imperceptibility, more robust, and more perfect security than the traditional algorithms. The embedded capacity of the proposed scheme is large enough to embed specified secret data. So the proposed scheme can be used to protect the content of the webpage and covert communication. Keywords- webpage; information hiding; embed; extract; attribute order
I.
INTRODUCTION
Information hiding technology, which is a new hotspot of information security, can hide secret message inside of image, video, audio, text, and other digital objects. The general model of hiding data in other data can be described as follows. The embedded data is the message that one wishes to send secretly. It is usually hidden in an innocuous message referred to as a cover-text, or cover-image or cover-audio as appropriate, producing the stego-text, or other stego-object. A stego-key is used to control the hiding process so as to restrict detection and/or recovery of the embedded data to parties who know it [1]. In recent years, information hiding technology has been a new exploratory field with its superiority by hiding sensitive information secretly into a non-secret cover-object to enforce the security of encrypted information as well as to verify the authenticity of its cover-object [2]. Webpage based information hiding technology uses a webpage as the cover-object. The secret data is embedded inside the HTML webpage text while the meaning of the original webpage text preserved, and transmit as a stego-text. Documents written in HTML are distribute widely on web. Therefore, in consideration of the technical validity of such ___________________________________ 978-1-4244-6943-7/10/$26.00 ©2010 IEEE
II.
HTML CHARACTERISTICS Secret Message
HTML, which stands for Hyper Text Markup Langu-age, is the predominant markup language for web pages. It provides a means to create structured documents by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes and other items. It allows images and objects to be embedded and can be used to create interactive forms. Elements give structure to a HTML document and tell the browser how your website would be presented. Generally, the elements consist of a start tag, contents, and an end tag. The element’s name appears in the start tag and the end tag. HTML element may have associated properties, called attribute, which may have values (by default, or set by authors or scripts). Attribute/value pairs appear before the final “>” of an element’s start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in a element’s start tag. They may appear in any order [9], for example, the following two start tags are equivalent.
and The proposed information hiding scheme is based on the attributes by changing the order of the attribute value pairs . III.
Cover-webpage
Embed Algorithm
Extract Algorithm
Extracted Message
Inverse Pre-process
Figure 1. The proposed scheme of webpage information hiding
A. Pre-processing n the proposed information hiding scheme, we take a binary image as the secret message to be embedded. Let W w(i, j ) : 1 i M ,1 j N be a binary image to be embedded within the cover-webpage, and w(i, j ) {0,1} is the pixel value at (i, j ) . In order to dispel the pixel space relationship of the binary image and improve the security performance of the whole information hiding system, the secret image must be scrambled first. In our scheme, the secret binary image is scrambled from W to W by using Arnold transform [10], and then rearranged it to a one-dimension vector M .
PROPOSED SCHEME OF INFORMATION HIDING
Let T (a1 , a2 ,, an ) be a n-attributes tuple associate to some HTML element A, where T denotes the tag name of the element A, and ai denote the i-th attribute value pair in the n-attributes tuple (a1 , a2 ,, an ) . Suppose PAn is the set of all permutations of n-
B. Algorithm In order to describe the proposed information scheme more brief, some functions are defined first. Let PAn be the set of all permutations of n-attributes (a1 , a2 ,, an ) , Bn 1 {b1b2 bn 1 | bi 0 or bi 1 } be the set of all (n-1)-bit binary string. Definition 1. Function f : PAn Bn 1 can be defined as follows: b1b2 bn 1 f (( a1 , a2 ,, an )) where b1b2 bn 1 Bn 1 , (a1 , a2 , , an ) PAn , and f is a function represented as follows, 1, if ai ai 1 bi 1 i n 1 0, else According to Definition 1, we know that f is a 1-to-1 map, i.e., for every (a1 , a2 ,, an ) PAn , there exists a b1b2 bn 1 Bn 1 such that unique b1b2 bn 1 f (( a1 , a2 ,, an )) . Definition 2. The map g : Bn 1 PAn can be defined as follows: (a1 , a2 ,, an ) g (b1b2 bn 1 ) where b1b2 bn 1 , (a1 , a2 , , an ) PAn , and satisfying b1b2 bn 1 f (( a1 , a2 ,, an )) . According to the above definition, we know that g is a 1to-n map, i.e., for arbitrary b1b2 bn 1 Bn 1 , there may
attributes tuple (a1 , a2 ,, an ) , and obviously PAn n! . Since HTML attributes value pairs may appear in any order in the HTML document, for (a1 , a2 ,, an ), (a1 , a2 ,, an ) PAn , tag T (a1 ,, an ) and T (a1 ,, an ) have the same effect in the webpage document, and they present the same in viewpoint of the browser. Suppose | H | denotes the tag number in webpage H , T j T j (a1 , a2 ,, anj ) denotes the j-th tag in webpage H ,
| T j | is the attribute number of tag T j , where 1 j H ,
Pre-process
and H (T ) T j (a1 , a2 ,, anj ) | T j | 2, 1 j H . Our information hiding scheme is shown in Figure 1 as follows:
multiple (a1 , a2 , , an ) PAn , such that (a1 , a2 ,, an ) g (b1b2 bn 1 ) , and G g (b1b2 bn 1 ) is a subset of PAn . Definition 3. The function B SubM () be defined as follows: B b1b2 bk SubM (b1b2 bk bk 1 bn , k ) if n k , then bn1 bk 0 . Definition 4. The function M TruncM () be defined as follows: M bk 1 bn TruncM (b1b2 bk bk 1 bn , k ) if n k , then k n . Definition 5. Let M b1b2 bk , B b1b2 b j be two binary strings, the function M CatM () can be defined as follows:
The proposed embedding algorithm and proposed extracting algorithm are shown as fig. 2 and fig. 3 as following: j 1
j j 1
| T j | 2
B j SubM (M , | T j | 1) G g (B j )
M TruncM (M , | T j | 1)
Get a P from G randomly, and replace T j
M b1b2 bk b1b2 b j CatM ( M , B) . Algorithm 1: The proposed embedding algorithm In our scheme, the secret message will be embedded into the cover-webpage by the following 4 steps: Input: cover-webpage H, secret message for embedded M, and stego-key K. Output: stego-webpage H with secret message. Step 1. Let j 1 ; Step 2. If | T j | 2 , go to Step 3, else let j j 1 , and go to Step 2; Step 3. (1) Using function SubM () to get the former | T j | 1 bit
| M | 0? Figure 2. The proposed embedding algorithm
j 1 j j 1
| T j | 2 B j f (T j , )
from M , and denoted by B ( j ) ; (2) Get subset G of PA by using function g , i.e.
M CatM (M , B j )
(a1 , a2 ,, a|T j | ) G, f (( a1 , a2 ,, a|T j | )) B ( j )
Figure 3. The proposed extracting algorithm
(3) Take a (a1 , a2 ,, a|T j | ) G randomly, and let it replace T j ;
C. The embedded capacity of the Algorithm
(4) Using function TrunM () to truncate the former | T j | 1 bit from M , and store the result to M ;
Let H be a webpage, T T (a1 , a2 ,, anj ) be a tag of H . According to the proposed embedded algorithm, we can embed (n-1)-bit secret message in it. Let H (T ) {(a1,
Step 4. If | M | 0 , then the embedded procedure is completed, else let j j 1 , and go to step 2.
a2 ,, anj ) | | T j | 2, 1 j | H |} be the set of all tags, whose attributes number is larger than 1, in webpage H . Then the largest embedded capacity of the proposed embedded algorithm LEC(H ) can be calculated as follows:
Algorithm 2: The proposed extracting algorithm The secret message will be extract from stego-webpage as following: Input: stego-webpage H with secret message, and stego-key K. Output: secret message M. Step 1. Let j 1 , M is empty binary string; Step 2. If | T j | 2 , go to Step 3, else let j j 1 , and go to Step 2; Step 3. Using function f to get B j , and
LEC( H ) IV.
1 (| T j | 1) . 8 T j H (T )
EXPERIMENT AND DISCUSSION
A. Experiment result We have downloaded several web pages from different websites randomly for the experiment to verify the proposed scheme. The webpage after embedded secret message doesn’t lengthen the size of the original webpage, and they
M CatM (M , B j ) Step 4. Let j j 1 , and go to Step 2.
webpage, and has better imperceptibility, more robust, and more perfect security than the traditional algorithms. The embedded capacity of the proposed scheme is large enough to embed specified secret message. So the proposed scheme can be used to protect the content of the webpage and covert communication.
show the same effect in the Brower. The secret message can be extracted from the stego-webpage exactly without the original webpage. The following Table 1 shows the largest embedded capacity (LEC) of some popular website’s homepage. TABLE I.
THE LARGEST EMBEDDED CAPACITY (LEC) OF SOME POPULAR WEBSITES HOMEPAGES
Homepage of Website www.qq.com www.sina.com.cn www.sohu.com.cn www.fjzs.edu.cn www.microsoft.com /zh/cn
ACKNOWLEDGMENT This paper is supported by Supported by Fujian Province Foundation of Higher Education under Grant No. JK2010036, and the Fujian Province Foundation of Serving the Construction of the Economic Zone on the West Side of the Straits.
LEC(B) 55 187 251 49 128
Note: 2010-07-2 visited.
REFERENCES
B. Discussion The methods based on embedding invisible characters, which embed invisible characters at the end of the lines or sentences, are easy to achieve, but these methods lengthen the size of the webpage, and it easy to attract by deleting excess white space and tab in the HTML document. The methods based on switching the uppercase-lowercase state of letter in tag, can embed a large number of secret message into the webpage, and do not lengthen the size of the webpage, but the letters which have switched uppercase-lowercase would expose the embedded secret message easily, and it easy to attract by changing all the letters of tag to uppercase or lowercase. In comparison with the above tow methods, the proposed scheme does not lengthen the size of the coverwebpage, and has better imperceptibility, more robustness, and more perfect security. There are some information hiding detection method [11], but up to now, the detection method focus on proposed scheme is not reported. V.
[1]
F.A.P. Petitcolas, R.J. Anderson, and M.G. Kuhn, “Information hiding - A survey”, Proceeding of the IEEE, July 1999, 87(7), pp. 1062-1078. [2] W. Dai, Y. Yu, Y. Dai, and B. Deng, “Text steganography system using Markov chain source model and DES algorithm”, Journal of software, July 2010, 5(7), pp. 785-792. [3] X.G. Sui, H. Luo, “A new Steganography method based on hypertext”, In: Proc. Of Asia-Pacific Radio Science Conference, Piscataway NJ, IEEE Press, pp. 181-184. [4] C. John, “Hiding Binary Data in HTML Document [OL]”, http://www.codeproject.com/KB/security/steganodotnet13.aspx, ( 2010-07-2 visited). [5] X. Sun, H. Huang, B. Wang, G. Sun, and J. Huang, “An Algorithm of Webpage Information Hiding Based on Equal Tag”, Journal of Computer Research and Development, 2007, 44(5), pp. 756-760 (in Chinese). [6] Q. Zhao, H. Lu, “PCA-Based Web page Watermarking”, Journal of the Pattern Recognition 40, 1334–1341 (2007) [7] R. Yao, Q. Zhao, H Lu, “A Novel Watermark Algorithm for Integrity Protection of XML”, In: International Journal of Computer Science and Network Security, VOL.6 No.2B, February 2006 [8] Y. Shen, “A Scheme of Information Hiding Based on HTML Document”. Journal of Wuhan University, 2004, 50(sl), 217–220 (in chinese) [9] “HTML 4.01 Specification [OL]”, http://www.w3.org/TR/ REC html401-19991224/, ( 2010-07-2 visited). [10] Ding Wei, Yan Wei-Qi, Qi Dong-Xu, “Digital Image Scrambling Technology Based on Arnold Transformation”, Journal of Computer Aided Design & Computer Graphics, 2001, 13(4), pp. 338-341 (in Chinese). [11] H. Huang, J. Tan, X. Sun, L. Liu, “Detection of Hidden Information in Webpage Based on Higher-Order Statistics”, In: IWDW2008, LNCS 5450, 2009, pp. 293-302.
CONCLUSION
Information hiding technology, which widely applies to information security, copyright protected, covert communication, and so on. In this paper, A scheme of webpage information hiding based on the attribute order is presented. In the proposed scheme, a map between the set of all permutations of n-attributes and the of all (n-1)-bit binary string is established first. According to this map, the embedding and extracting processes are achieved. The proposed scheme does not lengthen the size of the cover-