Proceedings of ICCNS 08 , 27-28 September 2008
Steganography in MS Word Document using its In-built Features Mrs. V. S. Tidake, Prof. S. G. Pukale, Prof. M. L. Dhore
Abstract— There are plenty of text resources available for text
II. STEGANOGRAPHY USING CHANGE TRACKING
steganography. Microsoft word being a commonly used communication medium can be well utilized as a cover document to hide the data. In this paper, a new steganographic method is presented which hides data in MSword documents. It uses one special feature of Microsoft word: change tracking. The process of data hiding is divided into two steps: message embedding and message extraction. On the sender’s side, a secret message is embedded inside a cover document to obtain a stegodocument. Depending on the data, the position where it should be embedded is decided. The embedded secret message is revised back again which makes the cover document look normal and also produces a stegodocument. On the receiver’s side, the hidden message is extracted back from the stegodocument. The paper shows comparison between two encoding techniques used for message embedding, namely Huffman and block encoding.
In the proposed steganographic method, a secret message is embedded inside a cover document D using change tracking [1] to obtain a stegodocument S. The process is divided into two stages, the degeneration stage, and the revision stage, as shown in fig.1.
Keywords— Text steganography, cover document, change tracking, message embedding, stegodocument, message extraction. I. INTRODUCTION
Fig. 1 Steganography using change tracking
Steganography is the art of sending hidden or invisible messages. The name came from the Greek word having meaning “covered writing”. While much of modern steganography focuses on images, audio signals, and other digital data, there is also a plethora of text sources in which information can be hidden. While there are various ways in which one may hide information in text, there is a specific set of techniques that uses the linguistic structure of a text [9] as the space in which information is hidden. Text steganography uses text as the medium in which information is hidden. Text steganography can involve anything from changing the formatting of an existing text, to changing words within a text, to generating random character sequences or using context-free grammars to generate readable texts [10]. With any of these methods, the common thing is that hidden messages are embedded in characterbased text.
The data embedding is done in such a way that the stegodocument appears to be the product of a collaborative writing effort. Text segments in the document are degenerated such that it appears to be the work of an author with inferior writing skills and the secret message is embedded in the choices of degenerations [1]. Then the degenerations are revised back using the change tracking feature of MSword, in such a way that it appears as if a expert author is correcting the mistakes. The change tracking information contained in the stegodocument allows to recover the original cover, the degenerated document, and, hence, the secret message. The extra change tracking information is added during message embedding so that it appears a normal collaboration scenario. As the input data consists of characters, it is first converted to binary data. Assume that the input message is converted to an m-bit stream M = b1 b2… bm, where each bi is a bit. It is converted to the following binary message: M’ = H b1 b2… bm P = b1’ b2’… where the header H denotes length m of message and P denotes padding bits. This message M’ is embedded in the cover document D. The message bits can be embedded using different techniques. This paper concentrates on Huffman coding and block encoding. Position in cover doc where bits are
V. S. Tidake is with the NDMVPS’s College of Engineering, Nashik and is a student of M.E. (CSE-IT), Vishwakarma Institute of Technolgy, Pune. (e-mail: vaishalitidake@ yahoo.co.in). Prof. .S. G. Pukale is with the Vishwakarma Institute of Technolgy, Pune. (email:
[email protected]). Prof. M. L. Dhore is with the Vishwakarma Institute of Technolgy, Pune. (email:
[email protected]).
© 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA
410
Proceedings of ICCNS 08 , 27-28 September 2008
embedded, is called as embedding place. It is computed using the secret key K and the bit position in the message. III. HUFFMAN CODING
This technique uses probabilities of occurrences of each word to compute its Huffman code [11]. Words having small probabilities are assigned longer Huffman codes and those having higher probabilities are assigned smaller Huffman codes. A. Message embedding Message embedding is performed in two stages: degeneration and revision. In the degeneration stage, first a cover document D is segmented. Then some of the text segments in a cover document D are degenerated. For a text segment d, a degeneration set Rd is defined to be the ordered set of possible degenerated text segments. Let us use set of synonyms of a word as a degeneration database. Rd (j) denotes the jth element in Rd. The term Pr {Rd (j)} denotes the probability of occurrence for Rd (j). The probabilities of occurrences are used during message embedding so that the system prefers substitutions that occur commonly and, thus, produces a more natural stegodocument. Algorithm 1: Message Embedding using Huffman coding Input: a cover document D partitioned into text segments d1, d2,…,dn ; a character message to be embedded; and a secret key K . Output: a stegodocument S. Steps: 1) Convert character message to binary as M’ = b1’ b2’ b3’… 2) Initialize the set OF embedding places P to be empty. Also define an index p to denote the position of the message bit bp’ which we are currently encoding. Initially p is equal to 1. 3) Compute an embedding place i randomly using K such that i is in the range of 1≤i≤n and i not in the set P. Now add i to P. 4) Construct a Huffman tree T for the text segment di with degeneration set Rd of size c. Use Pr {Rd (j)} as weight of a node initially. 5) Degenerate text segment di to be di’=Rd(j) , where the degeneration choice j is determined by traveling the Huffman tree T from the root to a leaf node as stated by the current bits to be embedded. 6) Repeat Steps 3 to 5 until the entire message has been embedded. 7) Revise each previously degenerated text segment di’ back to di with the revisions made being tracked to yield stegotext segments Si for all i in P. B. Message Extraction The change tracking information included in the stegodocument S allows simple recovery of the original document D and the degenerated document D’, from both of which the embedded message can be extracted. Algorithm 2: Message Extraction © 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA
Input: a stegodocument S = {s1, s2,…sn} and a secret key K. Output: the extracted message in characters. Steps: 1) Recover the original document D = {d1, d2,…dn} and the degenerated document D’ = {d1’, d2’,…dn’} from S using the change tracking information and the related operations provided by MSword. 2) Initialize the set of embedding places P to be empty. 3) Define an index p which denotes the position of the message bit bp’ which we are currently decoding. Set initially p = 1. 4) Select the same embedding place i as that in message embedding using key K and set of embedding places P. 5) Construct a Huffman tree T for the text segment di with a degeneration set Rdi of size c as described in Algorithm 1. 6) Determine the choice of degeneration j such that Rd (j) = di’. 7) Decode the message bits encoded in j by traversing the Huffman tree T from the root to the leaf node nj. Note the path traversed. It gives the bits embedded at that position. Convert bits to corresponding characters. 8) Repeat steps 4 to 7 until the entire message has been extracted. C. Illustration with example Working of both the algorithms is illustrated with an example in this section. [a] Message embedding Here the set of synonyms is used as a degeneration set. The synonym database is available from different resources like WordNet database [7]. In this paper the synonym set is constructed from thesaurus available in MSword itself. For example, let the text segment to be degenerated is d=“scheme”. Suppose the degeneration set of “scheme” contains the eight entries scheme, system, plan, method, format, idea, proposal and design. Probabilities of their occurrences can be calculated from any related database [8]. Synonyms of “scheme” and their respective probabilities are used to find Huffman codes as shown in fig. 2. j Rd(j) Huffman Code 1 Scheme 011 2 System 00 3 Plan 01001 4 Method 10 5 Format 110 6 Idea 0101 7 Proposal 01000 8 Design 111 Fig. 2 Huffman codes for synonyms of “scheme” By using the occurrence probabilities, construct a Huffman tree T. Label left branch as 0 and right branch as 1. Construct Huffman codes for all the leaf nodes, as shown in fig. 2. Let the code to be embedded at this position is 110… 411
Proceedings of ICCNS 08 , 27-28 September 2008
So when the tree is traversed from root visiting the branches 1, 1, 0 respectively, we will reach at a leaf node of “format”. Hence the text segment d=“scheme” is degenerated to text segment d’ = “format”. Then track changes feature of MSword is turned on and d’ = “format” is revised back to d = “scheme”. It will be shown by stegotext as S=“formatscheme”. [b] Message extraction Given a stegotext segment S = “formatscheme”, we can recover the original and the degenerated text segments to be di = “scheme” and di’=“format” respectively. Again construct the Huffman tree T using the given probabilities to get the same Huffman codes. Since the degenerated text segment is “format”, traverse the tree from the root to a leaf node which denotes “format”. Analyze the path traveled. It will give the bits “110”. It means that the bits “110” were embedded at that position. IV. BLOCK ENCODING
Block encoding is implemented by restricting the size of synonym set to integral power of 2. If size of the set is 2 raise to k, then k bits are used to encode each entry in the synonym database uniquely [12]. Algorithms for message embedding and message extraction Algorithms are very similar to those used in Huffman coding. The only difference is that instead of constructing Huffman codes, the synonyms in each set are uniquely represented using the bit sequence as shown in the following example. Illustration with example Again consider the set of synonyms for “scheme”. As the size of the set is eight ( that is 2 raise to 3), three bits can be used to uniquely represent each entry in the set as shown in fig. 3. j Rd(j) Block Code 1 Scheme 000 2 System 001 3 Plan 010 4 Method 011 5 Format 100 6 Idea 101 7 Proposal 110 8 Design 111 Fig. 3 Block codes for synonyms of “scheme” a.
Message embedding Let the code to be embedded next 110… So the set is searched for block code 110 which denotes “proposal”. Hence the text segment d=“scheme” is degenerated to text segment d’ = “proposal”. Then track changes feature of MSword is turned on and d’ = “proposal” is revised back to d = © 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA
“scheme”. It will S=“proposalscheme”.
be
shown
by
stegotext
as
b.
Message extraction Given a stegotext segment S = “proposalscheme”, we can recover the original and the degenerated text segments to be di = “scheme” and di’=“proposal” respectively. Again construct the same block codes for the same synonym set of “scheme”. Here the key point is that the each entry in the synonym set of “scheme” should be represented by same block code at the time message embedding and the extraction. Since the degenerated text segment is “proposal”, search it in the synonym set of “scheme” and analyze the corresponding block code for “proposal”. It will give the bits “110”. It means that the bits “110” were embedded at that position. V. SECURITY CONSIDERATIONS AND LIMITATIONS
For every steganographic system, security is very important. The following security aspects are considered for the given system: 1. The synonym database used for degeneration and the secret key are agreed upon by the sender and receiver beforehand. 2. It is robust against statistical steganalysis [6] because of the following reasons: a. In Huffman coding, degenerations are chosen according to their occurrence probabilities. So even though the adversary becomes successful to obtain the database, he can not find out occurrence frequencies because occurrence frequencies may be computed from personal databases owned only by the sender and the receiver. In block encoding, the sequence of words in the database is important to obtain block code. b. To ensure that statistical properties of the degenerations of a stegodocument are closer to that of a normal document, the message can be compressed or encrypted before embedding. c. To increase robustness in the Huffman coding, we can change the occurrence probability of degeneration after it has been used once. So the probability of the same word getting selected decreases in future and we can achieve the desired statistical coherence with a normal document. 3. The degeneration database can be modified dynamically after embedding secret data. 4. After embedding information in a stegodocument using the proposed method, a sender may manipulate the unused portions of the stegodocument. As every coin has two sides, the given system also has some limitations: 1. The degeneration set and the key must be known only to the sender and the receiver. 2. The change tracking information used for message embedding should not be disturbed by anybody knowingly or unknowingly. 3. The degeneration database should be kept realistic. VI.
IMPLEMENTATION RESULTS
The system is implemented using Microsoft Word 2003 and C\#. The automation techniques of Microsoft Word are also used for implementation. The degeneration database 412
Proceedings of ICCNS 08 , 27-28 September 2008
is constructed using the thesaurus available in Microsoft Word 2003. The System is evaluated by comparing the results obtained using the three coding techniques, namely Huffman, block and arithmetic coding. The results obtained from these three techniques are compared with each other as shown in fig.4. Results show that the system gives better results if block encoding is used for message embedding instead of Huffman coding. Further if the message is compressed before embedding, then the system performance is improved and can embed more data. Here the arithmetic encoding is used as compression technique.
[7] WordNet v2.1, a lexical database for the English language. Princeton Univ., Princeton, NJ, 2005. http://wordnet.princeton.edu/ [8] Google, Google SOAP Search API (beta), [Online]. Available: http://www.seochat.com/c/a/Google-OptimizationHelp/Using-the-Google-SOAP-Search-AP [9] K. Bennett, “Linguistic steganography: Survey, analysis, and robustness concerns for hiding information in text,” Purdue Univ., West Lafayette, IN, CERIAS Tech. Rep. 2004– 13, May 2004. [10] J. T. Brassil and N. F. Maxemchuk, “Copyright protection for the electronic distribution of text Documents,” Proc. IEEE, vol. 87, no. 7, pp. 1181–1196, Jul. 1999. [11] P. Wayner, “Mimic functions,” Crypt., vol. XVI, no. 3, pp. 193–214, 1992. [12] M. Chapman, I. D. George, and R. Marc, “A practical and effective approach to large-scale automated linguistic steganography,” in Proc. Information Security Conf., Malaga, Spain, Oct. 2001, pp. 156–165.
Fig. 4 Comparison between encoding techniques VII. CONCLUSION
Though the steganographic method presented in this paper focuses on Microsoft Word, the idea can be applied to some other communication mediums also. The robustness of the system can be increased by increasing randomness in the input and the degeneration database. As the work appears to be the effort of collaborative writing, is less likely to be under close scrutiny. The results obtained from the implementation show that embedding capacity of the Huffman coding is less as compared to the block encoding. Better results are obtained when a message is compressed using arithmetic encoding before embedding. REFERENCES [1] “A New Steganographic Method for Data Hiding in Microsoft Word Documents by a Change Tracking Technique”, Tsung-Yuan Liu, Student Member, IEEE, and Wen-Hsiang Tsai, Senior Member, IEEE. [3] F. A. P. Petitcolas, R. J. Anderson, and M. G. Kuhn, “Information hiding—A survey,” Proc. IEEE, vol. 87, no. 7, pp. 1062–1078, Jul. 1999. [5] R. Stutsman, C. Grothoff, M. Attallah, and K. Grothoff, “Lost in just the translation,” in Proc. ACM Symp. Applied Computing, 2006, pp. 338–345. [6] F. Johnson and S. Jajodia, “Steganalysis: The Investigation of Hidden Information,” in Proc. IEEE Information Technology Conf., Syracuse, NY, Sep. 1998, pp. 113–116.
© 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA
413