Secure Content Delivery and Data Hiding in Digital Documents. Muhammad Abdul Qadir & Ishtiaq Ahmad. Mohammad Ali Jinnah University. ABSTRACT.
Digital Text Watermarking: Secure Content Delivery and Data Hiding in Digital Documents Muhammad Abdul Qadir & Ishtiaq Ahmad Mohammad Ali Jinnah University
ABSTRACT Secure communication of data over public channels is one of the most important challenges. Both the activities to secure contents and to break security are very hot. In order to reduce the chances of attack, security needs to be made invisible. The needs to preserve originality, ownership information, and integrity of text documents in a way that cannot be identified by everyone is being felt badly. Watermarking of the documents is a step toward achieving these objectives. However, to watermark a plain text document (ASCII) in a way that the original text will not change (and it would be very difficult to break it), is a great challenge. We have developed a novel encoding scheme which can be used to insert information in plain text without changing the text. A system has been developed based upon this encoding scheme. This paper describes the system and demonstrates its workings.
INTRODUCTION A huge amount of confidential information/data is exchanged over the Internet (publicly open medium), as this is the most cost-effective and widely available way. This technological progress has also made digital data very much vulnerable to interception and then possible unauthorized access/use and has caused significant economic losses for the content producers and rights holders. To protect data on public channels, security measures need to be incorporated into data communication systems over the internet. Therefore, there has been a flurry of research and development effort in the field of information security for both security policies and security mechanisms. At the same time, security attacks in the form of eavesdropping, masquerading, tampering, and in many other forms is becoming sophisticated. One important question that needs to be answered for every security measure is: "How Author's Current Address: M.A. Qadir and I. Ahmad, Faculty of Engineering & Sciences, Mohammad Ali Jinnah University Islamabad, Pakistan. Based on a presentation at Carnahan 2005. 0885/8985/06/ $17.00 i 2006 IEEE
18
secure is the security policy and the underlying mechanisms to implement that policy. " Best practices for securing data are improving day-by-day, but the battle to secure critical data is far from won. It is human nature to break anything which is being kept secret from her, even without specific reason. If she is unable to remove protection, she would try to destroy it or would block it from going to its destination (if I am unable to see it then no one should be able to use it). This leads to a conclusion that securing things in a way that everyone can see its security is not a good way. To reduce the possibility of attack, security needs to be kept secret, i.e., invisible security. The valuable data can be inserted into multimedia documents in a way that cannot be spotted. i.e., imperceptible insertion of information into multimedia data [1, 2]. Data hiding is one of the most promising technologies helping to achieve the overall goal of secure delivery of information from its source to the authorized end-users. Insertion of information, such as a number or text, into multimedia contents through slight modification of the data which cannot be spotted, is being used in several early applications [2, 3, 4, 5] (e.g., audio and video watermarking, broadcast monitoring, secret communication). It is relatively easy to insert information in images as compared to plain text as there is a lot of redundant area [14] where valuable information could be inserted in an imperceptive way.' There could be many other uses like copyright protection [8], integrity preservation, labeling, monitoring, tamper-proofing, and conditional access, especially if we can hide information in plain text documents. Because of the intended application, watermarks must have at least three properties. A watermarking scheme must be robust. The watermark cannot be removed or destroyed without destroying the value of the watermarked document. The original and watermarked documents should be perceptually identical. Unauthorized parties should not be able to determine the watermark or tamper with it. In the coming sections after presenting a model of document preservation by using watermarks, a benchmark for an ideal "Watermarking text is very, very difficult," said Atallah. "It's much more difficult than watermarking images."
IEEE A&E SYSTEMS MAGAZINE, NOVEMBER 2006
KEYPR
chwnnel+ b
$
KS
Fig. 1. Spread spectrum-based watermarking b bit to be embedded watermarking scheme is given. Strengths and weaknesses of existing text watermarking schemes are presented. Before explanation of our solution, brief introduction to the strength of using spread spectrum technique is highlighted.
WATERMARKING AND DOCUMENT PRESERVATION MODEL In digital watermarking, relevant information is embedded in an imperceptive way into a digital document. The embedded information is called a watermark. The watermark may contain information such as copyrights, ownership, timestamps, distribution points, document information, or any other information needed to be preserved in the document. Even if the document is copied or redistributed, it should be possible to determine the hidden information from the document. Let us look into a possible Model of Watermarking scheme and the way to protect a document; a content owner approaches a neutral trustworthy registration authority. Depending on the nature of the document's content or based upon any other selection criteria, the authority allots a unique registration number to the document and allots the maximum size of the watermark which can be inserted in the document in an imperceptible way. It may also archive the contents and unique registration number for future reference. Content owner generates suitable watermark, which can then be embedded by a third party within the data. The third party in case of any dispute may act as a reference point may preserve the watermark along with the document information. To ensure further security of embedded digital watermark, one or several secret and cryptologically secure keys have to be used. To ensure robustness against data manipulation and processing, it is helpful to have very small digital watermarks and ensure that it is redundantly distributed in the host data. Thus, while extracting a digital watermark, a small sample of watermarked data is enough to establish its originality. In case the destination of the document wants to confirm the originality of the document, it can be verified from the trustworthy third party by sending the documents. If multiple parties claim the ownership of the document. the trustworthy third party can resolve the dispute by verifying the signature and the watermark hidden in the document. The confidence or originality measures depend upon the degree of closeness of original watermark and contents integrity information stored by the third party and recovered watermark and integrity information from the document. IEEE A&E SYSTEMS MAGAZINE, NOVEMBER 2006
The watermarking system can be used to send valuable information from the source to its destination hidden in a text document; in this case, the text document acts as a harmless carrier and no one would be able to spot that it carries such valuable information.
BENCHMARK AND EXISTING TEXT WATERMARKING SCHEMES Let us define a benchmark for Digital Text Watermarking while keeping in view all the necessary security concerns. A watermark must be language independent, invisibility, robust, preserve data integrity and veracity. The algorithm must be simple and immune to attacks. Digital watermarking for text documents are discussed in [6, 7, 9 - 12]. Commonly used techniques to hide watermark information are line shift coding, word shift coding, and feature coding [13]. The formatted text method of watermarking can be defeated easily by copy/pasting text using a new character font. This scheme cannot watermark plain text documents. Text watermarking based on semantics [14] is language-dependant. In many legal and technical documents. specific words cannot be replaced with synonyms. The algorithms are more complex in general.
SPREAD SPECTRUM TECHNIQUE OF WATERMARKING Many digital watermarking schemes deploy spread spectrum communication [2]. In such schemes a watermark is embedded in a host signal like pseudo-random fashion by pseudo-random noise (PRN). Watermark bits are mixed with PRN-generated signal and then this signal is inserted in the host signal. This PN signal functions as a secret key. Figure 1 shows such a mechanism. The watermarked signal's amplitude is much less than the host signal; -1% of the host's amplitude [2]. The valuable information in the form of watermark is spread in the PRN fashion: it would be very difficult to spot and identify the embedded signal. This specific PRN signal can later be detected by correlation receiver or matched filter. Appropriate amplitude and the number of added samples can make the probability of false-positive or false-negative detection low [2,3]. It is also possible to subtract the PRN signal from the host data. In this case, the correlation receiver will calculate high-negative correlation in the detection process [3, 4]. Thus, by using addition or subtraction process it is possible to convey one-bit of information by sequential addition of several such bits, it is possible to convey arbitrary information. OUR SOLUTION We have suggested a novel idea based upon an intelligent encoding scheme in the world of text watermarking which has no effect on the alteration of the syntax of the document as well as the layout. Thus providing a layout/format independent 19
technique in which information within the text is manipulated to hide certain information. We encode the information in the existing characters of the text in an intelligent way that does not change the document. Moreover, the hidden information is being preserved by the document.
Fig. 2. Spread spectrum-based watermarking b bit to be embedded Overall architecture of the system is shown in Figure 2. The system has two parts: Insertion Of Watermark; and Detection Of Watermark which are explained in the following sections.
Watermark Insertion To preserve the integrity of a document, document signatures are generated by Document analyzer and signature generator (DAS). DAS is a combination of many pretty-smart algorithms, worked on the information about the author, actual
text, feature of text, etc. The output signature will be unique in the context of a given text. DAS also computes how big a watermark can be inserted in the text. These two parameters are determined based upon the trade-off between the robustness of document integrity and size of information to be hidden in it. In order to provide extra security, encryption of these watermarks (signature, user-specific watermark), can be done with different keys as per level of confidentiality of the document. Spreading the processed watermark and signature over the pseudo-random noise can provide another level of security. The spread data is then inserted into the original text stream by using invisible watermarking insertion algorithm with the intelligent encoding scheme. The beauty of the invisible watermarking insertion algorithm is that the output text stream (watermark inserted, i.e., s) will be identical to the input text stream. As the valuable information is completely hidden in the text, no one would be interested to disrupt or hack/decrypt it, which is a major advantage of invisible security. Even if someone tries to break it the multiple level of security will make it nearly
impossible. Depending upon the selected level of redundancies of watermark insertion, the watermark will remain present even in a small piece of text. This is the most important factor of security that will prevent illegal copying (cut, paste). The watermark will go with the copied text and the originator information will remain there. Thus the scheme provides protection for copyright. labeling, monitoring, tampering and facilitates conditional access. The scheme can also be used to
20
Figure. 3. Text-based watermark embedding Soure
Dti
Chan1 + attack noise
Fig. 4. Transmission Channel preserve the integrity of legal documents, financial and other reports, poetry, thesis, papers, etc.
Watermark Detection The detection mechanism is semantically inverse of the insertion logic. Incoming signal/text stream (y) is received from the channel and goes as an input to the Detector. It is quite possible that the received signal y might not be identical with the original i.e., x. Channel noise and intended/deliberated attacks may transform the signal x into y. The other signal is regenerated PRN to the detector. This subsystem is smart enough to recover the desire signal from noisy data. Even the data is mixed with noise in the communication channel. The system is also capable of recovering the watermark from the partial contents of the original document even when inserted in another document. Outputs of the Detect module are recovered text (xA) and recovered embedded watermark (bA). The smart DAS subsystem runs over the received text (xA) and re-produces the document signature for comparisons. This is the most significant part of the watermark detection process in view of verification of document originality, authenticity of its contents, protection for copyright, labeling, monitoring, tampering, etc. Extracted watermark goes into a splitting process so that signature part and user watermark can be separated, BR, LR respectively. The user inserted watermark, LR is decrypted by its private key (key2) and made available for further
IEEE A&E SYSTEMS MAGAZINE, NOVEMBER 2006
[2] Henrique S. Malvar, Fellow, IEEE, and Dinei A.F. Florencio,
Improved Spread Spectrum: A New Modulation Technique for Robust Watermarking, in IEEE Tran. Signal Processing, Vol. 51, No. 4, pp. 898-905, April 2003.
[3] A.Z. Tirkel, C.F. Osborne and R.G. van Schyndel, Image watermarking - A spread spectrum application, in Proc. IEEE 4" International Symposium Spread Spectrum Technology Applications, Mainz, Germany, 1996, pp. 785-789. [4] F. Hartung, J.K. Su and B. Girod, Spread spectrum watermarking: Malicious attacks and counterattacks, Proc. SPIE, Vol. 3657, pp. 147-155, January 1999.
[5] D. Kirovski and H. Malvar, Robust spread-spectrum audio watermarking, in Proc. International Conference Acoustical, Speech, Signal Process, Salt Lake City, UT, May 2001.
Fig. 5. Text-based watermark detection processing by the user. Whereas BR, the signatures embedded by the insertion logic and coming along with the document, is decrypted by its private key, thus the signatures are ready to perform verification process against the generated signature. Newly-generated signatures by DAS subsystem are compared with watermarked signatures. As uniqueness of signature is ensured for a given text-stream, so system can take the integrity decision whether to accept or reject. Received text is also available in its original form for further use, redistribution, as we have not considered washing the watermark from received text stream; cleaning the watermark for the text.
APPLICATIONS AND CONCLUSION We have successfully developed and implemented the watermark-encoding scheme and have implemented the system for plain text documents as described. This demonstrates that embedding information in plain text documents in a secure way is possible. The same scheme can be extended for Unicode, and other formatted documents, too. There are many applications which require contents and ownership protection, and exchange of highly secure contents (Steganography) can now be implemented by using this scheme. Large corporations amd government organizations need such applications.
REFERENCES [1] R.J. Anderson and F.A.P. Petitcolas, (1999), Information Hiding & Digital Watermarking: An Annotated Bibliography, [Online], Available: http://www.cl.cam.ac.uk/-fapp2J
steganography/bibliography.
IEEE A&E SYSTEMS MAGAZINE, NOVEMBER 2006
[6] J. Brassil, S. Low, N. Maxemchuk and L. O'Gorman, Electrical Marking and Identification Techniques to Discourage Document Copying, IEEE Journal on Selected Areas in Communications, Vol. 13, No. 8, pp. 1495-1504, October 1995. [7] J. Brassil and L.O'Gorman, Watermarking Document Images with Bounding Box Expansion, in Anderson, pp. 227-235.
[8] Special Issue on Copyright and Privacy Protection, IEEE Journal on Selected Areas in Communications, Vol. 16, No. 4, May 1998.
[9] S. Katzenbeisser and F.A.P. Petitcolas, Eds.,
Information Hiding Techniques for Steganography and Digital Watermarking, Boston, Artech House, 2000.
[10] S.H. Low and N.F. Maxemehuk, Performance Comparison of Two Text Marking Methods, in special issue [5], pp. 561-572.
[11] S.H. Low, N.F. Maxemchuk, J.T. Brassil and L.O'Gorman,
Document Marking and Identification Using Both Line and Word Shifting, Proc. Infoncom'95, Boston, MA, April 1995, pp. 853-860.
[12] S.H. Low, N.F. Maxemchuk and A.M. Lapone,
Document Identification for Copyright Protection Using Centroid Detection, IEEE Transactions on Communications, Vol. 46, No. 3, pp. 372-383, March 1998.
[13] Ding Huang and Hong Yan,
Interword Distance Changes Represented by Sine Waves for Watermarking Text Images, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. I1, No. 12, pp. 1237-1245, December 2001.
[14] Mikhail Atallah and Victor Raskin,
Purdue team develops watermark to protect electronic documents,
2001, [Online], http://news.uns.purdue.edu/html4ever/ 010427.AtalIah.watermark.html.
A
21