images and videos does not exist for digital text documents. To be specific, a minor change to .... are printed out using HP Laserjet 8150DN. The hard copy is.
Formatted Text Document Data Hiding Robust to Printing, Copying and Scanning Dekun Zou and Yun Q. Shi Department of Electrical and Computer Engineering New Jersey Institute of Technology Newark, New Jersey, USA {dz6, shi}@njit.edu
Abstract— In this paper, a novel formatted text document data hiding algorithm, called Inter-word Space Modulation (ISM) scheme, is proposed in which the spaces between neighboring words are modulated to hide data. In contrast to prior arts, this method does not require original documents for hidden data extraction. The hidden data are robust to printing, copying and scanning. Our experiments show that after printing, ten times of repeated copying, followed by scanning, the hidden data can still be extracted without a single bit error. It is expected that it can find wide applications for secure document processing, including digital notarization.
I.
INTRODUCTION
In recent years, digital data hiding has become a hot topic in the signal processing research community. Tremendous amount of algorithms have been proposed to hide data into images, videos and audios. Several excellent survey articles can be found in the literature, e.g., [1]. However, there is relatively less work devoted to document data hiding because the property of continuous tone that inheres in images and videos does not exist for digital text documents. To be specific, a minor change to pixel values of an image or a frame in a video sequence will not cause visual artifacts. In the case of text documents, however, changes to pixel values will cause disturbing salt-and-pepper noise because the digital representations of documents are in binary format, i.e., one pixel is represented by only one bit, (for example, ‘1’ for white and ‘0’ for black), therefore, special measures must be taken to hide data into text document images. In [2], a block based method to hide bits into document images was proposed. Certain patterns are predefined as “flippable” or “non-flippable”. For those flippable patterns, certain pixels can be flipped without causing perceptible artifacts. Data will be embedded by flipping those flippable pixels to make the number of black pixels in a block to be even or odd according to whether bit ‘1’ or ‘0’ is to be embedded in the block. A more sophisticated method called boundary modifications was proposed in [3]. 100 pairs of five-pixel long boundary patterns were defined. For each
0-7803-8834-8/05/$20.00 ©2005 IEEE.
pair, there are two different patterns, an ‘A’ pattern and a ‘D’ pattern, which can be changed to one another when the pair is flipped. By flipping between the patterns, a bit can be embedded into a five-pixel long boundary. The above two methods can be applied to arbitrary text documents and have a capacity of hundreds bits to thousands bits according to the size of the document. Nevertheless, for both schemes [2, 3], any distortion occurred to the marked document will make correct retrieval of the hidden information impossible, not to say distortions caused by copying, printing and scanning. Three different methods for formatted text document data hiding: line shift coding, word shift coding and feature coding were proposed in [4, 5, 6]. Line shift coding and word shift coding are robust to printing, copying and scanning to some extent. The major drawback is that the original intact document is needed for hidden data extraction which may not be available in many cases. In [6], the authors did mention a baseline detection method for line shift coding which did not require the original document. However, as pointed out by the authors themselves, it is not reliable to printing, copying and scanning. Besides, the embedding capacity is about one bit per two lines. In this paper, a novel document data hiding algorithm named Inter-word Space Modulation (ISM) is proposed. The extraction of hidden data does not require the original cover document. It is robust to printing, copying and scanning. The rest of this paper is organized as follows. Section II explains the detail of the ISM algorithm. In Section III, evaluation of the ISM is provided, compared with the prior arts. Conclusion is presented in Section IV. II.
THE PROPOSED ISM ALGORITHM
In this section, we discuss our novel document data hiding algorithm for formatted text documents. A.
Inter-word Space For a formatted text document, the inter-word space is defined as the space between adjacent words in a text row.
4971
The spaces on the boundary of document are excluded for data hiding. Fig.1 illustrates an example of inter-word space in which there are nine inter-word spaces in the row. The actual length of the inter-word spaces in terms of number of pixels is measured and labeled from b1 to b9. The total length of the inter-word spaces Φ of a row is calculated as k
Φ = ∑ bi
(1)
i =1
where bi is the inter-word spaces and k is the number of inter-word spaces on this text row.
Figure 1 Inter-word Space.
B. Modulation of the Inter-word Space The basic idea is as follows. The inter-word spaces of a text row are divided into two sets, namely, Set A and Set B, each has the same number of inter-word spaces. Generally speaking, the summation of lengths of all inter-word spaces (referred to as the total inter-word space for short) within Set A in terms of number of pixels is expected to be very close or even equal to that of Set B. By creating a detectable difference between the total inter-word spaces of these two sets, data hiding can be achieved. Fig.2 depicts an example of grouping of the inter-word spaces. From Fig.2, it is observed that Set A and Set B must contain equal number of space elements. If there are an odd number of inter-word spaces, one of them will be left out which is the middle one, b5, in this example.
to embed bit ‘0’: Φ A − Φ B = −ε '
and:
'
(6)
Φ 'A + Φ 'B + b5 = Φ (7)
where ε is called embedding strength which is an integer number, indicating the difference of the total length of interword spaces between Set A and Set B after data hiding. Intuitively, larger ε will render stronger robustness of hidden data and larger distortion of marked image versus original image. Watermarked documents with a smaller ε will have higher visual quality of marked image and weaker robustness against printing, copying and scanning. The embedding strength ε will be spread evenly among the inter-word spaces of Set A and Set B, respectively. Fig.3 illustrates the text row after embedding compared with that prior to embedding. In this example, bit ‘1’ is embedded and b5 is kept unchanged. We can see the total width of the text row remains the same before and after data embedding. This is an important constraint to follow in data hiding which ensures the imperceptibility of hidden data.
Figure 3 Illustrate of ISM Embedding.
C. Hidden Data Extraction Data extraction contains the following steps.
Figure 2 Grouping of Inter-word Spaces.
After grouping, the total length of inter-word spaces of Set A and that of Set B, denoted by Φ A and Φ B, respectively, are summarized as follows. Using the text row in Fig.2 as an example, we then have:
Φ A = b1+b2+b3+b4
(2)
Φ B = b6+b7+b8+b9
(3)
Now,
Φ = Φ A + Φ B + b5
(4)
During data embedding, the inter-word spaces will be ' modified. We use Φ A to denote the new total length of '
inter-word space for Set A, Φ B the new total inter-word space for Set B. After data embedding, the following conditions should be satisfied. to embed bit ‘1’: Φ A − Φ B = ε '
'
(5)
1) Rectification of geometric distortion: Printing, scanning and copying often introduce geometric distortions, which distroy the syncronization that is needed in data extraction.. We propose to use our prior work [7] to correct the geometric distortion. That is, some morphological scheme is used to extract useful feature, while the multiresolution structure is used to reduce computation involved in feature matching. 2) Denoising: Printing and copying often introduce small isolated black spots onto the document. They are like pepper. We refer them as pepper-like noises. If pepper-like noise exisits between words, it will affect the decision of the number of inter-word spaces and the lengths of inter-word spaces, resulting in errors in data extraction. We examine the whole document and remove the isolated black pixels in believing that those are pepper-like noises. We must point out that punctuation marks such as period differ from the pepper-like noise though they look alike. By experiments, we found that, usually, a mark of period contains a troupe of black pixels while a single noisy spot has only one black pixel.
4972
3) Horizontal profile: The document image is a matrix of pixels assuming either ‘0’s or ‘1’s. The horizontal profile is a graph, in which the horizontal axis represents the vertical positon of the line in the text document image, and the vertical axis represents the number of black pixels in this line. Fig.4 is a typical horizontal profile of a text document. A predetermined threshold η is set. If the number of black pixels of a line is larger than η , then this line is considered to belong to a text row. A text row consisits of a group of consecutive lines, and two neighbroing text rows are separated by multiple blank lines. By black line, it is meant a line in which there is no black pixels.
side by side. However, if we only have Fig.5(b), no artifact can be perceived. The marked document looks very normal. To test the robustness, the marked digital documents are printed out using HP Laserjet 8150DN. The hard copy is then copied using Canon ImageRUNNER 330S. Multitimes copying is obtained by repeatedly copying. That is, the first-time copied document is used as the source to generate the second-time copy using the same copy machine. Repeating this process can generate a copy that has experienced multiple copying. Finally, the resultant copy is scanned back into digital form by using HP Scanjet 4400C. The experiments have shown that all of the hidden data can be successfully extracted without any error for all of the testing documents with number of copying ranging from 0 to 10. After 10 times of copying, the quality of document has been degraded to such an extent that many letters cannot be recognized. At this point, it appears that there is no need for further copying since the document itself has lost its real value. Fig.6 shows a document after 10 times of copying. Note that rotation has occurred during the repeated copying. Compared with the methods proposed in [4, 5, 6], the advantages and disadvantages of each method are listed in Table 1.
η
Table 1. Comparison among Different Methods.
4) Vertical profile: After text rows have been identified and separated from each other in the previous step, we now work on a single text row. A verticle profile can be generated from each text row and is a graph, in which the horizantal axis represents postions of pixels in the text row, and vertical axis represents the number of black pioxels in a specific position. The number of concecutive zero points in the vertical profile is the length of an inter-word space. 5) Hidden data extraction: At this point, we have obtained all the inter-word spaces. The hidden data extraction is then rather straightforward. For a given text row, the inter-word spaces are grouped into two sets, A and B, in the same way as that in the data embedding stage. Then, the total lengths of inter-word space of Set A and Set B, Φ A and Φ B are calculated. If Φ A - Φ B >0, bit ‘1’ is extracted from this text row. Otherwise a bit ‘0’ is extracted. III.
Robust to printing, copying and scanning?
Embedding Capacity
Yes
A little
High (1bit/word)
No
Yes (but not good)
Low (0.5bit/line)
Yes
Yes
Medium (1bit/line)
No
Yes
Medium (1bit/line)
Require the original document?
Figure 4 Horizontal profile. (The x-axis indicates the line position from top to bottom; y-axis indicates the number of black pixels in the line.)
EXPERIMENTAL WORKS AND DISCUSSION
To test the effectiveness of the proposed Inter-word Space Modulation algorithm, we apply it to document images with various font sizes (from 9 to 20). The embedding strength ε is set to be 10% of the total length of inter-word spaces of the whole text row. As a result, ε is not a constant for each row. The adaptability of ε ensures imperceptibility. Fig.5(a) shows an original test document with a font size 10, while Fig.5(b) is the marked version. We may be able to see the difference when comparing them
Word shifting Line shifting (using baseline) Line shifting (using centroid) Inter-word space modulate (ISM)
There are several variations that may affect the performance of the proposed ISM text document data hiding algorithm. 1) The grouping of Set A and Set B: In general, arbitrary grouping of these two sets in a text row is permitted as long as the numbers of inter-word spaces of two sets are equal. However, the selection method shown in Fig.2 turnes out to be able to achieve stronger robustness. Note that if one inter-word space is blurred by noise and hence is missed by the detector, then a mistake may take palce with the randomly grouping. For the proposed grouping, this interspace missing may cause an inter-word space, originally belongs to Set A, now belongs to Set B. Even in this case, the detection will still work correctly since the decision is made by comparing the total lengths of inter-word spaces of two sets.
4973
2) Error correction coding (ECC): If the number of rows exceed the required pure payload, ECC encoding of the original information bits will increase the robuseness. IV.
REFERENCES [1] [2]
CONCLUSION
This paper proposed a novel data hiding method for formatted text documents. The hidden data can be extracted without using the original document. Experiments show that this method is quite robust to printing, scanning and copying. Compared with the prior arts, it achieves stronger robustness and higher data embedding capacity while keeping less system complexity. It can be used for many applications such as copyright protection, document authentication, and digital notarization to name a few.
[3]
[4]
[5]
[6]
ACKNOWLEDGMENT This work is supported in part by New Jersey Commission of Science and Technology via New Jersey Center of Wireless Network and Internet Security (NJWINS).
[7]
F. Hartung and M. Kutter, “Multimedia warking techniques,” Proc. IEEE, July 1999, vol.87, pp. 1079-1107. M. Wu, E. Tang, and B. Liu, “Data hiding in digital binary images,” Proc. IEEE Int’l Conf.on Multimedia and Expo, Jul 31-Aug 2, 2000, vol. 1, pp. 393-396. Q. Mei, E. K. Wong, N. Memon, “Data hiding in binary text documents,” Proc. SPIE, Security and Watermarking of Multimedia Contents III, Aug. 2001, vol. 4314, pp. 369-375. S. H. Low, N. F. Maxemchuk, J. T. Brassil, L. O'Gorman, “Document marking and identification using both line and word shifting,” IEEE INFOCOM, April 1995, vol. 2, pp. 853-860. S. H. Low, N. F. Maxemchuk, A. M. Lapone, “Document identification for copyright protection using centroid detection,” IEEE Transactions on Communications, March 1998, vol. 46 , pp. 372– 383. J. T. Brassil, S. H. Low, N. F. Maxemchuk, L. O'Gorman, “Electronic marking and identification techniques to discourage document copying,” IEEE Journel on Selected Areas in Communication, Oct. 1995, vol.13, pp. 1495-1504. Y. Q. Shi, C. Chang, S. Lin, and W. Su, “Method and Apparatus for Rapid and Precision Detection of Omnidirectional Postnet Barcode Location,” US Patent, US 6,708,884 B1, March 23, 2004.
Figure 5 (a) Original document with font size 10.
Figure 5 (b) Watermarked document with font size 10. (Capacity: 1 bit per line)
Figure 6 Distorted document after 10 times of copying.
4974