2010 International Conference on Pattern Recognition
Color Feature Based Approach for Determining Ink Age in Printed Documents Biswajit Halder Dept. of Information Technology Mallabhum Institute of Technology Bisnupur, WB, India Email:
[email protected] Abstract Answering to a query like when a particular document was printed is quite helpful in practice especially forensic purposes. This study attempts to develop a general framework that makes use of image processing and pattern recognition principles for ink age determination in printed documents. The approach, at first, computationally extracts a set of suitable color features and then analyzes them to properly associate them with ink age. Finally, a neural net is designed and trained to determine ages of unknown samples. The dataset used for the present experiment consists of the cover pages of LIFE magazines published in between 1930’s and 70’s (five decades). Test results show that a viable framework for involving machines in assisting human experts for determining age of printed documents.
1. Introduction Many documents like legal deeds, certificates, university/college mark sheets, bank notes, etc. are common lifelong security documents. Recently, it’s seen that forgery of such documents has substantially been increased [1] because of the advancement in printing, scanning, and photocopying technologies. Untrained human eye cannot detect easily these types of fraud documents. Forensic experts try to ascertain the authenticity of these documents through several means [2], one of which is to check the relative or absolute age of a questioned document by determining the age of the printing ink thereon. Therefore, ink age determination has been a standard field of study in forensic science [3]. Ink age determination is a challenging problem because the documents in question may have been generated many decades or even centuries ago. Paper technologies, printing technologies, or chemical component of ink pigment, paper component, etc. all change with time. Moreover, color of ink, shade of colors all differ from one manufacture to other. Also, due to oxidation over a period of time, there would be change in the constituents of chemical substances of the ink used in printing a document. So far the forensic experts have been following different chemical techniques [4, 5] for determination of ink age. Some of the recently proposed methods can be found in [6,
1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.785
Utpal Garain Computer Vision & Pattern Recognition Unit Indian Statistical Institute 203, B.T. Road, Kolkata 700108, India Email:
[email protected] 7]. The essence of these techniques is to measure or observe several chemical properties or changes in properties upon use of different chemicals and based on these measures or observations a decision is taken. For case-based experiment these techniques have proved their reliability to an extent. However, many of these methods are destructive in nature and not suitable especially when the document in question has to be preserved as it is. Methods are not fully automated and therefore, they require human intervention. These shortcomings make the existing techniques bit inconvenient especially when a large number of documents in question have to be processed online. Every time a document in question is encountered, if one has to register a case with the police department who in turn ask for help of forensic experts then involvement of so many steps make the entire process delayed and less attractive. Therefore, a fully automated approach for ink age detection would be of great help even to the forensic experts for their own assistance. This paper attempts to address this problem from image processing and pattern recognition viewpoint. Only printed documents are considered for the present study. The method is based on the color features of ink used in printing a document in question. The study assumes that (i) there must be entries on the document or similar documents (more specifically it is assumed that similar samples are not varying by more than five years in their ages) whose dates are not questioned that use the same ink formula as the questioned entry or document, and (ii) the questioned and unquestioned ink entries must occur on documents stored under the same conditions to ensure that the entries were exposed to the same environmental conditions. The significant contributions of this paper include involving color image analysis and pattern recognition techniques for designing automated approach for ink age determination. Finding out the relevant color features and their statistical analysis is also another motivating aspect of this study. The features are simple, easily computable and yet effective for the purpose. The types of experiments conducted involving publicly available samples like LIFE magazine cover pages bring out several exciting observations and open up many new research avenues in the related field.
3204 3216 3212
2. Our Approach
2.2 The Method
In our approach we attempt to closely follow the method the forensic experts do in determining ink age. The method is basically analysis of ink colors. The changes of different properties of colors with age are noted. For example, older a document is, much more yellowish and brownish it is. Dark noise also increases with age. The color shade and brightness are affected on a document with its age and this in turn affects hue and saturation level of the ink colors. However, instead of doing all these analysis by chemically and manually, we attempt to concentrate on statistical analysis of relevant color features that are computationally captured from scanned images. Apart from gray level analysis, we consider two other color spaces namely, RGB and HSV.
In our approach, we do not consider the entire cover page rather we concentrate on a region appearance of which remains the same in every issue of the magazine. Dealing with the entire cover page is always not effective as color combination changes from one issue to another making relative comparison of different samples difficult. On the other hand, the magazine name (or the logo) is a good candidate for this purpose. In every issue it appears (ref. fig 1) in a rectangular area where text appears in white on a red background. The area of this rectangular region is found to be good enough for the purpose of color analysis. After extracting (done manually in the present study) the region of interest (ROI), it went through a bi-clustering process to separate the foreground from the background. The K-means is applied for bi-clustering and the cluster consisting white pixels (i.e. relative more white than the pixels in other cluster) represents the foreground. This separation is done to study the age effect on the foreground and the background colors separately. Features are extracted from background and foreground separately, analyzed and then the analysis results are combined to take decision. Human experts also do analysis in a similar way, i.e. studying the background and the foreground colors and their changes separately. However, certain color features are of course extracted from the whole region, i.e. considering background and foreground together. 2.3 The Features Features are extracted from gray, RGB, and HSV color spaces. Initially, many more features than the ones used in this study were extracted and the following 33 real valued features are finally chosen because of their significant discriminative power as observed on a training set. Choices of these features also involve consultation with forensic practitioners. Average intensities: The first set of features includes the average gray levels of the entire ROI, the background pixels, and the foreground pixels. Next, average red, green and hue levels are computed from background and foreground separately. This gives nine (09) feature values (4 on background, 4 on foreground and 1 from the whole region). Prominent intensities: The most prominent intensity levels are computed on four channels namely, gray, blue, hue, and saturation. Prominent intensity level of a channel is computed as the value observed in maximum number of pixels. Prominent gray intensity is computed from the whole ROI and prominent blue, hue and saturation levels are computed from background and foreground pixels separately. This gives seven (07) features, three from background, three from foreground and one from the both. Pixel profile: The most prominent intensities are recorded as explained just now. The numbers of pixels showing these prominent intensities are also noted. These numbers are
(b) (a) Figure 1. (a) Cover page of an issue published in Sept., 1942, (b) Extracted region of interest for color analysis. 2.1 Dataset and the Task As mentioned earlier that images of the cover pages of LIFE magazine printed during five decades (1930’s to 70’s) are considered. We label the samples with their year of publication. Samples appearing within five (or less) successive years in each decade are taken in such way that samples printed in different decades have age difference of five years or more. So the samples of 30’s are selected from the magazines published in between 1935 to 1937, samples of 50’s are taken from the magazines appeared in between 1944 to 1947, and so on. Issues between 1971 and 1972 are considered for the last, i.e. fifth decade. Forty (40) samples for each decade are considered that results in 200 samples. The task has been defined as follows: given a sample page how accurately can our system date it. For doing this, the dataset is divided into four sets each containing 50 samples (10 for each decade). Two sets are considered as training sets, the third one is considered as validation set while the testing is done of the fourth set.
3213 3217 3205
decades (30’s to 70’s) are considered. In total, 200 samples (40 for each decade) are there in the dataset.
normalized with respect to total number of pixels as image size may slightly vary from one sample to another. For example, while computing the intensities on background, the numbers of pixels are normalized with respect to the total number of background pixels and similarly for foreground. So this pixel profile adds seven (07) more features. Kurtosis of image colors: Apart from the highest intensity levels, we also measure the kurtosis to analysis whether variations are due to infrequent extreme deviations. Kurtosis is separately measured for blue, hue, and saturation levels on background as well as on foreground. Therefore, we get six (06) additional features for these measures. Geometric mean: It is noted that geometric means of hue and saturation level significantly differ in two pages if they vary in ages. Therefore, we compute these means separately for background and foreground pixels. Four (04) more features are then added to capture this aspect.
3.1 Results of K-means All the 200 samples are at first clustered using an unsupervised clustering method. The purpose of this clustering is to analyze the distribution of samples in the feature space. The K-means algorithm is used for this purpose. Euclidean distance is used to find similarity between two samples. The algorithm finds five clusters corresponding to five decades. The k-means results are evaluated by computing the number similar samples grouped together vs. the number of dissimilar samples contained in that group. Since each sample is tagged with its year of publication evaluating clustering results in this way is straightforward. Table-1 presents the evaluation of Kmeans results. Since cluster centers are initialized randomly, K-means were executed three times to get an average result. From Table 1, overlapping of inter-cluster samples in the feature space can easily be visualized. This indeed discards designing of a classification system based on any linear decision functions. Another important observation is that the more old the samples are, less accurate is their clustering, e.g. 30’s samples are clustered with 59.2% accuracy which eventually improves to give an accuracy of 79.2% in clustering 70’s samples. Similar trend is also observed in determining age as explained next. We analyzed and found that the reason behind this trend is that the ageing effect on older documents is less consistent than it is in relatively newer samples.
2.4 Dating of a Sample After the features described above are extracted from sample images, which are spread over five decades, their dating is to be done. This is modeled as a 5-class pattern recognition problem, i.e. which decade a particular document belongs. Let mi be the number of samples of i-th decade. In the feature space, it is expected that these mi samples would form a cluster (Ci). If any two clusters Ci and Cj are linearly separable then the task of decision-making becomes easier. If Di(X)’s are the linear decision functions, then a given document, X belongs to Ci if Di(X) > Dj(X) ∀ j ≠ i. However, a simple investigation reveals that the clusters are not linearly separable. For this purpose, we implement a K-means algorithm and cluster the N ( =
5
∑m
i
3.2 Age Determination by Neural Network
) labeled
i =1
An MLP is next designed for determining age of a sample by using a Neural Network (NN)-based classifier. The generalized function of NN is:
samples into five classes. Selecting five samples randomly initializes the centers in K-means algorithm. Since K-means results get affected by this initialization phase, K-means is executed more than once (three times) and each time the clustering results are investigated. This investigation reveals that the clusters always overlap and therefore, it is difficult to find Di(X), i.e. linear decision functions. The classification accuracy is then checked with a Neural Network (NN)-based classifier. An MLP (Multi-Layer Perceptrons) based NN structure used. The MLP consists of 33 input nodes correspond to 33 dimensions of a feature vector. The output consists of five nodes corresponding to five decades. The hidden layer, in the present experiment, is made of 3 nodes. A Gaussian Radial Basis Function as explained in the next section is used as the activation function of the network.
y = ∑ w jφ j (x)
(1)
j
Where wj are the weights (which are updated following the K-means optimization techniques) and φ j (x ) is the transfer or activation function that takes form of a Gaussian Radial Basis Function (RBF) as follow:
φ j ( x) = exp(− x − c j Where width,
2
/ 2σ 2 )
(2)
c j represents the center of j-th cluster, σ is the
x − cj
2
is the square of the distance between the
input feature vector x and the cluster center for the radial basis function. The set of 200 samples are divided into 4 sets to realize a four-fold experiment. The proportion in which samples appear in training, validation and test data is 2:1:1 (i.e. training: 50%, validation: 25% and testing: 25%). Sets are
3. Experimental Results and Discussions As mentioned before, images of old LIFE magazine cover pages, scanned at 100dpi, as available with the Google book project [8] have been used here. Sample of five
3214 3218 3206
be conducted next. It would be exciting to attempt to develop a computational aging model for printer inks and then estimate the aging parameters from samples. This model can later be employed to predict the age of a printed document. Similar attempt is reported in [6] from chemical viewpoint. Extension of the present study for handwriting manuscripts would also be another motivating area of study.
selected in such a way so that each set appears at least once as a test set and a validation set. To ensure that each set would eventually appears twice as training set, four different runs were executed. The result of this four-fold experiment is reported in Table-2. It shows NN-based classification gives an average accuracy of 74.5% in determining ink age on the test set. Relatively newer samples show better performance than the older samples. The reason is as stated before that the aging process is more random as ink grows more and more old. Table-3 shows the confusion matrix in dating the samples (each sample appears as a test sample). Here also, it is interesting to observe that dating errors are not very random in nature.
6. References [1] The Times of India, 1st March 2009, http://timesof india.indiatimes.com/news/india/Beware-the-fakes-RBIteaches-kids/articleshow/4206645.cms [2] J. Levinson, “Questioned Documents: A Lawyers Handbook,” Academic Press Inc, 2001. [3] K. Koppenhaver, “Forensic Document Examination: Principles and Practice,” Humana Press, 2007. [4] O. Hilton, “Scientific Examination of Questioned Documents,” New York: Elsevier Science Publishing Co., 1982. [5] L.F. Stewart, “Ink Age Determination by Volatile Component Comparison—A Preliminary Study,” in Journal of Forensic Science, 30 (2), 1985. [6] Céline Weyermann, “Mass Spectrometric Investigation of the Aging Process of Ballpoint Ink for the Examination of Questioned Documents,” PhD Thesis, Justus-Liebig-University Giessen, 2005. [7] H.J. Bugler, H. Buchner, and A. Dallmayer, “Age Determination of Ballpoint Pen Ink by Thermal Desorption and Gas Chromatography-Mass Spectrometry,” in Journal of Forensic Sciences, 53 (4), 2008. [8] Google books at http://books.google.com/
4. Conclusions and Future Scope of Research An attempt is made to determine ink of printed documents. LIFE magazine cover pages are taken as a reference to show that a fully automated system based on color image analysis and pattern recognition principles could provide a viable framework for the purpose of ink age determination. The method is fully automated. An overall accuracy of about 74.5% shows that the system can assist the human experts with a great extent. However, a lot more experiments are to be done to establish this practice as an acceptable one. In fact, the findings of this study open up new research avenues in this domain. Rigorous experiment on a larger dataset and then establishing the statistical significance of the results are to
Table 1. K-means Result
Iterations
Distribution of samples in different clusters(Ci) C1 (30’s samples) #Samples: 40 # 30’s Others samples
1 2 3 Average
17 29 25 23.7
%Accuracy
59.2%
24 10 13 15.7
C2 (40’s samples) #Samples: 40 # 40’s Others samples
20 29 26 25
15 8 10 11
62.5%
C3 (50’s samples) #Samples: 40 # 50’s Others samples
24 31 29 28 70%
Run 1 Run 2 Run 3 Run 4 Average
27 32 31 30
C5 (70’s samples) #Samples: 40 # 70’s Others samples
15 8 10 11
75%
29 35 31 31.7
12 5 11 9.3
79.2%
Table 3. Confusion Matrix for Age determination
Table 2. Accuracy for Ink Age determination by the Neural Net Runs
17 13 14 14.7
C4 (60’s samples) #Samples: 40 # 60’s Others samples
Correct ink age dating on the test dataset (# test samples for each decade = 10) 30’s 40’s 50’s 60’s 70’s Total 5 7 9 8 8 37 6 6 8 7 8 35 8 7 6 9 9 39 6 7 8 8 9 38 6.25 6.75 7.75 8 8.5 149 (74.5%)
3215 3219 3207
Dated as→ Samples↓
30’s
40’s
50’s
60’s
70’s
30’s 40’s 50’s 60’s 70’s
25 5 1
5 27 4
6 5 31 3
2 3 4 32 6
2
5 34