Retrieval yielding a hit list, in this case of suspect documents, given a query in the form .... tributed to our data se
Sparse-parametric writer identification using heterogeneous feature groups L.R.B. Schomaker , M. Bulacu and M. van Erp AI Institute, Groningen University, NICI, The Netherlands !#"$%&"(' ) *+"-,. /'0123'4!56!#"-'7"8'9
AI
nici
Nijmegen institute for Cognition and Information
RuG
Problem Traditional methods for forensic writer identification require considerable manual efforts in individual-character measurements by human experts. However, with current background removal methods, it now becomes feasible to use automatic image-based features on regions of interest which describe the individuality of handwriting style. Nevertheless, a single feature representation cannot be expected to capture all particularities of writing style, and combination methods are needed. The application domain precludes the use of training on the large datasets such that sparse-parametric combination methods are preferred, excluding MLP or SVM-based combination functions.
f4: (continued Brush PDF) After
summing all luminances, the accumulator window is normalized to a volume of 1, yielding a PDF for ink presence at stroke endings in any direction. pixels was This feature is not size invariant: the window of chosen because it captures 6-7 pixel-wide ink traces (size normalization is assumed). Figure 2 displays the overall shape and subtle writer differences.
WYX P
srqpqp srqpqp tu tu onmlml onmlml kj[ZZ[ kj[ZZ[
\]_^_^ \]_^_^ b b d d f f ~~ ~~ ~~ a` a` c c e e g gh i h iwv h wv yx wv yx z yx{ z {}| z }| }|
· T¹¸»º¬¼©½ ¶ ½ ¾ µ
Given a sample of unknown identity uniquely labeled
µ
¶
can construct a Borda rank-combination scheme. Assume a set of tween the unknown sample
and the reference set
for each feature group
É ÌÀ ÌÊ Ã ÀÁ Ë ÀÆ
Ë Ê Ã Ì À ÆSÄ ÌªÀ ÍªÎ²Ï ¶ Ò¹ÓSÕ ÖØ× Ù à Û Ð Ã ÑÅæ å Æ Ã Í ¶ ÀÆ
µ
which returns for each dimension in
set
with respect to
Background
dexed
Ink
function
Background
0.007 0.006 0.005 0.004 0.003 0.002 0.001 0
¶
. Thus
2
4 x
6
8
10 12 14 0
2
4
6
8
10
is the distance between an unknown sample
operating on a tensor:
f1 f2 f3 f4 f5 f6 f7 f8 WR
QKESR M J M QKESRKT J R P M QKESR T RVU
Explanation Autocorrelation in horizontal raster PDF of vertical run lengths of ink PDF of horizontal run length of ’white’ Ink-density PDF at stroke endings Edge-direction PDF Hinge angle combination PDF Horiz. edge-angle co-occurrence Writer: handedness, sex, age, style
FD EHIKG J LNG M OCP OP OP OP OCP
Dim. 100 Euclid. 100 100 225 16 Euclid. 464 512 16 Euclid.
f1: ACF, autocorrelation function of the horizontal raster, detects the presence of regularity in writing: regular vertical strokes will overlap in the original row and its horizontally shifted copy for offsets equal to integer multiples of the local wavelength. Every row of the image is shifted onto itself by a given offset and then the normalized dot product between the original row and the shifted copy is computed. The maximum offset (’delay’) corresponds to 100 pixels. All autocorrelation functions are then accumulated for all rows and the sum is normalized to obtain a zero-lag correlation of 1.
y
f5: ¢¡¤£ , simple edge-direction PDF, is computed by considering the PDF of quantized directions of the Sobel edges in the image. Sixteen bins were used in the histogram (Figure 3).
ûüüü üüü üüüý
is known that axial pen force (’pressure’) is a highly informative signal in on-line writer identification [1]. In ink traces of ball-point pens, there exist lift-off and landing shapes in the form of blobs or tapering [2] due to the ink-depositing process during take off and landing of the pen. A convolution window of 15x15 pixels was used, only accumulating the local image if the current region obeys to the constraints for a stroke ending. This constraint is determined by a supraliminal ink intensity in the central pixel of the window, co-occurring with a long run of white pixels along minimally 50% of the perimeter of the window, which is interrupted by one ink strip of at least 5 % of the window perimeter (Figure 1).
ÿþ
(1)
Ê
æÊÃÆ
¼P
¼ · TÔ¸ ¿ ¾
We evaluated the effectiveness of different features for writer identification using the Firemaker data set [4] A number of 251 Dutch subjects wrote four different A4 pages. On page 1 they were asked to copy a text presented as machine-printed characters. On page 2 they were asked to describe a given cartoon in their own words. The same kind of paper, pen and support were used for all subjects. The A4 sheets were scanned at 300 dpi, 8 bit / pixel gray-scale. Performance was tested using leave-one out. For a query sample, the set ) will contain one matching sample of the same writer and 500 distractor samples by 250 other writers.
100
BACKGROUND
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.15 0.15
writer 1 - page 1 writer 1 - page 2
0.1
0.05
0
0.05
0.1
writer 2 - page 1 writer 2 - page 2
0.1
0.05
0
0.05
0.1
80
0.15
Figure 3. (left) Two handwriting samples from two different subjects. (right): We superimposed the polar diagrams of the edgedirection distribution corresponding to pages 1 and 2 contributed to our data set by each of the two subjects.
f6: ¢¡
¡©¨>£ , hinge-angle combination PDF. In or> W § der to capture the curvature of the ink trace, which is very typical
for different writers, another feature is needed, using local angles along the edges [3]. The computation of this feature is similar to the one previously described, but it has added complexity. The central idea is to consider the two edge fragments emerging from a central pixel and, subsequently, compute the joint probability distribution of the orientations of the two fragments of this ’hinge’. The final normalized histogram gives the joint probability distribution quantifying the chance of finding in the image two and respec”hinged” edge fragments oriented at the angles tively. The orientation is quantized in 16 directions for a single ) angle. We will consider only the non-redundant angles ( and we will also eliminate the cases when the ending pixels have a common side. Therefore the final number of combinations is (464 dimensions). See Figure 4 for more details.
QKESRKT J R P M
RªT
RP
E ¢¨ ® J ¨ MC¯ ¨¢® ° ® E ¨¢® ¯/±²M
+f5 +f8 +f2 +f3
40
+f3 +f2
20
0
f1 1
2
3 4 5 6 7 Rank (or: size of writer hit list)
8
9
0
10
1
2
3 4 5 6 7 Rank (or: size of writer hit list)
8
9
10
Figure 6. (left): Results for individual feature groups, (right): Results for sorted feature groups, using sequential Borda rank combination Recent tests with the Min operator, not reported here, have given indications that this rule may be preferable to sequential Borda. Ongoing studies have revealed still better identification performances if (a) feature vectors are computed separately from upper and lower parts of lines of text [5], and additional improvement if (b) local component-shape features are used (Fig. 7). Comparison split- vs entire-line features on lower case text
Comparison split- vs entire-line features on UPPER case text
100 95 90
100 95 90
80
80
70
70
60
60
50
30 20 10 0
50
p(phi1, phi2) - split p(phi1, phi2) - entire p(phi1, phi3) - split p(brush) - split p(phi1, phi3) - entire p(brush) - entire p(phi) - split p(phi) - entire p(rl) - split p(rl) - entire
40
1
10
20
30 List size
40
p(phi1, phi2) - split p(phi1, phi2) - entire p(phi1, phi3) - split p(phi1, phi3) - entire p(brush) - entire p(brush) - split p(phi) - split p(phi) - entire p(rl) - split p(rl) - entire
40 30 20 10 50
0
1
10
20
30
40
50
List size
Figure 7. Recent results on refinement of angular features [5]
ÑÇ Ñ ° W
0.015 0.010 0.005 0 360o o
o φ1180
180 o
360
0o
Conclusions
φ2
2
Localized, angular (co)occurences based on edges are very good features for writer identification.
BACKGROUND
Figure 4. (left): The computation of the angular ’hinge’ feature and (right): An example of a single-writer hinge PDF.
f7: ¢¡ W>§ ¡ rence.
40
+f4
Actual forensic systems: System A: 34%,(90%) and System B: 100 from the same 65%,(90%) for Top1,(Top10) using only *,+ - . / data are largely outperformed by our method: 79%,(95%).
0.020
φ1
+f4
60
+f8 +f7 +f6 +f5
f1
R P¬« R¥T
p(φ1, φ2)
φ2
60
20
Q¥E¦R M
80
+f7 +f6
% correct writer in hit list
INK
±£,
horizontal edge-angle co-occur-
This feature is an variant of the edge-hinge feature, in that the combination of angles is computed along the rows of the image. For the angle of a found edge fragment , the co-occurrence probability is computed with the angles of fragments which are horizontally displaced from (Figure 5).
³
³
φ1
f4: Brush, ink-density PDF at stroke endings. It
æ ÀÃ Æ ÎÏ Ê Ù Ö Ã µHÄÅ¶Æ ¼ Ë Ê Ã Â Ë Ê Ã Á À  à à µHÄÅ¶Æ Æ æ À Ã Æ ËÊ Æ
% correct writer in hit list
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.15
0o
narized image taking into consideration either the black pixels corresponding to the ink trace width distribution or the white pixels corresponding to the horizontal stroke and character-placement distribution for the writer. The histogram of run lengths is normalized and interpreted as a probability distribution. We use horizontal run lengths of up to 300 pixels (f3) and vertical run lengths (f2) of up to 100 pixels, i.e., the height of a written line in the data set used (resolution is 300 dpi). This feature is not size invariant. However, size normalization is not an issue in interactive writer search. The run-length PDFs provide orthogonal information to the directional features.
Results
INK
f2: VrunB, PDF of vertical run lengths in ink f3: HrunW, PDF of horizontal run lengths in background pixels Run lengths are determined on the bi-
ÿ þþ
ÿþ
ÿ
where returns a vector in which has a monotonous relation to the combined rank vec tor. The output hit list contains the samples in the final rank order . In the regular Borda vote, "! #$ , i.e., is the Sum function: ranks are summed per dimension . However, many Borda-operator variants are known : Sum, Max, before being resorted by Median, Min, Majority, Plurality etc. In this study, we tested the use of the Sum operator. The problem of the Sum operator is that all votes are treated equally. Since the Median did not improve on this, we applied the Sum rule in a sequential and cumulative fashion from worst to best feature & where ( is group. This is comparable to taking a weighted sum with rank weights %& ' the quality index of the feature group (1=best, ( is worst) after optimal group reordering.
Probability of correct hit %
Feature ACF VrunB HrunW Brush
rank, in-
Data & Evaluation
14
Figure 2. Superimposed brush PDFs for two writers, and examples of an ”a” tail for two writers
Feature & Distance function Overview
A=
and a known sample of
,
. Then a Borda rank combination scheme can be considered as a rank-combination
100
?=
of a rank
guarantees that
φ
:?$= @ AC= B
ÃÑ Æ Ð Ã ÒÔÓ¦Õ ÖØ× Ù Ú ÛÝÜ(ÒÔÓSÕ ÅÖ × Ù Þ Û¢Ü È È Ü(ÒÔÓSÕ ÖÅ× Ù ß»Û Æ µ áãâ ä
according to values in , in ascending order. The dimension
¼¿
12
, such that
(i.e, handwritten sample) its unique rank in the
ú çè3éêÔëVì¥íïî>ðñ-ò 6 ó ô ö õ ÷ùø
Ink
Run length, ink (Lb)
Brush PDF(x,y)
be-
more, given that a vector of ranks will be denoted by , assume the availability of a rank operator
where
’Writ158.Doc01-w.gnu’ ’Writ154.Doc01-w.gnu’
distance vectors
each dimension of the distance vectors corresponds to one and the same sample index. Further-
Center pixel
Figure 1. (left): A lower-case letter a with its tail stroke. (right): An example of detecting end strokes on the basis of a central inked pixel and a constrained ink and paper runlength configuration around the window border (actually 15x15 pixels).
¿
, each
feature groups describing a sample, we
Perimeter pixels:
0
A number of feature groups has been selected for this experiment, on the basis of literature and earlier work on on-line writer identification. Complementarity of extracted information in the feature group was an important design goal. Table 1: Feature groups used for writer identification and the used distance function between two samples and . Colors correspond to performance-curve colors in Figure 6.
¿
¶ ÁSÀ  à µHÄÅ¶Æ Ç ¼ · T Ä¦È È ÄÅ¿ ¾
and a universe of samples of known writer identity
, and assuming there exist
value uniquely refers to a sample in
Run length, background (Lw)
Method Forensic writer search is similar to Information Retrieval yielding a hit list, in this case of suspect documents, given a query in the form of a questioned script sample. Given the requirements, simple nearest-neighbour search is a viable solution. However, a proper distance function has to be identified. For the combination of results, rank combination (Borda) will be tested.
Borda Rank-Combination Schemes
´
¨
2
For feature vectors which are PDFs, the 3 distance measure is mostly the natural choice. 2
In multiple feature groups where trained parametric combination cannot be applied, a sequential Borda approach which overweighs the better feature groups can be useful
φ1
φ3
φ3 INK
BACKGROUND
INK
BACKGROUND
Figure 5. Computation of the horizontal (or vertical) edge-angle cooccurrence
f8: Writer characteristics (WR)
is a ’pseudo’ feature vector, containing writer parameters which are often known in the application context: Style may be one of Handprint, Cursive or Mixed. The parameters are represented as a bit vector. This feature is added to underscore the possibility of using heterogeneous sources of information in a rank-combination scheme.
References [1] L. R. B. Schomaker and R. Plamondon, “The Relation between Pen Force and Pen-Point Kinematics in Handwriting,” Biological Cybernetics, vol. 63, pp. 277– 289, 1990. [2] D.S. Doermann and A. Rosenfeld, “Recovery of temporal information from static images of handwriting,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1992, pp. 162–168. [3] M. Bulacu, L. Schomaker, and L. Vuurpijl, “Writer identification using edgebased directional features,” in Proc. of ICDAR 2003, 2003, pp. 937–941. [4] L.R.B. Schomaker and L.G. Vuurpijl, “Forensic writer identification [internal report for the Netherlands Forensic Institute],” Tech. Rep., Nijmegen: NICI, 2000. [5] M. Bulacu and L. Schomaker, “Writer style from oriented edge fragments,” in Proc. of the 10th Int. Conference on Computer Analysis of Images and Patterns, 2003, pp. 460–469.
Acknowledgements: Thanks to the Dutch Forensic Institute (NFI) who enabled us to set up a large data collection, to Katrin Franke of the Fraunhofer Institut (IPK) with her expertise in background removal, who revived the problem of writer identification in our lab and Louis Vuurpijl (NICI) for the data collection and preparatory research. This poster was presented at ICIP’2003, Barcelona.