Multiclass Learning for Writer Identification using Error-Correcting Codes Utkarsh Porwal∗ , Chetan Ramaiah∗ , Ashish Kumar† and Venu Govindaraju∗ ∗ Department
of Computer Science and Engineering State University of New York at Buffalo, Amherst, NY - 14260 Email: utkarshp,chetanra,
[email protected] † Indian Institute of Technology (BHU) , Varanasi, India Email:
[email protected] Abstract—Writer Identification can be seen as a multi-class learning problem where number of writers are different classes. One of the fundamental approaches to solve a multi-class problem is by breaking it into binary classification tasks. In this work we are proposing a generic approach for multi-class classification using an ensemble of binary classifiers. We assign a distributed output representation to each class in the form of codewords and an ensemble of binary classifiers is created where each classifier predicts one bit of the codeword. Actual label is determined using Belief Propagation algorithm on a graph constructed from the code matrix. We have performed experiments on a new publicly available IBM-UB-1 dataset for the task of writer identification to show the efficacy of our method.
I.
I NTRODUCTION
In writer identification, an automatic identification system identifies the writer of any handwritten document from a pre defined list of writers. It is an important problem with many applications such as in forensics, handheld devices such as smart phones and tablets. This problem has drawn interest of research community for its wide application and inherent challenges. To identify a writer of any handwritten text one needs to learn information pertaining to the text as well as writer. Lot of work has been done in past by various research groups. A typical approach is to extract relevant features and then learn a model that can predict the writer of the handwritten text in future. In this setting each writer can be considered as a class and the task of writer identification can be seen as a multi-class learning problem. The writer identification system should learn the attributes that can distinguish writers (classes) and should be able to classify any new test handwritten sample (data point). In this work we are proposing an approach considering writer identification as a multi class learning problem. Our approach is inspired from communication background where a message is transmitted through a communication channel which can be noisy. Due to noise, message received at the other end could be corrupted. Therefore, to overcome this problem any message is encoded as a binary string known as codeword and some redundancy is introduced at the sender’s end. This is called encoding the message before sending. However, because of the noise in the channel there will be some error in the message received as some of the bits can get flipped. At the receivers end, the encoded message is decoded using error correcting code techniques and the original message can be recovered because of the redundancy induced
before transmitting. Writer identification can also be viewed as a communication problem where identity of writer of any test sample is corrupted because of the noisy channel. The channel in this case consists of features extracted, algorithm used and training samples. The noise in this channel is because of several factors such as inductive bias of the learner. Features extracted might not be capturing all the information needed or training samples may not be sufficient. Therefore, we attempt to solve this problem using similar approach of error correcting output codes as shown by Dietterich et al.[1] in their seminal work. TABLE I. Classes 1 2 3 4 5
A CODE MATRIX FOR n = 5 AND k = 8
0 0 0 1 1
0 1 1 1 1
0 0 0 0 0
Codeword 0 0 1 0 0 1 0 1 1 0
0 1 0 1 0
0 1 1 0 0
1 0 0 1 0
In this approach a binary code word of length k is assigned to each of the n classes generating an n×k code matrix. k binary classifiers are learned to predict bit at each position for every test sample. Once k binary classifiers have generated a k dimensional probability vector, its distance is calculated from each of the n binary codewords and the test sample is assigned to the closest class. However, it is to be noted that the measure of distance can vary and the performance of this ensemble of binary classifiers is heavily contingent on the design of the n×k code matrix. All these issues are discussed in great detail in later sections. The organization of the paper is as follows. Section 2 provides an overview of the related work done for writer identification for handwritten documents. Section 3 illustrates the underlying principle of error correcting output codes and explains the working of the distributed output representation and ensemble of binary classifiers. Section 4 outlines the proposed approach explaining the code matrix design and decoding process. Section 5 introduces a new IBM-UB-1 dataset that can be used for various document recognition and retrieval tasks. Section 6 describes the experimental setup including features and classifiers used. Section 7 outlines the conclusion and the future work.
II.
R ELATED W ORK
Error correcting code based approaches have been used in many applications[1][2][3][4] but it has never been used for writer identification to the best of our knowledge. State-of-theart techniques for writer identification for offline handwritten documents have mainly focussed on feature based techniques. They can be broadly divided under two approaches. The first approach is text dependent[5] in which features encapsulates the characteristics of the writer based on the similar text content written by different writers. This approach is not extensible as similar content written by all authors is seldom available. The second approach is based on text independent features. This approach captures the writer specific properties such as slant and loops which are independent of any text written and are better suited for real world scenarios as they are scalable. Feature selection plays an important role in such techniques and different aspects of handwritten text have been tried. Said et al.[6] considered each writer’s handwriting as a different texture and applied texture recognition algorithms such as multi channel Gabor filtering for the task of writer identification. Zois et al.[7] used morphological features and needed only single word for identification whereas Niels et al.[8] used allographic features for comparison. Moreover, Bensefia et al.[9] did writer identification by clustering the graphemes produced by a segmentation procedure. Likewise, statistical analysis of several features has been done such as using edge hinge distribution[10]. Bulacu et al.[11][12] proposed a two edge-based directional features to capture the changes in the direction of writing samples. III.
E RROR C ORRECTING C ODES
In this section we will define the method formally and discuss all the details. Any learning problem can be defined as finding the target function f that will produce label y ∈ Y for the data point x ∈ X . The learning algorithm selected to obtain the target function produces a hypothesis h ∈ H closest to the target function by minimizing the error
be decoded from the obtained probability vector. Different decoding techniques have been tried in the past. Most common is to calculate the distance of the test data point from all the class code words and assign the data point to the class corresponding to the shortest distance. Here, distance measure d can be considered as a collective loss of all the binary classifiers or collective loss at every bit position. Therefore, distance can be seen as
dL (M (y, l), h(x)) =
k X
L(M (y, l), hl (x))
(2)
l=1
where L is a loss function. Different kind of loss functions have been tried such as Hamming distance, L1 , L2 norms and exponential loss function[1][3]. Error correcting output code approach can be seen as a unifying framework that includes all the existing approaches to solve multi class classification problem. Allwein et al.[3] showed that the existing approaches like one against all approach[13] and all pairs approach[14] are the special cases of the error correcting output code framework. They defined the n×k code matrix M ∈ {−1, 0, +1} where some entries M (y, l) will be zero signifying how binary classifier hl classifies data samples with label y is not important or in other words this classifier hl doesn’t care about data points with label y. In this framework one against all approach can be seen as the special case where the code matrix M is an n × n matrix with all diagonal elements as +1 and all other elements as −1. The all pair approach is another special case where the code matrix n matrix in which each column represents M is an n × 2 two distinct pair of classes or rows (r1 , r2 ). For this column row r1 will have an entry +1, row r2 will have an entry −1 and all other rows will have 0. IV.
P ROPOSED A PPROACH
A. Designing Code Matrix errΩ (h) := P rob [h(x) 6= y] (x,y)∼Ω
(1)
over finite number of data points (xi , yi ) drawn from some unknown probability distribution Ω. In error correcting output code approach, for an n class n×k problem (n > 2) a code matrix M ∈ {0, 1} is generated for any k. Each row r of matrix is associated with one of the n labels y ∈ Y where |Y| = n. Each class has been assigned a binary code word of length k and k binary classifiers will be learned pertaining to each bit position. Every binary classifier will generate probability of that bit position being one and we will obtain one k dimensional probability vector h(x) = (h1 (x), h2 (x)...hk (x)) where hl : X → R for l ∈ [1, k]. It is to be noted that the labeled data samples provided to the binary classifiers are of the form (x, M (y, l)) as opposed to the form (x, y) in the case of multi class learning classifier. This part is the encoding part where the actual class labels are encoded with a binary code vector. Once the probability vector is obtained, the next step is the decoding. Actual class of the test data point must
In the proposed approach set of all the of writers Y is divided into different sub-sets representing a super class of writers. These super classes are learned by the ensemble of base binary classifiers. Division of the writer set Y into k subsets of writers (Y1 , Y2 ...Yk ) is such that each writer in Y can be uniquely represented by intersection of the subsets it belongs to. Since each super class is learned using a base binary classifier it is important that columns of the matrix are well separated so that the ensemble of base classifiers can capture different pattern. If two columns will be exactly same or completely opposite (like all 0’s and all 1’s) then the base classifier corresponding to the two will learn the same discriminant function. Another important feature is of writer distinction. It is intuitive that the codewords(rows) representing writers must be different than each other. The primary reason behind this is that error correcting techniques works because of injected redundancy. As discussed in earlier sections that some extra bits are added to the message before transmitting it into the channel so that if error occurs, the actual message can be recovered because of the extra bits. Moreover, if the minimum
Hamming distance between any pair of code words is d, a code can correct at least d−1 errors. Therefore, rows of the 2 code matrix should have a good Hamming distance d between them. B. Decoding: Belief Propagation Once the vector of confidence scores of all the base classifiers is obtained, next step is to decode the actual writer of that particular data point. In previous section we gave distance as a measure of minimizing collective loss of all the binary classifiers. In computing this distance it was assumed that all the classifiers have the same degree of accuracy in generating the confidence score. However, Guruswami et al.[15] argued that the performance of the ensemble of the binary classifiers will get reduced if the error correlation between the binary classifiers is high. They proposed an approach to give weights to the score of individual classifiers called Maximum Likelihood Decoding scheme. In this scheme probability of error pi is calculated for every binary classifier at the time of training and probability vector h(x) = (h1 (x), h2 (x)...hk (x)) for hl : X → {−1, +1} is generated for any test data point x. If set E is the set where M (y, l) 6= hl (x) for l = 1...k, then the new loss function for hl : X → R will be L(M (y, l), h(x)) =
Y i∈E
pi hi (x)
Y
(1 − pi )hi (x)
(3)
(5)
However, this is not the only way to calculate unary potentials. They can also be calculated using different distance measures.
∀i ∈ n : U (i) = kM (i, :) − h(x)kp
(6)
where, kkp is the p-norm. Once node potentials and pairwise potentials are in place, nodes can start passing messages expressing their belief in the marginal value of their neighbors.
mnew (i, j) =
X
Y
A(i, j) × U (i)
xi
mold (k, i)
(7)
k∈nbd(i)\j
where nbd(i) are neighbors of i and m(i, j) is the message from node i to node j. Once the message passing converges, marginals can be obtained as
i∈E /
In this work we look at this problem slightly differently. As we have seen that different subset of writers are created representing super classes learned by base classifiers. Since each writer class interacted with other writer classes as a part of the different subsets; it is intuitive if we can leverage this interaction of writer classes to obtain the actual target class for any test data point. There should a mechanism in which different classes will interact with each other and express their confidence in each other to determine the strongest candidate. This is achieved by using belief propagation algorithm[16][17]. It can be used to determine the marginals of all the nodes in any graph. All the writers can be considered as nodes in a graph and their marginals will reflect the probability of the test data point written by that writer. This graph can be easily obtained using the code matrix. Any two writers are considered connected if they had the same super label when provided to binary classifiers. Moreover, the edge weights can be calculated by counting the number of times two writers had same super labels. Formally this n × n adjacency matrix A can be obtained as k X n X n X
∀i ∈ n : U (i) =< M (i, :) · h(x) >
A(i, j) = A(i, j) + 1 if M (i, l) = M (j, l)
(4)
l=1 i=1 j=1
Once the graph is constructed, marginals can be calculated using the BP algorithm. In this work we have implemented a pairwise MRF model where only unary and pairwise factors are considered. We take the values in adjacency matrix as pairwise potentials while the unary factors i.e. node potentials can be obtained by taking a dot product of every class code with the obtained probability vector.
P (i) ∝ U (i)
Y
m(k, i)
(8)
k∈nbd(i)
If the graph has loops, this method will approximate the marginals. Therefore, using BP algorithm we can obtain the probability of any test data point belonging to all the writers. Moreover, these probabilities are computed by taking into account the effect of different classes when grouped together.
V.
DATASET
We have conducted experiments on the publicly available IBM-UB-1[18] dataset. This is a bimodal dataset of online and offline handwritten documents having two kind of documents. One is called the summary pages that has text written in the forms of letters or cursive text as shown in Figure 1. Second kind of pages in the dataset are called query pages which have up-to 25 words written in a page as shown in Figure 2. Any writer who wrote summary pages on any topic also wrote query pages describing the content in few words. Therefore, there is a mapping between summary pages and query pages. This dataset is one of its kind as it can be used in variety of document recognition and retrieval tasks. For instance, summary pages can act as a database of handwritten documents and query pages can be used to build a handwritten document retrieval engine. In this work we only used summary pages for the task of writer identification. We used 3714 summary pages written by 41 writers scanned at 300dpi grayscale images in PNG format. More information about this dataset is available at [18].
TABLE II. Matrix Type k = 100 Dense(k=53) Sparse(k=80) One vs all(k=n=41) kNN(k=3) Naive Bayes
Fig. 1.
P ERFORMANCE OF THE WRITER IDENTIFICATION SYSTEM FOR DIFFERENT KIND OF CODE MATRIX Fold top 68.75 63.38 62.28 60.47 54.26 52.41
1 top 3 83.45 77.08 78.35 75.94 -
Fold top 59.35 55.85 53.80 51.67 49.22 45.28
2 top 3 76.44 70.80 71.70 69.66 -
Fold top 65.00 60.18 58.88 56.69 54.47 48.56
3 Fold top 3 top 78.67 67.43 73.92 62.79 73.93 61.61 73.27 60.67 Baseline 54.90 49.97
A summery page in IBM-UB-1 dataset
VI.
E XPERIMENTS
In this work we have used four kinds of random code matrices. Number of classes (n = 41) is equal to the number of writers in the dataset. In our first experiment we used k = 100(chosen empirically) columns as more number of classifiers in an ensemble will result in better performance. In second experiment we used dense random code matrix n×k M ∈ {−1, +1} , where k = 10log2 (n). In third experiment n×k we used sparse code matrix M ∈ {−1, 0, +1} , where k = 15log2 (n). Each element of the matrix is 0 with probability 1/2 and −1 or +1 with probability 1/4. In our last experiment, we measured the performance of the system using one against all approach which is a special case of error correcting code framework. Results obtained in all the experiments are shown in the tables above. Experiments are conducted on different folds of the dataset. First rule lines are removed and then handwritten text lines
Fig. 2.
4 top 3 81.76 77.50 76.77 75.84 -
Fold top 71.34 65.71 66.03 63.59 57.75 52.67
5 top 3 85.82 80.15 81.35 79.36 -
Fold top 69.39 64.50 64.75 62.54
6 top 3 84.98 78.48 80.01 77.79
top 66.87 62.06 61.23 59.27
top 3 81.85 76.32 77.01 75.31
-
54.61 50.55
-
57.06 54.38
Avg
A query page in IBM-UB-1 dataset
in the pages are extracted using the method proposed by Shi et al.[19]. After line segmentation features are extracted from each line image by dividing the image into 4x4 frames and extracting GSC features[20] from them. Each frame in the line image will return 32 dimensional feature corresponding to first 12 as gradient feature, next 12 as structural features and the last 8 as concavity features. Therefore, from one line image we had 512 dimensional feature in which first 192 dimensions corresponds to the gradient feature, next 192 signifies structural features and last 128 are concavity features. We used Random Forest as base classifier in all the experiments. We have calculated unary function by taking the dot product of the codewords with the obtained probability vector. However, we did try methods based on different distance measures to obtain unary potentials but the result did not vary much. In our last experiment we used unary potentials as marginals because in one vs all approach each subset of writers had only one writer. As a baseline we have compared our performance with two of the most common classifiers i.e. kNN and Naive Bayes. We
used k = 3 for kNN and assumed multivariate multinomial distribution on data for Naive Bayes. We have reported results on the accuracy of identifying the writer of the text and the accuracy among the top 3 choices of the identification system. Results obtained supports the argument of using large number of classifiers as best performance is obtained in the first experiment. However, performance is comparable in the experiment with dense and sparse matrices despite sparse matrix experiment having more classifiers in ensemble. This can be explained because more writers are selected in dense matrix case in every subset. Therefore, interaction of writers is better in the case of dense matrix as opposed to sparse matrix where fewer writers are selected in a subset. This limits the interaction between the writers. This proves the strength of belief propagation algorithm for decoding. Since, it encapsulated the interaction between writers for every base classifier. It can decode the right information even with less redundancy given the interaction between the writers is rich. VII.
R EFERENCES
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[10]
[11]
[12]
[13]
[14] [15]
C ONCLUSION AND F UTURE W ORK
In this paper, we proposed an approach for writer identification using error correcting output codes. We have conducted experiments on publicly available IBM-UB-1 dataset for the task of writer identification using different random code matrices. We have generated random code matrices by following the design guidelines for generating good code matrices. We used belief propagation algorithm for decoding which relaxes the assumption of every base classifier have the same degree of accuracy in generating the confidence score. In future we would like to try real valued code matrix instead of the discrete one. There is lot of scope in this approach of distributed output representation and the results obtained can be improved further by gaining more insight into the structure of the data.
[1]
[9]
T. G. Dietterich and G. Bakiri, “Solving multiclass learning problems via error-correcting output codes,” J. Artif. Int. Res., vol. 2, no. 1, pp. 263–286, Jan. 1995. [Online]. Available: http://dl.acm.org/citation.cfm?id=1622826.1622834 S. Escalera, D. M. J. Tax, O. Pujol, P. Radeva, and R. Duin, “Subclass problem-dependent design for error-correcting output codes,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 6, pp. 1041–1054, 2008. E. L. Allwein, R. E. Schapire, and Y. Singer, “Reducing multiclass to binary: a unifying approach for margin classifiers,” J. Mach. Learn. Res., vol. 1, pp. 113–141, Sep. 2001. [Online]. Available: http://dx.doi.org/10.1162/15324430152733133 S. Escalera, O. Pujol, and P. Radeva, “On the decoding process in ternary error-correcting output codes,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 1, pp. 120–134, 2010. S. Al-Maadeed, “Text-dependent writer identification for arabic handwriting,” Journal of Electrical and Computer Engineering, vol. 2012, p. 8, 2012. H. Said, K. Baker, and T. Tan, “Personal identification based on handwriting,” in Pattern Recognition, 1998. Proceedings. Fourteenth International Conference on, vol. 2, aug 1998, pp. 1761 –1764 vol.2. E. N. Zois and V. Anastassopoulos, “Morphological waveform coding for writer identification,” Pattern Recognition, vol. 33, no. 3, pp. 385– 398, 2000. R. Niels, L. Vuurpijl, and L. Schomaker, “Introducing trigraph trimodal writer identification,” in in Proc. European Network of Forensic Handwr. Experts, 2005.
[16]
[17]
[18] [19]
[20]
A. Bensefia, T. Paquet, and L. Heutte, “A writer identification and verification system,” Pattern Recognition Letters, pp. 2080–2092, 2005. L. V. D. Maaten and E. Postma, “Improving automatic writer identification,” in PROC. OF 17TH BELGIUM-NETHERLANDS CONFERENCE ON ARTIFICIAL INTELLIGENCE (BNAIC 2005, 2005, pp. 260–266. M. Bulacu and L. Schomaker, “Text-independent writer identification and verification using textural and allographic features,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 4, pp. 701 –717, april 2007. M. Bulacu, L. Schomaker, and L. Vuurpijl, “Writer identification using edge-based directional features,” in Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, ser. ICDAR ’03. Washington, DC, USA: IEEE Computer Society, 2003, pp. 937–. [Online]. Available: http://dl.acm.org/citation.cfm?id=938980.939525 D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributed processing: explorations in the microstructure of cognition, vol. 1,” D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, Eds. Cambridge, MA, USA: MIT Press, 1986, ch. Learning internal representations by error propagation, pp. 318–362. [Online]. Available: http://dl.acm.org/citation.cfm?id=104279.104293 T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” 1998. V. Guruswami and A. Sahai, “Multiclass learning, boosting, and error-correcting codes,” in Proceedings of the twelfth annual conference on Computational learning theory, ser. COLT ’99. New York, NY, USA: ACM, 1999, pp. 145–155. [Online]. Available: http://doi.acm.org/10.1145/307400.307429 K. Mimura, F. Cousseau, and M. Okada, “Belief propagation for error correcting codes and lossy compression using multilayer perceptrons,” CoRR, vol. abs/1102.1497, 2011. K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: An empirical study,” CoRR, vol. abs/1301.6725, 2013. I.-U.-. D. Set, “http://www.cubs.buffalo.edu/hwdata/.” Z. Shi, S. Setlur, and V. Govindaraju, “A steerable directional local profile technique for extraction of handwritten arabic text lines,” in ICDAR, 2009, pp. 176–180. J. Favata and S. Srikantan, G.and Srihari, “Handprinted character/digit recognition using a multiple feature/resolution philosophy,” in Proceedings of Fourth International Workshop Frontiers of Handwriting Recognition, 1994.