Where are the Search Engines for Handwritten ...

INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009, 228–239

Where are the Search Engines for Handwritten Documents? Tijn van der Zant AI Dept., University of Groningen, The Netherlands

Sveta Zinger VCA Dept., Technical University of Eindhoven, The Netherlands

Lambert Schomaker AI Dept., University of Groningen, The Netherlands

Henny van Schie National Archive, The Hague, The Netherlands

Although the problems of optical character recognition for contemporary printed text have been resolved, for historical printed and handwritten connected cursive text (i.e. western style writing), they have not. This does not mean that scanning historical documents is not useful. This article describes our research on retrieving digitized handwritten documents containing the information that the user is looking for. This task is essential for optimizing the archive’s work. We investigated how to process historical documents and their transcriptions, so that a super computer could learn how to read. We applied artificial intelligence techniques to a large amount of image data and created a search engine. Our methods often require that the computer learns in interaction with a human. We have studied the requests of archive users in order to bring our research as close as possible to the current information needs. Our system learns continuously, allowing the constant improvement of search results. User requests stimulated us to delve into an unsolved topic: to search for the most elusive knowledge in text, namely the names of people and places. The solutions are described in this article.

1 Introduction SCRATCH — SCRipt Analysis Tools for the Cultural Heritage — is an interdisciplinary project aiming at information retrieval from historical handwritten documents. Our collaborative research is intended to unite the © Institute of Materials, Minerals and Mining 2009 Published by Maney on behalf of the Institute

34-2-228-ISR 07 vdZant.indd 228

DOI 10.1179/174327909X441126

5/13/2009 6:33:09 PM

WHERE ARE THE SEARCH ENGINES FOR HANDWRITTEN DOCUMENTS?

229

state-of-the-art achievements in pattern recognition, artiÞcial intelligence and computational linguistics for the creation of a search engine for historical handwritten documents. The Dutch National Archive stores eight ßoors of handwritten documents. The total length of the bookshelves is about 100 kilometres. This rather large amount of handwritten documents is difficult, if not impossible, to search through. Serendipity plays a role in the Þnding of interesting historical documents. In the Netherlands alone, the total length of bookshelves with handwritten documents is estimated to be 700 km, give or take a few. Most of these archives are state archives and the collections can be many hundreds of years old. We work with the royal Dutch Archive of the ‘Cabinet of the Queen’ (Kabinet der Koningin) with royal decrees and decisions from the last centuries that are approved, or disproved, by the Queen. Although all the methods in this article are scalable to large collections, we only had access to a single book. This means that we have little end-user feedback, since searching in a single book is not so exciting. There are about 180,000 people in the Netherlands doing research into their historical backgrounds, which is about one per cent of the population of the Netherlands, and who are desperately in need of digitized and searchable archives. We are developing ‘googling’ procedures for the collections of the Dutch National Archive, as part of a large national project (NWO/CATCH and MORPH) for improving the access to cultural heritage information. Automatic Handwriting Recognition (HWR) systems still face problems. To read a randomly selected connected-cursive piece of text in an open context is not possible. The successful systems act on simpler forms of script, on restricted content (addresses and written bank draughts), or on a single writer. Our goal is to develop generic mechanisms that provide a reasonable to good performance on a wide range of script types. Often, machine-learning techniques such as Hidden-Markov Models (Rabiner 1989) or a combination of smart preprocessing and template matching (Mitoma et al. 2004) are used for HWR research. The focus of our research is different from those approaches. We do pay attention to novel ways of detecting what is written, but we think that it is more important that users are able to train the computer systems doing the recognition task, in order to improve the quality of the HWR system. The next section gives an introduction into the data collection and the difficulties that people face while working with historical handwritten texts. We describe what kind of user requests arrive in the archive and after that follows a general introduction into (our interpretation of) machine learning. Then, in order to get a feeling for the difficulties that computers face when deciphering the text, we explain what has to be done on the image processing part. This consists of several procedures which can roughly be put in three categories: preprocessing, feature extraction and classiÞcation. The next section explains the feedback procedures from the end users that retrain the classiÞcation software in order to obtain a quick bootstrapping. This speeds up the entire process and allows us to search for the more elusive items such INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:09 PM

230

TIJN VAN DER ZANT et al.

as family names, locations and even categories. Then we describe how the research we do correlates with the user requests. The research described in this article is work-in-progress. There are working demonstrations of the technologies that are being used and also interactive websites that people use for the transcriptions and training. On the other hand, there is no system that is available to the public (yet).

2 The ‘Cabinet of the Queen’ The ‘Cabinet of the Queen’, in Dutch called ‘Kabinet der Koningin’ (KdK), is an administrative collection that started in the late eighteenth century and is still being updated. Most of this collection is handwritten cursive text. An example is shown in Figure 1. Usually a clerk writes for several years and the handwritings are reasonably stable. The collection consists of many hundreds of books where entries to the actual documents are inscribed. The documents are stored, usually in boxes, in an 8-ßoor building with special atmospheric conditions to enhance the preservation. The total length of the shelves with these boxes is about three and a half kilometres. To Þnd something in this collection means going through the entry books. Anyone dealing with this collection will have a very hard time Þnding anything. It is less difficult if one knows in what year the particular event happened, but even that still requires reading two and a half thousand pages. Any other type of search is nearly impossible. It is difficult to search the KdK, so the Dutch National Archive would like to digitize the collection in order to make it publicly available. But since there is no (commercial) software that can actually ‘read’ the text and transcribe it, which is needed to create a search engine, it is difficult to obtain funding for the digitization. It does not make sense if the user has to sift through hundreds of thousands of pages manually. On the other hand, the researchers need a lot of digitized and transcribed material if the machine learning software is to make anything out of the images. This ranges in the order of

O N L I N E C O L O U R

figure 1

An example of the Royal Dutch archive of the Cabinet of the Queen.

O N L Y

INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:09 PM


231

hundreds of instances per word, which is unrealistic and would make the archive only accessible for the most frequently used words, which are the least informative. But to obtain so many instances per word, a lot of transcription has to be done. This requires a lot of man-hours which are not available. This deadlock situation had to be overcome and is addressed in the CATCH/SCRATCH research project in the Netherlands.

3 User requests

O N L I N E

The National Archive receives many queries from people and organizations that search for information. Looking for information in an archive is time consuming and often requires considerable efforts from archivists. We explore the information retrieval tasks that the archive receives in order to identify what kind of information people would like to extract. There are several thousands of emails, mostly in English and Dutch, that have been received by the National Archive since the year 2001 until the present time. With the help of a domain expert — an archivist — we classify these queries and calculate statistics from them. Figure 2 shows our results for the queries written in English to the National Archive in the period from 2001 to 2005. Figure 3 shows that one question can contain several classes of queries. Our classiÞcation of queries is similar to the concept categories presented in Constantopoulos et al. (2002). Figure 4 shows our results for the queries written in Dutch to the National Archive in the period 2001–2005. Many questions to the archive written in English concern genealogical research and often do not contain much information about the person in question. That is why the names of people are the most common class of queries. The information needs of the users

C O L O U R O N L Y figure 2 Classification of the queries in English received by the National Archive during a 5 year period (2001–2005). Total number of queries is 927. INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:10 PM

O 232 N L I N E


C O L O U R O N L Y

figure 3

Example of a query to the National Archive.

O N L I N E C O L O U R

figure 4 Classification of the queries in Dutch received by the National Archive during the period 2001–2005. Total number of queries is 8248.

O N L Y

who write in Dutch are somewhat broader than those of the users writing in English, therefore, we included two new classes of queries: organizations and other queries. By ‘other queries’, we mean the queries that may not be easily processed by a document retrieval system: for example, a request for some statistics over one or more centuries of Dutch history. We notice that queries about names of people, dates and geographical names together compose approximately 70% of all information retrieval requests.

4 Machine Learning: learning to learn Our view about this type of pattern recognition (HWR) problem has changed over the past few years. Typically, this problem is viewed as a retrieval task, INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:10 PM


233

for example, Þnding the word ‘Napoleon’ or ‘SpringÞeld’. But if machine learning is applied, because modelling by hand is impossible, then the machine learning algorithms require large amounts of data. For example, with optical character recognition, typically 5000 instances per letter per font-type are needed to train the system. Since every person has a different writing style, this problem becomes intractable for the standard type of machine learning algorithms. To be useful, this would require that every person should write the same letter 5000 times in order to get enough samples. Also the computer has no idea where a letter begins, because most handwriting is connected, i.e. most letters are joined up to other letters. Other methods had to be found. The solution created by our research group is to have a multi-stage learning system in which humans interact with the learning algorithms. Instead of cutting a piece of handwriting into semi-random pieces and then annotating the pieces (thousands per word-piece class), we have adopted interactive methods. Having the user’s request as a way of steering our research has proven to be very worthy, since it has pushed our methods to the maximum. Also we started experimenting with what is called ‘one-shot learning’. This means that there is only one example of a handwritten word available to retrieve other words that look like it. The hit list of likely candidates is not bad, but also not good enough to present to the end user. It is good enough, though, to present to users who are training the machine learning algorithms. These ‘professional’ users click, on the Internet, on the words that are the same as the query. This gives the machine learning algorithm a larger training set, which is then used to improve automatically on the retrieval algorithm. With the improved algorithm, a new hit list can be generated, which in turn, is checked again by a human. After a few of these iterations, most of the instances of that word have been found in the handwritten text, using the best of the human world (quickly assessing the truth of the suggestions of the machine learning algorithm) and of the machine learning world (sifting through large amounts of data to assess the similarity between the query and the data set) (Schomaker 2008). Perhaps the most interesting part of the process is that we were driven by user requests. Typical users are not interested in words which are relatively easy to detect, i.e. the words with a high frequency, such as the Dutch versions of the words: the, a, there, this, that, etc. Users want to search for family names, locations and dates. These words have a very low frequency and are the most difficult to Þnd. Using our multi-stage system allows us to Þnd these more elusive items.

5 Image processing 5.1 Preprocessing Image processing consists of several stages: preprocessing, feature extraction and classiÞcation. These steps are explained in some detail to give an idea of the artiÞcial systems that process the images. The Þrst step is preprocessing. INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:10 PM

234


This is explained according to Figure 1. Although this illustration is one of the better looking ones, it can be seen that the border on the left and the top are a bit darker than the centre due to weathering. On the right, the shadow of the centre of the book can give trouble during classiÞcation, simply because it is not text but it is dark, which means it is text-like for many algorithms. The steps we use in the preprocessing are: • subtract red from the image to remove most of the red border • use a smart algorithm to create a grey scale image which should leave the text dark and the background white(ish) • adjust the threshold on the values of the pixels to make the text black and the background white • cut off the left- and rightmost part of the image in order to remove the shadow from the centre of the book and the stapled pages on the other side • count the amount of black pixels in a row to determine whether there is text in that row and where the centre of the text is for a certain line • remove noise (for example, single dark pixels with white around them) • cut out the line strip • use heuristics to cut the line strip into word(-like) pieces • annotate the line strips using the Internet. This might sound like a clear and easy algorithm, but it is not. This is a heuristic, which means that it only makes sense for humans. Computers do not understand the words written in the preprocessing steps above. For example, the sentence ‘cut out the line strip’ is a pattern recognition procedure that took months to develop, using statistics on pixel data, and even after years of reÞnement there are still mistakes made while cutting the text automatically into line strips. The preprocessing steps described above build on decades of research in computerized image processing (Helsper et al. 1993). If there is no satisÞable algorithm, it takes months to years to develop it.

5.2 Creating feature vectors What constitutes a feature vector? A feature vector is a list of numbers that describes (part of) the image. The feature vector can be compared to a signature. Most signatures from the same person look alike, but they can change over time. Also there is the problem that from two different people the signature looks alike. Then the context becomes important, as in “is it the person with the name ‘A’ from Alphaville or the person with name ‘A’ from BetaVille?” A feature vector is useful when two items (signatures) from the same type or class (person) are close to each other in feature space. Although for us, humans, two words with the same syntax from two different writers are close to each other in our feature space (representations/the meaning of a word), they are probably very different in feature space from the image or machine point of view. A good example is the difference between a typical piece of INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:11 PM


235

handwriting from a physician, which is often very difficult to read, and the handwriting from a girl who writes with a lot of curls and little hearts instead of dots above the letter ‘i’. Although they might be writing down the same syntax, the image space is very different. Since computers only have the image space, it is our task to Þnd transformations where these differences do not matter too much. On a more basic level, even the writing from the same writer differs according to physical condition, writing utility, size of writing, speed of writing and many other factors (Franke 2005). At the moment, we are mainly researching intra-writer text, in order to extend the technology to inter-writer recognition methods. In our research, several types of feature extraction methods are being used. The simplest feature is the pixel in the image. This can be a colour pixel in, for example, Red/Green/Blue space much like a television. It can also be a binarized pixel (black or white) as described in the previous section, which is already a higher level feature. The rest of this section describes one of the feature extraction methods where the text is cut into little pieces called fraglets, which are learned by a special type of neural network called a Kohonen map or self organizing map (Schomaker et al. 2007). A Kohonen map is a statistical procedure that clusters information (Haykin 1999). The information in this case is the fraglets as shown in Figure 5. The Þrst step after the preprocessing is to cut the words into little pieces called fraglets. The procedure is to follow the black pixels, which are usually ink, and then cut the text into pieces at the bottom of the ‘U’ shape of the ‘valleys’ (see Figure 5, the arrows beneath the word-part ‘veili’ designate the bottom of these ‘valleys’). These little pieces of ink are not letters or digits, although they could be. Often they are smaller than a complete letter. The reason that it is not possible to cut out the letters immediately almost sounds like a riddle: ‘To cut out a letter, the computer Þrst has to recognize the letter, but in order to recognize the letter, it Þrst has to cut out the letter’. The fraglets are used to train the Kohonen maps, which in turn are used as a code book during classiÞcation.

5.3 Classification There is a myriad of methods in ArtiÞcial Intelligence and Pattern Recognition for classiÞcation; for example, back-propagation neural networks, self-organizing neural networks, support vector machines, nearest-neighbour algorithms and K-means clustering (Duda et al. 2000). The common idea is that, during the training, some form of common elements are found which can later be used in the classiÞcation process. Here follows an example. If a feature vector has N elements (for example N=4900 in the case of the Kohonen map from Figure 5), then this is called a feature vector with a dimensionality of N (4900). For a typical word, most of the dimensions are ‘0’, and a few are ‘1’. For an image/instance that has to be classiÞed, the preprocessing steps are used on the image. The shape-nodes in the Kohonen map that are most similar to the query get the highest activation and are set to ‘1’. The remaining nodes get a ‘0’ in the output vector. INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:11 PM

236


O N L I N E C O L O U R O N L Y

figure 5 The top shows a word that is cut into pieces on the ‘valleys’ of the ink trail. Every piece is mapped to a position in the Kohonen map. The positions in the map determine the classification of the word. Part of the Kohonen map is enlarged to give an impression of the shape space the computer ‘thinks’ in. Note that this is radically different from the human ‘think space’ that we are accustomed to. The word-part is ‘veili’ from ‘veilingen’.

The following part is an example which explains feature vector space in a more intuitive manner. In this example, persons are identiÞed (classiÞed) according to their position in a house. In a three-dimensional space, the coordinates (0, 0, 1) could translate, for example with the system (front door=0 ‖ back door=1; ground ßoor=0 ‖ Þrst ßoor=1; North side=0 ‖ South side=1), into a position at the front door (0, *, *) on the ground ßoor (*, 0, *) on the South side (*, *, 1), where the asterisk stands for ‘anything’. Some people might have an apartment on the Þrst ßoor and will always have the feature vector (*, 1, *), no matter whether they are at the front door or INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:11 PM


237

back door side (Þrst asterisk) or at the North or South side (last asterisk). Some people might even have an apartment on the North Side of the Þrst ßoor and they have feature vector (*, 1, 0), if they are in their apartment. The classiÞcation methods try to learn where a certain person usually is, and then use the feature vector later to identify who is in the house. Of course this will fail miserably if that same person rents out the apartment or throws a party and that is exactly the problem with classiÞcation. It is not certain that the feature vector (position in the house) that is being used can discriminate between the classes (who is in the house). There might be better representations, in this case, for example, colour of hair, height, colour of eyes and male/female. One of the jobs of scientists in Pattern Recognition is to Þnd improved feature extraction methods (see Figure 5) to be able to discriminate better with the classiÞcation method. Traditional methods in machine learning require large amounts of data per class. The theory predicts that more examples are required than the length of the feature vector. In handwriting recognition, this is hardly possible because there is no affordable method to get, for example, the word ‘Napoleon’ or ‘SpringÞeld’ neatly cut out 10,000 times from at least 5000 different writers. Even when it is possible, in the case of very common words such as ‘and’, ‘or’, ‘as’, then the results are not interesting. Hardly anybody is interested in Þnding the word ‘and’ in text, since it is non-discriminative for meaning of the text. The user requests forced us to apply different methods using feedback procedures to lift the quality of the classiÞcation out of the mud of many errors.

6 Multi-stage and feedback procedures The user requests forced our research to move into a direction which is uncommon in machine learning. Often large amounts of data are required for decent machine learning. Although there are many scanned documents, the problem is often that the annotation is not present, and if it is present, the annotation does not correlate in image space. For example, the annotation does not tell us whether a new line has started in image space. This is understandable, since for humans it does not matter, because humans are relatively smart. But the computer has no idea where to Þnd what. This meant that we had to build our data set from scratch. The procedures described below give an impression of the bootstrapping activities that are used to create large enough data sets to train on. We started with a fresh collection of scanned pages from a book from the KdK, as described previously. The pages were cut into line strips and these were correlated in pixel-shape space (Schomaker 2007). This gives an estimate of how much line strips look alike. As time passed by some line strips were annotated by users. If a word exists in the annotations, the corresponding line strips with the query word were retrieved and the correlations in pixel space were used to show correlated line strips. This immediately gives the user a way to search through the documents. Also it is possible to click on an interesting line strip (for example the one which contains ‘Napoleon’) to INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:11 PM

238


generate a hit list in the pixel/image space. If that particular word exists in the remainder of the documents it often shows up in the Þrst ten to twenty hits. We Þnd this reasonable, but many user accustomed to commercial search engines, expect a better performance. A next step is that an algorithm estimates where words start and end, based on the annotation and white space in the image. The cut-out words were again correlated in image space to create a list with computer generated hypotheses (Schomaker 2008). These ‘hit lists’ (as in a list generated by a search engine) were shown to users who did not have to annotate, which is a very time consuming activity, but who could just click an ‘ok’ or ‘not ok’ button. This effectively increases the amount of items per class, which in turn increases the performance of classiÞers. This cycle of: learn → show hit list → ask user → learn again → show new hit list → learn more → . . . continues and increases the performance of the reading system using minimal user input.

7 Conclusions The two examples in the previous paragraph are both data- and user-driven. The type of data we used is more difficult than what is used in most handwriting recognition research. Often, data sets are used which contain neatly cut out letters or words, where the irregularities have also been erased from the image. Our group is focused on using the computer as much as possible and a human expert only where needed. There is no manual cleaning of the data, because cleaning is not practical for a real-world data set. In the case of the KdK, there are millions of pages. Only automated procedures will work on these amounts of data. By actually talking to the end users and professional users (i.e. people working at the archive), we were able to create procedures that are already useful in a very early stage of the work. Sadly, we had only a single scanned book, so the system was not so interesting for the average end user. At the moment of writing, we are scanning more books and hope to process 40 books this year (2009). The user-driven aspect also gave us the idea to employ the user for further machine learning. This resulted in more methods where user-input is put to work for machine learning, always with the focus on minimizing the amount of labour the user has to do (Zant et al. 2008). The user-driven aspect also forced us to come up with ‘something’, to demonstrate early in the project that there is progress and that users can inßuence this process. The answer to the question raised in the title is the following: search engines for handwritten text are under development, and trustworthy users are needed to assist in the training of the ArtiÞcial Intelligence. A Þrst version of the search engine under development can be seen at: http://www.ai.rug.nl/ Monk (Zant 2008). The name is based on the laborious writing of the Monks before the printing press was invented. Just like a monk, the Monk search engine also makes mistakes. The site shows what an end user can do at the moment; the annotation and interactive part which is used for training is not (yet) public. INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:11 PM


239

Bibliography Constantopoulos, P., M. Doerr, M. Theodoridou, and M. Tzobanakis. 2002. Historical documents as monuments and as sources. In: Proceedings of Computer Applications and Quantitative Methods in Archaeology Conference, CAA2002, Heraklion, Greece. Duda, R.O., P.E. Hart, and D.G. Stork. 2000. Pattern classiÞcation. 2nd edition, John Wiley & Sons, Inc. Franke, K. 2005. The inßuence of physical and biomechanical processes on the ink trace. PhD diss., University of Groningen. Haykin, Simon. 1999. Chapter 9 ‘Self-organizing maps’. In Neural networks — A comprehensive foundation. 2nd edition. Prentice-Hall. Helsper, E., L. Schomaker, and H.L. Teulings. 1993. Tools for the recognition of handwritten historical documents. History and Computing 5(2): 88–93. Mitoma, H., S. Uchida, and H. Sakoe. 2004. Online character recognition using eigen-deformations. In: Proceedings of the IWFHR, 3–8. Rabiner, L.R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE 77(2): 257–86. Schomaker, L. 2007. Retrieval of handwritten lines in historical documents. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). Schomaker, L.R.B., K. Franke, and M. Bulacu. 2007. Using codebooks of fragmented connectedcomponent contours in forensic and historic writer identiÞcation. Pattern Recognition Letters 28(6): 719–27. Schomaker, L. 2008. Word mining in a sparsely-labeled handwritten collection. In: Proceedings of Document Recognition and Retrieval XV, IS&T/SPIE International Symposium on Electronic Imaging, San Jose. Van der Zant, T. 2008. Large scale parallel document image processing. In: Proceedings of Document Recognition and Retrieval XV, IS&T/SPIE International Symposium on Electronic Imaging, San Jose. Van der Zant, T., L.R.B. Schomaker, and A.A. Brink. 2008. Interactive evolutionary computing for the binarization of degenerated handwritten images. In: Proceedings of Document Recognition and Retrieval XV, IS&T/SPIE International Symposium on Electronic Imaging, San Jose.

INTERDISCIPLINARY SCIENCE REVIEWS, Vol. 34 No. 2, June, 2009


5/13/2009 6:33:11 PM