GCDB: A Character Database System

3 downloads 0 Views 381KB Size Report
Jul 25, 2009 - In this article the GCDB (Greek Characters DataBase) System is presented. ... life expectancy, a capacity to achieve fast and easy growth (in terms of .... Half of the fields are words printed in capital letters and the other half are ...
GCDB: A Character Database System John Margaronis, Minas Christou, Ergina Kavallieratou, Theodoros Tzouramanis Department of Information & Communication Systems Engineering, University of the Aegean, Karlovassi, Samos, 83200, Greece

{icsd01038, icsd01074, kavallieratou, ttzouram}@aegean.gr ABSTRACT In this article the GCDB (Greek Characters DataBase) System is presented. GCDB is a complete database system for storing images of Greek unconstrained handwritten characters. The three elements that compose this database are: a specialized input form, a database that contains the images of the filled forms and the software that allows the inputting of the data from the form into the database and their retrieval. The main purpose of this database system is to make it possible to achieve the automatic storage and organization of images of Greek symbols and letters in view of their use by OCR (Optical Character Recognition) systems or other applications. The GCDB system is designed within the concept of future expansion, providing an up to date database of Greek handwritten characters to cover the growing demands for offline character recognition of the Greek language.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – image databases; I.4.6 [Image Processing and Computer Vision]: Segmentation – Edge and feature detection, Pixel classification; I.7.7 [Document and Text Processing]: Document Capture – optical character recognition (OCR), scanning.

General Terms Experimentation

Keywords OCR, Greek characters, unconstrained characters, handwritten characters.

1. INTRODUCTION The need for databases containing images of characters other than Latin is currently on the increase in the field of Optical Character Recognition (OCR) systems and other applications. This especially holds true for the Greek alphabet, which is widely used in science, more specifically, in the fields of mathematics and physics. The Greek and the Latin alphabets have some letters in common, but the Greek alphabet presents 22 characters (10 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MOCR’09, July 25, 2009, Barcelona, Catalonia, Spain. Copyright 2009 ACM 1-58113-000-0/00/0004…$5.00.

uppercase and 11 lowercase) that differ from the Latin characters and many of these can present variations in their written form, depending on the style of the writer’s hand. While a number of different databases for the Latin alphabet exist, this is not the case for the Greek Alphabet, for which one such database only exists, the GRUHD [1]. The GRUHD is similar to the NIST Special Database 19 [2] in respect of the way their data is gathered. Both these databases make use of a specialized input form: this printed form is given out to individual writers who fill it in. The form is subsequently scanned and the data are collected from the scanned images. The data of a database can also be collected by other means, for example on line: the writers use a pen device to write on a specialized tablet from which the data are transferred to the database, e.g. the CEDAR [3]. And lastly, it should be mentioned that there are databases that receive their data from other sources such as companies, or organizations like UNIPEN [4], or from texts originating from a preexisting text collection, e.g. the IAM and the Lancaster-Oslo/Bergen database [5]. Since the GRUHD is no longer fed with data and since its architecture does not allow easy adaptation, it has been perceived that a more up-to-date database of handwritten Greek characters is needed. Basic principles underlie the design of a new database: the creation of a system that would be easy to use and that would also require a minimum of experience to handle; the ease that the data stored in the database should be increased and the ensuring that a minimal number of experienced personnel is required to add new data to the database. Another consideration taken into account in the course of designing the new database regarded how the system might be made easily adaptable to changes affecting the input form, to allow for flexibility and for an architecture that would permit future expansion. Should the system needed major changes of software, even in the case of a small change in the input form (for example, by enlarging some fields), the users would either be trapped with a dysfunctional input form or they would let the system slide into uselessness. Our aim was to design a cost effective system with an extended life expectancy, a capacity to achieve fast and easy growth (in terms of data), while remaining as simple and effective as possible to prevent its falling into disuse. The article will unfold as follows: the next section describes the input form which is used. Section 3 presents the database which

Figure 1: The input form used in GCDB. is used by the system and the way the data are organized. Section 4 analyses the accompanying software and the data input processing method. Finally the last section of the article will draw some conclusions.

2. INPUT FORMS It was decided to use input forms because this procedure makes it easier to obtain a larger number of samples from a larger number of writers, as it is also easier to transport forms made of paper to different locations than to transport equipment around. In addition, a non-negligible number of writers feel negatively about

using electronic equipment to provide samples of their handwriting, not being familiar with the technology with which it would take them time to get acquainted, and therefore involving a time factor that would not apply in the alternative solution of giving out forms made of paper for input, to which people are more used, mostly from questionnaires, and which they accept more easily. The input form is divided by horizontal lines into three sections (Figure 1). The lines represent more than an aesthetic division of the page: the software refers to the lines as guides in order to find the different fields and retrieve the data.

Table 1: Common translation of the English words in the form.

without guarantees of permission being granted. The main reason for adding the field with the question regarding whether the writer has filled the same form before is to allow the users of the system to select samples from unique (i.e. different) writers.

Words used

English translation

και

and

φαίνεται

appears to be, seems

μάλλον

perhaps

The second section consists of a series of fields of single characters for the writers to fill in. All the fields are small boxes placed below the letter or character that the writer is requested to copy. The fields are the following:

σώμα

body

• 25 fields for small letters, in random order.

για

for

• 25 fields for small letters, in alphabetical order.

γυναίκα

woman

• 24 fields for capital letters, in random order.

σαστίζω

to perplex

• 24 fields for capital letters, in alphabetical order.

όπως

as

• 7 fields for intonated vowels.

ζημιά

damage

• 10 fields for numbers (digits) in logical order (from 0..9)

ψάχνω

to search, to look for

δημοκρατία

democracy

• 10 fields for numbers (digits) in random order.

εξέλιξη

evolution

καθώς

as

βρέχω

to wet

δορυφόρος

satellite

ΕΛΛΑΔΑ

GREECE

ΜΑΝΙΑ

MANIA

ΕΝΤΟΣ

WITHIN

ΘΑΡΡΟΣ

COURAGE

ΓΙΑΤΙ

WHY

ΔΗΛΑΔΗ

NAMELY, THAT IS TO SAY

ΒΑΡΕΤΟΣ

BORING

ΤΩΡΑ

NOW

ΦΙΛΟΞΕΝΟΣ

HOSPITABLE

ΖΗΤΩ

TO ASK, TO REQUEST, TO LOOK FOR

ΚΑΛΟΣ

GOOD

ΟΧΙ

NO

ΕΥΤΥΧΩΣ

FORTUNATELY

ΨΑΡΙ

FISH

ΠΡΟΣ

TO, TOWARDS, IN THE DIRECTION OF

The first section of the page consists of checkboxes relating to information about the writer. These include his/her age and sex and a question asking whether the writer has filled the same form before or not. This last checkbox was necessary since it was impossible to include a field for the name of the writers and later to be able to group the forms with the same writers. The fact that the writers’ names do not appear is due to legal considerations and in order not to infringe the privacy protection and data protection laws [6-7]. A field containing the name of the writer would mean that we would require permission from the Hellenic Data Protection Agency, which is a time consuming process

• 6 fields for punctuation symbols. The different orders, alphabetical and random, is a strategy that was followed on account of the consideration that people tend to adopt a slightly different style when they are writing text known in advance. It should be noticed that while there are 24 letters in the Greek Alphabet, there are 25 small letter fields. This is explained by the fact that the small letter sigma (‘σ’) is written differently according to whether it stands as the first letter or in the middle of a word or whether it stands at the end of a word (‘ς’). The third section of the input form consists of 30 fields for words. Half of the fields are words printed in capital letters and the other half are words printed in small letters. The words were selected to ensure that all the letters of the Greek alphabet were included at least once and most of the words are commonly used Greek words (Table 1).

3. DATABASE AND DATA ORGANIZATION The storage of data is extremely important and a good database design can speed up queries and data retrieval. The GCDB Database was designed around two axes. The first was the expansion capabilities of the database and the second was the query speed and data retrieval [8]. The relationships diagram of the database tables is shown in Figure 2. The database had to be designed in such a way that it would be easy to implement changes in the future (for example adding the names of the writers, or adding more symbols, or even adding new prototype forms) while at the same time maintain the previous data and structure intact so that applications and forms designed for the old database would still function. To resolve this, all data are stored in database tables whose joins with other tables are made based on foreign keys that refer to the primary keys of the other tables. For example there is a table named LETT_CAP_NORM which contains all the images of the capital letters in alphabetical order. The only additional information that this table contains is an index (ID field) attribute which serves as the primary key. This field is then assigned as a foreign key to a higher level table called CONTAINS_CHAR which keeps track

Figure 2: The relationships diagram of the GCDB database tables. of all character-like symbols (letters, numbers, punctuation symbols, etc.). This table contains only the index fields of the tables with character images and the index field of the FORMS table, which contains information regarding every single filled form. This structure allows us to add new attributes onto existing tables or even new tables for image data (e.g. for the support of letters of another language) with minimum changes on the database schema (e.g. we just add a few foreign keys on the CONTAINS_CHAR table), while at the same time keeping backwards compatibility. Consideration was also given to the query speed and data retrieval. Thus, we created two main tables that store the primary keys of the tables the main responsibility of which is the efficient storage and retrieval of the image data. These main tables are the CONTAINS_CHAR and the CONTAINS_WORD table, the first for images of characters and the second for images of words. The reason behind this choice is that most users usually make queries on one of these two categories (characters or words) and rarely on both of them. The WRITERS table includes information about the sex and the age categories of the writers. Although the above-mentioned information could have been kept in another table, as with the FORMS table, provides an easier possible expansion of the database if the name of the writer needs to be included, provided that the legal issues mentioned above are resolved and the secure maintenance of the database have been set-up. Should there be a need to implement personal information (i.e. name) of the writers in the future and had the age and sex fields been recorded in a different table, we would have been unable to match the name of the writer with his other attributes (age, sex) while maintaining compatibility with the existing data in the database. So far the database contains 428 filled forms from 305 unique writers, featuring 39823 letter images, 8426 number images, 12519 word images and 5696 other images, including special characters (the end-of-word sigma or intonated letters) and

punctuation marks. However the ease with which the system can be used, allowing the fast and easy addition of data to the database, allows us to believe that the amount of stored data will increase steadily over time.

4. GCDB SYSTEM DESCRIPTION The software was necessary for the operation of such a databasedriven system. The software was created for both the input of data in the database and their retrieval from it. The input software is responsible for taking a scanned image of an input form, for finding the fields that were filled in by the writer and inserting the images in the respective fields of the database. The output software is a simple application which allows the user to select data from the categories of his choice (letters, words, numbers etc.) that comply with the criteria he has set (writer’s age, sex etc.), and allows the user to retrieve that data and save them in image form in a selected location.

4.1 System Main Characteristics Three of the most important features of the GCDB System, the easiness in use, the automation and the flexibility, are reflected by the data input application. The input of the data is automated, the user has to select a form, scan it and load the scanned image into the input application. The application handles everything else, from deskewing the image to finding the fields that were filled in and saving the images in the database. This automation procedure, along with the fact that the input application supports batch processing of form images and is coupled with a simple and complete graphical user interface, makes it easy to use even by completely inexperienced users. The users merely need to select the images and then select the option to start the processing. The application also features some advanced options to allow customization and flexibility of the processing. As already mentioned, in the system design, possible future changes of the input forms as well as the full automation of the

Scanned Form Image

Image binarization

Skew Correction

Segmentation

Storage in Database

Figure 3: Data Extraction Procedure. process were taken into consideration. Even the slightest change in the input form (e.g. expanding a field) can lead to massive changes in the input process. To deal with this problem, it was decided that all information concerning the process of the form would be stored in a separate parameter file. Even further, each prototype form would have each its own separate parameter file. The user would be able to select a parameter file in the application so that the same version of software would be able to process more than one different prototype form. The term prototype form refers to the original input form that will be given to writers to fill in. The scanned images of this form can be processed with the prototype form’s parameter file. It needs to be mentioned that all the forms must be scanned at the same resolution, and that resolution is taken under consideration when a parameters file is created. If the resolution changes the process of the form will produce invalid. The form parameter file contains two main types of information concerning the process. The first is all the data needed by the algorithms used in the application to process the form, e.g. the threshold used when changing the image to binary or the thresholds used by the algorithm which checks for empty (not filled by the writer) fields. The second type of information it contains is the coordinates that are used by the application to localize the data fields. The fact that we have a scanned medium means that there will always be variations from the original form, either in the form of a small displacement of pixels or rotation of the original image. That alone makes the traditional starting points of coordinates used for images (Pixel [0, 0]) of little use. To overcome this problem we used a fixed point in the form that could be located and used as the starting point for the calculation of the coordinates: the start of the first of the three long horizontal lines that separate the form into three sections. The relative coordinates of all fields from the start of the line will always be the same, no matter the degree of displacement or rotation. Thus, the coordinates of the fields are calculated by using the start of the line as reference point and are saved in the parameter file.

4.2 Process Description The application starts with the input of a form image file (Figure 3). First, the image is binarized. The original threshold is calculated according to [9]. It is stored in the form parameter file and it can be changed if necessary. Then we have the skew correction of the image, where the application detects the skew angle using the algorithm [10] and corrects the image accordingly. Finally, it proceeds to the data field segmentation, the extraction of the writer’s information and saves them appropriately (Section 3) in the database. Two problems had to be faced here: dealing with empty fields (i.e. fields that the writer did not fill in) and extracting the writers’

information (marked checkboxes). Both problems were solved by counting the black pixels. A filled checkbox that records the writer’s personal information in the top section of the input form of Figure 1 will always contain a larger number of black pixels than a checkbox which has remained unfilled. This should be expected, since there is a marking in the field (any kind of marking). Thus we count the number of black pixels: the field with the largest number of pixels will be the one selected by the writer. The drawback of this method concerns the event that the user might make a mistake and attempt to correct it by smudging consequently creating another marking. The application will incorrectly select the smudge, as it will usually contain more black pixels than the marking does. Writers should be advised that in case of a mistake, they should mark the correct field with more color rather than create a smudge.

12%

1% skew correction segmentation rest processes 87%

Figure 4: Time distribution of the data extraction procedure. The empty fields were handled in a similar way. Since the writer has not filled a field, it should be empty, and thus the number of black pixels should theoretically be 0. However, in practice there will always be a few black pixels. This fact made the calculation of a threshold harder, since it wasn’t easy to carry out an estimation of an exact lower limit in pixels. Therefore the ratio of black pixels to the total number of pixels of the field (black pixels / data field pixels) was calculated. A threshold based on the ratio was more successful and accurate than using the number of pixels for comparison. If the black pixel ratio is very low, it means that the field is empty. Minimum ratios for each major category of the fields (numbers, words, letters, etc.) of the form were calculated. Those ratios were selected by testing 50 random scanned forms and producing black pixel ratios for all of their fields. The retrieval application is simpler with a graphical user interface (Figure 5) which allows the user to select the type of the data and the criteria of his choice and subsequently the data will be chosen and saved in the folder of his choice. All the selections are made via checkboxes and radio buttons, making it easy to select the required data. According to the user’s choices the application creates dynamically an SQL query and retrieves the data from the database.

Figure 5. The graphical interface of the output application. Furthermore, the application offers two ways of arranging the results. The first is sorting by Form Name and the second is sorting by data Category. The first option uses the ID of each form and creates one folder for each form ID. In that folder the images of all the fields of that form are saved with names according to their respective categories. For example, in the folder Form_1 there will be images called LETT_CAP_NORM.ALPHA.bmp or APOSTROPHE.bmp etc. The second option creates one folder for each category and the corresponding image fields of each form are saved inside those folders. Using the previous example, a folder named APOSTROPHE and a folder named LETT_CAP_NORM.ALPHA will be created and the appropriate images named after the form ID will be saved inside each folder.

5. CONCLUSION GCDB System is a database of Greek unconstrained handwritten characters. Its main characteristics are: • Automation • Easiness in use • Architecture open to possible future expansion • Flexibility Automation is encountered during the input of data in the database. The entire process is automated. Batch processing also makes it faster and allows the unattended processing of many forms. The easiness of use was also important to us. Simple and intuitive graphical user interfaces, integrated help files and accompanying documentation, including instructions for the creation of

parameter files and future expansion, make the system easy to use, to maintain and expand. The entire system was designed and coded in such a way that future expansions would be easy, maintaining backwards compatibility. Both the way in which the database was designed and the way the input software was written, allows the advanced users to expand the system, easily and rapidly. Finally, the flexibility of the system, which is especially evident in the form parameter file, allows for the adaptation of the system to the needs and challenges that may arise during its operation. An advanced user can easily change even the way the processing of the images works, in order to achieve different results. Separate form parameter files means that different forms or even forms with different processing parameters can be processed. A user can even keep separate databases simultaneously according to his needs. The drawback of this system is that since it includes automatic procedure for the data extraction, it does not check what is inserted in the database. As long as the field is not empty whatever is written in the field will be saved in the appropriate field of the database. Thus if a writer makes a mistake in spelling a word, or smudges a field, or even draws inside it, those images will be saved in the database. In order to avoid it, the user should check the contents of the form before scanning. This is one of the drawbacks of the automated procedures. Automated systems can never reach the quality in data of a system where the fields are checked and selected manually by a human being. However, the time and effort, as well as the cost required for manual selection is much bigger. The GCDB system is freely available http://www.icsd.aegean.gr/lecturers/kavallieratou/

via:

http://

6. REFERENCES [1] Kavallieratou, E., Liolios, N., Koutsogeorgos, E., Fakotakis, N., Kokkinakis, G.: The GRUHD database of Modern Greek Unconstrained Handwriting. 6th Int. Conference on Document Analysis and Recognition (ICDAR 2001), Vol.1, pp.561-565 (2001). [2] Wilkinson, R., Geist, J., Janet, S., Grother, P., Burges, C., Creecy, R., Hammond, B., Hull, J., Larsen, N., Vogl, T., and Wilson, C.: The first census optical character recognition systems conf. #NISTIR 4912. The U.S Bureau of Census and the National Institute of Standards and Technology. Gaithersburg, MD. (1992). [3] Hull, J.: A Database for Handwriting Recognition Research. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 550-554 (1994). [4] Guyon, I., Schomaker, L., Plamondon, R., Liberman, M. & Janet, S.: UNIPEN project of on-line data exchange and recognizer benchmarks. Proceedings of the 12th

International Conference on Pattern Recognition (ICPR 1994), pp. 29-33 (1994). [5] Marti, U., and Bunke, H.: A full English sentence database for off-line handwriting recognition. 5th Int. Conference on Document Analysis and Recognition, Bangalore, pp. 705 – 708 (1999). [6] Law 2472/1997 (as amended) of the Greek Parliament [7] Directive 95/46/EC of the European Parliament [8] Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 4th Edition, Addison Wesley (2004). [9] Kavallieratou, E.: A Binarization Algorithm specialized on Document Images and Photos. Proceedings of 8th International Conference of Document Analysis and Recognition, Vol. I, pp. 463-467 (2005). [10] Kavallieratou, E., Fakotakis, N., and Kokkinakis, G.: Skew Angle Estimation for Printed and Handwritten Documents using the Wigner-Ville Distribution. Image & Vision Computing, Vol.20, No.11, pp. 813-822 (2002).