Document not found! Please try again

A workflow management system to feed digital libraries ... - UdC

28 downloads 37572 Views 2MB Size Report
email: {asplaces, fari, opedreira, luaces, dseco}@udc.es. ABSTRACT ..... Collaborative workflow systems automate business processes where a group of people.
A workflow management system to feed digital libraries: proposal and case study Ángeles S. Places, Antonio Fariña, Miguel R. Luaces, Óscar Pedreira, Diego Seco Database Laboratory, Facultade de Informática, University of A Coruña Campus de Elviña, s/n, 15071 A Coruña, Spain Tlf: +34981167000 ext. 1306 email: {asplaces, fari, opedreira, luaces, dseco}@udc.es

ABSTRACT Building a digital library of antique documents involves not only technical implementation issues, but also aspects related to the digitization of large collections of documents. Antique documents are usually delicate and need to be handled with care. Also, a poor state of preservation and the use of unrecognizable font types make automatic text recognition more difficult, hence requiring a further human revision to perform text corrections. This makes the participation of experts in the digitization process mandatory and, therefore, costly. In this paper, we present a framework for managing the workflow of the digitization of large collections of antique documents. We describe the digitization process, and a tool supporting all of its phases and tasks. We also present a case study in which we describe how the workflow management system was applied to the digitization of more than 10,000 documents from journals of the 19th century. In addition, we describe the resulting digital library, focusing on the most important technological issues. Keywords- Digital libraries, text retrieval, workflow management system.

1

Introduction

Interest and research on digital libraries have experienced a significant growth, mainly due to the advances in document digitization, information retrieval and web publishing technologies. There are many types of digital libraries. Among them, many are created by real or traditional libraries that digitize their collections and make them public through a digital library. In some cases, the motivation behind the construction of these digital libraries is more complex than the simple desire of a web-based access to these documents. For instance, antique documents are usually kept in museums and libraries and the access to them is very restricted. Therefore, the publication of their digitized pages in the Internet serves the important purpose of providing access to these documents to the world community (Baird, 2003; Borgman, 2002). Additionally, it helps to preserve such documents preventing their disappearance due to their antiquity and fragility (Baird, 2003; Sankar, et al., 2006; Ross and Hedstrom, 2005). There are many examples of this type of digital libraries; among others, the Spanish National Library1, the Library of Congress2 (Arms, 2000), the Digital Library of India3, the University of Chicago Library4, or the Stanford Digital Repository5 (Cramer, 2010).

1

http://bdh.bne.es/ http://memory.loc.gov/ammem/about/techIn.html 3 http://www.dli.gov.in/ 4 http://www.lib.uchicago.edu/e/ets/eos/ 5 http://www.dlib.org/dlib/september10/cramer/09cramer.html 2

In the past decades, much research effort was put in creating techniques and software supporting the process of building digital libraries (Witten and Bainbridge, 2003). In some cases, the digital library is created from scratch with software developed ad hoc. In other cases, developers use software packages that provide a software infrastructure for creating digital libraries such as Greenstone6, Fedora-Commons7 or DSpace8. In any case, many researchers and developers consider that there is a need for a more formal framework. In order to achieve such formalization, standards represent an important tool. Unfortunately, in the field of digital libraries, they usually deal only with data interchange (Van de Sompel and Lagoze, 2000; Library of Congress, 2007), although many researchers consider that standards need to cover a wider range of issues (Delos, 2008; CCSDS, 2002). As pointed out in (Ross, 2014), "after more than twenty years of research in digital curation and preservation, the actual theories, methods and technologies that can either foster or ensure digital longevity remain startlingly limited". The process of feeding digital libraries is one of the processes that should be formalized (Duguid, 1997; CCSDS, 2002), and some authors have been working in this direction (Buchanan, et al., 2005; CCSDS, 2002). For example, Ross (Ross, 2014) also points out that "automation in preservation" is one of the nine main themes for the research agenda in digital libraries for the next years. Digitizing a large collection of documents to feed a digital library poses many problems if done without a tool supporting the process. The construction of a repository requires the sequential execution of a set of activities on the source documents with several people participating in each of them: scanning the physical documents, gathering and recording metadata, automatic text recognition with OCR (Optical Character Recognition) software, text revision to correct errors from the automatic text recognition, and document storage and indexing. Most of these activities require the direct intervention of a person, and may require a significant time to be completed. Therefore, the digitization of the documents to be incorporated to the repository usually involves a high cost and the interaction of many people. In addition, in such a complex process, the lack of control on the workflow can result in dead times, errors in the obtained results and loss of data. In general, an unsatisfactory coordination of the people increases the overall cost and decreases the quality of the results. From our previous experiences (Parama, et al., 2006; Places, et al., 2007), we identified typical errors such as the digitization of the same document several times with different names, the introduction of errors regarding file naming conventions, the publication of documents that were not still reviewed, the loss of files or documents, among others. Since feeding a digital library requires a significant effort, the more automated tools that can be built and used, the better the use of human resources (McCray and Gallagher, 2001; Baird, 2003). The problems described in the paragraph above may be amplified if the documents composing the repository are ancient physical documents. These documents are usually very old and their state of preservation is, in general, poor. For that reason, the scanning has to be done carefully to avoid new damages to the documents. Also, the conversion of the obtained images to text through character recognition technologies becomes especially difficult due to the deterioration of the documents. Therefore, the results of this task must be reviewed in order to correct possible errors. This process is sometimes carried out step by step by a small group of experts with a deep knowledge on the documents, and their skills and knowledge guarantee the quality of the results. However, when the digital library has to be built from thousands of documents, the creation of the document repository involves a large digitization process carried out by a large group of people who are not always so skilled due to financial restrictions (Chang and Hopkinson, 2006; Sankar, et al., 2006). One of the main challenges we have to face in the development of digital libraries that store ancient documents is the importance and complexity

6

http://www.greenstone.org http://fedora-commons.org/ 8 http://www.dspace.org 7

of the digitization and processing of documents. In these situations, the use of support tools to guide the workflow of the process and to facilitate the labor of the workers is mandatory. Controlling the workflow inside this digitization team is a key factor in the success of the process. This control can be achieved using a workflow management tool specially designed for this purpose. That is, a system to coordinate and control all the involved people, to monitor and manage factors such as the current state of each page, to store intermediate results, to maintain significant statistics on the progress of each task, to control the average time to process each document, to track all the people who have worked in each document, etc. In this paper we face the issues we have described, and we propose a framework for effectively and efficiently feeding a digital library. The purpose of this workflow management system is to automatically coordinate all activities involved in the digitization and indexing process. Therefore, the main contributions of this paper are: 

We present a workflow management framework supporting the process of creating a document repository. This framework is composed of a process comprising the digitization, revision and edition activities, and a system architecture supporting the process. This proposal is based on an analysis of the potential problems that may arise during the digitization of large collections of documents, based in real cases and previous experiences. This framework improves the performance of the process, and ensures that all required tasks are correctly performed, facilitating the work of the people involved in these activities.



We have implemented a tool based on this proposal, called Digiflow. We present the details of this tool in the paper, relating each of its components with the different parts of the system architecture.



Finally, we present the results of a real case study in which our framework has been used for the digitization of a large collection of ancient documents of the Royal Galician Academy (RAG) in order to make them available online.

The rest of the paper is structured as follows. Section 2 describes related work focusing on existing import and digitization tools for digital libraries. Section 3 presents our framework for feeding digital libraries, presenting an analysis of the potential problems it addresses, its requirements, a system architecture, and a comparison of the proposed framework and other existing systems. In Section 4 we present Digiflow, a tool implementing the framework we propose. The description of this tool focuses on the implementation of each module composing the system architecture presented in Section 3. Section 5 presents a case study on the application of Digiflow in a real scenario, in which the tool was used to digitize a collection of 10,000 ancient documents from the nineteenth century to be incorporated in the digital library of the RAG. Finally, we conclude the paper by summarizing the main conclusions and directions for future work.

2

Related work

In this Section we review existing works tackling the process of building and feeding digital libraries. Some of them, such as (Larson, R. and Carson, C., 1999) and (Sankar, et al. , 2006), guide the whole process from the digitalization of the documents to its publication in the digital library. Other works, such as (Bainbridge et al., 2003) and (Buchanan, et.al., 2005) skip the digitization step, assuming the works were previously digitized and focusing in the remaining steps (we can see these proposals as import tools).

2.1

Import tools

Bainbridge and his colleagues presented in (Bainbridge et al., 2003) a tool called “Gatherer” that facilitates the entire process of building digital library collections. This tool was designed to feed a digital library built with Greenstone, though the underlying ideas can be used in any case. However, this tool (and the underlying procedure) does not take into account the digitization process, and assumes that the documents are already in electronic format. The authors explicitly

point out that the process of feeding a digital library usually starts with a process of digitization, but they do not address it. The tool supports four administrative tasks:  Copying documents from the computer file space into the new collection. Any existing metadata remains “attached” to these documents. Documents can also be harvested from the web.  Enriching the documents by adding further metadata to individual documents or groups of documents.  Designing the collection allows the specification of the structure, organization, and presentation of the collection.  Building the collection as a final step, the collection is built in Greenstone; this step includes the indexation of the collection. With the appearance of Greenstone 3, Buchanan and his colleagues (Buchanan, et al., 2005) presented a framework for building digital libraries with Greenstone 3, this time without any specific tool. Again, their starting point considers that the source files are already in an electronic format. The process is more elaborated, it begins with an Expansion process where compressed source files are decompressed, and links to web sites are expanded into lists of web pages. This stage gathers the source files for the next phases. The Recognition phase joins all the files that form a document, for example, the JPEG files that are included in a web page are considered as part of the document which includes that web page. The obtained documents are sent to the Encoding process, which converts these documents into the METS (Metadata Encoding and Transmission Standard) format (Library of Congress, 2007). The Extraction stage processes automatically the documents in order to extract information from the document (e.g. title, keyphrases). Next, the Classification stage processes the documents using classifiers in order to assign each document to a node of a browsing structure. The Indexation phase can index the collection with different indexers. Finally, a Validation process provides a quality control. As it can be seen, this process is general enough to be suitable for an automatic feeding of digital libraries in many different situations or cases. However, we found that at least two processes are not considered. First, the process of scanning documents from their original versions (in many cases, in very bad conditions and with an old typography). Second, dealing with those cases where electronic metadata are not available.

2.2

Digitization tools

Larson and Carson (Larson, R., and Carson, C., 1999) presented the feeding process of the Cheshire II project. This process is composed of the following six stages: 1. The scanning of each document. As a result, a directory named with the Electronic ID assigned to the document is created. In the directory, there is a list of files created by the scan software and a file (bib.elib) with the associated metadata introduced by the person responsible of the scanning procedure. Each page of the document is stored in a TIFF file numbered sequentially. 2. TIFF files are turned into GIF. 3. The OCR (Optical Character Recognition) process is run. Two directories are produced: a. OCR ASCII text. b. OCR XDOC, which contains word position information. 4. Each document is merged with its images. That is, links are inserted in the text to give access to images. 5. The document is moved to its final location, from where the digital library will make it available. 6. The indexation process is run.

The main drawback of this process is the absence of control mechanisms or tools to help us to carry out the tasks. The result is that many errors may arise due to wrong placement of files, file names with errors, scanning wrong pages, etc. Sankar and his colleagues (Sankar, et al., 2006) were involved on a project that pretends to digitize one million books. In this case, they face the problem of scanning the documents. They divided the digitization process into logical steps, which are distributed in several places. They also designed a tool to control the workflow. The phases of their workflow start with the scanning. Next, a post-processing of the scanned images is carried out to remove noise and other artifacts in the scanned images. After cleaning the images, the OCR is run. Then, a quality check is performed. Finally, the resulting documents are stored in a web server. In (Sankar, et al., 2006) no further details about the workflow management are given, and only an automatic tool for quality control is presented. The quality control is based on automatically verifiable parameters such as dimensions, dpi, skew, etc. However, there is no quality control over the OCR process, and therefore they admit that due to OCR errors, they store texts with some mistakes. Although the transcribed version of the text is only used to assist in the search process, this process is obviously burdened by the presence of such OCR errors. Thus, as the authors pointed out, they have to rely on the scanned version for presentation purposes. The Stanford Digital Repository (SDR) (Cramer, 2010) is other representative system for building and feeding digital libraries including the digitization activities. SDR allows to integrate in an institutional digital library contents coming from different sources (either internal or external). Given the large scope of SDR, it also provides management functions that allow the system administrators to manage and control different activities, such as the progression of feeding tasks. However, (Cramer, 2010) does not provide details on how the digitization process is managed. That is, SDR would be able to accommodate the documents resulting from other systems managing the digitization, but it does not directly support the digitization process.

3

A framework to feed digital libraries

In this Section we present our framework for feeding digital libraries. First, we analyze the potential problems that may appear during the digitization of large collections of documents. From this result, we describe the requirements for a workflow system for managing digitization. We then present the system architecture of our framework. Finally, we present a comparison of our proposal with existing systems and models.

3.1

Problems in massive digitization of documents

During the digitization process, each page of the physical documents is scanned, then they are processed using OCR technologies for text extraction, and finally, revised and corrected in order to fix errors from the automatic text recognition. Other activities such as metadata definition or document markup are also necessary. Finally, the text is stored in the repository, and its content (text and images) is indexed and published. Taking into account that the collection of documents can have hundreds of thousands of pages, the digitization of documents becomes a complex and costly process, usually requiring the intervention of experts with deep domain knowledge. From our previous experience in the development of digital libraries (Parama, et al., 2006; Places, et al., 2007), we have identified several typical problems in this process:  Problems with the file naming protocol. Due to the high number of files to be managed during the digitization, such a protocol is necessary. When few people participate in the digitization, the file naming conventions are usually followed, and small errors can be easily managed. However, when tens of people are working together, small errors are likely to appear and their management can produce a significant waste of time.  Loss of files. Without support tools, each participant is responsible for the files obtained in each activity. If the management of hundreds or thousands of files is done manually typical errors will frequently occur, such as overwriting files, saving files with the

wrong name or in the wrong folder. If the experience of the participants with computers is limited, these errors will be very common.  Task specification. There are different ways to carry out a task. A bad specification of the task parameters is also a source of typical problems. For example, scanning with an incorrect orientation of the pages, scanning two pages together instead of one, reviewing an already reviewed document, or writing again the document metadata when they were already available in the database. These problems worsen when several people work with the same document.  Lack of coordination. Coordination is difficult when a large group of people work in the project. Each person can be devoted to specific activities and have a different timetable. For example, a given person can be responsible for scanning a document in the morning and other person can deal with its correction in the afternoon. An effective task management to avoid dead times and waste of resources is necessary.  Effective resource control. Since the number of resources used for the digitization is limited, the lack of control can be a source of dead times in some activities of the workflow. For example, some workers could have to wait for free scanners or computers, or even for the availability of the physical document. In addition, without this resource management, reports about the particular resources used in each activity would not be available.  Management of responsibilities. The correct definition of the person in charge of each task is also important, especially when checking the extracted texts is difficult and it requires deep knowledge of the type of literature being digitized. Perhaps most of these problems seem trivial and easy to solve. However, taking into account that they can be repeated thousands of times during the whole digitization process, their consequences can have a great impact in factors such as the process time and the quality of the digitized documents.

3.2

Requirements for the workflow management system

As a solution to the problems presented in the previous section, we describe here the requirements for the workflow management architecture we propose in this paper.  Automated results management. As we pointed out, errors due to not following the file naming conventions and loss of files are common in a group of people working together in a digitization chain. A workflow management system for document digitization must automatically manage the files produced in each activity. Thus, when a person starts a new task, the system must automatically bring the inputs needed for such a task, which could be the outputs of previous tasks, without any kind of human interaction. This avoids problems regarding the lost of intermediate products.  Task control and monitoring. The system must provide the administrator with the tools needed to continuously monitor all the information about the state of each document task. For example, the person assigned to the task, the progress of the results and the recorded problems.  Effective resource management. This requirement is related to the previous one. The system must continuously control the availability of the necessary resources for each activity, identifying and reporting immediately on potential conflicts between tasks due to the resources needed. For example, if several documents are being scanned at the same time and a rescanning is needed to correct OCR errors, the system must identify a time slot in which the hardware will be available.  Work dedication reporting. It is important to provide the possibility of generating reports about indicators such as the average time devoted to each task, the average number of pages in a period of time, the number of corrections made on the results of the OCR process (that is, the number of errors found and fixed from the OCR result), the average dedication of each person in a given period of time, etc.



3.3

Product quality control. Although research in OCR is continuously reducing the error rate, the output of OCR systems is still far from perfect (Kolak, et. al., 2003; Banerjee et. al., 2009). This is especially harmful when we deal with ancient documents. Therefore, the review of the results obtained from the digitization is really important. The system must facilitate this review process by providing the reviewer with both the image and the extracted text, and ensuring that the document is not published until the review is successfully finished.

System architecture

According to Hollingsworth (1995), workflow is the computerized facilitation or automation of a business process either in whole or in part, and it is concerned with the automation of procedures where documents, information, or tasks are passed between participants following a defined set of rules to achieve, or contribute to, an overall business goal. Collaborative workflow systems automate business processes where a group of people participate to achieve a common goal (Aalst and Hee, 2002; Fischer, 2003). This type of business processes involves a chain of activities where the documents, which hold the information, are processed and transformed until that goal is achieved. As the problematic of feeding digital libraries fits perfectly in this model, we based the architecture of the system on this model. We can differentiate three user profiles involved in the repository building scenario:  Administrator. Administrators are the persons who are responsible for the digitization process as a whole. They are the responsible for assigning tasks to different workers and controlling the state of each digitized document.  Advanced users. The advanced users are the people in charge of carrying out critical activities such as the metadata storage or the review of the texts obtained from the OCR processes. The rationale behind this user type is that they usually need a thorough knowledge of the documents to carry out these tasks (for example, a deep knowledge of the Galician literature of the 19th century is needed if the user is going to review this type of documents).  Standard users. The standard users are the workers who carry out tasks such as scanning or the OCR correction. This role is played by users with some knowledge in the document field, but without any responsibility on the management of the system (for example, granted students could carry out these activities). System users (standard, advanced and administrator users)

Workflow management

Metadata storage

Scanning

OCR

Correction

Indexing and Web publishing

Markup

Administrators

Documents

Index

Text Repository

Image Repository

Document Database

Workflow Administration module

Workflow Database

35

30

25

Comida

20

Transporte 15

Alojamiento

10

5

0

Reports

Figure 1. System architecture.

Ene

Feb

Mar

Abr

May

Statistics

Jun

Figure 1 shows the overall system architecture. When defining it, we followed the recommendations of the Workflow Reference Model (Aalst and Hee, 2002), a commonly accepted framework for the design and development of workflow management systems. It is intended to accommodate the variety of implementation techniques and operational environments that characterize this technology. Thus, although we defined this architecture for the implementation of a specific system, it can be used in other environments and situations. As we can see in Figure 1, the authentication and authorizing module is in charge of the authentication of the workers who want to access to the system. Each user has a system role depending on the tasks he/she is going to work on. In terms of this system role, the authorizing module only provides the user with access to the needed features. The system architecture consists of a module for each one of the activities involved in the creation of the repository.  Metadata storage: this subsystem is in charge of the introduction and storage of the metadata for each document (title, author, year, source, etc.), following any desired format, such as Dublin Core or MARC (Machine-Readable Cataloguing). This task is performed by the advanced users of the system.  Scanning: this subsystem provides access to the scanning hardware and software, and it is the responsible for managing the specification of the scanning parameters for each document (for example, options like scanning two pages at the same time, landscape/portrait orientation, resolution, number of colors, etc.).  OCR: it provides access to the OCR software and obtains the scanned images needed as the input of this activity; therefore it is not necessary to manually retrieve them. The module automatically stores the results.  Correction: this module provides the reviewer with both the image and the extracted text. Corrections of the extracted text can be carried out if necessary.  Markup: it provides the tools used for marking the text with metadata such as the title, author, page, etc.  Indexing and Web publishing: once the document is ready for publication, this module is in charge of indexing its content using information retrieval techniques that will provide efficient search functionalities and its publication in the Web.  Workflow administration module: this subsystem is in charge of managing the workflow. It also provides reporting tools for monitoring purposes. This data regarding the digitalization chain is stored in what we called the workflow database. Recall that the system architecture assumes the use of different repositories and databases. An image repository, a text repository, and a document database store the scanned images and the texts extracted from them. An index is built over the document database and the text repository to support searches efficiently. Finally, the workflow database stores the information about the digitization (lists of tasks, state of each document, etc.).

3.4

Comparison

In this section, we compare our framework and the tool implementing it (which we present in the next section), Digiflow, with the systems and/or frameworks presented in Section 2. MMS

Exp

Ch II



Gatherer





GS 3





DLI



SDR



DigiFlow



Sca √

COCR

Enc

AMS

Mark

Class



Idx

Val









√ √



√ √



√ √

Sto √



√ √

OCR

√ √























Table 1. Comparing Ch II: Cheshire II, Gatherer, GS 3: Greenstone 3 framework, DLI: Digital library of India and DF: the framework we propose in this paper.

Table 1 shows for each framework/system the steps it includes. Each column corresponds to one of the following stages:  MMS: Manual metadata storage or extraction from previous electronic metadata files.  Exp: Expansion, which includes decompression and URL expansion.  Sca: Scanning.  OCR: Optical Character Recognition.  COCR: Correction of OCR errors.  Enc: Encoding, this translates the documents into a standard representation like METS.  AMS: Automatic metadata extraction and storage.  Mark: Markup.  Class: Classification, this uses metadata to site a document within the browsing structures.  Idx: Indexation.  Val: Validation, which corrects different issues such as skew, dpi and other parameters.  Sto: Storage, from where the digital library will make the document available. From Table 1, we can conclude that our focus is in the quality of the scanning process. Our framework and the associated tool (Digiflow, which will be reviewed in Section 4) represent the only approach that considers the correction of the unavoidable OCR errors. Other existing systems, like Greenstone, for example, put their emphasis in the automatic ingestion by means of a bulk process. This requirement comes into conflict with the scanning process, which requires much human interaction. Our proposal and the tool implementing it put the emphasis on digital library feeding processes in which the documents go through a complex scanning, text recognition, correction, and indexing process, which is the case in digital libraries built for cultural heritage preservation. Due to the need of participation of experts and the complexity of the process, our framework aims at make this process manageable, efficient, and effective.

4

Digiflow: A tool for building document repositories

The framework presented in the previous section was applied in the implementation of Digiflow, a workflow management system supporting the creation of digital libraries. This tool provides an integrated environment where all the tasks necessary to create a document repository and feed a digital library can be executed. This application provides the user all the tools needed to carry out each task without being necessary to use other software applications nor to manually manage the results of each task. Digiflow is focused on the digitization of documents, which we will also call works in the rest of the paper. A work can be a book, a volume of a journal, or any other unit on which the digitization process can be made. In the development of Digiflow we addressed the main problems that arise during the massive digitization of documents. In the description of Digiflow, we first present the digitization workflow it supports. Then, we present how the tool was designed and developed, and how the different modules of the architecture presented in the previous section were addressed.

4.1

User profiles and responsibilities

According to the proposal we presented in the previous Section, Digiflow distinguishes between three different user profiles: administrator, advanced, and standard users.  Administrator: the administrator profile is responsible for the administrative activities of the digitization process, such as creating new digitization works in the system and requiring the needed tasks (subtasks, priority, etc.). The administrator can also access a set of monitoring tools that allows him/her to supervise of the whole process (task revision, reports about the progress of a work, work done by each user, alerts of

problems, etc). Digiflow allows the administrator to avoid problems related with the lack of coordination, and it makes it easier to follow an effective resource control since the administrator can modify the priority of any task or even change the user that should perform it. Digiflow also allows the administrator to know the list of tasks in charge/done by each user, hence providing support to an effective management of responsibilities. In addition, it also brings the capability to generate reports about work dedication. Details about user interface provided by Digiflow to the administrator can be seen in Section 5.2.2.  Advanced and standard users: they conform a low-level profile and have access to the basic functionalities of Digiflow. Digiflow shows these users a list with the pending tasks they have to perform, and for each task they are completely guided through the process by the instructions provided by the system. Among other tasks, advanced users are in charge of metadata of each work, such as the title or the author, but also more interesting metadata for the digitalization process such as the orientation of the page and other parameters about how the scanning should be done. Once this is done, Digiflow can guide a standard user through the digitalization of a work. For example, when a scanning task is being done by a user, Digiflow indicates which page of a book must be scanned and how (orientation, two pages at a time, etc.). The user only has to put the book over the scanner and push a "scan button". After that, Digiflow automatically saves the scanned page in the corresponding repository with the proper name. Therefore, problems related to file naming, or to the loss of files in the system are avoided. Other basic tasks within the basic functionalities in Digiflow are the OCR and the correction of the obtained text for a given page. In the former task, Digiflow automatically fetches the previously scanned page and launches the OCR process. Then, following the guidelines to ensure an effective product quality control, it presents in parallel both the text and the scanned page so that the corresponding user could validate its correction or modify it if needed. Again the result is automatically stored in the corresponding text repository. In the next subsection we focus in the flow of activities carried out during the creation of a document repository according to the system architecture defined in Section 3.3.

4.2

Digitization workflow

The UML activity diagram of Figure 2 shows the activities involved in the Digiflow digitization workflow, and the order in which those activities must be carried out to create the document repository. Each of the activities in the diagram is a stage of the workflow. Activities can have different execution modes, that is, an activity can be carried out using different applications. In addition, activities can be either divisible or indivisible. An activity that can be done by more than one user is called a divisible activity. These activities are divided into tasks, which are performed by only one user. Next, we describe the activities involved in the workflow:

Start the workflow with a work

Metadata storage

Work configuration :work :page Scanning

OCR

Correction :opd

Checking

[Needs correction]

[correct work]

Indexing and web publishing

Figure 2. Digiflow workflow for the creation of a document repository.

1. Start the workflow with a work: this activity, which is carried out by an administrator user, marks the beginning of the workflow for a specific work. It consists of the creation of the work in the system and the assignment of the metadata storage activity to a user. 2. Metadata storage: the first step when processing a new work is to enter its metadata into the system. This includes, for example, the name of a book, its authors, the number of pages, the expected orientation of the pages for the scanning process, etc. Digiflow provides the users with specific forms to carry out this activity. It is not possible to proceed with the flow until the metadata of a given work are entered into the system, since they are necessary to assign the remaining activities to particular tasks. The current implementation of Digiflow does not allow the users to manually enter the metadata, but not its automatic import from other existing information source. The motivation for this design decision is that automatic metadata import was not a requirement in the potential use scenarios we faced with Digiflow. However, this module could be modified to automatically import metadata from other sources, since the architecture and design of the tool allows us to replace or modify the implementation of a module without affecting the rest of the system.

3. Work configuration: this activity consists in the generation of the necessary tasks to complete the digitization. In the case of a divisible activity, the system will allow generating different tasks, and assigning them to different users. 4. Scanning a page: the tasks associated to this activity are performed by either standard or advanced users. We used an UML expansion area (the area surrounded by a dotted line) in the activity diagram to represent the repetitive process of scanning each page composing a work. This UML notation also indicates that the three activities inside the expansion area (scanning, OCR, and correction) can be done in parallel by more than one person when possible (that is, these are divisible tasks): a) Scanning: this activity comprises the creation of the digital images that correspond with each page in a work. As expected, the system frees the users from the task of assigning a name and a storage location to those images. b) OCR: this activity involves the application of an OCR process on the images generated in the previous activity. The OCR software used in the first release of Digiflow was OmniPage Pro. c) Correction: in this activity, a user revises the results obtained from the OCR activity. The OCR tools do not always provide the expected results, especially if the typography of the work is not standard or if the quality of the original document was not good. Therefore, it becomes necessary to manually review all the pages trying to find and correct the mistakes. The result of these tasks is an OmniPage file ("opd" file in the rest of the paper) that includes the image, the associated text and the coordinates of each word within the image obtained from the scanner. This is the source for the text repository, the image repository, and the indexing subsystem of Digiflow (described below). The scanning OCR - correction group of activities is the part of the workflow demanding more time from the users. In the next Section, we will show how Digiflow guides the users through this part of the process in a real scenario. 5. Checking: this activity involves a second revision of the pages to verify the correction of the process. The purpose of this activity is to add an additional guarantee of correctness before publishing the works in the web. 6. Indexing and web publishing: the obtained data (that consists in an image and/or transcribed text) is finally indexed and published in the Web. Note that after the previous acquiring steps, we obtain both an image repository and a text repository. In addition, in the OCR phase our system is also able to provide the coordinates of each word within its corresponding image of the document. Therefore, we can build an index on the text that enables retrieving the documents containing a given word, and additionally we can mark the positions where such pattern occurs within the corresponding source images. This allows the publishing system to not only permit access to the repositories, but also boost search capabilities. More details are provided in Section 5.3.

4.3

Digiflow architecture, design and development

In Section 3 we presented a framework and a general system architecture for a digitization workflow management system without tying it to a particular technology, Digiflow refines the system architecture with particular technologies and design decisions. In this subsection we describe how the design and development of Digiflow addresses each of the components defined in the general system architecture, and we describe the reasons for certain implementation decisions. Figure 3 shows a detailed architecture diagram of Digiflow. The lower part of the diagram shows how all the data generated during the digitization process is persistently stored. Digiflow uses three storage subcomponents:



Document and workflow database: all data related to the document's metadata, and to the management of the workflow is stored in a relational database. In particular, we used Microsoft SQL Server. Note, however, that this component can be easily replaced by any other DBMS.



Scanning and OCR repository: the images obtained from the scanning of the works, and the corresponding OPD files obtained from the OCR software are stored in an external file repository.



Indexing and publishing: this module is responsible for storing and indexing the text of the works. One of our goals in the development of the system was to provide the users a set of rich search functionalities. Since Digiflow was designed to be used specially with ancient documents, we wanted to be able to show search results on the original images of the works, for example. The implementation of this module is based on modifications we developed on the open source indexing library Lucene9. The details of this module will be presented in the last subsection of this section. Since this module can be of interest for other developments even if the rest of Digiflow is not used, it was developed as a separate component that can be used alone. That is, the rest of the tool communicates with this module to enter the texts into it when the scanning, text recognition and correction activities have been correctly completed.

Workflow Management

Metadata

Scanning and OCR

SQL Server

Document DB

Scanning and OCR repository

Images

Workflow DB

Indexing and Publishing

Correction

OPD Files

Indexing and Publishing Text and Image Index

Text Repository

OmniPage

Figure 3. Digiflow architecture and implementation

The remaining modules shown in the architecture implement the functionalities presented to the different users of the system:

9



Workflow management: this module implements all the workflow controls, guiding the users through the digitization process, and providing access to the rest of the modules.



Metadata: this module allows the users to enter the metadata of the works into the system. It currently supports the storage of metadata related to literary works and periodical publications, as journals. In the case of literary works, Digiflow allows to enter data about its authors, the literary work itself, and each of its pages. In the case of periodical publications, it allows to store data related to each journal, volumes of each journal, numbers composing each volume, articles published in each of the numbers, and finally, each page of the article.

http://lucene.apache.org/core/



Scanning and OCR: this module encapsulates all the details needed to access the scan and OCR functions through the OmniPage suite. The purpose of this module is to act as a black box hiding all low-level details and providing a simple interface to the rest of the modules. Digiflow was implemented in C# and the communication with the OmniPage suite was implemented through OLE (Object Linking and Embedding) and COM (Component Object Model) automation components.



Correction: this module allows the users to access the opd files resulting from the scanning and OCR module, and to revise and correct those files.



Indexing and publishing: this small module interacts with the storage module in charge of indexing and publishing, which we describe with more detail in the next subsection.

Other aspect related to the design and development of Digiflow is our choice of technologies. We implemented Digiflow in C# using the Microsoft .NET platform. We also used the OmniPage OCR suite, and Lucene open source indexing library. Some of these decisions resulted from constraints on the project sponsor's available technological environment. However, the architecture and design of the system allows to replace any of its modules without affecting the rest. Some modules can be replaced easier than others. For example, the database containing the documents metadata and the workflow management could be easily replaced by other relational DBMS. Other modules, such as the one managing the interaction with the scanning and OCR software, should be reimplemented in case of adapting Digiflow to other platform with a different operating system and scanning software. It is also important to note that the scope of Digiflow is only that of managing the digitization process. That is, it does not provide a public digital library the general public can directly access. In the next Section we present how a large collection of ancient documents was digitized with Digiflow, and how a public digital library (developed in a different technology, Java) was built on the result.

4.4

Digiflow search and indexing capabilities

The text retrieval subsystem of Digiflow is based on an inverted index built with the Lucene open source library for text indexing. Since this module of the system can be useful in other developments without using the rest of the tool, it was implemented as a separate software component. In this way, the texts obtained by using Digiflow are entered in this module, which acts as a black box that indexes the documents, their corresponding images, and provides a set of search functionalities. In order to build such a inverted index, the opd files generated by the OmniPage software are processed firstly. These opd files have three components: text, image, and the coordinates of each word in the image. After a first preprocess of the opd files, we transform those files into XML files, which are the source for the process that constructs the inverted file. This translation makes the manipulation of the obtained information much easier. As we show in Figure 4, a document is represented in Lucene as an instance of a class Document that aggregates a collection of objects belonging to the class Field. Each field contains a name and a string of characters. Examples of names could be title, author, etc. The text of the work is always one of these fields. The exact list of fields is chosen by the developer for each case.

Figure 4. Representation of a document (literary work in our case) in Lucene.

In the case of Digiflow, each edition of a literary work or the number of a journal is represented by an object of the Document class. The fields we associated to each document are the content of the work and the identifier of the work in the database of the digital library. In order to obtain the content (that will the indexed), the XML files containing the text of the literary work or journal article have to be pre-processed to remove the XML tags. Additionally, the resulting text is also converted to lower case. Once we have the document objects of all the literary works and journal articles, it is possible to build the inverted index. For each word in the vocabulary (list of all different words that appear in the indexed collection of documents) the index stores the list of documents where those words occur. In addition to this list of documents, the inverted index stores other additional information depending on the nature of the indexed text. When designing this module, we wanted to support the cases in which the text could be either transcribed text or text included in an image obtained from a scanner. If the indexed text is plain transcribed text, the inverted index stores the relative positions of each word inside each document. In the case of scanned images, instead of storing the relative position of each word, the inverted index keeps the position (coordinates) of each word in each scanned image. During the search process, once the inverted index has supplied the documents where a word or phrase occurs, the system presents to the user the sections (a page or a group of pages, in the case of literary works, or articles, if it is a periodic journal) where the word or phrase occurs. From here on, the process that computes the response to a given query follows two different ways depending on the nature of the text. 4.4.1 Plain text retrieval The global search process for plain text (that is, text not coming from a scanned image) can be seen in Figure 5. From the index, as we have already pointed out, we obtain the identifiers of the literary works containing the searched patterns. With those identifiers the system accesses the metadata of the literary works to build a list of retrieved works. The user selects one of these literary works, and then the system shows the list of pages where the pattern occurs. Unfortunately, the inverted index is not enough to generate this list of pages since it only stores which documents contain the searched words and the relative positions of the words inside those documents. Relative positions do not represent the exact physical position of the word, but the order of that word within the text. The first word in the text is numbered with 1, the second one with 2 and so on. Relative positions are used to seek phrases, where the searched words should be present in a certain order in the text, but they are not useful to know the physical position of the words. Thus we cannot know the page and the exact position of the occurrences of a word. Then, to find the pages containing a given pattern, a pattern-matching algorithm is needed to find the first occurrence of the pattern in each page, if it exists. Once the first occurrence is found, that page is reported as one of the pages including the pattern, and the search skips the rest of the page to continue from the beginning of the next page. List of literary works Pattern Matching Algorithm

Query Index

Work1 Work1 Work1 Work1

List of pages Pag Pag 1 Pag 2 ... 3

Marked Text

A beautiful house close to our house

Figure 5. Search system process in plain text.

After the selection of one page by the user, the system highlights the searched word/s. Now, the system should search across the text for the occurrences of the pattern. Again a pattern matching algorithm is used and the whole page is always processed in this case. We performed a study including some of the most well-known pattern matching algorithms to choose the most suitable ones for our system (Places, et al., 2007). Finally, we decided to use Backward Nondeterministic Dawg Matching Algorithm (BNDM) for patterns of up to 32 characters and Knuth-Morris-Pratt (KMP) for longer patterns (Navarro, et al., 2002). This system was empirically shown to be very effective (Places, et al., 2007) with response times below 1 millisecond for typical searches including one word. Table 3 shows the average time needed to search for phrases composed of 1, 2, or 4 words.

Simple words

Search type Average time (milliseconds)

2 words phrases

< 1.00

4 words phrases

44.75

62.63

Table 2. Average time for 100 random searches.

4.4.2 Scanned text retrieval When the search is performed over images, no text is available to perform searches inside it (it is obtained during the OCR, used for indexing, and finally discarded), and consequently, pattern matching algorithms are not useful. Then, we have to use the inverted index to look for the exact location of each word in the images. The process is depicted in Figure 6.

Index

List of literary works

Query Work1 Work1 Work1 Work1

List of pages Pag Pag 1 Pag 2 ... 3

Marked Text

A beautiful house close to our house

Figure 6. Search system in image text.

For each occurrence of a given word, the inverted index contains: the journals or newspapers where it takes place, the numbers and pages inside those numbers, and the coordinates inside the pages. Observe that in this case, the inverted index does not actually store the relative position of the word inside the document text, as searches of phrases are not considered. Once the system recovers the coordinates of the location of the searched words within the scanned image, it generates a new image with the searched words surrounded by colored rectangles. This image can finally be sent to the client browser.

5

Case study: A digital library with ancient documents for the RAG

In this Section we present a case study in which we show how Digiflow was used to digitize a collection of 10,000 documents from the 19th century. After briefly presenting the context in

which the case study was carried out, we show how the digitization workflow was managed with Digiflow. Then, we present statistics related to the performance obtained in the digitization. Finally, we show how the final digital library was built, and the search functionalities it provides by relying on the indexing and searching module of Digiflow.

5.1

Context: digital library requirements and settings

The Royal Galician Academy (RAG) is a scientific organization whose main objective is the study of the Galician culture and specially the defense and promotion of the Galician language, an official language that comes from an ancient language called Galician-Portuguese, and that is nowadays spoken in Galicia (a region in the north-west of Spain) by around 2.5 million people (85% of the population). The RAG has built a digital library (accessible at http://www.lbd.udc.es/RAG-20042012/Hemeroteca/) containing literary works, newspapers and periodic journals, all of them of great cultural value. The newspapers and periodicals section of this library is of particular interest because it is mainly composed of journals from the 19th century. These journals (some of them are the only existing copy) constitute a valuable patrimony that permits to show the historical, social, and economic situation in Galicia in the last centuries. Due to their antiquity and poor state of preservation (an example is shown in Figure 7), these copies cannot be accessed by the public in general. In order to preserve them and to make them available for researchers, the RAG decided to create a digital library.

Figure 7. Images from “El Patriota Compostelano” (1810).

Digiflow was used to support the digitalization process of both periodicals and literary works. In Section 5.2 we show how Digiflow guides the different users through the process. Note that the RAG digital library contains two versions of the original literary works: i) plain transcribed (error-free) text and ii) scanned images of the original text. Due to financial restrictions for periodicals only the scanned images were created (no corrections are done upon the OCR phase, and the obtained text is not finally kept). Apart from the support to the digitization process, some of the most interesting aspects of the RAG digital library are related to the search capabilities built upon the Digiflow indexing and publishing subsystem. In addition to the typical searches based on metadata, a retrieval system was included that permits to perform content based searches on both transcribed text and images. We describe the search subsystem in Section 5.3.

5.2

Feeding the repositories of the RAG digital library with Digiflow

The first step to start the process of introducing a work in the digital library is to store its metadata in the appropriate database. At the time of starting the creation of the RAG digital library, the traditional RAG library already had an electronic catalogue. Therefore, we decided that, to save time and errors, it was worth the development of an ad hoc system for extracting

the data from the RAG catalogue in order to feed the metadata database of DigiFlow. Once the metadata was stored, starting the scanning process became possible.

5.2.1 Using Digiflow to create the RAG repositories As we have explained in the previous section, after the administrator users register in the system the different works to be done, the other participants log in the system to access the tasks that were assigned to them. Once a user is validated by the system, a table appears showing the pending tasks assigned to him. By clicking on one of them, the corresponding interface that permits to perform such task is displayed. For example, a scanning task is carried out in a window like the one shown in Figure 8. The window shows the user which page must be introduced in the scanner. Once the user introduces the page in the scanner and then presses the central button, the scanning process of the page starts. The result of such process is shown to the user who has to confirm if it is correct or not. If the quality is poorer than expected, the page can be scanned again. Once the user confirms that it is correct, the obtained page is stored automatically in its correct location. Notice that, in order to scan a page and to store the result, the user only has to press a button. However, the user does not have to concern about where to store the page nor about the name of the obtained file. Everything is automatically managed by the system.

Pending tasks.

Type of digitization (in this case, double-sided)

The work and the page/s to be scanned.

Start scanning

Figure 8. Scanning task.

When a user chooses a correction task among the pending tasks, the system shows to the user a window like that shown in Figure 9. The window displays the page that is going to be corrected, and by just clicking in the button named “correct”, OmniPage is started.

The work and the page/s to be corrected.

Figure 9. Correction task.

Finally, the window in Figure 10 is displayed. The system shows the scanned text in the upper part of the interface and its transcription in the bottom part. If the user finds an error, the transcribed text can be replaced by the correct version. By clicking over a word either in the upper part (image) or in the lower part (transcribed text), the other version of the word is highlighted in the corresponding part of the interface.

Figure 10. Scanned and transcribed text during the review.

Figure 8, Figure 9, and Figure 10 show the user interfaces corresponding to the three main activities discussed in the system architecture: scanning, OCR, and correction. These are the most common activities since they must be repeated for each source page in the processed work. The first two are solved with just a pair of clicks because the rest of the work is automatically done by the system. In the last one, the user has only to focus in the correction of the words, the

rest of the work (starting OmniPage, opening the scanned image and the transcribed text, and saving the result) is automatically arranged by Digiflow. 5.2.2 Using Digiflow to monitor and control the digitalization process The administrators of Digiflow can get summarized information about the process in order to control and improve the workflow. Unexpected situations in workflow management systems are likely to appear (Mourão, et al., 2003), and it is impossible to predict every possible cause of failure or exception during the design of the system. In Digiflow, the approach chosen to address these deviations is the adoption of an adaptive workflow system, which provides the system administrator with tools to correct such failures if they occur. The administrator, on the presence of these situations, can change the system behavior. In order to manage the workflow, the administrator can benefit from three crucial aspects that are controlled by the system: the status of the open works in the system, the status of the workflow tasks, and the work of the users of the system. Related to the first of these aspects, the system provides a group of reports, like the one shown on the top of Figure 11, where all the works in the system and the status of the activities assigned to those works can be seen. By using this report, the system administrator can know which works are completed, who is working in each work, and how much time has been used for processing the work.

Users

Task captions

Figure 11. Reports. The system also allows the administrator to watch the tasks that are currently being performed by the users the system. On the bottom of Figure 11, a report is shown with the pending tasks assigned to each of the users. By means of these reports the system administrator can know the workload of the users.

Figure 12. Work revision.

Priority

Assigned user

Figure 13. Task revision.

Apart from reports, the administrator is also given other tools to know the state of the system. An example of these functionalities is shown in Figure 12 and Figure 13. In order to solve bottlenecks, it is possible to modify the status of each workflow task, its priority or the user who is in charge of its execution. The system also offers reports that permit to know the number of hours that each user of the system has worked (see Figure 14).

Figure 14. Interface to show the periods worked by a given user.

5.2.3 Summary of the digitization process During the digitization process we gathered data of the performance obtained. Table 2 shows the results obtained by a group of twenty graduates in Galician Arts and Philology using the system during 5 months. The first column indicates the activity, the second column shows the total number of processed pages, the third column presents the total amount of hours devoted to each activity, and finally, the fourth column gives the performance in pages per hour. Without DigiFlow, this process would have been longer and it would probably include many errors. ACTIVITIES

PAGES

HOURS

PAGES/HOUR

Metadata storage

13304

135.99

97.83

Scanning

13304

255.77

52.01

OCR

13093

380.83

34.38

Correction

12192

4402.87

2.77

Table 3. Statistics on the digitization process.

5.3

Search support in the RAG digital library

One of the goals in the development of the RAG's digital library was to provide advanced search functionalities, that is, not only the typical search based on the metadata of the works, but also the capability of seeking literary works by their content, taking advantage of the digital nature of the stored documents. 5.3.1

Description of the metadata model

Figure 15 and Figure 16 shows Entity-Relationship diagrams for two types of works, namely

journals and literary works. In the case of journals, Digiflow allows us to store all information related to the journal (title, first date and last dates in which the journal was published, and ISSN if applicable), each of its volumes (title and number of pages), the numbers composing each volume (title and date of publication), the articles published in each number (title, authors, and pages), and each of the

pages of the articles (page identifier and a path allowing to access the image of the scanned page). Number of initial pages

Title Signature

Title

issn Start date

Number of final pages

Identifier Publishing

VOLUME

JOURNAL Publishing place 1

1

Identifier

N N

Title

JOURNAL NUMBER

1

Date

1

N

Identifier

Identifier N Path

Title

PAGE

N

M

End Page

ARTICLE

Author

Initial Page

Subject

Figure 15. Entity-Relationship diagram for newspapers and periodicals database.

In the case of general literary works, it allows us to store information related to the authors of the work (name, surname, dates of birth and death, a biography of the author and even a photography), the work itself (title and genre), and each of its pages (order of the pages, and links to their corresponding images).

ID

Name

M

Date of birth

Genre

N LITERARY WORK

AUTHOR

Name

Title

ID

Biography

Date of death

Photo

Epoch

1

N

PAGE

File Name

Order

format

Figure 16. Entity-Relationship diagram for the database of literary works.

5.3.2 Indexing and searching As we have explained in the previous section, Digiflow provides search capabilities that allow the user to locate the document in which a query appears, including the text of the document and also the image corresponding to the page in which the page appears. The works digitized with Digiflow are stored in a module devoted to indexing and publishing. As explained in the previous section, we developed this module by extending the Lucene inverted index. The index was modified in order to store the coordinates of each word in the scanned image the word comes from. As we will see later in this section, this allows us to show the results of the search directly in the scanned images of the relevant works. This inverted index is constructed using the stored metadata and the opd files provided by the scanning process. As said before, this module implements the functionality to perform content-based queries. In the next section we describe in more detail the text retrieval module and the public web interface. TEXT RETRIEVAL SUBSYSTEM WEB INTERFACE FOR RAG ADMINISTRATION

Metadata

WEB INTERFACE FOR PUBLIC Lucene

Text Repository *.html, *.txt

LUCENE Index

Image Repository *.gif, *.jpg

Figure 17. RAG digital library architecture.

The RAG digital library was mainly implemented using Java, and it accesses the underlying database (MS SQL Server in its current implementation) and a set of file repositories. Figure 17 shows the general architecture of RAG digital library. It is fully modularized and comprises three main subsystems. The first module is a web interface to manage the digital library; this module is only used by authorized users. Administrators can introduce changes in the RAG digital library, such as introducing news, sections, and new works in the digital library. The second module is in charge of the public web interface. The third module is a text retrieval module based in an inverted index built using the Lucene libraries. 5.3.3 Search Interfaces As explained above, the RAG digital library provides typical searches using metadata, that is, searches by author, title, editor, etc. It also supports searches by content, that is, it is possible to seek works containing a list of words. Obviously, the RAG digital library provides different search interfaces; Figure 18 shows an example of a metadata search. These interfaces are different depending on the type of searched work. In fact, literary works and periodic works have their own subsection inside the digital library and then, due to the peculiarities of each type of work, the interfaces inside those subsections are slightly different.

List of Journals with a name that begins with ‘A’.

Figure 18. List of newspapers and journals sorted by name.

5.3.3.1 Searching a literary work To describe the process that comes after a query, we are going to consider content-based searches because they are more complex and include all the stages of the simpler ones. Once the query is issued, the system returns a list of works matching the query. When the user selects a literary work, its index card is displayed (see Figure 19). The index card informs about the available digital versions of the work, which can be scanned images and/or transcribed text. The user selects the desired version, and then the system presents an index to access individual pages or groups of pages of the work (see Figure 20). In the case of searches by content, the user might be interested in checking only the pages that contain the words specified in the query. Observe in Figure 20 that some groups of pages have an asterisk just in its right. This means that such a group of pages contains the searched words. Obviously, by clicking on the label, the system gives access to those pages.

Metadata

Only text version is available for this work Figure 19. An index card of a literary work.

Index to group of pages

Figure 20. Index of pages. Asterisks indicate pages that contain the searched patterns. .

Figure 21. A page with marked words (“Galicia”, “amar”, and “terra”).

Figure 22. An image page with marked terms (“Revista” and “Galicia”).

Continuing with a content-based search, when the displayed version of the work is a plain text, the system highlights the searched words with colors (see Figure 21). In the case of the image version of a work, the system allows the user to display the images with or without marking the searched words. If the marked option is chosen, the image is displayed with the searched words surrounded by colored rectangles. All occurrences of the same word have the same rectangle color, as it can be observed in Figure 22. 5.3.3.2 Searching a periodical work The process of searching a periodical work differs depending on the type of search. If the search is carried through metadata, the user starts selecting one newspaper or journal. Then, the list of numbers of such newspaper or journal is displayed. Once the user selects a number, the system displays the list of articles in that number. Finally, by clicking in one of the article names, the system displays its contents. If the search is done by content, the articles containing the searched words are displayed directly (see Figure 23) and then the user can access them without selecting the journal and number.

List of searched words (in this case)

List of articles matching the query. Figure 23. List of article journals containing the word “Galicia”.

There is no transcribed text version of periodical works. Therefore, the interface associated to the visualization of a periodical work is similar to that of the scanned literary works. 5.3.3.3 Other searches: the RAG catalogue Another service included in the digital library is the possibility of querying the actual catalogue of the library by means of two interfaces: simple and advanced. This service is important since the RAG digital library does not contain the whole collection of the RAG library.

6

Conclusions and future work

The creation of a document repository is not a simple process. It requires the coordination of people and tools to carry out every activity that is part of the process. These activities include digitization of documents, optical character recognition, results correction, and indexing to perform content-based searches. The use of support tools that facilitate the work of each

participant and ensure the quality of the obtained results is necessary for those processes to be correctly and efficiently done. The proposed workflow strategies and system architecture support the control and coordination of the people and tasks involved in the digitization process. The use of this architecture automates the completion of activities prone to errors and optimizes the performance of the digitization process and the quality of the obtained results. This architecture was applied to the design and development of DigiFlow, a collaborative workflow management system designed to create document repositories. This system was built as a desktop application, which provides an integrated environment for the execution of all the tasks needed to create a digital library. DigiFlow was successfully applied to the building of the digital library of the RAG. In this paper, we also presented some remarkable technological issues applied in the RAG digital library, which can be of interest for any team facing the challenge of building a digital library. As future work, we want to adapt our current system technology, permitting to maintain our transcribed plain texts in a compressed form. Nowadays there are compression techniques (Moura, et al., 2000; Brisaboa, et al., 2007) that allow searching the compressed text up to eight times faster than searching the plain version of the text. At the same time, these techniques compress the text to around 30% of the original size. The interest of those compression techniques comes up because they can be integrated with an inverted index. Particularly in our case, our document-grained inverted index can be built on the compressed documents. Then during searches, the efficiency of pattern-matching algorithms over the compressed text would permit to speed up the retrieval. In addition, due to the good features of those compressors, once an occurrence is found during the sequential search, decompression can be done from such a position on for presentation purposes. That is, it is not necessary to decompress the whole document from the beginning.

7

References

Aalst, W.M.P. and Hee, K.M. (2002), Workflow Management: Models, Methods, and Systems, MIT press, Cambridge, MA. Arms, C. R. (2000), “Keeping Memory Alive: Practices for Preserving Digital Content at the National Digital Library Program of the Library of Congress”, RLG DigiNews, Vol 4 No 3, available at: http://www.rlg.org/legacy/preserv/diginews/diginews4-3.html#feature1 (accessed 11 May 2007). Baeza-Yates, R. and Ribeiro-Neto, B. (1999), Modern Information Retrieval, Addison-Wesley, New York, NY. Bainbridge, D., Thompson, J. and Witten, I. H. (2003), “Assembling and Enriching Library Collections”, Proceedings of JCDL’03: Joint Conference on Digital Libraries, May 2731, Houston, Texas, USA. Baird, H. S. (2003), “Digital Libraries and Document Image Analysis”, Proceedings of the Seventh International Conference on Document Analysis and Recognition, August 3-6, Edinburgh, UK. Banerjee, J., Namboodiri, A. and Jawahar C. (2009), "Contextual Restoration of Severely Degraded Document Images", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, June 20-25, Miami, Fl, pp. 517 - 524. Borgman, C. (1999), “What are digital libraries? Competing visions”, Information Processing and Management, Vol 35 No 3, pp. 227-243. Borgman, C. (2002), “Challenges in Building Digital Libraries for the 21st Century”, Proceedings of 5th International Conference on Asian Digital Libraries, ICADL 2002, December 11-14, Singapore, pp. 1-13.

Buchanan, G., Bainbridge, D. and Don, K. J. (2005), “A New Framework for Building Digital Library Collections”, Proceedings of JCDL’05: Joint Conference on Digital Libraries, June 7-11, Denver, Colorado, USA. Brisaboa, N. R., Fariña, A., Navarro, G. and Paramá, J. R. (2007), “Lightweight Natural Language Text Compression”, Information Retrieval, Vol 10 No 1, Springer, Netherlands, pp 1-33. Chang, N. and Hopkinson, A. (2006), “Reskilling staff for digital libraries”, Digital Libraries: Achievements, Challenges and Opportunities, Lecture Notes in Computer Science, Vol. 4312, Springer-Verlag, Berlin, pp. 531-532. CCSDS: Consultative Committee for Space Data Systems (2002), “Referente Model an Open Archival información System (OAIS)”, Available at: http://public.ccsds.org/publications/archive/650x0m1.pdf (accessed January 2014). Cramer, T. and Kott, K. (2010), "Designing and Implementing Second Generation Digital Preservation Services: A Scalable Model for the Stanford Digital Repositor", D-Lib Magazine, Vol 16 No 9/10, online. Available at http://www.dlib.org/dlib/september10/cramer/09cramer.html Delos (2008), “A Reference Model for Digital Library Management Systems”, Available at: http://www.delos.info/index.php?option=com_content&task=view&id=345&Itemid= (accessed January 2014). Duguid, P. (1997), “Report of the Santa Fe Planning Workshop on Distributed Knowledge Work Enviroments: Digital Libraries”, School of Information, University of Michigan. Ellis, C. A. and Keddara, K. (2000), “A Workflow Change Is a Workflow”, Business Process Management, Models, Techniques, and Empirical Studies, Lecture Notes in Computer Science, Vol. 1806, Springer-Verlag, Berlin, pp. 201-217 Fischer, L. ed. (2003), Workflow Handbook 2003, Workflow Management Coalition, Future Strategies, Lighthouse Point, Florida. Gonçalves, M., Fox, E., Watsom, L. and Kipp, N. (2001), Streams, structures, spaces, scenarios, societies (5S): A formal model for digital libraries. Technical Report TR-0112, Virginia Tech, Blacksburg, VA. Gonçalves, M., Mather, P., Wang, J., Zou, Y., Luo, M., Richardson, R., Shen, R., Xu, L. and Fox, E. (2002), “Java MARIAN: From an OPAC to a modern digital library system”, Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE 2002), Lecture Notes in Computer Science, Vol. 2476, Springer-Verlag, Berlin, pp. 194-209. Hollingsworth, D. (1995), “WFMC Reference Model”. January 1995, available at: www.wfmc.org/standards/docs/tc003v11.pdf. (accessed January 2014). Kolak, O., Byrne, W. J. and Resnik, P. (2003), “A Generative Probabilistic OCR Model for NLP Applications”, Proceedings of HLT-NAACL, May 27-June 1, Edmonton, Canada. Larson, R. and Carson, C (1999), “Information Access for A Digital Library: Cheshire II and the Berkeley Environmental Digital Library”, Proceedings of ASIS’99, October 31November 4, Washington D.C, USA. Lesk, M. (1997), Practical Digital Libraries: Books, Bytes, and Bucks, Morgan Kaufmann Publishers, San Mateo, CA. Library of Congress (2007), “Metadata Encoding and Transmission Standard (METS)”, available: http://www.loc.gov/standards/mets/ Lucene (2006), Lucene project, available at: http://lucene.apache.org/ (accessed January 2014). McCray, A.T. and Gallagher, M.E. (2001), “Principles for digital library development” Communications of the ACM, Vol 44 No 4, ACM, NEW YORK, NY, pp. 49-54.

Moura, E. S., Navarro G., Ziviani, N. and Baeza-Yates, R. (2000), “Fast and flexible word searching on compressed text” ACM Transactions on Information Systems, Vol 18 No 2, ACM, NEW YORK, NY, pp.113-139. Mourão, H. and Antunes, P. (2003), “Workflow Recovery Framework for Exception Handling: Involving the User”, Groupware: Design, Implementation, and Use, 9th International Workshop, CRIWG 2003, Lecture Notes in Computer Science, Vol. 2806, SpringerVerlag, Berlin, pp. 159-167. Navarro, G. and Raffinot, M. (2002). Flexible Pattern Matching in Strings, Cambridge University Press, Cambridge. Paramá, J. R., Places, A. S., Brisaboa, N. R. and Penabad, M. R. (2006), “The Desing of a Virtual Library of Emblem Books”, Software: Practice and Experience, Vol 36 No 5, John Willey & Sons, Sussex, England, pp 473-494. Places, A. S., Brisaboa, N. R., Fariña, A., Luaces, M. R., Paramá, J. R. and Penabad, M. R. (2007), “The Galician Virtual Library”, Online Information review, Vol 31 No 3, Emerald Group Publishing Limited, Yorkshire, England, pp. 333-352. Ross, S. and M. Hedstrom (2005), “Preservation research and sustainable digital libraries”, International Journal on Digital Libraries, Vol 5 No 4, Springer, pp. 317-324. Ross, S. (2014), "Digital preservation, archival science and methodological foundations for digital libraries", New Review of Information Networking, Vol. 17, Taylor & Francis Group, pp. 43-68. Sankar, K. P., Ambati, V., Pratha, L. and Jawahar, C. V. (2006), “Digitizing a Million Books: Challenges for Document Analysis”, Proceedings of Development and Application Systems, DAS 2006, Lecture Notes in Computer Science, Vol. 3872, Springer-Verlag, Berlin, pp. 425-436. Van de Sompel, H. and Lagoze, C. (2000), “The Santa Fe Convention of the Open Archives Initiative”, Dlib Magazine, Vol 6 No 2, available http://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html (accesed January 2014). Vázquez, E., Places, A. S., Fariña, A., Brisaboa, N. R. and Paramá, J. R. (2005), “Recuperación de Textos en la Biblioteca Virtual Galega”. Revista IEEE América Latina, Vol 3 No 1, IEEE Press. Witten, I. H. and Bainbridge, D. (2003), How to Build a Digital Library, Morgan Kaufmann Publishers, San Mateo, CA.

Suggest Documents