Argentinean Historical Heritage Project

Argentinean Historical Heritage Project María Feldgen Osvaldo Clúa Laboratorio de Sistemas Distribuidos Heterogéneos Facultad de Ingeniería Universidad de Buenos Aires {mfeldgen,oclua}@ieee.org

ABSTRACT

In this paper, we describe the digitalization effort of the Argentinean Heritage Project from its beginning, using hand made prototypes to identify needs and characteristics of Archive organizations, up to present day form, as a framework of automatized, Web operated and platform independent tools to assist historians to build and maintain digital libraies suited to their research needs. We show how low cost, labor intensive digital library building is possible using standard formats and tools. Keywords

Digital preservation, Web enabled technology, XML, Archiving. INTRODUCTION

The "Instituto de Historia Argentina y Americana Dr. Emilio Ravignani" belongs to the "Facultad de Filosofía y Letras" from the "Universidad de Buenos Aires". It has several research projects on the Argentine and Hyspanic-American history. The Institute is the guardian of several historical documents archives spanning from the 18th century to our days. Collections were donated by notary publics, lawyers, governors or their relatives. There are ancient facsimiles and unique handwritten books. In addition, it publishes some periodicals on historical research issues, being a consultant and reference site in deep contact with the "Archivo General de la Nación" and museums that have custody of manuscripts and other written historical corpora, and shares some cooperation projects with several American and European Universities. The "Proyecto Patrimonio Histórico" (Historical Heritage Project) was created to satisfy the preservation and availability of these assets.

Fernando Boro Juan José Santos Instituto de Historia Argentina y Americana "Dr. E. Ravignani" Facultad de Filosofía y Letras Universidad de Buenos Aires [email protected] After a chemical stabilization of the media, an overall consulting politics had to be designed, attempting to find the best way of granting the preservation and access of the material. Facing the "digitalize or microfilm" question [1], the decision was the digital preservation. In order to assure the utility of digital images, the needs of the users had to be clearly defined. To fulfill them, an appropriate infrastructure to support digitizing, conversion, management and delivery of contents had to be assembled [2][3]. To allow for a multidisciplinary approach, the Institute began to work in close cooperation with the Distributed and Heterogeneous Systems Laboratory of the School of Engineering from the University of Buenos Aires. THE ENVIRONMENT OF THE PROJECT

Libraries and Archives aim to different goals. Due to the economics and work in volved in the creation of a digital collection, some selection must be done. In a library, the guiding principle is the expected life of the information contained in the objects to be digitized. This implies the historical value of the objects and the access people want to have to it, not all the objects in a Library can be considered Treasures. In an Archive, all the objects are there to be preserved. In this case the selection process aims to choose first the most brittle and consulted material [4]. The above-mentioned issue affects also the technical decisions about digitalization quality [5], making the whole process manpower intensive in the quality control phase. Also implies a serious concern in handling and storing the material and a lot of diplomacy in the inter-institutional and interpersonal work with their guardians. In an Archive, the material is used mainly by scholars and researchers looking for new evidences or for support for their theories. Scholars are used to consider the information they access as a part in the whole they are building. It is very important not to disrupt or bias their work with interpretations that may suggest or hinder some explanation. So the access system to the digitized collection is of great concern. It is essential the active participation of

scholars and information designing such access.

systems

professionals in

This access system must also be preserved along with the digitized material. Before Digitalization, cataloging was a very important byproduct of the preservation effort (and when money was at a real premium, often overlooked). Digitalization creates a new object to be preserved, and the access system is part of this new object. The access system design is linked to the available technology in computing and to the technology that users (mainly historians) have access to. ORIGINAL INFORMATION ORGANIZATION

The most accessed documents refer to events, military transactions, business records, correspondence, memoranda, notes diaries, accounts and other data and draft documents made by appointment of the Spanish Crown or the local government. Historians prefer to work with a true copy of the document rather than with some sort of transcription. This information was originally kept in different types of document corpora, which we grouped according to their access and reference properties.

(front/rear side of the page) are used as reference and must be kept in the index schema associated with the archive. Journals, periodicals, pamphlets, etc.

Several of these documents used to be sewn forming "books" or "booklets" in order to preserve and archive them. Here ordering of documents in the "book" is not important and documents are read individually or in chronological ordering. Some of the individual documents have indexes or tables of contents, on the other hand, the "book" never has it. "Books" are kept untouched and preserved as a whole because of its fragility, but when they are digitalized, documents are dealt as separate units and the "book" data is retained only as "storage" reference. Researchers and scholars refer to and ask for the documents individually, using original page numbers, indexes and or tables of contents. Due to original numbering format, blank pages are numbered. Handwritten books form also a unit of access like each one of the above-mentioned documents. In this paper we call these units of access "books" irrespective if they are periodicals, pamphlets, journals or true books.

Personal Archives

PROJECT DEVELOPMENT

These are archives filed by public notaries as required by the law, politicians who kept copies of their papers and donated after by their inheritors. Mainly these are documents with few pages, different page sizes and formats. As an example, letters from the 18th and early to the middle 20th century were written in a piece of paper in landscape position and folded to make four usable sides. The original filing order is regarded as very important for the historian because it reflects the chronological processing and the relationship between documents in the same archive, as regarded by the original officer. To keep this ordering, pages were numbered with several different formats in the same file. There are letters with four numbers on each piece of paper, because of the mentioned folding, and letters with two or yet only one number. White pages are sometimes numbered and sometimes they are not. All these data have to be maintained and made present to the scholars as they read the documents for research and reference purposes.

We assembled a team with the Director of the Institute, Prof. J.C. Chiaramonti, three Historians and two Computer Engineers and developed a prototype with some little corpora containing no more than 600 documents each. As our primary goals we decided to implement a query and read system

Judicial Transcript Archives

In Argentina, all court trials and hearings were written until the late 1960s. Many of the 19th century transcripts are kept in court houses and are frequently read by scholars and researchers as they mirror usages and practices held in different locations of the country. They are alike the current judicial transcripts with well-defined sections according to the character of the trial, reflecting the different stages of it. All the pages are numbered in only one side of the paper, but as the same document is reused again there are several renumbering schemes, all of them valid. Sections, which may be in the same page, are used in Indexes and Tables of Contents. These numbering scheme with some aggregates

Ÿ

with no special training, or non standard hardware or software requirements

Ÿ

which will provide a way for geographically distributed scholars to access the system and the documents, hopefully suppressing their needs to access the original paper form of the document.

We choose then a Web based technology, where the scholars can access to a digitalized image of the document through a catalog organized in the same way the current catalogs are. After developing the prototype, where the pages and links were build by hand by the historians in a long and labor-intensive activity which lasted for six months, we were able to state Ÿ

what the primary users of the system wanted to have: the document page along with the catalog data (if pertinent) and simple ways of turning the pages and navigating back and forth between documents and catalog entries;

Ÿ

how do they want it to be seen: they want a legible copy, no mind of the paper format or size nor time added details such as renumbering schemas or stains;

Ÿ

how they want to search: they want to find the archive table of contents and catalog's data in the original

order. They want access to table of contents in judicial transcripts and books only in case they exist in the original document; Ÿ

they want to have access to the system from their workplace, though they have little (and really slow) Internet access.

Fig. 1. A Catalog of Judicial Transcripts (City of Dolores, Buenos Aires)

As a first result, we decided to maintain a full quality preservation image and to build several publication images from each original page. The different publication images are processed and enhanced accordingly to different goals, some examples are: Ÿ

In house (in the Institute) access using adequate hardware and screens.

Ÿ

Restricted Internet access offering good reading quality, some of them watermarked, minimizing communication costs (to the low standard speeds of our country's links) and not allowing quality printing to prevent unauthorized copies.

Ÿ

Password protected paid Internet access to a high quality digital image system.

Ÿ

Stand alone use allowing the publication of a CD-ROM or DVD.

These calls for an automatized production system which allows to obtain the different to-be-published archival web systems from the original preservation system. This system enables to allot the always scarce manpower to the one-time high quality preservation image generation, with enough effort on the quality control of the product. The production system has to be flexible enough to accommodate the evolution of what is the standard in the digital world. SYSTEM IMPLEMENTATION

Fig. 2. A page of a Judicial Transcript with its catalog and index data

From the team work breakdown point of view the protoype allowed to foresee Ÿ

special characteristics each archive imposes to the job

Ÿ

time requirements

Ÿ

common mistakes, how to detect and time to allot for correction

Ÿ

additional information related to the digital nature of the preservation in order to extend the useful life of information resources [6].

We decided to use Web Standard formats to implement the system [7]. We used the EAD DTD (Encoded Archival Description, Data Type Definition) [8]. In its base document of June 1998, the EAD is defined as a set of rules specifying the access system associated with the information to allow for searching, retrieving and exchange among different platforms. These rules are written in the machine processing oriented form known as Standard Generalized Markup Language (SGML) Document Type Definition (DTD) [9]. We used the EAD DTD Tag Library, a natural language translation of the rules which expresses the DTD structure by explaining the relationship between elements, specifying where the elements may be used and describing how they may be modified with attributes. For further information on EAD DTD, the reader is refered to [8]. As the EAD is compatible with both, the Extensible Markup Language (XML) [10][11] and the SGML existing software, it permits the creation of new finding aids and the conversion of existing ones. The XML coding using the EAD DTD guides is done using CGI programs. The selection was made in order to use the existing equipment, with no need of Java enabled browsers and minimizing the server impact. The chosen language was GNU Pascal (gpc) [12]. Catalogs must be made in "pure" HTML (no style sheets) [13] because historians and their institutions are using old computers and old versions of the Operating Systems and browsers. They see no advantage in following the continuous hardware-software changes with

no important benefit for their primary task (history research). On the other side, catalog versions using CSS (Cascading Style Sheets) for the Internet and future compatibility are also build using the system. XSL (Extensible Stylesheets Language) [14] was the choice for converting from XML documents into a format recognizable to a browser. The digitalized images and the catalog entries from one archive are combined by the cgi programs with the capture and image processing parameters to form one base XML file for each publication type (in house, Internet restricted, Internet full, stand alone, etc). There is only one EAD DTD file valid for all class of publications and archives. Using different XSL stylesheets, one for each desired layout format and the J. Clark XT Parser [15], the HTML structure for the archive is obtained. Using a different stylesheet the preservation and physical management information are obtained (CD number, back-up info, preservation image information, original location, capture data, etc).

Fig. 4 The first page of the Archive named "ECHE".

Owner/Guardian environment (MINI DIGITAL LIBRARY) Scanning & image processing Source Documents

Scanning activity log MARC /

TIF Files (Preservation)

JPG Files JPG (Internet) Files JPG (Intranet) Files (etc)

MicroISIS Data

Books

OLD Catalog text Files

Transcripts DOC.XSL

Meta Data Legal Restriccions etc. Document selection Titles, etc.

Documents

DOC.XSL

Fig. 5 The first page of a multipage document from the Archive named "ECHE", with catalog data and page index.

DOC.XSL

XSLT Parser

WEB tools (CGI

XT

Programs) ECHE.XML

E C H EJ.PHG TM ( C a tFiles alog) JPG (pages) Files (ADM)

Owner/ guardian DATA

EAD DTD

Fig. 3 Building the Digital Archive named "ECHE". In the Figure 3, all the process beginning with the original documents and its MicroIsis catalog data is depicted. Transcripts and Books are obtained by a similar process. Fragments of the resulting XML file and the DOC.XSL style sheet are included as appendixes. The following figures show the finished product.

INFORMATION ARCHITECTURE DIGITAL LIBRARY

OF

THE

RESULTING

The described system can be applied to different owner's collections and archives, forming a Digital Library. The information about owners, divisions, guardians, must be retained and the Digital Library section maintains this distinction as grouping criteria, to allow exporting the structure to different servers or sites according to political decisions. Each of these units can be regarded as a mini-digital library. Each owner has his own Web and working directories where the corresponding files are stored. There are the necessary building and maintaining data, image files in different formats, catalogs and related information converted to XML files. There are als o Web operated cgi programs to add legal restrictions, Metadata [16], special titles, acknowledgments and other data. Other cgi programs are used to select which

images and catalog entries form the input to the system to obtain a specific digital archive.

maintaining a direct relationship between digital and physical numbering scheme.

Naming

In the owner's base directory, the name of the catalog to build is used as the name of the base subdirectory of the HTML structure, the XML file and the HTML catalog file. With the exception of books, there is one HTML file and one image file with the same name. Book pages are built dynamically using a cover page with XML data resembling the Electronic Binding Project of the Berkeley Digital Library SunSITE [17]. As requested by scholars, catalog and index data are included in each Page.

Fig. 8 Assigning numbers to blank pages to keep the original numbering schema . CONCLUSIONS

Digitalization is a valid alternative in preservation tasks. It appears also as a starting point to the development of several different publications and research products.

Fig. 6 A catalog of Periodicals (books) showing its cover page and the MicroIsis or Marc Data. In this case there is no index information.

Mixed teaming and using prototypes is indispensable to state the user needs and ways to interact with the information. It is a time consuming task but rewards with the translation of the user needs into the Digital Library design, improving its usability. During the prototyping experience a high involvement from the user's community is necessary. In our case, historians actually designed using "pure" HTML the prototypical pages. Of course, they have done it in a close interaction with computing professionals. After it, computing professionals defined the necessary XML files using the standard EAD DTD and derived the XSL style sheets. Neither comp uter professionals alone could establish the base usability nor the computer versed historians could develop the XML Web machinery. Writing the XML files is a time consuming, error prone task. Web operated cgi programs can be used in a consistent way to create this files. It also makes this task accessible by the historians in charge of the collections in their natural technical slang. Changing the outputs using XSL style sheets can be done in little time, and enables a trial and error process with the user, so using XSL is a must.

Fig. 7 A dynamic page of one of the periodicals.

There are also Web operated cgi programs used to rename, add, change the ordering, renumber, white pages processing which we never digitalize but are recognized by the system, easing the scanning process which calls for a correlative numbering scheme. Some programs help in the quality control and error control process with the goal

Using the automated process, a catalog and document generation from the "raw" images and catalog data is a quick task, which allow for higher quality and art standards by trial and error. In an "old" non dedicated Pentium 120 running Linux, all the process takes no more than 15 minutes for 6000 Images, generating also the corresponding digital preservation information. The prototyping activity,

generating only the HTML output for 600 Images, took six months. Digitalization appears as a solution when wide access of the information is necessary, cost is at a premium and the necessary expertise and man power is available, as it is in a University. It also allows a continuous development of the final products according to the progress of the Internet technology [18]. All the project was made using freely available software (most of then GNU style) and off the shelf (and relatively old) hardware. It enabled us to focus the money usage in the input devices such as scanners and cameras. The multidisciplinary approach is a high rewarding task. Each professional learns from the others field and enhances his vision and performance. Next step is more challenging for all the involved people: designing a search system for historians. As stated in the free software we used, we all had a real lot of fun developing the system. ACKNOWLEDGMENTS

The project is funded in part by the "Fundación Antorchas" and is operated by the Universidad de Buenos Aires. Authors wish to thank the people of the Library of Congress and of the Columbia University (NY) in general and particularly (in order of visit) to Dr. E. Larson, L. E. Brooks, D. Bell-Russel, A. Wells, A. Cook, J. Pull, Dr. H. Klein, Dr. P.M. Graham, A. Bukowky, Dr. C. Dutschke and their teams for their patient discussion of the subject. NOTE TO REVIEW ERS

Final version will be reviewed by the English Professors, they are on their summer vacations now. REFERENCES

1.Boro, F. ¿Microfilm o preservación digital en bibliotecas y archivos? Ciencia Hoy, vol. 11 Nr. 66 Dec 2001. 2.Department of Preservation and Conservation at Cornell University Digital Imaging Tutorial. http://www.library.cornell.edu/preservation/tutorial/toc.ht ml accessed January 2002. 3.Waters, D. and Garret, J. Preserving Digital Information: Report of the Task Force on Archiving Digital Information. Washington D.C: Commision on Preservation and Access and Research Libraries Group, 1996. 4.Hazen, D. Horrell, J. and Merrill-Oldham, J., Selecting Research Collections for Digitization, Council on Library and Information Resources (CLIR) Publication and Resources, Aug. 1998, available at: http://www.clir.org/pubs/reports/pub74.html

5.Kenney, A.R. and Chapman, S. Digital Resolution Requirements for Replacing Text-Based Material: Methods for Benchmarking Image Quality Washington D.C.: Commision on Preservation and Access, 1995. 6.Conway, P. The relevance of Preservation in a Digital World. Northeast Document Conservation Center, Technical Leaflet, Section 5 Leafllet (Winter 1999), 1-10. 7.Manola, F. Technologies for a Web Object Model, IEEE Internet Computing, vol. 3 Nr.1, Jan. 1999. pages 38-47. 8.Encoded Archival Description (EAD) Document Type Definition (DTD), Version 1.0 Technical Documents No. 1, 2 and 3, Society of American Archivists and the Library of Congress, 1998. http://lcweb.loc.gov/ead/ accessed January 2002. 9.SGML: ISO 8879 ISO (International Organization for Standardization). ISO 8879:1986(E). Information processing -- Text and Office Systems --Standard Generalized Markup Language (SGML). First edition -1986-10-15. [Geneva]: International Organization for Standardization, 1986. 10.Extensible Markup Language (XML) 1.0, W3C Recommendation 6-October-2000. http://www.w3.org/TR/2000/REC-xml-20001006 accessed January 2002. 11.Khare, R and Rifkin, A., XML: A Door to Automated Web Applications, IEEE Internet Computing, vol. 1 Nr. 4 Jul 1997, pages 78-87. 12.GNU

The GNU Pascal. http://agnes.dida.physik.uni-essen.de/~gnu-pascal/ accessed January 2002.

13.HTML 4.01 Specification,W3C Recommendation 24 December 1999, http://www.w3.org/TR/1999/REC-html401-19991224 accessed January 2002. 14.Extended Style Sheet Language Transformations http://www.w3.org/TR/1999/REC-xslt-19991116 accessed January 2002. 15.Clark, J. XT, http://www.jclark.com/xt/ 16.Dublin Core Metadata Initiative, available at http://dublincore.org/documents/2001/04/12/usageguide/ or http://purl.org/dc/ accessed January 2002. 17.Pollock, A. and Pitti, D. The Electronic Binding Project (Ebind), UC Berkeley, 1996, http://sunsite.berkeley.edu/Ebind/ accessed January 2002. 18.Gellersen, H. W. and Gaedke, M., Object-Oriented Web Application Development, IEEE Internet Computing, vol. 3 Nr.1, Jan 1999,. pages 60-68.

APPENDIX A: FRAGMENT OF THE ECHE.XML FILE

eche Instituto de Historia Argentina y Americana "Dr. E. Ravignani" - Projecto de recuperacion y preservacion de Patrimonio Historico Facultad de Filosofia y Letras de la Universidad de Buenos Aireso Dn Vicente Echeverria 1749-1810
Correspondencia recibida y emitida y documentos manuscritos de Dn Vicente Echeverria
Datos descriptivos con un esqueleto de markup derivados a partir de un catalogo WordPerfect, con datos adicionales y datos provenientes del sistema de imagenes creado por programas CGI del sistema de documentos manuscritos en GNUPascal, dentro del Proyecto Patrimonio Hist¢rico 26 de marzo de 2001 Archivo Vicente A. Echeverr¡a1749-1847 Legajo Nro. 1 1749-1810 3523 fojas 1 caja Biblioteca Historia Argentina 24 Informacion para lectores del archivo
No hay restricciones de acceso a este archivo

Reproduccion de las imagenes del archivo solamente con autorizacion del Instituto Dr. E. Ravignani
Documento 1: .- Don Francisco Ramos al Cabildo de la Ciudad de Buenos Aires: Referente al destino del importe de la sal traida de las salinas para el abasto de la ciudad. Buenos Aires, a¤o de 1749. F. 1. Folios

1

2

Folio 1

Height865 Width875

APPENDIX B FRAGMENT OF THE DOCUMENTS XSL STYLE SHEET

This template builds the Catalog Page

Created file

This template builds the Documents

Cat logo de Documentos

pages with the images