into a proper representation that directly reflects the entities, attributes and relationships of the ... data elements identified by relative character position, whereas the data fields ..... http://www.loc.gov/marc/marc-functional-analysis/tool.html. 11.
A process and tool for the conversion of MARC records to a normalized FRBR implementation Trond Aalberg Norwegian University of Science and Technology, Department of Computer and Information Science
Abstract. This paper presents a generic process and a tool for the conversion of MARC-based bibliographic records to the ER-based model of the Function Requirements for Bibliographic Records. The interpretation of a record, the construction of a new set of records and the final normalization needed, is decomposed into a series of steps that is implemented in the tool using XSL transformations. The purpose of the tool is to support researchers and developers who want to explore FRBR or develop solutions for using FRBR with existing MARC-based bibliographic catalogues.
1
Introduction
The Functional Requirement for Bibliographic Records (FRBR) that was published by the International Foundation for Library Associations and Institutions in 1998 [8], is a major contribution to the next generation of bibliographic catalogues and may significantly change the way bibliographic agencies create, maintain and exchange information in the future. Existing library catalogues are to a large extend based on the use of the MARC format and naturally many libraries would like to implement support for FRBR in existing bibliographic information system. Although many projects have explored the use of FRBR in different contexts and some tools exist, there is little support for the systematic processing of information in MARC records into a proper representation that directly reflects the entities, attributes and relationships of the FRBR model. Researchers and developers beginning work in this area typically need to reinvent the conversion process and write their own interpretation system due to the lack of reusable solutions. This paper presents an approach for the processing of MARC records into a normalized set or FRBR records. The different steps needed in the conversion process are identified and a tool that implements this process is presented. The conversion tool is based on the use of XML and XSL transformations and supports reuse across catalogues by separating the rules and conditions that govern the conversion from the general control structures that can be applied to any MARC-based catalogues.
2
MARC and FRBR
Most of todays bibliographic information is based on a common framework of rules and formats such as the International Standard Bibliographic Description (ISBD), the Anglo American Cataloguing Rules (AACR2), and the MARC format. The main components of a MARC record are the leader field, the control fields and the data fields. The leader and the control fields contain fixed length data elements identified by relative character position, whereas the data fields may contain variable length data elements identified by subfield codes. The data fields of a MARC record typically represents a logical grouping of the data, and the subfields represent the various attributes describing the logical unit. Data fields may additionally have indicators at the beginning of the field that supplement the data or are used to interpret the data found in the field. There are actually a number of MARC-formats in use, but different formats are often inspired by each other or are extensions or subsets of other MARC formats. Many catalogues are based on the use of either MARC 21 or UNIMARC, but there are still many national or vendor-specific versions of the MARC format in use. Additionally we often find that libraries use the same format in different ways and/or use national or local adaptations of the cataloguing rules, and for these reason even libraries that officially share the same format may need different rules in a conversion of the catalogue. The aim of FRBR is to establish a precisely stated and commonly shared understanding of what it is that the bibliographic record aims to provide information about. This is defined by the use of an entity-relationship model (ERmodel) that defines the key entities that are of interest to users of bibliographic data. The entities work, expression, manifestation and item are the core of the model and reflects the products of intellectual and artistic endeavor at different levels of abstraction. The entities person and corporate body represents the various actors of concern in bibliographic descriptions. The model additionally defines the attributes for describing these entities and the relationships that may exist between entities. The process of applying FRBR as an implementation model for existing catalogues is often referred to as ”FRBRization”, and studies and experimental applications are reported by many. The identification of FRBR entities in catalogues have been explored e.g. in [1–3, 5, 9]. A few tools for experimenting with the FRBR model are available such as the FRBR display tool made available by The Library of Congress Network Development and MARC Standards Office [10], and the workset algorithm developed by OCLC [7]. Most experiments and tools are quite incomplete and only partially show the application of FRBR in library catalogues, but a few more extensive implementations of FRBR are available such as OCLC’s FictionFinder [6] and VTLS’ library system Virtua [12]. Except for the mapping between MARC 21 and FRBR produced for The Library of Congress Network Development and MARC Standards Office [11], no attempts have so far been made to formalize the process of conversion between MARC and FRBR and the work reported in this paper contributes towards this.
3
Interpreting MARC records
The transformation from MARC to FRBR is a complex task that in many ways is different from a simple sequential transformation of records in one format into equivalent records in a different format. The process described in this paper aims to produce a normalized set of FRBR records and by this we mean that each entity instance finally should be described in only one record with a proper set of relationship to other entities. In the context of FRBR, each MARC-record may be seen as a self contained universe of entities, attributes and relationships. At the most generic level the process of interpreting a MARC record consists of (1) identifying the various entities described in the record, (2) selecting the fields that describe each entity and (3) finding the relationships between entities. This approach can be used to decompose each record into a corresponding set of interrelated entities, but because many records may contain descriptions of the same entity (e.g. an author with multiple publications will be described in many records) a conversion process additionally needs to (4) support normalization by finding and merging equivalent records. 3.1
Identifying entities
Identifying entities is a process that includes inspecting a MARC record to determine what entities that are described in the record and what role this entity has in its relationship to other entities. This is not a trivial process, but due to the logical grouping of data in a MARC record certain fields will reflect specific FRBR entities and the role of these entities. A record may e.g. include person entries in both the 100- and 600-fields. These tags represent the same kind of entity but the entities have different roles. The former is the author of a work and the latter is the subject of a work. Persons, corporate bodies and works are typically identified by the presence of specific fields such as main entry fields (1XX), title fields (24X) or added entry fields (7XX). Additional persons, corporate bodies and works can be identified in some of the subject access fields (6XX-fields) and series added entry fields (8XX). Expression entities are often considered to be more vaguely defined in a MARC record due to the lack of specific fields for expression titles, but can on the other hand be derived from work entities already identified. If a work is identified by the presence of a 240-entry, an expression can be identified based on the same field as well. Finally, the fact that each record corresponds to a manifestation can be used to identify manifestation entities although there may be a need to consider special cases based on the rules for cataloguing multivolumed manifestations etc. Items are typically listed using holdings information fields and each item can usually be identified for each entry. The identification of entities can be formalized using a set of conditions for testing whether an entity is present in the record or not. Due to the many different occurrences of entities this is most conveniently solved by defining a condition for each of the possible entity occurrences.
3.2
Assigning attributes
FRBR defines a comprehensive set of attributes for the entities that is based on what typically is reflected in bibliographic records. On the other hand, the model does not define the various possible data elements of an attribute in the same way as in a MARC format. A possible solution for this discrepancy is to maintain the subfield structure from the MARC record but additionally associate FRBR attribute names with the subfields. Many subfields can only be assigned to a single entity occurring in the record. If a work is identified by the presence of a 130-field, the mapping for this entity will include the 130-subfields and possible other fields that are interpreted as describing the work identified in the 130-field. In some cases subfields can be assigned to several entity occurences such the language code that describes the language of all expressions identified in a record (e.g. analytical entries). Sometimes the assignment even have to be based on the actual data found in a subfield e.g. if a subfield contains information that in some cases belongs to an expression and in other cases belongs to the manifestation. The selection of what attributes that is associated to what entity can basically be defined in a mapping table that describes what datafields/subfields that belongs to what entity occurrence and additional conditions for determining if an assignment should be made or not. 3.3
Establishing relationships
The interpretation of relationships between entities can either be based on the implicit roles of the entities occurring in a record or it can be based on explicit information about roles and relationships found in indicators, relator codes or field linking subfields. Essentially this is a process that must be based on a definition of what kinds of relationships that may exist between entities. For each kind of relationship it is necessary to know the conditions for when a relationship can be identified as well as the condition for determining what target entity the relationship points to. 3.4
Normalizing the result
The process outlined so far is only concerned with the conversion of a single MARC record into a corresponding set of interrelated entities without considering other records in the collection. To achieve a final set of interrelated entities with a consistent set of relationships between all entities in the whole collection, the output from the previous interpretation must be normalized. By this we imply that equivalent entities need to be merged to avoid redundant information and a fragmented network of relationships. This process is mainly a question about equivalence between records. In some cases already existing identifiers may be used. If two records have the same identifier they describe the same entity instance and can be merged. Most entities, however, do not have proper identifiers and in this case records must be
compared in a way that can be used to determine whether the records describe e.g. the same work. If two equivalent records are found, the merging process must create a new record that maintains the relationships found in both records and additionally create a new description that includes the union of distinct data fields and values found in both records.
4
The frbrization tool
The conversion process described in the previous section is implemented as a conversion tool that is based on the use of XML and XSL transformations. The interpretation and creation of FRBR records is performed by the use of XSLT, but other parts of the conversion are solved by the use of a program written in Java. The conversion tool accepts records in the MarcXchange [4] format and the output is a set of records in a format that uses the same field and subfield structure but with additional elements and attributes for FRBR relationships and types. The conversion process is decomposed into a preprocessing step, a main conversion and a final postprocessing step. The tool is illustrated in figure 1 and example records are found in figure 2. 4.1
The rule base
The various conditions, rules and other data that are needed to define the conversion for a specific catalogue is stored in a database. The purpose of this is to support reuse of the tool across catalogues and to facilitate consistency across the many rules that are used in the conversion. The database schema is illustrated in figure 1 and consists of an entity mapping table that contains the variable data for the various occurrences of entities. For each kind of entity occurrence different rules need to be defined, and this table will for this reason contain a number of entries (e.g. different entity types for works identified by respectively 130, 240, 245, 600$t, 630, 700$t, etc.). The attribute mapping table defines the mapping between MARC and FRBR attributes for each entity occurence type and the relationship mapping contains the relationship types that can exist for an entity occurence, the conditions for when a relationship exists and an expression for what entity occurence(s) to relate to. The rule base is used to generate XSLT templates. One template is created for each entry in the mapping entities table. Each template follows the same control structure and includes the subcode needed to test for the presence of an entity, select and copy attributes and test for and create possible relationships. A simplified example of such a template is illustrated in figure 2. 4.2
Preprocessing
The first step in the actual conversion is a preprocessing that is introduced to enable different kinds of processing that more conveniently is applied in advance rather than during the actual frbrization. Some formats may for example use
Entity_mapping
Catalogue 1
Template_identifier FRBR_entity_type FOREACH_field_expression Entity_tempid_expression Entity_key_expression
* 1 Relationship_mapping
Export
FRBR_relationship_type Target_template_identifier IF_expression FOREACH_target_expression
* Attribute_mapping
Marc Xchange
Conversion rules
Preprocessing
Create XSLT files
Appying XSL Transformation
XSLT entity templates
FRBR_entity_type FRBR_attribute_type MARC_field MARC_subfield IF_expression
Create intermediary keys
Copy attributes
Generate relationship FRBR XML
Create final entity keys Find and merge identical entities
Postprocessing
FRBR database
.............. ............. ................
Fig. 1. The conversion tool outlined
Pratchett, Terry