DELOS Deliverable 6.10.1. 2. Catalogue Entry. Title. Report on Automated re-Appraisal: Managing Archives in. Digital Libraries. Creator. Gillian Oliver. Creator.
DELOS Deliverable 6.10.1
Project no.507618 DELOS A Network of Excellence on Digital Libraries Instrument: Network of Excellence Thematic Priority: IST-2002-2.3.1.12 Technology-enhanced Learning and Access to Cultural Heritage
Deliverable 6.10.1: Report on Automated re-Appraisal: Managing Archives in Digital Libraries Due date of deliverable: Actual submission date: 30 January 2008 Start Date of Project: 01 January 2004 Duration: 48 Months Organisation Name of Lead Contractor for this Deliverable: University of Glasgow Final Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006) ______________________________________________________________ Dissemination Level: PU (Public)
1
DELOS Deliverable 6.10.1 Catalogue Entry
Title Creator
Report on Automated re-Appraisal: Managing Archives in Digital Libraries Gillian Oliver
Creator
Seamus Ross
Creator
Maria Guercio
Creator
Cristina Pala
Subject
Appraisal; Digital Records
Description
Report on methodologies for determining the significance of digital information objects, with recommendations for automation of the appraisal function.
Publisher
DELOS NoE for HATII at the University of Glasgow
Contributor
Milena Dobreva
Date ISBN
2-912335-40-X
Type
Text
Format Language
English
Rights
© HATII at the University of Glasgow
Citation Guideline:
Oliver, G., Ross, S., Guercio, M., Pala, C.: Report on Automated reAppraisal: Managing Archives in Digital Libraries (Glasgow: DELOS NoE, January 2008).
2
DELOS Deliverable 6.10.1
Contents Executive Summary .......................................................................................................... 5 1.0 Introduction................................................................................................................ 7 2.0 Key Concepts and Definition of Terms .................................................................. 8 2.1 Determining Significance.................................................................................. 8 2.2 Information Management............................................................................... 10 2.3 Item-Level Appraisal ....................................................................................... 11 3.0 International Standards ............................................................................................ 13 3.1 ISO 14721 OAIS Reference Model .............................................................. 13 3.2 ISO 15489 Records Management.................................................................. 14 3.3 ISO 23081 Metadata for Records .................................................................. 14 4.0 Current Practices ....................................................................................................... 15 4.1 Information for Awareness/Entertainment ................................................ 15 4.2 Information for Accountability ..................................................................... 18 4.2.1 Current Appraisal Strategies........................................................................ 20 4.2.2 The Process ................................................................................................... 23 5.0 Current Research ...................................................................................................... 27 5.1 InterPARES ...................................................................................................... 27 5.2 Paradigm ........................................................................................................... 28 5.3 Metadata ............................................................................................................ 30 5.3 Appraisal in a Digital World........................................................................... 32 6.0 Issues .......................................................................................................................... 32 6.1 Is it Necessary to Determine the Significance of Digital Information?... 32 6.2 Is the Process of Determining Significance Fundamentally Flawed? ...... 34 7.0 Principles, Requirements and Criteria for the Appraisal of Digital Objects .... 36 8.0 Automation ................................................................................................................ 45 8.1 Using Metadata and Genres to Determine Significance ............................ 48 8.1.1 Genres ............................................................................................................ 55 8.1 Models of Automation .................................................................................... 57 9.0 Summary of Recommendations ............................................................................ 58 10.0 Conclusions ............................................................................................................. 61 11.0 References ................................................................................................................ 62 Appendix 1: Summary of Findings from InterPares................................................. 67 Appendix 2: Source Documents Used to Provide Initial List of Criteria .............. 69
3
DELOS Deliverable 6.10.1
Tables Table 1: Specific factors considered in the appraisal of digital information objects ............................................................................................................................................ 41 Table 2: Crosswalk showing relationship between appraisal factors and questions that could be answered using metadata values............................................................ 49 Table 3: Metadata elements that could be used to assist in appraisal decision making ............................................................................................................................... 51
Figures Figure 1 Factors influencing the determination of value of information ................. 9 Figure 2 Differentiation of purpose for the two main communities involved in information management ............................................................................................... 10 Figure 3: Potential appraisal points in relation to OAIS Model .............................. 13 Figure 4: Categories used for the grouping of appraisal criteria............................... 37 Figure 5: Appraisal criteria that comprise each category ........................................... 38 Figure 6: Application areas for appraisal criteria documented in Table 1 .............. 40
4
DELOS Deliverable 6.10.1
Executive Summary This Task will investigate how the process of automatic re‐appraisal of digital holdings (resulting in either the disposal or the retention of an object) may be effectively handled in the context of digital libraries. At different times after material has been ingested into a repository (which is what a digital library is) it may be necessary to re‐assess whether the ingested material should be retained or disposed of. Given that this activity will often concern a substantial quantity of digital objects, it would be sensible to automate such a re‐appraisal process, identifying what objects need to be removed based on pre‐defined criteria. This Task started in JPA3 and work on the Task capitalized on the results of other tasks in the cluster (T6.4, T6.5, T6.6 and T6.7). Our first activity has been to examine the appraisal function in the digital archives and libraries, in order to identify on which basis the rules for retention (or disposal) have been insofar defined and applied. A significant event contributing to this was the Appraisal in the Digital World Conference held in Rome, November 15‐17, 2007. Discussions at this conference, which involved a range of international speakers, informed the findings of this report. Given the proliferation of digital information of all types and the challenges of preservation, identifying what subset of that information is actually worth keeping is critical. This reports investigates the relevance of processes to determine what is ‘worth keeping’ for digital libraries and suggests ways in which technology can be used to automate processes. The findings are particularly applicable to documents created in uncontrolled environments and to libraries. Appraisal, the determination of the worth of preserving information, continues to be significant in the digital environment. Furthermore, the concept is applicable beyond the recordkeeping domain in which it originated. A number of strategies have been identified to undertake appraisal, any one of which, or combination of, may be appropriate to a specific information community or domain. In considering the automation of the appraisal function in the context of a digital library or archives, it becomes clear that this will include the assessment of individual items. Automation enables a level of granularity that is rarely, if at all, possible in the case of manual appraisal methods, without loss of cognizance of the placing of items within the aggregation that they belong. The results of item‐level assessment can inform the overall appraisal determination.
5
DELOS Deliverable 6.10.1
Research underway on metadata extraction, together with the structurational view of genres, shows a great deal of promise for the digital library developer and user communities. In addition, the technological possibilities now present to facilitate input of other voices into the selection of information that has value for communities open up a way forward to a new information age, one that need no longer be exclusively defined by dominant societal forces. As a result of our analysis of the approaches to appraisal we identified a series of appraisal criteria and structured these so that we can represent them as appraisal rules. Rules are susceptible to representation as active knowledge components. In considering the next steps, this representation suggests three models of automation:
Hybrid: A combination of manual and automated decision making. For instance, application of functional appraisal methodology supplemented by subsequent automated triage to determine the feasibility of preservation at the item level.
Appraisal engine: Where a document is submitted to an appraisal engine for analysis using a combination of text mining and rule‐based reasoning.
Profiler: The development of a prototype to review a variety of information object types (image, document, dataset for example) and apply appraisal rules, probably again using rule‐based reasoning methodologies.
It is though critical that when digital objects such as documents are selected for destruction or retention that why the disposal decisions were taken be recorded. One of the strengths of automation is that it provides this chain of evidence.
6
DELOS Deliverable 6.10.1
1.0 Introduction Given the proliferation of digital information of all types and the challenges of preservation, identifying what subset of that information is actually worth keeping is critical. The aim of this report is twofold. Firstly, it is to establish the relevance of processes to determine what is ‘worth keeping’ for digital libraries. Secondly, it is to make recommendations for the use of technology to improve the efficiency and effectiveness of decision‐making. The report begins with an overview of key concepts and clarification of terminology. This is followed by an account of current practices in the main information communities and findings from major research projects, including InterPARES and Paradigm. Key principles as distilled from the literature are identified. The final section considers possibilities for automation and what kind of experimentation is needed if we are to really develop automated appraisal in digital libraries and archives.
7
DELOS Deliverable 6.10.1
2.0 Key Concepts and Definition of Terms Key concepts underlying this report relate to determining the significance of information, the different purposes for which information is managed in digital repositories, and item‐level appraisal. One of the particular challenges of this report has been associated with terminology. Attempting to apply a methodology (which is not standardized even within its “home” domain of archival science) to information managed by other communities provides rich potential for misunderstandings and confusion, not to mention conflict. The theoretical framework of the information continuum is one device used to address this (see 2.2 below). In addition, definition of the most contentious terms is provided here and reiterated in the body of the text where necessary. However, to achieve real progress in this area of determining the significance of digital objects consideration should be given to achieving consensus and agreement between all disciplines as to appropriate terminology.
R2.0.1 A glossary should be developed of terminology relating to the entities and processes associated with determining the significance of information. Definitions should be acceptable from the perspective of all information management occupations.
2.1 Determining Significance Determining the significance, value or worth of information has always been a fundamental concept for memory institutions, including libraries, archives and museums and continues to be problematic for digital collections (Pymm, 2006). ‘Appraisal’ is the methodology used in recordkeeping to determine the significance of records, resulting in the designation of some as worthy of long‐ term preservation. The term ‘appraisal’ is used in this report to apply to the process of determining significance of any information object. Adding to a collection always has resource implications, including, for instance, initial purchase, cost of processing, storage and so on. At various points, decisions that should be underpinned by an assessment of the value of an item or an aggregation relative to the costs involved are made – although this may not be made explicit. Decisions relate to whether or not to acquire, and then at different stages whether or not to retain as part of the collection. The act of determining what has significance, or what is worth keeping, has been recognized as a complex act influenced by ideological, political, economic, cultural and social factors (Lloyd, 2007).
8
DELOS Deliverable 6.10.1
Ideological factors
Social factors
Political factors
Cultural factors
Economic factors
Figure 1 Factors influencing the determination of value of information
The perspective taken in this report recognises this complexity and suggests that it is now possible to formulate ways of addressing these issues with the use of technology.
R2.1.1 Technological solutions to determining the significance of information must take into account ideological, political, economic, cultural and social factors.
9
DELOS Deliverable 6.10.1
2.2 Information Management – Different Purposes, Different Perspectives The purpose for which information is managed is critical in determining the approach to its management. The Information Continuum model (IC) (Schauder, Stillman, & Johanson, 2005) provides a useful framework for analysis of activities undertaken by the different professional communities of librarians and archivists. The primary purpose of activities in the library community is the management of information for awareness or entertainment. This has been re‐stated in the community networking context as information to maximise opportunity, and information to enhance living (Schauder et al., 2005). The primary purpose of the recordkeeping community is the management of information for accountability – or information to minimise risk (Schauder et al., 2005).
Figure 2 Differentiation of purpose for the two main communities involved in information management
The identification of a primary purpose for a community (or information type) does not imply that other purposes are excluded or are not present, but
10
DELOS Deliverable 6.10.1
reflects a greater emphasis accorded to the purpose designated as likely to be primary. As digital repositories cannot always be as easily segmented into either the library or recordkeeping domain and the situation becomes more confused as the terms ‘digital library’ and ‘digital archive’ often appear to be used interchangeably, reference will be made in this report to ‘information for awareness’ and ‘information for accountability’. An example of the practical application of this theory to a university repository environment is provided by Andrew Treloar and colleagues at Monash University (Treloar, Groenewegen, & Harboe‐Ree, 2007).
R2.2.1 Appraisal methodologies must be “fit for purpose” – i.e., take into account the purpose(s) for which information is being managed: accountability, awareness and/ or entertainment.
2.3 Item-Level Appraisal ‘Item’ is used in this report as term that includes simple and complex information objects – e.g. document, record, website, image. The common characteristic of these information objects is that intellectually they are regarded as a single entity. This concept is taken from the information object that is defined in the DELOS Digital Library Reference Model (Candela et al., 2007, p.75). Here, an information object is described as potentially being a multimedia and multi‐type object with parts, such as a sound recording with slides, political and economic data with interactive simulations, or a data stream representing the pool of data continuously measured by a sensor Information objects belong to collections, or sets of resources (Candela et al., 2007, p.80). The concept of item level appraisal is a contentious one for many records managers and archivists as a key motivator for many of the advances in appraisal theory (see section 4.2 Information for Accountability) has been the increase in the sheer quantity of records generated, and the impossibility of reviewing those records individually. However, working from the premise that processes can be automated means that review at item level is feasible. The opportunity now exists to build pragmatic tools to be used in conjunction with advances in archival appraisal theory; the challenge lies in achieving the appropriate balance. Appraisal at item level does not imply that relationships between items will be disregarded or destroyed. On the contrary, the assumption underlying all references to item‐level appraisal is that the metadata of digital objects will reflect those contextual relationships in a way that could not be envisaged in
11
DELOS Deliverable 6.10.1
the paper world. As a consequence of this assumption, although not strictly in scope for this particular report, section 5.3 Metadata and section 8.1.1 Genres contain recommendations relating to description. R2.3.1 Item‐level appraisal should be considered as a tool to be used in the context of an appropriate theoretical framework, and does not imply the destruction of contextual relationships.
12
DELOS Deliverable 6.10.1
3.0 International Standards The standards that are relevant to the discussion in this report are those relating to the Open Archival Information System (OAIS) reference model (International Organization for Standardization, 2003), records management (International Organization for Standardization, 2001) and metadata for records (International Organization for Standardization, 2006).
3.1 ISO 14721 OAIS Reference Model In terms of the OAIS reference model (International Organization for Standardization, 2003), the process of determining significance should commence as one of the pre‐ingest activities. It can be envisaged as taking place as part of the preliminary phase of the producer‐archive interface methodology, although it is not made explicit in this standard (Consultative Committee for Space Data Systems, 2004). In addition, further checks on information objects which will contribute to final appraisal decisions could be carried out in as part of ingest functionality. Re‐appraisal, however, could be considered as an activity associated with Preservation Planning.
Figure 3: The red stars indicate potential appraisal points in relation to OAIS Model
R3.1.1 Appraisal may take place prior to ingest, on ingest and/or as part of Preservation Planning functionality.
13
DELOS Deliverable 6.10.1
3.2 ISO 15489 Records Management This standard (International Organization for Standardization, 2001) provides a high‐level framework for recordkeeping and establishes benchmarks for good records management practice. If digital records are created and maintained in accordance with this standard (and ISO23081), appraisal strategies are likely to be top‐down (see section 4.2.1 below) and item‐level review may not be required. In this report, the standard has been used to define a record, and the characteristics of records (usability, authenticity, reliability, integrity).
3.3 ISO 23081 Metadata for Records In conjunction with ISO 15489, ISO 23081 is critical for best practice in records management. The standard is unequivocal about the importance of metadata in recordkeeping, that is, the records management and archival areas of activity: “…metadata are structured or semi-structured information that enables the creation, registration, classification, access, preservation and disposition of records through time and within and across domains …[and] can be used to identify, authenticate, and contextualize records and the people, processes and systems that create, manage, maintain and use them and the policies that govern them.” (International Organization for Standardization, 2006).
The standard describes how metadata must be initially assigned at the point of creation of a record, and then layers should continue to be assigned, either automatically or manually, reflecting different contexts, usages, systems, as necessary. Without metadata, authenticity cannot be assessed.
R3.3.1 Records created and maintained in accordance with ISO 15489 and ISO 23081 may not require appraisal at item level.
14
DELOS Deliverable 6.10.1
4.0 Current Practices Whether information is retained and managed in order to provide awareness/entertainment or accountability has influenced the different approaches that have been taken to determining significance by librarians and archivists.
4.1 Information for Awareness/Entertainment Consideration of the significance of items in library collections has been the subject of very little debate (Lloyd, 2007; Pymm, 2006). One reason for this is suggested to be because very few libraries build collections for permanent retention (Pymm, 2006). This points to a key factor that characterises library selection: there is a likelihood that it will be focused on a current, known, user group with identified needs. The extent to which this is the case will vary according to library type and purpose. A consequence of the focus on current needs is that determining significance for libraries may be an ongoing matter consisting of two activities – selection and de‐selection (weeding) when information is no longer required. Library concerns relating to the selection of materials have resulted in the development of solutions that have focused on the analysis of collections. Conspectus was one such solution, a methodology developed by the Research Libraries Group (RLG) for the assessment of collections in research libraries in the 1980s and subsequently adopted by many countries and other library types. Libraries acquire information products – unlike records, information products are not unique so a key concern has to be the consideration of library resources at a level higher than the individual institution. Library collections in aggregation should encompass the universe of knowledge, but unless there is systematic and careful collaboration some subject areas will be characterised by a duplication of resources while others will be poorly represented. Use of the Conspectus methodology enabled libraries to both assess the extent and level of subject coverage, and to contribute to national assessments. The methodology was complex and labour‐intensive and the approach was seen as becoming less and less relevant in the context of the increasing ubiquity of digital information (Burke, 1998; OCLC, 2007). Consequently at the end of the 1990s Conspectus was discontinued and RLG focused its attention on improving electronic access to collections, and getting more resources online (OCLC, 2007). The primacy of concerns about access are reflected in the literature relating to the changing role of collection development in the digital age, where the establishment of purchasing
15
DELOS Deliverable 6.10.1
consortia are a key objective (Dorner, 2004). Collaborative approaches to collection development continue to be encouraged as a means of avoiding duplication of effort while maintaining sufficient technical and/or geographical redundancy (Day, Pennock, & Allinson, 2007). These authors suggest that collection development policies need to specify object types (file formats) as well as content types (for example, peer reviewed articles, dissertations). We need to understand the nature of entities that we are dealing with, but it is more useful to think about this in terms of representation and encoding, rather than file formats. The need to capture information from the World Wide Web has led to some new developments for libraries. A collection development approach has been advocated, and a template for development plans have been developed and trialled (Murray & Hsieh, 2006; Murray & Phillips, 2007). Cobb and colleagues distinguish two models for selection of web content for digital libraries: those centred on the item (bibliocentric) and those centred on technology (techno‐centric)1. In the biblio‐centric model each item is assessed in accordance with rigorous criteria relating to its relevance to the collection. This labour intensive approach results in high quality, low volume collections. The techno‐centric approach emphasises comprehensive collection building using software such as a web crawler. The end result of this, it is suggested, places the burden of selection on the end‐user rather than the curating institution (Cobb, Pearce‐Moses, & Surface, 2005). Applying the archival principles of provenance and original order, they suggest, offers a middle ground worthy of exploration. This ensures that aggregations, rather than individual documents, are the focus of effort. (See also discussion relating to national libraries and legal deposit, below.) In the analogue world, the information collected and organised by libraries was likely to be clearly structured and identifiable by bibliographic data. This bibliographic data has appeared in the publications themselves since the 1960s (‘cataloguing in publication’). In the digital environment this may still be the case but there will also be increasing volumes of much less formally
1
A third model identified by Pearce-Moses and Kaczarek focused on the development of standards and metadata schema and collaboration with webmasters. This was found to be unsuccessful, due in part to lack of understanding on the part of the webmaster and also to high turnover (Pearce-Moses & Kaczmarek, 2005).
16
DELOS Deliverable 6.10.1
structured information such as websites that will be of value to library collections. It is this increasingly ubiquitous nature of “library” information and the consequent exponential increase in acquisition of digital information by libraries that suggest the methodologies of archival science are applicable, or at least worthy of scrutiny. Activities undertaken by national libraries are of particular interest, as the scope and scale of operations imply that manual selection procedures are unlikely to be sustainable. The parameters for collection development for many national libraries are established by legal deposit regimes – for example, legal requirements on publishers in that country to deposit one or more copies of publications. However, in the digital world not only is the definition of ‘published’ problematic, the locus of publication becomes less and less clear. “… the concept of ‘national’ publications is becoming increasingly ambiguous in a world in which management and service delivery of publications may occur in a number of locations” (JISC, 2007).
Nonetheless, a state of the art report into current practice by national libraries in the digital preservation sphere found that most of the 15 countries surveyed were at least exploring the extension of existing legal deposit legislation to encompass digital objects (Verheul, 2006). This survey also found that although currently all 15 libraries accept all formats, most libraries showed awareness of the need to limit or regulate file formats accepted into their collections. An overall finding was that in this specific national library domain, there will be an increasing emphasis on developing selection methodologies. The two main reasons for this are that there will be a need to establish limits from a storage perspective and also because of the costs involved in long‐term preservation (Verheul, 2006). New Zealand’s legal deposit legislation does extend to digital objects. A strategy with the potential for automation recommended to assist with collection development in this setting is ‘nominated automated deposit’. Four categories of nominated deposit can be identified: solicited/requested by the library, provided on a contractual basis, initiated by creator and initiated by the general public (Ross, 2003, p. 21). The selection guidelines developed by the National Library of Australia for online publications provide insight into the particular challenges faced by libraries. These challenges include, for instance, the definition of a ‘publication’, definition of Australian content, the problem of multiple versions and particularly the need to define the parameters of a publication.
17
DELOS Deliverable 6.10.1
In this latter case there is a need to establish parameters to take into internal and external links into account (National Library of Australia, 2005). David Bearman has recently suggested that universal capture is the only way in which the costs associated with selection at an individual institutional level can be minimised and the problem of many copies of some things and none of others can be addressed (Bearman, 2005). (For further information relating to the costs associated with selection see Ross, 2003, p.45 and discussion of the SEEDS cost estimation model). The onus for selection then would rest with the user, and the librarians would focus on access concerns. That model does not appear to be particularly far‐fetched given initiatives such as Google Books2 and UNESCO’s World Digital Library3 Even given a universal capture model, a key role can be seen for continued evaluation or re‐appraisal of the objects in that global store to ensure that ongoing preservation is feasible.
4.2 Information for Accountability Records, the information objects that are the concern of archivists, can be defined as the evidence of business transactions (International Organization for Standardization, 2001). Records are, therefore, ubiquitous – a record is (or should be) created each time an interaction takes place. Records can range from the most mundane – bus or train tickets or till receipts for instance ‐ to the specialised and influential such as high level policy documents. The context of a record, including (but not limited to) documentation of who created it, why, and with what purpose is as critical as the informational content of the record. A very small percentage of these records are preserved for long term. The need to decide which those records are, which will be of interest and value to
2
(“a project to digitize the world's books in order to make them easier for people to find and buy” http://books.google.com/googlebooks/newsviews/) 3 “ The World Digital Library initiative will digitize unique and rare materials from libraries and other cultural institutions around the world and make them available free of charge on the Internet. These materials include manuscripts, maps, books, musical scores, sound recordings, films, prints and photographs.” http://portal.unesco.org/en/ev.phpURL_ID=40277&URL_DO=DO_TOPIC&URL_SECTION=201.html
18
DELOS Deliverable 6.10.1
future generations, has led to the development of appraisal methodologies. ‘Appraisal’ has been defined as ‘making a judgement or estimation of the worthiness of continued preservation of records’ (InterPARES, 2000b, p.69). A key difference between archival appraisal and library selection is the requirement for archivists to predict future usage: “The essential problem in appraisal is to learn how archivists can more from what we can know to some valid projection of what we apparently cannot know, that is, from what we can know about the documents to suppositions about their continuing value.” (Eastwood, 1993, p.112).
The need to undertake appraisal activities became increasingly acute as quantities of records being created grew exponentially in the first half of the 20th century. It simply was neither possible nor desirable to keep everything – the resource implications for management and storage would be unsustainable. In the United States during the early 1950s Theodore Schellenberg devised a system of values as the basis for appraisal of government records (Schellenberg, 2003). This appraisal system was enormously influential not only in the United States, but also in other English speaking countries, and it is still used today, despite vigorous criticism4. Schellenberg’s system identified two types of value that could be accorded to records – primary and secondary. The primary value is the value of the record to the organisation that created it. The nature of this value could be either administrative (to support the long‐term business of the organisation); legal (to establish obligations and protect legal rights); or fiscal (to provide evidence of the receipt and use of funds). The secondary value is the value of the record to other users. The nature of this secondary value could be either evidential or informational. Evidential value exists if the record provides documentation of the ways in which the organisation functioned, its history or structure. Informational value means that the content would be significant to researchers because of the information provided about persons, places or subjects. This multi‐faceted approach to defining value endeavours to take into account the requirements of a variety of future users.
4 For example: “I feel it is essential that Canadian archivists realize that the traditional approach to appraisal no longer works …” (Cook, 1992, p. 182).
19
DELOS Deliverable 6.10.1
Schellenberg’s theory has been the subject of much debate in the archival literature, resulting in other attempts to devise appropriate methodologies. Of particular interest to our investigation of the automation of appraisal is work undertaken to model the elements that need to be considered when undertaking appraisal (Boles & Young, 1985). These authors identified three interrelated categories of elements, each of which should be applied in turn. The first of these is value of information, and it includes components encompassing circumstances of creation, analysis of content and use of the records. The next category introduces consideration of cost implications into appraisal decision making. (Boles and Young attribute the origins of this idea to a government archivist, G. Philip Bauer (Bauer, 1946).) The final category considers the implications of the appraisal recommendations, i.e., whether the impact will be positive or negative on the repository. In 1989, David Bearman challenged existing appraisal methodologies, arguing that these approaches to appraisal are doomed to failure because of three factors (Bearman, 1989). Firstly, records must have been created and maintained as records until the archivist appears to conduct appraisal, possibly at a much later stage. Secondly, the process is ‘people intensive’ – too much human expertise is required. The third and most significant reason for failure is that “we cannot know from examining records what societal requirements would be satisfied by their retention or destruction” (Bearman, 1995, p.383).
Bearman’s proposal to address this was that selection should be based on business function (see also 4.2.1.2 below) and guided by the principles of risk management (Bearman, 1995). Other strategies he identified were that others should do the selecting (replacing the review of records by archivists with high level negotiated agreements of required outcomes), that selection should be carried out automatically based on metadata, and that public interests should inform appraisal decisions (Bearman, 1995, pp399‐400).
4.2.1 Current Appraisal Strategies The International Council on Archives (ICA), the professional organization representing the global archival community, has identified five appraisal strategies or approaches, which can be used in combination with each other if required. These strategies are inventory, functional, theme or territory, risk assessment and business systems design (Committee on Appraisal, 2003).
20
DELOS Deliverable 6.10.1
4.2.1.1 Inventory This is a bottom‐up, records‐centric approach. It involves identifying and listing all records created by an organization. The listing will include information relating to the creation of the records (who and why), date ranges, volumes, uses and content. Retention periods can then be assigned, and those records worthy of long‐term or permanent retention identified. Problems with this approach are that it is extremely labour intensive and of course the resulting schedule or inventory has to be kept up‐to‐date to reflect changes in recordkeeping practices. It is still very widely applied, however, particularly in the United States and it has been adapted for use in the digital environment. For instance, the United States Geological Survey (USGS) has developed an online survey form to collect information about individual record series or data sets5. Similarly, the United Kingdom’s National Archives provides guidelines for inventory of digital records (EROS, 1999), despite the fact that their new appraisal policy (The National Archives, 2004) takes a macro approach (see below).
4.2.1.2 Functional/Macro Approach As discussed above, a functional approach to appraisal was first advocated by David Bearman (Bearman, 1989). It is a top‐down approach and involves analysis of the functions of an organisation or society to determine which functions are likely to create and maintain records of long‐term value. The terms functional and macro appraisal are sometimes used interchangeably, but there are key distinctions. Functional appraisal is commonly used to specify analysis that takes place within the organisation. Macro appraisal however as the name suggests involves a step back and considers functions within a broader context. It has been defined as “…a planned, strategic, holistic, systematic and comparative approach to researching and identifying society’s need for records.” (Cunningham & Oswald, 2005)
5 The USGS Records Appraisal Tool can be seen at http://eros.usgs.gov/government/RAT/tool.php
21
DELOS Deliverable 6.10.1
This approach has been prompted by the need to develop appraisal methodology that could be applied to the increasing volumes of records created in society, and different variations have been developed in different national jurisdictions (see, for example, the detailed accounts of macroappraisal practices in Australia (Cunningham & Oswald, 2005), the Netherlands (Jonker, 2005) and New Zealand (Roberts, 2005)). A criticism of the functional/macro approach is that records having secondary informational value beyond the creating and maintaining organisation may not be identified as being appropriate for long‐term preservation (Committee on Appraisal, 2003) – see 4.2.1.3 Documentation Strategy below for a hybrid solution to this problem.
4.2.1.3 Documentation of a theme or a territory The third approach identified in the ICA manual focuses on a subject or geographic area. The strategy involves the identification of all owners of relevant recordkeeping systems (for example, public and private archives) and potential users of the records. Assessment of this approach is that it is ‘ slow and resource intensive’ (Committee on Appraisal, 2003). Documentation strategy has been explored in depth in North America as an alternative to the Schellenbergian value system. Terry Cook provides a concise overview of dimensions of this discussion, and pros and cons of the approach by way of a critique of Helen Samuels’ keynote address on this topic to the Association of Canadian Archivists conference (Cook, 1992). Documentation strategy is currently practiced in Germany. There, a cross‐archive (for example, public and private) approach, also referred to as ‘vertical and horizontal’ takes into account functional perspectives as well as ‘textual and content‐oriented aspects’ (Kretschmar, 2005).
4.2.1.4 Risk Assessment This approach uses the identification of risks to set priorities and make decisions. The ICA manual highlights relevant risks at organisational level, i.e. the risks to an organisation if records of a particular function are not appraised (Committee on Appraisal, 2003). Although not specifically archival, the work of the preservation community in assessing the degree of preservation risk associated with digital objects is of particular interest when considering the automation of appraisal. The Digital Asset Assessment Tool (DAAT) Project set out to produce a tool that could be used by collecting institutions such as libraries and archives to assess which
22
DELOS Deliverable 6.10.1
digital assets were at greatest risk and to take action accordingly (Pinsent & Ashley, 2006). The resulting tool is dependent on data collected manually. However, an automated workflow that assesses and reports on preservation risk has been developed and tested on a digital archive (Anderson, Frost, Hoebelheinrich, & Johnson, 2005).
4.2.1.5 Business Systems Design The final strategy identified by the ICA is a holistic approach to records management in accordance with the international standard for records management ISO154389. This approach involves incorporating appraisal decisions in the design of business systems. If this is done, then as a consequence it is possible to envisage disposition being carried out automatically. This is the theoretical basis underlying the development of DIRKS (State Records Authority of New South Wales, 2007). (See also 4.2.2.1 Appraisal as Part of the Business Process.)
4.2.2 The Process Definitions of appraisal in the archival context emphasise the use of clearly specified criteria and requirements in determining the value of records in order to ensure that the process is as objective as possible: “Records appraisal is the process of determining the archival value and ultimate disposition of records. Appraisal decisions are based on a number of criteria including the historical, legal, administrative, and financial value of the records”6
and according to the Paradigm project7 (see also section 5.2.1 below) “archival theory extends this definition to include the policies and procedures used by an archivist to identify, evaluate, and authenticate records, in all formats, which have enduring value to records creators, institutions, researchers, and society. Appraisal in a paper-based archive traditionally takes place once a record is no longer current, but determination of how long a record should be retained can take place before creation for some kinds of records of the records”
6 7
http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation.aspx http://www.paradigm.ac.uk/workbook/appraisal/index.html
23
DELOS Deliverable 6.10.1
The process of determining significance consists of these distinct stages:
appraisal, which includes identifying the selection (or retention) rules and subsequently the evaluation of the value of information or digital objects. Evaluation can be based on different methods, such as analyzing the context in which the digital objects will be created (e.g. the business activities) or analyzing the content.
selection, which includes the attribution of values to the digital objects; could be part of the first stage.
disposal, which is the actual application of the appraisal and selection decisions, that is, keeping or destroying the information.
Re‐appraisal, which may occur on ingest, as part of preservation planning, or on access.
As previously indicated, in the archival sector appraisal is part of a selection process made of specific activities (selection, appraisal, disposition as destruction or preservation). The appraisal should be conducted on the basis of well defined principles and criteria as further developed with reference to the non‐electronic environment. Specifically, the appraisal should be carried out when the digital resources are still in their active phase, as near to the time of creation as possible. The management of the appraisal function implies the use and the maintenance of a huge amount of information which include the decisions taken in the past (with reference to the various responsibilities involved and the strategies and procedures developed), the contextual information related to the records (the juridical, documentary, technological contexts), the values established for the records and for their preservation feasibility (in terms of cost and in terms of preserving the authenticity of the records). The feasibility of the records preservation is strictly based on the capacity of preserving the essential digital components of the records, those able for the present and for the future to confer their identity and to ensure their integrity. This information (which includes content and data/metadata necessary to organise, structure or render the content of the records) have to be structured and articulated in a way to enable the decisions related to the present and future capacity of preserving the digital components which constitute the record identity and to ensure its integrity. This effort includes at least three phases:
24
DELOS Deliverable 6.10.1
determine which elements can validate authenticity
identify where these elements are manifested (in which digital components) and what is the technical information relevant for their preservation
reconcile these preservation requirements with the financial and technical capacities of the repository
As clearly testified by the flexibility required in the preservation process, appraisal is a relevant active component of this process and it includes a higher level of responsibility than in the past. The quality of the preservation is strictly connected with the quality of an early appraisal. The more complex and rich the digital data to be preserved (as in the scientific world), the more relevant is the active appraisal here described which includes crucial tools whose automation8 will ensure the success of the preservation itself:
criteria and policies able to orient a neutral approach,
auditing and validating procedures
contextual information automatically extracted and preserved.
Appraisal processes and the resulting decisions need to be transparent and accountable and could be made at the following points in time:
before recordkeeping systems are designed ‐ Whenever systems designers know the requirements for creating, maintaining and disposing of records over time, appraisal strategies can be built into recordkeeping systems.
before records creation ‐Early appraisal informs managers of the risks they face if records are not created for long term preservation. It also prevents the accumulation of un‐appraised records.
8
See http://eros.usgs.gov/government/RAT/tool.asp where the USGS Scientific Records Appraisal Tool is described (see CODATA-ERPANET workshop, The selection, appraisal and retention of digital scientific data…cit., p.13.
25
DELOS Deliverable 6.10.1
before disposal. ‐It is a standard practice to require appraisals to be conducted before it authorises classes of records for disposal.
when required ‐ Some records may be subject to numerous appraisal processes over time.
4.2.2.1 Appraisal as Part of the Business Process When appraisal is carried out as part of a business process it involves the following decisions:
what records should be created to document a business activity
how long those records should be retained.
Appraisal should be a planned, transparent and accountable process of research, analysis, evaluation and consultation. As part of business process analysis appraisal is done to achieve some benefits such as:
reduction of information overload through identification, segregation and elimination of non‐critical records;
protection and awareness of legal, financial and community interests, rights, entitlements and obligations of organisations and individuals;
preservation of corporate and cultural memory;
reduced storage and maintenance costs through timely disposal of records that are no longer required.
26
DELOS Deliverable 6.10.1
5.0 Current Research Research exploring the issues relating to the determination of significance of digital objects has been largely carried out by the information for accountability community, so attention has focused on appraisal methodology. In the paper world appraisal and selection processes are conducted manually and require an enormous amount of effort. In a digital environment new approaches are necessary. This section summarises findings from InterPARES and Paradigm, as well as developments involving metadata. InterPARES and Paradigm have been singled out for particular attention because findings from both these projects are particularly important. InterPARES is significant because this was truly a global endeavour, and therefore included a multiplicity of theoretical views. Paradigm deserves close scrutiny as its focus is on the difficulties encountered with respect to personal papers – i.e., documents created and stored in uncontrolled environments. The section concludes by considering the Archives in a Digital World conference held in Rome in 2007.
5.1 InterPARES The concept of appraisal in the digital environment was a major focus of research for InterPARES (The International Research on Permanent Authentic Records in Electronic Systems). One output was a review of the English‐ language literature relating to the appraisal of electronic records (InterPARES, 2000a) This literature review formed the basis for subsequent work, in particular the development of a model to identify the activities involved in selection and acquisition and identification of issues specific to the appraisal of electronic records. Of particular interest was the conclusion that only two specific criteria could be established to cover all possible appraisal situations:
“…first, the requirements for assessing authenticity as part of assessing the value of electronic records; and, second, the concepts that have been developed for determining the record elements to be preserved and for identifying the digital components to be preserved as part of determining the feasibility of preservation.” (InterPARES, 2000b, p.97) The project also concluded that an appraisal function valuable and necessary for the digital environment should be seen as a new activity strictly related to the early control of the creation of digital resources and as integrated part of the complex preservation processes necessary in the new information systems. This activity, as part of the preservation function, requires a clear and well
27
DELOS Deliverable 6.10.1
developed definition of records/resources characteristics (such as authenticity and usability) and it includes also:
analysis of the feasibility of preservation,
regular monitoring (as an activity which includes continuing internal changes in the decision process related to the appraisal and transfer),
a continuing self‐documented approach with specific reference also to the technological context (i.e. definition of the formats for transfer and its requirements).
In this perspective, appraisal is not only unavoidable, but it is an inextricable part of effort dedicated to preserve digital memories. (See Appendix 1 for a summary of findings from InterPARES). It is now imperative to validate the findings of InterPARES by experimentation.
5.2 Paradigm The Paradigm Project (dedicated to the preservation of digital archives of individuals and small organization) explores whether appraisal of digital records is a worthwhile exercise.9:
“Trends in the digital world seem to reject the practice of actively organising our digital collections by choosing what to keep and what to discard. Declining storage costs and improved discovery seem to have rendered appraisal and disposal needless” Specifically ‐ as the authors of this useful report stress – when the resources preserved have a poor structure, lack of a functional organization and the detailed analysis and description at the time of transfer/acquisition would
9 See http://www.paradigm.ac.uk/workbook/appraisal/index.html. The question has been at centre of the discussion within the DELOS workgroup on digital libraries preservation. An active discussion took place and different views were expressed in preparing this study which tries to take into account the different perspectives expressed in this occasion. For an historical perspective of the digital appraisal in US, see Linda J. Henry, An historical perspective on appraisal of electronic records, 1968-1998. SAA Annual meeting, Session 47, September 2000 (non published paper) and Ead.,
Schellenberg in Cyberspace, in “The American Archivist”, 61 (Fall 1998), p. 317.
28
DELOS Deliverable 6.10.1
require an enormous amount of effort and time. The conclusion is an acceptance of the possibility that
“in future we might only appraise and catalogue very important collections, similar to the way in which only very high value manuscripts are catalogued to piece level”, and that the appraisal would affect only the recordkeeping systems well organised through a functional classification based on business processes. According to this project (and reiterated in the summing up of the ERPANET and CODATA workshop in Lisbon10) the basic reason for appraisal is still a pragmatic question of quantity and cost. The future need for appraisal, selection and destruction is foreseen as a consequence of unresolved financial issues. The growth of digital content (per byte or per object) will not be in step with the declining costs of devices and of back up routines and other system administration tasks needed to create a huge amount of “preservation metadata” required to sustain the long‐term life of the resources, specifically in the case of complex and compound objects (website, emails systems), of undocumented, obscure or no longer supported formats at the time of acquisition11.
10
CODATA-ERPANET workshop, The selection, appraisal and retention of digital scientific data, Lisbon 15-17 December 2003. Final report, cit.
11
In the Paradigm project report this aspect is well defined with a list of object whose specific technical characteristic could affect seriously the cost of the preservation and requires an appraisal decision: Objects that might be more expensive to retain include: “Complex or compound objects, such as websites or email archives Objects in undocumented formats Objects in obscure formats Objects in formats unsupported by a community or vendor at the time of acquisition Objects in formats for which no migration/emulation tools exist Objects in formats unknown/unsupported by preservation registries and tools
29
DELOS Deliverable 6.10.1
5.3 Metadata
The existence of metadata, data about data, is a key concern in any consideration of determining significance or value. The more information we have about an object, the greater the potential is for improved decision‐ making. Much research into metadata has originated in the archival or ‘information for accountability’ community, for instance Monash University’s Clever Recordkeeping Metadata project, which explored the automatic creation of metadata, and its reuse across systems (Evans, McKemmish, & Bhoday, 2005). However, the value of metadata remains unproven (PREMIS Working Group, 2004). Appraisal, determining the value of a particular digital object or class of objects, requires metadata that describes
“the context, content and structure of records and their management through time” (International Organization for Standardization, 2001). Other information communities also recognize the importance of having adequate metadata, and significant steps are being taken by researchers at the Digital Curation Centre (DCC). In recognition of the fact that the creation of digital objects is increasing at an exponential rate and the manual collection of metadata is unsustainable, the DCC research is focusing on the automatic extraction of metadata (see, for example, (Kim & Ross, 2006)). (See also section 8.1 Using Metadata and Genres.) As discussed above in section 2.3, the existence of metadata reflecting relationships and levels of aggregation is critical in any consideration of item‐
Objects for which the repository has no preservation strategy Objects which are encrypted, password protected or subject to digital rights mechanisms Objects on old or obsolete media Objects without metadata Objects which require software licences for access or manipulation”.
30
DELOS Deliverable 6.10.1
level appraisal for records. David Bearman recognized the potential for the use of metadata to show relationships in 1985: "…our current description practices focus on capturing content of records, and on describing existing arrangement and highly general context, when what we need is highly specific metadata about transaction contexts which would provide us with what we need to know about content and structure (including, but not limited to, arrangement). …An archival strategy for documentation is to automatically capture metadata required to ensure evidence, to manage programs and to support use after analysis of functional requirements for recordkeeping, business process, and user needs." (pp 393-394)
R5.3.1 Metadata showing relationships and levels of aggregation of records should be used to automatically generate description for archival repositories
31
DELOS Deliverable 6.10.1
5.3 Appraisal in a Digital World This conference, held in Rome in 2007, was a ground breaking endeavour. It brought together academics and practitioners and most importantly was inter‐ disciplinary, thus allowing the exchange of ideas and practice across information domains. International speakers included Luciana Duranti, Terry Eastwood, Ken Thibodeau and Jason Baron. Papers are yet to be published, but discussions at the conference and subsequently have informed the development of recommendations in this report.
6.0 Issues Given increasing capabilities for search and retrieval, improving quality and decreasing costs of storage devices and the greater complexity and consequent higher costs involved in evaluating a huge amount of digital resources, a fundamental issue debated is whether or not it is necessary to expend effort in determining the value or significance of information. A further concern is that the very act of nominating some information as worthy of preservation and not others means that the resulting bodies of knowledge will not be truly representative of all voices within a community.
6.1 Is it necessary to determine the significance of digital information? Organisations repeatedly take decisions on what information (or digital objects) should be preserved and for how long. Criteria governing what to keep and what to discard are usually based upon such factors as organisational needs/objectives, juridical requirements, and information value that are relevant to the business context of the organization (whether a library, a public sector institution or a commercial company). This is happening in government organizations, business companies and memory organisations including, recently, digital libraries and digital archives. The reason for this is that preserving too much digital material is not cost‐effective. We could say, as a consequence of this approach, that the main reasons in favour of appraisal are based on the lack of a detailed control in the creation phase of digital objects, specifically on the technological contexts. The paradox could be that a well organised recordkeeping system, with easy and well detailed retrieval tools, even if built at the creation stage with an integrated preservation plan would not require to be evaluated and disposed for preservation.
32
DELOS Deliverable 6.10.1
A similar conclusion can be recognised in the ICA Guidelines on Appraisal . Emphasis is on the difficulties of conducting appraisal on the digital heritage of the small organizations, often characterized by the absence of systematic filing and naming rules (p. 22): “Where electronic documents exist with little organisation or structure linking them together in meaningful collections or groupings, appraisal will be difficult. This will be the case where, for instance: electronic documents are held on a shared local network drive with no systematic organisation or structure in the filing or folder hierarchy files and folders are created directly by end users with no established naming conventions, resulting in names that are ambiguous, mysterious or misleading electronic documents are held in a document management system that relies upon search technology alone to bring together sets of related records.”
The lack of these relevant attributes are the result of weak recordkeeping systems with a consequent
lack of consistency in the allocation of individual records and in the development of series,
loss of information related to the original context, and
loss of the original organization and its meaning.
The only possibility for appraisal (but also for preservation) in this case is a record‐by‐record analysis, “time‐consuming and resource‐intensive” and of course “unlikely to be cost effective” (p. 22). Linda Henry from the basis of the long experience of the US National Archives determined that the characteristics to evaluate for appraisal (in absence of best practice in the resources creation) include the resources’ manipulability, volume, linkage duplication, micro‐level, readability, hardware and software documentation, format independency. In any case the content analysis is the most relevant aspect for taking the final decision and
33
DELOS Deliverable 6.10.1
the lack of organization and indexing is, at the end, the main component to evaluate12. A different answer to the basic question can be found if the appraisal is not evaluated as a pragmatic solution for space and redundancy but as part of a digital preservation process. To develop a more comprehensive answer to the question, it is important to ensure a consistent theoretical approach based on the idea that, as part of a business process, appraisal is done – as already stressed in this report ‐ to achieve benefits such as:
reduction of information overload through identification, segregation and elimination of non‐critical resources;
protection and awareness of legal, financial and community interests, rights, entitlements and obligations of organisations and individuals;
preservation of corporate and cultural memory;
reduction of storage and maintenance costs through timely disposal of materials that are no longer required by the creator.
6.2 Is the Process of Determining Significance Fundamentally Flawed? A recent discussion paper which has provoked debate claims that appraisal as currently practiced is faulty, as it is based on the assumption that it is possible to select the information that will have greatest value to a community in the future (Neumayer & Rauber, 2007). The case is expressed that, by its very nature, appraisal is an exercise in censorship, and has been used by repressive regimes to control access to knowledge. Although a seasoned archivist or records manager can (and some have) identified inherent flaws in their arguments and proposed solution (random sampling of all digital information, regardless of origins or purpose), the underlying concern needs
12 See Linda Henry, An historical perspective on appraisal of electronic records, 1968-1998. SAA Annual meeting, Session 47, September 2000, cit..
34
DELOS Deliverable 6.10.1
to be explored further, even if this is not a question of the very nature of the appraisal but it refers to the social organisation of the memory preservation at each national level and with reference to the specific mandate of the dedicated institutions. Of course, minority or disadvantaged groups traditionally have not been well represented in the historical record as a consequence of the nature of the records created by public institutions and by the weakness of protection for private records.. Facilitating involvement of all societal stakeholders in determining the significance of information is a course of action which would assist in addressing this issue. In the archival domain, there has been a call for participatory appraisal to ensure that the needs of marginalised communities are met (Shilton & Srinivasan, 2007). In the case study described in their paper, the methodology involves the establishment and on‐going development of community ontologies. For digital records and other content, the fact that access can be made widely available now means that views outside traditional libraries or archives can be incorporated into the decision making process. (See recommendation R8.0.1.)
35
DELOS Deliverable 6.10.1
7.0 Principles, Requirements and Criteria for the Appraisal of Digital Objects In the recordkeeping context, if a robust system is in place, findings from a functional appraisal strategy can be used to provide the framework (a disposal schedule) for technological controls to distinguish those records which need to be retained for long periods of time. The implementation of a disposal schedule however does require manual intervention especially when a functional organization of the records system is not provided at the creation phase. Consequently the emphasis in this part of the report is on determining significance at item level as it is here that real possibilities for automation can be seen. The final report of the Paradigm Project suggests that top down appraisal (macro or functional) will need to be carried out initially, before a bottom up approach focusing on individual items can proceed. It is also possible to envisage an initial triage stage if the data to be appraised is sufficiently messy. For instance, a poorly organised shared drive of legacy records could initially be scanned at item level to identify any digital material that has the potential for long term preservation, before any other analysis work takes place. It has also been suggested that genre could be used as a basis for appraisal in the absence of any, or limited, metadata (Underwood, Isbell, & Underwood, 2007). Findings from InterPARES Appraisal Task Force indicate very clearly that only two criteria can be established to cover all situations: firstly the requirements for assessing authenticity and secondly determination of the feasibility of preservation (InterPARES, 2000b). Both of these criteria can be focused on item level assessment and can therefore be usefully considered to indicate significant potential areas for automation. To supplement this, a representative collection of published policies, guidelines and case studies were consulted in order to try to identify a list of specific criteria that have been used from a range of different settings. Subsequent analysis identified that criteria identified could be clustered in the following categories:
Content: comprises appraisal criteria which involve assessment of the informational content of the item, series or collection.
Contextual: comprises criteria relating to an assessment of the context in which the item, series or collection was created.
36
DELOS Deliverable 6.10.1
Evidence: comprises criteria which provide evidence of activities and/or functions
Figure 4: Categories used for the grouping of appraisal criteria
Operational: comprises appraisal criteria which contribute to assessment of implications of long term preservation for the collecting agency
Societal: comprises appraisal criteria which relate to the external societal/national information management infrastructure, including legislative and ethical concerns
Technical: comprises appraisal criteria which relate to technical characteristics or features of the record or data
37
DELOS Deliverable 6.10.1
Figure 5: Appraisal criteria that comprise each category
38
DELOS Deliverable 6.10.1
The results of this analysis are shown in Table 1. All criteria either can be or have been applied at item level, but it must be emphasized that most are interrelated and will be part of a contextual analysis. In this sense items are rarely isolated even in a completely uncontrolled environment. The source documents used to compile this table are a selection of publications in English comprising policy and guidelines and reports of practice in different information environments (see Appendix 2 for the scope of each document used). This combination of literature types was used to try to assemble a representative, rather than comprehensive, list of criteria from a range of different settings and to provide a basis for starting to identify categories. In fact it proved quite challenging to find sources that provided sufficient detail (as opposed to an overview of principles) about the selection process undertaken, but there are doubtless others and this may be an area that receives increasing attention in the future13.
13
For instance, subsequent correspondence with John Faundeen of USGS identified the following appraisal factors: The data is raw or minimally processed; Do the files contain nonarchival records? Data has successfully undergone the peer review process; Compression used? Is the data classified (by governments such as 'secret.')? Reputable author (creator)? Usefulness of [scientific] parameters outside of the project that created the data? Are the records in a discernible order? Language used;. If analog and digital exist, which is better or do both have to kept? What are the accession or disposition costs? What was the data collection method? Space available to accommodate the collection? Further analysis is required to determine which should be incorporated into existing factors, and which merit articulation as additional criteria.
39
DELOS Deliverable 6.10.1
Figure 6 Application areas for appraisal criteria documented in Table 1
R7.0.1 Further analysis should take place of policies, guidelines and reports of practice to determine a comprehensive database of criteria used in appraisal of digital objects, linked not only to domain of practice but also to sector of activity and country.
40
DELOS Deliverable 6.10.1 Table 1: Specific criteria considered in the appraisal of digital information objects
Category
Appraisal Criteria
Content
Comprehensiveness – e.g. whether information covers a complete population or not Coverage - e.g. the spatial area covered Growth – will information continue to grow or is object complete? Relationships – Are there existing relationships to information already held in the repository. This will include consideration of any dependencies/interdependencies. For example, does one item constitute a finding aid to another item? Reliability – whether the information is likely to be accurate or authoritative. ‘Contents can be trusted as a full and accurate representation of the transactions, activities or facts to which they attest’ (ISO15489)
Content Content Content
Content
Content Content Content
Source document domain (policies, guidelines or reports of application) Geospatial data13 Geospatial data13 Geospatial data 9, 13 Social science data3; records4, geospatial data13, websites6 Geospatial data13 publications10
Social science data3; records4, publications10, websites6, geospatial data13 Time - Period of time covered, e.g. creation date and end Social science data3; date. records4; geospatial data13 Uniqueness – whether the resource represents unique Social science data3; information. This includes consideration of whether records4, 5; publications10, Significance – importance of information content for current and future research needs
41
DELOS Deliverable 6.10.1
Content Contextual Contextual Contextual Contextual Evidence Evidence Evidence
Evidence Operational
duplicates exist or information is also available in other media Usability – Accessibility of content, e.g. are appropriate manuals available to decipher information Documentation - Accompanying technical documentation explains how data collected etc Provenance Appropriateness of provenance to collection (e.g. ‘within state’, or existence of relationship with donor) Significance of source/context of data/records Usage - Frequency of use Accountability - Provide defence of agency against charges of fraud/misrepresentation Artefact Provide evidence of way in which organization functioned (e.g. how technology incorporated into business, how web used as communication tool) Authenticity – Whether the object is what it purports to be, to have been created or sent by the person purported to have created or sent it; to have been created or sent at the time purported Precedence - Documentation of decisions that set precedent Costs involved in long-term maintenance
geospatial data13 , websites6 Geospatial data13; social science data3, records4; websites12 Social science data3; geospatial data13 Records; geospatial data9; websites6, 12, publications10 Social science data3; records4 Geospatial data13 Geospatial data13 Websites6 Records1,2, geospatial data13
Records4 Social science data3; records4 42
DELOS Deliverable 6.10.1
Operational Operational Operational Operational Societal Societal Societal Societal Technical Technical Technical Technical
Collection Fit with existing collection policy Mission Fit with organizational mission Potential - e.g. Repurposing - possibilities for use in ‘Value added products’ Replaceability – e.g. can information be replicated, cost of replicating information; value of information vs costs of preservation Ethics – are there ethical implications that will influence decision making, for example any reasons why records or information should not be retained? Intrinsic value – eg aesthetic or artistic quality, experimental use of new technology Legal considerations – eg privacy, data protection legislation prohibiting retention;
Geospatial data13 Geospatial data13 Geospatial data9
Representativeness – either of sectors of community, or statistically Functionality – Retention of behaviour, e.g. has look and feel been retained? Integrity of records – should be complete and unaltered, and have been protected against unauthorized modification Rights issues – eg copyright Risk - Degree of risk to content
Records4; geospatial data13
Geospatial data13
Records4; geospatial data13 publications10 Geospatial data13
Websites12 Records4,8, geospatial data13 Geospatial data13 Geospatial data 9, 13; records Social science data3 43
DELOS Deliverable 6.10.1
Technical Technical
Size of object/volume of records Usability of records - should be able to be located, retrieved, presented and interpreted
Social science data3; records4 Geospatial data13; social science data3 records4; websites12
1 (InterPARES, 2000b) 2 (Eastwood, 2003) 3 (Data Preservation Alliance for the Social Sciences (DataPASS)) 4 (National Archives and Records Administration, 2007)) 5 (Thomas, 2007) 6 (Grotke & Ruth, 2007) 8 (InterPARES, 2002) 9 (Morris, 2006) 10 (National Library of Australia, 2005) 11 (Murray & Phillips, 2007) 12 (Lala & Joe, 2006) 13 (United States Geological Survey, 2007)
44
DELOS Deliverable 6.10.1
8.0 Automation Identification of criteria used to make appraisal decisions about digital objects, and linking those criteria with specific metadata, implies that it may be possible to build an appraisal engine to automate this process. The Boles and Young attempt to codify appraisal (see section 4.2) has been criticised on the grounds that implementation would be too difficult (Bearman, 1995, p. 392), rather than because there are inherent problems in the definition of the appraisal elements. Anne Gilliland‐Swetland has commented that there is insufficient agreement in the archiving community to allow complete codification of appraisal (Gilliland‐Swetland, 1995). This is echoed in the InterPares recommendations that only distinguished two ‘universal’ factors – authenticity and feasibility of preservation (InterPARES, 2000b). Terry Cook has provided trenchant criticism of the taxonomic approach to appraisal on the grounds that there are simply too many records to appraise (Cook, 1992). However, if a much more flexible approach to codification is envisaged, based on genre, configurable to different institutional and cultural contexts, and applied within an overarching, top‐down appraisal framework, a way forward can begin to be seen with respect to the information for accountability community. Automating implementation of this process will facilitate its application, and a solution starts to emerge to the impossibility of reviewing excessive and ever‐increasing numbers of records on a record by record basis. In order to enable appraisal/selection and disposal the following activities are necessary:
extract information about the digital objects (either about content or context or both) specifically the extraction tools provide excellent mechanism/method for extracting information about the content, the context of origin as well as the technical nature of a digital object or any aggregation of object; by using knowledge representation and processing techniques it would be possible to automate the appraisal decision making processes;
analyse this information based upon appraisal criteria14;
attribute appropriate appraisal/selection/disposal metadata to the digital object15;
45
DELOS Deliverable 6.10.1
According to the previous analysis some indicators that will impact on appraisal in the digital environment can be identified, i.e. impact of the quality of recordkeeping practice, and impact of timing of appraisal. Specific factors to be noted are:
in a digital environment the need to create and maintain a huge amount of metadata for the description and preservation of digital objects at the point of creation of the resources to ensure a correct acquisition from the digital repository implies new procedures and a change of the chain of the responsibilities
it may be necessary to develop a re‐appraisal strategy within the repository (both for ensuring a first rudimentary process, and for refining the process when the archival analysis will provide the required information for a detailed evaluation
impact of appraisal methods: for instance in case of a snapshots approach (for the websites) the appraisal policy should be in place at the same time of the preservation strategy
financial and staffing constraints: the budget constraints governing digital repositories should be taken into account with more attention than in the paper world. The lack of funding for cataloguing and descriptive activity should imply that appraisal and disposal should be in place as soon as possible and before the creation of a submission package to the repository (i.e. before the preparation of Archival Information Packages (AIPs) as expressed by the OAIS model)
the requirements for a higher degree of documentation and information at a early stage and in the course of the management of the resources.
All the metadata collected for conducting the electronic records appraisal are crucial documentation and information for ensuring the adequacy of the preservation
15
This activity could be considered part of the definition of the functionalities of a digital library system as defined in the Delos report prepared by Volker Herrmann and Manfred Thaller, Integrating preservation aspects into the design process of Digital Libraries, available at http://www.dpc.delos.info/private/output/DELOS_WP6_d651_finalv3_5__cologne.p df. 46
DELOS Deliverable 6.10.1
function. They include all the main information required for carrying out the appraisal as previously described, such as:
information on the creation context of the records (juridical‐administrative, procedural, provenancial, documentary)
information on the technological context of the records, relevant both for assessing the authenticity of the records and for evaluating the feasibility of the preservation processes,
information related to the appraisal decision itself: this documentation necessary to justify the decision is relevant not only for accountability of the creator but also to support at any time the preservation process.
information related to the continuous monitoring of the appraised records, with specific reference to the technological context and its evolution.
This information has to be maintained in association with the records themselves and in large part (with the exception of the documentation related to the technological context required in the course of the preservation process) exists while the records are active and disappears when the records are removed from the active recordkeeping system. For this reason this information should be packaged with the records/documents themselves as soon as possible and collected automatically in the creation process (for instance in connection with the classification schemes) and transfer to the preserver. For the same reason it is more and more necessary – as clearly stressed by all the projects – for the implementation of tools or even more meta‐tools able to
translate the information and the documentation created in the course of the digital resources production and management in the form of metadata,
transfer these metadata in a representation and technical environment able to be shared in various domains and activities by using and mapping metadata registry systems
ensure the correct mapping of metadata registries for supporting the interoperability, specifically among various levels of business systems and different management phases of the digital resources.
The other key consideration to be explored in any investigation of the benefits of automating appraisal is the establishment of channels to facilitate input into decision‐making by other relevant communities. Whereas in the paper world this would have been extremely difficult to accomplish, in the digital environment the reverse is true. Not only can access to information under review be enabled beyond
47
DELOS Deliverable 6.10.1
the memory institution, but also current technologies allow for efficient data collection and analysis. Methodologies used will have to be tailored to suit the needs of specific communities and stakeholders. R8.0.1 Strategies making use of web‐enabled communication channels should be investigated to enable input from all community stakeholders in determining what information is significant and worth preserving
8.1 Using Metadata and Genres to Determine Significance Appraisal criteria identified in Table 1 were analysed and where possible broken down into specific questions about an item that could be answered at least partially using metadata16. To seasoned archivists these questions will no doubt appear too trivial or simplistic. Our aim is, however, to investigate to what extent they could be automated. The answers to these questions provide essential information for the decision making tools supporting appraisal in the digital repository workflow and thus contribute to minimizing the human intervention required. Therefore the questions shown in the following table should be considered as an initial trial to use in a future mapping study of existing technologies for digital object processing which could be used for automated decision making.
16
The input from students of HATII’s 2007-8 Management, Creation, and Preservation of Digital Materials course into the identification of questions and metadata elements is gratefully acknowledged.
48
DELOS Deliverable 6.10.1 Table 2: Crosswalk showing relationship between appraisal criteria and questions that could be answered using metadata values Category Appraisal Criteria Specific questions arising from this, that could be answered using metadata Content Comprehensiveness Content Coverage What is the spatial area covered? Content Growth Is there an ongoing record of modifications? Content Relationships Are there any relationships to existing items? Have similar objects/records been ingested previously? Content Reliability Was it created by an authoritative person/unit? Content Significance What is it about? What genre is it? Content Time What timeframe does it cover? Content Uniqueness Is the item a duplicate? Is the information available in different media? Content Usability Is the language comprehensible? Contextual Documentation Contextual Provenance Is provenance appropriate? Contextual Significance What business function/organization was the item created for? Contextual Usage How often has it been accessed? Evidence Accountability Evidence Artefact Evidence Authenticity Is it identifiable? Evidence Precedence Operational Collection Operational Costs Are there special hardware requirements? Are there special software requirements? Are additional metadata elements required? Operational Mission Operational Potential Operational Replaceability Societal Ethics Societal Intrinsic Is the creator significant? Was it created at a significant time? Societal Legal Does content include personal information? Is creating agency subject to legislative requirements? Societal Representativeness Technical Functionality Is it static or dynamic? Is it simple or complex? How sticky is the metadata? Technical Integrity Has item been tampered with? Is it complete and uncorrupted? Are there security controls? Technical Rights Can it be accessed? Can it continue to be accessed? Are there restrictions on who can access? Technical Risk What format is it in? Technical Size How big is it? Technical Usability Can it be accessed?
49
DELOS Deliverable 6.10.1
It can be seen from this table that it does not appear possible at this stage to fulfill all requirements for appraisal at item level by using metadata. More analysis is required in order to fully investigate this, and to determine whether there are alternative approaches using automation that can be taken to address the gaps. At this stage a way forward would seem to be to envisage multiple models for the automation of appraisal (see 8.2 below). R8.1.1 Further analysis of appraisal criteria is required in order to formulate appraisal rules. A wide range of input from practitioners and academics in different domains is required, so consideration should be given to a Delphi study and/or a series of focus groups. Returning to those areas where it appears that questions to be answered with metadata can be formulated (see Table 2), consideration was given to how significant the answers to those questions would be, given that superficially they appear to be very simplistic. One approach was to determine whether or not answers would contribute to one or other of the two InterPares ‘universal’ criteria: the authenticity of the object, and the feasibility of digital preservation for that object. The results of this last stage of analysis are shown in Table 3. This table introduces a further categorization – this time of metadata type. The categories applied are those defined in the DELOS Digital Library Reference Model (Candela et al., 2007, p.78):
Syntactic: Metadata that provides information about the syntax or structure of the information object. For example, creation date, size, file format.
Semantic: Metadata that provides information about the content of the information object. For example, name of creating agency, business function, keywords.
Contextual: Metadata that provides information not related to either the semantics or the syntax of the information object. For example, background information relating to the establishment of the creating agency of an information object, or digital rights restrictions.
50
DELOS Deliverable 6.10.1 Table 3: Metadata elements that could be used to assist in appraisal decision making Category Appraisal Question Possible Metadata Category metadata element(s) Semantic Content What spatial area is covered? Descriptive – place names Content
Is there an ongoing record of modifications? Are there any relationships to existing items?
Audit log
Contextual
Matter, agency, etc.
Semantic
Content
Have similar objects/records been ingested before?
Agency, actors
Semantic
Content
Is the language comprehensible?
Character set
Semantic
Content
What is it about?
Semantic
Content
What genre is it?
Content
What timeframe does it cover?
Title, descriptors Document genre Dates
Content
Is the item a duplicate?
Checksum
Syntactic
Content
Syntactic Syntactic
Why is this important? Fit with acquisitions/collections policy May indicate item is incomplete Coherency of collection; finding aids to other records May influence determination of feasibility, build coherent body of information May be an indication of usability, and also fit with collection May indicate possibility of significant content May indicate possibility of significant content Extent of time may be an indication of value Impact on feasibility assessment
Comments
This will be a question of matching content of metadata fields. Matching metadata elements may answer this question. Specification of which elements will vary according to domain and jurisdiction.
Value will vary according to domain, but this may be a key consideration File synchronisation may also be required – see http://www.cis.upenn.edu/~bcpierce /unison/
51
DELOS Deliverable 6.10.1 Content Is information available in different media?
Identifier elements, e.g. author, ISBN Publisher/ creator or other indicator of place such as url Actors, agency
Semantic
Contextual
Is provenance appropriate?
Contextual
What business function/organization was the item created for?
Contextual
How often has it been accessed?
Audit log
Contextual
Is it identifiable?
Actors, dates, matter etc. etc
Syntactic & semantic
Are there special hardware requirements? Are there special software requirements? Are additional metadata elements required?
File format
Syntactic
File format
Syntactic
All
Semantic, syntactic, contextual
Societal
Is the creator significant?
Creator
Contextual
Societal
Was it created at a significant time
Dates
Contextual
Evidence
Operational Operational Operational
Semantic
Semantic
May indicate duplication, so will be influential in feasibility assessment Fit with acquisitions/ collections policy May provide an indicator of likely value especially if functional approach is used May be a weighting factor in cost benefit analysis Contributes to assessment of authenticity Influential in assessing feasibility of preservation Influential in assessing feasibility of preservation Influential in assessing feasibility of preservation May be an indicator of intrinsic value May be an indicator of intrinsic value
Appropriate elements will vary according to domain and type of information.
To be used with care – may not be significant at all in certain domains, e.g. recordkeeping The existence of certain metadata elements can be used to answer this question. The specification of which elements will vary according to domain and jurisdiction.
There are likely to be substantial cost implications if metadata has to be manually assigned. Particularly applicable to born-digital artworks For instance – an organisation’s first website; early examples of born-digital art
52
DELOS Deliverable 6.10.1 Societal Does content include personal information? Societal Is creating agency subject to legislative requirements?
Descriptive – personal names Agency
Semantic Semantic
May be legal barriers to retention or accessibility May be legal requirements to retain and make accessible May influence determination of feasibility of preservation
Technical
Is it static or dynamic?
File format
Syntactic
Technical
Is item simple or complex?
File format
Syntactic
Impact on feasibility assessment
Technical
How sticky is the metadata?
All
Syntactic
Technical
Has it been tampered with?
Syntactic
Technical
Is it complete and uncorrupted?
Audit log Electronic signature/ seal Fixity check
Impact on feasibility assessment Contribute to assessment of authenticity
Technical
Can it be accessed?
Contextual & syntactic
Technical
Can it continue to be accessed?
Rights, permissions, etc. Format type Rights
Technical
Are there restrictions on who
Rights
Contextual
Syntactic
Contextual
Contribute to assessment of authenticity and determine feasibility of preservation Will influence determination of feasibility of preservation
Format will provide an indicator that may partially answer this question, but further analysis may also be required in order to discover embedded formats, e.g. spreadsheet in word document. Format will provide an indicator that may partially answer this question, but further analysis may also be required in order to discover embedded formats
If restrictions apply, now or in the future, it may not be worth preserving
Impact on feasibility assessment Impact on feasibility
53
DELOS Deliverable 6.10.1 can access? Technical What format is it in?
Technical
How big is it?
Format type
Syntactic
File size
Syntactic
assessment May influence determination of feasibility, and prioritisation May influence determination of feasibility of preservation
54
DELOS Deliverable 6.10.1
The applicability and relative significance of the factors listed in Table 3 will vary according to domain and to the policy of the specific collecting body. It is suggested therefore that weightings should be assigned to each factor, appropriate to the organisation concerned. Genres should also be considered as another factor in determining weightings.
R8.1.1 A ranking system for appraisal factors must be developed. The system should be flexible enough to allow customization for different organizational settings and to take into account the purpose for which the information is being managed
8.1.1 Genres The concept of genre is an important one for information communities, and is very significant for digital libraries in particular. A genre can be broadly defined as a socially recognised communication norm, and examples of genres will encompass the whole gamut of communication from text messages to scholarly publications. Possible benefits to be derived from application of the genre concept to appraisal include a preliminary categorisation by document type to facilitate metadata extraction and/or assign weightings to various metadata elements. Other benefits for digital libraries/archives include enriched description and understanding of the digital object itself and the context of its creation and use, and ultimately improved access to information. There is a lack of consensus in the literature relating to the definition of genre (Kim & Ross, 2007). Attempts to provide a universal classification however are likely to result in an overly simplistic approach to genre definition. Awareness of the different purposes for which information is being managed will enable a much richer and more useful definition of genre. It is suggested, therefore, that there should be at least two analytical frameworks used – one for the purpose of managing information for accountability, and the other for managing information for awareness or entertainment. Attempts to classify genres range from a simple categorisation to a much more complex multi‐faceted approach. Proponents of a multi‐faceted approach point to the difficulty in assigning a single genre to some digital documents (Santini, 2007) or the need to incorporate contextual information (Crowston & Kwasnik, 2003; Yoshioka, Herman, Yates, & Orlikowski, 2001). Appropriate classification is a critical first step in any experimental attempts to automate identification – if the appropriate framework is not used, much valuable research runs the risk of being dismissed as overly simplistic or reductionist
8.1.1.1 Information for Accountability Wanda Orlikowski, Joanne Yates and colleagues at the MIT have undertaken a number of studies of genres, from a structurational perspective. In other words,
55
DELOS Deliverable 6.10.1
viewing genres as socially recognised communicative transactions, that as they are enacted over time, also become organising structures and templates for behaviour. See, for example, their analysis of business presentations, and the use of PowerPoint (Yates & Orlikowski, 2007). Their analysis is potentially very useful for the recordkeeping community for two main reasons. Firstly, because of the acknowledgement of the importance of context. Secondly, because of their development of the notion of a genre system. For the recordkeeping community, as the primary purpose of the information stored in a digital archive will be information as evidence for accountability purposes, the context of the genre is of critical concern. A proposal of a multi‐dimensional taxonomy for organisational genres (Yoshioka et al., 2001) may go some way to addressing this issue. This taxonomy is based on the analysis of the following six genre dimensions:
The purpose (why)
The content (what)
The timing (when)
The location (where)
The participants (who)
The structure and media (how)
In reference to the interrogative pronouns, this taxonomy is referred to as “5W1H” (Yoshioka et al., 2001). It has been suggested that these dimensions can be used as a framework for gathering genre‐based metadata (Honkaranta, 2003b). The 5W1H taxonomy encompasses both genres and genre systems. The concept of genre system is very important, as it ensures consideration of documents as components of a communicative action, and not solely as discrete objects. For instance, a meeting genre system might comprise an invitation to attend, an agenda and minutes (Osterlund, 2007; Yates & Orlikowski, 2007). This genre system is akin to the notion of aggregation of records, so is extremely useful to retain in any consideration of the application of genre theory to digital archives.
8.1.1.2 Information for Awareness/Entertainment Research being undertaken at the School of Information Studies, Syracuse University explores the utility of genre in assisting access to information in digital collections. These researchers argue that information about the genre, as well as the subject of a document, assists in improving the precision of searches (Crowston & Kwasnik,
56
DELOS Deliverable 6.10.1
2003; Kwasnik, Crowston, Nilan, & Roussinov, 2001). This research is very much grounded in a library perspective. This is clear from the examples used, and the historical overview of genre usage in tools such as the Dewey Decimal Classification and Library of Congress Subject Headings. There is clear recognition of the need for a multi‐faceted classification to recognise both form and function of a genre as well as
“the numerous clues and components that allow us to discriminate one genre from another.” (Crowston & Kwasnik, 2003, p.356). A further publication on the taxonomy is currently being finalised17. It is worth noting that both communities emphasise the importance of user involvement in genre identification (Crowston & Kwasnik, 2003; Honkaranta, 2003a, 2003b). R8.1.2 Further investigation of the genre, and genre system concept should be undertaken, with a view to determining appropriate taxonomies for the different information domains. R8.1.3 A ranking system for genres should be developed in conjunction with taxonomies. The system should be flexible enough to allow customization for different organizational settings and to take into account the purpose for which the information is being managed. R8.1.4 The potential for using genre to enhance archival description and cataloguing of publications should be investigated
8.1 Models of Automation As a result of the analysis of the approaches to appraisal discussed in section 7.0 we identified a series of appraisal criteria and structured these so that we can represent them as appraisal rules. Rules are susceptible to representation as active knowledge components. This representation then suggests three models of automation:
17
Hybrid: This model would use technology to carry out specific tasks, but within an overarching appraisal top‐down strategy requiring human decision‐making, or automated application of a retention and disposal schedule. For instance, application of functional appraisal methodology
Email from Kevin Crowston, 23 Nov 07
57
DELOS Deliverable 6.10.1
supplemented by subsequent automated triage to determine the feasibility of preservation at the item level.
Appraisal engine: Where a document is submitted to an appraisal engine for analysis using a combination of text mining and rule‐based reasoning.
Profiler: The development of a prototype to review a variety of information object types (image, document, dataset for example) and apply appraisal rules, probably again using rule‐based reasoning methodologies.
A wholly automated approach to appraisal can at this stage only be envisaged where a top‐down appraisal strategy is not required, i.e. when managing information for awareness and/or entertainment, rather than information for evidential purposes. However, Maria Esteva’s concept of a natural electronic archive, and appraisal using social networking analysis and text mining is an interesting initiative that will be worth monitoring for further development (Esteva, 2007). R8.2.1 Three models of automation are identified for further investigation:
Hybrid: A combination of manual and automated decision making. For instance, application of functional appraisal methodology supplemented by subsequent automated triage to determine the feasibility of preservation at the item level.
Appraisal Engine: Where a text document is submitted to an appraisal engine for analysis using a combination of text mining and rule‐based reasoning.
Profiler: The development of a prototype to review a variety of information object types (image, document, dataset for example) and apply appraisal rules.
9.0 Summary of Recommendations Specific recommendations to further develop automated appraisal (or re‐appraisal) are listed below. Discussion of these recommendations is provided in the body of the text; the recommendation number is a guide to location within the report. R2.0.1 A glossary should be developed of terminology relating to the entities and processes associated with determining the significance of information. Definitions should be acceptable from the perspective of all information management occupations. R2.1.1 Technological solutions to determining the significance of information must take into account ideological, political, economic, cultural and social factors. R2.2.1 Appraisal methodologies must be “fit for purpose” – i.e., take into account the purpose(s) for which information is being managed: accountability, awareness and/ or entertainment.
58
DELOS Deliverable 6.10.1
R2.3.1 Item‐level appraisal should be considered as a tool to be used in the context of an appropriate theoretical framework, and does not imply the destruction of contextual relationships. R3.1.1 Appraisal may take place prior to ingest, on ingest and/or as part of Preservation Planning functionality. R3.3.1 Records created and maintained in accordance with ISO 15489 and ISO 23081 may not require appraisal at item level. R5.3.1 Metadata showing relationships and levels of aggregation of records should be used to automatically generate description for archival repositories R7.0.1 Further analysis should take place of policies, guidelines and reports of practice to determine a comprehensive database of criteria used in appraisal of digital objects, linked not only to domain of practice but also to sector of activity and country. R8.0.1 Strategies making use of web‐enabled communication channels should be investigated to enable input from all community stakeholders in determining what information is significant and worth preserving R8.1.1 Further analysis of appraisal criteria is required in order to formulate appraisal rules. A wide range of input from practitioners and academics in different domains is required, so consideration should be given to a Delphi study and/or a series of focus groups. R8.1.2 Further investigation of the genre, and genre system concept should be undertaken, with a view to determining appropriate taxonomies for the different information domains. R8.1.3 A ranking system for genres should be developed in conjunction with taxonomies. The system should be flexible enough to allow customization for different organizational settings and to take into account the purpose for which the information is being managed. R8.1.4 The potential for using genre to enhance archival description and cataloguing of publications should be investigated R8.2.1 Three models of automation are identified for further investigation:
Hybrid: A combination of manual and automated decision making. For instance, application of functional appraisal methodology supplemented by subsequent automated triage to determine the feasibility of preservation at the item level.
59
DELOS Deliverable 6.10.1
Appraisal Engine: Where a text document is submitted to an appraisal engine for analysis using a combination of text mining and rule‐based reasoning.
Profiler: The development of a prototype to review a variety of information object types (image, document, dataset for example) and apply appraisal rules.
60
DELOS Deliverable 6.10.1
10.0 Conclusions Appraisal is in the digital environment an activity at risk. In reality, the lack of active appraisal puts preservation itself at risk. The basic requirements identified here includes:
an early initiative within the design of the resources creation (in the recordkeeping system in case of records)
the neutrality of its principles and procedures as the guarantee for its role as support to the research right
the capacity to ensure the corporate memory as a significant memory (a contextual memory) for the creator and for the social community.
We conclude that appraisal, the determination of the worth of preserving information, continues to be significant in the digital environment. Furthermore, the concept is applicable beyond the recordkeeping domain that initiated it. A number of strategies have been identified to undertake appraisal, any one of which, or combination of, may be appropriate to a specific information community or domain. In considering the automation of the appraisal function in the context of a digital library or archives, the focus is likely to be on the assessment of individual items. The results of this assessment will contribute to the overall appraisal determination. Our analysis of the approaches to appraisal resulted in the identification of a series of appraisal criteria which have been structured so that we can represent them as appraisal rules. Rules are susceptible to representation as active knowledge components. In considering the next steps, this representation suggests three models of automation: hybrid, appraisal engine and profiler. Research underway on the automation of metadata extraction in conjunction with genre identification, together with the structurational view of genres, shows a great deal of promise for the digital archives community. In addition, the technological possibilities now present to facilitate input of other voices into the selection of information that has value for communities open up a way forward to a new information age, one that need no longer be exclusively defined by dominant societal forces.
61
DELOS Deliverable 6.10.1
11.0 References Anderson, R., Frost, H., Hoebelheinrich, N., & Johnson, K. (2005). The AIHT at Stanford University: Automated Preservation Assessment of Heterogeneous Digital Collections. D-Lib Magazine, 11(12), http://www.dlib.org/dlib/december05/johnson/12johnson.html. Bauer, G. P. (1946). The appraisal of current and recent records. Staff Information Circulars, 13, 2. Bearman, D. (1989). Archival methods Archives and Museum Informatics Technical Report. Bearman, D. (1995). Archival strategies. American Archivist, 58(Fall), 380-413. Bearman, D. (2005). Addressing selection and digital preservation as systemic problems. Paper presented at the Preserving the digital heritage: principles and policies, The Hague http://www.unesco.nl/images/preserving_the_digital_heritage.pdf. Boles, F., & Young, J. M. (1985). Exploring the black box: The appraisal of university administrative records. American Archivist, 48(2), 121-140. Burke, J. (1998). Renovating Conspectus for the digital era: applied at Queensland University of Technology. Paper presented at the 9th Biennial VALA, Melbourne http://www.nla.gov.au/libraries/hosted/embracin.html. Candela, L., Castelli, D., Ferro, N., Yoannidis, Y., Koutrika, G., Meghini, C., et al. (2007). The Digital Library Reference Model: Foundations for Digital Libraries. Pisa: DELOS.http://www.delos.info/files/pdf/ReferenceModel Cobb, J., Pearce-Moses, R., & Surface, T. (2005, April 26). ECHO Depository Project. Paper presented at the IS&T Archiving Conference, Washington, DC http://www.ndiipp.uiuc.edu/pdfs/IST2005paper_final.pdf. Committee on Appraisal. (2003). Manual on appraisal [draft]: International Council on Archives. http://www.ica.org/en/node/30417 Consultative Committee for Space Data Systems. (2004). Producer-Archive Interface Methodology Abstract Standard. Washington, DC: NASA. http://public.ccsds.org/publications/archive/651x0b1.pdf Cook, T. (1992). Documentation strategy. Archivaria(34), 181-191. Crowston, K., & Kwasnik, B. H. (2003). Can document-genre metadata improve information access to large digital collections? Library Trends, 52(2), 345-361. Cunningham, A., & Oswald, R. (2005). Some functions are more equal than others: the development of a macroappraisal strategy for the National Archives of Australia. Archival Science, 5, 163-184. Data Preservation Alliance for the Social Sciences (DataPASS). Appraisal guidelines. http://www.icpsr.umich.edu/DATAPASS/pdf/appraisal.pdf Day, M., Pennock, M., & Allinson, J. (2007). Co-operation for digital preservation and curation: collaboration for collection development in institutional repository networks. Paper presented at the DigCCurr2007: An International Symposium in Digital Curation, Chapel Hill, NC http://www.ils.unc.edu/digccurr2007/papers/dayPennock_paper_93.pdf. 62
DELOS Deliverable 6.10.1
Dorner, D. G. (2004). The impact of digital information resources on the roles of collection managers in research libraries. Library Collections, Acquisitions, & Technical Services, 28, 249-274. Duranti, L. (1994). The concept of appraisal in archival science. American Archivist, 57, 328-344 Eastwood, T. (1993). How goes it with appraisal? Archivaria(36), 111-121. Eastwood, T. (2003). What archivists have learned about appraisal of digital records. Paper presented at the International Workshop on the selection, appraisal and retention of digital scientific data, Lisbon, Portugal http://www.erpanet.org/events/2003/lisbon/presentations/Terry%20Eastwood %20paper.pdf. EROS. (1999). Inventory, appraisal and disposal. In Guidelines for management, appraisal and preservation of electronic records (2nd ed., Vol. 2: Procedures): The National Archives http://www.nationalarchives.gov.uk/electronicrecords/advice/guidelines.htm Esteva, M. (2007). Bits and pieces of text: appraisal of a natural electronic archive. Paper presented at the Digital Humanities 2007. from http://www.digitalhumanities.org/dh2007/abstracts/xhtml.xq?id=136. Evans, J., McKemmish, S., & Bhoday, K. (2005). Create once, use many times: The clever use of recordkeeping metadata for multiple archival purposes. Archival Science, 5, 17-42. Gilliland-Swetland, A. (1995). Development of an expert assistant for archival appraisal of electronic communications: An exploratory study. Unpublished Ph.D., University of Michigan. Grotke, A., & Ruth, J. E. (2007). Selecting and managing content captured from the web: Expanding curatorial expertise and skills in building Library of Congress web archives. Paper presented at the DigCCurr2007: An International Symposium in Digital Curation, Chapel Hill, NC http://www.ils.unc.edu/digccurr2007/papers/grotkeRuth_paper_9-3.pdf. Honkaranta, A. (2003a, April 23-26). Developing Document and Content Management in Enterprises Using a "Genre Lens". Paper presented at the Proceedings of the 5th International Conference on Enterprise Information Systems, Angers, France http://www.cc.jyu.fi/~ankarjal/ICEIS2003_GenreDM.pdf. Honkaranta, A. (2003b, June 16-17). Evaluating the 'genre lens' for analyzing requirements for content assembly. Paper presented at the Eighth CAiSE/IFIP8.1 International Workshop on Evaluation of Modeling Methods in Systems Analysis and Design (EMMSAD '03), Velden, Austria http://www.ad.jyu.fi/users/a/ankarjal/EMMSAD2003.pdf. International Organization for Standardization. (2001). Information and documentation Records Management - Part 1: General(No. ISO15489-1: 2001). Geneva: ISO. International Organization for Standardization. (2003). Space data and information transfer systems -- Open archival information system -- Reference model(No. ISO14721:2003). Geneva: ISO. International Organization for Standardization. (2006). Information and documentation Records management processes - Metadata for records. Part 1: Principles(No. ISO230811:2006). Geneva: ISO. 63
DELOS Deliverable 6.10.1
InterPARES. (2000a). Appendix 3 Appraisal of electronic records: A review of the literature in English. In The long-term preservation of authentic electronic records: Findings of the InterPares project http://www.interpares.org/book/interpares_book_l_app03.pdf InterPARES. (2000b). Appraisal Task Force Report. In The long-term preservation of authentic electronic records: Findings of the InterPares project: University of British Columbia http://www.interpares.org/book/interpares_book_e_part2.pdf InterPARES. (2002). Appendix 2 Requirements for assessing and maintaining the authenticity of electronic records. In The long-term preservation of authentic electronic records: Findings of the InterPares project http://www.interpares.org/book/interpares_book_k_app02.pdf JISC. (2007). e-Journals: Archiving and Preservation Briefing paper. from http://www.jisc.ac.uk/publications/publications/pub_ejournalspreservationbp.a spx Jonker, A. E. M. (2005). Macroappraisal in the Netherlands. The first ten years, 19912001, and beyond. Archival Science, 5, 203-218. Kim, Y., & Ross, S. (2006). Genre classification in automated ingest and appraisal metadata. Paper presented at the European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Alicante, Spain http://eprints.erpanet.org/110/. Kim, Y., & Ross, S. (2007). "The naming of cats": Automated genre classification. International Journal of Digital Curation, 2(1), 49-62. http://www.ijdc.net/ijdc/article/view/24/27. Kretschmar, R. (2005). Archival appraisal in Germany: A decade of theory, strategies, and practices. Archival Science, 5, 219-238. Kwasnik, B. H., Crowston, K., Nilan, M., & Roussinov, D. (2001). Identifying document genre to improve web search effectiveness. Bulletin of the American Society for Information Science and Technology, 23-26. Lala, V., & Joe, S. (2006). Web archiving at the National Library of New Zealand. Paper presented at the LIANZA, Wellington http://www.lianza.org.nz/library/files/store_013/WebArchives_VLala.pdf. Lloyd, A. (2007). Guarding against collective amnesia? Making significance problematic: An exploration of issues. Library Trends, 56(1), 53-65. Morris, S. (2006, March 27). Identification, selection and appraisal within the North Carolina Geospatial Data Archiving Project (NCGDAP). Paper presented at the Digital Preservation in the State Government: Best Practices Exchange http://www.lib.ncsu.edu/ncgdap/presentations/StateArchIDSelectionfinal.ppt. Murray, K., & Hsieh, I. K. (2006). Collection Planning Guidelines. from http://web3.unt.edu/webatrisk/reports/cpg_final_31may2006.pdf Murray, K., & Phillips, M. (2007). Collaborations, best practices, and collection development for born-digital and digitized materials. Paper presented at the DigCCurr2007: An International Symposium in Digital Curation, Chapel Hill, NC http://www.ils.unc.edu/digccurr2007/papers/murrayPhillips_paper_9-3.pdf. National Archives and Records Administration. (2007). Strategic directions: appraisal policy. from http://www.archives.gov/records-mgmt/initiatives/appraisal.html 64
DELOS Deliverable 6.10.1
National Library of Australia. (2005). Online Australian publications: selection guidelines for archiving and preservation by the National Library of Australia. from http://pandora.nla.gov.au/selectionguidelines.html Neumayer, R., & Rauber, A. (2007). Why appraisal is not 'utterly' useless and why it's not the way to go either: A provocative position paper: Digital Preservation Europe. http://www.digitalpreservationeurope.eu/publications/position/appraisal_final. pdf OCLC. (2007). Creating the conspectus. from http://www.oclc.org/programs/ourwork/past/conspectus.htm Osterlund, C. (2007). Genre combinations: A window into dynamic communication practices. Journal of Management Information Systems, 23(4), 81-108. Pearce-Moses, R., & Kaczmarek, J. (2005). An Arizona model for preservation and access of web documents. DttP: Documents to the People, 33(1), 17-24, www.ndiipp.uiuc.edu/pdfs/azmodel.pdf. Pinsent, E., & Ashley, K. (2006). Digital Asset Assessment Tool (DAAT) project. London: University of London Computer Centre. http://www.jisc.ac.uk/publications/publications/pub_ejournalspreservationbp.a spx PREMIS Working Group. (2004). Implementing preservation repositories for digital materials. Mountain View, CA. www.oclc.org/research/projects/pmwg/surveyreport.pdf Pymm, B. (2006). Building collections for all time: the issue of significance. Australian Academic and Research Libraries (AARL), 37(1), 61-73. Roberts, J. (2005). Macroappraisal Kiwi style: Reflections on the impact and future of macroappraisal in New Zealand. Archival Science, 5, 185-201. Ross, S. (2003). Digital Library Development Review. Wellington: National Library of New Zealand. http://www.natlib.govt.nz/catalogues/library-documents/digitallibrary-development-review/?searchterm=ross&body_language= Santini, M. (2007). Characterizing genres of web pages: Genre hybridism and individualization. Paper presented at the Proceedings of the 40th Hawaii International Conference on System Sciences http://csdl2.computer.org/comp/proceedings/hicss/2007/2755/00/27550071. pdf. Schauder, D., Stillman, L., & Johanson, G. (2005). Sustaining a community network: the information continuum, e-democracy and the case of VICNET. Journal of Community Informatics., 1(2), http://www.cijournal.net/index.php/ciej/article/view/239/203. Schellenberg, T. R. (2003). Modern Archives: Principles and Techniques. Chicago: Society of American Archivists Shilton, K., & Srinivasan, R. (2007). Participatory appraisal and arrangement for multicultural archival collections. Archivaria(63), 87-101. State Records Authority of New South Wales. (2007). The DIRKS manual - strategies for documenting government business. rev., from http://www.records.nsw.gov.au/recordkeeping/dirks-manual_4226.asp 65
DELOS Deliverable 6.10.1
The National Archives. (2004). Appraisal policy. from http://www.nationalarchives.gov.uk/recordsmanagement/selection/appraisal.ht m Thomas, S. (2007). Paradigm: A practical approach to the preservation of personal digital archives. Oxford. http://www.paradigm.ac.uk/projectdocs/jiscreports/ParadigmFinalReportv1.pd f Treloar, A., Groenewegen, D., & Harboe-Ree, C. (2007). The data curation continuum: Managing data objects in institutional repositories. D-Lib Magazine, 13(9/10), http://www.dlib.org/dlib/september07/treloar/09treloar.html. Underwood, W., Isbell, S., & Underwood, M. (2007). Grammatical induction and recognition of the documentary form of records. Paper presented at the DigCCurr2007, Chapel Hill, NC http://www.ils.unc.edu/digccurr2007/papers/underwood_paper_4-5.pdf. United States Geological Survey. (2007). Records appraisal tool. from http://eros.usgs.gov/government/ratool/view_questions.php Verheul, I. (2006). Networking for digital preservation: Current practice in 15 national libraries. Muenchen: Saur.http://www.ifla.org/VI/7/pub/IFLAPublication-No119.pdf Yates, J., & Orlikowski, W. (2007). The PowerPoint Presentation and its corollaries: How genres shape communicative action in organizations. In M. Zacrhy & C. Thralls (Eds.), Communicative Practices in Workplaces and the Professions: Cultural Perspectives on the Regulation of Discourse and Organizations. Amityville, NY: Baywood Publishing Yoshioka, T., Herman, G., Yates, J., & Orlikowski, W. (2001). Genre taxonomy: A knowledge repository of communicative actions. ACM Transactions on Information Systems, 19(4), 431-456.
66
DELOS Deliverable 6.10.1
Appendix 1: Summary of Findings from InterPares
Archival appraisal can be considered as a type of a preservation function for digital records. General principles are as follows:18 In the archival sector appraisal is part of a selection process made of specific activities (selection, appraisal, disposition as destruction or preservation). The appraisal should be conducted on the basis of well defined principles and criteria as further developed with reference to the non‐electronic environment. Specifically, the appraisal should be carried out when the digital resources are still in their active phase, as near to the time of creation as possible. The management of the appraisal function implies the use and the maintenance of a huge amount of information which include the decisions taken in the past (with reference to the various responsibilities involved and the strategies and procedures developed), the contextual information related to the records (the juridical, documentary, technological contexts), the values established for the records and for their preservation feasibility (in terms of cost and in terms of preserving the authenticity of the records). The feasibility of the records preservation is strictly based on the capacity of preserving the essential digital components of the records, those able for the present and for the future to confer their identity and to ensure their integrity. This information (which includes content and data/metadata necessary to organise, structure or render the content of the records) have to be structured and articulated in a way to enable the decisions related to the present and future capacity of preserving the digital components which constitute the record identity and to ensure its integrity. This effort includes at least three phases:
determine which elements are able to make the authenticity presumable
The assumptions included in the following paragraphs are basically a synthesis of the main results of the Appraisal Task Force of the InterPARES project. They could be considered a common conceptual framework in the archival international community. See National and multinational team report. Italian research team report, in The long-germ preservation of authentic electronic records: findings of the InterPARES project, Luciana Duranti editor, San Miniato (PI), Archilab, 2005. This paragraph considers also and integrates the evolution of the Monash Clever Recordkeeping Metadata (CRKM) project as presented by S. McKemmish, J. Evans, K. Bhoday, Create Once, Use Many Times: the Clever Use of Recordkeeping Metadata for Multiple Archival Purposes, at the International Conference of ICA in Vienna, 23-29 August 2004 in the session: Smart Metadata and the Archives of the Future. 18
67
DELOS Deliverable 6.10.1
identify where these crucial elements are manifested (in which digital components) and what is the technical information relevant for their preservation
reconcile these preservation requirements with the financial and technical capacities of the repository
As clearly testified by the flexibility required in the preservation process, appraisal is a relevant active component of this process and it includes a higher level of responsibility than in the past. The quality of the preservation is strictly connected with the quality of an early appraisal. The more complex and rich the digital data to be preserved (as in the scientific world), the more relevant is the active appraisal here described which includes crucial tools whose automation19 will ensure the success of the preservation itself:
criteria and policies able to orient a neutral approach,
auditing and validating procedures
contextual information automatically extracted and preserved.
19
See http://eros.usgs.gov/government/RAT/tool.asp where the USGS Scientific Records Appraisal Tool is described (see Codata-ERPANET workshop, The selection, appraisal and retention of digital scientific data…cit., p.13.
68
DELOS Deliverable 6.10.1
Appendix 2: Source Documents Used to Provide Initial List of Criteria 1. InterPARES. (2000). Appraisal Task Force Report. In The long‐term preservation of authentic electronic records: Findings of the InterPares project: University of British Columbia http://www.interpares.org/book/interpares_book_e_part2.pdf This summarises the overall findings on appraisal from InterPARES 1, detailing requirements to assess authenticity. InterPARES findings are applicable to records of all types. This source was used as it is so significant and influential, and referred to throughout our report. 2. Eastwood, T. (2003). What archivists have learned about appraisal of digital records. Paper presented at the International Workshop on the selection, appraisal and retention of digital scientific data, Lisbon, Portugal http://www.erpanet.org/events/2003/lisbon/presentations/Terry%20Eastwood%20pap er.pdf Terry Eastwood was chair of the InterPARES Appraisal Task Force. This paper discusses the InterPARES findings, and is particularly relevant to our report in that consideration is given to applying those findings outside the archival domain, to scientific data. 3. Data Preservation Alliance for the Social Sciences (DataPASS). Appraisal guidelines. http://www.icpsr.umich.edu/DATAPASS/pdf/appraisal.pdf DataPASS is a major US collaborative project with partners from the academic sector and the National Archives and Records Administration, supported by the Library of Congress. Activities involve surveying important research in the social sciences, as well as other sources of information about potential acquisitions, and identifying content that should be preserved – including public and private sources of data. Appraisal standards have been developed to guide this process. This source was utilised because it focuses specifically on a significant and specialised type of data. 4. National Archives and Records Administration. (2007). Strategic directions: appraisal policy. http://www.archives.gov/records-mgmt/initiatives/appraisal.html The NARA appraisal policy is very clearly written and provides useful explanations for appraisal criteria. One of the key factors influencing its use as a source for our report is its currency: the policy was published in September 2007.
5. Thomas, S. (2007). Paradigm: A practical approach to the preservation of personal digital archives. Oxford. http://www.paradigm.ac.uk/projectdocs/jiscreports/ParadigmFinalReportv1.pdf
69
DELOS Deliverable 6.10.1
The appraisal criteria used in the Paradigm Project are referred to in our report. This document provides a very detailed view of specific issues with personal papers – i.e. records created in uncontrolled environments. The appraisal criteria used were devised within that context, which provides a unique perspective. The fact that this was a British project was also significant. 6. Grotke, A., & Ruth, J. E. (2007). Selecting and managing content captured from the web: Expanding curatorial expertise and skills in building Library of Congress web archives. Paper presented at the DigCCurr2007: An International Symposium in Digital Curation, Chapel Hill, NC http://www.ils.unc.edu/digccurr2007/papers/grotkeRuth_paper_9-3.pdf A case study of work undertaken in web archiving at the Library of Congress, that provides some brief, but useful, discussion about the development of specific appraisal criteria. 7. InterPARES. (2002). Appendix 2 Requirements for assessing and maintaining the authenticity of electronic records. In The long‐term preservation of authentic electronic records: Findings of the InterPares project http://www.interpares.org/book/interpares_book_k_app02.pdf As with the first source listed above, the significance and influence of InterPARES findings determined the inclusion of this resource. 8. Morris, S. (2006, March 27). Identification, selection and appraisal within the North Carolina Geospatial Data Archiving Project (NCGDAP). Paper presented at the Digital Preservation in the State Government: Best Practices Exchange http://www.lib.ncsu.edu/ncgdap/presentations/StateArchIDSelectionfinal.ppt This PowerPoint presentation provides useful detail of the appraisal criteria applied to geospatial data in a specific archiving project. 9. National Library of Australia. (2005). Online Australian publications: selection guidelines for archiving and preservation by the National Library of Australia. From http://pandora.nla.gov.au/selectionguidelines.html These relatively recent guidelines were included for analysis because of their specific focus on publications. The fact that they originated from Australia was also important, as this introduced another perspective into the mix.
10. Murray, K., & Phillips, M. (2007). Collaborations, best practices, and collection development for born‐digital and digitized materials. Paper presented at the DigCCurr2007: An International Symposium in Digital Curation, Chapel Hill, NC http://www.ils.unc.edu/digccurr2007/papers/murrayPhillips_paper_9-3.pdf
70
DELOS Deliverable 6.10.1
A report that describes a survey undertaken of curators and librarians and the subsequent development of web collection plans. This highlights and discusses the specific concerns identified, and was included as a source in our report because of the input from a number of organisations. 11. Lala, V., & Joe, S. (2006). Web archiving at the National Library of New Zealand. Paper presented at the LIANZA Conference, Wellington http://www.lianza.org.nz/library/files/store_013/WebArchives_VLala.pdf. A case study of web archiving in New Zealand, which includes some discussion of the specific criteria used in appraisal. 12. United States Geological Survey. (2007). Records appraisal tool. from http://eros.usgs.gov/government/ratool/view_questions.php An extensive list of questions used to gather the data necessary to determine the value of geospatial data. This was particularly useful, as the criteria are very specific.
71