need not feel intimidated when adding annotations as there is no risk of altering the ... The XML Digital Signatures to assure provenance and authenticity. ... Format independent: a PDF and text version of the same document share the same.
Location and Format Independent Distributed Annotations for Collaborative Research Fabio Corubolo, Paul B. Watry and John Harrison University of Liverpool, Liverpool, L69 3DA, United Kingdom {corubolo, p.b.watry, john.harrison}@liverpool.ac.uk
Abstract. This paper describes the development of a distributed annotation system which enables collaborative document consultation and creates new access to otherwise hard to index digital documents. It takes the annotations one step further: not only the same types of annotations are available across file formats, but robust references to the documents introduce format and location independence, and enable the attachment even when the document has been modified. These features are achieved using standards of the digital library systems, and don’t require modification of the original documents or impose further restrictions, thus being infrastructure independent. Integration into the Kepler workflow system allows annotating workflow results, and the automatic creation and indexing of annotations in document oriented workflows, which can be used as a flexible way to archive and index collections in the Cheshire3 search engine.
1
Introduction
In an era when digital documents are to a great extent replacing paper, there is a strong need for improved annotation tools which cover a range of annotation types, including good authoring tools, on a variety of common document formats. The primary aim of this work is to use digital library resources as the basis for collaborative research; therefore, the investigation has looked into how existing digital library developments can be used to support distributed, spontaneous collaborations. In particular, technologies which will enable research community users to annotate documents and other peoples’ data and share these annotations with others in a simple, spontaneous way. The result will support research collaborations within scholarly communities which are intellectually cohesive but geographically distributed. Our work builds and extends Multivalent annotations [1], which will allow users to annotate shared documents, in numerous ways, and to share these annotations without any special prior arrangement or significant systems overhead and creates new access to otherwise hard to index digital documents, such as images. The system, developed in the context of the JISC funded VRE programme, takes the annotations one step further: not only the same types of annotations are available across file formats, but robust references to the documents introduce format and location independence, and enable the attachment also when the document has been modified, thanks to a novel use of lexical signatures [2]. These features don’t require modification of the original documents or impose further restrictions, and thus can be
adopted without any additional infrastructures. The system can be inserted in many contexts, including situations where the original files do not support annotations or must remain intact, as in a digital preservation environment. Also, the casual users need not feel intimidated when adding annotations as there is no risk of altering the original document. Integration of the key components into the Kepler workflow system [3] introduces the idea of annotating workflow results, and allows the automatic creation and indexing of annotations in document oriented workflows, that can be used as a flexible way to archive and index collections in the Cheshire3 search engine [4]. Due to the modular structure of the system, it will be possible to integrate alternative software components (e.g. a different workflow, database or document browser).
2
Methodology
The methodology followed aims to maintain a clear separation between the components and to adopt the relevant standards: • Web services for the Cheshire3 workflow connector and the lexical signature service. These services are application specific, enabling reuse at a service level. • The SRW standard search protocol for searching and retrieving the annotations. • The XML Schema formally describes the annotations. The structure has been built to be extensible and application-agnostic. • The XML Digital Signatures to assure provenance and authenticity. The signature is applied optionally by the client (the Fab4/Multivalent browser) to the entire annotation. The annotation schema consists in an envelope for the annotation body, which is considered to be application dependant (in XML, text or binary format), containing: • The generic annotation metadata, using Dublin Core and some specific metadata (annotation format, generating application, nature of the annotation). • The digital signature applied to the annotation as a whole, so that both the body and the metadata are digitally signed. • The annotated resource element, the main feature identifying the referenced document. This consists of multiple identifiers, permitting different levels of attachment. These include the document URI, binary digest, lexical signature [2], and textual contents digest. The different identifiers, together, allow the attachment of the annotations according to different rules, which can be defined by the user (for example: attach to the same exact document, same location, same content, or similar document, in case of partial changes to the content). The format independence is achieved using the textual content digest which normally does not change across file formats. Afterwards, an SRW query to the database system allows retrieving all the annotations for a specific textual content. Other advanced methods, involving the document structure, will be implemented in the future.
Fig.1. The system connection diagram shows the interaction between the three main components. The infrastructure independency is highlighted by the use of web services.
Use cases considered during the development include: peer review, scholars needing to disseminate knowledge bases and virtual collaboration environments for students and researchers. An exemplar, based on the AHDS-derived “Designing Shakespeare” collection, has been developed; the Tavistock Institute has conducted a user study involving a community of students, researchers, and systems administrators.
3
Results
This system is now included in the default distribution of the Fab4 browser, publicly available [5]. The annotations are robustly attached, and thus: • Location independent: the same file will always share the same notes, independently of where it resides (web server, local file system, email attachment etc.) • Format independent: a PDF and text version of the same document share the same annotations. • Robust to document changes: the same annotations can be attached to a document even if its contents are modified. The annotations are always distributed to all the copies of the document, without the need to redistribute or modify the original file, a great advantage for spontaneous collaboration. This differs from other annotation systems which apply the notes to the original file and require the redistribution of the file on every annotation. Further-
more, the annotations are robustly attached to the contents of the document, using Robust Locations [6].
Fig.2. A view of an annotated Open Document File in Fab4/Multivalent. In the annotations list on the left the trusted ones are highlighted.
The digital signature and a trust system guarantee the secure attribution and originality of the annotations so that the provenance can be trusted and proved. This could be further extended to enable the application of trusted actions to documents in a peer review system (e.g. “approved for publishing”) or in other similar use cases. A search interface, built on the Cheshire3 system, allows retrieval of the annotations, and, through them, of the referenced documents. This in fact creates new paths to the retrieval of digital objects. Acknowledgements: This work was supported by the JISC VRE programme.
References 1. Phelps, T., Wilensky, R.: Multivalent Annotations. In Procs. First European Conference on Research and Advanced Technology for Digital Libraries, 1997. 2. Phelps, T., Wilensky, R.: Robust Hyperlinks: Cheap, Everywhere, Now. In Lecture Notes in Computer Science. Proceedings of Digital Documents and Electronic Publishing, 2000. 3. The Kepler Project: http://kepler-project.org/ 4. The Cheshire3 Information Framework: http://www.cheshire3.org/ 5. The Liverpool VRE project web pages: http://bodoni.lib.liv.ac.uk/VRE/ 6. Phelps, T., Wilensky, R.: Robust Intra-document Locations: http://www9.org/w9cdrom/312/312.html