Preserving Change: Observations on Weblog Preservation Yunhyong Kim
Seamus Ross
Humanities Advanced Technology and Information Institute University of Glasgow Glasgow, UK
Humanities Advanced Technology and Information Institute University of Glasgow, Glasgow, UK & Faculty of Information University of Toronto Toronto, Ontario, Canada
[email protected]
[email protected] the context of web archiving initiatives (e.g. the Minerva Project1, Internet Archive2, UK Web Archive3, Arcomem4, BlogForever5, Memento Project6, LiWA7) that have been increasingly trying to create solutions for social media archival situations.
ABSTRACT In this article, we revisit concepts introduced within the digital preservation literature, such as Open Archival Information System (OAIS) reference model, and Preservation Metadata Implementation Strategy (PREMIS), to examine their continued applicability to the preservation of dynamic web content such as weblogs.
OAIS: A BRIEF SUMMARY The Reference Model for an Open Archival Information System (OAIS) [1] is a conceptual model for a preservation-aware archival system developed by the Consultative Committee for Space Data Systems (CCSDS) (accepted as an ISO standard in 20038). It has been adopted by several well-known preservation projects in recent years (e.g. CASPAR9, SHAMAN10, SHERPA DP211 and the Planets Interoperability Framework [9]). To be compliant to the model (see [1]), “the OAIS must: 1) negotiate for and accept appropriate information from information producers; 2) obtain sufficient control of the information needed to ensure long-term preservation; 3) determine which communities should become the Designated Community and, therefore, should be able to understand the information provided; 4) ensure that the information to be preserved is independently understandable to the Designated Community; 5) follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the original, or as traceable to the original; and, 6) make the preserved information available to the Designated Community.”12.
Categories and Subject Descriptors H.3.7 [Information Storage and Retrieval]: Digital Libraries – Standards.
General Terms Management, Design, Human Factors, Standardization, Theory.
Keywords digital preservation, digital curation, designated community, authenticity, intellectual entity, archive, web archive, blog, weblog
INTRODUCTION Current preservation approaches tend to be largely data object oriented, relying on the notion that data can be reasonably reduced to a manageable discrete set of objects accompanied by formal syntactic, semantic and pragmatic attributes that constitute the original object’s content and characteristics necessary for validating authenticity, managing rights, and enabling access and use (e.g. see [1], [6]). Now, the dynamic web environment (e.g. blogs, wiki, networking platforms) enables us to capture data objects at finer levels of communicative granularity. Continuing to capture each of these bits as a discrete entity/object imposes independent object identities on pieces of information that, in the past, would have only been considered to have meaning as part of the whole intellectual process. It may be time to re-examine the established approaches to determine whether they are still valid in
PREMIS DATA MODEL The PREMIS (Preservation Metadata: Implementation Strategies) working group was sponsored by OCLC Online Computer Library
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iPRES2011, Nov. 1–4, 2011, Singapore. Copyright 2011 National Library Board Singapore & Nanyang Technological University
1
http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
2
http://www.archive.org
3
http://www.webarchive.org.uk
4
http://www.arcomem.eu
5
http://www.blogforever.eu
6
http://www.mementoweb.org/
7
http://www.liwa-project.eu/
8 9
262
ISO/DIS 14721 http://www.casparpreserves.eu
10
http://shaman-ip.eu/shaman/
11
http://www.sherpadp.org.uk/sherpadp2.html
12
This content from [1] has been condensed to save space.
Center and Ressearch Librariess Group (RLG), to develop a core c set of preservationn metadata appplicable to a wide w range of digital preservation coontexts. The resulting standardd [6] was intennded to comply with the OAIS moodel (Section 2.1), 2 while taargeting a the metadata that capture presservation proccesses, such as w an object. While descriptiive and preservation levvel associated with technical metaddata are also keey concepts in the t standard, PR REMIS recommends thhe use of previious standards to meet requirements for these, focu using on preserrvation levels and a processes, rights, and object properties and relations r to bee preserved. A large amount of the effort in PREM MIS remains with w object moodeling. o agents, evennts and rights are discussed witthin the While notions of standard, detaiiled informationn is not providded. The model relies on the conceptt of an intellecctual entity ass a single intelllectual unit to be manaaged within the archive.
ittself, and, preseerve how changges might propaagate other chaanges. T The time dimennsion in the prreservation of the webpagees has 14,15 allready been recognised r but the curreent paradigm is to unnderstand channge as the time--stamped objeccts in selected states. s O Ontologies havee been proposeed to capture events e and relaations between objectss (e.g. see Evvent Ontology16 used to reprresent m musical perform mance; ABC O Ontology propossed for preservvation [33]). While ontoologies providee a step in the right direction,, they sttill describe traansitions of obbject states. Ou ur contention iss that obbjects are sym mptoms of dynnamic processees generated by the m medium throughh which they are broadcast.. These need to be caaptured as recuurring patternss within mediu um dependent event w windows that goo beyond objectt boundaries (Fiigure 3.1).
OBSERVAT TIONS ON WEB W ARCH HIVING While many web w archives have h claimed compliance with the OAIS model (S Section 2.1), thhis can be accep pted only on thhe most generous termss: 1) while acccess can be bloocked, there is almost never any expllicit negotiationn between inforrmation produceers and existing web archives: the information is obtained thhrough [ 2) the laack of procedures for “copying thhe website” [7]; m that thee archive’s riights to mannipulate negotiation means harvested pagees for preservaation purposes becomes ambiiguous, and introduces an unpredicctable gap beetween the archive’s me of creation;; 3) by “authentic copy” and the maaterial at the tim hive to equating inactiion of creators with permission for the arch retain the mateerial, the integrrity of the archiive’s content iss put at risk, as any material m (e.g. ann image within a blog post) may m be later requestedd to be removeed; 4) the selecction of a desiignated community is also a largely waashed over in thhe web context: in the case of blogs, there t is no cleaar long-term reaadership, as evidenced by the constanntly fluctuatingg statistics avaiilable through search services such as Technorati133; 5) the long term deteriorattion of integrity (throuugh missing objjects and lapsedd URLs) will reesult in semantic gapss in the know wledge base; 6) 6 the notion of an intellectual enttity is also blurrred (e.g. see [2]]): new blog poosts are added to bloggs periodicallyy, previously submitted posts and comments are modified, deletted, and rearrannged, changing rights, mantics. content and sem
F Figure 3.1. Windows of size 3 surrounding physical objecct nts (POE) and other events (OE). ( even T To quote Marsh hall McLuhan: “the medium is i the message” [4]. T There is semantics beyond the content of o a messagee: the em mergence of soo many differennt channels of communication c n (e.g. blogs, twitter annd facebook) m may be a testam ment to the parrt that thhe medium playys in conveyingg meaning and purpose. p
A ACKNOWL LEDGMENT TS T The research leaading to the disccussion in this paper p was condducted ass part of the BloogForever projeect funded by th he European Unnion's Seventh Framew work Programm me (FP7-ICT-2 2009-6) under grant aggreement n° 2669963.
R REFERENC CES [11] CCSDS (20002) “Referencee Model for an Open Archival Informationn System (OAIS S)”, CCSDS 6550.0-B-1 (2002): http://publicc.ccsds.org/pubblications/archiv ve/650x0b1.pdff
As a solution for point 6), ssome have intrroduced the notion of archiving versions at varyingg times as inddependent intelllectual t blog into smaller s entities. Otherss have tried too break down the intellectual enntities (e.g. possts, comments,, embedded ob bjects). This approach could lead to: 1) an unmanaggeable increase in data storage, 2) maany instances of semanticallyy incomplete objects (posts often make m sense onlyy in the contexxt of other possts, and even more so for commentts and embeddded images), and, a 3) millions of objeects with minorr differences beetween them.
[22] Hank, C., Choemprayong. C S., and Sheblee, L. (2007) “Blogger perceptionss on digital presservation.” In Proceedings of thhe 7th ACM/IE EEE-CS joint cconference on Digital D libraries (JCDL '07). ACM, New Yo York, NY, USA. http://doi.accm.org/10.11455/1255175.12555276 [33] Lagoze, C. and Hunter, J. ((2002) “The AB BC Ontology annd V 2, No 2, Model”, Jouurnal of Digitall Information, Vol http://journnals.tdl.org/jodi//article/viewArtticle/44 [44] McLuhan, M. M (1964) Undeerstanding Meddia. Routledge, London.
TOWARD PRESERVI P NG CHANG GE We emphasise the predominannce of change as a a core characcteristic C has, of course, c of today’s digittal information environment. Change always been ann integral part of o digital inform mation. As we access, save, and trransmit inform mation, we cause changee and deterioration. To T ensure that information doees not change from fr its original state has become esssentially impoossible [5]. Thhe core purpose of preservation is, noot to capture thhe illusory static steps in between chhanges, but to ensure that wee capture the change 13
[55] Montague, L., Nicchiarellii, E., Matthezin ng, H., Kummerr, R., ncept of significcant Puhl, J., & Roberts, B. (20010b) “The con
144
Compare with appproaches at http://www.memento oweb.org/
155
Denev et al. (20011) The SHAR RC framework foor data quality inn web archiving. VLDB B Journal, 20(2):183—207.
166
http://motools.soourceforge.net/event/event.html
http://technoratti.com/
263
[8] Wilson, Carl (2008) “Planets Interoperability Framework Guidelines for Service Wrapping”, Planets Project, England: http://sherpa.bl.uk/113/01/Planets_IF6D2_GuidelinesForServiceWrapping_Ext.pdf
properties”, The National Archives, UK & The Austrian National Library [6] PREMIS Editorial Committee (2011) PREMIS Data Dictionary for Preservation Metadata Version 2.1, PREMIS Editorial Committee: http://www.loc.gov/standards/premis/v2/premis-2-1.pdf [7] Roche, Xavier (2006) “Copying Websites”, Web Archiving, Julien Masanes (ed.) Database Management and Information Retrieval, Springer, Pp 93-114: http://www.springer.com/computer/database+management+ %26+information+retrieval/book/978-3-540-23338-1
264