A thinking model for digital libraries Elsbeth Kwant Koninklijke Bibliotheek P.O. Box 90407 NL-2509 LK The Hague 0031-70-3140333
[email protected] ABSTRACT This paper describes a thinking model for digital libraries, defining the roles between author and reader for the digital age. It is meant to aid recognition of these roles and the changes in them for different actors in the communications circuit at different moments in time. As such it can help institutions to make better decisions based on a clearer understanding of their position in the value chain. The goal of communication is to connect the immaterial idea of the author with the immaterial mind of the reader, but for that to happen, the material stage of publication is essential. Between the immaterial phases of the books conception and reception, there are three important steps: the production of the work, the publication of the work and access to the work. The model represents three key perspectives: that of the information, the user and that of actors in the communications chain. A case study for the Koninklijke Bibliotheek, National Library of the Netherlands shows the model in action in aiding understanding and decisionmaking. Finally this paper offers some perspectives for the future.
still in transition from a primarily physical focus to a mainly digital one. As the image below shows, if you plot the history of Western book production so far on a twenty-four hour clock, we haven't been digitizing metadata for more than half an hour, and digitizing content for about ten minutes.
Categories and Subject Descriptors H.3.7 Digital Libraries
Figure 1 Parchment to Portal. Textual transmission in the West
General Terms Theory
Keywords Communication circuit, digital library, author, reader, publisher, bookseller, library, Koninklijke Bibliotheek
1. INTRODUCTION The Koninklijke Bibliotheek, National Library of the Netherlands, connects people and information. Starting from this mission statement, we have been rethinking our position in the information landscape in the digital age. In a changing environment, having a thorough understanding of that position is essential to keep the eye on the goal, and not just on the ball.
When textual transmission changed from manuscript to printed book the transition lasted around fifty years, a period now called the incunabula period. The first printed books aimed to look as much like manuscripts as possible. Gradually, the book as we know it found its form, including indexes and page numbers. 1 We are now still in the ´webincunabula´ period. Many services are still based firmly in the physical paradigm and are in fact digital translations of physical concepts. This phase is now commonly known as ‘Horseless Carriage Syndrom’. Slowly we will see new standards evolve. For our thinking model we started ab initio, with the authors and the readers, and have tried to define the roles in between. These are in some ways similar to the Communications circuit that Robert Darnton defined in 1982 2, but there are some significant
We are very aware of the fact that the information landscape is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2014 ACM 1-58113-000-0/00/0010 …$15.00.
1
And in a broader sense what Adriaan van der Weel calls ´The Order of the Book´ in Changing our Textual Minds, Manchester University Press, 2011. www.let.leidenuniv.nl/wgbw/research/Weel.../Weel_Changing_ MUP.pdf
2
Robert Darnton, "What Is the History of Books?", Daedalus, Summer 1982, pp. 65-83; repr. in Robert Darnton, The Kiss of
differences as well. An early critique of this model by Thomas R. Adams and Nicholas Barker 3 states that a communications circuit should focus on the text as the prime mover, not on the people facilitating the movement. I follow that vein, always excepting the author and the reader. In 2000 Adriaan van der Weel presented a paper 4 at the Society for the History of Authors, Readers and Publishers rethinking this model for the digital age, which shows quite a few of the issues any model of the roles between authors and readers should take into account. The thinking for the present model has been done from a library environment, but as D.F. McKenzie (quoted by Adriaan van der Weel) states, this could ‘include verbal, visual, oral and numeric data, in the form of maps, prints, and music, of archives of recorded sound, of films, videos, and any computer-stored information, everything in fact from epigraphy to the latest forms of discography’. 5 I think we might conclude that the model from ‘creator’ to ‘consumer’ works for any ´work´-driven approach to the information landscape. This means it not only has its application for the whole digital book environment, but also for the cultural heritage field in a broader sense. The model presented below not only echoes Darnton, but also has relations to the Unesco Culture Cycle (2009). 6 However in this paper, I will use the terms author and reader as pars pro toto for creator and consumer.
or information (1). In this step, the author acquires copyright which, under the current law, lasts until seventy years after his or her death . Then an agency imparts value to such a work by selecting it for publication (2). This agency can be a publisher, who takes a commercial risk, or – usually in a second iteration of the cycle – by a public or private digitization agency. This last form implies re-issuing a publication, in bulk or per item, usually after the economic value has become close to nil (out-ofcommerce). This agency adds value through standardization, producing a publishable object and metadata and providing storage for publication. The first iteration usually concerns single works, which still have to prove themselves to an audience and have to provide a return on investment for the publisher. The second iteration is either a republication of a work (classics) usually less commercially viable. The product of this step is usually protected by copyright law as a compilation (database rights). 7
Sections 2 and 3 on the information and the user perspectives contain a description of the model proposed. They serve as an introduction to the model and can stand alone to offer more insight in the value chain involved in the steps between author and user. An example of the practical application of the model can be found in section 4. Issues addressed include the importance of managing copyright, keeping available the total body of knowledge, using persistent identifiers, user trends, linked data and standardisation versus differentiation.
2. THE INFORMATION PERSPECTIVE The model is primarily concerned with the information perspective as defined by Shannon, i.e. the importance or meaning of the works going through the cycle are not important to the working of the machinery. However, the reasons to put the machinery in motion at all are of course dependent on those concepts of importance and meaning. The cycle represents a value chain in which each step builds on the previous one. The first step in the model, is the creation of a ‘work’, an expression of an idea Lamourette: Reflections in Cultural History, pp. 107-35, at p. 112. 3
Thomas R. Adams and Nicolas Barker. "A New Model for the Study of the Book." A Potencie of Life: Books in Society. Ed. Nicolas Barker. London: British Library, 1993. 5-43.
4
Adriaan van der Weel, ´The Communications Circuit Revisited´ (paper delivered at the SHARP conference, Mainz, July 2000), http://www.let.leidenuniv.nl/English/B&P/Eltext/CCRev.html, retrieved 28-2-2014.
5
D.F. McKenzie, "The Book as an Expressive Form", in his Bibliography and the Sociology of Texts, The Panizzi Lectures 1985, London, 1986, pp. 4-5
6
http://unesdoc.unesco.org/images/0019/001910/191061e.pdf
Figure 2 The Digital Object Cycle The third step is the actual publication. This too is subject to copyright law, and can only be done when the original author or rightsholder has given permission to publish (right to publish). The presentation of the work can take different forms, an online representation (streaming access) or a download-link (in different formats, like pdf, epub or xml). Publication is preferably accompanied by a persistent URL (such as DOI) so the object can be found in the same location through different services at different times. The right to publish is then mirrored by a right to access from the other side, the end-user interface. In the Unesco Culture Cycle, this stage is not represented. I believe it is crucial to our understanding of the communications cycle: it represents the availability of a work (after publication, and before distribution). In the physical world it was so strongly related to a copy that had to be distributed, that there was no need to define it as a step in itself. In the digital world, publishing a work also means hosting and storing the object for access. Warehousing the object as booksellers and libraries used to do at the services level 7
I leave out self-publishing for the moment, as the element of selection there comes later in the process, at the distribution stage and is done by the reader.
(4, having something in stock), has now moved to this step (3). Therefore in the digital domain this step marks the end of the flow of the digital object: a work has been published and is available for reading. But it does not end the communication chain. To engage a reader, it still has to be found. The work itself does not move to the next step, only the descriptive metadata about the work do (information such as title, author, perhaps a preview). Without an interface for an end-user (4), the work has little chance of reaching its target, the reader. There are many different end-user interfaces, which serve larger or smaller audiences and contain links to more or fewer objects. The value proposition involved in selecting objects deemed to satisfy a certain user or user group is of prime importance here. Traditionally this step involves searching (usually on the basis of metadata only, sometime in a full-text index) and evaluating the results (the classical roles of booksellers or library catalogues). However, any end-user interface that links to a published work belongs in this step, and for most people the discovery layer only contains a single access point: Google. This has been neatly argued by Lorcan Dempsey: ‘access and discovery have now scaled to the level of the network: they are web scale’. 8 And last, but definitely not least, comes the reader (5). The reader accesses a published work, by searching or browsing the discovery layer (4). When he or she has found a result or stumbled across a link to a published work the next step is delivery (3), or the right to access. When the user accesses a work, he or she may add value or be inspired to create himself. The reader then becomes author (1) and a new cycle starts. A party imparts value to the user generated content and decides to add this material to its repository, negotiates the right to publish etc. In 1941, the brilliant and versatile Dorothy L. Sayers published a work called The Mind of the Maker. 9 It offers earthly creation as a metaphor for the Holy Trinity. The book as it is in the mind of the author - the ideal book - stands for God the Father, the Creator (1). The incarnation of the book, the book in its material form, represents God the Son (3). And finally, the book as it becomes in the mind of the reader reflects the Holy Spirit (5). I rather like this idea of a Holy Trinity of communication. The goal of communication is to connect the immaterial idea of the author with the immaterial mind of the reader, but for that to happen, the material stage of publication is essential. What makes it even more powerful is that no one aspect in this Holy Trinity of communication is more important than another: a reader cannot do without a book, a writer is incomplete without a reader. Or as Sayers puts it ‘each [is] equally in itself the whole work, whereof none can exist without other’. 8
9
Lorcan Dempsey, ‘Thirteen Ways of Looking at Libraries, Discovery, and the Catalog: Scale, Workflow, Attention.’ EDUCAUSE Review Online, (Monday, December 10). http://www.educause.edu/ero/article/thirteen-ways-lookinglibraries-discovery-and-catalog-scale-workflow-attention. This article is a slightly amended version of a contribution of the same name to Sally Chambers, ed., Catalogue 2.0: The Ultimate User Experience (London: Facet Publishing, 2013). Dorothy L. Sayers, The Mind of the Maker, 1941.
In the terms of this model we need two other stages to flank the central incarnated work (3). The first is production (2), the second distribution (4). It would probably be pushing the metaphor too far to call these Mary (2) and the Apostles (4), though I rather like the idea that they are one and many… But for the time being let us go with the two concepts Michael Bhaskar devised in his recent work The Content Machine: filtering and amplification, which he sees as the core roles of publishing. 10 Filtering implies aspects of selection, structuring the form, adding authority and branding. Amplification is in short the difference between leaving a published work on a park bench, or telling the world about it (aiming to let the copy reach as many readers as possible). Michael Bhaskar has coined these phrases to provide a theory of publishing, but these roles apply to different intermediaries active between author and reader, as we will see in section 4.
3. THE USER’S PERSPECTIVE The model is not only concerned with the information, but also with the goal of the communication: connecting minds. If we follow the dark blue arrows in fig. 2 that go the other way round we see the user’s perspective. He or she (5) starts searching at the interface (4) which he credits with the highest likelihood of success (either in discovery or in delivery). An end-user interface might just describe a single work, or provide a search interface for a number of works together (such as a booksellers or library catalogue). There has, of course, been a fundamental change in how a user values different interfaces. In the pre-digital world, aggregation of goods took place close to customers and the resulting supply of goods was usually taken for granted (you made do with the book the store had in stock). The visibility of out-of-stock material has grown immensely with the web. With it the focus of costumers has changed. They now usually have different search strategies, tailored to their specific needs at a specific time: social media, newspapers and Goodreads for recommendations, Google for general search, Wikipedia for general information, specific search via dedicated services (such as Pubmed). The Ithaka report 11 shows that for academic literature, 80% of university staff uses either a specific electronic research resource or a general purpose search engine, such as Google. For students general search engines are even more important to discover academic literature. If we look at the broader public, PewInternet offers some insight. 12 Most people discover books based on word-of-mouth (5), followed by recommendations from online bookstores or other websites (4). A quarter or less discovers through physical bookstores or libraries. We see the twin strands of centralisation (to find a single point of access to the unwieldy bulk of information) and fragmentation (to
10
Michael Bhaskar, The Content Machine: Towards a Theory of Publishing from the Printing Press to the Digital Network, 2013.
11
http://www.sr.ithaka.org/research-publications/us-facultysurvey-2012
12
http://libraries.pewinternet.org/2012/06/22/libraries-patronsand-e-books/
offer specific services to user groups that used to be too small to cater to) that characterize the internet at work. When the reader finds a work he is interested in, he moves to delivery (3), in this case the right to access. When the resource is free of copyright, this is usually a non-commercial transaction (for instance open access publications or digitized material in the Gutenberg Project, Google Books or Hathi Trust). If copyright still holds, there is a commercial transaction, either a purchase by the individual, or a negotiated right to access (for instance through a library license). There is a serious issue here of illegal access to published material, which threatens authors as much as intermediaries. E-material is often seen as too expensive. By libraries, who see the big deals as unreasonably priced, and by readers buying ebooks through online retailers. 13
4. THE INSTITUTIONAL PERSPECTIVE For figure 3 we have taken the circular model presented in figure 2, and flattened it to accommodate more information on the actors in the different domains. Of course, the actors mentioned are just examples, but this might aid understanding of how the model can help you to understand your position in the cycle. If you spend most of your time talking to authors and arranging copyright, you will be in the second column. If your focus is on customer needs and habits and marketing metadata, you are probably in the fourth column.
A very common combination is combining all access (4 and 3, discovery and delivery / search and retrieval) in one service. This is the traditional role of booksellers and libraries: providing a selection of published material and an end-user interface to access them. In the digital age, we would call this role content aggregation. JSTOR is a prime example – they do not own copyright (right to publish), but ‘obtain licenses to preserve and make published content available digitally from publishers’. At the Koninklijke Bibliotheek we host Delpher, a national infrastructure for the publication (3) of digitized books, newspapers and journals, complemented by a search interface (4). This is a model I think we are moving away from. The hosting of content (3) is a different role and asks for different skills from marketing and packaging (4). Most publishing agencies (2) also have a native platform (3) which provides search facilities (4) (such as Science Direct). And some booksellers (4), notably Amazon, also have a publishing unit (2 and 3). These are examples of the blurring boundaries between different intermediaries. As many book historians have shown these boundary shifts (as those between printer-booksellers, publishers and printers and now publishers and booksellers), have been common in the history of the book, but haven’t changed the communication of minds through mediation fundamentally. I do believe however that for new concepts, the boundary between the published object (3) and the packaging for the end user (4) will emerge as a logical cut-off point. We already see this happening with metadata-aggregators for
Figure 3 The digital object cycle with actors 13
Other issues that threaten authors and intermediaries are overproduction, disintermediation, pressures on reading time and the decline of long-form reading.
discovery 14. These fulfil the role of combining different sets of object identifiers and metadata but without actually holding the content. This puts aggregators such as Europeana and the Digital Public Library of America firmly in column 4. Sometimes aggregators also negotiate the right of access, such as public libraries in the Netherlands. I would like to emphasize that both the second and the fourth columns imply selection, or filtering. In the first case, it is filtering for production – what is relevant to invest in. In the second case it has to do with finding an audience: which combination of content is likely to appeal to my user. Both are informed decisions, but they are usually not taken by the same people. This difference in skill-sets needed to fulfill the different roles between author and reader are, I believe, a strong indicator of where future boundaries will lie.
4.1 Case Study Koninklijke Bibliotheek As an example of how this model has influenced our thinking at the Koninklijke Bibliotheek, I would like to use our position in the information landscape as a case study. It also offers an opportunity to add another dimension to the model: that of time. The model is primarily concerned with the instant flow of information from an author to a reader. However, there are also dimensions of time involved in the flow of information from author to reader. Adams and Barker accounted for that by adding the element of survival to their amended version of the communications circuit. The KB is a deposit library, both for national heritage, and for the publications of international academic publishers. I see the deposit role as a time-out for publications that belongs in the third column. Publishers deposit their published works at a library. Our role is to keep the copy safe and to make sure future generations can access it as well. We also offer onsite access to these collections. That means we store the items, but provide access without amplification. If and when the original publisher cannot or will not provide a copy in the third column any more, deposit libraries can bring copies back into circulation (3). That is exactly what we are doing by digitizing our collections. We have historically built up a huge collection of publications. In 2010 the KB announced its ambition to digitize, not only the collections of the KB, but all books, newspapers and periodicals printed in the Netherlands since 1470. To achieve this goal we work with public and private partners, such as University Libraries, Google and Proquest. The process of digitization comes in column 2, the model helps explain why we put so much more time and effort in talking to the rightsholders of column 1 than before. The digitized product comes in column 3. The third column contains the total published body of knowledge, from which different services in column four select for their respective users. In 2012 we got an interesting perspective on the perceived value of keeping this body of knowledge available. The Koninklijke Bibliotheek wanted to publish pre-1940 journal via 14
This does not deal with productionsystems such as a Central Bibliographic System (CBS), but with aggregators such as Europeana or the Digital Public Library of America (DPLA).
an opt-out system, in the face of the impossibility of finding the rights holders for millions of journal pages. Though the KB explicitly recognized the moral rights of authors, the KB thought the public value of providing access to the digitized material was greater than the private right of these authors, given the fact that any author still alive would be over 93, and the printed publications did not represent any commercial value any more. Some contemporary authors reacted outraged: 'KB steals from the elderly' was a headline on one website, but others reacted differently. They claimed that access to this body of knowledge (or even more recent material, which they would actually hold rights over) was so helpful to them, that it would easily outweigh any remuneration due to them as author. In the end, we were able to arrange a collective licensing deal with the Dutch societies for authors and illustrators rights, which acknowledged the rights of the authors and balanced them with the public value of publishing the historical journals. For digitized material we also want to amplify (4). By digitizing out-of-print and out-of-commerce works, we give authors a new lease of life. In November 2013, The KB and the University Libraries in the Netherlands launched Delpher, a national infrastructure for digitized material for libraries, which provides access to more than 1 million historical books, newspapers and journals. The model helped us to focus on getting both aspects of the service right: A publication platform (3) with persistent identifiers, and a full-text search environment (4). The publication platform will remain central to our mission, perhaps the search function will in time be taken over by others or integrated in another libraryservice. We do believe our own services have a role to play too. As a national library we provide right to access scholarly information. From 2015 we will also have responsibility for the Dutch Public Library network and the National Digital Library. This will provide an end-user interface to all the library collections in the Netherlands. In our thinking about the way we could best serve the library community as a whole, it has really helped to realize that whilst we at the KB are primarily concerned with content (columns 2 and 3), the public libraries are much stronger in their end-user reach (column 4). This helps us to build on each other’s strengths. And in our new role, the focus on providing context and guidance (also very much a column 4 activity) will become even more important than it is now.
4.2 The red line and the next step Though we are very proud of Delpher, the model helped us to see that our goal (amplifying the use of the collection) would be better served if we did not only concentrate on our own service(s). Most potential users will not come to the library, but to Google or other large intermediaries. The line between the columns three and four is red to reflect the importance of providing technical support to access column three material through different column four services. The red line contains a set of Application Programming Interfaces such as OAI-PMH, linkresolvers and Search and Retrieval by URL (SRU). By using these in a transparent way, we can offer the data (at least the material that is free of copyright) to any service that wants to help us amplify the content. Dataservices, as we call this, have – on the basis of this model – gained priority over our traditional library services. By making sure Google can index it, that Europeana aggregates the metadata and anyone can add the
metadata to their own discoverytools or other services we want to amplify our content. Not necessarily through our own services, but by the services our customers know and love. Having a Wikipedian in Residence is another example of amplifying our content through larger intermediaries. Expanding on this red line, I believe there is an exciting new opportunity just around the corner, well suited to heritage institutions. It has to do with context. Libraries traditionally had a role that echoes Google’s mission ‘to organize the world's information’. This mission has become even more important when abundance has replaced scarcity. Brian O’Leary puts this nicely in his essay ‘Context, not containers’: ‘When there was only the Gutenberg Bible, we didn’t need Dewey. When booksellers were smaller and largely independent, we didn’t have much need for BISAC codes. And before online sales made almost every book in print evident and available, ONIX was an unattended luxury. […] Simply put: Content abundance is the precursor to the development (and maintenance) of context.’ 15 A very rich vein of context is provided by Linked Data. Linked Data add value on different levels of the content cycle. They are primarily useful for discovery (as they are meant to be). An example is enriching a search query (for instance by expanding with different ways of spelling names, such as Shakespeare) or adding access points to data (such as providing specific geolocations to information). But at the stage of production or maintenance of metadata (2) trusted external information can be very helpful, for instance in determining the date of death of an author, so providing information about the copyright status of a work. In 2010 Tim Berners-Lee suggested a 5 star deployment scheme to enhance the discoverability of data: ★Available on the web (whatever format) but with an open licence, to be Open Data ★★Available as machine-readable structured data (e.g. excel instead of image scan of a table) ★★★ excel)
as (2) plus non-proprietary format (e.g. CSV instead of
★★★★ All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff ★★★★★ All the above, plus: Link your data to other people’s data to provide context For the last two steps, RDF is used, a W3C standard, that not only connects two concepts, but also names the relationship between the objects. In the Linked Data Cloud, DBPedia (a database generated from Wikipedia) and Freebase (now owned by Google) are already major players. Wikidata is a more recent initiative that looks promising.
15
Brian O’Leary, ‘Context, not Container’, in: A Futurist’s Manifesto, edited by Hugh Maguire and Brian O’Leary. http://book.pressbooks.com/chapter/context-not-containerbrian-oleary
For historical information this is a very powerful way to amplify content by deep linking. Linking named entities (especially about who, what, where and when) to external concepts makes the resource linked more valuable. It is a role, like digitization of outof-print material, that provides new and exciting possibilities for cultural heritage institutions.
5. EPILOGUE Naturally, a thinking model for digital libraries can never be wholly comprehensive. Or, as a colleague once said, when I entered happily with my newest visualisation: ‘You do realise this is just a model, not the world?’ It is a wise remark, reminding us that a model is only useful to serve our understanding of the world, not to rule it. With that caveat, I think it can help address the challenges all intermediaries in the content chain are experiencing. If you know what it is you are actually doing, the how you might best achieve this will be less hard to think out. That makes it possible to focus on things that will serve you in the long term instead of being swamped by the short-term problems. As intermediaries, we serve the Holy Trinity of Communication. It is a world of shifting sands and moving floodlines, exciting, everchanging, but rooted in the firm soil of connecting minds.
6. ACKNOWLEDGMENTS I would like to thank Lily Knibbeler, without whom the model would have never reached its present form, and several colleagues at the Koninklijke Bibliotheek who have shared their insights and comments. In preparing this paper, Adriaan van der Weel and Marco de Niet have been very helpful by commenting on earlier stages.