JeromeDL - Reconnecting Digital Libraries and the ... - Semantic Scholar

1 downloads 626 Views 1MB Size Report
ditional number identification to improve the selection of the category. ... Search engines provided by digital libraries differ from web search engines in many ...
JeromeDL - Reconnecting Digital Libraries and the Semantic Web Sebastian Ryszard Kruk

Stefan Decker

Lech Zieborak

Digital Enterprise Research Institute National University of Ireland Galway, Ireland

Digital Enterprise Research Institute National University of Ireland Galway, Ireland

Main Library Gdansk University of Technology Gdansk, Poland

[email protected]

[email protected]

[email protected]

ABSTRACT In the recent years more and more information has been made available on the Web. High quality information is often stored in dedicated digital libraries, which are on their way to become expanding islands of well organized information. However, managing this information still poses challenges. The Semantic Web provides technologies that help to meet these challenges. In this article we present JeromeDL, a full fledged opensource digital library system developed by us, that exemplifies how digital libraries benefit from the Semantic Web. We define and evaluate how browsing and searching based on the semantic descriptions of resources and users improves the usability of a digital library, and how digital libraries can be interconnected to exchange semantic descriptions.

a number of search engines have emerged. Although some of the search engines come up with very sophisticated algorithms like PageRank[4] they still do not seem to be sufficient to always find high quality information. The Semantic Web effort which partially originated from the digital library community, is providing technology that can be potentially applied to the problem of managing resources. In this paper we present JeromeDL, an open-source digital library which is using Semantic Web technology to provide better access to its resources.

1.1

• We present JeromeDL, a sophisticated open-source digital library. We detail JeromeDLs architecture and exemplify which parts of JeromeDL benefit from the Semantic Web.

Keywords digital library, semantic web, ontology, searching, P2P networks, query expansion, MARC21, BibTEX, DublinCore, annotation

1.

• We show how personal profile information based on Semantic Web data can be exploited to gather information about user preferences.

INTRODUCTION

Digital libraries are facing challenges managing an every increasing amount of resources caused by the increase of investment in research and development as well as the trend to produce more and more written information. Several solutions to these challenges have been developed. The most popular approach is to organize the collection according to similarities of the resources. The tree-structured classifications schemes used to coordinate materials on the same and related subjects lead to the hierarchical organization of catalogs. The most popular classifications schemata (e.g., the Dewey Decimal Classifications (DDC)[6][35] and Universal Decimal Classification (UDC)[21]) provide an additional number identification to improve the selection of the category. The catalog approach to manage resources has been successfully adapted in on-line directories like Yahoo![37] or the Open Directory Project[36]. However, the constant growth of the Internet proved the catalog fashion of organizing the enormous database to be inadequate. As an answer to the need for getting a grip on the information on the Internet Copyright is held by the author/owner(s). WWW2005, May 10–14, 2005, Chiba, Japan. .

Contributions

The paper makes the following contributions to the field of digital libraries and the Semantic Web:

• We show how the preference information can be used in a semantic search algorithm which delivers better results when searching JeromeDL.

1.2 1.2.1

Related work Open-source digital libraries

There are a number of existing implementations of digital library systems like DSpace1 , e-Prints2 , or Greenstone3 . In this article we describe the benefits of semantic features implemented in the JeromeDL system.In contrast to listed systems the JeromeDL has been designed with Semantic Web technologies like RDF, FOAF, and ontologies in mind.

1.2.2

Social network based collaborative Filtering

Recent approaches in collaborative filtering are including social network information[18]. JeromeDL extends those approaches by an automated interest collection mechanism tailored towards digital libraries. 1

http://www.dspace.org/ http://www.eprints.org/ 3 http://www.greenstone.org/ 2

1.2.3

Search engines

Search engines provided by digital libraries differ from web search engines in many aspects. • Digital libraries are closed systems, since all resources are usually added directly to the system repository. Therefore crawling features are not required. • Appart from the usual full-text index the search is performed on the resource description which is very often provided explicitly. These condition result in a higher quality of the search process. • Digital library users are more demanding than web search engine users. In many cases digital libraries are considered as an outreach of classical libraries. In order to reach the status of trust and reliability classical ”bricks and mortal” libraries have, digital libraries require effective information access facilities[13]. The latest studies on search process in digital libraries revield the importance of query personalization. The query expansion process implemented in the JeromeDL tends to be more flexible and automated than existing solutions[12], by expanding queries based on user profile information consisting distributed, private bookshelves and usage statistics.

1.3

Outline of the paper

The following section discusses the architecture and the feature set of the JeromeDL system in the context of generic implementations of digital library systems. In section 3 we present how the digital library can benefit from semantic descriptions. We present the architecture of the JeromeDL e-library with semantics system that utilizes semantic technologies. Then, we discuss how the advanced bookmarks management and user-oriented profiling based on semantic description improves search results. Finally, we present the semantic search algorithm, that realizes the previously described concepts, implemented in the JeromeDL system.

2. 2.1

DIGITAL LIBRARY ARCHITECTURE AND THE SEMANTIC WEB A generic digital library architecture

Figure 1) shows a generic digital library infrastructure, following the architecture described in [10]. A digital library system contains a user interface and middleware, like the classic tree-tier architecture. The data engine handles catalog information along with the resources. Additionally there is a special interface for librarians to manage the content and catalog descriptions. To enable interoperability between library systems a communication interface is defined as well. The following section instantiates this generic digital library architecture for JeromeDL.

2.2

JeromeDL

JeromeDL is a joined project between the Main Library of Gdansk University of Technology4 and DERI.International5 . The main requirements for JeromeDL from librarians and library users were: 4

http://www.bg.pg.gda.pl/ 5 http://www.deri.org/

Figure 1: The architecture of a generic digital library system • support the legacy of classic libraries (e.g., antique books). • provide user-oriented browsing features, • allow efficient searching, • cover security and accounting constraints, • support multiple formats of resources, • enable communication with other digital library systems • utilize results of latest results in Semantic Web and communication and information management research.

2.3

Architecture of the JeromeDL system

Figure 2 instantiates the generic digital library architecture of Figure 1 for JeromeDL by especially taking Semantic Web developments (such as metadata descriptions based on RDF, FOAF, and ontologies) into account. JeromeDLs middleware implements features like viewing resources, searching and browsing (see 3.3 and 3.4), users’ profile management (based on FOAF) (see 3.2) and resources management (see 2.5). The description and content of the resources, e.g., the fulltext index of the resource’s content, MARC21 and BibTEX bibliographic descriptions and semantic description according to the Jerome ontology are held in several stores. Apart from the textual resources the JeromeDL system has been designed to also handle the collections of scans of old books and other binary resources like e.g. Macromedia Flash presentations. A communication link to the outside world enables searching in a network of digital libraries. The database content of the JeromeDL system is rendered in XHTML following an HTTP request. To administrate the content of the database and to describe the resources in the database a stand-alone application, JeromeAdmin, is utilized. JeromeAdmin communicates with the main system through the RMI (Remote Method Invocation) protocol. The content of the JeromeDL database can be searched not only through web pages of the digital library but also from the other digital libraries and other web applications through a special web services interface based on the Extensible Library Protocol (ELP)[23] (see 3.5).

2.4

Providing access to resources

Although even ”bricks and mortar” libraries were initially meant to handle only books, users usually find a far larger

(see 3.1 to the resource. Users are able to submit resources to JermeDL as well. In this case a two stage process is required. First a user submits the content of the resource and the set of descriptions. Then the administrator approves and finalizes the submission of the resource to the database.

3.

Figure 2: The architecture of the JeromeDL system

variety of resources. Following the same paradigm the JeromeDL system supports not only resources in PDF or RTF formats, but also other multimedia content (e.g. Flash presentations). To support both flexibility and specialization in resources handling, each resource has its own description of its structure. If it is possible the resource is stored in the XSL:FO format that allows the reader to choose between different rendered types, e.g. PDF, RTF, HTML. The content of the resources can be additionally protected from printing or copying (if applicable) with an ACL (access control list) attached to the resource. Very often a digital library user searching for information is flooded with an abundance of inadequate query results. Many attempts have been taken to limit the number of answers of the query process. The research on the usability of search features covers a large variety of approaches, from word-sense disambiguation based on boolean expressions and lexicons[22] to the dynamic query user interfaces[28]. Apart from providing access to the resources in the database, a digital library system is expected to provide resources discovery and navigation features[32][20]. The JeromeDL system aims to deliver high quality of searching (discovery) and browsing (navigation) features. A diversity of metadata utilized in our digital library system reflects the sophistication of the search algorithm. To achieve result sets closest to the users’s requirements, concepts of content, meaning and user preferences have been adapted for the semantic search algorithm (see 3.3 and 3.4). While the user is browsing the JeromeDL’s databases the semantically enabled user profile is annotated with statistical information. Then profiles are utilized in the search process. (c.f. 3.2).

2.5

Resources management in JeromeDL

The most convenient way to add resources to the JeromeDL system and describe them is by using the stand-alone Java administration application - JeromeAdmin. The JeromeAdmin application provides an interface for generating the description of resources and uploading the content of the resource to the system database. With JeromeAdmin an administrator can attach the MARC21 and BibTEX bibliographic description as well as ontological annotations

SEMANTIC DESCRIPTIONS IN DIGITAL LIBRARIES

The aim of the semantic web initiative is to provide standards and infrastructure that help machines understand the content they are dealing with. Utilizing the provided standards and infrastructure knowledge bases can be constructed, that contain high quality structured content that can be utilized for various purposes. So far there are several approaches to constructing the knowledge base for digital libraries. The catalog and fulltext indexing can be considered as one of the approaches. But to provide full support for machine based reasoning, and hence the ability to perform human-like interactions with readers, the knowledge base must be equipped with semantics. The bibliographic descriptions like MARC21[11] or BibTEX[19] are effective in applications where human interactions are possible. E.g., MARC21 consists of a few keywords and free text values, without a controlled vocabulary. So for inquiring information about a resource or placing the bibliographic annotation into the digital library, one can utilize these descriptions as they are interpretable by human beings. However, machines are not able to utilize much of a MARC21 description. Therefor other description formalisms are necessary, and ontology formalisms as provided by the Semantic Web effort are a promising path.

3.1

Semantic description of resources

Digital libraries need to take the rich history of libraries and the plethora of existing resources into account. Therefore introducing ontologies to the digital libraries domain[34] requires compatibility with already existing bibliographic description formalisms [17][15]. To provide machine understandable descriptions, we need to imitate the human process of understanding and association, which means that a concept needs to be interlinked with other relevant concepts. The main problem with building an ontology for the library domain is that the information resources that are collected in digital libraries cover all kind of human activities, which requires that also a bibliographic ontology needs to cover everything. Fortunately the main purpose of such an ontology is still the annotation of resources, and it is not necessary to completely capture the content. Therefore the JeromeDL Core ontology only needs a common high level core, which is used to capture the essence of bibliographic annotation. The good starting point for building an ontology for bibliographic description is the DublinCore Metadata[7]. In fact the idea of the semantic web emerged from digital library activities like DublinCore. However, the description values in Dublin Core are plain text descriptions, not other objects. This fact makes it difficult to interlink Dublin Core metadata with other Semantic Web data (e.g., use a FOAF:Person as the value of the Dublin Core creator attribute). Modifications of the DublinCore Metadata in the Jerome ontology therefore include the permission of structured val-

Figure 3: Relations between major concepts in Jerome ontology ues as well as additional definitions of keywords and catalog classifications (domains of interest) of the resource. Each keyword concept is connected to other concepts with properties: hypernyms – concepts with more general meaning, hyponyms – concepts with more precise meaning, synonyms – concepts with similar meaning, homonyms – concepts with the same spelling but different meaning, semantic fields - concepts that can be found in the same context, categorization - the categories (domains of interest) where this concept is most common. Additionally each concept has a list of most often used lexical variants of the word. That can help to get the stem for the concept when the user provides a different variant of it. WordNet6 can be used as a starting point for the JeromeDL [8][22] ontology. However, some properties defined for the keyword concept in the Jerome ontology are not accessible within the WordNet ontology. Each resource can also be described in MARC21 and BibTEX format. The next sections describe how the JeromeDL system benefits from the semantic descriptions, which is an unique feature among digital libraries.

3.2

A Social Network approach to Collaborative Filtering for Digital Libaries

A classic library requires its users to identify themselves in order to be able to assign a resource. Digital resources in the Internet are usually easily duplicable and in most cases the reader does not have to identify him before viewing the content for this reason. However, a reader can benefit from identifying himself doing so, e.g. gaining access to protected resources, or making use of additional features when browsing or searching a digital library’s archive. The on-line communities introduced the idea of connecting users registered within one system[24][25] with each other. To manage the users’ profiles in the JeromeDL system the FOAFRealm7 [14] library has been created. The FOAFRealm library implements the standard J2EE authentication mechanism based on the FOAF (Friend-of-a-Friend)8 6

http://www.cogsci.princeton.edu/ wn/ http://www.foafrealm.org/ 8 http://www.foaf-project.org/ 7

metadata, which enables readers registered in the digital library system to point to other readers as friends and also enables one to reuse registration information. Registered readers are able to annotate and evaluate the resources stored in the JeromeDL database. JeromeDL FOAFbased user management also supports the creation of personal bookshelf (sometimes called a ”personal digital library” [31][33]) - a tree-structured collection of bookmarks handled by the digital library system. Readers can link categories created and managed by friends into their own bookmarks structure. Additionally when a user registers to the JeromeDL system and he wants to benefit from searching based on personal preferences (see 3.4), JeromeDL creates a personal profile of user activities in the system based on the profiles of indicated friends. To identify the categories a user is interested in information on • previously read books, • resources linked in the personal bookshelf (bookmarks), • annotated resources, • highly evaluated resources, is collected. Each resource is described by some categories. Baed on the collected categories JeromeDL identifies the categories a reader is interested in (see Fig. 4)

3.3

Browsing semantics and collaborative catalog - personal library

Catalogs in digital libraries help to organize resources according to the implemented classification schemata. One of the first and still popular classification schemata is the Dewey Decimal Classification [6]. On the base of this scheme many new schemata have been created like e.g. the Universal Decimal Classification [21]. The assumption that everyone classifies resources the same way is not actually true. Two different people can see the same concept in two different ways. This can often lead to confusion. This is why the concept of electronic bookmarks is so widely adapted in the Internet. Everyone can organize already seen resources the way he perceives the world [31]. To bookmark a resource, the reader has to find either the resource itself or another reader that has already done so. We have assumed that two friends share similar interests, so probably one can find the resource he is looking for in the bookmarks of his friends. The users of the digital library can browse bookmarks of their friends and link some folders (categories) into their own structure. As a result one can have a catalog containing bookmarks created and maintained by others, which he assumes have higher expertise on the given subject (see fig. 4). Readers can also state how much their interests are similar to their friends. Later on each of the categories created by the reader have a unique ACL (access control list) that defines which friends are able to see or use the content of this category. The ACL entries are based on the distance and similarity level between issuer of the category and the user that is willing to read the content of this category.

3.4

Search algorithm based on semantic description

Figure 4: The user’s profile in JeromeDL Since the content of the digital library is constantly expanded and advanced it will be always the case that resources in the digital library have not been classified in a personal bookshelf.

IsSemanticQuery IsConjunction property

true false name=”keywords” value=P2P(mustExists) value=”Semantic Web”(ranking=10)

property

name=”category” value=AI(mustNotExists)

fulltext

value=”semantic routing”(proximity=4)

Figure 6: The semantic search query object

step C – the last step finally is a user-oriented search with semantics, based on the semantic description of the resources and information about most interested categories (regarding the user that issued the query).

Query object

Figure 5: The semantic search algorithm phases For this reason a search over the resources must be possible. The search algorithm of JeromeDL consists of three major steps (see Fig. 5). Each step requires different metadata sources that describe the resources. step A – the first step is the fulltext index search on the resources’ contents and users’ annotations on resources, step B – the next step is the bibliographic description search consisting of MARC21 and BibTEX formats,

When issuing a query to JeromeDL a reader has to submit a query object (usually using an HTML-form). An example of a query object that holds the information provided by the user is presented in Figure 6. Each query contains several entries which state what information search algorithm should look for in the resource description. The reader can choose from Dublin Core Metadata properties, the jeromedl :keywords property and a special property that indicates the content of the resource ( fulltext). Each property contains the list of possible values, that the user expects to find in the description of the resource. The user can specify which values are required (mustExist) in the description of the resource, and which should not exist in the description (mustNotExist). Additionally each value may have a ranking value specified

(ranking), so results containing a desired value are higher ranked. The value can be either a single word or a phrase. In the latter case, it is possible to define a maximum distance desired words need to have in the description (proximity). In addition to the information what the algorithm should look for, the reader can specify if the query expansion with semantics should be performed (IsSemanticQuery) or if all the values of properties should be treated as a conjunctive query (IsConjunction). The structured nature of the query object makes it possible to represent the information in XML format, so the query can be easily supplied by a web service (see 3.5).

Result object Figure 7) gives an example of a result object. A result object contains the information about the resources that have been found and additional information on the query process that has been executed. resource

uri=http://jeromedl.org/show?id=2134556 title=”EDUTELLA: A P2P Networking Infrastructure Based on RDF” author=”Wolfgang Nejdl, Boris Wolf, Changtao Qu, Stefan Decker, Michael Sintek, Ambjrn Naeve, Mikael Nilsson, Matthias Palmer, Tore Risch” categories=[distributed systems, ...] keywords=[P2P, RDF] summary=”Metadata for the World Wide Web is important, but metadata for P2P” bookType=pdf hits=3

resource

...

... info

”Category value semantic web is to general, try to select more specific to get

info

...

better results” ...

Figure 7: The semantic search result object Each of the resources is described with • the URI of the resource, • title and authors, • categorizations and keywords, • summary – digest, • information on the type of the resource (like XSL:FO, SWF, an antique book’s page scans), and • the ranking of the resource in this result set. During the search process, there are some situations that the user should be informed about, e.g. the reader has provided keywords that were too general and they have been specified automatically to aim her interests. This information is also included in the result object (see 7)

Search algorithm with semantics The search algorithm with semantics implemented in the JeromeDL system (see pseudocode in Fig. 8) processes the query object according to the flow depicted on Fig. 5 and returns a set of result objects. JeromeDLs search algorithm was designed with the following goals in mind: • the query should return resources where descriptions do not directly contains the required values, e.g. resources about Californian Condors should be returned as an answer to the query ”carnivores in the U.S.”[9]; • the meaning of values provided in the query should be resolved in the context of the users interests, e.g. a teenager looking for ”a star” has most likely something different in mind than a person interested in astronomy. These goals can be achieved by combining fulltext search as well as searching the bibliographic description and semantic descriptions of resources. The semantically enabled search phase includes the preparation of the RDF query and query extrapolation based on the user’s interests.

Fulltext search. The FULLTEXT QUERY procedure (phase A) searches for the resources by deploying a full-text index of the content and the digest of the resources. If the text index is incomplete (e.g. the resource is a collection of scanned pages of an old book), the annotations provided by the readers are also taken into account. The reader can specify in his profile the boundaries of the network of friends whose annotations should be used during the search process. Bibliographic descriptions. To provide support for legacy bibliographic description formats like MARC21 or BibTEX a digital library system needs to utilize the information provided by these descriptions. The JeromeDL system manages these descriptions in XML formats, so the second phase (phase B) of the searching process, requires an XQuery performed on the XML database. The query on the XML:DB engine is performed by the XMLQ procedure, that uses different templates for searching in the MARC-XML and BibTEXML formats.

RDF query templates. In the last phase (phase C) of the search process the RDFQ procedure performs a query on the RDF storage. A keyword search does not directly translate to a query on the RDF storage. The reader is using literals, but the reader does not know where in the resources description this literals appears, since a reader is shielded from the complexities of the RDF representation. So a list of RDQLtemplates is used to find the required information. There are 4 possible situations that are covered by the templates matching algorithm: − → • rdf :Resource[type:A] B rdfs:Literal[=C]. SELECT ?s WHERE (?s, , ), (?s, , ’C ’ ) − → • rdf :Resource[type:A] B rdf :Resource[type:C] −−−−−−−−−−−→ jeromedl : name rdfs:Literal[=D]. SELECT ?s WHERE

procedure SEMANTIC SEARCH ( QB ) : RO // – fulltext search – // – phase A – RO.results ←FULLTEXT QUERY(QO.fulltext);

Table 1: Sample entries in the property description map property description title publisher

// – properties search – for each p ∈ QO.properties do begin if p.name == ”keyword” then for each v ∈ p.values do value ←GET SIMPLE FORM(value); RO ←FIND RESOURCES(p); end end procedure SEMANTIC SEARCH ———————————————— procedure FIND RESOURCES ( property ) :

keywordLexicalForms author

RO

// – phase C – RO.results ←RO.results∨RDFQ(property); if not SizeOf (RO.results) ∈ then RO ←EXPAND QUERY(property, RO); end procedure FIND RESOURCES ———————————————— procedure EXPAND QUERY ( property, RO ) : RO aValues ←FIND CONCEPTS(property.values, property.name, SizeOf(RO.results)); aCategories ←FAVOURITE CATEGORIES(Context.USER); for each v ∈ arrayOfValues do if not v.isInCategory(arrayOfCategories) then removeFrom(arrayOfValues, v); property.values ←arrayOfValues; end procedure EXPAND QUERY ———————————————— procedure FIND CONCEPTS ( values, p name, size ) : values select property for expand p from p name size < MIN size > MAX

author

hypernym synonym

hyponym homonym

semantic field

categorization

otherNames

categorization

publisher category

categorization super-category

Publisher Container,Literal Container,Author

(?s, ,
), (?s, , ?z), (?z, , ), (?z, , ’D’)

// – phase B – RO.results ←XMLQ(property, Type.Marc21); RO.results ←RO.results∨XMLQ(property, Type.BibTEX);

keyword

Literal

subcategory

values ←values·p; end procedure FIND CONCEPTS Figure 8: Pseudocode for the query processing

− → • rdf :Resource[type:A] B rdfs:Container[type:C] → rdfs:Literal[=D]. SELECT ?s WHERE (?c, , ), (?c, ?i, ’D ’), (?s, ,
), (?s, , ?c) − → • rdf :Resource[type:A] B rdfs:Container[type:C] → rdf :Resource[type:D] −−−−−−−−−−−→ jeromedl : name rdfs:Literal[=E]. SELECT ?s WHERE (?z, , ), (?z, , ’E ’ ), (?c, , ), (?c, ?i, ?z), (?s, , ), (?s, , ?c ) However, information provided by the reader in the query object may not contain the exact values of the Literal properties, e.g. the user might search for the author providing the forename only. To handle the situations of incomplete queries, the semantic data is also indexed with the fulltext indexing engine. The RDF query process is performed in two steps 1. the fulltext query name:’’incomplete’’ is performed, the result is a URI of the RDF resource that has been indexed in the fulltext engine. 2. modified set of templates is applied in the RDF query: each entry (?z, , ’incomplete ’ ) is replaced with (?z, , ). The two-step approach speeds up the search process up, since the RDF queries do not contain literal comparisons. The query processing is also less depended on the lexical variations of provided literals. Because with the ontology described with RDF Schema it is impossible to state what type of resources the rdfs:Container objects holds, the information about each property is stored in the map (see Table 1). That allows to retrieve very quickly which of the templates to use and how to fill them in with the proper information. If the last of the entries specified by the mapping is not an rdfs:Literal instance, it is assumed that each resource has a property jeromedl :name that points to the appropriate rdfs:Literal instance. The assumption is always true within the Jerome ontology. This approach makes it possible to very easily include more semantical description sources during the search process.

Semantically enabled query extrapolation. If the size of the result set is outside the predefined range , the EXPAND QUERY procedure is called.

Table 2: The semantic query expansion properties property.name size < MIN size > MAX keyword

author publisher category

hypernym

hyponym

synonym

homonym

semantic field

categorization

otherNames

categorization

categorization super-category

subcategory

The semantically enabled query extrapolation phase can be considered as a tailoring phase. Many of the resources that fulfilled the query object have been found in the previous two search algorithm phases. However, some relevant resources have been omitted, whereas some non-relevant resources have been added to the result of the first two phases. The information about the readers’ interests and semantic descriptions of the resources is exploited to tailor the result set by performing three kinds of actions: • remove the resources that do not satisfy the reader’s requirements, • add the resources that were omitted in the previous phases, • rank the resources in the results set to best suit the reader’s interest. The query expansion replaces the original values of the given property in the query object to values retrieved from querying the semantic description of the resources (see Fig. 2), e.g. the keyword can be changed to the list of its synonyms. The query expansion is performed iteratively. The decision which property to choose for the query expansion in an iteration depends on the number of the results (size) received from the previous iteration. The list of possible values that could exchange the original values in the query object is being tailored. The categorizations statistics that were collected in the readers profile (see Fig.4 and Fig.5) are taken into account. The resources that have been bookmarked, annotated or evaluated by the user provides additional input to statistical analysis on reader’s preferences.

A recent trend in digital libraries is to connect multiple digital libraries to federations. Each digital library, apart from delivering discovery and navigation features, provides the ability to search among other digital libraries systems. JeromeDL supports federated digital libraries by providing a communication infrastructure for a distributed network of independent digital libraries (L2L)[38][23] similar to communication in a P2P network. Different digital libraries can use different metadata standards. This also implies a variety of protocols available: Z39.50, DIENST[5], OAI[3], SDLIP[2]. The solution implemented in the JeromeDL system utilizes the fact that during the searching process in JeromeDL, all information can be managed in XML form, from the query object to result object (see 3.4 and 3.4). Deploing XML enabled us to build SOAP based protocol prototype - Extensible Library Protocol (ELP)[23]. The use of Web Services for building the P2P network of digital libraries will enable connecting JeromeDL in the future to the ongoing projects like OCKHAM9 [1]. The idea of the ELP is to allow communication in the heterogeneous environment of digital libraries. Each library has to know about at least one other digital library, so it could connect to the L2L network. The minimal requirement imposed on the digital library is to support at least the DublinCore Metadata. If two digital libraries describe the resources with semantics, like JeromeDL system, the communication between them is automatically upgraded to the semantic description level. It allows to use the search algorithm with semantics in the L2L communication.

4.

The aim of the search algorithm presented in the previous section is to reflect the readers’ expectations and to reduces the time required to find the specified resources in JeromeDL. An evaluation of the search algorithm needs to cover the computable efficiency measures and users’ satisfactory level.

4.1

3.5

Searching in the distributed environment of digital libraries network

Definition of evaluation experiment for the search algorithm

The JeromeDL search algorithm utilizes tree types of information:

Extrapolated profile. In many cases, especially when the reader has just registered to the JeromeDL system, it is very likely that the profile information is incomplete. To provide a search experience that would suit new users as well, JeromeDL is extrapolating user profiles. Each user defines a list of friends with the FOAF vocabulary. With this information she is placed in the network of friendships. We assume that the closer the other readers are with respect to friendship connections, the more similar are the interests of the readers. It means that a reader’s profile consists not only his own activities, but to some extent, activities of his friends. With the concept of the extrapolated user profile, even new users have domain of interest information available. During the search process, the search engine is able to exploit categories defined by the reader’s friends.

EVALUATION OF THE SEARCH ALGORITHM WITH SEMANTICS

• implicit descriptions, including semantic description; • descriptions provided by readers: annotations, personal bookshelves, history of usage; • information about relations between readers. To evaluate the whole search subsystem of JeromeDL, we propose a staged experiment, that would cover all aspects of usability. In each experiment performed the efficiency measures: precision, recall and waste are computed. The following stages of the experiment scenario include: 1. Fill the database of JeromeDL system with at least 100 resources and provide MARC21, BibTEX and semantic descriptions. 9

http://www.ockham.org/

2. Present the system to the users that will perform some browsing in the categories that are interesting to them. 3. Experiment 1: Readers are querying the system two times: with and without the query expansion with semantics. With the knowledge on the database content of the digital library, learned during the browsing part, they calculate the metrics: precision, recall and waste of each query result. 4. Readers register to the JeromeDL system and continue browsing its content, annotating some resources and creating personal bookshelves. 5. Experiment 2: Readers performs the queries once again, computes the metrics and compares them to the metrics obtained from experiment 1. 6. Each reader indicates his friends registered in the JeromeDL system. 7. Readers provides ACLs to the categories in their personal bookshelves and links categories created by their friends into their own personal bookshelves. 8. Experiment 3: Readers performs the queries for the last time and compares the results with the previous experiments.

4.2

Simple efficiency measurement

The results have shown that the semantic phase in the search algorithm improves the results by 60% compared to the search process without the semantic (user-oriented tailoring) phase (see Fig. 9).

4.3

Future planned evaluations

On the base of cooperation between Main Library of Gdansk University of Technology and DERI.International the two instances of JeromeDL will be connected by the ELP. The instance of the Gdansk University of Technology, contains mainly antique books and abstracts of some books published by the local publisher. The DERI.International JeromeDL instance will contain mostly articles created by DERI researchers. The overall number of resources in both systems’ databases will be more than 200 resources. Some of the systems will be described with MARC21 (mainly GUT database content) or BibTEX. The antique books will be indexed with the fulltext index engine only on the summary fields. During the extended evaluation, a group of people from both collaborating institutes will use the system for 3 weeks, following presented scenario. At the end of each week they will perform the experiments. After the evaluation, participants will answer the simple questionnaire to receive the overview on subjective evaluation of usability of the JeromeDL system.

5.

FUTURE WORK

In order to measure the improvement of efficiency of the semantic enabled search algorithm, the database of the prototype system has been filled with 100 resources. Each of the resources has been described with a semantic description according to the Jerome ontology and indexed in the Lucene indexing engine (if the textual content of the resource was available). After a little time of browsing through the catalog of JeromeDL, we have performed the first experiment: 50 queries have been performed. Each query have been processed with and without the semantic phase. Each time the expected result set was known.

The evaluations of the JeromeDL search algorithm revealed that the results depend strongly on the semantic parts of resources’ descriptions. That leads to the conclusion that better quality of the semantic description will result in higher efficiency of the searching process. To evolve the benefits of the use of the semantic description, the MarcOnt initiative10 [15] has been started. The goal of this initiative is to provide an ontology that would finally unite the digital libraries and the semantic web worlds, by defining concepts and relations that would uphold legacy of world known bibliographic description formats like MARC21 or BibTEX. The requirement for the DublinCore Metadata in the distributed searching (L2L networks) will enable to connect ELP-based network of digital libraries to OAI11 (Open Archives Initiative). The works started by the Library of Congress on MARC21-based web services allows to expect that these technology will also enable communication with ELP-based digital libraries. To simplify the use of the distributed environment of digital libraries the current work initiates connects the L2L network to e-Learning environments[16], online communities and P2P networks. To overcome the problems that can arise in the P2P network of digital libraries (called L2L networks), semantic routing algorithms can be applied. Possibilities include HyperCuP[29][30] and categorization based multicasting. That would also improve scalability of the L2L network by limiting the required bandwidth in the nodes[27][26].

Figure 9: The simple efficiency measurement of the search algorithm with semantics

6.

To evaluate the gain in efficiency produced by the semantic phase of the semantic searching process, tree metrics have been calculated: precision, recall and waste[34]

CONCLUSIONS

In this paper we presented JeromeDL, a digital library that deploys Semantic Web technology for user management 10 11

http://www.marcont.org/

and search. The FOAF vocabulary is used to gather information about user profile management, and semantic descriptions are utilized in the search procedure. JeromeDL is actively deployed in several installations and is continually enhanced with semantic features. JeromeDL is implemented in Java and available under an open-source Licence 12 . Parties interested in setting up JeromeDL are invited to join our library P2P network.

7.[1] Ockham REFERENCES initiative grant proposal. http://wiki.osuosl.org/download/attachments/527/ockham.pdf. [2] The simple digital library interoperability protocol. http://www-diglib.stanford.edu/ testbed/doc2/SDLIP/. [3] The open archives initiative protocol for metadata harvesting. http://www.openarchives.org/ OAI/ openarchivesprotocol.html, February 2003. [4] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998. [5] J. Davis, D. Fielding, C. Lagoze, and R. Marisa. Dienst protocol specification. http://www.cs.cornell.edu/ cdlrg/ dienst/ protocols/ DienstProtocol.htm. [6] M. Dewey. A Classification and Subject Index for Cataloguing and Arranging the Books and Pamphlets of a Library Dewey Decimal Classification. guternberg.net, http://www.gutenberg.net/catalog/world/ readfile?fk files=59063, 2004. [7] DublinCore Initiative, http://dublincore.org/documents/dces/. Dublin Core Metadata Element Set, Version 1.1: Reference Description. [8] C. Fellbaum. Wordnet an electronic lexical database, 1998. [9] M. Frauenfelder. A smarter web. Technology review, Ontoprise,, http://www.ontoprise.de/documents/A Smarter Web.pdf, November 2001. [10] J. Frew, M. Freeston, N. Freitas, L. L. Hill, G. Janee, K. Lovette, R. Nideffer, T. R. Smith, and Q. Zheng. The alexandria digital library architecture. In Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, pages 61–73. Springer-Verlag, 1998. [11] D. A. Fritz, , and R. Fritz. MARC21 for Everyone: a practical guide. The American Library Association, 2003. [12] G. Koutrika and Y. Ioannidis. Rule-based query personalization in digital libraries. International Jurnal on Digital Libraries, (4):60–63, July 2004. [13] S. R. Kruk. Advanced search and browsing in digital libraries. In ISWC, 2004. [14] S. R. Kruk. Foaf-realm - control your friends’ access to the resource. In FOAF Workshop proceedings, http://www.w3.org/2001/sw/Europe/events/foafgalway/papers/fp/foaf realm/, 2004. [15] S. R. Kruk. Marcont initiative. Technical report, DERI.Galway, Ireland, http://www.marcont.org/, 10 2004. Bibliographic description and related tools utilising Semantic Web technologies. [16] S. R. Kruk, A. Kwoska, and L. Kwoska. Metadito - multimodal messanging platform for e-learning. In International Workshop on Intelligent Media Technology for Communicative Intelligence, pages 84–87. Polish-Japanese Institute of Information Technology, PJIIT - Publishing House, 2004. [17] S. R. Kruk and M. Synak. Jeromedl - e-library with semantics. Technical report, DERI.NUIG - Ireland; Gdansk University of Technology - Poland, http://www.jeromedl.org/, 09 2004. [18] C. Lam. Snack: incorporating social network information in automated collaborative filtering. In Proceedings of the 5th ACM conference on Electronic commerce, pages 254–255. ACM Press, 2004. [19] L. Lamport. LaTeX: A Document Preparation System. Addison-Wesley, 1986. [20] N. Lossau. Search engine technology and digital libraries. D-Lib Magazine, 10(5), June 2004. [21] I. Mcilwaine. The Universal Decimal Classification: guide to its use, volume no P035 of UDC Publication. The Hague : UDC Consortium, 2000. 12

http://elvis-dl.sf.net

[22] D. I. Moldovan and R. Mihalcea. Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1):34–43, January 2000. [23] M. Okraszewski and H. Krawczyk. Semantic web services in l2l. In T. Klopotek, Wierzchon, editor, Intelligent Information Processing and Web Mining, pages 349–357. Polish Academy of Science, Springer, May 2004. Proceedings of the International IIS: IIPWM’04 Conference held in Zakopane, Poland, May 17-20, 2004. [24] I. O’Murchu, J. Breslin, and S. Decker. Community portal survey. DERI Research Report - Semantic Web Portal Project D9, DERI.Galway, April 2004. [25] I. O’Murchu, J. G. Breslin, and S. Decker. Online social and business networking communities. In Proceedings of the Workshop on the Application of Semantic Web Technologies to Web Communities, Valencia, Spain, August 2004. 16th European Conference on Artificial Intelligence 2004 (ECAI 2004). [26] paul perry. Scalable p2p search. http://www.paulperry.net/notes/p2p.asp. [27] J. Ritter. Why gnutella can’t scale. no, really. http://www.darkridge.com/ jpr5/doc/gnutella.html, February 2001. [28] E. T. S. Greene, C. Plaisant, B. Shneiderman, L. Olsen, G. Major, and S. Johns. The end of zero-hit queries: Query previews for nasa’s global change master directory. International Journal on Digital Libraries, 2(2-3):79–90, 1999. [29] M. Schlosser, M. Sintek, S. Decker, and W. Nejdl. Ontology-based search and broadcast in hypercup. In International Semantic Web Conference, Sardinia, http://www-db.stanford.edu/ schloss/docs/HyperCuPPosterAbstract-ISWC2002.pdf, 2002. [30] M. Schlosser, M. Sintek, S. Decker, and W. Nejdl. Hypercup–hypercubes, ontologies and efficient search on p2p networks. In Third International Workshop on Agents and Peer-to-Peer Computing, July 2004. [31] J. W. Schmidt, G. Schrder, C. Niedere, and F. Matthes. Linguistic and architectural requirements for personalized digital libraries. International Journal on Digital Libraries, 1:89–104, 1997. [32] B. Shneiderman, D. Byrd, and W. B. Croft. Clarifying search a user-interface framework for text searches. D-Lib Magazine, January 1997. [33] M. Sugimoto, N. Katayama, and A. Takasu. A system for constructing private digital libraries through information space exploration. International Journal on Digital Libraries, 2(1):54 – 66, October 1998. ISSN: 1432-5012 (Paper) 1432-1300 (Online). [34] P. C. Weinstein and W. P. Birmingham. Creating ontological metadata for digital library content and services. International Journal on Digital Libraries, 2(1):20–37, October 1998. ISSN: 1432-5012 (Paper) 1432-1300 (Online). [35] Wikipedia. Dewey decimal classification. http://en.wikipedia.org/wiki/Dewey Decimal Classification. [36] Wikipedia. Open directory project. http://en.wikipedia.org/wiki/Open Directory Project, 2004. [37] Wikipedia. Yahoo! http://en.wikipedia.org/wiki/Yahoo%21, 2004. [38] J. Zieliski, S. R. Kruk, and H. Krawczyk. Usugi webowe dla zastosowa l2l (web services for l2l). Technologie Informacyjne Zeszyty Naukowe Wydziau Elektroniki, Telekomunikacji i Informatyki (Information Technologies, Lecture Notes Faculty of Electronics, Telecomunications and Computer Science), 1(1):155–163, 2003.