Proceedings Template - WORD

3 downloads 2126 Views 569KB Size Report
ine a situation of e-commerce website employing content-based or ... ever, on some domains, usable attributes may be difficult to fill in ... Users tends to buy.
Using Linked Open Data in Recommender Systems Ladislav Peska

Peter Vojtas

Faculty of Mathematics and Physics Charles University in Prague Malostranske namesti 25, Prague, Czech Republic

Faculty of Mathematics and Physics Charles University in Prague Malostranske namesti 25, Prague, Czech Republic

[email protected]

[email protected]

ABSTRACT In this paper, we present our work in progress on using LOD data to enhance recommending on existing e-commerce sites. We imagine a situation of e-commerce website employing content-based or hybrid recommendation. Such recommending algorithms need relevant object attributes to produce useful recommendations. However, on some domains, usable attributes may be difficult to fill in manually and yet accessible from LOD cloud. A pilot study was conducted on the domain of secondhand bookshops. In this domain, recommending is extraordinary difficult because of high ratio between objects and users, lack of significant attributes and limited availability of items. Both collaborative filtering and content-based recommendation applicability is questionable under this conditions. We queried both Czech and English language edition of DBPedia in order to receive additional information about objects (books) and used various recommending algorithms to learn user preferences. Our approach is general and can be applied on other domains as well. Proposed methods were tested in an off-line recommending scenario with promising results; however there are a lot of challenges for the future work including more complex algorithm analysis, improving SPARQL queries or improving DBPedia matching rules and resource identification.

Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval Information Filtering

General Terms Measurement, Human Factors, Experimentation.

Keywords Recommender systems, Linked Open Data, DBPedia, contentbased attributes, e-commerce, VSM, CBMF.

1. INTRODUCTION Recommending on the web is both an important commercial application and a popular research topic. The amount of data on the web grows continuously and it is virtually impossible to process it directly by a human. Also the number of trading activities on the web steadily increases for several years. Various tools ranging from keyword search engines to e-shop parameter search were adopted

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WIMS’15, July 13-15, 2015, Limassol, Cyprus Copyright 2015 ACM 1-58113-000-0/00/0010…$10.00.

to decrease information overload. Although such tools are definitely useful, users must specify in detail what they want. Recommender systems are complementary to this scenario as they are mostly focused on serendipity – finding surprising, unknown, but interesting items and presenting them to the user. Our aim is to enhance recommending systems on e-commerce domains, more specifically on small to medium enterprises without dominant position on the market. This brings some unique challenges such as unwillingness of user to register, scarcity of explicit feedback, low consumption rate, user loyalty, number of visited objects etc. Thus we focus on learning from implicit user feedback, algorithms with fast learning curve, using additional content information etc. Many recommender systems, algorithms or methods have been presented so far. Initially, the majority of research effort was spent on the collaborative systems and explicit user feedback. Collaborative recommender systems however suffer greatly in case of insufficient data (referred as cold start problem). The importance of cold start problem rises with the higher fluctuation of either objects or users. As for objects – news servers or auction servers are common examples. Yet another challenge provides domains, where only a single piece of the product is available (e.g. real estate agencies, secondhand shops in general or dating sites), because feedback received on already sold item cannot be reused without some measure of inter-object similarity. High user fluctuation and low loyalty is common working condition for majority of small e-commerce websites. Users tends to buy only a single product and often never go back. Recommendation task under such conditions became extremely challenging, but on the other hand, well-designed and informative recommendations may provide the competitive advantage needed to distinguish our service from others.

1.1 Our motivation Despite the widespread of recommending systems, there are still domains, where creating useful recommendations is very difficult. 

Auction servers or used goods shops have often only one object of given type available which prevents us from applying collaborative filtering directly.



For some domains e.g. films, news articles or books it is difficult to define and maintain all important attributes hindering content based methods.



Websites with relatively high ratio between #objects / #users will generally suffer more from cold start and new user problems.

For our paper, we chose secondhand bookshops domain which includes all above mentioned difficulties. The main problem of the book domain is that similarity between books is often hidden in vast

number of attributes (characters appeared, location, art form or period, similarity of authors, writing form etc.). Although those attributes are difficult to design and fill, it is not impossible. But in the most cases, only one piece of the book is available in stock, so creating a new record must be fast and simple enough that potential purchase could eventually cover the costs of work. Collaborative recommender systems can be used, but their efficiency is hindered both by high ratio between #objects/#users and the fact that each object can be purchased only once. In our previous work [12] we presented an effective method to query DBPedia in order to enhance content-based object attributes and improve recommendation accuracy. Our main goal for this stage of the ongoing work is to evaluate different LOD data sources, queries and ways to incorporate RDF into the object attributes as well as experimenting with various algorithms using this data.

This paper follows on and extends our previous work [12] by using different data sources and more suitable recommending algorithms.

3. LINKED OPEN DATA AND RECOMMENDER SYSTEMS In this section, we will briefly describe our architecture to collect LOD and their usage in recommender system. The architecture itself is rather simple and straightforward, but we believe that this chapter can be interesting from the software engineering point of view as we try to point out several issues, which should be bore in mind.

1.2 Main contribution The main contributions of this paper are: 

Proposing system to on-line querying Linked Open Data to enrich object’s content information.



Experiments with real-world e-commerce dataset evaluating added value of mined data.



Identifying key problems hindering development of LOD enhanced recommender systems

The rest of the paper is organized as follows: review of some related work is in section 2. In section 3 we how to incorporate LOD into recommending systems in general, section 4 discuss specific needs of secondhand-bookshop domain. Section 5 discusses experiments held on Czech secondhand bookshop, selection of recommending algorithms, evaluation scenario and discuss the results. Finally section 6 concludes our paper and points to some future work.

2. RELATED WORK Semantic web (especially Linked Open Data Cloud) with its continuously increasing amount of data is a great opportunity to enhance information and finding connections among various objects of our recommender system. Let it be a similarity of two holiday destination reviews or parameters of the recent product, the answers might be reachable through the LOD. However there are still numerous limitations such as missing statistics about LOD, searching among LOD datasets, or reliable and recent data. Although areas of both Recommender systems and Semantic Web have been extensively studied, papers involving both topics were not very common until quite recently. There were several approaches to build recommender system for particular domain based solely on LOD datasets. Di Noia et al. [1] aimed to develop content based recommender system for a movie domain based sole on three different LOD datasets. Similarly A. Passant [9] developed dbRec – the music recommender system based on DBPedia dataset. Among other work connecting areas of recommender systems and LOD we can mention paper by Heitmann and Hayes [6] using Linked Data to cope with new-user, new-item and sparsity problems and Goldbeck and Hendler’s work on FilmTrust [5]: movie recommendations enhanced by FOAF trust information from semantic social network. A very stimulating recent event was LOD Enabled Recommender Systems Challenge held as a part of ESWC 2014 conference [2], which led to numerous interesting ideas on the field of connecting semantic web and recommender systems, e.g. [11], [14].

Figure 1: Top-Level architecture of the system: 1A is original ecommerce system with a content-based or hybrid recommender; 1B is its extension for querying and storing LOD datasets. The top-level architecture is shown on Figure 1. System maintains one or more SPARQL endpoints to various LOD datasets. The connections are usually REST APIs or simple HTTP services. Whenever an object of the system is created or edited, the system will automatically query each SPARQL connection with object specification and possibly also other restrictions. Resources are then filtered, modified to fit with predefined object attributes and stored in the system triples store (relational database table was sufficient during our experiments) or alternatively mapped into the object-attribute structure immediately. Each object is also queried periodically via batch queries as the LOD datasets can provide more information over time. We have also considered using local copies of LOD datasets. This approach would however result in excessive burden to both data storage and system maintenance and also prevent us from using up-to-date dataset. The outlined system on the other hand is quite sensitive on quality of service of each LOD dataset and, of course, we are limited only to datasets with working SPARQL endpoint.

3.1 Using LOD in Recommender System Unless we want to use specific graph-based similarity measures e.g. SimRank [7] we need to map RDF into the Object-Attribute structure. In our previous work [12], we used following mapping: 

RDF Subject ≈ recommendable object (e.g. product of an e-shop)



RDF Predicate ≈ Attribute name



Set of RDF Objects (with the same Subject and Predicate) ≈ Attribute value

In such way, we can define different similarity metric for each attribute, set different importance for various attributes etc. This approach was used in our previous work [12] However there are also other variants of incorporating RDF with certain (dis)advantages. Mapping into binary attributes as follows: 

RDF Subject ≈ recommendable object



≈ Binary attribute

Mapping RDF into binary attributes is suitable for some recommending algorithms, namely VSM or Content-boosted Matrix Factorization (CBMF) [4], but we need to cope e.g. with discretizing numeric attributes. Also the number of attributes can grow unbound, rising time demands. It is also possible to omit RDF Predicate which will slightly reduce number of attributes, reduce errors made by redundancy of DBPedia ontology and allow us to match together relevant objects referred under different predicates (e.g. dpedia-owl:preccededBy and dpedia-owl:followedBy for book series), but on the other hand could introduce noise if the same RDF Object (especially literal) is used in different roles. 

RDF Subject ≈ recommendable object



RDF Object ≈ Binary attribute

Our final dataset was a compromise between the two later mentioned mappings, where some predicates were grouped together as described in 4.2.

approach provided the best possible precision of correctly matched entities, only 1% of books and 6.5% of authors were identified this way. In this work we experimented with queries with higher coverage of the target domain and still reasonable precision. First, we decided to query for books and authors separately as corresponding Wikipedia pages to each resource might not exists. For Czech DBPedia we designed two sets of queries. Simple keyword search for author name and book title respectively should have the highest coverage, but also highest ambiguity (Figure 2a). Let us call them keyword queries. Second set of queries adds rdf:type check: dbpedia-owl:Person for authors and dbpedia-owl:WrittenWork for books (Figure 2b). Let us call them typed queries. The typed queries were also applied to the English DBPedia, but due to the execution time limits we could not perform keyword queries on English DBPedia.

4. USING LINKED OPEN DATA IN SECONDHAND BOOKSHOP DOMAIN In our previous work [12] on the same secondhand-bookshop domain, we targeted Czech edition of DBPedia to query for additional information about books. In this work, we aimed to explore also other datasets. There are two key limitation for any considered dataset. Availability of SPARQL endpoint is a prerequisite of our framework. The second limitation is a lack of common unique identifier. For instance a lot of older books do not possess ISBN number at all and not all books e.g. in DBPedia has information about ISBN available. Thus we need to rely on language specific information such as title of the book or author name. The possibility of machine translation was discussed, but our preliminary experiments did not support this method as many book names cannot be translated literally. Also using neologisms (e.g. Hogfather by Pratchet) or unusual phrases makes translation very difficult. Instead of translation, we opted for adding English edition of DBPedia, which contains rdfs:label predicate with the name of the resource in each language with corresponding Wikipedia page. So for correct mapping, there must exist both English and Czech Wikipedia page for the book and the English one must be correctly typed (as dbpedia-owl:WrittenWork). On the first sight, this might seem as an inferior approach to search Czech DBPedia directly, which needs just existing, correctly typed Czech Wikipedia page. However as our preliminary study showed, most of the books are incorrectly identified within Czech DBPedia (its wiki page does not contain relevant infobox, so the resource is identified as owl:Thing and all domain dependent attributes are missing1). For the performance reasons, we used non-standard Virtuoso bif:contains filter and didn’t set additional conditions to the RDF triples. The mined data were post-filtered and also some predicates were merged. The SPARQL queries for both Czech and English DBPedia are described in subsection 4.1. The comparison of Czech and English DBPedia regarding to coverage as well as modifications of the dataset is described in subsection 4.2.

4.1 Querying Czech and English DBPedia We used combination of author name and book title keyword search and correct rdf:type in our previous work [12]. Although this

1

See e.g. http://cs.dbpedia.org/page/Matka_(drama)

Figure 2: Examples of SPARQL queries: (a) keyword query, (b) typed query for a book “Otec Prasátek”, in English “Hogfather”.

4.2 Comparing Czech and English DBPedia After receiving RDF triples from the DBPedia, we performed postfiltering based on RDF Predicates. Only predicates relevant for the similarity computation was left and some predicates were grouped together to ensure reasonable compromise between maximizing capability to detect similarities, minimizing introduced noise and also in order to build common space for book-based and author-based RDF triples. The predicate groups are as follows 2:    

Abstract: tokenized dbpedia:abstract of the resource. Publisher: dbpedia:publisher. Related genres: dbpedia:genre, literaryGenre. Related persons: dbpedia:Author, Writer, Director, Artists, Translator, Characters; influencedBy, influences, influenced  Location: dbpedia:country, birthPlace, deathPlace, nationality  Language: dbpedia:language  Relevant Books: dbpedia:subsequentWork, previousWork, followedBy, precededBy, notableWork, knownFor  Related categories: dbpedia:category, movement, wikiPageWikiLink, dc:subject  Author occupation: dbpedia:occupation The resulted features were filtered according to the minimal coverage in the dataset. We can surely omit attributes with only single occurrence within the dataset as they have no effect on recommending algorithms and also attributes with less than K occurrences has 2

We do not show the predicates from Czech DBPedia for the sake of clarity as they are mostly only translated version of the English ones.

only small impact on the results. We followed to our previous experience [11] and discarded all attributes with K