Enhancing Recommender System with Linked Open ...

2 downloads 0 Views 266KB Size Report
Figure 2: Example of SPARQL query about The Return of the King (in Czech “Navrat krale”) by J. R. R. Tolkien and a portion of returned data. 4 Experiments.
Enhancing Recommender System with Linked Open Data Ladislav Peska and Peter Vojtas Faculty of Mathematics and Physics Charles University in Prague Malostranske namesti 25, Prague, Czech Republic peska|[email protected]

Abstract. In this paper, we present an innovative method to use Linked Open Data (LOD) to improve content based recommender systems. We have selected the domain of secondhand bookshops, where recommending is extraordinary difficult because of high ratio of objects/users, lack of significant attributes and small number of the same items in stock. Those difficulties prevents us from successfully apply both collaborative and common content based recommenders. We have queried Czech language mutation of DBPedia in order to receive additional attributes of objects (books) to reveal nontrivial connections between them. Our approach is general and can be applied on other domains as well. Experiments show that enhancing recommender system with LOD can significantly improve its results in terms of object similarity computation and top-k objects recommendation. The main drawback hindering widespread of such systems is probably missing data about considerable portion of objects, which can however vary across domains and improve over time. Keywords: Recommender systems, Linked Open Data, implicit user preference, content based similarity

1 Introduction Recommending and estimating user preferences on the web are both important commerce application and interesting research topic. The amount of data on the web grows continuously and it is impossible to process it directly by a human. Keyword search engines were adopted to fight information overload but despite their undoubted successes, they have certain limitations. Recommender systems can complement onsite search engines especially when user does not know exactly what he/she wants. Many recommender systems, algorithms or methods have been presented so far. Initially, the majority of research effort was spent on the collaborative systems and explicit user feedback. Collaborative recommender systems suffer from three well known problems: cold start, new object and new user problem. New user / object problem is a situation, where recommending algorithm is incapable of making relevant prediction because of insufficient feedback about current user / object. The cold start problem refers to a situation short after deployment of recommender system, where it cannot relevantly predict because it has insufficient data generally.

The new object problem became even more important when there is only limited amount of items per object type and thus object fluctuation is higher. Using attributes of the objects (and hence content based or hybrid recommender systems) can speed up learning curve and reduce both cold start and new object problems. Various domains however differ greatly in how many and how useful attributes can be provided in machine readable form. The Linked Open Data (LOD) is a community project aiming to provide free and public data under the principles of linked data [3]. LOD datasets are available on the web under an open license in machine readable format (RDF triples) and linked together with unique URIs. DBPedia [2] is one of the cornerstones of LOD cloud. It extracts data available on Wikipedia topic pages (primarily Wikipedia Infoboxes), publishes it as LOD and hence provides data about wide range of topics. Several language mutations of DBPedia are available varying in both in size of corresponding Wikipedia language mutation and completeness of transcription rules converting Wikipedia pages into the RDF. 1.1 Motivation, problem domain Despite the widespread of recommending systems, there are still domains, where creating useful recommendations is very difficult. • Auction servers or used goods shops have often only one object of given type available which prevents us from applying collaborative filtering directly. • For some domains e.g. films, news articles or books it is difficult to define and maintain all important attributes hindering content based methods. • Websites with relatively high ratio between #objects / #users will generally suffer more from cold start and new user problems. For our study, we have selected the secondhand bookshops domain which includes all mentioned difficulties. The main problem of the book domain is that similarity between books is often hidden in vast number of attributes (characters appeared, location, art form or period, similarity of authors, writing form etc.). Although those attributes are difficult to design and fill, it is not impossible. But in the most cases, only one book is available in the secondhand bookshop, so creating a new record must be fast and simple enough that potential purchase could eventually cover the costs of work. For example average book price in our dataset is approximately 7EUR, so potential income per sold book is 2-3EUR. Collaborative recommender systems can be used, but their efficiency is hindered both by high ratio between #objects / #users and the fact that each object can be purchased only once. On the other hand, Wikipedia covers the book domain quite well, so our main research question is whether we can effectively use information available on Linked Open Data cloud (e.g. DBPedia) to improve recommendation on difficult domains such as secondhand book shop.

1.2 Main contribution The main contribution of our work is: • Proposing on-line method to automatically enrich object attributes with Linked Open Data and enhance content-based recommending. • Experimental evaluation against both content-based and collaborative filtering on secondhand bookshops – a domain with high object fluctuation, high ratio objects/users and not much meaningful attributes. • Identifying key problems hindering further development of similar systems. The rest of the paper is organized as follows: we finish section 1 with review of some related work. In section 2 we describe in general how the LOD datasets can be used to enhance recommender systems and section 3 describes its implementation for Czech secondhand bookshop domain. Section 4 contains results of our experiments and finally section 5 concludes the paper and points out some future work. 1.3 Related work Although areas of recommender systems and Linked Open Data have been extensively studied recently, papers involving both topics are rather rare. The closest to our work is the research by Di Noia et al. [4], who developed content based recommender system for a movie domain based sole on several LOD datasets. Similarly A. Passant [12] developed dbRec – the music recommender system based on DBPedia dataset. The main difference between these and our approaches is that we start with real system and use LOD only as an instrument to improve recommendations on it. Objects of the system don’t have to correspond to the data available in LOD and so we need to cope with problems like absence of unique mapping, missing records etc. Among other work connecting areas of recommender systems and LOD we would like to mention paper by Heitmann and Hayes [7] using Linked Data to cope with new-user, new-item and sparsity problems and Goldbeck and Hendler’s work on FilmTrust [6]: movie recommendations enhanced by FOAF trust information from semantic social network. As for the recommender systems, we suggest the papers by Adomavicius and Tuzhilin [1] or Konstan and Riedl [9] for overview. A lot of recommending algorithms aims to do decompose user’s preference on the object into the preference of the objects attributes e.g. Eckhardt [5], which is a parallel to our content based similarity method. In our recent work [13], we present several methods how to combine attribute preferences together and analogically we plan to experiment with e.g. T-conorms to combine attribute similarities. We suggest paper by Bizen, Heath and Berners-Lee [3] as a good introduction to the Linked Open Data. Our current experiment was based on Czech version of DBPedia1, originally developed by Chris Bizer et al. [2]. 1

http://cs.dbpedia.org

2 Enhancing Content-based Recommending with Linked Data In this section, we would like to describe general principles how to enhance recommending systems with data from LOD cloud. We want to stress that one of our intentions was to keep the architecture simple and straightforward to enable its further widespread. Several extensions are possible. The top-level architecture is shown on Figure 1. The system maintains one or more SPARQL endpoints to various LOD datasets. The connections are usually REST APIs or simple HTTP services. Whenever an object of the system is created or edited, the system will automatically query each SPARQL connection with unique object identifier or other best possible object specification, if no common unique ID is available (see Figure 2 for example of SPARQL query to Czech DBPedia). The important step while using more than one LOD dataset is matching identical resources. The difficulty of this step is highly dependent on the datasets, but mostly involves (recursive) traversing of owl:sameAs links. Resources are then stored in the system triples store, which can be however quite simple (relational database table was sufficient during our experiment).

Figure 1: Top-Level architecture of enhancing recommender system with LOD. Figure 1A represents original e-commerce system with a content-based recommender; 1B is its extension for querying and storing LOD datasets.

Each object of the system should be also queried periodically as the LOD datasets can provide more information over time. We have also considered using local copies of LOD datasets. This approach would however result in excessive burden to both data storage and system maintenance and also prevent us from using up-to-date dataset. There are several possible approaches to map RDF triples into the binary relation . Each pair (or RDF object only if we wish to omit predicates) can be identified as an attribute of RDF subject (object of the system) forming Boolean matrix of attributes A|objects| × |RDF objects|. This approach leads to various methods of matrix factorization [10]. Strength of such approach is that each resource is considered separately so more important resources may have more significant impact.

Also resource similarities e.g. SimRank [8] can be applied. However the number of resources can grow unlimited could make such solution computationally infeasible. Our approach is to identify each RDF predicate as an attribute of RDF subject (object of the system). The value of the attributes is then a set of all relevant RDF objects. Two sets can be compared relatively fast with e.g. Jaccard similarity (a x I a y ) (a x U a y ) . Total number of attributes will be limited and allows us to use e.g. linear combination as a similarity metrics. We lose the ability to weight each resource separately, but we still can weight each RDF predicate. The method can be further simplified by omitting RDF predicate and use set of all RDF objects as single attribute. 2.1 Content based similarity

SSSS

The key part of any content based recommender system is defining similarity of its objects. We expect that each object of the system is represented by its attributes vector. Attributes can be one of following type: • Numerical • String • Unordered (enumeration, category etc.) • Set Our content-based similarity is based on the idea of two-step preference model published by our research group [5]. The method works in two steps: for objects x and y, it first computes local similarity µ a: Da 2 → [ 0,1] for each attribute a ∈ A . Global similarity of objects S is defined as linear combination of local similarities: =

∑ wi * µi

(1)

i∈ A

The choice of proper local similarity measure might depend on the domain and a nature of each attribute. As we aims to keep our method domain independent, we have chosen rather simpler, widely adopted metrics than ones closely specialized on particular domain. For objects x, y and values of their attribute ax and ay local similarity µa is: • Normalized absolute value of the difference for numerical attributes. ax − a y

• • •

max

∀objects o

( ao )

(2)

Relative levenshtein distance for string attributes. 1 − (levenshtein(a x , a y ) max(lenght (a x ), lenght ( a y )))

Equality of unordered attributes. Jaccard similarity of set attributes.

(3)

(a x I a y ) (a x U a y ) (4) Metrics (2), (3), (4) allows us to compute local similarity smoothly which can be beneficial as we expect to have rather high ratio #objects/#attributes. It is possible to compare all attributes in terms of equality, but it won’t make much sense for many real-world attributes (e.g. small difference in price or issuing date implies high similarity). For string attributes, our aim was to both cope with minor typos and inconsistencies (e.g. Arthur Conan Doyle vs. A. C. Doyle) or to track series of books

(usually containing common substring). There are analogies for this behavior in other domains as well. Weight wa of each local similarity µa is set by Linear Regression maximizing similarity of objects x, y co-occurring in the list of preferred objects Pu for some user u and minimizing otherwise. Optimizing formula is defined as follows: arg min (RSS ( w ) ) = w

∑ (π ( x, y ) − wT µ ) 2

x , y∈O

(5)

1 ↔ ∃u : x, y ∈ Pu where π ( x , y ) =   0 OTHERWISE

2.2 Content-based and collaborative top-n recommending We are yet to describe the procedure of deriving recommendations. For arbitrary fixed user u, set of his/her preferred objects Pu and set of candidate objects Cu, the recommending algorithm computes similarity S(oi,oj) for each oi ∈ Pu and o j ∈ Cu . Candidate objects are then ranked according to their average similarity:

(

Rank(oi ) = Rank AVG( S (oi , o j ); ∀oi ∈ Pu )

)

(6) Similarity measure S can be one of: • Content based similarity as described in section 2.1. In the experiment section, this method is further divided according to the usage of LOD. • Item-to-item collaborative similarity as described in [11]. Similarity of objects x and y is measured as cosine similarity of vectors of users Ux and Uy preferring x and y respectively. SCosine =

U x •U y Ux Uy

(7)

In production recommender system, we should take into account also other metrics like diversity, novelty, serendipity, or pre-filter candidate objects, but for purpose of our study, we will focus on similarity only. 2.3 Hybrid top-n recommending In order to benefit from both content-based and collaborative filtering, we have implemented also a hybrid approach enhancing content-based filtering with collaborative data: Collaborative similarity of objects (7) is used as an additional attribute of content-based similarity (we expand vectors w and µ from equation (5)).

3 Using Linked Open Data for secondhand bookshop First domain dependent task is typically selecting proper LOD dataset. There are several datasets in the LOD cloud2 containing information about books. We can mention i.e. DBPedia3, Freebase4, Europeana project5 or British National Bibliography (BNB)6. One of our concerns was quality of the available data. From this point of view although BNB provides good coverage, it contains only a little more data (mostly about authors), than what is already available within the bookshop. However the main drawback of the most datasets is language: without common identifier, it is difficult to connect data between various languages (e.g. Czech as the bookshop language and English as a foremost LOD language). ISBN does not work well in this case as it identifies each issue, publisher or language version separately and we also have to deal with books from the pre-ISBN era. The best way to identify data records is using textual comparison of book title and author name. So our key constraint for selecting dataset is availability of book titles (and person names) in Czech. Another constraint given by the architecture is availability of SPARQL endpoint in order to query effectively for demanded data. Given the constraints we identified Czech DBPedia dataset as the best option. We plan to use automated translation services in future to be able to query also non-Czech datasets. 3.1 Querying Czech DBPedia The example of SPARQL query to Czech DBPedia is shown on Figure 2a, portion of result on Figure 2b. Some restrictions on RDF predicates were applied thereafter to filter out useless data. We do not show them for the sake of clarity. In order to access more available data, we have conducted also separate authors search. Czech DBPedia in its current state unfortunately doesn’t support directly object categories (purl:subject in English DBPedia). It contains information about explicit links to other Wikipedia pages (dbpedia:wikiPageWikiLink), where categories can be extracted by “Category:” prefix. However the dataset doesn’t contain its supercategories which make them far less useful, so we decided not to distinguish them from other links. We collect data also from preceding/following books in the series, source country, genre, translator of the book, author occupation, influences etc. Our dataset contains information about 8802 objects (books). By querying Czech DBPedia, we were able to gather data about 87 books and authors of 577 books altogether recording additional information about 7.3% of books. This small coverage, as we investigated, was caused mainly by missing infoboxes on the vast majority of books in Czech Wikipedia, causing DBPedia to fail identifying them as books. The problem could be bypassed by traversing categories/supercategories 2

http://datahub.io/group/lodcloud http://dbpedia.org 4 http://www.freebase.com/ 5 http://datahub.io/dataset/europeana-lod 6 http://bnb.data.bl.uk/ 3

which is however not possible in current version of Czech DBPedia. We are working with Czech DBPedia administrators to eliminate this problem.

Figure 2: Example of SPARQL query about The Return of the King (in Czech “Navrat krale”) by J. R. R. Tolkien and a portion of returned data.

4 Experiments In order to prove our theory as feasible, we performed a series of off-line experiments on real users of a Czech secondhand bookshop. The objects (books) have following content-based attributes: book name, author name, book category and book price. The experimental website does not provide an interface to collect explicit feedback (e.g. object rating), so we had to rely on implicit feedback only. For the purpose of experiments, we consider page-view action (user u visit object o) as an expression of positive user preference. The list of preferred objects Pu is then precisely list of visited objects. We have also restricted the domain of books to only those for which we were able to mine some data from DBPedia. Both similarity metrics and recommending algorithm can however predict for all objects. Finally we use all collected RDF objects as a value (set) of single attribute to show that even sole presence of additional data can improve recommendations. The experiment had two phases. First, we compared differences between similarities of objects visited by the same user, which we believe should be more similar to those which were not visited. The first phase gives us rough picture about how can various similarity methods explain user behavior. Thereafter in the second phase we compared methods in more realistic recommending top-k objects scenario. 4.1 Similarity of objects In the first phase of our experiment, our aim was to demonstrate that adding semantic information from LOD can improve recommender systems capability to distinguish between similar and non-similar objects from the user perspective. For each pair of objects visited by the same user, we have computed content based similarity either including or not including DBPedia data. Similarities of mutually visited objects were then compared against average object similarity in general population as we want to test their resolving power.

Experiment results are shown in Table 1. The result seems to corroborate our theory that additional DBPedia data can help to distinguish between similar and dissimilar and thus also preferred and non-preferred objects. Table 1. Results for similarity of objects: table shows average similarity of mutually visited objects Svisited, general population Sgeneral and their quotient for content based similarity with and without using DBPedia resources. Both Content only and Content with DBPedia methods have Svisited significantly higher than Sgeneral (p-value < 10-7). Content with DBPedia has also significantly higher Svisited / Sgeneral ratio (p-value < 10-7). Similarity Method Content-based only Content-based with DBPedia

Svisited

Sgeneral

Svisited / Sgeneral

0.261701 0.339476

0.101041 0.084035

2.59005 4.03970

4.2 Top-n recommending Although similarity can provide us with rough image about benefits of external Linked Data, the real usage scenario is different. Most typically, we have a user currently visiting an object or a category and we want to recommend him/her some other options. We expect that user already visited few objects. Our dataset contains implicit feedback from the users. For arbitrary fixed user, his/her sequence of visited objects, let n be its size, was sorted according to the timestamp. Then we split the sequence and denote first k objects as a train set and nonempty remaining portion of the sequence as a test set. Note that only users with at least two visited objects qualify for the experiment. Train set sizes was set from 1 up to 5 because not enough users would qualify for larger train set sizes. For the purpose of the top-n recommending experiment, we define train set as preferred objects Pu and all objects expect those from train set as candidate objects: Cu = O\Pu. We compute the list of recommended objects as described in (6). Following methods were tested: • Collaborative filtering forming our baseline. • Content-based filtering with no additional data. • Content-based filtering with DBPedia data handled as a single attribute (referred as Content+DBPedia in results). • Hybrid approach as described in 2.3 with DBPedia data (referred as Hybrid+DBPedia in results). Only the feedback available at the time when current user visits the last object from train set is used. This simulates real situation (we cannot use data which will be created in the „future“). We consider as success if methods rank objects from the test set "good enough". In another words, recommending methods should be able to predict objects, which will user visit in the “future”. In order to compare the recommending methods, we have set three success metrics: • Average position of test set objects in the recommended list. • Normalized Discounted Cumulative Gain (nDCG), where relevance is defined as 1 for objects from test set and 0 otherwise.



350

0,32 0,3 0,28 0,26 0,24 0,22 0,2 0,18

average position

nDCG

Presence of test set object(s) at top-10 as this is a typical size of list of recommended objects shown to the user. Figure 3 shows results of the experiment: 3a, 3b and 3c show trend of the metrics while enlarging train set sizes, 3d contains average data for all users. Both content based and hybrid methods outperform greatly the item-item collaborative filtering in all observed metrics (p-value < 10-7 according to Tukey HSD test). The added value of Content + DBPedia method was observed mainly for the smaller train set sizes. It dominates over Content method consistently in both nDCG (p-value: 0.065) and Presence at TOP-10 (p-value: 0.0399), however the absolute difference was rather small with growing train set sizes indicating, that additional data was most useful for new users of the system. Hybrid with DBPedia method dominates significantly in average position (p-value ≤ 5.1*10-4) but has only small effect on other metrics suggesting that it only improves position of badly ranked objects. Even though it might be useful if the candidate set Cu is pre-filtered. 3a: Average nDCG 3b: Average position

Presence@10

250 200 150 100

1 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

300

2

3

4

5

1

3c: Presence at TOP-10

1

2

3

4

2

3

4

5

3d: Average values of metrics nDCG

Position

P@10

Collaborative

0,2218

255,21

0,1086

Content + DBPedia

0,2901

181,78

0,2245

Content only

0,2713

189,45

0,1733

Hybrid + DBPedia

0,2961

138,34

0,2210

5

Figure 3: Recommending experiment results: all metrics shown according to the train set size per user. Figure 3a displays nDCG, 3b average position and 3c relative presence of preferred objects in top 10. Figure 3d shows average values per all train set sizes. In nDCG Content + DBPedia outperformed Content only method, but with rather low significance (p-value: 0.065). Similar results, however a bit more conclusive provides also P@10 (p-value: 0.0399). In average position Hybrid + DBPedia dominates over all other methods (p-value ≤ 5.1*10-4).

Interesting phenomenon was decrease of all methods performance while increasing train set size. This might be partially caused by inappropriately chosen aggregation function (average) supported by the assumption, that objects visited further in the past has only limited impact on future visiting objects. Further result analysis also shown that majority of users who received best recommendations had only small number of visited object and so did not qualify for larger train sets. We will further investigate whether this is only coincidence and how to avoid such problem.

5 Conclusions and Future work In this paper, we aim to improve recommendations on problematical e-commerce systems by using additional data collected from Linked Open Data Cloud. We have shown several difficulties which current recommender systems may encounter and a domain (secondhand bookshops) where all those difficulties can be found. We have presented method for enhancing content based recommender system with data from LOD cloud. This on-line method is straightforward and quite general, so it can be applied on various domains. The off-line experiments held on the real visitors of Czech secondhand bookshop corroborates our assumption, that enhancing recommender systems with LOD data can improve recommendation quality. One of the main drawbacks of our approach is low coverage of objects within the Czech DBPedia. Therefore our future work involves improving Czech DBPedia mapping rules, refining SPARQL queries used during the experiment and exploring other available datasets. The possibility of automatic translation of book names should be also considered. Last but not least the experiments have shown unexpected behavior while increasing train set sizes which should be further investigated. Acknowledgments. This work was supported by the grant SVV-2013-267312, P46 and GAUK-126313.

References 1.

Adomavicius, G. & Tuzhilin, A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 2005, vol. 17, 734-749

2.

Bizer, C.; Cyganiak, R.; Auer, S. & Kobilarov, G. DBpedia.org - Querying Wikipedia like a Database. WWW2007, Banff, Canada, May 2007, 2007

3.

Bizer, C.; Heath, T. & Berners-Lee, T. Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst., 2009, 5, 1-22

4.

Di Noia, T.; Mirizzi, R.; Ostuni, V. C.; Romito, D. & Zanker, M. Linked open data to support content-based recommender systems. In Proc. of I-SEMANTICS '12, ACM, 2012, 1-8

5.

Eckhardt, A. Similarity of users (content-based) preference models for Collaborative filtering in few ratings scenário. EXPERT SYST APPL, 2012, vol. 39, 11511 - 11516

6.

Golbeck, J. & Hendler, J. FilmTrust: movie recommendations using trust in web-based social network. In Proc. of CCNC 2006, IEEE, 2006, vol. 1, 282-286

7.

Heitmann, B. & Hayes, C. Using Linked Data to Build Open, Collaborative Recommender Systems. AAAI Spring Symposium: Linked Data Meets Artificial Intelligence, 2010

8.

Jeh, G. & Widom, J. SimRank: a measure of structural-context similarity. KDD ‘02, ACM, 2002, 538-543

9.

Konstan, J. & Riedl, J. Recommender systems: from algorithms to user experience. User Modeling and User-Adapted Interaction, 2012, vol. 22, 101-123

10. Koren, Y.; Bell, R. & Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer, IEEE, 2009, 42, 30-37 11. Linden, G.; Smith, B. & York, J. Amazon.com recommendations: item-to-item collaborative filtering. Internet Computing, IEEE, 2003, 7, 76 - 80 12. Passant, A. dbrec - Music Recommendations Using DBpedia International Semantic Web Conference (2) Springer, LNCS, 2010, 209-224 13. Peska, L. & Vojtas, P. Negative Implicit Feedback in E-commerce Recommender Systems. To appear on WIMS 2013,http:// www.ksi.mff.cuni.cz/~peska/wims13.pdf

Suggest Documents