Recoo∗ : A Recommendation System for Youtube RSS Feeds †
Anita Krishnakumar
Jack Baskin School of Engineering University of California, Santa Cruz 1156 High Street Santa Cruz, California
[email protected]
ABSTRACT
Keywords
Today’s internet users face a bewildering number of choices when looking for information about a product, movie, music, video, news, blog, restaurant, etc. Users find it very difficult to find information that is most relevant to their needs and interests. Most websites provide RSS (Really Simple Syndication) feeds - a family of Web feed formats, to publish frequently updated content such as blog entries, news headlines, podcasts, music, videos, etc. These feeds deliver new content to users on the topics in which they are interested. However, the number of RSS feeds and the amount of information channeled through them is also increasing exponentially. Recommender systems are considered as a potential and powerful solution to this ubiquitous information overload problem as they offer users a more intelligent and personalized mechanism to seek out new information. In this paper, we describe “Recoo” - a recommender system developed to provide users recommendations of ‘Youtube RSS Feeds’. As media content is becoming more and more popular and is being widely used for education as well as entertainment, a system recommending such content will be very useful.
Recommendation Systems, Information Filtering, Personalization, RSS Feeds, YouTube
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Retrieval and Search—Information Filtering; H.2.8 [Database Management]: Database Applications—Data Mining
General Terms Design, Experimentation ∗(Copyright c University of California, Santa Cruz 2007. All rights reserved) †Graduate student researcher, IRKM (Information Retrieval and Knowledge Management) Laboratory, University of California, Santa Cruz
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. UCSC IRKM 2007 UC Santa Cruz, California, USA
1.
INTRODUCTION
We often find it necessary to make choices without sufficient background information about available alternatives. In day-to-day life, we request recommendations from people by word of mouth, reviews, recommendation letters, surveys, etc. Recommender systems have augmented this aspect of natural social life by providing automatic recommendations based on the background knowledge the system has about the users and the items. The purpose of a recommender system is to eliminate the need for browsing the entire item space by presenting the user with items of interest early on. Recommender systems are a useful alternative to search algorithms since they help users discover items they might not have found otherwise. Interestingly enough, recommender systems are often implemented using search engines indexing non-traditional data. Recommender systems can greatly influence the way people seek information and are being used in a wide variety of applications today - right from recommending electronic items, books, news and blogs to music, videos, restaurants, movies, etc. With the increasing number of information sources, there are several challenges in providing customers useful recommendations. RSS(Really Simple Syndication) provides a convenient way to syndicate information from a variety of sources, including news stories, updates to a web site or important bulletins. But it’s is not just limited to news. Anything that can be broken down into discrete items, can be syndicated via RSS. For example, the recent changes page of a wiki, blog or website, the revision history of a book, the latest videos uploaded on a website can also be served to users via RSS feeds. Media content, like music and videos is becoming more popular on the web. A lot of media content is available freely online. YouTube, Blip.tv, VideoEgg, DailyMotion, Google video, etc are examples of video sharing websites where users can view, upload and share video clips. ‘Recoo’ is a recommender system that provides YouTube RSS Feeds recommendations to users based on user profile and feed item information. Recoo monitors YouTube RSS feeds for updates and recommends items which might be of interest to its users. The user’s profile is built over time and is used to make recommendations. The design and architec-
ture of Recoo is described in this paper.
2.
LITERATURE SURVEY
A recommender system[11, 1] is a specific type of information filtering technique that attempts to present information about items (movies, music, books, news, web pages) that are likely of interest to the user. Typically, a recommender system compares the user’s profile to some reference characteristics. These characteristics may be from the item Content-based filtering approach[2] or the user’s social environment - Collaborative filtering approach[3]. Recommender Systems are widely used online in several websites to suggest items that users may find interesting. Websites like Amazon, Netflix, MovieLens, Pandora, Last.fm, WhatShouldIReadNext use recommendation systems to serve users. Collaborative filtering systems recommend objects for a target user based on the opinions of other users by considering how much the target user and another users have agreed on other objects in the past[4, 9]. Ratings are predicted based on the ratings of the item by similar users. The weighted average of similar users rating is used as a prediction. A content-based filtering system[10] selects items based on the correlation between the content of the items and the user‘s preferences as opposed to a collaborative filtering system that chooses items based on the correlation between people with similar preferences. It makes recommendations by comparing a user profile with the content of each item in the collection. Formally, an item is described as a vector of components. The components are derived from the content of the items or information about the user‘s preferences. The user profile is represented with the same components and built up by analyzing the content of items that the user found interesting. The items that the user found relevant and interesting are determined by using either explicit or implicit feedback. Explicit feedback requires the user to evaluate items by providing ratings and reviews. Whereas implicit feedback is obtained by observing user‘s actions. Content-based filtering recommender systems[12] do not scale to large item bases. Collaborative filtering systems do not depend on the semantics of items under consideration; they automate the recommendation process based on user opinions. While collaborative filtering algorithms are promising for implementing large scale recommender systems, recommendations to new users are not very good. Also the system gets affected by malicious users.
3. 3.1
REALLY SIMPLE SYNDICATION (RSS) Introduction
RSS[7] is the most common XML content-syndication standard. It is one of a new breed of technologies that is contributing to the ever-expanding dominance of the Web as the pre-eminent, global information medium. It is intimately connected with - though not bound to - social environments such as blogs and wikis, annotation tools such as del.icio.us, Flickr, YouTube and many other websites which are reshaping and redefining our view of the Web that has been built up and sustained over the last few years. RSS provides a synopsis, or snapshot, of the current state
of a website with simple titles and links. While titles and links are the joints that articulate an RSS feed, they can be freely embellished with textual descriptions and richer metadata annotations. RSS usually functions as a signal of change on a distant website. Syndication and annotation are the order of the day and are beginning to herald a new immediacy in communications and information provision. An RSS document, which is also called a “feed”, “web feed”, or “channel”, contains either a summary of content from an associated web site or the full text. RSS makes it possible for people to keep up with their favorite web sites in an automated manner that’s easier than checking them manually.
3.2
RSS Format
A sample RSS 2.0 file is shown below: Liftoff News http://liftoff.msfc.nasa.gov/ Liftoff to Space Exploration en-us Tue, 10 Jun 2003 04:00:00 Tue, 10 Jun 2003 09:41:01 http://blogs.law.harvard.edu/tech/rss Weblog Editor 2.0
[email protected] [email protected] Astronauts’ Dirty Laundry http://liftoff.msfc.nasa.gov/laundry.asp Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them. Instead, astronauts have other options. Tue, 20 May 2003 08:56:02 GMT http://liftoff.msfc.nasa.gov/20.htm#item70 . . . . . . . . RSS feeds have their own internal structure. At its most basic, a feed consists of a channel, with its own attributes, and a number of items contained within the channel, each
Figure 2: User Actions Use Case
Figure 1: A brief overview of Recoo system
with their own individual attributes. The items inside an RSS feed are simple links to other resources, with varying amounts of description associated with each item. Hence, RSS feeds are always used with systems in which content can be segmented into discrete sections or objects that can be linked. The RSS feed (2.0 Specification) can contain any number of items. Each item has a attribute which stands for Globally Unique Identifier, which is essentially a string that uniquely identifies the item. RSS, a constrained form of XML, is of great interest and popularity due to its ability to include additional metadata.[5]
3.3
YouTube RSS Feeds
YouTube[8] offers several RSS feeds for categorized groups of videos, such as recently uploaded, top viewed, top favorites, top rated, most discussed, YouTube blog etc., as well as customized feeds for users and tags. Users can subscribe to these feeds using an RSS reader. YouTube allows users to create customized feeds for tags and users.
• Presenting two items to users and asking them to choose the one they prefer • Asking users to create a list of items that they like Examples of implicit data collection include the following: • Observing the items that a user views online • Analyzing items/user viewing times • Keeping a record of the items that a user viewed • Analyzing the user’s social network and discovering similar likes and dislikes The recommender system compares the collected data to similar data collected from others and calculates a list of recommended items for the user. The learned user profile is used to refine the recommendations made to match the user‘s preferences better. Figure 1 shows a brief overview of the Recoo system.
5. 4.
RECOO: SYSTEM OVERVIEW
An overview of the Recoo recommender system is presented in this section. Recoo is a recommender system intended to make recommendations about video links available from the RSS feeds of YouTube to its users. The Recoo system has a very typical Client-Sever framework. Basically, the server fetches RSS feeds provided by YouTube, as well as several other customized feeds from the website, and builds the data collection repository. The data collection repository contains items fetched from the feeds in individual documents. Users of the Recoo system can log on as clients to the recommendation server and view the recommendations made by the system. First time users can provide certain tags or keywords that might describe what kind of videos they might be interested in viewing. The system stores the feedback information provided by the user and builds the user profile. While building the user’s profile a distinction is made between explicit and implicit forms of data collection. Examples of explicit data collection include the following: • Asking a user to rate an item • Asking a user to rank a collection of items from most favorite to least favorite
5.1
SYSTEM DESIGN Use Cases
In this section we present a graphical overview of the functionality provided by the Recoo system in terms of actors and their goals - represented as use cases. The actors identified in the Recoo system are the Users, the System and the Administrators. Figure 2 shows the Use Case diagram for the User actions. The Users should be able to create an account for themselves in the system and view the recommendations provided by the recommender. They should also be able to provide feedback to the system by rating and reviewing items and also by specifying their interests intially. Figure 3 shows the Use Case diagram for the Monitoring Subsystem. The monitoring subsystem is used by the system administrators. The system should provide functionality for the administrators to troubleshoot the system when faults occur. The administrators should be able to manage data sources and analyze the system performance. They might also want to modify the recommendation system to include new functionalities and also implement a different recommendation algorithm to predict recommendations for system users. Figure 4 shows the Use Case diagram for the Recommendation Subsystem. The users must be able to read reviews and provide feedback to the system. The system should
Table 2: Site Information Site ID Site URL
Table 3: RSS Feed Information RSS ID RSS URL Site ID
Figure 3: System Administrator Actions Use Case
Figure 4: System Actions Use Case
acquire new data items by crawling the web for RSS feeds for new items. The fetched feeds should be processed and items should be extracted and then indexed and stored in the data collection repository. The system should monitor user actions as the system receives implicit feedback through it. The system also logs the system performance and generates reports and evaluation for system administrators to review and monitor the performance of the system.
5.2
Class Diagram
Figure 5 shows the class diagram describing the structure of the Recoo YouTube RSS feed recommender system by showing the system’s classes, their attributes, and the relationships between the classes. Most of the system functionalities have been represented in the diagram. The section about system components describes the components of the system in detail.
5.3
Database Schema
Information about the User and Items need to be stored in the recommender system. Some information is stored in MySql database and others are stored on disk and indexed.
User ID
Table 1: User Information Password User Features... Profile Learned
Information that is used frequently and modified often is stored in MySql database, as reading from file takes more time than querying a database. User information like user id, password, some user features that get modified frequently like the learned user profile are stored in a User Information table in a MySql database as shown Table 1. Item information is stored in 3 different tables. The Site Information table , as shown in Table 2, stores the url information about the sites from which RSS feeds are fetched. Site ID uniquely defines each entry in the Site Information table. The RSS Feed Information table stores the urls of the RSS feeds from the websites listed in the Site Information table. Table 3 shows the structure of the RSS Feed Information table. As described in the section about RSS Feeds, an RSS feed contains several items. Each RSS feed is parsed and the items are extracted. Metadata information about the item from the feed is retrieved and stored as item features. The (Globally Unique Identifier) is used as the Item ID to uniqely identify each item. The item ratings and reviews are also stored in the MySql database. The description text about each item is stored in a file and indexed using the Lemur toolkit[6]. Table 4 shows the structure of the Item Information table. The feedback that the user gives about an item is stored in the User Feedback table. The feedback got from the user may be explicit(for example, ratings) or implicit(for example, feedback time spent on an item, click-through data, etc.). Table 5 shows the structure of the User Feedback Information table. The recommendation server, starts predicting ratings for items not seen by the user, and presents him with the top items he might like. The predictions generated for each user are stored in the Generated Recommendations table, as shown in Table 6.
6.
SYSTEM COMPONENTS
The Recoo system consists of several components that work together to run the recommender system. Each component is described briefly in the following sections.
Item ID
Table 4: Item Information RSS ID Item Features... Item Rating
Figure 5: Class Diagram of the Recoo Recommendation System
User ID
Table 5: User Feedback Item ID User Rating Implicit Feedback
Table 6: Generated Recommendations User ID Item ID Predicted Rating
6.1
Web Server
Recoo has been designed as a web-based application, with a very typical Client-Server architecture. The Server runs the recommendation system and clients connecting to it over the internet. The Recoo recommender system for YouTube RSS feeds uses an Apache web server running cgi-scripts.
6.2
User Manger
The User Manager takes care of handling all tasks related to the users. The User manager can add, delete users in the Recoo recommender system. It creates a profile for the user and stores information about the user in the MySql database. The User Manager monitors the users actions and other feedbacks and stores it to build the profile of the user. It also takes care of storing the ratings given by the user. It handles the user authentication system as well and maintains the tables containing the user information on the MySql server.
6.3
RSS Crawler
The RSS Crawler is a specialized web spider which crawls the specific sites for RSS feeds with new information periodically. The crawler keeps track of the sites crawled and the RSS feeds fetched, so that duplicate feeds are not fetched. The RSS Crawler also parses the RSS feeds and extracts the individual items from it, and stores the items in the data collection repository. The RSS crawler manages the Site Information and RSS Feed Information tables in the MySql database.
6.4
Item Manager
The Item Manager extracts metadata from the items fetched by the RSS crawler and stores the information about the item partially on the MySql Item Information table and partially on disk, the contents of which are indexed by the Index Manager.
6.5
Index Manager
When a new item is fetched by the RSS crawler, the Item Manager calls the Index Manager to index the text data about the item and store the information. The Index Manager uses the Lemur Index Manager to index the description about the item. The index is updated as and when a new item arrives.
6.6
Recommendation Server
The recommendation server starts predicting rating for items not seen by each user based on a hybrid of collaborative filtering techniques (like user-user similarity and itemitem similarity), and content-based filtering methods. The
recommendation server is designed such that, it can also use different and more complex models for recommendation. The predicted ratings are stored in the MySql table and the User Manager server, pick the top few recommendations to present to the user. The recommendations are refreshed periodically, as and when the predictions are made.
6.7
Media Server
Currently we do not fetch the video links from the RSS items, but we intend to pre-fetch the items on a media server to provide streaming video content to the users or prefetch videos onto his device, so that it is available to him even when he is offline. However this is possible only when Recoo is designed to be a stand-alone application, which is the next phase of this project.
6.8
Front End
The front end is developed using Php scripts to provide the user with a user interface to connect to the recommendation system and view recommendations. As mentioned above, we are looking forward to making Recoo a stand alone application that does not depend on the web browser.
7.
CONCLUSIONS AND FUTURE WORK
A system for recommending YouTube RSS feeds has been described and built. A hybrid of collaborative and contentbased filtering approach has been used for generating recommendations. More complex models can also be used with our system. This project is a part of a continuing thesis research we have undertaken and we intend to work on different parts of this system and make it better in the future. We plan on extending and focusing research on this system so that it can be used for applications on mobile phones making use of the user-context information available from the phone to generate better recommendation. We also plan to make this system into a generalized toolkit that developers can modify to develop different applications using the same underlying framework.
8.
REFERENCES
[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005. [2] M. Balabanovi´c and Y. Shoham. Fab: content-based, collaborative recommendation. Commun. ACM, 40(3):66–72, 1997. [3] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Commun. ACM, 35(12):61–70, 1992. [4] N. Good, B. J. Schafer, J. A. Konstan, A. Borchers, B. M. Sarwar, J. L. Herlocker, and J. Riedl. Combining collaborative filtering with personal agents for better recommendations. In AAAI/IAAI, pages 439–446, 1999. [5] T. Hammond, Hannay, and B. Lund. The role of rss in science publishing: Syndication and annotation on the web. D-Lib Magazine, 10(12), December 2004. [6] http://www.lemurproject.org/. The lemur toolkit for language modeling and information retrieval. [7] http://www.rssboard.org/rss specification.
[8] http://www.youtube.com. [9] W. S. Lee. Collaborative learning for recommender systems. In Proc. 18th International Conf. on Machine Learning, pages 314–321. Morgan Kaufmann, San Francisco, CA, 2001. [10] M. J. Pazzani. A framework for collaborative, content-based and demographic filtering. Artificial Intelligence Review, 13(5-6):393–408, 1999. [11] P. Resnick and H. R. Varian. Recommender systems. Commun. ACM, 40(3):56–58, 1997. [12] R. van Meteren and M. van Someren. Using content-based filtering for recommendation.