A Recommender System for YouTube Based on its ...

3 downloads 14990 Views 2MB Size Report
attracting millions of users (e.g. Facebook, MySpace, Orkut,. YouTube, Twitter). .... developer and YouTube by providing the developer with access to the video and ... the YouTube API, we found that we can only download up to 25 comments ...
A Recommender System for YouTube Based on its Network of Reviewers Song Qin, Ronaldo Menezes, Marius Silaghi Florida Tech, Department of Computer Sciences Melbourne, Florida, USA [email protected], {rmenezes,msilaghi}@cs.fit.edu

Abstract—Social network studies are becoming increasingly popular and have been applied to several fields of study such as law enforcement, marketing, spread of disease, as well as in the improvement of organizational performance. One area that is yet to be explored relates to harnessing the power of social networks as recommender systems. The idea that users may provide other users with recommendations that are more relevant than naive approaches is long known. However, the approaches currently implemented are based on the creation of simple relationships such as co-purchase of similar items. Similarly, video websites would like to suggest related videos to users to maximize the time they spend on their sites. Ergo it is crucial that sites like YouTube provide users with recommendations that are relevant to them. Moreover, given the large amount of videos on YouTube, a good recommender system may alleviate users’ efforts on finding videos that interest them the most. Existing recommender systems for YouTube are typically based on finding similarities between the videos’ textual features (video title or tags annotations) and matching these features to tags in the user’s profiles or the videos they are currently watching. This approach is very limited because it restricts the users interests to the current theme of the video being played or the preferences in their profiles. This means that if one is watching a movie about football, it is assumed that the user would like recommendations on other videos about football. In this paper, we attempt to extract information about video relationship using a network formed from reviews left as comments in YouTube videos. We create a network of videos called YouTube Recommender Network (YRN) and use complex network analysis on this network as the basis of a recommender system. Our results show that our list of recommended videos is more diverse than the ones based on textual information. Our YRN provides diversity and captures other important characteristics such as high rating and mostviewed count.

I. I NTRODUCTION Since the advent of the Internet users found a way to communicate and form groups. Initially we had the Bulletin Board Systems (BBS), then Usenet Groups but more recently we have witnessed a flood of online social network websites attracting millions of users (e.g. Facebook, MySpace, Orkut, YouTube, Twitter). Facebook alone claims to have more than 350 million registered users. Given the widespread use of online social networks, we have seen a growing interest in social network studies in several fields such as law enforcement [1], marketing [2], spread of diseases [3] as well as organizational performance improvement [4]. Together with the increase use of online social networks, we have also seen an increase in information being added making it hard to find the desired

information. In sites like YouTube1 , users are presented with recommended lists of videos based on many criteria. What is surprising is that YouTube does not seem to use the power of its social network of users to better recommend videos. This paper proposes a mechanism to create ranking of videos from the information extracted from the social network of users, more specifically a social network formed from users who write reviews about videos. Founded in February 2005, YouTube is the leader in online video sharing [5]. YouTube allows people to easily upload and share video clips through websites, mobile devices, and e-mails. People watch hundreds of millions of videos a day on YouTube and upload hundreds of thousands of videos daily—20 hours of video is added to the site every minute. A recommender system for online video networks is an information filtering system that lists (ranked or not) videos considered interesting to a given user. The typical approaches to find related videos are based on customer profiles or looking for videos with textual features (usually videos title or tags annotations) similar to the video being watched. In both cases, the current approaches (as described in Section II) are based solely on textual tags, meaning that a user watching a video on classical music be restricted to suggestions on music-related topics because other topics would not share similar tags. We believe this approach is not always correct since there could be a correlation in which classical music lovers tend to like the same non-music topic. Moreover, if examples like this are true tags are insufficient to capture such relationship. This paper delves into the idea of recommending videos based on information extracted from the network of reviewers. We claim that this is an interesting approach because users review diverse topics that interest them and not only topics with similar tags. II. R ELATED W ORK YouTube has received a lot of attention in the research community in the recent years. Mislove et al. [6] studied friendship relationship between social network users. Accessible user links are crawled based on friendship between users on several online social-network websites, including Flickr, YouTube, LiveJournal and Orkut. They made a large scale measurement on the structure of these networks. The dataset 1 YouTube

refers to website www.youtube.com

contains 11.3 million users and 328 million links. Their results confirmed the small-world and scale-free properties of online social networks. This work indicates that we should find the same properties for our network of YouTube videos. Paolillo [7] studied friendship and its correlation to tags used to uploaded videos to the YouTube. Their results showed that YouTube has a socially cohesive core of producers of mixed content, with smaller cohesive groups around Korean music and anime music videos. They also showed that YouTube producers are strongly linked to others producing similar contents. Besides these results, they mentioned that it is important to recognize that friendship is not the only relationship that structures YouTube interaction. Commenting is also another important one which they left as future study. Since YouTube has a very large collection of videos, it is just natural to help users in their efforts to find videos that interest them the most. There are a few approaches to build recommender systems. Mei et al. [8] proposed an online video recommender system called VideoReach that suggests videos according to the current video being viewed without the need of user profiles. The relevance of videos is determined by their textual features (tags, keywords). Videos with common textual features are related. This is the most common approach for recommender systems. Krishnakumar [9] built an online video recommender system called Recoo. In Recoo, a profile is built for each user and data of user’s interests is explicitly collected from the user. Recoo compares the collected data to similar data collected for other users and calculates a list of recommended items. The learned user profile is used to refine the recommendations made to match the user’s preferences better. As we argued before, most recommender systems are based on a comparison between video tags or on learning and refining user profiles. In VideoReach [8], videos are related if they have relevant textual features (video tags or keywords). It is obvious that users may get plenty of videos related to one but their approach does not rank the videos. In Recoo [9], before the system could work, they must build each user a profile. This approach does not work when users do not want to create a profile; something that is quite common on the Internet. We propose a way to generate a YouTube Recommender Network (YRN) that relates videos based on comments left on them. The nodes represent the videos and an edge is established between two nodes if the same user comments on both of them. Edges are undirected and weighted by the number of times two the two videos are commented by different users. The assumption is that two videos are more related if they have a heavier weighted edge between them. In order to recommend videos to users without profiles or tags, we propose a way to quantify the importance of a node in the YRN by assigning each node a utility value that is calculated based on the nodes’ position/role in the network. We also characterize nodes based on the community structure of the YRN where videos in the same communities are assumed to be related. We demonstrate that the network formed from these videos contains many communities that are distinct from the

tags in the videos, demonstrating that the tags may not be a good indication of the strength of relation. III. O UR M ETHOD Our data has been collected using the YouTube API [10] that lets you incorporate YouTube functionality into our own applications. The API contains functions to search for videos, retrieve standard feeds, and retrieve related videos. The standard feeds are a list of video entries through which one can get information about a video based on some criteria. The available feeds are: “Top rated”, “Top favorites”, “Most viewed”, “Most popular”, “Most recent”, “Most discussed”, “Most responded”, “Recently featured” and “Videos for mobile phones”. The API allows for the integration between developer and YouTube by providing the developer with access to the video and user information stored on YouTube. Using the API we crawled the YouTube website and retrieved videos from standard feeds and a limited number of comments associated with the videos. Each video on YouTube is uniquely identified by an 11-character string. The API imposes a limit on the number of comments that can be retrieved for a given video. Though not well documented by the YouTube API, we found that we can only download up to 25 comments per video. The result is that we are dealing in this study with a sample of the comments. It is likely that the network is more connected than what we are dealing with. However, as demonstrated in Figure 1, the YRN retains its scale-free property despite the sampling. A. Building the YRN In order to build a network of videos, the first problem we face is to decide how to collect the data. The YouTube API offers us with the ability to deal with video feeds. These feeds are lists of videos that are part of YouTube. Most people are familiar with the recommended list on YouTube. That list is an example of a video feed in the API. For our purposes we harvest the feeds of the following lists and use them as our starting point in the crawler: “Top rated”, “Top favorites”, “Most viewed”, “Most popular”, “Most recent”, “Most discussed”, “Most responded”, “Recently featured” and “Videos for mobile phones”. The crawler visits each video in these feeds and stores its unique identifier and the ID of the users who made a comment on the video (up to 25 comments). To create the network itself we take the video IDs and make them nodes in the YRN. These nodes are linked if in our database we find that the two videos have been commented by the same user ID. The edges are weighted by the number of common comments between the nodes they link. For instance, if videos A and B have been commented by 6 unique users, the weight of the edge between A and B is 6. IV. N ETWORK M EASUREMENTS Figure 4 shows the YRN we used in this study. There are many measurements that can be performed on a network but we limit our discussion to a few used to characterize

networks as scale-free as well as measurements that are useful to the recommender systems we propose: degree distribution, average shortest path, clustering coefficient and community identification. All measurements have been performed using Network Workbench [11] which is a tool designed specifically to perform analysis of network characteristics. To characterize our network we need to look at the degree distribution of the network. The degree distribution displays the probability P (k) that a given node will have degree k. 0.3

Degree Distribution

community tend to have similar properties. In the context of a recommender systems, the community structure may help the system to restrict the recommendations within certain categories given by the communities. This may be more effective than trusting tags since they tend are static information and generally not standardized. There are many algorithms for finding communities. The most accepted algorithm was proposed by Palla et al. [14] which is generally referred to as the clique-percolation algorithm for overlapping communities. This algorithm has been implemented in a tool by the same authors called CFinder. After running CFinder we found that the network has many communities as described in Figure 2.

p(k)(Probability of node having degree k)

0.25

D(+5." B*6>-C"

0.2

,1)>:)/516>1)" A576"

0.15

@>*97>" .3" 29*:)+"

0.1

?/6>+" 7" 0.05

8*19:*;)" '156/7+"

0

23*4+" 0

20

40

60

80

100

120

140

k (degree)

,-(./0*1" '()*+"

Fig. 1. The degree distribution of YRN with exponent λ = 2.3. The powerlaw degree distribution shows that the YRN is scale-free.

Figure 1 displays the degree distribution for our YRN. One can clearly see that the network is not random since it follows a power-law (random networks have a Poisson distribution). The class of networks that present this type of distribution is called Complex Networks, more specifically they are defined as networks in which P (k) = k −λ . Barab´asi and Albert [12] have demonstrated that in many real networks the value of λ is invariant to the size of the network and that for most realworld networks 2 ≤ λ ≤ 3. For the YRN we have λ = 2.3. Two other important characteristics in the analysis of networks are the average path length, `, and the clustering coefficient, C, as they tell us whether a network has smallworld characteristics. The small-world phenomenon was first introduced by Milgram [13] with his famous six-degrees of separation experiment. A network is considered to be smallworld if its average path length, `, grows logarithmically as a function of the number of nodes in the network. Small-world networks have a small diameter and high clustering. For the YRN we have that ` ≈ 2.61 which is relatively small for the size of the network (198 nodes). The network clustering coefficient, C ≈ 0.51 which is much higher than an equivalent random network. The conclusion is that the YRN is a smallworld network, which is important because it says that from a given video most other videos can be reached and hence considered in a recommender system. Many networks have densely interconnected group structures. These structures are more common in social networks and are called communities. Community identification is very useful because nodes in a network belonging to the same

!"

#"

$!"

$#"

%!"

%#"

&!"

100 TO INDICATE THAT THE RANKING IS GREATER THAN THE 100 THAT CAN BE CHECKED ON THE WEBSITE . Video ID _OBlgSz8sSM dMH0bHeiRNg FzRH3iTQPrk Hr0Wv5DJhuk DMs-p5y6cvo

Rank 1 2 3 4 5

U(ni ) 0.022565321 0.021773555 0.019794141 0.019398258 0.019398258

Rate 4.63 4.66 4.87 4.20 4.10

ATTR > 100 > 100 > 100 > 100 > 100

ATMD 14 9 > 100 8 > 100

ATMV 2 3 34 5 24

ATF 2 1 16 > 100 > 100

Fig. 6. Behavior of the rate of video as a function of the degree of nodes in the YRN. Inset graph shows the rate distribution (from 1 to 5) of the videos in the YRN. Our sample has more videos with higher rate because of the feeds available for us in the YouTube API. These tend to return videos with higher rate but not exclusively as can be seen here. Fig. 4. YRN of YouTube videos generated by Network Workbench software. The nodes are videos from the YouTube website. An edge is established between two videos if they are commented by the same YouTube user.

we observed from Figure 6(outset), the degree is not linear to average rate of the video. It can be observed that for low degrees the rate is nearly constant. 2e+08

Degree vs. View Count

1.8e+08 1.6e+08

vc (View Count)

1.4e+08 1.2e+08 1e+08 8e+07 6e+07 4e+07 2e+07 0

Fig. 5. YRN.

0

10

20

30 k (degree)

40

50

60

Growth of view count as a function of the degree of nodes in the

We stated before that the YRN contains 198 nodes. The rate

given by YouTube for these videos range from 1 to 5. But most of the videos we downloaded to built the YRN comes from feeds that tend to include videos with high rate. Figure 6(inset) shows the distribution of videos in the YRN based on their rating. Our system also allows for local-view recommendation which is used to recommend videos related to one that is being watched. When a user is watching a video, a list of videos she may like will be recommended. Although we cannot be sure of the exact algorithm used by YouTube, it appears to be based on tags since they almost always are videos in the same category or with similar titles. Using the YRN we can use the adjacency of a node ranked by the strength of the connection to the other videos. So when a user is watching video n, only the nodes connected to the node representing n will be in the recommended list, hence the name localview recommendation. The way to quantify the strength of the connection between two nodes is determined by the weight of the edge connecting them. The larger the weight, the higher the strength of relation is. The recommended list is sorted from the highest rank to the lowest. We hypothesize that if the user likes the video she is watching, there is a high probability that she likes the list from the local view recommendation. Figure 7 shows a snapshot of the adjacency of a node in the YRN. Node FzRH3iTQPrk is the 3rd ranked globally.

5.0 18.0 5.0 6.0 4.0 1.0 5.0 5.0 6.0 1.0 7.0 5.0 4.0 5.0 5.0 5.0 2.0 6.0

1.0 6.0 "FzRH3iTQPrk" 9.0 1.0 4.0

20.0 7.0

5.0

6.0 11.0 5.0 6.0 4.0

7.0

1.0

3.0

1.0 3.0 1.0

1.0

1.0

1.0

Fig. 7. A subgraph in YRN. Node with video FzRH3iTQPrk (yellow node) is surrounded by nodes connected to it. The video IDs of the yellow node’s neighbors are not shown here.

Note that the edges connecting to the adjacency of the node are weighted. These weights represent the similarity of the two nodes according to the co-occurrence of comments by the same user. Therefore when using a local-view recommendation, our system lists this adjacency in order of similarity. Note that our approach is orthogonal the use of tags meaning that we can typeset our recommendation based on tags if necessary. VI. C ONCLUSION AND F UTURE W ORK In this paper, we proposed the construction of a YouTube Recommender Network (YRN) and a recommender system derived from it. The YRN is created from the data collected from YouTube website using their API. Data is composed of a number of videos and comments on each video. The nodes represent the videos, where edges are established between two nodes if there is a user who commented on both of them. The YRN is undirected and weighted. The edge weight represents the number of times two nodes are associated with comments from different users in both of them. After the YRN is generated, we observed scale-free and small-world properties in our network with a number of communities. We demonstrated that the distribution of tags inside communities is diverse and follows a power law. The weight of an edge predicts the strength of the relation between the two nodes it connects. We also introduced a utility value which represents the importance of a node. The higher the utility value, the more important the node is. Finally, we proposed a way to build a recommender system derived from our YRN. Firstly, the videos are recommended to users from the highest utility value to the lowest. We are able to recommend categorized videos to user through community characteristics of our YRN: global recommendation. Secondly, when a user is watching

a video, we can recommend other nodes connected it: local recommendation. In this paper we have worked with a small network due to the limitations of the YouTube API. We are currently looking at the possibility of using other datasets and augmenting them with the information about the IDs of users who wrote comments. A promising dataset is the one made available by Cha et al. [16] which contains more than 400,000 videos. Evaluation is also an important issue that we did not tackle in this paper. When dealing with recommendations we should evaluate if our recommendation is more effective than others. This is a tricky issue because it involves bias if we use of user surveys; the development of the survey and how it is applied should be carefully done to avoid introducing bias. We are currently looking into this issue to see if we can devise a method to evaluate the effectiveness of our recommendation system without the use of surveys. Ideally we would like to have a metric that could be quantified without having to ask users directly whether they liked the recommendation or not. R EFERENCES [1] V. Furtado, A. Melo, A. L. V. Coelho, and R. Menezes, “A crime simulation model based on social networks and swarm intelligence,” in Proceedings of the 2007 ACM Symposium on Applied Computing (SAC), 2007, pp. 56–57. [2] P. Arabie and Y. Wind, “Marketing and social networks,” in Advances in social network analysis: Research in the social and behavioral sciences, S. Wasserman and J. Galaskiewicz, Eds. Sage Press, 1994, pp. 254–273. [3] A. S. Klovdahl, “Social networks and the spread of infectious diseases: The aids example,” Social Science and Medicine, vol. 21, no. 11, pp. 1203–1216, 1985. [4] B. Collingsworth and R. Menezes, “Identification of social tension in organizational networks,” in Complex Networks, ser. Studies in Computational Intelligence, S. Fortunato, G. Mangioni, R. Menezes, and V. Nicosia, Eds. Springer Verlag, 2009, pp. 209–223. [5] Google Inc., “YouTube Fact Sheet,” http://www.youtube.com/t/fact sheet, 2006. [6] A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee, “Measurement and analysis of online social networks,” in Proceedings of the 7th ACM SIGCOMM conference on Internet measurement. ACM, 2007, pp. 29–42. [7] J. Paolillo, “Structure and network in the youtube core,” in Hawaii International Conference on System Sciences. IEEE Computer Society, 2008, pp. 146–156. [8] T. Mei, B. Yang, X. Hua, L. Yang, S. Yang, and S. Li, “VideoReach: an online video recommendation system,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2007, pp. 767–768. [9] A. Krishnakumar, “Recoo: A Recommendation System for Youtube RSS Feeds,” University of California, Santa Cruz, Tech. Rep., 2007. [10] Google Inc., “Youtube data API,” http://code.google.com/apis/youtube/overview.html, 2009. [11] NWB Team, “Network workbench tool,” Indiana University, Northeastern University and University of Michigan, Tech. Rep., 2006, http://nwb.slis.indiana.edu. [12] A.-L. Barab´asi and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, Oct. 1999. [13] S. Milgram, “The small world problem,” Psychology today, vol. 2, no. 1, pp. 60–67, 1967. [14] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure of complex networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814–818, June 2005. [15] X. Cheng, C. Dale, and J. Liu, “Statistics and social network of youtube videos,” in Proc. of IEEE IWQoS. Citeseer, 2008. [16] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon, “I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System,” in ACM Internet Measurement Conference, October 2007.