Clustering Approach to Collaborative Filtering Using Social Networks

13 downloads 229884 Views 1MB Size Report
Clusters of users are created using friendship links within a social network using ... that enables users to add music they are listening into a list and then receive ...
Clustering Approach to Collaborative Filtering Using Social Networks Emir Cogo and Dzenana Donko Faculty of Electrical Engineering University of Sarajevo Bosnia and Herzegovina [email protected], [email protected] Abstract - This paper presents results of using clustering to improve results of collaborative filtering. Clusters of users are created using friendship links within a social network using Markov Chain Algorithm (MCL). Clusters are then used to make prediction of user choices using item based collaborative filtering with cosine similarity. Using the results from analyzing different cluster sizes, new algorithm was proposed that saves time and memory resources. Keywords-Clustering; Graph clustering; Collaborative filtering; Social networks; MCL algorithm

I.

INTRODUCTION

In the last five years there has been a trend of rapid increase of number of social networks users and social network influence in the world. Social network growth has increased with the appearance of smartphones that enabled regular user easier access to the Internet and social networks. Many users use Internet only because of social networks and usage of social networks for communication has transcended email communication [1]. Social aspect is present in many applications and web pages. Many business solutions and Customer Relationship Management (CRM) applications have started using social networks because of the convenience they get from them. Social networks are intensively used for marketing and large amounts of resources are spent to create new methods that will use information gathered from social networks to recommend content to users that has most chance to fit their needs [2]. Data mining is widely used in marketing. One of the most known cases is the amazon.com and its patent for collaborative filtering that enables precise recommendation to users based on their purchases. Method was so successful that many companies are compete to provide best recommendations to users to significantly increase their number of returned customers. Clustering technique gives the possibility of grouping objects into clusters based on their similarity, which enables the possibility to notice connections that have not been visible before. Social networks are convenient for clustering because there are numerous ways to form groups that will fit the defined problem and they have a lot of potential for increasing the results of collaborative filtering.

II. GOAL, DATASET AND ALGORITHMS The main goal of this paper is to show how clustering can improve results of a collaborative filtering when resource limitations are of concern. Several groups within a social network are selected for the collaborative filtering tests to see if the number of true predictions will be larger than one when using collaborative filtering over the whole group. Based on analyzed results, the new algorithm is developed that combines advantages of clustering and collaborative filtering. A. Dataset During the “2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems” the idea was that students, researchers, professors and professionals from universities and industry meet and to explore the application of heterogeneous information in the systems for giving recommendation, i.e. how they can use information about the created users network and their activities to improve the results or to speed up the processing time of recommendation algorithms. Several datasets were announced like Movielens data set that provides data on the reviewed films by users and Bookcrossing data set that provides a list of read books. From these data sets Last.fm data set has been selected to test recommended algorithm in this paper. This dataset is maintained by Grouplens groups and is located under http://www.grouplens.org/node/462. Last.fm is a web service that enables users to add music they are listening into a list and then receive recommendations for music that could be of interest to them based on that list. Last.fm also enables users to create a public profile with their personal and music data and share with other users. It also provides information on artists, music and many statistical data like top listened artists and groups [3]. The dataset contains 92,800 artist listening records from 1892 users with information about user friendship. It is composed of two main tables. The first table contains user, listened artist and number of times the artist song was played. The second table contains pairs of users that are friends. User personal data is hidden, and they are instead represented as a list of id numbers [4]. For Last.fm there is continuous effort to find better recommendation algorithm. This challenge has been launched by various websites that provide similar services to collect a large number of ideas about better recommendation approach. The task is as follows: for a given number of played songs by

half of users, find songs that will be listen by customers in the second half. For this challenge there is a lot of supporting data sets that many sponsors set up in order to find potential algorithms. This data set contains everything that is required for the application of collaborative filtering algorithm and clustering. There are many published papers that present possible solutions of the original problem. The winning approach use classic item-based collaborative approach of filtering with additional modifications that have improved the accuracy of the result, including the use of prediction based on the similarity between users [5]. Regarding the fact that offered dataset includes detailed metadata of songs and artists, a collection of data that is available for use in the competition proved to be a good solution for testing many other problems. One of these problems is the prediction of the year in which the song was issued based on its content [6]. This paper presents one of the approaches of improvement of algorithms for songs prediction in users songs list. B. Clustering algorithm User friendship data can be represented as a graph, where nodes represent users and links between nodes represent friendship between users [7]. Clustering algorithm was supposed to find tightly connected groups of users within the dataset. Users were added to groups based on the number of shared friends. The algorithm that was chosen for this kind of clustering was Markov Chain Algorithm (MCL) since it is a graph clustering algorithm and it is based on number of connections between nodes. Markov chain algorithm is based on two concepts: random walks and Markov chains [8]. The idea of random walks is that in every graph there are clusters with many edges within them and few edges between them. By randomly traversing (walking) from node to node, there is a bigger chance of staying within cluster than getting to other cluster. With many random walks there will be a huge flow within a cluster and that can be used to determine clusters [9]. Random Walks on a graph are calculated using “Markov Chains”. First, the graph is represented using a matrix, then loops are added on each node and matrix is normalized. Random walks are generated by expansion and inflation operators [10]. The expansion operator takes the 𝑒 !! power of the matrix. This operator groups regions of nodes into a single group. Operation of inflation takes the 𝑖 !! power of each column and then each element is divided by the sum of the column. This operator strengthens bonds between strong neighboring nodes and weakens the links between weak neighboring nodes [8]. This parameter defines the granularity of the clustering. By choosing different e and i parameters, different sizes of clusters are achieved. In MCL algorithm, sizes of clusters can’t be predetermined and resulting groups will vary in size. They can only be influenced by these parameters. There were three clustering cases that were used for analysis: large groups (using parameters i = 3 and e = 3, presented in Fig. 1.) with group sizes 912 (blue cluster), 638 (green cluster), and 80 (red and light blue cluster), medium groups (e=2, i=2) with group sizes of 488 and of range from 80

Figure 1.Large group size clustering

to 100 users and small groups (e=2, i=3) below 80 users. With these groups, roughly all group sizes in range from 5 to 912 users are covered. It is important to note that the performance of the algorithm of clustering over 2,100 beneficiaries lasted about 14 seconds. For very large data sets, such as Last.fm real database of about 40 million users algorithm would last much longer. Additional optimization and parallelization using the algorithm could be reduced to an acceptable time but would still be slow. For this reason, in this paper we proposed an algorithm that could perform clustering without depending on the number of users. C. Collaborative filtering algorithm There are two types of collaborative filtering algorithms: user based collaborative filtering and item based collaborative filtering [11]. User based collaborative filtering makes predictions based on user item selection by searching for users with similar attributes and then gives items that those similar users picked. This method is based on two assumptions: that user has similar preferences over time and that they are consistent and stable. Item based collaborative filtering algorithms make predictions based on item similarity. When user picks an item, item similarity with other items is calculated and user is presented with similar items [11]. Collaborative filtering algorithm used for analysis is a classical item based collaborative filtering using cosine similarity [12]: 𝑥∗𝑦 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑥, 𝑦 = cos 𝜑 =   𝑥 ∗ 𝑦 User based collaborative filtering is not used because clustering is already grouping user similarity. Clustering will bring the aspect of user based collaborative filtering into item based collaborative filtering so that their combined strategies can be analyzed. III. ANALYSIS AND RESULTS Collaborative filtering algorithm was analyzed first on the whole dataset to get the base result to compare the cluster results. Algorithm was then tested on each cluster. Analysis

was done using the following rules: For every user within the dataset the most listened artist is picked and all other listened artists in that user list are erased. Then a prediction of 5 top artists is calculated using the chosen collaborative filtering algorithm and compared to erased listened artists. Number of hits are then calculated using a range from 0 hits (all artist predictions missed) to 5 hits (all five predictions were true). When using collaborative filtering on all users, results were following: zero hits 23.22%, one hit 17.74%, two hits 17.09%, three hits 14.65%, four hits 13.13% and 14.16% five hits. Percentages of number of hits were evenly distributed when collaborative filtering on the whole dataset is used. Clusters were analyzed using collaborative filtering on each group and then taking the average value, but for more detailed results, individual cluster results were also analyzed. Results on medium sized clusters were: zero hits 31.42%, one hit 22.24%, two hits 18.8%, three hits 14.18%, four hits 7.63% and 5.69% five hits. The overall results were not as good as on the whole dataset and they were not evenly distributed. They deteriorated with number of hits, but they were relatively close considering that most of the groups were size of 80 users, which is much smaller than the 1800 user dataset. Results were relatively same for small size groups and large size groups. Analysis on the individual clusters indicated interesting results. While some of the clusters had slightly worse results than one using the collaborative filtering on the whole dataset, other clusters had much better results (one cluster of 22 users had 4.76% misses and 42% 5 hits). This shows that results vary from group to group and that by selecting a right group better results can be achieved. Based on these results one would think that by selecting any group of users, the same results would be achieved. For this reason, special groups of most distant users were formed and algorithm was tested on them. Results were much worse than the normal groups: 58% were zero hits and 42% were predictions with at least one hit. IV. PROPOSED ALGORITHM Based on the analysis a new algorithm for clustering was proposed. For each user his friend users are added to the group with group value of 1. Then, every friend of a friend gets added to the group if he is new or else gets his group value increased. This way the most tightly grouped nodes will be superior in a group. When the number of user exceeds the desired group size, users are sorted by their group value and then first n (desired size) of users are picked and the collaborative filtering algorithm is applied on them. The algorithm does not depend on the size of the dataset and is very scalable since it will stop at the same time regardless of the dataset size. This could be used in a fast changing social network with the collaborative filtering or to avoid long calculations on large data sets. This algorithm was tested for n parameter of 30, 100 and 500 users, as it is shown of Fig. 2. Parameter of 30 users will give the user group that is closely tied to the user. Parameter of 100 users will give a scattered group next to basic group while the parameter of 500 users will expand this group with users that are located in the main cluster. For almost every user, parameter of 500 user returns a group from the main cluster

Figure 2. Group of 30, 100 and 500 users

due to its high mutual connectivity and connections with other nodes of the cluster. Algorithm was applied to every user within the whole dataset. Results of the new algorithm are presented in Table I. It is obvious that the obtained results are approximately the same for all three groups. The results were very close to the results while using the collaborative filtering applied to the whole dataset. Search algorithm together with collaborative filtering takes one second for 30 users, three seconds for 100 users and ten seconds for 500 users. Taking in the consideration that used computer does not have performance of servers that are normally used for this kind of processing it is a satisfactory execution time. In the table the first column n presents users in a group while the other columns presents percentage for each of prediction (from 0 to 5) i.e. 0.1276 presents to 12.76%. From the above it can be concluded that it is sufficient to take 30 users to obtain desired results. The whole dataset uses 2100 users and this approach uses much smaller dataset to get the same results. In a very large number of users as is the case of last.fm page (the number of users has reached 40 million in 2012) there is a problem of memory limitations and time constraints. This algorithm is more efficient than the traditional collaborative filtering since it will not have memory or processing time issues with really large datasets. With this algorithm execution time will be the same in the case of 2100 users and 40 million users. TABLE I. RESULTS AFTER APPLYING NEW ALGORITHM n 30   100   500  

0 hits 0.2742 0.283 0.2586

1 hit 0.1942 0.1885 0.17

2 hits 0.1502 0.1397 0.1582

3 hits 0.1276 0.1212 0.1346

4 hits 0.1258 0.1218 0.1262

5 hits 0.1276 0.1455 0.1520

V. CONCLUSION The results given in this paper show a big potential in the application of clustering to improve the result of collaborative filtering but only when considering processing time and memory resource limitations. This proposed algorithm can easily be applied to any of social networks and it will give fast results for large datasets without the requirement for powerful computer resources. Since some of the clusters show better results, there is a possibility that more accurate results can be achieved with the right cluster, possibly using some other parameters than friendships between users.

VI. FURTHER RESEARCH Further research will include a new dataset with more information about users so that other factors can be used to create a better grouping than the given algorithm. Algorithm would sort users based on new attributes but would still acquire users based on their friendship. This would probably improve the accuracy of the algorithm while preserving its scalability.

REFERENCES [1] http://www.digitalbuzzblog.com/infographic-the-growth-of-social-media2011, 2013 [2] http://www.greenbook.org/marketing-research.cfm/social-mediaopportunities-for-market-research-37076, 2013 [3] http://www.last.fm/about, 2013 [4] http://www.grouplens.org/node/462, 2013. [5] F. Aiolli, A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge Preference Learning: Problems and Applications in AI (PL-12), ECAI-12 Workshop, Montpellier, 2012 [6] Thierry Bertin-Mahieux, Daniel P. W. Ellis, Brian Whitman, Paul Lamere, The Million Song Dataset, ISMIR, 2011 [7] Elisa Schaeffer, Graph clustering, Elsevier, 2007 [8] Van Dongen, Graph Clustering by Flow Simulation. PhD Thesis, University of Utrecht, The Netherlands, 2000 [9] Pavel Berkhin, Survey of clustering data mining techniques, Grouping Multidimensional Data, 2006 [10] Nir Ailon, Steve Chien, Cynthia Dwork, On Clusters in Markov Chains, Latin, 2006 [11] J. Ben Schafer1, Dan Frankowski2, Jon Herlocker3, and Shilad Sen2, Collaborative Filtering Recommender Systems, Lecture Notes in Computer Science, 2007 [12] http://mines.humanoriented.com/classes/2010/fall/ csci568/portfolio_exports/sphilip/cos, 2013