Artificial Life and Robotics (2018) 23:249–254 https://doi.org/10.1007/s10015-017-0424-8
ORIGINAL ARTICLE
A real-time recommendation engine using lambda architecture Thanisa Numnonda1 Received: 18 April 2017 / Accepted: 11 December 2017 / Published online: 29 December 2017 © ISAROB 2017
Abstract In a data science theory, the recommended methodology is one of the most popular theories and has been deployed in many real industries. However, one of the most challenging problems these days is how to recommend items with massively streaming data. Therefore, this paper aims to do a real-time recommendation engine using the Lambda architecture. The Apache Hadoop and Apache Spark frameworks were used in this research to process the MovieLens dataset comprised 100 K and 20 M ratings from the GroupLens research. Using alternating least squares (ALS) and k-means algorithms, the top K recommendation movies and the top K trending movies for each user were shown as results. Additionally, the mean squared error (MSE) and within cluster sum of squared error (WCSS) had been computed to evaluate the performance of the ALS and k-means algorithms, sequentially. The results showed that they are acceptable since the MSE and WCSS values are low when comparing to the size of data. However, they can still be improved by tuning some parameters. Keywords Apache Spark · Big data 2.0 · Collaborative filtering · Clustering · Lambda architecture · Real-time recommendation
1 Introduction Big data has been widely discussed since technology related to information architecture has been improved and a framework such as Apache Hadoop was introduced. In the era of big data 1.0, we had to deal with large sets of complex data and most of the analytics process is a batching process. However, nowadays there are more real-time data coming from emerging technologies such as social media and the internet of things. Therefore, real-time big data analytics is considered to be more challenged and it is now discussed as big data 2.0. One example of big data 2.0 analytics is a recommendation system which is required to analyze a large streaming data. A recommender system [1] can help users to choose items among the overwhelming number of alternative items. Internet companies such as Netflix, Amazon.com, and YouTube This work was presented in part at the 22nd International Symposium on Artificial Life and Robotics, Beppu, Oita, January 19–21, 2017. * Thanisa Numnonda
[email protected] 1
have found out that providing personalized recommendations on products to customers is much more increasingly valuable [2]. Therefore, various techniques and tools have been proposed to produce more accurate recommendations. However, one of the most challenging problems these days is how to recommend what is trending now items to customers with massively streaming data in addition to recommend items based on a historical data. Real-time recommendations can be considerably difficult and hard to compute quickly due to large data of users and items. Therefore, this paper proposes a methodology for a real-time recommendation engine using the lambda architecture. The methodology proposed in this paper comprises three layers; (1) batch layer, (2) speed layer, and (3) service layer. In the rest of this paper, Sect. 2 provides background work on collaborative filtering technique using alternating least squares (ALS) algorithm, clustering technique using k-means algorithm, Apache Hadoop, Apache Spark, Apache Kafka, and Lambda architecture. Related work is discussed in Sect. 3. Implementation on Hadoop clusters, data sets, and sample codes are demonstrated in Sect. 4. Then, the results are shown in Sect. 5 and conclusion will be discussed in the last section.
Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
13
Vol.:(0123456789)
250
Artificial Life and Robotics (2018) 23:249–254
2 Background work
2.2 Clustering using k‑means
This research first used two main machine learning algorithms: ALS and K-means. Then, two large datasets were used in this research, therefore, big data technologies such as Apache Hadoop, Apache Spark, and Apache Kafka were implemented. Finally, the Lambda architecture was applied to achieve real-time recommendations. This section provides brief background work on these as follows.
Clustering is used to cluster similar users into groups. k-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering problem. The procedure follows a simple and easy way to classify a given dataset through a certain number of clusters (assume k clusters) [6]. This algorithm aims at minimizing an objective function, in this case a squared error function. The objective function is shown in Eq. (1):
2.1 Collaborative filtering using ALS Recommender systems are widely studied and there are many available methods including collaborative filtering (CF) and content-based [3]. CF has been very successful in both research and practice as a recommender system technique. It is simply a mechanism to filter the massive amounts of data based upon a previous interactions of a large number of users. There are two well known types of CF; user-based and item-based. Using user-based approach, user’s previous preferences and other likeminded user’s preferences can be used to predict unknown ratings and recommend good candidate items to a given user. However, as the numbers of users and items grow, user-item preferences represented in two-dimensional rating matrix (with users as rows and items as columns) could be extremely large. Moreover, there are normally only a few available ratings in the matrix which can lead to less prediction accuracy. Therefore, CF can be easily suffered from the scalability and sparsity problems. To take care of these problems, matrix factorization (MF) [4], one of the most used and extremely well-performed method for CF, is applied. MF decomposes a large user-item association matrix (X) into lower dimension of user factor (U) and item factor (V) matrices such that X ≈ UV where X ∈ Rm×n, U ∈ Rm×k and V ∈ Rk×n. It also provides modeling for various real-life situations. Therefore, this method can help not only scalability, but also prediction more accuracy. A popular approach to compute missing data in two matrices of MF is ALS. ALS works by iteratively solving a series of least squares regression problems [5]. In each iteration, the first one of two dependent parameters in the system is fixed to solve for the second one. Then the second parameter which was solved for is, in turn, treated as fixed, while the first one is updated. This process is repeated until the two parameters stop changing. By fixing one or the other, it becomes a simple least square solution. This algorithm is used in Spark MLlib which will be used in this research.
13
J=
k n ∑ ∑
∥ xi(j) − cj ∥2 ,
(1)
j=1 i=1
where ∥ xi(j) − cj ∥ is a chosen distant measure between a data point xi(j) and the cluster center cj is an indicator of the distance of the n data points from their respective cluster centers. In this research, the k-means algorithm is used to partition a dataset built by ALS model into a small number of clusters. Items in the same cluster are as close to each other as possible by minimizing the distance between them.
2.3 Apache Hadoop Apache Hadoop [7] is a scalable fault-tolerant distributed system for data storage and processing. It is an open source and distributed under Apache license. A core Hadoop comprises two main components, Hadoop distributed file system (HDFS) [8] for storing data and MapReduce [9] for processing data. In addition, Hadoop has many other related services which can be used to enhance its efficiency. For example, Hive, Pig, and Impala can be used as alternative processing languages. Flume, Sqoop, and Kafka can be used for data ingestion into Hadoop storage. HBase can be used as a Hadoop NoSQL database. Apache Hadoop was used in this research to store a large-scale semi-structure dataset, such as the MovieLens dataset. In addition, related Hadoop services such as Apache Kafka and Apache Spark were used to process the dataset.
2.4 Apache Spark and Spark MLlib Apache Spark is an open-source framework for distributed processing which has in-memory and fault tolerant data structures. It can be easily integrated with Apache Hadoop and uses a primary abstraction called resilient distributed dataset (RDD) [10]. It is a very interesting framework that is up to 100 times faster than Hadoop MapReduce for real-time data processing. Apache Spark can handle any input/output data sources such as RDBMS, cloud storage, NoSQL, and HDFS. Programming languages can also be
Artificial Life and Robotics (2018) 23:249–254
251
The main purpose of this research is to provide a recommendation system on both historical data and real-time feeding data. Therefore, Lambda architecture [11] which is suitable for real-time big data systems is chosen. It processes scalable new data using three major layers; batch layer, speed layer, and serving layer. Batch layer is to get raw datasets, pre-compute batch views from these datasets using Apache Spark and send them to the serving layer. Speed layer uses new data to create the real-time views. Then two views from the two layers are merged and indexed so that they can be queried with low latency as shown in Fig. 1.
the users into groups as clustering centers. Then similarity between target user and cluster centers is computed to do user clustering. Later, item clustering provides a personalized recommendation to the target user. The performance is evaluated through mean absolute error (MAE) which is shown lower than traditional CF. Therefore, scalability and prediction accuracy problems are improved. Phorasim et al. [13] proposes a movie recommender system using CF and k-means clustering algorithms. Using the Euclidean distance in k-means algorithm, users are clustered into groups and CF is used to recommend interesting times to users. This paper is to reduce time consuming and increase prediction accuracy in CF algorithm. Jiang et al. [14] proposes ALS with weighted regularization (ALS-WR). This strategy is not only used to enable the parallel computation in each MapReduce phase but also to maximize data locality to minimize the communication cost. Phulari et al. [15] presents ClubCF, clustering-based CF to apply for searching big data application query. AHC algorithm is used as hierarchical clustering to cluster big data applications before applying CF approach to do service recommendation. The computation time of experiments on real-time dataset was verified to be reduced and the prediction accuracy is higher. In addition, the sparsity problem can be less affected. Liu et al. [16] uses Apache Mahout to do parallel itembased CF algorithm. Using MovieLens and Douban datasets, the experimental results show that this parallel algorithm can outperform serial traditional CF algorithms. Dutta et al. [17] provides real-time techniques and a number of real-world use cases of big data analytics and Huang et al. [18] presents TencentRec which can capture users’ real-time interests based on the practical algorithm implementation and the sparsity problem is able to get significantly better performance.
3 Related work
4 Implementation
Gong [12] utilizes both user clustering and item clustering technologies in CF to solve scalability and sparsity problems. k-means clustering algorithm is also used to cluster
The proposed Lambda architecture for real-time recommendation is shown in Fig. 2. It consists of three main layers, each layer uses a different algorithm to produce
Fig. 1 General Lambda architecture
Fig. 2 Proposed Lambda architecture
chosen from Scala, Java, or Python. Apache Spark platform comprises of many processing techniques such as Spark SQL, Spark Streaming, GraphX, and MLlib. In this research, we used Spark MLlib for machine learning algorithm and Spark Streaming for processing real-time data.
2.5 Apache Kafka Apache Kafka is a distributed publish–subscribe messaging system initially developed at LinkedIn. It is designed for processing real time activity stream data, e.g., logs, metrics collections. It enables near real-time access to any data source and allow us to build real-time analytics. Apache Kafka supports both queue and topic semantics and it runs as a cluster comprised one or more servers each of which is called a broker. In Kafka messaging system concept, a producer will publish messages into a Kafka topic and consumers which subscribe to topics will pull and process the feed of published messages. In this research, Apache Kafka was used to feed real-time dataset for processing.
2.6 Lambda architecture
13
252
recommendation results. For the batch layer, the historical rating data has been processed using the ALS algorithm so that it can recommend top K items for each user. The recommendation result from the batch layer is a user preference based on historical data. The result of ALS, also provides item rating for all users, and this result can be used to cluster user preferences via the k-means algorithm. For the speed layer, the real-time selected items can be grouped based on each cluster. Then, the top K trending items per cluster can be found. The recommendation result from the speed layer based on streaming data. Finally, the serving layer combines the top K recommendation items from the batch layer and the top K trending items from the speed layer. The combination of the two layers is straight forward since it is just shown with all recommendations from both layers. The recommendation of each layer is used by a viewer for a different propose. The batch layer is recommendation based on a viewing history of a viewer, while the speed layer is a recommendation based on a current viewing of other users. The proposed architecture is based on the Apache Hadoop and Spark framework and its echosystem where the historical data is stored in the Hadoop HDFS storage. The real-time data is streamed using the Apache Kafka server. Finally, the algorithms have been implemented using Apache Spark, Spark MLlib, and Spark streaming.
Artificial Life and Robotics (2018) 23:249–254
4.3 Sample codes Spark MLlib and Spark Streaming have been used to implement the machine learning algorithms. The program was written in Scala as shown in following two sample codes. /* SAMPLE ALS CODE */ import org.apache.spark.mllib.recommendation.Rating import org.apache.spark.mllib.recommendation.ALS // load a data file into Spark cluster
val rawData = sc.textFile("hdfs:///user/cloudera/ movielens/ratings.csv") // Separate data into three columns val rawData_head = rawData.first()
val newData = rawData.filter(row => row != rawData_head) val rawRatings = newData.map(_.split(",").take(3))
val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } // Train data using an ALS algorithm
val model = ALS.train(ratings, 50, 10, 0.01) // Save the trained model model.save(sc, "target/tmp/ratingALS")
4.1 Hadoop clusters and architecture A big data architecture on a public cloud using an Amazon Web Services (AWS) was used in the research to prove the conceptual architecture. It is a highly scaled platform which comprises three components (1) a Hadoop cluster, (2) a Kafka server, and (3) a data server. A Hadoop cluster has been installed on five m3.2xLarge AWS virtual servers. Each server has 4 vCPU, 15 GB memory, 500 GB SSD. The Hadoop cluster is based on the Cloudera express edition. In this research, it is used for storing a movie rating data in HDFS and processing data using Apache Spark. A Kafka server is installed on a single m3.xLarge AWS virtual server. It is a broker to receive streaming messages from a data server and sends it to a Hadoop cluster using a Kafka topic.
4.2 Data set The MovieLens dataset from the GroupLens research has been used in this research. The first dataset has 20 M ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. The second dataset has 100,000 ratings from 1000 users on 1700 movies.
13
/* SAMPLE K-MEANS CLUSTERING CODE */ import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
// Extract user features from the trained model
val userFactors = model.userFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) } val userVectors = userFactors.map(_._2)
// Specified training parameters for k-means algorithm val numClusters = 15 val numRuns = 3
val numIterations = 10 // Train data using the k-Means algorithm val userClusterModel = KMeans.train(userVectors, numClusters, numIterations, numRuns) // Save the trained model. userClusterModel.save(sc, "target/tmp/userCluster")
Artificial Life and Robotics (2018) 23:249–254
253
Table 1 Performance of the System
Table 3 WCSS with different number of clusters
DataSet
100 K
20 M
Number of cluster
100 K
20 M
MSE WCSS
0.0842 1784.2385
0.3129 906132.6935
5 10 15 20 25 30
1881.8513 1832.5318 1784.2385 1751.1305 1718.2053 1679.0972
951321.3243 922621.3825 906132.6935 894798.9972 885883.6966 880493.0582
Table 2 Processing Time of the System DataSet
100 K
20 M
Algorithm
ALS
k-means
ALS
k-means
Processing time (s)
30.74
2.87
159.73
5.61
6 Conclusion
5 Results The mean squared error (MSE) and the within cluster sum of squared error (WCSS) had been computed to evaluate the performance of the ALS algorithm and the k-means algorithm, sequentially. MSE is a direct measure of the reconstruction error of the user-item rating matrix from the ALS algorithm. The MSE assesses the quality of a predictor which can be estimated by Eq. (2): n
1∑ MSE = (Y − Ŷ i )2 , n i=1 i
(2)
where Yi is the vector of observed values and Ŷ i is a vector of n predictions. The WCSS is a sum of distance functions of each point in the cluster to the k center in the k-means algorithm which can be estimated by Eq. (3):
avg min S
k ∑ ∑ i−1 x∈Si
∥ x − 𝜇i ∥2 ,
(3)
{ } where Si = x1 , x2 , … , xk is a set of observations and 𝜇i is the mean of points in Si. When the number of cluster is 15, the results of performance and processing time of the system are given in Tables 1 and 2 respectively. The results are acceptable since the MSE and WCSS values are low when comparing to the size of data. However, they can still be improved by tuning some parameters. In this research, we have also adjusted a number of clusters in the k-means algorithm and the result shows that the WCSS is improved with more number of clusters as shown in Table 3.
The real-time recommendation engine using the Lambda architecture has been proposed in this paper. The architecture comprised three layers; (1) batch layer, (2) speed layer, and (3) serving layer. The proposed engine has been implemented on a Hadoop cluster and has been tested with the MovieLens dataset from the GroupLens research comprised 100 K and 20 M ratings. The proposed architecture/algorithms have been demonstrated using the Hadoop and Spark frameworks. The historical data is stored in the Hadoop HDFS storage. The real-time data is streamed using the Kafka server. Finally, the algorithms have been implemented using Spark, Spark MLlib, and Spark streaming. The results had been shown in terms of the top K recommended movies and top K trending movies for each user. In addition, the MSE and the WCSS had been computed to evaluate the performance of the ALS algorithm and the k-means algorithm, sequentially.
References 1. Kantor PB, Rokach L, Ricci F, Shapira B (2011) Recommender systems handbook. Springer, Berlin 2. Linden G, Smith B, York J (2003) Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Comput 7:76–80 3. Aggarwal CC (2016) Recommender systems. Springer, Switzerland 4. Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. J Comput 42(8):30–37 5. Pentreath N (2015) Machine learning with spark. Packt Publishing, Birmingham 6. Panigrahia S, Lenkaa RK, Stitipragyana A (2016) A hybrid distributed collaborative filtering recommender engine using Apache Spark. International workshop on big data and data mining challenges on IoT and pervasive systems (BigD2M 2016), pp 1000–1006 7. Karanth S (2014) Mastering Hadoop. Packt Publishing, Birmingham 8. Shvachko K (2010) The Hadoop distributed file system. In: Proceeding of 2010 IEEE 26th symposium, mass storage system and technology, (MSST’10), pp 1–10
13
254 9. Deanand J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. OSDI 10. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association 11. Marz N, Warren J (2013) Big Data: principles and best practices of scalable real-time data systems. O’Reilly Media, Newton 12. Gong S (2010) A collaborative filtering recommendation algorithm based on user clustering and item clustering. JSW 5(7):745–752 13. Phorasim P, Yu L (2016) Movies recommendation system using collaborative filtering and k-means. Int J Adv Comput Res 7(29):52
13
Artificial Life and Robotics (2018) 23:249–254 14. Zhou Y, Wilkinson D, Schreiber R, Pan R (2008) Large-scale parallel collaborative filtering for the Netflix prize. Algorithmic aspects in information and management. Springer, Berlin, pp 337–348 15. Phulari SV, Shah PP, Kalpande AD, Pawar VA (2016) Clustering and filtering approach for searching Big Data application query. Int J Eng Sci Innov Technol 5(1):197–204 16. Liu Q, Xiaobing L (2015) A new parallel item-based collaborative filtering algorithm based on Hadoop. JSW 10(4):416–426 17. Dutta K, Jayapal M (2015) Big Data analytics in real time systems. In: Big Data analytics seminar, pp 1–13 18. Huang Y, Cui B, Zhang W, Jiang J, Xu Y (2015) TencentRec— real-time stream recommendation in practice, SIGMOD’15