Designing Personalized. Recommender Systems. Dr. Satya Gautam Vadlamudi. Principal Data Scientist. Capillary Technologie
Designing Personalized Recommender Systems Dr. Satya Gautam Vadlamudi Principal Data Scientist Capillary Technologies
Outline ●
How to build a ratings based personalized recommender
●
How to build a top-N type of personalized recommender
●
How to build a content based personalized recommender
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Ratings based personalized recommender
Problem Statement: Given that an user has rated some products/movies (say, on a scale of 1 to 5), the objective is to predict a rating that the user would give for a new product
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
What do we have P1
P2
P3
U1
5
?
3.5
U2
1
3
3
3
4
U3
P4
P5
.
.
Pn
1
2
?
5
?
4
3
.
? ?
. Um
.
? 4
4
3
2
2
?
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
What else do we have Lot of content (entire movie!) Cast & Crew User reviews Critics reviews CONTEXT And so much more.. Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
And let’s not forget the humans! User profile: Thinks a lot Watches 2 movies a year May like or hate the same movie Depends on CONTEXT & more such profiles of billions of reco-hungry humans Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Don’t try this at home! (or even on supercomputers) Make each pixel of all frames of each movie at 4K resolution as a feature and train using deep learning on ratings data from billions of people on millions of movies (and use all available web/video data of each human too)
Is this the future though? Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
So how do we approach solving this problem?
Let’s get back to the present! Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
You guessed it right! 1.
If someone likes some product lines, say, dramas (given high ratings), then give high scores for dramas for them (Content Filtering)
2.
Predict rating for an user based on how users similar to her have rated that product/movie (User User Collaborative Filtering)
3.
Predict rating of a product for an user based on the relationship of the movies rated by the user in the past with the current movie (Item Item Collaborative Filtering) Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Collaborative Filtering Two types: 1. 2.
Neighbourhood methods Latent factor models
Neighbourhood methods: 1. 2.
User User CF Item Item CF
Latent factor models: 1. 2.
Matrix factorization with explicit feedback/with implicit feedback Restricted Boltzmann Machines Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Matrix factorization For user u, item i, predict rating r^ by computing the dot product of user to latent factors affinities (p_u) and latent factors to item affinities (q_i)
Where p_u and q_i are real number vectors of size f (no. of latent factors)
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Latent Factors example
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
How to learn the latent factor affinities Given K records containing u, i, and r, learn p and q by solving the below equation:
Where λ is the regularization parameter.
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
How do we solve Equation 2 1. 2. 3. 4.
Stochastic Gradient Descent (SGD) Alternating Least Squares (ALS) Singular Value Decomposition (SVD) And more..
SGD basic idea:
SVD basic idea: M = UΣV* where U is mxm, Σ is mxn, and V is nxn, V* is conjugate transpose of V Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Alternating Least Squares (ALS) Equation 2 is not convex since both q and p are unknown Steps: 1. 2. 3.
Fix p (Eq. 2 becomes convex), and optimize for q using Least Squares Fix q (Eq. 2 becomes convex), and optimize for p using Least Squares Repeat steps 1 & 2 until Eq. 2 converges
Easy to massively parallelize Can handle implicit data better than SGD
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Adding more to ALS Adding Biases
Adding input sources (helps with cold start)
Temporal Dynamics
Inputs with varying confidence levels (some records are more reliable) Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Sample Latent Factor model learned
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Netflix Prize Competition (2006) training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies ~1m quiz set, ~1m test set Winning solution: Linear combination of 100+ algos!
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Evaluation metrics Dev/Test set setup: Use timeline information Coverage: For how many (%) users whose history is available, are we able to generate ratings? Accuracy: RMSE based on test data ratings and predicted ratings Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
DEMO - Anshu Kumar
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Top-N recommender system
Problem Statement: Given that an user has purchased some products/movies, the objective is to predict a list of Top N products that the user would be most interested in
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Neighbourhood Methods
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
User User Collaborative Filtering
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
User User Collaborative Filtering Input: User-item matrix with ratings/purchase history Steps: 1. 2.
Fit: Learn user-user correlation matrix Transform: Generate personalized top-N list
User-user correlation (normalize your data first): Pearson correlation can be used: Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Selecting Neighbourhoods ● ● ● ●
All neighbours Random K neighbours Top K neighbours Neighbours who have min. threshold of similarity
Fewer neighbours -> lower coverage but also lesser noise from dissimilar neighbours Typically, about 50
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Exercise - UUCF Find ranked lists for U3 & U4 P1 U1
1
U2
1
U3
1
P2
P4
1 1
U4 U5
P3
P5 1
1
1 1
1
1
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Exercise - UUCF with 2 neighbours Find ranked lists for U3 & U4 P1 U1
1
U2
1
U3
1
P2
P4
1 1
U4 U5
P3
P5 1
1
1 1
1
1
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Item Item Collaborative Filtering UUCF drawbacks: 1. Users’ tastes change fast 2. Users watch relatively few movies of the whole movie set, leading to sparse data, and few or no recommendations for many users
Item-item affinity/correlation is much more stable Item-Item correlations can be learned even from sparse data
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
IICF Input: User-item matrix with ratings/purchase history Steps: 1. 2.
Fit: Learn item-item correlation matrix Transform: Generate personalized top-N list (use of neighbourhood similar to UUCF)
Item-item correlation (normalize your data first): Pearson correlation can be used: Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Exercise - IICF Find ranked lists for U3 & U4 P1 U1
1
U2
1
U3
1
P2
P4
1 1
U4 U5
P3
P5 1
1
1 1
1
1
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Evaluation metrics Dev/Test set setup: Use timeline information Coverage: For how many (%) users whose history is available, are we able to generate ratings? Accuracy: Hitrate@n (say, n = 5) Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Sanity Sample User User history for July’16
Recommendations suggested by Capillary Tech. for Aug’16
Actual user purchases in Aug’16
BREAKING BAD; S2: MA15+ 2009 BREAKING BAD; S3: MA15+ 2010 BREAKING BAD; S1: MA15+ 2008 GRIMM; S2: MA15+ 2013 GRIMM; S1: M15 2012 GRIMM; S3: MA15+ 2014
1. BREAKING BAD; S4: MA15+ 2011 2. BREAKING BAD; S5: MA15+ 2012 3. BREAKING BAD; FINAL SEASAON 4. GRIMM; S4 MA15+ 2015 5. TEEN WOLF; S5 P2: MA15+ 2015
BREAKING BAD; S4: MA15+ 2011 BREAKING BAD; S5: MA15+ 2012 BREAKING BAD; FINAL SEASAON
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
DEMO - Shashi Kumar
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Content based recommender system
Problem Statement: Given that an user has purchased some products/movies, the objective is to predict a list of Top N products that the user would be most interested in
Say, data of only a single user is available and product meta-data is available Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Content Filtering
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
TFIDF Term Frequency (TF) = No. of occurrences of a term in the document/product description/user purchase history
Inverse Document Frequency (IDF) = log(#documents/#documents with the term) (how few documents contain this term)
TFIDF = TF * IDF Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
What does it do Automatically downgrades stopwords and common terms Promotes core terms over incidental ones
Drawback: If core term is not used much in the document, then it is not focussed on
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Vector space model Each keyword is a dimension Steps: 1. 2. 3. 4.
Learn p_u using TFIDF Learn q_i using TFIDF (Normalize if needed) Compute pearson correlation/cosine to rank
Limitation: Cannot handle interdependencies-- someone likes Shahrukh in romantic movies but Salman in action movies and does not like vice-versa Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Exercise - TFIDF Give recommendations for U3 & U4
User
Movie
U1
Dangal
U1
Hindi Medium
Movie
Keywords
U2
Bahubali 2
Toilet - Ek Prem Katha
Comedy, Drama
U3
Bahubali 2
Hindi Medium
Comedy, Drama
U3
Hindi Medium
Bahubali 2
Action, Adventure, Drama U3
The Ghazi Attack
Jolly LLB 2
Comedy, Crime, Drama U4
Dangal
The Ghazi Attack
Action, Drama, History U4
Toilet - Ek Prem Katha
Dangal
Action, Biography, Drama
U5 Hindi Medium Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
DEMO - Sanket Sahu
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
References 1. 2. 3. 4. 5.
https://www.coursera.org/specializations/recommender-systems - Univ. of Minnesota, Prof. Joseph A Konstan, Dr. Michael D Ekstrand Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix factorization techniques for recommender systems." Computer 42.8 (2009). J. Bennet and S. Lanning, “The Netflix Prize,” KDD Cup and Workshop, 2007; www.netflixprize.com. D. Goldberg et al., “Using Collaborative Filtering to Weave an Information Tapestry,” Comm. ACM, vol. 35, 1992, pp. 61-70. Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative filtering." Proceedings of the 24th international conference on Machine learning. ACM, 2007. Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies
Thank You
Dr. Satya Gautam Vadlamudi, Principal Data Scientist, Capillary Technologies