A Probabilistic Approach for Modelling User Preferences in Recommender Systems A Case Study on IBM Watson Analytics Parisa Lak
Can Kavaklioglu
Data Science Lab, Ryerson University 350 Victoria St. Toronto, Canada
[email protected]
Data Science Lab, Ryerson University 350 Victoria St. Toronto, Canada
[email protected]
Martin Petitclerc
Graham Wills
IBM Watson Analytics 140 Grande Allee East Quebec city, Canada
[email protected]
IBM 71 S Wacker Dr Chicago,USA
[email protected]
Mefta Sadat
Data Science Lab, Ryerson University 350 Victoria St. Toronto, Canada
[email protected]
Andriy Miranskyy
Data Science Lab, Ryerson University 350 Victoria St. Toronto, Canada
[email protected]
Ayse Basar Bener
Data Science Lab, Ryerson University 350 Victoria St. Toronto, Canada
[email protected]
ABSTRACT Recommender systems (RS) provide personalized recommendations to the users based on their historical preferences. User’s preferences are either explicitly provided by the user in the form of ratings or they could be implicitly derived from the user’s interaction with the application. Modeling user preferences from implicit information is an important task while designing recommender systems as this information is the input for RS. User’s interaction with the system is not always consistent. Therefore, modeling their preferences (based on their interaction with the system) is affected by user’s personal biases which may change over time. In order to account for the uncertainty in user’s decision we propose a probabilistic approach to model user preferences. We test our proposed user preference modeling on a case study of IBM Watson Analytics (WA) with real life data on user interactions. The modeling approach is employed by a collaborative filtering matrix factorization algorithm in order to come up with recommendations. We also propose an evaluation metric to compare the result of our proposed system with the current system as well as a baseline model. The result shows significant improvement using the proposed probabilistic model over the baseline model in terms of precision, recall and Discounted Cumulative Gain (DCG) metrics.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). CASCON, Toronto ,Ontario, Canada © 2017 Copyright held by the owner/author(s). 123-4567-24-567/08/06. . . $15.00 DOI: 10.475/123 4
ACM Reference format: Parisa Lak, Can Kavaklioglu, Mefta Sadat, Martin Petitclerc, Graham Wills, Andriy Miranskyy, and Ayse Basar Bener. 2017. A Probabilistic Approach for Modelling User Preferences in Recommender Systems. In Proceedings of CASCON, Toronto ,Ontario, Canada, 2017, 10 pages. DOI: 10.475/123 4
1
INTRODUCTION
IBM Watson Analytics (WA) is an application that helps users find new patterns within datasets provided by users [12]. The system provides users with visualization recommendations based on the dataset column headers (i.e.data fields). The system also allows users to ask questions in natural language to narrow down the recommendation options. Current system provides recommendations using rule based strategies based on the analysis of dataset column headers. The rule based approach provides the same set of suggestions for a user selecting a dataset at the very first page of the application (i.e. starting point). Also, the current approach provides similar set of suggestions to all users using the same datasets regardless of their preferences. Moreover, when a large dataset with several columns is provided, the rule based approach fails to analyze all possible options. We believe that a machine learning algorithm that learns from user’s prior interaction with the system may generate more accurate and personalized recommendations. We refer to Recommender system (RS) as a set of learning algorithms that are used to filter information and mitigate information overload problem [16]. In this context, RS learns from user’s prior preferences to generate a list of items that are relevant to user’s interest. The concept of
CASCON, 2017, Toronto ,Ontario, Canada recommender systems has been widely investigated [1]. Several algorithms and techniques are proposed to predict users’ preferences and provide them with the list of items that are of their interests. User preferences can be provided explicitly by the users through ratings or they can be derived implicitly through user’s interaction with the applications. Capturing user’s preferences explicitly is expensive and sometimes not possible as providing such data requires more efforts by users. Even in cases when this information is available, it is limited, as it is not possible to get user’s opinion on all available items. On the other hand, implicit information is not expensive to gather, but is hard to interpret and model. The common form of implicit information that can be used to model user preferences are clicks, time spent on the page, mouse cursor or eye movement, etc. [6]. In our case study on WA, we only have user’s selection behavior as the implicit source of information. Modeling user preferences from the implicit information is the primary source of information for recommender systems using this form of user’s opinion [15]. Therefore, modeling this information is one of the key factors in the prediction of users’ preferences. We initially model user preferences with a heuristic-based approach. In this model, we consider the frequency of an item being selected or not selected by a user to distinguish between the level of user preferences over different items. This model is purely based on heuristics and the assumption that the more frequently a user selected (or not selected) an item the more preferred (or not preferred) is that item. However, users do not always prefer the same set of items. This implies that there are uncertainties in user preferences that should be taken into consideration while modeling user preferences. Therefore, the main research question (RQ) we address in this work is: How to account for uncertainty in user behavior for Watson Analytics Recommendations? To address this research question, we propose a novel approach that formulates users’ preferences in a probabilistic manner. Performance of this model is then tested using a collaborative filtering approach for recommender system. More specifically, we used matrix factorization to test the proposed user preference modeling approach against a baseline model (which is used by the current rule based system) as well as against the heuristic-based modeling approach. We defined the common evaluation metrics such as precision and recall to fairly evaluate the performance of different systems. We also analyzed results by using the Discounted Cumulative Gain (DCG) metric. The results show that the proposed probabilistic modeling outperforms both heuristic-based modeling approach and the baseline model significantly. The rest of this paper is organized as follows. In Section 2, we outline the methodology of our work that includes the problem formulation followed by data extraction and model building as well as the evaluation of our model and the recommender system algorithm. We then provide the summary of our results in Section 3, followed by the threats to validity of our study in Section 4. Related work from Recommender system literature and implicit information is outlined in Section 5. We present the conclusion of this work with some future work directions in Section 6.
Lak et al.
2
METHODOLOGY
In this section, we outline the details of the methodology that we followed in this work. Figure 1 provides an overview of our methodology [3]. We start by formulating and mapping of the recommendation problem for WA to the context of recommender systems. We then explain the data extraction process. The details of user preference modeling based on implicit information cues is outlined next. The recommender system algorithm that is used in this study and its implementation are then explained. Finally, we provide the details of our evaluation metrics to measure and compare the performance of different systems.
Figure 1: Overview of our methodology [3]
2.1
Problem Formulation
Figure 2 provides an illustrative example of the current recommendations provided by WA. Terminologies used throughout this work are highlighted in this figure. In this study, we focus on the recommendations provided at the “starting point”. This is referred to the recommendations provided to the user before they ask any question. There are always ten recommendations from position 0 to 9 at the starting point. Each recommendation includes “Data fields” that are the headers of the provided datasets. Recommendations also contain one of the 11 possible “visualization types” that are appropriate to visualize the selected data fields.
Figure 2: Illustrative example of IBM Watson Analytics starting point recommendations There are some of drawbacks in the current rule based recommendation approach. The system does not provide personalized recommendations. This means that the system provides same set of recommendations for all users using the same dataset. Also, the system does not learn user’s preferences to generate personalized recommendations based on their previous preferences. Moreover the current rule based system can not handle the analysis and ranking of a large amount of options at the starting point in a timely manner. The problem is diminished as the user provides questions
A Probabilistic Approach to Modeling Uncertainty in User Preferences for RS to narrow down the possible options in later pages. We believe that learning user preferences and ranking items according to a predicted preference may address these problems. The best type of learning algorithms that can solve similar problems are categorized under recommender systems. A good recommender system should be able to learn from user’s prior preferences to generate a relevant list of items. In such a framework, we deal with the tuples of and the recommendations are provided by predicting and ranking of the items based on user’s historical preferences. To formulate WA’s recommendation problem into learning-based recommender systems framework, we followed the definition of machine learning problems provided by Mitchell [13]: “A computer program is said to learn from experience E, with respect to some class of task T with a performance measure P, if its performance at task T measured by P improves by learning the experience E”. For WA, the task can be the recommendation of data fields along with their visualization types; the Experience can be derived from the user interaction with the system, which provides user preferences through selection and not selection actions; the performance of the system can be measured with two approaches: either by the commonly used performance metrics in machine learning or through user studies and evaluation of user satisfaction. In this study, we use some of the common prediction metrics to compare our proposed system with the current rule based recommendation engine; the evaluation of user satisfaction will be addressed in our later studies with the online implementation of user preference learning model. We define the items in WA to be characterized as the combination of meaningful visualization types along with the data fields. This means that an item is a tuple < data field(s), visualization type>. Based on the historical data, the maximum number of data fields presented with one visualization is 4. Also, there are 11 possible visualization types available in the first version of the WA application. This gives us an estimate of the number of possible items for each dataset. In this study, we did not investigate all possible options, since we did not have any information about those items that were never shown to the user. In fact, we excluded the cold start items from our current set of analysis. The investigation of those items will be addressed in our further studies. In this study, we only focused on the items that were previously provided to the users and were either selected or not selected by them. The items are not limited to the items provided at the starting point, but also the items that were recommended after the users provide their questions.This inclusion provides more user-item pair interaction information and enriches user preferences modeling.
2.2
Data Extraction
IBM Watson Analytics maintains a logging database based on ElasticSearch. Kibana is used to query the ElasticSearch engine, viewing logs and exploring events in real time. However, specific user actions and all Watson-generated recommendations are not available in this interface. For this reason, a data-extraction application was developed in Java to query complete logs from the Logstash servers. These logs contain all the user actions and recommendations generated by Watson Analytics. Information about all the other events
CASCON, 2017, Toronto ,Ontario, Canada
that were triggered when the user was interacting with the system is also stored in the logs. To extract specific selection actions and recommended visualizations, we had to parse the extracted JSON logs and filter unnecessary information. Two Python scripts that parse over a month long data were developed for this purpose. The physical size of the unfiltered data was 40 Gigabytes. This dataset was extracted from the users interacting with WA version 1 during the month of October 2015.
Figure 3: Overview of datasets extracted from IBM Watson Analytics Logstash As illustrated in Figure 3, one database was created based on all the recommendations that were provided to the users, which we refer to as “Questions dataset”. A second database was created based on the selection actions, which we deem “Selected dataset” in Figure 3. This includes the specifications of the selected recommendations by each user. We also had access to some high-level user information, such as their account type and email address providers, that was stored in the “User dataset”. All data bases were merged to a large database (i.e.CSV file) for further analysis. The final dataset contains 716,945 observations and 17 variables. Each observation corresponds to a recommendation generated by the Watson Analytics. A subset of the variables defines the characteristics of the visualization such as the type of the visualization (e.g. bar chart), the caption of the visualization (e.g. what is the breakdown of sales by year), the list of attributes (i.e. data fields) corresponding to the visualization (e.g. sales or year) and the position of the visualization in the list of recommendations. The selection action is stored in a Boolean variable that specifies if the visualization was clicked by the user or not. The other subset of the variables stores information about the user. This information is obfuscated in a way that it is not possible to decode and trace back the users. In total, there are 3,771 unique users in the dataset. This dataset was extensively investigated in our prior works [12]. During our prior data exploration work [12], we noticed that some users are test users and therefore those instances were removed in order to have real users actions without any noise. We also removed all users with no selection actions (i.e. none of the recommended items was selected). This filtering generates a dataset of more than 150K observation with 971 real users and 34,794 items based on the definition of an item in Section 2.1. The number of unique user-item pairs are 49,125 from which 3,618 were selected
CASCON, 2017, Toronto ,Ontario, Canada
Lak et al.
The problem with this model is that we consider all non selected items to be a solid not relevant item and, therefore, the items that were not selected will never be recommended to the user. Only the items that were selected before are always offered to the user over and over. Also, all selected items have the same positive preference values of 1 and hence can not be ranked based on their predicted rui . Figure 4 shows the distribution of the defined binary ratings. As illustrated in this figure, we also have an imbalance class problem in this model. This yields to overfitting on the majority class while performing predictions. Heuristic-Based Model: This model considers the frequency of selection or not selection to assign preference levels. We refer to the number of times a user u selects and item i as the fsui and the
30000 20000 10000
Implicit user preference model is the primary source of information that is provided to a learning-based recommender system algorithm [15]. The learning algorithm provides recommendation based on this primary information; therefore, the user preference modeling can be considered as the core of each recommender system’s design based on implicit user preferences. In this study, we refer to the preference of user u to item i as rating rui . Let us consider that we are given a dataset D = { rui : 1 ≤ u ≤ n, 1 ≤ i ≤ m , (u,i) ∈ O }. Here, n refers to the maximum number of users and m refers to the maximum number of items. O denotes the set of user-item pairs for which a user preference is available (i.e.rating). Dataset D is a sample of n × m rating matrix R, where rui is the u-th row and i-th column of R, being the user u’s preference on item i. In practice, D contains a small fraction of the entries in R. User preferences are normally presented in two main forms: Implicit or Explicit (refer to Section 5.2). User preferences in WA are not reported explicitly by users and hence should be derived from an implicit cue based on user’s interaction with the application. In WA case study, users’ selection behavior can be used as their implicit preferences cue. Several approaches are presented in the literature to model and use implicit user preferences in recommender systems framework. In this study, we propose a probabilistic approach to model user preferences by considering uncertainty in user’s behavior. In order to build such model we outline the details of the alternative models and describe the pros and cons of each model. We then evaluate and compare the performance of our proposed model, the heuristic-based approach, as well as the baseline model that is the current WA recommendation engine. Binary Model:This model considers a selection action as positive preference and a not selection action as negative preference. In this model, rui can only take the value of 1 for selection and 0 for not selection: ( 1 If item i is selected by user u rui = . (1) 0 If item i is NOT selected by user u
number of observations
Modeling User Preferences
0
2.3
40000
with frequencies higher than or equal to one and the rest were not selected (i.e. the item was recommended to the user more than once and it was not selected at any instance). These final data are used to model user preferences based on the ‘selection’ and ‘not selection’ actions.
0
1 rating
Figure 4: Bar plot to illustrate the number of observations for each binary ratings. 0 shows negative preference (not selected items) and 1 shows positive preference (selected items)
number of times a user u does not select an item i as the fnsui . The assumption is that the more frequent a user selects an item – the higher the user preference for that item. On the contrary, the more an item is offered to a user and is not selected – the higher is the user’s negative preference towards that item. Figure5 shows the number of observations for each user-item pair selection frequency. As illustrated in this figure, the distribution of the number of data points in each frequency level is highly skewed. To avoid data sparsity for each frequency level class (based on the pure frequency) and to make the data less skewed, we use the log transformation of the frequency (i.e. log fsui and log fnsui ). We also assumed that the recommended position of an item might affect the user’s selection or not selection behavior. This assumption is based on the concept of primacy effect [2, 17]. A primacy effect occurs when information is being encountered in an order that has a strong impact on judgment and memory. According to Rundus [17], early items are memorized as they are given higher levels of attention and hence have higher chance of being selected, clicked or purchased by a user. Atkins and Shiffrin also indicate that early items are given higher levels of rehearsal [2] due to the primacy effect. This also provides evidence that the recommendations generated by WA system in the primary positions are seen more often and have higher chance of being selected by the users. Based on this concept, we consider the items in the first four positions (i.e. on the first page of the application / starting point) that are not selected as having higher negative preferences than the rest of the items in lower positions. We also considered the selection of an item in those prime positions (i.e. top4 positions) to have a lower positive preference value than the selection of the items in lower positions. The computation of rating scales for this model can be summarized as follows:
CASCON, 2017, Toronto ,Ontario, Canada
15000
Number of Observations
0
0
5000
10000
20000 10000
Number of Observations
30000
20000
40000
25000
A Probabilistic Approach to Modeling Uncertainty in User Preferences for RS
0
1
2
3
4
5
6
7
8
9
11
13
15
17
19
21
23
28
32
1
2
3
Frequency of Selection
Figure 5: Bar plot to illustrate the number of observations for each positive frequency. The horizontal axis shows the number of times an item was selected by a user. The vertical axis shows the number of observation for that user-item pair selection frequency
If item i is selected by user u in top4 positions If item i is selected by user u in lower positions
,
If item i is NOT selected by user u in top4 positions If item i is NOT selected by user u in lower positions
.
(3) We then shift both positive and negative rating values towards the positive scales by adding the absolute value of the minimum of the negative ratings (i.e. rui = rui +min(|rui− |). To have less sparse data in each level of preferences we set boundaries to preferences to have 5 scale of ratings. The final number of observations for each preference level (i.e. ratings) is provided in Figure 6. This model is used as an alternative model to our proposed probabilistic model. The resulting rating scales from this modeling approach are then passed through our recommender system algorithm for predictions. The final results are reported in Section3. Proposed Probabilistic Model: User preferences are not always consistent throughout their interactions with each application. These uncertainties in user preferences can be modeled using probabilistic approaches. Our proposed probabilistic model also considers the concept of primacy effect that is reflective of the result of our prior work [12]. Specifically, this model considers user selection behavior based on the position of a recommendation [12] to generate the prior probability of selection. Figure 7 provides an illustration of the number of times users select items based on their recommendation positions. In our proposed model we considered a prior selection probability (psp) for each item by each user based on the recommendation position. To represent the selection trend with respect to the item’s
1000
− log2 fnsuip − 1 − log2 fnsuip
1500
(2)
Number of Observations
( rui− =
log2 fsuip − 1 log2 fsuip
Figure 6: Bar plot to illustrate the number of observations for general ratings in heuristic-based model. The horizontal axis shows the rating assigned based on the frequency of selection and not selection as well as the reflection of primacy effect. The vertical axis shows the number of user-item pair observation for each rating level
500
rui+ =
5
0
(
4
Rating
0
1
2
3
4
5
6
7
8
9
Recommendation Position
Figure 7: Bar plot to illustrate user’s selection behavior based on recommendation position. The horizontal axis shows the position of recommendation while the vertical axis illustrates the number of selection observation for that position. position with a probability distribution, we used a Gaussian mixture. The parameters of this Gaussian mixture was approximated to imitate the actual user selection behavior derived from the distribution shown in Figure 7. The distribution consists of two Gaussian distributions with standard deviation of four, and mean of zero for
CASCON, 2017, Toronto ,Ontario, Canada
Lak et al.
8
Figure 8: Probability Distribution for the Gaussian mixture to model user selection behavior based on recommendation position. This probability distribution provides the maximum probability of an item being selected by a user based on its position. We refer to this value as Guip . To account for the uncertainty in user selection behavior, we generated a random value from a uniform distribution, denoted as Hui . Therefore, the final prior selection probability of item i recommended in position p by user u is pspuip = Hui × Guip for each observation. The final ratings (rui ) are computed based on the pspuip and the frequency of selection or of item i by user u in position p. The frequency of selection and not selection is fsuip and fnsuip for selected items and not selected items respectively. In the event when an item was shown to a user multiple times and it was selected for some observations and wasn’t for the others, we considered the frequency of selection only. The final rui can be computed as follows: ( ruip + = fsuip × (1 − pspuip ), ruip − = fnsuip × pspuip ,
If item i is selected by user u Otherwise
.
(4) The first condition in Equation 4 computes the rui for the selected items. The assumption is that the items in the prime positions have higher probability of being selected (i.e. pspuip is relatively higher) than the items in other positions. Therefore, an adjusting factor (i.e. 1 − pspuip ) is used to diminish the effect of the selection frequency for the prime locations as well as to increase the effect of selection frequency for items in other positions. For example, consider an item offered to a user 20 times in position 0, and assume that most of the time this item was selected. The psp for this position is high; therefore, the adjusting factor should reduce the effect of
15000
6
10000
4 Recommendation Position
Number of Observations
2
5000
0
the selection frequency on ruip + . On the contrary, an item offered to a user 20 times in position 5 and selected often (the fsuip is high), shows user’s high preference for that item (i.e. ruip + should be high); the relative psp for this position (i.e. position 5) is low and hence the adjusting factor increases the effect of the selection frequency on ruip + compared with the prior scenario. The second condition in Equation 4 computes the ruip − for items that are not selected (i.e. non preference degree for useru on item i). Consider an item is shown to a user 20 times in position 5. The psp for this position is low and multiplying this factor by the frequency of not selection diminishes the effect of the negative frequency on the computation of rui . This means that the non preference degree of the user for that item is not high as the non selection might be due to the fact that the user never saw the item. In contrary, if an item is shown to a user 20 times in position 0 and is not selected, this item is of higher degree of non preference to the user. The psp for the prime position is high and therefore the rui that is the multiplication of psp by the not selection frequency is higher in this case compared with the former example. With this strategy we would have D as a set of probabilities for both selected and not selected items. One of the shortcomings of working with implicit feedback is that the information is only provided for the positive preferences [15]. Therefore, user preference modeling for the positive feedback are more accurate and of more confidence compared with the negative preference modeling. In order to account for the confidence in positive preferences, we use another adjusting factor. We add the mean probabilities of not selected items to the probabilities of selected items (i.e. ruip + = ruip + + mean(ruip − ). We then set equal intervals for the probabilities to generate 1-5 rating scales. Figure 9 shows the number of observations for each rating scale.
0
0.06 0.00
0.02
0.04
Probability of Selection
0.08
0.10
the first Gaussian and mean of twelve for the second Gaussian, respectively (i.e. N (0,4) + N (12,4) ). The distribution of this Gaussian mixture is illustrated in Figure 8.
1
2
3
4
5
Rating
Figure 9: Bar plot illustrating the number of observations for each rating level using proposed probabilistic approach to model user preferences. The final preference of user u on item i, regardless of its position, is the mean of the preference of user u’s preference on item i in
A Probabilistic Approach to Modeling Uncertainty in User Preferences for RS different positions p. Matrix D that contains all observed user-item preferences is the input for our recommender system algorithm. The system learns from these preferences and provides preference predictions for other unobserved items.
2.4
Prediction
In order to test our proposed user preference modeling approach we used a recommender system algorithm as our prediction model. We consider Matrix Factorization (MF) algorithm, which is omnipresent in the literature of recommender systems, since the achievements of this algorithm on Netflix Prize competition in 2006 [10]. One of the most important factors that differentiate MF algorithms from the other algorithms is that it scales well when dealing with large datasets and parallel computing that is of concern on applications similar to WA. The details of this model is explained in this section. Typically, in MF algorithm users and items are modeled by factor ˆ by user vectors pu and qi , respectively [10]. A predicted rating rui u to item i is ˆ = qTi rui
· pu .
(5)
Addressing the fact that using only the available known entries to calculate predictions is prone to overfitting, a regularization factor is normally added to the prediction model. λ in Equation 6 is the regularization factor. The value of this factor is selected to be 0.001 in our analysis (based on the optimization of hyperparameters discussed below). ˆ = qTi .pu + λ(kqi k 2 + kpu k 2 ). rui
(6)
To learn the factor vectors for both users and items, the system minimizes the regularized squared error on the set of known ratings (i.e. training set) using an optimization method. In this study we use Stochastic Gradient Descent (SGD) optimization algorithm that was proven to be successful in this context [10]. The algorithm iterates through the observations in order to minimize the prediction accuracy of the user preferences. The bias feature idea in the context of matrix factorization was first introduced by Paterek [14]. Accounting for these biases would improve accuracy drastically. We use the deviation based biases in this study. User biases are quantified by the deviation of the mean of a specific user’s preference degree (i.e. rating) from the global mean of all user preferences. These biases account for the users who tend to have extremely high or extremely low preferences degrees in general. Item biases are also considered as the deviation of the mean of an item’s preference degree from the mean of all other item’s preference degrees. Including this bias will also alleviate the impact of overly popular items on the general rating prediction. In this study, this type of bias is used in a way that it has lower effect on predictions than the user biases. This is due to the fact that there are not enough preference measures available for one single item that normally this bias generates either high or low penalty value that interferes with the accuracy of prediction. The refined matrix factorization algorithm with basic biases is given by ˆ = qTi · pu + bi + bu + µ, rui
(7)
where µ is the average rating, and bu and bi are the user and item biases, respectively.
CASCON, 2017, Toronto ,Ontario, Canada
While learning, the regularization factor should also be applied to the biases. Hence, the system learns by minimizing the squared error function: Õ min (rui − qTi .pu + bi + bu + µ)2 + λ(kqi k 2 + kpu k 2 + bi2 + bu2 ). (8) Stochastic Gradient Descent (SGD) algorithm loops through all user preferences (i.e. ratings) in the training set and calculates estimated preferences for each user-item pair. Then, the error associated with the prediction is calculated (eui ). Consequently, it modifies parameters of pu and qi by a magnitude proportional to a predefined learning rate γ in the opposite direction of the gradient. bu and bi are also updated using similar learning rate. The update is summarized as follows: pu ← pu + γ (eui · qi − γ · pu ), qi ← qi + γ (eui · pu − γ · qi ), bi ← bi + γ (eui − γ · bi ),
(9)
bu ← bu + γ (eui − γ · bu ). This update will be computed until convergence. There are several hyperparameters that should be set at the initialization of this algorithm. These hyper parameters including the number of iterations, number of features, regularization rate and learning rate should be optimized before the actual evaluation of the algorithm [11]. The hyper-parameters were set to minimize the prediction error in validation set. We used 5-fold cross validation for the evaluation of the model. 5-fold cross validation is used over 10-fold cross validation due to data sparsity. The 20% test part was set aside while defining hyper-parameters and the rest of the data was divided into 70% training and 30% validation set in each fold. While evaluating the system we used the whole training and validation parts as training set and the test instances for testing purposes.
2.5
Evaluation
There are several evaluation metrics to evaluate the performance of a recommender system [7, 16]. Some researchers have argued that the bottom-line measure of recommender system success should be user satisfaction [7] and e-commerce companies measure this value based on the number of sold and not returned items and other companies explicitly ask their user’s opinion. Other than user satisfaction, commonly used evaluation metrics measuring the performance of machine learning algorithms can be categorized in three groups: (1) Predictive Accuracy metrics, (2) Classification Accuracy metrics, and (3) Ranking Accuracy metrics [7]. (1) Predictive Accuracy metrics. The most common evaluation metric for accuracy is Root Mean Square Error (RMSE). We used this measure to set the hyperparameters of our algorithm. RSME for a validation set of T observations can be calculated as follows: v t 1 Õ ˆ − rui )2 , RMSE = (rui (10) T (u,i)ϵT
ˆ is the predicted rating on item i given by user u and rui where rui is the actual rating for item i by user u. (2) Classification Accuracy metrics. The classification accuracy metrics measure the quality of the set of recommendations [4]. A common metric of classification accuracy is Precision and Recall [4]. The precision is commonly referred to the ratio of the relevant
CASCON, 2017, Toronto ,Ontario, Canada recommended items and the total number of recommended items; and recall is the ratio of the relevant recommended items and the total number of relevant items. In this study, we consider a relevant item as the item that is selected from the top 4 recommended positions (i.e. first page of the starting point recommendations). The selections from lower positions are those relevant items that were not recommended on the first page (i.e. top 4 positions). If none of the top 4 recommendations are selected then we consider a non relevant instance that is recommended. Formally, precision and recall are computed as follows: TP Precision = , TP + FP (11) TP , Recall = TP + FN where T P is the number of selection in top4 positions, F P is the number of non selection in top4 positions, and F N is the number of selection from lower positions or not recommended items. (3) Ranking Accuracy metrics. To evaluate the quality of the list of recommendations, ranking metrics are used. The most commonly used ranking accuracy metric is Discounted Cumulative Gain (DCG) that focuses on the idea that highly relevant documents appearing lower in a recommendation list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The discounted cumulative gain accumulated at a particular rank position p is defined as follows: p Õ
reli . (12) log 2 (i + 1) i=1 We considered the relevancy of an item to be the selection of that item. Therefore, the reli = 1 if an item in position i is selected. The higher the DCG – the better is the list of recommended items. DCGp =
3
DISCUSSION OF THE RESULTS
In this section we provide the summary of the performance evaluation results on the dataset for one month actual user interaction with WA system. We provide comparison of the results for the proposed probabilistic model with the heuristic-based model, as well as the baseline that is the current state of the WA application. As stated in the previous section, the rating scales prepared through the heuristic-based model as well as proposed probabilistic model are passed through a collaborative filtering recommender system, being Matrix Factorization. Hyperparameters for both heuristic-based model and probabilistic-based model are identical for the systems to be comparable and are set in the validation phase. The final results are reported for the test instances. We compute the result of precision and recall based on the definition provided in Equation 11. The results are reported based on the top 10 recommendations. Table 1 provides a summary of the results. As shown in this table both precision and recall for the proposed model outperforms the WA system significantly. The p-value for this claim is 1.27 × 10−6 and 2.28 × 10−6 for precision and recall, respectively. The proposed probabilistic modeling also outperforms the heuristic-based model significantly in terms of precision with p-value of 2.3 × 10−5 . However, the t-test does not report a significant difference between our proposed model and the alternative
Lak et al. Table 1: Mean Precision and Mean Recall with 5-fold cross validation for current WA system, heuristic-based Model and Proposed Probabilistic Model using Matrix Factorization algorithm and ranking based on predicted rating. Mean Precision Mean Recall
Current WA system 0.324 0.451
heuristic-based Model 0.363 0.719
Proposed Probabilistic Model 0.396 0.725
heuristic-based model. The p-value for this claim is 0.55. This result is due to the fact that MF algorithm performs in a way that it pushes the lower level ratings towards the end of the ranked list. Therefore, if both modeling approaches consider negative preferences in somewhat similar manner, we would have a false negative of similar portion to the true positive and hence the recall for both model would not be significantly different. However, the system that models the positive instances more precisely would push the positive preferences in the higher positions. Therefore, although the proposed model does not differ significantly in terms of recall from the heuristic-based model, it still outperforms this model based on precision. To confirm the superiority of the probabilistic-based model over the heuristic-based model, we used a second measure being DCG. DCG metric evaluates the recommendation list provided by a recommender system. The summary of the results for mean of the DCG is provided in Table 2. Table 2: Mean DCG for WA system, heuristic-based Model and Proposed Probabilistic Model using Matrix Factorization algorithm and ranking based on predicted rating. Mean DCG
Current WA system 362.49
Heuristic-based Model 577.77
Proposed Probabilistic Model 621.77
As shown in this table, based on DCG, the proposed probabilistic model outperforms both the baseline WA system and the alternative heuristic-based model. This difference is significant with p-value of 1.27 × 10−7 for the baseline WA system and 3.3 × 10−4 for the alternative heuristic-based model. The findings reported in this section provides evidence that the proposed probabilistic model is a successful approach to model user preferences for this system. The proposed model is not only superior in terms of finding the most realistic model of preferences but also outperforms both the heuristic-based model and the current WA system in terms of the accuracy of recommended items.
4
THREATS TO VALIDITY
Threats to validity of the current case study include external, internal, construct, and statistical threats: External validity: We extracted a dataset for this application ourselves. We removed the test users, and we believe that the analysis is performed on the real users who interact with the system on a daily basis. However, since we only use one dataset in this case study, the result of the analysis is not generalizable. We plan to use other datasets from the second version of the application to account for this threat in our later studies.
A Probabilistic Approach to Modeling Uncertainty in User Preferences for RS Construct validity: In this study, “user preferences” is our dependent variable. We consider user-item pairs as our unique observations; and user preferences are modeled based on the frequency with which a user-item pair is observed in our dataset. However, the position of a recommended item affects user preferences and, therefore, should be considered as a confounding factor. To account for this threat, we consider the user-item-position triples to represent our unique observations; and the user preferences were derived based on all three factors of user, item, and item’s position. Internal validity: We used different statistical software to perform our analysis. R was used for data cleaning and pre-processing. The MF algorithm was implemented using Matlab. The scripts for data pre-processing and MF algorithm was outlined by the authors and hence the threats to the validity of the results due to these scripts can be addressed by ongoing debugging of the scripts. We also used the common evaluation metrics to measure the performance of our system. In this study, we used multiple measures to minimize this threat. Conclusion validity: User preference modeling based on implicit information is very unique to each application. The conclusions of this study are outlined based on the specifications of WA application and should cautiously be extended to other similar applications.
5
RELATED WORK
Learning-based “Recommender system” is referred to a set of machine learning algorithms that takes user’s historical preferences in an application as an input and predicts what might be of their interest in their next interaction with the application. The historical preferences are either provided by the user as an explicit information (e.g. movie ratings) or implicit information (e.g. clicks). The implicit user preferences should be derived from user’s interaction with the application. In WA application users do not explicitly provide their opinion about the recommendations that are provided to them and hence their preferences should be derived from their selection behavior. In this section, we provide some related works to explain the basics of recommender systems as well as some benchmark studies that model implicit information for user preferences for recommender systems.
5.1
Recommender Systems
In the era of big data and information overload, recommender systems are designed to simplify the selection tasks for products and services. A significant body of work has been generated both in academia and in industry on the advancements of these systems during the last two decades [1]. Despite all the advancements in computational power and the incorporation of cognitive computing techniques in recommender systems, the area still attracts much attention for improvements of these techniques [1]. This is because there are still ongoing challenges with the design and implementation of recommender systems such as cold start problem, data sparsity, implicit user preference modeling, etc. In recommender systems, commonly users provide their opinion about an item. The system then will use this information to make future predictions. These predictions can be performed using three main techniques: (1) collaborative filtering based technique, (2) content based technique, and (3) hybrid technique [9].
CASCON, 2017, Toronto ,Ontario, Canada
Systems that use collaborative filtering techniques produce personalized item recommendations based on patterns of explicit or implicit users’ opinion without the need for exogenous information about items or users. These systems rely on various types of inputs representative of users’ opinion on items. The basic assumption for recommendation using collaborative filtering approach is that users who like the same thing are likely to feel similarly towards other things [18]. The two primary areas for collaborative filtering methods are neighborhood models and latent factor models [10]. The neighborhood models compute the relationship between items or users in terms of the ratings that have been assigned to them to provide recommendations; whereas, latent factor models attempt to model users and items with latent features according to the user-item pairs ratings. In short, the neighborhood models statistically calculate the similarity between users or items according to rating trends. Latent factor models assume that rating trends can be explained by hidden factors describing users or items. The most commonly used latent factor model is Matrix Factorization (MF) that is the main algorithm used in this work. The mathematics and details of this algorithm are provided in Section 2.4.
5.2
Implicit User preferences
As stated in the previous section, user’s opinion can be provided explicitly through ratings or should be implicitly derived from their interaction with the application in hand. Modeling user preferences from the implicit information is the primary source of information for recommender systems, regardless of the selected technique or algorithm [15]. Implicit information is very rich and since it does not depend on user’s effort, we have access to a larger volume of information. Hence the sparsity of user’s preferences is less while using implicit information. However, these data are messy and noisy and cannot be used in its original format [8]. There are several approaches in the literature to transform implicit user information into meaningful user preferences to be used in recommender systems. One of the first papers investigating the use of implicit information is the work by Claypool et al. [6]. In their work, they compared the impact of different sources of implicit information on the accuracy of recommender systems. In [8] the authors consider the availability of an implicit feedback as a confidence level rather than a pure preference information. However, this confidence level does not account for the uncertainty on human behavior as we do in our work. Choi et al. [5] in their work consider a function to transform user’s implicit information into explicit rating to be used in recommender systems. They first compute the absolute preference of a user that is relative to the user’s purchase frequency in isolation and then compute the relative user preference based on other users preferences. The explicit rating derived from the implicit information is then a function of all user preferences, based on their frequency of purchase. This model also does not account for the uncertainty in user’s behavior through probabilistic nature of user’s action. Peska et al. [15] used multiple sources of implicit information to generate a preference relation model. They focused on modeling user’s preferences based on different sources of information and
CASCON, 2017, Toronto ,Ontario, Canada use this model to enhance their prediction algorithm. In our case study, the only source of information is user’s selection behavior.
6
CONCLUSION AND FUTURE DIRECTION
In this study we modeled user’s preferences based on their selection behavior. This modeling was designed to account for uncertainties in user behavior with a probabilistic approach. The outcome of the modeling process was used in a collaborative filtering algorithm to predict user preferences. The result of the prediction and ranking (i.e. recommendation) is evaluated in terms of precision and recall as well as Discounted Cumulative Gain. The proposed probabilistic approach to model user preferences significantly outperforms the current and the heuristic-based approach in terms of precision and DCG. This shows that the proposed approach successfully models user’s positive preferences. The alternative heuristic-based model performs similarly to the probabilistic model in terms of recall. This was because both models use MF for predictions and, thus, the lower predicted ratings are pushed towards the lower rank positions. Therefore, the number of relevant items in the lower position for both systems would not be significantly different if both approaches perform similarly in modeling negative preferences. In the future, we will use the Bayesian inference approaches to process the proposed probabilistic models. We believe that using a probabilistic inference method for the probabilistic user preference model would even better account for the uncertainties in user behavior. The evaluation measures used in this study are common evaluation measures in the recommender system literature with some modifications to map to our problem. As part of our future work, we plan to come up with a novel evaluation metric that is able to distinguish the differences in the performance of each system with higher accuracy.
7
ACKNOWLEDGEMENTS
This research is funded in part by IBM CAS Project 919, and NSERC CRD Grant 490782 2015.
REFERENCES [1] Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE transactions on knowledge and data engineering 17, 6 (2005), 734–749. [2] Richard C Atkinson and Richard M Shiffrin. 1968. Human memory: A proposed system and its control processes. Psychology of learning and motivation 2 (1968), 89–195. [3] Ayse Bener, A Misirli, Bora Caglayan, Ekrem Kocaguneli, and Gul Calikli. 2015. Lessons Learned from software analytics in practice. The art and science of analyzing software data, 1st edn. Elsevier, Waltham (2015), 453–489. [4] Jes´us Bobadilla, Fernando Ortega, Antonio Hernando, and Abraham Guti´errez. 2013. Recommender systems survey. Knowledge-based systems 46 (2013), 109– 132. [5] Keunho Choi, Donghee Yoo, Gunwoo Kim, and Yongmoo Suh. 2012. A hybrid online-product recommendation system: Combining implicit rating-based collaborative filtering and sequential pattern analysis. Electronic Commerce Research and Applications 11, 4 (2012), 309–317. [6] Mark Claypool, Phong Le, Makoto Wased, and David Brown. 2001. Implicit interest indicators. In Proceedings of the 6th international conference on Intelligent user interfaces. ACM, 33–40. [7] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 5–53.
Lak et al. [8] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. Ieee, 263–272. [9] Yehuda Koren and Robert Bell. 2011. Advances in collaborative filtering. In Recommender systems handbook. Springer, 145–186. [10] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009). [11] Parisa Lak, Bora Caglayan, and Ayse Basar Bener. 2014. The impact of basic matrix factorization refinements on recommendation accuracy. In Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing. IEEE Computer Society, 105–112. [12] Parisa Lak, Mefta Sadat, Carl Julien Barrelet, Martin Petitclerc, Andriy Miranskyy, Craig Statchuk, and Ayse Basar Bener. 2016. Preliminary Investigation on User Interaction with IBM Watson Analytics. (2016). [13] Tom M Mitchell. 1997. Machine learning. 1997. Burr Ridge, IL: McGraw Hill 45, 37 (1997), 870–877. [14] Arkadiusz Paterek. 2007. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD cup and workshop, Vol. 2007. 5–8. [15] Ladislav Peska and Peter Vojtas. 2017. Using implicit preference relations to improve recommender systems. Journal on Data Semantics 6, 1 (2017), 15–30. [16] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. In Recommender systems handbook. Springer, 1–35. [17] Dewey Rundus. 1971. Analysis of rehearsal processes in free recall. Journal of experimental psychology 89, 1 (1971), 63. [18] JHJB Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborative filtering recommender systems. The adaptive web (2007), 291–324.