Online Nonparametric Max-Margin Matrix ... - Semantic Scholar

2014 IEEE International Conference on Data Mining

Online Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction Zhi Qiao1,2 , Peng Zhang3 , Wenjia Niu1 , Chuan Zhou1 , Peng Wang1 , Li Guo1 1 Institute of Information Engineering, Chinese Academy of Sciences 2 Institute of Computing Technology of the Chinese Academy of Sciences 3 QCIS, University of Technology Sydney, Australia {qiaozhi,wangpeng}@nelmail.iie.ac.cn, [email protected], {niuwenjia, zhouchuan, guoli}@iie.ac.cn Abstract—Max-margin matrix factorization (M 3 F ) has been popularly applied to collaborative filtering for personalized recommendations. The nonparametric M 3 F model represents the latest progress of the M 3 F methods, which can auto-select the number of factors by using nonparametric techniques. However, existing non-parametric M 3 F methods assume a collection of user rating data can be fully obtained before training, and they are inapplicable for on-the-fly recommender systems where user rating data arrive continuously. In this paper, we present a new efficient online nonparametric M 3 F model for flexible recommendation. Specifically, we design an online nonparametric M 3 F model (OnM 3 F ) based on the online Passive-Aggressive learning and solve the corresponding optimization problem by using the online stochastic gradient descent. Empirical studies on two large real-world data sets verify the effectiveness of the proposed method.

I.

presented a nonparametric max-margin matrix factorization for collaborative prediction in [11], [12], which can solve the selfadaptive selection of the number of factors and achieve better performance. However, the nonparametric M 3 F learning method and other M 3 F -based collaborative filtering approaches assume that a collection of user rating data is given a priori, and the models have to be re-trained whenever new training data arrive. Such approaches are impractical for real-world recommender systems where training data often arrive sequentially as new users are being added daily or even hourly, and new product items are being offered dynamically. Usually, traditional batch learning methods are non-scalable due to their highly expensive re-training cost. This calls for an urgent need of efficient and scalable nonparametric M 3 F learning technique for collaborative filtering in real-world recommender systems.

I NTRODUCTION

The online learning is an effective way to handle largescale data, especially streaming data. Among popular online algorithms, the online Passive-Aggressive (PA) learning [13] provided a generic framework for stream data learning, with many successful applications [14], [15]. The work [16] further presented online Bayesian Passive-Aggressive (BayesPA) learning as a new framework for Bayesian online inference. In the previous studies, the online learning techniques have been applied into matrix factorization. For example, the work [17], [18], [19] presented online collaborative learning based on the basic matrix factorization. The work [20] suggested an efficient document clustering method via online nonnegative matrix factorizations. An online learning for sparse matrix factorization was proposed in [21].

Matrix factorization plays an important role in latent factor models for collaborative prediction [1], [2], [3], [4], [5], [6], [7], [8]. Given a user-item preference matrix Y ∈ RN ×M , where N and M represent the sizes of user set and item set, matrix factorization aims to find a low-rank matrix X ∈ RN ×M that approximates the observed entries of Y and reconstructs the missing entries. However, in the collaborative prediction setting, only a portion of entries in matrix Y are observed, and the low-rank matrix X minimizing the sum-squared distance to the observed entries cannot be computed in terms of a singular value decomposition. In fact, the problem of finding a low-rank approximation to a partially observed matrix is a difficult nonconvex problem. The Maximum Margin Matrix Factorization (M 3 F ) was first proposed by Srebro et al. [9], constraining the norms of U and V instead of their dimensionality. Constraining the norms of U and V corresponds to constraining the trace-norm (sum of singular values) of X. M 3 F can be formulated as a semi-definite programming (SDP) and solved by standard SDP solvers. However, current SDP solvers can only handle M 3 F problems on matrices of dimensionality up to a few hundred. Hence, Jason D. M. Rennie [10] presented a direct gradientbased optimization method for M 3 F in order to make fast collaborative prediction. One common problem in latent factor models is how to determine the number of factors, which is unknown a priori. A typical solution relies on some general model selection procedure, e.g., cross-validation, which explicitly enumerates and compares many candidate models and thus can be computationally expensive. Hence, the work [11], [12] 1550-4786/14 $31.00 © 2014 IEEE DOI 10.1109/ICDM.2014.43

However, compared to existing online matrix factorization methods, online learning for nonparametric M 3 F has not been addressed before, probably due to the following non-trivial technical challenges:

520

•

Different from existing matrix factorization models, M 3 F uses a low-norm constraint and leads to a discrimination function for each matrix unit, which brings a new challenge for online learning.

•

The predefined size of the latent dimensions often biases from the optimal value. Thus, how to do online learning using M 3 F with auto dimension selection is another challenge.

•

Online learning aims to incrementally learn from data. Compared with traditional parametric learning based

M 3 F and Nonparametric M 3 F . Srebro et al.[9] discussed a formulation termed M 3 F . Some studies based on the model can be applied to real-world applications, such as the collaborative filter and image recognition [10], [29]. The approach is inspired by, and has strong connections to, largemargin linear discrimination. M 3 F can be formulated as a semi-definite programming (SDP) and learned using standard SDP solvers. Then, Jason D. M. Rennie [10] proposed a fast batch learning method for M 3 F . Recently, some enhanced batch M 3 F models were proposed, such as the probability M 3 F , Bayesian M 3 F and nonparametric M 3 F [11], [12]. These models can be adaptive to real data. However, all existing methods are mainly base on offline batch learning and impractical for real-world applications where training data often arrive sequentially as data streams.

on batch processing, how to learn parameters rapidly is also a challenge. In this paper, we study a new problem of online nonparametric M 3 F method for collaborative prediction. The method aims to incrementally learn the rating stream data by using non-parametric M 3 F models. Specifically, we build an online nonparametric M 3 F model by following the online PassiveAggressive (PA) learning method, and use the online gradient descent algorithm for fast parameter learning. The paper makes the following contributions: •

We first study the problem of using online learning to enhance the nonparametric M 3 F models for recommendation.

•

We propose a new online nonparametric M 3 F learning model, i.e., the OnM 3 F model.

•

We present an online gradient descent algorithm for model learning and parameter updating.

Our study is to explore how to update the nonparametric M 3 F models for online learning and apply it to adaptive collaborative filtering. III.

The rest of the paper is organized as follows. Section II provides the related work. Section III introduces the preliminaries of the online nonparametric M 3 F learning, which contains the M 3 F model, the online learning based on PA, and the mini-batches learning. Section IV discussed the online nonparametric M 3 F , i.e., the proposed OnM 3 F method. Section V proposes some simplified online learning methods for general M 3 F formulations. Section VI presents empirical results on two popular collaborative filtering data sets and demonstrates the efficiency. Section VII concludes the paper. II.

P RELIMINARY

In this section, we first introduce the regular collaborative filtering task. Then, we introduce the M 3 F models. Next, we introduce online learning methods, which contain the online Passive-Aggressive (PA) learning and online Bayesian PA Learning. At last, to improve stability of the online learning algorithm, mini-batches are introduced. A. Problem Definition In a collaborative filtering task, some users from a total number of N users rate some products (items) from M products. These ratings form an incomplete matrix Y ∈ RN ×M , where yij is the rating on the j-th item given by the i-th user. Usually, there exist two scenes for the rating value, binary rating and ordinal rating. For the binary rating, the value of yij can be +1/−1 representing users that buy/not buy or watch/not watch. For ordinal rating, yij ∈ {1, 2, ..., L} represents user preferences to items, where the value is discrete and ordinal. Hence, our model needs to be based on both rating scenes. The goal of collaborative filtering is to predict the unknown ratings based on the given scores. For collaborative filtering, maximum margin matrix factorization mainly learns the latent structures by factorizing the rating matrix to a user matrix U ∈ RN ×K and an item matrix V ∈ RM ×K , which has been widely applied in the collaborative prediction.

R ELATED WORK

This section briefly reviews some closely related work, including the M 3 F models, the batch collaborative filtering models, and the online collaborative filtering models. Collaborative filtering. One of the state-of-the-art methods for regular CF tasks is the latent factor or matrix factorization method [22], [23], [24], [25], [26], [27], [28]. The key idea of the latent factor model is that the similarity between users and items can be simultaneously induced by hidden lower-dimension structures in the data. For example, the rating that a user gives to a movie might depend on a few implicit factors such as the users taste across various movie genres. M 3 F is a special matrix factorization technology, which has been mainly applied in collaborative filtering. Although the batch algorithms for matrix factorization have shown great success, they generally suffer from high time complexity and memory cost, which thus are non-scalable for building on-thefly recommender systems.

In the following subsection, we first introduce the detailed information on M 3 F . While, our aim is to online nonparametric learning for M 3 F . Online learning has been studied for several years. Our proposed method is mainly based on the online PA framework. Hence, we also present some descriptions on online learning methods based on the PA learning.

Online collaborative filtering. Online collaborative filtering has received emerging attention recently. The work in [19] is perhaps one of the earliest work, in which OCF is cast as an online ranking problem. The related work relevant to our study is the online collaborative filtering by stochastic gradient descent [17] and another improved work in [18], which attempted to solve the multi-task collaborative filters for on-the-fly recommendation systems. However, as a special matrix factorization technology, M 3 F has not been considered for online learning.

B. M 3 F M 3 F [9] extends the matrix factorization model by adopting a sparsity-inducing norm regularizer for a low-norm constraint. The model can handle binary or discrete ordinal data that are typical in a recommendation system. For the binary 521

case where yij ∈ {±1} and one predicts by yîj = sign(Xij ), the optimization problem of M 3 F is defined as h(yij xij ), (1) argminX ||X||∗ + C

C. Online learning based on Passive-Aggressive idea Generally, user rating data is generated continuously. The model needs to learn from a sequence of training data. To capture the information embedded in a newly received rating, the model has to be retrained using all available data, which is often time expensive.

i,j∈I

where ||X||∗ is the norm of X, I is the indices of the observed entries, C is a trade-off and h(x) = max(0, 1 − x) is the hinge loss. Problem (1) can be equivalently formulated as a semi-definite program (SDP) and learned by standard SDP solvers. But the method above is often very slow and scales to only thousands of users and items. An alternative M 3 F model based on a variational formulation of the nuclear norm is then proposed in [10] and it solves an equivalent problem on the factorized form X = U V T instead, argminU,V ||U V T ||∗ + C h(yij Ui VjT ), (2)

The goal of online supervised learning is to minimize the cumulative loss for a prediction function built from sequentially arriving training samples. Online Passive-Aggressive (PA) algorithms [13] achieves the goal by updating some parameterized model w (e.g., the weights of a linear SVM) in an online manner with loss from arriving data xtt≥0 and corresponding responses yt t≥0 . The loss lε (w; xt , yt ) can be the hinge loss (ε − yt wT xt )+ for binary classification or the ε-insensitive loss (|yt − wT xt | − ε)+ for regression, where ε is a hyper-parameter and (x)+ = max(0, x). The PassiveAggressive update rule is then derived by defining the new weight wt+1 as the solution to the following optimization problem, 1 (7) minε ||w − wt ||2 s.t : lε (w; xt , yt ) = 0 2 Intuitively, if wt does not have a loss from new data, i.e., lε (w; xt , yt ), the algorithm passively assigns wt+1 = wt ; otherwise, it aggressively projects wt to the feasible zone of parameter vectors that attain zero loss. With provable bounds, Crammer et al. [30] shows that online PA algorithms could achieve comparable results to the optimal classifier w∗ . In order to model inseparable training samples, soft margin constraints are often adopted and the resulting learning problem is 1 (8) minε ||w − wt ||2 + Clε (w; xt , yt ) 2 where c is a positive regularization parameter. For problems (1) and (2) with samples arriving one at a time, closed form solutions can be derived as in the work [13].

i,j∈I

where U ∈ RN ×K and V ∈ RM ×K are interpreted as the user factor matrix and the item factor matrix respectively, and K is the number of latent factors. We use Ui to denote the i-th row of U as the factor vector of i-th user, and Vj likewise. By ignoring for the moment the non-differentiability of hinge loss function at 1 or replacing the hinge loss with a smooth surrogate, a gradient descent solver has been developed and it scales to millions of users and items [10]. Ordinal ratings. The data sets we more frequently encounter in real recommendation applications are discrete and ordinal, y ∈ {1, 2, 3..., R}. We use the same strategy as in [10] to define the loss function. Specifically, we introduce thresholds θ0 ≤ θ1 ≤ θ2 ≤ ... ≤ θR , where θ0 = −∞ and θR = +∞, to discretize the real number field into R intervals. The prediction rule is changed accordingly to yîj = max{r|Ui VjT ≥ θir } + 1

(3)

Moreover, instead of updating a point estimate of w, online Bayesian PA (BayesPA) [16] sequentially infers a new posterior distribution qt+1 (w), either parametric or nonparametric, on the arrival of new data (xt , yt ) by solving the following optimization problem,

In a hard-margin setting, we would require θiyij −1 + 1 ≤ Ui VjT ≤ θiyij − 1

(4)

While adding slack in a soft-margin setting, we define the loss as yij −1

lε (U, V ; yij ) =

h(Ui VjT

− θir ) +

r=yij

r=1

=

R−1

R−1

h(θir −

minq(w)∈Ft KL[q(w)||qt (w)] − Eq(w) [logp(xt |w)] s.t : lε [q(w) : xt , yt ],

Ui VjT )

where Ft is distribution family. In other words, we obtain a posterior distribution qt+1 (w) in the feasible zone that is not only close to qt (w) by the commonly used KL-divergence, but also has a high likelihood on the new data. As a result, if Bayes rule already gives the posterior distribution qt+1 (w) ∝ qt (w)p(xt |w) that suffers no loss (i.e. lε = 0). BayesPA passively updates the posterior following just Bayesian rule; otherwise, BayesPA aggressively projects the new posterior to the feasible zone of posteriors that attain zero loss. Note that when no likelihood is defined (e.g. p(xt |w) is independent of w), BayesPA will passively set qt+1 (w) = qt (w) if qt (w) suffers no loss. We call it non-likelihood BayesPA.

h(Tijr (θir − Ui VjT ))

r=1

(5)

where Tijr =

+1 r ≥ yij −1 r < yij

Hence, for ordinal rating, the optimization function can be transformed as following: argminU,V ||U V T ||∗ + C

R−1

h(Tijr (θir − Ui VjT ))

(9)

To adapt M 3 F for handling such scenarios, it is better to train the model in an online manner, which incrementally

(6)

r=1

522

adapts the model to the newly observed ratings. In next part, we present our online non-parameter M 3 F learning algorithms.

For ordinal ratings, we apply Gibbs classifier to the ordinal rating prediction loss function. minp(U,V ) KL(p(U, V )||p0 (U, V ))

D. Mini-Batches Learning

IV.

R−1

+C

Online learning usually introduces noise when process data one by one. To improve stability of the online learning algorithm, a useful technique to reduce the noise in data is the use of mini-batches as in the work [16], which means that practitioners typically use multiple samples to compute gradients at a time. In our case, suppose that we have a minibatch of data points at time t with an set Yt , denoted as Yt = {yij |. Our model online updates based on the mini-batch observations instead of a data point at each time stamp for stream data analysis.

(i,j)∈R r=1

A common problem of finite factor-based models such as M 3 F and P M 3 F is that the number of latent factors K needs to be explicitly selected. We introduce the non-parametric maximum margin matrix factorization algorithm in [1]. Without loss of generality, we consider learning a binary coefficient matrix Z ∈ {0, 1}N ×∝ . For finite-sized binary matrices, their prior may be defined by a Beta-Bernoulli process [32]. While in the infinite case, Z is allowed to have an infinite number of columns. In the nonparametric matrix factorization model, IBP is used as prior over unbounded binary matrices [33]. Furthermore, its stick-breaking construction is focused, which facilitates the development of efficient inference algorithms. Specifically, let πk ∈ (0, 1) be a parameter associated with each column of Z (with respect to its left-ordered equivalent class). Then the IBP prior can be described as given by the following generative process

The non-parametric learning is based on the non-parametric Bayesian learning [11]. For non-parametric M 3 F , we first introduce a probabilistic formulation of M 3 F and a probability max-margin matrix factorization (P M 3 F ). Then, we discuss an existing non-parametric P M 3 F learning based on batch process. At last, we propose our online learning for non-parametric probability M 3 F learning, and online nonparameter max-margin matrix factorization (OnM 3 F ). A. P M 3 F

π1 = ν1 , πk = νk πk−1 =

In probability matrix factorization (PMF), we treat U and V as random variables, and the joint prior distribution is denoted by p0 (U, V ). Then, the goal is to infer the posterior distribution p(U, V ) after a set of observations have been provided. The idea is also applied to P M 3 F . We first consider the binary case where yij takes value from {±1}. lε (Ui , Vj ; yij ) minp(U,V ) KL(p(U, V )||p0 (U, V )) + C

k

νi , i.i.d.f or k = 1, ..., ∝ (∀i)

i=1

Zik ∼ Bernoulli(πk ) i.i.d.f or i = 1, ..., N (∀k),

(13)

where each ν ∼ Beta(α, 1). This process results in a descending sequence of πk . Specifically, given a finite data set (N < + ∝), the probability of seeing the kth factor decreases exponentially with k. The number of active factors K+ follows a Poisson(αHN ), where HN is the N th harmonic number. Alternatively, we can use a Beta process prior over Z as in the work [33].

(i,j)∈I

(10) Usually, U = {U1 , U2 , ..., Un } and V = {V1 , V2 , ..., Vn } are given, and we can define the prediction function Ui VjT to represent user i rating item j.

We place an isotropic Gaussian prior over the item factor matrix V . We follow the above probabilistic framework to perform the max-margin training, with U replaced by Z. The stick-breaking construction for the IBP prior results in an augmented non-parametric P M 3 F problem for binary data can be calculated as follows,

Furthermore, as both Ui and Vj are random variables, we need to solve the uncertainty and derive a prediction rule. Following the principle of maximum entropy discrimination (MED) learning as in the work [31], we use the expected hinge loss of a Gibbs classifier that randomly draws a classifier (Ui , Vj ) ∼ p(Ui , Vj ) to make predictions using the rule yîj = signEp(Ui ,Vj ) (Ui VjT ).

minp(ν,Z,V ) KL(p(ν, Z, V )||p0 (ν, Z, V )) +C hl (Yij Ep [Zi VjT ])

(14)

i,j∈I

For ordinal ratings, we augment the non-parametric P M 3 F problem,

Then, by following the principle of Bayesian learning, we define P M 3 F to solve the following optimization problem.

(12)

B. Non-parametric P M 3 F

O NLINE L EARNING FOR N ON - PARAMETRIC M 3 F

minp(U,V ) KL(p(U, V )||p0 (U, V )) + C

h(Tijy (Ep [θir − Ui VjT )])

minp(ν,Z,V ) KL(p(ν, Z, V, θ)||p0 (ν, Z, V, θ))

h(yij Ep(Ui ,Vj ) [Ui VjT ])

(i,j)∈I

+C

(11)

R−1 i,j∈I r=1

Note that our probabilistic formulation is more general than the original M 3 F model, which is in fact a special case of P M 3 F under a standard Gaussian prior and a mean-field assumption on p(U, V ).

h(Tijy (Ep [θir − Ui VjT )])

(15)

Besides adopting the same prior assumptions for ν, Z and V , assume p(θ) as follows, θir ∼ N (ρr , σ 2 ) i.i.d.f ori = 1, ..., N, r = 1, ..., R,

523

(16)

where ρ1 < ... < ρR−1 are specified as a prior guidance towards an ascending sequence of large-margin thresholds.

approximately solve ψij . We first change the object function into approximate formulation as follows.

C. OnM 3 F

minψi

Generally, online Bayesian PA model can be used for online probability model learning. Suppose that current minibatch observation is Y t = {yij } at time t. We can obtain the objective function for online non-parametric maximum margin matrix factorization learning as follows, minp(ν,Z,V ) KL(p(ν, Z, V )||pt (ν, Z, V )) +C hl (yij Ep [Zi VjT ])

K k=1

i=1 j=1

(17)

minρir

(19)

φj = φj − 2λ

k=1

j|yij ∈Y

(φj − φtj ) ∂Ch(yij νi.T φj ) −λ 2 σ ∂φj

ψt

ik where f = log 1−ψ t + C ik

1 1 + ef

j|yij ∈Y t

(24)

(25)

yij φjk .

For ordinal ratings, the parameter updating method with respect to Φi and Ψj is analogous to rules in binary rating with different loss function. For θir , the updating rule is as fellows,

Infer p(Z): The problem of decomposing p(Z) into N independent convex optimization problems can be calculated as follows,

j|yij ∈Y t

ψik = 1 −

i|yij ∈Y t

p(zik ) )] + C pt (zik )

hl (Tijr (ρir − νi.T φj ))

In particular, given a single prediction problem of user i on item j, the algorithm first makes a prediction of rating rîj = UiT Vj . After the true rating rij is revealed, the algorithm suffers a loss lξ (Ui , Vj , rij ),

Infer p(V): The linear discriminant function and the isotropic Gaussian prior Mon V leads to an isotropic Gaussian posterior p(V ) = j=1 N (Vj |φj , σI), where M means vectors φj can be obtained via solving M independent binary SVMs, 1 minφj 2 ||φj − φtj ||2 + C hl (yij νi.T φj ) (20) 2σ

Ep(Z) [log(

We use online gradient descent for parameter updates as in the work [32]. The elementary online gradient descent algorithm is obtained by dropping global operation in the batch gradient descent algorithm. Instead of considering the gradient of the loss over the complete training set, each iteration of the online gradient descent consists of a sample data. This way, the convergence speed of OnM 3 F can be improved considerably.

(18)

Then the problem can be solved by using an iterative procedure that alternates between optimizing each component at a time, as outlined below.

K

1 (ρir − ρtir )2 + C 2σ 2

D. Parameter Updates

where K is the predefined truncation level

minψi

(1 − yij ψi.T φj ),

j|yij ∈Y t

(23)

m=1

pτi (νi ) = Beta(νi ; τi1 , τi2 ), i.i.d f or i = 1, ..., K pφm (Vm. ) = N ormal(Vm. ; φm , σI), i.i.d f or m = 1, ..., M pψij (zij ) = Bernoulli(zij ; ψij ), i.i.d f or i = 1, ..., N ; j = 1, ..., K

The gradient of above approximate object function can be easily computed. Infer L−1p(θ) remains 2an isotropic Gaussian as p(θ) = N p(θ): i=−1 r=1 N (θir |ρir , σ ) and the mean ρir of each component is the solution to the corresponding problem

Now, we briefly discuss how to perform learning and inference in OnM 3 F . Specifically, we introduce a simple variational inference method to approximate the optimal posterior, which turns out to perform well in practice. We make the following truncated mean-field assumption,

i=1

p(zik ) )] + C pt (zik )

(22)

yij ∈Y t

p(ν, Z, V ) ≈ pτ (ν)pψ (Z)pφ (V ) k N M K pτi (νi ) × pψij (zij ) × pφm (Vm. )

Ep(Z) [log(

ρir = ρir − λ(

hl (yij νi.T φj ),

∂hl (Tijr (ρir − νi.T φj ))) ρir − ρtir +C 2 σ ∂ρir

(26)

Algorithm 1 summarizes the detailed procedure of OnM 3 F for model updating.

t

(21) ik ) where Ep(Z) [log( pp(z t (z ) )] = ψik logψik + (1 − ψik )log(1 − ik t t ). We cannot simply ψik ) − ψik logψik − (1 − ψik )log(1 − ψik take the gradient of the objective function and set it to zero to get the optimal ψ due to the special form of hinge loss. Hence, we use the subgradient technique as in the work [1] to

V.

O NLINE LEARNING FOR GENERAL M 3 F FORMULATIONS

Non-parametric P M 3 F is a variant of the basic M 3 F model and an improved formulation of P M 3 F .

524

Algorithm 2 summarizes the detailed procedure of OM 3 F for model updating.

Algorithm 1: The OnM 3 F algorithm Input: the rating data at time t + 1 Yt+1 = {yi j|newly rating yij }, the size of rating data S = |Yt+1 |, the truncation level K, the user latent coefficient t |i f or 1, ..., n; k f or 1, ..., K}, the item matrix ν t = {νik t = {ρtr |f orr = 1, ..., R − 1} latent factor φt , the thresholds ρ t+1 t+1 t+1 Output: ν , Φ , ρ

Algorithm 2: The OM 3 F algorithm Input: the rating data at time t + 1, Yt+1 = {yi j|newly rating yij }, the size of rating data S = |Yt+1 |, the user latent factors U t , the item latent factors V t , the thresholds Θt = θir |i ∈ {1, 2..., n}, r ∈ {1, 2, ..., R − 1} Output: U t+1 , V t+1 , Θt+1

01 If t==0 t+1 using the non-parametric 02 Learn ν t+1 , Φt+1 , ρ M 3 F algorithm, 03 Else =ρ t 04 Initialize ν = ν t , Φ = Φt , ρ 05 For c=1 to S 06 Get the c-th rating data yij from Yt+1 ; 07 For k=1 to K 08 Update ψik according to Equation (25); 09 End For 10 Update φj according Equation (24); 11 Update ρyij according Equation (26); 12 End For t+1 = ρ ; 13 ψ t+1 = ψ, Φt+1 = Φ, ρ 14 End If

01 02 03 04 05 06 07 08 09 10 11 12

If t==0 Learn U t+1 , V t+1 , Θt+1 using M 3 F , Else Initialize U = U t ; V = V t ; Θ = Θt For c=1 to S Get the c-th rating data yij from Yt+1 ; Update Ui according Equation (28); Update Vj according Equation (29); Update θiyij according Equation (30); End For End If U t+1 = U ; V t+1 = V ; Θt+1 = Θ

A. Online learning for M 3 F We simply extend the online PA learning framework to tackle the online max-margin matrix factorization (OM 3 F ) learning for collaborative prediction in recommendation task. First, we consider binary rating y ∈ {±1}. Suppose the new arriving rating set Y t = {yij ∈ (−1/1)N ×M }, where the rating yij represents that user i rates item j at time t. We define the loss function lξ (Ui , Vj ; yij ) = h(yij Ui VjT ) for each rating. Then, according to the online passive-aggressive learning, the updating term related to this particular rating is,

B. Online learning for P M 3 F We first consider binary rating, where yij ∈ {±1}. Suppose the current mini-batch observation is Y t = {yij } at time t. Applying online Bayesian PA learning, we define online probabilistic max-margin matrix factorization learning (OP M 3 F ) as solving the following optimization problem,

(Uit+1 , Vjt+1 ) = argminUi ,Vj C

yij ∈Y

h(yij Ui VjT ) +

N

|Uit − Ui |2 +

i=1

t

M

|Vjt − Vj |2

j=1

(27)

minp(U,V ) KL(p(U, V )||pt (U, V )) + C

(31)

We first assume there is no likelihood between parameters and observations.Then, if we assume p(U, V ) = p(U )p(V ) t t and pt (U, j p(Vj |Ψj , I), we can prove V ) = i p(Ui |Φi , I) p(U ) = i p(Ui |Φi , I), p(V ) = j p(Vj |Ψj , I), namely

Parameter updates. We use the online gradient descent for parameter updates. In particular, given a single prediction problem of user i on item j, the algorithm first makes a prediction of rating yîj = Ui VjT . After the true rating yij revealed, the algorithm suffers a loss lξ (Ui , Vj , yij ), ∂Ch(yij Ui VjT ) ∂Ui

(28)

Vj = Vj − 2λ(Vj −

Vjt )

∂Ch(yij Ui VjT ) −λ ∂Vj

(29)

lξ (yij Ep [Ui VjT ])

y( ij)∈Y t

The first term in above Equation is the squared error between the observation and predicted value, and the remaining two terms are the corresponding regularization. The parameter C is a trade-off constant.

Ui = Ui − 2λ(Ui − Uit ) − λ

minΦ,Ψ

1 (||Φt − Φ||F + ||Ψt − Ψ||F ) + C h(yij Φi ΨTj ) 2 σ t yij ∈Y

(32)

where λ is predefined step size. For θir , the updating rule is as follows,

θir = θir − λCTijy h (Tijy (θir − Ui VjT ))

In Eq. (34), the first part KL(p(U, V )||pt (U, V )) can be calculated as follows,

(30)

525

KL(p(U, V )||pt (U, V )) n exp((Ui − Φi )T (Ui − Φi )) = log( exp((Uit − Φti )T (Uit − Φti )) i=1

Algorithm 3: The OP M 3 F algorithm Input: the rating data at time t + 1, Yt+1 = {yi j|arriving newly rating yij }, the size of rating data S = |Yt+1 |, the user latent factor matrix Φt , the item latent factor matrix Ψt , the threshold vector ρ t = {ρtr |f orr = 1, ..., R − 1} t+1 Output: Φt+1 , Ψt+1 , ρ

m exp((Vj − Ψj )T (Vj − Ψj )) )p(U )p(V )dU dV exp((Vjt − Ψtj )T (Vjt − Ψtj )) j=1 1 (E[(Ui − Φi )T (Ui − Φi )] − E[(Uit − Φti )T (Uit − Φt )] = 2( σ (i,j)∈R

+ E[(Vj − Ψj )T (Vj − Ψj )] − E[(Vjt − Ψtj )T (Vjt − Ψtj )])) =

01 02 03 04 05 06 07 08 09 10 11 12

i

N M 1 t ( (|Φ − Φ | + |Ψi − Ψti |F )) i F i σ2 (i=1)

(j=1)

(33) Parameter Updates. We use online gradient descent for parameter updates in OP M 3 F . In particular, given a single prediction problem of user i on item j, the algorithm first makes a prediction of rating yîj = Ep(Ui ,Vj ) [Ui VjT ]. After the true rating yij is revealed, the algorithm suffers a loss lξ (Ui , Vj , yij ), Φi = Φi − λ

∂Ch(yij Φi ΨTj ) 2 (Φi − Φti ) − λ 2 σ ∂Φi

(34)

Ψj = Ψj − λ

∂Ch(yij Φi ΨTj ) 2 t (Ψ − Ψ ) − λ j j σ2 ∂Ψj

(35)

GroupLens Research Webset. Table 1 summarizes the statistics of the data sets, where the rating matrix density is defined as the fraction of observed ratings out of the total number of elements in the rating matrix. Compared Algorithms. We compare the proposed online learning algorithms with typical online collaborative filtering methods. Specifically, the compared algorithms in our experiments include,

where λ is predefined step size. Similar processes occur on modelling the ordinal rating. Because of the threshold θ, we consider θir ∼ N (ρr , σ 2 ) as in the OnM 3 F model. Then, we can get the corresponding optimization function as follows, minp(U,V ) KL(p(U, V )||pt (U, V )) +C

R−1

h(Tijr (Ep [θir − Ui VjT )])

(36)

yij ∈Y t r=1

where ρ1 < ... < ρL−1 are specified as a prior towards an ascending sequence of large-margin thresholds. For ρr , the updating rule is as follows,

ρr = ρr − λCTijr h (Tijr (ρr − Φi ΨTj ))

(37)

•

OCF: the Online Collaborative Filtering algorithm by online gradient descent method in [12];

•

DA-OCF: the Dual-Averaging method for online collaborative filtering in [9];

•

OM 3 F : the online maximum margin matrix factorization learning in Algorithm 2.

•

OP M 3 F : the online probability maximum margin matrix factorization learning in Algorithm 3.

•

OnM 3 F : the online non-parameter maximum margin matrix factorization learning in Algorithm 1.

Experimental Configuration For each original data set, we randomly split first the set into test set and train set with the partition ratio 0.1. Then we randomly split each test set into base test set and online test set with the partition ratio 0.3. In our experiments, we first apply the corresponding batch models to the base test set for learning the base model. Then the online learning method is applied to online test set to incrementally update the base model. After the online learning process, the learnt model is verified on the test set for performance evaluation.

Algorithm 3 summarizes the detailed procedure of OP M 3 F for model updating. VI.

If t==0 t+1 using M 3 F , Learn Φt+1 , Ψt+1 , ρ Else =ρ t Initialize Φ = Φt ; Ψ = Ψt ; ρ For c=1 to S Get the c-th rating data yij from Yt+1 ; Update Φi according Equation (34); Update Ψj according Equation (35); Update ρyij according Equation (36); End For End If t+1 = ρ Φt+1 = Φ; Ψt+1 = Ψ; ρ

E XPERIMENTS

In this section, we evaluate the proposed online learning algorithm OnM 3 F for collaborative filtering tasks. We first analyze the sensitivity of the model with respect to different batch sizes. Then, we test and compare the algorithms on four real data sets. Finally, we evaluate the convergence rate of the proposed model.

Parameter Setup The learning rate λ controls the speed of model training. However, it is hard to converge if it is too large. In this work, the learning rate is set to 0.1 for OnM 3 F and the benchmarks. The trade-off constant C is also set to 0.1 for OM F , OP M 3 F and OnM 3 F . The baseline methods need to predefine the dimension of latent factors. We set the number of dimensions to be 300 for the latent factors in the

Data sets. We conduct experiments on four publicly available data sets which are widely used for benchmark in collaborative filtering. All data sets can be downloaded from the

526

Datasets Movielens 100k HetRec 2011 Movielens 1M Movielens 10M

Datasets Movielens 100k HetRec 2011 Movielens 1M Movielens 10M

(a) RMSE, ML100

size=2048 0.857 0.885 0.934 0.922

(b) RMSE, HetRec

(c) RMSE, Movielens 1M

(d) RMSE, Movielens 10M

The Convergence rate evaluation of the online learning algorithms TABLE III.

Datasets

Measures In order to compare the performance, we use two measures,

•

size=1024 0.853 0.879 0.927 0.919

models, OCF , DAOCF , OM 3 F and OP M 3 F . Also, we set the truncation level K to be 300.

•

density 6.3% 4.0% 4.2% 1.3%

ALGORITHM UNDER DIFFERENT MINI - BATCH SIZES

Rating Scale 1-5 1-5 1-5 1-5

size=256 0.847 0.871 0.921 0.911

#Users 943 2,113 6,040 71,567

Fig. 1.

#Items 1,682 10,109 3,900 10,681

size=64 0.828 0.846 0.909 0.893

size=16 0.807 0.833 0.891 0.881

#Ratings 100,000 855,598 1,000,209 10,000,054

C OMPARISONS OF THE PROPOSED OnM 3 F

TABLE II.

T HE STATISTICS OF THE DATA SETS .

TABLE I.

Movielens 100k

The Root Mean Squared Error (RMSE), which is used to measure the prediction accuracy. RMSE is a popular method to measure the difference between predicted values and the real values we actually observed, i.e., n xi |2 2 i=1 |xi −ˆ RM SE(e) = . n

HetRec 2011

The Mean Absolute Error (MAE), which is used to measure how close the forecasts to the eventual n outcomes, i.e., M AE(e) = n1 i=1 |xi − x î |.

In both equations, x î is a predicted value and xi is the true value.

Movielens 1M

A. Batch size impacts Table 2 presents the accuracy and time cost of OnM 3 F on the four real data sets with different batch sizes, where K = 300. We use the corresponding batch learning algorithm for the base data set and the online gradient descent algorithm for online train set. Then test the learnt model on the test set. We can find that the accuracy increase when the batch size decreases on each dataset.

Movielens 10M

C OMPARISONS AMONG THE ALGORITHMS .

Algorihms OCF DAOCF OM 3 F OP M 3 F OnM 3 F OCF DAOCF OM 3 F OP M 3 F OnM 3 F OCF DAOCF OM 3 F OP M 3 F OnM 3 F OCF DAOCF OM 3 F OP M 3 F OnM 3 F

RMSE 0.933 1.021 0.879 0.852 0.847 0.923 1.011 0.881 0.866 0.871 0.987 1.106 0.941 0.937 0.921 0.992 1.105 0.927 0.919 0.911

MAE 0.816 0.891 0.751 0.739 0.735 0.8891 0.8659 0.8021 0.7122 0.7003 0.889 0.9576 0.8831 0.805 0.798 0.907 0.963 0.832 0.824 0.821

Time 0.34 0.68 0.54 0.56 0.73 3.15 6.25 4.98 5.11 7.21 3.65 7.38 5.33 5.42 8.21 39.5 80.1 61.5 66.9 88.7

First, Compared with OCF and DAOCF approaches, we can observe that the three proposed online learning algorithms based on M 3 F can achieve better results, having smaller RMSE and MAE values in all cases. This shows that M 3 F , by considering low-norm factorizations and strong connections to large-margin linear discrimination, has better performance than the basic matrix factorization methods.

B. Performance evaluation We evaluate the algorithms on the four data sets under the latent factor dimensionality k = 300 and the truncation level K=300. Table 3 summarizes the average performance of the algorithms. 527

Second, the proposed OP M 3 F algorithm is better than the OM 3 F . Bayesian matrix factorization method scales linearly with the number of observations and performs well on the large, sparse, and imbalanced data sets. M 3 F is a special kind of matrix factorization method, and the Bayesian model has the same characteristic.

[2]

Y. Takeuchi and M. Sugimoto, “An outdoor recommendation system based on user location history,” In Proceeding of 3-th International Conference on Ubiquitous Intelligence and Computing, 2006.

[3]

V. W. Zheng, Y. Zheng, X. Xie, and Q. Yang, “Collaborative location and activity recommendations with gps history data,” In Proceeding of the 19th International Conference on World Wide Web, 2010.

[4]

H. Wang, M. Terrovitis, and N. Mamoulis, “Location recommendation for location based social networks,” In Proceeding of The ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010.

[5]

P. Weike, X. EvanW, and Y. Qiang, “Transfer learning in collaborative filtering with uncertain ratings,” In Proceeding of Twenty-Sixth AAAI Conference on Artificial Intelligence, 2013.

[6]

Z. Qiao, P. Zhang, J. He, Y. Cao, C. Zhou, and L. Guo, “Combining geographical information of users and content of items for accurate rating prediction,” In Proceeding of 23rd International World Wide Web Conference, 2009.

[7]

L. Wu-Jun and Y. Dit-Yan, “Social relations model for collaborative filtering,” In Proceeding of Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

[8]

P. Zhang, C. Zhou, P. Wang, J. Gao, X. Zhu, and L. Guo, “E-tree: An efficient indexing structure for ensemble models on data streams,” In Proceeding of IEEE Transactions on Knowledge and Data Engineering, Vol. 26 (3), 2004.

[9]

N. Srebro, J. Rennie, and T. Jaakkola, “Maximum margin matrix factorizations,” In Proceeding of Advances in Neural Information Processing Systems (NIPS), 2005.

3

Third, the proposed OnM F algorithm is better than the OM 3 F and OP M 3 F algorithms. Actually, the latent dimension value is selected according to experience. Nonparametric learning can auto set the latent dimension value and better the model. Fourth, the time costs of OM 3 F and OP M 3 F are higher than basic matrix factorization methods. Obviously, discrimination term processing incurs more time cost. The time cost of OnM 3 F is obviously heavier than OM 3 F and OP M 3 F . During the online non-parametric learning, a non-gradient decent method on model updating costs more time. C. Convergence evaluation To further evaluate the three online M 3 F learning algorithms, Figure 1 shows the online convergence of all the algorithms. The results show that the compared algorithms get the convergence result. However, the convergence speed is different among the algorithms. The OnM 3 F outperforms other algorithms and finally turns the best convergence result. Actually, data are often very sparse. Hence, when the data size is small, the accuracy is low. Online learning tolerate the accuracy loss to get high speed. When stream data arriving, the accuracy can get improved. Overall, an effective online updating algorithm is particularly important to tackle the sparsity challenge. VII.

[10] J. D. M. Rennie and N. Srebro, “Fast maximum margin matrix factorization for collaborative prediction,” In Proceeding of the 22nd international conference on Machine learning, 2005. [11] M. Xu, J. Zhu, and B. Zhang, “Nonparametric max-margin matrix factorization for collaborative prediction,” In Proceeding of Advances in Neural Information Processing Systems (NIPS), 2012. [12] X. Minjie, Z. Jun, and Z. Bo, “Fast max-margin matrix factorization with data augmentation,” In Proceeding of the 30th International Conference on Machine Learning (ICML), 2013. [13] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” In Proceeding of Journal The Journal of Machine Learning Research, 2006.

C ONCLUSION

In this paper, we studied a new problem of online learning for the nonparametric M 3 F models. Specifically, we proposed an online nonparametric learning model OnM 3 F which can auto-select the number of latent factors continuously and an online gradient descent algorithm. Experiments on two large real-world data sets demonstrated the effectiveness of the proposed method.

[14] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists: Learning to detect malicious web sites from suspicious urls,” In Proceeding of the ACM SIGKDD Conference, 2009. [15] Z. Wang and S. Vucetic, “Online passive-aggressive algorithms on a budget,” In Proceeding of Thirteenth International Conference on. Artificial Intelligence and Statistics, 2013. [16] T. Shi and J. Zhu, “Online bayesian passive-aggressive learning,” In Proceeding of The 31st International Conference on Machine Learning (ICML-14), 2014.

With the development of mobile social network applications, many new factors can be used to model real-world data for collaborative recommendation, such as social relationship and geographical information. In the future, we will study impacts from these new social factors for online learning. VIII.

[17] G. Ling, H. Yang, I. King, and M. R. Lyu, “Online learning for collaborative filtering,” In Proceeding of IEEE World Congress on Computational Intelligence, 2012. [18] J. Wang, S. C. Hoi, P. Zhao, and Z.-Y. Liu, “Online multi-task collaborative filter for on-the-fly recommendation systems,” In Proceeding of the 7th ACM conference on Recommender systems, 2013.

ACKNOWLEDGEMENT

[19] J. Abernethy, K. Canini, J. Langford, and A. Simma, “Online collaborative filtering,” In Proceeding of University of California at Berkeley, Technical Report, 2007.

This work was supported by the NSFC (No. 61370025, 61103518), 863 projects (No. 2012AA012502, 2011AA01A103), 973 project (No. 2013CB329606), the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences(No.XDA06030200), and Australia ARC Discovery Project (DP140102206).

[20] F. Wang, P. Li, and A. C. Konig, “Effecient document clustering via online nonnegative matrix factorizations,” In Proceeding of the Eleventh SIAM International Conference on Data Mining, 2011. [21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” In Proceeding of Journal of Machine Learning Research, 2011.

R EFERENCES [1]

[22] Z. Qiao, P. Zhang, Y. Cao, C. Zhou, and L. Guo, “Combining heterogenous social and geographical information for event recommendation,” In Proceeding of the 28th AAAI Conference on Artificial Intelligence, 2014.

M.-H. Park, J.-H. Hong, and S.-B. Cho, “Location-based recommendation system using bayesian users preference model in mobile devices,” In Proceeding of 4th International Conference on Ubiquitous Intelligence and Computing, 2007.

[23] K. Yehuda, B. Robert, and V. Chris, “Matrix factorization technology

528

[24]

[25]

[26]

[27]

[28] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, “Evaluating collaborative filtering recommender systems,” In Proceeding of ACM Transactions on Information Systems (TOIS), 2004.

for recommendation system,” In Proceeding of Journal of Computer, 2009. G. Linden and J. Y. Brent Smith, “Amazon.com recommendations: Item-to-item collaborative filtering,” In Proceeding of IEEE Internet Computing, 2003. G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions,” In Proceeding of IEEE Transactions on Knowledge and Data Engineering, 2005. B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” In Proceeding of the Tenth International World Wide Web Conference, 2001. J. S. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictive algorithms for collaborative filtering,” In Proceeding of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 1998.

[29] D. Xia, “Sparse learning via maximum margin matrix factorization,” In Proceeding of School of Mathematics,CSE program,Georgia Technique. [30] J. D. M. Rennie, “Smooth hinge classification,” In Proceeding of Massachusetts Institute of Technology, 2005. [31] T. Jaakkol, M. Meilay, and T. Jebara, “Maximum entropy discrimination,” In Proceeding of Massachusetts Institute of Technology, 2009. [32] Y. Ying and M. Pontil, “Online gradient descent learning algorithms,” In Proceeding of University College London. [33] F. Doshi-Velez, K. T. Miller, J. V. Gael, and Y. W. Teh, “Variational inference for the indian buffet process,” In Proceeding of University of Cambridge Technical Report, 2009.

529

Online Nonparametric Max-Margin Matrix ... - Semantic Scholar

Online Nonparametric Max-Margin Matrix ... - Semantic Scholar

Suggest Documents

Nonparametric statistical methods - Semantic Scholar

Nonparametric statistical methods - Semantic Scholar

NONPARAMETRIC DETECTION AND ... - Semantic Scholar

Nonparametric statistical methods - Semantic Scholar

Nonparametric Risk and Nonparametric Odds in ... - Semantic Scholar

Matrix Visualization - Semantic Scholar

Matrix functions - Semantic Scholar

Aggregation of Nonparametric Estimators for Volatility Matrix

Nonparametric Methods for Incorporating Genomic ... - Semantic Scholar

Nonparametric Function Estimation Using ... - Semantic Scholar

A Nonparametric, Entropy-Minimizing Approach - Semantic Scholar

testing restrictions in nonparametric efficiency ... - Semantic Scholar

Nonparametric Sample Size Estimation for ... - Semantic Scholar

nonparametric density estimation for randomly ... - Semantic Scholar

Nonparametric Identification of the Production ... - Semantic Scholar

Bootstrap Methods for the Nonparametric ... - Semantic Scholar

nonparametric estimation of homogeneous functions - Semantic Scholar

Variance Estimation in Nonparametric Regression ... - Semantic Scholar

Nonparametric Bayesian Methods for Multiple ... - Semantic Scholar

Parametric and nonparametric Granger causality ... - Semantic Scholar

Nonparametric Error Estimation Methods for ... - Semantic Scholar

Nonparametric tests for independence - Semantic Scholar

Robust Nonparametric Testing for Causal ... - Semantic Scholar

Nonparametric Problem-Space Clustering ... - Semantic Scholar