ClientâServer Multitask Learning from Distributed ...

290

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 2, FEBRUARY 2011

Client–Server Multitask Learning from Distributed Datasets Francesco Dinuzzo, Gianluigi Pillonetto, Member, IEEE, and Giuseppe De Nicolao, Senior Member, IEEE

Abstract— A client–server architecture to simultaneously solve multiple learning tasks from distributed datasets is described. In such architecture, each client corresponds to an individual learning task and the associated dataset of examples. The goal of the architecture is to perform information fusion from multiple datasets while preserving privacy of individual data. The role of the server is to collect data in real time from the clients and codify the information in a common database. Such information can be used by all the clients to solve their individual learning task, so that each client can exploit the information content of all the datasets without actually having access to private data of others. The proposed algorithmic framework, based on regularization and kernel methods, uses a suitable class of “mixed effect” kernels. The methodology is illustrated through a simulated recommendation system, as well as an experiment involving pharmacological data coming from a multicentric clinical trial. Index Terms— Collaborative filtering, conjoint analysis, inductive transfer, kernel methods, learning to learn, multitask learning, population methods, recommender systems, regularization theory.

T

I. I NTRODUCTION

HE solution of learning tasks by joint analysis of multiple datasets is receiving increasing attention in different fields and under various perspectives. Indeed, the information provided by data for a specific task may serve as a domain-specific inductive bias for the others. Combining datasets to solve multiple learning tasks is an approach known in the machine learning literature as multitask learning or learning to learn [1]–[7]. In this context, the analysis of the inductive transfer process and the investigation of general methodologies for the simultaneous learning of multiple tasks are important topics of research. Many theoretical and experimental results Manuscript received October 16, 2009; accepted November 10, 2010. Date of publication December 13, 2010; date of current version February 9, 2011. This work was supported in part by the PRIN Project Sviluppo di Nuovi Metodi e Algoritmi per l’Identificazione, la Stima Bayesiana e il Controllo Adattativo e Distribuito, and the Progetto di Ateneo under Project CPDA090135/09, funded by the University of Padova and by the European Community’s Seventh Framework Program under Agreement FP7ICT-223866-FeedNetBack. F. Dinuzzo is with the Department of Mathematics, University of Pavia, Pavia 27100, Italy. He is also with the Risk and Security Study Center, Istituto Universitario di Studi Superiori, Pavia 27100, Italy (e-mail: [email protected]). G. Pillonetto is with the Department of Information Engineering, University of Padova, Padova 35131, Italy (e-mail: [email protected]). G. De Nicolao is with the Department of Computer Engineering and Systems Science, University of Pavia, Pavia 27100, Italy (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2095882

support the intuition that, when relationships exist between the tasks, simultaneous learning performs better than separate (single-task) learning [8]–[17]. Theoretical results include the extension to the multitask setting of generalization bounds and the notion of Vapnik–Chervonenkis-dimension [18]–[20] as well as a methodology for learning multiple tasks exploiting unlabeled data (the so-called semisupervised setting) [21]. Information fusion from different but related datasets is widespread in econometrics and marketing analysis, where the goal is to learn user preferences by analyzing both userspecific information and information from related users, (see [22]–[25]). The so-called conjoint analysis aims to determine the features of a product that mostly influence customer’s decisions. In the web, collaborative approaches to estimate user preferences have become standard methodologies in many commercial systems and social networks, under the name of collaborative filtering or recommender systems, (see [26]). Pioneering collaborative filtering systems include Tapestry [27], GroupLens [28], [29], ReferralWeb [30], and People Helping One Another Know Stuff [31]. More recently, the collaborative filtering problem has been attacked with machine learning methodologies such as Bayesian networks [32], Markov chain Monte Carlo algorithms [33], mixture models [34], dependency networks [35], and maximum margin matrix factorization [36]. Biomedicine is another field in which importance of combining datasets is especially evident. In pharmacological experiments, very few training examples are typically available for a specific subject due to technological and ethical constraints [37], [38]. To obviate this problem, the so-called population method has been studied and applied with success since the 1970s in pharmacology [39]–[41]. Population methods belong to the family of so-called mixed-effect statistical methods [42], and are based on the knowledge that subjects, albeit different, belong to a population of similar individuals, so that data collected in one subject may be informative with respect to the others [43], [44]. Classical approaches postulate finitedimensional nonlinear dynamical systems whose unknown parameters can be determined by means of optimization algorithms [45]–[48]. Other strategies include Bayesian estimation with stochastic simulation [49]–[51] and nonparametric regression [52]–[56]. In the machine learning literature, much attention has been given in the last few years to techniques based on regularization, such as kernel methods [57] and Gaussian processes [58]. The regularization approach is powerful and theoretically sound, having their mathematical foundation in the theory

1045–9227/$26.00 © 2010 IEEE

DINUZZO et al.: CLIENT–SERVER MULTITASK LEARNING FROM DISTRIBUTED DATASETS

of inverse problems, statistical learning theory, and Bayesian estimation [59]–[64]. The flexibility of kernel engineering allows the estimation of functions defined on generic sets from arbitrary sources of data. The methodology has been recently extended to the multitask setting. In [65], a general framework to solve multitask learning problems using kernel methods and regularization has been proposed, relying on the theory of reproducing kernel Hilbert spaces (RKHS) of vector-valued functions [66]. In many applications (e-commerce, social network data processing, recommender systems), real-time processing of examples is required. Online multitask learning schemes find their natural application in data mining problems involving very large datasets, and are therefore required to scale well with the number of tasks and examples. In [67], an online taskwise algorithm to solve multitask regression problems has been proposed. The learning problem is formulated in the context of online Bayesian estimation, (see [68], [69]), and Gaussian processes with suitable covariance functions are used to characterize a nonparametric mixed-effect model. One of the key features of the algorithm is its capability to exploit shared inputs between the tasks in order to reduce computational complexity. However, the algorithm in [67] has a centralized structure in which tasks are sequentially analyzed, and is not able to address neither architectural issues regarding the flux of information nor privacy protection. In this paper, multitask learning from distributed datasets is addressed using a client–server architecture. In our scheme, clients are in a one-to-one correspondence with tasks and their individual database of examples. The role of the server is to collect examples from different clients in order to summarize their informative content. When a new example associated with any task becomes available, the server executes an online update algorithm. While in [67] different tasks are sequentially analyzed, the architecture presented in this paper can process examples coming in any order from different learning tasks. The summarized information is stored in a disclosed database whose content is available for download, enabling each client to compute its own estimate exploiting the informative content of all the other datasets. Particular attention is paid to confidentiality issues, which are especially valuable in commercial and recommender systems, (see [70], [71]). First, we require that each specific client cannot access other clients’ data. In addition, individual datasets cannot be reconstructed from the disclosed database. Two kinds of clients are considered: active and passive clients. An active client sends its data to the server, thus contributing to the collaborative estimate. A passive client only downloads information from the disclosed database without sending its data. A regularization problem is considered in which a mixed-effect kernel is used to exploit relationships between the tasks. Though specific, the mixed-effect nonparametric model is quite flexible, and its usefulness has been demonstrated in several works [54], [55], [67], [72]. This paper is organized as follows. Multitask learning with regularized kernel methods is presented in Section II, in which a class of mixed-effect kernels is also introduced. In Section III, an efficient centralized offline algorithm for

291

multitask learning is described that solves the regularization problem of Section II. In Section IV, a rather general client–server architecture is described that is able to efficiently solve online multitask learning from distributed datasets. The server-side algorithm is derived and discussed in Section IVA, while the client-side algorithm for both active and passive clients is derived in Section IV-B. In Section V, a simulated music recommendation system is employed to test the performance of our algorithm. The application to the analysis of real data from a multicentric clinical trial for an antidepressant drug is discussed in Section VI. Section VII concludes this paper. The Appendix contains technical lemmas and proofs. A. Notational Preliminaries 1) X denotes a generic set with cardinality |X|. 2) A vector is an element a ∈ X n (an object with one index). Vectors are denoted by lowercase bold characters. Vector components are denoted by the corresponding non-bold letter with a subscript (e.g., ai denotes the i th component of a). 3) A matrix is an element A ∈ X n×m (an object with two indices). Matrices are denoted by uppercase bold characters. Matrix entries are denoted by the corresponding non-bold letter with two subscripts [e.g., Ai j denotes the entry of place (i, j ) of A]. 4) Vectors y ∈ Rn are associated with column matrices, unless otherwise specified. 5) For all n ∈ N, let [n] := {1, 2, . . . , n}. 6) Let I denote the identity matrix of suitable dimension. 7) Let ei ∈ Rn denote the i th element of the canonical basis of Rn (all zeros with 1 in position i ) ! "T . ei := 0 · · · 1 · · · 0 8) An (n, p) index vector is an object k ∈ [n] p . 9) Given a vector a ∈ X n and an (n, p) index vector k, let ! " a(k) := ak1 · · · ak p ∈ X p .

10) Given a matrix A ∈ X n×m k2 , which are (n, p1 ) and  Ak 1 k 2 1 1  . 1 2  . A(k , k ) :=  . Ak 1 k 2 p1 1

11) Finally, let

and two index vectors k1 and (m, p2 ), respectively, let  · · · Ak 1 k 2 1 p2  .. ..  ∈ X p1 × p2 . . .  · · · Ak 1p k 2p 1

2

A(:, k2 ) := A([n], k2 ), A(k1 , :) := A(k1 , [m]).

Notice that vectors, as defined in this paper, are not necessarily elements of a vector space. The definition of “vector” adopted in this paper is similar to that used in standard objectoriented programming languages such as C++. II. P ROBLEM F ORMULATION Let m ∈ N denote the total number of tasks. For the task j , a vector of ! j input–output pairs S j ∈ (X × R)! j is available ! " S j := (x 1 j , y1 j ) · · · (x ! j j , y! j j )

292


sampled from a distribution P j on X × R. The aim of a multitask regression problem is learning m functions f j : X → R, such that expected errors with respect to some loss function L ) L(y, f j (x))d P j X ×R

are small. Task labeling is a simple technique to reduce multitask learning problems to single-task ones. A task label is an integer ti ∈ [m] that identifies the task to which the i th example belongs. The overall available data can be viewed as a set * of triples S ∈ ([m] × X × R)! , where ! := mj=1 ! j is the overall number of examples " ! S := (t1 , x 1 , y1 ) · · · (t! , x ! , y! ) . The correspondence between the dataset S j and S can be expressed by using an (!, ! j ) index vector k j such that tk j = j, i

i ∈ [! j ].

(1)

For instance, k32 denotes the index in S of the third example of the dataset S2 . In this paper, predictors are searched within suitable hypothesis spaces called RKHS, associated with positive-definite kernel functions. A positive-definite kernel is a function that can be represented as an inner product in some Hilbert space F as follows: K (x 1 , x 2 ) = $"(x 1 ), "(x 2 )%F

(2)

where " : X → F is an implicitly defined map that extracts a (possibly infinite) vector of features from a given pattern x ∈ X. The value (2) can be seen as a measure of similarity between the two patterns x 1 and x 2 . A positive kernel can be also interpreted as the covariance function of a suitable stochastic process defined over X. In this respect, there is extensive literature on the connections between regularized kernel methods and Bayesian estimation, (see [58], [62]). An RKHS is a Hilbert space H associated with a positive kernel K such that the following property holds: f (x) = ( f, K (x, ·))H

∀ f ∈ H.

In the regularization approach, the predictor is obtained by minimizing a cost function, which is the sum of a data fit loss term and a regularization penalty, thus balancing fitting of training data with “smoothness” of the solution. The approach can be extended to multitask learning, by defining kernels whose domain includes also the task labels. In such a case, the similarity between pattern x 1 of a task labeled t1 and pattern x 2 of a task labeled t2 can be expressed as K ((x 1 , t1 ), (x 2 , t2 )). Let H denote an RKHS of functions defined on the enlarged input set X × [m] with kernel K , and consider the following regularization problem: + ! , (yi − f (x i , ti ))2 λ 2 ˆ + ( f (H f = arg min 2wi 2 f ∈H i=1

where λ ≥ 0 is the regularization parameter, and w is a suitable vector of positive weights: w := (w1 , . . . , w! ) > 0. Here, prediction errors on training data are penalized by weighted squared losses, and regularization is imposed by penalizing functions whose squared norm is too high. In this paper, the focus is on a class of mixed-effect kernels, whose values K ((x 1 , t1 ), (x 2 , t2 )) can be expressed as a convex combination of a task-independent contribution and a task-specific one αK (x 1 , x 2 )+(1 −α)

m , j =1

j

. j (x 1 , x 2 ), K T (t1 , t2 ) K

0 ≤ α ≤ 1.

(3) The rationale behind such structure is that predictors for a generic task j will result in the sum of a common task function and an individual shift [see (4) below]. Kernels K . j can possibly be all distinct. On the other hand, K j and K T are positive semidefinite “selector” kernels defined as / 1, t1 = t2 = j j K T (t1 , t2 ) = 0, otherwise. The so-called representer theorem, which is a central result in the theory of regularized kernel methods (see [73], [74]), gives the following expression for the optimum fˆ: fˆ(x, t) =

! ,

ai K ((x, t), (x i , ti )).

i=1

Estimate fˆj := fˆ(x, j ) for the j th task is defined to be the function obtained by plugging the corresponding task label t = j in the previous expression. As a consequence of the structure of K , the expression of fˆj decouples into two parts fˆj (x) = f¯(x) + f˜j (x)

(4)

where f¯(x) = α

! ,

ai K (x i , x),

i=1

f˜j (x) = (1 − α)

! ,

i∈k j

. j (x i , x). ai K

It can be shown (see [62]) that the optimal coefficient vector a solves the linear system (K + λW) a = y

(5)

where W = diag(w), and the kernel matrix K contains the evaluated kernels in correspondence with all the data K i j = K ((x i , ti ), (x j , t j )). From (3), it follows that K has the following structure:   m , . j (k j , k j )I(k j , :) (6) I(:, k j )K K = αK + (1 − α) K i j = K (x i , x j ),

.ikj K

j =1

.k (x i , x j ). =K

Function f¯ is independent of the task label and can be regarded as the estimate of an average task, whereas f˜j is the estimate of an individual shift. The value α is related to the “shrinking” of the individual estimates toward the average task. When α = 1, the same function is learned for all the


Algorithm 1 Centralized offline algorithm 1: for j = 1 : m do 0 1 . j (k j , k j ) + λW(k j , k j ) −1 2: R j ← (1 − α)K 3: end for ˘ = LDLT 4: Compute factorization K *m T j j 5: y ˘ ← j =1 L (:, h )R y j 2 3−1 *m T (:, h j )R j L(h j , :) 6: H ← D−1 + α L j =1 7: z ← Hy ˘ " ! 8: a ˘ ← Solution to DLT a˘ = z. 9: for j = 1 : m 0 do 1 10: a j ← R j y j − αL(h j , :)z 11: end for

Sj

Client j

Fig. 1.

III. C OMPLEXITY R EDUCTION In many applications of multitask learning, some or all of the input data x i are shared between the tasks so that the number of different basis functions appearing in the expansion (4) may be considerably less than !. As explained below, such feature can be exploited to derive efficient incremental online algorithms for multitask learning. Let a j := a(k j ), y j := y(k j ), w j := w(k j )

such that

x˘i ,= x˘ j

x˘

y˘

H

Disclosed database

Server

L

y1

w1

hm

ym

wm Rm

R1

Undisclosed database

Client–server scheme.

The next result shows that coefficient vectors a and a˘ can be obtained by solving a system of linear equations involving only “small-sized” matrices so that complexity depends on the number of unique inputs rather than the overall number of examples. Theorem 1: Coefficient vectors a and a˘ , defined in (5) and (9), can be computed by Algorithm 1. Algorithm 1 is an offline (centralized) procedure whose computational complexity scales with O(n 3 m). In the next section, we derive a client–server online version of Algorithm 1 that preserves such ! " complexity bound. Typically, this is much better than O !3 , which is the worst case complexity of directly solving (5). IV. C LIENT–S ERVER O NLINE A LGORITHM

where index vectors k j are defined in (1). Introduce the vector of unique inputs x˘ ∈ X ,

Rj

D h1

(xi, yi, wi )

tasks, as if all examples were referred to a unique task (pooled approach). On the other hand, when α = 0, all the tasks are learned independently (separate approach), as if the tasks were not related at all.

n

293

∀i ,= j

where n < ! denote the number of unique inputs. For each task j , a new (n, ! j ) index vector h j can be defined such that i ∈ [! j ]. x i j = x˘ h j , i

hj

The information contained in the index vectors is equivalently expressed by a binary matrix P ∈ {0, 1}!×n , such that 4 1, x i = x˘ j Pi j = (7) 0, otherwise. We have the following decompositions:

The main objective of the present section is to derive an incremental distributed version of Algorithm 1 having a client– server architecture. It is assumed that each client is associated with a different learning task, so that the terms “task” and “client” will be used interchangeably. The role of the server is twofold. 1) Collecting triples (x i , yi , wi ) (input–output–weight) from the clients and updating online all matrices and coefficients needed to compute estimates for all the tasks. 2) Publishing sufficient information so that any client (task) j can independently compute its estimate fˆj , possibly without sending data to the server.

(8)

On the other hand, each client j can perform two kinds of operations.

where L ∈ Rn×r , D ∈ Rr×r are suitable rank-r factors, D is ˘ ∈ Rn×n is a kernel matrix associated with the diagonal, and K set of unique inputs: K˘ i j = K (x˘i , x˘ j ). If K is strictly positive, L can be taken as a full-rank lower triangular Cholesky factor, (see [75]). Letting (9) a˘ := P T a

1) Sending triples (x i , yi , wi ) to the server. 2) Receiving information from the server sufficient to compute its own estimate fˆj . To preserve privacy, each client can neither access other clients’ data nor reconstruct their individual estimates. With reference to matrices y˘ , H, and R j , whose definitions are given inside Algorithm 1, we have the following scheme (see Fig. 1).

˘ T, K = PKP

˘ := LDLT K

the optimal estimates (4) can be rewritten in a compact form fˆj (x) = α

n , i=1

a˘ i K (x˘i , x) + (1 − α)

!j , i=1

j .j ai K (x˘h j , x). i

1) Undisclosed information: h j, y j, w j, R j, for j ∈ [m]. 2) Disclosed information: x˘ , y˘ , H.

The server update algorithm represents an incremental implementation of lines 1–6 of Algorithm 1, having access

294


Algorithm 2 Server: receive (x i , yi , wi ) from client j and update the database 1: s = find (x i , x ˘) 2: if (s = n + 1) then 3: n ← !n + 1, " 4: x ˘ ← 5 x˘ 6x i , y˘ 5: y ˘← , 0! " 6: k ← ker x i , x ˘; K , 7: r ← Solution to LDr = k([n − 1]), 8: β ← k n − r T Dr, 9: H ← diag (H, β) 10: D ← 5 diag (D, β)6 L 0 11: L ← rT 1 12: end if ! " 13: p = find x i , x ˘ (h j ) 14: if ( p = ! j + 1) then 15: ! j ← !! j + 1 " 16: h j ← h j s 5 j 6 y 17: y j ← 5 yi j 6 w 18: w j ← wi " ! j ); K .j , 19: . k←5 (1 − α) · ker x i , x˘ (h 6 k([! j − 1]) R j. 20: u ← −1 k). 21: γ ← 1/(λwi − uT . 22: µ ← γ uT y !j , " 23: R j ← diag R j , 0 24: else 25: u ← R j (:, p), 2 3 j j j 26: w p ← w p wi / w p + wi , 27: 28: 29: 30: 31: 32: 33: 34: 35:

j

j

w

j

j

y p ← y p + wpi (yi − y p ), 7 2 3 8 j j j −1 γ ← − λ(w p )2 / w p − wi + R pp , j

j

j

µ ← w p (yi − y p )/(wi − w p ) + γ uT y j , end if R j ← R j + γ uuT v ← LT (:, h j )u y˘ ← y˘ + µv, q ← Hv, T H ← H − (αγ )qq −1 +v T q .

to all the information (disclosed and undisclosed). Serverside computations are described in Section IV-A. The clientside algorithm represents the computation of lines 7–11 of Algorithm 1 distributed among the clients, where access to undisclosed information is denied. Client-side computations are described in Section IV-B. A. Server Side In order to formulate the algorithm in compact form, it is useful to introduce the functions “find” and “ker.” Let A(x) := {i : x i = x} .

Algorithm 3 Server (Case 1) 1: s = find (x ! i , x˘ ) j " 2: p = find x i , x ˘ (h ) 3: u ← R j (:, p), 2 3 j j j 4: w p ← w p wi / w p + wi , j

j

w

j

j

y p ← y p + wpi (yi − y p ), 3 8 7 2 j j j −1 6: γ ← − λ(w p )2 / w p − wi + R pp , 5:

j

7: 8: 9:

10: 11: 12:

j

j

µ ← w p (yi − y p )/(wi − w p ) + γ uT y j , R j ← R j + γ uuT v ← LT (:, h j )u y˘ ← y˘ + µv, q ← Hv, T H ← H − (αγ )qq −1 +v T q .

For any p, q ∈ N, x ∈ X, x ∈ X p , y ∈ X q , let find : X × X p

→

ker(·, ·; K ) : X p × X q ker (x, y; K )i j

→ =

find(x, x)

=

[ p + 1] 4 p+1 min A(x) R p×q " ! K xi , y j .

A(x) = Ø A(x) ,= Ø

The complete computational scheme is reported in Algorithm 2. The initialization is defined by resorting to empty matrices whose manipulation rules can be found in [76]. In particular, all the matrices and vectors are all initialized to the empty matrix. It is assumed that function “ker” returns an empty matrix when applied to empty matrices. Algorithm 2 is mainly based on the use of matrix factorizations and matrix manipulation lemmas in the Appendix. The rest of this section is an extensive proof devoted to show that the algorithm correctly updates all the relevant quantities when a new triple (x i , yi , wi ) becomes available from task j . Three cases are possible. 1) The input x i is already among the inputs of task j . 2) The input x i is not among the inputs of task j , but can be found in the common database x˘ . 3) The input x i is new. 1) Case 1: Repetition within Inputs of Task j : The input x i has been found in x˘ (h j ), so that it is also present in x˘ . Thus, the flow of Algorithm 2 can be equivalently reorganized as in Algorithm 3. Let r denote the number of triples of the type (x, yi , wi ) belonging to task j . These data can be replaced by a single triple (x, y, w) without changing the output of the algorithm. Letting + r -−1 r , , −1 w := wi , y := w yi /wi i=1

i=1

the part of the empirical risk associated with these data can be rewritten as + "2 r ! r , yi − f j (x) 1 1 , yi2 yi = − 2 f j (x) + f j (x)2 2wi 2 wi wi wi i=1

i=1


" r , f j (x)2 − 2 f j (x)y yi2 + = 2w 2wi i=1 "2 ! y − f j (x) +A = 2w where A is a constant independent of f . To recursively update w and y when a repetition is detected, notice that +r+1 -−1 5 6−1 , 1 1 1 wr wr+1 r+1 w = = + = wi wr wr+1 wr + wr+1 !

i=1

(wr )2 wr + wr+1 6−1 5 r 6 5 r+1 , yi y 1 1 yr+1 r+1 =w = + + wi wr wr+1 wr wr+1 = wr −

y r+1

i=1

wr =y + (yi − y r ). wi r

The variations can be expressed as (wr )2 (wr+1 )2 = wr + wr+1 wr+1 − wr+1 r yr+1 − y r+1 w = (yi − y r ) = wr+1 . wi wr+1 − wr+1

'wr+1 = − 'y

r+1

By applying these formulas to the pth data of task j , lines 4 and 5 of Algorithm 3 are obtained. In particular, the variation j j of coefficients w p and y p after the update can be written as follows: j

(w p )2

j

'w p =

j

w p − wi

,

j

j

'y p = w p

j

yi − y p

j

wi − w p

.

At this point, we also need to modify matrix R j , since its definition (see Algorithm 1, line 2) depends on the weight j coefficient w p . To check whether R j is correctly updated by lines 3, 6, and 8 of Algorithm 3, observe that, applying Lemma 1, we have 52 3 6−1 −1 j j j T R + λe p e p 'w p R ← j

= R −

R j e p eTp R j j

λ'w p + eTp R j e p

j

T

= R + γ uu .

Consider now the update of y˘ . Since y j has already been j modified, the previous y j is given by y j − e p 'y p . Recalling the definition of y˘ in Algorithm 1, and the update for R j , we have m 3 2 , y˘ ← LT (:, hk )Rk yk + LT (:, h j ) R j + γ uuT y j . k, = j

Using the definition of µ and u in Algorithm 3, we have 3 32 3 2 2 j j R j + γ uuT y j = R j + γ uuT y j − 'y p e p + 'y p e p 2 3 2 3 j j = R j y j − 'y p e p + 'y p + γ uT y j u 2 3 j = R j y j − 'y p e p + µu

295

Algorithm 4 Server (Case 2) 1: s = find (x ! i , x˘ ) j " 2: p = find x i , x ˘ (h ) 3: ! j ← !! j + 1 " 4: h j ← h j s 5 j 6 y 5: y j ← 5 yi j 6 w j 6: w ← wi " ! j ); K .j , k←5 (1 − α) · ker x i , x˘ (h 7: . 6 k([! j − 1]) R j. 8: u ← −1 " ! 9: γ ← 1/ λwi − uT . ! "k . 10: R j ← diag R j , 0 + γ uuT 11: v ← LT (:, h j )u 12: µ ← γ uT y j , 13: y ˘ ← y˘ + µv, 14: q ← Hv, qq T 15: H ← H − . (αγ )−1 +v T q so that

y˘ ← y˘ + µLT (:, h j )u.

By defining v as in line 9 of Algorithm 3, the update of line 10 is obtained. Finally, we show that H is correctly updated. Recall from Algorithm 1 that  −1 m , H = D−1 + α LT (:, h j )R j L(h j , :) . j =1

In view of lines 8 and 9 of Algorithm 3, we have 2 3−1 H ← H−1 + αγ vv T .

Lines 11 and 12 follow by applying Lemma 1 to the last expression. 2) Case 2: Repetition in x˘ : Since x i belongs to x˘ but not to x˘ (h j ), we have s ,= n + 1,

p = ! j + 1.

The flow of Algorithm 2 can be organized as in Algorithm 4. First, vectors h j , y j , and w j must be properly enlarged as in lines 3–6. Then, recalling the definition of R j in Algorithm 1, we have 6 5 . k([! j − 1]) (R j )−1 . (R j )−1 ← . k([! j − 1])T . k! j + λwi

The update for R j in lines 7–10 is obtained by applying Lemma 2 with A = (R j )−1 . Consider now the update of y˘ . Recall that h j and y j have already been updated. By the definition of y˘ and in view of line 10 of Algorithm 4, we have y˘ ←

m , k, = j

8 7 2 3 LT (:, hk )Rk yk + LT (:, h j ) diag R j , 0 + γ uuT y j

= y˘ + γ (uT y j )LT (:, h j )u.

296


Algorithm 5 Server (Case 3) 1: n ← !n + 1 " 2: x ˘ ← x˘ ! x i . " 3: k ← ker x i , x ˘; K , 4: r ← Solution to LDr = k([n − 1]), 5: β ← k n − r T Dr, 6: D ← 5 diag (D, β)6 L 0 7: L ← T r 5 6 1 y˘ 8: y ˘← 0 9: H ← diag (H, β) 10: Call Algorithm 4.

The update in lines 11–13 immediately follows. Finally, the update in lines 14 and 15 for H follows by applying Lemma 2, as in Case 1. 3) Case 3: x i is a New Input: Since x i is a new input, we have s = n + 1, p = ! j + 1. The flow of Algorithm 2 can be reorganized as in Algorithm 5. The final part of Algorithm 5 coincides with Algorithm 4. In addition, the case of new input also requires updating factors D and L. If K is strictly positive, then D is diagonal and L can be chosen as a lower triangular Cholesky factor. If K is not strictly positive, other kinds of decompositions can be used. In particular, for the linear kernel K (x 1 , x 2 ) = x 1T x 2 over Rr , D and L can be taken, respectively, equal to the identity and x˘ . Recalling (8), we have 5 6 ˘ K k([n − 1]) ˘ K← k([n − 1])T kn 6 5 T k([n − 1]) LDL = k([n − 1])T kn 5 65 6T 65 L 0 L 0 D 0 = 0 β rT 1 rT 1

with r and β as in lines 4 and 5. Finally, updates for y˘ and H are similar to that of previous Case 2, once the enlargements in lines 8 and 9 are made. B. Client Side It turns out that each client can compute its own estimate fˆj by only accessing disclosed data x˘ , y˘ , and H. It is not even necessary to know the overall number m of tasks, nor . j . First of all, we show that vector a˘ can be their kernels K computed by knowing only disclosed data. From Algorithm 1, we have 2 3 DLT a˘ = Hy˘ .

Apparently, the right-hand side can be computed by just knowing y˘ and H. Moreover, from (8) one can see that factors ˘ which, in L and D can be computed by knowing matrix K, turn, can be computed by just knowing vector x˘ . In practice, factors L and D can be incrementally computed by lines 1–7 of Algorithm 6. Notice that a˘ is only needed at prediction

Algorithm 6 (Client j ) Receive x˘ , y˘ , and H and evaluate a˘ and a j 1: for i = 1 : !n do " 2: k ← ker x˘ i , x ˘ ([i ]); K , 3: r ← Solution to LDr = k([i − 1]), 4: β ← k i − r T Dr, 5: D ← 5 diag (D, β), 6 L 0 6: L ← , rT 1 7: end for 8: if Passive then 9: for i = 1 : ! j do 10: Run Algorithm 2 with (x i j , yi j , wi j ). 11: end for 12: end if 13: z ← Hy ˘ " ! 14: a ˘ ← Solution to DLT a˘ = 0 1 z. 15: a j ← R j y j − αL(h j , :)z . time. While it is possible to keep it updated on the server, this would require solving a linear system for each new example to compute quantities that are possibly overwritten before being used. To reduce computation load on the server, it is preferable that each client computes by itself these quantities whenever it needs predictions. As mentioned in the introduction, two kinds of clients are considered. 1) An active client j , which sends its own data to the server. 2) A passive client j , which does not send its data. Passive clients must run a local version of the update algorithm before obtaining coefficient vectors, since the information contained into the disclosed database does not take into account their own data. For such reason, they need to know matrix H and vector y˘ separately. Once the disclosed data and vector a˘ have been obtained, each client still needs the individual coefficients vector a j in order to perform predictions. In this respect, notice that line 10 of Algorithm 1 decouples with respect to the different tasks 7 8 a j ← R j y j − αL(h j , :)z

so that a j can be computed by knowing only disclosed data together with private data of task j . This is the key feature that allows a passive client to perform predictions without disclosing its private data and exploiting the information contained in all the other datasets. Client side computations are summarized in Algorithm 6. V. I LLUSTRATIVE E XAMPLE : M USIC R ECOMMENDATION In this section, the proposed algorithm is applied to a simulated music recommendation problem, in order to predict preferences of several virtual users with respect to a set of artists. Artist data were obtained from the May 2005 AudioScrobbler Database dump,1 which is the last dump released 1 Available at http://www-etud.iro.umontreal.ca/∼bergstrj/audioscrobbler_ data.html.


297

RMSE

Bob Dylan

world soul rock rnb reggae punk pop latin jazz indie rock electronic country blues Hip-Hop Classical 80s 70s 60s 50s

Fig. 2.

0.02 0.018 0.016 0.014 0.012 0.01 0.008 0 −2 log10(λ)

1 −4

−6

−8

TOP20HITS

Music recommendation experiment: example of artist tagging.

by AudioScrobbler/LastFM under Creative Commons license. LastFM is an internet radio that provides individualized broadcasts based on user preferences. The database dump includes user playcounts and artist names so that it is possible to sort artists according to overall number of playcounts. After sorting according to decreasing playcounts, 489 top ranking artists were selected. The input set X is therefore a set of 489 items X = {Bob Marley, Madonna, Michael Jackson, . . .} .

The tasks are associated with user preference functions. More precisely, normalized preferences of user j over the entire set of artists are expressed by functions f j : X → R that are to be learnt from data. The simulated music recommendation system relies on music type classification expressed by means of tags (rock, pop, etc.). The i th artist is associated with a vector zi ∈ [0, 1]19 of 19 tags, whose values were obtained by querying the LastFM2 web site on September 22, 2008. In Fig. 2, the list of the tags considered in this experiment, together with an example of artist tagging, is reported. Vectors zi were normalized to lie on the unit hypersphere, i.e., (zi (2 = 1. Tag information is used to build a mixed-effect kernel over X, where Tz

K (x i (zi ), x j (z j )) = ezi

j

,

.(x i (zi ), x j (z j )) = ziT z j . K

The above kernels were also used to generate synthetic users. First, an “average user” was generated by drawing a function g : X → R from a Gaussian process with zero mean and autocovariance K . Then, m = 105 virtual user preferences were generated as f j = 0.25g + 0.75. gj

where . g j were drawn from a Gaussian process with zero mean .. For the j th virtual user, ! j = 5 artists and autocovariance K x i j were uniformly randomly sampled from the input set X, and the corresponding noisy outputs yi j were generated as yi j = f j (x i j ) + (i j 2 Available at http://www.lastfm.com.

0

0.5 α

10 9 8 7 6 5 4 0 −2 log10(λ)

1 −4

−6

−8

0

0.5 α

Fig. 3. Music recommendation experiment: average TOP20HITS and root mean squared error (RMSE) against α and λ.

where (i j are i.i.d. Gaussian errors with zero mean and standard deviation σ = 0.01. Performance is evaluated by both the average RMSE 9 |X | m : : 1 , , 1 ; RMSE = ( f j (x i ) − fˆj (x i ))2 m |X| j =1

i=1

and the average number of hits within the top 20 ranked artists, defined as m

1 , hits20 j m i=1 < < hits20 j :=

ClientâServer Multitask Learning from Distributed ...

ClientâServer Multitask Learning from Distributed ...

Suggest Documents

Online Multitask Learning

Bayesian multitask inverse reinforcement learning

Improved multitask learning through synaptic intelligence - arXiv

Feature Hashing for Large Scale Multitask Learning

Feature Hashing for Large Scale Multitask Learning

Semi-Supervised Multitask Learning for Scene Recognition

Feature Hashing for Large Scale Multitask Learning

Deep Multitask Learning for Semantic Dependency Parsing

Sparse and Structural Multitask Learning - users.cs.umn.edu

Multitask learning for hostâpathogen protein interactions

Multitask Learning for Protein Subcellular Location Prediction

Effective Shared Representations with Multitask Learning for

Multitask Learning for Spoken Language Understanding - CiteSeerX

Learning Distributed Representations of Sentences from Unlabelled ...

Learning Rules from Distributed Data - Semantic Scholar

Learning from Distributed Data Sources Using ...

Online Multitask Learning for Machine Translation Quality Estimation

Not to Cry Wolf: Distantly Supervised Multitask Learning in

Learning over Multitask Graphs â Part II: Performance Analysis

Joint syntactic and semantic analysis with a multitask Deep Learning ...

Integrating Concept Ontology and Multitask Learning to Achieve More ...

Multitask Coactive Learning - College of Engineering | Oregon State ...

Multitask learning improves prediction of cancer drug sensitivity

Multitask Kernel-based Learning with Logic Constraints - Google Sites

ClientâServer Multitask Learning from Distributed ...