Semiparametric Preference Learning - IEEE Xplore

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll04/11llpp257-264 Volume 19, Number 3, June 2014

Semiparametric Preference Learning Yi Zhen , Yangqiu Song, and Dit-Yan Yeung Abstract: Unlike traditional supervised learning problems, preference learning learns from data available in the form of pairwise preference relations between instances. Existing preference learning methods are either parametric or nonparametric in nature. We propose in this paper a semiparametric preference learning model, abbreviated as SPPL, with the aim of combining the strengths of the parametric and nonparametric approaches. SPPL uses multiple Gaussian processes which are linearly coupled to determine the preference relations between instances. SPPL is more powerful than previous models while keeping the computational complexity low (linear in the number of distinct instances). We devise an efficient algorithm for model learning. Empirical studies have been conducted on two real-world data sets showing that SPPL outperforms related preference learning methods. Key words: semiparametric learning; preference learning; Gaussian process

1

Introduction

Traditional supervised learning problems such as classification and regression learn a predictive model from training data which are available as a collection of instances and their corresponding target values. In recent years, a new supervised learning problem called preference learning has emerged and aroused intense research interest in the machine learning community[1-5] . Unlike traditional supervised learning problems, the training data in preference learning are in the form of pairwise preference relations between instances associated with a partial or total order relation. The goal is to learn the underlying preference relations between instances or the ranking of instances Yi Zhen is with College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA. E-mail: yzhen@ cc.gatech.edu. Yangqiu Song is with Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. E-mail: [email protected]. Dit-Yan Yeung is with Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China. E-mail: [email protected]. To whom correspondence should be addressed. Manuscript received: 2014-04-18; accepted: 2014-04-22

from the training data. The problem of preference learning arises in a variety of applications, including web search, recommender systems, decision making, and drug discovery. Various methods have been developed for preference learning, ranging from classificationbased approaches[4, 6, 7] , boosting-based approaches[3, 8] , SVM-based approaches[2, 9] to probabilistic kernel approaches[5, 10-12] . Most of the existing methods assume that there is an underlying score associated with each instance and the preference relation between two instances can be determined by their scores. Existing methods can be roughly divided into two categories according to the way the underlying scores are determined. One category consists of parametric methods which utilize a parametric score function and learn the function parameters from the training data. A popular method belonging to this category is Ranking SVM (RankSVM)[2] . The second category consists of nonparametric methods which define a distribution over the score function and learn the score function values directly from the training data. A representative method belonging to this category is based on Gaussian Processes (GPs)[13] , which will be referred to as PGP[5] in the sequel. As we know, parametric models are usually faster to train but

258

less expressive, while nonparametric models have stronger expressive power but higher computational cost for model training. Because of their strong expressive power and probabilistic interpretation, GP-based models have received much attention recently[5, 10] . However, previous GP-based models only use one single GP. Despite their simplicity, they may not be powerful enough for preference learning. In this paper, we propose a novel semi parametric preference learning model, abbreviated as SPPL, to combine the strengths of the parametric and nonparametric approaches. SPPL uses multiple GPs to predict the instance scores while keeping the computational complexity low (linear in the number of distinct instances). The semiparametric nature of SPPL is attributed to the existence of both a nonparametric component (multiple GPs) and a parametric component (a vector of linear combination weights). In essence, SPPL is a parametric generalization of PGP because PGP uses only one GP to predict the unobserved instance scores. To the best of our knowledge, SPPL is the first semiparametric model for preference learning. We will perform comparative studies using real-world data sets to compare SPPL with both parametric and nonparametric methods.

2

Semiparametric Preference Learning

Suppose we have a set of N distinct instances, xi 2 fX j i D 1; ; N g, where X denotes the input space to which the instances belong. We use a matrix X to denote the N instances with the i-th column of X denoted by xi . In addition, we define a preference matrix O with .i; j /-th entry Oij D 1 if input xi is considered better than xj according to the preference relation (xi xj ) and Oij D 0 otherwise. In practice, O is not fully observed and we use O to denote the set of observed preferences in O and jOj to denote the size of O. Our goal is to predict the unobserved preferences in O by learning a predictive model from the observed preferences. 2.1

Model

It is common to assume that every instance x is associated with a score s and the preference relation can be determined from the scores, that is, xi xj ” si > sj . The main idea of our model is to assume that there exists an unobserved (latent) D 1 vector fi associated with instance xi and the score of xi can be computed as si D wT fi where w

Tsinghua Science and Technology, June 2014, 19(3): 257-264

is a D 1 vector of combination weights and wT denotes the transpose of w. Instead of directly modeling the score function itself as a random process, we model the score function as a parametric function of several factors each of which is modeled by a random process. We believe this hierarchical or hybrid modeling scheme will lead to a model with stronger expressive power. More specifically, we assume that the elements in f follow independent GP priors conditioned on x, i.e., there are D random functions ffd W X 7! R j d D 1; ; Dg such that fi D .f1 .xi /; f2 .xi /; ; fD .xi //T and fd is distributed as a GP. Since our model contains both a parametric component (w) and a nonparametric component (ffd g), it belongs to the family of semiparametric models and hence the name. Without loss of generality, we assume the GP priors on the latent functions ffd g to have zero means. Hence, all the GPs can be fully specified by a covariance matrix defined by any Mercer kernel function[14] . Thereafter we assume that the covariance matrices of different latent functions are the same for clarity, although they may be different in our model. Furthermore, we impose a Gaussian prior on w. Thus the prior density functions of the latent variables and the combination weights are given as follows: D Y p.F/ D N .Fd j 0; ˙ /; d D1

p.w/ D N .w j 0; w2 ID /; where we use an N D matrix F to denote the latent vectors of all the N instances with the i -th row of F being fTi , Fd denotes the d -th column of F, N ./ stands for a normal (or Gaussian) distribution, ID is a D D identity matrix, ˙ is the covariance matrix defined by any valid kernel function with each element ˙ij D k.xi ; xj / where k.; / is the kernel function used to define the GPs, and w2 is the variance term. Here we use the same kernel function for different latent functions. We note that the model can be more expressive if different kernel functions are utilized. Following the idea used in Ref. [5], we define the following probability function of a preference pair for the noise-free case, ( Pideal .xi xj j fTi w; fjT w/ D

1; if .fi fj /T w > 0, 0; otherwise,

where xi xj indicates that instance xi is better than instance xj and fi and fj are the corresponding latent vectors. We further assume that the latent preference

Yi Zhen et al.: Semiparametric Preference Learning

259

scores (s D wT f) are corrupted by Gaussian noise and define the following noisy model for the s-th preference pair: p.xus xvs j fus ; fvs ; w/ D Z Z Pideal .xus xvs j fTus w C ıus ; fTvs w C ıvs /

N .ıus j 0; o2 /N .ıvs j 0; o2 /dıus dıvs D ! .fus fvs /T w ˚ p 2o

(1)

where fus and fvs are the latent vectors of the two instances involved in the s-th preference pair, o2 is Z z

N . j 0; 1/d

the noise variance and ˚.z/ D 1

is the cumulative distribution function of the standard Gaussian distribution. In the following, for notational .fus fvs /T w simplicity, we will denote zs D . p 2o Similar to previous works[5, 10] , we assume that preference relations are independent given latent vectors F and combination weights w. Therefore, we have the following preference likelihood function: M Y p.O j F; w/ D ˚.zs /; sD1

where O represents the observed preference relations and M D jOj is the number of observed preferences. Let denote a vector of all the hyperparameters, i.e., w ; o ; and ˙ . The total likelihood w.r.t. (with respect to) the hyperparameters can be expressed as follows: p.F; w; O j / D p.F j /p.w j /p.O j F; w; / (2) Figure 1 shows the graphical model of SPPL. 2.2

Learning

We use maximum a posteriori (MAP) estimation to find the point estimates of F and w, i.e., fF ; w g D argmax p.F; w j O/ (3) F;w

ID

sw

w

so

Σ

O

F

Fig. 1 Graphical model of SPPL, in which O is a set of observed variables, F is a group of latent vectors, w is a vector of combination weights and the others are hyperparameters.

The solution to Eq. (3) is the minimizer of the following negative log-likelihood function: L.F; w/ D ln Œp.F/p.w/p.O j F; w/ D ln p.F/ 1 ˙ tr.˙ 2

1

ln p.w/

FFT / C

wT w 2w2

ln p.O j F; w/ D M X

ln ˚.zs / C C0 (4)

sD1

where C0 is a constant independent of F and w. In the following, we adopt an alternating method to learn w and F, that is, we alternate between learning w while fixing F and learning F while fixing w. 2.2.1

Learning w

Based on Eq. (4), the objective function w.r.t. w can be expressed as follows, M X wT w C C1 (5) L1 .w/ D ln ˚.zs / C 2w2 sD1 where C1 is a constant independent of w. The gradient vector and Hessian matrix of L1 .w/ w.r.t. w are M X @L1 1 .zs / 1 D bs C 2 w (6) p @w w 2o ˚.zs / sD1 @2 L1 1 D 2 ID C T @w@w w 2 M X 1 .zs / zs .zs / T b b C s s , Hw 2 2 .z / 2 ˚ ˚.z / s s o sD1

(7)

where ./ D N . j 0; 1/ is the standard Gaussian density function and bs D fus fvs is a difference vector between the two latent vectors associated with the s-th preference pair. As a result, we can easily observe that the following property holds. Lemma 1 L1 .w/ is convex w.r.t. w. 1 Proof The matrix 2 ID is positive definite. Let w 2 1 .zs / zs .zs / Qs D C bs bTs and y denote a 2o2 ˚ 2 .zs / ˚.zs / D 1 vector. Because yT Qs y D ˛yT bs bTs y D ˛.yT bs /2 > 0; 2 1 .zs / zs .zs / C > 0, Qs is where ˛ D 2o2 ˚ 2 .zs / ˚.zs / positive semidefinite and then M X 1 Hw D ID C Qs 0; w sD1 because the sum of a positive definite matrix and a positive semidefinite matrix is positive definite. Since the Hessian matrix of L1 .w/ w.r.t. w is positive definite,


260

L1 .w/ is convex w.r.t. w and this completes the proof. We use Newton’s method to find the optimal w iteratively. At iteration t, the update equation is @L1 ˇˇ w.t/ D w.t 1/ 1 ŒHw .t 1/ 1 ˇ @w w.t 1/ (8) where 1 is the step size which is found by performing line search. 2.2.2 Learning F Based on Eq. (4), the objective function w.r.t. F can be written as follows, M X 1X L2 .F/ D ij fTi fj ln ˚.zs / C C2 (9) 2 sD1 i;j 1 where Œij N and C2 is a constant i;j D1 D ˙ independent of F. The gradient vector and Hessian matrix of L2 .F/ w.r.t. fi are given by M X X s .i / .zs / @L2 D ij fj w (10) p @fi ˚.zs / 2 o sD1 j

@2 L2 @fi @fTi

D

M X s .i/2 2 .zs / zs .zs / wwT , Hi C i i ID C 2 2 .z / 2 ˚ ˚.z / s s o sD1 (11) where s .i/ is an indicator function which is C1 if us D i, 1 if vs D i , and 0 otherwise. The following property holds. Lemma 2 L2 .F/ is convex w.r.t. fi . Proof The matrix i i ID is positive definite. Let M X s .i/2 2 .zs / zs .zs / D wwT and y C 2 2 .z / 2 ˚ ˚.z / s s o sD1 denote a D 1 column vector. It is obvious that the following holds: yT y D ˇyT wwT y D ˇ.yT w/2 > 0; M X s .i /2 2 .zs / zs .zs / where ˇ D C > 0. 2o2 ˚ 2 .zs / ˚.zs / sD1 Since is positive semidefinite, we have Hi D i i ID C 0: Since the Hessian matrix of L2 .fi / w.r.t. fi is positive definite, L2 .fi / is convex w.r.t. fi and this completes the proof. We propose to update each row of F sequentially to speed up the optimization procedure. For each fi , the update equation for fi at iteration t is

fi .t / D fi .t

2 ŒHi .t

1/

1/

1

@L2 ˇˇ ˇ @fi F.t

i D 1; ; N

1/

;

(12)

where 2 is the step size found by performing line search. Based on the latent variables learned, we can easily compute the scores si D fTi w and use them to predict the unobserved preference relations. The proposed algorithm is summarized in Algorithm 1. 2.3

Time complexity and convergence

In Algorithm 1, the time complexity is O..M C D 2 /D/ for updating w and O..M C D/D 2 / for updating fi . As a result, the algorithm achieves time complexity O.N.M C D/D 2 / to update both w and F in one iteration. Because M and D are often very small in practice, our algorithm scales linearly with the number of distinct instances N . Besides, our algorithm is not sensitive to initialization and converges very fast because each subproblem is convex.

3

Related Work

The task of ranking[15] , which aims to learn a function to rank a group of instances, can be considered a general form of preference learning. The main difference between preference learning and ranking is in the form of the training data. While the training data for preference learning are pairwise preferences, the training data for ranking may exist in different forms, including pairwise preferences. One promising application of ranking is called learning to rank which has recently received a lot of attention recently from both the machine learning and information retrieval communities. Depending Algorithm 1 Algorithm for SPPL 1: INPUT: X – input instances O – observed preference relations 2: PROCEDURE: 3: Initialize F; w. 4: repeat 5: Fix F, update w using Eq. (8). 6: Fix w, 7: for i D 1 to N do 8: update fi using Eq. (12). 9: end for 10: until converge w.r.t. objective function in Eq. (4). 11: return score vector s D Fw.


on the training data in the learning procedure, most existing algorithms for learning to rank can be divided into three major categories: individual approaches[9, 16, 17] , pairwise approaches[2, 8, 18, 19] , and listwise approaches[20-23] . Readers are referred to Ref. [24] for a detailed survey of learning to rank. Another important application of ranking and preference learning is label ranking, in which each instance is associated with a total or partial order of a fixed set of labels and the goal is to predict the ranking of labels given an instance[3, 5, 25, 26] . Our model belongs to the family of latent factor models which use low-dimensional latent factors for modeling complex problems. These models have recently received a lot of attention in machine learning, such as relational learning[27, 28] and kernel learning[29] . Most of the existing models are nonparametric in nature and they incur high computational cost despite their flexibility. Teh et al.[30] proposed a semiparametric latent factor model. However, their model deals with regression problems while our SPPL model is for preference learning.

4 4.1

Experiments Data sets

We have conducted a set of experiments on two benchmark data sets, OHSUMED and TREC, from the LETOR 3.0 repository[31] , which is publicly available for machine learning and information retrieval research. The OHSUMED data set[31] originated from the OHSUMED document collection[32] which consists of 348 566 document records from 270 medical journals. There are in total 106 queries, each of which is associated with about 150 documents. Each document is annotated with one of three relevance levels, definitely relevant, partially relevant, and irrelevant, according to the corresponding query. In total, the data set contains 16 140 query-document pairs with the corresponding relevance labels. Each query-document pair is represented by a feature vector of 45 dimensions, the definition of which can be found in Ref. [31]. The TREC data set[31] originated from the TREC2004 web track[33] which consists of 1 053 110 HTML documents crawled from the .gov domain in January 2002. There are in total 75 queries, each of which is associated with about 1000 documents. There are two relevance levels, relevant and irrelevant,

261

w.r.t. the query. Similarly, each query-document pair is represented by a feature vector of 64 dimensions, the definition of which can be found in Ref. [31]. 4.2 4.2.1

Results Relationship with PGP

As discussed above, SPPL is a generalization of PGP by using multiple GPs. Thus we expect SPPL to degenerate to PGP when we set D D 1 and fix w. To verify model equivalence under this special setting of SPPL, we compare the error rates on test preference pairs of the two models on the same data sets. Since the goal of preference learning is to predict the preference relations of instance pairs, error rate is the most common evaluation measure and will be adopted in following experiments. The data sets are generated as follows: First, we randomly select 5 queries together with their associated documents from the OHSUMED data set; then, for each query, we randomly select half of the preference pairs as a fixed test set and randomly select 10 and 20 pairs (with 10 repeats) from the remaining preference pairs as two groups of training sets. It should be noted that such experimental settings are very common not only in preference learning research but also in practice[5, 34] . We then apply SPPL (D D 1) and PGP on the data sets. We use the original implementation of PGP provided by the authors. The code can be downloaded from http://www.gatsby.ucl.ac.uk/chuwei/plgp.htm. For fair comparison, both SPPL (D D 1) and PGP use the linear kernel and the same hyperparameter value o D 0:01. Because we use MAP estimation for SPPL (D D 1), we also use MAP estimation for PGP in our experiments. We note that in the Bayesian formulation of PGP, the hyperparameters are learned from data and hence better performance can be expected. The average error rates as well as standard deviations, averaged over 5 queries and 10 different training sets for each query, are reported in Table 1. As we can see from Table 1, the D D 1 special case of SPPL is comparable with PGP in terms of the prediction accuracy although their optimization procedures are quite different. Based on the above Table 1 jOj 10 20

Equivalence between SPPL (D D 1) and PGP. PGP (MAP) 0:2112˙0:0526 0:1293˙0:0216

SPPL (D D 1) 0:2072˙0:0524 0:1180˙0:0579


262

results, we simply regard the performance of SPPL with D D 1 as a performance indicator of PGP with MAP estimation in the subsequent experiments. 4.2.2

Comparison with other approaches

We now compare SPPL with RankSVM (a parametric model) and PGP (a nonparametric model). The data sets used here are generated in a similar way as that described in Section 4.2.1. The hyperparameters of SPPL are set to D D 10; o D 0:01; w D 0:1 for the OHSUMED data set and D D 15; o D 0:01; w D 0:1 for the TREC data set. We perform SPPL (D D 1) with w 2 f10; 1; 0:1; 0:01g on an additional data set, generated from OHSUMED, and select w D 0:01 which achieves the best performance. In the following, we will not tune any parameter of SPPL. For fairness, we use the Bayesian formulation of PGP in the experiments. Besides, we tune the parameters of RankSVM to achieve the best performance. In Fig. 2, RankSVM PGP SPPL

0.5

Error rate

0.4 0.3 0.2 0.1 0

5

10 15 20 30 Number of training preference pairs (a) OHSUMED

we plot the average error rates as well as standard deviations of the three models over 5 random queries and 10 random training sets for each query. For the OHSUMED data set, Fig. 2a shows that SPPL consistently outperforms PGP which in turn is consistently better than RankSVM. For the TREC data set, Fig. 2b shows that SPPL always outperforms PGP except when jOj D 5, i.e., the training set is very small. The error rates shown in Fig. 2 show that the TREC task is easier than the OHSUMED task. This may explain why RankSVM (parametric model) can achieve comparable performance as PGP (nonparametric model) in Fig. 2b even though PGP is more flexible than RankSVM. 4.2.3

To examine the effect of the dimensionality of the latent factors F, we conduct a set of experiments with latent dimensionality D 2 f1; 5; 10; 15; 20; 25; 30g while keeping the other hyperparameters fixed. The data sets used are the same as those in Section 4.2.2. The hyperparameters are set to jOj D 10; o D 0:01; w D 0:1 for both data sets. The error curves, averaged over 5 random queries and 10 different training sets for each query, are plotted for OHSUMED and TREC separately in Fig. 3. From Fig. 3, we can see that as D increases, the error rates for both OHSUMED and TREC decrease and the error rate for OHSUMED decreases faster than that for TREC. Besides, the performance becomes stable when D exceeds some values (about 10 for OHSUMED and 15 for TREC). Since a larger value of D will incur higher computational cost, we should adopt a moderate D value in practice.

RankSVM PGP SPPL

0.20

0.24

OHSUMED TREC

0.20 Error rate

0.15 Error rate

Effect of latent dimensionality

0.10

0.16 0.12

0.05 0.08 0

5

10 15 20 30 Number of training preference pairs

0.04

(b) TREC

Fig. 2

Comparison of SPPL with RankVSM and PGP.

Fig. 3

5

10 15 20 Number of dimensions

25

Effect of latent dimensionality on error rate.

30


5

Conclusions

We have proposed a semiparametric model for preference learning and have compared it empirically with state-of-the-art parametric and nonparametric models. Experiments conducted on two benchmark data sets show that our model has promising performance. In our future work, we will consider learning the hyperparameters from data by integrating out the latent variables. Moreover, we would like to apply SPPL to other ranking problems such as learning to rank. References [1] [2] [3] [4] [5] [6]

[7]

[8]

[9]

[10]

[11]

[12]

[13] [14] [15] [16]

J. Fürnkranz and E. Hüllermeier, Preference Learning. Springer-Verlag, 2010. T. Joachims, Optimizing search engines using clickthrough data, in Proc. of SIGKDD, 2002, pp. 133-142. O. Dekel, C. D. Manning, and Y. Singer, Log-linear models for label ranking, in Proc. of NIPS, 2004, pp. 209-216. F. Aiolli and A. Sperduti, Learning preferences for multiclass problems, in Proc. of NIPS, 2005, pp. 17-24. W. Chu and Z. Ghahramani, Preference learning with Gaussian processes, in Proc. of ICML, 2005, pp. 137-144. S. Har-Peled, D. Roth, and D. Zimak, Constraint classification: A new approach to multiclass classification and ranking, in Proc. of NIPS, 2003, pp. 785-792. P. Li, C. Burges, and Q. Wu, McRank: Learning to rank using multiple classification and gradient boosting, in NIPS, 2008, pp. 897-904. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, An efficient boosting algorithm for combining preferences, Journal of Machine Learning Research, vol. 4, pp. 933969, 2003. W. Chu and S. S. Keerthi, New approaches to support vector ordinal regression, in Proc. of ICML, 2005, pp. 145152. W. Chu and Z. Ghahramani, Gaussian processes for ordinal regression, Journal of Machine Learning Research, vol. 6, pp. 1-48, 2005. B. Eric, N. D. Freitas, and A. Ghosh, Active preference learning with discrete choice data, in Proc. of NIPS, 2008, pp. 409-416. E. Bonilla, S. Guo, and S. Sanner, Gaussian process preference elicitation, in Proc. of NIPS, 2010, pp. 262270. C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006. B. Schölkopf and A. Smola, Learning with Kernels. The MIT Press, 2002. W. W. Cohen, R. E. Schapire, and Y. Singer, Learning to order things, in Proc. of NIPS, 1998, pp. 451-457. K. Crammer and Y. Singer, Pranking with ranking, in Proc. of NIPS, 2002, pp. 641-648.

263 [17] A. Shashua and A. Levin, Ranking with large margin principle: Two approaches, in Proc. of NIPS, 2003, pp. 937-944. [18] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, Learning to rank using gradient descent, in Proc. of ICML, 2005, pp. 89-96. [19] Y. Yue, T. Finley, F. Radlinski, and T. Joachims, A support vector method for optimizing average precision, in Proc. of SIGIR, 2007, pp. 271-278. [20] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, Learning to rank: From pairwise approach to listwise approach, in Proc. of ICML, 2007, pp. 129-136. [21] F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li, Listwise approach to learning to rank: Theory and algorithm, in Proc. of ICML, 2008, pp. 1192-1199. [22] M. N. Volkovs and R. S. Zemel, Boltzrank: Learning to maximize expected ranking gain, in Proc. of ICML, 2009, pp. 1089-1096. [23] H. Valizadegan, R. Jin, R. Zhang, and J. Mao, Learning to rank by optimizing NDCG measure, in Proc. of NIPS, 2009, pp. 1883-1891. [24] T.-Y. Liu, Learning to rank for information retrieval, Foundations and Trends in Information Retrieval, vol. 3, no. 3, pp. 225-331, 2009. [25] E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker, Label ranking by learning pairwise preferences, Artificial Intelligence, vo. 172, no. 16-17, pp. 1897-1916, 2008. [26] W. Cheng, J. Hühn, and E. Hüllermeier, Decision tree and instance-based learning for label ranking, in Proc. of ICML, 2009, pp. 161-168. [27] W. Chu, V. Sindhwani, Z. Ghahramani, and S. S. Keerthi, Relational learning with Gaussian processes, in Proc. of NIPS, 2007, pp. 289-296. [28] R. Silva, W. Chu, and Z. Ghahramani, Hidden common cause relations in relational learning, in Proc. of NIPS, 2008, pp. 1345-1352. [29] W.-J. Li, Z. Zhang, and D.-Y. Yeung, Latent Wishart processes for relational kernel learning, in Proc. of AISTATS, 2009, pp. 336-343. [30] Y. W. Teh, M. Seeger, and M. I. Jordan, Semiparametric latent factor models, in Proc. of AISTATS, 2005, pp. 333340. [31] T. Qin, T.-Y. Liu, J. Xu, and H. Li, LETOR: A benchmark collection for research on learning to rank for information retrieval, Information Retrieval, vol. 13, no. 4, pp. 346374, 2010. [32] W. Hersh, C. Buckley, T. J. Leone, and D. Hickam, OHSUMED: An interactive retrieval evaluation and new large test collection for research, in Proc. of SIGIR, 1994, pp. 192-201. [33] N. Craswell and D. Hawking, Overview of the TREC-2004 web track, in NIST Special Publication, 2004, pp. 500-261. [34] K. Kersting and Z. Xu, Learning preferences with hidden common cause relations, in Proc. of ECML/PKDD, 2009, pp. 676-691.

264 Yi Zhen is currently a research scientist with the College of Computing in Georgia Institute of Technology. He received his BS degree from Nankai University, MEng degree from Tsinghua University, and PhD degree from the Hong Kong University of Science and Technology in 2003, 2008, and 2012, respectively. His major research interests are in machine learning and data mining, and their applications to social and healthcare informatics. Yangqiu Song is currently a postdoctoral researcher at University of Illinois at Urbana-Champaign. He received his BEng and PhD degrees from Tsinghua University, China in 2003 and 2009, respectively. His current research focuses on using machine learning and data mining to extract and infer insightful knowledge from big data, including the techniques of large scale learning

Tsinghua Science and Technology, June 2014, 19(3): 257-264 algorithms, natural language understanding, text mining, and knowledge engineering. Dit-Yan Yeung is a professor of computer science and engineering in the Hong Kong University of Science and Technology. He received his BEng degree in electrical engineering and MPhil degree in computer science from the University of Hong Kong and PhD degree in computer science from the University of Southern California in 1983, 1985, and 1989, respectively. His current research interests are primarily in computational and statistical approaches to machine learning and artificial intelligence as well as their application to computer vision and social computing problems.