Topographic organization of latent classes makes orientation in rating/ ... patterns captured by the latent classes easier and more systematic. ..... The Lion King.
Topographic organization of user preference patterns in collaborative filtering Peter Tiˇ no School of Computer Science University of Birmingham Edgbaston, Birmingham B15 2TT, UK Gabriela Polˇ cicov´ a Faculty of Electrical Engineering and Information Technology Slovak University of Technology, Ilkoviˇcova 3, 81219 Bratislava, Slovakia Abstract We introduce topographic versions of two latent class models for collaborative filtering. Topographic organization of latent classes makes orientation in rating/preference patterns captured by the latent classes easier and more systematic. Furthermore, since we deal with probabilistic models of the data, we can readily use tools from probability and information theories to interpret and visualize information extracted by the model. We apply our models to a large collection of user ratings for films.
1
Introduction
When deciding which book to read, which movie to watch, or which Web page to visit, we often rely on advise given by other people [6]. To help people to share their evaluations, several techniques for leveraging existing user data (in the form of ratings/profiles/logs) has been proposed. Among the most popular is the collaborative filtering (CF) [4]. 1
There are two dominant approaches to CF, namely memory based approaches [12] and (more principled) latent class models (LCM) presented in [6]. The underlying principle of the former approach is that in order to recommend new items to a given user, ratings/judgments of people in the database with “similar” interests are used. The latter approach is based on Probabilistic Latent Semantic Analysis [5]. The main advantage of this approach is that it is able to automatically discover preference patterns in user profile data without suffering the flaw of memory based approaches – the inability to account for the fact that one person can be reliable recommender for another person on a subset of items, but not necessarily for all possible items [6]. Latent classes in the LCM can be viewed as representing “common interest patterns” across users in the data base. In order to understand the structure extracted by LTM and learn from it, one would need to carefully examine all the latent classes, which can become a very tiring and time consuming process. One possibility to deal with this problem is presented in this paper: we introduce a topographic organization into the latent space by organizing the latent classes on an easily readable two-dimensional grid while imposing a constraint that “similar” rating patterns should lie “close” to each other. The topographic organization enables one to quickly detect classes of interest when searching for relevant film and user rating patterns. The paper has the following organization: After introducing latent class models for user ratings in Section 2, we endow their latent space with topology in Section 3. Section 4 is devoted to formulation of the expectation-maximization algorithm for training our models. In Section 5 we apply the models to a large collection of user ratings for films. Section 6 concludes the paper by summarizing key messages of this study.
2
Latent class models for user ratings
In this section we briefly describe latent class approaches to modeling user ratings introduced in [6]. There are three sets we will work with: the set U of users, the set of films, Y, and the set V of rating values that are used by users to evaluate films. In general, given a set A, we denote its cardinality by |A|. We would like to predict a rating vu,y ∈ V given by a user u ∈ U to a film (in general 2
- an item) y ∈ Y. With each triplet (u, y, vu,y ) we associate a latent variable (class) zu,y ∈ Z = {1, 2, . . . , K} that “explains” why the user u rates the film y by vu,y . Triplets (u, y, vu,y ) are either actually observed or just hypothetical entities (in most of the cases). The latent variables z ∈ Z index ”abstract” classes of users in two types of models: • Type I – given a film y, all users from class z tend to adopt the same (class-specific) rating pattern expressed through conditional distribution P (v|y, z) over evaluations from V. Given a user u and a film y, the probability of vote v is modeled as P (v|y, u) =
X
P (v|y, z)P (z|u),
(1)
z∈Z
where P (z|u) is the probability that the user u “participates” in class z. • Type II – all users from class z tend to adopt the same preferences over the [rating, film] pairs (v, y). Here we predict the rating in conjunction with a selection of films [6]. Given a user u, the probability of a pair (v, y) is modeled as P (v, y|u) =
X
P (v, y|z)P (z|u).
(2)
z∈Z
Actually, classes z are not meant to represent a clear-cut clustering of users according to their voting preferences, but rather they express “common interest patterns” among the users found in the ratings [6]. P (z|u) then represents to what extend the user u participates in the common interest pattern z. Given a set of observation triplets, free parameters of the model, P (z|u) and P (v|y, z), are determined by an expectation-maximization procedure outlined in [6].
3
Introducing a topology into the latent space
While latent classes z ∈ Z provide a concise description of “common interest patterns” across the users u ∈ U, keeping a large number |Z| = K of classes (and hence having a finer description of the data set) is inappropriate: it may be unreasonable to expect that one examines all the classes in search for relevant films and user rating patterns. One possibility to deal with this problem is to organize the classes z ∈ Z on an easily readable 3
two-dimensional grid while imposing a constraint that “similar” rating patterns z should lie “close” to each other. Many variants of such topographic representations of data patterns can be found in the machine learning literature. Perhaps the most famous example is the Kohonen selforganizing map (SOM) [10, 11]. More recently, statistically principled reformulations and extensions of SOMs appeared in e.g. [1, 9]. Of particular interest to us is the link between SOMs and vector quantization through noisy communication channels established in [2, 7, 13]. Briefly, in the information theoretic interpretation of SOM, the topological organization of codebook vectors (that correspond to nodes (classes) on the latent grid) emerges through non-uniformity of the channel noise: to minimize the average quantization error at the receiving end of the communication channel, the codebook vectors that are more likely to be corrupted into each other should represent “similar” data patterns, i.e. should lie “close” to each other in the data space. Using the channel noise methodology, we endow the latent class model for user ratings (1) with topological organization by 1. formally placing the latent variables z ∈ Z on a regular two-dimensional grid in [−1, 1]2 and 2. introducing a “channel noise” exp P (z2 |z1 ) = P
³
−kz1 −z2 k2 2σ 2
z∈Z exp
³
´
−kz1 −zk2 2σ 2
´.
(3)
For latent classes z1 and z2 lying close to each other1 on the grid in [−1, 1]2 , the probability of corrupting one into the other in the transmission process through the communication channel is high. The parameter σ > 0 determines “specificity” of the topological neighborhood for class z1 : low values of σ correspond to sharply peaked localized transition probabilities concentrated on close neighbors of z1 , while large values of σ induce general broad neighborhoods spanning large areas of the latent grid. It will be convenient to work with two copies ZY and ZZ of the latent space Z. For 1
with respect to the metric induced by the Euclidean norm k · k
4
each user u ∈ U we imagine that the film-conditional ratings v (type I) or pairs (v, y) (type II) are generated as follows: 1. randomly generate a class index zY ∈ ZY by sampling the user-conditional probability distribution P (·|u) on ZY . 2. instead of using the class zY to index the “common interest patterns” P (v|y, zY ) (type I) or P (v, y|zY ) (type II), we transmit the class identification zY through a noisy communication channel, and receive (a probably different) class index zZ ∈ ZZ with probability P (zZ |zY ). 3. generate a film-conditional rating v with probability P (v|y, zZ ) (type I) or a pair (v, y) with probability P (v, y|zZ ) (type II). The models for user ratings have now the form (see eqs. (1), (2)) X P (zZ |u) = P (zZ |zY )P (zY |u),
(4)
zY ∈ZY
X
P (v|y, u) =
P (v|y, zZ )P (zZ |u)
[type I]
(5)
P (v, y|zZ )P (zZ |u)
[type II]
(6)
zZ ∈ZZ
X
P (v, y|u) =
zZ ∈ZZ
4
Parameter estimation
Given a set D = {(u1 , y1 , v1 ), ..., (uN , yN , vN )} of N observation triplets (u, y, vu,y ), the log likelihood of the data D is L =
XX u
L =
[type I]
(7)
log P (vu,y , y|u)
[type II],
(8)
y∈Yu
XX u
log P (vu,y |y, u)
y∈Yu
where Yu is the set of films that are evaluated by the user u, Yu = {y ∈ Y| (u, y, vu,y ) ∈ D}.
(9)
Following [6], we denote by ρ(v, y, zZ ) the probabilities P (v|y, u) and P (v, y|u) in type I and type II models, respectively. 5
Free parameters of the model, P (zY |u) and ρ(v, y, zZ ), are fitted to the data D by maximizing the likelihood L via Expectation-Maximization (EM) algorithm [3, 14]. The EM algorithm is a standard method for maximum likelihood estimation in latent variable models. It consists of repeated alternation of two steps until a convergence criterion is satisfied. In the E-step, the expected values of the unobserved (latent) variables are computed given the current parameter values: P P (zY |u) zZ ρ(v, y, zZ )P (zZ |zY ) P P (zY | y, u, v) = P . ′ ′ z ′ P (zY |u) zZ ρ(v, y, zZ )P (zZ |zY )
(10)
Y
P ρ(v, y, zZ ) zY P (zZ |zY )P (zY |u) P . P (zZ | y, u, v) = P ′ ′ z ′ ρ(v, y, zZ ) zY P (zZ |zY )P (zY |u)
(11)
Z
In the M-step, we re-estimate the free parameters of the model by maximizing the expected complete data (log)likelihood evaluated in the E-step: P P (zZ | y, u, v) u∈U [type I] ρ(v, y, zZ ) = P (v| y, zZ ) = P P v,y ′ v′ u∈Uv′ ,y P (zZ | y, u, v ) P P (zZ | y, u, v) u∈U P v,y [type II] ρ(v, y, zZ ) = P (v, y| zZ ) = P ′ ′ v ′ ,y ′ u∈Uv′ ,y′ P (zZ | y , u, v )
(12) (13)
where Uv,y is the set of users that evaluated film y with rating v, Uv,y = {u ∈ U| (u, y, v) ∈ D}.
(14)
Recall that |Yu | denotes the number of films evaluated by the user u. The class memberships are estimated as P (zY |u) =
P
y∈Yu
P (zY | y, u, vu,y ) . |Yu |
(15)
Detailed derivation of the update equations can be found in the Appendix. The EM algorithm is initialized in the following way: First, the data D is transformed ˜ by considering the non-vote a default vote v0 . In effect, for to a full evaluation matrix D initialization purposes, we impose a new evaluation set2 V˜ = V ∪ {v0 }. We train a SOM ˜ Then, sharing the neural field topology with our topographic latent class model on D. 2
This step is needed since typically not all the users vote on all the films.
6
the user conditional latent priors P (z|u) are estimated from trained SOM by consulting memberships of users u to clusters defined by nodes z ∈ Z = {1, 2, ..., K} of the SOM. The binary memberships are further “softened” by the following transformation: P (z|u) = A, if the user u belongs to the cluster defined by the node z of SOM; P (z|u) = (1 − A)/(K − 1), otherwise. The parameter A is set by postulating that if u belongs to the cluster z, P (z|u) should be B > 1 times higher than P (z ′ |u) for all the other z ′ . Hence, A = B/(K − 1 + B). Having trained the SOM, we return to the original set D. Rating patterns ρ(v, y, z) = P (v|y, z) in type I models are estimated by calculating for each film y and latent class z the empirical distribution of ratings for film y by the users belonging to the SOM class z. Due to the data sparseness, we perform a smoothing of the empirical estimates by applying Laplace correction (see e.g. [8]): P (v|y, z) =
N (v, y, z) + m P , m|V| + v′ ∈V N (v ′ , y, z)
(16)
where m is a positive number3 and N (v, y, z) is the number of times in the data set D that users belonging to the SOM center z rated the film y by v. The parameter m can be viewed in the Dirichlet prior interpretation for the multinomial distribution P (v|u, z) as the effective number of times each rating value v ∈ V was used to rate the film y by the collection of users from class z prior to evaluations collected in our data set D. In models of type II, ρ(v, y, z) = P (v, y|z) are estimated by calculating for each latent class z the empirical distribution of [rating,film] pairs (v, y) across the users belonging to the SOM class z. Laplace correction now reads: P (v, y|z) =
N (v, y, z) + m P P . m|V||Y| + v′ ∈V y′ ∈Y N (v ′ , y ′ , z)
(17)
The parameter4 m can now be considered the effective number of times each [rating,film] pair was considered by the collection of users from class z prior to creating the data set D. The EM algorithm converges to a local maximum of the likelihood surface and can be quite sensitive to initialization. Self-organizing maps are involved in initiating the EM updates since pure vector quantization without reflecting the latent space channel noise in the neural field topology of quantization centers is suboptimal. 3 4
Usually m = 1, or m = |V|−1 . Here we use the latter choice. suggested value for m is now (|V||Y|)−1
7
5
Experiments
In this section we demonstrate the topographic versions of the latent class models for user ratings. The purpose of this section is just to sketch possibilities of using our models to “dig out and understand” user rating and interest patterns hidden in a large database. We used 4×4 regular latent grid in [−1, 1]2 of K = 16 “abstract” classes z. To initialize the EM algorithm we set the parameter B to 5. Spread parameter σ of the channel noise (3) was set to 0.5.
5.1
Data
We experimented with the publicly available EachMovie dataset5 containing ratings for movies. The data set contains ratings by 61, 265 users for 1623 movies. Ratings are from the scale V = {1, 2, . . . , 6}. The default non-vote rating was set to v0 = 0. We selected a set of 100 most rated movies. The number of users that rated at least one movie from the selected set of movies was 60, 895. To describe latent organization developed by the models, it is convenient to label classes on the grid according to their position, such that the top left class is [1, 1].
5.2
Model of type II
Table 1 shows 5 most probable movies for each class (grid point), i.e. movies with the P largest P (y|z) = v P (v, y|z). We observe that rating patterns in the database impose an organization of latent classes, such that movies associated with the same grid point are
in the same or similar genres6 . For example, movies associated to the top left ([1,1]) grid point are action thrillers, and movies associated to the bottom right ([4,4]) grid point are romantic comedies. The top right movies (grid point [1,4]) can be classified as adventure and drama movies and movies associated to the bottom left ([4,1]) grid point are action crimes. 5 6
http://www.research.compaq.com/SRC/eachmovie/ One movie is usually categorized in more than one genre.
8
Figure 1(a) shows normalized entropies H[P (y|z)] = −
X
P (y|z) · log|Y| P (y|z)
y∈Y
of the class-conditional film distributions P (y|z). Grid points on the right hand side of the grid have higher entropy than those on the left hand side. It follows that the choice of movies from romantic dramas and comedies is in general more “ambiguous” than the choice from action thrillers and crimes. This reveals a tendency of the users to be slightly more picky when it comes to action films. Figures 1(b) and 1(c) show modes and means of P (v|z), respectively. There is a tendency of more positive (higher) ratings in classes organized along edges of the latent grid. This contrasts with more negative (lower) ratings in classes from the grid interior. In contrast to H[P (y|z)], the normalized entropies H[P (v|z)] = −
X
P (v|z) · log|V| P (v|z)
v∈V
of P (v|z) =
P
y
P (v, y|z) for classes on the right hand side of the grid are smaller than
those on the left hand side of the grid (see Figure 1(d)). This means that rating recommendations for action thrillers and crimes are less specific than those for romantic dramas and comedies. In other words, as far as the “quality” of films (judged by user ratings) is concerned, overall people tend to have more crisp opinions about romantic films than about action productions. However, as seen in figure 1(a), when it comes to the choice of individual films, the users appear to be more selective on action films. The exception is grid point [3,3] ”explaining” crimes, dramas and mysteries.
5.3
Models of type I
In contrast to type-II models, models of type I are not appropriate for visualization of P (y|z) via tables of most probable movies. Type-I models are more suitable for highly interactive and selective inspection of recommendation patterns in the database. They are also appropriate in cases where P (y|u) is an interesting quantity to model, irrespective of the actual rating vu,y given to y by u [6]. For example, P (y|u) may model the fact that not all the movies are equally available to all the users. 9
1
6
0.9
5.5
0.8
5
0.7
4.5
0.6
4
0.5
3.5
0.4
3
0.3
2.5
0.2
2
0.1
1.5
0
1
(a)
(b) 6
1
5.5
0.9
5
0.8
4.5
0.7
4
0.6
3.5
0.5
3
0.4
2.5
0.3
2
0.2
1.5
0.1
1
0
(c)
(d)
Figure 1: Model of type II. Shown are the entropies H[P(y|z)] (a) and H[P(v|z)] (d), as well as modes (b) and means (c) of the class-conditional rating distributions P(v|z). Here we demonstrate that “similar” movies y∗ have similar rating distributions P (v|y∗ , z). As an example we take movies Ghost and Pretty Woman, that are both romantic comedies. The means of P (v|y∗ , z) for the two films are shown in Figures 2(a) and (b). Very different rating distributions P (v|y∗ , z) are obtained for the criminal horror movie The Silence of the Lambs (see Figure 2(c)). However, rating patterns in Figure 2(c) are close to those for the action thriller Pulp Fiction in Figure 2(d). Users concentrated in class [1,3] seem to be positively tuned towards comedies and dislike scary films, while users in class [4,4] seem to have the opposite appetite. Class [1,4] represents an interesting mixture of users with less crisp attitudes towards comedies and thrillers.
10
11
The Rock
12 Monkeys
Goldeneye
Jurassic Park
Independence Day
Mr. Holland’s Opus
Beauty and the Beast
Forrest Gump
Mission: Impossible
The Birdcage
Jumanji
Speed
Eraser
Broken Arrow
The American President
The Firm
The Nutty Professor
Leaving Las Vegas
Babe
The Lion King
Toy Story
Fargo
Get Shorty
Braveheart
Grumpier Old Men
Sense and Sensibility
The Usual Suspects
Interview with the Vampire
Sabrina
Dead Man Walking
Seven
Outbreak
Father of the Bride Part II
Leaving Las Vegas
Clueless
Waterworld
Star Wars
The Birdcage
Quiz Show
Addams Family Values
Batman (1989)
Dances With Wolves
The Usual Suspects
Legends of the Fall
Apollo 13
Ace Ventura: Pet Detective
The Piano
Disclosure
True Lies
Batman (1989)
Quiz Show
Clueless
Dances With Wolves
Apollo 13
Seven
Nell
Pulp Fiction
Pulp Fiction
Schindler’s List
French Kiss
Die Hard: With a Vengeance
Beauty and the Beast
Jurassic Park
Sleepless in Seattle
Ace Ventura: Pet Detective
Aladdin
Forrest Gump
Pretty Woman
Batman Forever
The Shawshank Redemption
Terminator 2: Judgment Day
Four Weddings and a Funeral
Clear and Present Danger
Batman Forever
The Lion King
Ghost
Dances With Wolves
Star Trek: Generations
The Fugitive
In the Line of Fire
Table 1: 5 most probable movies for each latent class (type II).
6
6
5.5
5.5
5
5
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
(a)
(b) 6
6
5.5
5.5
5
5
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
(c)
(d)
Figure 2: Model of type I. Shown are the means of the class-and-film-conditional rating distributions P (v|y∗ , z) for the following films y∗ : Ghost (a), Pretty Woman (b), Silence of the Lambs (c) and Pulp Fiction (d).
6
Conclusion
We have developed topographic versions of two latent class models for collaborative filtering. They offer an opportunity to inspect preference patterns in rating databases in a variety of ways. Since our system is a probabilistic model of the data, we can readily use tools from probability and information theories to interpret and visualize trends captured by the abstract latent classes. In this respect, we have by no means exhausted the full range of choices. For example, besides moments, modes and entropies, we could use mutual information to detect dependencies between ratings and films in model II. Topographic organization of the latent classes makes orientation in those trends easier and more systematic. We demonstrated the system on a large collection of user ratings for films. There 12
are many possible ways of extending our system. For example, the probabilistic character of the model allows for a principled construction of a hierarchy of such models (see e.g. [15]), where we would be able to interactively “zoom in” into interesting user groups and rating patterns. This is a matter for our future work.
References [1] C.M. Bishop, M. Svens´en, and C.K.I. Williams. GTM: The generative topographic mapping. Neural Computation, 10(1):215–235, 1998. [2] J. Buhmann and H. Kuhnel. Vector quantization with complexity costs. IEEE Transactions on Information Theory, 39(4):1133–1145, 1993. [3] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39(1):1–38, 1977. [4] D. Goldberg, D. Nichols, B. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(2):61–70, 1992. [5] T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in AI, pages 289–296, 1999. [6] T. Hofmann. Learning what people (don’t) want. In L. De Raedt and P. Flach, editors, 12th European Conference on Machine Learning (ECML), pages 214–225. Springer, 2001. [7] T. Hofmann and J.M. Buhmann. Competetive learning algorithms for robust vector quantization. IEEE Transactions on Signal Processing, 46(6):1665–1675, 1998. [8] F. Jelinek. Statistical Methods for Speech Recognition (Language, Speech, and Communication). MIT Press (Boston, MA), 1998.
13
[9] A. Kab´an and M. Girolami. A combined latent class and trait model for the analysis and visualization of discrete data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(8):859–872, 2001. [10] T. Kohonen. Self–organizing formation of topologically correct feature maps. Biological Cybernetics, 43:59–69, 1982. [11] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1479, 1990. [12] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Reidl. Grouplens: Applying collaborative filtering to Usenet news. Communications of the ACM, 40(3):77– 87, 1997. [13] S.P. Luttrell. Hierarchical vector quantization. IEE Proc., 136:4065–4130, 1989. [14] R.M. Neal and G.E. Hinton. (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In M.I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers (Dordrecht), 1998. [15] P. Tiˇ no and I. Nabney. Hierarchical GTM: Constructing localized non-linear projection manifolds in a principled way. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):639–656, 2002.
Appendix To derive update equations for the model free parameters P (zY |u) and ρ(v, y, zZ ) in the Expectation-Maximization framework, we introduce (latent) indicator variables δzu,y,v for Y ,zZ couples of latent classes (zY , zZ ) ∈ ZY × ZZ : δzu,y,v = 1, if in the process of generating Y ,zZ the rating v for the film y by the user u, the latent class zY was first selected and then transmitted through the noisy channel as the class zZ ; otherwise δzu,y,v = 0. The complete Y ,zZ data log-likelihood reads: LC =
XX X
X
u,y δzu,y,v log{ρ(vu,y , y, zZ )P (zZ |zY )P (zY |u)}. Y ,zZ
u∈U y∈Yu zY ∈ZY zZ ∈ZZ
14
(18)
It will be more convenient to represent LC as LC =
N X X
∆uun
n=1 u∈U
X
∆yyn
X
∆vvn
X
δzu,y,v Y ,zZ
zY ∈ZY zZ ∈ZZ
v∈V
y∈Y
X
log{ρ(v, y, zZ )P (zZ |zY )P (zY |u)},
(19)
where ∆αa = 1 if and only if a = α, otherwise ∆αa = 0. We also write ,...,αm ∆αa11,...,a m
=
m Y
∆αaii .
i=1
so that LC =
N X X
n ,yn ,vn ∆uu,y,v
n=1 u,y,v
X
δzu,y,v log{ρ(v, y, zZ )P (zZ |zY )P (zY |u)}. Y ,zZ
(20)
zY ,zZ
M-step The M-step equations maximize the expected7 complete data log-likelihood, extended with Lagrange multiplier terms to account for proper normalization: < LC > =
N X X
n ,yn ,vn ∆uu,y,v
n=1 u,y,v
X
P (zY , zZ | y, u, v)
zY ,zZ
{log ρ(v, y, zZ ) + log P (zZ |zY ) + log P (zY |u)} Ã ! X X P (zY |u) − 1 + λu zY
u
+T (ρ),
(21)
where T (ρ) =
X
λy,zZ
X
λzZ
zZ
!
P (v| y, zZ ) − 1
v
y,zZ
T (ρ) =
à X
à X
!
P (v, y| zZ ) − 1
v,y
[type I] [type II].
(22)
∂ = 0, determining the corresponding Lagrange multiplier, and reBy setting ∂ρ(v,y,z Z) P alizing that zY P (zY , zZ | y, u, v) = P (zZ | y, u, v), we arrive at the update equation for 7
with respect to posterior distribution of hidden variables δzu,y,v on ZY × ZZ Y ,zZ
15
ρ(v, y, zZ ): P
u∈Uv,y
ρ(v, y, zZ ) = P (v| y, zZ ) = P P v′
ρ(v, y, zZ ) = P (v, y| zZ ) = P
P (zZ | y, u, v)
u∈Uv′ ,y
P
u∈Uv,y
P (zZ | y, u, v)
P
v ′ ,y ′
[type I]
(23)
[type II]
(24)
P (zZ | y, u, v ′ )
u∈Uv′ ,y′
P (zZ | y ′ , u, v ′ )
where Uv,y is the set of users that evaluated film y with rating v, Uv,y = {u ∈ U| (u, y, v) ∈ D}.
(25)
∂ Update equation for P (zY |u) is derived by setting ∂P = 0, determining the La(zY |u) P grange multiplier λu , and realizing that zZ P (zY , zZ | y, u, vu,y ) = P (zY | y, u, vu,y ) and P P ′ y∈Yu z ′ P (zY | y, u, v) = |Yu |, where |Yu | is the number of films evaluated by the user Y
u:
P
y∈Yu
P (zY |u) =
P (zY | y, u, vu,y ) . |Yu |
(26)
E-step for type I models We estimate P (v| y, u, zY )P (zY | y, u) . ′ ′ z ′ P (v| y, u, zY )P (zY | y, u)
P (zY | y, u, v) = P
Y
By model assumptions, P (zY | y, u) = P (zY |u) and P (v| y, u, zY ) =
X
P (v, zZ | y, u, zY )
zZ
=
X
P (v| y, u, zY , zZ )P (zZ |y, u, zY )
zZ
=
X
P (v| y, zZ )P (zZ |zY ),
zZ
and so
P P (zY |u) zZ ρ(v, y, zZ )P (zZ |zY ) P . P (zY | y, u, v) = P ′ ′ P (z |u) ′ Y z zZ ρ(v, y, zZ )P (zZ |zY ) Y
Analogously,
P (v| y, u, zZ )P (zZ | y, u) P (zZ | y, u, v) = P , ′ ′ z ′ P (v| y, u, zZ )P (zZ | y, u) Z
16
(27)
where P (v| y, u, zZ ) = P (v| y, zZ ) and X
P (zZ | y, u) =
P (zZ , zY | y, u)
zY
X
=
P (zZ | zY , y, u)P (zY | y, u)
zY
X
=
P (zZ |zY )P (zY |u).
zY
Hence,
P ρ(v, y, zZ ) zY P (zZ |zY )P (zY |u) P P (zZ | y, u, v) = P . ′ ′ ρ(v, y, z ) ′ Z z zY P (zZ |zY )P (zY |u) Z
E-step for type II models We write
P (v, y| u, zY )P (zY | u) , ′ ′ z ′ P (v, y| u, zY )P (zY | u)
where
P (zY | y, u, v) = P P (v, y| u, zY ) =
X
Y
P (v, y, zZ | u, zY )
zZ
=
X
P (v, y| u, zY , zZ )P (zZ |u, zY )
zZ
=
X
P (v, y| zZ )P (zZ |zY ),
zZ
and so P (zY | y, u, v) is updated as prescribed by eq. (27). Analogously, P (v, y| u, zZ )P (zZ | u) , ′ ′ z ′ P (v| y, u, zZ )P (zZ | y, u)
P (zZ | y, u, v) = P
Z
where P (v, y| u, zZ ) = P (v, y| zZ ) and P (zZ | u) =
X
P (zZ , zY | u)
zY
=
X
P (zZ | zY , u)P (zY | u)
zY
=
X
P (zZ |zY )P (zY |u).
zY
Hence, (28) is the update equation for P (zZ | y, u, v). 17
(28)