bus motorcycle jet train jeep wheelbarrow bike tricycle cucumber lettuce carrot ... tiger. BHC (fixed) log likelihood -1
Bayesian Rose Trees Yee Whye Teh Charles Blundell and Katherine Heller Gatsby Computational Neuroscience Unit University College London
March 2010 Edinburgh
1 / 25
Representational Structures I
Clustering.
I
Hierarchical representations or trees.
I
Overlapping clusters.
I
Low dimensional embeddings.
I
Distributed representations or latent variable models.
2 / 25
Hierarchical Clustering
I
Linkage algorithms.
I
Maximum likelihood, MAP, maximum parsimony [Vinokourov and Girolami 2000, Segal and Koller 2002, Friedman 2003].
I
Bayesian models [Williams 2000, Neal 2003, Teh et al. 2008].
I
Bayesian hierarchical clustering (BHC) [Heller and Ghahramani 2005].
3 / 25
Non-binary Hierarchical Clusterings −250
−200
−150
−100
−50
0
data item
1
2
3
feature 4 / 25
Non-binary Hierarchical Clusterings −180
−160
−140
−120
−80
−100
−60
−40
−20
0
data item
1
2
3
feature 4 / 25
Bayesian Rose Trees
I
Allows for non-binary trees if this is supported by data.
I
Computational efficiency.
I
Likelihood-based, probabilistic approach.
I
most likely tree should offer a simple explanation of the data.
5 / 25
Tree-Consistent Partitions
a b c d e f a b ckdkekf
a b c akbkc
akbkckdkekf a
b
c
d
e
f
An internal node means: Data at its leaves are more similar. Each internal node denotes: 1. a cluster of its leaves 2. its children further partition the cluster into smaller subclusters. A Bayesian rose tree represents a set of partitions of the data. part(T ) = {leaves(T )} ∪ {e1 ke2 ke3 k · · · : Tk ∈ ch(T ), ek ∈ part(Tk )} [Heller and Ghahramani 2005] 6 / 25
Tree-Consistent Partitions
a b c d e f a b ckdkekf
a b c akbkc
akbkckdkekf a
b
c
d
e
f
An internal node means: Data at its leaves are more similar. Each internal node denotes: 1. a cluster of its leaves 2. its children further partition the cluster into smaller subclusters. A Bayesian rose tree represents a set of partitions of the data. part(T ) = {leaves(T )} ∪ {e1 ke2 ke3 k · · · : Tk ∈ ch(T ), ek ∈ part(Tk )} [Heller and Ghahramani 2005] 6 / 25
Tree-Consistent Partitions
a b c d e f a b ckdkekf
a b c akbkc
akbkckdkekf a
b
c
d
e
f
An internal node means: Data at its leaves are more similar. Each internal node denotes: 1. a cluster of its leaves 2. its children further partition the cluster into smaller subclusters. A Bayesian rose tree represents a set of partitions of the data. part(T ) = {leaves(T )} ∪ {e1 ke2 ke3 k · · · : Tk ∈ ch(T ), ek ∈ part(Tk )} [Heller and Ghahramani 2005] 6 / 25
Likelihood of Clusters, Partitions and Trees Cluster: a b ckdkekf
A cluster is a set of data items. We use an exponential family distribution to model the cluster: ! X p(D|θ) = exp θ> s(x) − |D|A(θ) x∈D
Using a conjugate prior for θ, we can marginalize out θ: Z q(D) = p(D|θ)p(θ)dθ Example: Product of Beta-Bernoulli’s: q(D) =
d Y i=1
p(Di |αi , βi ) =
d Y Beta(αi + niD , βi + N D − niD ) Beta(αi , βi ) i=1
7 / 25
Likelihood of Clusters, Partitions and Trees Partition: a b ckdkekf
A partition is a separation of data set into clusters. We model each cluster independently, so the likelihood of a partition is: Y r ({D1 kD2 k . . . }) = q(Dj ) j
Example: r (a b ckdkekf ) = q(a b c)q(d)q(e)q(f )
8 / 25
Likelihood of Clusters, Partitions and Trees Tree: {a b c d e f , a b ckdkekf , akbkckdkekf }
A tree is treated as a mixture of partitions. The likelihood of a tree will be a convex combination of partition likelihoods: X s(T ) = mT (P)r (P) P∈part(T )
Example: s(T ) =mT (a b c d e f )r (a b c d e f )+ mT (a b ckdkekf )r (a b ckdkekf )+ mT (akbkckdkekf )r (akbkckdkekf )
9 / 25
Likelihood of Clusters, Partitions and Trees Tree: {a b c d e f , a b ckdkekf , akbkckdkekf }
To make computations tractable, we will define the tree likelihood in a recursive fashion: X s(T ) = mT (P)r (P) P∈part(T )
= πT q(leaves(T )) +(1 − πT ) | {z } cluster of leaves
Y
s(Ti )
Ti ∈ch(T )
|
{z
}
partitions of children
10 / 25
Likelihood of Clusters, Partitions and Trees Tree: {a b c d e f , a b ckdkekf , akbkckdkekf }
Example:
a b c d e f a b ckdkekf
a b c akbkc
akbkckdkekf a
b
c
d
e
f
s(Tabc ) =πabc q(Dabc ) + (1 − πabc )s(Ta )s(Tb )s(Tc ) =πabc q(Dabc ) + (1 − πabc )q(xa )q(xb )q(xc ) s(Tabcdef ) =πabcdef q(Dabcdef ) + (1 − πabcdef )s(Tabc )q(xd )q(xe )q(xf ) 11 / 25
Likelihood of Clusters, Partitions and Trees Tree: {a b c d e f , a b ckdkekf , akbkckdkekf }
Example:
a b c d e f a b ckdkekf
a b c akbkc
akbkckdkekf a
b
c
d
e
f
s(Tabc ) =πabc q(Dabc ) + (1 − πabc )s(Ta )s(Tb )s(Tc ) =πabc q(Dabc ) + (1 − πabc )q(xa )q(xb )q(xc ) s(Tabcdef ) =πabcdef q(Dabcdef ) + (1 − πabcdef )s(Tabc )q(xd )q(xe )q(xf ) 11 / 25
Likelihood of Clusters, Partitions and Trees Tree: {a b c d e f , a b ckdkekf , akbkckdkekf }
Example:
a b c d e f a b ckdkekf
a b c akbkc
akbkckdkekf a
b
c
d
e
f
s(Tabc ) =πabc q(Dabc ) + (1 − πabc )s(Ta )s(Tb )s(Tc ) =πabc q(Dabc ) + (1 − πabc )q(xa )q(xb )q(xc ) s(Tabcdef ) =πabcdef q(Dabcdef ) + (1 − πabcdef )s(Tabc )q(xd )q(xe )q(xf ) 11 / 25
An End to Needless Cascades Define mixing proportions with parameter 0 < γ < 1: πT = 1 − (1 − γ)|ch(T )|−1 Suppose r (a b ckd) > r (a bkckd) [+ other partitions of a, b, c].
m(S, T ) γ (1 − γ)γ (1 − γ)(1 − γ)γ (1 − γ)(1 − γ)(1 − γ)
partition S a b c d a b ckd a bkckd akbkckd
Cascading binary tree
a
b
c
d
Collapsed rose tree
m(S, T ) γ 2 (1 − γ) 1 − (1 − γ) (1 − γ)3 a
b
c
partition S a b c d a b ckd akbkckd
d 12 / 25
Complexity of Maximising s(D|T ) There are too many rose trees T for an exhausitive search for the highest s(T ). Binary trees 2O(L log L) With L leaves there are: Rose trees 2O(L log L+L) 10234 10216 10198 10180 10162 10144 10126 10108 1090 1072 1054 1036 1018 100 0
1016
Number of rose trees Number of binary trees
1014
Ratio of rose to binary trees
1012 1010 108 106 104 102 20
40 60 80 Number of leaves
100
120
100 0
20
40 60 80 Number of leaves
100
120
13 / 25
Construction by Greedy Model Selection 1. Let Ti = {xi } ∀i. 2. For every ordered pair of trees (Ti , Tj ) and possible joined/absorbed tree Tm , pick the Tm with the highest L(Tm ). L(Tm ) = log
s(Tm ) s(Ti )s(Tj )
3. Join/absorb Ti , Tj into Tm . 4. Repeat 2 and 3 until one tree remains. 5. Repeatedly contract edges if this increases s(T ). Join (Tm )
Absorb (Tm ) Tj
Ti
Tj Ta
Ta
Tb
Tc
Td
Te
Tb
Tc Td Te
14 / 25
Bayesian Hierarchical Clustering Relationship between BRT and BHC: I
BHC produces binary trees; BRT can produce non-binary trees.
I
BRT and one version of BHC interpret trees as mixtures over partitions. In other version, BHC interpreted as approximate inference in a DP mixture:
I
I I
I
I
Uses a different πT related to DP clustering prior. BHC includes many partitions in its model as this encourages a tighter bound on the marginal probability under the DP mixture. Unfortunately this leads to overly complicated models with many more partitions than necessary. We found that this tends to produce trees with inferior likelihoods.
[Heller and Ghahramani 2005]
15 / 25
Results (anecdotal) BHC (DP)
BHC (fixed) log likelihood -1266
BRT
908,188,506 partitions
log likelihood -1258, 1,441 partitions
log likelihood -1418 468,980,051 partitions chisel clamp drill pliers scissors sledgehammer hammer wrench screwdriver crowbar tomahawk axe hoe shovel rake helicopter submarine ship yacht truck van car bus motorcycle jet train jeep wheelbarrow bike tricycle cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse lion tiger
chisel clamp drill pliers scissors sledgehammer hammer wrench screwdriver crowbar tomahawk axe hoe shovel rake helicopter submarine ship yacht wheelbarrow bike tricycle truck van car bus motorcycle jet train jeep cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse
duck chicken seal dolphin mouse rat squirrel cat cow sheep pig deer horse tiger lion lettuce cucumber carrot potato radish onions tangerine orange grapefruit lemon apple grape strawberry nectarine pineapple drill clamp pliers scissors chisel axe tomahawk crowbar screwdriver wrench hammer sledgehammer shovel hoe rake yacht ship submarine helicopter train jet car van truck bus motorcycle bike wheelbarrow tricycle jeep
lion tiger
16 / 25
van car bus motorcycle jet train jeep wheelbarrow bike tricycle
tricycle truck van car bus motorcycle jet train jeep
Results (anecdotal) BHC (DP) cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse lion tiger
BHC (fixed) cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse lion tiger
BRT duck chicken seal dolphin mouse rat squirrel cat cow sheep pig deer horse tiger lion lettuce cucumber carrot potato radish onions tangerine orange grapefruit lemon apple grape strawberry nectarine pineapple drill clamp pliers scissors chisel axe tomahawk crowbar screwdriver wrench hammer
17 / 25
Results (quantitative) Log likelihood: Data set toy spambase digits024 digits newsgroups
BHC (DP) −230.278 ± 0.000 −2301.59 ± 141.996 −4106.458 ± 77.272 −4369.455 ± 81.491 −11439.463 ± 24.428
BHC (fixed) −169.395 ± 0.000 −1920.19 ± 123.075 −3708.448 ± 66.958 −3900.203 ± 72.093 −10710.756 ± 26.767
1.0
Hierarchical F2−measure BHC (DP) BHC (fixed) BRT
0.6 0.4 0.2 0.0
0.0
0.2
0.4
0.6
0.8
BHC (DP) BHC (fixed) BRT
0.8
1.0
Purity
BRT −167.301 ± 0.000 −1905.626 ± 122.5 −3692.174 ± 65.499 −3881.593 ± 71.553 −10687.927 ± 28.157
toy
spambase digits024
digits
newsgroups
toy
spambase digits024
digits
newsgroups
18 / 25
Mixtures of Gaussian Process Experts Mixtures of GPs are simple ways to construct nonparametric density regression models. A type of dependent Dirichlet process mixtures. MCMC inference can be very time consuming. 150 100 20 40 60 80 100 120 140 160 180
50
model GP
lik −1698
#parts 1
#nodes 1
0 50 100 150 200 10
20
30
40
50
150 100 8 16 24 32 40 48 56
50 0
BHC
−691
47 ∗ 109
132
50 100 150 200 10
20
30
40
50
150 100 8 16 24 32 40 48 56 64 72
50
BRT
−509
32 ∗ 105
0
53
50 100 150 200 10
20
30
40
50
[MacEachern 1999, Rasmussen and Ghahramani 2002] 19 / 25
Mixtures of Gaussian Process Experts 3 2 1 0
model GP BHC BRT
lik −300 −116 26
#parts 1 53 ∗ 1014 4
1 2 32 3
0
2
4
6
0
2
4
6
8
2 1 0 1 2 32
8
20 / 25
Discussion
A hierarchical clustering model that: I
allows arbitary branching structure.
I
uses this flexibility to find simpler models better explaining data.
I
Finding good trees in O(L2 log L) time (same as BHC).
There are other (unexplored wrt hierarchical clustering) models of non-binary trees such as Λ-coalescents and Gibbs fragmentation trees. [Pitman 1999, McCullagh et al. 2008]
21 / 25
Thanks
22 / 25
References I Friedman, N. (2003). Pcluster: Probabilistic agglomerative clustering of gene expression profiles. Technical Report Technical Report 2003-80, Hebrew University. Heller, K. A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the International Conference on Machine Learning, volume 22. MacEachern, S. (1999). Dependent nonparametric processes. In Proceedings of the Section on Bayesian Statistical Science. American Statistical Association. McCullagh, P., Pitman, J., and Winkel, M. (2008). Gibbs fragmentation trees. Bernoulli, 14(4):988–1002. Neal, R. M. (2003). Density modeling and clustering using Dirichlet diffusion trees. In Bayesian Statistics, volume 7, pages 619–629. 23 / 25
References II Pitman, J. (1999). Coalescents with multiple collisions. Annals of Probability, 27:1870–1902. Rasmussen, C. E. and Ghahramani, Z. (2002). Infinite mixtures of Gaussian process experts. In Advances in Neural Information Processing Systems, volume 14. Segal, E. and Koller, D. (2002). Probabilistic hierarchical clustering for biological data. In RECOMB ’02: Proceedings of the sixth annual international conference on Computational biology, pages 273–280, New York, NY, USA. ACM. Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems, volume 20. Vinokourov, A. and Girolami, M. (2000). A probabilistic hierarchical clustering method for organizing collections of text documents. Pattern Recognition, International Conference on, 2:2182. 24 / 25
References III
Williams, C. K. I. (2000). A MCMC approach to hierarchical mixture modelling. In Advances in Neural Information Processing Systems, volume 12.
25 / 25