Latent variable models. Estimation. Maximum Likelihood. Maximum a-posteriori. Bayesian Estimation. Unsupervised. General concepts. Principal Components.
Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Probabilistic Data Mining
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised
Lehel Csató
General concepts Principal Components Independent Components Mixture Models
Faculty of Mathematics and Informatics Babe¸s–Bolyai University, Cluj-Napoca,
November 2010
Outline Probabilistic Data Mining Lehel Csató Modelling Data Motivation
1
Modelling Data Motivation Machine Learning Latent variable models
2
Estimation methods
3
Unsupervised Methods
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Motivation for Data Mining Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Data Mining is not: SQL and relational data-base application; Storage technologies;
Machine Learning Latent variable models
Cloud Computing;
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Data mining:
Unsupervised General concepts Principal Components Independent Components Mixture Models
The extraction of knowledge or information from an ever-growing collection of data. “Advanced” search capability that enables one to extract patterns useful in providing models for: 1 2 3
characterising; prediction, and exploiting the data.
Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.
Principal Components Independent Components Mixture Models
Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense
Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.
Principal Components Independent Components Mixture Models
Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense
Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.
Principal Components Independent Components Mixture Models
Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense
Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.
Principal Components Independent Components Mixture Models
Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense
Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.
Principal Components Independent Components Mixture Models
Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense
The need for data mining Probabilistic Data Mining Lehel Csató Modelling Data Motivation
“Computers have promised us a fountain of wisdom but delivered a flood of data.” “The amount of information in the world doubles every 20 months.” (Frawley, Piatetsky-Shapiro, Matheus, 1991)
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
A competitive market environment requires sophisticated – and useful – algorithms.
Unsupervised General concepts Principal Components Independent Components Mixture Models
Data aquisition and storage is ubiquotuous. Algorithms are required to exploit them. The algorithms that exploit the data-rich environment are coming usually from the machine learning domain.
Machine learning Probabilistic Data Mining
Historical background / Motivation:
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation
Huge amount of data, that should automatically be processed,
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components
Mathematics provides general solutions, solutions are i.e. not for a given problem,
Mixture Models
Need for “science”, that uses mathematics machinery for solving practical problems.
Definitions for Machine Learning Probabilistic Data Mining
Machine learning
Lehel Csató Modelling Data Motivation
Collection of methods (from statistics, probability theory) to solve problems met in practice.
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
noise filtering for non-linear regression and/or non-Gaussian noise
Classification: binary, multiclass, partially labelled
Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,
Modelling Data Probabilistic Data Mining
(x1,y1 )
Lehel Csató
Observation Modelling Data Motivation
f(x)
Machine Learning
(x2,y2 )
(x,y)
Latent variable models
Estimation
(xN,yN)
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components
Real world: there “is” a function y = f (x) Observation process: a corrupted datum is collected for a sample xn :
Mixture Models
tn = yn + tn = h(yn , )
additive noise h distortion function
Problem: find function y = f (x)
Latent variable models Probabilistic Data Mining
(x1,y1 )
Lehel Csató Modelling Data Motivation
f*(x)
Inference
(x2,y2 )
Machine Learning Latent variable models
Estimation Maximum Likelihood
F − function class Observ. process
(xN,yN)
Maximum a-posteriori Bayesian Estimation
Unsupervised
Data set – collected.
General concepts Principal Components Independent Components Mixture Models
Assume a function class. polynomial, Fourier expansion, Wavelet;
Observation process – encodes the noise; Find the optimal function from the class.
Latent variable models II Probabilistic Data Mining
We have the data set D = {(xx 1 , y1 ), . . . , (xx N , yN )}.
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components
Consider a function class: w ∈ Rd , b ∈ R (1) F = w T x + b|w
(2) F = a0 +
K X
ak sin(2πkx) +
k =1
a, b ∈ RK , a0 ∈ R |a
K X
bk cos(2πkx)
k =1
Independent Components Mixture Models
Assume an observation process: yn = f (xx n ) + with ∼ N(0, σ2 ).
Latent variable models III Probabilistic Data Mining
1
The data set: D = {(xx 1 , y1 ), . . . , (xx N , yN )}.
Lehel Csató Modelling Data Motivation
2
Machine Learning Latent variable models
Estimation
Assume a function class: θ ∈ Rp F = f (xx , θ )|θ
Maximum Likelihood Maximum a-posteriori
F – polynomial, etc.
Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
3
Assume an observation process. Define a loss function: L (yn , f (xx n , θ )) For the Gaussian noise: L(yn , f (xx n , θ )) = (yn − f (xx n , θ ))2 .
Outline Probabilistic Data Mining Lehel Csató Modelling Data Motivation
1
Modelling Data
2
Estimation methods Maximum Likelihood Maximum a-posteriori Bayesian Estimation
3
Unsupervised Methods
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Parameter estimation Probabilistic Data Mining
Estimating parameters:
Lehel Csató Modelling Data
Finding the optimal value to θ :
Motivation Machine Learning
θ ∗ = arg min L(D, θ )
Latent variable models
θ ∈Ω
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
where Ω is the domain of the parameters. L(D, θ ) is a “loss function” for the data set. Example: L(D, θ) =
N X n=1
L(yn , f (xx n , θ))
Maximum Likelihood Estimation Probabilistic Data Mining
L(D, θ ) – (log)likelihood function.
Lehel Csató Modelling Data Motivation Machine Learning
Maximum likelihood estimation of the model:
Latent variable models
θ ∗ = arg min L(D, θ )
Estimation Maximum Likelihood
θ ∈Ω
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components
Example – quadratic regression:
Independent Components Mixture Models
L(D, θ ) =
N X
(yn − f (xx n , θ ))2
– factorisation
n=1
Drawback: can produce perfect fit to the data – over-fitting.
Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Y 190
Graphic
h(cm)
Example of an ML estimate
180
Machine Learning Latent variable models
170
Estimation Maximum Likelihood Maximum a-posteriori
160
Bayesian Estimation
Unsupervised
150
General concepts Principal Components Independent Components Mixture Models
140
50
60
70
80
90
100
We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .
w(kg) X 110
Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Y 190
Graphic
h(cm)
Example of an ML estimate
180
Machine Learning Latent variable models
170
Estimation Maximum Likelihood Maximum a-posteriori
160
Bayesian Estimation
Unsupervised
150
General concepts Principal Components Independent Components Mixture Models
140
50
60
70
80
90
100
We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .
w(kg) X 110
Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Y 190
Graphic
h(cm)
Example of an ML estimate
180
Machine Learning Latent variable models
170
Estimation Maximum Likelihood Maximum a-posteriori
160
Bayesian Estimation
Unsupervised
150
General concepts Principal Components Independent Components Mixture Models
140
50
60
70
80
90
100
We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .
w(kg) X 110
Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Y 190
Graphic
h(cm)
Example of an ML estimate
180
Machine Learning Latent variable models
170
Estimation Maximum Likelihood Maximum a-posteriori
160
Bayesian Estimation
Unsupervised
150
General concepts Principal Components Independent Components Mixture Models
140
50
60
70
80
90
100
We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .
w(kg) X 110
M.L. for linear models Probabilistic Data Mining Lehel Csató
I
Assume: linear model for the x → y relation
Modelling Data Motivation Machine Learning Latent variable models
θ) = f (xx n |θ
Estimation
d X
θ` x`
`=1
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
with x = [1, x, x 2 , log(x), . . .]T
Unsupervised General concepts Principal Components Independent Components Mixture Models
quadratic loss for D = {(xx 1 , y1 ), . . . , (xx N , hN )} E2 (D|f ) =
N X n=1
θ))2 (yn − f (xx n |θ
M.L. for linear models Probabilistic Data Mining Lehel Csató Modelling Data
Minimisation: N X
Motivation Machine Learning Latent variable models
θ))2 = (yy − X θ )T (yy − X θ ) (yn − f (xx n |θ
n=1
θT X T y + y T y = θ T X T X θ − 2θ
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
Solution:
Principal Components Independent Components Mixture Models
X T X θ − 2X X Ty 0 = 2X −1 θ = X TX X Ty
where y = [y1 , . . . , yN ]T and X = [xx 1 , . . . , x N ]T are the transformed data.
II
M.L. for linear models Probabilistic Data Mining Lehel Csató
Generalised linear models: Use a set of functions Φ = [φ1 (.), . . . , φM (.)].
Modelling Data Motivation Machine Learning
Project the inputs into the space spanned by Im(Φ).
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components
Have a parameter vector of length M: θ = [θ1 , . . . , θM ]T .
P x ) | θm ∈ R . The model is m θm φm (x
Mixture Models
The optimal parameter vector is: −1 ΦT y θ∗ = Φ T Φ
III
Maximum Likelihood Probabilistic Data Mining
Summary
There are many candidate model families:
Lehel Csató Modelling Data
the degree of polynomials specifies a model family;
Motivation Machine Learning Latent variable models
the rank of a Fourier expansion;
Estimation Maximum Likelihood Maximum a-posteriori
the mixture of {log, sin, cos, . . .} also a family;
Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Selecting the “best family” is a difficult modelling problem. In maximum likelihood there is no controll on how good a family is when processing a given data-set. Smaller number of parameters than
√ #data.
Maximum a–posteriori Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Generalised linear model powerful – it can be extremely complex; With no complexity control, overfitting problem.
Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.
I
Maximum a–posteriori Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood
Generalised linear model powerful – it can be extremely complex; With no complexity control, overfitting problem.
Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Goal: Prior knowledge specification using probabilities; Using probability theory for consistent estimation; Encode the observation noise in the model;
I
Maximum a–posteriori Probabilistic Data Mining
Data/noise
Probabilistic data description:
Lehel Csató Modelling Data Motivation
How likely is that θ generated the data:
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
y = f (xx ) y = f (xx ) +
⇔ ⇔
y − f (xx ) ∼ δ0 y − f (xx ) ∼ N
Unsupervised General concepts Principal Components Independent Components Mixture Models
Gaussian noise: y − f (xx ) ∼ N(0, σ2 ) 1 (y − f (xx ))2 P(y |f (xx )) = √ exp − 2σ2 2πσ
Maximum a–posteriori Probabilistic Data Mining
William of Lehel Csató
Prior
Ockham (1285–1349) principle
Entities should not be multiplied beyond necessity. Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood
Also known as (wiki...): .
“Principle of simplicity” – KISS,
“When you hear hoofbeats, think horses, not zebras”.
Maximum a-posteriori Bayesian Estimation
Unsupervised
Simple models
≈
L2 norm
⇐
General concepts Principal Components Independent Components Mixture Models
small number of parameters. L0 norm
Probabilistic representation: "
θk2 kθ θ) ∝ exp − 22 p0 (θ 2σ0
#
M.A.P. Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning
Inference
M.A.P. – probabilities assigned to D – via the log-likelihood function: P(yn |xx n , θ , F ) ∝ exp [−L(yn , f (xx n , θ ))]
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
θ – prior probabilities:
Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
"
θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0
#
A–posteriori probability: θ|D, F ) = p(θ
θ)p0 (θ θ) P(D|θ F p(D|F )
F ) – probability of the data for a given family. p(D|F
M.A.P. Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning
Inference
M.A.P. – probabilities assigned to D – via the log-likelihood function: P(yn |xx n , θ , F ) ∝ exp [−L(yn , f (xx n , θ ))]
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
θ – prior probabilities:
Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
"
θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0
#
A–posteriori probability: θ|D, F ) = p(θ
θ)p0 (θ θ) P(D|θ F p(D|F )
F ) – probability of the data for a given family. p(D|F
M.A.P. Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning
Inference
M.A.P. – probabilities assigned to D – via the log-likelihood function: P(yn |xx n , θ , F ) ∝ exp [−L(yn , f (xx n , θ ))]
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
θ – prior probabilities:
Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
"
θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0
#
A–posteriori probability: θ|D, F ) = p(θ
θ)p0 (θ θ) P(D|θ F p(D|F )
F ) – probability of the data for a given family. p(D|F
M.A.P. Probabilistic Data Mining
Inference II
M.A.P. estimation – finds θ with largest probability:
Lehel Csató
θ|D, F ) θ ∗MAP = arg max p(θ
Modelling Data
θ ∈Ω
Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
Example: with L(yn , f (xx n , θ )) and Gaussian prior:
Principal Components Independent Components Mixture Models
θ ∗MAP = argmax K − θ∈Ω
σ20 = ∞ .
=⇒
θ k2 kθ 1X L(yn , f (xx n , θ )) − 2 n 2σ20
maximum likelihood. after a change of sign and max → min
M.A.P.
Example I
Probabilistic Data Mining
Poly 6
Lehel Csató
12
Modelling Data
10
−3
N. Dev =10 True function Training data −2 N. Dev = 10
Motivation Machine Learning Latent variable models
8
Estimation Maximum Likelihood
6
Maximum a-posteriori Bayesian Estimation
Unsupervised
4
General concepts Principal Components Independent Components
2
Mixture Models
0 −2 −4 −6 −10
−5
0
5
10
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
Aim: test different levels of flexibility. ⇒ Prior width:
100
w(kg) X 110
p = 10
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
Aim: test different levels of flexibility. ⇒ Prior width: σ20 = 106
100
w(kg) X 110
p = 10
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
Aim: test different levels of flexibility. ⇒ Prior width: σ20 = 106 σ20 = 105
100
w(kg) X 110
p = 10
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
100
w(kg) X 110
Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
100
w(kg) X 110
Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
100
w(kg) X 110
Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
100
w(kg) X 110
Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102 σ20 = 101
M.A.P.
Linear models I
Lehel Csató Modelling Data Motivation
Y 190
h(cm)
Probabilistic Data Mining
Machine Learning Latent variable models
180
Estimation Maximum Likelihood Maximum a-posteriori
170
Bayesian Estimation
Unsupervised
160
General concepts Principal Components
150
Independent Components Mixture Models
140
50
60
70
80
90
100
w(kg) X 110
Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102 σ20 = 101 σ20 = 100
M.A.P.
Linear models II
Probabilistic Data Mining Lehel Csató
θ ∗MAP = argmax K −
Modelling Data
θ ∈Ω
Motivation
θ k2 kθ 1X E2 (yn , f (xx n , θ )) − 2 n 2σ20
Machine Learning Latent variable models
Transform into vector notation:
Estimation
θ ∗MAP = argmax K −
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components
θ ∈Ω
1 θT θ (yy − X θ )T (yy − X θ ) − 2 2σ20
solve for θ by differentiation:
Independent Components
X T (yy − X θ ) −
Mixture Models
1 I dθ = 0 σ20 θ∗MAP
.
=
1 X X + 2 Id σ0 T
!−1 XTy
again M.L. for σ20 = ∞
M.A.P. Probabilistic Data Mining
Summary
Mximum a–posteriori models:
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Allow for the inclusion of prior knowledge;
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
May protect against overfitting;
Principal Components Independent Components Mixture Models
Can measure the fitness of the family to the data; . Procedure called M.L. type II.
M.A.P. application Probabilistic Data Mining Lehel Csató
M.L. II
Idea: instead of computing the most probable value of θ , we can measure the fit of the model F to the data D. F) P(D|F
Modelling Data
X
=
p(D, θ ` | F )
θ ` ∈Ω
Motivation
X
Machine Learning
=
Latent variable models
Estimation
θ` | F ) p(D | θ ` , F )p0 (θ
θ ` ∈Ω
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Gaussian noise ase and polynomial of order K : Z
Unsupervised General concepts Principal Components
F )) log(P(D|F
=
log
Independent Components
dθ Ωθ
Mixture Models
=
−
θ |F) p(D | θ , F )p0 (θ p(D | F )
! = log (N(yy |0, ΣX ))
1 ΣX | + y T ΣX−1y N log(2π) + log |Σ 2
where ΣX =
X Σ 0X T I N σ2n +X
h i X = x 0, x 1, . . . , x K with
Σ 0 = diag(σ20 , σ21 , . . . , σ2K ) = σ2pI K +1
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 10 k = 9 k = 8 .
2
4
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 9 k = 8 k = 7 .
2
4
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 8 k = 7 k = 6 .
2
4
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 7 k = 6 k = 5 .
2
4
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 6 k = 5 k = 4 .
2
4
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 5 k = 4 k = 3 .
2
4
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 4 k = 3 k = 2 .
2
4
M.A.P. ⇒ M.L.II
Lehel Csató
−80 Modelling Data
log(P(D|k ))
Probabilistic Data Mining
poly
Motivation Machine Learning Latent variable models
−100
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
−120
Unsupervised General concepts Principal Components
log(σ2p )
Independent Components Mixture Models
−140 −6
−4
−2
0
Aim: test different models. Polynomial families: k = 3 k = 2 k = 1 .
2
4
Bayesian estimation Probabilistic Data Mining
Intro
M.L. and M.A.P. estimates provide single solutions.
Lehel Csató Modelling Data Motivation
Point estimates lack the assessment of un/certainty.
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
Better solution: for a query x ∗ , the system output is probabilistic:
Bayesian Estimation
Unsupervised General concepts
x∗
⇒
p(y∗ |xx ∗ , F )
Principal Components Independent Components Mixture Models
Tool: go beyond the M.A.P. solution and use the a–posteriori distribution of the parameters.
Bayesian estimation Probabilistic Data Mining
II
We again use Bayes’ rule:
Lehel Csató Modelling Data Motivation
θ|D, F ) = p(θ
θ)p0 (θ θ) P(D|θ F) p(D|F
Z F) = with p(D|F
θ P(D|θ θ)p0 (θ θ). dθ Ω
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
and exploit the whole posterior distribution of the parameters. A-posteriori parameter estimates
Principal Components Independent Components Mixture Models
θ) def θ|D, F ) and use the total We operate with ppost (θ = p(θ probability rule: X θ` , F ) ppost (θ θ` ) p(y∗ |θ p(y∗ |D, F ) = θ ` ∈Ωθ
in assessing system output.
Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Example I
Given the data D = {(xx 1 , y1 ), . . . , (xx N , yN )} estimate the linear fit: T 1 θ0 d θ1 x1 X T θi xi = . . def y = θ0 + =θ x .. .. i=1
θd
xd
Unsupervised General concepts Principal Components
Gaussian distributions noise and prior:
Independent Components Mixture Models
= yn − θ T x n ∼ N(0, σ2n ) w ∼ N(0, Σ 0 )
Bayesian estimation Probabilistic Data Mining
Example II
θ). Goal: compute the posterior distribution ppost (θ
Lehel Csató Modelling Data Motivation
θ) ppost (θ
∝
θ) p(D|θ θ, F ) = p0 (θ θ |Σ Σ0 ) p0 (θ
Machine Learning
Maximum Likelihood
θ)) −2 log (ppost (θ
=
Maximum a-posteriori Bayesian Estimation
Unsupervised
=
General concepts Principal Components Independent Components
θT x n ) P(yn |θ
n=1
Latent variable models
Estimation
N Y
=
1 Kpost + 2 (yy − X θ )T (yy − X θ ) + θ T Σ −1 0 θ σn 1 T 2 0 θT X X + Σ −1 θ − 2 θ T X T y + Kpost 0 σ2n σn T 00 θ − µpost Σ−1 post θ − µ post + Kpost
Mixture Models
and by identification −1 1 T −1 Σ post = X X + Σ 0 σ2n
and
µ post = Σ post
X Ty σ2n
Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Example III
Bayesian linear model The posterior distribution for the parameters is a Gaussian with parameters
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Σ post =
1 T X X + Σ −1 0 σ2n
−1 and
µ post = Σ post
X Ty σ2n
Unsupervised General concepts Principal Components Independent Components Mixture Models
Point estimates from keeping : M.L. if we take Σ 0 → ∞ and considering only µ post . M.A.P if we approximate the distribution with a single value at the maximum, i.e. µ post .
Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation
Prediction for new values x ∗ : use the likelihood P(y∗ |xx ∗ , θ , F ), and the posterior for θ and Bayes’ rule. The steps: Z
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
p(y∗ |xx ∗ , D, F ) =
Unsupervised General concepts Principal Components Independent Components Mixture Models
Example IV
θ | D, F ) dθ p(y∗ |xx ∗ , θ , F )ppost (θ Ωθ
Z
(y∗ − θ T x ∗ )2 1 T −1 θ θ dθ exp − K∗ + + (θ − µ ) Σ (θ − µ ) post post post 2 σ2n Ωθ Z 2 1 y θ) = dθ exp − K∗ + ∗2 − a T C −1a + Q(θ 2 σn Ωθ =
where a=
x ∗ y∗ + Σ −1 postµ post σ2n
C=
x ∗x T∗ + Σ post σ2n
Bayesian estimation Probabilistic Data Mining Lehel Csató
Example V
Integrating out the quadratic in θ : Predictive distribution at x ∗
Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood
"
1 p(y∗ |xx ∗ , D, F ) = exp − 2
K∗ +
(y∗ − x ∗µpost )2
!#
σ2n + x T∗ Σ −1 postx ∗
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components
= N y∗ x T∗ µ post , σ2n + x T∗ Σ postx ∗
Independent Components Mixture Models
With the predictive distribution we: measure the variance of the prediction for each point: σ2∗ = σ2n + x T∗ Σ postx ∗ ; sample from the parameters and plot the candidate predictors.
Bayesian example Probabilistic Data Mining
Error bars
Pol. 6 − N.var σ2 = 1
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
10
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised
5
General concepts Principal Components Independent Components Mixture Models
0
−5 −10
−5
0
The errors are the symmetric thin lines.
5
10
Bayesian example
Predictive samples
Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Third order polynomials are used to approximate the data.
Bayesian estimation Probabilistic Data Mining Lehel Csató
Problems
θ|D, F ) we assumed that the When computing ppost (θ posterior can be represented analytically.
Modelling Data Motivation Machine Learning Latent variable models
This is not the case.
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Approximations are needed for the posterior distribution predictive distribution
In Bayesian modelling an important issue is how we approximate the posterior distribution.
Bayesian estimation Probabilistic Data Mining Lehel Csató
Summary
Complete specification of the model Can include prior beliefs about the model.
Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Accurate predictions Can compute the posterior probabilities for each test location.
Unsupervised General concepts Principal Components
Computational cost
Independent Components Mixture Models
Using models for prediction can be difficult and expensive in time and memory.
Bayesian estimation Probabilistic Data Mining Lehel Csató
Summary
Complete specification of the model Can include prior beliefs about the model.
Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Accurate predictions Can compute the posterior probabilities for each test location.
Unsupervised General concepts Principal Components
Computational cost
Independent Components Mixture Models
Using models for prediction can be difficult and expensive in time and memory. Bayesian models Flexible and accurate – if priors about the model are used.
Outline Probabilistic Data Mining Lehel Csató Modelling Data
1
Modelling Data
2
Estimation methods
3
Unsupervised Methods General concepts Principal Components Independent Components Mixture Models
Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Unsupervised setting Probabilistic Data Mining Lehel Csató
Data can be unlabeled, i.e. no values y are associated to an input x .
Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood
We want to “extract” information from D = {xx 1 , . . . , x N }.
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
We assume that the data – although high-dimensional – span a much smaller dimensional manifold. Task is to find the subspace corresponding to the data span.
Models in unsupervised learning Probabilistic Data Mining
It is again important the model of the data:
Lehel Csató
Principal Components;
x2
Modelling Data
3 2 1
Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood
.
x1
0
1
2
3
0
1
2
3
3
4
5
6
Maximum a-posteriori
Independent Components;
Bayesian Estimation
x2
Unsupervised
3 2 1
General concepts Principal Components Independent Components Mixture Models
. Mixture models;
x1
x2 3 2 1
.
0
x1 1
2
The PCA model
I
Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood
x2
Simple data structure. Spherical cluster that is: translated; scaled; rotated.
3 2 1
.
0
x1 1
2
3
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
We aim to find the principal directions of the data spread.
Principal Components Independent Components Mixture Models
Principal direction: the direction u along which the data preserves most of its variance.
The PCA model Probabilistic Data Mining
II
Principal direction:
Lehel Csató Modelling Data Motivation Machine Learning
u = argmax u k=1 ku
N 1 X T u x n − u x )2 (u 2N n=1
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
we pre-process: x = 0 . Replacing the empirical covariance with Σx :
Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
u = argmax u k=1 ku
N 1 X T u x n − u x )2 (u 2N n=1
1 u k2 − 1) = argmax u T Σx u − λ(ku 2 u ,λ with λ the Lagrange multiplier. Differentiating w.r.t u : u =0 Σx u − λu
The PCA model Probabilistic Data Mining
III
The optimum solution must obey:
Lehel Csató Modelling Data
u Σx u = λu
Motivation Machine Learning Latent variable models
The eigendecomposition of the covariance matrix.
Estimation Maximum Likelihood Maximum a-posteriori
(λ∗ , u ∗ ) is an eigenvalue, eigenvector of the system.
Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
If we replace back, the value of the expression is λ∗ . ⇒ Optimal solution when λ∗ = λmax . Principal direction: The eigenvector u max corresponding to the largest eigenvalue of the system.
The PCA model Probabilistic Data Mining Lehel Csató Modelling Data
Data mining I
How is this used in data mining? Assume that data is: jointly Gaussian:
Motivation
mx , Σx ), x = N(m
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised
high-dimensional; only few (2) directions are relevant.
General concepts Principal Components Independent Components Mixture Models
20
2 −20 −40
0
−20 0 20
−20
The PCA model Probabilistic Data Mining
Data mining II
How is this used in data mining?
Lehel Csató Modelling Data
x2
Subtracting mean.
3 2 1
Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Eigendecomposition.
Selecting the K eigenvectors corresponding 0 to the K largest values.
x1 1
2
3
Computing the K projections: zn` = x Tn u ` .
Unsupervised General concepts Principal Components Independent Components Mixture Models
u 1 , . . . , u K ]T : The projection using matrix P def = [u Z = XP and z n can is used as a compact representation of x n .
The PCA model Probabilistic Data Mining
Reconstruction:
Lehel Csató Modelling Data
Data mining III
x n0 =
Motivation
K X
zn`u `
or, with matrix notation: X 0 = Z P T
`=1
Machine Learning Latent variable models
Estimation
PCA projection analysis:
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
N i 1 X 1 h 0 T 0 2 0 tr X − X X − X x − x = n n N2 N2 n=1 i h = tr Σx − P T Σz P i h = tr U (diag(λ1 , . . . , λd ) − diag(λ1 , . . . , λK , 0, . . .)) U T h i = tr U T U diag(0, . . . , 0, λK +1 , . . . , λd )
EPCA =
=
d−K X `=1
λK +`
The PCA model Probabilistic Data Mining Lehel Csató
Data mining IV
PCA reconstruction error: The error made using the PCA directions:
Modelling Data Motivation Machine Learning Latent variable models
EPCA =
Estimation
d−K X
λK +`
`=1
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components
PCA properties:
Independent Components Mixture Models
PCA system orthonormal: u T` u r = δ`−r Reconstruction fast. Spherical assumption critical.
PCA application Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
USPS digits – testbed for several models.
USPS I
PCA application Probabilistic Data Mining Lehel Csató Modelling Data Motivation
USPS II
USPS characteristics: handwritten data centered and scaled; ≈ 10.000 items of 16 × 16 grayscale images;
Machine Learning
λ(%)
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
We plot P kr = r`=1 λ`
100 90
i
80 20
40
60
80
Principal Components Independent Components Mixture Models
Conclusion for the USPS set: The normalised λ1 = 0.24 24% of the data.
⇒
u 1 accounts for
at ≈ 10 more than 70% of variance is explained. at ≈ 50 more than 98% ⇒
50 numbers instead of 256.
PCA application Probabilistic Data Mining
USPS III
Visualisation application:
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Visualisation along the first two eigendirections.
PCA application Probabilistic Data Mining
Visualisation application:
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Detail.
USPS IV
The ICA model Probabilistic Data Mining
Motivation
x = Pz is a generative model for the data.
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised
x2
Start from the PCA:
Lehel Csató Modelling Data
I
3 2 1 0
x1 1
2
3
We assumed that 0, diag(λ` )); z i.i.d. Gaussian random variables z ∼ N(0 ⇒ x are not independent; ⇒
General concepts Principal Components Independent Components Mixture Models
In most of real data: Sources are not Gaussian. But sources are independent. We exploit that!.
z are Gaussian sources;
The ICA model Probabilistic Data Mining
II
The following model assumption:
Lehel Csató
x = As
Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
where z independent sources; A linear mixing matrix;
Bayesian Estimation
Unsupervised
Looking for matrix B that recovers the sources:
General concepts Principal Components Independent Components Mixture Models
As ) = (B BA ) s s 0 def = B x = B (A
BA ) is unity up to a permutation and scaling i.e. (B . but retains independence.
The ICA model Probabilistic Data Mining
In practice: s 0 def = Bx
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation
III
with s = [s1 , . . . , sK ] all independent sources. Independence test: the KL-divergence between the joint distribution and the marginals
Maximum Likelihood Maximum a-posteriori
B = argmin KL (p(s1 , s2 )kp(s1 )p(s2 ))
Bayesian Estimation
B ∈SOd
Unsupervised General concepts Principal Components Independent Components Mixture Models
B | = 1. where SOd is the group of matrices with |B In ICA we are looking for matrix Z B that minimises: XZ dp(s` ) log p(s` ) − dp(ss ) log(p(s1 , . . . , sd ) `
.
Ω`
Ω`
skip KL
The KL-divergence Probabilistic Data Mining
Detour
Kullback-Leibler divergence
Lehel Csató
KL(pkq) = Modelling Data
X
p(x) log
x
p(x) q(x)
Motivation Machine Learning Latent variable models
is zero only and only if p = q,
Estimation
is not a measure of distance (but cloooose to it!),
Maximum Likelihood Maximum a-posteriori
Efficient when exponential families are used.
Bayesian Estimation
Unsupervised General concepts Principal Components
Short proof:
Independent Components Mixture Models
0 = log 1 = log ≥
X x
⇒
KL(pkq) ≥ 0
X
! q(x)
x
p(x) log
q(x) p(x)
= log
X x
= −KL(pkq)
q(x) p(x) p(x)
!
ICA Application Probabilistic Data Mining
Data
Separation of source signals: Mixture
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
m2 m4
m1 m3
m3 m4
ICA Application Probabilistic Data Mining
Results
Results of separation: Source
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
m2 m4
Principal Components
m1 m3
m3 m4
Independent Components Mixture Models
.
FastICA package
Applications of ICA Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components
Applications: Coctail party problem; Separates noisy and multiple sources from multiple observations.
Fetus ECG; Separation of the ECG signal of a fetus from its mother’s ECG.
MEG recordings; Separation of MEG “sources”.
Financial data;
Mixture Models
Finding hidden factors in financial data.
Noise reduction; Noise reduction in natural images.
Interference removal; Interference removal from CDMA – Code-division multiple access – communication systems.
The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Introduction
The data structure is more complex. More than a single source for data.
x2 3 2 1 0
x1 1
2
3
4
5
6
Estimation Maximum Likelihood Maximum a-posteriori
The mixture model:
Bayesian Estimation
Unsupervised
Σ) = P(xx |Σ
General concepts Principal Components Independent Components
K X
µk , Σ k ) πk pk (xx |µ
k =1
Mixture Models
where: π1 , . . . , πK – mixing components. µk , Σ k ) – density of a component. pk (xx |µ The components are usually called clusters.
(1)
The mixture model Probabilistic Data Mining Lehel Csató
Data generation
The generation process reflects the assumptions about the model.
Modelling Data Motivation Machine Learning Latent variable models
Estimation Maximum Likelihood
The data generation: first we select from which component,
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts
then we sample from the component’s density function.
Principal Components Independent Components Mixture Models
When modelling data we do not know: Which point belongs to which cluster. What are the parameters for each density function.
The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Old Faithful geyser in the Yellowstone National park. Characterised by: intense eruptions;
Machine Learning Latent variable models
differing times between them.
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Rule: Duration is 1.5 to 5 minutes. The length of eruption helps determine the interval. If an eruption lasts less than 2 minutes the interval will be around 55 minutes. If the eruption last 4.5 minutes the interval may be around 88 minutes.
Example I
The mixture model Probabilistic Data Mining
Example II
100
Lehel Csató 90
Modelling Data Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised
Interval between durations
Motivation
80
70
60
General concepts Principal Components Independent Components Mixture Models
50
40 1.5
2
2.5
3
3.5 Duration
4
4.5
5
The longer the duration, the longer the ininterval. The linear relation I = θ0 + θ1 d is not the best. There are only a very few eruptions lasting ≈ 3 minutes.
5.5
The mixture model Probabilistic Data Mining
Assumptions:
Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
Estimation
We know the family of individual density functions: These density functions are parametrised with a few parameters.
Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
The densities are easily identifiable: If we knew which data belongs to which cluster, the density function is easily identifiable.
Gaussian densities are often used – fulfill both “conditions”.
I
The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning
II
The Gaussian mixture model: µ1 , Σ 1 ) + π2 N2 (xx |µ µ2 , Σ 2 ) p(xx ) = π1 N1 (xx |µ for known densities (centres and ellipses):
100
90
Latent variable models
µk , Σ k ) p(k ) Nk (xx n |µ p(xx n |k ) = P x µ` , Σ ` ) p(`) N (x n |µ ` `
80
i.e. we know the probability that data comes from cluster k (shades from red to green).
70
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
60
For D:
γn`
x p(xx |1) p(xx |2) x1 γ11 γ12 .. .. .. . . . xN γN1 γN2 – responsibility of x n in cluster `.
50
40 1
2
3
4
5
6
The mixture model Probabilistic Data Mining Lehel Csató
When γn` known, the parameters are computed using the data weighted by their responsibilities:
Modelling Data Motivation Machine Learning
µk , Σ k ) = argmax (µ Σ µ ,Σ
Latent variable models
N Y
µ, Σ ))γnk (Nk (xx n |µ
n=1
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components
for all k . This means: µk , Σ k ) ⇐ (µ
X
µ, Σ ) γnk log N(xx n |µ
n
Independent Components Mixture Models
When making inference Have to find the responsibility vector and the parameters of the mixture.
III
The mixture model Probabilistic Data Mining Lehel Csató
When γn` known, the parameters are computed using the data weighted by their responsibilities:
Modelling Data Motivation Machine Learning
µk , Σ k ) = argmax (µ Σ µ ,Σ
Latent variable models
N Y
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components
for all k . This means: µk , Σ k ) ⇐ (µ
X
Given data D: Initial guess: µ1 , Σ 1 ), . . . , (µ µK , Σ K )) ⇒ (µ
µ, Σ ))γnk (Nk (xx n |µ
n=1
Estimation Maximum Likelihood
III
µ, Σ ) γnk log N(xx n |µ
Re-estimate resps: x1 γ11 γ12 .. .. .. . . . xN γN1 γN2
n
Independent Components Mixture Models
When making inference Have to find the responsibility vector and the parameters of the mixture.
Re-estimate parameters: µ1 , Σ 1 ), . . . , (µ µK , Σ K ) ⇒ (µ (π1 , . . . , πK )
The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation
Summary
Responsibilities γ The additional latent variables needed to help computation.
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
In the mixture model: goal is to fit model to data; which submodel gets a particular data;
Achieved by the maximisation of the log-likelihood function.
The EM algorithm Probabilistic Data Mining Lehel Csató
π, Θ ) = argmax (π
Modelling Data
Latent variable models
Estimation Maximum Likelihood
X n
Motivation Machine Learning
I
" log
X
# µ` , Σ ` ) π` N` (xx n |µ
`
µ1 , Σ 1 , . . . , µ K , Σ K ] is the vector of parameters; Θ = [µ π = [π1 , . . . , πK ] the shares of the factors;
Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Problem with optimisation: The parameters are not separable due to the sum within the logarithm.
Solution: Use an approximation.
The EM algorithm
II
Probabilistic Data Mining Lehel Csató
π, Θ ) = log P(D|π
Modelling Data
X
log
" X
Motivation
n
Machine Learning Latent variable models
Estimation
=
Maximum Likelihood
Bayesian Estimation
µ` , Σ ` ) π` N` (xx n |µ
`
X
Maximum a-posteriori
#
log
" X
n
`
!
X
# p` (xx n , `)
Unsupervised General concepts Principal Components
Use Jensen’s inequality:
Independent Components Mixture Models
log
X
p` (xx n , `|θ` )
`
p` (xx n , `|θ` ) = log qn (`) qn (`) ` X p` (xx n , `) ≥ qn (`) log qn (`) `
!
for any [qn (1), . . . , qn (`)]. .
skip Jensen
Jensen Inequality
Detour
Probabilistic Data Mining
f (γ1z 1 + γ2z 2 )
Lehel Csató
γ1 f (zz 1 ) + γ2 f (zz 2 ) Modelling Data Motivation Machine Learning Latent variable models
concave f (zz )
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
z1
γ2 = 0.75
z2
Unsupervised General concepts Principal Components
Jensen’s Inequality
Independent Components Mixture Models
For any concave f (z), any z1 and z2 , and any γ1 , γ2 > 0 such that γ1 + γ2 = 1: f (γ1 z1 + γ2 z2 ) ≥ γ1 f (z1 ) + γ2 f (z2 )
The EM algorithm Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models
log
X
!
Maximum a-posteriori
≥
p` (xx n , `|θ` )
`
X
qn (`) log
`
p` (xx n , `) qn (`)
for any distribution qn (·). Replacing with the right-hand side, we have:
Estimation Maximum Likelihood
III
π, Θ ) ≥ log P(D|π
XX n
Bayesian Estimation
Unsupervised
≥
General concepts Principal Components
`
" X X n
`
Independent Components
qn (`) log
θ` ) p` (xx n |θ qn (`)
θ` ) p` (xx n |θ qn (`) log qn (`)
# =L
Mixture Models
and therefore the optimisation w.r.to cluster parameters separate.
∂`
⇒
0=
X n
qn (`)
θ` ) ∂ log p` (xx n |θ ∂ θ`
For distributions from exponential family optimisation is easy.
The EM algorithm Probabilistic Data Mining Lehel Csató Modelling Data Motivation
IV
any set of distributions q1 (`), . . . , qN (`) provides a lower bound to the log-likelihood. We should choose the distributions so that they are the closest to the current parameter set.
Machine Learning Latent variable models
Estimation Maximum Likelihood
We assume the parameters have the value θ 0 . Want to minimise the difference:
Maximum a-posteriori Bayesian Estimation
Unsupervised
θ0` ) − L = log P(xx n , `|θ
General concepts
X
θ0` ) − qn (`) log P(xx n , `|θ
`
X
Principal Components Independent Components Mixture Models
`
`
θ0` )qn (`) P(xx n , `|θ qn (`) log θ0` ) p` (xx n |θ
and observe that by setting qn (`) = we have
P `
qn (`) 0 = 0.
X
θ0` ) p` (xx n |θ θ0` ) P(xx n , `|θ
qn (`) log
θ0` ) p` (xx n , `|θ qn (`)
The EM algorithm Probabilistic Data Mining
V
The EM algorithm:
Lehel Csató Modelling Data Motivation
Init – initialise model parameters;
Machine Learning Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
E step – compute the responsibilities γn` = qn (`);
Bayesian Estimation
Unsupervised General concepts Principal Components
M step – for each k optimize
Independent Components Mixture Models
0=
X
qn (`)
n
repeat – goto the E step.
θ` ) ∂ log p` (xx n |θ ∂ θ`
EM application Probabilistic Data Mining
I
Old faithful: 100
Lehel Csató Modelling Data
90
Motivation Machine Learning
Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation
Unsupervised General concepts Principal Components Independent Components Mixture Models
Interval between durations
Latent variable models
80
70
60
50
40 1.5
2
2.5
3
3.5 Duration
4
4.5
5
5.5
EM application Probabilistic Data Mining
II
Old faithful:
Lehel Csató
100 Modelling Data Motivation Machine Learning
90
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
80
Bayesian Estimation
Unsupervised General concepts
70
Principal Components Independent Components Mixture Models
60
50
40 1
2
3
4
5
6
EM application Probabilistic Data Mining
III
Old faithful:
Lehel Csató
100 Modelling Data Motivation Machine Learning
90
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
80
Bayesian Estimation
Unsupervised General concepts
70
Principal Components Independent Components Mixture Models
60
50
40 1
2
3
4
5
6
EM application Probabilistic Data Mining
IV
Old faithful:
Lehel Csató
100 Modelling Data Motivation Machine Learning
90
Latent variable models
Estimation Maximum Likelihood Maximum a-posteriori
80
Bayesian Estimation
Unsupervised General concepts
70
Principal Components Independent Components Mixture Models
60
50
40 1
2
3
4
5
6
References Probabilistic Data Mining Lehel Csató References
J. M. Bernardo and A. F. Smith. Bayesian Theory. John Wiley & Sons, 1994. C. M. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, New York, N.Y., 2006. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39:1–38, 1977. T. Hastie, R. Tibshirani, és J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, 2001.