Probabilistic Data Mining

Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Probabilistic Data Mining

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised

Lehel Csató

General concepts Principal Components Independent Components Mixture Models

Faculty of Mathematics and Informatics Babe¸s–Bolyai University, Cluj-Napoca,

November 2010

Outline Probabilistic Data Mining Lehel Csató Modelling Data Motivation

1

Modelling Data Motivation Machine Learning Latent variable models

2

Estimation methods

3

Unsupervised Methods

Machine Learning Latent variable models


Unsupervised General concepts Principal Components Independent Components Mixture Models

Motivation for Data Mining Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Data Mining is not: SQL and relational data-base application; Storage technologies;


Cloud Computing;


Data mining:


The extraction of knowledge or information from an ever-growing collection of data. “Advanced” search capability that enables one to extract patterns useful in providing models for: 1 2 3

characterising; prediction, and exploiting the data.

Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.


Unsupervised General concepts

(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.

Principal Components Independent Components Mixture Models

Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense





























The need for data mining Probabilistic Data Mining Lehel Csató Modelling Data Motivation

“Computers have promised us a fountain of wisdom but delivered a flood of data.” “The amount of information in the world doubles every 20 months.” (Frawley, Piatetsky-Shapiro, Matheus, 1991)



A competitive market environment requires sophisticated – and useful – algorithms.


Data aquisition and storage is ubiquotuous. Algorithms are required to exploit them. The algorithms that exploit the data-rich environment are coming usually from the machine learning domain.

Machine learning Probabilistic Data Mining

Historical background / Motivation:

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation

Huge amount of data, that should automatically be processed,

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components

Mathematics provides general solutions, solutions are i.e. not for a given problem,

Mixture Models

Need for “science”, that uses mathematics machinery for solving practical problems.

Definitions for Machine Learning Probabilistic Data Mining

Machine learning

Lehel Csató Modelling Data Motivation

Collection of methods (from statistics, probability theory) to solve problems met in practice.




noise filtering for non-linear regression and/or non-Gaussian noise

Classification: binary, multiclass, partially labelled

Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,

Modelling Data Probabilistic Data Mining

(x1,y1 )

Lehel Csató

Observation Modelling Data Motivation

f(x)

Machine Learning

(x2,y2 )

(x,y)

Latent variable models

Estimation

(xN,yN)



Real world: there “is” a function y = f (x) Observation process: a corrupted datum is collected for a sample xn :

Mixture Models

tn = yn + tn = h(yn , )

additive noise h distortion function

Problem: find function y = f (x)

Latent variable models Probabilistic Data Mining

(x1,y1 )


f*(x)

Inference

(x2,y2 )


Estimation Maximum Likelihood

F − function class Observ. process

(xN,yN)

Maximum a-posteriori Bayesian Estimation

Unsupervised

Data set – collected.


Assume a function class. polynomial, Fourier expansion, Wavelet;

Observation process – encodes the noise; Find the optimal function from the class.

Latent variable models II Probabilistic Data Mining

We have the data set D = {(xx 1 , y1 ), . . . , (xx N , yN )}.



Unsupervised General concepts Principal Components

Consider a function class: w ∈ Rd , b ∈ R (1) F = w T x + b|w

(2) F = a0 +

K X

ak sin(2πkx) +

k =1

a, b ∈ RK , a0 ∈ R |a

K X

bk cos(2πkx)

k =1

Independent Components Mixture Models

Assume an observation process: yn = f (xx n ) + with ∼ N(0, σ2 ).

Latent variable models III Probabilistic Data Mining

1

The data set: D = {(xx 1 , y1 ), . . . , (xx N , yN )}.


2


Estimation

Assume a function class: θ ∈ Rp F = f (xx , θ )|θ

Maximum Likelihood Maximum a-posteriori

F – polynomial, etc.

Bayesian Estimation


3

Assume an observation process. Define a loss function: L (yn , f (xx n , θ )) For the Gaussian noise: L(yn , f (xx n , θ )) = (yn − f (xx n , θ ))2 .

Outline Probabilistic Data Mining Lehel Csató Modelling Data Motivation

1

Modelling Data

2

Estimation methods Maximum Likelihood Maximum a-posteriori Bayesian Estimation

3

Unsupervised Methods




Parameter estimation Probabilistic Data Mining

Estimating parameters:

Lehel Csató Modelling Data

Finding the optimal value to θ :

Motivation Machine Learning

θ ∗ = arg min L(D, θ )


θ ∈Ω



where Ω is the domain of the parameters. L(D, θ ) is a “loss function” for the data set. Example: L(D, θ) =

N X n=1

L(yn , f (xx n , θ))

Maximum Likelihood Estimation Probabilistic Data Mining

L(D, θ ) – (log)likelihood function.

Lehel Csató Modelling Data Motivation Machine Learning

Maximum likelihood estimation of the model:


θ ∗ = arg min L(D, θ )


θ ∈Ω



Example – quadratic regression:


L(D, θ ) =

N X

(yn − f (xx n , θ ))2

– factorisation

n=1

Drawback: can produce perfect fit to the data – over-fitting.

Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Y 190

Graphic

h(cm)

Example of an ML estimate

180


170

Estimation Maximum Likelihood Maximum a-posteriori

160

Bayesian Estimation

Unsupervised

150


140

50

60

70

80

90

100

We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .

w(kg) X 110


Y 190

Graphic

h(cm)


180


170


160

Bayesian Estimation

Unsupervised

150


140

50

60

70

80

90

100


w(kg) X 110


Y 190

Graphic

h(cm)


180


170


160

Bayesian Estimation

Unsupervised

150


140

50

60

70

80

90

100


w(kg) X 110


Y 190

Graphic

h(cm)


180


170


160

Bayesian Estimation

Unsupervised

150


140

50

60

70

80

90

100


w(kg) X 110

M.L. for linear models Probabilistic Data Mining Lehel Csató

I

Assume: linear model for the x → y relation


θ) = f (xx n |θ

Estimation

d X

θ` x`

`=1


with x = [1, x, x 2 , log(x), . . .]T


quadratic loss for D = {(xx 1 , y1 ), . . . , (xx N , hN )} E2 (D|f ) =

N X n=1

θ))2 (yn − f (xx n |θ

M.L. for linear models Probabilistic Data Mining Lehel Csató Modelling Data

Minimisation: N X

Motivation Machine Learning Latent variable models

θ))2 = (yy − X θ )T (yy − X θ ) (yn − f (xx n |θ

n=1

θT X T y + y T y = θ T X T X θ − 2θ



Solution:


X T X θ − 2X X Ty 0 = 2X −1 θ = X TX X Ty

where y = [y1 , . . . , yN ]T and X = [xx 1 , . . . , x N ]T are the transformed data.

II

M.L. for linear models Probabilistic Data Mining Lehel Csató

Generalised linear models: Use a set of functions Φ = [φ1 (.), . . . , φM (.)].

Modelling Data Motivation Machine Learning

Project the inputs into the space spanned by Im(Φ).




Have a parameter vector of length M: θ = [θ1 , . . . , θM ]T .

P x ) | θm ∈ R . The model is m θm φm (x

Mixture Models

The optimal parameter vector is: −1 ΦT y θ∗ = Φ T Φ

III

Maximum Likelihood Probabilistic Data Mining

Summary

There are many candidate model families:


the degree of polynomials specifies a model family;


the rank of a Fourier expansion;


the mixture of {log, sin, cos, . . .} also a family;

Bayesian Estimation


Selecting the “best family” is a difficult modelling problem. In maximum likelihood there is no controll on how good a family is when processing a given data-set. Smaller number of parameters than

√ #data.

Maximum a–posteriori Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models



Generalised linear model powerful – it can be extremely complex; With no complexity control, overfitting problem.

Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.

I

Maximum a–posteriori Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models


Generalised linear model powerful – it can be extremely complex; With no complexity control, overfitting problem.

Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.



Goal: Prior knowledge specification using probabilities; Using probability theory for consistent estimation; Encode the observation noise in the model;

I

Maximum a–posteriori Probabilistic Data Mining

Data/noise

Probabilistic data description:


How likely is that θ generated the data:



y = f (xx ) y = f (xx ) +

⇔ ⇔

y − f (xx ) ∼ δ0 y − f (xx ) ∼ N


Gaussian noise: y − f (xx ) ∼ N(0, σ2 ) 1 (y − f (xx ))2 P(y |f (xx )) = √ exp − 2σ2 2πσ

Maximum a–posteriori Probabilistic Data Mining

William of Lehel Csató

Prior

Ockham (1285–1349) principle

Entities should not be multiplied beyond necessity. Modelling Data Motivation Machine Learning Latent variable models


Also known as (wiki...): .

“Principle of simplicity” – KISS,

“When you hear hoofbeats, think horses, not zebras”.


Unsupervised

Simple models

≈

L2 norm

⇐


small number of parameters. L0 norm

Probabilistic representation: "

θk2 kθ θ) ∝ exp − 22 p0 (θ 2σ0

#

M.A.P. Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning

Inference

M.A.P. – probabilities assigned to D – via the log-likelihood function: P(yn |xx n , θ , F ) ∝ exp [−L(yn , f (xx n , θ ))]



θ – prior probabilities:

Bayesian Estimation


"

θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0

#

A–posteriori probability: θ|D, F ) = p(θ

θ)p0 (θ θ) P(D|θ F p(D|F )

F ) – probability of the data for a given family. p(D|F


Inference





Bayesian Estimation


"

θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0

#





Inference





Bayesian Estimation


"

θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0

#




M.A.P. Probabilistic Data Mining

Inference II

M.A.P. estimation – finds θ with largest probability:

Lehel Csató

θ|D, F ) θ ∗MAP = arg max p(θ

Modelling Data

θ ∈Ω




Example: with L(yn , f (xx n , θ )) and Gaussian prior:


θ ∗MAP = argmax K − θ∈Ω

σ20 = ∞ .

=⇒

θ k2 kθ 1X L(yn , f (xx n , θ )) − 2 n 2σ20

maximum likelihood. after a change of sign and max → min

M.A.P.

Example I


Poly 6

Lehel Csató

12

Modelling Data

10

−3

N. Dev =10 True function Training data −2 N. Dev = 10


8


6


Unsupervised

4

General concepts Principal Components Independent Components

2

Mixture Models

0 −2 −4 −6 −10

−5

0

5

10

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150


140

50

60

70

80

90

Aim: test different levels of flexibility. ⇒ Prior width:

100

w(kg) X 110

p = 10

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160


150


140

50

60

70

80

90

Aim: test different levels of flexibility. ⇒ Prior width: σ20 = 106

100

w(kg) X 110

p = 10

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160


150


140

50

60

70

80

90

Aim: test different levels of flexibility. ⇒ Prior width: σ20 = 106 σ20 = 105

100

w(kg) X 110

p = 10

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160


150


140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160


150


140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160


150


140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160


150


140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102 σ20 = 101

M.A.P.

Linear models I


Y 190

h(cm)



180


170

Bayesian Estimation

Unsupervised

160


150


140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102 σ20 = 101 σ20 = 100

M.A.P.

Linear models II

Probabilistic Data Mining Lehel Csató

θ ∗MAP = argmax K −

Modelling Data

θ ∈Ω

Motivation

θ k2 kθ 1X E2 (yn , f (xx n , θ )) − 2 n 2σ20


Transform into vector notation:

Estimation

θ ∗MAP = argmax K −



θ ∈Ω

1 θT θ (yy − X θ )T (yy − X θ ) − 2 2σ20

solve for θ by differentiation:

Independent Components

X T (yy − X θ ) −

Mixture Models

1 I dθ = 0 σ20 θ∗MAP

.

=

1 X X + 2 Id σ0 T

!−1 XTy

again M.L. for σ20 = ∞

M.A.P. Probabilistic Data Mining

Summary

Mximum a–posteriori models:


Allow for the inclusion of prior knowledge;



May protect against overfitting;


Can measure the fitness of the family to the data; . Procedure called M.L. type II.

M.A.P. application Probabilistic Data Mining Lehel Csató

M.L. II

Idea: instead of computing the most probable value of θ , we can measure the fit of the model F to the data D. F) P(D|F

Modelling Data

X

=

p(D, θ ` | F )

θ ` ∈Ω

Motivation

X

Machine Learning

=


Estimation

θ` | F ) p(D | θ ` , F )p0 (θ

θ ` ∈Ω


Gaussian noise ase and polynomial of order K : Z


F )) log(P(D|F

=

log


dθ Ωθ

Mixture Models

=

−

θ |F) p(D | θ , F )p0 (θ p(D | F )

! = log (N(yy |0, ΣX ))

1 ΣX | + y T ΣX−1y N log(2π) + log |Σ 2

where ΣX =

X Σ 0X T I N σ2n +X

h i X = x 0, x 1, . . . , x K with

Σ 0 = diag(σ20 , σ21 , . . . , σ2K ) = σ2pI K +1

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 10 k = 9 k = 8 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató


log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0


2

4

M.A.P. ⇒ M.L.II

Lehel Csató


log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0


2

4

M.A.P. ⇒ M.L.II

Lehel Csató


log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0


2

4

M.A.P. ⇒ M.L.II

Lehel Csató


log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0


2

4

M.A.P. ⇒ M.L.II

Lehel Csató


log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0


2

4

M.A.P. ⇒ M.L.II

Lehel Csató


log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0


2

4

M.A.P. ⇒ M.L.II

Lehel Csató


log(P(D|k ))


poly


−100


−120


log(σ2p )


−140 −6

−4

−2

0


2

4

Bayesian estimation Probabilistic Data Mining

Intro

M.L. and M.A.P. estimates provide single solutions.


Point estimates lack the assessment of un/certainty.



Better solution: for a query x ∗ , the system output is probabilistic:

Bayesian Estimation


x∗

⇒

p(y∗ |xx ∗ , F )


Tool: go beyond the M.A.P. solution and use the a–posteriori distribution of the parameters.


II

We again use Bayes’ rule:


θ|D, F ) = p(θ

θ)p0 (θ θ) P(D|θ F) p(D|F

Z F) = with p(D|F

θ P(D|θ θ)p0 (θ θ). dθ Ω




and exploit the whole posterior distribution of the parameters. A-posteriori parameter estimates


θ) def θ|D, F ) and use the total We operate with ppost (θ = p(θ probability rule: X θ` , F ) ppost (θ θ` ) p(y∗ |θ p(y∗ |D, F ) = θ ` ∈Ωθ

in assessing system output.

Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models


Example I

Given the data D = {(xx 1 , y1 ), . . . , (xx N , yN )} estimate the linear fit:  T   1 θ0 d  θ1   x1  X     T θi xi =  .   .  def y = θ0 + =θ x  ..   ..  i=1

θd

xd


Gaussian distributions noise and prior:


= yn − θ T x n ∼ N(0, σ2n ) w ∼ N(0, Σ 0 )


Example II

θ). Goal: compute the posterior distribution ppost (θ


θ) ppost (θ

∝

θ) p(D|θ θ, F ) = p0 (θ θ |Σ Σ0 ) p0 (θ

Machine Learning

Maximum Likelihood

θ)) −2 log (ppost (θ

=


Unsupervised

=


θT x n ) P(yn |θ

n=1


Estimation

N Y

=

1 Kpost + 2 (yy − X θ )T (yy − X θ ) + θ T Σ −1 0 θ σn 1 T 2 0 θT X X + Σ −1 θ − 2 θ T X T y + Kpost 0 σ2n σn T 00 θ − µpost Σ−1 post θ − µ post + Kpost

Mixture Models

and by identification −1 1 T −1 Σ post = X X + Σ 0 σ2n

and

µ post = Σ post

X Ty σ2n

Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Example III

Bayesian linear model The posterior distribution for the parameters is a Gaussian with parameters



Σ post =

1 T X X + Σ −1 0 σ2n

−1 and

µ post = Σ post

X Ty σ2n


Point estimates from keeping : M.L. if we take Σ 0 → ∞ and considering only µ post . M.A.P if we approximate the distribution with a single value at the maximum, i.e. µ post .

Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation

Prediction for new values x ∗ : use the likelihood P(y∗ |xx ∗ , θ , F ), and the posterior for θ and Bayes’ rule. The steps: Z


p(y∗ |xx ∗ , D, F ) =


Example IV

θ | D, F ) dθ p(y∗ |xx ∗ , θ , F )ppost (θ Ωθ

Z

(y∗ − θ T x ∗ )2 1 T −1 θ θ dθ exp − K∗ + + (θ − µ ) Σ (θ − µ ) post post post 2 σ2n Ωθ Z 2 1 y θ) = dθ exp − K∗ + ∗2 − a T C −1a + Q(θ 2 σn Ωθ =

where a=

x ∗ y∗ + Σ −1 postµ post σ2n

C=

x ∗x T∗ + Σ post σ2n

Bayesian estimation Probabilistic Data Mining Lehel Csató

Example V

Integrating out the quadratic in θ : Predictive distribution at x ∗



"

1 p(y∗ |xx ∗ , D, F ) = exp − 2

K∗ +

(y∗ − x ∗µpost )2

!#

σ2n + x T∗ Σ −1 postx ∗



= N y∗ x T∗ µ post , σ2n + x T∗ Σ postx ∗


With the predictive distribution we: measure the variance of the prediction for each point: σ2∗ = σ2n + x T∗ Σ postx ∗ ; sample from the parameters and plot the candidate predictors.

Bayesian example Probabilistic Data Mining

Error bars

Pol. 6 − N.var σ2 = 1


10


Unsupervised

5


0

−5 −10

−5

0

The errors are the symmetric thin lines.

5

10

Bayesian example

Predictive samples




Third order polynomials are used to approximate the data.


Problems

θ|D, F ) we assumed that the When computing ppost (θ posterior can be represented analytically.


This is not the case.



Approximations are needed for the posterior distribution predictive distribution

In Bayesian modelling an important issue is how we approximate the posterior distribution.


Summary

Complete specification of the model Can include prior beliefs about the model.



Accurate predictions Can compute the posterior probabilities for each test location.


Computational cost


Using models for prediction can be difficult and expensive in time and memory.


Summary

Complete specification of the model Can include prior beliefs about the model.



Accurate predictions Can compute the posterior probabilities for each test location.


Computational cost


Using models for prediction can be difficult and expensive in time and memory. Bayesian models Flexible and accurate – if priors about the model are used.

Outline Probabilistic Data Mining Lehel Csató Modelling Data

1

Modelling Data

2

Estimation methods

3

Unsupervised Methods General concepts Principal Components Independent Components Mixture Models




Unsupervised setting Probabilistic Data Mining Lehel Csató

Data can be unlabeled, i.e. no values y are associated to an input x .



We want to “extract” information from D = {xx 1 , . . . , x N }.



We assume that the data – although high-dimensional – span a much smaller dimensional manifold. Task is to find the subspace corresponding to the data span.

Models in unsupervised learning Probabilistic Data Mining

It is again important the model of the data:

Lehel Csató

Principal Components;

x2

Modelling Data

3 2 1



.

x1

0

1

2

3

0

1

2

3

3

4

5

6

Maximum a-posteriori

Independent Components;

Bayesian Estimation

x2

Unsupervised

3 2 1


. Mixture models;

x1

x2 3 2 1

.

0

x1 1

2

The PCA model

I



x2

Simple data structure. Spherical cluster that is: translated; scaled; rotated.

3 2 1

.

0

x1 1

2

3



We aim to find the principal directions of the data spread.


Principal direction: the direction u along which the data preserves most of its variance.

The PCA model Probabilistic Data Mining

II

Principal direction:

Lehel Csató Modelling Data Motivation Machine Learning

u = argmax u k=1 ku

N 1 X T u x n − u x )2 (u 2N n=1



we pre-process: x = 0 . Replacing the empirical covariance with Σx :

Bayesian Estimation


u = argmax u k=1 ku

N 1 X T u x n − u x )2 (u 2N n=1

1 u k2 − 1) = argmax u T Σx u − λ(ku 2 u ,λ with λ the Lagrange multiplier. Differentiating w.r.t u : u =0 Σx u − λu


III

The optimum solution must obey:


u Σx u = λu


The eigendecomposition of the covariance matrix.


(λ∗ , u ∗ ) is an eigenvalue, eigenvector of the system.

Bayesian Estimation


If we replace back, the value of the expression is λ∗ . ⇒ Optimal solution when λ∗ = λmax . Principal direction: The eigenvector u max corresponding to the largest eigenvalue of the system.

The PCA model Probabilistic Data Mining Lehel Csató Modelling Data

Data mining I

How is this used in data mining? Assume that data is: jointly Gaussian:

Motivation

mx , Σx ), x = N(m



Unsupervised

high-dimensional; only few (2) directions are relevant.


20

2 −20 −40

0

−20 0 20

−20


Data mining II

How is this used in data mining?


x2

Subtracting mean.

3 2 1



Eigendecomposition.

Selecting the K eigenvectors corresponding 0 to the K largest values.

x1 1

2

3

Computing the K projections: zn` = x Tn u ` .


u 1 , . . . , u K ]T : The projection using matrix P def = [u Z = XP and z n can is used as a compact representation of x n .


Reconstruction:


Data mining III

x n0 =

Motivation

K X

zn`u `

or, with matrix notation: X 0 = Z P T

`=1


Estimation

PCA projection analysis:



N i 1 X 1 h 0 T 0 2 0 tr X − X X − X x − x = n n N2 N2 n=1 i h = tr Σx − P T Σz P i h = tr U (diag(λ1 , . . . , λd ) − diag(λ1 , . . . , λK , 0, . . .)) U T h i = tr U T U diag(0, . . . , 0, λK +1 , . . . , λd )

EPCA =

=

d−K X `=1

λK +`

The PCA model Probabilistic Data Mining Lehel Csató

Data mining IV

PCA reconstruction error: The error made using the PCA directions:


EPCA =

Estimation

d−K X

λK +`

`=1



PCA properties:


PCA system orthonormal: u T` u r = δ`−r Reconstruction fast. Spherical assumption critical.

PCA application Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models



USPS digits – testbed for several models.

USPS I

PCA application Probabilistic Data Mining Lehel Csató Modelling Data Motivation

USPS II

USPS characteristics: handwritten data centered and scaled; ≈ 10.000 items of 16 × 16 grayscale images;

Machine Learning

λ(%)




We plot P kr = r`=1 λ`

100 90

i

80 20

40

60

80


Conclusion for the USPS set: The normalised λ1 = 0.24 24% of the data.

⇒

u 1 accounts for

at ≈ 10 more than 70% of variance is explained. at ≈ 50 more than 98% ⇒

50 numbers instead of 256.

PCA application Probabilistic Data Mining

USPS III

Visualisation application:




Visualisation along the first two eigendirections.

PCA application Probabilistic Data Mining

Visualisation application:




Detail.

USPS IV

The ICA model Probabilistic Data Mining

Motivation

x = Pz is a generative model for the data.



Unsupervised

x2

Start from the PCA:


I

3 2 1 0

x1 1

2

3

We assumed that 0, diag(λ` )); z i.i.d. Gaussian random variables z ∼ N(0 ⇒ x are not independent; ⇒


In most of real data: Sources are not Gaussian. But sources are independent. We exploit that!.

z are Gaussian sources;


II

The following model assumption:

Lehel Csató

x = As



where z independent sources; A linear mixing matrix;

Bayesian Estimation

Unsupervised

Looking for matrix B that recovers the sources:


As ) = (B BA ) s s 0 def = B x = B (A

BA ) is unity up to a permutation and scaling i.e. (B . but retains independence.


In practice: s 0 def = Bx


Estimation

III

with s = [s1 , . . . , sK ] all independent sources. Independence test: the KL-divergence between the joint distribution and the marginals


B = argmin KL (p(s1 , s2 )kp(s1 )p(s2 ))

Bayesian Estimation

B ∈SOd


B | = 1. where SOd is the group of matrices with |B In ICA we are looking for matrix Z B that minimises: XZ dp(s` ) log p(s` ) − dp(ss ) log(p(s1 , . . . , sd ) `

.

Ω`

Ω`

skip KL

The KL-divergence Probabilistic Data Mining

Detour

Kullback-Leibler divergence

Lehel Csató

KL(pkq) = Modelling Data

X

p(x) log

x

p(x) q(x)


is zero only and only if p = q,

Estimation

is not a measure of distance (but cloooose to it!),


Efficient when exponential families are used.

Bayesian Estimation


Short proof:


0 = log 1 = log ≥

X x

⇒

KL(pkq) ≥ 0

X

! q(x)

x

p(x) log

q(x) p(x)

= log

X x

= −KL(pkq)

q(x) p(x) p(x)

!

ICA Application Probabilistic Data Mining

Data

Separation of source signals: Mixture




m2 m4

m1 m3

m3 m4

ICA Application Probabilistic Data Mining

Results

Results of separation: Source




m2 m4

Principal Components

m1 m3

m3 m4


.

FastICA package

Applications of ICA Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models



Applications: Coctail party problem; Separates noisy and multiple sources from multiple observations.

Fetus ECG; Separation of the ECG signal of a fetus from its mother’s ECG.

MEG recordings; Separation of MEG “sources”.

Financial data;

Mixture Models

Finding hidden factors in financial data.

Noise reduction; Noise reduction in natural images.

Interference removal; Interference removal from CDMA – Code-division multiple access – communication systems.

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Introduction

The data structure is more complex. More than a single source for data.

x2 3 2 1 0

x1 1

2

3

4

5

6


The mixture model:

Bayesian Estimation

Unsupervised

Σ) = P(xx |Σ


K X

µk , Σ k ) πk pk (xx |µ

k =1

Mixture Models

where: π1 , . . . , πK – mixing components. µk , Σ k ) – density of a component. pk (xx |µ The components are usually called clusters.

(1)

The mixture model Probabilistic Data Mining Lehel Csató

Data generation

The generation process reflects the assumptions about the model.



The data generation: first we select from which component,



then we sample from the component’s density function.


When modelling data we do not know: Which point belongs to which cluster. What are the parameters for each density function.

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Old Faithful geyser in the Yellowstone National park. Characterised by: intense eruptions;


differing times between them.



Rule: Duration is 1.5 to 5 minutes. The length of eruption helps determine the interval. If an eruption lasts less than 2 minutes the interval will be around 55 minutes. If the eruption last 4.5 minutes the interval may be around 88 minutes.

Example I

The mixture model Probabilistic Data Mining

Example II

100

Lehel Csató 90

Modelling Data Machine Learning Latent variable models


Unsupervised

Interval between durations

Motivation

80

70

60


50

40 1.5

2

2.5

3

3.5 Duration

4

4.5

5

The longer the duration, the longer the ininterval. The linear relation I = θ0 + θ1 d is not the best. There are only a very few eruptions lasting ≈ 3 minutes.

5.5

The mixture model Probabilistic Data Mining

Assumptions:


Estimation

We know the family of individual density functions: These density functions are parametrised with a few parameters.



The densities are easily identifiable: If we knew which data belongs to which cluster, the density function is easily identifiable.

Gaussian densities are often used – fulfill both “conditions”.

I

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning

II

The Gaussian mixture model: µ1 , Σ 1 ) + π2 N2 (xx |µ µ2 , Σ 2 ) p(xx ) = π1 N1 (xx |µ for known densities (centres and ellipses):

100

90


µk , Σ k ) p(k ) Nk (xx n |µ p(xx n |k ) = P x µ` , Σ ` ) p(`) N (x n |µ ` `

80

i.e. we know the probability that data comes from cluster k (shades from red to green).

70



60

For D:

γn`

x p(xx |1) p(xx |2) x1 γ11 γ12 .. .. .. . . . xN γN1 γN2 – responsibility of x n in cluster `.

50

40 1

2

3

4

5

6


When γn` known, the parameters are computed using the data weighted by their responsibilities:


µk , Σ k ) = argmax (µ Σ µ ,Σ


N Y

µ, Σ ))γnk (Nk (xx n |µ

n=1



for all k . This means: µk , Σ k ) ⇐ (µ

X

µ, Σ ) γnk log N(xx n |µ

n


When making inference Have to find the responsibility vector and the parameters of the mixture.

III


When γn` known, the parameters are computed using the data weighted by their responsibilities:


µk , Σ k ) = argmax (µ Σ µ ,Σ


N Y



for all k . This means: µk , Σ k ) ⇐ (µ

X

Given data D: Initial guess: µ1 , Σ 1 ), . . . , (µ µK , Σ K )) ⇒ (µ

µ, Σ ))γnk (Nk (xx n |µ

n=1


III

µ, Σ ) γnk log N(xx n |µ

Re-estimate resps: x1 γ11 γ12 .. .. .. . . . xN γN1 γN2

n


When making inference Have to find the responsibility vector and the parameters of the mixture.

Re-estimate parameters: µ1 , Σ 1 ), . . . , (µ µK , Σ K ) ⇒ (µ (π1 , . . . , πK )

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Summary

Responsibilities γ The additional latent variables needed to help computation.




In the mixture model: goal is to fit model to data; which submodel gets a particular data;

Achieved by the maximisation of the log-likelihood function.

The EM algorithm Probabilistic Data Mining Lehel Csató

π, Θ ) = argmax (π

Modelling Data



X n


I

" log

X

# µ` , Σ ` ) π` N` (xx n |µ

`

µ1 , Σ 1 , . . . , µ K , Σ K ] is the vector of parameters; Θ = [µ π = [π1 , . . . , πK ] the shares of the factors;



Problem with optimisation: The parameters are not separable due to the sum within the logarithm.

Solution: Use an approximation.

The EM algorithm

II

Probabilistic Data Mining Lehel Csató

π, Θ ) = log P(D|π

Modelling Data

X

log

" X

Motivation

n


Estimation

=

Maximum Likelihood

Bayesian Estimation

µ` , Σ ` ) π` N` (xx n |µ

`

X


#

log

" X

n

`

!

X

# p` (xx n , `)


Use Jensen’s inequality:


log

X

p` (xx n , `|θ` )

`

p` (xx n , `|θ` ) = log qn (`) qn (`) ` X p` (xx n , `) ≥ qn (`) log qn (`) `

!

for any [qn (1), . . . , qn (`)]. .

skip Jensen

Jensen Inequality

Detour


f (γ1z 1 + γ2z 2 )

Lehel Csató

γ1 f (zz 1 ) + γ2 f (zz 2 ) Modelling Data Motivation Machine Learning Latent variable models

concave f (zz )


z1

γ2 = 0.75

z2


Jensen’s Inequality


For any concave f (z), any z1 and z2 , and any γ1 , γ2 > 0 such that γ1 + γ2 = 1: f (γ1 z1 + γ2 z2 ) ≥ γ1 f (z1 ) + γ2 f (z2 )

The EM algorithm Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

log

X

!


≥

p` (xx n , `|θ` )

`

X

qn (`) log

`

p` (xx n , `) qn (`)

for any distribution qn (·). Replacing with the right-hand side, we have:


III

π, Θ ) ≥ log P(D|π

XX n

Bayesian Estimation

Unsupervised

≥


`

" X X n

`


qn (`) log

θ` ) p` (xx n |θ qn (`)

θ` ) p` (xx n |θ qn (`) log qn (`)

# =L

Mixture Models

and therefore the optimisation w.r.to cluster parameters separate.

∂`

⇒

0=

X n

qn (`)

θ` ) ∂ log p` (xx n |θ ∂ θ`

For distributions from exponential family optimisation is easy.

The EM algorithm Probabilistic Data Mining Lehel Csató Modelling Data Motivation

IV

any set of distributions q1 (`), . . . , qN (`) provides a lower bound to the log-likelihood. We should choose the distributions so that they are the closest to the current parameter set.



We assume the parameters have the value θ 0 . Want to minimise the difference:


Unsupervised

θ0` ) − L = log P(xx n , `|θ

General concepts

X

θ0` ) − qn (`) log P(xx n , `|θ

`

X


`

`

θ0` )qn (`) P(xx n , `|θ qn (`) log θ0` ) p` (xx n |θ

and observe that by setting qn (`) = we have

P `

qn (`) 0 = 0.

X

θ0` ) p` (xx n |θ θ0` ) P(xx n , `|θ

qn (`) log

θ0` ) p` (xx n , `|θ qn (`)

The EM algorithm Probabilistic Data Mining

V

The EM algorithm:


Init – initialise model parameters;



E step – compute the responsibilities γn` = qn (`);

Bayesian Estimation


M step – for each k optimize


0=

X

qn (`)

n

repeat – goto the E step.

θ` ) ∂ log p` (xx n |θ ∂ θ`

EM application Probabilistic Data Mining

I

Old faithful: 100


90




Interval between durations


80

70

60

50

40 1.5

2

2.5

3

3.5 Duration

4

4.5

5

5.5


II

Old faithful:

Lehel Csató

100 Modelling Data Motivation Machine Learning

90



80

Bayesian Estimation


70


60

50

40 1

2

3

4

5

6


III

Old faithful:

Lehel Csató


90



80

Bayesian Estimation


70


60

50

40 1

2

3

4

5

6


IV

Old faithful:

Lehel Csató


90



80

Bayesian Estimation


70


60

50

40 1

2

3

4

5

6

References Probabilistic Data Mining Lehel Csató References

J. M. Bernardo and A. F. Smith. Bayesian Theory. John Wiley & Sons, 1994. C. M. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, New York, N.Y., 2006. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39:1–38, 1977. T. Hastie, R. Tibshirani, és J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, 2001.

Probabilistic Data Mining

Probabilistic Data Mining

Suggest Documents

Probabilistic Data Mining

Mining Uncertain Data with Probabilistic Guarantees - HKU

Probabilistic Workflow Mining

Probabilistic user behavior models - Data Mining, 2003 ... - C. Lee Giles

Implications of Probabilistic Data Modeling for Mining Association Rules

Probabilistic Principal Surfaces for Yeast Gene Microarray Data Mining

Probabilistic estimation-based data mining for discovering ... - CiteSeerX

Probabilistic Declarative Process Mining - Unife

data mining

Data Mining

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining

Data Mining

Probabilistic Data Exchange

Why Data Mining - Process mining

Privacy Preserving Data Mining Mining

Mining Sequential Patterns from Probabilistic Databases

Data Warehousing and Data Mining

Data Warehousing und Data Mining

Probabilistic Data Integration - Semantic Scholar

FREQUENCY DOMAIN PROBABILISTIC DATA ASSOCIATION