Representing and comparing probabilities: Part 2 - Gatsby ...

Representing and comparing probabilities: Part 2 Arthur Gretton Gatsby Computational Neuroscience Unit, University College London

UAI, 2017

1/52

Testing against a probabilistic model

2/52

Statistical model criticism

(

MMD P ; Q

) = kf k2 = supkf kF 1[EQ f

Ep f

]

0.4 0.3 0.2

p(x)

0.1 -4

q(x) 2

-2

4

-0.1

f *(x)

-0.2 -0.3

()

f x is the witness function Can we compute MMD with samples from Q and a model P ? Problem: usualy can’t compute Ep f in closed form. 3/52

Stein idea To get rid of Ep f in

[

sup Eq f

kf kF 1

Ep f

]

we define the Stein operator

[Tp f ] (x ) = p (1x ) dxd (f (x )p (x )) Then EP TP f

=0

subject to appropriate boundary conditions.

(Oates, Girolami, Chopin, 2016)

4/52

Stein idea: proof

[

Ep Tp f

]=

Z

( ( ) ( )) p (x )dx ( ) Z d ( f (x )p (x )) dx dx = [f (x )p (x )]11 =0 1 d f x p x p x dx

5/52

Stein idea: proof

[

Ep Tp f

]=

Z

( ( ) ( )) p (x)dx ( ) Z d ( f (x )p (x )) dx dx = [f (x )p (x )]11 =0 1 d f x p x p x dx

5/52

Stein idea: proof

[

Ep Tp f

]=

Z


5/52

Stein idea: proof

[

Ep Tp f

]=

Z


5/52

Stein idea: proof

[

Ep Tp f

]=

Z


5/52

Kernel Stein Discrepancy Stein operator Tp f

= @x f + f @x (log p )

Kernel Stein Discrepancy (KSD)

(

KSD p ; q ; F

)=

sup Eq Tp g

kg kF 1

Ep Tp g

6/52


= @x f + f @x (log p )


(

KSD p ; q ; F

)=

sup Eq Tp g

kg kF 1

Ep T pg

=

sup Eq Tp g

kg kF 1

6/52


= @x f + f @x (log p )


(

KSD p ; q ; F

)=

sup Eq Tp g

kg kF 1

Ep T pg

=

sup Eq Tp g

kg kF 1

0.4 0.2 -4

2

-2 -0.2 -0.4

4

p(x) q(x) g *(x)

-0.6 6/52


= @x f + f @x (log p )


(

KSD p ; q ; F

)=

sup Eq Tp g

kg kF 1

Ep T pg

=

sup Eq Tp g

kg kF 1

0.4 0.3

p(x)

0.2

q(x) g *(x)

0.1

-4

-2

2

4 6/52

Kernel stein discrepancy Closed-form expression for KSD: given Z ; Z 0

q, then

(Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016)

(

KSD p ; q ; F

) = Eq hp (Z ; Z 0)

where

(

hp x ; y

) := @x log p (x )@x log p (y )k (x ; y ) + @y log p (y )@x k (x ; y ) + @x log p (x )@y k (x ; y ) + @x @y k (x ; y )

and k is RKHS kernel for

F

Only depends on kernel and @x log p (x ). Do not need to normalize p, or sample from it. 7/52


Chicago crime data

8/52


Chicago crime data Model is Gaussian mixture with two components. 8/52


Chicago crime data Model is Gaussian mixture with two components Stein witness function 8/52


Chicago crime data Model is Gaussian mixture with ten components.

8/52


Chicago crime data Model is Gaussian mixture with ten components Stein witness function Code: https://github.com/karlnapf/kernel_goodness_of_fit

8/52

Kernel stein discrepancy Further applications: Evaluation of approximate MCMC methods. (Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017)

What kernel to use? The inverse multiquadric kernel,

for

2 ( 1; 0).

(

k x;y

) = c + kx

y k22

9/52

Testing statistical dependence

10/52

Dependence testing Given: Samples from a distribution PX Y Goal: Are X and Y independent?

X

Y A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose. Their noses guide them through life, and they're never happier than when following an interesting scent. A responsive, interactive pet, one that will blow in your ear and follow you everywhere.

Text from dogtime.com and petfinder.com

11/52

MMD as a dependence measure? Could we use MMD?

(| {z } |

MMD PXY ; PX PY ; H P

We don’t have samples from Q f xi ; yi gni=1 i:i:d: PXY

(

{z }

)

Q

:= PX PY , only pairs 6

Solution: simulate Q with pairs (xi ; yj ) for j = i .

What kernel

to use for the RKHS H ?

12/52


(| {z } |



(

{z }

)

Q



What kernel


12/52


(| {z } |



(

{z }

)

Q



What kernel


12/52

MMD as a dependence measure Kernel k on images with feature space

F,

Kernel l on captions with feature space G ,

13/52

MMD as a dependence measure Kernel k on images with feature space

F,

Kernel l on captions with feature space G ,

Kernel

on image-text pairs:

are images and captions similar?

13/52

MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent?

(

b XY ; P bX P b Y ; H MMD 2 P

) := n12 trace(K L)

( K, L column centered)

14/52

MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent?

(

b XY ; P bX P b Y ; H MMD 2 P

) := n12 trace(K L)

14/52

MMD as a dependence measure

Two questions: Why the product kernel? Many ways to combine kernels - why not eg a sum? Is there a more interpretable way of defining this dependence measure?

15/52

Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence.

Correlation: 0.00 1.5 1 0.5 0 -0.5 -1 -1.5 -2

-1

0

1

2

16/52

Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. 0.5

Correlation: 0.00

0

1.5 -0.5

1 0.5

-1 -2

0

2

0

2

0 0.5

-0.5 -1 -1.5 -2

0

-1

0

1

2 -0.5

-1 -2

16/52

Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. 0.5

Correlation: 0.00

Correlation: 0.90

0

0.5

1.5 -0.5

1

0

0.5

-1 -2

0

2

0 0.5

-0.5 -1 -1.5 -2

-0.5

0

-1

0

1

-1 -1

2 -0.5

-1 -2

0

-0.5

0

0.5

2

16/52

Define two spaces, one for each witness Function in

F

( )=

f x

1 X j =1

Function in

()

F on X : k (x ; x 0 ) = h'(x ); '(x 0 )iF

Kernel for RKHS

( )=

fj 'j x

Feature map

G

g y

1 X j =1

()

gj j y

Feature map

Kernel for RKHS

(

l x;x0

G on Y :

) = h(y ); (y 0)iG

17/52

The constrained covariance The constrained covariance is

(

COCO PXY

)=

[ ( ) ( )]

sup

cov f x g y

kf kF 1 kg kG 1

0.5

Correlation: 0.00

Correlation: 0.90

0

0.5

1.5 -0.5

1

0

0.5

-1 -2

0

2

0 0.5

-0.5 -1 -1.5 -2

-0.5

0

-1

0

1

-1 -1

2 -0.5

-1 -2

0

2

-0.5

0

0.5

18/52


(

COCO PXY

)=

sup

kf kF 1 kg kG 1

20 cov 4@

1 X j =1

10

()

fj 'j x A @

1 X j =1

()

13

g j j y A5

18/52


(

COCO PXY

)=

sup

kf kF 1 kg kG 1

Fine print: feature mappings

20 Exy 4@

1 X j =1

()

10

fj 'j x A @

1 X j =1

13

()

g j j y A5

'(x ) and (y ) assumed to have zero mean.

18/52


(

COCO PXY

)=

sup

kf kF 1 kg kG 1


20 Exy 4@

1 X j =1

()

10

fj 'j x A @

1 X

13

()

g j j y A5

j =1


Rewriting:

[ ( ) ( )]

Exy f x g y

2 3> 02 3 f1 '1 (x ) h = 664 f2 775 Exy BB@664 '2(x ) 775 1(y ) 2(y ) : : : . . ..

|

..

{z

C'(x )(y )

12 3 iC 6 g1 7 C 6 g2 7 A4 . 5 }

..

18/52


(

COCO PXY

)=

sup

kf kF 1 kg kG 1


20 Exy 4@

1 X j =1

()

10

fj 'j x A @

1 X

13

()

g j j y A5

j =1


Rewriting:

[ ( ) ( )]

Exy f x g y

3 02 2 3> '1 (x ) h f1 = 664 f2 775 Exy BB@664 '2(x ) 775 1(y ) 2(y ) : : : . . ..

|

..

{z

C'(x )(y )

12 3 iC 6 g1 7 C 6 g2 7 A4 . 5 }

..

COCO: max singular value of feature covariance C'(x )(y )

18/52

Computing COCO in practice i:i:d: \ Given sample f(xi ; yi )gni=1 PXY , what is empirical COCO

?

19/52


\

COCO is largest eigenvalue

Kij

"

0 1 n LK

1 n KL

max of

#"

0

#

=

"

K 0

0 L

#"

?

#

:

= k (xi ; xj ) and Lij = l (yi ; yj ).

Fine print: kernels are computed with empirically centered features Pn and (y ) n1 (yi ). i =1

'(x )

1 n

Pn i =1

'(xi )

AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05

19/52


\

max of # " 1 0 = n KL 1 0 n LK = k (xi ; xj ) and Lij = l (yi ; yj ).

COCO is largest eigenvalue

Kij

#"

"

K 0

0 L

( )/

n X

#"

?

#

:

Witness functions (singular vectors):

( )/

f x

m X i =1

i k (xi ; x )

g y

i =1

i l (yi ; y )

Fine print: kernels are computed with empirically centered features Pn and (y ) n1 (yi ). i =1

'(x )

1 n

Pn i =1

'(xi )

AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05

19/52

What is a large dependence with COCO? Rough density 3

2

2

1

1

Y

Y

Smooth density 3

0

−1

−2

−2

−3

−3 −2

0

2

2

2

Y

4

0

−2

0

2

X 500 samples, rough density

4

−4 −4

Density takes the form: −2

X 500 Samples, smooth density

Y

0

−1

PXY

/ 1+sin(!x ) sin(!y )

0

−2

−2

0

X

2

4

−4 −4

−2

0

2

4

X

Which of these is the more “dependent”? 20/52

Finding covariance with smooth transformations Case of

! = 1: 1 0.5

Correlation: 0.50

Correlation: 0.31 4

COCO: 0.09

0

0.5 -0.5

2

-1 -2

0

2

0

0

1

-2 0.5

-4 -4

-0.5

0

-2

0

2

-0.5

4

0

0.5

-0.5 -1 -2

0

2

21/52


! = 2: 1 0.5

Correlation: 0.54

Correlation: 0.02 4

COCO: 0.07

0

0.5 -0.5

2

-1 -2

0

2

0

0

1

-2 0.5

-4 -4

-0.5

0

-2

0

2

-0.5

4

0

0.5

-0.5 -1 -2

0

2

22/52


! = 3: 1 0.5

Correlation: 0.44

Correlation: 0.03 4

COCO: 0.04

0

0.5 -0.5

2

-1 -2

0

2

0

0

1

-2 0.5

-4 -4

-0.5

0

-2

0

2

-0.5

4

0

0.5

-0.5 -1 -2

0

2

23/52


! = 4: 1 0.5

Correlation: 0.25

Correlation: 0.05 4

COCO: 0.02

0

0.5 -0.5

2

-1 -2

0

2

0

0

1

-2 0.5

-4 -4

-0.5

0

-2

0

2

-0.5

4

0

0.5

-0.5 -1 -2

0

2

24/52


! =??: 1 0.5

Correlation: 0.14

Correlation: 0.01 4

COCO: 0.02

0

0.5 -0.5

2

-1 -2

0

2

0

0

1

-2

-0.5

0.5

-4 -4

0

-2

0

2

-0.5

4

0

0.5

-0.5 -1 -2

0

2

25/52


! = 0:

uniform noise! (shows bias) 1 0.5

Correlation: 0.14

Correlation: 0.01 4

COCO: 0.02

0

0.5 -0.5

2

-1 -2

0

2

0

0

1

-2

-0.5

0.5

-4 -4

0

-2

0

2

-0.5

4

0

0.5

-0.5 -1 -2

0

2

26/52

Dependence largest when at “low” frequencies

As dependence is encoded at higher frequencies, the smooth mappings f ; g achieve lower linear dependence. Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be found by f,g (bias) This bias will decrease with increasing sample size.

27/52

Can we do better than COCO? A second example with zero correlation. First singular value of feature covariance C'(x )(y ) : 0.5

0

Correlation: 0.80

Correlation: 0.00

0.4

COCO 1 : 0.11

-0.5

1

0.2

0.5

-1 -2

0

0

2 -0.2

0

-0.4

0.5

-0.5

-0.6

-1

0

-1

-0.5

0

0.5

-0.8 -1

1

-0.5

0

0.5

-0.5

-1 -2

0

2 28/52

Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x )(y ) : 1 0.5 0

Correlation: 0.00 1

-0.5

0.5

-1 -2

0

2

0

2

0 -0.5

1

-1

0.5

-1

-0.5

0

0.5

1

0 -0.5 -1 -2

28/52

Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x )(y ) : 1 0.5

Correlation: 0.37

0

Correlation: 0.00 1

-0.5

0.5

-1 -2

COCO2 : 0.06

0.5

0

2

0

0

-0.5

1

-1

0.5

-1

-0.5

0

0.5

1

-0.5 -1

0

-0.5

0

0.5

1

-0.5 -1 -2

0

2 28/52

The Hilbert-Schmidt Independence Criterion

Writing the i th singular value of the feature covariance C'(x )(y ) as

i

:= COCOi (PXY ; F ; G );

define Hilbert-Schmidt Independence Criterion (HSIC)

(

HSIC 2 PXY

; F ; G) =

1 X

i2 :

i =1

AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,.

29/52

The Hilbert-Schmidt Independence Criterion Writing the i th singular value of the feature covariance C'(x )(y ) as

i

:= COCOi (PXY ; F ; G );

define Hilbert-Schmidt Independence Criterion (HSIC)

(

HSIC 2 PXY

; F ; G) =

1 X

i2 :

i =1

AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,.

HSIC is MMD with product kernel!

(

; F ; G ) = MMD 2(PXY ; PX PY ; H) where ((x ; y ); (x 0 ; y 0 )) = k (x ; x 0 )l (y ; y 0 ). HSIC 2 PXY

29/52

Asymptotics of HSIC under independence

(

Given sample f xi ; yi gni=1 Empirical HSIC (biased)

i:i:d:

\ = n1 trace(KL)

HSIC

;

)

2

= l (yi yj ) (K and L computed with Statistical testing: given PXY = PX PY , what is the threshold c \ such that P (H SIC > c ) < for small ? \ Asymptotics of H SIC when PXY = PX PY :

Kij

= (

\

PXY , what is empirical HSIC ?

k xi xj and Lij empirically centered features)

1 \ !X z ;

n HSIC where

l

l (zj ) =

R

D

2 l l

hijqr

l (zi )dFi ;q ;r

;

hijqr =

N (0; 1)i:i:d:

zl

l =1 1 4!

P i ;j ;q ;r (

(t

)

;u ;v ;w ) ktu ltu

+ ktu lvw

2ktu ltv

30/52


(


i:i:d:

\ = n1 trace(KL)

HSIC

;

)

2


Kij

= (

\



1 \ !X z ;

n HSIC where

l

l (zj ) =

R

D

2 l l

hijqr

l (zi )dFi ;q ;r

;

hijqr =

N (0; 1)i:i:d:

zl

l =1 1 4!

P i ;j ;q ;r (

(t

)

;u ;v ;w ) ktu ltu

+ ktu lvw

2ktu ltv

30/52


(


i:i:d:

\ = n1 trace(KL)

HSIC

;

)

2


Kij

= (

\



1 \ !X z ;

n HSIC where

l

l (zj ) =

R

D

2 l l

hijqr

l (zi )dFi ;q ;r

;

hijqr =

N (0; 1)i:i:d:

zl

l =1 1 4!

P i ;j ;q ;r (

(t

)

;u ;v ;w ) ktu ltu

+ ktu lvw

2ktu ltv

30/52


(


i:i:d:

\ = n1 trace(KL)

HSIC

;

)

2


Kij

= (

\



1 \ !X z ;

n HSIC where

l

l (zj ) =

R

D

2 l l

hijqr

l (zi )dFi ;q ;r

;

hijqr =

N (0; 1)i:i:d:

zl

l =1 1 4!

P i ;j ;q ;r (

(t

)

;u ;v ;w ) ktu ltu

+ ktu lvw

2ktu ltv

30/52

A statistical test

=

Given PXY PX PY , what is the threshold c such that P HSIC > c < for small (prob. of false positive)?

(\

)

Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10

Null distribution via permutation

f

Compute HSIC for xi ; y(i )

gni=1 for random permutation of indices

f1; : : : ; n g. This gives HSIC for independent variables.

Repeat for many different permutations, get empirical CDF Threshold c is 1 quantile of empirical CDF

31/52

A statistical test

=


(\

)



f





31/52

A statistical test

=


(\

)



f





31/52

Application: dependence detection across languages Testing task: detect dependence between English and French text

X

Y

Honourable senators, I have a question for the Leader of the Government in the Senate

Honorables sénateurs, ma question s’adresse au leader du gouvernement au Sénat

No doubt there is great pressure on provincial and municipal governments

Les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions

In fact, we have increased federal investments for early childhood development.

Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes

.. .

Text from the aligned hansards of the 36th parliament of canada, https://www.isi.edu/natural-language/download/hansard/

.. .

32/52

Application: dependence detection across languages Testing task: detect dependence between English and French text k -spectrum kernel, k

= 10, sample size n = 10

\ = n1 trace(K L)

HSIC (K and L column centered)

2

33/52

Application:Dependence detection across languages

Results (for

= 0:05)

k-spectrum kernel: average Type II error 0 Bag of words kernel: average Type II error 0.18

Settings: Five line extracts, averaged over 300 repetitions, for “Agriculture” transcripts. Similar results for Fisheries and Immigration transcripts.

34/52

Testing higher order interactions

35/52

Detecting higher order interaction

How to detect V-structures with pairwise weak individual dependence?

Y

X

Z

36/52

Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence?

36/52

Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence?

X

?? Y ; Y ?? Z ; X ?? Z X1 vs Y1

Y1 vs Z1

Y

X

Z X1 vs Z1

X1*Y1 vs Z1

N (0; 1) Z j X ; Y sign(XY )Exp ( p12 ) X;Y

i:i:d:

Fine print: Faithfulness violated here!

36/52

V-structure discovery Y

X

Z

Assume X

?? Y

has been established.

V-structure can then be detected by:

: :(

Consistent CI test: H0 X ? ? Y jZ [Fukumizu et al. 2008, Zhang et al. 2011] Factorisation test: H0 X ; Y ? ? Z _ X ; Z ?? Y _ Y ; Z ?? X (multiple standard two-variable tests)

)

(

)

(

)

How well do these work? 37/52

Detecting higher order interaction Generalise earlier example to p dimensions

X

?? Y ; Y ?? Z ; X ?? Z X1 vs Y1

Y1 vs Z1

Y

X

Z X1 vs Z1

X1*Y1 vs Z1

N (0; 1) Z j X ; Y sign(XY )Exp ( p12 ) i :i :d : X2 p ; Y2 p ; Z2 p N (0; Ip 1 ) X;Y

:

i:i:d:

:

:

Fine print: Faithfulness violated here!

38/52

V-structure discovery

CI test for X test, n 500

=

?? Y jZ

from Zhang et al. (2011), and a factorisation 39/52

Lancaster interaction measure Lancaster interaction measure of (X1 ; : : : ; XD ) P measure D

=2:

is a signed

P that vanishes whenever P can be factorised non-trivially. LP = PXY

PX PY

40/52

Lancaster interaction measure Lancaster interaction measure of (X1 ; : : : ; XD ) P measure

=2: D =3: D

is a signed

P that vanishes whenever P can be factorised non-trivially. LP = PXY LP = PXYZ

PX PY PX PYZ

PY PXZ

PZ PXY

+ 2PX PY PZ

40/52

Lancaster interaction measure

(

)

Lancaster interaction measure of X1 ; : : : ; XD P is a signed measure P that vanishes whenever P can be factorised non-trivially.

=2: D =3: D

LP = PXY LP = PXYZ

PX PY PX PYZ

PY PXZ

PZ PXY

+ 2PX PY PZ

∆L P = PXY Z

−PX PY Z

X

Z

Y

−PY PXZ X

Z

Y

−PZ PXY X

Z

+2PX PY PZ Y

Y

X

Z

40/52


=2: D =3: D

LP = PXY LP = PXYZ Y

X

∆L P = 0 PXY Z

is a signed

P that vanishes whenever P can be factorised non-trivially.

Z

−PX PY Z

PX PY PX PYZ

Y

X

PY PXZ

Case of PX

Y

X

Z

−PXZ PY

PZ PXY

Z

−PXY PZ

+ 2PX PY PZ Y

X

Z

+2PX PY PZ

?? PYZ 40/52

Lancaster interaction measure

(

)

Lancaster interaction measure of X1 ; : : : ; XD P is a signed measure P that vanishes whenever P can be factorised non-trivially.

=2: D =3: D

LP = PXY LP = PXYZ

PX PY PX PYZ

PY PXZ

PZ PXY

+ 2PX PY PZ

(X ; Y ) ?? Z _ (X ; Z ) ?? Y _ (Y ; Z ) ?? X ) LP = 0: ...so what might be missed?

40/52


is a signed

P that vanishes whenever P can be factorised non-trivially.

=2: D =3: D

LP = PXY LP = PXYZ

PX PY PX PYZ

PY PXZ

PZ PXY

+ 2PX PY PZ

LP = 0 ; (X ; Y ) ?? Z _ (X ; Z ) ?? Y _ (Y ; Z ) ?? X Example: P (0; 0; 0) = 0:2 P (0; 1; 0) = 0:1

P (0; 0; 1) = 0:1 P (0; 1; 1) = 0:1

P (1; 0; 0) = 0:1 P (1; 1; 0) = 0:1

P (1; 0; 1) = 0:1 P (1; 1; 1) = 0:2

40/52

A kernel test statistic using Lancaster Measure

k (LP )k2H ; where = k l m: k(PXYZ PXY PZ )k2H = hPXYZ ; PXYZ iH 2 hPXYZ ; PXY PZ iH

Construct a test by estimating

41/52


Table: V -statistic estimators of is centering matrix I n 1

h ; 0 iH (without terms PX PY PZ ). H

Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k LP k2H n12 H KH H LH H MH ++ :

( )

=

(

)

Empirical joint central moment in the feature space

42/52


Table: V -statistic estimators of is centering matrix I n 1

h ; 0 iH (without terms PX PY PZ ). H

Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k LP k2H n12 H KH H LH H MH ++ :

( )

=

(

)

Empirical joint central moment in the feature space

42/52

V-structure discovery

Lancaster test, CI test for X factorisation test, n 500

=

?? Y jZ

from Zhang et al. (2011), and a 43/52

Interaction for D

4

Interaction measure valid for all D: (Streitberg, 1990)

X

S P = ( 1)jj 1 (jj

For a partition

)!

1 J P

, J

associates to the joint the corresponding factorisation, e.g., J13j2j4 P = PX1 X3 PX2 PX4 :

44/52

Interaction for D

4


X

S P = ( 1)jj 1 (jj

For a partition

)!

1 J P

, J


44/52

4


X

S P = ( 1)jj 1 (jj

For a partition

, J

)!

1 J P


Number of partitions of {1,...,D}

Interaction for D

Bell numbers growth

1e+19

1e+14

1e+09

1e+04

1 3 5 7 9 11 13 15 17 19 21 23 25

D

44/52

Part 4: Advanced topics

45/52

Advanced topics

testing on time series testing for conditional dependence regression and conditional mean embedding

46/52

Measures of divergence

47/52


48/52


49/52


50/52


Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 51/52

Co-authors From Gatsby: Kacper Chwialkowski Wittawat Jitkrittum Bharath Sriperumbudur Heiko Strathmann Dougal Sutherland

Questions?

Zoltan Szabo Wenkai Xu

External collaborators: Kenji Fukumizu Bernhard Schoelkopf Alex Smola 52/52

Representing and comparing probabilities: Part 2 - Gatsby ...

Representing and comparing probabilities: Part 2 - Gatsby ...

Suggest Documents

Representing and comparing probabilities: Part 2 - Gatsby ...

Representing and comparing probabilities - Gatsby Computational ...

Representing and comparing probabilities - Gatsby Computational ...

Comparing Acceptance Sampling Standards, Part 2

What Happens to Shelter Dogs? Part 2. Comparing Three Melbourne ...

A volumetric method for representing and comparing regions of

Hooker – Hydrogen Ignition and Light up Probabilities Part 1

Chapter 2 – Part 2

Part 2

part 2

Part 2

Part 2

Part 2

Part 2

PART 2

Part 2

PART 2

Part 2

part 2

Part 2

Part 2

part 2

part 2

part 2