Goal: Are X and Y independent? Their noses ..... define Hilbert-Schmidt Independence Criterion (HSIC) ..... regression a
Representing and comparing probabilities: Part 2 Arthur Gretton Gatsby Computational Neuroscience Unit, University College London
UAI, 2017
1/52
Testing against a probabilistic model
2/52
Statistical model criticism
(
MMD P ; Q
) = kf k2 = supkf kF 1[EQ f
Ep f
]
0.4 0.3 0.2
p(x)
0.1 -4
q(x) 2
-2
4
-0.1
f *(x)
-0.2 -0.3
()
f x is the witness function Can we compute MMD with samples from Q and a model P ? Problem: usualy can’t compute Ep f in closed form. 3/52
Stein idea To get rid of Ep f in
[
sup Eq f
kf kF 1
Ep f
]
we define the Stein operator
[Tp f ] (x ) = p (1x ) dxd (f (x )p (x )) Then EP TP f
=0
subject to appropriate boundary conditions.
(Oates, Girolami, Chopin, 2016)
4/52
Stein idea: proof
[
Ep Tp f
]=
Z
( ( ) ( )) p (x )dx ( ) Z d ( f (x )p (x )) dx dx = [f (x )p (x )]11 =0 1 d f x p x p x dx
5/52
Stein idea: proof
[
Ep Tp f
]=
Z
( ( ) ( )) p (x)dx ( ) Z d ( f (x )p (x )) dx dx = [f (x )p (x )]11 =0 1 d f x p x p x dx
5/52
Stein idea: proof
[
Ep Tp f
]=
Z
( ( ) ( )) p (x)dx ( ) Z d ( f (x )p (x )) dx dx = [f (x )p (x )]11 =0 1 d f x p x p x dx
5/52
Stein idea: proof
[
Ep Tp f
]=
Z
( ( ) ( )) p (x)dx ( ) Z d ( f (x )p (x )) dx dx = [f (x )p (x )]11 =0 1 d f x p x p x dx
5/52
Stein idea: proof
[
Ep Tp f
]=
Z
( ( ) ( )) p (x)dx ( ) Z d ( f (x )p (x )) dx dx = [f (x )p (x )]11 =0 1 d f x p x p x dx
5/52
Kernel Stein Discrepancy Stein operator Tp f
= @x f + f @x (log p )
Kernel Stein Discrepancy (KSD)
(
KSD p ; q ; F
)=
sup Eq Tp g
kg kF 1
Ep Tp g
6/52
Kernel Stein Discrepancy Stein operator Tp f
= @x f + f @x (log p )
Kernel Stein Discrepancy (KSD)
(
KSD p ; q ; F
)=
sup Eq Tp g
kg kF 1
Ep T pg
=
sup Eq Tp g
kg kF 1
6/52
Kernel Stein Discrepancy Stein operator Tp f
= @x f + f @x (log p )
Kernel Stein Discrepancy (KSD)
(
KSD p ; q ; F
)=
sup Eq Tp g
kg kF 1
Ep T pg
=
sup Eq Tp g
kg kF 1
0.4 0.2 -4
2
-2 -0.2 -0.4
4
p(x) q(x) g *(x)
-0.6 6/52
Kernel Stein Discrepancy Stein operator Tp f
= @x f + f @x (log p )
Kernel Stein Discrepancy (KSD)
(
KSD p ; q ; F
)=
sup Eq Tp g
kg kF 1
Ep T pg
=
sup Eq Tp g
kg kF 1
0.4 0.3
p(x)
0.2
q(x) g *(x)
0.1
-4
-2
2
4 6/52
Kernel stein discrepancy Closed-form expression for KSD: given Z ; Z 0
q, then
(Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016)
(
KSD p ; q ; F
) = Eq hp (Z ; Z 0)
where
(
hp x ; y
) := @x log p (x )@x log p (y )k (x ; y ) + @y log p (y )@x k (x ; y ) + @x log p (x )@y k (x ; y ) + @x @y k (x ; y )
and k is RKHS kernel for
F
Only depends on kernel and @x log p (x ). Do not need to normalize p, or sample from it. 7/52
Statistical model criticism
Chicago crime data
8/52
Statistical model criticism
Chicago crime data Model is Gaussian mixture with two components. 8/52
Statistical model criticism
Chicago crime data Model is Gaussian mixture with two components Stein witness function 8/52
Statistical model criticism
Chicago crime data Model is Gaussian mixture with ten components.
8/52
Statistical model criticism
Chicago crime data Model is Gaussian mixture with ten components Stein witness function Code: https://github.com/karlnapf/kernel_goodness_of_fit
8/52
Kernel stein discrepancy Further applications: Evaluation of approximate MCMC methods. (Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017)
What kernel to use? The inverse multiquadric kernel,
for
2 ( 1; 0).
(
k x;y
) = c + kx
y k22
9/52
Testing statistical dependence
10/52
Dependence testing Given: Samples from a distribution PX Y Goal: Are X and Y independent?
X
Y A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose. Their noses guide them through life, and they're never happier than when following an interesting scent. A responsive, interactive pet, one that will blow in your ear and follow you everywhere.
Text from dogtime.com and petfinder.com
11/52
MMD as a dependence measure? Could we use MMD?
(| {z } |
MMD PXY ; PX PY ; H P
We don’t have samples from Q f xi ; yi gni=1 i:i:d: PXY
(
{z }
)
Q
:= PX PY , only pairs 6
Solution: simulate Q with pairs (xi ; yj ) for j = i .
What kernel
to use for the RKHS H ?
12/52
MMD as a dependence measure? Could we use MMD?
(| {z } |
MMD PXY ; PX PY ; H P
We don’t have samples from Q f xi ; yi gni=1 i:i:d: PXY
(
{z }
)
Q
:= PX PY , only pairs 6
Solution: simulate Q with pairs (xi ; yj ) for j = i .
What kernel
to use for the RKHS H ?
12/52
MMD as a dependence measure? Could we use MMD?
(| {z } |
MMD PXY ; PX PY ; H P
We don’t have samples from Q f xi ; yi gni=1 i:i:d: PXY
(
{z }
)
Q
:= PX PY , only pairs 6
Solution: simulate Q with pairs (xi ; yj ) for j = i .
What kernel
to use for the RKHS H ?
12/52
MMD as a dependence measure Kernel k on images with feature space
F,
Kernel l on captions with feature space G ,
13/52
MMD as a dependence measure Kernel k on images with feature space
F,
Kernel l on captions with feature space G ,
Kernel
on image-text pairs:
are images and captions similar?
13/52
MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent?
(
b XY ; P bX P b Y ; H MMD 2 P
) := n12 trace(K L)
( K, L column centered)
14/52
MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent?
(
b XY ; P bX P b Y ; H MMD 2 P
) := n12 trace(K L)
14/52
MMD as a dependence measure
Two questions: Why the product kernel? Many ways to combine kernels - why not eg a sum? Is there a more interpretable way of defining this dependence measure?
15/52
Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence.
Correlation: 0.00 1.5 1 0.5 0 -0.5 -1 -1.5 -2
-1
0
1
2
16/52
Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. 0.5
Correlation: 0.00
0
1.5 -0.5
1 0.5
-1 -2
0
2
0
2
0 0.5
-0.5 -1 -1.5 -2
0
-1
0
1
2 -0.5
-1 -2
16/52
Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. 0.5
Correlation: 0.00
Correlation: 0.90
0
0.5
1.5 -0.5
1
0
0.5
-1 -2
0
2
0 0.5
-0.5 -1 -1.5 -2
-0.5
0
-1
0
1
-1 -1
2 -0.5
-1 -2
0
-0.5
0
0.5
2
16/52
Define two spaces, one for each witness Function in
F
( )=
f x
1 X j =1
Function in
()
F on X : k (x ; x 0 ) = h'(x ); '(x 0 )iF
Kernel for RKHS
( )=
fj 'j x
Feature map
G
g y
1 X j =1
()
gj j y
Feature map
Kernel for RKHS
(
l x;x0
G on Y :
) = h(y ); (y 0)iG
17/52
The constrained covariance The constrained covariance is
(
COCO PXY
)=
[ ( ) ( )]
sup
cov f x g y
kf kF 1 kg kG 1
0.5
Correlation: 0.00
Correlation: 0.90
0
0.5
1.5 -0.5
1
0
0.5
-1 -2
0
2
0 0.5
-0.5 -1 -1.5 -2
-0.5
0
-1
0
1
-1 -1
2 -0.5
-1 -2
0
2
-0.5
0
0.5
18/52
The constrained covariance The constrained covariance is
(
COCO PXY
)=
sup
kf kF 1 kg kG 1
20 cov 4@
1 X j =1
10
()
fj 'j x A @
1 X j =1
()
13
g j j y A5
18/52
The constrained covariance The constrained covariance is
(
COCO PXY
)=
sup
kf kF 1 kg kG 1
Fine print: feature mappings
20 Exy 4@
1 X j =1
()
10
fj 'j x A @
1 X j =1
13
()
g j j y A5
'(x ) and (y ) assumed to have zero mean.
18/52
The constrained covariance The constrained covariance is
(
COCO PXY
)=
sup
kf kF 1 kg kG 1
Fine print: feature mappings
20 Exy 4@
1 X j =1
()
10
fj 'j x A @
1 X
13
()
g j j y A5
j =1
'(x ) and (y ) assumed to have zero mean.
Rewriting:
[ ( ) ( )]
Exy f x g y
2 3> 02 3 f1 '1 (x ) h = 664 f2 775 Exy BB@664 '2(x ) 775 1(y ) 2(y ) : : : . . ..
|
..
{z
C'(x )(y )
12 3 iC 6 g1 7 C 6 g2 7 A4 . 5 }
..
18/52
The constrained covariance The constrained covariance is
(
COCO PXY
)=
sup
kf kF 1 kg kG 1
Fine print: feature mappings
20 Exy 4@
1 X j =1
()
10
fj 'j x A @
1 X
13
()
g j j y A5
j =1
'(x ) and (y ) assumed to have zero mean.
Rewriting:
[ ( ) ( )]
Exy f x g y
3 02 2 3> '1 (x ) h f1 = 664 f2 775 Exy BB@664 '2(x ) 775 1(y ) 2(y ) : : : . . ..
|
..
{z
C'(x )(y )
12 3 iC 6 g1 7 C 6 g2 7 A4 . 5 }
..
COCO: max singular value of feature covariance C'(x )(y )
18/52
Computing COCO in practice i:i:d: \ Given sample f(xi ; yi )gni=1 PXY , what is empirical COCO
?
19/52
Computing COCO in practice i:i:d: \ Given sample f(xi ; yi )gni=1 PXY , what is empirical COCO
\
COCO is largest eigenvalue
Kij
"
0 1 n LK
1 n KL
max of
#"
0
#
=
"
K 0
0 L
#"
?
#
:
= k (xi ; xj ) and Lij = l (yi ; yj ).
Fine print: kernels are computed with empirically centered features Pn and (y ) n1 (yi ). i =1
'(x )
1 n
Pn i =1
'(xi )
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05
19/52
Computing COCO in practice i:i:d: \ Given sample f(xi ; yi )gni=1 PXY , what is empirical COCO
\
max of # " 1 0 = n KL 1 0 n LK = k (xi ; xj ) and Lij = l (yi ; yj ).
COCO is largest eigenvalue
Kij
#"
"
K 0
0 L
( )/
n X
#"
?
#
:
Witness functions (singular vectors):
( )/
f x
m X i =1
i k (xi ; x )
g y
i =1
i l (yi ; y )
Fine print: kernels are computed with empirically centered features Pn and (y ) n1 (yi ). i =1
'(x )
1 n
Pn i =1
'(xi )
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05
19/52
What is a large dependence with COCO? Rough density 3
2
2
1
1
Y
Y
Smooth density 3
0
−1
−2
−2
−3
−3 −2
0
2
2
2
Y
4
0
−2
0
2
X 500 samples, rough density
4
−4 −4
Density takes the form: −2
X 500 Samples, smooth density
Y
0
−1
PXY
/ 1+sin(!x ) sin(!y )
0
−2
−2
0
X
2
4
−4 −4
−2
0
2
4
X
Which of these is the more “dependent”? 20/52
Finding covariance with smooth transformations Case of
! = 1: 1 0.5
Correlation: 0.50
Correlation: 0.31 4
COCO: 0.09
0
0.5 -0.5
2
-1 -2
0
2
0
0
1
-2 0.5
-4 -4
-0.5
0
-2
0
2
-0.5
4
0
0.5
-0.5 -1 -2
0
2
21/52
Finding covariance with smooth transformations Case of
! = 2: 1 0.5
Correlation: 0.54
Correlation: 0.02 4
COCO: 0.07
0
0.5 -0.5
2
-1 -2
0
2
0
0
1
-2 0.5
-4 -4
-0.5
0
-2
0
2
-0.5
4
0
0.5
-0.5 -1 -2
0
2
22/52
Finding covariance with smooth transformations Case of
! = 3: 1 0.5
Correlation: 0.44
Correlation: 0.03 4
COCO: 0.04
0
0.5 -0.5
2
-1 -2
0
2
0
0
1
-2 0.5
-4 -4
-0.5
0
-2
0
2
-0.5
4
0
0.5
-0.5 -1 -2
0
2
23/52
Finding covariance with smooth transformations Case of
! = 4: 1 0.5
Correlation: 0.25
Correlation: 0.05 4
COCO: 0.02
0
0.5 -0.5
2
-1 -2
0
2
0
0
1
-2 0.5
-4 -4
-0.5
0
-2
0
2
-0.5
4
0
0.5
-0.5 -1 -2
0
2
24/52
Finding covariance with smooth transformations Case of
! =??: 1 0.5
Correlation: 0.14
Correlation: 0.01 4
COCO: 0.02
0
0.5 -0.5
2
-1 -2
0
2
0
0
1
-2
-0.5
0.5
-4 -4
0
-2
0
2
-0.5
4
0
0.5
-0.5 -1 -2
0
2
25/52
Finding covariance with smooth transformations Case of
! = 0:
uniform noise! (shows bias) 1 0.5
Correlation: 0.14
Correlation: 0.01 4
COCO: 0.02
0
0.5 -0.5
2
-1 -2
0
2
0
0
1
-2
-0.5
0.5
-4 -4
0
-2
0
2
-0.5
4
0
0.5
-0.5 -1 -2
0
2
26/52
Dependence largest when at “low” frequencies
As dependence is encoded at higher frequencies, the smooth mappings f ; g achieve lower linear dependence. Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be found by f,g (bias) This bias will decrease with increasing sample size.
27/52
Can we do better than COCO? A second example with zero correlation. First singular value of feature covariance C'(x )(y ) : 0.5
0
Correlation: 0.80
Correlation: 0.00
0.4
COCO 1 : 0.11
-0.5
1
0.2
0.5
-1 -2
0
0
2 -0.2
0
-0.4
0.5
-0.5
-0.6
-1
0
-1
-0.5
0
0.5
-0.8 -1
1
-0.5
0
0.5
-0.5
-1 -2
0
2 28/52
Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x )(y ) : 1 0.5 0
Correlation: 0.00 1
-0.5
0.5
-1 -2
0
2
0
2
0 -0.5
1
-1
0.5
-1
-0.5
0
0.5
1
0 -0.5 -1 -2
28/52
Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x )(y ) : 1 0.5
Correlation: 0.37
0
Correlation: 0.00 1
-0.5
0.5
-1 -2
COCO2 : 0.06
0.5
0
2
0
0
-0.5
1
-1
0.5
-1
-0.5
0
0.5
1
-0.5 -1
0
-0.5
0
0.5
1
-0.5 -1 -2
0
2 28/52
The Hilbert-Schmidt Independence Criterion
Writing the i th singular value of the feature covariance C'(x )(y ) as
i
:= COCOi (PXY ; F ; G );
define Hilbert-Schmidt Independence Criterion (HSIC)
(
HSIC 2 PXY
; F ; G) =
1 X
i2 :
i =1
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,.
29/52
The Hilbert-Schmidt Independence Criterion Writing the i th singular value of the feature covariance C'(x )(y ) as
i
:= COCOi (PXY ; F ; G );
define Hilbert-Schmidt Independence Criterion (HSIC)
(
HSIC 2 PXY
; F ; G) =
1 X
i2 :
i =1
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,.
HSIC is MMD with product kernel!
(
; F ; G ) = MMD 2(PXY ; PX PY ; H) where ((x ; y ); (x 0 ; y 0 )) = k (x ; x 0 )l (y ; y 0 ). HSIC 2 PXY
29/52
Asymptotics of HSIC under independence
(
Given sample f xi ; yi gni=1 Empirical HSIC (biased)
i:i:d:
\ = n1 trace(KL)
HSIC
;
)
2
= l (yi yj ) (K and L computed with Statistical testing: given PXY = PX PY , what is the threshold c \ such that P (H SIC > c ) < for small ? \ Asymptotics of H SIC when PXY = PX PY :
Kij
= (
\
PXY , what is empirical HSIC ?
k xi xj and Lij empirically centered features)
1 \ !X z ;
n HSIC where
l
l (zj ) =
R
D
2 l l
hijqr
l (zi )dFi ;q ;r
;
hijqr =
N (0; 1)i:i:d:
zl
l =1 1 4!
P i ;j ;q ;r (
(t
)
;u ;v ;w ) ktu ltu
+ ktu lvw
2ktu ltv
30/52
Asymptotics of HSIC under independence
(
Given sample f xi ; yi gni=1 Empirical HSIC (biased)
i:i:d:
\ = n1 trace(KL)
HSIC
;
)
2
= l (yi yj ) (K and L computed with Statistical testing: given PXY = PX PY , what is the threshold c \ such that P (H SIC > c ) < for small ? \ Asymptotics of H SIC when PXY = PX PY :
Kij
= (
\
PXY , what is empirical HSIC ?
k xi xj and Lij empirically centered features)
1 \ !X z ;
n HSIC where
l
l (zj ) =
R
D
2 l l
hijqr
l (zi )dFi ;q ;r
;
hijqr =
N (0; 1)i:i:d:
zl
l =1 1 4!
P i ;j ;q ;r (
(t
)
;u ;v ;w ) ktu ltu
+ ktu lvw
2ktu ltv
30/52
Asymptotics of HSIC under independence
(
Given sample f xi ; yi gni=1 Empirical HSIC (biased)
i:i:d:
\ = n1 trace(KL)
HSIC
;
)
2
= l (yi yj ) (K and L computed with Statistical testing: given PXY = PX PY , what is the threshold c \ such that P (H SIC > c ) < for small ? \ Asymptotics of H SIC when PXY = PX PY :
Kij
= (
\
PXY , what is empirical HSIC ?
k xi xj and Lij empirically centered features)
1 \ !X z ;
n HSIC where
l
l (zj ) =
R
D
2 l l
hijqr
l (zi )dFi ;q ;r
;
hijqr =
N (0; 1)i:i:d:
zl
l =1 1 4!
P i ;j ;q ;r (
(t
)
;u ;v ;w ) ktu ltu
+ ktu lvw
2ktu ltv
30/52
Asymptotics of HSIC under independence
(
Given sample f xi ; yi gni=1 Empirical HSIC (biased)
i:i:d:
\ = n1 trace(KL)
HSIC
;
)
2
= l (yi yj ) (K and L computed with Statistical testing: given PXY = PX PY , what is the threshold c \ such that P (H SIC > c ) < for small ? \ Asymptotics of H SIC when PXY = PX PY :
Kij
= (
\
PXY , what is empirical HSIC ?
k xi xj and Lij empirically centered features)
1 \ !X z ;
n HSIC where
l
l (zj ) =
R
D
2 l l
hijqr
l (zi )dFi ;q ;r
;
hijqr =
N (0; 1)i:i:d:
zl
l =1 1 4!
P i ;j ;q ;r (
(t
)
;u ;v ;w ) ktu ltu
+ ktu lvw
2ktu ltv
30/52
A statistical test
=
Given PXY PX PY , what is the threshold c such that P HSIC > c < for small (prob. of false positive)?
(\
)
Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
f
Compute HSIC for xi ; y(i )
gni=1 for random permutation of indices
f1; : : : ; n g. This gives HSIC for independent variables.
Repeat for many different permutations, get empirical CDF Threshold c is 1 quantile of empirical CDF
31/52
A statistical test
=
Given PXY PX PY , what is the threshold c such that P HSIC > c < for small (prob. of false positive)?
(\
)
Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
f
Compute HSIC for xi ; y(i )
gni=1 for random permutation of indices
f1; : : : ; n g. This gives HSIC for independent variables.
Repeat for many different permutations, get empirical CDF Threshold c is 1 quantile of empirical CDF
31/52
A statistical test
=
Given PXY PX PY , what is the threshold c such that P HSIC > c < for small (prob. of false positive)?
(\
)
Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
f
Compute HSIC for xi ; y(i )
gni=1 for random permutation of indices
f1; : : : ; n g. This gives HSIC for independent variables.
Repeat for many different permutations, get empirical CDF Threshold c is 1 quantile of empirical CDF
31/52
Application: dependence detection across languages Testing task: detect dependence between English and French text
X
Y
Honourable senators, I have a question for the Leader of the Government in the Senate
Honorables sénateurs, ma question s’adresse au leader du gouvernement au Sénat
No doubt there is great pressure on provincial and municipal governments
Les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions
In fact, we have increased federal investments for early childhood development.
Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes
.. .
Text from the aligned hansards of the 36th parliament of canada, https://www.isi.edu/natural-language/download/hansard/
.. .
32/52
Application: dependence detection across languages Testing task: detect dependence between English and French text k -spectrum kernel, k
= 10, sample size n = 10
\ = n1 trace(K L)
HSIC (K and L column centered)
2
33/52
Application:Dependence detection across languages
Results (for
= 0:05)
k-spectrum kernel: average Type II error 0 Bag of words kernel: average Type II error 0.18
Settings: Five line extracts, averaged over 300 repetitions, for “Agriculture” transcripts. Similar results for Fisheries and Immigration transcripts.
34/52
Testing higher order interactions
35/52
Detecting higher order interaction
How to detect V-structures with pairwise weak individual dependence?
Y
X
Z
36/52
Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence?
36/52
Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence?
X
?? Y ; Y ?? Z ; X ?? Z X1 vs Y1
Y1 vs Z1
Y
X
Z X1 vs Z1
X1*Y1 vs Z1
N (0; 1) Z j X ; Y sign(XY )Exp ( p12 ) X;Y
i:i:d:
Fine print: Faithfulness violated here!
36/52
V-structure discovery Y
X
Z
Assume X
?? Y
has been established.
V-structure can then be detected by:
: :(
Consistent CI test: H0 X ? ? Y jZ [Fukumizu et al. 2008, Zhang et al. 2011] Factorisation test: H0 X ; Y ? ? Z _ X ; Z ?? Y _ Y ; Z ?? X (multiple standard two-variable tests)
)
(
)
(
)
How well do these work? 37/52
Detecting higher order interaction Generalise earlier example to p dimensions
X
?? Y ; Y ?? Z ; X ?? Z X1 vs Y1
Y1 vs Z1
Y
X
Z X1 vs Z1
X1*Y1 vs Z1
N (0; 1) Z j X ; Y sign(XY )Exp ( p12 ) i :i :d : X2 p ; Y2 p ; Z2 p N (0; Ip 1 ) X;Y
:
i:i:d:
:
:
Fine print: Faithfulness violated here!
38/52
V-structure discovery
CI test for X test, n 500
=
?? Y jZ
from Zhang et al. (2011), and a factorisation 39/52
Lancaster interaction measure Lancaster interaction measure of (X1 ; : : : ; XD ) P measure D
=2:
is a signed
P that vanishes whenever P can be factorised non-trivially. LP = PXY
PX PY
40/52
Lancaster interaction measure Lancaster interaction measure of (X1 ; : : : ; XD ) P measure
=2: D =3: D
is a signed
P that vanishes whenever P can be factorised non-trivially. LP = PXY LP = PXYZ
PX PY PX PYZ
PY PXZ
PZ PXY
+ 2PX PY PZ
40/52
Lancaster interaction measure
(
)
Lancaster interaction measure of X1 ; : : : ; XD P is a signed measure P that vanishes whenever P can be factorised non-trivially.
=2: D =3: D
LP = PXY LP = PXYZ
PX PY PX PYZ
PY PXZ
PZ PXY
+ 2PX PY PZ
∆L P = PXY Z
−PX PY Z
X
Z
Y
−PY PXZ X
Z
Y
−PZ PXY X
Z
+2PX PY PZ Y
Y
X
Z
40/52
Lancaster interaction measure Lancaster interaction measure of (X1 ; : : : ; XD ) P measure
=2: D =3: D
LP = PXY LP = PXYZ Y
X
∆L P = 0 PXY Z
is a signed
P that vanishes whenever P can be factorised non-trivially.
Z
−PX PY Z
PX PY PX PYZ
Y
X
PY PXZ
Case of PX
Y
X
Z
−PXZ PY
PZ PXY
Z
−PXY PZ
+ 2PX PY PZ Y
X
Z
+2PX PY PZ
?? PYZ 40/52
Lancaster interaction measure
(
)
Lancaster interaction measure of X1 ; : : : ; XD P is a signed measure P that vanishes whenever P can be factorised non-trivially.
=2: D =3: D
LP = PXY LP = PXYZ
PX PY PX PYZ
PY PXZ
PZ PXY
+ 2PX PY PZ
(X ; Y ) ?? Z _ (X ; Z ) ?? Y _ (Y ; Z ) ?? X ) LP = 0: ...so what might be missed?
40/52
Lancaster interaction measure Lancaster interaction measure of (X1 ; : : : ; XD ) P measure
is a signed
P that vanishes whenever P can be factorised non-trivially.
=2: D =3: D
LP = PXY LP = PXYZ
PX PY PX PYZ
PY PXZ
PZ PXY
+ 2PX PY PZ
LP = 0 ; (X ; Y ) ?? Z _ (X ; Z ) ?? Y _ (Y ; Z ) ?? X Example: P (0; 0; 0) = 0:2 P (0; 1; 0) = 0:1
P (0; 0; 1) = 0:1 P (0; 1; 1) = 0:1
P (1; 0; 0) = 0:1 P (1; 1; 0) = 0:1
P (1; 0; 1) = 0:1 P (1; 1; 1) = 0:2
40/52
A kernel test statistic using Lancaster Measure
k (LP )k2H ; where = k l m: k(PXYZ PXY PZ )k2H = hPXYZ ; PXYZ iH 2 hPXYZ ; PXY PZ iH
Construct a test by estimating
41/52
A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of is centering matrix I n 1
h ; 0 iH (without terms PX PY PZ ). H
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k LP k2H n12 H KH H LH H MH ++ :
( )
=
(
)
Empirical joint central moment in the feature space
42/52
A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of is centering matrix I n 1
h ; 0 iH (without terms PX PY PZ ). H
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k LP k2H n12 H KH H LH H MH ++ :
( )
=
(
)
Empirical joint central moment in the feature space
42/52
V-structure discovery
Lancaster test, CI test for X factorisation test, n 500
=
?? Y jZ
from Zhang et al. (2011), and a 43/52
Interaction for D
4
Interaction measure valid for all D: (Streitberg, 1990)
X
S P = ( 1)jj 1 (jj
For a partition
)!
1 J P
, J
associates to the joint the corresponding factorisation, e.g., J13j2j4 P = PX1 X3 PX2 PX4 :
44/52
Interaction for D
4
Interaction measure valid for all D: (Streitberg, 1990)
X
S P = ( 1)jj 1 (jj
For a partition
)!
1 J P
, J
associates to the joint the corresponding factorisation, e.g., J13j2j4 P = PX1 X3 PX2 PX4 :
44/52
4
Interaction measure valid for all D: (Streitberg, 1990)
X
S P = ( 1)jj 1 (jj
For a partition
, J
)!
1 J P
associates to the joint the corresponding factorisation, e.g., J13j2j4 P = PX1 X3 PX2 PX4 :
Number of partitions of {1,...,D}
Interaction for D
Bell numbers growth
1e+19
1e+14
1e+09
1e+04
1 3 5 7 9 11 13 15 17 19 21 23 25
D
44/52
Part 4: Advanced topics
45/52
Advanced topics
testing on time series testing for conditional dependence regression and conditional mean embedding
46/52
Measures of divergence
47/52
Measures of divergence
48/52
Measures of divergence
49/52
Measures of divergence
50/52
Measures of divergence
Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 51/52
Co-authors From Gatsby: Kacper Chwialkowski Wittawat Jitkrittum Bharath Sriperumbudur Heiko Strathmann Dougal Sutherland
Questions?
Zoltan Szabo Wenkai Xu
External collaborators: Kenji Fukumizu Bernhard Schoelkopf Alex Smola 52/52