Representing and comparing probabilities - Gatsby Computational ...

Representing and comparing probabilities Arthur Gretton Gatsby Computational Neuroscience Unit, University College London

UAI, 2017

1/51

Comparing two samples Given: Samples from unknown distributions P and Q. Goal: do P and Q differ?

2/51

An example: two-sample tests Have: Two collections of samples X; Y from unknown distributions P and Q. Goal: do P and Q differ?

MNIST samples

Samples from a GAN

Significant difference in GAN and MNIST? T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Xi Chen, NIPS 2016.

3/51

Testing goodness of fit Given: A model P and samples and Q. Goal: is P a good fit for Q?

Chicago crime data Model is Gaussian mixture with two components. 4/51

Testing independence Given: Samples from a distribution PX Y Goal: Are X and Y independent?

X

Y A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose. Their noses guide them through life, and they're never happier than when following an interesting scent. A responsive, interactive pet, one that will blow in your ear and follow you everywhere.

Text from dogtime.com and petfinder.com

5/51

Outline: part 1

Two sample testing Test statistic: Maximum Mean Discrepancy (MMD)... ...as a difference in feature means ...as an integral probability metric (not just a technicality!)

Statistical testing with the MMD Troubleshooting GANs with MMD

6/51

Outline: part 2

Goodness of fit testing The kernel Stein discrepancy Dependence testing Dependence using the MMD Depenence using feature covariances Statistical testing Additional topics

7/51

Outline: part 2

Goodness of fit testing The kernel Stein discrepancy Dependence testing Dependence using the MMD Depenence using feature covariances Statistical testing Additional topics

7/51

Maximum Mean Discrepancy

8/51

Feature mean difference Simple example: 2 Gaussians with different means Answer: t-test Two Gaussians with different means 0.4

PX

0.35

QX

Prob. density

0.3 0.25 0.2 0.15 0.1 0.05 0 −6

−4

−2

0

2

4

6

X

9/51

Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form

'(x ) = x 2

Two Gaussians with different variances 0.4

PX

0.35

QX

Prob. density

0.3 0.25 0.2 0.15 0.1 0.05 0 −6

−4

−2

0

2

4

6

X

10/51

Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form

Densities of feature X2

Two Gaussians with different variances 0.4

1.4

PX

0.35

P

X

QX

1.2

Prob. density

0.3

Prob. density

'(x ) = x 2

0.25 0.2 0.15

QX

1 0.8 0.6 0.4

0.1 0.2

0.05 0 −6

−4

−2

0

X

2

4

6

0 −1 10

0

1

10

10

2

10

X2

10/51

Feature mean difference Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features...RKHS Gaussian and Laplace densities 0.7

PX QX

Prob. density

0.6 0.5 0.4 0.3 0.2 0.1 0 −4

−3

−2

−1

0

1

2

3

4

X

11/51

Infinitely many features using kernels Kernels: dot products of features Feature map

'(x ) 2 F ,

'(x ) = [: : : 'i (x ) : : :] 2 `2 For positive definite k , k (x ; x 0 ) = h'(x ); '(x 0 )iF Infinitely many features '(x ), dot product in closed form! 12/51

Infinitely many features using kernels Kernels: dot products of features Feature map

'(x ) 2 F ,

Exponentiated quadratic kernel k (x ; x 0 ) = exp

kx

x 0 k2

'(x ) = [: : : 'i (x ) : : :] 2 `2 For positive definite k , k (x ; x 0 ) = h'(x ); '(x 0 )iF Infinitely many features '(x ), dot product in closed form! Features: Gaussian Processes for Machine learning, Rasmussen and Williams, Ch. 4. 12/51

Infinitely many features of distributions Given P a Borel probability measure on X , define feature map of probability P , P = [: : : EP ['i (X )] : : :] For positive definite k (x ; x 0 ),

hP ; Q iF = EP ;Q k (x ; y ) for x

P and y Q.

Fine print: feature map '(x ) must be Bochner integrable for all probability measures considered. Always true if kernel bounded.

13/51

Infinitely many features of distributions Given P a Borel probability measure on X , define feature map of probability P , P = [: : : EP ['i (X )] : : :] For positive definite k (x ; x 0 ),

hP ; Q iF = EP ;Q k (x ; y ) for x

P and y Q.

Fine print: feature map '(x ) must be Bochner integrable for all probability measures considered. Always true if kernel bounded.

13/51

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD 2 (P ; Q ) = kP

Q k2F

= hP ; P iF + hQ ; Q iF 2 hP ; Q iF 0 0 2EP ;Q k (X ; Y ) =E P k (X ; X ) + EQ k (Y ; Y ) | {z } | | {z } {z } (a)

(a)

( b)

14/51

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD 2 (P ; Q ) = kP

Q k2F

= hP ; P iF + hQ ; Q iF 2 hP ; Q iF 0 2EP ;Q k (X ; Y ) 0 =E P k (X ; X ) + EQ k (Y ; Y ) {z } | | {z } | {z } (a)

(a)

( b)

14/51

The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD 2 (P ; Q ) = kP

Q k2F

= hP ; P iF + hQ ; Q iF 2 hP ; Q iF 0 0 2EP ;Q k (X ; Y ) =E P k (X ; X ) + EQ k (Y ; Y ) | {z } | {z } | {z } (a)

(a)

( b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity.

14/51

Illustration of MMD Dogs (= P ) and fish (= Q ) example revisited Each entry is one of k (dogi ; dogj ), k (dogi ; fishj ), or k (fishi ; fishj )

15/51

Illustration of MMD

\

The maximum mean discrepancy: MMD

2

=

X

1 n (n

1)

=

i6 j

k (dogi ; dogj ) +

2 X k (dogi ; fishj ) n2 i ;j

X

1 n (n

1)

=

k (fishi ; fishj )

i6 j

16/51

MMD as an integral probability metric Are P and Q different? Samples from P and Q

1

0.5

0

-0.5

-1 0

0.2

0.4

0.6

0.8

1

17/51

MMD as an integral probability metric Are P and Q different? Samples from P and Q

1

0.5

0

-0.5

-1 0

0.2

0.4

0.6

0.8

1

18/51

MMD as an integral probability metric Integral probability metric:

Find a "well behaved function" f (x ) to maximize EP f (X )

EQ f (Y )

Smooth function 1

f(x)

0.5

0

-0.5

-1

0

0.2

0.4

0.6

0.8

1

x 19/51

MMD as an integral probability metric Integral probability metric:

Find a "well behaved function" f (x ) to maximize EP f (X )

EQ f (Y )

Smooth function 1

f(x)

0.5

0

-0.5

-1

0

0.2

0.4

0.6

0.8

1

x 20/51

MMD as an integral probability metric Maximum mean discrepancy: smooth function for P vs Q MMD (P ; Q ; F ) := sup [EP f (X )

kf k1

(F

EQ f (Y )]

= unit ball in RKHS F )

21/51


EQ f (Y )]

kf k1

(F


Witness f for Gauss and Laplace densities 0.8 f Gauss Laplace

Prob. density and f

0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −6

−4

−2

0

2

4

6

X 21/51


kf k1

(F

EQ f (Y )]


Functions are linear combinations of features:

21/51


kf k1

(F

EQ f (Y )]


Expectations of functions are linear combinations of expected features EP (f (X )) = hf ; EP '(X )iF

= hf ; P iF

(always true if kernel is bounded) 21/51


kf k1

(F

For characteristic RKHS

EQ f (Y )]


F , MMD (P ; Q ; F ) = 0 iff P = Q

Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded varation 1 (Kolmogorov metric) [Müller, 1997] Bounded Lipschitz (Wasserstein distances) [Dudley, 2002] 21/51

Integral prob. metric vs feature difference

The MMD: Witness f for Gauss and Laplace densities 0.8

= sup [EP f (X ) 2

f F

EQ f (Y )]

Prob. density and f

MMD (P ; Q ; F )

f Gauss Laplace

0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −6

−4

−2

0

2

4

X

22/51

6


The MMD: MMD (P ; Q ; F )

= sup [EP f (X ) 2

f F

use EP f (X ) = hP ; f iF EQ f (Y )]

= sup hf ; P Q iF f 2F

22/51


The MMD:

MMD (P ; Q ; F )

= sup [EP f (X ) 2

f F

EQ f (Y )]


f

22/51


The MMD:

MMD (P ; Q ; F )

= sup [EP f (X ) 2

f F

EQ f (Y )]


f

22/51


The MMD:

MMD (P ; Q ; F )

= sup [EP f (X ) 2

f F

EQ f (Y )]


f*

22/51

Integral prob. metric vs feature difference The MMD: MMD (P ; Q ; F )

= sup [EP f (X ) 2

f F

EQ f (Y )]

= sup hf ; P Q iF f 2F = kP Q k Function view and feature view equivalent

22/51

Construction of MMD witness Construction of empirical witness function

(proof: next slide!)

Observe X = fx1 ; : : : ; xn g P Observe Y

=

fy1 ; : : : ; yn g Q

23/51



23/51



v

23/51



v witness(v) |

{z

}

23/51

Derivation of empirical witness function Recall the witness function expression f

/ P Q

24/51


/ P Q

The empirical feature mean for P

bP :=

n 1X '(xi ) n i =1

24/51


/ P Q


bP :=

n 1X '(xi ) n i =1

The empirical witness function at v f (v ) = hf ; '(v )iF

24/51


/ P Q


bP :=

n 1X '(xi ) n i =1


/ hbP bQ ; '(v )iF

24/51


/ P Q


bP :=

n 1X '(xi ) n i =1


/ hbP bQ ; '(v )iF = n1

n X i =1

k (xi ; v )

n 1X k (yi ; v ) n i =1

Don’t need explicit feature coefficients f

:=

h

f1 f2

:::

i 24/51

Two-Sample Testing

25/51

A statistical test using MMD

\

The empirical MMD: MMD

2

=

X

1 n (n

1)

=

k (xi ; xj ) +

i6 j

n (n

2 X k (xi ; yj ) n2 i ;j

How does this help decide whether P

X

1 1)

=

k (yi ; yj )

i6 j

= Q?

26/51


\


2

=

X

1 n (n

1)

=

k (xi ; xj ) +

i6 j

X

1 n (n

2 X k (xi ; yj ) n2 i ;j

1)

=

k (yi ; yj )

i6 j

Perspective from statistical hypothesis testing:

\ \

Null hypothesis

H20 when P = Q

should see MMD “close to zero”.

Alternative hypothesis 2

H1 when P 6= Q

should see MMD “far from zero”

26/51


\


2

=

X

1 n (n

1)

=

k (xi ; xj ) +

i6 j

X

1 n (n

2 X k (xi ; yj ) n2 i ;j

1)

=

k (yi ; yj )

i6 j

Perspective from statistical hypothesis testing:

\ \

Null hypothesis

H20 when P = Q

should see MMD “close to zero”.

Alternative hypothesis 2

H1 when P 6= Q

should see MMD “far from zero”

\ to get false positive rate Want Threshold c for MMD 2

26/51

\

2

Behaviour of MMD when P Draw n

= 200 i.i.d samples from P

6= Q

and Q

Laplace with different y-variance.

pn MMD \ 2 = 1:2

10 P Q

8 6 4 2 0 -2 -4 -6 -8 -10 -2

0

2

27/51

\

2

Behaviour of MMD when P

6= Q

1.5

10 P Q

8 6

1

4 2 0 -2

0.5

-4 -6 -8 -10

0

-2

0

0.5

1

1.5

2

0

2

2.5

27/51

\

2

Behaviour of MMD when P Draw n

= 200 new samples from P

6= Q

and Q

Laplace with different y-variance.

pn MMD \ 2 = 1:5

10 P Q

8 6 4 2 0 -2 -4 -6 -8 -10 -2

0

2

27/51

\

2


6= Q

1.5

10 P Q

8 6

1

4 2 0 -2

0.5

-4 -6 -8 -10

0

-2

0

0.5

1

1.5

2

0

2

2.5

27/51

\

2

Behaviour of MMD when P Repeat this 150 times

6= Q

:::

1.5

1

0.5

0 0

0.5

1

1.5

2

2.5

28/51

\

2


6= Q

:::

1.5

1

0.5

0 0

0.5

1

1.5

2

2.5

28/51

\

2


6= Q

:::

1.5

1

0.5

0 0

0.5

1

1.5

2

2.5

28/51

\

2

Asymptotics of MMD when P When P

6= Q

6= Q, statistic is asymptotically normal, \ M MD

2

MMD(P ; Q ) Vn (P ; Q )

p

where variance Vn (P ; Q ) = O n

1

! N (0; 1);

D

. Two Laplace distributions with different variances 1.5

PX

1.5

QX

Prob. density

Empirical PDF Gaussian fit

1

1

0.5

0 −6

0.5

−4

−2

0

2

4

6

X

0 0

0.5

1

1.5

2

2.5

3

3.5

29/51

\

2


=Q

What happens when P and Q are the same?

30/51

\

=Q

2

Behaviour of MMD when P Case of P

= Q = N (0; 1)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2

0

2

4

6 31/51

\

2


= Q = N (0; 1)

=Q

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2

0

2

4

6 31/51

\

2


= Q = N (0; 1)

=Q

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2

0

2

4

6 31/51

\

2


= Q = N (0; 1)

=Q

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2

0

2

4

6 31/51

\

2


= Q = N (0; 1)

=Q

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2

0

2

4

6 31/51

\

=Q

2

Asymptotics of MMD when P Where P

= Q, statistic has asymptotic distribution \ nM MD

2

1 X

l

h

zl2

i

2

l =1

where i i (x 0 ) = 0.6

Z

k~ (x ; x 0 ) X | {z }

i (x )dP (x )

centred

zl

0.4

N (0; 2) i:i:d:

0.2

0 -2

0

2

4

6 32/51

A statistical test A summary of the asymptotics: 0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -2

-1

0

1

2

3

4

5

6 33/51

A statistical test Test construction:

(G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -2

-1

0

1

2

3

4

5

6 33/51

How do we get test threshold c ? Original empirical MMD for dogs and fish:

\

MMD

2

=

X

1 n (n +

1)

=

k(xi , xj )

k(xi , yj )

i6 j

1 n (n

k (xi ; xj )

1)

X

=

k (yi ; yj )

i6 j

2 X k (xi ; yj ) n2 i ;j

k(yi , yj )

34/51

How do we get test threshold c ? Permuted dog and fish samples (merdogs):

35/51

How do we get test threshold c ? Permuted dog and fish samples (merdogs):

\

MMD

2

=

X

1 n (n +

1)

=

1 n (n

k (~ xi ; x~j )

k(˜ xi , x ˜j )

i6 j

1)

X

=

k(˜ xi , y˜j )

k (~ yi ; ~ yj )

i6 j

2 X k (~ xi ; ~ yj ) n2 i ;j

Permutation simulates P =Q

k(˜ yi , y˜j )

35/51

Demonstration of permutation estimate of null Null distribution estimated from 500 permutations P = Q = N (0; 1)

0.6

0.4

0.2

0 -2

0

2

4

6

36/51

Demonstration of permutation estimate of null Null distribution estimated from 500 permutations P = Q = N (0; 1)

0.6

0.4

0.2

0 -2

0

2

4

6

Use 1 quantile of permutation distribution for test threshold c

36/51

How to choose the best kernel

37/51

Optimizing kernel for test power The power of our test (Pr1 denotes probability under P

\ Pr1 n M MD

2

> c^

6= Q):

38/51


Pr1

!

2 \ nM MD > c^

6= Q):

nMMD2 (P ; Q ) p Vn (P ; Q )

c p Vn (P ; Q )

!

where

is the CDF of the standard normal distribution. c^ is an estimate of c test threshold.

38/51


Pr1

!

2 \ nM MD > c^

Variance under

MMD2 (P ; Q ) p Vn (P ; Q )

|

{z

}

O (n 1=2 )

H1 decreases as

6= Q):

c p n Vn (P ; Q )

|

{z

O (n

p

1=2 )

Vn (P ; Q ) O (n

!

} 1=2

)

For large n, second term negligible!

38/51


Pr1

!

2 \ nM MD > c^

6= Q):

MMD2 (P ; Q ) p Vn (P ; Q )

c p n Vn (P ; Q )

!

To maximize test power, maximize MMD2 (P ; Q ) p Vn (P ; Q ) (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017)

Code: github.com/dougalsutherland/opt-mmd 38/51

Graphical illustration Reminder: maximising test power same as minimizing false negatives 0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -2

-1

0

1

2

3

4

5

6

39/51

Troubleshooting for generative adversarial networks

MNIST samples

Samples from a GAN

40/51

Troubleshooting for generative adversarial networks

MNIST samples

Samples from a GAN

Power for optimzed ARD kernel: 1.00 at = 0:01

ARD map

Power for optimized RBF kernel: 0.57 at = 0:01

40/51

Troubleshooting generative adversarial networks

41/51

MMD for GAN critic Can you use MMD as a critic to train GANs? Can you train convolutional features as input to the MMD critic? From ICML 2015:

From UAI 2015:

42/51

MMD for GAN critic: 2017 update

43/51


43/51


43/51


“Energy Distance Kernel”

43/51

An adaptive, linear time distribution metric

44/51

Reminder: witness function for MMD

^P (v): mean embedding of P ^Q (v): mean embedding of Q

v witness(v) = ^ P (v )

|

{z

^Q (v)

}

45/51

Distinguishing Feature(s) witness2 (v)

v Take square of witness (only worry about amplitude) 46/51

Distinguishing Feature(s)

v New test statistic: witness2 at a single v ; Linear time in number n of samples ....but how to choose best feature v ? 46/51


v v

Best feature = v that maximizes witness2 (v)

?? 46/51

Distinguishing Feature(s) witness2 (v)

Sample size n

=3

v

47/51


Sample size n

= 50

v

47/51


Sample size n

= 500

v

47/51


P(x) Q(y) witness 2 (v)

v Population witness2 function

47/51



v ?

v ?

v

47/51

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 (v) ^n (v) := n variance ME Statistic: of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016

48/51


48/51



48/51


witness 2 (v) variance X (v)

48/51


witness 2 (v) variance Y (v)

48/51


witness 2 (v) variance of v

48/51


¸^n (v)

^n . Best location is v that maximizes Improve performance using multiple locations fvj gJj=1

v¤

48/51

Properties of the ME Test Can use J features V = fv1 ; : : : ; vJ g. ^n (V ) follows Under H0 : P = Q, asymptotically Rejection threshold is T = (1

2 (J ) for any V .

)-quantile of 2 (J ).

^n ) (unknown). Under H1 : P 6= Q, it follows PH ( ^ n ! 1. Consistent test. But, asymptotically 1

Test power = probability of rejecting H0 when H1 is true.

49/51


2 (J ) for any V .



Test power = probability of rejecting H0 when H1 is true. Â 2 (J)

0

20

40

60

80

100

¸^n

Runtime:

O(n ) for both testing and optimization.

49/51


2 (J ) for any V .



Test power = probability of rejecting H0 when H1 is true. Â 2 (J) T®

0

20

40

60

80

100

¸^n

Runtime:


49/51


2 (J ) for any V .



Test power = probability of rejecting H0 when H1 is true. Â 2 (J) T® ℙH1 (¸^n )

0

20

40

60

80

100

¸^n

Runtime:


49/51


2 (J ) for any V .




0

20

40

60

80

100

¸^n

Runtime:


49/51


2 (J ) for any V .




0

20

40

60 ¸^n

Theorem: Under H1 , optimization of the (lower bound of) test power. Runtime:

80

100

V (by maximizing ^n ) increases


49/51


2 (J ) for any V .




0

20

40

60 ¸^n


80

100



49/51


2 (J ) for any V .




0

20

40

60 ¸^n


80

100



49/51

Distinguishing Positive/Negative Emotions

+:

happy

neutral

surprised

35 females and 35 males (Lundqvist et al., 1998).

48 34 = 1632 dimensions. Pixel features. Sample size: 402.

:

afraid

angry

disgusted

The proposed test achieves maximum test power in time O (n ). Informative features: differences at the nose, and smile lines. 50/51

Distinguishing Positive/Negative Emotions Random feature

happy

neutral

surprised

1.0

Power ⟶

+:

0.5

:

0.0

afraid

angry

+ vs. -

disgusted



happy

neutral

surprised

1.0

Power ⟶

+:

Random feature Proposed

0.5

:

0.0

afraid

angry

+ vs. -

disgusted



happy

neutral

surprised

1.0

Power ⟶

+:

Random feature Proposed MMD (quadratic time)

0.5

:

0.0

afraid

angry

+ vs. -

disgusted



+: :

happy

neutral

surprised

afraid

angry

disgusted

Learned feature The proposed test achieves maximum test power in time O (n ). Informative features: differences at the nose, and smile lines. 50/51


+: :

happy

neutral

surprised

afraid

angry

disgusted

Learned feature The proposed test achieves maximum test power in time O (n ). Informative features: differences at the nose, and smile lines. Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 50/51

Code: https://github.com/wittawatj/interpretable-test

Co-authors From Gatsby: Kacper Chwialkowski Wittawat Jitkrittum Bharath Sriperumbudur Heiko Strathmann Dougal Sutherland

Questions?

Zoltan Szabo Wenkai Xu

External collaborators: Kenji Fukumizu Bernhard Schoelkopf Alex Smola 51/51

Representing and comparing probabilities - Gatsby Computational ...

Representing and comparing probabilities - Gatsby Computational ...

Suggest Documents