Nonparametric probability metrics with applications in ...

5 downloads 2241 Views 7MB Size Report
solve classification, data visualization and hypothesis testing problems. KeywordsX €ro˜—˜ility metri™sD distri˜utionsD density level setsD ™l—ssifi™—tionD ...
Nonparametric probability metrics with applications in classication and data visualization Alberto Muñoz, Universidad Carlos III de Madrid

[email protected] Gabriel Martos, Ponticia Universidad Católica de Valparaíso [email protected] Javier González, Sheeld Institute for Translational Neuroscience j.h.gonzalez@sheed.ac.uk Abstract In this paper we introduce a family of nonparametric probability metrics. We propose to estimate the nonparametric distance for probability measures using the estimated density level sets. Simulated and real data sets are used in order to test the performance of the proposed nonparametric metrics, in particular to solve classication, data visualization and hypothesis testing problems.

Keywords : Probability metrics, distributions, density level sets, classication, data

visualization, hypothesis testing.

1

Introduction

A probability metric, also known as statistical distance, is a functional that quanties the dissimilarity of two random quantities, in particular two probability measures. Examples of the use of probability metrics in Statistics are homogeneity, independence, and goodness of t tests. Where, for example, goodness of t test make use of the χ2 distance or the Kolmogorov metric. Probability metrics are also extensively used in several applications related to Machine Learning and Pattern Recognition. For instance, in Clustering Banerjee et al. (2005), Image Analysis Rubner et al. (2000), Bioinfomatics Minas et al. (2013); Saez et al. (2014), Time Series Analysis Ryabko and Mary (2013); Moon et al. (1995) or Text Mining Lebanon (2006), just to name a few. 1

The ζ -divergences Csiszár and Shields (2004) constitutes an interesting class of metrics for probability measures (PMs). Consider two PMs, say P and Q, dened on a measurable space (X, F, µ), where X is the sample space, F the σ -algebra of measurable subsets of X and µ : F → R+ the Lebesgue measure. For a convex function ζ and assuming that P is absolutely continuous with respect to Q, then the ζ -divergence from P to Q is dened by: Z   dP dQ. (1) dζ (P, Q) = ζ dQ X

Some well known particular cases are in order: for ζ(t) = |t − 1|/2 √ we obtain the 2 2 metric; ζ(t) = (t − 1) yields the χ -distance; ζ(t) = ( t − 1)2 yields the distance. A second important class of dissimilarities between PMs is the family of Bregman Divergences. Consider a continuously-dierentiable real-valued and strictly convex function ϕ, then the Bregman divergence between P and Q is: Z dϕ (P, Q) = (ϕ(fP ) − ϕ(fQ ) − (fP − fQ )ϕ0 (fQ )) dµ, (2)

Total Variation Hellinger

X

where fP and fQ represent the density functions of P and Q respectively and ϕ0 (fQ ) is the derivative of ϕ evaluated at fQ (see Frigyik at al., 2008; Cichocki and Amari, 2010 for further details). Some examples of Bregman divergences: ϕ(t) = t2 yields the Euclidean distance between fP and fQ (in L2 ); ϕ(t) = t log(t) yields the (KL) Divergence; and for ϕ(t) = − log(t) we obtain the distance. In general dζ and dϕ are not metrics because the lack of symmetry and because they do not necessarily satisfy the triangle inequality. A third interesting class of distances for PMs are integral probability metrics (IPM) Muller (1997). Consider a class of real-valued bounded measurable functions on X , say H, and dene the IPM between P and Q as Z Z dH (P, Q) = sup f dP − f dQ . (3)

Leibler

Itakura Saito

Kullback

f ∈H

If we choose H as the space of bounded functions Qd such that h ∈ H if khk∞ ≤ 1, then dH is the metric; when H√= { i=1 1[(−∞,xi )] : x = (x1 , . . . , xd ) ∈ d R }, dH is the distance; if H = {e −1hω, · i : ω ∈ Rd } the metric computes the maximum dierence between characteristics functions. Sriperumbudur et al., 2010 propose to choose H as a Reproducing Kernel Hilbert Space and study conditions on H to obtain proper metrics dH .

Total Variation Kolmogorov

2

In practice, the problem to implement the above described metrics is that we do not know explicitly the underlying distribution of the data at hand, and we need to compute a distance between probability measures by using a nite data sample. In this context, the computation of a distance between PMs that rely on the use of nonparametric density estimations often is computationally dicult and the rate of convergence of the estimated distance is usually slow Stone (1980). To avoid the estimation of the density function, we will adopt the perspective of the Generalized Function theory of Schwartz (Strichartz, 1994; Zemanian, 1965), where a function is not specied by its values but by its evaluations with respect to a collection of smooth and compactly supported functions in a space of test functions. We derive then a family of (semi)metrics for PMs from the metric structure inherited from the inner product in the space of test functions. The computation of the proposed metrics for PMs are based on the estimation of density level sets avoiding in this way the dicult task of density estimation. In this work we extend the seminal idea in Muñoz et al. (2012) by proposing a general denition of nonparametric metrics for PMs and establishing the properties of the family of nonparametric metrics. We also provide weighting schemes that improves signicantly the discrimination power of the proposed metrics. We also provide novel and interesting applications of the family of nonparametric metric for PMs in the context of classication and data visualization. The article is organized as follows: In Section 2 we introduce the family of nonparametric metrics and the empirical estimators counterparts based on the estimation of density level sets. In Section 3 we illustrates the theory with some simulations and real data sets applications in classication and visualization. Finally Section 4 concludes our work.

2

Nonparametric probability metrics and level sets

Throughout this paper we consider a measure space (X, B, µ), where X ⊂ Rd is a compact set of a real vector space, a not restrictive assumption in real scenarios, see for instance Moguerza and Muñoz (2006); B the Borel σ -algebra in X , and µ : B → R+ the Borel σ -nite measure in X . A PM P is a σ -nite measure absolutely continuous w.r.t. µ, then there function fP : X → R+ (the density function) R exists a measurable dP is the Radon-Nikodym derivative. such that P(A) = A fP dµ, and fP = dµ In order to motivate the discussion of nonparametric metrics for PMs, we consider the general framework Schwartz distributions. A distribution is a continuous linear functional on a set D of test function, where the usual choice for D is a subset of C ∞ (X) Strichartz, 1994. A probabilityRmeasure can be regarded as a generalized function P : D → IR by dening P(φ) = φdP = hφ, fP i. When the density function fP ∈ D, then fP acts as the representer in the Riesz representation theorem: P(·) = 3

h·, fP i. Note that we do not need to impose that fP ∈ D; only the integral hφ, fP i should be properly dened for every φ ∈ D. In this way two given continuous linear functionals, say P1 and P2 , will be identical (similar) if they act identically (similarly) on every φ ∈ D. For instance, if we choose φ(x) = x, P1 (φ) = hfP1 , xi = µP1 and if P1 and P2 are `similar' then µP1 ' µP2 because P1 and P2 are continuous functionals. Same argument apply for higher order (r) (r) moments, take φ(x) = xr for r ∈ N+ , and if P1 and P2 are `similar' then µP1 ' µP2 . For φξ (x) = eixξ , ξ ∈ IR, we obtain the Fourier transform of the probability measure, R

the characteristic functions in Statistics, given by fˆ(ξ) = P, eixξ = eixξ dP. Thus, two PMs can be identied in terms of functional evaluations, for appropriately chosen test functions, as the following denition suggest.

Denition (Identication of PMs). Let D be a set of test functions and P and Q two PMs dened on the measure space (X, B, µ), then we say that P = Q on D if: hP, φi = hQ, φi ,

for all φ ∈ D.

The key point in our approach is that if we appropriately choose a nite subset of test functions {φi }m i=1 ∈ D , we can compute the distance between the probability measures by calculating a nite number of functional evaluations.

2.1 Level set for probability measures The ν -minimum volume sets of a PM P with density function fP is dened by Sν (fP ) = R {x ∈ X| fP (x) ≥ αν }, such that P (Sν (fP )) = Sν dP = 1 − ν , where 0 < ν < 1.

Denition (Asymptotically dense sequence.)

Given m ∈ N+ , we dene the asymptotically dense sequence ν = {ν0 , . . . , νm } as an ordered sequence where 0 = ν0 < ν1 < . . . < νm = 1 and the elements of the sequence ν m constitutes an asymptotically dense set in [0, 1]. m

As an example, consider the sequence ν m := { mi }m i=0 , that is an asymptotically dense sequence uniformly distributed in [0, 1]. Therefore, with respect to an asymptotically dense sequence ν m , the corresponding ν -minimum volume sets are decreasing sets, (m) that is Sνi (fP ) ⊆ Sνi−1 (fP ). We dene the νi -level (Borel) set as Ai (P) = Sνi−1 (fP )− Sνi (fP ) for i ∈ {1, . . . , m}. To simplify the notation, in what follow we omit the superscript (m) in the νi -level set relative to a PM. Next we connect the denition of ν -level sets with the space of test functions D. We choose D as Cc (X), the space of all compactly supported and piecewise continuous functions on X , as test functions. Notice that D is dense in Lp . Given two PMs P and Q, we consider the indicator functions of ν -level sets as the family of test functions {φi }i∈I ⊆ D, and then dene distances between P and Q by weighting terms of the type d (hP, φi i , hQ, φi i) for i ∈ I , where d is some distance function. 4

2.2 Nonparametric metrics for probability measures Let DX be the set of σ -nite probability measures dened on X and x the asymptotically dense sequence ν m . Then dene: φ : DX → D, such P that for P ∈ DX then φi (P) = 1[Ai (P)] where φi ∈ D. We propose distances of the form m i=1 wi d (φi (P), φi (Q)). Several instances of dierent distance functionals d(., .) can be considered, for example d (φi (P), φi (Q)) = ||φi (P) − φi (Q)||2 , then d is the L2µ distance between the indicator functions of the ith ν -level sets of the PMs P and Q. We are particularly interested in considering the measure of the standardized symmetric dierence:

d (φi (P), φi (Q)) =

µ (Ai (P) 4 Ai (Q)) . µ (Ai (P) ∪ Ai (Q))

This motivates the denition of the ν -level set semi-metric as follows.

Denition (Weighted ν -level set semi-metric).

Fix an asymptotically dense sequence ν . The family of ν -level set distances between P and Q is dened by m

dν m (P, Q) =

m X

wi d (φi (P), φi (Q)) =

i=1

m X

wi

i=1

µ (Ai (P) 4 Ai (Q)) , µ (Ai (P) ∪ Ai (Q))

(4)

where w1 . . . , wm ∈ R+ and µ is the ambient measure. Equation (4) can be interpreted as a weighted sum of Jaccard distances Deza and Deza (2009) between the Ai (P) and Ai (Q) sets. For m  0, when P ≈ Q, then dν m (P, Q) ≈ 0 since µ (Ai (P) 4 Ai (Q)) ≈ 0 for all i ∈ {1, . . . , m}. To clarify this idea ε→0 ε→0 assume that ||fP (x) − fQ (x)||∞ ≤ ε, since P = Q then µ (Ai (P) 4 Ai (Q)) −→ 0, for ε→0 all i ∈ {1, . . . , m}, because otherwise it contradicts the fact that P = Q. The semi-metric proposed in Equation (4) constitutes a proper metric when m → ∞ (see the technical appendix for further details).

2.3 On the estimation of nonparametric metrics for PMs We can calculate dν m in Equation (4) only when we know the distribution function of P and Q. In the typical context of data analysis, there will be available two samples sP and sQ drawn from P and Q respectively, and we need to dene an estimator for dν m based on these samples. Consider estimators Aˆi (P) = Sˆνi−1 (fP ) − Sˆνi (fP ), then we can estimate dν m (P, Q) by   m µ Aˆi (P) 4 Aˆi (Q) X . (5) dˆν m (P, Q) = wi  µ Aˆi (P) ∪ Aˆi (Q) i=1 5

2.3.1 Estimation of density level sets In order to compute the estimated distance in Equation 5 we need then rst an estimator for the sequence of ν -level sets {Aˆi (P), Aˆi (Q)}m i=1 . To carry out this estimation, we consider two random samples sP and sQ of sizes nsP and nsQ drawn from the PM's P and Q respectively and the One-Class Neighbor Machine (OCNM) Muñoz and Moguerza (2006). The OCNM solves the following optimization problem:

max νnρ − ρ,ξ

s.t.

n X

ξi

i=1

g(xi ) ≥ ρ − ξi , ξi ≥ 0, i = 1, . . . , n ,

(6)

where g(x) = M (x, sn ) is a sparsity measure, ξi with i = 1, . . . , n are slack variables, ν ∈ [0, 1] is the threshold parameter that verify P (Sν ) = 1 − ν and ρ is a predened constant that represents the decision value which, induced by the sparsity measure, determines if a given point belongs to the support of the distribution. The OCNM estimate then density contour clusters around the centre of both distribution for a xed 0 ≤ ν ≤ 1. In Table 1 we present the algorithmic version of the OCNM to compute the sequence of ν -level sets. Table 1: Estimation of ν -level sets of a PM P.

INPUTS: sP (of size n), ν m and g(x)

1: 2: 3: 4: 5: 6:

FOR i in 1 TO m: Choose ν = νi ∈ ν m . Consider the order induced in the sample sP by the sparsity measure g(x), that is, g(x(1) ) ≤ · · · ≤ g(x(n) ), where x(i) denotes the ith sample, ordered after g . DO ρ∗n = g(x(νn) ) if νn ∈ N, ρ∗n = g(x([νn]+1) ) otherwise, where [x] stands for the largest integer not greater than x. Dene hn (x) = sign(ρ∗n − g(x)). DO Sˆνi (fP ) = {x : hn (x) = sign(ρ∗n − g(x)) ≥ 0} DO Aˆi (P) = Sˆνi−1 (fP ) − Sˆνi (fP ) OUTPUT: The sequence {Aˆi (P)}m i=1

The region Sˆν (fP ) estimated with the OCNM converges in probability to Sν (fP ) (Muñoz and Moguerza (2006), Theorem 1 pp. 477). Notice that Aˆi (P) ∈ sP and ˆ ∪m i=1 Ai (P) = sP . In the supplementary material we present more details about the estimation of the ν -level sets. The execution time required to compute the sequence of ν -level sets grows at a rate of order O(dn2 ), where n represent the size of the 6

data sample sP and d is stand for the dimension of the data at hand. The same procedure apply for the estimation of the ν -level set sequence of the PM Q. We consider important to mention that the eectiveness of the estimation procedure is not aected by the fact of multimodal and/or asymmetrical distributions as can be seen, with also more details about the estimation procedure, in Muñoz and Moguerza (2006).

2.3.2 Covering approximation After the estimation of the ν -level sets of both distribution one could think that the remaining task is simply to pluginn these estimates in Equation 5. Notice  that µ Aˆi (P) ∪ Aˆi (Q) equals the total number of points in Aˆi (P) ∪ Aˆi (Q), i.e. ˆ Ai (P) ∪ Aˆi (Q) which is the counting measure of the union. Regarding the numera-

tor in Equation (5), one is tempted to estimate µ (Ai (P) 4 Ai (Q)), the area of region  ˆ Ai (P) 4 Ai (Q), by Ai (P) − Aˆi (Q) + |Aˆi (Q) − Aˆi (P) . However this is incorrect since probably there will be no points in common between Aˆi (P) and Aˆi (Q), that is Aˆi (P) ∪ Aˆi (Q) = Aˆi (P) 4 Aˆi (Q). For this reason, we propose a covering estimate for the numerator and the denominator in Equation (5). The set Aˆi (P) is always a subset of the sample sP drawn from the PM P, and we denote the estimation of this set by sAˆi (P) from now on. We reserve the notation Aˆi (P) for the covering estimation of Ai (P) dened by ∪nj B(xj , rAˆi (P) ), where xj ∈ sAˆi (P) , and B(xj , rAˆi (P) ) are closed balls with centres at xj and (xed) radius rAˆi (P) Devroye and Wise (1980). The radius is chosen to be constant (for data points in sAˆi (P) ) because we can assume that the density is approximately constant inside the region Aˆi (P) when m  0 (being m the size of the asymptotically dense sequence ν m ). For instance, in the experimental section, we x rAˆi (P) as the median distance between the points that belong to the set sAˆi (P) . To illustrate the covering estimation procedure, in Figure 1 (a) we show data points corresponding to two ν -level sets A and B : sA (red points) and sB (blue points), respectively. In Figure 1(b) Aˆ is the covering estimation of set A made up of the union rA (n)→0 of balls centered in the red data points in sA , that is Aˆ = ∪nj B(xj , rA ) −−−−−→ A and nrA (n) → ∞ Chevalier, J. (1976); Devroye and Wise (1980). Figure 1 (c) can be ˆ . Then the magnitude of interpreted equivalently regarding the covering estimate B the estimated symmetric dierence between the two sets is computed as the number of sample points in sB not belonging to the covering estimate of A, plus the number points in sA not belonging to the covering estimate of B . To make the computation

7

Figure 1: Set estimate of the symmetric dierence. (a) Data samples sA (red) and ˆ : red points. Blue sB (blue). (b) sB - Covering Aˆ: blue points. (c) sA - Covering B points in (b) plus red points in (c) are the estimate of A 4 B . explicit consider x ∈ A, y ∈ B and dene

IrA ,rB (x, y) = 1[B(x,rA )] (y) + 1[B(y,rB )] (x) − 1[B(x,rA )] (y)1[B(y,rB )] (x), ˆ where IrA ,rB (x, y) = 1 when y belongs to the covering Aˆ, x belongs to the covering B or both events happen. Thus if we dene XX I(A, B) = IrA ,rB (x, y), x∈A y∈B

we are able to estimate the symmetric dierence by

\ \ \ µ(A 4 B) = µ(A ∪ B) − µ(A ∩ B) = |µ(A ∪ B)| − I(A, B).

2.3.3 Weights for the ν -level set distances In this section we dene dierent weighting schemes for the family of distances dened by Equation (4). Denote by nAˆi (P) to the number of data points in sAˆi (P) , nAˆi (Q) the number of data points in s ˆ , r ˆ the (xed) radius for the covering Aˆi (P), and r ˆ the (xed) Ai (Q)

Ai (P)

Ai (Q)

radius for the covering Aˆi (Q). We dene the following weighting schemes:

Weighting Scheme 0

1 for i = 1, . . . , m, : Choose wi in Equation (4) by: wi = m that is wi is constant and independent of the sample data.

8

Weighting Scheme 1

: Make wi in Equation (4) equal to: n

ˆ (P) A i 1 X wi = m x∈s

ˆ (P) A i

nAˆ

i (Q)   X k x − y k2 1 − IrAˆ (P) ,rAˆ (Q) (x, y) × , i i nAˆi (P) nAˆi (Q) y∈s ˆ (Q) A i

Therefore wi is a weighted average of distances between a point of sAˆi (P) and a point of sAˆi (Q) where kx − yk2 is taken into account only if IrAˆ (P) ,rAˆ (Q) (x, y) = 0. i

Weighting Scheme 2 wi =

i

Choose wi in Equation (4) by:

o n 1 (1 − IrAˆ (P) ,rAˆ (Q) (x, y)) k x − y k2 , max i i m x∈sAˆi (P) ,y∈sAˆi (Q)

that takes into account for the radius of the set Ai (P)4Ai (Q).

Weighting Scheme 3

Hausdor distance between the sets Ai (P)−Ai (Q) and Ai (Q)− Ai (P). Choose wi in Equation (4) by:

 1 ˆ ˆ ˆ wi = H sAˆi (Q) − Ai (P), sAˆi (P) − Ai (Q) , m ˆ X, ˆ Yˆ ) denotes the estimated Hausdor distance (nite size version) where H( ˆ and Yˆ . between the sets X In the experimental section, we denote by LS(0) to the level set distance obtained when the weighting scheme 0 is used, and similarly LS(1), LS(2) and LS(3) corresponds to the level set distance when the weighting scheme 1,2 and 3 is used respectively.

3

Experimental work

In this section we compare the discrimination power of the proposed metric with other distances for PMs. As a reference benchmark, we consider distances belonging to the main types of PMs metrics: Kullback-Leibler (KL) divergence Nguyen et al. (2010) (ζ -divergence and also Bregman divergence), t-test (T) measure (Hotelling test in the multivariate case), Maximum Mean Discrepancy (MMD) distance Gretton et al. (2012); Sriperumbudur et al. (2010) and Energy distance Székely and Rizzo (2004), two IPMs Sejdinovic et al. (2013).

9

Figure 2: Mixture of a Normal and a Uniform Distribution and a Gamma distribution.

3.1 Articial data 3.1.1 Homogeneity tests The rst experiment concerns a homogeneity test between two populations: a mixture between a Normal and a Uniform distribution P = αN (µ = 1, σ 2 = 1) + (1 − α)U (a = 1, b = 8), where α = 0.7 and a Gamma distribution Q = γ(shape = 1, scale = 2). In Figure 2 we show the corresponding density functions for P and Q, notice that rst and second order moments are quite similar in this case µP = 2.05 ' µQ = 2 and σP2 = 3.975 ' σQ2 = 4, and additionally both distributions are strongly asymmetric. We generate two random i.i.d. samples of size 100 from P and Q respectively and the goal of this experiment is to test the null hypothesis H0 : P = Q with the samples at hand. In order to compute a p-value for the KL-divergence, T, Energy, MMD and the proposed LS distances we run a permutation test based on 1000 random permutation of the original sample data. In the case of Kolmogorov-Smirnov, χ2 and Wilcoxon test we report the p-value given by these tests. Results are displayed in Table 2: Only the LS distances are able to reject the null hypothesis with a 5% signicancelevel.

3.1.2 Discrimination between normal distributions In this experiment we quantify the power of the proposed PM distances to test the null hypothesis H0 : P = Q when P and Q are multivariate normal distributions. To this end, we generate a i.i.d. random sample denoted as snP of size n = 100d, where d stands for dimension, from a multivariate normal distribution P = N (0, Id ). In order to estimate a threshold value under H0 , we then generate another 1000 random 10

Table 2: Hypothesis test between a mixture of Normal and Uniform distributions and a Gamma distribution. Metric Parameters p-value Reject? Kolmogorov-Smirnov 0.211 No. t-test 0.503 No. 2 χ test 0.993 No. Wilcoxon test 0.355 No. KL k = 25 0.134 No. T 0.502 No. Energy 0.170 No. MMD 0.099 No. LS (0) m = 15 0.030 Yes. LS (1) m = 15 0.040 Yes. LS (2) m = 15 0.050 Yes. LS (3) m = 15 0.050 Yes. samples of size 100d from a N (0, Id ) distribution. Next we calculate the distances between each of these 1000 samples and the rst generated data sample to obtain the 95% distance percentile, here denoted as dH95% . 0 Now dene δ = δ1 = δ(1, . . . , 1) ∈ Rd . We start dening δ = 0 and increase δ by 0.001. For each δ we generate a i.i.d. random sample of size 100d, denoted as snQ , drawn from Q = N (δ, Id ) distribution. The goal of this experiment is to test the dierence between the distributions by using the random samples snP and snQ , drawn from P and Q, respectively. If d(snP , snQ ) > dH95% we conclude that the 0 present distance is able to discriminate between both populations (we reject H0 ). To track the power of the test, we repeat this process 1000 times and x δ ∗ to the present δ value if the distance is above the threshold dH95% in 90% of the cases. 0 ∗ Thus we are calculating the minimal value δ required for each metric in order to discriminate between populations with a 95% condence level (type I error = 5%) and a 90% sensitivity level (type II error = 10%). In Table 3 we report the minimum √ ∗ distance (δ d) between distributions centers required to discriminate for each metric in dierent dimensions (notice that small values implies better discrimination power). The T statistic outperform KL and MMD. However, Energy distance works even better than T statistic in dimensions 1 to 4. The LS(0) distance work similarly to T and Energy until dimension 10 and tends to improve LS(3) after dimension 3. The LS(2) distance works similarly to LS(1), the best distance in terms of discrimination power, until dimension 4. In a second scenario for this experiment we consider again normal populations but dierent variance-covariance matrices. Dene an expansion factor δ ∈ R and increase 11

Metric KL T Energy MMD LS(0) LS(1) LS(2) LS(3)

Metric KL T Energy MMD LS(0) LS(1) LS(2) LS(3)

√ Table 3: δ ∗ d for a 5% type I and 10% type II errors.

d:

1 0.870 0.490 0.460 0.980 0.490 0.455 0.455 0.470

2 0.636 0.297 0.287 0.850 0.298 0.283 0.283 0.284

3 0.433 0.286 0.284 0.650 0.289 0.268 0.268 0.288

4 0.430 0.256 0.256 0.630 0.252 0.240 0.240 0.300

5 0.402 0.246 0.250 0.590 0.241 0.224 0.229 0.291

10 0.474 0.231 0.234 0.500 0.237 0.221 0.231 0.237

15 0.542 0.201 0.203 0.250 0.220 0.174 0.232 0.240

20 0.536 0.182 0.183 0.210 0.215 0.178 0.223 0.225

50 0.495 0.153 0.158 0.170 0.179 0.134 0.212 0.219

Table 4: (1 + δ ∗ ) for a 5% type I and 10% type II errors.

dim:

1 3.000 − 1.900 6.000 1.850 1.700 1.800 1.800

2 1.700 − 1.600 4.500 1.450 1.350 1.420 1.445

3 1.250 − 1.450 3.500 1.300 1.150 1.180 1.250

4 1.180 − 1.320 2.900 1.220 1.120 1.150 1.210

5 1.175 − 1.300 2.400 1.180 1.080 1.130 1.180

10 1.075 − 1.160 1.800 1.118 1.050 1.080 1.120

15 1.055 − 1.150 1.500 1.065 1.033 1.052 1.115

20 1.045 − 1.110 1.320 1.040 1.025 1.030 1.090

50 1.030 − 1.090 1.270 1.030 1.015 1.025 1.050

100 0.470 0.110 0.121 0.130 0.131 0.106 0.134 0.141

100 1.014 − 1.030 1.150 1.012 1.009 1.010 1.040

δ by 0.001 (starting from 0) in order to determine the smallest δ ∗ required for each metric in order to reject the null hypothesis H0 : P = Q by using two data samples snP and snQ (of equal size n = 100d), drawn from P = N (0, Id ) and Q = N (0, (1 + δ)Id ) respectively. If d(snP , snQ ) > dH95% , where dH95% is the threshold value for this simulation 0 0 scenario, we conclude that the present distance is able to discriminate between both populations. To make the process as independent as possible from randomness we repeat the testing process 1000 times and x δ ∗ (reported in Table 4) to the present δ value if d(snP , snQ ) > dH95% in 90% of the cases, as it was done in the previous 0 experiment. There are no entries in Table 4 for the T statistic because it was not able to distinguish between the considered populations in none of the considered dimensions. The MMD distance shows a poor discrimination power in this experiment. We can see that the proposed LS(1) distance is better than the other metrics considered in this experiment in all the dimensions, having LS(2) and LS(3) similar performance in the second place, and LS(0) and the KL similar performance in the third place among the metrics with best discrimination power.

12

Figure 3: Real image (a) and sampled image (b) of a hart in the MPEG7 CE-Shape-1 database.

3.2 Two case-studies 3.2.1 Data visualization and shape classication As an application of the preceding theory to data visualization and classication problems we consider the MPEG7 CE-Shape-1 Latecki et al. (2000), a well-known shape database. We select four dierent classes of shapes from the database: hearts, cups, hammers and bones. For each object class we choose 3 images in the following way: 2 standard images plus an extra image that exhibit some distortion or rotation (12 images in total). In order to represent each shape we do not follow the usual approach in pattern recognition that consists in representing each image by a feature vector. Instead, we will look at the image PM (a cloud of points in R2 ) according to the following procedure: Each image is transformed to a binary image where each pixel assumes the value 1 (white points region) or 0 (black points region) as in Figure 3 (a). For each image i of size Ni × Mi , row and column number of pixels respectively, we generate a uniform sample of size Ni Mi allocated in each position of the shape image i. To obtain the cloud of points as in Figure 3 (b), we retain only those points which fall into the white region (image body) whose intensity gray levels are larger than a threshold xed at 0.99. This procedure yield around one thousand and two thousand points image representation depending on the image, as can be seen in Figure 3 (b). After rescaling and centering, we compute the 12 × 12 image distance matrices, using the LS(1) distance and the KL divergence, and then compute Euclidean coordinates for the images via Multidimensional Scaling (MDS). The 2D representation obtained via MDS is show in Figure 4. It is apparent that the LS distance produces a MDS map coherent with human image perception (Figure 3 (a)). This does not happen for the rest of the tested metrics, in particular for the KL divergence as it is shown in Figure 3 (b). 13

Figure 4: Multi Dimensional Scaling representation for objects based on (a) LS(1) and (b) KL divergence.

3.2.2 Testing statistical signicance in Microarray experiments Here we present an application of the proposed LS distance in the eld of Bioinformatics. The data set we analyse comes from an experiment in which the time to respiratory recovery in ventilated post trauma patients is studied. Aymetrix U133+2 micro-arrays were prepared at days 0, 1, 4, 7, 14, 21 and 28. In this analysis, we focus on a subset of 48 patients which were originally divided into two groups: early recovery patients (group G1 ) that recovered ventilation prior to day seven and late recovery patients " (group G2 ), those who recovered ventilation after day seven. The size of the groups is 22 and 26 respectively. It is of clinical interest to nd dierences between the two groups of patients. In particular, the initial goal of this study was to test the association of inammation on day one and subsequent respiratory recovery. In this experiment we will show how the proposed distance can be used in this context to test statistical dierences between the groups and also to identify the genes with the largest eect in the post trauma recovery. From the original data set 1 we select the sample of 675 probe sets corresponding to those genes whose GO annotation include the term inammatory". To do so we use a query (July 2012) on the Aymetrix web site (www.aymetrix.com/index.ax). The idea of this search is to obtain a pre-selection of the genes involved in post trauma recovery in order to avoid working with the whole human genome. Figure 5 shows the heat map of day one gene expression for the 46 patients (columns) over the 675 probe-sets. By using a hierarchical procedure, it is apparent 1 Available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE13488

14

Figure 5: Aymetrix U133+2 micro-arrays data from the post trauma recovery experiment. On top, a hierarchical cluster of the patients using the Euclidean distance is included. At the bottom of the plot the grouping of the patients is shown: 1 for early recovery" patients and 2 for late recovery" patients. that the two main clusters we nd do not correspond to the two groups of patients of the experiment. However, the rst cluster (Figure 5 on the left) contains mainly patients form the early recovery" group (approx. 65 %) whereas the second cluster (Figure 5 on the right) is mainly made up of patients from the late recovery group (approx 60%). This lack of balance suggests a dierent pattern of gene expression between the two groups of patients. To test if statistical dierences exist between the groups G1 and G2 we dene, inspired by Hayden et al. (2009), a statistical test based on the LS distance proposed in this work. To this end, we identify each patient i with a probability distribution Pi . The expression of the 675 genes across the probe-sets are assumed to be samples of such distributions. Ideally, if the gene expression is not related to the recovery speed then all distributions Pi should be equal (H0 ). On the other hand, assume that expression of a gene or a group of genes eectively change between early" and late" recovery patients. Then, the distributions Pi will be dierent between patients belonging to groups G1 and G2 (H1 ). To validate or reject the previous hypothesis, consider the proposed LS(1) distance dˆν (Pi , Pj ) dened in Equation 4 for two patients i and j . Denote by

∆1 =

X X 1 1 dˆν (Pi , Pj ), ∆2 = dˆν (Pi , Pj ), 22(22 − 1) i,j∈G 26(26 − 1) i,j∈G 1

2

15

(7)

0.10 density−frequency

0.05 0.00

0.05 0.00

density− frequency

0.10

0.15

late recovery

0.15

early recovery

−5

0

5

10

15

−5

log2(expression)

0

5

10

15

log2(expression)

Figure 6: Gene density proles (in logarithmic scale) of the two groups of patients in the sample. The 50 most signicant genes were used to calculate the proles with a kernel density estimator.

and

∆12 =

X 1 dˆν (Pi , Pj ), 22 · 26 i∈G ,j∈G 1

(8)

2

the averaged α-level set distances within and between the groups of patients. Using the previous quantities we dene a distance between the groups G1 and G2 as

∆∗ = ∆12 −

∆1 + ∆2 . 2

(9)

Notice that if the distributions are equal between the groups then ∆∗ will be close to zero. On the other hand, if the distributions are similar within the groups and dierent between them, then ∆∗ will be large. To test if ∆∗ is large enough to consider it statistically signicant we need the distribution of ∆∗ under the null hypothesis. Unfortunately, this distribution is unknown and some re-sampling technique must be used. In this work we approximate it by calculating a sequence of distances ∆∗(1) , . . . , ∆∗(N ) where each ∆∗(k) is the distance between the groups G1 and G2 under a random permutation of the patients. For a total of N permutations, then

p − value =

#[∆∗(k) ≥ ∆∗ : k = 1, . . . , N ] N

.

(10)

where #[Θ] refers to the number of times the condition Θ is satised, is a one-side p-value of the test. 16

0.20



● ●

0.15





● ●

●●



● ●

● ●





0.10

● ●





●●

● ● ●●

0.05

groups difference test, p−value

●●●

● ●

●●●●●●







● ●

● ●

●●

●●●● ●

60

80

100

120

140

number of genes in the sample

Heat-map of the top-50 ranked genes (b) P-values obtained by the proposed α(rows). A hierarchical cluster of the pa- level set distance based test for dierent tients is included on top. The patients la- samples of increasing number of genes. bels are detailed at the bottom of the plot. (a)

Figure 7: Heat-map of the 50-top ranked genes and p-values for dierent samples. We apply the previous LS(1) distance based test (weighting scheme 1 with 10.000 permutations) using the values of the 675 probe-sets and we obtain a p-value = 0.1893. This result suggests that no dierences exist between the groups. The test for microarrays proposed in Hayden et al. (2009) also conrms this result with a p-value of 0.2016. The reason to explain this -a priori- unexpected result is that, if dierences between the groups exist, they are probably hidden by a main group of genes with similar behaviour between the groups. To validate this hypothesis, we rst rank the set of 675 genes in terms of their individual variation between groups G1 and G2 . To do so, we use the p-values of individual dierence mean T-tests. Then, we consider the top-50 ranked genes and we apply the α-level set distance test. The obtained p-value is 0.010, indicating a signicant dierence in gene expression of the top-50 ranked genes. In Figure 6 we show the estimated density proles of the patients using a kernel estimator. It is apparent that the proles between groups are dierent as it is reected in the obtained results. In Figure 7, we show the heat-map calculated using the selection of 50 genes. Note that a hierarchical cluster using the Euclidean distance, which is the most used technique to study the existence of groups in microarray data, is not able to accurately reect the existence of the two groups even for the most inuential genes. To conclude the analysis we go further from the initial 50-genes analysis. We aim to obtain the whole set of genes in the original sample for which dierences between groups remain signicant. To do so, we sequentially include in the rst 50-genes 17

sample the next-highest ranked genes and we apply to the augmented data sets the LS distance based test. The p-values of such analysis are shown in Figure 7. Their value increases as soon as more genes are included in the sample. With a type-I error of 5%, statistical dierences are found for the 75 rst genes. For a 10% type-I error, with the rst 110 genes we still are able to nd dierences between groups. This result shows that dierences between early and late recovery trauma patients exist and they are caused by the top-110 ranked genes of the Aymetrix U133+2 micro-arrays (ltered by the query inammatory"). By considering each patient as a probability distribution the LS distance has been used to test dierences between groups and to identify the most inuential genes of the sample. This shows the ability of the new proposed distance to provide new insights in the analysis of biological data.

4

Conclusions

In this paper we present a new family of nonparametric (semi)metrics for probability measures. The estimation of these probability measure distances does not require the use of either parametric assumptions or explicit density estimations, which makes a clear advantage over most well established probability metrics treatise in the literature. A wide range of experiments have been used to study the performance of the new family of distances. Using synthetically generated data, we have shown that the proposed distances shows superior discrimination power than other well known metrics and classical methods in statistics. Regarding the practical applications, the proposed distances have been proven to be competitive in classication and data visualization problems. Also they represent a novel way to identify genes and discriminate between groups of patients in micro-arrays experiments.

Acknowledgements This work was partially supported by projects MIC 2012/00084/00, ECO2012-38442, SEJ2007-64500, MTM2012-36163-C06-06, DGUCM 2008/00058/002 and MEC 2007/04438/001.

Technical appendix

If the set of test functions D contains the indicator functions of the ν -level sets, then D suce to discriminate probability measures. Identication of PMs:

Proof: Consider two σ-nite PMs dened in the measure space (X, B, µ), say P and Q, and denotes by φi (P) and φi (Q) to the indicator functions of the ith ν -level (Borel) sets of P and Q respectively. If P = Q, then φi (P) = φi (Q) for all i ∈ {1, . . . , m} and 18

dν m (P, Q) = 0 (for all m ∈ N+ ). If P 6= Q, for some m > M (the value of M possibly determined by the ν m sequence), φi (P) 6= φi (Q) for at least one i ∈ {1, . . . , m}. Then dν m (P, Q) > 0 for all m > M . Therefore lim dν m (P, Q) = 0 if and only if P = Q. m→∞

Corollary The semi-metric dν m (P, Q) converges to a proper metric when m, the size

of the asymptotically dense sequence ν m , tends to ∞.

Properties of the semi-metric We claim that the semi-metric presented in Equa-

tion (4) is a proper distance. Proof: By denition, the semi-metric in Equation (4) is non-negative: dν m (P, Q) ≥ 0 and as we proof above lim dν m (P, Q) = 0 if and only if P = Q. Is symmetric by conm→∞

struction dν m (P, Q) = dν m (Q, P), obey the triangular inequality (for a formal proof of this property refer to Deza and Laurent (1996) pp 118: the Bitope transform) and is invariant under ane transformations (the counting measure is ane invariant).

Supplementary material and source code:

The R code to implement the method proposed in the article is available on the supplementary material. The experiment presented in Section 3.1.1 is also included within the code. In the supplementary material, we also present a brief study about the computational time required to estimate the proposed distances by using the algorithm presented in Table 1.

References Banerjee, A., Merugu, S., Dhillon, I. and Ghosh, J. (2005): Clustering whit Bregman Divergences. Journal of Machine Learning Research 6, pp. 1705:1749. Chevalier, J. (1976): Estimation du support et du contour du support d'une loi de probabilité. Annales de l'IHP Probabilités et statistiques. Vol 12, pp 339364. Cichocki, A. and Amari, S. (2010): Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy 12, pp. 1532-1568. Csiszár, I. and Shields, P. (2004): Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory. Devroye, L. and Wise, G.L. (1980): Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38, pp. 480-488. Deza, M. and Deza, E. (2009): Enciclopedia of Distances. Springer. 19

Deza, M. and Laurent, M. (1996): Geometry of cuts and metrics. Citeseer. Frigyik, B.A., Srivastava, S. and Gupta, M. R. (2008): Functional Bregman Divergences and Bayesian Estimation of Distributions. IEEE Transactions on Information Theory 54(11), pp. 5130-5139. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B. and Smola, A. (2012): A kernel two-sample test. The Journal of Machine Learning Research, 13(1), pp. 723-773. Hayden, D., Lazar, P. and Schoenfeld, D. (2009): Assessing Statistical Signicance in Microarray Experiments Using the Distance Between Microarrays. PLoS ONE 4(6): e5838. Latecki, L. J., Lakamper, R. and Eckhardt, U. (2000):Shape Descriptors for Non-rigid Shapes with a Single Closed Contour. IEEE Conference on Computer Vision and Pattern Recognition, pp. 424-429. Lebanon, G. (2006): Metric Learning for Text Documents. IEEE Trans on Pattern Analysis and Machine Intelligence, 28(40), pp. 497-508. Minas, C., Curry, E. and Montana, G. (2013): A distance-based test of association between paired heterogeneous genomic data. Journal of Bioinformatics 29(20), pp. 2555-2563. . Moguerza, J. M. and Muñoz, A. (2006): Support vector machines with applications. Statistical Science, pp. 322-336. Moon, Y., Rajagopalan, B. and Lall, U. (1995): Estimation of mutual information using kernel density estimators. Physical Review E 52(3), pp. 2318-2321. Muñoz, A. and Moguerza, J.M. (2006): Estimation of High-Density Regions using One-Class Neighbor Machines. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(3), pp. 476-480. Muñoz, A., Martos, G., Arriero, J. and Gonzalez, J. (2012): A new distance for probability measures based on the estimation of level sets. Articial Neural Networks and Machine LearningICANN 2012, pp. 271-278. Müller, A. (1997): Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability, 29(2), pp. 429-443.

20

Nguyen, X., Wainwright, M. J. and Jordan, M. I. (2010): Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), pp 5847-5861. Rubner, Y., Tomasi, C. and Guibas, L. J. (2000): The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), pp. 99-121. Ryabko, Daniil and Mary, Jérémie (2013): A binary-classication-based metric between time-series distributions and its use in statistical and learning problems. Journal of Machine Learning Research, 14(1) pp. 28372856. Sáez, C., Robles, M., and García-Gómez, J. M. (2014). Stability metrics for multisource biomedical data based on simplicial projections from probability distribution distances. Statistical methods in medical research. Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu K. (2013):Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing. The Annals of Statistics, 41(5), pp. 2263-2291. Sriperumbudur, B. K., Gretton, A., Fukumizu, K. and Schlkopf, B. (2010): Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research 11, pp. 1297-1322. Stone, C. J. (1980): Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pp. 1348-1360. Strichartz, R.S. (1994): A Guide to Distribution Theory and Fourier Transforms. World Scientic. Székely, G.J., Rizzo, M.L. (2004):Testing for Equal Distributions in High Dimension. InterStat. Zemanian, A.H. (1965): Distribution Theory and Transform Analysis. Dover.

21

Suggest Documents