Density Level Sets: Asymptotics, Inference, and Visualization arXiv ...

4 downloads 114 Views 2MB Size Report
Sep 5, 2016 - and Austin, 2004), two-sample comparison (Duong et al., 2009), binary ...... We then create a graph G = (V,E) with each node corresponding to a .... P. Chaudhuri and J. S. Marron. .... Debacl: A python package for interactive.
DENSITY LEVEL SETS: ASYMPTOTICS, INFERENCE, AND VISUALIZATION B Y Y EN -C HI C HEN , C HRISTOPHER R. G ENOVESE , L ARRY WASSERMAN

arXiv:1504.05438v1 [stat.ME] 21 Apr 2015

Carnegie Mellon University A PRIL 22, 2015 We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidimensional scaling.

1. Introduction. In this paper we study the problem of estimating the level set D ≡ D(λ ) = {x : p(x) = λ } of a density p. Estimating the density level set has wide applications including anomaly detection (outlier detection) (Breunig et al., 2000; Hodge and Austin, 2004), two sample comparison (Duong et al., 2009), binary classification (Mammen et al., 1999) and clustering (Rinaldo and Wasserman, 2010; Rinaldo et al., 2010). Figures 3 and 6 show the types of confidence sets and visualization methods we will develop in this paper. b n = {x : pbn (x) = λ }, where pbn is A common approach to estimate D is to use a plug-in estimate D the kernel density estimator or some other density estimator. There is a large literature for level sets (and upper level sets {x : p(x) ≥ λ }) concerning the consistency, rate of convergence (Polonik, 1995; Tsybakov, 1997; Walther, 1997; Cadre, 2006; Cuevas et al., 2006) and minimaxity (Singh et al., 2009) for such estimates under various error metrics. On the other hand, there are few results about statistical inference for level sets (Jankowski and Stanberry, 2012; Mammen and Polonik, 2013). Statistical inference is challenging since the estimand is a set and the estimator is a random set (Molchanov, 2005). It is hard to describe the asymptotic behavior of a random set. Mason et al. (2009) show asymptotic normality for upper level set in the metric defined by the measure of the set difference. However, it is unclear how to derive a confidence set from this result. Another difficulty with level set estimation is that we cannot directly visualize the level sets in high dimensions. A common remedy is to construct a level set tree (Stuetzle, 2003; Klemel¨a, 2004, 2006, 2009; Stuetzle and Nugent, 2010; Kent et al., 2013), which is to trace how the connected components for the upper level set bifurcate when we gradually increase λ . The level set tree only reflects the topology of connected components for level set but geometric information is lost. In this paper, we propose solutions to all the above issues. Our main contributions can be summarized as follows. b n , Dh ), where Dh is a smoothed version of D 1. We derive the limiting distribution of Haus(D (Theorem 3). 2. We propose two bootstrap-based methods to construct confidence regions for Dh (Section 4). 3. We prove the validity of the bootstrap (Theorem 4 and 5). MSC 2010 subject classifications: Primary 62G20; secondary 62G10, 62G15 Keywords and phrases: Nonparametric inference, asymptotic theory, level set, anomaly detection, visualization

1

2

Y.-C. CHEN ET AL.

4. We show that the confidence regions for Dh can be inverted into hypothesis test for clustering and anomaly detection (Theorem 6). 5. We derive valid confidence sets for the upper and lower density level set (Theorem 7). 6. We propose a visualization technique that preserves the geometric information for density level sets (Section 5). Related Work. Early work on density level set focuses on proving the consistency or the rate of convergence under various metric. See e.g. Polonik (1995); Tsybakov (1997); Walther (1997); Cuevas et al. (2006); Rinaldo and Wasserman (2010). However, none provides a limiting distribution for the density level sets. To our knowledge, the only literature concerning limiting distributions is Mason et al. (2009). They prove an asymptotic normality under a generalized integrated distance. However, this metric cannot be used to construct a confidence set for density level sets so that the applicability is limited. Estimating the level set is also related to the support estimation, see e.g. Cuevas and Rodr´ıguez-Casal (2004) and Cuevas (2009). Jankowski and Stanberry (2012) and Mammen and Polonik (2013), both provide methods for constructing confidence sets for the density level sets using the variation of the density function. We have a similar approach to theirs but our main method is based on Hausdorff distance of the level sets. We will compare their methods to ours in Section 4.1.2. Outline. We begin with a short introduction to the density level set along with some useful geometric concepts in Section 2. In Section 3, we derive the limiting distribution of the Hausdorff distance between estimated level set and true level set. In Section 4, we construct a valid confidence set for density level sets and show that this confidence set is related to hypothesis tests for anomaly detection and outlier detection. In Section 5, we propose a visualization method that uses mode clustering and multidimensional scaling to visualize high dimensional density level sets. We summarize our result and discuss related problems in Section 6. We provide a R-code for the proposed visualization algorithm in http://www.stat.cmu.edu/~yenchic/HDLV.zip. 2. Technical Background. 2.1. Level Sets. Let X1 , · · · , Xn be a random sample from an unknown, continuous density p(x). We define the level set (1)

D ≡ D(λ ) = {x : p(x) = λ }

for some λ > 0. Note that in some papers, the upper level set {x : p(x) ≥ λ } is called a level set. The level sets are the boundaries to the upper level sets under smoothness assumption (for instance, assumption (G)). We assume that λ is a fixed, positive value. A plug-in estimate for D based on the kernel density estimator (KDE) is (2)

bn ≡ D b n (λ ) = {x : pbn (x) = λ }, D

where 1 pbn (x) = d nh

n

∑K

i=1



 kx − Xi k . h

In this paper, we focus on inference for the smoothed version of p, which is denoted as (3)

ph = p ? Kh = E( pbn ),

DENSITY LEVEL SETS

where Kh (x) =

1 K hd



kxk h



3

and ? denotes convolution. We define Dh = {x : ph (x) = λ }

(4)

as the density level set for ph . Note that although we focus on estimating Dh , we allow h = hn → 0 as n → ∞. There are several advantages to targeting ph instead of p. First, the density level set for p may not be well-defined; this occurs when the density concentrates around some low dimensional structure so that p may not have a density function. On the contrary, ph is always well-defined and smooth whenever K is smooth. This is a major motivation for using ph over p as is mentioned in Rinaldo and Wasserman (2010). Second, estimating ph via a plug-in KDE allows us to focus on the stochastic part of the variation which makes the analysis much simpler. When h is small and p is smooth, the difference between p and ph is O(h2 ) which, for geometric inference is generally not of practical √ significance. Third, estimating ph by the KDE has a much faster rate of convergence ( n rate) when h is fixed. 2.2. Geometric Concepts. Let πA (x) be the projection of a point x onto a set A. Note that πA (x) may not be unique. The distance induced by the projection is d(x, A) = inf{kx − yk2 : y ∈ A} = kx − πA (x)k2

(5)

A common measure of distance between two sets is the Hausdorff distance. The Hausdorff distance between two subsets of Rd is given by Haus(A, B) = inf{ε : A ⊂ B ⊕ ε and B ⊂ A ⊕ ε}   = max sup d(x, A), sup d(x, B) ,

(6)

x∈B

x∈A

where A ⊕ ε = x∈A B(x, ε) and B(x, ε) = {y : kx − yk ≤ ε}. The Hausdorff distance is a generalized version of the L∞ metric for sets. Now we introduce the concept of reach (Federer, 1959; Cuevas, 2009) (also known as condition number (Niyogi et al., 2008) or minimal feature size (Chazal and Lieutier, 2005)). The reach of a set M is the largest distance from M such that every point within this distance to M has a unique projection onto M. i.e. S

(7)

reach(M) = sup{r : πM (x) is unique ∀x ∈ M ⊕ r}.

Note that πA (x) is unique if 0 < d(x, A) ≤ reach(A). Another way to understand the reach is that reach is the largest radius for a ball that can freely move along M; see Figure 1 for an example. In some cases, the reach is the same as the smallest radius of curvature on M. The reach plays a key role in relating the Hausdorff distance to the empirical process. Note that the reach is closely related to ‘rolling properties’ and ‘α-convexity’; see Cuevas (2009) and appendix A of Pateiro-L´opez (2008). Finally, two smooth sets A and B are called normal compatible (Chazal et al., 2007) if the projection between A and B are one to one and onto; see Figure 2 for an example. When A and B are normal compatible, the Hausdorff distance is (8)

Haus(A, B) = sup d(x, A) = sup d(x, B). x∈B

x∈A

4

Y.-C. CHEN ET AL.

M

M

(a)

(b)

F IG 1. An illustration for reach. The reach is the largest radius for a ball that can freely move along the set M. In (a), the radius of the pink ball is equal to the reach. In (b), the radius is too large so that it cannot pass the small gap on M.

M

M

M0

M0

(a)

(b)

F IG 2. An example for two normal compatible curves. Panel (a): each thin red line indicates the projection from a point of M onto M 0 . Panel (b): each thin black line indicates the projection from a point of M 0 onto M. When these projections are one to one and onto, we say M and M 0 are normal compatible to each other.

3. Asymptotic Theory. We begin with defining notations that will be used in this paper. Let BCr denote the collection of functions (including both univariate and multivariate functions) with bounded continuous derivatives up to the r-th order. For a smooth multivariate function f : Rd 7→ R and f ∈ BCr , we denote the elementwise max norm for r-th derivative as k f kr,max . For instance, 2 ∂ f (x) ∂ f (x) , k f (x)k2,max = max . k f (x)k1,max = max 1≤i≤d 1≤i, j≤d ∂ xi ∂ x j ∂ xi And we define the sup norm using derivatives until r-th order as:   ∗ (9) k f kr,max = max sup k f (x)k`,max : ` = 0, · · · , r . x∈K

Note that for a vector v ∈ Rd , the norm kvk is the usual Euclidean norm. Assumptions. (G) Let D(q) = {x ∈ K : q(x) = λ } be the level set for a density q. There are δ0 , g0 > 0 such that ∀x ∈ D(q) ⊕ δ0 , we have k∇p(x)k > g0 . (K1) The kernel function K ∈ BC3 and is symmetric, non-negative and Z Z  2 2 (α) x K (x)dx < ∞, K (α) (x) dx < ∞

5

DENSITY LEVEL SETS

for all α = 0, 1, 2, 3. (K2) The kernel function K and its partial derivative satisfies condition K1 in Gine and Guillou (2002). Specifically, let     (α) x − y d (10) K = y 7→ K : x ∈ R , h > 0, α = 0, 1, 2 h We require that K satisfies (11)

supN K , L2 (P), εkFkL2 (P)



P

 v A ≤ ε

for some positive number A, v, where N(T, d, ε) denotes the ε-covering number of the metric space (T, d) and F is the envelope function of K and the supremum is taken over the wholeR Rd . The A and v are usually called the VC characteristics of K . The norm kFkL2 (P) = supP |F(x)|2 dP(x). Assumption (G) appears in Molchanov (1990); Tsybakov (1997); Walther (1997); Molchanov (1998); Cadre (2006); Mammen and Polonik (2013); Laloe and Servien (2013). For a smooth density q, (G) holds whenever the specified level λ does not coincide with the density value for a critical point. Assumption (K1) is to guarantee that the variance of the KDE is bounded and to ensure that ph ∈ BC3 . This assumption is very common in statistical literature, see e.g. Wasserman (2006). Assumption (K2) is to regularize the complexity of the kernel function so that the supremum norm for kernel functions and their derivatives can be bounded in probability. Similar assumption appears in Einmahl and Mason (2005) and Genovese et al. (2014). The Gaussian kernel and many compactly supported kernels satisfy both assumptions. An immediate result from assumption (G) is the smoothness of the density level set. Assume a density p ∈ BC2 satisfies condition (G) and let D

L EMMA 1 (Smoothness Theorem). denotes the level set for p at λ . Then

( reach(D) ≥ min

δ0 g0 , 2 kpk∗2,max

) .

Moreover, let q be another density function and define D(q) as the level set for q at level λ . When kp − qk∗2,max is sufficiently small, 1. Condition (G) holdsnfor q. 2. reach(D(q)) = min δ20 , kph gk0∗

2,max

o

+ O(kp − qk∗2,max ).

3. D and D(q) are normal compatible.

The proof is in Section 7. Lemma 1 is very similar to Theorem 1 and 2 in Walther (1997). Essentially, this lemma shows the smoothness of the level set D and whenever two smooth densities are sufficiently close, their level set will both be smooth, close to each other and the normal projections between them are one-to-one and onto. Given a collection of functions F = { ft : Rd 7→ R : t ∈ T }, where T is some index set, the empirical process Gn is defined as 1 n Gn ( f ) = √ ∑ ( f (Xi ) − E( f (X1 ))), n i=1

f ∈ F.

6

Y.-C. CHEN ET AL.

b n ). The following theorem links the empirical process with the projection distance d(x, D b n be L EMMA 2 (Empirical Approximation). Assume (K1–K2) and (G) holds for ph . Let Dh and D the density level sets with level λ for ph and pbn . Define the function   1 x−y fx (y) = √ K h hd k∇ph (x)k with x ∈ Dh . Then

! r G ( f ) − √nhd · d(x, D b n ) log n n x ∗ √ . sup = O(k pbn − ph k1,max ) = OP bn ) nhd+2 x∈Dh nhd · d(x, D

b R EMARK 1. Lemma  2 implies  that for each x ∈ Dh , d(x, Dn ) converges to a mean 0 Gaussian 2 b n ) as a measure for local uncertainty (this is analogous to the mean process. We can use E d(x, D squared error) and apply the bootstrap to estimate this quantity. This is called the (local) uncertainty measure in Chen et al. (2014b). R EMARK 2. Lemma 2 shows that the projected distance to the level sets can be approximated by a stochastic process (empirical process) defined on a smooth manifold. The properties of a stochastic process (or more general, a random field) defined on a smooth manifold is one of the central topics in stochastic geometry. For more involved discussion on the random fields and geometry, we refer to Adler and Taylor (2009). Lemma 2 shows that the projection distance can be approximated by an empirical process on certain functions fx , where x ∈ Dh . The level sets Dh now acts as an index set. Thus, we define the function space ( )   1 x−y (12) F = fx (y) ≡ √ : x ∈ Dh K h hd k∇ph (x)k and define a Gaussian process B on F such that for all f1 , f2 ∈ F , (13)

d

B( f1 ) = N(0, E( f1 (X1 )2 )),

Cov(B( f1 ), B( f2 )) = E( f1 (X1 ) f2 (X1 )).

b n be the T HEOREM 3 (Asymptotic Theory). Assume (K1–K2) and (G) holds for ph . Let Dh and D density level sets with level λ for ph and pbn . Then the Hausdorff distance satisfies !  4 1/8 ! √  log n d b n , Dh ) < t − P sup |B( f )| < t ≤ O sup P nh Haus(D , nhd t f ∈F where F is defined in equation (12) and B is a Gaussian process defined on F satisfying equation (13).

DENSITY LEVEL SETS

7

b n , Dh ) can be approximated by a maximum Theorem 3 shows that the Hausdorff distance Haus(D over a certain Gaussian process. Note that we cannot directly use this Theorem to construct a confidence set for Dh since the Gaussian process is defined on Dh , which is unknown. Later we will use the bootstrap to approximate this limiting distribution and construct a confidence set. The random variable sup f ∈F |B( f )| follows an extreme value type distribution. However, writing down the explicit form for this distribution is not very helpful in statistical inference since it involves unknown quantities and the convergence to the distribution is notoriously slow. Instead, we will use the b n , Dh ). This avoid using some unknown quantities bootstrap to approximate the distribution of Haus(D and converges much faster. 4. Statistical Inference. 4.1. Confidence Sets. We now show that we can construct confidence sets for Dh by the bootstrap. A set Sn,1−α is called an asymptotically valid confidence set for Dh if (14)

P(Dh ⊂ Sn,1−α ) = 1 − α + O(rn ),

where rn → 0 as n → ∞. We propose two methods for constructing a confidence set and we will show that they are both asymptotically valid. 4.1.1. Method 1: Variation of the Level Sets. The first approach is to use the Hausdorff distance b n , Dh ) and define between the level sets. Let Wn = Haus(D w1−α = FW−1 (1 − α), n where FA denotes the cdf for a random variable A. Then, it is easy to see that (15)

b n ⊕ w1−α ) ≥ 1 − α. P(Dh ⊂ D

We use the bootstrap to estimate w1−α . Let X1∗ , · · · , Xn∗ be a bootstrap sample from X1 , · · · , Xn . Let pb∗n denote the KDE using the bootstrap b ∗n is the corresponding level set. We define Wn∗ = Haus(D b ∗n , D b n ) and sample, and D (16)

b1−α = FW−1 w ∗ (1 − α), n

bn ⊕ w b1−α . Then the bootstrap confidence set is D b n and D b ∗n be the density level T HEOREM 4. Assume (K1–K2) and (G) holds for ph . Let Dh and D ∗ set with level λ for ph and pbn and pbn . Then there exist Xn such that √    b ∗n , D b n < t X1 , · · · , Xn sup P nhd Haus D t √    b n , Dh ) < t ≤ O (k pbn − ph kmax )1/8 −P nhd Haus(D d+2 A e0

for all (X1 , · · · , Xn ) ∈ Xn and P(Xn ) ≥ 1 − 3e−nh

e0 . Thus, for some constants A !     log n 1/8 b b1−α ≥ 1 − α + O P Dh ⊂ Dn ⊕ w . nhd

An intuitive explanation for Theorem 4 is that as n goes to infinity, the bootstrap process converges to the same Gaussian process as the empirical process. Thus, they share the same Berry-Esseen bound.

8

Y.-C. CHEN ET AL.

4.1.2. Method 2: Variation of the Density. The second approach is to use the supremum norm of the KDE and impose an upper and lower bound around the density level. This idea is very similar to the work in Jankowski and Stanberry (2012) and Mammen and Polonik (2013). Let Mn = supx∈K | pbh (x) − ph (x)| and m1−α = FM−1 (1 − α). n Define (17)

Cn,1−α = {x ∈ K : | pbn (x) − λ | ≤ m1−α }.

It is easy to verify that (18)

P(Dh ⊂ Cn,1−α ) ≥ 1 − α.

Again we use the bootstrap to estimate the quantile. Recall that pb∗n is the KDE based on the bootstrap sample. We define Mn∗ = supx∈K | pb∗n (x) − pbn (x)| and set (19)

b 1−α = FM−1 m ∗ (1 − α). n

Then the confidence set is (20)

b 1−α }. Cbn,1−α = {x ∈ K : | pbn (x) − λ | ≤ m

b n and D b ∗n be the density level T HEOREM 5. Assume (K1–K2) and (G) holds for ph . Let Dh and D ∗ sets with level λ for ph and pbn and pbn . Then  4 1/8 !   log n P Dh ⊂ Cbn,1−α ≥ 1 − α + O . nhd

The proof to this Theorem is simply an application of the Gaussian approximation to Mn and Mn∗ √ √ (see e.g. Chernozhukov et al. (2014b)) so that nhd Mn and nhd Mn∗ have the same Berry-Esseen bound as Theorem 4. By (18), we obtained the desire result. Now we compare our approach to Jankowski and Stanberry (2012) and Mammen and Polonik (2013). Jankowski and Stanberry (2012) proposes to construct a confidence set of the form Cn = {x : λ − `n ≤ pbn (x) ≤ λ + τn } √ with some `n , τn → 0. They require that nhd ( pbn − ph ) converges weakly to a random field. This is true when h is fixed but not attainable if we allow h = hn → 0. Note that the convergence to a random field is a reasonable assumption for image data, which is the main research target in Jankowski and Stanberry (2012). Mammen and Polonik (2013) construct a confidence set using a similar approach to method 2, but they focus on the original level set D = {x : p(x) = λ } rather than the smoothed version Dh . Instead of taking supremum deviation of the density over the whole support K, they propose to focus on the b n . i.e. regions x ∈ D4D Rn = sup | pbn (x) − p|, bn x∈D4D

DENSITY LEVEL SETS

9

where A4B = {x : x ∈ A, x ∈ / B} ∪ {x : x ∈ B, x ∈ / A} is the symmetric difference between sets. Then they use the upper quantile of Rn to construct a confidence set of a similar form to (17) and apply the bootstrap to estimate the quantile. Their bootstrap consistency relies on Neumann’s method (Proposition 3.1 in Neumann (1998)) and they assume that h converges fast enough so that one can ignore the bias for estimating the original density p. Actually, under their assumptions, our proposed bootstrap confidence sets (from both method 1 and 2) are also consistent for the original level set D since the bias converges faster than the stochastic variation. The method in Mammen and Polonik (2013) should have a higher power than our method 2 since they consider taking supremum over a smaller region. Note that we can also apply Neumann’s method to prove consistency. But this requires h converges   1 − d+4 to 0 even when we focus on the smoothed version Dh and the optimal rate is O n log n under   1 h = O n− d+4 . R EMARK 3. We may use a variance stabilizing transform to obtain an adaptive confidence set using similar idea to Chernozhukov et al. (2012). The variance of pbn (x) is proportional to p(x). Thus, we may use | pb∗n (x) − pbn (x)| p pbn (x) x∈K

Vn∗ = sup

vb1−α = FV−1 ∗ (1 − α) n

p and set vb1−α × pbn (x) as an adaptive threshold for constructing the confidence set. Namely, the adaptive confidence set is given by n o p ∗ Cbn,1−α = x ∈ K : | pbn (x) − λ | ≤ vb1−α × pbn (x) . ∗ Following the same approach for the proof to Theorem 5, we can show that Cbn,1−α has asymptotically 1 − α coverage.

R EMARK 4. Both method 1 and 2 generate confidence sets with asymptotically valid coverage. Figure 3 compares the 90% confidence sets from both methods on the old faithful dataset. Apparently, method 1 (left panel; blue regions) is superior to method 2 (right panel; gold regions) in the sense that the size of confidence set is much smaller. The main reason is that both methods use the maximum over certain empirical processes but the two processes are defined on different function spaces. Method 1 only takes the supremum over a small function space F , in which the index set contains only points on the level sets Dh . However, method 2 takes the maximum over a large function space whose index set is the whole space K. Thus, we expect the second method to have a wider confidence set.  1/8  log4 n R EMARK 5. The rate O may not be optimal. In Chernozukov et al. (2014), they nhd apply a induction technique that gives a rate of order n−1/6 for the Gaussian approximation. Despite not being mentioned explicitly in that paper, we believe that similar technique to the empirical  applies 1/6  4 log n process. Thus, the rate in Theorem 3, 4 and 5 can be further refined to O . nhd

10

Y.-C. CHEN ET AL.

F IG 3. An example of 90% confidence sets using variation of level set (method 1; blue regions in left panel) and the supremum variation of the density (method 2; gold regions in right panel). This dataset is the old faithful dataset. As can be seen easily, the supremum variation of density (right panel) is too huge so that it contains a wide regions as the confidence set. On the other hand, the variation of level sets (left panel) gives a much tighter confidence set.

F IG 4. An example for density level set and confidence regions for the old faithful data. Left: density level set (thick black contour denotes the specified level λ ). Right: 90% confidence regions for the density level sets. We also have 90% confidence that (1) all the yellow regions have density above λ (2) all green regions have density bellowed λ and (3) the level sets {x : ph (x) = λ } are within the blue regions. Note that the yellow and green regions are the collection of x that we reject Hin,0 (x) and Hout,0 (x) and simultaneously control the significance level at α = 10%.

DENSITY LEVEL SETS

11

4.2. Hypothesis Tests for Clustering and Anomaly Detection. The confidence set from the previous section is related to the hypothesis tests in the following scenarios. Assume the density level λ to be fixed. Now consider an arbitrary point x. A natural question is to ask if the density at this point, p(x), is greater than the given level λ . In the framework for level set clustering (using connected components in upper level set to cluster data points), the above question is equivalent to asking if we have evidence that this point x belong to some clusters. Framing this question in terms of hypothesis tests, we are conducting a local hypothesis test (21)

Hin,0 (x) : p(x) ≤ λ ,

Hin,A (x) : p(x) > λ .

If we reject Hin,0 (x) under certain significance level, we have evidence that x belong to some clusters. For literature about upper level set clustering, we refer to Hartigan (1975); Polonik (1995); Rinaldo and Wasserman (2010); Rinaldo et al. (2010); Steinwart (2011); Kent et al. (2013); Balakrishnan et al. (2013). Similarly, for the given point x, we may also want to know if we have evidence that the density p(x) is below λ . This is related to the problem of anomaly detection (outliers detection), which is an important topic in pattern recognition, artificial intelligence and machine learning (Desforges et al., 1998; Breunig et al., 2000; He et al., 2003; Hodge and Austin, 2004; Jiang et al., 2008; Chhabra et al., 2008; Kloft et al., 2009; Chandola et al., 2009). An anomaly point (or an outlier) is a point in a low density region. We are performing the following local hypothesis tests: (22)

Hout,0 (x) : p(x) ≥ λ ,

Hout,A (x) : p(x) < λ .

If we reject Hout,0 (Xi ), then we have strong evidence that Xi is an anomaly. When we only want to test just a few points, we can do local tests for each point and control the family-wise error rate to control the type 1 error rate. However, in most cases, we are interested in many or even infinite number of points (like a region). We will conduct the local test for every point, which makes it difficult to control type 1 error simultaneously. A remedy to the above problem is to invert the confidence set constructed from the previous section. Theorem 6 shows how one can invert the confidence set to do tests that guarantee the type 1 error being controlled simultaneously for every point. Note that the lower level set Vh = {x ∈ K : ph (x) ≤ λ } and the upper level set Lh = {x ∈ K : ph (x) ≥ λ } are those regions that we should not reject Hin,0 and Hout,0 respectively. T HEOREM 6. Let Sbn,1−α be a confidence set for Dh from either method 1 or 2 in previous section. Let the coverage of Sbn,1−α be 1 − α + O(rn ), where rn → 0 by Theorem 4 and 5. Then the decision rules Tin,n (x) = 1( pbn (x) ≥ λ ∧ x ∈ / Sbn,1−α ) Tout,n (x) = 1( pbn (x) ≤ λ ∧ x ∈ / Sbn,1−α ) controls type 1 error simultaneously for all x ∈ K. Namely, for the lower level set Vh and upper level set Lh , we have P(Tin,n (x) = 1 ∀x ∈ Vh ) = α + O(rn )

P(Tout,n (x) = 1

∀x ∈ Lh ) = α + O(rn ).

12

Y.-C. CHEN ET AL.

Figure 4 shows an example. The blue regions are 90% confidence sets for the level sets Dh . The yellow regions and green regions are the acceptance regions. The above hypothesis tests can be inverted to construct confidence sets for the upper level set Lh and the lower level set Vh . The simultaneous control over type 1 error becomes a coverage guarantee. T HEOREM 7. (23)

Let Tin,n (x) and Tout,n (x) be defined as in Theorem 6. Then the sets Rblow,n,1−α = {x ∈ K : Tin,n (x) = 0} Rbupp,n,1−α = {x ∈ K : Tout,n (x) = 0}

are asymptotical (1 − α)% confidence sets to lower level set Vh and upper level set Vh . That is,   P Vh ⊂ Rblow,n,1−α = 1 − α + O(rn )   P Lh ⊂ Rbupp,n,1−α = 1 − α + O(rn ).

In Figure 4, a 90% confidence regions for the upper level set Vh is the union of yellow and blue regions; on the contrary, a 90% confidence regions for Vh , the lower level set, is the union of green and blue regions. Thus, we can conclude that we have 90% confidence that all yellow regions are above λ and the true high density regions should be contained by the yellow and blue regions. R EMARK 6. In some cases, we may only interested in one single point x and would like to know if ph (x) is higher (or lower) than a prescribed value λ . In this case, we can do a local test by using the fact that ! √ b p (x) − p (x) d n h p nhd → N(0, σ 2 (K)), (24) ph (x) where σ 2 (K) is some known quantity that depends only on the kernel function K. Assume that the null hypothesis is H0 : ph (x) = λ , then the test statistics is   √ pbn (x) − λ d √ (25) Tn (x) = nh λ and we reject H0 if |Tn (x)| > z1−α/2 . When we only want to do a local test or test only a few points, the simultaneous tests may be lack of power and the above local normality test is preferred. R EMARK 7. An alternative method is to control the false discovery rate (Benjamini and Hochberg, 1995) rather than the nominal significance level. We can use local normality test in (25) and compare to the z-score to obtain corresponding p-values for every point and then controls the false discovery rate. See Duong (2013) for applying this idea to a two sample density comparison problem (this is an equivalent problem to finding the density level set; see Section 6).

DENSITY LEVEL SETS

13

F IG 5. An example for density level set and mode clustering. Left: density level set (thick black contour denotes the specified level); four red dots are the local modes. Middle: basins of attraction for each local modes intersecting the density level set. Right: Graph representation. There are two connected components. The large connected components contain three local modes m1,1 , m1,2 , m1,3 with m1,1 , m1,2 and m1,2 , m1,3 are connected.

5. Visualization for High Dimensional Level Sets. A limitation for using density level sets of data analysis is the dimension of data. When each observation consists of more than 3 variables (i.e. d > 3), we are unable to directly visualize the level sets. All we can do is to use some approaches that visualize a high dimensional structures by preserving certain information. A common visualization technique is the density tree (Stuetzle, 2003; Klemel¨a, 2004; Kent et al., 2013; Balakrishnan et al., 2013). The density tree considers several density levels, say λ1 < · · · < λK and computes the number of connected components for the upper density level set at each level. As we increase the density level, some connected components may disappear or split into several sub-connected components The connected component disappears when the density level is above the density to every element in the connected component and splits if the density level passes the density value to some saddle points. Using the changes in the connected components, we can construct a tree structure to visualize the level set. See Stuetzle (2003); Klemel¨a (2004); Kent et al. (2013); Balakrishnan et al. (2013) for more details. A problem with density trees is that they only show topological information. That is, the tree only shows the connected components at each density level. Other information, like the size of clusters, the relative position of clusters and how clusters are connected to each other, are lost. To construct a visualization that preserves more information, we combine the ideas from level set clustering and mode clustering. 5.1. Mode Clustering and Density (Upper) Level Set. Given a smooth function p, mode clustering works as follows (Fukunaga and Hostetler, 1975; Cheng, 1995; Comaniciu and Meer, 2002; Li et al., 2007; Chac´on, 2012; Chen et al., 2014c). We form a partition of K based on the gradient field g ≡ ∇p. For each x ∈ K, we define a gradient flow πx : [0, ∞) 7→ K: πx (0) = x,

πx0 (t) = g(πx (t)).

That is, πx (t) starts at x and moves along the gradient of p. We define the destination for πx (t) as dest(x) = limt→∞ πx (t). Let M be the collection of all local modes of p. Then it is easy to see that dest(x) ∈ M except for a set B with Lebesque measure 0 (this set corresponds to the boundaries of

14

Y.-C. CHEN ET AL.

clusters). For each mode m j ∈ M , we define its basin of attraction as A j = {x ∈ K : dest(x) = m j }.

The regions A1 , · · · , Ak are the clusters generated by mode clustering. Now we recall three facts about a general upper density level set L = {x : p(x) ≥ λ } (c.f. Figure 5 left and middle panels): 1. L can be factorized into several independently connected components. Namely, Lh =

K [

C` ,

`=1

where the C` are disjoint, connected compact sets under regularity conditions. 2. Each C` contains at least one local mode. 3. If C` contains s local modes, then λ λ C` = C`,1 ∪ · · · ∪ C`,s ∪ B,

where C`,λ j is the basin of attraction for a local mode m`, j intersecting with level set L and B are the boundaries of the basins which has 0 Lebesque measure. Namely, C`,λ j = L ∩ Ak for some k. Thus, the upper level sets are covered by the basins of attraction for local modes. We may use a graph G = (V, E) with each vertex corresponding to a local mode within Lh and edges representing the connection for local modes to represent a level set. We add an edge to a pair of local modes (m`, j , m`,k ) λ shares the same boundaries. i.e. C¯λ ∩ C¯λ 6= φ . Note that when the corresponding basins C`,λ j and C`,k `, j `,k two local modes will have an edge only if they are in the same connected component. Figure 5 provides an example about L and its connected components and the basins of attractions and the corresponding graph. 5.2. Visualization Algorithm. Our visualization algorithm is given in Algorithm 1. Here are some detailed implementation steps. We first perform mode clustering to obtain local modes m1 , · · · , mk . One may use the mean shift algorithm (Fukunaga and Hostetler, 1975; Cheng, 1995; Comaniciu and Meer, 2002) to obtain this. For local modes m` , we assign it an index   n n` (λ ) d i ) = m` ∧ pbn (Xi ) ≥ λ . (26) r` (λ ) = , n` (λ ) = ∑ 1 dest(X n i=1 That is, n` (λ ) is the number of data points that are assigned to m` by mode clustering and have density being greater or equal to λ . If r` (λ ) > 0, we create a circle around each m` with radius proportional to r` (λ ) (if r` (λ ) = 0, we ignore this local mode). Essentially, r` (λ ) = 0 if the density to the mode (and its basin of attractions) is below λ . We connect two local modes m` and m j if they belong to the same bn at level λ and their basins of attraction above level set share the same connected component of L boundary (see e.g. Figure 5 middle and right panel). We may adjust the width for the line connecting m` and m j to be proportional to r` (λ ) + r j (λ ). Now given several density level, say λ1 < · · · < λK , we can overlay the visualization from the previous paragraph from λ1 to λK to create a tomographic visualization for each clusters. This gives a visualization for the density level sets. Figure 6 gives an example for visualizing level sets for a 6dimensional and a 10-dimensional simulation datasets at different density levels. This dataset is from Chen et al. (2014c).

15

DENSITY LEVEL SETS

0.1*p_max 0.3*p_max 0.5*p_max 0.7*p_max 0.9*p_max

0.1*p_max 0.3*p_max 0.5*p_max 0.7*p_max 0.9*p_max

● ●

● ●

● ● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ● ●

F IG 6. A visualization example. Top: the data for the first three dimensions. We add an additional 3 and 7 dimensional Gaussian noise to each data point. Bottom: a visualization for level sets at different values. Bottom left: a 6 dimensional dataset. Bottom right: a 10 dimensional dataset.

16

Y.-C. CHEN ET AL.

Algorithm 1 Visualization for a single level set Input: Data {X1 , ...Xn }, density level λ , smoothing parameter h 1. Compute the kernel density estimator pbn . 2. Find the modes m1 , · · · , mk of pbn (one can apply the mean shift algorithm). 3. Apply multidimensional scaling to the modes to project them into R2 . 4. For each local modes m` , we compute its index (26) r` (λ ) =

n` (λ ) , n

n

n` (λ ) =

∑1

i=1

  d i ) = m` ∧ pbn (Xi ) ≥ λ . dest(X

5. If r` (λ ) > 0, we create a circle around m` with radius in proportional to r` (λ ). 6. We connect two local modes m` and m j if bn at level λ , and (1) they belong to the same connected component of L (2) their basins of attraction above level set share the same boundary. 7. Adjust the width for the line connecting m` and m j to be in proportion to r` (λ ) + r j (λ ).

F IG 7. A visualization for confidence sets for Lh to the example of Figure 6. Left: the 6 dimensional dataset. Right: the 10 dimensional dataset. We pick the density level λ = 0.1 × k pbn kmax and display confidence sets under α = (50%, 80%, 90%, 95%, 99.9%).

DENSITY LEVEL SETS

17

5.3. Visualizing Confidence Sets. A modification of the visualization algorithm for various density level sets allows us to visualize high dimensional confidence sets for a level set at a given level λ . In particular, we use the method 2–maximal deviation for density. The main reason is that this confidence set under different α is just the density level set at different λ . To be more specific, assuming that bn∗ = we are interested in visualizing α1 < · · · < αK , K distinct confidence sets for Lh . Recall that M −1 ∗ b 1−α = FMn∗ (1 − α) is supx∈K | pbn (x) − pbn (x)| is the bootstrap supremum deviation for the KDE and m the quantile. When we want to visualize confidence sets for Lh at different significance levels, we pick b 1−αK b 1−α1 , · · · , λK = λ + m λ1 = λ + m and use the visualization algorithm to create a tomographic visualization. On the other hand, if we want to visualize the confidence sets for Vh , we use b 1−α1 , · · · , λK = λ − m b 1−αK λ1 = λ − m and apply again the visuailization algorithm. Figure 7 provides an example for visualizing confidence sets for the 6 and 10 simulation data in Figure 6. R EMARK 8. We can also use method 1–variation of the level sets–and combine it with the basins of attractions to visualize the confidence sets. We use method 1 to construct the inner/outer confidence sets and then we find the connected components and partition it by the basins of attraction for local modes and use multidimensional scaling to visualize it in low dimensions. 6. Discussion. In this paper, we derive the limiting distribution for the density level set under Hausdorff loss. This result immediately allows us to construct a confidence set for the level set. We propose to construct the confidence sets by two bootstrapping methods and we show the consistency for our methods. We illustrate that the confidence set can be linked to two types of hypothesis tests; one is related to the level set clustering and the other is related to the anomaly detection. Finally, we propose a visualization method using the properties of mode clustering that allows us to see (upper) density level set in high dimensions. Here we discuss some related topics to the level set. 6.1. Extension. In this paper, we focused on the density level sets but all of our theorems the lemmas can be applied to the following problems (Mammen and Polonik, 2013). • Generalized Level Set. Let q(x) be a known function. The generalized level set is defined as (27)

D = {x : p(x) = q(x)}.

The usual level set is a special case (q(x) = λ ) for this problem. • Two sample comparison. The level set problem is also linked to the two sample comparison. Our data consists of two samples; one sample is from an unknown density p1 and the other sample is from p2 (unknown). We are interested in the region where p1 = p2 : (28)

D = {x : p1 (x) = p2 (x)}.

• Bayes Classifier. Consider a binary classification problem such that the class label Y = {1, 2} and the feature X satisfies pX|Y =1 (x) = p1 (x),

pX|Y =2 (x) = p2 (x).

18

Y.-C. CHEN ET AL.

Let marginal probability for the class Y be P(Y = 1) = π1 and P(Y = 2) = π2 . Under 0 − 1 loss, the decision boundaries to the Bayes classifier is D = {x : π1 p1 (x) = π2 p2 (x)}.

(29)

This is also an equivalent problem to density level set. 6.2. Inference for Level Sets under Increasing Dimensions. We have assumed that dimension is fixed. To extend the inference to increasing dimension, there are several issues we need to address: • Concentration of measure under increasing dimensions. The asymptotic theory for level sets depends heavily on the concentration of probability for density estimation. Most concentration inequalities we have applied (e.g. Talagrand’s inequality) are for fixed dimensions. We need to modify these inequalities to account for the increment in dimensions. • Nonparametric rate of convergence. Another issue is the nonparametric rate of convergence is slow when we allow h to decrease with the sample size.The optimal rate of convergence for  2β

nonparametric kernel density estimate under L2 error is O n 2β +d

under Holder β -smoothness

condition. As d = dn → ∞, this rate is extremely slow. A way to circumvent this problem is, as treated in this paper, to focus on estimating the smoothed version of density (and the corresponding level set) and fix a lower bound on h so that it will not converge to 0. • Computational complexity. To conduct statistical inference, we need to compute the deviation of the level sets or densities. Since the the level sets are a d − 1 object and the density function is defined on the full d-dimensional space, the computational cost increases exponentially as the dimension increases. 7. Proofs. T HEOREM 8 (Theorem 2 in Cuevas et al. (2006)). Assume (K1–2) and (G), then we have   b n , Dh = O(k pbn − ph k1,max ). Haus D T HEOREM 9 (Talagrand’s inequality; version of Theorem 12 in Chen et al. (2014a)). (K1–2), then for each t > 0 there exists some n0 such that whenever n > n0 , we have

Assume

 d+2` P k pbn − ph k∗`,max > t ≤ (` + 1)e−tnh A1 , for some constant A1 and ` = 0, 1, 2. Moreover, r E

k pbn − ph k∗2,max



=O

log n nhd+4

! .

P ROOF FOR L EMMA 1. We first prove the lower bound for reach(Dh ) and then we will prove the third additional assertions.

DENSITY LEVEL SETS

19

Part 1: Lower bound on reach. We prove this by contradiction. Take x near D such that ! g0 δ0 , . (30) d(x, D) < 2 kpk∗2,max We assume that x has two projections onto D, denoted as b and c. Since b, c ∈ D, p(b) − λ = p(c) − λ = 0 so that p(b) − p(c) = 0. Now by Taylor’s theorem k(b − c)T ∇p(b)k = kp(b) − p(c) − (b − c)T ∇p(b)k 1 ≤ kb − ckkpk2,max . 2

(31)

Now by nature of projection, we can find a constant tb ∈ R such that x − b = tb ∇p(b). Now together with (31), 2|(b − c)T (x − b)| = 2|(b − c)T ∇p(b)tb |

≤ k(b − c)T ∇p(b)k|tb |

(32)

≤ kpk2,max kb − ck2 |tb |.

Since both b and c are projection points from x onto D, kx − bk = kx − ck. Thus, we have 0 = kx − bk − kx − ck

= kb − ck2 + 2(b − c)T (x − b)

(33)

≥ kb − ck2 − kpk2,max kb − ck2 |tb | = kb − ck2 (1 − kpk2,max |tb |).

Recall that d(x, D) ≤ (34)

g0 kpk2,max

and by Taylor’s theorem,

g0 > d(x, D) = kx − bk = ktb ∇g(b)k = |tb |k∇g(b)k ≥ |tb |g0 kpk2,max

so that |tb |kpk2,max < 1. Note that the lower bound g0 in the last inequality is because d(x, D) < δ20 so it follows from assumption (G). Plug-in this result into last equality of (33), we conclude  that kb −ck = , we 0. This shows b = c so that we have unique projection. Thus, whenever d(x, D) < δ20 , kpkg∗0 2,max have unique projection onto D and thus we have proved the lower bound on reach. Part 2: The three assertions. The first assertion is trivially true when kp − qk∗2,max is sufficiently small since assumption (G) only involves gradients (first derivatives). The second assertion follows from the lower bound on reach. By assertion 1, (G) holds for q. And the lower bound on reach is bounded by gradient and second derivatives so that we have the prescribed bound.

20

Y.-C. CHEN ET AL.

The third assertion follows from Theorem 1 in Chazal et al. (2007) which states that √ if two d − 1 dimensional smooth manifolds M1 and M2 have Hausdorff distance being less than (2− 2) min{reach(M1 ), reach(M2 )}, then M1 and M2 are normal compatible to each others. Now by Theorem 8, the Hausdorff distance between D and D(q) is at rate O(kp − qk1,max ) so that this assertion is true when kp − qk2,max is sufficiently small. b n . By Lemma 1 P ROOF OF L EMMA 2. Let x ∈ Dh . We define Π(x) ∈ Dh be the projected point onto D ∗ b n ) ≤ Haus(Dh , D b n ) so that Π(x) is and Theorem 8, when k pbn − ph k2,max is sufficiently small, d(x, D unique. Thus, we assume Π(x) is unique. b n and x ∈ Dh , pbn (Π(x)) − ph (x) = 0. Thus, by Taylor’s theorem Now since Π(x) ∈ D pbn (x) − ph (x) = pbn (x) − pbn (Π(x))

(35)

= (x − Π(x))T (∇ pbn (Π(x)) + O(kx − Π(x)k)).

b n at Π(x) so that it points toward the same direction as ∇ pbn (Π(x)). Note that x − Π(x) is normal to D Thus, (35) can be rewritten as  (36) pbn (x) − ph (x) = kx − Π(x)k k∇ pbn (Π(x))k + O(kx − Π(x)k) . By Taylor’s theorem, ∇ pbn (Π(x)) is close to ∇ph (x) in the sense that ∇ pbn (Π(x)) = ∇ph (x) + O(k pbn − ph k∗1,max ).

(37)

b n , Dh )) which is at rate O(k pbn − ph k∗ In addition, O(kx − Π(x)k)) is bounded by O(Haus(D 1,max ) due to Theorem 8. Putting altogether to (36), we conclude  pbn (x) − ph (x) = kx − Π(x)k kph (x)k + O(k pbn − ph k∗1,max ) (38)  b n ) kph (x)k + O(k pbn − ph k∗1,max ) . = d(x, D Note that the left hand side can be written as      1 n 1 1 x − Xi x − Xi (39) pbn (x) − ph (x) = d ∑ K − dE K = √ d Gn ( fex ), nh i=1 h h h nh  where fex (y) = K x−y h . Plug (39) into left hand side of (38), dividing both side by kph (x)k and set fx (y) =

e √ fx (y) , hd kph (x)k

(40)

we obtain bn ) √ 1 Gn ( f x ) − d(x, D nhd bn ) d(x, D

= O(k pbn − ph k∗1,max ).

This works uniformly for all x ∈ Dh and note that the definition of F is ) (   x−y 1 : x ∈ Dh . F = fx (y) ≡ √ K h hd k∇ph (x)k So we conclude

1 bn ) √ Gn ( fx ) − d(x, D nhd sup = O(k pbn − ph k∗1,max ). b d(x, Dn ) x∈Dh

21

DENSITY LEVEL SETS

P ROOF FOR T HEOREM 3. The proof for Theorem 3 follows the same procedure of proof of Theorem 6 in Chen et al. (2014b). The proof contains two parts: Gaussian approximation and anticoncentration. Part 1: Gaussian approximation. Basically, we will show that √ b n , Dh ) ≈ sup |Gn ( f )| ≈ sup |B( f )|, nhd Haus(D f ∈F

f ∈F

where B is a Gaussian process defined in (13). b n and Dh are normal compatible to each other by First, when k pbn − ph k is sufficiently small, D Lemma 1. Then by the property of normal compatible, (41)

b n ) = Haus(D b n , Dh ). sup d(x, D

x∈Dh

Thus, the difference √ √ d d b b nh Haus( D , D ) − sup |G ( f )| = nh sup d(x, D ) − sup |G ( f )| n n n n h x∈Dh f ∈F f ∈F b n ) supx∈Dh √ 1 d Gn ( fx ) − d(x, D nh ≤ √1 (42) nhd 1 bn ) √ Gn ( fx ) − d(x, D nhd ≤ sup b d(x, Dn ) x∈Dh = O(k pbn − ph k∗1,max ). b n ) ≤ O( √ 1 ). By Theorem 9 the Note that the last two inequality follows from the fact that d(x, D nhd above result implies, ! √ b n , Dh ) − sup |Gn ( f )| > t ≤ 2e−tnhd+2 A2 (43) P nhd Haus(D f ∈F for some constant A2 . d Now by Theorem 3.1 in Chernozhukov et al. (2014c), there exists some random variable B = sup f ∈F |B( f )| such that for all γ ∈ (0, 1) and n is sufficiently large, ! log2/3 (n) (44) P sup |Gn ( f )| − B > A3 1/3 d 1/6 ≤ A4 γ. f ∈F γ (nh ) √ Combining equations (43) and (44) and pick t = 1/ nhd+2 , we have that for n is sufficiently large and γ ∈ (0, 1), ! √ 2/3 √ log (n) 1 − nhd+2 A2 b n , Dh ) − B > A3 √ (45) P nhd Haus(D + ≤ A γ + 2e . 4 γ 1/3 (nhd )1/6 nhd+2 Part 2: Anti-concentration. To obtain the desire Berry-Esseen bound, we apply the anti-concentration inequality in Chernozhukov et al. (2014b) and Chernozhukov et al. (2014a).

22

Y.-C. CHEN ET AL.

L EMMA 10 (Modification of Corollary 2.1 in Chernozhukov et al. (2014b)). Let Xt be a Gaussian process with index t ∈ T , and with semi-metric dT such that E(Xt ) = 0, E(Xt2 ) = 1 for all t ∈ T . Assume that supt∈T Xt < ∞ a.s. and there exists a random variable Y such that P(|Y − supt∈T |Xt || > η) < δ (η). If A(|X|) = E (supt∈T |Xt |) < ∞, then   sup P(Y < t) − P sup |Xt | < t ≤ A5 (η + δ (η)) A(|X|) t

t∈T

for some constant A5 . d

By definition B = sup f ∈F |B( f )| so that E (B) < ∞ (since we do not include h into F , see equation (12)). From Lemma 10 and equation (45), there exists some constant A6 such that ! √  b n , Dh ) < t − P sup |B( f )| < t sup P nhd Haus(D t f ∈F ! (46) √ 1 log2/3 (n) − nhd+2 A2 ≤ A6 A3 1/3 d 1/6 + √ + A4 γ + 2e . γ (nh ) nhd+2  4 1/8 n and use the fact that Now pick γ = log nhd terms, we obtain the desire rate.

√ 1 nhd+2

√ nhd+2 A2

and 2e−

converges faster than the other

P ROOF FOR T HEOREM 4. This proof follows the same strategy for the proof to Theorem 7 in Chen et al. (2014b). We prove the Berry-Esseen type bound first and then show that the coverage is consistent. We prove the Berry-Esseen bound in two simple steps: Gaussian approximation and support approximation. Let Xn = {(X1 , · · · , Xn ) : k pbn − ph k∗2,max ≤ η0 } for some small η0 so that whenever our data is within Xn , (G) holds for pbn . By Lemma 1, such an η0 exists and by Theorem 9 we have P(Xn ) ≥ d+4 e e0 . Thus, we assume our original data X1 , · · · , Xn is within Xn . 1 − 3−nh A0 for some constant A bn and P b∗ be the empirical measure and the bootstrap emStep 1: Gaussian approximation. Let P n  pirical measure. A crucial observation is that for a function fx (y) = K x−y h ,  Z  x − y bn (y) = hd pbn (x). bn ( fx ) = K dP (47) P h Also note (48)

b∗ ( fx ) = P n



Z

K

 x−y b∗ (y) = hd pb∗ (x). dP n n h

Therefore, for the bootstrap empirical process G∗n = (49)

√ b∗ b n(P − P),

1 pb∗n (x) − pbn (x) = √ d G∗n ( fx ). nh

23

DENSITY LEVEL SETS

Thus, if we sample from Pbn and consider estimating pbn by pb∗n , we are doing exactly the same procebn ) b ∗n , D dure of estimating ph by pbn . Therefore, Lemma 2 and Theorem 3 hold for approximating Haus(D by a maxima for a Gaussian process. The difference is that the Gaussian process is defined on ) (   1 x−y bn (50) Fn = fx (y) ≡ √ :x∈D K h nhd k∇ pbn (x)k b n (the estimator is D b ∗n ). Note that Fn is very since the “parameter (level sets)” being estimated is D b n is also different to Dh . similar to F except the denominator is slightly different and the support D That is, we have ! √ ∗ bn , D b n ) < t X1 , · · · , Xn sup P nhd Haus(D t (51) !  4 1/8 ! log n − P sup |Bn ( f )| < t X1 , · · · , Xn ≤ O , nhd f ∈Fn where Bn is a Gaussian process on Fn such that for any f1 , f2 ∈ Fn , (52)

E(Bn ( f1 )|X1 , · · · , Xn ) = 0,

Cov(Bn ( f1 ), B( f2 )|X1 , · · · , Xn ) =

1 n ∑ f1 (Xi ) f2 (Xi ). n i=1

Step 2: Support approximation. In this step, we will show that (53)

sup |Bn ( f )| ≈ sup |Bn ( f )| ≈ sup |B( f )|.

f ∈Fn

f ∈F

f ∈F

The first approximation can be done by using the Gaussian comparison lemma (Theorem 2 in Chernozhukov et al. (2014a); also see Lemma 17 in Chen et al. (2014b)). We do the same thing as Step 3 in the proof of Theorem 8 in Chen et al. (2014b) so that we omit the detail. Essentially, given any ε > 0, we can construct pair of balanced ε-nets for both F and Fn , denoted as {g1 , · · · , gK } and {gn1 , · · · , gnK } so that max j kg j − gnj k∗max = O(k pbn − ph k∗1,max ). Then this ε-net leads to ! sup P sup |Bn ( f )| < t X1 , · · · , Xn t f ∈Fn ! (54)  1/6  . − P sup |Bn ( f )| < t X1 , · · · , Xn ≤ O k pbn − ph k∗1,max f ∈F The difference between sup f ∈F |Bn ( f )| and sup f ∈F |B( f )| is small since the these two Gaussian √ processes differ on their covariance but as n → ∞, the covariance converges at rate n so that we can neglect the difference between them. Thus, combining (51) and (54) and the argument from previous paragraph, we conclude ! ! √ b ∗n , D b n ) < t X1 , · · · , Xn − P sup |B( f )| < t sup P nhd Haus(D t f ∈F (55)  4 1/8 !  1/6  log n ∗ b ≤O + O k p − p k . n h 1,max nhd

24

Y.-C. CHEN ET AL.

Now compare the above result to Theorem 3 and use the fact that the first big-O term dominates the second term (first is of rate −1/8 for n but the second term is at rate −1/6 by Theorem 9), we conclude the result for first assertion. b n , Dh ) and wn,1−α = F −1 (1−α). Since Dh ⊂ D b n ⊕Haus(D b n , Dh ), For the coverage, let Wn = Haus(D Wn we have (56)

b n ⊕ wn,1−α ) = 1 − α. P(Dh ⊂ D

Now by the first assertion, the difference for wn,1−α and the bootstrap estimate w∗n,1−α differs at rate  1/8  log4 n , which completes the proof. O nhd

P ROOF FOR T HEOREM 6. We first prove the case for Tin,n . All we need to show is that (C) asymptotically, if there exists x ∈ Vh = {x : ph (x) ≤ λ } with Tin,n (x) = 1, then Sbn,1−α does not contain true Dh . We prove by contradiction. Assume that there exists some x ∈ Vh = {x : ph (x) ≤ λ } with Tin,n (x) = 1 and Sbn,1−α contain the true Dh . By definition of Tin,n , we have (57)

pbn (x) > λ ,

x∈ / Sbn,1−α .

We assume ph (x) < λ otherwise we have met a contradiction (ph (x) = λ implies x ∈ Dh but Sbn,1−α does not contain x). Since in both method 1 and 2, the size of confidence set Sbn,1−α is shrinking, we further assume (C1) Sbn,1−α does not contain any density local mode of both pbn and ph and (C2) k pbn − ph kmax < |ph (m) − λ | for all local modes m of ph . Assumption (G) implies that λ cannot be density level for any critical point so that these two assumptions are asymptotically correct. Actually, (C1) and (C2) hold whenever k pbn − ph k1,max is sufficiently d+2 e e1 so small. By Theorem 9, this holds with probability greater than 1 − 2e−nh A1 for some constant A that we can always assume this. Let C (x) be a connected set containing x and for every y ∈ C (x), pbn (y) > λ , y ∈ / Sbn,1−α . Since C (x) does not intersect Sbn,1−α and it contains at least one point x with density pbn > λ , by (C1) and (C2) C (x) must contains some local modes of pbn and ph whose densities are above λ . Since C (x) is connected, there exists a path π(t) connected x and a local mode m of ph such that both x and m and the whole path are all within C (x). Namely, π(0) = x, π(∞) = m and π(t) ∈ C (x) for all t ≥ 0. However, the density along this path must be continuously changing and (58)

ph (π(0)) = ph (x) < λ ,

ph (π(∞)) = ph (m) > λ

so that there exist some point π(t0 ) such that ph (π(t0 )) = λ . But π(t) ∈ C (x), this implies that C (x) contains some points on the level sets Dh but C (x) does not intersect Sbn,1−α . This contradicts to the assumption that Sbn,1−α contains Dh . Thus, we have proved assertion (C). Now since the coverage of Sbn,1−α is 1 − α + O(rn ). By assertion (C), we have P (Tin,n (x) = 1, for some x ∈ Vh ) ≤ 1 − P(Dh ⊂ Sbn,1−α ) = α + O(rn ),

25

DENSITY LEVEL SETS

which is what we desire. For the case of Tout,n , the proof is essentially the same so that we omit the proof.

P ROOF FOR T HEOREM 7. We first observe an important fact that the three sets {x ∈ K : Tin,n (x) = 1},

{x ∈ K : Tout,n (x) = 1},

Sbn,1−α

form a partition of K. Thus, (59)

Rblow,n,1−α = {x ∈ K : Tin,n (x) = 0} = {x ∈ K : Tout,n (x) = 1} Rbupp,n,1−α = {x ∈ K : Tout,n (x) = 0} = {x ∈ K : Tin,n (x) = 1}

[

Sbn,1−α

[

Sbn,1−α

Now observe the fact that the following events are equivalent to each other: {∃x ∈ Vh : Tin,n (x) = 1}C ≡ {∀x ∈ Vh : Tin,n = 0} (60)

≡ {∀x ∈ Vh : Tout,n (x) = 1 ∨ x ∈ Sbn,1−α } ≡ {Vh ⊂ Rblow,n,1−α }.

(by (59))

Thus, by Theorem 6,    b P Vh ⊂ Rlow,n,1−α = P {∃x ∈ Vh : Tin,n (x) = 1}C

(61)

= 1 − P (Tin,n (x) = 1, for some x ∈ Vh ) = 1 − α + O(rn ).

  Similarly, we can get the same result for P Lh ⊂ Rbupp,n,1−α , which completes the proof. References. R. J. Adler and J. E. Taylor. Random fields and geometry, volume 115. Springer Science & Business Media, 2009. S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster trees on manifolds. In Advances in Neural Information Processing Systems, pages 2679–2687, 2013. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In ACM sigmod record, volume 29, pages 93–104. ACM, 2000. B. Cadre. Kernel estimation of density level sets. Journal of multivariate analysis, 2006. J. E. Chac´on. Clusters and water flows: a novel approach to modal clustering through morse theory. arXiv preprint arXiv:1212.1384, 2012. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009. F. Chazal and A. . Lieutier. The lambda-medial axis. Graphical Models, 2005. F. Chazal, A. . Lieutier, and J. Rossignac. Normal-map between normal-compatible manifolds. International Journal of Computational Geometry and and Applications, 2007. Y.-C. Chen, C. R. Genovese, R. J. Tibshirani, and L. Wasserman. Nonparametric modal regression. arXiv preprint arXiv:1412.1716, 2014a. Y.-C. Chen, C. R. Genovese, and L. Wasserman. Asymptotic theory for density ridges. arXiv preprint arXiv:1406.5663 (To appear in the Annals of Statistics), 2014b. Y.-C. Chen, C. R. Genovese, and L. Wasserman. Enhanced mode clustering. arXiv preprint arXiv:1406.1780, 2014c.

26

Y.-C. CHEN ET AL.

Y. Cheng. Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17 (8):790–799, 1995. V. Chernozhukov, E. Kocatulum, and K. Menzel. Inference on sets in finance. arXiv preprint arXiv:1211.4282, 2012. V. Chernozhukov, D. Chetverikov, and K. Kato. Comparison and anti-concentration bounds for maxima of gaussian random vectors. Probability Theory and Related Fields, pages 1–24, 2014a. V. Chernozhukov, D. Chetverikov, K. Kato, et al. Anti-concentration and honest, adaptive confidence bands. The Annals of Statistics, 42(5):1787–1818, 2014b. V. Chernozhukov, D. Chetverikov, K. Kato, et al. Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42(4):1564–1597, 2014c. V. Chernozukov, D. Chetverikov, and K. Kato. Central limit theorems and bootstrap in high dimensions. arXiv preprint arXiv:1412.3661, 2014. P. Chhabra, C. Scott, E. D. Kolaczyk, and M. Crovella. Distributed spatial anomaly detection. In INFOCOM 2008. The 27th Conference on Computer Communications. IEEE. IEEE, 2008. D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603–619, 2002. A. Cuevas. Set estimation: Another bridge between statistics and geometry. Bolet´ın de Estad´ıstica e Investigaci´on Operativa, 2009. A. Cuevas and A. Rodr´ıguez-Casal. On boundary estimation. Advances in Applied Probability, pages 340–354, 2004. A. Cuevas, W. Gonzalez-Manteiga, and A. Rodriguez-Casal. Plug-in estimation of general level sets. Aust. N. Z. J. Stat., 2006. M. Desforges, P. Jacob, and J. Cooper. Applications of probability density estimation to the detection of abnormal conditions in engineering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 212(8):687–703, 1998. T. Duong. Local significant differences from nonparametric two-sample tests. Journal of Nonparametric Statistics, 25(3): 635–645, 2013. T. Duong, I. Koch, and M. Wand. Highest density difference region estimation with application to flow cytometric data. Biometrical Journal, 51(3):504–521, 2009. U. Einmahl and D. M. Mason. Uniform in bandwidth consistency for kernel-type function estimators. The Annals of Statistics, 2005. H. Federer. Curvature measures. Trans. Am. Math. Soc, 93, 1959. K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. Information Theory, IEEE Transactions on, 21(1):32–40, 1975. C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, L. Wasserman, et al. Nonparametric ridge estimation. The Annals of Statistics, 42(4):1511–1545, 2014. E. Gine and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l’Institut Henri Poincare (B) Probability and Statistics, 2002. J. A. Hartigan. Clustering algorithms. Wiley series in probability and mathematical statistics. 25 cm. 351 p., 1975. Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9):1641–1650, 2003. V. J. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85–126, 2004. H. Jankowski and L. Stanberry. Confidence regions in level set estimation. Preprint, 2012. Z. Jiang, W. Luosheng, F. Yong, and Y. C. Xiao. Intrusion detection based on density level sets estimation. In Networking, Architecture, and Storage, 2008. NAS’08. International Conference on, pages 173–174. IEEE, 2008. B. P. Kent, A. Rinaldo, and T. Verstynen. Debacl: A python package for interactive density-based clustering. arXiv preprint arXiv:1307.8136, 2013. J. Klemel¨a. Visualization of multivariate density estimates with level set trees. Journal of Computational and Graphical Statistics, 13(3), 2004. J. Klemel¨a. Visualization of multivariate density estimates with shape trees. Journal of Computational and Graphical Statistics, 15(2):372–397, 2006. J. Klemel¨a. Smoothing of multivariate data: density estimation and visualization, volume 737. John Wiley & Sons, 2009. M. Kloft, S. Nakajima, and U. Brefeld. Feature selection for density level-sets. In Machine Learning and Knowledge Discovery in Databases, pages 692–704. Springer, 2009. T. Laloe and R. Servien. Nonparametric estimation of regression level sets. Journal of the Korean Statistical Society, 2013. J. Li, S. Ray, and B. G. Lindsay. A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research, 8(8):1687–1723, 2007. E. Mammen and W. Polonik. Confidence regions for level sets. Journal of Multivariate Analysis, 122:202–214, 2013.

DENSITY LEVEL SETS

27

E. Mammen, A. B. Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999. D. M. Mason, W. Polonik, et al. Asymptotic normality of plug-in level set estimates. The Annals of Applied Probability, 19 (3):1108–1142, 2009. I. Molchanov. Theory of random sets. Springer-Verlag London Ltd., 2005. I. S. Molchanov. Empirical estimation of distribution quantiles of random closed sets. Theory of Probability and Its Applications, 1990. I. S. Molchanov. A limit theorem for solutions of inequalities. Board of the Foundation of the Scandinavian Journal of Statistics, 1998. M. H. Neumann. Strong approximation of density estimators from weakly dependent observations by density estimators from independent observations. Annals of Statistics, pages 2014–2048, 1998. P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete Computational Geometry, 2008. B. Pateiro-L´opez. Set estimation under convexity type restrictions. PhD thesis, UNIVERSIDADE DE SANTIAGO DE COMPOSTELA, 2008. W. Polonik. Measuring mass concentrations and estimating density contour clusters - an excess mass approach. The Annals of Statistics, 1995. A. Rinaldo and L. Wasserman. Generalized density clustering. The Annals of Statistics, 2010. A. Rinaldo, A. Singh, R. Nugent, and L. Wasserman. Stability of density-based clustering. arXiv:1011.2771v1, 2010. A. Singh, C. Scott, R. Nowak, et al. Adaptive hausdorff estimation of density level sets. The Annals of Statistics, 37(5B): 2760–2782, 2009. I. Steinwart. Adaptive density level set clustering. In COLT, pages 703–738, 2011. W. Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of classification, 20(1):025–047, 2003. W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, 19(2), 2010. A. B. Tsybakov. On nonparametric estimation of density level sets. The Annals of Statistics, 1997. G. Walther. Granulometric smoothing. The Annals of Statistics, 1997. L. Wasserman. All of Nonparametric Statistics. Springer-Verlag New York, Inc., 2006. D EPARTMENT OF S TATISTICS C ARNEGIE M ELLON U NIVERSITY 5000 F ORBES AVE . P ITTSBURGH , PA 15213 E- MAIL : [email protected]