Outline of talk. â« Computational geometry for statistical applications. â Background. â Examples of collaboration between the two communities. â A basic ...
Computational geometry and statistical depth measures Diane L. Souvaine Computer Science Department, Tufts University www.cs.tufts.edu/research/geometry Joint work with Eynat Rafalin DLS/EKR 7/15/03
1
Outline of talk Computational geometry for statistical applications – Background – Examples of collaboration between the two communities – A basic technique: the duality transform
Applications: – – – –
Least Median of Squares (LMS) regression in optimal time Halfspace depth contours in optimal time Simplicial depth A computational tool for depth-based Statistical analysis
Directions for future collaboration DLS/EKR 7/15/03
2
Computational Geometry Deals with problems that require geometric algorithms for their solutions. Systematic study of algorithms and data structures for geometric objects, with a focus on exact algorithms that are asymptotically fast. At the outset: once exact algorithms have been obtained, refined, and are still slow, then move to approximation algorithms. DLS/EKR 7/15/03
3
Multivariate analysis by Data depth Data depth - A way of measuring how deep a given point x in Rd is relative to F, a probability distribution, or relative to a given data cloud. Examples: – – – – – –
Halfspace (Location) depth (Hodges 55, Tukey 75) Simplicial depth (Liu 90) Convex Hull Peeling depth (Barnett 76, Eddy 82) Regression depth (Rousseeuw & Hubert 99) Mahalanobis depth (Mahalanobis 36) Oja depth (Oja 83)
Concept provides center outward ordering of points. NonDLS/EKR parametric, multivariate statistics. 7/15/03
4
There are many important and challenging problems at the interface of geometry and statistics.
DLS/EKR 7/15/03
5
The continuous and finite sample case Most depth functions are defined in respect to a probability distribution F, considering {X1,.., Xn } random observations from F. The finite sample version of the depth function is obtained by replacing F by Fn, the empirical distribution of the sample {X1,.., Xn }. In general, computational geometers study the finite sample case! DLS/EKR 7/15/03
6
Examples of collaboration between the two communities History – Shamos, Geometry and statistics: problems at the interface,1976 – Bentley & Shamos, A problem in multivariate statistics: algorithm, data structure and applications, 1977
Least Median of Squares (LMS) Regression – The LMS line can be computed in 2D in O(n2)
[Edelsbrunner, Souvaine 90]. Earlier result; [Souvaine, Steele 87]
– Practical approximation algorithm [Mount, Netanyahu, Romanik, Silverman, Yu] DLS/EKR 7/15/03
7
Collaboration – halfspace depth The depth of a single point can be computed in O(nlog n) [Rousseeuw & Ruts 1996]. The lower bound is Ω(n log n) [Aloupis, Cortes, Gomez, Soss, Toussaint 02]
Computing the 2D tukey median can be done in O(n log5n) [Matousek 1991], and was improved to O(n log4n) [Langerman, Steiger 00]
Computing all 2D depth contours can be done in O(n2) time using duality & topological sweep [Miller, Ramaswami, Rousseeuw, Sellares, Souvaine, Streinu, Struyf, 01]
Another approach for computing depth contours uses parallel arrangement construction [Fokuda & Rosta, 02] Halfspace depth contours can be computed for display in 2D using hardware assisted computation [Krishnan, Mustafa, ]
DLS/EKR 7/15/03
8
More collaboration –
Halfspace depth & center points Center points are points of depth >= n/(d+1) – Computing a center point in 2D • Matousek, 91 • In linear time Jadhav & Mukhopadhyay, 94
– In 3D [Naor & Sharir 90] – Center points in high dimensions [Amenta, Bern, Eppstein & Teng 00]
– Approximation [Clarkson, Eppstein, Miller, Sturtivant & Teng 96]
DLS/EKR 7/15/03
9
More collaboration - Regression depth The regression depth of a hyperplane relative to a point set is the minimum number of points crossed by the hyperplane in any continuous motion to the vertical [Roussueew & Hubert 99] Was generalized to multivariate regression depth [Bern & Eppstein 01, 02]
The deepest regression line in 2D can be found in O(nlog 2n) [van Kreveld, Mitchell, Rousseeuw, Sharir, Snoeyink, Speckman, 99].
Theoretical algorithm uses parametric search and achieves O(n log n) [Langerman Steiger, 00] Approximation of the deepest flat can be done using parametric sampling [Steiger & Wenger 98], and a (1 + ε) approximation is also achievable [Bern & Eppstein 02] DLS/EKR 7/15/03
10
More collaboration- simplicial depth The simplicial depth of a point p relative to a set S is the fraction of the closed simplices given by d+1 of the data points containing the point [Liu 90]. For a point can be computed in 2D in O(nlogn) time [Gil, Steiger, Wigderson 92], [Khuller, Mitchell 90], [Rousseeuw, Ruts 96].
This matches the lower bound [Khuller, Mitchell 90], [Aloupis, Cortes, Gomez, Soss, Toussaint 02]
In 3D the depth of a point can be computed in O(n2) time [Gil, Steiger, Wigderson 92], [Cheng & Ouyang 01]
The depth of all n points can be computed in 2D in O(n2) [Gil, Steiger, Wigderson 92], [Khuller, Mitchell 90]
The simplicial median in 2D in O(n4) time [Aloupis, Langerman,
Toussaint 01]
DLS/EKR 7/15/03
11
Least Median of Squares Regression Ordinary least sum of Squares – Low breakdown point Least median of squares – high breakdown point Given a set of points, find a line such that the median of the squares of the residuals is minimized Find two parallel lines at minimum vertical distance from each other with half of the data points contained in the slab they define O(n2logn) time algorithm for computing the LMS line in R2 [Souvaine,Steele 87]
DLS/EKR 7/15/03
An O(n2) algorithm using duality and topologcial sweep [Edelsbrunner,Souvaine 90] 12
Points and lines It is hard to find an order in a set of points. An arrangement of lines is easier. A set of points can be transformed into an arrangement of lines, preserving important properties using duality: a point (a,b)
DLS/EKR 7/15/03
T
a line y=ax+b
13
TD:y=4x-1
Duality
TC:y=3x TB:y=2x+1 TA:y=x+2
l: y = -x+3
(1,3) (2,2)
(1,2) (2,1) (3,0) (4,-1)
m: y=-2x+2 T
Primal a point (a,b) DLS/EKR 7/15/03 A line y=cx+d
T
Dual a line y=ax+b ?
14
TD:y=4x-1
Duality
TC:y=3x TB:y=2x+1 TA:y=x+2
l: y = -x+3
(1,3) (2,2)
(1,2) (2,1) (3,0) (4,-1)
m: y=-2x+2 T
Primal a point (a,b) DLS/EKR 7/15/03 A line y=cx+d
T
Dual a line y=ax+b (-c, d)
15
Duality A transformation which maps points in the plane to lines and vice versa Primal Dual a point (a,b) a line y=ax+b A line y=cx+d (-c, d) The transformation preserves slope, distance and the above below relationship DLS/EKR 7/15/03
16
LMS Primal
LMS
B z l
A
x
y C
LMS dual Tx
TA
TC Tl Ty TB DLS/EKR 7/15/03
17
LMS Primal
LMS
B z l
A
x
y C
Tx The LMS line bisects a TC slab bounded by 2 parallel lines, one of which goes through 2 Ty data points and the other goes through 2 data point TB Provable characteristics of LMS DLS/EKR 7/15/03
LMS dual TA
Tl
18
Sweeping an arrangement of lines Vertical line sweep – Report all intersection pairs – sorted in order of x coordinate – O(n2logn) time and O(n) space
DLS/EKR 7/15/03
Topological line sweep – Report all intersection pairs – according to a partial order related to the levels of the arrangement – O(n2) time and O(n) space
19
Halfspace depth The depth of a point p – The minimum number of points of the data set S lying in any closed halfspace determined by a line through p
Question – how to compute the half space depth contours efficiently? (naive cost per point– O(n3)) DLS/EKR 7/15/03
20
The depth of a point p is the minimum number of points of a given set S lying in any closed halfplane bounded by a line through p
D F
p
G E
A
DLS/EKR 7/15/03
B
C
21
The depth of a point p – The minimum number of points of S lying in any closed halfspace determined by a line through p A line through p –> a point P’ in the dual k points in the halfplane above the line passing through p -> k lines above the point P’ To count how many lines above another line -> look at the level DLS/EKR 7/15/03
22
D
Levels in an arrangement
F
p
G A
E B TB C
TA
TC TD TE TF
DLS/EKR 7/15/03
TG
23
The depth of a point p – The minimum number of points of S lying in any closed halfspace determined by a line through p A line through p –> a point P in the dual k points in the halfplane above the line passing through p –> k lines above the point P’ To count how many lines above another line –> look at the level The minimum number of points lying in any closed halfspace determined by a line through p – the min level in the dual line P’ DLS/EKR 7/15/03
24
Half-space depth computation The depth of a point in the primal is the minimum level of its dual line Computing the k-th half-space depth contour = finding the k-th level in the dual This can be done in O(n2) time using an algorithm called topological sweep [Miller, Ramaswami, Rousseeuw, Sellares, Souvaine, Streinu, Struyf, 01] DLS/EKR 7/15/03
25
D
F G
E A
B TB C
TA
TC TD TE TF
DLS/EKR 7/15/03
TG
26
Depth 2 D
F Depth 1 G
E A
B TB C
TA
TC TD TE TF
DLS/EKR 7/15/03
TG
27
D
All the halfspace depth contours in R2 can be computed in O(n2) time using topological sweep
F E
A
[Miller, Ramaswami, Rousseeuw, Sellares,Souvaine,Streinu,Struyf,01] TA TB
B
C
TC TD TE TF
DLS/EKR 7/15/03
TG
28
Duality in 3D
Primal a point (a,b,c) DLS/EKR 7/15/03
T
Dual a plane z=ax+by+c 29
Halfspace depth in Rd The depth of a point p is the minimum number of points of a given set S lying in any closed halfspace bounded by a line hyperplane through p
DLS/EKR 7/15/03
30
Simplicial depth [Liu 90] The simplicial depth of a point x w.r.t a data set S in Rd is the fraction of the closed simplices defined by points in S that contain x [Liu 90] B x1
x3
A SD(x1) = 1 SD(x2) = 1 SD(x3) = 0 DLS/EKR 7/15/03
x2
C
Dimension R1
Simplex
R2 R3
31
Simplicial Depth [Liu 90] B
.4
.3 E .4
.3 .5 .4 .4 .3
C .3 D
A
Total number of simplicies = (53 ) = 10 Depth of open regions: marked Depth of each of A, B, C, D = ( 42 )/10 = .6 Depth of E = .8 Depth of a position on AE = .5 DLS/EKR 7/15/03
32
Simplicial depth for the finite sample case The simplicial depth of a point x w.r.t a data set S in Rd is the fraction of the closed simplices defined by points in S that contain x [Liu 90] Problems: – Fails to satisfy [Zuo & Serfling 00] • Maximality - the depth function should attain maximum value at the center • Monotonicity - As a point x moves away from the `deepest point' along any fixed ray through the center, the depth at x should decrease monotonically.
– Depth of positions on facets causes discontinuities in the depth function. DLS/EKR 7/15/03
33
Tweaked simplicial Depth [Burr,Rafalin,Souvaine 03] B
.4
.3 E .4
.3 .5 .4 .4 .3
A
C .3 D Averaging number of closed and open simplicies containing x
Total number of simplicies = (53 ) = 10 Depth of open regions: marked Depth of each of A, B, C, D = ( 42 )/10 = .6 .3 Depth of E = .8 .5 Depth of a position on AE = .5 .35 DLS/EKR 7/15/03
34
Revised definition [Burr,Rafalin,Souvaine 03] Given a data set S={X1,…, Xn } in Rd, the simplicial depth of a point x is the average of the fraction of closed simplicies containing x and the fraction of open simplicies containing x Equivalently SDBRS(S;x) = ρ(S,x)+1/2σ(S,x) ρ(S,x) - the number of simplicies with data points as vertices which contain x in their open interior σ(S,x) - the number of simplicies with data points as vertices which contain x in their boundary. DLS/EKR 7/15/03
35
Properties of the revised definition Reduces to original definition, for continuous distributions and for points lying in the interior of cells. Keeps ranking order of data points Corrects irregularity at boundaries of simplicies, making the depth of a point on the boundary between two cells the average of the depth of the two cells. Invariant Under Dimensions Change for R1, R2 Fixes Zuo & Serfling’s counterexamples Can be calculated using the existing algorithms, with slight modifications DLS/EKR 7/15/03
36
The connection between halfspace depth and simplicial depth Assume a data set {x1..xn} and for a point x with half-space depth h To compute the simplicial depth of x [Roussuew,Ruts 96]: a data triangle xi, xj, xk excludes x iff there exists an angle smaller than π which contains all three rays, i.e. two of the points are in the halfplane defined by the third point and x. Therefore, the number of data triangles containing x (and from it the simplicial depth of x) can be computed from: DLS/EKR 7/15/03
n
( )-Σ( ) n 3
hi 1 2
37
The connection- cntd
To bound the number of data points in any halfplane defined by a line through x – h is an achievable lower bound (halfspace depth) – n/2 is an achievable upper bound (ham-sandwich cut theorem)
For every h