Journal of Machine Learning Research 1 (2017) 1-48
Submitted 4/00; Published 10/00
Model Selection via VC dimension Merlin Mpoudeu
[email protected];.edu
Department of Statistics University of Nebraska-Lincoln Lincoln, NE, 68503, USA
Bertrand Clarke Jordan
[email protected]
Department of Statistics University of Nebraska-Lincoln Lincoln, NE, 68503, USA
Editor: Leslie Pack Kaelbling
Abstract We develop an objective function that can be readily optimized to give an estimator of the Vapnik-Chervonenkis dimension for regression problems. We verify our estimator is consistent and performs well in simulations. We use our estimator on two datasets both acknowledged to be difficult and see that it gives results that are comparable in quality if not better than established techniques such as the Bayes information criterion, two forms of empirical risk minimization, and two sparsity methods. Keywords: Vapnik-Chervonenkis dimension, Model Selection, Bayesian information criterion, Sparsity methods, empirical risk minimization.
1. Complexity and Model Selection Model selection is often the first problem that must be formulated and addressed when analyzing data. In M-closed problems, see Bernardo (1994), the analyst posits a list of models and assumes one of them is true. In such cases, model selection is any procedure that uses data to identify one of the models on the model list. There is a vast literature on model selection in this context including information based method such as the Aikaikie Information Criterion (AIC), the Bayes information criterion (BIC), residual based methods such as Mallows Cp or branch and bound, codelength methods such as the two-stage coding proposed by Barron and Cover (1991) or Wallace and Freeman (1987). We also have computational search methods such as simulated annealing and genetic algorithms. In addition, there are a variety of non-parametric or non-linear model selection methods such as recursive partitioning, deep learning and kernel methods. A less well developed approach to model selection is via complexity as assessed by the Vapnik-Chervonenkis Dimension (VCD), here denoted by h. Its earliest usage seems to be in Vapnik and Chervonenkis (1968). A translation into English was published as Vapnik and Chervonenkis (1971). The VCD was initially called an index of a collection of sets with respect to a sample and was developed to provide sufficient conditions for the uniform convergence of empirical distributions. It was extended to provide a sense of dimension for function spaces, particularly for functions that represented classifiers. After two decades c
2017 Merlin Mpoudeu PhD and Bertrand Clarke PhD.
Mpoudeu and Clarke
of development this formed the foundation for the field of Statistical Learning Theory, extensively treated in Vapnik (1998). Although, the VCD goes back to 1968, it wasn’t until Vapnik et al. (1994) that a method for estimating h was proposed in the classification context. Specifically, given a collection C of classifiers, Vapnik et al. (1994) tried to estimate V CD(C) by deriving an objective function based on the expected value of the maximum difference between Two empirical losses (EMDBTEL). The two empirical losses came from dividing the dataset into a first and second set. The objective function proposed by Vapnik et al. (1994) depends on h, the sample size n, and several constants they claimed were universal. They also proposed an algorithm to estimate h given a class C of classifiers. This algorithm treated possible sample sizes as design points n1 , n2 , · · · , nL and requires one level of bootstrapping. Despite the formulation of Vapnik et al. (1994) the objective function was over-complex and the algorithm did not systematically approximate the expected maximum difference between two empirical losses. We comment that this seems to be due to instability arising from the numerical maximization. Vapnik and his collaborators suggested a fix for the instability that they thought would make the expected difference between two empirical losses maximal. We do not use this here. Moreover, it is unclear if this ‘fix’ will work in classification, let alone regression. One of the other input to the Vapnik et al. (1994) method is the choice of design points. Choosing the design points is a non trivial source of variability in the estimate of h. So, Shao et al. (2000) proposed an algorithm, based on extensive simulations, to generate optimum values of n1 , n2 , · · · , nL , given L. They found that non-uniform values of the nl ’s gave better results than the uniform nl ’s proposed by Vapnik et al. (1994). Most recently, in a pioneering paper that deserves more recognition that it has received, ˆ for h for the classification context McDonald et al. (2011) established the consistency of h ˆ using the formulation of Vapnik et al. (1994) to obtain the estimator h. The main reason the Vapnik et al. (1994) estimator for h did not become more widely used despite McDonald et al. (2011) is, we suggest, that it was too unstable and dependent on design points (whose effect was unclear). In addition, as we will see, the universal constants in Vapnik et al. (1994) are not in fact universal, at least for the regression problems we examine here, and the bound on the EMDBTEL seems to be too loose. This means that ˆ is not as effective as it needs to be for reliably good the Vapnik et al. (1994) estimator h performance in real settings. The main contribution of this paper is to extend Vapnik et al. (1994)’s methodology and McDonald et al. (2011)’s consistency theorem so that estimating h becomes effective in real regression problems. Thus, we derive an objective function for estimating h in the regression setting that provides, we think, a tighter bound on a modified form of the EMDBTEL. Aside from changing the form of EMDBTEL, we do not assume that any constants are universal; we optimize over them. To convert from classification to regression, we discretize the loss used for regression into m intervals (the case m = 1 would then apply to classification), and introduce an extra layer of bootstrapping so the quantity we empirically optimize accurately represents the quantity we derive theoretically. This extra layer of bootstrapping stabilizes our estimator of h and appears to improve its asymptotic performance in the sense of reducing its dependency on the nl ’s. 2
VC dimension
If the models are nested in order of increasing VCD, it is straightforward to choose ˆ Otherwise, we can convert a non-nested the model with VCD closest to our estimate h. problem to the nested case by ordering the inclusion of the covariates using a shrinkage method such as SCAD (see Fan and Li (2001)) or correlation (see Fan and Lv (2008)), and ˆ as before. We compare our model selection using VCD to several other methods use our h including Vapnik et al. (1994)’s original method, two forms of empirical risk minimization, BIC, SCAD, and ALASSO (see Zou (2006)). Our general findings indicate that in realistic settings, model selection via estimated VCD, when properly done, is fully competitive with existing methods and unlike them rarely, if ever, gives abberant. The remainder of this manuscript is structured as follows. In Sec. 2 we present the main theory justifying our approach. In Sec. 2.1 we discretize bounded loss functions so that upper bounds for the distinct regions of the expected supremal difference of empirical losses can be derived. In Sec. 2.2 we derive an estimator of the VCD. In Sec. 3 we extend McDonald et al. (2011)’s consistency theorem to our estimator of the VCD from Sec. 2. In Sec. 4 we use our estimator of the VCD on nested synthetic datasets. In this context, we compare our method to BIC, and to two versions of empirical risk minimization (ERM) that would be considered standard in computer science. We also discuss the comparative performance of our method to Vapnik’s original method and show computationally, as a generality, that the dependency on design points decreases as the sample size increases. In Sec. 5 we use our estimator on two real datasets namely Tour de France and Abalone. In this section our comparisons also include simplifying non-nested model lists by using correlation, SCAD, and ALASSO. In Sec. 6, we discuss the proper interpretation of VCD for model selection in the context of structural risk minimization.
2. Deriving an optimality criterion for estimating VCD We are going to bound the Expected Maximum Difference Between Two Empirical Losses (EMDBTEL) for the case of regression. In fact, we convert the regression problem into m classification problems by discretizing the empirical loss. We will use this bound to derive an estimator of the VCD for the class of linear functions. In Sec. 2.1, we present our extension of Vapnik et al. (1994) bounds and in Sec. 2.2, we present an estimator of h. 2.1 Extension of Vapnik-Chervonenkis bounds to regression Let Z = (x, y) be a pair of observations and write Z 1 = (z1 , z2 , · · · , zn ) and Z 2 = (zn+1 , zn+2 , · · · , z2n ) two vectors of n independent and identically distributed (IID) copies of Z. Let Q1 (z1 , α1 ) = L (y, f (x, α1 ))
and Q2 (z2 , α2 ) = L (y, f (x, α2 ))
be two bounded real valued loss functions where αi ∈ Λ, i = 1, 2, an index set, and assume ∀ αi , 0 ≤ Qi (zi , αi ) ≤ Bi for some Bi ∈ R , i = 1, 2. Consider the discretization of Qi using m disjoint intervals (with union [0; Bi )) given by h ( (2j+1)Bi jBi (j+1)Bi , if Q (z, α) ∈ I = , , i = 1, 2; i j 2m m m Q∗ij (zi , αi , m) = (1) 0 otherwise. 3
Mpoudeu and Clarke
Where j = 0, 1, · · · , m − 1. Now, consider indicator functions for Qi being in an interval of the same form. That is, let h ( i (j+1)Bi , , i = 1, 2; 1, if Qi (z, α) ∈ Ij = jB m m (2) χIij (Qi (z, α, m)) = 0, otherwise. and write n∗1j
=
n X
Q1 (zi2 , α1 , m)
χI1j
and
n∗2j
=
i=1
2n X
χI2j Q2 (zi1 , α2 , m)
i=n+1
for the number of data points whose losses land inside the interval Ij in the first and second half of the sample of size n, respectively. The empirical loss for each model will now be written as follows: n∗1j Q∗1j z 1 , α2 , m n∗2j Q∗2j z 2 , α1 , m ∗ 1 ∗ 2 , ν2j z , α2 , m = . ν1j z , α1 , m = n n ∗ z 2 , α , m , we use data from the second set and to find ν ∗ z 1 , α , m Note that to find ν1j 1 2 2j we use data from the first set. P ∗ Proposition 1 The sequence of measurable functions hQ∗ (·, α, m)i m≥1 = m−1 j=0 Qj (z, α, m)i converges to Q(·, α) ∈ [0, B) a.e. in the underlying probability of the measure space. The proof, which is easy, shows the decomposition of the regression problem into m classification problems is asymptotically valid. Let Z 1 = (Z1 , . . . , Zn ) and Z 2 = (Zn+1 , . . . , Z2n ) be the first and second half of the sample of size 2n, respectively, and write n∗1j Q∗j z 1 , α, m n∗2j Q∗j z 2 , α, m ∗ 1 ∗ 2 ν1j z , α, m = and ν2j z , α, m = , n n for the empirical risk using Q∗ (z, α, m) on the first and second half of the sample, respectively for the j th interval. Also, let n
ν1
1X z ,α = Q zi1 , α n 1
and ν2
i=1
2n 1 X z ,α = Q zi2 , α n 2
i=n+1
be the empirical risk using the first and second half of the sample, respectively. To begin to control the expected supremal difference between bounded loss functions, let > 0 and define the events 2n ∗ 2 ∗ 1 A,m = z : sup ν1 z , α1 − ν2 z , α2 ≥ (3) α1 ,α2
where
ν1∗
2
z , α1 =
m−1 X
∗ ν1j
X m−1 ∗ 1 ∗ z , α1 , m , ν2 z , α2 = ν2j z 1 , α2 , m . 2
j=0
j=0
4
VC dimension
Since A is defined on the entire range of our loss function, and we want to partition the range into m disjoint intervals, let m−1 m−1 X X ∗ ∗ A,m = z 2n | sup ν1j z 2 , α1 − ν2j z 1 , α2 ≥ α1 , α2 ∈Λ j=0 j=0 ) ( 2n ∗ 2 ∗ 1 ⊆ z | ∃ j : sup ν1j z , α2 , m − ν2j z , α2 , m ≥ m α1 , α 2 ∈ Λ ) ( m−1 [ ∗ ∗ ⊆ z 2n | sup ν1j z 2 , α1 , m − ν2j z 1 , α2 , m ≥ m α1 , α2 ∈Λ j=0
⊆
m−1 [
A,m,j
j=0
n o ∗ z2, α , m − ν ∗ z1, α , m ≥m where A,m,j = z 2n | supα1 , α2 ∈Λ ν1j . The suprema 1 2 2j over Λ within A,m,j will be achieved at ∗ ∗ αj∗ = αj∗ z 2n = arg sup ν1j z 2 , α1 , m − ν2j z 1 , α2 , m . (α1 , α2 )∈Λ
Next, for any fixed z 2n , and any given αj , form the vector (Q∗ (z1 , αj , m), Q∗ (z2 , αj , m), . . . , Q∗ (z2n , αj , m)) of the middle values of the intervals Ij , for j = 0, 1, 2, . . . , m − 1. For any αj and αj 0 , write αj ∼ αj 0 ⇐⇒ (Q∗ (z1 , αj , m), . . . , Q∗ (z2n , αj , m)) = Q∗ (z1 , αj 0 , m), . . . , Q∗ (z2n , αj 0 , m) . So, for any fixed Z 2n = z 2n it is seen that ∼ is an equivalence relation on Λ and therefore partitions Λ into disjoint equivalence classes. Denote the number of these classes by NjΛ and write NjΛ = NjΛ Z 2n = NjΛ (z1 , z2 , . . . , z2n ). ∗ as the canonical representatives of the equivalence classes where k ∈ K We define values αjk j is the k th equivalence class, i.e ∗ ∗ ∗ αjk = arg sup ν1j z 2 , α1 , m − ν2j z 1 , α2 , m . (α1 ,α2 )∈Λ
2n
Clearly, #(Kj ) = NjΛ Z and Kj is treated simply as an index set. To make use of the above partitioning of A , consider mapping the space Z 2n onto itself using (2n)! distinct permutations T = Ti . Then, if f is integrable with respect to the distribution function of Zi , its Riemann-Stieltjes integral satisfies Z Z 2n 2n f Ti Z 2n dF Z 2n , f Z dF Z = Z 2n
Z 2n
5
Mpoudeu and Clarke
and this gives Z f Z
2n
dF Z
2n
P(2n)!
Z
i=1
= Z 2n
Z 2n
f Ti Z 2n dF Z 2n . (2n)!
(4)
Our first main result is the following. Theorem 2 Let ≥ 0,
m ∈ N , and,
h = V CD {Q (·, α) : α ∈ Λ}. If h is finite, then
P (A,m ) ≤ 2m
2ne h
h
n2 exp − 2 m
.
(5)
∗ ∗ , m − ν∗ ∗ , m , where α = Proof Let ∆∗j Ti Z 2n , αj∗ , m = ν1j Ti Z2 , α1j T Z , α i 1 j 2j 2j (α1j , α2j ) and αj∗ = {(α1j , α2j ) : arg max ∆j (Ti Z, αj , m)} . Using some manipulations, we have m−1 m−1 [ X P (A,m ) ≤ P A,m,j ≤ P (A,m,j ) j=0
=
m−1 X
j=0
( Z 2n :
P
sup α1 , α2 ∈Λ
j=0
∗ ∗ ν1j (Z2 , α1 , m) − ν2j (Z1 , α2 , m) ≥ m
)!
.
Continuing the equality gives that the RHS equals m−1 X
( P
Z
2n
:
j=0
=
m−1 X
m−1 X
n o ∗ ∗ ∗ ∗ Z 2n : ν1j (Ti Z2 , α1j , m) − ν2j (Ti Z1 , α2j , m) ≥ m
P
n o Z 2n : ∆∗j Ti Z, αj∗ , m ≥ m
j=0
=
−
∗ ν2j (Ti Z1 , α2 , m)
P
j=0
=
sup α1 , α2 ∈Λ
∗ ν1j (Ti Z2 , α1 , m)
≥ m
)!
m−1 (2n)! 1 X X n 2n o P Z : ∆∗j Ti Z, αj∗ , m ≥ (2n)! m j=0 i=1
=
1 (2n)!
m−1 X (2n)! XZ j=0 i=1
I{Z 2n :∆∗ (Ti Z,α∗ ,m)≥ } z 2n dP(Z 2n ). j j m
(6)
Let the equivalence classes in Λ under ∼ be denoted Λk . Then, the equivalence classes SNjΛ (z 2n ) ∗ ∗ ] for the α∗ ’s provide a partition for Λ. That is, Λ = ∗ ∈ Λ [αjk [αjk ] because αjk k jk k=1 ∗ ] = Λ . In addition, α∗ is the maximal value of α th and hence [αjk and α in the k 1j 2j k jk 6
VC dimension
equivalence class. So, I{Z 2n :∆(Ti Z,α∗ ,m)≥} (z 2n ) ≤ I{Z 2n :∆∗ (Ti Z,α∗ ,m)≥ } z 2n j j 1j m + · · · + I(
!
Z 2n :∆∗j Ti Z,α∗
N Λ (z 2n )j j
)
z 2n
,m ≥ m
NjΛ (z 2n )
X
=
k=1
I{Z 2n :∆∗ (Ti Z,α∗ ,m)≥ } z 2n j kj m
(7)
where A,m,j,k = =
n o ∗ Z 2n : ∆∗j Ti Z, αkj ,m ≥ m ( Z
2n
:
sup α1j ,α2j ∈Λk
∗ ν1j (Ti Z2 , α1j , m)
−
∗ ν2j (Ti Z1 , α2j , m)
≥ m
)
Now, using (7), (6) is bounded by N Λ (z 2n )
j m−1 (2n)! Z X 1 X X P (A,m ) ≤ I{Z 2n :∆∗ (Ti Z,α∗ ,m)≥ } z 2n dP z 2n j kj m (2n)! j=0 i=1 k=1 N Λ (z 2n ) Z m−1 (2n)! X jX X 1 = I{Z 2n :∆∗ (Ti Z,α∗ ,m)≥ } z 2n dP z 2n . j kj m (2n)!
j=0
(8)
i=1
k=1
The expression in square brackets in (8) is the fraction of the number of the (2n)! permutations Ti of Z 2n for which A,m,j,k is closed under Ti for any fixed equivalence class Λk . Following Vapnik (1998) Sec. 4.13, it equals
Γj =
X
m∗j 2n−m∗j k m∗j −k 2n n
k
where ( n k m∗j −k k : n − n ≥ m∗j = n∗1j + n∗2j
m
o
; .
Here, Γj is the probability of choosing exactly k sample data points whose losses fall in interval Ij respectively in the first and second half of the sample such that A,m,j,k holds. m∗j is the number of data points from the first and the second half of the sample whose loss 2 land inside interval j. From Sec. 4.13 in Vapnik (1998), we have Γj ≤ 2 exp − n . So, 2 m 7
Mpoudeu and Clarke
using this in (8) gives that P (A,m ) is upper bounded by N Λ (z 2n ) Z m−1 X jX j=0
k=1
= 2 exp − = 2 exp −
(z 2n ) X NjΛX 2 Z m−1 n n2 2 exp − 2 dP(z 2n ) = 2 exp − 2 dP(z 2n ) m m j=0
XZ 2 m−1
n m2
n2 m2
j=0
m−1 X
k=1
NjΛ (z 2n )
m−1 Z X n2 X 2n NjΛ (z 2n )dP(z 2n ) dP(z ) = 2 exp − 2 m j=0
k=1
E NjΛ (z 2n ) .
(9)
j=0
Theorem 4.3 from Vapnik (1998) P.145 gives 2n
Hann (Z ) = ln E
NjΛ (Z 2n )
≤ G(2n) ≤ h ln
2ne h
⇒E
NjΛ (z 2n )
≤
2ne h
h .
Using this m times in (9) gives the Theorem. We will use Theorem 2 to give an upper bound on the unknown true risk via the following Propositions. Let R (αk ) be the true unknown risk at αk and Remp (αk ) be the empirical risk at αk . Proposition 3 With probability 1 − η, the inequality v u h ! u1 2m 2ne R (αk ) ≤ Remp (αk ) + mt log n η h
(10)
holds simultaneously for all functions Q (z, αk ), k = 1, 2, · · · , K. This inequality follows from the additive Chernoff bound and suggests that the best model will be the one that minimizes the RHS of inequality (10). The use of inequality (10) in model selection is called empirical risk minimization (ERM) and we denoted it by ERM1 . Proposition 4 With probability 1 − η, the inequality v ! u h 4nRemp (αk ) m2 2m 2ne u R (αk ) ≤ Remp (αk ) + log 1 + t1 + h 2m 2ne 2n η h m2 log η h
(11)
holds simultaneously for all K functions in the set Q (z, αk ), k = 1, 2, . . . , K. This follows from the multiplicative Chernoff bound and suggests that the best model will be the one that minimizes the right hand side (RHS) of (11). The use of (11) in model selection is another form of empirical risk minimization, here denoted ERM2 . The proof of Propositions 3 and 4 are easy and can be found in Mpoudeu (2017). 8
VC dimension
Theorem 5 E
1. If h < ∞, we have
v ! u h ! u1 ∗ 2 2ne 1 sup ν1 (z , α1 ) − ν2∗ (z 1 , α2 ) ≤ mt ln 2m3 + r n h α1 ,α2 ∈Λ m n ln 2m3
2ne h h
(12) 2. If h ≤ ∞, and
∞
Z Dp (α) =
p p P {Q (z, α) ≥ c}dc ≤ ∞
0
where 1 < p ≤ 2 is some fixed parameter, we have 2.5+ p1
Dp (α∗ )2
! E
sup |ν1 (Z2 , α1 ) − ν2 (Z1 , α2 )|
≤
α1 ,α2 ∈Λ
n
q h ln
ne h
1− p1
2.5+ p1
16Dp (α∗ )2 q + 1− 1 n p h ln
ne h
.
(13) 3. Assume that h → ∞,
n h
→ ∞, m → ∞, ln (m) = o(n), and Z ∞p p Dp (α) = P {Q(z, α) ≥ c}dc ≤ ∞ 0
where p = 2. Then we have that s
! E
sup |ν1 (Z2 , α1 ) − ν2 (Z1 , α2 )|
∗
≤ min (1, 8Dp (α ))
α1 ,α2 ∈Λ
h ln n
2ne . h
(14)
Proof The proof of Theorem 5 can be found in Mpoudeu (2017) in Appendices A1–A3. It rests on using the integral of probabilities identity
2.2 An Estimator of the VCD Recall that from Theorem 5, the upper bound is s ∗
Φh (n) = min (1, 8Dp (α ))
h log n
2ne . h
(15)
This is meaningfully different from the form derived in Vapnik et al. (1994) and studied in McDonald et al. (2011). Moreover, although min (1, 8Dp (α∗ )) does not affect the optimization, it might not be the best constant for the inequality in (14). So, we replace it with an arbitrary constant c over which we optimize to make our upper bound as tight as possible. In contrast to Vapnik et al. (1994), c is data driven not ‘universal’. We let c vary from 0.01 to 100 in steps of size 0.01. However, we have observed in practice, the best value of cˆ is ˆ is also different from usually between 1 and 8. The technique that we use to estimate h that in Vapnik et al. (1994). Indeed, our Algorithm 2.2 below accurately encapsulates the way the LHS of (14) is formed unlike the algorithm in Vapnik et al. (1994). 9
Mpoudeu and Clarke
1.4
200 180 ize
1.2 1.1 100
mp le S
160
120
140
140 VC 160 D
Sa
RHS
1.3
120 180 200 100
Figure 1: Perspective Plot of the RHS
10
VC dimension
In particular, we use two bootstrapping procedures, one as a proxy for calculating expectations and the second as a proxy for calculating a maximum. Moreover, we split the dataset into two subsets, using the first dataset, we fit model I and with the dataset we fit model II. To explain how we find our estimate of the RHS of (14) from Theorem 5, we start by replacing the sample size n in (15) with a specified value of design point, so that the only unknown is h. Thus, formally, we replace (15) by s
Φ∗h (nl )
h =b c log nl
2nl e , h
where cˆ is the optimal data driven constant. If we knew the left hand side (LHS) of (14), even computationally, we could use it to estimate h. However, in general we don’t know the LHS of (14). Instead, we generate one observation of the form ! b l) = E ξ(n
sup α1 ,α2 ∈Λ
(ν1∗ (z2 , α1 )
−
ν2∗ (z1 , α2 ))
= Φ∗h (nl ) + (nl )
(16)
for each design point nl by bootstrapping. In (16), we assume (nl ) has a mean zero but b l ) for the an otherwise unknown distribution. We can therefore obtain a list of values of ξ(n elements of NL . Our algorithm is as follows: Algorithm #1: Inputs: A collection of regression models G = {gβ : β ∈ β}, a dataset, two integers b1 and b2 for the number of bootstrap samples, Integer m for the number of disjoint intervals use to discretize the losses, A set of design points NL = {n1 , n2 , · · · , nl }. 1 For each g = 1, 2, · · · , l do; 2 Take a bootstrap sample of size 2ng (with replacement) from our dataset; 3 Randomly divide the bootstrap data into two groups G1 and G2 of size ng each; 4 Fit two models one for G1 and one for G2 ; 5 The mean square error of each model is calculated using the covariate and the response from the other model, thus: For instance M SE1 = (predict(M odel1 , X2 ) − Y2 )2 , and M SE2 = (predict(M odel2 , X1 ) − Y1 )2 ; 6 Discretize the loss function i.e. discretize M SE1 and M SE2 into m disjoint intervals; ∗ (Z , α , m) and ν ∗ (Z , α , m) using M SE and M SE respectively for each 7 Estimate ν1j 2 1 1 2 1 2 2j interval; ∗ ∗ (z , α , j) 8 Compute the difference ν1j (z2 , α1 , j) − ν2j 1 2
11
Mpoudeu and Clarke
9 Repeat steps 1 − 7 b1 times, take the mean interval-wise and sum it across all intervals so we have: m−1 X ∗ ∗ ˆ ξi (nl ) = (z2 , α1 , j) − ν2j (z1 , α2 , j) ; mean ν1j j=0
Repeat step 1 − 8 b2 times and calculate b2 1 X ˆ ξˆi (nl ). ξ(nl ) = b2 i=1
Note that this algorithm is parallelizable because different nl can be sent to different ˆ l ) for each nodes to speed up the process of estimating ξˆ () for all nl . After obtaining ξ(n ˆ value of nl , we estimate hT by minimizing the squared distance between ξ(nl ) and Φ∗h (nl ). Our objective function is |NL |
fnl (h) =
X
s ˆ l ) − cˆ ξ(n
l=1
h log nl
2nl e h
!2 ,
(17)
where |NL | is the number of design points. Optimizing (17) usually only leads to numerical solutions and in our work below, we set b1 = b2 = W for convenience.
3. Proof of Consistency ˆ for hT . In most aspects, the structure of Here, we provide a proof of consistency of our h this proof should be credited to McDonald et al. (2011). Details on the quantities we used can be found in van de Geer (2000) Chaps. 2 and 3. Our contribution is to adapt McDonald et al. (2011) to our stable estimator for the regression context. We begin with the following two results adapted from McDonald et al. (2011). Intuitively, the i ’s represent the ξi ’s in Sec. 2.2. 1 PW Lemma 6 Let t ≥ 0 and let (nl ) = W i=1 i (nl ), where i (nl )’s are independent for any given nl ’s. Then at any design point nl , we have
E (exp (ti (nl ))) ≤ exp
t2 B 2 8W m2
(18)
Proof Fix j = 0, 1, 2, · · · , m − 1. Now, for any t ≥ 0 and i ≥ 0, and i (nl ) ∈ using Hoeffding’s inequality we have E (exp (ti (nl ))) ≤ exp 12
t2 B 2 8m2
.
jB (j+1)B m, m
i
,
VC dimension
Thus, we have E (exp (ti (nl ))) = = = ≤ =
!! W t X E exp i (nl ) W i=1 t W i (nl ) E Πi=1 exp W t W Πi=1 E exp i (nl ) W 2 2 t b ΠW exp i=1 8W m2 2 2 t B exp 8W m2
(19)
Proposition 7 Suppose that {l ≡ (nl ) , l = 1, 2 · · · k} is a set of random variables satisfying Lemma 6. Then for any γ ∈ Rk and ρ ≥ 0, we have ! ! k X 2W m2 ρ2 l γl |≥ ρ ≤ 2 exp − P | . (20) P B 2 kl=1 γl2 l=1 Proof
P
|
k X
! l γl |≥ ρ
≤ 2P
l=1
k X
! l γl ≥ ρ
= 2P
exp t
l=1
≤
=
! l ρl
! ≥ exp(tρ)
l=1
P 2E exp t kl=1 l ρl exp(tρ) 2Πkl=1 E (exp (tl γl )) ≤ exp(tρ)
= 2 exp
k X
t2 B 2 8W m2
k X
2E Πkl=1 exp (tl γl ) = exp(tρ) 2 2 2 t B γ 2Πkl=1 exp 8W m2l
exp(tρ) !
γl2 − tρ .
(21)
l=1
Since (21) is true for all t, the RHS of 21 attains its minimum for t=
8W m2 ρ P 2B 2 kl=1 γl2
Replacing t by its value in (21) gives the statement of the proposition.
13
(22)
Mpoudeu and Clarke
Let Q = QL be the empirical norm corresponding to the empirical L2 inner product defined by L 1X hf, giQ = f (nl ) g (nl ) . L l=1
Definition 8 Φ = {φh,c : I × H −→ R} where I ∈ [a, b] ⊂ R+ , h ∈ (1, M ] ⊂ N. For s the purpose of our proof, initially we fix c so φ = φ(h). For s = 0, 1, 2, · · · , let {φsj }N j=1 be the minimal 2−s R-covering set of (2−s R, Φ, k · kQ ) for given s ∈ I and R > 0. Now, Ns = N (2−s R, Φ, Q) is thesmallest number of balls of radius 2−s R that cover Φ. For each h, there exists a φsj ∈ φs1 , φs2 , · · · , φsNs such that k φh − φsj kQ ≤ 2−s R. Because Φ is compact we choose R > 0 so that suph k φh kQ ≤ R. Now consider the class Φ(R) = {φh ∈ Φ :k φh − φhT kQ ≤ R}, i.e. the elements of Φ within R of φhT φh (n), obviously j = j(h) and as derived in Theorem 5, let s h 2ne log . (23) φh (n) = c n h Let φsj and φjs−1 be indexed so that the j th element of φs1 , φs2 , · · · , φsNs and the j th of PS s s−1 s−1 s−1 S . φs−1 1 , φ2 , · · · , φNs−1 are as close as possible. Using chaining, we have φj = s=1 φj − φj By the triangle inequality, we have k φsj − φs−1 k ≤ k φsj − φh kQn + k φh − φs−1 kQn j j ≤ 2−s R + 2−s+1 R = 3 · 2−s R
(24)
Theorem 9 Given the conclusion of Proposition 7 suppose that hT ∈ (0, M ]. Then, S X W m2 Lδ 4 P k φhˆ − φhT kQ ≥ δ ≤ 2 exp − . (25) 18B 2 2−2s R2 s=1
Proof By construction, we have kξˆ (nl ) − φhˆ k2 ≤ kξˆ (nl ) − φhT k.
(26)
Expanding both sides of (26) we have kξˆ (nl ) k2 − 2hξˆ (nl ) , φhˆ i + kφhˆ (nl ) k2 ≤ kξˆ (nl ) k2 − 2hξˆ (nl ) , φhT i + kφhT (nl ) k2 kφˆ (nl ) k2 − kφh (nl ) k2 ≤ 2hξˆ (nl ) , φˆ (nl ) − φh (nl )i. (27) h
h
T
T
Using (16), (27) becomes kφhˆ (nl ) k2 − kφhT (nl ) k2 ≤ 2hφhT (nl ) + l , φhˆ (nl ) − φhT (nl )i = 2hφhT (nl ) , φhˆ (nl )i − 2kφhT (nl ) k2 + 2hl , φhT (nl ) − φhˆ (nl )i 2
2
= 2hl , φhT (nl ) − φhˆ (nl )i
2
≤ 2hl , φhT (nl ) − φhˆ (nl )i
kφhˆ (nl ) k − 2hφhT (nl ) , φhˆ (nl )i + kφhT (nl ) k kφhˆ (nl ) − φhT (nl ) k 14
(28)
VC dimension
n o s+1 s s+1 = φh : 22δ ≤ kφh (nl ) − φhT (nl ) k ≤ 2 2 δ , and assume φhˆ (nl ) ∈ R 2 2 δ . s+1 s Let φhˆ = φhˆ ∈ R 2 2 δ and define φ1hˆ = φhT where the φshˆ ’s are canonical representations
Let R
2s+1 δ 2
of a Ns (2−s R) covering of the compact set Φ (R). Using (28), and Proposition 7 we have
P kφhˆ (nl ) − φhT (nl ) k > δ = P kφhˆ (nl ) − φhT (nl ) k2 > δ 2 1 2 ≤ P hl , φhT (nl ) − φhˆ (nl )i > δ 2 # " L 1 X 1 2 l φhT (nl ) − φhˆ (nl ) > δ ≤ P L 2 l=1 " # L X S 1 X 1 2 ≤ P l φhT (nl ) − φhˆ (nl ) > δ . (29) L 2 l=1 s=1
s+1 P (nl ), based on the rings R 2 2 δ Using the identity φht (nl )−φhˆ (nl ) = Ss=1 φshˆ (nl )−φs−1 ˆ h the RHS of (29) is upper bounded by S X s=1
" L # S X L X P (nl ) > δ 2 = 2 exp l φshˆ (nl ) − φs−1 ˆ h 2 s=1
l=1
−2W m2 L2 δ 4 4B
PL 2 l=1
φsˆ (nl ) h
−
φs−1 (nl ) ˆ h
S X
2
! 2W m2 L2 δ 4 exp − 2 = 2 4B Lkφsˆ (nl ) − φs−1 (nl ) k2 ˆ s=1 h h S X W m2 Lδ 4 ≤ 2 exp − (30) 18B 2 2−2s R2 s=1
ˆ As a consequence, we can show that φh (·) is Lipschitz that our h is consistent. Suppose 0 i.e. ∀n, ∃κ = κ(n) so that κ(n) h − h ≤ φh (nl ) − φh0 (nl ) , where κ(n) is bounded on compact sets. Since the form of φh (nl ) is known from (15), it is clear that the uniform Lipschitz condition we have assumed actually holds at least for appropriately chosen compact sets. We also observed that for c ∈ C there exists a neighborhood B (c, l ) on which (25) is true. Cover C × H by balls of the form B (c, l ) × {h} finitely many will be enough since C × H is compact. Theorem 10 Given that assumptions of Theorem 9 hold, we have S X W m2 Lδ 4 κ4 ˆ exp − P h − hT ≥ δ ≤ 2 , 18B 2 2−2s R2 s=1
where κ =
q P L 1 L
l=1 κ (nl )
15
(31)
Mpoudeu and Clarke
Proof Given that φh (nl ) is Lipschitz, we have for any h, h0 that κ (nl ) |h − h0 | ≤ |φh (nl ) − φh0 (nl )| .
(32)
v v u u L L u1 X u1 X 2 0 t h − h t κ (nl ) ≤ φh (nl ) − φh0 (nl ) L L
(33)
So, using (32) we have
l=1
l=1
= kφh (nl ) − φh0 (nl ) k. Let κ =
q P L 1 L
l=1 κ (nl ),
using Theorem 9, and (33) we have ˆ P h − hT ≥ δ ≤ P kφhˆ (nl ) − φhT (nl ) kQ ≥ δκ S X W m2 Lδ 4 κ4 . exp − ≤2 18B 2 2−2s R2
(34)
s=1
ˆ However, as a pragmatic point, We have not derived a standard deviation (se) for h. ˆ 1, h ˆ 2, · · · , h ˆM , bootstrap samples of the data can be used to generate multiple values of hT say h ˆ that can be used to obtain an empirical SE for h.
4. Simulation studies For any model, we can evaluate the LHS of (14) from Theorem 5 by Algorithm 1 in Sec. ˆ So, it is seen that h ˆ is a 2.2. Then, we can use nonlinear regression in (17) to find h. function of the conjectured model. In principle, for any given model class, the VCD can be found. This is particularly easy for linear models: – Anthony and Bartlett (2009) show the VCD for linear models is just the number of parameters for a saturated model. Since our goal is to estimate the true VCD, when a conjectured model P (· | β) is linear ˆ By the same logic, if P (· | β) is far from the and correct, we expect V CD (P (· | β)) ∼ = h. ˆ or V CD (P (· | β)) h. ˆ This suggests we true model, we expect V CD (P (· | β)) h estimate hT by seeking ˆ = arg min V CD (Pk (· | β)) − h ˆ k ≤ t. h (35) k
ˆ k is calculated using model k, where {Pk (· | β) |k = 1, 2, · · · , K} is some set of models and h t is a positive and usually small number that such that t ≤ 2. In the case of linear model, with q = 1, 2, · · · , Q explanatory variables, we get ˆ = arg min q − h ˆ q ≤ t h (36) q
ˆ q is the estimated VCD for model of size q. Note that (35) can identify a good where h model even when consistency fails. The reason is that (35) only requires a minimum at the 16
VC dimension
VCD not convergence to the true VCD which may be any model under consideration. Here, ˆ we use a variation on (36) by choosing the smallest local min q − hq , effectively setting t = 0. In practice this amounts to imposing parsimony at the possible cost of allowing a little flexibility via t > 0. Our simulations are based on linear models, since for these we know the VCD equals the number of parameters in the model. To establish notation, we write the regression function as a linear combination of the covariates Xj , j = 0, 1, · · · , p, y = f (x, β) = β0 + β1 x1 + β2 x2 + · · · + βp xp =
p X
βj xj .
j=0
Given a dataset, {(xi , Yi , i = 1, 2, · · · , n)}, the matrix representation is Yn×1 = Xn×p βp×1 + p,n n×1 where Xn×p = 1n , (xij )j=1,i=1 and β = (β0 , β1 , · · · , βp )T , and n×1 is an ndimension column vector of mean zero. Now, the least squared estimator βb is given by 0 −1 0 βb = X X X Y. Our simulated data is analogous. We write Y = β0 x0 + β1 x1 + β2 x2 + · · · + βp xp + where ∼ N (0, σ = 0.4) x0 = 1, βj ∼ N (µ = 5, σβ = 3), for j = 1, 2, · · · p and xj ∼ N (µ = 5, σx = 2), and are all independent. We center and scale all our variables, including the response. Initially, we use a nested sequence of model lists. If our covariates were highly correlated, before applying our method we could de-correlated them by sphering. 4.1 Analysis of Synthetic data In Subsec. 4.1.1, we begin by presenting simulation results to verify our estimator for VCD is consistent for the VCD of the true model. Of course, since our results are only simulations, ˆ is off by one. (As we note later, we do not always get perfect consistency; sometimes our h this can often be corrected if the sample size is larger.) In Subsec. 4.1.2, we will look at simulations where results do not initially appear to be consistent with the theory. However, we show that for larger values of p, larger values of n are needed. Also, as p increases, we must choose nl ’s that are properly spread out over [0, n]. and these nl ’s seem to matter less as n → ∞. We suggest this is necessary because (14) is only an upper bound that tightens as n increases. 4.1.1 Some first examples We implement simulations for model sizes p = 15, 30, 40, and 50 and we present the results for all cases. For p = 15, 30, the choice of parameters in our simulation are all the same. When the sample size is n = 400; the design points are NL = {50, 100, 150, 200, 250, 300, 400}; m = 10; and the number of bootstrap samples is W = 50. For p = 40, 50 and 60, the choice of the parameters in our simulation is all the same. When the sample size is n = 600; 17
Mpoudeu and Clarke
ˆ (b) Estimates of h
\ 1 , ERM \ 2 and BIC (a) Values of ERM
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 15, σ = 0.4, σβ = 3, σx = 2 Figure 2: Estimates of h,
ˆ (b) Estimates of h
\ 1 , ERM \ 2 and BIC (a) Values of ERM
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 30,σ = 0.4, σβ = 3, σx = 2 Figure 3: Estimates of h,
the design points are NL = (75, 150, 225, 300, 375, 450, 525, 600); m = 10; and the number of bootstrap samples is W = 50. For these cases, we fit two sets of models; the first set uses a subset of our covariates to estimate the VCD, and in the second set, we added some decoys (their corresponding β’s in the generation of the response are zeros). Outputs of \ 1 and ERM \ 2 use the point where the sharpest simulations are given in Figures 2-6. ERM decreases occurs, and BIC is simply minimized to identify a good model. By examining Figures 2 to 6, we see that, for each given true model of pre-specified size, we fitted a list of nested models. These Figures also contain graphs of ERM1 , ERM2 and BIC that indicate which models should be chosen under those criteria respectively. We see that when the size of the conjectured model is strictly less than that of the true model, the estimated VCD equals the minimum value of the design points, and the values of \ 1 , ERM \ 2 and BIC are extremely high. Furthermore, these latter values typically the ERM decrease as the conjectured models become similar to the true model. For this range of model sizes, when the conjectured model exactly matches the true model, the estimated ˆ is closest to the true value. The biggest discrepancy (of size 2) occurs for p = 50; VCD (h) 18
VC dimension
0e+00 4e+04 8e+04
ERM2
ERM1, ERM2, BIC~ VCD, p = 40 ,N = 600 ,NL = c(75,150,225, 300, 375, 450, 525, 600)
ERM2 ERM1 BIC
30
35
40
45
50
VCD
ˆ (b) Estimates of h
\ 1 , ERM \ 2 and BIC (a) Values of ERM
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 40 Figure 4: Estimates of h,
ˆ (b) Estimates of h
\ 1 , ERM \ 2 and BIC (a) Values of ERM
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 50 Figure 5: Estimates of h,
ˆ (b) Estimates of h
\ 1 , ERM \ 2 and BIC (a) Values of ERM
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 60 Figure 6: Estimates of h,
19
Mpoudeu and Clarke
by contrast, for every other case the difference between the true value and the estimated VCD is at most one. \ 1 , ERM \ 2 or BIC is the same. In fact, when the From Figures 2-5, the behavior of ERM conjectured model is a subset of the true model, we see a consistent and substantial decrease \ 1 , ERM \ 2 or BIC, and a sudden drop in these statistics when the conjectured model of ERM perfectly matches the true model. This sudden drop can be regarded as an indicator of the \ 1 usually flatlines. However, ERM \ 2 can still often has true model. After this point, ERM \ smallest value occur at the sudden drop because, ERM 2 often increases (albeit slowly) from its minimum as p increases. We also see that BIC has good power of discrimination since its smallest values occur at the true model. As we add decoys, the values of BIC are bigger than those of the true model, although sometimes not by much. ˆ usually occurs at the The smallest discrepancy between the size hT of the model and h ˆ is consistent for the true model. In addition, h ˆ generally true model. This indicates that h increases as the size of the model becomes bigger although in some cases past a certain point it may flatline as well. The problem with flatlining or even decreasing past a certain value of h is worse when N is not large enough relative to p. \ 1 , ERM \ 2 and BIC behave as before. In fact, we observe a decrease In Figure 6, ERM as the conjectured model becomes similar to the true model and there is a big drop as the conjectured model exactly matches the true model. However, at the true model, there is ˆ and hT . We suggest that this discrepancy occurs because the a big discrepancy between h sample size is too small compared to p and the choice of the design points is poor. \1 For the present, we note that Figure 6 gives us estimates of the true model using ERM \ 2 when the sample size is N = 600, and NL takes on values from 75 to 600 and ERM \ 1 and ERM \ 2 have a very low power of disin steps of size 75. We observe that ERM crimination between models. Later, in Figure 9 where the sample size is N = 2000 and \1 NL = 500, 700, 1000, 1500, 2000, we see that after the sudden drop in the estimate of ERM \ \ and ERM 2 , (that is an indicator of the true model), that ERM 2 tends to discriminate bet\ 1. ter than ERM
n p
Fig. 2b
Fig. 3b
Fig. 4b
Fig. 5b
Fig. 6b
27
13
15
12
10
Table 1: Relative increase of the sample size given the size of the model Table 1 gives the ratio of the sample size to the size of the model. In fact, from Fig 2b ˆ over models is. to Fig 6b, we see that the higher np is, the better the discrimination of h \ 1 or ERM \ 2 . There We argue that estimating VCD directly is better than using ERM ˆ \ \ are several reasons. First, the computation of ERM 1 and ERM 2 requires h. It also requires ˆ a threshold η be chosen (see Propositions 3 and 4) and is more dependent on m than h ˆ ERM ˆ This \ 1 , ERM \ 2 will break down faster than h. is. Being more complicated than h, is seen, for instance in tables of Mpoudeu (2017) Chap.3 and the discussion there. More ˆ with increasing p, if \ 1 , and ERM \ 2 break down faster than h generally, we argue that ERM ˆ \ 1 and ERM \ 2 are less efficient than h. the sample size is held constant. Otherwise put, ERM To end this subsection, we note that results in Mpoudeu (2017) show that our conclusions are qualitatively the same if σ , σx or σβ are varied. 20
VC dimension
4.1.2 Dependency on The Sample Size and Design Points Our goal here is to show how we can improve the quality of our estimate by increasing n or tuning the design points. In Sec. 4.1.1, we started observing the effect of sample size on the quality of our estimates. Here, we emphasize both the sample size and the effect of design points. We perform simulations for model sizes p = 60. ˆ the upper bound of the true unknown risk using Figures 6 and 7 give estimates of h, \ 1 ) and 4 (ERM \ 2 ), and the BIC as before, for small sample sizes Propositions 3 (ERM N = 600, 700 and their corresponding design points. Given that the size of the conjectured ˆ is equal to the smallest design model is strictly less than the size of the true model, h ˆ ≈ 50, 61 point. However, when the conjectured model exactly matches the true model, h underestimates hT . When the conjectured model is more complex than the true model, we ˆ still underestimates hT in most cases. Our observations about ERM \ 1 , ERM \2 see that h and BIC remain the same as before. ˆ ERM \ 1 , ERM \ 2 and BIC when n = 700, NL = 100, 200, Figure 8 gives estimates of h, ˆ = 57; this estimate 300, 400, 500, 600, and 700 and the model size is p = 60. We see that h is closer to the true value than that from Figure 6. We do not observe any change in the \ 1 , ERM \ 2 and BIC. This shows that small changes in sample qualitative behavior of ERM ˆ ERM \ 1 and ERM \2 size or design points may have large numerical effects on the values of h, and BIC when n is too small. Figure 9 is qualitatively the same as Figure 6. The difference is the sample size and design points. In fact, in Figure 9, the sample size is n = 2000, design points are NL = 500, 700, 1000, 1500, 2000, whereas in Figure 6, n = 600, design points vary from 75 to 600 ˆ ERM \ 1 , ERM \ 2 and BIC remain the same as previously in steps of 75. The behavior of h, ˆ moves closer to its true described. We infer from this that as the sample size increases, h value. In these figures, the design points have also shifted. This leads us to suggest that ˆ not only must n increase, the value design points must also to get the optimal estimate h, increase so that the range of values in the set of design points covers [0, n]. When comparing Figure 10 to Figure 6, the difference is that in Figure 10, n = 2000 whereas (in Figure 6, n = 600). We observe that when there is a big enough increase in ˆ n relative to the design points (NL being constant, but small compare to sample size) h ˆ converges to the true value, but at a lower rate even though h only moved from 48 to 50, ˆ in both cases, the model identified is very close to the true model. Indeed, in Figure 6, h ˆ identifies p = 61 or 62 and in Figure 10, h identifies p = 61. Figure 8 and Figure 11 are almost the same despite the substantial increase in n. In Figure 11, the sample size is n = 2000, and in Figure 8 the sample size is n = 700, though the design points NL = 100, 200, 300, 400, 500, 600, 700 are the same for the two figures. We ˆ at the \ 1 , ERM \ 2 and BIC is unchanged and h observe that the qualitative behavior of ERM true model is nearly the same for both simulations and close to the true VCD. In both cases, ˆ is qualitatively the same and close to the true model. We also note the model identified by h ˆ q tends to increase as the size of the wrong model increases. This comparison that q − h ˆ can be found accurately suggests that if the design points are large relative to p then h without necessitating large sample sizes, i.e., for well chosen design points, np ≈ 15 will be sufficient. (Usually, one wants 10 data points per parameter. Here, we recommend 15 because we are doing model selection as well as parameter estimation.) 21
Mpoudeu and Clarke
\ 1 , ERM \ 2 and BIC (a) Values of ERM for p = 70
ˆ for p = 70 (b) Estimate of h
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 70 Figure 7: Estimates of h,
\ 1 , ERM \ 2 and BIC (a) Values of ERM for p = 60
ˆ for p = 60 (b) Estimate of h
ˆ ERM ˆ = 57 \ 1 , ERM \ 2 and BIC for p = 60, h Figure 8: Estimates of h, We leave the question of optimally choosing the design points as future work even though 1. We have conjectured that design points should be spread over [0, n]. 2. More design points should be in the upper half of the interval so that the size of the design points should track n. 3. The effect of design points seems to attenuate as n → ∞.
5. Analysis of more complex datasets The goal of this section is to evaluate our method on two real datasets: Tour de France and Abalone datasets.1 . The analysis of the Abalone dataset will be more extensive than that 1. Tour de France Data was collected by Bertrand Clarke. http://www.letour.fr/
22
More information can be found at
VC dimension
\ 1 , ERM \ 2 and BIC (a) Values of ERM for p = 60
ˆ for p = 60 (b) Estimate of h
ˆ ERM ˆ = 59 \ 1 , ERM \ 2 and BIC for p = 60, h Figure 9: Estimates of h,
\ 1 , ERM \ 2 and BIC (a) Values of ERM for p = 60
ˆ for p = 60 (b) Estimate of h
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 60 Figure 10: Estimates of h,
\ 1 , ERM \ 2 and BIC (a) Values of ERM for p = 60
ˆ for p = 60 (b) Estimate of h
ˆ ERM \ 1 , ERM \ 2 and BIC for p = 60 Figure 11: Estimates of h,
23
Mpoudeu and Clarke
of the Tour de France because its sample size is much larger, 4177 versus 103 and the data can be assumed independent. We start this section by giving some information about our Tour de France dataset in Sec. 5.1. Then, in Sec. 5.1.1 we analyze Tour de France dataset using a model list based on Y ear and Distance. This class is a sequence of nested models. We evaluate our method by ˆ to BIC, ERM \ 1 and ERM \ 2 . In Subsec. 5.1.2, we look at the effect of outliers comparing h ˆ ERM \ 1 and ERM \ 2. in the estimates h, 5.1 Tour de France Data The full dataset has n = 103 data points. The data points are dependent because many cyclists competed in the Tour for more than one year. Here we ignore the dependence structure for the ease of exposition. Each data point has a value of the response variable, the average speed in kilometer per hour (km/h) of the winner (Speed) of the Tour from 1903 to 2016. However during World Wars 1 and 2 there was no Tour de France, so we do not have data points for those years. We also see the effect of World War I on the speed of the winner of the tour: The lowest speeds were after World War 1, probably due to casualties. After World War II, there was also a decrease in average winning speed, but the decrease was less than that after World War 1. There is a curvilinear relationship between speed and year (Y). We also note a linear relationship between Speed and Distance and that the variability of speed increases with the Distance (D). The Tour de France dataset also has information on the age of the winner (A), the number of stages won by the winner (S), and the distance (Km) (D) of the Tour de France. We observed that most of the winners of the tour had age ranging from 25 to 30 years and the number of stages that they won ranged between 0 and 1. 5.1.1 Analysis of a nested collection of model lists of Tour de France We identify a nested model list using Y , D, Y 2 , D2 and Y : D as covariates. Because the size of the dataset is small, we can only use a small model list. We order the variables using the SCAD shrinkage method because it perturbs parameter estimates the least and satisfies an oracle property. Under SCAD, the order of inclusion of variables is Y , D, D2 , Y 2 , and Y : D. We therefore fit five different models. Model
ˆ h
\1 ERM
\2 ERM
BIC
Y Y, D Y, D, D2 Y, D, D2 , Y 2 Y, D, D2 , Y 2 , Y : D
20 20 20 20 20
316.74 218.11 176.09 172.75 172.05
495.22 451.06 317.27 312.89 311.97
419.14 411.13 365.68 368.24 372.43
Table 2: Direct Implementation of Vapnik’s method for the nested models. Note that 20 is the smallest design point, a typical problem for this method.
24
VC dimension
Model
ˆ h
\1 ERM
\2 ERM
BIC
Y Y, D Y, D, D2 Y, D, D2 , Y 2 Y, D, D2 , Y 2 , Y : D
4 4 4 4 4
16.42 15.10 11.21 11.09 11.06
44.95 42.83 36.37 36.16 36.11
79.67 71.66 26.21 28.77 32.96
ˆ ERM \ 1 , ERM \ 2 and BIC of the nested model using our method Table 3: Estimates of h, The estimate of VCD requires that we choose some values to use in the analysis. Since the size of our dataset is 103, we set m = 10, we choose to vary NL from 20 to 100 by 10 and, we set W = 50. ˆ ERM \ 1 , ERM \ 2 and BIC of the nested models Tables 2 and 3 give us estimates of h, using the Vapnik et al. (1994)’s original algorithm and our algorithm 1 respectively. It is seen that Vapnik’s original method is helpful only if it is reasonable to surmise that there are 15 missing variables. Our method uniquely identifies one of the models on the list. Even though there is likely no model for Tour de France dataset that is accurate to infinite precision, our method is giving a useful result. ˆ = 4 no matter which model is chosen. This indicates that we In Table 3, we see h need 4 explanatory variables to explain our response. Thus, the best model is the one with Y , D, D2 , and Y 2 . However, if we use BIC for model selection we select the model ˆ because there does not appear to with Y , D, and D2 . We prefer the model chosen by h be as strong a curvilinear relationship between speed and distance as there appears to be between speed and Year e.g via Y 2 . We attribute the slightly lesser performance of BIC to the fact that its derivation rests heavily on the assumption that the data are independent. \ 1 and ERM \ 2 leads to the model with five parameters. There is nothing a Minimizing ERM ˆ is preferred when justifiable. priori wrong with this, but a smaller model (of size 4 using h) th Alternatively, we may regard the difference among the 3 , 4th and 5th models as trivial for \ 1 , ERM \ 2 , so they effectively lead to the model with Y , D and D2 as terms, since ERM \ 1 and ERM \ 2 both have a large decrease from the 2 term to the 3 term model. That ERM ˆ is, they give the same result as BIC which we think is inferior to the model chose by h. 5.1.2 Analysis of The Tour de France dataset with outliers removed The observations just after World War I may be outliers. Let us see how our estimate will behave after removing these observations. The process of analyzing this reduced dataset is the same. We identify the nested model \ 1 , ERM \2 lists by SCAD then, for each model in the class, we estimate h, and obtain ERM and BIC. Under SCAD, the order of inclusion of our covariates is: Y , D2 , D, Y 2 and Y : D. This order is different from when we use all data points. Recall that, when we used all data points, D was included before D2 and D2 was included after Y 2 . With this new ordering we fit 5 different models. ˆ we get the same answer as in Sec. 5.1.1, From Table 4, if we choose a model using h, 2 2 the model with four variables: Y , D , D, Y . The interaction between Year and distance 25
Mpoudeu and Clarke
Model Size
b h
\1 ERM
\2 ERM
BIC
Y Y , D2 Y , D2 , D Y , D2 , D, Y 2 Y , D2 , D, Y 2 , Y : D
4 4 4 4 4
12.87 12.01 11.66 11.48 11.35
40.72 39.26 38.36 38.34 38.13
44.55 37.84 37.41 39.20 41.83
Table 4: Nested models using Y ear and Distance as covariates with outliers removed (Y:D) is not included because of the low correlation between Speed and Y : D (-0.08). Also \ 1 and as before, BIC indicates a model of size 3 having Y , D2 , D as covariates and ERM \ ERM 2 chose a model of size 5. The reasoning in Subsec. 5.1.1 for why we think that the ˆ is best continues to hold. model chosen by h 5.2 Analysis of the Abalone dataset The Abalone dataset was first presented in Nash et al. (1994) and can be freely downloaded from http://archive.ics.uci.edu/ml/datasets/abalone. ABalone has been widely used in statistics and in machine learning as a benchmark dataset. It is known to be very difficult to analyze as either a classification or as a regression problem. Our goal in this section is to see how our method will perform and compare the result to other model selection techniques such as SCAD, ALASSO, and BIC. 5.2.1 Descriptive Analysis of Abalone dataset The Abalone dataset has 4177 observations 8 covariates. Sex is a nominal variable with 3 categories: Male, Female and Infant. Length (mm) is the longest shell measurement, Diameter (mm), Height (mm) is the height measures with the meat, Whole weight (grams) is the whole weight of the abalone, Shucked weight (grams) is the weight of the meat, Viscera weight (grams) is the gut weight after bleeding, Shell weight is the shell weight after being dried. The response variable, i.e., the Y , is rings. The number of rings is roughly the age of an abalone. Fig. 12 shows pairwise plots of the form Rings versus covariates. We see that no matter which covariates are chosen, the variability in the Rings increases as the size of the covariates increases. We also observe that there is likely to be a curvilinear relationship between Rings and the covariates. However it is so weak for Rings vs Length that linear terms in these covariates may be adequate. We have left 3 scatter plots out since we get near duplicates due to collinearity between covariates. For instance rings vs Diameter is nearly the same as Rings vs Length, also Rings vs Schucked weight is nearly the same as Rings vs V iscera weight and Rings vs W hole weight. 5.2.2 Statistical Analysis of the Abalone data The model that we use to estimate the complexity of the response ‘Rings’ is a linear combination of all the variables. To accomplish this, we first order the inclusion of variables in the model using correlation; see Fan and Lv (2008). Under correlation between Rings and each 26
VC dimension
(a) Scatter plot of Rings vs Height
(c) Scatter Shell weight
plot
of
Rings
(b) Scatter plot of Rings vs Length
vs
(d) Scatter plot of Rings VS Shucked weight
Figure 12: Scatter plot of Rings vs Height, Length, Shellweight, Shuckedweight by Sex
27
Mpoudeu and Clarke
of the explanatory variables, the order of inclusion of variables is as follows: Shell weight, Diameter, Height, Length, W hole weight, V iscera weight and Shucked weight. Us\ 1, ing this ordering, we fit seven different models, estimated h, and found values of ERM \ 2 and BIC. These values are in Table 5. We also compare our method to other model ERM selection techniques based on sparsity such as SCAD and ALASSO. Model Size
b h
\1 ERM
\2 ERM
BIC
Shellweight Shellweight, Diameter Shellweight, Diameter, Height Shellweight, Diameter, Height, Length 5 6 7
8 8 8 8 9 8 8
26315 22839 26045 25745 23477 20389 20507
26524 23033 26253 25952 23684 20573 20696
19567 18983 19540 19500 19124 18551 18575
Table 5: Nested models using covariates for Abalone ˆ = 8 except for model of size 5. We regard h ˆ = 9 From Table 5, we observe that h for a model of size 5 as a random fluctuation since it is close to 8 and our method, while ˆ because it has the smallest stable, is not perfectly so. The full model is chosen using h ˆ distance between its size and h. However, the fact that model is first order of size 7 and ˆ = 8 suggests there may be a missing variable in the dataset i.e., unavoidable bias. We see h \ 1 and ERM \ 2 as we include variables in the model. some variability in the estimate of ERM We also see a drop when Diameter is included, and the values go back up when height and Length are included. There is a big decrease at the 6th model and a slight increases \ 1 , ERM \ 2 and BIC pick at the 7th model. This observation is similar for BIC. So ERM the model of size 6 while our method picks the model of size seven and suggests there is ˆ as more plausible a bias from at least one missing variable. We regard the results from h physically since several of the variables are highly correlated. Next we turn to the results of a sparsity driven analysis. Since there are seven explanatory variables and n = 4177, sparsity per se is not necessarily an important property for a model to have. Hence, we present these results for comparative purposes only. First suppose SCAD is used as a model selection technique. The optimal value of λ ˆ = 0.0027. With this value of λ, ˆ the best model must have 6 variables. is found to be λ The variables that get into the model in order are Shell weight, Shucked weight, Height, Diameter, V iscera weight, and W hole weight. Thus, under SCAD, we are led to the model \ = 0.36 · Diameter + 0.15 · Height + 1.40 · wholeweight Rings − 1.39S · shuckedweight − 0.34 · V isceraweight
(37)
+ 0.37 · Shellweight. Analogous analysis under ALASSO leads us to the same six terms and the model is: 28
VC dimension
\ = 3 + 11.62 · Diameter + 11.69 · Height + 9.21 · wholeweight Rings − 20.24 · Shuckedweight − 9.79 · V isceraweight
(38)
+ 8.63 · Shellweight. Even though (37) and (38) have the same terms, the coefficients are very different. This may occur because there is a high correlation between covariates. Note that the models in (37) and (38) both included Shucked weight but neither included Length, whereas the \ 1 and ERM \ 2 include Length but not Shucked weight. That models chosen by BIC, ERM \ 1 , ERM \ 2 and BIC used the same is, the sparsity models use the same variables, the ERM ˆ includes all the variables, suggesting that some are variables (albeit a different set) and h ˆ as the most reasonable. missing. As before, we regard the model chosen by h
6. General Conclusions Sec. 2 presents the derivation of the objective function we used to estimate the VCD. Our estimation procedure involves 2 bootstrap steps and one optimization over an arbitrary constant. In Sec. 3 we gave conditions under which our estimator is consistent. In Sec. 4, we presented simulations, suggesting a rate for consistency in term of sample size and number of parameters as well as providing guideline for selecting design points. Our examples were limited to linear models but extend readily to other model classes for which the VCD can be determined. In Sec. 5, we used our method on two real datasets arguing that, as in the simulations, using estimated VCD to do model selection gives better performance than several other standard methods. Finally, we would like to situate our general approach in the context of structural risk minimization. When we minimize an objective function, the intuition is that this corresponds to finding a decision problem, or more precisely a model class, that has the fastest convergence to zero of its expected supremal difference of cumulative empirical losses and hence would be the ‘right’ problem for us to solve if only we knew it. Therefore, even though ˆ as a lower the model class is only implicitly defined, we can take its hT or empirically, h, bound on the VCD of model classes. Essentially, this assumes that the true model class may be extremely large so we are seeking a good stopping point along a sequence of models in increasing size. Thus, in the examples presented here, when our objective function has ˆ we announce its value. However, a unique minimum that can be taken as an estimate of h, models are rarely true to infinite precision. Rather, we look for element in a sequence of models that, for finite sample sizes, corresponds to a useful lower bound on hT , implicitly ˆ greater than the ‘hT ’ we have surmised may be more valid as allowing that models with h n increases.
Acknowledgments The authors gratefully acknowledge support from NSF grant # DMS-1419754 and the support from Holland Computing Center for invaluable computational support. The authors also thank Daniel McDonald for helpful conversations. 29
Mpoudeu and Clarke
References Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009. Andrew R Barron and Thomas M Cover. Minimum complexity density estimation. IEEE transactions on information theory, 37(4):1034–1054, 1991. Adrian F. M. Smith Bernardo. Bayesian Theory. Wiley Series in Probability and Statistics, 1994. Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348– 1360, 2001. doi: 10.1198/016214501753382273. URL http://dx.doi.org/10.1198/ 016214501753382273. Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849–911, 2008. Daniel J McDonald, Cosma Rohilla Shalizi, and Mark Schervish. Estimated vc dimension for risk bounds. arXiv preprint arXiv:1111.3404, 2011. Merlin Tchamche Mpoudeu. Use of Vapnik-Chervonenkis Dimension in Model Slection. PhD thesis, University of Nebraska Lincoln, ProQuest Customer Service, 789 E. Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106-1346 U.S.A., 7 2017. Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994). The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait. Technical report, Department of Primary Industry and Fisheries, Tasmania, 1994. Xuhui Shao, Vladimir Cherkassky, and William Li. Measuring the vc-dimension using optimized experimental design. Neural computation, 12(8):1969–1986, 2000. Sara van de Geer. Empirical processes in M-estimation. Cambridge Series in statistical and Probabilistic Mathematics. Cambridge University Press Cambridge, 2000. Vladimir Vapnik. Statistical learning theory. Wiley, New York, 1998. Vladimir Vapnik and Alexey Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Soviet Math. Dokl, volume 9, pages 915– 918, 1968. Vladimir Vapnik and Alexey Ya Chervonenkis. On the uniform convergence of relative frequencies to their probabilities. In Soviet Math. Dokl, volume 9, pages 915–918, 1971. Vladimir Vapnik, Esther Levin, and Yann Le Cun. Measuring the vc-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994. 30
VC dimension
Chris S Wallace and Peter R Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society. Series B (Methodological), pages 240–265, 1987. Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
31