In this paper, a 1-nearest neighbour (NN) classification rule for classifying m multiple observations into one of s ...... discrimination. Ann. Statist., 9, 1320-1327.
Sankhy¯ a : The Indian Journal of Statistics 1995, Volume 57, Series A, Pt. 2, pp. 316-332
A NEAREST NEIGHBOUR CLASSIFICATION RULE FOR MULTIPLE OBSERVATIONS BASED ON A SUB-SAMPLE APPROACH∗ By S.C. BAGUI The University of West Florida K.L. MEHRA University of Alberta and M.S. RAO Osmania University SUMMARY. In this paper, a 1-nearest neighbour (NN) classification rule for classifying m multiple observations into one of s populations is proposed by suitably sub-grouping the training sample observations. This classification rule generalizes the one by Cover and Hart (1967) for a single observation to m multiple observations. The asymptotic risk of the proposed rule is derived and the bounds thereon in terms of the corresponding Bayes risk are shown to be parallel to those obtained by Cover and Hart (1967) for the case m = 1. A proposed crossvalidation type estimator of the asymptotic risk is shown to be asymptotically unbiased and consistent. Further, the results of a Monte Carlo study are reported to assess the performance of the proposed rule and to compare it with that of the 1st-stage rank nearest neighbour (RNN) rule for multiple observations (Bagui, 1989) in small sample situations.
1.
Introduction
In this paper we address the problem of classifying nonparametrically a set of m(> 1) independent and identically distributed (i.i.d.) observations into one of the s-populations π1 , π2 , . . . , πs based on a training sample from a mixture of these s populations obtained under a Bayesian model. For the case m = 1, there have been many nonparametric classification procedures proposed in the literature, which can be broadly classified under (i) nearest neighbour (NN) Paper recieved. August 1992; revised July 1993. AMS (1990) subject clasification. Primary 62H30; secondary 62F15, 62G10, 62G20. Key words. Bayes risk, classification, discrimination, misclassification, nearest neighbor, repeated measurements. ∗ The research was supported in part by a CRF-NSERC Grant from the University of Alberta and the grant No. A-3061 from the National Sciences and Engineering Research Council of Canada.
nearest neighbour classification rule
317
rules (Fix and Hodges (1951)); (ii) Rules based on density estimates (Van Ryzin (1966)); (iii) Rules based on distance between Empirical c.d.f’s (Matusita (1956)); (iv) Rules based on Empirical Bayes approach (Johns and Robbins (1961)); (v) Rules based on ranks (Das Gupta (1964)) and lastly, (vi) Rules based on tolerance regions (Quesenberry and Gessaman (1968)). For m > 1, a rather naive way to classify these multiple observations would be to repeat m times one of the above single observation classification procedures. However, the drawback of this repeating procedure is that it does not utilize the full information available in the data and is, therefore, clearly inappropriate. Das Gupta (1964) considered the problem of classifying non-parametrically a random sample of i.i.d. observations based on Kolmogorov distance and Wilcoxon statistic and showed the consistency of the rules proposed. However, such rules may be appropriate only when both the size of the random sample to be classified and the sizes of the training samples are adequately large. In the parametric case, the problem of classifying multiple observations has been studied by Choi (1972) and Gupta and Logan (1990) assuming that the populations are normal. For the nonparametric situation when the distributional structures of the populations are unknown, a satisfactory simple classification rule for multiple observations should be a welcome addition to the literature with many potential practical applications: For example, in (i) Biomedical field: Vascular basement membrane (BM) thickening is a variable phenomenon in the capillaries of diabetic patients. A random sample of independent thickness measurements on a patient may be used to classify the patient into a suitable category describing the patients advanced state of disease; (ii) Community health: To determine whether a community is infected by a virus or not, a sample of the virus count on m randomly selected individuals from the community can be used for classifying the community in various categories signifying the extent of infection; (iii) Job applicant evaluation: Multiple observations can be obtained by administering similar tests to candidates at different time points and the data may be used more effectively to classify applicants into various selection categories; and (iv) Pattern recognition: Let P1 and P2 be two patterns from which suitable trainig samples are available. Let P 1 , P 2 , . . . , P m be m i.i.d. observations on a certain pattern P0 under study, which is known to be similar to either P1 or P2 . One may use the sample P = {P 1 , P 2 , . . . , P n } to make a decision regarding the identity of P0 . The object of the present paper is to propose and study a NN-type classification rule, using sub-samples drawn from the given (identified) training sub-samples. This rule may be viewed as a generalization of the Cover and Hart (1967) 1-NN rule from 1 to m multiple observations. Their model follows the Bayesian set-up and the proposed 1-NN rule assigns a single observation Z to πi if the NN (measured by a distance function d) of Z belongs to the i-th population πi , i = 1, 2, . . . , s. Although the NN-type classification rules were first introduced by Fix and Hodges (1951), Cover and Hart (1967) were the first
318
s.c. bagui, k.l. mehra, m.s. rao
to obtain an appropriate upper bound on the limiting NN risk R(s) (1) given by ∗ ∗ R(s) (1) ≤ R(s) (1) ≤ R(s) (1)(2 −
s R∗ (1)), s − 1 (s)
. . . (1.1)
∗ (1) is the corresponding (minimum) Bayes risk (see (2.5) below). where R(s) Subsequently Devroye (1981) obtained a similar upper bound for the corresponding k-NN asymptotic risk, which in a “certain” sense is the best possible. For the case s = 2 and m = 1, Das Gupta and Lin (1980) also proposed certain Rank-NN (RNN) multistage rules and showed that their 1-stage rule was asymptotically equivalent to the 1-NN rule of Cover and Hart (1967). For other aspects of the problem dealt by Cover and Hart (1967) and Devroye (1981), see Wagner (1971), Firtz (1975), Hand (1981) and Xiru (1985). The main result of this paper, namely, a generalization of the bounds (1.1) to those bounds on the asymptotic NN-risk function R(s) (m) given by ∗ ∗ R(s) (m) ≤ R(s) (m) ≤ R(s) (m)(2 −
s R∗ (m)), s − 1 (s)
. . . (1.2)
∗ is achieved in section 2, where R(s) (m) is the corresponding Bayes risk given in (2.5). The bounds in (1.2) for general m are the precise counter-part of those in (1.1) for the m = 1 case. The bounds in (1.1) are well-known in the standard nearest neighbor theory. In section 3, a cross-validation estimator of the asymptotic risk R(s) (m) is proposed and its consistency established. The final section 4 contains some supporting small sample Monte Carlo results and some concluding remarks.
2.
Asymptotic nn-risk and bounds
2.1. Preliminaries : Model and the assumptions. Let (X1 , θ1 ), (X2 , θ2 ), . . ., (Xn , θn ) be a random (identified) i.i.d. training sample from a mixture of s populations taking values in 0}.
(2.1a)
In fact, S may be considered as a conditioned sample space, conditioned by the fact that the m components of each Y are identically distributed. Thus, we consider the sequence {(Yk , τk ) | τk > 0; k = 1, 2, . . . , N } as an identis X nj caly distributed training sample with N = nPm (and only N 0 = Pm j=1
having τk ’s positive) to be used to classify Z. Note that, conditionally given n = (n1 , n2 , . . . , ns )0 , there are km (n) = [ nm1 ] + . . . + [ nms ] mutually independent elements in the above set, and km (n) →a.s. ∞, as n → ∞ (in fact, km (n) ≥ n (m ) − s → ∞, as n → ∞). Classification rule. Denote by Yn0 = Yn0 (Y1 , Y2 , . . . , YN ) the NN of Z among those Yk ’s with τk ’s positive, that is, min
k∈{1≤k0 ≤N :τk0 >0}
{kYk − Zk} = kYn0 − Zk,
. . . (2.2)
where k · k is the Euclidean norm. Suppose Yn0 is identified as coming from the population corresponding to τn0 , say πj , then the subsample classification rule (δn ) classifies Z to population πj , j = 1, 2, . . . , s. For practical applications of the proposed sub-sample classification rule, the preceding algorithm may be explicity described as follows: Given an identified training sample the mixture of s populations π1 , π2 , . . . , πs , first separate the observations corresponding to each of the s populations. This yields s individual samples corresponding to the s populations, nj denoting the size of the sample from the population πj , j = 1, 2, . . . , s. Then for each j, identify all possible nj Pm sub-samples of size m from the πj sample, j = 1, 2, . . . , s.
320
s.c. bagui, k.l. mehra, m.s. rao
0
Now regarding the totality N =
s X
nj
Pm of all sub-samples of size m (drawn
j=1
from the individual population sample) as constituting a new transformed single (vector) training sample, use the conventional 1 - NN rule for classifying the multiple observation vector z = (z1 , z2 , . . . , zm ). We have assumed above (in the first paragraph of this section) that θi ’s corresponding to the components of Z = (Z1 , . . . , Zm ) satisfy θ1 = θ2 = . . . = s X θm = τ . Given this event, say A = {θ1 = θ2 = . . . = θm } with P (A) = ξj∗m , j=1
ξj∗m = ξj say. P (τ = j | A) = s X ∗m ξj 0
. . . (2.3)
j 0 =1
Thus under this assumption i.e., given the event A (which fact will be notationally suppressed throughout, so that all probability calculations below are conditional given A) the posterior probabilities n∗j (z)’s are given by Qm ξj∗m i=1 fj (zi ) ∗ ηj (z) = P (τ = j | Z = z) = s , j = 1, 2, . . . , s. . . . (2.3a) m X Y ∗m 0 fj (zi ) ξj 0 j 0 =1
i=1
It should be pointed out at this point again that all results and probability and risk computations below are under the assumption and notation explained in (2.3) and (2.3a) above. Accordingly A will remain suppressed henceforth. Thus if we decide to classify Z to the population j, then the conditional risk given Z = z (and A) is given by γj (z) = E(L(τ, j) | Z = z) =
s X
n∗j 0 (z)L(j 0 , j),
j 0 =1
where L is a suitable non-negative loss function, and the Bayes decision rule δ ∗ chooses the population πj for which γj (z), j = 1, . . . , s is minimum. Using δ ∗ , s X the conditional Bayes risk γ ∗ (z) can be written as γ ∗ (z) = minj { ηj∗0 (z)L(j 0 , j)}, j0=
∗ and the overall (minimum) expected risk R(s) (m), called the Bayes risk, is given by Z ∗ R(s) (m) = E[γ ∗ (z)] =
where f ∗ (z) =
s X i=1
ξi
m Y
γ ∗ (z)f ∗ (z)dz,
. . . (2.3b)
fi (zl ) and dz = dz1 , . . . , dzm . Taking the 0-1 loss
l=1
function, we express the conditional and overall minimum Bayes risk as γ ∗ (z) = min(1 − η1∗ (z), . . . , 1 − ηs∗ (z)),
. . . (2.4)
nearest neighbour classification rule
321
and R
∗ R(s) (m) =
min(1 − η1∗ (z), . . . , 1 − ηs∗ (z))f ∗ (z)dz m m X Y X Y R fj (zi ))dz. ξj min( ξj fj (zi ), . . . ,
=
j6=1
i=1
. . . (2.5)
i=1
j6=s
respectively. Suppse the proposed NN rule δn classifies Z to the population τn0 . We define the NN risk of δn by R(s) (m; δn ) = E[L(τ, τn0 )], and the corresponding asymptotic NN risk by R(s) (m)
=
lim R(s) (m; δn ) n
= lim P [τ 6= τn0 ] for the 0 − 1 loss function L.
. . . (2.6)
n
2.2. Bounds on R(s) (m). First we prove two lemmas: the first of these establishes the almost sure convergence of Yn0 to Z, and the second the probabilistic relationships among positive τk ’s k = 1, 2, . . . , N . These lemmas are needed for Theorem 2.1 and also for establishing the consistency of the proposed asymptotic NN-risk estimator in Section 3. Lemma 2.1. Let Z and Y1 , Y2 , . . . , YN be the random vectors as defined above in (2.1a), where km (n) of the Yk ’s with τk > 0 are mutually independent and km (n) →a.s. ∞ as n → ∞. Let Yn0 denote the Nearest Neighbor of Z from the set {Y1 , Y2 , . . . , YN } as defined in (2.2). Then, as n → ∞, Yn0 → Z a.s.
. . . (2.7)
Proof. For any ² > 0 and m fixed, we have P {kYn0 − Zk > ²}
= P { min kYj − Zk > ²} {j:τj >0}
= P [ ∩ {kYk − Zk > ²}] {k:τk >0} ¡ ¢km (n) ≤ E P {kY1 − Zk > ², τ1 > 0} (n−ms)/m ≤ (P {kY1 − Zk > ², τ1 > 0}) → 0,
. . . (2.8)
as n → ∞. Since kYn0 − Zk is nonincreasing in n, it follows that lim P [ ∪ {kYν0 − Zk > ²}] = lim P [
n→∞
ν≥n
n→∞
kYn0 − Zk > ²],
so that by (2.8) we have Yn0 → Z a.s. as n → ∞.
2
For stating Lemma 2.2, we introduce some notations: For a ν, 0 ≤ ν ≤ m, define the event
322
s.c. bagui, k.l. mehra, m.s. rao
Aν = {(k, k 0 ) : 1 ≤ k, k 0 ≤ N, with τk > 0, τk0 > 0 and Yk and Yk0 have ν components in common}.
. . . (2.9)
and the equivalent one A0ν = {(l, l0 ) : τk = τ1 > 0, τk0 = τl0 > 0, k → l = (i1 , i2 , . . . , im ), k 0 ↔ l0 = (i01 , i02 , . . . , i0m ) such that l and l0 have ν components in common}.
. . . (2.9a)
Then clearly A0 (A00 ) consists of all pairs of distinct labels k(l) and k 0 (l0 ) with τk = τ1 > 0, τk0 = τl0 > 0 and k ↔ l and k 0 ↔ l0 having no components in common; the set Am (A0m ) is the degenerate case when ν = m and in this case k = k 0 . We shall show in Lemma 3.1 below that the cardinality #(Aν ) = #(A0ν ) = Op (n2m−ν ), as n → ∞. We now state Lemma 2.2. For any fixed pair (k, k 0 ) ∈ Aν (or (l, l0 ∈ A0ν ) and any pair (j, j 0 ), 1 ≤ j, j 0 ≤ s, we have on the event Aν , 0 ≤ ν ≤ m, (i) If ν = 0 and k 6= k 0 . s X P (τk = j, τk0 = j 0 | τk > 0, τk0 > 0) = ξj ξj 0 = ξj∗m ξj∗m /( ξt∗m )2 ; t=1
(ii) If 1 ≤ ν ≤ m − 1 and k 6= k P (τk = j, τ
k0
0
= j | τk > 0, τ
k0
> 0) =
ξj∗2m−ν /(
s X
ξt∗2m−ν );
t=1
(iii) P (τk = j | τk > 0) = ξj , where ξj is defined in (2.3) above. Proof. First note that for any (k, k 0 ) ∈ A0 and 1 ≤ j, j 0 ≤ s, P (τk = j, τk0 = j 0 )
P (τk = j, τk0 = j 0 | τk > 0, τk0 > 0) =
s X
P (τk = j, τ
k0
,
. . . (2.10)
0
=j)
j,j 0 =1
where for j 6= j 0 P (τk = j, τk0 = j 0 ) = E{P (τk = j, τk0 = j 0 | nj , nj 0 )} = E {(nj Pm ) . (nj 0 Pm ) /n P2m } µ n−m X n −1 = (n P2m ) (nj Pm ) . (nj 0 Pm ) n j 0 nj ,nj =m
¶ nj 0
∗nj ∗nj 0 ξj 0
ξj
¡ ¢n−nj −nj0 1 − ξj∗ − ξj∗0 . . . (2.11)
nearest neighbour classification rule
323
the last equality following by elementary computations; similarly for j = j 0 P (τk = j, τk0 = j |= EP [τk = j, τk0 = j | nj ) µ = E[
nj mm
= (n P2m )
¶
−1
(m!)2 ]/ (n P2m ) ¶ µ ¶ n µ X ¢n−nj n nj ∗n ¡ 2 ξj j 1 − ξj∗ (m!) nj mm
. . . (2.11a)
nj =m
= ξj∗2m .
From (2.11) and (2.11a), we obtain s X
2 s X = j0) = ξj∗m ,
P (τk = j, τk0
j,j 0 =1
. . . (2.12)
j=1
so that part (i) of the lemma follows from (2.10), (2.11) and (2.12). For part (ii), arguing as for part (i), for any pair (k, k 0 ) ∈ Aν and for any 1 ≤ j ≤ s,
P (τk = j, τk0 = j | τk > 0, τk0 > 0) =
EP (τk = j, τk0 = j | nj ) s X
. . . (2.13)
EP (τk = j, τk0 = j | nj )
j=1
with µ EP (τk = j, τk0 = j | nj ) = E[
−1
n
= ( P2m−ν ) = ξj∗2m−ν ;
n µ X nj =m
nj m m−ν
nj m m−ν
¶ m!(m − ν)!]/(n P2m−ν )
¶
µ (m!)(m − ν)!
n nj
¶ ∗nj
ξj
¡
1 − ξj∗
¢n−nj
. . . (2.14) (2.13) and (2.14) yield the proof of part (ii). The proof of part (iii) follows analogously from that of part (ii) by setting ν = m in there, this being the degenerate case when k = k 0 . The proof is complete. 2 Remark 2.1. From Lemma 2.2 (i) and equation (2.3a), it follows that for any (k, k 0 ) ∈ A0 and any 1 ≤ j, j 0 ≤ s, P (τk = j, τk0 = j 0 | Yk = yk , Yk0 = yk0 , τk > 0, τk0 > 0)
324
s.c. bagui, k.l. mehra, m.s. rao
=
s X
ξj ξj 0 fj∗ (yk )fj∗0 (yk0 )
{
ξj ξj 0 fj∗ (yk )fj∗0 (yk0 )}
j,j 0 =1
. . . (2.15) 0
= P (τk = j | Yk = yk , τk > 0). P (τk0 = j | Yk0 , τk0 > 0) = ηj∗ (yk )ηj∗0 (yk0 ), where fj∗ (y) =
m Y
fj (yi ). It should be pointed out that the equation (2.15)
i=1
is the result of (k, k 0 ) ∈ A0 and the resulting independence of (Yk , τk ) and (Yk0 , τk0 ). In general, however, for an equation of the type (2.15) is hold, simply (k, k 0 ) ∈ A0 and the conditional independence of τk and τk0 given (Yk , Yk0 ) is all that is required (see (2.18) below). Now we state a theorem that provides the generalized bounds on R(s) (m): Theorem 2.1. Under the model and the assumptions of subsection 2.1, the limiting N N -risk R(s) (m) of the “sub-sample” classification rule {δn } (defined by (2.2)) is given by R(s) (m)
= =
lim R(s) (m; δn ) X lim E{ n∗j (z)n∗j 0 (z)}
n→∞
n→∞
. . . (2.16)
j6=j 0
and satisfies the bounds: ∗ ∗ R(s) (m) ≤ R(s) (m) ≤ R(s) (m)(2 −
s R∗ (m)), s − 1 (s)
. . . (2.17)
∗ where R(s) (m) is the Bayes risk given by (2.5). Proof. The conditional N N risk of {δn } given Z = z and Yn0 = yn0 is given by τn0 ) | z; yn0 } = P (τ 6= τn0 | z, yn0 ) γ(z; yn0 ) = E{L(τ, X = P (τ = j, τ 0 = j 0 | z, yn0 )
=
j6X =j 0
P (τ = j | z).P (τn0 = j 0 | yn0 ),
. . . (2.18)
j6=j 0
=
X
ηj∗ (z)ηj∗0 (yn0 ),
j6=j 0
the last but one equality in (2.18) following due to the conditional independence of τ and τn0 given (z, yn0 ) (the dependence between τ and τn0 is only through Z and Yn0 ). Now assuming wlog, with probability one, that z1 , z2 , . . . , zm are continuity points of fj , j = 1, . . . , s and that fj (zi ) > 0 for j = 1, . . . , s and
nearest neighbour classification rule
325
i = 1, 2, . . . , m, it follows that z is a continuity point of ηj∗ , j = 1, 2, . . . , s, so that n → ∞. a.s. γ(z, Yn0 ) → γ(z), where γ(z) =
X
ηi∗ (z)ηj∗ (z) = 1 −
s X
ηj∗2 (z).
. . . (2.19)
j=1
i6=j
By letting ηt∗ (z) = maxj {ηj∗ (z)}, the conditional Bayes risk γ ∗ (z) in (2.4) can be written as γ ∗ (z) = 1 − maxj {ηj∗ (z)} = 1 − ηt∗ (z), so that in view of the Cauchy-Schwartz inequality we get X X (s − 1) ηj∗2 (z) ≥ [ ηj∗ (z)]2 = [1 − ηt∗ (z)]2 = γ ∗2 (z). j6=t
j6=t
Now adding (s − 1)ηt∗2 (z) to both sides in the above, we have s X
ηj∗2 (z) ≥
j=1
(γ ∗ (z))2 + (1 − (γ ∗ (z))2 , (s − 1)
so that from (2.19), we get γ(z) ≤ 2γ ∗ (z) −
s (γ ∗ (z))2 . s−1
. . . (2.20)
Now by the Lebesgue dominated convergence theorem, (2.20) and Jensen’s inequality, we obtain ∗ R(s) (m) = lim E{γ(Z; Yn0 )} = E{γ(Z)} ≤ R(s) (m)(2 − n
s R∗ (m)). s − 1 (s)
This proves the right inequality in (2.17). The left inequality follows by noting from (2.19) that γ(z) = 1 −
s X j=1
ηj∗2 (z) ≥ 1 − max ηj∗ (z) j
s X j=1
ηj∗ (z) = 1 − max ηj∗ (z) = γ ∗ (z) j
and integrating both sides w.r.t. the marginal density f ∗ (z) of Z given in (2.3a). ∗ This proves the left inequality R(s) (m) ≤ R(s) (m). The proof is complete. 2
326
s.c. bagui, k.l. mehra, m.s. rao 3.
Estimation of asymptotic risk R(s) (m).
Let {(Yk , τk ) : τk > 0, k = 1, . . . , N }, N = n Pm , be the “transformed” training sample, as defined by (2.1) and (2.1a), of identically distributed random vectors taking values in 0, k = 1, . . . , N } and 0 let τkn be the N N estimate of τk > 0, for 1 ≤ k ≤ N based on the balance of the transformed “training” sample. The proposed cross-validation estimator for the asymptotic N N -risk is then defined by N 1 X 0 6=τ ,τ 0 >0,τ >0] . I[τkn k kn k N0
. . . (3.1)
N 1 X 0 6=τ ,τ 0 >0,τ >0] ) E(I[τkn k kn k N 0 j=1
. . . (3.2)
pbn =
k=1
Further, set Un =
where IB denotes the indicator function of a set B and E[.](P [.]). the conditional expectation (probability) given n = (n1 , n2 , . . . , ns )0 . a.s. In (3.7) and (3.8), below,it is proved that Un → R(s) (m) and E[Un ] → R(s) (m) as n → ∞. The latter convergence entails immediately that E(b pn ) → R(s) (m), so that pbn is an asymptotically unbiased estimator of the asymptotic NN risk R(s) (m). We shall establish in Theorem 3.1 below that pbn , in fact, is a mean square (and, therefore, in probability) consistent estimator of R(s) (m), as n → ∞. To accomplish this, we first state a straight forward result (Lemma 3.1 below) for which we need some additional notations : First recall the notation Aν , 0 ≤ ν ≤ m, defined in (2.9) above. For a ν, 0 ≤ ν ≤ m and (k, k 0 )²Aν , define for a ν 0 with 0 ≤ ν + ν 0 ≤ 2m the event 0 A∗νν 0 ,kk0 = {(lk , lk0 , lk0 , lk0 ) : (Yk , Ykn ) and (Yk0 , Yk0 0 n ) have ν + ν 0 components 0 in common, where Yk = Ylk , Ykn = Yl00 n , Yk0 = Ylk0 , Yk0 0 n = Yl00 0 n }. k k . . . (3.3) 0 Also note that since Yk and Ykn and, similarly, Yk0 and Yk0 0 n have no components in common, for each ν, 0 ≤ ν ≤ m and pair (k, k 0 )²Aν , the collection {A∗νν 0 ,kk0 : 0 ≤ ν 0 ≤ 2m − ν} is exhaustive.
nearest neighbour classification rule
327
Lemma 3.1. (i) Conditionally given n, the cardinalities # (Aν ) of sets Aν , 0 ≤ ν ≤ m, are given by #(A0 )
=
s X j=1
#(Aν )
=
s X j=1
#(Am ) =
s X
(nj P2m ) +
X
(nj Pm ) (nj0 Pm )
0
µ n j6=j ¶ µ ¶ j m −m (m!) f or 1 ≤ ν ≤ m − 1 (nj Pm ) ν m−ν (nj Pm ) ;
j=1
(ii) consequently (unconditionaly) # (Aν ) = Op (n2m−ν ), 0 ≤ ν ≤ m; (iii) conditionally given n, the cardinalities #(A∗νν 0 ,kk0 ) of sets A∗νν 0 ,kk0 , 0 ≤ µ ¶ 4m−ν 0 ∗ ν + ν ≤ 2m, satisfy #(Aνν 0 ,kk0 ) = O max nj , and consequently (uncon1≤j≤s
4m−ν−ν 0
#(A∗νν 0 ,kk0 )
ditionally) = Op (n ), as n → ∞, 0 ≤ ν + ν 0 ≤ 2m. Proof. Part (i) of the Lemma follows from the definitions of the sets Aν , 0 ≤ ν ≤ m and straightforward combinatorial resoning; part (ii) follows as a consea.s. quence in view of (nj /n) → ξj for 1 ≤ j ≤ s, as n → ∞. For part (iii) one can easily see that, conditionally given n, #(A∗00,kk0 )
=
s X j=1
= +
(nj P4m ) + X
X
((nj P3m ) (nj0 Pm ) + (nj P2m ) (nj 0 P2m ))
j6=j 0
(nj Pm ) (nj 0 P2m ) (nj00 Pm )
j6=j 0 6=j 00
= +
X
(nj Pm ) (nj 0 Pm ) (nj00 Pm ) (nj000 Pm )
j6=j 0 6=j 00 6=j 000
= O(n4m ) Similarly #(A∗νν 0 ,kk0 )
0
= O(n4m−ν−ν ), 1 ≤ ν + ν 0 ≤ 2m. The proof is complete.2
Remark 3.1. One can easily verify that
m−1 X ν=0
#(Aν ) + (m!)#(Am ) = N 02 ; the
factor (m!) in the second term of the above identity appears since the set Am corresponds to degenerate case when k = k 0 in the definition of the sets Aν in (2.9). A similar identity can be verified for the cardinalities #(A∗ν )’s of the sets A∗ν ’s. Theorem 3.1. Under the model and assumptions of Theorem 2.1 E[b pn − R(s) (m)]2 = O(1), as n → ∞, where pbn and R(s) (m) are defined in (3.1) and (2.6), respectively.
328
s.c. bagui, k.l. mehra, m.s. rao Proof. Using (3.1) and (3.2), we obtain E[b pn − R(s) (m)]2 ≤ 2E[b pn − Un ]2 + 2E[Un − R(s) (m)]2
. . . (3.4)
where for the second term in (3.4) we note, using the conditional independence 0 0 of τk and τkn given (Yk , Ykn ), that Un
= =
=
=
N 1 X 0 0 0 E{P [τk 6= τkn , τk > 0, τkn > 0 | Yk , Ykn ]} N0 k=1 N 1 X X 0 0 E P [τk = j, τkn = j 0 | Yk , Ykn ]} N0 0 j6=j k=1 N n n ∗ ∗ 0 0 j j P X X ) ( P ) ( ) f (Y )f (Y 0 1 m m k j j kn E t X N0 X k=1 j6=j 0 nt nt0 ∗ ∗ 0 nt ∗ ∗ 0 ( P ) ( P ) f (Y )f (Y ) + ( P ) f (Y )f (Y 0 m m k 2m k t t kn t t kn 0 t=1 t6 = t N 1 X X 0 0 E Ψ (n, Y , Y )η (Y )η (Y ) say, 0 n k k j n kn jn kn N0 0 k=1
j6=j
. . . (3.5) where fj∗ (.) is given in (2.15) and in view of [nj /n] → ξj , 1 ≤ j ≤ s, and 0 a.s. → Yk , as n → ∞ (Lemma 2.1), Ykn X 0 Ψn (n, Yk , Ykn )
=
(nt Pm ) (nt0 Pm ) ft∗ (Yk )ft∗0 (Ykn ) +
s X
t6=t0
t=1
X
s X
(nj Pm ) (nt0 Pm ) ft∗ (Yk )ft∗0 (Ykn ) +
2
0 (nt Pm ) ft∗ (Yk )ft∗ (Ykn )
2
0 (nt Pm ) ft∗ (Yk )ft∗ (Ykn )
t=1
t6=t0 a.s.
→ 1, as n → ∞, . . . (3.6)
and 0 limn→∞ η j 0 n (Ykn )
( = a.s.
=
lim
n→∞
(
nj 0
0 / Pm ) fj∗0 (Ykn
s X
) nt
(
0 Pm ) ft∗ (Ykn )
,
t=1
ηj∗ (Yk ), a.s.
lim η jn (Yk ) = ηj∗ (Yk ).
n→∞
. . . (3.6a) . . . (3.6b)
Thus, by (3.6) to (3.6b) and bounded convergence theorem, all terms in the first summation on the RHS of (3.5) and, therefore, their average a.s.
Un → R(s) (m),
. . . (3.7)
nearest neighbour classification rule
329
as n → ∞. By the dominated convergence theorem thus E[Un ] → R(s) (m) and
. . . (3.8)
E[Un − R(s) (m)]2 → 0.
. . . (3.8a)
as n → ∞. We now deal with the first term in (3.4): First note that, writing A∗0,kk0 for A∗00,kk0 , we have E(b pn − Un )2 ( ) N ´2 1 X ³ 0 0 0 ,τ >0,τ 0 >0) − P (τk 6= τ = E E I(τk 6=τkn kn , τk > 0, τkn > 0) k kn N 02 k=1 1 X¡ 0 0 + 02 P (τk 6= τkn , τk0 6= τk0 0 n , τk > 0, τkn > 0, τk0 > 0, τk0 0 n > 0) N k6=k0 ¢ 0 0 0 0 −P (τ k 6= τkn , τk > 0, τkn > 0)P (τk0 6= τk0 n , τk > 0, τk0 n > 0) X ¡ 0 ≤ E (4N 0 )−1 + N 0−2 P (τk 6= τkn , τk0 6= τk0 0 n ) (k,k0 )∈A0 ¢ ª 0 0 (τ = 6 τ )P (τ = 6 τk0 0 n ) + 2(#(Ac0 )/N 02 ) −P k kn k X ¡ ¢ 0 0 , τk0 6= τk0 0 n | Yk , Ykn , Yk0 , Yk0 0 n ≤ E N 0−2 E P (A∗0,kk0 , τk 6= τkn (k,k0 )∈A0 ¢ ¡ ¢ª ¡ 0 0 − E P (τk 6= τkn ) E P (τk0 6= τk0 0 n | Yk0 , Yk0 0 n ) + O(n−1 ), | Yk , Ykn . . . (3.9) the last inequality in (3.9) following by Lemma 3.1 parts (ii) and (iii) which imply, respectively, that [#(Ac0 )/N 02 ] = Op (n−1 ) and P (A∗c 0,kk0 )
ª © ∗c ∗c = #(A∗c 0,kk0 )/[#(A0,kk0 ) + #(A0,kk0 )] = Op (n−1 ),
. . . (3.10)
. . . (3.10a)
as n → ∞. Further for the first term in the last summation (3.9), we have using (3.10a) again 0 0 , Yk0 , Yk0 0 n ] , τk0 6= τk0 0 n | Yk , Ykn E P [A∗0,kk0 , τk 6= τkn ∗ 0 ∗ = E P [A0,kk0 , τk 6= τkn | Yk , Yk0 ].E P [A0,kk0 , τk0 6= τk0 0 n | Yk0 , Yk0 0 n ] + Op (n−1 ), 0 = E P [τk = 6 τkn | Yk , Yk0 ].E P [τk0 6= τk0 0 n | Yk0 , Yk0 0 n ] + Op (n−1 ). . . . (3.11) In view of (3.10) and (3.11), and the fact that the order terms in (3.10a) and, therefore, in (3.11) are bounded by the inequality (3.9) yields
E(b pn − Un )2 = O(n−1 ).
. . . (3.12)
The assertion of the theorem, thus, follows from (3.4), (3.8a) and (3.12). 2
330
s.c. bagui, k.l. mehra, m.s. rao 4.
Monte carlo results
We have studied above in Sections 2 and 3 the asymptotic properties of the proposed subsample classification rule and its asymptotic risk estimator, but these properties may not hold for small sample situations. In this section we report the results of a small sample Monte Carlo study of the error-rate of the proposed sub-sample 1 − N N rule in comparison with that of the first stage Rank Nearest Neighbor (RNN) rule described below. In this simulation study we consider only the classification of multiple univariate observations between two populations (i.e. s = 2). First stage RNN rule (see Bagui, 1989): Let {X11 , X12 , . . . , X1n1 } and {X21 , X22 , . . . , X2n2 } be random samples from two univariate populations π1 and π2 respectively. A random sample Z = {Z1 , Z2 , . . . , Zm } is taken from one of these two populations. Combine X1i ’s, X2j ’s and Z’s and arrange them in increasing order; the left and right hand rank nearest neighbors of Zl , ∀l = 1, 2, . . . , m are identified. Then, classify Z to the population π1 with probability 1 if # RNN from π1 ¿ # RNN from π2 , and classify Z to the population π1 or π2 with probability 12 if # RNN from π1 = # RNN from π2 ; others to π2 with probability 1. For simplicity, we take n1 = n2 = 10 and vary m from 1 to 3. For given X1i ’s and X2j ’s we classify 200 random samples Z = (Z1 , . . . , Zm ), 100 from each population using the subsample procedure and the 1st stage RNN rule. The average proportion of misclassifications (APM) were calculated for both the subsample procedure and the 1st-stage RNN rule for different pairs of distributions. The results are given in Table 4.1, Table 4.2 and Table 4.3. Table 4.1. COMPARISONS BETWEEN THE SUBSAMPLE PROCEDURE AND THE 1ST-STAGE RNN RULE FOR DIFFERENT PAIRS OF NORMAL DISTRIBUTIONS m 1 2 3
N(0,1) vs. N(1,1) I II 0.3300 0.3500 0.2600 0.3375 0.2650 0.3000
N(0,1) vs N(2,1) I II 0.3500 0.3600 0.2450 0.2975 0.1500 0.2600
N(0,1) vs. N(3,1) I II 0.1000 0.1175 0.0250 0.0525 0.0200 0.0400
I = APM’s for the sub-sample procedure II = APM’s for the 1st stage RNN rule. Table 4.2. COMPARISONS BETWEEN THE SUBSAMPLE PROCEDURE AND THE 1ST-STAGE RNN RULE FOR DIFFERENT PAIRS OF LAPLACE AND LOGISTIC DISTRIBUTIONS m 1 2 3
LA(0,1) vs. LA(1,1) I II 0.3600 0.3900 0.2700 0.2800 0.2600 0.2950
LA(0,1) vs LA(2,1) I II 0.2350 0.2550 0.1700 0.1500 0.1750 0.1350
LG(0,1) vs. LG(2,1) I II 0.3600 0.3625 0.3500 0.3300 0.2600 0.3125
nearest neighbour classification rule
331
Table 4.3. COMPARISONS BETWEEN THE SUBSAMPLE PROCEDURE AND THE 1ST-STAGE RNN RULE FOR PAIRS VS. LAPLACE AND NORMAL VS. LOGISTIC m 1 2 3
N(0,1) vs. LA(1,1) I II 0.4150 0.4450 0.2850 0.3425 0.3350 0.3600
5.
N(0,1) vs LA(2,1) I II 0.1700 0.1950 0.1750 0.1875 0.0100 0.1025
N(0,1) vs. LG(2,1) I II 0.3450 0.3500 0.2400 0.2375 0.1600 0.2225
Concluding remarks
In the case m = 1, I and II above correspond, respectively, to the Cover and Hart (1967), and Das Gupta and Lin (1980) single observation 1 - NN and RNN classification rules. For increasing m the APM generally decrease in both the situations. We also note from the tables above that the subsample procedure has generally performed better than the 1st-stage RNN rule with few exceptions (underlined in the tables). In moderately large or large sample cases, the sub-sample procedure involves considerable computational work; one may accordingly, wish to use the 1st-stage RNN rule in this case as it is simple to employ and performs reasonably well in moderately large or large samples (see Bagui, 1989). However, the subsample approach seems to be preferable otherwise. Acknowledgements. The authors wish to thank an anonymous referee for his constructive suggestions in improving the presentation and the readability of the paper. References
Bagui, S.C. (1989). Nearest neighbor classification rules for multiple observatins. Ph.D. thesis, University of Alberta, Edmonton, Canada. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees, Wadsworth, Belmont, Calif. Choi, S.C. (1972). Classification of multiply observed data. Biometrische Zeitshcrift, 14, 8-11. Cover, T.M. and Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Trans. Inform. Theory, IT - 13 21-26. Das gupta, S. (1964). Nonparametric classification rules. Sankhy¯ a A, 26, 25-30. Das gupta, S. and Lin, H.E. (1980). Nearest neighbor rules for statistical classification based on ranks. Sankhy¯ a A, 42, 219-230. Devroye, L. (1981). On the asymptotic probability of error in nonparametric discrimination. Ann. Statist., 9, 1320-1327.
332
s.c. bagui, k.l. mehra, m.s. rao
Fix, E. and Hodges, J.L. (1951). Nonparametric discrimination: Consistency properties. U.S. Air Force School of Aviation Medicine, Report No. 4 Randalph Field, Texas. Gupta, A.K. and Logan, T.P. (1990). On a multiple observations model in discriminant analysis. J. Statist. Comput. Simul., 34, 119-132. Hand, D.J. (1981). Discrimination and Classification, John Wiley and Sons. New York. Fritz, J. (1975). Distribution-free exponential error bound for nearest neighbor pattern classificatin. IEEE Trans. Inform. Theory. IT - 21, 552-557. Johns, M.V. (1961). An empirical Bayes approach to nonparametric two-way classification. Studies in Item Analysis and Prediction, Ed. H. Solomon. Stanford University Press, Stanford, California. Matusita, K. (1956). Decision rule, based on the distance, for the classification problem. Ann. Inst. Statist. Math., 8, 67-77. Quesenbery, C.P. and Gessaman, M.P. (1968). Nonparametric discrimination using tolerance regions. Ann. Math. Statist., 39, 664-673. Van Ryzin, J. (1966). Bayes risk consistency of classification procedures using density estimation. Sankhy¯ a A, 26, 25-30. Wanger, T.J. (1971). Convergence of the nearest neighbor rule. IEEE Trans. Inform. Theory, IT - 17, 566-571. Xiru, C. (1985). Exponential posterior error bound for the k-NN discrimination rule. Scientia Sinicia A, XXVII, 673-682.
Department of Mathematics and Statistics The University of West Florida Pensacola, FL 32514 USA.