AbstractâThis paper considers model selection in classification. In many applications such as pattern recognition, probabilistic inference using a Bayesian ...
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
4767
On Strong Consistency of Model Selection in Classification Joe Suzuki, Member, IEEE
Abstract—This paper considers model selection in classification. In many applications such as pattern recognition, probabilistic inference using a Bayesian network, prediction of the next in a sequence based on a Markov chain, the conditional probability P (Y = y jX = x) of class y 2 Y given attribute value x 2 X is utilized. By model we mean the equivalence relation in X : for x; x
0
x
2X , x
0
P (Y
=y
j
X
= x ) = P (Y = y
j
X
0
= x );
for all y 2Y :
By classification we mean the number of such equivalence classes is finite. We estimate the model from n samples z n = (xi ; yi )in=1 2 n (X 2Y ) , using information criteria in the form empirical entropy H plus penalty term (k=2)dn (the model such that H + (k=2)dn is minimized is the estimated model), where k is the number of independent parameters in the model, and fdn g1 n=1 is a real nonnegative sequence such that lim supn dn =n = 0. For autoregressive processes, although the definitions of H and k are different, it is known that the estimated model almost surely coincides with the 1 1 true model as n ! 1 if fdn gn =1 > f2 log log ngn=1 , and that it 1 1 does not if fdn gn=1 < f2 log log ngn=1 (Hannan and Quinn). The problem whether the same property is true for classification was open. This paper solves the problem in the affirmative. Index Terms—Error probability, Hannan and Quinn’s procedure, law of the iterated logarithm, Kullback–Leibler divergence, model selection, strong consistency.
for model , and We denote finite. denote by the set of such models with can By classification we mean for some be expressed by , which is assumed through the current paper. (Notice that the random variables are discrete.) We denote . We consider the following problem: given a finite sequence with , where is a stationary ergodic process; 1) and 2) satisfy for some and given a subset of containing the , we estimate the as well as . For given and , we divide into the three sets: ; 1) ; 2) . 3) Example 1: Suppose , , , where ; 1) 2)
I. INTRODUCTION
and
L
ET and be random variables taking values in the sets and such that and , respectively. We express the conditional probability of given using a model , which is defined as an equivalence relation in
for If we fix a model ,
(1)
can be divided into equivalence classes such that for
Manuscript received February 12, 2005; revised April 26, 2006. The material in this paper was presented in part at the 6th International Workshop on Artificial Intelligence and Statistics, January 1997. The author is with the Department of Mathematics, Osaka University, Osaka 560-0043, Japan. Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition, Statistical Learning, and Inference. Digital Object Identifier 10.1109/TIT.2006.883611
is such that , . Then, , , and Example 2: Suppose , , where ; 1)
. , and
2)
;
3)
;
and
is such that , , ,
. Then, ,
, . and , the model is classified as If the estimated is in overestimated; and if the estimated is in , the model is classified as underestimated (Atkinson [2]). for . For and Let such that , we define with respect to and by
0018-9448/$20.00 © 2006 IEEE
(2)
4768
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
We define the Kullback divergence by
for . Throughout the paper, we assume the following. Assumption 1: For each , stationary ergodic satisfies
for all and Then, apparently
.
(3) Given a sequence
, we define
where , otherwise. and
In this paper, we select the model that minimizes the quantity (4) where
, is a real nonnegative sequence, denotes the natural logarithm of , and is the so-called empirical entropy of given , and is interpreted as the number of , the number independent parameters because for each of probabilities to be specified is . This paper analyzes the model selection procedures based on information criteria in the form of (4) in a unified manner, instead of considering each information criterion such as Akaike’s ) [1], the minimum descripinformation criterion (AIC, ), the tion length (MDL) principle (Rissanen [17], ), etc. For Bayesian information criterion [19] (BIC, expresses a codelength of up to a MDL and BIC, constant when model is assumed while AIC expresses an unbiased estimator of the average code length of with respect to model . These information criteria give information theoretic interpretations. and Let . We derive (in Section IV) the asymptotic exact error probin model selection for each ability
in (4). Then, if converges to almost surely, from is asymptot(3), the discriminant betwen and and . We show ically larger than that between in Section IV that the probability of is expressed and when in terms of the while that of diminishes exponentially when . In general, if as , then we is weekly consistent. Moreover, if say the sequence such that for all but is finite with probability one, we say the sequence strongly consistent. such Let be the set of real nonnegative sequences . We will define the partial order that in : for any : 1) if ; 2)
if
;
if . 3) One checks for : ; 1) 2) ; 3) . The climax of this paper (in Section V) is in the derivation that satisfies strong consistency of the smallest for classification, i.e., the classification counterpart of Hannan and Quinn’s information criterion [10] that was provided for autoregressive processes. More precisely, the problem is whether satisfies the following: , then is strongly 1) If , and ; consistent for 2) If , then is not strongly , and , when consistent for ; ranges over all stationary ergodic processes satwhere isfying Assumption 1. We solve this problem in the affirmative. II. EXAMPLES A. Learning Bayesian Network Structure From Data We claim that the problem of learning a Bayesian network . Supstructure from data [6], [23] satisfies random variables such pose we are given that takes on a finite set with cardinality , and is expressed that the marginal distribution of by
(5) with for some depends on those of , i.e., the occurrence of A Bayesian network is a directed acyclic graph with such nodes and edges direct from to ,
, .
SUZUKI: ON STRONG CONSISTENCY OF MODEL SELECTION IN CLASSIFICATION
, . If the Bayesian network structure is given, the marginal distribution is determined if
are specified for all from
. The problem is to estimate examples
assuming that the examples are independently and identically distributed according to a distribution in the form of (5) with , since divides the structure . For each into states, may be regarded , thus as a model with . Then, in order to estimate , we iterate a model . selection procedure for each B. Order Identification of Ergodic Markov Processes We also claim that in the problem of order identification of is finite for . A seergodic Markov processes [9], with , , is said to be emitted quence if the conditional according to a Markov process of order probability of given is expressed when for finite . Then, by states each of which is determined by the prethere exist . Since vious sequence is finite, we know . The problem is to idenfrom a given sequence tify Markov order which has been emitted by an ergodic Markov process with the order . In order to decide the states for the previous sequences of , where denotes the empty sequence, we need the knowledge so that of the sequence can be determined even for . However, since we mainly focus on the properties when is large, the effect of may be negligible when analizing the error probability and verifying strong consistency. A Markov process is said to be ergodic if the transition probamong the states satisfies ability matrix for an integer , where is the transition probability from means that all the elements one state to another , and of matrix are positive. Since the Markov process is ergodic, . and be the true and estimated orders, and and Let be the corresponding models. Then, one checks
III. PREVIOUS WORK Several results not assuming have been reported such as nonparametric density estimation [12], estimation with histogram construction [24], etc. However, the purpose of the
4769
model selection is not to find the true model itself because, if , we cannot generally capture the true model even for large : the selected model continues to change as goes to infinity so that the estimated density distribution is as correct as possible. Then, the careful control of the grow of model complexity is required. However, in this paper, since we have for all , the number of states being assumed considered is finite and does not grow with sample size. For model selection, several strategies can be considered. A possible solution may be derived from the idea of complexity regularization in which the sum of a term corresponding to empirical error and that corresponding to the “complexity” of a model is minimized. A similar idea has been investigated in various statistical problems by, e.g., Akaike [1], Schwartz [19], Rissanen [17], [18], and Shibata [20]. In the problem of model selection in autoregressive processes, and are different although the the definitions of information criteria are given in the form of (4). An autoregressive process is defined as
where occurs independently, is a constant are coefficients. If is known, irrespective of , and the coefficients to be estimated, , are calculated by the Yule–Walker equations, and the variance is usually estimated by
Then, the information criteria are expressed by
where we regard and . Shibata [20] pointed out that, for autoregressive processes, AIC asymptotically provides an efficient estimator although it does not satisfy even weak consistency of model selection. For autoregressive processes, conditions for weak consistency and [20]. On the are other hand, Hannan and Quinn [10] derived from the law of the iterated logarithm (LIL) for autoregressive processes the that satisfies strong consistency, which smallest function suggests that for autoregressive processes the conditions for and strong consistency are . In this paper, we focus on classification rather than on autoregressive processes. Also, some properties of Markov order estimation which is not based on information criteria have been proposed. Let be the true order of a Markov process. Merhav, Gutman, and Ziv [15] attempted to minimize the error probability for models while keeping the error probability exponent with order at a given prescribed level decay for models with order . However, if is large, even weak consistency fails. If (4) is applied, for the probability of error of the first kind to be , as seen in Section V, should be , which seems to be too restrictive to make that of the second kind desirable.
4770
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
Merhav [16] extended the result into the case of the independent identically distributed (i.i.d.) exponential family of distributions that includes autoregressive processes. Furthermore, Ziv and Merhav [25] extended the result into the case of finite state models and also dealt with identification of the number of states for hidden Markov processes. On the other hand, Liu and Narayan [14] improved Merhav, Gutman, and Ziv’s scheme [15] so that strong consistency is sat. isfied while the probability of error of the first kind is Also, Finesso [9] and Liu and Narayan [14] dealt with identification of the number of states for hidden Markov processes based on the maximum redundancy for hidden Markov processes (Csiszar [7]). For Markov order estimation, Finesso [9] derived a similar result: the model that minimizes
for each almost surely converges to the true model , where is a function specified in [9]. However, the does not coincide with whereas both compensation terms are . Worse, function depends on , i.e., the [9]. number of independent parameters for the true model Instead, Finesso [9] also showed that the model selection that minimizes
where is any strictly increasing function of , satisfies strong consistency of model selection under . In this sense, strong consistency has been proved for the MDL prin. ciple and BIC Kieffer [13] considered the model selection that minimizes
(6)
for large , where respect to parameter
is the Gamma function. Proof of Theorem 1: Let denote the distribution with degrees of freedom, and denote that random variable is according to distribution . Since
and
with
and
, where
is the probability density function of the fices to derive for large
distribution, it suf(7)
( Let for
is the length of the maximum likelihood code (Shtarkov [21]), is a strictly increasing function of specified in and [13]. Kieffer [13] and Csiszar [8] proved strong consistency of the model selection for Markov and hidden Markov processes, . However, the model selection does not not assuming provide us any suggestion to the problem since the compensa. tion term is
is the incomplete Gamma function with
and
where where
and any
Theorem 1: For
for
is a constant depending on such that implies ). for , , and
and
,
. We derive (8)
where
,
. Since
IV. ERROR PROBABILITIES In what follows, we derive both for and for
.
A. Overestimated Models One of the main technical tools in this paper is to relate the with the mutual entropy metric . This is a wellknown problem in information theory. (See Cover and Thomas [5, p. 333].)
(9) Equation (8) implies (7).
SUZUKI: ON STRONG CONSISTENCY OF MODEL SELECTION IN CLASSIFICATION
Let
4771
which is asymptotically normal. Hence, Propositions 1 and 2 imply (7). Proof of Proposition 1:
(10)
we may express
where
We prove (8) by showing the following two propositions. , Proposition 1: For
almost surely converges to zero as Proposition 2: There exist , with
,
so that
. ,
where
and
, with
such that (11)
is a function of
(
) such that
and . and almost Since surely as for and , the denominator converges to six almost surely. The numerator can be of expressed as
satisfies (12) On the other hand, the probability of and , otherwise
(14) (13)
where denotes the expectation. If Proposition 2 is true, , are independent for each . Also, each is each of a linear combination of
is computed as shown in (15) at the bottom of with the page, where is the error function. Since (15) is summable, (14) with probability one follows from Borel–Cantelli’s is arbitrary, the numerator almost surely Lemma. Since converges to zero.
a.s.
(15)
4772
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
Proof of Proposition 2: From
From (19), (20), (21), and (22), we obtain (23) For , and matrix Then, one checks that
if
if , so eigenvalues of the are and the other one is zero.
(16)
(24)
(17)
is the eigenvector with the eigenvalue . (Note that does not depend on .) For the other eigenwith the eigenvalue , , if we vectors , only express as
where
and
(25)
Then, eigenvalues of the matrix while the other one is . One checks that
are
. Then, the first column of is is required. Let . We put the remaining columns as matrix . Then, from the construction, coincides with expressed by (11). Furthermore, from (25)
(18) is the eigenvector with the eigenvalue . For the other eigenvecwith the eigenvalue , , if we express tors as , only (19) is required, where
is the
unit matrix. Then, (20)
where
.. .
.. .
.. .
..
.
.. .
Let (21) From (16) and (18), the first row of
(26) if the multiplications (In general, and exist.) From (17), (23), (26), and (12) follows. On the other hand, for each and , the variance of is 1 and independent each and , other. From (19), for each is 1 and independent the variance of each other since , , are orthogonal. This shows (13). Q.E.D. For the order identification of ergodic Markov processes, P. Billingsley derived (7) [3, p. 18 ], although Theorem 1 is applied to more general problems including pattern recognition and learning Bayesian network structure. However, Theorem 1 plays a rather preliminary role in this paper: the main goal is is the smallest in that makes to prove that model selection strongly consistent (Theorem 3). The material used in the proof of Theorem 1 will give a useful tool for solving the problem (in Section V). B. Underestimated Models
is
Theorem 2: For We put the remaining rows as with
, almost surely as (27)
matrix In particular, almost surely as for any . Proof of Theorem 2: For
, and
Then (22)
(28)
SUZUKI: ON STRONG CONSISTENCY OF MODEL SELECTION IN CLASSIFICATION
For
and
such that
4773
The proof is based on a simplified version of Kolmogorov’s law of the iterated logarithm: be identically Lemma 1 ([22, p. 273 ]): Let distributed with for some , and . Then (31) (29)
such that for Hence
is in
where
almost surely as We define
. as
, so that (28) has been applied.
(32) Then, from the definitions of matrices
almost surely, where (28) and (29) have been used. Since
Since
,
,
, and
, we obtain (33)
almost surely as
.
Q.E.D.
V. STRONG CONSISTENCY OF MODEL SELECTION is the smallest in Theorem 3 suggests that that makes the model selection procedure strongly consistent for classification as well as for autoregressive processes. Theorem 3: , then is 1) If , and ; strongly consistent for , then is not 2) If strongly consistent for , and , , where ranges over all stationary when ergodic processes satisfying Assumption 1. , Proof of Theorem 3: From Theorem 2, for almost surely as if . , almost surely We show for (30)
, almost surely as . On the other hand, from a similar discussion in [11, pp. 1076–1077], (33) implies
(34) almost surely as If
, from which (30) follows. , then almost surely
(35)
4774
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
Since , , so that the right of (35) is strictly . positive for large . Hence, almost surely On the other hand, suppose and . Let and be such that , , . Then, for such that and and , almost surely
Hence, ability one.
infinitely many times with probQ.E.D.
REFERENCES [1] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Automat. Contr., vol. AC-19, pp. 716–723, 1974. [2] A. C. Atkinson, “A method for discriminating between models,” J. Roy. Statist., Soc. Ser., vol. B32, pp. 323–353, 1970. [3] P. Billingsley, Statistical Inference for Markov Processes. Chicago, IL: The University of Chicago Press, 1961. [4] H. Cramer, Mathematical Methods of Statistics. Princeton, NJ: Princeton Univ. Press, 1946. [5] T. Cover and J. Thomas, Elements of Information Theory. New York: Wiley, 1991. [6] G. F. Cooper and E. Herskovits, “A Bayesian method for the induction of probabilistic networks from data,” Machine Learn., vol. 9, pp. 309–347, 1992. [7] I. Csiszar, “Information theoretical methods in statistics,” in Class Notes. College Park, MD: Univ. Maryland, 1990. [8] I. Csiszar and P. C. Shields, “The consistency of the BIC Markov order estimation,” Ann. Statist., vol. 28, no. 6, Dec. 2000.
[9] L. Finesso, “Consistent Estimation of the Order for Markov and Hidden Markov Chains,” Ph.D. dissertation, University of Maryland, College Park, MD, 1990. [10] E. J. Hannan and B. G. Quinn, “The determination of the order of an autoregression,” J. Roy. Statist. Soc., ser. B, vol. 41, pp. 190–195, 1979. [11] E. J. Hannan, “The estimation of the order of an ARMA process,” Ann. Statist., vol. 8, no. 5, pp. 1071–1081, 1980. [12] P. Hall and E. Hannan, “On stochastic complexity and nonparametric density estimation,” Biometrika, vol. 75, pp. 705–714, 1988. [13] J. C. Kieffer, “Strongly consistent code-based identification and order estimation for constrained finite-state model classes,” IEEE Trans. Inf. Theory, vol. 39, pp. 893–902, 1993. [14] C. Liu and P. Narayan, “Order estimation and sequential universal data compression of a hidden Markov source by the method of mixtures,” IEEE Trans. Inf. Theory, vol. IT-40, pp. 1167–1180, 1994. [15] N. Merhav, Gutman, and J. Ziv, “On the estimation of the order of a Markov chain and universal data compression,” IEEE Trans. Inf. Theory, vol. IT-35, pp. 1014–1019, 1989. [16] N. Merhav, “The estimation of model order in exponential families,” IEEE Trans. Inf. Theory, vol. IT-35, pp. 1109–1113, 1989. [17] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465–471, 1978. [18] ——, “Stochastic complexity and modeling,” Ann. Statist., vol. 14, pp. 1080–1100, 1986. [19] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, pp. 461–464, 1978. [20] R. Shibata, “Selection of the order of autoregressive model by Akaike’s information criterion,” Biometrika, vol. 63, pp. 117–126, 1976. [21] Y. M. Shtarkov, “Universal sequential coding of single messages,” Probl. Inf. Trans., vol. 16, pp. 175–186, 1987. [22] W. F. Stout, Almost Sure Convergence. New York: Academic, 1974. [23] J. Suzuki, “A construction of Bayesian networks from databases based on the MDL principle,” in Proc. 1993 Uncertainty in Artificial Intelligence Conf., 1993, pp. 266–273. [24] B. Yu and T. Speed, “Data compression and histograms,” Prob. Theory Rel. Fields, vol. 92, pp. 195–229, 1992. [25] J. Ziv and N. Merhav, “Estimating the number of states of a finite state source,” IEEE Trans. Inf. Theory, vol. IT-38, pp. 61–65, 1992.