On Strong Consistency of Model Selection in Classification - IEEE Xplore

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

4767

On Strong Consistency of Model Selection in Classification Joe Suzuki, Member, IEEE

Abstract—This paper considers model selection in classification. In many applications such as pattern recognition, probabilistic inference using a Bayesian network, prediction of the next in a sequence based on a Markov chain, the conditional probability P (Y = y jX = x) of class y 2 Y given attribute value x 2 X is utilized. By model we mean the equivalence relation in X : for x; x

0

x

2X , x

0

P (Y

=y

j

X

= x ) = P (Y = y

j

X

0

= x );

for all y 2Y :

By classification we mean the number of such equivalence classes is finite. We estimate the model from n samples z n = (xi ; yi )in=1 2 n (X 2Y ) , using information criteria in the form empirical entropy H plus penalty term (k=2)dn (the model such that H + (k=2)dn is minimized is the estimated model), where k is the number of independent parameters in the model, and fdn g1 n=1 is a real nonnegative sequence such that lim supn dn =n = 0. For autoregressive processes, although the definitions of H and k are different, it is known that the estimated model almost surely coincides with the 1 1 true model as n ! 1 if fdn gn =1 > f2 log log ngn=1 , and that it 1 1 does not if fdn gn=1 < f2 log log ngn=1 (Hannan and Quinn). The problem whether the same property is true for classification was open. This paper solves the problem in the affirmative. Index Terms—Error probability, Hannan and Quinn’s procedure, law of the iterated logarithm, Kullback–Leibler divergence, model selection, strong consistency.

for model , and We denote finite. denote by the set of such models with can By classification we mean for some be expressed by , which is assumed through the current paper. (Notice that the random variables are discrete.) We denote . We consider the following problem: given a finite sequence with , where is a stationary ergodic process; 1) and 2) satisfy for some and given a subset of containing the , we estimate the as well as . For given and , we divide into the three sets: ; 1) ; 2) . 3) Example 1: Suppose , , , where ; 1) 2)

I. INTRODUCTION

and

L

ET and be random variables taking values in the sets and such that and , respectively. We express the conditional probability of given using a model , which is defined as an equivalence relation in

for If we fix a model ,

(1)

can be divided into equivalence classes such that for

Manuscript received February 12, 2005; revised April 26, 2006. The material in this paper was presented in part at the 6th International Workshop on Artificial Intelligence and Statistics, January 1997. The author is with the Department of Mathematics, Osaka University, Osaka 560-0043, Japan. Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition, Statistical Learning, and Inference. Digital Object Identifier 10.1109/TIT.2006.883611

is such that , . Then, , , and Example 2: Suppose , , where ; 1)

. , and

2)

;

3)

;

and

is such that , , ,

. Then, ,

, . and , the model is classified as If the estimated is in overestimated; and if the estimated is in , the model is classified as underestimated (Atkinson [2]). for . For and Let such that , we define with respect to and by

0018-9448/$20.00 © 2006 IEEE

(2)

4768


We define the Kullback divergence by

for . Throughout the paper, we assume the following. Assumption 1: For each , stationary ergodic satisfies

for all and Then, apparently

.

(3) Given a sequence

, we define

where , otherwise. and

In this paper, we select the model that minimizes the quantity (4) where

, is a real nonnegative sequence, denotes the natural logarithm of , and is the so-called empirical entropy of given , and is interpreted as the number of , the number independent parameters because for each of probabilities to be specified is . This paper analyzes the model selection procedures based on information criteria in the form of (4) in a unified manner, instead of considering each information criterion such as Akaike’s ) [1], the minimum descripinformation criterion (AIC, ), the tion length (MDL) principle (Rissanen [17], ), etc. For Bayesian information criterion [19] (BIC, expresses a codelength of up to a MDL and BIC, constant when model is assumed while AIC expresses an unbiased estimator of the average code length of with respect to model . These information criteria give information theoretic interpretations. and Let . We derive (in Section IV) the asymptotic exact error probin model selection for each ability

in (4). Then, if converges to almost surely, from is asymptot(3), the discriminant betwen and and . We show ically larger than that between in Section IV that the probability of is expressed and when in terms of the while that of diminishes exponentially when . In general, if as , then we is weekly consistent. Moreover, if say the sequence such that for all but is finite with probability one, we say the sequence strongly consistent. such Let be the set of real nonnegative sequences . We will define the partial order that in : for any : 1) if ; 2)

if

;

if . 3) One checks for : ; 1) 2) ; 3) . The climax of this paper (in Section V) is in the derivation that satisfies strong consistency of the smallest for classification, i.e., the classification counterpart of Hannan and Quinn’s information criterion [10] that was provided for autoregressive processes. More precisely, the problem is whether satisfies the following: , then is strongly 1) If , and ; consistent for 2) If , then is not strongly , and , when consistent for ; ranges over all stationary ergodic processes satwhere isfying Assumption 1. We solve this problem in the affirmative. II. EXAMPLES A. Learning Bayesian Network Structure From Data We claim that the problem of learning a Bayesian network . Supstructure from data [6], [23] satisfies random variables such pose we are given that takes on a finite set with cardinality , and is expressed that the marginal distribution of by

(5) with for some depends on those of , i.e., the occurrence of A Bayesian network is a directed acyclic graph with such nodes and edges direct from to ,

, .

SUZUKI: ON STRONG CONSISTENCY OF MODEL SELECTION IN CLASSIFICATION

, . If the Bayesian network structure is given, the marginal distribution is determined if

are specified for all from

. The problem is to estimate examples

assuming that the examples are independently and identically distributed according to a distribution in the form of (5) with , since divides the structure . For each into states, may be regarded , thus as a model with . Then, in order to estimate , we iterate a model . selection procedure for each B. Order Identification of Ergodic Markov Processes We also claim that in the problem of order identification of is finite for . A seergodic Markov processes [9], with , , is said to be emitted quence if the conditional according to a Markov process of order probability of given is expressed when for finite . Then, by states each of which is determined by the prethere exist . Since vious sequence is finite, we know . The problem is to idenfrom a given sequence tify Markov order which has been emitted by an ergodic Markov process with the order . In order to decide the states for the previous sequences of , where denotes the empty sequence, we need the knowledge so that of the sequence can be determined even for . However, since we mainly focus on the properties when is large, the effect of may be negligible when analizing the error probability and verifying strong consistency. A Markov process is said to be ergodic if the transition probamong the states satisfies ability matrix for an integer , where is the transition probability from means that all the elements one state to another , and of matrix are positive. Since the Markov process is ergodic, . and be the true and estimated orders, and and Let be the corresponding models. Then, one checks

III. PREVIOUS WORK Several results not assuming have been reported such as nonparametric density estimation [12], estimation with histogram construction [24], etc. However, the purpose of the

4769

model selection is not to find the true model itself because, if , we cannot generally capture the true model even for large : the selected model continues to change as goes to infinity so that the estimated density distribution is as correct as possible. Then, the careful control of the grow of model complexity is required. However, in this paper, since we have for all , the number of states being assumed considered is finite and does not grow with sample size. For model selection, several strategies can be considered. A possible solution may be derived from the idea of complexity regularization in which the sum of a term corresponding to empirical error and that corresponding to the “complexity” of a model is minimized. A similar idea has been investigated in various statistical problems by, e.g., Akaike [1], Schwartz [19], Rissanen [17], [18], and Shibata [20]. In the problem of model selection in autoregressive processes, and are different although the the definitions of information criteria are given in the form of (4). An autoregressive process is defined as

where occurs independently, is a constant are coefficients. If is known, irrespective of , and the coefficients to be estimated, , are calculated by the Yule–Walker equations, and the variance is usually estimated by

Then, the information criteria are expressed by

where we regard and . Shibata [20] pointed out that, for autoregressive processes, AIC asymptotically provides an efficient estimator although it does not satisfy even weak consistency of model selection. For autoregressive processes, conditions for weak consistency and [20]. On the are other hand, Hannan and Quinn [10] derived from the law of the iterated logarithm (LIL) for autoregressive processes the that satisfies strong consistency, which smallest function suggests that for autoregressive processes the conditions for and strong consistency are . In this paper, we focus on classification rather than on autoregressive processes. Also, some properties of Markov order estimation which is not based on information criteria have been proposed. Let be the true order of a Markov process. Merhav, Gutman, and Ziv [15] attempted to minimize the error probability for models while keeping the error probability exponent with order at a given prescribed level decay for models with order . However, if is large, even weak consistency fails. If (4) is applied, for the probability of error of the first kind to be , as seen in Section V, should be , which seems to be too restrictive to make that of the second kind desirable.

4770


Merhav [16] extended the result into the case of the independent identically distributed (i.i.d.) exponential family of distributions that includes autoregressive processes. Furthermore, Ziv and Merhav [25] extended the result into the case of finite state models and also dealt with identification of the number of states for hidden Markov processes. On the other hand, Liu and Narayan [14] improved Merhav, Gutman, and Ziv’s scheme [15] so that strong consistency is sat. isfied while the probability of error of the first kind is Also, Finesso [9] and Liu and Narayan [14] dealt with identification of the number of states for hidden Markov processes based on the maximum redundancy for hidden Markov processes (Csiszar [7]). For Markov order estimation, Finesso [9] derived a similar result: the model that minimizes

for each almost surely converges to the true model , where is a function specified in [9]. However, the does not coincide with whereas both compensation terms are . Worse, function depends on , i.e., the [9]. number of independent parameters for the true model Instead, Finesso [9] also showed that the model selection that minimizes

where is any strictly increasing function of , satisfies strong consistency of model selection under . In this sense, strong consistency has been proved for the MDL prin. ciple and BIC Kieffer [13] considered the model selection that minimizes

(6)

for large , where respect to parameter

is the Gamma function. Proof of Theorem 1: Let denote the distribution with degrees of freedom, and denote that random variable is according to distribution . Since

and

with

and

, where

is the probability density function of the fices to derive for large

distribution, it suf(7)

( Let for

is the length of the maximum likelihood code (Shtarkov [21]), is a strictly increasing function of specified in and [13]. Kieffer [13] and Csiszar [8] proved strong consistency of the model selection for Markov and hidden Markov processes, . However, the model selection does not not assuming provide us any suggestion to the problem since the compensa. tion term is

is the incomplete Gamma function with

and

where where

and any

Theorem 1: For

for

is a constant depending on such that implies ). for , , and

and

,

. We derive (8)

where

,

. Since

IV. ERROR PROBABILITIES In what follows, we derive both for and for

.

A. Overestimated Models One of the main technical tools in this paper is to relate the with the mutual entropy metric . This is a wellknown problem in information theory. (See Cover and Thomas [5, p. 333].)

(9) Equation (8) implies (7).


Let

4771

which is asymptotically normal. Hence, Propositions 1 and 2 imply (7). Proof of Proposition 1:

(10)

we may express

where

We prove (8) by showing the following two propositions. , Proposition 1: For

almost surely converges to zero as Proposition 2: There exist , with

,

so that

. ,

where

and

, with

such that (11)

is a function of

(

) such that

and . and almost Since surely as for and , the denominator converges to six almost surely. The numerator can be of expressed as

satisfies (12) On the other hand, the probability of and , otherwise

(14) (13)

where denotes the expectation. If Proposition 2 is true, , are independent for each . Also, each is each of a linear combination of

is computed as shown in (15) at the bottom of with the page, where is the error function. Since (15) is summable, (14) with probability one follows from Borel–Cantelli’s is arbitrary, the numerator almost surely Lemma. Since converges to zero.

a.s.

(15)

4772


Proof of Proposition 2: From

From (19), (20), (21), and (22), we obtain (23) For , and matrix Then, one checks that

if

if , so eigenvalues of the are and the other one is zero.

(16)

(24)

(17)

is the eigenvector with the eigenvalue . (Note that does not depend on .) For the other eigenwith the eigenvalue , , if we vectors , only express as

where

and

(25)

Then, eigenvalues of the matrix while the other one is . One checks that

are

. Then, the first column of is is required. Let . We put the remaining columns as matrix . Then, from the construction, coincides with expressed by (11). Furthermore, from (25)

(18) is the eigenvector with the eigenvalue . For the other eigenvecwith the eigenvalue , , if we express tors as , only (19) is required, where

is the

unit matrix. Then, (20)

where

.. .

.. .

.. .

..

.

.. .

Let (21) From (16) and (18), the first row of

(26) if the multiplications (In general, and exist.) From (17), (23), (26), and (12) follows. On the other hand, for each and , the variance of is 1 and independent each and , other. From (19), for each is 1 and independent the variance of each other since , , are orthogonal. This shows (13). Q.E.D. For the order identification of ergodic Markov processes, P. Billingsley derived (7) [3, p. 18 ], although Theorem 1 is applied to more general problems including pattern recognition and learning Bayesian network structure. However, Theorem 1 plays a rather preliminary role in this paper: the main goal is is the smallest in that makes to prove that model selection strongly consistent (Theorem 3). The material used in the proof of Theorem 1 will give a useful tool for solving the problem (in Section V). B. Underestimated Models

is

Theorem 2: For We put the remaining rows as with

, almost surely as (27)

matrix In particular, almost surely as for any . Proof of Theorem 2: For

, and

Then (22)

(28)


For

and

such that

4773

The proof is based on a simplified version of Kolmogorov’s law of the iterated logarithm: be identically Lemma 1 ([22, p. 273 ]): Let distributed with for some , and . Then (31) (29)

such that for Hence

is in

where

almost surely as We define

. as

, so that (28) has been applied.

(32) Then, from the definitions of matrices

almost surely, where (28) and (29) have been used. Since

Since

,

,

, and

, we obtain (33)

almost surely as

.

Q.E.D.

V. STRONG CONSISTENCY OF MODEL SELECTION is the smallest in Theorem 3 suggests that that makes the model selection procedure strongly consistent for classification as well as for autoregressive processes. Theorem 3: , then is 1) If , and ; strongly consistent for , then is not 2) If strongly consistent for , and , , where ranges over all stationary when ergodic processes satisfying Assumption 1. , Proof of Theorem 3: From Theorem 2, for almost surely as if . , almost surely We show for (30)

, almost surely as . On the other hand, from a similar discussion in [11, pp. 1076–1077], (33) implies

(34) almost surely as If

, from which (30) follows. , then almost surely

(35)

4774


Since , , so that the right of (35) is strictly . positive for large . Hence, almost surely On the other hand, suppose and . Let and be such that , , . Then, for such that and and , almost surely

Hence, ability one.

infinitely many times with probQ.E.D.

REFERENCES [1] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Automat. Contr., vol. AC-19, pp. 716–723, 1974. [2] A. C. Atkinson, “A method for discriminating between models,” J. Roy. Statist., Soc. Ser., vol. B32, pp. 323–353, 1970. [3] P. Billingsley, Statistical Inference for Markov Processes. Chicago, IL: The University of Chicago Press, 1961. [4] H. Cramer, Mathematical Methods of Statistics. Princeton, NJ: Princeton Univ. Press, 1946. [5] T. Cover and J. Thomas, Elements of Information Theory. New York: Wiley, 1991. [6] G. F. Cooper and E. Herskovits, “A Bayesian method for the induction of probabilistic networks from data,” Machine Learn., vol. 9, pp. 309–347, 1992. [7] I. Csiszar, “Information theoretical methods in statistics,” in Class Notes. College Park, MD: Univ. Maryland, 1990. [8] I. Csiszar and P. C. Shields, “The consistency of the BIC Markov order estimation,” Ann. Statist., vol. 28, no. 6, Dec. 2000.

[9] L. Finesso, “Consistent Estimation of the Order for Markov and Hidden Markov Chains,” Ph.D. dissertation, University of Maryland, College Park, MD, 1990. [10] E. J. Hannan and B. G. Quinn, “The determination of the order of an autoregression,” J. Roy. Statist. Soc., ser. B, vol. 41, pp. 190–195, 1979. [11] E. J. Hannan, “The estimation of the order of an ARMA process,” Ann. Statist., vol. 8, no. 5, pp. 1071–1081, 1980. [12] P. Hall and E. Hannan, “On stochastic complexity and nonparametric density estimation,” Biometrika, vol. 75, pp. 705–714, 1988. [13] J. C. Kieffer, “Strongly consistent code-based identification and order estimation for constrained finite-state model classes,” IEEE Trans. Inf. Theory, vol. 39, pp. 893–902, 1993. [14] C. Liu and P. Narayan, “Order estimation and sequential universal data compression of a hidden Markov source by the method of mixtures,” IEEE Trans. Inf. Theory, vol. IT-40, pp. 1167–1180, 1994. [15] N. Merhav, Gutman, and J. Ziv, “On the estimation of the order of a Markov chain and universal data compression,” IEEE Trans. Inf. Theory, vol. IT-35, pp. 1014–1019, 1989. [16] N. Merhav, “The estimation of model order in exponential families,” IEEE Trans. Inf. Theory, vol. IT-35, pp. 1109–1113, 1989. [17] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465–471, 1978. [18] ——, “Stochastic complexity and modeling,” Ann. Statist., vol. 14, pp. 1080–1100, 1986. [19] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, pp. 461–464, 1978. [20] R. Shibata, “Selection of the order of autoregressive model by Akaike’s information criterion,” Biometrika, vol. 63, pp. 117–126, 1976. [21] Y. M. Shtarkov, “Universal sequential coding of single messages,” Probl. Inf. Trans., vol. 16, pp. 175–186, 1987. [22] W. F. Stout, Almost Sure Convergence. New York: Academic, 1974. [23] J. Suzuki, “A construction of Bayesian networks from databases based on the MDL principle,” in Proc. 1993 Uncertainty in Artificial Intelligence Conf., 1993, pp. 266–273. [24] B. Yu and T. Speed, “Data compression and histograms,” Prob. Theory Rel. Fields, vol. 92, pp. 195–229, 1992. [25] J. Ziv and N. Merhav, “Estimating the number of states of a finite state source,” IEEE Trans. Inf. Theory, vol. IT-38, pp. 61–65, 1992.

On Strong Consistency of Model Selection in Classification - IEEE Xplore

On Strong Consistency of Model Selection in Classification - IEEE Xplore

Suggest Documents

Strong universal consistency of neural network classifiers - IEEE Xplore

A cloud queuing service with strong consistency and ... - IEEE Xplore

A Consistency-Based Model Selection for One-Class Classification

Selection and justification of data transferring model in ... - IEEE Xplore

On Optimal Input Design and Model Selection for ... - IEEE Xplore

Model selection based on Bayesian predictive densities ... - IEEE Xplore

Texture Classification - IEEE Xplore

On Model Selection Consistency of Lasso - Journal of Machine ...

Using Feature Selection and Classification to Build ... - IEEE Xplore

Assessing band selection and image classification ... - IEEE Xplore

Efficient Model Selection for Mixtures of Probabilistic ... - IEEE Xplore

Consistency Maintenance of Policy States in ... - IEEE Xplore

Bitcoin Meets Strong Consistency

Consistency in Estimation and Model Selection of Dynamic ... - MDPI

A semantic classification model for e-catalogs - IEEE Xplore

An Efficient Classification Model for Detecting Advanced ... - IEEE Xplore

Multiple Relay Selection Based on Game Theory in ... - IEEE Xplore

Fading Effects on Parameter Selection in Genetic ... - IEEE Xplore

Efficient Deep Learning Model for Text Classification ... - IEEE Xplore

An Efficient Classification Model for Detecting Advanced ... - IEEE Xplore

On strong consistency of estimators for in nite

Bayesian Selection for the l2-Potts Model Regularization ... - IEEE Xplore

Bayesian Model Selection for Markov, Hidden Markov ... - IEEE Xplore

Joint Model Selection and Parameter Estimation by ... - IEEE Xplore