Selection of Probabilistic Measure Estimation Method Based on

0 downloads 0 Views 1MB Size Report
results in some domains, the bootstrap method calculates better estimation in other domains, and it is very di/~cult ... further, we introduce recursive iteration of resampling methods(RECITE). .... In order to solve the second problem, recently, re-.
From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.

Selection of Probabilistic Measure Estimation Method based on Recursive Iteration of Resampling Methods Shusaku Tsumoto and Hiroshi Tanaka Department’ of Informational Medicine Medical Research Institute, TokyoMedical andDentalUniversity 1-5-45Yushima, Bunkyo-ku Tokyo113 Japan TEL:÷81-3-3813-6111 (6159),FAX:+81-3-5684-3618 email:{tsumoto, tanaka}~tmd.ac.jp Abstract One of the most important problemsin rule induction methodsis howto estimate the reliability of the induced rules, Whichis a semantic pm~t.of knowledgeto be estimated from finite training samples. In order to estimate errors of induced results, resamplingmethods,such as cross-vaiidation, the bootstra p method, have been introduced. However,While cross-validation methodobtains better results in somedomains, the bootstrap methodcalculates better estimation in other domains, and it is very di/~cult howto Chooseone’ of the two methods. In order tO reduce these disadvantages further, we introduce recursive iteration of resampling methods(RECITE). RECITE consists of the following four procedures: First, it randomlysplits trldnlng samples(So)into two equal parts, one for new training samples(St) and the other for new test samples(T1). Second, rules are induced from $I, and severai estimation methods, given by users, are executed by using $I. Third, the rules are tested by TI~ sad test error estimators are comparedwith each estimator. The second and the third procedure are repeated for certain times given by users; ¯ Thenthe estimation method whichgives the best estimator is selected as the most suitable estimation method. Finally, we use 1~his estimation methodfor 50 sad derive the estimators iof statistical measuresfrom the original training samples. Weapply this RECITEmethodto three original medical databases, and seven UCIdatabases. The results showthat this methodgives the best Selection of estimation methods in almost all the cases.

1

Introduction

One of the most important problems in rule induction methods is ,how to estimate the reliability of the induced results, which is a semantic part of knowledge to be induced from finite training samples. In order to estimate errors of induced results, resamplingmethods, such as cross-vaiidation, the bootstrap method, ate introduced. However,while cross-validation method0btains better results in some domains, the bootstrap methodcalculates better estimation in other domains, and it is very difficult howto choose one of the two methods. ¯ In order to reduce these disadvantages further, we introduce recursive iteration of resampling methods(RECITE). RECITEconsists of the following four procedures: First, it randomly splits training samples(go) into two equal parts, one for new training samples(S1) and the other for new samples(T1). Second, rules are induced from ~, and Several es timation me thods, gi ven byusers, are executed by using Sl. Third, the rules are tested by T1, and test error is comparedwith each estimator. The’second and the third procedure are repeated for certain times given by users. And the estimation method which gives the best estimator is selected as the most suitable estimation method. Finally, we use this estimation method for So. ~: Weapply this RECITEmethodto three original medical databases, and seven UCI databases. The results, showthatthismethod givesthebestselection of estimation methods in almosttheallcases. Thepaperis organized as follows: in section 2, we introduce ourruleinduction methodbasedon roughsets,calledPRIMEROSE. Section 3 showsthecharacteristics of resampling methods. In section 4, We discuss thestrategy of RECITEandillustrate howit works.Section 5 givesexperimental results. Finally, in Section 6 we discuss abouttheproblems of ourwork.

KDD-94

AAAI-94 Workshop on Knowledge Discovery in Databases

Page 121

In this paper, it is notable that we apply resampling methodsnot to gain predictive accuracy of induced rules, but to estimatemoreaccurate statistical measuresof inducedrules. So our methodology is quite different from ordinary usage of resampling methodsin the communityof machinelearning. However,in the field of statistics, our methodologyis morepopular than the above usage.

2

PRIMEROSE

e induction Inthispaper, we usea ru! system based on rough setsforourexperiments. However, our RECITE method is notdependent on rough setmodel, andwe canalsoapply it to other ruleinduction rystems, suchas AQ[12], andID3[16]. In thissectioi% we briefly introduce PRIMEROSE method, 2.1

ProbabiHstic

Extension

of Rough Sets

Roughset theory is developed and rigorously formulated by Pawlak[15]. This theory can be used to acquire certain sets of attributes whichwouldcontribute to class classification and can also evaluate howprecisely these attributes axe able to classify data. Weaxe developing an extension of the original rough set model¯ to probabilistic domain,whichwe Call, PRIMEROSE( Probabilistic Rule Induction Methodbased on ROughSets )[22, 23]. This extension is very similar to the concepts of Ziarko’s VPRS model[26,27, 28]. PRIMEROSE algorithm is executed as follows: first, we calculate primitive clusters whichconsists of the samples which axe supported by the same equivalence relations. Then we removeirredundant attribute pairs fromtotal equivalencerelations if they do not affect increasing classification rate, which wecall ~uster-Based Reduction. Repeating these procedures, wefinally get minimalequivalence relations, called minimalreducts. These equivalence relations can be regarded as premises of rules, so we derive the aboveminimalruIes by using the abovereduction technique. Next, weestimate two statistical measures of the induced rules, by cross-validation methodand bootstrap method. Combinedthese measures¯with the inducedresults, weobtain probabilistic rules, whoseform is defined in subsection 2.2. For further informationon the extension o~ rough set model,readers could refer to [23, 27, 28]. In this paper, we use somenotations used in rough sets, whichwouldmakeour discussion clearer. For example, we denote a set which supports an equivalence relation Ri by [x]a, and we call it an indiscernible set. For example,if an equivalencerelation R is supportedby a set {1,2,3), then [x]a is al to {1,2,3} ([Z]a = {1, 2, 3}). Hereweuse {1,2,3} as a set of training samples, and each number, "1", denotes the record numberof samples. For example, "3" in {1,2,3} is equal to the samples whoserecord numberis three. In the context of rule induction, R~represents the combinationof attribute-value pairs, whichcorresponds tO the complexesof the selectors in terms of. AQmethod[12]. Furthermore, [z]a, meansthe set whichsatisfies such attribute-value relations. This set correspondsto a partial star of. AQmethods whichSupportsthe complexesof the selectors. For¯moreinformation on rough sets and on rule induction basedon rough sets, readers mightrefer to [15, 26].

~

2.2

Definition

of Probabilistic

Rules

Weuse the definition of probabilistic measuresof diagnostic rules whichMatsumuraet. al [9] introduce for the development.of’ a medical expert system, RHINOS(Rule-based Headacheand facial pain ¯ INformationOrganizingSystem):This diagnostic rules, called "inclusive rules" is formulated in terms of rough set theory msfollows: Detinltlon 1 (Definition of Probabilistic Rules) Let Pe be an equivalence relation and D denotes a ¢et eThoseelementsbelong~o one class and whichis a subset of U. A probabilistie rule of D is defined as a tuple, < D, P~, ,qI(Rt, D), CI(P~,D) > whereP~, SI, and CI are defined as follows. P~is a conditional part of a class O and defined as:

81 and 6"1 are defined as:

sCP,v) = Page 122

card {(Ix]a, n D)U([x]~, nDc)} card {[z]a, U[x]~z,}

AAAI-94 Workshop on Knowledge Discovery

in Databases

KDD-94

nD)

cx( ,D)

nDo)}

whereDc or [z]~, consists o.f unobserved future cases of a class Dor those tohich satisfies P~,respectively. [3 In the above definition, unobservedfuture ¯cases meansall possible future cases. So weconsider an infinite size of cases, whichis c~dledtotal populationin the community of statistics. AndSI(Satisfactory Index) denotes theprobability that a patient has the disease with this set manifestations, and CI(CoveringIndex) denotes the ratio of the numberthe patients whosatisfy the set of manifestations to that of altthe patients havingthis disease. Note that SI(/~,D) is equivalent the accuracy of/~. For example,let us consider an exampleof inclusive rules. Let us showan exampleof an inclusive rule of common migralne(CI-0.75) as follows: If history:paroxysmal, jolt headache:yes, nature: throbbing or persistent, prodrome:no, intermittent symptom:no, persistent time: more than 6 hours, and location: not eye, Then wesuspect commonmigraine (SI=0.9, CI=0.75). ThenSI=0.9 denotes that we can diagnose common migraine with the probability 0.9 whena patient satisfies the premiseof this rule. AndCI-0.75suggests that this rule only covers 75 %of total samples which belong to a class of common migraine. A total rule of D is given by R - Vi R~, and then total CI(tCI) and total SI(tCI) is defined tCI(R,D) - cI(V, P~,D), and tCI(R,D) -sI(V, Ri,D) respectively. Since the above formulaeincIude .unobserved cases, weare forced to estimate these measuresfrom the training samples. For this purpose, we introduce cross-validation and the Bootstrap methodto ge.n~rate "pseudo-unobserved"cases from these samples as shownin the next subsection.

3

Resampling

Estimation

Methods

Theaboveequationis rewrittenas:

[x]R,

[z]R, nD

card[z]R, U[z]~t, card Ix]R, ecard [z]~, A D card[z]~t, + cardIx]R,U[z]~t, card [x]~, -- e~,~R, + (I "ea,)a~, wheree&denotes the ratio of tr~ning samplest ° total population, whichconsists of both training samples and future caseS, a2, denotes an apparent accuracy, ~md~, denotes the accuracy of classification for unobservedcases. This is a fundamentalformula of accuracy(SI). Resamplingmethodsfocus on how to estimate e~, and ~,, and makessome assumption about these parameters. Under someassumptions, we obtain the formulae of several estimation methods. In the following subsections, due to the limitation of the space, werestrict discussion to the mainthree methods:crossvalidation method, the Bootstrap methodand 0.632 estimator. Other methods, such as the Jackknife method[2], generalized cross-validation[4], can be also discussed in the sameframework. 3.1

Cross-Validation

Method

Cross-validation methodfor error estimation is performedas following: first, the wholetraining samples £ are split into V blocks: {£1,£2,’" ,£v}. Second, repeat for V times the procedure in which we

KDD-94

AAAI-94 Workshopon KnowledgeDiscoveryin Databases

Page 123

induce rules from the training samples £ - £{(i = I,..., V) and examinethe accuracy al of the rules using £~ as test samples. Finally, we derivethe whole accuracy a by averaging al over i, that is, a = ~{~=i al]V (this methodis called V-fold cross-vaiidation). Therefore we can use this methodfor estimation of CI and .HI by replacing the calculation of a by that of CI and HI, and by regarding test samples as unobservedcases. This methoddoes not ,use training samples to estimate measures, so in this case, we can regard the fo116wingapproximationas an assumptionin applying this method: If unobservedcases are expected tobe muchlarger than training samples, then the above formulae can be approximatedaa follows:

SZ(P,D)

card [z]~i ~ nD

card

The main problems of croes-vaiidation are howto choose the value of V and high variability of ¯ estimates, or large melmsquared errors of the cross-validation estimates. Thefirst problemsuggests that, aa the value of V increMm,estlmates get closer to apparent ones and the variance growsto be smaller. Wediscuss this phenomenonin[22i 23],:in which we conclude that the choice of V depends on our strategy. If it is desirable to avoid the over estimation of statistical measures,wecan safely choOse2-fold cross’validation, whoseestimators are aaymptoticaily equal to predictive estimators for completelynewpattern of data aa shownin [3, 4]. In order to solve the second problem, recently, repeated cross’validation methodis introduced[24]; In this method,cross-validation methodsare executed ’repeatedly(safely, 100 times), and estimates are averagedover all the trials. This iteration makesthe variances to be ¯lower aa shownin [23, 24]. For detailed informationabout ¯these problemsand methods, readers mightrefer to [23, 24]. Sinceourstrategy is to avoid the high :variabilities, wechooserepeated 2-fold cross-vaiidation methodand repeated 10-fold cross-validation methodin this paper.

3.2 the Bootstrap Method Onthe other hand, the Bootstrap methodis executedas follows: first, we create empirical probabilistic distribution(Fn) from the Original training samples. Second, we use the Monte-Carlomethodsand randomlygenerate thetraining samples by using Fn. Third,rules are induced by using these newlygenerated training samples.Finally, these results are tested by the original training samplesand statistical measures,such as error rates are calculated. Weiterate these four steps for finite times. Empirically,it is ~nownthat about 200 times’ repetition is sut~cient for estimation. This methoduses training Samplesto estimate measures, so in this case, we use the above equation in the subsection 3.1. For example,!st {1,2,3,4,5~ be original training samples. Promthis population, wemaketraining samples, say {1,1~2,3,3}. The inducedresu!t is equivaJent to that of {1,2,3}. Since original training samplesare. used as, test samples~{1,2,3} makesan apparentaccuracy, and {4,5} makes ¯ a predictive estimator for completelynewsamples, whichcan be regarded as test samples generated by cross-validation. In this case, e is estimated aa 3/5. Werepeat this procedureand take the average over the wholeresults, whichmakesthis estimation moreaccurate. That is, e/zd and

Suggest Documents