U-boosting method for classification and information ... - CiteSeerX

0 downloads 0 Views 134KB Size Report
Jun 20, 2002 - related with information on large-scale gene expression studies including .... exponential and logistic losses is also discussed in Freidman et al.
2002 SRCCS International Statistical Workshop on 19-20 June, 2002 at Seoul National University

U-boosting method for classification and information geometry Shinto Eguchi Institute of Statistical Mathematics, Japan and Department of Statistical Science, Graduate University of Advanced Studies

SUMMARY We aim to extend from AdaBoost to U-boost in the paradigm to build up a stronger classification machine among weak learning machines. The Bregman divergence leads to U-boosting class in the framework of information geometry for the space of finite measures on the label set. In the sequential U-boost learning algorithm proposed we show a common aspect in this class that each iteration step associates with the least favorable distribution to the best classifier selected in the last step. In the iteration step we observe that the two adjacent and the initial classifiers associate with a right triangle in the scale via the Bregman divergence, which leads to a mild convergence property as in the EM algorithm. The statistical discussion elucidates the optimality structure of U-boost class based on a probablistic assumption for a training data.

1

1

Introduction

There has been developed novel methodology for classification and pattern recognition along different directions for these ten years. In the present stage we face to a great breakthrough to solve much more difficult problems beyond conventional paradigms. See Maclachlan (1992), Bishop (1995), Hastie et al. (2001). In the pattern recognition for digital imaging the on-line data processing is still one of these difficult problems. For example, based on pixel datasets taken by TV video recorder at the corner of a city centre it is almost impossible to detect a specified person’s figure at the real time. In genome data there exist confounded complex aspects arisen from genetic polymorphism among human populations. The key subjects associated with this context are related with information on large-scale gene expression studies including quantitative trait loci (QTL) analysis, single nucleotide polymorphism (SNP) data and especially microarray data which have come out from the new technology for typing genes in massive cluster. If one succeeds in detecting all the associations of gene functions with quantitative characteristics for each individual, it will give a great impact for fundamental notions in medical sciences and great changes of therapy methods for disease. Practically informative data are hidden by trivially meaningless data, which is still unestablished to extract a mine in the hull of data, a further development for classification method is expected to be challenged these unsolved problems. Recent developments in classification method have been made in the direction of statistical learning theory. See Vapnik (1999) and Hastie et al. (2001). One of the most fundamental objectives is to build up algorithms to learn predictive factors in the statistical learning architecture. Several important approaches have been proposed and implemented into feasible computational algorithms. One promising direction is the boosting method which is characteristic in combining weak learning machines. The aim of combining is to make a new stronger machine by way of learning process, which is along ensemble learning theory. Alternative direction is motivated by the idea based on maximizing a margin generated by classification rule, which is called Support vector machine. This is implemented in a feasible form by the use of mathematical programming including quadratic programming, which is embedded to a higher-dimensional space by the theory of Hilbert reproductive kernel space. However it may be difficult to understand this approach by probablistic argument. 2

Adaboost is one of boosting algorithms in ensemble learning methods. See Schapire (1990), Freund and Schapire (1997), Schapire et al.(1998) for the original idea and Pepe and Thampson (2000) for the related motivation. The key idea of Adaboost is to give a reweighting of examples in accordance with the best performance tuned in each step and to succeed to the next step in terms of a renewal error rates by reweighting. In the updated step correctly classified examples are less weighted while wrongly classified examples more weighted. In the first round all the examples are given by the uniform weighting, but from the second round the weighting is drastically changed according to the status of pre-round. In a step the best machine is selected by the performance for classifying the examples that are wrongly answered in the previous step. Accordingly the best machine in the step should be supposed the worst machine in the next step. At the final stage all the best machines are combined as a new classification machine by weighting in accordance with the respective performance. In this paper we would like to shed light on statistical properties of Adaboost. This paper is organized as follows. Section 2 gives an overview of Adaboost and the geometric understanding for a relation between Adaboost and logistic classification rules. In Section 3 a class of boosting, called U-Boost, is proposed in terms of Bregman divergence class. Section 4 investigates statistical performance in the class of U-boosting rules. In Section 5 some comments for this proposal in the paper and for future problems to be challenged are given.

2 2.1

Adaboost Algorithm

Let us make a brief review of Adaboost. We consider a structure of classification methods in a space where any element x in the space X of feature vectors should be predicted the group attribution of x as y in a label set Y. For convenience we often consider a case in which y has a binary label with values −1 and +1. For a given examples {(xi , yi) : i = 1, · · · , n} we take a set of weak learning machines {fj (x) : 1 ≤ j ≤ J}, where we suppose fj (x) ∈ {−1, 1}. Adaboost is defined by the following algorithm: 1.

For t = 1 let wt (i) =

1 n

for i = 1, · · · , n. 3

2.

The t-th weighted error rate is t (f ) =

n 1 I(f (xj ) = yj )wt (i) n i=1

(1)

3-a. Find ft = argmin1≤j≤J t (fj ), 3-b. Set βt =

1 2

t (ft ) log 1− t (ft )

3-c. The t-th weight wt (i) is updated to wt+1 (i) ∝ wt (i) exp{−yi βt ft (xi )} 4. The final answer is given by voting all the steps by f = sgn In this recursive iteration we assume t (ft ) ≤

1 . 2

 T



t=1 βt ft (x) .

Unless, it suffices to change ft

into −ft . Hence the weight wt (i) is boosted up by exp(βt ) if yi = f (xi ), while it is boosted down by exp(−βt ) otherwise. Apparently this seems to be an adhoc algorithm just by alternating the logarithmic and exponential functions in the stages 3-b and 3-c. However we observe that t+1 (ft ) =

1 2

( ∀t = 1, · · · , T − 1 )

(2)

which implies that the best machine ft in the t-step becomes the worst machine when it is assessed by the weighted error rate updated. In this way this algorithm updates into the least favorable distribution for the best choice in each step. Let us discuss the exponential loss Lexp (F ) =

n 1 exp{−yi F (xi )} n i=1

(3)

and consider the best update from F (x) to F (x) + βf (x). Thus we get Lexp (F + βf ) = Lexp (F ){e−β + (f )(eβ − e−β )},

(4)

where (f ) is the weighted error rate as defined by 1 (f ) = n

n

i=1

I(f (xj ) = yj ) exp{−yi F (xi )} . Lexp (F )

(5)

Therefore we get the best update merging F with f as follows. F (x) +

1 − (f ) 1 log f (x). 2 (f )

(6)

Hence the AdaBoost is an algorithm to optimize sequentially the exponential loss. Let Ft (x) =

t

s=1

βs f (xs ). Then we get a monotone decreasing sequence {Lexp (Ft ) : t = 4

1, · · · , T }. For the improvement in the t-step is Friedman et. al (2001) discuss this aspect in the light of generalized additive model. If one considers the empirical loss Lemp exp (β) =

1 exp{−yi β T f (x)}. n

(7)

the parallel optimization can be implemented by the iteratively reweighted least squares (IRLS) algorithm in the context of binary regression. See MacCullagh and Nelder (1989) and Eguchi and Copas (2002). On the other hand the logistic discriminant function is frequently applied to various types of data especially in the statistical community. The logistic loss is Lemp log (β)

n exp{−yi β T f (xi )} 1 = . n i=1 exp{βT f (x)} + exp{−βT f (x)}

(8)

This loss function Lemp lof (β) is a natural reflection from the parametric model for the conditional distribution P(Y = y|X) of Y given X. See Cornfield (1962), Cox (1966) and Efron (1975). In fact Lemp log (β) is proportional to the conditional log-likelihood function by the factor −1/n. In the usual argument of asymptotic theory the estimator ˆ = arg min Lemp (β) β log

(9)



is shown to be efficient by the Cramer-Rao type inequality. The relation between the exponential and logistic losses is also discussed in Freidman et al.(2001). We now review information geometric understandings on this relation. See Lebanon and Lafferty (2001).

2.2

AdaBoost and Logisic classification

Let G be a fixed marginal distribution of the feature vector x throughout our discussion. In practice G will be taken the empirical distribution. We consider the space of all the finite measures over the label set Y, M = {m(y|x) :



m(y|x) < ∞ ( a.e. x)}

(10)

y∈Y

and the subspace P = {m(y|x) ∈ M :



m(y|x) = 1 ( a.e. x)}

(11)

y∈Y

The KL divergence extended over M is given by 

KL(m, µ) =

X

g(x)

 y∈Y





m(y|x) − m(y|x) + µ(y|x) G(dx). m(y|x) log µ(y|x) 5

(12)

Hereafter we fix a measure m0 (y|x) in P and µ0 (y|x) in M Let f (y, x) = (f1 (y, x), · · · , fT (y, x))T be the vector of weak learners. Then we denote E(f , m0 ) = {m ∈ M :



 X y∈Y

where f˜ (x) =

m(y|x){f (x, y) − f˜ (x)}G(dx) = 0},



m0 (y|x)f (x, y).

(13)

(14)

y∈Y

By definition E(f , m0 ) is of codimension T in the full space M and m0 (y, x) ∈ E(f , m0 ). Let m1 and m2 be in E(f , m0 ). Then we observe that for any positive numbers α1 and α2 , (α1 m1 + α2 m2 ) ∈ E(f , m0 ),

(15)

so E(f , m0 ) is a convex cone. Let m ∈ M, and define an expectation operator Em by Em {a(X, Y )} =





X y∈Y

a(x, y)p(y|x)G(dx),

(16)

where m(y|x) .  y  ∈Y m(y |x)

p(y|x) = 

(17)

Thus Em is the statistical expectation of a joint distribution associated with the conditional distribution p(y|x) and marginal distribution G. Then we can rewrite E(f , m0 ) by the expectation constraint form E(f , m0 ) = {m ∈ M : Em {f (X, Y )} = Em0 {f (X, Y )}}.

(18)

However the original linear form is helpful for a subsequent discussion as noted in Lebanon and Lafferty (2001). We consider an optimization problem defined by min

m∈E( ,m0 )

KL(m, µ0 ).

(19)

By the usual discussion the Lagrangian is L(m, β) =





X y∈Y





m(y|x) − β T {f (y, x) − f˜ (x)} − m(y|x) G(dx). (20) m(y|x) log µ(y|x)

6

The variational equilibrium condition is led to m(y|x) − β T {f (y, x) − f˜ (x)} = 0 ( a.e. x ) µ(y|x)

(21)

Hence the optimization problem (19) yeilds m(y|x) = µ(y|x) exp[βT {f (x, y) − f˜ (x)}],

(22)

M(f , m0 ) = {µ(y|x) exp[β T {f (x, y) − f˜ (x)}] : β ∈ RT }

(23)

of which model is

The problem dual to (19) reduces to min

µ∈M( ,m0 )

KL(m0 , µ).

(24)

Therefore we conclude the following equivalence between two problems of optimization. Theorem A. (Lebanon and Lafferty, 2001) µ∗ = arg

min

m∈E( ,m0 )

KL(m, µ0 ) = arg

min

µ∈M( ,m0 )

KL(m0 , µ).

(25)

This discussion is parallel to the subspace P projected from M . Note that the KL divergence is reduced to over P by 

KL(p, q) =



X y∈Y





p(y|x) p(y|x) log G(dx). q(y|x)

(26)

These optimization problems can be reduced to various boosting and logistic classification rules.

2.3

Empirical loss

Let (xi , yi), (i = 1, · · · , n) be a training data. One of the most advantageous point to the KL divegence is to be feasible to get the empirical form based on the training data. For this we select the marginal distribution G of the feature vector by the empirical distribution as

n 1 G= δ( ,y ) (x, y) ( Kronecker’s δ). n i=1 i i

7

(27)

In accordance with this setting we reduce m0 (y|x) as 

m0 (y|x) =

δyi (y) if x = xi . 0 otherwise

(28)

f (xi , yi) if x = xi . 0 otherwise

(29)

Thus f˜ (x) defined in (14) is 

f˜ (x) =

The estimative space E(f , m0 ) has the empirical form E emp (f , m0 ) =

  

m∈M:

 

n  

1 m(y|xi ){f (xi , y) − f (xi , yi )} = 0  n i=1 y∈Y

(30)

This empirical reduction connects this discussion with practical situations in the classification tasks. Example 1. AdaBoost. The dual optimization problem reduces to the minimization of the exponential loss Lemp exp (β) =

n  1 exp[βT {f (xi , y) − f (xi , yi)}] n i=1 y∈Y

(31)

with respect to β. If y is binary with values ±1 , then the loss is further simplified by 1 Lemp exp (β) as defined in (7), where we set f (x, y) = 2 yf (x).

Example 2. We make an additional constraint that



m(y|xi ) = 1 for any i =

1, · · · , n. This leads to the parametric density q(y|x, β) =

1 µ0 (y|x) exp{β T f (x, y)} Z(x)

(32)

where Z(x) is the normalizing factor, Z(x) =



µ0 (y|x) exp{β T f (x, y)}.

(33)

y∈Y

We observe that it is exactly the polytomonous logistic model with the minimal sufficient statistics fj (x, y). If y is binary the loss is just Lemp log (β) as defined in (8). We consider a sequential version of Theorem A. Let the t-th weak learning machine ft (x, y) be fixed, and M(ft , m0 ) = {µt−1 (y|x) exp[βt {fy (x, y) − f˜t (x)}] : βt ∈ R} 8

(34)

and E(ft , m0 ) = {m ∈ M :





X y∈Y

m(y|x){ft (x, y) − f˜t (x)}G(dx) = 0},

(35)

A direct application of Theorem A leads to µt+1 = arg

min

m∈E(ft ,m0 )

KL(m, µ0 ) = arg

and in fact E(ft , m0 )



min

µ∈M(ft ,m0 )

KL(m0 , µ).

M(ft , m0 ) = {µt+1 }.

(36)

(37)

We observe that KL(m0 , µt ) = KL(m0 , µt+1 ) + KL(µt , µt+1 ),

(38)

which implies a uniform decreasing sequence {KL(m0 , µt ) : t ≥ 1}. The subspace E(ft , m0 ) orthogonally intersects with M(ft , m0 ) at the singleton {µt+1 }. A set of right triangles connecting m0 , µt and µt+1 with t ≥ 1 is associated with the process of this learning algorithm. A set of right triangles connecting m0 , µt and µt+1 with t ≥ 1 is associated with the process of this learning algorithm. Let us return to the situation AdaBoost method as given Example 1 where y takes values ±1. Then the AdaBoost algorithm is the same as this algorithm with the relation wt (i) = µt (yi|xi )

(39)

if one considers the empirical formulation as above again. We will extend this understanding to a wider class by the Bregman divergence rather than the KL divergence.

3

U-boost

We have reviewed the geometric understandings of AdaBoost by using the KL divergence. Let us extend them to the Bregman divergence. Let H(z) be a convex function. Over the space M the Bregman divergence is defined by 

DH (m, µ) =



X y∈Y

dh (m(y|x), µ(y|x))G(dx),

(40)

where h(z) = H  (z) and 







dH (m, µ) = H h−1 (µ) − H h−1 (m) − m{h−1 (µ) − h−1 (m)}. 9

(41)

By definition DH satisfies the first axiom of distance, that is , DH (m, µ) ≥ 0 and DH (m, µ) = 0 ⇐⇒ m = µ (a.e. x)

(42)

If one chooses H(z) = exp(z), then the Bregman divergence reduces to the extended KL divergence defined in (19). Other typical examples are given as follows. Example 3. The β-divergence is generated by β

(βz + 1) β+1 Hβ (x) = . β+1

(43)

The functional form is 

Dβ (m, µ) =



 

m(y|x)β+1 − µ(y|x)β+1



β+1

X y∈Y









m(y|x) m(y|x)β − µ(y|x)β  G(dx). (44) β

Example 4. The generating function is Hη (z) = exp(z) − ηz,

(45)

so that the generated functional is 

Dη (m, µ) =



X y∈Y





m(y|x) + η m(y|x) − µ(y|x) + {m(y|x) + η} log µ(y|x) + η



G(dx). (46)

We observe that Dη (m, µ) = KL(m + η, µ + η).

(47)

Let us now extend the discussion on the KL divergence to that on the Bregman divergence. Consider the problem of min

m∈E( ,m0 )

DH (m, µ0 ),

(48)

where E(f , m0 ) is defined in (35). The Lagrangian is given by LH (m, β) =





X y∈Y



m(y|x) {h−1 (m(y|x)) − h−1 (µ(y|x))} 



−β T {f (x, y) − f˜ 0 (x)} − H(h−1 (m(y|x))) G(dx) 10

(49)

The variational argument yeilds that h−1 (m(y|x)) − h−1 (µ(y|x)) − β T {f (x, y) − f˜ (x)} = 0 ( a.e. x).

(50)

Hence we get a parametric model  



MH (f , m0 ) = h h−1 (µ(y|x)) − β T {f (x, y) − f˜ (x)} : β ∈ RT



(51)

The optimization problem (48) has a dual version min

µ∈MH ( ,m0 )

DH (m0 , µ).

(52)

Through this discussion we get an extended theorem from KL divergence to Bregman divergence. Theorem B µ∗H = arg

min

m∈E( ,m0 )

DH (m, µ0 ) = arg

min

µ∈MH ( ,m0 )

DH (m0 , µ).

(53)

See Murata et al. (2002) for detailed discussion. This extension is fruitful for giving a variety of boosting methods. Let U(z) = H(h−1 (z)). In fact we propose U-boost as follows. The part of steps 1, 2, 3-a and 4 is exactly the same as that of AdaBoost algorithm. The rest of two stages should be changed into 3-bU

n 1 βt = arg min U(Ft−1 (xi )yi + βft (xi )yi ) , β n i=1

(54)

wt+1 (i) = U  (Ft−1 (xi )yi + βft (xi )yi) .

(55)

3-cU

By the definition we observe that t+1 (ft ) =

1 2

(56)

for any t = 1, · · · , T . In this way any U-boost algorithm shows the same learning property of the least favorable distribution as that of AdaBoost. We also consider a sequential version of Theorem B. The result will be quite parallel to that of Theorem B. For the t-th weak learning machine ft (x, y) MH (ft , m0 ) = {µt−1 (y|x) exp[βt {fy (x, y) − f˜t (x)}] : βt ∈ R} 11

(57)

where E(ft , m0 ) is the same as (37). Theorem B leads to µt+1 = arg

min

m∈E(ft ,m0 )

DH (m, µ0 ) = arg

and in fact E(ft , m0 )



min

µ∈M(ft ,m0 )

DH (m0 , µ).

MH (ft , m0 ) = {µt+1 },

(58)

(59)

so that DH (m0 , µt ) = DH (m0 , µt+1 ) + DH (µt , µt+1 ).

(60)

The subspace E(ft , m0 ) orthogonally intersects with M(ft , m0 ) at the singleton {µt+1 }. We have a result of the convergence for the U-boost algorithm via this geometric view. See Amari(1995) for the similar argument for the EM algorithm. Theorem C DH (m0 , µt ) = DH (m0 , µt+1 ) + DH (µt , µt+1 ).

(61)

The proof follows from a direct observation from Theorem B. In face we get that DH (m0 , µt ) − {DH (m0 , µt+1 ) + DH (µt , µt+1 )} 

= which is



X y∈Y

{m0 (y|x) − µt (y|x)}{h−1 (µt (y|x)) − h−1 (µt+1 (y|x))}dx





X y∈Y

{m0 (y|x) − µt (y|x)}βt {ft (x, y) − f˜t (x)}dx

(62)

(63)

which vanishes because of Theorem B. The proof is complete. Let us return to the empirical setting on examples (xi , yi ), (i = 1, · · · , n) as discussed before Examples 1 and 2. We have confirmed that the AdaBoost is driven by the exponential loss. Alternatively the U-boost is done by the U-loss function LU (β) =

n  1 U(β T {f (xi , y) − f (xi , yi)}), n i=1 y∈Y

in which the U-boost algorithm is a sequential optimization of this U-loss function.

12

(64)

4

Statistical structure in the U-boosting class

The Breyman divergence has offered more broad choice for statiscal methodology rather than the AdaBoost for classification. The key idea presented is the empirical loss LU (β) =

n  1 U(β T {f (xi , y) − f (xi , yi)}). n i=1 y∈Y

(65)

A statistical consideration naturally leads to the abstract loss 

Labs U (β) =





E U(β T {f (X, y) − f (X, Y )}) ,

(66)

y∈Y

where E denote the statistical expectation with respect to the joint distribution of X and Y. If the training dataset {(xi , yi ) : i = 1, · · · , n} follows from the joint distribution, then LU (β) =⇒ Labs U (β) ( a. s. ).

(67)

as n increases to infinity. Alternatively the true log-likelihood function is λ(x, y) = log p(y|x),

(68)

where p(y|x) is the conditional probability of Y = y given X | = x. It leads to the abstract loss Labs U (λ) =







E U(λ(X, y) − λ(X, Y )) .

(69)

y∈Y

In this formulation we observe the following. Theorem D abs ˜ Labs U (F ) ≥ LU (λ) ( ∀ F )

(70)

(Proof). The different of two losses is expressed by the conditional expectation as abs Labs U (F ) − LU (λ) = E

   

y∈Y k∈Y

 

{U(Fky ) − U(λky )}p(k|X) ,

(71)



where Fky = β T {f (X, k) − f (X, y)} and λky = λ(X, k) − λ(X, y). Hence we get Labs U (F )



Labs U (λ)

=E

      λky 

y∈Y k∈Y

Fky



U (z){e

z˜−λky

− 1}dz

  

,

(72)

We conclude the assertion of Theorem D since the bracket term in (72) is positive with probability one. 13

If follows from Theorem D that the global optimization of the U-loss function attains if F is the monotone transform of the log-likelihood λ. In otherwords the Bayes rule leads to the unique optimality of the U-loss function.

5

Discussion and future problem

Let us overview more general applications to statistical methodology rather than the Uboosing method for classification rule. For a general statistical model f (y) the Bregman divergence of a data distribution with density g(y) to the model is 

DH (g, f ) =

dH (g(y), f (y))ν(dy),

(73)

where ν is a carrier measure and dH is defined in (41). The minimum divergence method is feasible to give the empirical form for a given dataset y i , (i = 1, · · · , n) as follows: LH (θ) =

n 1 h−1 (f (y i )) − n i=1



H(h−1 (f (y)))ν(dy).

(74)

Since the first term of LH (θ) approximates 

h−1 (f (y))g(y)ν(dy).

(75)

we observe that the minimization via the Bregman divergence is approximated by that via the empirical loss LH (θ). See Eguchi and Kano (2001) and Jones et al. (2001) for general discussions. For the principal component analysis (PCA) the η-divergence is specially applied to data with outliers, cf. Higuchi and Eguchi (2001) and Kamiya and Eguchi (2001). For the independent component analysis (ICA) the β divergence is applied from an invariance point of view, cf. Minami and Eguchi (2002). The main focused point in these contexts is to investigate the relation of the robustness and model efficiency. The conventional method via the KL divergence can be drastically robustified if one replace the KL divergence into the β or η divergence. In fact in the context of PCA or ICA the informative component to be detected is sensitively lost by the conventional method if a few outliers occur in the hull of acceptable data. On the other hand, the minimum β and η divergence methods are quite robust against outliers, in which the theoretical aspect is pointed out the boundedness of the influence functions. Let us back to the situation of classification rule. We have to reconsider the meaning of outlier in contrast with the usual case including the PCA and ICA. The label set 14

Y is finite, so any influence functionis bounded. We need formulation of a probalistic contamination on labering feuture vectors rather than that of outlying distribution. We will present a reasonable formulation associated with η-divergence in a near future.

References Adams, N.M. and Hand, D.J. (1999). Comparing classifiers when the misclassification costs are uncertain. Pattern Recognition 32, 1139-1147. Adams, N.M. and Hand, D.J. (2000). Improving the practice of classifier performance assessment. Neural Computation 12, 305-311. Amari, S. (1995). Information Geometry of the EM and em Algorithms for Neural Networks Neural Networks 8, 1379-1408. Bishop, C. (1995). Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Cornfield, J. (1962). Joint dependence of risk of coronary heart disease on serum cholesterole and systolic blood pressure: a discriminant function approach. Fed. Amer. Socs. Exper. Biol. Proc. Suppl. 11, 58-61. Cox, D. R. (1966). Some procedures associated with the logistic qualitative response curve. In Reasarch papers on Statistics: Festschrift for J. Neyman, F. N. David(ED.) New York; Wiley, pp. 55-71. Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Asoc. 70, 892-898. Eguchi, S. and Copas, J. (1998). A class of local likelihood methods and near-parametric asymptotics. J. Royal Statist. Soc. B , 60, 709-724. Eguchi, S. and Copas, J. (2001). Recent developments in discriminant analysis from an information geometric point of view.J. Korean Statist. Soc. 30, 247-264. (The special issue of the 30th anniversary of the Korean Statistical Society) Eguchi, S. and Copas, J. (2002a). A class of logistic type discriminant functions. Biometrika 89, 1-22. Eguchi, S. and Copas, J. (2002b). Interpreting Kullback-Leibler divergence with the NeymanPearson Lemma. (Preprint). Freund, Y. and Schapire, R. E. (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and System Sciences 55, 119-139. Friedman, J., Hastie, T. and Tibishirani, R. (2000). Additive logistic regression: A statitistical view of boosting. Ann. Statist. 28, 337-407. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179-188. Hand, D. J. and Henley, W. E. (1997). Statistical classification methods in consumer credit scoring: a review. J. Roy. Statist. Soc., A, 160, 523-541. Hastie, T. Tibishirani, R. and Friedman, J. (2001). The elements of statistical learning. Springer, New York.

15

Higuchi, I. and Eguchi, S. (1998). The influence function of Principal component analysis by self-organizing rule. Neural Computation 10, 1435-1444. Jones, M.C., Hjort, N. L., Harris, J. R. and Basu, A. (2001). A comparison of related densitybased minimum divergence. Biometrika, 66, 865-873. Kamiya. H and Eguchi, S. A class of robust principal component vectors. J. Multivariate Analysis 77, 239-269 (2001). Lebanon, G. and Laffertry, J. (2001). Boosting and maximum likelihood for exponential models. to appear in Advances in Neural Information Processing Systems (NIPS), 14, 2001. ( http://www-2.cs.cmu.edu/˜lafferty/) MacCullagh, P.A. and Nelder,J (1989). Generalised Linear Model. London, Chapman and Hall. MacLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. New York.

Wiley,

Minami, M., and Eguchi, S. (2002). Robust Blind Source Separation by beta-Divergence. In press. Neural Computation, 14. Murata, N., Eguchi, S., Takenouchi, T. and Kanamori, T (2002). Information geometry of U -boost and Bregman divergence. In preparation. Pepe, M.S. and Thampson, M.L. (2000). Combing diagnostic test results to increase accuracy. Biostatistics 1, 123-140. Schapire, R. (1990). The strength of the weak learnability. Machine Learning 5, 197-227. Schapire, R. Freund, Y, Bartlett, P. and Lee, W. (1998). Boosting the margin: a new explanation for effectiveness of voting methods. Ann. Statist.,26, 1651-1686. Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Springer: New York.

16

Suggest Documents