Relationship between Loss Functions and ... - Semantic Scholar

Relationship between Loss Functions and Confirmation Measures Krzysztof Dembczyński1 and Salvatore Greco2 and Wojciech Kotłowski1 and Roman Słowiński1,3 1

3

Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland {kdembczynski, wkotlowski, rslowinski}@cs.put.poznan.pl 2 Faculty of Economics, University of Catania, 95129 Catania, Italy [email protected] Institute for Systems Research, Polish Academy of Sciences, 01-447 Warsaw, Poland

Abstract. In the paper, we present the relationship between loss functions and confirmation measures. We show that population minimizers for weighted loss functions correspond to confirmation measures. This result can be used in construction of machine learning methods, particularly, ensemble methods.

1

Introduction

Let us define the prediction problem in a similar way as in [4]. The aim is to predict the unknown value of an attribute y (sometimes called output, response variable or decision attribute) of an object using the known joint values of other attributes (sometimes called predictors, condition attributes or independent variables) x = (x1 , x2 , . . . , xn ). We consider binary classification problem, in which we assume that y ∈ {−1, 1}. All objects for which y = −1 constitute decision class Cl−1 , and all objects for which y = 1 constitute decision class Cl1 . The goal of a learning task is to find a function F (x) (in general, F (x) ∈ 0, L0−1 (y, F (x)) = (2) 1 if yF (x) ≤ 0.

It is possible to use other loss functions than (2). Each of these functions has some interesting properties. One of them is a population minimizer of prediction risk. By conditioning (1) on x (i.e., factoring the joint distribution P (y, x) = P (x)P (y|x)), we obtain: F ∗ (x) = arg min Ex Ey|x L(y, F (x)).

(3)

F (x)

It is easy to see that it suffices to minimize (3) pointwise: F ∗ (x) = arg min Ey|x L(y, F (x)).

(4)

F (x)

The solution of the above is called population minimizer. In other words, this is an answer to a question: what does a minimization of expected loss estimate on a population level? Let us remind that the population minimizer for 0-1 loss function is: 1 if P (y = 1|x) ≥ 21 , F ∗ (x) = (5) −1 if P (y = −1|x) > 21 . From the above, it is easy to see that minimizing 0-1 loss function one estimates a region in predictor space in which class Cl1 is observed with the higher probability than class Cl−1 . Minimization of some other loss functions can be seen as an estimation of conditional probabilities P (y = 1|x) (see Section 2). From the other side, Bayesian confirmation measures (see, for example, [5,9]) have paid a special attention in knowledge discovery [7]. Confirmation measure c(H, E) says in what degree a piece of evidence E confirms (or disconfirms) a hypothesis H. It is required to satisfy:  > 0 c(H, E) = = 0  P (H), if P (H|E) = P (H), if P (H|E) < P (H),

(6)

where P (H) is the probability of hypothesis H and P (H|E) is the conditional probability of hypothesis H given evidence E. In Section 3, two confirmation measures of a particular interest are discussed. In this paper, we present relationship between loss functions and confirmation measures. The motivation of this study is a question: what is the form of the loss function for estimating a region in predictor space in which class Cl1 is observed with the positive confirmation, or alternatively, for estimating confirmation measure for a given x and y? In the following, we show that population minimizers for weighted loss functions correspond to confirmation measures. Weighted loss functions are often used in the case of imbalanced class distribution, i.e., when probabilities P (y = 1) and P (y = −1) are substantially different. This result is described in Section 4. The paper is concluded in the last section.

3.0 1.5 0.0

0.5

1.0

Loss

2.0

2.5

0−1 loss Exponential loss Binomial log−likelihood loss Squared−error loss

−2

−1

0

1

2

yF(x)

Fig. 1. The most popular loss functions (figure prepared in R [10]; similar figure may be found in [8], also prepared in R)

2

Loss Functions

There are different loss functions used in prediction problems (for a wide discussion see [8]). In this paper, we consider, besides 0-1 loss, the following three loss functions for binary classification: – exponential loss: Lexp (y, F (x)) = exp(−yF (x)),

(7)

– binomial negative log-likelihood loss: Llog (y, F (x)) = log(1 + exp(−2yF (x))),

(8)

– squared-error loss: Lsqr (y, F (x)) = (y − F (x))2 = (1 − yF (x))2 .

(9)

These loss functions are presented in Figure 2. Exponential loss is used in AdaBoost [6]. Binomial negative log-likelihood loss is common in statistical approaches. It is also used in Gradient Boosting Machines [3]. The reformulation of the squared-error loss (9) is possible, because y ∈ {−1, 1}. Squared-error loss is not a monotone decreasing function of increasing yF (x). For values yF (x) > 1 it increases quadratically. For this reason, one has to use this loss function in classification task very carefully.

The population minimizers for these loss functions are as follows: 1 P (y = 1|x) log , 2 P (y = −1|x) 1 P (y = 1|x) F ∗ (x) = arg min Ey|x Llog (y, F (x)) = log , 2 P (y = −1|x) F (x) F ∗ (x) = arg min Ey|x Lsqr (y, F (x)) = P (y = 1|x) − P (y = −1|x). F ∗ (x) = arg min Ey|x Lexp (y, F (x)) = F (x)

F (x)

From these formulas, it is easy to get values of P (y = 1|x).

3

Confirmation Measures

There are two confirmation measures of a particular interest: P (E|H) , P (E|¬H) P (E|H) − P (E|¬H) f (H, E) = , P (E|H) + P (E|¬H) l(H, E) = log

(10) (11)

where H is hypothesis, and E is evidence. Measures l and f satisfy two desired properties that are: – hypothesis symmetry: c(H, E) = −c(¬H, E) (for details, see for example [5]), – and monotonicity property M defined in terms of rough set confirmation measures (for details, see [7]). Let us remark that in the binary classification problem, one tries for a given x to predict value y ∈ {−1, 1}. In this case, evidence is x, and hypotheses are then y = −1 and y = 1. Confirmation measures (10) and (11) take the following form: P (x|y = 1) , P (x|y = −1) P (x|y = 1) − P (x|y = −1) f (y = 1|x) = . P (x|y = 1) + P (x|y = −1) l(y = 1|x) = log

4

(12) (13)

Population Minimizers for Weighted Loss Functions

In this section, we present our main results that show relationship between loss functions and confirmation measures. We prove that population minimizers for weighted loss functions correspond to confirmation measures. Weighted loss functions are often used in the case of imbalanced class distribution, i.e., when probabilities P (y = 1) and P (y = −1) are substantially different. Weighted loss function can be defined as follows: Lw (y, F (x)) = w · L(y, F (x)),

where L(y, F (x)) is one of the loss functions presented above. Assuming that P (y) is known, one can take w = 1/P (y), and then: Lw (y, F (x)) =

1 · L(y, F (x)). P (y)

(14)

In the proofs presented below, we use the following well-known facts: Bayes theorem: P (y = 1|x) = P (y = 1 ∩ x)/P (x) = P (x|y = 1)P (y = 1)/P (x); and P (y = 1) = 1 − P (y = −1) and P (y = 1|x) = 1 − P (y = −1|x). Let us consider the following weighted 0-1 loss function: 1 0 if yF (x) > 0, w · (15) L0−1 (y, F (x)) = 1 if yF (x) ≤ 0. P (y) Theorem 1. Population minimizer of Ey|x Lw 0−1 (y, F (x)) is: 1 if P (y = 1|x) ≥ P (y = 1), F ∗ (x) = −1 if P (y = −1|x) > P (y = −1) 1 if c(y = 1, x) ≥ 0, = −1 if c(y = −1, x) > 0.

(16)

where c is any confirmation measure. Proof. We have that F ∗ (x) = arg min Ey|x Lw 0−1 (y, F (x)). F (x)

Prediction risk is then: w w Ey|x Lw 0−1 (y, F (x)) = P (y = 1|x)L0−1 (1, F (x)) + P (y = −1|x)L0−1 (−1, F (x)),

Ey|x Lw 0−1 (y, F (x)) =

P (y = 1|x) P (y = −1|x) L0−1 (1, F (x)) + L0−1 (−1, F (x)). P (y = 1) P (y = −1)

This is minimized, if either P (y = 1|x)/P (y = 1) ≥ P (y = −1|x)/P (y = −1) for any F (x) > 0, or P (y = 1|x)/P (y = 1) < P (y = −1|x)/P (y = −1) for any F (x) < 0 (in other words, only the sign of F (x) is important). From P (y = 1|x)/P (y = 1) ≥ P (y = −1|x)/P (y = −1), we have that: 1 − P (y = 1|x) P (y = 1|x) ≥ , P (y = 1) 1 − P (y = 1) which finally gives P (y = 1|x) ≥ P (y = 1) or c(y = 1, x) ≥ 0. Analogously, from P (y = 1|x)/P (y = 1) < P (y = −1|x)/P (y = −1), we obtain that P (y = −1|x) > P (y = −1) or c(y = −1, x) > 0. From the above we get the thesis. From the above theorem, it is easy to see that minimization of Lw 0−1 (y, F (x)) results in estimation of a region in predictor space in which class Cl1 is observed with a positive confirmation. In the following theorems, we show that

minimization of a weighted version of an exponential, a binomial negative loglikelihood, and a squared-loss error loss function gives an estimate of a particular confirmation measure, l or f . Let us consider the following weighted exponential loss function: Lw exp (y, F (x)) =

1 exp(−y · F (x)). P (y)

(17)

Theorem 2. Population minimizer of Ey|x Lw exp (y, F (x)) is: F ∗ (x) =

1 P (x|y = 1) 1 log = l(y = 1, x). 2 P (x|y = −1) 2

(18)

Proof. We have that F ∗ (x) = arg min Ey|x Lw exp (y, F (x)). F (x)

Prediction risk is then: w w Ey|x Lw exp (y, F (x)) = P (y = 1|x)Lexp (1, F (x)) + P (y = −1|x)Lexp (−1, F (x)),

Ey|x Lw exp (y, F (x)) =

P (y = −1|x) P (y = 1|x) exp(−F (x)) + exp(F (x)). P (y = 1) P (y = −1)

Let us compute a derivative of the above expression: ∂Ey|x Lw P (y = 1|x) P (y = −1|x) exp (y, F (x)) =− exp(−F (x)) + exp(F (x)). ∂F (x) P (y = 1) P (y = −1) Setting the derivative to zero, we get: P (y = 1|x)P (y = −1) , P (y = 1|x)P (y = 1) P (y = 1|x)P (y = −1) 1 1 = l(y = 1, x). F (x) = log 2 P (y = 1|x)P (y = 1) 2 exp(2F (x)) =

Let us consider the following weighted binomial negative log-likelihood loss function: 1 Lw log(1 + exp(−2y · F (x))). (19) log (y, F (x)) = P (y) Theorem 3. Population minimizer of Ey|x Lw log (y, F (x)) is: F ∗ (x) =

1 P (x|y = 1) 1 log = l(y = 1, x). 2 P (x|y = −1) 2

(20)

Proof. We have that F ∗ (x) = arg min Ey|x Lw log (y, F (x)). F (x)

Prediction risk is then: w w Ey|x Lw log (y, F (x)) = P (y = 1|x)Llog (1, F (x)) + P (y = −1|x)Llog (−1, F (x))

=

P (y = 1|x) P (y = −1|x) log(1 + exp(−2F (x))) + log(1 + exp(2F (x))). P (y = 1) P (y = −1)

Let us compute a derivative of the above expression: ∂Ey|x Lw P (y = 1|x) exp(−2F (x)) log (y, F (x)) = −2 + ∂F (x) P (y = 1) 1 + exp(−2F (x)) P (y = −1|x) exp(2F (x)) . +2 P (y = −1) 1 + exp(2F (x)) Setting the derivative to zero, we get: P (y = 1|x)P (y = −1) P (y = −1|x)P (y = 1) 1 P (y = 1|x)P (y = −1) 1 F (x) = log = l(y = 1, x). 2 P (y = −1|x)P (y = 1) 2

exp(2F (x)) =

Let us consider the following weighted squared-error loss function: Lw sqr (y, F (x)) =

1 (y − F (x))2 . P (y)

(21)

Theorem 4. Population minimizer of Ey|x Lw sqr (y, F (x)) is: F ∗ (x) =

P (x|y = 1) − P (x|y = −1) = f (y = 1, x). P (x|y = 1) + P (x|y = −1)

(22)

Proof. We have that F ∗ (x) = arg min Ey|x Lw sqr (y, F (x)). F (x)

Prediction risk is then: w w Ey|x Lw sqr (y, F (x)) = P (y = 1|x)Lsqr (1, F (x)) + P (y = −1|x)Lsqr (−1, F (x)),

Ey|x Lw sqr (y, F (x)) =

P (y = 1|x) P (y = −1|x) (1 − F (x))2 + (1 + F (x))2 . P (y = 1) P (y = −1)

Let us compute a derivative of the above expression: ∂Ey|x Lw P (y = 1|x) P (y = −1|x) log (y, F (x)) = −2 (1 − F (x)) + 2 (1 + F (x)). ∂F (x) P (y = 1) P (y = −1)

Setting the derivative to zero, we get: P (y = 1|x)/P (y = 1) − P (y = −1|x)/P (y = −1) , P (y = 1|x)/P (y = 1) + P (y = −1|x)/P (y = −1) P (x|y = 1) − P (x|y = −1) F (x) = = f (y = 1, x). P (x|y = 1) + P (x|y = −1)

F (x) =

5

Conclusions

We have proven that population minimizers for weighted loss functions correspond directly to confirmation measures. This result can be applied in construction of machine learning methods, for example, ensemble classifiers producing a linear combination of base classifiers. In particular, considering ensemble of decision rules [1,2], a sum of outputs of rules that cover x can be interpreted as an estimate of a confirmation measure for x and a predicted class. Our future research will concern investigation of general conditions that loss function has to satisfy to be used in estimation of confirmation measures.

References 1. Błaszczyński, J., Dembczyński, K., Kotłowski, W., Słowiński, R., Szeląg, M.: Ensemble of Decision Rules. Research Report RA-011/06, Poznań University of Technology (2006) 2. Błaszczyński, J., Dembczyński, K., Kotłowski, W., Słowiński, R., Szeląg, M.: Ensembles of Decision Rules for Solving Binary Classification Problems with Presence of Missing Values. In: Greco et al. (eds.): Rough Sets and Current Trends in Computing, Lecture Notes in Artificial Intelligence, Springer-Verlag 4259 (2006) 318–327 3. Friedman, J. H.: Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 5 29 (2001) 1189–1232 4. Friedman, J. H.: Recent Advances in Predictive (Machine) Learning. Dept. of Statistics, Stanford University, http://www-stat.stanford.edu/~jhf (2003) 5. Fitelson, B.: Studies in Bayesian Confirmation Theory. Ph.D. Thesis, University of Wisconsin, Madison (2001) 6. Freund, Y., Schapire, R. E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 1997 119–139 7. Greco, S., Pawlak, Z., Słowiński, R.: Can Bayesian confirmation measures be useful for rough set decision rules? Engineering Applications of Artificial Intelligence 17 (2004) 345–361 8. Hastie, T., Tibshirani, R., Friedman, J. H.: Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2003) 9. Kyburg, H.: Recent work in inductive logic. In: Lucey, K.G., Machan, T.R. (eds.): Recent Work in Philosophy. Rowman and Allanheld, Totowa, NJ, (1983) 89–150 10. R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, http://www.R-project.org, Vienna, (2005)