Introducing Cost-Sensitive Neural Networks - CiteSeerX

1 downloads 0 Views 193KB Size Report
662 Blackburn Road. Clayton, Victoria 3168 ... ron either from other neurons or external sources of input. ... ron j (j = 1;2; :::NH) in the rst layer is Wij. Hence.
Introducing Cost-Sensitive Neural Networks

Chunru Wan1, Lipo Wang1, and Kai Ming Ting2 School of Electrical and Electronic Engineering Nanyang Technological University Block S2, Nanyang Avenue Singapore 639798 Email: fecrwan,[email protected] http://www.ntu.edu.sg/home/fecrwan,elpwangg

1

School of Computing and Mathematics Deakin University 662 Blackburn Road Clayton, Victoria 3168 Australia Email: [email protected] http://www.cm.deakin.edu.au/~kmting

2

Abstract

In many real world problems, such as nancial analysis and medical diagnosis, di erent errors in prediction usually lead to di erent costs. For example, the cost of making an error in predicting the future of a million-dollar investment is higher compared to that of a thousand-dollar investment. The cost of misdiagnosing a person with serious disease as healthy is much greater than misdiagnosing a healthy person as ill. This important aspect has largely been ignored in applications of arti cial neural networks. In this paper, we propose cost-sensitive neural networks (CSNNs) in order to address these important issues.

1 Standard Back-Propagation Networks

fV1 ; V2 ; :::; VNI g are the individual inputs to the neuron either from other neurons or external sources of input. The neuron then determines its output a according to a = f (h + b) ; (2) where b is the bias of the neuron and f is usually a nonlinear function which will be speci ed later. Let us consider a layer of NH neurons. All neurons receive an input vector (pattern) f1 ; 2 ; :::; NI g. Neuron j in this layer has weights fwj1 ; wj2 ; :::; wjNI g. Hence the total input to neuron j is hj

NI X wV j =1

j j

;

Vj

(1)

where NI is the number of inputs to the neuron (dimension of the input vector), fw1 ; w2 ; :::; wNI g are the weights or synapses of the neuron, and

NI X w k=1

which produces output

The total input to an arti cial neuron is [2] h=

=

where

= g(hj ) = g(

jk k

;

NI X w

jk k )

k=1

g (x) = f (x + b) :

(3)

;

(4)

(5) Now let us connect a second layer of NO neurons on top of this rst layer of NH neurons to form a feedforward neural network. The weights connecting

neuron i (i = 1; 2; :::NO ) in the second layer and neuron j (j = 1; 2; :::NH ) in the rst layer is Wij . Hence neuron i in the second layer receives a total input hi =

XW NH

j =1

ij Vj

=

X W g (X w

= g(hi ) = g( = g(

ij

j =1

and produces output Oi

NI

NH

k=1

XW NH

j =1 NI

ij

k=1

;

(6)

jk k ))

:

(7)

Back-propagation uses a gradient decent algorithm to learn the weights. For the hidden-to-output connections we have @ Wij = ? @W ij X  =  [i ? Oi ]g0 (hi )Vj

X  V  ; i j

(9)

where  is called the learning rate and we have de ned i = g 0 (hi )[i ? Oi ] (10) For the input-to-hedden connections, we have @ wjk = ? @w jk

=

 @ @Vj ? @V  @w jk j



j k

X  W i

i

ij

(12)

We see that eq. 9 and eq. 11 have exactly the same form, only with di erent de nitions of the 's. In general, for a feedforward neural network with an arbitrary number of layers, suppose layer p receives input from layer q, which can be either a hidden layer or the external input. Then the gradient decent learning rule for layer p can always be written as follows wpq = 

X  V  

p q

(13)

where p represents error at the output of layer p and Vq is the input to layer p from layer q . If the layer concerned is the nal (or top) layer of the network,  is given by eq. 10, which represents the error between the desired and the actual output. If the layer concerned is one of the hidden layers,  needs to be calculated with some propagating rule, such as eq. 12. The most popular nonlinear function for a neuron is the sigmoid function 1 g (h) = (14) 1 + e? h where is called the gain.

2 Cost-Sensitive Back-Propagation





=

j = g 0 (hj )

ij Vj )

Suppose for input pattern ~ used during training, the desired output pattern is ~ . We shall use superscript  to denote training pattern , where  = 1; 2; :::; Np and Np is the number of input-output training pairs. The objective of training is to minimize the error between the actual output O~  and the desired output ~ . A commonly used error measure or cost function is 1 X[  ? O ]2 ; = (8) i 2 i i Or, by substituting eq.7 into eq.8, we have 1 X[  ? g(X W g(X w   ))]2 : = ij jk k 2 i i j k

= 

=

where we have de ned

NH X X W g( w j =1

jk k )

X[  ? O]g0(h)W g0(h) ij i i i j k i X W g0(h)  i ij j k i X    ;  (11)

= 

In conventional back-propagation, errors made with respect to di erent patterns are assumed to be the same, as shown in the cost function given by eq. 8. We now write a cost-sensitive cost function as follows: 1 X  [  ? O ]2 ; (15) = i i 2 i where  is a cost-dependent factor. The standard back-propagation situation (eq.8) is recovered if we let  = 1 ; for all  (16)

With eq. 15, we can easily generalize the standard back-propagation (SBP) algorithm to cost-sensitive situations. By going through the same derivations as above, we can show that the cost-sensitive cost function given by eq.15 is minimized by the same backpropagation algorithm, but with eq. 13 modi ed as follows X wpq =   p Vq (17) 

Other quantities such as the 's and V 's are calculated in the same ways as the standard back-propagation case. In iterative implementation, i.e., all weights are updated after each training pattern is presented, the above cost-sensitive back-propagation (CSBP) can be realized by simply replacing the learning rate  in the standard back-propagation case by a cost-sensitive learning rate   for each training pattern . In the CSBP, \more important" pattern classes with larger cost factors () have larger learning rates compared to the \less important" classes with smaller cost factors. Let us consider a case with only two pattern classes. Suppose (1) = 3(2) = 3, or making an error in classifying a pattern of class 1 is three times as costly as making an error in classifying a pattern of class 2. The CSBP requires that the learning rate for class 1 is 3 if the learning rate for class 2 is . This is roughly equivalent to the case where we use the SBP, i.e., the same learning rate for all classes, and present each training pattern in class 2 only once to the network, but present each training pattern in class 1 three times. Suppose there are a total of Np input-output pattern pairs for training and there are Ncl classes (kinds) of patterns. In particular, there are Nk training patterns for class k, where k = 1; 2; :::; Ncl. In this paper, we assume that there are equal number of training patterns for each class, i.e., Nk

= No

(18)

for all k = 1; 2; :::; Ncl. Hence Ncl X N

k=1

k

= Np = Ncl No

(19)

Since in the SBP cost function (eq.8), the sum of all coecients in front of [i ? Oi ]2 is

X1 = N 

p

;

(20)

it is reasonable to require the same condition satis ed in the CSBP. This can be achieved by choosing  = Ncl (

PCNkkclCk ) : ( )

(21)

=1

The SBP is recovered if all C's are the same (thus all 's are 1). Although we are not aware of any prior work on cost-sensitive neural networks, cost-sensitive classi cation trees have been studied by Turney [4] and Ting [3].

3 Simulation Results Assume that two classes of patterns are uniformly distriuted in two intersected circles in a plane. We randomly generate 600 training patterns (characterized by their coordinates in the plane) for each of the two classes. These patterns are used to train the neural network by setting di ernet cost functions. The neural network has one input layer of two neurons, one hidden layer of three neurons with sigmoid transfer functions and one output layer of one neuron with a sigmoid transfer function. Three di erent cost-factor settings are used in the simulation study. The di erent cost factors are set as follows:

 Case 1. C = C = 0:5: This is corresponding to 1

2

the standard back-prapagation (SBP) algorithm.  Case 2. C1 = 0:2; C2 = 0:8: This sets a higher cost for class 2 than class 1.  Case 3. C1 = 0:8; C2 = 0:2: This sets a higher cost for class 1 than class 2.

After the network is trained, we randomly generate another 600 test patterns for each class to test the neural network for the recognition rate. The results are shown in Figure 1, where we have computed the decision boundary of the neural network for di erent cases. The solid line shows the decision line for the standard backpropagation, the dotted line shows the decision line for Case 2 and the dashed line for Case 3. The correct recognition rates for the three neural networks using the same set of test patterns are given in Table table1. As we can see from Figure 1 and Table 1, the recognition rates for the \important" classes (with higher cost) are higher than the recognition rates for the \less

cial analysis and medical diagnosis. For example, the cost of making an error in predicting the future of a million-dollar investment is higher compared to that of a thousand-dollar investment. The cost of misdiagnosing a person with serious disease as healthy is much greater than misdiagnosing a healthy person as ill. Our CSNN will be expected to give lower overall costs in these and other cost-sensitive situations.

1

0.5

0

References

−0.5

−1

−1

−0.5

0

0.5

1

1.5

2

2.5

Figure 1: Two classes are represented by triangles and circles, respectively. Table 1: Recognition rate for test patterns

Class 1 Case 1 561/600 (93.5%) 1 = 2 =05 Case 2 542/600 (90.3%) 1 =02 2 =08 Case 3 599/600 (99.8%) 1 =08 1 =02 C

C

:

C

: ;C

:

C

: ;C

:

Class 2 Total 571/600 1132/1200 (95.2%) (94.3%) 594/600 1136/1200 (99.0%) (94.7%) 533/600 1132/1200 (88.8%) (94.3%)

important" classes. The decision boundaries in Figure 1 also clearly reveal this result.

4 Conclusions In this paper, we proposed cost-sensitive neural networks (CSNNs) in order to address the important issue that di erent errors in prediction usually lead to di erent costs. Training algorithm is derived based on the back-propagation learning rule, according to a cost-sensitive error function. Computer simulations were carried out on a simple classi cation problem. The comparisons with the standard backpropagation network showed desirable results, i.e., the CSNN demonstrated higher recognition rates (or lower classi cation errors) on more important classes. We plan to carry out more theoretical studies and computer simulations on the CSNN. Future work will also include applications of the CSNN to many real world problems, such as nan-

[1] S.E. Fahlman, \Fast-learning variations on backpropagation: an empirical study," in Proc. 1988 Connectionist Models Summer School (Pittsburg 1988), D. Touretzky, G. Hinton, and T. Sejnowski, pp. 38-51, 1989 (Morgan Kaufmann, San Mateo). [2] D.E.Rumelhart, J.L. McClelland, and the PDP Research Group, Parallel Distributed Processing. The MIT Press, Cambridge, Massachusetts, 1986. [3] K.M. Ting, \Inducing cost-sensitive trees via instance weighting", Proceedings of The Second European Symposium on Principles of Data Mining and Knowledge Discovery, LNAI-1510, pp. 139-147, 1998. (download at http://www3.cm.deakin.edu.au/~kmting/ publications.html). [4] P.D. Turney, \Cost-sensitive classi cation: empirical evaluation of a hybrid genetic decision tree induction algorithm," J. Arti cial Intelligence Research, vol.2, pp. 369-409, 1995.