Cost Sensitive Learning of Deep Feature Representations from ...

3 downloads 105 Views 773KB Size Report
Aug 14, 2015 - 1. arXiv:1508.03422v1 [cs.CV] 14 Aug 2015 ... For example, Borderline SMOTE [14] which only over- samples the minority class samples which ...
Salman H. Khan, Mohammed Bennamoun, Ferdous Sohel School of CSSE, UWA

Roberto Togneri School of EECE, UWA

{salman.khan,ferdous.sohel,mohammed.bennamoun}@uwa.edu.au

[email protected]

800

Abstract Number of Images

700

In real-world object detection and classification tasks, data for common classes appear quite frequently (majority) while rarer classes have a lower representation (minority). With the imbalanced data, it therefore becomes challenging for a classifier to learn equally good boundaries for the majority and minority classes. In this work, we propose a cost sensitive deep neural network which can automatically learn robust feature representations for both the majority and minority classes. During training, our learning procedure involves an alternative joint optimization for the class dependent costs and the neural network parameters. The proposed approach is applicable to both binary and multi-class problems without any modification. Moreover, it does not add to the computational load during training, because the original data distribution is not disturbed. We report experiments on four major datasets relating to image classification and show that the proposed approach significantly outperforms the baseline procedures. The comparisons with the popular data sampling techniques and the cost sensitive classifiers demonstrate superior performance of our proposed method.

600 500 400 300 200 100 0 10

20

30

40

50

60

Distinct Classes

(a) Word cloud: MIT-67

(b) Class frequencies 500

400

Number of Images

arXiv:1508.03422v1 [cs.CV] 14 Aug 2015

Cost Sensitive Learning of Deep Feature Representations from Imbalanced Data

300

200

100

0 0

20

40

60

80

100

Distinct Classes

(c) Word cloud: Caltech-101

(d) Class frequencies

Figure 1: Examples of popular classification datasets in which the number of images vary sharply across different classes. This class imbalance poses a challenge for classification and representation learning algorithms. classify examples from an infrequent but important minority class. For example, a particular cancerous lesion may appear rarely during the medical image analysis but we cannot afford to mis-classify it into the benign category. Similarly, for a continuous surveillance task, a suspicious activity may happen occasionally, but it must not go unnoticed by the monitoring system. In some other application domains e.g., object classification, the correct classification of minority class samples is of equal importance as that of the majority class samples. Despite the pertinence of the class imbalance problem to practical computer vision, very few research works have appeared on this topic in recent years. Real-world class imbalance is avoided in nearly all competitive datasets during the evaluation and training procedures. For example, for the case of popular image classification datasets (such as Caltech−101/256, MIT−67, ImageNet, CIFAR−10/100), efforts have been made by the collectors to make sure that

1. Introduction In most of the real-world classification problems, the available data follows a long tail distribution i.e., a few object classes are abundant while others only have a limited representation. This behaviour is termed as the ‘classimbalance’ problem and it is inherently manifested in nearly all of the collected object databases. We call a multiclass dataset ‘imbalanced’ or ‘skewed’ if some of the minority classes in the training set are heavily under-represented compared to some other majority classes. This skewed distribution of class instances forces the classification algorithm to be biased towards the majority classes. As a result, the concepts related to minority classes are not adequately learned. The class imbalance problem is of particular importance in real world scenarios where it is undesirable to mis1

either all the classes have a minimum representation of sufficient data or the experimental protocols are reshaped to use an equal number of images for all classes during the training and testing process [27, 5, 22]. This approach seems reasonable when we have datasets of only a few classes, which are equally probable to appear in practical scenarios (e.g., digits in MNIST). However, since the number of classes in object datasets are increasing, it is becoming impractical to provide equal representations of all classes in the training and testing subsets. For example, for a fine-grained marine categorization dataset, an extinct fish will probably have a low representation compared to more abundant species. In this work, we aim to address the problem of class imbalance for the case of deep neural networks, which are becoming increasingly popular for the traditional classification and recognition tasks. While previous works such as [6, 20, 50, 33] only disturb the training data distribution to learn better classifiers, we directly modify the learning procedure to incorporate class dependent costs during training. For this purpose, we introduce cost-sensitive versions of three widely used loss functions for the case of Convolutional Neural Networks (CNNs). We analyze the effect of these modified loss functions on the back-propagation algorithm by deriving relations for propagated gradients. Further, we propose an algorithm for joint alternate optimization of the network parameters and the class-sensitive costs. The proposed algorithm can automatically work for both binary and multi-class classification problems. We also show that the introduction of class-sensitive costs does not significantly affect the training and testing time of the original network (Sec. 5). The proposed approach has been extensively tested on four major classification datasets and has shown to outperform baseline procedures (Sec. 5.2). The remainder of this paper is organized as follows. We briefly discuss the related work in the next section. In Sec. 3, we introduce our proposed approach and provide details of the CNN in Sec. 4. Experiments and results are summarized in Sec. 5 and the paper concludes in Sec. 6 with an outlook towards future work.

2. Related Work Previous research on the class imbalance problem has concentrated mainly on two levels: the data level and the algorithmic level [11]. Below, we briefly discuss the different research efforts to tackle the class imbalance problem. However, it must be noted that the majority of these prior works only address the two-class imbalance problem. In contrast, our approach can work equally well for the case of binary and multiple classes. Data level approaches manipulate the class representations in the original dataset by either over-sampling the minority classes or under-sampling the majority classes to

make the resulting data distribution balanced [11]. However, these techniques change the original distribution of the data and introduce therefore consequent drawbacks. While under-sampling can ignore the potentially useful majority class data, over-sampling makes the training computationally burdensome by artificially increasing the size of the training set. Furthermore, over-sampling is prone to cause over-fitting, when exact copies of the minority class are replicated randomly [6, 11]. To address the over-fitting problem, Chawla et al. [6] introduced a method, called SMOTE, to generate new instances by linear interpolation between closely lying minority class samples. These synthetically generated minority class instances may lie in the convex hull of the majority class instances, a phenomenon known as overgeneralization. Over the years, several variants of SMOTE algorithm have been proposed to solve this problem [42]. For example, Borderline SMOTE [14] which only oversamples the minority class samples which lie close to the boundaries. Safe-level SMOTE [2] carefully generates synthetic samples in the so called safe-regions, where the majority and minority class regions are not overlapping. Local neighborhood SMOTE [29] considers the neighboring majority class samples when generating synthetic minority class samples and reports a better performance compared to the former variants of SMOTE. Combinations of under and over sampling procedures (e.g., [1, 18, 33]) to balance the training data have also been shown to perform well. However, these approaches suffer from the increased computational cost of data pre-processing and learning of a classification model. Moreover, they are not straightforwardly applicable to multiclass problems. Algorithm level approaches directly modify the learning procedure to improve the recognizability of the classifier towards minority classes. Zhang et al. [50] first divided the data into smaller balanced subsets, followed by intelligent sampling and a cost-sensitive SVM learning to deal with the imbalance problem. A neuro-fuzzy modeling procedure was introduced in [10] to perform leave-one-out crossvalidation on imbalanced datasets. A scaling kernel along with the standard SVM was used in [49] to improve the generalizability of learned classifiers on skewed datasets. Li et al. [24] gave more importance to the minority class samples by setting weights with Adaboost during the training of an extreme learning machine (ELM). An ensemble of soft-margin SVMs was formed via boosting to perform well on both majority and minority classes [40]. These previous works hint towards the use of distinct costs for different training examples to improve the performance of the learning algorithm. However, they do not address class imbalance learning of CNNs [49, 44, 45]. Furthermore, they are mostly limited to the binary class problems [17, 40] and do not explore computer vision tasks which inherently have

imbalanced class distributions.

3. Problem Formulation We address the class imbalance problem during the training of CNNs. For this purpose, we introduce a cost sensitive error function which can be expressed as the mean loss over the training set:

E(θ, ψ) =

M 1 X (i) `(d(i) , yθ,ψ ), M i=1

(2)

θ,ψ

The loss function `(·) in Eq. (1) can be any suitable loss such as the Mean Square Error (MSE), Support Vector Machine (SVM) hinge loss or a Cross Entropy (CE) loss (also called as the soft-max log loss). The cost sensitive versions of these loss functions are discussed below:

(a) Cost Sensitive MSE loss: This minimizes the squared error of the predicted output with the desired ground-truth and can be expressed as follows: `(d(i) , y(i) ) =

1 X (i) (d − yn(i) )2 2 n n

(i)

(3) (i)

where, yn is related to the output of the previous layer on via the logistic function, yn(i) =

1 (i)

(i)

,

n

where, yn can be represented in terms of the previous layer (i) output on and the cost ξ, as follows: yn = on(i) ξ(d(i) n ).

(1)

where, the predicted output (y) of the penultimate layer is parameterized by θ (network weights and biases) and ψ (class sensitive costs), M is the total number of training examples, d ∈ {0, 1}1×N is the desired output (s.t. P n dn := 1) and N denotes the total number of neurons in the output layer. For the sake of brevity, we will not explicitly mention the dependence of y on the parameters (θ, ψ) in the following discussion. Note that the error is higher when the model performs poorly on the training set. The objective of the learning algorithm, is to find the optimal parameters (θ∗ , ψ ∗ ) which give the minimum possible cost E ∗ (Eq. (1)). Therefore, the optimization objective is given by: (θ∗ , ψ ∗ ) = arg min E(θ, ψ).

(b) Cost Sensitive SVM hinge loss: This maximizes the margin between each pair of classes and can be expressed as follows: X (i) `(d(i) , y(i) ) = − max(0, 1 − (2d(i) (5) n − 1)yn ),

(4)

1 + exp(−on ξ(dn ))

where, ξ is the class sensitive penalty which depends on the desired class of the ith training sample. The effect of this cost on the back-propagation algorithm is discussed in Sec. 3.1.

(6)

The effect of the introduced cost on the gradient computation is discussed in Sec. 3.2. (c) Cost Sensitive CE loss: This maximizes the closeness of the prediction to the desired output and is given by: X `(d(i) , y(i) ) = − (dn(i) log yn(i) ), (7) n

where yn incorporates the class dependent cost (ξ) and is related to the previous layer output on via the soft-max function, (i) (i) ξ(dn ) exp(on ) . (8) yn(i) = P (i) (i) ξ(dk ) exp(ok ) k

The effect of the modified CE loss on the back-propagation algorithm is discussed in Sec. 3.3. Relation between ψ and ξ: Note that the parameter vector ψ defines a class dependent penalty which strives to deal with the class imbalance by imposing a high cost for the mis-classification of the minority class samples. Based on the learned parameters ψ, we define a cost matrix (ξ) which is then used to reshape the errors in the loss functions (Eqs. (3-7)). In the cost matrix ξ, each class (cp ) has a specific cost relationship with another class (cq ) based on the overall distribution in the training set. Therefore, there are N 2 parameters in the cost matrix ξ. A closer analysis shows that there are only N free parameters1 in ξ which can be set using the learned parameters ψ. The relationship between ψ and ξ can be defined as follows:  max(ψp , ψq ) : p 6= q, (p, q) ∈ c, ξ(p, q) = (9) ψp : p = q, p ∈ c For clarity of exposition, ξ is shown to be a function of (i) ground-truth dn in Eqs. (3-8). The indices (p, q) can be found from the ground-truth vector using the following relations: p = n, q = arg max d(i) n . n

1A

formal proof can be found in the supplementary material.

Algorithm 1 Iterative optimization for parameters (θ, ψ) Input: Training set (x, d), Validation set (xV , dV ), Max. epochs (E), Learning rate for θ (γθ ), Learning rate for ψ (γψ ) Output: Learned parameters (θ∗ , ψ ∗ ) 1: Net ← construct CNN() 2: θ ← initialize Net(Net) . Random initialization 3: ψ ← 1, val-err ← 1 4: for e ∈ [1, E] do . Number of epochs 5: gradψ ← compute-gard(x, d, F (ψ)) . Eq. (12) 6: ψ ∗ ← update-CostParams(ψ, γψ , gradψ ) 7: ψ ← ψ∗ 8: for b ∈ [1, B] do . Number of batches 9: outb ← forward-pass(xb , db , Net, θ) 10: gradb ← backward-pass(outb , xb , db , Net, θ, ψ) 11: θ∗ ← update-NetParams(Net, θ, γθ , gradb ) 12: θ ← θ∗ 13: end for 14: val-err∗ ← forward-pass(xV , dV , Net, θ) 15: if val-err∗ > val-err then 16: γψ ← γψ ∗ 0.01 . Decrease step size 17: val-err ← val-err∗ 18: end if 19: end for 20: return (θ ∗ , ψ ∗ )

Learning Optimal Parameters: When using any of the above mentioned loss functions (Eqs. (3-7)), our goal is to jointly learn the hypothesis parameters θ and the class dependent loss function parameters ψ. For the joint optimization, we alternatively solve for both types of parameters by keeping one fixed and minimizing the cost with respect to the other (Algorithm 1). Specifically, for the optimization of θ, we use the stochastic gradient descent (SGD) with the back-propagation of error (Eq. (1)). Next, to optimize for ψ, we again use the gradient descent algorithm to calculate the direction of the step to update the parameters. The following cost function is used for the gradient computation to update ψ:

Also, the expression in Eq. (10) can be understood as a squared `2 norm of the difference between the vectors h and ψ. Instead of an `2 norm, it may be of interest to use some other distance measure (e.g., Manhattan distance) in Eq. (10), but we do not explore this possibility because it is out of the scope of the current work. In order to optimize the cost function in Eq. (11), we use the gradient descent algorithm which computes the direction of update step, as follows:   1 (h − ψ)(h − ψ)T ∇F (ψ) = ∇ 2 = (h − ψ)JψT = −(h − ψ)1T .

(12)

where, J denotes the Jacobin matrix. Next, we discuss the impact of the modified loss functions on the gradient computation in the back-propagation algorithm.

3.1. Cost Sensitive MSE During the supervised training, the MSE loss minimizes the mean squared error between the predicted weighted outputs of the model y (i) , and the ground-truth labels d(i) , across the entire training set (Eq. (3)). The modification of the loss function changes the gradient computed during the back-propagation algorithm. Therefore, for the output layer, the mathematical expression of gradient at each neuron is given by: ∂`(d(i) , y(i) ) (i)

∂on

(i)

(i) = −(d(i) n − yn )

∂yn

(i)

∂on

(i)

The yn for the cost sensitive MSE loss can be defined as: yn(i) =

1 (i)

(i)

1 + exp(−on ξ(dn ))

The partial derivative can be calculated as follows: (i)

∂yn

(i)

∂on

(i)

(i)

(i)

ξ(dn ) exp(−on ξ(dn )) = 2 (i) (i) 1 + exp(−on ξ(dn )) (i)

1X F (ψ) = (hn − ψn )2 , 2 n

n ∈ [1, C]

ξ(dn )

=

(10)

(i)

(i)

(i)

(i)

exp(on ξ(dn )) + exp(−on ξ(dn )) + 2 (i)

where, C is the total number of distinct classes in the training set and h denotes the histogram vector which encodes the distribution of classes in the training set. The resulting minimization objective to find the optimal ψ ∗ can be expressed as: ψ ∗ = arg min F (ψ). (11) ψ

It is important to note that the cost function F (ψ) is independent of the input x(i) and the network parameters θ.

=

ξ(dn )   (i) (i) (i) (i) 1 + exp(on ξ(dn )) 1 + exp(−on ξ(dn ))

(i)

∂yn

(i)

∂on

(i) (i) = ξ(d(i) n )yn (1 − yn )

The derivative of the loss function is therefore given by: ∂`(d(i) , y (i) ) (i)

∂on

(i) (i) (i) = −ξ(dn(i) )(d(i) n − yn )yn (1 − yn ). (13)

3.2. Cost Sensitive SVM Hinge Loss

After simplification we get:

For the SVM hinge loss function given in Eq. (5), the directional derivative can be computed at each neuron as follows: ∂`(d(i) ,y (i) ) (i)

∂on

(i)

= −(2dn − 1)

(i) ∂yn (i)

∂on

(i) (i) ∂yn

I{1 > yn

(i)

∂on

(i)

(2dn − 1)}.

The partial derivative of the output of the softmax w.r.t the output of the penultimate layer is given by: (i)

∂yn

(i)

=

∂on

Proposition 1. The introduction of a class imbalance cost ξ(·) in the soft max loss (`(·) in Eq. 7), does not affect the computation of the gradient during the back-propagation process. Proof. We start with the calculation of the partial derivative of the softmax neuron with respect to its input:   (i)

(i)

(i) ∂om

 ξ(dn ) exp(on )  P (i) (i)  ξ(dk ) exp(ok )

(15)

k

Now, two cases can arise here, either m = n or m 6= n. We first solve for the case when n = m: 

(i)

 exp(om )  = ξ(d(i) ) m  

P k

(i)

(i)

(i)

After simplification we get: (i)

(i) ∂om

(i) (i) = ym (1 − ym ),

s.t. : m = n

Next, we solve for the case when n 6= m: (i)

=−

(i)

(i)

(i)

ξ(dm )ξ(dn ) exp(om ) exp(on )  2 P (i) (i) ξ(dk ) exp(ok ) k

(i)



ξ(dk ) exp(ok ) − ξ(dm ) exp(2om )     2  P (i) (i) ξ(dk ) exp(ok ) k

∂yn

(i)

(i)

=−

∂om

X

d(i) n

n

X

1 ∂yn (i)

(i)

,

yn ∂om

(i) dn(i) ym = −d(i) m +

X

(i) d(i) n ym .

n

Since, d(i) is defined as a probability distribution over all P (i) output classes ( dn = 1), therefore: n

(i)

The cost sensitive softmax log loss function is defined in Eq. (7). Next, we show that the introduction of a cost in the CE loss does not change the gradient formulas and the cost (i) is rather incorporated implicitly in the softmax output ym .

(i) ∂om

∂`(y(i) , d(i) )

∂om

3.3. Cost Sensitive CE loss



s.t. : m 6= n

The loss function can be differentiated as follows:

∂`(y(i) , d(i) )

where, I(·) denotes an indicator function.

=

(i) (i) = −ym yn ,

n6=m

(i) (i) (i) (i) = −(2d(i) n − 1)ξ(dn )I{1 > yn ξ(dn )(2dn − 1)}. (14)

(i)

(i)

∂om

(i) = −d(i) m (1 − ym ) +

ξ(d(i) n )

By combining the above two expressions, the derivative of the loss function can be represented as:

∂yn

(i)

∂yn

(i) = −d(i) m + ym .

This result is the same as the case when CE does not contain any cost sensitive parameters. Therefore the costs affect (i) the softmax output ym but the gradient formulas remain unchanged.

In our experiments, we will only report the performances with the cost sensitive CE loss function. This is because CE loss has shown to outperform the other two loss functions in most cases [21]. Moreover, it avoids the learning slowing down problem of the MSE loss [31].

4. Convolutional Neural Network We use a deep CNN to learn robust feature representations for the task of the image classification. The network architecture consists of a total of 18 weight layers (see Fig. 2 for details). Our architecture is similar to the state-of-theart CNN (configuration D) proposed in [35], except that our architecture has two extra fully connected layers before the output layer and the proposed loss layer is cost sensitive. Since there are a huge number of parameters (∼139 million) in the network, its not possible to learn all of them from scratch using a relatively smaller number of images. We, therefore, initialize the first 16 layers of our model with the pre-trained model of [35] and set random weights for the last two fully connected layers. We then train the full network with a relatively higher learning rate to allow a change in the network parameters. Note that the cost sensitive (CoSen) CNN is trained with the modified cost functions introduced in Eqs. (3-8). The CNN trained without cost sensitive loss layer will be used as a baseline CNN in our experiments (Sec. 5).

Figure 2: The CNN architecture used in this work. It consists of 18 weight layers.

5. Experiments and Results The class imbalance problem is present in nearly all realworld object and image datasets. This is not because of any flawed data collection, but it is simply due to the natural frequency patterns of different object classes in real life. For example, a bed will be clearly visible in nearly every bedroom scene, but a baby cot is likely to appear less frequently. We perform experiments on four popular classification datasets, two of which have an equal representation of each class in the training and testing splits. For the other two, the class distributions are not balanced but the experimental protocols are defined in such a way so as to use an equal number of images during training for all classes. We report our performances on the standard splits, deliberately deformed splits and the original data distributions (for the case of imbalanced datasets). Since, our training procedure requires a small validation set (Algorithm 1), we use ∼ 5% of the training data in each experiment as a held-out validation set. We explain the datasets and our experimental settings in the next section.

5.1. Datasets and Experimental Settings Object Classification: Caltech-101 contains a total of 9,144 images, divided into 102 categories (101 objects + background). The number of images for each category varies between 31 and 800 images (mean: 90, median 59). Therefore, the dataset is originally imbalanced but the standard protocol uses 30 or 15 images for each category during training, and testing is performed on the rest of the images (max. 50). We perform experiments using the standard split, 60%/40% and 30%/70% train/test splits. Scene Classification: MIT-67 consists of 15,620 images belonging to 67 classes. The number of images varies between 101 and 738 (mean: 233, median: 157). The standard protocol uses a subset of 6700 images (100 per class) for training and evaluation to make the distribution uniform. We will, however, evaluate our approach both on the standard split (80%/20% train/test) and the complete dataset with train/test splits of 60%/40% and 30%/70%.

Handwritten Digit Classification: MNIST consists of 70,000 images of digits (0-9). Out of the total, 60,000 images are for training (∼600/class) and remaining 10,000 for testing (∼100/class). We evaluate our approach on the standard split as well as the deliberately imbalanced splits. To imbalance the training distribution, we reduce the representation of even or odd classes to only 25% and 10% of images. Image Classification: CIFAR-100 contains 60,000 images belonging to 100 classes (600 images/class). The standard train/test split for each class is 500/100 images. We evaluate our approach on the standard split as well as on artificially imbalanced splits. To imbalance the training distribution, we reduce the representation of even or odd classes to only 25% and 10% of images.

5.2. Results and Comparisons For the two balanced datasets, MNIST and CIFAR-100, we report our results in Tables 1, 2 on the standard splits along-with the deliberately imbalanced splits. To imbalance the training distributions, we experimented with normal training data for even classes and only 25% and 10% of the data of odd classes. Similarly, we experimented by keeping the normal representation of odd classes and reduceds the representation of even classes to only 25% and 10%. Our results show that we perform identical to the baseline method when the distribution is balanced, but as the amount of imbalance increases, our approach shows significant improvements over the baseline CNN which is trained with out using the cost sensitive loss layer. We also compare with the state of the art techniques which report results on the standard split2 and demonstrate that our performances are better or comparable to them. Note that for the MNIST digit dataset, nearly all the top performing approaches use distortions (affine and/or elastic) and augmentation to get a significant boost in performance. In contrast, our baseline and cost sensitive CNNs do not use any form 2 Note that the standard split on Caltech-101 and MIT-67 is different from the original data distribution (see Sec. 5.1 for details).

Methods (using stand. split)

Performances

Conv DBN [23] Deep learning via embedding [43] Deeply Supervised Nets [22] Our approach (↓) Stand. split (∼600 trn, ∼100 tst) Low rep. (10%) of odd digits Low rep. (10%) of even digits Low rep. (25%) of odd digits Low rep. (25%) of even digits

99.2% 98.5% 99.6% Baseline CNN

CoSen CNN

99.3% 97.6% 97.1% 98.1% 97.8%

99.3% 98.5% 98.3% 98.9% 98.5%

Methods (using stand. split)

Performances 15 trn. samples

30 trn. samples

71.1 ± 0.6 − − 73.2 − 83.8 ± 0.5 − −

78.2 ± 0.4 76.9 ± 0.4 77.8 ± 0.6 84.3 86.9 ± 0.7 86.5 ± 0.5 88.3 ± 0.6 91.4 ± 0.7

Baseline CNN

CoSen CNN

87.1% 90.8%

87.1% 90.8%

88.1%

89.2%

77.4% 76.1%

83.0% 82.2%

85.5%

87.9%

74.6% 75.2%

80.3% 80.9%

Multiple Kernels [39] LLC† [41] Imp. Fisher Kernel† [32] SPM-SC [46] DeCAF [9] Zeiler & Fergus [48] Chatfield et al. [5] SPP-net [16] Our approach (↓)

Table 1: Evaluation on MNIST Database. Methods (using stand. split)

Performances

Network in Network [27] Tree based Priors [37] Probablistic Maxout Network [36] Maxout-Networks [13] Stochastic Pooling [47] Representation Learning [28] Deeply Supervised Nets [22] Our approach (↓) Stand. split (500 trn, 100 tst) Low rep. (10%) of odd digits Low rep. (10%) of even digits Low rep. (25%) of odd digits Low rep. (25%) of even digits

64.3% 63.1% 61.9% 61.4% 57.5% 60.8% 65.4%

Stand. split (15 trn. samples) Stand. split (30 trn. samples) Org. data distribution (60%/40% split) Low rep. (10%) of odd classes Low rep. (10%) of even classes Org. data distribution (30%/70% split) Low rep. (10%) of odd classes Low rep. (10%) of even classes

Baseline CNN

CoSen CNN

65.2% 55.0% 53.8% 57.7% 57.4%

65.2% 60.1% 59.7% 61.5% 61.6%

Table 2: Evaluation on CIFAR-100 Database. of distortions/augmentation during the training and testing procedures on MNIST. We also experiment on the two popular classification datasets which are originally imbalanced, but the standard protocols use an equal number of images for all training classes. For example, 30 or 15 images are used for the case of Clatech-101 while 80 images per category are used in MIT-67 for training. We report our results on the standard splits (Tables 3, 4), to compare with the state of the art approaches, and show that our results are superior to the state of the art on MIT-67 and they are competitive on the Caltech-101 dataset. Note that the best performing SPPnet [16] uses multiple sizes of Caltech-101 images during training. In contrast, we only use a single consistent size during training and testing. We also experiment with the original imbalanced data distributions to train the CNN with the modified loss function. For the original data distributions, we use both 60%/40% and 30%/70% train/test splits to show our performances with a variety of train/test distributions. Moreover, with these imbalanced splits, we further decrease the data of odd and even classes to just 10% respectively, and observe a better relative performance of our proposed approach compared to the baseline method. The comparisons with the best approaches for class-

Table 3: Evaluation on Caltech-101 Database († figures reported in [4]). Methods (using stand. split)

Performances 50.1% 52.3% 58.4% 60.8% 64.0% 68.9%

Spatial Pooling Regions [26] VC + VQ [25] CNN-SVM [34] Improved Fisher Vectors [19] Mid Level Representation [8] Multiscale Orderless Pooling [12] Our approach (↓) Stand. split (80 trn, 20 tst) Org. data distribution (60%/40% split) Low rep. (10%) of odd classes Low rep. (10%) of even classes Org. data distribution (30%/70% split) Low rep. (10%) of odd classes Low rep. (10%) of even classes

Baseline CNN

CoSen CNN

70.9%

70.9%

70.7%

73.2%

50.4% 50.1%

57.0% 56.4%

61.9%

66.3%

38.7% 37.2%

44.8% 43.5%

Table 4: Evaluation on MIT-67 Database. imbalance learning are shown in Table 5. Note that we used a high degree of imbalance for the case of all four datasets to clearly show the impact of the class imbalance approaches (Fig.4). For valid comparisons, our experimental procedure was kept as close as possible to the proposed CoSen CNN. For example, for the case of CoSen Support Vector Machine (SVM) and Random Forest (RF) classifiers, we used the 4096 dimensional features extracted from the pre-trained deep CNN (D) [35]. Similarly, for the cases

Datasets

Performances

(Imbalaned protocols)

Experimental Setting

Over-sampling Under-sampling Hybrid-sampling CoSen SVM CoSen RF Baseline CoSen (SMOTE [6]) (RUS [30]) (SMOTE-RSB∗ [33]) (WSVM [38]) (WRF [7]) CNN CNN

MNIST 10% of odd classes CIFAR-100 10% of odd classes Caltech-101 60% trn, 10% of odd cl. MIT-67 60% trn, 10% of odd cl.

94.5% 32.2% 67.7% 33.9%

92.1% 28.8% 61.4% 28.4%

96.0% 37.5% 68.2% 34.0%

96.8% 39.9% 70.1% 35.5%

96.3% 39.0% 68.7% 35.2%

98.5% 60.1% 83.0% 57.0%

97.6% 55.0% 77.4% 50.4%

Table 5: Comparisons of our approach with the state of the art class-imbalance approaches. The experimental protocols used for each dataset are shown in Fig. 4. With highly imbalanced training sets, our approach significantly out-performs other data sampling and cost sensitive classifiers on all four classification datasets. 150

500

7000

400 6000

3000 2000

300

200

Number of Images

4000

Number of Images

Number of Images

Number of Images

400 5000

300

200

100

50

100

100 1000 0 2

4

6

8

10

0

0 20

Distinct Classes

40

60

80

Distinct Classes

(a) MNIST Training Set

(b) CIFAR-100 Training Set

100

0

10

20

30

40

50

Distinct Classes

(c) MIT-67 Training Set

60

20

40

60

80

100

Distinct Classes

(d) Caltech-101 Training Set

Figure 4: The imbalanced training set distributions used for the comparisons reported in Table 5. (best viewed when enlarged). 70

Classification Error

dent costs based on the proportion of each class in the training set. Our proposed approach demonstrates a significant improvement over all of the cost sensitive class imbalance methods.

Trn Error CoSen-CNN Val Error CoSen-CNN Trn Error Baseline-CNN Val Error Baseline-CNN

60 50 40 30 20 10 0 0

10

20

30

40

50

60

70

80

Number of Epochs

Figure 3: Trend of decrease in the training and validation error on CIFAR-100 dataset (even classes unchanged, odd classes reduced to 10%) for the cases of baseline and costsensitive CNNs. of over and under-sampling, we use the same 4096 dimensional features, which have shown to perform well on other classification datasets. We report comparisons with all types of data sampling techniques i.e., over-sampling (SMOTE [6]), under-sampling (Random Under Sampling - RUS [30]) and hybrid sampling (SMOTE-RSB∗ [33]). Note that despite the simplicity of the approaches [6, 30], they have been shown to perform very well on imbalanced datasets [11, 15]. We also compare with the cost sensitive versions of popular classifiers (weighted SVM [38] and weighted RF [7]). For the case of weighted SVM, we used the standard implementation of LIBSVM [3] and set the class depen-

Timing Comparisons: The introduction of the class dependent costs did not prove to be prohibitive during the training of the CNN. For example, on an Intel quad core i74770 CPU (3.4GHz) with 32Gb RAM and Nvidia GeForce GTX 660 card (2GB), it takes 73.59 secs and 71.87 secs to run one epoch with and without class sensitive parameters, respectively. At test time, CoSen CNN takes identical resources as that of Baseline CNN, because no extra computations are involved during testing.

6. Conclusion We proposed a cost sensitive deep CNN to deal with the class-imbalance problem commonly found in real world datasets. We analysed three commonly used cost functions and introduced class dependent costs for each case. Furthermore, we proposed an alternating optimization procedure to efficiently learn the class dependent costs as well as the network parameters. Our results on four popular classification datasets show that the modified cost functions perform very well on the majority as well as on the minority classes in the dataset. In the future, we will incorporate inter-class relationships in the learned cost function.

References [1] G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1):20–29, 2004. [2] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. Safelevel-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In AKDDM, pages 475– 482. Springer, 2009. [3] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. TIST, 2(3):27, 2011. [4] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. 2011. [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv:1405.3531, 2014. [6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. JAIR, 16(1):321–357, 2002. [7] C. Chen, A. Liaw, and L. Breiman. Using random forest to learn imbalanced data. University of California, Berkeley, 2004. [8] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element discovery as discriminative mode seeking. In NIPS, pages 494–502, 2013. [9] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014. [10] M. Gao, X. Hong, and C. J. Harris. Construction of neurofuzzy models for imbalanced data classification. TFS, 22(6):1472–1488, 2014. [11] V. Garcia, J. Sanchez, J. Mollineda, R. Alejo, and J. Sotoca. The class imbalance problem in pattern classification and learning. In Congreso Espanol de Informatica, pages 1939–1946, 2007. [12] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In ECCV, pages 392–407. Springer International Publishing, 2014. [13] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, pages 1319–1327, 2013. [14] H. Han, W.-Y. Wang, and B.-H. Mao. Borderline-smote: a new oversampling method in imbalanced data sets learning. In AIC, pages 878–887. Springer, 2005. [15] H. He and E. A. Garcia. Learning from imbalanced data. TKDE, 21(9):1263–1284, 2009. [16] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv:1406.4729, 2014. [17] K. Huang, H. Yang, I. King, and M. R. Lyu. Learning classifiers from imbalanced data based on biased minimax probability machine. In CVPR, volume 2, pages II–558. IEEE, 2004. [18] P. Jeatrakul, K. W. Wong, and C. C. Fung. Classification of imbalanced data by combining the complementary neural network and smote algorithm. In NIP, pages 152–159. Springer, 2010. [19] M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman. Blocks that shout: Distinctive parts for scene classification. In CVPR, pages 923– 930. IEEE, 2013. [20] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Automatic feature learning for robust shadow detection. In CVPR, pages 1939– 1946. IEEE, 2014. [21] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv:1409.5185, 2014. [22] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In AISTATS, 2015. [23] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, pages 609–616. ACM, 2009.

[24] K. Li, X. Kong, Z. Lu, L. Wenyin, and J. Yin. Boosting weighted elm for imbalanced learning. Neurocomputing, 128:15–21, 2014. [25] Q. Li, J. Wu, and Z. Tu. Harvesting mid-level visual concepts from large-scale internet images. In CVPR, pages 851–858. IEEE, 2013. [26] D. Lin, C. Lu, R. Liao, and J. Jia. Learning important spatial pooling regions for scene classification. 2014. [27] M. Lin, Q. Chen, and S. Yan. Network in network. In Statistical Language and Speech Processing, 2013. [28] T.-H. Lin and H. Kung. Stable and efficient representation learning with nonnegativity constraints. In ICML, pages 1323–1331, 2014. [29] T. Maciejewski and J. Stefanowski. Local neighbourhood extension of smote for mining imbalanced data. In CIDM, pages 104–111. IEEE, 2011. [30] I. Mani and I. Zhang. knn approach to unbalanced data distributions: a case study involving information extraction. In WLID, 2003. [31] M. A. Nielsen. Neural networks and deep learning. Determination Press, 1, 2014. [32] F. Perronnin, J. S´anchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, pages 143–156. Springer, 2010. [33] E. Ramentol, Y. Caballero, R. Bello, and F. Herrera. Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. KIS, 33(2):245–265, 2012. [34] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPRw, pages 512–519. IEEE, 2014. [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. [36] J. T. Springenberg and M. Riedmiller. Improving deep neural networks with probabilistic maxout units. ICLRs, 2014. [37] N. Srivastava and R. R. Salakhutdinov. Discriminative transfer learning with tree-based priors. In NIPS, pages 2094–2102, 2013. [38] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser. Svms modeling for highly imbalanced classification. TSMC, 39(1):281–288, 2009. [39] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In ICCV, pages 606–613. IEEE, 2009. [40] B. X. Wang and N. Japkowicz. Boosting support vector machines for imbalanced data sets. KIS, 25(1):1–20, 2010. [41] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Localityconstrained linear coding for image classification. In CVPR, pages 3360–3367. IEEE, 2010. [42] K.-J. Wang, B. Makond, K.-H. Chen, and K.-M. Wang. A hybrid classifier combining smote with pso to estimate 5-year survivability of breast cancer patients. ASC, 20:15–24, 2014. [43] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012. [44] G. Wu and E. Y. Chang. Class-boundary alignment for imbalanced dataset learning. In ICML workshop, pages 49–56, 2003. [45] G. Wu and E. Y. Chang. Kba: Kernel boundary alignment considering imbalanced data distribution. TKDE, 17(6):786–795, 2005. [46] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pages 1794–1801. IEEE, 2009. [47] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. arXiv:1301.3557, 2013. [48] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833. Springer, 2014. [49] Y. Zhang, P. Fu, W. Liu, and G. Chen. Imbalanced data classification based on scaling kernel-based support vector machine. NCA, 25(34):927–935, 2014. [50] Y. Zhang and D. Wang. A cost-sensitive ensemble method for classimbalanced datasets. In AAA, volume 2013. Hindawi, 2013.