Weakly supervised classification with bagging in fisheries acoustics. R. Lefort
R. Lefort, R.Fablet, J-M.Boucher
French research institute for exploitation of the sea (Ifremer) STH/LTH Technopˆole Brest Iroise, 29280 PLOUZANE, France
[email protected] http://www.ifremer.fr
Institut Telecom/Telecom Bretagne/Lab-Sticc/UEB Technopˆole Brest Iroise - CS 83818, 29238 Brest cedex, France
[email protected] [email protected] [email protected] http://www.telecom-bretagne.eu
Abstract— Statistical training allows the establishment of a probabilistic classification model. In the supervised case, the model is assessed from a labelled dataset, i.e. each observed data has a label. In the weakly-supervised case, the label is not exactly known. In our instance, the probability to associate the observation to the different classes is known. Thus, labels for the data are a probability vector. Methods developed in this paper are applied to object recognition in images. These images contain objects that must be classified according to their class membership. The ground truth is the knowledge of the relative proportion of classes in each labelled images. This global proportion leads to probability vector label for each training object. The originality of this paper consists in the association between weakly labelled data and several probabilistic discriminative models that are mixed using a Bagging technique [1]. Two classification models (Bayesian and discriminative) are compared on oceanographic data. The objective is to recognize the species of fish schools in acoustic images. The relative class proportion in labelled images is given by successive trawl catches. The results show that the discriminative model is more robust than the Bayesian model. The contribution of the bagging is shown for the discriminative model.
I. I NTRODUCTION Performance of computers are more and more remarkable and on the top of that the necessity and the ability to develop automatic classification techniques that are reliable and robust is rising too. In this context, many new automatic classification procedures appeared. Some of them are very well known as Bayesian techniques that can use Expectation-Maximization procedure (EM) [2], as discriminative techniques using Support Machine Vector (SVM) [3], as decision trees and random forest [4]. However, all of these methods have been developed in the case of supervised learning situation. This consists in a classification model from a training data set that is composed of N instances {xn }n∈[1...N ] and associated labels {yn }n∈[1...N ] where yn ∈ [1 . . . I] et I is the number of classes. For the semi supervised learning case [5], a set of unknown instances is added to the previous training set. The final training set is then constituted of labelled and unlabelled data. The contribution of these unlabelled data is important considering the difficulties to obtain labelled data. In weakly supervised learning there is a lack of information as missing
features or weak label. For example, a weak label can give indications of the possible classes that could be attributed to an instance [6] [7]. In this case, the label is a binary vector yn = [yn1 . . . ynI ] such as yni = 1 if there is a possibility for the class i to be associated with the instance n, or yni = 0 if not. We talk of weakly supervised or partially supervised learning when training instances are not expressed in the same feature space or when features are lacking. For these issues, methods such as the Dempster-Shafer theory are employed [8]. In this paper we focus on weakly supervised learning that can be viewed as a generalization of the semi supervised case. The label indicates which are the class probabilities for the instance. Thus, a label vector πn = [p (yn = 1) . . . p (yn = I)] is associated with the instance xn , such as p (yn = i) is the probability that xn is an element from the class i. Our interest in weakly supervised learning comes from fisheries acoustics campaigns. From echograms acquired with an echosounder placed on the hull of an oceanographic boat, images representing fish schools in the water column are obtained (Fig. 1). The goal is to assess the fish school biomass per species in the images. These campaigns are carried out once a year and allow to analyse and to follow fish stocks in a given area. In practice, assessment is performed by experts who convert the energy of the observed schools in biomass [9]. The backscattering strength differs between species so that experts first classify fish school according to their class species. We hope to develop computer tools in order to automate the classification step and to help experts in their interpretation tasks. The ground truth is obtained by successive trawl catches, associating the catch that gives a sample of the relative species proportion at the fishing point and the image obtained at the fishing point. The training data set is then constituted by fish schools described in the features space with morphological or backscattering strength parameters and by the relative species proportions in each image. The global proportion information that can be viewed as a prior for each instance leads to label the fish schools. In this case, the label is the probability that the schools in the considered image are classified among each species. The issue presented in [6] is very close to this problem in the sense that the training data is
constituted by images containing instances to be classified but contrary to proportion knowledge, the presence or the absence of classes is known in each image. Concerning notations, N denotes the number of training images. The image k contains N (k) objects described in the feature space by {xkn }n∈[1...N (k)] . Each image is associated to the prior vector πk whose components are πki = p (ykn = i)i∈[1...I] indicating the probability that the P class i is present in the image k such as π = 1. ki i The training data set can then be written as follows : {xkn , πk }k∈[1...N ];n∈[1...N (k)] . In the case of fisheries acoustic campaigns, πk is directly given by trawl catches. The parameters Θ of the classification models are assessed in the training step and then, the assignation probability p (ykn = i|xkn , Θ) is computed in the classification step. The class that is attributed to the test instance corresponds to the higher component of assignation probability. In this paper, two classification models are presented: a bayesian model and a discriminative model. For the first one it remains to assess the parameters of a Gaussian mixture model while for the other one, the goal is to find separator hyperplans for each class (one against the others). For the discriminative model, the issue is extended to the non linear case considering a non linear mapping with the aid of kernel trick. For this model, a bagging technique [1] is used in order to improve the results. We briefly present the Bayesian model in the section II and the discriminative model in the section III. Before the last section dedicated to results and conclusions we present the bagging (section IV).
as a Gaussian mixture such that:
p (ykn = i|xkn , Θ) ∝ πki
M X
2 ρim N (xkn |µim , σim )
(1)
m=1
the parameters Θ = The training consists in2 evaluating 2 . ρim is the proportion ρi1 . . . ρiM , µi1 . . . µiM , σi1 . . . σiM 2 of the mode m of the class i and N (xkn |µim , σim ) is the normal distribution of the mode m of the class i considering 2 the mean µim and the variance σim . Note that for test images containing objects to be classified, there is no label that leads to the same probability πki = 1/I what ever i. In the training step, the EM procedure [2] is adapted to the proportion case to assess Θ. Contrary to the common procedure that considers a single hidden variable, in this case with variables indicating the considered mode, the proportion case needs to take into account two hidden variables: ykn and skni such as skni = m indicates that the object xkn is classified in the mode m of the multi modal distribution of the class i. This constraint leads to develop two EM procedures that are mixed. The procedure, similar to the one described in [6], is detailed in [10]. We sum up now the steps that are iterated until convergence :
E step : τkni = X
πki p(xkn |ykn =i,Θ)
πkl p (xkn |ykn = l, Θ)
l
M step : An other EM procedure is carried out in order to assess remaining parameters. M-E step : γknim =
ρim N (xkn |skni =m,Θ) M X
ρil p (xkn |skni = l, Θ)
l=1
M-M Xstep X : Parameters Θ are updated. XX τkni γknim τkni γknim xkn ρim =
k
n X X
τkni
kX n X 2 σim =
k
et µim =
kXnX k
τkni γknim
n
τkni γknim (xkn − µim )(xkn − µim )T
n
XX k
τkni γknim
n
III. D ISCRIMINATIVE MODEL A. Linear model 3 Dimension image obtained from an echo sounder. All objects in the water column appeared as fish schools that must be classified according to their fish species. The classification is feasible because the shape of the schools, their backscattering strength or their position in the water column differ between species. Fig. 1.
Discriminative models are stated as an explicit parameterization of the classification. They are defined as probabilistic versions of discriminative models. As proposed by [6], linear discriminative models can be defined as follows: p (ykn = i|xkn , Θ) ∝ F (hωi , xkn i + bi )
II. BAYESIAN MODEL A model similar to the one proposed in [6] is considered. It mainly relies on modelling the data likelihood for each class
(2)
where hωi , xkn i + bi is the distance to the separation hyperplane defined by hωi , xkn i+bi = 0 in the feature space. Model parameter Θ is given by {ωi }l . F is an increasing function,
typically an exponential or continuous stepwise function. Hereafter, F will be chosen to be the exponential function: p (ykn
exp (hωi , xkn i + bi ) = i|xkn , Θ) = X exp (hωl , xkn i + bl )
(3)
l
In the training step, parameters Θ are initialized with a weighted Fisher algorithm [11]. To simplify, the problem is reduce into one class against the others. Considering two classes the separating hyperplane coefficients are expressed as follows: ωi = (Vi1 + Vi2 )−1 (mi1 − mi2 ), where Vi1 is the variance of the class i, Vi2 the variance of the group formed by all the data excepted the class i, mi1 the mean of the class i and mi2 the mean of the group formed by all the data excepted the class i. The originality is in the way of calculating mi1 , mi2 , Vi1 and Vi2 with the aid of weighted sum such as the weight coefficients are given by the labels: (k) K N X X
(k) K N X X
πki xkn
n k K N (k)
mi1 =
XX
, mi2 = 1 − δ(πki )
n
k
k
(1 − πki )xkn
n
(k) K N X X k
(k) K N X X
Vi1 =
k
1 − δ(1 − πki )
n
(4)
πki (xkn − mi1 )(xkn − mi1 ) (k) K N X X
,
(5)
1 − δ(πki )
n
k
(k) K N X X (1 − πki )(xkn − mi1 )(xkn − mi1 )T
Vi2 =
k
n (k) K N X X k
(6) 1 − δ(1 − πki )
n
Where δ(πki ) = 0 everywhere, except if πki = 0 that leads to δ(πki ) = 1. Once the initialization is done, in order to find the better ˜ a minimum error criterion using a typical coefficients Θ, gradient minimization is considered: ˜ = arg min Θ
X
Θ
D(˜ πk (Θ), πk )
(7)
k
where π ˜k (Θ) is the vector of the estimated priors of the acoustic backscattering strength relative to the different species classes and D a distance between the observed and estimated priors. Among the different distances between likelihood functions, the Battacharrya distance [12] is chosen: D(˜ πk (Θ), πk ) =
N 1 Xp π ˜k (Θ) · πk N k=1
Before calculating the parameters of the linear model we investigate a non linear mapping using kernel trick [3] [10]. It is a transformation of the feature space in which linear solutions are difficult to obtain. In the mapped space, linear solutions are easily performed. Here, the Gaussian kernel is chosen. IV. BAGGING To improve the classification performance a bagging technique [1] [13] is associated with the discriminative model in the weakly supervised case. Bagging is a machine learning ensemble meta-algorithm to improve machine learning of classification and regression models in terms of stability and classification accuracy. The procedure consists of choosing a random part of the training set and generating a classifier with the sampled training instance. This is done J times leading to J classification results for the considered test object. The decision is typically done with a vote but our classifiers give probabilistic results. So the probabilistic solutions are mixed together using an average over all the classifier results. If we note pji the results obtained by the j th generated classifier with parameters Θj and if pji denotes the probability that the test instance xkn is classifying among the class i, the final probabilistic classification result is given by: p (ykn = i|xkn , Θ) =
T
n
B. Non linear model
(8)
J 1X p (ykn = i|xkn , Θj ) J j=1
(9)
V. R ESULTS In order to assess our methods and to simulate the lack of ground truth at the individual object level, training images and test images have been randomly generated from a known data base, i.e. a base of objects with known associated classes. The performance variability is strongly dependent on the random choice of objects in the original data set. We must add that classification models performance are closely connected to the class mixture complexity in the training images (Fig. 2). In fact, parameters models are easily assessable when one class is dominating but things are getting more and more complicated when instances become equiprobable. The problem is solved by making several experiments: the global procedure (random choice of training instances in the data base, training and test images simulation, learning of the classification parameters, and test) is carried out 100 times and the final result is given by the mean correct classification rate (over the 100 experiments). The number of class per training image contributes to damage the performance because the rate of correct classification decreases when the number of species per images is rising. So, results are shown as a function of the number of classes per training images. Images containing one, two, three, or four classes have been generated. Note that when the training images contain one class, i.e. the true class is known for all the instance, this leads to the supervised case. The classification rate is defined as follows. It is the mean classification rate over all the classes. So we average
Fig. 2. Mean correct classification rate as a function of the class proportion complexity in training images. 24 training images have been generated with simulated data. We remark that the mean correct classification rate strongly depends on the chosen mixture. It goes from 80% when one class is dominating to 40% when classes are equiprobable.
Mean correct classification rate as a function of the class number in training images. Results are shown for the bayesian model, the discriminative model without bagging and the discriminative model with bagging. Fig. 3.
R EFERENCES the diagonal component of the confusion matrix. It can be viewed as the capability to discriminate classes between them. An other solution could be to divide the number of good classified instances over all the instances but this can lead to mistakes when the number of instance is poor for a given class compared to the others. Performances are shown in Fig. 3 for data coming from fisheries acoustic campaigns. The dataset comprises fourspecies fish schools identified by experts. Schools parameters are extracted automatically, giving a set of 20 parameters as the length, the height, the depth, the energy, etc [9] [14]. Performance confirms the fact that the higher is the number of species per training images, the less the results are suitable. Concerning the robustness, i.e. the ability to have equivalent performances as a function of the number of class per training images, the discriminative model outperforms the Bayesian model. For example, for the discriminative model the point nearly stays stable from the supervised case to the bi-class situation while it decreases of 15% when the Bayesian model is considered. This can be understood because data overlap and there is a high number of features. As for classification rate and for the same reasons as previously, the discriminative model outperforms the Bayesian model from around 20% in the two-class configuration to 3% in the supervised case. The bagging contribution is shown in the Fig. 3. We compare performance of the discriminant model and the discriminant model with bagging. For the bagging 100 discriminative classifiers have been generated with a 0.8% sample proportion of the training data base. Results show that the mean correct classification rate is improved from 5% considering 3 or 4 classes per images to 2% considering the bi-class images.
[1] L. Breiman, “Bagging predictors.” Machine Learning, pp. 123–140, 1996. [2] A. Dempster, N. Laird, and D. Rubin, “Likelihood from incomplete data via the em algorithm.” Journal of the royal statistic society, vol. Series B, 39(1), pp. 1–38, 1977. [3] B. Sch¨olkopf and A. Smola, “Learning with kernels,” The MIT Press, 2002. [4] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and regression trees,” The Wadsworth & Brooks, 1984. [5] O. Chapelle, B. Sch¨olkopf, and A. Zien, “Semi-supervised learning.” MIT Press, 2006. [6] C. M. Bishop and I. Ulusoy, “Generative versus discriminative methods for object recognition,” Computer society conference on CVPR., vol. 2, pp. 258–265, 2005. [7] M. Weber, M. Welling, and P. Perona, “Unsupervised learning of models for recognition.” ECCV, vol. 1, pp. 18–32, 2000. [8] E. Cˆome, L. Oukhellou, T. Denoeux, and P. Aknin, “Learning from partially supervised data using mixture models and belief functions.” Pattern Recognition., vol. 42(3), pp. 334–348, 2009. [9] C. Scalabrin and J. Mass´e, “Acoustic detection of the spatial and temporal distribution of fish shoals in the bay of biscay,” Aquatic Living Resources, vol. 6, pp. 269–283, 1993. [10] R. Fablet, R. Lefort, S. C., M. J., and B. J-M., “Weakly supervised learning using proportion based information: an application to fisheries acoustic.” ICPR, 2008. [11] R. Fisher, “The use of multiple measurements in taxonomic problems.” Annals of Eugenics, pp. 179–188, 1936. [12] A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by probability distributions.” Bull. Calcutta Maths. Soc., vol. 35, pp. 99–109, 1943. [13] S. Kotsiantis and P. Pintelas, “Combining bagging and boosting.” International Journal of Computational Intelligence., vol. 1(4), pp. 324–333, 2004. [14] C. Scalabrin, N. Diner, A. Weill, A. Hillion, and M.-C. Mouchot, “Narrowband acoustic identification of monospecific fish shoals,” ICES Journal of Marine Science, vol. 53, no. 2, pp. 181–188, 1996.