Data-Efficient Weakly Supervised Learning for Low-Resource Audio ...

0 downloads 0 Views 593KB Size Report
Jul 17, 2018 - Machine Listening Lab, Centre for Digital Music (C4DM), Queen Mary ..... As input to our proposed method, log mel-band energy is extracted ...
Data-Efficient Weakly Supervised Learning for Low-Resource Audio Event Detection Using Deep Learning Veronica Morfi*, Dan Stowell

arXiv:1807.06972v1 [cs.SD] 17 Jul 2018

Machine Listening Lab, Centre for Digital Music (C4DM), Queen Mary University of London, UK *[email protected] Abstract We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are “weakly labelled” having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose a data-efficient training of a stacked convolutional and recurrent neural network. This neural network is trained in a multi instance learning setting for which we introduce a new loss function that leads to improved training compared to the usual approaches for weakly supervised learning. We successfully test our approach on a low-resource dataset that lacks temporal labels, for bird vocalisation detection.

1

Introduction

In recent decades, there has been an increasing amount of audio datasets that have labels assigned to them to indicate the presence or not of a specific event type. This is related to tagging of audio recordings [17, 36, 37, 1]. However, in many cases these labels do not contain any information about the temporal location of each event or the number of occurrences in a recording. This type of label, which we will refer to as weak, lacks any temporal information. Collecting and annotating data with strong labels, labels that contain temporal information about the events, to train sound event detectors is a time consuming task involving a lot of manual labour. On the other hand collecting weakly labelled data takes much less time, since the annotator only has to mark the active sound event classes and not their exact boundaries. In comparison to supervised techniques that are trained on strong labels, there has been relatively little work on learning to perform audio event detection using weakly labelled data. In [4, 30, 8] the authors try to exploit weak labels in birdsong detection and bird species classification, while in [32] singing voice is pinpointed from weakly labelled examples. Furthermore, in [2] the authors train

1

a network that can do automatic scene transcription from weak labels and in [12] audio from YouTube videos is used in order to train and compare different previously proposed convolutional neural network architectures for audio event detection and classification. Finally, in [19, 20] the authors use weakly labelled data for audio event detection in order to move from the weak labels space to strong labels. Most of these methods formulate the provided weak labels of the recordings into a multi instance learning (MIL) problem. Furthermore, depending on the audio event, only limited data may be available. Some classes of interest may have very rare occurrences making it hard to collect enough samples for them. We refer to datasets that have this type of rare event and also contain limited amounts of data as low-resource. An important example of this comes from wildlife monitoring, where many species may be only rarely encountered or have vocalisations that are rarely used. So far, monitoring of avian populations has been performed via manual surveying, sometimes even including the help of volunteers due to the challenging scales of the data [13, 14]. Early studies mainly focus on detecting bird species present in a recording. In order to properly classify a recording, small datasets were used. These datasets were usually noise-free and/or manually segmented and only contained a small number of species. Methods with fewer limitations were introduced in more recent studies [21, 5, 4, 23, 34]. However, these methods are only used for predicting the species present in the recordings and are not sufficient for detecting the exact times of the vocalisations. Machine learning has experienced a strong growth in recent years, due to increased dataset sizes and computational power, and to advances in deep learning methods that can learn to make predictions in extremely nonlinear problem settings [22]. However, a large amount of data is needed in order to train a neural network that can achieve a good quality performance. In the task of birdsong detection and classification, even though reliable automated identification algorithms that would perform comparably to an expert observer are still non-existent [6], many authors have proposed methods for bird audio detection [1, 28] and bird species classification, e.g. in the context of LifeCLEF classification challenges [11, 9, 10] and more [31, 16]. However, more work is needed to address the problem of identifying all species and the exact times of their vocalisations in noisy recordings containing multiple birds. Moreover, these tasks need to be achieved with minimal manual intervention, in particular without manual segmentation of recordings into audio events, only using the weak labels of the recordings. Furthermore, automated detection and identification of animal vocalisations is a challenging task. There is a high diversity of animal vocalisations, both in the types of the basic syllables and in the way they are combined [33, 18]. Also, there is noise present in most habitats, and many bird communities contain multiple bird species that can potentially have overlapping vocalisations [25, 26, 27]. Finally, vocalisations in reference databases (e.g. Xeno-Canto1 ) are typically based on targeted recordings, hence lack both biological and technical variation present in field data to be classified. Thus, the datasets used for 1

https://www.xeno-canto.org/

2

training models that do birdsong transcription are usually low-resource with only the weak labels provided, making this an ideal setting to test our method on. In this paper, we propose a network that uses a low-resource birdsong dataset in an efficient way in order to predict audio event detection using only the weak labels provided for each recording during training. This network, a stacked convolutional and recurrent neural network (cf. [2]), is trained in a MIL setting where we propose a new loss function that outperforms the most commonly used losses when trying to derive the strong labels from weakly labelled data. We successfully train an acoustic event detector using weakly labelled data to predict the temporal location of occurrences of events in the test data. The rest of the paper is structured as follows: Section 2 describes the multi instance learning setting, Section 3 presents our method. The evaluation follows in Section 4, with the discussion and conclusions in Section 5.

2

Multi Instance Learning

The concept of multi instance learning (MIL) was first properly developed in [7] for drug activity detection. MIL is described in terms of bags, with a bag being a collection of instances. The existing labels are attached to the bags, rather than the individual instances within them. Positive bags have at least one positive instance, an instance for which the target class is active. On the other hand, negative bags contain negative instances only, the target class is not active in them. A negative bag is thus pure while a positive bag is presumably impure, since the latter most likely contains both positive and negative instances. Hence, all instances in a negative bag can be uniquely assigned a negative label but for a positive bag this cannot be done. There is no direct knowledge of whether an instance in a positive bag is positive or negative. Thus, it is the bag-label pairs and not the instance-label pairs which form the training data, and from which a classifier which classifies individual instances must be learned. Let the training data be composed of N bags, i.e. {B1 , B2 , ..., BN }, the i-th bag is composed of Mi instances, i.e. {Bi1 , Bi2 , ..., BiMi }, where each instance is a p-dimensional feature vector, e.g. the j-th instance of the i-th bag is [Bij1 , Bij2 , ..., Bijp ]T . We represent the bag-label pairs as (Bi , Yi ), where Yi ∈ {0, 1} is the bag label for bag Bi . Yi = 0 denotes a negative bag and Yi = 1 denotes a positive bag. One na¨ıve but commonly used way of inferring the individual instances’ labels from the bag labels is assigning the bag label to each instance of that bag: we refer to this method as false strong labelling. In the MIL setting, when there is a positive bag Bi , all of its instances are assigned a positive label, i.e. yij = 1, where yij is the label of the j-th instance of bag Bi . On the other hand, if bag Bi is a negative bag all of its instances are assigned a negative label, i.e. yij = 0. During training, a neural network in the MIL setting with false strong labels tries to minimise the average divergence between the network output for each instance and the false strong labels assigned to them, identically to an

3

ordinary supervised learning scenario. It is evident that the false strong labelling approach is an approximation of the loss for a strong label prediction task, hence it has some disadvantages. When using false strong labels some kind of early stopping is necessary since when perfect accuracy is achieved that would mean all positive instance predictions for a positive bag. However, there is no clear way of defining a specific point for early stopping. This is the same issue that all methods in the MIL setting face. As mentioned before a positive bag might include both positive and negative instances, however false strong labels will force the network towards positive predictions for both. Additionally, by using strong false labels there is an imbalance of positive and negative instance labels compared to the true strong labels, since a substantial amount of negative instances are considered as positive during training. Finally, a negative instance may appear in both a negative and positive recording, however due to the false labelling of negative instances as positive in positive bags, the network may not learn the proper prediction for this kind of instance. As an alternative to false strong labels, one can attempt to infer labels of individual instances in bag Bi by making a few educated assumptions. The most common ones are: if Yi = 0, all instances of bag Bi are negative instances, hence yij = 0, ∀j, while on the other hand, if Yi = 1, at least one instance of bag Bi is equal to one. For all instances of bag Bi , this relation between the bag label and instance labels can be simply written as: Yi = max yij

(1)

j

The conventional way of training a neural network for strong labelling is providing instance specific (strong) labels for a collection of training instances. Training is performed by updating the network weights to minimize the average divergence between the network output in response to these instances and the desired output, the ground truth of the training instances. In the MIL setting using equation (1) to define a characteristic of the strong labels, we must modify the manner in which the divergence to be minimized is computed, to utilize only weak labels, as proposed in [38]. Let oij represent the output of the network for input Bij , the j-th instance in Bi , the i-th bag of training instances. We define the bag-level divergence for bag Bi as: Ei =

1 2

2



max (oij ) − Yi

1≤j≤Mj

(2)

where Yi is the label assigned to bag Bi . The overall divergence on the training set is obtained by summing the divergences of all the bags in the set: E=

N X i=1

4

Ei

(3)

Equation (2) indicates that if at least one instance of a positive bag is perfectly predicted as positive, or all the instances of a negative bag are perfectly predicted as negative, then the error on the concerned bag is zero and the weights of the network will not be updated. Otherwise, the weights will be updated according to the error on the instance whose corresponding actual output is the maximal among all the instances in the bag. Note that such an instance is typically the most easy to be predicted as positive for a positive bag, while it is the most difficult to be predicted as negative for a negative bag. It seems that this sets a low burden on producing a positive output but a strong burden on producing a negative output. As indicated in [3], the value of a bag is fully determined by its instance with the maximal output, regardless how many real positive or negative instances in the bag. Therefore, in fact the burden on producing a positive or negative output is not unbalanced, at least at bag-level. However, on an instance-level, when using max to compute the loss, only one instance per bag contributes to the gradient, which may lead to inefficient training. Additionally, as mentioned earlier, in positive bags the network only has to accurately predict the label for easiest positive instance to reach a perfect accuracy, thus not paying as much attention to the rest of the positive instances that might be harder to accurately detect. In the present work, we perform audio event detection by training a neural network using weakly labelled data in a MIL setting. However, instead of using either of the two most common approaches in MIL, we introduce a new loss function at a bag-level that does not have the disadvantages of the above methods and leads to improved training, as explained in more detail in Section 3.

3

Method

In our approach, in order to perform audio event detection we use a neural network with stacked convolutional and recurrent layers that can capture both the dependencies of the events through frequency and the temporal structure. We calculate audio features used for training every 23 milliseconds by using overlapping Hamming windows on 5-second input recordings. This results in T = 432 frames of features. The neural network maps these features into strong labels, meaning it can predict the transcription of the recordings. The network prediction is a continuous range of [0,1] for each frame of the input features. A prediction value equal or greater than 0.5 denotes the presence of the audio event in question while a prediction less than 0.5 signifies the absence of it. The most important part of our approach is the loss function, to make possible the learning of strong labels from weakly labelled training data, and so we first turn to this.

3.1

Loss Function

We compute the loss for our approach in a multi instance learning (MIL) setting. MIL uses bags of instances and their labels in order to train a network in bag-

5

level and produce the predictions on instance-level. In our case a bag is a recording and the bag label is the weak label of this recording. A bag is positive if the audio event in question is present in the bag and negative otherwise. As mentioned in Section 2, the two usual approaches of weak-to-strong training in a MIL setting are either using false strong labelling, or training on the max prediction of the instances of a bag. However, both of these approaches have disadvantages (see Section 2), specific examples of which we will present in more detail in Section 4. In this paper, we propose an alternative way of training a network in a MIL setting that does not face the same limitations as the above methods. Firstly, we want all predictions to weigh in on the backpropagated gradient and not just the one with the maximum value, as is the case with MIL using max. In [24], the authors proposed the “noisy-or” pooling function to be used instead of max. However, noisy-or has been proven to not perform as well as max for audio event detection [35]. As discussed in [35], a significant problem with noisy-or is that the label of a bag is computed via the product of the instance labels as seen in equation (4). This calculation relies heavily on the assumed conditional independence of instance labels, an assumption which is highly inaccurate in audio event detection. Furthermore, this can lead the system to believe a bag is positive even though all its instances are negative. Yi = 1 −

Y

(1 − yij )

(4)

1≤j≤Mj

Using all instances in a bag for computation of the error and backpropagated gradient is important, since the network ideally should acquire some knowledge from every instance in every epoch. However, it is hard to find an elegant theoretical interpretation of the characteristics of the instances in a bag. On the other hand, simpler assumptions about these characteristics can achieve a similar effect. One approach is to consider the mean of the instance predictions of a bag. If a bag is negative the mean should be zero, while if it is positive it should be greater than zero. The true mean is unknown in weakly labelled data. A na¨ıve assumption is to presume that approximately half of the time a specific event will be present in a recording. Even though this is not true all of the time, it takes into consideration the predictions for all instances, and also inserts a bias to the loss that will keep producing gradient for training even after the max term has reached its perfect accuracy. However, this is indeed a na¨ıve assumption that will guide the network to predict a balanced amount of positives and negatives which may make it to more sensitive to all kind of audio events, even when they are not the ones in question. Another simple yet accurate assumption is that on both negative and positive recordings the minimum predictions at an instance-level should be zero. It is possible for a positive recording to have no negative frames however it is extremely rare in practice. This assumption could be used in synergy with max and mean to enforce the prediction of negative instances even on positive recordings and manage a certain level of the bias that is introduced with considering mean in the computation of the loss.

6

We train a network on a loss function that takes into account all the above mentioned assumptions and compute the max, mean and min from the predictions of a recording and depending on whether a recording is positive or negative we predict their divergence from different conditions. The proposed loss function is computed as: 1 Yi Loss = (bin cr(maxj (oij ), Yi )+bin cr(meanj (oij ), )+bin cr(minj (oij ), 0)) 3 2 (5) where bin cr(x, y) is a function that computes the binary cross-entropy between x and y, oij are all the predicted strong labels of bag Bi , where j = 1...Mi with Mi being the total number of instances in a bag, and Yi is the label of the bag. We refer to this as an MIL setting using MMM. For negative recordings, equation (5) will compute the binary cross-entropy between the max, mean and min of the predictions of the instances of a bag Bi and zero. This denotes that the predictions for all instances of a negative recording should be zero. On the other hand, for positive recordings the predictions should span the full dynamic range from zero to one, biased towards a similar amount of positive and negative instances. Our proposed loss function is designed to balance the positive and negative predictions in a bag resulting in a network that has the flexibility of learning from harder-to-predict positive instances even after many epochs. This is due to the fact that there are no obvious local minima to get stuck in as in the max case. In our experiments all three terms used in the loss functions have the same weight, namely 31 , however this may be changed after more experiments are conducted.

3.2

Features

As input to our proposed method, log mel-band energy is extracted from audio in 23ms Hamming windows with 50% overlap. In order to do so the librosa Python library is used.2 In total, 40 mel-bands are used in the 0–44100 Hz range. For a given 5 second audio input, the feature extraction produces a 432x40 output (T = 432).

3.3

Neural Network Architecture

We use a stacked convolutional and recurrent neural network architecture (cf. [2]) in order to predict the strong labels of a recording. Figure 1 depicts the overall block diagram of the proposed method. The log mel-band energy feature extracted from the audio is fed to a stacked convolutional and recurrent neural network, which sequentially produces the predicted strong labels for each recording. The input to the proposed network is a T x40 feature matrix such as the example shown in Figure 1. The convolutional layers in the beginning of the network are in charge of learning the local shiftinvariant features of this input. We use a 3x3 receptive field and the padding 2

https://librosa.github.io/librosa/index.html

7

Figure 1: Audio event detection network architecture 8

arguments set as ‘same’ in order to maintain the same size as the input in all our convolutional layers. The max-pooling operation is performed along the frequency axis after every convolutional layer to reduce the dimension for the feature matrix while preserving the number of frames T . The output of the convolutional part of the network is then fed to bi-directional gated recurrent units (GRUs) with tanh activation to learn the temporal structure of audio events. Next we apply time distributed dense layers to reduce feature-length dimensionality. Note that the time resolution of T frames is maintained in both the GRU and dense layers. A sigmoid activation is used in the last timedistributed dense layer to produce a binary prediction of whether there is an event present in each time frame. This prediction layer outputs the strong labels for a recording. The dimensions of each prediction are T x1. Finally, we calculate the loss on this output as explained in Section 3.1.

4 4.1

Evaluation Dataset and Network Settings

In order to test our approach in a low-resource dataset we use the training dataset provided during the Neural Information Processing Scaled for Bioacoustics (NIPS4B) bird song competition of 2013, which contains 687 recordings of maximum length of 5 seconds each.3 For the NIPS4B dataset the recordings have already been weakly labelled. There is a total of 87 possible events (labels), with each being active in only 7 to 20 recordings. Such a dataset can be considered low-resource for a few reasons. First, the total amount of training time is less than one hour. Also, there are 87 possible labels that have very sparse activations, 7 to 20 positive recordings for each. In order to efficiently use the data provided by the NIPS4B 2013 training dataset we first consider all 87 unique labels as one general label ‘bird’ and train an audio event detection network for this audio event. Another limitation of this dataset is the imbalance of positive and negative recordings: out of the whole dataset (687 recordings) only 100 of them are labelled as negative. This caused some of our original experiments to classify almost all bags as positive ones. An easy way to solve the positive-negative imbalance is to force the network to have the same amount of positive and negative recordings in each mini-batch. Since the amount of negative recordings is much smaller compared to positive ones, we randomly repeat the negative recordings in each mini-batch. We refer to this as Half and Half (HnH) training. This provides a balanced training set at a bag level, but not necessarily balanced at an instance level. For the following experiments, we split the NIPS4B 2013 training dataset into a training set and testing set. Only weak labels for the training set during the NIPS4B 2013 bird song competition were released, hence we could only use these recordings. We acquired the strong labels of most of these recordings via 3

http://sabiod.univ-tln.fr/nips4b/challenge1.html

9

Figure 2: Mel spectrograms used as input for testing and their transcription predictions after 3000 epochs of training the network when loss function uses false strong labels. manual annotations, in order to use them for evaluating our system and have uploaded our transcriptions online.4 For our training set the first 500 recordings of the NIPS4B 2013 training dataset are used, while the rest are included in our testing set, excluding 14 recordings for which confident strong annotation could not be attained. Those 14 recordings were added to our training set. Finally, the following fine tuning and settings have been chosen for the network: batch normalization after each convolutional layer, L2 kernel regularization of 0.01 in each layer, HnH training and a reducing learning rate schedule starting from 1e-5, dropping to 1e-6 after 30 epochs and to 1e-7 after 60 epochs until the end of our training, which for the examples presented on this paper was at 1400 epochs.

4.2

False Strong Labelling

As mentioned in Section 2, one of the most common ways of training networks from weak to strong labels is generating false strong labels for training by replicating the weak labels for every time frame of the audio and using them as strong labels. To demonstrate the problems with this approach, before evaluating the improved MIL loss functions we show illustrative examples of predictions produced by a network trained with false strong labelling. We train our network using these false strong labels, for 3000 epochs using 4 https://figshare.com/articles/Transcriptions_of_NIPS4B_2013_Bird_Challenge_ Training_Dataset/6798548

10

binary cross-entropy as loss function and Adam optimizer [15]. Figure 2 shows some of the resulting transcriptions predicted by the network. The main issue one can notice is that using false strong labels has a tendency of pushing all the results closer to one. Some structure is apparent in the transcription, hence it is apparent that the network is indeed able to differentiate between positive and negative instances to some degree, however all results for positive recordings are above the 0.5 threshold. We attribute this primarily to the nature of the false strong labels: for a positive recording all time frames are labelled as positive. Furthermore, since perfect accuracy for this setting does not correspond to the actual task, it becomes extremely unclear when training should be stopped.

4.3

MIL using different loss functions

We train four different networks using the loss functions described in Section 3 in order to compare their performance and the impact each term has in the predictions of our audio event detector.

Method MMM Max Max Mean Max Min

F1-score 74.4% 73.0% 73.6% 70.7%

Precision Recall 76.8% 72.2% 86.5% 63.3% 76.3% 71.1% 89.2% 58.5%

ROC AUC 87.6% 88.2% 88.6% 87.7%

Table 1: Comparison of metrics computed for four different loss functions at each method’s best F1-score value. Table 1 presents the metrics [29] computed for networks using four different loss functions at their best training epoch based on the results of F1-score. Method Max is trained with the conventional MIL loss, using the max of the instances’ predictions to predict the label of a bag. The MMM method is trained with our proposed loss (equation (5)) that takes into consideration the max, mean and min of the instance predictions of a bag. We also test the performance of our audio event detector when trained with loss that derives from a combination of the different terms we introduce in the MMM loss. Hence, we train method Max Mean with Loss = 21 (bin cr(maxj (Bij ), Yi ) + bin cr(meanj (Bij ), Y2i )) and Max Min with Loss = 21 (bin cr(maxj (Bij ), Yi ) + bin cr(minj (Bij ), 0)). This was done in order to examine the influence each term of our proposed loss has to the results. We notice that the methods that have the mean term in their loss function (MMM and Max Mean) tend to have a lower precision rate. As we discussed on Section 3, mean imposes a bias to the loss that might result to more false predictions. It is possible that by decreasing the weight of the mean term in the loss function the influence of this bias will decrease accordingly, however more experiments are needed in order to determine that. By comparing the results of MMM and Max Mean one can notice that when the min term is added to

11

MMM the influence of mean is reduced while the number of true positives, recall, is increased. By studying the progress of the results of some of these metrics (F1-score and ROC AUC) at different epochs of training one can notice that the longer we train the methods using a loss that considers the mean into the loss computation, the worse the overall results. This brings up another issue of the MIL setting which is defining when one should stop training. In Figure 3, one can notice that after a certain amount of epochs both the F1-score and ROC AUC results start to decrease or plateau. An interesting thing to notice is that the F1-score results don’t correspond exactly with the ROC AUC results. This is due to the fact that the ROC AUC only considers if a randomly chosen positive instance has a higher predicted value than a randomly chosen negative one, without taking into account whether either of these instances was correctly classified. On the other hand, F1-score is computed through recall and precision which take into account the numerical value of the prediction for each instance. Hence, the F1-score is the most appropriate measure in this task since it evaluates whether the system outputs are well-calibrated for detection of audio events. Another interesting aspect that one can study is the individual results of the conventional max loss to our proposed MMM loss. Figures 4 and 5 depict a couple of positive recordings from our testing set and the transcriptions predicted by both methods. It is evident from these examples, particularly from Figure 4, that once the max loss reaches the perfect accuracy for a bag it ignores the harder-to-predict positive events. In this example, the network correctly predicts the three more prominent events and then ignores all other events between them. While the network trained with a MMM loss is starting to pick out some of the harder to detect events, due to the gradient provided by using mean. However, due to the nature of mean, some of the negative instances may end up being classified as positive. As seen in figure 5b, the network has predicted values around 0.2 for some negative instances from 2.5 seconds and 3.5 seconds, after more training some of these instances may end up being wrongly predicted as positive. As mentioned above it is possible that by decreasing the weight of the mean term in our loss function that the false positive rate might decrease. Overall, as depicted in Figure 3, our network trained with the MMM loss function attains the strongest results, although determining a criterion for stopping training remains an open problem in the weak learning scenario.

5

Conclusions

In this paper, we propose a method to perform audio event detection in a MIL setting that introduces a new loss function that takes into account all the instance predictions. Our method is tested on a low-resource dataset with only weakly labelled training data to perform bird vocalisation detection. We compare the different components of our loss function and define the influence each of them has to the training of the network. Depending on the task different

12

Figure 3: Progress of metrics for our testing set through epochs

13

(a) MIL using max

(b) MIL using MMM

Figure 4: Predicted transcription, after 800 epochs of training, of recording nips4b birds trainfile623 of our testing set. 4a depicts the results of our network trained with max loss. 4b depicts the results of our network trained with MMM loss. 14

(a) MIL using max

(b) MIL using MMM

Figure 5: Predicted transcription, after 800 epochs of training, of recording nips4b birds trainfile667 of our testing set. 5a depicts the results of our network trained with max loss. 5b depicts the results of our network trained with MMM loss. 15

combinations of our loss function components might present a more suitable solution than the previously used false strong labelling or max in the MIL setting. Additionally, our method is not tailored to any specific type of audio events, hence there is reason to believe it can be used for any kind of audio event detection tasks. Finally, further experiments are needed to define the best way to weight the three components of our loss function and determine an appropriate early-stopping criterion.

References [1] S. Adavanne, K. Drossos, E. C ¸ akir, and T. Virtanen. Stacked convolutional and recurrent neural networks for bird audio detection. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1729–1733, Aug 2017. [2] S. Adavanne and T. Virtanen. Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. 2017. [3] R. Amar, D. R. Dooly, S. A. Goldman, and Q. Zhang. Multiple-instance learning of real-valued data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 3–10, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [4] F. Briggs, B. Lakshminarayanan, L. Neal, X. Fern, R. Raich, S. J. K. Hadley, A. S. Hadley, and M. G. Betts. Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach. Journal of the Acoustic Society of America, 131:4640–4650, 2014. [5] T. Damoulas, S. Henry, A. Farnsworth, M. Lanzone, and C. P. Gomes. Bayesian classification of flight calls with a novel dynamic time warping kernel. In ICMLA, pages 424–429. IEEE Computer Society, 2010. [6] U. M. de Camargo, P. Somervuo, and O. Ovaskainen. Protax-sound: A probabilistic framework for automated animal sound identification. PLOS ONE, 12(9):1–15, 09 2017. [7] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´erez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1):31–71, 1997. [8] L. Fanioudakis and I. Potamitis. Deep networks tag the location of bird vocalisations on audio spectrograms. 1711.04347, 2017. [9] H. Go¨eau, H. Glotin, W.-P. Vellinga, R. Planqu´e, and A. Joly. LifeCLEF Bird Identification Task 2016: The arrival of Deep Learning. 2016. [10] H. Go¨eau, H. Glotin, W.-P. Vellinga, R. Planqu´e, and A. Joly. LifeCLEF Bird Identification Task 2017. 2017. [11] H. Go¨eau, H. Glotin, W.-P. Vellinga, R. Planqu´e, A. Rauber, and A. Joly. LifeCLEF Bird Identification Task 2015. 2015.

16

[12] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, March 2017. [13] A. Johnston, S. Newson, K. Risely, A. J Musgrove, D. Massimino, S. Baillie, and J. W Pearce-Higgins. Species traits explain variation in detectability of UK birds. Bird Study, 61:340–350, 07 2014. [14] J. Kamp, S. Oppel, H. Heldbjerg, T. Nyegaard, and P. F. Donald. Unstructured citizen science data fail to detect long-term population declines of common birds in Denmark. Diversity and Distributions, 22:1024–1035, 07 2016. [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. 2014. [16] E. Knight, K. C. Hannah, G. J. Foley, C. D. Scott, R. Brigham, and E. Bayne. Recommendations for acoustic recognizer performance assessment with application to five common automated signal recognition programs. Avian Conservation and Ecology, 12, 11 2017. [17] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley. A joint detectionclassification model for audio tagging of weakly labelled data. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 641–645, March 2017. [18] D. Kroodsma. The Singing Life of Birds: Audio CD. The Singing Life of Birds: The Art and Science of Listening to Birdsong. Houghton Mifflin, 2005. [19] A. Kumar and B. Raj. Audio event detection using weakly labeled data. In Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pages 1038–1047, New York, NY, USA, 2016. ACM. [20] A. Kumar and B. Raj. Deep CNN framework for audio event recognition using weakly labeled web data. 2017. [21] B. Lakshminarayanan, R. Raich, and X. Fern. A syllable-level probabilistic framework for bird species identification. In ICMLA. IEEE, 2009. [22] Y. Lecun, Y. Bengio, and G. Hinton. 521(7553):436–444, 5 2015.

Deep learning.

Nature,

[23] C.-H. Lee, C.-C. Han, and C.-C. Chuang. Automatic classification of bird species from their sounds using two-dimensional cepstral coefficients. Trans. Audio, Speech and Lang. Proc., 16(8):1541–1550, Nov. 2008. [24] D. Liu, Y. Zhou, X. Sun, Z. Zha, and W. Zeng. Adaptive pooling in multiinstance learning for web video annotation. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages 318–327, Oct 2017. [25] D. Luther. Signaller: Receiver coordination and the timing of communication in amazonian birds. Biology Letters, 4:651–654, 2008.

17

[26] D. Luther and R. Wiley. Production and perception of communicatory signals in a noisy environment. Biology Letters, 5:183–187, 2009. [27] K. Pacifici, T. R. Simons, and K. H. Pollock. Effects of vegetation and background noise on the detection process in auditory avian point-count surveys. The Auk, 125(4):998–998, 2008. [28] T. Pellegrini. Densely connected cnns for bird audio detection. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1734– 1738, Aug 2017. [29] C. J. V. Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition, 1979. [30] J. F. Ruiz-Mu˜ noz, M. Orozco-Alzate, and G. Castellanos-Dominguez. Multiple instance learning-based birdsong classification using unsupervised recording segmentation. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 2632–2638. AAAI Press, 2015. [31] J. Salamon and J. P. Bello. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24:279–283, 2017. [32] J. Schl¨ uter. Learning to Pinpoint Singing Voice from Weakly Labeled Examples. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR 2016), New York, USA, 2016. [33] T. Scott Brandes. Automated sound recording and analysis techniques for bird surveys and conservation. Bird Conservation International - BIRD CONSERV INT, 18:S163–S173, 09 2008. [34] D. Stowell and M. D. Plumbley. Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning. PeerJ, 2:e488, 2014. [35] Y. Wang, J. Li, and F. Metze. Comparing the max and noisy-or pooling functions in multiple instance learning for weakly supervised sequence learning tasks. 2018. [36] Y. Xu, Q. Huang, W. Wang, P. Foster, S. Sigtia, P. J. B. Jackson, and M. D. Plumbley. Unsupervised feature learning based on deep models for environmental audio tagging. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6):1230–1241, June 2017. [37] Y. Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley. Convolutional gated recurrent neural network incorporating spatial features for audio tagging. 2017 International Joint Conference on Neural Networks (IJCNN), pages 3461–3466, 2017. [38] Z.-H. Zhou and M.-L. Zhang. Neural networks for multi-instance learning. In Proceedings of the International Conference on Intelligent Information Technology, Beijing, China, pages 455–459, 2002.

18