ACTIVE LEARNING FOR SOUND EVENT CLASSIFICATION BY CLUSTERING UNLABELED DATA Zhao Shuyang, Toni Heittola, Tuomas Virtanen Department of Signal Processing, Tampere University of Technology, Finland Background
Evaluation Dataset: UrbanSound 8K
Training sound event classifier requires annotated recordings: I I
Audio data is easy to collect. Annotation is time-consuming. Idea: Utilizing abundant unlabeled data to optimize the effectiveness of the annotation effort.
I
8732 labeled sound segments
I
10 sound event classes in open urban space.
I
Cross-validation: 10-fold. Setup
Proposed method Medoid-based active learning (MAL): parition the data and annotate only the medoid segments, centroids of clusters. I I
I
Labels are produced by simulating limited number of labeling responses (labeling budget), according to the ground truth.
I
Number of clusters is set to 1/4 of the number of unlabeled data points.
I
Supervised learning setup (feature and model) follows UrbanSound SVM baseline. Features are various MFCCs statistics within segments: mean, median, variance, minimum, maximum, skewness, etc.
I
Compared with reference methods, including random sample (baseline), certainty-based active learning(CRTAL) and semi-supervised learning (SSL).
I
Experiments are repeated five times and the average performance is reported.
Medoids are assured to span different local distributions. A labeled medoid can be used to derive predicted labels for other cluster members.
Result
Figure 1: Overview of the proposed method. Medoid segments are marked with red border. Annotated labels are filled with black and predicted labels are filled with grey.
Noticeable technical details I
I
In the clustering stage, sound segments are respresented with a single Gaussian, based on the MFCCs statistics. Segment-segment dissimilarity is measured by the symmetric KL divergence.
Figure 2: Classification accuracy as a function of labeling budget, simulated using an oracle annotator. I
The classification accuracy is improved by 8%, when labeling budget is lower than 10% of unlabeled data.
I
The proposed method saves 50% to 60% budget to achieve the same accuracy, with respect to the best reference method.
Conclusions I
The proposed method effectively saves labeling budget for sound event classification.
Initialization of mediods is based on farthest-first traversal, starting from a random point.
I
Future: Study with different datasets, especially larger scale datasets.
I
An annotated label overrules predicted labels on the same segments.
I
Future: Study weak annotator case, simulated using real human labeling responses.
I
Produced labels are used to construct training examples for supervised learning.
I I
K-medoids clustering is based on the dissimilarity matrix.
[email protected]