Monaural Blind Source Separation in the context of Vocal Detection Bernhard Lehner, Gerhard Widmer Department of Computational Perception, JKU Linz
[email protected]
Introduction I We evaluate the usefulness of four monaural blind source separation (BSS) methods in the context of vocal detection (VD). I BSS methods: Adaptive REpeating Pattern Extraction Technique (aREPET), Kernel Additive Modelling (KAM), Flexible Audio Source Separation Toolbox (FASST), Robust Principal Component Analysis (RPCA). I First set of experiments: What is the best strategy to utilise BSS as pre-processing to improve VD? I Second experiment: Can we improve BSS estimates by post-processing them according to VD output?
Strategies to improve VD I Foreground Separation: features only from estimated vocals.
Results BSS for Pre-processing to improve VD:
MIX VOC aREPETmix aREPETsep FASSTmix FASSTsep KAMmix KAMsep RPCAmix RPCAsep
Table: Results of Foreground Separation. MIX: trained and tested with mixed audio; VOC: trained with mixed audio, tested with pure vocals. METHODmix: trained with mixed audio, tested with separated vocals; METHODsep: trained and tested with separated vocals.
I Foreground Concatenation: features from both mixed and estimated vocals (yields a double-sized vector). I Foreground Enhancement: remixing estimated vocals with original audio signal before feature extraction. I We compare two state-of-the-art feature sets (IC14 and OS11) by feeding them to Random Forest (RF) and Support Vector Machine (SVM) classifiers.
Utilising VD to improve BSS
Internal Data Set (framesize=200ms) RF SVM accuracy F-measure accuracy F-measure IC14 OS11 IC14 OS11 IC14 OS11 IC14 OS11 .837 .795 .846 .814 .855 .807 .863 .819 .916 .910 .920 .905 .939 .949 .943 .951 .768 .756 .800 .781 .783 .742 .797 .789 .841 .796 .850 .810 .861 .811 .866 .822 .732 .670 .682 .603 .751 .711 .756 .686 .826 .778 .835 .795 .845 .791 .854 .803 .752 .736 .773 .738 .631 .577 .728 .709 .826 .786 .835 .798 .849 .805 .855 .815 .752 .691 .788 .763 .620 .563 .704 .703 .845 .797 .851 .809 .861 .820 .867 .828
MIX MIX+VOC MIX+aREPET MIX+FASST MIX+KAM MIX+RPCA
Internal Data Set (framesize=200ms) RF SVM accuracy F-measure accuracy F-measure IC14 OS11 IC14 OS11 IC14 OS11 IC14 OS11 .837 .795 .846 .814 .855 .807 .863 .819 .960 .985 .962 .986 .976 .984 .977 .985 .845 .800 .853 .817 .865 .825 .872 .834 .842 .798 .850 .816 .863 .825 .871 .835 .844 .800 .853 .815 .871 .830 .877 .839 .850 .806 .858 .822 .870 .833 .877 .841
Table: Results of Foreground Concatenation. The classifier is given a double-sized vector containing the features from the mixed and the separated audio signal. MIX+VOC: concatenating features from the real vocals to simulate perfect separation.
I Vocals: mute the vocal estimate at non-vocal parts. I Background: select the original signal at non-vocal parts.
MIX VOC −6dB VOC 6dB aREPET −6dB aREPET 6dB FASST −6dB FASST 6dB KAM −6dB KAM 6dB RPCA −6dB RPCA 6dB
Internal Data Set (framesize=200ms) RF SVM accuracy F-measure accuracy F-measure IC14 OS11 IC14 OS11 IC14 OS11 IC14 OS11 .837 .795 .846 .814 .855 .807 .863 .819 .880 .861 .886 .868 .907 .869 .911 .874 .937 .943 .940 .944 .960 .945 .961 .946 .844 .792 .852 .809 .862 .807 .869 .818 .845 .799 .854 .813 .867 .813 .874 .823 .844 .795 .852 .812 .861 .805 .868 .817 .844 .799 .852 .815 .864 .811 .871 .822 .845 .801 .854 .816 .866 .815 .873 .825 .845 .803 .854 .816 .870 .821 .876 .829 .847 .803 .855 .817 .868 .817 .874 .826 .850 .809 .858 .821 .873 .821 .878 .829
Table: Results of Foreground Enhancement. The classifier is given the features extracted from a signal, where the separated vocals are remixed with the original audio signal. VOC: using the real vocals instead of the separated.
Figure: Example of RPCA separated singing voice. In the upper subplot we can see the mixed signal (grey) and the embedded vocals (black). In the lower subplot we can see the estimated vocals from RPCA (grey) and the partially muted vocal estimates according to our VD (black).
VD for Post-processing to improve BSS:
Discussion I Website with examples: www.cp.jku.at/misc/ismir2015bss/ I All four BSS methods show a very similar characteristic regarding the (only limited) improvement of VD results. I However, by utilising the VD output, we could improve BSS estimates for both vocal (useful for artist recognition) and background (useful for karaoke track creation). Bernhard Lehner, Gerhard Widmer, Department of Computational Perception, JKU Linz.
Figure: RPCA vocal estimation evaluation results. A: raw RPCA output; B: VD post-processed output; C: post-processed using ground truth. The global measure OPS indicate better performance for the post-processed output. The higher performance regarding interferences IPS are caused by the parts, that are muted, when our VD classifies them as non-vocal.
contact:
[email protected]