everything but the speaker of interest (i.e., environmental sounds such as background ... de la parole dans un. H ..... [2] (originally developed for singing voice / music separation), and ... pecially true when using all constraints together (# 7).
A SOURCE/FILTER MODEL WITH ADAPTIVE CONSTRAINTS FOR NMF-BASED SPEECH SEPARATION Damien Bouvier, Nicolas Obin, Marco Liuni, Axel Roebel IRCAM - UMR STMS IRCAM-CNRS-UPMC Paris, France ABSTRACT This paper introduces a constrained source/filter model for semisupervised speech separation based on non-negative matrix factorization (NMF). The objective is to inform NMF with prior knowledge about speech, providing a physically meaningful speech separation. To do so, a source/filter model (indicated as Instantaneous Mixture Model or IMM) is integrated in the NMF. Furthermore, constraints are added to the IMM-NMF, in order to control the NMF behaviour during separation, and to enforce its physical meaning. In particular, a speech specific constraint - based on the source/filter coherence of speech - and a method for the automatic adaptation of constraints’ weights during separation are presented. Also, the proposed source/filter model is semi-supervised: during training, one filter basis is estimated for each phoneme of a speaker; during separation, the estimated filter bases are then used in the constrained source/filter model. An experimental evaluation for speech separation was conducted on the TIMIT speakers database mixed with various environmental background noises from the QUTNOISE database. This evaluation showed that the use of adaptive constraints increases the performance of the source/filter model for speaker-dependent speech separation, and compares favorably to fully-supervised speech separation. Index Terms: speech separation, non-negative matrix factorization, source/filter model, constraints. 1. INTRODUCTION Speech separation consists in the separation of a speech signal from a background environment, referred as noise, which is defined as everything but the speaker of interest (i.e., environmental sounds such as background non-speech sounds or background speech). Speech separation is essential for further speech processing in real speech technologies, such as speech recognition, speaker recognition, speaker localization, and audio multi-media technologies for speech extraction and remixing. Audio source separation methods have been recently introduced for speech separation, in which the audio signal is described as the sum of two sources: a speech signal and a background noise signal. In particular, the non-negative matrix factorization (NMF) of an audio signal is extremely popular for source separation, and is widely used in recent times for speech separation [1, 2, 3, 4, 5, 6]. In the original formulation, the NMF decomposition of an audio signal is strictly unsupervised [7]. In the last decade, audio and speech separation has massively converged to informed audio source separation, in order to provide prior knowledge about the audio sources to be separated [8]. In the context of speech separation, two main trends co-exist: semi-supervised speech separation uses prior knowledge about speech only [5, 6], and supervised speech separation adds prior knowledge about the
background environment [3, 9]. This latter case remains extremely limited when the background environment is unknown, which is the case of most real-world applications. In this context, semi-supervised speech separation is the most common approach. The main advantage of semi-supervised speech separations is that robust prior knowledge about speech can be exploited, while the integration of prior knowledge about the background environment is clearly not realistic, regarding the extreme variability of the background environment. For semi-supervised speech separation, the experience into speech processing and speech recognition can be exploited: a source/filter model can be used to inform NMF-based separation [1, 2]. Also, a universal speech model (USM) has been proposed for speaker-independent speech separation [3], in which a speaker can be represented by a combination of the most similar speakers bases. A real-time implementation of the USM has been recently proposed for on-line background noise estimation [4]. Furthermore, hidden Markov models (HMM) has been added to NMF speech separation, in order to construct a language model [5] and to use prior text information for speech separation [6]. Finally, deep neural networks (DNN) has been successfully introduced for speech separation [10], and deep-NMF [11] has been proposed in order to integrate the advantages of DNN within the NMF framework. In this paper, we propose a constrained source/filter model for semi-supervised and physically-motivated NMF-based speech separation. To do so, a source/filter NMF model is described in Section 2. Then, constraints are added to this model in Section 3, in which a specific speech constraint and a method for the automatic adaptation of the constraints’ weights during separation are proposed. An experiment is conducted in Section 4 in order to explore the use of constraints within the source/filter model, with comparison to stateof-the-art speech separation methods. 2. NMF AND SOURCE/FILTER MODEL 2.1. NMF principle Let V denotes our observation matrix, with only non-negative coefficients (for audio, usually the STFT magnitude of the observed mixture signal); the NMF consists of finding the best approximation given a chosen cost C: V ' WH (1)
where W and H also contains only non-negative coefficients. W represents a dictionary matrix and H is the activation matrix (it can be seen as the gains of the projection of V onto the space defined by W ). Afterwards, source separation can be made using Wiener filters [12].
Fréquence
ex
ex
Fréquence
Fréquence
uité desNMF filtres, dictionnaire des filtres est décomposition. représenté comme une semi-supervised one where webase can matrices. injectdu knowledge about the NMF for speech Wefactorisation usedspeech an informed Durrieu et al. [1] proposent en matrices non-négatives spécifique au ; W ex contient deschaque différentes n?) to thetation?) fortospeech separation. We separation. used anune informed construites préalablement NMF, restent fixes durant les combiitérations where ⇥ 2.2. represents either de W and H, ⌦for denotes thetoutes Hadamard (physical model, speaker information, information, ...) but à la source/filtre de la voix et de intégration dans le et modèle la NMF ;speech enfin nous dressons N A source/filter-NMF Voix Source semi-supervized NMF in order totextual perform a speaker-dependent Wson 2.2. A source/filter-NMF for speech 4 -supervized NMF in order to perform a speaker-dependent linéaire de filtres élémentaires venant d’un banc de filtre. La Figure 1.2 illustre cette Les règles de mise à jour multiplicatives spécifiques à ce modèle, déduites à partir de (1.9), product and the division is pointwise. use no priorspour on the noise source/filtre une tâche dearticle séparation deis=l’instrument principal enenmusique . ¢ sont à(non latype. fois des excitations et=séparation des excitations bruitées. Lemodel, Tableau 1.2 résume le rôle unthistableau succinct) différentes méthodes de de parole, se ¢concentrant speech task. The main contribution thisharmonique article ch separation task. The main contribution ofphysical this des isof(interpreIn separation paper, our goal is to give a meaning Speech can be modelized using a source/filter with a glottal can be modelized using a source/filter model, with a glottal Voix Source Filt décrites Annexe A.2. ex position. thetation?) formalization ofdes a en speech-specific constraint in Speech pre-existing ormalization of aspectrogramme speech-specific constraint in pre-existing to the NMF for speech separation. We used utilisant an informed principalement sur lesd’amplitude méthodes la NMF. différentes matrices. délisent le de la2.2. voix (ou deconvoluted l’instrument àtract. comme excitation it is easy to see that A source/filter-NMF for speech excitation convoluted by the vocal tract.bySotheit vocal is easy toséparer) seeSothat source/filter model for NMF-based speechaseparation. semi-supervized NMF in order to perform speaker-dependent e/filter model for NMF-based speech separation. ex in order to fully represent potential speech signals, we will need in order to fully represent all potential speech signals, wewe will need LesTherègles desources mise à jour spécifiques àall modèle, à partir de (1.9), sont N this uit d’unspeech spectrogramme des filtres separation task. main contribution article multiplicatives isparIn un Chapitre 3and order spectrogramme to refine the representation ofdes thece speech, used déduites the. La factoHof ¢ = very large dictionaries, so loose advantage theintegrated rank-reduction advantage of= very large dictionaries, andforso looseproposed the rank-reduction ofa the formalization ofdécrites a speech-specific constraint inthe pre-existing source/filter model NMF in [12], and we ex The paper is organized as follows: first, we will describe the en Annexe A.2. The paper is organized as follows: first, we will describe Temps Temps 1.3 Autres méthodes de séparation de parole the order this problem, the Instant nsionclassique de lamodel NMF est alors appliqué sur ettoforNMF. .InMais, pour assurer une certaine NMF. In order get this to problem, used the Instant 1.1 La factorisation entheTemps matrices non-négatives source/filter model for NMF-based speech separation. learning step thearound vocal filters. Ifget V around is awe STFT magnitude of awe used Chapitre 3 inclusion of the source/filter model in the NMF then of the source/filter in the NMF framework; thenframework; ¢ = ¢ Mixture (IMM) proposed Mixture speech Modelsignal, (IMM) for in [5] for : the NMF in [5] : = weproposed canModel modeled itthe as NMF follows: Figure – Décomposition source/filtre du spectrogramme d’une voix uité des filtres, chaque du dictionnaire des est représenté comme une combiwill explain structure our algorithm the wethewill the structure ofand ourparticularly algorithm particularly the 1.2filtres Theexplain paper of is organized asbase follows: first, we willand describe thetravaux Pour pouvoir mieux situer nos par rapport à la littérature, nous dressons ici un bref Figure 1.2 – Modèle de décomposition source/filtre proposé dans [1] pour la NMF ex ex Le principe de la NMF, introduit tout d’abord par Paatero et Tapper [1] et popularisé par Évaluation expérimentale oped constraint; afterwards, we will describe experimental developed constraint; afterwards, will describe our experimental V framework; V = V = V VV = ⌦ ⌦⇣ V⇣V V ⌘ ⌦ inclusion of the source/filter model we inour the NMF then V ⌘ 1.2 ⇣ illustre ⌘ techniques Figure 1.2d’un –theModèle décomposition source/filtre proposé dans [1] pour la NMF filtres élémentaires venant de filtre. La Figure descriptif des différents algorithmes de banc séparation de laex◊ parole, en se sur les ex F ,linéaire and finally wede will discuss thestructure results. 1.3 Autres méthodes de séparation parole we will the ofest our algorithm and particularly ex W ex focalisant cH setup, andexplain finally we will discuss the results. Lee et Seung [2], d’approcher une matrice V de de N coefficients positifs ou nulscette Temps Temps c ⌦ H ⌦ de cdes (4)H ' taille (W'des H(W ){zH ⌦ })àex(W W ' H ) W le produit d’un spectrogramme sources par un spectrogramme filtres . La facto≥ | constraint; afterwards,Tout we willcedescribe our experimental Évaluation expériment | {z } (5) qui est considéré comme bruit (i.e. l’accompagnement musical pour Durrieu et al., | {z } | {z } (5) ex position.developed | {z } par W |par {z assurer } les risationdu classique la NMF est alors appliqué sur; on peut et . Mais, pour une travaux certaine pas seule intégration modèledesource/filtre dans la NMF citer exemple setup, and finally4.weCe willn’est discuss the la results. excitation excitation Figure – Décomposition d Temps Tem Tempscomme excitation Tout cemieux qui estsituer considéré comme bruit (i.e. l’accompagnement musical pour Durrieu et bref al.,source/filtre du spectrogramme fond sonore pour nous) est représenté en utilisant le décomposition classique de la une NMF. Le1.2 excitation continuité des filtres, base du des filtres représenté 2. NMF AND2. SOURCE/FILTER MODEL V ƒ = chaque W H (1.1) ex dictionnaire = ANDle SOURCE/FILTER MODEL Pour pouvoir nos travaux par rapport àbase lales⌦est littérature, nous dressons ici un deNMF Virtanen et Klapuri [12] qui proposent une nouvelle factorisation où chaque spectrale est décomposée en –H Modèle decombidécomposition ex¢ ' (W exce Hchapitre, ) ' ⌦ Wévaluons U ex H Dans nous différents proposés sur une tâche source/filtre de séparation proposé dans [1] po (W H ) Figure W 1.2 Ualgorithmes ex
excitation
excitation
naison linéaire de est élémentaires venant d’un de filtre. La Figure 1.3 modèle s’écrit :qui where Vcomme and V bruit are respectively the banc amplitude spectrogram of Tout ce est considéré (i.e. l’accompagnement musical pour Durrieu et cette al., de le obtenu fondd’amplitude sonore pour nous) représenté en utilisant lefiltre. décomposition classique NMF. Le source/filtre Figure 1.2illustre – –Modèle de propo produit d’un spectre d’une excitation harmonique par celui d’un 2.un NMF SOURCE/FILTER MODEL 1.2 Décomposition source/filtre du spectrogramme d’une voix ex en seFigure deou lapart parole danstaille un environnement descriptif des algorithmes de séparation de la sursource/filtre lesladécomposition techniques décomposition. où W et HAND sont deux matrices àsonore coefficients positifs nuls, de respective FModèle ◊K etdeKfocalisant ◊N the anden source filter part, W exparole, and bruité. H the basic ex NMF principle ex excitation ledifférents fond pour nous) est V représenté utilisant le décomposition de la. NMF. Le 2.1. NMF principle ledécomposition produit d’un spectrogramme des sources par un[1] spectrogramme des Figure 1.2 – proposé dans pour la NMF ex ex N classique where V and are respectively the amplitude spectrogram of modèle obtenu s’écrit : ex c where V and V are respectively the amplitude spectrogram of = ¢ + H Dans ce chapitre, nous évaluons les différents algorithmes A s’écrit Tout qui"approchée considéré comme bruit (i.e.source/filtre l’accompagnement musical po NMF decomposition forexVFigure ,!and W and H ex 2.1. La NMF principle exce1.2 ! (approximation?) "W NMF revient donc ànon-negative chercher, pour un excitation K :préalablement choisi, la meilleure matrice modèle obtenu –exest Modèle décomposition danspour [1p risation classique de la NMF est alors appliqué sur proposé et . Mais, (1.11) ex de the part and source filter part, and H the basic ex basic ex N N V denotesLet ourVobservation matrix, with co-non-negative the excitation part(approximation?) and source filter part, W and Hest the basic de laexemple parolecomme dans un environnement bruité. the NMF decomposition for V . In order 4. Ceonly n’est pas seule intégration du modèle source/filtre dans la NMF ; on peut citer par les = (W H ) ¢ U H + W H ex travaux denotes our observation matrix, withlaonly coex W N Tout ce qui considéré bruit (i.e. l’accompagnemen le fond sonore pour nous) est représenté en utilisant le décomposition classiqu ex continuité des filtres, chaque basepar du un dictionnaire des filtres est représent ledecompoproduit spectrogramme des sources spectrogramme des filtres . La = ¢ + exco- decomposition ex d’un de Damien rang maximum K, selon une fonction de coût choisie colonne de la matrice W est NMF (approximation?) for V , and V Let Vusually denotes ourSTFT observation matrix, with only non-negative ex ; chaque ex N c ents (for audio, the magnitude of the observed Bouvier 7 M2 ATIAM Rapport de Stage W Tout ce "qui estoù considéré comme bruit (i.e. H decomposition to=impose NMF smoothness on the! filter spectrum, we+decompose forW V classique , !and Vdespectrale decompoHestl’accompagnement ¢ !(approximation?) "chaque "(1.11) exen musical pour Durrieu de Virtanen et Klapuri quiobserved proposent une nouvelle factorisation base décomposée efficients (for audio, usually STFT magnitude of the (1.11) !obtenu "as naison linéaire de élémentaires venant d’un bancpour de Laune Figu risation NMF est filtres alors appliqué sur . Mais, assurer c modèle :pour efficients (for audio, usually thethe STFT magnitude of ex the[12] ex exthe activation N la N le fond sonore nous) est représenté enet utilisant le filtre. décompos is the as H matrix of the filter basis, ex N”atomic” Ns’écrit ure); the NMF consists of finding theoù best approximation: product W Uis isU aW dictionary ofce smooth Nfollows: unethe base de décomposition de ,observed etsition H correspond àex l’encodage de selon cette décomposi= (W H ¢ H + H sition H is U the activation matrix of the filter basis, = (W H )is,aswhere ¢)follows: W H + W H Tout qui est considéré comme bruit l’accompagnement musical leW fond pour est représenté enbase utilisant le(i.e. décomposition de la une NM les matrices , d’amplitude H représentent respectivement les temporelles des mixture); NMF consists finding the spectre bestH approximation: ! sonore " activations !nous) "W unlaproduit d’un d’une excitation harmonique par celui d’un filtre. décomposition. continuité des filtres, chaque du dictionnaire des filtres est classique représenté comme mixture);the NMF consists ofof finding the best approximation: W etisH afilters dictionary of smooth ”atomic” filters and U is the coeffiex ex matrix N N ex and U is the coefficient combining those elementary H ex ex N H )¢ W U ”atomic” H modèle + naison W H obtenu s’écrit : H exW V isƒa(W dictionary of smooth filters and U is the coeffition. =est représenté ¢ utilisant linéaire deAdictionnaire filtres élémentaires venant La Figure des filtres bruit, W estaNun dictionnaire d’excitations, W est un modèle obtenu s’écrit :pour fond sonore nous) en le filtre. décomposition clas filters to form vocal filter. Fig. 1 le shows aninto example of the overall cient combining those elementary filters the filter basis. !d’un banc de " + ! N 1.3Nillustr " V ' W H excitations, (1) et du ex matrix V '' W H les (1) cient matrix combining those filters into the filter basis. ex temporelles W H Damien (1) où matrices H et H représentent respectivement activations ex elementary " !+ N W " L’idée principale NMF est laBouvier contrainte de positivité des qui oblige l’estimation source/filter model for NMF. M2N,décomposition. ATIAM - bases Rapport de Stage (W Hàƒex(W ) ex¢Hexex ) ¢ W!Wdes U¢ H deVdes filtres élémentaires, UH contient les combinaisons linéaires construisant filtres W In, [5], the authors and W and letting allles ex matrices, ex8fixed N les ex ex=de N W HN H + = modèle obtenu s’écrit : VInimposed ƒ[5], W H ¢ W U ¢ H +W Wfixed HA = ex V U H + the authors imposed and W and W , letting all ex =with + W W ¢ est N filtres ex pas ! " ⇣ un ⌘ bases a contains cost C, where W andfaire H coefficients, also contains non-negative coefexcitations, des et du bruit, estsame dictionnaire d’excitations, e W and Hgiven also non-negative aadditive, others matrices free; weW kept the construction methods, ! un dictionnaire des données à contains se manière de sorte que l’annulation entre de W n’est partir dede W , given et coefficients, W est le dictionnaire des bases de bruit. Les WATIAM etexW -with sont ex ex Nmatrices Nconstruction ex de ex"Stage! N N " Damien Bouvier 7dans M2 Rapport where Wrepresent H aalso non-negative given a of aVothers matrices free; we kept the same exW ' (W H )⌦ U H Hexcitation (5) exexmethods, N W ficients. represent a dictionary matrix andH H is is the the activation = (WexU¢ Hfiltres )respectivement UN les Hactivation + Figure 1.2 –Wand Modèle de matrix décomposition source/filtre proposé [1] pour la NMF U the =ex (W Hexex )et ¢ H +¢à WWH ex N D(V |W H). W dictionary and addition learning step forHW U . For W+les , the où matrices H H H représentent ex N + ex;,= de filtres élémentaires, U contient les combinaisons linéaires construisant les bases de V ƒ W U construites préalablement à la NMF, et restent fixes durant toutes les itérations W contient possible (contrairement à l’analyse en composantes principales). V ƒ W H ¢ W U H + W H cost D(V |Wthe H). Wofrepresent matrix and H is the the addition of aand learning step for U . WFor W , the H excitation ex !dictionnaire d’excitations, " ! NW Ne ation matrix containing gains the basis aindictionary W. dictionary, we incorporated voiced unvoiced basis. N ex ex ex excitations, des filtres et du bruit, W est un à lathefois des des bruitées. Lede Tableau 1.2 résume rôle activation matrix containing gains of excitations the . W we*incorporated voiced and basis. partir de basis W in, W etharmonique est et ledictionary, dictionnaire des bases bruit. Les matrices W )respectivement et¢NW Wsont ex N (W ) excitations U les Hactivations + W H H ex le H où les matrices H ,unvoiced H et H= représentent temporell ex filtres
où les matrices H , H ex
et H
représentent respectivement l
W ,U V ƒles Uex H de filtres élémentaires, Uet linéaires construisant les W Htoutesleles différentes matrices. excontient audio, ladesmatrice V bruit est une représentation dufixes signal à analyser, plus construites préalablement à latemps-fréquence NMF, et restent durant itérations ;W Wcombinaisons contient ut ce qui est En considéré comme (i.e. l’accompagnement musical Durrieu al., ex excitations, despour filtres et duexbruit, estetNun dictionnaire d’excitations, W est un dictio N NW excitations, des filtres du bruit, est un d’excit W ) * H ) représentent * W partir de NW , partir etLe W est le dictionnaire deslebases de dictionnaire bruit. Lesles matrices W Les règles de mise àexcitations jourà multiplicatives spécifiques à ce modèle, déduites à de (1.9), sont ex et ex Nbruitées. où les matrices , H et H respectivement activat souvent la Transformée de Fourier Court-Terme (TFCT) ou la transformée à Q-constant. à la fois des harmonique des excitations Tableau 1.2 résume rôle W , H de , W filtres , U classique ,H , W , H de W ,U élémentaires, U contient combinaisons linéaires construisant les bases decon fi U les Le W sonore pour nous) est représenté en utilisant le décomposition la NMF. de filtres élémentaires, U contient les combinaisons linéaires construites préalablement à la NMF, et restent fixes durant toutes les itératio ex décrites en Annexe A.2. ex Puisque l’on nécessite coefficients positifs ou nuls, la matrice une excitations, desN filtres du bruit, W est un d’excitations, W desdes différentes matrices. partirconsidérée de W , etest W estpuissance le et dictionnaire des de dictionnaire bruit. Les Nmatrices W ex et W ) exbases * ex excitationsdes N obtenu s’écrit à la fois desUexcitations des bruitées. LedeTableau partir de W U, Hetharmonique W N est le, H dictionnaire bases bruit. W et ,W ,U ,H ,W ,H ex L1 positive: de l’amplitude, e.g. dans cas de la STFT : de filtres élémentaires, contient les combinaisons linéaires construisant Les règles de le mise spécifiques à ce modèle, déduites à partir de (1.9), sont préalablement à la NMF, et restent fixes durant toutes les itérations ; W co N à jour multiplicatives construites H 3.1 Bases de données des différentes matrices. construites préalablement à la NMF, et restent fixes durant toute N ex Ndes Net à la fois excitations harmonique des excitations bruitées. Le Tableau 1.2 résume Figure 1.2 – Modèle de décomposition source/filtre proposé dans [1] pour la NMF – séparation partir de W , et W est le dictionnaire des bases de bruit. Les matrice décrites en ¢ Annexe A.2. W H Autres méthodes de de parole = 1.3 + Vf n!= |TFCT{x}f n | " , – > 0! (1.2)multiplicatives Les"règles de mise àexcitations jour spécifiques modèle, déduites à paL à préalablement la fois desconstruit harmonique etfixes desà ce excitations bruitées. Afin d’évaluer tâche proposé, nous avons uneNMF, base deet données de signaux de (1.11) 3.1 Bases de données des ladifférentes construites à la restent durant toutes les itéra ex ex N N matrices. décrites enutilisé Annexe ex de parole TIMIT [26], et comme base N commeA.2. = (W H ) ¢ W U H + W H paroles bruitées. Pour cela, nous avons base W des différentes matrices. Pour rapport ààlalalittérature, nousmultiplicatives dressons ici un brefet des Dans ce cadre là, Wpouvoir est un mieux dictionnaire des travaux sources par sonores, chaque colonne un spectre ≥situer nos Les règles deconstruction mise jour spécifiques à ce modèle, déduites à partir de (1.9 desàétant excitations harmonique excitations bruitées. Leconstruit Tablea de bruit celle utilisé pour lafois de QUT-NOISE-TIMIT [27]. Afin d’évaluer la tâche proposé, nous avons un W ut ce qui est considéré l’accompagnement Durrieu et desbruit différents algorithmes de séparation lamusical parole, enpour seLes focalisant sur les règles de mise àal., jour multiplicatives spécifiques à ce modèle, de puissance –,descriptif etcomme H1.3 représente les(i.e. activations temporelles dede ces sources. Autres méthodes de séparation de parole décrites endifférentes Annexe A.2. Ntechniques ex H paroles bruitées. Pour cela, nous avons utilisé comme base de des matrices. ex N matrices H nous) , H est et 4.H représentent les activations temporelles des sonore pour représenté en intégration utilisant décomposition classique de la3600 NMF. TIMIT une base de 1.3 parole comprenant phrases enet divisées par locu- dede parole Tout cerespectivement qui du estle considéré commeestbruit (i.e. musical pour Durrieu al., en Annexe A.2. deanglais bruit de celle utilisé pour360 la construction QUT-NOISE-TI Autres méthodes séparation Ce n’est pas la seule modèle source/filtre dans lal’accompagnement NMF ; ondécrites peut citer par exemple lesLe travaux Les règles de à10jour multiplicatives à ; ce le fond sonore pour est différents. représenté enpar utilisant le décomposition dedescribed la NMF. Le teurs Chaque locuteur aà enregistré phrases, dont 2dressons sont pourbref tous cesmodèle, déduites à Fig. 1. Illustration of thenous) source/filter decomposition for the IMM-NMF inidentiques (4).spécifiques H N classique Pour pouvoir mieux situer nos travaux rapport lamise littérature, nous ici un de Virtanen etex Klapuri [12] qui proposent une nouvelle factorisation où chaque base spectrale est décomposée en ions, des s’écrit filtres du bruit, W est un dictionnaire d’excitations, W est un dictionnaire obtenu : Fonctions 1.1.1 et de coût modèle obtenu d’une s’écrit excitation : 2 phrasesharmonique exposent les de prononciations des e nous serviront base décrites enparole, Annexe A.2. TIMIT une base dedeparole 3600nous phrases un produit d’un spectre parAutres celui d’un filtre. 1.3toutes méthodes delocuteurs, séparation de parole Pour pouvoir travaux par rapport à lacomprenant littérature, dr descriptif des d’amplitude différents algorithmes de séparation devariantes la enmieux sedefocalisant lesesttechniques ≥situer nossur d’apprentissage. Les autres phrases servent deN base test. teurs différents. Chaque locuteur a enregistré 10 phrases, don es élémentaires, contient combinaisons linéaires les bases de filtres à exconstruisant W exla les = ¢ ! N descriptif + Dans U la littérature, factorisation (1.1) se fait en minimisant des différents algorithmes de séparation de la parole, en se focalisant " ! " = ¢ + 1.3 Autres méthodes de séparation de (1.11) QUT-NOISE-TIMIT contient des enregistrements longue durée de différents environnements so2 phrases exposent toutes les variantes de prononciations des ex N N ex ! seule "Itakura-Saito " Vmieux =bruit. (Wmodèle H exLes ) !source/filtre ¢ Pour W pouvoir U Hdans + W≥situer H; Non 4. Ce n’est pas la intégration du la NMF peut citer par les travaux ex nos travaux parexemple rapport à la littérature, nous dressons ici u (1.11) de W , et WIn est Damien le dictionnaire des bases W et W sont audio, usual costs are Kullback-Leiber (KL)de and where and V- Φ are respectively theunmagnitude STFT of the nores (cuisine, rue, matrices café, intérieur d’une voiture, ...). Lorsque nécessaire, apprentissage dul’accompagnement ex ex N factorisation N M2 d’apprentissage. Les autres phrases servent de base demusical test. Bouvier 7 nouvelle Rapport dequi Stage Tout ce est modèle considéré comme bruit (i.e. deH Virtanen et Klapuri qui proposent une oùW chaque base spectrale est décomposée en = (W ) ¢ W U H + W H 4. CeATIAM n’est pas laex seule intégration du source/filtre dans lafocalisant NMF ; on peut citertech par arg[12] min D(V |W H) (1.3) ex ex descriptif des différents algorithmes de séparation de la parole, en se sur les 1.3 Autres méthodes de séparation de parole ex N bruit est fait sur les 5 premières minutes du fichier. QUT-NOISE-TIMIT contient des enregistrements longue dur (IS) divergence which are both limit cases of the β-divergence excitation part and the filter part, W and H are the standard où les matrices H , H et H représentent respectivement les activations temporelles des le fond sonore pour nous) est représenté en utilisant le décomposition clas uites préalablement à la[13], NMF, et restent fixes durant toutes les itérations ; W contient W ,HØ0 Pour pouvoir mieux situer nos travaux par rapport à la littéra de Virtanen et Klapuri [12] qui proposent une nouvelle factorisation où chaque base spectral un produit d’un spectre d’amplitude d’une excitation harmonique par celui d’un filtre. ex ex nores (cuisine, rue, café, d’unemusical voiture,pour ...). Durrieu Lorsqu obtenu s’écrit : bruit Tout ce quimodèle est modèle considéré comme (i.e.la l’accompagnement (respectively for β N = 1 and β = 0). In this paper, we use the IS4. Ce n’est NMF decomposition for speech excitations Vintérieur (where ladescriptif seule intégration du source/filtre dans NMF ; on peut citer parparole, exemple les unpas produit d’un spectre d’amplitude d’une excitation harmonique par celui d’un filtre. ex dessuivante différents algorithmes séparation deW Notre baseLe de test est construite deletemporelles la résume manière : est bruit est fait sur lesde 5 premières minutes dula fichier. smatrices des excitations et desDamien excitations bruitées. Tableau 1.2 lethe rôle sonore pour nous) représenté en utilisant le décomposition denou laNen NM Hdivergence , Hharmonique et itsHscale-invariance, représentent respectivement les activations Φ est décomp pouvoir mieux situer nos travaux la spectrale littérature, de Virtanen et Klapuri [12] qui proposent unedes nouvelle factorisation où chaque exrapport Bouvier 8 isPour M2fond ATIAM -including Rapport de Stage cclassique for which is an interesting property =andpar ¢ ! àbase + a fixed dictionary, periodic noisy basis), and W " ! N N modèle obtenu d’une s’écrit excitation : 3 unsélectionné produit d’un spectre d’amplitude harmonique ex exparexcelui d’un filtre. érentes matrices. Φ • nous avons 20 locuteurs, 10 de chaque sexe ; = (W H ) ¢ W U H + W H 4. Ce n’est pas la seule intégration du modèle source/filtre dans la NMF ; descriptif des différents algorithmes de séparation de la parole, en se focalisa Damien Bouvier 7 M2 ATIAM Rapport de Stage for audio signals: ions, des filtres et du bruit, W est un dictionnaire d’excitations,and W unthedictionnaire base de test est construite la manière suivante :o H est are standard NMFNotre decomposition for thedespeech ex N Damien Bouvier 7 nouvelle M2 ATIAM - R = ¢ ! une + factorisation Φ Virtanen de et Klapuri [12] qui proposent F N " ! " où chaq les de mise à jour multiplicatives spécifiques à ce modèle, déduites à partir de (1.9), sont ex N filters V .pas Inlaorder ensure the smoothness ofreprésentent speech filters XX ex N N es élémentaires, U contient les combinaisons linéaires construisant les bases deto à(WH•modèle oùfiltres les matrices et H respectivement lescelui activat =d’amplitude H, exH)avons ¢ Wexcitation Uthe Hdans + W H;de 4. Ce n’est seule intégration du source/filtre la NMF onchaque peut citerd; nous sélectionné 20 locuteurs, 10 sexe un produit d’un spectre d’une harmonique par 16 Φ ATIAM Φ dIS Vf n | (W H)f n (2) de W c Φ , weetfurther c Φ as Damien Bouvier 7 the M2 - Rapport de NC = DIS (V |W H) = ex decompose W product W U , where Virtanen Klapuri [12] quiexproposent une nouvelle factorisation où chaque base spec s enWAnnexe de , et WA.2. est le dictionnaire des bases de bruit. Les matrices W et W sont oùspectre les matrices H , H d’une et H N excitation représententharmonique respectivement lescelui activations tempore Φ n f un W produit d’amplitude par d’un filtre. is d’un a fixed dictionary of smooth ”atomic” filters (here, Hann Damien Bouvier 8 M2 ATIAM 16 ex uites préalablement à la NMF, et restent fixes durant toutes les itérations WΦBouvier contient windows) and; U is the coefficient matrix linearly combining Damien 7M2 those M with dIS (x|y) = xy − log xy − 1. Moreover, we use the artificial Damien Bouvier 8 ATIAM - Rapport d s des excitations harmonique et des excitations bruitées. Le Tableau 1.2 résume le elementary filters to form arôle speech filter. Figure 1 illustrates the floor introduced in [14] to control the noise robustness Autresnoiseméthodes dein order séparation de parole Damien Bouvier M2 ATIAM architecture of the source/filter model for NMF. 7 érentes matrices. of the IS-NMF. de misemieux à joursituer multiplicatives spécifiques à ceà modèle, déduites àForpartir de (1.9), sont urlespouvoir nos travaux par rapport la littérature, nous dressons ici unthe bref speech separation, observed signal V is assumed to be a The solution of the NMF problem, using β-divergence, can be S stifendes Annexe A.2. mixture of a speech signal V and a background noise signal V N . différents algorithmes de séparation de la parole, en se focalisant sur les techniques
efficiently obtained by applying an iterative algorithm, derived from e as follows: V can be approximated by V a gradient step descent technique [13]. The i-th iteration is based on les travaux n’est pas la seule intégration du modèle source/filtre dans la NMF ; on peut citer par exemple the application, on both W and H, of the following multiplicative e = W ex H ex ⊗ W Φ U b Φ H Φ + W N H N (5) V 'V nen et Klapuri [12] qui proposent une nouvelle factorisation où chaque base spectrale est décomposée en rule: uit d’un spectre d’amplitude d’une excitation harmonique par celui d’un filtre. ∇− (i) C in which the background noise signal V N is expressed using a stanΘ(i+1) ←− Θ(i) ⊗ Θ (3) + ∇Θ(i) C à la littérature, nous ur pouvoir mieux situer nos travaux par rapport dressons ici un Following bref dard NMF decomposition. the denomination used in [2], we will refer to this mixture decomposition as the ”Instantaneous + de la parole, − tif des différents algorithmes de séparation en se focalisant sur les techniques pos- ATIAM - Rapport de Stage n Bouvier where Θ represents either W or H, ∇7Θ(i) and∇Θ(i) are theM2 Mixture Model for NMF” (IMM-NMF). itive and negative parts of the gradient of the cost C with respect to n’est pas la seule (i) intégration du modèle source/filtre dans la NMF ; on peut citer par exemple les travaux Θ , ⊗ denotes the Hadamard product, and the division is pointnen et Klapuri [12] qui proposent une nouvelle factorisation où chaque base2.3. spectrale est décomposée en Semi-supervision of the IMM-NMF wise. uit d’un spectre d’amplitude d’une excitation harmonique par celui d’un filtre. c Φ of a speaker are explicitly In the IMM-NMF, the speech filters W 2.2. A source/filter-NMF for speech represented by the coefficients matrix U Φ . In [2], this matrix was directly estimated from the observed signal V , thus fully unsuperof speech, we use the ATIAM - Rapport de Stage n Bouvier In order to provide an explicit representation 7 M2 vised. Here, we propose to estimate the speech filters from clean source/filter model for NMF proposed in [2], in which a training speech signals of a speaker. To do so, the speech filter matrice V Φ step is added to estimate the vocal filters. The NMF source/filter is first estimated by a spectral envelope estimation algorithm [15], S decomposition of the STFT magnitude V of a speech signal can be which is then approximated by the NMF filter decomposition: expressed as: V Φ ' W ΦU ΦH Φ (6) VS = V ex ⊗ VΦ Furthermore, we used phonetic information in order to train c ΦH Φ ' (W ex H ex ) ⊗ W (4) phonemes separately and to have one basis for each speech filter | {z } | {z } (i.e. each phoneme). excitation filter
Autres méthodes de séparation de parole
Constraints Constraints IMM-NMF Coherence SoA 33Coherence All SoA All SoA avec 3 3 3 3 SoA 3 3après 33Coherence 3Coherenc 3 Figure 2.5 3 –3 Comparaison W séparation et All san nous l’utilisons ici pour éliminer le Adaptation Adaptation Constraints Without Without adaptation weight adaptation WithAll weight Withadaptation weightSoA ada SoA Coherence Allweight SoA Coherence SoA Coherence All des échelles d’amplitudes diffé Adaptation Without weight adaptation With avec weight Without adaptation weight adaptation Wi SDR SDR 5.8 5.8 4.4 4.44.0 4.1 4.0 4.15.0 5.05.2 4.1 5.2 4.15.2 5.25.4 6dB 6dB 2.00 SDR 2.00 1.22 1.91 1.91 1.94 1.92 1.92 2.015.2 2.01 2.01 SDR 5.8PESQ PESQ 4.4 4.0 1.22 4.1 5.8 5.0 1.91 4.45.2 1.91 4.1 4.0 1.94 4.15.2 1.91 5.05.4 1.91 4.1 6dB 6dB Figure 2.5 – Compa PESQ 2.00 1.22 1.91 1.91 Contrainte 2.00 1.94 9.2 1.22 1.92cohérence 1.91 1.91 9.0 1.91 2.01 9.2 1.94 2.01 9.29.8 1.92 1.91 2.2.3 de source/filtre SDR SDR 10.7 PESQ 10.7 7.8 7.89.1 9.1 9.29.0 8.9 8.9 9.8 9.8 +0dB +0dB PESQ PESQ 2.44 SDR 2.44 1.54 2.30 2.30 2.24 2.23 2.23 2.348.9 2.34 2.35 SDR 10.7 7.8 9.1 1.54 9.2 10.7 9.0 2.30 7.88.9 2.30 9.2 9.1 2.24 9.29.8 2.30 9.09.8 2.30 9.2 +0dB +0dB linguistique, distingue deux type13.0 de phonèmes : 12.8 PESQ 2.44 1.54 2.30 2.30 2.44 2.24 on 1.54 2.2312.8 2.30 2.30grands 2.30 2.34 2.24 2.35 2.23 2.30 SDR SDR 15.0 PESQ 15.0 En 9.7 9.7 13.0 12.8 13.0 11.1 11.1 10.9 10.9 13.0 12.8 12.9 +6dB +6dB PESQ PESQ 2.85 SDR 2.85 1.82 1.82 2.62 2.62 2.46 2.44 2.44 2.59 2.62 SDR 15.0 9.7 13.0 12.8 15.0 11.1 2.61 9.7 10.92.61 13.0 13.0 2.46 12.8 12.8 2.62 11.1 12.92.62 10.9 2.59 13.0 +6dB CONSTRAINTS +6dB Periodic Noisy 3. SOURCE/FILTER MODEL UNDER PESQ 2.85 1.82 PESQ 2.62 • 2.61 2.85 2.46 voisées, 1.82 2.44 2.62 2.62 2.2.3 2.61 2.59à Contrainte 2.62 2.44 2.62 les phonèmes qui correspondent une2.46 excitation harmoniq de cohé SDR SDR 10.5 10.5 7.3vocales7.3 8.7 exemple 8.78.4 les voyelles 8.48.3 /a/8.7 8.3 8.79.3 9.39.4/m ; 8.7 ce sont8.7 par ou /e/ ou les nasales Mean Mean PESQ PESQ 2.43 SDR 2.43 1.52 2.28 2.27 2.28 2.21 2.21 2.20 2.28 2.20 2.28 2.31distingue 2.31 2.33 SDR 10.5 7.3 8.7 1.52 10.5 8.4 non-voisées, 7.38.3 2.27 8.7 8.7 8.7 9.3linguistique, 8.49.4excitation 8.3 bruitée 8.7de • 8.7 les phonèmes qui correspondent à une En on 3.1. Constrained-NMF Mean Mean PESQ 2.43 1.52 PESQ 2.28 2.27 2.43 occlusives 2.21 1.52 2.20 2.28 2.31ou certaines 2.21 2.33 fricatives, 2.20 2.28 certaines , comme2.28 /b/ ou 2.27 /d/, com IMM-NMF Constraints Adaptation
Amplitude (dB)
The main objective of this work is to inform NMF speech separa• les phonèmes voisées, qui co Table 1. Table Results 1. from Results from experimental the experimental validation. validation. SoA SoA to refers thetests, State-of-the-Art to nous the State-of-the-Art constraints (decorrelation, (decorrelation, and spar Lors desrefers premiers avons constraints pu observer que ;des filtressparsity vocaux ass tion with a physical model of speech. For this purpose, we use the convocales ce sont par exemple whereas Coherence refers to 1. the refers proposed toSoA thefrom proposed source/filter source/filter coherence coherence constraint. constraint. Bold values Bold are values the best arescores the best inconstraints scores the proposed in the prop met Table 1. Resultswhereas from theCoherence experimental Table validation. Results refers thetoexperimental the State-of-the-Art validation. constraints SoA refers (decorrelation, to the State-of-the-Art sparsity and smoothness) (decorr voisés étaient activées en même temps que des excitations bruitées, et invers • les phonèmes non-voisées, qu straints into the NMF to penalize solutions not respecting the speech whereas Coherence refers to the proposed whereas source/filter Coherence coherence refers to constraint. the proposed source/filter values non-voisés. are the coherence best scores constraint. in the proposed Bold values methods. are the best vocaux deBold phonèmes Malheureusement, cela résultait lascore pré certaines occlusives ,en comme model. Accordingly, the cost C is modified with the addition of the audibles, et baissait la qualité de la reconstruction. constraint penalty cost P: the algorithms the algorithms using theusing SDR the criteria SDRfor criteria BSS for defined BSS indefined [17] in [17] and3 pour chaque 3 Puisque nous and connaissons, base de W , le phonème corres des tests, nous avon the PESQ, thewhich PESQ, is which a MOS ismeasure a MOS measure defined indefined [18]. in [18]. 8 8 cohérence thealgorithms the SDR criteria the for algorithms BSS defined using in [17] the and SDR criteria 3for une BSScontrainte defined inde [17] and Lors 3 premiers pensé à créer source/filtre pour empêcher l’ap using voisés the PESQ, which is a MOS measure defined the PESQ, in [18]. which is a MOS measure 8 defined in [18]. 8 étaient activées en même tem The proposed The proposed algorithms algorithms were usedwere on the usedSTFT on the magnitude STFT magnitude of of e mène. C = DIS V |V + µP (7) vocaux de phonèmes non-voisés. M the observed the observed signals, using signals, a Hamming using a Hamming window of window 64 msof and 64ams hopand a hop
The proposed algorithms were used onThe the proposed STFT magnitude algorithms of were used on the STFT magnitude of et baissait la qualité de la size using of 32size ofThe 32 ms. maximum The maximum number number iteration ofa Hamming iteration of the NMF ofwindow the wasNMF the observed signals, ams. Hamming the window observed of 64 signals, msof and using a hop of was 64 ms and a hop audibles, 6. CONCLUSION CONCLUSION /s/ Puisque nous6.connaissons, pour ch fixed 100, fixed and to 100, we size used and the IS divergence the divergence with noise with floor aofnoise of floorofofthe 6. size of 32 The maximum number of iteration ofwe 32used ms. of the TheIS NMF maximum was anumber iteration NMF was with µ a positive value determining thems. weight oftothe constraint. The CONCLUSION 6. CONC pensé à créer une contrainte de coh 60dB (see 60dB [8]). (see The [8]). filterbank The filterbank W was W created was with created 50 Hann with 50 Hann fixed to 100, and we used the IS divergence fixed to with 100, a and noise we floor used of the IS divergence with a noise floor of In this paper, In this we paper, presented we presented a new semi-supervised a new semi-supervis metho new multiplicative update for the i-th iteration will be of the form : /a/ mène. spaced windowlinearly spaced from linearly 0(see tofrom 8000 0 Hz to and the Hz excitation and the excitation dic-created dic60dB (see [8]).window The filterbank W was 60dB created [8]). with The 508000 Hann filterbank W was with 50 Hannsemi-supervised separation, separation, using a constrained using a constrained NMF with NMF a source/filter with a sou In this paper, we presented a new In this paper, method we presented for speech a new s ex ex tionary W tionary was made was up with 250 up with periodic 250 periodic basis basis (spanning every every window from 0 toW 8000 window Hzmade andspaced the excitation linearly from dic- (spanning 0 to 8000 Hz and the excitation − spaced linearly − tion fromtion thedicfrom literature. the literature. main The contribution contributio are NM an separation, using a constrained NMF with separation, aThe source/filter using amain decomposiconstrained ∇ D + µ∇ P ex ex IS Voiced ( /i/ ) periodic (i) wastwentieth (i) Unvoiced ( /s/ ) ensuring twentieth of tone250 between ofperiodic tone 80 between and 80 Hz) and and 350every 100 Hz)with white and 100 noise white basis. noise basis. made up with tionary W 350 (spanning was made up 250 basis (spanning every ΘW Θ specific constraint, specific ensuring phonetic phonetic coherence, coheren and tion from the literature. The mainconstraint, contribution tion from the are literature. a new speechThe ma Θ(i+1) ←− Θ(i) tionary ⊗ + (8)basis twentieth tone between+80 and 350 Hz) twentieth and 100 ofwhite tone between noise basis. 80 and 350 Hz) and 100 white noise basis. weight method weight for method constraints. for constraints. specific constraint, ensuring phonetic specific coherence, constraint, and an ensuring adaptive pho ∇Θ(i)ofD IS + µ∇ (i) P Θ /u/ /a/ method for constraints. weight method for constraints. weight Frequency (Hz) 5. RESULTS 5. RESULTS AND DISCUSSION AND DISCUSSION 7. REFERENCES 7. REFERENCES for any non-fixed matrix Θ in the model.5. In our method, we use RESULTS AND DISCUSSION 5. RESULTS AND DISCUSSION s /i/ 7. REFE several constraints ans sum their values in The order toThe obtain results that results wethe show thattotal we here show are here the one are obtained the one obtained for optimized for optimized 7. REFERENCES Fig. 2. Schematic example ofvalues realistic (green marks) un[1]forG.optimized J.[1]Mysore G. check J. and Mysore P. Smaragdis, and and P. Smaragdis, “A non-negativ “A no ofvalues constraint of constraint weight and weight number and of number noise basis; of noise those basis; values The results that values we show here are the The oneresults obtained that for we optimized show here are the onethose obtained penalty cost P. a [1] G. /u/ to to language informed speech separation,” speech Lat Mysore P.speech Smaragdis, [1] “A informed G. non-negative J.(top) Mysore andseparation P. inSmar (red crosses) combinations oflanguage excitations andapproach were found were byfound testing by on testing anoise separate arealistic and separate smaller and database. smaller database. values of constraint weight and number values of of on constraint basis; those weight values and number of J. noise basis;and those values Analysis Analysis andseparation,” Signal andto Separation. Signal Separation. Springer, S tosmaller language informed speech language in Latent informed Variable speec were found by testingTable on a separate were smaller found by testing on a all separate andthe speech (bottom). 1 shows Table and 1our shows experimental ourdatabase. experimental resultsfilters for results the fortested all algotesteddatabase. algo356–363. 356–363. [Online]. [Online]. Available: Available: https://ccrma.s https: Analysis and Signal Separation. Analysis Springer, and 2012, Signal pp. Sep i s rithms. Firstly, rithms.we Firstly, can see we that can see baseline thattested our baseline (# 1) results has (# bet1) Table 1 shows our experimental results Table for our all 1 shows the ouralgorithm experimental algo- algorithm forhas allbetthe tested algo3.2. State-of-the-art constraints ⇠gautham/Site/lva ⇠gautham/Site/lva ica356–363. 2012.html ica 2012.html 356–363. [Online]. Available: https://ccrma.stanford.edu/ [Online]. Availa tercan results than results thebaseline algorithm than thealgorithm algorithm V-IMM,we as1) expected as expected byour thebaseline inclusion by thealgorithm inclusion rithms. Firstly, we seeter that our rithms. Firstly, (#V-IMM, can has seebetthat (# 1) has bet⇠gautham/Site/lva ica 2012.html ⇠gautham/Site/lva ica 2012 u of constraints aA. Ozerov, [2] L. inclusion Le[2] Magoarou, L. Le Magoarou, A.and Ozerov, N. Q.and Duong, N. Q.“Te Du of: aalgorithm learning of a V-IMM, step learning in the step in the than IMM-NMF. ter the results than the ter asIMM-NMF. expected results by the theAdaptation algorithm inclusion V-IMM, as expected by the We used three constraints from literature 3.4. weights source audio separation. source separation. example-based example-based [2] L. Le A. Ozerov, and N.[2] Q. L. Duong, Le Magoarou, “Text-informed A.approach Ozerov, of a learning step in the IMM-NMF. ofifawe learning stepalgorithms in thesimilar IMM-NMF. Secondly, Secondly, if weColumncompare compare algorithms with withconstraints, similar constraints, weMagoarou, we audio • the sparsity constraint described in [16] as the negative we negative matrix partial matrix co-factorization,” partial Journa source separation. example-based audio approach source using separation. nonex observe that observe our adaptive that our adaptive weight method weight improve method improve the separation theaudio separation re-de reLe principe contrainte seraitisid’empêcher les co-factorization, activations sim Secondly, if we compare algorithms with Secondly, similar constraints, if weof compare algorithms with similar constraints, One the we main limitations ofcette constrained-NMF the difficulty for Normalized `1-norm, in order tothat promote at any ex Processing Processing Systems, Systems, pp. 1–15, 2014. 1–15, 2014. partial co-factorization,” negative Journal matrix of Signal co-u sultsthe foractivation, sults all SNR, formethod all especially SNR, especially when theseparation when coherence the coherence constraint isnegative used. ismatrix used. lignes deconstraint H etweight H queseparation l’on connaît ; nous avons donc pensé àpartial utiliser observe our adaptive weight observe improve thatthe our adaptive weight remethod improve the refinding a good constraint for speech separation. App. small given time, of a single filtersults basis a single basis. Processing Systems, pp. 1–15, 2014. Processing pp. 1–1 We canexcitation also Wewhen see canthat also the see best that the best algorithm we proposed wecelle proposed isconduct the one isusing the aonesmall using dethe la coherence contrainte de Notons V des indic for and all SNR, especially the sults coherence foralgorithm allconstraint SNR, especially iswould used. when constraint is[3]used. uG. [3]décorrélation D. null L. Sun D. (2.6). L. and Sun and J. Mysore, G.l’ensemble J. Systems, Mysore, “Universal “Unive spe weight to or effect of the constraint, • the normalized decorrelation based on correlation all constraints all the constraints and theWe adaptive and the adaptive method method 7);using the(# constraints 7); the constraints from from vocaux voisés et HSun l’ensemble des indices des bases d’excitations harmoniq We constraint, can also see that the best algorithm we can proposed also see is the that(# one the best algorithm we proposed is the one for speaker forconstraint independent speaker independent single channel single channel source [3] D. L. over-consider and G. J.using Mysore, [3] “Universal D. L. Sun speech and models G. J. My while a strong weight will the over the rethethe literature the literature (decorrelation, (decorrelation, sparsity and sparsity smoothness) and have small have efefand adaptive method all (#constraints 7); the constraints the adaptive fromsmoothness) method (# 7);small the constraints from measure proposed in [17], all in constraints order to penalize simultaneous actiin Acoustics, insingle Acoustics, Speech and Speech Signal andseparation,” Processing Signal Proces (ICA for speaker independent channel for source speaker independent si Le principe de cette contrainte cost, thus lead to a wrong solution depending on the inifect (see fect # 1sparsity and (see##5), 1and and and #the 5),developed and construction the developed constraint constraint is more effective isin more effective the literature (decorrelation, the smoothness) literature (decorrelation, have small efsparsity and smoothness) have small efex Conference IEEE International IEEE International Conference on. IEEE, on. 2013, IEEE Acoustics, Speech and Signal Processing inHAcoustics, (ICASSP), 2013 and vation between bases. lignes de et H Speech que l’on conSp Damien Bouvier 14 M2 ATIAM noisier environment. noisier environment. tialization random). The second contribution ofcontrainte this paper fect (see # 1 and in # 5), andin the developed fect constraint (see # 1 is and more # 5),effective and(generally the developed constraint is more effective IEEE International Conference on. IEEE, IEEE 2013, International 141–145. Conferen la de décorrélat • a smoothness constraint proposed in [18] to prevent filter activaG. [4] Germain F. G.celle Germain and de G. J.and Mysore, G. J.pp. Mysore, “Speaker “Speake and no in noisier environment. inthat noisier environment. Finally, we Finally, can see we canour seebest that algorithm our best algorithm 7) offers (# 7) results offers farresults [4] far F.weight consists in(#adapting the constraint’s atsingle-channel each iteration vocaux voisés etduring H l’ensemble dent online dent online single-channel speech enhancement,” speech enhanc [4] F. G. Germain and G. J. Mysore, “Speaker [4] F. G. and Germain noise indepenand G. J.des Myii tion to jump from one phoneme to another between frames. better better the best unsupervised thanalgorithm the unsupervised algorithm algorithm V-IMM which V-IMM it is which based it on, is based on, Finally, we can see than that our Finally, (# 7)we offers can results see that far our best algorithm (# 7) offers results far speech separation: from small to strong on evoluSpeech depending and Speech Signal and Processing Signal Processing (ICASSP), (ICASSP) 2015 IE dent online single-channel speech enhancement,” dentthe online insingle-channel Acoustics, sp and that those andalgorithm that results those are results close are from close the from ones obtained theon, ones obtained by ASNA. byInASNA. In-it is better than the unsupervised better V-IMM than which the unsupervised it is based algorithm V-IMM which based on, tion of the reconstruction (i.e.and the evolution rateDamien ofConference the β-divergence tional Conference tional on. Speech IEEE, on.and 2015, IEEE, pp. 2015, 71–75. pp. 7 Speech Signal Processing (ICASSP), 2015 IEEE Signal InternaProcessi Bouvier deed, it could deed, have it could been have interesting been interesting to compare to compare our algorithms our algorithms with with and that those results are close from theand ones that obtained those results by ASNA. are close In- from the ones obtained by ASNA. Intional Conference IEEE, tional 71–75. Conference on.D.“Deep IEE 3.3. Source/filter coherencedeed, constraint value). Atonly the i-th iteration, the weight isY.pp. updated as:D. Wang, other unsupervised other unsupervised but the only but been the others authors others could authors find could find [5] D.on. S.[5] Williamson, D. S.2015, Williamson, Wang, Y. and Wang, and Wan it could have been interesting tomethod, deed, compare itmethod, could our algorithms have with interesting to compare ourconstraint’s algorithms with
for were the were algorithm algorithm proposed proposed in [3], which in [3], uses which the USM usesthe the (and USM soothers the (and so the works for works estimating speech model speech activations,” model activi other unsupervised method, butthethe only other others unsupervised authors could method, find but [5]only D. S. Williamson, authors could Y. Wang, find and D. [5]estimating Wang, D. S. “Deep Williamson, neural Y. netWang, (i−1) eand The first contribution of thiswere paper consists in the elaboration of D V |speech V IS Speech comparison comparison with with speaker-dependent our speaker-dependent method would method would flawed), be flawed), Speech Signal andProcessing Signal (ICASSP) 2015 IE the algorithm proposed in [3],our which wereuses the the algorithm USM (and proposed so the in [3],be which works uses for the estimating USM (and so the model activations,” works Processing for(ICASSP), estimating in Acoustics, speech (i) a speech-specific source/filtercomparison coherence for the IMM Conference µis anmethod. = µmax (10) and [4], and which in [4], is which also based is also onwould based the USM on and USM is an and online online method. tional Conference tional(ICASSP), on. Speech IEEE, on.and 2015, IEEE, pp. 2015, 5113–511 pp. 5 withconstraint ourinspeaker-dependent comparison method with our be the flawed), speaker-dependent Speech method and would Signal beProcessing flawed), 2015 IEEE Signal InternaProcessi (i−2)2015, pp. eIEEE, DIS Vmethod. |V NMF. This constraint is based fact that allbased phonemes correandon in the [4], which is also on the USM and in and [4], is which an online is also method. based on the USM tional and Conference is an online on. tional 5113–5117. Conference on. IEE sponds to a match between one excitation one corresponding filter: where µ is initialized at 0 for the first two iterations, and after varies • vocal filters corresponding to voiced phoneme will always be used in the interval [0 µmax ], where µmax is a chosen value. The stronger simultaneously with a periodic excitation; the β-divergence diminishes, the smaller the constraint; the smaller • vocal filters corresponding to unvoiced phoneme will always be it diminishes, the higher the constraint. Figure 3 shows an example used with a noisy excitation; of the effect of this adaptive method on costs evolution. Because the IMM-NMF allows unreal combination leading to audible artefacts, we propose a source/filter coherence constraint which Without adaptation With adaptation aims to avoid unrealistic combinations between excitation and filter (see Figure 2). This constraint is inspired by the normalized decor10 10 relation constraint, and is expressed as follow: 5
Pφ =
X
k∈periodic l∈unvoiced
h i T H ex H Φ
kl
kHkex k`2 kHlΦ k`2
+
X
k∈noise l∈voiced
h
H ex H Φ
T
i
Cost (log)
5
kl
kHkex k`2 kHlΦ k`2
4
4
10
3
10
10
3
10
(9) The left term of the sum is a measure of the correlation between periodic excitation basis and filter basis corresponding to unvoiced phoneme, normalized by their power; the right term is the same measure, but between noisy excitation basis and filter basis corresponding to voiced phoneme.
0
25
0
25
Iteration number
Fig. 3. Evolution of the reconstruction cost (in green), constraints cost (in red) and total cost (in blue) with (right graphic) and without (left graphic) the weight adaptation method, in function of the iteration number.
Algorithms References ASNA [19] V-IMM [2] Training
Speech Noise
3 3
IMM-NMF Constraints Adaptation −6dB +0dB +6dB Mean
SDR PESQ SDR PESQ SDR PESQ SDR PESQ
5.8 2.00 10.7 2.44 15.0 2.85 10.5 2.43
#1
#2
#3
3
3
3
3
3
4.4 1.22 7.8 1.54 9.7 1.82 7.3 1.52
4.0 1.91 9.1 2.30 13.0 2.62 8.7 2.28
3 | SoA 4.1 1.91 9.2 2.30 12.8 2.61 8.7 2.27
3 coherence {z Without 5.0 1.94 9.0 2.24 11.1 2.46 8.4 2.21
Proposed #4 #5 3
3
3 3 all }| SoA 5.2 1.92 8.9 2.23 10.9 2.44 8.3 2.20
4.1 1.91 9.2 2.30 13.0 2.62 8.7 2.28
#6
#7
3
3
3 coherence {z With 5.2 2.01 9.8 2.34 12.8 2.59 9.3 2.31
3 all } 5.4 2.01 9.8 2.35 12.9 2.62 9.4 2.33
Table 1. Results from the experimental evaluation. SoA refers to the state-of-the-art constraints (decorrelation, sparsity and smoothness), coherence to the proposed source/filter coherence constraint and adaptation to the adaptive weight method. 4. EXPERIMENT 4.1. Experimental setups An experiment was conducted to evaluate the performance of the semi-supervised and constrained source/filter model for speech separation. The benchmark includes: the semi-supervised source/filter model, with variants on the use of the constraints (with/without constraints, state-of-the-art constraints vs. source/filter coherence constraint, and with/without the constraint adaptation), with comparison to state-of-the-art unsupervised V-IMM source/filter model [2] (originally developed for singing voice / music separation), and the supervised ASNA algorithm [19] (see Table 1 for details). The database used for the experiment is a mix of the TIMIT speech database for clean speech [20] and the QUT-NOISE database for environmental background noises [21]. We used 20 TIMIT speakers (10 women and 10 men), with each 10 sentences: 2 sentences, shared among all the speakers, were used for training, and the 8 remaining sentences, different for all speakers, were used as the test set for speech separation. We mixed those 160 test sentences with 4 different background noises from the QUT-NOISE database (city street, home kitchen, car window, cafe) and white noise, using 3 signal-to-noise-ratio (SNR) (−6 dB, 0 dB, +6 dB), resulting in 2, 400 mixture signals. For the training, the 2 shared sentences were used for each speaker for the semi-supervised and the supervised algorithms, and one 5 s. extract of the background noise (different of the one used for mixing) was used for the supervised algorithm. The performance of the speech separation was measured based on the signal-to-distortion ratio (SDR) [22] (in dB), and the perceptive evaluation of speech quality (PESQ) [23]. All benchmark speech separation algorithms were based on the STFT magnitude of the audio signal, using a Hamming window of 64 ms and a hop size of 32 ms. For training, the filterbank dictionary W Φ was created with 50 Hann windows linearly spaced from 0 to 8000 Hz, and the excitation dictionary W ex was created with 250 periodic bases (spanning every twentieth of tone between 80 and 350 Hz) and 100 white noise bases. For testing, the maximum number of iteration of the NMF was set to 100, and we used the IS divergence with a noise floor of −60dB (see [14] for details). Various constraint weights (from 10−2 to 103 ) and number of background noise bases (from 5 to 100, used for all algorithms) were tested.
4.2. Results and Discussion Table 1 summarizes the scores obtained for the benchmark algorithms, optimized for the state-of-the-art algorithms, and sharing the same optimal setup for all of the proposed algorithms. Firstly, the semi-supervised source/filter algorithm (# 1) improves the performance over a standard unsupervised source/filter, which naturally confirms the importance of training the filter dictionary for speech separation. Secondly, the use of constraints in the source/filter model without adaptation does not improves speech separation (# 1 vs. # 2, # 3, and # 4) on the one side. On the other side, the use of constraints with adaptation (# 5, # 6, and # 7) substantially improves speech separation. This shows the importance of the adaptation during speech separation, by gradually increasing the importance of the constraints depending on the convergence of the speech separation. This is especially true when using all constraints together (# 7). A comparison of the constraints reveals that the state-of-the-art constraints (decorrelation, sparsity and smoothness) have a small effect (# 5) whereas the source/filter coherence constraint provides a strong effect, and is more efficient for a high SNR (# 6). Finally, the semi-supervised constrained source/filter algorithm shows encouraging performance, compared to state-of-the-art algorithms. The semi-supervised algorithm stands in between unsupervised algorithm (V-IMM) and the supervised algorithm (ASNA). In particular, the semi-supervised algorithm is close to the supervised algorithm, without any prior knowledge about the nature the noise environment. This proves the importance of using prior knowledge on speech for informed speech separation. 5. CONCLUSION In this paper, we presented a semi-supervised method for speech separation, based on a constrained source/filter model for NMF-based speech separation, with the add of a speech specific constraint, and the adaptive weighting of constraints during separation. An experimental validation proved the efficiency of the constrains for speech separation, and beyond indicates the importance of prior knowledge about speech and physically-motivated speech separation. Further research will focus on the integration of a source/filter model for text-informed speech separation [5, 6], and speaker-independent speech separation (Universal Speech Model [3]), and on the unsupervised estimation of the background noise [4, 14].
6. REFERENCES [1] T. Virtanen and A. Klapuri, “Analysis of polyphonic audio using source-filter model and non-negative matrix factorization,” in Advances in models for acoustic processing, neural information processing systems workshop, 2006. [2] J.-L. Durrieu, A. Ozerov, C. F´evotte, G. Richard, and B. David, “Main instrument separation from stereophonic audio signals using a source/filter model,” in European Signal Processing Conference (EUSIPCO), 2009, pp. 15–19. [Online]. Available: http://www.durrieu.ch/phd/eusipco09/ [3] D. L. Sun and G. J. Mysore, “Universal speech models for speaker independent single channel source separation,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 141–145. [4] F. G. Germain and G. J. Mysore, “Speaker and noise independent online single-channel speech enhancement,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 71–75. [5] G. J. Mysore and P. Smaragdis, “A non-negative approach to language informed speech separation,” in Latent Variable Analysis and Signal Separation. Springer, 2012, pp. 356–363. [6] L. Le Magoarou, A. Ozerov, and N. Q. Duong, “Text-informed audio source separation. Example-based approach using nonnegative matrix partial co-factorization,” Journal of Signal Processing Systems, pp. 1–15, 2014. [7] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999. [8] A. Liutkus, J.-L. Durrieu, L. Daudet, and G. Richard, “An overview of informed audio source separation,” in International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), 2013, pp. 1–4. [9] T. Virtanen, B. Raj, J. F. Gemmeke et al., “Active-set newton algorithm for non-negative sparse coding of audio,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 3092–3096. [Online]. Available: http://www.esat.kuleuven.be/psi/spraak/downloads/ [10] Y. Wang and D. Wang, “A Neural Network For Time-Domain Signal Reconstruction: Towards Improving The Perceptual Quality Of Supervised Speech Separation,” Department of Computer Science and Engineering, The Ohio State University, Tech. Rep., 2014. [11] J. Le Roux, J. R. Hershey, and F. Weninger, “Deep NMF for speech separation,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 66–70. [12] L. Benaroya, F. Bimbot, and R. Gribonval, “Audio source separation with a single sensor,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 191–199, 2006. [13] C. F´evotte and J. Idier, “Algorithms for nonnegative matrix factorization with the β-divergence,” Neural Computation, vol. 23, no. 9, pp. 2421–2456, 2011. [14] A. Roebel, J. Pons, M. Liuni, and M. Lagrange, “On automatic drum transcription using non-negative matrix deconvolution and Itakura-Saito divergence,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 414–418.
[15] F. Villavicencio, A. R¨obel, and X. Rodet, “Improving LPC spectral envelope extraction of voiced speech by true-envelope estimation,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, 2006, pp. 869–872. [16] C. Joder, F. Weninger, D. Virette, and B. Schuller, “A comparative study on sparsity penalties for NMF-based speech separation: Beyond Lp-norms,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 858–862. [17] S. Z. Li, X. W. Hou, H. Zhang, and Q. Cheng, “Learning spatially localized, parts-based representation,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2001, pp. I–207. [18] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066–1074, 2007. [19] T. Virtanen, J. F. Gemmeke, and B. Raj, “Active-set Newton algorithm for overcomplete non-negative representations of audio,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 11, pp. 2277–2289, 2013. [20] V. Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech Communication, vol. 9, no. 4, pp. 351–356, 1990. [21] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms,” Proceedings of Interspeech 2010, pp. 3110–3113, 2010. [22] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006. [23] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, 2001, pp. 749–752.