filter model with adaptive constraints for

1 downloads 0 Views 4MB Size Report
Mar 24, 2016 - WΦ. UΦ. Trained. Hex. HΦ. Free. ICASSP 2016. D. Bouvier, N. Obin, A. Roebel, M. Liuni. Source/filter model in NMF-based speech separation.
A source/filter model with adaptive constraints for NMF-based speech separation Damien Bouvier1 , Nicolas Obin1 , Axel Roebel1 , Marco Liuni1 1 IRCAM, UPMC - Sorbonne Universités, CNRS

International Conference on Acoustics, Speech and Signal Processing 24 March 2016

State of the art

Proposed method

Experimental evaluation

State of the art • • •

Speech separation using NMF Semi-supervised NMF Source/filter model

Proposed method • • •

Semi-supervised constrained NMF Contribution 1: speech-specific source/filter coherence constraint Contribution 2: adaptive weight method

Experimental evaluation • • •

ICASSP 2016

Experiment description Effect of weight’s adaptation Algorithm comparison

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

1 / 14

State of the art

Proposed method

Experimental evaluation

Speech separation using NMF

Signal has only 2 sources: speech and background sound

SPEAKER

Supervised algorithms ä [Mysore and Smaragdis, 2012]: language model ä [Virtanen et al., 2013]: new updates using Newton algorithm ä [Sun and Mysore, 2013]: Universal Speech Model (USM)

Semi-supervised algorithms ä [Germain and Mysore, 2015]: USM & online noise adaptation

Unspervised algorithms (but informed) ä [Le Magoarou et al., 2014]: use of textual information ä [Durrieu et al., 2009]: source/filter model for NMF

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

2 / 14

State of the art

Proposed method

Experimental evaluation

Semi-supervised NMF  argmin W S ,W N ,H N ≥0

e ) with C(V |V

e = W SHS + W N HN V W S learned

(1)

Training Speech

NMF

STFT

SPEAKER

WS

Separation SPEAKER

+

Noisy Speech

STFT

NMF W S, H S, W S, H N

Speech

Wiener Filtering SPEAKER

ISTFT

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

Noise

3 / 14

State of the art

Proposed method

Experimental evaluation

Source/filter model [Durrieu et al., 2009] source



glottis

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

filter

- vocal tract

=

sound

- speech

Source/filter model in NMF-based speech separation

4 / 14

State of the art

Proposed method

Experimental evaluation

Source/filter model [Durrieu et al., 2009] ∗

source glottis

=

filter

.∗

V ex ICASSP 2016

.∗

D. Bouvier, N. Obin, A. Roebel, M. Liuni

sound

- speech

- vocal tract

=



=

Source/filter model in NMF-based speech separation

V 4 / 14

State of the art

Proposed method

Experimental evaluation

Source/filter model [Durrieu et al., 2009] ∗

source glottis

=

filter

H ex



W ex

bΦ W

.∗

(W ex ICASSP 2016

sound

- speech

- vocal tract

H ex )

.∗

D. Bouvier, N. Obin, A. Roebel, M. Liuni

=

bΦ W





=

Source/filter model in NMF-based speech separation

V 4 / 14

State of the art

Proposed method

Experimental evaluation

Source/filter model [Durrieu et al., 2009] ∗

source

=

filter

glottis H ex



W ex





.∗

(W ex ICASSP 2016

sound

- speech

- vocal tract

H ex )

.∗

=



D. Bouvier, N. Obin, A. Roebel, M. Liuni







=

Source/filter model in NMF-based speech separation

V 4 / 14

State of the art

Proposed method

Experimental evaluation

Source/filter model [Durrieu et al., 2009] ∗

source

=

filter

glottis H ex



sound

- speech

- vocal tract HΦ

Fixed

W ex



.∗

(W ex ICASSP 2016

H ex )

.∗

=



D. Bouvier, N. Obin, A. Roebel, M. Liuni







=

Source/filter model in NMF-based speech separation

V 4 / 14

State of the art

Proposed method

Experimental evaluation

Source/filter model [Durrieu et al., 2009] ∗

source

=

filter

glottis H ex



sound

- speech

- vocal tract HΦ

Fixed Trained W ex



.∗

(W ex ICASSP 2016

H ex )

.∗

=



D. Bouvier, N. Obin, A. Roebel, M. Liuni







=

Source/filter model in NMF-based speech separation

V 4 / 14

State of the art

Proposed method

Experimental evaluation

Source/filter model [Durrieu et al., 2009] ∗

source

=

filter

glottis H ex



W ex

HΦ Fixed Trained Free



.∗

(W ex ICASSP 2016

H ex )

.∗

sound

- speech

- vocal tract

=



D. Bouvier, N. Obin, A. Roebel, M. Liuni







=

Source/filter model in NMF-based speech separation

V 4 / 14

State of the art

Proposed method

Experimental evaluation

State of the art • • •

Speech separation using NMF Semi-supervised NMF Source/filter model

Proposed method • • •

Semi-supervised constrained NMF Contribution 1: speech-specific source/filter coherence constraint Contribution 2: adaptive weight method

Experimental evaluation • • •

ICASSP 2016

Experiment description Effect of weight’s adaptation Algorithm comparison

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

4 / 14

State of the art

Proposed method

Experimental evaluation

Semi-supervised constrained NMF Physically-informed model ( argmin H ex ,H Φ ,W N ,H N ≥0

e ) with C(V |V

e = W ex H ex ⊗ W Φ U Φ H Φ + W N H N V W ex and W Φ fixed U Φ learned

(2)

But still no physically-coherent behavior.

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

5 / 14

State of the art

Proposed method

Experimental evaluation

Semi-supervised constrained NMF Physically-informed model ( argmin H ex ,H Φ ,W N ,H N ≥0

e ) with C(V |V

e = W ex H ex ⊗ W Φ U Φ H Φ + W N H N V W ex and W Φ fixed U Φ learned

(2)

But still no physically-coherent behavior.

Constraints for controlling its behavior Weight parameter

e) = C(V |V

e) D(V |V

| {z }

| {z }

Total cost

ICASSP 2016

+

Reconstruction cost

D. Bouvier, N. Obin, A. Roebel, M. Liuni

z}|{ λ

P(Θ)

(3)

| {z }

Constraint penalty

Source/filter model in NMF-based speech separation

5 / 14

State of the art

Proposed method

Experimental evaluation

Semi-supervised constrained NMF Physically-informed model ( argmin H ex ,H Φ ,W N ,H N ≥0

e ) with C(V |V

e = W ex H ex ⊗ W Φ U Φ H Φ + W N H N V W ex and W Φ fixed U Φ learned

(2)

But still no physically-coherent behavior.

Constraints for controlling its behavior Weight parameter

e) = C(V |V

e) D(V |V

| {z }

| {z }

Total cost

+

Reconstruction cost

z}|{ λ

P(Θ)

(3)

| {z }

Constraint penalty

New multiplicative update rules : Θ(i+1) ←− Θ(i) ⊗

− ∇− Θ D + λ∇Θ P + ∇+ Θ D + λ∇Θ P

∀Θ ∈



H ex , H Φ , W N , H N



(4)

Constraints from literature [Bertin, 2009]: Sparsity, Decorrelation, Smoothness. ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

5 / 14

ASNASpeech [14] V-IMM #31nous l’utilisons #ASNA 2 [14] V-IMM # 4[7]3 3les ##51activation # 3# 7DE #des 4 évolutions #7 bases ayant similaires, même INTÉGRATION CONTRAINTES CHAPITR Speech3[7] 3#ici3 pour 33éliminer 3#32# 6 des33 33 3#35 3# 6 Training Training avec des échelles d’amplitudes différentes. State of the art Experimental evaluation Speech 3Noise Noise 3 Speech 3 3 Proposed 3 3 method 3 33 33 33 3 3 3 3 Training Training NoiseIMM-NMF 3 Noise 3 IMM-NMF 3 33 33 33 33 33 33 33 3 Même laSoA contrainte est Constraints Constraints IMM-NMF Coherence SoA 33Coherence All All SoAdeavec Coherence Allplus IMM-NMF 3 3 3 3 SoA 3 3si 33Coherence 3décorrélation 3 decorr 3 utile pour 3 assurer un bo Figure 2.5 3 –3 Comparaison W après séparation et All sans nous l’utilisons ici pour éliminer les activation Adaptation Adaptation Constraints Without Without adaptation weight adaptation WithAll weight Withadaptation weightSoA adaptation Constraints SoA Coherence Allweight SoA Coherence SoA Coherence All Coherencedes bases All ayant des é des échelles d’amplitudes différentes. Adaptation Adaptation Without weight adaptation With avec weight Without adaptation weight adaptation With weight adaptation SDR SDR 5.8 5.8 4.4 4.44.0 4.1 4.0 4.15.0 5.05.2 4.1 5.2 4.15.2 5.25.4 5.4 6dB 6dB 2.00 SDR 2.00 1.22 1.91 1.91 1.94 1.92 1.92 2.015.2 2.01 2.01 2.01 SDR 5.8PESQ PESQ 4.4 4.0 1.22 4.1 5.8 5.0 1.91 4.45.2 1.91 4.1 4.0 1.94 4.15.2 1.91 5.05.4 1.91 4.1 5.2 5.4 6dB 6dB Figure 2.5 – Comparaison W après2.01 séparation avec e PESQ 2.00 1.22 1.91 1.91 Contrainte 2.00 1.94 9.2 1.22 1.92cohérence 1.91 1.91 9.0 1.91 2.01 9.2 1.94 2.01 9.29.8 1.92 1.91 2.01 2.2.3 de source/filtre SDR SDR 10.7 PESQ 10.7 7.8 7.89.1 9.1 9.29.0 8.9 8.9 9.8 9.8 9.8 +0dB +0dB PESQ PESQ 2.44 SDR 2.44 1.54 2.30 2.30 2.24 2.23 2.23 2.348.9 2.34 2.35 2.35 SDR 10.7 7.8 9.1 1.54 9.2 10.7 9.0 2.30 7.88.9 2.30 9.2 9.1 2.24 9.29.8 2.30 9.09.8 2.30 9.2 9.8 9.8 +0dB +0dB En linguistique, on distingue deux grands type de phonèmes : PESQ 2.44 1.54 PESQ 2.30 2.30 2.44 2.24 1.54 2.23 2.30 2.30 2.30 2.34 2.24 2.35 2.23 2.30 2.34 2.35 SDR SDR 15.0 15.0 9.7 9.7 13.0 12.8 13.0 12.8 11.1 11.1 10.9 13.0 10.9 13.0 12.8 12.8 12.9 12.9 +6dB +6dB PESQ PESQ 2.85 SDR 2.85 1.82 1.82 2.62 2.62 2.46 2.44 2.44 2.59 2.62 2.62 SDR 15.0 9.7 13.0 12.8 15.0 11.1 2.61 9.7 10.92.61 13.0 13.0 2.46 12.8 12.8 2.62 11.1 12.92.62 10.9 2.59 13.0 12.8 12.9 +6dB +6dB Periodic Noisy PESQ 2.85 1.82 PESQ 2.62 • 2.61 2.85 2.46 voisées, 1.82 2.44 2.62 2.62 2.2.3 2.61 2.59à Contrainte 2.62 2.44 2.62 venant 2.59 2.62 les phonèmes qui correspondent une2.46 excitation harmonique des cordes de cohérence source/filtre SDR SDR 10.5 10.5 7.3vocales7.3 8.7 8.7 8.7 exemple 8.78.4 les voyelles 8.48.3 /a/8.7 8.3 8.79.3 9.39.4/m/ ou9.4 ; ce sont par ou /e/ ou les nasales /n/. Mean Mean PESQ PESQ 2.43 SDR 2.43 1.52 2.28 2.27 2.28 2.21 2.21 2.20 2.28 2.20 2.28 2.31distingue 2.31 2.33 2.33 SDR 10.5 7.3 8.7 1.52 10.5 8.4 non-voisées, 7.38.3 2.27 8.7 8.7 8.7 9.3linguistique, 8.49.4excitation 8.3 bruitée 8.7deux 9.3par 9.4phonèmes : • 8.7 les phonèmes qui correspondent à une ; cegrands sont exemple En on type de Mean Mean PESQ 2.43 1.52 PESQ 2.28 2.27 2.43 occlusives 2.21 1.52 2.20 2.28 2.31ou certaines 2.21 2.33 fricatives, 2.20 2.28 certaines , comme2.28 /b/ ou 2.27 /d/, comme /s/2.31 ou /f/. 2.33

Contribution 1: speech-specific source/filter coherence constraint Problem: unrealistic source/filter combinations possible

Amplitude (dB)

• les phonèmes voisées, qui correspondent à une excitation harm Table 1. Table Results 1. from Results the from experimental the experimental validation. validation. SoA SoA to refers thetests, State-of-the-Art to nous the State-of-the-Art constraints (decorrelation, (decorrelation, and sparsity smoothness) and smoothness) Lors desrefers premiers avons constraints pu observer que ;des filtressparsity vocaux associés à des phonèmes vocales ce sont par exemple les voyelles /a/ ou /e/ ou les nasa whereas Coherence refers to 1. the refers proposed toSoA thefrom proposed source/filter source/filter coherence coherence constraint. Bold values Bold are values the best arescores the best inconstraints scores the proposed in the proposed methods.methods. able 1. Resultswhereas from theCoherence experimental Table validation. Results refers thetoexperimental the State-of-the-Art constraints SoA refers (decorrelation, to the State-of-the-Art sparsity and smoothness) sparsity and smoothness) voisés étaientvalidation. activées enconstraint. même temps que des excitations et(decorrelation, inversement avec des filtres • les phonèmesbruitées, non-voisées, qui correspondent à une excitation br hereas Coherence refers to the proposed whereas source/filter Coherence coherence refers to constraint. the proposed Bold source/filter values are the coherence best scores constraint. in the proposed Bold values methods. are the best scores in the proposed methods. vocaux de phonèmes non-voisés. Malheureusement, celaocclusives résultait ,en la présence très certaines comme /b/ oud’artefacts /d/, ou certaines fricatives audibles, et baissait la qualité de la reconstruction. the algorithms the algorithms using theusing SDR the criteria SDRfor criteria BSS for defined BSS indefined [17] in [17] and3 pour chaque 3 Puisque nous and connaissons, base de W , le phonème correspondant, nous avons Lors des premiers tests, nous avons pu observer que des filtres vocau the PESQ, thewhich PESQ, is which a MOS ismeasure a MOS measure defined indefined [18]. in [18]. 8 8 e algorithms using the SDR criteria the for algorithms BSS defined using in [17] the and SDR criteria 3 for BSS defined in [17] and 3 pensé à créer une contrainte de cohérence source/filtre pour empêcher l’apparition de ce phénovoisés e PESQ, which is aThe MOS measure defined the PESQ, in were [18]. which ison a MOS measure 8STFT defined in [18]. of 8 étaient activées en même temps que des excitations bruitées, et proposed The proposed algorithms algorithms usedwere the usedSTFT on the magnitude magnitude of mène. observed the observed signals, using signals, a Hamming using Hamming window of window msof and 64aon ms hop andSTFT a hopmagnitude of vocaux de phonèmes non-voisés. Malheureusement, cela résultait en l The proposedthe algorithms were used onThe the proposed STFTa magnitude algorithms of 64 were used the audibles, et baissait la qualité de la reconstruction. size using of 32size ofThe 32 ms. maximum The maximum number number iteration ofa Hamming iteration of the NMF ofwindow the wasNMF was e observed signals, ams. Hamming the window observed of 64 signals, msof and using a hop of 64 ms and a hop 6. CONCLUSION 6. CONCLUSION /s/ pour chaque base de W , le phonème c fixed to 100, fixed and to 100, we size used and the IS divergence the divergence with noise with floor aofnoise of floorofofthe 6. ze of 32 ms. The maximum number of iteration ofwe 32used ms. of the TheIS NMF maximum was anumber iteration NMF was Puisque nous connaissons, CONCLUSION 6. CONCLUSION pensé à créer une contrainte de cohérence source/filtre pour empêche 60dB (see 60dB The [8]). filterbank filterbank W was W created was created 50 Hann with 50 Hann xed to 100, and we used the[8]). IS (see divergence fixed toThe with 100, aand noise we floor used of the with IS divergence with a this noise floor of In paper, In this we paper, presented we presented a new semi-supervised a new semi-supervised method for method speech for speech /a/ mène. spaced windowlinearly spaced from linearly 0(see tofrom 8000 0 Hz to and the Hz excitation and the excitation dicdic60dB (see [8]).window The filterbank W was 60dB created [8]). with The 508000 Hann filterbank W was created with 50 Hann separation, separation, a constrained using a constrained NMF NMF a source/filter with a source/filter decomposidecomposiIn this paper, we presented a using new semi-supervised In this paper, method wewith presented for speech a new semi-supervised method for speech ex ex tionary W tionary was made was up with 250 up with periodic 250 periodic basis basis (spanning every every indow spaced linearly from 0 toW 8000 window Hzmade andspaced the excitation linearly from dic- (spanning 0 to 8000 Hz and the excitation diction from tion the from literature. the literature. The main The contribution main contribution are a new are speecha new speech- decomposiseparation, using a constrained NMF with separation, a source/filter using a decomposiconstrained NMF with a source/filter Voiced ( /i/ ) periodic Unvoiced ( /s/ ) ensuring of tone250 between ofperiodic tone 80 between and 350 80 Hz) and and 350every 100 Hz)with white and 100 noise white basis. noise basis. onary W ex wastwentieth made uptwentieth with tionary basis W ex (spanning was made up 250 basis (spanning every specific constraint, specific ensuring phonetic phonetic coherence, coherence, and adaptive and an adaptive tion from the literature. The mainconstraint, contribution tion from the are literature. a new speechThe mainancontribution are a new speechwentieth of tone between 80 and 350 Hz) twentieth and 100 ofwhite tone between noise basis. 80 and 350 Hz) and 100 white noise basis. weight method weight for method constraints. for constraints. specific constraint, ensuring phonetic specific coherence, constraint, and an ensuring adaptive phonetic coherence, and an adaptive /u/ /a/ method for constraints. weight method for constraints. weight Frequency (Hz) 5. RESULTS 5. RESULTS AND DISCUSSION AND DISCUSSION 7. REFERENCES 7. REFERENCES 5. RESULTS AND DISCUSSION 5. RESULTS AND DISCUSSION s /i/ 7. REFERENCES The results Thethat results we show that we here show are here the one are obtained the one obtained for optimized for optimized 7. REFERENCES [1]forG.optimized J.[1]Mysore G. J. and Mysore P. Smaragdis, and P. Smaragdis, “A non-negative “A non-negative approachapproach values ofvalues constraint of constraint weight and weight number of number noise here basis; of noise those basis; values values he results thatWe we show here want are the The one obtained thatand for we optimized show are the onethose obtained only toresults allow: a [1] G. J. Mysore and /u/ to language to language informed informed speech separation,” speech separation,” LatentinVariable Latent Variable P. Smaragdis, [1] “A G. non-negative J. Mysore approach and P. inSmaragdis, “A non-negative approach were found were by found testing by on testing a separate on a and separate smaller and database. smaller database. alues of constraint weight and numbervalues of noise of constraint basis; those weight values and number of noise basis; those values Analysis Analysis andseparation,” Signal andto Separation. Signal Separation. Springer, Springer, 2012, pp. 2012,in pp. tosmaller language informed speech language in Latent informed Variable speech separation,” Latent Variable ere found by testing on a separate and were smaller found database. by testing on a separate and database. Table 1 shows Table 1 our shows experimental our experimental results for results all the for tested all the algotested algo• periodic excitation with adequate filter (e.g. vowels, voice consonants) 356–363. 356–363. [Online]. [Online]. Available: Available: https://ccrma.stanford.edu/ https://ccrma.stanford.edu/ Analysis and Signal Separation. Analysis Springer, and 2012, Signal pp. Separation. Springer, 2012, pp. i s rithms. Firstly, rithms.we Firstly, can see we that can see baseline thattested our baseline (# 1) results has (# bet1) Table 1 shows our experimental results Table for our all 1 shows the ouralgorithm experimental algo- algorithm forhas allbetthe tested algo⇠gautham/Site/lva ⇠gautham/Site/lva ica 2012.html ica 2012.html 356–363. [Online]. Available: https://ccrma.stanford.edu/ 356–363. [Online]. Available: https://ccrma.stanford.edu/ •results noisy with adequate filter (e.g., consonants) tercan ter than results theexcitation algorithm than thealgorithm algorithm V-IMM, as1) expected as expected byour thebaseline inclusion by thealgorithm inclusion thms. Firstly, we see that our baseline rithms. Firstly, we (#V-IMM, can has seebetthat (# 1) unvoiced has bet⇠gautham/Site/lva ica 2012.html ⇠gautham/Site/lva ica 2012.html u a [2] L. Le [2] Magoarou, L. Le Magoarou, A. Ozerov, A. and Ozerov, N. Q. and Duong, N. Q. “Text-informed Duong, “Text-informed of aalgorithm learning of a V-IMM, step learning in the step in the than IMM-NMF. r results than the ter asIMM-NMF. expected results by the the algorithm inclusion V-IMM, as expected by the inclusion source audio separation. source separation. example-based example-based using non[2] L. Le A. Ozerov, and N.[2] Q. L. Duong, Le Magoarou, “Text-informed A.approach Ozerov, approach and N. nonQ. using Duong, “Text-informed f a learning step in the IMM-NMF. ofifawe learning stepalgorithms in thesimilar IMM-NMF. Secondly, Secondly, if we compare compare algorithms with withconstraints, similar constraints, weMagoarou, we audio negative we negative matrix partial matrix co-factorization,” partial Journal Journal of Signal Signal using nonsource separation. example-based audio approach source using separation. nonexample-based approach observe that observe our adaptive that our adaptive weight method weight method the separation theaudio separation re-de reLe principe cette contrainte serait id’empêcher les co-factorization,” activations simultanées deofcertaines Secondly, if we compare algorithms with Secondly, similar constraints, if we improve compare weimprove algorithms with similar constraints, ex Processing Processing Systems, Systems, pp. 1–15, pp. 2014. 1–15, 2014. partial co-factorization,” negative Journal matrix of Signal co-factorization,” Journal sults for sults all SNR, formethod all especially SNR, especially when theseparation when coherence the weight coherence constraint isnegative used. ismatrix used. lignesmethod deconstraint H et H queseparation l’on connaît ; nous avons donc pensé àpartial utiliser une formule basée sur of Signal bserve that our adaptive weight observe improve thatthe our adaptive reimprove the reICASSP 2016 N. A.algorithm Roebel, M. Liuni Source/filter NMF-based speechpp.separation 6 / 14 Systems, pp. 1–15,model 2014. inProcessing Systems, 1–15, 2014. We can also WeD. see canBouvier, that alsothe seebest thatObin, algorithm the best we proposed we proposed is the oneProcessing isusing the one using

State of the art

Proposed method

Experimental evaluation

Contribution 1: speech-specific source/filter coherence constraint

New constraint (that requires phoneme-labelled spectral filter basis)

Pφ (H ex , H Φ )

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

(5)

Source/filter model in NMF-based speech separation

7 / 14

State of the art

Proposed method

Experimental evaluation

Contribution 1: speech-specific source/filter coherence constraint

New constraint (that requires phoneme-labelled spectral filter basis)

h Pφ (H ex , H Φ ) =

X

H ex H Φ

T

i kl

(5)

k∈periodics l∈unvoiced

ä

ICASSP 2016

: measure of correlation

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

7 / 14

State of the art

Proposed method

Experimental evaluation

Contribution 1: speech-specific source/filter coherence constraint

New constraint (that requires phoneme-labelled spectral filter basis)

h Pφ (H ex , H Φ ) =

X k∈periodics l∈unvoiced

ä ä

ICASSP 2016

H ex H Φ

T

i kl

kHkex k` kHlΦ k` 2

(5)

2

: measure of correlation : normalized

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

7 / 14

State of the art

Proposed method

Experimental evaluation

Contribution 1: speech-specific source/filter coherence constraint

New constraint (that requires phoneme-labelled spectral filter basis)

h Pφ (H ex , H Φ ) =

X k∈periodics l∈unvoiced

ä ä ä

ICASSP 2016

H ex H Φ

T

h

i kl

kHkex k` kHlΦ k` 2

2

+

X k∈noisy l∈voiced

H ex H Φ

T

i kl

kHkex k` kHlΦ k` 2

(5) 2

: measure of correlation : normalized : for both type of unwanted combination

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

7 / 14

State of the art

Proposed method

Experimental evaluation

Contribution 2: adaptive weight method Main issue with constrained NMF Adjusting the weight parameter λ : • if too small, no effect is visible; • if too big, convergence becomes extremely sensitive to initialization (which is typically random).

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

8 / 14

State of the art

Proposed method

Experimental evaluation

Contribution 2: adaptive weight method Main issue with constrained NMF Adjusting the weight parameter λ : • if too small, no effect is visible; • if too big, convergence becomes extremely sensitive to initialization (which is typically random).

Idea Adjust the constraint weight at each iteration of the NMF: • constraint relaxed during strong evolution of the reconstruction cost; • constraint enforced when the reconstruction is more stable;

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

8 / 14

State of the art

Proposed method

Experimental evaluation

Contribution 2: adaptive weight method Main issue with constrained NMF Adjusting the weight parameter λ : • if too small, no effect is visible; • if too big, convergence becomes extremely sensitive to initialization (which is typically random).

Idea Adjust the constraint weight at each iteration of the NMF: • constraint relaxed during strong evolution of the reconstruction cost; • constraint enforced when the reconstruction is more stable;

Adaptive method λ(i) = λMax



ICASSP 2016

e (i−1) ) D(V |V

e ) the reconstruction cost D(V |V λ ∈ [0 λMax ]

D. Bouvier, N. Obin, A. Roebel, M. Liuni

(6)

e (i−2) ) D(V |V 

Source/filter model in NMF-based speech separation

8 / 14

State of the art

Proposed method

Experimental evaluation

State of the art • • •

Speech separation using NMF Semi-supervised NMF Source/filter model

Proposed method • • •

Semi-supervised constrained NMF Contribution 1: speech-specific source/filter coherence constraint Contribution 2: adaptive weight method

Experimental evaluation • • •

ICASSP 2016

Experiment description Effect of weight’s adaptation Algorithm comparison

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

8 / 14

State of the art

Proposed method

Experimental evaluation

Experiment description

Database TIMIT [Zue et al., 1990] : 20 speakers, 2 learnings sentences and 8 test sentences QUT-NOISE [Dean et al., 2010] : 5 types of noise Mixed at 3 Signal-to-Noise Ratio (−6dB, +0dB and +6dB)

Benchmark SoA Proposed

• • • • • •

ASNA [Virtanen et al., 2013] IMM [Durrieu et al., 2009] S-IMM: without contrainsts SC-IMM1: state-of-the art contrainsts SC-IMM2: source/filter coherence constraint SC-IMM3: all constraints

supervised unsupervised semi-supervised

Measures • SDR : Signal to Distortion Ratio (in dB) • PESQ : Perceptual Evaluation of Speech Quality (from 1 (bad) to 5 (excellent))

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

9 / 14

State of the art

Proposed method

Experimental evaluation

Effect of weight’s adaptation

SNR

−6dB +0dB +6dB Mean

With adaptation

Measure

Without adaptation

SC-IMM1

SC-IMM2

SC-IMM3

SC-IMM1

SC-IMM2

SC-IMM3

SDR (dB) PESQ SDR (dB) PESQ SDR (dB) PESQ

4.1 1.91 9.2 2.30 13.0 2.62

5.2 2.01 9.8 2.34 12.8 2.59

5.4 2.01 9.8 2.35 12.9 2.62

4.1 1.91 9.2 2.30 12.8 2.61

5.0 1.94 9.0 2.24 11.1 2.46

5.2 1.92 8.9 2.23 10.9 2.44

SDR (dB) PESQ

8.7 2.28

9.3 2.31

9.4 2.33

8.7 2.27

8.4 2.21

8.3 2.20

⇒ adaptation gives best results

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

10 / 14

State of the art

Proposed method

Experimental evaluation

Algorithm comparison

−6dB +0dB +6dB Mean

ICASSP 2016

Algorithms

Measure

SNR

ASNA

IMM

S-IMM

SC-IMM1

SC-IMM2

SC-IMM3

SDR (dB) PESQ SDR (dB) PESQ SDR (dB) PESQ

5.8 2.00 10.7 2.44 15.0 2.85

4.4 1.22 7.8 1.54 9.7 1.82

4.0 1.91 9.1 2.30 13.0 2.62

4.1 1.91 9.2 2.30 13.0 2.62

5.2 2.01 9.8 2.34 12.8 2.59

5.4 2.01 9.8 2.35 12.9 2.62

SDR (dB) PESQ

10.5 2.43

7.3 1.52

8.7 2.28

8.7 2.28

9.3 2.31

9.4 2.33

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

11 / 14

State of the art

Proposed method

Experimental evaluation

Algorithm comparison

−6dB +0dB +6dB Mean

Algorithms

Measure

SNR

ASNA

IMM

S-IMM

SC-IMM1

SC-IMM2

SC-IMM3

SDR (dB) PESQ SDR (dB) PESQ SDR (dB) PESQ

5.8 2.00 10.7 2.44 15.0 2.85

4.4 1.22 7.8 1.54 9.7 1.82

4.0 1.91 9.1 2.30 13.0 2.62

4.1 1.91 9.2 2.30 13.0 2.62

5.2 2.01 9.8 2.34 12.8 2.59

5.4 2.01 9.8 2.35 12.9 2.62

SDR (dB) PESQ

10.5 2.43

7.3 1.52

8.7 2.28

8.7 2.28

9.3 2.31

9.4 2.33

⇒ supervisation helps separation

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

11 / 14

State of the art

Proposed method

Experimental evaluation

Algorithm comparison

−6dB +0dB +6dB Mean

Algorithms

Measure

SNR

ASNA

IMM

S-IMM

SC-IMM1

SC-IMM2

SC-IMM3

SDR (dB) PESQ SDR (dB) PESQ SDR (dB) PESQ

5.8 2.00 10.7 2.44 15.0 2.85

4.4 1.22 7.8 1.54 9.7 1.82

4.0 1.91 9.1 2.30 13.0 2.62

4.1 1.91 9.2 2.30 13.0 2.62

5.2 2.01 9.8 2.34 12.8 2.59

5.4 2.01 9.8 2.35 12.9 2.62

SDR (dB) PESQ

10.5 2.43

7.3 1.52

8.7 2.28

8.7 2.28

9.3 2.31

9.4 2.33

⇒ best proposed algorithm: SC-IMM3 (with all constraints)

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

11 / 14

State of the art

Proposed method

Experimental evaluation

Algorithm comparison Algorithms

Measure

SNR

−6dB +0dB +6dB Mean

ASNA

IMM

S-IMM

SC-IMM1

SC-IMM2

SC-IMM3

SDR (dB) PESQ SDR (dB) PESQ SDR (dB) PESQ

5.8 2.00 10.7 2.44 15.0 2.85

4.4 1.22 7.8 1.54 9.7 1.82

4.0 1.91 9.1 2.30 13.0 2.62

4.1 1.91 9.2 2.30 13.0 2.62

5.2 2.01 9.8 2.34 12.8 2.59

5.4 2.01 9.8 2.35 12.9 2.62

SDR (dB) PESQ

10.5 2.43

7.3 1.52

8.7 2.28

8.7 2.28

9.3 2.31

9.4 2.33

⇒ better than Durrieu & close of Virtanen in low SNRs Speech Noise Text : "Computers are being used to keep branch inventories at more workable levels."

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

11 / 14

State of the art

Proposed method

Experimental evaluation

Conclusion

Summary • Semi-supervised speech separation • Source/filter model

Contributions • Weight adaptation method for constraints • Source/filter coherence constraint for speech • Good results close to literature in supervised separation

Further research • Speaker-independant model [Sun and Mysore, 2013] • Integration of a language model [Mysore and Smaragdis, 2012] • Integration of a noise adaptation method [Roebel et al., 2015]

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

12 / 14

Bibliography

Bibliography I Gautham J. Mysore and Paris Smaragdis. A non-negative approach to language informed speech separation. In Latent Variable Analysis and Signal Separation, pages 356–363. Springer, 2012. Tuomas Virtanen, Jort F Gemmeke, and Bhiksha Raj. Active-set Newton algorithm for overcomplete non-negative representations of audio. IEEE Transactions on Audio, Speech, and Language Processing, 21(11):2277–2289, 2013. Dennis L. Sun and Gautham J. Mysore. Universal speech models for speaker independent single channel source separation. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 141–145, 2013. François G. Germain and Gautham J. Mysore. Speaker and noise independent online single-channel speech enhancement. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 71–75, 2015. Luc Le Magoarou, Alexey Ozerov, and Ngoc QK Duong. Text-informed audio source separation. Example-based approach using non-negative matrix partial co-factorization. Journal of Signal Processing Systems, pages 1–15, 2014. Jean-Louis Durrieu, Alexey Ozerov, Cédric Févotte, Gaël Richard, and Bertrand David. Main instrument separation from stereophonic audio signals using a source/filter model. In European Signal Processing Conference (EUSIPCO), pages 15–19, 2009. URL http://www.durrieu.ch/phd/eusipco09/. ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

13 / 14

Bibliography

Bibliography II

Nancy Bertin. Les factorisations en matrices non-négatives. Approches contraintes et probabilistes, application à la transcription automatique de musique polyphonique. PhD thesis, Télécom ParisTech, 2009. Victor Zue, Stephanie Seneff, and James Glass. Speech database development at MIT: TIMIT and beyond. Speech Communication, 9(4):351–356, 1990. David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, pages 3110–3113, 2010. Axel Roebel, Jordi Pons, Marco Liuni, and Mathieu Lagrange. On automatic drum transcription using non-negative matrix deconvolution and Itakura-Saito divergence. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 414–418, 2015.

ICASSP 2016

D. Bouvier, N. Obin, A. Roebel, M. Liuni

Source/filter model in NMF-based speech separation

14 / 14

Suggest Documents