Changing timbre and phrase in existing musical ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
Oct 24, 2009 - sical music such as violin, upright bass, or piano, the user could enjoy a ... 4. Timbre and phrase changeable remastering2 tool: This tool en- .... most the whole musical score because its volume control func- ...... [14] T. Suzuki.
Changing Timbre and Phrase in Existing Musical Performances as You Like — Manipulations of Single Part Using Harmonic and Inharmonic Models — Naoki Yasuraoka Takehiro Abe [email protected] [email protected] Toru Takahashi [email protected]

Katsutoshi Itoyama [email protected] Tetsuya Ogata Hiroshi G. Okuno [email protected] [email protected]

Dept. of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University Sakyo-ku, Kyoto 606-8501 Japan

ABSTRACT

General Terms

This paper presents a new music manipulation method that can change the timbre and phrases of an existing instrumental performance in a polyphonic sound mixture. This method consists of three primitive functions: 1) extracting and analyzing of a single instrumental part from polyphonic music signals, 2) mixing the instrument timbre with another, and 3) rendering a new phrase expression for another given score. The resulting customized part is re-mixed with the remaining parts of the original performance to generate new polyphonic music signals. A single instrumental part is extracted by using an integrated tone model that consists of harmonic and inharmonic tone models with the aid of the score of the single instrumental part. The extraction incorporates a residual model for the single instrumental part in order to avoid crosstalk between instrumental parts. The extracted model parameters are classified into their averages and deviations. The former is treated as instrument timbre and is customized by mixing, while the latter is treated as phrase expression and is customized by rendering. We evaluated our method in three experiments. The first experiment focused on introduction of the residual model, and it showed that the model parameters are estimated more accurately by 35.0 points. The second focused on timbral customization, and it showed that our method is more robust by 42.9 points in spectral distance compared with a conventional sound analysis-synthesis method, STRAIGHT. The third focused on the acoustic fidelity of customizing performance, and it showed that rendering phrase expression according to the note sequence leads to more accurate performance by 9.2 points in spectral distance in comparison with a rendering method that ignores the note sequence.

Algorithms, Design

Keywords music manipulation, performance rendering, signal processing, sound source extraction, timbre mixing

1.

INTRODUCTION

One of the dreams in computer-supported music listening is to customize an existing music performance by editing it as the user likes. The user may replace an instrumental sound with another to mix a new instrument timbre. He/she may also replace phrases with another to render a new phrase expression. For example, if replacing a musical instrument typically used in rock music such as electric guitar, electric bass, or keyboard, with one used in classical music such as violin, upright bass, or piano, the user could enjoy a classical remix of the original music performance. In addition, if extracting the phrase expression related to the score from a tune played by his/her favorite guitarist and rendering this expression according to another score, he/she could enjoy various phrases virtually played by his/her favorite guitarist. Yoshii’s Drumix [1] and Itoyama’s Instrument Equalizer [2] are examples of music customization application based on source separation for two-mixed polyphonic sound mixtures. Drumix enables the user to change the volume of the drum part and to replace its tone with other recorded sound samples, and Instrument Equalizer can change the volume of any part. Unfortunately, these applications do not change musical elements other than the volume. This paper presents a new music manipulation method that changes the timbre and phrase of an existing instrumental performance in a polyphonic sound mixture by incorporating score information about the performance. Figure 1 depicts an overview of our approach. We define instrument timbre as a musical component that is determined by the musical instrument, which does not change during the performance. The deviated component is defined as a phrase expression, which changes mainly according to the musical score. Accordingly, the system has the following three primitive functions: 1) extracting and analyzing a single instrumental part, 2) manipulating timbre by mixing one instrument timbre with another, and 3) manipulating phrase ,i.e., to change the score being played, by rendering an appropriate phrase expression for the user-specified score. The resulting changed part is re-mixed

Categories and Subject Descriptors H.5.5 [Sound and Music Computing]: [Signal analysis, synthesis, and processing]

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’09, October 19–24, 2009, Beijing, China. Copyright 2009 ACM 978-1-60558-608-3/09/10 ...$10.00.

203

Score of single part Polyphonic music

Score of a single part

Polyphonic music

Extract

Instrument timbre

Another sound Another timbre

Mix with another sound

Extract and estimate

Extracted single part Specified by user

Phrase expression e pression Phrase

Specified by user

Tone model 1

Another phrase

Tone model 2

Tone model 3

Average

Instrument model

Calculate deviation Mix

Render Expression model 1

Customized part New instrument timbre

New phrase N h expression Another phrase

Expression model 2

Expression model 3

Rendering another phrase

Figure 2: Flowchart of our approach

Re synthesize Re-synthesize

although Mel-frequency cepstral coefficients (MFCC) were found to be a well-performing feature set for musical instrument recognition [7], they do not keep sufficient information for reproducing audio signals. In contrast, several sound manipulation and synthesis methods such as vocoder (especially STRAIGHT [8]), and phase vocoder [9] can efficiently manipulate the pitch and duration of an instrumental or vocal sound. However, these methods are not designed to represent timbre or phrase expression explicitly, and hence they are limited to only simple manipulations. Harmonictemporal clustering (HTC) [10] and the sinusoidal model [11, 12] are based on semi-parametric models that can easily analyze and synthesize instrumental sounds. Rendering of phrase expression of musical performances has also been widely researched. However, many researchers in performance rendering mainly focus on only the volume and timing because of a lack of analysis of acoustic-signal-processing levels and missing information on the deviations in timbre [13, 14]. Some systems modify the phrase expression of recorded performance sound [15, 16], but they are not designed for modifying performances in a polyphonic audio mixture. Our method is formulated as analysis, manipulation, and synthesis of the power spectrogram. We define a mathematical musical-tone model that represents each single tone by referring to Itoyama’s integrated model [2], which is the sum of a parametric harmonic-structure model and a non-parametric inharmonicstructure model similar to the HTC model. The extraction of the source part from polyphonic music is then regarded as an estimate of these model parameters. Instrument timbre is also regarded as the average parameters of the tone models, and the phrase expression is regarded as deviations from the average tone-model parameters. We call the averages the instrument model and the deviations the expression model. A flowchart of our formulation is shown in Figure 2. The extraction process using this tone model embodies both single-part extraction and model parameter estimation within the same framework and is one of our method’s advantages over other source separation methods such as blind sound source separation [17]. From the viewpoint of the structure of the tone model, our integrated tone model is semi-parametric and therefore can easily be transformed according to the timbre or phrase expression, unlike the other sound manipulating methods mentioned above. Our model incorporates several new ideas for efficient implementation:

Add Customized music

Figure 1: System overview with the remaining parts of the original musical piece to generate new polyphonic music signals. We call the instrumental part the user customizes the “source part.” This work contributes to the following applications: 1. Automatic musical medley generator: The user specifies a set of musical pieces and their playing order, and this application produces a series of musical pieces by connecting them smoothly in terms of chord, phrase, and timbre. Our method enables smooth connections by synthesizing chords and phrases, incorporating some automatic composing technique. 2. Phrase training tool with favorite-artist sound: The user can train fingering or bowing using his/her favorite artist’s performance. Given CDs and their musical scores, our method can synthesize performance that is virtually played by the favorite player. 3. Source creation tool for expressive instrument synthesizers such as Synful1 [3]: The synthesizer can be created from monaural musical audio signals instead of studio-equipped solo recordings. This application could reduce the costs of preparing sound sources. 4. Timbre and phrase changeable remastering2 tool: This tool enables a recording engineer to reproduce a two-mixed old recording into trendier and unique one, compared with a conventional mastering tool that can control only loudness or frequency response. For example, a vintage crunch guitar sound can be changed into a modern overdrive one. Well-known solo phrases can be replaced with newer ones. There are many related studies that deal with such analysis, manipulation, and synthesis required for this work. Many studies on music content analysis (pitch detection, instrument recognition, beat tracking, etc.) deal with computers recognizing the characteristics of timbre or phrase expression [4–6]. However, most of them utilize features that cannot be easily re-synthesized. For example, 1 Synful is an instrument synthesizer with automatic expression and smart articulation based on realistic instrument simulations of the transitions between notes, as well as the timbres of each individual note. 2 Mastering is a process for audio CD production that includes signal restoration operations to each musical piece such as leveling, fading in and out, and noise reduction. Remastering means creating new masters and implies sound quality enhancement of a previously existing recording.

1. We introduce a residual model to extract and analyze one specific instrumental part by using only the music score of the source. Although our method needs score information for proper analysis, preparing a complete score musical including all instrument

204

2.

Am mplitudde

parts is difficult for a typical user who is not an expert in acoustics or music. We aim only to use the source part’s score consisting solely of the attack time and pitch name information e.g., C3, and A4, for each note. This is an important revision and thus a difficult point of Itoyama’s Instrument Equalizer that needs almost the whole musical score because its volume control function is based on separating all parts in the musical piece. The residual model tries to absorb the spectrum that the score does not convey about the source part, i.e., the accompaniment and reverberation. 2. To ensure that the analysis is robust against repeated mixing (i.e., mixing of already mixed sounds), we implement to our tone model a parameter that represents inharmonicity: the degree to which the frequencies of overtones depart from whole multiples of the fundamental frequency [18, 19]. Defining the tone model as a semi-parametric one including the inharmonicity coefficient maintains acoustic quality and enables more accurate mixing. This is because such a model can align timbral characteristics of two or more sounds better than a non-parametric model such as STRAIGHT can, for which repeated mixing degrades acoustic quality [20] 3. Rendering a new phrase expression in a given score is done taking into account the pitch-name transition sequence. In order to re-synthesize the audio signal, phrase expressions must be handled as acoustic characteristics and have appropriate information for synthesis. This means that the sound should be represented by time-series characteristics, therefore it is important to maintain their time continuity.

v(4)

v(n)

Frequency

n" (t ) 1 # Bn2

Figure 3: Spectral envelope

Figure 4: Temporal envelope

where n is an index of harmonics. F (H) (n,t, f ) and E (H) (n,t) correspond to the spectral and temporal envelopes of harmonic components, as shown in Figures 3 and 4. v(n) is relative strength between harmonic peaks. µ (n,t) corresponds to the mean of a Gaussian defined variably in each time and thus represents the trajectory of each harmonic peak along the time axis. σ corresponds to the standard deviation of a Gaussian and represents the width of each peak on the frequency axis. Equation (4) is the theoretical formula of inharmonicity; it expresses µ (n,t) in terms of the fundamental frequency (F0) ¸ µ (t) and coefficient of inharmonicity B. M (I) (t, f ) is expressed as non-parametric model that consists of the product of the relative strength of the frequency axis, F (I) (t, f ) and temporal trajectory, E (I) (t): M (I) (t, f ) = F (I) (t, f )E (I) (t).

(5)

All of the parameters noted above are positive. In extracting an instrumental part from a polyphonic audio signal, we use the tone model, Mk (t, f ), to represent the power spectrogram of the k-th musical note we want to extract. The following constraints guarantee uniqueness of the parameters of each harmonic and inharmonic model: Z

∑ v(n) = 1,

∀n :

n

Z

2.1 2.1.1

E (I) (t) dt = T,

Z

∀t :

E (H) (n,t) dt = T,

(6a)

F (I) (t, f ) df = 1.

(6b)

Instrument Models and Expression Models Instrument Model

The instrument model M¯ p (t, f ) is defined as the average of the parameters of the set of the tone models with the same pitch name, p. The reason why we average with respect to each pitch name is to reflect the timbral characteristic’s dependency on the pitch of the instrument sound, such as the relative strength between harmonic peaks, v(n), and the ratio of the powers of harmonic and inharmonic components, w(H) /w(I) [22, 23]. When taking the average of two or more tone models their durations have to be aligned. However, simply expanding or contracting them is not proper, because volume excitation in attack and release segments and the pitch trajectory in a particular for musical instrument and player do not vary with the duration so much. Distorting these features badly affects the perception of loudness or the auditory impression, e.g., vibrato articulation. To avoid changing the models’ timbre, we propose a method of matching the temporal envelopes of tone models being averaged. First, we define the end of a sharp emission of energy as an onset t (on) , and the start of a sharp decline in energy as an offset t (off) (Figure 5). We implement these parameters as follows:

(1)

where t and f correspond to time and frequency. M (H) (t, f ) is defined as a weighted Gaussian mixture model that has n parametric harmonic overtones: M (H) (t, f ) = ∑ M (H) (n,t, f ) = ∑ F (H) (n,t, f )E (H) (n,t), (2) n

" # ( f − µ (n,t))2 v(n) (H) exp − , and F (n,t, f ) = √ 2σ 2 2πσ 2 p µ (n,t) = nµ (t) 1 + Bn2 ,

v (3)

" (t )

The key to develop a method of music customization is how the instrumental sound is represented. We represent a musical tone as a mathematical model so that parameters of the model can represent timbral features such as pitch and brightness. This enables flexible manipulations of musical sound as well as accurate analysis and synthesis. For example, if pitch is explicitly defined in the tone model, we can easily synthesize a vibrato sound from a normal sound, which would be difficult to do if the normal sound were simply represented as a temporal waveform. We designed the tone model by referring to the acoustic psychology finding that auditory differences between timbres tend to be caused by differences in the distribution of each harmonic and inharmonic component [21]. The model is based on the integrated model [2] and it adds a coefficient of inharmonicity, B, to represent the inharmonicity of string instruments such as pianos and guitars. This tone model represents the power spectrogram, M(t, f ), of a specific musical tone as a sum of a harmonic-structure tone model, M (H) (t, f ), and an inharmonic-structure tone model, M (I) (t, f ), with those amplitudes w(H) and w(I) :

n

v(2)

!

MUSICAL-TONE MODEL BASED ON TIMBRAL FEATURES

M(t, f ) = w(H) M (H) (t, f ) + w(I) M (I) (t, f ),

v (1 )

t (on) ≡ min T , t (off) ≡ max T , o n d E˜ (H) (t) T = t|abs[ ] ≤ ε , E˜ (H) (t) ≥ κ , dt

(3) (4)

205

(7a) (7b)

3.

Amp plitude

~ E (t )

3.1 Attack

Sustain

Onset

where κ is a threshold that determines the vibration energy of the instrument sound, and E˜ (H) (t) is the average of all temporal envelopes in the harmonics. These equations mean that t (on) is the minimum member of the set of times, T , in which the energy of the sound reaches a sufficient level and its variation is small. t (off) is the maximum member of the set. The instrument model is calculated by making use of the offset information. Let P p be a set of note models with pitch name p. Among the parameters of the instrument model, the non-time(H) (I) series parameters w¯ p , w¯ p , v¯ p (n), and B p , are calculated by simply averaging their values over any relevant tone models. The time(H) (I) (I) series parameters µ¯ p (t), E¯ p (n,t), E¯ p (t), and F¯p (t, f ) are calculated by aligning the models’ envelopes after the offset time to the offset of the longest models (let T¯p ) in P p . This operation is defined by the following expressions, where we have used alternative expression α¯ p (t) to denote the parameters: ³ ¡ ¢ (2) ´ (1) ∑k∈P p αk (t)Wk (t) + αk t − (T¯p − Tk ) Wk (t) α¯ p (t) = , (8) ¡ (1) (2) ¢ ∑k∈P p Wk (t) +Wk (t) where Tk represents the duration of the k-th tone model, and ( (off) 1 (t < tk ) (1) Wk (t) = (off) 0 (tk ≤ t), and ( (off) 0 (t < tk + T¯p − Tk ) (2) Wk (t) = (off) 1 (t + T¯p − Tk ≤ t).

3.1.1

1. Distribution: Derive the most reliable power spectrogram distribution function from the tone models, 2. Adaptation: Derive the most reliable tone-model parameters from the extracted spectrogram. The iteration is terminated when the tone models become close to the extracted spectrogram. (M) (H) We define three distribution functions: ∆k (t, f ), ∆k (n,t, f ) (I)

and ∆k (t, f ). The first function decomposes S(t, f ) into spectrogram of individual single tone. The other two decompose each single-tone spectrogram into its harmonic and inharmonic components. Therefore, the decomposed spectrograms are: (M)

Sk

(M)

(t, f ) = ∆k (t, f )S(t, f ),

(H)

(H)

(M)

Sk (n,t, f ) = ∆k (n,t, f )Sk (I) Sk (t,

(9a) (M)

(I) f ) = ∆k (t,

(H)

(M) f )Sk (t,

(11a)

(t, f ),

(11b)

f ),

(11c)

(I)

where Sk (t, f ), Sk (n,t, f ), and Sk (t, f ) correspond to the distributed single-tone spectrogram, that of the n-th harmonic component, and that of the inharmonic component, respectively. The residual model, M (R) (t, f ), represents the remaining spectrogram that the tone models cannot or should not express, i.e., other instrument sounds and reverberation. We introduce another distribution function for the residual model, ∆(R) (t, f ), which decomposes S(t, f ) into the remaining spectrogram, S(R) (t, f ):

(9b)

The denominator in Equation (8) represents how many models are added each time, so that the whole equation means the average3 . The onset and offset of an instrument model are recalculated anew. The remainder σ is determined by a window function that is used in short-time Fourier transform (STFT) analysis. Thus, it is not necessary in the instrument model.

S(R) (t, f ) = ∆(R) (t, f )S(t, f ).

Expression Model

(12)

This residual model is what distinguishes our method from the existing sound source separation method [2]. (M) (H) S(t, f ) and the distributed spectrograms Sk (t, f ), Sk (n,t, f ),

We define the expression model, Mˇ k (t, f ), as the ratio of each tone model to the instrument model. Among the parameters of (H) (I) the expression model, the non-time-series parameters wˇ k , wˇ k , and vˇn , are determined as the ratio of the instrument model to each tone model. The time-series parameters µˇ k (t), Eˇ (H) (n,t), Eˇ (I) (t), and Fˇ (I) (t, f ) are calculated by aligning the instrument model’s envelopes after the offset time for the current tone model. This is defined in the following expression, where we have used the alternative expression αˇ l (t):

k

Formulation

The extraction and parameter estimation are done by alternately iterating two phases:

k

αk (t) αˇ k (t) = ¡ ¢ (2) . (1) α¯ p (t)W (t) + α¯ p t − (T¯p − Tk ) W (t)

Extraction and Analysis of Specific Instrumental Part

This section describes how to extract a specific instrumental part from a polyphonic audio mixture and how to estimate the tonemodel parameters to represent the extracted part. Here, extraction means to decompose the polyphonic power spectrogram, S(t, f ), into a source instrument part and a residual part.

Release Time

Offset

Figure 5: Onset and offset

2.1.2

IMPLEMENTATION

(I)

Sk (t, f ), and S(R) (t, f ) are related by (M)

S(t, f ) = ∑ Sk k

=∑ k

³

(t, f ) + S(R) (t, f ) (H)

∑ Sk n

´ (I) (n,t, f ) + Sk (t, f ) + S(R) (t, f ).

(13)

In this case, the distribution functions need to satisfy these conditions:

(10)

(M)

∀t, f : ∑ ∆k (t, f ) + ∆(R) (t, f ) = 1,

k

(14a)

k

We don’t need the ratio of B to that of instrument model because it does not depend on the expressiveness of the performance, but rather on kind of instrument and pitch.

∀k,t, f : ∑ ∆k (n,t, f ) + ∆k (t, f ) = 1.

3 As this process divides envelopes into two parts, in practice it is need to do some smoothing around the point of division.

For all t, f , n, k, the distribution functions are restricted to be at least 0 but no more than 1.

(H)

(I)

(14b)

n

206

parameters is guaranteed for the following reason. The integrated model expressing both harmonic and inharmonic components is based on harmonic-temporal structured clustering (HTC) [10], where the power-spectrogram is assumed as a two-dimensional ,time and frequency, probability distribution and each tone model is assumed as a mixture of probability distributions. The distribution functions of our method correspond to weights of individual kernels. Consequently, the cost functions Qdist and Qadapt play almost the same role as the Q-function of the EM algorithm, although they have some additions, i.e., γ in Qdist and cost terms in Qadapt ). Unfortunately, the EM algorithm cannot determine which kernel in the tone model represents the component of source power spectrogram. This means that although the mixture model expressed by the tone models fits the source spectrogram well, the fitting result of each kernel does not always matches its “meaning”; for example, there is a possibility that M (I) (t, f ) represents the whole source spectrogram. It is required that each kernel plays the proper role for reflecting timbral characteristics, and thus we pose the constraints in the parameter update process.

The distribution phase updates the distribution functions to minimize the cost function, Qdist , while fixing the parameters of the tone models. Qdist is defined as the sum of the Kullback-Leibler divergences (DKL ) from the tone models to the distributed spectrograms: µ ¯¯ ´ ³ ¯¯ (H) (H) Qdist = ∑ ∑ DKL Sk (n,t, f )¯¯γ (H) w(H) Mk (n,t, f ) n

k

¯¯ ´¶ ³ ¯¯ (I) (I) (I) (I) + DKL Sk (t, f )¯¯γ w Mk (t, f ) ¯¯ ´ ³ ¯¯ + DKL S(R) (t, f )¯¯γ (R) M (R) (t, f ) ,

(15)

where γ (H) , γ (I) and γ (R) (> 0) are weight parameters. The update result can be controlled by assigning distinct values to the weights at first and gradually moving them to 1 in each iteration. Our method sets them to satisfy γ (H) > γ (I) > γ (R) in order to get a preferable extraction result, that is, 1) to prevent the inharmonic models or the residual model from absorbing even harmonic components of the source part, and 2) to prevent the residual model from absorbing the even spectrum of the source part. This update is an optimization problem under the constraints of Equation (14); thus, it can be solved by using the Lagrange-multiplier method. (H) The adaptation phase updates the tone-model parameters wk , (H)

(I)

(I)

(I)

3.2

(I)

wk , σ , µk (t), Bk , Ek (n,t), wk , Ek (t), and Fk (t, f ) by minimizing the following cost function, Qadapt , while fixing the distribution functions: µ ¯¯ ³ ´ ¯¯ (H) (H) Qadapt = ∑ ∑ DKL Sk (n,t, f )¯¯w(H) Mk (n,t, f ) k

1. Analyze the instrument models through all sounds4 , 2. Calculate a new onset and offset by averaging these values of all the models,

n

¯¯ ³ ´ ¯¯ (I) (I) + DKL Sk (t, f )¯¯w(I) Mk (t, f ) ¯¯ ³ ´ (H) ¯¯ (H) + β (HE) DKL E˜k (t)¯¯Ek (n,t) ¯¯ ³ ´ ¯¯ (I) (I) + β (IFS) DKL F˜k (t, f )¯¯Fk (t, f ) ¯¯ ³ ´ ¯¯ + β (V ) DKL v¯ p (n)¯¯vk (n) ¯¯ ³ ´¶ (I) ¯¯ (I) + β (IE) DKL E¯ p (t)¯¯Ek (t) ¯¯ ³ ´ ¯¯ + DKL S(R) (t, f )¯¯M (R) (t, f ) ,

Mixing Instrumental Timbres

This section describes how to mix two or more instrument timbres by mixing their instrument model parameters in an arbitrary mixing ratio. To mix multiple instrument models, the onset and offset times of all models must be aligned to catch the temporal changes of timbre. The mixing process has four steps:

3. Adjust each model’s duration to the user specified one by expanding or contracting distance between the new onset and offset, and 4. Mix the model parameters with arbitrary mixing ratios. (16)

We do not change the duration of the inharmonic tone models in step 3, because the inharmonic component characterizes the timbre at attack segment, and changing it alters the timbre. We define the method of mixing several instrument models as linear mixture:

α (P) = ∑ φm α (m) .

(17)

m

Here α is a convenient alternative expression for each parameter, and α (P) represents a mixed instrument model. φm (∑m φm = 1) is the mixing ratio. When 0 < φm < 1, this process is an interpolation while when 1 < φm or φm < 0, it is an extrapolation. The user can also freely manipulate the timbre by mixing with different ratios with respect to each model parameter. Log-scale mixture may also be used, because it will reflect the auditory property that humans perceive the logarithm of sound energy.

(H) (I) (I) where E˜k (t), F˜k (t, f ), v¯ p (n), and E¯ p (t) and the weights β (HE) , β (IFS) , β (V ) , and β (IE) are variables for the additional cost terms. These cost terms work as constraints that make each (H) model parameter close to each variable. E˜k (t) is the average of all temporal envelopes in the harmonics and is used to reduce the envelopes’ variation on the frequency axis because it is vk (n) that (I) should represent global variation on the frequency axis. F˜k (t, f ), which is obtained by smoothing F (I) (t, f ) along the frequency axis, also prevents the inharmonic model from containing a harmonic component. The other two cost terms make the parameters close to the corresponding parameters of the instrument models defined in Section 2.1.1. They work to keep each tone model expressing a similar sound throughout an individual performance. Every tone model in the source part has the same σ because σ is considered to be the same throughout the performance.

3.3 Rendering Phrase Expression This section describes how to render a new expressive performance for an arbitrary score by using a set of trained expression models. This rendering is done by reconstructing the expression models so that they reflect the characteristics of the given score. Rendering expression model parameters for each note of the given score is done by using such trained expression models that have a note-sequence pattern similar to the target note. The notesequence pattern is defined as the pitch-name transition pattern of

3.1.2 Theoretical Foundations

4 If the mixed sound only has one note, its tone model is substituted for the instrument model.

Our extraction process can be interpreted as an expectationmaximization (EM) algorithm in which local convergence of the

207

the notes. Let pl be the pitch-name of l-th note for the given score. Each note belongs to two sequences of length 2; one is a pair (pl−1 , pl ), and the other is a pair (pl , pl+1 ). We fetch matched two expression models and interpolate them so that they evolve smoothly along the time axis and maintain the continuity of the model parameters: n³ t ´ (1) t (2) o (18) αˇ l (t) = 1 − αˇ l (t) + αˇ l (t) , Tl Tl

Table 1: List of musical pieces and parts used in Evaluation 1 Musical pieces Instrument part used as source Classical No. 12 FL, CB, VC, VL, VN1, VN2 No. 37 PN, VN No. 13 VC, VL, VN1, VN2 No. 39 PN, VN No. 16 CL, VC, VL, VN1, VN2 No. 42 HP, VC Jazz No. 22 TP, PN No. 33 FL, PN No. 24 SA, PN No. 34 FL, PN No. 32 VI, PN No. 41 SA, PN FL: Flute, CB: Contrabass, VC: Cello, VL: Viola, VN: Violin, CL: Clarinet, PN: Piano, TP: Trumpet, SA: Alto Sax, VI: Vibraphone

where l and Tl correspond to the index of the note for the given (1) score and the duration of the l-th note, respectively. αˇ l (t) is the expression model, the duration of which is adjusted to Tl in advance. It is selected from the analyzed models by using the following rules: 1) the pitch-name transition pattern (pk−1 , pk ) should be the most similar to (pl−1 , pl ), and 2) its difference in durations (2) |Tk−1 − Tl−1 | + |Tk − Tl | should be the minimum. αˇ l (t) is the selected in the same way. In contracting the expression models’ envelopes to adjust their durations, we move all model envelopes after offset by using the method described in Section 2.1.1. In contrast, we cannot apply this method to expand the envelopes. Instead we expand envelopes only between the onset and offset to avoid accidental change the timbre by moderating the attack and release segments. If there is no pitch-name transition pattern matching that of the note of the given score, we use an expression model whose transition pattern is most similar and manipulate its µ (t) to fit the required pitch.

as shown in Table 1. Each instrument part in the SMFs is synthesized using a MIDI synthesizer so that we can evaluate the distances between the extracted signals and the original ones. The musical pieces contain two or more instrument parts to be extracted, and each part is extracted independently; e.g., the flute part from the classical No. 12 was extracted from the whole mixture of the piece by using only the score of the flute. We evaluated the accuracy of the part extraction by calculating the linear spectral distance between the extracted instrument signal and the original one: DLin. =

1 T

ZZ ¡

¢2 Sext (t, f ) − Sreal (t, f ) dt df ,

(19)

where Sext (t, f ) represents the spectrogram of the extracted sound and Sreal (t, f ) represents the spectrogram of the original one.

4.1.2

Experimental Results and Discussion

Figure 6 (a) shows the average distances5 of our method, the harmonic component extracted by the baseline method (baselineharmonic), and all sounds extracted by the baseline method (baseline-all) classified by jazz and classic. The distances of our method are shorter by 42.9%, 23.1% and 35.0% in the jazz pieces, classical pieces and average in comparison with baseline-harmonic, respectively. This result shows that the residual model enables us to extract the source part more accurately when scores of other parts are unavailable. The baseline method cannot ignore sounds that do not appear in a given score; therefore, almost all sounds in the polyphonic mixture are distributed in the source part, especially in the inharmonic models, which increases the distance from the original sound. All three distances for the jazz pieces are larger than those of the classic pieces. This is because jazz pieces include a drum part, which consists almost entirely of inharmonic sounds. It is difficult to separate the inharmonic components of each instrument from a polyphonic audio mixture accurately because the shape of these components cannot be easily modeled. Figure 6 (b) indicates one of limits of our approach; that is, a decayed tone instrument part with low volume is difficult to be extracted. The figure depicts the relationship between extraction performance and degree of sound mixture of each extracted part. The extracted parts are classified into two categories, sustained tone instruments such as violin and decayed tone instruments such as piano. The horizontal axis of the Figure is power ratio, which is defined as the ratio of the temporal mean of whole parts’ spectral power to the mean of the source part’s power. The vertical axis is the spectrum distance of each musical piece. The distance of decayed tone instruments become larger depending on its power ratio. This result shows that extracting a relatively-low-volume and decayed tone instrument, indicated by a red dotted circular line, is

3.4 Musical Instrument Sound Synthesis The process of re-synthesizing the signal for a manipulated performance is as follows. First, we obtain the tone models for the given score, Ml (t, f ), by multiplying the instrument model, M¯ p (t, f ), by the calculated expression model, Mˇ l (t, f ). Second, we synthesize the signal for each tone model by adding a harmonic signal, which is synthesized from the harmonic compo(H) nent, Mk (t, f ), by using the sinusoidal model [24], and the inharmonic signal, which is synthesized from the inharmonic compo(I) nent, Mk (t, f ), by using the overlap-add method [25] (commonly used to transform a spectrogram into a signal). Finally, we obtain the signal for the whole instrumental performance by summing all single-tone signals.

4. EXPERIMENTAL EVALUATIONS We conducted three experimental evaluations to confirm how well the three functions work. The experiments evaluate the effectiveness of 1) introducing a residual model for extracting a single instrumental part, 2) mixing instrument timbres by using the semiparametric tone model, and 3) rendering a new phrase expression according to the note sequence.

4.1 Evaluation 1: Instrument Part Extraction This evaluation is to confirm whether the instrument part extraction using the residual model is better than one not using it, called the baseline method. These methods extract several instrument parts from a polyphonic mixture, and the accuracies of the extracted sounds are compared.

4.1.1 Experimental Conditions Twelve standard MIDI files (SMFs) are excerpted from the RWC Music Database: Jazz and Classic [26] and 33 instrument parts are extracted from polyphonic audio mixtures of these musical pieces,

5 These distances are normalized by the distance of AS of Jazz No. 24 for our method because we focus on the relative distances to the baseline method.

208

Original sound-A sound A

#

(a) classified by genres

(b) plotted by power ratio between source sound and whole one

Remixed sound-B

Mixing ratio

Table 2: Kinds of instrument sounds used in Evaluation 2 Categories Struck string Plucked string Bowed string Brasswind

difficult. This is because their volumes may become much small in decaying.

Evaluation 2: Mixing of Timbre

The most important aspect of manipulating timbre is that the manipulations should be properly reflected in the results while it should not include any accidental or unintended changes. For example, if we intend to change a piano sound into a guitar sound but instead obtain a sound similar to that of a violin, then the sound should be considered useless no matter how expressively it is presented. Therefore, we evaluate how reproducible the manipulated sounds are by performing operations that change the sound into another instrument’s and then changed that sounds back into the original sound. This experiment has four steps (see Figure 7):

Woodwind

Instruments Piano (PF) Acoustic guitar (AG) Electric bass (EB) Violin (VN) Contrabass (CB) Trumpet (TR) Tuba (TU) Alto sax (AS) Baritone sax (BS)

Pitches 55, 110, 220, 440, 880 110, 220, 440 55, 110 220, 440, 880 55, 110, 220 220, 440, 880 55, 110, 220 220, 440, 880 110, 220

2. Log-scale spectral distance, ZZ ³ Ssyn (t, f ) ´2 1 DLog. = log10 dt df , and T Sreal (t, f )

(20)

3. Equivalent-rectangular-bandwidth (ERB) weighted spectral distance [28], ZZ ³ Ssyn (t, f ) ´2 1 ζ ( f ) log10 DERB. = dt df , (21) T Sreal (t, f ) ³ 4.37 f ´ ζ ( f ) = 21.4 log10 +1 , (22) 1000

1. Analyze models with two different instrumental sounds, 2. Interpolate these two models by using two pairs of mixing ratios, (φ , 1 − φ ) and (ψ , 1 − ψ ) (0 < φ , ψ < 1), thereby generating two intermediate sounds,

where Ssyn (t, f ) represents the spectrogram of the remixed sound and Sreal (t, f ) represents the spectrogram of the original sound. The ERB-weighted spectral distance satisfactorily explains human auditory perception. We used nine different mixing ratios {0.1, 0.2, ..., 0.9} in step 2. Each sound pair in the experiment is selected to have the same fundamental frequency.

3. Extrapolate these two intermediate sounds so that they can return to the two original sounds using the appropriate mixing ratios and synthesize them, and 4. Calculate the distance between the original sounds and the remixed ones.

4.2.1

Mi d Mixed Sound-2

Figure 7: Process of Evaluation 2

Figure 6: Average spectral distances in Evaluation 1

4.2

1"#

Mixed sound 1 sound-1

C Calc. distance e

C Calc. distance e

!

Remixed sound-A

Original sound-B sound B

1"!

Experimental Conditions

4.2.2

We focus on the acoustic fidelity of our timbre-mixing method; therefore, we do not extract the instrumental sound from a polyphonic audio mixture, but rather use instead independently recorded instrumental single tones from the RWC Music Database: Musical Instrument Sound [27]. The instrument models are thus considered to be equal to the tone model. The instruments and their pitches that are used for the experiments are listed in Table 2. These sounds are played using normal articulation. We select low and high pitched instrumental sounds for each category, except for the piano. Our method is compared with STRAIGHT, another sound analysis-mixing-synthesis method [8]. To evaluate the quality of the remixed sounds, we calculated the distances between the remixed and the original sound using three criteria:

Experimental Results and Discussions

Table 3 lists the average distances between remixed and original sounds for all evaluated pairs6 . Compared with the values for STRAIGHT, the linear spectral distance, the log-scale spectral distance, and the ERB-weighted spectral distance in our method are reduced by 42.9%, 58.9%, and 65.9%, respectively. This improvement attributes to the parametric treatment; that is, our method controls the timbre of sound with parametric characteristics such as the coefficient of inharmonicity, whereas STRAIGHT mixes nonparametric characteristics such as STRAIGHT spectrogram. For detailed analysis, we classify the results into pitches, instruments, and difference of mixing ratios shown in Figure 8, 9, and 10, respectively. Difference of mixing ratios is defined as the subtraction of two mixing ratios, |φ − ψ |. The smaller it is, the mixed 6 All the distances in Table 3 and Figure 8-10 are normalized by the piano’s distance for our method because we focus on the relative distance.

1. Linear spectral distance (described in Section 4.1.1),

209

Table 3: Average distances in Evaluation 2 Criteria Lin. Log. ERB Our method 1.01 0.938 1.02 STRAIGHT 1.78 2.28 3.00 two sounds become closer, and thus remixing back to the original two sounds become more difficult. In most cases the distances with our method are smaller than that with STRAIGHT, which proves our method’s robustness. Figure 8 shows that the distances of high-pitched sounds with our method are considerably smaller than that with STRAIGHT. Figure 9 shows that the distances with our method do not vary so much through the instruments compared with those of STRAIGHT. Figure 10 also shows that the distances with our method are much smaller than that with STRAIGHT except for the linear spectral distance at large difference of mixing ratios. The spectral distances of the sound at 55 Hz and the sound of electric bass with our method are larger than that with STRAIGHT. This is another limit of our method; a performance of low-pitched, especially under about 100 Hz, instrument is not accurately analyzed. The harmonic model is not suitable for analyzing the spectrograms of low-pitched sounds for two reasons:

(a) Our method (b) STRAIGHT Figure 8: Average distance classified by each pitch

(a) Our method

(b) STRAIGHT

Figure 9: Average distance classified by each instrument

1. The harmonic peaks overcrowd on the frequency axis. Since the harmonic component is based on the Gaussian mixture model, the analysis becomes more difficult as the harmonic peaks on the frequency axis get closer. 2. Each harmonic peak rapidly fluctuates on the time axis because of the inertia of the instrument’s body. The temporal amplitude envelope of a low-pitched instrumental sound changes rapidly, which makes the harmonic model difficult to fit to the observed spectrogram.

(a) Our method

Figure 10: Average distance classified by each difference of mixing ratios

Human subjects listen to the resulting sounds generated by our method and by STRAIGHT and feel that the sounds output by our method were closer to the original ones. The inharmonic component in the middle-low frequency range of the resulting sounds output by STRAIGHT feels particularly different from those of the original. This difference may explain why the ERB-weighted spectral distance of STRAIGHT is large, since this distance metric emphasizes differences in the middle-to-low frequency range.

4.3

audio mixture. Since the evaluation should be independent on a particular recording condition, we select a set of different players and CDs for each instrument. We use a method that does not take into account the sequence of single tones as the baseline method for comparison. it renders a new sound in a performance by selecting and assigning one analyzed tone model with the nearest pitch name and duration. The distances between a synthesized performance and the real one are calculated by using two criteria. The first criterion is a set of two distances:

Evaluation 3: Performance Rendering

We evaluate the acoustic fidelity of musical performance signals rendered with our method. The followings are steps conducted on every real performance and evaluations are based on five-fold cross validation (see Figure 11) :

1. Harmonic distance, ¢2 1 1 ¡ DH. = ∑ ∑ vsyn,k (n) − vreal,k (n) K k Tk n Z ¡ ¢2 (H) (H) Esyn,k (n,t) − Ereal,k (n,t) dt, ×

1. Analyze 4/5-th of the entire performance, 2. Render the remaining 1/5-th by using the analyzed data, and 3. Calculate the distance between the original real and the rendered performances.

4.3.1

(b) STRAIGHT

(23)

2. Inharmonic distance, Z ¢2 1 ¡ (I) 1 (I) Esyn,k (t) − Ereal,k (t) DI. = ∑ K k Tk (24) ³Z ¡ ¢2 ´ (I) (I) × Fsyn,k (t, f ) − Freal,k (t, f ) d f dt,

Experimental Conditions

We use ten unaccompanied monophonic musical audio signals. Three violin [VN1-3], three flute [FL1-3], and three cello [VC1-3] pieces were selected from commercial audio CDs, and the remaining one [VN1’] (the same piece as VN1) was our recordings of an expert violinist. Since the Evaluation focuses on the quality of the rendering method, we use unaccompanied monophonic musical pieces instead of extracted instrumental sound from the polyphonic

where K represents the number of musical tones in the given score. The model parameters with subscript “real” mean a tone model analyzed directly from the real performance audio signal, while those

210

(1) Analyze

(2) Synthesize

(1) Analyze

(3) Calculate distance

Figure 11: Process of Evaluation 3 (a) Spectral distance

(b) Relation between data size and improvement

Figure 13: Distances in each musical piece in Evaluation 3

(a) Harmonic distance

Figure 13 (b) indicates another limit related this insufficient improvement; that is, the acoustic fidelity with our method may worsen rather than improve if there is a large amount of data for analysis. Figure 13 (b) depicts how the improvement from the baseline method changes in terms of the amount of musical note. Its horizontal axis is the average occurrences of same-pitch note that is defined as C1 ∑ p |P p |, where C is the number of pitch name that occurs at least once in the musical piece. This value relates the degree of manipulation in calculating the instrument model based on Equation (8). The musical piece whose average occurrences of same-pitch note is larger than about 30, indicated by a red dotted circular line, cannot be synthesized well. This is because our current implementation of mixing and rendering uses pre-defined manipulations of model parameters, which degrades the quality of sound if the amount of manipulations is large. This problem might be avoided if we exploit some statistical modeling, such as hidden Markov model, for tone model definition and its manipulations.

(b) Inharmonic distance

Figure 12: Distances in each musical piece in Evaluation 3 with subscript “syn” mean the tone model reconstructed by using our method or the baseline method. The set of these two distances give the similarity of musical performance between real sound and synthesized one because the tone-model parameters are designed to represent the timbral characteristics of instrument performance. Note that the quality of the tone-model parameters depends on the accuracy of our parameter estimation method described in Section 3. The second criterion is the linear spectral distance described in Section 4.1.1. It is used in order to evaluate the acoustic fidelity of the rendered performance signal.

4.3.2

5.

CONCLUSIONS

This paper has presented a method of changing the timbre and phrases of one instrumental part extracted from polyphonic sound mixtures. We defined a mathematical musical-tone model by referring to the acoustic psychology. The model is composed of an instrument model and expression model on the assumption that an instrumental performance consists of instrument timbre and phrase expression. To extract and analyze tone model parameters from a sound mixture, we used a residual model to separate the power spectrogram of the source instrumental part from the other parts. The sound analysis and mixing function considered the inharmonicity and temporal change pattern of instrumental sounds in order to avoid distorting the acoustic quality during repeated sound mixing. To render a phrase expression for a given score, we reconstructed the expression models using analyzed data according to the similarities of the pitch-name transition pattern of note sequences. Our future work includes the followings. 1) Improving the method of extracting low-volume and decayed tone instrument part. The residual model will be estimated better by incorporating different melody extraction method for non-source melody parts and some statistical dereverberation method [29] for robust suppression of the reverberant. 2) Improving the method of analyzing and manipulating low pitched musical-instrument sounds. Using longer frame in analysis may be promising, but note that it cannot be used to analyze the amplitudes of steep envelopes that piano and guitar sounds have. 3) Improving the method of rendering phrase expression in case a large amount of data is available. We plan to introduce some statistical modeling, such as hidden Markov model, for tone model definition and its manipulations. It may also enables to extract a performance adapting attack time and duration of notes.

Experimental Results and Discussions

Figure 12 shows the harmonic and inharmonic distances of each musical piece. Our method outperforms the baseline method in the harmonic and inharmonic distances by 26.3%, and 21.5%, respectively. Human subjects listen to the rendered sounds and feel that the sound output by our method is closer to the original sound, especially when changing from one note to the next. This superiority of our method is attributed to taking into account the note sequence. Figure 13 (a) shows the spectral distances for each musical piece. We also plot the spectral distance between the real sound and one synthesized from the model parameters with subscript “syn”, that is, synthesized by using a method, called the closed method that only analyzes the tone-models and re-synthesizes them directly, incorporating any relevant score information. Our method improves the spectral distance by 9.2% from that of the baseline method. This indicates that our method can render the performance more faithfully in terms of acoustic fidelity, even though our method needs more manipulation of model parameters, described in Equation (8) - (10) and (18), than the baseline method. It is also found that improvement degree in the spectral distance is shorter than either the harmonic or the inharmonic distances. This implies that the reproducibility of tone-model parameters does not completely contribute the reproducibility of the performance. Ideally, the spectrum distance with our rendering method should be in the same range as the closed method. The result indicates that there is still room for improvement.

211

Some demonstration sounds and pieces obtained by our method in the experiments are available at: http://winnie.kuis. kyoto-u.ac.jp/Demo/ACMMM09/.

6.

[13] G. Widmer. Modeling the rational basis of musical expression. Computer Music Journal, 19(2):76–96, 1995. [14] T. Suzuki. A case based approach to the generation of musical expression. In Proc. International Joint Conferences on Artificial Intelligence, pages 642–648, 1999. [15] J. L. Arcos, R. L. de Mántaras, and X. Serra. Saxex : a case-based reasoning system for generating expressive musical performances. In Proc. International Computer Music Conference, pages 329–336, 1997. [16] S. Canazza, G. De Poli, C. Drioli, A. Roda, and A. Vidolin. Modeling and control of expressiveness in music performance. Proceedings of the IEEE, 92(4):686–701, Apr 2004. [17] M. Casey and A. Westner. Separation of mixed audio sources by independent subspace analysis. In Proc. International Computer Music Conference, pages 154–161, 2000. [18] H. Fletcher, E. Blackham, and R. Stratton. Quality of piano. tones. The Journal of the Acoustical Society of America, 34(6):749–761, 1962. [19] N. H. Fletcher and T. D. Rossing. The Physics of Musical Instruments. Springer, second edition, 1997. [20] T. Takahashi, H. Kawahara, and T. Irino. Evaluation of iterative analysis-by-synthesis speech sounds using STRAIGHT. In Proc. of Autumn Meeting of Acoust. Soc. Japan, pages 289–290, 2007. (in Japanese). [21] J. M. Grey. Multidimensional perceptual scaling of musical timbres. The Journal of the Acoustical Society of America, 61(5):1270–1277, 1977. [22] J. Marozeau, A. Cheveigne, S. McAdams, and S. Winsberg. The dependency of timbre on fundamental frequency. The Journal of the Acoustical Society of America, 114(5):2946–2957, 2003. [23] T. Abe, K. Itoyama, K. Yoshii, K. Komatani, T. Ogata, and H. G. Okuno. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distroting timbral characteristics. In Proc. Digital Audio Effects, pages 249–256, 2008. [24] R. McAulay and T. Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, & Signal Processing, 34(4):744–754, 1986. [25] M. Portnoff. Implementation of the digital phase vocoder using the fast fourier transform. IEEE Transactions on Acoustics, Speech, & Signal Processing, 24(3):243–248, 1976. [26] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC music database: Popular, classical, and jazz music databases. In Proc. International Symposium on Music Information Retrieval, pages 287–288, October 2002. [27] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC music database: Music genre database and musical instrument sound database. In Proc. International Symposium on Music Information Retrieval, pages 229–230, October 2003. [28] R. D. Patterson. Auditory filter shapes derived with noise stimuli. The Journal of the Acoustical Society of America, 59(3):640–654, 1976. [29] T. Yoshioka, T. Nakatani, and M. Miyoshi. Integrated speech enhancement method using noise suppression and dereverberation. IEEE Transactions on Audio, Speech and Language Processing, 17(2):231–246, Feb. 2009.

ACKNOWLEDGMENTS

This work was partially supported by JSPS Grant-in-Aid for Scientific Research (S), the Global Center of Excellence Program (GCOE), and the CrestMuse Project of the Japan Science and Technology Agency (JST).

7.

REFERENCES

[1] K. Yoshii, M.Goto, K.Komatani, T. Ogata, and H. G. Okuno. Drumix: An audio player with real-time drum-part rearrangement functions for active music listening. The Journal of Information Processing Society of Japan, 48(3):1229–1239, 2007. [2] K. Itoyama, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno. Integration and adaptation of harmonic and inharmonic models for separating polyphonic musical signals. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 57–60, 2007. [3] E. Lindemann. Music synthesis with reconstructive phrase modeling. Signal Processing Magazine, IEEE, 24(2):80–91, March 2007. [4] A.P. Klapuri. Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6):804–816, Nov. 2003. [5] T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H.G. Okuno. Instrogram: A new musical instrument recognition technique without using onset detection nor f0 estimation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 229–232, May 2006. [6] M. Goto and Y. Muraoka. Beat tracking based on multiple-agent architecture a real-time beat tracking system for audio signals. In In Proc. Second International Conference on Multiagent Systems, pages 103–110, 1996. [7] A. Eronen. Comparison of features for musical instrument recognition. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 19–22, 2001. [8] H. Kawahara. STRAIGHT, exploration of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds. Acoustic Science and Technology, 27(6):349–353, 2006. [9] M. Slaney, M. Covell, and B. Lassiter. Automatic audio morphing. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1001–1004, 1996. [10] H. Kameoka, T. Nishimoto, and S. Sagayama. A multipitch analyzer based on harmonic temporal structured clustering. IEEE Transactions on Audio, Speech and Language Processing, 15(3):982–994, 2007. [11] E. Tellman, L. Haken, and B. Holloway. Timbre morphing of sounds with unequal number of features. J. Audio Eng. Soc., 43(9):678–689, 1995. [12] R.J. McAulay and T.F. Quatieri. Pitch estimation and voicing detection based on a sinusoidal speech model. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 249–252 vol.1, Apr 1990.

212