2.1 Two short-time Fourier magnitude plots also called spectrograms of a cello tone. . 13. 2.2 Time domain shape of several members of the Kaiser window ...
THE REASSIGNED BANDWIDTH-ENHANCED METHOD OF ADDITIVE SYNTHESIS
BY
KELLY RAYMOND FITZ B.S., University of Illinois at Urbana-Champaign, 1990 M.S., University of Illinois at Urbana-Champaign, 1992
THESIS Submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 1999
Urbana, Illinois
THE REASSIGNED BANDWIDTH-ENHANCED METHOD OF ADDITIVE SYNTHESIS Kelly Raymond Fitz, Ph.D. Department of Electrical Engineering University of Illinois at Urbana-Champaign, 1999 Dr. Lippold Haken, Advisor ABSTRACT We introduce a highly manipulable, high delity additive sound model capable of representing transient sounds and sounds having signi cant nonsinusoidal energy. We represent sounds as a collection of bandwidth-enhanced partials having sinusoidal and noise-like characteristics. Partials are de ned by a trio of synchronized breakpoint envelopes specifying the time-varying amplitude, center frequency, and noise content. Breakpoints for the partial envelopes are obtained by following ridges on a time-frequency surface computed by the method of reassignment. The Reassigned Bandwidth-Enhanced Model yields greater resolution in time and frequency than conventional additive techniques, and preserves noise-like and transient signals, even in modi ed reconstruction.
iii
To my parents, Ronald and Karen Fitz, for their love and support.
iv
ACKNOWLEDGMENTS Many people have contributed greatly to this research. I cannot begin to acknowledge them all. I am especially grateful to Lippold Haken, my friend and thesis advisor, for his stubbornness and dedication to this research, and for coming through for me every single time. The technical contributions of Bill Walker, who was there at the very beginning, and Bryan Holloway, who was there at the very end, were invaluable, but I am most grateful to them for their friendship. The greatest debt of gratitude I owe to my best friend, partner, and wife, Ulrike Axen, and to our daughter, Margot Fitz Axen, who are my constant inspiration.
v
TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 SHORT-TIME FOURIER ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 2.2 2.3 2.4 2.5 2.6 2.7
The Fourier Transform . . . . . . . . . . . . . . . . . . . Discrete Time . . . . . . . . . . . . . . . . . . . . . . . . The Discrete Fourier Transform . . . . . . . . . . . . . . The Short-Time Fourier Transform . . . . . . . . . . . . Windowing . . . . . . . . . . . . . . . . . . . . . . . . . Reconstruction from the Short-Time Fourier Transform The Phase Vocoder . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
5 7 9 10 12 19 23
3 THE BASIC SINUSOIDAL MODEL . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 3.2 3.3 3.4 3.5
Overview of the Basic Sinusoidal Model . . Development of the Basic Sinusoidal Model Validity of the Sinusoidal Model . . . . . . The Importance of Phase . . . . . . . . . . Improving Frequency Estimates . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
25 28 33 34 35
4 BANDWIDTH-ENHANCED SINUSOIDAL MODELING . . . . . . . . . . . . 39 4.1 Noise Representation in Sinusoidal Models . . . . . . . . 4.2 Hybrid Sound Models . . . . . . . . . . . . . . . . . . . 4.3 Bandwidth-Enhanced Synthesis . . . . . . . . . . . . . . 4.3.1 Bandwidth enhancement using phase modulation 4.4 The Bandwidth-Enhanced Sinusoidal Model . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
39 42 43 48 50
5 BANDWIDTH ASSOCIATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 Energy Reassociation . . . . . . . . . . . . . . . . . . . . . 5.1.1 Thresholding, hysteresis, and energy reassociation 5.2 Bandwidth Association . . . . . . . . . . . . . . . . . . . . 5.2.1 Energy matching strategies . . . . . . . . . . . . . 5.2.2 Loudness matching . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
56 60 63 64 68
6 TIME-FREQUENCY REASSIGNMENT . . . . . . . . . . . . . . . . . . . . . . 80
6.1 The Method of Reassignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Sharpening Transients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3 Partial Cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 vi
6.4 Phase Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.5 Multiple Transient Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.6 Reassigned Bandwidth-Enhanced Additive Modeling . . . . . . . . . . . . . . . . . . 95
7 EVALUATION AND DIRECTIONS FOR FUTURE RESEARCH . . . . . . . 99 APPENDIX A AUDITORY EXPERIMENTS . . . . . . . . . . . . . . . . . . 102 A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8
Flute Tone . . . . . . . . . . . . . . Cello Tone . . . . . . . . . . . . . . . Flutter-Tongued Flute Tone . . . . . Bongo Roll . . . . . . . . . . . . . . Orchestral Gong . . . . . . . . . . . Alto Saxophone Phrase . . . . . . . Piano and Soprano Saxophone Duet French Speech . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 103 . 103 . 108 . 116 . 124 . 125 . 133 . 134
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
vii
LIST OF FIGURES Figure
Page
2.1 Two short-time Fourier magnitude plots (also called spectrograms) of a cello tone. . 2.2 Time domain shape of several members of the Kaiser window family having the same length N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Fourier magnitude spectra of the Kaiser window for several values of the shaping parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Two identical, bell-shaped analysis windows superimposed on a (relatively) lowfrequency sine wave. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 A system for block-wise short-time Fourier analysis and modi ed reconstruction employing block DFT analysis and overlap-add synthesis. . . . . . . . . . . . . . . . 2.6 A system for short-time Fourier analysis and modi ed resynthesis by lterbank summation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 16 17 17 21 22
3.1 The basic MQ sinusoidal analysis process. . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Data produced by an MQ-style sinusoidal analysis of a cello tone, pitch G2 (G below low C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Re ning peak frequency estimates using parabolic interpolation of the magnitude spectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Sinusoidal analysis of a breathy ute tone having a fundamental frequency of approximately 293 Hz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Block diagram of a narrow-band noise generator. . . . . . . . . . . . . . . . . . . . 4.3 Several overlapping bell-shaped magnitude spectra (dotted lines), similar to spectra of generators described by Equation (4.2), combine to produce a wideband noise spectrum (solid line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Block diagram of the Bandwidth-Enhanced Oscillator. . . . . . . . . . . . . . . . . 4.5 Spectra for partials having dierent amounts of spectral line widening due to bandwidth enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Constrained sinusoidal analysis data for the breathy ute previously shown in Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Three-dimensional spectrogram of a sinusoidal ute synthesis, with time increasing from front to back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Three-dimensional spectrogram of a bandwidth-enhanced ute synthesis, with time increasing from front to back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Three-dimensional spectrogram of the ute recording analyzed to produce the sinusoidal data plotted in Figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
41 44 46 46 47 52 54 55 55
5.1 Graph of sinusoidal partials for the breathy ute sound, previously plotted in Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Graph of pruned, bandwidth-enhanced partials for the breathy ute sound. . . . . . 5.3 Graph of noise amplitude envelopes for bandwidth-enhanced partials between 1600 Hz and 3000 Hz in the pruned ute representation shown in Figure 5.2. . . . . . . . . . 5.4 Graph of sinusoidal amplitude envelopes for bandwidth-enhanced partials between 1600 Hz and 3000 Hz in the pruned ute representation shown in Figure 5.2. . . . . 5.5 A partial amplitude envelope (solid line) plotted superimposed on a static amplitude threshold (dashed line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 A partial amplitude envelope (solid line) plotted superimposed on a time-varying amplitude threshold (dashed line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 A hypothetical short-time magnitude spectrum, with signi cant sinusoidal components indicated by X's at the magnitude peaks, and bandwidth association regions identi ed by vertical dashed lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Equal loudness curves, also called Fletcher-Munson curves, relating loudness level to intensity across the range of audible frequencies and audible sound pressure levels. . 5.9 Log-linear plot of loudness in sones against loudness level. . . . . . . . . . . . . . . 5.10 Log-linear plot of the critical bandwidth function, relating bark frequency to frequency in hertz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Tapered weighting functions due to overlapping bandwidth association regions. . . . 5.12 Onset of a cello tone reconstructed from bandwidth-enhanced analysis data. . . . . 5.13 Synthesized noise in the onset of a cello tone reconstructed from bandwidth-enhanced analysis data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14 Spectrogram of the cello tone reconstructed from bandwidth-enhanced analysis data magni ed in Figure 5.12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.15 Spectrogram of a cello tone reconstructed from purely sinusoidal analysis data. . . . 5.16 Spectrogram of the source cello tone analyzed and reconstructed in Figures 5.14 and 5.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Window functions used by the three short-time transforms used to compute reassigned times and frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Comparison of time-frequency data included in common representations. . . . . . . 6.3 Two windowed short-time waveforms that are not distinguished in traditional shorttime analysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Two long analysis windows superimposed at dierent times on a square wave signal with an abrupt turn-on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Onset of square wave reconstruction without reassignment. . . . . . . . . . . . . . . 6.6 Onset of square wave reconstruction with reassignment. . . . . . . . . . . . . . . . . 6.7 Time-frequency analysis data points for an abrupt square wave onset, as shown in Figure 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Onset of square wave reconstruction with reassignment and removal of unreliable partial parameter estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Time-frequency coordinates of data from two reassigned bandwidth-enhanced analyses before (a) and after (b) removal of o-center components clumped together at partial births. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
58 59 61 61 62 63 66 69 71 72 73 76 76 77 77 78 83 84 85 86 87 87 87 89 89
6.10 Data from analysis of a square wave using a 20-dB relative partial amplitude threshold and hysteresis without time-frequency reassignment and removal of unreliable data points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Data from analysis of a square wave using a 20-dB relative partial amplitude threshold and hysteresis with time-frequency reassignment and removal of unreliable data points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12 Reconstructed square waveform from ve harmonic partials without reassignment or removal of o-center components using linear frequency interpolation. . . . . . . . . 6.13 Reconstructed square waveform from ve harmonic partials without reassignment or removal of o-center components using cubic phase interpolation. . . . . . . . . . . 6.14 Reconstructed square waveform, frequencies shifted by 10%, from ve harmonic partials without reassignment or removal of o-center components using cubic phase interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.15 Reconstructed square waveform from ve harmonic partials with reassignment and removal of o-center components and linear frequency interpolation. . . . . . . . . . 6.16 Reconstructed square waveform, frequencies shifted by 10%, from ve harmonic partials with reassignment and removal of o-center components and linear frequency interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.17 Time-frequency plot of reassigned bandwidth-enhanced analysis data for one strike in a bongo roll. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.18 Time-frequency plot of reassigned bandwidth-enhanced analysis data for one strike in a bongo roll with partials broken at components having large time corrections, and far o-center components removed. . . . . . . . . . . . . . . . . . . . . . . . . . 6.19 Waveform plot for two strikes in a bongo roll reconstructed from reassigned bandwidthenhanced data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.20 Waveform plot for two strikes in a bongo roll reconstructed from nonreassigned bandwidth-enhanced data, synthesized using cubic phase interpolation to maintain phase accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.21 Plot of the source waveform for the bongo strikes analyzed and reconstructed in Figures 6.19 and 6.20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90 91 93 93 94 94 94 96 96 98 98 98
A.1 Waveform and spectrogram plots for a ute tone, pitch D4 (D above middle C). . . 104 A.2 Waterfall plot for a ute tone, pitch D4 (D above middle C). . . . . . . . . . . . . . 104 A.3 Plot of reassigned bandwidth-enhanced analysis data for a ute tone, pitch D4 (D above middle C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.4 Waveform and spectrogram plots for a reconstruction of the ute tone plotted in Figure A.1 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.5 Waterfall plot for a reconstruction of the ute tone plotted in Figure A.1 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.3. . . . . . . . . . 106 A.6 Waveform and spectrogram plots for a reconstruction of the ute tone plotted in Figure A.1 from reassigned non-bandwidth-enhanced analysis data. . . . . . . . . . 107 A.7 Waterfall plot for a reconstruction of the ute tone plotted in Figure A.1 from reassigned non-bandwidth-enhanced analysis data. . . . . . . . . . . . . . . . . . . . . . 107 A.8 Waveform plot for synthesized noise in a reconstruction of the ute tone plotted in Figure A.1 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 x
A.9 Waveform and spectrogram plots for a cello tone, pitch D-sharp 3 (D below middle C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 A.10 Waterfall plot for a cello tone, pitch D-sharp 3 (D below middle C). . . . . . . . . . 109 A.11 Plot of reassigned bandwidth-enhanced analysis data for a cello tone, pitch D-sharp 3 (D below middle C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.12 Waveform and spectrogram plots for a reconstruction of the cello tone plotted in Figure A.9 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A.13 Waterfall plot for a reconstruction of the cello tone plotted in Figure A.9 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.11. . . . . . . . . 111 A.14 Waveform and spectrogram plots for a reconstruction of the cello tone plotted in Figure A.9 from reassigned non-bandwidth-enhanced analysis data. . . . . . . . . . 112 A.15 Waterfall plot for a reconstruction of the cello tone plotted in Figure A.9 from reassigned non-bandwidth-enhanced analysis data. . . . . . . . . . . . . . . . . . . . . . 112 A.16 Waveform plot for synthesized noise in a reconstruction of the cello tone plotted in Figure A.9 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.17 Waveform and spectrogram plots for the rst 300 ms of a reconstruction of the cello tone plotted in Figure A.9 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.18 Waveform and spectrogram plots for synthesized noise in the rst 300 ms of a reconstruction of the cello tone plotted in Figure A.9 from the reassigned bandwidthenhanced analysis data plotted in Figure A.11. . . . . . . . . . . . . . . . . . . . . . 114 A.19 Waveform and spectrogram plots for the rst 300 ms of a reconstruction of the cello tone plotted in Figure A.9 from reassigned sinusoidal analysis data without bandwidth enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.20 Waveform and spectrogram plots for a utter-tongued ute tone, pitch E4 (E above middle C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.21 Waterfall plot for a utter-tongued ute tone, pitch E4 (E above middle C). . . . . 116 A.22 Plot of reassigned bandwidth-enhanced analysis data for a utter-tongued ute tone, pitch E4 (E above middle C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.23 Waveform and spectrogram plots for a reconstruction of the utter-tongued ute tone plotted in Figure A.20 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.24 Waterfall plot for a reconstruction of the utter-tongued ute tone plotted in Figure A.20 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.25 Waveform and spectrogram plots for a reconstruction of the utter-tongued ute tone plotted in Figure A.20, analyzed using a long window that and smears out the
utter eect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.26 Waveform and spectrogram plots for a bongo roll. . . . . . . . . . . . . . . . . . . . 120 A.27 Waterfall plot for a bongo roll. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.28 Plot of reassigned bandwidth-enhanced analysis data for a bongo roll. . . . . . . . . 121 A.29 Waveform and spectrogram plots for a reconstruction of the bongo roll plotted in Figure A.26 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.28. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 xi
A.30 Waterfall plot for a reconstruction of the bongo roll plotted in Figure A.26 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.28. . . . . . . . . 122 A.31 Waveform plot for a reconstruction of the two strikes in the bongo roll plotted in Figure A.26 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.28. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.32 Waveform plot for a reconstruction of the two strikes in the bongo roll plotted in Figure A.26 from nonreassigned, non-bandwidth-enhanced analysis data, synthesized using cubic phase interpolation to maintain phase accuracy. . . . . . . . . . . . . . 123 A.33 Plot of the source waveform for the bongo strikes analyzed and reconstructed in Figures A.31 and A.32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.34 Waveform plot for the two bongo strikes plotted in Figure A.33 reconstructed from the reassigned bandwidth-enhanced analysis data plotted in Figure A.28 with time dilation by a factor of two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.35 Waveform plot for the two bongo strikes plotted in Figure A.33 reconstructed from nonreassigned, bandwidth-enhanced analysis data with time dilation by a factor of two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.36 Waveform plot for synthesized noise in a reconstruction of the bongo tone plotted in Figure A.26 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.28. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.37 Waveform and spectrogram plots for a gong strike. . . . . . . . . . . . . . . . . . . 125 A.38 Waterfall plot for a gong strike. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.39 Plot of reassigned bandwidth-enhanced analysis data for a gong strike. . . . . . . . 127 A.40 Plot of the reassigned bandwidth-enhanced gong analysis data plotted in Figure A.39 with all partials shorter than 100 ms pruned and their energy redistributed as bandwidth among the remaining partials. . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.41 Waveform and spectrogram plots for a reconstruction of the gong strike plotted in Figure A.37 from the pruned reassigned bandwidth-enhanced analysis data plotted in Figure A.40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.42 Waterfall plot for a reconstruction of the gong strike plotted in Figure A.37 from the pruned reassigned bandwidth-enhanced analysis data plotted in Figure A.40. . . . . 129 A.43 Waveform and spectrogram plots for a short alto saxophone phrase. . . . . . . . . . 130 A.44 Waterfall plot for a short alto saxophone phrase. . . . . . . . . . . . . . . . . . . . . 130 A.45 Plot of reassigned bandwidth-enhanced analysis data for a short alto saxophone phrase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.46 Waveform and spectrogram plots for a reconstruction of the alto saxophone phrase plotted in Figure A.43 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.45. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.47 Waterfall plot for a reconstruction of the alto saxophone phrase plotted in Figure A.43 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.45. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.48 Waveform plot for a reconstruction of the two notes in the alto saxophone phrase plotted in Figure A.43 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.45. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.49 Waveform plot for a reconstruction of the two notes in the alto saxophone phrase plotted in Figure A.43 from nonreassigned analysis data. . . . . . . . . . . . . . . . 133 A.50 Waveform and spectrogram plots for a piano and soprano saxophone duet. . . . . . 135 A.51 Waterfall plot for a piano and soprano saxophone duet. . . . . . . . . . . . . . . . . 135 xii
A.52 Plot of reassigned bandwidth-enhanced analysis data for a piano and soprano saxophone duet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.53 Waveform and spectrogram plots for a reconstruction of the piano and soprano saxophone duet plotted in Figure A.50 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.52. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.54 Waterfall plot for a reconstruction of the piano and soprano saxophone duet plotted in Figure A.50 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.52. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.55 Magni ed plot of reassigned bandwidth-enhanced analysis data for the sharp attack of a note in a piano and soprano saxophone duet. . . . . . . . . . . . . . . . . . . . 138 A.56 Waveform plot for the onset of one note in a piano and soprano saxophone duet. . . 139 A.57 Waveform plot for the onset of one note in a piano and soprano saxophone duet reconstructed from reassigned bandwidth-enhanced analysis data with partial breaking at points of large time correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.58 Waveform plot for the onset of one note in a piano and soprano saxophone duet reconstructed from reassigned bandwidth-enhanced analysis data without partial breaking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.59 Waveform plot for the onset of one note in a piano and soprano saxophone duet reconstructed with time dilation by a factor of 1.5 from reassigned bandwidth-enhanced analysis data with partial breaking at points of large time correction. . . . . . . . . 139 A.60 Waveform plot for the onset of one note in a piano and soprano saxophone duet reconstructed with time dilation by a factor of 1.5 from reassigned bandwidth-enhanced analysis data without partial breaking. . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.61 Waveform plot for the onset of one note in a piano and soprano saxophone duet reconstructed with time dilation by a factor of 3.0 from reassigned bandwidth-enhanced analysis data with partial breaking at points of large time correction. . . . . . . . . 140 A.62 Waveform plot for the onset of one note in a piano and soprano saxophone duet reconstructed with time dilation by a factor of 3.0 from reassigned bandwidth-enhanced analysis data without partial breaking. . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.63 Waveform and spectrogram plots for a fragment of French speech. . . . . . . . . . . 141 A.64 Waterfall plot for a fragment of French speech. . . . . . . . . . . . . . . . . . . . . . 141 A.65 Plot of reassigned bandwidth-enhanced analysis data for a fragment of French speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A.66 Spectrogram plot for a fragment of French speech magni ed to show the pitch variation due to the speaker's in ection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.67 Plot of reassigned bandwidth-enhanced analysis data for a fragment of French speech magni ed to show the partial tracking of the in ection visible in Figure A.66. . . . 144 A.68 Waveform and spectrogram plots for a reconstruction of the fragment of French speech plotted in Figure A.63 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.65. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.69 Waterfall plot for a reconstruction of the fragment of French speech plotted in Figure A.63 from the reassigned bandwidth-enhanced analysis data plotted in Figure A.65. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
xiii
CHAPTER 1
INTRODUCTION Musical applications of signal processing are distinguished by their emphasis on modi cation and transformation. Computer representations of sounds are commonly employed for such disparate purposes as pitch correction, noise reduction, and synthesis of musical timbres. The emphasis on manipulation puts unique demands on the capabilities of general-purpose digital sound models. The characterization of a signal as a sum of sinusoids is fundamental to many areas of signal processing, particularly audio signal processing. The theoretical bases for sinusoidal decomposition can be found in any rudimentary signal processing text [1, 2]. Sinusoidal decomposition of audio signals has a long history in the investigation of perception of musical sounds [3, 4] and forms the basis for modern psychoacoustic research on perception of pitch and timbre [5, 6]. Sum-of-sines representations have many convenient features for sound processing and manipulation that have made them the basis for many audio signal processing applications. Digital computers have greatly facilitated time-variant audio analysis. Initially, much of this analysis was focussed on resolving the time-varying characteristics of pitched musical instrument tones using a pitch-synchronous sum-of-sines model, meaning that a sampled waveform is analyzed in blocks that are an integral number of pitch periods in length. This structure admits a standard Fourier series representation, wherein the sound is represented by a collection of sinusoidal components corresponding approximately to harmonics of the tone [7, 8, 9]. The component amplitudes and frequencies are time-variant, though the frequencies deviate only minimally from the harmonic frequencies.
1
Maintaining the relationship between the pitch period and the length of the analysis blocks is problematic for sounds with widely-varying pitch and is clearly impractical for nonharmonic sounds. More recent methods of sinusoidal modeling depend less on harmonicity and represent sounds as a collection of sinusoidal components having arbitrary time-varying frequencies and amplitudes [10, 11, 12, 13], but still are limited in application to sounds that are locally periodic. A popular modern sinusoidal model is presented in Chapter 3 of this dissertation. Our research is driven by the allure of a exible, general-purpose digital sound model, applicable to a wide variety of sounds, enabling time and frequency scale modi cation, timbre morphing, and other abstract timbral modi cations. We present a representation that shares many of the desirable properties of traditional sinusoidal models, but is able to represent a much greater variety of sounds without an increase in complexity or loss of manipulability. The Reassigned Bandwidth-Enhanced Model is similar in spirit to traditional sinusoidal models in that a waveform is modeled as a collection of components, called partials, having time-varying amplitude and frequency envelopes. Our partials are not strictly sinusoidal, however. We developed a technique called bandwidth enhancement to combine sinusoidal energy and noise energy into a single partial having time-varying amplitude, frequency, and bandwidth parameters [14]. We developed the method of time-frequency reassignment, an adaptation of a technique previously used to improve the readability of spectrogram plots [15, 16], to improve the time and frequency estimates used to de ne our partial parameter envelopes. The sharpened time-frequency estimates allow us to represent transient and short-duration waveforms very accurately, and to preserve transient shapes in modi ed reconstruction. The body of this document is divided into two background chapters and three chapters presenting our own work, organized is as follows: Chapter 2 presents a review of short-time Fourier analysis, the basis for sinusoidal decomposition of time-varying signals, and fundamental to much of audio signal processing. The short-time Fourier transform is the fundamental analytical tool in our algorithms, and in most sinusoidal modeling techniques. Chapter 3 describes a basic sinusoidal sound model, rst presented by McAulay and Quatieri [10], that is representative of modern sinusoidal representations. 2
Chapter 4 introduces the notion of bandwidth enhancement and its application to the basic sinusoidal model. Bandwidth enhancement fundamentally changes the basic model by using nonsinusoidal components, allowing it to represent a greater variety of sounds, including noisy sounds, but retains the structure of the basic model. Chapter 5 describes the extensions to basic sinusoidal modeling techniques used to generate bandwidth-enhanced representations. Chapter 6 introduces the use of the method of reassignment [16] for sharpening the timefrequency characteristics of our model. This technique, which we call time-frequency reassignment, addresses the fundamental tradeo between time and frequency resolution in short-time Fourier analysis by extracting temporal information from the short-time phase spectrum. The nal chapter, Chapter 7 considers the evaluation of our representation, and directions for future work. At the risk of belaboring the obvious, it may be useful to introduce some terms used frequently in this dissertation that may have unintended or con icting connotations from other elds of research or from common use. We refer to audio signals as waveforms or sounds, but generally we prefer the term sound when we wish to stress perceptual characteristics, and use the term waveform to refer to an arbitrary signal. We use the terms model and representation interchangeably, but restrict our use of the former because it often implies more about the representation of higher level structure and aggregate behavior, issues which we do not address in this research. We use the term partial to denote a primitive component of a sound model. Historically, sound models have used sinusoids as primitive components, but we use the same term to refer to our model components, though they are not sinusoidal. In some discussions, sinusoidal components in harmonic representations of pitched sounds are called partials, but our use of the term implies nothing about the harmonicity of the analyzed sound, or about the frequency distribution of the components. When describing the auditory results of our experiments, we use the term artifact to refer to perceived elements and characteristics that are recognizable as foreign and undesirable, and the 3
more general term delity to describe the ability of a reconstruction to accurately represent all perceived elements and characteristics of its source sound, without omission and without artifacts. In discussions of experimental results throughout this presentation, it should be recognized that no mathematical, graphical, or textual description can fully convey the quality or characteristics of sound synthesis. The success of this research can only be judged by auditory evaluation of experimental results. Many of the source sounds used in our examples were taken from the McGill University Master Samples (MUMS) compact discs [17]. Waveform, spectrogram, and waterfall plots were produced using the SoundMaker software application [18]. Sinusoidal analysis data plots were produced using unreleased development versions of our Loris software application [19] and its predecessor, the Lemur software application [20], with a postscript generation module written by Edwin Tellman.
4
CHAPTER 2
SHORT-TIME FOURIER ANALYSIS Most sinusoidal models are based on digital short-time spectral analysis of the audio waveform. The short-time Fourier transform (STFT) is widely used for analysis of time-varying signals, and is covered in detail in most digital signal processing texts (for example [1, 2]). The presentation of the STFT in this chapter is intended to introduce the terminology and the theoretical basis for much of the work in the succeeding chapters. This discussion is not intended to be rigorous or exhaustive, since the concepts presented are covered in great detail in the many texts devoted to digital signal processing, and will be familiar to many readers.
2.1 The Fourier Transform The idea of decomposing sound into primitive components can be traced to the 19th-century mathematician G. S. Ohm, who asserted that there is only one kind of periodic vibration with no harmonics above the prime tone, and all musical (periodic) tones can be composed from a sum of such tones [3]. Helmholtz called these tones simple tones, and de ned compound tones as combinations of simple tones. He also described numerous inharmonic secondary tones produced by membranes, metal plates, and elastic rods [4]. Jean Baptiste Fourier showed mathematically that any periodic waveform can be produced by a discrete, though possibly in nite sum of harmonically related sinusoids (called the Fourier series, though it had been used occasionally by Euler and other 18th-century mathematicians), and that a particular waveform is described by a unique set of amplitudes and phases. The Fourier series 5
representation of a continuous function of time x(t) that is periodic with period T , is written
x (t) =
1
X
k=,1
ck e
j2kt T :
(2.1)
The magnitude of the kth complex Fourier coecient ck corresponds to the amplitude weighting of a sinusoidal component at a frequency k times the fundamental frequency, the inverse of the period T , and the phase of ck is a phase oset for that component. The Fourier series coecients can be computed by evaluating Z T j2kt ck = 1 x (t)e, T dt: (2.2)
T 0
A periodic waveform must repeat itself exactly each period, so its Fourier series representation must comprise only sinusoids having a whole number of periods in the fundamental period T . This is equivalent to saying that the sinusoidal frequencies must be whole-number multiples of the fundamental frequency, hence, the series representation. To extend the Fourier series reconstruction of Equation (2.1) to general waveforms without restriction on the sinusoidal frequencies, the sum over a discrete set of coecients is replaced by an integral of a continuous function of frequency Z 1 1 X (!)ej!t d!: x (t) = 2 ,1
(2.3)
Conceptually, X (! ) is a complex function of radian frequency specifying the weighting of each sinusoidal component in the representation of the time function x(t) as a sum of sinusoids of all frequencies. This function of frequency, called the Fourier transform of x(t), can be computed from the time waveform by Z 1 X (!) = x (t)e,j!t dt: (2.4) ,1
Each of these 19th-century transforms has the useful property of being one-to-one, or injective, so there exists a unique Fourier representation for any waveform. The inverse Fourier transform is also one-to-one, so the waveform can be unambiguously reconstructed from its Fourier representation. However, both transforms are continuous in the time domain. In order to process audio
6
signals digitally, the waveforms must be sampled at a nite rate, and discrete-in-time versions of the Fourier transform must be employed.
2.2 Discrete Time When a continuous-time audio waveform, described by a function of a continuous time variable and written x(t), is sampled at a constant rate of one sample every TS seconds, the sampled waveform can be written as a function of a discrete time index n:
xn = x (t0 , nTS ) n = ,1 : : : , 1; 0; 1 : : : 1
(2.5)
where t0 is an arbitrary time reference. The sampling rate, or sampling frequency fS , is equal to 1 TS samples per second (hertz). In discrete-time signal processing, it is often useful to use radian frequencies, expressed in units of radians per sample, instead of absolute frequencies. Radian frequencies are related to absolute frequencies through the relation ! = 2ff (2.6) S
and the sampling frequency corresponds to a radian frequency of 2 radians per sample. Radian frequency greatly simpli es the notation of digital signals, since a sinusoidal waveform 2 nf xn = sin f
(2.7)
xn = sin (!n) :
(2.8)
S
can be written Whereas a continuous-time audio waveform may contain energy at any frequency, a sampled waveform is comprised only of energy at frequencies below the so-called Nyquist frequency or Nyquist rate, which is equal to half of the sampling frequency. More precisely, for real-valued waveforms, samples of a sinusoid of frequency f cannot be distinguished from samples of a sinusoid of frequency
7
fS , f ; the samples are identical [2, Chapter 3]. In fact, all sinusoids at frequencies
f kfS k = ,1 : : : , 1; 0; 1 : : : 1
(2.9)
have the same samples and are aliases of each other. This is made clear using radian frequencies, since xn = sin 2n [ff + kfs ]
s
= sin ([! + 2k] n) = sin (!n + 2kn) = sin (!n) :
By convention, we assume that sampled waveforms are bandlimited to , f2S ; f2S (negative frequencies are needed to describe sinusoids of diering phases) and pains are taken in the sampling process to ensure that energy at frequencies higher than the Nyquist frequency is removed before samples are computed. The discrete-time Fourier transform (DTFT) shares most of the properties of its continuoustime counterpart and is computed from a sampled waveform by
X (! ) =
1
X
n=,1
xn e,j!n :
(2.10)
Like the continuous-time Fourier transform, the DTFT is one-to-one, and the original sampled waveform can be recovered from Z 1 xn = 2 X (! ) ej!n : (2.11) ,
Due to aliasing, it is clear that the discrete-time Fourier transform is periodic with period equal to the sample rate, or 2 radians per sample.
8
2.3 The Discrete Fourier Transform For digital signal processing, the frequency domain of the DTFT is dicult to work with, since it is continuous and, therefore, presents the same problems as continuous-time waveforms. The discrete Fourier transform (DFT) maps a nite-length sampled waveform to a nite, discrete sum of sinusoids. The DFT, Xk , of a discrete waveform xn of length N samples is computed as
Xk =
NX ,1 n=0
xn WNkn
(2.12)
where for convenience, the complex exponential is abbreviated
WN = e
,j2 N
(2.13)
as in [1] (some others, such as [21] and [22], use the complex conjugate of this abbreviation). Comparing Equation (2.12) with Equation (2.10), it is apparent that for a sampled waveform xn that is nonzero on the range n = 0; 1 : : :N , 1 and zero everywhere else, the DFT is equivalent to N samples of the DTFT at frequencies
!k = 2Nk k = 0; 1 : : :N , 1:
(2.14)
Like the continuous Fourier representations, the discrete Fourier transform and its inverse are one-to-one. The inverse DFT, by which the time domain samples may be unambiguously reconstructed from their DFT, is written NX ,1 1 xn = Xk WN,kn :
N k=0
(2.15)
The discrete frequency spectrum provided by the DFT oers the same advantages for digital signal processing as discrete-time (sampled) waveforms. The well-known fast Fourier transform is a very ecient algorithm for computing a DFT of highly composite length.
9
2.4 The Short-Time Fourier Transform The discrete Fourier transform provides a unique, complex-valued spectral portrait of a sampled sound. The Fourier transform and its inverse are one-to-one, so no information is lost in the transformation. However, the temporal evolution of the waveform, which is critical to the aural perception of real sounds [23], is captured in the phase spectrum in a way that is nonintuitive and very dicult to interpret. In order to obtain a spectral portrait with an explicit time dimension, one can compute, as a function of time, the Fourier spectrum of the waveform as viewed through a narrow temporal aperture that, at any time, admits only the most recent waveform samples. The aperture may be implemented using a sliding window function, which is de ned by a smooth, nonzero function over a nite number of samples, say N , and zero everywhere else. That is,
hl =
(
nonzero l = 0; 1 : : :N , 1 0 everywhere else
(2.16)
A discussion of the details of the window function is deferred until Section 2.5. For now, it is sucient to note that a window function is used to isolate N samples of the original waveform by point-by-point multiplication. The short-time Fourier transform (STFT) as a function of discrete time n and discrete frequency k is de ned as
Xk;n =
1
X
l=,1
hn,l xl WNlk
(2.17)
where WN is the complex exponential de ned in Equation (2.13). Introducing a change of variable, m = l , n, and realizing that the window function de ned in Equation (2.16) has nite support, Equation (2.17) can be rewritten
Xk;n = WNnk =
1
X
m=,1 nk WN X~ k;n
10
h,m xm+n WNmk
(2.18) (2.19)
where X~ k;n is de ned as the DFT of the windowed, nite-length short-time sequence x~m;n , i.e.,
X~ k;n = = where x~m;n =
NX ,1 m=0 NX ,1
h,m xm+n WNmk
(2.20)
x~m;n WNmk
(2.21)
m=0 h,m xm+n :
(2.22)
Thus the short-time Fourier transform for any xed time index n can be interpreted as the phaseshifted DFT of a nite-length short-time sequence. Equation (2.17) can also be rewritten as a convolution of the window function and the frequency shifted analysis waveform, i.e., Xk;n = hn n xnWNnk (2.23) where n denotes convolution with respect to the time index n, and the frequency index k is constant. The kth short-time Fourier transform coecient, viewed as a function of the time, can therefore be interpreted as the output of a linear time-invariant lter having impulse response de ned by the window function hn , excited with the modulated analysis waveform. The modulation by WNnk has the eect of shifting the spectrum of the analysis at frequency !k to 0. In order that the kth short-time coecient represent only energy near frequency !k , the analysis window should have a low-pass response. Equation (2.23), therefore, describes a bank of band-pass lters having center frequencies !k . In light of this interpretation, the sequence of short-time Fourier coecients corresponding to a particular frequency index k is often called a channel, and the window function is often referred to as the analysis lter. The two interpretations of the STFT suggested by Equations (2.21) and (2.23) are equivalent; one describes a sequence of xed-time operations (DFTs), the other a set of xed-frequency operations (a bank of bandpass lters). The former interpretation is usually re ected in digital implementations of the STFT, since very ecient algorithms exist for computing the DFT. The lter bank interpretation is conceptually useful and contributes to the discussion of window function speci cation for short-time Fourier analysis in Section 2.5.
11
The magnitude of the short-time Fourier transform is often called the spectrogram and is often used in sound modeling and analysis, particularly for speech. A spectrogram of a cello tone is shown in Figure 2.1. Magnitude is indicated by gray level. Strong harmonic components in the tone are visible as dark horizontal lines. Noise and low-frequency rumble in the recording are visible as light gray components. The magnitude spectrum is not a complete portrait of the sound, however. Critical temporal information is encoded in the short-time phase spectrum, but since the phase spectrum is visually uninformative and very dicult to interpret, it is rarely viewed.
2.5 Windowing Ideally, STFT coecients should be well-localized in time and frequency. That is, the STFT coecient X~ k;n should represent the energy in the waveform xn only at the kth coecient frequency 2k and only at time n. Practically, time and frequency resolution in short-time Fourier analysis N is governed by the choice of analysis window. Equation (2.23) shows that the STFT is convolutional in the time domain; it describes the convolution of the signal with a bandpass impulse response created by modulating the window function. Equation (2.21) describes the STFT as the DFT of a product of two waveforms, the window hn , and the signal xn . Since the Fourier transform of a product of two signals is the convolution of the transforms of the signals (for a discussion of Fourier transform properties, see, for example, [1]), it is apparent that the STFT is convolutional in the frequency domain as well. In order to minimize the smearing due to convolution, and thereby maximize the time and frequency resolution of the short-time Fourier estimates, the window function should be highly concentrated in both time and frequency domains. The window should form the narrowest possible temporal aperture so that past and future samples interfere minimally with STFT estimates at any particular time. The window's magnitude spectrum should be suciently narrow that spectral energy represented in one STFT channel interferes minimally with STFT estimates in nearby channels.
12
Frequency (Hz)
8000
0 0.0
2.5
Time (s) (a)
Frequency (Hz)
2000
0 0.0
2.5
Time (s) (b)
Figure 2.1 Two short-time Fourier magnitude plots (also called spectrograms) of a cello tone. Magnitude is indicated by gray level. Strong harmonic components in the tone are visible as dark horizontal lines. Noise in the bow attack and low-frequency rumble in the recording are visible as light gray components between the harmonic components. White regions of the plots have very low spectral energy. The two plots dier only in their frequency scale; (a) plots frequencies up to 8000 Hz while (b) plots frequencies up to 2000 Hz. 13
The narrowest function in any domain is an impulse function. The discrete-time impulse k is de ned by
k =
(
1 k=0 0 otherwise:
(2.24)
Convolution with an impulse is an identity operation, that is,
xm m = xm:
(2.25)
Thus, for optimal resolution in time and frequency, the analysis window function should be impulselike in both domains. Unfortunately, it is not possible for a waveform to be impulsive in both time and frequency. It is a property of the Fourier transform that an impulse in the time domain corresponds to a constant function in the frequency domain, and vice versa. In fact, a nite signal in one domain corresponds to an in nite signal in the other domain, and moreover, the more concentrated the signal energy is in one domain, the more it spreads out in the other domain. For example, an impulse (at zero frequency) in the frequency domain corresponds to an in nite, constant time domain signal (also called DC), and a frequency-shifted impulse in the frequency domain corresponds to an in nite, constant signal modulated by a sinusoid. Conversely, a time domain impulse corresponds to a constant in the frequency domain; all frequencies are present at equal intensity. Thus, there is an inherent tradeo between time and frequency resolution in short-time Fourier analysis. The sinc function, de ned sinc (x) = sin (x) (2.26)
x
has an ideal low-pass frequency response, i.e., it is bandlimited and at in its passband. A lter bank constructed from a sinc impulse response would represent all spectral energy, even energy between frequency samples, in exactly one Fourier coecient. But being bandlimited ( nite in frequency), the sinc function, like the impulse, is in nite in the time domain. Since the window function is applied in the time domain, it is chosen to be nite in time, as described in Equation (2.16), and its particular shape in its range of support is chosen to provide a balance of time and frequency resolution, according to the application. Various window 14
functions have been proposed whose shapes are more or less sinc-like. Windows that are narrow or concentrated in frequency oer reduced temporal resolution, because more waveform samples are averaged in a single spectrum, corresponding to a wider aperture through which the signal spectrum in viewed. Other windows, more concentrated in time, and thus less concentrated in frequency, introduce interference between channels in the short-time spectrum. The interference (called main lobe interference) is most easily understood in terms of the lterbank description of the STFT given in Equation (2.23). If the width of the main lobe of the analysis window's magnitude response exceeds the frequency dierence between adjacent spectral channels ( 2N ), then the bandpass lters in the lter bank have overlapping pass bands. Each STFT channel, then, includes energy leaking from adjacent channels, so that nearby spectral peaks are not completely resolved. Sidelobe interference is the result of the analysis lter having nite rejection in the stop band, so that stop band energy leaks into other STFT channels, even channels that are distant in frequency. Sidelobe interference always occurs to a greater or lesser extent, even when there is no main lobe overlap, except in the (extremely rare) case of a perfectly harmonic waveform and a perfectly tuned transform, in which the signal's harmonic frequencies correspond exactly to frequency samples in the transform. In the case of a perfectly harmonic waveform, a rectangular window, de ned
hk =
(
1 k = 0; 1 : : :N , 1 0 otherwise
(2.27)
can be used. The rectangular function is Equation (2.27) and the sinc function in Equation (2.26) comprise a Fourier transform pair. The sinc has a at, bandlimited (i.e., rectangular) frequency response. Correspondingly, the rectangular window has a sinc-shaped frequency response. Though it is not bandlimited, the transform of the rectangular window has zero magnitude at all harmonic frequencies, so if the waveform has energy only at exactly harmonic frequencies, then there will be no sidelobe interference. Few waveforms of interest are perfectly harmonic, so in general, sidelobe interference must be minimized by choosing a window with very small sidelobes, or equivalently, an analysis lter having very good stop-band rejection. The rectangular window has relatively high sidelobes, so it is rarely used, except for pitch-synchronous analysis of very strongly harmonic sounds. 15
1
Amplitude
β=2
β=4
β=8 β = 16 0
0
Time (samples)
N
Figure 2.2 Time domain shape of several members of the Kaiser window family having the same length N . The shaping parameter governs the tradeo between time and frequency resolution controlling the slope of the window shoulders. See also Figure 2.3. The window functions used in this research are members of the Kaiser window family [24, 25]. The Kaiser window is parameterized by a shaping factor that governs the slope of the (time domain) shoulders, thereby providing control over the tradeo between time and frequency resolution. Figures 2.2 and 2.3 show the relationship between shoulder slope, or time domain impulsiveness, main lobe width, and sidelobe leakage for a Kaiser window of a particular length. The length of the analysis window determines a lower bound on the frequency at which signi cant spectral energy can be accurately represented. The window must be suciently long relative to the period of the lowest frequency component that phase dierences (time osets) are not interpreted as amplitude dierences. Figure 2.4 shows two identical windows superimposed on a low-frequency sine wave of constant amplitude. One is centered near a peak in the waveform, the other near a zero-crossing. Because they span only a fraction of a period of the sine wave, these two windows will yield dierent estimates of the amplitude of the sinusoid. In general, the analysis window must be long enough to span several periods of the lowest-frequency component present in the analyzed waveform. The exact number of periods depends on the particular window shape. In order to preserve phase of the analyzed signal, we prefer a zero phase window, that is, a window function that does not alter the phase of the analyzed waveform. Although other formulations are possible, a zero-phase response can be most easily obtained by using an odd-length symmetric 16
Fourier Transform Magnitude (dB)
0 -25 -50 -75 -100
β β β β
-125 -150
=2 =4 =8 = 16
-175 0
Frequency (radians per sample)
0.015 π
Amplitude
Figure 2.3 Fourier magnitude spectra of the Kaiser window for several values of the shaping parameter . The corresponding time domain plots are shown in Figure 2.2. Windows that are more concentrated in time (having higher values of ) are less concentrated in frequency and have less sidelobe leakage.
Time
Figure 2.4 Two identical, bell-shaped analysis windows superimposed on a (relatively) lowfrequency sine wave. Though the amplitude of the sinusoid is constant, the window on the left clearly includes more signal energy than the window on the right. These windows are too short to yield an accurate estimate of the parameters of this sine wave.
17
window centered at the origin. Symmetric windows have the conceptually pleasing property that samples before and after the center of the window are of equal signi cance. The Fourier transform phase is referenced to the beginning ( rst sample, in the discrete case) of the transform. Since time is usually referenced to the peak of the analysis window, it is necessary to align the center sample of the window with the rst sample of the transform. This can be accomplished by modulating the Fourier components by WNk(N /2) = (,1)k , corresponding to a phase shift of half the length of an odd-length window of N samples. Since a phase shift corresponds to circular rotation in the time domain, the Fourier transform phase can also be corrected by rotating the windowed time domain waveform by N2 samples [22]. So far, it has been assumed that the length of the DFT in Equation (2.21) is equal to the length of the nonzero region of the window function, but this need not be so. In fact, for purposes of making the analysis data easier to interpret, it is common to obtain a more densely-sampled spectrum by zero padding the windowed short-time analysis waveform. This is equivalent to extending the tails of the analysis window with zeros. Since the padded analysis window admits no additional samples of the waveform (only those corresponding to nonzero window samples), the frequency spectrum sampled by the DFT is unaltered, and no additional temporal smearing is introduced; only the density of frequency samples changes. Zero padding has the eect of performing bandlimited interpolation of the spectrum. It does not increase the resolution of the spectrum (an increase in frequency resolution can only be achieved at the expense of temporal resolution); it merely samples the same spectrum more densely, making spectral features easier to identify and estimate accurately [1]. This is particularly advantageous in algorithms based on spectral feature extraction, such as the technique of McAulay and Quatieri, described in Chapter 3. It is also possible to use a window that is longer than the transform. This technique, called time-domain aliasing, can be used to improve the accuracy of estimates at the DFT bin frequencies without increasing the amount of computation. Time-domain aliasing requires a more general de nition of the short-time waveform than the one given in Equation (2.22). Time-domain aliasing is not used in this research, and is beyond the scope of this paper, but it is commonly used in Phase Vocoder implementations (see Section 2.7), and is detailed in [26].
18
The complete set of short-time Fourier coecients X~ k;n represents an N -fold increase in the amount of data over the sampled waveform. In fact, each set of N coecients corresponding to time n = m introduces only one new sample from the waveform and omits only one waveform sample used to compute the coecients at time m , 1. This suggests that the information in the original waveform can be captured in a downsampled transform. This downsampling can be achieved by computing coecients only for times n = rS , where r is an integer and S is the downsampling factor, also called the hop size. In the case of S = N , there is no overlap in consecutive analysis windows, all samples of the waveform are used exactly once, and the number of coecients is exactly equal to the number of samples in the waveform. This is, of course, the minimum amount of STFT data that can be used to exactly reconstruct the original waveform. In practice, however, overlapping analysis windows are used to make the short-time representation more robust. It has been shown that if the hop size is equal to the reciprocal of the main lobe width of the analysis window in frequency samples, then the STFT coecient representation is robust to simple modi cations [27].
2.6 Reconstruction from the Short-Time Fourier Transform The two equivalent formulations of the STFT in Equations (2.21) and (2.23) suggest two equivalent techniques for reconstructing a sampled waveform in modi ed or unmodi ed form from its short-time Fourier coecients. Equation (2.21) suggests a synthesis based on block-by-block inversion of the discrete Fourier transform coecients. Successive blocks of samples can then be overlapped and summed in such a way as to reverse the eects of downsampling and windowing. The synthesis formula for the so-called overlap-add synthesis method is
x^n =
1
X
s=,1
fn,sR0 N1
NX ,1 k=0
X^ k;sR0 WN,nk
(2.28)
where X^ k;n represents a (possibly modi ed) set of spectral coecients, R0 is the synthesis hop size, which may be dierent from the analysis hop size R, and fn represents the synthesis window, which complements the analysis window. The constraints and speci cations for the synthesis window are derived in [21], and Allen [27] showed that for a properly sampled (in time) STFT and a properly 19
normalized window function, a rectangular synthesis window can be used. A complete system for block-wise analysis and modi ed reconstruction employing the overlap-add synthesis technique is depicted in Figure 2.5, adapted from Crochiere [22]. As suggested by the notation in Equation (2.28), and as shown in Figure 2.5, it is possible to modify the short-time Fourier coecients before resynthesis. The STFT must be sampled more densely in time (i.e., using a smaller hop size) in order to prevent the introduction of artifacts heard as reverberation in syntheses of modi ed short-time spectra [27]. Moreover, modi ed reconstruction requires that extra care be taken to preserve phase coherence between consecutive blocks of synthesized samples. Short-time spectral modi cations may aect the cumulative phase travel of individual partials, causing interference between corresponding components having dierent phase portraits in the overlap region of successive synthesis blocks. Any spectral modi cation that may change the amount of phase travel due to periodic oscillation|for example, changing the time scale by using a synthesis hop size R0 6= R|will introduce phase incoherence in the overlap regions if the short-time phases are not properly recomputed. Crochiere [22] showed that the issue of phase coherence arises naturally from the description of the STFT as a sequence of DFTs, each having its time reference in the middle of its analysis window, and can be addressed by assigning a xed time reference to all the coecients. This can be accomplished by preserving the phase shift in Equation (2.19) through the process of spectral modi cation, and demodulating the modi ed coecients before performing overlap-add reconstruction, as shown in Figure 2.5. Frequency domain modulation by WNsRk corresponds to the time shift employed in Equation (2.22) to align sample n = sR of the waveform with the center of the analysis window. This modulation shifts the coecient phase by an amount equal to the phase travel due to periodic oscillation at frequency 2Nk over sR samples. The process of computing the total phase of the coecients, including the travel due to periodic oscillation, is called phase unwrapping. If the spectral modi cations are commutable with the phase shift, then the phase can be unwrapped by replacing the phase shifts (,1)k WNsR0 k and (,1)k WN,sR0 k 0 in Figure 2.5 with a single phase modi cation WNs(R,R )k . If, additionally, R = R0 (and only under these conditions), then phase unwrapping is unnecessary. Phase unwrapping ensures phase 20
xn
M-sample buffer
Shift in R samples per block
xsR hn x˜ m, n DFT
(−1)k X˜ k ,sR WNsRk Xk ,sR Short-time spectral modification
Xˆ k ,sR'
(−1)k WN− sR′k
fm
Inverse DFT
Overlap-Add
M-sample buffer
xˆ m Shift out R' samples per block
Figure 2.5 A system for block-wise short-time Fourier analysis and modi ed reconstruction employing block DFT analysis and overlap-add synthesis. If the phase modi cation operation is commutable with the short-time spectral modi cation, then the phase modi cations may be condensed, or even eliminated if the hop sizes R and R0 are equal. Adapted from Crochiere [22].
21
n N
W
hn
X1,sR
Xˆ 1,sR'
...
xn
R:1
X0,sR
kn N
W
hn
...
R:1
Xk ,sR
WN(N-1)n hn
R:1
X(N-1),sR
fn
1:R'
WN-n fn
1:R'
...
R:1
WN0
Xˆ k ,sR'
-kn N
W
Xˆ (N-1),sR'
Decimate
+
xˆ m
fn
1:R'
...
hn
Xˆ 0,sR' Short-time spectral modification
WN0
1:R'
WN-(N-1)n fn
Interpolate
Figure 2.6 A system for short-time Fourier analysis and modi ed resynthesis by lterbank summation. Upsampled, modi ed coecients excited the synthesis lter, and the lter outputs are remodulated and summed. Downsampling is aected by decimation (computing only every Rth coecient sample). Upsampling is aected by interpolation (inserting R0 , 1 zeros between modi ed coecient samples). Adapted from [22] and [21]. coherence between consecutive synthesis frames, but does not preserve the overall phase portrait, or the temporal envelope shape under modi ed reconstruction. Preservation of the temporal envelope is critical for some sounds and is addressed in Chapter 6. In the method of synthesis by lterbank summation, corresponding to the STFT interpretation in Equation (2.23), the interpolated (upsampled) and possibly modi ed coecients excite the synthesis lter, and the lter outputs are remodulated by the coecient frequencies and summed. The lterbank summation method, depicted in Figure 2.6, is exactly equivalent to the overlap-add synthesis method [28]. Phase incoherence cannot be introduced using this structure because there are no overlap regions. Modulation by (,1)k , needed in the overlap-add algorithm to align the short-time phases in the analysis window, is unnecessary, but note that the analysis and synthesis lters, hn and fn , are noncausal.
22
2.7 The Phase Vocoder The Phase Vocoder is a technique for analyzing and modifying sampled waveforms widely used in a variety of domains, including musical composition and sound design. It is based on the Channel Vocoder algorithm, which is essentially an implementation of discrete short-time Fourier analysis and synthesis, described in Sections 2.4 and 2.6, respectively. The Phase Vocoder models a waveform as a sum of sinusoids having time-varying amplitudes and frequencies, and distinguishes itself from the Channel Vocoder and the STFT by using the unwrapped short-time phases to correct the partial frequencies, instead of relying on the bin frequencies !k [29]. For strongly harmonic sounds, the length of the analysis window is tuned to be, as nearly as possible, a small whole-number (two or three) multiple of the pitch period. If the signal is exactly harmonic and its fundamental frequency is an integral divisor of the sample rate, then the fundamental and each of its harmonics will coincide exactly with frequency samples provided by the STFT. If the fundamental frequency is slowly and slightly time-varying, or the window length does not correspond exactly to an integer number of periods, then the Phase Vocoder algorithm, which corrects the partial frequencies according to the dierence in consecutive unwrapped partial phases, will give improved estimates of the partial frequencies. The Heterodyne Filter method of analysis and synthesis, used by J. Beauchamp for time-variant synthesis of musical instrument tones [7] and employed by J. Grey in an extensive study of timbre perception [9], is mathematically equivalent to a Phase Vocoder implementation using analysis windows tuned to a single pitch-period. This method works well for isolated, quasi-harmonic tones, without vibrato and having very little (less than 2%) deviation in fundamental frequency [30], although some improvements can be made to relax those constraints somewhat [26]. The Phase Vocoder and related methods are, by nature, pitch-synchronous. The fundamental frequency must be known a-priori or computed by some other algorithm at analysis time [30]. Note that the distinction between the Phase Vocoder and the short-time Fourier transform is largely one of presentation. For quasi-harmonic waveforms, the frequency-tracked harmonic partial interpretation of the Phase Vocoder coecients is conceptually useful and may be signi cant in the domain of modi cations to the short-time spectra. Mathematically, however, there is no dierence between storing the short-time phases and storing their time derivatives (the corrected short23
time frequencies). The Phase Vocoder is commonly implemented using a fast Fourier transform algorithm, with long (many pitch periods) analysis windows whose length is unrelated to the period length [26]. In such cases, and in the modeling of nonharmonic waveforms, the distinction is a matter of taste. The short-time Fourier transform is unsatisfying as a sound model since it, like sampled time domain waveforms, is sampled at time-frequency loci that are dictated by the parameters of the analysis, rather than any quality of the analyzed waveform. For example, for a particular con guration of the transform, the STFT of an instrument tone has data at the same time-frequency coordinates as the STFT of the same amount of silence. The representation of sound by timevarying frequency and amplitude trajectories has proven very useful, however, and the algorithms discussed in the chapters that follow are all based on short-time Fourier decomposition, and process the short-time data to yield a more responsive and broadly applicable, higher-level sound model.
24
CHAPTER 3
THE BASIC SINUSOIDAL MODEL McAulay and Quatieri proposed a sinusoidal sound model based on the short-time Fourier transform [10]. In their method, often called the MQ method, a sound is modeled as a collection of sinusoidal components, called \tracks." A track describes the time-varying amplitude and frequency of a single sinusoidal component (a \partial") in the analyzed waveform. We distinguish tracks, the MQ model components, from partials, a more general term for frequency-localized components of an audio waveform. In contrast to other short-time Fourier methods, such as the Phase Vocoder, the McAulay and Quatieri algorithm estimates the parameters of the underlying sinusoidal components in the model, rather than retaining the raw short-time spectra. Tracks are constructed from sinusoidal parameter estimates derived from the short-time spectral data. A collection of tracks represents the sinusoidal energy in the analyzed waveform. The tracks in the MQ model are not constrained to be quasiharmonic, and the tracks may start and stop at any time during the analyzed waveform. The MQ model is, thus, applicable to a wide range of sounds, including polyphonic and nonharmonic sounds, which are not accommodated by pitch-synchronous Fourier techniques. For quasi-stationary sounds, sounds that are not very noise-like, and do not have rapidly changing spectral characteristics, a perceptually complete representation can be constructed using the MQ method.
3.1 Overview of the Basic Sinusoidal Model The basic MQ analysis technique is depicted in Figure 3.1. To capture the spectral evolution of an audio waveform, short-time Fourier spectra are computed for a sequence of overlapping windowed 25
Amplitude
Window Fourier Transform Time Magnitude
Retain Data at Peaks
Short-time Fourier Magnitude Spectrum
Frequency
16 Frequency (kHz)
Frequency
Death
Birth
0 0.0
Track Formation
Frames Time
Time (s)
2.6
MQ Analysis Data
Figure 3.1 The basic MQ sinusoidal analysis process. Data extracted from short-time Fourier magnitude peaks are collected in analysis frames, and peaks in consecutive frames are linked to form tracks. Data for a sinusoidal analysis of a ute tone is shown with frequency envelopes plotted against time and amplitude indicated by gray level, while higher-amplitude tracks are darker. See also Figure 3.2 for a detailed plot of MQ analysis data. segments of the source waveform. The short-time magnitude spectra are searched for peaks, which represent concentrations of spectral energy at the time corresponding to the center of the short-time analysis window. The peak amplitude, frequency, and phase data are extracted from the short-time spectra and collected into analysis frames, each frame comprising the peak data extracted from a single short-time analysis window. Peaks in consecutive frames are compared, and those that are suciently close in frequency are linked to form tracks. A track is a sequence of peaks describing the time-varying frequency and amplitude of a single sinusoidal component of the analyzed waveform. A track can also be considered to be a pair of envelopes de ning the track's frequency and amplitude evolution. The 26
3000
Frequency (Hz)
2400
1800
1200
600
0 0.00
0.84
1.68
2.52
3.36
4.20
Time (s)
Figure 3.2 Data produced by an MQ-style sinusoidal analysis of a cello tone, pitch G2 (G below low C). Track frequency envelopes are plotted against time. Track amplitude is indicated by gray level; higher-amplitude tracks are darker. For clarity, only tracks below 3 kHz (maximum frequency) are plotted. track frequency and amplitude envelopes are constructed by interpolation of the frequency and amplitude of the constituent peaks. In general, the number of peaks will vary from frame to frame with the spectral character of the sound, so it will not be possible to nd a match for every peak in every frame. When no successor can be found for a peak because there are no unmatched peaks of similar frequency in the next short-time spectrum, that peak marks the terminus, or death, of a track, and the disappearance of that component from the sinusoidal model. Similarly, when a peak has no predecessor, it marks the birth of a track and the appearance of a new component in the model. Sound is generally rendered from MQ analysis data by direct control of a bank of sinusoidal oscillators (one per track) by the track frequency and amplitude envelopes, although it is also possible to use an overlap-add technique. Data from an MQ-style sinusoidal analysis of a cello tone, played without vibrato on the open G string at pitch G2 (G below low C), having a fundamental frequency of approximately 98 Hz, taken from the McGill University Master Samples compact discs [17, Disc 9 track 75 Index 1], are shown in Figure 3.2. Track frequency envelopes are plotted against time. Track amplitude is indicated by gray level, higher-amplitude tracks are darker. 27
3.2 Development of the Basic Sinusoidal Model The MQ sinusoidal model, originally developed for speech waveforms [10], is based on a conventional speech model consisting of an excitation signal e(t) and a vocal tract modeled as a linear time-varying lter having the time-varying transfer function
H (!; t) = M (!; t) ej(!;t)
(3.1)
where M (! ; t) is the time-varying magnitude response of the lter, and (! ; t) is the time-varying phase shift introduced by the lter. In the MQ algorithm, the excitation is modeled as a sum of sinusoids having arbitrary frequencies, amplitudes, and phases, as previously proposed by Hedelin [31]: 9 8 L(t)
> :
l=1
j
al (t)e
t
R
0
!l ()d+l
> = > ;
(3.2)
where for the lth component, al (t) is the time-varying amplitude, !l ( ) is the time-varying frequency, and l is the phase oset, or the phase at time t = 0. Since the vocal tract is modeled as a linear lter, the excitation and vocal tract models can be collapsed into a single sum-of-sinusoids model: LX (t) s (t) = Al (t)ej l (t) (3.3) l=1
The sinusoidal amplitudes combine the time-varying amplitudes from the excitation model al (t) with the time-varying magnitude response of the vocal tract lter M (! ; t):
Al (t) = al (t) M [!l (t) ; t]
(3.4)
The sinusoidal phase functions combine the time-varying frequency and phase oset from the excitation model l with the time-varying phase response of the vocal tract lter (! ; t) and the rapid phase change due to the sinusoidal oscillation at time-varying frequency !l (t): l (t) =
t
Z
0
!l () d + [!l (t) ; t] + l 28
(3.5)
To simplify the generally dicult problem of nding optimal sinusoidal parameters, the waveform is broken into analysis frames, and the parameters are assumed to be constant over the duration of the analysis frame. (To support the speech model, it is further assumed that the parameters are constant over the duration of the impulse response of the vocal tract transfer function, because the lter response has been collapsed into the sinusoidal model.) This gives a model of a single frame of the speech waveform consisting of a sum of sinusoids having arbitrary but constant amplitudes and frequencies, and having arbitrary starting phases. The waveform in a particular frame s is then modeled by Ls X yn;s = Al;s ej(n!l;s+l;s ) (3.6) l=1
where Al;s is the amplitude and l;s the phase oset for the lth of Ls sinusoidal components in frame s. Finding a set of sinusoid parameters that minimizes the mean squared error with the measured waveform is dicult in general. To gain insight, two simplifying assumptions are made. First, assume that the speech waveform is perfectly voiced, and therefore is perfectly harmonic. Moreover, assume that the pitch period is known, and that the analysis window length is a multiple of the pitch period. Under these assumptions, the error function simpli es greatly. The optimal estimate for a perfectly harmonic waveform is comprised of sinusoids at the harmonic frequencies having amplitudes and phases obtained by evaluating the discrete Fourier transform at samples corresponding to those harmonic frequencies. Under these ideal conditions, the frequencies of the harmonics correspond exactly to pulse-like peaks in the continuous magnitude spectrum of the speech waveform. The optimal estimator for a single frame can be generalized to consist of sinusoids at frequencies corresponding to magnitude peaks in the short-time Fourier spectrum, and having amplitudes and phases obtained by evaluating the STFT at those peaks. This generalized interpretation can be extended to other less optimal kinds of sounds. If the speech is not perfectly voiced, but still has a peaky, pulse-like magnitude spectrum (a spectrum in which most of the energy is concentrated near strong magnitude peaks), then provided that the analysis window is long enough, we can still obtain sinusoidal parameter estimates by evaluating the STFT at its magnitude peaks. The analysis window must be chosen to prevent 29
interference between spectral peaks compromising the peak parameter estimates. As discussed in Section 2.5, the main lobe of the window's magnitude spectrum must be narrower than the frequency separation of any pair of spectral peaks. For a rectangular window, the window length N is constrained by j!i;s , !j;sj N4+ 1 (3.7) for all peaks i 6= j in frame s [10, p. 746]. As described in Section 2.5, sidelobe interference may also compromise peak parameter estimates, even when the peaks are widely separated in frequency, but this eect can be minimized by a proper choice of window function, at the expense of greater window length. To the extent that interference between spectral peaks can be minimized, optimal sinusoidal parameter estimates may be obtained by evaluating the STFT at samples corresponding to peaks in the magnitude spectrum. Expansion of this representation to include waveforms lacking impulsive magnitude spectra, such as unvoiced speech, requires a Karhunen-Loeve expansion for noise-like signals. It can be shown that if a sucient density of spectral peaks is retained in the analysis data, then the waveform can be represented by a collection of weighted sinusoids at the peak frequencies [32]. The frequency distribution of the spectral peaks must be such that the power spectral density changes slowly from peak to peak, and McAulay and Quatieri show that this peak density is achievable in practice [10, p. 747]. The model of a single frame of the speech waveform as a sum of sinusoids has been justi ed, but the sinusoidal parameters and, in general, the number of sinusoids will vary from frame to frame. Since the goal is to model the entire waveform as a collection of sinusoids with time-varying parameters, some means of associating peaks in one frame with peaks in another is needed to complete the model. Broadly speaking, it is desired to connect each peak in a frame with the peak of nearest frequency in the succeeding frame. The spectral parameters (frequency and amplitude) of the waveform are assumed to be quasistationary, that is, slowly-varying over the span of an analysis window. Slow parameter variation will cause corresponding peaks in successive analysis frames to dier slightly in their parameters. In order to trace the frequency and amplitude evolution of a slowly-varying component, it is necessary to link corresponding peaks in consecutive analysis frames to form tracks. The number of 30
components may vary from frame to frame, as slow variations cause components to fade in and out, or to be obscured by stronger components, so some matching algorithm is needed to establish the correspondences between peaks in dierent analysis frames. McAulay and Quatieri describe an algorithm for matching peaks in each frame to the peak in the succeeding frame that is nearest in frequency, resolving collisions in favor of the match of least frequency dierence, and restricting the frequency dierence between matched peaks to a speci ed maximum [10]. This strategy does not provide an optimal set of matches; nding optimal matches is dicult, in general, but it does ensure that peaks in adjacent frames that are very close in frequency will be matched, and many of the remaining peaks will be matched with peaks that are reasonably close in frequency. After links have been established, there will still be some peaks in the earlier frame which were not matched to peaks in the later frame. These peaks mark the death of a sinusoidal component and are matched to peaks having the same frequency and zero amplitude that are inserted in the later frame. Similarly, there will be peaks in the later frame which were not matches for any peaks in the earlier frame. These peaks mark the birth of a new sinusoid and are matched to peaks having the same frequency and zero amplitude that are inserted in the earlier frame. While it is possible to employ an overlap-add synthesis method similar to the one associated with the phase vocoder and other short-time Fourier representations, it is more intuitive to render an MQ representation using a method of direct parameter interpolation, which achieves higher delity than overlap-add, especially at lower frame rates, by avoiding the problem of phase incoherence in the overlap regions. To interpolate the sinusoidal parameters directly, it is necessary resolve the frequency and phase constraints at the frame boundaries. The sinusoidal phase function and its derivative, the instantaneous frequency, are constrained to match the estimates at the frame boundaries. Straightforward linear interpolation of the track frequency cannot, in general, satisfy all of these constraints. Moreover, the discrete Fourier transform phases are reported modulo 2 (on the range [0; 2] or [,; ]), but the time between frames is much longer than the length of a period of the high-frequency tracks, so the cumulative phase travel over a frame duration is greater than the reported phase by some integer multiple of 2 . The peak phases, therefore, need to be unwrapped to re ect the cumulative phase travel of the sinusoid over the duration of the frame. This unwrapping is analogous 31
to the phase unwrapping described in Section 2.6, but is complicated slightly by the variation in component frequency. In the short-time Fourier transform, corresponding components always have the same frequency index k and, therefore, the same frequency 2Nk , whereas the MQ component frequencies are time-variant. If the phase for peak l in frame s is l;s , then the cumulative phase travel for that track in frame s is given by (3.8) l;s (T ) = l;s+1 , l;s + 2M where M is an integer and T is the duration of the frame. Almeida and Silva proposed to resolve the overspeci cation of the phase in a purely harmonic sinusoidal model by postulating a cubic function for the instantaneous partial phase [33]. The 2 uncertainty leads to a family of cubic phase functions 2 3 (3.9) l;s (t) = l;s + !l;s t + t + t where l;s is the phase at the start of the frame, !l;s is the instantaneous track frequency at the start of the frame, and and are coecients satisfying the matrix equation "
(M ) # " T32 ,T1 # " l;s+1 , l;s , !l;s T + 2M # = ,2 1 : (M ) !l;s+1 , !l;s T3 T2
(3.10)
The maximally smooth cubic phase function is found by minimizing the integral of the squared magnitude of the second derivative of l;s (t) with respect to M . It can be shown [10] that the minimizing value of M is M = 21 (l;s + !l;s T , l;s+1 ) + (!l;s+1 , !l;s ) T2
(3.11)
and that the maximally smooth phase function, having coecients corresponding to M equal to the integer closest to M , unwraps the phase so as to yield the smoothest possible phase trajectory that can satisfy the frequency and phase estimates at the frame boundaries.
32
3.3 Validity of the Sinusoidal Model The sinusoidal model developed in Section 3.2 was shown to be a valid representation of a broad class of sounds subject to certain constraints. It is assumed that the parameters of the speech model, the glottal excitation signal and the vocal tract transfer function, are constant over the duration of an analysis frame, so that the parameters can be estimated for each individual frame. More generally, this implies that the spectral character of the analyzed waveform must not be changing rapidly relative to the length of the analysis window. Naturally, if the spectrum changes considerably over the duration of the analysis frame, then it could not possibly be represented by a set of constant sinusoidal parameters, and the frame-based representation could not be valid for such a waveform. It was initially assumed, for simplicity, that the waveform to be analyzed was perfectly voiced, and that the analysis window was tuned to a multiple of the pitch period. Under this assumption, the purely harmonic waveform could be optimally estimated by its magnitude-spectral peaks. It was then shown that the spectral peaks comprised an optimal estimator of waveforms that are not perfectly voiced, provided that the analysis window is long enough that the peak amplitude estimates are not compromised by interference from neighboring peaks. Speci cally, neighboring peaks must be suciently distant that peak magnitude estimates are not corrupted by main lobe interference from neighboring peaks. Additionally, for nite-length windows, sidelobe interference must be minimized by careful choice of the window function. It was further shown that a suciently high density of magnitude-spectral peaks could be used to represent noise-like waveforms, and that, in general, such a density could be achieved using the MQ analysis method. It should be noted, however, that the validity of such a representation does not guarantee manipulability of the resulting data, and in fact it will be shown in later chapters that sinusoidal representations of noise-like sounds are fragile and in exible. The basic McAulay-Quatieri sinusoidal model was developed as a mid-rate speech coding algorithm, but was found to be a suitable model for a wide range of sounds, including a great variety of musical sounds [10]. A very similar algorithm was developed for musical applications by Smith and Serra [11] and formed the basis for their work on a deterministic-plus-stochastic waveform decomposition and the Spectral Modeling Synthesis software [12]. Although speech is widely recognized 33
to be a nonlinear process [34], the MQ process models speech as a linear process on a sinusoidal excitation signal. The delity of the MQ representation of speech, and a variety of other sounds that are not well modeled as linear systems, is thus somewhat surprising, and cannot be attributed to any physical accuracy of the model.
3.4 The Importance of Phase The MQ representation is a phase-correct sinusoidal waveform model, meaning that the phase measured in the short-time analysis is preserved in the representation. It has often been argued that the ear is insensitive to phase, and that the sinusoidal model can be simpli ed to use only the amplitudes and frequencies of the constituent sinusoids. The elimination of the phase parameter would greatly simplify the synthesis by obviating the cubic phase interpolation algorithm. In practice, however, the importance of phase seems to vary with context and application, and phase cannot, in general, be discarded from the model. Experimental results in speech synthesis showed that reconstruction from a magnitude-only sinusoidal model (i.e., one that does not retain phase data) changed the character of the speech signal, especially for low-pitched speakers, and introduced an objectionable \tonal" quality in noisy speech [10]. It has been noted that the length of the analysis windows and the lack of control of the phase relation between partials in a magnitude-only reconstruction renders these systems inadequate for time-scale modi cation of short-duration complex waveforms, because preservation of the temporal envelope of the individual partials does not prevent deformation of the temporal envelope of the composite waveform [35]. (Temporal envelope preservation under transformation is discussed in Chapter 6.) This also explains the failure of magnitude-only models to reproduce noise-like sounds, especially under conditions of time-scale modi cation. Similarly, in applications involving musical sounds, it has been observed that the importance of phase in the model depends greatly on the application and the nature of the analyzed waveform [11].
34
3.5 Improving Frequency Estimates The basic sinusoidal algorithm of McAulay and Quatieri is unique in its use of estimated partial frequencies. Other short-time Fourier methods, such as the phase vocoder, employ all complex samples, or amplitude-phase pairs, from the short-time frequency spectrum in their representation. The frequency corresponding to each transform sample is implicitly determined from the sample rate of the source waveform and the number of points in the short-time transform. By contrast, the MQ representation retains data only at the spectral peaks, and the time-varying component frequency is explicit. For perfectly harmonic sounds and tuned analysis windows, as in the ideal voiced speech assumption used in the development of the sinusoidal model above, the spectral peaks will correspond exactly to samples of the short-time spectrum, so the samples and their corresponding frequencies correspond exactly to the sinusoidal parameters of the ideal estimator. In general, however, the peaks in the continuous magnitude spectrum will not correspond exactly to samples in the short-time spectrum, and the sinusoidal parameters must be estimated. The basic MQ sinusoidal model uses the short-time spectral samples directly, but there are techniques available for improving the sinusoidal parameter estimates. Zero padding of the short-time analysis windows was introduced in Section 2.5 as a computationally inexpensive method for improving frequency estimates by interpolating the magnitude spectrum. Time domain zero padding causes the frequency spectrum to be more densely sampled, and thus facilitates estimation of the peak frequencies. Zero padding does not, however, improve the resolution in the frequency domain, or allow peaks to be resolved that would otherwise be obscured by window spectrum interference from neighboring peaks. Smith and Serra describe a method of re ning frequency estimates using parabolic spectral interpolation in the context of the software application PARSHL, which employed an algorithm similar to the MQ method for the analysis of musical sounds [11]. Parabolic interpolation of the magnitude spectrum is an inexpensive method for re ning the peak frequency estimates with greater precision than is practical with zero padding. A parabola is t to three samples in the vicinity of a peak in the magnitude spectrum, the peak sample and its two neighbors. The peak frequency estimate is then computed from the fractional sample location of the vertex of the parabola. 35
y β
γ
α
-1
0
p
1
x
Figure 3.3 Re ning peak frequency estimates using parabolic interpolation of the magnitude spectrum. A new Cartesian coordinate system is de ned such that the tallest (largest magnitude) spectral peak has x (horizontal) coordinate 0, and the neighboring peaks have x coordinates ,1 and 1, as shown in Figure 3.3. The corresponding peak magnitudes are , , and . A parabola of the form
y (x) = a (x , p)2 + b
(3.12)
is t to the three magnitude-spectral samples in this new coordinate system. The vertex coordinate p is found by solving (3.13) p = 12 , 2, + : If k is the sample number corresponding to the peak sample value , then the estimate of the fractional sample location of the true peak is
k = k + p
(3.14)
and the estimate of the true peak frequency is fsNk , where fs is the sampling frequency and N is the length of the short-time analysis window. The peak amplitude estimate may also be re ned by evaluating the parabola in Equation (3.12) at p. This is equivalent to solving
y (p) = , 14 ( , ) p 36
(3.15)
It is stated that y (p) can be computed separately for the real and imaginary parts of the complex short-time spectrum to yield the complex peak estimate, from which the peak magnitude and phase are computed [11]. However, the simpli cation given in Equation (3.15) is only valid at the vertex p of the parabola in Equation (3.12). Since a parabola with vertex p will not, in general, t the samples of the real and imaginary spectra, parabola coecients will have to be computed independently for those spectra and the resulting parabolas evaluated at p, the location of the peak in the magnitude spectrum. In other words, it is necessary to nd coecients for
and
y< (x) = a< (x , p< )2 + b