Preservation of Local Sound Periodicity with Variable ...

2 downloads 0 Views 95KB Size Report
time of a sampled source – a technique based upon Dennis Gabor's landmark papers in 1946 and 1947 and developed in the field of computer music.
Preservation of Local Sound Periodicity with Variable-Rate Video. D. Knox, T. Itagaki, I. Stewart, A. Nesbitt, I.J. Kemp. Dept. of Engineering, Glasgow Caledonian University, Glasgow G4 OBA, UK.

Abstract – A method of allowing pitch preservation of sound with variable-rate video playback is suggested. This is an important factor in monitoring of audio content for cueing purposes. Methods of separately considering a signal’s frequency and time representations are considered with a view to performing time-scale modification with preservation of local periodicity (pitch). Particular emphasis is placed upon granulation in time of a sampled source – a technique based upon Dennis Gabor’s landmark papers in 1946 and 1947 and developed in the field of computer music. This relatively simple method requires no prior signal analysis and is therefore a less computationally expensive method of achieving the goals stated above. This is an important point considering the need for real-time implementation. The process does however introduce some distortion, and investigation into how this may be minimised is necessary to produce acceptable results.

Introduction Vari-speed functions in video playback systems rarely allow for simultaneous playback of accompanying audio information. Audio output is usually muted during such operations (Watkinson 1994). If audio were to be output, the varying speed of the video would result in the transposition of the pitch content of the audio signal – resulting in a loss of intelligibility. This situation may not present much problem if there are strong visual cues in the source material that guide the operator to the desired location, but long, uninterrupted footage of e.g. political addresses or interviews are a different matter. Clear audible cues in such situations provide invaluable information in locating specific segments. Time / Frequency Representation The above situation is a clear demonstration of the reciprocal relationship between time and frequency. Simple time-scale modification of a given signal inversely affects that signal’s local periodicity. Hence a means of separating a signal representation in terms of time and frequency domains is desirable. Gabor (1946) theorised a method by which any signal could be described by, or expanded into, a series of time-frequency units he termed logons, that could be plotted on a twodimensional time-frequency grid. This essentially aimed to present the first true time

and frequency representation of a signal – in contrast to what Gabor termed the ‘ essentially timeless ‘ Fourier Transform which assumes any input is a combination of related sinusoids of infinite duration (Gabor 1947). The grid is produced by windowing the input signal into a series of segments with Gaussian amplitude envelopes. Each segment or ‘grain’ has a specific duration in time and bandwidth in the frequency domain. Thus each point on the grid represents a unit or ‘quantum’ of timefrequency energy. Gabor also drew on psychoacoustic issues to identify lower bounds to his quanta. In particular he proposed that there is a minimum duration for an acoustical event where the ear ceases to perceive frequency content and simply registers a noise or click. Distortion or warping the analysis grid in either of its planes performs independent modification of either time or frequency content. An interesting point is that an infinite stretch in the frequency plane produces the signal’s (timeless) Fourier Transform, whereas an infinite stretch in the time domain produces a pure time representation (Gabor 1947). This can be seen as an interpretation of Heisenberg’s uncertainty principle which states that one cannot be precise in representations of both time and frequency. Time-Scale Modification Techniques The short-time representation of signals implicit in Gabor’s theories has been adopted

widely in the area of time-frequency analysis/resynthesis methods. Although these techniques are concerned primarily with accurate frequency representations of continuous signals, various transformations of the data produced by analysis are possible. The windowing of signals in the Short-Time Fourier Transform (STFT) has the obvious practical advantages of speeding the transform calculation time and providing a degree of localisation in time as regards the frequency content of the signal. Alteration of the shape or envelope of the window is mainly concerned with production of clean, clear spectra at the transform output (Ifeachor and Jervis 1993). The STFT also forms the basis of the Phase Vocoder (Dolson 1986). Linear Predictive Coding (LPC) also manipulates analysis data in windowed form. These techniques produce frames of analysis data from which the original signal may be reconstructed. Varying the spacing between these frames on output allows time-scale modification whilst preserving signal pitch. This change in the spacing alters the phase continuity of the signal, so phase values must be altered before reconstruction (Dolson 1986). The Wavelet Transform (Kronland-Martinet 1988) addresses the time-frequency resolution limitations of fixed-size windowing by using analysing windows (wavelets) that dilate according to the frequency being analysed. The advantages of this technique are well documented (Kronland-Martinet 1988, Ellis 1992). Time-scale modification by manipulation of analysis data is also possible with the Wavelet Transform. However complications arise regarding how the analysis data grid may be transformed for effective time-scale modification (Ellis 1992). Granulation of sound is perhaps the technique most directly influenced by Gabor’s work. Developments in granular methods since Gabor’s papers are detailed in Roads (1985, 1996). Investigation into real-time implementations of granular methods have mostly been in the area of sound synthesis. In particular the work of Truax (1988, 1994). Time-compression or expansion by granulation is achieved by extraction of windowed grains from sampled real sound and re-ordering in time. For simple time-scale compression by granulation, the grain

segments are drawn from the source and reconstructed in a fashion such as illustrated in Fig.1. A

B

Fig.1. Selection and reconstruction of grains for compression in time.

All the techniques discussed except granulation require prior signal analysis to produce frames of data for manipulation. The majority of the processing requirement in these techniques is taken up by the analysis itself, whereas the altering of frame reconstruction times is a relatively simple, less computationally expensive task (Truax 1994). Analysis in the Wavelet Transform is more time consuming than a standard FFT (Roads 1996), involves more detailed consideration regarding time-scale modification (Ellis 1992). Any use of hardware optimised for such methods speeds processing time (Ifeachor and Jervis 1993), but incurs an increase in financial cost. Granulation, in comparison, requires no analysis stage, and is therefore worthy of investigation into its use as a computationally and financially inexpensive alternative to the above methods. However, since no phase information is available for adjustment to compensate for the signal’s re-ordering in time, the process introduces a series of distortions. Distortions On reconstruction (fig.1), adjacent grains overlap at their envelope’s –3dB (half power) points, and the ratio of period A to period B dictates the degree of time-compression or expansion. For speed in calculation the trapezoidal window (Truax 1988) is suggested.

It is at the grain overlap points that the aforementioned phase discontinuity is apparent, in addition to amplitude variations producing an audible beating effect and combfilter like spectral artefacts (Roads 1985). Granular synthesis methods seek to control these artefacts to produce aurally interesting results. Itagaki’s definition describes granular synthesis as using a short grain (and varying grain envelope) to create such effects for compositional use. From a signal processing point of view these effects (due to the use of windows in spectral analysis) represent unwanted distortion. Sound granulation involves use of longer grains to minimise these effects and thus uphold signal integrity (Itagaki 1988). A granulation program was written by the author to investigate these effects, and time-scale modifications were carried out. Initial results using long grain models (50-90ms) on speech sources confirm the observations in Itagaki (1998) as regards the need for long grains. However shorter models are required for timecompression if sibilance and transients (and therefore intelligibility) are to be maintained, thus artefacts of the short grain must be tolerated to achieve this clarity. Spectrograms of the sampled word ‘wow’ processed by the author’s granulation program (fig.2) are included to illustrate the preservation of pitch and also the production of spectral artefacts (sidebands) by the process. freq

t (samples)

(a) freq

(b)

t (samples)

freq

t (samples)

(c) Fig.2. Original signal (a). Playback at 2x speed (decimation by a factor of 2) in (b) shows pitch transposition of 800Hz and 1400Hz elements in the original signal to 1600Hz and 2800Hz respectively. The time-compressed granulation equivalent in (c) preserves these components, but also introduces unwanted artefacts. A 1024 point FFT with Blackman-Harris window was used.

The artefacts apparent in Fig.2(c) are the downside to the advantages in processing speed gained by omission of signal analysis. Further investigation into these effects is necessary if optimal quality in output is to be achieved.

Sound with Variable-Rate Video Application of this technique to sound with variable-rate video is a method of addressing the problems illustrated in the introduction. The author’s granulation program performs time-scale modification of stored samples that are processed in system memory and written back to disk. Real-time operation, however, is necessary in this particular application of the technique. To support his ground-breaking papers, Gabor developed a system to granulate an optical film soundtrack using a revolving drum with a slit and photo-electric cell that could sample segments of the original source at different rates – termed a ‘Kinematical Frequency Converter’ (Gabor 1946). The implementation discussed here represents a new application of these ideas to tape and digital video sources. The basic premise for variable-rate video, regardless of the medium format, remains the same. Altering linear tape speed in conjunction with periodic repeating or skipping of fields of video information implements slow-motion or shuttle play. Audio information may be

interleaved in helical tracks or recorded in separate, linear tracks along the tape length depending on tape format. If in linear format, the modification of tape speed results in pitch transposition and if in helical format the skipping or repeating of fields results in ‘bursts’ of audio with discontinuity (Watkinson 1994). The selection of grains from linear format audio resembles closely the illustration of granulation discussed earlier, but interleaved audio in helical tracks raises extra points. That is, audio data is already being presented in segments – produced by reading periodic fields of data. Once synchronisation and identification data is removed from these segments the source can be windowed and reconstructed for continuous output. Although mechanical aspects of this discussion do not apply to digital video displayed on a PC, the simple repetition or omission of frames of video still applies. Summary Granulation has an advantage over its peer technologies regarding speed of computation (Truax 1994). Investigation is required into how artefacts produced by the technique may be minimised. Initial results suggest acceptable results are possible using the technique. The speed of computation offered suggests the technique is suited to real-time implementation such as is required by the variable-rate video application described here. Conclusion An application of this technique to sound with variable-rate video allows preservation of local signal periodicity in real-time – thus providing an intelligible audio cue source. It is suggested this may be achieved with a single, inexpensive DSP device, as about 20-30 MIPS is required for two-channel sound granulation (Bartoo et al 1994, Itagaki 1998). It is intended this implementation will be developed to provide full, professional jog and shuttle control of video with pitch-preserved sound – with automatic grain parameter selection optimised for the corresponding playback rate.

References Arfib, D. 1991. ‘ Analysis, Transformation and Resynthesis of Musical Sounds with the Help of a Time-Frequency Representation ‘. In G. De Poli, A. Piccialli and C. Roads, eds. Representations of Musical Signals. Cambridge, Mass : The MIT Press. pp. 87-117. Atal. B. and S.L. Hanauer. 1971. ‘ Speech Analysis and Synthesis by Linear Prediction of the Speech Wave ‘, The Journal of the Acoustical Society of America, 50 (2) : 637-655. Bartoo, T., R. Ovans, D. Murphy, B. Truax. 1994. ‘ Granulation and Time-Shifting of Sampled Sounds in Real-Time with a Quad DSP Audio Computer System ‘, Proceedings of the 1994 International Computer Music Conference. pp. 335-337. Dolson, M. 1986. ‘The Phase-Vocoder: Computer Music Journal, 10 (4) : 14-27.

A Tutorial ‘,

Ellis, D. 1992. ‘ Timescale Modifications and Wavelet Representations ‘, Proceedings of the 1992 International Computer Music Conference. pp. 6-9. Gabor, D. 1946. ‘ Theory of Communication ‘, Journal of the Institute of Electrical Engineers Part III, 93: pp. 429-457. Gabor, D. 1947 ‘ Acoustical Quanta and the Theory of Hearing ‘, Nature, 159 (4044) : 591-594. Harris, F.J. 1978. ‘ On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform ‘, Proceedings of the IEEE, 66 (1) : 51-83. Digital Signal Ifeachor, E.C., and B.W. Jervis. 1993. Processing : A Practical Approach. Wokingham, England; Reading, Mass. : Addison-Wesley. Itagaki, T. 1998. ‘ Real-Time Sound Synthesis on a MultiProcessor Platform ‘, PhD Thesis, Durham University. Jones, D.L., and T.W. Parks. 1988. ‘ Generation and Combination of Grains for Music Synthesis’, Computer Music Journal, 12 (2) : 27-34. Kronland-Martinet, R. 1988. ‘ The Wavelet Transform for the Analysis, Synthesis and Processing of Speech and Musical Sounds ‘, Computer Music Journal, 12 (4) : 11-20. Roads, C. 1985. ‘ Granular Synthesis of Sound ‘, Computer Music Journal, vol. 2, no. 2, pp. 61-62, 1978. Reprinted in C.Roads and J.Strawn, eds. Foundations of Computer Music. Cambridge, Massachusetts: The MIT Press. The Computer Music Tutorial. Roads, C., ed. 1996. Cambridge, Massachusetts: The MIT Press. Truax, B. 1988. ‘ Real-Time Granular Synthesis with a Digital Signal Processor ’, Computer Music Journal, 12 (2) : 14-26. Truax, B. 1994. ‘ Discovering Inner Complexity: Time Shifting and Transposition with a Real-Time Granulation Technique ’, Computer Music Journal, 18 (2) : 38-48. Watkinson, J. 1994. The Art of Digital Video – 2nd Edition. Oxford, Boston : Focal Press.

Suggest Documents