19th INTERNATIONAL CONGRESS ON ACOUSTICS ...

3 downloads 0 Views 362KB Size Report
Sep 7, 2007 - Music synthesis systems are typically controlled by setting .... note 67, nominally 392Hz), and each was adjusted to have no vibrato or use of multiple or ... Electric Guitar, Tenor Trombone, Bach Trumpet, Trumpet Harmon, ...
19th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Timbral adjectives for the control of a music synthesizer PACS: 43.75.-z Howard, David M., Disley, Alastair C, Hunt Andrew D. Audio Lab, Intelligent Systems Research Group, Department of Electronics, University of York, Heslington, York, YO10 5DD, United Kingdom; [email protected] ABSTRACT Music synthesis systems are typically controlled by setting parameters such as: filter setting, resonance controls, modulation settings and different “raw” waveforms. For many performers and composers, these are not obvious ways of thinking about the sound generation process in a creative manner; they are much more likely to describe their sound requirements in terms of timbral adjectives such as: bright, mellow, brash, warm and fuzzy. A pilot music synthesis system that is based on Pure Data (PD) will be described, which makes use of timbral adjectives as its controllers. These are mapped to the underlying acoustic variations which control the PD synthesis system. The mapping between the timbral adjectives and the acoustic nature of the sound is based on a series of listening tests to establish which adjectives are most commonly favoured by practicing musicians, and how they are used to describe typical sounds of existing musical instruments. Multidimensional scaling and principal component analysis have been employed to link these data in terms of how the adjectives most appropriately describe the sounds, and this provides the basis for the mapping itself. The mapping techniques will be explored and the basic design of the synthesis system will be described. INTRODUCTION It can be argued that it is theoretically perfectly possible with digital technology to reproduce any sound, but of course, this requires the appropriate sequence of waveform samples. In practice, finding the appropriate sequence is a formidable and generally non-intuitive task, which can be aided through knowledge of acoustics and psychoacoustics. For musicians, terms such as pitch, loudness and timbre are very familiar and commonly used in practice. Whilst pitch and loudness have formally defined scales of “low” to “high” through which they are defined, there is no single subjective rating scale against which timbre judgements can be made. The American National Standards Institute [1] formally defines timbre as: “Timbre is that attribute of auditory sensation in terms of which a listener can judge two sound similarly presented and having the same loudness, pitch and duration as being dissimilar.” In other words, two sounds that are perceived as being different but which have the same perceived loudness, pitch, and duration are said to differ by virtue of their timbre. The timbre of a note the characteristic through which a listener recognises which instrument is playing. Timbral descriptors are often employed when describing the differences between sounds. For example, the definition given in [2] encompasses such descriptors: “Timbre means tone quality - coarse or smooth, ringing or more subtly penetrating, “scarlet” like that of a trumpet, “rich brown” like that of a cello, or “silver” like that of the flute ... The one and only factor in sound production which conditions timbre is the presence or absence, or relative strength or weakness, of overtones”. The specific timbral descriptions used by an individual when describing the differences between sounds can be unique to an individual and their relationship to the acoustic nature of the sounds may or may not bear scientific scrutiny. Miranda et al. [3] propose a new taxonomy for the timbres produced by the Chaosynth software synthesiser which is based on cellular automata. The reason they require this is that Chaosynth can produce extremely complex sounds. Examples of defined classes of sounds in their proposed taxonomy include: chaotic, explosive, general textures. They have further observed that: “potential users have found it very hard to explore its possibilities as there is no clear referential framework to hold on to when designing sounds”.

The use of everyday adjectives to describe sounds has been studied, and some have been found to have common usage between a majority of listeners. Timbre’s inherent multidimensionality is well-known [4]. Several studies have examined the relationship between highlevel descriptors and the sound of particular instruments. For example, Nykänen and Johansson [5] list ten common timbral descriptors used by Swedish saxophone players. Disley and Howard [6] used similar methods to gather English words describing pipe organ ensembles, and refined these in subsequent listening tests [7] to realise an uncorrelated and consistently understood subset of descriptors for use in that context: thin, flutey, warm, bright and clear. Other studies have gathered timbral descriptors without musical stimuli. Moravec et al., [8] collected words from Czech musicians and developed a subset based on frequency of occurrence. A number of studies have looked for generally applicable auditory cues and acoustic factors for individual timbral adjectives in isolation, and these are summarised on pages 76 to 81 of [9]. Von Bismarck [10] summarises timbral scales from many sources and creates a subset of four (dull-sharp, compact-scattered, full-empty and colourless-colourful) but Kendall and Carterette [11,12] question both the relevance of these scales to the sounds of real instruments and the wisdom of making assumptions about meaning opposition between words, going on to demonstrate more success with scales of “x” to “not-x”. The timbral adjectives “bright” and “clear” are quite commonly used to describe sounds. Increasing brightness has two key spectral effects that appear to be associated with it: raising the frequency of a low-pass filter [13-18] or increasing the frequency of the spectral centroid. [10,17,19-21]. Clarity tends to be increased when: the number of harmonics is decreased, harmonic density is expanded or the spectral slope is decreased [16]; the amplitude of the second harmonic is increased since it is octave doubling the fundamental [22]; or raising high frequency energy [23]. Any timbral descriptions should be considered in terms of the perceptual judgements that underpin them. These will have been directly influenced by the particular sound set that was used to elicit them in the first place, perhaps as a function of the overall set of sounds from which they came. Previous work with the sounds of the pipe organ [6,7] provided feasibility evidence and relevant experience for the move from the rather restricted timbral space occupied by the pipe organ to that occupied by the sounds of other musical instruments. This was the motivation behind the work described here, where a particular set of instrumental sounds are employed for the listening tests and therefore there is always the potential for them to inspire their own set of timbral descriptors. However, the importance of any differences established in the between particular timbral adjectives and the acoustic manifestations must not be overlooked. Timbral descriptors that are commonly employed by musicians are explored and used to segregate the acoustic attributes of the sounds that these timbral descriptors are being applied to. Listening tests are used in which musicians use timbral adjectives to describe a set of sounds, and the results are analysed using multidimensional scaling and principal component analysis to establish which adjectives account for the greatest degree of segregation across the listeners for the sounds under investigation. The adjectives that provide the greatest degree of segregation are then considered as candidates to provide user controls in a synthesis system. Progress towards the realisation of such a sound synthesis system is described. LISTENING TESTS A pilot test was undertaken to explore the appropriateness of a chosen set of timbral adjectives and instrumental stimuli and this was followed by the main tests where that aim was to establish a consensus from listeners in terms of timbral adjectives chosen and the way they are applied to instrumental sounds. Pilot listening test Sixteen musical (involvement in an occupation related to music or music technology) listeners (13 male, 3, female; average age 36yrs., range 21-63yrs.) took part in the pilot experiment (UK: 7; USA: 4; Canada, China, Germany, Italy and Sweden: 1 each). Their task was to listen to ten test stimuli (listed in table 1) one at a time and rate each using a set of ten timbral adjectives. The stimuli were taken from a Yamaha XG module and consisted of: piano, xylophone, bell, taiko drum, flute, oboe, viola, brass, sawtooth and sinewave. The stimuli were captured as 1.5 second mono PCM samples (44.1kHz sampling rate, 16 bit resolution) on the note G3 (MIDI 2 th

19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID

note 67, nominally 392Hz), and each was adjusted to have no vibrato or use of multiple or stereo voices. Table I.- Sound stimuli and timbral adjectives used in the pilot listening test. timbral adjectives sound stimuli

bright harsh piano oboe

clear dull xylophone viola

warm nasal bell brass

thin metallic taiko drum sawtooth

flutey colourful flute sinewave

The adjectives (listed in table 1) were provided on an eleven point slider scale via a graphical user interface (see figure 1). Users were also asked to indicate how confident they felt about using each adjective in the context of each sound, and they were given the opportunity to suggest any other adjectives they might use and to provide any general comments. The tests were conducted over the internet. The results suggested that Flutey was a poor discriminator since most sounds were not flutey apart from the flute and to a much lesser degree, the sinewave. Flutey was therefore discarded in the main experiments. Colourful was found to have the least level of listener confidence, and so it too was discarded. Listeners also indicated that it would be perfectly reasonable to use up to 15 adjectives in any future test, and that the use of synthetic material was not ideal.

Figure 1: Example section of the user response sheet for the pilot experiment. Main listening test It was decided that fifteen adjectives could be employed, so having discarded flutey and colourful, the following were added for the main test: wooden, rich, gentle, ringing, pure, percussive and evolving. The stimuli consisted of twelve acoustic instrument samples selected from the MUMS (McGill University Master Samples) library, three from each of the four categories (strings, brass, woodwind and percussion) as follows: Viola Bowed, Viola Pizzicato, Electric Guitar, Tenor Trombone, Bach Trumpet, Trumpet Harmon, Flute Vibrato, Alto Saxophone, Oboe, Hamburg Steinway, Tubular Bells, and Xylophone Twenty nine female and thirty male listeners took part in the main test. In order to validate the use of the internet, twenty three (18-22Yrs.) took the test ten times as a control group and the remaining thirty six (19-27Yrs.) took it once. All listeners used high quality headphones. Members of the control group each took the test five times in the presence of an investigator (controlled) and five times alone using their own computer (uncontrolled). The remaining listeners took the test once using their own computers (uncontrolled). All subjects were native English speakers and students or staff in the subjects of Music and Music Technology at a UK University. Subjects were paid for their participation and were free to withdraw at any point. Full experimental details are in [24]. The stimulus order of presentation was varied for the control group and no significant effect was found. The use of the internet for testing The results from the control group confirmed that the use of the Internet in the context of these listening tests was having no adverse effect on the results themselves. The average difference in results between the controlled and uncontrolled situations was 0.5% of the rating scale range, but the scale itself had 11 points and was therefore quantised in 10% increments, so this is not 3 th

19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID

significant. A slight increase in self-evaluated confidence in the uncontrolled situation was observed, but this was found to be non-significant. Results from the main listening test The adjectives that had the highest agreement between listeners were: bright, percussive, gentle, harsh and warm, whilst those with the least were: nasal, ringing, metallic, wooden and evolving. That which elicited the least confidence was evolving, whilst the highest confidence ratings being given to clear, percussive, ringing and bright. Table 2: Principal component analysis results for the main listening test.

The overall objective of this experiment was to establish a set of timbral adjectives that can be used as high-level controls for a synthesis system, but there was no intention to make use of fifteen adjectives for this purpose which would make the user interface for the instrument overly complex. Some of the adjectives are likely to be describing aspects of timbral difference between sounds that are similar and therefore there is a potential degree of redundancy. Cluster analysis (CA) and principal component analysis (PCA) provide means for establishing any such commonalities, and these were applied to the listening test results. The CA results suggested that bright and clear, and warm and gentle are indicating very similar aspects of the sounds in the context of the particular set of samples used in the listening experiment. The PCA results indicated that 91.7% of the variance can be explained by four principal components (see table 2), the first of which has bright, thin and harsh at one end and dull, warm and gentle at the other, the second has pure and percussive at one end and nasal at the other, the third is mainly accounted for by metallic to wooden, and the fourth by evolving. These data suggest that pure and rich are not good discriminators for this data set. The highest standard deviations between listener results were found for: nasal, ringing, metallic, wooden and evolving, suggesting that these are not being used consistently, and that they therefore would not be particularly suitable as universal timbral descriptors. This leaves: bright, clear, warm, thin, harsh, dull, percussive and gentle as candidates for user controls of a synthesis system. SYNTHESIZER DESIGN The design of a synthesizer to make use of these remaining eight adjectives as its user controls requires mapping data to indicate the effect that each will have on the acoustic output from the synthesis system. A further listening test based on the classic experiment by Gray [4] was carried out where listeners compared pairs of sounds. A multidimensional scaling analysis of the results produced two dimensions. The main adjective describing changes along dimension 1 was percussive, whilst those for dimension 2 were : bright, thin and harsh to warm, dull and gentle. The stimuli were analysed acoustically in terms of: spectral centroid, spectral slope, 4 th

19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID

spectral smoothing, attack time, decay time, and the ratio of the energy in the first harmonic (fundamental) to that in harmonics 5 to 8, and it was found that dimension 1 related to: spectral centroid, spectral slope, spectral smoothness, attack time and ratio f0:harmonics 5-8, whilst dimension 2 related to: ratio f0:harmonics 5-8, spectral centroid and decay time.

Figure 2: High-level timbral adjective controlled synthesizer schematic.. This data provide the basis for the first implementation of the synthesis system for which a schematic is shown in figure 2. The user will make use of eight sliders labeled with the eight adjectives. The outputs from these sliders will be mapped by varying degrees to the acoustic characteristics listed for the two dimensions, and the synthesis engine will be controlled directly in terms of these acoustic characteristics. The nature of the acoustic analysis suggest that a harmonic additive synthesis system is the most appropriate for this first implementation, and for programming and distribution convenience, the widely-used freeware system Pure Data or PD [25] was selected as the synthesis engine. Additive synthesis can be achieved by generating every harmonic independently and summing them together; in this way, each can be specified in terms of its own individual amplitude envelope and frequency. MIDI (the Musical Instrument Digital Interface) protocol will be used to control the PD implementation and enable musical notes to be specified using an appropriate MIDI controller or sequencer. CONCLUSIONS A new synthesizer controlled by high-level timbral adjectives has been specified that makes use of PD and MIDI. The adjectives have been determined through listening tests that explored how musicians timbral adjectives, and these have been correlated with the acoustic changes apparent in a set of musical sounds. The synthesizer will be realised in PD and beta tested by practising musicians to establish whether or not such a control system can be usefully employed in practice. It has the potential to make the selection of sounds for music making a much more high-level process for the future, enabling musicians to work in a greater timbral space in a far more musically intuitive manner than is possible at present.

5 th

19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID

ACKNOWLEDGEMENTS The authors thank the subjects who took part in the listening tests. This project was supported by grant number GR/T06384/01 from the Engineering and Physical Sciences Research Council (EPSRC), UK. References:

[1] ANSI, USA Standard Acoustical Terminology (including Mechanical Shock and Vibration), Technical Report S1.1-1960, American National Standards Institute, New York (1960, revised 1976). [2] P.A.. Scholes, The Oxford companion to music, London: Oxford University Press, (1970). [3] E.R. Miranda, J. Correa, and J. Wright, Categorising complex dynamic sounds, Organised Sound, 5, No.2 (2000) 95-102. [4] J.M. Grey, An Explanation of Musical Timbre, PhD thesis, Stanford University, (1975). [5] A. Nykänen, and Ö. Johansson, Development of a language for specifying saxophone timbre, Proceedings of the Stockholm Music Acoustics Conference, SMAC03, Stockholm, Sweden, 2 (2003) 647-650. [6] A.C. Disley, and D.M. Howard, Timbral semantics and the pipe organ, Proceedings of the Stockholm Music Acoustics Conference, SMAC03, Stockholm, Sweden, 2 (2003) 607-610. [7] A.C. Disley, and D.M. Howard, Spectral correlates of timbral semantics relating to the pipe organ, Proceedings of the Baltic-Nordic Acoustics Meeting, HUT, Helsinki, 2004. [8] Moravec, Ondrej and J. Št pánek, Verbal description of musical sound timbre in Czech language, Proceedings of the Stockholm Music Acoustics Conference, SMAC03, Stockholm, Sweden, 2, (2003) 643-645. [9] D.P. Creasey, An exploration of sound timbre using perceptual and time-varying frequency spectrum techniques. DPhil thesis, The University of York, (1998). [10] G. von Bismarck, G., Timbre of steady sounds: a factorial investigation of its verbal attributes, Acustica, 30, No.3, (1974) 149-159. [11] R.A. Kendall, and E.C. Carterette, Verbal Attributes of Simultaneous Wind Instrument Timbres: I. von Bismarck’s Adjectives, Music Perception, 10, No.4, (1993) 445-468. [12] R.A. Kendall, and E.C. Carterette, Verbal Attributes of Simultaneous Wind Instrument Timbres: II. Adjectives Induced from Piston’s “Orchestration”, Music Perception, 10, No.4, (1993) 469-502. [13] C. Dodge and T. A. Jerse. Computer Music: Synthesis, Composition and Performance. Schirmer Books, London, (1985). [14] S. De Furia and J. Scacciaferro. The Sampling Book. Third Earth Publishing, Pompton Lakes, New Jersey, (1987). [15] S. L. Campbell. Uni- and Multidimensional Identification of Rise Time, Spectral Slope, and Cutoff Frequency. Journal of the Acoustical Society of America, 96, No.3, (1994) 1380-1387. [16] R. Ethington and B. Punch. SeaWave: A System for Musical Timbre Description, Computer Music Journal, 18, No.1, (1994) 30-39. [17] R. Vertegaal and E. Bonis, ISEE: An Intuitive Sound Editing Environment, Computer Music Journal, 18, No.2, (1994) 21-29. [18] F. Lerdahl. Timbral Hierarchies. Contemporary Music Review, 2, (1987) 135-160. [19] W. H. Lichte. Attributes of Complex Tones. Journal of Experimental Psychology, 28, No. 6, (1941) 455-480. [20] J-C. Risset and D. L. Wessel. Exploration of Timbre by Analysis and Synthesis. In D. Deutsch, editor, The Psychology of Music. Academic Press, London (1982). [21] D. Keislar, T. Blum, J. Wheaton, and W. Wold. Audio Analysis for Content-Based Retrieval. In International Computer Music Conference (Banff), (1995) pp. 199-202. [22] J. Jeans, 1961. Science and Music. Cambridge University Press. [23] L. N. Solomon. Search for Physical Correlates to Psychological Dimensions of Sounds. Journal of the Acoustical Society of America, 31, No.4, (1959) 492-497. [24] A. Disley, D.M. Howard, and A.D. Hunt, Musicians’ use of timbral adjectives, Proceedings of the Institute of Acoustics, 28, No.1, (2006) 670-679. [25] www.pure-data.org

6 th

19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID