situations. Max Mathews' Radio Baton [5] from 1991 is generally regarded as the first ... a baton in order to conduct a ballet dancing avatar [13]. Further ... at every stage of the piece, an understanding of phrasing, orchestration and the physical ...
c Springer – to appear in LNAI as Proc. 5th Int. Gesture Workshop 2003 °
1
Conducting Audio Files via Computer Vision Declan Murphy, Tue Haste Andersen, and Kristoffer Jensen Computer Science Department, University of Copenhagen {declan,haste,krist}@diku.dk
Abstract. This paper presents a system to control the playback of audio files by means of the standard classical conducting technique. Computer vision techniques are developed to track a conductor’s baton, and the gesture is subsequently analysed. Audio parameters are extracted from the sound-file and are further processed for audio beat tracking. The sound-file playback speed is adjusted in order to bring the audio beat points into alignment with the gesture beat points. The complete system forms all parts necessary to simulate an orchestra reacting to a conductor’s baton.
1
Introduction
The technique of classical conducting is the most established protocol for mapping free gesture to orchestral music. Having developed since the baroque period out of the practical necessity for an ensemble to play together in time, it has evolved into the most sophisticated such mapping [1] of gesture to music despite the recent flourish of research activity in this area [2]. The presented system consists of a computer vision part to track the gestures of an ordinary baton, a second part that extracts a parameter from the audio in order to track the beat, and a third part that performs time scaling of the audio. A schematic overview of the complete system appears in Fig. 1. Gesture tracking is performed using a combination of computer vision techniques and a 3D model of the baton. The system outputs the beat time points based on the position and velocity of the baton tip. Two modes are employed: a seek mode to locate the baton initially and a track mode to follow the baton position in time. Audio beat estimation is based on the extraction of an audio parameter, combined with a high-level beat model. The peaks of the extracted parameter indicate note onsets that are related to the beats of the music. The beat estimation is improved by updating a beat probability vector with the note onsets. The coupling of conductor’s beats and audio beats is done using two different approaches. The first approach, viz. an event based coupling of the beats, is possible if the overall latency of the system is low and in the case of sparse event based music, while the second approach of delayed handling is suitable in all situations. Max Mathews’ Radio Baton [5] from 1991 is generally regarded as the first computerised baton conducting system. It monitors the positions of two batons
2
Declan Murphy, Tue Haste Andersen, and Kristoffer Jensen
Eyesweb
Mixxx
Baton Tracker
Audio Beat Extraction Gesture Interpretation
Hand Tracker
Audio File Database
MIDI
Conducting/ Audio Coupler
Audio Stretching
Fig. 1. The complete system. It consists of a computer vision module implemented in EyesWeb [3] and an audio processing system implemented in Mixxx [4].
which are fitted with radio transmitters by means of directional antennas. The Buchla Lightning-II [6] is a versatile MIDI device consisting of a pair of wands fitted with infra-red sources whose locations are monitored by a special remote unit. Borchers et al. [7] use the Lightning-II with pre-stretched audio tracks across a range of pitches, so that the audio can be scaled live more easily and with better quality. Magnetic sensors were used by Ilmonen and Takala [8] for tracking the conductor’s movements and they used artificial neural networks for determining the conductor’s tempo. The Digital Baton developed by Marrin and Paradiso [9] adopts a hybrid approach of accelerometers, pressure sensors and an infra-red LED mounted in the tip which is monitored by a positionsensitive photo-diode behind a camera lens. Marrin later developed the Conductor’s Jacket [10] which is designed to be worn by a conductor during performance and is fitted with sensors to monitor electromyography (a measure of muscle tension), breathing, heart rate, skin conductivity and temperature. CCD cameras were used by Carosi and Bertini [11] to trigger computer generated parts while conducting an ensemble. Segen et al. use two cameras for segmentation and apply a linear filter to boundaries represented as a curvature plot [12] to track a baton in order to conduct a ballet dancing avatar [13]. Further systems are briefly reviewed in [14]. Disadvantages with the instrumented batons are that they are fatiguing for prolonged use, that they do not feel like “the real thing”, and that they involve special hardware. Most of the vision based systems are relatively crude by comparison in terms of their precision: they use heuristics to extract the main beat of an up-down movement rather than tracking and recognising the various beat gestures. The system presented here may be used with an ordinary conductor’s baton and inexpensive CMOS cameras. Furthermore, this system is independent of musical style in that it can be used with any sound file having a clear beat.
Conducting Audio Files via Computer Vision
2
3
The Nature of Conducting Gestures
In the light of gesture research, conducting is a rather interesting example because of the complexity of multiple layers of meaning at the same time, and by virtue of the combination of both formal technique and subjective body language. Indeed, almost the entire gamut of a gesture taxonomy can be present at a given moment: from control gestures with a clearly defined meaning associated with a clearly defined movement, through to the equivalent of co-verbal gestures, through to emotive gesticulation. Indeed, following McNeill [15], there are direct equivalents of the full range of co-verbal gestures. The baton gestures qualify as both beats and cohesives: the physical form1 and semantic content are canonically that of a beat as it punctuates and emphasises the underlying pulse of the music, while they are cohesive in their form by being repetitious and in their function by tying together temporally disjoint yet structurally related points. The non-baton hand generally executes iconics and metaphorics: an iconic could typically be a mimesis of bowing action by way of instruction to the strings while a metaphoric could typically be an upward movement of the upturned palm calling for a crescendo. Both hands and eyes execute deictics: the baton and eyes to give cues, and the non-baton hand by pointing directly to the intended recipient. The overall body language of the conductor generally sets the desired mood for the musicians’ phrasing, whilst the specific gestures – how the conducting patterns are executed – give more technical lower-level instruction to the musicians. There is a great deal to be said about how to conduct well from a musicological point of view. This involves being aware of the most salient musical events at every stage of the piece, an understanding of phrasing, orchestration and the physical constraints of all instruments involved, good social communication and an understanding of rehearsal group dynamics, amongst other attributes. See, for example, [16]. Rudolf made a comprehensive survey of orchestral conducting technique in his classic text [1], in which he defines the formal grammar of beat structure and how it may be embellished by gestural variations, again illustrating how conducting functions simultaneously both linguistically and as co-verbal gestures. This paper presents a prototype system for conducting audio files by extracting the tempo information from the user’s baton gestures and by scaling the audio playback accordingly. A more complete conducting system is the subject of the first author’s current research. This involves a representation of the score with beat pattern annotations (which is realistic since the players and conductor know what to expect from each other and from the music through their rehearsals), a subsystem for recognising the deviations from the expected tempo and volume as indicated by the sampled baton locations, and live generated output. This in turn forms part of the authors’ research on how gesture may used for composition and performance of music. 1
Indeed, the “beat” gesture has been termed the “baton” by several authors.
4
3
Declan Murphy, Tue Haste Andersen, and Kristoffer Jensen
Baton Tracking
The tracker has two modes of operation: seek mode and track mode. In seek mode, the system tries to locate the baton without any knowledge from previous frames. In track mode, the system knows where to expect to find the baton and so is more tolerant of blurred or noisy images and less likely to switch to tracking something other than the targeted baton. The system starts in seek mode and switches automatically into track mode when it is sure that the baton has been localised. It switches back into seek mode if the track mode looses its confidence in its result or upon momentary request of the user. A camera is placed directly in front of the user. In the case of a second camera being available, it takes a profile view of the conductor and thus always sees the baton in its full length and always has a clear view of the vertical component of the baton’s trajectory. The first (and perhaps only) camera faces the conductor directly, and thus has a better view of the baton’s trajectory in both horizontal and vertical components, but only sees the baton’s length as ranging between zero and less than its actual length which makes the recognition more difficult. Once the baton has been located in both views, a three dimensional reconstruction of the baton is modeled, and the correspondence of the two views is a tracking criterion in subsequent frames. It is assumed that there is only one baton in view, and that it is uniformly bright against a reasonably contrasting background. Once tracked, the baton reduces to a single point: its tip pointing away from the conductor. Periodic samples of the tip’s trajectory are sent to the gesture interpretation algorithm. In the following sections, a brief introduction to the baton tracker is given; a more complete description appears in [17]. 3.1
Seek Mode
In seek mode, the tracker begins with Canny edge detection [18] that identifies and follows pixels using a hysteresis function on blurred images. This outputs a number of pixel-wide line segments corresponding to the edges of the baton and other objects in the image. The stationary baton has a very characteristic edge trace of two parallel line segments of the same length, running side by side, with a fixed perpendicular distance apart. These are detected by a finite state machine. In a typical image, however, there will be many other edges satisfying this criterion (coming from the user’s body and the background), and the real baton edges will be fragmented. Nevertheless, this substantially and efficiently reduces the search area. To identify the baton, a trace along pixels of maximum intensity is made on a Gaussian blurred copy of the original image, starting at locations corresponding to in-between each pair of candidate line segment pairs in the edge detection image. See Fig. 2. A count is made of the number of edge pixels flanking these traces, and the one of greatest count (not exceeding the maximum length) is subjected to gauging techniques as in [19].
Conducting Audio Files via Computer Vision
5
Intensity image trace being guided
Edges Intensity Edge detection image as guide
Fig. 2. Illustrating (left) how the seek mode uses the extracted edges (lower plane) to guide an intensity trace (upper plane) and (right) how ranking the flanking edges avoids the disadvantages of using both edges and intensity.
During development, it was found that the above-mentioned maximum intensity tracing, when applied to all image pixels without the previous edge detection stages, resulted in spurious quantisation along digitisation boundaries and an over-sensitivity to non-baton traces despite elaborate gauging. On the other hand, working with edge detection alone requires elaborate filtering to establish “straightness” (both geometric methods and the Hough transform were tried) and even more elaborate techniques in order to deal with the branching of detected edges (where edges overlap) and the gaps invariably left out. The presented combination method uses a very simple and efficient finite state machine to filter the detected edges, and by collating this with maximal tracing it simultaneously avoids spurious tracing and the complexity of curved, branching and broken edges. 3.2
Track Mode
A problem with baton tracking by camera is that, if the baton is relatively close to the camera then it gives blurred images when it moves quickly (see Fig. 3), whereas if it is relatively far away from the camera then it cannot be seen well enough due to its slight cross-section. The solution pursued is a hybrid approach combining a more tolerant version of the above seek mode with optical flow [20]. The “slack seek” tracking is suitable for when the baton has little or no motion as seen by the camera. Optical flow calculates a vector field corresponding to the apparent motion of brightness patterns between successive frames as seen by the camera. It diminishes to background noise for a stationary baton but it is immune from motion blur and thus very suitable for tracking higher speed movement. Both of these methods start with knowledge of the expected location and velocity as calculated from previous frames. If a profile camera is used, the expected location of the baton in the frontal image is known with greater likelihood and accuracy. After optical flow is calculated, the sum of the scalar (or dot) products of each vector with its immediate neighbours is calculated and thresholded. This
6
Declan Murphy, Tue Haste Andersen, and Kristoffer Jensen
Fig. 3. Note how a moving baton appears as a wedge with an intensity gradient in the camera frame (right) giving the general form illustrated with an artificial border (left).
was found to be rather more noise resilient than simply taking the magnitudes for revealing the baton’s location. A further advantage to using optical flow is that it gives a measure of the baton’s velocity. A local average is taken of a small region of non-border pixels close to the tip, and this value is passed on to the gesture analysis algorithm. 3.3
Gesture Recognition
At this stage, the baton has been reduced to a single tracked point in a plane and the objective is to follow where this point is in relation to a standard conductor’s beat pattern. The user is instructed to face and gesture directly towards the frontal camera so that the tracked point’s trajectory lies approximately in a plane parallel to the camera image plane. This avoids unnecessary camera 3D calibration and projective geometry. The user is also expected to conduct such that the field of beating occupies most of the image frame while being contained within it. Figure 4 shows just two of the many standard conductor’s beat patterns from Rudolf [1]. The standard patterns along with their standard variations are encoded as parametric template functions of an even tempo. The periodic updates of position and velocity from the tracker are used to monitor the user’s execution of the beat pattern. Changes in tempo, dynamics and registration are simultaneously resolved and output via MIDI to the conducting/audio coupler.
4
Audio Beat Estimation
The estimation of the underlying beat of the conductor’s movements is now to be coupled with the beat of the music, in order to perform the necessary time scaling of the audio. Audio beat estimation is non-trivial, in particular taking into account problems such as weak/strong beats, tempo change, and off-beat rhythmic information. In this work beat is defined to represent what humans perceive as a binary regular pulse underlying the music. The beat in music is often marked by transient sounds, e.g. note onsets of instruments. Some onset positions may correspond to the position of a beat,
Conducting Audio Files via Computer Vision
7
4
4 2
1
3
2
3
1
4
Fig. 4. The standard 4 light-staccato (left) and expressive-legato (right) conducting beat patterns (conductor’s view). For light-staccato, the baton pauses at each beat point with a quick flick in-between. For expressive-legato, the baton passes through the beat points without stopping with a more tense movement in-between.
while other onsets fall off beat. By detecting the onsets in the acoustic signal, and using this as input to a beat induction model, the beat is estimated. The system presented here estimates the beat from a sampled waveform in real-time, and consists of a feature extraction module from which note onset information is extracted and a beat induction model based on a probability function which gives beat probability as a function of the note onset intervals. The probability function ensures a stable beat estimation, while local variations are detected directly from the estimated note onsets. 4.1
Parameter Extraction
The parameter is estimated from the given sound files, and used to detect note onsets. The note onset detection is based on the assumption that the attack has more high frequency energy than the sustain and release of a note. This is generally true for most musical instruments. To capture the start of each new note, peaks are detected on the maximum of the time derivative of the parameter. In [21] a number of different parameters were compared. It was shown that the high frequency content (HFC) [22] is the most appropriate parameter for the detection of note onsets, and it is therefore used in the onset detection routine. The HFC is calculated as the sum of the magnitudes of a Short Time Fourier Transform, weighted by the frequencies squared. A short segment of the time derivative HFC is shown in Fig. 5 (top), from which the note onsets are clearly visible. 4.2
Beat Probability Vector
It is not unreasonable to assume that any given melody will have an underlying beat, which is perceived through recurring note onset intervals. This is modeled in the beat probability vector, in which note onset intervals have their weight increased by a Gaussian shape each time the interval occurs. This is a variation
8
Declan Murphy, Tue Haste Andersen, and Kristoffer Jensen
Fig. 5. Short example of HFC, including the detected peaks at time tk (top), and selection of beats in the beat probability vector (bottom). For each new peak, a number of previous intervals are scaled and added to the beat probability vector. If an interval is found near the maximum of the beat probability vector, it is selected as the current interval.
of the algorithm presented in [23]. To maintain a dynamic behavior, the memory weights are scaled down at each time step. In [23] and [24], weights at multiples of the interval are also increased. Since the intervals are found from the audio file in this work, the erroneous intervals are generally not multiples of the beat. Another method must therefore be used to identify the important beat interval. To avoid a situation where spurious peaks create a maximum in the probability vector with an interval that does not match the current beat, the vector is updated in a novel way. By weighting each new note and taking previous note onsets into account, the probability vector of time intervals H(∆t) is updated at each note onset time tk , with N previous weighted intervals that lie within the allowed beat interval, H(∆t) = W tk −tk−1 H(∆t) + (1 − W tk −tk−1 )
N X
wk wk−i G(tk − tk−i , ∆t),
i=1
where G is a Gaussian shape which is non-zero at a limited range centered around tk − tk−i . W is the time scaling factor that ensures a dynamic behavior of the beat probability vector. wk is the value of the HFC at time tk . This model gives a strong indication of note boundaries at common intervals of the analysed music, which permits the identification of the current beat. An example of the beat probability vector is shown in Fig. 5 (bottom) together with an illustration of how it is created from the peaks of the HFC.
Conducting Audio Files via Computer Vision
5
9
System Construction
The gesture tracking and analysis algorithms are implemented as EyesWeb blocks [25]. EyesWeb provides a suitable platform for framegrabber camera input and MIDI output while serving as highly adaptable environment for prototyping [3]. The MIDI output is directed to Mixxx [4], but it can be used in other systems, in particular it is intended to be used in PatternPlay [26]. Mixxx is an open source digital DJ system, developed to perform interaction studies in relation to the DJ performance scenario. Mixxx emulates two turntables and a mixer by integrating two sound file players with functionality similar to that of an analogue mixer. Mixxx has been designed to be easily extendible through a modular sound processing system, and by a flexible mapping of control values to incoming MIDI messages. Mixxx includes an implementation of a phase vocoder for time scaling, described in Sect. 5.2, and audio beat extraction as described in Sect. 4. The coupling of the MIDI based timer events and the audio beats is described in Sect. 5.1. 5.1
Coupling
A certain latency of the system, when measured from the time when the conductor marks a beat to the beat from the music is heard, is unavoidable. The hardware and operating system latencies of audio playback is at best around 3 ms, but the camera and image streaming have larger latency. Dependent on the total latency of the system and the music material, different approaches for synchronization can be taken. In the event based approach the beat is played as soon as possible, thereby cutting or extending audio, by performing interpolation over very short periods of time. This solution works well if the latency of the overall system is sufficiently low, and the music consists of relatively sparse percussive instruments. The event based approach is similar to how sequences of MIDI files are played back in other baton control systems. If, however, the latency is large, this approach is not possible. In the delayed approach the audio tempo is adjusted one beat behind the baton beats. This results in delayed response, but it permits the playback of beats synchronously to the conductor’s beats. The audio is scaled according to the current tempo, and played back. If the conductor changes the tempo, the playback speed of the audio is scaled so the next beat falls at the estimated point in time of the new baton beat. Because of latency, the audio tempo must overcompensate during one beat so that the next beat falls in synchronization. An illustration of the relative playback speed when the tempo is changed is shown in Fig. 6. This approach is described in more detail in [7]. 5.2
Time Scaling
The na¨ıve approach of changing the speed of audio playback has the effect of also changing the pitch correspondingly, as with a record player (although this
10
Declan Murphy, Tue Haste Andersen, and Kristoffer Jensen
Fig. 6. Example of relative playback speed adjustments over time in delayed latency handling. Because the conductor’s beat falls later than expected, the current audio beat is out of phase. To correct for this, the sound is played at an even lower speed so the next audio beat is in phase with the conductor’s new tempo.
pitch alteration is highly undesirable for orchestral music, it is perfectly normal for turntablists). Different approaches exist to try to keep the pitch as it was. The approach used here, and implemented in Mixxx, is based on the phase vocoder [27]. It is a combination of time and frequency processing techniques. It works by processing the sound in blocks, whose sizes are determined by a nonoverlapping sliding window of length Ra . Each block is passed through an FFT followed by an inverse FFT of a different size Rs . To form a smooth audio track from the resulting blocks, each block is divided with the same type of window as used in the analysis. Interpolation between consecutive blocks is performed by adding them together using a triangular window, as described in [28]. The overlap factor corresponds to the block size used in the Mixxx playback system. By controlling the scale factor Ra /Rs , the audio is stretched or compressed to match the conductor’s beat and tempo prediction messages received via MIDI.
6
Conclusion
In this paper a complete system for the simulation of an orchestra’s reaction to a conductor’s baton is presented. It consists of a vision based baton tracking system, an automatic real-time audio beat estimation module, and a system for coupling the conducting beats with the audio beats by scaling the audio playback accordingly. The baton tracking system successfully tracks the velocity of the tip of the baton using one or two cameras. The baton may be mapped to a 3D model to improve the stability of the estimated position. The system operates in two modes: an initial seek mode to retrieve a start position of the baton, and a subsequent tracking mode. The tip of the baton is subsequently used to estimate velocity and beat time points. The audio beat estimation module is based on a computationally simple parameter. The peaks of the estimated parameter are used for identifying the
Conducting Audio Files via Computer Vision
11
temporal location of the beats and the errors are removed by weighting the estimated intervals and merging them into a running beat probability vector. Two approaches for mapping baton beats to audio beats are used. If the visual and audio systems operate at low latencies, an event based coupling of the beats is used. Otherwise a delayed approach is employed. In this way, the vision based system corresponds to the musicians observing the conductor, the audio beat estimation module corresponds to the conductor listening to the orchestra, and the coupling and time-scaling modules correspond to the musicians performing the music. The visual tracking is freely available as EyesWeb modules from [25]. Mixxx is available from [29]. Further work includes refinement of the conducting gesture recognition algorithm and its inclusion into a more complete conducting system. To provide an improved orchestral simulation, expressive parameters other than beat could be included in the modelling. By extracting such information from the audio, or by providing it with manual annotation, it could be used in the rendering of the audio playback, thereby giving the conductor control of these other expressive parameters.
References 1. Rudolf, M.: The Grammar of Conducting: A Comprehensive Guide to Baton Technique and Interpretation. Third edn. Macmillan (1993) 2. Wanderley, M., Battier, M., eds.: Trends in Gestural Control of Music. IRCAM (2000) 3. Camurri, A., Hashimoto, S., Ricchetti, M., Trocca, R.: Eyesweb – towards gesture and affect recognition in dance/music interactive systems. Computer Music Journal 24 (2000) 57–69 www.musart.dist.unige.it/site inglese/research/r current/eyesweb.html. 4. Andersen, T.H.: Mixxx: Towards novel DJ interfaces. In: New Interfaces for Musical Expression, Montreal, Canada (2003) 30–35 5. Boulanger, R., Mathews, M.: The 1997 Mathews radio-baton improvisation modes. In: Proceedings of the International Computer Music Conference, Thessaloniki, Greece, ICMA (1997) 395–398 6. Rich, R., Buchla, D.: Lightning II. Electronic Musician 12 (1996) 118–124 7. Borchers, J.O., Samminger, W., M¨ uhlh¨ auser, M.: Engineering a realistic real-time conducting system for the audio/video rendering of a real orchestra. In: Fourth International Symposium on Multimedia Software Engineering, California, USA, IEEE MSE (2002) 8. Ilmonen, T., Takala, T.: Conductor following with artificial neural networks. In: Proceedings of the International Computer Music Conference, Beijing, China, ICMA (1999) 367–370 9. Marrin, T., Paradiso, J.: The digital baton: a versatile performance instrument. In: Proceedings of the International Computer Music Conference, Thessaloniki, Greece, ICMA (1997) 313–316 10. Marrin, T.: Inside the Conductor’s Jacket: analysis interpretation and musical synthesis of expressive gesture. Ph.D thesis, MIT (2000)
12
Declan Murphy, Tue Haste Andersen, and Kristoffer Jensen
11. Carosi, P., Bertini, G.: The light baton: A system for conducting computer music performance. In: Proceedings of the International Computer Music Conference, San Francisco, CA, USA (1992) 73–76 12. Segen, J., Kumar, S., Gluckman, J.: Visual interface for conducting virtual orchestra. In: Proceedings of the International Conference on Pattern Recognition (ICPR). Volume 1., Barcelona, Spain, IEEE (2000) 1276–1279 13. Segen, J., Majumder, A., Gluckman, J.: Virtual dance and music conducted by a human conductor. In Gross, M., Hopgood, F.R.A., eds.: Eurographics. Volume 19(3)., EACG (2000) 14. Gerver, R.: Conducting algorithms. WWW (2001) http://www.stanford.edu/~rgerver/conducting.htm. 15. McNeill, D.: Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press (1992) 16. Humphries, L.: What to think about when you conduct: Perception, language, and musical communication. WWW (2000) http://www.ThinkingApplied.com. 17. Murphy, D.: Tracking a conductor’s baton. In Olsen, S., ed.: Proceedings of the 12th Danish Conference on Pattern Recognition and Image Analysis. Volume 2003/05 of DIKU report., Copenhagen, Denmark, DSAGM, HCØ Tryk (2003) 59– 66 18. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-8 (1986) 697–698 19. Murphy, D.: Extracting arm gestures for VR using EyesWeb. In Buyoli, C.L., Loureiro, R., eds.: Workshop on Current Research Directions in Computer Music, Barcelona, Spain, Audiovisual Institute, Pompeu Fabra University (2001) 55–60 20. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of 7th International Joint Conference on Artificial Intelligence (IJCAI). (1981) 674–679 21. Jensen, K., Andersen, T.H.: Real-time beat estimation using feature selection. Volume 2771 of Lecture Notes in Computer Science., Springer-Verlag (2003) 22. Masri, P., Bateman, A.: Improved modelling of attack transient in music analysisresynthesis. In: Proceedings of the International Computer Music Conference, Hong-Kong (1996) 100–104 23. Desain, P.: A (de)composable theory of rhythm. Music Perception 9 (1992) 439– 454 24. Jensen, K., Murphy, D.: Segmenting melodies into notes. In Olsen, S., ed.: Proceedings of the 10th Danish Conference on Pattern Recognition and Image Analysis. Volume 2001/04 of DIKU report., Copenhagen, Denmark, DSAGM, HCØ Tryk (2001) 115–119 25. Murphy, D.: Baton tracker. WWW (2002) Includes user guide. http://www.diku.dk/~declan/projects/baton-tracker.html. 26. Murphy, D.: Pattern play. In Smaill, A., ed.: Additional Proceedings of the 2nd International Conference on Music and Artificial Intelligence. On-line tech. report series of the Division of Informatics, University of Edinburgh, Scotland, UK, http://dream.dai.ed.ac.uk/group/smaill/icmai/b06.pdf (2002) 27. Dolson, M.: The phase vocoder: A tutorial. Computer Music Journal 4 (1986) 14–27 28. Rodet, X., Dapalle, P.: Spectral envelopes and inverse FFT synthesis. In: Proceedings of the 93rd AES Convention, San Francisco, USA (1992) Preprint 3393 29. Andersen, T.H., Andersen, K.H.: Mixxx. WWW (2003) http://mixxx.sourceforge.net/.