The Automated Music Transcription Problem Stephen W. Hainsworth† and Malcolm D. Macleod‡ †
Cambridge University Engineering Department, Cambridge, CB2 1PZ, UK ‡
QinetiQ, Malvern, WR14 3PS, UK
[email protected],
[email protected] Abstract Automated transcription of polyphonic music is a difficult task which has been attempted over the last 30 years by a number of researchers with only partial success. The problem is highly complex and is not simple even for humans. In fact, for humans transcription is a specialised task which can generally only be undertaken successfully after years of musical training. One facet of this paper concerns a study of 19 musicians who responded to a survey as to how they transcribed music. Differences were found but a consistent pattern could be seen. The results placed within the context of music psychology and the point is made that transcription does not sit well in this environment - transcription is in fact the breaking down of the overall musical scene to examine individual elements, which is not a normal listening task. After reviewing existing automated transcription methods, a scheme motivated by the human approach is proposed. This first examines more global musical features and uses these to build up a musical context to aid the resolution of more detailed and possibly ambiguous structures. While the details of implementation of each individual process are beyond the scope of this paper, some issues surrounding it are also discussed.
1
Introduction
One of the “grand challenges” facing the computational music field is automated transcription1 of complex polyphonic musical audio; a number of people have investigated this problem over what is now nearly 30 years but a solution still seems remote. This is partly because music, by its very nature, is incredibly diverse; in style (compare Baroque counterpoint with late 1990s UK Garage music), sonic texture and performance characteristics there can be huge variations, to say nothing of the intrinsic properties of the recording medium and any subsequent compression system which have a significant effect on the audio signal. These all have to 1 Transcription
is sometimes called dictation by musicians; transcription can also be the process of copying music from one written
format to another, possibly involving arrangement.
1
be considered in a transcription system if it is to be successful. The intention of this paper is not to present a technical solution this problem but to re-evaluate the situation as a conjunction of music psychology, music theory, engineering and computer science. Specifically, parallels with the methods employed by humans will be made. However, it is noted that transcription is a very specialised task which can only be undertaken after considerable music training; as a result, some of the models for human cognition of music have more relevance than others. These issues are addressed and a proposal for an algorithmic approach to music transcription is described. However, before discussing the philosophy of polyphonic transcription, it should first be defined carefully. A simple definition is that it is the process of seeking to determine all the notes sounded in a musical audio sample and notating these in a conventionally recognisable form. It is possible, however, to break down the process into two separate goals [10]. The aim of the first is to extract the pitches of notes and their timings from an audio signal as well as identifying which instrument generated them. The second part is to represent this in a form which is understandable to musicians and which could be used to recreate the original audio. Implied in the latter process is the analysis of the original expressive performance which has to be factored out of the representation in order to convert from notes to score. The first process can be though of as moving from audio to a MIDI or piano-roll type representation, while the latter converts the piano-roll to a traditional score (or equivalent). Section 2 of this paper will introduce some of the more influential methods previously tried for automated transcription. However a main argument of this paper is that these all try to extract detailed information (i.e. the notes which are present) without first establishing the contextual information which human musicians build up to help resolve ambiguities or unclear passages. To investigate the processes used by human musicians when performing transcription, an informal study was undertaken, the results of which are presented in section 3. These are discussed in the context of current music psychology understanding in section 4, making a second point that transcription is a very specialised task for humans and therefore maybe does not fit well within the context of generalised music perception. Next, section 5 introduces a potential model for the computer implementation of a transcription scheme. Insights as to the differences between the ideal human transcription method and the computer model are discussed in section 6 before conclusions are drawn in section 7.
2
Review of Transcription Models
Before discussing new ways to look at automated transcription, it is obviously beneficial to examine methods tried to date. In general, there have been three approaches to the problem - bottom-up heuristic solutions, blackboard approaches and model-based methods. 2
All of the earliest methods can be thought of as bottom-up methods, in that no higher-level information was utilised beyond basic principles of musical sound. Analysis of polyphonic musical audio can be traced back to the mid 1970s and the work of Moorer [48]. His methods were harmonic comb related, with the ‘greatest common harmonic’ located at each step. Two voice polyphony was the greatest considered, with many limitations (e.g. no vibrato or glissando and no unison, octave or twelfth ratios between the voices). Contemporary with Moorer was Piszczalski[56] who presented an algorithm for examining pitch from frequency ratios. Pairs of partials identified in a short time spectrum were compared to find a potential harmonic ratio. If a simple ratio was found then the two were assigned a tentative harmonic number, as part of a series and weightings then given to an overall hypothesis. This was performed recursively until all pairs had been considered. The most likely pitches were then calculated. Following these early approaches, there have been a number of people who have worked upon transcription with essentially bottom-up systems. Maher [43], Klapuri [34, 35] and Virtanen [71] are just some of these. A complete literature review of all the research is beyond the scope of this article, but a comprehensive review can be found in [26].
2.1
Blackboard Methods
An enduringly popular approach over the last decade has been the blackboard method. Developed at the University of Massachusetts as a general signal processing tool, the Integrated Processing and Understanding of Signals (IPUS) blackboard scheme [41] was later applied specifically to computational auditory scene analysis [37]. The basic philosophy is to extract information from the (audio) signal with a frame by frame scheme using a series of signal processing algorithms (SPAs) and then find the parameters that describe this signal, via the use of competing ‘knowledge sources’ and multilevel hypotheses. Thus, IPUS is a framework for combining bottom-up processing of the SPAs with top-down prior or global information (encoded in the knowledge sources) to choose between hypotheses (termed ‘sources of uncertainty’ or SOUs) at different levels of the blackboard. Also, the system has a dynamic element in that the processing parameters can be adapted by the higher levels to better resolve any ambiguities. This is accomplished through an interactive architecture of discrepancy detection, diagnosis and signal re-processing (see figure 1). While the IPUS scheme was never specifically applied to musical audio transcription, Martin [44], Godsmark [23] and various researchers at Queen Mary, University of London [2, 47, 57] have all used blackboard systems for musical transcription purposes. Blackboard systems, on first glance, seem a good way of combining top-down priors with bottom-up information flow. However, they are heuristic in their approach and the performance is very dependent upon the design characteristics. Also they are inflexible: there is little scope for resolving poor data, as they are 3
SOU Summary
Problem Solving Model
SOU selected by focusing heuristics
Planner hypotheses expectations level N
Knowledge Sources
Control plan for interpretation, reprocessing loop, differential diagnosis or SPA execution
hypotheses expectations level k+1
SPA Output level k
Discrepancy Detection
Diagnosis SPA Output level 1
Differential Diagnosis
Reprocessing
BLACKBOARD REPROCESSING LOOP Signal Data
Figure 1: The blackboard framework of IPUS [41]. deterministic in action and will always arrive at the same answer given a data set. Thus, in simple systems, they will probably arrive at the correct result, but with complex input such as in polyphonic audio, it is unlikely that they will ever have the complexity required to resolve ambiguities.
2.2
Model-Based Algorithms
A more rigorous method for combining bottom-up processing with top-down prior expectation is the modelbased approach. The designer explicitly sets down the structural relationships which are assumed to exist within the data and formulates the model with these in mind. An inference algorithm is then defined to process the signal and select the model parameters which best fit the data. Obviously, if the model does not reflect the content of the data, the process will fail. This is the single greatest criticism of model-based methods but generally, the model is simplified until it encompasses the majority of data cases it will meet. However, as pointed out in the introduction, music by its very nature will always have exceptions to the rule. The earliest model-based methods of music processing are those of Kashino [33] who used a Bayesian formulation to integrate information from different levels. A three tier hypothesis structure was created with single frequency components at the lowest level, followed by notes and then chords at the top. Bottom-up and top-down processing was carried out using Bayes’ rule to provide support for hypotheses from different levels, while knowledge sources also provided prior information. A success rate for the system (called OPTIMA) of 4
about 50% was given but the papers quoted here were very unclear on the exact method used for the Bayesian formulation. A later system designed by Kashino used a time-domain spectral template matching method [31, 32]. A Bayesian network was used to integrate musical prior information (likelihood of interval occurring, similarity of tone series to those of the model and by ‘musical context’) when choosing between implied hypotheses (though how a set of possible hypotheses was derived was not explained). Simple examples consisting of piano, violin and flute, gave an accuracy rated at 89%. Sterian [65] produced another model-based approach to polyphonic transcription. He used Pielemeier’s modal transform [55] as a front end, the peaks of which were tracked through frames by Kalman filtering [3]. Following this, the tracks were associated into notes by Bayesian methods using grouping rules such as common onset and harmonicity. Multiple hypotheses were maintained via the multiple hypothesis tracking (MHT) algorithm [3] which provides a deterministic framework for hypothesis propagation and deletion. Several assumptions were made, which simplified the calculation needed but may have affected the performance: firstly, tracks were assigned to one note only. This meant that shared harmonics were not considered as such (though a fix was included to allow for other notes to have power at the shared frequency). Also, the Kalman filtering stage assumed a fairly simple model for partial evolution, discounting vibrato, for instance (though this may also have been a consideration due to the use of the modal transform). Reasonable success was found with up to four part brass ensembles (79.5% accuracy was given in [63] for 4 note polyphony). Walmsley [73] used Markov chain Monte Carlo (MCMC) methods [21] to attempt the transcription task. Working directly with the audio signal, he used harmonic sinusoidal basis functions in a general linear model (GLM) environment with an unknown number of notes and harmonics for each note to build a generative representation for this. The model was constructed initially on an unlinked frame by frame basis with no information carried between frames. This produced reasonable results but with the inclusion of global parameters controlling the evolution of data over multiple frames, performance was improved. Godsill & Davy [22] continued this research and improved the model to take account of amplitude evolution within a single frame and also inharmonicity. Their simulations are computationally intensive but produce accurate results for traditionally difficult problems (e.g. 5th s). Recent work by Cemgil [10] has also covered model based transcription. Cemgil’s proposal used a generative model for producing a MIDI type representation from a score, thereby including expressive performance characteristics. A signal model was then used to produce the waveform, given the MIDI. There are a large number of parameters to estimate in the overall model and Cemgil gives no methodology or results on the general estimation of these. However, a simple example was presented where a high proportion of the parameters were assumed to be known. 5
2.3
Discussion
On examination, most of the transcription methods examined above start at the same place; they attempt to extract all the notes from the outset. This is akin to jumping in at the deep end and it is amazing that the level of success has been as high as it is. It is certainly not how trained human musicians approach a transcription task and this motivates the discussion presented later in this article. Before moving on, however, some common themes and factors will be briefly elucidated. Firstly, the terms, “bottom-up” and “top-down”, appeared several times; bottom-up generally refers to the flow of signal information upwards through a set of increasingly complex representations, each of which builds on the decisions made in the one below. Top-down flow is the inclusion of prior knowledge of the (musical) structures which are expected to be present. To find a good way of combining both of these in a system is undoubtedly hard and this is why many systems do not include more than very basic top-down principles of good continuity and basic harmonicity. There are then several factors involved in the transcription problem which have affected all approaches and continue to have a bearing. The first is computing power; Moore’s Law states that computing power doubles every eighteen months and this means that within even a few years, processing methods that were previously impractical become standard tools. This is evidenced in the current move towards stochastic Monte Carlo methods (e.g. [9, 14, 73]) compared with earlier, less computationally expensive algorithms. The second is the gradual improvement in low-level processing techniques; methods of extracting information from the time waveform or the frequency spectrum are always changing and advancing. A third factor is the steady accumulation of knowledge about the human brain and cognition/processing of music.
3
Study of Human Music Transcription
As mentioned above, the computer based methods for transcription are contrary to the approach taken by humans. No-one transcribes anything but the most single music in a simple pass: even the story of Mozart transcribing Allegri’s Miserere after one listening has now been shown to be apocryphal. Therefore to explore the process by which humans transcribe music, an informal study was undertaken, mostly via an email questionnaire. Musicians who had undertaken transcription were asked to answer three questions: 1. Briefly describe the process by which you tackle a transcription problem (specifically think about the order in which you build up the mental representation of the music). 2. How do you attempt to transcribe individual “lines” which are not the dominant tune? (e.g. a sax riff or a cello sub-melody) 6
3. Do you think that you ultimately have an accurate representation of the notes or is what you have transcribed an “approximation” which would sound very similar but has elements of your own “composition” in it? along with some background queries. Nineteen responses have been gathered. By definition, all responders were highly trained musicians with many years of formal training. Five were under the age of 25, ten were between the 25 and 30 while the remainder were 30+. Ten had studied music to at least degree level while the others were all semi-professional musicians. Three responded that they had perfect pitch. Thirteen used a musical instrument as an aid to transcription while the other six used just mental rehearsal. Only four used technology as an aid - one used a tape recorder to play back at half speed while the others used SibeliusT M or similar for automated audio generation of the transcribed notes. A big dividing line in the responses was found between those who were trying to achieve a faithful transcription of exactly what is on a recording (one respondent recently transcribed some of the Horowitz piano variations; another produced an accurate transcription of Nina Simone’s recording of For All We Know) and others who were interested primarily in arranging a recording for possibly slightly different musical forces. The latter usually answered the third question with the response that they aimed for a faithful transcription of the important features of the piece (rhythmic, melodic and chordal) but that it would often contain elements of their own arranging. An extreme case of arrangement was one respondent who had transcribed a big band tune (Robbie Williams - Have you Met Miss Jones?) where it was deemed impossible to hear out all details of each horn part. Chords were transcribed accurately but arrangement was arbitrary as to note-instrument assignments. Despite the general nature of the questions and the differences between eventual aims, a consistent pattern emerges; the first thing that humans do is make a rough sketch of the piece, often not even on manuscript paper. The song is broken down into sections, repetitions noted and maybe key phrases are marked with nothing more than contour lines. Often, next comes a chord scheme or possibly the bass line. Melody follows and then any counter-melodies, though the schemes employed by different people diverge somewhat by this point. To the second question, many reported that they heard out inner harmonies or counter-melodies by repeated listening, building up a mental representation and then playing this on an instrument, sometimes along with the original audio. Many reported that often this process was highly informed by context garnered from the chords and other previously transcribed lines as well as musical knowledge regarding “what notes should be present.” Implicit in even the early steps of transcription are a series of stages which human musicians take for granted as being almost trivial; beat tracking is a prime example of this. Style detection (informing many later decisions with heavy prior expectations) is another, as is instrument identification within the recording. A few 7
of these low level tasks have been tackled with computer algorithms and some success has been achieved but instrument identification in polyphonic spectra is a prime example of one which has not even been attempted.
4
Discussion of Human Transcription Schemes
So how do the results of this survey fit in with established music psychology? Firstly, it is worth bearing in mind that transcription is a highly specialised task which is only possible after a number of years of training; does it even fit in with the generally accepted principles of music perception? The ability to delve in to a complex sonic spectra and hear individual parts is something which untrained listeners are often unable to do2 . A very important issue is that of genre: a musician familiar with a given genre will generally be able to produce a transcription more quickly and accurately. However, there is an important point in that writing down a score using traditional music notation conventions implicitly involves the conversion of a genre from its native medium to Western traditional values. This may not be a sufficiently rich language to represent some styles of music (microtonal Indian music for instance). Despite these problems, the focus in this article will remain deliberately broad and will try not to consider just 4-part Bach chorales!
4.1
Music Psychology Research
One branch of music psychology which is very relevant to the problem of transcription is that of music structure theory. After Schenkerian analysis which can probably be thought of the first real attempt at psychoacoustic explanations for the musical listening experience, the most influential is Lerdahl and Jackendoff’s Generative Theory of Tonal Music (GTTM) [40]. This is a rule based approach where the musical input is parsed in an analogous manner to the generative linguistics of Chomsky [11]. There are unbreakable ‘wellformedness’ rules and flexible ‘preference’ rules which are used to model four descriptions of the music listening experience: 1) grouping structure, a hierarchical segmentation into sections/phrases; 2) rhythm/metrical structure; 3) time span reduction, which is a way of drawing out the most important events and linking them together; and 4) prolongation reduction, which models the longer-term structure and the ebb and flow of musical tension. The four ‘domains’ are linked and together attempt to explain, in a hierarchical and reductionist fashion, some of the perceptions of a trained musical listener versed in Western tonal music. One successor to the GTTM is Deli`ege [15] who proposed a ‘cue abstraction’ process whereby listeners extract cues commensurate to their level of knowledge and use these to build up a global structure representa2 Indeed,
an informal listening test where a non-musician was asked to identify the instruments present in a rock ensemble resulted in
the detection of the prominent instruments; however, the bass was mentioned only “because I know there’s probably one there” and the Hammond organ was missed entirely.
8
tion. She produced a set of rules to this effect and tested them on several examples. Another is Temperley [67] who took a more computational standpoint and implemented a number of systematic rules on computer to account for such experiences as metrical structure, pitch spelling, tonal centre and melodic phrase structure. Meredith [46] pointed out that the use of a MIDI input to the program is somewhat at odds with Temperley’s stated aim of providing a cognitive account of the musical listening process but nevertheless, the ideas and implementation are useful. Indeed, this could be a criticism of many of the cognitive models - the input is often a symbolic music representation which presumes that a (trained) listener is able to implicitly transcribe the audio before parsing its structure. An alternative implication is that what is being analysed is the music structure, rather than the perception of this (though this is not a bad thing in its own right). Cook [12] terms this charge ‘theorism’, in that many cognition models rely too much on accepted music theory. Scheirer [59] also makes a similar point regarding the use of notated music as a starting point for cognitive research. Narmour [49, 50] presented an approach which accounts for some of this criticism; his approach is based upon relative relations and builds models of perceptual ‘closure’ based upon combinations and chains of note relations. This is perceptually more plausible. Parncutt [53] also produced a model biased towards cognition and away from structural concerns. Based on Terhardt’s pitch perception model [69], Parncutt formulated a theory for the ‘stability’ of chord percepts in terms of their roots; a non-inverted major chord is thus very stable, while others such as the Tristan chord are not. He went on to account for the listening process being a constant flux of chords with varying stability. Another approach is that of Bregman [4] who was very much influenced by the Gestalt principles proposed in vision processing. The Gestalt theory is a bottom-up process whereby elements are grouped into wholes by rules such as proximity, continuity and similarity. In vision, this is a natural representation but as Bregman asked, “What about hearing?”. In the temporal domain, this is not an easy question to answer so attention was turned to the time-frequency domain as represented by the spectrogram. There are still problems as components from different streams (as Bregman termed musical sources) can combine in the spectrogram at the same location, thereby making analysis very difficult. Bregman postulated a great number of rules and described experiments to test them, by means of which he explained a number of simple auditory experiences. He termed the framework auditory scene analysis. Several researchers later attempted to produce working computer implementations of this, calling it computer auditory scene analysis: Ellis’s [18] focus was upon general auditory scenes (e.g. a city street) while Brown [5] concentrated more upon speech and musical examples. Slaney [60] criticises what he terms ‘pure audition’, i.e. bottom-up cognitive models where there is little top-down flow of information, citing Brown [5] as an example of this. Slaney prefers an interactive system, an example of which is the dynamic attending model of Jones [30]. Here the listener makes implicit predictions 9
UR E
Kashino
Steedman [62]
ON
ST
RU
CT
I AT UT MP
CO
Walmsley
2]
s [4
H etgu on
R KE
N HE
SC
in igg
Ab
da
L
C
& HL
JA
ND KE
F OF
A
RD
LE
h
Cambouropoulos
TEMPERLEY
e
lie` g
De
lla
nsl
ha
um
Kr
t cut
BREGMAN
rn
Pa
Bro wn is
Ell
es
Jon
R
U MO AR
N
COGNITION
Figure 2: Diagrammatic representation of cognition research snapshot. about what they expect to hear next which are then validated or contradicted by the sensory input. Wrigley [75] also examined interactive listening but from a computational standpoint, working on auditory attention. These approaches are summarised in figure 2 but form a mere snapshot of music psychology. However, the question of how relevant they are to human transcription remains.
5
Model for Automated Polyphonic Transcription
Having discussed some of the issues surrounding previous approaches to transcription with computers and examined how humans approach a similar task, this section proposes a novel potential scheme for the former motivated by the latter. This proposed methodology suggests the use of global music contextual information to aid subsequent processing of successively more detailed structures. Thus, the fundamental difficulty of collectively modelling the whole transcription problem is avoided by building up the representation from simpler, more global structures. 10
FURTHER TASKS
arrangement
post processing
performance
(score generation)
MORE DETAILED TASKS
analysis
sub-melodies
bass
melody
individual lines
detailed phrase by phrase analysis
chords/ tonal context
structural representation
instrument
beat tracking
ID (approx.)
GLOBAL TASKS
style detection
Figure 3: Proposed scheme for automated polyphonic transcription.
11
The main features of the model are that it is a multiple sweep system and that, while signal information is fed to each individual process, there is a flow of music contextual information up through the algorithm. This is motivated by the study of human transcription which revealed that humans also use a multiple sweep process building up contextual information before embarking upon the inference of complex and/or obscured musical constructs. Also, it should be noted at the outset that many of the later, more involved processes can provide information which aids refinement of decisions at the more generic levels; thus a system of feeding back information and the adaptation of earlier decisions in the light of this new data should ideally be included. Figure 3 gives a graphical representation of the model stages. At the bottom of the diagram are generic, global processes, generally performed intrinsically by humans with little thought or effort. The further up, the harder the task and the more musical knowledge and skill is required. Within each task, there is scope for a wide variety of individual approaches and this proposal does not go into detail as to how each sub-problem should be tackled. This is an extensive task which will be left for future work. To illustrate the contextual information flow, style identification which is placed at the bottom of the model does not require accurate beat estimation or chord recognition for adequate performance. Indeed, a study by Perrot & Gjerdingen [54] found that humans were able to make a reliable judgement of style from just 0.25s of musical audio. Beat tracking, which is represented as one level above style detection, can function without knowledge of the genre but performance should be improved by inclusion of this into an algorithm3 . The following paragraphs will describe the various modules and cites researchers who have investigated the individual areas. It is proposed that the first stage is style detection. At this level, style (or genre) refers to little more than a broad categorisation as to whether the piece is a pop song with drums, classical chamber music, jazz, etc. The more specific the classification is, the more accurately the subsequent processes can be tuned. For instance, there is no point in looking for electric guitar in Renaissance chamber music; less obviously, it is also senseless looking for a piano. Tzanetakis & Cook [70] described a study of genre classification. If full genre analysis is possible, so much the better, though it is speculated that for accurate automated genre analysis, more detailed inference is needed4 . Following genre analysis are instrument identification and beat tracking. The former is useful for many later processes, most obviously in separating harmonic spectra5 . To date, research has been centred on single monophonic note examples [28, 45] and the task of identifying instruments in rich polyphonic spectra has 3 Consider
a beat tracker designed for dance music which is then applied to choral music: there is very little likelihood it will function
well. 4 An example of this is Dixon [17] who uses patterns detected by a beat tracker to attempt highly specific recognition of style in ballroom dance music. 5 Klapuri [34] and Eronen [19] argue for instrument models while Kashino [32] instead uses adaptive templates.
12
not been attempted. This would seem a promising though challenging area for research. Beat tracking, on the other hand, is the single most important ‘intrinsic’ task within a general framework for musical audio understanding. Given that most events occur ‘on the beat’, if a rhythmic framework is in place, then this can be a vital aid to later processing. In fact, music psychology methods such as the GTTM [40] implicitly require a metrical framework to be established before some of the rules can be applied. Examples of beat tracking research are Goto et al [24], Laroche [39], Cemgil et al [9], Klapuri [36] and Hainsworth & Macleod [27]. After the above levels are the first which require some musical training. A structural representation is defined to be a sketch of the musical form (e.g. intro, verse, chorus, etc. or exposition, development, recapitulation, coda) showing number of bars and any repetitions. The chord structure is a low-level form of transcription but has received little attention6 . It is also the first where the traditional music harmony rules can truly be brought to bear. Parncutt [53] or the GTTM [40] both have chord development ideas within their analysis schema. Another illustration of the contextual flow is that chord detection can be made significantly more robust if beats and bars have already been correctly marked - the chord is usually constant for at least a whole beat and often for whole bars or more and is most likely to change at bar lines. The main process of transcription is then described by a complex series of interacting processes. Each line is analysed separately and the pitch and rhythm notated. Selective attention, the ability to focus upon individual event streams within the overall sensory stimulus (as investigated in a computational framework by Wrigley [75]) could be useful here. A similar idea in an engineering paradigm is independent component analysis (ICA) where a mathematical separation of number of sources is considered. For ICA the sources must be independent [8] but in music this is not the case. Some investigations into ICA for music are given in [1, 61] and in the more usual speech separation context by Davies [13]. Also important for separating spectra are instrument models informed by the earlier decisions on which instruments are present. The chordal context is of much use for many styles (though may not be with all, e.g. jazz), as is the rhythmic framework (though, of course, the circular argument can be applied that chord changes are needed to fully describe the rhythmic framework in many cases). The models proposed in the cognitive/music-structural research fields are also likely to be helpful in resolving ambiguities. It is noted once again that many of the music structural models require a rhythmic framework to be present before they can be applied, which is another example of global information being required to aid more detailed processing. Finally, there are post-processing issues such as pitch spelling [7] and other notational issues [6], the issue of resolving performance characteristics (dynamics, tone, etc.) [58] and the arranging of the original performance for different musical forces. 6 Examples
of work in this area are [51, 66].
13
6
Discussion of Model
There is no doubt that automated transcription is hard: it takes humans many years of music training and experience to be able to delve into and analyse a complex musical structure. Teaching this to a computer will no doubt take as much effort. There are some important caveats to the above discussion. The first and most obvious is that the contextual information flowing up through the algorithm must be correct for it to be of any use. This is akin to Klapuri’s statement [34, p68] in the context of top-down/bottom-up processing that, “Without being able to reliably extract information at the lowest level, no amount of higher level processing is going to resolve a musical signal.”. The second is that information also ideally has to flow down from detailed processes to aid the refining of decisions at more global levels. The architecture should therefore be interacting and capable of adaptation7 . The final caveat relates to the fact that the system performance is limited by the models and methods used to implement individual processes; if a model is poor, the algorithm will not function well and conversely, if the data is highly unusual, operation will be unsatisfactory. The issue of style-specific processing was mooted above and deserves further discussion. It is the contention here that general, all-encompassing models are preferable to highly genre-specific ones; a generic model can then be ‘tweaked’ to improve performance with examples from individual styles if a style analysis has been undertaken. However, it is unlikely that a completely generic model which copes with all styles will be found - genre knowledge will probably have to be included to make the problems tractable.
6.1
Relationship to Music Psychology
As has been explained above, transcription is not an intrinsic listening skill but one which is learned over time by human musicians. Some of the research avenues discussed above in section 4 are more relevant than others. The GTTM of Lerdahl & Jackendoff [40] is a prime example of pertinent research because it attempts to “account for the musical intuitions of a listener versed in a particular idiom” [29] and the fundamental rules they expound should form a good encoding of some of the prior knowledge available to musicians. On the other hand, Narmour’s research [49, 50] is not particularly relevant to transcription because it focuses more on the basic listening experience, “capturing the everyday experience of competent music listeners.” [50, p278]. Similarly, Krumhansl’s tone profiles [38] for finding which notes in the chromatic scale perceptually fit best into a given tonal context are less likely to be useful than statistical studies of the intervallic relations in music (e.g. Fucks [20] or Ortmann [52]). 7 Sterian
& Wakefield [64] actually argue for the opposite of this, where an isolated black box approach is taken to facilitate develop-
ment of individual systems.
14
Another point is the potential use of auditory models as a front end for a transcription system; in order to be psychologically plausible, these sometimes discard information (e.g. at high frequency) or combine it into sub-optimal representations. An example of this is constant-Q analysis where the width of frequency bands increases with centre frequency. When it is considered that harmonics of notes are linear in frequency constant-Q methods will blur higher harmonic information. It can be shown that simple Fourier methods outperform constant-Q approaches for frequency resolution [72]. Thus psychologically plausible methods are not necessarily the best approach when the aim is to produce a working system - “the more information the better” is a good rule here. A similar point can be considered by looking at the terms of Terhardt [68], who defined three representations of music: symbolic (the score or equivalent), acoustic (the sound waveform and any direct observation of it) and auditory (perceptions induced in the mind of a listener). An engineering-oriented computer transcription system aims to convert from the second of these to the first but not necessarily via the third. Artificial intelligence would be more interested in inferring the auditory perceptions from the acoustic waveform and is a separate task.
6.2
The Limitations of the Computer
There are a number of areas where a computer model of the transcription process might well diverge from the human ideal; these are often due to the limitations of the computer when compared to the evolved complexity of the human mind. A prime example of this is the ordering of melody perception and chordal context within the above model. A human listener will usually perceive the melody first and foremost and when considering a 4-part chorale, this is one of the first things a human will extract. Indeed, in this situation it can argued that chords are simply a function of the notes sounding at any one time and can often be described by just the melody and bass. Why then the positioning of chordal context first in the scheme? There are two arguments, the first of which is based around more general approach to genre - in rock and pop, chords play a much more significant structural role within the music. Similarly in jazz, the chord progression is often all that maintains the form of the piece. In both these situations it is helpful to maintain a chord progression to aid the analysis individual streams8 . The second consideration is algorithmic - harmonics of individual notes coincide with each other and many of the methods employed to examine the frequency structure of musical audio (the Fourier transform, autocorrelation etc) are unable to distinguish the individual harmonics. However, it seems highly likely that the pattern they form can be recognised as one of a small number of basic chords. Figure 4 gives an illustration of this. 8 Indeed,
sometimes it is impossible to hear which individual notes are being played on instruments such as a guitar or piano and only
the chord and a rough guess at the particular voicing can be made.
15
2
Amplitude
10
0
10
−2
10
0
200
400
600
800
1000
1200
1400 1600 Freq, Hz
1800
2000
0
200
400
600
800
1000
1200
1400 1600 Freq, Hz
1800
2000
2
Amplitude
10
0
10
−2
10
Figure 4: Figure illustrating the chord detection principle. Without knowing the exact notes present, it is possible to construct a chord template and match it to the actual spectrum. In each plot, the upper, solid line is the spectrum under consideration (a G major chord) while the lower, dashed line is a constructed chord template simply made by convolving harmonics of notes with a Gaussian function and summing these. The top plot shows a G major chord template while the lower shows a C major chord template. It can clearly be seen that the G major chord matches better despite the fact that the notes used in the template are not identical to those in the original spectrum.
16
Melody perception provides another interesting algorithmic issue: humans are very able to hear the top note in an ensemble whereas a computer has difficulties in doing so because the harmonics of the lower parts obscure those of the melody. Goto [25] produced a successful algorithm for extracting the melody and bass lines from musical audio which showed that it was possible to do this. However, one thing is certain - the human transcriber is constantly refining his or her decisions on the basis of each new piece of musical information they extract. One area in which the computer has an advantage is pitch extraction. The harmonics in the audio signal can be labelled as having an identifiable frequency. Humans have more problems with this and the categorisation of tonal sensation is an active area of research in psychology [16]. Of course, there are a small number of musicians who have perfect pitch and this is often a great aid in transcription; many musicians also have an ingrained sense of relative pitch. These are not normal listening activites and again motivate the use of music structural research models (e.g. the GTTM [40]) over those focusing on psychoacoustic principles (e.g. Narmour [50]).
7
Conclusions
Transcription is undoubtedly a very hard process and the existing methods only scratch the surface of human ability. This paper has discussed some of these approaches and also mentioned some of the large body of relevant literature in music psychology. A small-scale study into human transcription processes was described, the conclusion from which being that musicians treat a number of fundamental processes as trivial (even though they are not for computers) and then build up the mental representation of the music in layers of complexity. Crucially, contextual information is gathered and used to aid the more in-depth analysis of exactly which notes are present. Depending on the aim and the style of music being transcribed, methods obviously vary in the details. The problem is then how to implement this on a computer. The assumption made in this article is that one is interested in producing a working system which does the best possible job, rather than in directly modelling the human cognition process. The latter will probably ultimately result in a better system but it will be a good number of years before AI is sufficiently advanced for this to be a reality. Thus, a scheme was proposed in section 5 which accounts for some of the elements of human transcription; specifically, contextual information is accreted through increasing layers of complexity. The inference processes (such as genre analysis, beat tracking, etc.) which were often not mentioned by musicians who responded to the survey (presumably because they are trivially easy for humans and are undertaken instinctively) come first before the information gleaned from these is used as a prior source of information for later, more involved processes. It was also pointed out that prior musical information (in this case, pertaining more 17
to the actual structural relationships present than to how they are perceived by listeners) would need to be included into the algorithms for accurate analysis to be possible. This is a non-trivial task in itself. So what use is a transcription algorithm? Given the stated aim of producing a working system, it may not shed much light upon the reverse problem of understanding music perception. However, as pointed out, transcription is not a normal operation in the listening process but rather a specialised function. The basic components of music perception are used but the locks which normally bind them together are broken and the individual atomic elements of music are discriminated and analysed. Perhaps an algorithmic transcription system might have something to contribute to the understanding of this. More immediately, a practical application is in ethnomusicology, where transcriptions of music passed on by aural tradition are desired, and in jazz where transcription of solos has long been used as a learning aid [74]. While the exact mechanics of the automated transcription system components are beyond the scope of this paper, it is hoped that this paper will serve to link some disparate ideas together and inspire further work.
References [1] S. Abdallah. Towards Music Perception by Redundancy Reduction and Unsupervised Learning in Probabilistic Models. PhD thesis, Dept. E.E. Kings College, London, 2003. [2] J.P. Bello Correa. Towards the Automated Anaylsis of Simple Polyphonic Music: A Knowledge Based Approach. PhD thesis, Queen Marys, Univ. London, January 2003. [3] S.S. Blackman and R. Popoli. Design and Analysis of Modern Tracking Systems. Artech House, 1999. [4] A.S. Bregman. Auditory scene analysis. In McAdams and Bigand, editors, Thinking in Sound, chapter 2, pages 10–36. OUP, 1993. [5] G.J. Brown and M. Cooke. Perceptual grouping of musical sounds - a computational model. J. of New Music Research, 23(2):107–132, 1994. [6] E. Cambouropoulos. From MIDI to traditional musical notation. In Proc. AAAI Workshop on Artifical Intelligence and Music, 2000. [7] E. Cambouropoulos. Automatic pitch spelling: from numbers to sharps and flats. In Proc. 8th Brazilian Symposium on Computer Music, 2001. [8] J-F. Cardoso. Blind signal separation: statistical principles. Proc. IEEE, 9(10):2009–25, October 1998. [9] A.T. Cemgil and B. Kappen. Monte Carlo methods for tempo tracking and rhythm quantization. J. Artifical Intelligence Research, 18:45–81, 2003.
18
[10] A.T. Cemgil, B. Kappen, and D. Barber. Generative model based polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NJ, 2003. [11] N. Chomsky. Logical Structures of Linguistic Theory. PhD thesis, MIT, 1955. [12] N. Cook. Perception: a perspective from music theory. In R. Aiello and J. Sloboda, editors, Musical Perceptions. OUP, 1994. [13] M. Davies. Audio source separation. In Mathematics in Signal Processing V. OUP, 2001. [14] M. Davy and S.J. Godsill. Bayesian harmonic models for musical pitch estimation and analysis. Technical Report CUED/F-INFENG/TR.431, CUED, November 2002. [15] I. Deli`ege, M. M´elen, D. Stammers, and I. Cross. Musical schemata in real-time listening to a piece of music. Music Perception, 14(2):117–60, Winter 1996. [16] D. Deustch, editor. The Psychology of Music. Academic Press, 2nd edition, 1999. [17] S. Dixon. Classification of dance music by periodicity pattern. In Proc. ISMIR, 2003. [18] D.P.W. Ellis. Prediction driven computational auditory scene analysis. PhD thesis, Media Laboratory, MIT, June 1996. [19] A. Eronen and A. Klapuri. Musical instrument recognition using cepstral coefficients and temporal features. In Proc. ICASSP, 2000. [20] W. Fucks. Mathematical analysis of formal structure of music. IRE Trans. on Information Theory, pages 224–228, 1962. [21] W.R. Gilkes, S. Richardson, and D.J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman and Hall, 1996. [22] S.J. Godsill and M. Davy. Bayesian harmonic models for musical signal analysis. In 7th Valencia Int. Meeting on Bayesian Statistics, 2002. [23] D. Godsmark and G.J. Brown. A blackboard architecture for computational auditory scene analysis. Speech Communication, 27:351–366, 1999. [24] M. Goto. An audio-based real-time beat tracking system for music with or without drum-sounds. J. of New Music Research, 30(2):159–71, 2001. [25] M. Goto and S. Hayamizu. A real-time music scene description system: detecting melody and bass lines in audio signals. In Proc. IJCAI Workshop on CASA, pages 31–40, 1999. 19
[26] S.W. Hainsworth. Analysis of musical audio for polyphonic transcription. 1st Year Ph.D. Report; Available online http://www-sigproc.eng.cam.ac.uk/˜swh21, August 2001. [27] S.W. Hainsworth and M.D. Macleod. Beat tracking with particle filtering algorithms. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk, NY., 2003. [28] P. Herrera, X. Amatriain, E. Batlle, and X. Serra. Towards instrument segmentation for music content description: a critical review of instrument classification techniques. In Proc. Int. Symp. on Music Information Retrieval, 2000. [29] R. Jackendoff and F. Lerdahl. A grammatical parallel between music and language. In M. Clynes, editor, Music, Mind, and Meaning, chapter 5, pages 83–117. Plenum Press, 1982. [30] M.R. Jones and M. Boltz. Dynamic attending and responses to time. Psychological Review, 96(3):459– 91, 1989. [31] K. Kashino and H. Murase. Music recognition using note transition context. In Proc. ICASSP, volume VI, pages 3593–6, 1998. [32] K. Kashino and H. Murase. A sound source identification system for ensemble music based on template matching and music stream extraction. Speech Communication, 27(3-4):337–49, 1999. [33] K. Kashino, K. Nakadai, T. Kinoshita, and H. Tanaka. Organisation of hierachical perceptual sounds. In Proc. IJCAI Workshop on CASA, volume 1, pages 158–64, 1995. [34] A. Klapuri. Automatic Transcription of Music. Master’s thesis, Audio Research Group, University of Tampere, Finland, 1998. [35] A. Klapuri. Pitch estimation using multiple independant time-frequency windows. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 115–8, 1999. [36] A. Klapuri. Musical meter estimation and music transcription. In Proc. Cambridge Music Processing Colloquium, pages 40–45, 2003. [37] F. Klassner, V. Lesser, and H. Nawab. The IPUS blackboard architecture as a framework for computational auditory scene analysis. In Proc. IJCAI Workshop on CASA, 1995. [38] C.L. Krumhansl. Cognitive Foundations of Musical Pitch. Oxford University Press, 1990. [39] J. Laroche. Estimating tempo, swing and beat locations in audio recordings. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 135–8, 2001. 20
[40] F. Lerdahl and R. Jackenoff. A Generative Theory of Tonal Music. MIT Press, 1983. [41] V. Lesser, H. Nawab, I. Gallastegi, and F. Klassner. IPUS: an architecture for integrated signal processing and signal interpretation in complex environments. In Proc. AAAI, pages 249–55, 1993. [42] H.C. Longuet-Higgins. Artificial intelligence and musical cognition. Phil. Trans. R. Soc. Lond. A., 349:103–13, 1994. [43] R.C. Maher. Fundamental frequency estimation of musical signals using a two-way mismatch procedure. J. Acoust. Soc. Am., 95(4):2254–63, April 1994. [44] K.D. Martin. A blackboard system for automatic transcription of simple polyphonic music. Technical Report TR.385, Media Laboratory, MIT, 1996. [45] K.D. Martin. Sound Source Recognition: a Theory and Computational Model. PhD thesis, Media Lab, MIT, June 1999. [46] D. Meredith. A review of Temperley “The Cognition of Basic Muiscal Structures”. Musicae Scientiae, 6(2), 2002. [47] G. Monti and M. Sandler. Automatic polyphonic piano note extraction using fuzzy logic in a blackboard system. In Proc. Digital Audio Effects Workshop (DAFx), pages 39–44, 2002. [48] J.A. Moorer. On the segmentation and analysis of continuous musical sound by digital computer. PhD thesis, CCRMA, Stanford University, 1975. [49] E. Narmour. The Analysis and Cognition of Basic Melodic Structures. Univ. Chicago Press, 1989. [50] E. Narmour. The Analysis and Cognition of Melodic Complexity. Univ. Chicago Press, 1992. [51] S.H. Nawab, S.A. Ayyash, and R. Wotiz. Identification of musical chords using constant-Q spectra. In Proc. ICASSP, 2001. [52] O. Ortmann. On the melodic relativity of tones. Psychological Monographs, 35(162), 1926. [53] R. Parncutt. Harmony: a Psychoacoustical Approach. Berlin: Springer-Verlag, 1989. [54] D. Perrot and R. Gjerdingen. Scanning the dial: an exploration of factors in identification of musical style. In Proc. Society for Music Perception and Cognition, page 88, 1999. Abstract only. [55] W.J. Pielemeier and G.H. Wakefield. A high-resolution time-frequency representation for musical instrument signals. J. Acoust. Soc. Am., 99(4):2382–96, April 1996. 21
[56] M. Piszczalski and B. Galler. Predicting musical pitch from component frequency ratios. J. Acoust. Soc. Am., 66(3):710–20, September 1979. [57] M.D. Plumbley, S.A. Abdallah, J.P. Bello, M.E. Davies, G. Monti, and M.B. Sandler. Automatic music transcription and audio source separation. Cybernetics and Systems, 33(6):603–27, 2002. [58] E.D. Scheirer. Extracting expressive performance information from recorded music. Master’s thesis, Media Lab, MIT, September 1995. [59] E.D. Scheirer. Music Listening Systems. PhD thesis, Media Lab, MIT, June 2000. [60] M. Slaney. A critique of pure audition. In D.F. Rosenthal and H.G. Okuno, editors, Computational Auditory Scene Analysis, chapter 3. Lawrence Erlbaum Associates, 1998. [61] P. Smaragdis and J.C. Brown. Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2003. [62] M. Steedman. The well-tempered computer. Phil. Trans. R. Soc. Lond. A., 349:115–31, 1994. [63] A. Sterian, M. Simoni, and G. Wakefield. Model based musical transcription. In Proc. International Computer Music Conference, 1999. [64] A. Sterian and G.H. Wakefield. Robust automated music transcription systems. In Proc. International Computer Music Conference, 1996. [65] A.D. Sterian. Model-based Segmentation of Time-frequency Images for Musical Transcription. PhD thesis, MusEn Project, University of Michigan, Ann Arbor, 1999. [66] B. Su and S-K. Jeng. Multi-timbre chord classification using wavelet transform and self-organized map neural networks. In Proc. ICASSP, 2001. [67] D. Temperley. The Cognition of Basic Musical Structures. MIT Press, 2001. [68] E. Terhardt. Inpact of computers on music. In M. Clynes, editor, Music, Mind, and Meaning, chapter 18, pages 353–69. Plenum Press, 1982. [69] E. Terhardt, G. Stoll, and M. Seewan. Pitch of complex signals according to virtual-pitch theory: tests, examples, and predictions. J. Acoust. Soc. Am., 71(3):671–8, March 1982. [70] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Trans. Speech and Audio Processing, 10(5):293–302, July 2002. 22
[71] T. Virtanen and A. Klapuri. Separation of harmonic sounds using linear models for the overtone series. In Proc. ICASSP, 2002. [72] I. Vun. Wavelet-based musical pitch estimation. CUED, unpublished paper, May 2000. [73] P.J. Walmsley. Signal Separation of Musical Instruments - Simulation-based Methods for Musical Signal Decomposition and Transcription. PhD thesis, Cambridge University Engineering Department, September 2000. [74] A.N. White. A treatise on transcription. Washington D.C., 1978. [75] S.N. Wrigley. A model of auditory attention. Technical Report CS-00-07, SPandH, Dept of Computer Science, Sheffield University, 2000.
23