experiments show classification accuracy of 72.5% on a data set of European ... of an explicit note length (or duration) model, something that is obviously ... In this paper, we utilize a technique recently introduced in the natural ... collection which is available on the web in ABC notation (see ... Perplexity can also be viewed.
A Factored Language Model of Quantized Pitch and Duration Xiao Li, Gang Ji and Jeff Bilmes Electrical Engineering Department University of Washington, Seattle {lixiao,gang,bilmes}@ee.washington.edu Abstract
many music representations utilize only a finite set of “linguistic” units, i.e. notes, and has its own linguistic strucThis paper investigates a novel statistical approach to muture dictated by the harmonic patterns of a cluster of notes sic classification that utilizes recent technology developed in sounded in either unison, sequentially, or with an offset patthe domain of natural language processing. Specifically, we tern. Indeed, linguistic analyses have been used to study muinvestigate the use of factored language models (FLMs) for sic for some time (Lerdahl and Jackendoff 1983). Statistical the task of producing conditional probability distributions to music processing, moreover, has long accompanied that of model origin-specific folk songs. In our model, pitch cluster natural language processing (Conklin 2003). In 1990s, imand quantized duration are employed as the two fundamental proved methods for statistical language modeling were develfactors in a musical unit. The structure of our FLM is empiroped making higher order n-grams tractable (Goodman 2001). ically chosen from data to minimize perplexity. We apply this Researchers in music modeling have also investigated various collection of FLMs to the task of folk song classification. Our forms of Markov model (Brooks et al. 1993; Conklin 1995; experiments show classification accuracy of 72.5% on a data Ponsford et al. 1999; Pickens 2000) and more recently ranset of European folk songs from 6 different regions. dom field (Lavrenko and Pickens 2003). In general, these techniques share the fact that they operate on quantized music, creating a “word token” (e.g. a pitch or a chord) for each time 1 Introduction interval. The word tokens are then modeled using Markov chains or random fields, where the melody characteristic is In recent times, there has been a renewed interest in comparameterized. Some of the above approaches, however, lack putational methods for music processing and analysis, involvof an explicit note length (or duration) model, something that ing a variety of applications such as music classification, muis obviously crucial for determine both a song and its style. sic search, generation, categorization, and retrieval. This inIn this paper, we utilize a technique recently introduced terest has been motivated in part by the proliferation of the in the natural language processing community, namely facworld-wide web, where music-related applications are now tored language models (FLMs) and generalized parallel backnot only prevalent but also profitable. Statistical approaches off (GPB) (Bilmes and Kirchoff 2003), and apply it to folk are particularly appealing due to the exploding availability of song classification. Unlike some of the linguistic approaches digitized and MIDI music archives which can be used as cormentioned above, this approach is strictly finite-state. Specifpora for automatic training. ically, an FLM represents a probability distribution over a In this work, we study the extent to which modern statissequence of words by assuming that each distribution over tical language modeling techniques can be applied to written a word factorizes into the word’s constituent parts (such as (or more specifically, both pitch and time quantized) music. parts of speech, or other tags). This allows a more fineWhile such representations of music have certainly lost most grained (and generalizable) representation of language than of what provides it its expressive and emotional feel (which if we were to model words as only atomic units. can be captured by tempo curves (Fraisse 1983; Jaffe 1985), An additional contribution, moreover, of this work is to patterns of timing deviations (Bilmes 1992; Bilmes 1993), loudview pitch (and pitch cluster) and note length as two conness, and timbre), there is still much information regarding stituent factors of a musical “word”, and to utilize an FLM the song and its categorization just in its quantized represento capture the melodic and rhythmic characteristics in paraltation. lel. In particular, we show that perplexity can be maximally A further reason our approach is viable is that such dereduced when a generalized backoff strategy is employed as scriptions of music share certain, though not all, characteropposed to a normal backoff order. We apply our FLM to the istics with textual forms of human language. For example,
Dusty Miller 3 4
Figure 1: Standard western musical notation
problem of folk song classification, and in fact, the FLM represents a generative-model that can be seen as an extension of the HMM based system of (Chai and Vercoe 2001). Our experiments show classification accuracy of 72.5% on a data set of European folk songs from 6 different regions. The remainder of the paper is organized as follows: Section 2 provides background regarding ABC notation, language modeling, and factored language models. Section 3 describes how we tokenize written music for language modeling, presents an FLM for music as well as its backoff schemes, and introduces a natural classification strategy. Section 4 discusses evaluation experiments and results, and Section 5 concludes.
2 Background 2.1 ABC music notation The ABC music notation, originally developed by Chris Walshaw (Walshaw 2006), is a simple representation of music in ASCII. ABC has become popular since its introduction in 1991, because of easy readability by both humans and machines. Music written in ABC format can be efficiently stored, manipulated, and transported, and can be easily played by a computer or be converted into a music score. An ABC file always starts with a header specifying the key and time signatures, and sometimes provides additional information, such as the title and the origin of the music. In the body of an ABC file, capital letters A-G are used to denote the 7 major keys, and prefixes, if necessary, are added to indicate accidentals. A lower case letter or a letter with a suffix denotes the corresponding key at a different octave. Letters between a pair of square brackets indicate a chord. For example, A two-octave C major scale starting from middle C is CDEFGABcdefgabc. A note length is denoted by a fraction following the pitch symbol. Figure 2 shows the folk song in Figure 1 notated in ABC format. Please refer to (Walshaw 2006) for additional details regarding this notation. We have built our corpus of folk songs using a standard collection which is available on the web in ABC notation (see Section 4.1). We use music available in this notation since it is freely available in the public domain.
X:1 \% tune no 1 T:Dusty Miller \% title K:G \% key signature M:3/4 \% time signature B>cd BAG|FA Ac BA|B>cd BAG|DG GB AG:| Bdd gfg|aA Ac BA|Bdd gfa|gG GB AG:| BG G/2G/2G BG|FA Ac BA|BG G/2G/2G BG|DG GB AG:|
Figure 2: ABC music notation
2.2 Language modeling Many different statistical models of language have been proposed and utilized over the years, including parse-tree based (Manning and Schutze 1999) and finite-state (Jelinek 1997). Parse-tree based models assume language is described by a generative grammar of some form. Finite state models, however, assume that language has been generated by some form of Markov chain, and for various reasons related to ease of training and the existence of powerful smoothing techniques, the finite-state models have been most successful in speech recognition, statistical machine translation, and a variety of other language processing application. The underlying goal is to produce a joint distribution over a sequence of words w1 , . . . , wT . It is most often the case that the chain rule of probability is used to factor this joint distribution thus: Y Y p(w1 , . . . , wT ) = p(wt |w1:t−1 ) = p(wt |ht ) t
t
where ht = w1:t−1 is the history.1 In a standard n-gram based model, it is assumed that the immediate past (i.e., the preceding n − 1 words) is sufficient to capture the entire history up to the current word. In a trigram (n = 3) model, for example, the goal is to produce conditional probability representations of the form p(wt |wt−1 , wt−2 ). Of course, even for a number as low as n = 3 it is still impossible to produce a reasonable purely maximum likelihood estimate for such a model as often the number of words in the vocabulary exceeds 150,000 (leading to what would be a 150, 000 3sized table). Therefore, a number of smoothing techniques have been developed (Jelinek 1997; Goodman 2001), and the backoff approach to smoothing has been most successful in this endeavor. A backoff model works by smoothing a higherorder model (such as p(wt |wt−1 , wt−2 )) with a lower order model (such as p(wt |wt−1 )). In the usual language model case, there is an obvious choice as to which lower order model to choose (i.e., it would be less beneficial to smooth next with, say, p(wt |wt−2 )). The order that we “drop” variables (which in this case is wt−2 and then wt−1 ) is called the backoff order. Further details of this approach are described in Section 2.3. 1 We use standard matlab notation m : n where n ≥ m to denote the integer range m, m + 1, . . . , n.
A standard measure to evaluate language model performance is perplexity. Given a language model pθ (wt |ht ) whose parameters θ have been trained using only a training corpus, perplexity evaluates that language model by looking at its average score on a separate test corpus. Specifically, if D = {(wt , ht }Tt=1 } is a test corpus, perplexity is defined as: ppl =
Y
t∈D
pθ (wt |ht )
!−1/T
The perplexity of a language model is commonly used to produce a rapid estimate of a language model’s generalization ability, the assumption being that if the model on average produces high-probability for an unseen data set, then it is doing well (Goodman 2001). Perplexity can also be viewed as the difficulty of the corpus, indicating roughly the number of typical next words given the history. In any event, high probabilities will lead to a low perplexity, the lowest possible perplexity being unity, meaning that all words are predicted perfectly with probability one. When used to evaluate language models, therefore, lower perplexity is better. One last aspect of language modeling must be addressed before proceeding. As mentioned above, it is common to have as many as 150, 000 or more words in the vocabulary. It is moreover often the case that a testing corpus contains words that never occurred in a training corpus. Such words are called “unknown words”. Ordinarily, if no smoothing were to be applied to the language model, any such word would be given a probability of zero, rendering any sentence in the test corpus that contained such a word to have probability zero. Backoff smoothing, however, can help in this situation. The assumption is that unknown words have similar statistical properties to the rare words, since if the training set size is large, not seeing particular words strongly indicates that they are rare. We thus create a new word called unk during training to represent all unknown words, and give it probability similar to that of the collection of rare words. Whenever an unknown word is encountered in the test corpus, that word is treated as if it was mapped to unk, and given unk’s probability. This way, even when unknown words exist, we will get a reasonable estimate of their probability. This property of backoff models will be crucial to us in the music domain when we encounter chords or chord sequences in the test data that were never seen in training data.
2.3 Factored language models Factored language models and generalized parallel backoff were originally proposed by (Bilmes and Kirchoff 2003) as a model for natural language processing (e.g., speech recognition, statistical machine translation, etc.). In an FLM, a
word is seen as a collection or bundle of K (parallel) “factors”, so that wt ≡ {ft1 , ft2 , . . . , ftK }. Factors can be anything, including morphological classes, stems, roots and other such features in highly inflected languages (such as Arabic or Finish), or data-driven word classes or semantic features useful for sparsely inflected languages (such as English). In an FLM setting, the language model is given by Y p(ftk |π(ftk )) (1) p(wt |ht ) = k
where π(ftk ) are the set of variables that ftk immediately can interact with (in Bayesian network terminology, π(ftk ) are the set of parents of ftk ). Unlike in standard n-gram language modeling, with an FLM there is no immediately obvious best backoff order in which variables within π(ftk ) should be dropped. The reason is that in the standard language model case, we are certain that all parent variables correspond to a different time point in the history, and information almost always decreases with increasing time. In the FLM case, however, certain variables might occur at the same time. Moreover, the variables might have different number of values (their cardinality), and knowing the outcome of a high-cardinality variable might provide more information than knowing the outcome of a low-cardinality variable. There are several solutions to this problem. One solution is to design procedures to intelligently search over model structures for one that works well on a development data set (Duh and Kirchhoff 2004). Another solution, proposed in the original FLM paper (and which we evaluate here) is to backoff simultaneously “in parallel”, effectively skirting the decision (Bilmes and Kirchoff 2003). Here, potentially all backoff orders are simultaneously utilized to ultimately produce a backoff score. This was called “generalized parallel backoff”, and can be described by the following equation (for three variables, f , f2 , and f3 ): d p (f |f1 , f2 ) ifN (f, f1 , f2 )>τ pGPB (f |f1 , f2 )= N (f,f1 ,f2 ) ML α(f1 , f2 )g(f, f1 , f2 ) otherwise
where dN (f,f1 ,f2 ) is a discount factor, pML is the maximum likelihood distribution over its arguments, α(f1 , f2 ) is a weighting function (to ensure normalization), and g(f, f1 , f2 ) is an arbitrary non-negative backoff function of its three arguments. Different generalized backoff procedures are achieved by using different g-functions. For example, g(f, f1 , f2 ) = pGPB (f |f1 ) and g(f, f1 , f2 ) = pGPB (f |f2 ) correspond to two different backoff paths, and parallel backoff is obtained, say, using g(f, f1 , f2 ) = [pGPB (f |f1 ) + pGPB (f |f2 )]/2. FLMs were recently added to a well-known open-source language modeling toolkit (Stolcke 2002).
3 Music Tokenization and Modeling The question next becomes, given music represented in ABC format, how to convert it into a sequence of parallel lexical elements that can be directly analyzed by an FLM. This tokenization of ABC files essentially means that we extract both “word” tokens, and produce “sentence” boundaries. One method of extracting word tokens from music is to temporally quantize the piece and then create a word token for each time interval (Ponsford et al. 1999; Lavrenko and Pickens 2003), such as the tactus or tatum (Bilmes 1993). The word token, then, is either the note or the chord played during that interval. The sequence of words, then, occur at each such interval. In our work, we explicitly model both note event and note length, and produce sequence models over the two. Here we make a simple analogy between the terms used in natural language processing and language modeling, and those used in our music model. Word A musical “word” is defined as a note consisting of two factors – a pitch and length. We use the term “note event” throughout this paper to indicate either a single note, or a cluster of notes (chord) occurring at the same time. Since there are 12 notes per octave, and if we spanned 7.3 octaves, there would be 88 individual notes. If we allow chords of up to 10 notes to be sounded simultaneously, there would be (ignoring unlikely chords) about 88 ≈ 4.5 × 1012 possible 10 words! This is of course quite a large set, but the complexity of the model (like in any backoff-based model) grows only with the size of the vocabulary that exists in the training set. As mentioned above, those words in the testing set not found in training are mapped to unk. This makes such representations of music fit well into the statistical language modeling framework. In addition to the above, we consider a rest as a distinct note event. Each note event will have a duration, or length called the “note length.” We use an absolute duration expressed by a fraction with respect to a whole note. It is worth clarifying that notes can actually move independently. For example, two eighth notes can be played in synchronization with an quarter note of a different pitch. It is still possible to tokenize and thus represent such events by expanding our lexicon. This scenario, however, does not occur in our folk song corpus. Sentence The language model systems we have employed require the text to be broken in to a set of sentences, each of which consists of a sequence of words. Our “musical sentence” is defined as a musical phrase (Section 2). Since automatic phrase segmentation is beyond the scope of this work, we herein utilize a simple but effective procedure where the notes between two double bar lines or repeat signs constitute a phrase. By doing this, we are ignoring the fact that
a phrase can sometimes span repeat signs, and we are not utilizing measure information. While this is obviously a simplification, but our results still perform well for music origin classification Document A musical “document” is simply a complete piece of music. After tokenization, each musical sentence is converted into a sequence of note events and their durations. We have written a parser that converts ABC files into tokenized music documents based on the above definitions. In addition, the parser takes into account key and time signature information. Each word token is then represented in the form of “ut : vt ”, where ut is a random variable denoting note event at position t, and vt denotes duration of that event. For example, “F#:0.125” denotes a middle F sharp with an eight-note duration, and “CE:0.25” denotes a chord with a quarter-note duration. Additionally, start- and end-of-sentence tokens are added as in standard text language models.
3.1 Statistical music model As mentioned above, word wt at index t is a vector of two factors, note event ut and note length vt . Note, the index t does not refer to a specific time interval, but refers to the position in the sequence of “note-event, length” pairs. This is a significant difference from the work of (Ponsford et al. 1999) and (Lavrenko and Pickens 2003). Given these two factors, our task is twofold: (1) induce an appropriate model structure for the FLM; (2) find the best backoff path for that structure. Both tasks ideally should reduce language model perplexity. The word-sequence probability is factorized as follows: p(w1:T ) = p(u1:T , v1:T ) QT QT = t=1 p(vt |π(vt )) t=1 p(ut |π(ut ))
(2)
where π(ut ) denotes ut ’s parents (similarly for π(vt )). Equation (2) essentially results in two language models decoupled by factorization: one for note event and the other for duration. Language model perplexity, in this case, therefore simply equals the product of the perplexities of the note event model and that of the note length model. This factorization enables us to study these two statistical models separately. The optimal language model is simply a combination of their best models. We define the parent sets of a note event and a note length as follows, π(ut ) ∈
{ut−1 , ...ut−N , vt , vt−1 , ..., vt−M −1 } (3)
π(vt ) ∈
{vt−1 , ...vt−K , ut−1 , ut−2 , ..., ut−J }. (4)
Note length v Note event u t = 1:T
Figure 3: Graphical model representation
P(ut|ut-1ut-2vt)
P(ut|ut-1ut-2vt)
vt
vt
ut-2
P(ut|ut-1ut-2)
P(ut|ut-1vt)
ut-2
vt
ut-2
P(ut|ut-1ut-2)
P(ut|ut-1vt)
ut-2
vt
P(ut|ut-1)
P(ut|ut-1)
ut-1
ut-1
P(ut)
P(ut)
(a)
(b)
Figure 4: Different backoff strategies in FLM where (N, K) is the number of parents from homogeneous factors, while (M, J) is that from heterogeneous factors. In this setting, vt can be a parent of ut but not vise versa. In Figure 3, we show such a graphical model where (N, M, K, J) = (2, 1, 2, 0). In practice, the best order can be empirically obtained by evaluating the perplexity on a development set.
3.2 Backoff It is still necessary to determine the optimal backoff order and/or backoff strategy for our music model. One natural strategy is to drop parents in reverse order of Equation (3) and (4). In other words, the far-most heterogeneous factor is always the first to be removed from the parent set. After all heterogeneous factors are dropped, the standard n-gram backoff path is used. Figure 4 (a) shows an example of the note event model using such a backoff path. In this example, the prediction of note event ut is based on three parents: the two previous note events ut−2 , ut−1 , and the current note length vt . Using the above backoff scheme, we first drop vt to form a pitch trigram. Therein, we implicitly assume that the most distant heterogeneous factor of a variable, conditioned on other parent factors, has the least information about this variable. This assumption, however, is of course not necessarily true, especially for higher-order models. It is completely possible that the most recent heterogeneous factor has
higher predictability than a homogeneous factor far distant in the history, especially for more structured musical forms. To compensate for this problem, we utilize the GPB strategy. At each stage of the backoff graph, we can either drop the most-distant heterogeneous or the most-distant homogeneous factor. Instead of choosing one single backoff path, we allow these two paths to be combined using mean, maximum or weighted sum (Bilmes and Kirchoff 2003). This scheme is illustrated in Figure 4 (b), where two parallel paths are chosen to back off the probability p(ut |ut−1 , ut−2 , vt ). This probability hence becomes a combination of two scores, namely p(ut |ut−1 , vt ) and p(ut |ut−2 , ut−1 ). Regarding the combination method, we found that taking the mean of two log scores worked well for our tasks. In addition to backoff, discounting techniques are also used to further improve the performance. We empirically found that in all cases, modified Kneser-Ney or Witten-Bell discounting (Chen and Goodman 1999; Goodman 2001) with interpolation worked very well.
3.3 FLM based classification The goal of the music classification problem in this work is to automatically identify the origin of a music piece. A generative-model-based approach is a natural strategy. The idea is very simple: a music piece is classified based on how likely it is generated from the origin-specific language model. Mathematically, if we denote the class of musical origin as c, the classification problem can be solved using Bayes decision rule: cˆ = argmax p(c|w1:T ) = argmax p(w1:T |c)p(c) c
(5)
c
In this paper, assume assign equal priors p(c) to all musical origins. Moreover, we train separate FLMs as in Equation (2) for each origin to produce a representation of p(w1:T |c). To study the roles that pitch and length play in classification, we also investigate the cases where a musical piece is classified based only on the note event or only on the length model, leading to one of the following decision rules: Y cˆ = argmax p(ut |π(ut ), c) (6) c
cˆ = argmax c
t
Y
p(vt |π(vt ), c)
(7)
t
We compare results of these methods in the next section.
4 Evaluation Our evaluation methodology is as follows. First, in order to get an idea of a good overall backoff structure, we train
our proposed language models with different structures on a grand training set consisting of musical pieces from all possible origins. This model is then evaluated (using perplexity) on a development set. Second, we build a number of originspecific language models using the promising structures, and evaluate them on classification experiments. Note that an origin-specific language model is produced simply by training on pieces only with the given origin. In all experiments, we use the FLM extensions (Bilmes and Kirchoff 2003) to the SRILM toolkit (Stolcke 2002).
Origin England Ireland Scotland France Scandinavia S.E. Europe
4.2 Language modeling of folk songs As described in Section 2.3, we explore the best language model structure for the note event model and that for the note length model in parallel. First we investigated the cases where a single backoff path, as depicted in Figure 4 (a), is used for smoothing. Figure 5 shows the perplexity values of the note event model evaluated on the development set while we increase N and M of Equation (3). For example, N =2 with “no 2 By making the exact folds we used available on the web, we hope other researchers will be able to evaluate their methodologies on exactly the same training/test corpora on the music classification task and will thus be able to compare with our results exactly.
# sents 2526 2230 1325 632 1720 461
# tokens 105K 95K 73K 32K 67K 16K
Table 1: Data distribution for the classification corpus
9
4.1 Corpus
no v parents M=1 M=2 M=3
8.5 8 PPL
For the music modeling experiments, we collected from the web over 14,000 folk songs that were transcribed into ABC format. These folk songs originated from various European countries. We chose to utilize the ABC format only due to its wide availability in vast quantities on the web — there is no reason, however, that our approach could not be applied to any type of quantized music or to other folk-song corpora such as the Essen collection (Schaffrath 1995). All the files were parsed and tokenized as described in Section 3. For the perplexity experiments, we allocated 9,585 documents, with 25K sentences and 1M word tokens, to the train set; and 4511 documents, with 13K sentences and 500K word tokens, to the development set. The vocabulary size of the note event is around 310 and that of the note length is 36. For the classification experiments, we extracted ABC files from 6 European countries/regions. Table 1 shows the origin classes and the amount of data in each class. We divided the data into 6 subsets, and conducted 6-fold crossvalidation classification experiments. In other words, for each origin class, we trained an origin-specific language model using data from 5 subsets, and the remaining subsets from all origin classes were combined to evaluate classification accuracy. Exact information regarding the precise definition of the corpus files, data allocation, the data folds themselves, can be accessed from (Li 2006) 2
# pieces 1070 879 603 418 627 127
7.5 7 6.5 6 0
1
2
3
N
4
5
6
7
Figure 5: Note event model perplexities v parents” (M =0) means a trigram note event model. The figure shows that N =3, M =1 is a good configuration for the predicting melodic information, while a further increase in order did not bring any significant improvement. This configuration corresponds to a 4-gram note event model, enhanced by 1 parents from the duration factors. Similarly, Figure 6 shows the perplexity values of the note length model while we increase K and L. The figure shows that adding heterogeneous parents to a length variable only harms the performance. A 6-gram model, where K=5 and J=0, seems the best for predicting note duration. The resulting best FLM perplexity on the development set is 6.35 × 2.19=13.91. Second, we adopted the GPB strategy, as depicted in Figure 4 (b), in an attempt to further reduce perplexity. Figure 7 shows the performance of the note event model with and without generalized backoff. Since M =1,2,3 had very close performance for the note event model in Figure 5, we picked M =3 to represent three of them, denoted as “M=1/2/3 with single backoff” in Figure 7. Due to computational limitations, we only tried orders up to N =7. As shown in the figure, applying the generalized backoff scheme greatly improved the performance, especially for the higher-order models. The perplexity of the entire FLM, as a result, was further reduced to 5.17 × 2.19 = 11.32.
3
no u parents J=1 J=2
2.9 2.8 2.7
PPL
2.6 2.5 2.4 2.3 2.2 2.1 2 0
1
2
3
K
4
5
6
7
Figure 6: Length model perplexities 9
M=1/2/3 with single backoff M=1 with general backoff M=2 with general backoff M=3 with general backoff
8.5 8
note event model N=1,M=0 (bigram) N=2,M=0 (trigram) N=3,M=1 N=7,M=3
PPL
7.5 7 6.5
single backoff 51.58 ± 1.42 % 62.70 ± 1.18 % 66.11 ± 2.04% 63.96± 0.81%
general. backoff 66.54 ± 1.59 % 63.64 ± 1.32 %
Table 2: Average accuracies and standard deviations using note event model only
6 5.5 5 0
and used the standard n-gram backoff path. We then conducted experiments using the best model structure obtained in Figure 5, where N =3, M =1. This model outperformed the trigram model significantly. Finally, we used the best model structure with the GPB strategy applied, i.e. N =7 and M =3. However, there was a non-trivial reduction in classification accuracy compared to the previous model structure. One reason for this degradation might be that the amount of the origin-specific data is much less than the data from all possible origins (what we used in perplexity experiments), and this penalizes higher-order models. In fact, we found that the perplexity had actually increased on some originspecific data sets by using the (N =7, M =3) configuration with GPB. Another observation from the experiments is that using GPB did not affect classification performance significantly, although it did lower the perplexities on the grand data set. As mentioned, the best smoothing strategy might not yield the model with the best the discrimination ability.
1
2
3
N
4
5
6
7
Figure 7: Note event model perplexities with generalized backoff
4.3 Folk song classification As mentioned earlier, we performed 6-fold cross-validation to evaluate classification accuracy. The results reported below are average accuracies and standard deviations over these 6 folds. Note that the model maximizing the likelihood is not necessarily the one maximizing the classification accuracy. Finding the most discriminative model is itself a research area, involving the study of discriminative structure learning and discriminative training. In this work, we do not focus our attention on finding the best discriminative model, but rather on empirically evaluating several FLM structures on a classification task. We first conducted experiments using the note event model only, according to Equation (6), and the results are summarized in Table 2. As a baseline, we generated note event bigram and trigram models (without heterogeneous parents),
Second, Table 33 shows a study of the importance of the note-event/length model to classification. The three rows in the table correspond to three classification methods in Equation (6), (7) and (5) respectively. The highest accuracy of 72.53% was achieved using the complete model. note-event (N=3, M=1) length (K=5, J=0) note-event+length
single backoff 66.11 ± 2.04 % 57.23 ± 1.18 % 71.48 ± 1.30%
general. backoff 66.54 ± 1.59% 72.53 ± 1.30%
Table 3: Average accuracies and standard deviations of single and combined models The confusion matrix in Table 4 (obtained from our best results) suggests that most of the errors came from England, Ireland and Scotland. This is reasonable since they are geographically close, and their music shares common characteristics. Furthermore, as these songs are generally very short, consisting of only two music sentences on average, classification becomes a harder problem. Our classification results, nonetheless performed quite well. 3 Missing cells in both tables correspond to results that are not applicable for that row/column. For example, N = 1, M = 0 does not yield a distribution for which GPB would be ideal.
% Eng. Ir. Scot. Fr. Scand. S.E.
Eng. 65.51 14.22 16.59 14.34 11.80 8.55
Ir. 13.73 74.85 9.95 1.20 0.64 0.76
Scot. 10.75 7.85 79.97 1.91 1.75 0.79
Fr. 2.43 0.80 0.17 72.05 1.59 6.28
Scand. 7.10 2.16 2.16 8.60 82.94 10.25
S.E. 0.47 0.11 0.17 1.91 1.27 73.36
Table 4: Confusion matrix averaged over 6 folds
5 Conclusions and Discussions In this work, we applied natural language processing techniques to written time-quantized music. Specifically, we viewed a musical note as two factors, a note event and length, and constructed a factored language model accordingly. Experiments on a collection of European folk songs showed that folk music is fairly predictable, with a perplexity of only 5.17 × 2.19 = 11.32. Furthermore, a generative-FLM approach achieved significantly better performance compared to a trigram note event model on a folk song classification task. Our work utilized relatively little music knowledge. In particular, we used a simplistic notion of musical phrase, and bundled all chords together as integers. We moreover did not use any subtleties that are, in real music, likely to be strong indicators of musical origin and style. We envision further improvements on music classification by incorporating additional such information into our model. For instance, tempo variation, deviations, loudness, measure boundaries, accents, gracings, timing subtleties, as well as key and time signatures might all be important indicators of musical origin. Finally, it would be helpful to study discriminative model structures and training techniques to improve results further. The authors wish to thank Mari Ostendorf for discussions regarding this work.
References Bilmes, J. (1993). Timing is of the essence: Perceptual and computational techniques for representing, learning, and reproducing expressive timing in percussive rhythm. Master’s thesis, MIT, Cambridge, MA. Bilmes, J. and K. Kirchoff (2003). Factored language models and generalized parallel backoff. In HLT-NAACL. Bilmes, J. A. (1992). A model for musical rhythm. In ICMC Proceedings, San Jose, CA. Brooks, F., A. Hopkins, P. Neumann, and W. Wright (1993). Machine Models of Music.
Chai, W. and B. Vercoe (2001). Folk music classification using hidden Markov models. In Intl. Conf. on AI. Chen, S. F. and J. Goodman (1999). An empirical study of smoothing techniques for language modeling. Computer, Speech and Language, 359–394. Conklin, D. (1995). Multiple viewpoint systems for music prediction. Journal of New Music Research 24(1), 51– 73. Conklin, D. (2003). Music generation from statistical models. In the AISB 2003 Symp. on Artificial Intelligence and Creativity in the Arts and Sciences, pp. 30– 35. Duh, K. and K. Kirchhoff (2004). Automatic learning of language model structure. In Proceedings of COLING 2004, Geneva, Switzerland. Fraisse, P. (1983). Rhythm and tempo. In D. Deutch (Ed.), The Psychology of Music. Academic Press. Goodman, J. (2001). A bit of progress in language modeling. Computer, Speech and Language, 403–434. Jaffe, D. (1985). Ensemble timing in computer music. CMJ 9(4), 38–48. Jelinek, F. (1997). Statistical Methods for Speech Recognition. MIT Press. Lavrenko, V. and J. Pickens (2003). Music modeling with random fields. In ACM SIGIR, pp. 389–390. Lerdahl, F. and R. Jackendoff (1983). A Generative Theory of Tonal Music. MIT Press. Li, X. (2006). Folk tune corpus for music modeling.
http://ssli.ee.washington.edu/˜lixiao/music.htm.
Manning, C. D. and H. Schutze (1999). Foundations of Statistical Natural Language Processing. MIT Press. Pickens, J. (2000). A comparison of language modeling and probabilistic text information retrieval approaches to monophonic music retrieval. In intl. Symp. on Music Information Retrieval. Ponsford, D., G. Wiggins, and C. Mellish (1999). Statistical learning of harmonic movement. Journal of New Music Research 28(2). Schaffrath, H. (1995). The Essen folksong collection in the Humdrum kern format. D. Huron (ed.). Menlo Park, CA: Center for Computer Assisted Research in the Humanities. Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proc. Intl. Conf. on Spoken Language Processing, Volume 2, pp. 901–904. Walshaw, C. (2006). The ABC musical notation language. http://staffweb.cms.gre.ac.uk/˜c.walshaw/abc/.