IEEE TRANSACTIONS ON MULTIMEDIA
1
Music information retrieval using social tags and audio Mark Levy, Student Member, IEEE, Mark Sandler, Senior Member, IEEE
Abstract— In this paper we describe a novel approach to applying text-based information retrieval techniques to music collections. We represent tracks with a joint vocabulary consisting of both conventional words, drawn from social tags, and audio muswords, representing characteristics of automatically-identified regions of interest within the signal. We build vector space and latent aspect models indexing words and muswords for a collection of tracks, and show experimentally that retrieval with these models is extremely well-behaved. We find in particular that retrieval performance remains good for tracks by artists unseen by our models in training, and even if tags for their tracks are extremely sparse. Index Terms— social tags, audio, music, information retrieval.
I. I NTRODUCTION
L
ARGE collections of musical audio have become commonplace in recent years, and methods of searching for tracks, whether for identification, recommendation, music discovery, playlist generation, or simply to navigate huge collections, have become correspondingly important. Search systems based on audio analysis have been extremely successful for fingerprinting, i.e. searching for exact or nearexact matches to a particular recording [1], [2]. The remaining tasks, however, have yet to benefit significantly from methods using audio information, and widely-used interfaces to music collections, such as iTunes, still depend heavily on conventional artist, album and title metadata. Some progress has been made on classification of musical audio into genre and mood classes, and a few initial studies have attempted more general automatic description using a larger vocabulary [3], [4], [5]. The accuracy of such methods remains modest, however, with typical classification rates as low as 40% on realistic collections [6], and annotation recall and precision considerably lower still [3]. A small number of music search systems taking advantage of audio analysis are currently attempting to find a foothold in the world of commercial music distribution, for example those of musicIP, one llama and isophonics.1 The dominant forces in the space of music recommendation and discovery, however, with tens of millions of active users This work was supported by EPSRC grant EP/E017614/1 (Online Music Recognition And Searching). M. Sandler is with the Centre for Digital Music, Department of Electronic Engineering, Queen Mary, University of London, London E1 4NS, U.K. M. Levy was with the Centre when this research was undertaken. He is now with Last.fm, Karen House, 111 Baches Street, London N1 6DL, U.K. (e-mail:
[email protected];
[email protected]) 1 http://www.musicip.com, http://www.onellama.com, http://isophonics.net
between them, are Pandora and Last.fm.2 Pandora’s search system is based on expert-generated descriptions, using a rich vocabulary of musical terms, for each track in its collection; Last.fm uses a combination of collaborative filtering and analysis of user-supplied tags for artists, albums and tracks. While both these systems work well in practice, some issues remain. Expert description looks unlikely to scale with the growth in the amount of recorded music available, while social tags and collaborative filtering suffer from problems such as cold start, where no data is available for new tracks, spam, etc. This looks to leave a significant role for audio information, if it can be integrated effectively into a suitable joint system without losing the advantages of text-based retrieval. In this paper we propose a straightforward way to search music collections using descriptions and audio, by jointly indexing a vocabulary of conventional words in social tags for each track, and one of audio muswords representing each track’s significant musical content. In particular this enables us to take advantage of established text information retrieval (IR) methods to search for tracks similar either to a given query song, or to a free text query such as “laid back piano jazz”. In recent work [7], [8] we showed that IR systems indexing tags alone could outperform all previous methods on various simple retrieval tasks, and we extend this here to our joint vocabulary of words and muswords. We evaluate the performance of two different IR models on a collection of several thousand tracks and find that using audio information in this framework can indeed improve search results when tags are scarce, and that retrieval remains good for tracks by artists unseen during model training. The remainder of this paper is organised as follows: in Section II we describe the particular characteristics social tags for music that motivate a text IR approach; in Section III we introduce our vocabulary of audio muswords; in Section IV we outline the IR models which we evaluate in the experimental framework described in Section V; we give results in Section VI; finally we draw parallels with related work in Section VII and draw conclusions in Section VIII. II. S OCIAL
TAGS FOR MUSIC
Social or collaborative tags are brief descriptions supplied by a community of internet users to aid navigation through large collections of media [9], [10]. Although there usually no restrictions on the text that can be used as a tag, the shared purpose of creating a usable navigation system makes it attractive for users to select tags which others are already 2 http://www.pandora.com,
http://last.fm
IEEE TRANSACTIONS ON MULTIMEDIA
where M is the vocabulary size and T is the total number of terms observed, and k and b are constants for the given collection of documents. The vocabulary growth which we observe for tags for music fits very closely to b = 0.42 once we consider a large number of tags, in line with typical values seen in standard text corpora [11]. Table I shows the first few tags we downloaded containing the term 80s, illustrating the freedom with which words are combined even in short tags. We therefore represent the set of tags for each track in our dataset as a bag-of-words (BOW), so our underlying textual data takes the form of a document-term matrix N of cooccurrence counts n(t, w), representing the number of times we see the word w in tags applied to track t. We follow the standard IR approach to text documents, tokenizing each tag with a standard stop-list (to remove common words such as ‘it’, ‘and’, ‘the’, etc.). We do not use a stemmer, because of the idiosyncratic vocabulary of social tagging and the large number of words used as proper nouns (particularly artist names). Working with words rather than tags nonetheless goes some way towards capturing the common meaning of alternate forms such as ‘female vocalist’, ‘female vocals’, ‘good female vocals’, ‘sexy female vocals’, ‘lovely female vocals’, etc. In practice we have access only to partial information about the number of times that each tag has been applied to a given 3 http://ws.audioscrobbler.com 4 https://www.musicstrands.com
number of distinct tags
3
x 10
4
2
1
0
1
Fig. 1.
2
3
4 5 6 7 8 9 10+ tag length in words
Tag lengths
11.5
11 log vocabulary size
using. A common expectation is therefore that new tags will enter the vocabulary in an “organic” fashion as they become adopted by significant numbers of users, leading gradually to the development of a folksonomy, i.e. a full-scale taxonomy of music reflecting current usage amongst the user community. This view of tags informs most current tag-based search interfaces, which highlight the most widely-used tags for the page or item in question, and offer a naive search facility based on direct matching of tags. Tags for music, however, frequently do not fit this model, perhaps because tagging so readily invites the expression of personal or “tribal” responses to particular songs or performers which are so central to the role of music in people’s lives. We aggregated 667,900 tags for 31,359 individual tracks by 5,265 artists. The tags were downloaded from the last.fm3 and MyStrands4 web services between March and August 2007. Simple statistics of these tags show that they are far from constituting a vocabulary of basic concepts, even allowing for a large amount of error, subjectivity or other statistical noise. In the first place, tags for music are often discursive, as illustrated in Fig. 1, which shows the number of tags in our data set against their length in words. We observe that over a third of the tags consist of three or more words, while over 10% contain five or more words: these are frequently complete phrases. Secondly, the vocabulary of tags shows no sign of converging to a stable taxonomy as the number of tags grows. Rather the vocabulary grows according to the power law, known as Heaps’ Law, characteristic of ordinary text documents, as shown in the log-log plot of Fig. 2. Heaps’ Law is given by M = kT b (1)
2
10.5
10 Heaps’ law observed
9.5
9
Fig. 2.
6
7 8 9 10 log collection size (in tracks)
11
Tag vocabulary growth obeys Heap’s law TABLE I S OME TAGS CONTAINING THE TERM 80s
80s 80s rock My 80s memories 80s y 90s 80s and 90s 60s 70s 80s rock 80s and 90s rnb 80s wave 80s-90s 80s Music flya 80s Decade: 80s 80s Classic we love the 80s 80s magic big-hair 80s 20 songs mix : 80s Hits golden 80s 80s alternative ilx 80s poll The 80s was not a dead decade pop 80s 80s soundtracks 80s Pop 80s throwback 80s songs i love
IEEE TRANSACTIONS ON MULTIMEDIA
A. Tag sparsity Despite the huge and growing number of tags available, there are reasons to expect that the distribution of social tags will remain highly uneven in practice, with many sparselytagged or untagged tracks in any large collection. Firstly, new music is constantly being created, leading to the wellknown cold start problem: tracks can be tagged only as listeners discover them, but untagged new tracks remain invisible within systems that depend on tags to give search results or recommendations. Secondly, recent research [12] has highlighted the correlation between a listener’s liking for a particular track and their willingness to supply a rating for it: listeners are much more likely to rate a track which they like or (somewhat less often) dislike strongly. Ratings for tracks that are new to a particular listener are therefore not missing at random (NMAR). We can expect a similar relationship to exist for tagging, with tracks that provoke only mild feelings of affection in their listeners remaining sparselytagged, even if they have obvious characteristics that could be described in words. In particular we expect that there will be a clear difference between the distribution of tags for tracks by mainstream and by new or niche artists. This uneven distribution of tags between ‘haves’ and ‘havenots’ can be clearly observed in our dataset, as illustrated in Fig. 3, which shows the number of artists as a function of the mean number of tags applied to their tracks. Roughly a third of our 5,265 artists have received no tags for any of their tracks, while even amongst the artists with tagged tracks, roughly a third have no more than five distinct tags per track on average. The cold start and NMAR issues evident here will give real-world music recommendation or search systems based on tags an inbuilt conservative bias towards tracks by well-known and well-liked artists. While this is a reasonable starting point for a usable system, the ability to suggest a large variety of tracks, in particular including little-known music, is clearly also valuable. This provides a practical motivation to extend our models by incorporating information drawn directly from the audio signal. It also suggests a realistic framework for evaluating the contribution of such audio information to
1800 1600 1400 number of artists
track. The Last.fm web service gives integer percentages relative to the most frequently applied tag, with the frequency of relatively rare tags rounded down to zero: this enforces a form of editorial censorship in the tag clouds shown on web pages, where by convention the font size for each tag is proportional to its count, i.e. tags with zero counts are simply not displayed. The MyStrands web service gives only a list of tags applied to each track, with no information about their relative popularity. Our initial work in [7] showed that even a rough measure of tag frequencies improves search performance, and we therefore restrict our data in our experiments to the Last.fm tags, using the counts as published but simply incrementing them to expose all the tags, including thoseP with zero counts, to our models. Formally we set n(t, w) = g∈Gt,w f (g) + 1, where Gt,w is the set of distinct tags applied to track t and containing word w, and f (g) is the frequency of tag g according to the Last.fm web service.
3
1200 1000 800 600 400 200 0
Fig. 3.
0
1−5 6−10 11−30 mean tags per track
>30
Artist tag distribution
both the quality and variety of results returned to set of search queries: we develop this in Section V. III. C REATING
A VOCABULARY OF MUSWORDS
In [8] we applied standard information retrieval models to the tag document-term matrix N and found that they had extremely attractive properties. Even the simplest vector space models position tracks in a space which is extremely wellorganised by artist and genre, while latent semantic models can learn a wide range of familiar and readily meaningful semantic aspects relating to genre, nationality, era, mood, etc. Our aim here is to incorporate audio information into these models by representing audio features as a set of “audio words” extending the vocabulary of conventional words. A simple method of this kind, using vector quantisation (VQ) to discretise the features and treating the resulting VQ codebook as the vocabulary of audio words, was first proposed in a somewhat different context by Vignoli and Pauws in [13], where a discrete representation was chosen as the basis of a similarity metric for audio tracks because of its computational efficiency in relation to existing methods. In [13], a single Self-Organising Map (SOM) trained on features drawn from all tracks in the collection to be indexed was used for VQ. Features from each track were mapped onto the indices of their best-matching SOM units, and the indices for each track recorded in a histogram. A distance between tracks could then be computed by comparing histograms with a suitable measure: Vignoli and Pauws proposed Kullback-Leibler divergence. We investigated this representation in comparison to a number of other lightweight audio similarity measures in [14]. Despite finding a more effective distance measure to compare the histograms than that used by Vignoli and Pauws, our results showed that tracks were poorly organised in the resulting similarity space: in particular using this discretisation degraded results in comparison to similarity measures computed directly on the underlying features. A particular concern in constructing our vocabulary of audio muswords here is that tracks should be no worse organised by muswords in a simple vector space model than when using a state-ofthe-art similarity measure directly on the features.
IEEE TRANSACTIONS ON MULTIMEDIA
Audio features intended to model perceptual characteristics of music have been widely studied in the context of automatic genre classification, with features for a particular track typically modelled as a so-called bag-of-frames, i.e. all frames in the track are modelled but with no consideration of their temporal sequence. While the bag-of-frames (BOF) model works well for classification of non-musical audio such as natural ambient soundscapes, detailed studies by Aucouturier in [15] and [16] highlight its shortcomings in relation to music. In particular Aucouturier observes ([15], p.889) that with BOF algorithms, frames contribute to the simulation of the auditory sensation in proportion of their statistical predominance in the global frame distribution. In other words, the perceptive saliency of sound events is modeled as their statistical typicality... The above-presented results establish, as expected, that the mechanism of auditory saliency implicitly assumed by the BOF approach does not hold for polyphonic music signals: For instance, frames in statistical minority have a crucial importance in simulating perceptive judgments. Aucouturier hypothesises that higher-level features are required to improve classification performance on musical audio. In our work, this problem is compounded by the obvious mismatch between semantics and either individual audio frames or track-level models. While fully addressing these issues remains well beyond the scope of this paper, we use them to motivate a novel approach to audio feature modelling, based on an initial step in which we identify regions of interest within each track. We make the following simple assumptions: 1) semantics apply naturally to music at the phrase level (a single track can contain both harsh and gentle sections) 2) semantics are associated with particular events within the music (rather than with individual audio frames) 3) significant musical events will be perceptually prominent by design (both composer and performer devote their skill to bringing this about) We consequently extract muswords for a track by first identifying musical events within it, and then discretising timbral and rhythmic features for each region found. We note that this perspective differs from previous work on semantic music search and annotation, in which semantics are associated either with every frame of audio [17], [3] or with randomly selected segments [5]. A. Finding regions of interest A number of methods have been proposed to find representative thumbnail segments of musical audio tracks, typically based on a first step in which the repetition structure of the track is estimated [18], [19], [20], [21], [22], [23]. We review these approaches in our own contributions to this literature [24], [25], [26]. While some of these structural segmentation algorithms have been shown to be effective in locating chorus sections in conventional pop tracks, notably [20], they are not suitable for our purposes here, in particular because the
4
1 0 −1 1
2
3
4
5
6 x 10
6
5 10 15 20
200
400
600
800
1000
1200
1400
200
400
600
800
1000
1200
1400
200
400
600
800
1000
1200
1400
200
400
600
800
1000
1200
1400
1 0.5 0 0
800 600 400 200 0
400 200 0
Fig. 4. Locating regions of interest. From top to bottom: the audio signal; perceptual features (MFCCs); the moving window and its Hamming-decayed history; the unsmoothed boundary function; the smoothed, mean-subtracted boundary function and the found event start times.
initial analysis of repetition structures within a track is too computationally expensive to scale to large music collections. Assumption (3) above, on the other hand, suggests a straightforward and computationally scalable method to locate musical events by finding perceptually prominent regions of interest within the signal. We generate candidate regions by finding local maxima in a boundary function which compares perceptual features in a short “present” window with those in its “history”, representing everything heard so far in the song. The length of the window is chosen to correspond roughly to the length of a short phrase of music. Fig. 4 shows an overview of the process. We first extract perceptually-motivated audio features for the whole track. We then pass a fixed-length window along the track, comparing the distribution of features in the window to their distribution in the time-decayed history (i.e. from the beginning of the track to the start of the window) with a probabilistic distance measure. The distance of the window from its history gives us a boundary function, expressing the contrast between them, and consequently, given assumption (3), the likelihood of an event beginning at the start of the window. We smooth the boundary function with a median filter to eliminate noise from local contrast, and peak-pick to give a set of candidate event start times. Finally we normalise for the degree of local contrast within each track by discarding candidates whose boundary function is less than the mean value over the whole track. We return windows beginning at each of the remaining event start times as the track’s regions of semantic interest.
IEEE TRANSACTIONS ON MULTIMEDIA
5
In our current implementation we use the first twenty MelFrequency Cepstral Coefficients (including the 0-th coefficient) as our perceptual audio features, extracted from audio downsampled to 22.05kHz and mixed to mono, with a frame and hop size of 4096 samples. Our moving window has a length of 5 seconds. We estimate the distribution of MFCCs in the window p and the history q by fitting a single Gaussian to features in each of them, weighting features in the history with a halfHamming window extending back to the start of the track, so that features from the distant past are gradually “forgotten”. We measure the distance between the two Gaussians with a symmetrised Kullback-Leibler divergence 2KLs (p||q) = =
2KL(p||q) + 2KL(q||p) −1 tr(Σ−1 q Σp + Σp Σq ) − 2d
(2)
−1 +(µp − µq )T (Σ−1 q + Σp )(µp − µq )
where the two Gaussians are given by p(x) = N (x; µp , Σp ) and q(x) = N (x; µq , Σq ), and d is the dimensionality of the features. This boundary function is smoothed with a median filter of length 2 seconds. Finally after peak-picking we prune candidates that are within two window lengths of each other, retaining the one with the higher boundary function value. Each region of interest found in a track is mapped onto muswords for independent vocabularies of timbral and rhythmic characteristics.
(x)
ri ∈ (0, 1], ∀i, based on the distance of the region to each musword in the vocabulary. The score for the musword i, is given by 1 (x) (3) ri = (1 + KLs (x||yi )) where yi are the features for musword i, and KLs (·||·) is the symmetrised Kullback-Leibler divergence given in (2). Finally we compute the relevance scores for a track by summing the scores for all of its regions of interest. Because each region of interest is mapped onto a score for every musword in the timbre vocabulary, in general this representation is no longer sparse. This is a disadvantage for IR models, where the computational complexity is proportional to the total number of non-zero (mus)word counts. We therefore increase sparsity by zeroing small scores for timbre muswords in this representation. Specifically we set scores for track j to (j) (j) zero when they are less than τ maxi ri , where ri is the track’s total relevance score for musword i. We discuss the choice of the threshold τ in the next Section.
C. Creating rhythm muswords Our rhythmic feature for each region of interest is the thresholded autocorrelation of an onset detection function introduced by Davies and Plumbley in [27]:
B. Creating timbre muswords Our underlying timbral feature for each region of interest is the same feature that we used when computing the boundary function for event-finding described above, i.e. the mean and variance of the first twenty MFCCs. In this Subsection we describe two alternative methods of representing these features as muswords. The methods are evaluated comparatively in Section VI-A. 1) VQ method: We concatenate means and variances into a single 40-dimensional feature for each region of interest. Following our work in [14], we train a single SOM on features from our collection of tracks, first normalising each feature dimension to have zero mean and unit variance. We use a SOM with 1000 hexagonal units arranged in a rectangular 50 x 20 grid, so that each unit represents one timbre musword. A single musword for each region of interest in a track is then created by finding its best matching unit in the trained SOM. 2) Random Vocabulary method: When we listened back to them, regions mapped to the same musword by the VQ method were often surprisingly disparate, suggesting that the VQ was not capturing coherent timbral states effectively. We therefore developed an alternative mapping based closely on the timbral distance measure in (2), which is known to be relatively wellbehaved [14]. We first select 1000 regions of interest at random from our collection of tracks, and consider these directly as comprising our vocabulary of timbre muswords. We then map a region of interest with features x not onto integer counts, but instead (x) onto a vector of continuous relevance scores, {ri } with
A(l) =
PL
m=1
˜ ˜ Γ(m) Γ(m − l) |l − L|
l = 1, ..., L
(4)
˜ is an adaptively-thresholded where L = 144 samples and Γ(·) onset detection function based on complex spectral difference (see [27] and [28] for full details). This feature was found in [27] to give good results in a classification task for different styles of ballroom dance music. We follow the VQ approach as for timbre muswords, training a 50 x 20 SOM on these 144-dimensional features and mapping each region of interest onto its best matching unit. By inspection, the VQ method appears to be reasonably successful for rhythm, i.e. regions mapped to the same unit frequently have the same tempo and rhythmic character.
IV. M USIC RETRIEVAL
WITH
IR MODELS
In principle we can integrate descriptions and audio features simply by concatenating words and muswords into a single extended vocabulary, so our track representation is a bag-ofwords-and-muswords (BOW+M). Because of course we are not really counting words in documents, we observe however that “counts” for the two types of word in this representation have dissimilar - and essentially arbitrary - ranges. A consequence of this is that we have to choose a scaling for counts for muswords relative to those for words. We discuss this in Section VI-B. We apply two different models to the documentterm matrix N, which we now assume contains counts for words and muswords for each track in our collection.
IEEE TRANSACTIONS ON MULTIMEDIA
6
=
t
X
t∈T
w∈W
z∈Z
where n(t) is the total number of words in tags for track t, using the Expectation Maximization (E-M) algorithm, alternating the following steps [31]: E-step: P (w|z)P (z|d) (9) P (z|t, w) = P ′ ′ z ′ P (w|z )P (z |d)
z
w
N
Fig. 5.
# X X n(t, w) log P (w|z)P (z|t) n(t) P (t) + n(t) "
M-step:
Aspect model
A. Vector space model In the well-known vector space model [29], a weighting scheme is applied to the entries of the document-term matrix N, and a distance measure between vectors of weighted counts n ˆ (t, w) is chosen as the matching function between documents (tracks) and queries. Queries can be either free combinations of words or, in the query by example scenario characteristic of music applications such as playlist generation and recommendation, tracks themselves, represented by their term vectors, i.e. their entire vectors of weighted counts. We use the standard tf-idf (term frequency - inverse document frequency) weighting N (5) df w where N is the total number of tracks and df w is the number of tracks tagged with word w. To compare queries and documents we use a standard matching function, cosine distance P n ˆ (t, w)ˆ n(q, w) pP s(t, q) = pP w (6) 2 ˆ (t, w) ˆ (q, w)2 wn wn n ˆ (t, w) = n(t, w)log
B. Aspect model
The aspect model, also known as Probabilistic Latent Semantic Analysis (PLSA) [30], is a simple probabilistic model in which words are associated with documents via latent semantic classes or aspects. The aspects, representing base concepts for the particular document domain, are learned by fitting the model to a set of training documents. In the aspect model represented graphically in Fig. 5, we associate a latent class variable z ∈ Z = {z1 , ..., zK } with each occurrence of a word w ∈ W = {w1 , ..., wM } in the tags for track t ∈ T = {t1 , ..., tN }. The model can then be defined generatively as follows: • select a track t with probability P (t), • select a latent class z with probability P (z|t), • select a word w with probability P (w|z). The joint probability model for the observed data is given by X P (t, w) = P (t)P (w|t) = P (t) P (w|z)P (z|t) (7) z∈Z
To fit the model to a collection of training tracks we maximise the log-likelihood X X L= n(t, w) log P (t, w) (8) t∈T w∈W
P t n(t, w)P (z|t, w) P P (w|z) = ′ ′ t,w ′ n(t, w )P (z|t, w ) P n(t, w)P (z|t, w) P (z|t) = w n(t)
(10)
(11)
We avoid overfitting the training data by early stopping, based on the likelihood of a validation set of tracks which we hold out from the training set. After each iteration we fold in the validation tracks to learn their aspect probabilities P (z|t). Folding in is achieved as follows: we perform a fixed number of E-M iterations on P (z|t) for tracks t in the validation set, following (9) and (11), but with the word probabilities P (w|z) held fixed to the values learned from the main training set. We then compute the log-likelihood of the validation set according to the model, stopping when it fails to increase from the previous iteration of the main E-M process. In practice the E- and M-steps can be interleaved, giving training a computational complexity of O(RK), where R is the number of observed document/term pairs, i.e. the number of non-zero entries of N. To do retrieval with a trained aspect model, we first fold in a text query or track outside the training set q, following the same procedure used on the validation set, to compute its aspect probabilities P (z|q). We can then use cosine distance as our matching function between the K-dimensional vectors of aspect probabilities. The formulation of the aspect model also makes it possible to use a probabilistic similarity measure, estimating P (q|t) directly for each track t in the collection. C. Training an aspect model on words and muswords In [8] we found that the aspects learned by models trained on words from tags were semantically coherent: high probability words for a given aspect clearly related to a common domain concept, such as a genre, era, nationality, particular artist, etc. We therefore experimented with two methods of training aspect models on words and muswords. As well as conventional training on the joint vocabulary, we also implemented a two-stage training method, as suggested in [32] where aspect models are applied to image annotation. In the two-stage training, semantic aspects are first learned by training on words only; the P (z|t) are then held fixed during a further set of E-M iterations in which the P (w|z) are learned for the muswords. Finally the word and musword probabilities P (w|z) are weighted by the total word and musword counts respectively, and normalised to sum to unity. This two-stage training ensures that the aspects remain semantically coherent,
IEEE TRANSACTIONS ON MULTIMEDIA
7
TABLE II T EST AND TRAINING SETS
tracks Ta (test) ADW (artist-disjoint)
928 5064
vocab. size 8946 25591
data density (%) 0.50 0.33
% of test vocab. 100 35
while further tracks, particularly those that are sparsely- or untagged, can be folded in to the model using both words and muswords. V. E XPERIMENTAL FRAMEWORK We evaluate within a query by example framework, to learn to what extent the representation of tracks within each model respects traditional catalogue organisation by artist and genre. To allow comparison with previous work, in [8] we selected a test set T of 1561 tracks from our full dataset to replicate the experimental set-up used in a series of influential papers following [33], in which artist-artist similarities were calculated for a set of 224 well-known artists split equally over 14 mainstream genres. The genre labels for each artist in this list were chosen by comparing editorial labels from the All Music Guide, Yahoo! LAUNCHcast and other sources, and can therefore be considered authoritative in comparison with individual tags [34]. Audio was not available for all the tracks in T, so for our experiments here we pruned it to create a reduced set of 928 test tracks with audio Ta, with between 25 and 98 tracks for each of the 14 labelled genres. We evaluate artist retrieval over 105 artists with between 4 and 12 tracks each in Ta. In order to study the ability of our models to generalise to unseen tracks, we select a training set which has no artists in common with the test set. In a practical application this scenario would arise if it was undesirable to retrain the model even following the arrival of tracks by hundreds of new artists, perhaps because of computational expense or the difficulty of making updates to data used in a live search engine. More importantly, it provides a good test of whether learned aspects capture significant basic musical concepts, rather than depending on artist names and associated highly specific vocabulary. We restrict the training set to tracks tagged with at least 30 distinct words, resulting in a set ADW of 5,064 artist-disjoint, well-tagged tracks. Vocabulary sizes and data densities for the test and training sets, after tokenizing tags for all tracks with a standard stop-list, are shown in Table II. We also show the proportion of the vocabulary of tags for the test set which is indexed in the training set: two thirds of the words applied to the test tracks do not occur in tags for the artistdisjoint training tracks, showing the extent of the artist-specific vocabulary which we exclude when learning models from the training sets. This makes us reasonably confident that models learned on the training set which continue to perform well on the test set have indeed captured some genuine underlying semantics of tags for music. To simulate a more realistic scenario in which tracks for some artists are sparsely-tagged, as discussed in Section II, we use a cross-validation framework as follows:
1) the test set artists are split into three folds at random For each fold in turn: 2) the tag words for each track by the artists in the current fold are sorted by their count 3) all but the top κ words for each track are masked by setting their counts to zero 4) query by example is evaluated for all tracks in the test set The three-fold harness both allows cross-validation and reproduces approximately the distribution of tags which we observed in our full dataset, in which tracks by a third of all artists have been tagged with only some small number κ of words. A possible consequence of the uneven distribution of tags is that search results may effectively segregate tracks by sparsely- and well-tagged artists. Besides means and standard errors for genre and artist retrieval precision over the three folds, we therefore report a measure of track integration: the proportion of sparsely-tagged tracks appearing in the top ten search results for well-tagged query tracks, and vice versa. VI. R ESULTS In general we show per-word mean Average Precision (mAP), averaged over the sets of artist and genre labels. The AP for a particular query is calculated as PN P (r)rel(r) (12) AP = r=1 R where P (r) is the precision at rank r, rel(r) is 1 if the document at rank r is relevant (i.e. is labelled with same genre/artist as the query) and 0 otherwise, R is the total number of relevant documents, and N is the total number of documents in the collection. AP therefore measures the average precision over the ranks at which each relevant track is retrieved. The per-word mean AP for a particular genre or artist label is the mAP over all queries labelled with that term. Besides being a standard IR performance metric (which has become consensual in parallel literature in the field of image retrieval), mAP rewards the retrieval of relevant tracks ahead of irrelevant ones, and is consequently an extremely good indicator of how the semantic space is organised by each model. We also show the precision at rank 5 for genre labels, and the r-precision for artist identity, i.e. the precision at rank r, where r is the total number of tracks by the query artist in the collection. These two figures give a measure of the performance at high ranks, reflecting the results that would be seen in practice by the user of a search engine. A. The musword representation We evaluate our musword representations in a simple vector space model of the bag-of-muswords (BOM) for each track, with tf-idf weighting, where document frequencies for each musword are computed over the test set. Table III gives average genre and artist retrieval precision figures using each track in the test set as the query in a query-by-example scenario. The best BOM results are shown in bold. Besides comparing the BOM with timbre muswords created by the VQ and Random Vocabulary (RV) methods described
IEEE TRANSACTIONS ON MULTIMEDIA
8
TABLE III
0.5
BOM RETRIEVAL PERFORMANCE
genre precision at 5 genre mAP artist r−precision artist mAP
0.45
genre mAP
artist r-prec.
artist mAP
0.322 0.387 0.379 0.462 0.439
0.121 0.168 0.165 0.203 0.196
0.233 0.251 0.247 0.286 0.278
0.203 0.228 0.227 0.269 0.256
0.262 0.461 0.939
0.099 0.187 0.774
0.208 0.304 0.581
0.175 0.288 0.629
0.4 0.35 precision
BOM: rhythm VQ timbre VQ timbre + rhythm RV timbre RV timbre + rhythm baseline: random timbre similarity BOW
genre prec. at 5
0.3 0.25 0.2 0.15 0.1
B. Scaling word counts In our BOW+M representation, count values for words and muswords are computed by different means, and have no natural scaling with respect to one another. Specifically our counts for conventional words depend on Last.fm’s unexplained normalisation of the number of times a tag has been applied to any particular track, while our counts for muswords result from the particular discretisation method used to map features onto muswords. Fig. 7 shows retrieval performance as we vary the scaling of counts in a simple vector space BOW+M model: the scale factor shown is the ratio between the mean count for muswords and that for words. A scale factor of zero corresponds to
0.1
0.2
0.3
0.4 0.5 data density
0.6
0.7
0.8
0.9
Fig. 6. BOM retrieval performance vs data density. The sparsification threshold τ takes values 0.9, 0.8, ... 0.1
0.95 genre precision at 5 genre mAP artist r−precision artist mAP
0.9 0.85 0.8 0.75 precision
in Section III, we give results for three baseline methods. For our primary baseline we evaluate content-based retrieval using a state-of-the-art distance measure directly on the underlying timbral audio features: we use symmetrised Kullback-Leibler divergence on single Gaussians fitted to MFCCs from the whole of each track [35], [14]. We also show results for a random baseline, and for a BOW vector space model. The results in Table III show that timbre muswords created by the RV method are significantly more effective than those created by VQ. The organisation of our test tracks in a simple BOM model using these muswords is similar to using a stateof-the-art similarity measure directly on the underlying features: genre retrieval is marginally better in the BOM model, and artist retrieval slightly worse. Rhythm muswords, while inducing some organisation on the collection when compared with a random baseline, reduce retrieval performance, and we therefore do not use them in our remaining experiments. Fig. 6 shows retrieval results using timbre muswords created by the RV method, with the counts produced by this method sparsified to varying degrees. As discussed in Section IIIB, this method creates distance-based soft counts for each track every musword in the vocabulary, causing problems of scalability. In this experiment we reduce the data density by zeroing all counts for each track which are less than some proportion τ of the highest one. Fig. 6 shows that we can reduce musword data density to under 10% with no significant loss in retrieval performance: in practice we set τ = 0.6, giving a data density of 7.4%, i.e. on average we retain only 74 of the original 1000 muswords for each track.
0
0.7 0.65 0.6 0.55 0.5 0.45
Fig. 7.
0
0.5
1
1.5 scale factor
2
2.5
3
BOW+M retrieval performance
discarding muswords completely i.e. using a baseline BOW model. In Table III we illustrate the top ten search results returned by the joint vector space model for some example query tracks at several scale factors. Using a scale factor of 1.0, the retrieval performance is slightly lower than the BOW baseline, but, as the examples in Table IV show, search results in this model are largely acceptable, although by no means identical to those returned by searching on words only. With a scale factor of 3.0, however, objective retrieval performance is reduced significantly, and the search results include more surprises. A more detailed examination of the examples in the third column of Table III is informative. At first glance, the jazz tracks returned for Joni Mitchell’s ‘Both Sides Now’ are poor matches, because Mitchell is most often labelled as a folk singer (as she is in our genre groundtruth). ‘Both Sides Now’, however, is the title track of an album of classic jazz songs, and the pianist on the album is none other than Herbie Hancock, whose ‘Tell Me a Bedtime Story’ is the fourth result here. Radiohead’s brit rock classic ‘Karma Police’ is
IEEE TRANSACTIONS ON MULTIMEDIA
9
TABLE IV E XAMPLE SEARCH RESULTS
scale factor = 0.0 Joni Mitchell: Both Sides Now Joni Mitchell: Free Man In Paris Joni Mitchell: You Turn Me On I’m A Radio Leonard Cohen: Sisters of Mercy Leonard Cohen: Story of Isaac Leonard Cohen: Famous Blue Raincoat Pete Seeger: Little boxes Leonard Cohen: First We Take Manhattan Bob Dylan: Mr. Tambourine Man Steeleye Span: All Around My Hat Radiohead: Karma Police Weezer: No Other One Radiohead: We Suck Young Blood Radiohead: A Wolf at the Door Radiohead: A Wolf At The Door Foo Fighters: Up in Arms Sonic Youth: Tunic (Song for Karen) Smashing Pumpkins: Disarm Weezer: Beverly Hills Jane’s Addiction: Just Because Moby: My Weakness Aphex Twin: Xtal Moby: My Beautiful Blue Sky Moby: Natural Blues Moby: Sunday (The Day Before My Birthday) Aphex Twin: Avril 14th Moby: Honey Aphex Twin: Kladfvgbung Micshk Kraftwerk: Computerliebe Aphex Twin: Btoum-Roumada Sonic Youth: ’Cross the Breeze Sonic Youth: Tunic (Song for Karen) Sonic Youth: Disappearer Sonic Youth: Master-Dik Sonic Youth: Tunic Sonic Youth: Mary-Christ Radiohead: Karma Police Weezer: No Other One Weezer: Beverly Hills Radiohead: A Wolf at the Door Slayer: Jesus Saves Slayer: Raining Blood Slayer: Angel of Death Slayer: Altar of Sacrifice Anthrax: Caught in a Mosh Slayer: The Antichrist Sepultura: Endangered Species Anthrax: I Am the Law Anthrax: Got the Time Sepultura: Its´ari
scale factor = 1.0 Joni Mitchell: Both Sides Now Joni Mitchell: You Turn Me On I’m A Radio Joni Mitchell: Free Man In Paris Leonard Cohen: Sisters of Mercy Leonard Cohen: Famous Blue Raincoat Pete Seeger: Little boxes Leonard Cohen: Bird on the Wire Leonard Cohen: Story of Isaac Steeleye Span: Gaudete Bob Dylan: Blowin’ in the Wind Radiohead: Karma Police Weezer: No Other One Radiohead: A Wolf at the Door Radiohead: We Suck Young Blood Foo Fighters: Up in Arms Smashing Pumpkins: Disarm Radiohead: A Wolf At The Door Sonic Youth: Tunic (Song for Karen) Smashing Pumpkins: Bullet With Butterfly Wings Weezer: Beverly Hills Moby: My Weakness Aphex Twin: Xtal Moby: My Beautiful Blue Sky Aphex Twin: Avril 14th Moby: Natural Blues Moby: Sunday (The Day Before My Birthday) Aphex Twin: Kladfvgbung Micshk Moby: Honey Aphex Twin: Btoum-Roumada Underworld: Mmm Skyscraper I Love You Sonic Youth: ’Cross the Breeze Sonic Youth: Tunic (Song for Karen) Sonic Youth: Tunic Foo Fighters: Best of You Sonic Youth: Disappearer Weezer: No Other One The Smiths: The Headmaster Ritual Foo Fighters: Burn Away Smashing Pumpkins: Bullet With Butterfly Wings Sonic Youth: Mary-Christ Slayer: Jesus Saves Slayer: Raining Blood Slayer: Altar of Sacrifice Slayer: Angel of Death Slayer: The Antichrist Anthrax: Caught in a Mosh Anthrax: Got the Time Sepultura: Its´ari Anthrax: Madhouse Megadeth: A Tout Le Monde
a slow, minor key song with a bittersweet character, a guitar and piano accompaniment, with prominent cymbal hits in the mix. Out-of-genre search results for this track include a pop song, Robbie Williams’ ‘She’s the One’, and a classic punk track, ‘London Calling’ by The Clash: both of these, however, share some obvious musical characteristics with the query. The remaining unexpected results, on the other hand, are plainly poor, such as a Mozart mass movement returned for a track by Moby, or ABBA and the death metal band Sepultura to match Sonic Youth’s guitar-laden experimental noise rock. These observations, while of course subjective, illustrate the effect in practice of retrieving tracks that share muswords with a query track: we see pleasing results that are not returned by a search on words alone, but at the cost of many false positives.
scale factor = 3.0 Joni Mitchell: Both Sides Now Joni Mitchell: You Turn Me On I’m A Radio Thelonious Monk: Thelonious Herbie Hancock: Tell Me A Bedtime Story Dave Brubeck: Blue Rondo a la Turk Leonard Cohen: Bird on the Wire Steeleye Span: Gaudete Bob Dylan: Like a Rolling Stone Leonard Cohen: Everybody Knows Leonard Cohen: Famous Blue Raincoat Radiohead: Karma Police The Smiths: There Is a Light That Never Goes Out Smashing Pumpkins: Disarm Radiohead: A Wolf at the Door Weezer: No Other One The Smiths: Last Night I Dreamt... Radiohead: Myxomatosis Robbie Williams: She’s the One The Clash: London Calling Robbie Williams: Something Beautiful Moby: My Weakness Aphex Twin: Xtal Moby: My Beautiful Blue Sky Aphex Twin: Avril 14th Aphex Twin: Kladfvgbung Micshk Aphex Twin: Btoum-Roumada Underworld: Mmm Skyscraper I Love You Moby: Natural Blues Moby: Sunday (The Day Before My Birthday) Wolfgang Amadeus Mozart: Agnus Dei Sonic Youth: ’Cross the Breeze Dead Kennedys: Holiday in Cambodia Deep Purple: The Battle Rages On Sonic Youth: Tunic Sonic Youth: Tunic (Song for Karen) Foo Fighters: Best of You Sepultura: Endangered Species ABBA: So Long Foo Fighters: Burn Away The Smiths: The Headmaster Ritual Slayer: Jesus Saves Slayer: Raining Blood Slayer: Altar of Sacrifice Slayer: The Antichrist Slayer: Angel of Death Megadeth: A Tout Le Monde Anthrax: Madhouse Pantera: Strength Beyond Strength Anthrax: Got the Time Pantera: New Level
The scale factor for musword counts serves effectively as a system parameter, controlling the influence of the audio content analysis on search results. Indeed one possibility in a practical search system would be to allow the user to vary this parameter at search time, controlling the balance between audio-based music discovery, with its increased risk of inexplicable ‘clunkers’, and purely word-based search with its tendency to recommend the obvious. It is possible to avoid the issue of scaling counts altogether by using more sophisticated models, such as an extended threeway aspect model, in which words and muswords are treated as being generated independently for each track. This has its own problems, however, most significantly a mismatch between our observations and the model structure. The underlying co-
IEEE TRANSACTIONS ON MULTIMEDIA
10
occurrence data for such a model is a set of triples; in reality we do not know the association of individual muswords for a track with any of the particular words describing it, because tags are applied to whole tracks rather than to temporal regions of the audio signal. We leave further study of this approach for future work.
D. Aspect models We evaluate a set of aspect models with increasing numbers of aspects in the same three-fold framework. The models were trained on the artist-disjoint training set of 5064 well-tagged tracks ADW. Audio was available for 2824 of these tracks; we trained on all available words and muswords for each training track, scaling musword counts to have the same mean as word counts. For each fold of the test set Ta we mask either all or
0.7
precision
0.65 0.6 0.55 0.5 0.45
mAP (sf = 0) mAP (sf = 1) mAP (sf = 3)
0.4 0.35
Fig. 8.
0
1
2 3 4 number of words unmasked
5
all
BOW+M genre retrieval performance with sparse tags
0.65
0.6
0.55 precision
Figs 8-11 show cross-validation results for query by example on our test set using the BOW+M vector space model. The plots show how search results are affected by tag sparsity, and how they vary as we use words only (scale factor = 0), words plus muswords with counts scaled to have the same mean (scale factor = 1), and words plus muswords scaled to have more influence (scale factor = 3). The x-axis shows the number of words remaining after masking to simulate sparse tagging, i.e. all but the indicated number κ of tag words are masked for tracks by the artists in each fold. The rightmost value of each curve corresponds to using all tags words for each track i.e. it shows performance in the ideal scenario where tag sparsity is not an issue. We can draw several significant conclusions from these results. Firstly, Figs 8 and 9 show that tracks remain highly organised in a BOW+M model even when tags are scarce: although it helps to have many words for each track, retrieval remains at state-of-the-art levels as long as we have more than one word for each track. Even with only a single word available for a third of our test tracks, performance far exceeds content-based methods, such as the baseline method shown in Table III. This shows the ‘wisdom of crowds’ in action: by inspection the most frequently applied word in tags for a track is usually an appropriate genre label. Secondly, incorporating muswords into the model can actually increase retrieval performance when only a single word is available for a third of the tracks, as long as the counts are scaled appropriately. In particular artist organisation increases significantly when we introduce muswords, taking advantage of the so-called ‘album effect’, i.e. the ability of content-based representations to match highly similar tracks. We see from Figs 10 and 11 that tag sparsity does cause some segregation in the vector space model. In particular we observe that on average there is less than one well-tagged track in the top ten search results for query tracks tagged with only a single word, while untagged tracks can be almost completely segregated from tag ones. We can reduce segregation by increasing the scale factor for musword counts, but the cost is a major fall in retrieval performance, suggesting that most of the results reducing segregation are spurious anyway.
0.75
0.5
0.45
0.4
Fig. 9.
mAP (sf = 0) mAP (sf = 1) mAP (sf = 3) 0
1
2 3 4 number of words unmasked
5
all
BOW+M artist retrieval performance with sparse tags
0.35
0.3
0.25
integration
C. The effect of tag sparsity
0.8
0.2
0.15
0.1 unmasked queries (sf = 0) unmasked queries (sf = 1) unmasked queries (sf = 3)
0.05
0
Fig. 10.
0
1
2 3 4 number of words unmasked
5
all
BOW+M integration with sparse tags: well-tagged queries
IEEE TRANSACTIONS ON MULTIMEDIA
11
0.7
0.25
1−stage mAP (1 word) 2−stage mAP (1 word) 2−stage mAP (0 words)
0.65
0.2
0.15 precision
integration
0.6
0.55
0.5
0.1 0.45
masked queries(sf = 0) masked queries (sf = 1) masked queries (sf = 3)
0.05
0
Fig. 11.
0.4
0.35
0
1
2 3 4 number of words unmasked
5
all
BOW+M integration with sparse tags: sparsely-tagged queries
0
Fig. 12.
100
200
300
400 500 600 number of aspects
700
800
900
1000
Aspect model genre retrieval performance with sparse tags
0.48
VII. R ELATED WORK While IR models indexing both words and low-level features are well-established in the field of image retrieval and annotation, for example [36], [37], [38], [39], this appears to be the first paper applying this approach to music. In particular our modelling of audio content as discrete muswords generated by specific regions of interest within each track was inspired by parallel work in the image domain [40]. Our method of locating candidate regions of interest is related to the approaches to structural boundary finding developed in [41], [42]. The most closely related work on music retrieval is [43], which develops a vector space model based on web-mined text for a collection of tracks. Although the model only indexes
0.46 0.44 0.42 precision
all but one of the tag word counts for the relevant tracks before folding in the whole test set. The retrieval results given in Figs 12 and 13 show that aspect models trained by conventional E-M over the joint vocabulary perform poorly. Two-stage training, on the other hand, where we learn the aspects themselves from tag words only, gives retrieval performance only slightly below that of the vector space model, while solving the segregation of welland sparsely-tagged tracks, as illustrated by Figs 14 and 15. For clarity the plots show mean AP only; with all but one word masked for tracks in each fold, the genre precision at 5 with the two-stage model reached 0.86 while the artist r-precision reached 0.44. We observe further that we achieve these results despite adopting the extreme scenario in which none of our test artists were present in the training set. While this scenario gives us confidence that our models have indeed learned some semantics, in a practical application it can be avoided by a variety of means including training on the whole dataset if computational resources permit, representative subsampling of tracks, vocabulary pruning or incremental training with the use of approximate direct parameter updates if necessary. We find that retrieval performance with aspect models equals or exceeds that of the vector space model when the training set includes tracks by test artists.
0.4 0.38 0.36 1−stage mAP (1 word) 2−stage mAP (1 word) 2−stage mAP (0 words)
0.34 0.32
0
Fig. 13.
100
200
300
400 500 600 number of aspects
700
800
900
1000
Aspect model artist retrieval performance with sparse tags
words, a timbral similarity metric is employed to smooth word counts by weighted averaging over acoustically similar tracks. Web-mining text for large numbers of tracks suffers from huge vocabulary sizes, even compared with social tags, as irrelevant content is inevitably included in the text to be indexed, making dimension reduction of some kind essential. Timbral similarity is used indirectly to prune the word counts for each track, retaining only the words that discriminate most effectively between a group of timbrally neighbouring tracks and a group of distant ones. The vocabulary that remains after this track-specific pruning is clearly highly fitted to the training set, and external queries have to be folded in by a process of massive expansion. In the current implementation, queries are first submitted to Google, then the top 10 pages returned are downloaded, and all their text aggregated, before finally indexing against the model vocabulary. The model is evaluated in a free text query scenario, using Last.fm tags for each track directly as groundtruth, with best mean AP of 0.26 over a set of 227 test queries including genre and other terms. In [44] an aspect model is trained jointly on audio features and user ratings for tracks, and the trained model is used to rec-
IEEE TRANSACTIONS ON MULTIMEDIA
12
0.35
0.3
integration
0.25
0.2
0.15
0.1 1−stage unmasked queries (1 word) 2−stage unmasked queries (1 word) 2−stage unmasked queries (0 words)
0.05
0
0
Fig. 14.
100
200
300
400 500 600 number of aspects
700
800
900
1000
Aspect model integration with sparse tags: well-tagged queries
0.5 0.45 0.4
integration
0.35 0.3
recommended tracks were completely unrated, and (contrary to our experimental setup) all the tracks in the dataset were used in training, so there is no guarantee that these results will generalise to practical scenarios in which many artists are unseen during training. While there is little other literature applying formal IR systems to music, some interesting recent work extends the classification paradigm which has dominated recent research in music retrieval to a wider vocabulary. In particular in [5] a bank of classifiers is trained to output artist-level autotags for 60 largely genre tags. The individual classifiers output three classes (“a lot ”, “some”, “none”) for the relevance of each tag, with an accuracy ranging from 53% to 82% when their frame-level predictions are aggregated on a persong basis. Combining these autotags with real Last.fm tags produces a small improvement in artist recommendations relative to a groundtruth based on a simple collaborative filtering algorithm. In related work [3] a bank of Gaussian Mixture Models trained on data from questionnaire answers is used to auto-annotate tracks with the 10 most likely terms from a vocabulary of 135 musical concepts given their audio content. The machine annotations have precision of 27% and recall of 16% averaged over the vocabulary, treating a version of the questionnaire data as groundtruth: this is perhaps too low for the output to be useful in a retrieval context.
0.25 0.2
VIII. C ONCLUSIONS
0.15 1−stage masked queries (1 word) 2−stage masked queries (1 word) 2−stage masked queries (0 words)
0.1 0.05 0
0
Fig. 15.
100
200
300
400 500 600 number of aspects
700
800
900
1000
Aspect model integration with sparse tags: sparsely-tagged queries
ommend new tracks to users. To evaluate the system, 10% of the ratings are held out, the model is trained on the remainder, and tracks are recommended to all users. Recommended tracks with held out ratings are then inspected, and the proportion of such tracks with particularly high and low ratings, i.e. tracks known to be strongly liked or disliked, is reported. Although the experimental dataset used is rather small (316 users and 358 tracks), which also results in a very small set of recommendations actually being evaluated, the system appears to work well. When ten tracks are recommended to each user, 80% of the recommendations whose ratings were held out had actually been rated with the highest number of stars, compared to 71% using a simple baseline recommender system. Meanwhile only 1% of such tracks had actually been rated with the lowest number of stars, compared with a baseline of 3.5%. As in our work, a concern of this research is to ensure that the system can recommend tracks by artists previously unrated by the user in question, and, in particular, tracks unrated by any user. In the experiments reported in [44], 90% of the total number of tracks recommended to users were indeed by artists new to them, although less than 5% of the
This paper introduces a method of finding regions of interest within a track that - while only a first simple implementation of the approach - leads to an effective discretisation of audio as a vocabulary of muswords. Query by example using these muswords is more successful than with previous discrete representations, equalling the performance of an effective similarity measure applied directly to the underlying audio features. We index a joint vocabulary of conventional words, drawn from social tags, and muswords with vector space and probabilistic aspect models, and demonstrate how a scaling factor for word counts serves as a system parameter controlling the influence of audio over retrieval results. These models provide effective retrieval even under realistic conditions of tag sparsity: in particular retrieval is is excellent as long as two or more tags are available for each track, with the inclusion of audio making no significant difference to the performance in such cases. Retrieval is improved by indexing audio when fewer tags are available, as is the case in current real world tagging systems, and indexing audio also helps to avoid segregation between sparsely and well-tagged tracks. Social tags for music are increasingly being used in research, principally as a direct groundtruth for classification and retrieval tasks [5], [43], [45]. Most existing studies acknowledge that real tags for music are in fact far from being idealised class labels, leading to a need to “normalise away” the subjectivity and informality that in fact typify social tags for music. The methods we develop here, building on our work in [7], [8], outline an approach that can make good use of tags for music as they really are. Future work includes evaluating the use of our models for
IEEE TRANSACTIONS ON MULTIMEDIA
music retrieval with free text queries, and extending the aspect model to index words and audio independently. ACKNOWLEDGMENT The authors would like to thank Alexei Yavlinksy, Doug Turnbull, Elias Pampalk and Paul Lamere for many useful insights, Matthew Davies for the use of his rhythmic feature extraction code, and the anonymous reviewers for their helpful comments. R EFERENCES [1] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system,” in Proc. ISMIR, 2002. [2] A. Wang, “The shazam music recognition service,” Commun. ACM, vol. 49, no. 8, pp. 44–48, 2006. [3] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic annotation and retrieval of music and sound effects,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 2, pp. 467–476, February 2008. [4] B. Whitman, “Learning the meaning of music,” Ph.D. dissertation, MIT, 2005. [5] D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green, “Automatic generation of social tags for music recommendation,” in Neural Information Processing Systems Conference (NIPS) 20, 2007. [6] E. Pampalk, “Computational models of music similarity and their application to music information retrieval,” Ph.D. dissertation, Vienna University of Technology, 2006. [7] M. Levy and M. Sandler, “A semantic space for music derived from social tags,” in Proc. ISMIR, 2007. [8] M. Levy and M. B. Sandler, “Learning latent semantic models for music from social tags,” submitted to Journal of New Music Research. [9] X. Wu, L. Zhang, and Y. Yu, “Exploring social annotations for the semantic web,” in Proc. World Wide Web Conference, 2006. [10] S. Golder and B. Huberman, “Usage patterns of collaborative tagging systems,” Journal of Information Science, vol. 32, pp. 198–208, 2006. [11] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval. Cambridge University Press, 2008. [12] B. Marlin, R. Zemel, S. Roweis, and M. Slaney, “Collaborative filtering and the missing at random assumption,” in Proc. 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [13] F. Vignoli and S. Pauws, “A music retrieval system based on user-driven similarity and its evaluation,” in Proc. ISMIR, 2005. [14] M. Levy and M. Sandler, “Lightweight measures for timbral similarity of musical audio,” in Proc. ACM Multimedia, 2006. [15] J.-J. Aucouturier, B. Defreville, and F. Pachet, “The bag-of-frame approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,” Journal of the Acoustical Society of America, vol. 122, no. 2, pp. 881–891, 2007. [16] J.-J. Aucouturier, “Ten experiments on the modelling of polyphonic timbre,” Ph.D. dissertation, University of Paris 6, 2006. [17] L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet, “Audio information retrieval using semantic similarity,” IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2007. [18] N. Maddage, X. Changsheng, M. Kankanhalli, and X. Shao, “Contentbased music structure analysis with applications to music semantics understanding,” in 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, October 2004. [19] L. Lu, M. Wang, and H. Zhang, “Repeating pattern discovery and structure analysis from acoustic music data,” in 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, October 2004. [20] M. Goto, “A chorus-section detecting method for musical audio signals,” in Proc. ICASSP, vol. V, 2003, pp. 437–440. [21] W. Chai and B. Vercoe, “Music thumbnailing via structural analysis,” in Proc. ACM Multimedia, 2003, pp. 223–226. [22] J. Paulus and A. Klapuri, “Music structure analysis by finding repeated parts,” in Proc. of the 1st ACM Audio and Music Computing Multimedia Workshop, 2006. [23] Y. Shiu, H. Jeong, and C. J. Kuo, “Similarity matrix processing for music structure analysis,” in Proc. of the 1st ACM Audio and Music Computing Multimedia Workshop, 2006.
13
[24] M. Levy, M. Sandler, and M. Casey, “Extraction of high-level musical structure from audio data and its application to thumbnail generation,” in Proc. ICASSP, 2006. [25] M. Levy and M. Sandler, “New methods in structural segmentation of musical audio,” in Proc. European Signal Processing Conference, 2006. [26] ——, “Structural segmentation of musical audio by constrained clustering,” IEEE Trans. Audio, Speech and Language Processing, vol. 16, no. 2, pp. 318–326, 2008. [27] M. Davies and M. Plumbley, “Exploring the effect of rhythmic style classification on automatic tempo estimation,” 2008, submitted to Eusipco. [28] J. P. Bello, C. Duxbury, M. E. Davies, and M. B. Sandler, “On the use of phase and energy for musical onset detection in the complex domain,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553–556, 2004. [29] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975. [30] T. Hofmann, “Probabilistic latent semantic analysis,” in Uncertainity in Articial Intelligence, UAI ’99, Stockholm, 1999. [31] ——, “Unsupervised learning by probabilistic latent unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, pp. 177–196, 2001. [32] F. Monay and D. Gatica-Perez, “Modeling semantic aspects for crossmedia image indexing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, iDIAP-RR 05-56. [33] P. Knees, E. Pampalk, and G. Widmer, “Artist classification with webbased data,” in Proc. ISMIR, 2004. [34] P. Knees, “Automatische klassifikation von musikk¨unstlern basierend auf web-daten,” Master’s thesis, Vienna University of Technology, November 2004. [35] M. Mandel and D. Ellis, “Song-level features and SVMs for music classification,” in Proc. ISMIR, 2005. [36] Y. Mori, H. Takahashi, and R. Oka, “Image-to-word transformation based on dividing and vector quantizing images with words,” in Proc. International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999. [37] K. Barnard and D. Forsyth, “Learning the semantics of words and pictures,” in Proc. IEEE InternationalConference on Computer Vision, 2001. [38] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” in Proc. ACM SIGIR Conference, 2003. [39] D. Blei and M. Jordan, “Modeling annotated data,” in Proc. ACM SIGIR Conference, 2003. [40] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, and T. Tuytelaars, “A thousand words in a scene,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp. 1575–1589, 2007. [41] O. Gillet and G. Richard, “Comparing audio and video segmentations for music videos indexing,” in Proc. ICASSP, 2006. [42] D. Turnbull, G. Lanckriet, E. Pampalk, and M. Goto, “A supervised approach for detecting boundaries in music using difference features and boosting,” International Conference on Music Information Retrieval (ISMIR), September 2007. [43] P. Knees, T. Pohle, M. Schedl, and G. Widmer, “A Music Search Engine Built upon Audio-based and Web-based Similarity Measures,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), Amsterdam, the Netherlands, July 23-27 2007. [44] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “An efficient hybrid music recommender system using an incrementally trainable probabilistic generative model,” Audio, Speech, and Language Processing, IEEE Transactions on [see also Speech and Audio Processing, IEEE Transactions on], vol. 16, no. 2, pp. 435–447, Feb. 2008. [45] G. G. an M. Schedl and P. Knees, “The quest for groundtruth in musical artist tagging in the social web era,” in Proc. ISMIR, 2007.
IEEE TRANSACTIONS ON MULTIMEDIA
Mark Levy (born 1963) is in the MIR team at Last.fm. He studied mathematics and music at Cambridge University, followed by musicology at King’s College London, and more recently computer science at Birkbeck and at Queen Mary, University of London, where he was a Research Assistant in the Centre for Digital Music, and where this research was undertaken. His research interests include music search and recommendation, automatic description and structural segmentation of musical audio. Mark is also well known as a performer on the viola da gamba, having made recordings for most of the major labels and given concerts throughout Europe, and he can often be heard on BBC radio and television, and on the soundtracks of movies including The Governess, A Knight’s Tale and Titus.
14
Mark Sandler (born 1955) is Professor of Signal Processing at Queen Mary, University of London, and Director of the Centre for Digital Music. Mark received the BSc and PhD degrees from University of Essex, UK, in 1978 and 1984, respectively. Mark has published over 300 papers in journals and conferences. He is a Senior Member of IEEE, a Fellow of IEE and a Fellow of the Audio Engineering Society. He is a two-times recipient of the IEE A.H.Reeves Premium Prize.