Jornada de Seguimiento de Proyectos, 2009 Programa Nacional de Tecnologías Informáticas
Computer Music Semantic Processing (PROSEMUS) TIN2006-14932-C02 Xavier Serra * Universitat Pompeu Fabra
Rafael Ramirez** Universitat Pompeu Fabra
Jose Manuel Iñesta*** Universidad de Alicante Abstract The aim of the proposed project is to work towards the automatic semantic description of digital music. The project intends to contribute to bridging the current semantic gap in music information and apply the results to content-based music processing. Specifically, in this project we have been working on (1) the analysis and manipulation of low-level audio descriptors by spectral modelling analysis and synthesis techniques, (2) the extraction of high-level musical attributes from these lowlevel audio descriptors, (3) the study and development of pattern recognition and machine learning techniques applied to sequential data for building semantic models of different aspects of music and, based on these semantic models, (4) the development of prototypes of music personalization, and postproduction systems.
1 Project aims With the explosion and quick expansion of the music collections available on the Internet and, more generally of the music in digital formats, a key challenge in the area of musical information management is the automation of the annotation, indexation and organization of music based on its semantic content (Groski et al 2002). One of main deficiencies in the current music organization and processing systems is the existing semantic gap between low-level features or content descriptors that usually can be obtained in an automatic manner and the richness of musical information. The aim of this project is to contribute to fill that semantic gap and, thus, open a door to a new generation of content-based music processing systems. Specifically, the project intends to: 1.1 Develop techniques for the extraction, analysis and manipulation of low-level audio descriptors based on spectral modelling analysis and synthesis techniques.
*
Email:
[email protected] Email:
[email protected] *** Email:
[email protected] **
TIN2006-0000-C02
1.2 Develop techniques for the computation of high-level musical attributes from these low-level audio descriptors. 1.3 Develop pattern recognition techniques applied to sequential data for building models of different aspects of music and, based on these semantic models, 1.4 Develop prototypes of music mining, and postproduction systems.
2 Project progress 2.1 Universitat Pompeu Fabra Most of the objectives concerned to the Pompeu Fabra University (UPF) are already fulfilled. They mainly involved corpora acquisition, low-level and high-level audio descriptor extraction, and pattern recognition techniques. Here is a brief overview of the tasks developed: 2.1.1: Corpora acquisition An audio recording corpus of saxophone and violin monophonic pieces has been recorded by our technicians at the UPF and has been used to test our algorithms (both feature extraction and pattern recognition). Also we have recorded a large amount of isolated saxophone and violin notes for synthesis purposes. In addition, a repository of audio files has been collected and analyzed with the different techniques developed in the project. 2.1.2 Low-level audio descriptor extraction and analysis We have applied a spectral analysis to the computation of low-level frame-based descriptors such as energy and an estimation of the fundamental frequency. The energy descriptor is computed on the spectral domain, using the values of the amplitude spectrum at each analysis frame. In addition, energy is computed in different frequency bands as defined in (Klapuri, 1999), and these values are used by the algorithm for note segmentation. For the estimation of the instantaneous fundamental frequency we use a harmonic matching model derived from the Two-Way Mismatch procedure (TWM) (Maher, 1994). After a first test of this implementation, some improvements to the original algorithm where implemented to deal with some errors of the algorithm: •
Peak selection: a peak selection routine has been added in order to eliminate spectral peaks corresponding to noise. The peak selection is done according to a masking threshold around each of the maximum magnitude peaks. The form of the masking threshold depends on the peak amplitude, and uses three different slopes depending on the frequency distance to the peak frequency.
TIN2006-0000-C02
•
Context awareness: we take into account previous values of the fundamental frequency estimation and instrument dependencies to obtain a more adapted result.
•
Noise gate: a noise gate based on some low-level signal descriptor is applied to detect silences, so that the estimation is only performed in non-silent segments of the sound.
2.1.3 High-level musical attributes computation We compute note high-level descriptors using the note boundaries and the low-level descriptors values. The low-level descriptors associated to a note segment are computed by averaging the frame values within this note segment. Pitch histograms have been used to compute the pitch note and the fundamental frequency that represents each note segment, as found in (McNab, 1996). This is done to avoid taking into account mistaken frames in the fundamental frequency mean computation. It is widely recognized that humans listen and perform music considering a number of abstract musical structures. Thus, in order to perform a musical analysis of our musical corpus, and as a first step towards providing an abstract structure for the recordings under study, we decided to use Narmour’s theory of perception and cognition of melodies (Narmour 1990), (Narmour, 1991) to analyze the performances. Once we segment the audio signal into notes, we perform a characterization of each of the notes in terms of its internal features. The proposed intra-note segmentation method is based on the study of the energy envelope contour of the note. Once onsets and offsets are located, we study the instantaneous energy values of the analysis frames corresponding to each note. This study is carried out by analyzing the envelope curvature and characterizing its shape, in order to estimate the limits of the intra-note segments. We use this characterization to group notes into perceptually similar note clusters. 2.1.4 High-level programming language for sound manipulation The objective of this task was to design and implement a high-level programming language for the manipulation of different aspects in music, e.g. musical processes, music processing components. We have designed a first-order logic-based concurrent programming language in which the ordering of events is declaratively stated. So far we have applied the language to specify and execute musical processes and plan to use the language as coordination language for distributed sound processing modules. 2.1.5 Pattern recognition for music modeling In this task we have worked on the modeling of different aspects of a performed musical piece using machine learning and pattern recognition techniques (Mitchell, 1997). We have concentrated on the study of expressive music performance (with music postproduction applications) and its application to performer identification (with music mining applications). We have based our work on classical and folk violin, as well as jazz saxophone expressive performance. We have developed
TIN2006-0000-C02
models for these instruments and styles based on genetic programming and inductive logic programming techniques, capable of both generate and explain music performance. In addition we have successfully applied the resulting models to the task of automatic performer identification by combining supervised and unsupervised learning.
2.2 Universitad de Alicante Most of the objectives posed in the project proposal concerned to the University of Alicante (UA) subproject are already fulfilled. They were mainly related to the pattern recognition techniques that were involved in the solutions needed.. Here is a brief overview of the subtasks developed: 2.2.1: Corpora acquisition A MIDI repository (of around 200.000 MIDI files) has been collected and now is under a categorization process under domain - genre - style – author. This involves (semi-)automatic separation of prototypes / versions of prototypes / duplicates for each music piece. This is a longterm task but some intermediate corpora have been generated in order to test our algorithms, like for example a Jazz vs. Classical training test, Jazz (pre-bop / bop / bossanova) vs. Popular (pop / blues / celtic) vs. Academic (baroque / classical / romantic) training set. Also melody-tagged corpora have been developed and also a set of popular melodies performed by different players has been built. As stated, this is an undergoing task but the corpora developed so far are enough to complete our objectives, so we can consider this task finished for the present project. 2.2.2: Design and evaluation of data representations This was a key task for the UA subproject, since good data representations are crucial for the rest of phases of the systems. A number of developments have been carried out: global (vectorial representations) and local (n-words + n-grams) statistical descriptors of melodies, local statistical descriptors of chord sequences (n-words + n-grams), and tree-based representations have been developed and tested with different degrees of success. Only global statistical analysis of polyphonic lines are still to be tested due to some problems in how to collapse all polyphonic tracks in a MIDI file into a single polyphonic line. We hope that this can be solved in the next few months. A graphical interface for the automatic identification of the melody track in MIDI files has been developed using these techniques and is currently available for free downloading and use from the project’s website. 2.2.3: Adaptive metrics. The objectives in this part were to extend the conditional stochastic edit distance to the transducers with memory and also to measure differences between tree structures. These aims have been already fulfilled in collaboration with researches of the University of Saint Etienne (France) and University of Marsella (France). Both the theoretical aspects and the practical applications on musical data have been already published (see below). So, we can consider that this task is already finished.
TIN2006-0000-C02
2.2.4: Analysis of dynamic and cooperative classification techniques. The main objectives of this task were developed in the early stages of the project together with those in subtask in 2.2.2. Different techniques to combine classifiers, in order to combine different classification paradigms and to combine partial decisions, have been studied and reported. The remaining works are related to the strategies to combine partial or local decisions into general, global ones. Also multimodality is an open problem since we need to integrate decisions taken from different sources of information like melodies, chords and rhythm. These problems are posed in the prototype development task. 2.2.5: Efficiency improvement to make decisions. Scalability efficiency (for large databases) is necessary in this project due to the high computational cost of some of the metrics utilized (for example, the edit distance between trees). We have developed and published some efficient k-NN search for classification and also we have reported good performances for approximate tree edit distances in order to find melody variations and music retrieval. Although more realistic scenarios should be tried, we can consider the objectives of this task fulfilled in the degree aimed in the project. The tasks under development now are mainly those concerned to the prototype development of a transcription system and a music genre classification graphical interface. The technician working in the project is currently integrating the different data representation and techniques needed and designing interfacing issues.
3 Results Indicators 3.1. Publications (UPF+UA) Journals (+ LNCS) 1. 2. 3. 4. 5.
Ramirez, R. Perez, A., Kersten, S., Rizo, D., Román, P., Iñesta, J.M. (2009). Modeling Violin Expressive Performance, Intelligent Data Analysis (accepted). Perez, C., Rizo, D., Iñesta, J.M., Kersten, S., Ramirez, R. (2009). Genre Classification of Music by Tonal Harmony, Intelligent Data Analysis (accepted). Ramirez, R., Hazan, A., Maestre, E., Serra, X. (2008). A Genetic Rule-based Expressive Performance Model for Jazz Saxophone, Computer Music Journal, 32(1), pp.38-50. Ramirez, R., Maestre, E., Gomez, E., Pertusa, A., Serra, X. (2007). Performance-based Interpreter Identification in Saxophone Audio Recordings. IEEE Transactions on Integrated Circuits and Sytems for Video Technology, 17(3), pp. 356-364. Ramirez, R., Hazan, A. (2006). A Tool for Generating and Explaining Expressive Music Performances of Monophonic Jazz Melodies, International Journal on Artificial Intelligence Tools, 15(4), pp. 673-691
TIN2006-0000-C02
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
Serrà, J., Gómez, E., Herrera, P., Serra, X. (2008). Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification. IEEE Transactions on Audio, Speech and Language Processing. 16(6), 1138-1151. Purwins, H., Herrera, P., Grachten, M., Hazan, A., Marxer, R., Serra, X. (2008). Computational Models of Music Perception and Cognition I: The Perceptual and Cognitive Processing Chain. Physics of Life Reviews. 5(3), 151-168. Purwins, H., Grachten, M., Herrera, P., Hazan, A., Marxer, R., Serra, X. (2008). Computational Models of Music Perception and Cognition II: Domain-Specic Music Processing, Physics of Life Reviews. 5(3), 169-182. Ramirez, R., Puiggros. M. (2007). A Machine Learning Approach to Detecting Instantaneous Cognitive States from fMRI Data, Lecture Notes in Artificial Intelligence 4426, Springer. Ramirez, R., Puiggros, M. (2007). A Genetic Programming Approach to Feature Selection and Classification of Instantaneous Cognitive States, Lecture Notes in Computer Science 4448, Springer. Ramirez, R., Puiggros, M, Gomez, E. (2006). Modeling Expressive Music Performance in Bassoon Audio, Lecture Notes in Control and Information Sciences, 345, pp. 951-958 Hazan, A., Ramirez, R. (2006). Modeling Expressive Performance: a Regression Tree Approach Based on Strongly Typed Genetic Programming, Lecture Notes in Computer Science, 3907, pp. 676-687 Pérez-Sancho, C.; Rizo, D.; Iñesta, J.M. Genre classification using chords and stochastic language models, Connection Science (accepted), 2009 Habrard, A.; Iñesta, J.M.; Rizo, D.; Sebban, M., Melody recognition with learned edit distances, Lecture Notes in Computer Science, 5342, pp. 86-96, 2008. Pérez-Sancho, C.; Rizo, D.; Iñesta, J.M., Stochastic text models for music categorization, Lecture Notes in Computer Science, 5342, pp. 55-64, 2008. Ponce de León P. J. and Iñesta J. M., A Pattern Recognition Approach for Music Style Identification Using Shallow Statistical Descriptors, IEEE Transactions on Systems Man and Cybernetics C, 248-257, p. 37--, 2007. J.F. Serrano, J.M. Iñesta, Comparación de representaciones interválicas hansonianas para recuperación de información musical, Revista Iberoamericana de Inteligencia Artificial, 11, pp. 7-15, 2007.
Book Chapters 1. 2. 3. 4.
Ramirez, R. Hazan, A., Serra, X. (2008). Evolutionary Expressive Music Performance Modeling, In: The Art of Evolution, Springer, ISBN 978-3-540-72876-4 Ramirez, R., Hazan, A., Maestre, E., Serra, X. (2006). A Data Mining Approach to Expressive Music Performance Modeling, In: Multimedis Data Mining, Springer, ISBN 184628-436-8, pp. 379-399 Ponce de León, P.J.; Iñesta, J.M.; Rizo, D. Mining digital music score collections: melody extraction and genre recognition, In Pattern Recognition, ch. 25. In-Tech, ISBN 978-3902613-24-4, pp. 559-590. 2008. Rizo, D.; Lemström, K.; Iñesta, J.M. Tree representation in combined polyphonic music comparison", In Genesis of Meaning of Sound and Music, Springer (accepted)
TIN2006-0000-C02
International Conferences 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
Ramirez, R. (2008). An fMRI Study on Attentive Music Listening. The Neurosciences and Music. Kersten, S., Maestre, E., Ramirez, R. (2008). Concatenative Synthesis of Expressive Saxophone Performance. Sound and Music Computing Conference. Marinescu, M., Ramirez, R. (2008). Expressive Performance in the Human Tenor Voice. Sound and Music Computing Conference. Perez, A., Maestre, E., Ramirez, R., Kersten, S. (2008). Expressive Irish Fiddle Performance Model Informed with Bowing. International Computer Music Conference. Ramirez, R., Perez, A., Maestre, E., Kersten, E., Rizo, D., Roman, P., Iñesta, J.M. (2008). Modeling Celtic Violin Expressive Performance. International Workshop on Machine Learning and Music, International Conference on Machine Learning. Ponce León, P., Rizo, D., Ramirez, R. (2008). Melody Characterization by a Fuzzy Rule System. International Workshop on Machine Learning and Music, International Conference on Machine Learning. Perez, A., Ramirez, R., Kersten, S. (2008). Modeling Moods in Violin Performances. Sound and Music Computing Conference. Ponce León, P., Rizo, D., Ramirez, R., Iñesta, J.M. (2008). Melody Characterization by a Genetic Fuzzy System. Sound and Music Computing Conference. Ramirez, R., Perez, A., Kersten, S., Maestre, E. (2008). Performer Identification in Celtic Violin Recordings. International Conference on Music Information Retrieval. Ramirez, R., Puiggros, M.(2007). An Evolutionary Computation Approach to Cognitive States Classification, IEEE Congress on Evolutionary Computing, IEEE Press, Singapore, 2007 Ramirez, R., Puiggros, M.(2007). Automatic Classification of Instantaneous Auditory Cognitive States, Conference of the Society for Music Perception and Cognition, Montreal, 2007. Ramirez, R., Cecilla, E. (2007). Learning to Decode Instantaneous Cognitive States from Brain Images, Fontiers in the Converge of Biosicnce and Information Technologies, IEEE Computer Societe Press, Korea. Ramirez, R., Santosa, A. (2007). A Framework for Separation of Concerns in Concurrent Programming, IEEE International Computer Software and Applications Conference, Beijing. Ramirez, R. (2007). Automatically Detecting Cognitive States: a Speech and Music Case Study, International Conference on Language and Music as Cognitive Systems, Cambridge, UK. Ramirez, R., Maestre, E. (2007). Identifying Saxophonists from Their Playing Styles, 30th AES Conference, Finland. Ramirez, R., Hazan, A. (2007). A Genetic Rule-based Expressive Performance Model for Jazz Saxophone, IJCAI International Workshop on Artificial Intelligence and Music, India. Ramírez, R. (2006). A Logic-based Language for Modeling and Verifying Musical Processes, Proceedings of International Computer Music Conference; New Orleans, ICMA Press.
TIN2006-0000-C02
18. Maestre, E. Hazan, A. Ramirez, R. (2006) Using concatenative synthesis for expressive performance in jazz saxophone, Proceedings of International Computer Music Conference; New Orleans, ICMA Press. 19. Hazan, A. Grachten, M. Ramirez, R. (2006) Evolving performance models by performance similarity: beyond note-to-note transformations, Proceedings of the Intl. Conference on Music Information Retrieval, Canada. 20. Ramirez, R.,Puiggros, M. (2006). Training Classifiers to Detect Instantaneous Musical Cognitive States, International Conference on Music Perception and Cognition, Italy, 21. Hazan, A. Ramirez, R. (2006). Modeling expressive performance using consistent evolutionary regression trees. Proceedings of ECAI Workshop on Evolutionary Computation. 22. Ramirez, R. et al. (2006). A Sequential Covering Evolutionary Algorithm for Expressive Music Performance, Proceedings of The Conference on Innovative Applications of Artificial Intelligence, Boston, AAAI Press. 23. Puiggròs, M. Gómez, E. Ramirez, R. Serra, X. Bresin, R. (2006). 'Automatic characterization of ornamentation from bassoon recordings for expressive synthesis', Proceedings of 9th International Conference on Music Perception and Cognition; Bologna, Italy 24. Iñesta, J.M.; Ponce de León, P.J., Heredia-Agoiz, J.L., A ground-truth experiment on melody genre recognition in absence of timbre, Proc. of the 10th International Conference on Music Perception and Cognition (ICMPC10), pp. 758-761, 2008. 25. Illescas, P.R., Rizo, D., Iñesta, J.M., Learning to analyse tonal music, Proc. Int. Workshop on Machine Learning and Music, MML 2008, pp. 25-26, 2008. 26. Ponce de León, P.J.; Rizo, D.; Ramirez, R.; Iñesta, J.M., Melody Characterization by a Genetic Fuzzy System, Proceedings of the 5th Sound and Music Computing Conference , Martin Supper and Stefan Weinzierl, Universitätsverlag der TU Berlin, pp. 15-23, 2008. 27. Pertusa A., Iñesta J.M., Multiple Fundamental Frequency estimation using Gaussian smoothness, Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2008, pp. 105-108, 2008. 28. Rizo, D., Lemström, K., Iñesta, J.M., Tree structured and combined methods for comparing metered polyphonic music, Proc. Computer Music Modeling and Retrieval 2008 (CMMR'08), pp. 263--278, 2008. 29. Espí D., Ponce de León P.J., Pérez-Sancho C., Rizo D., Iñesta J.M., Moreno-Seco F., Pertusa A., A cooperative approach to style-oriented music composition, Proc. of the Int. Workshop on Artificial Intelligence and Music, MUSIC-AI, pp. 25-36, 2007. 30. Plácido R. Illescas, David Rizo, José M. Iñesta, Harmonic, melodic, and functional automatic analysis, Proceedings of the 2007 International Computer Music Conferrence, pp. 165--168, 2007. 31. Lidy T., Rauber A., Pertusa A., Iñesta J.M., Improving genre classification by combination of audio and symbolic descriptors using a transcription system, Proc. of the 8th Int. Conf. on Music Information Retrieval, ISMIR 2007, pp. 61-66, 2007. 32. Pertusa, A., Iñesta, J.M., Multiple fundamental frequency estimation based on spectral pattern loudness and smoothness, In MIREX 2007, 2007. 33. Pedro J. Ponce de León, José M. Iñesta, David Rizo, Towards a human-friendly melody characterization by automatically induced rules, Proceedings of the 8th Int. Conf. on Music Information Retrieval, ISMIR 2007, pp. 437--440, 2007
TIN2006-0000-C02
3.2. Collaboration with other research groups The Universitat Pompeu Fabra has established collaborations with academic institutions such as City University London, Cambridge University, Queens University Belfast, University of Athens, and National University of Singapore. We are currently working closely with member of these universities on musical pattern mining, musical analysis, evolutionary musical composition, violin expression, and fMRI-decoding. The University of Alicante has established collaborations with academic institutions such as University of Saint-Etienne, University of Marsella, City University London, University of Helsinki, Technical University of Vienna, and IRCAM-CNRS París. We are currently working closely with member of these universities and research centers on pattern recognition, machine learning, musical analysis, sound analysis, and comparison metrics.
3.3 Staff training 3.3.1 Universitat Pompeu Fabra PhD Thesis • •
Emilia Gomez, (2006). Tonal Description of Music Audio Signals. Ph.D. Dissertation. UPF Cano, P. (2007). Content-Based Audio Search from Fingerprinting to Semantic Audio Retrieval.
DEA Theses • •
Amaury Hazan, 2006. Computational Modeling of Expressive Music Performance: New Machine Learning Approaches for Dealing with Real-World Data. UPF. Barcelona. Esteban Maestre, 2006. Coding Instrumental Gestures. Towards a quantitative description of instrumental gestures in excitation-continuous musical instruments, UPF. Barcelona.
Others •
Stefan Kersten, Engineer in computing, was initially contracted as research assistant in the project and he is currently registered as PhD student (first year) in the computing and digital communication program of the UPF. His research is devoted to the design and implementation of pattern recognition techniques in sequences.
TIN2006-0000-C02
3.3.2 Universidad de Alicante PhD Theses Carlos Pérez Sancho, 2009. Stochastic language Models for Music Information Retrieval. (scheduled for spring 2009) Antonio Pertusa Ibañez, 2009. Computational Efficient Methods for Polyphonic Music Transcription (expected for summer-autumn, 2009). 3 other are under development now and are expected to be finished during 2009-2010. DEA Theses Plácido Illescas Casanova, 2007. Analisis Tonal Automático, UA. Alicante 4 other are under development now and are expected to be finished during 2009.
3.4 Technology transfer Part of the work in ProSeMus has been motivated by a long term research relationship with the company Yamaha. In this context we have been investigating violin expressive performance and we plan to apply the results to the generation of expressive music using Yamaha violin synthetizer, also developed at the UPF. The creation of the UPF Spin-off Company BMAT has motivated the collaboration of BMAT with the project. One of the research lines of BMAT is the development of recommender systems. ProSeMus results on similarity measures for musical sequences are of particular interest in this context.
3.5 Joint activities The following two international events have been co-organized: • •
International Workshop on Machine Learning and Music 2008, ICML 2008, Helsinki, Finland. International Workshop on Artificial Intelligence and Music 2007, IJCAI 2007, Hyderabad, India.
TIN2006-0000-C02
4 References (Groski, 2202) Groski, W.I, Zhao R. (2002). Negotiating the semantic gap: from feature maps to semantic landscapes. Pattern Recognition, 35(3): 593-600. (Klapuri, 1999) Klapuri, A. (1999). Sound Onset Detection by Applying Psychoacoustic Knowledge, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. (Maher, 1994) Maher, R.C. and Beauchamp, J.W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure, Journal of the Acoustic Society of America, vol. 95 pp. 2254-2263. (McNab, 1996) McNab, R.J., Smith Ll. A. and Witten I.H., (1996). Signal Processing for Melody Transcription,SIG working paper, vol. 95-22. (Mitchell, 1997) Mitchell, T.M. (1997). Machine Learning. McGraw-Hill. (Narmour, 1990) Narmour, E. (1990). The Analysis and Cognition of Basic Melodic Structures: The Implication Realization Model. University of Chicago Press. (Narmour, 1991) Narmour, E. (1991). The Analysis and Cognition of Melodic Complexity: The Implication Realization Model. University of Chicago Press.