Grouping, Similarity and the Recognition of Rhythmic Structure Tillman Weyde University of Osnabrück Research Department of Music und Media Technology
[email protected] Abstract Recognition of musical structure is based on grouping and similarity of groups which are interdependent. A system for the recognition of rhythmic motif structure of note sequences is introduced, which determines grouping and group relations between two sequences (comparison) or within one sequence (analysis). The system makes use of knowledge and learning from data and it is open for the integration of different features and rules.
1
Introduction
The recognition of musical structures is an important task for musical applications. It is needed in interactive music performance systems, musical training, music analysis and database retrieval. Since music unfolds in time it contains some rhythmic structure, at least in a very general sense. Hence recognition of musical objects generally involves recognition of rhythmic structure. Computer based research in musical rhythm has been mainly concerned with periodicity, beattracking, finding metrical structure and quantization. Yet for the recognition of musical structure it is necessary to discover musical units and their relations. Metrical structure and grouping structure are largely independent and musical units are determined by Gestalt features and relations. Rhythmic motifs can be categorized assigning metrical interpretations, but this is not a sufficient basis for the recognition of rhythmic patterns. Recognition of musical structure relies on the recognition of musical units and their relations, such as repetition, variation or contrast on various levels. Notes are grouped to musical motifs and groups of motifs form phrases within which motifs may be related. The relations of motifs obviously depends on the grouping, but also grouping depends on motif relations. Repetitions and similarities are perceived immediately and motifs relations influence segmentation. Although there is a large body of research on rhythm perception (e.g. (Deutsch 1986), (Handel 1989), (Desain and Windsor 2000)),
a generally accepted theory of musical rhythm has not yet been developed. Accordingly a model for the recognition of rhythmic structure should be open for integrating different concepts, knowledge and learning from data. In this paper a system is introduced for recognition of structure on the basis of grouping and similarity and some of the measurements we use are introduced. It allows to model constraints on grouping and relations, definition of arbitrary input features for processing, to integrate a priori knowledge, and to learn from examples. The system architecture is modular and open for different types of constraint models, feature extraction, similarity models and learning algorithms.
2 Approach The basic idea is to determine groupings for sequences of notes and similarity relations of groups and rate their quality. We can use one sequence or two different sequences giving a comparison or analysis. Each note of the input is assigned to one or none other group (of the input itself or the given model), providing an interpretation i.e. a combination of grouping and assignments. In figure 1 the grouping of a note sequence as a phrase consisting of two motifs is shown. Groups of two sequences are assigned in figure 2. The similarity relations are determined on a note by note basis as can be seen in figure 3. This allows to gain information about the differences of particular groups or notes. We can use one sequence or two different sequences yielding a comparison or analysis. This structuring of a single sequence is shown in figure 4. Our computational approach is a generate-and-test design. We generate all relevant interpretations and choose the best. This entails mainly two tasks: finding quality measures for grouping and similarity and managing computational complexity. The generated interpretations are rated for their quality of grouping, similarity on the group level, and on the structural level, i.e. the whole sequence. The rating module is a hybrid neuro-fuzzy system that allows to integrate knowledge and
velocity
time
to use learning by examples. The system works not directly on the data but on the features extracted from sequences and interpretations. Knowledge is integrated in the definition of features and by defining fuzzy-logical rules which are used to calculate ratings from the features. The system can be trained by examples of interpretations. Computational complexity is reduced by perceptually motivated filtering of groupings and and by algorithmic optimization.
Figure 1: Grouping
3
Feature extraction
Finding adequate input features is a crucial point in modeling rhythm recognition with adaptive systems since their performance and adaptation are very sensitive to the representation of the input data. Low level representations of musical data are usually not suitable since they contain too little structural information. We extract and process features on two levels: • motif level features are computed individually for each input group • phrase level features are computed from the groups and their relations
Figure 2: Group relations
Group features are group length and duration, and several measurements for precision, correctness, tempo of assigned input groups. Input level features are tempo and segmentation qualities which are computed from the group level features and order which reflects how the group assignments preserve serial order.
4
Figure 3: Note assignments
Figure 4: Analysis of a note sequence
Grouping
There are perceptual constraints on the length and the duration of perceptual groups. The number of events in a group is restricted as is well known since Miller (1956). The maximal number is a adjustable parameter in our system and a setting of 4 or 5 has shown to be adequate which agrees with the literature (Handel and Todd (1981), Swain (1986)). Empirical evidence suggests that durations of perceptual groups lie in a range of approximately 0.5 to 2 seconds (Seifert, Olk, and Schneider 1995). It is also known that temporal proximity of events is an important factor in segmentation, relatively long distances between events tend to end a group (Handel 1973). Since grouping by temporal proximity is dominant over accent grouping (Deutsch 1986) we can filter out segmentations that grossly contradict grouping by proximity. We also assume that groups containing only one element should not occur unless they have considerable distance to the neighbor notes (Lerdahl and Jackendoff 1983), otherwise they are filtered out.
5
Similarity
Similarity is a Gestalt feature of rhythmical groups. Gestalt is defined by von Ehrenfels (1890) as the quality of a melody invariant to transposition. Analogous to the transposition of a melody We can transform rhythms by changing the tempo, the position within the musical context, or the timing of notes within the rhythm. Depending on the amount of the transformation the rhythm will be regarded as the same or similar rhythm by a listener. Apart from these mainly geometrically motivated transformations there are also structural transformations like inserting or omitting notes. These are usually dealt with in pattern matching and have also been used for music (Orpen and Huron 1992). We use them with the velocity and duration of inserted notes as amount and we differentiate between addition/subtraction where the time structure is unchanged and insertion/deletion where the following notes are moved by the length of the inserted/deleted note. We calculate the input features for similarity by performing the necessary transformations to transform one group into the other. From the transformations needed we have take the amount as basis for the input data. We use input features for position, loudness, and duration of notes, that sum squares of the differences as correctness values. The application of transformations also yields information about the tempo of groups, timing of individual notes and relative position of the groups. The results can be used in interactive music applications, music education, music analysis, and music retrieval. For every assigned input group tempo variants are calculated, based on pairs of assigned notes as fixed points. Different possibilities for the two fixed notes yield different tempo variants. The process is visualized in figure 5. The result allows measuring the deviation of the group from the expected position, the tempo deviation and deviations of the individual notes concerning timing, loudness and length. Only the tempo variant which produces the best similarity rating is used for further calculation in order to reduce calculation time.
6
Structure
When trying recognize rhythmic structure we have to perform the two tasks of grouping the notes into meaningful units and determining relations between groups and both are highly interdependent We model this interdependence by integrating grouping and similarity quality into the quality rating of an interpretation which represents the whole rhythmic structure. At the moment we only integrate similarity relations between groups. This could be extended to other Gestalt features and relations that might be relevant for recognition and analysis like contrast, Prägnanz, closure, or emotional quali-
a) Original
b) Addition removed
c) Position adjusted
d) Tempo equalized
Figure 5: Calculation of tempo variants
ties. But that similarity relations seem to be fundamental for musical structure. The similarity based principles of repetition and variation are predominant at least in western music tradition, although this may not be an universal.
7
Knowledge and Learning
The interpretations are rated by a neuro-fuzzy system using Fuzzy-Prolog (Nauck, Klawonn, and Kruse 1996) extended by a list processing feature. In a Fuzzy-Prolog system vague knowledge can easily be modeled by defining fuzzy rules. The system structure is very flexible and allows treelike structures as well as distributed representation. FuzzyProlog rules can be interpreted as a feed-forward neural net and can be trained by error backpropagation. The system defined by the rules can be trained to prefer certain interpretations over others by examples. Examples give a target interpretation for a given input with a given interpretation. More details on the rules and net structure can be found in can be found in Weyde (2001) and Weyde (2000).
8
Evaluation
The program RhythmScan is an experimental application of the described system. It offers segmentation, comparison and analysis modes with a graphical user interface. Figure 6 shows how expert samples can be provided drawing groupings and relations. The system has been trained and tested with segmentation tasks that simulate subjective rhythmization of uniform se-
also be used to test the validity of rules in certain contexts but we need to process larger sets of data than we do currently to draw any conclusions of that type. So approach of integrating knowledge and machine learning to model recognition of rhythmic structure can be rewarding for both music research and the use in interactive music software.
References
Figure 6: RhythmScan user interface
quences and randomly generated patterns. It has been trained and tested with inaccurate interpretations of rhythmic patterns, that contained tempo fluctuations, missing and added notes and imprecise timing. The system can be trained to assign and segment the trained samples correctly and perform acceptably on test sets. Although training was successful with different training sets, it did not always succeed for combined sets. The weights vectors after training showed no uniform trend. The capacity of the rule based net seems to be not large enough. This could be solved by adding further rules or by using a standard Bayesian net, that initially contains more connections than needed, and eliminates unnecessary weights by training. A drawback of this option is that there is no straightforward way to interpret the weights after training wich can easily be done with rule weights.
9
Conclusions
Recognition of rhythmic motif structures can be successfully modeled on the basis of similarity and segmentation. The integration of knowledge and learning gives the opportunity to use existing theories and rules as well as examples. Results of experiments with the implementation are satisfactory although not yet ready for real-world applications. The integration of more rules or different net structure should be a future goal to enhance system performance. When changing the net structure it would be desirable to retain interpretability of weights in the trained system. Then the system could
Desain, P. and L. Windsor (Eds.) (2000). Rhythm Perception and Production. Lisse: Swets and Zeitlinger. Deutsch, D. (1986). Auditory pattern recognition. In K. R. Boff, L. Kaufman, and J. P. Thomas (Eds.), Handbook of Perception and Human Performance: Cognitive Processes and Performance, Volume 2, Chapter 32, pp. 32–1–49. New York: John Wiley and Sons. Handel, S. (1973). Temporal segmentation of repeating auditory patterns. Journal of Experimental Psychology 101, 46–54. Handel, S. (1989). Listening: An Introduction to the Perception of Auditory Events. Cambridge, Massachusetts: MIT Press. Handel, S. and P. Todd (1981). Segmentation of sequential patterns. Journal of Experimental Psychology: Human Perception and Performance 7(1), 41–55. Lerdahl, F. and R. Jackendoff (1983). A Generative Theory of Tonal Music. Cambridge, Mass.: The MIT Press. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review 63(2), 81–97. Nauck, D., F. Klawonn, and R. Kruse (1996). Neuronale Netze und Fuzzy-Systeme (2 ed.). Computational Intelligence. Braunschweig: Vieweg. Orpen, K. S. and D. Huron (1992, Fall). The measurement of similarity in music: A quantitative approach for non-parametric representations. Computers in Music Research, 1–44. Seifert, U., F. Olk, and A. Schneider (1995). On rhythm perception: Theoretical issues, empirical findings. Journal of New Music Research 24(2), 164–95. Swain, J. P. (1986). The need for limits in hierarchichal theories of music. Music Perception 4(1), 121–148. von Ehrenfels, C. (1890). Über Gestaltqualitäten. Vierteljahresschrift für wissenschaftliche Philosophie 14, 249–292. Weyde, T. (2000). Recognition of rhythmic structure with a neuro-fuzzy-system. In C. Woods, G. B. Luck, R. Brochard, F. Seddon, and J. A. Sloboda (Eds.), Proceedings of the Sixth International Conference on Music Perception and Cognition, Keele, Staffordshire, UK, pp. 1467–77. Department of Psychology, Keele University. CD-ROM (pdf, html). Weyde, T. (2001). Knowledge and learning based segmentation and recognition of rhythm using fuzzy-prolog. In Proceedings of the 8th Journee d’Informatique Musicale, pp. 173– 82.