a scheme for search space reduction in cbmir using ...

0 downloads 0 Views 400KB Size Report
reduced as we need not analyze the musical audio piece using small 30ms windows. ... containing songs belonging to all types of classical and popular music. ... Therefore the typical approach to this task is then to extract the main voice and.
A SCHEME FOR SEARCH SPACE REDUCTION IN CBMIR USING ‘MELAKARTHA RAGA’ CLASSIFICATION SCHEME Vijayaditya Peddinti Technical Associate, TechMahindra,Sharda Center Pune, Maharastra-411004, India [email protected]

Prof.Vijaykumar Chakka Asst. Professor, DA-IICT, Gandhinagar, Gujarat-382007, India vijaykumar_chakka @daiict.ac.in

Abstract - The global features of melody like scale, in addition to the local features like pitch, interval contour etc., help for better characterization of the melody. They improve the classification, querying, segmentation, retrieval, etc of the music data. In this paper the use of traditional carnatic raga classification system, 'melakartha', is examined in reducing the search space of an audio query in a polyphonic acoustic database. The method proposed for the melakartha raga identification doesn’t require the temporal information on the occurrence of the notes. Hence the computation is drastically reduced as we need not analyze the musical audio piece using small 30ms windows. This also enables us to use spectral analysis methods with high resolution. Using the ‘melakartha’ system the songs in the database are classified, based upon the notes that are mostly used in their melodies. The proposed system suggests a search path within the classified groups for every audio query, to achieve a quick ‘first hit’. The search space for trained singers’ audio query was reduced to nearly 2% of the test database Index Terms – Content based Music Information Retrieval, Search space reduction, Polyphonic acoustic database, Melodic Mode identification.

Introduction The area of Content Based Music Information Retrieval has received much attention in the recent years, as a result of the explosion of music data available to the users through various sources. The traditional methods of searching for audio files based upon keywords in no longer viable. However searching the entire content of the audio with the similarity measures currently available like approximate string matching of the transcribed songs, DTW, etc; is a computationally intensive time consuming process. This necessitates the development of efficient search space reduction methods. Yang et al., [1] propose one such system, which reduces the search space by clustering the data based on the rhythm of the songs. According to the rhythm pattern, a Korean folk song database was divided into 5 clusters. The rhythm pattern of the audio query is identified and it is matched with the songs within the particular cluster. Lee et al [2], proposed a hierarchical filtering method for reducing the search space to nearly 20% using a two step HFM. In the first step a crude comparison procedure was used to filter nearly 80% of the unlikely candidates and in the next step the query input is compared to the remaining 20% of the candidates in a detailed manner. Both the above methods do not make use of music scale knowledge. Zhu et al [3] proposed a music scale modeling technique to tackle the key transposition issue, reducing the key search space while melody matching. They demonstrated the benefits of applying the music scale knowledge. They also improvised their method in [4] to detect multiple keys. But both the system works only on MIDI and monophonic data. In [5] Zhu et al. proposed a method for Music key detection in musical audio, they made use of the CQT transform for spectral analysis. But they too did the spectral analysis of the signal on small windows and later summarized this data to determine the scale of the musical audio piece. In this paper a method for classifying the songs in the database based upon their probable ‘melakartha’ raga is proposed. The proposed system eliminates the necessity of pitch tracking for the classification procedure. Furthermore the knowledge of the exact positions of the frequency components

-1-

of interest ,which we gain as a result of the proposed technique, helps to adopt other means of spectral analysis offering variable resolutions and also the methods which give the energy in the required frequency bands only, while analyzing the signal for temporal information. The proposed system reduces the search space to one of the total 72 groups for a perfect query and a perfect classification, which translates to 1.3% assuming the database contains nearly equal number of songs in all the groups. For an untrained query the system can reduce the search space to 22 of the 72 groups (translating to 30% of the test database) on an average. The search space reduction system can be employed in any CBMIR system irrespective of the similarity measures or the features used by the system.

The Composition of the Music Database The ‘Melakartha’ system described in [5] is a note based raga (melodic mode) classification system in Carnatic music. It is very similar to the genus – specie classification system. Out of the 12 notes in an octave 7 or less are selected to produce a raga. Based on these 7 notes the ragas can be classified into 72 groups. Further within each group there are many types of ‘asampurna’ (incomplete) ragas, based upon the deletion of some of the seven notes. So within each group the ragas can be further clustered based upon the selection of notes within the 7 pertaining to the ‘janaka’ (parent) raga, leading to a finer classification and hence greater search space reduction. Carnatic music follows the just intonation system. Arvindh Krishnaswamy [8] suggests that the intervals in Just Intonation are very close to equally tempered intervals with respect to the variances in intonation. Also almost all of the scales used in the western music fall within the 72 groups. This facilitates the inclusion of songs belonging to other styles of music in the database. Experimental results suggest that most of the popular songs like rock and Indian film songs can also be included in the database. Further a 73rd group is created for the songs whose placement in the predefined 72 groups is highly ambiguous. Thus the proposed classification system can be applied on a music database, containing songs belonging to all types of classical and popular music.

Identification of the Feature The classification of the song based on the proposed system requires the identification of the notes used in the composition of the song. Various note identification methods were explored for this purpose, however many of these methods require the pitch tracking results for the transcription purpose. The proposed classification system however does not require the temporal information of occurrence of notes and is only concerned upon the presence of the notes in the song. Hence the spectral analysis of the entire song suffices for the purpose of classification. This enables us to use methods of power spectral estimation which offer high resolution as the number of time windows to be analyzed is drastically reduced. This also enables us to eliminate the pitch tracking phase from the system. Hence the complexities of polyphonic pitch tracking, like the identification of the various sources, are eliminated. For the purpose of note identification the assumptions to be made are the same as those done for polyphonic pitch tracking as explained in [9] viz., the main melody has a higher intensity than the accompaniment and, in the case of stereo recordings is balanced between the left and the right channels. These assumptions are reasonable in the case of pop and rock music and may hold also for the traditional music and for a Western musical concerto; these assumptions are perfectly true for melody based music like the carnatic music. Therefore the typical approach to this task is then to extract the main voice and consider the background accompaniment as noise. Hence the main task is to select the frequency with the strongest harmonic content, in terms of number of overtones present and the energy of these overtones, as the formant frequency (Sa) of the source, singing or playing the main melody.

-2-

Feature extraction As it is required to capture the frequency content of the whole song it is required to take a PSD estimation of the whole data. The typical length of an audio file is 5:00 minutes, at a sampling rate of 44.1 KHz. This results in 1,32,30,000 samples to be analyzed. Hence we use the Welch algorithm proposed in [10] for extracting the Power Spectral Density of the audio data, as it gives smoother spectrum simplifying the peak identification. According to this algorithm the whole data is divided into a number of overlapping frames and a hamming window is applied on the frames. The PSD of each of the frames is averaged to get the PSD estimation of the whole data. A 1 Hz resolution is necessary as the much of the energy is concentrated in the lower frequencies where the frequency difference between two notes is of the order 4 Hz. Thus the frequency resolution of 22050 pts is required as the audio is sampled at 44.1 KHz. Hence a window length of 22050 samples is selected and overlap of 11025 is selected as suggested by [11]. Even this resolution is not expensive as the time windows are large.

FIGURE 1: PSD OF A TYPICAL SONG

FIGURE 2 : PSD AFTER THE APPLICATION OF HPS

The selection of the required features from this PSD necessitates the identification of the fundamental with the strongest harmonic content, which is identified as ‘Sa’. The Harmonic Product Spectrum algorithm [12] is used for this purpose. The spectrum is down sampled to compress it. In the compressed spectrum the first harmonics fall in the place of the fundamental frequency. The HPS algorithm measures the maximum coincidence for harmonics according to equation (1) for each spectral frame, X(ω).

Y ( )   | X ( ) |

Y  max{Y (i)}

R

r 1

^

(1)

(2)

where R is the number of harmonics to be considered (R = 5,for example), and frequency ωi is in the range of possible fundamental frequencies. The resultant PSD Y(ω) is shown in Figure 2. The four peaks in the above graph are all related to the first peak; hence the first peak is selected as the formant frequency of the source playing the main melody. The selected frequency, its harmonics and the frequencies related to the selected frequency through just intonation ratios (shown in Table I) are identified. The presence of peaks in the actual PSD at these frequencies and their amplitudes measured. In the octaves of the formant frequency and its harmonics the presence of the peaks at any of the 12 corresponding note frequencies shows the existence of those notes in the composition of the song. Furthermore as we are considering only the peaks at the required frequencies, the peaks formed due to other sources cannot be a source of confusion, in the decision making process. Hence the detection of the notes in polyphonic audio is simplified.

-3-

TABLE I: JUST INTONATION RATIOS

FIGURE 3: PSD OF A TYPICAL HUMMED QUERY WITH PEAKS IDENTIFIED

Classification According to the peaks at required frequencies in each of the octaves a decision is made on placing the song in one of 72 groups.  If there are more than 7 peaks in octave, the other octaves are checked for peaks at the corresponding points. If no peaks are detected at the extra frequencies these are eliminated from the note selection process, and classification is done.  If more than 7 peaks are present in all the octaves the song is placed in the groups having the either selection of 7 among those 8 frequencies  If all 12 notes are present in the song, it is placed in all the groups.  If the note intervals used in the song do not match either just intonation or equal temperament intervals the songs are placed in the 73 group. Hence the classification of songs belonging to all styles of music is covered.

Query Processing The proposed system accepts humming, whistling, sung and even polyphonic queries. The PSD of the query is taken and HPS algorithm is applied as described above. According to the selected frequency the required frequencies are identified. Now the presence of peaks is detected in the PSD and these peaks are passed to a fitting function. The fitting function tries to fit these peaks to one of the 72 note selections and the error in each case is noted. In a typical query by a musically trained subject the average error in humming is 0.28 semitones per transition, lower than the error in a non-trained subject’s humming of 1.16 semitones per each note transition interval. Figure 3 shows a typical untrained user query, though the peaks identified are not same as the ones calculated, the peaks are near their actual notes and there is no confusion in the note selection. If the query is recorded from a source playing the song, even these errors can be eliminated.

Search Space Reduction and Search Path The audio query is also classified among 72 groups, as was previously described. The groups it can get classified into are defined as the search space of the audio query. For a perfectly classified song and a query from that song by a trained subject the search space is reduced to one of the 72 groups (i.e., less than 2% of the test database). The initial search is just carried on in this part of the database and the results displayed. If the user is not satisfied with the results an alternate search path is generated to run a

-4-

search on the remaining database. The search path contains the remaining of the 72 clusters in increasing order of fitting error (calculated above) for the particular query. This alternative search procedure was designed taking into consideration the errors in the query rendition by the user. Hence if the user does not find the required song in the first search, an exhaustive search of the remaining database along the search path is done eliminating the possibility of missing the cluster of the required song in the total search.

Experimental results The songs chosen for experimentation belonged to various styles of music. Traditional carnatic songs, film songs from various eras, rock and instrumental songs were selected. A total of 50 songs were selected. *1% of the songs could not be classified correctly as they Percentage 32% 39% 28% 1%* had large segments containing non-melodic content, like the Of songs rendition of dialogues in the song. No. of TABLE II CLASSIFICATION RESULTS groups 3 2 1 0 placed in The proposed system reduced the search space to 1 of the total 72 groups for an audio query closely resembling the original song and a perfect classification, which translates to 1.3% assuming the database contains nearly equal number of songs in all the groups. For a query by an untrained singer, the system reduced the search space to 22 of the 72 groups (translating to 30% of the test database) on an average. The search space reduction system can be employed in any CBMIR system irrespective of the similarity measures or the features used by the system. The audio query had the required song in the search space in 80% of the cases. For remaining 20% the song was found in the first three clusters of the alternate search path. The table shows the percentage of songs classified into one or more groups.

Conclusion The search space reduction method worked well on the polyphonic and monophonic audio recordings. The proposed system though simple has great implications, as various similarity measures having high retrieval accuracy even for an erroneous query but taking large amount of time and resources due to their complex procedures, can now be used in the CBMIR systems. The reduction of the search space to 2% of the database size has not been achieved by any other methods to the best of our knowledge. The two level search process implemented in the system ensures that the queried songs if present are retrieved as fast as possible and also that the required songs are not neglected in the search due to highly erroneous nature of an untrained query.

Acknowledgements I would also like to thank Mr. B. Ravi Ganesh for explaining to me the nuances in classical music and making this project possible. I specially thank Mr.Arvindh Krishnaswamy and Mr.Sinith M.S. for their guidance.

References [1]

ChulYong Yang, JongTak Shin, JinWook Kim, HangJoon Kim , “Korean Folk Song Retrieval using Rhythm Pattern Classification”, Proceedings of Fifth International Symposium on Signal Processing and its Application ’99.

[2]

Jyh-Shing Roger Jang, Hong-Ru Lee ,“Hierarchical Filtering Method for Content-based Music Retrieval via Acoustic Input”, International Multimedia Conference; Vol. 9 Proceedings of the ninth ACM international conference on Multimedia, pp. 401 - 410

-5-

[3]

Yongwei Zhu, Mohan Kankanhalli, “Music Scale Modeling for Melody Matching”, Proceedings of the eleventh ACM international conference on Multimedia, November 2-8, 2003, pp. 359 - 362

[4]

Yongwei Zhu, Mohan S. Kankanhalli, Sheng Gao “ Music Key Detection for Musical Audio”, Proceedings of the 11th International Multimedia Modeling Conference (MMM’05), 1214 Jan. 2005, pp. 30- 37

[5]

Prof. Sambamurthy P., ”South indian music ”, Vol 1-6, the indian music publishing house, madras-014

[6]

Mahadevan ramesh,“A gentle introduction to south Indian Classical Music” Parts 1-4. http://www.ecse.rpi.edu/Homepages/shivkuma/personal/music/basics/ramesh/gentle-introramesh-mahadevan-I.htm.

[7]

Arvindh Krishnaswamy, “Application of pitch tracking to south Indian classical music”, Proceedings of the 2003 lEEE Inlemational Conference on Acoustics, Speech, & Signal Processing., vol.5, pp.- 557-60

[8]

Nicolo Orio, “Music Retrieval : A tutorial and review” Foundations and Trends in Information Retrieval, Vol. 1, No 1 (November 2006) pp. 1–90

[9]

Welch, P.D, "The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging Over Short, Modified Periodograms," IEEE Trans. Audio Electroacoustics, Vol. AU-15 (June 1967), pp. 70-73.

[10]

John G.Proakis,Dimitris G. Manolakis, “Digital signal processing: Principles, Algorithms and applications”,Prentice Hall 3rd edition

[11]

Patricio de la Cuadra, Aaron Master, Craig Sapp, “Efficient Pitch Detection Techniques for Interactive Music”. Proceedings of the 2001 International Computer Music Symposium, 2001, Havana, Cuba. pp. 403-406

-6-

Suggest Documents