Monophony/Polyphony Classification System using ... - CS Journals

3 downloads 0 Views 227KB Size Report
Classification is based on the fact that music signals are harmonic. For monophonic signals, .... 1808, 1858, 1894, 1926, 1963, 2002, 2038}. We considered.
International Journal of Electronics Engineering, 2 (2), 2010, pp. 299 – 303

Monophony/Polyphony Classification System using Fourier of Fourier Transform Kalyani Akant1, Rajesh Pande2, and S.S. Limaye3 1 Department of Electronics and Communication Engineering, Manoharbhai Patel Institute of Engineering and Technology, Gondia, Maharashtra, India, E-mail: [email protected]. 2

3

Department of Electronics Engineering, Shri Ramdeobaba Kamla Nehru Engineering College, Nagpur, Maharashtra, India, E-mail: [email protected].

Department of Electronics Engineering, Jhulelal Institute of Technology, Nagpur, Maharashtra, India, E-mail: [email protected].

Abstract: In this paper we have proposed a method of classifying monophonic signals from polyphonic ones using Fourier of Fourier Transform (FFT2). Pitch estimation for monophonic signals is much simpler than polyphonic signals. Also prior knowledge of number of notes played (in case of polyphony) facilitates multi pitch estimation. One may use different methods for estimation of pitch in monophonic and polyphonic context. Hence identifying signal as monophonic or polyphonic becomes essential. Investigation of harmonic pattern of the sound in frequency domain gives us fundamental frequency (pitch). The periodicity of the Fourier transform is detected by again taking its Fourier transform to obtain the “Fourier of Fourier transform”[7] (FFT2). Classification is based on the fact that music signals are harmonic. For monophonic signals, we get series of peaks in FFT2 domain at near bin difference related to pitch of single note. Whereas for polyphonic signals this regularity will be disturbed, as spectrum in FFT2 will contain multiple series of peaks corresponding to multiple notes. We have tested our method on the database available at http://theremin.music.uiowa.edu/ [15]. Keywords: Monophony, Polyphony, Fourier of Fourier Transform, Pitch.

INTRODUCTION Many methods have been proposed for estimation of pitch in literature [1]. In case of monophony, the pitch is relatively easy to determine than polyphony. The problem of pitch estimation of monophonic signals is said to be solved, whereas multi-pitch estimation is still a challenging issue. Few methods of monophonic pitch estimation are, time domain methods: [2], [3], and [4], frequency domain methods: [5], [6], and [7]. Few methods of pitch estimation in polyphonic context are [8], [9], and [10]. In [11], monophony/polyphony classification is done based on a confidence indicator used by de Cheveign´e [12]. Short term mean and variance of this indicator was calculated and bivariate repartition of these two parameters was modeled with Weibull bivariate distributions for each class. The classification was made by computing the likelihood over one second for each class and taking the best one. The problem of singing voice detection in monophonic and polyphonic contexts is addressed in [13] where, again the method by by de Cheveign´e [12] is used for signal classification as mono/polyphony. Our method is based on the fact that music signals are harmonic. The harmonicity was detected in FFT2 [7] domain. For monophonic signals all peaks in FFT2 spectrum will observe harmonic relation. For polyphonic signals the FFT2 spectrum will be mixture of multiple harmonic peaks

corresponding to multiple notes. Hence all peaks will not follow harmonic relation. Knowing if all peaks are harmonically related, signal is classified as mono/ polyphonic. Our main objective is singing voice detection from mono recordings in view of query by humming applications. This method is part of main objective. This paper is organized as follows. Section 1 presents the details of Fourier of Fourier Transform. Proposed method is explained in Section 2. Result and conclusion are given in Sections 3 and 4 respectively. 1. FOURIER OF FOURIER TRANSFORM In our analysis we have used two Fourier transforms in sequence referred as Fourier of Fourier Transform (FFT2). Our method works very well in the case harmonic sounds, i.e. sounds rich in harmonics. It is not suited for pure sinusoids. Fourier transform, FT (first Fourier transform of the signal) of a typical musical sound has a series of peaks in its magnitude spectrum corresponding to the harmonics of the sound, at frequencies close to multiples of the fundamental frequency F. The peak showing fundamental frequency may not always be dominant. Hence single Fourier transform is inefficient to identify correct peak. Fourier of Fourier Transform is of great interest in locating this peak, which helps to overcome the possibility of octave error. To find out Fourier of Fourier Transform,

300

International Journal of Electronics Engineering

we compute magnitude spectrum of the Fourier transform of singing voice. Magnitude spectrum of the Fourier transform of the above magnitude spectrum is then computed. Note that this transform is not the same as the well-known “Cepstrum”, which is the (inverse) Fourier

transform of the logarithm of the spectrum resulting from the Fourier transform. Figure 1 shows the FT of piano C# of 5th octave. This FT has a series of uniformly spaced peaks as shown in Fig. 1, corresponding to the harmonics of fundamental frequency.

Fig. 1: Fourier Transform of Piano C# of 5th Octave

We can clearly see that, peak corresponding to fundamental frequency is not dominant. If fundamental frequency is F, the distance between two consecutive peaks corresponds to a period of ∆1 bins where:

∆1 = N1

F Fs

... (1)

N1: Size of the first Fourier transform. Fs: Sampling frequency. The first peak is at bin 0 and it corresponds to the DC level. The difference between second peak (shown by an arrow in Fig. 1) and the first peak is ∆1 bins. Figure 2 shows the spectrum of Fourier of Fourier Transform of piano C# of 5th octave. In this spectrum of FFT2, there are series of peaks. Here also, the first peak is at bin 0 and it corresponds to the DC level. The second peak is shown by an arrow in Fig. 2. The distance between two consecutive peaks corresponds to a period of ∆2 bins where:

N2 ∆2 = ∆1

... (2)

From Eqs (1) and (2), we get

N2  ( N1 ) F     Fs 

If size of first and second Fourier transform is same (N2 = N1), Fundamental frequency F is given by,

F =

Fs ∆2

... (4)

1.1 Advantage of FFT2 Over FT

N2: Size of the second Fourier transform.

∆2 =

Fig. 2: Fourier of Fourier Transform of Piano C# of 5th Octave

... (3)

The peaks in FFT2 are more widely spaced as illustrated in the Table 1. Here 12 notes in the 4th octave are analyzed with sampling frequency of 44100 Hz and FFT size as 4096 and the bin index numbers in FT and FFT2 algorithms are tabulated. (Note that due to slight mistuning of the Piano, A is having a frequency of 442.8272 Hz rather than 440 Hz). Frequency of note is found by applying parabolic interpolation [14] to the peak found in FFT2.

Monophony/Polyphony Classification System using Fourier of Fourier Transform Table 1 Indices of Harmonics in Terms of Bins in FT and FFT2 for Notes in 4th Octave Musical note

Index in FT

Index in FFT2

Frequency of note

MIDI note number

C

25

168

263.3223 Hz

60

C#

26

159

278.9644 Hz

61

D

27

150

295.4930 Hz

62

Eb

29

142

312.7933 Hz

63

E

31

134

331.7331 Hz

64

F

33

126

351.5081 Hz

65

#

35

119

372.4047 Hz

66

G

37

113

394.5282 Hz

67

b

A

39

106

418.4121 Hz

68

A

41

101

442.8272 Hz

69

Bb

44

95

469.2249 Hz

70

B

46

90

497.1536 Hz

71

F

We observe from above table that, in FT there is only one or two bins difference for a semitone, while in FFT2 there is five to nine bins difference. Also, as we move to the lower octaves, index in FT goes on reducing while index in FFT2 goes on increasing. For some of the two consecutive semitones in third octave, the indices in FT are same but in actual their frequencies are different. Estimation of fundamental frequency without parabolic interpolation would give same value for these semitones. Hence it is parabolic interpolation which plays important role in finding correct frequency of such semitones. Another feature of FFT2 is its ability to detect peaks of harmonics corresponding to multiple pitches. 1.2 Ability of FFT2 to Detect Multiple Pitches In FFT2 domain the spectral peaks are not as closely placed as in FT domain, so it becomes easier for peak detector to locate the peaks without any ambiguity. Figure 3 shows the FFT2 spectrum when A flat and C sharp of 4th octave played together. This ability of FFT2 is of great interest in music segregation in polyphonic environment.

301

2. PROPOSED METHOD Step 1: Signal of frame size N is selected, FFT2 of this frame is computed. Size of first and second FT was chosen to be 2 N to improve frequency resolution. In our case, N = 2048. Step 2: Bin numbers of all peaks in FFT2 spectrum from 0 to N are stored in a vector V. V = {V1, V2……….., Vn}. Step 3: Bin number of maximum amplitude in the FFT2 spectrum is detected. Let’s denote this by K.(If bin0 is at 1, as in case of matlab, K should be considered K – 1). Step 4: If singing voice lies in the frequency range from f1 to f2, FFT2 bin numbers of maximum amplitude in the spectrum will be from Fs/f 2 to Fs/f 1. All the bin numbers in this range from vector V are stored in vector X. Let X = {X1, X2……….., Xm}. Step 5: From X, those bin numbers whose peak values are less than 30% of peak value at K are rejected. Let the remaining bin numbers are Y = {Y1, Y2……….., Yi}. Step 6: Now, it is checked whether bin numbers Yj + K or Yj + K – 1 in case of matlab for (1 ≤ j ≤ i) fall in the vicinity of Vj – 5 to Vj + 5 for (1 ≤ j ≤ n). If this happens, then the signal is monophonic else polyphonic. Above condition is critical for monophonic signals, so probability of misclassifying monophonic signals is more than polyphonic ones. So, we have tested our method for large database of monophonic signals at [15]. 3. RESULTS Algorithm is explained using following examples. For the frame in Fig. 4, K = 195, K – 1 = 194. V = {1, 49, 97, 145, 195, 245, 292, 340, 391, 441, 487, 535, 586, 636, 682, 730, 781, 831, 877, 925, 976, 1027, 1073, 1119, 1171, 1222, 1269, 1314, 1366, 1418, 1464, 1508, 1559, 1616, 1661, 1698, 1735, 1768, 1800, 1832, 1907, 1955, 2006 }. We considered f 1 = 100 Hz, f 2 = 800 Hz. If Fs = 44100 Hz, vector X in Step 4 will be the bin numbers from 55 to 441. So, X in this example is {97, 145, 195, 245, 292, 340, 391, 441}. Y = {145, 195, 245, 340, 391, 441,}. Now Step 6 is performed.

Fig. 4: FFT2 Spectrum of one Frame of Note of Frequency 227.0427 Hz Fig. 3: FFT2 Spectrum when A Flat and C Sharp of 4th Octave Played Together

Condition in Step 6 is satisfied, hence signal is monophonic.

302

International Journal of Electronics Engineering Table 2 Illustration of Step 6 for Monophonic Signal

Yj

Yj + K – 1

If fall in the vicinity of Vj – 5 to Vj + 5 for (1  j  n)

145

339

Yes (element 340 in V)

195

389

Yes (element 391 in V)

245

439

Yes (element 441 in V)

340

534

Yes (element 535 in V)

391

585

Yes (element 586 in V)

441

635

Yes (element 636 in V)

For the frame in Fig. 5, K = 330, K – 1 = 329. V = {1, 44, 83, 126, 202, 246, 285, 330, 380, 421, 468, 534, 579, 617, 663, 721, 762, 804, 847, 882, 925, 1005, 1052, 1096, 1139, 1179, 1212, 1253, 1294, 1334, 1379, 1418, 1455, 1495, 1536, 1579, 1620, 1662, 1700, 1726, 1765, 1808, 1858, 1894, 1926, 1963, 2002, 2038}. We considered f 1 = 100 Hz, f 2 = 800 Hz. If Fs = 44100 Hz, vector X in Step 4 will be the bin numbers from 55 to 441. So, X in this example is (83, 126, 202, 246, 285, 330, 380, 421}. Y = {83, 126, 202, 246, 285, 330, 380}. Now Step 6 is performed.

Error = Number of misclassified seconds / Total number of seconds. We observed that error reduces for the frames, whose signal amplitude is more. Signal amplitude in a frame is calculated by adding modulus of each sample value in a frame. We run the algorithm for those frames whose amplitude is more than threshold. If threshold is set at larger value, error reduces. In the following table, Threshold/ Maximum amplitude of signal = 0 means threshold = 0, hence algorithm will be run for entire signal. This effect is shown in Table 4. All the files are available at [15]. Table 4 % Error Name of file AltoFlute_ff_C4B4 BassFlute_pp_C4B4 Bassoon_pp_C4B4 EbClar_pp_C4B4

Threshold/Maximum amplitude of signal 0

6.84

0.7

0

0

5.19

0.7

0

0

2.83

0.7

0

0

4.07

0.7 Flute_novib_pp_B3B4 Horn_pp_C4B4 TenorTrombone_pp_C4B4

% error

0

0 12.2

0.7

0

0

1.74

0.7

0

0

6.8

0.7

0

Fig. 5: FFT2 Spectrum of one Frame of Polyphonic Signals

4. CONCLUSION AND FUTURE WORK Table 3 Illustration of Step 6 for Polyphonic Signal Yj

Yj + K – 1

If fall in the vicinity of Vj – 5 to Vj + 5 for (1  j  n)

83

412

No

126

455

No

202

531

Yes (element 534 in V)

246

575

Yes (element 579 in V)

285

614

Yes (element 617 in V)

330

659

Yes (element 663 in V)

380

709

No

Condition in step 6 is not satisfied, hence signal is polyphonic. Accuracy of our algorithm is tested using global error rate:

Real world signals are noisy. Our algorithm may fail for noisy signals. So, signal should be band pass filtered (100 – 800 Hz) prior to the application of this algorithm to reject peaks corresponding to noise. This algorithm will be merged with our main goal: pitch tracking of singing voice in polyphonic context. Monophonic pitch tracking is simple and requires less time. Once the signal is classified at each frame, different algorithms will be run for pitch tracking for each class. REFERENCES [1] Zhenyu Zhao, Lyndon J. Brown, “Musical Pitch Tracking using Internal Model Control Based Frequency Cancellation”, 42nd IEEE Conference on Decision and Control, 5, December 2003, pp. 5544 – 5548. [2] L.R. Rabiner, et.al. “A Comparative Performance Study of Several Pitch Detection Algorithms”, IEEE Trans. ASSP, 24 (5), pp. 399 – 418, October 1976.

Monophony/Polyphony Classification System using Fourier of Fourier Transform [3] J.C. Brown and M.S. Puckette, “Calculation of a Narrowed Autocorrelation Function”, J. Acoust. Soc. Am., 85 (4), pp. 1595 – 1601, April 1991. [4] J.C. Brown and B. Zhang, “Musical Frequency Tracking using the Methods of Conventional and Narrowed Autocorrelation”, J. Acoust. Soc. Am., 89 (5), pp. 2346 – 2354, May 1991. [5] M. Piszczalski and B.A. Galler, “Predicting Musical Pitch from Component Frequency Ratios”, J. Acoust. Soc. Am., 66 (3), pp. 710 – 721, September, 1979. [6] J.C Brown, “Musical Fundamental Frequency Tracking using a Pattern Recognition Method”, J. Acoust. Soc. Am., 92 (3), pp. 1394 – 1402, September 1992. [7] Sylvain Marchand, “An Efficient Pitch-tracking Algorithm using a Combination of Fourier Transforms”, Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6 – 8, 2001. [8] Walmsley P.J., Godsill S.J., Rayner P.J.W., “Polyphonic Pitch Tracking using Joint Bayesian Estimation of Multiple Frame Parameters Department of Engineering, Cambridge University, Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltc, New York, Oct. 17 – 20, 1999, pp. 119 – 122. [9] Klapuri A.P., “Multiple Fundamental Frequency Estimation Based on Harmonicity and Spectral Smoothness”, IEEE

303

Transactions on Speech and Audio Processing, 11 (6), 2003, pp. 804 – 816. [10] Chunghsin Yeh, Robel A., Rodet X., “Multiple Fundamental Frequency Estimation of Polyphonic Music Signals”, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). 3, pp. iii/225 – iii/228. [11] Lachambre H., Andre-Obrecht R., Pinquier J., “Monophony vs Polyphony: A New Method Based on Weibull Bivariate Models”, Content-Based Multimedia Indexing, 2009. CBMI '09, pp. 68 – 72. [12] A. de Cheveign´e and H. Kawahara. Yin, “A Fundamental Frequency Estimator for Speech and Music”. Journal of the Acoustical Society of America, 111 (4), 1917 – 1930, April 2002. [13] H´el`ene Lachambre, R´egine Andr´e-Obrecht, “Julien Pinquier, Singing Voice Detection in Monophonic and Polyphonic Contexts”, 17th European Signal Processing Conference (EUSIPCO 2009). [14] J.O. Smith and X. Serra, “PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation”, Proceedings of the 1987 International Computer Music Conference, International Computer Music Association, San Francisco, 1987, pp. 290 – 297. [15] http://theremin.music.uiowa.edu/.