S.R. Subramanya1, Abdou Youssef 2, Bhagirath Narahari 2, Rahul Simha3. 1 ... Department of EE and CS, The George Washington University, Washington, DC ...
Use of Transforms for Indexing in Audio Databases S.R. Subramanya1, Abdou Youssef 2, Bhagirath Narahari 2 , Rahul Simha3 1
Department of Computer Science, University of Missouri{Rolla, Rolla, MO 65409 Department of EE and CS, The George Washington University, Washington, DC 20052 3 Department of Computer Science, College of William and Mary, Williamsburg, VA 23185 2
Abstract The phenomenal increases in the amounts of audio data being generated, processed, and used in several computer applications have necessitated the development of audio database systems with newer features such as content-based queries and similarity searches to manage and use such data. Fast and accurate retrievals for content-based queries are crucial for such systems to be useful. Ecient content-based indexing and similarity searching schemes are keys to providing fast and relevant data retrievals. This paper studies and evaluates the dierent parameters involved in an indexing scheme which uses transforms (used in signal processing) for generating indices of audio data, which are to be used in search and retrieval. A comparison of the performance of the indexing scheme with dierent parameters are presented.
1 Introduction There has been an explosive growth in the generation and use of audio data (together with other multimedia data such as video and images) in several computer applications and this is expected to continue at an accelerated pace in the near future. Audio databases are being developed to facilitate eective management and use of enormous amounts of audio data. For audio/multimedia databases, the keyword-based indexing is very limiting and content-based indexing is essential. This is due to several reasons including (1) the inexact nature and the richness of expression of audio/multimedia data, (2) the subjective interpretations of such data, and (3) the diculty in formulating exact keyword-based queries for describing the `qualities' of the required objects. Since the content-based retrievals in audio/multimedia databases are unrestricted, unanticipated, and usually search the entire collection of data, good indices which characterize the data features well, while still being of low dimensionality are desirable, which facilitate accurate and fast retrievals. There has been considerable attention devoted to the problem of content-based image indexing ([1, 2, 3, 4, 5]), and video indexing ([6, 7, 8, 9, 10]), but comparatively little to content-based audio indexing. Although tremendous amount of work has been done in the analysis, synthesis, recognition, coding, and transmission of speech data, and the analysis and synthesis of music, the on-line storage of digital audio data in computers and their organization as databases with query and retrieval functions are just beginning and are expected to have tremendous growth. Example applications of audio databases are in digital libraries, entertainment industry, forensic laboratories, and virtual reality, among others. Although transforms have been routinely used in signal processing and data compression, their use in audio/multimedia database indexing have not been extensively explored. In this paper, we focus on the issues involved in the derivation of indices from the raw audio data using transforms. The index organization in a suitable multidimensional index structure, and
Audio Data
Blocking / Segmentation
Apply Transform on Blocks / Segments
Coefficient Selection (plus locations, etc.)
Insert to Index Structure
Figure 1: Index extraction from audio data the search schemes are not addressed here. The next section presents a few representative research eorts in the indexing of audio data. Section 3 brie y describes the indexing of audio data using transforms and enumerates the issues involved in such a scheme. Section 4 describes a scheme using audio segments as indices. Experimental results are given in Section 5, followed by conclusions.
2 Schemes for Audio Data Indexing We can broadly classify the commonly used schemes for indexing audio data as follows:
1. Indexing based on signal statistics. In this scheme, the signal statistics such as (1) mean
(2) variance (3) zero-crossings, (4) autocorrelation, (5) histograms of samples/dierence of samples, either on the whole data or blocks of data are used. 2. Indexing based on acoustical attributes. In this scheme, using the measurable quantities of audio signal, acoustical attributes such as pitch, loudness, brightness, and harmonicity are derived. Subsequently, statistical analysis (mean, variance, etc.) are applied on these attributes to derive feature vectors which are used in indexing and searching [14, 15]. 3. Transform-based indexing. In this scheme, a transform (such as DCT) is applied to (blocks of) the data and suitable coecients are selected and used as indices for search and retrieval [16]. Issues such as the appropriate transform to be used, the block size, the number of coecients to be retained, etc. need to be addressed. Methods 1 and 2 have drawbacks: the rst method does not give very accurate results for highly varying signals. The second method is better than the rst but requires extensive computations. Also, the rst two methods are not robust to noise, while the DCT-based method is more resilient to noise.
3 Transform-based Indexing The essential idea behind the transform-based indexing scheme for audio data is to apply a suitable transform to the data, which yields a set of transform coecients. An index entry for each audio le is created by selecting an appropriate subset of the transform coecients and retaining a pointer to the original audio block. Clearly, this index occupies less space than the original data and allows for faster searching. The outline of the scheme is illustrated in Fig. 1. (The reason for blocking/segmentation is explained in Sections 3.3.3 and 4).
3.1 Brief overview of transforms Transforms, such as the Discrete Fourier Transform (DFT) and the Discrete Cosine Transform (DCT) are applied to signals (time domain: audio, spatial: images, spatio-temporal: video) to transform the data to the frequency domain, sometimes referred to as the spectrum of the signal. Speci cally, given a vector X = (x1 ; x2 ; : : : ; xN ) representing a discrete signal, the application of a transform yields a vector Y = (y1 ; y2 ; : : : ; yN ) of transform coecients such that the original signal X can be recovered from Y by applying the inverse transform. The transform{inverse-transform pairs for DFT and DCT can be found in any book on signal processing (for example, [12]). The transformed data has several desirable properties.
3.2 Advantages of transform-based indexing Transforms such as DFT and DCT possess the following desirable properties: Energy compaction. Only a few coecients of Y are signi cantly large while the remaining coecients are near zero. Thus only a few transform coecients would suce as the index, resulting in a compact index size. This is easily seen in Figure 2. Robustness to noise. Coecients corresponding to high-frequencies and noise are clustered in recognizable locations in Y . These could be selectively removed to eliminate noise. This results in more accurate retrievals. Correspondence with the human auditory system. Most of the energy is concentrated in a few coecients of Y at recognizable locations, corresponding to the lower-frequency parts of the sound spectrum, to which the human ear is more sensitive. Independence of transforms from signals. These transforms are expressible in simple matrix form: Y = TX , where X is the original signal in column form, Y is the transform output in column form, and T is a non-singular matrix that characterizes the transform, and is independent of the signals. Fast computations. Fast algorithms are available for their computation. The transform of an N -point data can be computed in O(N log N ) time. Also, the availability of special-pupose hardware for their computations leads to further speed up. Original signal DCT COEFFS.
0
0.5
1
1.5
2
0
100
200
300
DCT COEFFS.
Original signal DCT COEFFS.
400
500
4
0
2
4
6
x 10
0
100
200
300
500
0
100
DCT COEFFS.
0
100
200
300
500
0
100
DCT COEFFS.
0
100
200
300
100
200
300
400
500
0
100
100
200
300
400
500
0
100
400
500
200
300
400
500
0
100
500
200 300 400 DCT COEFFS.
500
200
300
0
100
200
300
0
100
200
300
200
300
400
500
0
100
200
400
500
200
300
400
500
400
500
0
100
200
300
400
500
400
500
400
500
400
500
0
100
500
0
100
200
300
400
500
0
100
500
0
100
200
300
300
200
300
DCT COEFFS.
400
500
0
100
DCT COEFFS.
400
200
DCT COEFFS.
DCT COEFFS.
400
300
DCT COEFFS.
DCT COEFFS.
DCT COEFFS.
400
100
DCT COEFFS.
DCT COEFFS.
DCT COEFFS.
0
300
DCT COEFFS.
DCT COEFFS.
0
200
DCT COEFFS.
400
0
x 10
DCT COEFFS.
400
8 4
DCT COEFFS.
200
300
DCT COEFFS.
400
500
0
100
200
300
Figure 2: DCT coecients of a few blocks of two dierent audio signals.
3.3 Issues in Transform-Based Indexing The major parameters in a transform-based indexing scheme are: (1) the choice of transform, (2) whether to transform whole signal or windows of signal, (3) type of window, (4) window size, (5) coecient selection method, and (6) the appropriate number of coecients. The following subsections brie y describe these issues and their dierent options. Section 5 gives more results related to the performance of the indexing scheme while using dierent options for the parameters.
3.3.1 Choice of transform
Dierent transforms such as DCT, DFT, Hadamard, and Haar were experimentally evaluated for their suitability for indexing audio data. The performance metrics were: (1) PSNR (Peak Signal to Noise Ratio) and (2) Retrieval precision. PSNR is de ned between the original signal A = fa1 a2 : : : aN g and the reconstructed signal A = fa1 a2 : : : aN g, which is obtained by applying the inverse transform on only the few retained coecients (while unretained coecients are treated ? the s where s is the peak sample value as zero values). The PSNR is given by: PSNR = 20 log10 RMSE q P possible, and RMSE is the root mean squared error, given by: RMSE = N1 Ni=1 (ai ? ai )2 ). of Relevant Objects Returned Retrieval precision is de ned as the ratio: Number Total Number of Objects Returned , which is determined by retrieving matching data for queries-by-example, using the transform coecients as indices in the search process. 0
0
0
0
0
3.3.2 Coecient selection scheme
There are broadly two schemes for selecting the transform coecients: zonal and threshold. In zonal selection, a certain number of k coecients in xed locations are chosen, while in threshold selection, the k highest valued coecients are chosen. Figure 3 shows the reconstructed signals for two audio signals A1 (columns 1,2) and A2 (columns 3,4), using dierent numbers of DCT transform coecients (20%; 10%; 5%; 2:5% : rows 2{5), selected using zonal (columns 1, 3) and threshold schemes (columns 2, 4). It is easily seen that signal reconstruction is superior using threshold selection, which implies that most of the signal information is concentrated in those threshold-selected coecients, which suggests the use of those coecients as indices. This is also con rmed by the better PSNR's and retrieval precisions provided by threshold-selected coecients over zonal coecients (as shown in Figures 5 and 6). ORIGINAL SIGNAL
ORIGINAL SIGNAL
ORIGINAL SIGNAL
ORIGINAL SIGNAL
RECONS: 20 % COEFFS (ZONAL)
RECONS: 20 % COEFFS (THESHOLD)
RECONS: 20 % COEFFS (ZONAL)
RECONS: 20 % COEFFS (THESHOLD)
RECONS: 10 % COEFFS (ZONAL)
RECONS: 10 % COEFFS (THESHOLD)
RECONS: 10 % COEFFS (ZONAL)
RECONS: 10 % COEFFS (THESHOLD)
RECONS: 5 % COEFFS (ZONAL)
RECONS: 5 % COEFFS (THESHOLD)
RECONS: 5 % COEFFS (ZONAL)
RECONS: 5 % COEFFS (THESHOLD)
RECONS: 2.5 % COEFFS (ZONAL)
RECONS: 2.5 % COEFFS (THESHOLD)
RECONS: 2.5 % COEFFS (ZONAL)
RECONS: 2.5 % COEFFS (THESHOLD)
Figure 3: Reconstruction of A1 and A2 using dierent number of DCT coecients using zonal and threshold selections.
3.3.3 Windows and blocking
The next issue is whether to consider the whole audio signal and apply the transforms or to consider smaller blocks and apply the short-term transforms. Since all naturally occurring audio data is nonstationary i.e., all the frequencies in the signal may not be present over the whole signal duration, using a Fourier or DCT transform of the whole signal gives the frequency components present in the signal and their relative strengths, but will give no information about the time where the frequencies occur in the signal. This lack of time resolution in the spectrum led to the development of Short Time Fourier Transform or windowed Fourier transform, which achieve time-localization by slicing o a well localized region of the signal and then taking its (Fourier) transform. These slices are
called windows or frames. The slicing is achieved by multiplying x[n] with a window function w[n]. To get the kth frame we shift the window k times by the shifting distance M and then multiply x[n] with the window: xk [n] = x[n]w[n ? kM ]. There are several window functions: rectangular, Hamming, Bartlett, Blackman, Chebyshev, Hanning, etc., [11]. Dividing the audio into blocks and applying short-term transforms has the advantages of (1) preserving the ner local information in the transform coecients, (2) more energy compaction, and (3) computation of transforms in parallel. In our experiments the indices derived by using blocking with rectangular window gave satisfactory retrieval results [17].
4 Segmentation of Audio Data for Indexing When the audio data is divided into xed length blocks, there may be blocks with high signal variation and blocks with low signal variation. In such cases, retaining more coecients for the highvariation blocks, and fewer coecients for the low-variation blocks, results in better retrievals. This is one of the motivating factors for segmenting audio data for the purposes of indexing. In this section, a segment of audio data roughly refers to a portion of the audio data where the signal characteristics do not have marked variations, while the boundaries between two consecutive segments have rapid changes in the signal. The segments are usually variable-sized units. Segments of audio data could be used as indices either directly or as `metadata' from which indices could be derived by further processing. The indices so derived are expected to provide fast and relevant data retrievals. Numerous schemes have been developed for the segmentation of speech, a few representatives are ([19, 21, 20, 18]). Since most audio data generally have long periods of low signal variations and short periods of high variations, applying transforms to segments of audio data reduces the number of coecients and thus the index size, compared to deriving indices using xed-size blocks. Figure 4 shows the original signal and the segments of the signal on the top two windows. The remaining windows show the DCT coecients of the rst 16 segments of the signal. It is clear that only a few of the DCT coecients are signi cant, which are selected as indices. Fixed length blocks of the signal would have resulted in many more signi cant coecients. Since the index size is smaller, it results in faster searches compared to the scheme using blocks. The precision is slightly lower, however, due to the coarseness of segments. ORIG SIGNAL
SEGMENTED SIGNAL
DCT COEFS. OF SEGMENTS
DCT COEFS. OF SEGMENTS
DCT COEFS. OF SEGMENTS
Figure 4: DCT coecients of segments of an audio data.
5 Experimental Results The dierent options in the transform-based indexing scheme for audio data have been implemented in a Sun Sparcstation, using MATLAB and C functions. The experiments used over 200 Sun/Next format (.au) audio data les. The data les consisted of speech, music, and several other kinds of data such as sounds produced by animals and birds, sounds of bells and whistles, special eect sounds, etc. The performance of the indexing scheme using dierent options are presented in the following sub-sections.
5.1 Performance of transforms Figure 5 shows the performance of the four transforms{ DCT, DFT, Haar, and Hadamard, which were evaluated for their suitability for indexing audio data. The rst two plots show their performance with respect to signal reconstruction (PSNR) using threshold and zonal selction of coecients, respectively. The third and the fourth plots show respectively the dependence of retrieval precision on the number of coecients, and on query size. The coecients are selected using threshold selection. It is observed that DCT consistently outperforms all the other transforms, and that threshold selection yields better PSNR than zonal selection. THRESHOLD SELECTION
ZONAL SELECTION
55
40
DCT FFT Hadamard Haar
50
0.55
DCT FFT Hadamard Haar
38
36
0.5
45
0.45
35
RETRIEVAL PRECISION
P. S. N. R.
40
32
30
28 30
DCT FFT HADAMARD HAAR
0.4 RETRIEVAL PRECISION
0.45
34
P. S. N. R.
0.5 DCT FFT HADAMARD HAAR
0.4
0.35
0.3
0.35
0.3
0.25
26 25
0.25
0.2
24
20
4
8
16 32 Number of Coefficients
64
22
128
4
8
16 32 Number of Coefficients
64
0.2 32
128
64 128 NO. OF SELECTED COEFFICIENTS
0.15 10
256
20
40
80
QUERY BLOCK SIZE
Figure 5: PSNR ( rst two plots) and retrieval precision (last two plots) for various transforms
5.2 Performance of coecient selection schemes The rst two plots in Figure 6 show respectively, the retrieval precision as a function of the query size and the number of coecients, averaged over 30 queries. The following combination of parameters were found to be best suited for the purposes of indexing (using blocking) in the experiments: transform: DCT, window size: 128 to 1024, selection: threshold, number of coecients: 10% of the block size.
5.3 Performance comparison of blocking and segmentation The search speed of the scheme based on segmentation is on the average 6{10 times faster than the scheme which uses xed length blocks, for the data set used. The retrieval precision of the former is however lower than the latter scheme. These are shown in the third and fourth plots in Fig. 6. (The time for segmenting the data is not considered while determining the search times). 18 0.42
0.6
16
ZONAL THRESHOLD
without segmentation with segmentation
0.55
0.38
14
12
0.34
0.32
Retrieval times (sec.)
RETRIEVAL PRECISION
0.5
0.36
0.4
0.45
0.4
0.395 Retrieval precision
0.4
RETRIEVAL PRECISION
0.405
without segmentation with segmentation
ZONAL THRESHOLD
10
8
0.39
6 0.385 0.35
4 0.3
0.38
0.3
0.28 10
20 40 QUERY SIZE (NO. BLKS. BLK SIZE=1024)
80
0.25 32
2
64 128 NO. OF SELECTED COEFFICIENTS
256
0 20
40
60 Data size (No. of files)
80
100
0.375 20
40
60 Data size (No. of files)
80
Figure 6: Performance of coecient selection schemes ( rst two plots) and of blocked and segmented data used as index (last two plots).
100
6 Conclusions and future directions Derivation of good indices for audio data is critical to fast and accurate retrievals for contentbased queries in audio databases. The indices need to be structured with respect to the features of the audio data. This paper identi ed and evaluated several parameters in the use of transforms for indexing audio data in audio/multimedia databases. The performance of the indexing scheme using dierent options for the parameters were presented. The results indicate the combination of DCT, threshold selection, a block size between 128 and 1024, and about 10% of the coecients per block serve as good indices. Also, segmentation would result in faster searches (when a fast and accurate segmentation scheme is used). Some future directions include (1) study of newer techniques such as wavelets and fractals in deriving indices, (2) better similarity measures for audio data, and (3) model-based index derivation.
References [1] M.D'Allegrand. Handbook of image storage and retrieval systems. Van Nostrand Reinhold, New York, 1992. [2] W.Grosky and R.Mehrotra, eds. Special issue on image database management. IEEE Computer, Vol. 22, No. 12, Dec 1989. [3] V.Gudivada and V.Raghavan. Special issue on content-based image retrieval systems. IEEE Computer, Sept. 1995, Vol. 28, No. 9. [4] A.D.Narasimhalu, ed. Special issue on content-based retrieval. ACM Multimedia systems, Vol. 3, No. 1, Feb 1995. [5] R.Jain, et al. Similarity measures for image databases. SPIE, Vol. 2420, pp. 58{61. [6] M.La Casia and E.Ardizzone `Jacob: Just A Content-based Query System for Video Databases', IEEE Conf., 1996 pp 1216{1219. [7] Y-L.Chang et al. `Integrated Image and Speech Analysis for Content-Based Video Indexing', Proc. 3rd IEEE Int'l. Conf. on Multimedia Computing and Systems, 1996, pp306{313. [8] A.Hampapur et al. `Digital Video Segmentation', ACM Multimedia 94 Conf. Proc., pp 357{364. [9] A.Hauptmann and M.A.Smith `Text, Speech and Vision for Video Segmentation: The Informedia Project'. [10] S.W.Smoliar and H.J.Zhang `Content Based Video Indexing and Retrieval', IEEE Multimedia, Summer 1994, pp62{72. [11] P.Schauble Multimedia Information Retrieval, Kluwer Academic Publishers, 1997. [12] C.Marven and G.Ewers A Simple Approach to Digital Signal Processing, Wiley-Interscience, 1996. [13] M.J.Hawley Structure of Sound, Ph.D. Thesis, MIT, Sept. 1993. [14] E.Wold et al. Content-based classi cation, search and retrieval of audio data. IEEE Multimedia Magazine, 1996. [15] A.Ghias et al. Query by humming. Proc. ACM Multimedia Conf., 1995. [16] S.R.Subramanya et al. `Transform-Based Indexing of Audio Data for Multimedia Databases', IEEE Int'l Conference on Multimedia Systems, Ottawa, June 1997. [17] S.R.Subramanya `Experiments in Indexing Audio Data' Tech. Report, GWU-IIST, January 1998. [18] D.Kimber and L.Wilcox `Acoustic Segmentation for Audio Browsers', Proc. Interface Conference, Sydney, July 1996. [19] H.C.Leung and V.W.Zue `A Procedure for Automatic Alignment of Phonetic Transcriptions with Continuous Speech', Proc. ICASSP-84, pp2.7.1{2, 1984. [20] L.R.Rabiner and M.R.Sambur `An Algorithm for Determining the Endpoints of Isolated Utterances', Bell Systems Technical Journal, Vol.54, pp297{315, Feb. 1975. [21] J.P.Van Hemert `Automatic Segmentation of Speech', IEEE Trans. Signal Processing, Vol.39, pp1008{ 12, April 1991.