Optimised soundprint selection for identification in ...

1 downloads 0 Views 318KB Size Report
Jun 4, 2001 - time identification of sound stream segments'. Proceedings of IEEE conference MeleCon'2000, Cyprus, 2000, Vol. 11, pp. 52C528. Table 1: ...
Optimised soundprint selection for identification in audio streams G.Richly, R.Kozma, F.Kovacs and G.HosszU Abstract: An improved iterative method for selection of reference soundprints from a set of sound records to be identified is presented. The soundprints applied for reliable streaming audio identification should have maximum dissimilarity to each other. As the first trial of the iteration, a set of soundprints is randomly selected according to a grid placed on the records. To the accepted set, a further local refinement process is applied using the quadratic local approximations of the Euclidean distance between the soundprints. A large number of measurements have shown that the method significantly improves the reliability of real-time monitoring.

1

Introduction

Monitoring of sound records in audio databases or in audio signals has been a challenging task for a long time, and many solutions have been proposed and realised. All of these, however, need a large number of arithmetic operations, especially if the number of records to be recognised is hgh 111; they are therefore not suitable for real-time operation, although they can be used for off-line purposes [2]. Nowadays, as audio is becoming a common medium used on computers and on the Internet, solutions to realtime monitoring are emerging due to their possible economical benefit. For example, monitoring known advertisements in the internet streaming audio or in the traditional broadcast of a radio or television broadcaster is a realworld application that is still mainly done by humans. In order to realise the recognition even on a PC with limited hardware resources, or preferably on a low-cost microcontroller-based system, the computations must be drastically simplified. As a consequence, the reliability of the identification may decrease due to the incomplete restoration of the network signal in case of packet losses, and the increased sensitivity to any noise from the transmission channel. The reliability of a recognition method can be increased by selecting well distinguishable, very dissimilar sections of the records for references from each record to be recognised. These representative sections are called soundprints. To find an adequate set of soundprints (one from every record) with high deviation from each other requires a very time consuming process, because eventually the maximum point of a multi-dimensional surface, described by the distance function, should be searched for. However, it is a one-time process for each record set, and ths computational expense is recovered by the simplification of the algorithms used at the real-time identification. 0IEE, 2001 IEE Proceedings online no. 20010538 DOL lO.l049/ipam:20010538 Paper frst received 9th March and in revised form 4th June 2001 The authors are with the Department of Electron Devices, Budapest University of Technology and Economics, H-1111 Budapest, G o l d ” Gy. t6r 3,

””

IEE Proc-Cotnmun.. Vol. 148, No. 5. October 2001

The selection and the recognition can be independently rFlised on different machines to keep the recognition process continuous. An extensive examination of all possible reference sets (i.e. soundprints starting with all sampling time events for all records) demands a huge amount of computation which cannot be economically solved. Using the grid method [3], we can recognise the existence of peaks on the multidimensional surface, and by examining the neighbourhood of these points the local maximums can be found, from which the largest can be selected. Naturally, ths ‘refinement’ step of the search process assumes the analytical description of the interleaved region. The present work applies the grid method complete with a short-range quadratic approximation of the spectrogram differences. 2

Method

For the monitoring of previously known sound materials (records) in an audio stream, a probable solution can use short sections from the records for references representing the given records. This frees the recogniser to handle a large amount of record data. The chosen soundprints uniquely represent the records in the actual set. The aim of the presented method is to find the set of soundprints for the given records to be recognised that contains the most dissimilar ones possible. The soundprints used for references are characterised by their normalised spectrograms. These consist of Nfiume (non-overlapping) frames, each frame containing M s samples, and their spectrums are described by Nbln frequency bands. The full spectrogram of the soundprint thus is a matrix with Nfrinzecolumns and Nbrnrows, where all values are normalised to the maximum-value of the given frame’s amplitude spectrum, resulting a column of the normalised spectrogram matrix. The dissimilarity of a soundprint set is described by the sum of the painvise dissimilarities of the references. For this the total distance TD(tz) is introduced, where _n contains the starting position of the soundprints (in samples) for all records. Having R records TD(_n)forms an R + 1 dimensional surface, where the highest peak should be found. In the practice, an acceptable maximum provided by a limited computation is satisfactory, because the selected sound287

prints ensure reliable identification in the case of a well chosen limit. The Euclidean distance was chosen as a measure of dissirmlarity as it is an easily handled analytical form, in contrast to the Manhattan distance, for example. The total distance can be expressed as

c 2,

(4)

,=1

where A,, B,. and C are the contracted constants related to a given record.

R

R-1

TD(EL) =

R

E(&. n: + B, . n,) + C

TD(B)

DEucl(~r,Jr)

(1)

position in jth record

3, =z,+l

where the distance of any two soundprint spectrograms from the records numbered by i,. and J , is obtained as the square of the representing matrices' Eucledian distance: N b x r ,Nfn

DEucl

u78Lp

(S(.I,,i b I if) - S ( h i b , Zf

(i,, j,) = zb=1

1)

zf=l

(2)

where S(ir, ib, $) is the normalised spectrogram of the soundprint in the i,.th record, starting at the _n(z,.) sample position of that record; furthermore, ib and are the row and column indices in the spectrogram matrix, respectively. To discover the existence of peaks on the T D surface, a grid is laid onto the records. The grid points assign the possible starting positions of the candidate soundprints (Fig. 1). Starting the iteration process, one grid point will be randomly selected from each record (_n,J and the total distance wdl be calculated using eqn. 1. If the value of eqn. 2 is below the given limit D,,, for any pairs, or eqn. 1 is smaller than the previous value (if any), then the iteration step is finished and a new/ , ,n_ wdl be chosen.

+

I

l

l

ni-6

ni

ni+h

position in ith record

a position in 2nd record

4

+4--

9

Fig. 1 Assigning possible sturting points of sodprints by a grid

where the coefficients U, b, c, d and e can be determined by solving the linear equation system resulting from the measurement of the pair-distance surface. This is realised using the five-point method around the random selected position {q,a,}, as illustrated in Fig. 2a using 6 as the discretised deviation. Substituting eqn. 3 into eqn. 1, the coefficients belonging to the positions in _n can be collected for all records, providing the following approximation: 288

2* h

(>

.

3

time

If a position that increases the total distance was selected (see _n-') + 2 (n where i = 1. The search for the maximum TD value in thls _n,(O)

IEE Proc.-Commun., Vol. 148, No. 5 , October 2001

cube is camed out on the same ,way as before. The iteration is continued until a local maximum is found inside a hypercube or the increment between the consecutive steps sinks below a prescribed limit. Then a new nrnd will be randomly selected, and the process is repeated until an acceptable maximum value of TD is found within an acceptable computation time. The flowchart of the whole process is shown in Fig. 3. As one can see, after every randomly selected set, as well as at all refinement cycles, the obtained TD value is compared to the actual largest total distance TD,,,, and rejected if no enlarging occurred. The process is repeated until the allowed runtime has been exceeded.

random selection of set of soundprints ‘(n) calculate To@)

+

no

TD>TD?,

I

I

processing steps as the FFT of 50 not overlapped, 256sample frames (Ms) and an 8-bit quantisation in order to get the spectrograms. According to this M , = 256, Nfime= 50, and the reduced number of frequency bands is Nbl,,= 32. Components above 8kHz are ignored, since the system was intended to be applied to audio streams on the Internet, and the most common audio coding schemes already limit the signal bandwidth. As a typical experimental value, the optimum set of the soundprints was selected from R = 67, 6 8 s long records with 5% white noise added using a 300MHz Pentium-I1 PC. The size of grid applied by the process was g = 8, and the size of the hypercube was h = 4. The rejection limit was D,,,= TD,,,J3R. The allowed maximum runtime was 2h, corresponding to about 8000 random set selections from the grid. In order to verify the proposed method, a random mixture of records were continuously monitored for more than 480h and the identifications were listed. The measured rates are shown in Table 1. for different runtimes with and without local refinement. As one can see, the application of local refinement provides a significant improvement of identification rates.

I 4 refinement,,T,D,

= TD(~)

+

~

TD>TD ?,

4

Yes

A method for selecting optimal references applied for realtime sound record identificationwas developed, enabling its implementation on PCs equipped with no special hardware. With the use of the proposed soundprint selection method on the given set of records to be monitored, the reliability of the identification is significantly increased. Regarding the computation time, the application of a rough grid, complete with the proposed gradient-based refinement procedure, is more suitable than the conventional grid method applying the local refinement of the grid. 5

3

Results

The proposed method was tested with a simplified recognition system [4], whch monitors audio streams containing any kinds of sound components such as music, speech and artificial sounds. The system consists of well-known

Conclusion

References

1 LOURENS, J.G.: ‘Detection and logging advertisements using its sound‘, IEEE Truns. Broudrost., 1990, 36,(3), pp. 231-233 2 SUBMANYA, S.R., SIMHA, R., and NARAHARI ABDOU YOUSEFF, B.: ‘Transform-based indexing of audio data for multimedia databases’. Proceedings of IEEE conference on Multimedia computing and systems, Ottawa, 1997, pp. 21 1-219 3 WILDE, D.J.: ‘Optimum seeking methods’ (Prentice Hall, 1964) 4 RICHLY, G., VARGA, L., HOSSZU, G., and KOVACS, F.: ‘Realtime identification of sound stream segments’. Proceedings of IEEE conference MeleCon’2000, Cyprus, 2000, Vol. 11, pp. 52C528

Table 1: Measuredrates of erroneous and missed identifications Allowed time of computing

Number of evaluated Sets

Rate of false-identification

Rate of misses

Without local refinement

With local refinement

Without local refinement

With local refinement

0.5 h

-2000

2.3%

1.2%

3.1%

3%

I h

-5000

1.4%

0.2%

2.5%

2.3%

2h

-12000

1.1%

0%

1.8%

1.5%

IEE Proc.-Commun., Vol. 148, No. 5 . October 2001

289

Suggest Documents