A PC Speaker Identification System - Semantic Scholar

A PC Speaker Identification System for Forensic Use: IDEM MAURO FALCONE, NICOLO' DE SARIO

Abstract  The field of Speaker Identification in forensic is one of the oldest application of speech science. Listening test and spectrogram reading have been the most popular approach in this specific task for many years. We introduce a method based on the comparison of vowel formant, and we describe the associated software tools that let you easily perform a complete speaker identification test: starting from speech acquisition and ending with a report on statistical tests. The system has been designed to run on a small computer, and in a user-friendly environment. After a first release of IDEM in 1991[1] we have now reached a (almost) errors free and "enduser" robust version of the system. We base our corrections and improvements of the system on our own experience in this field, and on the suggestions of real users, i.e. of several police investigation bureau that extensively use the IDEM system.

The IDEM system is the result of many years of experience devoted in the field of speech analysis and speaker identification for forensic purpose. In the '80 a set of programs for VAX mainframe elaborator, similar to the actual IDEM, has been developed in FUB [2]. The downsizing trend and the increasing cost of maintaining such a software, push us to design a new environment for this set of applications. In 1990 we start an internal project which main goal was to implement the already existing packages, plus new tools in an "low-cost" and "user-friendly" hardware/software environment. We decide to use a PC under the WINDOWS operating system (it was just announced the version 3.0 at that time). After three years we may now say that it was the best choice we could do.

1

Index Terms  forensic, speech analysis, video-audio speech tools, reliable speaker identification.

I. INTRODUCTION One of the most important factor of speaker identification when applied to forensic task, is that an high reliable decision must be taken. This fact suggest the adoption of a full automated method. Unfortunately, due to the adverse recording conditions, to no cooperative speakers, to very similar voices, etc., is proved that a large amount of operator's intervention should be retained to achieve adequate performance. So it seems very useful to have an user-friendly system dedicated at this scope, i.e. an environment able to manage speech acquisition and restitution, time and frequency signal representations, extraction of speech features, and finally the execution of statistical tests. The proposed system is a tentative to fulfil these needs, and it is (as we know) the only system designed ad hoc at this scope running on a small computer. Of course, here we only have space to discuss part of the features of the system, and we shall not report example of test on real data.

II. OVERVIEW OF THE SYSTEM

1dr. Mauro Falcone is with the speech processing group of the "Fondazione Ugo Bordoni" (FUB), in Rome, Italy. Mr. Nicolò De Sario is a senior member of the speech group at FUB. You can reach them by fax at +39.6.5480.4405 or via Email at [email protected]

Fig. 1. Modules of the IDEM package.

The philosophy behind the IDEM system is to consider the problem of speaker identification as a sequence of simple tasks. Some of these (e.g. speech acquisition) are common to any speaker identification method, while other are peculiar of the proposed method. This method is, in summary, based on the comparison of a set of parameters (in our case is the F0 and the first three formants of the four vowels /a/, /e/, /i/, /o/ ) estimate in well stable portion of speech. Given a set of these parameters for each speaker a reference matrix is computed. The comparison of the resulting data will give the response to the identity test. We have no time here to theoretically describe the method that is described in detail in [3], [4], [5]. One of the mandatory requirements of the IDEM package was that it must be very simple to use and robust against any kind of improper operation, in order to allow no skilled users (e.g. staff of the police investigation bureau) utilise the package correctly. In the last two years we have collected the observations and

suggestions of three laboratories that extensively use this package, and of course we also consider our own recommendations, and we have update the original release. Now the IDEM system is based on seven modules. It works on any PC running WINDOWS. The software has been written using the Microsoft language (C compiler and Software Development Kit). 2.1 The acquisition module: WDSK As usually speech material in not furnished in digital format, the first step is speech acquisition. In our very specific case the speech material is always originally recorded on tape, and it is recorded from telephonic or ambient interception. This module has three main function: to play an audio (speech) file, or part of it; to record a new file; to calibrate the acquisition procedure. A detailed description of this may be found in [6]. The standard audio board for the IDEM system is actually the OROS AU21 board, that is a professional 16 bit, AD and DA board with a DSP for real time digital filtering. We choose this acquisition board as it is the recommended hardware by the European project on speech input/output standardisation ESPRIT 2859, and it completely fulfil our requirement. The standard procedure to use this module is: first run the VuMeter function to analyse the complete speech material you want to record in a file and measure the signal dynamic. Then you may calibrate your tape or the OROS board to ensure the appropriate recording level and avoid saturation effect. Then you may record your signal in a continuous mode or in an "energy-triggered" mode, in case you want to discard long pieces of silence and save space on the storage, using the Record function. The Play function is a a fast and simple tool to control the speech files you have on your data storage. 2.2 The speech quality evaluation module: QUALITY After you have the speech files available in a digital format on your PC, it may be necessary to evaluate in a objective way the quality of the speech material. In fact it is a trivial fact that worse is the speech quality, hardest is the task of speaker identification decision. An estimation of the signal to noise ratio, as well a long term spectrum of both "signal" and "noise" may give to the operator an immediate view of the speech quality, so that he can decide if the speech signal is good enough to be processed and analysed, or it should be discarded. The same module may be used to compare different long term spectra, in order to control the acquisition procedure, the transmission line condition and the background noise. The use of this module is very simple: you only need to select a signal file and fill in the box in a dialog windows. In this window you can change the frame shift for the FFT computation, the signal/noise threshold (a signal to noise estimation algorithm is ran before the window appear, in order to give a default value for this field), and finally you may plot the signal long term power spectrum alone, or the signal and noise spectra in the same window. You may have (compare) a maximum of four of these graphs on the video of the PC. The dimension of the charts is fixed. 2.3 The editing module: EDIT

The editing module has been designed to control and extract from long signal file the selected speech material needed to estimate, in an accurate way, a given speaker. You may open a binary audio file [7], and at this time a description file, associated to the signal file, is created by the system. In this description file will be written several important information, and it is a mandatory file for the following audio-video modules. We named this file the INF (it stands for information) file. Of course EDIT can open both binary and INF file. This module had two important requirements: it must be simple and fast to use.

Fig. 2. Main window of the EDIT module.

To satisfy these need in designing this module we consider that in telephonic speech (that is the most common environment) we only have two speakers. In order the main function that an operator can do with this module is to assign selected speech zones to one speaker (say speaker A), or to another speaker (say speaker B). The audio waveform is represented on a full screen window. Scaling of time and amplitude axes is possible by clicking on the zoom buttons. Selection of a piece if signal is realised just by dragging the two-lines cursor using the mouse. Once the selected speech as been assigned to a given speaker, the audio waveform change colour according to the speaker itself (red for A, blue for B). The number of segments assigned to each speaker, the overall duration of these segments and other useful quantities are monitored in real time in the main window. When you have selected speech segments, you may create a new file (both binary and INF) containing all the "red", or the "blue" segments of speech. From the main menu' you can also set some option as the automatic alignment on zero crossing of the two-line cursor, the automatic listening after the signal selection using the mouse, the link of several speech files in a unique file, the deletion of a speech file and all its associated parameter files, etc.

2.4 The signal processing module: IDSP This is a blind module, i.e. it has no audio or video output. It is also the only module that do not work under native WINDOWS. Once you have selected a signal file, from a main menù where the possible parameter are listed

you had to select the parameters to be computed. You cannot compute twice the same elaboration, in fact if a given parametrisation for a specific file has already be computed he associated file is highlighted, and you can not select it from the menù. The main parameters in this menù are: energy, pitch, formant tracking, sonogram (wide band, narrow band, LPC), vowel localisation and formant estimation. Each parameter is computed by a different program, so that is easy to add or change any of these if user need a different speech parametrisation. As this module is computationally heavy, you can select the parameters to be computed for as many files you need, and then run in background (or in another system on your PC network) the computational engine. The computational engine consists of a set on ANSI C programs. Given a signal file and a command file, it creates the requested parameter (energy, sonogram, etc.) files. This approach of course will waste a lot of space on the mass data storage. On the contrary it is much more flexible, and we found it the unique solution to make signal processing results available in real time on the audio-video tools also on small computers. If you think at task as formant tracking or sonogram representation, it is clear that it is not feasible without dedicated DSP hardware and software, unless you already have all results available. In our case you only need to pick up date and plot it on the screen, of course making signal processing using general purpose processor is a heavy task.

Fig. 4. Modify parameters window of the ARES module.

According to the defined number of formant you want to estimate (from one to four, default is three) in the power spectrum window you have some vertical bars, that you may move using the mouse. The position (in Hertz) of the line you are moving is monitored on the left side of the window. Just down the waveform you have plotted two scalar quantities (default are pitch and energy). Once you have find the signal portion from which you want to estimate the formant value, you had to move the vertical lines on the supposed formant frequencies. Now you may fix (i.e. save in a file) the information that include: the pitch value, the formants values, the vowel (or phoneme), the context of the word. The symbol of the labelled vowel (phoneme) will now appear aligned to the audio wave on the screen. If you want to modify, cancel or control any data, just double click on this symbol and a new window where you may read/modify all these parameters appears on the screen. 2.6 The labelling and spectrographic module: SONOG

Fig. 3. Main window of the ARES module.

2.5 The formant estimation module: ARES This module is dedicated to the spectral analysis of fixed length signal window. On the top of the main window a 2.5 second waveform of the audio signal is represented. The cursor is a tick line, wide as the selected zone (you may select any power of two cursor from 128 to 4096 points, default is 512). In the bottom left you have the zoom of the selected window. In the bottom right the power spectrum of the selected window, optionally in this window you can plot the LPC and the CEPSTRUM smoothed power spectrum. Fine tuning is possible by clicking the two buttons on the zoomed signal window.

This module only has a descriptive function, i.e. the results of its output are not necessary to the execution of the identification test, neither for any other modules of the IDEM system. It has been added to the IDEM package because the sonographic representation and the aligned text transcription is still an interesting graphical way to characterise some peculiar phonetic events. In the main window the sonogram (you can choice among the wide band, narrow band or the LPC spectral representation) is mapped in grey scale or colour scale. A 256 colours board is recommended. In the bottom side you have a reduced graph of the amplitude waveform, of the pitch and the energy as well the list of the symbols labelled by the ARES module. Many online operations are possible on the sonographic map: preenphasis, saturation effect, dynamic limit modification, over plotting of formant track, etc. This module is also dedicated to segmentation and labelling. You may introduce the orthographic contents of the whole sentence, or of a part of it, and then ask to the system to translate the string obtaining the phonetic transcription in SAMPA [7] format. The you may label the speech using different level of accuracy. To speed up the tedious labelling work, several tricks has been introduced, as well a simple but useful "divide and

merge" procedure to manage phonetic labelling. It is not allowed to have overlapped labels, and it not allowed to have un-labelled part of speech, i.e. the not labelled speech is marked with the special character "*".

We have described the IDEM system that is a software running under WINDOWS, on a PC with dedicated audio board. Each single module and its basic functions are explained, following the line of an hypothetical execution of a speaker identification test for forensic purpose. Nowadays the system has been assessed on the basis of two years of extensive use in laboratories where skilled and no-skilled users utilise this package. After this positive experience, we are now planning to redesign and improved version of the IDEM system, on the multimedia standard platform, with networking and database facilities.

REFERENCES [1]

[2]

[3] Fig. 5. Main window of the SONOG module. [4]

III. RUNNING TESTS USING IDEM Until now we have described the speech manipulation and analysis available in this software package, and how to manually extract some parameters that are useful to identify people also in adverse condition and in no cooperative situations. The last operation will be the decision problem, that can be taken by statistical analysis of these parameters. It should be outlined that any variable that you believe is distinctive of the speaker could be used. Our experience in this field, and other recent works [8] on this topics confirm that pitch and vowel formants are a good choice. As the very specific target of this application (forensic), particular attention has been devoted to the development of the decision module, to the characterisation of the speaker, and the estimation of false rejection, and false identification probability. The theoretical design used for the identification test may be found in [9]. 3.1 The statistical module: SPREAD The SPREAD (SPeaker REcognition by Automatic Decision) is an interactive system that let you load in a working area as many "data-file" (in our case data are pitch and formants) you like. To each data file is associated a speaker name, so that you may select from the speaker list an ensemble of speaker to compare. Before running the test you may remove the outliers data using three different filters: one based on predefined value, one on the standard deviation of all data, and one on the standard deviation of the speaker data (this is a recursive filter). After the test execution a YES/NO matrix is opened on the video, and you may read the detailed result clicking on the specific box. It also possible to produce an ASCII file containing a detailed report including all the used data, and the intermediate results of the statistical analysis.

IV. CONCLUSION

[5]

[6]

[7] [8]

[9]

M. Falcone, A. Paoloni, N. De Sario, B. Saverione, "IDEM: un sistema per l'analisi e la rappresentazione del segnale vocale", Atti XX Convegno Nazionale dell'AIA, Roma, Aprile 1992, pp.417 (in Italian) N. De Sario, A. Paoloni, B. Saverione "ARES: an environment for speech analysis and labelling", proc. MELECON'89, Lisboa, April 1989, pp.229 A. Federico, G. Ibba, A. Paoloni, "A new automated method for reiliable speaker identification and verification over the telephone channels", proc. ICASSP'87, Dallas, April 1987, pp.1457 A. Federico, G. Ibba, A. Paoloni, B. Saverione, "Comparison between automatic methods and human listeners in speaker recognition tasks", proc. EUROSPEECH'89, Paris, September 1989, pp.279 A. Federico, G. Ibba, A. Paoloni, "Comparing direct spectral matching techniques with formant extraction ones for speaker recognition", proc. of the 12th ICA Congress, Toronto, 1989 M. Falcone, "WDSK21: a WINDOWS audio tool for the AU21 board based on the WABI software approach", ESPRIT P. 6819 (SAM-A), First Years Progress Report, Rel. FUB/005/V1, October 1993 ESPRIT PROJECT 2589 (SAM) "User Guide to ETR Tools", UCLG007, London 1992 N. Fakotakis, A. Tsopanoglou, G. Kokkinakis, "A textindependent speaker recognition system based on vowel spotting", Speech Communucation 12 (1993), pp.57 A. Federico, A. Paoloni, "Bayesian decision in the speaker recognition by acoustic parametrisation of voice samples over the telephone line", proc. EUROSPEECH'93, Berlin, September 1993, pp.2307