Digital Signal Processing Technique.pdf - School of Electrical ...

Department of Electronic Engineering, University College Galway

Digital Signal Processing Techniques

for

Telecommunication

and

Speech Processing Applications

Michael E. Keane, B. E. (Electronic)

Submitted to the National University of Ireland in fulfilment of the requirements for the degree of Masters of Engineering Science.

Head of Department Prof. D. J. Wilcox July 1994

Project Supervisor Dr. Eliathamby Ambikairajah

ABSTRACT

This thesis is divided into three distinct sections. The first section describes the application of digital signal processing techniques from an applied research ___ point of view. The second and third sections describe the application and development of digital signal processing techniques from a pure research perspective. The work described in the first section involved the development and programming of a stand alone digital signal processor hardware platform. This generic hardware platform, which interfaces to a standard personal computer, contains a single TMS320C25 digital signal processor, as well as an Analogue to-digital converter and a Digital-to-Analogue converter. This system was programmed to completely implement real-time DTMF telecommunication tone detection. The overall system was tested to the CCITT recommendations for DTMF detection. The second section describes the development of a new speaker verification model. This model uses self-aligning linear predictors to represent the password utterance as spoken by the true speaker. In contrast to conventional LPC analysis which uses short analysis frames of fixed length, this model uses temporal segments of variable length. Each of these segments has an associated linear predictor which is used to predict the speech in that segment. A technique called dynamic programming is used to associate segments of speech with specific linear predictors. During verification, the accumulated error between the predicted utterance and the test utterance is used as one test statistic. The segmentation pattern produced by the optimal segmentation is used as another test statistic. On a small database of 70 true speaker utterances and 20 impostor utterances, a verification accuracy of 97% and 95%, respectively, was achieved using these two test statistics. Combining the test statistics yielded a 100% verification accuracy. The third and last section describes the theory and implementation a speech analysis technique called the Wavelet Transform. This technique has recently generated considerable interest among speech researchers. Its application to the task of speech pre-processing is particularly appropriate because it provides a constant relative bandwidth analysis similar to that performed by the human ear. As well as describing the theory behind the operation of the Wavelet Transform, this section describes how to implement the digital Wavelet Transform and develops a fast Wavelet Transform implementation. This fast Wavelet Transform is over twice as fast as the direct digital Wavelet Transform implementation. A large number of examples that illustrate the characteristic operation of the Wavelet Transform are given. Finally, the analysis provided by the digital Wavelet Transform and the analysis provided by the constant-Q factor cochlear model are compared and shown to be virtually equivalent.

TABLE OF CONTENTS

Section I

Chapter 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Chapter 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.10 2.11

Chapter 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

An Introduction to Digital DTMF Detection

3

Introduction DTMF signalling CCID Recommendations The development of DTMF receivers Digital Signal Processing The TMS320C25 Digital Signal Processor Conclusion

The Development of the TMS320C25 Hardware Platform 10

lntrcduction Overview of the TMS320C25 development platform The TMS320C25 and it's support circuitry Program memory Data memory The PC interface The Analogue to Digital Converter The Digital to Analogue Converter The Wait State Generator Conclusion

10

10

12

14

15

16

18

20

21

21

The DTMF Detector Architecture

22

Introduction DTMF Detector Architecture Overview The Band Pass Filters Band Pass Filter Implementation The resonators The integrators Time scheduling The Backend Conclusion

22

22

24

25

26

28

28

29

31

Chapter 4

4.1 4.2

4.2

DTMF Detector Characterisation

33

Introduction Test description and results Receive power levels 4.2.1 4.2~2 Signal reception timing 4.2.3 Pause reception timing 4.2.4 Twist acceptance levels. 4.2.5 Frequency tolerance acceptance levels 4.2.6 Speech Immunity test Conclusions

33

33

33

35

36

37

37

38

Section TI Chapter 5

5.1 5.2 5.3 5.4 5.5 5.6

Chapter 6

6.1 6.2 6.3

6.4 6.5

Chapter 7

7.1 7.2 7.3 7.4

7.5 7.6

An Introduction to Digital Speech Processing

42

Introduction The physiology of speech production Speech Analysis Linear Predictive analysis of speech signals Speech coding Speaker identification and verification

42

43

44

45

46

47

A New Speaker Verification Model Introduction The self segmenting Linear Predictor model Training the model 6.3.1 The Steepest decent algorithm 6.3.2 Dynamic programming and backtracking Training algorithm summary The Verification Algorithm

49

50

53

53

53

55

56

Training and Testing The Model The use of passwords in speaker verification The speaker database Training the model Setting a threshold Testing the model Some screen dumps

II

57

57

58

59

61

61

7.7 Chapter 8

8.1 8.2 8.3 8.4 8.5 8.6 8.7

Conclusion

63

Improving The Model's Performance

64

Introd uction The nature of the problem Neural Networks The Multi-layer Perceptron Using an MLP to Improve the Verification Decision Results Conclusions

64

64

66

66

67

69

70

Section Ill Chapter 9

9.1 9.2 9.3 9.4 9.5

An Introduction to Time-Frequency Analysis Introduction Time-Frequency Analysis Frequency Resolution of the STFT Time Resolution of the STFT Time-frequency Resolution of the STFT

73

75

75

77

78

Chapter 10 The Theory of the Wavelet Transform

10.1 10.2 10.3 10.4 10.5

The Continuous Wavelet Transform The Discrete Wavelet Transform Choice of Wavelet Function Time-frequency resolution of the Guassian

Wavelet Transform Use of the Wavelet Transform in auditory modelling

80

81

81

84

84

Chapter 11 Implementation of the Discrete Gaussian Wavelet Transform

11.1 11.2 11.3 11.4 11.5 11.6

Implementation overview Generating Gaussian wavelet basis functions A worked example Fast Digital Wavelet Transform Application to Speech Processing Conclusions

iii

86

87

88

89

92

92

Chapter 12 Graphical Representation of the Wavelet Coefficients

12.1 12.2 12.3 12.4

Introduction Overview Discussion of selected examples Conclusions

93 93 94 103

References

104

Papers Published

108

Ambikairajah, E., Keane, M. and Tattersall, W.,'Speaker Verification using Self-segmenting Linear Predictors', 4th Australian International Conference on Speech Science and Technology, Brisbane, Dec. 1-3, 1992. Ambikairajah, E., Keane, M., Kelly, A, Kilmartin, L. and Tattersall, G., 'Predictive Models for Speaker Verification', European Journal of Speech Communication Special Edition, January 1994. Ambikairajah, E., Keane, M., Kilmartin, L. and Tattersall G., 'The Application

of the Wavelet

Transform

Eurospeech '93, Berlin, 1993

iv

for

Speech

Processing',

Section I

Section Overview The aim of this part of the research was to develop and to implement in real time an algorithm to detect CCITI standard DTMF telecommunication tones on the Texas Instruments TMS320C25 digital signal processor. This involved both hardware and software development.

r.

The hardware part involved the development of a stand alone TMS320C25 platform. This generic platform, which is described fulJy in chapter 2, contains a single TMS320C25 processor, program memory, data memory, an analogue to digital converter and a digital to analogue converter. It also contains a PC interface which allows a PC to write to or read from the boards memories. The software consists of digital signal processing routines that analyse the sampled signal which is input from the analogue to digital converter and search for presence or absence of DTMF tones. Basically, the detection scheme consists of two parts, known as the front end and the back end. The front end uses filters, tuned resonators and integrators to analyse the signal. Every 6 ms the execution of the frontend is suspended and the outputs of the integrators are passed to the backend. The backend compares these outputs to each other and to pre-set thresholds in order to ascertain whether a valid DTMF tone is present or not. The details of both frontend and backend are given in chapter 3. The motive for developing the TMS320C25 hardware platform was the real time evaluation of the detection scheme. The last chapter in this section, chapter 4, describes the methods used to carry out this evaluation. In general, testing of a telecommunication tone detector is an extensive procedure. Simply, detecting a few valid tones is not enough. The CCITI recommendations specify parameter ranges in which the detector should operate and parameter ranges in which not operate, l.e. not detect tones. Realistic testing requires a full system characterisation. This involves iterative testing of the system to characterise its operation range. The results of this system's characterisation are given in tables 4.1 to 4.10 of chapter 4. Comparing these results to the recommendation, given in section 1.3, confirms that the solution developed during this research does operate to CCITI recommendations. 2

Chapter 1

to

An Introduction Detection

Digital

DTMF

1.1 Introduction

More than 25 years ago the need for an improved method for transferring dialling information through the telephone network was recognised. The traditional method, dial pulse signalling, was not only slow, suffering severe distortion over long wire loops, but required a DC loop throughout the communications channel. A signalling scheme was developed utilising voice frequency tones and implemented as a reliable alternative to pulse dialling. This scheme is known as DTMF, Touch-tone or simply tone dialling. The use of Dual-Tone Multi-Frequency (DTMF) signalling schemes within telecommunications has become widespread in recent years. It is replacing the older pulse dialling methods traditionally used in telephone systems world-wide and in addition is finding applications in other equipment types such as Personnel Computer (PC) telephone peripherals and remote signalling schemes. Thus, in many countries, there are two standard dialling conventions being used; the newall-electronic DTMF method and the old electro-mechanically based pulse dial system. Figure 1.1 represents a simplified pulse dialling telephone terminal. The actual circuit is more complex, but this diagram serves to illustrate the pulse dialling mechanism.

HOOK-SWITCH

e

Handset

Speech

r-----ee

RING

BELL

Circuit DIAL TIP

Figure 1.1 Schematic of a Pulse Dialling Telephone

When the receiver of a pulse dialling telephone is lifted, the hook-switch closes and a DC loop current of a few milliamps flows from the central office or local exchange [1]. This indicates to the local exchange that the telephone is being used. The dial is arranged so that the switch within it opens and closes as it returns to the rest position. When the switch is open it causes the loop current to be interrupted, hence the alternative name of loop disconnect dialling. The dial switch is arranged so that one disconnect or pulse is created for the digit 1, two for the digit 2, up to ten pulses for the digit o. 3

Chapter I An lntroducuon 10 Digual DTMF Detecuon

Dial pulses originally operated electromechanical switching systems, and still do in many countries. These systems are slow, having an upper limit of about ten operations per second. Pulse dialling systems therefore produce pulses of a 100 millisecond (ms) duration. Nominally operations in the U.S. gives a break period of 61 ms and a make period of 39 ms. This is different for other countries which use a 2:1 ratio (67 ms break, 33 ms make). An inter-digitpause is indicated by an absence of pulses for a period which varies from 700 ms for the U.S. down to 200 ms for other countries. The time required to dial the pulses needed for one digit can be up to 1.7 seconds (ten pulses for the digit 0 and 700 ms inter-digit pause) which can make dialling of a long international number very time consuming. For example the number 044-11-270-232111 would take over 15 seconds to dial. This is one reason why pulse dialling is considered out-dated. However the main drawbacks of the pulse dialling system are associated with servicing and maintaining the electro-mechanical switching system. In order to reduce costs, increase reliability, and improve service, the electro mechanical switching systems used at central office and local exchanges are being replaced with fully electronic systems. In most western countries, including Ireland, this upgrading process is virtually complete. With the new electronic equipment it is no longer necessary to have the slow dialling mechanism which accommodated the response time of the old switching time of the older equipment. A new dialling scheme thus becomes possible using purely electronic means. The DTMF system has been adopted as the universal standard through the Comite Consultatit International de Telepnonie et de Telegraphie, the CCITT, which is a committee of the International Telecommunication Union (ITU), now a part of the United Nations.

1.2 DTMF signalling The full name for DTMF is Dual-Tone Multi-Frequency which describes its operating characteristics very well. The standard Multi-Frequency Push Button (MFPB) keypad layout is shown in figure 2. This keypad replaces the dial in DTMF equipped telephone terminals. Note that the keys A,B,C and D are not usually present, but are part of the CCITT recommendations. Thus the system is composed of the ten decimal digits and 6 reserve signals, making 16 signals in all. Pressing any key causes an electronic circuit to generate a tone which is a summation of the two individual frequencies related to the row and column of that key. The two frequencies composing each signal are taken from two mutually exclusive frequency groups of four frequencies each, a code known as "2 (1/4) code". As the tone generation does not involve a disconnect of the telephone circuit, DTMF tones may be sent down the line during a call just by pressing a key on 4

Chapter I An [ntroductum

10

Digllal DTMF De/eel/on

the keypad. Alternatively DTMF tones may be sent from other sources connected to the line. When this method is used as a form of low speed data transmission, it is important that speech is not accidentally interpreted as a DTMF tone. In order to reduce the chance of this happening, the recommendations require that a tone must be present for at least 50 ms , with an interdigit pause of similar length. Thus, with a minimum dial time of 100 ms per digit for all digits, the-previous number would take 1.4 seconds to dial. This represents a saving of 13.7 seconds which is 91% of the previous dial time. Additional advantages of DTMF dialling include the use of solid state electronics and compatibility with electronically controlled exchanges.

HIGH GROUP FREQUENCIES (Hz)

r:

.A

1209

1336

1477

"

1633

LOW

GROUP

FREQUENCIES

( Hz)

Figure 1.2 DTMF Keypad

1.3 CCITT Recommendations The CCITT recommendations for DTMF are given in recommendations 0.23 and 0.24 of the CCITT Blue book [2] and summarised below. The numerical values used vary slightly from country to country. The values given here are those used in Ireland. Slight variations in the dBm level and frequency tolerances apply for other countries; these factors depend on the line quality in that country. •

The low group frequencies of the 2(1/4) code are: 697, 770, 852, 941 Hz

The high group frequencies are:

1209, 1336,1477,1633Hz

5

Chapter I An lntroducuon 10 Dignal DTMF Detectton

,..'

;

The allocation of these frequencies is as in figure 1.2. It is interesting to note that the tone frequencies have been chosen such that they are not harmonically related and that their inter modulation product result in minimum signal impairment. • The transmitted frequency must be with 1.8 % of the nominal frequency and - -- - - the total distortion resulting from harmonics and inter modulation must be at least 20 dB below the fundamental frequencies. • The detector should respond to frequencies that are within the 1.5 % + 2Hz of the nominal frequency. The frequency tolerance parameter is specified as : must accept < 1.5 % + 2 Hz must reject > 7.0 % • The receive power levels per frequency are specified as: must accept -5 dBm to -30 dBm must reject < -39 dBm •

The power level difference between frequencies are specified as: must accept < 6 dB

• The signal reception timings specification are: must accept > 40 ms must reject < 20 ms with signal interruption < 20 ms pause duration > 40 ms

1.4 The development of DTMF receivers Early tone receivers utilised banks of bandpass filters making them somewhat cumbersome and expensive to implement. This generally restricted their application to central offices (telephone exchanges). The bandpass filters were typically LC filters, active filters and/or phase locked loops. These early decoders were the first generation of DTMF tone receivers [3]. The introduction of Metal Oxide Semiconductor (MaS) and Large Scale Integration (LSI) techniques brought about the second generation of tone receiver development. These receivers were used to digitally decode the two discrete tones that result from decomposition of the composite signal. Two analogue bandpass filters were used to perform the signal decomposition. Totally self-contained receivers implemented in thick film hybrid technology depicted the start of third generation devices. Typically, they also used analogue

6

Chapter I An lntroductton 10 Diguat DTMF Detecuon

active filters to bandsplit the composite signal and MOS decode the tones.

digital devices to

The development of the silicon-implemented switched capacitor sampled filters marked the birth of the fourth and current generation of tone receivers. Initially single chip bandpass filters were combined with currently available decoders enabling a two chip receiver design. A further advance in integration has merged these two function onto a single chip. This is typically the approach used in large dedicated DTMF and tone receivers in exchanges. In parallel with such developments in integration, came the widespread availability of powerful and cheap user-programmable Digital Signal Processors (DSPs). These processors combine the flexibility of a high-speed controller with the numerical capability of an array processor, offering an inexpensive alternative to custom VLSI multi-chip solutions. The performance of these devices have increased dramatically since their introduction in the early 80's. Today, their use in telecommunications has become widespread, particularly in the situations, such as this one, which combine tone filtering, with other processing such as state sequence programming and program control.

1.5 Digital Signal Processing In the last decade, Digital Signal Processing has made tremendous progress in both the theoretical and practical aspects of the field [4]. While more DSP algorithms are being discovered, better tools are being developed to implement these algorithms. Some of the most important developments were made in the area of VLSI technology. This resulted in availability of powerful single chip digital signal processors that are specifically designed to cater for the computational intensive demands of DSP algorithms. These chips are capable of implementing the complex DSP algorithms which previously required the use of a minicomputer or an array processor. With this VLSI advancement, innovative engineers and scientists discovered a multitude of application where digital signal processors can provide a better solution than their analogue counterpart for reasons of reliability, re productivity, compactness and efficiency. DSPs are also inherently programmable. This make them particularly attractive from the point of view of system upgrades, where the functionality of the system can be changed by rewriting the DSP code, multitasking, where many task can be performed at the same time, and flexibility, where the same generic DSP hardware can be used for many different applications. A basic assumption when using DSP is that a sampled digital signal, rather than a continuous analogue one, is available. In general, the bridge between analogue and digital is crossed using analogue to digital converters (ADCs) and digital to analogue converters (DACs). Recent advances in VLSI also contributed greatly to the widespread availability of these devices at very affordable prices.

7

Chapter 1 An lntroUucllOn 10 Dlgl/al DTAtF Detection

As a result of these advantages and the affordability of the hardware involved, digital signal processing is finding application in diverse fields such as telecommunications, imaging, digital audio, control, instrumentation, medical science and navigation, to name but a few. 1.6 The TMS320C25 Digital Signal Processor There are a range of digital signal processors available on the market today. In general they can be broadly divided into two classes; fixed point processors, which are extensively used in telecommunications and speech processing, and floating point processors which are more expensive and are used in application requiring broader dynamic ranges and higher accuracy. For this application, a fixed point approach is sufficient. There are a range of fixed point processors available on the market today. Examples include the Texas Instruments TMS320 series, the Analogue Devices ADSP-2100 series and the Nippon Electric Company NEC 7720 series. Stiff competition in a fast growing market has result in continual evolution of products and technology. Indeed, most digital processors are superseded or upgraded with two years of release. The result is that there is now little to chose between the main competitors [5J. For this application, all of the main fixed point processors were reviewed. In the end the TMS320C25 was chosen. However, this choice was not purely objective. In fact, the main deciding factor in favour of using the TMS320C25 was the fact that the development software was already available in the department, and that there had been some previous experience in the department with the TMS320' series. That said, the choice of the TMS320C25 has proven quite successful. The TMS320C25 is the third generation of the TMS320 fixed point family. It's combination of the TMS320's Harvard-type architecture (separate internal program and data buses) and its special digital signal processing instructions provide the speed and flexibility to provide a throughput of 10 MIPS (million instructions per second). In fact, this performance has since been increased, as the instruction cycle time has been reduced from 100 ns on the version used in this project, to 80 ns in the latest versions. As do all modern digital signal processors, the TMS320C25 boasts a range of features particularly suited to computationally intensive OSP routines [6]. These include • • • • • • • •

544 words of on-chip RAM a 32 bit Arithmetic Logic Unit (ALU) / Accumulator a 16 x 16 bit on chip multiplier single cycle multiply/accumulate instructions An on-chip timer Eight auxiliary registers Sixteen input/output ports Wait states for use with slow memories/peripherals 8

Chapter I An lntroducuon

10 D,~""I

DTAIF Detection

1.7 Conclusion This chapter has introduced the elements involved in the design of a real time DTMF detection scheme. The next chapter outlines the design of a TMS320C2S hardware platform which was built during this project. The design of this platform served two purposes. Firstly, the design process required a working insight of the TMS320C2S's operation. Thus, the experience gained during lhe hardware stage proved invaluable later during the software development. Secondly, the hardware platform itself was used to evaluate the overall system performance.

Chapter 2

The Development of the TMS320C25 Hardware Platform

2.1 Introduction

A TMS320C25 software development platform was designed and built during this project. This platform exists as a stand alone hardware board which interface to a personal computer (PC) via a commercial PC interface card. The development board consists of a single TMS320C25 chip along with program memory, data memory, a PC interface and the necessary support hardware. Through the interface card, a user can load a TMS program to the board's program memory and instruct the TMS320C25 to execute it. A user can also write to or read data from the board's data memory. This allows computer generated test data to be stored on the board and applied to the algorithm being tested. The stored results can then be uploaded and analysed. A single Analogue to Digital Converter (ADC) and a single Digital to Analogue converter (DAC) are interfaced to the on board TMS320C25. This allows the board to be used for real time filtering, real-time algorithm testing and speech sampling. In this project, the board was used primarily to develop and test the DTMF detection algorithms. However it was also used to sample and store the database of speech utterances which were used for sections 2 and 3 of this project.

2.2 Overview of the TMS320C25 development platform

A block diagram of the TMS320C25 development board is shown in figure 2.1. The main functional blocks are shown along with the address and data buses. Each functional block will be described in detail in a separate section of this chapter. This section gives a general overview of the board's design and operation. At the centre of the system stands a single TMS320C25 digital signal processor and it's associated circuitry. The address bus and the data bus connect the processor to 4 kilobytes (Kb) of program memory and to 64 Kb of data memory. The TMS320C25 implements two separate and distinct memory spaces, so both program and data memory have a base address of zero. The distinction between the memory spaces is made through the use the program space (PS) and data space (OS) pins on the processor. These pins are used to enable the relevant memories during data read/write operations. A third memory space, the I/O space, is used for input/output operations. The I/O space pin (IS) is used in the address decoding scheme to enable either the ADC or the DAC during 110 operations. Thus, the ADC and the DAC are not directly connected to the address bus.

10

Chapter 2 The Development 01the TMS320C25 Hardware Platform

A wait state generator is used to add wait states to TMS320C25 read/write operations. This is necessary when read/write operations cannot be completed in a single cycle. In order to achieve maximum speed and throughput the TMS320C25 should operate with no wait states as much as possible. Therefore, this system uses high speed static RAM (SRAM), which requires no wait states, for program memory.-This~is important since the processor will read each instruction in executes from program memory. Adding a single wait state to program memory accesses would effectively halve execution time.

Program memory

Data memory

'" i

'" ~ ~

•

•

PC Interface

Address bus Data bus

'"

I

~ TMS320C25!

~

I DAC

Figure 2.1 Hardware block diagram

The data memory access time is less important because the TMS320C25 possess 544 words of internal RAM [6]. This internal RAM is used to store oft accessed variables in a single cycle. The external data memory is mainly used to store raw data, results and speech samples. Generally- this type of data is accessed only once or twice during each sampling period, so it is possible to use slower SRAM for the external data memory. In this system 64 Kb of SRAM which requires a single wait state during access is used as external data memory. The buffered PC interface has two modes of operation; a high impedance mode and a clocked bus access mode. The high impedance mode isolates the board from the PC allowing the TMS302C25 to operate normally. The clocked bus access mode is used directly access the board's data buses from the PC. This mode uses the TMS320C25's Direct Memory Access (DMA) facility to place the processors address and data bus pins in a high impedance state. This allows the PC to write to the busses without bus contention. Thus, a DMA involves the following steps. •

The PC requests a DMA by bringing the HOLD pin low

11

Chapter 2 The Development of the TMS320C25 Hardware Platform

• The TMS320C25 responds within three cycles by bringing HOLD Acknowledge low. This means that the TMS320C25 has placed its bus pins in a high impedance state. The HOLDA pin is connected to the output enable pin of the interface, so when it goes low, the PC interface is enabled. • A simple PC program latches the address through the PC interface onto the address bus. The data bus interface is bi-directional allowing data to be written to or read from the board's data bus. Note that a latched address interface is used to ensure that the address stays stable throughout the read/write cycle. • Once all the data has been written or read the HOLD condition is released and the PC interface returns to high impedance mode. This method is used to sequentially write program code that has been compiled and linked on the PC directly into the boards program memory. Once the write operation is completed the PC can reset the TMS320C25 so that program execution begin at location zero. This method is also used to write test data to data memory and to read results and speech samples from data memory.

2.3 The TMS320C25 and it's support circuitry Figure 2.2 is a circuit diagram of the TMS320C25 and it's support circuitry. This circuit corresponds to the central block in figure 2.1. The address bus, the data bus and the other connections marked with the 0 symbol connect to points in the circuit diagrams ofsection 2.3 to 2.8. The circuit in figure 2.2 is quite straight forward. Only the address decoding scheme. outlined below and the wait state generator, described in section 2.9, require explanation. The decoding scheme uses a 3 to 8 line decoder, the 74LS138 chip, to enable the appropriate device during I/O operations. Although the TMS320C25 supports 16 input/output ports, this scheme requires only 3 I/O ports. Thus, only 3 address lines are decoded; the A3 address line is use as an active low enable as the I/O address space contains only address from 0 to 7. The following three enable lines are used in this scheme. • The YO output corresponds to 110 address zero. This output is ORed with the TMS320C25 RlW output and connected to the ADC latch output enable pin. Thus, to read from the ADC latch, a read operation from 110 address 0 is executed. To specify this operation, the TMS320C25 instruction IN ,PAO is used to read a data word from port 0 into the data memory address specified in . • The Y1 output corresponds to I/O address one. This output is used to clock the DAC output latch. Thus, to write to the DAC, a write operation to I/O address 1 is executed. The TMS320C25 instruction OUT ,PA1 achieves this. 12

Chapter 2 The Development of the TMS320C2S Hardware Platform

•

The Y2 output corresponds to I/O address two. This output is used to initiate the ADC conversion process. It connects to the ADC control circuitry in section 2.7. Its operation is described in detail in that section.

RESET fram PC HOlD ACKNOWlEDGE 10 PC lrOorloce

DATA BUS

ADDRESS BUS Q

A15 A14

-c

0% 0Q %

...w

u

uuu

»~

4!

1=1

All of the integrators are zeroed on initialisation and at the end of every iteration of the backend.

3.7 Time scheduling

This implementation is designed to operate at 8 kHz. This requires that a next sample be read in every 0.125 ms. This is achieved by using a timing interrupt and the IDLE instruction. This causes the TMS320C25 to idle until it receives a timing interrupt. In order to ensure that this methods works effectively, the processing must be complete in less than 0.125 ms so that the TMS320C25 will be idling before the timing interrupt.

28

Chapter 3 The DTMF Detector Architecture

In this implementation the front end will always be complete within 0.125 ms, as required. However, both the front end and the backend will not execute in 0.125 ms. Thus, at the end of every time slice, when both the front end and the backend must be execute, computational overrun could occur. In order to avoid this, the front end and the back end are interleaved, as shown in figure 3.6. This figure shows the end of a time slice, where the arrows represent samples and . the intervals between these samples correspond to 0.125 ms. Note that the dummy samples are not processed. The backend processing is done in the sampling interval after the dummy sample. This requires less than 0.125 ms so it is complete before the next samples must be taken and processed.

N=24

N=23

dummy sample

•

~

Front end

Figure 3.5 Time Scheduling of front and back end

3.8 The Backend The backend of the DTMF receiver, takes the outputs of the frontend integrators and tests for the presence of either a valid tone or a valid silence. Validation of a tone involves the passing a number of tests, where each test is design to examine a certain signal characteristic. Overall, the combination of these tests, is designed to detect DTMF tones with a high recognition rate, while at the same time, reject speech signals that may look like DTMF signals. A flowchart, outlining the operation of the backend is given in figure 3.7. The first step in the backend is initialisation. This involves setting the state of the backend as it was at the end of the previous backend iteration. In particular, the tone count, the pause count, the previous tone present, if any, and the tone/pause state of the detector must be loaded. This is necessary because validation of a tone or pause occurs over a number of time slices. The next five steps involve tests which a valid tone must pass. These tests are performed sequentially. Failure at any stage means that the tone is not a valid tone and it is interpreted a pause. The five tests are •

The threshold test : the tone must have frequency components above a minimum threshold in both the upper and the lower frequency band. The integrators at the outputs of the bandpass filters are compared to a minimum threshold for this test. 29


START

I

...

Initialisation

I

...

High Band Energy> Threshold

N

Low Band Energy > Threshold

y

.,,.

....

Swing High Band < 1dB

r'

N

Swing Low Band < 1dB

Y

.,,. I

...

I

Twist < adB

r'

IN

Y

.,,.

...

Compare the outputs of the 8 tone integrators with the floor threshold

N

r

y

.,,.

Test for pause condition

Only 2 valid tones exist? N

y

.,,.

~,.

Pause count

=0

Tone count ,. 0

Tone count

+1

Pause count

Y

.,,.

~,.

I

Tone count> 3

IN

........

I

r ' I~

NI

Pause count> 6

y

.

I

y

.,,.

I

+1.

I

In tone state already?

Iy

......

.....

I

~

yl

Pause state already?

N

N

.,,.

.,~

DECLARE TONE STATE DECLARE PAUSE STATE Output DTMF Digit

.,,. ... r'

Zero Integrators Store state and counts for

....

~

next itteration of back end

Figure 3.7 DTMF Backend state diagram 30

I


• The swing test: the tone must maintain stable frequency components from one time slice to the next. This is a good way to reject speech, as it is unlikely that speech would stay stable for such a period of time. The bandpass filter integrators are compare to their previous value in this test. A variation of less than 1 dB is required for valid tones. • The twist test : the twist test compares the high and low bands for acceptable twist. The CCIIT recommendation specify a twist accept value of 6 dB. In this case the twist accept is set to 8 dB, in order to always accept all twists of 6 db and less. Implementation of this test involves comparison of the band pass filter integrators. • The resonator tests : this test compares the resonators' integrator outputs with pre-computed thresholds in order to test for valid DTMF frequencies. • Two and only two tones present test: As the name suggests, this tests checks that two and only two frequency components are present. If more or less than two are present, the time slice is considered to be a pause. If all these tests are passed, then a valid tone is present. This, however, does not mean that the corresponding digit is new information. The backend first compares the current digit to the digit recognised in the last time slice. If so, then the tone count is incremented. If not then it is a new digit and the tone count is reset to zero. The last stage of the validation process test the counts. If the tone count reaches 4, then a valid tone is present for long enough to be declared. If the pause count reaches 6, then a valid pause is declared. In either case, the tone or the pause is only declared once. The detector then stays in either state until a different valid state is encountered and declared.

3.9 Conclusion The software as described in this section was implemented in modules which were individually compiled and linked together using the TMS320C25 assembler tools. The executable code was then loaded in the program memory of the development board and the reset signal applied. For evaluation the programs were modified so that all input output operation were redirected to data memory so that the programs operation could be evaluated. For example to test the operation of a filter section, the program was modified to read a sampled impulse from one section of data memory into the filter input and to write the filter's output to another section of memory. After the program had executed the impulse response was read into the PC and analysed using an FFT program to investigate the filter performance. Such analysis was useful in improving the hardware implementation of the software, i.e. minimising filter quantisation noise and removing filter overflow. 31

Chaptf1r 3 Thf1 DTMF Df1tf1ctor Architf1cturf1

A more advanced program was used to test the overall operation of the detector. This involved digitally generating ranges of valid DTMF tones and loading them into the board's data memory. The detector program was slightly modified to read samples from the board's data memory rather than the ADC and to write the results, i.e. the detector state, to another data memory buffer. Thus, the detector's operation could be evaluated on a sample by sample basis. The actual tests used to evaluate it's performance are outlined in the next chapter.

32

Chapter 4

DTMF Detector Characterisation

4.1 Introduction This chapter outlines the method which was used to verify that the DTMF -detector works to CCITT specifications. Essentially the detector had to be tested for acceptance of all valid tones and rejection of all invalid tones. Invalid tone includes cases such as tones whose specification falls outside in the 'must reject' category, as well as speech signals and other non-DTMF signals. A range of tests were designed and implemented. Each test was designed to characterise a certain detector parameter. In all 11 separate tests were devised where each individual test was carried out on each digit. This involves 11 x 16 individual tests. Manual completion of such a number of tests would take weeks, so a program was devised to implement the tests. The test program digitally generates 40 repetitions of a digital DTMF tone pair and writes the samples down to the board's data memory. Tone parameters, such as frequency, dBm level, tone on time and tone off time are programmable, so this method can be used to automatically characterise the DTMF detector. For testing, the TMS320C25 DTMF detection program is modified to read samples from data memory, rather than from the ADC. The detector state is monitored by writing the detector state a separate data memory buffer as each sample is processed. At the end of 40 tones the detector output buffer is read back into the PC and analysed. This process is repeated for each of the 16 DTMF digits. If all 40 tones of each digit were recognised; then the detector has passed for this parameter set. Increasing or decreasing a single parameter until the detector fails to recognise all 40 tones characterises the detector's operational range for that premature. A complete characterisation requires that this process be repeated for all the specified operational parameters.

4.2 Test description and results 4.2.1 Receive power levels The CCITT recommendation for DTMF receive power levels per frequency specify that all tones with power level between -5 dBm and -30 dBm must be accepted, and that all tones with power levels less than -39 dBm must be rejected. To establish these operational levels 3 separate characterisations were implemented. Each of these tests was implemented with 40 tones of length 40 ms, with 40 ms pauses between. The first characterisation established the maximum level always detected. The tone dBm level was increased until all 40 tones were not recognised. The maximum level always detected for each DTMF digit are shown in table 4.1. Clearly the maximum level has been achieved for all tones. 33

Chupter 4 {)TMF Detector Chuructensunon

--

Freq.

697

770

852

941

1209

-2.5

-3

-3

-3

1336

-3

-3

-3

-3

1477

-3

-3

-3

-3

1633

-3

-3

-3

-3

----._-

Table 4.1 Maximum levels (in dBm) always detected The minimum level always received is established in a similar manner. In this ease, however, the tone dBm level is steadily decreased until the detector stops recognising all 40 tones. The results for all 16 DTMF digit are shown in table 4.2. Again the detector meets the specifications.

Freq.

697

770

852

941

1209

-37.5

-37

-36.5

-36.5

1336

-37

-37

-36.5

-36.5

1477

-37

-37

-36.5

-36.5

1633

-37

-36.5

_-36.5

-36.5

Table 4.2 Minimum levels (in dBm) always detected The third level test involves testing for the maximum tone dBm level that is always rejected. This is achieved by starting at -50 dBm and steadily increasing the dBm level until some of the 40 tones are detected. The results, shown in table 4.3 show that the detector always rejects tones at or below -39 dBm.

Freq.

697

770

852

941

1209

-40

-40

-40.5

-41

1336

-40

-40

-40.5

-41

1477

-40

-40.5

-41

-41

1633

-40

-40.5

-41

-41

Table 4.3 Maximum levels (in dBm) always rejected 34

Chapter 4 DTAfFDe/ector ( 'haructensauon

4.2.2 Signal reception timing The CCITT recommendation specify that all tones of length 40 ms or more must be detected and that all tone of length 24 ms or less must be rejected. To characterise the detector's signal reception timing, two tests were implemented. The first test established the minimum pulse length always detected. This was achieved by applying successively shorter sets of pulses to the detector until it stopped recognising them all. The dBm level of these tones was set to -30 dBm. The results in table 4.4 show that DTMF tones of 32 ms will always be accepted, even at the lowest dBm level. Tones with greater dBm levels will always exceed these minimum levels so we conclude that all tones of length 32 ms or greater will always be accepted. Note that this is generously inside the specification of accepting 40 ms tones

Freq.

697

770

852

941

1209

32

32

32

32

1336

32

32

32

32

1477

32

32

32

32

1633

32

32

32

32

Table 4.4 Minimum pulse length (in ms) always accepted The other half of this test is to establish the maximum pulse length that is always rejected, This is established by applying set of pulses of increasing length until one or more of the 40 pulses is accepted, The results in table 4.5 show that pulses of duration less that 20 ms will always be rejected. A dBm level of -5 was used for all tones in this test. Thus, if tones of length 23 ms with energies at the highest level are rejected then we can conclude that all tones of length 23 ms will be rejected, and that this specification is met.

Freq.

697

770

852

941

1209

24

23

23

23

1336

24

23

23

23

1477

24

23

23

23

1633

24

23

23

23

Table 4.5 Maximum pulse length (in ms) always rejected 35

\~'nupl..r.J D'''AIF Detector (·I",racleTl..ulwn

4.2.3 Pause reception timing The CCITT recommendation specify that all pauses of length 40 ms or more must be detected and that all pauses of length 20 ms or less must be rejected. The pauses of less than 20 ms must therefore be interpreted as signal interruptions rather than pauses. Only if the interruptions longer than 20 ms should be recognised as pauses. The first test involves applying sets of 40 tones of length 40 ms with decreasing pauses in between. These tones are applied at the highest level, i.e. -5 dBm. If the detector recognises 40 tones then all 39 pauses were detected. If less than 39 tones are recognised, then some of the pauses were missed, and the previous successfully applied value is the minimum always accepted pause value. Table 4.6 shows the results of this test. Clearly 40 ms pauses will always be recognised by this detector.

Freq.

697

770

852

941

1209

36

36

38

38

1336

36

36

38

38

1477

36

38

38

38

1633

36

38

38

38

Table 4.6 Minimum pause lengtfi (in ms) always accepted The second half of this test investigates the maximum signal interruption that will never be detected. Again 40 tones of length 40 ms are applied to the detector. This time however the dBm level is set to -30 dBm. The length of the pauses between tones is initially 10 ms. With this pause length, a single long tone is detected because the pauses are too short to be valid. The length of the pauses is increased until at least one pause is recognised. Then, the previous length, at which no pauses were recognised is taken as the maximum signal interruption never recognised. The results are shown in table 4.7.

Freq.

697

770

852

941

1209

36

36

38

38

1336

36

36

38

38

1477

36

38

38

38

1633

36

38

38

38

Table 4.7 Maximum signal interruption (in ms) never accepted 36

Chapter -I DTMF Detector Charoctertsuuon

4.2.4 Twist acceptance levels. The eelD recommendation for power level difference between frequencies is to accept a maximum twist of 6 dB between level in a valid tone pair. This test was implemented at the -30 dBm level and the -5 dBm level. The results in table 4.7 show the maximum twist accepted for each digit. In each case the dB represents the level difference between the high tone and the low tone, in that order.

Freq. 1209 1336 1477 1633

697

770

852

941

+5

+5

+5

+5

-4

-4

-4

-4

+5 -5 +5 -5 +5 -5

+5

+5

+5

-4

-4

-4

+5 -5 +5

+5

+5

-4

-4

+5

+5

-4

-4

-4

Table 4.8 Maximum twist HIL (in dB) always accepted

4.2.5 Frequency tolerance acceptance levels. The final test concerns the frequency tolerance accept range. The recommendation calls for a maximum frequency tolerance of 1.5% + 2Hz. Thus, the frequency tolerance ranges from ± 13.5 Hz for the 697 Hz tone to ± 27 Hz for the 1633 Hz tone. This always accept frequency tolerance was implemented by applying 40 tones of 40 ms with 40 ms pauses between at the -30 dBm level. The results in table 4.9. A second table, table 4.10 shows the frequency tolerance that is always rejected, even at the highest dBm level.

37

Chapter 4 DTMF Detector Chuructertsauon

Freq.

1209

697

770

852

941

+25

+25

+25

+25

-24

-24

-24

+27

+27

+27

+27

-26

-26

-26

-26

+30

+30

+30

+30

-29

-29

-29

-29

+32

+32

+32

+32

-35

-34

-34

-34

-24 ~-

1336 1477 1633

-

-

-

Table 4.10 Maximum frequency tolerance (in Hz) always accepted

Freq.

1209 -

.

1336 1477 1633

697

770

852

941

+29

+29

+29

+29

-28

-28

-28

-28

- +31

+33

+33

+33

-30

-33

-33

-34

+33

+33

+33

+33

-34

-35

-35

-35

+37

+37

+37

+37

-42

-43

-43

-43

Table 4.11 Minimum frequency tolerance (in Hz) never accepted

4.2.6 Speech Immunity test The recommendation also specify that the DTMF detection be immune to stimulation by speech. The actual specification requires that the detector should register no more than 46 tones in 1000 hours of speech where the speech has a mean level of -12 dBm. Thus, this test requires that the detector be exposed to speech signal stimulation for over 100 hours. The testing process can be simplified however, with the use of the N1ITEL condensed speech tape. This tape consists of recordings made on telephone trunks over a long period of time and 38

Chapter 4 DTlvIFDetector Characterisauon

condensed into a 30 minute period. The manufacturer's notes state that a receiver with an acceptable talk-off response should register less than 30 'hits' during the 30 minute period. A tape player was connected to the ADC via a commercial amplifier and the correct amplitude level was set by iteratively sampling a test tone from the tape while refining the amplitude control. Then, the condensed speech was played for 30 minute while the DTMF was executing on the TMS320C25. Only, two hits were registered during the 30 minute period. Thus, the detector easily passed the talk-off test.

4.2 Conclusions The DTMF detector program meets the all CCITT recommendations, given in section 1.3. This fact is not unexpected as the detector was designed specifically to meet these recommendations. The program was written so that the detector's operational parameter can be easily modified. This was useful during development, as the filters and thresholds had to be refined to meet the specification. The final version is a generic DTMF detector, in that the operational parameters can changed without changing the overall program. For example, in order to increase the frequency tolerance, the bandwidth of the resonators could be slightly increased or to increase the dynamic range the floor thresholds could be modified.

39

Section II

Section Overview The aim of this part of the research was to develop a new model for speaker verification based on the idea of a self segmenting linear predictors. The first chapter in this section, chapter 5, introduces some relevant topics in digital speech processing. The classical speech production model is described and the theory of linear prediction is introduced. The use of linear prediction in speech coding is described in order to give a feeling for the traditional use of linear predictive analysis. Finally speaker identification and speaker verification are discussed. Chapter 6 describes the proposed speaker verification model in detail, including the theory behind the operation and training of the model and chapter 7 describes how the suitability of the proposed model to the task of speaker verification was determined. Finally, in chapter 8, a number of improvements that were made to the basic models structure are described. As a result of the research described in this section, a paper describing the speaker verification model was presented at the 4th International Conference on Speech Science and Technology, which was held in Brisbane, Australia, in December 1992. A copy of this paper [13] is included in the appendix of this thesis. Another paper entitled 'Predictive methods for Speaker Verification' [22] was . published in the EC/ropeafl ./ourfla( or Speecn Communication in January 1994. r (S e~a (rTcf6'aea1n ate eaaeaat): 41

ChapterS

An Introduction to Digital Speech Processing

5.1 Introduction In has long been argued that it is the ability to communicate effectively that raises man above ail other living things. Although many animals can communicate to some extent, only man can communicate ideas and thoughts effectively. Thus, only man, with his unique ability to communicate effectively, has been able to dominate life on this planet. Of course, the nature of this unique communication ability is the dual phenomena of speech production and speech understanding. It is not surprising then, that modern man has been fascinated by the processes of speech communication. The physiology operation of speech production and hearing was investigated [7}[8][9][10] in the early part of this century. Much of this work involved acoustic tube or analogue electrical models of the speech production process. Up to that time the main focus was the analysis of acoustic tube shapes and of the corresponding resonance frequencies, with the short term aim of artificially generate stationary vowel sounds. The advent of digital computers in the 1960's and their consequent rapid development, in conjunction with the major advances in the theory of digital signal processing, resulted in major progress in the area of speech processing. In the first instance, digital signal processing techniques were used to simulate analogue systems such as Stevens electrical vocal tract model [9]. However, as the area of DSP developed, new techniques which had no analogue equivalent were developed. By the early 1980s, computers were available that could carry out complex speech signal processing in real time. Further hardware advances in the areas of processing speed and memory have resulted in the fact that today, almost all speech processing is carried out digitally. Interest in the linear prediction model within the context of speech processing stems from the fact that it allows the parameters of the acoustic tube model to be estimated directly from the acoustical speech waveform. Atal [11] was the first to attempt to calculate the parameters of the acoustic tube model from the speech signal. He demonstrated the important experimental result, that if the speech is properly pre-emphasised, and if the boundary conditions of the acoustic tube are properly chosen, then very reasonable vocal tract shapes can be directly estimated using the autocorrelation method of linear prediction. Although the motivation for the linear prediction model was speech synthesis, the effectiveness of the LPC parametric representation of speech has meant that it is in the area of speech coding the linear prediction has been most effective. Other application of linear prediction are formant extraction [12][13], speaker identification and speech recognition [14].

42

Chapter 5 An Introduction to Digital Speech Processing

In general, the importance of linear prediction lies in the accuracy with which the basic model applies to speech. This fact is utilised in this research as the linear prediction model is applied to the task of speaker verification in a new way. The remainder of this chapter introduces the basics of speech production and the theory of linear prediction as well as some of the common application of LPC analysis.

5.2 The physiology of speech production In order to best apply digital signal processing analysis techniques to human speech, it is instructive to understand the basics of the human speech production process. Human speech is composed of a series of sounds which are generated as a result of acoustic excitation of the vocal tract. The source of this acoustic excitation is the passage of air out of the lungs. In general, speech sounds are classified as either voiced sounds or unvoiced sounds according to the type of acoustic excitation involved. For voiced sounds the source of the excitation is at the glottis and it consists of broad-band quasi-periodic puffs of air produced by the vibrating vocal cords [15]. The frequency of this excitation is determined by the mass and the tension of the vocal cord. UsuaJly this frequency is between 40 Hz and 60 Hz. For unvoiced sounds like sand f, the source is a point of constriction at some point in the vocal tract. Forcing air through this constriction at a high velocity creates turbulence which produces noise. Finally, for unvoiced sounds like p in pop and t in top, the source of excitation is at a complete closure of the vocal tract. The sound results from a rapid release of the air pressure built up behind the closure. The vocal tract is a non uniform acoustical tube [15] that extends from the glottis to the lips. Modes of resonance are set up in the vocal tract as a result of the acoustical excitation described above. The frequencies of these resonances, known as formants, depend on the shape of the vocal tract. Varying the positions of the tongue, lips, jaws and velum allows the formants associated with the characteristic speech sounds to be generated. Thus, speech sounds, which are predominantly generated in the throat, propagate down the non uniform acoustical tube known as the vocal tract and are radiated at the lips or from the nostrils. The inherent nature of speech production is complex as it links a number of different acoustical phenomena. However, the process has been successfully modelled by separating the source of excitation and the vocal tract. In fact, almost all speech production models use a two state excitation source that is independent from a vocal tract system that varies with time in discrete steps. Figure 5.1 shows the widely accepted speech production model This model produces a sampled speech signal s(nT), where T is the sampling period of the system. The content of the speech signal is determined by the source of excitation and by the digital filter coefficients. In order to generate voiced speech, the excitation switch is moved to the impulse train generator. The period of the impulses corresponds to the pitch period of the human vocal chord.

43

Chapter 5 An Introduction to Digtal Speech Processing

Alternatively, to generate unvoiced speech, a random excitation is used. This corresponds to noise generated due to the high velocity turbulence that is generally associated with unvoiced sounds.

PITCH PERIOD

~ ,.------.-----, l.1..J__

DIGITAL FILTER COEFFICIENTS (VOCAL TRACT PARAMETERS)

IMPULSE TRAIN

GENERATOR

TIME-VARYING DIGITAL FILTER WHITE NOISE 1-----/

GENERATOR

SPEECH

SAMPLES

AMPLITUDE

Figure 5.1 Digital model of speech production

The excitation signal is amplified and applied to a digital filter which models the human vocal tract. The digital filter coefficients are pre-calculated to have a frequency response similar to that of the vocal tract. As mentioned before, this systems models the slow time-varying nature of the human vocal tract in discrete steps by updating the filter coefficients every 10 ms. The digital waveform at the output corresponds to the final speech output, sampled at the appropriate rate. Clearly, the operation of the above model requires a knowledge of the appropriate parameters (pitch period, switch position, amplitude and filter coefficients) as a function of time. The estimation of these parameters is the goal of most speech analysis systems. Indeed, as we will see later, a good estimation of the time-varying nature of these parameters can be used in speech recognition and speaker identification and speaker verification systems.

5.3 Speech Analysis A simple model of the vocal tract is achieved by representing it a discrete time varying linear filter. In general, such a filter should contain both poles and zeros, however, it has been shown that the transfer function of the vocal tract for non nasal voiced sounds contains no zeros [11]. Thus, such voiced sounds can be adequately represented by an all-pole digital filter. For unvoiced sounds, some zeros are present in the transfer function, however, since these zeros lie within the unit circle, the overall function can be approximated by an all pole model of high enough order [16]. Figure 5.2 illustrates the all-pole speech production model. 44

Chapter 5 An Introduction to Digital Speech Processing

,

PITCH PERIOD

I

IMPULSE TRAIN

GENERATOR

T

.-'.s(n).. ......

~ ~

WHITE NOISE 1-----/ GENERATOR

G (Gain)

Figure 5.2 All-pole digital model for speech production

In this model the speech samples s(n) are related to the excitation function u(n) by the simple difference equation p

s(n) =Laks(n - k) +Gu(n) k=1

This means that a sample s(n) can be approximately predicted as a linearly weighted summation of the previous samples where the weighting in the summation are the coefficients of the all-pole filter, i.e. the parameters of the vocal tract frequency response. If we estimate the coefficients of the all-pole filter using the prediction coefficients Uk' then the error between the predicted sample and the actual sample is given by p

e(n) = s(n) - s(n) = s(n) - L aks(n - k) + Gu(n) k=1

If the frequency response of the estimated all-pole filter exactly matches the frequency response of the vocal tract, then the error e(n) is equal to Gu(n). For voiced speech this error is simply a trail of impulses. For unvoiced speech, the error Gu(n) will be a random sequence, however, it is interesting to note that for unvoiced speech the gain term G is much less than the G for voiced speech. Thus, for both voiced speech and unvoiced speech, the error sequence e(n) is minimised to Gu(n) if the filter frequency response matches the vocal tract frequency response. This is the principle behind linear predictive analysis of speech signals.

5.4 Linear Predictive analysis of speech signals As described previously, the principle of linear predictive analysis is based on the assumption of a slow time varying all-pole model of the vocal tract.

4S

Chapter 5 An Introduction to DifitaJ Speech Processing

Generally, the vocal tract is assumed to stay constant for about 10 ms. Thus, a speech signal that is to be analysed using linear predictive analysis is first segmented into 10 ms segments. The goal of the analysis is then to estimate the vocal tract response within that segment by obtaining predictor coefficients that minimise the average squared prediction error across the segment. This is achieved by minimising total squared error, E,

E

N ]2 =?;[e(n)t =?;N [s(n)P - ~ aks(n - k)

over the segment n=1 to N. Many algorithms have been developed that attempt this minimisation. Examples include the maximum likelihood method, the minimum variance method and Prony's method, although any method that minimises the above equation is valid. To finish this introduction to liner predictive analysis, it is necessary to mention something about the order of the prediction, Le. the number of pole in the all pole filter model for speech production. For voiced sounds, the poles generally occur in conjugate pairs. For good quality representation, at least 3 formant frequencies must be considered. Additionally, each formant may have a significant harmonic frequency. Thus, 6 resonant frequency may be present so a 12 pole model is required to give a good representation. This value of p is generally used when an 8 kHz sampling frequency is used. At higher sampling frequencies, more than 3 formant frequencies will be present, so more than 12 linear predictor coefficients (LPG) are used [17].

5.5 Speech coding One of the most important application of linear predictive analysis has been the area of low bit rate encoding of speech transmission and storage of speech. The need for low bit rate transmission of speech is becoming more acute as the transmission bandwidth becomes more expensive. In fact, today, LPC coding is being used in the new GSM digital telephone network to minimise the bandwidth requirements. Speech storage is also becoming more common, with computer voice response systems and digital voice messaging systems becoming more available to the public. This type of application uses speech encoding to reduce the memory requirements for storing the speech. In general. LPC speech encoding involves an encoder, a decoder and some type of channel over which the encoded speech must be transmitted. Of course, in speech storage applications, the channel is replaced by some type of memory device. A set of LPG parameters are calculated for every frame and transmitted to the decoder which uses the parameters to reconstruct the speech segment within the frame. The LPG parameters that are generally used are; the p LPG coefficients, the pitch period. the voiced/unvoiced state and the gain G. Thus, the typical total number of bits per 10 ms frame with p= 12 and where 10 bit quantisation is used for the LPG coefficients can be calculated as follows:

46

Chapler 5 An Introduction to Digital Speech Processing

1Ox12 bits 6 bits 6 bits 1 bit TOTAL:

for for for for

the the the the

12 LPC coefficients pitch period gain parameter voiced/unvoiced switch

133 bits per frame

Comparing this to 640 bits required to transmit the samples using the standard 8 bit A-law (80 samples at 8 bits = 640 bits) illustrates the effectiveness of LPC encoding.

5.6 Speaker identification and verification We have seen that linear predictive analysis of speech is rooted in the physiological nature of speech production. In fact, synthesis experiments have shown the linear prediction parameters retain a considerable degree of naturalness from the original speaker [18]. Additionally, the fact that linear predictive parameters are easily and efficiently obtained from speech samples makes LP based analysis particularly suited to tasks such as speaker identification and speaker verification. The object of speaker identification is to determine whether or not a speech sample from an unknown talker can be associated with a particular speaker from the user population. The object of speaker verification is to determine whether the speaker is who he claims to be. Due to the binary nature of speaker verification, it is inherently more manageable than speaker identification. However, it should be noted that the difference between speaker identification and speaker verification mainly attecfs fhe dec}s)ol7 JO;7/C 0/ IJJB -fj/s)Bm Bot?

r .[rJ"dr#?/Z?.p-~~/?~/?~.r~~~~g/~c~.rct??t:r ~---'.'

"-----

t I

t

• I

I

>---... I---;.----l82:JL.

1-----------1 •

_

Input utterance samples

Figure 6.2 The LP self-segmenting model.

LP

A database of repetitions of the test utterance spoken by the true speaker is used to train the model. Essentially, the training goal is to generate at a set of coefficients which represent the test utterance as spoken by the true speaker. This goal is achieved by minimising the prediction error across all the utterances in the training set. The first step when training the model is to associated a state transition table with each repetition in the training set. Each table has m entries, where the mth entry is the number of samples associated with the mth state of that repetition, Initially, all the entries in the state transition tables are set equal. For example, the initial segmentation for five repetitions of the utterance "six" are shown in table 6.1

Rep. 1 Rep, 2 Rep. 3 Rep. 4 Rep. 5

437 433 380 392 438

437 433 380 392 438

Number of samples per segment 437 437 437 437 433 433 433 433 380 380 380 380 392 392 392 392 438 438 438 438

437 433 381 392 438

437 434 381 392 438

Table 6.1 Initial transition table for five repetitions of "six"

51

Total 3496 3465 3042 3140 3504

Chapter 6 A New Spe"J.:er Venficauon Model

In the initial epoch of training, the m LPs are trained to represent the m equal segments of each utterances. This initial training is followed by several epochs of dynamic programming and coefficient update. During each of these epochs; • DP generates the optimal segmentation of each training utterance. Note, that the optimal segmentation is the segmentation which gives the lowest prediction error across the whole utterance. The effect of this optimal segmentation is to match each segment of the speech signal to the LP which best it. • The m LPs are then each trained across the segment of the speech signal with which they are associated, i.e. the segment which they predicted best during the previous DP segmentation phase. In this way, each LP is trained on a section of the speech signal during which the vocal tract stays constant, i.e. the segmentation homes in on the demi-phoneme speech sub-units. The results in table 6.2 illustrate this phenomena. This table shows the segmentation table of the five utterances from table 1 after six epochs of DP and coefficient update. Note the similar segmentation of the utterances. This is remarkable, as each utterance was independently segmented using the DP technique. Clearly the LPs have been effectively trained to represent the sections of the utterance "six" where the vocal tract stays constant. Note, also the large number of samples in the second and second last segments of each repetition. Listening to the speech segments associated with these segments revealed them to be the unvoiced "sss" sound at the start and the end of the utterance "six", pronounced "sss-ick-sss". The shorter segments in between these segments make up the "ick" voiced sound. The first and last segments are associated with silence either side of the utterance. This table also illustrates how this model handles the variable speaking rate inherent in human speech. Each of the repetitions in the table has a different number of samples. In order to apply these repetitions to most conventional speech processing system, time warping would be necessary to standardise the number of samples in each repetition. In this model, however, the variable sample length is absorbed by the DP technique with no computational overhead.

Rep. 1 Rep. 2 Rep. 3 Rep. 4 Rep. 5

21 5 5 20 2

1288 1322 953 1156 1181

Number of samples per segment 417 556 113 83 149 298 726 6 682 55 164 410 134 302 602 37 242 746 1 239

982 943 736 887 1086

36 16 37 2 7

Table 6.2 Final transition table for five repetitions of "six"

52

Total 3496 3465 3042 3140 3504

Chapter (, Ii New Speaker t'enticuuon Model -------~-----------------~--~-

6.3 Training the model In the introduction the overall operation of the model was outlined. The following sub-sections outline the theory and implementation of the coefficient update and the DP which are each iteratively used to train the LPs to represent the demi phoneme speech sub-units.

6.3.1 The Steepest decent algorithm The Steepest Decent Algorithm (SDA) [23] is used to train each of the m LPs to represent the speech signal in the associated segment. Training the LPs involves adjusting the LPCs so that the squared prediction error, e(n)2, is minimised. The prediction error, e(n), is given by p

en

= s, - sn =s, - L: Sn_kak,m

(6.2)

k=1

where ak m are the p LPC of the mth predictor. Each of the p LP coefficients of the mth predictor are updated at each sample n within the mth segment of each utterance in the training set according to the SDA equation, (6.3)

where k=1 to p, the number of LP coefficients,

~l

is the learning rate, m=1 to M and

'Ve(n)2 = -2e(n)s(n - k)

(6.4)

This reduces the local squared prediction error of the mth predictor across the mth segment of each utterance. Applying this algorithm to all M segments of the training utterances reduces the global prediction error of the M predictors for an utterance of the by the true speaker.

" 6.3.2 Dynamic programming and backtracking Dynamic programming and backtracking [24][25] is used to determine the optimal segmentation of each utterance at the end of each training epoch, where the optimal segmentation of an utterance is achieved by minimising the accumulated prediction residual, 0, N

2

0= min L:(sn

- sn,m)

(6.5)

n=1

. where N is the total number of samples in the utterance, sn is the nth sample in the • utterance and sn m is the sample predicted by the mth predictor. Note that m must 53

Chupter 6 A New Speaker Venficution Model

be equal to 1 at the start of the utterance and must follow the state diagram through all of the M states as n goes from 1 to N, (see Figure 6.1). Under these conditions, the minimisation can be achieved using the dynamic programming recursion formula, given by g(n,m)=d(n,m)+min[ g(n-1,m), g(n-1,m-1) ]

(6.6)

where g(n,m) is the global accumulated squared prediction error at nand d(n,m) is the prediction residual for predictor m, (6.7)

This prediction residual is generated for each of the M LPs according to equation (6.7). At the end of the recursive application of equation (6.6), the global squared prediction error, 0, is given by (6.8)

0= g(N,M)

At each sample, the state which gives the minimum contribution to the global error is recorded. Backtracking along the optimal trajectory then gives the optimal segmentation. Figure 6.3 illustrates a plane visual ising the DP computation where the horizontal sample number and vertical axes represent the model state.

M

r

Model state, m

I

I

I

r

1

J

o

sample, n

54

N

Chapter 6 A New Speaker Venficution Model

6.4 Training algorithm summary

Training the model requires a reasonable number of repetitions of the password utterance. These training utterances are applied to the model in a random order and the following algorithm is followed. 1. Initialise all the lP€ o efficients to zero. Divide each training utterance into M equal segments Semi-train each of the M LPs to represent each of the M segments of each utterance 2. For each utterance in the training set use DP to determine the optimal segmentation. 3. Set m=1. 4. Update the mth LP, using the Steepest Decent Algorithm, using the samples in the mth segment of every utterance in the training set. 5. Increment

m

6. Repeat steps 4 and 5 while m5 so that the admissibility requirements, listed in section 10.3 are met. Thus, the prototype wavelet is defined. In order to generate the range of wavelet to be used in the transform, use the equation (11.6)

(11. T) ~e /0 opera/e ar a

sa/77p//ilg rreqiJency or 6' kH" aiea we

BB

wt« nave

Chap/ttr 11 Implflmttntatiotl at tI/8 D/sereltt Gaussian Wavelel Transform

2N(a)1 a~l

=2 2T ..[5 = 35777 samples

(11.8)

in the prototype wavelet. This is certainly a lot of samples. It is not however necessary to implement the prototype wavelet. We dilate or contract this wavelet in order to generate the required set of wavelets. In this case the wavelet has a very low frequency, so we contract it to generate the other wavelets. That, of course is the theory. In practise, we simply calculate the dilation values and then calculate the required wavelets at the discrete sampling points. From equation (11.6) we can calculate that the dilation value required to generate the 300 Hz wavelet is a=0.008333. From equation (11.5) we calculate the number of samples in the 300 Hz wavelet to be twice N(a) , where N(a)=149. Similarly, the for the 3000 Hz wavelet, the dilation value is a= 0.0008333, and the number of samples is twice N(a), where N(a)=14.9 (round up to 15). Thus, we have; 8 300 8 3000

= 0.008333 = 0.0008333 = 149

N(a300 } N(a3000 ) = 15

Finally we calculate the dilation values for the frequencies between 300 Hz and 3000 Hz. Any frequency scale can be used and as many points as required may be used. For each frequency, the dilation value and the N(a) value are used to . generate both the real and the complex wavelet according to equation (11.1). The real wavelet, the complex wavelet and the N(a) value are stored for later convolution.

·11.4 Fast Digital Wavelet Transform Although the DWT is an effective tool for time-frequency analysis. It does however require extensive computation. To be useful in any real-time speech processing application, a fast DWT is required. This section describes the logical steps towards such a fast DWT. It is important to first understand that most of the DWT computational load is involved in the convolution of the signal with the wavelet basis functions. Further, , ince low frequencies are analysed with longer wavelets than higher frequencies, it clear that much of the computational loadJs ///yo/Y-9L7-h.cL7/7P'L7~bh?A¥'.??~--bAV' ~_..---~~ ~C'..rC',.;:7~rre"'"'i"e'rTC'rwa\(el€(scan be adequately represented r sampling rate and convolved with a decimated version of the signal sing information. This reduces the number of convolution 'multiply-adds' '. imation rate, saving considerable computational load. A structure which is method is shown in figure 11.1. . vel in this structure, the signal is low pass filtered and decimated. Thus half of the spectrum is passed to the next level where it is represented at a . ling frequency. At each level the upper half of the spectrum is analysed

Chapter 11 Implementatio/l (If the Discrete Gaussian Wavelet Transform

using wavelets. Only the wavelets corresponding to frequencies in the upper half of the spectrum are applied as the lower frequency wavelets can be more efficiently applied lower sampling rates. The wavelets for each level can be pre-calculated for the different sampling frequencies. However, it is possible to use the same wavelets throughout; for example, a 3000 Hz prototype wavelet sampled at 8 kHz, can be dilated to give a 1500 Hz wavelet, and then decimated to operate at 4 kHz. This however, is equivalent to applying the 3000 Hz wavelet unchanged at 4 kHz. This is due to the fact that halving the wavelet frequency involves dilating the wavelet to twice its length. If the stretched (dilated) wavelet is decimated in order to apply it at 4 kHz, it is then equivalent to the original wavelet. A scaling factor is required and it is calculated as follows.

s(n)

Low pass filler 2 kHz

OHz

" LPF 2 kHz

1 kHz

OHz I I I

I I I

hi

o Hz

1 kHz

I

"

LI

o Hz

500 Hz

TIME DOMAIN

FREQUENCY DOMAIN

REPRESENTATION

REPRESENTATION

Figure 11.1 Fast DWT schematic in time and frequency domains 90

4 kHz

Chapter 11 Implementatiol/ (If tllll Discrete Gaussian Wavelet Transform

Firstly, consider applying the 3000 Hz wavelet at the lower sampling frequency, where it will have an effective frequency of 1500 Hz. The convolution now involves half the number of 'multiply-adds', so the resulting sum should be multiplied by 2. Secondly the scaling factor Ja in the WT equation must be considered. Noting the fact that dilation parameter, a, is inversely proportional to frequency, we see that a ___wavelet applied at half the sampling frequency needs to be normalised by dividing by ..fi. The net effect of these two adjustments is a J2 multiplication for each level of sub-sampling. This factor can be included in the decimation process as shown in figure 11.1. The 'C' program which implements the fast WT described above was developed. during this research. It generates 16 wavelets linearly spaces in the range 2 kHz to 4 kHz. These wavelets are applied at each of the 4 levels where fs =8kHz at the top level, 4 kHz at the next, 2 kHz at the next and 1 kHz at the lowest level. This generates a total of 64 frequency outputs, i.e. the resulting frequency scale has 16 analysis frequencies in each range 2-4 kHz, 1-2 kHz, 500 Hz to 1 kHz and 250 Hz to 500 Hz. While this frequency scale is suited to speech analysis, other scales are possible. If, for example, a linear scale was required throughout the spectrum, all the wavelets should be generated at 8 kHz and each then sampled at the appropriate sampling frequency. The basic steps involved in implementing the fast WT are given below. 1. Read in sample s(n) and generate sub-sampled ver sions of s(n) by Low Pass Filtering (LPF) and decimating. 2. Generate a range of wavelets linearly spaced between 2kHz and 4kHz 3. Apply these wavelets at each level of the sub-sampled signal. Store wavelet coefficients. 4. Increase p. This corresponds to shifting in time. 5. Repeat steps 3 and 4 until the end of the signal

.Typically, the sample shift between analyses is 16 samples at the top level, 8 at the next lowest level etc. This ensures that the analysis is performed at the same time instant at each level of the sub-sampled signal. This allows the outputs to be displayed on a scalogram. ~

the sample shift is the same at each level in the sub-sampled tree, then the high 'frequency wavelets are applied more often than the lower frequency wavelets, Thus the wavelet coefficients corresponding to high frequencies are generated at a igher rate those corresponding to the lower frequencies. This is particularly efficient because the high frequency components of speech evolve quicker than the ow frequency components.

91

Chapter 11 Implementatloll (.f the Discrete GausSIan Wavelet Transform

This type of multi-sampling rate system is particularly suited to real-time feature

extraction because it eliminates the redundancy involved in analysing slowly

changing low frequencies with high time resolution. This method is discussed

further in the next section.

11.5 Application to Speech Processing To date the principal use of the WT has been to generate scalograms for use in

speech analysis and formant extraction applications. Indeed it has been found to be

particularly useful as an automatic formant extraction tool for speech synthesis.

The WT can also be used as a pre-processor in speech recognition and speaker

verification applications. In these type of applications the WT output can be

considered as a frequency vector and applied to the classification stage. The

properties of 'constant Q' analysis and improved time-frequency resolution make the

WT particularly attractive when compared to traditional speech pre-processing

techniques.

For real time applications the use of the fast WT gives increased efficiency. As mentioned previously, the fast WT produces a multi-rate output when the sample shift between analysis points is the same at all levels of the sub-sampled signal. Then the high frequency wavelets are applied more frequently than the lower frequency wavelets. This is particularly suitable for speech processing where the low frequency formants change more slowly than the higll frequency transients associated with stops and plosives. The fast WT coefficient output can also be considered as a frequency vector. The

multi-rate output simply means that the vector updates more frequently in the

direction of the high frequency. With this interpretation the output of the fast WT can

be applied to speech classifiers in the usual way. Another option is to design a

multi-rate classifier to operate specifically with the fast WT multi-rate coefficient

output. For example, a multi-rate neural network system comprised of 4 neural

networks, with each network operating at a different rate. Each would be trained

and operated as a separate unit. Only in the classification stage would their outputs

be combined to give an overall classification decision .

. 11.6 Conclusions • The first three chapters of this section, chapters 9,10 and 11, have described the . theory and implementation of the Gaussian Wavelet Transform. A fast version of the GWT was developed in this chapter, and possible application of the GWT and the fast GWT were discussed. In the next, and last chapter in this section, a number of scalogram and phasograms that illustrate the characteristic behaviour of the GWT are presented.

92

I;

Chapter 12 Graphical Representation Wavelet Coefficients

of

the

12.1 Introduction The notion of a time-frequency representation of speech was prevalent long before the advent of digital signal processing techniques for speech analysis. Indeed, the spectrograph was first developed in the 1930s using analogue modulators and bandpass filters to implement a time-frequency Fourrier analysis [17]. This crude instrument allowed speech researchers to visualise and understand the nature of speech. It also inspired interest in speech synthesis techniques based on the formants which were easily extracted from the spectrographic display. Today, the advances in computer technology and in DSP techniques allow spectrographic displays to be generated on any PC. In fact, such displays are widely used in formant extraction, pitch detection and general speech analysis. As, described previously, the Wavelet transform can be represented on a scalogram and a phasogram, whose display is very similar to that of the conventional spectrogram. This chapter presents a number of GWT time frequency displays and compares them with traditional spectrographic displays, in order to illustrate the characteristic operation of the GWT.

12.2 Overview The complex value S*(p,a) Wavelet coefficient can be expressed in terms of modulus and phase in the usual way,

S"(p,a) = IS·(p,a)lej~P.a)

(o ,

¢(p,a):s 2tr)

(12.1)

Usually the modulus and phase of S(p,a) are represented separately on sections of the half-plane (p,a). On this half plane the time axis points to the right and the dilation axis points downward. This means that high frequencies are displayed above low frequencies and allows comparison with traditional spectrographic displays, see figure 10.1. The modulus IS(p,a)1 is represented on a half plane called a scalogram. Each IS(p,a)1 is quantised and colour coded using eight colour bands according to

Colour(p,a) = 8round [ I

IS(p,a)1 J ( )1

Smax

(12.2)

p,a

where colour(p, a) is associated with one of the eight colours. This colour is then plotted at the point (p,a). Note that the colour associated With the lowest energy 93

-

Chapter 12 Graphical Representation of th. Wavelet Coefficients

band should be the background colour. It is possible to use more or less than eight colours by adjusting the quantisation. In the b/w examples shown in section 12.3, the eight energy bands are represented on a grey scale. The phase at each point (p,a) is represented on a similar half-plane called a phasogram. The phase at each point is quantised into one of eight equal spaced bands in the range 0 to 21t and plotted accordingly. In the b/w examples shown in section 12.3, these bands are represented on a grey scale. . As the magnitude of S*(p,a) approaches 0, the phase becomes distorted and at IS*(p,a)I=O incalculable. Further, since small magnitudes could be generated by noise, it is clear that phase information at these points will be misleading. Therefore, the phase is only calculated at points where the modulus is greater than a certain threshold value.

12.3 Discussion of selected examples This section contains a selection of screen dumps that demonstrate the properties of the DWT and STFT. Each screen consists of two windows which each contain scalogram, phasogram or spectrogram displays, depending on the comparison being made. The type of output is indicated in the text below each screen dump. Note, that a spectrogram is the STFT equivalent of a scalogram. In each case the time waveform used is plotted below the output windows. The frequency scale used in all these examples is logarithmic and in the range 200 Hz to 3500 Hz. In all cases the wavelet function used is Gaussian and, unless stated otherwise, the STFT window is Guassian and of length 300 samples. An increasing grey scale (shown on top in each case) is used to indicate signal energy at a point for scalogram/spectrogram outputs. The grey scale used to represent phase in the phasogram is as described in the previous section. Note that the background colour is the lowest energy and lowest phase band in each case.

P.T.D.

94

-?-?

,:~

Chapter 12 GrapillCai Representation of the Wavelet Coefficients

freQ(Hz)

di lat ion (a) 0.0008 0.00.15

855

0.003.1

4.13 •••

_

~

-

.199

...

.,...

_.

- .

~

-...... -.

~ . '

~

-

-~--,.-.,-

---- -. . . _'"" -- -- 7>._. ., .......;.. - ;...- --:.--.-...-. .. ,

~

-

•

•

__

J

~

__

-~

•

~

-

_

_ _ .......

0.0063

~

_ . ~ ~ O C _ ~ ~ , . _

~

-~-.

-

~

..

-

-

.-

..

~~-

-

--.

-

-.-

-:::--;.- .;:...

---I

l..

0.0.13.1 0.0008 ; 0.00.15

855

0.003.1

4.13

0.0063 l..

---'

The top window is the WT scalogram output, the middle the 'NT phasogram output and the signal is plotted in phase with the outputs at the bottom. The input signal is a 300 Hz sinusoid. The analysing wavelet is Gaussian. The features of interest are the line of constant modulus on the scalogram corresponding to 300 Hz and the phase staying in step with the phase of the signal below. Note that the phase has been plotted only for points whose modulus is above a noise threshold (see section 12.2).

95

0.0.131.

Chapter t2 GraphlCat RepresentatIon of the Walletel Coe(ffclfmts

d i 1 a t ion (a)

freq(Hz)

3500

0.0008

.1770

0.00.15

85S

0.003.1

4.13

0.0063

.199 '--

---10.

3500 .1770 855 4.13 .199

l....-

---I

The top window is the again WT scalogram output, but the middle window is the STFT spectrogram. The signal, plotted on the bottom, is an impulse function. This example illustrates the difference in time resolutions between the WT, which uses long windows at low frequencies and short windows at high frequencies, and the STFT which uses the same window at all frequencies. The impulse function is a good example of the WT's effective localisation of high frequencies.

96

0~3.1

Chapter 12 Graphical Representation of tile Wavelet Coefficients

freQ(Hz)

di lat ion

=:~:

·_-·,.,.k-.;...·

3500

(a)

O.OOOli ·1

1770

o. 001.~

855

0.0031

I

41.3

0.0063

199

0.01.31.

f,

3500

0.0008

~

1770

0.001.5

855

0.0031.

41.3

0.0063

199

0.01.31.

I

f

~

The signal is again the impulse function. This time the top window is the WT scalogram and the middle the WT phasogram. Note the convergence of lines of constant phase toward the discontinuity. This property of WT phase can be used to accurately locate discontinuities in more complex time signals.

97

Chapter 12 G,aplllcal Rep'es~ntatlon of the Wavelet CoeffiClttnts

d i 1 a t ion . .

..

:::;:: :: : :: ' ::

d i lat ion

0.001.1. I .0.0023 846

0.0047

41.1.

0.0096

-

-

The signal here is the IV at the end of the word "eight" from the previous example, i.e. we have zoomed in on the stop sound IV in the word "eight", The top is again the WT scalogram and the middle the STFT spectrogram. This example clearly illustrates the effective time localisation of the WT compared to that of the STFT (see next example for more on this).

99

I

Chapter 12 Grapillcal Representation of tn» Wavelet Coefficients

dilation (a) : :.

- : : . : : ~ ~ ~ ; + ~ . ~ ~ ~ :;~ ••

-~~

..

~.:.:

-

~-.

~~~;:~:W-~~~.~11~~~;·

1740

0.00231

846

0.0047

4.1.1

0.0096

~.~~~ ~~ ;~f .:·~f~ ~i;'~ ~~~ ~ ~ ~;-:

:.

.:~ ....== . 3" :i.:..3::

~

J~~!':~~' 846

~. ;~..c

;. : ~:~j:':=~~

-~!i1

411 199

L...

----I

Chapter 12 Graphical Reprlfslmtation of the Wavelet Coefffclents

freQ(Hz)

di lat ion (a)

0.0011

I

~~~"

0.0023

I

i~~C __

0.0047

. ... .

-

..

='~ ~=-='o: :- ."~=~ ~_:';._;. )_¥,~r.:"_'.-~._'-. '_~ _-.= ~-. '_"~_- -.'-~'.-_!*".-_~._-.'_.-_':-._'-_;~-._-;:_Z.:dS_.i't.;~.':.5~i ':_:_f_=-,.-._.~_ ~:.-~_. -:_; :~,-_:~. -:~._ =.:_ ~ ~~~~~i;~c~ -". -'-

-=

':-=' __ __. .

....• '.-_:.,:: •.._..:...

... ..

.. . .'•_.'•.- .

--'--•.•

. -. __.'_:'_-

0.0096

__

0.0023

The signal here is the word "five" sampled at 8 kHz. The top window is the WT scalogram and the middle the 'NT phasoqrarn. The phase has been plotted only for points whose modulus is above a noise threshold. Note the constant change in phase along the formant frequencies.

101

Chapter 12 Graphical Representation of tile Wavelet CoeffiCIents

di lat ion

')----~L-+-_+-

W/§//J ......,.._--"--'--~

1--II~~1

LP 2

>-----L-+-~_"__'__'J

___J

LP 1

~~~L--

) - -......

....

I

,

---'

Test utterance samples

Fig. 4. LP self-segmenting model.

_

422

E. Ambikairajah et al. / Predictive models for speaker cerification

residual across all the utterances in the training set. This optimisation can be achieved by an iterative procedure combining DP and LP coeffi cient update. Each LP set is updated, according to a steepest descent algorithm, only across the segment of the utterance with which it is associ ated. This reduces the local prediction error for this predictor across this segment and, hence, the global prediction error. For the first iteration of the training process the utterance is divided into M equal segments and they are updated using the above equation until the predictors are semi trained. It is desirable to have the predictors semi-trained before the first segmentation of the utterances using DP. Insufficient training results in erratic segmentation with some segments span ning more than one phoneme. Subsequent LP updates cannot then train to this "too large" segment. Conversely, overtraining does not allow segments to converge to speech sub-units as they are overtrained on the initial segmentation. Ex periments also showed that setting LP coeffi cients to 0 rather than random value between -1 and + 1 yielded a better initial segmentation. This was because, in a random initialisation, cer tain LP coefficient sets predicted speech better than their neighbours. If this is not overcome in initial training the better LP set will dominate the segmentation and render training ineffective. After sufficient repetitions of DP and LP up date the total prediction residual will converge to a minimum. At this point both the LP coefficients and the segmentations of the training utterances are stored for use in verification. Verification involves DP of the test utterance with the LP

segment 1

coefficients of the claimed true speaker. Back tracking gives the optimal segmentation. The nor malised mean squared prediction residual is cal culated by dividing the accumulated squared pre diction residual by the accumulated squared sam ple values over the entire utterance. The nor malised mean squared prediction error is then compared to a threshold to make the verification decision. The segmentation table resulting from the backtracking is considered to reflect the temporal composition of the utterance. It was noted that segmentation patterns for true utterances were similar, reflecting the fact that the consecutive predictors in the M state verification model accu rately predicted the utterance sub-units for which they were trained. However, in the case where the test utterance was that of an impostor, the segmentation pattern is erratic usually with one segment dominating the utterance (Figures 5(a) and 5(b)). This property is also used in the verifi cation process. A three-layer Multi-Layer Percep tron (MLP) classifier was trained with the true speaker and impostor segmentation patterns and was used as a further test in the speaker verifica tion decision.

4. The Neural Prediction Model The Neural Prediction Model (NPM) consists of an array of MLP predictors. The model is constructed as a state transition network (Figure 3), where each state has a particular MLP associ ated with it. The NPM was first proposed by Iso

~1+-1----+----------r---HH ~ IIH-J..:-.:--------+-lH III II·

Hf---+-----+----t--------1

a

segment M

segment 1

segment M

N

b

n:O

Fig. 5. Typical segmentations of (a) two true speakers utterances, (b) two imposter utterances.

N

E. Ambikuirajali et al. / Predictive models for speaker verification

and Watanabe (1990) for continuous speech recognition. Its application to the task of speaker verification is described here. The MLP predictors have one hidden layer and an output layer consisting of eight nodes. The hidden layer nodes have a sigmoidal transfer function, while the output nodes are linear. The frame feature vectors used are the 8 Mel fre quency cepstral coefficients (C l' ... , C 8)' The MLP outputs a predicted frame feature vector based on the preceding frame feature vectors. The difference between the predicted feature vector and the actual feature vector is defined as the prediction residual. This residual can be re garded as an error function for the MLP training, based on error back-propagation techniques. Fig ure 6 shows an NPM (Ambikairajah and Kelly, 1992) with N predictors. In this case each predic tor predicts a frame given the two previous frames. The training goal of this type of system is to find a set of MLP predictor weights, which min imises the accumulated prediction residual for a training data set; for a speaker verification system the training data set is a collection of password utterances from a certain speaker. The training is achieved by an iterative procedure, combining Dynamic Programming (DP) and back-propa gation techniques. The algorithm is as follows: - initialise all MLP predictor weights to initial random values between zero and one,

.. ....--11

- set initial training utterance segmentations by following the DP procedure, - repeat the following steps for all training utter ances until convergence, - correct the MLP weights by back-propagation along the optimal trajectory, - compute the accumulated prediction Jesidu_al using DP and determine the optimal trajectory using its back-tracking. Verification requires the application of the test utterance to the NPM associated with the claimed identity. DP followed by back-tracking determine the optimal segmentation. The accumulated pre diction residual divided by the sum of the squares of each feature component in the utterance is used as a verification test statistic. Comparison with a preset threshold for the true speaker yields the verification decision.

5. The hidden control neural network model The Hidden Control Neural Network (HCNN) speaker verification model is a single MLP pre dictor. Like the NPM, this model is constructed as a state transition network. The model was first proposed by Levin (1990) for connected digit recognition. The use of this model for speaker verification is described here (Ambikairajah and Kelly, 1992). Frame i

Frame i -2

423

..

Scroll utterance through window

Squared Prediction Residuals

MLPN

Fig. 6. Neural Prediction Model.

424

E. Ambikairajan et al.

I

Predictive models for speaker verification

This model also uses the 8 Mel frequency cepstral coefficients l) or compression (a