parallel signal processing at aberdeen 1 introduction 2 parallel dsp

0 downloads 0 Views 46KB Size Report
Parallel computing is inherently scalable and flexible and able to express naturally many signal and image processing applications. However, many researchers ...
PARALLEL SIGNAL PROCESSING AT ABERDEEN A R ALLEN, D WANG and M A PLAYER Dept of Engineering, University of Aberdeen, Aberdeen, AB9 2UE, UK. Email: [email protected]

Abstract. A number of deconvolution techniques are being used in an ultrasonic signal processing application. A variety of parallel algorithms have been developed and their performance investigated on architectures using transputers and digital signal processors. A dual symbolic/numeric signal processing system is under development.

1 INTRODUCTION Parallel computing is inherently scalable and flexible and able to express naturally many signal and image processing applications. However, many researchers have realised that, in order to solve their particular problems, careful design of the system architecture must be accompanied by the rethinking of computational methods, to give a good match between algorithm and hardware. Additionally, in some cases it has been appropriate to take advantage of specialised hardware while remaining within a parallel processing environment. The University of Aberdeen has been involved in many of these developments, and in this paper we review some of the current work in the Department of Engineering.

2 PARALLEL DSP SYSTEM We have found that while some applications are suited to a multi-transputer system, others can also benefit from fast signal processing hardware. So, to support our parallel DSP research, we designed and built the first system which incorporated a floating-point DSP chip into a transputer array (Allen and Wang [1]). We used the DSP32C, a 32-bit floating-point digital signal processor, capable of executing a 1024-point complex FFT within 4 milliseconds. In our dual-processor system of T800 and DSP, dual-port memory is used to exchange data between the two processors. When the T800 requires DSP service, it writes the data to be processed into the dual memory, and the DSP32C is interrupted. As soon as the DSP32C receives this signal, it executes FFT processing or other algorithms according to flags set up by the transputer. The

indicate that the DSP processing has completed. The work reported here used this dual system, together with another four T800s. Most of our software is written in occam. We are currently designing a fast data capture frontend (100 MSa/s) for our transputer system to allow on-line experimental processing.

3 EXAMPLE DSP APPLICATION One of the main application areas driving this research at Aberdeen has been the measurement of surface roughness (Ra < 100m) using ultrasound pulse echo signals. If an ultrasonic pulse is reflected from a surface, the reflection will in principle include all surface information. However, the noise present on the captured echoes is significant, so the performance of a deconvolution algorithm depends on its ability to reconstruct images from noisy data. A simple but quite realistic relationship between the data obtained from the ultrasonic receiver and the amplitude distribution of surface heights is:

d=hf +n

(1)

d: output voltage from the receiver; h: blurring function (system impulse response, for a plane reflector); f : image (height amplitude distribution of the surface); : (circular) convolution; n:

noise. A reflection from a test surface is acquired by a digital storage oscilloscope with 150 MHz bandwidth and 8 bit vertical resolution. The sample set consists of N = 1024 samples. The image (ie. amplitude distribution of the surface) may be obtained using the deconvolution method (equation 1). In order to achieve the desired resolution with relatively low frequency ultrasound (10 MHz), substantial signal processing is required. In particular, a nonlinear technique (the maximum entropy method ( MEM)) has yielded reliable measurements due to its good noise suppression and frequency extension properties [2]. Such algorithms are computationally intensive, hence the need to research parallel methods for digital signal processing, especially those associated with nonlinear techniques.

4 PARALLEL ALGORITHMS We have developed a variety of algorithms for the dual T8/DSP system and for multi-transputer systems. We give some examples in this section.

4.1 FFT and Convolution The Fourier transform is fundamental to many DSP algorithms. We have developed parallel versions of the FFT. An important effect on the performance of a parallel FFT algorithm is to select appropriate data distribution and communication. To avoid redundant communication,

should engage in communication. Numbers of communications between processors should be reduced to a minimum using large block data transmissions. Typical timings (double precision arithmetic) for a 1024-point complex FFT are 117 ms on one T800 and 45 ms on four. Fast convolution can also be implemented in parallel. So, to calculate y = x  h, the results of two parallel FFTs, X and H , are multiplied together, Y = XH . The multiplication is partitioned across a number of processors. Then a parallel inverse FFT is used to obtain y . A single T800 executes this in 245 ms, while four take 97 ms.

4.2 The Wiener-Hopf Filter The Wiener-Hopf filter is a direct method of simple deconvolution. The estimated image F is calculated using a linear filter

 F = D jH H j2 + a :

(2)

With suitable choice of the regularization parameter a (usually taken as equal or close to the noise to signal ratio), and for white noise and broadband signals, the Wiener-Hopf filter can be shown to give a good estimate of the image. The Fourier transforms D; H of the data and blurring function can be calculated concurrently. Typical performance is: 1080 ms (1T800), 790 ms (2T800), 550 ms (T800 + DSP).

4.3 The Maximum Entropy Method If a deconvolved image consists of N discrete points, each being positive in value, then the entropy of the image is defined to be:

S=?

N X i=1

pi = ff ; fi : the value of the image at point i.

pi ln pi

(3)

i

The basic technique involves constructing an image which has the highest entropy given the constraints of the data. Because the form of S involves a nonlinear optimization problem, the algorithm is iterative, the sequence starting with a ‘default’ or ‘background’ image. The 2 test is used as a means of establishing the closeness of fit of the computed image: a comparison between the data and current ‘estimate’ of the data (blurring function convolved with the computed image) is calculated. The information from this test is used with the image’s entropy to assist in the construction of new estimates of the image. A complicated combination of steepest ascents and conjugate gradient techniques is used to determine the most appropriate alterations to the current image. Convergence on the maximum entropy image is reached both when 2 has reached N , and the angle between the search vectors based on 2 and entropy is small. Only when these conditions are satisfied is the algorithm said to have converged on the true maximum entropy image.

Processors MEM (s) Speedup T800 3.64 1.00 T800+DSP 1.46 2.50 4T800 1.34 2.72 Table 1: Comparison of the performance of parallel

MEM.

Processors MEM (s) Speedup T800 13.69 1.0 4T800 3.45 3.9 4T800 + DSP 2.55 5.4 Table 2: Performance with multiple samples. We use a combination of algorithmic and data parallel approaches in our parallel MEM development [3]. (a) A MEM process consists of at least 8 convolutions per iteration: we therefore make extensive use of parallel FFTs and convolutions. (b) Besides having global knowledge (the whole dataset), each processor is given responsibility for a subset of data points for certain calculations. For example, the calculation of 2, and the derivatives of entropy and 2 can be partitioned across the available processors. (c) The calculation of the search directions can similarly be parallelised. The experimental results of using parallel MEM are listed in Table 1. Here, MEM is the execution time of one iteration. These speedups are significant, because the algorithm typically requires 20–30 iterations to converge. Another approach to utilising parallelism relies on the fact that many practical applications require multiple, rather than single, images or signals to be processed. Thus samples taken from different transducer positions may be processed in parallel in order to obtain a complete image of the surface. In the case of our experiment, four samples were processed on four transputers concurrently. This approach yielded nearly linear speedups (Table 2). In the case of 4T800 + DSP, the DSP serviced FFT requests from all four transputers.

4.4 Projection Onto Convex Sets The POCS restoration method (eg. as discussed by Youla and Webb [4]) has a significant advantage over some other algorithms in that it enables a large number of a priori known constraints to be incorporated in the algorithm through the mechanism of projection onto a convex set. The basic idea of POCS is that every a priori property of the unknown image f is formulated as a constraint that restricts f to lie in a closed convex set Ci . The constraints depend on the properties of the signal and noise. The properties of the signal which may be used include pre-known limits on the Fourier spectrum, amplitude power spectrum, or a non-negative amplitude. Each of the constraints may be imposed by projecting an arbitrary image on the cor-

model in the present application. Since every a priori property of the unknown f is formulated as a constraint that restricts f to lie in a closed convex set Ci , then for m properties there are m sets Ci (i = 1; 2; : : : ; m) and f 2 Tmi=1 Ci = C0. Thus the problem is to find a point of C0 given the sets Ci and projection operators Pi (i = 1; 2; : : : ; m) projecting onto the various Ci . The restoration algorithm has the form fp+1 = Pm Pm?1 : : :P1 fp (4)

with f0 arbitrary. Parallelism can be exploited in a number of ways. For linear projection operators, the data can be partitioned across several processors, and results collected. Multiple datasets can, of course, also be processed in parallel. It can also be beneficial to try different combinations of projections to see which is most appropriate for a given signal type: these solutions can be explored concurrently. Typically, on a single T800, POCS iterates in about 350 ms, converging in 10 iterations. Using p processors, p combinations of projections can be calculated in the same time.

5 OTHER CURRENT AND FUTURE WORK Parallel hardware is now capable of providing enough power to incorporate artificial intelligence based systems into DSP instrumentation. We are now starting to investigate the coupled systems paradigm. In this, a signal processing task is regarded as being composed of two parts: symbolic and numeric. The former part can, for example, be implemented as a rule-based system. This can do such tasks as: model the physical system using various simplifying assumptions; select the most appropriate DSP algorithm dependent on system knowledge and/or previous results from the numeric system; make decisions about parameters, number of iterations; and so on. The symbolic system sends control information to the numeric system (which is optimised for fast DSP) and receives results in return. The coupled system will be made to incorporate much of the expertise of the human investigator, and should lead to improvements in the quality of signal processing along with possible savings in the computational resources required by the numeric processing. Such a system maps well onto the type of hardware we are developing: a multiprocessor system, with certain processor nodes having enhanced DSP capability. We aim to exploit parallelism at a number of levels (eg. data, algorithmic, and decisionmaking). We shall also be exploring the use of virtual channels in the programming of second generation transputer technology (16T9000). This will also provide significant support for a range of other parallel processing research and applications in the Dept of Engineering. This work has been supported in part by the UK SERC, which we gratefully acknowledge.

References [1] A R Allen and D Wang, An application of ultrasonic signal processing in a mixed system

Transputer Applications (IOS Press, Amsterdam, 1990) 219–222. [2] P F Smith, M A Player and D A L Collie, The performance of the maximum entropy method: deconvolution and the frequency content of data, J Phys D: Appl Phys 22 (1989) 906–914. [3] D Wang, A R Allen and M A Player, Parallel implementations of maximum entropy method for signal processing (Dept of Engineering, University of Aberdeen, 1993). [4] D C Youla and H Webb, Image restoration by the method of convex projections: Part 1— theory, IEEE Trans Medical Imaging MI-1 (2) (1982) 81–94.

Suggest Documents