TMS320C40 (C40) and an i860. The results obtained indicated that a real time implementation was achievable \\-ith the C40s. It was concluded that the usage of ...
Copyright © IFAC Algorithms and Architectures for Real-Time Control, Cancun, Mexico, 1998
HIGH-PERFORMANCE REAL-TIME IMPLEMENTATION OF A SPECTRAL ESTIMATOR
M. M. Madeira *, L. A. Aguilar Beltran+ , J. Solano Gonzalez+ F. Garcia Nocetti+, M.O. Tokbi ., M. G. Ruano *
* UCEHISEC, Universidade do Algarve, Campus de Gambelas, 8000 FARO Portugal + Depto. de ingenieria de Sistemas Computacionales y Automatizacion, IJ/vfAS, UNA_"'1. Apdo. Postal 20-726, Del. A . Obregon, Mexico D.F , 0]000, MEXICO • Depart. 0/Automatic Control & Systems Eng., University o/Sheffield, Mappin Street, Sheffield, S] 3JD United Kingdom
Abstract: Doppler blood flow spectral estimation is a conunon technique of noninvasive cardiovascular disease detection. Blood flow velocity and disturbance may be evaluated by measuring spectraI mean frequency and bandwidth respectively. Aiming at minor stenosis diagnosis, parametric spectral estimators may be employed. These models present better spectral resolution than the FFT based ones, at the expense of higher computational burden. Seeking for an efficient real-time implementation of a blood flow spectral estimation system, high performance techniques are being investigated. This paper compares the implementation of the Modified Covariance (MC) spectral estimator on two different DSP architectures: the TMS320C40 and the ADSP2016x (SHARC). Implementations are described and their performance assessed. Considerations about portability of algorithms, compiler optimisation levels and system dependence features are addressed. Copyright © 1998 IFAC Keywords: Spectral estimation, performance evaluation, DSP architectures.
1.
INTRODUCTION
The auto regressive Modified Covariance (MC) spectral estimator with model order of 4 was elected on behalf of its accuracy inspite of its algorithmic complexity (Ruano, Fish, 1993).
Doppler blood flow spectral estimation is a common technique of non-invasive cardiovascular disease detection. the evaluation of blood flow velocity and disturbance through accurate estimation of spectral mean frequency and bandwidth respectively enables an early diagnosis of cardiovascular diseases (Ruano et aI. , 1992). some parametric spectral estimators have been evaluated as possible blood flow estimators (Ruano et aI. , 1993), revealing better spectral performance than the traditional estimators but at the expense of higher computational processing times.
Previous work (Madeira et ai. , 1997) investigated the feasibility of a real-time implementation of the MC spectraI estimator on several homogeneous and heterogeneous parallel processing architectures using up to three transputers (T8) , up to three TMS320C40 (C40) and an i860. The results obtained indicated that a real time implementation was achievable \\-ith the C40s. It was concluded that the usage of T8 ' s as routers was degrading the overall performance of the architecture.
This lead to the investigation of alternative techniques of real-time implementation of the parametric spectral estimators, in particular the investigation of parallel processing architectures.
Aiming a low-cost PC-hosted system and taking into account that the results obtained \\-ith the C40 implementation did not allow the real time
185
estimation of the signal over the complete range of data segment lengths to be considered, led to further investigation.
. .] c xx [ 1, J = N -I
I x 2 x (N -p)
*
LX [n-i] x xfn-j]
Separate studies were conducted to evaluate the performance of firstly, some of the fastest digital signal processor currently available and secondly, custom parallel processing using an heterogeneous architecture comprising field programmable gate arrays and a C40 (Madeira et a/. , 1998). Two digital signal processors, C40 and ADSP21062 (SHARe), were selected based upon their availability and cost. Special attention was given to the add-on characteristics of the modules enclosing the processors, ensuring the compatibility of a final solution.
+ N-I-p
n=O
The a[ k] (k= I , ... ,4) estimates, solution of the above linear system, are achieved using the Cholesky algorithm. The signal white calculated as
0'2
cxx[2,1]
cxx[2;2]
cxx[p,p]
alp]
3. PROCESSING ARCIDTECTURES
The block diagrams of the processing architectures considered are presented in Fig.l.
PC bus Arch. I PC bus Arch. 2
CXX[l,O]] = _ c~~~,O]
(3)
On this particular application, signal time windowing is performed each 10 rns producing data segments of varying lengths: 64, 128, 256 and 512 elements.
xx C [1, p]] [all]] .. . c xx [2,p] ~2] x . ..
Ia[k] x cxx[O, k]
is
(4)
.. .
...
estimate
and the power spectral density , PMcCfn ), is obtained through
The general description of the autoregressive MC spectral estimation (Kay, 1988) involves solving the following linear system of equations,
cxx[I,2]
-= cxx[O,o] +
noise variance
k=;1
MODIFIED COVARIANCE SPECTRAL ESTIMATION
cxx[I,l]
*
L xfn+i] x x [n+j]
The performance evaluation of the architectures is typically done in terms of computational execution time and speedup. In this case study, execution time is particularly relevant since real-time implementation is envisage. For clinical diagnosis several cardiac cycles should be spectral estimated, each cardiac cycle being windowed approximately a hundred times, leading to signal data segments (lengths varying from 64 to 512 points) to be spectral estimated within IOrns. Execution time was therefore adopted as a performance measurement evaluator. In order to compare different architectures, the gradient performance metric was considered. Gradient measures the mean time per data element. .
2.
(2)
n=p
(1)
Fig. I. Block diagram of the architectures considered. cxx[p,o] The C40 architecture is a PC-host system using a 50MHz floating point C40 device, I :ME SRAM plus 4 :ME DRAM, rated as 50 MFLOPS. 25 MIPS. 275 MOPS. This architecture has six high-speed parallel
where each element cxx[ij] of the covariance matrix and right-hand-side vector is obtained by
186
unidirectional links for achieving communication with other processors and an asynchronous transfer rate of 20 Mbyteslsec (Texas Instruments, 1991). 10 -
The SHARC architecture used is a SHARCPAC PCISA Motherboard with a Floating point 40 :rvrnz ADSP20162 processor, that includes 4 ME of DRAM., 128KB internal SRAM and DMA band\\'idth of 240 ME/s and is rated 120 MFLOPS and 40 MIPS (Alex Parallel Computers, 1996). The 3L parallel C development environment was used for programming the C40 system (3L, 1995). For the SHARC system the Analog Divece toolset environment was used.
8~ 647
--==
.
2 ':'
o+-i 64
128
512
Z£5
naa segnrnt la1gth Fig. 2. MC algorithm regularity.
4. IMPLEMENTATIONS AND RESULTS The execution times, in milliseconds, obtained on the MC algorithm implementations or the DSP architectures are presented on Table I.
Regularity CoItllarison
Table 1. Results of MC algorithm implementations. Data Segment Length 64 128 256 512
Arch. 2
Arch. 3
2 4
0.53 1.05 2.09 4.18
• Theoretical \alue !
7
14
i OAIch. I
Modifications were perfonned to achieve a suitable matching between the algorithm and the architectures. The algorithm was modified to include less function calls and the twiddle vectors were calculated during the initialisation phase.
Fig. 3. Regularity comparison.
If we consider a general application of the algorithm, different model orders may be necessary. Another regularity study was conducted, analysing the impact of the model order variation (namely from 4 to 10). An evaluation of the algorithm was done with Matlab function flops , to measure the work load variation with model order.
This changes occurred mainly in the calculation of the power spectrum density. Further changes in the memory mapping and compiler options were required in order to achieve better performance.
5.
i
:IIIAIch.2
COMPARISON EVALUATION
Fig. 4 presents the impact on the overall algorithm upon model order variation. The overall algorithm presents a very regular nature. However, as it can be observed by Fig. 5 and Fig. 6, the two processors present different responses to model order variation.
A first approach for evaluating the MC algorithm was done using the Matlab function flops. Figure 2 shows that the algorithm presents a regular nature as the work load is proportional to the length of the data segment. Due to the regular nature of the algorithm, it is expected execution times to increase proportionally. It can be seen (Figure 3) that the two processor react differently to the increase in workload, the C40 presenting less sensibility to data length increase than the SHARe.
The architecture I present better results with model order 10. This indicates that there is a significantly different behaviour between the two architectures. This leads to the conclusion that the compiler optirnisation on Arch. 1 is more effective when a significant amount of data must be handled.
187
length. The other architecture including the SHARC as processing element achieved the real-time requirements for all the data segment lengths to be considered.
Model regularity 5 -
1
1164
i
1 1.128 , '
18256 : i.512 :
L-..-
-+- AIdt 1 C40
p=4
p=6
p=8
p=10
ModeIonIer
Fig. 4. MC algorithm growth rate with model order.
Regularity of MC implementation Arch. 1
.."
~
.E !'
.
;:
-=..""
!-p=4 !
10 .,... 5
-p9l
0
.p=8 : 1 ,Bp=10 :
Cl:
128
64
256
512
!
Fig. 7. Execution time perfonnance on C40 and SHARC architectures.
Data Segment Length
Fig. 5. Regularity of MC implementation on Arch, 1 with model order varying from 4 to 10.
Architecture 1 (C40) performed best upon model order variation, presenting a smaller growth index with the higher data segment lengths. Fig. 7 allows the comparison between the initial and final execution times, in milliseconds, obtained in implementing the algorithm on the architectures.
RegularityofMCimplemention Arch. 2
.."
~
.E !'
;: ~
..
~ Cl:
10
-
The perfonnance enhancements achieved by adapting the algorithm to the processing architecture were due to the usage of compiler optimisation and the reduction of function calls, leading to significant improves on the overall performance with Arch.l (reducing 20 to 60% of the execution times).
.p=4
5 -
.p=(;
0
.p~
64
128
256
512
Dda segment LmgIh
~
In the particular case of architecture 2 (SHARe), in addition to the usage of compiler optimisation and reduction of function calls, the utilisation of the architecture file (to distribute the data through the internal memory blocks) enabled usage of the dual internal bus and the intrinsic parallelism of the Sharc processor architecture leading to execution times 93% less than the original ones.
Fig. 6. Regularity of MC implementation on Arch. 2 with model order varying from 4 to 10.
6.
CONCLUSION
The real-time implementation of an autoregressive Doppler signal spectral estimator has been addressed. Two high perfonnance processors were utilised: a TMS320C40 and a ADSP21062 . The MC algorithm presents a regular nature and therefore regular execution times were expected.
7.
ACKNOWLEDGEMENTS
The authors gratefully acknowledge the Treaty of Windsor(B-115/96), JNICT(PBIC/2414/95), UNAM (PAPIIT-INI06796) and CONACYT(2146P-A9507)
The execution times obtained proved that real-time implementation requirements were not achieved with architecture 1 for the higher data segment
188
REFERENCES Alex Parallel Computers Inc. (1996). SHARCIOOO User 's Manual, Alex Parallel Computers I Inc. Ithaca, NY14850, USA. 3L Ltd. (1995). C4x Parallel C V2.0.2, 3L Limited, Scotland. Kay S.M. (1988). Modem Spectral Estimation Theory & Application, Prentice Hall. Madeira, M. M. , Tokhi, M.O., Ruano, M. Gra9(l (1997). Parallel Processing Architectures for RealTime Doppler Signal Blood Flow Estimation: a Comparative Study, Preprints of AARTC'97, pp 293298. Madeira, M.M., Bellis, S.J. , Mamane, W.P., Ruano, M.G. (1998) Configurable processing for real-time spectral estimation, submitted to AARTC'98. Ruano M. G, Garcia Nocetti D. F. , Fish P., Fleming P.J. (1992). A Spectral Estimator using Parallel Processing for use in a Doppler Blood Flow Instrument, Parallel Computing: from Theory to Sound Practice, ed.: W. Joosen and E. Milgrom, IOS Press, pp 397 -400. Ruano M.G., Garcia Nocetti D.F., Fish P., Fleming P.J. (1993). Alternative Parallel Implemetations of an AR-Modified Covariance Spectral Estimator for Diagnostic Ultrasonic Blood Flow Studies. Parallel Computing, Vol. 19, pp. 463-476. Ruano M.G., Fish P.J. (1993). Cost/benefit Criterion for Selection of Pulsed Doppler Ultrasound Spectral Mean Frequency and Bandwidth Estimators, IEEE Trans. Biomedical Engineering, Vol. 40, No. 12, pp 1338-1341. Texas Instruments (1991). TMS320C40 User' s Guide, Texas Instruments, USA. The Math Works Inc. (1991). Matlab users' guide, The Math Works Inc., South Natick, MA01760, USA.
189