High Throughput SAFT for an Experimental USCT System as MATLAB Implementation with Use of SIMD CPU Instructions M. Zapf.a , G.F. Schwarzenberga , and N.V. Ruitera a Forschungszentrum
Karlsruhe, Karlsruhe, Germany
ABSTRACT At Forschungszentrum Karlsruhe an Ultrasound Computer Tomography system USCT) is under development for early breast cancer detection. To detect morphological indicators in sub-millimeter resolution, the visualization is based on a SAFT algorithm (synthetic aperture focusing technique). The current 3D demonstrator system consists of approx. 2000 transducers, which are arranged in layers on a cylinder of 18 cm diameter and 15 cm height. With 3.5 millions of acquired raw data sets and up to one billion voxels for an image, a reconstruction may last up to months. In this work a performance optimized SAFT algorithm is developed. The used software environment is MathWorks’ MATLAB. Several approaches were analyzed: a plain M-code (MATLAB’s native language), an optimized M-code, a C-code implementation, and a low-level assembler implementation. The fastest found solution uses an SIMD enhanced assembler code wrapped in the C-interface of MATLAB. Additionally a 10 % speed up is gained by reducing the function call overhead. The overall speed up is more than one order of magnitude. The resulting computational efficiency is near the theoretical optimum. The reconstruction time is significantly reduced without losing MATLAB’s comfortable development environment. Keywords: SAFT, ellipsoidal backprojection, image reconstruction, performance, MATLAB
1. INTRODUCTION Early breast cancer diagnosis is still a major challenge. The standard screening methods often detect cancer in a state when metastases have already developed. The presence of metastases decreases the probability of survival significantly. A more sensitive tool for breast cancer diagnosis could lead to diagnoses in an earlier state, i.e. before metastases are generated. We are developing a new imaging method for breast cancer diagnosis - ultrasound computer tomography (USCT) - which allows recording of reproducible 3D images with high spatial resolution and tissue contrast. Our experimental 3D demonstrator setup consists of a water filled cylinder (18 cm diameter and 15 cm high) and contains 384 sending and 1536 receiving transducers, grouped in three rings on the cylinder surface, each 5 cm high (see Fig. 1). The cylinder can be rotated by a motor to six different positions, emulating a complete covering of the cylinder with transducers. Sequentially each emitter sends an unfocused wave front (center frequency 3 MHz, bandwidth 1 MHz, opening angle 30 ◦ ). The measurement starts with all receivers in parallel by the emission of a ultrasound wave packet. The incident wave and reflections introduced by objects are measured as function of pressure over time. The resulting A-Scan is digitized with 10 MHz and a dynamic of 12 bit. In a complete measurement 3.5 million A-Scans are produced with 20 GB data. Three different physical properties can be visualized for sensing indicators of breast cancer. Speed of sound and absorption maps are represented in the data as the amplitude and time of the directly transmitted wavefront. Image volumes can be created by a 3D inverse Radon transform approach, e.g. based on the FDK algorithm (Feldkamp, Davis, and Kress algorithm).1 The reflectivity (represented in the data as the scattered waves) is the Further author information: (Send correspondence to M. Zapf) M. Zapf: E-mail:
[email protected] G.F. Schwarzenberg: E-mail:
[email protected] N.V. Ruiter: E-mail:
[email protected]
Figure 1. Top view of the USCT 3D demonstrator, transducers grouped in three layers on measurement cylinder
third modality which is depending on the morphology of the measured object. For qualitative reconstruction of the reflectivity, a robust SAFT (ellipsoidal backprojection) based method2 is used, which projects each A-Scan independently into the volume and adds them up.
1.1 Description of Purpose To visualize the morphology of cancer (or cancer indicating structures like micro-calcifications) an image resolution in sub-millimeter resolution is required. For an average breast volume the 3D demonstrator could be shown to have a theoretical resolution (full width half maximum) of 0.2 to 1 mm.3 The total number of voxels at this resolution and a volume of 18 cm x 18 cm x 15 cm approaches 109 . However, the accepted time for a clinical-use should not exceed 1 day. Thus an implementation of the reconstruction with a high voxel throughput per time (Q [voxel/s]) is needed. The aim of this work is analysis and design of a performance optimized SAFT algorithm for a MathWorks’ MATLAB environment. Therefore several implementation alternatives are developed and compared, with optimizations performed at two levels: • High Level: M-code based (MATLABs interpreted high-level language) optimizations are evaluated. Also, MATLAB capabilities as code interpretation environment are analyzed and optimized (e.g memory footprint, function call overheads etc.). • Low Level: Several binary solutions (C and assembler languages) are wrapped in the MEX interface4 to MATLAB and evaluated. This allow us to process the A-Scans in blocks to reduce the function call overhead. Final step is the comparison of the solutions for varying setups.
reflection point
Pmax
V
d max
d min
Ei
Pmin
t0 (begin of A-scan)
VE i ,n
Rk
n
VR k ,n Vn
d max
d min
reflection pulse ~ i and receiver position Figure 2. Diagram of the SAFT algorithm (as 2D ellipsoidal backprojection): emitter position E ~ k are the focal points of the projected ellipsoids. P ~max (P ~min ) is defined as intersecting point of all potential ellipsoids R ~ i and R ~ k . dmax (dmin ) is the distance between E ~ i, R ~ k and with the voxel with the maximum (minimum) distance to E ~max (P ~min ). The sample width (swn ) in an A-Scan for a voxel (Vn ) is defined as dmax,n − dmin,n = swn . βn is the P ~ i and R ~ k . In the above shown example V the pulse in the A-Scan is a reflection (ultrasonic multistatic angle between E echo) from a reflection point in V and is correctly assigned to sw.
2. METHODS 2.1 Reflectivity Image Reconstruction: SAFT Algorithm The SAFT reconstruction could be implemented as a discrete signal to image mapping. This approach would need a 3D interpolation, so it is usually implemented vice-versa as isotropic voxel to signal sample mapping. This reduces the complexity of the algorithm to an 1D interpolation of the signal. The voxel to signal SAFT mapping can be described as P k f (~x) = T (Ak,i ( di +d c )) k,i
(1)
where f denotes the reflection image, ~x the coordinates of the reconstructed point, and T the preprocessing ~ i and receiving position R ~ k . c is the speed of sound function of the A-Scan Ak,i , acquired at sending position E in water, and di and dk are the distances from the reconstructed point to the emitter and receiver, respectively. This approximation is valid for a constant speed of sound, small attenuation, weak point scatterers, and spherical emittance. Available preprocessing steps are: band pass filtering, envelope calculation, or matched filtering.5 For a voxel of finite volume the reconstruction is associated with a minimum (maximum) bounding ellipsoid, ~ (Pmax ~ ). This results in a specific dmin (~x) (dmax (~x)), see also with a specific contact point on the voxel Pmin Fig. 2. The subtraction of this distances divided by an average c, results in the sample width sw(x)k,i (a time ~ i, R ~ k , P~min , and P~max : range) dependent on ~x, E sw(~x) = 1c (dmax (~x) − dmin (~x))
(2)
~ i + R ~ k − P~max dmax (~x) = P~max − E
(3)
~ i + R ~ k − P~min dmin (~x) = P~min − E
(4)
2.2 SAFT Algorithm Optimization The described SAFT algorithm (see Eqs. 1, 2, 3, 4) is very calculation intense. A significant part of computations would be spent for processing the sample width (sw(~x)). Therefore, the following implementations use a fixed sample width (swconst ) as approximation (see Eq. 5) which are only dependent on the voxel edge length (l). √ (5) swconst = 1c (l · 3) Thus, voxels are assumed to be spheres. The multistatic setup (defined by angle β > 0, see Fig. 2) is ignored. This is acceptable for several setups.6 Two major advantages arising from this: • The behavior of the algorithm is independent of the spatial orientation of the voxels. All voxels are processed uniformly, no decision branches are required with specific code, so the overall runtime is exactly predictable. • The interpolation can be done in a voxel independent way, outside of the voxel loop (once for all voxels). Without this approximations, for every voxel an unique interpolation has to be considered. Hence, for every voxel only two computations of the geometric distances, two data reads (one from the AScan, one from the source image), and one data write to destination image have to be carried out. Thus the computation time t is linearly dependent on the number of i ∈ [1...I] emitters, k ∈ [1...K] receivers, and voxels (Nv ) (see Eq. 6) t=
P Nv k,i
Q
(6)
Q (voxel throughput per time) denotes the calculation performance. Nv is determined by the volume of the region of interest (ROI) divided by the required voxel resolution. As an experimental setup we are using a 2D slice image of 2048 x 2048 voxels with double precision accuracy (64 bit floating point number). Voxel size is l = 0.07 mm for an area of 15 x 15 cm. I = 384, and K = 9216 are taken from the 3D demonstrator setup.
2.3 Runtime behavior of SAFT Algorithm All parameters in Eq. 6, except Q, are defined by the application, thus can not be influenced. In order to reduce the calculation time only Q can be changed, thus the following analysis of Q has been carried out. Implementations are behaving hardware specific so two setups were selected, to analyze this influence: • P3 : Intel Pentium 3 system with 733 MHz, 512 MByte SD-RAM, Windows 2000 with MATLAB 6.1 • PIV : Pentium IV system 2400 MHz with 2 GByte DDR-2 RAM, Windows XP and MATLAB 2007a. Despite the fact that both x86 processors are only one development generation apart, Intel made major changes in the architecture between them. Therefore, it is a test on the general quality of the solutions, especially in the low-level approaches. The pseudo code for the SAFT implementation is given as follows:
Table 1. Required accuracy and range of data types for distance calculation, resulting out of the geometric boundaries of the 3D demonstrator and the required sub-millimeter image resolution, data type accuracies and ranges values taken from Goldberg8
Type Minimum requirements Double float type Single float type Integer fixed point
Size
Range
Accuracy
-
±(100 ...10−5 ) m
> 5 decimal digits
308
64 bit
±(10
32 bit 32 bit
4
10 ...10
−5
...10
38
−308
±(10 ...10
−37
m (with unit fixed to 10
−5
)m
> 16 decimal digits
)m
> 7 decimal digits
m)
> 9 decimal digits
A_i=interp(A,voxel_size)