Parallel performances of three 3D reconstruction methods on MIMD computers : Feldkamp, Block ART and SIRT algorithms
y
C. Laurent , F. Peyrin , C.Girerd
y
and J.-M. Chassery
TIMC-IMAG,IAB,Domaine de la Merci,38076 La Tronche,France y CREATIS, URA CNRS 1216, INSA, 69621 Villeurbanne, France
[email protected] [email protected]
Abstract This paper deals with the parallel implementations of reconstruction methods in 3D tomography. 3D tomography requires voluminous data and long computation times. Parallel computing, on MIMD computers, seems to be a good approach to manage this problem. In this study, we present the dierent steps of the parallelization on an abstract parallel computer. Depending on the method, we use two main approaches to parallelize the algorithms: the local approach and the global approach. Experimental results on MIMD computers are presented. Two 3D images reconstructed from realistic data are showed.
I. Introduction During the last decade, three dimensional medical imaging made considerable progress. With the evolution of technology, new devices for 3D image acquisition appeared. In x-ray CT, while the DSR was the rst prototype of 3D CT scanner, new apparatus such as the "Morphometer" are currently being tested in medical environment [17]. 3D x-ray microtomography systems using bidimensionnal detectors, are also developed for instance, for the analysis of biological samples [15]. With the improvement of two-dimensional CCD based x-ray detectors, those systems may now provide billions of data. In PET and SPECT, devices allowing 3D acquisition are also developed. However, in most applications, 3D image reconstruction is still a very demanding task in terms of computations, and is often the bottleneck of systems. Nowadays, the reconstruction is often limited to 2563 volumes, even if the acquisition provide all the information necessary to reconstruct larger volumes. For instance, the reconstruction of a 10243 volumes, would typically last several days on a conventional computer. For this purpose, the use of parallel computers seem appropriate. Dierent approaches have already been proposed in the literature and may be classi ed in function of the architecture of the computer. In a number of works, dedicated architecture have been developed. In [8] algorithms have been implemented on VLSI architectures. In [10], Lattard proposed a parallel computer based on several Processor Elements dedicated to reconstruction
algorithms. Each Processor Element realizes basic operations on local data and communicate results to others Processor Elements. Reconstruction algorithms have also been implemented on vectorial computers. The vectorization tools guarantee acceptable reconstruction times [16, 7]. However this kind of architecture is generally quite limited in terms of power and memory for large 3D reconstruction problem. SIMD computers have mainly been used when the discretization of the physical model is similar to the computer topology. For instance, propagation in SPECT have been simulated on a grid of processors [13, 14]. However irregular geometries, like cone-beam geometry, are not well suited to such approaches on parallel computer topologies. MIMD are based on processors network. Usually, the power of each Processor Elements allows to solve smallest 3D reconstruction problems. Actual MIMD computers oer large memory space and high computation power. To reduce computation time, optimized communication schemes and load-balancing technics may be used [1, 19, 3]. In this paper, we present dierent implementations of 3D reconstruction algorithms (Feldkamp, Block ART and SIRT) on several MIND computers. We have developed two approaches to perform the parallelization. We show real 3D image reconstructed from 2D acquisitions provide by the "Morphometer".
II. 3D reconstruction methods 3D image reconstruction from cone-beam projections is still a topic of active research, since no algorithm is completely satisfactory in terms of accuracy and speed. The existence and unicity of the solution, depend on the curve on which the x-ray source is moved, called the "orbit". A sucient condition is that any plane passing through the volume to reconstruct intersect the orbit at least on one point [18]. However this kind of orbit is rarely considered on physical acquisition devices, and the reconstruction problem is then under-determined. In the case of circular orbit, the most popular algorithm, is a generalization of the ltered back projection method, rst proposed by Feldkamp [6]. It yields an approximate image acceptable when the divergence of the beam is small. Other methods based on space-variant ltering have been proposed in the literature [4, 9]. Alternative approaches
to analytical inversion formulas are iterative algebraic methods. In this case the problem is similar to the 2D reconstruction problem, except that its dimensionality is even bigger, and that it is hardly implementable on conventional computers. In this work, we consider the parallel implementation of three reconstruction algorithms : Feldkamp's method, and two iterative methods : block ART and a SIRT algorithms [16]. In Feldkamp's algorithm, each projection is rst weighted, ltered, and backprojected on the 3D volume. In Block ART and SIRT algorithms, the solution is iteratively updated by an additive correction. In block ART, this correction is computed for each projection, and may be interpreted as a weighted backprojection of the dierence between the projection of current image and the acquired projection. In SIRT, the correction is computed for the whole set of projections and may be also be seen as a weighted global backprojection of the dierence between the projections of current image and the acquired projections. In these algorithms, the computation of projection and backprojection, which are dual operations, are the most computer expensive tasks. The parallelization of the dierent methods will be based on the parallelization of the projection and backprojection operators.
III. Parallelization
A. De nition of an abstract parallel computer
Your study is to develop some dierent approaches to implement eciently the three reconstruction algorithms on the parallel computers. The target parallel computers, that we use, are show on table 1. So theirs Processor Elements, theirs topology, theirs number of Processor Elements ( #PE), theirs communication network are speci cally for each parallel computer. Then, the good way to implement ours algorithms on these computer is to de ne an abstract parallel computer composed of #PE Processor Element on a network. This approach suppose that the parallel algorithms are not very eciently for each architecture, but they are eciently globally on all parallel computers. Moreover, this way permit to compare the performances on the dierent target computers. We assume that the implementations on this abstract computer are similar to the implementations on parallel computers by using the PVM library [2]. Computer Processor(#PE) topology SUN4 Sparc2(1) Workstations network Sparc2&3(5) Ethernet Paragon (Intel) i860(32) Grid Farm of processors AXP(16) Giga-Switch SP1 (IBM) RS6000(32) Multi-level T3D (Cray) AXP(128) 3D torus Table 1 Parallel computers and reference computer: SUN4
B. Data repartition and communications schemes
We suppose in a rst approach that the processors of the abstract parallel computer are similar. Then the natural data repartition is to allocate to each processor element (PE) the same data size. There is two data sets: the 3D image (V) to be reconstructed of size N 3 and the m 2D images or acquisitions of size M 2. We consider two way function of the memory size of each PE: If the data size is greatest of PE memory size: The 3D image is cut in T 3D sub-images of size NT :N 2 with T:NT = N. We reconstruct a 3D sub-images at once. Each 2D images are share among the PE. If the data size is smallest of PE memory size: The 3D image and the 2D images are share among the PE. Each PE has an 3D sub-image of size NPE :N 2 and mPE 2D images of size M 2 (with PE:NPE = N and PE:mPE = m) We present the parallelization of the 3D reconstruction algorithm when all the data are distributed in a loadbalancing way among the processors. To compute the 3D image, each 2D acquisition must give their contribution at each 3D sub-image. Then there is two communication schemes: the 3D sub-images stay on processor memory and the 2D images are communicated through the network, or the opposite communication scheme. In a previous study, we have show that the rst communication scheme is more ecient [5].
C. Parallelization of a basic operators
With Feldkamp and SIRT algorithms, the operators may be computed for all 2D images simultaneously whereas with Block ART algorithm the projection and backprojection have to be computed on each 2D sequentially. Then we proposed two dierent approaches to parallelize the basic operators: Local approach The basic operators are computed locally on the data owned by each processor. The 2D image are communicated to the neighbor processor (projection algorithm gure 1). Global approach The basic operator are computed through the network. After a computation of an operator on their local data, each processor send their result to a particular processor with a reduction-summation operation ( gure 2). We describe below the parallelization of the projection operator on the two approaches.The parallelization of the backprojection operator is similar. m for all Pj with j = q #m PE : : : (q + 1) #PE ? 1 P
Create a new projection j for = 1 to = # Update j : compute of the projection of the local 3D sub-image Send j to processor ( + 1) mod # Receive j from processor ( 1) mod # Store final projection j
n
P
P
P
n
PE
q
P
q?
PE
PE
Fig. 1 Projection on local approach for the processor q
j
m
for = 1 to if( j processor ) then Create a new projection j 0 Compute partial projection j j 0 = reduce(SUM, , ) j j #PE Store final projection j else( j processor ) 0 Compute partial j projection j 0 reduce(SUM, j, #PE ) where reduce(OP,buf,dest) is a global combine operation on the variable buf. The operation applied is OP and the nal result is stored on processor dest.
P 2
q
P
P
P
P 2=
q
P
P
P
P
j
for = 1 to #PE for = 1 to # Compute projection jk Send jk to processor ( + 1) mod # Receive jk from processor ( 1) mod # k k k?1 Compute difference: j= j j for = 1 to # Compute backprojection jk Send jk to processor ( + 1) mod # Receive jk from processor ( 1) mod #
i
PE
P
P
i
PE
D
P
D
q
q? P ?P
D q
q? Fig. 5 SIRT algorithm for the processor q D
PE
PE
PE
PE
In rst all projections Pjk of 3D sub-images are Fig. 2 Projection on global approach for the processor q computed with a parallel projection operator on local approach ( gure 1). Then we calculate on each processor D. Parallelization of 3D reconstruction algorithms the 2D images of dierence Djk = Pjk ? Pjk?1. These 2D With these parallel basic operators, we build the images are backprojected by using parallel backprojection parallel algorithm of Feldkamp, Bloc ART and SIRT. operator on local approach. The parallel version of Feldkamp algorithm is based on a parallel backprojection operator ( gure 3). Each 2D E. Evaluation of their theoretical cost image is rst weighted and ltered, before backprojected To evaluate the computation times (T ) and the on all 3D sub-images. the communication times (T, ) of these algorithms on your abstract parallel computer, we use a classic for j = 1 to #m PE communication model: T, = + L where is the time Read Pj to initiate the communication, L the size of the message Weighted and filtered Pj P) to send and is the transmission rate. We de ne T(! for i = 1 to #PE Compute backprojection of Pj the cost of the projection on one voxel and T( B ) the Send Pj to processor (q + 1) mod #PE cost of the backprojection of one voxel. We suppose that Receive Pj from processor (q ? 1) mod #PE the times to load the projections Pj and to write the 3D Write the 3D sub-image sub-image are similar on each PE. Then we evaluate only Fig. 3 Feldkamp algorithm for the processor q the computation times and the communication times of The parallel version of Bloc ART algorithm is Feldkamp algorithm and one iteration of Bloc ART and developed with the global approach because one SIRT algorithms. projection is compute at once. Then we use a parallel Feldkamp The computation time is function of the projection operator presented on gure 2. We describe backprojection of all 2D images and ltered and one iteration of the parallel algorithm ( gure 4).We weighted local 2D images. We can notice that the compute the projection Pjk of all 3D sub-images at the communication times is function of all 2D images. kieme iteration. We calculate the dierence 2D image Djk N 3 T( B ) + m T(filtering) T = m #PE #PE between the 2D images Pjk and Pjk?1. Then the 2D image k m Dj is backprojected. T, = #PE (#PE ? 1)( + M 2 ) for j = 1 to m Bloc ART We suppose that the same processor performs if( Pjk 2 processor q) then all reduction operations to evaluate the theoretical Compute partial projection Pjk cost. Then it calculates all 2D images of dierence. j ) Pjk = reduce(SUM,Pjk , #PE The T, of Bloc ART algorithm is #PE greatest Compute difference: Djk = Pjk ? Pjk?1 than the Feldkamp algorithm. 3 Send Djk to all processors P ) + T( B )) + m:#PE:M 2 (T(! T = m:N Compute backprojection of Djk #PE else( Pj 2 = processor q) T, = m:#PE( + M 2 :) Compute partial projection Pjk j ) reduce(SUM,Pjk , #PE SIRT We compute two operators on local approach. If we k P ))is Send Dj to all processors suppose that the basic time of projection (T(! Receive Djk from processor q B equal of the basic time of backprojection (T( )), we Compute backprojection of Djk can notice that one iteration of SIRT algorithm cost Fig. 4 Bloc ART algorithm for the processor q twice of Feldkamp algorithm. An the opposite of the Bloc ART algorithm, the 3 P ) + T( B )) + m M 2 parallel version of the SIRT algorithm is developed with (T(! T = m:N #PE #PE the local approach. We present one iteration of this 2 T, = 2m:( + M :) algorithm ( gure 5). 0
0
0
0
In order to improve the eciency of these algorithms, the communication costs have been minimized by using either the overlapping technic for the local approach and a communication scheme based on a binary tree for the global approach [12]. IV. Performance results These algorithms have been implemented on the parallel computers presented on the table 1. The table 2 highlights the performance obtained with these algorithms with dierent size of the 2D images (m:M 2 ) and the 3D image (N 3 ). The computation times of Block ART and SIRT algorithm is an average for each cycle.
has the best performances, but the reconstructed 3D image reconstructed is an approximative solution. The algebraic methods yields higher quality 3D image reconstructions. The algorithms developed on local approach (Feldkamp and SIRT) are more ecient than the algorithm ART based on global approach.
V. application to physical data The 3D reconstruction algorithms have rst been evaluated from simulated images. Then they have been applied to data provided by the Morphometer. We present realistic 3D reconstructed images of a human hand and a human knee. The 3D human hand is reconstructed by the Computer Feldkamp Bloc ART SIRT Feldkamp algorithm from a set of 256 acquisitions of size m=M=N 64 128 64 128 64 128 5122 ( gure 8). We show a section of this 3D human hand SUN4 1565 10606 1213 1133 ( gure 9) and a view of the 3D Human hand compute by Workstations 446 2593 an 3D volume rendering algorithm ( gure 10). Paragon 30 320 Farm 16 210 52 621 35 415 SP1 7 85 85 820 17 204 T3D 3 22 24 108 12 50 Table 2 Computation times(sec)
We obtain also 3019 sec (329 sec) to reconstructed an 3D image on the T3D when m = M = N = 256 (m = M = 256; N = 512) [11]. The gures 6 and 7 time present the relative speed up ( sequential parallel time ) of the three algorithms to reconstruct an 3D image. 1000
Fig. 8 Acquisition of a human hand
100
10
1 Workstations Farm
Paragon
SP1
T3D
Fig. 6 Relative speed up of Feldkamp implementations 54 52 50 Block ART 48 46 44 42 40 38 36 34 32 Farm SP1
Fig. 9 A section of the 3D hand
90 80
SIRT
70 60 50 40 T3D
30 Farm
SP1
T3D
Fig. 7 Relative speed up of Block ART and SIRT implementations
These results show that the Feldkamp implementation
Fig. 10 3D Volume rendering view of 3D hand
The 3D human knee is reconstructed by the same algorithm from a set of 256 acquisitions ( gure 11). We show a sections of this 3D human knee ( gure 12).
Fig. 11 Acquisition of a human knee
Fig. 12 A section of the 3D knee
VI. Conclusion We have presented in this paper the parallel implementations of reconstruction algorithms for 3D tomography. We have introduced an abstract parallel computer to develop the parallelization of these algorithms. We described the data repartition, the parallelization of the basic operators. Your methodology are based on two approaches: the local approach and the global approach. The performances obtained show the bene ts of parallel implementations: the parallel computers allow to compute 3D image of realistic size (human hand and human knee). the computation times are reduced of a factor function of the power and number of processor. VII. References
[1] M. S. Atkins, D. Murray, and R. Harrop. Use of transputers in a 3-D Positron Emission Tomograph. IEEE Transactions on Medical Imaging, 12(2):173{181, 1993. [2] A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM3 User's Guide and Reference Manual. Technical report, Oak Ridge National Laboratory, 1994. [3] C. M. Chen, S.-Y. Lee, and Z. H. Cho. Parallelisation of EM Algorithm for 3-D PET Image Reconstruction. IEEE Transactions on Medical Imaging, 10(4):513{522, 1991.
[4] M. Defrise and R. Clack. A cone-beam reconstruction algorithm using shift-variant ltering and cone-beam backprojection. IEEE Transactions on Medical Imaging, 13:186{195, 1994. [5] L. Desbat, C. Laurent, and S. Rouault. Parallel reconstruction in tomography: work in progress. In International Workshop Parallel Imaging and Application, pages 147{156, 1995. [6] L. A. Feldkamp, L.C. Davis, and J.W. Kress. Practical Cone-Beam Algorithm. J. Opt. Soc. Am., 1(6):612{619, 1984. [7] C. Guerrini and G. Spaletta. An image reconstruction algorithm in tomography: a version for the CRAY X-MP vector computer. Computers and Graphics, 13:367{372, 1989. [8] W.F. Jones, L.G. Byars, and M.E. Casey. Design of a super fast three-dimensional projection system for postron emission tomography. IEEE Transactions on Nuclear Science, 37:800{804, 1990. [9] H. Kudo and T. Saito. Derivation and implementation of a cone-beam reconstruction algorithm for nonplanar orbits. IEEE Transactions on Medical Imaging, 13(1):196{211, 1994. [10] D. Lattard and G. Mazare. Parallel image reconstruction by using a dedicated asynchrounous cellular array. In P. M. Dew, editor, Parallel Processing for Computer Vision and Display, pages 479{488, 1989. [11] C. Laurent. Adequation Algorithmes et Architectures Paralleles pour la Reconstruction 3D en Tomographie X. PhD thesis, Universite Claude Bernard LYON 1, 1996. [12] C. Laurent, C. Calvin, F. Peyrin, and J.-M. Chassery. Ecient Implementation of Parallel Image Reconstruction Algorithms for 3D X-RAY Tomography. In PARCO 95, pages 109{116, Gent(Belgique), September 1995. [13] A. W. McCarty and M. I. Miller. Maximun Likelihood SPECT in Clinical Computation Times Using MeshConnected Parallel Computers. IEEE Transactions on Medical Imaging, 10(3):426{436, 1991. [14] M. I. Miller and C. S. Butler. 3-D maximun a posteriori Estimation for Single Photon Emission Computed Tomography on Massively-Parallel Computers. IEEE Transactions on Medical Imaging, 12(3):560{565, 1993. [15] M. Pateyron, F. Peyrin, A.-M. Laval-Jeantet, P. Spanne, P. Clotens, and G. Peix. 3D microtomography of cancellous bone samples using Synchrotron Radiation. In SPIE Medical Imaging, Newport Beach, February 1996. [16] F. Peyrin. Methodes de Reconstruction d'Images 3D a partir de Projections Coniques de Rayons X. PhD thesis, Universite Claude Bernard LYON 1, 1990. [17] D. Saint-Felix, Y. Trousset, C. Picard, and A. Rougee. In vivo evaluation of a new system for 3D computerized angiography. Phys. Med. Biol., 39:583{595, 1994. [18] H. K. Tuy. An inversion formula for cone beam reconstruction. SIAM Journal of Applied Mathematics, 43(3):546{552, 1983. [19] E. L. Zapata, I. Benavides, F. F. Rivera, J. D. Bruguera, and J. M. Carazo. Image Reconstrcution on Transputer Network. In Second International Conference on Applications of Transputers, pages 164{171, 1991.