E cient Implementation of Parallel Image ...

0 downloads 0 Views 167KB Size Report
This paper deals with e cient parallel implementations of reconstruction methods in 3D tomography. Depending on the method, we use two main approaches to ...
Ecient Implementation of Parallel Image Reconstruction Algorithms for 3D X-Ray Tomography C. Laurent a, C. Calvin b, J.M. Chassery a, F. Peyrin c [email protected] [email protected]

TIMC-IMAG, IAB Domaine de la Merci, 38076 La Tronche cedex, France b LMC-IMAG, INPG 46 Av. F. Viallet, 38031 Grenoble cedex, France c CREATIS, URA CNRS 1216, INSA, 69621 Villeurbanne cedex, France

a

This paper deals with ecient parallel implementations of reconstruction methods in 3D tomography. Depending on the method, we use two main approaches to parallelize the algorithms and we propose di erent optimizations in order to improve the eciency of the parallel algorithms. These improvements are based either on the minimization of the communication time by using optimized collective communication algorithm, or by overlapping the communication by the computation. Experimental results on di erent distributed parallel machines are presented which highlight the improvements obtained. Keywords: Tomography - Parallel 3D reconstruction methods - Overlap of communications.

1. Introduction Tomography has been developed in order to obtain 2D slices of human anatomy. Truly 3D tomography is a generalization of conventional 2D tomography allowing the reconstruction of volumes (3D images). In 3D X-Ray tomography, some prototypes using the rotation of one (or several) cone-beam X-Ray source(s) have been built [8,9]. In these cases, the computational problem is to reconstruct a 3D image from a set of 2D conic projections from di erent angles of view. The 2D conventional reconstruction methods are not suited and the problem has to be considered directly in 3D. These reconstruction algorithms involve large amount of data (536 MBytes for a 5123 image) and large computation time. For instance, the reconstruction of a 1283 image from about 100  1282 projections requires at about 2 hours and 30 minutes on a Sun 4 workstation. Thus, realistic image sizes for medical applications (2363 , 5123 ) can not be computed on classical computers. Moreover memory requirements can not be achieved by these machines. Thus the implementation of these technics onto distributed memory parallel machines seems to be a good solution to solve real problems in suitable times [1{3,11]. We present in this paper ecient implementations of reconstruction algorithms in 3D X-Ray tomography on di erent parallel machines. Depending on the type of the method, we either minimize the communication time or overlap it. The experimental results on the di erent machines highlight the improvement of the eciency of these opti-

mized implementations. The remainder of the paper is organized as follows: in section 2 we describe some methods to reconstruct 3D images from a set of 2D acquisitions. Section 3 is devoted to the parallelization of these methods and especially to the optimization of the communications. Before concluding, we present experimental results on di erent parallel machines and compare the di erent parallel implementations.

2. 3D reconstruction methods 3D reconstruction methods from cone-beam acquisitions may be classi ed into analytic, algebraic and statistical methods [5]. Although these methods rely on di erent mathematical basis, their implementation require similar basic operations: the projection and the back-projection operators. The projection operator permits the change from a 3D space to a 2D one. The back-projection operator corresponds to the inverse operation. Three reconstruction methods have been implemented:

 Feldkamp algorithm [4]: this analytic method is an extension of the 2D ltered back-

projection algorithm. It consists in computing a back-projection of an weighted ltered acquisition.

 Block ART algorithm [7]: this algebraic method consists in computing, for each iteration, the di erence between the 2D acquisition and the projection of the 3D image obtained at the previous iteration. This di erence is then back-projected and then summed with the initial 3D image.

 SIRT algorithm [6]: this method is basically the same at the block ART one. In

the ART method, the back-projection is done after each computation of an image of di erence. In SIRT method, the back-projection is realized when all the images of di erences have been computed. This method needs more iterations than the ART method to obtain the same result.

The Feldkamp algorithm computes an approximated 3D image, while the ART and SIRT algorithms approach the exact 3D image by successive iterations.

3. Parallelization and data distribution The data are divided into two sets: the 2D acquisitions and the 3D image to reconstruct. Each pixel of the 2D acquisitions may contribute to the value of all the voxels of the 3D image, during the back-projection. In the same way, during the projection, each voxel may contribute to the value of all the pixels of 2D projection images. However each voxel (respectively pixel) is independent of the other voxel (respectively pixel). In a rst implementation, we choose to distribute the 3D image in a load-balancing way among the PE processors. The parallel versions of the three algorithms are based on the parallelization of the basic operators. Two approaches have been used:

 The local approach computes locally the basic operators on the data owned by each

processor. Thus the acquisitions are exchanged between the processors. For example, to compute a projection of 3D image, the processor q projects its 3D sub-image on an acquisition Pj and sends Pj to processor (q + 1) mod PE . The projection of the whole 3D image is realized when all processors have received Pj and computed its projection.

 The global approach computes the basic operators through the network. In this ap-

proach, each projection is computed using a reduction scheme [10]. To compute the projection on an acquisition Pj , each processor projects its 3D sub-image on a partial image Pj . To realize the projection on PJ , the processors send their partial image Pj on the same processor by using a reduction-somation operation. 0

0

In order to improve the eciency of these parallel methods, we have to minimize the part of the communication time in the total execution time. In the local approach, we have overlapped the exchange of the acquisitions by the local computation on another set of acquisitions. As the projection algorithms and the backprojection algorithms are similar, we present only the parallelization of the projection operator without overlap on gure 1 and in gure 2 the version of this algorithm with overlap. In these algorithms m represents the total number of projections to compute, and PE is the number of processors.

Algorithm of processor

P 2 fP

q

g

: : :P

m m ?1 for all j q PE (q +1) PE do Create a new projection j for to do Update j : compute of the projection of the local 3D image Send j to processor Receive j from processor enddo Store final projection j enddo

n=1 P

n = PE

P

P

P

(q + 1) mod PE (q ? 1) mod PE

P

Figure 1. Projection algorithm without overlapping

Algorithm of processor

q

number of projections = 0 number of update = 0 while number of projections do if nb recv( , , number of update ) = false then Create a new projection j number of updatej = 0 else

j . Then, both Feldkamp and SIRT methods have been parallelized using a local approach, and the parallel algorithm of Block ART method uses a global approach.

4. Experimentations All the 3D reconstruction methods have been implemented on di erent distributed memory parallel machines using PVM. We present here some results on three di erent ones: a Cray T3D (DEC alpha processors connected via a 3D torus network), a IBM SP2 (Power 1 processors connected via a multi-stage network) and farm of DEC alpha processors connected via a multistage network. We present on gures 4, 5 and 6 the execution times of the three methods to reconstruct a 3D image (1283 ) from 128 2D acquisitions (1282) on, respectively, a Cray T3D, a SP2 and a farm of processors. For each experiment, we have detailed the part of the communication in the total execution time. The reported execution times for ART and SIRT methods correspond to one iteration of reconstruction.

400

Time (sec) T3D Times T3D Communication

350 300 250 200 150 100 50 0

Art Sirt Feld PE=32

Art Sirt Feld PE=64

Art Sirt Feld PE=128

Figure 4. Execution times of the three reconstruction methods on a Cray T3D.

1200

Time (sec) SP2 Times SP2 Communication

1000 800 600 400 200 0

Art Sirt Feld PE=8

Art Sirt Feld PE=16

Art Sirt Feld PE=32

Figure 5. Execution times of the three reconstruction methods on a IBM SP2.

2500

Time (sec) Farm Times Farm Communication

2000 1500 1000 500 0

Art Sirt Feld PE=8

Art Sirt Feld PE=16

Figure 6. Execution times of the three reconstruction methods on a farm of processors. We can notice that the communication are very ecient on the T3D. On the contrary, on the SP2 and on the farm, the part of the communication time is very important, and thus has to be minimize. Moreover, due to the communication, some of the parallel algorithms, depending on the machine, are not scalable (see for instance ART method on SP2 in gure 5, or SIRT and Feldkamp methods on the processor farm in gure 6).

We present on gures 7 and 8 parallel versions of two methods which illustrate both global and local optimized approaches of parallelization. 250 Time (sec) 200

Total time without optimization Communication time

150

with optimization

100 50 0

PE=4

PE=8

PE=16

IBM SP2, image size 64x64x64

16 Time (sec) Total time time Communication 14 without optimization 12 with optimization 10 8 6 4 2 0 PE=4 PE=8 PE=16

DEC alpha farm, image size 32x32x32

Figure 7. Comparison of execution times of ART method without and with optimizations on a IBM SP2 and on DEC alpha farm. The two previous gures illustrate the better eciency of the optimized version of ART method. We compare here two versions of the parallel algorithm, the rst one (without optimization) have been implemented using the global combine routine of PVM. The second one (with optimization) uses an optimized algorithm of global combine. On processor farm, the communication time is reduced by using the optimized communication algorithm. On SP2, even if the communication time is not reduced in a signi cant way, processor idle times decrease. 700 time (sec) 600 500 400 300 200 100 0 PE=4

Total time time Communication without optimization with optimization

PE=8

PE=16

SP2, image size 128x128x128

800 time (sec) 700 600 500 400 300 200 100 0 PE=4

Total time time Communication

without optimization

with optimization

PE=8

PE=16

DEC alpha farm, image size 128x128x128

Figure 8. Comparison of execution times of Feldkamp method without and with optimizations. In the Feldkamp method, we have implemented a version which allows the overlap of the communication time. On the SP2, the communication time has been widely reduced (see left curves on gure 4). The overlap is less ecient in the case of the implementation on

the processor farm. This can be explained by the perturbations from other users of the communication network of the farm. Moreover, on this machine, no hardware mechanism is dedicated to the communication, and thus the processor has to deal with the management of the communication. Although, the implementation with communication overlapped minimizes the idle time of the processors.

5. Conclusion We have presented in this paper ecient parallel implementations of reconstruction methods for 3D tomography. Using some communication optimizations, like overlapping or improved collective communication algorithms, the presented methods lead to scalable parallel algorithms. This allow us to reconstruct realistic image sizes. For example, an image of size 2563 for 256 acquisitions of size 2562 have been reconstructed on the Cray T3D with 128 processors in 5 minutes. The same problem is solved in 22 hours on a SUN4 and in 3 hours on a IBM 3090.

REFERENCES 1. H. Charles, J. Li, and S. Miguet, 3D image processing on distributed memory parallel computers, SPIE, 1905 (1993). 2. C. Chen, S. Lee, and Z. Cho, A Parallel Implementation of a 3-D CT Image Reconstruction on Hypercube Multiprocessor, IEEE Transactions on Nuclear Science, 37 (1990), pp. 1333{1346. 3. , Parallelisation of EM Algorithm for a 3-D PET Image Reconstruction, IEEE Transactions on Medical Imaging, 10 (1991), pp. 513{522. 4. L. Feldkamp, L. Davis, and J. Kress, Practical Cone-beam algorithm., Journal of Opt. Soc. Am., 1 (1984), pp. 612{619. 5. L. Garnero and F. Peyrin, Methodes de Reconstruction 3D en Tomograp hie X, tech. rep., GDR TDSI CNRS, France, May 1993. Rapport de synthese 93-01. 6. P. Gilbert, Iterative Methods for the Three Dimensional Reconstruction of an Object from Projections., Journal Theor. Biol., 36 (1972), pp. 105{117. 7. F. Peyrin, R. Goutte, and M. Amiel, 3D Reconstruction from Cone-beam Projections by Block Iterative Technic., in in \ SPIE Medical Imaging IV " proceedings, San Jose, CA, Feb. 1991. 8. E. Ritman, J. Kinsey, R. Robb, L. Harris, and B. Gilbert, Physics and technical considerations in the design of the DSR, Journal of Roenhgenology, 134 (1980), pp. 369{374. 9. D. Saint-Felix and al, A New System for 3D computerized X-RAY angiography: rts in vivo result, in Proceedings of the Annual Conference of the IEEE Engineering in Medecine and Biology Society, 1992, pp. 2051{2052. 10. R. van de Geijn, Massively Parallel LINPACK Benchmark on the Intel Touchstone and iPSC/860 Systems, Computer Science Technical Report TR-91-28, University of Texas, Aug. 1991. 11. E. Zapata, I. Benavides, F. Rivera, J. Brugera, and J. Crazo, Image Reconstruction on Hypercube Computers, in Proceedings of the Third Symposium on the Frontiers of Massively Parallel Computation, 1990, pp. 127{133.