A Parallel Software for the Reconstruction of Dynamic MRI Sequences G. Landi, E. Loli Piccolomini, and F. Zama Department of Mathematics, Piazza Porta S. Donato 5, Bologna, Italy
[email protected] http://www.dm.unibo.it/˜piccolom
Abstract. In this paper we present a parallel version of an existing Matlab software for dynamic Magnetic Resonance Imaging which implements a reconstruction technique based on B-spline Reduced-encoding Imaging by Generalized series Reconstruction. The parallel primitives used are provided by MatlabMPI. The parallel Matlab application is tested on a network of Linux workstations.
1
Introduction
In many type of Magnetic Resonance Imaging (MRI) experiments such as contrast-enhanced imaging and MRI during the course of a medical intervention, it is necessary to quickly acquire the dynamic image series with high time resolution. The increase of the temporal resolution is obtained at the expense of the spatial resolution. For example, in a typical functional MRI experiment, series of low resolution images are acquired over time and only one or two high resolution images are acquired and used later, to point out the brain area activated by the external stimuli. In general, the data acquired in a MRI experiment are frequency-encoded in the so-called k-space and the image is reconstructed by means of inverse Fourier transforms. The time sequence of the dynamic MR Images is obtained by recovering each image independently. Since during a dynamic experiment only a small part of the structure of the image changes, it is possible to accelerate the data acquisition process by collecting truncated samplings of the data. In particular in the k-space, only the low frequency signal changes dynamically, while the high frequency signal remains essentially unchanged. Consequently, only a small (usually central and symmetric with respect to the origin) subset of the k-space data in the phase encoding direction is acquired with maximum time resolution. Unfortunately, when the data are reduced-encoded, the Fourier traditional methods have many limitations and the reconstructed images present ringing
This work was completed with the support FIRB Project “Parallel algorithms and Nonlinear Numerical Optimization” RBAU01JYPN
J. Dongarra, D. Laforenza, S. Orlando (Eds.): Euro PVM/MPI 2003, LNCS 2840, pp. 511–519, 2003. c Springer-Verlag Berlin Heidelberg 2003
512
G. Landi, E. Loli Piccolomini, and F. Zama
effects from the edge of the objects. For this reason, some methods have been developed for the estimation of the images of the sequence from a limited amount of the sampled data [3], [8]. These methods, using prior information of the first and last images in the sequence, called references images, allow good quality reconstruction at a high computational cost. In the method proposed in [9], the reconstructed images are modelled with B-Spline functions and are obtained by solving ill-conditioned linear systems with Tikhonov regularization method. A Matlab software with graphical interface, called MRITool1 , has been developed: it implements the reconstruction technique based on B-spline Reducedencoding Imaging by Generalized series Reconstruction (BRIGR) [9]. For clinical needs, the execution time for the reconstruction of a whole sequence of images is too high. For this reason, in this paper we present a parallel version of the reconstruction method in substitution of the sequential MRItool computational kernel. Several parallel Matlab toolboxes have been built, differing each other in the underlying communication method. In [2], it can be found an overview of the current available software for parallel Matlab. In the development of our Matlab application on parallel distributed memory systems we choose a standard message passing library such as Message Passing Interface (MPI) [4]. As a preliminary step we use MatlabMPI which is a set of full Matlab routines that implement a subset of MPI functions on top of the Matlab I/O system. It is a very small “pure” Matlab implementation and does not require the installation of MPI. The simplicity and performance of MatlabMPI provides a reasonable choice for speeding up an existing Matlab code on distributed memory parallel architectures. In section 2 the mathematical problem is presented with the sequential algorithm. In section 3 we describe the parallel algorithm. The numerical results are reported in section 4.
2
Numerical Method and Sequential Algorithm
Numerically, the problem described in the introduction can be formulated as follows. At successive times tj , j = 0, . . . , Q, the j-th image of the sequence is represented by a function Ij (x, y) which solves: Dj (h, k) =
+∞
−∞
Ij (x, y) e−2πi(hx+ky) dxdy.
where Dj (h, k), j = 0, . . . , Q, are the data sets acquired independently in the Fourier space. We suppose that, at times t0 and tQ , two complete discrete high resolution reference sets are available: D0 (∆h, m∆k) and DQ (∆h, m∆k) 1
= −N/2, . . . , N/2 − 1 m = −M/2, . . . , M/2 − 1
MRITool 2.0 can be downloaded at http://www.dm.unibo.it/∼piccolom/WebTool/ToolFrame.htm
A Parallel Software for the Reconstruction of Dynamic MRI Sequences
513
where N is the number of frequency encodings measured at intervals ∆h and M is the number of phase encodings measured at intervals ∆k. At times tj , j = 1, . . . , Q − 1 we have reduced spatial resolution; the sets Dj (∆h, m∆k), = −N/2, . . . , N/2 − 1, m = −Nm /2, . . . , Nm /2 − 1, Nm M , are called dynamic sets. For j = 0 and j = Q we can apply a 2D discrete inverse Fourier transform to the reference sets D0 (∆h, m∆k) and DQ (∆h, m∆k) and compute the discrete image functions I0 (∆x, m∆y) and IQ (∆x, m∆y). Since the dynamic data Dj (∆h, m∆k) are undersampled only along the k direction, by applying a discrete Fourier inverse transform along the h direction, we obtain the discrete functions: +∞ ˆ j (∆x, m∆k) = D Ij (∆x, y)e−2πi(m∆k)y dy, j = 1, . . . , Q − 1 (1) −∞
We represent the dynamic changes by means of the difference function: ˆ () (m∆k) − D ˆ () (m∆k) ˆ () (m∆k) ≡ D D 0 j j ˆ j (∆x, m∆k). Using the relation (1) we obtain: ˆ () (m∆k) ≡ D where D j ∞ j = 1, . . . , Q − 1 () () ˆ () (m∆k) = D (Ij (y) − I0 (y))e−2πi(m∆k)y dy j = −N/2, . . . , N/2 − 1 −∞ (2) () where Ij (y) ≡ Ij (∆x, y). In order to reconstruct the whole images sequence we have to solve N · (Q − 1) integral equations (2). In our method, proposed in [9], we represent the unknown function (in this case the difference between the image Ij and the reference image I0 ), using cubic B-Spline functions: ()
()
Ij (y) − I0 (y) = G() (y)
N m −1
αp(j,) Bp (y)
(3)
p=0 ()
()
where G() (y) = |IQ (y) − I0 (y)| accounts for given a priori information and the set {B0 (y), . . . , BNm −1 (y)} is the basis of cubic B-Spline functions (see [9] for the details). Hence, the problem (2) leads to the solution of the following linear equation: ˆ () (m∆k) = D j
N m −1 p=0
αp(j,)
N m −1
F(G() )((m − q)∆k)F(Bp )(q∆k)
(4)
q=0
where F(f ) represents the Fourier transform of f . Introducing the square matrices: = F(G() ) ((s − t)∆k) , s, t = 0, . . . , Nm − 1, (5) H () s,t
(B)u,v = F(Bv ) (u∆k) ,
u, v = 0, . . . , Nm − 1,
514
G. Landi, E. Loli Piccolomini, and F. Zama
INPUT: First reference set: D0 (∆h, m∆k) of N · M values; Q − 1 intermediate dynamic sets Dj (∆h, m∆k) each of N · Nm values; Last Reference set: DQ (∆h, m∆k) with N · M values. (1) Compute the Nm × Nm matrix B through Nm FFTs of vectors of length Nm . (2) for j = 1, . . . , Q − 1 (2.1) for m = −Nm /2, . . . , Nm /2 − 1 ˆ j (·, m∆k) = IF F T (Dj (·, m∆k)) (2.1.1) D (3) Compute I0 (∆x, m∆y) = IF F T 2(D0 (∆h, m∆k)) (4) Compute IQ (∆x, m∆y) = IF F T 2(DQ (∆h, m∆k)) (5) for = −N/2, . . . , N/2 − 1 (5.1) Compute G() (y) (5.2) Compute the matrices H () and A() (eqs. (5) and (6)) (5.3) Compute the Singular Value Decomposition of A() (5.4) for j = 1, . . . , Q − 1 () () (5.4.1) Solve the system A() αj = dj (5.4.2) Represent the function: Nm −1 (j,) () () Ij (m∆y) − I0 (m∆y) = G() (m∆y) p=0 αp Bp (m∆y) OUTPUT: High resolution sequence: Ij (∆x, m∆y)−I0 (∆x, m∆y), j = 1, . . . , Q − 1 Fig. 1. BRIGR Algorithm
we can write (4) as a linear system of Nm equations: ()
()
A() αj = dj
T () ˆ () (0), . . . , D ˆ () ((Nm − 1)∆k) , coefficient mawith right hand side: dj = D j j trix: (6) A() = H () B T () (j,) (j,) and unknowns: αj = α0 , . . . , αNm −1 . The algorithm for reconstructing the whole sequence is described in figure 1; its total computational cost is: F LOP Sseq = F LOP SF T + F LOP SReg
(7)
where F LOP SF T contains the flops of the fast Fourier transforms (FFT) or inverse fast Fourier transforms (IFFT), required in steps (1),(2), (3), (4) and (5.2) in figure 1. Using the Matlab FFT function we have that: 2 F LOP SF T ∝ Nm log2 (Nm ) + (Q − 1)Nm N log2 (N ) + +2N M log2 (N M ) + N M log2 (M )
(8)
The parameter F LOP SReg is relative to the computational cost of the regularization algorithm required in steps (5.3) and (5.4.1) in figure 1. In our
A Parallel Software for the Reconstruction of Dynamic MRI Sequences
515
program, the value of F LOP SReg is relative to the Tikhonov Regularization method which requires the computation of the Singular Value Decomposition (SVD) of the matrix A() and the choice of the optimal regularization parameter λ by means of the Generalized Cross Validation method. Then F LOP SReg = N F LOP SSV D + N (Q − 1)F LOP Sλ where [1]: 11 3 3 F LOP SSV D ∝ 5 + Nm ∼ 9Nm 3 Using the primitives of the Regularization Tools Matlab package [5,6], we found that, in our application: F LOP Sλ ∝ 25F LOP SSV D and then 3 F LOP SReg ∝ 9N · Nm (1 + 25(Q − 1))
(9)
In figure 2 we report the total computational time with respect to the components F LOP SF T and F LOP SReg . The greatest computational work load is given by the contribution of F LOP SReg : the parallel splitting of this part of the algorithm can improve significantly the efficiency of the whole application.
3
Parallel Algorithm
The parallel version of the algorithm reported in figure 1 is obtained observing that the computations in cycle in step (5) are completely independent each other and can be distributed among different processors. The parallel algorithm is implemented on a master-slave basis, using message passing primitives for the data communications. The master and slave algorithms are reported in figures 4 and 5, respectively. Given P the number of homogeneous slave processors, we can describe the computational cost of the parallel algorithm as: F LOP Spar = F LOP Smaster + F LOP Sslave + F LOP Scomm
(10)
where F LOP Scomm is the number of floating point operations performed during the time spent in communication. Observing the structure of the algorithm and recalling (9), (8) we have that: F LOP Smaster ∝ 2N M log2 (N M ) + (Q − 1)Nm N log2 (N ) F LOP Sslave ∝
N 3 M log2 (M ) + 9Nm (1 + 25(Q − 1)) P
In this last case, the F LOP S required in the computation of matrix B (step (1) in figure 5) are not taken into account since they are completely overlapped by the value of F LOP Smaster .
516
G. Landi, E. Loli Piccolomini, and F. Zama Tikhonov
250
20 optimal speedup * S (P)
log10(FLOPS)
15
log (FLOPS ) 10 FT log10(FLOPSReg)
200
10
5
50
100
150
200
Tikhonov300 250
350
400
450
500
N 10
12
100
x 10
FLOPSFT FLOPS
10
FLOPS
speedup
150
0
Reg
8
50
6 4 2 0
0
50
100
150
200
250
300
350
400
450
0
50
100
150
200
250
P
500
N
Fig. 2. Values of F LOP Sseq in the case: Q = 58, M 0.3N , Nm ∼ 0.3M
Fig. 3. Values of Asymptotic Speed up in the case: Q = 58, N = 256, M = 70, Nm = 19
The parallel performance is measured by means of the speedup parameter as a function of the number of processors used: Su(P ) =
timesequential timeparallel
Using relations (7) and (10) we define the asymptotic speedup parameter by ignoring the communication time: S ∗ (P ) =
F LOP Sseq F LOP Smaster + F LOP Sslave
Analyzing the plot of S ∗ (P ) (figure 3) we notice that for values of P less then 100 optimal speedups are still obtained.
4
Numerical Results
The experiments have been performed on a cluster of 17 PC Pentium III 600 Mhz with 256 Mb RAM connected through a 10 Mbit/sec network. The PCs are equipped with Matlab 6.5 and MatlabMPI version 0.95 [7] that provides Message Passing primitives. The parallel algorithm has been tested on real dynamic MRI data. The data sequence is constituted of two reference data sets of 256 × 70 samples and of 57 dynamic data sets of 256 × 19 samples. In the notation used in the previous sections: N = 256, M = 70, Q = 58, Nm = 19. The total time for the parallel algorithm is given as the sum of the times necessary for the following algorithm chunks:
A Parallel Software for the Reconstruction of Dynamic MRI Sequences
517
MASTER INPUT: First reference set: D0 (∆h, m∆k) of N · M values; Q − 1 intermediate dynamic sets Dj (∆h, m∆k), each of N · Nm values; Last Reference set: DQ (∆h, m∆k) with N · M values. (1) for j = 1, . . . , Q − 1 (2.1) for m = −Nm /2, . . . , Nm /2 − 1 ˆ j (·, m∆k) = IF F T (Dj (·, m∆k)) (2.1.1) D (2) Compute I0 (∆x, m∆y) = IF F T 2(D0 (∆h, m∆k)) (3) Compute IQ (∆x, m∆y) = IF F T 2(DQ (∆h, m∆k)) (4) for = −N/2, . . . , N/2 − 1 ˆ () (η∆k) (4.1) Compute G() (m∆y) and D j (5) N = N/NumSlaves (6) for N p = 1 : NumSlaves (6.1) Send to Slave Np: m = −M/2, . . . , M/2 − 1 G() (m∆y) η = −Nm /2, . . . , Nm /2 − 1 ˆ () (η∆k) = (N p − 1) · N + 1 : N p · N D j j = 1, . . . , Q − 1 (6.2) Receive from Slave Np: m = −M/2, . . . , M/2 − 1 () () (Ij − I0 )(m∆y) = (N p − 1) · N + 1 : N p · N j = 1, . . . , Q − 1 OUTPUT: High resolution sequence: Ij (∆x, m∆y) − I0 (∆x, m∆y), j = 1, . . . , Q − 1 Fig. 4. Parallel Algorithm: Master Processor
1. the sequential operations in the master program ((1)-(4) in figure 4); 2. the broadcasting of input data from the master to the slaves ((6.1) in figure 4); 3. the computation in the slave programs ((3) in figure 5); 4. the sending of computed results from each slave to the master ((4) in figure 5). We present here the results obtained for the available data in our cluster; we have tested the application on 4, 8 and 16 slaves and the computational times in seconds are reported in table 1. The reconstructed images are not shown here, but they can be reproduced using MRITool 2.0 or they can be found in [9]. The time for chunk 1 is constant and it is about 0.65 seconds. The time for the broadcast in chunk 2 (t.b.) is reported in the first row. The message length decreases but the overhead increases from 4 to 16 nodes, hence the time is oscillating. The time for chunk 3 (t.s.) is dominant over the others. It depends on the processor power and it has inverse ratio with respect to the number of processors, as it is evident from figure 5. Indeed, the aim of this parallel
518
G. Landi, E. Loli Piccolomini, and F. Zama
SLAVE Np
(1) Compute the Nm × Nm matrix B. (2) Receive from Master: m = −M/2, . . . , M/2 − 1 G() (m∆y) η = −Nm /2, . . . , Nm /2 − 1 ˆ () (η∆k) = (N p − 1) · N + 1 : N p · N D j j = 1, . . . , Q − 1 (3) for = (N p − 1) · N + 1 : N p · N (3.3) Compute the matrix A() = H () · B (3.4) Compute the SVD of A() (3.5) for j = 1, . . . , Q − 1 () () (3.6.1) Solve A() αj = dj () () (3.6.2) Compute Ij (m∆y) − I0 (m∆y) = Nm −1 (j,) G() (m∆y) p=0 αp Bp (m∆y) (4) Send to Master: ()
(Ij
m = −M/2, . . . , M/2 − 1 () − I0 )(m∆y) = (N p − 1) · N + 1 : N p · N j = 1, . . . , Q − 1
Fig. 5. Parallel Algorithm: Slave Processor
application is to decrease the execution time by splitting the computational workload of chunk 3 among an increasing number of processors. If Tikhonov regularization method is used together with the GCV method for the choice of the regularization parameter (from the Regularization Tools Matlab package [5,6]), then the reconstruction time of a single image row ((3.6.1) and (3.6.2) in figure 5) is of about 0.4 seconds on Pentium III 600 Mhz. The time for the execution of the whole chunk 3 is reported in row 2 of table 1. The time for executing chunk 4 (t.f.) is the time for a communication of a message between two nodes, since we can presume that the slave do not send concurrently to the master. Finally, the last row of the table shows the total time for the described application; figure 6 plots the obtained speedup that is almost equal to the ideal speedup. If the application is executed on Pentium IV 1.5Ghz processors, the application scales again very well up to 16 processors, even if the time for the execution of chunk 3 is reduced of about 90%. This agrees with the theoretical results predicted in figure 3.
5
Conclusions
In this paper we presented a parallel software for dynamic MRI reconstructions. The computational cost of the parallel application has been analyzed in term of
A Parallel Software for the Reconstruction of Dynamic MRI Sequences Measured Speedups
P broadcasting time (t.b.) slave computational time (t.s.) sending time (t.f.) total time
18
16
speedup
14
12
10
4 8 16 0.85 0.65 0.8 1429 715 357 0.9 0.6 0.3 1431 716 358
Table 1. Execution times in seconds on P processors
8
6
4 4
519
6
8
10 P
12
14
16
Fig. 6. Values of measured speedup in the case: Q = 58, N = 256, M = 70, Nm = 19
number of floating point operations by varying the number of processors. The parallel application is tested on real MR data on a network of Linux workstations using MatlabMPI primitives. The results obtained completely agree with the predictions giving optimal speedups with up to 20 processors. In our future work we are planning to test our application on larger and more powerful parallel computing environments, investigating among different parallel matlab implementations based on MPI.
References 1. A. Bjorck, Numerical methods for least squares problems, SIAM, 1996. 2. J. Fenandez-Baldomero, Message Passing under Matlab, Proceedings of the HPC 2001 (Seattle, Ws) (Adrian Tentner, ed.), 2001, pp. 73–82. 3. E. Loli Piccolomini F. Zama G. Zanghirati A.R. Formiconi, Regularization methods in dynamic MRI, Applied Mathematics and Computation 132, n. 2 (2002), 325– 339. 4. MPI Forum, MPI: a message passing interface standard, International Journal of Supercomputer Applications 8 (1994), 3–4. 5. P.C. Hansen, Regularization Tools: : A Matlab package for analysis and solution of discrete ill-posed problems, Numerical Algorithms 6 (1994), 1–35. 6. P.C. Hansen, Regularization Tool 3.1, http://www.imm.dtu.dk/ pch/Regutools/index.html, 2002. 7. Jeremy Kepner, Parallel programming with MatlabMPI, http://www.astro.princeton.edu/ jvkepner/, 2001. 8. A.R. Formiconi E. Loli Piccolomini S. Martini F. Zama and G. Zanghirati, Numerical methods and software for functional Magnetic Resonance Images reconstruction, Annali dell’Universita’ di Ferrara, sez. VII Scienze Matematiche, suppl. vol. XLVI, Ferrara, 2000. 9. E. Loli Piccolomini G. Landi F. Zama, A B-spline parametric model for high resolution dynamic Magnetic Resonance Imaging, Tech. report, Department of Mathematics, University of Bologna, Piazza Porta S.Donato 5, Bologna, Italy, March 2003.