Sommario Il lavoro svolto è teso al completamento dell’ analisi dell’ algoritmo di Recursive Decoupling (RD) per sistemi tridiagonali lineari. Tale metodo, intrinsecamente parallelo e scalabile, trova una naturale implementazione sulle architetture CRAY della generazione MPP, quali il T3E. Recentemente è sorta la richiesta di impiego di tale solutore in problemi applicativi di ampie dimensioni, legati alla ingegneria dei circuiti microelettronici ed alla ricostruzione e restoring di immagini mediche. La necessità di una versione portabile ad altri sistemi distribuiti ha giustificato la revisione dell’algoritmo di RD, per la sua programmazione in modello Message Passing. Grazie al grant CINECA, è stato possibile sviluppare su CRAY T3E una prima versione PVM dell’algoritmo in esame, in modo da renderne il più possibile consistente il confronto con le precedenti implementazioni realizzate in modello a memoria condivisa su CRAY T3D. Le prestazioni del metodo RD su CRAY T3E vengono valutate in base alle caratteristiche di scalabilità ed alla accuratezza dei risultati ottenuti.
Abstract The work developed here is meant as a completion to the analysis of the Recursive Decoupling solver (RD) for tridiagonal linear systems. Such a method, intrinsically parallel and scalable, finds a natural implementation on the CRAY architectures, belonging to the MPP generation, such as the T3E. Recently, it has arisen a request to utilize the technique above mentioned in the solution of applied problems, related to the Engineering of microwave circuits and to the reconstruction and restoration of medical images. The need for an RD version, portable to various distributed systems, justified the remodeling of the RD algorithm, to make feasible its Message Passing (MP) implementation. Thanks to the CINECA grant, it has been possible to develop
a PVM version of the RD solver for the CRAY T3E, which makes also interesting and consistent a comparison between the current MP model implementation and the previous CRAY T3D shared memory versions. The performance of the RD method on the CRAY T3E is then evaluated, based on the obtained scalability features and accuracy results.
1. The algorithm Here we synthesize the concepts on which the RD technique is based; they can be found in more details in [1, 2]. Let A be a diagonally dominant, tridiagonal matrix A of dimension n = k x 2q (q, k integers). The related square system Au=d
is solved by partitioning A into an easily invertible block diagonal matrix J and a sum of m-1 rank-one matrices x(j) y(j)T, having set m=n/2. A first approximation to the solution vector u is given by solving Ju=d
At the same time the following m-1 systems are solved J g(j) = x(j)
The partitioning of A is mirrored in the vectors u and g(j), i.e. such vectors are partitioned into subvectors, whose size is that of the diagonal blocks in J. The recursive use of the Sherman-Morrison formula, then, permits to obtain the solution of the original tridiagonal system (1). The n-dimensional vectors g(j) are recursively updated and used to update u, providing the wanted solution when the final step is reached. RD is a direct method, thus the number of steps is finite; such number is equal to log2(m), due to the sparsity of the matrices involved, the dimension n being a power of 2 and the fan-in pattern of the updating. Furthermore, the sparsity and dimension features of the problem considered permit to hold all vectors u, g(j) by log2(n) auxiliary vectors v(k), each of dimension n. All the updating operations can be performed in parallel, maintaining the partition of data chosen during the solution of (2) and (3), with a minimal need for communication among tasks. In fact, systems (2) and (3) can also be solved in parallel partitioned form.
The procedure outlined gives rise to an intrinsically parallel and scalable algorithm. The current version uses the C programming language and the Message Passing programming model, to answer the request for a wider use of this solver. Results of numerical testing and the analysis of the performance obtained are given next, including a qualitative comparison with analogous testing of the CRAY T3D shared memory implementation.
2. PVM Recursive Decoupling The C implementation of the RD method uses the routines of the PVM library, devoted to the communication among PEs (Processing Elements). Further to portability, the aims of such implementation are those sought by every parallel program: efficiency, speed-up, scalability and workload balancing. The SPMD programming model, required on the T3E, consists of one main program, running on each processor; the master-slave paradigm is put into effect by appending the following instruction as the closing one in the main program: if (PE_id == 0) then else Data assignment (splitting) is crucial to the message passing version of the RD algorithm. All structures involved, that is to say the auxiliary vectors v(k) are partitioned into subvectors vi(k) of equal length (given by the ratio between the problem dimension n and the number npes of processors used), that are local to each PE. The updating of vectors v(k) is then obtained by each PE updating its substructure. The need for some communication of partial scalar results among groups of PEs justifies the choice of the master-slave programming model, with one PE specialized to execute the master code, to avoid the overhead due to synchronization among such groups. The master code has the following tasks: reads the input data A and d from an external file; subdivides A, d, and sends their parts to the slaves; collects the local results from each group of processors; computes and sends the global results to each group of slaves; receives the final result and writes it on an external file.
The kernel of the algorithm, correspondent to the updating procedure, basically consists of three nested loops; by denoting with the index k each step (level) forming such procedure, the following schemata holds for each slave task:
for each level k if all data needed for the updating are local for each j remaining level updates vi(k ) else for each remaining level j computes its partial scalar result sends it to the master receives the global scalar result uses it to update vi(k )
k = 1 .. log2(m) j = k +1 .. log2(m) i = 1 .. n/npes
Having one PE to perform the master controls implies that up to 64 slaves can be used, missing the possibility to exploit all 128 PE available on the T3E. We considered choosing a model in which all PE perform the master controls as well as the slave task. Such model would have lead to a less efficient implementation, mainly because, at each level k, different partial scalar results have to be shared among different pools of PEs; having one master PE minimizes the occurrence of idle processors. The uniform partitioning of structures, as described above, meets the goal for a perfect even load balance.
3. Performance analysis As a source of test problems for our numerical experiments, we consider a tridiagonal linear system of large dimension n = 2q, whose randomly generated entries are such to guarantee diagonal dominance. The exponent q ranges from 17 to 23 in the double precision version (d.p.), while q reaches up to 24 in the single precision implementation (s.p.). The accuracy results, working in single precision, comply with those obtained by the previous version on the CRAY T3D (see [1], [2] for a more detailed description). The double precision improvement is paid off by the inability to solve systems of dimension greater than 223. If the solution is only required to be accurate in the first few significant digits, then the RD routine is serviceable to solve general tridiagonal linear systems. Timing results, used for the evaluation of speed-up and efficiency, are measured in seconds and shown in the following three tables. Table 1 refers to the computational time required by the RD solver on a problem of dimension n = 217, both using PVM on the T3E (single and double precision) and using CRAFT on the T3D (single precision). Such a comparison must take into account the enhanced resource; the time gain observed reflects the increased speed of the T3E processor.
Tables 2 and 3 gather all computational timing required by the PVM implementation of the RD method, on the T3E. The empty fields mean that the memory requirements, needed to run a 2q x 2q problem onto a chosen number of PEs, were too high. The computational complexity of the PVM implementation is O(n⋅(log2n)2), which is confirmed by the timing observed. This is slightly less satisfactory than the theoretical complexity of O(n⋅log2n) and calls for further improvement in the PVM restructuring of the algorithm. Communication timings are not explicitly shown, since their incidence on the overall computational time is always lower than 10%. As a consequence, the workload is almost perfectly balanced. Speed-ups and efficiency are good, reaching the optimal value in most cases (in a few cases, we obtain superlinear speed-ups); such a good behavior fades when 64 processors are used on a problem whose dimension is not large enough to give each processor a significant work amount, to overcome the communication and synchronization overhead. This is shown in Figure 1, in which the Kuck Function is plotted (against the number of processors, in logarithmic scale) for problems of dimension 217, 218 and 219, respectively. The Kuck function gives a compound information, being the square of the geometric mean of speed-up and efficiency (obtained with a fixed number of processors). The scaling properties of the method are also confirmed by the timing results; the scaling factor, in its best instance, reaches the value of 0.88 (close to the theoretical optimal value 1).
Table 1. MP-T3E vs SM-T3D Timing Comparison: problem dimension is q=17 PEs 1 2 4 8 16 32 64 T3E s.p. 5.882 2.8234 1.4349 .7207 .3680 .2094 .1423 T3E d.p. 6.288 3.1268 1.5659 .7896 .4051 .2285 .1576 T3D 22.159 11.053 5.494 2.758 1.393 .7073 .3553
Table 2. MP-T3E Timings (double precision): problem dimension given by q PEs 1 2 4 8 16 32 64 q = 18 14.3959 7.1378 3.5883 1.7820 .9101 .4907 .3090 q = 19 16.2389 8.0822 4.0265 2.0372 1.0807 .6491 q = 20 18.2524 9.0763 4.5963 2.3884 1.5529 q = 21 20.4751 10.380 5.3492 2.9657 q = 22 22.9562 11.7664 6.7028 q = 23 28.2628 15.6026
Table 3. MP-T3E Timings (single precision): problem dimension given by q PEs 1 2 4 8 16 32 64 q = 18 13.1494 6.5497 3.2791 1.6216 .8272 .4412 .2761 q = 19 29.7550 14.8085 7.3862 3.6964 1.8816 .9769 .5751 q = 20 33.6471 16.7606 8.3642 4.1986 2.1877 1.2179 q = 21 37.9167 18.9307 9.4531 4.9017 2.6547 q = 22 42.6558 21.4814 11.1691 5.837 q = 23 47.7032 24.0896 12.8759 q = 24 53.4316 28.5315
4. Conclusion and future work PVM assures flexibility and portability of code, which is a very important requirement in all branches of applied sciences. This could allow the use of the RD routine, on its own or as part of a more general application solver, on an heterogeneous cluster of computers or any other MPP architecture. Because of the intrinsic parallelism of this problem the workload is perfectly balanced; accuracy and scaling features are maintained and confirmed by the PVM implementation of the RD algorithm. The communication and synchronization overhead, already quite small, might be further decreased; there is space for further improvement in the current message passing implementation, as is suggested both by the Kuck function and the observed complexity. Parallel to such improvement to the PVM version, an MPI implementation is also being developed.
Acknowledgments We wish to thank Dr. Bassini, Dr. Voli and their colleagues at CINECA for the kind availability. Computational resources provided by the CINECA Supercomputing Center, under grant n.97/335-5, are gratefully acknowledged.
