A parallel implementation of the LTSn method for a ... - CiteSeerX

A parallel implementation of the LTSn method for a radiative transfer problem Roberto Pinto Souto1 , Haroldo Fraga de Campos Velho2 , Stephan Stephany2 , Airam Jonatas Preto2 , Cynthia Feijó Segatto3, Marco Túlio Vilhena3 1

Programa de Pós-graduaç a˜ o em Computaç a˜ o Aplicada (CAP/INPE) [email protected] 2 Laborat´ orio Associado de Computaç a˜ o e Matemática Aplicada (LAC/INPE) INPE - Instituto Nacional de Pesquisas Espaciais Caixa Postal 515, CEP 12245-970 São José dos Campos, SP - Brazil [haroldo,stephan,airam]@lac.inpe.br 3 UFRGS - Universidade Federal do Rio Grande do Sul PPGMAp, PROMEC, Rua Sarmento Leite, 425/314 CEP 90050-170, Porto Alegre, RS - Brazil [cynthia,vilhena]@mat.ufrgs.br

Abstract— A radiative transfer solver that implements the LTSn method was optimized and parallelized using the MPI message passing communication library. Timing and profiling information was obtained for the sequential code in order to identify performance bottlenecks. Performance tests were executed in a distributed memory parallel machine, a multicomputer based on IA-32 architecture. The radiative transfer equation was solved for a cloud test case to evaluate the parallel performance of the LTSn method. The LTSn code include spatial discretization of the domain and Fourier decomposition of the radiances leading to independent azimuthal modes. This yields an independent radiative transfer equation for each mode that can be executed by a different processor in a parallel implementation. Speed-up results show that the parallel implementation is suitable for the used architecture. Keywords— RADIATIVE TRANSFER, DISCRETE ORDINATES, MPI, CLUSTER COMPUTING

I. I NTRODUCTION The radiative transfer equation (RTE) is a mathematical model for the study of absoprtion, transmission, and scattering of photons in a medium. There are many methods to solve the RTE, however all of them are very computing demanding. This work presents a parallel implementation of a RTE solver that employs the LTSn method [1], which implements the Laplace transform on the analytical discrete ordinate equation. A previous work [15] compared the performance of three RTE solvers and analized the correponding MPI versions. Besides the LTSn method, that work included the PEESNA [3], based on the analitical discrete ordinate method, and the Hydrolight[8], that uses an invariant imbedding method. In this work, an enhanced version of the LTSn method is presented. It was extended to solve multiregion problems, includes a new matrix inversion scheme based on the diagonalization of the LTSn matrix [13], and uses a general solution for the source term [7]. As in other RTE solvers, the LTSn method perform spatial discretization of the domain and Fourier decomposition

of the radiances obtaining independent azimuthal modes. Therefore, an independent RTE can be written for each mode and parallelization of the code can be accomplished by assigning an azimuthal mode to each processor. Timing and profiling of the sequential code was performed in order to check the sequential performance and to identify the timeconsuming routines [16]. The profile of the execution time show that a significant fraction of the total time is spend in the code to calculate the independent azimuthal modes. This allows the implementation of an efficient parallel version of the code. Inverse problems in radiative transfer are usually solved by an implicit technique. For instance, parameter estimation can be performed from radiometric measurements. The problem is usually formulated as a constrained nonlinear optimization problem, in which the direct problem is iteratively solved for successive aproximations of the unknown parameters. Iteration proceeds until an objective-function, representing the least-squares fit of model results and experimental data, converges to a specified small value. The associated direct problem is the solution of the processing demanding RTE. In a typical inversion, thousands of iterations may be required and therefore the choice of an algorithm that is suitable for parallelization is an important issue. This was the motivating force to improve the performance of a chosen RTE solver, the LTSn. The related code was optimized and parallelized using calls to the message passing communication library MPI (Message Passing Interface) [9] [2] and performance tests were executed in a distributed memory parallel machine, based on a cluster of standard IA-32 architecture processors. This parallel machine has a standard FastEthernet interconnection network, thus presenting high communication latencies. Therefore, parallel implementations require either coarse granularity or latency hidding programming tech-

niques. Since the LTSn method performs independent computations for each azimuthal mode, a parallel implementation presents coarse granularity and thus obtained speed-ups are close to linear.

yields

¼ ¼ ¼

II. T HE LTS N METHOD The Radiative Transfer Equation (RTE) for radiances I is given by

subjected to the following boundary conditions, for

¼ ¼ ¼ ¼ ¼ ¼

(1)

A. The discrete ordinates method Using the addition theorem of the spherical harmonics [4], the phase function can be expressed as

¼ Æ ¼

(2) the radiance and the source term are also expanded as a Fourier decomposition [4],

(3a)

(3b)

Finally, the properties of a heterogeneous medium are splitted into a multiregion domain composed by a set of homogeneous regions.

(6a)

(6b)

(4)

The substitution of Equations (2)-(3)-(4) in Equation (1)

(7)

Radiance was decomposed into azimuthal modes, while the phase function was replaced by the associated Legendre function expansion with -order anisotropy. An approximation of the integral equation (1) is obtained using a quadrature of order , with nodes and weights . The value of is then discretized in , with , that are the discrete ordinate directions. The radiate transfer equation is then expressed as the related discrete ordinate equations, also known as equations, given by

(8)

The boundary conditions are

and to the interface conditions, for

where is the optical depth, and are the cosine of the incident polar angle and the incident azimuthal angle, respectively. is the constant single scattering albedo. The scattering phase function ¼ ¼ , gives the scattering beam angular distribution and the source term is .

¼ ¼

and

(5)

(9a)

(9b)

The scattering angle was then discretized into azimuthal modes and polar angles, while the domain was split into homogeneous regions. The integral-differential equation (1) was rewritten as a set of ordinary differential equations. B. The LTSn approach The LTSn method [1, 12] applies the Laplace transform on the radiative transfer discrete ordinates equations, given

by (8) and (9). This yields a system of symbolic algebraic equations on :

(10)

½

. The matrix form of

(11)

where the -order matrix , called the LTSn matrix, is given by I (12) and I is the -order identity matrix, while the matrix is given by

if ,

and vectors

In order to solve the matrix equation (11), it must be mul tiplied by the inverse matrix of , as follows

! !

(14a) (14b)

Applying the Laplace inverse transform

! "

(18)

where & is a diagonal matrix containing the eigenvalues of , and % is the corresponding eigenvectors matrix. Therefore

% I & % % %

This has a lower computational cost in comparison to the tradicional inversion technique. Since the emergent radiances at and at the interfaces are unknown, the solution of Eq. (15) cannot be completely determined. However, the use of boundary conditions (6) and the values at the interface between the regions (7) allows one to determine the unknown values. C. Improving the LTSn solver

% & %

(19)

and are defined as

(17)

where the convolution is denoted by . This implementation of the LTSn method employs Gaussian elimination to solve each th azimuthal-mode system of order shown in Equation (15), for , regions and #$-order of quadrature. Matrix inversion is usually expensive and precludes the use of big ’s. The diagonalization method (see [13]) takes advantage of the fact that the LTSn matrix, Equation (12), is non-degenerate, i.e. all eigenvalues are distinct and therefore can be diagonalized.

if

,

(16)

! I

% % % & % (13)

" !

where equation (10) becomes

! !

and

where

(15)

The described implementation of the LTSn method employs Gaussian elimination to solve Equation (15). However, for big values of (say

) and large thickness , the matrix associated to this system of equations may become ill-conditioned, causing an overflow due to the exponential behavior of the solution. Moreover, according to [10], the number of operations performed in the gaussian elimination is of order . Table I shows the fraction of time spent in the Gaussian elimination routine. In order to achieve a lower complexity, and to deal with the matrix ill-conditioning for higher ’s, the current implementation used the following decomposition scheme for the

TABLE I P ERCENTAGE OF TIME SPENT IN THE G AUSSIAN ELIMINATION ROUTINE FOR DIFFERENT NUMBER OF REGIONS

R 1 2 5 10

Perc.(%) 3.34 11.91 52.67 86.14

matrix ! [7].

· ! % % % % ! !

(20)

where & and & are

and

if

'

if (

(21)

(

'

(22)

if if

Then, Eq. (15) can be rewritten as

! ! "

(23)

where the vector " is

"

! ) )

½

!

) )

(24)

and ( ( . This scheme for the LTSn method was also employed in the evaluation of the transmissivity and the reflectivity for a heterogeneous plane parallel slab [14]. III. PARALLEL I MPLEMENTATION A preliminary study of the LTSn solver has led to the chosen parallelization strategy. The method, as seen in Equation (15), discretizes polar ( discrete values) and azimuthal angles ( discrete values) for homogeneous regions and employs the non-discrete optical thickness , that is the onedimensional spatial variable. An immediate analysis would consider parallelization based on the distribution of these quantities among processors:

polar angles azimuthal angles regions range of optical thickness inside a region a combination of the above in a multidimensional arrangement The discrete ordinates associated to the polar angles are strongly coupled, as can be observed in the LTSn matrix, see Equation (11). This precludes a parallel implementation that distributes these values among processors. On the other hand, the azimuthal modes are independent and a parallelization based on assigning different modes to different processors is straightforward and was employed in this work. Therefore, each azimuthal-mode system of equations of order is solved on a different processor. The third option was not considered as continuity of boundary conditions between adjacent regions forces a sequential computation for the regions. The same restriction applies to the spatial decomposition of the domain inside each region, related to the optical thickness . Finally, a combination of the above decompositions would suffer these implied limitations. Another significant point is that, for any strategy, as stated on the Amdahl’s law [11], the gain in processing time is limited to the fraction of the code that can be executed in parallel. The analysis of the execution time profiles of the LTSn code show that the radiance calculation of the azimuthal modes account for most of the total time. This profile was obtained by means of the gprof Unix/Linux profiling tool. The adopted strategy was then confirmed. The related code was optimized and parallelized using calls to the message passing communication library MPI and executed on a distributed memory parallel machine. This machine is a cluster of 17 standard IA-32 architecture processors connected by a standard FastEthernet interconnection network and a 24-port switch. High communication latencies are caracteristic of this hardware arrangement and, therefore, either coarse granularity or latency hidding programming techniques are a must. IV. P ERFORMANCE R ESULTS The proposed parallel implementation of the LTSn method was tested with a cloud radiative tranfer problem using anisotropy . This case, named Cloud C , was proposed by the Radiation Commission of the International Association of Meteorology and Atmospheric Physics, and was previously solved by the LTSn method. The parameters of this test case are given in Table II. Considering the same test case, the radiance fluxes at the boundaries and the interfaces were evaluated and plotted in Figure 1 for 25 positive and 25 negative angular directions. Fluxes going from left to have positive angular directions, while fluxes in the opposite way, negative angular

TABLE II

TABLE III

S PEED - UP AND EFFICIENCY FOR PROCESSORS AND REGIONS

PARAMETERS OF THE CLOUD TEST CASE C LOUD C½

Parameter

Meaning single scattering albedo maximum layer thickness boundary conditions incident polar angle cosine anisotropy order quadrature order

and

Value 0.90 1.0 0.0 0.5 299 50

2

5

180 180

100

100

10

50

50 0 1

0 1 0.5 0 0

0.25

0.5

0.75

1 0.5 0 0

(a)

0.25

0.5

0.75

1

20

(b)

1.5

1.5

1

1

0 0

0 0 −0.5 −1 0

(c)

0.25

0.5

0.75

1 −0.5 −1 0

0.25

0.5

0.75

1

(d)

Fig. 1. Radiance fluxes in the boundaries and interfaces for positive (see and ) and negative (see and ) angular directions for test case Cloud C½ .

directions. Therefore, the boundary conditions given by Eqs. 6 are plotted in Figures 1a and 1d, while the solution of the RTE at and are given by the Figures 1c and 1b, respectively. Note that, as it would be expected, these plots are identical at the interfaces in both negative and positive directions. This confirms the correctness of the numerical results. Since the chosen parallelization strategy of the LTSn method was based on the independent computations for each azimuthal mode, this parallel implementation presented coarse granularity. Test runs were performed for the multiregion implementation, considering different number of regions (): 2, 5, 10 and 20. Performance values are shown in Table III where seq. denotes the sequential execution time. These results re-

*

seq. 2 5 10 15 seq. 2 5 10 15 seq. 2 5 10 15 seq. 2 5 10 15

Time(s) 22.22 11.06 4.42 2.24 1.51 104.18 52.25 20.97 10.54 7.08 725.37 364.89 146.09 73.09 48.83 5705.75 2863.18 1141.69 573.36 382.37

Speed-up – 2.01 5.03 9.92 14.72 – 1.99 4.97 9.88 14.71 – 1.99 4.97 9.92 14.86 – 1.99 5.00 9.95 14.92

Efficiency – 1.0045 1.0054 0.9920 0.9810 – 0.997 0.994 0.988 0.981 – 0.994 0.993 0.992 0.991 – 0.996 1.000 0.995 0.995

fer to 300 azimuthal modes and 50th-order quadrature. Particularly, processing times are plotted in Figure 2. It can be observed that these times decay proportionally to +*, where * is the number of processors, as the related speed-ups are very close to linear. V. C ONCLUSIONS This implementation of the LTSn method employs Gaussian elimination to solve each th azimuthal-mode system of order shown in Equation (15), for regions and #$-order of quadrature. The numerical results were correct for the value of employed. An important issue in parallel programming is to maximize the amount of computation done by each processor and to minimize the amount of communication, due in this case to MPI calls, in order to achieve good performance. This is particularly important in multicomputers as the communication latency is relatively high. The distribution of the azimuthal domain among the processors allows good load balancing and requires a very small amount of communication. Therefore, high values of speedup and efficiency are achieved in the parallel implementation up to 15 processors. Speed-up’s are very close to linear and efficiencies asymptotically tight bound to unity. This allows one to infere that the problem will scale well with a larger number of processors, as load balancing is straightforward in

R EFERENCES

6000 "2 regions" "5 regions" "10 regions" "20 regions" 5000

Time (s)

4000

3000

2000

1000

0 0

2

4

6

8

10

12

14

16

Number of processors

Fig. 2. Processing times versus number of processors.

this parallelization and the amount of communication is minimum, limited to a reduction operation to communicate the partial results for each azimuthal mode. This work has shown that a cost effective architecture standard software tools can be successfully employed to efficiently solve the RTE using the LTSn method. In order to further improve the performance, an iterative linear system solver, from the PIM (Parallel Iterative Methods) package [5] could be employed in the current implementation instead of Gaussian elimination. A potential benefit of this approach would be to exploit a cluster of multiprocessed nodes, following the current strategy of sets of independent azimuthal modes to the processors, but using the PIM routine inside each node to take advantage of the shared memory, using OpenMP constructions [6], for example. In addition, the calculation of the eingevalues in Equation (18) could be accelerated by employing QR factorization, as this technique is well suited for parallel implementation. ACKNOWLEDGMENTS Authors S. Stephany and A. J. Preto thank FAPESP, The State of São Paulo Research Foundation, for the support given to this study through a Research Project grant (process 01/03100-9). Author R. P. Souto acknowledges finantial support by CNPq, the Brazilian Council for Scientific and Technological Development.

[1] L. B. Barichello and M. T. Vilhena. A general approach to one-group one-dimensional transport equation. Kerntechnik, 58(3):182–184, 1993. [2] H. F. de Campos Velho, S. Stephany, A. J. Preto, N. L. Vijaykumar, and A. G. Nowosad. A neural network implementation for data assimilation using MPI. In C .A. Brebbia, P. Melli, and A. Zanassi, editors, Applications of High Performance Computing in Engineering VII, pages 211–220, WIT Press, Southampton, UK, 2002. [3] E. S. Chalhoub, R. D. M. Garcia. The Equivalence between two techniques of angular interpolation for the discreteordinates method. Journal of Quantitative Spectroscopy & Radiative Transfer (JQSRT), 64(5):517–535, 2000. [4] S. Chandrasekhar. Radiative Transfer, Dover, New York, 1950. [5] R. D. Cunha, and T. R. Hopkins. PIM 2.2 The Parallel Iterative Methods Package for Systems of Linear Equations. Instituto de Matemática and Centro Nacional de Supercomputaça˜ o, UFRGS, Brazil, 1994. [6] L. Dagum, and R. Menon. OpenMP: an industry-standard API for shared memory programming. Computational Science and Engineering, 5(1), 1998. [7] G. A. Gonçalves, C. F. Segatto, and M. T. Vilhena. The LTSn particular solution in a slab for an arbitrary source and large order of quadrature. Journal of Quantitative Spectroscopy & Radiative Transfer (JQSRT), 66(3):271–276, 2000. [8] C. D. Mobley. Light and Water - Radiative Transfer in Natural Waters, Academic Press, San Diego, USA, 1994. [9] MPI Forum. MPI: a message-passing interface standard International Journal of Supercomputer Applications, 8(3-4), 1994. [10] G. O. Oliveira. Advances in the criticality calculation of the LTSn method and implementation of a enhanced version of the LTSn code (in Portuguese). PhD Thesis, Mechanical Engineering Doctorate Program, UFRGS, Porto Alegre, Brazil, 2002. [11] P. Pacheco. Parallel Programming with MPI, Morgan Kaufmann Publishers, San Francisco, USA, 1996. [12] C. F. Segatto and M. T. Vilhena. Extension of the LTSn formulation for discrete ordinates problem without azimuthal symmetry. Annals of Nuclear Energy, 21(11):701–710, 1994. [13] C. F. Segatto, M. T. Vilhena, and M. G. Gomes. The onedimensional LTSn solution in a slab with high degree of quadrature. Annals of Nuclear Energy, 26(10):925–934, 1999. [14] C. F. Segatto, M. T. Vilhena, L. L. S. Tavares. The determination of radiant parameters by the LTSn method. Journal of Quantitative Spectroscopy & Radiative Transfer (JQSRT), 70(2):227–236, 2001. [15] R. P. Souto, H. F. de Campos Velho, S. Stephany, and E. S. Chalhoub. Performance analysis of discrete ordinate radiative transfer algorithms for inverse hydrologic optics in a parallel environment. Submitted to the 18th International Conference on Transport Theory, July 20 - 25, 2003, Rio de Janeiro, Brazil. [16] S. Stephany, R. V. Correa, C. L. Mendes, and A. J. Preto. Identifying performance bottlenecks in a radiative transfer application. In M. Ingber, H. Power, and C .A. Brebbia, editors, Applications of High Performance Computing in Engineering VI, pages 51–60, WIT Press, Southampton, UK, 2000.

A parallel implementation of the LTSn method for a ... - CiteSeerX

A parallel implementation of the LTSn method for a ... - CiteSeerX

Suggest Documents

A Parallel Implementation of the Davidson Method for ... - UPV

A PETSc-Based Parallel Implementation of Finite Element Method for

Parallel implementation of a central decomposition ... - CiteSeerX

Parallel Implementation of a Method on the Experimental G-\ Multigrid ...

Implementation of parallel tridiagonal solvers for a

A Parallel Implementation of blockMesh for Quick

Parallel Implementation of a Kalman Filter for Constituent ... - CiteSeerX

The Parallel Sieve Method for a Virus Scanning Engine - CiteSeerX

A Message Passing Implementation of a New Parallel ... - CiteSeerX

A Parallel Implementation of the Cylindrical Algebraic ... - CiteSeerX

The Implementation of a Parallel Watershed Algorithm 1 ... - CiteSeerX

A Parallel Implementation of the 2D Wavelet Transform ... - CiteSeerX

A Cache-Aware Parallel Implementation of the Push ... - CiteSeerX

LTSN Sub 2002 - CiteSeerX

THE DESIGN AND IMPLEMENTATION OF A PARALLEL ... - CiteSeerX

A Parallel Implementation of the 2D Wavelet Transform ... - CiteSeerX

A Parallel Adaptive Fast Multipole Method - CiteSeerX

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

A Parallel Implementation for Optimal Lambda-Calculus ... - CiteSeerX

Parallel Implementation of a Bioinformatics Pipeline for the Design of ...

A Tightly-Coupled Implementation on a Parallel Database ... - CiteSeerX

Implementation of a Fuzzy Control System for a Parallel Hybrid ...

Parallel Implementation of the Discontinuous Galerkin ... - CiteSeerX