A unified spatio-temporal parallelization framework for accelerated ...

Computer Physics Communications 183 (2012) 1683–1695

Contents lists available at SciVerse ScienceDirect

Computer Physics Communications www.elsevier.com/locate/cpc

A unified spatio-temporal parallelization framework for accelerated Monte Carlo radiobiological modeling of electron tracks and subsequent radiation chemistry Georgios Kalantzis a,∗ , Dimitrios Emfietzoglou b , Panagiotis Hadjidoukas c a b c

Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA 94305, United States Medical Physics Laboratory, University of Ioannina Medical School, Ioannina, 45110, Greece Department of Computer Science, University of Ioannina, Ioannina, 45110, Greece

a r t i c l e

i n f o

Article history: Received 24 December 2011 Received in revised form 6 March 2012 Accepted 14 March 2012 Available online 19 March 2012 Keywords: Grid computing Amazon EC2 Particle propagation Smoluchowski reaction–diffusion Electron track structure Radiation chemistry

a b s t r a c t Monte Carlo (MC) nano-scale modeling of the cellular damage is desirable but most times is prohibitive for large scaled systems due to their intensive computational cost. In this study a parallelized computational framework is presented, for accelerated MC simulations of both particle propagation and subsequent radiation chemistry at the subcellular level. Given the inherent parallelism of the electron tracks, the physical stage was “embarrassingly parallelized” into a number of independent tasks. For the chemical stage, the diffusion–reaction of the radical species was simulated with a time-driven kinetic Monte Carlo algorithm (KMC) based on the Smoluchowski formalism and the parallelization was realized by employing a spatio-temporal linked-list cell method based on a spatial subdivision with a uniform grid. The evaluation of our method was established on two metrics: speedup and efficiency. The results indicated a linear speedup ratio for the physical stage and a linear latency for shared- versus a distributed-memory system with a maximum of 3.6 · 10−3 % per electron track. For the chemical stage, a series of simulations were performed to show how the execution time per step was scaling with respect to the number of radical species and a 5.7× speedup was achieved when a larger number of reactants were simulated and eight processors were employed. The simulations were deployed on the Amazon EC2 infrastructure. It is also elucidated how the overhead started becoming significant as the number of reactant species decrease relative to the number of processors. The method reported here lays the methodological foundations for accelerated MC simulations and allows envisaging a future use for large-scale radiobiological modeling of multi-cellular systems involved into a clinical scenario. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Targeted radionuclide therapy (TRT) and nanoparticle enhanced X-ray therapy (NEXT) are two promising techniques for improved tumor control in radiation therapy [1–3]. Several studies have investigated the effect of nanoparticles (NP), and have shown that the dose to tissue volumes can be significantly increased by the addition of NPs due to their greater X-ray absorption and production of high LET Auger electrons [4–7]. Similarly, experimental studies for radiopharmaceuticals revealed non-homogeneous increased dose in the proximity of the radionuclide [8–10]. The effect of localized increased ionizations combined with nuclear tagging [11,12] can result in increased tumor cell damage and better therapeutic efficiency of radiotherapy. There is a general agreement that Monte Carlo (MC) simulation of radiation transport in an absorbing medium is the most accurate method for dose calculations [13,14]. Some of the strategies these

*

Corresponding author. E-mail address: [email protected] (G. Kalantzis).

0010-4655/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cpc.2012.03.008

algorithms employ are the condensed history technique and the variance reduction method [15,16]. In the case of TRT and NEXT, dose deposition takes place on the scale of nanometers and therefore single-interaction algorithms are preferred. Theoretical models for calculating the cross sections applicable to electron tracking MC simulations have been proposed in the past [17–23] and commonly used MC codes have been modified accordingly and used for microdosimetric calculations [24–27]. One way to estimate the biological effectiveness of the inhomogeneous dose deposition in the vicinity of the NP or radionuclide is to relate the deposited dose with the cell damage [28]. Modeling studies have characterized treatments with β -emitting radionuclides by using radiobiological indices [29] and cellular S-values [30]. However, ionizing radiation causes damage to biomolecules mainly through radiolysis of water with DNA being a primary target of these generated radicals [31]. Because of the complexity of the biochemical network involving reactions with radicals, KMC and molecular dynamics (MD) methods have been proposed in the past for their stochastic modeling [32–35]. External radiations from medical Linacs of photon or electron beams create a large number of secondary electrons and a vast

1684

G. Kalantzis et al. / Computer Physics Communications 183 (2012) 1683–1695

number of water radicals. The number of secondary (low energy) electrons, and consequently number of radicals, increase further in the case of enhanced radiation therapy with nanoparticles. Additionally in the TRT we have emission of low energy electrons from a large number of radionuclides spatially distributed in the tumor. As was mentioned earlier, the MC methods are considered the golden standard for nanodosimetric calculations and biochemical modeling of cell damage. However, this comes with a price of intensive computational time required for MC simulations, something that prohibits the applicability of these methods to clinical scenarios. One way to ameliorate the performance problem is to parallelize the application. Several studies have demonstrated the efficiency of parallel computing in photon [36–40] and electron transport [41], as well as, in large-scale molecular modeling [42–44]. There are a variety of models for constructing parallel programs, but at a very high level, there are two basic models: multipleprogram multiple-data (MPMD) and single-program multiple-data (SPMD). However, parallel programming introduces new complications not present in serial programming such as: keeping track of which processor specific data is located on, distributing computation, synchronizing data between processors, debugging and deciphering communication between processors. Dividing and distributing a problem to multiple processors incurs an overhead due to communication which is absent in serial programs [45,46]. For certain problems, like depth-first search of graphs, it is known that even if we use any polynomial number of processors, we cannot obtain polylogarithmic time. As the number of processors grows, work per processor decreases but communication increases. Therefore the performance increases, peaks, and then decrease as the number of processors grows, due to an increase in communication. This is referred to, as “slowdown” or “speed down” and can be avoided by carefully choosing the number of processors such that each processor has enough work relative to the amount of communication. Finding this balance is not generally intuitive and it often requires an iterative process. In the current study we present for the first time, to our best knowledge, a unified parallelization method of both the physical and chemical stage of a MC simulation code for electron tracks in water. Our goal was to demonstrate the accuracy and applicability of our method in radiobiological modeling and examine its behavior under different conditions. The primary motivation is the realization that nanodosimetric calculations and molecular modeling could provide unique information for prediction of the radiobiological properties of radiation tracks, which is important in radiation therapy for accurate dose planning and in radiation protection for cancer induced risk assessment. Cell-level dosimetry based on MC simulation for macroscopic tumors presents a computational challenge. Parallelization methods have been proven as an irreplaceable tool for modeling and understanding large-scale systems in many research areas [47–49]. The present study lays the methodological foundation for large-scale radiobiological MC modeling of multi-cellular systems involved in clinical scenarios. 2. Methods 2.1. Physical and pre-chemical stage The interaction of radiation with living matter includes two stages with different time scales. The first one corresponds to the physical and pre-chemical stage which starts at about 10−15 s and ends at 10−12 s and includes all the interactions of the radiation particles with the atoms and/or molecules of the medium and the creation of the water radicals. For the event-by-event simulation of low-energy electron tracking, a semi-theoretical model which combines aspects of the Bethe

and binary-encounter theory has been employed [50]. A brief overview of the model is presented below; for a detailed description one may consult Emfietzoglou et al. (2000) [17]. The single most important quantity for MC simulation of the energy dissipation of charged particles in matter, is the differential cross-section (DCS) for energy-loss. The DCS for ionization collisions which determines the energy spectrum of secondary electrons is described here by the following expression which combines two terms representing “soft” (or distant) and “hard” (or close) ionization collisions:

dσ dW

=

dσSOFT

+ F (W )

dW

dσHARD

(1)

dW

where, W is the ejection energy of the secondary electron and F ( W ) is an empirical parametric function derived from experimental data that fine-tunes the relative contribution of the softand hard-collision term. In essence, the function F ( W ) replaces the much more complicated generalized-oscillator-strength (GOS), which is a function of both energy- and momentum-transfer, by providing a smooth transition between optical-like collisions (i.e. of small momentum transfer) and binary-like collisions (i.e. of large momentum transfer). The dσSOFT /dW is described by the leading term of Bethe’s asymptotic expansion of the inelastic scattering cross section which takes the form:

dσSOFT dW

=

4πα02 T

R2

Γk σ ph

( W + I k ) 4π 2 aα02 R

k

ln

4T R

( W + I K )2 (2)

where α is the fine structure constant, α0 is the Bohr radius, R is the Rydberg constant Γk is the branching ratio for the kth orbital, σ ph is the photoionization cross section, and T is the energy of the primary (or impact) electron. The hard collision term of Eq. (1) is calculated through a modified binary encounter approximation formula which incorporates in a simple manner exchange terms that arise in electron–electron scattering:

dσHARD dW A= B= C= D=

=

4πα02 R 2 T

Nk

k

T T + Ik + U k

· [ A − B + C + D]

1

( W + I k )2 1

( W + I k )( T − W ) 1

( T − W )2 4U k

1

3

( W + I k )3

+

1

(3)

( T − W )3

where, N k is the electron occupation number of the kth orbital (equals 2 here), I k is the binding energy and U k the average energy of an electron in the kth orbital (k = 1, 2, . . . , 5 for water). The total ionization cross section which determines the corresponding mean free path, can then be obtained by numerical integration of Eqs. (1)–(3). With respect to (discrete) excitations which, although do not result in secondary electron production, contribute significantly to the energy-loss process in the present energy range, five excitation states of the water molecule have been considered: dissociative continuum A 1 B 1 and B 1 A 1 , Rydberg A + B and C + D and diffuse band. The excitation cross section is calculated by the following analytic formula [51]:

σn = 4π a20 R 2

A X n2

Xn T

Ω

1−

Xn T

γ ν (4)


1685

Fig. 1. (a) Differential cross sections for ionization as a function of the secondary electron energy W for different primary electron energies, (b) Partial excitation cross sections as a function of electron energy. Table 1 Products of excitation states. Excitation state

Products

A1 B 1 B 1 A1 Ryd( A + B ) Ryd(C + D ) Diffuse band

H + OH H2 + O H3 O+ + OH + eaq − H3 O + + OH + eaq − H3 O+ + OH + eaq −

tion between an ionized water molecule and an adjacent water molecule:

H2 O+ + H2 O → H3 O+ + OH

Fig. 2. Three-dimensional reconstruction of a 1 keV electron track in water medium.

where X n is the transition energy to the nth excitation level and the values of the parameters A, Ω , γ and ν are obtained from experimental data. Finally the total and differential elastic scattering cross section is described in the standard way by the screened Rutherford formula. In Fig. 1 (panels (a), (b)) we have plotted Eqs. (1) and (4) as a function of the energy of the secondary and primary electron, respectively. As shown elsewhere [52,53], Eqs. (1) and (4) can provide a reasonably good analytic representation of the available experimental data for electron scattering in water vapor over a broad range of incident energies. A simulated 1 keV electron track in water is shown in Fig. 2. More realistic models for the energy-loss of electrons in liquid water have been developed based on a suitable approximation of the so-called Bethe surface via the dielectric response function of many-body electron gas theory [54]. For use in MC simulations, the latter can be conveniently parameterized using optical data and suitable dispersion algorithms; for a recent review see [55]. Although work on the implementation of these more elaborate models into our present parallelized computational code is in progress, this development is in no way critical for the conclusions of the present work. For the ionization process considered in the present model, a hydronium ion and a hydroxyl radical are produced by a reac-

The H2 O+ ion migrates in a random direction after its formation with a displacement of 1.25 nm away from the site of production and is replaced by the H3 O+ . The OH radical is positioned with an arbitrary orientation at a distance of 0.29 nm which is equivalent to the diameter of a water molecule [33]. In the excitation process, H, OH, H2 , O, H3 O+ and eaq − are generated from different excited states as shown in Table 1. For the dissociation of an excited water molecule that leads to the formation of the H and OH radicals, the products are placed 0.87 nm apart. When H2 and O are produced, they are separated by 0.58 nm. In both cases, the species are located on a randomly oriented line centered at the site for the excited water molecule [56]. The O is assumed to combine with H2 O to form H2 O2 [57]. The hydrated or aqueous electron, eaq − , is placed in a random orientation at a distance of 0.65 nm, the most probable value of the thermalization distance [58] from the production site. 2.2. Chemical stage After the interaction of electrons with molecules of water and the formation of radicals, the chemical stage follows which takes place from 10−12 to 10−6 s. The simulations of water radical reactions are carried out using the Smoluchowski formalism [59] assuming all the reactions are diffusion-controlled reactions. A reaction takes place when the separation distance is less than the reaction radius for the respective reaction. Table 2 reports the reaction radii for the ten possible reactions of our model [60]. If no reaction occurs, each species is allowed to jump at a distance according to its diffusion coefficient shown in Table 3 [60] in a random direction until the next time step

1686


Table 2 Reaction radii for the set of water radiolysis. Reaction

Reaction radii (nm)

H + OH → H2 O eaq − + OH → OH− eaq − + H + H2 O → H2 + OH− eaq − + H3 O+ → H + H2 O H + H → H2 OH + OH → H2 O2 2eaq − + 2H2 O → H2 + 2OH− H3 O+ + OH− → 2H2 O eaq − + H2 O2 → OH + OH− OH + OH− → H2 O + O−

0.43 0.72 0.45 0.39 0.23 0.26 0.18 1.58 0.4 0.36

Table 3 Diffusion constants for individual species.

λ=

Species

D (10−5 cm2 /s)

H OH eaq − H3 O+ OH− H2 O2

8.0 2.5 5.0 9.5 5.3 1.4

√

6D τ

(5)

If any reaction between a pair of species occurs, the coordinates, (xr , y r , zr ) of the reaction site is determined by [33]:

√ D2 (xr , yr , zr ) = (x1 , y 1 , z1 ) √ D1 + D2 √ D1 + (x2 , y 2 , z2 ) √ D1 + D2

(6)

For the chemical reactions which result in the formation of a single product, the location of the product is assumed to be at the reaction site. When more than one product is formed, the location of each product away from the reaction site is estimated by using its diffusion coefficient. The time step, τ , was fixed at 3 ps in all the calculations. To account for the scavenging effect in the cellular condition, the characteristic absorption time of 6 · 10−10 s, the time at which radicals will be reduced by e−1 [61], is used to compute the probability that the radicals will be absorbed at the end of each time step. 2.3. Parallelization method The physical stage can be, as so called in the literature, “embarrassingly parallelized” with no coordination or communication between the tasks, by assuming each electron history is independent. Based on that assumption, the total number of primary electrons can be distributed on a group of CPUs and simulated in parallel independently [41]. For the partition of the number of electron histories we followed, what has been described as a natural partitioning [62]. The electrons are equally distributed among the CPU-workers where each one of them performs the entire simulation for all the electrons assigned to it, and reports its results to the master CPU. The master node is distributing the required information for the initialization phase which contains the number of histories each has to perform, the cutoff energy Tthres , the initial position, direction and energy T of the primary electron. After this point each worker starts the computation of an electron history by creating three data structures for each electron history. The first one is a First-In-First-Out (FIFO) list which serves as a particle counter and contains an ID number for the primary and also each secondary electron which is produced during the ionization interactions. The second one is a data container which serves for the

recording of the position of any ionization or excitation occurred and the respective deposited energy. The new particle’s position is calculated by direct sampling according to the total (elastic plus inelastic) interaction cross section. The type of interaction that the electron undergoes is determined by the relative magnitude of the individual cross sections. The scattering direction is then calculated based on a rejection MC sampling technique for scattering events or direct MC sampling for ionization. If the type of recorded event is ionization, the worker CPU also stores the initial energy, position and scattering direction of the secondary electron. Finally, the third data structure is also a data container which contains information about the location and the type of the radical species that was created from the primary or secondary electrons. Fig. 3 illustrates the flowchart of a complete electron history of our MC method. When the energy of the primary electron falls below the cutoff energy the computation is completed for the current particle and it is removed from the first container. Another particle (secondary electron) is fetched from the list and simulation with respect to the new particle is performed. This process is repeated until the first container becomes empty and the simulation terminates when all the workers have finished the assigned initial number of primary electron histories. The results are then aggregated in the master node for further analysis. During the chemical stage the diffusion of each molecule could be simulated independently, similarly to the physical stage since the principle of exclusive volume does not hold in our case. However, if second (or higher) order reactions occur the above method will fail. This is due to the fact that, each molecule cannot be considered independent from its neighboring reactants. Because of this restriction, a linked-list cell method was employed that follows a divide-and-conquer strategy to reduce the search space for molecular pairs within the cutoff reaction radius [63,64]. In that way, the reactant radical species are divided spatially among the workerCPUs and each CPU is searching efficiently for possible chemical reactions among its designated species. A spatial subdivision for the local reactions was implemented on a three-dimensional uniform grid. For simplicity we demonstrate the linked-list method for two dimensions but the same methodology is applied in three dimensions in our problem. The simulation space is divided into a number of uniformly sized cells (Fig. 4(a)) with a size that is double the radius of the largest reaction radius. Using this spatial subdivision, each molecule’s movement is limited to a specific number of grid cells (maximum 8 cells in 3 dimensions). Additionally, each molecule is assigned to only one grid cell based on its center point, and in each time step can only interact with reactants in the neighboring cells (9 in 2 dimensions or 27 in 3 dimensions, see Fig. 4(b)). At the beginning of the simulation, the master CPU creates a cell-list structure similar to the one in Fig. 4(c), where the radical species are stored in every cell that they touch. In that way the radical species are sorted into the grid cells based on their “Cell ID” number (see Fig. 4(c)). Then, the master node distributes equally the data array created during the physical stage, which contains the type and the location of the species to the workers. Similarly with the physical stage, the reason we distributed the radical species equally to the workers was to achieve a balanced work load. In addition, the cell-list structure with the grid-cell information is co-distributed (each worker has the same copy) to the CPU-workers. Based on the “Cell ID” number of each radical molecule, the CPU-worker loops over the 27 neighboring grid cells and checks for reactions with each of the reactants in these cells. If a reaction occurs, the type and position of the new species are recorded and the reactant species are marked for no further consideration.


1687

Fig. 3. Flowchart of an electron track history.

When all the workers have finished their task, the created data with the reactions and the updated positions of the radical species are returned to the master-CPU and one time step of the simulation time is completed. We need to note that the communication between the master and slave CPUs is subject to a synchronization barrier for two reasons. Firstly, the spatial redistribution of the radicals due to their diffusion after each time step which, may change their “Cell ID” number. Therefore, after each time step the master CPU is creating a new cell-list based on the new positions of the reactants. Second, the master CPU prior to the creation of the updated cell-list structure, searches the data for any inconsistencies (double occurred reactions). Double occurred reactions may arise from the fact that we do not have intercommunication among the worker-CPUs during each time step and also the existence of second order reactions among the radicals. This procedure is repeated for each time step until all species has been eliminated through chemical reactions or until the maximum time is reached. 2.4. Implementation platform and performance measurements In this study the presented algorithmic methods were implemented in MatLab (version 2011a, Mathworks, Inc., Natick, MA) using the parallel computation toolbox (PCT) which provides a MPI-based functionality. Amazon provides a basic measure of an EC2 compute unit (ECU) for compute power which is equivalent to a 1.0–1.2 GHz Opteron or 2007 Xeon processor. For the phys-

ical stage, eight high-CPU Extra Large (20 ECU) instances were launched with 7 GB RAM each. For the chemical stage, clustercompute quadruple extra large (33.5 ECU) was used with 23 GB memory (2 x Intel Xeon X5570, quad-core “Nehalem” architecture). The code was compiled as a standalone application for Windows HPC server 2008 and the simulations were launched on remote mode from a Dual Xeon 3.0 GHz with 4 GB of RAM. The performance of the proposed method was established on two metrics: Speedup ratio (S) and parallel efficiency (E) described as follows:

S= E=

T1 TN T1 TN · N

(7a) (7b)

where, t 1 and t N is the total computational time required when one or N CPU-workers are used respectively. The maximum speedup determines the minimum execution time. If we want to achieve this speedup, however we must add more processors and as a result, the system efficiency could be sacrificed. For this reason, efficiency is introduced in our analysis and it could be included in the objectives defined by the user in order to achieve the desired performance and avoiding a slowdown. Finally, due to the virtualization of the computational system, the measured times are the wallclock and not the CPU time of each instance.

1688


Fig. 4. (a) 2-D uniform grid. (b) 27 grid-cells are examined each time step for each molecule. (c) Representation of the 2-D grid for its parallel processing.

Fig. 5. Normalized deposited energy in polar coordinates for the serial and parallelized code. The distance from the point of electrons origin is 3 nm (a), 6 nm (b) and 10 nm (c) respectively, with a sampling bin size dλ = 1.5 nm and dθ = 3◦ .

3. Results 3.1. Speedup of the physical stage Three sample sets of 103 , 2.5 · 103 and 5 · 103 electron histories were used to assess the performance and accuracy of the parallelized method. A good agreement was achieved between the serial and parallelized code for the physical phase. Fig. 5 compares the normalized to the maximum, energy deposition for the serial (solid line) and the parallelized code (circles) using 5 · 103 electron histories with an initial energy 1 keV (azimuth and elevation angle ϕ , θ = 45◦ ). The deposited energy was recorded on concentric cylindrical shells with a sampling bin size dλ = 1.5 nm and dθ = 3◦ . The random number generator (RNG) of each electron track was initialized with a different seed and the same sequence of seeds was used for both the parallel and serial code.

For a distributed-memory system where, each CPU has its own “private” memory we should expect a linear speedup. That speculation was confirmed for three problem sizes and Table 4 reports our results. We notice that the speedup scales proportionally to the number of nodes, N, and the total time cost scales linearly with the problem size. For the performance evaluation of the method we did not consider the communication time required for the initialization of the simulations since that is negligible compared to the required simulation time. Additionally, the time required for collecting the data to the working station for further processing was not considered in our performance measurements, since that is dependent on the speed of the network connection to the EC2 infrastructure rather than the performance of the proposed method. Finally the collected data processed offline at the local working station. The performance of the parallelization method was also tested for the same set of simulations on a shared-memory system. For an


1689

Table 4 Measured timing results as a function of the number of processors and problem size. N

103 e−

2.5 · 103 e− 3

1 2 4 6 8

3

5 · 103 e−

Time (10 s)

Speedup

Time (10 s)

Speedup

Time (103 s)

Speedup

9.592 4.790 2.391 1.603 1.199

1.0 2.0 4.0 6.0 8.0

23.371 11.684 5.843 3.902 2.921

1 2.0 4.0 6.0 8.0

47.232 23.616 11.838 7.872 5.904

1.0 2.0 4.0 6.0 8.0

Fig. 6. (a) Percentage of time cost increase of a shared-memory system compared with a distributed-memory equivalent system. (b) Percent of time increase per electron history for a shared-memory system as a function of processors.

Fig. 7. Comparison of time cost of distributed- (solid line) versus shared-memory (dashed line) as a function of processors, for 103 (a), 2.5 · 103 (b) and 5 · 103 (c) electron histories respectively.

accurate comparison between the two systems we used the same sequence of random numbers as for the distributed-memory system and the multithreading of the code was adjusted according to the number of processors that was employed each time (number of workers equals number of processes). Fig. 6(a) illustrates the percentage increase of the execution time defined as (t shared − t distributed )/t distributed %. As Fig. 6(b) indicates, the percentage latency per electron history was approximately linear with a maximum value of 3.6 · 10−3 % when eight processors were employed. Fig. 7 (a), (b) and (c) show the execution time as a function of the processors, for the shared- versus distributed-memory system for the three problem sizes. The latency due to the shared-memory access starts becoming as the number of employed processors is increasing and the performance is limited by the memory bandwidth. Similarly, larger sets of simulations result in more cache misses. The largest increase (∼ 13%) was observed for 5 · 103 electron tracks when 8 processors were used, while the smallest one

(∼ 0.5%) was recorded for 103 trials with 2 out of 8 processors running. That increase of the execution time as the number of utilized processors or the sample size increases, reflects an agreeable decrease of the efficiency of the parallel code (Fig. 8). 3.2. Speedup and efficiency of the chemical stage One way to validate the agreement between the serial and parallel code for the chemical phase, is the total number of water radicals in the system as a function of time. Simulations of a total number of 13.5 · 103 radical reactants, produced from the electron track model were carried out both in serial and parallel mode. For the chemical stage a shared-memory system was employed with eight virtual cores. For consistency with our previous analysis we shall refer to the virtual cores as processors. The obtained results were essentially overlapping for 1, 2, 4 and 8 processors (Fig. 9).

1690


Fig. 8. Decreased parallel efficiency of distributed- (solid line) versus shared-memory (dashed line) as a function of processors, for 103 (a), 2.5 · 103 (b) and 5 · 103 (c) electron histories respectively.

Fig. 9. Total number of reactants as a function of time for 1, 2, 4 and 8 processors. The dash-dot curve represents an exponential decay with time constant 600 ps. Eventually the number of reactants decay asymptotically with the same time constant as the exponential decay after 1000 ps.

It is worth noting that at the beginning of the simulation, the total number of radical reactants decreases rapidly which is due to the proximity of the species and the fact that the reactions take place in a short time interval. After approximately 103 ps the number of reactants decays with a fixed time constant. In Fig. 9 the dash-dot curve is an exponential decay with a time constant equal to the degradation time due to the water scavenger molecules (600 ps). We notice that the total number of reactants is decayed asymptotically with the same time constant as the exponential decay. This is an indication that the water radicals have been displaced far from each other and the dominant chemical reaction is the degradation by water scavenger molecules. Since the number of radicals decreases as a function of time, we further investigated how the performance of the parallelization method changes as a function of the simulated time (or equivalently the total number of reactants in the system). Fig. 10(a) depicts the decrease of the required execution time per time step as a function of the simulated time for 1, 2, 4 and 8 processors. A similar trend was presented as the total number of reactants in the system decreased through the chemical pathways with respect to time (Fig. 10(b)). It is clear that the execution time scales with the total number of radicals. On the contrary, the speedup ratio and efficiency was essentially constant during the whole simulation (Figs. 10(c) and 10(d)). However, a “turn-over” with respect to the number of processors started appearing after approximately 1000 ps when 8 processors were utilized. That is, the communication overhead starts appearing when more nodes are employed. For illustrative purposes, the same performance test was repeated for a small sample set consisted of 1000 water radicals. Both the time cost per step and the total number of radicals

as a function of the simulation time showed a similar behavior as for the larger sample size (Figs. 11(a) and 11(b)). Despite that, the speedup and efficiency decreased rapidly for 4, 6 and 8 processors (Figs. 11(c) and 11(d)). That is due predominantly to the larger communication time relative to the compute time, resulting in poor scaling with respect to the processors number. Comparative performance tests were carried out for four sets of increasing sample size (103 , 3 · 103 , 1.3 · 104 and 2.6 · 104 reactants). Fig. 12 presents the speedup and efficiency as a function of processors. A maximum speedup of 5.7× was achieved when a larger number of reactants were simulated and 8 processors were employed. Similarly, an increase of 280% in the efficiency was noticed for the larger set of reactants compared to the smallest one. That increase in the efficiency was mainly attributed to the increased ratio of computation time over communication per processor. Table 5 reports total execution times in seconds for the four sample sizes considered in the current study and for different number of processor N. We notice the poor scaling of the execution time for the small sample size as a function of the number of processors. Also the execution time for one processor was increased by a factor of 20.6 when the number of reactants increased by a factor of 26. In order to investigate further the performance of our method we recorded the total computation time per worker for a complete simulation. From our results we could make two observations. Firstly, the computational time per worker was not decreasing linearly with the number of processors. Second, double-occurred reactions were more frequent as the number of processors was increasing. That implies longer computational time required for the creation of the linked-list at each simulation time step. Fig. 13 illustrates the ratio of the computation time per worker over the total execution time for the four sample sizes considered in the current study. The highest ratio computational time per processor over total execution time was achieved for the largest sample size when two processors were employed. The decrease in the ratio as the number of processors increases is more profound though for the smaller sample size. Finally, Table 6 illustrates the total execution time and speedup factors for a complete run of the physical and chemical stage for 400 electron histories (∼ 24 · 103 produced water radicals). 4. Discussion In the current study a parallelized framework of MC simulations, for both particle transportation and molecular modeling was presented. The method was applied for the first time, to the best of our knowledge, on two physically connected radiobiological problems with different time-scales; low-energy electron track structures in water and reaction–diffusion of produced radical species.


1691

Fig. 10. Required computational time per time step (a), speedup ratio (c) and parallel efficiency (d) as a function of the simulated time for 13 · 103 reactants. Computational time per time step was also decreased as the total number of water radicals is decaying in the system due to the chemical reactions (b).

A good agreement was achieved between the serial and parallel mode for both the physical and chemical phase. For the electron transportation a linear scaling of the required computational time for a distributed-memory machine was realized, according to the implicit independent tasking of the problem. This implies a constant efficiency equal to one, independent of the problem size. Additionally, the speedup ratio should be independent of the model that was used for the electron track. A more complicated model should require more computational operations per electron history and consequently more time for their calculations, but the ratio of the execution time as a function of the CPUs, should not be affected. A shared memory with global address space provides a user-friendly programming perspective for memory and the data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs. That being said, a main disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increase traffic on the shared memory-CPU path. Our set of simulations for the physical phase was performed for both a shared- and a distributed-memory virtual machine. The decrease of the speedup was scaling almost linearly in terms of the number of processors. A maximum of 3.6 · 10−3 % increase, of execution time per electron history was observed when a sharedmemory machine was used. We speculate that one of the main reasons for the increased execution time of the shared-memory system versus a similar distributed is the virtualization of the Amazon instances. Amazon is using Xen (DomU) Virtual Machine for the on-demand instances. One of the difficulties of a virtualized environment is that it is not clear how the caches are shared and whether other virtual machines interact with them especially

for multi-core CPUs with shared L2 caches. Thus, in the case that a user is using a VM he may see significantly slower performance compared to a dedicated machine. Previous studies have demonstrated the negative effect of virtualization in multi-core platforms such as, decreased memory bandwidth and higher core utilization [65]. Similar results have been demonstrated for the of EC2 instances compared with a real cluster [66–69]. For the chemical stage an Amazon HPC instance was employed with shared memory features. Our results demonstrated the increase in performance of our method with respect to the number of reactants simulated. This results from the increase in the ratio between the compute time and the communication time as a function of the system size, which is expected for efficient parallel code. It worth note, that another factor which affects the performance of our method is the required time for creating the link-list cell. Upon the completion of every simulation time step the array data with the reactant species are aggregated to the master CPU. Prior to the execution of the next time step and the creation of the new linked-list the collected data at the master CPU are checked for any double occurred reactions. From our results (data not shown) double occurred reactions were more frequent when eight processors were employed. That reflects longer computational time for the creation of the linked-list. Additionally it was demonstrated how the compute time scales with the number of reactants. For systems with a small number of radicals, the communication time between processors is actually longer than the computing time, resulting in poor scaling with respect to processor number. Finally, given the increasing number of cores per microchip, the development of efficient parallel applications on these platforms is a challenge. A comprehensive study for photon transportation has

1692


Fig. 11. Required computational time per time step as a function of the simulation time (a) and number of reactants (b). Speedup ratio (c) and parallel efficiency (d) as a function of the simulated time for 103 reactants. A rapid decrease is noticed in speedup and efficiency as the number of nodes is increasing.

Fig. 12. Speedup ratio (a) and parallel efficiency (b) as a function of the number of processors for different sample sizes.

been done in the past [70] for three different parallel architectures: shared-, distributed- and shared-distributed memory systems. Additionally, a similar study for low energy electron tracks [41] has demonstrated the reduced efficiency as the number of utilized processors increase for a mixed shared-distributed memory system. Heterogeneous multi-core clusters have different interconnection path architectures for communication when parallel processes are executed. Some communications are realized through network

links such as local area networks (LAN), and other communications are established by internal processor buses, e.g. inter-core communication travels through cache memory, while an inter-chip communication is delivered by the main memory. A proper hierarchical parallelization framework that combines inter-node, intra-node (inter-core) and intra-core level parallelism could optimize our method for different parallel architectures [71]. Furthermore, previous studies have demonstrated the advantage of shared-memory


Table 5 Measured execution times for the chemical stage as a function of the processors and sample size. N

1 2 4 6 8

Execution time (103 s) 103 molec.

3 · 103 molec.

1.3 · 104 molec.

2.6 · 104 molec.

5.39 3.67 2.54 2.34 2.75

12.31 7.24 4.56 3.92 3.71

70.3 39.1 22.5 15.9 14.2

111.2 58.9 32.7 24.7 19.8

Fig. 13. Ratios of the computation time per worker over the total execution time for four sample sizes of radical molecules. Table 6 Execution times of chemical and physical stage for 400 electron histories of 1 keV. N

Time (104 s)

Speedup

1 2 4 6 8

10.15 5.23 2.87 2.16 1.74

– 1.94 3.53 4.68 5.84

systems by using both micro-benchmarks and higher-level benchmark suites [72]. Similarly Mamidala et al. [73] demonstrated the advantage of shared-memory systems for collective MPI based operations and matrix multiplication. Nonetheless, for scientific computation workloads, approaches and software design optimized for multi-core systems are desired for achievement of highest performance [74,75]. However, further investigation of our method for those systems is outside the scope of the current study. The spatial subdivision was realized with a fixed size grid-cell equal to the largest reaction radius of the radical species. This method is rigorous in the sense that boundary conflicts are not ignored, but computationally more expensive since all the reactions are checked for any inconsistencies. Alternative parallelization methods with reduced computational cost (by ignoring the boundary effect of the subvolumes) could be applied for KMC modeling of larger systems [76]. For non-bonded interaction computations in molecular dynamics simulations, previous studies [77] stated that the cell dimension of half the cutoff radius R c usually gives the best performance. However, Kunaseth [78], have demonstrated that this is not always true. Additionally, from our results it was noticed that after some point the main reaction was the degradation by water scavenger molecules and the radicals freely diffuse in the medium. The time cost of a time-driven KMC comes,

1693

among the other, from the small time step (3 ps in the current study). In the past, several hybrid stochastic methods have been proposed for accelerated MC simulations in both serial [79–81] and parallel mode [82]. Additionally, for low particle densities, a coarse spatial grid could be applied in combination with a modified mesoscopic stochastic algorithm suitable for reaction–diffusion. Towards that direction, a dynamic transition from a microscopic time driven MC algorithm to a mesoscopic event-driven could further accelerate the performance of our method without sacrificing accuracy. In microdosimetry, we are concerned with the statistical variations in the absorption of energy in small sites (on the order of micrometers), whereas in nanodosimetry the site is smaller (nanometers), comparable to such structures as the DNA helix. The sensitive site of DNA has been modeled in the past as a cylindrical or spherical structure [32,83,84], although more detailed cellular models have been considered [85]. Also, for the direct effect of charged particles on small cellular targets, databases of frequency distributions have been provided [86] and interaction cross-sections with the DNA and simple biomolecules have been calculated [87–89]. Additionally, for the chemical phase, Aydogan et al. (2008) have evaluated site-specific percentage attack probabilities for the DNA bases by the various radical species. Given that available information, our method can easily be extended to incorporate both direct and indirect DNA radiation damage using more elaborate projectile–target interaction models such as, for example, those computed from a many-body electron gas theoretical perspective [54,55]. The MC code and its parallelization was implemented in MatLab. The main advantage of using MatLab is its high-level, intuitive syntax, user-friendliness, graphical capabilities and available Toolboxes with extensive numerical methods. Because of those reasons it has great appeal to scientists and engineers for modeling and testing parallelization methods [90–92]. There have been several attempts at producing MatLab based utilities for parallel programming and among the most notable are pMatLab [93] and MatlabMPI [94] from MIT Lincoln Laboratory, MultiMatLab [95] from Cornell University and bcMPI from Ohio Supercomputing Center [96]. Unfortunately, high-level language systems usually suffer from high performance overheads. For large scale MC simulations the lower level programming languages, such as C++ [97], are the most suitable. Alternatively, parallel codes have been implemented in CUDA and launched on a graphic processors unit (GPU) [98–100], however modifications of the proposed method may be required due to differences in the hardware architecture. The speedup ratio and efficiency for the physical phase should be independent of the programming language used for the implementation of the code. The reason for that is the inherent parallelization of the electrons histories as independent tasks. However, the absolute execution time should decrease significantly. Regarding the chemical phase, micro-benchmarks tests have shown an increased latency and smaller memory bandwidth for Matlab over a high performance MPI library [101]. Therefore, we should expect decreased communication overhead, faster access of the memory and increased efficiency when a different API is used for the parallelization. In this paper we demonstrated a comprehensive parallelization method for particle propagation and molecular modeling. Our results provide a general guidance to applications of this concept in many fields including radiation chemistry, radiation oncology and radiobiology. The goal is to apply the proposed method in radiobiological MC modeling of macroscopic tumors. Work along these lines is underway and will be presented elsewhere.

1694


Acknowledgements The authors would like to thank Dr. G. Pratx and Dr. M.T. Swulius for their comments and helpful suggestions. References [1] C.A. Boswell, M.W. Brechbiel, Development of radioimmunotherapeutic and diagnostic antibodies: an inside-out view, Nucl. Med. Biol. 34 (7) (2007) 757– 778. [2] N.P. Praetorius, T.K. Mandal, Engineered nanoparticles in cancer therapy, Recent Pat. Drug. Deliv. Formul. 1 (1) (2007) 37–51. [3] D.M. Herold, et al., Gold microspheres: a selective technique for producing biologically effective dose enhancement, Int. J. Radiat. Biol. 76 (10) (2000) 1357–1364. [4] S.H. Cho, Estimation of tumour dose enhancement due to gold nanoparticles during typical radiation treatments: a preliminary Monte Carlo study, Phys. Med. Biol. 50 (15) (2005) N163–N173. [5] B.L. Jones, S. Krishnan, S.H. Cho, Estimation of microscopic dose enhancement factor around gold nanoparticles by Monte Carlo calculations, Med. Phys. 37 (7) (2010) 3809–3816. [6] J.L. Robar, Generation and modelling of megavoltage photon beams for contrast-enhanced radiation therapy, Phys. Med. Biol. 51 (21) (2006) 5487– 5504. [7] S.J. McMahon, et al., Radiotherapy in the presence of contrast agents: a general figure of merit and its application to gold nanoparticles, Phys. Med. Biol. 53 (20) (2008) 5635–5651. [8] G.M. Makrigiorgos, et al., Inhomogeneous deposition of radiopharmaceuticals at the cellular level: experimental evidence and dosimetric implications, J. Nucl. Med. 31 (8) (1990) 1358–1363. [9] R. Kriehuber, et al., Cytotoxicity, genotoxicity and intracellular distribution of the Auger electron emitter (65)Zn in two human cell lines, Radiat. Environ. Biophys. 43 (1) (2004) 15–22. [10] P.V. Neti, R.W. Howell, Log normal distribution of cellular uptake of radioactivity: implications for biologic responses to radiopharmaceuticals, J. Nucl. Med. 47 (6) (2006) 1049–1058. [11] G.L. DeNardo, et al., Nanomolecular HLA-DR10 antibody mimics: A potent system for molecular targeted therapy and imaging, Cancer Biother. Radiopharm. 23 (6) (2008) 783–796. [12] B. Kang, M.A. Mackey, M.A. El-Sayed, Nuclear targeting of gold nanoparticles in cancer cells induces DNA damage, causing cytokinesis arrest and apoptosis, J. Am. Chem. Soc. 132 (5) (2010) 1517–1519. [13] P. Andreo, Monte Carlo techniques in medical radiation physics, Phys. Med. Biol. 36 (7) (1991) 861–920. [14] N. Papanikolaou, S. Stathakis, Dose-calculation algorithms in the context of inhomogeneity corrections for high energy photon beams, Med. Phys. 36 (10) (2009) 4765–4775. [15] K. Bhan, J. Spanier, Condensed history Monte Carlo methods for photon transport problems, J. Comput. Phys. 225 (2) (2007) 1673–1694. [16] M. Berger, Monte Carlo calculation of the penetration and diffusion of fast charged particles, Methods in Computational Physics 1 (1963) 135–215. [17] D. Emfietzoglou, et al., A Monte Carlo track structure code for electrons (approximately 10 eV–10 keV) and protons (approximately 0.3–10 MeV) in water: partitioning of energy and collision events, Phys. Med. Biol. 45 (11) (2000) 3171–3194. [18] D. Emfietzoglou, et al., Monte Carlo simulation of the energy loss of lowenergy electrons in liquid water, Phys. Med. Biol. 48 (15) (2003) 2355–2371. [19] S. Uehara, H. Nikjoo, D.T. Goodhead, Comparison and assessment of electron cross sections for Monte Carlo track structure codes, Radiat. Res. 152 (2) (1999) 202–213. [20] M. Dingfelder, M. Inokuti, The Bethe surface of liquid water, Radiat. Environ. Biophys. 38 (2) (1999) 93–96. [21] C. Bousis, et al., A Monte Carlo study of absorbed dose distributions in both the vapor and liquid phases of water by intermediate energy electrons based on different condensed-history transport schemes, Phys. Med. Biol. 53 (14) (2008) 3739–3761. [22] D. Owens, et al., A survey of general-purpose computation on graphics hardware, Computer Graphics Forum 26 (1) (2007) 80–113. [23] M. Dingfelder, et al., Electron inelastic-scattering cross sections in liquid water, Radiation Physics and Chemistry 53 (1) (1999) 1–18. [24] J.M. Fernandez-Varea, G. Gonzalez-Munoz, M.E. Galassi, K. Wiklund, B.K. Lind, A. Ahnesjo, N. Tilly, Limitations (and merits) of PENELOPE as a track-structure code, Int. J. Radiat. Biol. (2011). [25] S. Incerti, et al., Comparison of GEANT4 very low energy cross section models with experimental data in water, Med. Phys. 37 (9) (2010) 4692–4708. [26] L.S. Waters, M.G., J.W. Durkee, M.L. Fensin, J.S. Hendricks, M.R. James, R.C. Johns, D.B. Pelowitz, The MCNPX Monte Carlo radiation transport code Hardonic Shower simulation workshop, in: AIP Conf. Proc., 2006.

[27] M. Dingfelder, et al., Comparisons of calculations with PARTRAC and NOREC: transport of electrons in liquid water, Radiat. Res. 169 (5) (2008) 584–594. [28] R. Garcia-Molina, et al., A combined molecular dynamics and Monte Carlo simulation of the spatial distribution of energy deposition by proton beams in liquid water, Phys. Med. Biol. 56 (19) (2011) 6475–6493. [29] B.A. Hrycushko, et al., Radiobiological characterization of post-lumpectomy focal brachytherapy with lipid nanoparticle-carried radionuclides, Phys. Med. Biol. 56 (3) (2011) 703–719. [30] C. Bousis, et al., Monte Carlo single-cell dosimetry of Auger-electron emitting radionuclides, Phys. Med. Biol. 55 (9) (2010) 2555–2572. [31] B. Halliwell, O.I. Aruoma, DNA damage by oxygen-derived species. Its mechanism and measurement in mammalian systems, FEBS Lett. 281 (1–2) (1991) 9–19. [32] H. Nikjoo, et al., Computational modelling of low-energy electron-induced DNA damage by early physical and chemical events, Int. J. Radiat. Biol. 71 (5) (1997) 467–483. [33] H. Tomita, et al., Monte Carlo simulation of physicochemical processes of liquid water radiolysis. The effects of dissolved oxygen and OH scavenger, Radiat. Environ. Biophys. 36 (2) (1997) 105–116. [34] B. Aydogan, et al., Monte Carlo simulations of site-specific radical attack to DNA bases, Radiat. Res. 169 (2) (2008) 223–231. [35] R.M. Abolfath, A.C. van Duin, T. Brabec, Reactive molecular dynamics study on the first steps of DNA damage by free hydroxyl radicals, J. Phys. Chem. A 115 (40) (2011) 11045–11049. [36] X. Chen, et al., Qualitative simulation of photon transport in free space based on Monte Carlo method and its parallel implementation, Med. Phys. 31 (9) (2004) 2721–2725. [37] S. Liu, et al., Accelerated SPECT Monte Carlo simulation using multiple projection sampling and convolution-based forced detection, IEEE Trans. Nucl. Sci. 55 (1) (2008) 560–567. [38] Y.K. Dewaraja, et al., A parallel Monte Carlo code for planar and SPECT imaging: implementation, verification and applications in (131)I SPECT, Comput. Methods Programs Biomed. 67 (2) (2002) 115–124. [39] N. Tyagi, A. Bose, I.J. Chetty, Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications, Med. Phys. 31 (9) (2004) 2721–2725. [40] H. Wang, et al., Toward real-time Monte Carlo simulation using a commercial cloud computing infrastructure, Phys. Med. Biol. 56 (17) (2011) N175–N181. [41] G. Pratx, L. Xing, GPU computing in medical physics: A review, Med. Phys. 38 (5) (2011) 2685–2698. [42] K.Y. Sanbonmatsu, C.S. Tung, High performance computing in biology: multimillion atom simulations of nanoscale systems, J. Struct. Biol. 157 (3) (2007) 470–480. [43] S.J. Plimpton, Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys. 117 (1995) 1–19. [44] W.M. Brown, P. Wang, S.J. Plimpton, A.N. Tharrington, Implementating molecular dynamics on hybrid high performance computers – short range forces, Comput. Phys. Commun. 182 (2011) 898–911. [45] A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, 2nd edition, Addison–Wesley Longman Publishing Co. Inc., 2009. [46] D.L. Eager, J. Zahorhan, E.D. Lozowska, Speedup versus efficiency in parallel systems, IEEE Transactions on Computers 38 (3) (1989) 408–423. [47] R. Ren, G. Orkoulas, Parallel Markov chain Monte Carlo simulations, J. Chem. Phys. 126 (21) (2007) 211102. [48] M. Richert, J.M. Nageswaran, N. Dutt, J.L. Krichmar, An efficient simulation environment for modeling large-scale cortical processing, Frontiers in Neuroinformatics 5 (2011) 1–15. [49] G.T. Balls, S.B. Baden, T. Kispersky, T.M. Bartol, T.J. Sejnowski, A large scale Monte Carlo simulator for cellular microphysiology, in: Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004. [50] M. Inokuti, Inelastic collisions of fast charged particles with atoms and molecules. The Bethe theory revisited, Reviews of Modern Physics 43 (3) (1971) 297–347. [51] A.E. Green, R.S. Stolarski, Analytic models of electron impact excitation cross sections, Journal of Atmospheric and Terrestrial Physics 34 (10) (1972) 1703– 1717. [52] P.P. Yepes, D. Mirkovic, P.J. Taddei, A GPU implementation of a track-repeating algorithm for proton radiotherapy dose calculations, Phys. Med. Biol. 55 (23) (2010) 7107–7120. [53] R. Kohno, et al., Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy, Phys. Med. Biol. 56 (22) (2011) N287–N294. [54] D. Emfietzoglou, et al., Electron inelastic mean free paths in biological matter based on dielectric theory and local-field corrections, Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms 267 (1) (2009) 45–52. [55] D. Emfietzoglou, et al., Inelastic scattering of low-energy electrons in liquid water computed from optical-data models of the Bethe surface, Int. J. Radiat. Biol. 88 (1–2) (2012) 22–28. [56] M.A.H.a.F.A. Smith, Calculation of initial and primary yields in the radiolysis of water, Radiat. Phys. Chem. 43 (3) (1994) 265–280.


[57] V.V. Moiseenko, et al., Modelling DNA damage induced by different energy photons and tritium beta-particles, Int. J. Radiat. Biol. 74 (5) (1998) 533–550. [58] Y. Myroya, J. Meesungnoen, J. Jay-Gerin, A. Filali-Mouhim, T. Goulet, Y. Katsumura, S. Mankhetkorn, Radiolysis of liquid water: an attempt to reconcile Monte Carlo calculations with new experimental hydraded electron yield data at early times, Can. J. Chem. 80 (10) (2002) 1367–1374. [59] M.V. Smoluchowski, Three discourses on diffusion. Brownian movements, and coagulation of colloid particles, Phys. Zeits. 17 (1916) 557–565. [60] R.N. Hamm, J.E. Turner, M.G. Stabin, Monte Carlo simulation of diffusion and reaction in water radiolysis – a study of reactant ‘jump through’ and jump distances, Radiat. Environ. Biophys. 36 (4) (1998) 229–234. [61] A.C.a.W.R. Holley, Energy deposition mechanisms and biochemical aspects of DNA strand breaks by ionizing radiation, Int. J. Quantum Chem. 39 (1991) 709–727. [62] W.R. Martinez, T.C. Wan, T. Abdel-Rahman, T.N. Mudge, Monte Carlo photon transport on shared memory and distributed memory parallel processors, Int. J. Supercomputing 1 (1987) 57–74. [63] C. Ericson, Real-Time Collision Detection, Morgan Kaufmann, 2005. [64] U. Welling, G. Germano, Efficiency of linked cell algorithms, Comput. Phys. Commun. 182 (2011) 611–615. [65] G. Liao, et al., Software techniques to improve virtualized I/O on multi-core platforms, in: 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), San Jose, USA, 2008. [66] C. Evangelinos, C. Hill, Cloud computing for parallel scientific HPC applications: Feasibility of running coupled atmosphere–ocean climate models on Amazon’s EC2, in: Cloud Computing and Its Applications 2008 (CCA-08), Chicago, IL, 2008. [67] E. Walker, Benchmarking Amazon EC2 for high-performance scientific computing, Login 33 (5) (2008) 6. [68] Z. Hill, M. Humphrey, A quantitative analysis of high performance computing with Amazon’s EC2 infrastructure: The death of the local cluster?, in: 10th IEEE/ACM International Conference on Grid Computing, IEEE, 2009, p. 8. [69] G. Juve, et al., Scientific workflow applications on Amazon EC2, in: 5th IEEE International Conference on e-Science, Oxford UK, 2009. [70] A. Majumdar, Parallel performance study of Monte Carlo photon transport code on shared-, distributed-, and distributed-shared-memory architectures, in: Parallel and Distributed Processing Symposium, 14th International, IEEE, 2000. [71] L. Peng, M. Kunaseth, H. Dursun, K. Nomura, W. Wang, R.K. Kalia, A. Nakano, P. Vashishta, A scalable hierarchical parallelization framework for molecular dynamics simulation on multicore clusters, in: Proceeding of International Conference on Parallel and Distributed Processing Techniques and Applications, 2009. [72] S. Alam, et al., Characterization of scientific workloads on systems with multicore processors, in: International Symposium on Workload Characterization, 2006. [73] A. Mamidala, et al., MPI collectives on modern multicore clusters: Performance optimizations and communication characteristics, in: Proceedings of IEEE International Symposium on Cluster Computing and the Grid, 2008. [74] S. Boyd-Wickizer, et al., A software approach to unifying multicore caches, Technical Report TR-0.32-2011, MIT. [75] T. Tian, C. Shih, Software techniques for shared-cache multi-core systems, http://software.intel.com/en-us/articles/software-techniques-for-sharedcache-multi-core-systems/, 2009. [76] General-purpose computation on graphics hardware, 2009. [77] Z. Yao, et al., Improved neighbor list algorithm in molecular simulations using cell decomposition and data sorting method, Comput. Phys. Commun. 161 (2004) 27–35. [78] M. Kunaseth, et al., Performance modeling, analysis, and optimization of celllist based molecular dynamics, in: Proceedings of International Conference on Scientific Computing, 2010.

1695

[79] H. Salis, Y. Kaznessis, Accurate hybrid stochastic simulation of a system of coupled chemical or biochemical reactions, J. Chem. Phys. 122 (5) (2005) 54103. [80] G. Kalantzis, Hybrid stochastic simulations of intracellular reaction–diffusion systems, Comput. Biol. Chem. 33 (3) (2009) 205–215. [81] D. Gillespie, Approximate accelerated stochastic simulation of chemically reacting systems, J. Chem. Phys. 115 (2001) 1716–1733. [82] L. Xu, M. Taufer, S. Collins, D. Vlachos, Parallelization of Tau-Leap CoarseGrained Monte Carlo Simulations on GPUs in IPDPS, IEEE, 2010. [83] L. Zhang, Z. Tan, A new calculation on spectrum of direct DNA damage induced by low-energy electrons, Radiat. Environ. Biophys. 49 (1) (2010) 15– 26. [84] M. Terrissol, Modelling of radiation damage by 125I on a nucleosome, Int. J. Radiat. Biol. 66 (5) (1994) 447–451. [85] M.A. Bernal, et al., The invariance of the total direct DNA strand break yield, Med. Phys. 38 (7) (2011) 4147–4153. [86] H. Nikjoo, et al., A database of frequency distributions of energy depositions in small-size targets by electrons and ions, Radiat. Prot. Dosimetry 143 (2–4) (2011) 145–151. [87] P.B.a.H.G. Paretzke, Calculations of electron impact ionization cross sections of DNA using the Deutsch-Mark and Binary-Encounter-Bethe formalisms, Int. J. Mass Spectrometry 223 (2003) 599–611. [88] M. Vinodkumar, K.N. Joshipura, C. Limbachiya, N. Mason, Theoretical calculations of the total and ionization cross sections for electron impact on some biomolecules, Phys. Rev. A 74 (2) (2006) 721–727. [89] I. Abril, et al., Energy loss of H and He beams in DNA: calculations based on a realistic energy loss function of the target, Radiation Research 175 (2) (2011) 247–255. [90] G. Klingbeil, et al., STOCHSIMGPU: parallel stochastic simulation for the Systems Biology Toolbox 2 for MATLAB, Bioinformatics 27 (8) (2011) 1170– 1171. [91] C.D. Dharmaraj, et al., Reconstruction for time-domain in vivo EPR 3D multigradient oximetric imaging–a parallel processing perspective, Int. J. Biomed. Imaging 2009 (2009) 528639. [92] Y.V. Karpievitch, J.S. Almeida, mGrid: a load-balanced distributed computing environment for the remote execution of the user-defined Matlab code, BMC Bioinformatics 7 (2006) 139. [93] T. Bliss, J. Kepner, pMatalb: parallel Matlab library, Int. J. High Perform. Comput. 21 (3) (2007) 336–359. [94] J. Kepner, MatlabMPI, J. Parallel. Distrib. Comput. 64 (8) (2004) 997–1005. [95] A.E. Trefethen, V.S. Menon, C. Chang, G. Czajkowski, C. Myers, L.N. Trefethen, Multimatlab: MATLAB on Multiple Processors, Cornell University, 1996. [96] D.E. Hudak, N. Ludban, V. Gadepally, A. Krishnamurthy, Developing a computational science IDE for HPC systems, in: Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing Applications, IEEE Computer Society, Washington DC, 2007. [97] K. Wiklund, J.M. Fernandez-Varea, B.K. Lind, A Monte Carlo program for the analysis of low-energy electron tracks in liquid water, Phys. Med. Biol. 56 (7) (2011) 1985–2003. [98] L. Dematte, D. Prandi, GPU computing for systems biology, Brief Bioinform. 11 (3) (2010) 323–333. [99] X. Gu, et al., A GPU-based finite-size pencil beam algorithm with 3D-density correction for radiotherapy dose calculation, Phys. Med. Biol. 56 (11) (2011) 3337–3350. [100] X. Jia, et al., Development of a GPU-based Monte Carlo dose calculation code for coupled electron–photon transport, Phys. Med. Biol. 55 (11) (2010) 3077– 3086. [101] C. Shue, MPI over scripting languages: Usability and performance tradeoffs, IUCS, 2006.

A unified spatio-temporal parallelization framework for accelerated ...

A unified spatio-temporal parallelization framework for accelerated ...

Suggest Documents

A Unified Framework for Gesture Recognition and Spatiotemporal ...

A Calculational Framework for Parallelization of

a Unified Framework - CiteSeerX

VIRACOCHA: An Efficient Parallelization Framework for ... - CiteSeerX

A UNIFIED FRAMEWORK FOR GENERALIZED MULTICATEGORIES

A Framework for the Unified Presentation of

Relativization a Framework for Constructing Unified ...

A Unified Framework for Probabilistic Component Analysis

A unified geometric framework for the statistical

A Unified Computational and Statistical Framework for

A Unified Framework for Adaptive BDDC - Ricam

Towards A Novel Unified Framework for Developing

A Unified Framework for Bases, Frames, Subspace

A Unified Authentication Framework for JPEG2000

Understanding Forecasting: A Unified framework for ...

A Unified Conceptual Framework for Service ...

A unified framework for non-Brownian suspension

A Unified Framework for Adaptive BDDC - Ricam

A Unified Security Framework for Networked Applications

A UNIFIED FRAMEWORK FOR SHARED PROTECTION ... - SciELO

A Hybrid System Framework for Unified Impedance

A Unified Framework for Bases, Frames, Subspace

A Unified Monitoring Framework for Distributed

A Unified Framework for Non-standard Reasoning