Accelerating Lagrangian Particle Dispersion in ... - ACM Digital Library

3 downloads 0 Views 1MB Size Report
Paul Harvey. School of Computing Science. University of Glasgow. Glasgow, G12 8QQ [email protected]. Saji Hameed. Center for Advanced.
Accelerating Lagrangian Particle Dispersion in the Atmosphere with OpenCL across Multiple Platforms Paul Harvey

Saji Hameed

School of Computing Science University of Glasgow Glasgow, G12 8QQ

Center for Advanced Information Science and Technology [email protected] University of Aizu Aizuwakamatsu, Japan

[email protected]

Wim Vanderbauwhede School of Computing Science University of Glasgow Glasgow, G12 8QQ

Wim.Vanderbauwhede @glasgow.ac.uk

ABSTRACT

Keywords

FLEXPART is a popular simulator that models the transport and diffusion of air pollutants, based on the Lagrangian approach. It is capable of regional and global simulation and supports both forward and backward runs. A complex model like this contains many calculations suitable for parallelisation. Recently, a GPU-accelerated version of the simulator (FLEXCPP ) has been written in C++/CUDA. As CUDA is only supported on NVIDIA GPUs, such an implementation is tied to a single hardware vendor, and is not able to take advantage of other hardware acceleration platforms. This paper presents an OpenCL/C++ version of FLEXCPP, called FlexOcl. This simulator provides all the functionality of FLEXCPP, and has been extended to include modelling of the decay of radioactive particles. A performance comparison between the two simulators has been performed on GPU, and the performance of FlexOcl has also been evaluated on the Intel Xeon Phi, as well as a number of other hardware platforms. Our results show that the OpenCL code performs better than CUDA code on GPUs, and that equivalent performance is seen on the Xeon Phi for this type of application.

OpenCL, FLEXPART, weather simulation, GPU, multicore

Categories and Subject Descriptors I.6.3 [Simulation and Modeling]: Applications ; I.6.4 [Simulation and Modeling]: Model Validation and Analysis ; J.2 [Computer Applications]: Physical Sciences and Engineering

General Terms Application Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IWOCL ’14, May 12 - 13 2014, Bristol, United Kingdom Copyright 2014 ACM 978-1-4503-3007-7/14/05 ...$15.00. http://dx.doi.org/10.1145/2664666.2664672.

1.

INTRODUCTION

Lagrangian particle dispersion models are a popular subset of trajectory-based atmospheric transport and dispersion models. They have been used to simulate atmospheric transport at a variety of scales for many applications ranging from urban pollution dispersion [14], to inverse modelling of greenhouse gas sources [15], to the quantification of stratosphere troposphere exchanges [12]. An important application of these models is as a central component of emergency warning systems that predict the transport and dispersion of materials that could pose immediate societal risks, such as the 2010 eruption of Eyjafjallajokull in Iceland [4] and the meltdown at Japan’s Fukushima Daichi nuclear power plant following the 2011 Great East Japan earthquake [1]. Lagrangian particle dispersion models compute the trajectories of a large number of infinitesimally small air parcels, referred to as particles, in order to describe the transport and dispersion of tracers in the atmosphere. Unlike Eulerian models, Lagrangian models can accurately represent the emissions from point or line sources. These sources may be arbitrarily specified in order to explore a hypothesis, or come from actual observations in order to try and predict some future outcome. In these applications the amount of processing required is substantial, and grows with the scale of the simulation, the number of particles needed and the resolution of the input data. In general, the simulation aims to compute an ensemble of particle paths through a turbulent flow, based on knowledge of atmospheric velocity field [13]. The velocity field needed for these calculations may come from meteorological observations or weather/climate forecasts depending on the purpose of the simulation. These are interpolated to the current particle locations, and a number of differential equations representing air transport and diffusion are numerically solved, to generate the next step of the simulation. The time step for the integration of these differential equations ranges from a few to tens of seconds [7]. Depending on the application, these calculations may be repeated to cover simulation or forecast periods ranging from hours to years. Typical examples of idealistic simulations see 10,000 to 100,000 particles being released, requiring up to 1 second of computation time per time step for the trajectory calculations on a typical modern single-core CPU. However, for re-

alistic scenarios involving multiple, time-dependent sources and millions of particles, the cost of this computation quickly becomes the bottleneck that controls the total execution time. In this work, we have attempted to speed-up the particle dispersion calculations for a popular simulator of this class, known as the Flexible Particle Dispersion model (FLEXPART, hereafter) using OpenCL on a variety of hardware acceleration platforms. Our work builds upon an earlier attempt to accelerate FLEXPART on NVIDIA GPUs using the proprietary CUDA programming framework [16]. OpenCL [10] is a non-proprietary alternative to CUDA which is natively portable to different hardware platforms, including GPUs and multicore systems, as well as FPGAs. This paper presents a practical application of the OpenCL framework applied to the FLEXPART model, specifically, a modified version of FLEXPART to use OpenCL instead of CUDA: FlexOcl. The remainder of this paper is structured as follows: Section 2 discusses the current simulators, Section 3 discusses the new simulator, Section 4 is a performance evaluation of the simulators on different hardware platforms, as well as a discussion of future work, and Section 5 is a summary of the work.

2. 2.1

2

flexpart { read initial sim configuration (); w h i l e ( more h o u r s t o s i m u l a t e ) { i f ( t i m e t o r e l e a s e p a r t i c l e s ){ release particles (); } f o r ( a l l r e l e a s e d p a r t i c l e s ){ load weather information ( ) ; interpolate particle positions (); disperse or track particles (); // apply o t h e r o p t i o n a l a c t i o n s h e r e i f ( r e c o r d i n g i n t e r v a l c o m p l e t e ){ save data to file (); } } } }

LAGRANGIAN SIMULATION FLEXPART

FLEXPART [12] is a popular particle dispersion model using the Lagrangian approach. It is capable of simulating the atmospheric transport, diffusion, dry & wet deposition, and radioactive decay of tracers released from point, line, area or volume sources. It can also be used in a domainfilling mode where the entire atmosphere is represented by particles of equal mass. FLEXPART can be used to simulate the dispersion of tracers from their source going forward in time, or to determine their potential source contributions for given receptors by simulating backwards in time [12]. The simulator is capable of supporting the release of multiple different tracers from many locations within the same simulation. The simulator is freely available online, and is in active development and use 1 . The simulator is written in Fortran and has a straightforward structure as shown by the pseudocode in Listing 2.1. It is a largely simplified description of FLEXPART, however, is it meant to illustrate the general structure. A developer will supply an initial configuration for the simulator itself. This includes values such as the integration period, location and frequency of input data, time step of integration, the number of particles and their distribution at source locations, and so on. Once this information has been read and parsed, the main simulation loop is entered. At this point the simulator checks to see if any particles are due to be released at this timestep, doing so if required. For each set of released particles, the first batch of weather data is read in. The source of this information is in the self-describing netCDF data format2 . As this information is in a standard format, it does not matter where the source of the information comes from. This information includes values such as wind velocity, air temperature and pressure at different lo1

Listing 1: General structure of FLEXPART

http://flexpart.eu/ http://www.unidata.ucar.edu/software/netcdf/

cations. From here, information needed to numerically solve the next position of each particle such as the background velocity field is estimated at the current location and time of each particle using linear interpolation in space and time. A set of partial differential equations describing atmospheric transport and diffusion are solved numerically to estimate the location of each of these particles in the next time step. While the atmospheric transport equation is rather simple, diffusion processes on a variety of time scales must be represented and this leads to the use of several formulations to describe phenomenon such as turbulence and atmospheric convection. In the next step, processes that lead to change in the properties of each particles, for example radioactive decay, chemical reactions and wet and dry deposition are modelled. These steps are repeated in the next iteration. Once a recording interval has been reached, the simulator will save the particle positions and meta data for the particles such as radioactivity, chemical composition and so on to a user specified file.

2.2

FLEXCPP

The original FLEXPART simulator was written in Fortran77 [12]. Recently, the Centre for Global Environmental Research (CGER) of the Nation Institute for Environmental Studies (NIES) in Japan has created a C++ implementation of the FLEXPART model, FLEXCPP [16]. In addition, FLEXCPP 3 has been parallelised to leverage the power of modern multicore CPUs via the use of OpenMP [2] annotations. In the simple case, OpenMP annotations are applied to loop constructs. The OpenMP framework will analyse the loop and attempt to parallelise it with as many threads as are feasible. The implementation also includes sections which can run on NVIDIA GPU hardware to further improve simulation performance by taking advantage of the 3

http://db.cger.nies.go.jp/metex/flexcpp.html

inherent fine-grained parallelism in the simulator. By identifying critical sections of the code which are a bottleneck, and rewriting them as a CUDA kernel, an Nvidia GPGPU can be used to decrease the execution time of the simulator. To take advantage of the NVIDIA hardware, FLEXCPP is currently using NVIDIA’s proprietary CUDA programming framework [11]. This ties the FLEXCPP implementation to NVIDIA hardware. The main drawback of using CUDA is that the simulator can only be accelerated on Nvidia GPUs. This means that any scientist without such hardware must either purchase it, or not take advantage of the performance gain. It is also the case that FLEXCPP is not a complete replica of FLEXPART. At present radio-active decay, and deposition of particles is missing. This means that even though FLEXCPP provides better performance as compared to the original simulator, it does not offer as accurate a representation. Of particular note to this work is the lack of radioactive decay.

3.

FLEXOCL

In order to take advantage of non-Nvidia GPGPU hardware accelerators, e.g. Intel’s Xeon Phi [8], an OpenCL version of FLEXCPP has been created. Given the extensive use of the FLEXPART simulator in time critical operational emergency management, it is important to decrease the execution time of the simulator using all of the available hardware platforms, which OpenCL enables. Previous work has shown that the same level of performance should be possible from OpenCL as can be seen in CUDA [6]. Just as with FLEXCPP, the same approach and structure as FLEXPART has been used in the simulator. This said, a number of different approaches have been taken compared to FLEXCPP. Normally, an OpenCL kernel is placed within a single file, hence, multiple kernels will require multiple files. Each file is compiled independently, and then each compiled kernel is loaded onto the accelerator as required. This will require multiple loads for multiple kernels. In contrast to this, our approach is to use a single kernel file which contains all kernels used within the simulator. Each kernel is abstracted as a function, with the function parameters representing the input to the kernel. A single switch statement is used to dynamically dispatch which kernel function is to be invoked at runtime. Each arm of the statement contains a single kernel function. This approach requires that an extra argument be added to the main kernel’s inputs in order to select the appropriate kernel function to be dispatched. The advantage of this approach is that only a single kernel need be loaded for the entire run of the application. This is in contrast to the normal approach which may require multiple different kernel to be periodically loaded and unloaded through an applications execution. It is important to note that although there is only a single kernel, different work groups and/or work sizes may be specified for each function of the switch statement within the kernel. In this way the execution of the relevant kernel function may be customised to the outstanding work. Secondly, in CUDA, a struct containing pointers being sent to an accelerator is automatically marshalled when being sent to the accelerator, and demarshalled into the same struct on return. By comparison, OpenCL requires that this

Figure 1: Continuous comparison of particle output from FLEXCPP and FlexOcl be done manually4 . FlexOcl manually marshals host data in array units to and from the accelerator. Experiments were performed to decide between two arrays, one data and one offset, or multiple arrays of data. Marshalling as multiple arrays of data was more performant. As well as modifying the implementation to use OpenCL, the simulator has been extended to include the simulation of the radioactive decay of particles. For the discussion in this work, a single isotope was used, however the system has been paramaterised to accept input from the user to define this. The calculations themselves are essentially scalar operations which are applied to each particle at every time step. The operation decreases the mass of the particle until it reaches zero, at which point the particle is no longer present in the simulation. OpenCL has also been used to accelerate these calculations.

4.

EVALUATION

The OpenCl version of the simulator was evaluated in three parts to show correctness, comparative performance with the existing FLEXCPP simulator across Nvidia GPUs 4 This issue will be addressed in the next version of OpenCL (2.0)

Figure 2: Discrete comparison of output from FLEXCPP and FlexOcl

& host CPUs, and performance possible on different OpenCl enabled platforms. In each of the following results, the same input data set was used, and each simulation was a dispersion going forward in time with the same number of particles (10,000). Each result represents the average of ten experimental runs with the specified parameters. Ten experimental runs were chosen as the results were stable at that point. The error bars represent the standard deviation. The simulation represents a hypothetical release of particles representing radionuclides from the Fukushima Daichi nuclear power plant on 8 April 2011, and the trajectory of the air particles are computed over the next 16 days. The meteorological data for this simulation is taken from the National Center for Environmental Prediction forecasts, which provides data in a structured grid with uniform horizontal resolution of 1 degree in both latitude and longitude that covers the globe. The vertical resolution is variable in height, but represents 23 standard atmospheric pressure levels ranging from 1000 to 20 mb (hPa). An important point is that in the following results, none of the OpenCL kernels have been optimised for the given hardware platform. This is intentional so as to give a reasonable approximation of the capabilities of a climate scientist, or general programmer, without expert knowledge. Also, as the goal of using OpenCL is to be able to run the simulator on multiple hardware platforms, it does not make sense to spend large amounts of time optimising the kernels for a single device. In this way, these results can be viewed as a measure of performance portability.

4.1

Correctness

In order to show that FlexOcl produces correct output it was compared to FLEXCPP. For the same simulation, the

Figure 3: Output from FLEXCPP and FlexOcl

average displacement of particles from each simulator was compared. Figure 1 compares the average displacement of the particle plumes from both simulators in the zonal (top) and meridional directions (bottom). It can be seen that the particle plume released from the source is displaced more in the zonal than in the meridional direction. This is due to the presence of strong zonal atmospheric currents such as the jet streams. Recirculation of the plume in the zonal direction is also present. There is a noticeable similarity in the reproduction of these features in both the implementations. However, due to the chaotic character of the atmospheric dispersion processes modelled in the simulators and the inclusion of radioactive decay in FlexOcl, there are minor differences in plume characteristics as expected. Figure 2 shows a complementary comparison which describes the displacement histogram of particles after 96 and 364 hours of integration respectively. The left hand panels describe the zonal displacement histograms, and the right hand panels describe the meridional displacement histograms. After 96 hours of integration, a bimodal character is noticeable, with most particles concentrated in the north western and north eastern Pacific, and one meridional peak around 50o N. A snap-shot of the integration at this time (shown in Figure 3) shows the actual locations of the dispersed particles after 96 hours of simulation, and testifies to the fidelity of both simulators. In contrast, after 364 hours the

Figure 4: between FlexOcl and FLEXCPP on Host with Single Thread

Figure 5: Absolute performance difference between FlexOcl and FLEXCPP on Nvidia K20c, C2075, and GTX580 GPUs particles have dispersed uniformly in the north Pacific region. Both implementations model these distributions very similarly, with minor differences attributable to the factors mentioned earlier. In summary, FlexOcl produces correct and consistent results.

4.2 4.2.1

Performance Comparison Unaccelerated Host

Before testing with the parallel hardware architectures, a base-line comparison was made. Here both simulators are running on a single core on the host (Intel Core i5-3550 CPU @ 3.30GHz), without multithreading or accelerators. In this test, the kernel calculations and data movement have been disabled to measure the cost of the host code only. Figure 4 shows that in both cases the execution time grows linearly with simulated time, thus the performance results in the following sections are independent of the infrastructure code running on the host.

4.2.2

Nvidia GPGPU Acceleration

Figure 5 shows the absolute difference in time between FlexOcl and FLEXCPP running on three different Nvidia platforms, the Tesla K20c (13 compute units, 192 threads

Figure 6: Percentage difference between FlexOcl and FLEXCPP on Nvidia K20c, C2075, and GTX580 GPUs per unit, 0.71 GHz), the Tesla C2075 (14 compute units, 32 threads per unit, 1.15GHz) and the GTX580 ( 16 compute units, 32 threads per unit, 1.19 GHz). Figure 6 shows the same number as a percentage improvement of FlexOcl over FLEXCPP The results show that FlexOcl is consistently faster on all platforms by 40% on average when compared to FLEXCPP. We believe that this is due to only a single kernel having to be loaded onto the device during the initialisation of the simulator, and fewer calls to this kernel as larger workloads are sent in a single round, compared to the CUDA implementation which sends many smaller workloads in multiple rounds. The greater increase for the K20c and GTX580 in absolute time, is consistent, when considered as a percentage. The results for the GTX580 GPU are surprising at first as they are faster than the other cards which were released later. On closer inspection, the difference between the C2075 and GTX580 GPUs is explained by more compute units and higher CPU frequency in the GTX580. One general observation to make is that older GPUs tend to be slightly more efficient with regards to single-precision floating point (FP) operations. According to the specifications, this is typically 15% -20%. The difference with the K20C is partly due to the better FP performance of the GTX580, however in principle, if all available threads on the K20C were to be used, it should outperform the GTX580 by 2.5x, taking into account the FP performance. This means that the FP performance alone is not enough to explain the difference. Consequently, our application benefits less from having more parallel threads per compute unit, than from having more compute units and higher clock speeds. Regardless, FlexOcl is the better choice on GPGPUs.

4.2.3

AMD Multicore

To further highlight the benefits of OpenCL, performance comparisons were made on the 64-core AMD Opteron 6366 HE, 1.8GHz, as shown in Figure 7. The results show consistent performance from both simulators on this platform as the number of simulated hours increase. The CUDA kernels used in FLEXCPP do not work on non-Nvidia devices, such as the AMD platform. Instead, FLEXCPP uses conditional compilation to enable normal loop code to do the processing previously done by the CUDA-kernels. These loops

Figure 7: Difference between FlexOcl and FLEXCPP on AMD Multicore System

are annotated with OpenMP pragma annotations to enable these loops to be parellised across the multiple CPU cores. By comparison, FlexOcl required no modifications to run on this platform. However, to provide comparable performance, MAP BUFFERs were used to remove the copying between host as device. This was possible as the same memory hardware is being used by both. This was a small change, and is isolated from the simulator in a custom library which is controlled at compiletime by a single flag to the compiler. The conclusion is that OpenCL achieves the same goal on this platform with no performance penalty compared to the existing simulator, and only a minor change to the host code, not the kernels themselves.

4.3

Performance Exploration

By using OpenCL, the same kernel code will work on multiple different devices, which is not possible natively with CUDA kernels. Figure 8 shows the results when running FlexOcl on the Intel Xeon Phi co-processor [8]. This accelerator (5110P) has 60 cores, 4-way hyperthreaded, running at 1.2GHz. Figure 8 shows the performance of the Xeon Phi compared to the Nvidia GPGPUs, all running FlexOcl. Also included is the AMD r9 290x (44 compute units, 64 threads per unit, 1.03 GHz), also running FlexOcl. The results show that the Xeon Phi offers equivalent performance and scaling compared to the newer Nvidia graphics cards, for this type of problem. The older GTX580 GPU still performs the best, and the AMD GPU the worst. It was expected that the Xeon Phi would perform better than this, given that it is a newer, and much more expensive piece of hardware. However, considering the number of particles in these tests (10,000), it is likely that the many stream processors present in the GPGPUs are better suited to this type of problem, compared the fewer more general purpose x86 cores found in the Xeon Phi. It is not clear why the performance of the newer AMD GPU is so poor by comparison, considering that it does display a large number of stream processors and possesses the characteristics hypothesised in Section 4.2.2 (many compute units at higher clock rates) to achieve good performance for this application. Furthermore, the AMD GPU should have over twice the number of GFLOPS for single-precision calculations compared to the GTX580 GPU, according to the

Figure 8: Performance of FlexOcl and FLEXCPP on GPGPU and Xeon Phi specifications. As with the K20C, better utilisation of all threads should provide better performance. Further experimentation will be required to find an optimal configuration. Given these findings, one potential optimisation involves customising the number of particles to the number of threads suitable for the given device. Currently, the global work size is determined by the number of particles being release during a simulation. Some platforms, such as the Xeon Phi, are sensitive to this style of alignment, and can cause variance in the results. This was confirmed when the global work size was rounded up to a multiple of the Xeon Phi’s compute units for the 384 hour simulation shown in Figure 8. The average execution time dropped by five seconds to 70s, overtaking the results for the C2075, and almost matching the K20c. This is still below expectation, but provides an avenue for further exploration.

4.3.1

Multi-Platform Comparison

In order to highlight the benefits of acceleration, Figure 9 shows the performance of FlexOcl and FLEXCPP normalised to the performance of the original FLEXPART simulator for a simulation run of 96 hours. The FLEXPART results were recorded on an Intel-i7 CPU @2.67 GHz, and the FLECXPP Baseline results were recorded on a single core of the AMD Opteron 6366 HE. The horizontal line represents the normalised value of the original FlexPart simulator. Figure 10 shows the same results expressed in terms of the number of wall clock seconds to complete an hour of simulation on the different platforms. All of the results show an increase over the original Fortran simulator. Using OpenCL on the more powerful platforms we see an average 2.7x, 5.6x, 9.7x, 9.9x, 10x and 15x improvement for the AMD Opteron, AMD r9 290x, Xeon Phi, C2075, K20c and GTX580 platforms respectively. In each case FlexOcl presents either equivalent or better performance when compared to FLEXCPP. It is also worth noting that the 10x increase noted in the literature [16] was not seen in the FLEXCPP results until compiler optimisations were applied. The default version of FLEXCPP only generated a 6x improvement compared to the Fortran simulator.

4.4

Further Work

mentation which is tied to Nvidia hardware. This has been proven by an evaluation across multiple different hardware accelerators. The results show that by using unoptimised OpenCL, up to a 15x performance improvement can be seen compared to the original FLEXPART. Also, by comparison to the CUDA implementation, FlexOcl provides a more detailed simulation with the inclusion of radioactive decay. This is of particular relevance as such simulators form a central component of emergency warning systems.

6.

Figure 9: Performance Comparison of 96 Simulated Hours Against Flexpart

7.

Figure 10: Multi-platform Simulator Throughput

Currently, neither FLEXCPP or FlexOcl is a complete implementation of FLEXPART. Most notably, the removal of particles from the simulation by wet & dry deposition (e.g. rain and gravity) is not implemented. By adding this feature to FlexOcl, the tool will provide a more accurate simulation. Initial work has shown that this section of the simulator is also highly parallelisable in the same manner as the current implementation. One section of the simulator which has not been parallelised is convection. Convection has little effect at the start of a simulation, but can occupy up to 90% of the simulation as the particles become more dispersed. The convection code proved difficult to parallelise, and instead a version was written for FPGAs [5]. The results are promising, and the complete goal would be to use an OpenCL kernel to represent the input to the FPGA. This approach is supported by Altera [3].

5.

CONCLUSIONS

This paper presents FlexOcl, a simulator for the FLEXPART Lagrangian atmospheric dispersion model accelerated using OpenCL. The simulator has been shown to work correctly on a number of different hardware platforms from different vendors, and to either provide equivalent or better performance when compared to an equivalent CUDA imple-

ACKNOWLEDGMENTS

The authors wish to thank the Japanese Society for the Promotion of Science5 for their support in funding this work under their summer fellowship program. Also, the authors wish to thank Dmitry N. Mikushin for allowing access to a number of the Nvidia GPU platforms used in this work.

REFERENCES

[1] D. Arnold et al. Lagrangian modeling of the atmosphere, chapter Lagrangian Models for Nuclear Studies: Examples and Applications. Volume 200 of [9], 2013. [2] B. Chapman, G. Jost, and R. Van Der Pas. Using OpenMP: portable shared memory parallel programming, volume 10. MIT press, 2008. [3] T. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. Singh. From opencl to high-performance hardware on fpgas. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pages 531–534, Aug 2012. [4] R. P. Denlinger et al. Lagrangian modeling of the atmosphere, chapter A Bayesian Method to Rank Different Model Forecasts of the Same Volcanic Ash Cloud. Volume 200 of [9], 2013. [5] S. W. N. fand Saji N. Hameed and W. Vanderbauwhede. A reconfigurable vector instruction processor for accelerating a convection parametrization model on fpgas. Poster Session 1, June 2014. [6] J. Fang, A. L. Varbanescu, and H. Sips. A comprehensive performance comparison of cuda and opencl. In Proceedings of the 2011 International Conference on Parallel Processing, ICPP ’11, pages 216–225, Washington, DC, USA, 2011. IEEE Computer Society. [7] D. Folini, S. Ubl, and P. Kaufmann. Lagrangian particle dispersion modeling for the high alpine site jungfraujoch. Journal of Geophysical Research: Atmospheres, 113(D18):n/a–n/a, 2008. [8] J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Newnes, 2013. [9] J. Lin. Lagrangian modeling of the atmosphere, volume 200. John Wiley & Sons, 2013. [10] A. Munshi et al. The opencl specification. Khronos OpenCL Working Group, 1:l1–15, 2009. [11] J. Sanders and E. Kandrot. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010. 5

http://www.jsps.go.jp/english/index.html

[12] A. Stohl, C. Forster, A. Frank, P. Seibert, and G. Wotawa. Technical note: The lagrangian particle dispersion model flexpart version 6.2. Atmospheric Chemistry and Physics, 5(9):2461–2474, 2005. [13] D. J. Thomson and J. D. Wilson. Lagrangian modeling of the atmosphere, chapter History of Lagrangian Stochastic Models for Turbulent Dispersion. Volume 200 of [9], 2013. [14] G. Tinarelli et al. Assessment of pollution impact over Turin suburban area using integrated method. In C. Borrego and G. Schayes, editors, Air Pollution Modelling and Its Applications, volume XV. Kluwer Acad., New York, 2001. [15] J. Zeng et al. Lagrangian modeling of the atmosphere, chapter Linking Carbon Dioxide Variability at Hateruma Station to East Asia Emissions by Bayesian Inversion. Volume 200 of [9], 2013. [16] J. Zeng, T. Matsunaga, and H. Muka. Using nvidia gpu for modelling the lagrangian particle dispersion in the atmosphere. In International Environmental Modelling and Software Society (iEMSs), 2010.