In-Core Volume Rendering for Cartesian Grid Fluid Dynamics Simulations Ted Wetherbee
Elizabeth Jones
Michael Knox
Fond du Lac Tribal & Community College 2101 14th Street Cloquet, Minnesota 55720 218-879-0840
Fond du Lac Tribal & Community College 2101 14th Street Cloquet, Minnesota 55720 218-879-0800
Cray Inc. 1901 East Hennepin Ave. Minneapolis, Minnesota 55413 612-384-7852
[email protected]
[email protected]
[email protected]
Stou Sandalski
Paul Woodward
University of Minnesota 431 Walter Library Minneapolis, Minnesota 55455 612-626-1765
University of Minnesota / LCSE 449 Walter Library Minneapolis, Minnesota 55455 612-626-8049
[email protected]
[email protected]
ABSTRACT The volume rendering code Srend is designed for visualization of computational fluid dynamics simulations which use Cartesian grids, and it is designed to be compiled within the application for in-core rendering. Srend was embedded in three codes: Piecewise Parabolic Method (PPMstar), Cloud Model 1 (CM1), and Weather Research Forecast model (WRF). Results show modest rendering overhead, fine quality imagery, and high potential for scaling. When embedded in a code, Srend produces immediate, quality imagery, and this capability can sharply reduce data output & storage requirements.
Categories and Subject Descriptors J.2 [Computer Applications]: Physical Sciences and Engineering – astronomy, Earth and atmospheric sciences, physics.
General Terms Performance
Keywords Visualization, computational fluid dynamics.
1. INTRODUCTION Scalable computational fluid dynamics (CFD) simulation code can generate massive amounts of data which creates IO, storage, and handling challenges for visualization post-processing. Further, post-processing is an extra step between simulation and results. Practical uses for scalable CFD codes in lower division survey courses require real-time results. The CFD applications tested with Srend have potential for interesting uses in the classroom and outside projects. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.
They can be installed on local and remote machines and used through front-ends or gateways tailored to specific uses. The postprocessing step is one obstacle to simplifying the user front-end and delivering real-time results in the form of descriptive imagery. The visualization code Srend was written to address both problems: dealing with massive data challenges to IO & storage capabilities and delivering real-time visualization.
2. MOTIVATION Computational fluid dynamics codes help explore some of the most interesting of life's natural phenomena: weather, climate, and the evolution of stars and galaxies. These "grand challenge" problems are fascinating, motivating, and highly approachable in descriptive fashion for a general audience. CFD and its ability to exercise HPC resources in grand challenge research were already interesting, and we noted that lower division students were especially engaged by concrete, local applications. Thus, we explored means to simulate local problems such as spring wildfires, dispersion of vapor from a local chemical spill, and pollen advection patterns from wild rice beds in the Fond du Lac Reservation which students could use directly. One does not have to look hard for commercial desktop products in these areas, yet we were initially attracted to WRF [3], a community code which could simulate the atmospheres of Titan and Mars, mesoscale weather, and hurricanes, but most relevant to us were highly refined models for regional and local scale applications, say a predicted spread of a wildfire based on real conditions. We could configure WRF for our uses and run it through a web server front end from local and remote machines, yet delivering our desired imagery required extra steps which were difficult to automate for real time delivery. In fact, there are myriad visualization tools available and designed specifically to produce a wide variety of useful imagery for various practical needs, but none quite fit our interests. For our educational uses, high quality imagery is the most important data, and volume rendering is absent from many
important CFD research applications as a built-in feature, say to generate PNG or PPM files for display on a web site or within a desktop application as a simulation is running. Also, the overhead of handling massive data generated by large scale simulations was approaching intractability for related Laboratory for Computational Science & Engineering (LCSE) work. Opportunities to exploit have arisen. Modern multi-core CPUs with wide SIMD registers are excellent for ray casting computation if not as good as GPUs for this work, but the 3D data is already in CPU memory requiring only movement to registers, and cluster communications is well established for the much lighter rendered 2D results. Srend was modified from its original form as used within a terrain flow code developed to study advection of smoke and pollen in the St. Louis River region, and its methods were redesigned to help solve the specific problem of visualizing data from 15363 star simulations planned in 2014-15. The “S” in Srend is for spherical data rendering, a longitude/latitude (lon/lat) view for full spherical rendering of stellar data from the center. We retained perspective views through re-sampling along with other features necessary for general use. We believed that we could address these research and education needs with one small suite of routines which application users could place directly within their existing simulation code to define their rendering pipeline. We incorporated Srend within three well known applications: Piecewise Parabolic Method (PPMstar [1]), Cloud Model 1 (CM1 [2]), and Weather Research Forecast model (WRF [3]).
separate ranks where there is more than enough time for completion between rendering calls. A similar in-situ approach has been proposed and tested by Kageyama and Yamada [11] in which hundreds of separate images-each with is own view of the volume from eye positions on a surrounding sphere -- are rendered during the simulation to form a database of movies which can then be displayed by user command for effective interactive exploration.
4. SREND DESIGN Srend uses ray casting for volume rendering [12] which collects emissions along each ray from the eye through the volume of interest. Ray casting is a straightforward and classic “embarrassingly parallel” problem in that all calculation for each ray/pixel is independent from other rays. However, it is less straightforward when data, and hence each ray, is distributed among numerous processes identified by MPI rank. Srend exploits associativity of compositing rendered pre-multiplied alpha planes and composes the end image with a user-defined “rendering tree”. The data for ray casting—always rectangular “bricks” represented by 3D arrays of variable values--are partitioned among simulation worker ranks. Each rank renders its own brick employing all threads as if the brick was alone in space. A composer receives 2D renderings with geometry of the brick, and these are composited along the rendering tree to the root, where the finisher writes the image to disk.
3. RELATED WORK Recent and inspiring “in-situ” visualization efforts are described in [4] and [5] featuring work with CM1 and Visit [6 ] for visualization. Paraview [7] and Visit are VTK-based systems [8] and currently popular general visualization codes suitable for in-situ use within large scale parallel simulations. Another visualization code is the Hierarchical Volume Renderer (HVR) [9] written by David Porter for large scale CFD codes at The Laboratory for Computational Science and Engineering (LCSE). Earlier methods to pre-scale data to bricks of bytes were already used for compression, and HVR developed this further for interactive rendering using SGI infinite Reality Engines and subsequent commodity GPUs. HVR is used for interactive as well as production visualization [10]. Relevant to Srend is the recent analysis of time vs. space partitioning in [4]. Space partitioning uses dedicated processes to render, and time partitioning halts the simulation while all processes render data. Srend uses time partitioning for all volume rendering--which every simulation process does only on the data it just created--and compositing and image file writing is done by dedicated processes (space partitioning).
Figure 1 : Srend Rendering Tree
The primary difference between Srend and these established volume renderers is that Srend has no load balancing or supervisory mechanism. Rendering, compositing, and image file writing are defined for the simulation duration by how the user incorporates Srend calls within the application. However, Srend rendering work is insignificant compared to the simulation work of updating the volume data according to the partial differential equations of fluid dynamics. Further, we offload compositing and image file writing-with its messaging and I/O system dependencies--to a small set of
In particular, Srend is a Fortran code, and we view this as a powerful user-oriented feature as used within Fortran codes. For scalable Fortran simulation codes, it is highly convenient to include visualization through ordinary subroutine calls using familiar Fortran arrays and syntax. It was possible to incorporate Srend directly within PPMstar, CM1, and WRF as an integrated component.
Application developers can incorporate Srend within simulation code to create visualization capability within one executable using ordinary CPUs and without reliance on special visualization hardware, visualization libraries, or external programs.
One important usability and performance feature is that each rendering call contains all rendering parameters. The composers
require only a view index, the number of sources, and an MPI target rank to send composited results. Each finisher requires only a view index and the number of sources. This eases the task of defining the rendering tree as information is only one way from each simulation & rendering process up to the finishers which write images to disk. MPI messaging is non-blocking, and senders keep request state to test completion on the following call. Thus, rendering completion time (simulation halt time) depends on the rendering work itself as there is no synchronization or management overhead beyond that imposed by MPI messaging itself.
Srend enables. The simulation is rendered and ready to view--from many viewpoints and of variables as the user selects--when the simulation ends. Generated imagery can be viewed at any point during a running simulation, and real-time display of imagery from a running simulation is simple: display the most recent image.
To point out upfront, there is redundant information passed upward along the rendering tree in this scheme. The output file name and other image finishing parameters are used only by the one finisher for each view, but every renderer sends the identical information. This is a light data payload and it greatly simplifies usage vs. providing rendering job parameters in more than one location.
5.1 PPMstar – Sakurai’s Object
For uses we envisioned, suitable viewing parameters are already known for the imagery that would be useful. In fact, Srend does not eliminate the need for post-processing and data handling even if it can significantly reduce the need in some cases. Rather, realtime imagery generation is extremely handy for development, testing, education, exploration, and also guiding post-processing by identifying areas of interest. An unusual feature of Srend is that it requires data in arrays of bytes (Fortran character*1) scaled to a color table. The user has to scale real data to bytes and define a color table. Thus, the user has to know something about the data itself and how it should be scaled for appearance with color and opacity. Experience shows that this is work which does not have to be repeated often nor applied to all possible variables. This step of selecting and scaling variables for Srend can be left to application users familiar with the codes and variables of interest. It is probably easier for these people to write the code themselves (a triple loop, or perhaps one line of Fortran 90) than it would be to figure out a generic provided scaling routine. For end users, the list of interesting variables to view is usually short enough to be made available for selection, say by using the familiar and required namelist.input for CM1 and WRF.
5. APPLICATION RESULTS The applications tested with Srend use Cartesian grids. CM1 can use stretched grids in all directions, and both CM1 and WRF use terrain following vertical coordinates as well. Data arrays need to be interpolated to spatial arrays for Srend if spatial accuracy is to be rendered. For our test cases in CM1 and WRF, no grid stretching was used, and we did not shift vertical level values.
The taxing work (on humans and systems) of storing, moving, and post-processing terabytes and even petabytes of raw data to imagery can be reduced significantly. For educational, demonstration, and exploratory simulations where visualization is the essential result, raw data output can be turned off. The PPMstar code volume is partitioned in all dimensions. The hierarchy for purposes of parallelization is the problem domain, regions (teams of bricks), bricks (each corresponding to a process), briquettes (43 “sugar cubes” of cells), and grid cells. Srend renders bricks at the MPI rank level, each team of renderers sends results to a team’s composer for compositing, and all teams send results to a finisher for final compositing and writing imagery to disk. PPMstar configured to simulate Sakurai's object [13,14,15] was tested in a weak scaling fashion with and without Srend. The test for weak scaling used 8, 64, 512, and 4096 worker ranks with 2 threads each on Blue Waters XE nodes. There are five images generated, one for each of 5 variables. (One is the fuel fraction FV rendered as a hemisphere: figure 2.) Image (pixel dimensions)
4002 3
8002 128
16002 256
3
32002 5123
Cells
64
Ranks
8
64
512
4096
Solve: (seconds)
650
1265
2470
4980
Solve + Srend: (seconds)
677
1360
2584
5149
Srend % Increase
4.2
7.5
4.6
3.4
Table 1: PPMstar + Srend , Scaling tests on Blue Waters 4002
8002
16002
32002
3.3
5.1
6.3
7.1
Average per rank (10 )
1.7
1.7
1.7
1.7
Total (109)
.013
.107
.862
6.9
Extra Srend Ranks
1
1
9
65
Image (pixel dimensions) 6
Maximum per rank (10 ) 6
Table 2: PPMstar + Srend Sampling Intensity Note that the sampling density average per process remains a constant 1.7 million, but the maximum sampling density rises.
A notable result from developing Srend with three related yet different codes is that the identical code is used in each, just utilized differently. Significant differences, such as SMP only vs MPI & MPI+Openmp, are enabled through cpp defintions. In practice, these codes might be frequently compiled, and Srend fits nicely in the build mechanism for each to make its use (or not) convenient by setting one definition. Of note is the fact that Srend does not utilize nor interfere with established PPMstar, CM1, and WRF I/O methods. Data may be saved for post-processing in compressed byte, netCDF, and HDF5 forms as usual for visualization which requires full exploration of the data. However, avoiding some or all raw data output to reduce storage and then post-processing overhead is an important feature which
3
Figure 2: FV hemispherical view step 174
This shows that rendering overhead is increasing, yet it is not increasing as quickly as the simulation work. For each increase of problem resolution by two in each x, y, and z dimension, the number of time steps required between dumps doubles as the simulation time increment must be halved. Thus, the simulation work scales as N4 where N is the number of cells in each dimension. Other sources of overhead include messaging--which we are still measuring--but also the extra MPI ranks for compositing and writing imagery to disk. We are evaluating performance from using different configurations. Thus far, 216 worker ranks per composer and 64 composers per finisher works well, but 512 composers per finisher may exceed memory limits. Each extra level inserted within the rendering tree reduces memory requirements at that level and above, so 512 composers could send to 16 higher composers, and those 16 higher composers would send data to the finisher.
ffmpeg. At low resolution and in suitable format (webm and mp4 work well), ffmpeg is quite fast. For a special note regarding full spherical lon/lat renderings (360 degrees by 180 degrees), these are done with three views: north polar, equatorial, and south polar. "Longitudes" converge at the poles (the polar problem) and would create huge and mostly useless rendering work for those processes whose brick intersects or lies near to a pole. Instead, we view poles separately above 45 degrees and below -45 degrees "latitude", then these polar views are resampled to form the top and bottom quarters of a full lon/lat rendered image. The full spherical view is an awkward form for most viewers, yet there are systems designed specifically for it, notably Science On a Sphere systems (SOS [16]) which are common in science museums. Our output from this particular 768^3 run is ready to go for SOS viewing as written to disk.
Recent PPMstar runs with 64 team leaders, 216 workers per team, 64 composers, and one finisher (13953 ranks) have been successful in generating a great deal of data. One helpful feature in this configuration is that 872 Blue Waters nodes are each fully loaded with 16 ranks each running two threads, but the lone finisher is at the end with a node's memory all to itself. This may be necessary as the latest run used 19 separate Srend views. Ten of these views each generated one image: hemispheres and octants showing the fluid fraction (burnable H and He), Y velocity component (up), vorticity, divergence, and energy. Two views from the inside of the star outward rendered 16 concentric shells of the fluid fraction and energy with alpha; these shells were saved in a structure which can be used to interactively explore the star by layers quickly composited to 136 sensible sequences. The last view of the fluid fraction from the outside of the star inward has 32 shells for 528 sensible sequences. For storage, these are 74 images which sum to 670 MB uncompressed. This compares well to the standard 5.4 GB of data per output dump. PPM format images (Portable Pix Map) compress well using tar+gz or to PNG using Imagemagick convert, but one of the more useful methods is generating movies in place using
Figure 3: FV step 505 shell 11 The figure above shows the fluid fraction FV of unburned H and He with the fraction increasing from dark blue to yellow. This combines three separate renderings in one: North polar (45+), equatorial (-45 to +45), and South polar (45-) regions. These thin shell views can be combined to compare variables.
Figure 4: FV compared to ENUC by shell (FV colors inverted, ENUC inverted and cast to gray scale)
These image tiles (figure 4) are 35 degree square sections taken from shell imagery generated at dump 505. The shell numbers increase with distance from the center of the star simulation, so the lower right image of shell 8 samples fuel FV a distance of 14.9 to 15.8 megameters (106 m) from the star’s center. Note that the fuel (FV) does match well with nuclear energy production (ENUC) but more so with increasing depth within the star. The dynamics of this are much clearer in movies which show the fuel FV being drawn down into the star by convection.
5.2 CM1 r17 CM1 partitions the horizontal extent in the x and y directions, and a job was defined that would fit within the memory of one node and run with 1 MPI rank. This strong scaling test was modeled after the test George Bryan (CM1 author) used on NCAR's Yellowstone [17], but this test only used up to 64 ranks in the progression 1, 4, 16, 64. At least one additional rank for Srend must be used. The single Srend finisher can composite for all rendering ranks, but additional compositors between renderers and the finisher will be used for higher level scaling, so we inserted one for these tests. From observation, the extra compositing layer does not affect rendering time. For larger scale simulations, further compositing layers between renderers and a finisher would help reduce load on a layer’s compositers.
samples within this view and 84 flops per sample in the core ray casting loop for 23 Gflops per image. Rendering work is not balanced between ranks, and imbalance worsens yet within the limit of maximum volume sampling density. Worker Cores
1
4
16
64
Average: Samples / Rank (10 )
278
69
17
4.3
Minimum: Samples / Rank (106)
278
43
8.7
2.0
278
100
32
9.2
6
6
Maximum: Samples / Rank (10 )
Table 4: CM1 + Srend Sampling Intensity As the application scales further upwards, the maximum sampling value will approach a decrease by a factor of ¼ for each doubling of ranks in horizontal directions.
Figure 6: CM1 rho1-rho0, 512x512x64 Supercell at 7200 seconds, image number 361, 1600x800 pixels (colors inverted)
5.3 WRF 3.6.1 WRF currently partitions the horizontal extent in the x and y directions. (Vertical partitioning is scheduled for the next 2015 release.) WRF was not tested for scaling.
Figure 5: CM1 + Srend Logic Unlike the PPM codes, CM1 has only worker ranks, and there are some collective MPI operations. Srend requires at least one separate rank for composition and disk writing, so CM1 is run on its own communicator, and Srend uses MPI_comm_world which has the needed extra ranks. Srend_render is called in the top level cm1 routine within the solve loop. It was convenient with CM1 (and also with WRF) to utilize the common namelist.input file where all applicable Srend parameters are defined. Variables of interest chosen for tests were the density change (rho1-rho0) and cloud fraction per cell (qc), and these require scaling to bytes for Srend render input. Worker Cores
1
4
16
64
Solve (seconds)
1019
233.5
71.7
24.98
Solve+Srend (seconds)
1139
276.7
82.6
27.4
Srend % Overhead
11.8
18.5
15.2
9.7
Seconds / Image
12
4.32
1.09
.24
Gflops / Process (average)
1.9
1.35
1.34
1.52
Table 3: CM1 + Srend, Strong Scaling Test on LCSE nodes The test ran the simulation for 20 seconds generating an image every 2 seconds for 10 images in total. This ran on U Minnesota LCSE 8-core Xeon nodes using at most 8 ranks per node and 1 thread per process. The sampling density in this case was defined indirectly by the user, by the sampling increment dt = .25, and by the position of the volume within the view. There are 277,919,948
Figure 7: WRF + Srend Logic The WRF mechanism to read namelists is incorporated so that all Srend parameters are available to each rank through the WRF grid% structure. Recent success incorporating Srend within WRF comes from experience designing and optimizing Srend for more straightforward codes, particularly PPMstar which is one file vs. the complex framework of WRF.
6. SREND OPTIONS Some current and emerging CFD codes use Adaptive Mesh Refinement (AMR) on regular Cartesian grids. Srend always renders bricks based on a passed AMR value which is 1 for the base cell width, a positive integer for refinement, and a negative integer for coarsening. This feature is almost free from the Srend design,
and it is used to render fluid fraction data generated by PPMstar which is double resolution in all dimensions (AMR = 2) by using sub-cell information. We studied the option to incorporate 2D imagery within the Srend tree. Maps, images, legends, and other kinds of 2D data would be useful additions for illustrating rendered 3D data. We decided to stick with 3D volume rendering yet cast 2D imagery to cell-width slabs for usual 3D rendering. The AMR feature allows high resolution 2D imagery to be used when scaled to fit an integer AMR refinement value.
turns out that Imagemagick convert does a wonderful job. In fact, image conversion and tiling is supported within Srend through system calls to Imagemagick convert and mogrify utilities. These are especially useful for CM1 and WRF as .ppm images are less easily viewable away from Linux desktop machines. Common file conversions are enabled by supplying a destination filename with the desired extension, say .png or .jpg. Likewise, tiling is enabled by setting tiles_right and tiles_down to values above 1, then providing suitable file destinations and end format to suit viewer software requirements. Stereo imagery is created by rendering twice, once for each eye. Srend parameters set eye offsets which are used to determine common central clipping spheres and offset sampling for quality at the near common clipping sphere.
7. FUTURE WORK 7.1 Educational Uses Figure 8: Rendering a "surface" below a volume We tested this method in WRF by rendering the 3D cloud fraction variable qc and rendering the 2D surface velocity v as a slab just below the volume. These renderings were done separately by each rank with 2 calls to srend_render, but the 3D and “2D” render calls sent results to separate composers, and their common finisher target composited the 2D rendering below the 3D rendering.
Convenience is clear for relatively small-scale runs. A single executable delivers finished imagery suitable for real-time display, and this has been demonstrated for educational use. The image below is from a student WRF-fire run defined and executed through a web form for the spring 2015 FDLTCC Disasters course (Geography 2010) in May. This was an online course in two sections serving 70 students, and the WRF-fire exercise followed a discussion of wildfires and modeling.
Figure 10 : WRF-fire simulation exercise
Figure 9: WRF + Srend, qv+qr over v The image above (figure 9) was rendered on a workstation using four MPI ranks for WRF. The four tile artifacts on the surface were created for testing. Normally, Srend would be used without MPI for a workstation SMP application, and in this case srend_render also writes images to disk. For imagery with RGB values, these would have to be converted to bytes aligned to the color map used; this is awkward and not generally useful. An earlier option to restore is to allow data point (cell center) values as 1, 3, or 4 bytes corresponding to scaled byte, opaque RGB, and RGB with alpha. Thus, properly scaled imagery could be used directly and located in views. The RGB with alpha option would also allow embedded solids to be rendered. One option abandoned was output tiling and format conversion as an integrated Srend finisher feature. These features created some problems and complexity on certain network file systems, and it
Students filled in a web form (cs.fdltcc.edu/fire.html): wind vector, ignition parameters, comments, and a onetime key:password pair. Their submitted jobs were placed on a queue, the local server processed jobs sequentially, and web pages were created showing their results. The queue could be viewed by anyone, and it also showed progress of a running job as imagery was created. This exercise was run entirely on a local machine, but this particular simulation code could be run effectively on a remote XSEDE machine with superior performance by uploading the run parameters, running the setup and wrf executables, then downloading resulting imagery. We address volume rendering within simulation codes which are designed for researchers--not for lower division students--yet these codes can be setup with the help of researchers for similar interesting educational uses case-by-case. FDLTCC is a partner with San Jose State University (SJSU) on a recently funded project for study of urban heat islands, wildfires, and atmospheric tracers. (Sen Chiao, “Center for Applied Atmospheric Research and Education (CAARE)”, NASA MUREP Institutional Research Opportunity (MIRO)) We plan to develop similar educational
activities with SJSU atmospheric scientists for our students using WRF and other community codes.
7.1 Performance The original plan for Srend was to render data at a much finer level, for PPM codes down to the level of 4x4x4 = 64-cell “sugar cubes” which--with boundaries and instructions—fit entirely in close cache for fast staccatos of highly optimized vector arithmetic activity by each thread. We initially opted for code which works correctly on 3D data arrays at the much coarser rank level as there are very few CFD codes (PPMstar is one) which operate at this fine level, and data by rank is still generated by PPMstar for postprocessing exploration. In fact, fine cache-level rendering optimization by dicing bricks into sugar cubes could apply to all our targeted codes at the rank level without making deep and quite intricate modifications to simulation code.
this particular run does the final compositing and image writing work by spans of dump numbers and view numbers. It uses Srend code but replaces the MPI_recv calls by file read calls. Our most recent tests (8 workers in each of 216 teams) use 27 rendering calls which results in 82 separate images (figure 13). Each dump (from all 216 teams) is about 36 GB. Processing everything in one of these dumps takes about 7 minutes for one thread to complete, and the result is about 1.6 GB in raster RGB and RGBA imagery which, in one typical case, compresses to a 126 MB archive file using tar and gzip.
7.2 Large-scale Simulations For runs with PPM-star on Blue Waters, we created compositing and image-writing processes on MPI ranks separate from simulation & rendering ranks, and we used non-blocking MPI calls. This de-couples simulation & rendering from compositing & disk IO. The nature of the PPM-star code ensures that there is sufficient time between steps for messaging and disk IO, but MPI issues have arisen. Scaling issues appeared at 64*216 = 13,824 MPI worker ranks for a 7683 PPM-star simulation of Sakurai's object, and the target scale is 512*216 = 110,592 MPI worker ranks for a 15363 simulation. Figure 13 Imagery Produced per Dump Step (for illustration)
Figure 11: Original MPI Rank Ordering The expedient technique of placing separate MPI compositing and disk IO ranks after simulation/rendering ranks (figure 11) does not scale past a certain point. One solution is to distribute compositing MPI ranks "close" to groups of simulation & rendering workers so that MPI traffic does not converge on a small set of ranks "far" from sending ranks. This has already been deployed successfully for team leaders and timekeepers serving their teams of workers, so we also add compositors to this set of team servers (figure 12). In this scheme, final compositing and writing imagery to disk is not done within the simulation job. Each process among the compositing ranks writes its own results to disk.
When this problem is scaled up to a 15363 simulation using 216 teams of 512 workers (110,592 worker MPI ranks using 442,368 threads/CPU cores), the rendering work would be partitioned among many more workers, and the dump size and associated processing work for final imagery would be identical to that in our test runs. At this scale, the Srend dump size of 36 GB is modest compared to a full state dump at 500 GB and a scaled byte dump of cell variables at 45 GB. Finishing off these Srend dumps to imagery can be done with independent finish_dump jobs, each processing its own span of dumps. We realize that much of the stress inflicted on a cluster by Srend is created by our imagery selection. Compositing either nuclear energy production ENUC or cell H+He fuel fraction FV in 16 shells at 4096x2048 RGBA pixels requires 8 GB of RAM for the 216 incoming data buffers and 2.2 GB of RAM for the target compositing buffer. A lighter selection of imagery, say a few images of state variables for purposes of monitoring a running simulation, might easily be composited from 216 teams and written to disk by one rank in any position.
7.3 Dissemination www.lcse.umn.edu/srend cs.fdltcc.edu/srend Figure 12: Distributed MPI Rank Rordering A pair of compositors is assigned to each team. It is important to pack such team servers in multiples of 4 for performance, yet the pair of compositors is used in alternating fashion, so the extra compositor effectively doubles the time allowed to complete disk writes and thus reduces the chance of simulation waiting for IO completion. A separate job can read these files, composite arrays, then write completed images to disk. Our routine finish_dump designed for
These websites have the code for download, and there are links to small self-contained examples as well as specifics on how to incorporate Srend within recent versions of WRF and CM1. We envision that Srend could be directly useful at the desktop and small cluster level for problems which have significant 3D features. Srend is currently one file (srend.F) which is sufficiently flexible for our narrow target class of time-evolving Cartesian grid CFD codes. Simplicity (one file) has been especially convenient when incorporating Srend within codes which have their own complex frameworks.
8. ACKNOWLEDGEMENTS We thank NASA for funding our EMARE project (Elizabeth Jones, “Environmental Modeling And Research Experience”, #NNX11AQ96G) for work through summer 2014 and CAARE for continuing work in 2015-20, and The Minnesota State College and Universities (MnSCU) for sabbatical funding specific to these efforts in 2014-15. Prior Teragrid and XSEDE [18] allocations for education, as well as outreach and education staff, were especially helpful & highly supportive of our work developing techniques which could be used for remote educational use. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575. Development of PPM codes supported by contracts from Los Alamos National Lab. and Sandia National Lab., local computing and visualization facilities at the LCSE supported through NSF CNS grant 0708822, and computing on Blue Waters at NCSA [19] supported through NSF OCI PRAC grant 0832618 and 1440025.
[10]
[11]
[12] [13]
[14]
9. REFERENCES [1] P. Colella, P. Woodward, The piecewise parabolic method for gas dynamical simulations, J. Comput. Phys. 54, 174 (1984). [2] George H. Bryan and J. Michael Fritsch. A Benchmark Simulation for Moist Nonhydrostatic Numerical Models. Monthly Weather Review, 130(12):2917–2928, (2002). [3] Michalakes, J., S. Chen, J. Dudhia, L. Hart, J. Klemp, J. Middlecoff, and W. Skamarock (2001): Development of a Next Generation Regional Weather Research and Forecast Model. Developments in Teracomputing: Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology. Eds. Walter Zwieflhofer and Norbert Kreitz. World Scientific, Singapore. pp. 269-276. [4] M. Dorier, G. Antoniu, F. Cappello, M. Snir, R. Sisneros, O. Yildiz, S. Ibrahim, T. Peterka, L. Orf. Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations. Preprint submitted to Journal of Parallel and Distributed Computing February 8, 2015. [ http://www.mcs.anl.gov/papers/P5290-0215.pdf ].. [5] Orf, L., R. Wilhelmson, and L. Wicker, A Numerical Simulation of a Long-Track EF5 Tornado Embedded Within a Supercell. 94th Am. Meteorol. Soc. Annual Meeting, Atlanta, Ga., February 2-6, 2014. [6] https://wci.llnl.gov/simulation/computer-codes/visit/ [7] http://www.paraview.org/ [8] http://www.vtk.org/ [9] Woodward, P. and D. Porter, cited 2002: LCSE
[15]
[16] [17] [18]
[19]
Hierarchical Volume Renderer (HVR) User’s Guide. [http://www.lcse.umn.edu/hvr/HVR-Users-Guide-4-102.pdf.] Nystrom, N., D. Weisser, J. Lim, Y. Wang, Brown, S. T., Reddy, R., Stone, N. T., Woodward, P. R., Porter, D. H., Di Matteo, T., Kale, L. V., and Zheng, G., Enabling Computational Science on the Cray XT3, Proc. CUG (Cray User Group) Conference, Zurich, May, 2006. Akira Kageyama, Tomoki Yamada. An approach to exascale visualization: Interactive viewing of in-situ visualization. Computer Physics Communications 185(1): 79-85 (2014). Marc Levoy, Efficient Ray Tracing of Volume Data, ACM Transactions on Graphics, 9(3):245-261, July 1990. Woodward, P. R., Herwig, F., and Lin, P.-H., Hydrodynamic Simulations of H Entrainment at the Top of He-Shell Flash Convection. Astrophysical Journal 798, 49 (2015). arXiv:1307.3821, (2013). Woodward, P. R., J. Jayaraj, P.-H. Lin, M. Knox, D. H. Porter, C. L., Fryer, G. Dimonte, C. C. Joggerst, G. M. Rockefeller, W. W. Dai, R. J. Kares, and V. A. Thomas, Simulating Turbulent Mixing from Richtmyer-Meshkov and Rayleigh-Taylor Instabilities in Converging Geometries using Moving Cartesian Grids, Proc. NECDC2012, Oct., 2012, Livermore, Ca., LA-UR-13-20949; also available at www.lcse.umn.edu/NECDC2012. Herwig, F., P. R. Woodward, P.-H. Lin, Mike Knox, and C. L. Fryer, Global Non-Spherical Oscillations in 3-D 4 Simulations of the H-Ingestion Flash, Astrophysical Journal Letters 792, L3, preprint available at arXiv:1310.4584, (2014). http://sos.noaa.gov/ http://www2.mmm.ucar.edu/people/bryan/cm1/pp.html John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, Andrew Grimshaw, Victor Hazlewood, Scott Lathrop, Dave Lifka, Gregory D. Peterson, Ralph Roskies, J. Ray Scott, Nancy Wilkins-Diehr, "XSEDE: Accelerating Scientific Discovery", Computing in Science & Engineering, vol.16, no. 5, pp. 62-74, Sept.-Oct. 2014, doi:10.1109/MCSE.2014.80 Brett Bode, Michelle Butler, Thom Dunning, William Gropp, Torsten Hoe- fler, Wen-mei Hwu, and William Kramer (alphabetical). The Blue Waters Super-System for Super-Science. Contemporary HPC Architectures, Jeffery Vetter editor. Sitka Publications, November 2012.Edited by Jeffrey S . Vetter, Chapman and Hall/CRC 2013, Print ISBN: 978-1-4665-6834-1, eBook ISBN: 978-1-4665-68358