Scaling Crowd Simulations in a GPU Accelerated Cluster Hugo P´erez1,2(B) , Benjam´ın Hern´ andez3 , Isaac Rudom´ın2 , and Eduard Ayguad´e1,2 1
Universitat Polit`ecnica de Catalunya, BarcelonaTECH, Barcelona, Spain
[email protected] 2 Barcelona Supercomputing Center, Barcelona, Spain 3 Oak Ridge National Laboratory, Oak Ridge, TN, USA
Abstract. Programmers need to combine different programming models and fully optimize their codes to take advantage of various levels of parallelism available in heterogeneous clusters. To reduce the complexity of this process, we propose a task-based approach for crowd simulation using OmpSs, CUDA and MPI, which allows taking the full advantage of computational resources available in heterogeneous clusters. We also present the performance analysis of the algorithm under different workloads executed on a GPU Cluster. Keywords: Crowds · Simulation · Visualization · Parallel programming models · Accelerators · Heterogeneous architecture · GPU cluster · HPC
1
Introduction
Heterogeneous clusters provide various levels of parallelism that need to be exploited successfully to harness their computational power totally. At nodelevel, Pthreads, OpenMP directives, among other libraries, provide access to CPU level parallelism and OpenACC, CUDA or OpenCL libraries do the same for accelerators. Once node-level parallelism is achieved, additional scalability is accomplished by allocating computation in several nodes, using the message passing interface (MPI). Programmers need to combine these programming models in their code and fully optimize it for each level of parallelism which turns into a complex task. These two processes also may suggest programming heterogeneous systems is an ad hoc activity. During the last years, the Barcelona Supercomputer Center (BSC-CNS) has led a task-based approach to simplify code development efforts on these architectures. As a result, programmers can use OmpSs and MPI to leverage the computational power of heterogeneous clusters. Our particular endeavors have been focused on the design, development and analysis of crowd simulations in these systems. Our first efforts [1] combined CUDA and MPI models for in-situ crowd simulation and rendering on GPU c Springer International Publishing Switzerland 2016 I. Gitler and J. Klapp (Eds.): ISUM 2015, CCIS 595, pp. 461–472, 2016. DOI: 10.1007/978-3-319-32243-8 32
462
H. P´erez et al.
clusters and recently [2] we have adopted OmpSs. We proposed a task-based algorithm that allows the full use of all levels of parallelism available in a heterogeneous node. OmpSs allowed us to analyze the simulation under different resource allocations. In this article we present new techniques to scale the algorithm introduced in [2] to multiple nodes. We also introduce the analysis of the algorithm and provide scalability performance of the simulation running under different resource allocations on the Barcelona Supercomputing Center’s Minotauro Cluster. The rest of the work is organized as follows: the Background section introduces crowd behavior models and OmpSs as a task-based parallel programming model. The algorithm, implementation and results obtained from our case study are discussed in next sections. Finally, we present our conclusions and future work considering OmpSs requirements for visualization.
2 2.1
Background Crowd Simulation
Microscopic and macroscopic modeling are the most important approaches to crowd simulation. While macroscopic modeling describes global interactions with the environment and the crowd itself, microscopic modeling exposes the interactions among the individuals within a group [3]. We are particularly interested in microscopic modeling of crowds using AgentBased Models or ABM. An ABM simulates the actions and interactions of autonomous agents and promotes the progressive assembly of agents into groups resulting in global-scale behaviors, i.e. it provides macroscopic features on a large-scale. By applying ABM, the simulation can handle each agent individually and use general behavior rules in a single rule multiple agents fashion, which suits the Single Instruction Multiple Data (SIMD) model. There is a large body of work in crowd simulation, below we present a brief description of the principal behavior models. Boids. This method [4] is based on three basic rules: separation, alignment and cohesion. These rules maintain together and give a collision-free direction to a group of boids (bird-like objects). Social Forces. This method follows a physics-based approach [5]. Agent behavior is modeled by attractive forces called social forces. Attractive forces lead the agent to a destination, and repulsive forces maintain the agent free of collisions. The method allows the introduction of various social interactions. Predictive/Velocity-Based Models. In these methods, agents calculate the set of velocities that lead to a collision with an obstacle; to move on routes without collision, agents choose velocities out of this domain [6]. This concept is expanded by Van den Berg [7,8], introducing the notions of Velocity Obstacles (VO) and Reciprocal Velocity Obstacles (RVO) and the notion of Optimal Reciprocal Collision Avoidance (ORCA).
Scaling Crowd Simulations in a GPU Accelerated Cluster
463
Synthetic Vision. It simulates the human visual system and uses the distance to obstructions and the possible lines of sight. It applies simple heuristics that enable agents to adapt their walking speed and direction [9,10]. We provide a detailed description of crowd simulation and rendering techniques that take advantage of GPUs and GPU Clusters in [1,11,12]. 2.2
A Brief Introduction to OmpSs
Translating a sequential algorithm into one based on tasks consists in encapsulating the main steps into functions, changing these functions into tasks and defining data dependencies between them (Listing 1.1). Data dependencies definition is necessary to guarantee data locality and automatic scheduling. OmpSs also allows incremental code parallelization, and thus code testing can be performed during the migration process. We refer the reader to OmpSs specs for an in-depth description of each directive [13]. Listing 1.1. OmpSs syntax to define a task. 1 2 3 4
5
#pragma omp target device ({ smp | cuda | opencl }) {[ndrange(workDim, gworkSize, lworkSize)]} [implements( function_name )] {copy_deps | [copy_in( array_spec ,...)] [copy_out(...)] [copy_inout(...)] } #pragma omp taskwait [on (...)] [noflush]
From Listing 1.1 (code line 1), the target directive indicates the device and programming model used to execute the task; the ndrange directive, in code line 2, is used to configure the number of dimensions (workDim), the global size of each dimension (gworkSize) and the number of work-items for each workgroup (lworkSize). Directive implements (code line 3) allows the definition of multiple task implementations of the same function, for example, one in the CPU and one in the GPU; the runtime decides which is better during execution and can even execute both in parallel. Directive copy deps (code line 4) ensures that the device has a valid and consistent copy of the data. Therefore, the programmer does not need to do explicit memory allocations for the device or data transfers between the host and devices. A synchronization point is defined by using taskwait (code line 5), and all data is copied from the device to the CPU memory. Similarly, taskwait on, synchronize the execution and transfers specific variables. With the no flush directive synchronization occurs but data is not transferred to CPU memory. Using OmpSs with multiple GPUs offers the following advantages: – It is not necessary to select the device for each operation. – There is no need to allocate special memory in the host (pinned memory) or the device.
464
H. P´erez et al.
– It is not necessary to transfer data between host and devices, OmpSs maintains memory coherence efficiently and reduces transfers using data regions. – Programmers do not need to declare streams, OmpSs implements them by default. – All available devices are used by default, therefore, the application scales through all the devices. OmpSs does all these operations automatically.
3
Algorithm Overview
Following our previous development and analysis efforts using different programming models and parallelism approaches, we use the wandering crowd behavior algorithm introduced in [1,2] where design and complexity details of the algorithm are discussed in detail. However for clarity sake, we describe the basic algorithm. The sequential algorithm is composed of three primary tasks as shown in the following steps: Step 1: Navigation space is discretized using a grid (Fig. 1(a)). Step 2: Set up the agents’ information (position, direction, and speed) randomly according to the navigation space size. Step 3: Update position for each agent: calculate collision avoidance within a given radius. Evaluation of the agent’s motion direction switches every 45◦ covering a total of eight directions counterclockwise (Fig. 1(b)). The agent moves in the direction with more available cells or fewer agents within its radius. Processing these tasks in parallel within a cluster, requires tiling and stencil computations. First, the navigation space (from now on it will be called the World)
(a)
(b)
Fig. 1. (a) Navigation space discretization. (b) Search radius for collision avoidance.
Scaling Crowd Simulations in a GPU Accelerated Cluster
465
Fig. 2. Dividing the domain in tiles
and information for all the agents is divided into zones. Each zone is assigned to a node which in turn is divided into sub-zones (Fig. 2). Then, the tasks performed in each sub-zone can be executed in parallel by either a CPU or GPU. Stencil computations are performed on an inter and intra-node basis. A step by step description of the algorithm is included: Step 1: Navigation space is discretized using a grid; then the resultant grid is discretized into zones. Each node will compute each zone. Step 2: Divide each zone into tiles (sub-zones). A CPU or GPU will compute each sub-zone. Each CPU or GPU stores their corresponding tile of the world. Step 3: Set up the communication topology between zones and sub-zones. Step 4: Exchange borders (stencils). Step 4a: (Intra-node) Exchange the occupied cells in the borders between subzones. Step 4b: (Internode) Exchange the occupied cells in the borders between zones Step 5: Update position for each agent in parallel: Calculate collision avoidance within a given radius. Evaluation of the agent’s motion direction switches every 45◦ covering a total of eight directions counterclockwise. The agent moves in the direction with more available cells or fewer agents within its radius. Step 6: Agents’ Information Exchange. Step 6a: (Intra-node) Exchange agents’ information that crossed a border and moved to another sub-zone. Step 6b: (Internode) Exchange agents’ information that crossed a border and moved to another zone.
466
4
H. P´erez et al.
Implementation
OmpSs’ advantages for code development in heterogeneous architectures were noted in our previous publication [2]. As shown in the algorithm from Sect. 3, a new level of stencil computation and agents’ information exchange has been introduced to extend computations to additional nodes. OmpSs flexibility allows us to make MPI calls within tasks to implement internode stencils. Thus, incremental code development is still possible. As an example, we include a code snippet of one function converted to a task illustrated in Listing 1.2. In this case, target directive indicates that the GPU will execute the updatePosition task. OmpSs will take care of the memory management by using the directive copy deps. There is an explicit definition of data dependency for variables agents, ids and world, which are input and output of the task. Task execution is done by calling updatePosition as a regular language C function. Listing 1.2. Converting updatePosition function into a task to be executed as a CUDA Kernel. 1 2 3 4
5 6 7 8 9 10
// Kernel declaration as a task #pragma omp target device (cuda) ndrange(2, (int) sqrt(count_agents_total), (int)sqrt(count_agents_total), 16, 16) copy_deps #pragma omp task inout( ([agents_total_buffer] agents ) [0; count_agents_total], ([agents_total_buffer] ids ) [0: count_agents_total], [world_cells_block] world ) extern "C" global void updatePosition( float4 *agents, *ids, int *subzone, ...);
11 12 13 14 15 16 17 18 19 20
// Kernel execution bool runSimulation ( ) { ... for ( int i = 0 ; i < subzones; i ++) { ... updatePosition(agents[i], ids[i], subzone[i], ... ); ... }
21 22 23
#pragma omp taskwait ...
24 25 26 27 28 29
////////////////////////////////////////////////////////////////// //Internode exchange of agents ////////////////////////////////////////////////////////////////// MPI_Send(agents_left_out_pos, ...); MPI_Send(agents_left_out_ids, ..);
Scaling Crowd Simulations in a GPU Accelerated Cluster
467
MPI_Send(&count_agents_left_out, ...);
30 31
MPI_Recv(agents_int_pos, ...); MPI_Recv(agents_int_ids, ...); MPI_Recv(&count_agents_right_in, ...); count_agents_int += count_agents_right_in;
32 33 34 35 36 37
}
As noted in Sect. 3, we apply a second level of tiling, first dividing the domain(agents and world) between the nodes, assigning one tile (called zone) to each node, and inside each node the zone is divided into subzones. Then each subzone is processed as a task, and they are distributed between the computing resources (CPUs and GPUs) for parallel processing. At run-time, this information is exchanged within the nodes by OmpSs and between the nodes by MPI (code lines 31–38 from Listing 1.2).
Fig. 3. Tasks processed in parallel in different nodes
Using run-time variables we can control the number of CPUs and/or GPUs that are used. By default, OmpSs uses all the available computing resources. If more hardware resources are added to one node, for example, more CPUs or GPUs, the simulation will use them without any code changes. Figure 3 shows a schema of how tasks are processed in parallel inside each node also it shows the triple level of parallelism, the first one at the node level, the second one at CPU core level and the third one at CUDA thread level.
468
5
H. P´erez et al.
Results
We conducted our tests on the Barcelona Supercomputing Center’s Minotauro Cluster. Each node has two CPU Intel Xeon E5649 6-Core at 2.53 GHz, 24 GB of RAM, two NVIDIA Tesla M2090 GPUs with 512 CUDA Cores and 6 GB of GDDR5 Memory. We used CUDA 5.0, Mercurium compiler version 1.99.9 [14] and Nanos runtime system version 0.9a [15]. Our first test was to compare the execution time of one-node and multi-node versions, for the multi-node version we use 4, 9 and 16 nodes. We simulated up to 126 million of agents, in a world of 27852 × 27852 cells. For the one-node version, the world was partitioned into 16 tiles, i.e. a zone with 16 sub-zones, using two GPUs and ten CPU cores. In the multi-node version the world was decomposed into 4, 9 and 16 zones, according to the number of nodes, then each zone was partitioned into 16 tiles (sub-zones) for all cases. In each node, both GPUs and ten CPU cores were used, the remaining two cores control both GPUs. Figure 4 shows the results, comparing both versions.
Fig. 4. Execution time one-node vs multi-node
Applying tiling in one node allowed us to simulate a large number of agents, limited just by the host memory size. The bottleneck is the data transfer between the host and the GPUs. Figure 5, shows the data transfers in one iteration for both GPUs, for a simulation of 64 million of agents. Although we overlap with other tasks performed in the CPU, at one point the data calculated by the GPU is indispensable to continue with the simulation. For the multi-node version, bottlenecks related to CPU-GPU data transfers are solved partially by distributing the data between nodes. Nevertheless, another type of bottleneck appears as communication overhead between nodes. We minimize such bottleneck by compacting data, i.e. when border information exchange occurs, only occupied cells are transferred; and by overlapping computation with communication.
Scaling Crowd Simulations in a GPU Accelerated Cluster
469
Fig. 5. Data transfer between the host and the GPUs. The pink color represents data transfer from the host to the devices, the green represents transfers from the devices to the host.
Each node of the Minotauro Cluster has a PCI Express 2.0 (8 GB/s) interface where more modern technology uses PCI Express 3.0 (16 GB/s) interface. Summit, the next cluster at Oak Ridge National Laboratory [16], will provide NVLink technology that is expected to be five to twelve times faster than current PCI Express 3.0. Thus, in the future these bottlenecks will be alleviated by hardware technology improvements. Meanwhile, it is important to keep applying different techniques to take advantage of all the cluster’s computing power. The achieved speedup for 4, 16, 64 and 126 million agents, using 4, 9 and 16 nodes is shown in Fig. 6. For 4 and 16 million we observe a linear speedup; however, for 64 and 126 million agents there is not any speedup at all using four nodes. This is due to the intensive communication within and between nodes for such large simulations.
Fig. 6. Speedup
The second experiment consisted of testing the algorithm’s workload. It can simulate up to 764 million of agents (Fig. 7) and is limited by the size of the world. The system can scale up to several hundred million agents by using multiple nodes. This is important because we are now developing the core modules of communication that will allow us to develop more complex simulations.
470
H. P´erez et al.
Fig. 7. Weak scaling
6
Conclusion and Future Work
We presented the development of a task-based scalable crowd simulation and analyzed how the simulation scales to a different number of nodes. The system can scale up to several hundred million agents by using multiple nodes, using all the computing resources available in the cluster. Comparing with our previous work [1], which used MPI and CUDA programming models, we improved our results. The current version allows us to take advantage of the computing resources effectively (CPUs and GPUs) by using OmpSs. In addition, we identified some bottlenecks and solved them: applying a double tiling technique that allows to distribute the work between the nodes, minimizing the transfer of data between nodes and exploiting the triple level of parallelism. As future work, we will investigate on how to reduce the data transfers between the host and the GPU and also between nodes. We will add task-based in-situ analysis and visualization capabilities to the system, i.e. we expect to take advantage of OmpSs’ automatic scheduling, memory management, communication and synchronization for crowd visualization and analysis. A task-based approach may have advantages over current parallel rendering techniques. For example, visualization can be easily coupled or decoupled from the simulation just by adding or removing rendering tasks. This also applies to crowd analysis. It will help in decomposing communication between the simulation, visualization and analysis into tasks while their execution may take advantage of OmpSs’ synchronization and automatic scheduling. Also, synchronization and automatic scheduling will allow us to deploy the visualization system into immersive or large-scale tiled displays connected to GPU clusters. Finally, OmpSs can be seen as a bridge connecting data from simulation to visualization and analysis while maintaining the rendering algorithms intact. This interaction is expected to be similar to the OmpSs-CUDA programs, where OmpSs is aware of the CUDA’s context, memory, and GPU availability.
Scaling Crowd Simulations in a GPU Accelerated Cluster
471
Nevertheless, the full interaction between OmpSs-OpenGL is not supported yet. Additional challenges such as OpenGL thread safety, OpenGL context creation and registration in OmpSs, OmpSs multiGPU compatibility for graphics or updating of the OmpSs scheduling system to support graphics load balancing have to be solved first. Acknowledgements. This research was partially supported by: CONACyT doctoral fellowship 285730, CONACyT SNI 54067, BSC-CNS Severo Ochoa program (SEV-2011-00067), CUDA Center of Excellence at BSC, Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, under DOE Contract No. DE-AC05-00OR22725, the Spanish Ministry of Economy and Competitivity under contract TIN2012-34557, and the SGR programme (2014-SGR-1051) of the Catalan Government.
References 1. Hern´ andez, B., Perez, H., Isaac, R., Ruiz, S., DeGyves, O., Toledo, L.: Simulating and visualizing real-time crowds on GPU clusters. Computaci´ on y Sistemas 18, 651–664 (2014) 2. Perez, H., Hern´ andez, B., Rudomin, I., Ayguade, E.: Task-based crowd simulation for heterogeneous architectures. In: Handbook of Research on Next-Generation High Performance Computing. IGI (2015, in progress) 3. Bonabeau, E.: Agent-based modeling: methods and techniques for simulating human systems. Proc. Nat. Acad. Sci. 99, 7280–7287 (2002) 4. Reynolds, C.W.: Flocks, herds and schools: a distributed behavioral model. In: Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1987, pp. 25–34. ACM, New York (1987) 5. Helbing, D., Moln´ ar, P.: Social force model for pedestrian dynamics. Phys. Rev. E 51, 4282–4286 (1995) 6. Paris, S., Pettr´e, J., Donikian, S.: Pedestrian reactive navigation for crowd simulation: a predictive approach. Eurographics 2007 Comput. Graph. Forum 26, 665–674 (2007) 7. Van Den Berg, J., Patil, S., Sewall, J., Manocha, D., Lin, M.: Interactive navigation of multiple agents in crowded environments. In: Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, I3D 2008, pp. 139–147. ACM, New York (2008) 8. van den Berg, J., Guy, S.J., Lin, M., Manocha, D.: Reciprocal n-body collision avoidance. In: Pradalier, C., Siegwart, R., Hirzinger, G. (eds.) Robotics Research. STAR, vol. 70, pp. 3–19. Springer, Heidelberg (2011) 9. Ondˇrej, J., Pettr´e, J., Olivier, A.H., Donikian, S.: A synthetic-vision based steering approach for crowd simulation. ACM Trans. Graph. 29, 123:1–123:9 (2010) 10. Moussa¨ıd, M., Helbing, D., Theraulaz, G.: How simple rules determine pedestrian behavior and crowd disasters. Proc. Nat. Acad. Sci. 108, 6884–6888 (2011) 11. Rudomin, I., Hern´ andez, B., de Gyves, O., Toledo, L., Rivalcoba, I., Ruiz, S.: GPU generation of large varied animated crowds. Computaci´ on y Sistemas 17, 365–380 (2013) 12. Ruiz, S., Hern´ andez, B., Alvarado, A., Rudom´ın, I.: Reducing memory requirements for diverse animated crowds. In: Proceedings of Motion on Games, MIG 2013, pp. 55:77–55:86. ACM, New York (2013)
472
H. P´erez et al.
´ E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., 13. Duran, A., AyguadE, Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21, 173–193 (2011) 14. Balart, J., Duran, A., Gonz` alez, M., Martorell, X., Ayguad´e, E., Labarta, J.: Nanos mercurium: a research compiler for openmp. In: Proceedings of the European Workshop on OpenMP, vol. 8 (2004) 15. Barcelona Supercomputing Center: Nanos++ runtime (2015). http://pm.bsc.es/ nanox 16. NVIDIA: Summit and sierra supercomputers: an inside look at the U.S. Departmentof energy’s new pre-exascale systems. Technical report, NVIDIA (2014)