phd thesis performance improvements of interactive ...

16 downloads 1351 Views 7MB Size Report
Standard deviation of the number of avatars hosted by servers when using ..... (en inglés Massively Multiplayer Online Games o MMOGs [2]), debido al alto ...
Departament d’Informàtica

P H D T HESIS P ERFORMANCE IMPROVEMENTS OF INTERACTIVE CROWDS SIMULATIONS

Author: D. Guillermo A. Vigueras González

Valencia, March 2011 Thesis advisors: Dr. D. Juan Manuel Orduña Huertas Dr. D. Miguel Lozano Ibáñez

In Memoriam of my grandfather, Guillermo

Agradecimientos A lo largo de esta tesis son tantas las personas que, de un modo u otro, me han ayudado en su realización que no cabe expresar en unas pocas palabras mi sincero y profundo agradecimiento. En primer lugar quiero agradecer a mis directores de tesis, Dr. Juan Manuel Orduña y Dr. Miguel Lozano, la confianza depositada en mí para la realización de la investigación que ha concluido en esta tesis. Ellos me proporcionaron la financiación para poder realizar esta tesis así como el poder asistir a congresos científicos y realizar estancias en el extranjero. La experiencia de los dos, su apoyo y consejos me han resultado de gran ayuda para la realización de esta tesis. El Dr. Miguel Lozano siempre ha tenido una gran paciencia (a pesar de mi impaciencia) y sentido del humor a la hora de resolver todos los problemas que han ido surgiendo a lo largo de la tesis. Por su parte, el Dr. Juan Manuel Orduña ha demostrado una admirable disponibilidad atendiéndome, siempre que lo he necesitado, para resolver dudas tanto relacionadas con la investigación como con mi vida personal. Por estos y otros innumerables motivos desde aquí quiero expresar mi más sincero agradecimiento. Del mismo modo quisiera agradecer al Dr. Yiorgos Chrysanthou, al Dr. José Manuel García y al Dr. Thierry Priol, el tiempo que me han dedicado, así como el haberme permitido trabajar en colaboración con sus respectivos grupos de investigación. También quisiera agradecer a algunos miembros del Departamento de Informática la ayuda que desinteresadamente me han prestado. En primer lugar y especialmente, quiero agradecer a Vicente Cavero sus ideas y comentarios ante cualquier duda que le he planteado, así como la ayuda prestada en las primeras versiones del sistema distribuido. A Carlos Pérez, cuya ayuda y experiencia fue de gran importancia para el desarrollo de las primeras implementaciones ’multi-thread’ y el desarrollo de los primeros artículos. A Rafa Tornero quiero agradecer su compañía y amistad que han hecho que las horas de trabajo y los problemas de la vida en Valencia sean más llevaderos. A ellos y al resto de profesores del departamento que me han ayudado, gracias. A mis padres, por todo lo que han hecho por mí a lo largo de mi vida y la educación basada en el trabajo y el esfuerzo que me han inculcado con su ejemplo. El mérito de esta tesis es en gran parte también vuestro. A Patricia (mi Patri) por su paciencia y cariño durante estos años, especialmente al cancelar un plan o una salida a causa de algún ’deadline’. Su apoyo en los momentos en los que no se veía el ’final del túnel’ ha sido fundamental para llevar a buen puerto esta tesis.

II

Acknowledgements Throughout the development of this thesis I have received the support of so many people that I cannot express my sincere gratitude in just a few words. First, I want to thank my thesis advisors, Dr. Juan Manuel Orduña y Dr. Miguel Lozano, their trust in me to carry out the research that has resulted in this thesis. They have supported me to carry out this thesis as well as to attend scientific conferences and make interships in some international research institutions. Their experience, support and advice have been crucial for the development of this thesis. Dr. Miguel Lozano has always had great patience (despite my impatience) and sense of humor when solving all the problems that have arisen throughout the thesis. In his turn, Dr. Juan Manuel Orduña has shown an admirable availability attending me whenever I needed it, to answer questions related with both research and my personal life. For these and many other reasons I want to express my sincere gratitude. In the same way, I want to thank Dr. Yiorgos Chrysanthou, Dr. José Manuel García and Dr. Thierry Priol, for the time they spent helping me and for allowing me to work within their respective research groups. I would also like to thank some members of the Computer Science Department at University of Valencia, who have selflessly helped me. Specially, I want to thank Vicente Cavero for their ideas and comments on any questions and issues I had, as well as the assistance provided in early versions of the distributed system. Carlos Pérez, whose help and experience was of great value for developing the first multi-threaded implementations and writing the first articles. I would like to thank also Rafa Tornero for his friendship that have ease working hours and problems of living in Valencia. To them and other professors of the department who have helped me, thanks. I want to thank my parents all they have done for me over my entire life and the education they have given me with their example, based on hard work and effort. The merit of this thesis is also yours. To Patricia (my Patri) for their patience and love during these years, especially when we needed to cancel some plan because of a ’deadline’. Her support in the bad moments has been fundamental to make a success of this thesis.

II

Resume A crowd can be considered as a large group of individuals that share the same physical environment. These individuals can share a common goal and may act in a different way than when they are alone. The need of computational models for simulating this behavior appears in many research fields. The computer graphics community is interested in integrating realistic and vivid crowds of virtual characters within a 3D application, usually for entertainment purposes like film production. From the Social Sciences field, the interest is related with understanding the behavior of different social phenomena and carrying out predictions for future events. Safety and civil sciences interests are related with simulating social models in order to elaborate evacuation plans, accurately designing the elements of the emergency system of a building or predicting risks in crowded events that may take part in stadia or huge outdoor areas. Nevertheless, all these crowd systems require both rendering visually plausible images and managing the behavior of autonomous agents. The sum of these requirements results in a computational cost that highly increases with the number of agents in the system. Traditionally, the challenges related with crowd simulations consisted of developing efficient graphics techniques and proposing realistic models for animating crowds. Now, the new challenge consists of providing the scalability required by these kind of systems when the number of individuals in the scene grows. This thesis

1

proposes different improvements of crowd systems in order to signifi-

cantly enhance the performance of these systems. First, a distributed system architecture is proposed for simulating large crowds. This system is based on a networked-server interconnection scheme. Results demonstrate that this scheme can provide a good flexibility and scalability by distributing the elements of the system across different machines. In addition, different improvements are proposed for exploiting computational capabilities of multi-core and many-core platforms. Results show that the use of these platforms allows to increase the throughput of the simulation system. Performing crowd simulations in a distributed fashion requires some technique for efficiently managing the dynamic workload generated. For that reason, this thesis also 1

This work has been jointly supported by the Spanish MEC, the European Commission FEDER funds, and the University of Valencia under grants Consolider-Ingenio 2010 CSD2006-00046, TIN2009-14475C04-04, and V_SEGLES_PIE.

II

studies different partitioning methods for distributing the crowd across the different computational resources available in the system. A fitness function is defined in order to compare the performance of the different methods and choose the method that best suits the crowd partitioning problem. Finally, this thesis proposes a visualization system for displaying images of the simulations. In order to provide the scalability required by the visualization system, it is designed in a distributed fashion. In this way, different cameras hosted by different machines can render different viewpoints of the same scene. Results show that the proposed design of the visualization system can provide the scalability and interactivity required by crowd simulations. Key words: Crowd Simulation, scalability, load balancing, performance evaluation, multi-core, GPUs, real-time rendering.

CONTENTS

Resumen

1

Motivation

23

Objectives

27

CHAPTER 1. Introduction

31

1.1. Requirements of crowd simulations . . . . . . . . . . . . . . . . . . . .

33

1.2. Computer Architectures for Distributed Virtual Reality Applications . . .

37

CHAPTER 2. State of the art

41

2.1. Crowd Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

2.1.1. Efficient Rendering Techniques . . . . . . . . . . . . . . . . . .

42

2.1.2. Rendering Acceleration through GPUs . . . . . . . . . . . . . .

45

2.2. Behavioural animation . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

2.2.1. Microscopic Models . . . . . . . . . . . . . . . . . . . . . . . .

48

2.2.2. Macroscopic Models . . . . . . . . . . . . . . . . . . . . . . . .

51

2.2.3. Hybrid Models . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

2.3. Computer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

2.3.1. Distributed Crowd Systems . . . . . . . . . . . . . . . . . . . .

58

2.3.2. Parallel Crowd Systems . . . . . . . . . . . . . . . . . . . . . .

60

2.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

CHAPTER 3. Distributed System for Crowd Simulation 3.1. A Distributed System for Crowd Simulation . . . . . . . . . . . . . . . .

63 64

CONTENTS

IV

3.1.1. The Action Server . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.1.2. The Client Process . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.1.3. Communications model . . . . . . . . . . . . . . . . . . . . . .

73

3.1.4. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . .

75

3.1.5. Performance evaluation . . . . . . . . . . . . . . . . . . . . . . .

77

3.2. Distributed server system . . . . . . . . . . . . . . . . . . . . . . . . . .

83

3.2.1. The Parallel Action Server . . . . . . . . . . . . . . . . . . . . .

84

3.2.2. Modifications to the Action Server . . . . . . . . . . . . . . . . .

86

3.2.3. Modifications to the Client Process . . . . . . . . . . . . . . . .

88

3.2.4. Communications model . . . . . . . . . . . . . . . . . . . . . .

89

3.2.5. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . .

91

3.2.6. Performance evaluation . . . . . . . . . . . . . . . . . . . . . . .

93

3.3. Improving the performance of the Action Server through GPUs . . . . . .

97

3.3.1. CUDA programming model . . . . . . . . . . . . . . . . . . . .

97

3.3.2. A GPU-based Action Server . . . . . . . . . . . . . . . . . . . .

99

3.3.3. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . 104 3.3.4. Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . 106 3.4. Accelerating the collision check procedure on multi-core and many-core processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.4.1. Accelerating the collision check procedure on multi-cores . . . . 109 3.4.2. Accelerating the collision check procedure on many-cores . . . . 114 3.4.3. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . 123 3.4.4. Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . 124

CHAPTER 4. The Partitioning Problem in distributed crowd simulations

137

4.1. Region-based partitioning methods for distributed crowd simulations . . . 138 4.1.1. R-Tree method . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.1.2. Genetic algorithm method . . . . . . . . . . . . . . . . . . . . . 142 4.1.3. Convex hull method . . . . . . . . . . . . . . . . . . . . . . . . 146 4.1.4. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . 149 4.1.5. Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . 151 4.2. Enhancing the workload distribution of the QHull partitioning method . . 158 4.2.1. Enhancing Workload Distribution . . . . . . . . . . . . . . . . . 160 4.2.2. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . 161 4.2.3. Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . 161

CONTENTS

V

CHAPTER 5. Distributed Rendering

169

5.1. Workload Characterization of a Distributed Visual Client . . . . . . . . . 170 5.2. A Distributed Visualization System for Large-Scale Crowd Simulation . . 173 5.2.1. Modifications to the Action Server . . . . . . . . . . . . . . . . . 173 5.2.2. Implementation of the Visual Client Process . . . . . . . . . . . . 174 5.3. Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.4. Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

CHAPTER 6. Conclusions

187

6.1. Conclusions and contributions . . . . . . . . . . . . . . . . . . . . . . . 187 6.2. Future research lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.3. Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

BIBLIOGRAPHY

193

VI

CONTENTS

LIST OF TABLES

3.1. System throughput for a fixed agent cycle of 250 ms. . . . . . . . . . . .

79

3.2. Response times obtained at maximum throughput. . . . . . . . . . . . . .

80

3.3. Evaluation results for a fully centralized system configuration . . . . . . .

81

3.4. Average response times obtained for pathfinding agents . . . . . . . . . .

83

3.5. Execution time (ms.) for the Mutex-based collision check procedure when varying the cell size and the array size for high density scenario. . . . . . 125 3.6. Execution time (ms.) for the Mutex-based collision check procedure when varying the cell size and the array size for medium density scenario. . . . 126 3.7. Execution time (ms.) for the Mutex-based collision check procedure when varying the cell size and the array size for low density scenario. . . . . . . 126 3.8. Execution times (ms.) for CPU-based collision check when increasing the number of cores and the number of agents. . . . . . . . . . . . . . . . . . 128 3.9. Time (ms.) of PCI transfer and execution of kernels for the different GPU versions on Tesla C870 when varying number of agents. . . . . . . . . . 134 3.10. Time (ms.) of PCI transfer and execution of kernels for the different GPU versions on Tesla C1060 when varying number of agents. . . . . . . . . . 134 4.1. Actual performance provided by the different methods. . . . . . . . . . . 158 5.1. CPU utilization ( %) of the computer hosting the VCP. . . . . . . . . . . 171 5.2. Visualization requests not processed by the VCP. . . . . . . . . . . . . . 172 5.3. CPU utilization ( %) generated by the graphics tasks of the VCP. . . . . . 172 5.4. Operations lost by VCPs in scenario 1. . . . . . . . . . . . . . . . . . . . 181 5.5. Operations lost by VCPs in scenario 2. . . . . . . . . . . . . . . . . . . . 183

VIII

LIST OF TABLES

LIST OF FIGURES

1.

Elementos de un sistema de simulación de crowds. . . . . . . . . . . . .

3

2.

Ejemplo de una arquitectura cliente-servidor . . . . . . . . . . . . . . . .

14

3.

Ejemplo de una arquitectura red-servidor . . . . . . . . . . . . . . . . . .

15

4.

Ejemplo de una arquitectura peer-to-peer . . . . . . . . . . . . . . . . . .

16

5.

Elements of a crowd simulation system. . . . . . . . . . . . . . . . . . .

24

1.1. Example of a client-server architecture . . . . . . . . . . . . . . . . . . .

38

1.2. Example of a networked-server architecture . . . . . . . . . . . . . . . .

39

1.3. Example of a peer-to-peer architecture . . . . . . . . . . . . . . . . . . .

40

2.1. Computing the impostor images by discretizing the view direction, and views of the character from the set of discrete directions [93] . . . . . . .

43

2.2. Left: Snapshot of Geopostor system rendering a virtual crowd in an urban environment [18]. Right: wireframe view of the same scene where both the impostor (green rectangles) and mesh representation of characters can be seen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

2.3. a) Keyframe definition with vertex correspondences and b) Multi-resolution hierarchies computed between consecutive keyframes and represented using an octree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

2.4. Reynold’s rules for the boids simulations [79] . . . . . . . . . . . . . . .

49

2.5. Evacuation of a crowd using Helbing’s model [38] . . . . . . . . . . . . .

50

2.6. a) steps followed to build the database from real examples and b) steps followed for agent trajectory computation using the examples database[45] 52 2.7. a) computed roadmap to reach an exit point avoiding an obstacle and b) path solution search using the roadmap [73] . . . . . . . . . . . . . . . .

53

2.8. Continuum crowds general algorithm overview [95] . . . . . . . . . . . .

55

X

LIST OF FIGURES

2.9. Left: Evacuation of a complex and structured scenario using three levels: HiDAC, MACES and CAROSA. Right: three-level model overview [70] .

57

2.10. Overview of the two-level agent framework proposed in [92] . . . . . . .

58

2.11. Manager/worker paradigm used in [76] . . . . . . . . . . . . . . . . . . .

59

2.12. Communication scheme among neighbour processors proposed in [103] .

60

2.13. In PSCrowd disjoint jobs are evaluated in arbitrary order on any number of SPEs [78] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

3.1. An example of the proposed computer architecture for crowd simulation. .

64

3.2. The proposed software architecture . . . . . . . . . . . . . . . . . . . . .

65

3.3. Internal structure of the AS . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.4. The Semantic Database . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

3.5. A collision example. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.6. Evacuation test with 8000 agents . . . . . . . . . . . . . . . . . . . . . .

73

3.7. Diagram showing the communication model when the system is started. .

74

3.8. Diagram showing the communication model when validating an action request. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

3.9. Server performance on collision tests . . . . . . . . . . . . . . . . . . . .

78

3.10. Percentage of positive ACKs during the evacuation simulation . . . . . .

82

3.11. General scheme of the proposed architecture . . . . . . . . . . . . . . . .

85

3.12. Internal structure of an Action Server . . . . . . . . . . . . . . . . . . .

87

3.13. Diagram showing the communication model when the system is started. .

89

3.14. Diagram showing the communication model when validating a local action request. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

3.15. Diagram showing the communication model when validating a border action request. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

3.16. Response times for the cluster platform . . . . . . . . . . . . . . . . . .

94

3.17. Response times for the interconnected PCs . . . . . . . . . . . . . . . . .

94

3.18. System throughput for different configurations of the cluster platform . .

95

3.19. System throughput for different configurations of the interconnected PCs .

96

3.20. CUDA Hardware interface for the Nvidia GPU G80 . . . . . . . . . . . .

97

3.21. CUDA Programming model . . . . . . . . . . . . . . . . . . . . . . . .

98

3.22. Internal structure of an action server using a GPU . . . . . . . . . . . . . 101 3.23. Baseline algorithm for GPU collision checking . . . . . . . . . . . . . . 103 3.24. Average percentage of CPU utilization for collision tests. . . . . . . . . . 107 3.25. Aggregated computing time for collision tests. . . . . . . . . . . . . . . . 108

LIST OF FIGURES

XI

3.26. Average Response times provided to agents. . . . . . . . . . . . . . . . . 109 3.27. Read/reclaim race among threads that access a shared dynamic data structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.28. Diagram of the CPU collision checking using Mutex. . . . . . . . . . . . 112 3.29. Illustration of QSBR. Black boxes represent quiescent states. . . . . . . . 113 3.30. Deletion of one element in a RCU linked-list. . . . . . . . . . . . . . . . 115 3.31. Percentage of execution time required by the kernels for the baseline version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.32. Grid mapping to global memory in the baseline version . . . . . . . . . . 117 3.33. Grid mapping to global memory in the improved version . . . . . . . . . 118 3.34. Percentage of execution times required by the kernels in the baseline optimized version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.35. New algorithm for collision check on the GPU . . . . . . . . . . . . . . . 121 3.36. Execution time and speed-up obtained for 1000 agents when using 2 cores and increasing the number of threads. . . . . . . . . . . . . . . . . . . . 127 3.37. RCU Speed-up obtained in comparison with the Mutex version when increasing the number of cores. . . . . . . . . . . . . . . . . . . . . . . . . 129 3.38. Execution times on Tesla C870 card . . . . . . . . . . . . . . . . . . . . 130 3.39. Execution times on Tesla C1060 card . . . . . . . . . . . . . . . . . . . . 131 3.40. Speed-up obtained with respect to the Baseline version for the Tesla C870 platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.41. Speed-up obtained with respect to the Baseline version for the Tesla C1060 platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.42. Collisions rate on Tesla C870 card . . . . . . . . . . . . . . . . . . . . . 133 3.43. Collisions rate on Tesla C1060 card . . . . . . . . . . . . . . . . . . . . 133 3.44. Speed-up obtained for the new GPU procedure executed on different cards, with respect to the CPU implementations . . . . . . . . . . . . . . . . . 135 4.1. Update of two R-Trees a) initial state b) after agents movement . . . . . . 141 4.2. Snapshots of the partitions provided by R-Tree at different simulation stages a) beginning b) middle c) end . . . . . . . . . . . . . . . . . . . . 142 4.3. Chromosome used for the Genetic Algorithm . . . . . . . . . . . . . . . 143 4.4. The Genetic Algorithm main loop . . . . . . . . . . . . . . . . . . . . . 144 4.5. Offspring generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.6. Snapshots of the partitions provided by GA at different simulation stages a) beginning b) middle c) end . . . . . . . . . . . . . . . . . . . . . . . . 146

XII

LIST OF FIGURES

4.7. Steps followed by the Quickhull algorithm . . . . . . . . . . . . . . . . . 148 4.8. Snapshots of the partitions provided by QHull at different simulation stages a) beginning b) middle c) end . . . . . . . . . . . . . . . . . . . . 149 4.9. Movement patterns: a) full b) perimeter c) up d) down e) urban . . . . . . 150 4.10. Fitness function values provided by the partitioning methods for the full simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.11. Execution times required by the partitioning methods for the full simulation153 4.12. Fitness function values provided by the partitioning methods for the perimeter simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.13. Execution times required by the partitioning methods for the perimeter simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.14. Fitness function values provided by the partitioning methods for the up simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.15. Execution times required by the partitioning methods for the up simulation 155 4.16. Fitness function values provided by the partitioning methods for the down simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.17. Execution times required by the partitioning methods for the down simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.18. Fitness function values provided by the partitioning methods for the urban simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.19. Execution times required by the partitioning methods for the urban simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.20. Standard deviations provided by the QHull method for each server . . . . 159 4.21. H(P ) values provided by the QHull and the LB methods when using 5 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 4.22. H(P ) values provided by the QHull and the LB methods when using 9 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.23. Standard deviation of the number of avatars hosted by servers when using 5 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.24. Standard deviation of the number of avatars hosted by servers when using 9 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.25. Number of lock requests produced by the considered methods when using 5 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.26. Number of lock requests produced by the considered methods when using 9 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

LIST OF FIGURES

XIII

4.27. Aggregated execution times required by the partitioning methods when using 5 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 4.28. Aggregated execution times required by the partitioning methods when using 9 regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.1. General scheme of the distributed simulation system with Visual Client Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2. Scheme of an Action Server with VCP connections. . . . . . . . . . . . . 174 5.3. Scheme of the design of the Visual Client Process. . . . . . . . . . . . . . 175 5.4. Steps for loading characters animation data on GPU memory. . . . . . . . 176 5.5. Description of the GPU implementation of the VCP. . . . . . . . . . . . . 178 5.6. First scenario considered for evaluation purposes. . . . . . . . . . . . . . 179 5.7. Second scenario considered for evaluation purposes. . . . . . . . . . . . . 180 5.8. CPU utilization for three different VCPs in the first scenario. . . . . . . . 181 5.9. Frame rates for three different VCPs in the first scenario. . . . . . . . . . 182 5.10. CPU utilization for three different VCPs in the second scenario. . . . . . 183 5.11. Frame rates for three different VCPs in the second scenario. . . . . . . . 184 5.12. CPU utilization of simulation servers in scenario 1. . . . . . . . . . . . . 184 5.13. Response time provided by simulation servers in scenario 1. . . . . . . . 185 5.14. CPU utilization of simulation servers in scenario 2. . . . . . . . . . . . . 186 5.15. Response time provided by simulation servers in scenario 2. . . . . . . . 186

XIV

LIST OF FIGURES

LIST OF ALGORITHMS

1.

Algorithm to precompute the paths contained in the CA. . . . . . . . . .

72

2.

Pseudocode for an AE Thread in the CPU-based server. . . . . . . . . . . 100

3.

Pseudocode for an AE Thread in the GPU-based server. . . . . . . . . . . 100

4.

Algorithm for the Quick Hull method. . . . . . . . . . . . . . . . . . . . 148

5.

Pseudocode for the workload balancing algorithm. . . . . . . . . . . . . 162

XVI

List of Algorithms

RESUMEN

MOTIVACIÓN Una muchedumbre (en inglés crowd) se puede considerar como un gran grupo de individuos que comparten el mismo entorno físico. Estos individuos pueden compartir un objetivo común y pueden actuar de una manera diferente a cuando están solos [8]. Este comportamiento colectivo ha sido estudiado desde finales del siglo XIX. Sin embargo, los modelos computacionales para la simulación de este comportamiento son bastante recientes, con la mayoría de los trabajos realizados sólo en la mitad y finales de los años noventa. Desde el punto de vista general, la necesidad de modelos para la simulación de poblaciones virtuales aparece en muchos campos de investigación. La comunidad de gráficos por computador está interesada en la integración de multitudes de personajes virtuales realistas dentro de una aplicación 3D, por lo general con fines de entretenimiento, como la producción de películas. Desde el campo de las Ciencias Sociales, el interés está relacionado con la comprensión del comportamiento de los diferentes fenómenos sociales y la realización de predicciones sobre eventos futuros. Las ciencias de la seguridad y civiles están interesadas en la simulación de modelos sociales a fin de elaborar planes de evacuación, diseñar el sistema de emergencia de un edificio o predecir riesgos en castastrofes que puedan ocurrir o en grandes eventos organizados en estadios o grandes superficies al aire libre. El interés en las simulaciones de muchedumbres y la posibilidad de predecir riesgos ha motivado el interés de diferentes organismos en estas simulaciones. Los recientes

2

Resumen

desastres naturales como el huracán Katrina (2005) o el terremoto en Haití (2010) han mostrado la magnitud de sus efectos y su impacto en la población de un país. Además, hay peligros inherentes a todos los eventos públicos con un gran número de participantes. Cada año hay informes de catástrofes por todo el mundo en las que hay que lamentar muertes. Por este motivo, hay un creciente interés por parte de los gobiernos en prevenir los problemas que pueden surgir durante una evacuación, así como el desarrollo de sistemas de simulación que se puedan utilizar con el fin de entrenar a las fuerzas del orden en tácticas para el control pacífico de una multitud. De esta manera, la simulación de multitudes se propone como un método para mitigar o evitar los peligros asociados a eventos multitudinarios. Más allá de las diferentes aplicaciones de las simulaciones de muchedumbres, desde un punto de vista computacional un sistema de simulación de crowds requiere tanto de una visualización plausible de las imágenes del mundo virtual como la simulación del comportamiento de los agentes autónomos. La suma de estos requerimientos resulta en un gran coste computacional que aumenta con el número de agentes en el sistema. Durante muchos años, el reto era conseguir simulaciones realistas de multitudes virtuales. Ahora, el nuevo reto consiste en proporcionar la escalabilidad requerida por este tipo de sistemas cuando el número de agentes en la escena crece. Centrándonos en la estructura de un sistema de crowds, éste puede tener varias capas como muestra la Figura 1. Tradicionalmente, los enfoques previos se han centrado en las capas relacionadas con los gráficos por computador y la animación comportamental, resolviendo los problemas relacionados con el renderizado de grandes grupos de personajes virtuales así como la obtención de comportamientos realistas. Sin embargo, recientemente se ha planteado una nueva capa relacionada con el área de arquitectura de computadores. Esta nueva capa se ocupa de problemas relativos al diseño de una arquitectura de software capaz de hacer un uso eficiente de una plataforma de hardware. De esta manera, las propuestas de esta tesis se hacen dentro de esta capa con el fin de gestionar eficientemente el elevado coste computacional de las simulaciones de crowds y explotar al máximo el hardware subyacente. El contínuo incremento de la potencia computacional de los ordenadores, ha motivado un creciente interés en realizar simulaciones de crowds en tiempo real. El término tiempo real en aplicaciones de realidad virtual es diferente del término utilizado para los sistemas en tiempo real. En estos sistemas se garantiza que las salidas se van a obtener un instante

Resumen

3

Figure 1: Elementos de un sistema de simulación de crowds.

exacto en el tiempo. Sin embargo, estas restricciones son relajadas en un sistema de realidad virtual ya que el término tiempo real se sustituye por el término interactivo. De hecho, una aplicación de realidad virtual se considera que es en tiempo real si el usuario percive una interacción fluida con el sistema. Las simulaciones interactivas de crowds se utilizan en aplicaciones de entretenimiento, en las que el usuario interactúa con el mundo virtual, siendo su ejemplo más conocido el de los juegos de ordenador. También se utilizan en sistemas de Realidad Virtual y Realidad Aumentada con fines de entrenamiento y simulación. Un ejemplo de aplicación de estos sistemas interactivos lo constituyen los sistemas para entrenar las fuerzas del orden que tienen que controlar una multitud. Sin embargo, los requerimientos de interactividad de estas aplicaciones hace que aparezcan nuevos retos. Normalmente, los sistemas no interactivos de simulación llevan a cabo los cómputos para animar el crowd off-line. Luego, haciendo uso de los datos computados, se renderiza la escena utilizando para ello tanto tiempo como sea necesario, con el fin de obtener unos gráficos de alta calidad. Sin embargo, en el caso de los sistemas interactivos de crowds, el computador tiene que gestionar la alta carga computacional resultante de la simulación de comportamientos realistas y la visualización plausible de la escena en tan sólo una fracción de segundo. Esta restricción temporal se tiene que satisfacer para poder proporcionar un nivel de interactividad adecuado lo cual representa un reto al desarrollar un sistema de simulación de crowds. Existen diversas propuestas en la literatura para solucionar los problemas asociados a la simulación interactiva de crowds. Como se muestra en la Figura 1, algunas aproximaciones tratan sólo algunos de los problemas proponiendo modelos de animación de

4

Resumen

comportamiento, o técnicas para el renderizado eficiente de un crowd. Otras aproximaciones trantan ambos problemas a la vez permitiendo la simulación y visualización de crowds con un gran número de individuos. Sin embargo, estas soluciones no pueden proporcionar la escalabilidad requerida por los sistemas de crowds. Por esta razón, la motivación de esta tesis consiste en la mejora del rendimiento de los sistemas de simulación de crowds de tal modo que se pueda proporcionar la escalabilidad requerida al mismo tiempo que se proporciona un buen nivel de interactividad. Los retos presentados por las simulaciones de crowds a gran escala han sido un nuevo tema de interés para el Grupo de Redes y Entornos Virtuales (GREV)2 que pertenece al Departamento de Informática de la Universidad de Valencia y que forma parte del grupo Advanced Communication and Computer Architecture (ACCA)3 . El GREV ha afrontado problemas relacionados con la escalabilidad de los Entornos Virtuales Distribuidos (en inglés Distributed Virtual Environments o DVEs), proponiendo varios métodos para la mejora del rendimiento y la escalabilidad de estos sistemas de Realidad Virtual. Estos trabajos en DVEs han permitido al GREV el estudio de los nuevos retos presentados por los sistemas de crowds. Este interés en las simulaciones de crowds es la razón por la que este trabajo se ha realizado dentro del GREV y del grupo ACCA.

OBJETIVOS Puesto que uno de los principales retos de los sistemas de simulación de crowds es proporcionar una buena escalabilidad al mismo tiempo que un buen nivel de interactividad, el objetivo de esta tesis es: ’El diseño y desarrollo de un sistema de simulación y visualización de muchedumbres de personajes virtuales en tiempo real, que permita mejorar de forma significativa las prestaciones de las propuestas actuales en términos de latencia y productividad’ Este objetivo principal engloba otros subobjetivos que aparecen a continuación: • El diseño y desarrollo de un sistema distribuido para realizar simulaciones de crowds. La escalabilidad de una aplicación de crowds diseñada para un sistema centralizado,

2 3

http://grev.uv.es http://www.acca-group.info

Resumen

5

estará limitada por la potencia computacional del sistema. Un diseño centralizado no puede gestionar eficientemente los recursos computacionales de un sistema distribuido. Por el contrario, el diseño del sistema propuesto debería proporcionar una buena escalabilidad de un modo flexible, incrementando la productividad del sistema al añadir más recursos computacionales. • La propuesta y evaluación de diferentes técnicas heurísticas para mantener bal-

anceada la carga computacional del sistema distribuido de simulación de crowds. Un crowd es un sistema dinámico en el que la carga de la simulación se debe balancear de forma periódica. En caso contrario, uno o más nodos computacionales del sistema podría alcanzar un punto de saturación con lo que se degradaría el rendimiento de todo el sistema. En este sentido, la propuesta debería mantener la carga del sistema balanceada, incrementando la productividad del sistema y disminuyendo la latencia.

• El diseño y desarrollo de un sistema de visualización 3D que pueda ser integrado

en el sistema distribuido y permita visualizar la simulación interactivamente. Con el fin de obtener un sistema de visualización escalable, éste se debería desarrollar con un enfoque distribuido. Además, este diseño debería permitir una fácil integración con el sistema de simulación distribuido.

Organización de este documento Este documento describe cómo se han desarrollado los objetivos mencionados antes. La estructura del documento está dividida en los siguientes capítulos: • El capítulo 1 describe los principales retos planteados por las simulaciones de

crowds. Además, son descritos los principales conceptos de un sistema de crowds y las arquitecturas de comunicaciones para llevar a cabo simulaciones distribuidas.

• El capítulo 2 presenta el estado del arte relacionado con los sistema de simulación

de crowds. Este capítulo se centrará en las diferentes aproximaciones capaces de renderizar eficientemente un crowd y gestionar los comportamientos autónomos de grandes grupos de individuos. Este capítulo también describirá otras aproximaciones que tratan de proporcionar la escalabilidad requerida por los sistemas de crowds. Además, también se mencionarán trabajos relacionados con el balanceo de la carga computacional en simulaciones multiagente.

6

Resumen

• El capítulo 3 presenta un nuevo sistema distribuido propuesto para gestionar las

simulaciones de crowds con comportamientos autónomos. Este sistema se ha desarrollado en varias fases. La primera consiste en el diseño del sistema basado en la distribución de agentes del crowd en varias máquinas y en la gestión centralizada del mundo virtual. En una segunda fase, se propone un diseño mejorado del sistema distribuido para la simulación de muchedumbres. En una tercera fase se propone el uso de arquitecturas multi-core y many-core con el fin de incrementar la productividad del sistema distribuido.

• El capítulo 4 describe diferentes métodos heurísticos para el balanceo de la carga computacional en los sistemas de simulaciones de muchedumbres. Estos métodos son necesarios para poder gestionar de forma eficiente las arquitecturas distribuidas de crowds. • El capítulo 5 presenta un nuevo sistema de visualización capaz de renderizar las simulaciones de crowds llevadas a cabo en el sistema distribuido. El sistema de visualización debe ser integrado en la arquitectura distribuida con el fin de renderizar eficientemente las escenas simuladas. Además, en este capítulo se evalúa el impacto del sistema de visualización en el rendimiento del sistema de simulación. • Finalmente el capítulo 6 muestra las principales conclusiones de esta tesis así como las lineas de investigación resultantes de la misma.

INTRODUCCIÓN En el siglo pasado comenzó el interés en estudiar y comprender el comportamiento colectivo de un grupo de individuos. Todos estos estudios han llevado más recientemente a un creciente interés en simular este comportamiento por medio de una computadora. Las primeras simulaciones de crowds fueron llevadas a cabo con distintos propósitos, tiendo todas ellas en común que no eran simulaciones en tiempo real. Como consecuencia, normalmente estas simulaciones se realizaban off-line y eran visualizadas más tarde con diferentes calidades gráficas dependiendo del propósito de la simulación. Investigadores de diversos campos comenzaron a realizar este tipo de simulaciones hace algunos años. La comunidad de gráficos por computador estaba interesada en integrar muchedumbres de actores virtuales autónomos dentro de una aplicación 3D, normalmente para la producción de películas [53]. En el área de las Ciencias Sociales el interés residía en la compren-

Resumen

7

sión del comportamiento de diferentes fenómenos sociales y en la obtención de modelos humanos para llevar a cabo predicciones para eventos futuros [100, 11, 56, 77, 85]. Los intereses de las Ciencias Civiles y de la Seguridad estaban relacionados con la simulación de modelos sociales para elaborar planes de evacuación, diseñar apropiadamente los elementos del sistema de emergencia de un edificio o predecir los riesgos en eventos multitudinarios que pueden tener lugar en estadios o grandes explanadas [84, 26, 69, 13]. Los personajes virtuales animados también han sido utilizados en aplicaciones de realidad virtual. La aparición de la Realudad Virtual ha permitido la creación de entornos multi-sensoriales, sintéticos, interactivos e inmersivos [89]. En un sistema de realidad virtual el usuario interactúa con el entorno virtual por medio de distintos dispositivos de entrada/salida que pueden variar desde un teclado de ordenador típico o un ratón, a unas gafas estereoscópicas o unos guantes de realidad virtual. Diversos sistemas de realidad virtual se han desarrollado con distintas aplicaciones como medicina, entrenamiento, ingeniería y entretenimiento [6, 81, 48, 46]. Pero más allá de estos ejemplos centrados en un único usuario, otros sistemas de realidad virtual permiten la interacción entre varios usuarios que comparten una misma escena virtual. Estos entornos virtuales multi-usuario recivieron el nombre de Entornos Virtuales Distribuidos (en inglés Distributed Virtual Environments o DVEs). En un DVE cada usuario interactúa con el mundo virtual utilizando un computador que renderiza la escena desde el punto de vista del usuario. Los DVEs están ganando popularidad y su tamaño está creciendo en los últimos años proporcionando una experiencia de usuario inmersiva y rica gracias a los gráficos en 3D y las interacciones en tiempo real [99]. Los mundos virtuales distribuidos se usan para juegos, entretenimiento, educación, entrenamiento, colaboración y para las relaciones sociales[97, 17, 92, 44, 78, 49]. Por ejemplo, las fuerzas armadas de los E.E.U.U utilizan entornos virtuales para simular diversos escenarios de guerra con el fin de mejorar el entrenamiento de las tropas en misiones de combate y rescate [97]. Otro ejemplo lo constituyen diversas instituciones educativas que han creado su espacio virtual en mundos virtuales con el fin de mejorar la experiencia de la enseñanza a distancia [15]. Más recientemente, el interés en poblar estos entornos virtuales con grupos de personajes autónomos ha crecido por diversos motivos. En algunas aplicaciones, como sistemas de entrenamiento, el motivo es que el sistema de realidad virtual debe animar el comportamiento de grupos de personajes para que interactúen con el usuario [97]. En otros casos, el motivo es que estos grupos de humanos virtuales mejoran en gran medida la experiencia inmersiva proporcionada al usuario [19].

8

Resumen

Sin embargo, la simulación de grupos de individuos presenta sus retos partuculares y requiere el intercambio de ideas interdisciplinares entre distintas áreas de investigación. Por este motivo las simulaciones de crowds constituyen un área de investigación con entidad propia. Los investigadores en el área de la simulación de multitudes deben tener en cuenta las obras y estudios de diversas áreas de investigación para la obtención de simulaciones realistas a gran escala con un buen nivel de interactividad. Las áreas de investigación relacionadas con las simulaciones de crowds son: • Psicología: Algunos estudios se han llevado a cabo para analizar la relación entre el realismo de la simulación y la percepción que los usuarios tienen de la variabilidad

en la apariencia y en el comportamiento de cada individuo dentro del crowd [54, 55]. Además, se han utilizado algunos modelos psicológicos de la conducta humana para simular muchedumbres virtuales con diversos fines [84, 77, 100]. • Ciencias Sociales: Modelos propuestos por investigadores de Ciencias Sociales se han utilizado para simular por ordenador las interacciones entre individuos dentro de una muchedumbre [38]. Además, otras aproximaciones proponen modelos que puedan aplicarse para facilitar la coordinación y colaboración entre los agentes dentro del crowd [32, 72]. • Física: Los modelos físicos permiten la definición del comportamiento del crowd a nivel microscópico, controlando la navegación individual de cada agente [79,

71]. Por el contrario, otros métodos definen el comportamiento del crowd a nivel macroscópico por medio de modelos que controlan al crowd de forma global [95, 34]. • Gráficos por computador: El renderizado en tiempo real de un gran número de personajes en 3D es un reto considerable. Este problema es capaz de agotar rápida-

mente los recursos del sistema, incluso para los sistemas más modernos y potentes con amplios recursos de memoria, procesadores rápidos y potentes tarjetas gráficas. Enfoques basados en "fuerza bruta"son viables para unos pocos personajes pero no escalan para cientos, miles o más personajes. Varios trabajos han estado tratando de hacer frente a estas limitaciones mediante el uso eficiente de las capacidades de las tarjetas gráficas aceleradoras [21], y mediante métodos que se basan en el supuesto de que nuestra percepción de la escena en su conjunto es limitada [5, 93, 18].

Resumen

9

• Arquitectura de computadores: Trabajos recientes sobre la simulación de agentes

en tiempo real se basan en proponer implementaciones para sistemas paralelos y

distribuidos [103, 78, 76, 22]. De esta manera, las diferentes tareas de la simulación se pueden paralelizar y/o distribuir a través de varias máquinas, proporcionando un aumento en el número de agentes soportados durante una simulación así como la mejora del rendimiento del sistema. • Visión por computador: Otros estudios proponen modelos para la animación del

comportamiento de la muchedumbre que son validados a través de técnicas de visión por computador [74]. Además, otros trabajos proponen el uso de técnicas de visión artificial para animar grupos de personajes en tiempo real [67, 45].

Con el fin de tener una idea clara de los problemas planteados por un un sistema de simulación de multitudes, a continuación se describen los principales requisitos relacionados con las simulaciones de crowds.

Requisitos de las simulaciones de crowds Un sistema de crowds puede tener varias capas como se muestra en la Figura 1. La capa de visualización es responsable de abordar el problema del renderizado en 3D de grandes grupos de personajes de manera eficiente. Por otro lado, la capa encargada de manejar el comportamiento de los agentes es responsable de mostrar comportamientos realistas tanto a nivel microscópico como a nivel macroscópico. Además, la capa de arquitectura de computadores aborda los problemas relativos al diseño de una arquitectura software para sacar el máximo provecho de la plataforma de hardware subyacente. Relacionados con estas capas, un sistema de crowds tiene asociados algunos requisitos que deben ser abordados. Muchos de estos requisitos están relacionados entre sí por lo general, necesitando frecuentemente soluciones de compromiso durante el diseño y desarrollo de un sistema de simulación de multitudes. Estos requisitos son los siguientes: • Diversidad • Autonomía • Consistencia de la Simulación • Interactividad

10

Resumen

• Escalabilidad

Diversidad Con el fin de obtener una simulación de muchedumbres realista, es deseable proporcionar variedad tanto en los gráficos como en el comportamiento. Sin embargo, proporcionar variedad tiene la desventaja del aumento de la carga computacional en el sistema. Por esta razón, la replicación de un mismo personaje múltiples veces se utiliza comúnmente en las simulaciones de crowds. Replicación consiste en la copia de la malla y el movimiento de un personaje, y reutilizarlos en un grupo de individuos. Algunos trabajos han estudiado el impacto visual de la replicación de modelos 3D y los movimientos [54]. De esta manera, se puede determinar el umbral en el cual los usuarios pueden percibir que la replicación está presente en un crowd. Este umbral se puede utilizar para balancear el nivel de diversidad visual y el coste computacional. La variedad en el comportamiento de los personajes es también necesario para obtener simulaciones realistas. Algunas trabajos han propuesto modelos que pueden mostrar gran diversidad de comportamiento pero con un gran coste computacional debido a la gestión del comportamiento individual de cada personaje [79, 72]. Por otra parte, otros modelos permiten manejar al crowd en su conjunto mediante la asignación de un objetivo común a todos los personajes [95]. Estos enfoques tienen la ventaja de reducir la complejidad computacional pero limitan la variedad de comportamientos. De esta manera, existe una solución de compromiso entre la variedad de comportamientos y el coste computacional resultante.

Autonomía Los personajes autónomos deben reaccionar de forma realista ante los eventos del entorno que les rodea. Al mismo tiempo, cada individuo dentro del crowd debe tener sus planes y objetivos propios. De lo contrario, la simulación puede resultar monótona con distribuciones uniformes y periódicas de los personajes o sus características. Los agentes pueden tener diferentes niveles de autonomía que se reflejan en su com-

Resumen

11

plejidad computacional y de comportamiento. Agentes más simples pueden mostrar un comportamiento reactivo, donde un conjunto de reglas guían sus acciones en el mundo virtual. Agentes más complejos pueden ser dotados con representaciones de la escena virtual con el fin de navegar a través de ella de forma autónoma. Además de las habilidades de navegación, los agentes deliberativos presentan un mayor nivel de autonomía, pero también una mayor complejidad. Sin embargo, el nivel de autonomía de cada agente por lo general depende del propósito de la simulación.

Consistencia de la simulación Una cuestión clave a la hora de realizar una simulación es mantener la consistencia del estado del mundo virtual. De lo contrario, podrían surgir inconsistencias, como solapamientos entre los agentes. Además, un estado incoherente del mundo podría llevar a los agentes a planificar acciones incorrectas (por ejemplo, dos agentes pueden tener la creencia errónea de que ambos tienen el mismo objeto al mismo tiempo y continuar la planificación de acuerdo con esta creencia). Algunos trabajos han demostrado que los modelos comportamentales basados en agentes pueden generar inconsistencias, especialmente en escenarios densamente poblados [42, 34]. Por esa razón, un sistema de simulación de multitudes debe hacer frente a este problema y gestionar adecuadamente la coherencia con independencia de las condiciones del escenario simulado. Como las consultas y actualizaciones del mundo virtual se realizan al mismo tiempo por los agentes del crowd, la consistencia debe ser eficientemente garantizada. De lo contrario, esta parte de la simulación puede convertirse en el cuello de botella del sistema.

Interactividad Las simulaciones de crowds requieren manejar el comportamiento de un alto número de personajes, así como la visualización del estado de la simulación y procesar las acciones del usuario en caso de un sistema interactivo. Todas estas tareas compiten al mismo tiempo por utilizar los recursos computacionales disponibles en el sistema. Por ello, el sistema de simulación debe ser diseñado de manera eficiente. De lo contrario, el rendimiento

12

Resumen

del sistema podría sufrir una importante degradación. Esta degradación del rendimiento puede acabar limitando el número de agentes simulados o la interactividad de la simulación [103]. El término ’tiempo real’ en aplicaciones de Realidad Virtual se diferencia del utilizado para los sistemas en tiempo real en que en estos últimos se garantiza que las salidas del sistema se obtendrán en un instante exacto en el tiempo. Sin embargo, un sistema de Realidad Virtual se considera en tiempo real si un usuario percibe una interactividad fluida en el sistema. Trabajos recientes en el campo de los DVEs han establecido que los usuarios perciben una interactividad fluida si el tiempo de respuesta del sistema es inferior a un umbral de 250 ms. [39]. En un sistema de crowds el tiempo de respuesta es el tiempo que transcurre desde que un agente pide la ejecución de una acción hasta que el agente recibe el resultado de tal acción. Dicho tiempo de respuesta debe ser inferior a 250 ms. a fin de proporcionar un buen nivel de interactividad.

Escalabilidad La escalabilidad de un sistema mide su capacidad para adaptarse a los cambios en los requerimientos computacionales [51]. Sin embargo, la escalabilidad y el rendimiento son conceptos distintos. El rendimiento en un sistema de simulación de crowds se puede definir como el número máximo de agentes que el sistema puede soportar durante una simulación sin llegar a un punto de saturación. Por otro lado, un sistema de crowds es escalable si se garantiza un rendimiento medio cuando el número de agentes simulados se incrementa [86]. A fin de proporcionar la escalabilidad necesaria, en esta tesis se propone el uso de sistemas distribuidos. Desde el punto de vista de arquitectura de computadores, un sistema distribuido es escalable si el rendimiento del sistema aumenta conforme se añaden más recursos computacionales. Sin embargo, el desarrollo de sistemas distribuidos no es una tarea trivial ya que pueden surgir cuellos de botella y problemas de balanceo de carga, que limitarían la escalabilidad del sistema. Un sistema de simulación de crowds se encarga tanto de la representación visual del mundo virtual como de la gestión del comportamiento de los agentes autónomos. La suma de estos requerimientos genera un coste computacional que aumenta con el número de agentes en el crowd. Diversas técnicas se han propuesto con el fin de gestionar eficientemente la carga computacional de las tareas relacionados con la parte de renderizado

Resumen

13

[5, 93, 18]. Además, otros trabajos proponen una implementación eficiente de modelos comportamentales con el fin de aumentar el número de agentes simulados [92]. Sin embargo, aunque estos métodos proporcionan una mejora de rendimiento a la hora de simular o visualizar un gran número de agentes, no tienen en cuenta problemas de escalabilidad. Como consecuencia, el número de agentes no puede ser incrementado de forma significativa al mismo tiempo que se proporciona un buen rendimiento en el sistema. El hecho de que las trabajos previos no tienen en cuenta la arquitectura del sistema subyacente, limita en gran medida la escalabilidad de estos enfoques. Por esa razón, esta tesis propone diversas aproximaciones para solucionar los problemas de escalabilidad de las simulaciones de crowds, relacionados con la arquitectura del sistema subyacente.

Arquitecturas de Computadores para Aplicaciones de Realidad Virtual Las primeras propuestas en el área de la simulación de multitudes fueron diseñadas e implementadas para sistemas centralizados, donde tanto la parte de visualización como la de simulación eran ejecutadas en la misma máquina. Por este motivo, trabajos más recientes han propuesto el uso de sistemas paralelos y distribuidos para realizar simulaciones de crowds [103, 78, 76, 22]. Sin embargo, el desarrollo de una aplicación capaz de aprovechar los sistemas distribuidos y paralelos plantea nuevos retos que deben ser resueltos. En los sistemas paralelos como los procesadores many-core y multi-core, se debe prestar atención especial al uso eficiente de la jerarquía de memoria, así como a la reducción del overhead introducido por la sincronización entre hilos de ejecución. Además, deben ser resueltos los problemas de balanceo de carga a la hora de distribuir la carga computacional entre los diferentes núcleos del procesador [78]. En los sistemas distribuidos, el software que lleva a cabo la simulación debe ser eficientemente diseñado para aprovechar el hardware subyacente. Además, el problema del balanceo de carga debe ser abordado para repartir la carga computacional de la simulación entre los nodos del sistema. De lo contrario, uno o más nodos del sistema podrían llegar a un punto de saturación provocando el aumento de la latencia y una disminución del rendimiento general del sistema [61]. Latencia en un sistema distribuido es el tiempo que

14

Resumen

transcurre desde que un nodo en el sistema envía un mensaje hasta que el nodo receptor lo recibe completamente [20]. Hay una conexión directa entre la latencia de un sistema y el nivel de interactividad del mismo [3]. Si un sistema distribuido puede garantizar una latencia baja entonces el tiempo de respuesta entre los nodos será bajo y se podrá proporcionar un buen nivel de interactividad. Trabajos recientes en el área de los DVEs han establecido un valor umbral de 250 ms. como el tiempo de respuesta máximo permitido con el fin de garantizar un nivel aceptable de interactividad para los usuarios [39]. A la hora de diseñar la arquitectura software para un sistema distribuido, se debe tener en cuenta la arquitectura de computadores subyacente. Diferentes arquitecturas de comunicación se han propuesto con el fin de soportar eficientemente aplicaciones distribuidas de realidad virtual como los DVEs [86]. Las diferentes arquitecturas se pueden clasificar en tres tipos: las arquitecturas con servidor centralizado [96, 75], arquitecturas de red-servidor [50, 61] y arquitecturas peer-to-peer (P2P) [60, 59]. La Figura 2 muestra un ejemplo de una arquitectura con servidor centralizado. En este ejemplo, el mundo virtual es representado en dos dimensiones y los avatares son mostrados como puntos. En DVEs basados en una arquitectura de servidor centralizado, hay un solo servidor y todos los equipos cliente se conectan a este servidor. El servidor se encarga de gestionar todo el mundo virtual. Como resultado, el servidor se convierte en un cuello de botella potencial conforme el número de avatares en el sistema aumenta. De hecho, los DVEs basados en estas arquitecturas soportan un menor número de clientes respecto a otras arquitecturas de comunicaciones.

Figure 2: Ejemplo de una arquitectura cliente-servidor

Resumen

15

La figura 3 muestra un ejemplo de una arquitectura red-servidor. En este esquema hay varios servidores y cada equipo cliente está exclusivamente relacionado con uno de estos servidores. Este esquema es más distribuido que el esquema basado en servidor centralizado. Puesto que hay varios servidores, se mejora considerablemente la escalabilidad, flexibilidad y robustez con respecto al modelo servidor centralizado. Sin embargo, la arquitectura red-servidor requiere de una técnica para el balanceo de la carga que asigne los clientes a los servidores de una manera eficiente.

Figure 3: Ejemplo de una arquitectura red-servidor La figura 4 muestra un ejemplo de una arquitectura peer-to-peer. En este esquema, cada equipo cliente es también un servidor. Este sistema proporciona el mayor nivel de distribución de la carga. Aunque los primeros DVEs se basaban en arquitecturas centralizadas, durante los últimos años, las arquitecturas basadas en servidores en red han sido el principal estándar de facto para los sistemas DVE [50, 31]. Sin embargo, cada nuevo avatar en un sistema DVE representa un aumento no sólo en los requerimientos de cómputo de la aplicación, sino también en la cantidad de tráfico de red [61]. Debido a este aumento, las arquitecturas red-servidor no parecen que escalen correctamente con el número de clientes, en particular para el caso de los juegos online para muchos jugadores (en inglés Massively Multiplayer Online Games o MMOGs [2]), debido al alto grado de interactividad mostrado por estas aplicaciones. Como resultado, las arquitecturas peer-topeer se han propuesto para los juegos online multijugador [60, 59, 28]. Sin embargo, las arquitecturas P2P aún deben resolver de manera eficiente el problema del awareness. Este problema consiste en garantizar que cada avatar es consciente de todos los avatares que están en su entorno [87]. Garantizar el awareness para todos

16

Resumen

Figure 4: Ejemplo de una arquitectura peer-to-peer

los avatares del sistema es una condición necesaria para garantizar la coherencia espaciotemporal (como es definido en [104, 80]). El awareness es crucial en DVEs, ya que de lo contrario podrían surgir inconsistencias. Por ejemplo, un usuario de un juego podría disparar a algo aparentemente visible aunque en realidad no esté ahí como resultado de una inconsistencia. También, podría suceder que un avatar que no tenga una visión coherente del mundo virtual sea destruido por otro avatar que era invisible. En arquitecturas red-servidor, el problema del awareness se soluciona fácilmente por parte de los servidores del sistema, ya que éstos sincronizan periódicamente su estado y pueden conocer el estado de los avatares durante toda la simulación. Cada avatar informa acerca de sus cambios (mediante el envío de un mensaje) al servidor donde se encuentra asignado. De este modo dicho servidor puede decidir fácilmente qué avatares deben ser los destinos de ese mensaje (utilizando un criterio de distancia). De esta manera, no hay necesidad de un método para determinar la vecindad de los avatares, ya que los servidores conocen dicha vecindad en cada instante. Teniendo en cuenta las características de las diferentes arquitecturas de comunicación, esta tesis propone el modelo red-servidor como la arquitectura subyacente para la simulación de crowds. Por un lado, este sistema distribuido permite mejorar la escalabilidad, flexibilidad y robustez en comparación con las arquitecturas centralizadas (clienteservidor). Por otro lado, el pequeño número de servidores en las arquitecturas red-servidor hace que sea fácil solucionar el problema del awareness (y por lo tanto la coherencia espacio-temporal) con los agentes del mundo virtual. En este sentido, parece difícil mantener la coherencia de la información semántica en el sistema si se sigue un esquema

Resumen

17

peer-to-peer, donde cientos o incluso miles de ordenadores soportan cada uno un pequeño número de agentes y una copia de la base de datos semántica. De acuerdo con el esquema red-servidor, la arquitectura software ha sido diseñada con el fin de distribuir a los agentes del crowd a través de diferentes servidores (los servidores en red).

CONCLUSIONES En esta sección se describen las conclusiones generales de esta tesis así como las publicaciones obtenidas a partir de los resultados descritos en este documento. Además, en esta sección también se describen las futuras líneas de investigación que han surgido como consecuencia de la investigación llevada a cabo en esta tesis.

Conclusiones y contribuciones Esta tesis ha propuesto diversas mejoras del rendimiento de las simulaciones de crowds a gran escala. En primer lugar, se ha propuesto un sistema distribuido capaz de proporcionar la escalabilidad requerida por las simulaciones de muchedumbres. Con el fin de gestionar la carga computacional dinámica generada por las simulaciones de crowds distribuidas, esta tesis también propone un método eficaz para equilibrar dicha carga. Además, se propone un sistema de visualización para mostrar las imágenes de las simulaciones. Las principales aportaciones de esta tesis se describen a continuación: • Acerca del sistema para simulaciones de crowds: Se ha propuesto un sistema

distribuido para la simulación de crowds. Este sistema se basa en un esquema de interconexión red-servidor. Este esquema ha demostrado ser eficiente para la simulación de multitudes. Por un lado, el sistema puede distribuir agentes del crowd en diferentes máquinas. Este modelo proporciona una buena flexibilidad para la simulación de agentes con diferentes necesidades de cómputo en función de su nivel de complejidad. Por otra parte, la distribución del Action Server en diferentes máquinas proporciona la escalabilidad necesaria con el número de agentes en el crowd y el número de recursos computacionales.

• En relación con las mejoras de rendimiento a través de arquitecturas multicore y many-core: Se han propuesto diferentes mejoras del sistema distribuido

18

Resumen

para explotar las arquitecturas multi-core y many-core. Las mejoras se han llevado a cabo en el Action Server. El procedimiento para el chequeo de colisiones de este proceso se ha migrado para ser ejecutado en la GPU. Los resultados muestran que esta migración mejora significativamente el rendimiento. Además, el procedimiento de chequeo de colisiones se ha optimizado para que pueda ejecutarse en plataformas multi-core y many-core. De este modo, estas mejoras pueden beneficiar al rendimiento global del sistema. • Acerca del problema del particionado en las simulaciones de crowds: El sistema

distribuido propuesto para la simulación de multitudes realiza una partición del crowd basada en regiones. En base a este esquema, se han estudiado diferentes métodos para realizar la partición dinámica del crowd durante la simulación. Se ha definido una función de calidad con el fin de comparar el rendimiento de los diferentes métodos propuestos. Los resultados muestran que el mejor método de particionado es el que hace uso de formas irregulares para definir cada región.

• En relación con el sistema de visualización para las simulaciones de crowds: Se ha propuesto un sistema de visualización para el renderizado de las simulaciones.

A fin de proporcionar la escalabilidad requerida por el sistema de visualización, se ha propuesto un diseño distribuido. De esta manera, diferentes cámaras, cada una soportada por un cliente visual distinto, pueden visualizar la misma escena. Además, se ha propuesto una implementación del cliente visual eficiente basada en un análisis de los requisitos computacionales de dicho cliente. Estos requisitos son analizados en base a la carga de trabajo generada por el tráfico de red y las tareas gráficas. Los resultados muestran que el sistema de visualización puede proporcionar la escalabilidad y la interactividad requerida por las simulaciones de crowds.

Líneas de investigación futuras El trabajo desarrollado en esta tesis propone diferentes líneas de investigación que deben abordarse en el futuro. Estas líneas son las siguientes: • Esta tesis propone la mejora del rendimiento del Action Server a través de las GPUs. Sin embargo, el uso de la GPU se analiza para un sistema con un único Action

Server. Por esta razón, la mejora de prestaciones para un sistema en el que cada

Resumen

19

Action Server integre una o más GPUs debe ser analizada. • En esta tesis se han propuesto diferentes optimizaciones del procedimiento de chequeo de colisiones para plataformas multi-core y many-core. Las mejoras de rendimiento

se analizan en términos de tiempos de ejecución y tasa de chequeo de colisiones. Sin embargo, los beneficios de estas mejoras para todo el sistema deben ser medidos en términos de productividad y latencia. Además, las optimizaciones del algoritmo de chequeo de colisiones se han evaluado para escenarios con bajas densidades, donde los agentes se distribuyen uniformemente. Estudios futuros deberían analizar cómo la distribución de los agentes en la escena afecta al rendimiento proporcionado por estas implementaciones del procedimiento de chequeo de colisiones. • Aunque esta tesis ha estudiado el uso de las GPUs para mejorar el rendimiento del Action Server, otras partes del sistema también pueden ser implementadas haciendo uso de GPUs. En este sentido, las tareas relacionadas con el método de particionado pueden ser migradas de la CPU a la GPU con el fin de aliviar la carga de la CPU en el Action Server y aumentar el rendimiento de este proceso. Además, las GPUs se pueden utilizar dentro de cada proceso cliente. De esta manera, las tareas relacionadas con la animación del comportamiento de los agentes puede ser acelerada mediante el uso de GPUs. • Los métodos de particionado se han evaluado en un sistema secuencial. La función

de calidad definida en esta tesis para comparar los métodos de particionado representa un criterio eficaz para obtener el mejor método. Sin embargo, estos métodos deberían ser evaluados en un sistema distribuido con el fin de determinar las mejoras de rendimiento que pueden ofrecer en términos de productividad y la latencia.

Publicaciones Los trabajos de investigación descritos en esta tesis han originado la siguiente lista de publicaciones, clasificadas por tema de investigación. Algunas de estas publicaciones ya han sido publicadas y algunas de ellas están aceptadas para su publicación: Los siguientes cuatro artículos describen la arquitectura del sistema propuesto para la realización de simulaciones de crowds a gran escala. Además en los artículos se realiza un estudio de la escalabilidad del sistema.

20

Resumen

• G. Vigueras, M. Lozano, C. Pérez and J. M. Orduña. A Scalable Architecture for Crowd Simulation: Implementing a Parallel Action Server. IEEE International Con-

ference on Parallel Processing 2008 (ICPP’ 2008), IEEE Computer Society Press, pp. 430-437. Portland, USA. September, 2008. • G. Vigueras, M. Lozano, C. Pérez and J. M. Orduña. A Parallel Action Server for

Crowd Simulation. XIX Jornadas de Paralelismo, pp. 163-168. Castellón, Spain. September, 2008.

• G. Vigueras, M. Lozano, C. Pérez and J.M. Orduña. A Specific System Design for Crowd Simulation. ACACES 2009 (HiPEAC). Poster Abstracts, pp. 329-332. Barcelona, Spain. July, 2009. • M. Lozano, P. Morillo, J. M. Orduña, V. Cavero and G. Vigueras. A New System Architecture for Crowd Simulation. Journal of Network and Computer Applications, volume 32, issue 2, pp. 474-482. Elsevier Science. 2009.

Los siguientes cinco artículos proponen el uso de las GPUs y procesadores multi-core para mejorar el rendimiento del Action Server distribuido para la simulación de crowds. • G. Vigueras, J. M. Orduña, M. Lozano. A GPU-Based Multi-Agent System for RealTime Simulations. International Conference on Practical Applications of Agents and Multi-Agent Systems, (PAAMS’ 2010). Salamanca, Spain. April, 2010. • G. Vigueras, J. M. Orduña and M. Lozano. A GPU-based Action Server for Dis-

tributed Crowd Simulations. ACACES 2010 (HiPEAC). Poster Abstracts. Barcelona, Spain. July, 2010.

• G. Vigueras, J. M. Orduña, M. Lozano, J. M. Cecilia and J. M. García. Improving

the GPU-Based Collision Check Procedure for Distributed Crowd Simulations. International Conference on Parallel Architectures and Compilation Techniques 2010 (PACT’ 2010). GPUs and Scientific Applications Workshop (GPUScA). Vienna, Austria. September, 2010.

• G. Vigueras, J. M. Orduña and M. Lozano. Accelerating Real-Time Crowd Simu-

lations through GPUs. XXI Jornadas de Paralelismo. Valencia, Spain. September, 2010.

Resumen

21

• G. Vigueras, J. M. Orduña and M. Lozano. Increasing the Parallelism of Distributed

Crowd Simulations on Multi-core Processors. International Conference on Computational and Mathematical Methods in Science and Engineering 2011 (CMMSE’ 2011). Alicante (Spain), July, 2011.

Los siguientes seis artículos describen y analizan el rendimiento proporcionado por los diferentes métodos, propuestos en esta tesis, para resolver el problema del particionado. • G. Vigueras, M. Lozano, J.M. Orduña, F. Grimaldo. Improving the Performance of Partitioning Methods for Crowd Simulations. IEEE International Conference on Hybrid Intelligent Systems 2008 (HIS’ 2008). IEEE Computer Society Press, pp. 102-107. Barcelona, Spain. September, 2008. • G. Vigueras, M. Lozano, J.M. Orduña. Enhancing Workload Balancing in Distribut-

ed Crowd Simulations through the Partitioning Method. International Conference on Computational and Mathematical Methods in Science and Engineering 2009, (CMMSE 2009). pp. 1117-1128. Gijón (Spain), July 2009.

• G. Vigueras, M. Lozano, J. M. Orduña and F. Grimaldo. A Partitioning Method Based on Convex Hulls for Crowd Simulations. XX Jornadas de Paralelismo, pp. 117-122. La Coruña, Spain. September, 2009. • G. Vigueras, M. Lozano and J. M. Orduña. Workload Balancing in Distributed

Crowd Simulations: the Partitioning Method. The Journal of Supercomputing. Springer US. 2009. (In Press).

• G. Vigueras, M. Lozano, J. M. Orduña and F. Grimaldo. A Comparative Study of Partitioning Methods for Crowd Simulations. Applied Soft Computing, volume 10, issue 1, pp. 225-235. Elsevier Science Publishers. 2010. • G. Vigueras, M. Lozano, J. M. Orduña. Load Balancing in Distributed Crowd Sim-

ulations, chapter in book Load Balancing: Theory, Architecture, and Performance. Nova Science Publishers. 2011. (In Press).

Los siguientes artículos describen y analizan el rendimiento del sistema de visualización distribuido, propuesto en esta tesis, para renderizar las simulaciones de crowds.

22

Resumen

• G. Vigueras, M. Lozano, J. M. Orduña and Y. Chrysanthou. A Distributed Visual Client for Large-Scale Crowd Simulations. International Conference on Compu-

tational and Mathematical Methods in Science and Engineering 2010 (CMMSE’ 2010). Almería (Spain), July, 2010. • G. Vigueras, J. M. Orduña, M. Lozano and Víctor Fernández. A Scalable Visualization System for Crowd Simulations. XXII Jornadas de Paralelismo. La Laguna, Spain. September, 2011.

El siguiente artículo resume todas las contribuciones de esta tesis relacionadas con la mejora del rendimiento de las simulaciones de crowds. Fue presentado en el PhD forum de el IEEE International Parallel and Distributed Processing Symposium, en 2010. • G. Vigueras, J.M. Orduña, M. Lozano. Performance Improvements of Real-Time

Crowd Simulations. IEEE International Parallel and Distributed Processing Symposium, (IPDPS’ 2010). PhD Forum. Atlanta, USA. April, 2010.

A continuación aparecen tres artículos enviados a tres revistas internacionales y que se encuentran en fase de revisión. • G. Vigueras, J.M. Orduña, M. Lozano, J. M. Cecilia and J. M. García. Accelerating Large-scale Crowd Simulations on Multi-core and Many-core Architectures. Jour-

nal of Parallel and Distributed Computing (JPDC). Elsevier Science Publishers. • G. Vigueras, M. Lozano, J.M. Orduña and Y. Chrysanthou. A Distributed Visualization System for Crowd Simulations. Integrated Computer-Aided Engineering (ICAE). IOS Press. • G. Vigueras, J.M. Orduña, M. Lozano and Yvon Jégou. A Scalable Multiagent System Architecture for Interactive Applications. Journal of Science of Computer Programming (JSCP). Elsevier Science Publishers.

MOTIVATION

A crowd can be considered as a large group of individuals who share the same physical environment. These individuals can share a common goal and may act in a different way than when they are alone [8]. This collective behaviour has been studied since as early as the end of the nineteenth century. However, computer models for the simulation of this behaviour are quite recent, with most of the investigation having taken place in the mid and late nineties. From a general point of view, the need for models to simulate virtual populations appears in many research fields. The computer graphics community is interested in integrating realistic and vivid crowds of virtual characters within a 3D application, usually for entertainment purposes such as film production. In the Social Sciences field, the interest is related with understanding the behaviour of different social phenomena and carrying out predictions for future events. Safety and civil sciences interests are related with simulating social models in order to elaborate evacuation plans, to accurately design the elements of the emergency system of a building or predict risks in crowded events that may take part in stadiums or huge outdoor areas. The possibility of predicting risks has increased interest from different agencies in crowd simulations. Recent natural disasters such as the Katrina hurricane (2005) or the Haiti earthquake (2010) have shown the magnitude of their effects and impact on the population of a country. Also, there are inherent dangers associated with every large public gathering. Every year there are reports of overcrowding and crushing incidents from around the world. Therefore, there is a growing interest from governments in anticipating the problems that may occur during an emergency, as well as developing simulation

24

Motivation

systems that can be used to map and test-out alternative tactics in the handling of crowd situations. Hence, the simulation of crowds is proposed as a method to mitigate or avoid dangers associated with different catastrophes that can occur. Beyond the different applications of crowd simulations, from a computational point of view a crowd simulation system requires both rendering visually plausible images of a virtual world, and managing the behaviour of autonomous agents. The sum of these requirements results in a computational cost that greatly increases with the number of agents in the system. For many years, the challenge was to obtain realistic simulations of virtual crowds. Now, the new challenge consists of providing the scalability required by these kinds of systems when the number of individuals in the scene grows. Focusing on the structure of a crowd system, we can see that it can have several layers as Figure 5 shows. Traditionally, previous approaches have focused on layers related with Computer graphics and Behavioural animation addressing the problem of efficiently rendering large groups of individuals as well as providing realistic motion of crowds. Nevertheless, a new layer has arisen related with the area of computer architecture. This new layer addresses problems regarding the design of a software architecture that can efficiently take advantage of a hardware platform. Hence, the proposals of this thesis are done within this layer in order to efficiently manage the high computational cost of crowd simulations by efficiently exploiting the underlying hardware.

Figure 5: Elements of a crowd simulation system. The ongoing increase in the computational power available in recent years has promoted the interest in performing real-time crowd simulations. The term ’real-time’ in

Motivation

25

Virtual Reality applications differs from the one used in Real-time systems where system outputs are guaranteed to be obtained at an exact point in time. However, these constraints are relaxed in a Virtual Reality system where the term ’real-time’ is replaced by the term ’interactive’. Hence, a Virtual Reality application is considered to be real-time if a user perceives smooth interactivity with the system. Interactive simulations of crowds are used in entertainment applications where the user interacts with the virtual world, the most common example being computer games. They are also used in Virtual Reality systems and Augmented Reality applications for training and simulations. An example of these interactive systems can be the ones used for training law enforcement agencies who may face crowds in the future. However, due to the need for interactivity in these simulations, new challenges arise. Usually, non-interactive simulations perform the computation to animate the crowd in an off-line step. Then, using this data the simulated scenario is rendered taking as much time as needed in order to obtain high quality graphics. Nevertheless, in the case of an interactive crowd system, the computer system should manage the high computational cost involved in obtaining realistic behaviours and visualizing plausible images in a fraction of second. This temporal constraint must be fulfilled in order to provide a good level of interactivity, and represents a challenge when developing a system for crowd simulation. There are several approaches in the literature proposing solutions for the problems associated with interactive crowd simulations. As showed in Figure 5, some approaches address only parts of the whole problem proposing behavioural models or efficient graphics techniques for crowd rendering. Other approaches address both parts of the problem, allowing the simulation and visualization of crowds with a high number of individuals. Nevertheless, they fail to provide the required scalability of crowd systems. For this reason, the motivation of this thesis is the improvement of the performance of crowd simulation systems in order to provide the required scalability while providing a good level of interactivity. The challenges presented by large-scale crowd simulations were a topic of interest for the Network and Virtual Environments Group (GREV)4 that belongs to the Departamento de Informática at the Universidad de Valencia and is integrated in the Advanced Communication and Computer Architecture (ACCA) consortium5 . The GREV has addressed problems related with the scalability of Distributed Virtual Environments (DVEs), propos4 5

http://grev.uv.es http://www.acca-group.info

26

Motivation

ing several methods to improve the performance and scalability of these Virtual Reality systems. This background in DVEs has led the GREV to study the new challenges presented by crowd simulation systems. This interest in crowd systems is the reason why this work has been carried out within the GREV and the ACCA consortium.

OBJECTIVES

Since one of the main challenges of crowd simulation systems is to provide the required scalability while obtaining a good level of interactivity, the objective of this thesis is: "The design and development of a system for simulating and visualizing crowds of virtual characters in real-time that can significantly improve the performance of current approaches in terms of throughput and latency." In order to accomplish this objective different sub-objectives are proposed: • The design and development of a distributed system for performing crowd simu-

lations. The scalability of a crowd application designed for a centralized system is limited by the computational power available in the system. A centralized design cannot efficiently manage the computational resources available in distributed systems. On the contrary, the proposed system design should provide the required scalability in a flexible way by increasing the throughput of the system when more computational resources are added.

• The proposal and evaluation of different heuristic-based techniques to keep the

computational workload of the distributed crowd simulation system balanced. A crowd is a dynamic system where the load of the simulation must be periodically balanced. Otherwise, one or more of the computational nodes in the distributed system could reach saturation point thus highly degrading the performance of the whole simulation. The proposal should keep the load of the system balanced, increasing the throughput of the system and decreasing the latency.

28

Objectives

• The design and development of a 3D visualization system able to be integrated in

the distributed system for rendering the simulation at interactive rates. In order to provide the visualization system with the required scalability, it should be developed in a distributed fashion. Also, this design should allow easy integration with the distributed simulation system.

ORGANIZATION OF THIS DOCUMENT In order to accomplish the objectives mentioned above, this document is divided into the following chapters: • Chapter 1 describes the main challenges related to the simulation of large crowds. In addition, the main concepts of a crowd system and distributed architectures oriented to perform distributed simulations are described. • Chapter 2 discusses the state-of-the-art of crowd simulation systems. This chapter will focus on the different approaches to efficiently solve the rendering of crowded

scenes, managing autonomous behaviours of large groups of individuals, as well as approaches that try to provide the scalability required by crowd simulations. In addition, works related with the problem of workload balancing in multi-agent simulations will be discussed. • Chapter 3 describes the new distributed system proposed for handling crowd simu-

lations with autonomous behaviours. The first phase consists of designing the system based on the distribution of agents within the crowd across several machines and the centralization of the virtual world management. In a second phase, an improved design of the distributed system for crowd simulation is proposed. In a third phase, the use of many-core and multi-core architectures for increasing the throughput of the distributed system is proposed.

• Chapter 4 describes different heuristic methods for workload balancing in distribut-

ed crowd simulation systems. These methods are necessary in order to efficiently manage the distributed architectures for crowd simulations.

• Chapter 5 presents a new visualization system capable of rendering the crowd sim-

ulations performed by the distributed system. This visualization system should be

Objectives

29

integrated into the distributed architecture in order to efficiently render the simulated scenes. In addition, the impact of the visualization system on the performance of the simulation system is also evaluated. • Finally chapter 6 shows the main conclusions extracted from this work and discusses future research lines which can be derived from this.

30

Objectives

CHAPTER 1

INTRODUCTION

An interest in studying and understanding the collective behaviour of a group of individuals began during the last century. All these studies have more recently led to a growing interest in simulating this behaviour using computers. The first crowd simulations were performed with different purposes, yet what they all had in common was that they were non-real-time simulations. As a consequence, the simulation was usually performed offline to be visualized later with different graphics quality depending on the purpose of the simulation. Researchers from a broad range of fields started to perform these kinds of crowd simulations some years ago. The computer graphics community was interested in integrating vivid crowds of virtual characters within a 3D application usually for the production of movies [53]. Within the Social Sciences field the interest was related with understanding the behaviour of different social phenomena and obtaining human models in order to carry out predictions for future events [100, 11, 56, 77, 85]. Safety and civil sciences interests were related with simulating social models in order to elaborate evacuation plans, to accurately design the elements of the emergency system of a building or to predict risks in crowded events that may take place in a stadium or a huge outdoor area [84, 26, 69, 13]. Animated virtual characters have been also used in virtual reality applications. The appearance of Virtual Reality has allowed the creation of synthetic, interactive, immersive and multi-sensory environments [89]. In a virtual reality system the user interacts with the virtual environment by means of different I/O devices that can vary from a typical keyboard, or mouse,to stereoscopic glasses or data gloves. Several virtual reality systems have been developed for different applications such as medicine, engineering, training and entertainment [6, 81, 48, 46]. Besides these examples centred on a single user, oth-

32

er virtual reality systems allow interaction among several users sharing the same virtual scene. These multi-user virtual environments became known as Distributed Virtual Environments (DVEs). In a DVE, each user interacts with the virtual world using a computer that renders the scene from the user perspective. In recent years, DVEs have gained in popularity and grown in size rapidly, providing a rich and immersive user experience through 3D graphics and real-time interactions [99]. Distributed virtual worlds are used for gaming, entertainment, socialization, education, training and collaboration [97, 17, 92, 44, 78, 49]. For example, the US army uses virtual environments to simulate different battle zone scenarios which provide better training for personnel in combat, rescue and recovery missions [97]. Another example is that several education institutions have created their own space in virtual worlds, enhancing the experience of distance learning [15]. In recent years, the interest in populating these virtual worlds with groups of autonomous characters has grown for different reasons. In some applications, such as training systems, the reason is that the virtual reality system must animate the behaviour of groups of characters in order to interact with the user [97]. In other cases, the reason is these groups of human-like characters provide a greatly improved immersive experience for users [19]. However, the simulation of groups of individuals presents its own particular challenges and requires the interdisciplinary exchange of ideas from many research fields. For these reasons, the crowd simulation problem constitutes a research area in itself. Researchers in the area of crowd simulation should take into account works and studies from many other research areas to obtain realistic large-scale simulations with a good level of interactivity. The research areas related with crowd simulations are: • Psychology: Some studies have been completed to analyse the relation between simulation realism and user perception of variability of appearance and behaviour

for each individual within the crowd [54, 55]. Also, some psychological models of human behaviour in different situations have been used for crowd simulations with different purposes [84, 77, 100]. • Social sciences: Models proposed by researchers from the Social Sciences have

been implemented to simulate the interactions among individuals within a crowd by computer [38]. Also, other studies propose models that can be implemented to

provide coordination and collaboration among agents within the crowd [32, 72]. • Physics: Physical models allow the definition of the crowd micro behaviour con-

CHAPTER 1. INTRODUCTION

33

trolling the individual navigation of each agent [79, 71]. On the contrary, other approaches define the macro behaviour of the crowd by means of models that control the crowd as a whole [95, 34]. • Computer graphics: Real-time rendering of a large number of 3D characters is a considerable challenge. It can exhaust system resources quickly, even state-of-the-

art systems with extensive memory resources, fast processors and powerful graphic cards. ’Brute-force’ approaches that are feasible for a few characters do not scale up for hundreds, thousands or even more characters. Several studies have been trying to tackle such limitations by clever use of graphics accelerator capabilities [21], and by employing methods assuming that our perception of the scene as a whole is limited [5, 93, 18]. • Computer architecture: Recent studies concerning real-time agent based simulation propose implementations for parallel and distributed systems [103, 78, 76, 22].

In this way, the different simulation tasks can be parallelized and/or distributed across several machines, providing an increase in the number of agents supported during a simulation and an improvement in the performance of the system. • Computer vision: Some studies have been completed in order to propose a model for behavioural animation of the crowd and validate it through computer vision

techniques [74]. Also, others propose the use of artificial vision techniques to animate groups of characters in real-time [67, 45].

In order to get an idea of the aspects that a crowd simulation system should deal with, the main problems related with the simulation of crowds are described below.

1.1. REQUIREMENTS OF CROWD SIMULATIONS A crowd system can have several layers as shown in Figure 5. The visualization layer is responsible for addressing the problem of efficiently rendering large groups of individuals as well as displaying images of the 3D scene. On the other hand, the layer managing the agents’ behaviour is responsible for exhibiting realistic behaviours at the micro and macro levels. In addition, the computer architecture layer addresses problems regarding the design of a software architecture that can efficiently take advantage of the underlying hardware platform. Related with these layers, a crowd system has some associated

34

1.1. REQUIREMENTS OF CROWD SIMULATIONS

requirements that must be addressed. Many of these requirements are usually related, involving many trade-offs during the design and development of a crowd simulation system. These requirements are:

• Diversity • Autonomy • Simulation consistency • Interactivity • Scalability

Diversity In order to obtain a realistic crowd simulation, it is desirable to provide variety in both the graphics and the behaviour. However, variety has the disadvantage of increasing the computational load in the system. For this reason, replication is commonly used in simulations with a high number of individuals. Replication consists of reusing the mesh and motion of a character for a group of individuals. Some works have studied the visual impact of the replication of 3D models and motions [54]. In this way, the threshold at which users can perceive that replication is present in a crowd can be determined. This threshold can be used to solve the trade-off between visual diversity and computational cost. Variety in characters behaviour is also necessary to obtain plausible simulations. Some studies have proposed models that can show behavioural diversity at the cost of requiring a high computational load to manage the individual behaviour of each character [79, 72]. On the other hand, other models handle the crowd as a whole by assigning a common goal to all characters [95]. These approaches have the advantage of reducing the computational complexity at the cost of limiting the behavioural diversity. In this way, there is a trade-off that must be addressed between variety of behaviour and the computational cost obtained.

CHAPTER 1. INTRODUCTION

35

Autonomy Autonomous characters are required to create believable crowds reacting to events. At the same time, each individual within the crowd should have different plans and goals. Otherwise, the simulation could result in an ’army-like’ appearance with too uniform, or periodic, distributions of individuals or characteristics. Agents can have different levels of autonomy that will be reflected in their computational and behavioural complexity. They may show just a reactive behaviour where simple rules guide their reactions to the environmental state. More complex agents can be endowed with some representations of the virtual scene in order to navigate through it autonomously. In addition to navigation skills, agents performing deliberative actions would present the highest level of autonomy, but also the highest complexity. However, the level of autonomy of each agent usually depends on the purpose of the simulation, and whether simple reactive agents or more complex agents are needed.

Simulation consistency A key issue when performing a simulation is to keep the consistency of the state of the virtual world. Otherwise, abnormal situations could arise such as overlappings among agents. Also, an incoherent state of the world could lead agents to plan incorrect actions (e.g. two agents may have the wrong belief that both have taken the same object at the same time and continue planning according to this belief). Some studies have shown that agent-based behavioural models can generate inconsistent situations, especially in high density scenarios [42, 34]. For this reason, a crowd simulation system should tackle this problem and properly manage the consistency regardless of the conditions of the simulated scenario. Since queries and updates of the virtual world are done concurrently by crowd agents, the consistency should be efficiently managed and guaranteed. If not, this part of the simulation can become the system bottleneck.

36

1.1. REQUIREMENTS OF CROWD SIMULATIONS

Interactivity Crowd simulations require the management of the behaviour of a large number of characters, as well as rendering the status of the simulation and processing the actions of the user in the case of an interactive system. All these tasks compete concurrently to use the computational resources available in the system. For this reason the simulation system should be efficiently designed. If not, the performance of the system can be greatly degraded. This performance degradation can limit the number of agents simulated, or the interactivity of the simulation [103]. The term ’real-time’ in Virtual Reality applications differs from the one used for Realtime systems where outputs of the system are guaranteed to be obtained at an exact point in time. Nevertheless, a Virtual Reality system is considered to be real-time if a user perceives smooth interactivity in the system. Recent studies of DVEs have established that users perceive smooth interactivity if the response time of the system is below a threshold value of 250 ms. [39]. In a crowd system, response time is the time that passes from the moment an agent requests the execution of an action until the agent receives the result of such action. This response time should be lower than 250 ms. in order to provide a good level of interactivity.

Scalability The Scalability of a system measures its capacity to adapt to changes in the computational requirements [51]. Nevertheless, scalability and throughput should be differentiated. Throughput in a crowd simulation system can be defined as the maximum number of agents that the system can support during a simulation without reaching saturation point. On the other hand, a crowd system is scalable if an average performance is ensured when the number of agents simulated is increased [86]. In order to provide the required scalability, this thesis proposes the use of distributed systems. From the computer architecture point of view, a distributed system is scalable if the throughput of the system is properly increased when more computational resources are added. However, the development of distributed systems is not a simple task since bottle-necks and load balancing problems can arise, limiting the scalability of the system. A crowd simulation system is in charge of both rendering visually plausible images

CHAPTER 1. INTRODUCTION

37

of the virtual world and managing the behaviour of autonomous agents. The sum of these requirements results in a computational cost that increases with the number of agents in the crowd. In order to efficiently manage the increase in the computational requirements involved in the rendering part, different techniques have been proposed [5, 93, 18]. Also, other works propose efficient implementations of behavioural models in order to increase the number of agents supported [92]. Although these approaches provide a throughput improvement when rendering or simulating a high number of agents they do not tackle scalability issues. As a consequence, the number of agents cannot be significantly increased while at the same time providing good system performance. The fact that previous studies do not take into account the architecture of the underlying system severely limits the scalability of these approaches. For this reason, this thesis proposes to tackle scalability problems of crowd simulations related with the underlying system architecture.

1.2. COMPUTER ARCHITECTURES FOR DISTRIBUTED VIRTUAL REALITY APPLICATIONS Initial studies in the area of crowd simulation were designed and implemented for centralized systems where both the rendering and the simulation were executed on the same machine. Nevertheless, recent studies have proposed the use of parallel and distributed systems to perform crowd simulations [103, 78, 76, 22]. However, new challenges arises when developing an application to take advantage of parallel and distributed systems. In parallel systems like many-core and multi-core processors, special care should be taken to efficiently use the memory hierarchy as well as reduce the overhead introduced by the synchronization among execution threads. Also, load balancing problems should be solved when distributing the computational load among the different processor cores [78]. In distributed systems, the software performing the simulation should be efficiently designed to take advantage of the underlying hardware. Furthermore, the load balancing problem should be tackled by partitioning the computational load of the simulation among the nodes in the system. Otherwise, one or more nodes of the system could reach saturation point increasing the latency and degrading the overall performance of the system

38

1.2. COMPUTER ARCHITECTURES FOR DISTRIBUTED VIRTUAL REALITY APPLICATIONS

[61]. Latency in a distributed system is the time that passes from the moment a node in the system sends a message until the receiver node completely receives it [20]. There is a direct connection between the latency of a system and the level of interactivity [3]. If a distributed system can ensure low latencies then a low response time among the nodes will be obtained providing a good level of interactivity. Recent studies of DVEs have established a threshold value of 250 ms. as the maximum response time allowed in order to provide an acceptable level of interactivity for users [39]. When designing the software architecture for a distributed system, the underlying computer architecture should be taken into account. Different communication architectures have been proposed in order to efficiently support distributed virtual reality applications such as DVEs [86]. The different architectures can be classified into three types: centralized-server architectures [96, 75], networked-server architectures [50, 61] and peer-to-peer architectures [60, 59]. Figure 1.1 shows an example of a centralizedserver architecture. In this example, the virtual world is two-dimensional and avatars are represented as dots. In DVEs based on a centralized-server architecture, there is a single server and all the client computers are connected to this server. The server is in charge of managing the entire virtual world. As a result, it becomes a potential bottleneck as the number of avatars in the system increases. In fact, DVEs based on these architectures support the lowest number of clients.

Figure 1.1: Example of a client-server architecture Figure 1.2 shows an example of a networked-server architecture. In this scheme there are several servers and each client computer is exclusively connected to one of these

CHAPTER 1. INTRODUCTION

39

servers. This scheme is more distributed than the client-server scheme. Since there are several servers, it considerably improves the scalability, flexibility and robustness compared to the client-server scheme. However, it requires a load balancing scheme that assigns clients to servers in an efficient way.

Figure 1.2: Example of a networked-server architecture Figure 1.3 shows an example of a peer-to-peer architecture. In this scheme, each client computer is also a server. This scheme provides the highest level of load distribution. Although the first DVEs were based on centralized architectures, over the last few years architectures based on networked servers have become the major de-facto standard for DVE systems [50, 31]. However, each new avatar in a DVE system represents an increase not only in the computational requirements of the application, but also in the amount of network traffic [61]. Due to this increase, networked-server architectures seem not to properly scale with the number of clients, particularly for the case of MMOGs [2], due to the high degree of interactivity shown by these applications. As a result, Peer-to-Peer architectures have been proposed for massively multi-player online games[60, 59, 28]. Nevertheless, P2P architectures must still efficiently solve the awareness problem. This problem consists of ensuring that each avatar is aware of all the avatars in its neighbourhood [87]. Providing awareness to all the avatars is a necessary condition to provide time-space consistency (as defined in [104, 80]). Awareness is crucial for DVEs, since otherwise abnormal situations could arise. For example, a game user provided with a non-coherent view of the virtual world could be shooting something apparently visible that is not actually there. Also, it is possible that an avatar not provided with a coherent view is killed by another invisible avatar. In networked-server architectures, the aware-

40

1.2. COMPUTER ARCHITECTURES FOR DISTRIBUTED VIRTUAL REALITY APPLICATIONS

Figure 1.3: Example of a peer-to-peer architecture ness problem is easily solved by the existing servers, since they periodically synchronize their state and therefore they know the location of all avatars all the time. Each avatar reports its changes (by sending a message) to the server which it is assigned to, and the server can easily decide which avatars should receive that message (by using a criterion of distance). In this way, there is no need for a method to determine the neighbourhood of avatars, since servers know that neighbourhood at every instant. Taking into account the characteristics of the different communication architectures, this thesis proposes the networked-server scheme as the underlying architecture for the simulation of crowds. On the one hand, this distributed scheme allows the improvement of scalability, flexibility and robustness when compared to centralized (client-server) architectures. On the other hand, the small number of servers in networked-server architectures makes it easy to provide awareness (and therefore time-space consistency) to the agents moving in the virtual world. In this sense, it seems difficult to maintain the coherence of the semantic information system if it follows a peer-to-peer scheme, where hundred or even thousand of computers each support a small number of actors and a copy of the semantic database. According to the networked-server scheme, the software architecture has been designed in order to distribute the agents of the crowd across different server computers (the networked servers).

CHAPTER 2

STATE OF THE ART

This chapter discusses the latest developments with regard to crowd simulation systems. We will focus on the different approaches to efficiently solve the rendering of crowded scenes and managing the autonomous behaviour of large groups of individuals. In addition, this chapter describes the proposed implementations for parallel and distributed systems.

2.1. CROWD RENDERING Real-time rendering of a large number of 3D characters is a considerable challenge due to the high number of computational resources required. Approaches that are feasible for a few characters do not scale up for large-scale crowds [1]. Several studies have tried to tackle such scalability limitations by exploiting graphics accelerator capabilities, and also by taking advantage of the fact that our perception of the scene as a whole is limited. People can only perceive a relatively small part of a large collection of characters in full detail. A simple calculation shows that treating every crowd member as equal is rather wasteful, and it makes sense to take full care of only the foremost agents. The rest of the agents can be replaced with less complex approximations. This section describes the main approaches to efficiently render crowds at interactive frame rates.

42

2.1. CROWD RENDERING

2.1.1. Efficient Rendering Techniques Image-Based Rendering Image-based rendering (IBR) approaches are based on the idea proposed by Maciel et al. [52] of using texture mapped quadrilaterals to represent objects. This planar representation of objects is denoted as planar impostors and allows an interactive frame rate for the simulation of large environments. In order to represent a virtual human, both Tecchia et al. [93] and Aubel et al. [5] use planar impostors. However, they differ in how the impostor image is generated. The two main approaches to the generation of impostor images are: dynamic generation and static generation (also referred to as pre-generated impostors). Aubel et al. use a dynamically generated impostor approach [5]. In this approach, the impostor image is updated at run-time by rendering the object’s mesh model to an off-screen buffer and storing this data in the image. This image is displayed on a quadrilateral, which is dynamically orientated towards the viewpoint. This approach uses less memory, since no storage space is devoted to any impostor image that is not actively in use. Unlike dynamically generated impostors for static objects (where the generation of a new object impostor image depends on the camera motion), animated objects such as a virtual human’s mesh also have to take self deformation into account. The solution to this problem is based on the sub-sampling of motion. By simply testing distance variations between some pre-selected joints in the virtual human’s skeleton, the virtual human is re-rendered if the posture has significantly changed. However, dynamically generated impostors heavily rely on reusing the current impostor image over several frames in order to be efficient, since animating and rendering the human’s mesh off-screen is too costly to perform regularly. Therefore, this approach does not fit into scenes containing large dynamic crowds, since this would require a coarse discretization of time, resulting in jerky motion. Tecchia et al. [93] use pre-generated impostors to render several thousand virtual humans at interactive frame rates. Pre-generated impostors involve the rendering of an image of an object for a collection of viewpoints (called reference viewpoints) around the object (see Figure 2.1). However, since virtual humans are animated objects, they present a trickier problem in comparison to static objects. As well as rendering the virtual human from multiple viewpoints, multiple key-frames of animation for each viewpoint need to be rendered, which greatly increases the amount of texture memory used. In order to reduce the

CHAPTER 2. STATE OF THE ART

43

amount of texture memory consumed, authors reduce the number of reference viewpoints needed for each frame by using a symmetrical mesh representation animated with a symmetrical walk animation, so that already generated reference viewpoints can be mirrored to generate new viewpoints. At run-time, depending on the viewpoint with respect to the human, the most appropriate impostor image is selected and displayed on a quadrilateral, which is dynamically orientated towards the viewer. To allow for the dynamic lighting of the impostor representation, normal map images are pre-generated for each viewpoint by encoding the surface normals of the human’s mesh as a RGB colour value. By using a per-pixel dot product between the light vector and a normal map image, they compute the final value of a pixel through multi-pass rendering, requiring a minimum of five rendering passes.

Figure 2.1: Computing the impostor images by discretizing the view direction, and views of the character from the set of discrete directions [93] The main advantage of this approach is that it is possible to deal with the geometric complexity of an object in a pre-processing step. However, with pre-generated impostors, since the object’s representation is fixed, ’popping’ artefacts are introduced as a result of being forced to approximate the representation for the current viewpoint with the reference viewpoint. To avoid these artefacts, the number of viewpoints around the object for the pre-generation of the impostor images can be increased. However, this can cause problems with the consumption of texture memory. Image warping is another technique to reduce the popping effect, but this method can also introduce its own artefacts. Since a pre-generated approach requires a large number of reference viewpoints for several frames of animation, this makes it unsuitable for scenes containing a variety of human models where each model performs a range of different motions. Dobbyn et al. [18] developed the Geopostor system, which provides for a hybrid combination of pre-generated impostor and detailed geometric rendering techniques for virtual humans (see Figure 2.2). By

44

2.1. CROWD RENDERING

switching between the two representations, based on a ’pixel to texel’ ratio, their system allows visual quality and performance to be balanced. They improved on the existing impostor rendering techniques and developed a programmable, hardware based method for adjusting the lighting and colouring of the virtual humans’ skin and clothes.

Figure 2.2: Left: Snapshot of Geopostor system rendering a virtual crowd in an urban environment [18]. Right: wireframe view of the same scene where both the impostor (green rectangles) and mesh representation of characters can be seen

Point-Based Rendering Point-based rendering is a method for the visualisation of virtual humans, which involves replacing a mesh with a cloud of points, approximately pixel-sized. Wand et al. [98] provide a general technique to represent animated scenes that is not dependent on a specific application. The proposed algorithm works on keyframe animations of triangle meshes. The keyframe animation consists of a sequence of triangle meshes of arbitrary topology and connectivity. For each pair of consecutive keyframes, correspondences between the vertices of the triangles is specified, i.e. every vertex of a keyframe must be assigned a matching vertex in the other keyframe (see Figure 2.3.a). During animation, the position (and all other vertex attributes, such as normal or colour) is interpolated linearly between the keyframe values. Triangles can be created or deleted by blending from one vertex position to three different positions and vice versa. The specification of vertex correspondences is part of the input to the algorithm, i.e. they are not established automatically but they must be specified by the user during modelling. Wand et al. partition the scene’s triangles using an octree data structure. Each node

CHAPTER 2. STATE OF THE ART

45

stores an approximate representation of its part of the scene with a fixed resolution in respect to the geometric size of the node. The octree is built recursively in top-down order. Initially, a cube contains all triangles of the scene. Then, sample points are uniformly distributed on the surface area of the triangles and are stored in the current box. Triangles receiving more than 3 sample points are also stored in the current box. These triangles are not considered any longer for point sampling in child boxes because at that sampling density, the point sample approximation becomes more expensive than conventional rasterization of the triangles. The current cube is then subdivided recursively into 8 smaller cubes and the remaining triangles are distributed among the child nodes. The recursion is performed until a node contains only a constant amount of triangles which are then stored in the box without sampling. Figure 2.3.b shows the multi-resolution structure represented as an octree obtained for different keyframes. Using this multi-resolution data structure, they are able to render large crowds of animated characters. For smaller crowds, consisting of several thousands of objects, each object is represented by a separate point sample and its behaviour is individually simulated. Larger crowds are handled differently, with a hierarchical instantiation scheme, which involves constructing multi-resolution hierarchies (e.g., a crowd of objects) out of a set of multi-resolution sub-hierarchies (e.g., different animated models of single objects). While this scheme allows them to arbitrarily render crowded scenes with thousands of characters, less flexibility is provided for the motion of the objects, since the hierarchies are pre-computed and therefore they cannot be used in simulating a large crowd moving within its environment.

2.1.2. Rendering Acceleration through GPUs Historically, hardware graphics acceleration has started at the end of the pipeline, first performing rasterization of triangle scanlines. Successive generations of hardware have then worked back up the pipeline, to the point where some higher level application-stage algorithms are executed on the hardware accelerator. The first consumer graphics chip to include hardware vertex processing was NVIDIA’s GeForce256 shipped in 1999. NVIDIA used the term graphics processing unit (GPU) to differentiate the GeForce256 from the previously available rasterization-only chips. Over the next few years, the GPU evolved from configurable implementations of a complex

46

2.1. CROWD RENDERING

Figure 2.3: a) Keyframe definition with vertex correspondences and b) Multi-resolution hierarchies computed between consecutive keyframes and represented using an octree fixed-function pipeline to highly programmable hardware where developers could implement their own algorithms. Programmable shaders of various kinds are the primary means by which the GPU is controlled. The vertex shader enables various operations (e.g. transformations and deformations) to be performed on each vertex. Similarly, the pixel shader processes individual pixels, allowing complex shading equations to be evaluated per pixel. The geometry shader allows the GPU to create and destroy geometric primitives (points, lines and triangles) on the fly. Computed values can be written to multiple high-precision buffers and reused as vertex or texture data. Related with the rendering of crowds, programmable GPU shaders have been used with other purposes apart from geometry processing or shading. Concretely, they have been used for animating the 3D mesh of a character by means of deformation-based methods, such as skinning [1]. Skinning is based on having a skeleton that drives the deformation of the character mesh. The basic idea behind the skinning method consists of performing the deformation of each vertex by the weighted transformation of one or several attached bones or joints. Equation 2.1 defines the deformation of each vertex where v(t) is the deformed vertex at time t, Xit is the global transform of bone i at time t, Xi−ref

CHAPTER 2. STATE OF THE ART

47

is the inverse global transform of the bone in the reference position and v ref is the vertex in the reference position.

v(t) =

Pn

i=1

Xit · Xi−ref · v ref

(2.1)

GPU implementations of skinning have been shown efficient for rendering crowded scenes at interactive frame rates [4]. GPU skinning reuses animation data and the associated skin deformation calculations. In this sense, data required for performing the skinning is stored in Pixel-shaded textures, since accessing textures in the vertex shader stage is expensive. This skinning data is reused for deforming meshes of every character. In addition, the performance is improved by reducing the number of draw calls required to render the crowd. If the number of draw calls is high, the overhead introduced by each draw call can result in a CPU-bounded graphic application instead of GPU-bounded one. Nevertheless, the number of draw calls can be reduced by using Instancing. Instancing consists of grouping characters that share a 3D mesh into a batch, generating only one draw call per batch. However, Instancing has the limitation that characters have to share both the mesh and the pose at a given time. For this reason, this rendering technique does not seem appropriate for displaying real crowds with autonomous behaviours. Gosselin et al. presented an efficient technique for rendering large crowds based on Instancing, where each character in the crowd is provided with a different pose [29]. In order to animate several characters with different poses, a number of instances of character vertex data are packed into a single vertex buffer. The character skinning is implemented in a vertex shader. As vertex shading is generally the bottleneck of scenes containing a large number of deformable meshes, the number of vertex shader operations that need to be performed is minimized. Although this approach allows the rendering of groups of individuals with different poses at a given time, only the vertex data of four instances can fit into a single vertex buffer. For this reason, the number of draw calls greatly increases with the number of characters in the crowd, limiting the performance. A more recent approach can perform Instancing for rendering a crowd while providing each character with a different pose [21]. This approach uses a characteristic of the new DirectX graphics API, that allows Instancing to be performed but generating an identifier for each 3D mesh instance. In this way, the properties of each instance (i.e. translation, rotation, scale and animation frame) can be stored in a constant buffer to independently

48

2.2. BEHAVIOURAL ANIMATION

apply GPU skinning to each character mesh. The use of constant buffers allows the storage of the information of up to 819 instances in the same buffer. In this way, one draw call is issued for 819 instances, allowing the rendering of thousands of characters at interactive frame rates using consumer hardware.

2.2. BEHAVIOURAL ANIMATION Many models for behavioural animation of crowds have been developed over the years in a variety of disciplines including computer graphics, robotics, and evacuation dynamics. These models can be grouped into two main categories: macroscopic and microscopic. Macroscopic models focus on the system as a whole, providing global knowledge of the scenario to agents or modelling the flow of the group rather than individual behaviours. On the other hand, microscopic models study the behaviour and decisions of individual agents and their interaction with other agents in the crowd.

2.2.1. Microscopic Models Rule-based Models The most well-known set of local rules to simulate lifelike complex behaviour is Reynolds’ boids model [79]. It is an elaboration of a particle system with the simulated entities (boids) represented as oriented particles with specific control rules. The aggregate motion of the simulated flock is created by a distributed behavioural model. Each simulated agent is implemented as an independent actor that navigates according to its local perception of the dynamic environment, the laws of simulated physics that rule its motion and a set of behaviours programmed by the animator. The aggregate motion of the simulated flock is the result of the interaction of the relatively simple behaviours of the individual simulated boids. The basic model to simulate generic flocking behaviour consists of three simple rules that describe how an individual computes its trajectory based on the positions and speeds of its nearby flockmates. Figure 2.4 shows how an individual (green triangle) updates its trajectory and speed based on three rules. These rules are the following ones:

CHAPTER 2. STATE OF THE ART

49

• Separation: steer to avoid crowding local flockmates • Alignment: steer toward the average heading of local flockmates • Cohesion: steer toward the average position of local flockmates

Figure 2.4: Reynold’s rules for the boids simulations [79] Each boid has access to the whole environment description, but flocking only requires reaction within a specific neighbourhood that is given by a distance (from the centre of each boid) and an angle (from each boid’s direction of fight). This neighbourhood can be considered as a limited perceptual field. Each boid will not only avoid collision with other boids but also with obstacles in the environment. Rule-based models can provide realistic human movement for low and medium-density crowds in a flocking style. However, the use of a set of rules to animate the behaviour of agents cannot provide general models suitable for a wide range of scenarios, resulting in a bad generalization. For this reason, a set of rules that is suitable for a given scenario may provide an undesirable behaviour for a different scenario. In addition, these models avoid contact among individuals so, when densities are high, they apply wait rules to enforce an ordered crowd behaviour without the need for computing collision detection and response. Nevertheless, some studies have shown that rule-based models can generate overlapping among agents, due to the lack of collision detection, especially in high density scenarios [34].

Social Forces Models Social forces models represent each agent as a 2D circle moving over a plane, and describe continuous coordinates, speeds, and interactions of each agent with other ob-

50

2.2. BEHAVIOURAL ANIMATION

jects. Each force parameter is individual for each agent. Social forces models define the behaviour of a human crowd with a mixture of socio-psychological and physical factors. The most important empirically derived social forces model is Helbing’s model [38]. In Helbing’s model the i-th agent of mass mi likes to move with a certain desired speed vi0 in a certain direction e0i , and he tends to adapt his instantaneous velocity vi within a certain time interval τi . At the same time, each agent tries to keep a distance from other individuals j and from the walls w using interaction forces fij and fiw , respectively. The change of momentum for the i-th agent in time t is given by Equation 2.2 while the change of position ri is given by the velocity vi . Equation 2.2 computes the momentum using the three terms that appear on the right side of this equation. The first term updates the agent’s momentum taking into account his desired speed (vi0 ) and his instantaneous velocity (vi ) within a certain time interval (τi ). The second and third terms take into account the repulsion forces applied by neighbouring individuals and surrounding obstacles, respectively.

i = mi mi dv dt

vi0 (t)e0i (t)−vi (t) τi

+

P

j(6=i)

fij +

P

w

fiw

(2.2)

Helbing’s social forces model applies repulsion and tangential forces to simulate the interaction between people and obstacles, which allows for realistic pushing behaviour and variable flow rates. Although Helbing’s model was estimated from real data, the main disadvantage of this approach is that agents appear to shake or vibrate in response to the numerous forces in high-density crowds, which does not correspond to natural human behaviour. Figure 2.5 shows a sequence of images from Helbing’s simulation where a crowd tries to escape through an exit.

Figure 2.5: Evacuation of a crowd using Helbing’s model [38]

CHAPTER 2. STATE OF THE ART

51

Data-driven Models Data-driven approaches propose models where agents of the crowd are animated using data sets. Lerner et al. introduced a novel approach for guiding agents within a crowd based on real examples [45]. This approach uses tracking data from real crowds to create a database of examples that is subsequently used to drive the simulated agents’ behaviour. Using real world examples, crowd animations are synthesized by stitching together pieces of behaviours. The approach is a two-part process. The first part is a pre-process step where a database that contains examples is created. The steps followed for creating the database are shown in Figure 2.6.a. The input for the first part is a video showing the movement of real people. Using the input video, tracking information is extracted. Finally, different examples of the behaviour of each individual depending on the surrounding environment are stored in a database. The second part of the approach is the on-line synthesis of the simulated crowd. Figure 2.6.b shows the steps followed in the second part. The first step defines a query for each agent in order to search the examples database taking into account the current situation of each agent. The second step searches the database for a trajectory using the queries defined in the previous step. In the third step, the returned trajectory is copied over and the fourth step is in charge of animating the movement of each agent based on the computed trajectory. This data-driven approach can obtain realistic movements and behaviours since it is based on real data. However, the complex process of querying the examples database only allows small groups of agents at interactive frame rates to be animated.

2.2.2. Macroscopic Models Roadmap Models A roadmap is a graph where nodes are collision-free locations for the characters’ bounding box and edges are collision-free position evolutions of the bounding box between two nodes. A solution path is a set of contiguous edges connecting two given nodes. A cost value is associated to each edge in order to enable shortest paths searches. In this way, roadmap models provide agents with global knowledge of the scene for computing navigational paths between nodes. Nevertheless, a roadmap representation requires

52

2.2. BEHAVIOURAL ANIMATION

Figure 2.6: a) steps followed to build the database from real examples and b) steps followed for agent trajectory computation using the examples database[45] low-level models for steering agents between each pair of nodes. Pettre et al. proposed a method for crowd navigation based on the concept of Probabilistic Roadmap (PRM) [73]. This approach computes a roadmap over a 2D environment with obstacles and explores it redundantly to obtain multiple solution paths. Figure 2.7.a shows the computed roadmap for a given scene containing one obstacle. In Figure 2.7.a, circles represent nodes of the roadmap and segments between two nodes represent edges of the roadmap. The set of feasible paths is extracted from the computed roadmap using a Diskstra’s algorithm implementation. In addition, an edge deletion technique to provide variety in the set of paths is used. Figure 2.7.b shows an example of a path computed to reach the exit point from the entry point. Resulting paths are plausible, and widely cover the environment given that short paths are found as well as less optimal ones. Nevertheless, although different paths are calculated the pre-computation of both the roadmap and the paths limits the ability of the crowd to adapt to dynamic changes in the environment. In this way, static simulations with limited realism are obtained. Another approach proposed a new type of roadmap in order to react to changes in complex and dynamic environments [90]. In this approach, adaptive roadmaps are used to perform global path planning for each agent simultaneously. Dynamic obstacles and inter-agent interaction forces are taken into account to continuously update the roadmap by using a physically-based agent dynamics simulator. In addition, this approach intro-

CHAPTER 2. STATE OF THE ART

53

Figure 2.7: a) computed roadmap to reach an exit point avoiding an obstacle and b) path solution search using the roadmap [73] duces a new type of edge within the roadmap called ’link bands’. The use of link bands allows collisions among multiple agents to be resolved. Each link band specifies a collision free zone in a well-defined neighbourhood of each edge of the roadmap. In this way, collisions are avoided by computing new trajectories for an agent that is following an edge of the roadmap. This approach can perform real-time navigation for a few thousand human agents in indoor and outdoor scenes. However, the dynamics formulation to update roadmap edges can potentially result in an agent getting stuck in a local minimum across space-time. Therefore, this approach may not be able to provide either convergence or completeness on the existence of a collision-free path for each agent in all environments.

Cellular Automata Models Cellular Automata (CA) [43] is an artificial intelligence approach for simulating physical systems defined with mathematical models in which space and time are discrete and physical quantities take a finite set of discrete values. A cellular automaton consists of a regular uniform lattice (2D array) with one or more discrete variables at each site (cells). The state of a cellular automaton is entirely specified by the values of the variables at each cell. A cellular automaton evolves in discrete time steps, with the value of the variable at one cell being affected by the values of variables at the neighbouring cells. The variables at each cell are updated simultaneously based on the values of the variables in their neighbourhood at the previous time step and according to a set of local rules. These rules describe the (intelligent) decision-making behaviour of the automata, thus

54

2.2. BEHAVIOURAL ANIMATION

creating and emulating actual behaviour. Each automaton evaluates its opportunities on a case-by-case basis. Global emergent group behaviour is a result of the interactions of the local rules as each agent examines the available cells in its neighbourhood. CA models for crowds are fast and simple to implement [94, 12]. The virtual space is discretized, and individuals can only move to an adjacent free cell. However, this discrete approach offers realistic results for lower density crowds, but unrealistic results when agents in high-density situations are forced into discrete cells. In order to improve realism, CA models are also used to obtain global paths in the grid by pre-computing paths toward goals and storing them within the grid [47]. During the simulation, global pre-computed paths are followed using local steering rules applied to the perception of agents.

Continuum Models Continuum methods are based on continuum dynamics. These models usually discretize the environment into a regular grid. Then, a potential function is computed for each cell, which corresponds to the sum of a repulsive potential generated by obstacles in the environment and attractive potential generated by the goal. Therefore, gradient methods can be applied to find a path from any origin within the environment to a goal position. An example of a model based on continuum dynamics is Continuum crowds [95]. This approach presents a real-time motion synthesis model for large crowds without agentbased dynamics, where motion is considered as a per-particle energy minimization. In this way, a continuum perspective of the system is adopted. This formulation yields a set of dynamic potential and velocity fields over the domain that guide all individual motion simultaneously. In this sense, global path planning and local collision avoidance is unified into a single optimization framework. However, Continuum crowds model does not use a discrete scheme. Instead, characters perform global planning to avoid both obstacles and other individuals. The dynamic potential field formulation also guarantees that paths are optimal for the current environment state, so agents never get stuck in local minima. Figure 2.8 shows a scheme of the steps followed by Continuum crowds model. For every simulation cycle, different grids are used to compute a set of potential fields. Characters compute their trajectories taking the opposite direction of the gradient of these potential fields. This approach based on potential field computation allows large crowds

CHAPTER 2. STATE OF THE ART

55

to be efficiently animated, since values of the potential fields are reused for all agents within the crowd. Nevertheless, agents belonging to the same group share the same potential fields and exhibit the same behaviour. Since complexity grows with the number of groups simulated, Continuum crowds model can provide interactive simulations with only a few groups of characters. In this way, the range of different behaviours within the crowd is significantly reduced with respect to agent-based approaches.

Figure 2.8: Continuum crowds general algorithm overview [95] Another work follows a continuum approach for controlling the crowd [63]. In this work, authors propose a novel variational constraint called unilateral incompressibility, to model the large-scale behavior of the crowd, and accelerate inter-agent collision avoidance in dense scenarios. This approach allows the simulation of large, dense crowds at near-interactive rates. However, this model looks only at local information, and cannot anticipate future collisions from distant agents. Thus, two groups of agents approaching each other will not react until they are adjacent to each other. In addition, a problem present in continuum-based approaches is the fact that the computational complexity of the model grows with the size of the grid used to discretize the environment. Thus, authors have studied the performance of the approach when the size of the crowd increases, but they do not have analyzed how the size of the grid affects the performance.

2.2.3. Hybrid Models Hybrid approaches propose the combination of different models for animating crowds. In this sense, Shao et al. propose an implementation that integrates motor, perceptual, behavioural, and cognitive components within a model that manages agents as individuals

56

2.2. BEHAVIOURAL ANIMATION

[83]. The implementation defines models for the environment and for the agent’s behaviours. The environment model includes hierarchical data structures that support the efficient interaction between numerous agents and their complex virtual world through fast (perceptual) query algorithms and support agent navigation on local and global scales. On the other hand, the autonomous agent model is focused on its (reactive) behavioural and (deliberative) cognitive abilities. In this sense, authors adopt a bottom-up strategy that uses primitive reactive behaviours (rule-based) as building blocks that in turn support more complex motivational behaviours, all controlled by an action selection mechanism. The approach can obtain realistic behaviours through a complex combination of models at different levels. Nevertheless, the complexity of the approach limits the number of agents that can be simulated at interactive rates. Pelechano et al. have developed a framework for high-density multi-agent simulation with a hybrid bottom-up approach [70]. On the low-level, agents move within a room driven by a social forces model with psychological and geometric rules that provide a wide variety of emergent and high-density behaviours. Above the motion level, this framework uses a wayfinding algorithm that performs navigation in large complex virtual scenarios, using communication and roles to allow for different types of behaviour and navigation abilities. Both the motion level and the wayfinding with communication and roles can be affected by psychological factors that are initially given as personality parameters for each agent. Also, they can be modified during the simulation to affect an agent’s behaviour. Agents are animated without a centralized controller. Each agent has its own behaviour based on roles and personality variables that represent physiological and psychological factors observed in real people. Figure 2.9.a shows a snapshot of the simulation of an evacuation scenario using the model proposed by Pelechano et al.. In this model, agent behaviours are computed at three levels. Figure 2.9.b shows the integration of the three levels. Functionalities implemented by each level are the following: • CAROSA (high-level behaviour): character definitions, object and action semantics, and action selection and control

• MACES (middle-level behaviour): navigation, learning, communication between agents, and decision making for wayfinding

• HiDAC (low-level motion): perception and a set of reactive behaviours for collision avoidance, detection, and response in order to move within a room

CHAPTER 2. STATE OF THE ART

57

Figure 2.9: Left: Evacuation of a complex and structured scenario using three levels: HiDAC, MACES and CAROSA. Right: three-level model overview [70] Another approach uses a two-level model for providing a scalable agent framework for crowd simulation [92]. In this approach, the crowd is simulated on two levels. Figure 2.10 shows an scheme of the integration of the two levels. The high level integrates a situationbased distributed control mechanism. Such a control mechanism gives each agent in a crowd specific details about how to react at any given moment, based on its local environment. At the low level, a probability scheme is used. This scheme computes probabilities over state transitions and then samples to move the simulation forward. This probability scheme allows simple behaviours to be combined, obtaining more complex aggregate behaviours. Benefits of this two-level approach are that agent model simplification improves scalability. In addition, complex behaviours can be efficiently obtained to provide visually plausible simulations. However, although this approach takes into account scalability issues, it is not designed to be executed in a distributed system. As a consequence of its centralized design, only hundreds of agents can be simulated at interactive rates.

2.3. COMPUTER ARCHITECTURE As described in previous sections, typical crowd simulation approaches focus on computer graphics and behavioural animation issues. However, when the number of agents in the crowd grows so does the generated workload, requiring efficient use of the underlying hardware in order to maintain an acceptable performance of the simulation. This section describes some implementations that take advantage of different hardware platforms for simulating crowds.

58

2.3. COMPUTER ARCHITECTURE

Figure 2.10: Overview of the two-level agent framework proposed in [92]

2.3.1. Distributed Crowd Systems Different studies have proposed the distribution of crowd simulations in order to improve scalability. Quinn et al. propose a MPI implementation of a social forces model [76]. Several nodes of a cluster are used to perform simulations in parallel. The organization of distributed processes in the system is based on the manager/worker paradigm. Figure 2.11 shows a scheme of how the manager and worker processes are connected. In this figure, the virtual scene is divided into six regions, and each region is assigned to one worker process. In turn, the manager process is responsible for communicating with the other components of the software system. It reads the scenario layout and broadcasts this to the workers. It also receives the scenario information (including information about the agents) and sends information about each agent to the appropriate worker process. At each simulation step, the manager process gathers data about the locations of the agents from the worker processes and passes this information along to the appropriate rendering engine. Worker processes are organized into a 2D virtual grid. Each worker gets the scenario layout information from the manager process, as well as information about agents in its portion of the scene. In each simulation step, every worker is responsible for updating the positions of the agents in its portion of the scene. When the new locations have been

CHAPTER 2. STATE OF THE ART

59

Figure 2.11: Manager/worker paradigm used in [76]

determined, the workers send the locations of the agents to the manager process. In addition, worker processes exchange positions of agents with processes managing surrounding regions for computing social forces. This distributed implementation can perform interactive simulations of up to 10,000 agents when using 11 interconnected processors. Nevertheless, the synchronization scheme among worker processes has been shown to be inefficient, and performance has been evaluated only for a virtual world distributed across a one dimensional grid. Furthermore, load balancing is not provided during the simulation. As a result, worker processes could reach saturation point. On the other hand, the manager process can become the system bottleneck as a consequence of the high network traffic supported. Zhou et al. proposed the distribution of the Reynold’s model across different interconnected processors [103]. As in the work of Quinn et al., a region-based scheme is used to assign agents to processors. For efficiency reasons, only communication among neighbour regions is allowed. In this case, load balancing is performed during the simulation. However, the criterion of the workload balance method is based only on the number of agents supported by each processor. Some studies have shown that other aspects of the system should be taken into account when defining the workload balancing criterion in Distributed Virtual Reality applications [61]. For this reason, the simple criterion used in this approach severely limits the scalability and performance of the simulation system. As a consequence, only small simulations of up to 512 boids can be performed, obtaining far from interactive response times. Figure 2.12 shows the region based partitioning and

60

2.3. COMPUTER ARCHITECTURE

the communication scheme among processors managing neighbour regions, proposed by Zhou et al.

Figure 2.12: Communication scheme among neighbour processors proposed in [103]

2.3.2. Parallel Crowd Systems Different implementations of crowd simulation systems propose the use of parallel systems. One of these, the PSCrowd system, proposes the parallelization of Reynold’s model for the CellBE processor [78]. PSCrowd distributes the simulation workload across the multiple Synergistic Processing Elements (SPEs) of the CellBE. The system performs space partitioning with two purposes. The first is to perform spatial hashing to search for neighbour individuals during the simulation. The second purpose is to divide the update phase of the simulation into disjoint jobs (i.e. jobs that can be executed in parallel without the need of synchronization) which can be evaluated in arbitrary order on any number of SPEs. This partitioning suits SPE memory size and provides automatic load balancing of the simulation update. PSCrowd stores the whole simulation state in main memory, moving it temporarily to an SPE for processing, then right back to main memory. This approach better suits the CellBE architecture because main memory is large, SPE memory is small, and DMA is very fast. Figure 2.13 shows the partitioning of the scene and the mapping of regions to SPEs. Based on this scheme, PSCrowd allows tens of thousands of individuals at interactive rates to be simulated. Nevertheless, its design for a single CellBE processor limits its scalability when the number of individuals in the crowd grows.

CHAPTER 2. STATE OF THE ART

61

Figure 2.13: In PSCrowd disjoint jobs are evaluated in arbitrary order on any number of SPEs [78]

Another work proposes the parallelization of Reynold’s model using the OpenMP API [101]. In addition, several data structures representing the environment are evaluated to find the best one for performing parallel simulations. Querying these data structures, OpenMP threads update the simulation state. In this way, a group of agents in the crowd are controlled by one OpenMP thread. This implementation has been evaluated simulating 1000 agents using a dual core processor and up to four execution threads. Although interactive simulations are performed, authors focus on speed-ups obtained rather than maximum number of agents supported. This OpenMP parallel implementation can run up to 2.7 times faster than a single thread version. Some proposals have implemented crowd simulation systems using GPUs [22, 9]. These agent-based implementations use spatial hashing techniques in order to efficiently perform the simulation on GPUs. Special algorithms are designed to take advantage of SIMD architectures. In this sense, simulation data is organized in global memory to obtain coalesced accesses and increase the memory bandwidth. In addition, the data locality present in agent-based simulations is exploited by storing agent’s data in fast on-chip memories. In this way, each agent can accelerate the search of its surrounding objects. GPU implementations allow for efficiently simulating thousands of autonomous agents. Nevertheless, its design for a single GPU limits the scalability of these approaches when

62

2.4. CONCLUSIONS

the number of agents grows.

2.4. CONCLUSIONS Approaches described in this chapter for crowd rendering propose efficient techniques to display large groups of individuals using consumer hardware. Nevertheless, these approaches do not take into account scalability issues for large-scale scenarios where designs for a single machine usually represent a bottleneck. On the other hand, models for the animation of crowds focus on providing visual plausible behaviours without proposing a scalable system design capable of efficiently managing the high workload generated by large simulations. As a consequence, these models cannot provide the scalability required by large-scale crowd simulations. For this reason, some approaches described in this chapter, have proposed the use of distributed and parallel systems to implement different crowd models. Nevertheless, implementations using distributed systems define inefficient synchronization schemes or workload balancing techniques that limit the scalability of the system. On the other hand, approaches using parallel systems propose efficient implementations whose scalability is limited by the computational power of the multi-core processor or the GPU where they are executed. In conclusion, the lack of scalability of large-scale crowd simulation systems demonstrates that this aspect is still an open issue. For this reason, this thesis proposes different techniques to improve the scalability of crowd simulation systems.

CHAPTER 3

DISTRIBUTED SYSTEM FOR CROWD SIMULATION

In recent years has increased the interest in performing crowd simulations using distributed and parallel systems since these schemes allow to distribute the simulation workload among several computational resources. Nevertheless, some approaches using distributed systems propose implementations that can simulate small groups of characters, limiting the scalability of the system with the number of agents [103]. Other distributed approaches allow to simulate a higher number of agents but the synchronization technique among the computing nodes introduces a high overhead, limiting the scalability of the system [76]. Approaches using parallel processors propose efficient implementations for the underlying architecture [78, 22]. Nevertheless, these crowd systems present scalability limitations with the number of agents since they are restricted by the computational power of the single machine where they are executed. When the simulation workload is higher than the available computing power, these approaches are not designed to add more computational resources and increase the throughput of the system accordingly. In this chapter is proposed a distributed system that greatly improve the performance of crowd simulations with respect to previous approaches. First, is described a distributed system architecture that is based in a centralized process that manages the state of the simulation. Following, is proposed a design for parallelizing and distributing across several machines the process in charge of the management of the scene. In this way, the throughput and scalability of the system are improved by adding more computational resources. Finally, is described how the distributed system proposed for crowd simulation can benefit from the use of multi-core and many-core architectures.

64

3.1. A DISTRIBUTED SYSTEM FOR CROWD SIMULATION

3.1. A DISTRIBUTED SYSTEM FOR CROWD SIMULATION The proposed distributed system for crowd simulation is composed of two parts: the computer architecture and the software architecture. The underlying computer architecture chosen is a networked-server scheme. On the one hand, this distributed scheme allows to improve scalability, flexibility and robustness when compared to centralized (clientserver) architectures. On the other hand, the small number of servers in networked-server architectures makes it easy to provide awareness (and therefore time-space consistency) to the agents moving in the virtual world. It must be noticed that crowd systems have computational requirements different from other distributed virtual reality like MMOGs, because crowd systems are computer-driven applications instead of user-driven applications. Although MMOGs applications based on a networked-server scheme presented scalability problems when the number of clients increases, we selected the networked-server scheme because several agents can be executed in the same computer, reducing the number of computers needed. Figure 3.1 shows an example of the proposed computer architecture with three servers.

Figure 3.1: An example of the proposed computer architecture for crowd simulation. On top of this networked-server architecture, a software architecture must be designed to manage a crowd of autonomous agents. In order to easily maintain the coherence of the virtual world, a centralized semantic information system is needed. In this sense, it seems difficult to maintain the coherence of the semantic information system if it follows a peer-to-peer scheme, where hundred or even thousand of computers support each one a small number of actors and a copy of the semantic database. Therefore, on top of the

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

65

computer architecture shown in Figure 3.1 the software architecture shown in Figure 3.2 is proposed. This architecture has been designed to distribute the agents of the crowd in different server computers (the networked servers).

Figure 3.2: The proposed software architecture

The hierarchical software architecture is composed of two kind of elements: the Action Server (AS) and the Client Process (CP). The AS is unique, but the system can have as many client processes as necessary, in order to properly scale with the number of agents to be simulated. In his turn, each CP manages a group of autonomous agents. In order to take advantage of the underlying computer architecture, the most suitable distribution for this software architecture consists of allocating the AS in a single server, and uniformly distributing the CPs among the rest of the networked servers. In this way, the scalability and flexibility of the networked-server scheme can be used to add more CPs as the number of agents increases. Since a client process can manage a variable number of autonomous agents, each CP is hosted on a single machine. However, many CPs can be hosted on the same machine.

66

3.1. A DISTRIBUTED SYSTEM FOR CROWD SIMULATION

3.1.1. The Action Server The Action Server corresponds to the action engine [24], and it can be viewed as the world manager, since it controls and properly modifies all the information the crowd can perceive. The Action Server is fully dedicated to verify and execute the actions required by agents, since they are the main source of changes in the virtual environment. The AS must be placed on the computer with the highest computational power for improving the throughput of the system. This computer should exclusively be used for this purpose. Since the AS is unique, consistency is easily provided. In this context, consistency involves the information that the agents should know to animate consistent behaviours.

Figure 3.3: Internal structure of the AS Figure 3.3 shows the internal structure of the AS. The AS consists of three modules: the Interface module, the Semantic Database (SDB) and the Crowd Action Server Control (CASC). The SDB represents the global knowledge about the virtual world that the agents should be able to manage and it contains the necessary functionalities to handle interactions between agents and objects. Complex spatial data structures (such as quad/oct-trees) have been discarded for implementing the SDB. These structures can be too expensive to

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

67

Figure 3.4: The Semantic Database

handle when the number of insertions and deletions grows (agents can be continually changing their location). The structure used to manage the SDB is a map for objects and agents which allows to manage a set of pairs (attribute, value) associated to each object/agent during the simulation. Names of agents and objects are used for indexing the map and accessing the correspondent attributes and values. The semantic information managed can be symbolic (eg: objecti free true, objecti on objectk , ...) and numeric (eg: objecti position, objecti bounding volume, ..). This design can be used for different types of agents. Figure 3.4 illustrates this database scheme. The Interface module hides all the details of the message exchanges, providing the CASC module with the abstraction of asynchronous messages. In order to process messages as soon as they arrive, the Interface module contains one I/O thread dedicated to getting incoming messages from a TCP socket. In his turn, each TCP socket is associated to one CP connected to the AS. Reply messages are sent to each CP by means of one output queue and one I/O thread that writes in a TCP socket. In this way, messages are sent as soon as the corresponding TCP socket is ready for writing. The main module of the AS is the Crowd AS Control module. This module contains a configurable number of threads for executing actions (action execution threads in Figure 3.3). When an incoming message is received, an AE thread extracts it from the input queue (arrow pointing to the Crowd AS Control Module in Figure 3.3). If the input queue is empty, AE threads have to wait. Nevertheless, once an action is extracted from the input queue, it is executed from start to end by an AE thread.

68

3.1. A DISTRIBUTED SYSTEM FOR CROWD SIMULATION

For an action execution thread (AE thread), all messages sent to or received from CPs are exchanged asynchronously (the details are hidden by the Interface module). This means that AE threads only may have to wait when accessing shared data structures such as the semantic database. Thus, a single AE thread may be appropriate when an AS runs on a computer with a single core without hyperthreading support. However, experimental tests have shown that having more AE threads than cores (even many more) does not significantly affect execution times, because these extra AE threads simply stay blocked waiting for actions requests to arrive. This multithreaded scheme allows each AS to take advantage of several cores. In Section 3.1.5 is discussed the number of AE threads used to perform the experiments. The CASC module is devoted to guarantee the coherence of the virtual world, as it is responsible for the action checking and execution. In this way, the CASC stores in a hash table references to the objects and agents contained in the SDB. Using this table the CASC can efficiently obtain neighbour entities for a given agent in order to validate an action requested by this agent. The hash function used by the CASC is shown in Equation 3.1. Using this function, 2D positions of agets and objects in an unbounded virtual world can be mapped into a bounded hash table implemented as a 1D array. The array has T _SIZE elements and stores a 2D grid of size G_SIZE x G_SIZE (i.e. T _SIZE = G_SIZE ∗ G_SIZE). Each element of the array represents a 2D square cell within the

virtual world of size B_SIZE x B_SIZE and contains a list of those elements with the

same value of the hash function. The information of the virtual world has been stored in a hash table implemented as an array, since it offers a better performance compared to other spatial data structures (e.g. hierarchical data structures) [101].

hash(x, y) = (f loor(y/B_SIZE) ∗ G_SIDE + f loor(x/B_SIZE)) %T _SIZE

(3.1)

Figure 3.5 shows an example of how the CASC validates an action request using the hash table. During a simulation agentw wants to change its location. This agent tries to perform a motion action and therefore, a collision can occur. The agent requests the server to validate that movement by sending a message. Since the server knows the location of the agent, the CASC accesses to its cell through a hash function, and performs an objectobject collision test with elements within the neighbours cells of agentw . If no collision caused by that movement is detected, then the server should update the SDB and send a

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

69

Figure 3.5: A collision example. positive acknowledgment message to the agent. The CASC manages the action’s flow of the simulation. In order to allow the maximum flexibility, it can currently process 3 types of actions:

Motion actions: Location changes where collisions can occur, although agents were (potentially) able to navigate without colliding. If an agent wants to move to the location currently occupied by another object/agent, the environment should simply not allow it. Agents interactions: Corresponds to a normal agent-agent communication scheme, which can be obtained from the system through the server. Messages can be managed as other agent attributes, so the SDB will simply route them into the correspondent slots. This scheme has made us possible to investigate on social crowds in the future. STRIPS actions: STRIP is the action language used by planning agents. A STRIPS action scheme [25] can be represented through the Preconditions, Add and Delete lists associated to each agent action. Before executing an action (eg: pick up object), the CASC can verify its preconditions using the SDB maps (eg: is object-k free?). When a STRIPS action is accepted, the Add and Delete lists contain the new information the SDB needs to update its state.

70

3.1. A DISTRIBUTED SYSTEM FOR CROWD SIMULATION

3.1.2. The Client Process Each client process (CP) manages an independent group of autonomous agents (a subset of the crowd). This process has an interface for receiving and updating the information from the server, and a finite number of threads (one thread per agent). Using this interface, a client process initially connects to the server hosting the Action Server and downloads a complete copy of the SDB. From that instant, agents can think locally and in parallel with the server, so they can asynchronously send their actions to the server, which will process them (since each agent is an execution thread, it can separately access to the socket connected to the server). Replies to action requests processed by the server are submitted to all the CPs interfaces, that will update their SDB copies. This multi-threading approach is independent of the agent architecture (the AI formalism driving the agent behaviour). However, the proposed action scheme guarantees the awareness for all agents [87], since all the environmental changes are checked in a central server and then broadcasted to the agents. It must be notice that time-space inconsistencies can appear between the SDB of the Action Server and the copy kept by clients due to network latencies. Nevertheless, all these inconsistencies are fixed when each action request is validated or invalidated by the AS, that keeps the consistency. Two different agents’ behaviours have been integrated within the client processes. Each kind of agent generates a different movement pattern, that is, a different system workload. The first movement pattern consist of a wandering behaviour where agents compute a position. When this position is validated by the server then a new position is computed randomly. This pattern is the one generating the highest workload in the AS since all the movements computed by agents must be validated by the server. The second movement pattern added is a pathfinding behavioural model. Our pathfinding approach is based on cellular automata theory [43]. The basic procedure in cellular automata based motion is to compute a set of rules associated to each cell to determine the next cell until the goal is reached. In our system, a CA is included as a part of the SDB, and each cell has precomputed the k-best paths of length l to achieve any goal cell. A variation of the A∗ algorithm is used for calculating all the paths (k paths per cell where each square cell has a 1m side). In this way, navigation paths can be precomputed initially in an empty environment. The algorithm starts from each goal cell, and by inun-

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

71

dation the k best paths that arrive to each cell are selected and stored. Parameters such as k and the cell size allow to reduce the memory required for managing large environments, avoiding memory problems. Furthermore, the calculation of complete paths towards a goal is not useful, since agents can only evaluate the l-first steps before deciding its next cell. The algorithm used to pre-compute all the possible paths to reach the exit cells for a specific environment is shown in Algorithm 1. In this algorithm the structure Lab is a FIFO style list of cells that is filled with those cells whose paths towards a goal should be calculated. Initially, the list Lab is empty and the goal cell is inserted. Later the while loop will iterate until paths for all the cells are calculated (i.e. when there is no more cells in Lab list). Inside the while loop, the first cell in list Lab is extracted (cell_i). Later, neighbour cells of cell_i are obtained. Since the paths are calculated by inundation starting from the goal, for each neighbour cell that is not an obstacle, it is checked whether it was previously visited or not (condition d_next > d_through_next). If the neighbour cell next_cell_i was not previously visited then it is stored the distance from neighbour next_cell_i towards the goal passing through cell cell_i. This distance is stored in order to compute later the best k paths that arrive to cell cell_i in the direction of the goal. The best k paths are stored in the structure M oP . As a result of Algorithm 1, MoP contains a map for each cell with the best k paths to reach any goal cell. These paths will be used by agents to navigate through the environment avoiding obstacles. However, along the simulation time these paths can be either empty (not used by any agent) or busy. Therefore, each agent should evaluate all the paths to obtain the best next cell for any situation (state of the simulation). Since the purpose of the application is the evacuation of the virtual world, it is necessary to balance the distance for each path to any goal cell with the waiting time associated to each one. This waiting time is updated during the simulation and is computed as the total sum of cycles that agents are waiting in a path and also the path congestion percentage (the percentage of occupied cells in a path). In order to perform this balancing, we have defined an evaluation function H (Equation 3.2) that measures the quality of each path pi .

H(pi ) = α ∗ Dist(pi , goal) + (1 − α) ∗ (W ait(pi ) + Cong(pi ))

(3.2)

72

3.1. A DISTRIBUTED SYSTEM FOR CROWD SIMULATION

Algorithm 1 Algorithm to precompute the paths contained in the CA. VAR Lab /* Auxiliary list of cells managed as a FIFO structure */ MoP /* Map of paths (MoP) for each cell */ goal /* exit cell */ begin /*Initialization*/ Lab.push(goal); // Stores the next cell with its distance to a specific goal storeNext(goal, goal, NULL, 0.0); while Lab != NULL do {//Lab is not empty} cell_i = Lab.pop() // Computes distance from all the neighbour cell to the goal for next_cell_i in neighbours(cell_i) do if next_cell_i is not an obstacle then d_next = distance(next_cell_i, goal) d_through_next = distance(cell_i, goal) + distance(next_cell_i, cell_i) if d_next > d_through_next then storeNext(goal, cell_i, next_cell_i, d_through_next) Lab.push(next_cell_i) // Creates the best path going backward from cell to a goal path_i = buildPathfromCell(goal, cell_i, next_cell_i, longPath) MoP N_OPERATIONS then wake_GPU_manager_thread(); end The internal structure of a parallel action server using a GPU is shown in Figure 3.22. In this implementation, the SDB module contains the object positions array, the collision response array and the GPU manager thread. The object positions array is the host-GPU input interface. This array, contains the agents positions, needed for checking collisions. The structure collision response array, represents the host-GPU output interface, and it contains the result of the collisions tests performed by the GPU. When the GPU manager thread is signaled by AE Threads, it transfers the object positions array to the GPU and launches the collision tests. When the collision tests finishes, the GPU manager thread copies the results from GPU memory into the SDB module. Then, it signals AE threads to start replying to CPs or ASs. The object positions array has NOBJECT S elements, that is, as many elements as agents managed by a server. An agent identifier is associated to each request, in order to update the proper array position. Each element in the array has four floats for each

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

101

Figure 3.22: Internal structure of an action server using a GPU

object passed to the GPU. In this way, having each element a number of floats power of two permits an efficient use of GPU memory [64]. The first and second floats contain the x and y coordinates of the agent’s position, respectively. The third float does not contain any data and is used as padding. The fourth float is a flag indicating the GPU whether the element in the array has been updated by AE threads or not. If an object position has been updated (i.e. the flag is equal to a positive number representing the agent identifier number) then the collision test will be performed. Otherwise (i.e. the flag is equal to -1.0) the collision test is skipped. This flag is initialized by AE threads when the x and y coordinates are updated and is cleared when a collision test finishes. The collision response array has, as the previous array, NOBJECT S elements. Each element of the array has two floats. The first one indicates whether a collision occurred (in this case is equal to 1.0) or not (in this case is equals to 0.0). The second float contains the agent identification number and indicates which agent is associated to the collision result. The agent identification number is obtained by the GPU from the fourth float contained in each element of the object positions array. The collision response array is accessed by AE threads to send the collision responses corresponding to each agent collision request. Although this array has as many elements as agents, AE threads will only collect

102

3.3. IMPROVING THE PERFORMANCE OF THE ACTION SERVER THROUGH GPUS

NOP ERAT ION S instead of NOBJECT S collision results. In order to efficiently read the collision response array, random access is needed. The object-action array (that provides this random access) contains the elements that should be accessed for reading each collision result. Collisions on the GPU are checked using a spatial hashing based method. In this way, a grid is used to perform the collision test. Dimensions of this grid, the grid cell size and grid origin coordinates are input parameters that depend on the simulated scene. Figure 3.23 shows an example of the whole computation, including the data structures involved as both input and output of each step. The upper part of this figure shows a snapshot of a 2-D grid, composed of sixteen cells containing six agents at given locations. In the lower part, this figure shows the data structures with the values corresponding to that snapshot for each step of the algorithm. The computation consists of five main steps, each one represented by one CUDA kernel: 1 The SpatialHash kernel updates the collision grid, performing the spatial hashing by means of the agents positions contained in ObjectPositionsArray. As a result of this step, the array ObjectsHash contains the cell in which each agent is located. This is represented as a pair (cell identifier, agent identifier) (see figure 3.23). 2 A radix sort [35] is performed for ordering the ObjectsHash array based on the cell identifier (being first the lowest cell identifier). This sorting is needed to reorder the ObjectPositionsArray in the next step, allowing an efficient access to global memory during the collision check step. 3 In the reorderData kernel, the coordinates of the agents in ObjectPositionsArray are sorted based on the order established in the previous step. In this way, agent positions are sorted by cell identifier and returned into the array sortedPositions. For instance, the figure 3.23 shows that the position of agent 4 is returned in the first position of the array sortedPositions since agent 4 occupies the first position in ObjectsHash array. 4 The beginning of each cell in the ObjectsHash array is determined, storing this information into the cellStart array. For instance, if position i in cellStart array contains the value j, it means that the first agent of grid cell i appears in position

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

Figure 3.23: Baseline algorithm for GPU collision checking

103

104

3.3. IMPROVING THE PERFORMANCE OF THE ACTION SERVER THROUGH GPUS

j in the ObjectsHash array. In this way, the cellStart array allows a quick access to the agents in neighbouring cells. 5 Finally, the collision test is performed. For each agent position in sortedPositions, the corresponding grid cell is read from the ObjectsHash array. Once the cell is obtained, the collision check with agents in the same cell and neighbouring cells is performed. In order to keep the collision check execution time under a threshold to ensure a good response time, the CPU server process can control the number of agents that are considered during the GPU collision test, by means an ’update flag’. The collision result for elements in sortedPositions array with ’update flag’ different to -1.0 is returned in the collisionResponse array as a collision flag (1 if collision exists, 0 otherwise). According to this, Figure 3.23 shows that collision of agent 0 with agent 3 is detected, but collision test for agent 3 is skipped as it is indicated by the ’update flag’ of the latter agent.

This algorithm is the basic translation of a GPU-based method for collision test [65, 22]. For that reason, we have denoted it as Baseline algorithm. Section 3.4.2 describes some improvements made to this basic algorithm.

3.3.3. Evaluation methodology In order to evaluate the performance of the GPU-based Action Server described in this section, several experiments have been performed on a real system with an Action Server implemented on a GPU. Experiments have been performed running the system for a few minutes. However, since the performance remains stable after one minute results are obtained for one minute of execution. Every 2.5 seconds statistics are computed on the Action Server and Client Process, resulting in 24 different samples. Values of the performance of the system has been computed as the average of the values of the 24 samples. Each sample contains for the case of the server: average percentage of CPU utilization and time required for performing the collision check. For the case of the Client each sample contains: average percentage of CPU utilization, number of total server replies and number of ACKs (positive replies) received. The parameters of the system with the GPU-based Action Server that can affect the performance in terms of latency and throughput have been identified. Parameters of the

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

105

system whose impact has been analyzed are: • Population size: This parameter refers to the number of agents simulated. The response time provided by the system for simulations with a different number of

agents has been studied. In this way, it can be obtained the increase of the server throughput obtained for a GPU-based Action Server with respect to a CPU-based Action Server. As in previous sections, the throughput is consider as the maximum number of agents that the system can support while providing a response time below the threshold value of 250 ms. • Underlying hardware: This parameter indicates the hardware used for performing

the collision check. In this way, the performance of the system has been evaluated when collisions are checked on the CPU and on the GPU. As a result, the benefit obtained when using the GPU is measured.

• Collisions checked by the GPU: This parameter is represented in Figure 3.22 by

the value NOP ERAT ION S . The parameter indicates the number of agents requests checked by the GPU each time the collision test is launched. This parameter can affect the performance since the execution time of the collision test increases as more requests are checked at a time.

• GPU threads per block: This parameter indicates the number of threads launched

per GPU block. This number of threads must be an entire multiple of the warp size in order to obtain a good performance [64]. This parameter can affect the server

response time and requires a tuning process.

Regarding the density of agents, since in Section 3.1 it was analyzed the impact of density in the Action Server performance, the worst case has been chosen to evaluate the performance of the Parallel Action Server. Therefore, all the experiments have been carried out using low density environments where the percentage of ACKs is around 95 %. The movement pattern used to evaluate the performance of the Parallel Action Server is the wandering pattern because all their actions should be verified by the ASs. In addition, the behavioural update period is fixed to 250 ms. for all experiments. The reason for this agent cycle, mentioned in previous section, is that other crowd systems used similar values (between 500 and 170 ms.) for performing interactive simulations [95, 78]. Also, an agent cycle of 250 ms. obtains realistic effects for users [39].

106

3.3. IMPROVING THE PERFORMANCE OF THE ACTION SERVER THROUGH GPUS

3.3.4. Performance evaluation This section shows the performance evaluation of the GPU-based server described in the previous section. Different measurements have been performed on a real system using a GPU-based server. For comparison purposes, the same measurements have been also performed on the same real system but using a CPU-based server. Experiments have been completed using a computer platform with one server and six clients. The server was based on Intel Core Duo 2.0 GHz, with 4GB of RAM, executing Linux 2.6.18.2-34 operating system and had incorporated a NVIDIA C870 Tesla GPU. Each client computer was based on AMD Opteron (2 x 1.56 GHz processors) with 3.84GB of RAM, executing Linux 2.6.18-92 operating system. The interconnection network was a Gigabit Ethernet network. Using this platform, up to nine thousand agents have been simulated. First, the parameter Collisions checked by the GPU should be tuned. In this way, the value NOP ERAT ION S has been established experimentally by measuring the response time provided by the system. Concretely, for each simulation the parameter has been determined by initially assigning the value NOBJECT S (i.e. the number of agents executed in one simulation). Starting from this value it has been decreased, obtaining the best response time of the system for 150 requests checked at a time by the GPU. For this reason, results showed in this section have been obtained for this value. Another parameter having an impact in the performance is the number of GPU threads per block. This value has been determined experimentally by selecting the number of GPU threads per block that provides the lowest response time to the agents. The value of 256 threads per block has provided the best result for populations ranging from 3.000 to 9.000 agents. Results reported in this section have been obtained for this value of GPU threads. Figure 3.24 shows the evaluation results for simulations with different number of agents. Concretely, this figure shows the server performance, in terms of percentage of CPU utilization, for different population sizes. Each point in this figure has been computed as the average value of thirty different simulations. On the X-axis, this figure shows the number of agents in the system for different simulations. The Y-axis shows the percentage of CPU utilization for both the CPU-based and the GPU-based servers. Figure 3.24 shows that the percentage of CPU utilization linearly increases with the number of agents in the system for the CPU-based server, as it could be expected. When collisions are checked by the GPU (see the GPU plot), the CPU utilization significantly decreases with respect to

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

107

the case of checking collisions on the CPU. Despite of the CPU alleviation obtained, both CPU and GPU plots show a similar slope due to the fact that when the number of agents simulated is increased, more computing power is required by interface threads.

Figure 3.24: Average percentage of CPU utilization for collision tests. Figure 3.25 shows the aggregated computing time for operations involved in collision tests during the whole simulation. On the X-axis, this figure shows the number of agents in the system for different simulations. The Y-axis shows aggregated computing time (in ms.) devoted to compute the collision tests required by the simulations. This Figure shows that the plot for the CPU-based server has a parabolic shape, while the plot for the GPUbased server has a flat slope. These results show that the use of the replicated hardware in the GPU has a significant effect in the time required by the server to compute the collision tests. The population size (number of agents) considered in these simulations generates a number of collision tests that does not exceed the computation bandwidth available in the GPU. As a result, the computing time required for different population sizes is very similar. An additional benefit is derived from the fact that the GPU is exclusively devoted to compute collision tests, releasing the CPU from that task. This is the reason for the lower values shown by the GPU plot, even for the smallest population sizes. Although Figures 3.24 and 3.25 show a significant improvement in the GPU-based server performance, the effects of such improvements should be measured on the response time of system. Thus, Figure 3.26 shows the average response time provided to agents hosted by the client processes. In order to show the results for the worst case, values for

108

3.3. IMPROVING THE PERFORMANCE OF THE ACTION SERVER THROUGH GPUS

Figure 3.25: Aggregated computing time for collision tests.

the client with the highest average response time are shown (it must be noticed that we have a single server and six clients). Figure 3.26 shows that the average response times linearly increases with the number of agents for both plots until 7000 agents. From that point up, a similar behaviour is shown when using the GPU. However, when no GPU is used the plot shows a steeper slope. This behaviour is related to the behaviour shown in Figure 3.24. Since the CPU utilization is significantly higher in the CPU-based server for 8000 and 9000 agents (near a saturation point), response times provided are also significantly higher with respect to the GPU-based server. Furthermore, the CPU alleviation obtained when using the GPU permit interface threads to process more requests, reducing the response time provided. Additionally, Figure 3.26 shows that the CPU-based crowd simulation can support around 5000 agents while providing average response times below the 250 milliseconds threshold. When GPU is used, the number of agents supported grows up to 7500 agents, providing an improvement of 50 %. Taking into account that this improvement can be achieved per each server in the system, these results show that the GPU-based server can actually have a significant impact in the performance of large-scale crowd simulations. Additionally, it must be noticed that the performance achieved when using GPUs depends on the number of cores in the CPU, because the interface threads are responsible for replying to agents requests. Since dual-core processors have been used for evaluation

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

109

Figure 3.26: Average Response times provided to agents. purposes (in order to measure the worst-case performance), the improvements shown in this section can be increased when using platforms with a higher number of processor cores.

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS This section presents different improvements carried out in both the CPU-based collision check procedure described in Section 3.1.1 and the GPU-based procedure described in Section 3.3.

3.4.1. Accelerating the collision check procedure on multi-cores As multi-core processors become mainstream, multi-threaded applications will become more common, increasing the need for efficient programming models. Efficiently programing a multi-core processor is relatively easy when the program input is a static structure without data dependencies that can be partitioned among the execution threads. The OpenMP API for example, permits the parallelization of a program by means of

110

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

Figure 3.27: Read/reclaim race among threads that access a shared dynamic data structure. directives, partitioning the workload among different execution threads. However, problems arise if dynamic data structures are not safely managed in a multithreaded program. Furthermore, to obtain a proper speed-up with the number of cores an efficient thread coordination of concurrent accesses to shared data structures is needed. Traditional locking requires expensive atomic operations, such as compare-and-swap (CAS), even when locks are uncontended. Locking is also susceptible to priority inversion, convoying, deadlock, and blocking due to thread failure. Therefore, many researchers recommend avoiding a locking-based synchronization. Some proposals use non-blocking (or lock-free) synchronization in multithreaded applications for avoiding the use of locking, obtaining good results [91, 57]. Other works, have studied the impact of replacing a locking-based synchronization by Software Transactional Memory (STM) in a multi-player game server [105, 27]. Nevertheless, these studies reveal that regardless the granularity of memory transactions used in STM, the performance obtained is even worse compared to a lockingbased implementation. A major challenge for lockless synchronization is handling the read/reclaim races that arise in dynamic data structures. Figure 3.27 illustrates the problem: thread T1 removes node N from a list while thread T2 is referencing it. N’s memory must be reclaimed to allow reuse, otherwise memory exhaustion could block all threads. However, such reuse is unsafe while T2 continues referencing N. For languages like C, where memory must be explicitly reclaimed (e.g. via free()),programmers must combine a memory reclamation scheme with their lockless data structures to resolve these races. This section describes two CPU implementations of the collision check problem using

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

111

the grid data structure based on hashing described in Section 3.1.1. The implementation described in Section 3.1.1 is based on Mutex, the thread synchronization method provided by the POSIX threads API. This locking method is used to protect the dynamic data structure shared among the threads during the collision checking. As it could be expected the use of Mutex severely limits the scalability with the number of cores since it is a locking synchronization method. For that reason, another implementation based on a lock-free data structure is proposed. The Read-Copy Update (RCU) method [57] is used to protect this lock-free data structure from race conditions that are present in a multithreaded program. RCU is a concurrently-readable synchronization method that allows to significantly improve the scalability of the CPU collision checking with the number of cores and threads.

Mutex-based implementation of the Collision check procedure The Mutex version uses a grid-based data structure to perform the collision checking, this data structure is denoted as collision grid. Figure 3.28 illustrates the implementation details of the Mutex-based version. The top part of the figure shows the geometric space partitioned using a grid with 16 grid cells. Four agents, represented as numbered circles, are allocated within the grid at given positions. The bottom part of Figure 3.28 shows that the collision grid is implemented as a lineal array. Each element of this array contains a mutex and a pointer to a dynamic data structure, implemented as a linked list, that contains the agents positions. The mutex avoids the corruption of the dynamic data structure protecting it when reader and writer threads access it concurrently during the collision check. A thread performs the collision check for an agent, calculating first the mapping of the agent’s position into the collision grid by means of the hashing method. The mutex located at the position returned by the hashing method is locked and the dynamic data structure is queried in case of a read access, or updated in case of a write access. The described implementation using an array of mutexes instead of one mutex protecting the collision grid, provides a certain level of concurrency among threads. However, the use of lock-free dynamic data structures along with the appropriate reclamation scheme can significantly improve the parallelism on a multi-core processor.

112

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

Figure 3.28: Diagram of the CPU collision checking using Mutex.

RCU-based implementation of the Collision check procedure RCU is a synchronization mechanism that was added to the Linux kernel during the development of version 2.5. Recently, in 2009 it has been released for user-space access [16]. However, the benefits provided by RCU have not been checked in complex problems with large data structures. The idea behind RCU is to split updates into removal and reclamation phases. The removal phase removes references to data items within a data structure (possibly by replacing them with references to new versions of these data items), and can run concurrently with readers. The reason that it is safe to run the removal phase concurrently with readers is the fact that semantics of modern CPUs guarantee that readers will see either the old or the new version of the data structure rather than a partially updated reference. The reclamation phase does the work of reclaiming (e.g., freeing) the data items removed from the data structure during the removal phase. Since reclaiming data items can disrupt any readers concurrently referencing those data items, the reclamation phase must not start until readers no longer hold references to those data items. Splitting the update into removal and reclamation phases permits the updater to perform

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

113

Figure 3.29: Illustration of QSBR. Black boxes represent quiescent states.

the removal phase immediately, and to defer the reclamation phase until all readers active during the removal phase have completed, either by blocking until they finish or by registering a callback that is invoked after they finish. Only readers that are active during the removal phase need to be considered, because any reader starting after the removal phase will be unable to gain a reference to the removed data items, and therefore cannot be disrupted by the reclamation phase. Different reclamation schemes can be used to implement RCU. We have implemented a RCU version using the Quiescent State Based Reclamation (QSBR) scheme since it provides concurrent reads with the lowest overhead but at the cost that the application has to be modified in order to explicitly manage reclamations [36, 16]. QSBR uses the concept of a grace period. A grace period is a time interval [a,b] such that, after time b, all nodes removed before time a may safely be reclaimed. QSBR uses quiescent states to detect grace periods. A quiescent state for thread T is a state in which T holds no references to shared nodes. Hence, a grace period for QSBR is any interval of time during which all threads pass through at least one quiescent state. Figure 3.29 illustrates the relationship between quiescent states and grace periods in QSBR. Thread T1 goes through quiescent states at times t1 and t5 , T2 at times t2 and t4 , and T3 at time t3 . Hence, a grace period is any time interval containing either [t1 , t3 ] or [t3 , t5 ] . Figure 3.30 shows the management of a linked-list in a multi-threaded environment

114

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

by using RCU API calls [57]. The figure shows the changes suffered by a linked list containing three elements (A, B and C) while an updater thread deletes element B. This element is deleted using the RCU API call list_del_rcu(). This function removes a list element allowing some concurrent readers to continue seeing the removed element. Looking at Figure 3.30, it can be seen that after execute list_del_rcu(), the element B has been removed from the list. Since readers do not synchronize directly with updaters, readers might be concurrently scanning this list. These concurrent readers might or might not see the newly removed element, depending on timing. However, readers that were delayed (e.g., due to interrupts) just after fetching a pointer to the newly removed element might see the old version of the list for quite some time after the removal. Therefore, there are two versions of the list, one with element B and one without it. Filling color of element B is still white during the grace period, indicating that readers might be referencing it, for that reason the freeing of element B is postponed. Readers are not permitted to maintain references to element B after exiting from their RCU read-side critical sections. Therefore, when all readers have exited their critical sections, then no more readers can be referencing element B, as indicated by its grey filling color and dashed frame in the Updater column in Figure 3.30. When no more readers hold references to element B, then the synchronize_rcu() function completes. At this point, the list is back to a single version and element B may safely be freed. The CPU implementation based on mutex described in Figure 3.28, has been adapted in order to support lock-free data structures along with the RCU synchronization method. In this way, the lineal array representing the collision grid is modified removing the mutex from each element of the collision grid. As a consequence, lock-free linked lists containing the agents positions are obtained. Read and update accesses to the lockless linked lists are performed by threads through a user-space RCU API [16]. The QSBR version of RCU implemented by this API, permits the definition of quiescent states in order to safely update the linked lists contained in the collision grid.

3.4.2. Accelerating the collision check procedure on many-cores This section describes a new algorithm for performing the GPU-based collision check procedure. Also, an improved version of the Baseline algorithm is proposed for comparison purposes with the new algorithm. Since possible optimizations like exploiting the GPU memory hierarchy were not explored in the GPU Baseline algorithm, different opti-

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

115

Figure 3.30: Deletion of one element in a RCU linked-list. mizations are proposed obtaining an improved version of the Baseline algorithm.

Improved Baseline Algorithm The Baseline algorithm is composed of five kernels. In order to improve the baseline algorithm, the first step is to determine which kernels are the most time consuming. The percentage of the global execution time consumed by each kernel has been measured when checking the movements of one million agents. These measurements are shown in Figure 3.31. This figure shows that the most time consuming kernel is the one performing the collision check, consuming 63 % of the total time. The main reason for this time consumption is that this kernel does not take advantage of the GPU memory hierarchy in the Baseline version, accessing only to global memory. During the collision check kernel (fifth kernel in the Baseline algorithm), each agent checks its neighbourhood. This data locality can be exploited by using the on-chip GPU memories. Concretely, the input arrays of the kernel performing the collision check can be

116

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

2% calcHash radixSort reorderData findCellStart collisionCheck

27%

63%

7% 2%

Figure 3.31: Percentage of execution time required by the kernels for the baseline version.

bound to the texture memory. Hence, neighbour cells are cached and they can be fetched from the texture memory instead of the device memory, increasing the memory bandwidth. This first improvement of the baseline algorithm is denoted as the texture memory optimization. On the other hand, data locality can be also exploited by using the shared memory along with a tiling technique [102]. Tiles are defined within the collision grid in such a way that collisions can be independently checked by each GPU block, avoiding inter-block synchronization. Collision grid cells are ordered in global memory based on the tile organization. In this way, all threads in a GPU block collaborate in loading the assigned tile from global memory to shared memory, obtaining a coalesced access and reducing the number of accesses to device memory. In order to illustrate this improvement, Figure 3.32 shows the memory access pattern of the baseline algorithm, while Figure 3.33 shows the memory access pattern of the improved baseline algorithm. Both figures show a collision grid with sixteen cells. Figure 3.32 shows how a given tile consisting of 3x3 cells (from cell 5 to cell 15 except cells 8 and 12) is stored in global memory. It can be seen that the neighbouring cells are stored in non-adjacent memory segments (cells 8 and 12 are interleaved within the tile segments) preventing coalesced accesses to global memory. Figure 3.33 shows the global memory layout for the improved version. A tile in the improved algorithm consists of 3x3 submatrix of cells, as in the case of the baseline algorithm. This 3x3 submatrix is composed of a 2x2 submatrix of cells and its neighbour cells. For example, the tile highlighted in Figure 3.33 with a blue circle, consists of cells 12, 13, 14 and 15 forming the 2x2 submatrix that along with cells 3, 6, 7, 9 and 11 form the 3x3 tile. For that reason, cell numbers (i.e. big numbers in the middle of each cell) are

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

117

Figure 3.32: Grid mapping to global memory in the baseline version

assigned in the improved version for keeping those cells contained in the 2x2 submatrix linearly ordered. In addition, the improved algorithm replicates those cells that are in the border of the 2x2 submatrix of a tile. Figure 3.33 shows this replication scheme. In this figure, the numbers in the middle of each cell denotes the cell number in the collision grid, while the small numbers in the corners of each cell denote the replicas of that cell in each tile. For example, the cell number 3 is replicated as cell 4 in the first tile, cell 12 in the second tile, cell 19 in the third tile, and cell 27 in the fourth tile. The advantage of this data replication consists of having all the cells belonging to a given tile linearly ordered in the same global memory segment. Therefore, all threads in a warp (half-warp) can linearly access to the same global memory segment and load the data into shared memory obtaining a coalesced access. The lower part of Figure 3.33 shows how the cells of the first tile (black numbers in the corner of each cell) are stored in the same global memory segment. The same occurs for the second tile (numbers in red), for the third tile (numbers in green) and for the fourth tile (numbers in blue). This improved organization along with the use of shared memory is denoted as the shared memory optimization. It should be noticed that GPU memory consumption is not a hard limitation since crowd simulations are performed in a distributed fashion. In this way, the replication of border cells cannot exhaust GPU memory since the simulated world can be distributed across several GPUs. Figure 3.31 shows that the collision check is the most time consuming kernel in the

118

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

Figure 3.33: Grid mapping to global memory in the improved version

baseline algorithm. However, this figure also shows that the kernel performing the radix sort requires a 27 % of the total execution time, being the second most time consuming kernel. For that reason, the radix sort procedure used in the baseline algorithm has been replaced by the fastest published version of this sorting algorithm [82]. Finally, although the execution time for the rest of the kernels are less significant than the previous ones, some optimizations can be performed on them. The kernels corresponding to the third and fourth steps in the Baseline algorithm can be merged into a single one, as there are no global synchronization requirements between them. Therefore, the cost of synchronization can be saved. Furthermore, the shared memory can be used by the fourth kernel, taking advantage of the data locality and improving the global memory bandwidth. In order to show the improvements achieved by the optimized version of the Baseline algorithm, Figure 3.34 shows the impact of the optimizations in terms of percentages of the execution time (being 100 % the total execution time of the Baseline algorithm on the left bar). This bar shows that the effect of the optimizations represents a reduction of a 70 % in the global execution time with respect to the Baseline version. The right bar in Figure 3.34 zooms in the results obtained for the improved version. In this version, the most time consuming kernel is the radixSort, with a 54 % of the global execution time for the optimized version. For this reason is proposed a new algorithm to perform the

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

119

100% 90% 80% 70% Saved Time collisionCheck reorder&FindCS radixSort calcHash

60% 50% 40% 30% 20% 10% 0% Opt. version respect baseline version

Opt. version kernel consumption

Figure 3.34: Percentage of execution times required by the kernels in the baseline optimized version. collision check that is not based on sorting.

New GPU-based algorithm for collision check This subsection describes a new algorithm for performing the GPU collision check. This algorithm avoids the sorting step in the collision check procedure. In order to achieve this goal, a static grid is used. Nevertheless, if many agents fall within the same grid cell and they try to write into the same memory address, atomic operations are needed. In order to avoid the performance penalty caused by atomic operations, a different approach is proposed in which the size of each grid cell is fixed in order to guarantee the consistency of the simulation. Concretely, the consistency is guaranteed if √

L2 + L2 = D = 2R

(3.3)

where L is the size of the side of a grid cell, D is the diagonal of a grid cell and R is the radius of the agents. When the distance between two agents is less or equal to twice the agent radius (2R) a collision occurs. For that reason, the condition in Equation 3.3 establishes that all the agents falling in the same cell will collide since the maximum distance within a cell is the diagonal of the cell (i.e. D = 2R). In this way, the condition in Equation 3.3 implicitly performs the collision detection for agents trying to move to the same cell. In that situation, the consistency can be guaranteed by allowing the movement of one agent and forbidding the rest of the movements. It must be noticed that the selection

120

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

of the agent to perform the movement can be done in a non-deterministic fashion since agent-based simulations evolve in this way. As a result of using the condition in Equation 3.3 to define the cell size, more neighbour cells will have to be queried during the collision check. Since the side of a cell can be shorter than 2R, not only the closest neighbour cells must be queried but also those cells that are one cell distant. This set of cells is denoted as extended neighbour cells. Nevertheless, in spite of the fact that more neighbour cells are accessed, the performance can be improved by loading these cells from global memory only once and storing them on shared memory. Using the consistency condition (equation 3.3), a new collision check algorithm has been defined consisting of four steps, each one containing one GPU kernel call. In this new algorithm there is an array (denoted as CollisionResponseArray) containing a pair (collision flag, agent identifier) in each position. Another array called ObjectPositionsArray contains the agents positions, and the array collisionGrid has as many positions as cells used to perform the collision check. In addition, the collisionGrid array contains three elements in each position. The first element indicates the current step of the simulation. The second element in a given position i, stores an agent identifier indicating that the target cell for that agent is cell i. The third element in a given position i, stores an agent identifier indicating that the source cell for that agent is cell i. Agents positions are copied by the CPU onto device memory and then the collision check test is launched. Once the test is finished the result is returned back to the CPU by copying the CollisionResponseArray. Actions performed in each step of the new algorithm are illustrated in Figure 3.35. This figure shows an example of the whole process, including the data structures involved as both input and output of each step. The upper part of this figure shows a snapshot of a 2D grid, composed of sixteen cells containing four agents at given locations. In the lower part, this figure shows the data structures with the values corresponding to that snapshot for each step of the algorithm described above. The actions performed in each step are the following ones: 1 In the first step, the collisionResponse array is initialized indicating that there are collisions for all agents (see Figure 3.35). This initialization is necessary because one agent can overwrite other agent when falling in the same cell. Overwritten agents can detect the collision by means of this initialization step. 2 In the second step, the hashing to determine the target and the source cell for each

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

121

Figure 3.35: New algorithm for collision check on the GPU agent position stored in ObjectPositionsArray is performed. Each thread writes a step identifier and the agent identifier in both the source and target cells. All updated positions share a common step identifier. This identifier allows to determine whether the information within a cell is correct or it contains obsolete data. This step identifier is used to avoid the use of the function cudaMemset( ) for clearing the content of collisionGrid before launching the second step. The execution time of this function significantly increases the global execution time, specially when the size of the array to be cleared grows. The hashing performed in this step by the calcHash kernel is shown in Figure 3.35. Since cell 1 is the previous one for Agent 0 and it wants to move to cell 3, Agent 0 writes its identifier in these cells in the corresponding slot. Agent 2 moving from cell 11 to cell 8 and Agent 3 moving from cell 12 to cell 11, write their identifiers in the corresponding slots in these cells. Al-

122

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

so Agent 1 writes its identifier in the proper slot of cell 8 (the source cell of Agent 1) but the value for the target cell (cell 3) is overwritten with the value stored by Agent 0 when the kernel calcHash finishes. All agents share the step identifier 0, since movements for these agents are checked in the same collision test launch. 3 The third step of the new algorithm consists of agents detecting whether their desired movements are possible or not. If the desired movement of an agent was overwritten in the previous kernel or generates a collision, it means that the desired position is not possible. In this case, the collision grid is updated in the following way: agents whose desired movement was finally written clean their identifier from their source cell. However, if an agent detects that its movement is not possible, it checks whether its source cell is the target cell of other agent. In such case, the overwritten agent notifies that the desired movement is not possible. It must be noticed that restoring the previous position cannot lead to an inconsistent situation, since the initial scenario is collision free (i.e. position restore is possible), and for each cycle the agents positions are updated keeping the consistency. In Figure 3.35, Agent 0 cleans its identifier from its source position, cell 1. On the other hand, Agent 1 notifies to Agent 2 that its desired movement to cell 8 is not possible. Also, Agent 2 notifies to Agent 3 that the desired position of the latter agent generates a collision. 4 Finally, the collision check is performed in the fourth step. For each grid cell, if the agent identifier stored in that cell is written in the Desired Cell slot then its extended neighbour cells are queried to detect a collision. If no collision is detected, then the collision flag in collisionResponse array is set to 0, indicating that there is no collision. On the other hand, if the agent identifier is written in the Previous Cell slot, then the collision flag is not overwritten, since the desired position for that agent generates a collision. Figure 3.35 shows that the collision for agent 1 is detected. The collision for agent 2 and agent 3 are also detected, since they are notified about it.

The algorithm described above performs global synchronization through finishing the second kernel launch. In this way, in the third kernel the overwritten agents are restored to their previous positions and the consistency of the simulation is kept. A version of this algorithm has been implemented using atomic operations for comparison purposes. This new version consists of merging the second and third steps in a single kernel. In order to merge these two steps, atomic operations are needed (the global synchronization achieved

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

123

through the second kernel termination should be performed by using atomic operations). However, the advantage of saving one kernel launch at the cost of using atomic operations should be analyzed.

3.4.3. Evaluation methodology In order to evaluate the performance of both GPU and CPU implementations of the collision check procedure, different experiments have been performed on different hardware platforms. Performance tests are based on different configurations of the simulated scenario, varying the number of agents, in order to evaluate the scalability of each algorithm version. Random agent movements have been used for evaluation purposes. Concretely, one hundred random movements are computed per agent using agent identifiers as the seed for the random generation. In this way, reproducible results can be obtained. The correctness of GPU implementations and the RCU-based implementation has been checked by comparing the result of the collision check performed by these methods with the result provided by the Mutex-based implementation. The performance measurements of the different collision check implementations are execution time and collision checking rate (number of operations checked per second). Execution times reported in the performance evaluation section are the aggregated time obtained for all the movements performed by all the agents considered for each simulation. Some parameters of the system that can affect the performance in terms of latency and throughput have been identified. Concretely, the parameters analyzed are the following ones: • Population size: This parameter refers to the number of agents simulated. The response time provided by the system for simulations with a different number of agents has been studied. In this way, it can be obtained the maximum number of agents that the system can support while providing a response time below the threshold value of 250 ms. • Number of processor cores: This parameter indicates the number of processor cores used in the collision check. In order to check the scalability of the proposed implementations with the computational resources, various platforms with different number of cores have been used.

124

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

• Number of threads: This parameter indicates the number of execution threads that concurrently perform the collision check. Various configurations with differ-

ent number of threads have been used in both CPU and GPU implementations.

Regarding the density of agents, since in Section 3.1 it was analyzed the impact of density in the Action Server performance, the worst case has been chosen to evaluate the performance of the collision check implementations. Therefore, all the experiments have been carried out using low density environments (around 10 %).

3.4.4. Performance evaluation This section shows the performance evaluation for both CPU and GPU implementations of the collision check procedure. First, CPU and GPU algorithms are analyzed separated. Later, both implementations are compared in order to determined the performance improvement obtained by the GPU with respect to the CPU.

CPU-based collision check procedure This section shows the performance evaluation of the CPU implementations for collision check described in Section 3.4.1. The platform used for the CPU tests was a 16 core machine integrating 8 AMD Opteron processors (2 cores @ 1 GHz per processor). The RAM of the system was 32.5 GB and the operating system was Linux 2.6.18-92. The POSIX API has been used for obtaining different configurations with an increasing number of cores. This API allows to set the affinity of execution threads, limiting the cores set in wich the threads are executed. Four configurations containing 2, 4, 8 and 16 cores were used in order to check the scalability with the number of cores of the CPU-based implementations. First, it is performed the parameter tuning of the hashing method proposed in Equation 3.1 (see Section 3.1.1). Values that must be tuned are B_SIZE, G_SIZE and the number off threads performing the collision. Different values for these parameters have been used for the Mutex-based implementation, selecting those values that provided the lowest execution time. As an example, results of the parameter tuning for 1000 agents

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

125

using two cores are showed. This population size and number of cores have been chosen in order to compare the performance of the proposed Mutex-based implementation with a similar work that studies the performance of collision check methods for agent-based simulation [101]. This work uses OpenMP for parallelizing a collision check method that uses hashing along with a grid data structure. B_SIZE G_SIZE

1

1,5

2

2,5

500

620

707

735

759

1000

604

671

711

743

1500

588

627

663

687

2000

798

826

905

921

2500

1009

1176

1211

1243

Table 3.5: Execution time (ms.) for the Mutex-based collision check procedure when varying the cell size and the array size for high density scenario. Tables 3.5, 3.6 and 3.7 show the execution times obtained when performing the collision check for scenarios with a density of agents of 97 %, 50 % and 10 % respectively. In these tables values for B_SIZE varies from 1 to 2,5 and values for G_SIZE varies from 500 to 2500. It can be seen that the lowest execution times are obtained when the value of B_SIZE is 1 and the value of G_SIZE is 1500 in the three tables. Also, it can be see that execution time is inversely related with the density of agents. This results agree with the results presented in Section 3.1.5 where the throughput of the Action Server is increased in high density scenarios, since less neighbour agents are checked during the collision test. Figure 3.36 shows the impact in performance of the number of threads used for performing the collision check. This figure shows the execution time and speed-up of both the Mutex-based and RCU-based implementations. Execution times showed in this figure are obtained assigning values 1 and 1500 to parameters of the hashing method B_SIZE and G_SIZE, respectively. Speed-ups achieved by each implementation have been computed as the number of times that a collision check method using a given number of threads is faster than a single-threaded version of the same method. Different conclusions can be obtained from Figure 3.36. First, the synchronization overhead introduced by RCU is lower than the Mutex overhead, since the single thread execution time is lower for the RCU im-

126

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

B_SIZE G_SIZE

1

1,5

2

2,5

500

794

822

842

870

1000

774

810

854

898

1500

711

755

802

826

2000

1215

1291

1362

1378

2500

1454

1581

1616

1652

Table 3.6: Execution time (ms.) for the Mutex-based collision check procedure when varying the cell size and the array size for medium density scenario.

B_SIZE G_SIZE

1

1,5

2

2,5

500

925

977

1001

1021

1000

913

933

978

1029

1500

834

909

957

976

2000

949

969

1056

1064

2500

1589

1767

1803

1835

Table 3.7: Execution time (ms.) for the Mutex-based collision check procedure when varying the cell size and the array size for low density scenario.

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

127

plementation. Also, it can be seen that the lowest execution time is obtained when using 6 threads for both the Mutex and RCU implementations. This result agree with the one presented in Section 3.1.5 where the Action Server provided the lowest response times when using 6 AE_Threads. However, despite both Mutex and RCU implementations provide the lowest execution times when using 6 threads, the RCU implementation obtains a speed-up of 4.3x instead of a 2.6x speed-up for the Mutex version. Also, it can be seen the negative impact in the execution time of using more than 6 threads for performing the collision check. However, despite the negative impact when using more than 6 threads in both implementations, the one using Mutex shows a higher increase in the execution time compared with the RCU implementation. The reason for this behaviour is the higher synchronization overhead introduced by the Mutex version. Obtained results for the Mutex and RCU implementations are also compared with an OpenMP parallelization of a similar collision check method [101]. This OpenMP-based implementation obtained a seep-up of 2.7x for 1000 agents, similar to the 2.6x speed-up obtained by the Mutex implementation. As a consequence, the RCU implementation is the clear winner with a speed-up of 4.3x.

Figure 3.36: Execution time and speed-up obtained for 1000 agents when using 2 cores and increasing the number of threads. Table 3.8 shows the overall execution time in milliseconds for the CPU collision check implementations for an increasing number of cores and an increasing number of agents. The number of threads used for a given number of cores for each version has been chosen selecting those values that provided the lower execution times. Table 3.8 shows that

128

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

Table 3.8: Execution times (ms.) for CPU-based collision check when increasing the number of cores and the number of agents. # agents Synch. method

Mutex

RCU

# cores

# threads

1000

10000

100000

1000000

2

6

834

6001

145732

8896776

4

10

323

3099

80192

5434104

8

20

275

1977

45873

2749951

16

35

288

1860

40380

2144276

2

6

443

4639

44819

474380

4

16

284

2678

30589

257824

8

30

246

1377

15180

145664

16

50

215

958

11092

114212

the execution times provided the RCU implementation are always lower than the Mutex implementation. For a population size of the crowd of 1000 agents there are no significant differences in the execution time between RCU and Mutex. However, as the crowd population size grows such differences increase resulting that the Mutex version does not scales well with the number of agents. Nevertheless, the RCU version provides a lineal scalability. Table 3.8 also shows the best configurations in terms of number of execution threads for both RCU and Mutex implementations. This value was empirically fixed for each multi-core platform in order to obtain the minimum execution time. It can be seen that RCU supports a higher number of threads for a given number of cores than the Mutex implementation as a consequence of the less overhead introduced by the RCU synchronization method. Figure 3.37 shows the speed-up obtained by the RCU-based implementation with respect to the Mutex-based implementation. This figure shows on the X-axis the number of agents considered for the simulations. The Y-axis shows the speed-up obtained. Each bar in Figure 3.37 reports the speed-up for different platforms with an increasing number of cores. It can be seen that the speed-up obtained for the RCU version does not significantly vary when increasing the number of cores for a given population size of the crowd. However, when the population size of the crowd increases, the speed-up of the RCU implementation also increases with respect to the Mutex version. The reason for the increase in the speed-up with the population size of the crowd is that Mutex synchroniza-

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

129

Figure 3.37: RCU Speed-up obtained in comparison with the Mutex version when increasing the number of cores.

tion method introduces a higher overhead than the RCU synchronization method since the latter permits more concurrent read accesses.

GPU-based collision check In order to check the scalability of the proposed GPU algorithms with the physical parallelism available on the GPU, different NVIDIA GPUs have been considered: Tesla C870 (16 SMs) and Tesla C1060 (30 SMs). A key parameter in the performance of GPU algorithms is the number of threads per block. As for the case of the Baseline version, this number has been determined experimentally selecting the value that provides the lower execution time when performing the collision check. The value providing the lowest execution times is 256 threads per GPU block for populations ranging from 10.000 to 1.000.000 agents, as for the Baseline version. This value is also directly related with the tile size for those versions using shared memory. Concretely, the size of a tile is 16x16 when using 256 threads. Figure 3.38 and Figure 3.39 shows the overall execution time for the different colli-

130

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

sion check implementations on different graphic cards. These figures show on the X-axis the number of agents considered for the simulations. The Y-axis shows the aggregated execution time obtained for each collision check method in log scale. Figure 3.38 shows the results for the Tesla C870 platform. The new version using atomic operations has not been tested for this platform, since it does not support this kind of operations. As it could be expected, the greatest differences arise for the largest population size, that is, one million agents. The texture memory is used to decrease the number of accesses to device memory in the first optimization, obtaining 50 % of reduction in the execution time with respect to the Baseline version. In the second optimization, the shared memory is used along with the new organization of the collision grid in global memory, in such a way that a coalesced access to device memory is guaranteed. This optimization obtains 70 % of reduction in the execution time with respect to the Baseline version. Nevertheless, the proposed technique achieves the best results, obtaining 85 % of reduction in the execution time.

Figure 3.38: Execution times on Tesla C870 card Figure 3.39 shows the execution times obtained for the Tesla C1060 card. In this case, the effects of the texture memory optimization hardly arise. The reason is that for this card the global memory access algorithm has been improved with respect to the C870 platform [64], allowing to obtain more coalesced accesses. Therefore, the Baseline algorithm requires much shorter execution times than for the case of the C870 card. The optimization that uses shared memory permits a decrease in the execution time of 53 % with respect to the baseline version for a crowd size of one million agents. Nevertheless, the proposed algorithm achieves the best execution times, with a reduction of 65 % when using atomic operations and around 75 % without using atomic operations. If Figure 3.38

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

131

and Figure 3.39 are compared, then it can be seen that execution times are inversely related with the number of SMs available on the cards.

Figure 3.39: Execution times on Tesla C1060 card In order to clarify the performance improvements achieved by each method for all the population sizes considered, the speed-up obtained by each method has been calculated (the number of times that the execution time of each considered version is faster than the Baseline version). Speed-ups are shown in the Y-axis of Figures 3.40 and 3.41 for the C870 and C1060 cards, respectively. In Figure 3.40, it can be seen that the speed-up increases with the size of the crowd, obtaining for the first optimization a speed-up of 2x for one million agents (1M). For the second optimization, a speed-up of 1.8x is obtained for a population size of 1K agents, while a speed-up of 3.3x is obtained for a population of 1M agents. Nevertheless, the new approach provides the highest values, with a minimum speed-up of 3.1x for a crowd with 10K agents and a maximum speed-up of 7.2x for 1M agents. Figure 3.41 shows the speed-up values obtained for the C1060 card. The values provided by the first optimization does not significantly increase with the crowd size, since as mentioned before, the Baseline version takes much shorter execution times for the C1060 card. The second optimization obtains slightly higher values than the first optimizations up to 10K agents. From this population size on, the speed-up is increased up to 2.2x for 1M agents. Nevertheless, the highest values are provided by the new version without atomic operations, obtaining a 3.9x for 1M agents. Also, it can be seen that for the smallest population (1K agents) a slightly better speed-up is obtained for the new version using atomic operations. The reason is that for 1K agents less accesses to device memory are

132

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

Figure 3.40: Speed-up obtained with respect to the Baseline version for the Tesla C870 platform performed than for larger populations. Therefore, there is no significant overhead between using atomic operations or using one extra kernel launch for performing global synchronization for 1K agents. However, when the population size grows so does the performance penalty introduced by atomic operations.

Figure 3.41: Speed-up obtained with respect to the Baseline version for the Tesla C1060 platform In order to show that these execution times are directly related to the workload generated by each method, the throughput of the different versions in terms of number of collisions checked per second have been measured. Figures 3.42 and 3.43 show the collisions check rates obtained when increasing the number of agents for cards Tesla C870 and C1060, respectively. Figure 3.42 and Figure 3.43 show that the proposed method without

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

133

Figure 3.42: Collisions rate on Tesla C870 card atomic operations performs the highest numbers of collision checks per second for all the population sizes. These figures also show that the collisions check rate performed by the new method significantly increases with the number of available SMs on the GPU, assessing the scalability of this method.

Figure 3.43: Collisions rate on Tesla C1060 card The relation between the execution time of kernels and the time required to transfer data from CPU memory to GPU memory has been also analyzed for each version of the GPU collision check procedure. Tables 3.9 and 3.10 show this relation for Tesla C870 and C1060 cards, respectively. Column PCI on each table shows for a given population size, the time needed to transfer through the PCI express bus the input data from CPU to GPU memory and return back to CPU the output data. Column Kernels on each table shows the aggregated execution time required by the kernels of each algorithm version for a given

134

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

1000

10000

100000

1000000

Version

PCI

Kernels

PCI

Kernels

PCI

Kernels

PCI

Kernels

Baseline

4

84

26

121

221

594

1507

7118

Opt. 1: Tex. mem.

4

72

26

112

221

453

1507

3516

Opt. 2: Shared mem.

4

41

26

79

221

242

1507

2170

New Appr.

4

19

26

36

221

129

1507

999

Table 3.9: Time (ms.) of PCI transfer and execution of kernels for the different GPU versions on Tesla C870 when varying number of agents. population size. It can be seen in both tables that for the Baseline version the PCI time is much lower than the execution time of kernels. However, as the execution time is reduced in the improved and new versions, PCI times get higher values than kernels execution times, especially for bigger populations. As a conclusion, after the important reduction obtained in the execution times of kernels, future optimizations should try to reduce the time required to transfer data from CPU memory to GPU memory. 1000

10000

100000

1000000

Version

PCI

Kernels

PCI

Kernels

PCI

Kernels

PCI

Kernels

Baseline

4

47

26

65

221

303

1507

2563

Opt. 1: Tex. mem.

4

44

26

62

221

287

1507

2426

Opt. 2: Shared mem.

4

39

26

59

221

152

1507

1163

New Appr. Atom. ops.

4

14

26

31

221

118

1507

884

New Appr.

4

15

26

23

221

76

1507

650

Table 3.10: Time (ms.) of PCI transfer and execution of kernels for the different GPU versions on Tesla C1060 when varying number of agents.

Comparison of CPU and GPU implementations Figure 3.44 shows the speed-up obtained by the new GPU algorithm proposed in comparison with the CPU implementations. In this way, the speed-up is computed as how much the new GPU algorithm is faster than the CPU implementations. Figure 3.44 contains four plots. Two plots show the speed-up for Tesla C870 in comparison with

CHAPTER 3. DISTRIBUTED SYSTEM FOR CROWD SIMULATION

135

Figure 3.44: Speed-up obtained for the new GPU procedure executed on different cards, with respect to the CPU implementations both Mutex and RCU implementations executed on a 16 core machine. The other two plots show the speed-up for Tesla C1060 platform in comparison with the same CPU implementations. Execution times used for the GPU implementations were computed as the aggregated execution time of the kernels plus the PCI time to transfer the input and output data. Figure 3.44 shows on the X-axis the number of agents considered for the simulations. The Y-axis shows the speed-up obtained. As stated in Table 3.8 the RCU version requires less execution times than the Mutex version. For that reason, it can be seen in Figure 3.44 that as the crowd population size increases, the speed-up obtained for both GPU platforms in comparison with the Mutex implementation is much higher than the speed-up obtained in comparison with the RCU version. This is due to the fact that the RCU synchronization method introduces less overhead than a synchronization using Mutex. In addition, the good scalability offered by the RCU version is the cause of the flatter slope of the speed-up plot with respect to the plot of the Mutex version. However, despite RCU is a more efficient synchronization method, the GPU implementation remains the winner.

136

3.4. ACCELERATING THE COLLISION CHECK PROCEDURE ON MULTI-CORE AND MANY-CORE PROCESSORS

CHAPTER 4

THE PARTITIONING PROBLEM IN DISTRIBUTED CROWD SIMULATIONS

Section 3.2 in Chapter 3 describes a distributed system architecture for crowd simulation. This system architecture provides the required scalability with the number of agents by adding more computational resources. However, it also requires an efficient partitioning method. This partitioning method should efficiently assign the agents to the existing servers in such a way that the number of messages exchanged among the servers is reduced and the system workload is well balanced during the whole simulation. In this chapter, we analyze the current proposals for partitioning crowds in distributed simulations and we also present and compare three different partitioning methods. Typically, there are two different approaches for partitioning a crowd simulation. One of them is based on the criterion of workload [88, 61], so that different groups of agents are executed in different computers. The other approach is region-based, in such a way that the virtual world is split into regions (usually a 2D cell from a grid) and all the agents located at a given region are assigned to a given computer [76]. Both approaches should guarantee the consistency of the simulation. For example, two different agents cannot be located at the same point and at the same time in the virtual world. However, taking into account the system architecture described in Section 3.2 the most appropriate scheme is the region-based approach, because each action server is in charge of managing a region of the virtual world. Therefore, we have followed this approach. The region-based partitioning problem for crowd simulation has been previously addressed. An implementation for the IBM Cell Engine processor was shown efficient for simulating thousands of autonomous characters incorporating a social forces behavioural model [78]. This work incorporates spatial hashing techniques and it also distributes the

138

4.1. REGION-BASED PARTITIONING METHODS FOR DISTRIBUTED CROWD SIMULATIONS

load among the Cell Engine Synergistic Processor Elements (SPEs) [40]. The same social forces model has been also distributed in a PC-Cluster with MPI communications among the processors. However, this approach is capable of simulating only a low number of agents (512 agents) and execution times obtained are far from interactive [103]. Another work describes the use of a multicomputer with 11 processors to simulate a crowd of 10.000 agents at interactive rates [76]. Although this work performs the partitioning of the crowd following the criterion of regions, a static agent-processor assignment is used, and no workload balancing is provided. However, all these region-based partitioning methods add a significant overhead to the system. As a consequence, this overhead severely limits the scalability of the simulation system. This chapter proposes different region-based partitioning methods for crowd simulations. The results provided by each method are compared in order to determine the most efficient method.

4.1. REGION-BASED PARTITIONING METHODS FOR DISTRIBUTED CROWD SIMULATIONS The region-based partitioning problem consists of finding a near optimal partition of regions (containing all the agents in the system) that simultaneously fulfills two conditions: it minimizes the number of agents near the borders of the regions, and it properly balances the number of agents in each region too. On one hand, the number of agents near the borders should be minimized since these agents generate synchronization messages among the distributed servers. On the other hand, the number of agents in each region should be balanced for avoiding that one or many servers in the system reach a saturation point. The first element required for comparing different partitioning techniques is the definition of a homogeneous criterion for measuring the quality of the partitions provided by all the methods. In order to achieve this goal, the following fitness function that should be minimized has been defined:

H(P ) = ω1 · α(P ) + ω2 · β(P ), ω1 + ω2 = 1

(4.1)

CHAPTER 4. THE PARTITIONING PROBLEM IN DISTRIBUTED CROWD SIMULATIONS

139

The first term in this equation measures the number of border agents in the resulting partition P (those agents whose surroundings (Area Of Interest or AOI [86] ) crosses the region boundaries). Since the management of the border agents should be performed by two or more servers, the workload generated by these agents is higher than the one generated by those agents located far from the region borders. In order to check the action of a border agent, each server should send a locking request to the other servers managing the AOI of that agent. These locking requests allow to maintain the consistency of the virtual world in the border areas, but they involve several servers, requiring a much higher computational cost. Therefore, the number of border agents must be minimized. In this sense, α(P ) is computed as the sum of all the agents whose AOIs intersect two or more regions of the virtual world. On the other hand, β(P ) is computed as the standard deviation of the average number of agents that each region contains. Therefore, β(P ) measures how well balanced partition P is. Finally, ω1 and ω2 are weighting factors between 0 and 1 that can be tuned to change the behaviour of the partitioning method as needed. We have implemented three different partitioning methods. All these methods use the k-means algorithm to obtain the initial partition, since this clustering algorithm provides a simple and fast way of partitioning the crowd in a given number of regions. Once the simulation starts, the initial partition should be dynamically adapted to the current state of the crowd as the simulation evolves. During the simulation, each server knows the location of agents within its region and also the number of agents and the mass center of regions assigned to neighbour servers. While one of the partitioning methods uses a genetic algorithm (GA) guided by H(P ) to search a near-optimal partition of rectangular regions, the other two methods use spatial clustering techniques to calculate a near-optimal partition. In the latter cases, the servers periodically assign each of their agents agk to the server controlling the region ri that minimizes the following function:

falloc (agk , ri ) = dstM C(agk , ri ) + nAgs(ri ) ∗ dstM C(agk , ri )

(4.2)

where falloc is the allocation function, nAgs(ri ) provides the number of agents in region ri , and dstM C(agk , ri ) corresponds to the Euclidean distance from agk to the center of mass of the region ri . Since falloc should be minimized, the first term in falloc considers a spatial criterion and the second term balances the server workload. Every time a partition is updated, the corresponding state (center of mass and number of agents) is sent to the neighbour servers.

140

4.1. REGION-BASED PARTITIONING METHODS FOR DISTRIBUTED CROWD SIMULATIONS

4.1.1. R-Tree method The first partitioning method is based on the R-Tree structure. The R-Tree is one of the most popular dynamic index structure for spatial searching [33]. A partitioning method has been implemented based on the R-Tree structure that is aimed to optimize the area of the rectangles enclosing the crowd. The most interesting feature of this approach is that it is an efficient structure for managing the partitioning problem, since it let us to handle the crowd motion as insertions and deletions in the tree, where the falloc criteria can be easily introduced. A R-Tree is a height-balanced tree structure that splits the space with hierarchically nested, and possibly overlapping, Minimum Bounding Rectangles (MBRs). A MBR is the minimum rectangle that encloses a single or a group of agents. Each node of a R-Tree has a variable number of entries (up to some pre-defined maximum). Each entry within a non-leaf node stores two kinds of data: a way of identifying a child node, and the MBR of all entries within this child node. Each entry within a leaf node stores two kinds of information: the actual data element, and the MBR of the data element. There are two parameters that define the shape of a R-tree: the maximum and the minimum number of entries in a node (these parameters are denoted as M and m, respectively). On other hand, choosing an adequate splitting method is important, since insertions and deletions will generate node splits to keep the tree balanced. Therefore, the splitting method used for dividing a node when it has more than M entries can determine the performance of the R-tree method. In our implementation, a value of 5 has been chosen for parameter M and a value of 2 for parameter m. The reason is that the use of these values generate the least CPU utilization during R-Tree updating. Regarding the splitting method, the Quadratic split method [33] has been used, since it shows a lower CPU utilization for tree searches than the Linear split method and there are no significant differences with respect to the Linear split method for low values of the parameter M . In order to illustrate the implemented R-tree algorithm, Figure 4.1 shows and example of how the partitioning criterion falloc is used. A set of fifteen agents, represented as labeled circles, is shown in Figure 4.1. The square around each agent represents the MBR of the agent, that will be used for R-Tree insertion or deletion. The set of agents is partitioned in two regions, coloured in green and red, and each one is delimited by the MBR enclosing its agents. For each region, different sub-regions are depicted based on the agents location, and the resulting R-tree for the two regions is shown on the right side.

CHAPTER 4. THE PARTITIONING PROBLEM IN DISTRIBUTED CROWD SIMULATIONS

141

The value of parameter M is set to 3 in both R- trees, and there is only one non-leaf node, the root node. Each entry in the root node contains the MBR of its child node (denoted each MBR with Rx where x varies from 1 to 6). These MBRs will be used to guide the search during insertions and deletions in the tree. Figure 4.1-b shows how the R-Trees evolve when agents 8 and 12 move. When the agent 8 moves, firstly is applied the falloc criterion to determine whether it must change the region or not. In this case, falloc (ag8 , region1 ) < falloc (ag8 , region2 ), that is, the criterion determines a region exchange. Thus, agent 8 is deleted from the R-Tree 2 and it is inserted in R-Tree 1. The movement of agent 12 does not imply a region change, but the reinsertion in the R-Tree of Region 1 implies a branch change, since it is reinserted as a child of MBR R5.

Figure 4.1: Update of two R-Trees a) initial state b) after agents movement In order to illustrate the partitions provided by the R-tree method, Figure 4.2 shows three different instants of a crowd simulation with 8000 agents. In this Figure, agents are represented as dots, and each MBR in the partition shows a different color. It can be seen that at the beginning of the simulation (Figure 4.2-a) some overlapping exists among the MBRs of the regions. As the simulation evolves (Figure 4.2-b and 4.2-c), the R-Tree provides partitions with an increasing overlapping area. As a consequence, more

142

4.1. REGION-BASED PARTITIONING METHODS FOR DISTRIBUTED CROWD SIMULATIONS

synchronization messages are generated among the servers.

Figure 4.2: Snapshots of the partitions provided by R-Tree at different simulation stages a) beginning b) middle c) end

4.1.2. Genetic algorithm method Genetic Algorithms (GA) consist of a search method based on the concept of evolution by natural selection [58, 37]. GA is used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. In a genetic algorithm, a population of strings (called chromosomes), which encode candidate solutions (called individuals or phenotypes) to an optimization problem, evolves towards better solutions. The quality of a solution is determined using a fitness function. The evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation, the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on their fitness), and modified (recombined and possibly randomly mutated) to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached.

CHAPTER 4. THE PARTITIONING PROBLEM IN DISTRIBUTED CROWD SIMULATIONS

143

In the GA proposed for solving the partitioning problem, H(P ) (see Equation 4.1) has been used as the fitness function. Each iteration of the algorithm consists of generating a new population from the existing one, where each chromosome of the population consists of an integer array that contains k Minimum Bound Rectangles (MBR). Each MBR is a quadruple [x_min, x_max, y_min, y_max ] that defines a rectangular region of the virtual world enclosing a subset of the crowd. Thus, a chromosome defines a partition of the crowd in k regions. As an example, Figure 4.3 shows a chromosome for a given population with k = 4.

Figure 4.3: Chromosome used for the Genetic Algorithm In order to generate a new population from the existing one, different operators can be applied to the chromosomes. The selection operator allows to select those population individuals that will be used for reproduction in each iteration of the algorithm. The purpose of the selection operator is to give more chances to the most suitable individuals (chromosomes) in the current population. The crossover operator allows the generation of an offspring from the previously selected ancestor chromosomes. The mutation operator consists of the random alteration of each of the elements (genes) in the chromosome with a mutation probability. The purpose of mutation is to produce population diversity. Finally, the replacement operator consists of replacing the current population by their offsprings or a mix of both current chromosomes and offsprings. Figure 4.4 shows the diagram representing the main loop of the implemented GA. As most of heuristic methods, it starts from an initial population of R chromosomes randomly generated. These chromosomes are sorted by the fitness function H(P ) associated with each chromosome in ascending order (the first chromosome is the one with the best H(P ) value). This sorted list is denoted as Best Solutions List (BSL). This list represents the initial population for the genetic algorithm, and it will contain the best R solutions found until that iteration by the GA, that is, the current population pool. The value of R is a parameter that must be tuned. A value of R=10 has been used since the execution time of the partitioning method is limited by the server response time threshold in the crowd

144

4.1. REGION-BASED PARTITIONING METHODS FOR DISTRIBUTED CROWD SIMULATIONS

simulation (250 ms.). For greater values of R, the GA partitioning method exceeded the allowed server threshold.

Figure 4.4: The Genetic Algorithm main loop Each GA iteration consists of generating a descendant generation of R chromosomes, starting from an ancestor generation. The way that the algorithm provides the next generation determines the behaviour of the GA. A sexual reproduction technique has been chosen [58], in such a way that each descendant is generated starting from two ancestors. In each iteration a pseudo-random selection operator is used. Concretely, the first ancestor for the i − th chromosome of the population is the i − th chromosome of the population

in the previous iteration. The second ancestor is randomly selected among the 50 % of the previous population with the best fitness function. From each two ancestors, an offspring is obtained by applying a crossover operator.

Concretely, a randomly skewed average of the corresponding coordinates in each of the ancestors is computed. This skewed average is computed for all the coordinates in an MBR and for all the MBRs in a chromosome. As an example, Figure 4.5 shows the MBRs

CHAPTER 4. THE PARTITIONING PROBLEM IN DISTRIBUTED CROWD SIMULATIONS

145

corresponding to two ancestors and an example of the resulting offspring. In this Figure, it can be seen an ancestor MBR a whose coordinates are defined by the quadruple [xa_min, xa_max, ya_min, ya_max] and a second ancestor MBR whose coordinates are defined by the quadruple [xb_min, xb_max, yb_min, yb_max]. From these two MBRs, the MBR with dashed lines is computed. (xb_min,yb_max)

(xb_max,yb_max)

(xa_min,ya_max)

(xa_max,ya_max)

(xa_min,ya_min)

(xa_max,ya_min)

(xb_min,yb_min)

(xb_max,yb_min)

Figure 4.5: Offspring generation

It must be noticed that this reproduction method can produce non-valid offsprings, because they define MBRs with different shapes. For example, the resulting MBR in Figure 4.5 is narrower than ancestor a. If the rest of MBRs in the resulting chromosome do not include the area of MBR a not covered by the resulting offspring, then the agents located at that area will not be assigned to the semantic database. Moreover, the initial partition, provided by the previous execution of the GA, can be corrupted due to the movement of agents during the AS cycle. In this case, the initial population starts from a modified partition where some regions are expanded as necessary to cover all the agents at the current locations. When an invalid offspring is generated it is simply discarded, and another offspring is generated from different ancestors in the population. When all the ancestor population has been used for producing offsprings and the number of valid offsprings reaches R, then the replacement operator should be applied. Concretely, the new offsprings and the previous chromosomes in the BSL are sorted and merged to obtain the new BSL (population pool) for that iteration. The mutation operator is not used, since exchanging two or more coordinates between different MBRs could lead to invalid chromosomes. Finally, the finishing condition should be checked in order to detect when the GA should finish. On the one hand, the execution time has been established as one of the fin-

146

4.1. REGION-BASED PARTITIONING METHODS FOR DISTRIBUTED CROWD SIMULATIONS

ishing conditions of the algorithm, since one of the main constraints in crowd simulations is the execution time of the search. The execution time must be shorter than a fraction of the AS cycle in order to provide an effective partition. The AS cycle, denoted as T , refers to the maximum interactive response time (i.e. 250 ms.). Concretely, T is set to half of the AS cycle period, that is, 125 ms. On the other hand, in order to ensure that the proposed method provides the best possible solution, the decrease of H(P ) has been added as a convergence condition. That is, if the H(P ) value of the first chromosome in the BSL is not decreased in two successive iterations, then the algorithm finishes. Therefore, the first chromosome in the BSL is chosen as the result of the search either when the convergence condition is reached or when the algorithm has been executed during T milliseconds. Figure 4.6 shows a snapshot of the different partitions provided by the GA method during a simulation. It can be seen that at the beginning (Figure 4.6-a) some overlapping exists among the regions in the partition. However, as the simulation evolves (Figure 4.6b and 4.6-c), the GA method provides partitions in which the region overlapping area is lower than in the case of the R-Tree method. The reason for this behaviour is due to the heuristic search procedure carried out by this method.

Figure 4.6: Snapshots of the partitions provided by GA at different simulation stages a) beginning b) middle c) end

4.1.3. Convex hull method This approach is similar to the R-Tree technique described above, since it is aimed to optimize the area of spatial structures enclosing the crowd. However, unlike the Rtree technique, it is based on computing the convex hull of the points representing agents

CHAPTER 4. THE PARTITIONING PROBLEM IN DISTRIBUTED CROWD SIMULATIONS

147

in a given region. The partitioning technique in distributed crowd simulations can benefit from the use of convex hulls, since these spatial structures inherently reduce the area of the regions assigned to servers when compared to the rectangles used in the R-tree technique, and one of the purposes of the partitioning technique is to minimize the overlapping area of different regions. In order to compute the convex hull of a point set, the Quickhull algorithm [7] has been used because of its speed and efficiency. There are quite a few other efficient algorithms to calculate a convex hull, like Graham Scan [30] or Jarvis March [41]. However, QuickHull has been used because it is faster in most average cases, as well as its recursive nature allows a fast and yet clean implementation. QuickHull uses a divide and conquer approach similar to the QuickSort algorithm. Hence its name. As for the case of the R-Tree method, each server periodically updates the agents assigned to its region according to the falloc function. Once the agents have been inserted, the convex hull can be recomputed, so the center of mass and the number of agents can be also updated. Algorithm 4 shows the steps followed for updating each region. The outer for loop iterates over the number of agents in the crowd and the inner loop iterates over the number of regions. After the inner for loop, the i-th agent is assigned to the best region that minimizes the falloc function (see Equation 4.2). The assignment of agents to regions is performed by the Assign function in Algorithm 4. Since agents can migrate among the different regions considered (and therefore they should be re-assigned to different servers) a new convex hull should be computed each time the partitioning method is executed, and no updating of previous convex hulls are used. Function Compute_convex_hull() implements the Quickhull algorithm [7] for calculating the convex hull for each region. The steps followed for computing the convex hull are illustrated in Figure 4.7. These steps are the following ones: initially, the first hull is created with the most distant points (agents) in the vertical and horizontal directions. This process is shown in Figure 4.7-a) and it requires to access to every point of the initial set (|n|), so it has a linear cost (O(n)). The initial hull may not include all the points in the region, as Figure 4.7-b) shows. Then, the algorithm recursively finds and connects the most distant points in the orthogonal direction to each side of the previous hull. Figure 4.7-c) represents these steps. The recursion ends when all the agents are inside of the current hull, as shown in Figure 4.7-d).

148

4.1. REGION-BASED PARTITIONING METHODS FOR DISTRIBUTED CROWD SIMULATIONS

Algorithm 4 Algorithm for the Quick Hull method. VAR int N; /*Number of agents*/ int k; /*Number of regions*/ int i,j; int Minimum = INT_MAX; /*Convex Hull Method*/ for i=1 ; i