Agent-Based Modeling on High Performance Computing ... - hgpu.org

UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TÉCNICO

Agent-Based Modeling on High Performance Computing Architectures Nuno Maria Carvalho Pereira Fernandes Fachada

Supervisor: Doctor Agostinho Cláudio da Rosa Co-Supervisors: Doctor Vitor Manuel Vieira Lopes Doctor Rui Miguel da Costa Martins Thesis approved in public session to obtain the PhD Degree in Electrical and Computer Engineering Jury final classification: Pass with Distinction and Honour

Jury Chairperson: Chairman of the IST Scientific Board Members of the Committee: Doctor Leonel Augusto Pires Seabra de Sousa Doctor Agostinho Cláudio da Rosa Doctor Nuno Cavaco Gomes Horta Doctor Miguel Augusto Mendes Oliveira e Silva Doctor Nuno Manuel Mendes da Cruz David Doctor António Manuel Raminhos Cordeiro Grilo

2016

UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TÉCNICO

Agent-Based Modeling on High Performance Computing Architectures Nuno Maria Carvalho Pereira Fernandes Fachada

Supervisor: Doctor Agostinho Cláudio da Rosa Co-Supervisors: Doctor Vitor Manuel Vieira Lopes Doctor Rui Miguel da Costa Martins Thesis approved in public session to obtain the PhD Degree in Electrical and Computer Engineering Jury final classification: Pass with Distinction and Honour

Jury Chairperson: Chairman of the IST Scientific Board Members of the Committee: Doctor Leonel Augusto Pires Seabra de Sousa, Professor Catedrático do Instituto Superior Técnico da Universidade de Lisboa Doctor Agostinho Cláudio da Rosa, Professor Associado (com Agregação) do Instituto Superior Técnico da Universidade de Lisboa Doctor Nuno Cavaco Gomes Horta, Professor Auxiliar (com Agregação) do Instituto Superior Técnico da Universidade de Lisboa Doctor Miguel Augusto Mendes Oliveira e Silva, Professor Auxiliar da Universidade de Aveiro Doctor Nuno Manuel Mendes da Cruz David, Professor Auxiliar da Escola de Tecnologias e Arquitectura do ISCTE - Instituto Universitário de Lisboa Doctor António Manuel Raminhos Cordeiro Grilo, Professor Auxiliar do Instituto Superior Técnico da Universidade de Lisboa

Funding Institutions Fundação para a Ciência e a Tecnologia 2016

Agradecimentos Gostaria de agradecer ao Prof. Agostinho Rosa a oportunidade que me deu de realizar este trabalho no LASEEB, bem como aos Doutores Vítor Lopes e Rui Martins pelos seus contributos. Não esqueço ainda os meus colegas do LASEEB e do Instituto Superior Técnico pela ajuda e sugestões que me deram ao longo deste percurso. Queria também agradecer à minha família, que sempre me apoiou nos bons e nos maus momentos do meu percurso académico, e que me proporcionou todas as condições possíveis e necessárias para continuar a estudar. Uma palavra aos meus padrinhos pela ajuda que me deram no ensino secundário nas áreas base da engenharia electrotécnica, nomeadamente Matemática e Física, e pelo incentivo que me deram a ingressar num curso superior relacionado com essas áreas. Este trabalho foi financiado pela Fundação para a Ciência e a Tecnologia (FCT) através da bolsa SFRH/BD/48310/2008.

iii

Abstract In spatial agent-based models (SABMs) each entity of the system being modeled is uniquely represented as an independent agent. Large scale emergent behavior in SABMs is population sensitive. Thus, the number of agents should reflect the system being modeled, which can be in the order of billions. Models can be decomposed such that each component can be concurrently processed by a different thread. In this thesis, a conceptual model for investigating parallelization strategies for SABMs is presented. The model, PPHPC, captures important characteristics of SABMs. NetLogo, Java and OpenCL (CPU and GPU) implementations are proposed. To confirm that all implementations yield the same behavior, their outputs are compared using two methodologies. The first is based on common model comparison techniques found in literature. The second is a novel approach which uses principal component analysis to convert simulation output into a set of linearly uncorrelated measures which can be analyzed in a model-independent fashion. In both cases, statistical tests are applied to determine if the implementations are properly aligned. Results show that most implementations are statistically equivalent, with lower-level parallel implementations offering substantial speedups. The PPHPC model was shown to be a valid template model for comparing SABM implementations. Keywords Agent-based modeling, parallelization strategies, ODD protocol, standard model, GPGPU, model alignment, docking, multithreading, OpenCL, statistical tests, principal component analysis

v

Resumo Em modelos territoriais baseados em agentes (MTBA) cada entidade do sistema a ser modelado é representada por um agente independente. O comportamento emergente de agentes em larga escala é sensível ao tamanho da população. Logo, a quantidade de agentes deve reflectir o sistema a ser modelado, que pode ser na ordem de milhares de milhões de indivíduos. Os modelos podem ser decompostos tal que cada componente possa ser processada de forma paralela por um fio de execução diferente. Nesta tese é apresentado um modelo conceptual para investigação de estratégias de paralelização de MTBAs. O modelo, PPHPC, captura características importantes de MTBAs. São propostas implementações em NetLogo, Java e OpenCL (CPU e GPU). De forma a confirmar que todas as implementações têm comportamentos semelhantes, as suas saídas são comparadas através de duas metodologias. A primeira é baseada em técnicas normalmente encontradas na literatura. A segunda é uma abordagem original que utiliza análise de componentes principais para converter as saídas da simulação num conjunto de medidas estatísticas lineares não correlacionadas, que podem ser analisadas de uma forma independente do modelo. Em ambos os casos, testes estatísticos são aplicados às medidas processadas para determinar se as diferentes implementações estão adequadamente alinhadas. Os resultados mostram que a maioria das implementações são estatisticamente equivalentes e que as implementações de mais baixo nível podem aumentar substancialmente a rapidez das simulações. O modelo PPHPC mostrou ser um padrão válido para comparação de implementações de MTBAs. Palavras-chave Modelação baseada em agentes, estratégias de paralelização, protocolo ODD, modelo padrão, GPGPU, alinhamento de modelos, múltiplos fios de execução, OpenCL, testes estatísticos, análise de componentes principais

vii

Contents Agradecimentos

iii

Abstract

v

Resumo

vii

Contents

ix

List of Acronyms

xiii

List of Figures

xv

List of Tables

xvii

List of Algorithms

xxiii

1 Introduction

1

2 Background

7

2.1

2.2

Computer architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.1

Central Processing Units . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.2

Graphical Processing Units . . . . . . . . . . . . . . . . . . . . . . .

8

2.1.3

Heterogeneous systems . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4

Other architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Parallel programming languages and APIs . . . . . . . . . . . . . . . . . . . 10 2.2.1

Extensions and libraries for classical programming languages . . . . 11

2.2.2

Programming languages with built-in concurrency . . . . . . . . . . 12

2.2.3

Specialized languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3

Reference research ABMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4

Parallel SABM implementations . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1

Multithreaded implementations . . . . . . . . . . . . . . . . . . . . . 20

2.4.2

GPGPU implementations . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5

Issues with ABM replication . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6

Analyzing and comparings ABMs . . . . . . . . . . . . . . . . . . . . . . . . 25 ix

2.6.1

Statistical analysis of model output . . . . . . . . . . . . . . . . . . . 25

2.6.2

The classical approach for comparing the output of simulation models 25

2.6.3

Comparing model output using the classical approach: concrete cases 27

2.6.4

Disadvantages of the classical comparison approach . . . . . . . . . . 28

3 Methodology 3.1

3.2

3.3

The PPHPC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1

Overview, design concepts and details . . . . . . . . . . . . . . . . . 30

3.1.2

A canonical NetLogo implementation . . . . . . . . . . . . . . . . . . 36

3.1.3

Analyzing the output . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Parallel implementations of the PPHPC model . . . . . . . . . . . . . . . . 42 3.2.1

Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.2

OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Model-independent comparison of simulation models . . . . . . . . . . . . . 54 3.3.1

3.4

29

Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Software packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.1

cf4ocl: a C framework for OpenCL . . . . . . . . . . . . . . . . . . 58

3.4.2

cl_ops: a library of common OpenCL operations . . . . . . . . . . . 61

3.4.3

SimOutUtils: utilities for analyzing time series simulation output . . 62

3.4.4

micompr: multivariate independent comparison of observations . . . 68

3.4.5

micompm: a MATLAB port of micompr . . . . . . . . . . . . . . . . . 72

3.4.6

PerfAndPubTools: tools for software performance analysis and publishing of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Results and discussion 4.1

4.2

4.3

4.4

x

77

Analyzing the output of the PPHPC model . . . . . . . . . . . . . . . . . . 77 4.1.1

Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.1.2

Determining the steady-state truncation point . . . . . . . . . . . . . 77

4.1.3

Analyzing the distributions of focal measures . . . . . . . . . . . . . 78

4.1.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.1


4.2.2

The Java implementation . . . . . . . . . . . . . . . . . . . . . . . . 88

4.2.3

The OpenCL CPU implementation . . . . . . . . . . . . . . . . . . . 93

4.2.4

The OpenCL GPU implementation . . . . . . . . . . . . . . . . . . . 95

Statistical comparison of implementations . . . . . . . . . . . . . . . . . . . 97 4.3.1

Assessing the alignment of the Java variants . . . . . . . . . . . . . . 98

4.3.2

Assessing the alignment of the OpenCL implementations . . . . . . . 101

Detailed assessment of model-independent comparison . . . . . . . . . . . . 105 4.4.1


4.4.2

Classic model comparison method . . . . . . . . . . . . . . . . . . . 106

4.4.3

Model-independent comparison method . . . . . . . . . . . . . . . . 106

4.4.4

Comparison performance with different sample sizes . . . . . . . . . 113

4.4.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5 Conclusions and future work

117

Bibliography

121

A Suppl. results on the analysis of PPHPC FMs

137

B Suppl. results on the assessment of model-indep. comparison

149

xi

List of Acronyms ABM Agent-Based Modeling

JVM Java Virtual Machine

ALU Arithmetic Logic Unit

LaSEEB Laboratório de Sistemas Evolutivos e Engenharia Biomédica

AP Agent-Parallel LCG Linear Congruential Generator API Application Programming Interface LP Logical Processor ANOVA ANalysis Of VAriance MANOVA Multivariate ANalysis Of VAriance BATS Bash Automated Testing System ODD Overview, Design concepts, Details C++ AMP C++ Accelerated Massive Parallelism OpenACC Open Accelerators CA Cellular Automata OpenCL Open Computing Language CI Confidence Interval OpenGL Open Graphics Library CPU Central Processing Unit CLR Common Language Runtime

OpenHMPP Open Hybrid Multicore Parallel Programming

CU Compute Unit

OpenMP Open Multi-Processing

CUDA Compute Unified Device Architecture

OS Operating System

EP Environment-Parallel

PC Principal Component

FITA Fixed-Increment Time Advance

PCA Principal Component Analysis

FM Focal Measure

POSIX Portable Operating System Interface

FPGA Field-Programmable Gate Array

PPHPC Predator-Prey for High Performance Computing

FPS Frames Per Second PRNG Pseudo-Random Number Generator FPU Floating-Point Unit RS Replication Standard FSM Finite State Machine SABM Spatial Agent-Based Model GLSL OpenGL Shading Language SIMD Single Instruction Multiple Data GPGPU General-Purpose Computing on Graphics Processing Units

SU Shader Unit

GPU Graphics Processing Unit

SW Shapiro-Wilk

GUI Graphical User Interface

TBB Threading Building Blocks

HLSL High-Level Shading Language

UML Unified Modeling Language

IID Independent and Identically Iistributed

XML eXtensible Markup Language

xiii

List of Figures 3.1

NetLogo implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2

Typical model output for model size 400. . . . . . . . . . . . . . . . . . . . . 38

3.3

UML diagram for the multithreaded Java implementation of PPHPC. . . . 45

3.4

Equal with row synchronization (ER): example of three workers processing nine rows of the simulation grid in parallel. . . . . . . . . . . . . . . . . . . 50

3.5

SimOutUtils architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.6

Types of plot provided by the output_plot function. . . . . . . . . . . . . . 65

3.7

PerfAndPubTools architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.1

Moving average of outputs for model size 400 with w = 10. . . . . . . . . . . 78

4.2

Speedup for parallel variants with 12 workers against single-thread versions, with b = 500 for the OD variant. . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3

Scalability of the different versions for increasing model sizes. . . . . . . . . 89

4.4

Simulation time versus number of workers for the OD variant. . . . . . . . . 92

4.5

OD performance with 12 workers. . . . . . . . . . . . . . . . . . . . . . . . . 93

4.6

Speedup for OpenCL CPU implementation . . . . . . . . . . . . . . . . . . 94

4.7

Scalability of the OpenCL CPU implementation . . . . . . . . . . . . . . . . 95

4.8

Speedup for OpenCL GPU implementation . . . . . . . . . . . . . . . . . . 96

4.9

Scalability of the OpenCL GPU implementation . . . . . . . . . . . . . . . . 97

xv

List of Tables 2.1

CUDA, DirectCompute and OpenCL terminology.

. . . . . . . . . . . . . . 14

3.1

Model state variables by entity. . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2

Size-related and dynamics-related model parameters. . . . . . . . . . . . . . 35

3.3

Initial model sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4

Dynamics-related parameter sets. . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5

Statistical summaries for each output. . . . . . . . . . . . . . . . . . . . . . 39

3.6

Values of a generic simulation output. . . . . . . . . . . . . . . . . . . . . . 39

3.7

Methods for assessing the normality of a data set. . . . . . . . . . . . . . . . 41

3.8

Parallelization strategies (PS) and their handling of the possible synchronization points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1

Histograms for the several size@set combinations of the arg max Pis FM. . . 79

4.2

Q-Q plots for the several size@set combinations of the arg max Piw FM. . . . 80

4.3

Three statistical summaries for the several sizes of the arg min Pic FM for parameter set 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.4

P -values for the SW test and skewness for the several size@set combinations s

of the max E i FM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5

Empirical classification of each FM according to how close it follows the normal distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.6

Times and speedups for the different versions using both parameter sets and tested model sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.7

Times and speedups for the OpenCL CPU implementation . . . . . . . . . . 95

4.8

Times and speedups for the OpenCL GPU implementation . . . . . . . . . . 97

4.9

P -values for the classic comparison of FMs from the NetLogo and Java parallel variants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.10 P -values for the model-independent comparison of outputs from the NetLogo and Java parallel variants. . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.11 P -values for the classic comparison of FMs from the NetLogo and OpenCL CPU implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.12 P -values for the model-independent comparison of outputs from the NetLogo and OpenCL CPU implementations. . . . . . . . . . . . . . . . . . . . . 102 xvii

4.13 P -values for the classic comparison of FMs from the NetLogo and OpenCL GPU implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.14 P -values for the model-independent comparison of outputs from the NetLogo and OpenCL GPU implementations. . . . . . . . . . . . . . . . . . . . . 104 4.15 P -values for the classic model comparison method for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 107 4.16 Model-independent comparison for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.17 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 110 4.18 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.19 Percentage of p-values yielded by the specified tests of assumptions which fall within the non-significant and significant intervals. . . . . . . . . . . . . 113 A.1 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 100 and parameter set 1. . . . . . . . 138 A.2 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 200 and parameter set 1. . . . . . . . 139 A.3 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 400 and parameter set 1. . . . . . . . 140 A.4 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 800 and parameter set 1. . . . . . . . 141 A.5 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 1600 and parameter set 1. . . . . . . 142 A.6 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 100 and parameter set 2. . . . . . . . 143 A.7 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 200 and parameter set 2. . . . . . . . 144 A.8 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 400 and parameter set 2. . . . . . . . 145 A.9 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 800 and parameter set 2. . . . . . . . 146 A.10 Statistics and distributional analysis of the selected FMs for n = 30 replications of the PPHPC model with size 1600 and parameter set 2. . . . . . . 147 B.1 P -values for the classic model comparison method for model size 100, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 150 xviii

B.2 P -values for the classic model comparison method for model size 100, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 151 B.3 P -values for the classic model comparison method for model size 200, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 152 B.4 P -values for the classic model comparison method for model size 200, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 153 B.5 P -values for the classic model comparison method for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 154 B.6 P -values for the classic model comparison method for model size 400, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 155 B.7 P -values for the classic model comparison method for model size 800, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 156 B.8 P -values for the classic model comparison method for model size 800, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . 157 B.9 Model-independent comparison for model size 100, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 158 B.10 Model-independent comparison for model size 100, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 B.11 Model-independent comparison for model size 200, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 160 B.12 Model-independent comparison for model size 200, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 161 B.13 Model-independent comparison for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 162 B.14 Model-independent comparison for model size 400, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 163 B.15 Model-independent comparison for model size 800, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.16 Model-independent comparison for model size 800, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 165 B.17 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 100, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 166 B.18 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 100, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 167 B.19 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 200, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 168 xix

B.20 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 200, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 169 B.21 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 170 B.22 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 400, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 171 B.23 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 800, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 172 B.24 Percentage of explained variance and t-test p-values before and after weighted Bonferroni correction of the first four PCs for model size 800, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . 173 B.25 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 100, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 B.26 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 100, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 B.27 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 200, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 B.28 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 200, parameter set 2, and n = 30 runs per configuration.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

B.29 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 400, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 B.30 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 400, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 B.31 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 800, parameter set 1, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 B.32 Number of PCs and MANOVA test p-values for several percentages of explained variance for model size 800, parameter set 2, and n = 30 runs per configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 B.33 P -values for the classic model comparison method for model size 400, parameter set 1, n = 10 runs per configuration. . . . . . . . . . . . . . . . . . 182 xx

B.34 Results for the model-independent comparison for model size 400, parameter set 1, and n = 10 runs per configuration. . . . . . . . . . . . . . . . . . . . . 183 B.35 P -values for the classic model comparison method for model size 400, parameter set 1, and n = 100 runs per configuration. . . . . . . . . . . . . . . 184 B.36 Results for the model-independent comparison for model size 400, parameter set 1, and n = 100 runs per configuration. . . . . . . . . . . . . . . . . . . . 185

xxi

List of Algorithms 3.1

Main simulation algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2

Agent reproduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3

Realization of the main simulation algorithm in the SimWorker class. . . . . 46

xxiii

Chapter 1

Introduction Agent-based modeling (ABM) is a bottom-up modeling approach, where each entity of the system being modeled is uniquely represented as an independent decision-making agent. When prompted to act, each agent analyzes its current situation (e.g., what resources are available, what other agents are in the neighborhood), and acts appropriately, based on a set of rules. These rules express knowledge or theories about the respective low-level components. The global behavior of the system results from the simple, self-organized local relationships between the agents [67]. As such, ABM is a useful tool in simulating and exploring systems that can be modeled in terms of interactions between individual entities, e.g., biological cell cultures, ants foraging for food or military units in a battlefield [148]. In practice, ABM can be considered a variation of discrete-event simulation, since state changes occur at specific points in time [133]. Spatial agent-based models (SABMs) are a subset of ABMs in which a spatial topology defines how agents interact [215]. For example, an agent may be limited to interact with agents located within a specific radius, or may only move to a near physical or geographical location [147]. SABMs have been extensively used to study a range of phenomena in the biological and social sciences [112,215]. While cellular automata (CA) models [41] are also a very simple type of SABM [147], this work focuses on models with moving agents. Large scale emergent behavior in ABMs is population sensitive. As such, it is preferable that the number of agents in a simulation is able to reflect the reality of the system being modeled [110,118,145]; otherwise, the expected or desired behavior may not be observable, and model validation becomes difficult [97, 110]. In domains such as social modeling, ecology, and biology, systems can contain millions or billions of individuals [56,57,118,180]; consequently, simulating realistic models of these systems requires as much agents being processed per time step [56]. Such large scale simulations generate a very high demand for computing power [96] and are impractical on typical ABM frameworks such as NetLogo [262] or Repast [167], which execute serially on the CPU [56, 57]. Additionally, stochastic models in general and ABMs in particular usually require various input parameters, which can have a range of different values. Large-scale computational experiments are required for exploring the parameter space of such models [96,105,231]. These requirements stretch,

2

Introduction

and many times surpass, what typical off-the-shelf computing systems can offer, especially if models are implemented in a way to only make use of one processing element (PE), such as a CPU core. Considering that commodity processors, such as GPUs and multi-core CPUs, are nowadays composed of several PEs, a natural solution to reach acceptable scalability in ABMs consists of decomposing models such that each component can be independently processed by a logical processor (LP1 ) in a concurrent manner [44, 97, 215, 230–232, 252]. There are, however, two main issues when parallelizing ABMs. The first major issue concerns communication between model components and the bottleneck it creates, which is a major limiting factor in scaling parallel SABMs [215, 232]. This is especially true in distributed memory scenarios, where different computational cores may be located in separate, often geographically distant, nodes [97]. Communication costs may suppress the potential gains of using multiple nodes and their associated resources [252]. Many strategies and methods have been developed to manage and reduce communication in distributed memory SABMs [97, 209], nonetheless this is still a topic of active research [215]. Whatever the scenario, model partitioning should guarantee that each model component is as independent as possible in order to minimize communication between LPs. Furthermore, communication strategies should be designed to avoid deadlocks and to preserve the causality of simulation events while efficiently exploiting parallelism [105, 215]. The second major issue when parallelizing ABMs is that it is very easy to inadvertently introduce changes which modify model dynamics. This is akin to model replication, which is not a straightforward process [61,265]. ABMs are very sensitive to development decisions, and apparently minor implementation details can significantly skew results [157, 265]. The situation becomes more difficult with model parallelization, which by definition requires considerable changes in many of these aspects. There are inclusively reports of failure in converting a serial model into a parallel one [175]. Unfortunately, the lack of transparency in model descriptions constrains how models are assessed, compared and replicated [162]. This is in part a consequence of the fact that knowledge on how to replicate and compare the results of a replication is not pervasive within the ABM community [265]. Conceptual models should be well specified and adequately described in order to achieve a successful model replication. As such, some authors have suggested that there should be a minimum standard for model communication, which should include at the very least: a) a structured natural language description [94]; and, b) the model’s source code, given that it is the model’s definitive implementation, not subject to the vagueness and uncertainty possibly associated with verbal descriptions [162, 265]. The ODD protocol (Overview, Design concepts, Details) is currently one of the most widely used templates for making model descriptions more understandable and complete, providing a comprehensive checklist that covers virtually all the key features that can 1

In shared memory architectures, LPs are usually represented by threads, which communicate via synchronized access to shared variables. In distributed memory scenarios, LPs are commonly represented by processes, which communicate via message passing.

3 define a model [94]. It allows modelers to communicate their models using a natural language description within a prescriptive and hierarchical structure, aiding in model design and fostering in-depth model comprehension [162]. It is the recommended approach for documenting models in the CoMSES Net Computational Model Library [196]. However, the ODD protocol does not deal with models from a results or simulation output perspective, which means that an additional section for statistical analysis of results is often required. Nonetheless, these recommendations are not often followed in practice. While many ABMs have been published and simulation output analysis is a widely discussed subject matter [119, 132, 133, 163, 204], comprehensive inquiries concerning the output of ABM simulations are hard to find in the scientific literature. In this thesis, a conceptual model for investigating parallelization strategies for SABMs is proposed. The model, Predator-Prey for High-Performance Computing (PPHPC), captures important characteristics of SABMs, such as agent movement and local agent interactions [70]. It aims to serve as a standard in agent based modeling research, and was designed with several goals in mind: 1. Provide a best practices example on complete model specification and thorough simulation output analysis [70]. 2. Investigate statistical comparison strategies for model replication [72]. 3. Compare different implementations from a performance point of view, using various frameworks, programming languages, hardware and/or parallelization strategies [73]. 4. Test the influence of different pseudo-random number generators (PRNGs) on the statistical accuracy of simulation output. Four PPHPC implementations are proposed: 1. A NetLogo implementation, the canonical version of the model to which other implementations can be compared in terms of computational performance and dynamic behavior. 2. A Java implementation, with several user-selectable multithreaded parallelization schemes. The main goal of this implementation is to study how different parallelization strategies impact simulation performance on a shared memory architecture. 3. An OpenCL CPU implementation, which provides a good indication of how a lowlevel version performs against an optimized parallel high-level approach, i.e., the Java implementation. 4. A preliminary OpenCL GPU implementation, which will serve to assess the feasibility of GPUs for simulating SABMs.

4

Introduction Care is taken so that the four implementations yield statistically equivalent behavior.

To achieve this goal the various implementations are compared using two methodologies: a) a model-specific approach, commonly found in the literature, consisting on the selection and comparison of statistical measures derived from simulation output; and, b) a novel technique which uses principal component analysis (PCA) [114] to convert simulation output into a set of linearly uncorrelated statistical measures which can be analyzed in a consistent, model-independent fashion. In both cases, statistical tests are applied to the processed measures in order to determine if the different implementations are properly aligned. The main contributions of this thesis are: • The specification of the PPHPC model for research in agent-based modeling and simulation.

• A thorough study on the parallelization of SABMs, spanning vertical and horizontal computational axes. The vertical axis involves different programming languages and

distinct computer architectures, while the horizontal axis covers various parallelization techniques using the same language and architecture. • A novel model-independent technique for comparing the output of simulation models. • Six open source software packages, developed in the context of the previous three contributions.

Additionally, a number of papers directly associated with the work presented here have been published in academic journals, or are currently under review as part of the publishing process [69, 70, 72–74, 76]. Naturally, parts of this work are adapted from these papers. The remainder of this thesis is organized as follows. In Chapter 2 we describe the technologies required for the development and implementation of the PPHPC model, both from a hardware and software perspective, and outline the possible alternatives. We also discuss a number of reference ABMs, previous attempts at parallelizing such models, and the commonly used techniques for statistical comparison and/or alignment of ABMs. The methodological and theoretical aspects developed during the course of this work are presented in Chapter 3. These include, but are not limited to: a) a formal description of the PPHPC model using the ODD protocol; b) details of the several model implementations; c) an explanation of the proposed model-independent comparison technique; and, d) an exposition of the implemented software packages. Results, presented in Chapter 4, show that lower-level parallel implementations can offer speedups up to 193× over the canonical NetLogo version. The statistical equivalence of all implementations, with the exception of the OpenCL GPU implementation, is confirmed with model-specific techniques and with the proposed model-independent approach. The latter is also shown to be in accordance with the former when implementations are intentionally misaligned. Finally, in Chapter 5, several conclusions are drawn on SABM parallelization techniques, model alignment

5 methods and the usefulness of PPHPC as a valid standard model for ABM research. The chapter closes with an exploratory discussion concerning future work involving the topics addressed in this thesis.

Chapter 2

Background In this chapter we start by describing the technologies used for the development and implementation of the PPHPC model, both from a hardware (Section 2.1) and software (Section 2.2) perspective, and outline possible alternatives. We then focus on the models themselves, discussing common ABMs used in research (Section 2.3) and evaluating previous attempts at parallelizing SABMs (Section 2.4). An overview of the difficulties in replicating ABMs is presented in Section 2.5), and previous efforts towards the statistical analysis and/or comparison of ABMs are reported in Section 2.6.

2.1

Computer architectures

The PPHPC model is implemented in two distinct computer architecture families, namely the CPU and the GPU, which are discussed the following sections. We also examine the recent trend in heterogeneous systems, which combine CPU and GPU components, and briefly examine other architectures.

2.1.1

Central Processing Units

A central processing unit (CPU) is the computer hardware component that executes the instructions of a computer program by performing arithmetic, logical, control and input/output (I/O) operations specified by the instructions. From a simplified perspective, CPUs generally process instructions using three steps: 1) fetch an instruction from memory; 2) decode the instruction into signals that control other parts of the CPU; and, 3) execute the instruction [102]. In most modern CPUs, multiple instructions can be fetched, decoded, and executed simultaneously through instruction pipelining. Instructions are split into several stages, which are processed separately in the instruction pipeline. This allows for long instruction pipelines with multiple identical execution units to process separate parts of different instructions in parallel. Nonetheless, there is the possibility that one instruction needs the result of the previous one before it can be executed, causing pipeline stalls which slow down instruction throughput. CPUs have several ways of dealing with this, such as branch

8

Background

prediction, speculative execution and out-of-order execution, which require additional electronics, namely considerable amounts of cache memory [102]. Two or more individual CPU processors (designated “cores” in this context) can be merged together to form a multi-core CPU. Though they are individual processors, these cores share several components, such as the main memory and higher cache levels. Thus, simply adding more cores does not necessarily mean a corresponding linear increase in computational power. Additionally, technologies such as Hyper-Threading [149] and “modules” introduced in AMD’s Bulldozer architecture [34], effectively double some of the resources available in each core, allowing the OS to address two logical cores per physical core. While this is not as effective as having more individual cores, it allows the use of CPU resources that would otherwise be idle. CPUs are general-purpose processors, designed to run a very wide variety of programs as efficiently as possible, and therefore they package very complex logic and large caches [52].

2.1.2

Graphical Processing Units

A GPU is a dedicated processor for accelerating the creation and manipulation of images in a frame buffer, with the primary goal of outputting them to a display. Early GPUs had a fixed graphics pipeline, encompassing various stages of vertex and pixel processing, such as geometry, rasterization and/or texture mapping. Graphical manipulation APIs, such as OpenGL [216] and Direct3D [245], exposed the graphics pipeline to the user, such that the release of new GPUs with additional pipeline stages was closely associated with the release of new API versions. However, fixed hardware also meant inflexible graphics processing, and eventually certain stages of the graphics pipeline became programmable [143]. Special purpose shading languages1 were made available for this purpose, most notably GLSL (for OpenGL), HLSL (for Direct3D) and Cg (API-independent language for Nvidia GPUs [78]). General-purpose computing on GPUs (GPGPU) began at this time, as the scientific community noticed the raw power of GPUs for applying transformations to a large set of elements in parallel, and rapidly began experimenting with matrix and vector operations, which were the simplest to implement in shader programs2 [81,130]. However, using shaders for general-purpose programming was cumbersome [66], which led to the emergence of the first GPGPU languages, namely BrookGPU [33], Close-to-Metal [3], Accelerator [234] and Sh [155]. These languages exposed previously unavailable functionality in shader programs, such as loops and dynamic flow control. As GPU architectures increasingly centered around shader execution, the fixed-function stages of the graphics pipeline were completely replaced by multiple programmable shader pipelines in a unified shader architecture, where different shader types have similar capabilities. This paradigm shift effectively transformed the modern GPU into a stream 1

“A shading language is a specialized programming language used for describing the appearance of a scene” [130] 2 A shader program, or simply shader, instructs the GPU on how to draw a specific scene.

2.1 Computer architectures

9

processor3 [117], allowing all pipeline stages to be concurrently used for distinct graphical data through the pipe. The concept of a graphics pipeline was relegated to a software abstraction in graphics APIs. Specialized shaders became threads running different programs on flexible shader pipelines [143]. Currently, CUDA [168] (for Nvidia GPUs) and OpenCL [107] (an open standard), discussed in further detail in Section 2.2.3, are the two main GPGPU languages for developing parallel applications [58]. Though both frameworks provide a low-level representation of the GPU architecture, CUDA, being vendor-specific, is closer to Nvidia hardware, while OpenCL, which also supports CPUs and other processors, is slightly more generic. Modern GPU architectures, and their associated terminology, can differ substantially from vendor to vendor and from generation to generation. Thus, the following discussion will use OpenCL terminology and focus on what is common between modern GPU iterations, such as the fact that they are essentially based on the unified shader architecture. A GPU is composed of an array of compute units (CUs) and a dynamic scheduling unit that provides CUs with shader work. In turn, each CU contains a number of shader units (SUs), which usually correspond to an ALU or a combination of ALU and FPU. However, the exact SU configuration can differ substantially between GPU architectures. GPUs can be considered a type of SIMD processor, where each CU corresponds to a SIMD core, and each SU to a SIMD lane. GPUs usually have at least three different levels of memory, namely, from fastest to slowest, SU-local memory, CU-shared memory and global memory. Both CUDA and OpenCL allow (and in terms of performance, often require) the programmer to directly manipulate different levels of memory. Overall, the GPU memory bandwidth is an order of magnitude above that of the main system memory [189]. However, transferring data between host (CPU) and device (GPU) memory penalizes performance due to the system bus bandwidth and latency. In contrast with CPUs, most of the GPU die area is dedicated to data processing, with small areas dedicated to scheduling, flow control and caching [52, 139, 189]. It is the large number of overall SUs that provides GPUs with computing capabilities that can be exploited by parallel applications [246]. Applications which handle large data sets with independent data elements, and that do not require frequent data transfers between GPU and host (i.e., having high arithmetic intensity4 ), are the most appropriate for the GPU [139]. At the bare minimum, a typical GPGPU program transfers data to GPU memory, applies a sequence of operations to each data element in parallel using a kernel function, and then retrieves the result back to host memory. An important issue to consider when embracing GPGPU is that GPU code is very sensitive to small changes. GPGPU APIs expose hardware implementation details, forcing the programmer to consider relative component clock rates, bus widths, vector widths, memory and buffer sizes in order to obtain good performance [122]. Since OpenCL is supported 3

A stream processor provides multiple processing elements which allow massive parallel processing without explicit management of allocation and communication among those elements, e.g., by executing a function on many records in a stream at the same time. 4 Number of operations performed per word of memory transferred.

10

Background

by all major GPU vendors (which offer GPUs with substantially different architectural aspects), additional care is required in order to obtain performance portability [58].

2.1.3

Heterogeneous systems

In the last few years, Intel and AMD, have been releasing systems which bring together CPU and GPU CUs in the same integrated circuit [273]. All CUs share data and access the same memory, making communication simpler and more efficient by avoiding data transfers between the CPU and GPU over the system bus. Although systems from both vendors support OpenCL work sharing between CPU and GPU CUs, AMD seems to be pushing for tighter component integration by leading the Heterogeneous System Architecture (HSA) foundation, blurring the line between different types of CU [195]. In spite of the fact that integrated GPUs are more limited in both number of CUs and memory speed than discrete GPUs, several authors have reported successful speedups of various computational workloads on heterogeneous chips [48, 51, 273].

2.1.4

Other architectures

A number of alternative computing architectures do not fall within the typical definition of CPU or GPU. The Cell Broadband Engine Architecture is one such example, merging a general-purpose CPU with specialized vector coprocessors, adequate for multimedia and SIMD computation [40]. A field-programmable gate array (FPGA) is a software configurable integrated circuit, and as such, can be specifically setup for particular computational workloads. This allows FPGAs to be highly competitive with general-purpose processors in contexts of parallel and vectorized operations [46]. The Intel Xeon Phi is a family of coprocessor boards for high-performance computing based on the x86 architecture, with support for multiple threads per core (similar to HyperThreading) and SIMD/vector operations [113]. Boards typically contain more than 50 cores, and their compatibility with the x86 architecture allows them to leverage support for a number of existing parallelization software tools. While there are many and often dissimilar forms of programming the alternative architectures presented here, OpenCL, discussed in Section 2.2.3, is available for all of them [46, 113, 224].

2.2

Parallel programming languages and APIs

Parallel programming on commodity multi-processors is made possible by a set of programming technologies, which can be roughly divided into three groups: 1) extensions and libraries for classical programming languages, such as C or C++; 2) programming languages with built-in concurrency; and, 3) specialized languages which can be invoked by general-purpose programming languages.

2.2 Parallel programming languages and APIs

2.2.1

11

Extensions and libraries for classical programming languages

A number of extensions and libraries have been proposed for adding multi-threading capabilities to languages such as C and C++. Pthread is a POSIX standard which defines a set of low-level threading constructs for the C programming language [138]. Pthread implementations come in the form of a library which is usually bundled with most UNIX operating systems (OSes), although Windows implementations are also available. Its explicit and fine-grained management of threads and synchronization provides a solid foundation on which to build multi-threaded applications. However, the associated complexity of full thread life-cycle management can be a strong barrier of entry, and the tight coupling of its constructs with classical shared-memory multithreading limits its usage for broader parallelism, such as GPUs and other accelerators. The Threads module of the GLib C library aims to provide a portable means for writing multi-threaded software [87]. It uses the underlying OS native threading system, i.e., Pthread for UNIX OSes and Windows threads for the Windows OS, abstracting the differences from the programmer. OpenMP is a higher level and more portable threading API for C, C++ and Fortran, providing a simpler and more productive programming model [49]. It is based on compiler directives (also referred to as pragmas), which are treated as comments by unsupported compilers, allowing the same code to be used in serial and parallel contexts. Since version 4.0, OpenMP also supports GPUs and other accelerators [140]. Being primarily based on compiler directives, support for OpenMP must be provided at the compiler level. OpenACC is a more recent standard, similar to OpenMP, but focused solely on compiler directives for accelerators, simplifying parallel programming of heterogeneous CPU and GPU systems [259]. It is a more agile and less mature API than OpenMP, where performance portability is a primary concern [129]. It can also be used with C, C++ and Fortran, and is fully interoperable with CUDA [21]. There are several C++ specific multithreading libraries, most notably Boost.Thread (part of the Boost C++ libraries [208]) and TBB [181]. While both libraries offer a portable C++ threading mechanism, Boost.Thread is mostly focused on providing classes and functions for low-level thread life-cycle management and thread synchronization. On the other hand, TBB abstracts low-level threading aspects by mapping algorithms to task graphs, handling details such as the optimal number of threads and load balancing. C++ AMP extends the C++ programming language and its runtime library by providing a way to write programs that compile and execute on data-parallel hardware, such as GPUs [158]. The original implementation by Microsoft, based on DirectX 11 compute shader, targeted the Visual Studio 2012 compiler. However, being an open specification, third-party cross-platform implementations have since become available [214, 225].

12

Background

2.2.2

Programming languages with built-in concurrency

Several programming languages have multithreading capabilities within their standard library. These range from general-purpose to specifically parallel languages. Java is a general-purpose, object-oriented computer programming language, and is usually compiled to bytecode that runs on a Java Virtual Machine (JVM) [91]. While low-level support for multithreading has been available in Java from the beginning, version 5.0 marked a huge step forward for the development of concurrent applications, offering new higher-level components and additional low-level mechanisms [88]. The Clojure programming language is a general-purpose dialect of Lisp which runs on the JVM, but implementations also exist for the CLR5 and JavaScript. It tries to solve existing problems on how Java and C# deal with concurrency (i.e., via native threads and locking) by using a functional programming approach with immutable data structures and offering a software transactional memory system for concurrent lock-free access to mutable state [103]. Erlang is a general-purpose functional programming language designed for building parallel, distributed and fault-tolerant systems. The language is centered around a type of lightweight processes, which communicate via asynchronous message passing instead of shared variables, removing the need for explicit locks. Erlang is widely used in commercial and industrial contexts by entities who need scalable and efficient fault-tolerant applications [7]. Go is an expressive and concise programming language with explicit support for concurrent programming. Although it is a general-purpose language, it was aimed primarily at systems programming. Go is inspired in C, but makes many changes to improve simplicity and safety, such as built-in concurrency primitives. These consist of lightweight processes (goroutines) which communicate via message passing through channels, in a similar fashion to Erlang, although shared memory communication is also allowed [226]. Chapel is a recent programming language designed for productive and scalable parallel computing. It is based on a high-level multithreaded programming model which supports abstractions for data parallelism, task parallelism, nested parallelism and data locality-wise placement of subcomputations. Its multiresolution philosophy allows users to write code which is initially highly abstract, but to which more detail can be iteratively added until it is as close to the hardware as required. Chapel runs on a variety of machines, from simple laptops to high-end supercomputers, for which it was designed [36, 37]. Unified Parallel C is a parallel programming language which extends ANSI C by allowing programmers to exploit data locality and parallelism. The language was conceived for high-performance computing on large-scale parallel machines, with either shared and/or distributed memory architectures. Unified Parallel C exposes a single shared and partitioned address space, where data, although physically associated with a single processor, is accessible to all processors. In order to achieve this functionality, the language provides 5

The virtual machine component of the .NET framework from Microsoft.


13

synchronization primitives, explicit communication primitives and memory management primitives [62]. An extension to take advantage of GPU clusters has also been proposed [39]. Cilk Plus allows developers to exploit thread and vector parallelism commonly available in modern hardware. The language extends C and C++ with constructs to express parallel loops and the fork-join idiom. The programmer is responsible for identifying elements that can be concurrently processed, while the decision on how to actually divide the work between processors is left to the runtime environment. As such, Cilk Plus provides automatic load balancing which adapts to distinct host environments [194]. Finally, a note on recent versions of C and C++, namely C11 and C++11, in which multithreading capabilities have been added to their respective standard libraries. However, the availability of such features is still not widespread, as these depend on compiler support and on the C/C++ libraries shipped with different OSes and/or development environments [16].

2.2.3

Specialized languages

There is a class of specialized languages which can be compiled to run on compute devices in the form of computing kernels. A compute device can be a GPU, a CPU or any kind of special purpose accelerator. Device programs, or kernels, are invoked from general-purpose programming languages running on a host. These specialized languages are defined, among other things, by clearly separating the host from where they are invoked from the device where they are executed in. The host APIs provide functions to manage allocation and deallocation of device memory, perform data transfers between host and device or between devices, launch kernels, and so on. Kernel functions, which operate on device memory, are invoked with a predefined number of instances, usually divided in groups such that each group is executed in one device CU, leveraging shared CU resources, such as faster memory and intra-group synchronization. In turn, each kernel instance in a group is executed in a PE, usually a SIMD lane. The relation between the number of kernel instances, number of kernel groups and number of instances per group is given by Eq. 2.1. No. of kernel instance groups =

Total no. of kernel instances No. of kernel instances per group

(2.1)

Three specialized languages will be discussed here: CUDA, DirectCompute and OpenCL. While these languages mostly embrace the same concepts, they refer to them with different terms. Table 2.1 summarizes these concepts and shows how they are addressed by the different languages. For the remainder of this work we will use the OpenCL terminology whenever possible. CUDA CUDA is a general-purpose parallel computing platform and programming model that uses the parallel computing capabilities of Nvidia GPUs [168]. However, in this discussion we

14

Background

Concept

CUDA

DirectCompute

OpenCL

Kernel instance Group of kernel instances Set of all kernel instance groups Memory local to group of kernel instances Memory local to a kernel instance

Thread Block Grid

Thread Thread group Dispatch

Work-item Work-group N-D range

Shared memory Local memory

Groupshared memory Local memory

Local memory Private memory

Table 2.1 – CUDA, DirectCompute and OpenCL terminology.

focus on CUDA language extensions (a subset of C/C++ or Fortran) and CUDA host API. CUDA is exposed to users through C, C++ and Fortran extensions, although it is also available for other languages via third-party libraries (e.g., PyCUDA [122]). In the case of C, C++ and Fortran, host and device code are commonly placed in the same source files. Device code appears in the form of specifically annotated kernel functions, which are invoked with a special syntax for specifying number of kernel instance groups and number of kernel instances per group (number of blocks and number of threads per block in CUDA terminology). Since plain C/C++/Fortran code is mixed with CUDA-specific syntax, a special purpose compiler is required to perform the following steps: 1) separate device code from host code; 2) compile device code into assembly or binary form; 3) modify host code by replacing CUDA-specific kernel invocation calls with CUDA C library function calls which load and launch each compiled kernel; and, 4) compile host code. The CUDA runtime hides part of the complexity in writing parallel programs for GPU by implicitly managing functionality such as contexts, module management, command queues (streams in CUDA terminology) and/or events. This results in more concise code, simplifying and accelerating development. However, the runtime is built on top of a lowerlevel C API, the CUDA driver API, which is also accessible to host programs. The CUDA driver API provides additional levels of control by exposing the aforementioned functionality. CUDA was originally released in 2007 for all major OSes, and is the oldest, still in use GPGPU toolkit. With almost a decade of development and experience, it presents a very solid foundation for developing GPGPU applications, with numerous official and third-party tools and libraries, several of which have since been integrated in the CUDA platform itself, e.g., cuBLAS6 , cuFFT7 , cuRAND8 or Thrust [17]. New versions of CUDA closely follow new Nvidia GPU launches, which allows it to offer new GPGPU features before competing platforms. Unfortunately, CUDA is a proprietary technology with direct support for Nvidia GPUs only. 6

Implementation of BLAS (Basic Linear Algebra Subprograms). Library for efficiently computing discrete Fourier transforms of complex or real-valued data sets. 8 Provides simple and efficient generation of high-quality pseudorandom and quasirandom numbers. 7


15

DirectCompute DirectCompute (also known as Compute Shader) is a Microsoft API for GPGPU for the Windows OS, and is supported by the three main GPU vendors, Nvidia, AMD and Intel. DirectCompute exposes the general-purpose functionality of the GPU via compute shaders. Like CUDA, a compute shader (i.e., a kernel function) is invoked by dispatching a specified regular grid of threads. DirectCompute programs decompose parallel work into groups of threads in order to solve a computational problem [164]. In spite of being GPU vendor-independent, DirectCompute is a proprietary technology which is only available for Windows. OpenCL OpenCL is an open standard for general-purpose parallel programming across CPUs, GPUs, FPGAs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms [107, 224]. Such a diversity of architectures presents an equally diverse set of programming models. As such, OpenCL is ambitious in the sense that it aims to provide a unified programming model that can work with all types of processor. Although developers always have to take architectural details into consideration, a standardized development framework can reduce the learning curve and accelerate adoption of new processors. Since OpenCL implementations are provided by a large and increasing number of CPU, GPU and FPGA vendors9 , the strategy seems to be working. An OpenCL implementation includes at least two components. The first consists of a C host API (with C++ bindings), which provides functions for device memory management, kernel execution and so on. The second is a toolchain for compiling the OpenCL language for the target device. Depending on the device in question, a device driver may be necessary, as is the case of GPU devices. A system can have multiple OpenCL devices from different vendors via an installable client driver (ICD) loader, which exposes multiple separate vendor ICDs. An application written against the ICD loader will be able to access all exposed OpenCL platforms and associated devices, with the ICD loader acting as a demultiplexer. The separation between host and device code is much clearer in OpenCL than in CUDA. Host code is regular C or C++ code which uses the functionality provided by the OpenCL host API to achieve its goals. Device code (OpenCL C) containing program kernels is passed in the form of a character array (i.e., a string) and is compiled at runtime, before being sent to the target device for execution. This way, OpenCL applications are portable between implementations for different computing devices [214]. This approach also allows to optimize kernel code using information which is only available during runtime, e.g., the capabilities of the underlying device, specific computation parameters or computation size. This can be done, for example, by using preprocessor macros. In the limit, highly optimized kernels can be assembled using real-time code generation, which is the strategy followed by 9

https://www.khronos.org/conformance/adopters/conformant-products#opencl

16

Background

the PyOpenCL library [122]. Nonetheless, it is also possible to pass pre-compiled binary code to the OpenCL implementation, thus avoiding the compilation overhead. The OpenCL specification provides a very low-level programming interface, which is nonetheless portable and flexible. Contrary to CUDA, which offers a higher-level API on top of the CUDA driver API, OpenCL requires the programmer to explicitly describe every step of the process of setting up and executing parallel code on compute devices. Builtin functionality is scarce: there is no error handling or automatic resource management. The end-user must implement this functionality himself. This means that the simplest of OpenCL host programs imply a lot of error-prone and repetitive boilerplate code. The problem is minimized when using bindings or wrappers for other programming languages, which abstract away much of the associated complexity. Examples include the official C++ bindings [121], and third-party bindings for languages such as Python [122], Haskell [83], Java [100] or R [242]. In the case of C++, and in spite of the official bindings, the number of wrappers and abstraction libraries is remarkable [53, 111, 115, 134, 178, 223, 227, 248, 271]. These libraries aim for a number of goals, such as rapid and/or simplified development of OpenCL programs, high-level abstractions for common computation and communication patterns, embedded OpenCL kernel code within C++ programs or handling of multiple OpenCL platforms and devices. In turn, there are a number of libraries for developing OpenCL programs in pure C host code. Simple OpenCL aims to reduce the host code needed to run OpenCL C kernels [109]. It offers a very high-level abstraction, with a minimum of two types and three functions required to execute OpenCL code on a single compute device. While simple, it can be inflexible in more complex workflows involving multiple command queues or devices, for example. The OpenCL utility library provides set of functions and macros to make the host side of OpenCL programming less tedious [24]. The functions and macros perform common complex or repetitive tasks such as platform and device selection, information gathering, command queue management and memory management, although it does not go much beyond this. The library works at a lower level than Simple OpenCL, directly exposing OpenCL types. However, it is oriented towards using Pthreads [138], privileging a single device, context and command-queue per thread. Additionally, the library has a not insignificant performance cost, as it performs a number of queries upfront. OCL-MLA is a mid-level set of abstractions to facilitate OpenCL development [19]. It covers more of OpenCL than both Simple OpenCL and OpenCL utility library, and sits in between the two concerning the level of abstraction. However, like Simple OpenCL, it is also oriented towards basic workflows focusing on a single device, context and command-queue. It features compile-time logical device configuration, management of several OpenCL object types, profiling capabilities, helper functions for manipulation of events, and a number of utilities for program manipulation. Furthermore, it offers Fortran bindings. oclKit is a small wrapper library focused on platform, context, command queue and program initialization, avoiding the boilerplate code associated with these tasks [250]. It

2.3 Reference research ABMs

17

is a low-level thin wrapper, allowing the programmer to keep full control over the OpenCL workflow. Nonetheless, it does not provide much functionality beyond these tasks and associates each device with a command queue, limiting its applicability. Finally, hiCL consists of a C/C++ and a Fortran 90 wrapper which prioritizes memory management between different computation devices, namely those sharing RAM, such as CPUs with integrated GPUs [200]. It provides a high-level abstraction for launching kernels while internally managing dependencies between kernels, devices and memory objects. However, it also associates a device with a command queue. All of these libraries present some type of limitation, and many have seen development stall in the last few years. For example, at the time of writing none of them supports the new approach for command queue creation introduced in OpenCL 2.0. Thus, we opted to implement cf4ocl, a library which shares similar goals with the ones considered in the previous paragraphs, but without the discussed limitations and broader in scope. The cf4ocl library is described in Section 3.4.1.

2.3

Reference research ABMs

Several ABMs have been used for the purpose of modeling tutorials and/or model analysis and replication. Probably, the most well known standard ABM is the “StupidModel”, which consists of a series of 16 pseudo-models of increasing complexity, ranging from simple moving agents to a full predator-prey-like model [184]. It was developed as a teaching tool and template for real applications, as it includes a set of features commonly used in ABMs of real systems. It has been used to address a number of questions, including the comparison of ABM platforms [146,185], model parallelization [145,232], analysis of toolkit feasibility [220] and/or creating models as compositions of micro-behaviors [116]. The “StupidModel” series has been criticized for having some atypical elements and ambiguities [146], reasons which lead Isaac to propose a reformulation to address these and other issues [112]. However, its multiple versions and user-interface/visualization goals limit the series appeal as a pure computational model. Other paradigmatic models which have been recurrently used, studied and replicated include Sugarscape [8,23,57,63,145], Heatbugs [89,202,263], Boids [89,187,188] and several interpretations of prototypical predator-prey models [85, 104, 171, 217, 235, 261]. Nonetheless, there is a lack of formalization and in-depth statistical analysis of simulation output in most of these implementations, often leading to model assessment and replication difficulties [61, 265]. Unfortunately, most models are not implemented with replication in mind, making them inadequate targets for faithful parallelization.

2.4

Parallel SABM implementations

Parry and Bithell [175] describe two techniques for partitioning SABM components across multiple computational cores: agent-parallel (AP) and environment-parallel (EP). While

18

Background

these are not mutually exclusive, they are nonetheless a good starting point for reasoning about SABM partitioning. In the AP approach, the model is divided at the agent-level, i.e., each LP is responsible for handling a set of agents. Load balancing is simpler as agents can be equally distributed among LPs so that each LP has a similar share of the computation [45, 175]. However, in a moving agents scenario, this partitioning leads to extra communication between LPs, which is required in order to ensure that spatially localized agent interactions are dealt with consistently, as co-location on a LP does not guarantee co-location in space [175]. In EP partitioning, model decomposition occurs at the spatial environment level, i.e., each LP is assigned a location, together with the agents it contains [45]. As such, local agent interactions will mostly occur in the same LPs. Unfortunately, when agent density varies spatially over time, e.g., in flocking or grouping patterns, load balancing issues may occur [45, 89, 175]. A corner case of this issue is when simulating chemotaxis-like patterns, where agent movement is influenced by a chemical concentration gradient, which can result in millions of agents swarming to the same location [75]. Most attempts at parallelizing ABMs found in the literature are based on the distributed memory programming model [209, 215], including a few generic ABM frameworks for high-performance computing [42, 44, 96, 232]. This approach allows models to scale to thousands of cores, usually found in supercomputer-type setups [272, 274]. However, communication issues for larger models [215] and a more complex programming paradigm (when compared with multithreading on shared memory architectures) [137] can restrict this approach. Recently, the trend has been on hybrid [1, 229], GPU [38, 145, 192, 254] and heterogeneous [247, 253] methods. While these approaches allow concrete gains in simulation performance on commodity hardware, they come with an increased cost in implementation time due to the substantial more complex programming models. Hybrid methods, combining distributed and shared memory programming models, require modelers to master both paradigms, as well as specific multi-level model decomposition. GPU architectures require the reformulation of ABMs in terms of stream SIMD computation and offer limited control flow constructs [145]. Heterogeneous methods, in which both CPU and GPU are utilized, entail complex synchronization and data transfers between the two processors, also requiring careful model decomposition so that components can be efficiently processed. The computing potential of GPUs and the apparent massively parallel nature of SABMs make GPGPU implementations of such models an interesting proposal [179]. Nonetheless, reality is not that simple. In order to express ABMs in terms of stream SIMD computation, each type of agent should be processed by the same code (single instruction) and multiple instances of each type of agent should exist in the simulation (multiple data) [99, 141], in AP fashion. Optimal performance requires that the processing of one agent is causally independent of the processing of any other agent in the same time step [56]. This can be difficult to achieve considering that, in ABM, the behavior of agents is frequently related to other agents and to a dynamic environment [141]. A common case in SABMs is that of

2.4 Parallel SABM implementations

19

mobile agents in a discrete grid, which poses several challenges. If there is a unique location constraint (maximum of one agent per grid cell), the implementation must explicitly handle collisions which result from agents concurrently moving to the same location [145, 192]. However, the absence of this constrain creates a different problem. More specifically, agents need to determine the location of other agents in their neighborhood. The typical solution consists of having a sorted array of agents, with each grid cell holding a pointer to the first agent it contains. This solution requires a sorting step in each iteration, which can deteriorate performance [2, 65, 92, 176]. The agent life-cycle is another difficult aspect of ABM on GPUs, requiring adequate memory management patterns, such as allocation and garbage collection, which are intrinsically serial operations [145]. For more complex models, with different agent types and dynamic environments which influence agent behavior, it will be challenging to find an efficient SIMD solution at all [141]. Having this into account, the use of SIMD architectures may lead to good results for simpler ABMs [141], such as CA and homogeneous agent models [42,192]. A number of proposed GPGPU ABM frameworks rely on these assumptions or simplifications [42, 110, 145, 192]. In general, implementing SABMs in the GPU architecture requires taking into account model-level characteristics, such as agent-agent and agent-environment interactions, which have a considerable weight in overall simulation performance [230]. As such, an in order to obtain good performance, it is preferable to implement models from the ground up [110], which allows for hand tuning and optimization to exploit model features and the underlying hardware [141,230]. This fact, and the difficulties which often arise from hand-tuning GPU code [122], mean that there is a large overhead in terms of man-hours to implement any new model on GPGPU [141]. Among the possible parallelization techniques, multithreading is arguably the simplest to implement [137,252], with the added bonus of portability. For example, modern threading APIs, such as OpenMP for C, C++ and Fortran or the Java 5.0 concurrency API [88], greatly simplify multithreaded ABM implementations, and are available for a number of different shared memory CPU architectures and operating systems. While the majority of ABM toolkits [20, 165, 185, 238] are targeted for single-threaded execution on the CPU [89, 189], there have been explicit attempts to parallelize some of them. Goldsby and Pancerella [89] describe adjustments made to the MASON agent-based simulation package [144] that allow the use of multiple threads without major changes to conventional agent-based programming. RepastJ [166] has adaptations to multi-core CPUs [67], while Repast Simphony [167] supports parallel execution at the scheduling mechanism level. However, in most cases, the modeler must implement correct access semantics to shared data (e.g., environment and agent-agent interaction). One of the main problems in retrofitting parallelism to existing ABM frameworks and developing new “pure” parallel ABMs toolkits concerns the implementation of direct agent-to-agent memory access, which is model dependent and requires synchronization semantics such as locks or semaphores. These constructs are provided by the threading APIs, but efficient and thread-safe coordination of concurrent accesses still requires careful coding in order to

20

Background

obtain proper speedups with the number of cores, while avoiding common multithreading issues such as priority inversions or deadlocks [88]. Next, in Sections 2.4.1 and 2.4.2, we discuss a number of concrete multithreaded and GPU SABM implementations, since these parallelization paradigms are the focus of this work.

2.4.1

Multithreaded implementations

EcoKit, a simulation system for spatially-explicit ecological models [86], was one of the first parallel realizations of an ABM on a shared memory architecture, implementing a static EP solution. The authors tested the system with a mouse migration model with 10 000 agents in a discrete 2D grid with 20 000 cells on an eight-processor SGI PowerChallenge machine [211]. One of the processors was used for the simulation kernel, while the remaining ones for the simulation itself. Results showed speedup stabilization at about four processors, reaching a maximum of 2.8 for seven processors. A multithreaded SABM of immune system dynamics, parallelized using a static EP approach, is presented in reference [67]. Adequate speedups are obtained when chemotaxis is not simulated. However, in simulations involving this phenomena, agents in the order of thousands where shown to group in very few locations, causing severe load imbalances and limiting the scalability of simulations. A cellular automata (CA) based ABM of opinion exchange, developed in C++ and parallelized with OpenMP, is presented by Gong et al. [90]. The authors analyze how the performance of the parallel model varies with the size of the 2D simulation grid and with the range of agent interactions. Parallelization of the model is achieved by partitioning the CA grid in a row-wise block-striped fashion, with each block assigned to one thread (and consequently, to one CPU core); because the agents are fixed, and there is one agent per grid cell, this type of partitioning is both AP (agent-parallel) and EP (environmentparallel). Several tests were performed in a 32-core CPU (AMD Opteron), with varying number of threads (from 1 to 32), grid sizes (up to 5000×5000) and interaction ranges (from 2 to 1000 cells). Increasing the number of threads resulted in improved speedup, but lower efficiency (speedup divided by the number of threads). Lower efficiency also occurred when increasing the size of the simulation grid (maintaining the interaction range) and when increasing the interaction range. Nonetheless, the achieved speedups are impressive. For example, for a grid of 4000 × 4000, and an interaction range of 2 cells, a 25× speedup is obtained with 32 threads.

Goldsby and Pancerella [89] proposed to retrofit multithreading in the MASON ABM toolkit using an EP approach, where each thread is assigned an equal part of the simulation environment. Each simulation step is divided into two parts separated by a barrier that synchronizes access to shared environment data. Agent movement is guaranteed by interprocessor message queues. Tests were performed on a machine with 32 Intel Xeon processors using two classic ABMs: 1) Flockers, a Boids-like model [188] in which agents

2.4 Parallel SABM implementations

21

display flocking behavior; and, 2) the HeatBugs model [263]. The latter was shown to scale well, particularly for larger number of agents, with effective speedup up to 28 threads. The Flockers model did not scale as well, with the authors suspecting that the tendency of agents to move in flocks caused load balancing issues. The accurate characterization of bacterial network dynamics led to the design of BNSim [255]. A multithreading approach allows BNSim to efficiently simulate large populations of bacteria. As in reference [89], simulation steps are divided in two parts with barrier synchronization, in a tick-tock pattern. During the “tick”, agents perform their actions in parallel (AP). The environment is updated in EP fashion during “tocks”. An efficient thread scheduler is used to balance the workload of both AP and EP stages of the simulation.

2.4.2

GPGPU implementations

A precursor of modern GPGPU ABMs is described in [98], in which the authors implement a CA-like physics model capable of simulating various dynamic phenomena. Discrete grid data was stored as a texture in GPU memory, and was iteratively processed at pixel-level to obtain the next state. Simulations were integrated in visual interactive 3D graphics applications. A particle system implemented mostly using GPU fragment shaders for both simulation and rendering is presented in [124]. The implementation allowed for particle movement under different physical forces, as well as CPU-based particle creation and destruction. The particle system stored particle velocities and positions in GPU textures, updating these values in a one or two-pass double-buffering scheme. The implementation was shown to render and update a maximum of 1 million particles at about 10 frames per second using a position texture of size 1024 × 1024.

In reference [65], the authors present a GPU simulation and rendering of a flock-

ing behavioral model with static and dynamic obstacle avoidance based on the work of Reynolds [188]. Using the Cg shading language, the implementation was able to simulate and render a 3D scene with a flock of 8000 boids10 at 20 FPS, including bird model animations. Agents are represented as texture data and all behaviors are implemented as pixel shaders using a double-buffering approach. The entire process is accomplished in multi-pass fashion, where the output of a shader program is sent as input to the next shader. A special-purpose spatial data structure for neighbor searching was developed, though it requires an agent sorting step (implemented in CPU) when a boid departs from its group. An heuristic is used to skip same updates of this structure, avoiding additional performance degradation. The GPU implementation (with CPU sorting) was shown to scale better than a pure CPU version. The authors suggest a pure GPU bitonic sort as future work, since it was not feasible for GPUs at the time. An image-based approach is discussed in reference [199], where agent behaviors are defined by finite state machines (FSM) implemented as pixel shaders in GLSL. Images 10

Bird-oid objects [188]

22

Background

represent simulation properties such as environment characteristics, agent state and/or FSM specification. Common image manipulation techniques are used to update simulation state. Several simple examples with efficient simulation rendering are presented. In a subsequent work [159], the authors describe a a GPU-based crowd simulator, with the position and headings of crowd individuals, stored in the GPU texture memory, updated by pixel shaders. A crowd of more than 100 000 individuals was simulated and rendered at almost 38 FPS. However, collision avoidance, which is a common behavior in crown simulations, was not implemented. The group of D’Souza et al. has been one of the most prolific in developing GPGPU SABMs. One of their earlier works was implemented in GLSL using recent (at the time) fully programmable graphics hardware [57, 145], where they described a framework for large-scale ABMs on GPU. Like most ABM GPU implementations, textures are used to store agent and environment states, and are updated using pixel shaders and doublebuffering. The maximum number of agents in a grid cell is one. Consequently, movement and collisions are solved using a multi-pass priority strategy. For reproduction, a novel GPU-based randomized memory allocation scheme, in which not all allocations immediately succeed, was implemented. It allowed to bypass CPU allocation, enabling simulations with increased size and complexity, as all computations are performed on the GPU. Agent death is implemented with a state flag, which avoids updating the agent when set. The implementation displayed impressive performance, but some authors argue that the presented models are basically CA extensions with homogeneous agents, and are thus limited in scope and complexity [42, 44, 246]. D’Souza et al. eventually adopted CUDA, and published new models implemented using GPGPU. A tuberculosis model displaying agent movement, chemotaxis, diffusiondecay of chemokines, killing of bacteria by macrophages and adaptive immune response, is presented in reference [56]. Some ideas of their previous experiences were reused, but some new concepts were also introduced, e.g., the use of atomics for collision resolution. Later, the same group implemented the Toy Infection Model [5], a simplified version of the Systemic Inflammatory Response Syndrome model [4], and compared it with the original NetLogo version, as well as to an in-house optimized serial version [2]. The authors took advantage of the increasing availability of CUDA resources and used the Thrust library of optimized GPU functions [17], which provided common operations such as sorting and reductions. Initialization was performed in the GPU using parallel algorithms, which significantly decreased simulation times. Finally, D’Souza contributed with his expertise for an ecological ABM concerning species diversity in forests [118]. Massive speedups of up to 360× are reported, but comparison was again performed against a serial CPU version. In reference [191], Richmond et al. describe ABGPU, a simple but efficient framework with real-time visualization for GPGPU ABMs. Implementation patterns which are common nowadays were used in this realization, namely environment partitioning for agent communication (which required agent sorting by location) and simulated scattering, so that agents could find their neighbors in the partitions matrix.

2.5 Issues with ABM replication

23

Later, the authors presented FLAME GPU [190, 192], a complete GPGPU framework for ABMs. The goal was to allow users to implement models without an explicit understanding of the GPU architecture by offering common functionality such as agent communication and agent life-cycle management. The framework uses Communicating Stream X-Machines [13] for agent formalism, following the footsteps of FLAME [42], its “parent” framework for distributed computing on CPUs. FLAME GPU follows a template driven agent architecture which provides a mapping from XML model specification and C scripting to optimized and efficient CUDA code. The framework also employed commonly used GPU programming patterns, such as parallel prefix sums11 [128] for agent list compaction, determining agent positions (due to agent deaths and agent births) and partitioning agent lists according to function to be applied, minimizing thread divergence. Agent communication is accomplished via message passing, either using CU local memory or spatial partitioning with a prespecified message range. Models assume a continuous environment and collision resolution is performed with global agent functions which insert non-linear simulation step(s) to resolve the forces at play. The authors claim exceptional speedups against its “parent” framework, although it is not clear how the latter is setup. The authors only perform face validation of simulation output, which is not enough to assert model correctness, especially given that certain optimizations may be susceptible to have some kind of impact on the dynamic of models. Overall, models which can be implemented in FLAME GPU have limits in scope and complexity due in part to the preferentially homogeneous nature of the agents [42].

2.5

Issues with ABM replication

Computational models of complex systems in general, and ABMs in particular, are usually very sensitive to implementation details: the impact that seemingly unimportant aspects such as data structures, algorithms, discrete time representation, floating point arithmetic or order of events can have on results is notable [157,265]. Furthermore, most model implementations are considerably elaborate, making them prone to programming errors [266]. This can seriously affect model validation12 when data from the system being modeled cannot be obtained easily, cheaply or at all. Model verification13 can also be compromised, to the point that wrong conclusions may be drawn from simulation results. A possible answer to this problem is the independent replication of such models [266]. Replication consists in the reimplementation of an existing model and the replication of its results [236]. Replicating a model in a new context will sidestep the biases associated with the language or toolkit used to develop the original model, bringing to light dissimilarities between the conceptual and implemented models, as well as inconsistencies in the concep11

A prefix sum (or scan) operation produces an array in which every element is obtained from the sum of all previous elements in the original array. 12 Determining if the model implementation adequately represents the system being modeled [265] for its intended purpose [193]. 13 Determining if the model implementation corresponds to a specific conceptual model [265].

24

Background

tual model specification [61, 265]. Additionally, replication promotes model verification, model validation [265], and model credibility [236]. More specifically, model verification is promoted because if two or more distinct implementations of a conceptual model yield statistically equivalent results, it is more likely that the implemented models correctly describe the conceptual model [265]. Thus, it is reasonable to assume that a computational model is untrustworthy until it has been successfully replicated [50, 61]. Model parallelization is a an illustrative example of the importance of replication. Parallelization is often required for simulating large models in practical time frames, as in the case of ABMs reflecting systems with large number of individual entities [73]. By definition, model parallelization implies a number of changes, or even full reimplementation, of the original model. Extra care should be taken in order to make sure a parallelized model faithfully reproduces the behavior of the original serial model. For example, Parry and Bithell [175] provide an informative account in which they were unable to successfully replicate a serial model when converting it to a parallel one. Although replication is considered the scientific gold standard against which scientific claims are evaluated [177], most conceptual models have only been implemented by the original developer, and thus, have never been replicated [8, 236, 265, 266]. Several reasons for this problem have been identified, namely: a) Lack of incentive Success in replicating a model does not offer anything new to report, and thus researchers prefer to invest in their own models [266]. A successful replication confirms the reproducibility of the results produced by the original model, constituting the first step for building and extending on both. It is a fundamental part of the scientific method, and should not be discarded by researchers and scientific journals in the area of modeling and simulation. The lack of incentive problem of the former could partially be solved by explicit calls for replication studies by the latter [236]. Nonetheless, in the last few years there have been calls for computational science to develop a culture of reproducibility similar to its life sciences counterparts [177]. b) Below par model communication The assessment, comparison and replication of simulation models is constrained by the lack of transparency in model descriptions, poor model documentation, and unsuitable model distribution [162, 177]. Adequate templates for model description should be used for published models (e.g., the ODD protocol), along with the provision of a model’s source code [162,265]. In turn, source code availability provides access to an unambiguous model implementation, allowing: a) line-by-line comparisons for highlighting potential problems with the original implementation or the replication; and b) exploration of previously unexplored regions of the parameter space, aiding in the comparison of two model implementations under a larger variety of scenarios [162, 265]. Additional information, such as experimental scenarios, results from simulation runs, statistical analysis of model outputs, and sensitivity analyses, can also be valuable for model communication.

2.6 Analyzing and comparings ABMs

25

c) Insufficient knowledge Knowledge of how to replicate, and how to validate the results of a replication is not widespread [265]. Simple face validation of stochastic model outputs is not enough to announce a successful replication. In order to do so, a more objective and formal approach, usually including statistical hypothesis tests, is required [10, 61]. d) Level of difficulty Model replication has been shown to be an inherently difficult task [61, 265]. Nonetheless, it can be greatly facilitated with the minimization or resolution of issues a), b) and c), i.e., if the researcher is motivated, if it has access to a properly communicated simulation model to replicate, and if the knowledge and tools to replicate and validate models are available and widespread.

2.6 2.6.1

Analyzing and comparings ABMs Statistical analysis of model output

Many models are not adequately analyzed with respect to their output data, often due to improper design of simulation experiments. Consequently, authors of such models can be at risk of making incorrect inferences about the system being studied [132]. A number of papers and books have been published concerning the challenges, pitfalls and opportunities of using simulation models and adequately analyzing simulation output data. In one of the earliest articles on the subject, Sargent [204] demonstrates how to obtain point estimates and confidence intervals for steady state means of simulation output data using a number of different methodologies. Later, Law [131] presented a state-of-the-art survey on statistical analyses for simulation output data, addressing issues such as start-up bias and determination of estimator accuracy. This survey was updated several times over the years, e.g., [132], where Law discusses the duration of transient periods before steady state settles, as well as the number of replications required for achieving a specific level of estimator confidence. In reference [119], Kelton describes methods for designing simulation model runs and interpreting their output using statistical methods, also dealing with related problems such as model comparison, variance reduction and sensitivity estimation. A comprehensive exposition of these and other important topics of simulation research is presented in the several editions of “Simulation Modeling and Analysis” by Law and Kelton, and its latest edition [133] is used as a starting point for analyzing the output of the PPHPC model in Section 3.1.

2.6.2

The classical approach for comparing the output of simulation models

The typical process of model output comparison is described by Wilensky and Rand [265], and can be roughly divided into three steps: 1) choice of replication standard (RS); 2) selection of output statistical summaries, i.e., focal measures (FMs), to compare; and, 3) comparison of FMs using statistical techniques.

26

Background

Choice of replication standard Axtell et al. [8] defined three RSs for the level of similarity between FMs (Carley [35] calls the RS the emphasis of demonstration): numerical identity, distributional equivalence and relational alignment. The first, numerical identity, implies exact numerical output, but it is difficult to demonstrate for stochastic models and not critical for showing that two such models have the same dynamic behavior. To achieve this goal, distributional equivalence is a more appropriate choice, as it aims to reveal the statistical similarity between two outputs. Finally, relational alignment between two outputs exists if they show qualitatively similar dependencies with input data, which is frequently the only way to compare a model with another which is inaccessible (e.g., implementation has not been made available by the original author), or with a non-controllable “real” system (such as a model of the human immune system [75]). For the remainder of this text we assume the distributional equivalence RS when discussing model replication. Selection of focal measures After selecting the RS criteria, a set of statistical summaries representative of each output must also be defined. It is these statistical summaries, and not the complete model outputs, that will be compared in order to assert the similarity between the original model and the replication. As models may produce large amounts of data, the statistical summaries should be chosen as to be relevant to the actual modeling objective. Consequently, the selection of output statistical summaries, i.e., FMs, is an empirical exercise and is always dependent of the model under study [265]. Comparison of FMs using statistical techniques There are three major statistical approaches used to compare FMs: 1) statistical hypothesis tests; 2) confidence intervals; and, 3) graphical methods [12]. In statistical hypothesis testing, a null hypothesis is tested against an alternative hypothesis. The test can have two results: 1) fail to reject the null hypothesis; or, 2) reject the null hypothesis in favor of the alternative hypothesis. In this case, the tests of interest are two-sample (or multi-sample) hypothesis tests which test for the null hypothesis that the observations in each sample are drawn from the same distribution, against the alternative that they are not. These observations are the statistical summaries obtained from the outputs of two (or more) model implementations, or from the outputs of a model and the system being modeled. Although statistical procedures for comparing model and system outputs using hypothesis tests have been proposed [205], confidence intervals are usually preferred for such comparisons, as they provide an indication of the magnitude by which the statistic of interest differs from model to system. Confidence intervals are also commonly used when evaluating different models that might represent competing system designs or alternative operating policies [12, 133]. Graphical methods, such as Q-Q plots, can also be employed for comparing output data, though their interpretation is more subjective

2.6 Analyzing and comparings ABMs

27

than the previous methods. In this work we will focus on hypothesis tests, which are more common for comparing a model implementation and its replication [8, 61, 160, 183, 265]. The choice of which statistical test(s) to use depends on a number of factors, namely: 1) whether a specific FM follows a normal distribution; 2) the number of observations (i.e., number of runs) in each sample (i.e., model implementation); 3) the dimensionality of the FMs (i.e., unidimensional or multidimensional); and, 4) the number of samples (i.e., model implementations) to be compared. For example, in [10, 11], the authors discuss the use of the Hotelling’s T 2 test14 , for assessing the validity of a multivariate response simulation model representing a real, observable system with multiple outputs. The study is focused towards relating the cost of sampling against the probability of type I and II errors according to an acceptable similarity range between model and system output averages. This analysis was later extended to confidence intervals and joint confidence regions [12].

2.6.3

Comparing model output using the classical approach: concrete cases

In reference [8], Axtell et. all compared two initially different models, with one iteratively modified in order to be aligned with the other. The authors evaluated how can distinct equivalence standards be statistically assessed using non-parametric statistical tests (namely the Kolmogorov-Smirnov [153] and Mann-Whitney U [84] tests), and how minor variations in model design affect simulation outcomes. They concluded that comparing models developed by different researchers and with different tools (i.e., programming languages and/or modeling environments), can lead to exposing bugs, misinterpretations in model specification, and implicit assumptions in toolkit implementations. The concepts and methods of “computational model alignment” (or “docking”) were first discussed in this work. Edmonds and Hales [61] performed two independent replications of a previously published model involving co-operation between self-interested agents. Several shortcomings were found in the original model, leading the authors to conclude that unreplicated simulation models and their results cannot be trusted. This work is one of the main references in model replication, describing in detail the process of running two model implementations with different parameters, selecting comparison measures and performing adequate statistical tests. In reference [265], the authors presented an ABM replication case study, describing the difficulties that emerged from performing the replication and determining if the replication was successful. A standard t-test was used for comparing model outputs. The authors concluded that model replication influences model verification and validation and promotes shared comprehension concerning modeling decisions. Miodownik et al. [160] replicated an ABM of social interaction [22], originally imple14

In practice, a two groups MANOVA [127, 228].

28

Background

mented in MATLAB, using the Ps-i environment for ABM simulations [54]. A statistical comparison of the mean levels of “civicness” at the end of the simulation (over 10 runs) was performed using the Mann-Whitney U test. Results showed that, while distributional equivalence was obtained in some cases, the two models were mostly only relationally aligned. The authors attribute this mainly to the fact that some aspects of the original model were not implementable with Ps-i. A method for replicating insufficiently described ABMs was discussed in reference [183], consisting in modeling ambiguous assumptions as binary parameters and systematically applying statistical tests to all combinations for their equivalence to the original model. The approach was used to replicate Epstein’s demographic prisoner’s dilemma model [64], with only partial success, suggesting the existence of some undefined assumptions concerning the original model. The authors also conducted a number of statistical tests regarding the influence of specific design choices, highlighting the importance that these be explicitly documented. Alberts et al. [2] implemented a CUDA [168] version of Toy Infection Model [5], and compared it with the original version implemented in NetLogo, as well as to an optimized serial version. Statistical validation was performed visually using Q-Q plots.

2.6.4

Disadvantages of the classical comparison approach

This methodology has a number of disadvantages. First, it relies on FMs which are modeldependent and, probably, user-dependent. In order to apply it to a different model, the process of selecting FMs and the appropriate statistical tests must be redone. Furthermore, for different model parameters, the originally selected FMs may be of no use, as simulation output may change substantially (e.g., the warm-up period for the steady-state statistical summaries can be quite different). Also, it might not be clear which FMs best capture the behavior of a model.

Chapter 3

Methodology This chapter describes the methodological and theoretical aspects developed during the course of this work. Section 3.1 is devoted to the PPHPC model, namely its formalization using the ODD protocol, an introduction of a canonical NetLogo implementation, and a discussion on how to analyze and compare its output with other implementations. The Java and OpenCL parallel implementations of PPHPC are characterized in Section 3.2. The PCA-based model-independent comparison technique is presented in Section 3.3. Section 3.4 closes this chapter with a portrayal of the software packages developed with the purpose of supporting the research presented in this thesis.

3.1

The PPHPC model

The PPHPC model is a prototypical, yet original, predator-prey model. It is the target of three parallelization efforts, described in Section 3.2, and used as an example for the model-independent comparison method proposed in Section 3.3. It is also being used to test the influence of different pseudo-random number generators (PRNGs) on the behavior of simulation models, as discussed in Chapter 5. However, given that there are so many ABMs available, why would one create a new conceptual model for these purposes? Two main reasons exist: 1. As stated in Section 2.3, there is a lack of formalization and in-depth statistical analysis in most reference models, issues which cause model assessment and replication difficulties. Given the scope of the research presented in this thesis, a thoroughly formalized and unambiguous model specification is essential. 2. Certain features, such as integer-only state variables, were defined so that PPHPC is simpler to implement in any programming language or computer architecture. PPHPC was designed to be cross-platform at its most basic level. Furthermore, since PPHPC implements the paradigmatic predator-prey concept, and is based on typical ABM features, it is more of a formalization of these aspects than a new

30

Methodology

model per se. Such familiarity makes it easily understandable, so that others can use it without difficulty for their own research. The following sections provide a complete account of the model, beginning with its formal description in Section 3.1.1, continuing with a canonical NetLogo implementation in Section 3.1.2, and finalizing with the specific approach for analyzing model output in Section 3.1.3.

3.1.1

Overview, design concepts and details

Here we describe the PPHPC model using the ODD protocol [94]. Time-dependent state variables are represented with uppercase letters, while constant state variables and parameters are denoted by lowercase letters. The U (a, b) expression equates to a random integer within the closed interval [a, b] taken from the uniform distribution. Purpose The purpose of PPHPC is to serve as a standard model for studying and evaluating SABM implementation strategies. It is a realization of a predator-prey dynamic system, and captures important characteristics of SABMs, such as agent movement and local agent interactions. The model can be implemented using substantially different approaches that ensure statistically equivalent qualitative results. Implementations may differ in aspects such as the selected system architecture, choice of programming language and/or agentbased modeling framework, parallelization strategy, random number generator, and so forth. By comparing distinct PPHPC implementations, valuable insights can be obtained on the computational and algorithmical design of SABMs in general. Entities, state variables, scales The PPHPC model is composed of three entity classes: agents, grid cells and environment. Each of these entity classes is defined by a set of state variables, as shown in Table 3.1. All state variables explicitly assume integer values to avoid issues with the handling of floatingpoint arithmetic on different programming languages and/or processor architectures. The t state variable defines the agent type, either s (sheep, i.e., prey) or w (wolf, i.e., predator). The only behavioral difference between the two types is in the feeding pattern: while prey consume passive cell-bound food, predators consume prey. Other than that, prey and predators may have different values for other state variables, as denoted by the superscripts s and w. Agents have an energy state variable, E, which increases by g s or g w when feeding, decreases by ls or lw when moving, and decreases by half when reproducing. When energy reaches zero, the agent is removed from the simulation. Agents with energy higher than rTs or rTw may reproduce with probability given by rPs or rPw . The grid position state variables, X and Y , indicate which cell the agent is located in. There is no conceptual limit on the number of agents that can exist during the course of a simulation run.

3.1 The PPHPC model

31

Entity

State variable

Symbol

Range

Agents

Type Energy Horizontal position in grid Vertical position in grid Energy gain from food Energy loss per turn Reproduction threshold Reproduction probability

t E X Y gs, gw ls , lw rTs , rTw rPs , rPw

s, w 1, 2, . . . 0, 1, . . . , xenv − 1 0, 1, . . . , yenv − 1 0, 1, . . . 0, 1, . . . 1, 2, . . . 0, 1, . . . , 100

Grid cells

Horizontal position in grid Vertical position in grid Countdown

x y C

0, 1, . . . , xenv − 1 0, 1, . . . , yenv − 1 0, 1, . . . , cr

Environment

Horizontal size Vertical size Restart

xenv yenv cr

1, 2, . . . 1, 2, . . . 1, 2, . . .

Table 3.1 – Model state variables by entity. Where applicable, the s and w designations correspond to prey (sheep) and predator (wolf ) agent types, respectively.

Instances of the grid cell entity class can be thought of the place or neighborhood where agents act, namely where they try to feed and reproduce. Agents can only interact with other agents and resources located in the same grid cell. Grid cells have a fixed grid position, (x, y), and contain only one resource, cell-bound food (grass), which can be consumed by prey, and is represented by the countdown state variable C. The C state variable specifies the number of iterations left for the cell-bound food to become available. Food becomes available when C = 0, and when a prey consumes it, C is set to cr . The set of all grid cells forms the environment entity, a toroidal square grid where the simulation takes place. The environment is defined by its size, (xenv , yenv ), and by the restart parameter, cr . Spatial extent is represented by the aforementioned square grid, of size (xenv , yenv ), where xenv and yenv are positive integers. Temporal extent is represented by a positive integer m, which represents the number of discrete simulation steps or iterations within a fixed-increment time advance (FITA) context. Spatial and temporal scales are merely virtual, i.e., they do not represent any real measure.

Process overview and scheduling Algorithm 3.1 describes the simulation schedule and its associated processes. Execution starts with an initialization process, Init(), where a predetermined number of agents are randomly placed in the simulation environment. Cell-bound food is also initialized at this stage. After initialization, and to get the simulation state at iteration zero, outputs are gathered by the GetStats() process. The scheduler then enters the main simulation loop,

32

Methodology

Algorithm 3.1 Main simulation algorithm. for loops can be processed in any order or in random order. In terms of expected dynamic behavior, the former means the order is not relevant, while the latter specifies loop iterations should be explicitly shuffled. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

Init() GetStats() i←1 for i rT then if U (0, 99) < rP then E child ← E/2 E ← E − E child NewAgent(t, E child , X, Y ) end if end if end function

. Integer division

GrowFood() In step 2, during the GrowFood() process, each grid cell checks if C = 0 (meaning there is food available). If C > 0 it is decremented by one unit. Eq. 3.4 summarizes this process. Ci = max(Ci−1 − 1, 0)

(3.4)

Act() In step 3, agents Act() in explicitly random order, i.e., the agent list should be shuffled before the agents have a chance to act. The Act() process is composed of two subactions: TryEat() and TryReproduce(). The Act() process is atomic, i.e., once called, both TryEat() and TryReproduce() must be performed; this implies that prey agents may be killed by predators before or after they have a chance of calling Act(), but not during the call. TryEat() Agents can only interact with sources of food present in the grid cell they are located in. Predator agents can kill and consume prey agents, removing them from the simulation. Prey agents can consume cell-bound food, resetting the local grid cell C state variable to cr . A predator can consume one prey per iteration, and a prey can only be consumed by one predator. Agents who act first claim the food resources available in the local grid cell. Feeding is automatic: if the resource is there and no other agent has yet claimed it, the agent will consume it. Moreover, only one prey can consume the local cell-bound food if available (i.e., if C = 0). When an agent successfully feeds, its energy E is incremented by g s or g w , depending on whether the agent is a prey or a predator, respectively. TryReproduce() If the agent’s energy, E, is above its species reproduction threshold, rTs or rTw , then reproduction will occur with probability given by the species reproduction probability, rPs or rPw , as shown in Algorithm 3.2. When an agent successfully reproduces, its energy is divided (using integer division) with its offspring. The offspring is placed in the same grid cell as his parent, but can only take part in the simulation in the next iteration. More specifically, newly born agents cannot Act(), nor be acted upon. The latter implies that newly born prey cannot be consumed by predators in the

3.1 The PPHPC model

35

Type

Parameter

Symbol

Size

Environment size Initial agent count Number of iterations

Dynamics

Energy gain from food Energy loss per turn Reproduction threshold Reproduction probability Cell food restart

xenv , yenv P0s , P0w m gs, gw ls , lw rTs , rTw rPs , rPw cr

Table 3.2 – Size-related and dynamics-related model parameters.

Size 100 200 400 800 1600 3200 6400 12 800 .. .

xenv × yenv 100 × 100 200 × 200 400 × 400 800 × 800 1600 × 1600 3200 × 3200 6400 × 6400 12 800 × 12 800 .. .

P0s

P0w

400 1600 6400 25 600 102 400 409 600 1 638 400 6 553 600 .. .

200 800 3200 12 800 51 200 204 800 819 200 3 276 800 .. .

Table 3.3 – Initial model sizes.

current iteration. Agents immediately update their energy if they successfully feed and/or reproduce.

Parameterization Model parameters can be qualitatively separated into size-related and dynamics-related parameters, as shown in Table 3.2. Although size-related parameters also influence model dynamics, this separation is useful for parameterizing simulations. Concerning size-related parameters, more specifically, the grid size, we propose a base value of 100 × 100, associated with 400 prey and 200 predators. Different grid sizes should

have proportionally assigned agent population sizes, as shown in Table 3.3. In other words, there are no changes in the agent density nor the ratio between prey and predators. For the dynamics-related parameters, we propose two sets of parameters, Table 3.4, which generate two distinct dynamics. The second parameter set typically yields more than twice the number of agents than the first parameter set. Matching results with runs based on distinct parameters is necessary in order to have a high degree of confidence in the similarity of different implementations [61]. While many more combinations of parameters can be experimented with this model, these two sets are the basis for testing and comparing PPHPC implementations. We will refer to a combination of model size and parameter set as “size@set”, e.g., 400@1 for model size 400, parameter set 1.

36

Methodology

Parameter Prey Prey Prey Prey

energy gain from food energy loss p/ turn reprod. threshold reprod. probability

Predator Predator Predator Predator

energy gain from food energy loss p/ turn reprod. threshold reprod. probability

Cell food restart

Symbol

Set 1

Set 2

rTs rPs

4 1 2 4

30 1 2 10

gw lw rTw rPw

20 1 2 5

10 1 2 5

cr

10

15

gs ls

Table 3.4 – Dynamics-related parameter sets.

While simulations of the PPHPC model are essentially non-terminating1 , the number of iterations, m, is set to 4000, as it allows to analyze steady-state behavior for all the parameter combinations discussed here.

3.1.2

A canonical NetLogo implementation

NetLogo is a well-documented programming language and modeling environment for ABMs, focused on both research and education. It is written in Scala and Java and runs on a JVM. It uses a hybrid interpreter and compiler that partially compiles ABM code to JVM bytecode [218]. It comes with powerful built-in procedures and is relatively easy to learn, making ABMs more accessible to researchers without programming experience [151]. Advantages of having a NetLogo version include real-time visualization of simulation, pseudocode like model descriptions, simplicity in changing and testing different model aspects and parameters, and command-line access for batch runs and cycling through different parameter sets, even allowing for multithreaded simultaneous execution of multiple runs. A NetLogo reference implementation is also particularly important as a point of comparison with other ABM platforms [112]. However, NetLogo implementations also present some difficulties. For example, such a high-level interface for ABM programming naturally has a large impact on performance. Thus, as will be shown in Chapter 4, simulations are considerably slower than equivalent implementations in conventional programming languages. Nonetheless, NetLogo compared quite favorably in this regard against the ReLogo ABM package [172] in a recent study [146]. Additionally, some authors have encountered practical limitations in NetLogo [30, 55, 146, 151,260], and there can always be some doubt whether the complexity of one’s model would increase to a point in which NetLogo no longer suffices. For example, models with multiple spaces and non-square spatial units may be cumbersome or impossible to implement in NetLogo [146]. 1 A non-terminating simulation is one for which there is no natural event to specify the length of a run [133].

3.1 The PPHPC model

37

Figure 3.1 – NetLogo implementation of the PPHPC model.

The NetLogo implementation of PPHPC, Figure 3.1, is based on NetLogo’s own Wolf Sheep Predation model [261], considerably modified to follow the ODD discussed in Section 3.1. Most NetLogo models will have at least a setup procedure, to set up the initial state of the simulation, and a go procedure to make the model run continuously [264]. The Init() and GetStats() processes (lines 1 and 2 of Algorithm 3.1) are defined in the setup procedure, while the main simulation loop is implemented in the go procedure. The latter has an almost one-to-one relation with its pseudo-code counterpart in Algorithm 3.1. By default, NetLogo shuffles agents before issuing them orders, which fits naturally into the model ODD. The implementation is available at https://github.com/fakenmc/pphpc/tree/netlogo.

3.1.3

Analyzing the output

Selecting focal measures In order to analyze the output of a simulation model from a statistical point-of-view, we should first select a set of focal measures (FMs) which summarize each output. Typically, FMs consist of long-term or steady-state means. However, being limited to analyze average system behavior can lead to incorrect conclusions [133]. Consequently, other measures such as proportions or extreme values can be used to assess model behavior. In any case, the selection of FMs is an empirical exercise and is always dependent of the model under study. A few initial runs are usually required in order to perform this selection. For the PPHPC model, the typical output of a simulation run is shown in Figure 3.2 for size 400 and both parameter sets. In both cases, all outputs undergo a transient stage and tend to stabilize after a certain number of iterations, entering steady-state. For other sizes, the situation is similar apart from a vertical scaling factor. Outputs display pronounced extreme values in the transient stage, while circling around a long-term mean

38

Methodology ·104 Pis

4

Piw

Pic /4

s

Ei

w

Ei

4C i

3

Average energy

Total population

30

2

1

0

0

1000

2000 Iterations

3000

20

10

0

4000

0

(a) Population, param. set 1.

Piw

2000 Iterations

3000

4000

(b) Energy, param. set 1.

·105 Pis

1000

80

Pic /4

s

Ei

w

Ei

4C i

60 Average energy

Total population

3

2

1

0

40

20

0

1000

2000 Iterations

3000

(c) Population, param. set 2.

4000

0

0

1000

2000 Iterations

3000

4000

(d) Energy, param. set 2.

Figure 3.2 – Typical model output for model size 400. Other model sizes have outputs which are similar, apart from a vertical scaling factor. Pi refers to total population, E i to mean energy and C i to mean value of the countdown state variable, C. Superscript s relates to prey, w to predators, and c to cell-bound food. Pic and C i are scaled for presentation purposes.

and approximately constant standard deviation in the steady-state phase. This standard deviation is an important feature of the outputs, as it marks the overall variability of the predator-prey cycles. Having this under consideration, six statistics, described in Table 3.5, were selected for each output. Considering there are six outputs, a total of 36 FMs are selected for the PPHPC model.

Collecting and preparing data for statistical analysis Let Xj0 , Xj1 , Xj2 , ..., Xjm be an output from the j th simulation run (rows under ‘Iterations’ in Table 3.6). The Xji ’s are random variables that will, in general, be neither independent nor identically distributed [133], and as such, are not adequate to be used directly in many formulas from classical statistics (which are discussed in the next section). On the

3.1 The PPHPC model

39

Statistic

Description

max Xi

Maximum value.

0≤i≤m

arg max Xi

Iteration where maximum value occurs.

0≤i≤m

min Xi

Minimum value.

0≤i≤m

arg min Xi 0≤i≤m P ss X =s m i=l+1 Xi /(m − l) Pm 2 i=l+1 (Xi − X ss ) S ss = m−l−1

Iteration where minimum value occurs. Steady-state mean. Steady-state sample standard deviation.

Table 3.5 – Statistical summaries for each output X, where Xi is the value of X at iteration i, m denotes the last iteration, and l corresponds to the iteration separating the transient and steady-state stages. Rep.

Iterations

Focal measures ss

1 2 .. .

X10 X20 .. .

X11 X21 .. .

... ...

X1,m−1 X2,m−1 .. .

X1,m X2,m .. .

max X1 max X2 .. .

arg max X1 arg max X2 .. .

min X1 min X2 .. .

arg min X1 arg min X2 .. .

X1 ss X2 .. .

S1ss S2ss .. .

n

Xn0

Xn1

...

Xn,m−1

Xn,m

max Xn

arg max Xn

min Xn

arg min Xn

Xn

Snss

ss

Table 3.6 – Values of a generic simulation output (under ‘Iterations’) for n replications of m iterations each (plus iteration 0, i.e., the initial state), and the respective FMs (under ‘Focal measures’). Values along columns are IID.

other hand, let X1i , X2i , ..., Xni be the observations of an output at iteration i for n runs (columns under ‘Iterations’ in Table 3.6), where each run begins with the same initial conditions but uses a different stream of random numbers as a source of stochasticity. The Xji ’s will now be independent and identically distributed (IID) random variables, to which classical statistical analysis can be applied. However, individual values of the output X at some iteration i are not representative of X as a whole. Thus, we use the selected FMs as representative summaries of an output, as shown in Table 3.6, under ‘Focal measures’. Taken column-wise, the observations of the FMs are IID (because they are obtained from IID replications), constituting a sample prone to statistical analysis. Regarding steady-state measures, X

ss

and S ss , care must be taken with initialization

bias, which may cause substantial overestimation or underestimation of the long-term performance [203]. Such problems can be avoided by discarding data obtained during the initial transient period, before the system reaches steady-state conditions. The simplest way of achieving this is to use a fixed truncation point, l, for all runs with the same initial conditions, selected such that: a) it systematically occurs after the transient state; and, b) it is associated with a round and clear value, which is easier to communicate [203]. Law [133] suggests the use of Welch’s procedure [256] in order to empirically determine l. Let X 0 , X 1 , X 2 , . . ., X m be the averaged process taken column-wise from Table 3.6 (columns P under ‘Iterations’), such that X i = nj=1 Xji /n for i = 0, 1, . . . , m. The averaged process

40

Methodology

has the same transient mean curve as the original process, but its variance is reduced by a factor of n. A low-pass filter can be used to remove short-term fluctuations, leaving the long-term trend of interest, allowing us to visually determine a value of l for which the averaged process seems to have converged. A moving average approach can be used for filtering:  Pw   s=−w X i+s   2w + 1 X i (w) = Pi−1  X    s=−(i−1) i+s 2i − 1

if i = w + 1, . . . , m − w

(3.5)

if i = 1, . . . , w

where w, the window, is a positive integer such that w 6 bm/4c. This value should be large enough such that the plot of X i (w) is moderately smooth, but not any larger. A more in-depth discussion of this procedure is available in references [133, 256]. Statistical analysis of focal measures Let Y1 , Y2 , ..., Yn be IID observations of some FM with finite population mean µ and finite population variance σ 2 (i.e., any column under ‘Focal measures’ in Table 3.6). Then, as described in references [132, 133], unbiased point estimators for µ and σ 2 are given by n P

Y (n) =

Yj

j=1

n

(3.6)

and n P

S 2 (n) = respectively.

j=1

[Yj − Y (n)]2 n−1

(3.7)

Another common statistic usually determined for a given FM is the confidence interval (CI) for Y (n), which can be defined in several different ways. The t-distribution CI is commonly used for this purpose [132, 133], although it has best coverage for normally distributed samples, which is often not the case for simulation models in general [133, 204] and agent-based models in particular [101]. If samples are drawn from populations with multimodal, discrete or strongly skewed distributions, the usefulness of t-distribution CI’s is further reduced. While there is not much to do in the case of multimodal distributions, Law [133] proposes the use of the CI developed by Willink [267], which takes distribution skewness into account. Furthermore, CI’s for discrete distributions are less studied and usually assume data follows a binomial distribution, presenting some issues of its own [31]. As suggested by Radax et al. [183], we focus on providing a detailed assessment of the distributional properties of the different FMs, namely whether they are sufficiently “normal” such that normality-assuming (parametric) statistical techniques can be applied, not only for CI estimation, but also for model comparison purposes.

3.1 The PPHPC model

Descriptive Theory-driven

41

Graphical methods

Numerical methods

Histogram, Box plot, Dot plot Q-Q plot, P-P plot

Skewness, Kurtosis Shapiro-Wilk, AndersonDarling, Cramer-von Mises, Kolmogorov-Smirnov, JarqueBera and other tests

Table 3.7 – Methods for assessing the normality of a data set, adapted from [173]. Boldface methods are used in this study.

The normality of a data set can be assessed graphically or numerically [173]. The former approach is intuitive, lending itself to empirical interpretation by providing a way to visualize how random variables are distributed. The latter approach is a more objective and quantitative form of assessing normality, providing summary statistics and/or statistics tests of normality. In both approaches, specific methods can be either descriptive or theorydriven, as shown in Table 3.7. For this study we chose one method of each type, as shown in boldface in Table 3.7. This approach not only provides a broad overview of the distribution under study, but is also important because no single method can provide a complete picture of the distribution. Under the graphical methods umbrella, a histogram shows the approximate distribution of a data set, and is built by dividing the range of values into a sequence of intervals (bins), and counting how many values fall in each interval. A Q-Q plot compares the distribution of a data set with a specific theoretical distribution (e.g., the normal distribution) by plotting their quantiles against each other (thus “Q-Q”). If the two distributions match, the points on the plot will approximately lie on the y = x line. While a histogram gives an approximate idea of the overall distribution, the Q-Q plot is more adequate to see how well a theoretical distribution fits the data set. Concerning numerical methods, Skewness measures the degree of symmetry of a probability distribution about its mean, and is a commonly used metric in the analysis of simulation output data [133, 163, 204]. If skewness is positive, the distribution is skewed to the right, and if negative, the distribution is skewed to the left. Symmetric distributions have zero skewness, however, the converse is not necessarily true, e.g., skewness will also be zero if both tails of an asymmetric distribution account for half the total area underneath the probability density function. In the case of theory-driven numerical approaches, we select the Shapiro-Wilk (SW) test [213], as it has been shown to be more effective when compared to several other normality tests [186]. We focus on the p-value of this test (instead of the test’s own W statistic), as it is an easily interpretable measure. The null-hypothesis of this test is that the data set, or sample, was obtained from a normally distributed population. If the p-value is greater than a predetermined significance level α, usually 0.01 or 0.05, then the null hypothesis cannot be rejected. Conversely, a p-value less than α implies the rejection of the null hypothesis, i.e., that the sample was not obtained

42

Methodology

from a normally distributed population. Comparing the outputs of several implementations The steps required for comparing the output of several model implementations are described in Section 2.6.2. In the first step, choice of replication standard, we aim for distributional equivalence, since we are essentially interested in showing that the parallel implementations of PPHPC have statistically indistinguishable behavior from the canonical NetLogo implementation. The second step, selection of FMs, is already complete: we simply reuse the FMs selected for analyzing model output. The third and final step consists of applying statistical comparison techniques to the several samples of individual FMs to check if they are drawn from the same distribution. The statistical comparison methods described in this work make use of several statistical tests to assert the degree of output alignment. Two-sample or multi-sample hypothesis tests are commonly used for assessing statistically dissimilarity in univariate samples, i.e., samples composed of scalar observations, such as FMs. If samples are drawn from normally distributed populations, the t (two samples) and ANOVA (s > 2 samples) tests are adequate [106]. Non-parametric tests are more appropriate if population normality cannot be assumed. The Mann-Whitney U test [84] and the Kolmogorov-Smirnov test [153] are typically employed for comparing two samples. The Kruskal-Wallis test [126] extends the former for the s > 2 samples case. As described in Section 2.6.4, this approach for comparing outputs is model specific. The selected FMs are not only targeted at the PPHPC model, but also at specific parameterizations of the PPHPC model. An alternative, model-independent comparison technique is discussed in Section 3.3.

3.2

Parallel implementations of the PPHPC model

We propose three parallel implementations of the PPHPC model. The first is implemented in Java, with several user-selectable parallelization schemes. The other two are developed in OpenCL (using C as the host programming language), one targeting the CPU architecture, and the other aimed at GPU devices. While the goal is to study the impact of different parallelization approaches on simulation performance, care is taken so that these yield the same statistical behavior as the canonical NetLogo version, and among themselves. Simulation reproducibility was also taken into account during the development of the parallel implementations discussed here. This is an important and often overlooked aspect of ABM parallelization. As explained by Hill et al. [105], “To investigate and understand the results, we have to reproduce the same scenarios and find the same confidence intervals every time we run the same stochastic experiment. When debugging parallel stochastic applications, we need to reproduce the same control flow and the same result to correct

3.2 Parallel implementations of the PPHPC model

43

an anomalous behavior”. Most ABMs, including PPHPC, use one or more stochastic processes. PRNGs are crucial to reproducibility. These generators use iterative deterministic algorithms for producing a sequence of pseudo-random numbers that approximate a truly random sequence [43]. A PRNG consists of a finite set of states, and a transition function that takes the PRNG from one state to the next. The initial state of the PRNG is called the seed [219]. As such, PRNGs are used in simulations to mimic stochastic processes in a reproducible fashion. Reproducibility is simple to accomplish in a single-threaded model, sufficing the use of the same PRNG and seed, with deterministic scheduling of all stochastic processes. However, in a parallel simulation context, there are practical trade-offs between reproducibility, memory and speed [252]. Additionally, the use of PRNGs in parallel simulations comes with its own set of problems, such as hidden correlations or overlaps in different sub-streams of the same PRNG [105]. In order to make FITA simulation models, such as PPHPC, reproducible in parallel scenarios, two conditions must be verified: 1. Each thread must have its own PRNG sequence or subsequence. 2. The final state of shared memory at the end of each time step must be deterministic, i.e., it should not depend on thread scheduling. In a scenario with N threads, the first condition is ultimately achievable if dividing the PRGN sequence by N is feasible, and if there is enough memory for N thread-local PRNG states. The feasibility of PRNG partitioning increases with its period, which in turn is directly related to the PRNG state size. Additionally, the memory required to hold the state of the PRNGs subsequences increases with N . However, in order to minimize hidden correlations and overlaps due to PRNG sequence partitioning, higher-period, larger-state PRNGs are preferable. Thus, there is again a trade-off, this time between number of threads, apparent “randomness” of the multiple thread-local PRNG streams, and memory. The second condition can be even more complicated to guarantee, since thread scheduling is not deterministic. Generically, the solution to this issue requires two steps. In the first step, threads read shared simulation state, and register their intents, but do not actually carry them out, i.e., they do not write to simulation shared state at this time. In the second step, non-conflicting intents are deterministically executed, while paradoxical or conflicting intents are deterministically solved and fulfilled. However, this can have a very detrimental impact on simulation performance, possibly defeating the purpose of parallelization. By dropping the reproducibility requirement, paradoxical thread actions disappear and the second step is no longer required. In this case, threads act on whatever reality they find on the shared simulation state, which is accessed in a synchronized, nondeterministic fashion. Hence the trade-offs between reproducibility, memory and speed. An alternative to the first condition is that each unit of work must have its own PRNG subsequence. A unit of work is a minimal set of actions which must be performed within

44

Methodology

the same thread, with each thread processing one or more units of work per simulation time step. While this can facilitate reproducibility in certain scenarios (e.g., as discussed in Section 3.2.1), the memory and PRNG sequence partitioning issues become more pronounced, since there will usually be much more units of work than threads. Additionally, the second condition, the hardest to solve, does not simply go away. As we shall discuss, depending on the model and on the parallelization approach, there can be ways to obtain reproducibility without sacrificing much memory or speed. Unfortunately there are also situations in which guaranteeing reproducibility simply is not practical.

3.2.1

Java

Here we discuss a multithreaded Java implementation of PPHPC, featuring several userselectable parallelization schemes. Java is a well-known language within the ABM community, powering popular toolkits such as Repast Simphony [167] and MASON [144]. Additionally, NetLogo also runs on the JVM, making a Java implementation of PPHPC even more appropriate for performance comparison purposes. The main objectives of this implementation are: a) to study how different parallelization strategies impact simulation performance on a shared memory architecture; and, b) assess the improvements obtained against the canonical NetLogo implementation. The Java implementation of PPHPC is based on the concept of units of work, which are processed by one or more worker threads. The basic unit of work is a single grid cell, except in the agent initialization stage (which takes place during the Init() process), where the unit of work corresponds to the instantiation and deployment of a single agent. Work providers supply worker threads with tokens uniquely identifying the units of work to be processed. Different work providers offer specific parallelization strategies, as will be discussed in Section 3.2.1. The Java implementation is available at https://github.com/fakenmc/pphpc/tree/ java/java. Architecture The Java implementation is built upon the Model-View-Controller design pattern [82], as shown in Figure 3.3. The model (generically represented by the IModel interface) contains the actual ABM logic, aggregating the simulation grid (interface ISpace), composed of grid cells (interface ICell). Cells are associated with zero or more agents (interface IAgent). The model can be manipulated and observed using one or more views, represented by the IView interface. Views observe the model directly but manipulate it (e.g., start, pause, stop) via the controller (interface IController). Work factories (classes implementing the IWorkFactory interface), are responsible for creating objects which handle how units of work are processed, namely the controller and the work provider (the latter represented by the IWorkProvider interface).


e us

create

use

call

e at se e r c u

IView

SimWorker create

IWorkProvider

1...∗

1...∗ 1

IController

45

1...∗

c

all

1

1 1 1

IModel

ICell

1

use

IWorkFactory

1

ISpace

1...∗ 1

1

∗

IAgent

Figure 3.3 – UML diagram for the multithreaded Java implementation of PPHPC.

The controller spawns a specified number of worker threads (instances of the SimWorker class), which will execute Algorithm 3.1. The quantity of work each worker processes is determined by the work provider, i.e., by the tokens it provides. To improve performance, workers execute the operations in Algorithm 3.1 in a different order, but in a way that the final qualitative simulation outcome does not change, as outlined in Algorithm 3.32 . The Init() process, line 1 of Algorithm 3.1, is divided in four steps, which can be processed in parallel, as determined by the work provider: 1) instantiation and initialization of grid cells (line 2 of Algorithm 3.3); 2) connecting cells to their neighbors (line 4 of Algorithm 3.3); 3) instantiation of prey (line 6 of Algorithm 3.3); and, 4) instantiation of predators (line 7 of Algorithm 3.3). In steps 1 and 2 the unit of work represents a single grid cell. In steps 3 and 4 the unit of work represents the instantiation and deployment of a single agent. Next, workers execute the GetStats() process (line 2 of Algorithm 3.1 and line 9 of Algorithm 3.3). Data required for the observation of the model, specified in Section 3.1.1, is collected on a cell-by-cell basis. After processing their allocated cells, workers then update a global statistics object. Lines 5 to 10 of Algorithm 3.1, which include agent movement and growth of cell-bound food, are condensed into a single EP loop (line 13 of Algorithm 3.3). This is possible because the Move() and GrowFood() processes are independent; i.e., the consequences of either will only impact the Act() process, which occurs later. Most importantly, both Move() and GrowFood() are cell-wise independent and can be processed autonomously for each cell in an EP loop. When a cell is processed, agents located therein are prompted to move, and then the cell is asked to execute its GrowFood() process. Care is taken so that agents that already moved are not prompted to move again. Finally, lines 11 to 14 of Algorithm 3.1, containing the Act() and GetStats() processes, are also contracted into one EP loop, as shown in line 15 of Algorithm 3.3. While agent 2

Controller synchronization points (i.e., calls to ControllerSync()) are discussed in Section 3.2.1.

46

Methodology

Algorithm 3.3 Realization of the main simulation algorithm in the SimWorker class. Calls to ControllerSync() are controller synchronization points. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

ControllerSync(1) CreateCells() ControllerSync(2) SetCellNeighbors() ControllerSync(3) CreatePrey() CreatePredators() ControllerSync(4) GetStats() ControllerSync(5) i←1 for i 0

(3.8)

where ⊕ is the bitwise XOR operator and SHA256() is the SHA-256 cryptographic hash


47

function. The Java implementation of PPHPC uses the Uncommons Maths library [59], taking advantage of the several PRNGs it provides. For the results presented in this work, the library’s Mersenne Twister implementation was used, because it is the same PRNG used by NetLogo, and its very large period of 219937 − 1 makes substream overlapping

highly unlikely to occur [32, 221].

Considering that each worker thread has its own PRNG subsequence, the first condition for FITA simulation model reproducibility is met. However, to assure the second condition is satisfied within this context, two requirements can be inferred: 1. Each worker thread must process the exact same units of work between runs, i.e., it must: (a) Instantiate the same quantity of initial agents and place them in the same cells. (b) Process the same cells. 2. Agents within a cell must be processed in the same order. The first requirement can be met if work providers always assign the same tokens to the same worker. The second requirement may be problematic when cell-level synchronization is required, which occurs during agent movement. As will be discussed in the next section, this issue can be solved within the current framework by placing agents in their destination cell using some deterministic order criteria. This ensures that, when entering the agent action stage, each cell contains an ordered list of agents, ready to act. A broader way of forcing reproducibility of simulations is to associate the PRNG subsequences with units of work and not with worker threads. There are two main problems with this approach. The first, already briefly discussed, is that the problems associated with parallel PRNGs become worse when partitioning the PRNG stream into more and more substreams [105]. The second issue is related to the required memory. For the problem of size 400, and only considering the EP loops where units of work are represented by cells, using a large period PRNG (in order to minimize the first problem) such as the Mersenne Twister, will approximately require 380 MiB of memory just for the PRNG states. For larger model sizes, such as 1600, almost 6 GiB are required. Parallelization strategies A parallelization strategy is defined by the selected work factory, more specifically by the work providers it offers, and by the way it configures the controller. The work factory is first requested to instantiate and configure an appropriate controller object. When the simulation starts, the controller spawns the number of workers specified by the work factory, and each worker gets a reference to the controller and to the work factory. Workers use the former to synchronize themselves and the latter to get references to work providers; these, in turn, provide workers with tokens, i.e., integers that uniquely identify units of work. When a work provider returns a negative value, it means that the current parallel

48

Methodology

Controller

PS ST EQ EX ER OD

Cell

WP

1

2

3

4

5

6

7

8

S S S S

B B B B

S S S S

B B B B

S S S B

B B B B

B B B B

S S S S

B S

Init.

Move

S Sb Sb S

S Sb S

Table 3.8 – Parallelization strategies (PS) and their handling of the possible synchronization points at the level of the controller, work provider (WP) and grid cell. B means there is a barrier, i.e., that workers can only advance when all workers have reached the sync. point. S implies access serialization, but workers do not have to wait on other workers before continuing; Sb is similar, but implies an ordered agent insertion, which might take longer than simple access serialization.

work cycle is finished. Three work providers are used by workers: one for the cells, which provides work in an EP fashion for all cell-wise work cycles, and two for the initial agent creation (one for prey, the other for predators), which provide AP work. The latter are used only once during the Init() process, while the former, i.e., the cell work provider, is continuously reused. To accommodate different parallelization strategies, workers have several possible synchronization points, which occur at three different levels: 1) controller; 2) work provider; and 3) grid cells. There are eight controller-level synchronization points. All workers explicitly notify the controller when they reach them, as shown by the calls to ControllerSync() in Algorithm 3.3. Whether or not workers are held on that point by the remaining workers depends on how the controller was configured by the work factory. There are, however, some points where barriers are mandatory. For example, no worker can begin processing agent actions before all agents have moved and all food has grown; thus, synchronization point 6 (line 14 of Algorithm 3.3) is necessarily a barrier. Sequences of concurrent computation (AP or EP) and thread communication, terminating with a barrier, can be considered global supersteps under a Bulk Synchronous Parallel model [243]. Work provider and cell-level synchronization is performed implicitly when workers request work and when they process cells, respectively. There are two possible cell-level synchronization points: 1) when inserting initial agents during the Init() process; and, 2) when inserting agents moving from other cells. Again, whether or not workers are actually synchronized at work provider and cell-level synchronization points depends on the parallelization strategy. Five parallelization strategies are provided, as shown in Table 3.8, which also enumerates all possible synchronization points and how the different strategies handle them. The following paragraphs describe in detail these parallelization strategies. Single-thread (ST) A single-threaded work factory (SingleThreadWorkFactory class) is provided for comparison with the multithreaded work factories. This work factory config-


49

ures the controller such that no synchronization occurs when the (single) worker explicitly notifies the controller that it reached a given synchronization point. Likewise, no work provider or cell-level synchronization is required. The work providers made available by the single-threaded work factory (instances of the SingleThreadWorkProvider class) maintain a simple counter which issues tokens to the single worker. Simulation objects (cells and agents) are always iterated in the same order, which allows for simulation reproducibility. Equal (EQ) The general idea of the EQ parallelization strategy (handled by the EqualWorkFactory class) is that each worker always processes the same work. Work distribution is performed once at the beginning of the simulation by the associated work providers (instances of the EqualWorkProvider class), and then the workers are always given the same exact tokens, e.g., they always process the same cells in the EP sections. Cell-level synchronization is required because more than one worker may potentially access the same cell at the same time for agent movement or initial agent placement. The first worker to whom access is granted gets to place its agent topmost in the cell’s internal agent list. This means that, because thread synchronization is not a deterministic process, agents will not be processed in the same order from run to run. Thus, simulations with this work factory are not reproducible. The maximum number of tokens to be processed by each worker, n, is given by n = dT /N e

(3.9)

where T is the number of tokens to be processed in a parallel work cycle, and N is the number of worker threads. If T is not equally divisible between the available workers, the last worker will process less work than the remaining workers, as shown in Eq. 3.10, n · i ≤ ti < min n · (i + 1), T

(3.10)

where i identifies the ith worker, and ti corresponds to the range of tokens which will be processed by the ith worker. Equal with repeatability (EX) The EX parallelization strategy is a slight variation on EQ. The same classes are used, the only difference is that when workers access the same cell at the same time for agent placement purposes, the agent is placed in ordered fashion, as shown in Table 3.8. This allows for reproducible simulations because agents will be processed in the same order. Equal with row synchronization (ER) In the ER parallelization strategy, as in the previous strategies, work is assigned to workers at the beginning of the simulation. The difference here is that each worker serially processes rows of the simulation grid, leaving a distance of at least three rows (including the row to be processed) to the next worker (see Figure 3.4). More generally, this distance is given by

50

Methodology Processing

Unprocessed

Processed

Worker #1

Worker #2

Worker #3

Row sync.

Row sync.

Figure 3.4 – Equal with row synchronization (ER): example of three workers processing nine rows of the simulation grid in parallel.

dmin = 2r + 1

(3.11)

where r is the agent movement radius, which is 1 for the PPHPC model. This approach allows workers to run in parallel without any need for cell-level synchronization, because they synchronize at the end of each row at the work provider level. Thus, agents always move to neighboring cells in the same order, making simulations reproducible. The EqualRowSyncWorkFactory class is responsible for configuring the controller and issuing the appropriate work providers. An instance of the EqualRowSyncWorkProvider class is issued for EP work. However, this approach does not make sense for AP work; as such, the agent initialization phase, which is handled in AP fashion, is managed by instances of the EqualWorkProvider class. For simulations to be reproducible, initial agents are inserted in an ordered fashion, as shown in Table 3.8. This strategy implies that there is a practical maximum number of workers depending on the number of rows, yenv , and on the minimum distance between rows, dmin , as shown in Eq. 3.12: Nmax =

yenv dmin

(3.12)

If the specified number of workers, N , is larger than Nmax , an exception is thrown and the simulation terminates. An initial estimate of the number of rows per worker, ∆y, is given by ∆y =

jy

env

N

k

(3.13)

This estimate can be incremented if: 1) the number of rows is not equally divisible by the number of workers; and, 2) after incrementing it, there are enough rows for the last worker to process. This is shown in Eq. 3.14, where ∆yf is the final number of rows per worker:


 ∆y + 1, if y env mod N > 0 ∧ (N − 1) · (∆y + 1) ≤ yenv − dmin ∆yf = ∆y, otherwise.

51

(3.14)

From the workers perspective, what matters are the tokens to process. All workers, except possibly the last one, will process n tokens, according to Eq. 3.15. The exact tokens that the ith worker will process are given in Eq. 3.16. Note that any adjustment due to the number of rows not being exactly divisible by the number of workers is performed on the last worker. n = xenv · ∆yf

(3.15)

 n · (i + 1), if i < N − 1 n · i ≤ ti < tf , where tf = x × y , if i = N − 1 env env

(3.16)

On-demand (OD) The OD parallelization strategy, managed by an instance of the OnDemandWorkFactory class, aims to improve load balancing by issuing smaller blocks of tokens to keep workers busy. Work providers, instances of the OnDemandWorkProvider class, maintain a counter of the tokens already issued to workers. Each time a worker requests more tokens, this counter is incremented by the block size, b. Access to the work provider is serialized, as shown in Table 3.8, because the work counter needs to be atomically incremented. The counter is implemented using an instance of the AtomicInteger class, which was added to the Java SE 5.0 API. Lower values of b will cause workers to fetch tokens from the work provider more often, which may cause some thread contention; however, work distribution is improved because workers are more likely to be processing work instead of waiting for slower workers at controller synchronization points. Conversely, with higher values of b, worker threads will request work less frequently, which leads to lower thread contention; on the downside, faster workers may have to wait longer for slower ones. Thus, b controls a trade-off between thread contention and load balancing. If b is selected such that workers only access the work provider once and the same amount of work is assigned to each worker, the OD strategy becomes similar to EQ, but with additional synchronization. Table 3.8 also shows that the OD strategy requires all workers to explicitly wait for one another at controller synchronization point 5. This is required because some workers may finish processing their GetStats() tokens early (line 9 of Algorithm 3.3), while others lag behind; at line 13 of Algorithm 3.3, faster workers could potentially obtain work tokens that are still being processed by slower workers at line 9. This parallelization strategy does not offer reproducible simulations because workers obtain tokens in a FIFO fashion which is dependent on the OS thread scheduling, and thus not deterministic. As such, it is not possible to anticipate which worker will process which

52

Methodology

tokens, resulting in cells being associated with different worker-bound PRNG sub-sequences from iteration to iteration and from run to run.

3.2.2

OpenCL

OpenCL is the only low-level, portable framework where it is possible to implement multiple architectural versions of the model, also allowing the GPU version to run on boards from different vendors. While options such as OpenACC could be viable, the lack of solid compiler support and slower performance when compared to OpenCL [259] suggest against its adoption at this time. In spite of all the advantages OpenCL offers, the problem of API verboseness and lack of integrated libraries was solved with the implementation of cf4ocl and cl_ops, discussed in Sections 3.4.1 and 3.4.2, respectively. The former is a wrapper for OpenCL host code, with a higher-level and simpler interface, which includes integrated profiling. The latter provides several parallel computing algorithms for random number generation, sorting and parallel prefix sum. The OpenCL implementations described in this section are available at https:// github.com/fakenmc/pphpc/tree/opencl/opencl. A CPU implementation The OpenCL CPU version uses the ER parallelization strategy described in Section 3.2.1, and as such the simulation is partitioned in an EP fashion. The reasoning behind this choice is that shared write access to data structures is difficult to synchronize with OpenCL. One could use atomic operations, however these only work with integers (although improved supported is being added in newer versions of OpenCL), making their applicability difficult. Additionally, atomics make simulation reproducibility harder to accomplish. The OpenCL CPU implementation only uses three kernels, with the following responsibilities: init Initializes the simulation environment and gather the initial statistics, i.e., performs the Init() and GetStats() processes in lines 1 and 2 of Algorithm 3.1. step1 Performs agent movement(Move()) and grows cell food (GrowFood()), lines 5 to 10 of Algorithm 3.1. step2 Performs agent actions (Act()) and gather end-of-iteration statistics (GetStats()), lines 11 to 14 of Algorithm 3.1. Two large arrays hold information about grid cells and agents, respectively, in a array of structures (AoS) fashion. Each element of the cell array contains a data structure with two elements, namely the grass countdown value, C, and the location of the first agent in the cell in the agents array. In turn, each element of the agents array contains agent information, namely, the agent energy, the agent type (predator or prey), and the index


53

of the next agent in the same cell. Basically, agents in the same cell are scattered along the agents array, with each agent holding a pointer to the next. The last agent in the cell points to a specific constant identifying the end of the list. Simulation parameters and additional runtime information is passed to the kernels in the form of preprocessor constants, which improves performance for larger models. A GPU implementation This preliminary version was implemented having into account the massively parallel architecture of the GPU. For example, there is a direct correspondence between agents and GPU thread (or work-item in OpenCL terminology), such that all agent-wise steps are executed within the corresponding agents’ work-item in a AP fashion. Cell-wise kernels are executed in a EP fashion. Additionally, care was taken in order to use the faster local memory and to perform coalesced global memory accesses where possible. Most of the data within the GPU is handled in a structure of arrays (SoA) fashion, which is better suited for SIMD processing. Several arrays are used on the GPU device, three of which are especially relevant: Agents Array of agents, where each element contains agent information packed in an 32 or 64-bit integer, depending on model size. Each packed integer contains the agent position, energy and type (predator or prey). The size of this array corresponds to the maximum number of agents in the simulation. This array needs to be cell-wise sorted, i.e., first come the agents in cell (0, 0), then agents in cell (0, 1), and so on. Function-like macros are used in kernel code to transparently and efficiently access bit-packed information. AgentsIndex Array of uint2 vectors, i.e., each element is a vector of two 32-bit unsigned integers which hold the indexes to the first and last agent in each cell. As such, this array contains xenv × yenv pairs of unsigned integers. Grass Array of 32-bit integers which hold the the grass countdown value, C, for each cell. The size of this array corresponds to the number of cells in the simulation, i.e., xenv × yenv . The model steps defined in Algorithm 3.1 for each iteration are implemented as follows: 1. The init_cell kernel initializes the grass countdown value, C, for each cell (part of the Init() process in line 1 of Algorithm 3.1). 2. The init_agent kernel initializes all agents in the simulation, scattering them by the grid cells (part of the Init() process in line 1 of Algorithm 3.1). 3. Entering the main simulation loop:

54

Methodology (a) Run four kernels for gathering statistics: reduce_grass1, reduce_agent1, reduce_grass2, reduce_agent2. Parallel GPU reductions must be performed in two kernels [207], and because agents and grass statistics are obtained in substantially different ways, a pair of reduction kernels is required for each. This corresponds to the GetStats() process in lines 2 and 14 of Algorithm 3.1, depending on whether this is the first or nth iteration, respectively. In case this is the last iteration, break out of the simulation loop. (b) Perform grass growth with the grass kernel (line 9 of Algorithm 3.1). (c) Perform agent movement with the move_agent kernel (line 6 of Algorithm 3.1). The kernel updates the agents’ position, decreases agent energy by 1, and if energy becomes 0, marks the agent as dead (by setting all bits in the packed agent integer to 1). (d) Sort the array of agents using a sorting algorithm provided by the cl_ops library. (e) Transfer previous iteration statistics back to host using pinned memory (faster transfer between host and device). (f) Execute the find_cell_idx kernel, which updates the AgentsIndex array with the new start and finish agent indexes for each grid cell. (g) Perform agent actions with the action_agent kernel (line 12 of Algorithm 3.1). 4. Transfer last iteration statistics back to host. Several of these kernel invocations are overlapped in order to improve efficiency. To

further improve efficiency, the global work size of agent-wise kernels is automatically adjusted to the number of agents in each iteration. The global work size of cell-wise kernels is constant, i.e., approximately xenv × yenv . In both cases the effective global work size can be slightly larger in order to be divisible by a power of two local work size, which is preferable for GPUs.

3.3

Model-independent comparison of simulation models

A model-independent comparison or alignment method should work directly from simulation output, automatically selecting the features that best explain potential differences between the implementations being compared, thus avoiding the disadvantages of the classical approach. Additionally, such a method should not depend on the distributional properties of simulation output, and should be directly applicable by modelers. Our proposal consists of automatically extracting the most relevant information from simulation output using PCA [114]. PCA is a widely used technique [68] which extracts the largest variance components from data by applying singular value decomposition to a mean-centered data matrix. In other words, PCA is able to convert simulation output into a set of linearly uncorrelated measures which can be analyzed in a consistent, modelindependent fashion. This technique is especially relevant for ABMs, as it considers not

3.3 Model-independent comparison of simulation models

55

only equilibrium but also dynamics over time [265]. The following procedure summarizes this process for a time-series type output, although the general idea is applicable for other types of multivariate output [76]: 1. Perform n replications for each model implementation, and collect the respective outputs. 2. For each model output X (e.g., predator population, P w ): (a) Group the X’s from all replications row-wise in matrix X, i.e., the Xj output from the j th replication will be the j th row of matrix X. If we are comparing s implementations and performing n replications of m time steps for each, the size of this matrix will be sn × m. (b) Determine matrix Xc , which is the column mean-centered version of X. (c) Apply PCA to matrix Xc , considering that rows (replications) correspond to observations and columns (iterations or time steps) to variables. This yields: 1) matrix T, containing the representation of the original data in the principal components (PCs) space; and, 2) vector λ, containing the eigenvalues of the covariance matrix of Xc in descending order, each eigenvalue corresponding to the variance of the columns of T. The columns of T correspond to PCs, and are orderer by decreasing variance, i.e., the first column corresponds to the first PC, and so on. Rows of T correspond to observations. The k th column of T contains sn model-independent observations for the k th PC, n for each implementation. Groups of n observations associated with model implementations can be compared using statistical methods. More specifically, hypothesis tests can be used to check if the output projections on the PC space, grouped by model implementation, are drawn from populations with the same distribution. There are two possible lines of action: 1. A multivariate approach, consisting of applying a MANOVA test [127, 228] on the sn output projections, organized in s groups (one per model implementation) of n q-dimensional observations each, along the first q PCs (dimensions) such that these explain a user-defined minimum amount of variance. 2. A univariate approach, consisting of applying a hypothesis test to individual PCs on the sn output projections, organized in s groups (one per model implementation) of n observations each. Possible tests include the t-test and the Mann-Whitney U test for comparing two groups (s = 2), or ANOVA [106] and Kruskal-Wallis test, which are the respective parametric and non-parametric versions for comparing more than two groups (s > 2). The use of the MANOVA multivariate test has the advantage of yielding a single p-value from the simultaneous comparison of output projections along multiple dimensions. An

56

Methodology

equally succinct answer can be obtained with the univariate approach using the Bonferroni correction or a similar method for handling p-values from multiple comparisons [212]. However, both approaches will not prioritize dimensions, even though the first PCs are more important for characterizing model differences, as they explain more variance. In the univariate case one can subjectively attribute higher relevance to the first PCs, or objectively prioritize dimensions according to the explained variance using the weighted Bonferroni procedure [197]. Parametric tests such as MANOVA and t-test make several assumptions about the data being compared, namely that [11, 26, 127]: • All samples are mutually independent. • Each sample is drawn from a normally distributed population (multivariate normality for MANOVA).

• Samples are drawn from populations with equal variances (for MANOVA, the variancecovariance matrix should be the same for each population).

The first assumption is guaranteed if simulation runs are performed with uncorrelated PRNG seeds. The second and third assumptions can be checked using appropriate statistical tests. More specifically, group sample normality can be assessed using the Shapiro-Wilk test [213] (Royston test [198] in the multivariate approach), while equality of variances among groups can be verified with the Bartlett test [14] (Box’s M test [29] for homogeneity of variance-covariance matrices in the multivariate approach). However, Box’s M test is very sensitive and can lead to false negatives (type II errors). Fortunately, MANOVA is considerably robust to violations in the homogeneity of variance-covariance matrices when groups have equal sample sizes [228]. If these assumptions are not met in the univariate approach, non-parametric tests (e.g., Mann-Whitney U test or the Kolmogorov-Smirnov test) can be used instead. Nonparametric alternatives, e.g [6, 169], exist for the multivariate approach, but they are not as well established as the aforementioned methods. The eigenvalues vector λ is also important for this process, for two reasons: 1) to determine a minimum number of PCs for the MANOVA test, such that these explain a prespecified percentage of variance; 2) alignment or otherwise of the s model implementations can be empirically assessed by analyzing how the explained variance is distributed along PCs. The percentage of variance explained by each PC can be obtained as shown in Eq. 3.17. λi Si2 (%) = P λ

(3.17)

where i identifies the ith PC, λi is the eigenvalue associated with the ith PC, and

P

λ

is the sum of all eigenvalues. If the variance is well distributed along many PCs, it is an indication that the compared implementations are aligned, at least for the output being

3.3 Model-independent comparison of simulation models

57

analyzed. On the other hand, if most of the variance is explained by the first PCs, it is a clear sign that at least one model implementation is misaligned. The rationale being that if all implementations show the same dynamical behavior, then the projection of their outputs in the PC space will be close together and have similar statistics, i.e., means, medians and variance. As such, PCA will be unable to find components which explain large quantities of variance, and the variance will be well distributed along the PCs. If at least one model implementation is misaligned, the projection of its outputs in the PC space will be farther apart than the projections of the remaining implementations. As such, PCA will yield at least one component which explains large part of the overall variance. The alignment of two or more implementations can be assessed by analyzing the following information: 1) the p-values produced by the univariate and multivariate statistical tests, which should be above the typical 1% or 5% significance levels in case of implementation alignment; in the univariate case, it may be useful to adjust the p-values using the weighted Bonferroni procedure to account for multiple comparisons; 2) in case of misalignment, the total number of PCs required to explain a prespecified amount of variance should be lower than in case of alignment; also, more variance should be explained by the first PCs of the former than by the same PCs of the latter; and, 3) the scatter plot of the first two PC dimensions, which can offer visual, although subjective feedback on model alignment; e.g., in case of misalignment, points associated with runs from different implementations should form distinct groups.

3.3.1

Multiple outputs

Determining the alignment of models with multiple outputs may be less straightforward. If model implementations are clearly aligned or misaligned, conclusions can be drawn by analyzing the information provided by the proposed method for each individual output. Otherwise, an objective approach may be required. We suggest two such approaches: a) a Bonferroni or similar multiple comparison p-value correction; or, b) the concatenation of all outputs, centered and scaled, followed by a model-independent examination of the concatenated output. The first method can be appropriate for the multivariate case, in which we have one p-value per output. It can be extended to the univariate case if we only consider the uncorrected p-value from the first PC. If we consider multiple PCs or their already corrected p-values (weighted Bonferroni), the analysis becomes more complex and it can become unclear how to interpret the results. Additionally, multiple comparison correction methods often assume independency of test statistics, which might not be possible to assure when testing different outputs of the same simulation model, which are most likely correlated. In the second approach, we concatenate all outputs and examine the resulting concatenated output. This reduces a model with k outputs to a model with one output. In order to perform output concatenation, we center and scale the outputs such that their domains are in the same order of magnitude. This can be performed using range scaling on each

58

Methodology

output X, for example, as shown in the following Eqs.: ei = X

Xi − X , max X − min X h e= X e0 X

e1 X

i = 0, 1, ..., m

...

em X

i

(3.18)

(3.19)

e is the where Xi is the value of output X at iteration i, X is the mean of output X, and X range scaled version of X. Other centering and scaling methods, such as auto-scaling or level scaling [18], can be used as an alternative to range scaling in Eq. 3.18. For a model with k outputs, X 1 , X 2 , . . . , X k , the resulting concatenated output is given by fk f1 ⊕ X f2 ⊕ . . . ⊕ X e=X A

(3.20)

e is the concatenation of all model outputs. where ⊕ is the concatenation operator, and A Model implementations can thus be compared with the proposed method using the “single” e model output A.

3.4

Software packages

A number of software packages and libraries were developed with the purpose of supporting the research presented in this work. Here we describe each of these packages, discussing the motivation for developing them and showing where they were used.

3.4.1

cf4ocl: a C framework for OpenCL

The C Framework for OpenCL, cf4ocl [71], is a software library for rapid development of OpenCL programs in pure C. This library reduces much of the verbosity of the OpenCL C API, offering straightforward memory management, integrated profiling of OpenCL events (e.g., kernel execution and data transfers), simple but extensible device selection mechanism and user-friendly error management. cf4ocl abstracts the differences in the various OpenCL versions, presenting a consistent API which automatically selects the correct OpenCL API functions depending on the OpenCL platform version. This is important, since it is possible to have multiple OpenCL platforms with different OpenCL versions within the same system. The library is compatible with C++, since much of its functionality is also useful that context. cf4ocl allows the user to mix regular OpenCL API calls and objects (the latter can be unwrapped from the several cf4ocl objects), thus not limiting in any way the end users flexibility, which can use as much or as few of cf4ocl as appropriate. Furthermore, cf4ocl follows the general logic of the OpenCL API, making it simple to switch between raw OpenCL and cf4ocl. The end goal of cf4ocl is to allow the programmer to focus on device code, which is where the real complexity lies, rather than wasting time with repetitive, error-prone host code. The main features provided by cf4ocl can be summarized as follows:

3.4 Software packages

59

• Simplified memory management: – Clear set of constructor and destructor functions for all classes. – Automatic memory management for intermediate objects, such as information tokens retrieved from the underlying OpenCL objects. • Flexible device selection mechanism, with direct functions for most use cases and accessible API for more complex workflows.

• Straightforward event dependency system, with automatic memory management of all event objects.

• User-friendly error reporting using the approach taken by the GLib library [87]. • Abstracts differences in the OpenCL version of the underlying platforms, presenting a consistent API to the developer.

• Integrated profiling, with basic and advanced functionality. • Versatile device query utility. • Offline kernel compiler, linker and analyzer. Architecture The cf4ocl library offers an object-oriented interface to the OpenCL API using wrapper classes and methods, implemented using C structures and functions, respectively. Some of the provided methods directly wrap OpenCL functions, while others perform a number of OpenCL operations in one function call (usually in situations where using the OpenCL API is more tedious, such as setting kernel arguments). In any case, errors are handled in a developer friendly way using the GLib error reporting system, and the majority of intermediate memory objects is automatically managed. This, in itself, greatly reduces the number of lines of code required, promoting faster and less error-prone development. Since cf4ocl follows the workflow logic of the OpenCL API, it is straightforward for a developer to move between the two systems. Additionally, because raw OpenCL objects are always accessible to developers, a mix of OpenCL host code and cf4ocl code is possible. Developers can completely avoid any direct OpenCL API calls by using cf4ocl to its full capabilities, or use only the cf4ocl functionality that suits them. Wrapper constructors create the OpenCL object to be wrapped, but delegate memory allocation to the special ccl__new_wrap() functions. These accept the OpenCL object, and in turn call the ccl_wrapper_new() function, passing it not only the object, but also the size in bytes of the wrapper to be created. The ccl_wrapper_new() function allocates memory for the wrapper (initializing this memory to zero), and keeps the OpenCL object (wrapping it) in the new wrapper instance. For example, ccl_kernel_new() creates the cl_kernel object with the clCreateKernel() OpenCL function, but then relies on

60

Methodology

the ccl_kernel_new_wrap() function (and thus, on ccl_wrapper_new()) for allocation and initialization of the new CCLKernel wrapper object memory. The destruction of wrapper objects and respective memory deallocation is performed in a similar fashion. Each wrapper class has its own ccl__destroy() method, which delegates actual object release to the “abstract” ccl_wrapper_unref() function. This function accepts the wrapper to be destroyed, its size in bytes, and two function pointers: the first, with prototype defined by ccl_wrapper_release_fields(), is a wrapper-specific function for releasing internal wrapper objects, which the super class has no knowledge of; the second is the OpenCL object destructor function, with prototype defined by ccl_wrapper_release_cl_object(). Continuing on the kernel example, the ccl_kernel_destroy() method delegates kernel wrapper destruction to ccl_wrapper_unref(), passing it the kernel wrapper object, its size (i.e., sizeof(CCLKernel)), the “private” (static in C) ccl_kernel_release_fields() function for destroying kernel internal objects, and the clReleaseKernel() OpenCL kernel destructor function. As such, all cf4ocl wrapper objects use a common memory allocation and deallocation strategy, implemented in the CCLWrapper super class. If ccl__new_wrap() functions are passed an OpenCL object which is already wrapped, a new wrapper will not be created. Instead, the existing wrapper is returned, with its reference count increased by 1. Thus, there is always a one-to-one relationship between wrapped OpenCL objects and their respective wrappers. In practice, the ccl__destroy() functions decreases the reference count of the respective wrapper, only destroying it if the reference count reaches zero. The CCLWrapper base class maintains a static hash table which associates OpenCL objects (keys) to cf4ocl wrappers (values). Access to this table is thread-safe and performed via the ccl_wrapper_new() and ccl_wrapper_unref() functions. The management of OpenCL object information and OpenCL events is also handled automatically by cf4ocl. This frees the developer from manually managing the associated memory. In the case of OpenCL events, this allows cf4ocl to automate and simplify profiling. The library offers a number of methods to generate, manage and export profiling information, including overlapping events, which occur in the context of multiple command queues managed by their own host thread. cf4ocl also provides mechanisms for both simple and advance device selection and context creation, as well as utilities for querying devices and performing offline kernel compilation.

Quality control Library code is covered by unit tests implemented with the GLib test framework, while utilities and examples are probed by behavioral tests via the Bash Automated Testing System (BATS) [222]. Both unit and behavioral tests can be executed against all devices in a system.


61

Availability Operating system Linux, Windows, Mac OS X. Programming language C and OpenCL. Dependencies and third-party packages Runtime dependencies include GLib [87] and OpenCL [107]. Building cf4ocl from source also requires CMake [152], Git [239] and a C compiler with support for C99. Behavioral tests require BATS. Software location The development version is located at https://github.com/fakenmc/ cf4ocl, while an archive version is deposited at https://zenodo.org/record/57036. License GNU Lesser General Public License [80] (library code) and GNU General Public License [79] (remaining code). Motivation for development The cf4ocl library is used by the OpenCL implementations of the PPHPC model, described in Section 3.2.2. The library greatly minimized the necessary work to initially setup and perform various refactoring steps during development of these implementations. It did so by eliminating the verbosity of the OpenCL host API, by facilitating debugging due to its error reporting capabilities, and by allowing simple yet detailed profiling of OpenCL events, which in turn helped to pinpoint the cause of various bottlenecks.

3.4.2

cl_ops: a library of common OpenCL operations

The cl_ops library provides several common OpenCL operations, namely pseudo-random number generation, sorting and parallel prefix sums. The latter are also referred to as scans. Architecture This library displays a similar architecture to cf4ocl. It exposes an object-oriented API, using cf4ocl objects where applicable. It can be considered an extension to cf4ocl providing templates for actual device operations, grouped into three modules: pseudo-random number generation, sorting, and parallel prefix sum. The following algorithms are currently implemented: • PRNGs: 1. LCG (48 bits) [123]. 2. XorShift (64 and 128 bits) [150]. 3. MWC64x (64 bits) [237]. 4. Park-Miller (32 bits) [174].

62

Methodology 5. TauLCG: combined Tausworthe generator with an LCG (128 bits) [108]. • Sorting algorithms: 1. Simple bitonic sort [15]. 2. Advanced bitonic sort [9, 207]. 3. Radix sort [206]. • Parallel prefix sum [25].

Quality control Unit tests were implemented for the PRNG module. Benchmarking tests bundled with cl_ops successfully verify results from the three modules. Availability Operating system Linux, Windows, Mac OS X. Programming language C and OpenCL. Dependencies and third-party packages Runtime dependencies include cf4ocl (discussed in Section 3.4.1), GLib [87] and OpenCL [107]. Building cl_ops from source also requires CMake [152], Git [239] and a C compiler with support for C99. Software location The development version is located at https://github.com/fakenmc/ cl_ops. There is currently no archived version available. License GNU Lesser General Public License [80] (library code) and GNU General Public License [79] (remaining code). Motivation for development The cl_ops library is used by the OpenCL implementations of the PPHPC model, described in Section 3.2.2. More specifically, these implementations delegate PRNG, sorting and scanning functionality to cl_ops.

3.4.3

SimOutUtils: utilities for analyzing time series simulation output

SimOutUtils is a suite of MATLAB [154] functions for studying and analyzing time serieslike output from stochastic simulation models, as well as for producing associated publication quality figures and tables [74]. More specifically, the functions bundled with SimOutUtils allow to: 1. Study and visualize simulation output dynamics, namely the range of values per iteration and the existence or otherwise of transient and steady-state stages.


63

2. Perform distributional analysis of focal measures (FMs), i.e., of statistical summaries taken from model outputs (e.g., maximum, minimum, steady-state averages). 3. Determine the alignment of two or more model implementations by statistically comparing FMs. In other words, aid in the process of docking simulation models [8]. 4. From the previous points, produce publication quality LATEX tables and figures (the latter via the matlab2tikz script [210]). These utilities were generalized to be usable with any stochastic simulation model with time series-like outputs. The utilities were carefully coded in order to be compatible with GNU Octave [60]. Implementation and architecture The SimOutUtils suite is implemented in a procedural programming style, and is bundled with a number of functions organized in modules or function groups. As shown in Figure 3.5, the following function groups are provided with SimOutUtils: 1. Core functions. 2. Distributional analysis functions. 3. Model comparison functions. 4. Helper and third-party functions (not shown in Figure 3.5). The next paragraphs describe each group of functions in additional detail. Core functions Core functions work directly with simulation output files or perform lowlevel manipulation of outputs. The stats_get function is the basic unit of this module, and is at the center of the SimOutUtils suite. From the perspective of the remaining functions, stats_get is responsible for extracting statistical summaries from simulation outputs from one file (i.e., from the outputs of one simulation run). In practice, the actual work is performed by another function, generically designated as stats_get_*, to which stats_get serves as a facade for. The exact function to use (and consequently, the concrete statistical summaries to extract) is specified in a namespaced global variable defined in the SimOutUtils startup script. This allows researchers to extract statistical summaries and use FMs adequate for different types of simulation output. Two stats_get_* functions are provided, namely stats_get_pphpc and stats_get_iters. The former, set by default, was developed for the PPHPC model, and obtains six statistical summaries from each output: maximum, iteration where maximum occurs, minimum, iteration where minimum occurs, steady-state mean and steady-state standard deviation. It is adequate for time-series outputs with a transient stage and a steady-state stage. The latter, stats_get_iters, obtains statistical summaries corresponding to output

64

Methodology

dist plot per fm

stats compare plot

dist table per fm

stats compare pw stats compare table

dist table per setup

Call

stats compare

stats table per setup Distributional analysis

Model comparison Use object returned by

Call

stats analyze

Uses object returned by

stats gather Call

stats get Call one of

output plot

stats get pphpc

stats get iters

Core Load

Simulation output files Figure 3.5 – SimOutUtils architecture. Larger blocks with rounded corners and dashed outline constitute function groups, identified in italic font at the lower left corner of the respective block. Within these, functions are represented by smaller blocks with solid outline and sharp corners, with the function name shown in typewriter font. Arrows reflect the relationship between functions and between functions and function groups.

values at user-specified instants. It is very generic, and is appropriate for cases where it is hard to derive other meaningful statistics from simulation output. stats_get_* functions are also required to provide the name of the returned statistical summaries. This metadata is used by higher level functions for producing figures and tables. The stats_gather function extracts FMs from multiple simulation output files, i.e., for a number of simulation runs, by calling stats_get for individual files. It returns an object containing a n × m matrix, with n observations (from n files) and m FMs (i.e., statistical

summaries from one or more outputs). The returned object also includes metadata, namely a data name tag, output names and statistical summary names (via stats_get and the underlying stats_get_* implementation).

The matrix returned by stats_gather can be feed into the stats_analyze function, which determines, for each sample of n observations of individual FMs, the following statistics: mean, variance, confidence intervals, p-value of the Shapiro-Wilk normality


65

3,000

3,000 Sheep pop.

2,500

2,500

2,000

2,000 Value

Value

Sheep pop.

1,500

1,500

1,000

1,000

500

500

0

0

500

1,000

1,500

2,000

2,500

3,000

3,500

0

4,000

0

500

1,000

1,500

Iterations

2,000

2,500

3,000

(a) Superimposed, one run. 3,000 Sheep pop.

Sheep pop.

2,500

2,500

2,000

2,000 Value

Value

4,000

(b) Superimposed, 30 runs.

3,000

1,500

1,500

1,000

1,000

500

500

0

3,500

Iterations

0

500

1,000

1,500

2,000

2,500

3,000

3,500

0

4,000

0

500

Iterations

(c) Extremes, 30 runs.

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Iterations

(d) Moving average, 30 runs, window size = 10.

Figure 3.6 – Types of plot provided by the output_plot function. All figures show the sheep population output from the PPHPC model for size 100, parameter set 1 [70].

test [213] and sample skewness. This function is called by all functions in the distributional analysis module, as discussed in the next section. Plots of simulation output from one or more replications can be produced using output_plot. This function generates three types of plot: superimposed, extremes or moving average, as shown in Figure 3.6. Superimposed plots display the output from one or more simulation runs (Figures 3.6a and 3.6b, respectively). Extremes plots display the interval of values an output can take over a number of runs for all iterations (Figure 3.6c). Finally, it is also possible to visualize the moving average of an output over multiple replications (Figure 3.6d). This type of plot requires the user to specify the window size (a non-negative integer) with which to smooth the output. A value of zero is equivalent to no smoothing, i.e., the function will simply plot the averaged outputs. Moving average plots are useful for empirically selecting a steady-state truncation point. The provided stats_get_* functions, as well as output_plot, use the dlmread MATLAB/Octave function to open files containing simulation output. As such, these functions expect text files with numeric values delimited by a separator (automatically inferred by

66

Methodology

dlmread). The files should contain data values in tabular format, with one column per output and one row per iteration. Distributional analysis functions Functions in the distributional analysis module generate tables and figures which summarize different aspects of the statistical distributions of FMs. The dist_plot_per_fm and dist_table_per_fm functions focus on one FM and provide a distributional analysis over several setups or configurations, i.e., over a number of model scales and/or parameter sets. On the other hand, stats_table_per_setup and dist_table_per_setup offer a distributional analysis of all FMs, fixing on one setup. The dist_plot_per_fm function plots the distributional properties of one FM, namely its estimated probability density function (PDF), histogram and quantile-quantile (QQ) plot. The information provided by stats_analyze is shown graphically and textually in the PDF plot. The main goal of dist_plot_per_fm is to provide a general overview of how the distributional dynamics of an FM vary with different model configurations. The dist_table_per_fm function produces similar content but is oriented towards publication quality materials. It outputs a partial LATEX table with a distributional analysis for a range of setups (e.g., model scales) and a specific use case (e.g., parameter set). These partial tables can be merged into larger tables, with custom features such as additional rows, headers and/or footers. The stats_table_per_setup function produces a plain text or LATEX table with the statistics returned by the stats_analyze function for all FMs for one model setup. In turn, dist_table_per_setup generates a LATEX table with a distributional analysis of all FMs for one model setup. For each FM, the table shows the mean, variance, p-value of the Shapiro-Wilk test, sample skewness, histogram and QQ-plot. Model comparison functions Utilities in the model comparison group aid the modeler in comparing and aligning simulation models through informative tables and plots, also producing publication quality LATEX tables containing p-values yielded by user-specified statistical comparison tests. The stats_compare_plot function plots the probability density function (PDF) and cumulative distribution function (CDF) of FMs taken from multiple model implementations. It is useful to visually compare the alignment of these implementations, providing a first indication of the docking process. The stats_compare function is the basic procedure of the model comparison utilities, comparing FMs from two or more model implementations by applying user-specified statistical comparison tests. It is internally called by stats_compare_pw and stats_compare_table, as shown in Figure 3.5. The former applies two-sample statistical tests, in pair-wise fashion, to FMs from multiple model implementations, outputting a plain text table of pair-wise failed tests. It is useful when more than two implementations are being compared, detecting which ones may be misaligned. The latter, stats_compare_table, is a very versatile function which outputs a LATEX table with p-values resulting from statistical


67

tests used to evaluate the alignment of model implementations. Helper and third-party functions There are two additional groups of functions, the first containing helper functions, and the second containing third-party functions. Helper functions are responsible for tasks such as determining confidence intervals, histogram edges, QQ-plot points, moving averages and whether MATLAB or Octave is being used. Functions for formatting real numbers and p-values, as well as for creating very simple histograms and QQ-plots in Tik Z [233] are also included in this group. A number of third-party functions, mostly providing plotting features, are also included. The figtitle function adds a title to a figure with several subplots [93]. The fill_between function [249] is used by output_plot for filling the area between output extremes. The homemade_ecdf function [28] is a simple Octave-compatible replacement for the MATLABspecific ecdf, assisting stats_compare_plot in producing the empirical CDFs. In turn, the kde function [27] is used to estimate the PDFs plotted by stats_compare_plot and dist_plot_per_fm. The swtest function is the only third-party procedure not related to plotting, providing the p-values of the Shapiro-Wilk parametric hypothesis test of normality [201]. Some of these functions were modified, in accordance with the respective licenses, for better integration with the goals of SimOutUtils. Quality control All functions have been individually tested for correctness in both MATLAB and Octave, and most are covered by unit tests in order to ensure their correct behavior. The MOxUnit framework [170] is required for running the unit tests. Additionally, all the examples available in the user manual (bundled with the software) have been tested in both MATLAB and Octave. These examples range from simple usage patterns to the concrete use cases of this thesis. Reuse potential These utilities can be used for analyzing any stochastic simulation model with time serieslike outputs. As described in ‘Core functions‘, output-specific FMs can be defined by implementing a custom stats_get_* function and setting its handle in the simoututils_stats_get_ global variable. The core stats_gather and stats_analyze functions can be integrated into other higher-level functions to perform operations not available in SimOutUtils. Availability Operating system Any system capable of running MATLAB R2013a or GNU Octave 3.8.1, or higher. Programming language MATLAB R2013a or GNU Octave 3.8.1, or higher.

68

Methodology

Dependencies and third-party packages MATLAB requires the Statistics Toolbox. This software uses additional MATLAB/Octave functions written by Chad A. Greene [93], Benjamin Vincent [249], Mathieu Boutin [28], Zdravko Botev [27] and Ahmed Ben Saïda [201]. Software location The development version is located at https://github.com/fakenmc/ simoututils, while an archive version is deposited at https://zenodo.org/record/ 50525. License MIT license [161]. Motivation for development This software was used to simplify: 1) model output analysis; and, 2) model comparisons using the classic approach. The generation of tables and figures related with these topics was fully automated with SimOutUtils. These include all tables and figures in Section 4.1 and Appendix A, as well as tables associated with the classic model comparison methods in Section 4.3, Section 4.4 and Appendix B.

3.4.4

micompr: multivariate independent comparison of observations

The micompr package [76] for the R statistical computing environment [182] implements the methodology proposed in Section 3.3. It is built upon two functions, cmpoutput and micomp. The former compares two or more samples of multivariate observations collected from one output. The latter is used for comparing multiple outputs and/or comparing outputs in multiple contexts. grpoutputs is a helper function for loading data from two or more set of files and preparing the data to be processed by the cmpoutput and/or micomp functions. assumptions is a generic function for assessing the assumptions of the parametric tests used in sample comparisons. Architecture micompr is structured according to the S3 object-oriented system. The cmpoutput, micomp and grpoutputs functions produce S3 objects with the same name. The package also provides the assumptions generic method, and two concrete implementations, assumptions .cmpoutput and assumptions.micomp, which return objects of class assumptions_cmpoutput and assumptions_micomp, respectively. All classes have implementations of the common S3 generic methods print, summary and plot. Additionally, implementations of the toLatex generic, for producing user-configurable LATEX tables with information about the performed comparisons, are provided for cmpoutput and micomp objects. grpoutputs This function groups outputs from sets of files containing multiple observations into samples. It returns a list of output matrices, ready to be processed by micomp. Alternatively, individual output matrices can be handled by cmpoutput. Separate files


69

contain one multivariate observation of one or more outputs, one column per output, one row per dimension or variable. Each specified set of files is associated with a different sample. The function is also able to create an additional concatenated output, composed from the centered and scaled original outputs. The plot.grpoutputs functions shows k plots, one per output. Output observations are plotted on top of each other, with different samples colored distinctively. summary. grpoutputs returns a list containing two elements: a) the n × m dimensions of each output

matrix; and, b) the sizes of individual samples. The print.grpoutputs function simply outputs the summary in a more adequate presentation format. cmpoutput The cmpoutput function is at the core of micompr. It compares two or more samples of multivariate observations using the technique described in Section 3.3. It accepts an output matrix, X(n×m) , with n observations and m variables or dimensions, a factor vector of length n, specifying the sample associated with each observation, and a vector of explained variances with which to determine the number of PCs to use in the MANOVA test (alternatively, the number of PCs can also be directly specified). The function returns matrix T(n×r) of PCA scores and the p-values for the performed statistical tests, namely: a) a MANOVA test for each explained variance (or number of PCs); and, b) parametric (ttest or ANOVA) and non-parametric (Mann-Whitney or Kruskal-Wallis) univariate tests for each PC. Regarding the latter, the function also returns p-values adjusted with the weighted Bonferroni correction, using the percentages of explained variance by PC as weights. The plot implementation for cmpoutput objects shows six sub-plots, namely a scatter plot with the PC1 vs. PC2 scores and five bar plots. The horizontal scale of the latter consists of the r PCs, and the vertical bars represent the explained variance (one plot) or univariate parametric and non-parametric p-values, before and after weighted Bonferroni correction (four plots). The summary.cmpoutput function returns a list with the following items: a) percentage of variance explained by each PC; b) p-values of the MANOVA test or tests; c) p-values of the parametric test, per PC, before and after weighted Bonferroni correction; d) p-values of the non-parametric test, per PC, before and after weighted Bonferroni correction; and, e) name of the parametric and non-parametric univariate tests employed (either t-test and Mann-Whitney U test for comparing two samples, or ANOVA and Kruskal-Wallis for more than two samples). print.cmpoutput shows the information provided by the summary implementation, but the p-values of the univariate tests are only shown for the first PC. micomp The micomp function performs one or more comparisons of multiple outputs, invoking cmpoutput for each comparison/output combination. It accepts a list of comparisons, where individual comparisons can have one of two configurations: a) a vector of folders and a vector of file sets containing data in the format required by grpoutputs, where each file set corresponds to a different sample; and, b) a grpoutputs object, passed

70

Methodology

directly. The returned objects, of class micomp, are basically two-dimensional lists of cmpoutput instances, with rows associated with individual outputs, and columns with separate comparisons. The plot.micomp function shows the PC1 vs. PC2 score plots for each comparison/output combination. The summary implementation for micomp returns a list of comparisons, each one containing a a×k matrix of p-values or number of PCs, associated with a ≥ 6 mea-

sures and k outputs. Four rows represent the p-values of the parametric and non-parametric univariate tests for the first PC, before and after weighted Bonferroni correction. The remaining pairs of rows are associated with the MANOVA test for a given percentage of variance to explain. One row shows the p-values, and the other displays the number of PCs required to explain the specified percentage of variance for the given output. As with other micompr objects, the print.micomp function also shows the summary with a better presentation. assumptions assumptions is a generic function which performs a number of statistical tests concerning the assumptions of the parametric tests performed by the package functions. Implementations of this generic function exist for cmpoutput and micomp objects. The former, assumptions.cmpoutput, returns objects of class assumptions_cmpoutput containing results of the assumptions tests for a single output comparison. The latter, assumptions.micomp, returns a two-dimensional list of assumptions_cmpoutput objects, with rows associated to individual outputs, and columns to separate comparisons. These objects are tagged with the assumptions_micomp class attribute. The following assumptions are checked: a) observations are normally distributed within each sample along individual PCs (Shapiro-Wilk test); b) observations follow a multivariate normal distribution within each sample for all PCs used in MANOVA (Royston test); c) samples have homogeneous variance along individual PCs (Bartlett test); and, d) samples have homogeneous covariance matrices for all PCs used in MANOVA (Box’s M test). Assumptions a) and c) should be verified for the parametric test applied to each PC, while assumptions b) and d) should be verified for individual MANOVA tests performed for each variance to explain (or, alternatively, for each specified number of PCs). The plot implementations, plot.assumptions_cmpoutput and plot.assumptions_micomp, display a number of bar plots for the p-values of the performed tests. These are more detailed for assumptions_cmpoutput objects, showing the p-values of the univariate test for all PCs. For assumptions_micomp objects, one bar plot is shown per output/comparison combination, but in the case of the univariate tests only the p-values of the first PC are shown. Implementations of summary return a list of tabular data containing the p-values of the assumption tests. summary.assumptions_cmpoutput returns a list with two matrices of p-values, one for the MANOVA tests, another for the univariate tests. summary.assumptions_micomp follows the approach taken by the summary.micomp function, returning a list of p-value matrices, one matrix per comparison. Rows of individual matrices correspond to the assumptions tests, and columns to outputs. The


71

print.assumptions_cmpoutput and print.assumptions_micomp functions again show the summary information in a printable format. toLatex.cmpoutput and toLatex.micomp These functions are implementations of the toLatex generic method, and convert cmpoutput and micomp objects to character vectors representing LATEX tables. The generated tables are configurable via function arguments, with sensible defaults. Tables can present the following data for each output/comparison combination: a) number of principal components required to explain a user-specified percentage of variance; b) MANOVA p-value for a user-specified percentage of variance to explain or number of PCs; c) parametric test p-value for a given PC, before and/or after weighted Bonferroni correction; d) non-parametric test p-value for a given PC, before and/or after weighted Bonferroni correction; e) variance explained by a specific PC; and, f) a score plot with the output projection on the first two PCs. Other functions The micompr package is bundled with additional functions whose purpose is to aid the main package methods do their job. However, some of these may be useful in other contexts. The concat_outputs function concatenates outputs collected from multiple observations. It accepts two arguments, namely a list of output matrices, and the centering and scaling method. Several centering and scaling methods, such as “range”, “iqrange”, “vast” or “pareto” [18], are recognized in the second argument. The function returns an n × p matrix of n observations with length p, which is the sum of individual output lengths. Lower-level centering and scaling of individual outputs is performed by the centerscale function, which accepts a numeric vector and returns a new vector, centered and scaled with the specified method. The pvalf generic function formats p-values for LATEX. A concrete implementation, pvalf.default, is used by default by the micompr toLatex implementations. This implementation underlines and double-underlines p-values lower than 0.05 and 0.01, respectively, although these limits are configurable, and underlining can be turned off by setting both limits to zero. It is also possible to specify a limit below which p-values are capped. For example, if this limit is set to 1 × 10−5 , a p-value equal to 1 × 10−6 would be displayed

as “< 1e−5 ”. The pvalf.default function will format p-values lower than 5 × 10−4 using

scientific E notation, which is more compact and thus a better fit for tables. p-values between 5 × 10−4 and 1 are formatted using regular decimal notation with three decimal

places. This aspect is not configurable. However, another implementation of pvalf can be passed to the micompr toLatex implementations if different formatting is desired. Simple Tik Z 2D scatter plots, as the ones produced by the micompr toLatex implementations, can be generated with the tikzscat function. The function accepts the data to plot, a n x 2 numeric matrix, of n observations and 2 dimensions, and a factor vector specifying the levels or groups associated with each observation. Several plot character-

istics, such as mark types, scale and axes color, are configurable via function arguments.

72

Methodology

tikzscat returns a string containing the Tik Z figure code for plotting the specified data. Quality control The available functions are covered by unit tests in order to ensure their correct behavior. Additionally, all the examples available in the documentation are automatically tested when building the package. Availability Operating system Any system capable of running R statistical computing environment, version 3.2.0 or newer. Programming language The R programming language. Dependencies and third-party packages micompr has a number of optional dependencies, not required for package installation and for using most of its functionality. The biotools [47] and MVN [125] packages are required by the assumptions functions, providing the statistical tests for assessing MANOVA and t-test assumptions. If these functions are invoked without the presence of the specified packages, they will inform the user of that fact, and terminate cleanly. The testthat [257], knitr [270] and roxygen2 [258] packages are required for package development. The deseasonalize R package [156] is required for building one of the vignettes. Software location The package is available at https://cran.r-project.org/package= micompr. The development version is hosted at https://github.com/fakenmc/ micompr. License MIT license [161]. Motivation for development This software was used to facilitate the analysis of model comparisons using the modelindependent technique proposed in Section 3.3. The generation of the associated tables in Section 4.3, Section 4.4 and Appendix B, including the embedded score plots, was fully automated with micompr.

3.4.5

micompm: a MATLAB port of micompr

micompm is a MATLAB/Octave package for comparing multivariate samples. It is ported from micompr, discussed in Section 3.4.4. Architecture micompm follows a similar architecture as micompr, presenting the following functions:


73

grpoutputs Group outputs from multiple observations of the systems to be compared. cmpoutput Compares one output from several observations of two or more samples. micomp Perform multiple multivariate independent comparison of observations. micomp_show Generate tables and plots of multivariate independent comparison of observations. test_assumptions Get assumptions for the parametric tests performed in a comparison. There are, however, a few limitations when compared to micompr: • It does not support outputs with different lengths. • It does not directly provide p-values adjusted with the weighted Bonferroni procedure. • It is unable to directly apply and compare multiple MANOVA tests to each output/comparison combination for multiple user-specified variances to explain.

Quality control Using the same input data, micompm produces the same results as micompr. This was verified for various examples. Nonetheless, there are currently no formal unit tests. Availability Operating system Any system capable of running MATLAB R2013a or GNU Octave 3.8.1, or higher. Programming language MATLAB R2013a or GNU Octave 3.8.1, or higher. Dependencies and third-party packages This software uses various third-party packages, namely MANCOVAN [95], MBoxTest [241], Roystest [240] and swtest [201]. These packages are included with micompm. Software location The development version is located at https://github.com/fakenmc/ micompm. There is currently no archived version available. License MIT license [161]. Motivation for development The main motivation for the development of this package was to confirm the results produced by micompr. Since this a novel method, we felt a double verification of results justified the additional work.

74

Methodology

3.4.6

PerfAndPubTools: tools for software performance analysis and publishing of results

PerfAndPubTools consists of a set of MATLAB [154], GNU Octave-compatible [60], functions for the post-processing and analysis of software performance benchmark data and producing associated publication quality materials [69]. More specifically, the functions bundled with PerfAndPubTools allow to: 1. Batch process files containing benchmarking data of computer programs, one file per run. 2. Determine the mean and standard deviation of benchmarking experiments with several runs. 3. Organize the benchmark statistics by program implementation and program setup. 4. Output scalability and speedup data, optionally generating associated figures. 5. Create publication ready benchmark comparison tables in LATEX . These tools can be used with any computational benchmark experiment. Implementation and architecture Performance analysis in PerfAndPubTools takes place at two levels: implementation and setup. The implementation level is meant to be associated with specific software implementations for performing a given task, for example a particular sorting algorithm or a simulation model realized in a certain programming language. Within the context of each implementation, the software can be executed under different setups. These can be different computational sizes (e.g., vector lengths in a sorting algorithm) or distinct execution parameters (e.g., number of threads used). PerfAndPubTools is implemented in a layered architecture using a procedural programming approach, as shown in Figure 3.7. From lowest to highest-level of functionality, the functions represented in this figure have the following roles: get_gtime Given a file containing the default output of the GNU time [120] command, this function extracts the user, system and elapsed times in seconds, as well as the percentage of CPU usage. gather_times Loads execution times from files in a given folder. This function uses get_gtime by default, but can be configured to use another function to load individual benchmark files with a different format. perfstats Determines mean times and respective standard deviations of a computational experiment, optionally plotting a scalability graph if different setups correspond to different computational work sizes.


75

gather times

times table f Uses data from

Call

get gtime

Call

Call

times table

Load

speedup Call

Files containing execution times

perf stats

Figure 3.7 – PerfAndPubTools architecture. Blocks in typewriter font represent functions. Dashed blocks represent directly replaceable functions.

speedup Determines the average, maximum and minimum speedups against one or more reference implementations across a number of setups. Can optionally generate a bar plot displaying the various speedups. times_table Returns a matrix with useful contents for using in tables for publication, namely times (in seconds), absolute standard deviations (seconds), relative standard deviations, and speedups against one or more reference implementations. times_table_f Returns a table with performance analysis results formatted in plain text or in LATEX (the latter requires the siunitx [268], multirow [244] and booktabs [77] packages). Although the perfstats and speedup functions optionally create plots, these are mainly intended to provide visual feedback on the performance analysis being undertaken. Those needing more control over the final figures can customize the generated plots via the returned figure handles or create custom plots using the data provided by perfstats and speedup. Either way, MATLAB/Octave plots can be used directly in publications, or converted to LATEX using the excellent matlab2tikz script [210], as exemplified in the PerfAndPubTools user manual.

Quality control The available functions are covered by unit tests in order to ensure their correct behavior. The MOxUnit framework [170] is required for running the unit tests. Additionally, all the examples available in the user manual (bundled with the software) have been tested in both MATLAB and Octave.

76

Methodology

Reuse potential These utilities can be used for analyzing any computational experiment. As described in Section 3.4.6, other benchmark data formats can be specified by implementing a custom function to replace get_gtime and setting its handle in the gather_times function. Results from perfstats and speedup functions can be used to generate other types of figures. The same is true for times_table, the results of which can be integrated in table layouts other than the one provided by times_table_f. Availability Operating system Any system capable of running MATLAB R2013a or GNU Octave 3.8.1, or higher. Programming language MATLAB R2013a or GNU Octave 3.8.1, or higher. Dependencies and third-party packages There are no additional dependencies for the package tools. However, this software is enhanced by the matlab2tikz script and by the siunitx, multirow and booktabs LATEX packages. Additionally, unit tests depend on the MOxUnit unit test framework for MATLAB and GNU Octave. Software location The development version is located at https://github.com/fakenmc/ perfandpubtools, while an archive version is deposited at https://zenodo.org/ record/50190. License MIT license [161]. Motivation for development The PerfAndPubTools package allowed for fast analysis of performance results, including the automatic generation of the respective tables and figures in Section 4.2.

Chapter 4

Results and discussion In this chapter we describe the performed experiments, present the respective results, and discuss their implications. We begin with an analysis of the PPHPC model outputs, detailed in Section 4.1. In Section 4.2 we examine and compare the performance of the several PPHPC implementations. To assess the distributional equivalence of these implementations, their outputs are statistically compared, using both classic and model-independent comparison methods, in Section 4.3. The latter method undergoes an additional investigation in Section 4.4, where its discriminant capabilities are compared with those of the classical approach when implementations are misaligned.

4.1 4.1.1

Analyzing the output of the PPHPC model Experimental setup

A total of 30 replications, r = 1, . . . , 30, were performed with NetLogo 5.1.0 for each combination of model sizes (Table 3.3, sizes up to 1600) and parameters sets (Table 3.4). Each replication r was performed with a PRNG seed obtained by taking the MD5 checksum of r and converting the resulting hexadecimal string to a 32-bit integer (the maximum precision accepted by NetLogo), guaranteeing some independence between seeds, and consequently, between replications. The data produced by this computational experiment, as well as the scripts used to set up the experiment, are publicly available at https://zenodo.org/record/34053.

4.1.2

Determining the steady-state truncation point

Using Welch’s method, we smoothed the averaged outputs using a moving average filter with w = 10. Having experimented with other values, w = 10 seemed to be a good compromise between rough and overly smooth plots. Figure 4.1 shows results for model size 400 and both parameter sets. Following the recommendations described in Section 3.1.3, we select the steady-state truncation point to be l = 1000 for parameter set 1, and l = 2000 for parameter set 2. These are round values which appear to occur after the transient stage.

78

Results and discussion ·104 Pis (10)

4

Piw (10)

Pic (10)/4

s

E i (10)

3

Average energy

Total population

30

2 l

1

w

E i (10)

4C i (10)

20

10 l

0

0

1000

2000 Iterations

3000

0

4000

(a) Population moving average, param. set 1.

Piw (10)

2000 Iterations

3000

4000

80

Pic (10)/4

s

E i (10)

3

w

E i (10)

4C i (10)

60

2

Average energy

Total population

1000

(b) Energy moving average, param. set 1.

·105 Pis (10)

0

l

1

0

l 40

20

0

1000

2000 Iterations

3000

4000

(c) Population moving average, param. set 2.

0

0

1000

2000 Iterations

3000

4000

(d) Energy moving average, param. set 2.

Figure 4.1 – Moving average of outputs for model size 400 with w = 10. Other model sizes produce similar results, apart from a vertical scaling factor. The dashed vertical line corresponds to iteration l after which the output is considered to be in steady-state.

Other model sizes produce similar results, apart from a vertical scaling factor, which means that these values of l are also applicable in those cases.

4.1.3

Analyzing the distributions of focal measures

The six statistic summaries for each FM, namely mean, sample variance, p-value of the SW test, skewness, histogram and Q-Q plot, are given in Appendix A for all model size and parameter set combinations. The number of bins in the histograms is set to the minimum between 10 (an appropriate value for a sample size of 30) and the number of unique values in the data set. Much of the information provided in Tables A.1 to A.10, namely the p-value of the SW test, the skewness, and the Q-Q plots, is geared towards continuous distributions. However, FMs taken from arg max and arg min operators only yield integer (discrete) values, which correspond to specific iterations. The same is true for max and min of population outputs,

4.1 Analyzing the output of the PPHPC model

79

namely Pis , Piw , and Pic . This can be problematic for statistic summaries taken from integer-valued FMs with a small number of unique values. For example, the SW test will not be very informative in such cases, and cannot even be performed if all observations yield the same value (e.g., arg max of Pic for 800@1, Table A.4). Nonetheless, distributional properties of a FM can dramatically change for different model size and parameter set combinations. For example, for parameter set 2, observations of the arg max of Pic span many different values for model size 200 (Table A.7), while for size 1600 (Table A.10) they are limited to only three different values. Summary statistics appropriate for continuous distributions could be used in the former case, but do not provide overly useful information in the latter. In order to maintain a consistent approach, our discussion will continue mainly from a continuous distribution perspective, more specifically by analyzing how closely a given FM follows the normal distribution, though we superficially examine its discrete nature when relevant. Distribution of focal measures over the several size@set combinations In the next paragraphs we describe the distributional behavior of each FM, and when useful, repeat in a compact fashion some of the information provided in Tables A.1 to A.10 max Pis : The SW p-value is consistently above the 5% significance level, skewness is usually low and with an undefined trend, and the Q-Q plots mostly follow the y = x line. Although there are borderline cases, such as 800@1 and 1600@2, the summary statistics show that the maximum prey population FM generally follows an approximately normal distribution. arg max Pis : This FM follows an approximately normal distribution for smaller sizes of parameter set 1, but as model size grows larger, the discrete nature of the data clearly stands out. This behavior is more pronounced for parameter set 2 (which yields simulations inherently larger than parameter set 1), such that, for 1600@2, all observations yield the same value (i.e., 70). Table 4.1 shows, using histograms, how the distribution qualitatively evolves over the several size@set combinations. Size Set

100

200

400

800

1600

1 2 Table 4.1 – Histograms for the several size@set combinations of the arg max Pis FM.

min Pis : Two very different behaviors are observed for the two parameter sets. In the case of parameter set 1, this FM has a slightly negatively skewed distribution, with some p-values below the 0.05 significance threshold, but is otherwise not very far from normality

80

Results and discussion

(this is quite visible in some histograms). However, for parameter set 2, the data is more concentrated on a single value, more so for larger sizes. Note that this single value is the initial number of prey, which means that, in most cases, the minimum number of prey never drops below its initial value. arg min Pis : This FM follows a similar pattern to the previous one, but more pronounced in terms of discreteness, namely for parameter set 1. For parameter set 2, sizes 100 and 200, the distribution is bimodal, with the minimum prey population occurring at iteration zero (i.e., initial state) or around iteration 200, while for larger sizes, the minimum always occurs at iteration zero. Pis

ss

: The prey population steady-state mean seems to generally follow a normal dis-

tribution, the only exception being 400@2, in which some departure from normality is observed, as denoted by a SW p-value below 0.05 and a few outliers in the Q-Q plot. S ss (Pis ) : For most size@set combinations this FM does not present large departures from normality. However, skewness is always positive. max Piw : This FM presents distributions which are either considerably skewed or relatively normal. The former tend to occur for smaller model sizes, while the latter for larger sizes, although this trend is not totally clear. The 800@2 sample is a notable case, as it closely follows a normal distribution, with a symmetric histogram, approximately linear Q-Q plot, and a SW p-value of 0.987. arg max Piw : Interestingly, for parameter set 1, this FM seems to follow a uniform distribution. This is more or less visible in the histograms, but also in the Q-Q plots, because when we plot uniform data points against a theoretical normal distribution in a Q-Q plot we get the “stretched-S” pattern which is visible in this case (Table 4.2). For parameter set 2, the distribution seems to be more normal, or even binomial as the discreteness of the data starts to stand-out for larger model sizes; the only exception is for size 100, which presents a multimodal distribution. Size Set

100

200

400

800

1600

1 2

Table 4.2 – Q-Q plots for the several size@set combinations of the arg max Piw FM.

4.1 Analyzing the output of the PPHPC model

81

min Piw : The minimum predator population seems to follow an approximately normal distribution, albeit with a slight positive skewness, except for 800@1, which has negative skewness. arg min Piw : This FM displays an approximately normal distribution. However, for larger simulations (i.e., mainly for parameter set 2) the discrete nature of the data becomes more apparent. Piw

ss

: The steady-state mean of predator population apparently follows a normal dis-

tributions. This is confirmed by all summary statistics, such as the SW p-value, which is above 0.05 for all size@set combinations. S ss (Piw ) : Departure from normality is not large in most cases (200@2 and 800@2 are exceptions, although the former due to a single outlier), but the trend of positive skewness is again observed for this statistic. max Pic : The maximum available cell-bound food seems to have a normal distribution, although 400@2 has a few outliers which affect the result of the SW p-value (which, nonetheless, is above 0.05). arg max Pic : The behavior of this FM is again quite different between parameter sets. For the first parameter set, the discrete nature of the underlying distribution stands out, with no more than three unique values for size 100, down to a single value for larger sizes, always centered around the value 12 (i.e., the maximum available cell-bound food tends to occur at iteration 12). For the second parameter set, distribution is almost normal for sizes above 200, centered around iteration 218, although its discreteness shows for larger sizes, namely for size 1600, which only presents three distinct values. For size 100, most values fall in iteration 346, although two outliers push the mean up to 369.5. min Pic : This FM displays an apparently normal distribution for all model sizes and parameter sets, with the exception of 800@1, which has a few outliers at both tails of the distribution, bringing down the SW p-value barely above the 5% significance level. arg min Pic : In this case, the trend is similar for both parameter sets, i.e., the distribution seems almost normal, but for larger sizes the underlying discreteness becomes apparent. This is quite clear for parameter set 2, as shown in Table 4.3, where the SW test p-value decreases as the discreteness becomes more visible in the histograms and Q-Q plots. ss

Pic : For this FM there is not a significant departure from normality. The only exception is for 800@1, but only due to a single outlier.

82

Results and discussion Size Stat.

SW

100

200

400

800

1600

0.437

0.071

0.062

0.011

2 samples) non-parametric tests. With respect to the model-independent method, we will present the p-values for the MANOVA test on the first 10 PCs, since this is the sample size, and for the t-test or ANOVA on the first PC. The former will be used when comparing two implementations, and the latter when comparing s > 2 implementations. A more detailed assessment of the model-independent method is performed in Section 4.4, where we will verify that PCs extracted from PPHPC output approximately follow a normal distribution, justifying the use of parametric tests in this section.

4.3.1

Assessing the alignment of the Java variants

In this case there are six samples to be compared for each FM, produced by the NetLogo implementation and the five parallel variants of the Java implementation. Consequently, we use tests capable of comparing s > 2 samples. For the classical comparison approach, the p-values obtained by comparing the FMs of the six PPHPC versions for all combinations of parameter sets and model sizes are provided in Table 4.9. In a total of 360 p-values (36 focal measures, 2 parameter sets, 5 model sizes), 29 are below the 0.05 significance level; of these, 12 are below 0.01. However, s

there are two low p-values that stand-out for min E i , associated with model sizes 200 and 800, parameter set 1, suggesting that at least one of the implementations (i.e., samples) produced significantly different minimum values for mean prey energy. Taking a closer look s

at the distributional behavior of the min E i FM for parameter set 1 (Section 4.1.3), we note that this value always occurs at iteration zero, the initial state of the simulation (i.e., s

arg min E i = 0). Additionally, as described in Section 3.2.1, the EQ and EX parallelization strategies share the same work provider for agent initialization, which means, that for a given PRNG seed, both strategies generate initial agents with the exact same energy. As such, if the initial prey energy is somewhat different from usual for one of these strategies, it will be so for both. Consequently, the ANOVA F-test compares two samples with somewhat different observations (EQ and EX) from the remaining four (NL, ST, ER and OD), which is the case for sizes 200 and 800. If we remove EQ and EX from the test, the p-values rise to 0.108 and 0.001 for sizes 200 and 800, respectively, in line with the remaining results. Other than this, the p-values do not appear to follow any trend or pattern, e.g., smaller p-values do not seem to be associated with any particular FM, parameter set or model size. Results for the model-independent comparison approach are shown in Table 4.10. Approximately 10% of the p-values are below the 5% significance level, but none is below 0.01. Like in the classic approach, smaller p-values do not seem to follow any specific pattern, with the exception of the ANOVA p-values for the 200@2 size/set combination, possibly indicating a small misalignment in the first PC. All values are below 0.05, but

4.3 Statistical comparison of implementations

Out.

99

Param. set 1

Stat.

Param. set 2

100

200

400

800

1600

100

200

400

800

1600

Pis

max arg max min arg min ss X S ss

0.597 0.785 0.805 0.546 0.441 0.087

0.083 0.456 0.020 0.679 0.992 0.310

0.351 0.990 0.078 0.542 0.563 0.379

0.570 0.030 0.002 0.085 0.604 0.695

0.709 0.575 0.180 0.348 0.985 0.012

0.028 0.685 0.192 0.555 0.570 0.861

0.164 0.600 0.663 0.169 0.172 0.294

0.733 0.785 0.471 0.539 0.179 0.607

0.636 0.445 1.000 1.000 0.775 0.411

0.411 1.000 1.000 1.000 0.180 0.178

Piw


0.074 0.944 0.502 0.381 0.406 0.150

0.702 0.494 0.203 0.407 0.998 0.549

0.502 0.658 0.464 0.382 0.783 0.598

0.497 0.579 0.491 0.410 0.711 0.722

0.493 0.469 0.404 0.630 0.343 0.002

0.412 0.443 0.047 0.846 0.720 0.894

0.475 0.667 0.472 0.444 0.212 0.303

0.227 0.243 0.323 0.084 0.951 0.768

0.304 0.793 0.092 0.362 0.645 0.351

0.003 0.698 0.510 0.707 0.741 0.200

Pic

max arg max min arg min ss X ss S

0.786 0.383 0.643 0.942 0.454 0.108

0.559 0.749 0.164 0.802 0.972 0.283

0.176 0.582 0.443 0.067 0.641 0.429

0.013 0.416 0.223 0.896 0.465 0.739

0.064 1.000 0.709 0.823 0.932 0.004

0.247 0.050 0.152 0.822 0.963 0.924

0.630 0.534 0.215 0.862 0.447 0.264

0.130 0.235 0.732 0.413 0.076 0.703

0.486 0.739 0.445 0.302 0.399 0.351

0.014 0.463 0.280 0.671 0.350 0.173


0.580 0.970 0.002 1.000 0.956 0.395

0.243 0.169 4e−7 1.000 0.266 0.710

0.386 0.443 0.252 1.000 0.942 0.267

0.865 0.180 1e−7 1.000 0.976 0.283

0.721 0.369 0.109 1.000 0.961 0.309

0.583 0.940 0.298 0.705 0.430 0.765

0.207 0.520 0.068 0.485 0.036 0.371

0.438 0.819 0.925 0.871 0.279 0.495

0.004 0.348 0.096 0.566 0.183 0.341

0.121 1.000 0.006 0.627 0.306 0.223

Ei


0.064 0.338 0.495 0.109 0.030 0.539

0.789 0.687 0.009 0.963 0.421 0.657

0.696 0.156 0.001 0.514 0.123 0.606

0.632 0.247 0.207 0.425 0.217 0.987

0.645 0.467 0.020 0.278 0.763 0.027

0.177 0.499 0.055 0.598 0.951 0.950

0.045 0.620 0.985 0.033 0.052 0.406

0.421 0.057 0.069 0.459 0.371 0.809

0.944 0.974 0.121 0.317 0.872 0.421

0.283 0.371 0.497 0.949 0.238 0.130

Ci


0.657 0.674 0.715 0.709 0.452 0.109

0.170 0.792 0.794 0.834 0.972 0.281

0.523 0.044 0.148 1.000 0.648 0.429

0.201 0.996 0.063 1.000 0.463 0.738

0.655 0.772 0.101 1.000 0.934 0.004

0.031 0.257 0.240 0.078 0.969 0.924

0.859 0.581 0.633 0.570 0.420 0.266

0.379 0.234 0.149 0.547 0.067 0.709

0.866 0.735 0.513 0.901 0.407 0.354

0.124 0.276 0.014 0.904 0.324 0.175

s

Ei

w

Table 4.9 – P -values for the classic comparison of FMs from the NetLogo and Java parallel variants, n = 10 runs per implementation/variant. P -values were obtained with the ANOVA ss F-test for max, min, X and S ss statistics, and with the Kruskal-Wallis test for arg max and arg min statistics. Values lower than 0.05 are underlined, while values lower than 0.01 are double-underlined. The number of workers, p, is set to 12 for the parallel implementations. For the OD implementation, b = 500.

100


Set

Size

Outputs

Test Ps

Pw

Pc

E

s

E

w

C

e A

100

MNV ANV

0.611 0.062

0.507 0.086

0.626 0.058

0.422 0.046

0.426 0.020

0.626 0.058

0.553 0.073

200

MNV ANV

0.109 0.250

0.254 0.386

0.103 0.302

0.024 0.524

0.372 0.840

0.102 0.303

0.205 0.454

400

MNV ANV

0.771 0.908

0.858 0.948

0.801 0.880

0.453 0.820

0.378 0.875

0.802 0.880

0.826 0.883

800

MNV ANV

0.038 0.386

0.050 0.031

0.034 0.089

0.168 0.691

0.017 0.515

0.034 0.089

0.043 0.028

1600

MNV ANV

0.129 0.467

0.260 0.328

0.135 0.398

0.027 0.186

0.167 0.181

0.136 0.398

0.197 0.279

100

MNV ANV

0.457 0.060

0.428 0.052

0.493 0.067

0.636 0.053

0.818 0.046

0.497 0.067

0.656 0.061

200

MNV ANV

0.251 0.041

0.282 0.023

0.252 0.024

0.366 0.045

0.270 0.028

0.254 0.024

0.232 0.029

400

MNV ANV

0.348 0.186

0.564 0.237

0.512 0.243

0.615 0.373

0.416 0.216

0.512 0.242

0.509 0.247

800

MNV ANV

0.692 0.565

0.652 0.394

0.589 0.451

0.857 0.146

0.688 0.265

0.591 0.445

0.649 0.442

1600

MNV ANV

0.577 0.399

0.596 0.474

0.545 0.453

0.030 0.146

0.251 0.501

0.538 0.453

0.250 0.439

1

2

Table 4.10 – P -values for the model-independent comparison of outputs from the NetLogo and Java parallel variants, n = 10 runs per implementation/variant. MNV and ANV refer to the p-values yielded by the MANOVA (10 PCs or dimensions) and ANOVA (first PC) tests, e refers to the concatenation of all outputs (range scaled). P -values respectively. Output A lower than 0.05 are underlined, while p-values lower than 0.01 are double-underlined.

interestingly this is not reflected in the MANOVA p-values or even the classic approach p-values. As will be discussed in Section 4.4, misaligned models are prone to show much more significant p-values. Additionally, the ANOVA p-values are not adjusted for multiple comparisons. When doing so with the weighted Bonferroni procedure, no p-values remain e it is possible to observe that the few significant. Looking at the concatenated output, A, times it shows significant p-values, at least one of the associated individual outputs also presents a significant p-value. From the results of both comparison methods, it is possible to conclude that all realizations generally appear to produce similar dynamic behavior. Thus, model parallelization does not seem to have introduced any observable bias for the tested parameter sets.


4.3.2

101

Assessing the alignment of the OpenCL implementations

Given that the two OpenCL implementations are quite different in terms of implementation, we compare them independently with the NetLogo version, contrary to what was done with the Java parallel variants. This means that two sample statistical tests are used instead. Out.

Param. set 1

Stat.

Param. set 2

100

200

400

800

1600

100

200

400

800

1600

Pis


0.269 0.677 0.971 0.618 0.568 0.041

0.005 0.383 0.015 0.339 0.447 0.004

0.834 0.361 0.801 0.934 0.348 0.647

0.979 0.056 0.265 0.074 0.808 0.972

0.543 0.374 0.570 1.000 0.176 0.018

0.822 0.644 0.967 0.507 0.482 0.849

0.167 0.871 0.397 0.348 0.643 0.356

0.685 0.925 1.000 1.000 0.238 0.352

0.781 0.368 1.000 1.000 0.946 0.250

0.802 1.000 1.000 1.000 0.410 0.054

Piw


0.419 0.427 0.625 0.162 0.428 0.082

0.601 1.000 0.009 0.272 0.658 0.008

0.205 0.345 0.703 0.306 0.350 0.748

0.531 0.791 0.234 0.517 0.821 0.994

0.300 0.521 0.557 0.340 0.082 0.019

0.742 0.495 0.594 0.760 0.983 0.828

0.692 0.731 0.348 1.000 0.475 0.301

0.722 0.646 0.424 0.785 0.370 0.448

0.195 0.024 0.080 0.076 0.436 0.178

0.016 0.074 0.270 0.459 0.122 0.052

Pic


0.628 0.651 0.301 0.426 0.562 0.017

0.067 0.144 0.011 0.790 0.527 0.004

0.602 0.651 0.828 0.469 0.554 0.605

0.157 0.368 0.961 0.908 0.878 0.968

0.964 1.000 0.847 0.641 0.195 0.012

0.496 0.569 0.875 0.879 0.425 0.917

0.746 0.363 0.262 0.540 0.288 0.293

0.308 0.540 0.841 0.776 0.289 0.374

0.940 0.816 0.391 0.689 0.268 0.183

0.041 0.517 0.737 1.000 0.638 0.051


0.838 0.130 0.010 1.000 0.064 0.273

0.904 0.040 0.194 1.000 0.023 0.026

0.668 0.702 0.411 1.000 0.442 0.898

0.605 0.046 0.187 1.000 0.764 0.395

0.578 0.876 0.889 1.000 0.466 0.068

0.752 0.592 0.689 0.939 0.385 0.466

0.469 0.387 0.012 0.359 0.022 0.260

0.487 0.302 0.833 0.637 0.543 0.257

0.070 0.368 0.005 0.800 0.147 0.335

0.729 1.000 0.001 0.482 0.139 0.041

Ei


0.165 0.880 0.476 1.000 0.052 0.614

0.816 0.384 0.135 0.676 0.420 0.103

0.590 0.677 0.020 0.426 0.070 0.806

0.237 0.074 0.822 0.938 0.139 0.533

0.590 0.139 0.394 0.333 0.979 0.030

0.743 0.272 0.662 0.166 0.849 0.800

0.625 1.000 0.696 0.242 0.495 0.465

0.326 0.180 0.263 0.702 0.185 0.425

0.876 0.372 0.660 0.726 0.501 0.353

0.266 0.267 0.088 1.000 0.317 0.052

Ci


0.322 0.198 0.912 1.000 0.563 0.017

0.008 0.343 0.128 0.368 0.543 0.004

0.721 0.404 0.823 1.000 0.566 0.604

0.813 0.879 0.235 1.000 0.897 0.962

0.973 1.000 0.886 1.000 0.198 0.012

0.084 0.053 0.509 0.622 0.420 0.916

0.517 0.319 0.764 0.363 0.286 0.294

0.481 0.667 0.299 1.000 0.313 0.372

0.600 0.714 0.921 0.939 0.293 0.187

0.257 0.769 0.035 0.335 0.637 0.051

s

Ei

w

Table 4.11 – P -values for the classic comparison of FMs from the NetLogo and OpenCL CPU implementations, n = 10 runs per implementation. P -values were obtained with the ss t-test for max, min, X and S ss statistics, and with the Mann-Whitney U test for arg max and arg min statistics. Values lower than 0.05 are underlined, while values lower than 0.01 are double-underlined.

102


For the classical comparison approach, the p-values obtained by comparing the FMs of the NetLogo and OpenCL CPU (CLC) implementations for all combinations of parameter sets and model sizes are provided in Table 4.11. In a total of 360 p-values, 32 are below the 0.05 significance level; of these, 11 are below 0.01. Overall, the number of significant p-values is similar to what was observed for the Java variants comparison, though different tests are being employed in each case. As in the previous comparison, p-values do not appear to follow any trend or pattern, perhaps with the exception of 200@1, which contains a few significant p-values, mainly for the S ss statistical summary.

Set

Size

Outputs

Test Ps

Pw

Pc

E

s

E

w

C

e A

100

MNV t-test

0.543 0.805

0.285 0.826

0.587 0.774

0.316 0.900

0.251 0.030

0.581 0.773

0.356 0.888

200

MNV t-test

0.235 0.285

0.232 0.108

0.264 0.262

0.199 0.579

0.714 0.594

0.266 0.263

0.321 0.277

400

MNV t-test

0.182 0.784

0.019 0.555

0.093 0.608

0.469 0.707

0.013 0.824

0.093 0.607

0.050 0.449

800

MNV t-test

0.918 0.738

0.964 0.950

0.911 0.937

0.764 0.724

0.685 0.724

0.910 0.933

0.976 0.812

1600

MNV t-test

0.617 0.149

0.709 0.642

0.592 0.182

0.434 0.015

0.636 0.390

0.592 0.182

0.730 0.305

100

MNV t-test

0.040 0.229

0.027 0.277

0.026 0.267

0.087 0.376

0.133 0.149

0.026 0.269

0.034 0.242

200

MNV t-test

0.109 0.273

0.409 0.285

0.283 0.282

0.297 0.265

0.702 0.305

0.283 0.282

0.467 0.274

400

MNV t-test

0.946 0.954

0.915 0.964

0.979 0.936

0.994 0.761

0.946 0.994

0.979 0.936

0.990 0.979

800

MNV t-test

0.653 0.323

0.663 0.347

0.679 0.332

0.692 0.393

0.490 0.255

0.684 0.333

0.794 0.316

1600

MNV t-test

0.347 0.721

0.479 0.975

0.265 0.937

0.165 0.660

0.246 0.836

0.266 0.942

0.437 0.954

1

2

Table 4.12 – P -values for the model-independent comparison of outputs from the NetLogo and OpenCL CPU implementations, n = 10 runs per implementation. MNV and t-test refer to the p-values yielded by the MANOVA (10 PCs or dimensions) and t (first PC) tests, e refers to the concatenation of all outputs (range scaled). P -values respectively. Output A lower than 0.05 are underlined, while p-values lower than 0.01 are double-underlined.

Results for the model-independent comparison of the NetLogo and CLC implementations are shown in Table 4.12. Few p-values are significant, and those that are fall within the 0.05 threshold. Again, from the results of both comparison methods, and especially those from the model-independent approach, it appears that the two implementations are


103

aligned and generate statistically similar outputs. Out.

Param. set 1

Stat. 100

200

400

Pis


0.555 0.255 0.871 1.000 5e−5 0.031

0.385 1.000 0.337 0.163 0.004 0.136

Piw


0.143 0.212 0.109 0.820 0.092 0.020

Pic


Param. set 2 800

1600

100

200

400

0.833 0.622 0.647 0.213 2e−7 0.904

0.103 0.421 0.279 0.074 < 1e−8 0.296

0.001 0.847 0.013 1.000 < 1e−8 0.002

0.347 0.730 0.958 0.249 0.226 0.486

0.083 0.554 0.023 0.015 0.061 0.039

0.035 0.623 0.044 0.095 2e−4 0.043

0.082 0.076 0.808 0.088 < 1e−8 0.876

0.003 0.678 0.025 0.074 < 1e−8 0.347

6e−7 0.623 0.001 0.492 < 1e−8 0.004

0.885 0.570 0.893 0.676 3e−4 0.159

0.973 0.619 0.667 0.471 0.229 0.014

0.520 0.681 0.069 0.879 0.007 0.089

0.437 0.681 0.754 1.000 6e−5 0.957

6e−5 1.000 0.064 0.566 < 1e−8 0.334

< 1e−8 1.000 0.145 0.560 < 1e−8 0.001


0.442 0.340 0.351 1.000 0.017 0.253

0.462 0.316 0.262 1.000 0.306 0.755

0.783 0.969 0.894 1.000 0.160 0.443

0.870 0.014 0.321 1.000 0.004 0.523

Ei


0.979 0.910 0.478 0.272 0.002 0.093

0.332 0.307 0.271 0.878 0.224 0.112

0.495 0.449 0.301 0.267 0.395 0.367

Ci


0.768 0.140 0.921 1.000 0.237 0.014

0.073 0.879 0.646 1.000 0.007 0.090

0.722 0.087 0.511 1.000 6e−5 0.956

s

Ei

w

800

1600

0.558 0.651 1.000 1.000 2e−6 0.870

0.028 0.368 1.000 1.000 < 1e−8 0.397

2e−6 1.000 1.000 1.000 < 1e−8 0.895

0.496 0.360 0.130 0.336 3e−8 0.069

0.063 0.848 0.191 0.064 < 1e−8 0.776

7e−6 0.333 0.077 0.802 < 1e−8 0.133

3e−7 0.086 5e−6 0.864 < 1e−8 0.610

0.590 0.677 0.953 0.878 3e−4 0.240

0.106 0.909 0.001 1.000 2e−7 0.045

0.554 0.233 0.005 0.215 < 1e−8 0.859

0.316 0.174 7e−7 0.312 < 1e−8 0.205

0.188 0.049 < 1e−8 0.182 < 1e−8 0.698

0.733 0.816 0.325 1.000 2e−8 0.019

0.247 0.323 0.120 0.159 0.552 0.252

0.952 0.968 0.002 0.232 0.001 0.085

0.057 1.000 0.015 0.610 0.831 0.832

0.112 0.192 4e−5 0.043 5e−5 0.476

0.006 1.000 < 1e−8 0.583 3e−6 0.748

0.893 0.520 0.011 0.846 6e−5 0.741

0.575 0.184 0.450 0.837 < 1e−8 0.041

0.325 0.649 0.750 0.938 0.797 0.384

0.025 0.040 0.529 0.180 0.540 0.107

0.018 0.398 0.408 0.863 0.004 0.982

0.131 0.413 0.074 1.000 0.024 0.357

0.201 0.184 0.014 0.167 2e−5 0.975

0.077 0.424 4e−5 1.000 < 1e−8 0.331

0.114 1.000 < 1e−8 1.000 < 1e−8 0.001

0.164 0.879 0.578 0.677 3e−4 0.238

0.551 0.235 0.107 0.820 2e−7 0.045

0.089 0.968 0.540 0.098 < 1e−8 0.850

0.010 0.491 0.328 0.814 < 1e−8 0.208

4e−4 0.572 0.194 0.345 < 1e−8 0.697

Table 4.13 – P -values for the classic comparison of FMs from the NetLogo and OpenCL GPU (Nvidia) implementations, n = 10 runs per implementation. P -values were obtained with the ss t-test for max, min, X and S ss statistics, and with the Mann-Whitney U test for arg max and arg min statistics. Values lower than 0.05 are underlined, while values lower than 0.01 are double-underlined.

For the OpenCL GPU implementation we select results from the replications performed on the Nvidia GPU, i.e., from CLGN. In principle outputs should not statistically differ from one GPU or the other. The classical comparison approach p-values for all combinations of parameter sets and model sizes are shown in Table 4.13. Statistically significant differences are found for the steady-state mean of most outputs, namely for larger model sizes. Results for the model-independent comparison of the NetLogo and CLGN implemen-

104


Set

Size

Outputs

Test Ps

Pw

Pc

E

s

E

w

C

e A

100

MNV t-test

0.263 0.176

0.365 0.156

0.290 0.228

0.007 0.012

0.119 0.437

0.289 0.228

0.175 0.170

200

MNV t-test

0.685 0.829

0.469 0.684

0.788 0.949

0.660 0.994

0.150 0.368

0.789 0.950

0.802 0.812

400

MNV t-test

0.115 0.975

3e-04 0.975

0.408 0.666

0.931 0.521

0.413 0.893

0.409 0.665

0.945 0.844

800

MNV t-test

0.001 0.002

2e-04 1e-04

0.065 0.069

0.553 0.300

0.036 0.512

0.066 0.069

0.311 0.088

1600

MNV t-test

4e-07

Agent-Based Modeling on High Performance Computing ... - hgpu.org

Agent-Based Modeling on High Performance Computing ... - hgpu.org

Suggest Documents

High-Performance Computing

High performance astrophysics computing

1. High Performance Computing

high-performance pervasive computing

High Performance Quantum Computing

High Performance Bioinformatics Computing

MHPC Workshop on High Performance Computing Programme

MHPC Workshop on High Performance Computing Programme

High Performance Commodity Computing on the

Agentbased modeling forex anteassessment of tree ...

Modeling, Simulation and Analysis, and High Performance Computing

A High Performance Computing Cloud Computing ...

High Performance Computing (HPC) - CiteSeerX

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

ATM High Performance Distributed Computing

Developing High Performance Computing ... - ScienceDirect.com

High-Performance Cluster Computing - LexisNexis

SRIRAM KRISHNAMOORTHY - High-Performance Computing

Biology and High-Performance Computing

High Performance Computing - Google Sites

High Performance Computing - Google Sites

High Performance Computing for Dummies

High Performance Computing - Google Sites