Parallel Implementation and Application of the

PARALLEL IMPLEMENTATION AND APPLICATION OF THE RANDOM FINITE ELEMENT METHOD

A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences

2011

By Jonathan David Nuttall School of Mechanical, Aerospace and Civil Engineering

2

Contents Abstract

19

Declaration

20

Copyright

21

Acknowledgements

22

The Author

23

1 Introduction

27

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

1.2

Setting the scene . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

1.3

Thesis aims and objectives . . . . . . . . . . . . . . . . . . . . . .

29

1.3.1

Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

1.3.2

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

1.4

2 Parallel RFEM Strategies 2.1

2.2

33

Foster’s Design Methodology . . . . . . . . . . . . . . . . . . . . .

34

2.1.1

Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.1.1.1

Domain Decomposition . . . . . . . . . . . . . .

35

2.1.1.2

Functional Decomposition . . . . . . . . . . . . .

35

2.1.2

Communication / Interaction . . . . . . . . . . . . . . . .

37

2.1.3

Agglomeration

. . . . . . . . . . . . . . . . . . . . . . . .

38

2.1.4

Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.2.1

Processing Classification . . . . . . . . . . . . . . . . . . .

40

2.2.2

Memory Classification . . . . . . . . . . . . . . . . . . . .

43

3

2.2.3

. . . . . . . . . . . . . . . . . . . . . . .

45

2.2.3.1

SIMD Machines . . . . . . . . . . . . . . . . . . .

45

2.2.3.2

Parallel Vector Processor (PVP) Machines . . . .

45

2.2.3.3

Symmetric Multiprocessor (SMP) Machines . . .

46

2.2.3.4

Distributed shared memory (DSM) . . . . . . . .

46

2.2.3.5

Massively Parallel Processor (MPP) Machines . .

46

2.2.3.6

Cluster of Workstations (COW) . . . . . . . . . .

46

Race Conditions . . . . . . . . . . . . . . . . . . . . . . . .

47

Parallel Application Programming Interfaces . . . . . . . . . . . .

50

2.3.1

High Performance Fortran (HPF) . . . . . . . . . . . . . .

50

2.3.2

OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

2.3.3

Message Passing Interface (MPI) . . . . . . . . . . . . . .

50

Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . .

52

2.4.1

Execution Time . . . . . . . . . . . . . . . . . . . . . . . .

52

2.4.2

Communication . . . . . . . . . . . . . . . . . . . . . . . .

52

2.4.3

Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

2.4.4

Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

2.4.4.1

Amdahl’s Law . . . . . . . . . . . . . . . . . . .

55

2.4.4.2

Gustafson’s Law . . . . . . . . . . . . . . . . . .

56

Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

2.4.5.1

Cost of Resource . . . . . . . . . . . . . . . . . .

59

Ideal Strategy Properties . . . . . . . . . . . . . . . . . . .

60

2.5

Stochastic Parallelisation . . . . . . . . . . . . . . . . . . . . . . .

62

2.6

Existing Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

2.6.1

Serial Deterministic FE Code . . . . . . . . . . . . . . . .

62

2.6.2

Parallel Deterministic FE Code . . . . . . . . . . . . . . .

64

2.6.2.1

Element-By-Element (EBE) Technique . . . . . .

66

2.6.3

Serial Random FE Code . . . . . . . . . . . . . . . . . . .

68

2.6.4

Architecture Availability . . . . . . . . . . . . . . . . . . .

68

Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

2.7.1

Realisation Parallelism . . . . . . . . . . . . . . . . . . . .

75

2.7.1.1

Intrinsic Realisation Allocation . . . . . . . . . .

75

2.7.1.2

Job Farming . . . . . . . . . . . . . . . . . . . .

76

Solver Parallelism . . . . . . . . . . . . . . . . . . . . . . .

77

Parallel Realisations . . . . . . . . . . . . . . . . . . . . . . . . .

80

2.2.4 2.3

2.4

2.4.5 2.4.6

2.7

2.7.2 2.8

Systems Designs

4

2.9

Parallel Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

2.10 Parallel Solver and Realisations . . . . . . . . . . . . . . . . . . .

80

2.11 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

2.11.1 Serial Codes . . . . . . . . . . . . . . . . . . . . . . . . . .

86

2.11.2 Parallel Codes . . . . . . . . . . . . . . . . . . . . . . . . .

88

2.11.2.1 Parallel Realisations . . . . . . . . . . . . . . . .

90

2.11.2.2 Parallel Solver . . . . . . . . . . . . . . . . . . .

93

2.11.2.3 Comparison . . . . . . . . . . . . . . . . . . . . .

96

2.11.2.4 Hybrid Technique . . . . . . . . . . . . . . . . . .

99

2.12 Strategy Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 103 3 Random Field Generation 3.1

Random Field Generation . . . . . . . . . . . . . . . . . . . . . . 105 3.1.1

3.2

Maths in 1D . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.1.1.1

Mean and Variance . . . . . . . . . . . . . . . . . 106

3.1.1.2

Variance Function . . . . . . . . . . . . . . . . . 108

3.1.1.3

Scale of Fluctuation . . . . . . . . . . . . . . . . 109

3.1.1.4

Covariance Function . . . . . . . . . . . . . . . . 109

1D Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.2.1

Initialising the LAS process . . . . . . . . . . . . . . . . . 114

3.2.2

Mathematical Implementation . . . . . . . . . . . . . . . . 115 3.2.2.1

3.3

2D Mathematics

3.5

3.6

Boundary Conditions . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

3.3.1

Variance Function . . . . . . . . . . . . . . . . . . . . . . . 120

3.3.2

Scale of fluctuation . . . . . . . . . . . . . . . . . . . . . . 122 3.3.2.1

3.4

105

Covariance Function . . . . . . . . . . . . . . . . 123


Initial Mean Generation . . . . . . . . . . . . . . . . . . . 127

3.4.2

Subdivision . . . . . . . . . . . . . . . . . . . . . . . . . . 128

3D Mathematics

. . . . . . . . . . . . . . . . . . . . . . . . . . . 134

3.5.1

3D Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 134

3.5.2

Scale of Fluctuation . . . . . . . . . . . . . . . . . . . . . 136

3.5.3

Covariance Function . . . . . . . . . . . . . . . . . . . . . 137


Initial Mean Generation . . . . . . . . . . . . . . . . . . . 139

3.6.2

Subdivision . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5

3.6.3 3.7

Boundary Conditions . . . . . . . . . . . . . . . . . . . . . 141

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

4 Parallel Random Field Generation

147

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.2

Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.2.1

4.3

4.4

Partitioning (or Domain Decompostion) . . . . . . . . . . 151 4.2.1.1

FE Domain Decompostion . . . . . . . . . . . . . 151

4.2.1.2

Slice Decompostion . . . . . . . . . . . . . . . . . 154

Parallelisation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.3.1

Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.3.2

Random Number Generation

. . . . . . . . . . . . . . . . 157

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

5 Random Number Generation (RNG)

159

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.2

Sequential RNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.2.1

Linear Congruential Generators (LCG) . . . . . . . . . . . 162

5.2.2

Multiplicative Linear Congruential Generators (MLCG) . . 163

5.2.3

Lagged Fibonacci Generators (LFG) . . . . . . . . . . . . 164

5.2.4

Shift register Generators . . . . . . . . . . . . . . . . . . . 164

5.2.5

Combined Generators . . . . . . . . . . . . . . . . . . . . . 166

5.2.6

Other Generators . . . . . . . . . . . . . . . . . . . . . . . 166 5.2.6.1

System Implementation . . . . . . . . . . . . . . 166

5.2.6.2

Randu . . . . . . . . . . . . . . . . . . . . . . . . 166

5.3

Current RNG Implementation . . . . . . . . . . . . . . . . . . . . 169

5.4

Testing the RNG . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.5

5.4.1

Uniform Deviate transform to normal . . . . . . . . . . . . 177

5.4.2

Theoretical Tests . . . . . . . . . . . . . . . . . . . . . . . 179

5.4.3

Empirical Tests . . . . . . . . . . . . . . . . . . . . . . . . 179 5.4.3.1

TESTU01 Test Suite . . . . . . . . . . . . . . . . 179

5.4.3.2

Conclusion . . . . . . . . . . . . . . . . . . . . . 182

Parallel Random Number Generators . . . . . . . . . . . . . . . . 184 5.5.1

Parallel Testing . . . . . . . . . . . . . . . . . . . . . . . . 186 5.5.1.1

5.6

TestU01 Testing . . . . . . . . . . . . . . . . . . 188

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6

6 Parallel Random Field Implementation

195

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

6.2

Current Implementation . . . . . . . . . . . . . . . . . . . . . . . 195

6.3

Code Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.3.1

6.3.2

Domain Reduction . . . . . . . . . . . . . . . . . . . . . . 199 6.3.1.1

Reduction Profile . . . . . . . . . . . . . . . . . . 200

6.3.1.2

Schematic flowchart of the domain reduction technique . . . . . . . . . . . . . . . . . . . . . . . . 201

Boundary Cells . . . . . . . . . . . . . . . . . . . . . . . . 206 6.3.2.1

6.3.3

Anti-correlation movement . . . . . . . . . . . . . . . . . . 209 6.3.3.1

6.3.4

Schematic flowchart of the implementation of the Anti-Correlation movement . . . . . . . . . . . . 222

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.3.4.1

6.4

Schematic flowchart of the boundary cell methodology . . . . . . . . . . . . . . . . . . . . . . . . 209

Merging the techniques

. . . . . . . . . . . . . . 223

Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . 227 6.4.1

Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.4.2

Initialising parameters . . . . . . . . . . . . . . . . . . . . 229

6.4.3

6.4.4

6.4.2.1

Generate random movements (reductions) for anticorrelation . . . . . . . . . . . . . . . . . . . . . . 230

6.4.2.2

Compute number of levels of subdivision . . . . . 230

6.4.2.3

Compute a and c coefficients . . . . . . . . . . . 232

6.4.2.4

Calculate the domain reduction profile . . . . . . 232

Serial Component of Generation . . . . . . . . . . . . . . . 233 6.4.3.1

Define Boundary Cells . . . . . . . . . . . . . . . 234

6.4.3.2

LAS Subdivision Process . . . . . . . . . . . . . . 234

6.4.3.3

Domain Reduction . . . . . . . . . . . . . . . . . 234

6.4.3.4

Decomposition Criteria . . . . . . . . . . . . . . . 234

Parallel Component or Generation . . . . . . . . . . . . . 236 6.4.4.1

Domain Decomposition . . . . . . . . . . . . . . 236

6.4.4.2

Resetting the RNG seed . . . . . . . . . . . . . . 236

6.4.4.3

Define boundary cells . . . . . . . . . . . . . . . 239

6.4.4.4

Communicate Ghost Regions . . . . . . . . . . . 239

6.4.4.5

LAS Subdivision Process . . . . . . . . . . . . . . 239 7

6.4.5

6.4.4.6

Domain Reduction . . . . . . . . . . . . . . . . . 239

6.4.4.7

Anti-correlation movement . . . . . . . . . . . . . 240

6.4.4.8

Load Balancing or Cell Redistribution . . . . . . 240

6.4.4.9

Exiting the loop . . . . . . . . . . . . . . . . . . 243

Post Processing . . . . . . . . . . . . . . . . . . . . . . . . 243 6.4.5.1

Transformation . . . . . . . . . . . . . . . . . . . 244

6.4.5.2

Anisotropy of the heterogeniety . . . . . . . . . . 244

6.4.5.3

Squashing . . . . . . . . . . . . . . . . . . . . . . 245

6.4.5.4

Stretching . . . . . . . . . . . . . . . . . . . . . . 247

6.5

Resulting Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

6.6

Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

6.7

6.6.1

Global Mean and Cell Value Distributions . . . . . . . . . 252

6.6.2

Correlation structure . . . . . . . . . . . . . . . . . . . . . 253

Computational Performance . . . . . . . . . . . . . . . . . . . . . 257 6.7.1

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

6.7.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 6.7.2.1

6.8

Performance Conclusions . . . . . . . . . . . . . . 262

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

7 Application - Groundwater Modelling

267

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

7.2

Groundwater Modelling . . . . . . . . . . . . . . . . . . . . . . . 267 7.2.1

Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

7.2.2

Previous RFEM Investigation . . . . . . . . . . . . . . . . 268

7.2.3

New Implementation . . . . . . . . . . . . . . . . . . . . . 270

7.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

7.4

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

8 Application - Slope Stability 8.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 8.1.1

Fully saturated slope under undrained conditions, φu = 0 . 281 8.1.1.1

8.2

Taylor’s stability coefficients . . . . . . . . . . . . 281

Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 8.2.1

8.3

279

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 285

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 8.3.1

Monte Carlo simulation 8

. . . . . . . . . . . . . . . . . . . 288

8.4 8.5

8.3.2 Computing Methodology 8.3.3 Results . . . . . . . . . . Computational Performance . . Conclusions . . . . . . . . . . .

9 Conclusions 9.1 Aims . . . . . . . 9.2 Objectives . . . . 9.3 Main Conclusions 9.4 Future Research .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

289 289 318 325

. . . .

327 327 327 329 331

A Random Field Generation - Performance

339

B Slope Stability Analysis - Performance

365

Word count: 50575

9

List of Tables 2.1

Workload and Communications required in the domain decompositions illustrated in Figure 2.3. . . . . . . . . . . . . . . . . . . .

37

2.2

Common Network Bandwidths. . . . . . . . . . . . . . . . . . . .

53

2.3

Basic Memory Requirements (Minimal). . . . . . . . . . . . . . .

64

2.4

University of Manchester Supercomputers. . . . . . . . . . . . . .

71

2.5

Top 10 UK Supercomputers in November 2010 (Dongarra et al., 2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Top 10 World Supercomputers in November 2010 (Dongarra et al., 2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

2.7

Basic Realisation Allocation. . . . . . . . . . . . . . . . . . . . . .

76

2.8

Table of analysed domain sizes. . . . . . . . . . . . . . . . . . . .

85

2.9

A comparison of results between Stochastic implementations of P74 and P75 for 320 realisations. . . . . . . . . . . . . . . . . . .

86

2.10 A comparison of performance results between P75 Serial, P75 (Parallel Realisations) and P123 (Parallel Solver) executed on a single processor for 320 realisations. . . . . . . . . . . . . . . . . . . . .

88

2.11 A comparison of performance results using the distributed version of P75, executed over increasing numbers of processors. . . . . . .

90

2.12 A comparison of performance results using the parallel solver version, P123, executed over increasing numbers of processors. . . . .

93

2.6

2.13 A comparison of performance results using the hybrid strategy, executed over increasing numbers of processors, for 320 realisations. 102 2.14 Comparison of test results between the Hybrid and P75 (Parallel Realisations) Implementations. . . . . . . . . . . . . . . . . . . . . 102 4.1

Table illustrating the restricted processor decomposition for varying resolutions of domains. . . . . . . . . . . . . . . . . . . . . . . 156 10

5.1 5.2 5.3 5.4 5.5 5.6

p-values p-values p-values p-values p-values p-values

. . . . . .

181 181 182 188 191 191

6.1 6.2

Dimensions Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . Tables of numerical steps involved in the computation of the reduction profile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computation of reduction profile for a 5 × 3 cells random field, over 3 levels of subdivision. . . . . . . . . . . . . . . . . . . . . . . List of random field performance analyses undertaken. . . . . . . Comparison of performance of original serial random field generation with that of the new implementation executed on a single processor. (Note that the comparison is the ratio New:Original.) .

199

6.3 6.4 6.5

from from from from from from

Small Crush battery of tests. Crush battery of tests. . . . Big Crush battery of tests. . Small Crush battery of tests. Crush battery of tests. . . . Big Crush battery of tests. .

11

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

202 205 258

262

List of Figures 2.1

Foster’s Design Methodology. . . . . . . . . . . . . . . . . . . . .

34

2.2

Illustration of different types of Domain Decomposition. . . . . . .

36

2.3

Examples of Domain Decomposition. . . . . . . . . . . . . . . . .

37

2.4

Illustration of Communication Reduction due to Agglomeration. .

38

2.5

Simple example illustrating SISD Model in Flynn’s Taxonomy of Computer Architectures. . . . . . . . . . . . . . . . . . . . . . . .

41

Simple Example illustrating SIMD Model in Flynn’s Taxonomy of Computer Architectures. . . . . . . . . . . . . . . . . . . . . . . .

41

Simple example illustrating MISD Model in Flynn’s Taxonomy of Computer Architectures. . . . . . . . . . . . . . . . . . . . . . . .

42

Simple example illustrating MIMD Model in Flynn’s Taxonomy of Computer Architectures. . . . . . . . . . . . . . . . . . . . . . . .

42

Basic Schematic of Shared Memory Architecture. . . . . . . . . .

44

2.10 Basic Schematic of Distributed Memory Architecture. . . . . . . .

45

2.11 Flow chart outlining two dependent processes A and B. . . . . . .

47

2.12 The series of outcomes of a simple two process Race Condition. .

48

2.13 Illustration of OpenMP. . . . . . . . . . . . . . . . . . . . . . . .

51

2.14 Illustration of Amdahl and Gustafson’s Laws (Gustafson, 1988). .

57

2.15 Graph illustrating the Theoretical Maximum Speedup as determined using Amdahl and Gustafson’s Laws, for a hypothetical code using 10 processors. . . . . . . . . . . . . . . . . . . . . . . . . . .

58

2.16 Illustration of typical efficiency scaling for a parallel code.

. . . .

59

2.17 Illustration of the theoretical lower memory limit. . . . . . . . . .

61

2.18 Flow chart outlining a general deterministic serial code. . . . . . .

63

2.19 Flow chart outlining a general deterministic parallel code. . . . . .

65

2.20 Example of processor boundary communication. . . . . . . . . . .

66

2.21 Flow chart outlining the Serial RFEM Algorithm. . . . . . . . . .

69

2.6 2.7 2.8 2.9

12

2.22 Illustration of the components of the total run time for a Serial RFEM Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

2.23 Graph of Fastest Computer Architecture with Top 500 1993 - 2010 (Dongarra et al., 2010). . . . . . . . . . . . . . . . . . . . . . . . .

74

2.24 Graphs illustrating Realisation Partitioning Methods. . . . . . . .

78

2.25 Flowchart outlining the Parallel Realisations Stochastic Algorithm. 81 2.26 Flowchart outlining the Parallel Solver Stochastic Algorithm. . . .

82

2.27 Flow chart outlining the Hybrid Parallel Solver/Realisations Stochastic Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 2.28 Illustration of the performance test domain and visualization of results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

2.29 Graphs comparing the performance of P74 and P75 serial codes. .

87

2.30 Graphs illustrating the comparison of performance results between P75 Serial, P75 Parallel and P123 executed on a single processor.

89

2.31 Graphs comparing the performance results using the distributed version of P75 parallel, executed over increasing numbers of processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

2.32 Graphs comparing the performance results using the parallel solver version, P123, executed over increasing numbers of processors. . .

94

2.33 Graphs comparing the performance results of P75 Parallel and P123, for Test 4. . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

2.34 Graphs comparing the performance results using the hybrid strategy, executed over increasing numbers of processors. . . . . . . . . 100 3.1

Sample functions of a local average process (Vanmarcke, 1983). . . 107

3.2

Variance Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3.3

Local integrals of the function X (t). . . . . . . . . . . . . . . . . 111

3.4

An illustration depicting the separation of a domain into equally spaced local average cells. . . . . . . . . . . . . . . . . . . . . . . 112

3.5

Progression of 1D LAS Field Generation. . . . . . . . . . . . . . . 113

3.6

General cell arrangement and annotation in 1D. . . . . . . . . . . 115

3.7

Local averaging in 2D over a rectangle (Samy, 2003). . . . . . . . 119

3.8

Illustration showing the areas under consideration in the covariance analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

3.9

Illustration showing the special case of similar rectangles. . . . . . 125

3.10 2D LAS Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 13

3.11 Generic local average generation at Stage i + 1. . . . . . . . . . . 129 3.12 2D traversing pattern of index functions p (l) and q (l). . . . . . . 130 3.13 Splitting of a 2D parent cell to form the next level of subdivision.

130

3.14 Weighting coefficients associated with their respective parent cell.

131

3.15 A graph illustrating the influence of the boundary conditions on the random field. (Spencer, 2007) . . . . . . . . . . . . . . . . . . 141 3.16 Illustration of the arrangement of imaginary cells, I, forming a boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 3.17 Neighbourhood for weighting of imaginary edge cells (Spencer, 2007).144 3.18 Neighbourhood for weighting of imaginary corner cells (Spencer, 2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 3.19 Traversing pattern and values for index functions p(l) and q(l) for edge cells (Spencer, 2007). . . . . . . . . . . . . . . . . . . . . . . 145 3.20 Traversing pattern and values for index functions p(l) and q(l) for corner cells (Spencer, 2007). . . . . . . . . . . . . . . . . . . . . . 146 4.1

Full RF generation within a parallel analysis. . . . . . . . . . . . . 148

4.2

Illustration of parent cell neighbourhood in 3D. . . . . . . . . . . 149

4.3

Illustration of 2D parent cell communication across processor domain boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

4.4

FE domain decomposition. . . . . . . . . . . . . . . . . . . . . . . 152

4.5

Illustration of cell structure growth during subdivision. . . . . . . 153

4.6

Illustration of the communication to restore cell structure. . . . . 154

4.7

Domain decomposition - Slices - (2D profile view). . . . . . . . . . 155

4.8

Illustration of the agglomeration of communication. . . . . . . . . 156

5.1

Simple example of XOR operation. . . . . . . . . . . . . . . . . . 165

5.2

Plot of 3000 Triplets of Randu in a 3D space. . . . . . . . . . . . 167

5.3

Plot of 3000 Triplets of Randu in a 3D space (Alternative Angle). 168

5.4

Plots of the initial 5000 values generated using varying initializing seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

5.5

Plots of the 5000 values generated using varying initializing seeds, when the first 50 values are disregarded. . . . . . . . . . . . . . . 173

5.6

Histograms of frequency of random numbers generated. . . . . . . 175

5.7

Plot of the frequency of random numbers generated after normal transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 14

5.8

Illustration of techniques that are utilized in PRNG models. . . . 185

5.9

Illustration of merging multiple parallel RNG sequences to a single sequence for empirical testing. . . . . . . . . . . . . . . . . . . . . 187

5.10 Histograms of frequency of random numbers generated. . . . . . . 189 6.1

Schematic flowchart of current random field implementation. . . . 197

6.2

2D example of current random field implementation. . . . . . . . 198

6.3

Schematic of Domain Reduction Method. . . . . . . . . . . . . . . 203

6.4

Illustration of the proposed domain reduction method in comparison to the original implementation. . . . . . . . . . . . . . . . . . 204

6.5

Illustration showing the principle of self generating boundaries. . . 207

6.6

Schematic flowchart of the boundary cell procedure. . . . . . . . . 210

6.7

A 2D illustration of the self generating boundaries in the LAS method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

6.8

Illustration of “tartan” patterning in the variance field of 50000 realisations of a 64 × 64 × 64 field, generated with a cell size of 1m and θ = 4m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

6.9

Illustration of the process of random movement. . . . . . . . . . . 213

6.10 An illustration of the relative positions of the domains and fields between realisations with different random movements. . . . . . . 216 6.11 An illustration of the proposed implementation of the random movement method. . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.12 Illustration of correlation within the point variance of a series of generated random fields (No random movement) ((a)-(d) have individual scales) ((e)-(h) have a fixed scale). . . . . . . . . . . . . . 218 6.13 Illustration of correlation within the point variance of a series of generated random fields (Random movement: 6 15 cells) ((a)-(d) have individual scales) ((e)-(h) have a fixed scale). . . . . . . . . . 219 6.14 Illustration of correlation within the point variance of a series of generated random fields (Random movement: 6 31 cells) ((a)-(d) have individual scales) ((e)-(h) have a fixed scale). . . . . . . . . . 220 6.15 Illustration of correlation within the point variance of a series of generated random fields (Random movement: 6 63 cells) ((a)-(d) have individual scales) ((e)-(h) have a fixed scale). . . . . . . . . . 221 6.16 Schematic of anti-correlation random movement method. . . . . . 224 6.17 An illustration comparing the locations of cell reductions. . . . . . 225 15

6.18 Schematic flowchart overviewing the implementation of the Parallel LAS Random Field Generator. . . . . . . . . . . . . . . . . . . 228 6.19 Schematic flowchart of Initialisation component. . . . . . . . . . . 229 6.20 Typical input data file. . . . . . . . . . . . . . . . . . . . . . . . . 231 6.21 Schematic of Serial Generation Component. . . . . . . . . . . . . 233 6.22 Schematic of Parallel Generation Component. . . . . . . . . . . . 237 6.23 Example of “seed domain” decomposition. . . . . . . . . . . . . . 238 6.24 Example of Domain Reduction across multiple processors. . . . . 240 6.25 Example of anti-correlation movement across multiple processors.

241

6.26 Typical example of an imbalanced decomposition. . . . . . . . . . 241 6.27 Typical example of rebalancing a decomposed domain. . . . . . . 242 6.28 Schematic of Post Processing Component. . . . . . . . . . . . . . 243 6.29 Illustration of Squashing. . . . . . . . . . . . . . . . . . . . . . . . 246 6.30 Illustration of Squashing process by column averaging (ξ = 4). . . 246 6.31 Illustration of stretching. . . . . . . . . . . . . . . . . . . . . . . . 247 6.32 Illustration of the interpolation method of stretching (Spencer, 2007).247 6.33 Illustration of the Author’s Method of Stretching. . . . . . . . . . 248 6.34 Generated Random field. . . . . . . . . . . . . . . . . . . . . . . . 250 6.35 Generated Random field - Anisotropic field. . . . . . . . . . . . . 251 6.36 Probability density function of random field cell values from 250 fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 6.37 Probability density function of random field means over 1000 fields. 253 6.38 Exact and estimated covariance functions for a 3D random field; D=5m, θ =4m, averaged over 50 realisations. (The Author). . . . 254 6.39 Exact and estimated covariance functions for a 3D random field; D=5m, θ =4m, averaged over 50 realisations (Fenton and Vanmarcke, 1990). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 6.40 Exact and estimated covariance functions for a 3D random field; D=5m, θ =4m, averaged over 1000 realisations (The Author). . . 256 6.41 Performance : Field : 512 × 512 × 32 cells. . . . . . . . . . . . . . 260 6.42 Performance : Time against number of cells in random field. . . . 263 6.43 Performance : Memory against number of cells in random field. . 263 7.1

3D FE seepage model (Griffiths and Fenton, 1997). . . . . . . . . 269

7.2

Influence of θk on Statistics of Normalized Flow Rate ( LLyz = 1) (Griffiths and Fenton, 1997). . . . . . . . . . . . . . . . . . . . . . 275 16

7.3

Influence of θQ¯ on Statistics of Normalized Flow Rate ( LLyz = 1) (Author’s Results). . . . . . . . . . . . . . . . . . . . . . . . . . . 276

7.4

Influence of Lz /Ly on Statistics of Normalized Flow Rate (θln K ) (Griffiths and Fenton, 1997). . . . . . . . . . . . . . . . . . . . . . 277

7.5

Influence of Lz /Ly on Statistics of Normalized Flow Rate (θln K ) (Author’s Results). . . . . . . . . . . . . . . . . . . . . . . . . . . 278

8.1

Types of slope failure (Craig, 1997). . . . . . . . . . . . . . . . . . 280

8.2

The φu = 0 analysis (Craig, 1997). . . . . . . . . . . . . . . . . . . 282

8.3

Taylor’s stability coeffients for φu = 0 (Craig, 1997). . . . . . . . . 283

8.4

The basic slope geometry and finite element mesh. . . . . . . . . . 286

8.5

Typical finite element and four random field cells (Hicks and Spencer, 2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

8.6

Influence of foundation layer on strength reduction factor versus maximum settlement for a 100m long homogeneous slope. . . . . . 290

8.7

Influence of foundation layer on factor of safety for a 100m long homogeneous slope. . . . . . . . . . . . . . . . . . . . . . . . . . . 291

8.8

Influence of slope length on computed factor of safety for a homogeneous slope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

8.9

Influence of slope length on increase in F relative Taylor’s(1937) solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

8.10 2D failure mechanism for a homogeneous slope with D =0.0m. . . 294 8.11 Performance distributions for a 2D heterogeneous slope with D =0.0m.295 8.12 Visualisation of multiple failures in a 3D heterogeneous slope with D =0.0m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 8.13 Relationship between percentage volume and percentage of maximum node displacement for failure of a homogeneous slope. . . . . 298 8.14 2D failure mechanism for a homogeneous slope with D =3.0m. . . 299 8.15 3D failure mechanism for a homogeneous slope for D =0.0m. . . . 300 8.16 3D failure mechanism for a homogeneous slope for D =3.0m. . . . 302 8.17 Influence ξ on reliability versus global factor of safety for a 3D slope with 0 m and 3 m foundation layer. . . . . . . . . . . . . . . 305 8.18 Influence ξ on R versus F for a 3D slope (ξ = 1). . . . . . . . . . . 307 8.19 Influence ξ on R versus F for a 3D slope (ξ = 2). . . . . . . . . . . 308 8.20 Influence ξ on R versus F for a 3D slope (ξ = 6). . . . . . . . . . . 309 8.21 Influence ξ on R versus F for a 3D slope (ξ = 12). . . . . . . . . . 310 17

8.22 8.23 8.24 8.25 8.26 8.27 8.28 8.29 8.30

Influence ξ on R versus F for a 3D slope (ξ = 24). . . . . . . . . . Influence ξ on R versus F for a 3D slope (ξ = 48). . . . . . . . . . Influence ξ on R versus F for a 3D slope (ξ = 100). . . . . . . . . Influence ξ on R versus F for a 3D slope (ξ = 1000). . . . . . . . . Example Mode 2 failure mechanisms for ξ = 6. . . . . . . . . . . . Influence of ξ on mean slide volume for a 3D slope. . . . . . . . . Illustration of tested profiles in performance analysis. . . . . . . . Slope Stability Analysis Performance : Profile 1 : Length = 32m. Slope Stability Analysis Performance (Hybrid Method). . . . . . .

311 312 313 314 315 316 319 320 323

512 × 512 × 512 cells. 512 × 512 × 256 cells. 512 × 512 × 128 cells. 512 × 512 × 64 cells. . 512 × 512 × 32 cells. . 256 × 256 × 256 cells. 256 × 256 × 128 cells. 256 × 256 × 64 cells. . 256 × 256 × 32 cells. . 128 × 128 × 128 cells. 128 × 128 × 64 cells. . 128 × 128 × 32 cells. .

340 342 344 346 348 350 352 354 356 358 360 362

A.1 Performance A.2 Performance A.3 Performance A.4 Performance A.5 Performance A.6 Performance A.7 Performance A.8 Performance A.9 Performance A.10 Performance A.11 Performance A.12 Performance B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8 B.9

Slope Slope Slope Slope Slope Slope Slope Slope Slope

: : : : : : : : : : : :

Stability Stability Stability Stability Stability Stability Stability Stability Stability

Field Field Field Field Field Field Field Field Field Field Field Field

: : : : : : : : : : : :

Analysis Analysis Analysis Analysis Analysis Analysis Analysis Analysis Analysis

Performance Performance Performance Performance Performance Performance Performance Performance Performance

18

: : : : : : : : :

. . . . . . . . . . . .

. . . . . . . . . . . .

Profile Profile Profile Profile Profile Profile Profile Profile Profile

1 1 1 2 2 2 3 3 3

. . . . . . . . . . . . : : : : : : : : :

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

Length Length Length Length Length Length Length Length Length

= = = = = = = = =

96m. 64m. 32m. 96m. 64m. 32m. 96m. 64m. 32m.

. . . . . . . . . . . .

366 368 370 372 374 376 378 380 382

Abstract Geotechnical analyses have traditionally followed a deterministic approach in which materials are modelled using representative property values. An alternative approach is to take into account the spatial variation, or heterogeneity, existing in all geomaterials. In this approach the material property is represented by a mean and standard deviation and by a definition of the spatial correlation. This leads to a stochastic-type analysis resulting in reliability assessments. Random finite element methods (RFEM) have been implemented incorporating spatial variability for a series of models. This variability is incorporated using random fields, which conform to the mean, standard deviation and spatial correlation of the modelled geomaterials. For each material the set of statistical parameters produces an infinite number of possible random fields; therefore a Monte Carlo approach is followed, by executing the FE analysis for hundreds of realizations of the random field. This stochastic approach is both time and memory exhaustive computationally, limiting domain sizes and significantly increasing run-times. With the advances in commercial computational resources, the demand for more accurate and 3D models has increased, further straining the computational resources required by the method. To reduce these effects the stages of the method have been parallelized; initially the FE analysis, then the Monte Carlo framework and finally the random field generation. This has led to increases in the executable domain sizes and reductions in the run-times of the method. The new parallel codes have been used to analyse large-scale 3D slope reliability problems, which previously could not be undertaken in a serial environment. These computations have demonstrated the effectiveness of the new implementation, as well as adding confidence in the conclusions of previous research carried out on smaller 3D domains.

19

Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

20

Copyright i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see http://www.campus.manchester.ac. uk/medialibrary/policies/intellectual-property.pdf), in any relevant Thesis restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.manchester.ac.uk/ library/aboutus/regulations) and in The University’s policy on presentation of Theses 21

Acknowledgements The Author would like to acknowledge and thank Prof.dr. M. A. Hicks for his assistance, guidance and advice in conducting the research presented in this thesis in his capacity as supervisor. Acknowledgement and thanks must also be given to the Engineering and Physical Sciences Research Council (EPSRC) for providing the initial funding in starting this research. Later the work was also supported by the European Union FP7 Project “IRIS: Integrated European Industrial Risk Reduction System”, Project Number 213968. The Author would also like to thank his colleagues, both past and present, for their help and support throughout his studies, many of whom remain close friends. In particular the Author’s predecessor, William Spencer, for his knowledge, advice and friendship. Acknowledgement and thanks are also due to his colleagues, Patrick Arnold, Ellie Kuitunen, Kristina Samy, Gaisoni Nasreldin, Jian Chen, Lee Margetts, Mike Pettipher, Ian Smith, William Craig, XiaoHui Chen, Paul Curry, Richard Gardiner, Michael Whitby and Simon Miller. On a more personal note the Author would to thank his close friends for their support, friendship and the happiness they bring, in particular Heidi, Peter, Peter, Leanne, Gemma, Steve and David for their continued friendship. The Author would also like to thank his Grandmothers for their love and support during his studies and the Author would like to pay tribute to his Grandma Ivy, who sadly passed in this period; she is greatly missed, but still makes the Author smile. Finally the Author would to express his thanks to his Mother and Father for their unwavering love, support and advice throughout all his studies.

22

The Author The Author, Jonathan Nuttall, completed a BEng in Mechanical Engineering at UMIST, in 2002 and received the Rowland S. Benson Prize, awarded to the student with best performance in Applied Thermodynamics and Fluid Mechanics. It was during this course that the Author gained his first experiences of Finite Element Analysis and its practical use within engineering analysis and design. The Author continued his studies, completing a masters in Applied Numerical Computing, in the Department of Mathematics at Manchester University. During his masters degree his skills and knowledge of Finite Element Analysis, Fortran programming and parallel computation were furthered; completing courses in each. It is here where the Author gained the motivation to continue into research, expanding and combining his previous knowledge base to conduct the work reported in this thesis.

23

24

For my parents.

25

26

Chapter 1 Introduction Finite Element Analysis (FEA) and its derivatives have been a widely researched subject for many years. Its popularity has increased with the continuing development of computer processing power. Large scale engineering problems can be handled and processed faster than using traditional methods. The use of stochastic techniques has also become more widespread as the need for more realistic models incorporating material variability have been required and their analysis have become more computationally viable. The development of computer technology has been rapid, although it is envisaged that the development within single core processor speeds will slow, as technological limits are reached. This can already be seen with both major processor manufacturers, AMD and Intel, having produced multi-core processors. Today’s high performance supercomputers are machines containing multiple “off the shelf” processors, often hundreds or thousands, with some PC clusters known as Beowulf Clusters, forming a cheaper alternative. Both these types of machine architectures and the new multi-core processors are coded on the same parallel programming principles. So it would seem that the future of computation lies within the parallelisation of processing power, with today’s advances in high performance computing becoming tomorrow’s standard PC features. Note that throughout this thesis the term “processor” refers to a single processing core or processor, as at the onset of this research multi-cores were in their infancy and the term processor referred to a single physical processor with 1 processing unit on board. However, the term is equally applicable to the individual cores of multi-core processors. FEA using a standard PC with a single processor is limited both by processing 27

28

CHAPTER 1. INTRODUCTION

power and memory constraints, limiting the applications and domain sizes that can be analysed. In parallelising these problems, the domain is decomposed into smaller parts and distributed between multiple processors or machines in a cluster network. This means that problem sizes can be increased and analysis times cut, as each processor has less work and storage requirements. The codes used in the Author’s research originate from various editions of the book,“Programming the Finite Element Method” (Smith and Griffiths, 1997, 2004). Some of these codes were developed further and parallelised, as documented in “Parallel Finite Element Analysis” (Margetts, 2002). Further parallelisation and development occurred, and the resulting parallel finite element codes were published by Smith and Griffiths (2004). This research adopts these codes and adapts them to fit within a parallel stochastic framework. Stochastic analyses are often carried out using the Finite Element Method. These stochastic methods may be used when properties within an analysis are not deterministic, as for soils in which the properties vary in space. For modelling the variable nature of the properties, statistically accurate property domains may be computed, although, for any given set of statistics, an infinite number of these so-called random fields exists. One such method is the Random Field Element Method (RFEM), where the analysis is repeated as part of a Monte Carlo process on soil models represented by varying random fields; that is, until a satisfactory converged solution is obtained. Parallel Random Finite Element Methods will be the key focus of this thesis. The issues surrounding the parallelisation of the FE code, its benefits and performance will be analysed. Parallelisation strategies for RFEM codes will be discussed, including the parallelisation of the random field generation itself.

1.1

Motivation

The motivation behind the thesis is to develop faster and larger FE software applications to handle the increasing demand for larger and more accurate models, whilst maintaining viable analysis times. Working within the Author’s research group has shown that traditional serial computing is currently unable to handle the large scale models that engineers are increasingly demanding and, with the added complexity of stochastic analysis, the way forward is clearly the utilization of high performance computation.

1.2. SETTING THE SCENE

29

In recent years the research group has focused on the development of a stochastic strategy for the analysis of geotechnical problems, such as slope stability (Samy, 2003; Spencer, 2007) and groundwater flow. The group has also paid particular attention to advances in computer technology, with the utilisation of parallel and high performance computing techniques being incorporated into much of the research (Margetts, 2002; Spencer, 2007). Although previous researchers contributed knowledge to the field and often combined many aspects of FEA, high performance computing and stochastic analysis, none provided the comprehensive review, analysis and implementation of a parallel RFEM framework, such as that proposed in this thesis.

1.2

Setting the scene

With the rapid advances in computer technologies, it is essential that this thesis is put into historical context for future readers. On starting the thesis, commercial 64 bit processing technology was in its infancy, as was the development of multi core processors. In general, most systems, both desktops and larger research machines, operated with 32 bit operating systems and compilers, limiting each system to 4Gb of memory. It is within this context that many of the issues, discussed and developed with this thesis, arose.

1.3 1.3.1

Thesis aims and objectives Aims

The broad aim of this research is to advance the performance of Stochastic Finite Element Analysis, with the use of parallel computation, improved algorithms and coding. These performance enhancements are envisaged to be in the domain sizes analysed and the speed at which they are executed. It is also an aim to design, implement and test a parallel stochastic FE framework, for use within current and future FE projects.

1.3.2

Objectives

There are several objectives to accomplish the aims of this research and these objectives have evolved during the research, as new challenges have been faced.

30


The following is a list of the main objectives that were required to fulfil the aims: • Stochastic parallelisation strategy: As with the parallelisation of any code, it is essential to design a strategy best suited to the needs of the application. Several approaches can be taken in completing this task, so it is important to take a considered approach when initially designing the code. • Implementation of a Parallel Stochastic Framework: The main objective of this research is to implement a parallel stochastic framework and to analyse its performance, with the hope that larger domains can be processed with shorter runtimes. • Parallelisation of random field generation: The random field generator is a key area in memory cost; it is therefore an objective of this research to parallelise the code to minimize this memory constraint. • Testing of random field generation: Having produced a parallel random field generator, the field produced will have to be tested to validate the code, as it is essential that the field produced meets the required statistical criteria and that there is no correlation between field elements and between different field realisations. • Validation of parallel RFEM framework: As with the random field generator, it is important to validate the results of the parallel framework, by comparing the results obtained with those from a standard serial code. • Performance analysis: The new framework will also be tested to maximize the use of the code. Its performance will be compared, both from the point of view of memory and time, with original serial codes. • Advanced application: Having produced the framework and implemented a working code with enhanced performance, the framework will be applied to a practical problem involving slope stability.

1.4. THESIS SCOPE

1.4

31

Thesis Scope

The scope of this thesis is quite broad and includes the following areas: 1. Finite Elements; 2. Stochastic Monte Carlo FE methods (Random Finite Element Method (RFEM)); 3. Parallel computing; 4. Random field generation and the issues surrounding these main themes. The Author provides supplementary background information, where it is felt that the reader may benefit from this knowledge and contributes to greater understanding. This background literature review is provided throughout this thesis in the relevant chapters, rather than in a separate self-contained chapter. The thesis chapters are summarized as follows: Chapter 2 - Parallel RFEM Strategies: The parallelisation of any code requires careful thought and analysis. This chapter hypothesises about several strategies for implementing existing serial and parallel FE codes into a Parallel RFEM Framework; it analyses the likely performance of each method and concludes with a strategy selection to take forward into the rest of the research. Chapter 3 - Random Field Generation: Random field generation in the RFEM process is important and this chapter details the theory and mathematics involved in LAS random field generation. Chapter 4 - Parallel Random Field Generation: This chapter considers the reasoning and viability of the parallelisation of the LAS random field generation presented in Chapter 3, with consideration of the framework strategies presented in Chapter 2. Chapter 5 - Random Number Generation (RNG): The intricacies of the Random Number Generator to be used, and its use within the parallel framework, are important. The RNG provides the random component in the generation of

32


the random field and, as such, needs to be analysed for its random and statistical properties. The chapter implements a parallel RNG for use within the parallel generation of a random field. Chapter 6 - Parallel Random Field Implementation: The work of the previous chapters are brought together in forming a parallel implementation of the LAS random field generator for use within the parallel RFEM framework. This chapter also implements general improvements to the algorithm and concludes with a performance analysis of the new implementation. Chapter 7 - Application - Groundwater Modelling: This chapter documents the implementation of a Groundwater model into the parallel framework. The purpose of this implementation is to validate the model against known results and to provide some validation of the parallel random field and framework implemented. Chapter 8 - Application - Slope Stability: A Slope Stability model was implemented into the parallel framework. This model benefits from the advances made within the thesis, which have allowed, for the first time, modelling of large scale slopes with foundations. The chapter presents the implementation, results and analysis, together with a review of the computation performance of the code. Chapter 9 - Conclusions: This chapter is an overview of the entire research. It demonstrates that the aims and objectives set out prior to it commencing were reached, whilst providing a brief overview of the main conclusions from the work. The thesis concludes with the Author’s views on potential future research and development.

Chapter 2 Parallel RFEM Strategies In parallel computing, the area of stochastic analysis using Monte Carlo methods is often overlooked. This is because it is considered trivial to parallelise, due to the execution of different realisations on each processor. However, stochastic strategies can suffer from the same problems as serial codes, as will be highlighted later in the section, with memory and speed constraints being contributing factors. Therefore other strategies must be considered to allow the computation of large scale FE domains stochastically within a Monte Carlo framework. This Chapter discusses these strategies, their advantages and disadvantages, implementation and performance, and concludes with a discussion of various comparisons and applications. In parallelising any code a feasibility study must first be carried out, to see whether the parallelisation will be of benefit and to aid in the choice of a parallelisation strategy. It is also good practice to follow a strategy in the design of the model; to that end, Foster’s Design Methodology (Foster, 1994) will be used. This chapter presents a study of the possible strategies.

33

34

CHAPTER 2. PARALLEL RFEM STRATEGIES

2.1

Foster’s Design Methodology

Foster (1994) devised a design methodology for parallel codes. Quinn (2003) describes this methodology using the following steps: 1. Partitioning 2. Communication 3. Agglomeration 4. Mapping Figure 2.1 shows the steps of the methodology and the rest of the section provides an overview of each.

Communication

Partitioning

Problem

Agglomeration

Mapping

Figure 2.1: Foster’s Design Methodology.

2.1. FOSTER’S DESIGN METHODOLOGY

2.1.1

35

Partitioning

The first step is to identify and partition the problem into smaller tasks. The types of partitioning fit into two categories; Domain Decomposition and Functional Decomposition. A task contains a sequential program and the corresponding local memory. 2.1.1.1

Domain Decomposition

Domain Decomposition is the breaking down of the actual data; for instance, an array can be broken down into single elements or vectors, or the modelling of a physical domain can be broken down into its composite components or based on its geometry. See Figure 2.2 for examples of these types of decomposition. 2.1.1.2

Functional Decomposition

Functional Decomposition is the breaking down of a calculation or algorithm into its smaller basic functions, tasks and operations. The partitioning of the problem should minimize computations and data storage, by eliminating or minimizing redundancy. It is also important to balance the workload by partitioning the tasks to be roughly the same size; this is known as load balancing. Partitioning is also key in defining the communication within a model.

36

A=


a1,1

a1,2

a1,3

a1,j

a2,1

a2,2

a2,3

a2,j

a3,1

a3,2

a3,3

a3,j

ai,1

ai,2

ai,3

ai,j

a1,1

a1,2

a1,3

a1,j

a2,1

a2,2

a2,3

a2,j

a3,1

a3,2

a3,3

a3,j

ai,1

ai,2

ai,3

ai,j

(a) Array Decomposition

(b) Geometric Decomposition

(c) Composite Decomposition

Figure 2.2: Illustration of different types of Domain Decomposition.


2.1.2

37

Communication / Interaction

Communication involves the interactions between the different tasks within the data structure, the data required by each task from other tasks, and the order in which it is required and performed. Each task has a number of inputs and outputs which define its interaction with other tasks. The connections or “channels” between the tasks define the communication between them. The communication or interaction, whether between functional tasks or between domain decomposed tasks, is dependent on the decomposition and this highlights the necessity for good partitioning. Section 2.4.2 illustrates this point. It must also be stressed that the interaction and the path of a concurrently run algorithm is only as fast as the longest or critical path running through it; in other words, it is limited by the longest chain of sequential tasks that it must perform. Good partitioning is

(a) Example 1

(b) Example 2

(c) Example 3

Figure 2.3: Examples of Domain Decomposition. key when considering communication. Figure 2.3 shows 3 examples of partitioning a 2D mesh using domain decomposition. In each case the load balancing is the same, but the communication channels required differ significantly. Table 2.1 shows the work load and communication of each example. The number of comExample Elements Partitioned Workload 1 64 32 2 64 32 3 64 32

Communication Channels 112 14 8

Table 2.1: Workload and Communications required in the domain decompositions illustrated in Figure 2.3.

38


munication channels required was taken to be the length of the edges between adjacent decomposed areas. Therefore, while balancing the workload it is advantageous to choose a strategy that minimizes the number of connecting edges and thus minimizes communication.

2.1.3

Agglomeration

Agglomeration, in the current context, is the merging together of smaller tasks into larger tasks that are more efficient or easier to code, with a view to maintaining the scalability of a code and improving performance. For instance, within communication it is often better to send large amounts of data at the same time, rather than sending communications individually, as this limits synchronization time and the general communication overheads. Within a parallel algorithm, developed using the Message Passing Interface (MPI) standard (see Section 2.3.3), the aim is often to produce one agglomerated task per processor. Agglomerating tasks can improve performance by eliminating communication between tasks. (See Figure 2.4.) It is often necessary, when considering this process, to evaluate whether data should be communicated between processors or whether the computation should be replicated to produce this data. This is usually dependent on whether the replicated computation is quicker than the communication required to distribute the data. The steps of Communication and Agglomeration can be

Agglomeration

(a) Example 1

Agglomeration

(b) Example 2

Figure 2.4: Illustration of Communication Reduction due to Agglomeration.


39

considered together as a task interaction analysis (Margetts, 2002), where task interactions are determined and the necessary data structures and communications are defined.

2.1.4

Mapping

Tasks are allocated to available processors, so as to optimize execution and reduce run times of the codes. The mapping can be controlled in two ways, depending on the system. For centralized multiprocessor systems, that is those with shared memory, the mapping is controlled by the operating system; however, for a distributed memory system this mapping is controlled by the user. The goals of mapping are two fold: 1. To maximize utilization of available processors, 2. To minimize interprocessor communication. These issues are often conflicting, as the communication increases with the number of processors utilized. In maximizing the use of processors, some considerable attention should be concentrated on load balancing, so that each process has a similar amount of work and executes in a similar time. There are two main ideas surrounding the mapping of tasks to processors, depending on the type of algorithm involved. For an algorithm where the execution time of each task is static, tasks should minimize communications through Agglomeration and mapping a single task to each processor, as this would produce a load balanced mapping where each processor does a similar amount of work. If there is a variation in the computation time of the tasks, then these tasks should be mapped cyclically, with each processor executing the following task on finishing the previous task. This is known as Dynamic Load Balancing and, over the course of several cycles, the execution time as a whole should balance.

40


2.2

Architecture

One of the initial steps in the parallelisation of an algorithm is to evaluate the computer architectures available or which is the most appropriate for the field of work, especially if the user is not limited by this resource. It is then essential to show the implications of these choices and to direct the strategies accordingly.

2.2.1

Processing Classification

The Instruction/Data Stream Classification of “Flynn’s Taxonomy” is a classification used to define the architecture of computers (Flynn, 1972). The classifications are as follows: 1. SISD - Single Instruction, Single Data Stream In this architecture, a single instruction is issued per clock cycle on a single data stream. This architecture is typical of standard serial single core processor computers. (See Figure 2.5.) 2. SIMD - Single Instruction, Multiple Data Stream This is a parallel architecture, where a single instruction per clock cycle is issued across all processing units in multiple data streams. (See Figure 2.6.) 3. MISD - Multiple Instruction, Single Data Stream This would be a computer that carried out multiple instructions, on a single data stream, per clock cycle. To the Author’s knowledge, to date there are no practical examples of this type of architecture in existence. (See Figure 2.7.) 4. MIMD - Multiple Instruction, Multiple Data Stream This is another parallel architecture, where a different instruction is issued on each of the processing units to different data streams. (See Figure 2.8.)

2.2. ARCHITECTURE

41

A = 10

B = 25

C =A×B

C = 250

Output C

Figure 2.5: Simple example illustrating SISD Model in Flynn’s Taxonomy of Computer Architectures.

A(1) = 10

A(2) = 20

A(3) = 30

A(n) = 10n

B(1) = 25

B(2) = 50

B(3) = 75

B(n) = 25n

C =A×B

C =A×B

C =A×B

C =A×B

C(1) = 250

C(2) = 1000

C(3) = 2250

C(n) = 250n2

Output C(1)

Output C(2)

Output C(3)

Output C(n)

Figure 2.6: Simple Example illustrating SIMD Model in Flynn’s Taxonomy of Computer Architectures.

42


A = 10

A = 10

A = 10

A = 10

B = 25

B = 25

B = 25

B = 25

C(1) = A + B

C(2) = A × B

C(3) = A − B

C(n) = B − A

C(1) = 35

C(2) = 250

C(3) = −15

C(n) = 15

Output C(1)

Output C(2)

Output C(3)

Output C(n)

Figure 2.7: Simple example illustrating MISD Model in Flynn’s Taxonomy of Computer Architectures.

A(1) = 10

A(2) = 20

A(3) = 30

A(n) = 10n

B(1) = 25

B(2) = 50

B(3) = 75

B(n) = 25n

C(1) = A + B

C(2) = A × B

C(3) = A − B

C(n) = B − A

C(1) = 35

C(2) = 1000

C(3) = −45

C(n) = 15n

Output C(1)

Output C(2)

Output C(3)

Output C(n)

Figure 2.8: Simple example illustrating MIMD Model in Flynn’s Taxonomy of Computer Architectures.

2.2. ARCHITECTURE

2.2.2

43

Memory Classification

Memory classification is concerned with the locality of the memory relative to the processors in a multiple processor framework. There are three categories: 1. Shared Memory In this structure all the CPU’s share a common memory store. (See Figure 2.9.) This arrangement means that memory access is slowed, as each CPU is competing to access the memory over a finite bandwidth. Therefore shared memory can often be the cause of bottlenecks within codes. This type of architecture does not require communication between processors, although problems maintaining the integrity of the data can occur when different processors are reading and writing data to the same address; this is known as a Race Condition. Within a shared memory environment memory locks are used, preventing processes from reading parts of the memory until previous operations have been completed by other processes. Due to the limitations of 32 bit memory addressing, a shared memory environment can only share up to 4GB of memory between processors. It does however have the advantages of the hardware being relatively cheap compared with the distributed memory arrangement, while the coding is significantly easier without the need for complex communication. 2. Distributed Memory In this structure all the CPU’s have a local memory that can only be accessed by that particular processing unit and data are communicated between memory stores via a network, as shown in Figure 2.10. The processors in this architecture have faster access to data in memory, compared to a shared memory architecture. This is due to better bandwidths between processing units and memory; also, there is little competition for bandwidth, whilst Race Conditions are more easily controlled by the use of synchronized message passing. Distributed memory has many benefits: it provides each processor with its own localized memory, with full bandwidth and no bus or switch contention; thus interprocessor interference and Race Conditions are eliminated or more easily controlled. The number of processors in the system is no longer constrained as there is no common bus, and the number is only limited by the network which connects them; hence larger supercomputers can be built. Also, no cache coherency problems exist, with

44

CHAPTER 2. PARALLEL RFEM STRATEGIES each processor having full control over its own memory; as such, processors are not required to copy data to local cache, leaving the original data in memory for referencing by others. 3. Virtual Shared Memory This type of memory is a globally addressed architecture that is physically distributed between processors, which means that every processor can access the data from the entire global memory. Hardware and software are used to give the physical distributed memory a single address space. This aids the portability of code and maybe contrasted with a program written for shared memory, which will not generally execute on a distributed memory machine as it expects to see the full memory. Similarly, some distributed memory codes will run on shared memory machines; however, in both cases the programs execute inefficiently compared with execution on the machine for which they are designed. The Virtual Shared Memory Architecture is an attempt to capture the benefits of both types of architecture (i.e. shared memory and distributed memory), allowing programs designed for either type to run.

Processor 1

Processor n

Shared Memory

Processor 2

Processor ...

Figure 2.9: Basic Schematic of Shared Memory Architecture.

2.2. ARCHITECTURE

45

Processor 1 Memory

Processor n

Processor 2

Memory

Memory

Processor ... Memory

Figure 2.10: Basic Schematic of Distributed Memory Architecture.

2.2.3

Systems Designs

Large computers generally come under 6 types of machine model:

2.2.3.1

SIMD Machines

As discussed earlier, an SIMD machine is usually used for very specific applications. These machines were often referred to as vector processors and were typified by the Cray X-MP. These machines are now scarce in supercomputing, as MIMD machines have become more powerful and modern desktops are now using multiprocessor MIMD, where each processing core takes the form of a SIMD architecture.

2.2.3.2

Parallel Vector Processor (PVP) Machines

These systems are built from custom designed vector processors, connected to a shared memory, by a high-bandwidth crossbar switch network. Examples of this architecture are the Cray C-90, Cray T-90 and NEC SX-4.

46 2.2.3.3

CHAPTER 2. PARALLEL RFEM STRATEGIES Symmetric Multiprocessor (SMP) Machines

This type of machine is a multiprocessor machine where multiple identical processors are connected to a shared memory store. Modern desktops with multiple core processors run on this type of architecture model. Examples include the IBM R50, SGI Power Challenge and the DEC Alpha server 8400. 2.2.3.4

Distributed shared memory (DSM)

A DSM is a shared memory machine that is similar to an SMP architecture, except that the memory is physically distributed, whilst hardware and software provides a single address space, i.e. Virtual Shared Memory. An example of such a system is the Cray T3D. 2.2.3.5

Massively Parallel Processor (MPP) Machines

An MPP system is a system that uses many separate processors, with each having its own memory, in contrast to an SMP system which uses shared memory. MPP systems are large purpose built systems usually comprising a number of processing nodes, with each node containing one or more processors connected by a high speed interconnect to a local memory. In many MPP machines, a number of shared memory processing nodes are combined to form a larger machine. A Cluster is a group of nodes linked together to form one larger system; this is a form of Distributed Computing. Furthermore, a Constellation is defined as a type of Cluster, where each node contains more processors than there are nodes in the system. Although Constellations have a distributed element, each node usually consists of an SMP and will have shared memory. 2.2.3.6

Cluster of Workstations (COW)

These Clusters, similar to those defined above, are groups of linked nodes and are considered a low cost variation of MPP machines. They are often produced from Desktops and workstations, connected by low commodity networks. In recent years the distinction between MPP and COW architectures has become difficult, with clusters providing a cheaper and more cost effective architecture for supercomputing.

2.2. ARCHITECTURE

2.2.4

47

Race Conditions

Race Conditions occur, in parallel programming, when two or more processes or threads try to read and write to the same memory address. In this respect it is essential that processes and threads are synchronized, and that the order of operations is carefully structured to ensure that the values stored are correct at the time they are required by any processors. For instance, if process A writes to a value that process B reads, then it is essential that A writes before B reads. For example Figure 2.11 illustrates the flow of two processes A and B, where: Process A : X = X + 10, Process B : X = X 2 .

Input A

Input B

X = X + 10

X = X2

Output A

Output B

Figure 2.11: Flow chart outlining two dependent processes A and B.

Figure 2.12 illustrates the series of outcomes that are possible without synchronization if a Race Condition occurs. This is a simple example, illustrating a typical Race Condition: two processes using the same memory address produce four different results, highlighting how essential it is to control and organize calculations and Input/Output (I/O) in parallel computation, especially within a shared memory environment.

48


Process A

Process B

X = 10

X = X + 10

X = 20 Memory X = 20

X = X2

X = 400

End A

End B

(a) Process A reads and writes before Process B. Process A

Process B

X = 10

X = X + 10

X = 10 Memory X = 20

X = X2

X = 100

End A

End B

(b) Process B reads before and writes after Process A writes.

Figure 2.12: The series of outcomes of a simple two process Race Condition.

2.2. ARCHITECTURE

49

Process A

Process B

X = 10

X = X + 10

X = 10 Memory X = X2

X = 100

X = 20

End A

End B

(c) Process A reads before and writes after Process B writes. Process A

Process B

X = 10

X = X2

X = 100

Memory

X = 100

X = X + 10

X = 110

End A

End B

(d) Process A reads after Process B has output.

Figure 2.12: cont....The series of outcomes of a simple two process Race Condition.

50

2.3


Parallel Application Programming Interfaces (API’s)

There are several standards for parallel computation, including High Performance Fortran (HPF), OpenMP and Message Passing Interface (MPI). These three standards are well documented and a brief overview will be laid out in this section.

2.3.1

High Performance Fortran (HPF)

HPF is, an outdated, extension of Fortran 90, which permits the programming of parallel computation. A set of compiler directives gives the ability to specify the distributed data, and extended intrinsic functions and constructs allow calculations and manipulations to be carried out on the distributed data sets (Koelbel et al., 1994). Although HPF is intended to be easy to use, it has limited flexibility and is complex to implement; therefore it is unlikely to give good performance in most general applications. HPF is rarely implemented by Fortran compiler vendors, although its influence is considerable with many of its functions being present in the Fortran 2008 standard.

2.3.2

OpenMP

OpenMP is a set of compiler directives and call libraries that are an extension of both C and Fortran (Chandra et al., 2001; Dagum and Menon, 1998; Quinn, 2003). It is suited to shared memory systems and is the easiest of the standards to code and implement; simple directives can be inserted to parallelise serial code and the OpenMP compilers automatically distribute the workload between processors, opening and closing instruction threads as required. (See Figure 2.13.) Directives such as !$OMP DO ........ !$OMP DO replaces the traditional DO ........ END DO, which in this case distributes the workload of the DO Loop between the processing units on which they are executed. Although it is the simplest to code, this standard is not suited for use on distributed memory systems.

2.3.3

Message Passing Interface (MPI)

MPI is a standard which defines the requirements for a message passing library. The standard defines the various types of communications that a vendor should

2.3. PARALLEL APPLICATION PROGRAMMING INTERFACES

51

(a) Thread executed Sequentially

(b) Thread executed in Parallel

Figure 2.13: Illustration of OpenMP. implement on their systems, including point to point and global communications. It also defines the use topologies and groups. It is implemented by many vendors, both in Fortran and C and is very well documented (Forum, 1994; Snir et al., 1995; Quinn, 2003). Message passing is the only viable option for efficient and optimum performance on distributed memory systems, provided the program is programmed correctly. Message passing can be used on shared memory systems; however, this is inefficient, as all processors have access to all memory locations and, as such, OpenMP is usually the preferred interface. The standard has several commands for sending and receiving data between processors, and the main two are MPI SEND and MPI RECV. For every send command issued a corresponding receive command must also be present; these are blocking communication commands, which incorporate synchronicity into the message passing eliminating Race Conditions. However poor synchronization can lead to deadlocks. Later versions of the MPI standards (Forum, 1997) have incorporated one sided communications within them, although many vendors are yet to implement these later standards. One sided communication allows processors to retrieve data without the need to syncronize with the corresponding processor, however this can introduce Race Conditions.

52


2.4

Performance Metrics

It is essential to set out the performance measures by which the code will be tested in order to select a necessary strategy; this is known as performance modelling. The key to good design is to optimize the trade-offs between various factors within a code, to produce code that meets the overall requirements. Traditionally, execution time is the most applicable measure of performance. However, in many applications trade-offs with memory requirements, portability, simplicity, implementation costs, maintenance, efficiency and other factors are considered.

2.4.1

Execution Time

The execution time of a code is measured, in parallel computation, as the time from when the first processor starts execution, until the last processor completes. During an execution the time (wall clock time) taken on each processor can be characterized as follows: • Computation - Tcomp is the time taken up by a processor carrying out calculations on data; • Communication - Tcomm is the time taken to communicate data between other processors; • Idle - Tidle is the time a processor is idling. The total time taken by each processor can be expressed as: T = (Tcomp + Tcomm + Tidle )

(2.1)

while the time taken for the code to execute is the maximum time, Tmax , across all executing processors. From this basic equation it is clear that, in order to reduce run times, communication should be reduced, idle time eliminated and (where possible) redundant computation minimized.

2.4.2

Communication

A code produced using an MPI model has data distributed across multiple processors, with each processor having its own physical memory, or section of memory in the case of a shared memory machine. It is therefore advantageous to optimize

2.4. PERFORMANCE METRICS

53

communication. A good parallel code will optimize this communication by minimizing the data transfer required, through good decomposition. When deciding upon a parallel strategy a key factor is to consider the communication within the chosen strategy. This is especially true when dealing with large problems, where large amounts of data are being communicated, contributing significantly to run times and possibly causing a bottleneck within a parallel code. Within a distributed architecture the processors are connected together via a network. These networks are known as interconnects and are key to the performance of the machine. Two key metrics associated with these networks are: • Bandwidth - This is a measure of the data rate or network throughput and is usually measured in bits/s. Table 2.2, shows a list of common network bandwidths. When these values are compared with the bandwidth of the buses between processors and local memory, which are now available in Gbits/s, it can be seen why these networks create bottlenecks within codes. • Latency - This refers to the startup latency, and can be considered to be the time taken to set up the communication channel between locations. This latency is system dependent and is independent on message size. Network Type Modem / Dialup T1 1.544 Ethernet Wireless 802.11b Wireless-G 802.11g Fast Ethernet Wireless-N 802.11n Gigabit Ethernet

Rate 56 kbit/s Mbit/s 10 Mbit/s 11 Mbit/s 54 Mbit/s 100 Mbit/s 300 Mbit/s 1000 Mbit/s

Table 2.2: Common Network Bandwidths. To estimate communication times the following is a reliable estimate: Time = α + βδ + ω where, α is the message setup time, or Latency, 1 , β is the time taken to transfer 1 byte, or Bandwidth δ is the number of bytes to be communicated, ω is the time waiting for synchronization.

(2.2)

54


Typical values for these parameters are : α ≈ 4.3µ s Bandwidth ≈ 900 Mb per second β ≈ 0.0011µ s This model shows the necessity to reduce communication in any strategy and that the agglomeration of communication into fewer larger tasks reduces the amount of setup time and synchronization required before communication. In parallel computation there are further time measures that can be taken in order to assess the performance of a parallel code. These include Speedup and Efficiency, which are measures of the scalability of a code.

2.4.3

Scalability

Scalability is a key collection of metrics measuring the performance of a parallel computation. It measures the effectiveness of a code to expand in workload, i.e. domain size (refinement), and a code’s efficiency when running on multiple processors. It is obvious that increasing the number of processors over which a code is executed should result in a decrease in runtime, although this is not always the case. This reduction in execution time is known as “Speedup”.

2.4.4

Speedup

Speedup is a measure of the processing speed of a parallel code, measured using n processors, with respect to the time taken to run that same code using one processor. That is, T1 (2.3) Sn = Tn where, Sn = Speedup T1 = time taken to execute the parallel program on a single processor, Tn = time taken to execute the parallel program on n processors. However, the time taken to execute the code on a single processor, i.e. T1 , is often slower than that of the serial code equivalent, due largely to the setup overheads and communication within a parallel code. Therefore a better measure of


55

speedup is given by: Sn =

T0 Tn

(2.4)

where T0 = time taken to execute the serial program. The Speedup of any parallel code is limited at a lower bound by the setup overheads and parallel overheads and by the serial aspects of the code. Many researchers have sought to define this, the most influential and well known being Amdahl. 2.4.4.1

Amdahl’s Law

Amdahl (1967) discussed the potential and possible speedups from using parallel computers, writing: “For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit co-operative solution...The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor...At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.” Although this text was literal, many researchers have sought to derive an equation from his work. This became known as “Amdahl’s Law”, given by, Smax =

1 rs +

rp n

(2.5)

where, Smax = maximum speedup, rs = proportion of code that is serial, rp = proportion of code that is parallel, n = number of processors, rs + rp = 1. The law indicates that the maximum speedup achievable is related to the amount of code that is serial or parallel within any parallel program. For instance, if 100%

56


of the code was parallelised, an unrealistic proposition, then the maximum achievable speedup would be Smax = n and, for a code that was 50% parallelised, this speedup would be Smax → 2 as n → ∞. The law shows that the limiting factor to the maximum speed up achievable is based upon the parallelisable percentage of code rather than the number of processors over which it is executed.

2.4.4.2

Gustafson’s Law

In recent times Amdahl’s Law has been questioned and tuned, to make it more applicable and to account for modern day advances. The most notable of these is Gustafson’s Law, which was presented in “Reevaluating Amdahl’s Law” (Gustafson, 1988). In this Technical Note, Gustafson notes that Amdahl’s Law is based primarily on time and that, in reality, as the number of processors increases the amount of work expands to fill the available processing power; thus greater speedups are achievable than predicted by Amdahl. This is the same as increasing the percentage of parallel analysis within the code, whilst maintaining a constant runtime, whereas Amdahl maintained a constant workload for the measure. The two approaches are summarized and compared in Figure 2.14. Equation 2.6 was derived by Gustafson and is referred to as “Scaled Speedup”: rs + nrp rs + rp = rs + nrp

Smax =

= n + (1 − n)rs

(2.6)

where, Smax = maximum scaled speedup, rs = proportion of code that is serial, rp = proportion of code that is parallel, n = number of processors, rs + rp = 1 The graph in Figure 2.15 illustrates the difference between the two speedup analyses. Amdahl’s Law shows an exponential decay towards convergence as the amount of serial code increases, whereas Gustafson’s Law shows a linear relationship. Although both have the same limiting values at the extremes of the code composition, Gustafson’s Law shows the more optimistic evaluation of maximum speedup inbetween these limits. Amdahl’s Law suggests that significant speedups


57

Time = 1

Serial (S)

Parallel (P )

1

Serial execution

n Processors

Parallel execution

Time = S +

P n

(a) Amdahl’s Law Time = S + nP

Serial execution (hypothetical)

1

Serial (S)

n Processors

Parallel (P )

Parallel execution

Time = 1

(b) Gustafson’s Law

Figure 2.14: Illustration of Amdahl and Gustafson’s Laws (Gustafson, 1988).

58


are only achievable when the code is predominantly parallel.

10

Amdahl Gustafson

Maximum Speedup

8

6

4

2

0 0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Serial Proportion of Code

Figure 2.15: Graph illustrating the Theoretical Maximum Speedup as determined using Amdahl and Gustafson’s Laws, for a hypothetical code using 10 processors.

Amdahl and Gustafson’s Laws both suggest that the upper bound of any speedup is the linear speedup, based on the number of processors over which the computation is executed, n. However, it is possible for greater speed ups to be achieved, which is known as superlinear speedup. In the cases where this occurs, as the number of processors increases the amount of processor cache available also increases, giving rise to faster access to data for processors.

2.4.5

Efficiency

Speedup does not take into account the number of processors being used. For instance, a serial program could execute in 12 seconds, whereas the parallel version may run in 4 seconds on 4 processors, giving S4 = 3 using Equation 2.4, but on 12 processors the program gives S12 = 6. The speedup results show a continuous improvement as the number of processors increases, suggesting that the program


59

Efficiency

1

Number of Processors Figure 2.16: Illustration of typical efficiency scaling for a parallel code.

scales well in parallel. However, is the increased processing power available as the the number of processors increases fully utilized? To answer this the efficiency of the code execution may be measured using the relationship: En =

Sn . n

(2.7)

Using Equation 2.7 with the previous example, the 4 processor execution has an efficiency, E4 = 0.75, whilst E12 = 0.5, showing that as the number of processors increases the efficiency decreases. Although this example may be an extreme case, the general expectation is that communication increases with increasing number of processors. (See Figure 2.16.) This is an important measure of scalability and is also important when coding; it is important to optimize codes, as parallel processing time and resources can be expensive. 2.4.5.1

Cost of Resource

Parallel resources are usually scarce and expensive to buy, especially for commercial use. Sun Microsystems sells time on their Sun Grid, a service providing computational resources over the internet, for 1 US Dollar per CPU hour; meaning that a job executed over 100 processors for 1 hour will cost $100, or one

60


executed over 500 processors for 5 minutes will cost $42 (as pro-rata 41.66 CPU hours are used). These are examples of comparatively short production runs and often hundreds or thousands of processors are required for several days to complete tasks, which proves to be expensive. The utilization of this expensive commodity is very important and the efficiency is an important measure of a code’s effectiveness.

2.4.6

Ideal Strategy Properties

From the basic qualitative analysis of the parallelisation, the following general observations on performance can be made: • efficiency decreases with increasing numbers of processors, and with increasing communication requirements, • efficiency increases with increasing domain size and computation time, • execution time decreases with increasing numbers of processors, but has a lower bound resulting from the cost of communication, • execution time increases with increasing problem size, computation size and communication times. Although with some applications it would be perfectly adequate to follow these observations, and optimize for efficiency and computation time, it still overlooks optimization for memory requirements. It should be assumed that total memory consumption increases as a problem is split over multiple processors on a distributed memory system, even though it is reduced on each processor. The serial memory overheads provide the lower limit for the decreasing memory required per processor as a problem is increasingly decomposed, as shown in Figure 2.17. From these theories, analyses and laws, it is clear that the parallel approach should try to: • minimize processor memory consumption, • minimize communication, • utilize resources effectively, • minimize runtime.

61

Memory per processor


Serial Memory Overhead

No. of Processors

Figure 2.17: Illustration of the theoretical lower memory limit.

These targets are not mutually compatible; for example, a code that minimizes memory will not have the fastest run time possible. It will be the goal of any parallelisation scheme to provide an optimization between these key targets.

62

2.5


Stochastic Parallelisation

The previous sections have laid out some of the methods, devices and performance measures by which a code may be parallelised. The following sections will use this information and presented tools to parallelise the current codes within a stochastic framework. For a code to be parallelisable the algorithm must contain a series of independent calculations. The random finite element method has, at its basic level, two main independent calculations. It is possible to split an FE mesh using domain decomposition and also to split the realisations of the Monte Carlo process, as each realisation is a separate calculation. Initially the existing codes will be analysed to see which should be used in the parallelisation, why and in what situation.

2.6

Existing Codes

It is a good starting point for choosing a strategy to analyse the codes already in use. The Author had several different codes, all relevant in the determination of a strategy. Broadly they are broken down in the sections that follow. The initial codes that were presented to the Author, at the beginning of the research, were those that can be found in the third edition of “Programming the Finite Element Method” (Smith and Griffiths, 1997). The book contains a series of FE methods for problem solving within classic structural analysis, elasticity and plasticity, steady state and transient fluid flow, linear and non-linear solid dynamics and construction processes in geomechanics. The FE codes included ‘mesh-free’ and ‘element-by-element’ techniques. The book also provided a library of routines, not only to run the FE codes within the book, but also for readers to construct their own codes with minimal coding. It was these codes which provided the backbone of the research carried out within the Geotechnics research group and, as such, it is an aim of this research to provide a stochastic framework for these codes to fit into.

2.6.1

Serial Deterministic FE Code

The codes within the book, prior to the latest (fourth) edition, were limited to serial codes. Figure 2.18 outlines a basic algorithm over which these codes were

2.6. EXISTING CODES

63

implemented, with the serial solver element varying between applications. From

Figure 2.18: Flow chart outlining a general deterministic serial code.

simple testing and the experiences of colleagues, many of the codes had significant limitations, particularly in relation to memory. This was inevitable, with the increasing demands of engineering analyses, both in size and accuracy, together with the limited memory available on a standard PC. Table 2.3 shows the basic memory requirements to store an FE solution for a simple cube domain, using 8-node cube elements and with increasing refinement. The resulting values, of a single dependent variable, are stored at the nodes and in double precision, which means that each value stored requires 64 bits or 8 bytes. As seen in Table 2.3, the memory requirements increase rapidly as the FE meshes become larger and finer; and these are just the basic values of the solution, as in the actual codes these figures are typically larger by a factor of 3 or 4, with the storage of

64

CHAPTER 2. PARALLEL RFEM STRATEGIES Field Size Elements Nodes Memory Requirements (Bytes) 2×2×2 8 27 216 4×4×4 64 125 1000 8×8×8 512 729 5835 16 × 16 × 16 4096 4913 39304 32 × 32 × 32 32768 35937 287496 64 × 64 × 64 262144 274625 2197216 128 × 128 × 128 2097152 2146689 17173512 256 × 256 × 256 16777216 16974593 135796744 512 × 512 × 512 134217728 135005697 1080045576 Table 2.3: Basic Memory Requirements (Minimal).

node coordinates, element connectivity numbering, load and displacement values and nodal freedom values all dramatically increasing this requirement, and furthermore could be significantly larger if the global stiffness matrix is required. Clearly, in many respects the codes within the book are useful and viable, but as the demands of engineering applications and accuracy increase, outstripping the power of desktop machines, the codes become of less use, slow and limited by memory and processing power.

2.6.2

Parallel Deterministic FE Code

The parallelisation of an FE code is well documented (Smith and Griffiths, 2004; Margetts, 2002; Pettipher and Smith, 1997) and the benefits are predominantly 2 fold. In parallelising the element-by-element FE method, the performance is increased with respect to time and memory constraints. It is these benefits that are sought within the parallelisation of the Random Finite Element Method. Domains decomposed over multiple processors of distributed memory systems allow larger models to be analysed and others to be analysed quicker. Figure 2.19 outlines the general deterministic parallel algorithm, illustrating the communication involved between multiple processors. Here the parallel solver is the Element by Element (EBE) FE Solver. The basis of the parallelisation of the serial version follows Foster’s Design Methodology. By utilizing the Element-By-Element FE method, the domain can be partitioned by decomposing the domain between processors, while only requiring the communication of neighbouring node values between the boundaries of each processor. (See Figure 2.20.) The elements and nodes are mapped to the processors so as to load balance the problem, providing

2.6. EXISTING CODES

Figure 2.19: Flow chart outlining a general deterministic parallel code.

65

66


Figure 2.20: Example of processor boundary communication.

an optimized efficiency and limiting the idle time of the processors. Agglomeration of many of the tasks takes place as they are combined for more efficient execution; for example, the communication between processors takes place in a single action, rather than using several smaller messages over multiple communication channels, thus reducing synchronisation times. The following section briefly reviews the Element-By-Element technique to illustrate how it is used in the parallelisation of any code.

2.6.2.1

Element-By-Element (EBE) Technique

The EBE Technique is a method for solving Finite Element problems, where the system stiffness matrix is never formed. Traditionally, for a calculation such as a static equilibrium problem: KMr = f (2.8) where KM is the element stiffness matrix, r is the vector of equilibrium displacements and f is the external loading vector. The KM from each element would be used to form a system matrix and, in large scale problems, this array would be extremely large. Then the system would be solved using some form of Gaussian Elimination. To reduce the memory limitations alternative techniques have been developed. A popular method is based on the method of conjugate gradients (Jennings and McKeown, 1992).

2.6. EXISTING CODES

67

Consider the linear algebraic system, Ax = b

(2.9)

where A and b are an arbitrary known matrix and vector respectively and x is the required unknown vector, which are analogous to KM, f and r respectively. Initially, p0 = r0 = b − Ax0 (2.10) where x0 is an initial trial, r0 is a vector of error for the first trial and p0 is a storage vector. This process is iterative, containing k steps of the following algorithm: uk = Apk (rk )T rk αk = (pk )T uk xk+1 = xk + αk pk rk+1 = rk + αk uk (rk+1 )T rk+1 βk = (rk )T rk pk+1 = rk+1 + β k rk

(2.11)

where α and β are scalars, while u, p and r are vectors, with the superscript, k, indicating the iteration step. These iterative steps in Equation 2.11, are repeated until the difference between xk+1 and xk is negligible. If A was a system stiffness matrix, traditionally the matrix would be assembled and then reduced using a direct solver, expensively with regards to memory. However, using this algorithm it can be seen that only the product Ap is required. This can be performed using an element-by-element procedure, with the substitution of Equation 2.12, into the Algorithm shown in 2.11. That is, u=

X

KPi pi

(2.12)

where KPi is the element stiffness matrix of element i and pi is the corresponding part of p. This eliminates the need to calculate the entire system stiffness matrix, thus saving on memory resources. It is this element by element approach that is utilized in parallel FE to solve large scale problems; otherwise memory constraints

68


would render the calculation unfeasible. This approach is easily parallelisable, as the calculation of each element is done independently, and so this means that the domain can be decomposed and distributed across multiple processors and memory locations.

2.6.3

Serial Random FE Code

The Serial Random FE Method is simply the Serial FE code, looping through the FE Calculation for the desired number of realisations, as illustrated by Figure 2.21. In each realisation, the input field parameters are changed by means of a generated random field (see Chapter 3) and the required result variables are stored for analysis. This method suffers from the same memory and runtime problems associated with its serial deterministic counterpart; however, due to the many hundreds (or even thousands) of realisations, many problems and applications are often unfeasible. Figure 2.22 illustrates how run times increase as the number of realisations increases. As expected, the relationship is linear and the code contains a setup time overhead which is present no matter how many realisations are analysed. This relationship is expressed as: Time = Realisations × Single Realisation Time + Overhead

(2.13)

The overheads are usually associated with I/O processes. Although the long runtimes, which can be limiting, are dependent on the number of realisations. The increased memory consumption, over that of the deterministic code, is predominantly from the generation and storage of the random field.

2.6.4

Architecture Availability

Before a strategy can be selected and pursued, it is first sensible to analyse the computer architecture on which the program will be executed. During the Author’s research, several computers have been used or have been available to the Author locally, predominantly in the Manchester Computing Centre, but also in other university departments. Table 2.4 lists the Manchester computers. All these systems use MIMD and, in general, distributed memory architectures and so the research was developed and optimized for these architectures. However toward the end of the Author’s research, many of the systems listed

2.6. EXISTING CODES

Figure 2.21: Flow chart outlining the Serial RFEM Algorithm.

69


Time

70

Calculation of multiple realisations

Calculation of a single realisation Setup overhead No. of Realisations

Figure 2.22: Illustration of the components of the total run time for a Serial RFEM Code.

in Table 2.4 were phased out, and replaced by “Horace”, a Bull Itanium2 system providing 192 Itanium2 processor cores. The system was configured as 24 compute nodes, each with 4 Intel Itanium2 Montecito Dual Core 1.6GHz with 8Mb cache, providing 8 cores per node, with each node having access to 16GB RAM. Each core’s peak compute performance was 6.4 GFlop/s, contributing to a total peak performance of around 1.2 TFlop/s, with 32GB RAM per node available on a limited number of nodes for memory consuming tasks. A single rail Quadrics QsNetII (elan4) interconnect connects all the nodes with the system, providing up to 10TB of central filestore based on a Lustre shared disk filing system. Each compute node has access to 16GB RAM of shared memory, over which an OpenMP framework (see Section 2.3.2) can operate; however, to fully utilize the system and parallel compute over the full system, the Message Passing Interface (MPI) framework (see Section 2.3.3) can be facilitated. The Author and his colleagues also have access to the National Grid Service (NGS), which provides supercomputer resources in a grid framework for UK academic research, with resources based at the Rutherford Appleton Laboratory, University of Manchester, University of Leeds and University of Oxford.

2.6. EXISTING CODES Name Newton Green Fermet Wren Bezier Cosmos Eric

71

Computer SGI Altrix 3700, 512 Itanium 2 processors, 1 Terabyte Memory SGI Origin 3800, 512 MIPS R12000 processors, 512 Gigabytes Memory SGI Origin 2000, 128 MIPS R12000 processors, 128 Gigabytes Memory SGI Origin 300, 16 MIPS R14000 processors, 16 Gigabytes Memory SGI Onyx 300 Sun Microsystems 8 Processor Sun Microsystems 30 processor, 30 Gigabytes Memory

Table 2.4: University of Manchester Supercomputers.

Further afield, the supercomputing landscape is illustrated in Tables 2.5 and 2.6. These lists are compiled twice a year to benchmark and measure the performance of the most powerful supercomputers in the world (Dongarra et al., 2010). This further illustrates the distributed nature of the predominance of computing power available, with most providing distributed memory. It is clear from the availability of current Supercomputers that, in general, machines are distributed memory systems with an MIMD structure, often in a Massively Parallel Processing (MPP) machine or in the form of a distributed computing framework, where processing units are distributed over several machines interconnected to solve the problem. Figure 2.23 shows the breakdown and share of the architectures within the Top 500 most powerful supercomputers historically from 1993, (Dongarra et al., 2010). The graph clearly shows that Clusters and MPP architectures dominate the 2010 list. The architectural advantage of distributed memory machines, for FEA, is that they can support large amounts of memory, due to a reduced constraint on bandwidth compared with a shared memory architecture. The MPI standard, with its ability to control the passing of data between processors, is the ideal combination for producing code for the FE suite of programs. It allows the programmer to control the gathering and scattering of data, distributing the large memory requirements over multiple processors, to produce the most efficient code for the FE models. Therefore, the Author’s research focuses on the domain of MPI programming for distributed memory machines.

72


UK Rank 1

World Rank 25

Location

Computer

University of Edinburgh Atomic Weapons Establishment

Cray XT6 12-core 2.1GHz - Cray Inc. Bullx B500 Cluster, Xeon X56xx 2.8Ghz, QDR Infiniband Bull SA Power 575, p6 4.7GHz Infiniband Power 575, p6 4.7GHz Infiniband Cray XT4, 2.3GHz Cray Inc. iDataPlex, Xeon E55xx QC 2.26GHz, Infiniband, Windows HPC2008 R2 - IBM Cluster Platform 3000 BL460c G1, Xeon L5420 2.5GHz, GigE Hewlett-Packard Power 575, p6 4.7GHz, Infiniband IBM Power 575, p6 4.7GHz, Infiniband IBM Cluster Platform 3000 BL460c, Xeon 54xx 3.0GHz, GigEthernet - Hewlett-Packard

2

53

3

56

ECMWF

4

57

ECMWF

5

77

6

112

University of Edinburgh University of Southampton

7

128

Computacenter(UK) LTD

8

149

United Kingdom Meteorological Office

9

150

United Kingdom Meteorological Office

10

171

Computacenter(UK) LTD

Processors 43660

Rate (TFlops) 274.7

12936

124.6

8320

115.9

8320

115.9

12288

95.08

8000

66.68

11280

58.71

3520

51.86

3520

51.86

7560

47.89

Table 2.5: Top 10 UK Supercomputers in November 2010 (Dongarra et al., 2010).

2.6. EXISTING CODES

Rank Location 1

National Supercomputing Center in Tianjin

2

DOS/SC/Oak Ridge National Laboratory

3

National Supercomputing Centre in Shenzhen (NSCS)

4

GSIC Center, Tokyo Institute of Technology

5

DOE/SC/LBNL/NERSC

6

Commissariat a l’Energie Atomique (CEA)

7

DOE/NNSA/LANL

8

National Institute for Computational Sciences/University of Tennessee Forschungszentrum Juelich (FZJ)

9

10

DOE/NNSA/LANL/SNL

73

Computer Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-1000 8C NUDT Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz - Cray Inc. Nebulae - Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPU Dawning TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows NEC/HP Hopper - Cray XE6 12-core 2.1 GHz Cray Inc. Tera-100 - Bull bullx super-node S6010/S6030 - Bull SA Roadrunner BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire Infiniband - IBM Kraken XT5 - Cray XT5-HE Opteron 6-core 2.6 GHz - Cray Inc. JUGENE - Blue Gene/P Solution IBM Cielo - Cray XE6 8-core 2.4 GHz - Cray Inc.

Country

Processors

China

186368

Rate (TFlops) 2566.00

United States

224162

1759.00

China

120640

1271.00

Japan

73278

1192.00

United States

153408

1054.00

France

138368

1050.00

United States

122400

1042.00

United States

98928

831.70

Germany

294912

825.50

United States

107152

816.60

Table 2.6: Top 10 World Supercomputers in November 2010 (Dongarra et al., 2010).

74


(a) Number of Systems

(b) Performance Share

Figure 2.23: Graph of Fastest Computer Architecture with Top 500 1993 - 2010 (Dongarra et al., 2010).

2.7. PARTITIONING

2.7

75

Partitioning

Following Foster’s Design Methodology, it is necessary to partition the problem. It is clear that the code is parallelisable, as previously stated, and that there are 2 levels of basic parallelism: 1. Realisation Parallelism (Functional Decomposition) 2. Solver Parallelism (Domain Decomposition) These two forms of parallelism are forms of partitioning, as discussed in Section 2.1.1.

2.7.1

Realisation Parallelism

Realisation parallelism partitions and maps the workload according to the realisations, with each processor or group of processors running a single realisation at any one time. The partitioning of the realisations can be a trivial task to implement, and provides excellent performance analysis when measuring efficiency and speedup as little or no communication is required. There are two ways in which jobs can be allocated to processors; each processor can intrinsically know which realisations to generate and execute, or the realisation can be communicated from a master node that maintains a queue of realisations to be conducted. 2.7.1.1

Intrinsic Realisation Allocation

Taking an intrinsic approach, where each processor has a predefined queue of realisations, and assuming that each realisation executes in roughly the same amount of time with each processor executing the same number of realisations, then all processors should have a balanced workload finishing at approximately the same time. This assumption may be unrealistic with many models and algorithms, for instance those that are modelling failure or modelling plasticity, as the realisation times may significantly vary. However over the many hundreds of realisations being carried out on each processor, the total workload and execution time should average to be similar. This allocation method distributes realisations by some predetermined algorithm; e.g. using 4 processors, 1000 realisations would be distributed by splitting the realisations consecutively and evenly across all processors, as shown in Table 2.7.

76

CHAPTER 2. PARALLEL RFEM STRATEGIES Processor 1 2 3 4

Realisation Number 1 - 250 251 - 500 501 - 750 751 - 1000

Table 2.7: Basic Realisation Allocation. This method is indiscriminant of realisation factors such as execution time and can lead to work load imbalances and extended run times. The method in the main is sufficient, as, in many codes, the realisations have very similar execution times and, even when they vary, large realisation counts tend to balance out load distribution. However, in many nonlinear FE applications there is a difference in execution time between realisations. In these cases, it is often more efficient to utilize a master node for distributing new realisations as the different processors finish previous ones; this is known as Job Farming.

2.7.1.2

Job Farming

Job Farming is a method of partitioning the realisations. Instead of each processor intrinsically knowing its realisation, it is given a new realisation from a list as and when required. This method is especially useful when running codes that run to structure failure or convergence, where realisations may have dramatically varying execution times, and it uses a form of dynamic load balancing. However, the implementation of this approach is somewhat more difficult. If the code is run on a shared memory structure it is quite straightforward, as all processes would have access to the array of jobs to be done; it would simply require a memory locking mechanism to aid in maintaining data coherence. However, on a distributed platform the procedure becomes more problematic. For a distributed system, the processors have to send messages to one another, and this requires a master node dealing with the realisations that would be idle a great deal of the time. However, the later MPI standard, MPI-2 (Forum, 1997), defines advanced one sided communications, which should allow access to parts of the distributed memory globally via a method of locks and non-blocking communications. Also, there are hybrid options, where one processor becomes the master after it has completed a set number of realisations, as was implemented by Spencer (2007)

2.7. PARTITIONING

77

with good results. For example consider the following scenario: a code executes 40 realisations over 4 processors and each varies in run time between 2 - 50 minutes. Figure 2.24 shows how the initial allocation and job farming methods could distribute the realisations between processors. The intrinsic workload distribution described in Section 2.7.1.1 has a predefined and static realisation schedule, whereas job farming distributes the realisations when required to each processor; as such this distribution is based on the execution times of each realisation. As the graph shows, the Job Farming method executes faster and distributes more evenly the work than the initial allocation approach. This method was not implemented during this research due to coding complexities, but is provided here as background information. The research produced by Spencer (2007) showed encouraging results and, with MPI-2 one sided communication implemented, it would be a more viable alternative that doesn’t surrender a valuable node to become the master.

2.7.2

Solver Parallelism

This is an FE code where the actual FE domain has been partitioned and mapped, with the solver computing the solution to each of these partitioned sub-domains. The partitioning of the domain has previously been discussed in Section 2.6.2. These parallel FE codes have previously been developed and implemented by several researchers, including the Author’s colleagues and predecessors (Pettipher and Smith, 1997; Margetts, 2002; Smith et al., 2005), and have been published in Smith and Griffiths (2004). This type of parallelism is discussed here in the context of a general strategy and a list of limitations will be identified. This type of parallelisation is limited to the EBE Technique and other solvers are difficult to parallelise in such a way. The parallelisation utilizes the way in which the elementby-element method does not require the generation of the system stiffness matrix (see Section 2.6.2.1) and can therefore be decomposed over multiple processors. It must also be noted that the coding of failure criteria in several physical models are difficult to implement; in particular, the Mohr-Coulomb criterion has been investigated and found to be difficult (Spencer, 2007). The main problem with coding the Mohr-Coulomb criterion, using the EBE method for parallel computation, was to return the excess stress states to the hexagonal yield surface, particularly at the corners.

78


(a) Intrinsic Realisations

(b) Job Farming

Figure 2.24: Graphs illustrating Realisation Partitioning Methods.

2.7. PARTITIONING

79

The two types of parallelism, both Realisation and Solver, can be used singularly or together to form a strategy: 1. Parallel Solver - These are the current Parallel FE codes. 2. Parallel Realisation - This involves Serial FE codes running different realisations, mapped onto multiple processors. 3. Parallel Solver - Parallel Realisations - This is a hybrid model combining both; where a parallel solver is used on a distributed domain, where different realisations are computed are mapped onto multiple clusters of processors. The rest of this Chapter will be dedicated to the analysis of these strategies, with the objective of determining the most feasible option.

80

2.8


Parallel Realisations

This strategy, illustrated in Figure 2.25, is essentially the serial solver running a different realisation mapped on to each of the processors, in a cyclic loop to complete the required number of realisations. This is considered “embarrassingly parallel”; it is simple to implement and provides excellent efficiency and speedup statistics, as minimal communication is required and idle time is minimized. It would be expected that speedup would be close to linear, depending on the amount of communicated results required. This strategy can use any of the current serial solvers. However, it is still limited by the memory constraints previously encountered by the original serial codes.

2.9

Parallel Solver

This strategy to parallelise stochastic FE involves using all the available processing power for a single realisation and carrying out all the realisations using the same group of processors in a cyclic loop. (See Figure 2.26.) This strategy is particularly useful when memory requirements are large and if all the memory of a particular system is required to fulfill these requirements. As previously discussed, the number of available parallel solvers limits the application of this strategy. After a suitable solver is chosen it is placed within a loop, followed by cycling through all the realisations, in which each realisation uses a new set of random field parameters. It has the advantage of allowing the stochastic analysis of very large scale models; however, communication between multiple processors can slow the execution time and reduces processor utilization and efficiency. The implementation is as simple as that of the Serial version; the Parallel solver and random field generation is placed within a DO loop, cycling through all the required realisations. Calculations within the solver that remain the same, from realisation to realisation, are removed from the main loop and placed outside, so as not to repeat unnecessary calculations.

2.10

Parallel Solver and Realisations

This hybrid strategy uses a combination of both realisation and solver parallelism; for instance, using 16 processors to execute 4 realisations simultaneously, where

2.10. PARALLEL SOLVER AND REALISATIONS

81

Figure 2.25: Flowchart outlining the Parallel Realisations Stochastic Algorithm.

82


Figure 2.26: Flowchart outlining the Parallel Solver Stochastic Algorithm.

2.10. PARALLEL SOLVER AND REALISATIONS

83

the domain of each realisation is distributed over 4 processors. This is done by grouping sets of processors to analyse separate realisations. This method allows the flexibility to control the ratio of either parallelism, whilst providing greater control over memory distribution and parallelism. Figure 2.27 shows a flow chart of the generic method, showing how the previous strategies illustrated in Figures 2.25 and 2.26 are combined.

84


Figure 2.27: Flow chart outlining the Hybrid Parallel Solver/Realisations Stochastic Algorithm.

2.11. PERFORMANCE

2.11

85

Performance

The three main strategies discussed previously were implemented using the Seepage FE codes P74, P75 and P123 (Smith and Griffiths, 2004), and analysed for 2 factors: processing time and memory consumption. The modelling of seepage due to a point source potential of 100 units (at x = y = z = 0.0) was analysed using each strategy with 320 realisations and the metrics of memory and time measured. The domain represents a quarter of a larger symmetrical cube domain with a central point source of potential, φ = 100. The outer facing faces have a fixed boundary potential, φ = 0, while those joining the other quarters are planes of symmetry. The performance of these codes were measured on the “Horace” system at the University of Manchester, described earlier. Figure 2.28 illustrates the domain mesh for this problem and a visualization of the results obtained, while Table 2.8 shows the number of elements used to form the domains for each of the performance tests.

(a) Domain Mesh

(b) Contours of Potential

Figure 2.28: Illustration of the performance test domain and visualization of results.

Test 1 2 3 4

X 10 20 50 100

× × × × ×

Y 10 20 50 100

× Z = Elements Nodes × 10 = 1000 1331 × 20 = 8000 9261 × 50 = 125000 132651 × 100 = 1000000 1030301

Table 2.8: Table of analysed domain sizes.

86

2.11.1


Serial Codes

Firstly the standard serial seepage code P74 was implemented, incorporating a LAS random field generator (see Chapter 3). The code analysed the 3D steady state seepage problems described above using the global conductivity matrix and was therefore memory intensive. This code is unsuitable for memory intensive applications and parallelisation. However, code P75, also developed for steady state seepage analysis, was also implemented. In this implementation, a diagonally preconditioned conjugate gradient solver is used together with an elementby-element method. In using this method the global conductivity matrix is never assembled, thus providing greater memory efficiency. Table 2.9 compares typical results for the two serial stochastic codes, P74 and P75, executed over 320 realisations, showing the benefits of the latter. Figure 2.29 illustrates these results,

Test 1 2 3 4

Time (s) 21.985 648.138 67696.504 >86400

P74 Memory (kB) 8704 82688 5472128 > 15626304

Time (s) 15.491 171.789 4264.009 52560.338

P75 Memory (kB) 6656 21760 163840 713024

Table 2.9: A comparison of results between Stochastic implementations of P74 and P75 for 320 realisations.

and shows the superior performance, both in terms of memory and speed, of the P75 version of the analysis. The code P74 did not complete test 4 in the time allocated, i.e. 24 hours = 86400 second. These results show that P74 suffers from memory constraints that limit domain sizes < 1030301 nodes on system. Although both codes can easily be adapted to the parallel realisations method, it is P75 that will clearly give the best performance. It is also P75, with its iterative element by element approach, that is used in the parallelisation of the solver. However it must be noted that the iterative method, implemented by P75, is subject to convergence criteria and that the timing of the code can vary with the accuracy of output required; the EBE method may be significantly slower than the direct solver for applications requiring a large number of iterations.

2.11. PERFORMANCE

87

100000

P74 P75 10000

Time (s)

1000

100

10

1 1000

10000

100000

1000000

Nodes

(a) Time

10000000

P74

P75

Memory (kB)

1000000

100000

10000

1000 1000

10000

100000

1000000

Nodes

(b) Memory

Figure 2.29: Graphs comparing the performance of P74 and P75 serial codes.

88


2.11.2

Parallel Codes

Having identified a suitable code to test the proposed framework, this code, P75, was implemented using the strategies previously discussed. P75 was parallelised with the parallel realisation strategy, to form “P75 (Parallel Realisations)”. During the process of parallelising the P75 code, with the parallel solver strategy, (Smith and Griffiths, 2004) was published and contained the code P123, a parallelised version of P75 with respect to the solver. Therefore this version of the code, P123, was taken forward for testing These three approaches, P75(Serial), P75(Parallel Realisations) and P123 (Parallel Solver), were initially run using a single processor, to determine parallel overheads, to provide a benchmark for efficiency and speedup calculations and to compare with the original serial version of the code. Table 2.10 and Figure 2.30 show the results of these comparisons.

Test 1 2 3 4

P75 (Serial) Time (s) Memory (kB) 15.491 6656 171.789 21760 4264.009 163840 52560.338 713024

P75 (Parallel Realisations) Time (s) Memory (kB) 17.185 16832 165.414 33280 4062.889 201280 52393.947 750720

P123 (Parallel Solver) Time (s) Memory (kB) 8.399 17600 175.770 33728 4778.711 194560 63951.570 692544

Table 2.10: A comparison of performance results between P75 Serial, P75 (Parallel Realisations) and P123 (Parallel Solver) executed on a single processor for 320 realisations.

The results, shown in Table 2.10 and Figure 2.30, indicate that there are memory overheads associated with the parallel versions of the code and that the execution times are similar. It is expected that the parallel codes will perform poorly in this serial test as parallel overheads, of both memory and time will occur. Some of the unexpected results, for instance Test 1, where P123 performs better with respect to execution time, can be explained by the fact that this implementation is compiled using a parallel compiler for which the code is optimized. For the calculations of speedup and efficiency, it was decided that the serial execution time, T0 , would be taken as the fastest execution of any test code over a single processor, therefore: • Test 1 - T0 = 8.399 seconds (taken from P123 (Parallel Solver))

2.11. PERFORMANCE

89

100000

P75 (Serial)

P75 (Parallel Realisations) 10000

P123 (Parallel Solver)

Time (s)

1000

100

10

1 1000

10000

100000

1000000

Nodes

(a) Time

800000

P75 (Serial) 700000

P75 (Parallel Realisations) P123 (Parallel Solver)

Memory (kB)

600000

500000

400000

300000

200000

100000

0 1000

10000

100000

1000000

Nodes

(b) Memory

Figure 2.30: Graphs illustrating the comparison of performance results between P75 Serial, P75 Parallel and P123 executed on a single processor.

90

CHAPTER 2. PARALLEL RFEM STRATEGIES • Test 2 - T0 = 165.414 seconds (taken from P75 (Parallel Realisations)) • Test 3 - T0 = 4062.889 seconds (taken from P75 (Parallel Realisations)) • Test 4 - T0 = 52393.947 seconds (taken from P75 (Parallel Realisations))

2.11.2.1

Parallel Realisations

The implemented parallel P75 code, parallelised with respect to realisations, was subjected to the four tests defined in Table 2.8. Each test was executed using 1, 2, 4, 8, 16 and 32 processors, to test the scaling performance of the code in terms of speedup and efficiency. Table 2.11 and Figure 2.31 show the results of these analyses. Test 1 2 3 4 CPUs Time(s) Mem(kB) Time(s) Mem(kB) Time(s) Mem(kB) Time(s) Mem(kB) 1 17.185 16832 165.414 33280 4062.889 201280 52393.947 750720 2 10.418 18496 118.694 34816 2764.624 202688 39376.55 752128 4 4.488 20544 50.102 37184 1204.979 204800 16404.888 754368 8 2.830 25088 25.291 41408 633.863 209536 8319.137 758976 16 1.565 25216 16.261 41728 371.501 209728 4278.877 759040 32 0.431 25280 6.674 41920 165.138 209792 2104.761 759104

Table 2.11: A comparison of performance results using the distributed version of P75, executed over increasing numbers of processors. Figure 2.31(a) shows that the time of execution for each test case reduces to as the number of processors increases. However, the Author expects that this reduction will reach a lower limit as the number of processors increases, due to the serial elements within the code; in particular the input, output and serially generated random field, which either remain constant or increase with increasing numbers of processors due to the communication required. Figure 2.31(b) shows that the memory consumption of the code remains constant for each test. This is expected, as the code run on each processor is essentially the same as the serial version and as such its memory requirements remain constant. Figure 2.31(c) shows that the speedup for each test is almost linearly increasing, especially for the larger domains. This suggests that with increasing domain sizes the speedup is improving, as the percentage of processing time consumed by serially executing procedures or communication is reduced. Further evidence of this is provided by the measure of efficiency, shown in Figure 2.31(d). With most

2.11. PERFORMANCE

91

100000

10000

Times (s)

1000

100

10

Test 1 Test 2 Test 3 Test 4

1

0.1 0

5

10

15

20

25

30

No. of Processors

(a) Time

800000

Max. Memory (KB) per processor

700000


600000

500000

400000

300000

200000

100000

0 0

5

10

15

20

25

30

No. of Processors

(b) Memory

Figure 2.31: Graphs comparing the performance results using the distributed version of P75 parallel, executed over increasing numbers of processors.

92


35

Test 1

Test 2

30

Test 3 Test 4

25

Speedup

Linear 20

15

10

5

0 0

5

10

15

20

25

30

25

30

No. of Processors

(c) Speedup

1.2

1

Efficiency

0.8

0.6

0.4


0.2

0 0

5

10

15

20

No. of Processors

(d) Efficiency

Figure 2.31: cont....Graphs comparing the performance results using the distributed version of P75 parallel, executed over increasing numbers of processors.

2.11. PERFORMANCE

93

tests the efficiency remains relatively constant and high, with larger domains executing more efficiently. The exception is Test 1, which is due to the T0 measure being taken from the serially faster P123 code; this result, T0 = 8.399, was unexpected and can be attributed to efficient coding and compiler optimization of the developed P123 code. As such this result can be ignored, as the very small domain and times involved in this test are unlikely to be executed on expensive computer resources when they can be run quickly on a single processor. However, it must be noted that the general trend is linear and that relative to its own serial execution time, T0 (See Table 2.10), the results would be similar to the others. 2.11.2.2

Parallel Solver

The parallel solver, P123, was similarly implemented for the seepage problems described above and tested over the prescribed range of processors. The results are shown in Table 2.12 and Figure 2.32. As with the parallelisation of the Test CPUs 1 2 4 8 16 32

1 2 3 4 Time(s) Mem(kB) Time(s) Mem(kB) Time(s) Mem(kB) Time(s) Mem(kB) 8.399 17600 175.770 33728 4778.711 194560 63951.570 692544 6.660 18624 83.635 32512 2244.609 161344 22892.065 417856 6.138 20608 61.942 34048 1090.860 146048 16811.010 278976 6.282 25088 54.517 37568 712.683 142464 5679.692 211712 5.773 25152 54.848 37504 615.597 137280 3235.249 180096 8.229 25664 54.577 37312 545.366 135424 1919.598 162752

Table 2.12: A comparison of performance results using the parallel solver version, P123, executed over increasing numbers of processors.

realisations, the time scales down well to a lower limit, as shown in Figure 2.32(a). This limit is due to the combination of the serial processes taking place on each processor, including the random field generation and the communication taking place between all processors. The memory usage, illustrated in Figure 2.32(b) shows significant advantages over the parallelisation of the realisations. As the domain is distributed amongst the processors the memory consumption falls in a similar manner to the time, to a lower limit, this limit being the data replicated on all processors, which has a significant contribution from the random field. The effect is particularly observed for larger domain sizes where the reductions are significant. This reduction in memory consumption per processor removes the size constraints placed upon

94


100000


10000

Times (s)

1000

100

10

1 0

5

10

15

20

25

30

No. of Processors

(a) Time

800000


Max. Memory (kB) per processor

700000

600000

500000

400000

300000

200000

100000

0 0

5

10

15

20

25

30

No. of Processors

(b) Memory

Figure 2.32: Graphs comparing the performance results using the parallel solver version, P123, executed over increasing numbers of processors.

2.11. PERFORMANCE

95

35

Test 1

Test 2

30

Test 3 Test 4

25

Speedup

Linear 20

15

10

5

0 0

5

10

15

20

25

30

No. of Processors

(c) Speedup

1.4


1.2

Efficiency

1

0.8

0.6

0.4

0.2

0 0

5

10

15

20

25

30

No. of Processors

(d) Efficiency

Figure 2.32: cont....Graphs comparing the performance results using the parallel solver version, P123, executed over increasing numbers of processors.

96


application domain sizes; however, the undistributed random field still places a limit on the domain sizes achievable. The speed up illustrated in Figure 2.32(c) shows that, for the large domain which less readily fits on fewer processors, the speed up is almost uniformly linear and high; however, for smaller domains that would more easily fit on fewer processors, the speed up results are poor and reach a limit after only several processors. These findings are also reflected in the Efficiency metrics shown in Figure 2.32(d), with the largest domain operating at high efficiency, and the smaller domains starting at high efficiency over fewer processors, but scaling poorly and having very low efficiency for larger numbers of processors. These inefficiencies may be costly when executed on expensive computer resources. This can be explained, as, for small domains executed over large number of processors, the time for communication, where the processor is idle, becomes a significant percentage of the execution time. This shows that, although increasing the number of processors reduces the execution times, it may not reduce them efficiently; hence it is recommended that the codes be executed over the minimum number of processors applicable to meet time and memory constraints, as this provides the most efficient execution.

2.11.2.3

Comparison

Comparing the results between P75 (Parallel Realisations) and P123 (Parallel Solver), for Test 4, gives some indication of the performance expected from the hybrid method. Figure 2.33 illustrates these comparisons. The execution time of each of the codes, scales in a similar manner, as illustrated in Figure 2.33(a). However, Figure 2.33(b) shows that the memory consumption per processor of the parallelised solver of P123 is superior to that shown by P75 parallel, which remains constant. This will also be apparent in the hybrid method, with a reduction in per processor memory consumption as more of the domain is decomposed and distributed across more processors. However this memory reduction will be limited to the lower limit previously discussed. The speedup and efficiency of both codes are good, with both providing linear speedups and high efficiency, as shown in Figures 2.33(c) and (d) respectively. The graphs show that P123 is marginally superior for this example; however, previous results have indicated that, for smaller domains, this efficiency declines with

2.11. PERFORMANCE

97

70000

P75 (Parallel Realisations) 60000


Time (s)

50000

40000

30000

20000

10000

0 0

5

10

15

20

25

30

Processors

(a) Time

800000


700000

600000

500000


400000

300000

200000

100000

0 0

5

10

15

20

25

30

Processors

(b) Memory

Figure 2.33: Graphs comparing the performance results of P75 Parallel and P123, for Test 4.

98


35

30

Speedup

25

20

15

10


5

Linear 0 0

5

10

15

20

25

30

Processors

(c) Speedup

1.4

1.2

Efficiency

1

0.8

0.6

0.4

P75 (Parallel Realisations) 0.2


0 0

5

10

15

20

25

30

Processors

(d) Efficiency

Figure 2.33: cont....Graphs comparing the performance results of P75 Parallel and P123, for Test 4.

2.11. PERFORMANCE

99

increasing numbers of processors, mainly due to the greater amount of communication required. With the Hybrid code one would expect the efficiency of the code to reduce as the amount of processors allocated to the solver increases, whilst the memory consumption per processor reduces. With these two contrasting expectations, a balance must be struck between viable run times, memory constraints and efficiency. 2.11.2.4

Hybrid Technique

The hybrid method, described previously, is where the processing power is split between the parallelisation of the realisations and the parallelisation of the solver by splitting the analysed domain. To do this, the parallel solver of the P123 version of the code, implemented in the previous sections, was placed within the distributed realisation framework. Utilizing the MPI Comm split function allowed the controlling of the distribution of the processing power. Note, however, that the allocation of processing power to the parallel solver must be a factor of the total number of processors, with the corresponding factor being the number of groups of processors over which the realisations are distributed. This strategy was implemented and run over the four tests prescribed. Each test was executed over 32 processors, with the allocation between the solver and realisations being summarized as follows: 1 2 4 8 16 32

Solver Solver Solver Solver Solver Solver

× × × × × ×

32 16 8 4 2 1

Realisations, Realisations, Realisations, Realisations, Realisations, Realisations.

Table 2.13 and Figure 2.34 show the results of these analyses for the two larger tests, Test 3 and Test 4. Initially the results seemed surprising, as the new code performed significantly better, with respect to time, than the P75 (Parallel Realisations) code, as seen in Table 2.13. The two tests performed better when the realisations were distributed over all 32 processors, than when performed using P75 Parallel (Parallel Realisations), which are distributed identically with respect to solver and realisations. (See Table 2.14.) The Author feels that this discrepancy is due to the P123 code, on which the Hybrid code is based. This code has been coded, developed and optimized for some years and the code is more

100


2500

2000

Time (s)

1500

Test 3 Test 4 1000

500

0 0

5

10

15

20

25

30

Solver Processors per Realisation

(a) Time

800000


700000

600000

Test 3

500000

Test 4 400000

300000

200000

100000

0 0

5

10

15

20

25

30


(b) Memory

Figure 2.34: Graphs comparing the performance results using the hybrid strategy, executed over increasing numbers of processors.

2.11. PERFORMANCE

101

45

40

35

Speedup

30

25

20

15

10

Test 3 5

Test 4

0 0

5

10

15

20

25

30

25

30


(c) Speedup

1.4

1.2

Efficiency

1

0.8

0.6

0.4

Test 3 0.2

Test 4

0 0

5

10

15

20


(d) Efficiency

Figure 2.34: cont....Graphs comparing the performance results using the hybrid strategy, executed over increasing numbers of processors.

102

Solver CPUs 1 2 4 8 16 32

CHAPTER 2. PARALLEL RFEM STRATEGIES Test × Realisation CPUs × 32 × 16 × 8 × 4 × 2 × 1

3 4 Time(s) Mem(K) Time(s) Mem(K) 107.186 204928 1298.174 702976 122.369 170816 1358.661 427392 159.420 153856 1443.312 289024 192.360 144640 1494.289 218048 345.177 139520 1638.395 182080 588.060 137600 1920.840 164864

Table 2.13: A comparison of performance results using the hybrid strategy, executed over increasing numbers of processors, for 320 realisations.

Test 3 4

Time (Seconds) Hybrid P75 Parallel 107.186 165.138 1298.174 2104.761

Table 2.14: Comparison of test results between the Hybrid and P75 (Parallel Realisations) Implementations.

optimized for compilation using parallel MPI compilers. P75 (Parallel Realisations) is a more recently developed code and with further work would, in theory, out-perform this code. However, for the purposes of this research the Author is happy to continue with the Hybrid code. Further analysis of the results supports the above hypothesis. Figure 2.34(a), shows that increasing the processing power allocated to the solver, and thus decomposing the domain further, leads to slower execution times as the communication required increases; this is further illustrated in Figures 2.34(c) and (d), which show falls in both the speed up and efficiency respectively as the domain decomposition is increased. This is not to say that for different applications, where domain sizes are different, that execution times will slow; rather that their efficiency and speedup will show poorer performance. Figure 2.34(b) also shows that increasing domain decomposition lowers the memory consumption on each processor, again supporting the previous hypothesis.

2.12. STRATEGY CONCLUSIONS

2.12

103

Strategy Conclusions

In general, the results show that the optimal strategy to take forward into a stochastic framework is the hybrid method. This provides the flexibility of distributing the domain over processors, so as alleviate memory limitations, while also providing a means to distribute the realisations in order to provide a more efficient and faster execution. It was clear from the time and memory analysis, that, the greater the domain decomposition and therefore less memory consumed per processor, the poorer the efficiency. The optimum ratio of the parallelisation is application and resource dependent, but as a good guideline the domain should be distributed so as to as fully fill the distributed memory resources, and then the remaining processors used to replicate this initial allocation where possible, providing an efficient execution of the code. Although the results showed an increase in execution time with increasing numbers of processors, it should be noted that in some applications the execution times will reduce, even though the efficiency and speed ups may decrease. Therefore, it still remains the case that the ideal ratio should be determined with consideration of viable run times, memory and efficiency (resource costs.) From the research and analysis so far presented, the Author realised that the use of a random field within the stochastic framework provided a further limitation on the memory. The full field would be replicated on all processors with the production of these random fields being memory intensive, further increasing the memory limitation placed upon stochastic FE analyses. The following chapters discuss the method used for serial random field generation, that is the theory and implementation, leading to the parallelisation of the method to fit in within the stochastic framework discussed.

104


Chapter 3 Random Field Generation 3.1

Random Field Generation

The Author inherited a modular 3D Local Average Subdivision (LAS) random field generator, written in Fortran 90, which can be fully integrated into the current FE Fortran codes, in both serial and parallel. This module was developed by the Author’s predecessors. Originally, within the research group, a 2D standalone Fortran code was developed from the LAS Method, proposed by Fenton and Vanmarcke (1990). This code was later extended to 3D and modulated into a ‘blackbox’ style Fortran library by Spencer (2007). It was this code that the Author inherited. The code produced, serially, a randomly generated cube domain that met the required statistical parameters. This chapter borrows heavily from the knowledge of the Author’s predecessors; in particular Spencer (2007). It is included to show the mathematical background of the method, for the purpose of illustrating that it can be implemented in a parallel environment with no adverse effects.

3.1.1

Maths in 1D

To review the LAS procedure it is best to start with the 1D case and work up to the multi-dimensional cases. The following sets out the mathematical theory behind the procedure; in particular the theory of Local Averages, as presented by Vanmarcke (1983), followed by the practical implementation of the theory to produce a 1D random field (Fenton and Vanmarcke, 1990). The chapter then looks at two and three dimensional implementations. 105

106 3.1.1.1

CHAPTER 3. RANDOM FIELD GENERATION Mean and Variance

Considering a property X(t), the variability between two points, t and t0 , can be defined by the covariance and correlation functions, β(t, t0 ) and ρ(t, t0 ), respectively. That is, βX (t, t0 ) = Cov[X(t), X(t0 )] = E[X(t)X(t0 )] − µ(t)µ(t0 )

(3.1)

β(t, t0 ) σ(t)σ(t0 )

(3.2)

ρX (t, t0 ) =

where the X subscript relates to the property X, and where µ and σ are the mean and standard deviation of X respectively. The variability of X with respect to t is depicted in Figure 3.1(a). The mean and variance of X(t) over the distance T = t2 − t1 are given by: 1 µ = E[X(t)] = t2 − t1

Zt2 X(t) dt

(3.3)

[(X(t) − µ)2 ] dt

(3.4)

t1

1 σ 2 = E[(X(t) − µ)2 ] = t2 − t1

Zt2 t1

Equation 3.5 gives the relationship for the moving average, XT (t), by considering the average of the continuous stationary random process, X(t), over a finite domain of length T centred on a point t. t+ T2

1 XT (t) = T

Z

X(u) du

(3.5)

t− T2

In this equation, the integral on the right-hand side is known as the local integral process IT , whilst u is a point within the range of the process X, giving: 1 XT (t) = IT (t) T

(3.6)

Figure 3.1(b) illustrates the process XT (t), which is smoother than the original process X(t). It may be rearranged to give IT (t) (Figure 3.1(c)) which is given

3.1. RANDOM FIELD GENERATION

107

(a) spatially varying quantity X(t)

(b) spatially varying quantity XT (t)

(c) local integral process

Figure 3.1: Sample functions of a local average process (Vanmarcke, 1983).

108

CHAPTER 3. RANDOM FIELD GENERATION

by: IT (t) = T XT (t)

(3.7)

The mean of XT (t), µT , is the same as µ; however the variance, σT2 , is lower than that of the original σ 2 , i.e. σT2 < σ 2 . Considering the expectation of the functions over the window T , it can be shown that (Samy, 2003): µT = µ

σT2

1 = T2 σ2 = T2

ZT ZT 0

0

ZT

ZT

0

0 2

= γ(T )σ

(3.8)

β(u, u0 ) du du0

ρ(u, u0 ) du du0 (3.9)

In Equation 3.9, γ(T ) is the variance function (Vanmarcke, 1977). 3.1.1.2

Variance Function

γ(T ) in Equation 3.9 is a measure of the reduction in the point variance, σ 2 , resulting from averaging over the length T and is given by: 2 γ(T ) = T

ZT

2 ρ(τ ) dτ − 2 T

ZT τ ρ(τ ) dτ ,

(3.10)

0

0

where ρ(τ ) is the correlation function and τ is the lag distance between points at either end of the integrating window. For σT2 6 σ 2 , the range of the variance function is 0 6 γ 6 1. When considering the limits of T the following statements can be made: As T → 0 XT (t) → X(t) σT2 → σ 2 ∴ γ(T → 0) → 1 As T → ∞ XT (t) → µ

σT2 → 0

∴ γ(T → ∞) → 0

It can also be shown that γ(T ) is symmetrical about T = 0.

(3.11)

3.1. RANDOM FIELD GENERATION 3.1.1.3

109

Scale of Fluctuation

The scale of fluctuation can be considered as the distance beyond which the field correlation becomes negligible. Hence it is a function of the distance between peak values, both low or high. This attribute allows the scale of fluctuation to be estimated using the spatial variability of field data. The correlation function in most cases is unknown; hence it is necessary to define a scale of fluctuation, which is constant and independent of the window size. Equation 3.10 may be reduced to: Z∞ T γ(T ) ≈ 2

2 ρ(τ ) dτ = 2 σ

0

Z∞ β(τ ) dτ ,

(3.12)

0

as T becomes very large. On the left-hand side of Equation 3.12, T γ(T ) is known as the scale of fluctuation and is denoted by θ (Vanmarcke, 1977, 1983). In order to make the scale of fluctuation valid, as T → 0, γ(T ) = 0 is a necessary condition; an approximation to the variance function is found when this condition is applied to Equation 3.12. This approximate variance function can be represented by a conditioned function, such as θ T T 6θ , γ→1 T >θ , γ=

(3.13)

with the limits 0 6 γ 6 1 satisfied. Figure 3.2 illustrates the approximate variance function and compares it to the exact function. There is an implication that, to best represent the variance of X by XT , it is necessary for γ ≈ 1 and thus T < θ; otherwise the variance will be poorly represented for the case when the averaging window, T , is greater than θ.

3.1.1.4

Covariance Function

By considering the local integrals in Equations 3.5 and 3.7, the covariance between two windows over X(t) can be obtained. IT (t) differs from XT (t) by a factor of

110


Figure 3.2: Variance Functions.

T and so the variance is given by: Var[IT ] = T 2 σT2 = T 2 σ 2 γ(T ) = ∆(T )σ 2

(3.14)

where ∆(T ) = T 2 γ(T ). Applying the limits from Equation 3.13, the range of the variance function of the local integrals becomes 0 < ∆(T ) < T θ. Figure 3.3 illustrates a function X(t), with two integral windows, IT and IT0 . With reference to this figure: IT + IT0 = IT2 − IT 0

(3.15)

IT0 = IT3 − IT 0

(3.16)

IT2 = IT 0 + IT1

(3.17)

3.1. RANDOM FIELD GENERATION

111

Figure 3.3: Local integrals of the function X (t).

and therefore, 2IT 0 IT = IT20 − IT21 + IT22 − IT23 =

3 X

(−1)k IT2k

(3.18)

k=0

Assuming the process X (t) has a Gaussian distribution, with µ = 0 and σ = 1, Equation 3.18 gives a covariance of: Cov[IT , IT 0 ] =

3 σ2 X (−1)k ∆Tk 2 k=0

(3.19)

If the sample windows are equal in length, as is the case when the parameter space t is equally divided, T = T 0 . If the lag between the two windows is denoted by τ , it then follows that T1 = T3 = τ , T0 = τ − T and T2 = T + τ . From Equation 3.19 it follows that: Cov[IT , IT 0 ] = ∴ β(τ ) =

σ2 [(T − τ )2 γ(T − τ ) − 2(τ )2 γ(τ ) + (T + τ )γ(T + τ )] 2 σ2 [(T − τ )2 γ(T − τ ) − 2(τ )2 γ(τ ) + (T + τ )γ(T + τ )] (3.20) 2

112


Figure 3.4: An illustration depicting the separation of a domain into equally spaced local average cells. Applying the identity ∆(T ) = T 2 γ(T ), the covariance of similar windows separated by a distance τ is obtained as: σ2 [∆(T − τ ) − 2∆τ + ∆(T + τ )] β(τ ) = 2 = E[XT (t), XT 0 (t)]

(3.21)

Figure 3.4 depicts an equally divided domain, X(t), with cells, Zk , of width D, where k is the cell number. Each cell has a local integral equal to the local average. Therefore the covariance of two cells separated by m cells is: i E[Zki Zk+m ]=

σ2 [(m − 1)2 γ((m − 1)Di ) − 2m2 γ(mDi ) + (m + 1)2 γ((m + 1)Di )] (3.22) 2

3.2. 1D IMPLEMENTATION

3.2

113

1D Implementation

The implementation of the LAS random field generation method was developed using the ideas of Vanmarcke (1983) and Fenton and Vanmarcke (1990), and it has also been discussed in the theses of previous colleagues, including Spencer (2007). The method produces random fields that are internally consistent and maintains a constant mean over all levels of subdivision. This means that previous levels of subdivision can be reproduced from the final field. It can also be used to produce fields with local areas of refinement, so it can be used in conjunction with adaptive meshing. The 1D algorithm discussed here uses an Ornstein-Uhlenbeck process, an exponential correlation function, and a fractional Gaussian noise, as proposed by Mandelbrot and Ness (1968). An Ornstein-Uhlenbeck process is a mean reverting process, that fluctuates stochastically, but reverts to its mean over a set dimension or period. The LAS method is a recursive algorithm, which repeatedly subdivides an initial domain, in a top-down manner, until the random process X (t) is represented by a sequence of local averages, as illustrated in Figure 3.5. The initial

Figure 3.5: Progression of 1D LAS Field Generation. domain, D, is assigned a global average, taken as a random value from a Gaussian distribution with variance: 2 = σ 2 γ(D) (3.23) σD This initial stage is referred to as Stage 0 and the global average is maintained throughout the subdivision process. At each level of subdivision, a “parent” cell from the previous stage is subdivided in a top-down manner, to produce the new level. In Figure 3.5 Z denotes the local averages, whilst the subscripts and superscripts refer to cell and stage numbers respectively.

114


After the initial global average Z10 is generated, the algorithm proceeds in the following manner: 1. Subdivide the parent cell Zji−1 into two equal parts. i , so as to ensure 2. Assign a weighted random value to the local average, Z2j the required variance and correlation structure. i i = 2Zji−1 − Z2j 3. To preserve global averaging, calculate Zj−1

4. Repeat steps 1 → 3 until the desired level. The final field has a mean equal to the initial global average and a variance tending towards that of the target, σ 2 . To generate the required variance and correlation structure, the neighbouring cells of the “parent” cell must be considered.

3.2.1

Initialising the LAS process

The global mean at stage 0, Z10 , is given by: Z10

1 = D

ZD Z(t)dt

(3.24)

0

The target mean and variance follow a standard normal distribution and, from local averaging theory, the expectation of this mean is zero, i.e. E [Z10 ] = 0, whilst the variance is given by: h 2 i E Z10 = σ 2 γ(D) (3.25) as derived from Equation 3.9. The adopted 1D covariance function is the zero mean Ornstein-Uhlenbeck process: 2 |τ | 2 (3.26) β(τ ) = σ exp − θ Therefore, on substituting Equation 3.26 into Equation 3.12, γ (D) is given by θ2 2 |D| −2 |D| γ(D) = + exp −1 , 2D2 θ θ

(3.27)

giving an initial global mean of: Z10 = U γ(D)

(3.28)


115

Figure 3.6: General cell arrangement and annotation in 1D.

where U is a white Gaussian noise term, with zero mean and variance 1. The white Gaussian noise term is simply a random number taken from a Gaussian distribution (see Chapter 5). It is this term that incorporates the random component into the generated fields.

3.2.2

Mathematical Implementation

Figure 3.6 considers an arbitrary cell at stage i and position j, with local average, Zji . The subdivision of this cell relies upon its neighbouring cells for the field to converge towards the required covariance structure. For a monotonic covariance structure, Fenton (1990) determined that a neighbourhood size of 3 was sufficient and this was later confirmed by Samy (2003), suggesting that no significant i+1 improvement is gained by increasing the size. Therefore, Z2j is dependent on i i , and a random variable. This relation is given by: Zj−1 , Zji , Zj+1 i+1 i+1 i+1 Z2j = M2j + ci+1 U2j

(3.29)

where M2i+1 is the estimated mean for the new subdivided cell, which is based j upon the parent cells within the neighbourhood. ci+1 is the standard deviation of the white noise term, which produces variation at the correct scale of fluctuation, being very small for θ > Di , and tends to a maximum at θ ≈ Di (Samy, 2003). Uji+1 is the white Gaussian noise term generated from a standard normal distribution. The estimated mean is given by: i+1 i i M2j = ai−1 Zj−1 + ai0 Zji + ai1 Zj+1

(3.30)

116


where the a values are weighting coefficients, with the stage number and applied direction with respect to the parent cell indicated by the superscripts and subscripts respectively. Therefore, with knowledge of the previous stage, i, the mean i+1 of the new stage, Z2j , will be strongly correlated with its immediate parent cell, i i i due to their relative proximand Zj+1 Zj , but more weakly correlated with Zj−1 ity. Both c and a are unknown and require deriving. Substituting Equation 3.30 into 3.29 gives: i+1 i i Z2j = ai−1 Zj−1 + ai0 Zji + ai1 Zj+1 + ci+1 Uji+1

(3.31)

i Multiplying through by Zm , which is any cell value at stage i, and taking the expectations gives:

X i+1 i k=j+1 i i+1 i E Z2j Zm = aik−j E Zki Zm + ci+1 E Zm Um

(3.32)

k=j−1 i+1 i ] = 0, so that: Um The white Gaussian noise term has the expectation, E [Zm k=j+1

E

i+1 i Z2j Zm

=

X

i aik−j E Zki Zm

(3.33)

k=j−1 i Now substituting the three stage i cells of interest, instead of Zm :

 h i i+1 i  E Z Z  j−1 2j   h i i+1 i E Z2j Zj  h i    E Z i+1 Z i j+1 2j

    

 h i i h i h i Zi iZi i Zi  E Z E Z E Z  j+1 j−1 j j−1 j−1 j−1   h i h i h i i Zi i Zi E Zj−1 E Zji Zji E Zj+1 = j j   i i h i h   h i  iZi i Zi    E Z Zi E Z E Z j+1 j+1 j−1 j+1 j j+1

    i     a−1   i a0     i     a1 (3.34)

The square matrix to the right of the equation is symmetric and Toeplitz, in that the elements along each diagonal are equal and can be evaluated using Equation 3.22. The left-hand side is the expectation between cells at different stages, and represents the currently undetermined cross-stage covariance. Since upward averaging is preserved between stages,  i+1 i  2j Zj−1  E Z i+1 i E Z2j Zj   i+1 i E Z2j Zj+1

 i+1 i+1 i i  E Z2j Z2j−3 + E Z2j Z  i+1 i i+1 2j−2 1 i = E Z2j Z2j−1 + E Z2j Z2j  i+1 i  i+1 i  2 E Z2j Z2j+1 + E Z2j Z2j+2   

which can be solved using Equation 3.22.

    

,

(3.35)


117

The weighting coefficients, a, in Equation 3.34 are independent of the cell values and only require calculation once for each stage. However they are dependent on the cell sizes at stages i and i + 1 and the variance structure over the domain. To calculate the only remaining unknown, ci+1 , Equation 3.29 is squared and the expectations taken, giving: E

h

i+1 Z2j

2 i

=E

h

i+1 M2j

2 i

+ ci+1

2

(3.36)

ci+1 is the standard deviation of the white Gaussian noise term and can be taken outside the expectation operator, as it is considered constant for stage i. Squaring the E Uji+1 term gives the variance of the Gaussian distribution, i.e. E

h

2 Uji+1

i

= Var Uji+1 = 1

(3.37)

Taking expectations of the square of Equation 3.30 gives: E

h

i+1 2 M2j

i

=

j+1 X

i+1 i aik−j E Z2j Zk

(3.38)

k=j−1

Substituting Equation 3.36 into 3.38, and evaluating the cross-stage terms as before gives:  i+1 i  E Z2j Zj−1  h i 2 i+1 i i+1 2 ci+1 = E Z2j − ai−1 ai0 ai1 E Z2j Zj   i+1 i E Z2j Zj+1

  

(3.39)

 

Again, as with a, the c coefficient only needs to be calculated once at each stage, as it is independent of the local average values. In practice, the coefficients a and c for all stages are calculated and stored prior to subdivision. i+1 Therefore the local average Z2j can be calculated using Equations 3.29, 3.34 i+1 and 3.39, whilst the average for cell Z2j−1 is determined from upwards averaging, that is: i+1 i+1 Z2j−1 = 2Zji − Z2j (3.40)

Equations 3.29 and 3.40 are used recursively until the desired stage, once the initial mean and coefficients a and c have been determined.

118 3.2.2.1

CHAPTER 3. RANDOM FIELD GENERATION Boundary Conditions

As discussed, the LAS process depends on a neighbourhood of parent cell values. If the neighbourhood is greater than 1, there are parent cells that lie adjacent to the subdividing parent and, in some cases, these may lie outside the subdividing domain. Therefore it necessary to define boundary conditions for these parents values. Fenton and Vanmarcke (1990) assumed that the area beyond the domain boundary was uncorrelated with the area inside the domain; they acknowledged that this may cause an error, but considered it to be insignificant.The generation of the a coefficients, as per Equation 3.34, is affected by neglecting the parent cells outside the boundary. However, values of a for boundary cells can be obtained by discounting the parent values outside the domain. With reference to the i+1 in Figure 3.6, a may be calculated as follows: generation of Z2j−1 1. At stage 1, two neighbours are missing and therefore only one parent cell, Zji , is used: i+1 i i i i E Z2j Zj = E Zj Zj a0 (3.41) i 2. If one neighbouring parent cell, Zj−1 , to the left of the subdividing cell is missing and outside the domain:

(

i )( i ) i+1 i ) ( a0 E Zji Zji E Zj+1 Zji E Z2j Zj i i i i+1 i = i ai1 E Zj Zj+1 E Zj+1 Zj+1 E Z2j Zj+1

(3.42)

i 3. If one neighbouring parent cell, Zj+1 , to the right of the subdividing cell is missing and outside the domain:

(

i+1 i ) ( i )( i ) i i Zj−1 E Zj−1 Zj−1 E Zji Zj−1 E Z2j a−1 i+1 i i = E Z2j Zj E Zj−1 Zji E Zji Zji ai0

(3.43)

Therefore, Equation 3.34 may be written in the generalised form: i+1 Z2j = ci+1 Uji+1 +

j+q X k=j−p

where p = min (n, j − 1) and q = min (n, 2i − j).

aik−j Zki

(3.44)

3.3. 2D MATHEMATICS

3.3

119

2D Mathematics

Local averaging theory can be applied to any number of dimensions, as an extension of the 1D case (Vanmarcke, 1983). In the 2D case a square domain is subdivided into four equal areas, using three random numbers to produce the cell values for three quadrants, whilst the fourth is calculated to maintain the averaging. This section summarises 2D LAS.

Figure 3.7: Local averaging in 2D over a rectangle (Samy, 2003).

Local averaging of a 2D random process occurs over a rectangular area, A = T1 T2 , as illustrated in Figure 3.7. The local integrals are: t1 +

Z

T1 2

t2 +

Z

T2 2

IA (t1 , t2 ) ≡ IT1 T2 (t1 , t2 ) =

X (t1 , t2 ) dt1 dt2 t1 −

T1 2

t2 −

(3.45)

T2 2

Dividing by the area, A, gives the local average XA (t1 , t2 ): XA (t1 , t2 ) ≡ XT1 T2 (t1 , t2 ) = A1 IA (t1 , t2 ) t1 +

=

1 A

R t1 −

T1 2

T1 2

t2 +

R t2 −

(3.46)

T2 2

T2 2

X (t1 , t2 ) dt1 dt2

120


where A has sides of length T1 and T2 with respect to the t1 and t2 axes, respectively. The ratio of the variance of XA to the variance of X is given by: Var [XA ] ≡ σA2 ≡ σT21 T2 = σ 2 γ (T1 , T2 )

(3.47)

where γ (T1 , T2 ) is the variance function. This and the correlation function ρ (τ1 , τ2 ) are related via the relationship (Vanmarcke, 1983): 1 γ (T1 , T2 ) = T1 T2

Z+T1 Z+T2

|τ1 | 1− T1

|τ2 | 1− ρ (τ1 , τ2 ) dτ1 dτ2 T2

(3.48)

−T1 −T2

where τ1 and τ2 are the lag distances in the two directions. The variance function, ∆ (T1 , T2 ), of the local integral process is given by: Var [XA ] ≡ Var [XT1 T2 ] = σ 2 ∆ (T1 , T2 )

(3.49)

This function and Equation 3.47 differ by a factor of A2 = (T1 T2 ) and, therefore, ∆ (T1 , T2 ) = (T1 T2 ) γ (T1 , T2 ) = A2 γ (T1 , T2 )

(3.50)

By considering Equation 3.48 for very large T1 and T2 , the variance function tends towards: α α (3.51) = γ (T1 , T2 ) = T1 T2 A where the constant α in the asymptotic expression is the ’characteristic area’, which is also equal to the integral of the correlation function: Z+∞ Z+∞ α= ρ (τ1 , τ2 ) dτ1 dτ2

(3.52)

−∞ −∞

3.3.1

Variance Function

Considering the axes t1 and t2 such that X (t1 , t2 ) is rendered quadrant symmetric, the correlation measure can then be given as: Z∞ Z∞ α=4

ρ (τ1 , τ2 ) dτ1 dτ2 0

0

(3.53)

3.3. 2D MATHEMATICS

121

leading to the variance function: 4 γ (T1 , T2 ) = (T1 T2 )2

ZT2 ZT1 (|T1 | − |τ1 |) (|T2 | − |τ2 |) ρ (τ1 , τ2 ) dτ1 dτ2 0

(3.54)

0

The correlation structure in an isotropic random field spreads radially, and thus can be expressed as: ρR (τ ) = ρ (τ, 0) = ρ (0, τ ) = ρ (τ1 , τ2 ) where τ is the radial lag, τ = radial function.

(3.55)

p τ12 + τ22 , with the superscript R representing the

There are two approaches to averaging X (t1 , t2 ) over the area A = T1 T2 . That is, it can be integrated over a distance T1 along the t1 axis, followed by integration along the t2 axis over T2 , or vice versa. Integrating X (t1 , t2 ) along the t1 axis, over T1 , produces the 1D function XT1 (t2 ), that is t1 +

XT1 (t2 ) = XT1 (t2 , t1 ) =

Z

1 T1

T1 2

X (t1 , t2 ) dt1 , t1 −

(3.56)

T1 2

the variance of which is: Var [XT1 ] = σT21 = σ 2 γ (T1 )

(3.57)

where γ (T1 ) is the 1D variance function defined in Section 3.1.1.2. Integrating over T2 along the t2 axis gives: t2 +

XA (t1 , t2 ) =

1 T2

Z

T2 2

XT1 (t2 ) dt2 t2 −

(3.58)

T2 2

The variance is reduced further with respect to the original point variance giving: σA2 = σ 2 γ (T1 ) γ (T2 |T1 )

(3.59)

γ (T2 |T1 ) is the variance function of XT1 (t2 ), which is the conditional variance function of X (t1 , t2 ) that has previously been averaged over T1 . The variance

122


function of X (t1 , t2 ) can be expressed as a product of the ’marginal’ and ’conditional’ variance functions, and is derived by combining Equations 3.50 and 3.59: γ (T1 , T2 ) = γ (T1 ) γ (T2 |T1 )

(3.60)

It therefore follows that, if the order of integration is reversed: γ (T1 , T2 ) = γ (T2 ) γ (T1 |T2 ) = γ (T1 ) γ (T2 |T1 )

(3.61)

Similarly, for the variance functions defined in terms of local integrals: ∆ (T1 , T2 ) = ∆ (T2 ) ∆ (T1 |T2 ) = ∆ (T1 ) ∆ (T2 |T1 )

(3.62)

If the correlation structure of X (t1 , t2 ) can be separated along each dimension, then the variance is independent along each axis, thus reducing the conditional variance function to the marginal variance function, γ (T1 |T2 ) = γ (T1 )

(3.63)

γ (T1 , T2 ) = γ (T1 ) , γ (T2 )

(3.64)

thereby giving,

3.3.2

Scale of fluctuation

The scale of fluctuation is also changed from the 1D case, due to the variance function altering to a conditional version. Considering the 1D random process (2) XT1 (t2 ), the conditional scale of fluctuation is denoted by θT1 , where the axis to which the scale of fluctuation refers and the length over which the function has been integrated is denoted by the superscript and subscript, respectively. This is characteristic of the conditional covariance function γ (T2 |T1 ): (2)

θ γ (T2 |T1 ) = T1 T2

(3.65)

If there is no averaging taking place along the t1 axis, i.e. T1 = 0, the conditional scale of fluctuation is equal to the directional scale of fluctuation along the t2 (2) (2) (2) axis, and thus θ0 ≡ θ(2) . In contrast, if T1 → ∞, θ1 converges towards θ∞ , an asymptotic limit. The variance functions can be replaced by this limit as T1 and

3.3. 2D MATHEMATICS

123

T2 become very large. If Equations 3.51 and 3.65 are substituted into Equation 3.60, this gives, (2) θ(1) θT1 α = (3.66) T1 T2 T1 T2 As T1 → ∞, the asymptotic limit is given by (2) θ∞ =

α = cα θ(2) θ(1)

(3.67)

and similarly, as T2 → ∞

α = cα θ(1) (2) θ where the dimensionless factor cα is defined as: (1) θ∞ =

cα =

(3.68)

α

(3.69)

θ(1) θ(2) (2)

Considering Equation 3.62 for T2 → ∞, the value of θT1 can be evaluated by considering (2) θT1 θ(2) γ (T1 ) = γ (T1 |T2 = ∞) (3.70) T2 T2 which leads to (2)

θT1 =

γ (T1 |T2 = ∞) (1) θ γ (T1 )

(3.71)

Using Equation 3.67, γ (T1 |T2 = ∞) can be obtained, so that (2)

θT1 =

cα θ(1) (1) θ T1 γ (T1 )

(3.72)

This relationship depends only on the correlation of the structure and the length over which it is averaged along the t1 axis.

3.3.2.1

Covariance Function

The 2D variance function can be used to calculate the covariance between two local averages of X (t1 , t2 ). (See Figure 3.8.) If the areas A = T1 T2 and A0 = T10 T20 are considered, then, in a similar manner to the 1D covariance, this gives the relationship between the local averages and local integrals as XA ≡ IAA and

124


Figure 3.8: Illustration showing the areas under consideration in the covariance analysis.

XA 0 ≡

I A0 , A0

whilst the covariances are related by, Cov [XA , XA0 ] =

1 Cov [IA , IA0 ] AA0

(3.73)

The covariance of the local integrals, IA and IA0 , can be expressed in terms of the 2D variance function, ∆ (T1 , T2 ) = σ 2 γ (T1 , T2 ), by direct extension of the covariance in 1D (Equation 3.19), that is: Cov [IA , IA0 ] =

3 3 σ2 X X (−1)k (−1)l ∆ (T1k T2l ) 4 k=0 k=0

(3.74)

This further simplifies if the two areas differ in size and location along one axis, in this case the t2 axis, although the same applies if the variation is only along the t1 axis: 3 σ 2 ∆ (T2 ) X Cov [IA , IA0 ] = (−1)l ∆ (T1k |T2 ) (3.75) 2 k=0 If T1 = T10 and T2 = T20 and the lag distances are τ1 and τ2 , along the relative axis (see Figure 3.9), then the covariance for this special case can be expressed

3.3. 2D MATHEMATICS

125

Figure 3.9: Illustration showing the special case of similar rectangles.

using a similar formulation as for 1D, that is: βA (τ1 , τ2 ) =

σ2 4T12 T22

[∆ (T1 + τ1 , T2 + τ2 ) + ∆ (T1 − τ1 , T2 + τ2 )

+∆ (T1 + τ1 , T2 + τ2 ) + ∆ (T1 − τ1 , T2 − τ2 ) +∆ (T1 + τ1 , T2 + τ2 ) + ∆ (T1 − τ1 , T2 − τ2 ) −2∆ (T1 − τ1 , τ2 ) + 4∆ (τ1 , τ2 )]

(3.76)

The covariance function implemented here is governed by a Markov process with an exponential covariance function, although it should be noted that this function can take various forms (Vanmarcke, 1983). The correlation function is given by:  s  2 2   2τ1 2τ2 ρ (τ1 , τ2 ) = exp − +   θ1 θ2

(3.77)

and the covariance function is:  s  2 2   2τ1 2τ2 β (τ1 , τ2 ) = σ 2 exp − +   θ1 θ2

(3.78)

126


Thus the variance function is: γ (T1 , T2 ) =

1 [γ (T1 ) γ (T2 |T1 ) + γ (T2 ) γ (T1 |T2 )] 2

(3.79)

where, "

γ (Ti ) = 1 + " γ (Ti |Tj ) = 1 + and

Ti θi

32 # −2 3

Ti θji

(3.80) 32 # −2 3

" θji

= θi

( 2 )# Tj cα + (1 − cα ) exp − cα θj

(3.81)

(3.82)

which Vanmarcke (1983) gives as the integration of the conditional variance function: Z+∞ (1) θT1 = ρ (τ1 |T2 ) dτ1 (3.83) −∞

Fenton (1990) suggests that cα = function, in Equation 3.82.

π 2

should be used for the exponential covariance


3.4

127

2D Implementation

The implementation of 2D LAS is a direct extension of the 1D method summarised in Section 3.2. Figure 3.10 illustrates the progress of the 2D subdivision process, from Level 0 through to Level 2. It involves the recursive subdivision of 0 . an initial domain with a local average, Z1,1

Figure 3.10: 2D LAS Process.

3.4.1

Initial Mean Generation

For the process Z (tx , ty ), the average can be expressed as:

0 Z1,1

1 = Dx Dy

ZDyZDx Z (tx , ty ) dxdy 0

(3.84)

0

0 where Z1,1 is a random variable. Using local averaging theory, the expectations relating to the global average can be obtained; hence, the mean is given by

E Z10 = E [Z] = 0

(3.85)

128


and the variance is given by E

h

Z10

2 i

= σ 2 γ Dx0 , Dy0

(3.86)

where σ 2 is the target variance, defined as 1, and γ Dx0 , Dy0 is the variance function of the initial domain obtained from Equation 3.79.

3.4.2

Subdivision

As with the 1D process, the subdivision in 2D is illustrated for a generic cell i with the value, Zj,k , where j, k refer to the position of the cell and i is the stage number. (See Figure 3.11.) The following four equations give the relationships for the four cells generated as a cell subdivides to form Stage i + 1: Z1i+1

=

i+1 Z2j,2k

=

i+1 c11 U1jk

+

nxy X

i ail1 Zm(l),n(l)

(3.87a)

l=1

Z2i+1

=

i+1 Z2j,2k−1

=

i+1 c21 U1jk

+

i+1 c22 U2jk

nxy X

i ail2 Zm(l),n(l)

(3.87b)

l=1

Z3i+1

=

i+1 Z2j−1,2k

=

i+1 c31 U1jk

=

i+1 Z2j−1,2k−1

i 4Zj,k

+

i+1 c32 U2jk

i+1 c33 U3jk

+

−

i+1 Z2j,2k

i+1 Z2j,2k−1

+

nxy X

i ail3 Zm(l),n(l)

(3.87c)

l=1

Z4i+1

=

−

−

i+1 Z2j−1,2k

(3.87d)

where U is the white Gaussian noise term, with a zero mean and unit variance, and c is the coefficient defining the standard deviation of U . m (l) and n (l) are indexing functions traversing, in a fixed pattern, the neighbourhood (nxy ) of i Zj,k . {ailr } are the neighbourhood correlation weighting coefficients. Taking a neighbourhood size of 1, nx = ny = 1, nxy = (2nx + 1) × (2ny + 1) = 9, and i therefore there are 9 neighbouring cells including Zj,k . Figure 3.12 illustrates the traversing pattern of p (l) and q (l), while Figures 3.13 and 3.14 show the subscript numbering for the coefficients {ailr }. In Equation 3.87a, the unknown weighting coefficients, {ailr }, are calculated in a similar approach to that of the i 1D coefficients. Taking these equations and multiplying through by Zm(p),n(p) and


129

(a) Stage i

(b) Stage i + 1

Figure 3.11: Generic local average generation at Stage i + 1.

130


Figure 3.12: 2D traversing pattern of index functions p (l) and q (l).

Figure 3.13: Splitting of a 2D parent cell to form the next level of subdivision.


131

Figure 3.14: Weighting coefficients associated with their respective parent cell.

taking expectations, with m (p) and n (p) being traversing vectors, gives i h h i nP xy i+1 i i i , Zm(p),n(p) ail1 E Zm(l),n(l) E Z2j,2k = Zm(p),n(p) l=1

p = 1, 2, . . . , nxy (3.88a) i h h i nP xy i+1 i i i , Zm(p),n(p) ail2 E Zm(l),n(l) E Z2j,2k−1 Zm(p),n(p) = l=1

p = 1, 2, . . . , nxy (3.88b) h

i

i+1 i Zm(p),n(p) = E Z2j−1,2k

n xy P l=1

h

i i ail3 E Zm(l),n(l) , Zm(p),n(p)

i p = 1, 2, . . . , nxy (3.88c)

The matrices on the right-hand side of these equations, although symmetric, are in general no longer Toeplitz. The left-hand side involves the cross-stage covariance of local averages and these values are calculated considering upwards averaging, that is 1 i+1 i i+1 i+1 i+1 Zm,n = Z2m,2n + Z2m−1,2n + Z2m,2n−1 + Z2m−1,2n−1 (3.89) 4

132


i+1 By multiplying through by Z2j,2k and taking the expectations,

i+1 i 1 E Z2j,2k Zm,n = 4 {E [

i+1 i+1 i+1 i+1 Z2k,2k Z2m,2n + Z2k,2k Z2m−1,2n i+1 i+1 i+1 i+1 + Z2k,2k Z2m,2n−1 + Z2k,2k Z2m−1,2n−1

(3.90)

Evaluation of the cross-stage covariance can be achieved in terms of variances at the same stage. By taking Equation 3.76 and considering the case of a uniform grid of equal area cells, the covariance equation may be expressed as i i E Zj,k Zj+m,k+n = 14 σ 2

1 P

wp (m + p)2 ×

p=−1 1 P 2 i i wq (n + q) γ (m + p)Dx , (n + q)Dy q=−1

(3.91) where, when l = 0, w1 = −2, and w1 = 1 when l 6= 0, and where and Dyi are the cell dimensions at stage i, and m and n are integer multipliers of the cell dimension indicating the lag distances between cells. The covariance structure is quadrant symmetric if the grid axes are orthogonal. In this case, Vanmarcke (1983) defines γ (.) to be Dxi

γ (T1 , T2 ) =

1 σT1 T2

2 ZT1 ZT2 (|T1 | − |τ1 |) (|T2 | − |τ2 |) β (τ1 , τ2 ) dτ1 dτ2

(3.92)

−T1 −T2

with Equation 3.78 giving the covariance function, β (τ1 , τ2 ). The right-hand sides of Equation 3.88 are also evaluated using Equation 3.91. The standard deviation of the white noise term, ci+1 , remains unknown. The matrix is taken to be a lower triangle, satisfying ci+1 . ci+1

T

=R

(3.93)

R is given by Rrs = E [Zri+1 Zsi+1 ] −

n xy P l=1

h i i ailr E Zm(l),n(l) Zsi+1

(3.94) r, s = 1, 2, 3

R is symmetrical and can be calculated using the covariance and cross-stage covariance, previously discussed.


133

This concludes the theory and implementation of the 2D LAS process. As with the 1D process, the computational overheads can be reduced as the a and c values can be calculated for each stage and stored, as only one set of coefficients at each stage is required. The boundary conditions for the 2D process, and later for the 3D process, are the subject of various theories including those of the Spencer (2007). Therefore the boundary conditions for these processes will be discussed at length in Section 3.6.3. The 2D implementation has been developed, coded and implemented in Fortran as both standalone executables and blackbox modules by the Author’s predecessors, including Samy (2003) and Spencer (2007), and have been validated against Local Averaging theory and for several model applications; in all cases, the results have been positive.

134

3.5


3D Mathematics

3D Local Averaging theory is an extension of the 2D case (Vanmarcke, 1983). Consider a homogeneous 3D random field, X (t1 , t2 , t3 ), with mean µ = 0 and variance σ 2 = 1, averaged over a cuboidal domain, V = T1 T2 T3 . If the random field is averaged over V , XV (t) has a mean of zero and a variance given by σV2 = σ 2 γ (T1 T2 T3 ) ,

(3.95)

where γ (T1 T2 T3 ) is the 3D variance function for X (t1 , t2 , t3 ). The variance function can be expressed in terms of the random field with local integrals, IV (t1 , t2 , t3 ) = V XV (t1 , t2 , t3 ) over the same domain, giving, ∆ (T1 T2 T3 ) = V 2 γ (T1 T2 T3 )

(3.96)

ρ (τ1 , τ2 , τ3 ) denotes the correlation function of X (t1 , t2 , t3 ) and a characteristic volume is defined by the correlation parameter, ψ, given by Z+∞ Z+∞ Z+∞ ψ= ρ (τ1 , τ2 , τ3 ) dτ1 dτ2 dτ3

(3.97)

−∞ −∞ −∞

ensuring that the variance function tends to the asymptotic form. If T1 T2 T3 becomes very large: ψ (3.98) γ (T1 T2 T3 ) = V

3.5.1

3D Variance

It follows from Equation 3.97 that, if the correlation structure is octant symmetric, then Z+∞ Z+∞ Z+∞ ψ=8 ρ (τ1 , τ2 , τ3 ) dτ1 dτ2 dτ3 (3.99) 0

0

0

For an isotropic random field, ρ (τ1 , τ2 , τ3 ) = ρR (τ ), where τ = the radial lag and R∞ ψ = (2π)2 τ 2 ρR (τ ) dτ 0 2 R

= 2π θ

√

τ1 + τ2 + τ3 is

(3.100)

3.5. 3D MATHEMATICS

135

The averaging of X (t1 , t2 , t3 ) over the 3D volume, V = T1 T2 T3 , can start along any axis. If it is integrated along the t1 axis over a distance of T1 , it results in XT1 (t2 , t3 ), a 2D function given by t1 +

XT1 (t2 , t3 ) =

Z

1 T1

T1 2

X (t1 , t2 , t3 ) dt1 t1 −

(3.101)

T1 2

with a corresponding variance function Var [XT1 ] = σT21 = σ 2 γ (T1 )

(3.102)

which is the same as Equation 3.57 in the first step of the 2D integration process. Further integration along the t2 axis yields XT1 T2 (t3 ), a 1D function with the conditional variance function of X (t1 , t2 , t3 ) having already been averaged over T1 , γ (T2 |T1 ) such that: Var [XT1 T2 ] = σT21 T2 = σ 2 γ (T1 ) γ (T2 |T1 )

(3.103)

Integrating XT1 T2 (t3 ) in the final direction yields, t3 +

XV (t1 , t2 , t3 ) =

1 T3

Z

T3 2

XT1 T2 (t3 ) dt3 , t3 −

(3.104)

T3 2

This further reduces the variance function, to give σV2 = σ 2 γ (T1 ) γ (T2 |T1 ) γ (T3 |T1 , T2 )

(3.105)

where γ (T3 |T1 , T2 ) is the variance, conditional on both T1 and T2 , leading to the variance function γ (T1 , T2 , T3 ) = γ (T1 ) γ (T2 |T1 ) γ (T3 |T1 , T2 )

(3.106)

By substituting the appropriate indices, alternative arrangements can be derived. Local integrals can also be used to express the conditional variance functions.

136

3.5.2


Scale of Fluctuation

The scale of fluctuation for the 3D process is altered by the conditional variance function. The scale of fluctuation for XT1 (t2 , t3 ) is identical to that of the 2D case, (3) XT1 (t2 ), given in Equation 3.72. The scale of fluctuation, θT1 T2 , depends on the lengths T1 and T2 , and is related to the 1D process XT1 T2 (t3 ). The relationship with the covariance function is: (3)

θ γ (T3 |T1 , T2 ) = T1 T2 T3

(3.107)

If T1 = T2 = 0, so that averaging in these two dimensions does not occur, then (3) θ00 = θ(3) . Equations 3.61 and 3.106 are combined to give: γ (T1 , T2 , T3 ) = γ (T1 , T2 ) γ (T3 |T1 , T2 )

(3.108)

Now, considering the case where all averaging lengths T1 , T2 , T3 → ∞ and considering the local integral equivalents shown in Equations 3.51 and 3.107, gives (3)

ψ (3) θT1 T2 ψ = T1 T2 T3 T1 T2 T3

(3.109)

so that, (3)

θT1 T2 =

ψ (3) = cβ θ(3) (3) ψ

(3.110)

where, ψ

(3)

cβ = and ψ (3)

θ(3) ψ (3)

Z+∞ = ρ (τ3 ) dτ3 .

(3.111)

(3.112)

−∞

Taking the case where the system is reduced to that of a 2D system, i.e. T1 takes an arbitrary value whilst T2 = 0, then, using Equation 3.71 for 2D: (3)

(3)

θT1 T2 = θT1 = θ(3)

γ (T1 |T3 = ∞) γ (T1 )

(3.113)

3.5. 3D MATHEMATICS

137

If T1 takes an arbitrary value and T2 = ∞, whilst T3 = ∞, then, using the variance function relationships given by Equations 3.106 and 3.108: ψ (1) γ (T1 |T2 = ∞, T3 = ∞) θ(2) γ (T1 |T2 → ∞)

(3)

θT1 T2 =

The variance function γ (T1 |T2 = ∞, T3 = ∞) depends upon the scale (1)

cβ θ(1) , and γ (T1 |T2 → ∞) on

α(12) θ(2)

(3.114) ψ ψ (1)

=

.

Now, taking the final case where T1 and T2 are both arbitrary values and T3 → ∞, the alternative 3D variance relationships can be expressed as: (3)

γ (1) (T1 ) γ (2) (T2 |T1 ) and therefore:

θT1 T2 θ(3) = γ (T1 |T2 = ∞) γ (T2 |T1 , T3 = ∞) T3 T3

|T2 =∞) γ(T2 |T1 ,T3 =∞) θT1 T2 = θ(3) γ(T1γ(T γ(T2 |T1 ) 1)

(3.115)

(3)

|T1 ,T3 =∞) = θ(3) γ(T2γ(T 2 |T1 )

(3.116)

(3)

θT1 T2 is obtained from the 2D covariance function express in Equation 3.81 and γ (T2 |T1 , T3 = ∞), which, using a similar method to Equation 3.65, can be found from integrating the conditional variance function.

3.5.3

Covariance Function

Over the two volumes V = T1 T2 T3 and V = T10 T20 T30 , the covariance of the local integrals of X (t1 , t2 , t3 ) can be obtained from knowledge of the 3D variance functions. The resulting expression has 43 = 64 terms, consistent with the 1D formulation, giving: Cov [IV , IV 0 ] = (V V 0 )2 Cov [XV , XV 0 ] 3 3 3 2 P P P (−1)j (−1)k (−1)l ∆ (T1j , T1k , T1l ) = σ8

(3.117)

j=0 k=0 l=0

where, Z+T1 Z+T2 Z+T3 ∆ (T1 , T2 , T3 ) = (T1 − |τ1 |) (T2 − |τ2 |) (T3 − |τ3 |) ρ (τ1 , τ2 , τ3 ) dτ3 dτ2 dτ1 −T1 −T2 −T3

(3.118)

138


The correlation function representing the extension of the 2D process to that of the 3D process is given by:  s  2 2 2   2τ1 2τ2 2τ3 ρ (τ1 , τ2 , τ3 ) = exp − + +   θ1 θ2 θ3

(3.119)

However, in this thesis this formulation has not been implemented. Instead, a separable correlation function presented by Fenton and Vanmarcke (1990) has been used, in which the two horizontal correlations are separated from the vertical correlation, in the axis τ1 , giving the covariance function:   s 2 2   2|τ | 2τ2 2τ3 1 + − β (τ1 , τ2 , τ3 ) = σ 2 exp −   θ1 θ2 θ3

(3.120)

Hence, γ (T1 , T2 , T3 ) = γ (T1 ) γ (T2 , T3 ) = γ (T1 ) γ (T2 ) γ (T3 |T2 )

(3.121)

This separation represents the layers found within a soil deposit, where it is assumed that the horizontal layers are deposited one at a time and thus the horizontal correlation is stronger, whilst the vertical layers build up over time, allowing the vertical correlation to be separable from the horizontal. Any error in averaging over T2 then T3 is reduced, with both cases averaged to give: γ (T1 , T2 , T3 ) =

1 [γ (T1 ) γ (T2 ) γ (T3 |T2 ) + γ (T1 ) γ (T3 ) γ (T2 |T3 )] 2

(3.122)

This can be obtained in the same way as for the 2D process by evaluating Equations 3.80 - 3.82.


3.6

139

3D Implementation

The 3D implementation of the LAS process is similar to that of the 2D process developed in Section 3.4.

3.6.1

Initial Mean Generation

Taking the average process, Z (tx , ty , tz ), the global average can be given as:

0 Z1,1,1

ZDxZDyZDz

1 = Dx Dy Dz

Z (tx , ty , tz ) dxdydz, 0

0

(3.123)

0

The expected mean is given by: 0 E Z1,1,1 = E [Z] = 0

(3.124)

and the expected variance is given by: 0 E Z1,1,1 = σ 2 γ Dx0 , Dy0 , Dz0 ,

(3.125)

where the variance function is given by Equation 3.122 and where the z dimension is taken to be the separable vertical correlation.

3.6.2

Subdivision

For the subdivision of any cell in three dimensions, into 8 new cells, the new cell has the following relationship with its parent cells: Zsi+1 =

s P r=1

i+1 ci+1 rs Usjkl +

nP xyz l=1

i ails Zm(l),n(l),p(l)

(3.126) s = 1, 2, . . . , 7

Z8i+1

=

i 8Zjkl

−

7 X

Zsi+1 ,

(3.127)

s=1

Zsi+1

i where denotes the octant of the subdivided cell, with centre Zjkl . A neighbourhood of 3 × 3 × 3 parent cells is taken in this implementation, i.e. nxyz = 27, covered by the traversing pattern m (l), n (l), p (l), similar to the 2D pattern illustrated in Figure 3.12. The unknown ails coefficients are obtained by multiplying

140


i and then taking expectations, to give an expanEquation 3.126 by Zm(s),n(s),p(s) sive equation involving the cross-stage covariance. This is resolved in a similar way to the 2D case, with consideration to upward averaging, that is 8

i Zm,n,p

1 X i+1 = Z , 8 t=1 m(t),n(t),p(t)

(3.128)

where m (t), n (t) and p (t) indicate the octant location within the subdivided i+1 cell. The requisite expansion is formed by multiplying by Z2j,2k,2l and taking expectations. The following expression (Vanmarcke, 1983), is derived when the cells have equal volumes and are separated in each dimension by an integer multiple of the domain size: h

i

i i E Zj,k,l Zj+m,k+n,l+p =

( σ2 8

1 P

wq (m + q)2

s=−1

wr (m + r)2

r=−1

q=−1

1 P

1 P

2

ws (m + s) γ (m +

q) Dxi , (n

+

r) Dyi , (p

+

s) Dzi

(3.129)

where w1 = −2 when l = 0, w1 = 1 when l 6= 0, and Dxi ,Dyi and Dzi are the dimensions of the cells at stage i, in their respective directions, and where m, n and p denote the lag distances between cells as an integer multiplier of the cell dimension. ci+1 is the remaining unknown and is taken to satisfy the lower triangle, T ci+1 . (ci+1 ) = R, where R is symmetrical and is given by: Rrs = E [Zri+1 Zsi+1 ] −

nP xyz l=1

h i i ailr E Zm(l),n(l),p(l) Zsi+1

(3.130) r, s = 1, 2, 3

As with the 1D and 2D cases, the coefficients a and c are independent of the field values and so are calculated only once and stored.


141

Figure 3.15: A graph illustrating the influence of the boundary conditions on the random field. (Spencer, 2007)

3.6.3

Boundary Conditions

The boundary conditions play a key role in the generation of the random field, as their influence can penetrate deep within the domain. The following equation gives an estimate of the boundary effect, based upon the logical assumption that there is significant correlation over a distance θ into the domain (Spencer, 2007). Ω gives the percentage of the whole field that is influenced by the boundary. n

n

× 100 Ω = D −(D−2θ) Dn Ω = 100

θ D θ D

6 0.5 > 0.5

(3.131)

where D is the domain size and n is the number of dimensions. This is illustrated in Figure 3.15, which clearly shows that an increase in the number of dimensions increases the proportion of the field over which the boundary has influence. It was assumed by Fenton and Vanmarcke (1990), in their 3D random field implementation, that the conditions outside the domain are uncorrelated with those occurring within the domain and are therefore of no concern. This assumption requires an implementation where the a coefficients are modified to account for

142


Figure 3.16: Illustration of the arrangement of imaginary cells, I, forming a boundary.

the various missing parent cells at these boundaries. This method was also implemented by Onisiphorou (2000), in both 1D and 2D.

Samy (2003) suggested that neglecting these external values was detrimental to the generated random field. As an alternative, a set of imaginary cells was generated to form a new boundary, as illustrated in Figure 3.16. Hence, the subdividing cells had a full neighbourhood, as all parent cells were present and there was no requirement to alter the calculation of the coefficients. Samy (2003) generated these 2D imaginary cells by row and column averaging, assigning to i the imaginary cells at the ends of each row, IRj , a value given by nxy

i IRj

1X i = Z n k=0 j,k

(3.132)


143

i , and to those at the ends of each column, ICk nxy

i ICk

1X i = Z . n j=0 j,k

(3.133)

i The four corner imaginary cell values, IX , were generated by taking the average of the cross diagonal cells: nxy 1X i i Z (3.134) IX = n j=0 j,j

This implementation was based on the assumption that the sum of these imaginary cells must be the average of the field to preserve the average of the values within the field (Samy, 2003). This assumption was incorrect, as the internal averaging of the domain is preserved implicitly and, as such, the imaginary cells have no effect (Spencer, 2007). Examining the implementation further shows that there is significant bias, with the edges of the domain more highly correlated than those centrally. This effect is due to the imaginary cells being identical on opposite sides of the domain (Spencer, 2007). Spencer (2007) further analysed the use of these imaginary cells to provide boundary conditions, before developing a more advanced implementation. This resulting in an implementation that was a compromise between complexity and accuracy, whilst considering the complexity of the computation. The implemented strategy was based upon the assertion that the generated random field should be equivalent to a small domain within an infinitely large continuous random field, where a single θ exists and the size of the chosen domain can be altered to give the desired θ/D (Spencer, 2007). The imaginary cells surrounding the domain are therefore correlated in exactly the same way as those inside, and so the neighbouring cells within the domain provide a good estimate of the imaginary boundary cells. The method uses the cells in the immediate neighbourhood of the imaginary cell, weighted according to the covariance structure. Figures 3.17 and 3.18 illustrate the neighbourhood and indicate the weighting of the imaginary cells on the edge and corner of the boundary respectively, in generating a 2D field. The edge cells on the left-hand side of the 2D field, with reference to Equation

144


Figure 3.17: Neighbourhood for weighting of imaginary edge cells (Spencer, 2007).

Figure 3.18: Neighbourhood for weighting of imaginary corner cells (Spencer, 2007).


145

Figure 3.19: Traversing pattern and values for index functions p(l) and q(l) for edge cells (Spencer, 2007).

3.91, is calculated from; i 4 h P i i i E Z0,j , Zp(l),q(l) × Zp(l),q(l)

i IRj =

l=1

4 P

(3.135) E

h

l=1

i i Z0,j , Zp(l),q(l)

i

where p (l) and q (l) are index functions traversing the neighbourhood cells, defined by the pattern and values laid out in Figure 3.19. The equations for cells on the other edges are obtained by switching axis and mirroring the correlation direction. The three corner values, for which the neighbourhood is not valid, is obtained from the equation: i 3 h P i i i E Zm,n , Zp(l),q(l) × Zp(l),q(l)

IGi =

l=1

3 P l=1

(3.136) E

h

i , Zi Zm,n p(l),q(l)

i

where p (l) and q (l) are index functions traversing the neighbourhood cells, defined by the pattern and values laid out in Figure 3.20. The position of the imaginary boundary cell, IG , is denoted by m and n and is given by: G = R1 m = 1 n = 0 G=X m=0 n=0 G = C1 m = 0 n = 1

146


Figure 3.20: Traversing pattern and values for index functions p(l) and q(l) for corner cells (Spencer, 2007). Once again, the remaining corners of the domain can easily be found by transforming Equation 3.136. The method was implemented and expanded to three dimensions, where a similar neighbourhood pattern exists, using 6 neighbours for faces, 5 for edges and 4 for corners. These boundary conditions, their implementation and analysis, are discussed at length in Spencer (2007).

3.7

Summary

This chapter has presented the mathematical theory behind Local Average Subdivision, and has applied it to the generation of random fields, in one, two and three dimensions, discussing the implementation in each, including the discussion of boundary elements. This chapter has been included as a background for the following chapters that develop and implement a parallel version of a 3D LAS random field generator. Concepts and equations laid out in this chapter will be used to illustrate the mode of parallelisation and its viability.

Chapter 4 Parallel Random Field Generation 4.1

Introduction

Having discussed, in Chapter 3, the principles, theory and implementation of LAS random field generation, this chapter focuses on the background, reasoning and issues associated with its parallelisation. Parallelisation has been shown to be both efficient and advantageous for the stochastic framework associated with the Random Finite Element Method, RFEM. It has also been shown that a number of strategies can be adopted in parallelising the RFEM. A limiting factor to these strategies is the random field generation. This chapter looks at random field generation and discusses its parallelisation. Within the Author’s research group, the 2D and subsequent 3D implementations were first developed as standalone codes (Samy, 2003) and then later into black box style modules (Spencer, 2007), all coded in Fortran. These implementations have been successfully used with a variety of finite element codes, to provide stochastic analyses of several engineering models using the Random Finite Element Method (RFEM) (Samy, 2003; Spencer, 2007). The work and code discussed and developed in Spencer (2007) was inherited by the Author. It was this code that was used to meet the random field requirements for both serial and parallel analyses within the group. This mode of generation restricted performance, limiting domain sizes and slowing codes. Similar to the problems associated with the FE codes, discussed 147

148

CHAPTER 4. PARALLEL RANDOM FIELD GENERATION

(a) Processor 1

(b) Processor 2

(c) Processor 3

(d) Processor 4

Figure 4.1: Full RF generation within a parallel analysis.

in Chapter 2, the random field generation requires large amounts of storage, both during the generation itself and when storing the final random field. The amount of storage is significantly larger than that of the FE domain. It was this restriction of field sizes that was of primary concern to the Author, as it placed a restricted upper bound on the size of problem that could be analysed. It was clear that these limitations should be reduced, and that parallelisation would be a suitable means to achieve this. The execution time of the random field generator was considered of less importance in the current research climate. The random fields for current domains, models and applications are generated relatively fast, accounting for a very small percentage of the overall execution time of the RFEM code. Prior to this research, in the Author’s research group, parallel RFEM implementations involve the generation of identical random fields on each processor, mapping only the required random field cells to the corresponding decomposed finite elements of the domain. This is illustrated in Figure 4.1, where each processor generates the same field whilst mapping different areas (in grey). Figure 4.1 highlights the inefficiency of this implementation, with each processor generating significantly more cells than it requires, at a cost to both memory and time. This inefficiency increases with the number of processors used, as the number of cells subdivided and stored for each processor remains constant while the number required decreases. It would be more efficient to have a parallelised module that distributes the cell generation, workload and memory requirements, with each processor subdividing cells to produce only those required for mapping to the required subsection of the FE domain. This would significantly reduce the memory required in both the

4.2. PARALLELISATION

149

Figure 4.2: Illustration of parent cell neighbourhood in 3D.

generation and storage of the field, while also decreasing run times.

4.2

Parallelisation

It was necessary to determine whether a parallel implementation was viable and, if so, to determine an efficient strategy for doing so. In Chapter 3 it was shown that the LAS method recursively subdivides a domain to produce a random field satisfying the necessary dimensional and statistical properties. It was also shown that the process of subdivision requires the data from a 3 × 3 × 3 neighbourhood of parent cells, in the case of 3D generation. At the core of this neighbourhood is the cell being subdivided, as illustrated in Figure 4.2. It was also shown that the subdivision of each cell is independent at each level, requiring the parent cells from the previous level. Therefore, in principle this makes the parallelisation of the process viable, as the entire field is not required in the subdivision of each cell and, as such, the domain can be broken down across processors and cell subdivisions calculated independently of each other. However, communication between processors will be required to communicate parent cells between neighbouring processors to form the required neighbourhoods, where parent cells lie on processor mapped domain boundaries, as illustrated in Figure 4.3. The parallelisation will follow the strategy of Foster’s Design Methodology

150


Communication

Processor n

Processor n + 1

Figure 4.3: Illustration of 2D parent cell communication across processor domain boundaries.

discussed in Chapter 2. As discussed, this methodology is split into 4 key areas: 1. Partitioning, 2. Communication, 3. Agglomeration, 4. Mapping. The following sections discuss these areas with respect to parallelising the random field generator. Having studied the algorithm, code, theory and implementation, it is clear to the Author that the partitioning is key to the efficient parallelisation of the algorithm, controlling both the communication and mapping. With the distribution of memory requirements a priority when developing this implementation, it is clear that it is the domain decomposition that will provide the greatest benefits. Amongst the stochastic strategies set out in Chapter 2, splitting multiple realisations between a subset of processors is considered; therefore it will be necessary to not only split the field generation over multiple processors, but to simultaneously generate multiple fields over multiple processors. However, for the purposes of producing the random field strategy, the generation of a single field will be considered, as it is envisaged that the generation of different random fields will simply be initialised by varying the seed number that initialises the algorithm.


4.2.1

151

Partitioning (or Domain Decompostion)

In partitioning the algorithm into smaller tasks, the smallest and most fundamental task is the cell subdivision. Therefore the partitioning of the problem will involve the dividing of these tasks and thus the domain between processors. It is this domain decomposition that will dictate much of the efficiency and performance of the generation, controlling the amount of communication and the workload balancing between processors. The communication required to map the random field to the FE domain is also dependent on this decomposition. The main aim of reducing the memory limitations, to increase size capabilities, will be predominantly domain decomposition driven. In distributed parallel computing, the key to an efficient program is the domain decomposition with consideration of the following aspects: 1. Load Balancing 2. Communication 3. Mapping The following sections discuss these issues and the possible decompositions considered, weighing up the advantages and disadvantages of each solution. Ideally the solution should provide an evenly distributed work load, with minimal communication and with cells mapped so as to correspond to the FE mesh decomposition to remove the need for further communication. It should be noted that the random field decomposition in this context refers to the initial decomposition between processors, and not the final decomposition which is mapped to the decomposed FE domain. This decomposition is unusual in that the decomposed field will continue to change its cell structure after decomposition. 4.2.1.1

FE Domain Decompostion

The strategy of the FE decomposition was an ideal starting point, as ultimately the random field produced will be mapped in a similar manner to the distributed FE domain. The approach is simply to distribute the cells evenly amongst the processors, so that the first X elements are placed on the first processor and then the second X on the next processor and so on. This decomposition is based solely

152


Figure 4.4: FE domain decomposition. on the element numbering, which usually, in the cases studied by the Author, follows each axis in turn. A typical decomposition is shown in Figure 4.4. Figure 4.4 shows that the communication is minimal, as predominantly the contact cells between two processors have minimal surface area, when compared with other irregular decomposition profiles. However, if this type of decomposition is followed then the final field generated will no longer follow the intended structure, i.e. the FE decomposition. This is due to the decomposition of the random field taking place during the generation process, and not on the final field. During the subdivision process the cell structure of the random field changes at each level of subdivision, with the number of cells along each axis doubling at each level. Figure 4.5 illustrates this evolution in cell structure at each level of subdivision. Therefore, if the random field is decomposed during the subdivision process the decomposed cell structure will evolve to a final field decomposition different to that of the FE domain. This approach would add further communication difficulties as without a uniform structure the ghost regions to be communicated between the processors would be difficult to code as each processor will have a dynamically changing structure, and so the addressing and communication will also be dynamic. An alternative is to maintain the cell structure at each level of subdivision, by communicating the necessary cells to neighbouring processors, as shown in Figure 4.6. The figure shows that, even in this simple 2D example, over a single level of subdivision, significant additional communication is required to maintain the structure; however, one of the aims of the decomposition to minimize communication.


153

(a) Domain

(b) Decomposed domain

(c) Post subdivision domain

Figure 4.5: Illustration of cell structure growth during subdivision.

The communication shown in Figure 4.6 would be difficult to code and slow to execute, as the domain is irregularly decomposed with the addressing of the cells to be communicated complex, on this dynamically changing structure, especially in 3D. Ideally it would be possible to decompose the random field early in the subdivision process, so as to generate the required field distributed and mapped correctly to the FE domain. However this is only possible in one case; the case

154


Communication Communication

Communication

Figure 4.6: Illustration of the communication to restore cell structure. where the initial domain is split into slices along a single axis. 4.2.1.2

Slice Decompostion

Figure 4.7 illustrates the proposed decomposition in 2D, which can also be considered as the 3D decompostion in profile view; in the figure each colour represents a slice through the full domain. This decomposition has many advantages: • Load balanced - The domain is evenly distributed over all the processors and therefore the workload is balanced. • Communication minimized - This arrangement clearly provides the smallest surface area betweeen adjacent subdomains, and as such minimizes communication. • Mapping - In this decomposition distribution the produced random field maps perfectly with the FE domain, with comparatively little or no communication required in mapping the cells to the corresponding processors. To distribute the domain in this way requires that the number of processors over which the program is be executed has to be a factor of the number of cells in the axis of the field over which the domain is distributed; i.e. each processor must contain an equal number of slices of the decomposed domain. As such the same applies to the FE domain to which the random field is to be mapped, producing the same distribution, and thus perfect mapping. This disadvantage is, in the Authors opinion, minor compared to the coding issues and complexity of other decompositions; however the Author recognises that there will be a restriction on


155

(a) Random field level 1

(b) Random field level 2

(c) Random field level 3

Figure 4.7: Domain decomposition - Slices - (2D profile view).

156

CHAPTER 4. PARALLEL RANDOM FIELD GENERATION Dominant number of cells in any axis 10 50 100 200

Number of Processors 2,5,10 2,5,10,25,50 2,4,5,10,20,25,50,100 2,4,5,8,10,20,25,40,50,100,200

Table 4.1: Table illustrating the restricted processor decomposition for varying resolutions of domains.

Communication

Communication

(a) Single Cell Communication

(b) Agglomerated Communication

Figure 4.8: Illustration of the agglomeration of communication. the number of processors over which a particular FE domain can be executed, as shown in Table 4.1. This restriction is minor, as both the random field and the FE meshing can be manipulated to meet these requirements, matching the resources of the user against the number of cells in the problem, while altering the size of each cell without affecting the overall dimensions of the domain. Agglomeration: The smaller tasks that have been defined will be merged into larger tasks to produce efficient codes. In particular, this will be expected within the communications, where the communication of entire faces will be agglomerated to reduce synchronisation and latency, as shown in Figure 4.8. The figure

4.3. PARALLELISATION ISSUES

157

illustrates that agglomeration reduces the communication from multiple streams to a single larger stream. While the amount of data to be communicated are constant, the actual setting up of the communication streams between processors, their synchronisation and latency are reduced, due to the data being communicated via a single stream. This agglomeration is aided in this choice of decomposition as the data to communicated are always from the same planes of the decomposition at each level of subdivision, which allows for this merger of communication to be coded more easily and efficiently, without the need for complex array addressing of the cells to be communicated. This produces a more efficient and faster algorithm than under other decompositions with similar communication agglomeration.

4.3

Parallelisation Issues

Having selected a domain decomposition, satisfying the aspects of Foster’s Design Methodology, the Author’s attention turned to other factors underlining the parallelisation.

4.3.1

Optimization

Before parallelising any code or algorithm, it is best practice to optimize the original serial code. Inefficiencies within a serial code will be amplified after parallelisation. The Author recognised several areas of optimization, including boundary generation and the optimization of cell subdivision. These optimizations are discussed in Chapter 6, which outlines the implementation of a parallel 3D random field generator.

4.3.2

Random Number Generation

A further issue with regards the generation of a random field, is the random number generation within the code. As discussed in Chapter 3, the LAS method uses a sequence of Gaussian random numbers to introduce the randomness into the field variation. The use of random number generators within numerical modelling and Monte Carlo methods has been called into question (James, 1990; Hellekalek, 1998). This has been in part due to poor random number generators. It was therefore deemed necessary to determine whether the RNG in the current

158


implementations was fit for purpose and reliable. The following chapter discusses in depth the theory, implementation and testing of random number generators. It tests the current RNG implementation, and discusses and concludes potential strategies and issues for its use within a parallel environment.

4.4

Conclusion

This chapter has shown that the parallelisation of the LAS random field generator implemented by Spencer (2007) is viable. A suitable decomposition has been set out that can be used within the current parallel FE coding with corresponding mapping and ease of coding. The following chapter discusses the use of random numbers within the algorithm, and their generation in both serial and parallel. It is essential that the random numbers used within the codes are suitable, before a reliable method can be implemented.

Chapter 5 Random Number Generation (RNG) 5.1

Introduction

Aristotle defined randomness as a situation when a choice is to be made that has no logical component by which to determine or make the choice. A more modern definition is that a random process is a repeating process whose outcomes follow no describable deterministic pattern, but instead follow a probability distribution, such that the relative probability of the occurrence of each outcome can be approximated or calculated. The notion of randomness has been known for centuries such as in the Old Testament where the casting of lots is frequently referred to. In ancient Greece, Athenian democracy employed sortition, a random selection, as a primary method for appointing officials, a method still seen in modern democracies with the random selection of jurors. The ancient Greeks and similar civilizations used physical methods such as drawing lots, or the selection of coloured pebbles from concealed bags, to produce randomness. Today similar methods remain, such as coin tossing, roulette wheels and lottery machines; all examples of the modern physical generation of randomness. With the advent of the computer and numerical modelling, the need for robust computer random number generation has become essential for the validity of computer models incorporating random elements. Press et al. (1992a) introduce their Random Number chapter with: “ It may seem perverse to use a computer, that most precise and 159

160

CHAPTER 5. RANDOM NUMBER GENERATION (RNG) deterministic of all machines conceived by the human mind, to produce ‘random’ numbers. More than perverse, it may seem to be a conceptual impossibility. Any program, after all, will produce output that is entirely predictable, hence not truly ‘random.’ ”

This statement highlights that a computer can only execute algorithms to produce a random number sequence and thus cannot be truly random. These “random” numbers are often known as “Pseudo-random” numbers; however, for the purposes of this thesis the distinction will not be made and the term random number will be used. The generators are designed so as to produce a sequence of numbers that to some extent appear to be statistically random. Attempts have been made to incorporate physical aspects into computerised random number generation. Physically random sources such as radio static, Geiger measurements of radioactive decay, atmospheric conditions and thermal noise have all been incorporated into random number generators (Haahr, 2010). These add truly random elements to RNG’s and are known as True Random Number Generators (TRNG); however the measuring of physical phenomena is time consuming and expensive, and, for large-scale modelling requiring millions and billions of random numbers, is not viable. The sequences produced by these generators cannot be reproduced; thus, neither can the results of any analysis based upon them. As such, algorithmic random number generators were devised that try to be fast and reliable. Most Normal (Gaussian) distributed random numbers are based upon Uniform Deviates, which are random numbers which lie, uniformly distributed, within a set range, usually 0 to 1. These Uniform Deviates are then transformed to produce numbers in the required Gaussian distribution, with the appropriate mean and standard deviations. Ideally, a good sequential RNG should produce a stream of random numbers that (Coddington, 1997): • are uniformly distributed, • are uncorrelated, • are unrepetative, • satisfy any statistical test of randomness, • are reproduceable,

5.1. INTRODUCTION

161

• are portable, • are adjustable by use of an initial “seed”, • can be split into many independent sequences, • can be generated rapidly using restricted memory. However, it is said that every random number generator will fail a reliable statis´ tical tests of randomness, if enough numbers are generated (L’Ecuyer, 1994), and there will always be one or more tests of randomness that every RNG will fail (Hellekalek, 1998). For instance, every RNG algorithm has a finite period, after which the sequence generated will repeat. It is the application and the degree of statistical randomness required that is the ultimate measure of the reliability and viability of a random number generator for a particular application. Today, there are various algorithms used in random number generation, with various advantages and disadvantages. Bad RNG’s are still implemented and in use in many forms, although understanding and statistical testing has led to improved and more robust RNG’s with greater periods. However the history of computer modelling is littered with research based upon poor RNG’s, rendering the validity of this research at best questionable. The problems with correlations between random numbers often calls into question the validity of Monte Carlo analysis (Hellekalek, 1998). These problems are further amplified when several million random numbers are generated for several thousand realisations. It is therefore essential to understand and control the algorithms governing random number generation. It is necessary to control factors such as the random number seeds, so as to avoid repetition and correlation between realisations. As discussed in Chapter 3, Gaussian White Noise is introduced to the random field generation to provide the randomness within the field. Different sequences, initialised with different seed numbers, generate different random fields. To generate this sequence of Gaussian (normal) random numbers, a random number generator is used. The consequences of a poor RNG in the random field implementation are unknown. It is theorized that a poor RNG would produce fields with repeating structures, or that correlation between differently seeded RNGs could produce fields with similar patterns or structures, invalidating the Monte Carlo analysis. However the chances of this occurring in practice are slim, as RNGs with slightly

162

CHAPTER 5. RANDOM NUMBER GENERATION (RNG)

differing sequences produce fields with largely differing structures. Although the conditions under which the RNG would have significant adverse effects in a serial environment would be rare, in a parallel implementation multiple RNG’s operating on different sections of the field may cause correlations within a single field. Moreover the RNG provides the value that gives the random field its initializing mean property value and, as such, correlations or trends within this data would have severe adverse effects on the reliability of the analysis. In preparation for a parallel implementation of the random field generator, a strategy for implementing the RNG in parallel had to be considered. Firstly, it was essential that the original serial RNG produces non-overlapping, repeating or correlating random numbers, as a basis for the parallel version. Therefore this section will start with a brief overview of the popular types of sequential RNG available and the reasoning used in the selection of the current RNG implemented. Testing of the current serial RNG will be undertaken to discover whether it is fit for purpose or whether a new generator will have to be implemented. The section concludes with the implementation and validation of a parallel version of the selected algorithm.

5.2

Sequential RNG

There are many different types of algorithm for producing random numbers and listed below is a brief selection of some of the most popular.

5.2.1

Linear Congruential Generators (LCG)

LCGs produce pseudo-random sequences of integers, I0 , I1 , . . ., over a large range between 0 and m − 1, where m, known as the modulus, is an integer value usually taken to be equal to the largest storable integer on a system; on a 32 bit system this equates to 231 − 1 = 2147483647. Equation 5.1 shows the basic re-occurrence relation used in a LCG algorithm: Ij+1 = (aIj + c) mod m

(5.1)

where a and c are integer values known as the multiplier and increment respectively. The sequence produced will eventually begin to repeat after a period, which can be no greater than m. The terms m, a and c should be determined

5.2. SEQUENTIAL RNG

163

appropriately to maximize the full use of this range. In this case all values in the range 0 to m − 1 will be produced in the sequence, with the seed providing the starting position within the sequence. This type of generator works well when the parameters have been carefully chosen, with the modulus being a prime. It should only be used to generate numbers of 48 bits or above; 32-bit versions should be avoided, as they have poor reliability. Lehmer (1949, 1954) first proposed LCGs in the late 1940s and they became popular as system RNGs, as they require few operations per call and, as such, are very fast. However they have many disadvantages, in particular with 32 bit versions, which has rendered research using them unreliable. Sequences produced show correlation between successive calls and fail many of the tests discussed later in Section 5.4. Many of these generators are good for producing short random number sequences quickly in environments and applications where the quality of random numbers is not a priority, e.g. computer games. However they can be detrimental to the Monte Carlo process in scientific research and results produced are often unreliable. The large quantities of random numbers required to produce larger random fields, or in extensive Monte Carlo procedures, often means that the period of these algorithms are inadequate to produce high quality random numbers for the full generation of a required sequence.

5.2.2

Multiplicative Linear Congruential Generators (MLCG)

A special case of the LCG is when c = 0. This is called a Multiplicative Linear Congruential Generator (MLCG) and is given by: Ij+1 = aIj mod m

(5.2)

Park and Miller (1988) reviewed a number of generators and illustrated the historically poor use of RNGs. They also proposed a “Minimal Standard” based on the MLCG algorithm and an earlier proposal by Lewis et al. (1969), where the parameters were: a = 75 = 16807

m = 231 − 1 = 2147483647

164


This RNG was never claimed to be perfect; however, for many years, it passed statistical testing and scrutiny and was successfully used in many reliable applications. Instead the algorithm was proposed as a minimal standard by which other generators should be measured. The product of a and m would exceed the maximum integer that can be handled by a 32-bit system. Therefore this implementation makes use of Schrage’s algorithm (Schrage, 1979), which is a form of approximate factorization of m to allow calculation on 32 bit machines.

5.2.3

Lagged Fibonacci Generators (LFG)

These generators are so named because they are based on the Fibonacci Sequence, 1,1,2,3,5,8,13......, which is based on: Xn = Xn−1 + Xn−2

(5.3)

where X0 and X1 are supplied, in the case of the standard Fibonacci Sequence, as X0 = X1 = 1. To produce a sequence of random numbers Equation 5.3 is generalised to give: Xn = Xn−l Xn−k (mod m) (5.4) where l > k > 0 and denotes a binary operation; either addition, subtraction, multiplication or the bitwise arithmetic exclusive-or (XOR). m is usually taken as a power of 2, i.e. m = 2M . l and k are known as the lag, and it is clear that this generator must be initialised with a “Lag Table” or “Seed Table” containing the previous numbers in the random sequence from X−l+1 to X0 . These generators, although becoming increasingly more popular and considered more reliable than standard LCG’s, suffer from initialisation sensitivities, with care having to be taken to supply random and uncorrelated tables (Coddington, 1997).

5.2.4

Shift register Generators

Shift register generators are generally a special case of LFG using XOR binary operation. This type of generator became popular as a fast RNG, however suffering from poor randomness qualities, worse than the other choices of LFG operator. With the improved performance of new processors, the benefits of using the XOR version, are minimal and as such should not be used (Coddington, 1997).

5.2. SEQUENTIAL RNG

165

A shift register generator, such as Marsaglia’s Xorshift generator (Marsaglia, 2003), works by using a bitwise exclusive-or (XOR) operation on a number with a bit shifted version of itself, as in the following example shown in Figure 5.1. The figure shows the number 42, represented in binary, bit shifted by 4 bits and then the XOR operation combines the unshifted and shifted representations of the number; in this example the resulting binary is a representation of the integer 138.

bitwise representation 1

2

8

8

16 32 64 128

42

Shift → 4

XOR

= 138

Figure 5.1: Simple example of XOR operation.

In these generators this operation is carried out a number of times, with the shift varying in both direction and size. Marsaglia (2003) found that the generator is as good as any other 32-bit RNG, when using 3 operations with shifts of 13, -17 and 5, where positive numbers are considered a shift to the left and negative, the opposing shift to the right. This is known as a Marsaglia shift sequence and has a period of 232 − 1. These generators have the advantage of being very fast, as they only use fast logical bitwise operations, and the bit-mixing introduced provides the algorithm with different characteristics to those algorithms which use arithmetic operations such as LCGs. This makes shift generators ideal for combining with other generators of a different type; i.e. it is a powerful means to reducing the residual correlations and other weaknesses in other algorithm.

166

5.2.5


Combined Generators

This is simply the combination of two generators to form one, more reliable generator. Combining two LCGs or an LCG with a Lagged Fibonacci Generator, or other similar combinations, in many cases is considered to be more reliable.

5.2.6

Other Generators

The Author includes the following generators as a note of caution, emphasizing the pitfalls that have befallen the unwary researcher, both in the past and in recent times. 5.2.6.1

System Implementation

Most operating system and compiler vendors supply a library routine for random number generation. This command returns a random number sequence based upon its initialised seed. Each seed value returns a different random sequence or subsequence of a longer sequence. This sequence is the same each time it is initialized with the same seed value. These system random library routines are often unreliable and it is said that the research based upon them should also be considered as such (James, 1990). These routines often take the form of LCGs. Another problem associated with these routines is portability; one supplier may implement a reliable routine, while others a different inferior version, each generating a different number sequence. 5.2.6.2

Randu

An infamous LCG RNG was implemented by IBM (Press et al., 1992a) on their mainframes in the 1960s and 1970s, in particular on the IBM 360 computers. It was known as “Randu” and was a widely used RNG within the scientific communities. It was a LCG taking the form, a = 65539, c = 0 and m = 231 = 214743688 and the function can be coded using the following Fortran code: Function randu(iseed) implicit none integer :: iseed double precision :: randu integer, parameter :: IMAX= 2147483647

5.2. SEQUENTIAL RNG

167

real, parameter :: XMAX_INV=1./IMAX iseed = iseed * 65539 if (iseed < 0) iseed = iseed + IMAX + 1 randu = iseed * XMAX_INV end function randu. The results produced pass very elementary statistical tests and taking moments 1 . However the correlation of the of the sample gives good results, with hxk i ≈ (k+1) sequence is poor for RNG use. Plotting triplets of the sequence, where every three consecutive random numbers generated form the Cartesian coordinates x, y, z in a 3D space, produces Figure 5.2. The figure shows that Randu appears to produce

1

0.8

0.6

0.4

0.2

0 0 0.2

1 0.4

0.8 0.6

0.6 0.4

0.8

0.2 1

0

Figure 5.2: Plot of 3000 Triplets of Randu in a 3D space. a randomly distributed set of triplets and that there is little or no correlation; however, viewing from an alternative angle produces Figure 5.3, where 15 distinct 2D planes of triplets are formed. This was discovered by Marsaglia (1968), who

168


1

0.8

0.6

0.4

0.2

0 1 0.8

1 0.6

0.8 0.6

0.4 0.4

0.2

0.2 0

0

Figure 5.3: Plot of 3000 Triplets of Randu in a 3D space (Alternative Angle). found that these types of RNG fall in planes when triplets are taken. In this case the combination of 9x + 6y + z using any xyz triplet produced an integer between -5 and 9, and as such Randu has major correlation issues and is a poor RNG. Famously, when IBM was approached about the problem they replied: “We guarantee that each number is random individually, but we don’t guarantee that more than one of them is random.” (Press et al., 1992b) Randu was a widely used and copied routine and was used in a great deal of mathematical research, which was deemed invalid by the use of this RNG (Knuth, 1997). This failure, in what is known as the spectral test, is a problem associated with LCGs in general. It also highlights the problem with relying upon system based implementations, as their reliability is unknown and they differ from machine to machine, thus reducing portability.

5.3. CURRENT RNG IMPLEMENTATION

169

These are some of the reasons why the Author chose to investigate and implement a reliable and portable RNG, translating into reliable results from the Monte Carlo process and providing more accurate control over generation.

5.3

Current RNG Implementation

Prior to this research the implementation within the research group’s random field generation employs the combination of two random number generators; a Marsaglia shift sequence (Marsaglia, 2003) with a period of 232 − 1 and a ParkMiller sequence (Park and Miller, 1988), by Schrage’s method (Schrage, 1979), with a period of 232 − 1, combined using a logical bit operation. This RNg was incorporated by Spencer (2007) and is taken from Numerical Recipes in Fortran 90 (Press et al., 1996). It has a full period of around 3.1 × 1018 . FUNCTION ran(idum) implicit none integer,parameter :: k4b=selected_int_kind(9) integer (k4b), intent(inout)

::idum

double precision::ran integer(k4b),parameter::IA=16807,IM=2147483647,IQ=127773,IR=2836 double precision,save::am integer(k4b),save :: ix=-1, iy=-1,k if (idum 0.999, then the the behaviour of the RNG is unclear, as it is often possible that these results have been obtained by chance. In such cases the test should be repeated using an alternative random sequence produced by the same RNG, until the behaviour of the RNG becomes clear. Values outside these limits, neither close to 0 or 1, are indicative that the test was passed, and as such the RNG should be considered reliable. The initial step in testing within this environment was to port the Fortran RNG module into the test language of C. This was achieved by calling the Fortran function from a C code. To check for consistency, the output was compared between the C module to that of the original Fortran RNG. The comparison was made over several thousand numbers and with varying the seed which generated

5.4. TESTING THE RNG

181

the sequences, satisfying the Author that the results from both were consistent. Small Crush: The RNG code was initially tested using the less stringent and fastest of the tests, Short Crush. The results are summarized in Table 5.1, where the p-value statistics are collated to show the differing categories of failure. The first and last values indicate a RNG which has been shown to have significant bias in a particular test, while the central value, 0.001 0.999 −15 0 > 1 − 1 × 10 > 1 − 1 × 10−300 0 Table 5.1: p-values from Small Crush battery of tests.

Crush: Having passed the Small Crush tests without failing any tests, the RNG code was further tested using the more stringent Crush battery of tests, and the statistics are summarized in Table 5.2. The lower value of the range of results, i.e. No. of Statistics p-value −300 6 1 × 10 0 0 6 1 × 10−15 6 0.001 1 184 0.001 0.999 1 −15 > 1 − 1 × 10 0 > 1 − 1 × 10−300 0 Table 5.2: p-values from Crush battery of tests. p 6 0.001 , was not of significance and the RNG passed the full test associated with that statistic. However the higher p-value, which was 0.9993, did fail its associated test. This particular test was repeated 5 further times and all passed within the normal range. Therefore it can be assumed that the original result was coincidental and that the RNG can still be considered reliable.

182


Big Crush: Having passed the scrutiny of the Crush test battery, the RNG implemented was tested using the Big Crush battery of tests, the most stringent of the collection of tests available to Author and widely used throughout the field as a benchmark for testing random number sequences. The results are shown in Table 5.3. p-value No. of Statistics −300 6 1 × 10 0 6 1 × 10−5 0 6 0.001 0 252 0.001 0.999 0 −5 2 > 1 − 1 × 10 −300 > 1 − 1 × 10 0 Table 5.3: p-values from Big Crush battery of tests. Again the RNG proved robust passing all but two tests within the test battery. The two failed tests were based on the evolution of the linear complexity of a sequence of bits as it grows. This type of RNG, incorporating a shift register type generator, is susceptible to failing this type of test. However this failure is likely to have little or no effect on the ultimate composition of the random field, in either its structure or initial mean. These tests were repeated several times using multiple different seeds, so as to satisfy the Author of the results and conclusions. 5.4.3.2

Conclusion

After conducting these tests on the serial RNG currently used, the Author concluded that, so far, the code was adequate for producing single random fields without correlation or repeptition in the structure, passing the vast majority of the tests of TestU01 suite. This RNG has been tested, so far for statistical randomness of a sequence produced from a single seed. It has yet to be tested for correlation between different streams, i.e those between different fields and those produced on different processors in any parallel implementation. This is discussed in Section 5.5. Ultimately, this RNG was implemented by Spencer (2007) and was used in 2D and 3D slope stability analyses, where the results were comparable to the results of other authors and theory. This implementation and the subsequent results

5.4. TESTING THE RNG

183

provide further evidence that the algorithm is sufficiently random, assuming that the works of other authors used different “good” algorithms. The Author, therefore, felt that this RNG implementation was a sound basis on which to expand to a parallel implementation.

184

5.5


Parallel Random Number Generators

Within the generation of the random field in parallel, it is envisaged that sections of the field will be generated on different processors. This gives rise for a need for a parallel RNG, that provides differing, uncorrelated streams of numbers on each processor. This section is also of interest when discussing the random streams between different realisations, whether in a parallel or serial computational framework. This is because correlations or statistical anomalies between the streams could result in fields of similar structure and composition being formed for different realisations. A good parallel RNG should produce a stream of random numbers that meet the requirements for a sequential RNG, as laid out in Section 5.1, and should in addition meet the following criteria (Coddington, 1997): • It should work on any number of processors. • The sequences generated on each processor should individually meet the criteria laid down for the sequential RNG (see Section 5.1), in that they should be uniformly distributed, uncorrelated and have adequately large periods. • There should be no correlation between the different sequences generated on each processor. • Ideally the same sequence of numbers should be generated, when produced across any number of processors. • The algorithm should be efficient, with no communication between processors. However, as with the serial generators these ideals are not feasible and a compromise must be reached that best fits the ideals whilst providing a suitable generator for the application. The use of RNGs running over several processors can be difficult. It is essential, for the reliability of results from methods such as Monte Carlo analysis, that there is no correlation or overlap between the random number sequences on the different threads or processors. There are three main approaches to resolving this problem (Coddington, 1997):

5.5. PARALLEL RANDOM NUMBER GENERATORS

185

• Sequence Splitting: use a RNG algorithm that allows control over the starting points of the algorithm within a RNG chain and thus choose starting points that avoid overlap and correlation (Figure 5.8b). • Leapfrog: use a Leapfrog method, where each processor skips the numbers within the sequence used by other processors (Figure 5.8c). • Independent Sequences: use a different RNG on each processor, using the same algorithm but different initial generating seeds on each processor (Figure 5.8d). Figure 5.8 compares these options with the original sequence. Processor 1

(a) Original RNG Sequence Processor 1 Processor 2 Processor 3

(b) Sequence Splitting Processor 1 Processor 2 Processor 3

(c) Leapfrog Processor 1 Processor 2 Processor 3

(d) Independent Sequences

Figure 5.8: Illustration of techniques that are utilized in PRNG models. Just from brief examination of both the algorithm and code presented, and with thought given to the FE applications, the main priorities of the parallel RNG are to be fast with no correlation whilst maintaining large periods. The Author decided that the need to produce the same results for varying numbers of processors was somewhat limited, as within the Stochastic Monte Carlo Analysis,

186


and RFEM framework, the results should converge similarly regardless of achieving this ideal. Therefore, for speed and ease of coding, using the same RNG with different seeds on each processor was chosen and initially tested. This method is also concluded by other authors in the field (Srinivasan et al., 1998; Tan, 2002) as a suitable strategy for the parallelisation of RNGs. However, care must be taken to choose suitable seeds, so as not to cause overlap in the sequences, although the effects of coincidentally overlapping sequences may be minimal; this is dependent on the position and level of the repetition within the random field generation, although such overlaps are not ideal. This parallel implementation involves using the same serial RNG as used by Spencer (2007), but at the point at which different random number sequences are required on each processor, the RNG is initialized with different seed numbers, as illustrated by Figure 5.8d. This simple implementation should provide the required sequences, although if the code is run using a different number of processors, the sequences will differ. This means that, with respect to the random field generation, a different field will be produced when the number of processors is varied but the same initiating seed is used.

5.5.1

Parallel Testing

To test the parallel implementation it was necessary to merge multiple random number sequences into a single stream, as depicted in Figure 5.9. To test the implementation extensively, the Author varied the length and number of different substreams to simulate the likely operation of the RNG to generate fields across multiple processors.

(b) Step 2 : Each Sequence stream split into equal shorter periods

(c) Step 3 : A testing Sequence is formed by combining the components

Figure 5.9: Illustration of merging multiple parallel RNG sequences to a single sequence for empirical testing.


(a) Step 1 : Multiple Parallel Sequence (Varying Seeds)

187

188


In this implementation the initial values generated by the RNG are of no concern, as the mean value of the initial field to be subdivided will be provided by a single generator and, as shown before, this is sufficient. The initial test carried out on the generator was for uniformity, as with the previous serial generator, with Figure 5.10 showing the histograms of results, for sequences of 1000, 10000, 100000 and 1000000 random numbers from the same set of seeds, in this case 10 seeds, to simulate the generation over 10 processors. The results again show that the random number sequence produced by combining sub-sequences from differently seeded RNGs converged to a uniform distribution. This was repeated several times with different parameters, i.e. length and numbers of sub-sequences, until the Author was satisfied of the result. As with the serial RNG implementation, the parallel version was subjected to the TestU01 battery of statistical tests for a more stringent testing for correlation, overlaps and other phenomena, to test the reliability of the proposed generator. 5.5.1.1

TestU01 Testing

The testing was done, as previously, using the battery of tests incorporated in ´ TestU01 (L’Ecuyer and Simard, 2007). Again it was necessary to produce a single random number stream as in Figure 5.9. To do so a single Fortran function was coded that returned a single stream of numbers based upon subsets of multiple sequences generated using varying seeds. The tests were carried out in the same manner as before, starting with the Small Crush test battery. Small Crush: Table 5.4 shows the results of the testing using this less stringent and fastest of the collection of tests. As shown, the implementation passed all the No. of Statistics p-value −300 6 1 × 10 0 −15 6 1 × 10 0 6 0.001 0 0.001 0.999 0 0 > 1 − 1 × 10−15 −300 > 1 − 1 × 10 0 Table 5.4: p-values from Small Crush battery of tests.


3.5

3

Frequency %

2.5

2

1.5

1

0.5

0

Random Number Interval

(a) 1000 Numbers

3.5

3

Frequency %

2.5

2

1.5

1

0.5

0


(b) 10,000 Numbers

Figure 5.10: Histograms of frequency of random numbers generated.

189

190


3.5

3

Frequency %

2.5

2

1.5

1

0.5

0


(c) 100000 Numbers

3.5

3

Frequency %

2.5

2

1.5

1

0.5

0


(d) 1000000 Numbers

Figure 5.10: cont....Histograms of frequency of random numbers generated.


191

tests comfortably, with all 15 statistics measured having a p-value of, 0.001 < p < 0.999. This battery of tests was repeated, using different seeds and lengths and numbers of sub-sequences, until the Author was once more satisfied the results, before proceeding on to the more stringent and slower executing Crush battery of tests. Crush: The Parallel RNG initialized with the same seeds as in the previous Small Crush tests was tested using the Crush tests, and the results are given in Table 5.5. Once again, this implementation passed all tests, and subsequent No. of Statistics p-value 6 1 × 10−300 0 −15 6 1 × 10 0 0 6 0.001 186 0.001 0.999 0 0 > 1 − 1 × 10−15 −300 > 1 − 1 × 10 0 Table 5.5: p-values from Crush battery of tests. repetitions for testing different permutations, satisfactorily. Having passed all these tests the Author continued the testing with further testing using the most stringent battery of tests, Big Crush. Big Crush: Table 5.6 shows the statistical tests results for the final battery of tests completed on the parallel RNG implementation. No. of Statistics p-value −300 6 1 × 10 0 −5 6 1 × 10 0 6 0.001 1 0.001 0.999 1 > 1 − 1 × 10−5 0 −300 > 1 − 1 × 10 0 Table 5.6: p-values from Big Crush battery of tests. The generator fails two tests, both analysing the distance between the closest points in a sample of n uniformly distributed points in a unit torus in t dimensions,

192


´ as studied by L’Ecuyer et al. (2000). The tests failed with p = 0.0005 and 0.9992. Further repeating of this test with different sequences uncovered that the RNG, in its parallel form, intermittently failed the test, but never consistently or with any p-value to suggest that this was no more than coincidental. These tests were repeated several times, with similar results. The parameters of the implementation were varied to fully test the parallel operation of the RNG over its likely functionality. A variety of combinations was used, varying the number of differently seeded RNGs to simulate the number of processors, while varying the size of the sub-sequences to simulate the differing dimensional sizes of envisaged FE sub domains.

5.6

Conclusion

The passing of the more stringent TestU01 tests, for the parallel RNG implementation, has given the Author some confidence in the reliability of this parallel implementation. Not only do the results show that there will be no significant correlation overlap between random number sequences on differing processors, but they also prove that the streams produced during different realisations, with different initial seeds, will also show no statistical correlation or overlap. This evidence supports the reliability of the results of the Author’s predecessors who used this technique without implementing the same level of checks. However, caution should be taken when choosing initial seeds as this could have a significant effect on the results of these tests. The Author tested consecutive seeds, as it was believed that these would produce the most correlated results and was also the planned seeding mechanism of the RNG, between realisations in the parallel and serial versions, and between processors in the parallel version. It would have been impossible for the Author to test all possible combinations of parameters and envisaged situations; it is beyond the scope of this project and possibly provides a thesis topic in itself. This chapter and its corresponding research was to provide the Author with some understanding of the computational methods and issues involved with the generation of random numbers computationally; in particular their use within a parallel environment. This chapter did not, however, focus on the consequence of a bad RNG on the generation of a Random Field, with the Author preferring to implement a robust

5.6. CONCLUSION

193

and reliable RNG in both serial and parallel. Although beyond the scope of this project, the Author theorizes that the effects of a bad RNG would be minimal. Repeated random numbers, overlaps and correlations, would likely be used at different points within the random field generation procedure, which would result in differing fields. This leads the Author to the final test for validation, which is the full implementation within a FE model and validation of the results by comparing with known results. Spencer (2007) did this with 2D and 3D slope stability codes, conducting a stochastic analysis using the RFEM method, incorporating a LAS random field and the serial RNG. These results were favourable and thus the Author feels no need to test in this way the serial RNG. The parallel RNG will be tested and validated in this manner in Chapter 7, where the Author implements the full random field generator in a FE implementation, stochastically analysing the flow under a sheet pile wall using the RFEM.

194


Chapter 6 Parallel Random Field Implementation 6.1

Introduction

The previous chapters discussed the principles behind LAS random field generation, parallelisation and the use of parallel random number generators. In this chapter the implementation of a parallel random field generator is discussed, bringing the key aspects and conclusions of the previous chapters together.

6.2

Current Implementation

The current implementation, developed by Spencer (2007), in its basic form, is a recursive procedure and involves adding a boundary to the global domain, followed by the subdivision the cells as defined by LAS theory. The complex mathematics and implementation of LAS theory and field generation were discussed in Chapter 3. Figures 6.1 and 6.2 show this implementation in the form a schematic flowchart and a simple 2D example, respectively. Note that in Figure 6.2 the boundary cells generated within the process, highlighted in green, are of equivalent size as the cells within the domain; however, due to limiting space, and for clarity, the figure focuses on the subdividing area of interest. This original implementation produces cubic domains in 3D, that have a global domain resolution of 2(level) cells in each Cartesian direction, where level is the number of levels of subdivision that have been generated. The term “resolution” is frequently used within this thesis, but has multiple 195

196

CHAPTER 6. PARALLEL RANDOM FIELD IMPLEMENTATION

definitions and means different things to different readers. Within this thesis the term resolution will refer to the number of cells within the field in each axis direction, with no reference to cell size. In computation this definition is most frequently applied to computer and lcd screens, where it refers to the number of pixels in each direction. The term resolution is used to differentiate between size of a field, or domain, measured in cells, and the physical size of the field, which will be more often referred to as the size. The boundary cells added to the domain can be generated using different methods; however, the version inherited by the Author generated the boundary cells using the method of Spencer (2007) and is documented in Section 3.6.3. The required field is then cut from this larger global domain. This field follows a standard normal distribution, with µ = 0 and σ = 1, and the heterogeneity of the generated field is isotropic. To transform this initial field to the required field statistics and heterogeneity profile, it is post-processed. The values of the cells are transformed using an appropriate function, so as to fit the field to the correct distribution (as in Section 6.4.5.1). The field is then squashed and/or stretched to meet the required anisotropy of the heterogeneity (see Section 6.4.5.2). Finally, the generated field is mapped onto the the FE domain.

6.2. CURRENT IMPLEMENTATION

197

Start

Initialise Required Parameters

Compute a and c Coefficients (See Chapter 3)

Set seed (-ve)

Generate Boundary Cells (See Section 3.6.3)

LAS Subdivision Process (See Chapter 3)

No

Desired Field Resolution / Level?

Yes Crop Domain to Required Field

Post Processing (See Section 6.4.5)

End

Figure 6.1: Schematic flowchart of current random field implementation.

198


Initial global mean A cell of the required dimensions is generated with a value equal to the global mean. This is the domain over which the recursive subdivisions occur.

Initial boundary generation A boundary (Green) is generated to complete the cell neighbourhood ready for subdivision. This initial boundary is given a value equal to the global mean.

Subdivision Using a 3 × 3 neighbourhood of parent cells, the central cell is subdivided to form the new 2 × 2 domain.

Boundary cells generated The boundary is once more generated using the implementation of Spencer (2007) and discussed in Section 3.6.3. These new boundary cells are an extrapolation of the cells within the domain and reflect the correlation and statistics of the required field.

Subdivision The cells are once again subdivided, doubling the number of cells in each dimension. This process of generating a boundary and subdividing continues until a field of the desired resolution is produced.

Figure 6.2: 2D example of current random field implementation.

6.3. CODE IMPROVEMENTS

6.3

199

Code Improvements

Before parallelisation it is first necessary to analyse the original serial implementation of the generator to optimize the code. It is good practice to parallelise an optimized code because inefficiencies in the serial algorithm are often amplified in a parallel environment. In the Author’s opinion, the current implementation performed computationally inefficiently in a number of ways. The following sections discuss these areas of code and their subsequent optimization.

6.3.1

Domain Reduction

The serial implementation that the Author inherited produces large cubic domains Cx × Cy × Cz , where Cx = Cy = Cz , and where C is the number of cells in the direction indicated by the subscript. However, cubic domains are rarely required and more often Cx ≈ Cy Cz ; thus, often large amounts of unnecessary computation occurs. The current code requires that a global mean is generated for a cubic domain that has a dimension matching or greater than that of the largest dimension of the required final field. Furthermore it is beneficial to take the required field from the central region of a significantly larger domain to reduce the boundary effects (Spencer, 2007). Table 6.1 shows the resolution of the global domain generated to produce fields of varying refinement, and the corresponding level of subdivision required.

Level Required field resolution (Cells) Domain resolution generated (Cells) 1 2 3 4 5 6 7 8 .. .

max(Cx , Cy , Cz ) = 2 3 6 max(Cx , Cy , Cz ) 6 4 5 6 max(Cx , Cy , Cz ) 6 8 9 6 max(Cx , Cy , Cz ) 6 16 17 6 max(Cx , Cy , Cz ) 6 32 33 6 max(Cx , Cy , Cz ) 6 64 65 6 max(Cx , Cy , Cz ) 6 128 129 6 max(Cx , Cy , Cz ) 6 256 .. .

2×2×2 4×4×4 8×8×8 16 × 16 × 16 32 × 32 × 32 64 × 64 × 64 128 × 128 × 128 256 × 256 × 256 .. .

Table 6.1: Dimensions Table.

200


The table shows this can be a wasteful method, as it often leads to the generation of excessive amounts of cells, especially when a domain is dominated (with respect to size) by one or two dimensions. For example, to generate a field with a resolution of 36 × 36 × 12 cells requires that a random field be produced that has at least 6 levels of subdivision, thus generating a domain of 64 × 64 × 64 = 262144 cells; that is, for a field requiring only 15552 cells, which is a mere 6% of the random field domain generated. This wastage is both time and memory consuming, and can be limiting. It has already been identified that each cell is subdivided using the parent cells in a 3 × 3 × 3 neighbourhood centred around the subdividing cell, as discussed in Section 3.6. With this in mind, after each level of subdivision the domain can be reduced to only those cells required to produce the final field. This is achieved by firstly ascertaining the reductions required after each stage of the subdivision to produce a “reduction profile”. 6.3.1.1

Reduction Profile

To compute the reduction profile the required number of levels of subdivision is calculated. The level of subdivision is taken with reference to Table 6.1, where the maximum number of cells along any axis must be 6 2level . The Author now illustrates the computation in the form of a simple example. In this example a final problem field resolution of 40 × 40 × 10 cells is required. First the required number of subdivision levels is computed; in this case, the maximum number of cells in any direction is 40, in both the x and y directions. This corresponds to 6 levels of subdivision, producing a field of 64 × 64 × 64 cells. Taking the problem field resolution, 40 × 40 × 10 cells, the minimum number of cells required at each level of subdivision to generate this final field is computed. This is done by recursively working backwards from the required field resolution by the number of levels of subdivision, halving the number of cells in each Cartesian direction at each level and rounding up to the nearest integer. This is reversing the subdivision process, with regards the resolution, in order to obtain the starting resolution for each level of subdivision to produce the required domain at the next level. In doing so, the number of cell reductions after each subdivision, and in each direction, can be obtained. Table 6.2a shows the resolutions for the current example, giving the minimum resolutions required at each level of subdivision to produce a final field of 40 × 40 × 10 cells. Now, working forwards,


201

we can compute a reduction profile for each axis. This is achieved by subtracting the required number of cells from those actually generated at each level, after the reduction has taken place. Tables 6.2b and 6.2c shows the generation of these reduction profiles for the x and y-directions and the z-directions respectively; the reduction profiles are shown in the final column of each table. These computed reduction profiles, one for each axis, are used to reduce the subdividing domain, after each level of subdivision (by the number of cells prescribed by the profiles) This generates a field of the required resolution using the minimum number of cell subdivisions. 6.3.1.2

Schematic flowchart of the domain reduction technique

Figure 6.3 shows a schematic flow diagram of the procedure that was coded and implemented into the LAS random field generation. This basic schematic is placed in the context of the overall LAS random field generator, to show the relative arrangement and ordering of the method. Further to this schematic a simple illustration of the method is presented in Figure 6.4. The concept is presented in 2D, but is equally applicable in 3D; note that, for simplicity, the boundary elements and neighbourhood of parent cells are neglected to illustrate the procedure more clearly; therefore only those cells that subdivide to form the next level are presented. Not all these cells would be discarded immediately, with many being used in the parent cells of those subdividing. The 2D example shown in Figure 6.4 shows the original 2D method compared with the new method including domain reduction, for producing a small random field of 5 × 3 cells. In this example the reduction profile is calculated as shown in Table 6.3, using 3 levels of subdivision, as prescribed by Table 6.1 for a field with maximum field resolution in any direction, Cmax = 5. By following the original method, there are 1 + 4 + 16 = 21 cell subdivisions over 3 levels, to produce the 15 cell field. Using the proposed method of reduction requires only 1 + 2 + 6 = 9 cell subdivisions. Furthermore, the original method requires the storage of at least 64 cells, compared with a smaller minimum of 24 cells for the new reduction method. In this simple example, these are significant reductions in both memory and processing requirements and, when the method is extended to larger domains requiring further levels of subdivision and extended to 3D, the memory and computational performance can be greatly increased.

202


Level x − axis 6 40 5 20 4 10 3 5 3 2 2 1 0 1

× × × × × × ×

y − axis 40 20 10 5 3 2 1

× × × × × × ×

z − axis 10 5 3 2 1 1 1

(a) Required domain resolutions after each level of subdivision.

Level Generated 0 1 1 2 4 2 3 6 10 4 5 20 40 6

Required Reduction Profile 1 2 0 3 1 5 1 10 0 20 0 40 0

(b) Reduction profile in the x and y-directions

Level Generated 0 1 1 2 2 2 3 2 4 4 5 6 6 10

Required Reduction Profile 1 1 1 1 1 2 0 3 1 5 1 10 0

(c) Reduction profile in the z-direction

Table 6.2: Tables of numerical steps involved in the computation of the reduction profile.


203

Start

Compute reduction profile for x, y, z axes. (See Section 6.3.1.1)


Yes

Does x-axis require reduction?

Reduce cells in x-direction accordingly

Yes

No

Does y-axis require reduction?

Reduce cells in y-direction accordingly

Yes

No

Does z-axis require reduction?

Reduce cells in z-direction accordingly

No

No


Yes Post Processing (See Section 6.4.5)

End

Figure 6.3: Schematic of Domain Reduction Method.

204


Original Model

Domain Reduction Model Initial cell with the value of the global mean

(a) Initial Domain (Global Mean) Initial cell subdivided

The subdivided cells are reduced to the cells requiring further subdivision. (From Table 6.3, reduction profile: x=0, y=1)

(b) Subdivision Level 1 Further subdivision

Further reduction to cells requiring further subdivision. (Reduction profile: x=1, y=0)

(c) Subdivision Level 2 Further subdivision

Required field cut from generated domain. (Reduction profile: x=1, y=1)

(d) Subdivision Level 3 (Final Field)

Figure 6.4: Illustration of the proposed domain reduction method in comparison to the original implementation.


205

Level x − axis 3 5 2 3 1 2 0 1

× × × ×

y − axis 3 2 1 1

(a) Minimum resolution required at each stage of subdivision

Level Generated 0 1 1 2 2 4 3 6

Required Reduction Profile 1 2 0 3 1 5 1

(b) Reduction profile, x-direction

Level Generated 0 1 2 1 2 2 3 4

Required Reduction Profile 1 1 1 2 0 3 1

(c) Reduction profile, y-direction

Table 6.3: Computation of reduction profile for a 5 × 3 cells random field, over 3 levels of subdivision.

206


It should also be noted that domain reduction can start at any point within the subdivision process, by adjusting the starting point and the number of levels over which the reduction takes place.

6.3.2

Boundary Cells

The boundary cells discussed in Section 3.6.3 are currently implemented, that is in the original version, so as to be calculated at each stage of the process. As documented by Spencer (2007) these calculations provide the cells at the edges of the subdividing domain with a realistic neighbourhood to complete their subdivision. However, this boundary generation is an approximation within the method, as it is based on an extrapolation of the domain on to which it is to be added. These boundary cells are a problem both mathematically and for implementation. It is essential to stipulate a boundary to provide sufficient neighbours for the outer cells of the subdividing domain as discussed in Section 3.6.3. However, the selection of the boundary has implications for the generation of the random field and there are several possible methods of implementation. Spencer (2007) shows that the boundary cells generated around the subdividing domain reduce the reliability of the cells at the edge of the domain. Figure 3.15 shows the zone of influence that the boundary cells have upon the domain. As such, Spencer (2007) recommends that the final cut-away field should be taken at some distance away from the generated boundary cells placed around the larger domain, so as to minimize these effects. After investigation, it is the Author’s opinion that the generation of boundary cells is unnecessary for all levels of the subdivision. After 3 levels of subdivision the field becomes self sufficient in defining its own boundary, as illustrated in Figure 6.5. In the figure a 2D example is presented. Figure 6.5a shows an example where the outer cells of a 4 × 4 subdividing domain are redefined as boundary cells and are no longer subdivided. The subdividing domain is reduced in size by two cells in each direction. This reduction produces a smaller domain; however, in this example the subdivision generates another 4 × 4 domain. If this process was to continue at each level, the domain would continue to get smaller in size, halving in each dimension, but the resolution would always remain as 4 × 4. However, Figure 6.5b shows an example of the same process except that the redefinition


207

Boundary cells defined

4×4

Cells subdivided

2×2

4×4

(a) Starting with a 4 × 4 domain.

Boundary cells defined

5×5

Cells subdivided

3×3

6×6

(b) Starting with a 5 × 5 domain.

Figure 6.5: Illustration showing the principle of self generating boundaries.

of boundary cells starts when the domain is 5 × 5 cells. Again, at each stage of the subdivision the size of the domain is reduced, but crucially the resolution of the domain increases, to produce a field of 6 × 6 cells. The figure shows that, if the subdividing domain Cx , Cy , Cz > 5 cells, then the outer cells of the domain can be redefined as boundary cells and the subdivision can continue resulting in a refinement in the resolution of the domain; usually, in normal cell evolution, this is after 3 levels of subdivision, when Cx , Cy , Cz = 8. The Author therefore questions the need to generate both the boundary and edge cells to high resolutions, only to discard them later when mapping to the FE domain. It is advantageous, both computationally and mathematically, to discard these cells before needless and redundant subdivision occurs. This cell reduction has the effect of moving the final field away from the initial generated domain boundary. The reduction in domain size and the smaller increases in resolution, at each of the subsequent stages, of this method, has an overall effect on the size of the final domain that is generated; therefore, the size of the starting domain and number of levels of subdivision must be adjusted to reflect this. This is one of the

208


disadvantages of this method; the reduction in the increases in field resolution, at each stage, may require that further levels of subdivision are required to produce the final field. However, in the Author’s opinion these extra levels will be generated relatively quickly, especially following the improved techniques introduced in this chapter. Furthermore, the added levels will place the final field further from the original extrapolated boundary cells placed around the domain in the first 3 levels of subdivision, thus reducing their influence. The initial domain size is defined, in the original implementation, via the number of levels of subdivision and the cell size; e.g. 8 levels of subdivision and a cell size of 0.5m gives a full domain resolution of Cx = Cy = Cz = 2level = 28 = 256 cells

(i.e. 256 cells × 256 cells × 256 cells)

and a full domain size of, x = y = z = 256 × 0.5 = 128m

(i.e. 128m × 128m × 128m)

To adjust for the reduction, due to redefining cells as boundary cells, the number of levels of subdivision has to be adjusted. The domain resolution after subdivision levels 1 to 3 remains unchanged, i.e. Cx , Cy , Cz = 2level cells. Beyond level 3 the reduction in the subdivided resolution must be accounted for; therefore the new domain resolution size is given by

Cx = Cy = Cz = 2level for level 6 3 Cx = Cy = Cz = 2level − 2(level−1) + 4

(6.1)

for level > 3 Assuming a subdivision level > 3 and rearranging gives the level of subdivision required for the maximum cell resolution in any direction, dlevele =

ln(max(Cx , Cy , Cz ) − 4) +1 ln 2

(6.2)

From this subdivision level, the code computes the initial domain size required for generation. Since the actual process of subdivision remains unaltered, the final resolution of the field, without the boundary reduction, is applicable, and


209

is given by Cx = Cy = Cz = 2level cells. This is because, although the domain is eventually cropped, the statistics and correlation functions, used within the subdivision process, are still based upon this larger domain that is not generated in full. The reductions that occur comply with the LAS theory, presented in Chapter 3, and so the above statement is valid. This resolution is then multiplied by cell size to give the size of the initial domain to be subdivided, i.e. x = y = z = 2level × cell size (m). 6.3.2.1

Schematic flowchart of the boundary cell methodology

Figure 6.6 is a basic schematic of the Author’s implementation of this boundary generation method. The flow diagram shows the method with respect to the general subdivision process, to provide the reader with a clearer understanding of the method and its procedure. To further illustrate this method, Figure 6.7 shows a simple 2D example of the method over 4 levels of subdivision; once again, the same method is applicable when extended to 3D. Also, the generated boundary cells, highlighted in green, are of the same dimensions as the domain cells, however to make better use of the limited space, only the part of each boundary cell is shown. This method is easy to implement and provides a final field that is more reliable and less influenced by the extrapolated boundary cells of the global domain, generated at the earlier levels of subdivision. Therefore, the Author implemented this method into the framework presented in this thesis.

6.3.3

Anti-correlation movement

Spencer (2007) discussed the findings of Fenton (1994), which revealed that the LAS method exhibits some point variance correlation over thousands of realisations. The method has a systematic bias in the variance field, when used for two or more dimensions. The variance of the observed fields shows that the variance generally lies around that expected from LAS theory. However, it is the pattern of the variance field that is of concern, as shown in Figure 6.8, where the variance shown is the mean of the corresponding cell based upon the generation of 50000 fields; a “tartan” pattern is clearly visible within the data. There is anecdotal evidence suggesting that this may influence the RFEM results for slope stability analyses where there is depth dependency in the undrained

210


Start

Calculate required level of subdivision. (See Section 6.3.2)

Yes

Is min(Cx,Cy,Cz) ≥ 5 cells ?

No Boundary Cells Generated (See Section 3.63)

Reallocate edge cells as boundary cells


No



End

Figure 6.6: Schematic flowchart of the boundary cell procedure.


211

The initial cell is generated and its value is equal to the global mean. The boundary cells (Green) are also set equal to the global mean. The 8 boundary cells generated complete the 3 × 3 neighbourhood for the 2D system.

(a) Initial Field (Global Mean) After the first level of subdivision a new boundary is generated to complete the 3 × 3 neighbourhood for each of the subdivided cells. The 12 boundary cell values are an extrapolation of the values of the subdivided cells, maintaining the integrity of the correlation structure (Spencer, 2007). (See Section 3.6.3)

(b) Subdivision Level 1 Continuing to follow the same method as for Level 1, a new 20 cell boundary is generated to complete the neighbourhood, in preparation for the next in the series of subdivisions.

(c) Subdivision Level 2

After the subdivision of Level 2 into Level 3, the 8×8 subdivided cells become self sufficient and no longer require a generated boundary to continue subdividing. As illustrated, the outer cells of the domain are surrendered to form the boundary cells (Blue), reducing the area of subdivision to a 6 × 6 domain.

(d) Subdivision Level 3

The subdivision from Level 3 to Level 4 generates a 12 × 12 domain, illustrating that the subdivision process is self sufficient, as the number of cells continues to grow at each stage whilst surrendering the outer cells for the boundary. The process continues in this vein until the desired field size is produced. The original dimensions of the global domain can also be seen outlined.

(e) Subdivision Level 4

Figure 6.7: A 2D illustration of the self generating boundaries in the LAS method.

212

0.50


0.76

1.0

1.3

1.5

Figure 6.8: Illustration of “tartan” patterning in the variance field of 50000 realisations of a 64 × 64 × 64 field, generated with a cell size of 1m and θ = 4m.


0,0

213

0,0

5,9

Part of generated domain (zoomed to origin)

Location of required field after a random movement of 5 × 9 cells

Figure 6.9: Illustration of the process of random movement.

shear strength, cu , a claim the Author has yet to investigate due to it being beyond the scope of this thesis. Spencer (2007) proposed and implemented a random movement of the required field within the subdivided domain to remove the point variance correlation. In his implementation, the required 3D field is moved (with respect to the entire generated domain) by a random movement of Rx , Ry , Rz 6 16 cells for each realisation of the field. This occurs in post processing and requires that the generated field has the added dimensions required; that is the generated field should be Rx , Ry and Rz cells larger in each respective dimension than the actual required field. The Author realised that this movement requires further unnecessary cell subdivision, as the domain cells through which the required random field moves, are generated only to be discarded, as shown in Figure 6.9. The figure shows the generated domain, focused on the origin; it then shows the position of the required field (in red), shifted relative to the whole domain by the prescribed random movement. This shows the cells that may be discarded due to this random movement. The Author proposes to incorporate this movement into a cell reduction method similar to that proposed in Section 6.3.1. In this case the appropriate reductions will be from the origin, i.e. the planes defined by x = 0, y = 0 and z = 0. These reductions have the effect of moving the final field away from the

214


origin of the domain by the random distance generated. Spencer (2007) proposed that this random movement should be 6 16 cells in each direction, the reason being that the “tartan” patterning seems to follow the structure of the subdividing cells, i.e. the lines of the “tartan” are aligned along lines at 1/16, 1/8, 1/4 and 1/2 of the full domain from which the the field is taken. Spencer (2007) felt that the movement, of up to 16 cells should remove this patterning to a resolution that was satisfactory. The random movement is generated for each axis using the same serial RNG discussed in Chapter 5. The uniform output, in the range 0 to 1, is transformed to generate a number between 0 and 16. By converting the value of the required random shift in each direction into a 4 bit binary number, a reduction profile is generated, where the shift can be represented by reductions corresponding to the bits of the binary representation; however, with a reduced movement range from 0 to 15 cells, due to 15 being the maximum number representable by a 4 bit binary number. In the proposed method, the random movements along each axis are generated in the same way as in the original implementation, using the same RNG and producing uniform random numbers transformed to fit within the range into the range 0 to 15. This reduction profile works because the bits within the binary representation double at each level of subdivision. Therefore taking the 0s and 1s of the binary representation to represent cell reductions in each axis of the generated domain after the final 4 levels of subdivisions gives a total reduction of the prescribed random movement, as shown in Figure 6.11. This reduction in the size of the generated domain (with respect to the initial global domain) has the effect of moving the required field by the random movement as shown in Figure 6.10, where it can be seen that the required field is taken from the generated domain. Each realisation of the final field is taken from the same location, relative to the final domain generated; however the random cell reductions of the generated domain relative to the initial global domain has the effect of moving the required field relative to the global domain, as shown in Figure 6.10. The figure shows the global domain, the generated domain and the field taken. In this new method the global domain (Green) is not generated, instead a subsection of this domain is generated (Blue), using the prescribed random cell reductions, from which the field (Red) is taken. As the generated domain reduces in size, the location of the


215

field, relative to the global domain changes. A simple example is given in Figure 6.11 where a random movement of 11 cells in a particular axis direction is required. First this random number is converted into binary to provide the following reduction profile: Level: Level − 3 Binary: 8 11 =

1

Level − 2 Level − 1 Level 4 2 1 0

1

1

where Level is the number of levels of subdivision. Therefore, to achieve the random movement of 11 cells required along this axis, the final 4 levels should be reduced by 1, 0, 1 and 1 cells respectively. As shown in the figure, these reductions do indeed provide the movement of 11 cells away from the boundary. The figure illustrates these reductions focused on the domain boundary. Further to this, the Author investigated the degree of movement required to reduce the patterning due to bias in the variance field, to give a more uniform variance field. Figures 6.12 to 6.15 shows the results of this investigation. Fields of 64 × 64 × 64 cells, with cell size = 1m and θ = 4m, were generated with varying random movements using 9 levels of subdivision. The fields were generated using the final parallel implementation developed in this thesis, and executed over a single processor. The 3D code was used and the 2D illustrations represent a slice through the centre of the variance field. In each figure the field is moved randomly over differing numbers of cells (and distances), from no movement in Figure 6.12, through to a movement 6 63 cells, in Figure 6.15. The random movement in each is generated by using the method discussed in this section. The point variance was observed over 1000 to 50000 realisations. Each figure shows the variance field produced with both a scale relative to its own minimum and maximum values, and with a fixed scale (across all figures (e)-(h) of 0.5 - 1.6) for comparing between different fields. In all the scales, red indicates higher variance values, whereas blue is indicative of lower values. This enabled the Author to observe any patterns within the fields, no matter how small, to compare the effect of the random movement.

216


0,0

Global Domain Random movement

Generated Domain Field

(a) Realisation 1 0,0

Global Domain Random movement

Generated Domain Field

(b) Realisation 2

Figure 6.10: An illustration of the relative positions of the domains and fields between realisations with different random movements.


217

Level (level − 3) After the domain has been subdivided, the first reduction of a 1 cell column takes place to accomodate the random movement (Yellow). The remaining subdividing cells are highlighted in Red.

Level (level − 2) At this level no reduction is required; however it can be seen that the original reduction from the previous level (outlined) has now doubled in resolution at this level.

Level (level − 1) At this level the field is reduced by a further column of cells (Yellow), and it can also be seen that the previous reduction has doubled again to 4 columns of cells.

Level (level) There is a final reduction of a column of cells (Yellow) after the final level of subdivision level. It can be seen that the previous levels of subdivsion have doubled and that all the reductions combined have contributed to a full reduction of 11 columns; in effect, shifting the generated field from that of the original domain by 11 cells.

11 cells

Figure 6.11: An illustration of the proposed implementation of the random movement method.

218

0.80

1.1

1.4

1.6

0.51

(a) 1000 Realisations

0.50

0.79

1.1

1.4

(e) 1000 Realisations

0.76

1.0

1.3

1.5

0.50

(b) 10000 Realisations

1.6

0.50

0.79

1.1

1.4

(f) 10000 Realisations

0.75

1.0

1.3

1.5

0.50

(c) 20000 Realisations

1.6

0.50

0.79

1.1

1.4

(g) 20000 Realisations

0.76

1.0

1.3

1.5

(d) 50000 Realisations

1.6

0.50

0.79

1.1

1.4

1.6

(h) 50000 Realisations

Figure 6.12: Illustration of correlation within the point variance of a series of generated random fields (No random movement) ((a)-(d) have individual scales) ((e)-(h) have a fixed scale).


0.52

0.93

1.1

1.3

1.5

0.80


0.50

0.79

1.1

1.4


0.96

1.1

1.3

1.5

0.79


1.6

0.50

0.79

1.1

1.4


0.95

1.1

1.3

1.4

0.78


1.6

0.50

0.79

1.1

1.4


0.94

1.1

1.3

1.4


0.75


1.6

0.50

0.79

1.1

1.4

1.6


219

Figure 6.13: Illustration of correlation within the point variance of a series of generated random fields (Random movement: 6 15 cells) ((a)-(d) have individual scales) ((e)-(h) have a fixed scale).

220

1.0

1.2

1.3

1.5

0.96


0.50

0.79

1.1

1.4


1.0

1.1

1.2

1.3

0.95


1.6

0.50

0.79

1.1

1.4


1.0

1.1

1.2

1.3

0.95


1.6

0.50

0.79

1.1

1.4


1.0

1.1

1.2

1.3


1.6

0.50

0.79

1.1

1.4

1.6




0.88

1.0

1.1

1.2

1.3

1.0


0.50

0.79

1.1

1.4


1.1

1.1

1.1

1.2

1.0


1.6

0.50

0.79

1.1

1.4


1.1

1.1

1.1

1.1

1.0


1.6

0.50

0.79

1.1

1.4


1.1

1.1

1.1

1.1


0.95


1.6

0.50

0.79

1.1

1.4

1.6


221


222


The reductions required to generate the larger random movements; i.e. 16 6 R 6 63, are accommodated by cell reductions over more levels of subdivision. These numbers cannot be represented by 4 bit binary numbers; instead up to 6 bits are required for the higher end of the range, corresponding to 6 levels of subdivision. The results clearly show that the random movement produces variance fields with more uniformity. The results also indicate that the Author’s predecessor’s assumption of a random movement 6 16 cells, although creating a more uniform field, would still contain the “tartan” patterning observed. It can be seen that the structure of the pattern changes, with each increase in the movement range causing the “tartan” quadrants to become larger. The results indicate that, for this example, a random movement of 6 63 cells provided a uniform field with no clear patterning structure. It is the Author’s opinion that this is due largely to the movement being approximately the same size as the required field size and that, as such, the field itself lies within one of the “tartan” quadrants. Clearly, parameters such as cell and initial domain size, the number of levels of subdivision and the spatial variability (as characterised by the scale of fluctuation), contribute to this effect, and this topic should be further investigated as part of any future research. It is unclear what effects this pattern would have on any Monte Carlo results obtained; however it may lead to spatial bias within any results taken from them. The Author decided to use a random movement larger than the field required in the work presented in this thesis. Although this larger movement in the original implementation would have led to significant increases in both memory and time requirements, the proposed improvements mean that many of the extra cells required in the generation are cropped from the domain and never subdivided, thereby minimizing both the memory and time constraints placed on the code by this increase. 6.3.3.1

Schematic flowchart of the implementation of the Anti-Correlation movement

Figure 6.16 shows a flow chart representing the basic method used in moving the field, by the generated random cell movement in each direction, within the original domain. Once again, it is placed in context of the basic LAS subdivision implementation, to give the reader a clearer idea of its functionality and ordering


223

within the overall LAS random field generation procedure.

6.3.4

Conclusions

After studying the current serial random field code extensively, the Author noted some significant inefficiencies within the random field generation. Some cells were generated and subdivided only to be discarded, while the generated boundary cells led to unreliable zones within the domain and their use was inadvisable. The methods proposed by the Author reduce many of these inefficiencies and approximations; some cells remain that are generated and then immediately discarded, but they are relatively few and have been minimized in the coding. The Author highlighted three areas of inefficiency and presented the following solutions: • Domain Reduction: this reduces the domain resolution to only those subdividing cells required to generate the final field, thereby reducing the subdivision and generation of redundant cells. • Boundary Generation: this method redefines the boundary and subdividing domain, thereby eliminating the need to generate an extrapolated and error-inducing boundary. • Random Anti-correlation movement: this method takes general concepts used by Spencer (2007), to reduce correlation and patterning within the variance field, and implements them in a more robust and efficient way, by eliminating redundant cell generation within the process. These proposed changes were implemented into the parallel version of the code, along with other less noteworthy optimizations, such as tidying up looping within the code to provide more optimized memory caching, and compiler optimization by flagging the compiler to optimize certain aspects of code with flags such as -O3. These were simple changes which are common and standard practice, and use standard tools that are available. 6.3.4.1

Merging the techniques

The three techniques highlighted in this chapter have been implemented within a single code. These methods have some dependencies which require some attention and adjustment for them to gel in a single routine.

224


Start

Random movement generated for x, y, z axes. (Rx, Ry, Rz ≤ C max ) Compute reduction profile for x, y, z axes. Compute reduction (Seerandom Section X.XX)

profiles RX, RY R and RZ from Rx, Ry, Rz (See Section 6.3.3)

k=1


Yes

Does x-axis require reduction/shifting?

Reduce cells in x-axis by RX(k)

No

Yes k=k+1

Does y-axis require reduction/shifting?

Reduce cells in y-axis by RY(k)

No

Yes

Does z-axis require reduction/shifting?

Reduce cells in z-axis by RZ(k)

No

No



End

Figure 6.16: Schematic of anti-correlation random movement method.


225

Both the Domain Reduction and the Random Anti-correlation movement techniques reduce the resolution of the generated domain (the domain in terms ofnumber of cells) after each subdivision. The random anti-correlation movement is taken from the origins of each plane, i.e. the planes defined by x = 0, y = 0 and z = 0, to move the field away from this origin. The Author therefore took the domain reduction from the opposite ends of the domain, i.e. the planes defined by x = max(x), y = max(y) and z = max(z). The reasoning behind this is to balance the reductions and to move the final domain to a more central location, away from the original domain boundaries, as shown in Figure 6.17. The figure 0,0

0,0

(a) Same side

(b) Opposing sides

Figure 6.17: An illustration comparing the locations of cell reductions. shows that taking these two types of reductions, associated with anti-correlation and domain reductions (highlighted in blue and yellow respectively), from the same locations on the domain results in a field that is taken from the cells closer to or against the opposing boundary. Figure 6.17(b) shows that taking the cells from opposite sides of the domain results in a more centralized final domain away from all boundaries. The generated domain is distanced further from the boundary of the global domain by the cell reductions associated with the redefining of the boundary and subdividing cells proposed in Section 6.3.2. However this reduction in the number of subdividing cells has implications for the domain reduction method, in that the reduction profile generated is based on the evolution of cells without this reduction (see Section 6.3.1.1); i.e. this reduction profile is calculated based on a doubling of resolution at each subdivision level. However, this evolution of cells is altered when the boundary is redefined, to give:

226


Clevel = 2Clevel-1 for level 6 3 (6.3)

Clevel = 2(Clevel-1 − 2) for level > 3

where C is the resolution in each direction, x, y and z, and where the subscript refers to the level of subdivision that it is taken from. When combining these techniques into a single code, the new cell evolution, defined by the boundary cell implementation and shown in Equation 6.3, is followed in producing the Reduction profile in Section 6.3.1.1. Note that, although the discussion and explanation of the methods are treated separately, in reality the reduction of cells for the three methods is agglomerated into a single action.

6.4. PARALLEL IMPLEMENTATION

6.4

227

Parallel Implementation

This thesis has discussed the LAS random field theory, random number generation, reasoned their parallelisation and concluded its viability. The initial part of this chapter has focused on general improvements to the methodology, coding and implementation of the LAS. In this section the Author brings all these components together and presents the implementation of the parallel LAS random field generator. As previously discussed in Chapter 2, this implementation has been designed for use with a parallel FE code, on a distributed memory platform (see Section 2.2.2), to produce a parallel RFEM framework. The coding language is Fortran 95; the language in which the all previous codes have been produced in this area in the Author’s research group. The MPI libraries will be used to communicate data between the storage on the distributed system. It was also primarily executed on the University of Manchester high performance computer “Horace”, documented in Section 2.6.4 and this is therefore reflected in its design and optimization, using suitable compilers and corresponding flags as an example. To aid in the discussion, a schematic flowchart will be presented that highlights the various processes involved in the implementation and showing the relative ordering of the processes. At its simplest level, the LAS random field generator implemented in this thesis can be represented by Figure 6.18. It compromises 4 main components: 1. Initialisation, 2. Serial Generation, 3. Parallel Generation, 4. Post-processing. The remainder of this section breaks down each of these components and dissects the tasks within them, presenting, where necessary, the Author’s reasoning behind particular methods or approaches.

6.4.1

Initialisation

The initialisation phase of the implementation sets up the variables required by the code; it also assigns any data from files to the relevant variables. This section

228


Figure 6.18: Schematic flowchart overviewing the implementation of the Parallel LAS Random Field Generator.


229

Figure 6.19: Schematic flowchart of Initialisation component.

also includes any calculations which are conducted once to form the constants used within the code. Figure 6.19 illustrates the schematic flow chart for Initialisation. This process is the combination of five other smaller tasks. The following subsections explain these tasks.

6.4.2

Initialising parameters

This is the initialisation of the parallel environment, the associated variables and the input of parameters required to generate the random field.The data input is usually achieved using a datafile; Figure 6.20 is a typical datafile for use with the

230


generator (the parameters are identified for the reader’s benefit). The code uses these input data to allocate arrays and to calculate parameters such as the level of subdivision. The datafile in Figure 6.20 would generate a 3D field of 64m×64m×32m, with a resolution of 128 cells × 128 cells × 64 cells, using 0.5m cells. The scale of fluctuation is 4m and the field has a target mean, with no depth dependence, of 40 (units dependent on the property) and a target standard deviation throughout the depth of 20. The anisotropy of heterogeneity is 4, meaning that the scale of fluctuation is 4 times larger in the horizontal plane than the vertical direction. The field is based on a normal distribution. From the anisotropy variable, aniso, the amount by which the final domain requires both squashing and stretching is calculated and the required domain resolution is altered to reflect this. The squashing and stretching processes introduce anisotropy of the heterogeneity into the field and are discussed in further detail in Section 6.4.5.2. 6.4.2.1

Generate random movements (reductions) for anti-correlation

Using seed2 as the RNG seed, three uniformly distributed random numbers between 0 and 1 are generated using the serial RNG discussed in Chapter 5. The uniformly distributed numbers are transformed to give a range, 0 to Cmax , for each of the axes. These are the anti-correlation random movements, Rx , Ry and Rz . (See Section 6.3.3.) These numbers are then converted to their binary representations to form the subsequent random reduction profiles, RX, RY and RZ, for this method. These values of the random movements are subsequently used in the evaluation of the number of levels of subdivision required. 6.4.2.2

Compute number of levels of subdivision

The number of levels of subdivision are computed, using the parameters xcells, ycells, zcells, the random movements (Rx , Ry and Rz ), the anisotropy of heterogeneity, and with a knowledge of the evolution of cells based upon the cell reductions in redefining the boundary cells. The required field is of resolution xcells × ycells × zcells and therefore, with the addition of the random movements, Rx , Ry and Rz and before the


128

! xcells

- Required field resolution in the x-direction (cells)(+ve integer)

128

! ycells

- Required field resolution in the y-direction (cells)(+ve integer)

64

! zcells

- Required field resolution in the z-direction (cells)(+ve integer)

0.5

! cellsize

- Cell dimension of the final field (m)(+ve double precision)

4

! theta

- Scale of fluctuation, θ, (m)(+ve double precision)

-260281

! seed1

- Seed to initialize the RNG for random field generation (-ve integer)

-190952

! seed2

- Seed to initialize the RNG for random movement using the anti-correlation method (-ve integer)

40

! meantop

- Mean parameter value at top of the field

40

! meanbottom

- Mean parameter value at bottom of the field

20

! stand devtop

- Standard deviation of the parameter value at top of field

20

! stand devbottom

- Standard deviation of the parameter value at bottom of field

4

! aniso

- Anisotropy of the heterogeneity

.false.

! lognormal

- Cell value distribution. (.true. = lognormal distribution, .false. = normal distribution)

Figure 6.20: Typical input data file.

231

232


squashing and stretching of the field to model the anisotropy of heterogeneity, a final domain resolution of at least:

xcells + Rx ycells + Ry cells × cells × ((zcells + Rz ) × squash cells) stretch stretch

(6.4)

is required. Knowing the required minimum domain required and from the cell evolution defined in 6.3, the maximum number of levels of subdivision can be obtained. 6.4.2.3

Compute a and c coefficients

The a and c coefficients are constant throughout the subdivision process and require calculating just once in the code. The theory and calculations for a and c are given in Chapter 3. 6.4.2.4

Calculate the domain reduction profile

The domain reductions are calculated as described in Section 6.3.2 and take account of the change in cell evolution brought about by the introduction of the boundary method implemented by the Author, as discussed in Section 6.3.4.1; specifically, the cell evolution defined by Equation 6.3.


6.4.3

233

Serial Component of Generation

In order to proceed, the parallel generator requires an initial domain to decompose and continue subdividing. Therefore, in the early stages a domain is generated using a serial approach within the parallel code, with each processor generating the same domain. This domain is the seed for the parallel generation and provides the structure and variance from which the field is generated in parallel. This approach is faster than generating the initial “seed domain” on a single processor and then communicating it across all processors; as the communication would slow this part of the generation. Figure 6.21 shows a flow chart of the steps associated with this initial generation and the following sections briefly describe the processes involved at each step, up to and including the criteria for decomposing the domain.

Figure 6.21: Schematic of Serial Generation Component.

234


6.4.3.1

Define Boundary Cells

The method described in Section 6.3.2 for defining the boundary cells of the domain is implemented. As before, the domain boundaries are generated using the approach of Spencer (2007), as discussed in Section 3.6.3, and continues to be generated until the third level of subdivision, where upon the outer cells are redefined as the domain boundary and are no longer subdivided.

6.4.3.2

LAS Subdivision Process

The LAS subdivision, set out in Chapter 3, is executed across all processors, generating the same subdivided cells on all processors.

6.4.3.3

Domain Reduction

The domain reduction method, discussed in Section 6.3.1, is implemented. However, it is adjusted to limit reductions so that the resolution at level 3 and beyond is, Cx , Cy , Cz > 5; in this way the domain becomes self-bounding as expected, and the cell refinement and evolution continue with each subdivision.

6.4.3.4

Decomposition Criteria

As stated, an initial domain or “seed domain” is generated to seed the parallel portion of the implementation. This “seed domain” is decomposed on all the processors and the subdivision process continues separately on each processor. The resolution of the seed domain must be large enough across the decomposing axis, so as to place a plane of at least 1 cell width on each processor and of at least 2 cells width for the first and last processors. This is to enable the decomposition of the domain on all the processors. Note that the first and last processors require an extra cell width, as both of the outwardly facing planes are needed for use as the new boundary, thus leaving at least 1 cell width for subdivision. Therefore, for this criterion to be met: y > n + 2 cells

(6.5)

where y is the number of cells in the y-direction (which is the axis over which the domain is decomposed in this implementation) and n is the number of processors.


235

If the serial field does not meet the criterion, the domain is subdivided further within the serial component, until this criterion is met and the domain can be decomposed correctly across the number of executing processors. It should however be noted that, in the case of use on 2 processors, the field is not decomposed until after level 3 of the subdivision, whereas the equation above stipulates that only 4 cells are required along the relevant axis, corresponding to level 2. This is because the Author wants to ensure that the domain is self sufficient with respect to boundaries before decomposition.

236


6.4.4

Parallel Component or Generation

At this point in the implementation, the generation of the domain changes from that of a serial approach to the parallel approach. Figure 6.22 illustrates a schematic flowchart of the processes involved and below an overview of each is presented. 6.4.4.1

Domain Decomposition

The “seed domain” is decomposed across all the processors generating the field, following the profile laid out by the decomposition criteria previously stated. This decomposition allocates the planes (i.e. slices) of the domain equally to all processors, with the condition that when N = 1 or n, then y > 2 cells (where N is the ID number of the processor and n the number of processors), to meet the boundary requirements previously stated. If the resolution in the decomposing direction is greater than required, i.e. y > n + 2, then the remaining slices of domain are allocated to processors as to balance the work load, giving preference to processors N = 1 and N = n in an attempt to balance the loss of the outwardly facing cell planes to the boundary. Figure 6.23 illustrates this decomposition. In the example illustrated by the figure, the boundary cells are indicated in blue, showing that only processors 1 and n require a boundary in the planes defined by y = 0 and y = max y; these two processors require at least 2 slices (planes) of cells, as they lose one slice to the boundary, thus leaving one slice to subdivide. 6.4.4.2

Resetting the RNG seed

As discussed in Chapter 5, the parallelisation of the RNG was achieved by assigning different seed numbers to each of the processors, so that each processor proceeds with a different random number sequence. Hence, at this point the seed numbers are changed on each processor. This is achieved by subtracting a processor based integer from the seed on each processor. The Author chose to use the following equation: seedN = seed − 1000000 × N

(6.6)

where seedN is the seed number on the corresponding processor, N . This was


Figure 6.22: Schematic of Parallel Generation Component.

237

238


y (a) Seed domain

Processor 1

Processor 2

Processor 3

Processor 4

(b) Decomposed seed domain

Figure 6.23: Example of “seed domain” decomposition.

Processor 5


239

chosen as the Author believed that the seed numbers would not be repeated on any processor in a single Monte Carlo analysis with the original seed number reducing by 1 at each realisation. 6.4.4.3

Define boundary cells

The boundary cells are defined in the same fashion as before, as discussed in Section 6.3.2 and set out by the flowchart in Figure 6.6. However, at this stage in the generation the boundary cells are self generating and are the outer cells of the domain. As previously noted, in the direction of decomposition, in this implementation the y-axis, the boundaries only require defining on Processors 1 and n, as shown in Figure 6.23, where in the 2D example the boundary cells are indicated in blue. 6.4.4.4

Communicate Ghost Regions

As discussed previously, a neighbourhood of 3 × 3 × 3 cells is required for the subdivision in 3D, so there is a need for a ghost region. This ghost region is the cells on neighbouring processors required to complete the cell neighbourhoods of the planes of cells adjacent to these partitions. This need for communication was illustrated in Figure 4.3. Therefore these regions are communicated between processors, wherein each processor communicates and receives the relevant ghost regions (i.e. a slice of cells) from its neighbours. 6.4.4.5

LAS Subdivision Process

The process of subdivision remains unchanged in the parallel approach, as it takes a single cell and its corresponding neighbourhood and subdivides it into 8 cubic cells. The background theory and implementation are described in Chapter 3. 6.4.4.6

Domain Reduction

The domain is reduced as described before in Section 6.3.1 and Figure 6.3. However the reductions in the x and z directions take place across all processors, while those in the decomposed y direction only occur on processor n as shown in Figure 6.24, where the reduced cells are indicated in blue. Note, that the reductions can

240


Processor 1

Processor 2

Processor 3

Processor 4

Processor 5

Figure 6.24: Example of Domain Reduction across multiple processors. be of multiple cell widths and, therefore, in some cases the reductions may cross onto other processors; i.e. n − 1, then n − 2, etc. 6.4.4.7

Anti-correlation movement

After the subdividing process, the anti-correlation movement is applied to the generated domain in the form of a one cell reduction in the relevant axis, as prescribed by the corresponding reduction profiles, RX, RY and RZ, for this method generated during the initiation process. This method is discussed in Section 6.3.3 and shown in Figure 6.16. The number of subdivision levels after which the reductions take place is based up on the size of the binary representation of the required random movement, R. As for the domain reduction technique in the previous step, the reductions in the x and z directions take place across all processors, while those in the decomposed y direction occur only on processor 1, as illustrated in Figure 6.25 6.4.4.8

Load Balancing or Cell Redistribution

The combination of cell reductions, i.e. from the boundary cells, domain reduction technique and anti-correlation movement method, means that the domain decomposition is likely to be imbalanced, as illustrated in Figure 6.26. To re-balance the work load, or to redistribute the cells to produce the final field correctly decomposed, communication between the processors occurs, with


Processor 1

Processor 2

Processor 3

241

Processor 4

Processor 5

Figure 6.25: Example of anti-correlation movement across multiple processors.

Processor 1

Processor 2

Processor 3

Processor 4

Processor 5

Figure 6.26: Typical example of an imbalanced decomposition.

242


Processor 1

Processor 2

Processor 3

Processor 4

Processor 5

(a) Domain re-balancing

Processor 1

Processor 2

Processor 3

Processor 4

Processor 5

(b) Balanced domain

Figure 6.27: Typical example of rebalancing a decomposed domain.

slices of domain being communicated to balance the domain, using the same criteria as in the original decomposition. Figure 6.27 illustrates this communication for the previous example and the resulting balanced domain. Note that these last three tasks, the domain reduction, the anti-correlation movement and the load balancing and redistribution of cells, are agglomerated in the coding into a smaller task; this increases efficiency, accelerates the generation of the required field and reduces interprocessor communications. However, to provide clarity in the function of these methods and the overall implementation, they are discussed as individual tasks within this thesis.


243

Figure 6.28: Schematic of Post Processing Component.

6.4.4.9

Exiting the loop

After the required number of subdivision levels, the correct domain has been generated, i.e a domain that can be post-processed to generated the final field. The parallel generation is exited and the final domain enters the post processing component of the implementation.

6.4.5

Post Processing

The field produced from the main subdivision algorithm is an isotropic Gaussian field, with mean, µ = 0, and standard deviation, σ = 1. This field requires transformation to account for the required statistical parameters and anisotropy of the desired field. This section discusses this post-processing and the implications for these transformations within a parallel structure. Figure 6.28 presents the basic schematic flow chart for the post processing in the parallel implementation. It should be noted that it has an identical structure to that which would be implemented in a serial code, as the post processing of

244


each decomposed section of field on each processor is independent of the other sections, therefore requiring no communication. The different processes indicated in the flow chart are discussed in more detail below.

6.4.5.1

Transformation

The generated field is based on a standard normal (Gaussian) distribution and needs to be transformed to satisfy the mean, µ, and standard deviation, σ, of the required distribution. The initial Gaussian field is easily transformed to the desired normal field, with statistics, µx and σx using: Zx = µx + σx Z

(6.7)

where Z and Zx are the original cell value and new cell value respectively and the subscript, x, identifies the parameters and values associated with the new field. This equation can be adapted to incorporate parameter depth-dependency, which is a common occurrence in geo-technical application, that is, d d Zx = µtop + (µtop − µbot ) + σtop + (σtop − σbot ) Z H H

(6.8)

where subscripts, top and bot, refer to the parameter values at the top and bottom of the domain, respectively, d is the depth of the centroid of Z and Zx , and H is the full depth of the field. Also, within geotechnics it is often necessary to use log normal or other similar statistical distributions for applications. These can be modelled by transforming the Gaussian field by an appropriate function. As these procedures are spatially independent, requiring only the working cell, this post-processing is unaffected by parallelism and requires no modification to the original algorithm. As such, each cell is subjected to the relevant equation using the required statistical parameters.

6.4.5.2

Anisotropy of the heterogeniety

Soil occurs naturally in anisotropic heterogeneous deposits and so there are likely to be different scales of fluctuation in the vertical and horizontal planes. The LAS method can be used to produce anisotropic fields, by adapting the algorithm to


245

produce a field with an ellipsoidal correlation structure, although Fenton and Vanmarcke (1990) indicated that the overall statistics of such anisotropic processes are poorly preserved by such a method; in particular, when further transforming into a non-Gaussian distribution. However, a simple transformation of an isotropic field to an anisotropic field was suggested by Vanmarcke (1983) and later developed by Hicks and Samy (2002b). To account for anisotropy in the domain, the field is squashed and/or stretched to produce the required scales of fluctuation in the axes directions. Stretching the domain has the effect of increasing the horizontal scale of fluctuation θh , although this extrapolation reduces the accuracy of the field. Squashing the field reduces the vertical scale of fluctuation, θv ; although squashing can also be used to manipulate the anisotropy in the horizontal plane, by setting the original untransformed θ to be greater than that required. Squashing the field causes a reduction in the value of θv , while maintaining the value of θh . In soil applications θh >> θv due to the nature of soil deposition. The degree of anisotropy of the heterogeneity of the soil, ξ, is given by: ξ=

θh θv

(6.9)

where ξ, for simplicity in the current implementation, is maintained as an integer.

6.4.5.3

Squashing

In this parallel implementation, the domain is only decomposed horizontally, i.e. into vertical slices, and so the method of squashing remains unchanged (Spencer, 2007). The method interpolates the data, by averaging ξ cells in the vertical direction, to produce the required degree of anisotropy. By initially generating an isotropic field with θ = θh and then squashing, by taking the average of ξ vertical cells, a field with the desired anisotropic attributes is produced. Figure 6.29 shows the effect of squashing a 2D field, for ξ = 2 and ξ = 5. Figure 6.30 shows the method of interpolation used in squashing the domain. The example shows a column of elements that has been squashed by a factor of 4. Each cell of the squashed domain is produced by averaging ξ vertical cells, in this case 4 cells, from the un-squashed domain. This process occurs in the vertical direction, which is represented in full on each processor; hence this interpolation can occur independently on each processor, with no need for communication.

246


(a) Standard

(b) Squashing = 2

(c) Squashing = 5

Figure 6.29: Illustration of Squashing.

Figure 6.30: Illustration of Squashing process by column averaging (ξ = 4).


247

Therefore the process is unaffected by parallelisation and the serial algorithm can be implemented. 6.4.5.4

Stretching

It is not always possible to squash the produced domain, often when memory constraints are a problem. In these cases the field is stretched to produce the effect of anisotropy. Figure 6.31 illustrates the effect on θ that stretching creates. As this

(a) Standard

(b) Stretching = 2

(c) Stretching = 5

Figure 6.31: Illustration of stretching. stretching is in the horizontal plane and therefore crosses the boundaries between processors, it is difficult to interpolate between stretched cells in the manner suggested by Spencer (2007), due to the communication required between processors. Figure 6.32 illustrates Spencer’s (2007) method of interpolation, where the values of the cells produced by stretching are a spatial interpolation of the cell values they were generated from.

(a) Standard

(b) Stretching = 2

(c) Stretching = 5

Figure 6.32: Illustration of the interpolation method of stretching (Spencer, 2007). Instead, a more basic method of stretching is employed, that is spatially independent and maintains the statistics of the overall field. In this method, followed

248


by Samy (2003), the field values are repeated by the required number of cells in order to stretch the field, as illustrated in Figure 6.33. This method, although

(a) Standard

(b) Stretching = 2

(c) Stretching = 5

Figure 6.33: Illustration of the Author’s Method of Stretching. advantageous with regards parallelism, reduces the realism of the field. Note that squashing and stretching have to be accounted for, not only in the scales of fluctuation, θh , θv , but also in the resolution and dimensions of the original Gaussian field generated. These parameters are automated within the coding implementation, for both ease of use and to maintain the efficiency of the code.

6.5. RESULTING FIELDS

6.5

249

Resulting Fields

The following figures show visualisations of example fields generated by using the above methods, both decomposed across multiple processors and merged to represent the entire field. Figure 6.34 illustrates a standard field, where spatial variability of the field is isotropic, while in Figure 6.35 the field is anisotropic, with the horizontal plane having a larger scale of fluctuation than the vertical direction.

250


(a) Decomposed Field

(b) Merged Field

Figure 6.34: Generated Random field.

6.5. RESULTING FIELDS

251

(a) Decomposed Field

(b) Merged Field

Figure 6.35: Generated Random field - Anisotropic field.

252

6.6


Validation

In the Author’s implementation, the generation of the subdivided cells, expectations and other parameters remain unchanged from those in the implementation by Spencer (2007). It could therefore be assumed that the validation work presented by Spencer is still valid. However, the Author has conducted further validation, over multiple processors for varying parameters, to ascertain the validity of the new parallel implementation.

6.6.1

Global Mean and Cell Value Distributions

An initial validation of the random field generation was to evaluate the distributions of generated field values; that is, the distribution of random field cell values and the distribution of the field means. Figure 6.36 shows the frequency distribution of the individual cell values, for 250 generated fields of size 64×64×64 cells, produced for a Gaussian distribution. It clearly shows that the distribution of the values is as expected and consistent with a mean, µ = 0, and standard deviation, σ = 1.

0.45 Field data 0.4

Standard normal distribution (µ=0, σ=1)

0.35

pdf

0.3 0.25 0.2 0.15 0.1 0.05 0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Property Value

Figure 6.36: Probability density function of random field cell values from 250 fields.

6.6. VALIDATION

253

1 Field means Normal distribution (µ=0, σ=0.5) 0.8

pdf

0.6

0.4

0.2

0 -4

-3

-2

-1 0 1 Mean field property value

2

3

4

Figure 6.37: Probability density function of random field means over 1000 fields. Similarly, the distribution of the means of the generated fields was plotted, with the expectation that these values should follow a normal distribution. Figure 6.37 shows the results for the given example, taken over 1000 realisations. The figure shows that the overall means of each field follow a normal distribution. The field means within the figure have been fitted by observation with the normal distribution and shows a good fit with a mean, µ = 0 and standard deviation of σ ≈ 0.5. A reduced standard deviation is expected due to the averaging of property values over the problem domain.

6.6.2

Correlation structure

An important aspect of the generated random fields is the correlation structure. As a means of validation, the Author compared the covariance for specified directions across each field, against those presented by Fenton and Vanmarcke (1990). These covariance analyses corresponded to the lags in the x, y, z, xy, zx, yz and xyz directions; or, given in vector form, (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1) and (1, 1, 1) respectively. A 3D random field, of side length D = 5m and scale of fluctuation of θ = 4m in all directions, was analysed, over

254


50 field realisations. Figure 6.38 shows the results as covariance against the Lag, and are compared with the indicated exact value, Equation 3.120, and to the solutions presented by Fenton and Vanmarcke (1990), Figure 6.39.

1.5

x y z xy yz xz xyz exact

Covariance

1

0.5

0

-0.5 0

2

4

Lag (m)

6

8

10

Figure 6.38: Exact and estimated covariance functions for a 3D random field; D=5m, θ =4m, averaged over 50 realisations. (The Author). When comparing the two figures, they show similar trends, and follow the expected exact value of the covariance; however, the limited number of cases, in both examples, has limited the conclusions that can be made. The results suggest that the new implementation is as good as the original implementation of Fenton and Vanmarcke (1990). Indeed, at larger lags the Author’s implementation appears more stable than that of the original; the exception is the largest of the lags, although this is likely to be due to the limited data in this instance. Note that the starting point of each covariance line for the Author’s analysis (Figure 6.38) is constant, as is expected if all permutations of direction are taken from every cell; in contrast, the starting point of the results of Fenton and Vanmarcke (1990) fluctuates, indicating that a different approach was used when measuring the covariance. The new implementation was designed to provide better performance than before; therefore there was no reason to continue limiting this analysis to 50

6.6. VALIDATION

255

Figure 6.39: Exact and estimated covariance functions for a 3D random field; D=5m, θ =4m, averaged over 50 realisations (Fenton and Vanmarcke, 1990). fields. The Author increased the number of realisations within the analysis to 1000 fields, a sample size similar to the number of realisations used in a typical RFEM analysis. Figure 6.40 shows the results for the increased sample size. The results show that, over the 1000 realisations, the covariance functions appear to converge towards the exact function. The wavy fluctuations of the covariance lines in each direction have also stabilized, adding to the Author’s confidence in the reliability of this implementation. These conclusions are similar to those of Spencer (2007); perhaps not surprising, considering Spencer’s (2007) code was the basis for this new implementation. However, neither the Author’s results, nor Spencer’s results, are conclusively of better quality than the original fields produced by Fenton and Vanmarcke (1990), as the method of measuring the covariance structure appears to have been different, and the measure being used by the Author included all the data, rather than a smaller subset that it is suspected was used by Fenton and Vanmarcke (1990). These validations were repeated for several seeds, parameters, numbers of processors and field sizes, with similar results. Hence, the Author is confident of this implementation. However, further validation of the implementation and

256


1.5

x y z xy yz xz xyz exact

Covariance

1

0.5

0

-0.5 0

2

4

6

8

10

Lag (m)

Figure 6.40: Exact and estimated covariance functions for a 3D random field; D=5m, θ =4m, averaged over 1000 realisations (The Author). several aspects of it, including the random number generation, will occur when the field has been implemented into a working example, as covered in the following chapters.

6.7. COMPUTATIONAL PERFORMANCE

6.7

257

Computational Performance

The computational performance of the random field generation is an important consideration. The generation of the field takes a small fraction of time when compared with the FE applications that it is implemented with. However, as the efficiency of FE solvers and applications increase, the need for an efficient random field generator will become more apparent. In any case, one of the aims of this thesis was to remove the memory limits placed on stochastic Monte Carlo analyses arising from the generation of random fields. Therefore the Author conducted a review of the performance of the new random field implementation, to ascertain whether the implementation was both viable and robust for future use, while meeting the aims of the thesis with respect to memory.

6.7.1

Analysis

The performance of the code was measured for a range of field sizes and numbers of processors. The random fields generated all had the same statistical parameters, to ensure that the performance measured was related to only field size and resolution. 1000 field realisations were generated, so that the results could be averaged to give a more reliable and accurate measure of the performance. Note that the resources available to the Author were limited, with around 32 processors regularly available for the analyses. This meant that the analysis, conclusions and trends are only applicable within this range of 32 processors, from which the Author extrapolates some predictions for larger machines and varying field sizes. Table 6.4 lists the analyses undertaken, characterized by various numbers of processors and domain sizes.

6.7.2

Results

The complete set of performance results is presented in Appendix A (Figures A.1 to A.12); however; Figure 6.41 is presented here as a typical representation of the results. Each figure contains 4 graphs, each of which represents the data as a measure of parallel performance and scalability as discussed in Chapter 2. The four measures considered are as follows:

258

CHAPTER 6. PARALLEL RANDOM FIELD IMPLEMENTATION Processors Field Resolution (cells) 1, 2, 4, 8, 16, 32 512 × 512 × 512 512 × 512 × 256 512 × 512 × 128 512 × 512 × 64 512 × 512 × 32 256 × 256 × 256 256 × 256 × 128 256 × 256 × 64 256 × 256 × 32 128 × 128 × 128 128 × 128 × 64 128 × 128 × 32 Table 6.4: List of random field performance analyses undertaken.

(a) Time These graphs present the time taken for the 1000 fields to be generated for various numbers of processors. The data are compared with a line of best fit, which takes the form: f (x) =

a +b x

(6.10)

where a and b are constants, with the latter considered to be the time taken by the serial aspects of the code; that is, it provides a lower limit for the execution time no matter how many processors are used. a/x is considered to be the time taken for the parallel components of the code to execute. Equation 6.10 is chosen to best represent the likely function of time with respect to the number of processors. These parameters are presented in part (a) of each figure. A theoretical line is also presented, which indicates the theoretical expected time that should be taken on the corresponding number of processors, if 100% of the code was parallelised; it is based upon the execution time on a single processor, T1 , with the relationship; T1 (6.11) n This theoretical line is taken as the minimum time that the generation can take, and is analogous to the line of linear speedup. f (n) =


259

(b) Memory The memory consumption of the code is presented and compared with similar relationships to those used when considering time. The memory consumption is the amount of memory required per processor. In this case, the parameters a and b for the best fit line, Equation 6.10, are analogous to the total amount of distributed memory and the memory overhead respectively. This overhead represents the theoretical lower limit of the consumption per processor. Once again, a theoretical line is presented in the same form as Equation 6.11, using memory instead of time: M1 (6.12) n where M1 represents the memory consumption when the implementation is executed on a single processor. f (n) =

(c) Speedup The speedup is given by Equation 2.3, and is the ratio of the time taken on the multiple processors to the time taken using a single processor. The results are compared against the maximum speedup theoretically possible. This linear line illustrates the idealistic speedup, which is equal to the number of processors on which a code is executed (see Section 2.4.4). (d) Efficiency The efficiency of the code also compares multiple processor performance with single processor performance (see Section 2.4.5). Hence, it is an alternative representation of the speedup.

260


×103 12 +

Best Fit Theorectical Actual

10

+

Time (s)

8 6

a = 11.535×103 b = -0.34558×103

+

4 2

+ +

0

0

5

+ 10

+

15 20 No. of Processors

25

30

35

Max. Memory (kB) per Processor

(a) Time

×103 350 + 300


+

250 200

a = 278.88×103 b = 37.493×103

+ 150 +

100

+

+

50 0

0

5

10


+ 25

30

(b) Memory

Figure 6.41: Performance : Field : 512 × 512 × 32 cells.

35


261

40 35

Speedup

30 25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.6

Efficiency

1.5 1.4 1.3 1.2 1.1 1

0

5

10

15 20 No. of Processors (d) Efficiency

Figure 6.41: cont....Performance : Field : 512 × 512 × 32 cells.

262


Further to Figures A.1 to A.12, the results have been combined into Figures 6.42 and 6.43, to illustrate the performance of the implementation with respect to the number of cells in the field generated, against time and memory respectively.

6.7.2.1

Performance Conclusions

The performance results are encouraging, showing that the code scales well over multiple processors. The execution times of the implementation appear to generally scale well, with the data and best fit lines matching closely the theoretical lines presented. In the main there appears to be no precise set-up time within the code, as indicated by the backfigured values of b, that are both negative and relatively small when compared with the the larger parallel component a. It would be conventional and normal practice to include the original implementation, i.e. the serial 3D implementation of Spencer (2007), as part of the performance analysis. The Author felt that considerable alterations to the original coding in this implementation rendered its use in a full parallel performance analysis irrelevant, as the algorithm of the two codes differ significantly. However, Table 6.5 compares the performance of the original serial 3D implementation of Spencer (2007) with the parallel implementation presented in this thesis executed on a single processor.

Field Size 512 × 512 × 512 512 × 512 × 256 512 × 512 × 128 512 × 512 × 64 512 × 512 × 32 256 × 256 × 256 256 × 256 × 128 256 × 256 × 64 256 × 256 × 32 128 × 128 × 128 128 × 128 × 64 128 × 128 × 32

Original New Comparison Time(s) Memory(kB) Time(s) Memory(kB) Time (%) Memory (%) 572310.5 17879616 148213.0 4520768 25.90 25.29 567068.9 17355328 74450.1 2280064 13.13 13.14 543516.1 17093184 52272.5 1158528 9.62 6.78 553872.3 16962112 22255.1 597696 4.02 3.52 568532.1 16896576 11344.9 317312 2.00 1.88 67289.2 2245184 19992.7 588992 29.71 26.23 61117.0 2179648 10101.1 306752 16.53 14.07 55943.8 2146880 5950.3 165568 10.64 7.71 60356.5 2130496 3299.9 95040 5.47 4.46 6612.9 286272 2608.0 92224 39.44 32.22 6737.4 278080 1043.6 56512 15.49 20.32 8174.4 273984 684.3 38592 8.37 14.09

Table 6.5: Comparison of performance of original serial random field generation with that of the new implementation executed on a single processor. (Note that the comparison is the ratio New:Original.)


263

105

4

10 Time (s)

+ × × ∗ ∗ ◦ 4 ◦ 4 +

+ + + × × ∗ + × ∗ × ∗ ∗ ◦ 4 ◦ 4 4 ◦ 4

103

2

10

101 5 10

106

+ + × × × ∗ ∗ ◦ ◦ 4 4

+ × ∗ ◦ 4

+ × ∗ ◦ 4

1 CPU + 2 CPUs × 4 CPUs ∗ 8 CPUs 16 CPUs ◦ 32 CPUs 4

107 No. of Cells

108

109

Figure 6.42: Performance : Time against number of cells in random field.

Memory (kB)

107

106

1 CPU + 2 CPUs × 4 CPUs ∗ 8 CPUs 16 CPUs ◦ 32 CPUs 4

105

104 5 10

+ × ∗

+ + × + × ∗ ∗ × ◦ + ∗ 4 ◦ ∗ 4 ∗ ◦ × ◦ ◦ 4 4 4 + × 4 ∗ 106

+ × ∗ ◦ 4 4

107 No. of Cells

+ + × × ∗ ∗ ◦ ◦ 4 ◦ 4 4

108

109

Figure 6.43: Performance : Memory against number of cells in random field.

264


The table shows that the changes have considerably improved the performance, with respect to both memory consumption and execution time, without parallelisation. These improvements in real terms are considerable, although small relative to the finite element application with which the code will be used. It can be seen that the times and memory consumption of the original implementation are dependent on the dominant cell dimension required, remaining approximately the same when sharing a common dominant size. However the new implementation, although dependent on field size, is not constrained by this dominant cell size. From the figures presented in Appendix A, it can be seen that generally with the larger fields the memory consumption appears to scale well, following closely the theoretical lines predicted; however, for smaller fields this is not the case, with the scaling at lower processor numbers scaling well before diverging at larger numbers of processors, converging to the limit indicated by b. This value is clearly an overhead required by all processors in the implementation. It is also shown that this overhead, in all cases, is approximately 35-40 Megabytes. The speedup scaling seems initially, at lower numbers of processors, to fit closely to the linear speedup line, often bettering the line into the super-linear region. This extra performance is likely to be due to caching effects. This is when larger amounts of the required data can reside on the caches of the processors. This means that they are faster to access than conventional memory and therefore, as the problem is decomposed over greater numbers of processors, larger amounts of the data are stored in cache memory. This hypothesis is validated further when looking at the sizes of the fields. With smaller fields the speedup remains in the super linear region over a greater number of processors, with large amounts, if not all, of the data residing on the processor caches. In contrast, for the larger fields this number of processors is lower, due to less of the domain fitting on the caches and any performance benefit being outweighed by the added communication of additional processors being used in the execution. This is further highlighted in the analysis of efficiency showing that, in general, the efficiency of the code initially rises above 1, before declining to below this threshold as the code slowly becomes more inefficient. Figures 6.42 and 6.43 show the results of the performance measures, plotted against the number of cells generated in each field. The graphs are plotted using logarithmic axes. In general it shows that, with increasing numbers of cells, the


265

time trend is linear, for all number of processors. Interestingly, the time trends for different numbers of processors remains at the same gradient, indicating that the rate of time increase with increasing field size is constant, no matter how many processors are used in its generation. Figure 6.43 also illustrates the per processor memory consumption with increasing field size. It is clear that, the more processors used, the lower the memory consumption; however, the curvature of the lines suggests that there will always be an optimum number of processors. The results of this performance analysis have shown that, overall, the codes have excellent scaling properties, fulfilling the dual aims of both speedup and reduced per processor memory consumption. The results show that this implementation has considerably improved the efficiency of the original serial code, when executed on a single processor. However the results also highlight the optimum number of processors these codes should be executed on. They show that, in most cases analysed beyond 8 processors, the performance benefit in real terms is small, highlighted by the efficiency results which generally fall below 1 when executed on more than 8 processors. These conclusions, although based on a small set of trial fields, can be considered as likely trends for a multitude of different sized fields, although the values given may vary.

266

6.8


Conclusions

This chapter has presented a parallel implementation of LAS random field generation. It has validated the generation and a performance analysis has been conducted. It has also highlighted inefficiencies with the original implementation inherited by the Author and has implemented improvements that have dramatically improved this efficiency. Furthermore, the parallel implementation has achieved its dual goals of reducing execution time and especially reducing per processor memory consumption. The next step is to implement the random field generation within a suitable application; specifically a finite element code with a geo-engineering application. Initially, this will be performed to provide a final validation of both the parallel stochastic random FEM and the random field generation implementations, comparing the Author’s FE results with reliable stochastic results. The implementation will be further used to perform FE analyses that could not otherwise be carried out, due to memory and time constraints.

Chapter 7 Application - Groundwater Modelling 7.1

Introduction

Chapter 5 discussed the testing of random number generators and highlighted that a validation test is to implement the generator in a suitable application, comparing the results with known and accepted results. The same can also be said of the random field generator and the parallel random finite element algorithm. This chapter aims to provide some validation for the individual components, as well as a verification for the complete implementation. Verification is the process by which a piece of software is tested to see if it functions correctly computationally, i.e. that the user receives the correct outputs and that there are no problems or program crashes. In contrast, validation is the testing of the code to see if the values and output obtained from the model are correct in terms of theory. In this chapter the implementation of a simple example aims to fulfill both requirements. This implementation is an illustrative example rather than a full engineering analysis.

7.2

Groundwater Modelling

Griffiths and Fenton (1997) conducted a stochastic investigation of 3D seepage through a spatially random soil. It was based on the FE codes P59 (Program 5.9) and P7.0 (Program 7.0), in the text of Smith and Griffiths (1988) and adapted to model spatially varying soil using LAS random field generation. An aim of the 267

268

CHAPTER 7. APPLICATION - GROUNDWATER MODELLING

authors was to compare the more realistic 3D solutions to those of the idealised 2D model. It is this problem that the Author chose to use as part of the validation. The paper presents reliable results and the model is easy to implement and quick to execute.

7.2.1

Problem

This simple boundary value problem considers steady state seepage beneath a single sheet pile wall penetrating a layer of soil, as illustrated in Figure 7.1. In this case, the spatially varying soil property is the permeability, K. The steady flow was determined using a 3D finite element model, while the soil permeability was modelled using the LAS method. The figure presents the dimensions of the model, for isometric and cross-sectional views, analogous to the respective 3D and 2D flow regimes. In the figure, the dimensions Lx and Ly are held constant, while Lz is increased to monitor the effects of three-dimensionality. ∆H is the head loss across the sheet pile wall. A uniform mesh of cubic eight-node brick elements, of side length 0.2, is used in the discretisation of the domain. The authors do not indicate a measuring unit within the paper, this may be an attempt to present generalized results regardless of size. In the x direction there are 32 elements giving Lx = 6.4, in the y direction there are 16 elements giving Ly = 3.2, and in the z direction 4, 8 and 16 elements are used, corresponding to Lz = 0.8, 1.6 and 3.2 respectively. The boundary conditions are set as to make the sides, base and sheet pile = 0 in the relevant directions, i. The top surface of the wall impermeable i.e. δH δi domain is fixed with corresponding constant heads on each side of the sheet pile wall, as to model the head loss across the wall, ∆H. The permeability was considered to follow a lognormal distribution and so the Gaussian values from the LAS random field were transformed accordingly. The scale of fluctuation, θ, was taken to be the same in all directions.

7.2.2

Previous RFEM Investigation

Griffiths and Fenton (1997) also adopted the Random Finite Element Method (RFEM) for analysing this problem. The input to their model was the varying dimension, Lz , and the statistical parameters µK , σK and θln K , where the subscript, ln K indicates that the property value is taken after the transformation of

7.2. GROUNDWATER MODELLING

269

Ly

Lx Lz (a) 3D view

∆H

Ly 2

Ly

Lx (b) 2D cross-section

Figure 7.1: 3D FE seepage model (Griffiths and Fenton, 1997).

270


the Gaussian field to that of a lognormal distribution and the subscript K refers to the property value before transformation. µK was taken to be 10−5 , a value for a typical sand, if measured in m/s. The analysis considered 1000 realisations of the problem for each permutation of the input statistics, measuring the total flow rate through the system, Q. The mean and standard deviation of Q are computed based upon 1000 realisations ¯ given by: and are presented in a normalized non-dimensional form, Q, ¯= Q

Q ∆HµK Lz

(7.1)

For simplicity the head loss across the sheet pile wall was maintained at unity (∆H = 1) for all cases, considering its linear influence on Q. The division by Lz enables the comparison between the different dimensions as well as with 2D analysis. A parametric study was conducted with the following variations of input statistics, while µK = 1, Lx = 6.4 and Ly = 3.2 were kept fixed: σK µK

=

0.125, 0.25, 0.5 1, 2, 4, 8

θln K = Lz =

1, 2, 4, 8, ∞ (analytical) 0.8, 1.6, 3.2

The coefficient of variation of the permeability is given by υK = σK /µK . Due to the constant µK = 1, this is controlled by the standard deviation of the field, σK .

7.2.3

New Implementation

The Author reproduced this model using the parallel code P123 (Program 12.3) (Smith and Griffiths, 2004) and the parametric study previously described was recreated. The code was executed using multiple processors, varying from 1 to 32 processors, for all cases in the parametric study to check for consistency; it was found that the results with varying processors were consistent with each other. Therefore, the results obtained using 4 processors are presented. The results were compared with those of Griffiths and Fenton (1997), to provide a validation of the parallel implementation presented in this thesis. The code

7.3. RESULTS

271

produces a 3D analysis of Laplace’s equation, using 8 node brick elements and a preconditioned conjugate gradient solver using the element by element method. This code was first adapted to fit within the stochastic framework, discussed in Chapter 2, using the new parallel random field implementation. (See Chapter 6.) Within this implementation the generated random field maps the varying soil permeability values to the elements of the discretised mesh; that is, it maps the values to each element rather than to the Gauss points of the elements. The validation compared the results and analysis from Griffiths and Fenton (1997) with those produced from the new implementation. The results of Griffiths and Fenton (1997) are considered reliable and their implementation is similar to that of the Author, and therefore a credible comparison can be made. The numerical results presented by Griffiths and Fenton (1997) are limited in scope; so the validation analyses also include the comparison of conclusions based on a full parametric study with general observations made in the paper. Due to the expected increased performance of the new implementation, the Author increased the number of realisations from 1000 to 5000 for the new version. This is because the results of Griffiths and Fenton (1997) seemed to be converging to trends that the Author felt might be clearer with greater numbers of realisations. Due to the parallelisation and improvements in computational technology since 1997, when the original paper was written, the Author’s ability to viably increase the number of realisations beyond that of the previous authors was simple.

7.3

Results

Griffiths and Fenton (1997) presented a small selection of results from the parametric study. It is these results that are presented in this thesis for validatiing the new implementation. The Author did not limit his validation to those results presented visually and completed the full parametric study; however, these results could not be directly compared numerically but were compared with the general conclusions presented by Griffiths and Fenton (1997). Figures 7.2 and 7.3 compare the limited normalized mean, µQ¯ , and standard deviation, σQ¯ , results presented by Griffiths and Fenton (1997) with those produced by the Author’s new implementation. The results illustrate the influence of both the coefficient of variation, υK , and the scale of fluctuation, θln K , on the

272


¯ when Lz /Ly = 1. normalized flow rate, Q, The results in the figures are typical of those produced generally in the full parametric study and as such the conclusions can be generalized across the study. The conclusions from both the Author’s and Griffiths and Fenton’s (1997) results are in agreement and are presented below. It can be observed that, as the coefficient of variation is increased, the mean ¯ normalized flow rate µQ¯ falls below the deterministic value of Q det = 0.47. The gradient of the fall is steepest for small values of the scale of fluctuation, θln K , with µQ¯ tending towards the deterministic value, expected from a strongly correlated permeability field, as θln K → ∞. The estimated standard deviation of the normalized flow rate, σQ¯ , indicates ¯ at small θln k , even at high coefficients of variation, υK . The little variation in Q Author agrees with the explanation of Griffiths and Fenton (1997), who argue that the total flow rate through the continuous domain is an averaging process; that is, high flow rates in some regions are offset by lower flow rates in others. However, for large scales of fluctuation, θln k , the variation in the normalized flow ¯ is expected to be larger. This is because there are fewer regions of rate, Q, differing permeability, and therefore the averaging process takes place over fewer flow regions; as such the variance would be expected to be larger, over what is effectively a smaller statistical sample. The variance in the normalized flow rate tends towards a maximum as θln K → ∞, at which point the permeability fields are uniform for each realisation. This is intuitive, as the flow rate is proportional to the uniform permeability and so the flow rate variance follows that of the permeability, giving: σQ¯ =

σK ¯ Q , µK det

(7.2)

when θln K = ∞. Note that, for small scales of fluctuation, θln K , the standard deviation, σQ¯ , reaches a maximum at intermediate values of υK , due to the rapidly falling mean µQ¯ for these lower scales. The normalized flow rate tends to zero, as υK → ∞, ¯ is a non-negative quantity. It can therefore be said implying that σQ¯ → 0, as Q that, for any θln K < ∞, σQ¯ = 0, when υK = 0 or ∞, reaching a maximum between the two. The maximum point moves to the right, i.e. to larger υK , for larger θln K .

7.3. RESULTS

273

Griffiths and Fenton (1997) continued the investigation by comparing the results of varying degrees of three dimensionality with a two dimensional analyses i.e. by varying Lz . The authors suggested that the influence of three dimensionality is to allow greater freedom for the flow to avoid regions of low permeability, resulting in shallower gradients in the expected flow rate, µQ¯ , with increasing, υK . Figures 7.4 and 7.5 show the mean of the normalized means, µQ¯ , and the resulting standard deviation of the normalized total flow rates, sQ¯ , with varying coefficients of variation, υK , and varying degrees of three dimensionality, represented by Lz /Ly . The cases presented are for θln K = 1 and are compared with a 2D analysis. For the Author’s results (Figure 7.5), the 2D analysis was not carried out; therefore the results of Griffiths and Fenton (1997) are presented as a comparison. Comparing the figures shows that the two models compare well, further adding confidence in the new implementation. They also agree with the previous findings, as well as showing a corresponding reduction in the variation of the mean flow rates as the third dimension is elongated. As before, this is due to the increased averaging of the total flow rate due to the increased volume in three dimensions. The results indicate that the 2D results could be a good first approximation to the flow through the domain, as the differences between the 2D and 3D results are not so large, although the 2D analysis does tend to underestimate the predicted flow rate. The reduced variation in the Authors results compared to those of Griffiths and Fenton (1997) is likely to be due a number of factors. The 3D covariance functions used in the random field generation differ slightly, the Author using Equation 3.120, while Griffiths and Fenton used a covariance function based on a Gauss-Markov spatial correlation function 2 ρ(τ ) = exp − |τ | θ

(7.3)

where |τ | is the lag and θ the scale of fluctuation of the material property being modelled. Also the boundary cells generated, discussed in Section 3.6.3, have an influence on the generated field; the Author’s generated boundary being that prescribed by Spencer (2007). Furthermore, the location of the random fields

274


with respect to the larger generated domain and the number of levels of subdivision are both unknowns in Griffiths and Fenton (1997) analysis, these factors affecting the generated random field. Also, the increased number of realisations would have some effect on the variation of the generated random fields. Although this investigation has shown a slight difference in the variation between the two models, the mean flow rate results have been consistent; therefore the Author is satisfied that the implementation is correct and that the differences are due to the reasons stated. The implementation functioned as designed, generating the anticipated results, within the required data files, using the corresponding input. Therefore the Author was satisfied that the code and implementation was computationally verified.

7.4

Conclusions

Although this was a very limited validation, it does reaffirm the statistical validations carried out in the previous chapters for both the parallel random number generator (Chapter 5) and the implementation of the parallel LAS random field generator (Chapter 6). Furthermore, this validation goes some way to demonstrating the viability of the new implementation of a parallel framework for Stochastic FE. From successfully executing the implementation and obtaining the required results, the implementation was computationally verified. If, as intended, the stochastic framework is implemented for all applicable FE codes in a similar manner then these too should be easily verified. The following chapter, although containing new work and analyses will also be used to validate the model further. The developed framework would require use with several FE codes, solvers and models before full confidence could be achieved; however, with each new study using this RFEM framework more confidence can be taken.

7.4. CONCLUSIONS

275

0.5 0.45 0.4 0.35

μQ̄

0.3 0.25 0.2 θ=1 0.15

θ=2 θ=4

0.1

θ=8 θ = ∞ (Analytical)

0.05 0 0.1

1

10

υk = σk/μk

(a) Mean, µQ¯

4

θ=1 θ=2

3.5

θ=4 θ=8

3

θ = ∞ (Analytical)

σQ̄

2.5

2

1.5

1

0.5

0 0.1

1

10

υk = σk/μk

(b) Standard Deviation, σQ¯

Figure 7.2: Influence of θk on Statistics of Normalized Flow Rate ( LLyz = 1) (Griffiths and Fenton, 1997).

276


0.5 0.45 0.4 0.35

μQ̄

0.3 0.25 0.2 θ=1 0.15

θ=2 θ=4

0.1

θ=8 θ = ∞ (Analytical)

0.05 0 0.1

1

10

υk = σk/μk

(a) Mean, µQ¯

4 θ=1 θ=2

3.5

θ=4 θ=8

3

θ = ∞ (Analytical)

σQ̄

2.5

2

1.5

1

0.5

0 0.1

1

10

υk = σk/μk

(b) Standard Deviation, σK

Figure 7.3: Influence of θQ¯ on Statistics of Normalized Flow Rate ( LLyz = 1) (Author’s Results).

7.4. CONCLUSIONS

277

0.50 0.45 0.40 0.35

μQ̄

0.30

0.25 0.20 0.15

2D LZ/LY=0.25

0.10

LZ/LY=0.5 0.05

LZ/LY=1

0.00 0.1

1

10

υk = σk/μk

(a) Mean, µK

0.14

2D LZ/LY=0.25

0.12

LZ/LY=0.5 LZ/LY=1

0.10

σQ̄

0.08

0.06

0.04

0.02

0.00 0.1

1

10

υk = σk/μk


Figure 7.4: Influence of Lz /Ly on Statistics of Normalized Flow Rate (θln K ) (Griffiths and Fenton, 1997).

278


0.50 0.45 0.40 0.35

μQ̄

0.30 0.25 0.20 0.15 2D (Griffiths and Fenton(1997))

0.10

LZ/LY=0.25 LZ/LY=0.5

0.05

LZ/LY=1

0.00 0.1

1

10

υk = σk/μk

(a) Mean, µK

0.14 2D (Griffiths and Fenton (1997)) LZ/LY=0.25

0.12

LZ/LY=0.5 LZ/LY=1

0.10

σQ̄

0.08

0.06

0.04

0.02

0.00

0.1

1

υk = σk/μk

10


Figure 7.5: Influence of Lz /Ly on Statistics of Normalized Flow Rate (θln K ) (Author’s Results).

Chapter 8 Application - Slope Stability Much of the work in the Author’s research group has been focused on slope stability analysis. With advances in computer technology, the analyses carried out have developed from simple 2D analyses (Samy, 2003) to limited 3D analyses (Spencer, 2007); however, it was discovered that memory was a limiting factor in the use of these codes. The parallel RFEM framework implementation, presented in this thesis, has led to the analysis of larger slopes. This chapter presents this work and the results. The results serve as a further validation, by comparing with the previous work of Hicks and Spencer (2010) and then expanding the work to larger slopes with foundation layers. This implementation serves as an illustrative example rather than a full engineering analysis. Using the new parallel RFEM implementation, this chapter continues by investigating the influence of heterogeneity of undrained shear strength on the reliability of, and risk posed by, a long slope cut in clay. The clay has been idealized as a linear elastic, perfectly plastic Von Mises material and its spatial variability has been modelled using random field theory, whereas slope performance has been computed using the 3D parallel finite element program. The results of Monte Carlo simulations show that three failure modes are possible and that these depend on the horizontal scale of fluctuation relative to slope geometry. In particular, discrete failures are observed for intermediate scales of fluctuation and, in this case, reliability is a function of slope length. The risk posed by potential slides has been quantified in terms of slide volumes, which have been estimated by considering the computed out-of-face displacements. The results indicate that, for low probabilities of failure, the volumes of potential slides can be small. This suggests that, for some problems, it may not be necessary to design to very small 279

280

CHAPTER 8. APPLICATION - SLOPE STABILITY

probabilities of failure, due to the reduced consequence of failure in this case.

8.1

Introduction

Slope stability is concerned with the stability of natural slopes, excavations, embankments and dams. Seepage and gravitational forces can cause instability in these slopes. Figure 8.1 shows the the most important types of slope failure.

(a) Rotational slip (Circular)

(c) Translational slip

(b) Rotational slip (Non Circular)

(d) Compound slip

Figure 8.1: Types of slope failure (Craig, 1997).

Rotational slips take the form of both circular arcs and non circular curves, as illustrated in Figures 8.1(a) and (b) respectively. Circular slips are associated with homogeneous and isotropic soils, compared with non circular failures which occur in heterogeneous soils. Circular slips are rare, but have been observed in pure clays (Craig, 1997). Translational and Compound slips, as illustrated in Figures 8.1(c) and (d) respectively, are influenced by adjacent strata of higher strength. In translational failures, the adjacent strata tend to be at a relatively shallow depth under the slope surface, usually producing plane failures parallel to the slope. Compound failures occur where the adjacent strata is at a greater depth to the slope, producing a slip failure with both curved and plane sections (Craig, 1997).

8.1. INTRODUCTION

8.1.1

281

Fully saturated slope under undrained conditions, φu = 0

The idealised case considered in this chapter is a fully saturated slope under undrained conditions, a situation most likely to occur immediately after construction. The internal angle of friction is often taken to be φu = 0 and the strength of the soil is taken to be the undrained shear strength, cu . Considering a profile through the slope, the potential failure surface maybe idealised as a circular arc (Craig, 1997). Figure 8.2 illustrates a trial failure surface through the slope, with centre 0, radius r and length La . The potential instability of the slope is due to the total weight, W , (per unit length) of the soil that is above the failure surface. For the slope to fail, the shear strength of the soil along the failure surface must be overcome, and this is given by: cu τf = (8.1) F F where F is the factor of safety with respect to the shear strength, τf the shear strength along the failure surface and τm the shear stress at which mobilization occurs. Taking moments about 0 gives: τm =

Wd =

cu La r F

(8.2)

and rearranging: F =

cu L a r Wd

(8.3)

The analysis continues by taking trial failure surfaces through the slope until the minimum factor of safety is determined. 8.1.1.1

Taylor’s stability coefficients

Using the principles above and that of geometric similarity, Taylor (1937) produced stability coefficients, Ns , that are presented in Figure 8.3. These coefficients were for the analysis of a homogeneous slope in terms of total stress for the φu = 0 case. Also shown in the figure, is the generic dimensions of the slope corresponding to each set of coefficients, where H is the height of the slope, D, a depth factor, and β the angle of the slope, with D + H the height between the crest of the slope and the firm stratum.

282


0 r

La

d W

Cu

Figure 8.2: The φu = 0 analysis (Craig, 1997).

The relationship between the stability coefficient, Ns , and the minimum factor of safety, F , is given by: Ns =

cu . F γH

(8.4)

where γ is the unit weight of the soil. It will be these coefficients and the corresponding safety factors that will be used as a measure of accuracy of the analyses carried out by the Author.

NS

0

0.05

0.1

0.15

0.2

0.25

0.3

0◦

10◦

D

D

2H

20◦

H 0.0 = D

=

0.

H 0.5

0H

D=

=

1.

3.0H

D=

D=∞

30◦

40◦

β

50◦

D

H

60◦

70◦

β

80◦

90◦

8.1. INTRODUCTION 283

Figure 8.3: Taylor’s stability coeffients for φu = 0 (Craig, 1997).

284

8.2


Investigation

This investigation uses random field theory to model the spatial variability of material properties and finite elements to compute geo-structural response. It is the latest in a series of investigations into the influence of heterogeneity of undrained shear strength (cu ) for a simple slope stability problem. Previously, Paice and Griffiths (1997) and Griffiths and Fenton (2000, 2004) considered the influence of isotropic heterogeneity for a 2D slope using a lognormal distribution of cu . In particular, they demonstrated the importance of accounting for the spatial aspect of variability through implementation of a correlation distance (or scale of fluctuation). Hicks and Samy (2002c,b,a, 2004) conducted similar analyses to demonstrate the importance of depth-dependency and anisotropy of the heterogeneity. They assumed a normal distribution of cu , arguing that this was sufficient for practical ranges (0.1-0.3) of the coefficient of variation (υ) of cu . (This is because the possibility of negative values with the normal distribution is negligible for υ < 0.33 (Hicks and Samy, 2002c)). The authors demonstrated a strategy for deriving reliability-based characteristic property values (Hicks and Samy, 2002c) (in line with the requirements of Eurocode 7) and showed that solutions converged at higher degrees of anisotropy of the heterogeneity (Hicks and Samy, 2002c,b,a, 2004). This implied that the horizontal scale of fluctuation need not always be accurately known, an encouraging result given the difficulty of measuring this quantity in practice. However, the findings were restricted to slope failure in plane strain and the implicit assumption of an infinite scale of fluctuation in the out-of-plane dimension (that is, along the line of the slope). Spencer and Hicks (2007) and Hicks and Spencer (2010) extended the research to three dimensions. As in Hicks and Samy (2002c,b,a, 2004), they considered the influence of heterogeneity of undrained shear strength for a slope founded on a firm base and identified three modes of failure depending on the horizontal scale of fluctuation relative to slope geometry. The implications of the research were twofold. Firstly, 2D analysis is only justified for long slopes if the failure mechanism is two-dimensional. For discrete failures, 3D analysis should be considered and slope reliability is then slope length dependent. Secondly, for very long slopes (as in flood defence systems), the performance of the whole slope can be reasonably assessed by carrying out a detailed analysis of a shorter representative section of the slope and then extrapolating the results to longer sections

8.2. INVESTIGATION

285

by simple statistical analysis. In their work, Hicks and Spencer (2010) analyzed a slope section with a length to height ratio equal to ten, to successfully compute the reliability of shorter and much longer slopes. This investigation continues the work of Spencer and Hicks (2007) and Hicks and Spencer (2010), by considering the range of potential slide volumes associated with different levels of reliability. For this purpose, a simple automated strategy is devised to estimate the slide volumes in Monte Carlo simulations. The results provide an increased understanding of the failure mechanisms involved and of the influence of heterogeneity on slope reliability in general. They are also a first step towards quantifying the consequences of slope failure within a risk-based framework. This study includes results for different depths of foundation layer. The associated increase in computational requirements, relative to the previous investigations, has required the parallelisation of both the finite element code, Monte Carlo process, and the random field generator, as presented in this thesis.

8.2.1

Methodology

A detailed description of the methodology is given by Hicks and Spencer (2010). In brief, the undrained shear strength (cu ) has been represented by a normal distribution and by the point and spatial statistics of cu : these are the mean (µ) and standard deviation (σ), which combine to give the coefficient of variation (υ = σ/µ); and the vertical and horizontal scales of fluctuation (θv and θh ), which are measures of the distance between adjacent zones of similar strength. The covariance function β is given by r

β (τ1 , τ2 , τ3 ) = σ 2 e

2|τ | − θ1 − 1

2τ22 2τ 2 + θ3 θ2 3

!

(8.5)

where τ is the lag distance, and subscripts 1-3 represent the three coordinate directions (with 1 being the vertical direction). However, as in Hicks and Spencer (2010), the random field is initially generated for the scale of fluctuation being equal in all directions (and is often taken to be θh ). There then follow a series of post-processing stages to arrive at the final field (Hicks and Spencer, 2010): these are to transform the field to a normal distribution (based on µ and σ) and to distort it to an anisotropic field based on θv and θh . The final field comprises cubic cells of cross-correlated local averages of cu , and these values are mapped onto the finite element mesh at the Gauss point level.

286


Hence, this investigation uses a Monte Carlo approach, in which each realisation is a deterministic finite element analysis of the slope based on a different random field of cu , in an undrained total stress analysis.

8.3

Analysis

Figure 8.4 shows the problem geometry and finite element mesh details with respect to Cartesian axes x, y and z.

y x

z z x

(a) isometric projection

(b) cross-section through mesh

Figure 8.4: The basic slope geometry and finite element mesh.

A 45◦ slope, of height H = 5 m and length L = 100 m, is cut from a clay layer of depth H + D, where D is the depth of the foundation layer, which is the depth of clay below the slope toe. The front and back faces of the mesh are on rollers preventing movement in the x-direction, and are 10 m from the slope toe and crest to minimise each boundary’s influence on the failure mechanism. The bottom of the mesh is fixed, whereas, at the two mesh ends, only vertical displacement is allowed. The decision to constrain movement in both the xand y-directions for this boundary is based on Spencer (2007), who found that restraining only the y-direction led to a bias towards failure near the mesh ends. This was thought to be due to an increased influence of weaker zones near the mesh ends, arising from the implied symmetry of the random field about this boundary. Hicks and Spencer (2010) investigated the validity of the boundary conditions, by analysing slopes of different length for the case of D = 0.0 m, and produced results supported by probabilistic theory. The finite element mesh comprises 20-node brick elements. All elements use 2 × 2 × 2 Gaussian integration, and each element is 0.5 m deep and 1.0 m × 1.0

8.3. ANALYSIS

287

Figure 8.5: Typical finite element and four random field cells (Hicks and Spencer, 2010). m in plan, except for those elements that have been distorted to model the slope face. Hence, for D = 0.0 m the mesh comprises 14,000 elements and for D = 3.0 m there are 32,000 elements. In order to assign cu values to Gauss points, the random field cell size is 0.25 m (which is half the minimum element dimension). Hence, for assigning values to each finite element there are 32 cubic cells, with each Gauss point value being the average of a group of 4 cells, as shown in Figure 8.5. The clay layer has been modelled as linear elastic, perfectly plastic. The elastic component uses a Young’s modulus, E = 100000 kPa, and Poisson’s ratio, ν = 0.3, whereas the plastic component uses the internal Von Mises failure criterion and a spatially varying undrained shear strength (cu ). This is based on a normal distribution and the following ranges of statistics: depth-independent mean, µ = 40 kPa; coefficient of variation, υ = 0.2; vertical scale of fluctuation, θv = H/5 = 1.0 m; and horizontal scale of fluctuation, θv = ξ × θv , where ξ is the degree of anisotropy of the heterogeneity and is here investigated over the range 0 < ξ < ∞. The slope has been analyzed, for each realization of the random field, by applying gravitational loading to generate the in-situ stresses and by investigating whether the slope remains stable or fails under its own self weight. Note that, due to the use of a simple, linear elastic, perfectly plastic, soil model, there is no need to model the slope “construction” sequence in generating the in-situ stresses. The method of loading only becomes important when more realistic soil models are being considered. For a homogeneous (plane strain) slope, of height H and constant undrained shear strength cu , the Taylor (1937) stability number (NS ) is given by Equation 8.4. For a heterogeneous slope, the equation may be rewritten as

288


NS =

µ F γH

(8.6)

where F is now the global factor of safety based on the target mean property value µ (and not to be confused with the factor of safety of a potential slide mechanism, which is obviously 1.0 for a slope at the point of failure). From the equation, it can be seen that slope failure can be initiated by either decreasing the shear strength profile or increasing the gravitational loading. Hicks and Spencer (2010) followed previous researchers by re-analyzing the problem with progressively lower values of undrained shear strength until slope failure occurred (that is, by scaling down the random field values of cu ). In contrast, this investigation has triggered slope failure by incrementally increasing gravity (while keeping the same shear strength profile). Initially, gravity is increased in equal increments until the slope fails: the last increment is then halved in size and re-analyzed, and the process continues until slope failure is defined to an accuracy of 0.01 (in terms of F ). As in previous studies, slope failure is defined by an analysis failing to converge in 500 equilibrium iterations.

8.3.1

Monte Carlo simulation

For a given global factor of safety (F ), the percentage reliability is given by R=

Nf 1− N

× 100

(8.7)

where N is the total number of realizations and Nf is the number of realizations in which slope failure occurs at or above that value of F . Hicks and Spencer (2010) reviewed various strategies for setting up this equation, all of which involved analyzing the slope for progressively lower values of cu (for example, by gradually scaling down the random field values using an increasing strength reduction factor). The value of F at failure (for any realization) was related to µ at failure through Equation 8.6. For the Monte Carlo simulations in this study a slightly different (albeit equivalent) strategy has been adopted. Specifically, the mean undrained shear strength has been fixed at 40kPa for each realization and, for each value of ξ, the reliability distribution has been obtained as follows. Firstly, the point and spatial statistics of cu (µ, υ, θv , θh ) are used to generate N random fields of cu . Secondly, the slope is analyzed for each random field in turn, by increasing the gravitational loading

8.3. ANALYSIS

289

until slope failure occurs. The global factor of safety for any realization is then given by Equation 8.6, using µ = 40kPa, the value of γ at which slope failure occurs and the appropriate value of NS (which depends on the slope angle (45◦ in this instance) and D).

8.3.2

Computing Methodology

The current investigation considers the same slope angle, height and length as in Hicks and Spencer (2010), and uses the same finite element size. However, the total number of elements is much larger for some analyses (32,000 compared to 8,000), due to the 3m deep foundation layer and also to increasing the distance between the back mesh boundary and slope crest of 10m (compared to the 5m used previously). Hicks and Spencer (2010) used parallel computing to analyze their slope, but, as the mesh was designed to fit in the memory of a single 32 bit CPU, only the Monte Carlo simulation was parallelised; that is, the realizations were shared across the processors using a load-balancing technique and, as each realization was only analyzed on one processor, the communication required between processors was minimal. (Moreover, the finite element program used a direct solver and a Tresca failure criterion.) For this investigation, both the Monte Carlo simulation and finite element program have been parallelised. For the latter, this has been achieved by using an element-by-element technique: that is, only the element stiffness matrices are required, there is no assembling of a global stiffness matrix, and the equations have been solved using an iterative preconditioned conjugate gradient (PCG) solver (Smith and Griffiths, 2004). The efficiency of this solver has been facilitated by the use of a Von Mises failure criterion with a suitable return algorithm.

8.3.3

Results

Figure 8.6 shows the strength reduction factor (SRF) as a function of the maximum mesh settlement, for a homogeneous slope (with cu = 40 kPa and γ =20 kN/m3 ) and for 0.0 < D < 3.0 m. The SRF is the factor by which the material property, cu , is scaled down. In the method adopted, the SRF is equivalent to the factor by which gravity is increased, and the SRF that causes failure of the slope is equal to the factor of safety. This is shown in Equation 8.4, where scaling up γ by a given factor is equivalent to scaling down cu by the same factor. The

290


SRF 0

0.5

1

1.5

2

2.5

3

0

Maximum Settlement (m)

-0.01

-0.02

-0.03

-0.04

-0.05

D = 0.0m D = 0.5m D = 1.0m D = 1.5m D = 2.0m D = 2.5m D = 3m

-0.06

Figure 8.6: Influence of foundation layer on strength reduction factor versus maximum settlement for a 100m long homogeneous slope. curves indicate that 500 equilibrium iterations have been enough to capture the factor of safety of each slope (corresponding to the strength reduction factor at failure). The factors of safety from Figure 8.6 have been re-plotted in Figure 8.7 and compare favourably with both Taylor’s (1937) solution and equivalent 2D plane strain computations. As expected, the 2D results lie slightly below Taylor’s solution due to the use of an internal Von Mises failure criterion, whereas the 3D results are stronger than those for 2D due to the different end-boundary conditions, but the differences are small and the overall agreement is still very good. The factors of safety for Taylor’s solution have been obtained using Equation 8.6 and values of NS ranging from 0.165 to 0.176 (corresponding to the range of D considered). Figure 8.8 shows the results for further homogeneous analyses for D = 0.0 m and D = 3.0 m, for slope lengths in the range, H < L < 20H. These show that, for L/H > 8, the 3D solution for F is similar to the plane strain solution. For L/H < 8, the 3D boundary conditions lead to an increase in F while for L/H ≈ 2, they impose a failure surface that is approximately spherical. Also

8.3. ANALYSIS

291

2.6 3D

2.55

2D Taylor's (1937) Solution

2.5 2.45

F

2.4 2.35 2.3 2.25 2.2 2.15 0

0.5

1

1.5

2

2.5

3

Foundation Depth, D(m)

Figure 8.7: Influence of foundation layer on factor of safety for a 100m long homogeneous slope. shown in this figure are results for D = 0.0 m using a mesh in which the far boundary is located only 5 m from the slope crest (as used in Hicks and Spencer (2010)). The results are comparable to the larger mesh for the same value of D, and support the use of the smaller mesh in later Monte Carlo simulations (for D = 0.0 m) to reduce CPU run-times. Hence the analyses of the D = 0.0 m case within this thesis uses this smaller mesh. Figure 8.9 re-plots the same results as a function of Taylor’s factor of safety for a 2D slope (FT ), indicating an increase in F of about 20% for a spherical surface. Figure 8.8 shows (by comparing 2D and 3D solutions) that the end-boundary conditions, whether free or fixed in the z-direction, have little influence on the results. In this study, the consequence of failure has been quantified in terms of the volumes of material associated with potential slides. Hence, this requires an automated procedure for computing the failure volume for each realisation. Figure 8.10 shows how this may be done for a 2D analysis (in this case, for D = 0.0 m). Firstly, contours of shear strain invariant are computed: this is illustrated for a homogeneous slope in Figure 8.10(a), in which the warmer contours represent larger strains. The shear strain invariant is plotted due to it conveniently

292


3.6 0m (Boundary = 5m) 0m (Boundary = 10m) 0m (plane strain solution) 3m (Boundary = 10m) 3m (plane strain solution)

3.4

3.2

F

3 2.8 2.6 2.4 2.2 2

0

5

10

15

20

L/H

Figure 8.8: Influence of slope length on computed factor of safety for a homogeneous slope. combining all the strain components into a single value, thereby providing a clear representation of the failure mechanism. Figure 8.10(a) illustrates the influence of the boundary conditions at the base of the clay layer; that is, failure occurs along the bottom of the clay layer due to the presence of the stronger underlying material preventing a deeper slide. Next, the critical failure surface is computed by using a ridge-finding technique: this involves selecting an imaginary point in space well above the toe of the slope, and then finding the location at which the shear strain invariant is highest along straight lines radiating out from the point (Figure 8.10(b)). The volume of the slide is then easily computed as that area above the critical surface. Figure 8.11 shows typical results from a 2D Monte Carlo simulation for a heterogeneous slope, based on 500 realisations in which the slide volume is expressed as a percentage of the mesh area (Figure 8.11(a)). Note the magnitude and relatively narrow distribution of slide depths (Figure 8.11(b)), due to the importance of slope height on slope stability for a depth-independent mean strength. Unfortunately, the ridge-finding method was not found to be robust enough for 3D analyses involving soil heterogeneity. Such problems often involve the

8.3. ANALYSIS

293

1.5

0m (Boundary = 5m) 0m (Boundary = 10m)

1.4

3m (Boundary = 10m) 1.3

FT

1.2

1.1

1

0.9

0.8 0

5

10

15

20

L/H

Figure 8.9: Influence of slope length on increase in F relative Taylor’s(1937) solution. initiation of multiple and complex rupture zones before the full mechanism develops, and this can make the automatic detection of the critical failure surface difficult, especially when multiple and interacting slides are present, as in Figure 8.12. Hence, the present investigation uses a simpler approach based on computed displacements in the (out-of-face) x-direction. The 3D contours plotted in Figure 8.12(a) show shear strain invariant, in which the contours are within the slope, and not on the slope’s surfaces. Figure 8.12(b) illustrates the deformation of the same slope, which exhibits multiple failures. These displacements have been exaggerated, by multiplying by a factor of 100, to better illustrate the failure mechanism. Figure 8.13 shows the percentage of total mesh volume versus the percentage of maximum x-displacement, for 2D (plane strain) and 3D homogeneous slopes, and for D = 0.0 m and D = 3.0 m. That is, for each analysis the maximum nodal x-displacement in the mesh (δmax ) is recorded, and then, for a given percentage of that maximum displacement (∆), the volume of the mesh experiencing that displacement or higher is computed. This has been simply approximated as the accumulated volume of all elements with an average nodal x-displacement greater

294


(a) contours of shear strain invariant

(b) critical failure surface (black) and threshold displacement contour (red)

Figure 8.10: 2D failure mechanism for a homogeneous slope with D =0.0m.

8.3. ANALYSIS

295

35 30

Frequency

25 20 15 10 5 0

slide volume (%)

(a) slide volume

300

250

Frequency

200

150

100

50

0

slide depth (m)

(b) slide depth

Figure 8.11: D =0.0m.

Performance distributions for a 2D heterogeneous slope with

296



(b) deformed mesh

Figure 8.12: Visualisation of multiple failures in a 3D heterogeneous slope with D =0.0m.

8.3. ANALYSIS

297

than the threshold value. (Note that the small ripples in the solution are due to the approximate way in which the volume has been computed.) Also shown (as horizontal lines) in Figure 8.13, are the percentage volumes obtained by using the above ridge-finding technique to compute the critical failure surface for a 2D slice through the slope. That is, for both values of D, two percentage volumes have been computed: the first is based on the shear strain invariant contours produced in the plane strain analysis (as in Figure 8.10); whereas the second is based on the shear strain invariant contours for a single-element slice taken through the 3D mesh at L/2. The predicted percentage slide volumes for the two 3D slopes are 62.4% for D = 0.0 m and 52.7% for D = 3.0 m, which compare favourably to the respective 2D solutions of 61.9% and 51.6%. This indicates that the 3D mesh does deform in a plane strain manner over its central region. Hence, Figure 8.13 can be used to estimate the percentage of the maximum displacement that may be used as a threshold to approximate the slide volume: that is, by taking it to be that value which gives the same slide volume as obtained using ridge-finding for a 2D slice taken through the slope at L/2. For the plane strain analyses, ∆ = 22.8% for D = 0.0 m and 26.4% for D = 3.0 m. Figures 8.10 and 8.14 show that the x-displacement contours corresponding to these displacements are close to the critical failure surfaces obtained using the more rigorous method. As in Figure 8.10(a), Figure 8.14(a) shows how the failure mechanism is impeded by the firm stratum at the base of foundation, as modelled by the fixed boundary conditions. For the 3D analyses, ∆ = 23.4% for D = 0.0 m and 25.9% for D = 3.0 m. These threshold displacements have been used to estimate the slide volumes for the 100 m slopes shown in Figures 8.15 and 8.16. These show close agreement between the highest shear strain invariant contours and the approximate failure surface predicted by the threshold displacement contours. Figure 8.17 shows the influence of ξ on reliability (R) versus global factor of safety (F ) for υ = 0.2 and θv = 1.0 m, for L = 100 m and for both D = 0.0 m and D = 3.0 m. Each curve is based on 250 realisations. However, note that, after analysing the failed volumes, the Author discarded those results where the failure volume was < 1% of the total volume. This was because these failures, although computed, were so small as to not be deemed slope failures. In the main, this removed results in which the volumes were calculated to be near 0%. This meant

298


100

0m (2D - plane strain solution) 0m (3D (central section) solution) 80

2D plane strain solution - % area failure

% Volume

3D central section - % area failure 60

40

20

Δ = 22.8 %

0 0

Δ = 23.4 %

20

40

60

80

100

% of Max x Displacement, Δ

(a) 0m foundation layer (D = 0m)

100

3m (2D - plane strain solution) 3m (3D (central section) solution) 80

2D plane strain solution - % area failure

% Volume

3D central section - % area failure 60

40

20

Δ = 25.9 %

0 0

20

Δ = 26.4 %

40

60

80

100

% of Max x Displacement, Δ

(b) 3m foundation layer (D = 3 m)

Figure 8.13: Relationship between percentage volume and percentage of maximum node displacement for failure of a homogeneous slope.

8.3. ANALYSIS

299


(b) critical failure surface (black) and threshold displacement contour (red)

Figure 8.14: 2D failure mechanism for a homogeneous slope with D =3.0m.

300


(a) contours of highest shear strain invariant

(b) slide geometry based on displacement

Figure 8.15: 3D failure mechanism for a homogeneous slope for D =0.0m.

8.3. ANALYSIS

301

(c) shear strain invariant contours at L/2

(d) slide geometry at L/2 (black - critical failure surface, red - threshold displacement contour)

Figure 8.15: cont....3D failure mechanism for a homogeneous slope for D =0.0m.

302


(a) contours of highest shear strain invariant

(b) slide geometry based on displacement

Figure 8.16: 3D failure mechanism for a homogeneous slope for D =3.0m.

8.3. ANALYSIS

303

(c) shear strain invariant contours at L/2

(d) slide geometry at L/2 (black - critical failure surface, red - threshold displacement contour)

Figure 8.16: cont....3D failure mechanism for a homogeneous slope for D =3.0m.

304


that the number of realisations used in each analysis was generally less than 250. Hicks and Spencer (2010) conducted similar analyses for υ = 0.3 and θv = 1.0 m using a Tresca failure criterion, and identified three failure modes which depended on the magnitude of θh relative to slope geometry: Mode 1: for θh < H there is little opportunity for failure to develop through semi-continuous weaker zones. Hence, failure goes through weak and strong zones alike, there is considerable averaging of property values over the failure surface, and the slope fails along its entire length. This case is analogous to a conventional 2D analysis based on µ. Mode 2: for H < θh < L/2 there is a tendency for failure to propagate through semi-continuous weaker zones, leading to discrete 3D failures and a decrease in reliability as the slope length increases. Hicks and Spencer (2010) showed how probabilistic theory could be used to predict the reliability of longer slopes based on the detailed 3D stochastic analysis of shorter slopes. Mode 3: for θh > L/2 the failure mechanism reverts to along the slope length. Although it is similar in appearance to Mode 1, it is a fundamentally different mechanism. In this case, failure propagates along weaker layers and there is a wide range of possible solutions that depend on the locations of these layers. The solution for this mode is analogous to a 2D stochastic analysis. Figure 8.17 suggests similar findings to Hicks and Spencer (2010). Note the rapid increase in reliability for Mode 1 (ξ < 5) as the curve passes through F ≈ 1.0. For Mode 2 (5 < ξ < 50), there is a decrease in reliability (for a given F ) as ξ increases, with the weakest result being for θh ≈ L/2 (i.e. for ξ ≈ 50). For Mode 3 (ξ > 50), the solution tends towards the plane strain solution as ξ → ∞. The value of θh /L at which the predominant failure mode changes from 2 to 3 is rather subjective, but θh /L ≈ 1/2 seems reasonable, as for larger values it becomes difficult for complete mechanisms to form without some interaction with the mesh ends. Figures 8.18 to 8.25 show a more detailed evaluation of the results. They show the individual curves from Figures 8.17(a) and (b) and, for each curve, all computed slide volumes (as a percentage of the total mesh volume). The significance of individual slides can be assessed by comparing with a volume of

8.3. ANALYSIS

305

1

Reliability, R

0.8

0.6

ξ=1 ξ=2 ξ=6 ξ = 12 ξ = 24 ξ = 48 ξ = 100 ξ = 1000

0.4

0.2

0 0.8

0.9

1

1.1

F

1.2

1.3

1.4

1.5

(a) 0 m foundation layer (D = 0 m)

1

Reliability, R

0.8

ξ=1

0.6

ξ=2 ξ=6 ξ = 12

0.4

ξ = 24 ξ = 48 ξ = 100 0.2

ξ = 1000

0 0.8

0.9

1

1.1

F

1.2

1.3

1.4

1.5

(b) 3 m foundation layer (D = 3 m)

Figure 8.17: Influence ξ on reliability versus global factor of safety for a 3D slope with 0 m and 3 m foundation layer.

306


62.4% and 52.7%, for D = 0.0m and 3.0m respectivily, which is the volume of the slide for a homogeneous slope (which gives a Mode 1 failure). In each plot, the slide volumes are denoted by points relative to the values of F at which slope failure occurred. As can be seen, for each value of ξ there is a wide range of possible volumes, although there is also a clear trend as ξ increases, and this is reinforced by Figure 8.27 which shows the influence of ξ on mean slide volume. Figure 8.18 shows the distributions of reliability and slide volumes for θh = θv = H/5. Both distributions indicate the dominance of Mode 1 failure. That is, R increases rapidly from 0 − 100% at F ≈ 1.0, and most slide volumes are relatively large (Vf ≈ 20 − 50%) indicating failure along most of the slope length. As ξ increases Mode 2 becomes dominant, as indicated by a reduction in mean slide volume and the possibility of localised slides (for example, as in Figure 8.20 for θh = 6θv = 1.2H). Figure 8.26 shows typical mechanisms for ξ = 6, which include small and large discrete failures, multiple discrete failures and interacting failures. The computed slide volumes are seen to be consistent with the out-of-face displacement contours. Figures 8.18 to 8.22 also show that, as ξ increases, there is an increasing tendency for smaller slide volumes at larger factors of safety. This suggests that, for some problems, it may not be necessary to design to very small probabilities of failure, due to the volumes of material associated with potential slides then being inconsequential. This is also indicated by the linear volume trend lines plotted on the graphs. Note that, as ξ increase beyound L/4, the gradient of the trend reduces and tends to 0 as ξ → ∞. Figure 8.27 shows that the mean slide volume reaches a minimum when θh ≈ L/4, which also corresponds to when the reliability curve approaches its weakest position (Figure 8.17). At first this seems counter-intuitive, since discrete failures become larger with increasing θh . However, the larger volumes recorded for lower values of θh are the integrated volume arising from multiple failures, whereas, for Mode 2 failures at larger values of θh , there is a decreased likelihood of multiple failures due to constraints imposed by the slope length. Figures 8.22 to 8.25, and 8.27, show that, as θh increases beyond L/4, the mean slide volume increases and the tendency for smaller slide volumes at higher F reduces. There is also a gradual transition from Mode 2 to Mode 3 failures as the solution moves towards plane strain at higher values of ξ. Figure 8.17 shows the solution has converged to plane stain by the time ξ = 1000, whereas Figure 8.25 shows that most slide volumes are relatively large (20-50%). Note

8.3. ANALYSIS

307

60

1

50

0.8

30 0.4

Failure Volume %

Reliability, R

40 0.6

20

0.2

Reliability

10

Failure Volume Failure Volume Trend

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


60

1

50 0.8

30 0.4 20

Reliability

0.2

10


0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


Figure 8.18: Influence ξ on R versus F for a 3D slope (ξ = 1).

Failure Volume %

Reliability, R

40 0.6

308


60

1

50

0.8

30 0.4

Failure Volume %

Reliability, R

40 0.6

20

0.2

Reliability

10


0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


60

1

50 0.8

30 0.4 20

0.2

Reliability

10


0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F



Failure Volume %

Reliability, R

40 0.6

8.3. ANALYSIS

309

60

1

50

0.8

30 0.4

Failure Volume %

Reliability, R

40 0.6

20

Reliability

0.2

10


0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


60

1

50 0.8

30 0.4 20

0.2

Reliability

10


0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F



Failure Volume %

Reliability, R

40 0.6

310


60

1

Reliability Failure Volume 50

Failure Volume Trend

0.8

30 0.4

Failure Volume %

Reliability, R

40 0.6

20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


60

1 Reliability Failure Volume

50


0.8

30 0.4 20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F



Failure Volume %

Reliability, R

40 0.6

8.3. ANALYSIS

311

60

1 Reliability

Failure Volume 50


0.8

30 0.4

Failure Volume %

Reliability, R

40 0.6

20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


60



50

0.8

30 0.4 20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F



Failure Volume %

Reliability, R

40 0.6

312


60

1

Reliability Failure Volume 50


0.8

30 0.4

Failure Volume %

Reliability, R

40 0.6

20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


60


50


0.8

Reliability, R

30 0.4 20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F



Failure Volume %

40 0.6

8.3. ANALYSIS

313

60

1 Reliability

50

Failure Volume

0.8


30 0.4

Failure Volume %

Reliability, R

40 0.6

20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


60


50


0.8

0.6 30 0.4 20

0.2

10

0

0 0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F



Failure Volume %

Reliability, R

40

314


60

1

50 0.8

0.6 30

0.4

Failure Volume %

Reliability, R

40

20 Reliability 0.2

10

Failure Volume Failure Volume Trend 0

0

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F


70

1 Reliability

Failure Volume

60


0.8

0.6

40

30

0.4

Failure Volume %

Reliability, R

50

20 0.2 10

0

0

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

F



8.3. ANALYSIS

315

(a) Vf = 6.60%

(b) Vf = 7.92%

(c) Vf = 10.52%

(d) Vf = 13.95%

(e) Vf = 18.66%

(f) Vf = 27.86%

(g) Vf = 23.25%

(h) Vf = 27.8%

(i) Vf = 33.71%

Figure 8.26: Example Mode 2 failure mechanisms for ξ = 6 (Vf = failure volume).

316


35

Mean slide volume %

30

25

20

D = 0m

15

D = 3m

10 1

10

100

1000

Degree of anisotropy, ξ

Figure 8.27: Influence of ξ on mean slide volume for a 3D slope.

8.3. ANALYSIS

317

that, although the spread of solutions is still quite large, the slope is found to be at, or near to, failure along the entire length in most instances, even if the slope failure initiates over a smaller area (as indicated by those realisations in which the computed slide volume is lower.) Figures 8.17 to 8.27 reinforce the findings of Hicks and Spencer (2010), with the computation of slide volumes providing further insight into the three failure Modes: Mode 1: for very low values of θh (relative to H), the range of slide volumes is relatively narrow, whereas the mean slide volume is relatively large and approaches the homogeneous solution based on the mean undrained shear strength. As θh gets larger, there is a decrease in “uniformity” of shear strength along the slope length. This leads to an increased possibility of less extensive failures, and therefore to a greater range of potential slide volumes and a decrease in the mean slide volume. Mode 2: the range of slide volumes becomes large as θh becomes larger than H, due to the formation of discrete and multiple failures. However, there is a tendency for slide volumes to decrease at larger values of F , suggesting that low probabilities of failure may sometimes be associated with inconsequential slides. The mean slide volume reaches a minimum at θh ≈ L/4. Mode 3: the range of slide volumes is relatively wide, reflecting the wider range of solutions, and the mean slide volume is also large, indicating failure along most of the slope length.

318

8.4


Computational Performance

As in Chapter 6, the performance of the implementation was measured for time, memory consumption, speedup and efficiency, and to generally measure trends in the processor scalability. The Author tested the performance of the codes by varying the sizes of the domain for homogeneous conditions. Realisations for heterogeneous conditions would differ from one realisation to the next; therefore, by conducting the analysis for homogeneous conditions, every realisation was identical and should execute in a similar execution time, with the same memory consumption, for each set of parameters. The random field was still implemented and the algorithm executed for each realisation, but a homogeneous field was produced by setting the standard deviation equal to zero. Figure 8.28 shows the various slope profiles that were used in the analysis. These profiles were used in combination with the slope lengths 32m, 64m, and 96m, whilst maintaining cells sizes at 1 m × 1 m × 0.5 m. The slope analysis was conducted for all permutations of length and profile and the performance measured in the same ways as conducted in Chapter 6, repeating each test on 1, 2, 4, 8, 16 and 32 processors for 25 realisations. The trend lines used to analyse the data are of the type defined by Equation 6.10. The results of this analysis are present in Appendix B (Figures B.1 to B.10). Figure 8.29 is a typical example of the results obtained. The results show that, in general, both the time and memory consumption follow the trends expected. With respect to the execution time, the trend does not appear to converge to any noticeable lower limit. However the memory consumption converges to a lower limit of ≈ 28000kB. This does not seem to depend upon the profile of the domain; however the profiles considered may have been sufficiently small so as not to show any significant trend. Both the speedup and efficiency analyses show excellent scalability over the range of processors. The speedup, in general, initially follows closely the line of linear speedup, before increasing into the region of superlinear speedup. This is likely to be due to the increased amounts of processor cache available, as theorised in Chapter 6, making the data retrival faster and more readily available. Normally it would be expected that this speedup would begin to tail off and fall as the increase in communication, due to increasing number of processors, slows the execution. However in this case the communication per processor remains constant,


5m

5m

10m

(a) Profile 1 5m

5m

5m

3m

15m

(b) Profile 2 10m

5m

15m

(c) Profile 3

Figure 8.28: Illustration of tested profiles in performance analysis.

319

320


×103 80 + 70


+

Time (s)

60 50

+

a = 76.239×103 b = 0.61201×103

40 30 20

+ +

10 0

+ 0

5

10

+


25

30

35


(a) Time

×103 300 250


+

+

200 150

a = 231.07×103 b = 25.696×103

+

100 + +

50 0

0

5

+ 10


+ 25

30

35

(b) Memory

Figure 8.29: Slope Stability Analysis Performance : Profile 1 : Length = 32m.


321

60 50

Speedup

40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

Efficiency

(c) speedup

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8

0

5

10


Figure 8.29: cont....Slope Stability Analysis Performance : Profile 1 : Length = 32m.

322


neglecting the communication within the random field generation. Therefore it is difficult to predict whether this trend will continue, or whether there will be a fall in the speedup. More testing with greater resources than are available to the Author would be required to investigate this further. This speedup analysis is reflected in the general efficiency of the executions, in which an initial small fall, below 1, is observed, before a significant increase above. The figures also show that, with respect to time and memory, using more than 8 - 10 processors yields little with respect to these resources, in real terms, although the efficiency measures provide an argument to the contrary. Increasing the processor count provided little real reduction in time but relatively the increase was more efficient. Therefore, it can be argued that the use of more processors will be cost efficient. The Author is hesitant to generalize this, as the resources available were limited and, as previously stated, it is unclear whether this efficiency trend would continue with more processors or slopes of different dimensions. The performance was also analysed using the hybrid method discussed in Chapter 2. Due to time and resource limitations only a single performance analysis could be carried out of this strategy. A performance analysis of the model using slope profile 2 with a length of 96m was performed. The code was executed on 32 processors, varying the number of processors over which a single realisation was executed, (see Section 2.10). The results are shown in Figure 8.30. The results are inconclusive with respect to time, with the efficiency and speedup generally remaining level, despite an initial increase in time when the domain, of each realisation, is decomposed over 2 processors. The results with respect to memory, shown in Figure 8.30(b), shows that as the domain of each realisation is decomposed over more processors, the memory per processor is reduced as expected. In the Author’s opinion the results were inconclusive due to the size of the slope analysed and the limited resources available. With increased numbers of processors, and with larger and more varied domains, a more detailed performance analysis could be undertaken and the hypothesis posed in Chapter 2 could be proved; that is, the domain should be decomposed over the minimum number of processors required to accommodate it with respect to memory. However these results show that the aim of providing a viable framework in which the stochastic analysis of a large scale FE domain, using the RFEM,


323

x 10³ 36

34 32

Time (s)

30 28 26 24 22 20 0

5

10

15

20

25

30

25

30


(a) Time

x 10³


1600 1400 1200 1000

800 600

400 200 0 0

5

10

15

20


(b) Memory

Figure 8.30: Slope Stability Analysis Performance (Hybrid Method).

324


35 30

Speedup

25 20 15 10 5

0 0

5

10

15

20

25

30

25

30


(c) Speedup

1 0.9 0.8

Efficiency

0.7 0.6 0.5 0.4 0.3

0.2 0.1 0 0

5

10

15

20


(d) Efficiency

Figure 8.30: cont....Slope Stability Analysis Performance (Hybrid Method).

8.5. CONCLUSIONS

325

has been achieved; reducing execution times and increasing domain sizes and refinement.

8.5

Conclusions

Automatic computation of slide volumes for heterogeneous slopes in 3D is difficult, due to the complexity of the underlying failure mechanism. Instead, a simple but effective way of estimating the slide volume based on computed displacements has been presented. This is based on defining a threshold displacement, above which the soil volume is deemed to have slipped. The value for this displacement was initially estimated by calibrating against a plane strain homogeneous analysis, in which the slide volume may be accurately determined from shear strain invariant contours using a ridge finding technique. Indications were that, when this displacement was used in the 3D analysis of a 100 m long slope, the slide volume was over-estimated by around 20%, due to end-boundary effects. An improved estimate of the threshold displacement for 3D analysis was obtained by calibrating the displacements against the slide volume per metre run at L/2 for a homogeneous 3D slope. This gave slide volume estimates that were accurate enough to enable a detailed evaluation of the role of the scale of fluctuation on the performance of heterogeneous slopes. Previously, Hicks and Spencer (2010) analysed a similar slope, but with no foundation layer and with no detailed evaluation of failure volumes. This investigation has reinforced the previous finding that there are 3 distinct failure modes and that these are a function of the scale of fluctuation relative to slope geometry. However, it has also demonstrated the influence of a foundation layer on the results. For D = 0.0 m the trends are easier to distinguish, possibly due to the constraining influence of the mesh base promoting slope length failures for small and large scales of fluctuation (Modes 1 and 3). For D > 0.0 m the influence of the lower boundary on the failure mechanism is reduced and, for larger coefficients of variation, discrete (spherical and cylindrical) Mode 2 failures may occur even at relatively low correlation distances. The computational performance of the code has been analysed showing that the new implementation has made the RFEM analysis of large-scale models viable, decomposing the workload and memory requirements over multiple processors. The new implementation scales efficiently, reducing the execution times

326


when executed on increasing numbers of processors. Therefore the aim of implementing a viable framework for the RFEM has been achieved.

Chapter 9 Conclusions 9.1

Aims

The broad aim of this research was to advance the performance of stochastic analyses using the RFEM, through the use of parallel computations, improved algorithms and coding. The implementation of a parallel RFEM framework as discussed in Chapter 2, incorporating the parallelised random field generator implemented in Chapter 6, has led to increased computational performance, illustrated by the large slope stability analyses carried out in Chapter 8. These large analyses were performed for the first time within the Author’s research group; the implemented codes having removed the barrier preventing their viable execution, both with respect to time and memory constraints.

9.2

Objectives

To fulfil the aims of this Thesis, the Author defined a set of envisaged objectives to be met. Here these objectives are revisited with concluding comments from the Author. • Stochastic parallelisation strategy: Chapter 2 discussed in detail the computational background to parallelisation and the strategies associated with implementing a parallel RFEM. It concluded that a hybrid approach was the best means to optimize the resources available; a combination of realisation and solver parallelisations. It was also recommended to decompose 327

328

CHAPTER 9. CONCLUSIONS the domain over the minimum number of processors necessary to overcome memory constraints, in order to limit communication and maximize efficiency.

• Implementation of a Parallel Stochastic Framework: This was the main objective of the research; a parallel implementation of the RFEM method. This thesis has taken the reader through the basics of parallel computation and strategies for parallelisation of the RFEM (Chapter 2), the theory of LAS random field generation (Chapter 3) and its subsequent parallel implementation (Chapter 6) and finally validated the framework and its various components for typical analyses (Chapters 7 and 8). A framework is now in place, that can be used with any suitable element by element FE code, which reduces memory constraints, while accelerating execution times. • Parallelisation of random field generation: The random field generation within the original serial RFEM and early parallel implementations were a significant strain on computational memory resources. It was an objective of this thesis to parallelise this component and reduce the memory constraints of the implementation by decomposing the generation and storage of the random field across multiple processors. Chapters 3 to 6 discussed this parallelisation, taking the reader from theory to implementation. Chapter 6 also highlighted many inefficiencies within the original code and implemented solutions to them, providing improved performance above that of solely parallelism. • Testing of random field generation: The LAS random field generation was tested for both performance and accuracy. Chapter 6 included the validation of the random field generator, showing that it was as accurate as previous implementations and produced fields close to the input statistics required. The performance of the code was analysed and the new implementation was not only efficient and quicker in parallel, scaling well, but also the improvements to the underlying generation, i.e. boundary cell generation, domain reduction and anti-correlation technique, have led to improved performance in terms of both memory and time, even when run serially on a single processor.

9.3. MAIN CONCLUSIONS

329

Further to this testing the generator was further validated, when used within the new parallel RFEM framework, performing well both computationally and analytically in the groundwater flow and slope stability applications tested in Chapters 7 and 8 respectively. • Validation of parallel RFEM framework: As with the random field generator, the framework itself required validating. The modelling applications of groundwater flow and slope stability presented in Chapters 7 and 8 respectively, provided this validation. In both cases, the results were consistent with findings from previous studies. • Performance analysis: The new framework was also analysed for performance. Analysing with respect to both memory and time. • Advanced application: A large-scale slope stability analyses has been undertaken that previously could not be undertaken. This advanced application showed the performance capabilities of the new parallel framework and some of the benefits from using it. All the aims and objectives of this thesis were completed and a parallel RFEM framework has been designed, developed and implemented for use in geoengineering research. This was the main goal of the thesis and it has been achieved successfully.

9.3

Main Conclusions

During this research the Author was able to conclude many points with regards to the development, implementation and use of a parallel RFEM framework. The Author presents a short, but by no means exhaustive, list of these points and a brief summary of each. • The parallelisation of the framework and its components has led in all cases to superior performance, both with respect to time and memory. • The optimization of serial codes can reap significant performance improvements. By reducing the redundancy, both with respect to time and memory within codes, the need for parallelisation can be reduced.

330

CHAPTER 9. CONCLUSIONS

• The use of distributed computing has facilitated the stochastic analysis of larger FE models using the RFEM, making large scale RFEM analysis more viable. The Author expects similar performance when used with other element-by-element FE models. • Throughout the research the Author had limited access to computational resources, especially high performance computation. Although computers are everyday items and remain relatively idle, processing power remains an expensive and scarce commodity. Therefore its usage must be efficient and the Author hopes that the work presented goes some way to using this computational resource efficiently and cost effectively. During the period of the Author’s research the computational landscape, particularly in desktop computing, has advanced significantly. At the start of the research the desktop systems comprised 32 bit single core processors with 1Gb of RAM; today a typical system has a 64 bit quad core processor with 4Gb of RAM. The advances in multiple core technology continue, with Intel’s latest i9 processor boasting 6 cores capable of handling 2 threads per core, therefore equivalent to a 12 core processor. The 64 bit architecture also allows for superior memory capacity relative to its 32 bit predecessor, with desktop computers capable of handling 24GB of RAM being common and specialized work stations using 128GB of RAM with a single processor. The 64 bit architecture is capable of handling a far larger memory, with most modern processors artificially limited to 256 TB due to the limitations of a 48-bit address space. This added memory capability has led to a reduced need for distributed computation, with larger domains easily accommodated by the memory of a system. This means that shared memory systems, particularly desktop machines, are a more viable option when considering large scale FE analyses. However, these machines, although easier to implement and use, are still based upon the parallel principles presented in this thesis and are as useful on these smaller shared memory machines as on the large scale distributed systems. The need for distributed computing may have somewhat diminished, but inevitably, as the numerical models become more realistic and complex, and as the models become larger, there will be a greater need once more for distributed computing to reduce run times.

9.4. FUTURE RESEARCH

9.4

331

Future Research

During the research the Author has found many areas that require further attention. These areas and others that the Author feels are relevant are listed below. • The new generation of 64 bit multi-core shared memory machines, including desktops have opened the possibilities of parallel computation to the desktop user. In the Author’s opinion, further work is required on the framework to produce a hybrid method, applicable to both shared and distributed memory computers; using for instance OpenMP and MPI libraries. A hybrid technique would provide an optimized code for both architectures, while allowing these new techniques to be implemented on desktop machines for the first time. • The Author also recommends that the parallel RFEM framework be used together with more realistic and large models. These models should test the limits of the code and resources, exhausting computational power. Equally more realistic soil models, including constitutive models incorporating unsaturated behaviour should be included, as these complex models would benefit from parallelism. • The accuracy of the generated random fields can also be improved by incorporating actual measured data into the field, e.g. CPT data may be used to condition random fields, to reduce the uncertainty associated with in situ heterogeneity. A method known as kriging has already been used in 2D to incorporate CPT data ( van den Eijnden, 2010; van den Eijnden and Hicks, 2011). Further development could see a 3D implementation, utilising the parallel random field generator developed in this thesis. • Advances in computer technology have not been limited to the computer processing units and memory, but have extended to other components, most notably the graphics cards. The need for more complicated and realistic graphics, driven by the video gaming industry has led to rapid developments in the Graphical Processing Units (GPU) on-board graphic cards. These advances, considered superior to those in standard processing units, have led to GPUs containing hundreds of processing cores, capable of handling simple operations on large datasets; a function that could be easily

332

CHAPTER 9. CONCLUSIONS utilized within the applications discussed in this thesis. Therefore, the Author envisages that work in this area will provide significant performance improvements, from the standard desktop upwards to high performance systems.

• It is the Author’s opinion that more analysis should be performed comparing the RFEM with other stochastic techniques; for example, the stochastic finite element method, the First Order Reliability Method (FORM) and the Second Order Reliability Method (SORM), to identify for which applications and conditions each method is applicable. • The applications within this thesis have been compared with analyses conducted using serial codes; and the results have been found to compare favourably and no significant errors have been observed. However, it was noted that, for the iterative element by element solver, iteration counts for the same model differed when executed on different numbers of processors. In general as the number of processors increased, so did the iteration count. In the case of the analyses conducted in this thesis, no significant errors in results were observed; however, the Author is aware that the exit criteria for the iterative method is count-dependent and this phenomenon could cause errors within the results. Smith and Margetts (2006) highlighted this problem and discuss possible causes and solutions. Further research into this area would provide users with confidence in the models solved using these techniques.

References Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS Spring joint computer conference, volume 30, pages 483–485. Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., and McDonald, J. (2001). Parallel Programming in OpenMP. Morgan Kaufmann. Coddington, P. (1997). Random number generators for parallel computers. The NHSE Review, 2. Craig, R. (1997). Soil Mechanics. Spon Press, sixth edition. Dagum, L. and Menon, R. (1998). Openmp: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46–55. Dongarra, J. J., Meuer, H. W., and Strohmaier, E. (2010). Top500 supercomputer sites. internet. http://www.top500.org/ (updated every 6 months). van den Eijnden, A. P. (2010). Condition simulation for characterising the spatial varaibility of sand state. MSc thesis, Delft University of Technology, The Netherlands. van den Eijnden, A. P. and Hicks, M. A. (2011). Conditional simulation for characterising the spatial variability of sand state, (under review). In Proc. 2nd Int. Symp. Computational Geomechanics, Dubrovnik, Croatia. Fenton, G. (1990). Simulation and analysis of random fields. PhD thesis, Princeton University. Fenton, G. (1994). Error evaluation of three random field generators. ASCE, Journal of Engineering Mechanics, 12(120):2478–2497. 333

334

REFERENCES

Fenton, G. and Vanmarcke, E. (1990). Simulation of random fields via local average subdivision. Journal of engineering mechanics, 116(8):1733–1749. Flynn, M. (1972). Some computer organizations and their effectiveness. IEEE Transactions on Computers, C-21(9):948–960. Forum, M. P. I. (1994). MPI: A message-passing interface standard. The International Journal of Supercomputer Applications and High Performance Computing, 8(3/4):159–416. Forum, M. P. I. (1997). MPI-2: Extensions to the message-passing interface. Technical report. Foster, I. (1994). Designing and building paralllel programs : concepts and toolsfor parallel software engineering. Addison and Wesley. Griffiths, D. and Fenton, G. (1997). Three-dimensional seepage through spatially random soil. Journal of Geotechnical and Geoenvironmental Engineering, 123(2):153 – 160. Griffiths, D. and Fenton, G. (2000). Influence of soil strength spatial variability on the stability of an undrained clay slope by finite elements. In Slope Stability 2000, Proceedings of Sessions of Geo-Denver 2000, pages 184–193. ASCE. Griffiths, D. and Fenton, G. (2004). Probabilistic slope stability analysis by finite elements. Journal of Geotechnical and Geoenvironmental Engineering, 130(5):507–518. Gustafson, J. L. (1988). Reevaluating Amdahl’s law. Communications of the ACM, 31(5):532–533. Haahr, M. (2010). Random.org: True random number service. Web resource, available at http://www.random.org. Accessed: November 2010. Hellekalek, P. (1998). Don’t trust parallel Monte Carlo! In PADS ’98: Proceedings of the twelfth workshop on Parallel and distributed simulation, pages 82–89, Washington, DC, USA. IEEE Computer Society. Hicks, M. A. and Samy, K. (2002a). Influence of anisotropic spatial variability on slope reliability. In Proceedings of 8th Int Symp Num Models Geomech, pages 535–539.

REFERENCES

335

Hicks, M. A. and Samy, K. (2002b). Influence of heterogeneity on undrained clay slope stability. The Quarterly Journal of Engineering Geology and Hydrogeology, 35(1):41–49. Hicks, M. A. and Samy, K. (2002c). Reliability-based characteristic values: a stochastic approach to Eurocode 7. Ground Engineering, 35:30–34. Hicks, M. A. and Samy, K. (2004). Stochastic evaluation of heterogeneous slope stability. Italian Geotechnical Journal, (38):54–66. Hicks, M. A. and Spencer, W. A. (2010). Influence of heterogeneity on the reliability and failure of a long 3D slope. Computers and Geotechnics, 37(7-8):948–955. James, F. (1990). A review of pseudorandom number generators. Computer Physics Communications, 60(3):329–344. Jennings, A. and McKeown, J. (1992). Matrix Computation for Engineers and Scientists. John Wiley & Sons, Ltd. Knuth, D. E. (1997). The Art of Computer Programming : Seminumerical Algorithms, volume 2. Addison-Wesley Professional, third edition. Koelbel, C. H., Loveman, D. B., and Schreiber, R. S. (1994). The High Performance Fortran Handbook. MIT Press. ´ L’Ecuyer, P. (1994). Uniform random number generation. Annals of Operations Research, 53:77–120. ´ L’Ecuyer, P., Cordeau, J.-F., and Simard, R. (2000). Close-point spatial tests and their application to random number generators. Operations Research, 48(2):308–317. ´ L’Ecuyer, P. and Simard, R. (2007). TestU01: a C library for empirical testing of random number generators. ACM Trans. Math. Softw., 33(4). Lehmer, D. (1949). Mathematical methods in large-scale computing units. In Proc. 2nd Sympos. on Large-Scale Digital Calculating Machinery, pages 141 – 146, Cambridge, MA. Harvard University Press. Lehmer, D. (1954). Random number generation on the brl highspeed computing machines. Math. Rev., 15(559).

336

REFERENCES

Lewis, P., Goodman, A., and Miller, J. (1969). A pseudo-random number generator for the system/360. IBM Syst. Journal, 8:136–146. Mandelbrot, B. B. and Ness, J. W. V. (1968). Fractional brownian motion, fractional noises and applications. SIAM Review, 10(4):422–437. Margetts, L. (2002). Parallel Finite Element Analysis. PhD thesis, The University of Manchester, Manchester, UK. Marsaglia, G. (1968). Random numbers fall mainly in the planes. 61(1):25–28. Marsaglia, G. (2003). Xorshift rngs. Journal of Statistical Software, 8(14):1–6. Marsaglia, G. and Tsay, L.-H. (1985). Matrices and the structure of random number sequences. Linear Algebra and its Applications, 67:147–156. Onisiphorou, C. (2000). Stochastic analysis of saturated soils using finite elements. PhD thesis, University of Manchester, Manchester, UK. Paice, G. and Griffiths, D. (1997). Reliability of an undrained clay slope formed from spatially random soil. In Proceedings of 9th International Conference on Computer Methods and Advances Geomechanics, volume 1, pages 543–548, Wuhan, China. Park, S. K. and Miller, K. W. (1988). Random number generators: good ones are hard to find. Commun. ACM, 31(10):1192–1201. Pettipher, M. and Smith, I. (1997). The development of an MPP implementation of a suite of finite element codes. In Hertzberger, B. and Sloot, P., editors, High-Performance Computing and Networking, volume 1225 of Lecture Notes in Computer Science, pages 400–409. Springer Berlin / Heidelberg. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992a). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, second edition. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992b). Numerical Recipes in Fortran: The Art of Scientific Computing. Cambridge University Press, second edition.

REFERENCES

337

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1996). Numerical Recipes in Fortran 90: The Art of Scientific Computing., volume 2. Cambridge University Press, second edition. Quinn, M. J. (2003). Parallel Programming in C with MPI and OpenMP. McGraw-Hill Professional. Samy, K. (2003). Stochastic Analysis with Finite Elements in Geotechnical Engineering. PhD thesis, The University of Manchester, Manchester, UK. Schrage, L. (1979). A more portable fortran random number generator. ACM Trans. Math. Softw., 5:132–138. Smith, I. M. and Griffiths, D. V. (1988). Programming the Finite Element Method. John Wiley & Sons, Ltd, second edition. Smith, I. M. and Griffiths, D. V. (1997). Programming the Finite Element Method. John Wiley & Sons, Ltd, third edition. Smith, I. M. and Griffiths, D. V. (2004). Programming the Finite Element Method. John Wiley & Sons, Ltd, fourth edition. Smith, I. M., Leng, J., and Margetts, L. (2005). Parallel three dimensional finite element analysis of excavation. In 13th ACME Conference, University of Sheffield. ACME. Smith, I. M. and Margetts, L. (2006). The convergence variability of parallel iterative solvers. Engineering Computations: International Journal for ComputerAided Engineering and Software, 23(2):154–165. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., and Dongarra, J. (1995). MPI: The Complete Reference. MIT Press. Spencer, W. (2007). Parallel Stochastic Analysis with Finite Elements in Geotechnical Engineering. PhD thesis, The University of Manchester, Manchester, UK, Manchester, UK. Spencer, W. and Hicks, M. (2007). A 3D finite element study of slope reliability. In Proceedings of 10th int symp num models geomech, Rhodes;, pages 539–43.

338

REFERENCES

Srinivasan, A., Ceperley, D., and Mascagni, M. (1998). Testing parallel random number generators. Tan, C. J. K. (2002). The PLFG parallel pseudo-random number generator. Future Generation Computer Systems, 18(5):693 – 698. Taylor, D. (1937). Stability of earth slopes. Journal of the Boston Society of Civil Engineers, (24):197–246. Vanmarcke, E. H. (1977). Probabilistic modelling of soil profiles. Journal of the Geotechnical Engineering Division, 103(11):1227–1246. Vanmarcke, E. H. (1983). Random Fields: Analysis and Synthesis. MIT Press, Cambridge MA.

Appendix A Random Field Generation Performance Results

339

340

APPENDIX A. RANDOM FIELD GENERATION - PERFORMANCE

×103 160 + 140


+

Time (s)

120 100 80

a = 146.42×103 b = -0.85445×103

+ 60 40

+

20 0

+ 0

5

+ 10

+


25

30

35


(a) Time

×103 5000 4500 + 4000 3500 3000 2500 2000 1500 1000 500 0


+

a = 4475.4×103 b = 45.841×103

+ + + 0

5

+ 10


+ 25

30

(b) Memory

Figure A.1: Performance : Field : 512 × 512 × 512 cells.

35

341

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.2

Efficiency

1.1 1 0.9 0.8 0.7 0.6

0

5

10


Figure A.1: cont.... Performance : Field : 512 × 512 × 512 cells.

342


×103 80 + 70


+

Time (s)

60 50 a = 73.499×103 b = -0.5842×103

40 +

30 20

+

10 0

+ 0

5

+ 10

+


25

30

35


(a) Time

×103 2500 +


2000

+

1500 a = 2241.5×103 b = 38.841×103

+ 1000 +

500

+ 0

0

5

+ 10


+ 25

30

(b) Memory


35

343

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.3 1.2

Efficiency

1.1 1 0.9 0.8 0.7 0.6

0

5

10


Figure A.2: cont....Performance : Field : 512 × 512 × 256 cells.

344


×103 60 50


+

+

Time (s)

40 30

a = 51.341×103 b = -2.7033×103

20

+

10

+

+

+

0 -10

0

5

10

+


25

30

35


(a) Time

×103 1200 +


1000

+

800 a = 1120.5×103 b = 37.821×103

+

600 400

+ 200 0

+ 0

5

+ 10


+ 25

30

(b) Memory


35

345

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.8 1.7

Efficiency

1.6 1.5 1.4 1.3 1.2 1.1 1 0.9

0

5

10



346


×103 25 + 20


+

Time (s)

15 10

a = 22.136×103 b = -0.71423×103

+

5

+

+

+

0 -5

0

5

10

+


25

30

35


(a) Time

×103 600 +


500

+

400 a = 561.74×103 b = 35.846×103

+

300 200

+ +

100 0

0

5

+ 10


+ 25

30

(b) Memory


35

347

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.5 1.4

Efficiency

1.3 1.2 1.1 1 0.9 0.8

0

5

10


Figure A.4: cont.... Performance : Field : 512 × 512 × 64 cells.

348


×103 12 +


10

+

Time (s)

8 6

a = 11.535×103 b = -0.34558×103

+

4 2

+ +

0

0

5

+ 10

+


25

30

35


(a) Time

×103 350 + 300


+

250 200

a = 278.88×103 b = 37.493×103

+ 150 +

100

+

+

50 0

0

5

10


+ 25

30

(b) Memory


35

349

40 35

Speedup

30 25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.6

Efficiency

1.5 1.4 1.3 1.2 1.1 1

0

5

10



Time (s)

350


×103 20 + 18 16 14 12 10 + 8 6 + 4 2 0 0 5


+

a = 20.059×103 b = -0.53231×103

+

+ + 10


25

30

35


(a) Time

×103 600 +


500

+

400 a = 571.25×103 b = 22.726×103

+

300 200

+ +

100 0

+ +

0

5

10


25

30

(b) Memory


35

351

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.3 1.2

Efficiency

1.1 1 0.9 0.8 0.7 0.6

0

5

10



352


×103 12


10 +

+

Time (s)

8 6 4

a = 9.9146×103 b = -0.24527×103

+ +

2

+ 0

0

5

+ 10

+


25

30

35


(a) Time

×103 350


300 +

+

250 200 a = 266.87×103 b = 38.433×103

+ 150 +

100

+

50 0

0

5

+ 10


+ 25

30

(b) Memory


35

353

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.3 1.25

Efficiency

1.2 1.15 1.1 1.05 1 0.95 0.9

0

5

10



354


×103 6 +


5

+

Time (s)

4 a = 5.9696×103 b = -0.27865×103

3 +

2 1

+ +

+

0 -1

0

5

10

+


25

30

35


(a) Time

×103 180 160 +


140

+

120 100

+

80

a = 127.34×103 b = 37.05×103

+

60 +

+

40

+

20 0

0

5

10


25

30

(b) Memory


35

355

60 50

Speedup

40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.7 1.6

Efficiency

1.5 1.4 1.3 1.2 1.1 1

0

5

10



356


×103 3.5 + 3


+

Time (s)

2.5 2 a = 3.3052×103 b = -0.17287×103

1.5 +

1

+

0.5

+

+

0 -0.5

0

5

10

+


25

30

35


(a) Time

×103 100 + 90 80 70 + 60 50 + 40 30 20 10 0 0 5


+

a = 60.384×103 b = 34.02×103 +

+

+

10


25

30

(b) Memory


35

357

70 60

Speedup

50 40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

2.2

Efficiency

2 1.8 1.6 1.4 1.2 1

0

5

10


Figure A.9: continued....Performance : Field : 256 × 256 × 32 cells.

358


×103 3 2.5


+

+

Time (s)

2 1.5 1

a = 2.603×103 b = -0.13289×103

+

0.5

+ +

+

0 -0.5

0

5

10

+


25

30

35


(a) Time

×103 100 90 + 80 70 + 60 + 50 40 30 20 10 0 0 5


+

+

10

+

a = 51.81×103 b = 39×103


+

25

30

(b) Memory


35

359

45 40 35 Speedup

30 25 20 15 10 Speedup Linear

5 0

0

5

10


25

30

35

25

30

35

(c) Speedup

1.7 1.6

Efficiency

1.5 1.4 1.3 1.2 1.1 1

0

5

10



360


×103 1.2


+

1

+

Time (s)

0.8 0.6

a = 1.0448×103 b = -0.044693×103

+

0.4 0.2

+ +

+

0 -0.2

0

5

10

+


25

30

35


(a) Time

×103 60 +


50 + +

+

+

40

+ +

a = 15.579×103 b = 39.721×103

30 20 10 0

0

5

10


25

30

(b) Memory


35

361

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.6

Efficiency

1.5 1.4 1.3 1.2 1.1 1

0

5

10



362


×103 0.7 +


0.6

+

Time (s)

0.5 0.4 a = 0.67288×103 b = -0.046225×103

0.3 0.2

+

0.1

+

+

+

0 -0.1

0

5

10

+


25

30

35


(a) Time

×103 45 40 35

+ + +

+

+ +

30

a = -0.6823×103 b = 37.224×103

25 20 15


10 5 0

0

5

10


25

30

(b) Memory


+ 35

363

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

Efficiency

(c) Speedup

2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1

0

5

10



364


Appendix B Slope Stability Analysis Performance Results

365

366

APPENDIX B. SLOPE STABILITY ANALYSIS - PERFORMANCE

×103 300 + 250


+

Time (s)

200 +

150 100

a = 271.19×103 b = 6.473×103 +

50 0

+ 0

5

+ 10

+


25

30

35


(a) Time

×103 800 + 700


+

600 500 400

a = 708.38×103 b = 27.424×103

+

300 +

200

+

100 0

0

5

+ 10


+ 25

30

35

(b) Memory

Figure B.1: Slope Stability Analysis Performance : Profile 1 : Length = 96m.

367

40 35

Speedup

30 25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

Efficiency

(c) Speedup

1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75

0

5

10


Figure B.1: cont....Slope Stability Analysis Performance : Profile 1 : Length = 96m.

368


Time (s)

×103 200 180 + 160 140 120 100 80 60 40 20 0


+

a = 181.03×103 b = 2.2935×103

+

+ + + 0

5

10

+


25

30

35


(a) Time

×103 550 500 + 450 400 350 300 250 + 200 150 + 100 50 0 0 5


+

a = 470.01×103 b = 27.292×103

+

+ 10


+ 25

30

35

(b) Memory


369

35 30

Speedup

25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

Efficiency

(c) Speedup

1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8

0

5

10



370


×103 80 + 70


+

Time (s)

60 50

+

a = 76.239×103 b = 0.61201×103

40 30 20

+ +

10 0

+ 0

5

10

+


25

30

35


(a) Time

×103 300 250


+

+

200 150

a = 231.07×103 b = 25.696×103

+

100 + +

50 0

0

5

+ 10


+ 25

30

35

(b) Memory


371

60 50

Speedup

40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

Efficiency

(c) Speedup

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8

0

5

10



372


×103 1200 + 1000


+

Time (s)

800 600

a = 1040.3×103 b = -40.732×103

400 + +

200

+ +

0 -200

0

5

10

+


25

30

35


(a) Time

×103 1600 + 1400


+

1200 1000 800

a = 1519.8×103 b = 29.828×103

+

600 +

400

+

200 0

0

5

+ 10


+ 25

30

35

(b) Memory


373

70 60

Speedup

50 40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

2.2 2

Efficiency

1.8 1.6 1.4 1.2 1 0.8

0

5

10



374


×103 800 700 +


+

Time (s)

600 500 +

400

a = 753.91×103 b = -9.423×103

300 200

+

100 0

+ 0

5

+ 10

+


25

30

35


(a) Time

×103 1200 1000


+

+

800 600

a = 1011.3×103 b = 28.352×103

+

400 + 200 0

+ 0

5

+ 10


+ 25

30

35

(b) Memory


375

60 50

Speedup

40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.7 1.6

Efficiency

1.5 1.4 1.3 1.2 1.1 1 0.9 0.8

0

5

10



376


×103 250


+

Time (s)

200 150

+

+

a = 218.94×103 b = 7.2277×103

100 + 50 0

+ + 0

5

10

+


25

30

35


(a) Time

×103 600 500


+

+

400 300

a = 499.92×103 b = 27.165×103

+

200 + 100 0

+ 0

5

+ 10


+ 25

30

35

(b) Memory


377

40 35

Speedup

30 25 20 15 10 5 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

1.15 1.1

Efficiency

1.05 1 0.95 0.9 0.85 0.8 0.75 0.7

0

5

10



378


×103 600 + 500


+

Time (s)

400 300

a = 539.81×103 b = -0.78328×103

+

200 +

100 0

0

+

5

10

+ 15 20 No. of Processors

+ 25

30

35


(a) Time

×103 1200 +


1000

+

800 a = 1157.5×103 b = 28.593×103

+

600 400

+ 200 0

+ 0

5

+ 10


+ 25

30

35

(b) Memory


379

60 50

Speedup

40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

Efficiency

(c) Speedup

1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6

0

5

10



380


×103 600 + 500


+

Time (s)

400 300 200

a = 537.76×103 b = -29.49×103

+

100

+

+

+

0 -100

0

5

10

+


25

30

35


(a) Time

×103 900


800 + 700

+

600 500

a = 766.37×103 b = 27.575×103

+

400 300

+

200

+

100 0

0

5

+ 10


+ 25

30

35

(b) Memory


381

70 60

Speedup

50 40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

(c) Speedup

2.2

Efficiency

2 1.8 1.6 1.4 1.2 1

0

5

10



382


×103 160 140 +


+

Time (s)

120 100 80

a = 142.1×103 b = -1.0014×103

+

60

+

40 20 0

+ 0

5

10

+ 15 20 No. of Processors

+ 25

30

35


(a) Time

×103 450 400 +


350

+

300 250

a = 378.41×103 b = 26.575×103

+

200 150

+

100

+

50 0

0

5

+ 10


+ 25

30

35

(b) Memory


383

60 50

Speedup

40 30 20 10 0

Speedup Linear 0

5

10


25

30

35

25

30

35

Efficiency

(c) Speedup

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8

0

5

10


Figure B.9: continued...Performance : Field : Field33 Figure B.10: cont....Slope Stability Analysis Performance : Profile 3 : Length = 32m.

Parallel Implementation and Application of the

Parallel Implementation and Application of the

Suggest Documents

the design and implementation of massively parallel

A Sparse SCF algorithm and its parallel implementation: Application to ...

IMPLEMENTATION AND APPLICATION OF PRINCIPAL ...

development and implementation of application

Numerical implementation and oceanographic application of the

Performance Evaluation of Parallel Implementation

Implementation and Performance of Parallel Ecological ... - CiteSeerX

Parallel and GRID Implementation of a Large

Parallel Implementation of the Unified Flow Solver

Development of parallel implementation for the dendritic

Parallel Implementation of the Gauss-Seidel

Design and Implementation of Parallel ... - Semantic Scholar

Parallel implementation and performance characterization of MUSCLE

Design and Implementation of Parallel Nonrigid

Independent AND-Parallel Implementation of ... - Semantic Scholar

Parallel Implementation of the Discontinuous Galerkin ... - CiteSeerX

High-speed Parallel Software Implementation of the

Parallel Implementation of the PHOENIX Generalized Stellar ...

Parallel Implementation of the Box Counting

Parallel Implementation of the Ensemble Empirical ...

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

High-speed parallel software implementation of the

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

The design and implementation of a microkernel based parallel OS