Apr 19, 2012 - 6.15 Rocket equation application experimental results. ...... Equation Solver: Rocket equation [24], also known as the Tsiolkovsky's rocket.
DISSERTATION
A Java-based Programming and Execution Environment for Many-core Parallel Computers by
Muhammad Aleem A thesis submitted in partial fulfilment of the requirements for the degree of
Doctor of Technical Sciences supervised by
Assoz.-Prof. Priv.-Doz. Dr. Radu Prodan
Institute of Computer Science, University of Innsbruck, Austria April 19, 2012
ProQuest Number: 3730752
All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.
ProQuest 3730752 Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346
Declarations I declare that this thesis is my own work and most of the material has been taken from my own published work. This work was done wholly or mainly while in candidature for a research degree at this university and has not been submitted in any form for another degree or diploma at any university or other institute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged in the text, and a list of references is given.
Innsbruck, Austria April 19, 2012,
Muhammad Aleem
iii
Acknowledgements First of all, I am thankful to Almighty Allah for His blessed knowledge and power which enabled me to achieve the milestone of my Doctoral Thesis. I would like to express my deep and sincere gratitude to my respected supervisor Associate Prof. Dr. Radu Prodan, for his supervision, advice, valuable guidance, and encouragement throughout my Doctoral research work for the past four years. His supervision and comments helped me in great deal to stay focus on my research goals and motivated me for the further in depth investigations. It has been a great pleasure to work under his supervision and without his extremely valuable assistance, this thesis would likely not have matured. I am extremely grateful to Prof. Dr. Thomas Fahringer for providing me with the opportunity, substantial guidance, and encouragement to join his prestigious research group of very talented researchers in the field of parallel and distributed computing. The knowledge and skills i gained working here will greatly benefit my future career. I am thankful to Dr. Hans Moritsch for his contributions to earlier stages of this research. Moreover, I am grateful to the members of my research group for providing me with valuable guidance and support during my research. I wish them best of luck in their research. I am indebted to the reviewers of this thesis whose constructive feedbacks and suggestions enabled me to shape this thesis. Furthermore, I am grateful to Higher Education Commission (HEC) of Pakistan, Austrian Exchange Service (OeAD), Tiroler Zukunftsstiftung, and University of Innsbruck for providing me generous funding, technical, and organizational support to pursue my studies. My heartiest gratitude to my parents, my wife Rabia, my brothers and sisters, and other family members for their indispensable encouragement, love, and prayers throughout these long years of my PhD studies. Their tremendous moral support and confidence in me helped me to tackle all my problems. Furthermore, I would like to thank all my teachers for their encouragement and motivations. Moreover, I am grateful to all of my friends particularly from University of Innsbruck, Quaid-i-Azam University Islamabad, and from my hometown Ghotki Pakistan, for their emotional support, joyful company, exhilaration, cooperation, and constant help. Lastly, I want to thank and o↵er my regards to all of those who supported me in any respect during the completion of the thesis.
Innsbruck, Austria
Muhammad Aleem
April 19, 2012. iv
v
Dedicated to my parents, and my wife Rabia
Abstract Today, software developers face the challenge of re-engineering their applications to exploit the full power of the new emerging multi-/many-core architectures. This shift profoundly a↵ects application developers who can no longer transparently rely only on Moore’s law to speedup their applications. Rather, they have to parallelise their applications with user controlled load balancing and locality control to exploit the underlying multi-/many-core architectures and ever complex memory hierarchies. In the recent years, the usage of Java language to program performance oriented applications has increased. The locality of task and data has significant impact on application’s performance and it becomes more critical on multi-core architectures because of the additional levels of memory hierarchies and heterogeneity. In the last few years, the Graphics Processing Units (GPUs) have emerged as powerful coprocessors to accelerate performance oriented applications. The GPUs improved programmability support for general-purpose applications and the highly parallel architecture stimulate for extending the capabilities of the existing high-level programming languages such as Java with new abstractions and tools to ease programming and to efficiently exploit the potential compute capabilities. However, a uniform high-level programming model for parallelising Java applications is still missing. In this thesis, we propose a new Java based programming model called JavaSymphony for shared, distributed and hybrid memory multi-/many-core parallel computers, and coprocessors accelerators as an extension to the existing Java distributed programming environment. Using JavaSymphony, a parallel Java application can be uniformly programmed and executed on variety of multi-/many-core architectures. Heterogeneous conventional and data-parallel multi-core devices can be programmed using a unique and high-level Java programming abstraction that shields the user from the low-level architectural details such as method invocations, thread management, synchronisations, memory allocations, and data transfers. JavaSymphony’s design is based on the concept of dynamic virtual architectures, which allow programmers to define a hierarchical structure of the underlying computing resources (e.g., accelerators, cores, processors, machines, and clusters), and to control load balancing and locality. JavaSymphony provides high-level programming constructs which abstract low-level details and simplify the tasks of controlling parallelism, locality, and load balancing. Moreover, JavaSymphony provides a multi-core aware scheduling mechanism capable of mapping parallel Java applications on large multi-core machines and heterogeneous clusters with improved performance. JavaSymphony scheduler considers several multi-core specific performance parameters and application types, and uses these parameters to optimise the mappings of applications, objects, and tasks. We evaluate the JavaSymphony framework using several real applications and benchmarks on modern multi-core parallel computers, heterogeneous clusters, and machines consisting of a combination of di↵erent multi-core CPU and GPU devices. The performance results demonstrate that the JavaSymphony outperforms pure Java implementations, as well as other alternative related solutions and validate the research questions addressed.
Contents Declarations
iii
Acknowledgements
iv
Abstract
vii
List of Figures
xiii
List of Tables
xvii
1 Introduction 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Multi-/Many-core Computers . . . . . . . . 1.1.2 Scheduling . . . . . . . . . . . . . . . . . . 1.1.3 Coprocessor Accelerators . . . . . . . . . . 1.2 Thesis Goals . . . . . . . . . . . . . . . . . . . . . 1.2.1 Programming Multi-/Many-core Computers 1.2.2 Multi-core Scheduling . . . . . . . . . . . . 1.2.3 Programming Coprocessor Accelerators . . 1.3 Thesis Organization . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
1 3 4 5 6 6 7 8 9 10
2 Model 2.1 Architectural Model . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . 2.1.2 Parallel Computers . . . . . . . . . . . . . . . . . . . . . 2.1.3 Multi-core Era . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Coprocessor Accelerators . . . . . . . . . . . . . . . . . 2.1.5 General-Purpose Graphics Processing Units (GPGPUs) 2.2 Parallel Programming Models . . . . . . . . . . . . . . . . . . . 2.2.1 Shared Memory Model . . . . . . . . . . . . . . . . . . . 2.2.2 Distributed Memory Model . . . . . . . . . . . . . . . . 2.2.3 Hybrid Memory Model . . . . . . . . . . . . . . . . . . . 2.2.4 Data Parallel Model . . . . . . . . . . . . . . . . . . . . 2.3 Programming Technologies . . . . . . . . . . . . . . . . . . . . 2.3.1 Parallel and Distributed Computing using Java . . . . . 2.3.2 Programming Heterogeneous Parallel Computers . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
11 11 11 12 15 16 17 21 22 23 24 24 25 25 27
ix
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
x
Contents
2.4
2.3.3 Java–OpenCL Bindings . . . . . . . . . . . . . . . . . . . . . . . . 29 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 JavaSymphony Background 3.1 Introduction . . . . . . . . . . . . . . . . . . . . 3.2 Dynamic Virtual Distributed Architectures . . 3.3 JavaSymphony Distributed Objects . . . . . . . 3.3.1 Creating and Mapping JS Objects . . . 3.3.2 JS Object Types . . . . . . . . . . . . . 3.3.3 Method Invocation Types . . . . . . . . 3.4 Synchronisation Mechanisms . . . . . . . . . . 3.4.1 Asynchronous Method Synchronisation . 3.4.2 Barrier Synchronisation . . . . . . . . . 3.5 JavaSymphony Run-time System . . . . . . . . 3.5.1 Administration Shell . . . . . . . . . . . 3.5.2 Network Agent System . . . . . . . . . . 3.5.3 Object Agent System . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
31 31 32 34 35 35 36 38 38 39 40 40 41 42 45
4 JavaSymphony Multi-/Many-core Computing 4.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . . . . . 4.3 JavaSymphony . . . . . . . . . . . . . . . . . . 4.3.1 Dynamic Virtual Architectures . . . . . 4.3.2 JavaSymphony Objects . . . . . . . . . 4.3.3 Synchronisation Mechanisms . . . . . . 4.3.4 JavaSymphony System Architecture . . 4.4 Code Examples . . . . . . . . . . . . . . . . . . 4.4.1 Matrix Transposition Application . . . . 4.4.2 3D Ray Tracing Application . . . . . . . 4.5 Performance Evaluation . . . . . . . . . . . . . 4.5.1 Experimental Setup . . . . . . . . . . . 4.5.2 Shared Memory Experiments . . . . . . 4.5.3 Distributed Memory Experiments . . . . 4.5.4 Conclusions . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
47 47 49 51 51 53 54 55 59 59 60 61 61 62 69 77 78
. . . . . . . . .
81 81 83 84 84 86 87 89 89 90
5 JavaSymphony Multi-core Scheduler 5.1 Introduction . . . . . . . . . . . . . . 5.2 Related Work . . . . . . . . . . . . . 5.3 JavaSymphony Scheduler . . . . . . 5.3.1 Architecture . . . . . . . . . 5.3.2 Methodology . . . . . . . . . 5.3.3 Algorithm . . . . . . . . . . . 5.4 Performance Evaluation . . . . . . . 5.4.1 Experimental Setup . . . . . 5.4.2 Experimental Methodology .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Contents
xi . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
92 96 98 99
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
101 101 103 105 105 106 106 106 107 109 112 114 114 114 115 116 122 125 126
7 Conclusions and Future Directions 7.1 Contributions . . . . . . . . . . . . . . . . . . . 7.1.1 Programming Multi-/Many-cores . . . . 7.1.2 Multi-core Scheduling . . . . . . . . . . 7.1.3 Programming Coprocessor Accelerators 7.1.4 Published Contributions . . . . . . . . . 7.2 Future Directions . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
129 129 129 131 132 134 135
5.5
5.4.3 Communication-Intensive Applications 5.4.4 Computation-Intensive Applications . 5.4.5 Conclusions . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . .
6 JavaSymphony Heterogeneous Computing 6.1 Introduction . . . . . . . . . . . . . . . . . . 6.2 Related Work . . . . . . . . . . . . . . . . . 6.3 Model . . . . . . . . . . . . . . . . . . . . . 6.3.1 OpenCL Framework . . . . . . . . . 6.3.2 Java–OpenCL Bindings . . . . . . . 6.4 JavaSymphony Extensions . . . . . . . . . . 6.4.1 Dynamic Virtual Architectures . . . 6.4.2 System Architecture . . . . . . . . . 6.4.3 JavaSymphony API . . . . . . . . . 6.5 Code Example . . . . . . . . . . . . . . . . 6.6 Performance Evaluation . . . . . . . . . . . 6.6.1 Experimental Setup . . . . . . . . . 6.6.2 Application Versions . . . . . . . . . 6.6.3 Java versus OpenCL . . . . . . . . . 6.6.4 Single Machine based Experiments . 6.6.5 Cluster based Experiments . . . . . 6.6.6 Conclusions . . . . . . . . . . . . . . 6.7 Summary . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
Abbreviations
136
Bibliography
141
List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12
Shared memory parallel computers. . . . . Distributed memory parallel computer. . . Hybrid memory parallel computer. . . . . Multi-core processor architectures. . . . . A simplified architecture of a GPU device. The Fermi architecture. . . . . . . . . . . Fermi streaming multiprocessor. . . . . . Shared memory programming models. . . Distributed memory programming model. Hybrid memory model. . . . . . . . . . . . SIMD data parallel compute model. . . . The OpenCL memory model. . . . . . . .
3.1 3.2 3.3
A three-level virtual architecture. . . . . . . . . . . . . . . . . . . . . . . . 32 JavaSymphony system architecture. . . . . . . . . . . . . . . . . . . . . . 41 Job processing mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21
Four-level multi-core aware VAs. . . . . . . . . . . . . . . . . . . Multi-core aware JS system architecture. . . . . . . . . . . . . . . Shared and distributed memory object agent system. . . . . . . . JS run-time locality application procedure. . . . . . . . . . . . . CPU topology of SunFireX4600M2 machine. . . . . . . . . . . . . DCT experimental results. . . . . . . . . . . . . . . . . . . . . . . CG kernel speedup. . . . . . . . . . . . . . . . . . . . . . . . . . . CG kernel L3 cache misses. . . . . . . . . . . . . . . . . . . . . . CG kernel local DRAM accesses. . . . . . . . . . . . . . . . . . . Shared memory ray tracing application speedup. . . . . . . . . . Ray tracing application Local DRAM accesses. . . . . . . . . . . Ray tracing L3 cache misses. . . . . . . . . . . . . . . . . . . . . Cholesky factorisation experimental results. . . . . . . . . . . . . Matrix transposition speedup. . . . . . . . . . . . . . . . . . . . . Matrix transposition performance analysis. . . . . . . . . . . . . SpMV application speedup. . . . . . . . . . . . . . . . . . . . . . SpMV application performance analysis. . . . . . . . . . . . . . . Speedup of the JSRT application on m01 cluster. . . . . . . . . . JSRT application efficiency on m01 cluster. . . . . . . . . . . . . Data prefetch requests by the JSRT application on m01 cluster. Data-cache misses for JSRT application on m01 cluster. . . . . . xiii
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
13 13 14 15 18 20 21 22 23 24 25 28
52 56 57 58 61 62 63 64 64 65 66 66 67 67 68 68 69 69 70 70 71
xiv
List of Figures 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34
L3 cache misses and memory accesses by JSRT application on m01 cluster. Speedup comparison of JSRT application on Karwendel and m01 clusters. Speedup of the JSRT application on Karwendel cluster. . . . . . . . . . . JSRT application overheads analysis on Karwendel cluster. . . . . . . . . Speedup of JSRT application on heterogeneous cluster. . . . . . . . . . . . Load imbalance analysis on heterogeneous cluster. . . . . . . . . . . . . . JSRT application efficiency on heterogeneous cluster. . . . . . . . . . . . . JSRT execution times and overheads profile on heterogeneous cluster. . . JSRT application efficiency (II) on heterogeneous cluster. . . . . . . . . . IPC and DRAM accesses of the JSRT application on heterogeneous cluster. L1 and L3 cache misses of the JSRT application on heterogeneous cluster. DRAM bandwidth analysis on heterogeneous cluster. . . . . . . . . . . . . Write and read bandwidths analysis on heterogeneous cluster. . . . . . . .
71 72 72 73 73 74 74 75 75 76 76 77 77
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14
JS system architecture - (scheduler and resource manager). Application classification. . . . . . . . . . . . . . . . . . . . SpMVM experimental results on m01 machine. . . . . . . . SpMVM experimental results on HC cluster. . . . . . . . . NAS CG speedup on m01 machine. . . . . . . . . . . . . . . VCMatrix application speedup on HC cluster. . . . . . . . . SpMMM application speedup on m01 machine. . . . . . . . Speedup of the SpMMM application on HC cluster. . . . . . VCMatrix stepwise application optimisation on HC cluster. 3DRT application speedup on m01 machine. . . . . . . . . . 3DRT application speedup on HC cluster. . . . . . . . . . . MatrixFPO application speedup on m01 machine. . . . . . MatrixFPO application speedup on HC cluster. . . . . . . . NAS EP kernel speedup on m01 machine. . . . . . . . . . .
85 91 92 93 93 94 94 95 95 96 96 97 97 98
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18
A four-level virtual architecture. . . . . . . . . . . . . . . . . . . . . . . . 107 JS system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Jogamp-JOCL versus OpenCL experimental results. . . . . . . . . . . . . 115 Dense matrix multiplication experimental results. . . . . . . . . . . . . . . 116 Dense matrix multiplication performance results, matrix size: 6000 ⇥ 6000.117 Performance of the dense matrix multiplication using the multi-GPU setup.117 Dense matrix multiplication IO overheads for multi-GPU configuration. . 118 JS and data-transfer overheads for the dense matrix multiplication. . . . . 118 Optimised dense matrix multiplication performace results. . . . . . . . . . 119 Sparse matrix multiplication experimental results. . . . . . . . . . . . . . 120 Encryption-decryption speed of the IDE application. . . . . . . . . . . . . 121 Data transfer rates attained by the IDE application. . . . . . . . . . . . . 121 Work-group size impact on the speedup of the IDE application. . . . . . . 122 Rocket equation kernel experimental results. . . . . . . . . . . . . . . . . . 122 Rocket equation application experimental results. . . . . . . . . . . . . . . 123 DCT kernel experimental results. . . . . . . . . . . . . . . . . . . . . . . . 124 DCT application experimental results. . . . . . . . . . . . . . . . . . . . . 124 Overhead analysis of the DCT application. . . . . . . . . . . . . . . . . . 125
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
List of Figures
xv
6.19 DCT kernel execution profile. . . . . . . . . . . . . . . . . . . . . . . . . . 125
List of Tables 3.1
JavaSymphony job types. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1
The cluster architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 5.2
HC cluster experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . 90 Performance factors lists - average speedup gains. . . . . . . . . . . . . . . 91
6.1
Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xvii
Listings 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 4.4 6.1 6.2 6.3 6.4 6.5 6.6
VA creation using the bottom-up approach. . . . . . . . . . . . . VA creation using the top-down approach. . . . . . . . . . . . . . JavaSymphony distributed objects creation. . . . . . . . . . . . . Synchronous method invocation. . . . . . . . . . . . . . . . . . . Asynchronous method invocation. . . . . . . . . . . . . . . . . . . One-sided method invocation. . . . . . . . . . . . . . . . . . . . . Asynchronous synchronisation mechanisms. . . . . . . . . . . . . Barrier synchronisation mechanisms. . . . . . . . . . . . . . . . . Multi-core VA creation example. . . . . . . . . . . . . . . . . . . Shared JS object creation and synchronisation mechanisms. . . . Matrix transposition example - shared memory. . . . . . . . . . . The core JS code of the JSRT application - distributed memory. FAT VA node creation. . . . . . . . . . . . . . . . . . . . . . . . . JS data-bu↵er creation. . . . . . . . . . . . . . . . . . . . . . . . JS kernel object creation. . . . . . . . . . . . . . . . . . . . . . . Data-parallel JS object creation. . . . . . . . . . . . . . . . . . . Vector addition host program in JavaSymphony. . . . . . . . . . Vector addition kernel program in OpenCL. . . . . . . . . . . . .
xix
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
33 34 35 36 37 37 38 39 53 54 59 60 110 110 111 111 112 113
Chapter 1
Introduction In 1965, Gordon Moore presented a guiding principle of computer architecture known as Moore’s law [83]. According to the Moore’s law, the number of transistors in a processor will double after every 18
24 months. Over the years, the Moore’s law has proved to
be true and the application developers gained improved speedups by merely acquiring the faster computing machines based on high-clocked processors. In the last few years, the multi-core processors have emerged and eradicated the old processor manufacturing technology based on increasing processor’s clock frequencies to gain high performances. The technology shift was triggered by the problems related to high clocked single-core processors such as exponential power consumption and processor overheating. The multi-core technology essentially focuses on adding more parallelism in the form of processing cores and other functional units. The multi-core architectures rise new challenges which are mostly related to programming. Today, the application developers can no longer rely on the computing architectures to speedup applications. Rather, they are required to design and develop the existing and new applications based on parallel algorithms to exploit the potential compute capabilities of multi-core architectures. A parallel application employs several computing resources (e.g., machines, processors, or cores) simultaneously to compute a large problem. Generally, parallel applications use task and data level parallelism. In spite of the fact that the parallel applications can make use of the potential compute capabilities of multi-core processors, they are still difficult to program and maintain. The multi-core architectures have added more heterogeneity and diverse memory hierarchies to the existing large parallel computers such that massively parallel super computers or compute clusters. The architectural changes constrained by the multicore processors put a substantial challenge to the parallel programming paradigms and 1
2
Chapter 1. Introduction
execution environments. Programming multi-core parallel computers is today an important topic in scientific and mainstream communities. However, a uniform high-level programming model and multi-core aware execution environment is required to ease parallelisation process and to efficiently execute scientific and real applications on shared and distributed memory multi-core parallel computers. The use of Java for scientific and high performance parallel applications has significantly increased in the recent years. Java is a widely used modern programming language that provides features such as code portability, interoperability, synchronisation, multi-threading, network programming, and remote procedure call capabilities which are highly useful to exploit medium to coarse-grained parallelism on shared and distributed memory parallel computers. The intermediate byte-code representation is Java language’s premier feature enabling applications to run on almost every platform. Over the period of time, the performance of Java execution environment has significantly improved due to the optimisations and other related e↵orts such as advanced memory management, dynamic and Just-in-Time compilations [62]. There are several studies [17, 59, 74, 87] which show the competitive performance results achieved by Java-based applications as compared to other mainstream programming languages (such as C++, C, and Fortran) for performance oriented applications. Today, there is a rising interest in the usage of Java for programming shared memory computers, heterogeneous clusters, and parallel machines equipped with many-core accelerating devices such as Graphics Processing Units (GPUs). Java language provides object oriented shared as well as distributed memory programming constructs to program shared and distributed memory parallel computers. On shared memory machines, Java’s multi-threading capabilities can be used to exploit the cores where each core runs a thread. For distributed memory systems, Java language provides Remote Method Invocation (RMI) [45] mechanism that is used to instantiate parallel tasks on remote machines. However, to exploit the computing capabilities and optimisation opportunities provided by multi-/many-core architectures, programmers still have to deal with low-level and error-prone details of programming languages and require a good understanding of the target ever-complex hardware architectures. A uniform high-level programming model and interface for parallelising Java applications on multi-/many-core architectures is still missing. Today, there is a strong need of high-level parallel programming languages and libraries capable of programming multi-core shared and distributed memory, and heterogeneous many-core parallel computers to exploit the compute capabilities of the highly parallel multi-/many-core architectures. To provide fine-grained locality control and load
1.1. Motivations
3
balance, the programming paradigms are required to be multi-core aware providing abstractions at the levels of processors and cores. Furthermore, a unified solution is needed to program parallel Java applications on variety of multi-/many-core architectures such as shared, distributed and heterogeneous parallel computers. As a result, we enhanced a pure distributed memory Java programming environment called JavaSymphony (JS) [38, 39, 54, 55, 56] with shared memory programming abstractions and multi-core aware execution capabilities, a multi-core aware scheduling mechanism, and programming support for heterogeneous parallel machines equipped with coprocessor based accelerating devices such as GPUs. JS’s design is based on the concept of dynamic virtual architectures, which allow programmers to define a hierarchical structure of the underlying computing resources (e.g., machines, clusters, and distributed Grids) and to control load balancing and locality. JS’s high-level constructs abstract low level infrastructure details and simplify the task of controlling parallelism and locality.
1.1
Motivations
In this thesis, we investigate techniques and strategies to parallelise and execute Java applications on modern multi-/many-core architectures including large shared memory machines, heterogeneous multi-core clusters, and coprocessors accelerators (such as GPUs) based parallel computers. Primarily, we focus on three broad research questions: 1. How to program and execute parallel Java applications on multi-core architectures with user-controlled locality, parallelism, and load balancing to exploit the potential compute capabilities and locality optimisations? 2. How to automatise task–resource mappings of Java applications on shared and distributed memory multi-core parallel computers to improve applications performance and developers productivity? 3. How to exploit huge compute capabilities of the many-core coprocessor devices (such as GPUs) to accelerate parallel Java applications? Following are some of the identified motivations and research questions we plan to address.
4
1.1.1
Chapter 1. Introduction
Multi-/Many-core Computers
In the recent years, the usage of high-level languages such as Java to program performance oriented applications has increased. Java language provides code portability and high-level programming constructs such as multi-threading, communication and mechanisms, synchronisation constructs, and remote method invocations. These programming constructs are highly suited to develop a unified Java programming and execution environment for variety of multi-core architectures such as shared memory machines and distributed memory heterogeneous clusters. Multi-core processors add further complexities to the existing parallel computers and programming these machines became even more challenging. Java’s high-level objectoriented features simplify tasks of programming and provide abstractions to tackle with the complexities of multi-core architectures. The object-oriented programming constructs of Java language help developers to focus on the core-logic and free them from the low-level details such as thread management and interoperability issues. The locality of task and data have significant impact on application performance as demonstrated in [32, 69, 90, 97]. In a multi-core processor, the processing cores often share a common high speed memory called cache, and several levels of cache memories may exist within a multi-core processor. In a multi-core machine, the decision of mapping the parallel tasks to cores becomes very vital in order to attain good performances. Relaying only on operating systems and job schedulers (on heterogeneous clusters) often results in degraded performances. Today, there are several Java parallel programming environments [31, 57, 85, 99, 101]. Most of these e↵orts, however, do not provide user-controlled locality of task and data to exploit the complex memory hierarchies on multi-core architectures. Most of these parallel programming paradigms provide locality control of tasks at the levels of individual parallel computers and Java Virtual Machines (JVMs). To attain greater performance benefits on multi-core architectures, the applications are required to be executed with enhanced locality control mechanisms which could map the executing tasks to the individual processors and cores. The enhanced locality control mechanisms will result in better exploitation of the complex memory hierarchies (to allow or avoid sharing such that in case of cache contentions) on multi-/many-core architectures such as shared memory machines, homogeneous, and heterogeneous clusters.
1.1. Motivations
1.1.2
5
Scheduling
The scheduling process is based on mapping the executing tasks to the computing resources in order to achieve reduced execution times, better resource utilisations, or higher throughputs. Generally, scheduling is considered a hard problem [66] and multi-core architectures make it even more challenging. The decision to map a task to specific core may result in improved or degraded performance [95]. Therefore, a scheduler is required to be multi-core aware. The computational tasks can be mapped by the application developers manually or the mapping process can be automatised. The automatised mapping of applications by schedulers may reduced the application development and deployment time, however, inappropriate scheduling may also result in degraded performance. The growth of memory system’s performance as compared to CPUs is quite low [94]. The scheduling decisions to map tasks to the appropriate computing cores are important to avoid expensive memory access penalties. The multi-core processors add additional levels of memory hierarchies which are used as shared resources by the cores of a multicore processor. To achieve commendable performance benefits, the tasks–cores mapping should be done efficiently to fully exploit the underlying multi-core architectures. Therefore, a scheduler is required to consider several low-level architectural properties to efficiently schedule the performance-oriented applications. Scheduling an application by only considering the locality constraints do not pays o↵ always because some applications exhibit performance degradations if the tasks are mapped to employ some shared resources [100] (such as shared caches due to the contention). Therefore, a multi-core aware scheduler is needed that considers several multicore related architectural and application properties such as communication and computation needs. Today, there are many research e↵orts which target application scheduling on multicore architectures. Some of them [31, 57, 99], however, either consider at most one architectural characteristic or they [102] are limited to a specific parallel computing architecture such as shared or distributed memory computers. To the best of our knowledge, no scheduling mechanism for Java applications considers the low-level multi-core specific characteristics (e.g., network and memory latencies, bandwidth, processor speed, shared cache, machine load) and application properties.
6
1.1.3
Chapter 1. Introduction
Coprocessor Accelerators
Today, in high-performance computing the trends are to add more parallelism to the existing computing architectures. As a result, the coprocessor accelerating devices such as GPUs have emerged as highly parallel devices containing hundreds of computing cores. To exploit the potential compute capabilities of these devices, the high-level programming languages such as Java and performance oriented programming paradigms are required to be extended and interfaced with these devices. To accelerate existing Java applications, there is a strong need of a Java programming framework capable of o↵-loading the computations to the coprocessor based accelerating devices such as GPUs. The high-level Java framework is needed to free the application developers from the low-level details of Java language and accelerator’s programming frameworks such as Open Computing Language and Compute Unified Device Architecture. In addition to the high-level programming abstractions, the tedious run-time tasks such as task–device mappings and CPU–device data communications are required to be handled by a high-level programming environment freeing the developers from the low-level details of the ever-complex hardware architectures. Today, there are several research e↵orts which target accelerating Java applications using GPUs and other data-parallel compute devices. Some of them, however, are either vendor-specific [11, 73] or are limited to using a single device or machine configuration [51, 96] for executing a parallel Java application. Therefore, a cross-platform highlevel programming and execution environment is required that provides capabilities to accelerate existing Java applications using multi-device and machine configurations. A unified Java programming paradigm is required that is capable of executing parallel applications (including data-parallel applications) using user-controlled locality of tasks on heterogeneous architectures consisting of conventional shared and distributed memory machines, and heterogeneous parallel computers enhanced with coprocessor accelerators such as GPUs.
1.2
Thesis Goals
In light of the motivations outlined in the previous section, we propose JavaSymphony paradigm as a multi-core aware unique high-level programming and execution environment. The JS framework could be used to develop shared, distributed, and hybrid memory applications on variety of multi-/many-core parallel computers providing a user
1.2. Thesis Goals
7
controlled locality control of tasks, objects, and applications. Moreover, a multi-core scheduling mechanism is proposed to automatise mappings of tasks to the computing cores on shared and distributed memory parallel computers. Furthermore, to exploit the huge compute capabilities of the coprocessor accelerating devices (such as GPUs), we propose a Java parallel environment enabling the existing Java applications to o↵-load computations to these devices. The major goals of the thesis are summarized as follows:
1.2.1
Programming Multi-/Many-core Computers
We propose a multi-core aware programming and execution environment as an extension to the JavaSymphony distributed memory programming paradigm. The proposed parallel programming environment could be used to develop shared and hybrid memory parallel applications in addition to the previously supported distributed memory applications. The proposed programming paradigm will be enhanced with new shared memory abstractions to facilitate shared memory application development with user controlled locality control on large multi-core machines. With the shared memory programming support, the developers will be able to exploit the compute capabilities of large multi-core machines employing direct method invocations on cores and processors to avoid the costly RMI [45] based communications. One of our primary goals is to provide multi-core aware Dynamic Virtual Architectures (VAs), a hardware abstraction mechanism that could be used to map tasks and objects to the computing resources. The dynamic virtual architectures will facilitate low-level hierarchical modelling of the underlying multi-core computing infrastructures (such as large shared memory machines, heterogeneous clusters, etc.) enabling the application developers to control localities of objects at the levels of processors and cores using the high-level abstractions. On top of the virtual architecture, objects can be explicitly distributed, migrated, and invoked, enabling a high-level user control of parallelism, locality, and load balance. To efficiently exploit the multi-core architectures, we propose enhanced locality control mechanisms which will enable application developers to specify user controlled locality controls of parallel Java applications, objects, and tasks at the levels of individual cores and processors on large shared memory multi-core machines and heterogeneous clusters. The JS programming and execution environment will provide a multi-core aware execution of the applications and will manage all the low-level multi-core specific details such as mapping of tasks, objects, and applications using operating system’s native support related to thread–CPU bindings. Moreover, the run-time system will be responsible for
8
Chapter 1. Introduction
discovery of the multi-core hardware details such as number of processors, cores, and memory hierarchies. The multi-core aware JS framework will provide abstractions and Application Programming Interface (API) support to the old JS-based applications to transparently run on the new multi-core architectures with minimal changes. Furthermore, the codeportability attribute of Java applications will be ensured by providing a cross-platform multi-/many-core programming and execution environment. The evaluation of the work will be done by developing several real applications and benchmarks which will be executed on several real multi-core parallel computers such as large shared memory machines and distributed memory heterogeneous clusters.
1.2.2
Multi-core Scheduling
We propose a JS application scheduler responsible for discovery and selection of resources and providing an auto task–resource mapping mechanism for shared, distributed, and hybrid memory parallel Java applications. The scheduler will automatise creation of the required virtual architectures and will liberate the application developers from manual creation of the VAs to enhance developers productivity. The proposed scheduler will be based on a multi-core aware scheduling mechanism capable of mapping parallel JS applications on large shared memory multi-core machines and heterogeneous distributed memory clusters. The scheduler will consider several multi-core related performance parameters (such as locality, bandwidth, etc.) and application properties (such as communication and computational need), and will use these parameters to optimise the mappings of the objects and tasks on the multi-core architectures. The scheduler will be based on a novel scheduling methodology employing o↵-line training experiments which will determine the sensitivities of performance factors with respect to the application types (e.g., communication and computation-intensive) and multi-core architectures (e.g., shared memory machines, heterogeneous clusters). A two-phased scheduling algorithm will be developed to schedule both the coarse-grained JS objects and the fine-grained JS tasks on shared and distributed memory multi-core parallel computers.
1.2. Thesis Goals
9
The multi-core aware scheduling framework will be based on two components: the resource manager and the scheduler. The resource manager will interact with the physical computing resources to collect the machine related information while the proposed scheduler component will use the collected information by the resource manager and the training data as guidelines to schedule Java applications. The evaluation of the multi-core scheduler will be carried out using several real applications and multi-core architectures such as large shared memory machines and distributed memory heterogeneous clusters.
1.2.3
Programming Coprocessor Accelerators
The improved programmability support of coprocessor accelerators (such as GPUs) to program general-purpose applications, it becomes important now to extend capabilities of the existing high-level programming languages and frameworks (such as Java) with new abstractions and tools to efficiently exploit the potential compute capabilities of these highly parallel devices. Therefore, we propose JavaSymphony framework to program heterogeneous multi-core parallel computers enhanced with data-parallel accelerator devices such as GPUs. The dynamic virtual architectures will be enhanced to support data-parallel compute devices such as GPUs. The VAs will enable the application developers to represent the coprocessor accelerator devices as abstractions in the form of VA nodes, which will facilitate user-controlled locality control of tasks at the high-level. The proposed parallel programming framework for coprocessor accelerators (such as GPUs) will enable existing Java applications to exploit the compute capabilities of these devices with little API programmability change. Moreover, a parallel application can be uniformly programmed and executed on heterogeneous platforms (multi-core parallel computers enhanced with coprocessors), large shared memory multi-core machines, and distributed memory clusters. The unified framework will free the developers from the low-level and error-prone details of the programming languages and will provide abstractions to tackle with the complexities of target hardware architectures. The JS run-time system will facilitate the invocation of the OpenCL kernel functions using a single or multiple device and machine configurations. Moreover, the run-time system will automatise several low-level tasks (such as kernel invocations, memory allocations, and data-transfers) shielding the developers from the low-level details.
10
Chapter 1. Introduction
The evaluation of the JS framework will be carried out using several real applications and benchmarks experimented on multi-core CPU-GPU device configurations on a single and cluster of machines.
1.3
Thesis Organization
The rest of the thesis is organized as follows. Chapter 2 presents few preliminaries that build the foundation to aid understanding the remaining thesis parts. Primarily, the chapter focuses on architectural, programming, and technological aspects related to develop and execute Java performance-oriented applications on multi-/many-core parallel computers. Chapter 3 is dedicated to background work that presents JavaSymphony programming paradigm for parallel and distributed computing. The chapter presents JavaSymphony’s main features and capabilities, its architecture, programming model, and run-time systems. Moreover, several code examples are presented to highlight the usage of its programming interface. Chapter 4 presents JavaSymphony extensions and evaluation results related to programming multi-/many-core parallel computers. Mainly, the chapter presents related work, JavaSymphony architecture, run-time system, programming model, JavaSymphony API, and code examples. In the end, detailed experimental evaluation of the work is presented. Chapter 5 is dedicated to the JavaSymphony’s multi-core aware scheduling mechanism that is capable of mapping shared, distributed, and hybrid memory parallel applications on variety of multi-core architectures. First part of the chapter presents related work, scheduler architecture, scheduling algorithm, and employed scheduling methodology while the last part presents experimental methodology, experimental setup, and detailed evaluation of the multi-core aware scheduler. Chapter 6 presents JavaSymphony extensions related to programming data-parallel applications on heterogeneous parallel computers equipped with coprocessor accelerators such as GPUs. This chapter presents related work, model technologies, JS architecture, run-time system, and JS API. Furthermore, a code example is presented that details a data-parallel JavaSymphony application. The last part the chapter outlines the experimental setup and presents a detailed evaluation of the work. Chapter 7 concludes the thesis by presenting our major contributions and the potential future research directions.
Chapter 2
Model In this chapter, we present a few preliminaries that will build the foundation to understand the rest of the thesis. First, we describe the architecture related aspects or computing infrastructures e.g., computing machines, devices, and networked configurations. In the second part, we present an overview of the programming models employed in this work. In the third part, we briefly describe some of the technologies used in the implementation phase of this work.
2.1
Architectural Model
In this section, we first present Flynn’s taxonomy [41] that is used to classify the computer architectures. Then, we present physical compute resources including parallel machines, accelerating devices, and large inter-connected compute resources such as clusters, and classify them according to the Flynn’s taxonomy.
2.1.1
Flynn’s Taxonomy
Flynn’s taxonomy is the most used classification scheme proposed by M. J. Flynn in 1966. It is based on instruction and data stream notions to classify the computers. An instruction stream refers to the sequence of compute instructions from a computing device, while the data stream represents the data items used in the computations. According to the Flynn’s taxonomy, a machine may be categorised in one of the four following categories: 1. Single Instruction Single Data (SISD): These are single processor computers. They have a single control unit that fetches one instruction from memory at a 11
12
Chapter 2. Model time and directs the processing unit to operate on the fetched instruction. This is the only category in Flynn’s taxonomy that represents a non-parallel or a serial computer. 2. Single Instruction Multiple Data (SIMD): These computers have single instruction (control unit) and multiple data streams. The control unit fetches an instruction and directs all the processing units to operate on their respective data items. 3. Multiple Instruction Single Data (MISD): These computers operate using multiple instructions on same data. This model is generally considered as a theoretical model. 4. Multiple Instruction Multiple Data (MIMD): In these machines, multiple independent processors operate on di↵erent data items simultaneously using multiple instructions. Most of the today’s parallel machines including multi-core computers fall in this category.
2.1.2
Parallel Computers
A parallel computer consists of several processing units or micro-processors, capable of performing several computations concurrently. A parallel computer maybe based on a single machine with numerous compute units (e.g., processing cores, processors, or co-processors), or a set of connected single or multi-processor machines. The processing granularity (capability of executing n concurrent tasks) of a parallel computer depends on the number of processing units that may range from a small multi-processor computer to a large parallel machine with hundreds of thousands of processing units. We classify parallel computers in three categories based on the memory access schemes: shared, distributed, and hybrid memory parallel machines. Shared Memory: A shared memory parallel computer has several processors or processing cores (e.g., a multi-core processor) that are connected via system bus to a common physical memory. In a shared memory system, all processors share the same address space. The shared memory systems are easy to program although su↵er from poor scalability (ability of improved performance of a system with augmented amount of work and computing resources). A shared memory parallel computer can be built using several SIMD processors (e.g., vector processor [61]) or multi-processors (processors having multiple compute units). Generally, shared memory parallel computers are partitioned into two sub-classes: Symmetric Multi Processing (SMP) and Non Uniform Memory Access (NUMA). In SMP
2.1. Architectural Model
13
Interconnection Network
Memory Memory
System bus
Memory
Memory
Processor-2
Processor-n
System bus
Processor-1
Processor-2
Processor-n
Processor-1
(a) Symmetric Multi Processing machine.
(b) Non Uniform Memory Access machine.
Figure 2.1: Shared memory parallel computers.
machines (as shown in Figure 2.1a), all processors or processing-cores are connected to the common physical memory via the system bus. Due to the scalability issue (contention over shared resource i.e., memory bus) with the SMP machines, NUMA shared memory computers emerged. A NUMA based parallel computer contains several multiprocessors, all sharing the same address space. In NUMA machines (as shown in Figure 2.1b), a multi-processor has its own directly connected local memory module and can also access the memory modules of other multi-processors. However, the local memory accesses cost less as compared to the remote memory accesses (involving the memory modules of other multi-processors). Distributed Memory: A distributed memory parallel computer is a set of single/multi-processor based independent machines, where each machine has its own private memory. In a distributed memory parallel computer, the machines are connected with each other using some interconnection network (such as wired local area network or Ethernet) and communication among machines is carried using message passing. Figure 2.2 shows a conventional distributed memory parallel computer.
Memory
Memory
Memory
Processor-2
Processor-n
System bus Processor-1
Network
Figure 2.2: Distributed memory parallel computer.
14
Chapter 2. Model
Fundamental advantage of distributed memory parallel computers is the memory scalability such that addition of processors also augments available memory. In a distributed memory parallel computer, each machine processes data from its local memory module (high speed access), and the needed remote data is communicated using message passing (slow access). Therefore, these computer are relatively difficult to program as compared to the shared memory parallel computers where all the processors share the common address space. A concrete example of a distributed memory parallel computer is cluster computer. A cluster is a set of tightly-coupled independent machines that are based on single or multi-core processors and work together to solve a computational task. All machines or nodes in a cluster behave like a single resource, and have a centralized resource manager and scheduler. Generally, clusters are classified in two categories: homogeneous and heterogeneous clusters. A homogeneous cluster contains machines or nodes with similar architecture (i.e., processor, memory, network, operating system, etc.), while a heterogeneous cluster is a set of machines with divergent architectures. Hybrid Memory: A hybrid memory parallel computer is a set of independent computing machines or nodes which are connected via interconnect network (e.g., Ethernet, InfiniBand [80]), and acts as a single computing resource. In a hybrid memory parallel computer, each node contains its own local memory, and also capable of accessing non local or remote data using message passing technique. Figure 2.3 shows a hybrid memory parallel computer comprised of three computing nodes each having two processors and a local memory module. Node-1
Node-2
Node-3
Memory
Memory
Memory
Processor-1
Processor-1
Processor-1
Processor-2
Processor-2
Processor-2
Network (e.g. Ethernet, InfiniBand, etc.) Figure 2.3: Hybrid memory parallel computer.
Today, most of the large parallel computers are based on hybrid configurations [10]. In a hybrid parallel computer, intra-node communication is performed using shared memory that is fast, while the inter-node communication is accomplished using message passing which is slower as compared to the shared memory communication.
2.1. Architectural Model
2.1.3
15
Multi-core Era
Until recent past, increasing processor speed or clock-frequency was the premier approach to augment a computer’s performance. The increased instruction execution rates enabled existing applications to attain higher performances. However, the high clocked processors su↵ered from power consumption (cubic increase) and heat dissipation problems that resulted in a trend shift in processor manufacturing technology. The era of escalating the performance of a single-core processor to attain high performance has relinquished. Today, multi-core processors have emerged as a viable source of processing power. A multi-core processor consists of several homogeneous or heterogeneous processing cores confined in a single chip. Already, there are many-core processors having hundreds of cores [1]. Multi-core processors are ample examples of MIMD and shared memory multiprocessor architectures. To achieve high performances, developing parallel applications is not an option any more, rather it is a compulsion now. A parallel application exploits the inherent parallelism provided by multi-core processors to attain commendable performances. Figure 2.4 shows two generic multi-core processor architectures. The multi-core processor architecture shown in Figure 2.4a, shows a multi-core processor with four processingcores or cores, each containing a private Level-1 (L1) that is accessible only to that core, and Level-2 (L2) cache memories. Figure 2.4b shows a multi-core processor architecture that consists of two cores, where each core has a private L1 cache memory, the shown multi-core processor has a L2 cache memory that is shared by all the cores.
Quad-core Processor Core - 1
Core - 2
L1 Cache
L1 Cache
L2 Cache
L2 Cache
Core - 3
Dual-core Processor Core - 1
Core - 2
L1 Cache
L1 Cache
Core - 4
L1 Cache
L1 Cache
L2 Cache
L2 Cache
L2 Cache System bus
System bus
Memory
Memory
(a) A quad-core processor.
(b) A dual-core processor.
Figure 2.4: Multi-core processor architectures.
16
Chapter 2. Model
The caches are small high speed memories that keep copies of data and instructions (currently processed by the processor or core) from main memory to hide memory latencies. Some multi-core processor architectures use only private caches (accessible only to specific cores), while some other processor architectures also employ shared caches that are shared by multiple cores. With private cache memories, the cores do not su↵er from shared resource contention and have faster access to the data. With shared caches, the data can be shared by multiple cores, and a core has more available space if the other cores do not have expensive memory requirements. Generally, a multi-core processor is categorised as: homogeneous or heterogeneous processor. The processing cores of a homogeneous multi-core processor have similar instruction set, speed or clock-frequency, cache hierarchy, cache size, and other functional units. The homogeneous multi-core processors are easy to manufacture due to the identical hardware design. Intel’s Core i7-9xx and AMD’s Athlon II X4 are two examples of such processor architectures. A heterogeneous multi-core processor has divergent cores in term of speed, cache hierarchy, cache size, or functional units. A core in a heterogeneous processor can have di↵erent instruction set and specialised to run specific nature of tasks. Cell processor [46] is an example of a heterogeneous multi-core processor. The architecture of a heterogeneous multi-core processor is more complicated as compared to a homogeneous multi-core processor. However, the heterogeneous multi-core processors can attain better performances and power efficiencies [63] as compared to the homogeneous multi-core processors.
2.1.4
Coprocessor Accelerators
A coprocessor works along a main processor or CPU to assist performing computations. Using a coprocessor, applications can be accelerated to gain higher performances. To attain high performances, the compute intensive tasks or subset of computations are o↵-loaded from main processor to the coprocessors. In the past, coprocessors were successfully used to perform specialised tasks and to accelerate compute intensive applications. Typically, they were used to compute floating point operations, encoding and decoding of data, and image processing. Two important coprocessors from the past are: Intel’s Numeric Data Processor [78] and Motorola’s 68882 floating-point coprocessor [47]. In the recent years, coprocessor-based accelerators are extensively used for both specialised as well as general-purpose computing. Some of the most notable coprocessor accelerators are: the Field-Programmable Gate Array (FPGA) [82] and Cell processor [46]. A Cell processor is a heterogeneous multi-core with one regular CPU that is
2.1. Architectural Model
17
called Power Processor Element (PPE) and 8 accelerating cores which are called Synergistic Processing Elements (SPEs). The SSL accelerators [72] and Graphics Processing Units (GPUs) [76] are two more examples of coprocessor accelerators. The GPUs are special-purpose coprocessors that are used to do visual processing for computer displays. The GPUs have a highly parallel architecture that is very efficient in processing data-parallel nature of computational tasks such as image processing. Due to the highly parallel architecture of GPUs, they have better compute capabilities as compared to the multi-core processors. To utilise the compute capabilities of GPUs, their usage for general-purpose computing has increased in the recent years.
2.1.5
General-Purpose Graphics Processing Units (GPGPUs)
The GPUs are one of the examples of highly-parallel many-core processors that contain hundreds of cores and have notable processing ability. The GPUs are extremely powerful processing units with high degree of parallelism, precision, and performance. The GPU devices are more cost and power efficient as compared to the multi-core CPUs in terms of the attainable performances [34]. Due to the better programmability support, the GPUs have became a promising platform for general-purpose high-performance computing [76]. In the recent years, the GPGPUs became an essential component of parallel machines, capable of attaining hundreds of GFLOPs (billions of floating-point operations per second) of the performance. Today, there are several areas in scientific computing and other fields for example: finance, bio-informatics, fluid dynamics, heat simulations, weather prediction, signal processing, searching and sorting which make use of the GPGPUs to accelerate the applications. The GPGPUs provide an excellent platform with a gigantic computing power and programming flexibility to accelerate the real life and scientific applications. With the passage of time, the GPGPUs have evolved into extremely powerful compute devices capable of achieving huge performances (many folds superior attainable performances as compared to multi-core CPUs). Today, the GPGPU architecture employs massive parallelism, providing over thousand of cores, for example the NVIDIA’s GTX490 contains 6 billion transistors and over 1000 processing-cores [6]. GPU Architecture: The GPU devices have evolved in apparently powerful coprocessors that are used to accelerate general and special purpose applications. The attainable performances by the GPU devices are many folds better as compared to the multi-core CPU devices. The enormous achievable performances by the GPUs are because of their
18
Chapter 2. Model GPU Device
CPU Device
Streaming Multi-processor - 1
SIMD unit
Streaming Multi-processor - n
Fetch and Decode
Fetch and Decode
Streaming processor - 1
PC
Streaming processor - 1
PC
Streaming processor - n
PC
Streaming processor - n
PC
Shared Memory
PCIe bus
Shared Memory
Interconnection network
Device Memory
Figure 2.5: A simplified architecture of a GPU device.
architecture and the computing model that is inherently data-parallel. The central aspect of a GPU architecture that is driving the great attainable performances is the parallelism in the form of computing cores and threads. GPU devices employ thousands of parallel threads that compute in SIMD (computing by employing the same instruction but di↵erent data streams) fashion. The data parallelism attribute and several folds better memory bandwidth of GPU devices enable them to attain many times improved performances as compared to the multi-core CPUs. Also, the GPUs are evolving at much higher pace as compared to the CPU devices because they make use of additional transistors to add more parallelism (i.e., more compute units and hardware threads). Figure 2.5 shows a simplified architectural view of a GPU device. The GPU device (shown in Figure 2.5) employs several streaming multi-processors. Each streaming multiprocessor contains several independent stream processors. A stream processor comprise of integer, floating-point, and other functional units. A stream processor is connected to an input data stream, and produces the resultant data (output stream) in highly parallel and efficient manner. Each stream multi-processor contains a fetch and decode unit, and all stream processors execute same instruction based on SIMD execution model. Each stream processor has a Program Counter (PC) as shown in Figure 2.5 that facilitates execution divergence or branching during code execution. A stream multi-processor contains a shared storage (depicted as shared memory in Figure 2.5) that is utilised by all executing threads on the stream processor to share the data. A high-bandwidth interconnection network connects stream processors to the GPU or device memory. The
2.1. Architectural Model
19
GPU device communicates with CPU using a Peripheral Component Interconnect express (PCIe) [71] interconnect. The bandwidth supported by the PCIe interconnect is limited as compared to the bandwidth between GPU and device-memory. Therefore, to attain better performances, the communication between CPU and GPU should be kept minimal. Today, GPU devices have evolved in very sophisticated and powerful compute units with many technological innovations. However, the basic architectural design of modern GPUs remains similar to the simplified architecture shown in Figure 2.5. In following text, we present architecture of a modern GPU device from NVIDIA (GTX480) that is based on Fermi architecture. Fermi - A Modern GPU Architecture: Fermi [1] is a modern and powerful GPU architecture that is capable of achieving excellent performance and precision. NVIDIA has released several new generation GPUs based on the Fermi architecture, for example: GTX440, GTX460, and GTX480. The Fermi architecture has several notable features e.g., improved double-precision floating-point operations (up to 8 times as compared to the older GPUs from NVIDIA), faster context switching between applications (up to 10 times as compared to the older GPUs), cache hierarchies (L1 and L2), concurrent execution of multiple GPU programs or kernel function (up to 16), larger register files, and Error-Correcting Code (ECC) supported memories (detects and corrects the dataerrors that occur due to the internal data corruption), 64-bit unified addressing, full hardware support of IEEE 754-2008 floating-point standard. Figure 2.6 shows high-level view of the Fermi architecture. A Fermi-based GPU employs 16 streaming multi-processors that are organised around a L2 cache memory (shared among all the streaming multi-processors). The Fermi architecture contains 512 cores, organised in 16 streaming multi-processors each containing 32 cores. The shared L2 cache has 768KB of capacity and it is fully coherent at the device level. The memory accesses from all cores go through L2 cache, and in the event of a cachemiss (unable to locate the required data-item in the cache memory) the device or GPU’s memory is accessed. The L2 cache supports both the write-back (the modified data item is written back to the memory whenever some other thread needs it) and write-through (all modified data items are written to both cache and memory at same time) cache write policies. The Fermi-based GPU device contains six memory sub-interfaces that are depicted as DRAM in Figure 2.6. Each memory sub-interface is 64-bit wide, enabling the device to access a 384-bit memory interface. The GPU device supports the maximum of 6GB of Graphics Double Data Rate version 5 (GDDR5) memory [1]. The host interface
Chapter 2. Model
Streaming Multi-Processor - 1
Streaming Multi-Processor - 2
Streaming Multi-Processor - 8
DRAM DRAM
L2 Cache DRAM
Streaming Streaming Multi-Processor - 9 Multi-Processor - 10
Streaming Multi-Processor - 16
DRAM
DRAM Host Interface Giga Thread DRAM
20
Figure 2.6: The Fermi architecture.
shown in Figure 2.6 is used to communicate with the CPU device. The host-interface connects the GPU device with CPU using a PCIe link that is capable of achieving up to 12GB/second of bandwidth [79]. The Fermi architecture implements a two-level thread scheduler. The GigaThread component shown in the Figure 2.6 represents a global thread scheduler which schedules thread-groups to the individual stream multi-processors [79]. Within a stream multiprocessor, the assigned threads are scheduled locally by the local thread scheduler called warp scheduler. Figure 2.7 shows architecture of the streaming multi-processor and the cores. A streaming multi-processor contains 32 cores, organised in two-columns of 16 CUDA cores. A CUDA core has an integer and a floating-point compute units. A Fermi-based GPU can compute 512 (32 cores ⇥ 16 multi-processors) single-precision floating-point operations per clock-cycle. The double-precision floating-point computational speed is the half (i.e., 256 operations per clock-cycle) of the speed of the single-precision floating-point operations. A streaming multi-processor contains two warp schedulers and dispatch units (as shown in Figure 2.7). A warp is a group of 32 threads, scheduled by the warp scheduler and executed simultaneously. A streaming multi-processor can host up to 48 warps, or 1536 (48 ⇥ 32) threads. The large number of hosted threads help to dilute the performance
penalties that occur due to the DRAM latencies. The dual dispatch units issue two instructions at a time, one to each scheduled warp of threads.
2.2. Parallel Programming Models
21 Instruction Cache Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
Register File (32768 x 32 bit)
CUDA Core Dispatch Port Operand Collector FP Unit
INT Unit
Request Queue
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Load-Store unit - 1 Load-Store unit - 2 Load-Store unit - 3
Load-Store unit - 16
Special Function Unit Special Function Unit Special Function Unit Special Function Unit
Interconnect Network 64 KB Shared Memory Uniform Cache
Figure 2.7: Fermi streaming multiprocessor.
The Fermi-based GPUs provide a shared memory or L1 cache of size 64KB per streaming multi-processor (as shown in Figure 2.7). The shared memory enables the group of executing threads to coordinate and share the data, reducing o↵-chip communication. The shared memory or L1 cache supports either of the two configurations: 16KB for L1 cache and 48KB for shared memory, or L1 cache of size 48KB and shared memory of size 16KB. Each streaming multi-processor contains a register file of size: 32768 ⇥ 32 bits,
the larger register file (as compared to old GPUs from NEVADA) enable more number of threads to execute simultaneously on a stream multi-processor. A stream multi-processor contains 16 Load-Store and 4 special functional units as shown in Figure 2.7. The Load-Store units calculate addresses and perform load/store operations for 16 threads per clock cycle. The four special functional units compute the mathematical transcendental functions for example: sin, cosine, square root, etc.
2.2
Parallel Programming Models
Parallel programming models facilitate the expression of intrinsic concurrency in applications, and the selection of the appropriate programming model is essential to attain
22
Chapter 2. Model
the considerable performance benefits. In this section, we present the four parallel programming models, their inherent characteristics, and suitability for di↵erent types of applications.
2.2.1
Shared Memory Model
In a shared memory programming model, the concurrently executing applications, tasks, and threads share a common memory address space (regardless of the hardware implementation). The communication among the parallel threads and the tasks is carried out by reading or writing to the memory locations. Programming shared memory parallel applications is relatively easy i.e., the programmer is not required to deal with explicit data sends and receives. To control access to the shared memory locations, di↵erent synchronisation mechanisms [77] are used for example: locks, semaphores, and condition variables. Figure 2.8 shows two shared memory models: shared address space and distributed shared address space. The shared address space model (shown in Figure 2.8a is based on a computing model where all the parallel applications share same physical memory system. The shared address space model is the form of a pure shared memory model that has uniform memory-access latencies for all the applications, however, the scalability of this model is low. The shared address space model primarily su↵ers from the contention of the shared memory bus or interconnect. To elevate this problem, the distributed shared address space model was introduced (see Figure 2.8b). In a distributed shared address space model, each computing system has its own physical memory system and all memory systems are connected using memory
Application-1 Application-2
Application-n
Application-1 Application-2
Application-n
A thread
A thread
Memory Interconnect
Memory Interconnect
Memory
Shared Address Space (a) Shared address-space.
Memory Interconnect
Memory
Memory Interconnect
Memory
Distributed Memory Interconnect
(b)
Distributed shared space.
Figure 2.8: Shared memory programming models.
address-
2.2. Parallel Programming Models
23
interconnecting network. The main advantage of this computing model is better scalability compared to the shared address space model. One of the primary disadvantage of the model is the non-uniform memory accesses (i.e., remote memory accesses su↵er from higher access latencies as compared to the local memory). Therefore, this model is also known as NUMA [70].
2.2.2
Distributed Memory Model
In a distributed memory model, concurrently executing applications run on independent computing machines each having a private memory module. The computing machines are connected with each other using a network interconnect. The applications share data using explicit (responsibility of a programmer) message sends and receives over the network. Each machine has a private address space, and synchronisation among distributed tasks is performed using barriers (a task waits for other parallel tasks to complete their executions at a certain code position) and message passing. Figure 2.9 shows a distributed memory computing model that has several independent computing nodes without any shared memory modules. The data must be partitioned and sent over the network interconnect using point-to-point (communication between two processes on di↵erent machines) or using group-communication mechanisms (communication among set of processes on di↵erent machines). The primary advantage of the distributed memory computing model is scalability. However, in the case of communication-intensive applications (applications with higher needs of data transfer or communication), the distributed memory computing model may introduce significant performance penalties in the form of communications overheads.
Machine-1 Application
Machine-2
Machine-n
Application
Application
Memory Interconnect
Memory Interconnect
Memory Interconnect
Memory
Memory
Memory
A thread
Network Interconnect
Figure 2.9: Distributed memory programming model.
24
2.2.3
Chapter 2. Model
Hybrid Memory Model
A hybrid memory model comprises of di↵erent computing models. Generally, the shared and the distributed memory computing models are combined to constitute a hybrid model. The hybrid memory model reduces unnecessary intra-node communication and is appropriate for the computing architectures based on multi-processors. In the hybrid memory model, the parallel applications and tasks communicate using less costly shared memory communication inside a node while inter-node communication is performed using message passing. Machine-1 Application-1
Machine-n Application-n
Application-1
Application-n
A Atthread thread
Memory Interconnect
Memory Interconnect
Memory
Memory
Network Interconnect
Figure 2.10: Hybrid memory model.
Figure 2.10 shows a hybrid memory model where the executing parallel applications employ inter as well as intra-node communication. The primary advantage of the hybrid model is better scalability (as compared to the shared memory model) and reduced memory access latencies (as compared to the distributed memory model).
2.2.4
Data Parallel Model
In a data parallel model, the input data is parallelised by distributing it to several compute units (e.g., processors, cores, etc.). The compute units execute the same operation or compute instruction using the di↵erent input data. Using the data parallel model, significant performances can be attained, however, not every parallel task can be modelled as the data parallel. In the data parallel model, the degree of parallelism can be increased by using larger input data and additional compute units (i.e., processors, cores, etc.). Single Instruction Multiple Data or SIMD (see Section 2.1.1) is one of the example of the data parallel compute models. Figure 2.11 shows a SIMD based data parallel model.
2.3. Programming Technologies
25
Instruction Unit instruction - i Compute Unit - n data-n
Compute Unit - 3 data-3
Compute Unit - 2 data-2
data-1
Compute Unit - 1
Memory Interconnect Shared Memory Figure 2.11: SIMD data parallel compute model.
The data parallel model depicted in the Figure 2.11 employs n compute units (e.g., processors, cores, etc.). Each compute unit performs the same operation or instruction (an i-th instruction) issued by the control unit using distinct data items (e.g., 1 n data elements). The data parallel compute model is particularly advantageous for the applications that can be structured as a data parallel problem for example: image processing, encryption–decryption, etc. In a data parallel model, numerous data-items are loaded at the same time instance (e.g., using a load instruction) and are computed concurrently, therefore, the overall throughput (amount of work done in unit time) of the data parallel compute system is large. Generally, a data parallel compute model (e.g., SIMD) employs single control unit (i.e., employing single instruction stream), hence, a single control thread can be supported. Therefore, massively parallel compute architectures (e.g., GPUs) employ several SIMD units combined within a computing device to provide both the data as well as the task parallel compute capabilities.
2.3
Programming Technologies
In this section, we present several programming technologies and environments that are employed to program variety of parallel computers such as shared and distributed memory computers and multi-/many-core devices.
2.3.1
Parallel and Distributed Computing using Java
Since few years, the usage of Java language [43, 44] to program performance oriented applications has increased. To program large scientific-applications, the Java language
26
Chapter 2. Model
provides several important components, and code portability is one of the principal feature. Beside code portability, the Java language provides many other features too such as object orientation, multi-threading, distributed programming support, security, typesafety, and robustness, that are helpful to program large scientific parallel applications. Java converts a program’s source-code into byte-code that is a platform-independent intermediate representation of the source-code. The byte-code is interpreted or converted to machine-specific code at run-time by a Java Virtual Machine (JVM) on a platform (corresponding JVM implementation for the platform), providing code portability. On the other-hand, byte-code interpretation and dynamic linking at run-time cause some performance penalties. However, the optimisations in JVM, Just-In-Time compilation [62], and parallel garbage collectors [18] are some of the e↵orts that assisted to improve the performance of the Java applications. Some of the recent studies [17, 27, 84, 92] show that the Java language can attain attractive performances as compared to the other native-languages (e.g., C, Fortran, etc.). With the passage of time, the usage of the Java language to program performance-oriented applications is increasing. A parallel program employs several concurrent tasks performing the allocated computations to solve a large problem. A shared memory parallel program consists of n concurrent tasks, working together using a common address space. To program parallel shared applications, Java language provides the required features such as multi-threading, synchronisation, and communication constructs. A parallel Java program consists of several threads (smallest executing entity or application’s part that can be scheduled by an operating system) executing within a JVM. The executing threads inside a JVM coordinate using shared memory communication. To control the access to the shared data, Java language provides several synchronisation constructs such as synchronized keyword, wait-notify mechanism, locks, and atomic transactions, that can be applied to synchronise the executing threads within a JVM. A distributed memory parallel program consists of concurrent tasks, executing on disjoint set of computing resources (e.g., autonomous machines, clusters, etc.). The distributed tasks communicate with each other using message-passing over the network. To program distributed memory machines, Java provides a Remote Method Invocation (RMI) [22] framework: an object-oriented and powerful mechanism for remote procedure calls between di↵erent JVMs. Using the RMI mechanism, a Java application can invoke methods on remote objects from non-local JVMs. The required parameters and the resultant values produced by the remote methods are transferred over the communication network.
2.3. Programming Technologies
27
The RMI architecture consists of three independent layers: stub and skeleton, remote reference, and transport layers [45]. The stub and skeleton layer provides an interface between a Java application and the RMI framework. The primary responsibilities of the stub and skeleton layer are marshalling (converting Java-objects to byte-stream), unmarshalling (byte-stream to Java-objects conversion) of the data, and communication management with remote reference layer. The remote reference layer is responsible for calling or invoking the remote methods. The responsibilities of the transport layer is to manage the connection between the JVMs and to transmit the required data over the network.
2.3.2
Programming Heterogeneous Parallel Computers
In the recent years, GPUs became an essential component of parallel machines which are capable of attaining hundreds of GFLOPs of the performance. The GPUs are programmed using SIMD programming model, and NVIDIA’s Compute Unified Device Architecture (CUDA) [2] and Brook [26] are the commonly used frameworks to program GPU devices. To program heterogeneous parallel computers (mixture of CPUs, GPUs, Cell, and other accelerating devices), there is a strong need of a unified parallel programming framework, capable of o↵-loading the computations to di↵erent types of processing devices such as CPUs, GPUs, and Cell. Currently, Khronos-Group’s Open Computing Language (OpenCL) [75] is a prominent parallel programming framework to program both the task as well as the data parallel applications for heterogeneous parallel computers. OpenCL Framework: OpenCL is an open industry standard that is managed by a non-profit consortium Khronos-Group [9] and used to program heterogeneous parallel computers. The OpenCL framework uses vendor-neutral code that allows code portability across di↵erent platforms. The OpenCL language is a subset of ISO-C99 language, with some additional constructs to support e.g., vector types, images, synchronisations, and memory hierarchies. An OpenCL application has two code components: a host part (main application or host program), and a compute intensive part called kernel. The OpenCL host-program executes on a CPU device of a heterogeneous parallel machine, while the OpenCL kernel runs on an accelerating device (e.g., GPU, CPU, Cell, etc.). The OpenCL kernel utilises the device memory on a accelerating device, shared by all executing OpenCL threads for its computation. The communication between the host program and the OpenCL kernel is done using data transfers between the host (i.e., CPU) and the device (e.g., GPU, CPU, etc.). The input data required for the kernel computation is transferred
28
Chapter 2. Model
from the host machine’s memory to the device’s memory (e.g., GPU memory) via PCIe bus. Similarly, the resultant data produced by the OpenCL threads is transferred back from the device’s memory to the host’s memory. The OpenCL run-time executes the kernel function in a data-parallel fashion (i.e., SIMD), where each instance of the kernel is called work-item. Commonly, an OpenCL application uses large number of work-items (e.g., thousands), where each work-item performs similar operations (i.e., executes same kernel code) using di↵erent set of the data. The Index space is the information about the required number of the work-items, needed to compute a task and it is supplied by the host program. The OpenCL supports one, two, and three-dimensional index spaces. The allocation of data sets to the work-items is done using the index space information by dividing the data equally among the work-items. The instantiated work-items are managed in groups (subset of total work-items), which are called work-groups. Inside a work-group, the work-items can share data using on-chip fast local memories, and synchronise among themselves. Each work-item has a global (with reference to the index space) and a local (with reference to the work-group size) identification number or Id. The OpenCL framework provides a hierarchical four-level memory-model: global (accessible to all work-items), constant (read only memory, accessible by all work-items), local (shared by work-items within a work-group), and private (exclusive memory of a work-item). Figure 2.12 shows the OpenCL four-level memory model. To declare a data-item for specific memory level, one of the four declaration qualifiers of the OpenCL language can be used: __global (for global memory), __constant (for read only global memory), __local (for shared memory), and __private (for private memory).
OpenCL Compute Device Work Group-1
Work Group-n
Work item-1
Work item-n
Work item-1
Work item-n
Private Memory
Private Memory
Private Memory
Private Memory
Local Memory
Local Memory
Constant Memory Global Memory
Figure 2.12: The OpenCL memory model.
2.3. Programming Technologies
29
The OpenCL framework supports both the data as well as the task parallel programming models. For the task parallel applications, several kernels are invoked concurrently that exploit single or multiple OpenCL-supported compute devices.
2.3.3
Java–OpenCL Bindings
The usage of high level programming languages and frameworks such as Java to program performance-oriented scientific applications is increasing. To exploit specialised accelerating compute devices (e.g., GPUs, Cell, and other accelerators), existing high level languages are required to be extended or articulated, to get benefit from the highly parallel architectures. Furthermore, the usage of high level programming languages will provide several additional advantages such as programming abstractions, which will ease the program development process for the heterogeneous parallel architectures. To utilise the compute capabilities of heterogeneous parallel computers, the OpenCL framework is a de-facto standard, widely used to program parallel machines consist of multi-core CPUs, GPUs, Cell processors, or FPGAs. To articulate Java language with OpenCL framework, we use a Java–OpenCL language bindings called JOCL [52]. The JOCL is an easy-to-use, high performance, and open source Java bindings to OpenCL language. The JOCL is developed within the umbrella of JogAmp project [53]. The JogAmp project provides cross-platform open-source Java bindings for several open standards such as OpenCL (an open standard, parallel Application Programming Interface (API)), OpenGL (platform independent, open standard for 2D and 3D graphics), OpenAL (a cross-platform audio API), and OpenMAX (a cross-platform multi-media API). The JOCL provides both the low as well as the high-level Java–OpenCL bindings. The low-level bindings are Java Native Interface (JNI) [67] based, and generated automatically using GlueGen (a tool to generate JNI related codes at compile time [53]), that provide one-to-one correspondence to the OpenCL API. The low-level JOCL bindings require less maintenance e↵orts and have commendable stability. Also, the low-level JOCL bindings provide better conformity to the OpenCL specifications. The high-level JOCL bindings are hand written APIs, that employ abstractions to detach the tedious low-level code details. The high-level JOCL bindings provide better data transfer speeds (between the JVM and the OpenCL framework), because it use direct bu↵ers from Java’s New Input-Output (NIO) API, an efficient IO API). The direct bu↵ers are allocated outside the JVM’s heap-space (dynamic memory allocation area).
30
Chapter 2. Model
The direct bu↵ers are more garbage collector friendly because they have a fixed memory address, therefore, they are accessed by the OpenCL kernels directly and fast.
2.4
Summary
In this chapter, we presented a model for JavaSymphony framework, a programming and execution environment for multi-/many-core parallel computers. In the first part, we presented di↵erent computing architectures, their characteristics, and configurations. The presented and described computing architectures include multi-processor systems, parallel computers, multi-/many-cores processors, coprocessor accelerators, and heterogeneous parallel computers. Then, we presented several programming models that are used to program the parallel architectures presented earlier. In the third part, we concisely described some of the most relevant programming technologies and execution environments used in this work. The architectural, programming, and technological aspects that are defined and explained, build-up the understanding to comprehend the work in the remaining parts of the thesis. The next chapter describes the background information about JavaSymphony programming paradigm. In brief, it describes JavaSymphony features, its programming model, and execution environment.
Chapter 3
JavaSymphony Background This chapter presents JavaSymphony a novel programming paradigm for parallel and distributed computing. First, we introduce the JavaSymphony’s main features and capabilities. Then, we present distributed dynamic virtual architectures, the principal idea behind the JavaSymphony framework. Next, the JavaSymphony’s distributed objects, synchronisation mechanisms, and run-time system is presented. In the end, we summarise the chapter.
3.1
Introduction
JavaSymphony (JS) [38, 39, 54, 55, 56] is a Java based programming and execution environment, originally designed to develop applications for distributed memory computers, heterogeneous clusters, and computational Grids. JS’s design is based on the concept of distributed dynamic virtual architecture which allows modelling of hierarchical resource topologies ranging from individual machines to the more complex parallel computers and distributed infrastructures. On top of this virtual architecture, objects can be explicitly distributed, migrated, and invoked. In contrast to the most of the Java based programming frameworks, the JS is a unique paradigm that provides explicit user’s control to map objects and tasks to the computing nodes. JS is a pure Java based programming library that provides high-level constructs which abstract low-level programming details and simplify the tasks of controlling parallelism, locality, and load balancing. The high-level JS constructs liberate the programmers from the low-level and error-prone details of Java language (such as Java RMI details [45], multi-threading, synchronisation, and network communications) and provides an abstracted and easy-to-program application programming interface. 31
32
Chapter 3. JavaSymphony Background
3.2
Dynamic Virtual Distributed Architectures
Most of the performance-oriented Java-based programming paradigms hide the underlying physical architecture or assume a single flat hierarchy of a set (array) of computational nodes or cores. This simplified view does not reflect heterogeneous architectures such as large multi-processor parallel machines or clusters. In a result, the programmer has to depend fully on the underlying operating systems on shared memory machines or on the local resource managers on clusters and Grids to properly distribute the data, code, and computations which results in important performance losses. To mitigate this problem, JS introduces the concept of dynamic Virtual distributed Architectures (VA) that defines the structure of a heterogeneous architecture, which may vary from a small scale multi-processor machine or cluster to a large scale Grid. The VAs are used to control mapping, load balancing, code placement, and migration of objects in a distributed environment. A VA can be seen as a tree structure where each node has a certain level that represents a specific resource granularity:
• Level-1: VA nodes represent a single computing node such that a multi-processor machine;
• Level-2: nodes correspond to the aggregation or set of level-1 computing resources such that a compute cluster;
• Level-i: VA nodes refer to the collection of level-(i
1) VA nodes, depicting a
complex hierarchy of computational resources. For example a level-3 VA node corresponds to the distributed Grid architecture that consists of set of level-2 VA nodes or clusters. Leve-4 VA node Grid Architecture
Distributed memory cluster
Multi-processor machine
2
1
1
3
2
1
1
2
1
1
1
1
Figure 3.1: A three-level virtual architecture.
3.2. Dynamic Virtual Distributed Architectures
33
Figure 3.1 depicts a three-level VA representing a heterogeneous Grid architecture. The level-3 VA node represents the Grid architecture that has a possible connection to a higher level VA (level-4 node). The Grid architecture consists of a set of distributed memory cluster nodes on level 2. The level-2 nodes are aggregation of individual multi processor shared memory machines (such that a symmetric multiprocessing computer) each representing a level-1 node of the virtual architecture. The VAs can be created and modified dynamically during program execution. A VA can be created using a bottom-up or top-down approach. In the bottom-up approach, the lower level VA nodes are created before the higher level nodes. Next, the lower level VA nodes are added to the higher level nodes, using this scheme the desired level VA is formed. In the top-down approach, the higher level VA nodes are created along the furnished information related to the lower level VA nodes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
/⇤ C r e a t e t h r e e l e v e l 1 VA nodes ⇤/ VA c1 m1 = new VA( 1 ) ; VA c1 m2 = new VA( 1 ) ; VA c1 m3 = new VA( 1 ) ; /⇤ Add l e v e l 1 nodes t o c1 ( a l e v e l 2 node ) ⇤/ VA c1 = new VA( 2 ) ; c1 . addVA( c1 m1 ) ; c1 . addVA( c1 m2 ) ; c1 . addVA( c1 m3 ) ; /⇤ C r e a t e and add c2 m1 l e v e l 1 node t o c2 VA node ⇤/ VA c2 m1 = new VA( 1 ) ; VA c2 = new VA( 2 ) ; c2 . addVA( c2 m1 ) ; /⇤ C r e a t e f o u r l e v e l 1 nodes ⇤/ VA c3 m1 = new VA( 1 ) ; VA c3 m2 = new VA( 1 ) ; VA c3 m3 = new VA( 1 ) ; VA c3 m4 = new VA( 1 ) ; /⇤ Add t h e f o u r l e v e l 1 nodes t o t h e l e v e l 2 node c3 ⇤/ VA c3 = new VA( 2 ) ; c3 . addVA( c3 m1 ) ; c3 . addVA( c3 m2 ) ; c3 . addVA( c3 m3 ) ; c3 . addVA( c3 m4 ) ; /⇤ Add l e v e l 2 nodes t o g1 l e v e l 3 node ⇤/ VA g1 = new VA( 3 ) ; g1 . addVA( c1 ) ; g1 . addVA( c2 ) ; g1 . addVA( c3 ) ;
Listing 3.1: VA creation using the bottom-up approach.
Listing 3.1 shows the JS code to create the three-level virtual architecture (shown in Figure 3.1) using the bottom-up approach. In lines 2
4, three level-1 VA nodes are
created representing three computing machines. In lines 6
7, a level-2 VA node rep-
resenting a distributed memory cluster is first created and then the level-1 VA nodes (created in lines 2
4) are added to it. In lines 9
10, first a level-1 and then a level-2
VA node is created. In line 11, the c2_m1 level-1 node is added to the c2 a level-2 VA node. In lines 13
16, four level-1 VA nodes are created. In lines 18
node c3 is created and the level-1 nodes (created in lines 13
20, a level-2
16) are added to it. In
34 lines 22
Chapter 3. JavaSymphony Background 23, a level-3 VA node is created representing a distributed Grid architecture
and the earlier created level-2 VA nodes (c1, c2, and c3) are added to it. Using the bottom-up approach, the application developer has object references for each created VA node or computing resource which could be used in the subsequent JS program to explicitly control the locality of objects and tasks. 1 /⇤ C r e a t e a l e v e l 3 VA node , c o m p r i s e o f t h r e e sub nodes ( each h a v i n g 3 , 1 , and 4 l e v e l 1 nodes ) ⇤/ 2 VA v3 = new VA( 3 , new i n t [ ] { 3 , 1 , 4 } ) ;
Listing 3.2: VA creation using the top-down approach.
Listing 3.2 shows the code that creates the VA depicted in Figure 3.1 using the top-down approach. In line 2, a level-3 VA node v3 is created that represents a distributed Grid architecture. The first parameter of the VA class (i.e., 3) represents the level number for the v3 node. The second parameter is an array that has three elements, each representing a level-2 VA node. The value of each array element represents the number of child VA nodes of level-1 the node contains (for example: the third array element equal to 4 represents four level-1 VA nodes). In the top-down approach, less code is required but the mappings of the JS objects and tasks are delegated to the JS run-time system. The VA nodes are Java objects that can be passed to any method in a JS program. During the execution of a JS application, VAs can be modified or even released. To avoid inconsistent modifications, the JS API provides lock/unlock mechanism to perform locking or un-locking operation on a certain part of the VA. If there are some executing JS tasks or threads at the time of lock operation, the locking of the VA is delayed until the executing threads finish their executions. The locked VAs are not accessible to any working thread unless they are un-locked (after possible modifications) by the same thread that locked them earlier.
3.3
JavaSymphony Distributed Objects
Writing a distributed memory JavaSymphony application requires encapsulating Java objects into so called JS objects, which are then distributed and mapped onto the hierarchical VA nodes (levels 1 to n). Afterwards, the mapped JS objects can be used to remotely invoke the inherent compute tasks or methods. Additionally, the JS objects support the features such that object migration and synchronisation that are essential for a distributed memory application.
3.3. JavaSymphony Distributed Objects
3.3.1
35
Creating and Mapping JS Objects
The JS distributed objects can be instantiated on any VA node, provided that the required class-files are available on the target node. JS API provides various constructors of JSObject class, that are used to create an instance of the JS object with various parameters: class name (of the encapsulated object), single-/multi-threaded option, target VA-node (for mapping), and some optional constraints (e.g., a level-1 VA node with specific amount of available memory, etc.). A JS object can specify a level-1 or higher level target VA node. In the case of a level-1 target node, the JS object is mapped to the specified VA that represents a computing machine. If a higher-level (greater than or equal to 2) VA node is specified, then the JS run-time system searches for a level-1 node in the virtual architecture (node that is free and fulfills the specified constraints). If a JS object does not provide any information about the target VA node then the default location specified in the JS run-time configuration file is used to map the object. Listing 3.3 shows a JS object creation code. In line 3, a JS object objSingle is created that encapsulates the Worker type object and passes an input-argument vectorA to the constructor of the Worker class, and specifies machine1 (a level-1 node) as the target VA for object mapping.
3.3.2
JS Object Types
A JS object can be a single or a multi-threaded object. A single-threaded JS object is associated with one thread that executes all invoked methods of that object. A singlethreaded object ensures that no inconsistencies of the object-data occurs. A multi-threaded JS object is associated with n parallel threads, all invoking methods of that object. The number of threads for a multi-threaded JS object can be altered dynamically using JS-Shell, a JS configuration program. On a multi-processor system, the multi-threaded objects benefit form the concurrent execution of multiple threads running a common or several distinct methods. 1 /⇤ C r e a t e a s i n g l e t h r e a d e d JS o b j e c t ⇤/ 2 boolean b S i n g l e T h r e a d e d = true ; 3 JSObject o b j S i n g l e = new JSObject ( b S i n g l e T h r e a d e d , ” Worker ” , new O b j e c t [ ] { v e ct orA } , machine1 ) ; 4 5 /⇤ C r e a t e a m u l t i t h r e a d e d JS o b j e c t ⇤/ 6 bSingleThreaded = false ; 7 JSObject o b j M u l t i = new JSObject ( b S i n g l e T h r e a d e d , ” Worker ” , new O b j e c t [ ] { Math . PI }) ;
Listing 3.3: JavaSymphony distributed objects creation.
36
Chapter 3. JavaSymphony Background
Listing 3.3 in lines 2
7 shows the JS code for creating a single and a multi-threaded
object. In line 3, a single-threaded JS object objSingle is created that is based on the Worker class. An input argument vectorA is passed to the object and it is mapped on machine1 VA node. In line 7, a multi-threaded JS object objMulti is created and mapped on the local machine because in case of unspecified VA information, a JS object is mapped on the local machine.
3.3.3
Method Invocation Types
JS objects support three types of method invocations: synchronous, asynchronous, and one-sided invocations. Synchronous Invocation: A synchronous method invocation can be employed by calling sinvoke method of JSObject class. In a synchronous method invocation, the calling program is blocked or its execution is suspended until the called method completes its execution and results are returned. For all the three JS method invocations, the arguments to the called methods are passed as an object-array that consists of all the input arguments. The synchronous method invocation (using sinvoke) returns the result object of type Object Java class, which may be type-casted to the original or the required type. 1 /⇤ Synchronous method i n v o c a t i o n ⇤/ 2 i n t dotP = ( I n t e g e r ) o b j S i n g l e . s i n v o k e ( ” d o t P r o d u c t ” , new O b j e c t [ ] { vectorA , vectorB }) ;
Listing 3.4: Synchronous method invocation.
Listing 3.4 in line 2, shows the usage of the sinvoke call to synchronously invoke the method. In the code excerpt, a JS object’s method named dotProduct is invoked using two input arguments (vectorA, and vectorB ) and the returned result is stored (after type-casting) in an integer variable dotP. Asynchronous Invocation: An asynchronous method invocation is induced using the ainvoke method of the JSObject class. In an asynchronous method invocation, the calling method returns immediately without being blocked. However, the asynchronous method call returns a result handle object of type ResultHandle JS API class. The calling method (main JS application) continues its execution after asynchronously invoking the method, and later at some code point the calling program can check whether the previously called method has finished its execution or not. To check the execution status of the called method, isReady method of ResultHandle class can be employed. The isReady method returns a true value if the asynchronously invoked method has finished its execution, otherwise it returns false. If the calling method (main JS application)
3.3. JavaSymphony Distributed Objects
37
has no useful computations to do, then it can wait by blocking itself for the previously called method using getResult method. The result object returned by the getResult method is of type Object and required to be casted to the original or the desired type. 1 /⇤ Asynchronous method i n v o c a t i o n ⇤/ 2 R e s u l t H a n d l e rh = o b j S i n g l e . a i n v o k e ( ” d o t P r o d u c t ” , new O b j e c t [ ] { vecB , vecC } ) ; 3 . . . /⇤ do some o t h e r c o m p u t a t i o n s h e r e ⇤/ 4 i f ( rh . i s R e a d y ( ) ) 5 i n t dotPP = ( I n t e g e r ) rh . g e t R e s u l t ( ) ; 6 i n t dotPP = ( I n t e g e r ) rh . g e t R e s u l t ( ) ; /⇤ Or do a b l o c k i n g w a i t t o g e t r e s u l t s ⇤/
Listing 3.5: Asynchronous method invocation.
Listing 3.5 in lines 2
6 shows a code example of asynchronous method invocation and
the usage of the isReady and getResult mechanisms. In line 2, a method dotProduct is invoked asynchronously using two input arguments, and the returned result handle object is stored in rh object of type ResultHandle class. In lines 4
5, the actual
result is obtained in an integer variable dotPP using isReady mechanism related to the asynchronous method invocations. In line 6, the alternative mechanism related to the asynchronous method invocation is shown where the calling method or the main JS application blocks itself and waits for the result. One-sided Invocation: An one-sided method invocation is initiated using the oinvoke method of JSObject class. The oinvoke method neither blocks the calling method nor returns a result handler object. The one-sided method invocations can be employed in scenarios where the result is not needed by the calling JS application and the method can be executed asynchronously. The performance of the one-sided method invocation is better as compared to the synchronous and asynchronous invocations due to the no result transfer and reduced JS run-time overheads. 1 /⇤ One s i d e d method i n v o c a t i o n ⇤/ 2 o b j M u l t i . o i n v o k e ( ” s a v e T o F i l e ” , new O b j e c t [ ] { vectorA , v e c t o r B } ) ;
Listing 3.6: One-sided method invocation.
Listing 3.6 shows the JS code related to the one-sided method invocation. In line 2, a method named saveToFile is invoked on objMulti JS object using the one-sided invocation. The method saveToFile is invoked by passing two input arguments to the saveToFile method. The one-sided invocation does not return any result handler object.
38
3.4
Chapter 3. JavaSymphony Background
Synchronisation Mechanisms
JS provides synchronisation mechanisms for single as well as multiple JS objects. For a single-threaded object, single execution control thread insures sequential access to the object’s data and methods. For the multiple single or multi-threaded distributed objects, JS provides two synchronisation mechanisms: asynchronous and barrier synchronisations.
3.4.1
Asynchronous Method Synchronisation
Asynchronous method synchronisation is employed to multiple executing threads (e.g., n single-threaded or 1 n multi-threaded JS objects). An asynchronously invoked method returns a result handle object of type ResultHandle class. The result handle objects can be used individually to examine or wait for the availability of the results, or can be grouped using ResultHandleSet JS class. The ResultHandleSet class provides several methods to wait (blocking mode) or examine (without blocking) the execution status of set of the invoked methods. On a heterogeneous cluster, employing synchronisations for the individual invoked methods using ResultHandle objects mostly benefits more compared to the group synchronisations. Using individual result handle objects for synchronisation enables the JS application to re-invoke the method with new assigned computations while waiting for other invoked methods (on the slower machines). Listing 3.5 in lines 4 6 shows the synchronisation mechanisms using single result handle object. In lines 4
5, a non blocking synchronisation mechanism is shown where the
main JS application examines for the completion of the asynchronously invoke method. Listing 3.5 in line 6 shows a wait (blocking mode) synchronisation mechanism. 1 2 3 4 5 6 7 8 9 10 11
/⇤ C r e a t e a r e s u l t h a n d l e s e t o b j e c t ⇤/ R e s u l t H a n d l e S e t r h S e t = new R e s u l t H a n d l e S e t ( ) ; /⇤ I n v o k e methods and add h a n d l e r s t o r e s u l t h a n d l e s e t o b j e c t ⇤/ f o r ( i n t i =0; i "
>"
>"
1
1
"
>"
>"
>"
>"
>"
>"
0
Shared memory (NUMA, UMA)
2
Multi-core processor
="
#$:/(*#
1/*=-+>#$:/(*#
4