POSTER: The Relentless Computing Paradigm: A Data-oriented Model for Distributed-memory Computation Lucas A. Wilson
John A. Lockman III
Texas Advanced Computing Center The University of Texas at Austin Austin, Texas, USA
Texas Advanced Computing Center The University of Texas at Austin Austin, Texas, USA
[email protected]
[email protected]
Volunteer computing models can harness the untapped computing potential of millions of part-time citizen scientists. We propose a system that would couple this potential with the innately fault-tolerant nature of DHTs, allowing for the execution of programs for extremely long periods of time, with built-in failure recovery in the event any set of participants was unable or unwilling to continue contributing. Additionally, proper data partitioning would allow for problems requiring more onerous data sharing among participants to be executed, increasing the potential for more scientific discoveries.
ABSTRACT The possibility of hardware failures occurring during the execution of application software continues to increase along with the scale of modern systems. Existing parallel development approaches cannot effectively recover from these errors except by means of expensive checkpoint/restart files. As a result, many CPU hours of scientific simulation are lost due to hardware failures. Relentless Computing is a data-oriented approach to software development that allows for many classes of distributed and parallel algorithms, from no data-sharing to intense data-sharing, to be solved in both loosely- and tightly- coupled environments. Each process requires no knowledge of the current runtime status of the others to begin contributing, meaning that the execution pool can shrink and grow, as well as recover from hardware failure, automatically.
This work describes a new computational paradigm: Relentless computing. With Relentless Computing, traditionally tightlycoupled, numerically-intensive parallel computations can be performed in a decentralized, distributed environment with high fault-tolerance. So long as any single participant and the initial data are present to the system, computation will continue. We will provide a basic description of Relentless Computing, how code is generated and managed, and results from a test case solving a partial differential equation (PDE) using finite differences.
We present motivations for the development of Relentless Computing, how it works, and initial scaling results.
Categories and Subject Descriptors
2. THE RELENTLESS COMPUTING PARADIGM
D.1.o [Programming Techniques]: General – Design, Reliability
The Relentless Computing paradigm is a programming model where operations are described expressly in terms of load/compute/store operations. This creates several advantages over procedural, functional or object-oriented programming models.
Keywords Fault tolerance, Runtime systems.
1. INTRODUCTION Computer-based simulation and modeling is becoming critical for driving scientific breakthrough and discovery. As the sensitivity and scale of simulations increase, the computational requirements and time-to-solution also rises. Unfortunately modern hardware -- although much improved over technologies of several years ago -- does not provide researchers with a stable execution platform for simulations requiring weeks or months of computation to complete, and is extremely expensive to deploy in large-scale, tightly-coupled environments. As a result, computerbased simulation for scientific discovery has remained limited to those researchers who have access to high-performance systems at Universities and National Laboratories.
2.1 Benefits of Relentless Computing 2.1.1 Greater potential parallelism Since each operation is distilled into the smallest logical piece, more parallelism is exposed within the algorithm
2.1.2 Out-of-order execution Because each operation is dependent on the smallest possible set of input dependencies, operations can be performed out of order automatically, helping to hide network latency
2.1.3 Automatic failure recovery With no loops, function stacks, or specified operation execution order, relentless computing environments can recover automatically from the loss of one or many compute participants
Copyright is held by the author/owner(s). SC’11 Companion, November 12–18, 2011, Seattle, Washington, USA. ACM 978-1-4503-1030-7/11/11.
53
2.1.4 Elastic The relentless computing runtime environment is intended to be used on both dedicated and volunteer hardware, and programs can take advantage of new hardware automatically as it becomes available
2.1.5 Intuitive program design The use of smaller load/compute/store expressions allows programs written using the relentless computing paradigm to be based directly on underlying mathematical formulas, simplifying the programming task for domain scientists and engineers, reducing the barriers to entry and potentially enlarging the community of computational scientists
2.1.6 Effective on low cost, low power hardware The relentless computing paradigm has been designed from the beginning to be used on many computational platforms, from large-scale, data-center oriented server environments to personal desktops and laptops, as well as low power devices like netbooks, tablets, and even smart phones.
Figure 1. Components of Relentless Computing Environment
2.2 Components of Relentless Computing The Relentless Computing runtime system is broken down into three primary components which work together to allow solutions to be computed.
2.2.1 Problem Execution Service (PES) The PES executes simple codelets that transform data found in memory. Each codelet is serial, with no communication occurs between codelets. Additionally, PES participants begin with the designated result codelet, stepping back through the datadependency chain until a solvable codelet instance is identified or information must be loaded from file.
2.2.2 Global Distributed Memory Service (GDMS) The GDMS is a distributed key/value storage system where all data elements are globally accessible by the entire collective. Data elements are non-reusable to eliminate side-effects, and expire to prevent bloat.
Figure 2. Initial scaled speedup results
2.2.3 File Storage Service (FSS) The FSS permits commitment of data to file by scanning the GDMS for requested key/value pairs, retrieving them as available and writing to file using buffered blocks. Once files have been committed, the FSS replicates completed files over distributed resources to allow user to retrieve via torrent.
3. INITIAL RESULTS Initial scaling tests on a fourth-order finite difference example problem were done with 4 compute hosts, each running up to 4 compute participants. Initial results show good scaling, with participant computation component being exceeded by communication at 16 cores due to the small size of the problem. Figure 3. Execution times for example test
54