StenSAL: A Single Assignment Language for Relentlessly Executing Explicit Stencil Algorithms Lucas A. Wilson
Jeffery von Ronne
Texas Advanced Computing Center The University of Texas at Austin Austin, Texas, U.S.A.
Department of Computer Science The University of Texas at San Antonio San Antonio, Texas, U.S.A.
[email protected]
ABSTRACT
[email protected]
Single assignment language, domain specific language, stencil algorithms
[11–13]. The relentless execution model uses runtime instruction scheduling based on deterministic single assignment tasks to allow distributed, uncoordinated tasks to execute tightly-coupled parallel algorithms in an elastic fashion, with the ability to recover from node failure and loss of data which would be fatal to most parallel models. StenSAL uses this same task oriented single assignment approach to developing solvers for explicit stencil algorithms expressible as recurrence relations. The remainder of this paper is organized as follows. Section 2 provides some background into the relentless execution model of computation. Section 3 describes the StenSAL programming language and provides numerous code examples illustrating its various characteristics, including a complete code example for solving the heat equation in one dimension using the forward in time, central in space (FTCS) method. Section 4 provides a description of how the compiler translates StenSAL code into Python code for later execution. Sections 5 and 6 describe some related projects in stencil-based languages and consider upcoming work on the StenSAL language, respectively.
1.
2.
Many different scientific domains make use of stencil-based algorithms to solve mathematical equations for computational modeling and simulation. Existing imperative languages map well onto physical hardware, but can be difficult for domain scientists to map to mathematical stencil algorithms. StenSAL is a domain specific language which is tailored to the expression of explicit stencil algorithms through deterministic tasks chained together through single assignment data dependencies, and generates programs that map to the relentless execution model of computation. We provide a description of the StenSAL language and grammar, some of the sanity checks that can be performed on StenSAL programs before code generation, and how the compiler translates StenSAL into Python.
Keywords
INTRODUCTION
A wide range of scientific domains focused on computational simulation and modeling make use of stencil-based algorithms for the calculation of fundamental equations. Unfortunately, writing stencil applications can be extremely difficult, as existing mainstream programming models and languages are not particularly well suited to the expression of these algorithms, especially for modern many-core architectures where heavy concurrency is required to eke out any reasonable performance. Additionally, it can be extremely difficult for domain scientists to write stencil-based applications for distributed memory parallel systems, such as clusters, where heavy concurrency is further complicated by multiple, non-shared memory address spaces. StenSAL is a task-oriented single assignment language tailored to the expression of explicit stencil algorithms and targeted at the relentless execution model of computation
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
RELENTLESS EXECUTION
The Relentless Execution Model (REM) is a distributed dataflow model for task-uncoordinated execution on distributed memory parallel architectures where hardware volatility is assumed. REM uses uncoordinated scheduling processes running in parallel to translate dataflow program graphs into individual compute tasks which interact through a distributed, eventually-consistent single-assignment key-value store (see Figure 1). The advantages of this approach are that new scheduling processes can be added to a computation elastically, and processes which exist on failing hardware can fall out of the execution pool without fatally effecting other processes.
Deterministic, Single-Assignment Tasks. REM works on programs described as explicit data dependency descriptions [13], which consist of a set of deterministic imperative operations chained together by single assignment dictionary labels. By restricting the labels in the dictionary to single assignment and computations deterministic, REM can effectively schedule unexecuted tasks without explicit coordination between parallel processes. This is because the only external effect of executing a task is to perform a deterministic computation and update the store to include the computation’s result(s). Since the execution of each task is idempotent, there is no need to prevent the
hprogrami |= programhidentihprgmiend program hprgmi |= hcreateihtaskdefihwithi hcreatei |= createshlabel seti htaskdefi |= usinghtask seti hwithi htask seti
Figure 1: The Relentless Execution Model
same task from being executed by multiple agents.
Uncoordinated Task Execution. REM uses autonomous search agents to locate tasks capable of being executed based only on the data available within the attached distributed dictionary at any given time. This is done by dynamically generating a limited depth subgraph of the unexecuted tasks in the data dependency description and pruning out those branches for which outputs have already been generated. It is possible for tasks to have no input dependencies, thus allowing computation to begin. REM does not need to generate or store the entire graph, making this solution feasible for use on large-scale applications for which the generation and storage of a complete graph [2, 8] would be impossible (as would be the case for programs requiring conditional loops) or impractical. Each agent repeatedly performs a random walk over the dependency description until all labels in the result set have been associated with values and execution ends. Since the labels generated by the execution of any task must be unique, the existence of a task’s output labels provides a guarantee that the task in question has already been executed, and there is no need to re-execute it or its dependencies. This allows a label’s existence to be used to prune the graph and reduce search time for subsequent explorations by other agents. This technique, which we call cooperative pruning , is what enables distributed dataflow tasks to perform parallel calculations without the need for explicit task-level coordination. Agents traverse the graph based solely on the provided dependency descriptions, and only look forward a single level within the dependency description. Each agent autonomously schedules computation to execute without consideration of the current schedules of other agents. Because only the existing set of labels determines which tasks to schedule next, no task-level coordination between agents is necessary. This allows agents to be elastically added/removed from the execution pool without adversely affecting other agents.
3.
THE STENSAL LANGUAGE
StenSAL is a single-assignment language developed to easily and efficiently build stencil-based codes for evaluating the relentless execution model. StenSAL provides a simple means of expressing explicit stencils as collections of independent tasks tied together with data dependencies. StenSAL programs are collections of independent tasks with are chained together into a directed acyclic graph (DAG)
|= withhconstantsi | λ |= htask setihtaski | htaski
htaski hfori
|= taskhcreateihforihdepsihcodeiend task |= forhparam seti | λ
hdepsi hcodei
|= usinghlabel seti | λ |= codehstmt setiend code
hstmt seti |= hstmt setihstmti | λ hstmti |= hidenti=hexpri hparam seti |= hparam setihparami | hparami hparami |= hidenti=hrange seti hrange seti hrangei hlabel seti
|= hrangei,hrange seti | hrangei |= integer :integer |= hlabel setihlabeli | hlabeli
hlabeli |= hidenti | harrayiashidenti harrayi |= hidentihsinglei | hidentihtupleihsinglei hsinglei
|= (hexpri)
htuplei
|= (hexpr seti)
hexpr seti |= hexpr seti,hexpri | hexpri hexpri |= mathematical expression hconstantsi
|= hconstantsihconsti | hconsti
hconsti
|= hidenti=value
hidenti
|= string Figure 2: StenSAL grammar
through a series of single-assignment label/value pairs (labels). Tasks can range from very fine-grained to very coarsegrained, and currently make use of Python for their internal source code blocks. Within each task, variables can be overwritten as many times as desired. However, the labels used to connect tasks must adhere to the single assignment restriction. StenSAL uses a natural language-inspired grammar (see Figure 2) to allow programmers to express task dependency graphs in a very simple manner. Each StenSAL program consists of a program block, which contains a set of tasks that can be used to generate the desired labels.
3.1
Defining a Program
In StenSAL, all programs begin with the program keyword and end with the end program keyphrase (see Figure 2), creating a block structure similar to that of Fortran. The program keyword is followed by an identifier which represents the program name.
program h e l l o W o r l d creates output using task h e l l o creates output code output = ‘ H e l l o , World ! ’ end code end task end program Figure 3: A First StenSAL Program
There are two mandatory clauses which must be provided with any program definition. The first is the creates clause, which declares the outputs labels of a program. Because the relentless execution model uses a bottom-up traversal of the directed acyclic graph (DAG) produced by StenSAL programs to determine task scheduling [13], only those tasks that contribute to the calculation of these outputs will be executed. The labels specified in a program’s creates clause must each be a label generated by one of the program’s tasks. The second mandatory clause in any StenSAL program is the using clause. This clause starts the task definition section of the program, which is analagous to the main body of the program. Only tasks specified within the using clause can generate labels. Figure 3 shows a StenSAL version of the traditional “Hello, World!” first example program. It illustrates the specification of a StenSAL program which generates one label (output), which is created by one task.
3.2
program hypote nus e creates hyp using task s q r t o f s u m creates hyp using squareA squareB code hyp = s q r t ( squareA + squareB ) end code end task task s q u a r e o f A creates squareA code squareA = 3 ∗ 3 end code end task task s q u a r e o f B creates squareB code squareB = 4 ∗ 4 end code end task end program Figure 4: StenSAL program to compute hypotenuse of a right triangle
Defining a Task
A task is the sole means of expressing computations in StenSAL. Each task creates one or more labels from zero or more labels by means of a generic code snippet. Each task must be uniquely named, and must contain a creates clause which specifies the labels to be generated, as well as a code clause which specifies how to generate the labels. Figure 3 shows a minimal task definition for the “Hello, World!” example program. It requires no input labels and generates a single label (output) containing the string “Hello, World!”
3.2.1
Defining Task Dependencies
If a programmer wishes to define anything more complicated than the “Hello, World!” example in Figure 3, then tasks must be able to be dependent on one another, iteratively transforming data into the desired form. StenSAL tasks can specify dependencies on other tasks through the using clause (see Figure 2). The using clause can contain one or more dependent labels, each of which can be used within the code block in order to perform the necessary computations. Figure 4 illustrates this by using multiple tasks to calculate the hypotenuse of a right triangle. The program contains three tasks – two tasks to generate the square of three and four, and a third task to compute the square root of the sum of the first two tasks – resulting in the generation of a three-node directed acyclic graph (Figure 5).
Figure 5: Directed Acyclic Graph for hypotenuse example
program makeArray creates A( x ) f o r x =1:10 using task f i l l A r r a y creates A( x ) as elem f o r x =1:10 code elem = x end code end task end program
task updateTemps creates u ( x , y , z ) ( t ) as newTemp for x =1:100 y =1:100 z =1:100 t =1:50 using u ( x , y , z ) ( t −1) as c e n t e r u ( x−1,y , z ) ( t −1) as l e f t u ( x+1,y , z ) ( t −1) as r i g h t u ( x , y−1, z ) ( t −1) as f r o n t u ( x , y+1, z ) ( t −1) as back u ( x , y , z −1)( t −1) as top u ( x , y , z +1)( t −1) as bottom code newTemp = c e n t e r + a l p h a ∗ ( l e f t + r i g h t + f r o n t + back + top + bottom − 5 ∗ c e n t e r ) end code end task
Figure 6: StenSAL program for filling an array
3.2.2
Applying a Task to a Grid
While creating tasks which work on scalar data can be important, larger problems typically require the application of the same task repeatedly to multiple values. The application of a task to an abstract multidimensional grid can be achieved by providing the task with a for clause. The for clause takes parameters consisting of sparse ranges, and applies the task to all possible combinations formed by the cartesian product of the ranges. Each parameter is a potentially sparse range made of monotonically increasing integer values (see Figure 2). Dense ranges can be formed by using the colon (:) operator, while sparse ranges can be created using the comma (,) operator. These operators can be combined to create sparse ranges with contiguous values. Figure 6 illustrates using the for clause on a task used to fill an array with values. The task fillArray uses the for clause to declare a parameter x , which has a dense range of integers from 1 to 10 inclusive. The program also returns all of the elements of the array A by using a for clause for the program. Parameter values are automatically propagated to the task when it executes. As Figure 6 illustrates, the name of the parameter can be used within the task code (in this case, to fill the array with the monotonically increasing integer values). Parameters are always integers. Referencing individual grid elements from within the task code is done by aliasing the computed label to a variable with the as keyword. In this way, the task code can be reused for every element in a grid, using the same code each time. Non-linear grids in StenSAL are represented with a tuple and a singleton index. The singleton index is used to establish a recurrence relation between tasks when executing over the grid, and enforces the single assignment restriction. The tuple index allows for the creation of a multidimensional grid whose indices can be reused. In relation to stencil algorithms, the tuple index would be used to represent spatial dimensions, while the singleton index would be used to establish the relation of tasks in time. Figure 7 is an example of a StenSAL task definition for solving the heat equation in 3 dimensions over time. The tuple index contains the three spatial dimensions (x , y, z ), while the singleton index contains time, which establishes a recurrence relation over t for all values of 1 ≤ t ≤ 50. Once the task that expresses this recurrence relation is done, we can write tasks which express the base cases of the bound-
Figure 7: StenSAL task for computing temperature in a 3-D space
task i n i t i a l i z e creates u ( x , y , z ) ( 0 ) as i n i t i a l for x =0:101 y =0:101 z =0:101 code i n i t i a l = 2.0 ∗ x end code end task Figure 8: StenSAL task for establishing initial temperatures in a 3-D space
ary value problem (Figure 8 gives an example set of initial conditions t = 0).
3.3
Constants
Constants are defined in the with clause after the task declarations. Label aliases cannot match constant names. Figure 9 updates the array filling example in Figure 6 by having each index multiplied by a constant alpha. The constant is defined in the with clause, and can be used programwide by any task declared within.
3.4
Complete StenSAL Code Example
By putting all of the different elements of the StenSAL language together, it is easy to build a complete explicit stencil application using only a few lines of code. Suppose one wanted to simulate the dissipation of heat in a one dimensional system using the FTCS method. This can be done by solving the following boundary value problem:
program makeArray2 creates A( x ) f o r x =1:10 using task f i l l A r r a y creates A( x ) as elem code elem = a l p h a ∗ x end code end task with alpha = 12.25 end program Figure 9: Example of constant usage
uxt
2x : t = 0 0 : t > 0, x = 0 = 0 : t > 0, x = maxX t−1 t−1 t−1 ux + α(uxt−1 ) : otherwise −1 + ux +1 − 2ux
Figure 10 provides a complete code example for solving this boundary value problem. The example includes the inductive case for the recurrence relation, the base case for t = 0, and the boundary conditions for the edges of the space. The resulting transformed source code, which executes under the relentless model of computation, is capable of running in parallel in both shared memory and distributed memory environments [11–13].
3.5
Static Semantics
StenSAL programs must be able to generate explicit data dependency descriptions, which is the program form used for the relentless execution model. Explicit data dependency descriptions consist of [13]: 1. a finite set T = {τ1 , τ2 , ..., τn } of tasks to be performed; 2. a finite set L of labels, which are used as keys in the shared dictionary; 3. a set R ⊆ L of result labels whose association with values completes the REM program’s execution; 4. a function producer : L → T that maps dictionary labels to the task that produces the value for that label; 5. a function requires : T → P (L) that maps each task to the labels of the inputs the task requires before it can be executed; and 6. a function computes : T → (L → 7 V) → 7 (L → 7 V ) that maps each task to the partial function it computes. There are three requirements for a StenSAL program to be considered valid, all of which can be statically checked. First, a valid StenSAL program must contain a set of tasks T from which all labels L are generated. ∀ producer (l ) ∈ T l∈L
program h e a t X f e r creates u ( x ) ( maxT) f o r x=0:maxX using task updateTemp creates u ( x ) ( t ) as newVal for x=1:maxX−1 t =1:maxT using u ( x −1)( t −1) as l e f t u ( x ) ( t −1) as c e n t e r u ( x +1)( t −1) as r i g h t code newVal = c e n t e r + a l p h a ∗ ( l e f t + right − 2 ∗ center ) end code end task task i n i t i a l i z e creates u ( x ) ( 0 ) as v a l u e f o r x=0:maxX code value = 2.0 ∗ x end code end task task boundary creates u ( x ) ( t ) as bound for x=0,maxX t =1:maxT code bound = 0 . 0 end code end task with maxX=1000000 maxT=1000 end program Figure 10: StenSAL program for solving 1-D heat equation using explicit (FTCS) 3-point update
def updateTemps ( x , y , z , t ) : #Task P r o l o g u e c e n t e r = GET( u , ( x , y , z ) , ( t −1)) l e f t = GET( u , ( x−1,y , z ) , ( t −1)) r i g h t = GET( u , ( x+1,y , z ) , ( t −1)) f r o n t = GET( u , ( x , y−1, z ) , ( t −1)) back = GET( u , ( x , y+1, z ) , ( t −1)) top = GET( u , ( x , y , z −1) ,( t −1)) bottom = GET( u , ( x , y , z +1) ,( t −1)) #Task Code newTemp = c e n t e r + a l p h a ∗ ( l e f t + r i g h t + f r o n t + back + top + bottom − 5 ∗ c e n t e r ) #Task E p i l o g u e PUT( u , ( x , y , z ) , ( t ) , newTemp) Figure 11: Generated Python code for StenSAL task in Figure 7
Second, a StenSAL program cannot contain any task whose input dependencies are not produced by other tasks: ∀ τ ∈T
l ∈ dom(requires(τ )) producer (l ) ∈ T
Finally, a valid StenSAL program cannot contain any cyclic dependencies between task chains: τi ≺ τj ∀l∈dom(requires(τi )) producer (l ) 6= τj Where ≺ is a total ordering function, such that for any τ1 ∈ T , τ2 ∈ T , and ` ∈ L, τ1 = producer (`) and ` ∈ requires(τ2 ) only if τ1 ≺ τ2 .
4.
CODE GENERATION
Once a program has been verified, each task is converted into a Python function which can be called from the runtime task scheduler. Each task function contains a prologue, a code section, and an epilogue. Figure 11 shows the generated Python code for a StenSAL task which performs temperature updates (see Figure 7). The prologue performs a dictionary get operation on each of the input dependencies and stores them into the aliases specified in the StenSAL task description. The code section is then inserted into the task function definition, and the epilogue performs a dictionary put from all of the aliases specified in the creates clause to the associated single assignment labels. Once the task functions have been defined, the compiler generates the main region of the program, which contains the dynamic runtime task scheduler. The task scheduler follows the algorithm in Figure 12 [13], and uses data availability to determine when to initiate the execution of a particular task.
5.
OTHER DSL PROJECTS FOR STENCIL COMPUTATION There are many other projects which have sought to de-
velop domain specific languages for the execution of stencil algorithms, albeit for more traditional sequential or parallel execution models. StenSAL is the only language which targets the relentless execution model. Unlike the traditional sequential and parallel models targeted by these other language projects, the relentless execution model provides inherent elastic distributed memory parallelism and failure resilience. The Scala [7]-based Lizst [5] language is tailored for solving PDEs using finite element methods, and executes either in shared memory using OpenMP or distributed memory using MPI. Pochoir [10] is a C++-based stencil language which makes use of Cilk [1] for asynchronous multithreading. Like Lizst, Pochoir also uses finite element methods. Terra [4] is a muli-stage language for HPC which is built on top of Lua [6], which can be used to create other DSL’s, such as the Orion language for 2-D image stencil processing. The SKETCH language [9] is another stencil targeted DSL based on C. SKETCH uses a technique called stencil synthesis to reduce to a minimal number of tasks and reapplies those tasks to the original sketch. PATUS [3] is a code generating DSL for parallel stencil computations which uses C-like syntax. PATUS uses separate strategy and stencil specifications to generate final application code with specific optimizations based on the specification targeted at SMP and SIMD architectures.
6.
FUTURE WORK
Work is currently underway to compile to C source code instead of Python, in order to allow for lower-level compilation prior to execution. Additionally, we are investigating strategies for automatically coalescing tasks into super tasks, which when compiled with C would enable use of SIMD vector units and greater instruction-level parallelism on modern architectures. Additional compile-time analysis to determine data usage patterns would enable efficient runtime garbage collection or compiled-in memory management code to be used to reduce total memory footprint for REM programs generated using StenSAL, and methods for doing this are under investigation. Finally, greater ability to perform compile-time analysis of StenSAL codes would be beneficial to program developers, especially if logic error can be identified prior to execution. We plan to develop program analysis tools to help us identify regions of inefficient execution which might be candidates for coalescing or recoding.
7.
CONCLUSION
In this paper we present StenSAL, a domain specific language for expressing explicit stencil algorithms using deterministic single-assignment tasks. StenSAL allows for simple description of explicit stencil algorithms and other recurrencerelation based formulations using a minimal, natural languageinspired grammar. The StenSAL compiler generates Python code which is then run on the target hardware, but a version which generates C/C++ code is being developed to allow for compilation to machine-code. Because StenSAL programs are restricted to deterministic tasks chained together with single-assignment dependencies, these programs can be analyzed to identify potential deadlock conditions and present them to the programmer
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
for all r ∈ Result do SOLVE (r ) end for function SOLVE(label) if label 6∈ dom(Dictionary) then task ← PRODUCER(label ) Missing ← {l | l ∈ REQUIRES (task ) ∧ l 6∈ dom(Dictionary)} for all m ∈ Missing do SOLVE (m) end for Inputs ← {(l 7→ v ) | (l 7→ v ) ∈ Dictionary ∧ l ∈ REQUIRES (task )} Dictionary ← Dictionary ∪ COMPUTES (task )(Inputs) end if end function
Figure 12: Algorithm for dynamic runtime scheduling of StenSAL/REM tasks [13] before program execution. Future work will focus on developing more complete program analysis tools, providing compile-time analysis of data usage for runtime garbage collection, and coalescing of tasks to enable greater instructionlevel parallelism and SIMD vectorization on modern architectures.
8.
REFERENCES
[1] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. SIGPLAN Not., 30(8):207–216, August 1995. [2] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. Dague: A generic distributed dag engine for high performance computing. Parallel Computing, 38(1):37–51, 2012. [3] M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 676–687, May 2011. [4] Zachary DeVito, James Hegarty, Alex Aiken, Pat Hanrahan, and Jan Vitek. Terra: A multi-stage language for high-performance computing. SIGPLAN Not., 48(6):105–116, June 2013. [5] Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex Aiken, Karthik Duraisamy, Eric Darve, Juan Alonso, and Pat Hanrahan. Liszt: A domain specific language for building portable mesh-based pde solvers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 9:1–9:12, New York, NY, USA, 2011. ACM. [6] Roberto Ierusalimschy, Luiz Henrique De Figueiredo, and Waldemar Celes Filho. Lua-an extensible extension language. Softw., Pract. Exper., 26(6):635–652, 1996. [7] Martin Odersky and al. An overview of the scala programming language. Technical Report IC/2004/64, EPFL Lausanne, Switzerland, 2004.
[8] Daniel Orozco, Elkin Garcia, Robert Pavel, Rishi Khan, and Guang Gao. Tideflow: The time iterated dependency flow execution model. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2011 First Workshop on, pages 1–9. IEEE, 2011. [9] Armando Solar-Lezama, Gilad Arnold, Liviu Tancau, Rastislav Bodik, Vijay Saraswat, and Sanjit Seshia. Sketching stencils. SIGPLAN Not., 42(6):167–178, June 2007. [10] Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 117–128, New York, NY, USA, 2011. ACM. [11] Lucas A Wilson and John A Lockman III. Poster: The relentless computing paradigm: a data-oriented programming model for distributed-memory computation. In Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion, pages 53–54. ACM, 2011. [12] Lucas A. Wilson and John A. Lockman III. Relentless Computing: Enabling fault-tolerant, numerically intensive computation in distributed environments. In Proceedings of the 2011 International Parallel and Distributed Processing Techniques and Applications Conference (PDPTA’11), 2011. [13] Lucas A. Wilson and Jeffery von Ronne. A distributed dataflow model for task-uncoordinated parallel program execution. In Proceedings of the Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2014.