TALM: A Hybrid Execution Model with Distributed Speculation Support Leandro A. J. Marzulo, Tiago A. O. Alves, Felipe M. G. Franc¸a Universidade Federal do Rio de Janeiro Programa de Engenharia de Sistemas e Computac¸a˜ o, COPPE Rio de Janeiro, RJ, Brasil {lmarzulo, tiagoaoa, felipe}@cos.ufrj.br V´ıtor Santos Costa Universidade do Porto Departamento de Ciˆencia de Computadores Porto, Portugal
[email protected]
Abstract Parallel programming has become mandatory to fully exploit the potential of modern CPUs. The data-flow model provides a natural way to exploit parallelism. However, traditional data-flow programming is not trivial: specifying dependencies and control using fine-grained tasks (such as instructions) can be complex and present unwanted overheads. To address this issue we have built a coarse-grained data-flow model with speculative execution support to be used on top of widespread architectures, implemented as a hybrid Von Neumanm/data-flow execution system. We argue that speculative execution fits naturally with the data-flow model. Using speculative execution liberates the programmer to consider only the main dependencies, and still allows correct data-flow execution of coarse-grained tasks. Moreover, our speculation mechanism does not demand centralised control, which is a key feature for upcoming manycore systems, where scalability has become an important concern. An initial study on a bank server artificial application suggests that there is a wide range of scenarios where speculation can be very effective.
1. Introduction The trend toward multi-cores makes parallel programming a requirement for good performance on modern CPUs. Parallelism arises when we can divide a task in sub-tasks, and is limited by dependencies between tasks. Unfortunately, often the programmer has to overestimate dependencies to guarantee correct execution, thus severely hampering
parallelism. Speculative execution addresses this problem by assuming tasks will be independent until proven not so, and is the key to techniques such as Transactional Memory (TM) [3, 8], and Thread-Level Speculation (TLS) [7] The data-flow model, where tasks start running as soon as they receive all their input operands, provides a natural way to exploit parallelism. However, traditional data-flow programming is not trivial: specifying dependencies and control using fine-grained tasks (such as instructions) can be complex and present unwanted overheads. We argue that speculative models of execution fit naturally with the dataflow style of execution. Using speculative execution liberates the programmer to consider only the main dependencies, and still allows correct data-flow execution of coarsegrained tasks. In order to evaluate our claim, we have created TALM (TALM is an Architecture and Language for Multi-threading), a data-flow model with coarse-grained instructions that is designed to allow speculative execution. We provide a thread-level speculation model for TALM based on optimistic transactions with ordered commits. Since static dependencies tend to be considerably more frequent than dynamic ones, we can be optimistic and allow concurrent execution of memory access (including superinstructions that perform those operations) speculatively if static dependencies are not known. Moreover, our speculation mechanism does not demand centralised control; this is a key feature for upcoming many-core systems, where scalability is a fundamental concern. TALM was implemented as a hybrid Von Neumanm/data-flow execution system to run on top of multi-core architectures; we call this implementation the Trebuchet. To evaluate our system, we developed a bank server simulator artificial application to simulate scenarios
varying computation load, transaction size, speculation depth, and contention. We have executed this application with up to 24 threads in a 24-core machine. The results suggest that there is a wide range of situations where speculation can be very effective and achieve speedups close to the ideal case. The rest of this work is organised as follows: in Section 2 we present our data-flow base model, instruction set, architecture and speculation model; in Section 3 we present Trebuchet: an implementation of TALM to multi-cores; in Section 4 we discuss the experiments and results; related work is discussed in Section 5; in Section 6 we conclude and discuss future work.
2. TALM The TALM model is designed to allow the definition of coarse-grained instruction, or super-instructions, as an extension of the instruction set. Therefore, the TALM instruction set architecture has to allow instructions with multiple inputs and outputs. As usual, TALM also provides the most common arithmetic and logic instructions, such as add, sub, and, or, mult, div, along with their immediate variants. Control-flow is described by steer and inctag instructions, described next. As there is no program counter in data-flow machines, control branches must change the data-flow according to the selected path. This is done by the steer instruction that receives an operand O and a boolean selector S, causing O to be sent to one of two possible paths in the Data-flow graph, according to the value of S. In this paper, steer instructions are represented as triangles at a data-flow graph. A major source of parallelism are loops. During the execution of a loop, we want independent iterations to run asynchronously. If one is not careful, the new operands may be made available to instructions that are still running over the previous iteration. To do so, we support dynamic data-flow, so that an instruction may have multiple instances, one per iteration. In this case, operands are tagged with their associated instance number. When an operand reaches the next iteration, an Increment Iteration Tag (inctag) instruction increments its tag to match other operands in the same iteration. Instructions will execute when they have a complete ready set of input operands, all with with the same tag. We also provide a TALM Reference Architecture. We assume a set of identical Processing Elements (PEs) interconnected in a network. PEs are responsible for instruction interpretation and execution according to the Data-flow rules. Each PE has a communication buffer to receive messages with operands sent by other PEs in the network. When an application is executed in TALM, its instructions need to be distributed among the PEs. An instruction list stores in-
formation related to each static instruction mapped to a PE. Moreover, each instruction has a list of operands related to its dynamic instances. The matching of operands with the same tag, i.e., belonging to the same instance of an instruction, is done within that list. The instruction and operands relative to the matching are sent to the ready queue to be interpreted and executed. TALM allows any instruction to be marked as speculative. Execution of speculative instructions is done within transactions, each one formed by one speculative instruction and its related Commit instruction. Transactions will have access only to local copies of the used resources. Once they finish running, if no conflicts are found, local changes will be persisted to global state by Commit instructions, associated with each speculative instruction. In case conflicts are found in a speculative instruction I, local changes will be discarded and I will have to be re-executed. This allows speculative instructions to run concurrently and outof-order. The use of the Commit instruction ensures that global state changes are made atomically and according to program order. Each speculative instruction has a write set and a read set to keep track of all accesses (writes and reads, respectively) made within that instruction. More precisely, read and write sets are sets of tuples (a, v, s)r , where r is the resource (for example, memory or file), a is the resource’s address, v is the value (read or written) and s is the datum’s size. Given a speculative instruction S and the corresponding Commit instruction C, there will be an edge between S and C so that S can send a Commit Message when it finishes its execution. This message contains all the information needed by C to validate the speculative execution and all the data that would be necessary to persist the local changes or to command a re-execution, namely: (i) the read and write sets of S; (ii) an structure indicating the existence of exceptions during the execution of S; and (iii) the dispatch structure of S (containing all its input operands). When a Commit instruction runs, it first validates the received read set. If its values are identical to the ones at the global state, it means that speculation succeeded, local changes that were being kept in the write set can be made visible and eventual exceptions repressed during speculative execution can be thrown. If the read set differs from global state, the speculative instruction’s input operands must be re-sent to its PE, causing its re-execution. Notice that upon the re-execution of an instruction I, new versions of its output operands will be produced and sent to their consumers in the data-flow graph, causing the re-execution of all instructions that depend (directly or indirectly) on I. We say that those instructions are in I’s chain of causation. As speculative instructions re-execute and produce new versions of the output operands, there must be a way to match only over the most recent inputs. To do so, each spec-
ulative instruction has a marker (e, id), where e is the execution counter, incremented each time there is a re-execution, and id is the instruction identifier. This marker is attached to each output operand. Given an operand O, we will use O.e and O.id to represent O’s execution number and identifier, respectively. Non-speculative instructions do not have a marker of their own, so they forward received markers on their produced operands. In order to avoid possible conflicts, nonspeculative instructions can only receive operands from at most one speculative instruction. When an instruction S receives an operand A through an input port that already contains another operand A′ , it will have to check if A belongs to an execution that is more recent than the one that produced A′ . A will replace A′ if A.e > A′ .e. S may receive operands at the same port from different sources. But only one of the sources of each operand will be enabled, depending on the path taken during execution. If the decision on which path to take is speculative and there is a re-execution, A.id might be different from A′ .id and there would be no point in comparing A.e and A′ .e. That would happen if we had different speculative instructions on different paths and the decision on which path to take was also speculative, which could cause the operands received to have different ids. Thus, our model forbids these constructions. Program semantics are preserved by go-ahead and wait edges. The go-ahead edges connect Commit instructions, and guarantee the correct commit order. The wait edges connect Commit instructions related to dependent speculative instructions. They prevent commits from being executed if re-executions of their related speculative instructions might still be triggered by other speculative instructions. Given: (i) a speculative instruction S, (ii) its Commit instruction Cs , (iii) a certain speculative instruction D such that S is in its chain of causation, and (iv) Cd the Commit instruction related to D; if a re-execution of S is caused by D, Cs should not consider Commit Messages that were sent by previous executions of S. This is guaranteed by a wait edge coming from Cd to Cs . a Wait Message will be sent by Cd to inform the marker of its successful execution. Cs will only execute if the most recent input operands were used in S. So it will check the input operands in its Commit Message to see if their markers are the same ones provided by received Wait Messages. Moreover, there will be one Wait Message in Cs for each speculative input operand in S. In cases where speculative instructions execute according to a condition (branches), the commit order can only be defined in runtime. In those cases, this conditional relation between instructions must be replicated at the commit sub-graph.
#1
A
S1
B
Time S2
+
C1 g
S2
S1
C2
S3 w
g w
w
S3
C1 C2 C3
C3
Figure 1. Use of Speculation.
Figure 1 (pane A) shows an example with dependencies between speculative instructions. S1 , S2 and S3 are speculative instructions and C1 , C2 and C3 are their related Commit instructions. Since S3 receives input operands from S1 and S2 , we need to insert wait edges from C1 and C2 to C3 . The addition of these edges is necessary for C3 to know it can only commit when S3 executes with the operands produced by the last execution of S1 and S2 . The reason we need two wait edges from C1 to C3 is because the first input operand of S3 is also on the chain of causation of S1 and, since the add (+) instruction is not speculative, this operand will also receive S1 execution’s marker. In pane B we show a possible execution of this graph. A
B
#0
#0 IT+2 IT 10){ int val1 = LOAD(&(V[pos1])); STORE(&(V[pos1]), val1+10); STORE(&(V[pos2]), val2-10); } } void super2(oper_t **oper, oper_t *result) { printout(); }
superinst ("init", 0, 4, False) C superinst ("op", 1, 1, True) superinst ("out", 2, 1, False) const nT, NUM_TASKS init ini, nT placeinpe(0, "DYNAMIC") {i=0..NUM_TASKS-1} op t${i}, ini.${i*2}, ini.${i*2+1} commit c0, t0.0 placeinpe(0, "STATIC") {i=1,NUM_TASKS-1} commit c${i}, t${i}.0, c${i-1}.g out o, c3.g
nT
D
ini t0
c0
t2
g
c1 g out
Figure 3. Example of speculation use.
4. Experiments and Results As an initial study, we want to identify the situations in which our speculative model could be used to improve applications’ performance. Therefore we implemented an artificial application, based on the simulation of a bank server. It allows us to evaluate scenarios that vary on computation load, transaction size, speculation depth, and contention. The experiment was executed 20 times in order to remove discrepancies in the execution time. To perform these experiments we have used a host machine with four AMD Six-Core OpteronTM8425 HE (2100 MHz) chips (24 cores) and 64 GB of DDR-2 667MHz (16x4GB) RAM running GNU/Linux (kernel 2.6.31.5-127 64 bits). The artificial Bank application simulates the server of a bank, that receives as input orders for 8640 transfers between accounts. A transfer will only be completed if the source account has enough funds, otherwise the operation is cancelled and a counter of cancelled transfers is incremented. The implementation of this application comprises a reading stage and a processing stage, executed inside of a loop. At every iteration the first stage simulates reading a block of transfers, randomly selected, which will be processed by the second stage, that simulates transfers with an artificial computational load (an LU reduction of a matrix whose dimension varies in a predetermined range). Note that the transfers read by the first stage must be processed in order by the second stage, since if a transfer depletes the funds in a given account subsequent transfers that try to withdraw from that account must be cancelled.
Speculation was used to parallelise the second stage of the loop. Each block of transfers is divided into smaller blocks that are processed speculatively in parallel by different PEs. Conflicts occur if during the same iteration two PEs try to perform transfers on the same account. Speculation depth is controlled by a sliding window mechanism similar to the one presented in Figure 2.
Figure 6. Bank: Varying Number of Accounts.
Figure 4. Bank: Varying Matrices Size.
This application is highly parametrized, allowing us to expose the different costs and their impacts on Trebuchet’s performance. The parameters we have experimented with are (i) the average matrix size, (ii) the sliding speculation window size, (iii) the number of accounts, and (iv) the transaction size. The default values used for those parameters are, respectively, 300, 3, 180000 and 1.
more parallelism will be unleashed and more rollbacks will occur. This is why the speedups presented for the Real scenarios become less expressive as we go up to 24 threads, while the same does not happen with the Ideal ones. Moreover, a super-linear speedup was observed for all the ideal scenarios and for real scenarios with fewer threads. This happened because our host machine has a 6MB full-shared L3 cache in each chip and. In parallel versions each thread will load big cache blocks that will provide data to other threads, increasing the number of cache hits.
Figure 7. Bank: Varying Transactions Size.
Figure 5. Bank: Varying Window Size.
The effect of varying each parameter individually is discussed next, while results are presented on figures 4, 5, 6 and 7. Each figure shows the achieved speedups for our data-flow implementation compared against the sequential version. Moreover, we perform a comparison with an ideal scenario where no speculation is needed, since there are no conflicts. We also present the total number of rollbacks in each scenario (on top of each bar in the graph), since this has a strong relation with the overall performance. For all experiments, as the number of threads increases,
The dimension of the LU reduction matrix regulates the amount of computation done per access to the transactional memory. The smaller the amount of computation, the more expressive are the costs of the STM in the total time of execution. This behaviour can be observed in Figure 4. Figure 5 presents the results when we vary the size of the sliding speculation window. This parameter specifies the number of iterations that may run speculatively at the same time. Big windows will provoke too many conflicts, while small ones allow less parallelism to be exploited due to the constant need for synchronization. Figure 6 shows the results when we vary the number of bank accounts. The smaller this number, the bigger the
chance that two parallel transfers use the same account. Notice that for the ideal scenarios we need to have at least 17280 accounts, since we will perform 8640 transfers between different accounts (to avoid conflicts). Therefore the ideal results are provided only for scenarios with more than 18000 accounts. Figure 7 presents the results when we vary the size of the block transfers read at each iteration. The X axis shows the number of transfers per thread per block. More transfers being processed at a time means more addresses being stored in the transactions’ hash tables. This makes the probability of a collision grow, increasing the cost of STM accesses.
5. Related Work The Transactional WaveCache [6] is a mechanism that allows speculative out-of-order execution of memory operations in WaveScalar. In this mechanism, memory instructions in a loop iteration are treated as a transaction, which is nested inside transactions of previous iterations. Program Demultiplexing [2] is an execution paradigm where methods or functions are demultiplexed to be executed concurrently with the rest of the program, according to data-flow. As soon as the demultiplexed method’s input parameters are ready, the method is fired speculatively. When the control-flow reaches the original call site for that method, the write operations are committed, in case no conflicts are found. Otherwise, the method is re-executed. The SDF [4] architecture uses the same principle as Trebuchet. In SDF threads are scheduled according to the dataflow rules and inside each thread instructions are issued following the Von Neumann model. Speculative support for SDF was presented in [5]. Extra hardware was added and a centralised commit control unit was adopted to guarantee commit order, which can limit the scalability of their model. In our model, commit instructions receive all information needed to control speculation through wait and go-ahead edges. Therefore, TALM’s speculation mechanism is distributed and more scalable.
6. Conclusion and Future Work In this work we presented speculative execution on a coarse-grain data-flow model to be used on current architectures, as a hybrid Von Neumanm/data-flow execution model. The data-flow model enables creation of superinstructions with different granularities and provides a speculation mechanism so that program semantics can be maintained without the need to describe hidden dependencies in applications. We demonstrate the model on the Trebuchet virtual machine. Initial experiments on an artificial bank server application suggest that there is a wide range of situations where
speculation can be very effective. We attribute the good scalability in our results to the simplicity and the distributed aspect of our model. Since there are no centralised components in our speculative model it is natural that its implementation scales. Also, as most of the validation is done based on operand exchange, the implementation of speculative operations does not add too much overhead to the normal data-flow flow processing. We are currently investigating ways to help the application developers write their applications using our model. This includes improvements in our assembly language, the creation of a compiler to generate the data-flow binary from C code, tools to find out good candidates to become superinstructions and automatic instruction placement solutions based on the host machine architecture.
7. Acknowledgments To CAPES and Euro-Brazilian Windows consortium for the financial support given to the authors of this work.
References [1] T. A. O. Alves, L. A. J. Marzulo, F. M. G. Franc¸a, V. S. Costa, and R. A. Hexsel. Trebuchet: Explorando tlp com virtualizac¸a˜ o dataflow. In WSCAD-SSC’09, pages 60–67, S ao Paulo, Oct. 2009. SBC. [2] S. Balakrishnan and G. S. Sohi. Program demultiplexing: Data-flow based speculative parallelization of methods in sequential programs. In ISCA ’06, pages 302–313, Washington, DC, USA, 2006. IEEE Computer Society. [3] L. Hammond, B. D. Carlstrom, V. Wong, M. Chen, C. Kozyrakis, and K. Olukotun. Transactional coherence and consistency: Simplifying parallel hardware and software. IEEE Micro, 24:92–103, 2004. [4] K. Kavi, R. Giorgi, and J. Arul. Scheduled dataflow: Execution paradigm, architecture, and performance evaluation. IEEE Transactions on Computers, 50(8):834–846, 2001. [5] W. Li, K. Kavi, A. Naz, and P. Sweany. Speculative thread execution in a multithreaded dataflow architecture. In Proceedings of the 19th ISCA Parallel and Distributed Computing Systems, Sept 20-22, 2006. [6] L. A. Marzulo, F. M. Franca, and V. S. Costa. Transactional wavecache: Towards speculative and out-of-order dataflow execution of memory operations. SBAC-PAD, 0:183–190, 2008. [7] M. K. Prabhu and K. Olukotun. Exposing speculative thread parallelism in spec2000. In PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 142–152, New York, NY, USA, 2005. ACM. [8] C. von Praun, L. Ceze, and C. Cascaval. Implicit parallelism with ordered transactions. In PPOPP, pages 79–89, 2007.