Sep 9, 2011 - ing system, named SC-TM, the Single-Chip Cloud TM, is a fully decentralized and scalable .... 2.6.3 The directory-based cache coherence protocol for the DASH multiprocessor . ..... A starvation-free system is protected ...... Although it is presented as decentralized, it uses a global ticket for entering the pre-.
Royal Institute of Technology
Design of a Distributed Transactional Memory for Many-core systems
Vasileios Trigonakis vasileios.trigonakis(@)epfl.ch vtri(@)kth.se
9 September, 2011
A master thesis project conducted at
Supervisor: Prof. Rachid Guerraoui Advisor: Dr. Vincent Gramoli Examiner: Prof. Seif Haridi
KTH – School of Information and Communication Technology Forum 105, 164 40 Kista Sweden TRITA-ICT-EX-2011:220 c
Vasileios Trigonakis, September 09, 2011
i
Abstract The emergence of Multi/Many-core systems signified an increasing need for parallel programming. Transactional Memory (TM) is a promising programming paradigm for creating concurrent applications. At current date, the design of Distributed TM (DTM) tailored for non coherent Manycore architectures is largely unexplored. This thesis addresses this topic by analysing, designing, and implementing a DTM system suitable for low latency message passing platforms. The resulting system, named SC-TM, the Single-Chip Cloud TM, is a fully decentralized and scalable DTM, implemented on Intel’s SCC processor; a 48-core ’concept vehicle’ created by Intel Labs as a platform for Many-core software research. SC-TM is one of the first fully decentralized DTMs that guarantees starvation-freedom and the first to use an actual pluggable Contention Manager (CM) to ensure liveness. Finally, this thesis introduces three completely decentralized CMs; Offset-Greedy, a decentralized version of Greedy, Wholly, which relies on the number of completed transactions, and FairCM, that makes use off the effective transactional time. The evaluation showed the latter outperformed the three.
Keywords: Transactional Memory (TM); Contention Management (CM); Many-core Systems
iii
Acknowledgements I would like to give special thanks to... Vincent Gramoli, Post-Doc Fellow at EPFL for being an always helpful and interested advisor Rachid Guerraoui, Professor at EPFL for giving me the opportunity to conduct this thesis Seif Haridi, Professor at KTH for being my examiner and the professor who made me interested in Distributed Systems My family for supporting me the many years of my studies ...As well as everyone else who listened to my questions and spent the time to help me along the way of completing this project
Contents 1
2
Introduction
2
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Background & Related Work
6
2.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.1
Hardware, Software, or Hybrid TM . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.2
Conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.3
Irrecoverable Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.4
Interactions with non-transactional code . . . . . . . . . . . . . . . . . . . . . .
8
2.1.5
Data Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.6
Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.7
Conflict Detection Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.8
Static or Dynamic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.9
Lock-based or Non-blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.10 Contention Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.11 Transaction Nesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.1.12 Liveness Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.1.13 Safety Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Software Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
Software transactional memory for dynamic-sized data structures . . . . . . . .
13
2.2.2
On the correctness of transactional memory . . . . . . . . . . . . . . . . . . . .
13
2.2.3
McRT-STM: A High Performance Software Transactional Memory System for a
2.2
2.3
2.4
Multi-Core Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.4
Transactional memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.5
Elastic Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Contention Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.3.1
Contention Management in Dynamic Software Transactional Memory . . . . . .
17
2.3.2
Advanced contention management for dynamic software transactional memory .
18
2.3.3
Toward a theory of transactional contention managers . . . . . . . . . . . . . . .
19
2.3.4
Transactional Contention Management as a Non-Clairvoyant Scheduling Problem 19
Distributed Software Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . .
21
vi
Contents
2.5
2.6
3
2.4.1
Distributed Multi-Versioning (DMV) . . . . . . . . . . . . . . . . . . . . . . .
21
2.4.2
Sinfonia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.4.3
Cluster-TM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.4.4
DiSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.4.5
DSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4.6
D2 STM
23
2.4.7
On the Design of Contention Managers and Cache-Coherence Protocols for Dis-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
tributed Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4.8
FTDMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.4.9
D-TL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Cache Coherence Protocols for Distributed Transactional Memory . . . . . . . . . . . .
25
2.5.1
Ballistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.5.2
Relay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.5.3
COMBINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Cache Coherence Protocols for Shared Memory . . . . . . . . . . . . . . . . . . . . . .
27
2.6.1
An evaluation of directory schemes for cache coherence . . . . . . . . . . . . .
27
2.6.2
Directory-Based Cache Coherence in Large-Scale Multiprocessors . . . . . . . .
28
2.6.3
The directory-based cache coherence protocol for the DASH multiprocessor . . .
28
2.6.4
Software cache coherence for large scale multiprocessors . . . . . . . . . . . . .
29
SC-TM, the Single-chip Cloud TM
31
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2
System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3
System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.1
Application part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.2
DTM part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
TX Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
DS-Lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Contention Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Object Locating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Transactional Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.4.1
Transactional Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.4.2
Transactional Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.4.3
Transaction Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.4.4
Transaction Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4.5
Transaction Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
Contention Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.5.1
Back-off and Retry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.5.2
Offset-Greedy
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.5.3
Wholly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.5.4
FairCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4
3.5
vii
Contents
3.6 3.7
Elastic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.6.1
Elastic Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Target Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.7.1
Single-Chip Cloud Computer (SCC) . . . . . . . . . . . . . . . . . . . . . . . .
46
3.7.2
SCC Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.7.3
SCC Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Private DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Shared DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Message Passing Buffer (MPB) . . . . . . . . . . . . . . . . . . . . . . . . . .
47
SCC Programmability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
RCCE Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
iRCCE Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.8.1
Multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
POSIX threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Libtask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Dedicated Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
SCC-Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.9.1
Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.9.2
Messaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Deterministic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Unreliable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.7.4
3.8
3.8.2 3.9
4
5
SC-TM Evaluation
61
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.1
SCC Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.2
Multitasking vs. Dedicated DS-Lock Service . . . . . . . . . . . . . . . . . . . . . . .
62
4.3
Linked-list Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4
Hashtable Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.5
Bank Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
Conclusions
75
5.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2.1
Write-lock Batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2.2
Asynchronous Read Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2.3
Eager Write-lock Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2.4
Profiling & Refactoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2.5
Applications & Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
A Acronyms
80
List of Figures 2.1
State diagram of the life-cycle of a transaction. . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Possible problematic case for an Software Transactional Memory (STM) system . . . . .
11
3.1
Abstract Architecture of the SC-TM system . . . . . . . . . . . . . . . . . . . . . . . .
32
3.2
Pseudo-code for Read-lock acquire (dsl_read_lock) operation. . . . . . . . . . . .
34
3.3
Pseudo-code for Read-lock release (dsl_read_lock_release) operation. . . . . .
34
3.4
Pseudo-code for Write-lock acquire (dsl_write_lock) operation. . . . . . . . . . .
35
3.5
Pseudo-code for Write-lock release (dsl_write_lock_release) operation. . . . .
35
3.6
Pseudo-code for Transactional Read (txread) operation. . . . . . . . . . . . . . . . .
38
3.7
Pseudo-code for Transactional Write (txwrite) operation. . . . . . . . . . . . . . . .
38
3.8
Pseudo-code for Transaction Start (txstart) operation. . . . . . . . . . . . . . . . . .
39
3.9
Pseudo-code for Transaction Commit (txcommit) operation. . . . . . . . . . . . . . .
39
3.10 Pseudo-code for Transaction Abort (txabort) operation. . . . . . . . . . . . . . . . .
40
3.11 Offset-based timestamps calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.12 Offset-Greedy – Contradicting views of timestamps for two transactions . . . . . . . . .
42
3.13 Single-Chip Cloud Computer (SCC) Processor layout. Adapted from [Hel10], page 8. . .
46
3.14 SCC Memory spaces. Adapted from [Hel10], page 52. . . . . . . . . . . . . . . . . . .
47
3.15 Round-trip latency for a 32 bytes message on the SCC . . . . . . . . . . . . . . . . . .
49
3.16 RCCE Application Programming Interface (API) – Core utilities . . . . . . . . . . . . .
49
3.17 RCCE API – Memory Management functions . . . . . . . . . . . . . . . . . . . . . . .
49
3.18 RCCE API – Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.19 RCCE API – Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.20 RCCE API – Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.21 The allocation of Application and DS-Lock service parts on SCC’s 48 cores. . . . . . . .
53
3.22 Activity diagram of the multitasking between the Application and the DS-Lock Service on a single core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.23 Multitasking – An example where the scheduling of Core m affects the execution of Core n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.24 The allocation of the Application and the Dedicated DS-Lock service on the 48 cores of SCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.25 Implementing data exchange between cores 0 and 1 using RCCE. . . . . . . . . . . . .
57
3.26 The RCCE send and receive operations interface. . . . . . . . . . . . . . . . . . . . . .
57
3.27 Problematic output of the ping-pong-like test application running on 16 cores. . . . . . .
59
x
List of Figures
4.1
Available performance settings for Intel’s SCC processor. . . . . . . . . . . . . . . . . .
4.2
Throughput of read-only transactions for the multitask-based and the dedicated DS-Lock versions of Single-chip Cloud TM (SC-TM). . . . . . . . . . . . . . . . . . . . . . . . .
4.3
61 62
Latency of read-only transactions for the multitask-based and the dedicated DS-Lock versions of SC-TM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4
Throughput of Linked-list running only contains operations in sequential mode. . . .
64
4.5
Throughput of linked-list for normal and elastic-early transactions. . . . . . . . . . . . .
65
4.6
Commit rate of normal and elastic-early transactions on the linked-list micro-benchmark.
65
4.7
Throughput of sequential and transactional (elastic-read) linked-list versions under different list sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8
Ratio of transactional (elastic-read) throughput compared to the sequential under different list sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9
66 67
Throughput of Sequential and Transactional versions on the Hashtable benchmark under different load factor values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.10 Ratio of Transactional performance compared to the sequential on the Hashtable benchmark under different load factor values. . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.11 Throughput of normal and elastic-read versions on the Hashtable benchmark under load factor 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.12 Ratio of throughput for normal and elastic-read versions on the Hashtable benchmark compared to the throughput of 2 cores . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.13 Throughput of SC-TM running Bank Benchmark with different Contention Managers (CMs). (Configuration I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.14 Commit Rate of SC-TM running Bank Benchmark with different CMs. (Configuration I)
72
4.15 Throughput of SC-TM running Bank Benchmark with different CMs. (Configuration II)
73
4.16 Commit Rate of SC-TM running Bank Benchmark with different CMs. (Configuration II) 73
1 Introduction Over the last years, there was a shift of hardware architectural design towards Multi-core systems. Multicores consist of two, or more, processing units and are nowadays the de facto processors for almost every computer system. In order to take full advantage of these systems, parallel/concurrent programming is a necessity [LS08]. One of the most difficult and error-prone task with concurrent programming is the shared memory accesses synchronization. One has to be careful, or else problems, such as data races and deadlocks, may appear. Transactional Memory (TM) emerged as a promising solution to the aforementioned problem [HM93]. TM allows the programmer to define a sequence of commands, called transaction, which will be executed atomically with respect to the shared memory. TM seamlessly handles the synchronization between concurrent transactions. Programming using transactions is simpler and easier to debug than using lowerlevel abstractions, such as locks, because it resembles the sequential way of programming. On a Distributed System (DS) the synchronization problem aggravates. The asynchronous nature of such systems, combined with the limited debugging and monitoring availability, makes programming on such platforms a cumbersome process. Distributed Transactional Memory (DTM) systems aim to provide the programming abstraction of transactions on DSs. Considering the benefits of Transactional Memories and the difficulties of distributed programming positions Distributed Transactional Memory as a very appealing programming approach for such platforms. Although present hardware architectures typically incorporate efficient Cache-Coherent (CC) shared memory, this may not be the case for future general purpose systems. Contemporary Many-core processors consist of up to 100 cores, but they are soon expected to scale up to 1000 cores. In such systems, providing full hardware cache-coherency may be unaffordable in terms of memory and time costs, consequently it is probable that Many-cores will have limited, or no, support for hardware cache-coherency [BBD+ 09]. These systems will rely on Message Passing (MP) and coherency will be handled on software. Accordingly, Distributed Transactional Memory on top of Message Passing is a very promising programming model. The main goal of this thesis was the design and implementation of a Distributed Transactional Memory system suitable for Many-core architectures. The resulting DTM system is called the SC-TM, the Single-chip Cloud TM. At current date, Intel’s SCC is the only non-coherent Message Passing processor available, therefore it was selected as the target platform of this project. The Single-Chip Cloud Computer experimental processor is a 48-core ”concept vehicle” created by Intel Labs as a platform for Many-core software research.
Motivation
1.1
3
Motivation
As one can notice from Chapter 2 of this report, there is an extensive work on the design of DTM systems [HS05, MMA06, AMVK07, BAC08, RCR08, KAJ+ 08, ZR09b, DD09, ZR09a, CRCR09, AGM10, LDT+ 10, KLA+ 10]. Although all the solutions aim to provide the Transactional Memory abstraction, they differ in three major points; i . they optimize for different workloads, ii . they provide different guarantees, iii . and/or they target different platforms. While points (i) and (ii) are specific to TM, point (iii) is generic to Distributed Systems. For example, a system designed for a large scale cluster may be significantly different than one targeting a Many-core, mainly because of the different underlying system properties. Many-core systems recently emerged, so, to my knowledge, this is the first work trying to tailor a DTM algorithm specific for Many-core processors. Current Many-core processors consist of less than one hundred cores, but in the near future the number is expected to increase up to one thousand. Therefore, one of the most wanted characteristics for SC-TM system was scalability. In order to achieve scalability, a fully decentralized solution has to be applied. Many of the existing solutions introduce a centralization point in order to accomplish some other wanted characteristics. SC-TM is fully decentralized and, as the Evaluation (Chapter 4) shows, scales particularly well. As I already mentioned, another differentiation point among the solutions is the safety and liveness guarantees they provide. The safety guarantee is almost universal in all the systems; opacity [GK08]. Deadlock-freedom is the most commonly liveness guarantee used by the existing DTMs. The goal for SC-TM is to provide a stronger guarantee; starvation freedom. A starvation-free system is protected from both deadlocks and live-locks. In the STM world1 , starvation-freedom can be "easily" achieved by using a CM [HLM03]. On the other hand, in a decentralized DTM contention management is not a trivial case. The lack of a module which has a global view of the system makes contention management difficult. Moreover, the design of most DTMs is not suitable for contention management. For the aforementioned reason, none of the existing solutions employs an actual CM. SC-TM relies on contention management for providing lock-freedom. Different contention management policies can be easily applied, since CM is a separate module of the system.
1
When referring to STM, I refer to a non-Distributed TM solution.
4
Introduction
1.2
Contributions
Firstly, this thesis presents the first DTM system specifically designed for non-coherent Many-core architectures. The pre-existing DTM solutions mainly target Distributed Systems, such as clusters and Local Area Networks (LANs). Secondly, SC-TM is the first DTM system that uses an actual Contention Manager in order to provide the wanted liveness guarantees. Three different CMs were developed to be used with SC-TM. To our knowledge, SC-TM is one of the first systems that is both fully decentralized and guarantees starvationfreedom. The practical evaluation revealed that strong liveness guarantees can be essential under certain workloads. Finally, the experience gained while implementing and tuning the algorithm on the SCC processor can be seen as an extensive study of the programmability of a truly Message Passing Many-core system.
1.3
Structure of the Document
The remaining document is structured as following: • Chapter 2 describes the background of TM and some related to the SC-TM work. • Chapter 3 presents the DTM Algorithm, the SC-TM system design, the target platform, and some important implementation decisions. • Chapter 4 evaluates the SC-TM system. • Chapter 5 concludes this work and presents some possible future work.
2 Background & Related Work This chapter consists of two main parts; the Background and the Related Work. The former (Section 2.1) intends to help the reader familiarize with TM, while the latter (Sections 2.2 to 2.6) to present some work that affected, and is closely related to, this thesis. Section 2.2 presents some of the most influential research done in the area of Software Transactional Memories. The theory of STMs is the foundation behind building Distributed Transactional Memory systems. Section 2.3 discusses about Contention Managers; a module that can be used for guaranteeing the progressiveness of an STM system. Section 2.4 introduces several existing DTM systems developed the last years. Section 2.5 describes three Cache-Coherence protocols that were designed to be used in the context of DTM. Finally, Section 2.6 presents some work on Shared Memory Cache-Coherence protocols and particularly solutions that are Directory-based. Some of these protocols employ a quite similar approach to coherency with the one use on SC-TM.
2.1
Background
Concurrent programming is essential for increasing the performance of a single application in the modern processor architectures. A concurrent application consists of more than one threads of execution which run simultaneously. The parallel execution threads may share some memory objects. Accessing these objects should be somehow synchronized in order to avoid problems such as data races. The most typical solution to the aforementioned problem is using low level mechanisms, such as locks. Lock programming is a cumbersome and error-prone process. Moreover, ensuring the correctness of a lock-based application is rather difficult. Apart from that, lock-based synchronization is prone to several problems: • deadlocks: two or more threads are each waiting for the other to release a lock and thus neither proceeds. • priority inversion: a higher priority thread needs a lock which is held by a lower priority thread, thus is blocked. • preemption: a thread maybe preempted while holding some locks, therefore "spending" valuable resources. • lock convoying: a lock cab be acquired by only one of the threads contending for it. Upon acquisition failure, the remaining threads, perform an explicit context switch, leading to underutilization of scheduling quotas and thus to overall performance degradation. Transactional Memory emerged as an alternative synchronization mechanism [HM93]. The basic idea behind TM is to enable the threads of an application to synchronize their shared memory accesses by executing lightweight, in-memory transactions. A transaction is a sequence of operations that should be
7
Background
Figure 2.1: State diagram of the life-cycle of a transaction.
executed atomically. The purpose of a transaction is thus similar to that of a critical section. However, unlike critical sections, transactions can abort, in which case all their operations are rolled back and are never visible to other transactions. Also, transactions only appear as if they executed sequentially. TM is free to run them concurrently, as long as the illusion of atomicity is preserved. Using a TM is, in principle, very easy: the programmer simply converts those blocks of code that should be executed atomically into transactions [GK10]. Finally, the transactions should operate in isolation relatively to the other transactions; no other thread should observe writes before commit. Figure 2.1 depicts the state-chart diagram of the life-cycle of a transaction. A transaction may be aborted for one of the following reasons: • A transactional operation cannot be completed due to a conflict. • The commit operation is unsuccessful due to some validation failure (due to a conflict). • The transaction is forcibly aborted by another transaction (through a CM).
8
Background & Related Work
If a CM is used, the CONFLICT transition on the diagram means that there was a conflict and the CM decided to abort the current transaction. The NO_CONFLICT transition have two different meanings; either there was no conflict, or there was a conflict and the CM decided to abort the enemy transactions. The following subsections present some aspects of TM systems that are necessary in order to understand how such a system operates.
2.1.1
Hardware, Software, or Hybrid TM
TM was initially proposed as a solution [HM93] implemented on hardware, but soon expanded to software [ST97]. Moreover, hybrid solutions exist, which are based on software/hardware co-design [DMF+ 06, RHL05]. In the present work, I consider only Software TM and thus the following sections basically refer to STM systems.
2.1.2
Conflict
Two or more alive1 transactions conflict on a memory object M if one of the following happens: • One has written to M and the other tries to read it. (Read After Write (RAW) conflict) • One has written to M and the other tries to write it. (Write After Write (WAW) conflict) • One or more has red from M and another tries to write it. (Write After Read (WAR) conflict) Every conflict has to be resolved in order the STM to keep its semantics. The resolution is performed by aborting one or more of the involved transactions.
2.1.3
Irrecoverable Actions
TM has some difficulties when it comes to irrecoverable actions such as input, output, and non-catchable exceptions. For example, if a transaction prints something in the standard output, then it is not acceptable to be aborted and restarted because the output cannot be reverted.
2.1.4
Interactions with non-transactional code
There are two basic alternatives on how transactionally used memory objects should be accessed by non-transactional code.
Weak atomicity. Transactions are serializable against other transactions, but the system provides no guarantees about interactions with non-transactional code. In other words, memory objects used by transactions should not be accessed by non-transactional code. 1
started and not aborted or committed
9
Background
Strong atomicity. Transactions are serializable against all memory accesses. One can consider nontransactional loads and stores as single instruction transactions. Most STM systems provide weak atomicity, because strong atomicity is very "expensive" to guarantee; all memory accesses need to be intercepted2 .
2.1.5
Data Versioning
Data versioning arranges how uncommitted and committed values of memory objects are managed. Three basic approaches are used.
Eager versioning. The updates of the memory objects are immediately written in the shared memory. In order to be able to revert to the old values in case of abort, the transaction keeps an undo-log with the initial values of the objects written.
Lazy versioning. The updates are not persisted in the shared memory, but buffered in a write-buffer. Upon commit, the actual memory locations are updated.
Multi-versioning. The system keeps multiple versions of the same memory object in order to allow read operations to be able to access the "correct" snapshot of the shared memory. For example, assume the following history of transactional events. ri (x) −→ ... −→ wj (y) −→ ... −→ Cj −→ ri (y) where ri (x)/wi (x) means transaction i reads/writes memory object x and Ci that it commits. It should be obvious that there is a RAW conflict on object y. Normally, transaction i would not be able to commit, because it would violate the real time order. Multi-versioning solves this problem by keeping the necessary older memory object versions. So, in our example, transaction i would read the version of y prior of j updating it.
2.1.6
Conflict Detection
How (when) the conflicts are detected.
Pessimistic detection.
Check for conflicts during transactional loads and stores. Also called encounter
or eager conflict detection.
Optimistic detection. Detect the conflicts when the transaction tries to commit. Also called commit or lazy conflict detection. 2
for example, by instrumenting memory accesses in Java
10
Background & Related Work
Combination. An STM can apply different policies for reads and writes. A typical example is optimistic reads with pessimistic writes.
2.1.7
Conflict Detection Granularity
Similar to the lock granularity, conflict detection granularity defines the minimum size of memory that can be transactionally acquired through a TM system. Even if a transaction loads or stores a smaller part of the memory, it will be considered as loading this minimum conflict detection unit. So, if the TM supports word granularity, if two transactions simultaneously write on different bytes of the same word3 , then a conflict is detected. A conflict which is only detected due to the granularity of conflict detection and is not an actual conflict is called a false conflict.
Object. The conflict detection is done with memory object granularity.
Word.
The conflict detection is done with single word granularity. In typical processor architectures a
word is 4 or 8 bytes.
Cache line. The conflict detection is done with one cache line granularity.
2.1.8
Static or Dynamic
If all transactional memory accesses should be statically predefined, then the STM is called static. If the system handles the memory accesses dynamically, the STM is called dynamic. Almost all STM systems are dynamic.
2.1.9
Lock-based or Non-blocking
There are two major STM implementation approaches; lock-based and non-blocking schemes. A lockbased STM internally uses a blocking locking mechanism to implement the transactional semantics. On the other hand, a non-blocking STM relies on a non-blocking algorithm, such as versioning.
2.1.10
Contention Management
Resolving conflicts is achieved by aborting one or more of the conflicting transactions in order to get a serializable execution. A CM is the module which by implementing a contention management policy decides which transaction should be aborted. For example, one such policy could be to abort all others but the oldest transaction. 3
assuming word size 4 or 8 bytes
Background
2.1.11
11
Transaction Nesting
A transaction that includes other transactions in its body is called a nested transaction. STM systems use nested transactions for achieving composability; creating a new transactional operation by using two or more transactional operations within a nested transaction. A typical and naive approach to composability is called flat-nesting; the whole code between the outermost transactional start and end operations are considered as one "flat" transaction.
2.1.12
Liveness Guarantees
Every STM system should provide some liveness guarantees which make certain that the system progresses.
Wait-freedom. Wait-freedom [Her88] guarantees that all threads contending for a memory object eventually4 make progress.
Starvation-freedom. Starvation-freedom or lock-freedom [Fra03] guarantees that only one of the contending threads eventually progresses.
Obstruction-freedom. Obstruction-freedom [HLM03] guarantees that if a thread does not face contention, it will eventually make progress. Obviously, Wait-freedom ⊇ Starvation-freedom ⊇ Obstruction-freedom, therefore wait-freedom is the strongest guarantee.
2.1.13
Safety Guarantees
Several correctness criteria have been proposed for TM, most of them taken from other fields such as Databases. A short description of these criteria is described later in Subsection 2.2.2. STM’s transactions have the following peculiarity compared to the classic database transactions; even alive transactions are not allowed to access an inconsistent state of the shared memory. In databases, a live transaction that accesses an inconsistent state will be simply aborted, but in a STM system the irrecoverability of some events may cause a problem. 1 2 3
i n t x = ( i n t ) TX_LOAD( a d d r e s s 1 ) ; i n t y = ( i n t ) TX_LOAD( a d d r e s s 2 ) ; int z = 1 / (y − x) ; / ∗ i f ( x == y ) { r u n t i m e e x c e p t i o n } ∗ / Figure 2.2: Possible problematic case for an STM system if a transaction accesses an inconsistent state.
Figure 2.2 illustrates such a problematic case. Assume that because of the semantics of the application, x 6= y, therefore the application programmer does not perform an equality check in line 3. If the 4
in a finite amount of time
12
Background & Related Work
transaction accesses an inconsistent state, it may be that x = y and thus line 3 will throw a division by zero runtime exception which cannot be handled and will make the application hang. This intuition explains the need for a stricter safety guarantee for STM; opacity [GK08]. Informally, opacity is a safety property that captures the intuitive requirements that: 1. all operations performed by every committed transaction appear as if they happened at some single, indivisible point during the transaction lifetime, 2. no operation performed by any aborted transaction is ever visible to other transactions (including live ones), 3. and every transaction always observes a consistent state of the system.
Software Transactional Memory
2.2
13
Software Transactional Memory
This section presents relevant work on STM.
2.2.1
Software transactional memory for dynamic-sized data structures [HLMS03]
This paper was the first to introduce an STM system with support for dynamic-sized data structures. Prior to this, the TM systems required that the programmer will statically declare the memory locations that a transaction will use. The solution presented, called DSTM, guarantees obstruction-freedom and was the first system to use a contention manager to resolve conflicts and guarantee progressiveness. Although DSTM provides linearisability of transactions, the authors recognized that it is not a strong enough guarantee for a TM and introduced the problem of the consistency of aborted transactions. The DSTM works with multiversioning of both read and write objects and solves the consistency problem by validating the transaction in every object acquisition. Moreover, DSTM provides an explicit release method for read-objects, which aims to increase the concurrency in certain applications. Finally, a simple correctness criterion for contention managers is presented; a CM should guarantee that eventually every transaction is granted the right to abort another conflicting transaction. Two novel contention managers were suggested; aggressive and polite. Aggressive simply grants the permission to every transaction to preempt a conflicting one, while polite uses a controlled back-off5 so as to give the live transaction the opportunity to commit.
2.2.2
On the correctness of transactional memory [GK08]
This papers introduces some formal guarantees that a Transactional Memory (should) provide. Specifically, they introduce opacity, a correctness criterion for TMs, which is the de facto safety property the majority of TM systems have adopted. Opacity is an extension of the classic database serializability property. On a TM system, there is the need to state whether an execution with more than one transactions executing in parallel "looks like" a sequential one. The major difference between a memory transaction6 and a database transaction is that on the former, unlike the latter, a live transaction7 should not access an inconsistent state, even if it will be later aborted. The most prominent consistency criteria are the following: • Linearisability. In the TM terminology, linearisability means that, intuitively, every transaction should appear as if it took place at some single, unique point in time during its lifespan. • Serializability. A history H of transactions8 is serializable if all committed transactions in H issue the same operations and receive the same responses as in some sequential9 history S that consists only of the transactions committed in H. 5
similar to the TCP congestion algorithm a transaction executed by a TM 7 a transaction neither committed or aborted 8 i.e., the sequence of operations performed by all transactions in a given execution 9 a sequential history is, intuitively, one with no concurrency between transactions 6
14
Background & Related Work
• 1-Copy Serializability. 1-copy serializability [BG83] is similar to serializability, but allows for multiple versions of any shared object, while giving the user an illusion that, at any given time, only one copy of each shared object is accessible to transactions. • Global Atomicity. Global atomicity [Wei89] is a general form of serializability that (a) is not restricted only to read-write objects, and (b) does not preclude several versions of the same shared object. • Recoverability. Recoverability [Had88] puts restrictions on the state accessed by every transaction, including a live one. In its strongest form, recoverability requires, intuitively, that if a transaction Ti updates a shared object x, then no other transaction can perform an operation on x until Ti commits or aborts. • Rigorous Scheduling. A correctness criterion precluding any two transactions from concurrently accessing an object if one of them updates that object. Restricted to read-write objects (registers), this resembles the notion of rigorous scheduling [BGRS91] in database systems. The authors, after motivating why the aforementioned criteria are not suitable for TMs, they formally introduce opacity. Informally, opacity is a safety property that captures the intuitive requirements that: 1. all operations performed by every committed transaction appear as if they happened at some single, indivisible point during the transaction lifetime, 2. no operation performed by any aborted transaction is ever visible to other transactions (including live ones), 3. and every transaction always observes a consistent state of the system. Then, they provide some TM implementation I characterizations: • Progressive. I is progressive if it forcefully aborts a transaction Ti only when there is a time t at which Ti conflicts with another, concurrent transaction Tk that is not committed or aborted by time t (i.e., Tk is live at t); we say that two transactions conflict if they access some common shared object. • Single-version. I is single-version if it stores only the latest committed state of any given shared object in base shared objects (as opposed to multi-version TM implementations). • Invisible reads. I uses invisible reads if no base shared object is modified when a transaction performs a read-only operation on a shared object. Based on these definitions, they provide the following complexity result: every progressive, singleversion TM implementation that ensures opacity and uses invisible reads has the time complexity of Ω(k), where k = |Obj| the number of shared objects.
Software Transactional Memory
2.2.3
15
McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime [SATH+ 06]
This paper presents McRT-STM STM system for C and C++ programming languages. McRT-STM uses cache-line conflict resolution for big objects and object level granularity for smaller ones. It also supports nested transactions with partial aborts. The authors implemented and evaluated McRT-STM with several different design choices. They concluded that: • Lock-based approach is faster than the non-blocking one, because the second incurs a bigger overhead and more aborts. The proposed that deadlock avoidance can be achieved with the use of timeouts. Though, since McRT-STM does not have any "clever" contention management scheme, both live-lock and starvation are possible. • On the specifics of locking the evaluation showed that read-versioning/write-locking is better than read/write-locking with undo-logging. Their explanation is based on the cache problems that visible reads introduce. • Undo-logging outperforms write-buffering, because of the overhead of searching the write-buffers in every read operation.
2.2.4
Transactional memory [Gra10]
This paper does an extensive review of the TM systems up to late 2009. It presents both Software, Hardware, and Hybrid TM systems. After introducing the theory behind TM and the specific TM systems, the author makes the following conclusions about the trends on TM design: • Most recent Hardware Transactional Memory systems favour eager conflict detection over lazy. • Eager data version management seems to attend more focus in Hardware Transactional Memory systems than lazy version management does. • A majority of STM systems favour optimistic concurrency control over pessimistic. • Early STM systems usually employ non-blocking synchronization, while more recent proposals usually employ blocking synchronization. • Recent STMs usually have a more flexible approach regarding concurrency control, conflict detection, and conflict resolution than older ones.
2.2.5
Elastic Transactions [FGG09]
This paper introduces elastic transactions, a relaxation of the transactional model. When a conflict is detected, instead of aborting and retrying the whole transaction, an elastic transaction may split it and commit the work done prior to the conflict. This characteristic can boost the performance of a TM, especially for search structures where a conflict may not affect the semantics of the atomic block. Elastic transactions can coexist and be composed with normal transactions.
16
Background & Related Work
Then, they propose ε-STM as an implementation supporting elastic transactions. ε-STM uses timestamps, two-phase locking10 , atomic primitives (compare-and-swap and fetch-and-increment), atomic loads and stores, and is built on top of Tiny-STM system [FFR08]. Finally, ε-STM is elastic-opaque.
10
with write-buffering
Contention Management
2.3
17
Contention Management
This section presents some important work on Contention Managers.
2.3.1
Contention Management in Dynamic Software Transactional Memory [SS04]
This paper presents a plethora of contention managers and benchmarks them on top of the DSTM [HLMS03] system. According to the authors, the guarantees a contention manager should provide are two; non-blocking operations and always eventually aborting a transaction11 . In the following, enemy is called the transaction that holds some resource needed by the current transaction. The various contention management algorithms evaluated are: • Aggressive. Always abort the enemy. • Polite. Back-off an exponentially increasing amount of time in order to give the enemy the possibility to complete. If after n ≥ 1 back-offs the enemy still holds the required resource, abort it. • Randomized. Throw a (biased) coin to decide if the enemy should be aborted, or wait for a random interval (limited to an upper value). • Karma. Karma tries to use an estimation of the amount of resources already used by the enemies in order to select which transaction to abort. Whenever a transaction commits, the thread’s karma is set to 0. Then, when a transaction opens some resource it collects karma (≡ priority). Upon conflict, if the enemy has lower priority it is aborted, else the transaction waits for a fixed interval until either the enemy completes, or the sum of the karma with the number of retries is greater than the enemy’s karma, in which case the transaction aborts the enemy. Whenever a transaction is aborted, it keeps the gathered karma so it will have greater chances to complete next time. Finally, every transaction gains one point upon retry, so that the short-length transactions will eventually gather enough karma to commit12 . • Eruption. Eruption algorithm is similar to Karma. The main difference is that whenever a transaction finds an enemy with higher priority (called momentum, similar to the karma points of Karma), it adds its momentum points to the enemy and waits13 . The motivation behind adding the points to the enemy is to "help" a transaction that blocks many other transactions to complete. Eruption also halves the momentum points of an aborted transaction, in order to avoid the mutual exclusion problem. • KillBlocked. Every transaction that does not manage to open a resource is marked as blocked. On contention, the manager aborts the enemy when it is either blocked, or a maximum waiting time has passed. 11
in order to provide obstruction freedom a measure for avoiding live-lock 13 same waiting patent as Karma 12
18
Background & Related Work
• Kindergarden. The transactions take turns accessing a block. The manager keeps a list (for each transaction) with the enemies in favour of which the transaction aborted before. Upon conflict, the manager checks the list and if the enemy is in the list, it aborts it, else it backs-off for a fixed amount of time. If after a number of retries the enemy remains the same transaction, the manager aborts it. • Timestamp. Timestamp is similar to Greedy [GHP05]. Each transaction gets a fixed timestamp and upon conflict, if the enemy is younger14 , is aborted, else the transaction waits for a series of fixed intervals. After half of the maximum number of these intervals, it flags the enemy as potentially failed. If the enemy proceeds with some transactional operations, its manager will clean the flag. When the transaction completes waiting the maximum number of intervals, the enemy gets aborted if its flag is set. Otherwise, the manager doubles the waiting interval and backs-off the transaction again. • QueueOnBlock. Each transaction holds a queue, where any conflicting transactions subscribe. Upon completion the enemy sets a finished flag in the queue, so that the transactions waiting get the resources. Of course, if more than one transactions wait for the same resource, the enemy allocates the resource to one of them and the others subscribe, for that resource, in the new holder’s queue. The outcome of the evaluation revealed that there is no universally good contention manager. The performance of every manager is closely related to each benchmark and in many cases the performance of some contention managers is unacceptable.
2.3.2
Advanced contention management for dynamic software transactional memory [SS05]
This work is a continuation of the evaluation done in [SS04]. The authors introduce two new contention managers and benchmark them against the ones proved to have the best performance from their previous study (Polite, Karma, Eruption, Kindergarden, and Timestamp). • PublishedTimestamp. Like Timestamp, but uses a heuristic to estimate if a transaction is active. A transaction updates a "recency" timestamp each time it proceeds with a transactional operation. A thread is assumed active unless it’s timestamp value is lower than the system’s global time more than a threshold. PublishedTimestamp aborts an enemy E whose recency timestamp exceed it’s own (E’s) inactivity threshold. The value of the threshold is reset to an initial value when a thread’s transaction commits, while, when a transaction is aborted and restarted it is being doubled (up to an upper bound). • Polka. Polka is a combination of Polite and Karma algorithms. Upon contention, Polka backsoff a number n of exponentially increasing intervals, where n equals with the priority difference between the transaction and the enemy. Moreover, Polka unconditionally aborts any set of readers that holds some resources needed for a read-write access in order to give priority to writes. Using this mechanism though, makes Polka prone to live-locks. 14
has a greater timestamp value
19
Contention Management
The evaluation results once again suggested that there is no "universal" contention manager that performs the best in every workload. Though, they concluded that Polka performs well even in its worse case and thus it could be a good choice as a default contention manager.
2.3.3
Toward a theory of transactional contention managers [GHP05]
This paper introduces some foundation theory behind contention management for STMs. Contention managers are different from the classical scheduling algorithms mainly because they are decentralized15 and dynamic16 . The authors present Greedy contention manager and prove that provides the following non-trivial properties: • every transaction commits within bounded time, • and if n concurrent transaction share s objects, then the makespan17 of the execution is within a factor of
s∗(s+1) 2
of the time needed by an off-line list scheduler18
Greedy uses the following three components of a transaction’s state: 1. Timestamp. Each transaction is assigned a global timestamp pertains in case of abort and retry. Lower timestamp suggests higher priority. 2. Status. Attribute with a value of either active, committed, or aborted. This attribute is changed via an atomic compare and swap operation either from active to committed, or from active to aborted. 3. Waiting. Attribute that indicates if the transaction waits for another transaction. Greedy uses the aforementioned components by applying the following simple contention management rules (transaction A wants to access an object held by transaction B): • If priority B < priority A, or if B in waiting mode, then A aborts B. • If priority B > priority A and A not waiting, then A waits for B to commit, abort, or wait19 . Finally, the authors prove that any on-line contention manager that guarantees that at least one running transaction will execute uninterrupted at any time until it commits (property named pending commit) is, as Greedy, within a factor of
s∗(s+1) 2
of optimal. The practical evaluation revealed that Greedy is mostly
suitable in a low contention environment.
2.3.4
Transactional Contention Management as a Non-Clairvoyant Scheduling Problem [AEST08]
This paper analyses the performance of contention managers in terms of their competitive ratio compared to an optimal contention manager that knows the resources that each transaction will use. Competitive 15
the decision about which of the two transactions to be aborted is mainly local there is no prior knowledge about the duration and the size of the transaction, fact that does not permit off-line scheduling 17 time to commit all transactions 18 known N P -Complete problem, but any list schedule is within a factor of (s + 1) of the optimal [GG75] 19 in which case the rule (1) applies 16
20
Background & Related Work
ration is the makespan for completing the transactions using the current contention manager divided by the time needed under the optimal manager. They proved that every contention manager having the following two properties: 1. A CM is work conserving if it always lets a maximal set of non-conflicting transactions run. 2. A CM obeys the pending commit property [GHP05] if, at any time, some running transaction will execute uninterrupted until it commits. is O(s)-competitive, where s is the number of resources. This is an important improvement to the O(s2 ) competitive ratio that was proved in [GHP05] for Greedy contention manager (and generally every contention manager having the pending commit property). Then, they proved that this bound is asymptomatically tight, thus any deterministic contention manager is Ω(s)-competitive. Moreover, they showed that if each job fails at most k times, then the Greedy algorithm is O(s)-competitive. Generalizing this, they proved that if each job may fail at most k ≥ 1 times, then any deterministic contention manager has competitive ratio Ω(ks) (under the assumption that the first request of a job for a resource is time dependent).
Distributed Software Transactional Memory
2.4
21
Distributed Software Transactional Memory
This section presents several Distributed Software Transactional Memory systems.
2.4.1
Distributed Multi-Versioning (DMV) [MMA06]
DMV stands for Distributed Multi-Versioning and is a distributed concurrency control algorithm. The data are replicated across the nodes of the cluster and the algorithm ensures 1-copy serializability, while using page-level conflict detection. The motivation behind DMV is to take advantage of the multiple versions appearing due to the data replication, instead of using explicit multi-versioning. An update transaction proceeds in the following steps: 1. The writes are deferred until the commit phase. 2. As a pre-commit action the transaction node broadcasts the differences that it caused on the data set. 3. The receiving nodes do not apply these difference, but buffer them. 4. If a receiving node detects a conflict, the local transaction is aborted. This scheme is live-lock prone. In order to avoid this problem the system uses a system wide token that should be acquired by any transaction that wants to commit20 . 5. The receiving nodes reply to the sender immediately. The differences will only be applied when another transaction requires newer version of the data. The goal of DMV is to allow read-only transactions to proceed independently by operating on their own data snapshot. DMV uses a conflict-aware scheduler21 , in order to minimize the conflicts, and a master-replica22 where all update transactions run.
2.4.2
Sinfonia [AMVK07]
Sinfonia aims to provide developers with the ability to program distributed applications without the need to explicitly use message passing primitives, but rather by designing the data structures needed. It stores the data on memory-nodes, providing a linear address-space23 and minitransaction primitives. A minitransaction consists of read, write, and compare items. Every item includes a memory node to be accessed and an address range within that node. The advantage of minitransactions is that in the best case,they can be started, executed, and committed in two messages round-trips. Sinfonia’s microtransactions ensure atomicity, consistency, and isolation. The system uses a 2 Phase-Commit protocol, where an application node has the role of the transaction’s coordinator and the memory nodes are the participants. The motivation behind microtransactions is to embed the whole transaction’s execution in the first phase of the 2PC. 20
centralization point it is assumed that it knows which memory pages will be accessed by every transaction, which is a strong assumption 22 potential bottleneck 23 instead of a virtual global address-space, each memory location is accessed by the {node-id, memory-address} tuple 21
22
Background & Related Work
2.4.3
Cluster-TM [BAC08]
Cluster-TM is a DTM design targeting large scale clusters. Cluster-TM uses a PGAS24 memory model and serializability guarantees. For performance boosting, Cluster-TM uses multi-word data movement, transactional "on" construct25 , and software controlled data caching. The authors also partitioned the TM design decision space into the following eight categories26 : 1. Transactional view of the heap. "word-based" or "object-based"27 . 2. Read synchronization. Read-validation on commit time, or read-locks. 3. Write synchronization. Exclusive access, or both "before" and "after" states of the written location are kept, so that other transactions are able to read the value while it is locked. 4. Recovery mechanism. Write buffering, or Undo-log. 5. Time of acquire for write. On write time (early acquire), or on commit time (late acquire). 6. Size of conflict detection unit (CDU). Cache lines, objects, or groups of words. Cluster-TM used 2n words, where n ≥ 0 is an initialization parameter 7. Progress guarantee. Deadlock avoidance, obstruction freedom, lock freedom. 8. Where the metadata are stored. Stored in program data objects, in transaction descriptors, or inside data structures. Cluster-TM uses some globally stored metadata (one word per CDU) and transactional local metadata (transactional descriptor where the metadata concerning some data are stored on the home node for that data). Several design alternatives were explored and the results suggested that both read-locking and writebuffering provide acceptable performance in a DTM.
2.4.4
DiSTM [KAJ+ 08]
DiSTM is a framework for prototyping and testing software cache-coherence protocols for DTM. Three protocols were predefined; one decentralized, called Transactional Coherence and Consistency (TCC) and two centralized, based on leases. All protocols use object-level conflict detection granularity. The TCC allows a transaction to proceed locally and broadcasts the read and write sets as a pre-commit validation action. Although it is presented as decentralized, it uses a global ticket for entering the precommit phase, which is a centralization point. The other two protocols use the notion of a lease that has to be acquired before trying to commit. Two alternatives are possible; a system with one global lease, in which case no validation is needed, and a system with multiple leases and validation among the lease holders. 24
Partitioned Global Address Space moving a computation block to another node instead of moving the data, in order to take advantage of the data locality 26 Cluster-TM was tested with the alternatives in boldface or the ones explicitly mentioned 27 needs one more level of pointer indirection 25
Distributed Software Transactional Memory
2.4.5
23
DSM [DD09]
This paper presented a 2-Phase commit algorithm for preserving the transactional consistency model. Every data object permanently resides on it creator’s node (authoritative copy). The algorithm uses version numbers to verify that a transaction has accessed only the latest versions of the objects (first phase of the 2-PC). The version number increases when the authoritative copy of the object is changed. Two performance optimization techniques are used; object caching and object prefetching. Object caching locally caches the remote object accessed, while object prefetching is done in terms of probabilistically calculated paths in the memory heap. The validity of the objects accessed by these techniques is checked by the 2PC protocol.
2.4.6
D2 STM [CRCR09]
D2STM system is a replicated and consistent even in the presence of failures STM system. Total dataset replication is used to achieve performance and dependability of the system. D2STM builds on top of the JVSTM system, which is a multiversion STM that guarantees the local execution of read-only transactions. D2STM inherits the weak atomicity and opacity guarantees from JVSTM. Regarding the consistency, D2STM provides 1-copy serializability. Apart from the functionality and guarantees that JVSTM provides, D2STM uses atomic broadcast in order to achieve the consistency of the replicas. Each transaction proceeds autonomously in a node and the atomic broadcast is used to agree on a common transaction serialization order (the commit order). Moreover, the atomic broadcast provides nonblocking guarantees in the presence of failures28 . As mentioned before, the read-only transactions need no validation and can commit without any remote communication. On the other hand, each update transaction after executing locally needs to do a local (first) and a global (afterwards) conflict validation in order to commit or abort. In order to reduce the overhead of the distributed validation, D2STM uses a scheme called Bloom Filter Certification (BFC), which is a novel non-voting certification scheme that exploits a space-efficient Bloom Filter-based encoding. The BFC scheme is used to encode the read set of the transaction and provides a configurable trade-off between the data compression percentage and the increase on the risk of a false transaction abort.
2.4.7
On the Design of Contention Managers and Cache-Coherence Protocols for Distributed Transactional Memory [Zha09]
In his PhD dissertation, Zhang focuses on the Greedy Contention Manager and presents several cachecoherence alternatives to combine it with with a goal to achieve better DTM characteristics. Initially, Greedy algorithm is combined with a class of location-aware cache-coherence protocols (called LAC) and proven that these protocols improve Greedy’s performance. The solution is based on a hierarchical tree overlay, similar with the one used in the BALLISTIC protocol [HS05]. 28
less than half of the replicas
24
Background & Related Work
Then, a DHT29 -based cache-coherence solution is applied. DHTs have good scalability and load balancing characteristics, but are mostly designed for immovable data objects. Zhang used a simple extension to the DHT (a pointer from the normal host of an item to the node that actually holds it) in order to allow object mobility. Finally, he presents a cache-coherence protocol called DHCB which is based on a quorum system (DHBgrid) in order to allow node joins, departures, and failures.
2.4.8
FTDMT [LDT+ 10]
FTDMT focuses on providing a DTM with fault-tolerance properties. The DTM runtime replicates each shared object in one backup copy, so in case of failures the object will not be lost. Since one backup copy is kept, the object will be still "safe" only if not both nodes that hold a copy fail "simultaneously"30 . Under this assumption, FTDTM provides atomicity, isolation, and durability. FTDTM also uses the approximately coherent caching and symbolic prefetching techniques presented in DSM [DD09]. FTDTM provides object-level granularity and uses a Perfect Failure Detector (PFD)31 to detect node failures and a Leader Election algorithm to select a leader to control the recovery process in case of a failure32 . Finally, FTDMT uses optimistic concurrency with versioning and an adapted version of 2 Phase-Commit that facilitates the failure recovery process.
2.4.9
D-TL2 [SR11]
Distributed Transactional Locking II (D-TL2) is a distributed locking algorithm based on the Transactional Locking II (TL2) algorithm33 [DSS06]. D-TL2 provides opacity and strong progressiveness guarantees. Both TL2 and D-TL2 use versioning, but D-TL2 uses Lamport-like non-global clocks, while TL2 uses a global one. The proposed algorithm is an object-level lock-based algorithm with lazy acquisition and limits broadcasting to just the object identifiers. Transactions are immobile, objects are replicated and detached from any home node, and a single writable copy of each object exists in the network. When a transaction attempts to access an object, a cache-coherence protocol locates the current cached copy of the object in the network34 and moves it to the requesting node’s cache. Changes to the ownership of an object occurs at the successful commit of the object-modifying transaction. At that time, the new owner broadcasts a publish message with the owned object identifier.
29
Distributed Hash-Table the second failure before the recovery process re-establishes the a backup of the object 31 PFD is a very strong assumption, not applicable in a asynchronous DS 32 in which case the operation of the DTM is halted until the replication is restored 33 TL2 is not non-distributed 34 the object location is not implemented or described 30
Cache Coherence Protocols for Distributed Transactional Memory
2.5 2.5.1
25
Cache Coherence Protocols for Distributed Transactional Memory Ballistic [HS05]
Ballistic is a cache coherence protocol for tracking and moving up to date cached objects. Ballistic is location aware and works over a deterministic hierarchical overlay tree structure. Ballistic is used as the cache-coherence protocol of a DTM for Distributed Systems where the communication costs form a metric35 . Every node has a TM proxy which is responsible for communicating with other proxies and providing the interface to the applications that use the DTM. A typical transaction consists of the following steps: 1. The application starts a transaction. 2. The application opens an object (using the TM proxy). If the object is not local, the Ballistic protocol is used. 3. The application gets the copy of the object from the proxy. 4. The application works with the copy and probably fetches and updates/reads more objects. 5. When the application wants to commit, the proxy handles the validation. 6. If the transaction can commit, the proxy persists the updates, else the changes are discarded. The conflicts are detected by a contention manager which applies specific policies in order to avoid deadlocks and live-locks. In this solution, when a remote proxy asks for an object, the object’s local proxy checks if the object is being used by any local transactions and if it does , then it applies the policy that the contention manager implements. Generally, reliable communication should be used between the nodes so that no loss of data occurs. The DTM works in an exclusive-write/shared-read mode, keeping only one copy of each object. Ballistic cannot operate properly using non-FIFO links (in the face of message reordering Ballistic may get stuck).
2.5.2
Relay [ZR09b]
This paper introduces a DTM based on a cache-coherence protocol called Relay. Relay is based on a distributed queuing protocol; the arrow protocol [Ray89], which works with path reversal over a network spanning tree. Although arrow protocol guarantees good maximum caps for the locating and moving stretch, it does not take into account the possible contention on an object. The contention can cause several abortions, which in turn makes the queue grow bigger. Relay delays the pointer reversal until the object has already moved to the new node, reducing the number of abortions by a scale of O(N ), where N transactions operate simultaneously on the object. Relay does not operate properly on non-FIFO links (it may route messages inadequately in case of message reordering). 35
location/cost aware network
26
Background & Related Work
2.5.3
COMBINE [AGM10]
COMPINE is a directory-based consistency protocol for shared objects. It is designed for large scale Distributed Systems with unreliable links. COMPINE operates on an overlay tree where the leaves are the nodes of the system. The overlay tree is similar to the one in BALLISTIC protocol [HS05], but simpler since it does not use the shortcut links used in BALLISTIC. The advantages of COMPINE are the ability to operate over non-FIFO links and to handle concurrent requests without degrading the performance. COMPINE provides these characteristics by combining36 requests that overtake each other while passing from the same node. At the same time, COMPINE avoids race conditions while guaranteeing that the cost of a request is proportional to the cost of the shortest path between two nodes.
36
piggybacking the request’s message to another message
Cache Coherence Protocols for Shared Memory
2.6
27
Cache Coherence Protocols for Shared Memory
This sections presents some cache-coherence protocols designed for shared memory.
2.6.1
An evaluation of directory schemes for cache coherence [ASHH88]
This paper does an evaluation of shared memory cache coherency protocols. There are two basic approaches to the problem; snoopy cache schemes and the directory schemes. On the former, each cache monitors all the shared memory-related operations in order to determine if coherency actions should be taken. On the latter, a separate metadata directory (about the state of the blocks of shared memory) is kept. While snoopy protocols use broadcast to disseminate the information, directory based hold enough information about which caches keep a block, thus no broadcast is needed to locate the shared copies. The authors presented and evaluated the following directory based solutions. Tang’s method [Tan76] allows each memory block to reside in several caches, as soon as this copy is not dirty37 . Only one cache can hold a dirty entry of a block. The following actions are taken by the protocol: • On a read-miss, if there is a dirty copy of this block (checked through the directory), it is being written in the shared memory. • On a write-miss, if there is a dirty copy, it is being flushed in the shared memory. If not, all the copies in the caches are invalidated. • On a write-hit, if the block is already dirty, there is no need for further action. If not, all the other cached copies of the block must be invalidated. Censier and Feautrier [CF78] proposed a similar to Tang’s mechanism. Tang’s method duplicates every individual cache directory in the main one. Therefore, in order to search where a block resides, all directories must be searched. Censier and Feautrier used a centralized directory with some additional metadata to alleviate this overhead. Yen and Fu [YYF85] suggested a refinement to the Censier and Feautrier consistency technique. The same central directory is used, but an extra flag is kept that designates whether a cache is the only holder of a block. With this flag set, when a write to a clean block is done it is not necessary to search in the directory since this block is not cached anywhere else. Archibald and Baer [AB84] suggested a broadcast-based solution that has no need to keep any extra metadata. They also used the single holder technique used by Yen and Fu. This solution inherits the scaling problem of snoop based protocols due to the use of broadcasting. Finally, the scheme that holds full information about where each block resides is being discussed. With this scheme, broadcast is not needed, but the directory’s size grows proportionally to the number of processors. The authors suggest a modification to the full map directory that maintains a fix amount 37
new value written in the cache and the write is not yet propagated in the shared memory
28
Background & Related Work
of metadata. The directory keeps only one pointer to a cache and a broadcast bit per block. If there is one holder, the pointer points to it, else the broadcast bit is set and broadcasting is used for keeping the coherency.
2.6.2
Directory-Based Cache Coherence in Large-Scale Multiprocessors [CFKA90]
This paper, similarly with [ASHH88], evaluates different directory based cache coherency protocols. A categorization of directory protocols according to how the metadata are stored is presented. The different classes are: • Full-map directories. The directory keeps full data of where each memory block is cached. No broadcasting is ever needed. • Limited directories. The directory keeps data of where each memory block is cached up to a fixed limit. For example, it could store a single pointer. In case the limit is exceeded, a flag is set and broadcasting is used for invalidation. • Chained directories. A chain38 is created pointing from the memory to a chain of caches that holds a copy of a block. For example, if initially no cache holds a block and cache a and then b read it, the chain created looks like (memory) → b → a → null. Invalidation is achieved by traversing the chain.
2.6.3
The directory-based cache coherence protocol for the DASH multiprocessor [LLG+ 90]
DASH system is a scalable shared-memory multiprocessor system developed in Stanford. The system uses DASH distributed directory-based cache coherence protocol. DASH consists of several processing nodes organized into clusters, each of them holding a portion of the shared memory. The protocol uses point-to-point communication instead of broadcast and is hardware-based. Every processing node holds the portion of the directory protocol metadata that corresponds to the shared memory that it physically has. For each memory block the directory memory maintains a list of nodes that possess a cached copy. In this way point-to-point invalidation can be achieved. The authors recognized three main issues in designing a cache-coherency protocol; memory consistency, deadlock avoidance, and error handling. DASH guarantees release consistency [GLL+ 90]. Release consistency is an extension of weak consistency, where the memory operations of one node may appear out of order with respect to other processors. The ordering of memory operations is preserved only when completing synchronization or explicit ordering operations. The DASH cache coherence protocol is an ownership protocol and is based on invalidation. A memory block may either be in (i) uncached remote, (ii) shared remote, or (iii) dirty remote state. Coherence within a processing cluster is guaranteed by a memory snooping cache-coherence protocol and not by the directory protocol. 38
similar to a linked-list
Cache Coherence Protocols for Shared Memory
2.6.4
29
Software cache coherence for large scale multiprocessors [KS95]
This paper introduces a software based cache coherence protocol. The protocol assumes some hardware support39 , but was adjusted to work purely on software. The protocol allow more than one processors to write to a cached memory page concurrently and provides a modification of release consistency. The protocol uses a distributed, non-replicated full-map40 data structure. A memory page can be in one of the following states; uncached, shared, dirty, weak41 . Each processor holds the portion of the directory map that corresponds to the physical memory that the processor locally possesses. Moreover, each processor keeps a weak list, which is a list of the pages that are marked as weak.
39
intermediate hardware option-memory-mapped network interfaces that support a global physical address space every cached copy of a page is logged in the directory map 41 more than one processor have a cached copy of the page and at least one has both read and write access 40
3 SC-TM, the Single-chip Cloud TM 3.1
Introduction
This chapter introduces SC-TM, the Single-chip Cloud TM Distributed Software Transactional Memory algorithm that allows the programmer to use its transactional interface to easily and efficiently take advantage of the inherent parallelism that a Many-core processor exposes, while providing strong liveness and safety guarantees. SC-TM algorithm aims to create a fully decentralized, modular, and scalable DTM system, suitable for Many-core processors, that guarantees opacity and starvation-freedom1 . The algorithm consists of three distinct functional parts (object locating, distributed-locking, and contention-management) so different design choices and extensions can be easily adjusted. For example, data replication could be achieved with an additional network step by configuring the distributed locking service to publish the writes. The algorithm relies on the message passing support that every future Many-core processor is expected to provide. SC-TM was implemented, tested, and evaluated on Intel’s SCC. The SCC experimental processor is a 48-core ’concept vehicle’ created by Intel Labs as a platform for many-core software research. It does not provide any hardware cache coherency, but supports message passing. Section 3.2 describes the assumptions made for the underlying system. Sections 3.3 to 3.6 describe the design of the SC-TM algorithm, while Sections 3.7 and 3.8 present the target platform, Intel’s SingleChip Cloud Computer, and the specifics of porting SC-TM on SCC. Finally, Section 3.9 describes several problems that emerged during the thesis merely because of the SCC processor.
3.2
System Model
The underlying platform that SC-TM assumes should comply with the undermentioned properties. Firstly, regarding the process failure model, SC-TM presumes that processes do not fail. The target system is fully-connected2 and the links are reliable [GR06]. Consequently, every message sent is eventually delivered once3 by the target node. Finally, regarding the timing assumptions of the system, I consider an asynchronous system; I do not make any timing assumptions about the processes or the communication channels. Only in the case of the Offset-Greedy CM (Section 3.5), SC-TM assumes a partially synchronous system model [GR06]. In partially synchrony one can define physical time bounds for the system that are respected most of the time.
1
should be provided by the contention manager every node can communicate with every other 3 exactly once semantics 2
32
SC-TM, the Single-chip Cloud TM
Figure 3.1: Abstract Architecture of the SC-TM system
3.3
System Design
Figure 3.1 depicts the overall architecture of the SC-TM system. One of the major design goals was modularity; flexible and adjustable so that different design choices and extensions can be easily engineered. In order to achieve this, the system is separated to two major parts and several components communicating with well defined interfaces. The two parts are the Application and the DTM system.
3.3.1
Application part
The Application part is nothing more than the application code that the application programmer has developed and makes use of the SC-TM system. This part is as complex as the application programmer wants and from the DTM point of view consists of a simple component.
3.3.2
DTM part
The DTM part is the one implementing the SC-TM algorithm. It consists of the following components: • TX Interface • DS-Lock • Contention Manager
System Design
33
• Object Locating Each of the aforementioned components will be described in detail in one of the following subsections.
TX Interface The Transactional (TX) Interface component is the interface that SC-TM system exports to the applications. Using these functions is the only way an application can interact with the DTM system. It includes functions to perform the following operations: • Initialize the SC-TM system. • Finalize the SC-TM system. • Start a transaction. • End (try commit) a transaction. • Perform a transactional read. • Perform a transactional write. • Perform a transactional memory allocation. • Perform a transactional memory freeing. For more details on the transactional interface as implemented on the SC-TM system look at Section 3.4.
DS-Lock The Distributed (DS) Lock component is the heart of SC-TM system. SC-TM is a lock-based TM system and DS-Lock is responsible for providing a multiple-readers/unique-writer locking service. The service is collectively implemented by some, or all, nodes of the system. Consequently, each node running a part of the DS-Lock service is responsible for keeping and handling the locking metadata for a partition of the shared memory. On this context, DS-Lock service is similar to some Directory-based Cache Coherence solutions [LLG+ 90, KS95] . Of course, the locking service is not a simple blocking or try-lock one, but is extended in order to include the transactional semantics; Read after Write, Write after Read, and Write after Write conflicts. Whenever one of these conflicts is detected, the component makes use of the Contention Manager which is responsible for resolving the conflicts. The operations DS-Lock implements are basically the following four: 1. Read-lock acquire 2. Read-lock release 3. Write-lock acquire 4. Write-lock release
34
SC-TM, the Single-chip Cloud TM
Notice that these operations are not explicitly called by the application code. The application, in order to utilize these functions calls the read and write wrapper functions which perform the appropriate message passing in order to trigger the corresponding DS-Lock service operations.
Read-lock acquire Tries to acquire the read lock corresponding to the input memory object for the input node id. It may return an unsuccessful result because of a Read After Write (RAW) conflict. Figure 3.2 illustrates the pseudo-code for this operation. / ∗ i d : t h e i d o f t h e node t r y i n g t o a c q u i r e t h e l o c k ∗ t x _ m e t a d a t a : t r a n s a c t i o n a l m e t a d a t a o f node i d ∗ o b j : t h e memory o b j e c t t o be l o c k e d ∗/ d s l _ r e a d _ l o c k ( id , tx_metadata , obj ) { enemy_tx = g e t _ w r i t e r ( o b j ) ; / / i f t h e r e i s a w r i t e r and i t i s n o t t h e r e a d i n g node i f ( enemy_tx ! = NULL && enemy_tx ! = i d ) { / / Read A f t e r W r i t e c o n f l i c t − CM h a n d l e s i t cm = c o n t e n t i o n _ m a n a g e r ( t x _ m e t a d a t a , enemy_tx , RAW) ; / / c o n t e n t i o n manager a b o r t e d c u r r e n t t x i f ( cm == RAW) { r e t u r n RAW; } } / / no w r i t e r , o r c o n t e n t i o n manager a b o r t e d enemy a d d _ r e a d e r ( obj , i d ) ; r e t u r n NO_CONFLICT ; } Figure 3.2: Pseudo-code for Read-lock acquire (dsl_read_lock) operation.
Read-lock release Removes the corresponding node from the reader set of the memory object. Figure 3.3 shows the pseudo-code for this operation. / ∗ i d : t h e i d o f t h e node r e l e a s i n g t h e l o c k ∗ o b j : t h e memory o b j e c t t o be u n l o c k e d ∗/ d s l _ r e a d _ l o c k _ r e l e a s e ( id , obj ) { remove_reader ( obj , i d ) ; } Figure 3.3: Pseudo-code for Read-lock release (dsl_read_lock_release) operation.
Write-lock acquire
Tries to acquire the write lock corresponding to the input memory object for the
input node id. It may return an unsuccessful result because of a Write After Write (WAW), or Write After Read (WAR) conflict. Figure 3.4 presents the pseudo-code for this operation.
Write-lock release for this operation.
Simply resets the writer of the memory object. Figure 3.5 contains the pseudo-code
System Design
35
/ ∗ i d : t h e i d o f t h e node t r y i n g t o a c q u i r e t h e l o c k ∗ t x _ m e t a d a t a : t r a n s a c t i o n a l m e t a d a t a o f node i d ∗ o b j : t h e memory o b j e c t t o be l o c k e d ∗/ d s l _ w r i t e _ l o c k ( id , tx_metadata , obj ) { enemy_tx = g e t _ w r i t e r ( o b j ) ; / / i f t h e r e i s a w r i t e r and i t i s n o t t h e w r i t i n g node i f ( enemy_tx ! = NULL && enemy_tx ! = i d ) { cm1 = c o n t e n t i o n _ m a n a g e r ( t x _ m e t a d a t a , enemy_tx , WAW) ; / / c o n t e n t i o n manager a b o r t e d c u r r e n t t x i f ( cm1 == WAW) { r e t u r n WAW; } } / / no w r i t e r , o r c o n t e n t i o n manager a b o r t e d enemy / / m u l t i p l e r e a d e r s may e x i s t f o r o b j enemy_tx_list = get_readers ( obj ) ; i f ( ! is_empty ( enemy_tx_list ) ) { cm2 = c o n t e n t i o n _ m a n a g e r ( t x _ m e t a d a t a , e n e m y _ t x _ l i s t , WAR) ; / / c o n t e n t i o n manager a b o r t e d c u r r e n t t x i f ( cm2 == WAR) { r e t u r n WAR; } } / / no r e a d e r s , o r c o n t e n t i o n manager a b o r t e d e n e m i e s s e t _ w r i t e r ( obj , i d ) ; r e t u r n NO_CONFLICT ; } Figure 3.4: Pseudo-code for Write-lock acquire (dsl_write_lock) operation. / ∗ i d : t h e i d o f t h e node r e l e a s i n g t h e l o c k ∗ o b j : t h e memory o b j e c t t o be u n l o c k e d ∗/ d s l _ w r i t e _ l o c k _ r e l e a s e ( id , obj ) { s e t _ w r i t e r ( o b j , NULL) ; } Figure 3.5: Pseudo-code for Write-lock release (dsl_write_lock_release) operation.
Contention Manager This component implements a Contention Manager which is responsible for selecting the appropriate action4 when a conflict is detected. Generally, in order to change the CM of the system, changing the implementation of this component suffices.
Object Locating As the name suggests, this component is responsible for locating the memory objects in the Distributed System. Depending on the memory model of the system, locating may have one of the following two 4
which transaction(s) should be aborted
36
SC-TM, the Single-chip Cloud TM
meanings: i . If a global shared memory is available: locating the node (of the DS-Lock service) that is responsible for handling the locking of this specific memory object. ii . If no global shared memory is available: locating where in the system the memory object resides and also the functionality from (i). (assuming that the metadata for every memory object are handled by the DS-Lock node that resides where the actual object is located)
Transactional Operations
3.4
37
Transactional Operations
This section will present the design decisions taken regarding the transactional read and write operations and the motivation behind them. Moreover, the pseudo-code for each transactional operation is included.
3.4.1
Transactional Read
Transaction Read is the operation used to read a memory object within the context of a transaction. Figure 3.6 contains the pseudo-code describing the steps taken for this operation to complete. Transactional Reads works with early lock acquisition and therefore the system operates with visible reads. Early release acquisition suggests that a transaction has to acquire the read lock before proceeding to the actual read. The visible reads are an outcome of the early acquisition; every transaction is able to ”see” the reads of the others because of the read locks. The motivation behind this design decision is twofold. Firstly, a Many-core processor should provide a fast message passing mechanism. Taking this into account, the overhead from performing synchronous read validation is acceptable. On a cluster, on the other hand, the messaging latency is significantly higher, hence such a synchronous solution would be prohibitive. Additionally, visible reads are often cited as problematic for affecting the cache behaviour of the system. In SC-TM system this is not the case, because of the use of message passing. The visibility of reads is not in terms of changing some local memory objects (i.e. locks), but using the locking service to acquire the corresponding locks that reside on another node. Secondly, visible reads allow better Contention Management. Using mutli-versioning ”hides” the reads until the commit phase of a transaction and therefore prohibits Contention Management. Specifically, without visible reads, write after read conflicts are not possible to be detected and handled by the CM.
3.4.2
Transactional Write
Transaction Write is the operation used to write to a memory object within the context of a transaction. Figure 3.7 contains the pseudo-code describing the write operation. Transactional Writes work with late lock acquisition and deferred writes. Every write operation is buffered in a log and in the commit phase of the transaction the following steps are taken: 1. The transaction tries to acquire all the corresponding write locks. 2. If successful, the updates are persisted in the shared memory, else the transaction is aborted and restarted. 3. The transaction releases all locks. I preferred lazy write acquisition for one main reason. If two transactions conflict, one has to be an update transaction; it should be writing on a memory object. Therefore, if a transaction holds a write
38
SC-TM, the Single-chip Cloud TM
/ ∗ i d : t h e i d o f t h e node r e a d i n g ∗ o b j : t h e memory o b j e c t t o be r e d ∗/ txread ( obj ) { / / i f memory o b j e c t i s e i t h e r i n w r i t e o r r e a d b u f f e r i f ( ( o b j _ b u f f e r e d = g e t _ b u f f e r e d ( o b j ) ) ! = NULL) { return o b j _ b u f f e r e d ; } / / t h e node t h a t i s r e s p o n s i b l e f o r o b j ’ s l o c k i n g nId = g e t _ r e s p o n s i b l e _ n o d e ( obj ) ; tx_metadata = get_metadata () ; / / s i m i l a r t o an RPC−l i k e c a l l on node nId , b u t u s e s m e s s a g e p a s s i n g r e s p = r e a d _ l o c k ( nId , i d , t x _ m e t a d a t a , o b j ) ; / / i f acquired the read lock i f ( r e s p == NO_CONFLICT ) { v a l u e = shmem_read ( o b j ) ; a d d _ r e a d _ b u f f e r ( obj , v a l u e ) ; return value ; } / / e l s e t h e r e was a c o n f l i c t and t h e CM a b o r t e d c u r r e n t t x else { txabort () ; } } Figure 3.6: Pseudo-code for Transactional Read (txread) operation. t x w r i t e ( obj , v a l ) { / / t h e w r i t e l o c k w i l l be a c q u i r e d upon commit u p d a t e _ w r i t e _ b u f f e r ( obj , v a l ) ; } Figure 3.7: Pseudo-code for Transactional Write (txwrite) operation.
lock for a long time it increases the possibility that a conflict5 will appear. Lazy write acquisition helps reducing the time that the write locks are being held, since there is a dedicated phase that the locks are acquired and then released. Moreover, it allows one to implement write lock batching; requesting the locks for multiple memory objects in one message.
3.4.3
Transaction Start
Transaction Start is used to initialize a new transaction. Figure 3.8 presents the pseudo-code for this operation. Its purpose is to initialize the necessary metadata and to set the point at which the transaction will return in case of abort and retry. 5
RAW or WAW
Transactional Operations
39
txstart () { tx_metadata = create_metadata () ; } Figure 3.8: Pseudo-code for Transaction Start (txstart) operation.
3.4.4
Transaction Commit
Transaction Commit is used to (try) commit an active transaction. Due to the late write lock acquisition, this operation needs to try acquire all the corresponding write locks, persist the changes, and finally release all locks and update the transactional metadata. Figure 3.9 presents the pseudo-code for this operation.
/ ∗ i d : t h e node i d ∗ w r i t e _ b u f f e r : t h e non−l o c k e d w r i t e s ∗ writes_locked : the locked writes ∗ r e a d _ b u f f e r : c o n t a i n s t h e r e a d −l o c k e d i t e m s ∗/ txcommit ( ) { tx_metadata = get_metadata () ; / / w h i l e more non−l o c k e d u p d a t e d memory o b j e c t s e x i s t w h i l e ( ( i t e m = g e t _ i t e m ( w r i t e _ b u f f e r ) ) ! = NULL) { / / t h e node t h a t i s r e s p o n s i b l e f o r o b j ’ s l o c k i n g nId = g e t _ r e s p o n s i b l e _ n o d e ( item . obj ) ; / / s i m i l a r t o an RPC−l i k e c a l l on node nId , b u t u s i n g m e s s a g e p a s s i n g r e s p = w r i t e _ l o c k ( nId , i d , t x _ m e t a d a t a , i t e m . o b j ) ; / /CM a b o r t e d c u r r e n t t x i f ( r e s p o n s e ! = NO_CONFLICT ) { txabort () ; } / / t h e r e was no−c o n f l i c t , o r CM a b o r t e d t h e enemy t r a n s a c t i o n s / / add t h e i t e m t o t h e l i s t w i t h w r i t e −l o c k e d memory o b j e c t s append ( item , w r i t e s _ l o c k e d ) ; } / / p e r s i s t t h e w r i t e −s e t t o t h e memory w h i l e ( ( i t e m = g e t _ i t e m ( w r i t e s _ l o c k e d ) ) ! = NULL) shmem_write ( i t e m . o b j , i t e m . v a l ) ; / / r e l e a s e a l l r e a d and w r i t e l o c k s w l o c k _ r e l e a s e _ a l l ( id , w r i t e s _ l o c k e d ) ; r l o c k _ r e l e a s e _ a l l ( id , r e a d _ b u f f e r ) ; update_metadata ( tx_metadata ) ; } Figure 3.9: Pseudo-code for Transaction Commit (txcommit) operation.
40
SC-TM, the Single-chip Cloud TM
3.4.5
Transaction Abort
Transaction Abort aborts and restarts an active transaction. This operation is called if the DS-Lock service returns a non successful response to a lock request or if the Contention Manager forcible aborts the transaction. Figure 3.10 contains the pseudo-code for Transaction Abort. / ∗ i d : t h e node i d ∗ writes_locked : the locked writes ∗ r e a d _ b u f f e r : c o n t a i n s t h e r e a d −l o c k e d i t e m s ∗/ txabort () { tx_metadata = get_metadata () ; / / r e l e a s e a l l r e a d and w r i t e l o c k s wlock_release_all ( i , writes_locked ) ; rlock_release_all ( i , read_buffer ) ; update_metadata ( tx_metadata ) ; / / r e s t a r t the transaction t x r e s t a r t ( tx_metadata ) ; } Figure 3.10: Pseudo-code for Transaction Abort (txabort) operation.
Contention Management
3.5
41
Contention Management
Contention Management has been extensively studied for STM systems, but has been barely used for DTM systems. The reason is the difficulty of implementing contention management in a Distributed System. Visible reads are necessary for implementing a fully functional CM. Because of the big messaging latency, DTM systems targeting Distributed Systems, such as clusters, operate with invisible reads for performance reasons. The promise of fast message passing on Many-cores allows SC-TM design to incorporate visible reads. Apart from that, there are some other inherent difficulties on implementing contention management on a Distributed System. First of all, on a DTM the CM should be completely decentralized and able to take decisions based on local only information. It is impossible a single node to have a global view of the transactions. This poses strict limitations on what information are available to a CM. Using a fully asynchronous design (nonblocking transactional reads and writes) also limits the capability for contention management, because the CM node may have outdated data, which in turn may lead to inconsistent views among different nodes of the CM. In such a case, the system is not able to guarantee liveness. Secondly, the CM should be able to properly operate over asynchronous message passing and tolerate possible message reordering. This is essential because in different case the CM may mistakenly decide. For example, the CM may try to abort a transaction that has already committed, but the CM node has not yet be notified about the completion. The system has to handle such erroneous cases or else the wrong transaction may be aborted, putting at risk the liveness of the system. Providing starvation-freedom was one of the primary design goals for the SC-TM system. As mentioned before, the CM module is responsible for guaranteeing the system’s liveness. Four Contention Management policies were implemented on SC-TM; Back-off and Retry, Offset-Greedy, Wholly, and FairCM.
3.5.1
Back-off and Retry
The simplest possible approach is SC-TM without actual CM. Actually, the policy applied is called back-off and retry. If a read/write lock request returns a conflict, the requesting transactions performs a sleep and then retries. This sleep-retry procedure happens N times, where N is a system parameter, and in case of conflict after the last try, the transaction aborts. Every time a back-off is performed the sleeping time is increased. Using SC-TM with this CM policy is live-lock prone.
3.5.2
Offset-Greedy
Greedy [GHP05] is a simple CM which guarantees that every transaction eventually commits, therefore provides starvation-freedom. Greedy uses timestamps and in case of conflict, the youngest conflicting transactions are aborted in favour of the oldest one. One problem with Greedy on a Distributed System is the lack of a global clock. Consequently, different nodes of the system do not have a way of taking coherent timestamps. For this reason, I introduced Offset
42
SC-TM, the Single-chip Cloud TM
Greedy which uses timestamps estimation based on time offsets.
Figure 3.11: Offset-based timestamps calculation.
Steps Figure 3.11 depicts how the offset-based timestamp calculation works. The following steps are taken: 1. The transaction uses the node’s local clock in order to calculate the time offset since the transaction started. 2. The transaction sends the request to the responsible DS-Lock node, piggybacking the offset calculated in step 1. 3. The DS-Lock node uses the offset from the request and its local clock to estimate the timestamp of the transaction according to its own local clock. 4. The request is normally processed.
Accuracy The described technique does not take into account the message transmission time in calculating the offset. Due to the load of DS-Lock nodes, there could be the case that a message needs more time to be delivered than another. Therefore, there is the possibility that two different nodes have a contradicting view of the timestamps for two concurrent transactions. DS-Lock Node 1 Node 2
Transactions TX1 TX2 0.20 0.30 0.27 0.23
Figure 3.12: Offset-Greedy – Contradicting views of timestamps for transactions 1 and 2. Node 1 believes transaction 1 has the higher priority over transaction 2, while node 2 believes the opposite.
43
Contention Management
Figure 3.12 depicts such a problematic case. If two conflicts, related to TX1 and TX2, occur and one involves Node 1, while the other Node 2, both transactions will be aborted due to the inconsistency. Potentially, there is the possibility that consecutive such inaccuracies could lead to a live-lock. Practically, this problem did not occur and the Offset-Greedy worked as intended. Even if the aforementioned case emerges due to heavy messaging load, the system is expected to stabilize fast and the inconsistencies to disappear.
3.5.3
Wholly
This is a naive CM that guarantees that the system progresses altogether. The conflict resolution is based on the number of transactions that each application node has completed. Upon a conflict, the node that has committed the most transactions is aborted. If two nodes have the same number of completed transactions, one of them is statically selected to be aborted6 . Wholly guarantees the following two properties: • Upon a conflict, or several simultaneous conflicts, there is at least one transaction that will not be aborted. • If the system runs for an infinite time, every node will commit infinite number of transactions (under the assumption that every node starts an infinite number of finite-sized transactions).
3.5.4
FairCM
FairCM has the same properties as Wholly, but is also fair regarding the effective transactional time of each node. Instead of using the number of committed transactions, FairCM uses the cumulative time spent on successful transaction tries. So, if a transaction processes as following Start
→
Abort 1
→
Restart 1
→
Abort 2
→
Restart 2
→
Commit
only the duration from Restart 2 to Commit will be added to the cumulative time. Upon a conflict, the transaction with the less cumulative time has higher priority. FairCM has the advantage that short transactions are promoted over the longer ones, because a completed long transaction "costs" more than a short one. As it is clear from Section 4.5 of the Evaluation Chapter, this characteristic may prove to be very important for the performance of the system under certain cases. One such a case is when some nodes tend to run long, conflict-prone transactions. If the CM does not provide fairness, these nodes will degrade the overall throughput of the system.
6
for example, the one with the lower node id
44
SC-TM, the Single-chip Cloud TM
3.6
Elastic Model
Elastic transactions is a variant of the transactional model particularly appealing when implementing search structures. Upon conflict detection, an elastic transaction might drop what it did so far within a separate transaction that immediately commits, and initiate a new transaction which might itself be elastic. Elastic transactions are a complementary alternative to traditional transactions and can be safely composed with normal ones, but can significantly improve performance if used instead [FGG09]. Elastic transactions are elastic opaque.
Elastic-opacity In order to define elastic-opacity, I need to define what a cut and a consistent cut is. If H is the history of transactional events, then: Cut is a sequence C of sub-histories of H such that (i) each of the cut sub-history contains only consecutive operations of H, (ii) if one sub-history precedes another in C then the operations of the first precede the operations of the second in H, and (iii) any operation of H is in exactly one sub-history of the cut. Consistent Cut of history H is a cut such that there are no writes separating two of its sub-histories each accessing one of the object written by these writes. Intuitively, a system is elastic opaque if there exist some consistent cuts of its elastic transactions such that: (i) the transactions resulting from these cuts and the regular transactions always access a consistent state of the system (even if they are pending or aborted), (ii) they look like they were executed sequentially, and (iii) this sequential execution satisfies the real-time precedence of non-concurrent transactions and is legal.
3.6.1
Elastic Model Implementation
The paper which introduced elastic transactions [FGG09] describes them as transactions which their size may vary depending on the conflicts. Although this is one way of implementing the Elastic Model, it is not restrictive. Using SC-TM I implemented a linked-list data structure (see Section 4.3) which exports elastic opaque transactional operations7 using two different approaches.
Early-release The early-release transactional operation was first implemented on the DSTM system [HLMS03]. Early-release allows the application programmer to explicitly release a lock at any point of a transaction. If the operation is not used properly, the TM may violate its (safety) guarantees. On the contrary, proper use of early-release may lead to significant performance benefits, since the duration that some locks are being held can be decreased. I used early-release to implement elastic-opaque operations on the linked-list example. Every operation needs to perform a search (parsing) on the data structure. Using early-release, we discard the locks on the nodes that are no more relevant with the list searching. For example, consider the following list: 7
node search, node add, node remove
45
Elastic Model
head
−→
node 1
−→
node 2
−→
node 3
−→
node 4
−→
tail
When a transaction is searching on this list, when (and if) it reaches node 3, node 1 is no more relevant to the search, because even if it is modified by a concurrent transaction, the search will not be semantically affected. Using early-release we take rid of these unreasonable dependencies and thus implementing the elastic model.
Read-validation An alternative, but close to early-release, implementation of the Elastic model on a linked-list was done using read-validation. Read-validation takes early-release to the next level; the read locks related to the list searching are not even acquired, but instead proper read value validation is performed. This technique relies on the fact if a concurrent transaction commits an update, this will be visible to a read validation, because the committed transaction can only write new/different values to the altered fields. For the search operation, it is important to validate the previous node of node n after stepping to node n + 1. In our example head
−→
node 1
−→
node 2
−→
node 3
−→
node 4
−→
tail
if the search is currently on node 3 and wants to proceed to node 4, the following steps have to be taken: 1. Load node 4. 2. Validate node 3. 3. If node 3 did not change, then proceed normally, else the transaction has to be aborted.
46
3.7 3.7.1
SC-TM, the Single-chip Cloud TM
Target Platform Single-Chip Cloud Computer (SCC)
The Single-Chip Cloud Computer (SCC) experimental processor [HDH+ 10] is a 48-core ’concept vehicle’ created by Intel Labs as a platform for many-core software research. From the Transactional Memory perspective, SCC has two interesting characteristics. Firstly, it provides a shared memory that is visible from every core. Secondly, it does not support any hardware cachecoherency, so the memory consistency has to be guaranteed on software.
3.7.2
SCC Hardware Overview
The SCC processor consists of 48 general-purpose x86 cores on a single die. Figure 3.13 summarizes the processor’s layout. The cores are distributed across 24 tiles, each tile contains two P54C processor cores, organized into a six by four 2D mesh [WMH11]. Each core has 16 kB of L1 data cache (L1$) and 16 kB of L1 instruction cache. The tile provides a separate unified L2 cache for each core, as well as two globally accessible test-and-set registers. These are atomic read-modify-write registers that can be in one of two states, set and unset. They are used to build more complex atomic operations between cores (e.g. locks), and are needed since the conventional x86 atomic instructions are not supported on SCC. In addition, each tile has 16 kB of SRAM to support a shared address space visible to all cores, called the Message Passing Buffer (MPB). While created to support message passing, it is important to appreciate that this is a general-purpose shared address space. The cache controllers and MPB connect to the router through a Mesh Interface (I/F) unit. A peculiarity of the P54C core is that cache lines are allocated on read, not on write. A write combine buffer will hold write data until a cache line is full, or until data on another line is written.
Figure 3.13: SCC Processor layout. Adapted from [Hel10], page 8.
47
Target Platform
The SCC processor includes four DDR3 Memory Controller (MC) at the corners of the mesh, and an extension of the on-die network to a unit that translates router traffic to support the PCI interface.
3.7.3
SCC Memory Hierarchy
SCC processor provides three distinct address spaces (Figure 3.14); private DRAM, shared DRAM, and the Message Passing Buffer (MPB) in on-chip SRAM.
Figure 3.14: SCC Memory spaces. Adapted from [Hel10], page 52.
Private DRAM Each core has its own private memory, supported by moving across the network to one of the four memory controllers and finally to the off-chip DRAM. This memory is normally cached in L1 and L2 caches.
Shared DRAM The same DRAM that is used as private can be configured as shared memory as well8 , but as with all shared memory on the SCC processor, there is no built-in support for cache coherence among cores. Coherency, if it exists at all, is the responsibility of the software.
Message Passing Buffer (MPB) The final address space is the MPB. With 16 kB per tile, the amount is sufficient for moving cache lines between cores, but not for building larger persistent data structures to support a typical program. This is why this buffered is referred as the Message Passing Buffer. As described in [MRL+ 10], software 8
one portion of it
48
SC-TM, the Single-chip Cloud TM
must explicitly maintain coherence between cores with regard to data placed in the MPB. This is done at the granularity of L1 cache lines, while bypassing altogether the L2 cache. These lines are marked as Message Passing Buffer Type or MPBT; an additional data-type explicitly added to SCC. Moreover, an instruction was added to the P54C core to mark all cache lines of MPBT type as invalid to force an update from or to the MPB. This instruction is sufficient to implement a consistency protocol for the data in the MPB with respect to multiple cores in the processor [MRL+ 10].
3.7.4
SCC Programmability
SCC can be used in two different modes; Baremetal and Linux. In Baremetal, there is nothing running on the processor apart from the application that the user loads. In Linux mode, each core runs an independent instance of a custom-made version of GNU/Linux operating system. The latter mode is by far most popular, because it is significant easier to program. Finally, Intel provides a functional emulator of the SCC, built on top of the OpenMP API [ope] in order to allow researchers to experiment with SCC programming, even without having access to the actual hardware. The SCC combined with the RCCE library (presented below) were designed to be used with a Single Program Multiple Data (SPMD) programming model. On the SPMD, every core runs one single application.
RCCE Library RCCE is the message passing programming model provided with SCC. RCCE is a small library for message passing tuned to the needs of many core chips such as SCC [WMH11]. RCCE provides: • A basic interface; a higher level interface for the typical application. • A gory interface; a low level interface for expert programmers. • A power management API to support SCC research on power-aware applications. RCCE runs on the SCC chip, as well as on top of a functional emulator that runs on a Linux or Windows platform that supports OpenMP. Figures 3.16 to 3.20 present the RCCE API. RCCE aims to provide fast and light-weight message passing between cores. Figure 3.15 depicts the messaging latency on the SCC using RCCE. Message passing with RCCE is both blocking and deterministic.
Blocking RCCE provides only synchronous messaging operations.
Deterministic Every outgoing message request has to be matched with an incoming message request by the target of the former (and vice-versa).
49
Target Platform
Figure 3.15: Round-trip latency for a 32 bytes message on the SCC. The measurements where taken using the ping-pong example application, which is included in the RCCE release.
int int int int int int int int int int int
RCCE_init ( i n t ∗ , char ∗ ∗ ∗ ) RCCE_finalize ( void ) RCCE_num_ues ( v o i d ) RCCE_ue ( v o i d ) RCCE_debug_set ( i n t ) RCCE_debug_unset ( i n t ) R C C E _ e r r o r _ s t r i n g ( i n t , char ∗ , i n t ∗ ) RCCE_wtime ( v o i d ) RCCE_comm_rank (RCCE_COMM, i n t ∗ ) RCCE_comm_size (RCCE_COMM, i n t ∗ ) RCCE_comm_split ( i n t ( ∗ ) ( i n t , v o i d ∗ ) , v o i d ∗ , RCCE_COMM ∗ ) Figure 3.16: RCCE API – Core utilities
v o l a t i l e char ∗ RCCE_shmalloc ( s i z e _ t ) void RCCE_shfree ( v o l a t i l e char ∗ ) void RCCE_shflush ( ) / / o n l y a v a i l a b l e on GORY mode v o l a t i l e char ∗ RCCE_malloc ( s i z e _ t ) void RCCE_free ( v o l a t i l e char ∗ ) int R C C E _ f l a g _ a l l o c ( RCCE_FLAG ∗ ) int R C C E _ f l a g _ f r e e ( RCCE_FLAG ∗ ) Figure 3.17: RCCE API – Memory Management functions
50
SC-TM, the Single-chip Cloud TM
/ / o n l y a v a i l a b l e on non−GORY mode i n t RCCE_send ( char ∗ , s i z e _ t , i n t ) i n t RCCE_recv ( char ∗ , s i z e _ t , i n t ) i n t R C C E _ r e c v _ t e s t ( char ∗ , s i z e _ t , i n t , i n t ∗ ) i n t RCCE_reduce ( char ∗ , char ∗ , i n t , i n t , i n t , i n t , RCCE_COMM) i n t R C C E _ a l l r e d u c e ( char ∗ , char ∗ , i n t , i n t , i n t , RCCE_COMM) i n t RCCE_bcast ( char ∗ , i n t , i n t , RCCE_COMM) i n t RCCE_comm_split ( i n t ( ∗ c o l o r ) ( i n t , v o i d ∗ ) , v o i d ∗ aux , RCCE_COMM ∗comm ) / / o n l y a v a i l a b l e on GORY mode i n t RCCE_put ( v o l a t i l e char ∗ , v o l a t i l e char ∗ , i n t i n t RCCE_get ( v o l a t i l e char ∗ , v o l a t i l e char ∗ , i n t i n t R C C E _ f l a g _ w r i t e ( RCCE_FLAG ∗ , RCCE_FLAG_STATUS , i n t R C C E _ f l a g _ r e a d ( RCCE_FLAG , RCCE_FLAG_STATUS ∗ ,
, int ) , int ) int ) int )
Figure 3.18: RCCE API – Communication
v o i d R C C E _ b a r r i e r (RCCE_COMM ∗ ) v o i d RCCE_fence ( v o i d ) / / o n l y a v a i l a b l e on GORY mode i n t R C C E _ w a i t _ u n t i l ( RCCE_FLAG , RCCE_FLAG_STATUS ) Figure 3.19: RCCE API – Synchronization
/ / o n l y a v a i l a b l e on GORY mode i n t RCCE_power_domain ( v o i d ) i n t RCCE_power_domain_master ( v o i d ) i n t RCCE_power_domain_size ( v o i d ) i n t RCCE_istep_power ( i n t , RCCE_REQUEST ∗ ) i n t RCCE_wait_power ( RCCE_REQUEST ∗ ) i n t RCCE_step_frequency ( i n t ) Figure 3.20: RCCE API – Power Management
iRCCE Library iRCCE is a extension of RCCE Library which provides asynchronous message-passing functions [irc]. iRCCE uses message queues in order to allow multiple outstanding messaging requests.
Non-Deterministic Asynchronous Communication
In the SC-TM algorithm the communication pat-
tern is not predefined, meaning that the send and receive requests cannot be simply coupled, since at any time a message may arrive from any other core. In order to bypass this limitation, the pending request queues of iRCCE were used to implement the communication as following. Every node performs the following steps in a loop:
Target Platform
51
• Keeps a pending receive request for every other node in a wait-list and checks it for incoming messages. • On asynchronous send, adds the send request (if not completed immediately) in a send-list. • Checks the send-list for send requests that can be processed. These checks are necessary, or else the send request will never complete.
52
SC-TM, the Single-chip Cloud TM
3.8
Implementation
This section describes some implementation details - decisions, which proved to be important for the SC-TM system. Typical STM systems make use of the the shared memory, the hardware cache-coherency, and the atomic operations of the underlying hardware for keeping the necessary metadata for the TM system. For example, a lock-based STM stores some locks in the shared memory and performs atomic accesses on them. Of course, this corresponds to a centralized solution. On a Distributed System, and essentially on a non-coherent Many-core system, this solution cannot work, or even if it works, it will not be scalable. Scalability necessitates decentralization, and decentralization suggests the usage of message passing. Accordingly, the DTM should operate on top of message passing and therefore it can be seen as a service provided by a number of, or all, the system nodes. However, the application code, which utilizes the DTM, should also run on the same platform and so an interesting design decision appears; how should the DTM service and the application execution be allocated on the Many-core system’s cores? There are two possible answers to the aforementioned question.
3.8.1
Multitasking
One can use multitasking to allow both the DTM system and the application code to execute on every core. There are three different approaches to multitasking.
System-level Scheduling
The DTM system and the application code run as two separate processes. In
a Unix system this can be achieved using the fork operation.
Kernel-assisted Scheduling
The DTM system and the application code run as two threads using the
POSIX Thread (pthread) library [pth] or some other kernel supported multitasking library. Although pthreads reside in user-space, the scheduling is assisted by the kernel.
User-level Scheduling The DTM system and the application code run as different tasks and the scheduling is entirely done in user-space. Libtask library [lib] provides such functionality. The initial design considered every core as both an Application and DS-Lock Service node and used multitasking to allow both running on the same core. Figure 3.21 illustrates the allocation of the application and the locking services on the 48 cores of SCC. Two different multitasking9 libraries were used to implement the aforementioned design. 9
user-space scheduling
Implementation
53
Figure 3.21: The allocation of Application and DS-Lock service parts on SCC’s 48 cores.
POSIX threads In shared memory multiprocessor architectures, such as SMPs, threads can be used to implement parallelism. Historically, hardware vendors have implemented their own proprietary versions of threads, making portability a concern for software developers. For UNIX systems, a standardized C language threads programming interface has been specified by the IEEE POSIX 1003.1c standard. Implementations that adhere to this standard are referred to as POSIX threads, or Pthreads [pth]. Pthreads provide Kernel-assisted scheduling. My initial implementation was utilizing pthreads for multitasking, but due to performance issues I soon switched to litbask. An explanation of the performance problems can be found at Section 3.9.
Libtask Libtask is a simple coroutine library. Libtask gives the programmer the illusion of threads, but the operating system sees only a single kernel thread. For clarity, the coroutines are referred as "tasks", not threads. Scheduling is cooperative; only one task runs at a time and it cannot be rescheduled without explicitly giving up the Central Processing Unit (CPU) [lib]. Libtask is more lightweight than pthreads and was the library used for SC-TM’s multitasking design. Figure 3.22 depicts how the context switch between the two tasks happens. The Application task either explicitly yields the execution or sleeps waiting for a transactional operation to complete. The DS-Lock task iterates on the message queues and if there is a message to serve, it serves it, else it yields the execution. The multitasking design has a very important limitation; the scheduling of Core n can potentially affect the execution of Core m, where n 6= m. One such case is represented in Figure 3.23. Core m is executing some non-transactional code, while Node n tries to execute a transactional request that involves the DSLock Service on Node m. The request will not be served until m completes the local computation. As it
54
SC-TM, the Single-chip Cloud TM
Figure 3.22: Activity diagram of the multitasking between the Application and the DS-Lock Service on a single core.
should be obvious, this waiting time will be added to the latency of the transactional operation.
3.8.2
Dedicated Cores
In the following years, Many-core systems are expected to grow in the number of cores they consist of. Therefore, it may be advisable to dedicate some cores to a specialized function. In our case, this would be implemented by allocating a certain percentage of cores for providing the DTM service and the remaining run the application code. This solution seems more natural for a Many-core, because it exploits the true parallelism of the system. Moreover, it simplifies the design since there is a clear separation between the service and the application. In order to avoid the described dependencies with multitasking and because of the reasons just explained, I used the dedicated cores approach to engineer a second version of SC-TM. Every core is either an Application Core or a DS-Lock Core. An Application Core runs only the application code and utilizes the DTM system. A DS-Lock Core runs only the DS-Lock Service code, therefore DS-Lock Cores collectively implement the SC-TM algorithm. Figure 3.24 illustrates this design. The ratio of Dedicated DS-Lock Nodes to Application Nodes is a system parameter, but practical evaluation showed that one to one ratio is the best performing.
Implementation
55
Figure 3.23: Multitasking – An example where the scheduling of Core m affects the execution of Core n.
Figure 3.24: The allocation of the Application and the Dedicated DS-Lock service on the 48 cores of SCC. This allocation uses one DS-Lock Core for one Application Core.
56
SC-TM, the Single-chip Cloud TM
3.9 3.9.1
SCC-Related Problems Programming model
The SCC combined with the RCCE library were designed to be used with a SPMD programming model. In the SPMD model the same program is loaded on every core and data parallelism is used. In a data parallel computation, parallelism is applied by performing the same operation to different items of data concurrently [LS08]; the amount of parallelism grows with the size of the data. A distributed system commonly makes use of the notion of service. A service is a functionality that some, or all, nodes of a distributed system collectively provide. For example, the present DTM system uses a Distributed-locking service which is jointly implemented by all the participating cores. This service has to be running concurrently with the main application code, which is impossible if the SPMD model is followed. In order to provide such a functionality, and still use RCCE, one has to use a user-level scheduling library. Initially, I tried using the pthread10 library but it performed poorly. In some cases the messaging latency increased around 1000 times compared to the one achieved on a single–threaded application. Therefore, I finally used the libtask library for the user-level scheduling. With libtask the performance degradation was minimized to the scale of 10; the latency is around 10 times higher than without libtask. The significant performance difference can be explained by the context switch overhead of the pthread library and the scheduling policy of GNU/Linux. In GNU/Linux, every thread is handled by the scheduler as if it was a process; each thread is scheduled autonomously and competes with the other processes and threads for the CPU. On the other hand, libtask operates exclusively on the user-space, so the operating system is not involved in the scheduling, which makes context switching cheaper.
3.9.2
Messaging
Cores’ communication via message passing was rather problematic and caused severe problems during the implementation phase.
Blocking Communication using RCCE library is meant to be synchronous, or in other words blocking. Blocking communication is inflexible and prone to problems such as deadlocks. In a service like the Distributedlocking of the DTM asynchronous communication is necessary. iRCCE library is an extension of the RCCE library which provides asynchronous operations, but is not officially supported and tested by Intel. However, there was no alternative than using iRCCE. The code in figure 3.25 shows the inherent difficulties due to the combination of the SPMD model with blocking communication. 10
POSIX Threads; a POSIX standard for threads
57
SCC-Related Problems
i f ( RCCE_ue ( ) == 0 ) { RCCE_send ( d a t a , DATASIZE , 1 ) ; RCCE_recv ( l o c a l , DATASIZE , 1 ) ; } e l s e { / / RCCE_ue ( ) == 1 RCCE_recv ( l o c a l , DATASIZE , 0 ) ; RCCE_send ( d a t a , DATASIZE , 0 ) ; } Figure 3.25: Implementing data exchange between cores 0 and 1 using RCCE.
In order to implement a simple data exchange, one has to differentiate the behaviour of each core according to its core id. In the example above, if both nodes would execute the same operation first, a deadlock would occur.
Deterministic Communication using RCCE/iRCCE is deterministic. The nodes communicate based on a pattern that has to be defined statically while developing the application. Figure 3.25 is an example of a predefined communication pattern. This is unavoidable since every send request has to be paired with the corresponding receive request (and vice versa) and both operations have a specific receiver/sender. This can be easier understood from the figure 3.26 below. i n t RCCE_send ( char ∗ d a t a , s i z e _ t s i z e , i n t t o ) ; i n t RCCE_recv ( char ∗ d a t a , s i z e _ t s i z e , i n t from ) ; Figure 3.26: The RCCE send and receive operations interface.
The to/from arguments denote the receiver/sender of the message respectively. Receiving a message regardless who the sender is, is not a built-in functionality. So, a RCCE_recv(data, DATASIZE, 1) call from core 0 has to be matched with a RCCE_send(data, DATASIZE, 0) call from core 1 in order core 0 to deliver the message and unblock. As in most distributed systems, in my DTM algorithm the communication is not predefined, meaning that the send and receive requests cannot be simply coupled, since at any time a message may arrive from any other core. To bypass the limitation, one has (i) to use asynchronous messaging and (ii) to repeatedly check for incoming messages from any possible core. As described earlier, the aforementioned functionality was implemented using the iRCCE library. Although the problem was fixed, there was an increase in the round-trip messaging latency by a scale of 10, which can be explained by the increased number of computations. In a simple, deterministic sendreceive, two nodes simply synchronize their operations using two flags. If one of the nodes is not ready to perform the operation, the other blocks polling the first flag. On the other hand, in a non-deterministic send-receive, the receiver has to check a flag for every possible sender in order to figure if a node intends to send data to it. Moreover, if the send is non-blocking, the sender needs to periodically try to push its outgoing messages to the receivers, so that the operations will eventually complete.
58
SC-TM, the Single-chip Cloud TM
Unreliable The biggest challenge I faced during this project was the unreliability of the SCC processor in terms of messaging. Assuming reliable communication11 is an essential part of my DTM algorithmic design. The unreliability issue appeared in two different forms; Message loss and Message inconsistencies.
Message loss For a long period the behaviour of the SCC — the one that I had access to — was somehow non-deterministic. There were days a program would run flawlessly, while others the exact same program would get stuck. The reason why an application would stop executing was the unreliable communication; a non properly delivered message would cause the receiver to block. I devoted a lot of time locating the cause of the problem. My research showed that the flags used for synchronizing the send and receive operations were getting desynchronized, resulting to the sending core completing its operation, while the receiving one getting stuck, expecting for the message to be delivered. Rationally, I initially believed the problem’s source lied within my application code, but later, after exhaustive testing, I realized that the problem should either be in one of the libraries (RCCE or iRCCE), or in the SCC itself. This belief was verified by the other two people working on the same SCC processor. Consequently, I developed a simple test application and filled a bug report12 . The application connects the nodes to a ring and inserts a single token. When a node receives the token it increases its value and forwards it to the next node. The erratic behaviour of the SCC could be easily noticed since the aforementioned application was perfectly running on up to 8 cores, while it was always getting stuck on more than 24. Luckily, the Intel people paid attention to my report and provided me rapidly with a temporary access to another SCC processor to run my experiments. On the new SCC processor the exact same test programs13 ran flawlessly even on all 48 cores. I reported back my results and Intel immediately replaced our SCC processor. After the replacement on the 23rd of May, 2011, the aforementioned problem disappeared.
Message inconsistencies
After solving the Message loss problem I stepped into another messaging
issue; a core may deliver a message that is not destined to it. I first noticed this behaviour in the following use case; a node was trying to read-lock an address x by sending the relevant request to the responsible node, but instead of receiving a response for address x, it was receiving a response for another address that a third node was trying to lock. This inconsistency is not persistent but appears quite often, especially when using more than 14 cores. This problem is probably caused by data races on either the L1 cache or the MPB of the receiving core. As with the Message loss problem, I developed a test application to be used for submitting a bug report. It is a ping-pong-like application where each node asynchronously sends a message to a randomly selected target. Each message contains a unique14 integer that is used for verifying the messages. Upon receiving this request, the receiver simply sends back a response containing the same integer so the sender will be 11
no message loss — every message is eventually delivered by the receiving core Intel provides a bug reporting platform for the SCC (http://marcbug.scc-dc.com/bugzilla3/) 13 I have developed more than one test programs 14 locally to each node 12
59
SCC-Related Problems
able to confirm the correctness of the response. The node that initialized this request-response exchange should always receive the integer sent in the request. Figure 3.27 depicts the output of a problematic run. [08] [10] [05] [11] [07]
[Expected] [Expected] [Expected] [Expected] [Expected]
Integer num: 6 num: 897 num: 733 num: 2275 num: 2848
From (05 -> (00 -> (02 -> (04 -> (08 ->
To 08) 10) 05) 11) 07)
| | | | |
[Received] [Received] [Received] [Received] [Received]
Integer num: 2 num: 924 num: 926 num: 2231 num: 3553
From (03 -> (00 -> (00 -> (08 -> (08 ->
To 08) 11) 08) 11) 01)
Figure 3.27: Problematic output of the ping-pong-like test application running on 16 cores.
The problem is twofold. The core mistakenly delivers either a message destined to another core (as in line 2), or one targeting the correct core but from a wrong sender, or containing the wrong integer. For example, what the first line reveals is that node 08 expected a message from node 05 with value 6, but instead received a response message from node 03 with value 2. This seems to be an old, cached message which was re-delivered. Similarly with the Message loss problem, the cause proved to be due to the hardware. On the 7th of June, 2011, Intel provided us with our own access to an SCC processor hosted in Intel’s data-center. I did not manage to reproduce the inconsistency, even under heavy load. None of the described problems appeared on the new processor.
4 SC-TM Evaluation 4.1
Introduction
This chapter aims to evaluate the SC-TM Distributed Transactional Memory System. Section 4.2 compares the performance and scalability of the two different SC-TM implementations; the multitasking and the dedicated DS-Lock service. Sections 4.3 and 4.4 are two micro-benchmarks (linked-list, hashtable) used to compare sequential and transactional versions of the data structures and show the implementation of the elastic model and the scalability of the system. Finally, Section 4.5 presentes an application which resembles a Bank, which reveals the necessity of live-lock freedom.
4.1.1
SCC Settings
SCC has the following five performance settings: Setting 0 1 2 3 4
Tile 533 800 800 800 800
Mesh 800 1600 1600 800 800
DRAM 800 1066 800 1066 800
Figure 4.1: Available performance settings for Intel’s SCC processor. All values are in MHz.
The different columns refer to the tile, the mesh, and the memory speed frequency settings respectively. Using a setting other than 0 is discouraged by Intel and also proved to be problematic in many cases, therefore the data collected for this chapter were taken under SCC’s setting 0.
62
4.2
SC-TM Evaluation
Multitasking vs. Dedicated DS-Lock Service
As described in Section 3.8, the SC-TM implementation using multitasking appeared to have some performance and scalability issues due to scheduling. In order to evaluate the difference between the multitask-based and the dedicated DS-lock cores versions, a simple application was used. The application repeatedly runs read-only transactions which sequentially access an array of integers in the shared memory. This specific application was selected for two reasons. Firstly, it creates a lot of messaging traffic and secondly, it has no conflicts. No conflicts is important in order to have a somehow deterministic execution. A single configuration was used.
Configuration I. 100 int array, 100 reads per TX, No CM (back-off & retry) Figure 4.2 presents the system’s throughput (in Tx/s), while Figure 4.3 the average latency (in ms) for a transaction to complete. In both figures we can notice that the dedicated DS-Lock version performs and scales significantly better than the multitasking one, because of the reason explained in Section 3.8 and the cost of context switching of the latter. One can notice in the figures that both versions scale worse from 32 to 48 cores than they do for lower number of cores. As the number of cores increase, the memory starts becoming the bottleneck due to the high traffic. The limitation of memory bandwidth will become more clear in Section 4.3.
Figure 4.2: Throughput of read-only transactions for the multitask-based and the dedicated DS-Lock versions of SC-TM.
Linked-list Benchmark
63
Figure 4.3: Latency of read-only transactions for the multitask-based and the dedicated DS-Lock versions of SCTM.
4.3
Linked-list Benchmark
Linked-list Benchmark belongs to the micro-benchmark benchmark suite [HLM06] and, as the name suggests, is a transactional linked-list implementation. It supports the following three operations; 1. contains: checks if an item belongs to the list (read-only operation) 2. add: adds an item to the list (if it does not pre-exist) (update operation) 3. remove: removes an item from the list (if it exists) (update operation) Four different implementations of the link-list’s operations were designed. Three of them are transactional and one sequential. The versions are: 1. sequential: does not use transactions 2. normal: uses normal transactions 3. elastic-early: uses elastic transactions, implemented with early release 4. elastic-read: uses elastic transactions, implemented with read validation Just to mention that all operations are called with a random element as an input, so by "range value" I mean the range of the possible values that the random could return. For example, with 2048 list nodes and range value 4096, half of the contain operation are expected to return true.
64
SC-TM Evaluation
I use Linked-list with three different configurations.
Figure 4.4: Throughput of Linked-list running only contains operations in sequential mode. Obviously, the system does not scale and the reason is the memory bandwidth.
Configuration I. Sequential only, 2048 nodes, range value 4096, 0% updates This configuration aims to show the memory bandwidth problem on the SCC. Figure 4.4 depicts the problem. It is clear than for more than 4 nodes, the memory bandwidth does not allow the performance to increase. The per node average drops from 2500 for one core to less than 100 ops/s for 48 cores. The memory bottleneck problem will be visible in almost every figure.
Configuration II. Normal and elastic-early, FairCM, 2048 nodes, range value 4096, 20% updates This configuration shows the advantage of using the elastic model (implemented with early release). Figure 4.5 illustrates the system’s throughput with and without early release elastic transactions. Figure 4.6 shows the transactions’ commit rate under the same configuration. The elastic-version performs and scales better than the normal one. This is well expected since the conflicts are eliminated by using the elastic model as can be seen on the commit rate figure. For 48 cores, the commit rate with normal transactions is around 23%, while with elastic transactions 100%. Taking into account the commit ratio, one would expect the performance difference to be bigger. This is not the case due to the messaging overhead early release introduces; for every release, an extra message has to be sent.
Linked-list Benchmark
65
Figure 4.5: Throughput of linked-list for normal and elastic-early transactions. Elastic-early performs and scales better than normal transactions.
Figure 4.6: Commit rate of normal and elastic-early transactions on the linked-list micro-benchmark. Elastic-early eliminates the conflicts and thus achieves 100% commit rate.
66
SC-TM Evaluation
Configuration III. Sequential and elastic-read, FairCM, 20% updates This configuration reveals the performance improvement that can be achieved over the sequential code using SC-TM. Figure 4.7 illustrates the throughput of the system for different list sizes and Figure 4.8 the ratio of the transactional implementation compared to the sequential. SC-TM performs very well, since it provides better than sequential performance using only 4 cores (2 application cores). The best ratio is achieved for 256 elements list size and is close to 7 times faster. One can notice that the scalability drops when increasing the list sizes. This is solely due to the limited memory bandwidth, since the changes of the list size do not affect the number of messages sent by the elastic-read operations.
Figure 4.7: Throughput of sequential and transactional (elastic-read) linked-list versions under different list sizes.
Hashtable Benchmark
67
Figure 4.8: Ratio of transactional (elastic-read) throughput compared to the sequential under different list sizes.
4.4
Hashtable Benchmark
Hashtable Benchmark belongs to the micro-benchmark benchmark suite [HLM06] and, as the name suggests, is a transactional hashtable implementation. It supports the following three operations; 1. contains: checks if an item belongs to the hashtable (read-only operation) 2. add: adds an item to the hashtable (if it does not pre-exist) (update operation) 3. remove: removes an item of the hashtable (if it exists) (update operation) Four different implementations of the hashtable’s operations were designed. Three of them are transactional and one sequential. The versions are: 1. sequential: does not use transactions 2. normal: uses normal transactions 3. elastic-early: uses elastic transactions, implemented with early release 4. elastic-read: uses elastic transactions, implemented with read validation Just to mention that all operations are called with a random element as an input, so by "range value" I mean the range of the possible values that the random could return. For example with 2048 nodes and range value 4096, half of the contains are expected to return true.
68
SC-TM Evaluation
Figure 4.9: Throughput of Sequential and Transactional versions on the Hashtable benchmark under different load factor values.
Another important parameter is the load factor, which sets how many items each bucket will initially have. So, with 2048 cores and load factor 4, 512 buckets with 4 elements each will exist. I tested Hashtable under two configurations.
Configuration I. All versions, FairCM, 2048 items, range value 4096, 20%updates This configuration shows the performance improvement that can be achieved using SC-TM over the sequential code. Figure 4.9 presents the throughput off all four versions for different load factor values on 48 cores, while Figure 4.10 illustrates the ratio of the transactional implementations over the sequential. The elastic-read version achieves up to 15.5 times performance improvement for load factor 11 . As we increase the load factor, the transactional performance degrades. This is due to the increase of the number of messages in the system and the limited memory bandwidth.
1
where hashtable is initially a plain array
Hashtable Benchmark
69
Figure 4.10: Ratio of Transactional performance compared to the sequential on the Hashtable benchmark under different load factor values.
Configuration II. Normal and elastic-read, FairCM, 2048 items, range value 4096, load factor 4, 20%updates This configuration aims to show the scalability of SC-TM. Figure 4.11 presents the throughput for the two versions under Configuration II and Figure 4.12 the ratio of the throughput compared to the one on two cores. The elasctic-read version performs and scales better, as expected. The messaging overhead of normal transactions is more expensive than the read validation performed by the read-validation elastic transactions.
70
SC-TM Evaluation
Figure 4.11: Throughput of normal and elastic-read versions on the Hashtable benchmark under load factor 4.
Figure 4.12: Ratio of throughput for normal and elastic-read versions on the Hashtable benchmark compared to the throughput of 2 cores (1 application core).
Bank Benchmark
4.5
71
Bank Benchmark
Bank benchmark is a simple application which was initially used to evaluate DSTM2 system [HLM06]. As the name suggests, it resembles a bank consisting of a number of accounts. Each core performs transactional operations that either transfer an amount of money from one account to another or perform a snapshot of all the accounts in order to calculate the overall balance of the bank. I tested the Bank benchmark under two different configurations.
Configuration I. All CMs, 1024 Accounts, 20% Snapshot operations The purpose of this configuration is to show the need and effectiveness of Contention Management and the scalability of the system. Figure 4.13 illustrates the system’s throughput and Figure 4.14 the commit rate under Configuration I. The non CM version performs poorly and soon leads to a live-lock. This is totally expected if one takes into account the workload. Every snapshot operation has to read-lock every account making it practically impossible not to collide with a concurrent update transaction. This effect happens in both ways; either the snapshot operation "reaches" a write-locked account (RAW conflict), or a transfer operation conflicts with a read-lock that a snapshot has earlier acquired (WAR conflict). From the three CMs, Wholly performs the worst. This was also expected, due to the way that Wholly prioritizes the transactions. Every transaction is equal to the others and thus all transactions have the
Figure 4.13: Throughput of SC-TM running Bank Benchmark with different CMs. (Configuration I)
72
SC-TM Evaluation
same chance to complete. As explained in Section 3.5, every node commits transactions with almost the same rate. Consequently, if for example, a core has completed two less transactions than every other, then it will win any conflict resolution for its two next transactions. If these transactions are snapshots, the system will be slowed down until they complete. Finally, FairCM performed slightly better than Offset-Greedy. This can be explained by the fact that FairCM prioritizes shorter transactions. Although for 32 and 48 cores it has a worse commit rate than Offset-Greedy, it still completes more transactions. This is because transfer transactions are extremely short, so if a conflict exists, it is detected immediately and no time is lost on "zombie" transactions.
Figure 4.14: Commit Rate of SC-TM running Bank Benchmark with different CMs. (Configuration I)
Configuration II. 1024 Accounts, 50% Snapshot-only Cores, 10% Snapshot operations This configuration is used to show how SC-TM handles the problematic case of long, conflict-prone transactions. Figure 4.15 illustrates the system’s throughput and Figure 4.16 the commit rate under Configuration II. Wholly CM performs the worst as expected. The reasoning is the same as the one for Configuration I. Although, the performance with Wholly is low, the system is clearly progressing. The same is true about the scalability of the system. Under these settings, FairCM performs significantly better than Offset-Greedy. This is achieved due to the fairness2 of FairCM. Every core gets a fair share of transactional time, therefore the snapshot-only 2
regarding the effective transactional time
Bank Benchmark
cores3 are not allowed to degrade the system’s performance.
Figure 4.15: Throughput of SC-TM running Bank Benchmark with different CMs. (Configuration II)
Figure 4.16: Commit Rate of SC-TM running Bank Benchmark with different CMs. (Configuration II)
3
running only snapshot operations
73
5 Conclusions 5.1
Summary
This report presented SC-TM, the Single-chip Cloud TM system, specifically designed for Many-core architectures. SC-TM is fully decentralized and makes use of contention management to provide starvation freedom. The promise for fast message passing support by the future Many-core systems allowed us to take different design decisions than the ones usually used on DTM systems. SC-TM is a lock-based system and operates with eager read-lock acquisition, therefore having visible reads. This scheme allows the system to perform normal contention management, similar to the one on STMs, never done before by a DTM system. On the implementation part, two different versions of the system were developed; one using multitasking and one using dedicated DTM cores. The evaluation showed that the latter significantly outperforms the former, offers much better scalability, and has a cleaner design. The multitasking-based solution is slower mainly due to scheduling issues and the context switch overhead. Moreover, four different contention management policies were implemented on the SC-TM; back-off & retry, Offset-Greedy, Wholly, and FairCM. Back-off and retry is actually a non CM scheme; upon a failure to acquire a lock, a transaction waits for some period and then retries. This repeats for N > 0 times and then the transaction is aborted. As expected, back-off & retry is live-lock prone, fact that was confirmed by the evaluation part of this report. Offset-Greedy is a decentralized version of the well-known Greedy CM [GHKP05]. On a DS, the nodes do not have access to a coherent global clock, so the transactions cannot take the necessary timestamps. To overcome this problem, Offset-Greedy uses timestamp offsets instead of absolute values. Although theoretically there could be a live-lock problem due to estimation inaccuracies, this effect never emerged during the practical evaluation of the system and thus I can state that Offset-Greedy practically offers starvation-freedom. Wholly uses a naive contention management policy; among the conflicting transactions, the one that has completed the least transactions has the highest priority. Wholly clearly guarantees live-lock freedom, but the evaluation showed that it performs the worst among the three CMs. This can be explained by the fact that Wholly considers all transactions equal, regardless of their size, therefore long conflict-prone transactions may degrade the system’s performance. Finally, FairCM is similar to Wholly, but also provides fairness regarding the effective transactional time of each node. Instead of using the number of committed transactions, FairCM uses the cumulative time spent on successful transaction tries. By doing so, FairCM prioritizes shorter transactions, since they cost less. From the experimental results, it is clear that FairCM outperforms the other two policies and in one case (Section 4.5, Configuration II) it delivered around 20% better performance than Offset-Greedy. Generally, SC-TM guarantees opacity; serializability with the extra property that even alive transaction
76
Conclusions
are not allowed to view an inconsistent state of the memory. However, its interface includes an early release operation, which gives the application developer the opportunity to explicitly release a lock during the lifetime of the transaction. Proper use of early release can give great performance benefits, but if misused, the opacity guarantee of SC-TM may be violated. Elastic transactions is a variant of the transactional model particularly appealing when implementing search structures. Upon conflict detection, an elastic transaction might drop what it did so far within a separate transaction that immediately commits, and initiate a new transaction which might itself be elastic [FGG09]. I implemented the elastic model on a linked-list benchmark using two techniques; earlyrelease and read-validation. The former utilizes the early release operation that SC-TM exports. The latter approach completely avoids locking the items that it reads while searching the list. The evaluation revealed that elastic transactions can significantly boost the performance, a well expected result, since elastic transactions importantly reduce the number of conflicts. Overall, the evaluation showed that SC-TM system scales very well. Some limitations I noticed were mainly due to the SCC processor. Firstly, in several cases the memory bandwidth was the performance bottleneck. Secondly, SCC provides limited hardware support for message passing. Message passing is handled on software and is actually implemented over a non-coherent shared buffer. The message passing latency was under several circumstances a limiting factor.
Future work
5.2
77
Future work
Due to the lack of time there are several extensions and design alternatives that I did not have the time to implement. The major ones are undermentioned.
5.2.1
Write-lock Batching
Since all the write locks are acquired at the commit phase of the transaction, several requests targeting the same DS-Lock node could be potentially batched. This technique can significantly increase the performance for transactions that perform many updates. Moreover, decreasing the number of write lock requests also minimizes the number of messages and thus the messaging load of the system.
5.2.2
Asynchronous Read Locking
Another technique that could lead to significant performance boost would be to transform the read operation to non-blocking. This could be achieved using asynchronous read locking. Under this scheme, a read operation would take the following steps: 1. Send the read locking request (non-blocking). 2. Perform the read. 3. Continue normally. 4. Upon receive of the request on the DS-Lock node, if a conflict is detected and the issuing transaction has to be aborted, it is aborted, else nothing happens. One problem with this technique is that it "breaks" the opacity of the system, because a transaction is allowed to see an inconsistent view of the shared memory (in step 2).
5.2.3
Eager Write-lock Acquisition
A rational design alternative for the write lock acquisition policy would be eager acquisition. Instead of buffering the writes and in the end request for all the locks, the transaction could acquire the locks instantly1 . Eager acquisition can be used with either lazy writes with write-logging, or eager writes with undo-logging.
5.2.4
Profiling & Refactoring
I did not have the opportunity to profile the system, neither to work on refactoring. I believe that the overall system’s performance can be improved. 1
at the point the write operation was called
78
5.2.5
Conclusions
Applications & Benchmarks
There are several applications and benchmarks available for STM. The ones that I had ported have only transactional workload. It would be interesting to port some applications that have both transactional and non-transactional parts.
A Acronyms API Application Programming Interface CC Cache-Coherent CM Contention Manager CPU Central Processing Unit DS Distributed System DSTM Distributed Software Transactional Memory DTM Distributed Transactional Memory MC Memory Controller MPB Message Passing Buffer MP Message Passing RAW Read After Write SCC Single-Chip Cloud Computer SC-TM Single-chip Cloud TM SPMD Single Program Multiple Data STM Software Transactional Memory TM Transactional Memory WAR Write After Read WAW Write After Write
Bibliography [AB84]
James Archibald and Jean Loup Baer. An economical solution to the cache coherence problem. In Proceedings of the 11th annual international symposium on Computer architecture, ISCA ’84, pages 355–362, New York, NY, USA, 1984. ACM.
[AEST08]
Hagit Attiya, Leah Epstein, Hadas Shachnai, and Tami Tamir. Transactional Contention Management as a Non-Clairvoyant Scheduling Problem. Algorithmica, 57(1):44–61, May 2008.
[AGM10]
Hagit Attiya, Vincent Gramoli, and Alessia Milani. Combine: An Improved DirectoryBased Consistency Protocol. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures - SPAA ’10, pages 72–73, New York, New York, USA, 2010. ACM Press.
[AMVK07] Marcos Aguilera, Arif Merchant, Alistair Veitch, and Christos Karamanolis. Sinfonia : a new paradigm for building scalable distributed systems. SOSP ’07: Proceedings of twentyfirst ACM SIGOPS symposium on Operating systems principles, (Figure 1), 2007. [ASHH88] Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. An evaluation of directory schemes for cache coherence. In [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings, pages 280–289. IEEE Comput. Soc. Press, 1988. [BAC08]
Robert Bocchino, Vikram Adve, and Bradford Chamberlain. Software transactional memory for large scale clusters. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, number 3, pages 247–258. ACM, 2008.
[BBD+ 09] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schupbach, and Akhilesh Singhania. The multikernel: a new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 29–44. Citeseer, 2009. [BG83]
Philip A. Bernstein and Nathan Goodman. Multiversion concurrency control-theory and algorithms. ACM Trans. Database Syst., 8:465–483, December 1983.
[BGRS91]
Y. Breitbart, D. Georgakopoulos, M. Rusinkiewicz, and A. Silberschatz. On rigorous transaction scheduling. Software Engineering, IEEE Transactions on, 17(9):954–960, sep 1991.
83
Bibliography
[CF78]
L.M. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Transactions on Computers, 27:1112–1118, 1978.
[CFKA90] David Chaiken, Craig Fields, Kiyoshi Kurihara, and Anant Agarwal. Directory-Based Cache Coherence in Large-Scale Multiprocessors. IEEE Computer, (June), 1990. [CRCR09] Maria Couceiro, Paolo Romano, Nuno Carvalho, and Luís Rodrigues. D2STM: Dependable Distributed Software Transactional Memory. 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, pages 307–313, November 2009. [DD09]
Alokika Dash and Brian Demsky. Software transactional distributed shared memory. Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, 44(4):297, February 2009.
[DMF+ 06] Peter Damron, Sun Microsystems, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and Daniel Nussbaum. Hybrid transactional memory. In In Proc. 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOSXII, 2006. [DSS06]
Dave Dice, Ori Shalev, and Nir Shavit. Transactional Locking II. International Symposium on Distributed Computing - DISC, pages 194–208, 2006.
[FFR08]
Pascal Felber, Christof Fetzer, and Torvald Riegel. Dynamic performance tuning of wordbased software transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, pages 237–246, New York, NY, USA, 2008. ACM.
[FGG09]
Pascal Felber, Vincent Gramoli, and Rachid Guerraoui. Elastic Transactions. International Symposium on Distributed Computing - DISC, pages 93–107, 2009.
[Fra03]
Keir Fraser. Practical lock freedom. PhD thesis, Cambridge University Computer Laboratory, 2003.
[GG75]
M. R. Garey and R. L. Grahams. Bounds for multiprocessor scheduling with resource constraints. SIAM Journal on Computing, 4:187–200, 1975.
[GHKP05] Rachid Guerraoui, Maurice Herlihy, Michal Kapalka, and Bastian Pochon. Robust Contention Management in Software Transactional Memory. Proceedings of the Workshop on Synchronization and Concurrency in Object-Oriented Languages, 2005. [GHP05]
Rachid Guerraoui, Maurice Herlihy, and Bastian Pochon. Toward a theory of transactional contention managers. Proceedings of the twenty-fourth annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing - PODC ’05, page 258, 2005.
[GK08]
Rachid Guerraoui and Michal Kapalka. On the correctness of transactional memory. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP ’08, page 175, 2008.
[GK10]
Rachid Guearraoui and Michal Kapalka. Principles of Transactional Memory. 2010.
84
[GLL+ 90]
Bibliography
Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA ’90, pages 15–26, New York, NY, USA, 1990. ACM.
[GR06]
Rachid Guerraoui and Luís Rodrigues. Introduction to reliable distributed programming. Springer, 2006.
[Gra10]
Håkan Grahn. Transactional memory. Journal of Parallel and Distributed Computing, 70(10):993–1008, October 2010.
[Had88]
Vassos Hadzilacos. A theory of reliability in database systems. J. ACM, 35:121–145, January 1988.
[HDH+ 10] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages 108–109, feb. 2010. [Hel10]
Jim Held.
Single-chip Cloud Computer :
An experimental many-core processor
from Intel. http://communities.intel.com/servlet/JiveServlet/downloadBody/5646-102-18761/SCC_Sympossium_Feb212010_FINAL.pdf, February 2010. [Her88]
Maurice P. Herlihy. Impossibility and universality results for wait-free synchronization. In Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, PODC ’88, pages 276–290, New York, NY, USA, 1988. ACM.
[HLM03]
M. Herlihy, V. Luchangco, and M. Moir. Obstruction-free synchronization: double-ended queues as an example. In Distributed Computing Systems, 2003. Proceedings. 23rd International Conference on, pages 522–529, may 2003.
[HLM06]
Maurice Herlihy, Victor Luchangco, and Mark Moir. A flexible framework for implementing software transactional memory. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications, OOPSLA ’06, pages 253–262, New York, NY, USA, 2006. ACM.
[HLMS03] Maurice Herlihy, Victor Luchangco, Mark Moir, and William Scherer. Software transactional memory for dynamic-sized data structures. Proceedings of the twenty-second annual symposium on Principles of distributed computing - PODC ’03, pages 92–101, 2003. [HM93]
Maurice Herlihy and Eliot Moss. Transactional Architectural Memory : Data Structures Support for Lock-Free Memory. ACM SIGARCH Computer Architecture News, pages 289– 300, 1993.
[HS05]
Maurice Herlihy and Ye Sun. Distributed transactional memory for metric-space networks. Distributed Computing, 1(4):58–208, October 2005.
85
Bibliography
[irc]
iRCCE:
A
Non-blocking
Communication
Extension
to
RCCE.
http://communities.intel.com/servlet/JiveServlet/downloadBody/6003-102-39852/iRCCE_manual.pdf. [KAJ+ 08]
Christos Kotselidis, Mohammad Ansari, Kim Jarvis, Mikel Luján, Chris Kirkham, and Ian Watson. DiSTM: A Software Transactional Memory Framework for Clusters. 2008 37th International Conference on Parallel Processing, pages 51–58, September 2008.
[KLA+ 10] Christos Kotselidis, Mikel Lujan, Mohammad Ansari, Konstantinos Malakasis, Behram Kahn, Chris Kirkham, and Ian Watson. Clustering JVMs with software transactional memory support. 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1–12, 2010. [KS95]
L. Kontothanassis and M. Scott. Software cache coherence for large scale multiprocessors. In High-Performance Computer Architecture, 1995. Proceedings., First IEEE Symposium on, number 8910, pages 286–295. IEEE, 1995.
[LDT+ 10]
Jihoon Lee, Alokika Dash, Sean Tucker, Hyun Kook Khang, and Brian Demsky. FaultTolerant Distributed Transactional Memory. 2010.
[lib]
Libtask: a Coroutine Library for C and Unix. http://swtch.com/libtask/.
[LLG+ 90]
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. The directory-based cache coherence protocol for the DASH multiprocessor. In Proceedings of the 17th annual international symposium on Computer Architecture - ISCA ’90, pages 148–159, New York, New York, USA, 1990. ACM Press.
[LS08]
Calvin Lin and Larry Snyder. Principles of Parallel Programming. Addison-Wesley Publishing Company, USA, 1st edition, 2008.
[MMA06]
Kaloian Manassiev, Madalin Mihailescu, and Cristiana Amza. Exploiting distributed version concurrency in a transactional memory cluster. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 198–208, New York, New York, USA, 2006. ACM.
[MRL+ 10] Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, Nitin Borkar, Greg Ruhl, and Saurabh Dighe. The 48-core SCC processor: the programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society. [ope]
The OpenMP API specification for parallel programming. http://openmp.org/wp/.
[pth]
POSIX Threads Programming. https://computing.llnl.gov/tutorials/pthreads/.
[Ray89]
Kerry Raymond. A tree-based algorithm for distributed mutual exclusion. ACM Trans. Comput. Syst., 7:61–77, January 1989.
86
[RCR08]
Bibliography
Paolo Romano, Nuno Carvalho, and Luís Rodrigues. Towards distributed software transactional memory systems. Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware - LADIS ’08, page 1, 2008.
[RHL05]
Ravi Rajwar, Maurice Herlihy, and Konrad Lai. Virtualizing transactional memory. In Proceedings of the 32nd annual international symposium on Computer Architecture, ISCA ’05, pages 494–505, Washington, DC, USA, 2005. IEEE Computer Society.
[SATH+ 06] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’06, page 187, New York, New York, USA, 2006. ACM Press. [SR11]
Mohamed Saad and Binoy Ravindran. Distributed Transactional Locking II [ Technical Report ]. Technical report, 2011.
[SS04]
William Scherer and Michael Scott. Contention Management in Dynamic Software Transactional Memory. Management, 2004.
[SS05]
William Scherer and Michael Scott. Advanced contention management for dynamic software transactional memory. In Proceedings of the twenty-fourth annual ACM SIGACTSIGOPS symposium on Principles of distributed computing - PODC ’05, page 240, New York, New York, USA, 2005. ACM Press.
[ST97]
Nir Shavit and Dan Touitou. Software transactional memory. Symposium on Principles of Distributed Computing - PODC, 10(2):99–116, February 1997.
[Tan76]
C. K. Tang. Cache system design in the tightly coupled multiprocessor system. In Proceedings of the June 7-10, 1976, national computer conference and exposition, AFIPS ’76, pages 749–753, New York, NY, USA, 1976. ACM.
[Wei89]
W. E. Weihl. Local atomicity properties: modular concurrency control for abstract data types. ACM Trans. Program. Lang. Syst., 11:249–282, April 1989.
[WMH11]
Rob Van Der Wijngaart, Timothy Mattson, and Werner Haas. Light-weight Communications on Intel ’ s Single-Chip Cloud Computer Processor. ACM SIGOPS Operating Systems Review, pages 73–83, 2011.
[YYF85]
W.C. Yen, D.W.L. Yen, and King-Sun Fu. Data coherence problem in a multicache system. Computers, IEEE Transactions on, C-34(1):56–65, jan. 1985.
[Zha09]
Bo Zhang. On the Design of Contention Managers and Cache-Coherence Protocols for Distributed Transactional Memory. PhD thesis, 2009.
[ZR09a]
Bo Zhang and Binoy Ravindran. Location-Aware Cache-Coherence Protocols for Distributed Transactional Contention Management in Metric-Space Networks. 2009 28th IEEE International Symposium on Reliable Distributed Systems, pages 268–277, September 2009.
87
Bibliography
[ZR09b]
Bo Zhang and Binoy Ravindran. Relay : A Cache-Coherence Protocol for Distributed Transactional Memory. Proceeding OPODIS ’09 Proceedings of the 13th International Conference on Principles of Distributed Systems, 5923/2009:48–53, 2009.