Aug 31, 2016 - Haskell STM concurrent data structures suffer bloat from indirection. value. Int# value watch-list versio
Improving STM Performance with Transactional Structs1 Ryan Yates and Michael L. Scott ”University of Rochester”
IFL, 8-31-2016
1
This work was funded in part by the National Science Foundation under grants CCR-0963759, CCF-1116055, CCF-1337224, and CCF-1422649, and by support from the IBM Canada Centres for Advanced Studies. 1/31
Outline Haskell STM TStruct Performance results Future work
Slides: http://goo.gl/65pEZo Paper: http://goo.gl/pVQlh4
2/31
What is Transactional Memory?
Transactional memory is the joining of two ideas: The ability to express what should be atomic without saying how. An implementation that uses speculation to optimistically execute and try again if needed.
3/31
Existing Haskell STM Implementation
In STM execution, reads and writes to TVars are tracked in a transactional record (TRec). Execution continues under the assumption that there have been no conflicts. A conflict is where two threads access the same location and at least one is a write.
4/31
Existing Haskell STM Implementation
At the end of the transaction: The RTS validates that reads still match the values in the TVars. And commits by performing writes atomically.
If validation fails, start over. Similar to OSTM [Fraser, 2004]. No global bottlenecks. Read-only transactions do not acquire locks. OSTM is non-blocking, GHC’s STM can livelock.
5/31
Motivation Haskell STM concurrent data structures suffer bloat from indirection.
TVar version watch-list value
Node key value color parent left right
TVar version watch-list value
6/31
Int# value
Our Work TStruct Removes indirection required by TVars. Mutable unboxed values paired with mutable pointer values. Increases locality of data structure nodes.
Maintains properties: Commit parallels TVar commit. No global bottleneck. Read-only transactions do not acquire locks.
Avoids conflating lock and value. Flexible transactional variable granularity.
7/31
Data Structures
Red-Black Tree Skip List Cuckoo Hash Table Hashed Array Mapped Trie (HAMT)
8/31
Red-Black Tree
Rebalancing TVar Extra Indirection No mutable unboxed values (Color field)
TStruct Node initialization Nil node and indirection Accesses at constant offsets
9/31
Extra Indirection
-- TVar data Node k v = Node { _key , _value , _color , _parent , _left , _right } | Nil
:: :: :: :: :: ::
10/31
!k !v TVar TVar TVar TVar
Color (Node k v) (Node k v) (Node k v)
Avoiding Sum Type Indirection
-- TStruct with sum type data Node k v = Node { _tstruct :: TStruct# RealWorld (Node k v) } | Nil -- TStruct without sum type data Node k v = Node { _tstruct :: TStruct# RealWorld Any }
11/31
Skip List
No rebalancing needed Random number source TVar Pure node with TArray of next pointers Extra indirection
TStruct TStruct containing both values and next pointers Node initialization
12/31
Node Initialization
When a new node is made in a transaction no other thread can see it until the transaction has committed. We take advantage of this and access these nodes non-transactionally. In the skip list, this happens on insertion. The new node is created and the next pointers are written to match the previous node at that level.
13/31
Cuckoo Hash Table
Only insert needs to do significant work. TVar TArray of immutable buckets
TStruct Immutable array of TStruct buckets
14/31
Cuckoo Hash Table
Array size entry ...
TArray size version watch-list value version watch-list value
TStruct lock lock-count version watch-list count key1 key2 ... keyN
Array size entry ...
value1 value2 ... valueN
...
15/31
Hashed Array Mapped Trie (HAMT)
No rebalancing TVar Extra indirection
TStruct Node initialization Immutable fields Node tags
16/31
Immutable fields
Fields that are immutable can be safely be read non-transactionally. No bookkeeping needed! Two primitive read functions: Transactional readTStruct# implemented in Cmm and C. Non-transactional readTStructNT# implemented in code generator.
Things can go wrong in very unexpected ways!
17/31
Node tags
-- TVar data Node a = Nodes (TVar (WordArray a)) | Leaf Hash a | Leaves Hash (SizedArray a) data WordArray a = WordArray Bitmap (Array (Node a)) data SizedArray a = SizedArray Size (Array a)
18/31
Node tags
-- TStruct data Node a = WordArray Size Bitmap (Array (Node a)) | SizedArray Size Hash (Array a) data Node a = Node { _tstruct :: TStruct# RealWorld Any }
19/31
HAMT Nodes Node tag=Nodes
Node tag=Nodes
Node tag=Leaves
tvar
tvar
TVar version watch-list value
TVar version watch-list value
hash array
WordArray
WordArray
bitmap array
bitmap array
SizedArray
SizedArray
size
size
entry ...
entry ...
20/31
SizedArray size entry ...
HAMT Nodes
TVar version watch-list value
WordArray
WordArray
SizedArray
lock lock-count version watch-list
lock lock-count version watch-list
lock lock-count version watch-list
tag=0 size bitmap
tag=0 size bitmap
tag=1 size hash
entry ...
entry ...
entry ...
21/31
Code Example
Example from lookup in the TVar-based HAMT. Pattern matching ensures we do not handle a leaf as a node. lookupTVar ... = do arr ... Just (Nodes ns) -> ... Just (Leaves h la) -> ...
22/31
Code Example In the TStruct-based HAMT lookup we lose safety. No bounds check in readTStructWordNT#. Nodes can be confused with leaves.
readTagNT (WordArray arr#) = STM $ \s1# -> case readTStructWordNT# arr# 0# s1# of (# s2#, w# #) -> (# s2#, W# w# #) lookupTStruct ... arr = do t do ... readIndicesNT arr ... 1 -> do ... readHashNT arr ...
23/31
Benchmarks Machine c XeonTM E5-2699 v3 two socket, 36-core, 72-thread Intel
Tests Data structure with concurrent inserts (5%), deletes (5%), and lookups (90%) measuring throughput at steady state. Structure initially has 50,000 entries in a key space of 100,000 keys.
24/31
TVar
c XeonTM E5-2699 v3 two socket, 36-core) (Intel
Operations per second
·107
4 RBTree SkipList Cuckoo HAMT
2
0 1
18
36 Threads
72
25/31
HAMT
c XeonTM E5-2699 v3 two socket, 36-core) (Intel
·108
Operations per second
2.5 2 1.5
TVar TStruct CTrie
1 0.5 0 1
18
36 Threads
26/31
72
Cuckoo Hash
c XeonTM E5-2699 v3 two socket, 36-core) (Intel
Operations per second
·107 6
4
TVar TStruct
2
0 1
18
36 Threads
27/31
72
Skip List
c XeonTM E5-2699 v3 two socket, 36-core) (Intel
Operations per second
3
·107
2 TVar TStruct 1
0 1
18
36 Threads
28/31
72
Red-Black Tree
Operations per second
3
c XeonTM E5-2699 v3 two socket, 36-core) (Intel
·107
2 TVar TStruct 1
0 1
18
36 Threads
29/31
72
Future Work
Continue to improve performance and understand what factors contribute to good performance. Recover safety for TStruct features Node initialization Node tagging Accesses at constant offsets
Other data structures
30/31
Thanks!
Slides: http://goo.gl/65pEZo Paper: http://goo.gl/pVQlh4
31/31
32/31
Haskell STM Metadata Structure
Node key value parent left right color
TRec prev index TVar value watch
tvar old new
Watch Queue thread next prev
Watch Queue thread next prev
33/31
tvar old new ... tvar old new
Haskell Before TStruct
Node key value parent left right color
TVar value watch
34/31
Node key value parent left right color
Haskell with TStruct
Node lock watch
Node lock watch
key value color parent left right
key value color parent left right
35/31
Haskell STM commit
commit(TRec* trec) { if (validate(trec)) { if (read_check(trec)) { update(trec) return true } } return false }
36/31
Haskell STM commit bool validate(TRec* trec) { for (e in trec) { if (is_write(e)) { if (!lock(e) || e->value != e->tvar->value) { release_locks(trec) return false; } } else { e->version = e->tvar->version } } }
37/31
Haskell STM commit
bool read_check(TRec* trec) { for (e in trec) { if (is_read(e)) { if (e->value != e->tvar->value || e->version != e->tvar->version) { release_locks(trec) return false } } } }
38/31
Haskell STM commit
update(TRec* trec) { for (e in trec) { if (is_write(e)) { e->tvar->version++ e->tvar->value = e->new_value } } }
39/31
References
[Fraser, 2004] Fraser, K. (2004). Practical lock-freedom. PhD thesis, University of Cambridge Computer Laboratory.
40/31