Improving STM Performance with Transactional ... - cs.rochester.edu

0 downloads 121 Views 295KB Size Report
Aug 31, 2016 - Haskell STM concurrent data structures suffer bloat from indirection. value. Int# value watch-list versio
Improving STM Performance with Transactional Structs1 Ryan Yates and Michael L. Scott ”University of Rochester”

IFL, 8-31-2016

1

This work was funded in part by the National Science Foundation under grants CCR-0963759, CCF-1116055, CCF-1337224, and CCF-1422649, and by support from the IBM Canada Centres for Advanced Studies. 1/31

Outline Haskell STM TStruct Performance results Future work

Slides: http://goo.gl/65pEZo Paper: http://goo.gl/pVQlh4

2/31

What is Transactional Memory?

Transactional memory is the joining of two ideas: The ability to express what should be atomic without saying how. An implementation that uses speculation to optimistically execute and try again if needed.

3/31

Existing Haskell STM Implementation

In STM execution, reads and writes to TVars are tracked in a transactional record (TRec). Execution continues under the assumption that there have been no conflicts. A conflict is where two threads access the same location and at least one is a write.

4/31

Existing Haskell STM Implementation

At the end of the transaction: The RTS validates that reads still match the values in the TVars. And commits by performing writes atomically.

If validation fails, start over. Similar to OSTM [Fraser, 2004]. No global bottlenecks. Read-only transactions do not acquire locks. OSTM is non-blocking, GHC’s STM can livelock.

5/31

Motivation Haskell STM concurrent data structures suffer bloat from indirection.

TVar version watch-list value

Node key value color parent left right

TVar version watch-list value

6/31

Int# value

Our Work TStruct Removes indirection required by TVars. Mutable unboxed values paired with mutable pointer values. Increases locality of data structure nodes.

Maintains properties: Commit parallels TVar commit. No global bottleneck. Read-only transactions do not acquire locks.

Avoids conflating lock and value. Flexible transactional variable granularity.

7/31

Data Structures

Red-Black Tree Skip List Cuckoo Hash Table Hashed Array Mapped Trie (HAMT)

8/31

Red-Black Tree

Rebalancing TVar Extra Indirection No mutable unboxed values (Color field)

TStruct Node initialization Nil node and indirection Accesses at constant offsets

9/31

Extra Indirection

-- TVar data Node k v = Node { _key , _value , _color , _parent , _left , _right } | Nil

:: :: :: :: :: ::

10/31

!k !v TVar TVar TVar TVar

Color (Node k v) (Node k v) (Node k v)

Avoiding Sum Type Indirection

-- TStruct with sum type data Node k v = Node { _tstruct :: TStruct# RealWorld (Node k v) } | Nil -- TStruct without sum type data Node k v = Node { _tstruct :: TStruct# RealWorld Any }

11/31

Skip List

No rebalancing needed Random number source TVar Pure node with TArray of next pointers Extra indirection

TStruct TStruct containing both values and next pointers Node initialization

12/31

Node Initialization

When a new node is made in a transaction no other thread can see it until the transaction has committed. We take advantage of this and access these nodes non-transactionally. In the skip list, this happens on insertion. The new node is created and the next pointers are written to match the previous node at that level.

13/31

Cuckoo Hash Table

Only insert needs to do significant work. TVar TArray of immutable buckets

TStruct Immutable array of TStruct buckets

14/31

Cuckoo Hash Table

Array size entry ...

TArray size version watch-list value version watch-list value

TStruct lock lock-count version watch-list count key1 key2 ... keyN

Array size entry ...

value1 value2 ... valueN

...

15/31

Hashed Array Mapped Trie (HAMT)

No rebalancing TVar Extra indirection

TStruct Node initialization Immutable fields Node tags

16/31

Immutable fields

Fields that are immutable can be safely be read non-transactionally. No bookkeeping needed! Two primitive read functions: Transactional readTStruct# implemented in Cmm and C. Non-transactional readTStructNT# implemented in code generator.

Things can go wrong in very unexpected ways!

17/31

Node tags

-- TVar data Node a = Nodes (TVar (WordArray a)) | Leaf Hash a | Leaves Hash (SizedArray a) data WordArray a = WordArray Bitmap (Array (Node a)) data SizedArray a = SizedArray Size (Array a)

18/31

Node tags

-- TStruct data Node a = WordArray Size Bitmap (Array (Node a)) | SizedArray Size Hash (Array a) data Node a = Node { _tstruct :: TStruct# RealWorld Any }

19/31

HAMT Nodes Node tag=Nodes

Node tag=Nodes

Node tag=Leaves

tvar

tvar

TVar version watch-list value

TVar version watch-list value

hash array

WordArray

WordArray

bitmap array

bitmap array

SizedArray

SizedArray

size

size

entry ...

entry ...

20/31

SizedArray size entry ...

HAMT Nodes

TVar version watch-list value

WordArray

WordArray

SizedArray

lock lock-count version watch-list

lock lock-count version watch-list

lock lock-count version watch-list

tag=0 size bitmap

tag=0 size bitmap

tag=1 size hash

entry ...

entry ...

entry ...

21/31

Code Example

Example from lookup in the TVar-based HAMT. Pattern matching ensures we do not handle a leaf as a node. lookupTVar ... = do arr ... Just (Nodes ns) -> ... Just (Leaves h la) -> ...

22/31

Code Example In the TStruct-based HAMT lookup we lose safety. No bounds check in readTStructWordNT#. Nodes can be confused with leaves.

readTagNT (WordArray arr#) = STM $ \s1# -> case readTStructWordNT# arr# 0# s1# of (# s2#, w# #) -> (# s2#, W# w# #) lookupTStruct ... arr = do t do ... readIndicesNT arr ... 1 -> do ... readHashNT arr ...

23/31

Benchmarks Machine c XeonTM E5-2699 v3 two socket, 36-core, 72-thread Intel

Tests Data structure with concurrent inserts (5%), deletes (5%), and lookups (90%) measuring throughput at steady state. Structure initially has 50,000 entries in a key space of 100,000 keys.

24/31

TVar

c XeonTM E5-2699 v3 two socket, 36-core) (Intel

Operations per second

·107

4 RBTree SkipList Cuckoo HAMT

2

0 1

18

36 Threads

72

25/31

HAMT

c XeonTM E5-2699 v3 two socket, 36-core) (Intel

·108

Operations per second

2.5 2 1.5

TVar TStruct CTrie

1 0.5 0 1

18

36 Threads

26/31

72

Cuckoo Hash

c XeonTM E5-2699 v3 two socket, 36-core) (Intel

Operations per second

·107 6

4

TVar TStruct

2

0 1

18

36 Threads

27/31

72

Skip List

c XeonTM E5-2699 v3 two socket, 36-core) (Intel

Operations per second

3

·107

2 TVar TStruct 1

0 1

18

36 Threads

28/31

72

Red-Black Tree

Operations per second

3

c XeonTM E5-2699 v3 two socket, 36-core) (Intel

·107

2 TVar TStruct 1

0 1

18

36 Threads

29/31

72

Future Work

Continue to improve performance and understand what factors contribute to good performance. Recover safety for TStruct features Node initialization Node tagging Accesses at constant offsets

Other data structures

30/31

Thanks!

Slides: http://goo.gl/65pEZo Paper: http://goo.gl/pVQlh4

31/31

32/31

Haskell STM Metadata Structure

Node key value parent left right color

TRec prev index TVar value watch

tvar old new

Watch Queue thread next prev

Watch Queue thread next prev

33/31

tvar old new ... tvar old new

Haskell Before TStruct

Node key value parent left right color

TVar value watch

34/31

Node key value parent left right color

Haskell with TStruct

Node lock watch

Node lock watch

key value color parent left right

key value color parent left right

35/31

Haskell STM commit

commit(TRec* trec) { if (validate(trec)) { if (read_check(trec)) { update(trec) return true } } return false }

36/31

Haskell STM commit bool validate(TRec* trec) { for (e in trec) { if (is_write(e)) { if (!lock(e) || e->value != e->tvar->value) { release_locks(trec) return false; } } else { e->version = e->tvar->version } } }

37/31

Haskell STM commit

bool read_check(TRec* trec) { for (e in trec) { if (is_read(e)) { if (e->value != e->tvar->value || e->version != e->tvar->version) { release_locks(trec) return false } } } }

38/31

Haskell STM commit

update(TRec* trec) { for (e in trec) { if (is_write(e)) { e->tvar->version++ e->tvar->value = e->new_value } } }

39/31

References

[Fraser, 2004] Fraser, K. (2004). Practical lock-freedom. PhD thesis, University of Cambridge Computer Laboratory.

40/31

Suggest Documents