TSAR : a scalable, shared memory, many-cores architecture with ...

TSAR : a scalable shared memory many-cores architecture with global cache coherence

Alain Greiner ([email protected])

MPSoC 2009

Outlin e ©

The TSAR project

©

The low power challenge

©

The scalability challenge

©

The SoCLib virtual prototyping platform

©

First experimental results

©

Conclusion

MPSoC 2009

The TSAR TSAR is a Medea+ projectProject that started in june 2008. One central goal of this project is to define and implement a coherent, scalable shared memory, many-cores architecture : up to 4096 cores... The project leader is BULL. Industrial partners

Academic partners

ACE (Netherland)

Université Paris 6 (LIP6)

Bull S.A. (France)

Technical University Delft (CEL)

Compaan (Netherland)

University Leiden (LIACS)

NXP Semiconductors (Netherland)

FZI (Karlsruhe)

Philips Medical Systems (Netherland) Thales Communications (France)

MPSoC 2009

Main Technical Choices ©

Processor : The architecture is independent on the selected processor core. It will an existing core such as : SPARC V8, MIPS32, PPC 405, ARM, etc.

©

Interconnect : The TSAR architecture is clusterized, with a two-level interconnect. The global (inter-cluster) interconnect will use the DSPIN Network on Chip that supports the VCI/OCP standard.

©

Memory : The Tsar architecture will support a (NUMA) shared address space. The memory is logically shared, but physically distributed, with cache coherence enforced by hardware, using a directory-based protocol. The physical address is 40 bits.

MPSoC 2009

Clusterized Architecture External RAM network access CPU FPU L1 I

CPU FPU L1 L1 D I

L1 D

Mem cache

Timer DMA ICU

CPU FPU L1 I

L1 D

CPU FPU L1 L1 D I

Mem cache

Timer DMA ICU

VCI/OCP

I/O ctrl

Local Interconnect

Local Interconnect

NIC

NIC

DSPIN micro- network

NIC

I/O ctrl

NIC

NIC

NIC

Local Interconnect

Local Interconnect VCI/OCP

L1 I

L1 D

CPU FPU

L1 I

L1 D

CPU FPU

Mem cache

Timer DMA ICU

L1 I

L1 D

CPU FPU

L1 I

L1 D

CPU FPU

External RAM network access

MPSoC 2009

Mem cache

Timer DMA ICU

Outlin e ©

The TSAR project

©


©


©


©


©

Conclusion

MPSoC 2009

Low-power architecture (1) Olf fashion

Single instruction

Out of order

Microprogrammed

issue pipeline

Superscalar

instruction /cycle

0.2

1

2

power consumption

1

2

20

0.2

0.5

0.1

Oper/Joule

The TSAR processor will be an existing, simple 32 bits RISC processor , with single instruction issue ( Mips32 in the first prototype ) : - No superscalar architecture, - No multi-threading - No speculative execution - No out of order execution MPSoC 2009

Low-power architecture (2) cluster

cluster

cluster

cluster

ck03

ck13

ck23

ck33

Local Interconnect

L1 I

cluster

cluster

cluster

cluster

ck02

ck12

ck22

ck32

cluster

cluster

cluster

cluster

ck01

ck11

ck21

ck31

NIC

L1 D

CPU FPU

L1 I

L1 D

CPU FPU

Mem cache

Timer DMA ICU

• TSAR will use the DSPIN Network on Chip technology developed by LIP6. • each cluster is implemented in a different clock domain.

cluster

cluster

cluster

cluster

ck00

ck10

ck20

ck30

• inter-cluster communications use bi-synchronous (or fully asynchronous) FIFOs. => Clock frequency & Voltage can be independantly adjusted in each cluster.

MPSoC 2009

Low-power architecture (3) 1 Gbyte TSAR is a NUMA architecture

1 Gbyte

(Non Uniform Memory Access) : CPU

CPU

CPU

MEM

CPU

MEM

local

remote

external

1

5

20

Local Interconnect

Local Interconnect

The physical memory space is DSPIN micro- network

I/O

I/O

statically distributed amongst the clusters.

Local Interconnect

Local Interconnect

MEM CPU

MEM CPU

CPU

CPU

The operating system can precisely control the mapping : - tasks on the processors - software objects on the

1 Gbyte 1 Gbyte MPSoC 2009

memory caches.

Outlin e ©

The TSAR project

©

The low-power challenge

©


©


©


©

Conclusion

MPSoC 2009

Cache coherence © The coherence protocol must support up to 4096 processors. © It must be guaranteed by the hardware for compatibility with software applications running on multi-core PCs. © Both data & instruction caches are concerned by the cache coherence protocol. © As snooping is not possible with a Network on Chip supporting a large number of simultaneous transactions, TSAR will rely on the general directory based approach : each memory cache must register all copies stored in the (4096 * 2) L1 caches...

MPSoC 2009

The Write-Through / Write-Back issue © In

Write-Through policies, all write requests are immediately forwarded

to the main memory : the L1 cache is always up to date. © In Write-Back policies, the L1 cache is updated, but the main memory is only modified when the modified cache line is replaced. The Write-Back policy is generally used in discrete multi-processors architectures, as it saves transactions on the system bus, that is usually the architecture bottleneck, ... but it is - at least - one order of magnitude more complex than Write-Trough, because any L1 cache can acquire the exclusive ownership of any cache line. The TSAR cache coherence protocol exploits the scalable bandwidth provided by the NoC technology, and choose to use a Write-Through policy. ... This is the main price to pay for protocol scalability...

MPSoC 2009

Update / Invalidate messages B CPU

CPU

MEM

CPU

CPU

MEM

CPU A makes a write to an address X

Local Interconnect

Local Interconnect

CPUs B & C have a copy DSPIN micro- network

I/O

I/O

The memory cache that Local Interconnect

Local Interconnect

MEM CPU

A

MEM CPU

CPU

Home for X

CPU

is the home directory for address X must send update or invalidate requests to CPU B & C

C

=> The key point is the implementation of the Distributed Global Directory...

MPSoC 2009

Directory implementation (1) Memory Cache Flags

Tag

Solution 1 : vector of bits + multi-cast policy Copies We use the characteristics of the memory cache : The cache being inclusive, each cache line has a home directory in one single cache. N bits

memory cache directory

⇒ the cache directory is extended to include a vector of N bits (one bit per processor)

This approach is not scalable...

MPSoC 2009

Directory implementation (2) Memory Cache L1_id Next Flags

Tag

34

L1_id

13

Next

13


Copy Table

Solution 2 : chained lists of pointers in a local table + multi-cast policy This solution is not scalable, as the Copy Table can overflow...

MPSoC 2009

Directory implementation (3) Memory Cache

Flags

Tag

Count

34

14


Solution 3 : counter of copies + broadcast policy. This solution is not scalable, as the coherence traffic will be too high...

MPSoC 2009

Directory implementation (4) MC PREV

PREV

Z X

NEXT

L1 / Z

L1 / Y

L1 / X Y

NEXT

NEXT

Z X

MC

PREV

NEXT

MC Y

PREV

Solution 4 : distributed linked list in memory cache & L1 caches + multi-cast policy This solution is scalable, but the dead-lock prevention is a nightmare...

MPSoC 2009

DHCCP cache coherence protocol © TSAR implements the DHCCP protocol (Distributed Hybrid Cache Coherence Protocol), that is a mix of solution 2 & solution 3 : - multicast / update when the number of copies is smaller than a predefined threshold broadcast / invalidate when the number of copies is larger than the threshold © The DHCCP has been analysed, from the point of view of dead-lock prevention. Three types of traffic have been identified : - Direct read/write transactions

(L1 caches => Mem caches)

- Invalidate/Update/Cleanup transactions - External memory transactions

(Mem caches L1 caches)

(Mem caches => Ext. RAM)

=> Two fully separated virtual channels will be supported by the DSPIN network. => A broadcast service will be implemented in the DSPIN network. => access to the external RAM is implemented by a separated network.

MPSoC 2009

-

Outlin e ©

TSAR architecture principles

©


©


©


©


©

Conclusion

MPSoC 2009

©

The SoCLib Platform SoCLib is a cooperative project funded by the french authorities ( “Agence Nationale de la Recherche”) : 6 industrial companies & 10 academic laboratories joined in 2006 to build an “open modeling & simulation platform for Multi-Processor System on Chip”.

© The core of the platform is a library of SystemC simulation models for reusable hardware components (IP cores), with a garanteed path to silicon. ©

Several software tools (including embedded operating systems and communication middleware) are integrated in the platform.

© All SoCLib simulation models are available as open source software, under the LGPL license (www.soclib.fr)

MPSoC 2009

SoCLib Partners • LIP6

© ST Micro-electronics

• INRIA

© Thales

• CEA-LIST

© Thomson

• ENST • IRISA

© Orange

• LESTER

© Magillem Data Service

• IETR

© TurboConcept

• CEA-LETI • TIMA • CITI

MPSoC 2009

SoCLib technical choices The main concern is interoperability : © All SoCLib components use the same communicationprotocol : VCI/OCP. © All SoCLib components will be packaged, using the IP-XACT (ex SPIRIT) description standard. © Two simulation models are provided for each IP core : ■

CABA : Cycle-Accurate / Bit-Accurate

■

TLM-DT : Transaction Level Model with Distributed Time

MPSoC 2009

SoCLib Library (partial) content General Purpose

MIPS32

Processors

SPARC V8 MicroBlaze Nios ARM 7 ARM 9 PowerPC 450 PowerPC 750

Digital Signal

ST200

Processors

C6X Texas

System

Interrupt controler

Utilities

Locks engine DMA controler MWMR controler

VCI/OCP

Generic

Interconnects

DSPIN Pibus Token Ring

MPSoC 2009

Outlin e ©

The TSAR project

©


©


©


©


©

Conclusion

MPSoC 2009

TSAR V0 virtual prototype © The

first - running - virtual prototype implements a simplified architecture : ■

The DSPIN micro-network is replaced by a simplified interconnect (generic VCI micro-network, available in SoCLib).

■

The HCCP coherence protocol is not fully implemented : broadcast/invalidate transactions are not supported, and the number of cores is limited to 16).

© The goal is to evaluate the hardware complexity of the “memory cache”, and the “optimal” number of processors per cluster. © The multi-thread software application running on this virtual prototype is an insertion sort written in C (without embedded OS).

MPSoC 2009

direct transactions network

coherence transactions network

MIPS32 Processo r MIPS32 Processo r MIPS32 Processo r MIPS32 Processo r

direct trans. network

Mem cache




Mem cache


Mem cache

external transactions network

External RAM

MPSoC 2009



Mem cache

Processor efficiency

© In this measurement,

3

The CPI (CPI = average number of cycles per instruction) increases when the number

2,5

of processor per cluster is larger than 4. 2 CPI

© The largest simulated configuration

1,5

was a 16 processors architecture (4 clusters of 4 processors).

1 Selection Insertion Bulle

0,5

© The simulation speed (for cycle-accurate simulation) was 31 000 cycles/second on

0 1

2

3

4

Nprocs

5

6

7

8

a 1.8 GHz Linux PC.

MPSoC 2009

Conclusion © The first year of the TSAR project resulted in the definition of an hybrid cache coherence protocol (HCCP), implementing different policies for data & instruction caches. We expect this architecture to be scalable up to 4096 processors (32 * 32 clusters), thanks to an advanced network on chip technology supporting both packet switched virtual channels and a broadcast service. © The SoCLib virtual prototyping platform was very useful for quantitative analysis of various implementations of the TSAR architecture, especially to evaluate tradeofs between cost & performance. © To promote the Tsar architecture, the SystemC virtual prototype will be available as open source software on the SoCLib WEB site : www.soclib.fr

MPSoC 2009

Thank You To get the slides : [email protected]

MPSoC 2009

TSAR : a scalable, shared memory, many-cores architecture with ...

TSAR : a scalable, shared memory, many-cores architecture with ...

Suggest Documents

A scalable multi-core architecture with heterogeneous memory ... - arXiv

Scalable Fault-Tolerant Distributed Shared Memory - CiteSeerX

Data Forwarding in Scalable Shared-Memory

Scalable Fault-Tolerant Distributed Shared Memory - CiteSeerX

Scalable shared-memory multiprocessor architectures - Computer

Revisiting Scalable Coherent Shared Memory

Scalable directoryless shared memory coherence ... - DSpace@MIT

1 Shared Memory Architecture and Explored ...

Analysis of Avalanche's Shared Memory Architecture

A Scalable and Efficient Storage Allocator on Shared-Memory

Scalable Cross-Architecture Predictions of Memory ... - CiteSeerX

CTL* Model Checking on a Shared-Memory Architecture - ScienceDirect

The DIMM Tree Architecture: A High Bandwidth and Scalable Memory ...

A Scalable Near-Memory Architecture for Training Deep Neural ... - arXiv

The S3.mp scalable shared memory multiprocessor - Carnegie Mellon ...

Towards Scalable 1024 Processor Shared Memory ... - Cray User Group

Elastic Resource Management of Manycores and a Hybrid Memory ...

TSAR: A Two Tier Sensor Storage Architecture Using Interval Skip ...

An Architecture for High-Performance Scalable Shared ... - IEEE Xplore

PRISM: An Integrated Architecture for Scalable Shared ... - CiteSeerX

Programming Shared Memory Systems with OpenMP - LRZ

Shared Virtual Memory with Automatic Update Support

A SCALABLE AND RECONFIGURABLE SHARED ... - CiteSeerX

Software Distributed Shared Memory over Virtual Interface Architecture