A Novel Architecture and Simulation for Executing Decoupled ...

A Novel Architecture and Simulation for Executing Decoupled Threads in Future 1-kilo-core Chips Ho Nam, Antoni Portero, Alberto Scionti, Roberto Giorgi Universita' degli Studi di Siena, Dipartimento di Ingegneria dell’Informazione Via Roma, 56 -53100 SIENA, Italy ABSTRACT T-Star (T*) is an ISA-extension that supports a promising execution model to exploit Thread Level Parallelism (TLP) in designing for the next generation chip. This model relies on DataFlow principles. A compiler partitions the program into non-blocking threads which start consuming their own data frames when all their inputs become ready. Especially for future systems composed of thousands of cores on a single chip, we believe that this model is very efficient because it allows less synchronization delays among parallel threads. In this paper we describe some initial works towards simulating 1 kilo-core DataFlow enable chips. KEYWORDS: Dataflow; Multi-core; Multi-Thread

1 Introduction In last decades, the performance of microprocessor was accelerated through the exploitation of instruction-level parallelism (ILP). However, the techniques like out-oforder execution or putting large cache support added the complexity and power consumption for these architectures. New approaches, to take advantage of TLP to advance performance/throughput of multithreaded programs, have resulted in the transition to new designs with many execution cores or to chip-multiprocessors (CMP). Moreover, technology continuously driven by Moore’s Law will make feasible to double the number of cores on a chip every two years, leading to thousands of cores on a single chip (we referee to these architecture as kilo-core chips). It is expected that these kind of chip will be commonly used in around the next 10 years [1]. Among the recent approaches to exploit multithreading execution for these next generations of CMPs, novel models of data-driven non-blocking thread execution in data-driven described in [2], [3], [5], [8] are becoming promising. In these approaches, the sequential application is partitioned by compiler into DataFlow Threads (DF-Threads) which include the thread codes and the data frames. All the memory accesses to data frames are decoupled from the thread code execution. In the context of the feasibility of having one thousand or more cores on a chip, distribution of DF-Threads to the available hardware resources raises many challenges for designing efficient microarchitectures. We present a novel architecture towards the era of 1 kilo-core chips and the simulation framework that allow us to experiment the DataFlow execution model on it.

2 Basic Paradigm for T* Execution In the T* approach [2] [3] [5] [8], the compiler implicitly embeds both dynamic and static thread's data dependencies in the code of each thread. A data frame (i.e., the portion of the memory associated to the thread) is allocated for every newly created thread and each thread writes its results in the data frames of its consumers. A DF-Thread becomes ready to be scheduled for the execution when all its input data are available. We summarize this execution paradigm with the simple example shown in the figure 1. We assume that the sequential codes are compiled into x86-64 ISA, including the instruction set extension supporting the management of thread’s data frames and codes. In the example, the main routine dynamically allocates 3 data frames for 3 simple DF-Threads – add, mul and div through the tschedule instruction. Then, a and b are written to the frames of add and mul DF-Threads through the twrite instruction. For every twrite instruction, the synchronization counter (i.e., a counter that keeps track of the number of inputs needed by the thread to be ready for being scheduled for execution) is decreased. Once it reaches zero the corresponding thread becomes ready for execution. Finally, main, add, mult and div DF-Threads finish their execution by destroying their data frame through the tdestroy instruction.

Figure 1: An Example of T* execution

3 TERAFLUX Architecture The TERAFLUX high-level hardware architecture [4], [9] that we are considering is depicted on the left side of figure 2. The architecture is designed in order to execute a large number of DF-Threads. The architecture is divided into many nodes equipped with many execution cores. Here, we are also assuming that each node has its own memory. Nodes are connected via a dedicated inter-node network-on-chip (NoC). In order to schedule the DF-Threads to the cores, a Distributed Thread Scheduling Unit (D-TSU) located at the node level together with a Local Thread Scheduling Unit (L-TSU) located within each core are implemented. These hardware units are responsible for allocating data frame requested by other nodes and for distributing DF-Threads to be executed among the cores, balancing the workload. For this purpose, we based the architecture of the cores on the x86-64 ISA, and we created an Instruction Set Extension (ISE) called T* to support the

DataFlow execution model [2], [3], [5], [8]. T* Architectural Support consists of instructions for generating/stopping DF-Threads, operating on input/output data frames, and allocating/freeing data frames. The right side of figure 2 represents the software level that we want to target. The linux OS with a scheduler patch runs on the node 0 acting as the master node. It is in charge of distributing high-level policies and initial information. The Distributed Thread Scheduler (DTSUs + LTSUs) is responsible for distributing/allocating the DF-Threads and the data frames of the application which are the results of the compilation process. All the other nodes are considered as slaves. To manage T* instructions execution efficiently, we consider a distributed OS model with a small kernel, such as L4 μ-kernel [6], running on the slave nodes as a suitable solution. Both the Linux OS running on the master node and the L4 OS running on the slave nodes see the same physical address space for the whole system, since we assumed that the physical global address space support helps to easily distribute DF-Threads and data frames though the system. In other words, it is necessary to have a pool of data frames that allows each DF-Threads on a node accessing to whatever data frame on other nodes through this physical global address space. We need to ensure the data frame should be closed to its DF-Thread, and to keep low the number of remote accesses to other data frames via the NoC.

Figure 2: The TERAFLUX High Level Architecture and The Software Application Level

4 Proposed Simulation Model The simulation model that we propose is depicted in the left side of figure 3. It relies on COTSon simulation infrastructure, i.e., a full-system simulator potentially supporting the simulation of system at the kilo-cores scale. The simulation model also uses AMD’s SimNow as the emulation layer. SimNow provides functional simulation, while the timing model of the system is managed by the COTSon layer. Recently, we experimented not only with the AMD’s SimNow but also with Qsim/Qemu emulator. COTSon provides timing model and feedbacks for the Qemu functional emulator. One of the reasons we want to replace SimNow is that we need to model the physical global address space as described above. In order to support this memory model, we modify Qemu which emulates a node with a physical memory device (i.e., DRAM block) to see more physical memory devices from other nodes as showed on the right side of figure 3. At the

functional simulation level, we are experimenting the support of this feature by providing shared memory through mmap Host Linux OS service. At the timing simulation level, we consider that all the memory accesses issued by a node to other nodes must go through the NoC. In addition, we also integrate T* architectural support in Qemu to model T* instructions execution.

Figure 3: COTSon+QSim/Qemu

5 Conclusions This paper presents a novel architecture and a simulation framework targeting future systems composed of on thousand or more cores (i.e., kilo-core chips). The proposed architecture comes with an extension of the current x86-64 ISA, and it represents a general model for designing the next generation chips able to exploit TLP by efficiently scheduling T* instructions execution. The described simulation model supports this architecture and possibly scales up to a full-system simulation with one thousand cores (and more).

Acknowledgments This work was partly funded by the European FP7 projects TERAFLUX id. 249013 http://www.teraflux.eu, ERA (Embedded Reconfigurable Architectures) id. 249059 (FP7) http://era-project.eu; HiPEAC IST-217068, and IT PRIN 2008 (200855LRP2).

References [1]. Shekhar Borkar, Thousand Core Chips: A Technology Perspective, Proceedings of the 44th annual Design Automation Conference, (DAC’ 07), June 4-8, 2007, San Diego, California, USA. [2]. R. Giorgi, Z. Popovic, N. Puzovic, DTA-C: A Decoupled multi-threaded Architecture for CMP Systems”, Proceedings of IEEE SBAC-PAD, Gramado, Brazil, Oct. 2007, pp. 263-270. [3]. R. Giorgi, TERAFLUX: exploiting dataflow parallelism in teradevices, Proceedings CF '12 Proceedings of the 9th conference on Computing Frontiers, pp. 303-304. [4]. http://www.teraflux.eu/ [5]. Antoni Portero, Alberto Scionti, Zhibin Yu, Paolo Faraboschi, Caroline Concatto, Luigi carro, Arne Garbade, Sebastian Weis, Theo Ungerer, Roberto Giorgi, Simulating the Future kilo-x86-64 core Processors and their Infrastructure, 45th Annual Simulation Symp. (ANSS12), Orlando, FL, Mar. 2012. [6]. http://www.l4ka.org/ [7]. Eduardo Argollo, Paolo Faraboschi, Matteo Monchiero, Daniel Ortega, COTSon: infrastructure for full system simulation, ACM SIGOPS Operating Systems, vol. 43 Issue 1, Jan. 2009, pp. 52-61. [8]. Roberto Giorgi, Alberto Scionti, Antoni Portero Paolo Faraboschi, Architectural Simulation in the Kilo-core Era, Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012), London, UK, Mar. 2012, pp. 1-3. [9]. Sebastian Weis, Arne Garbade, Julian Wolf and Bernhard Fechner, Avi Mendelson, Roberto Giorgi and Theo Ungerer, A Fault Detection and Recovery Architecture for a Teradevice Dataflow System, DFM-2011: Data-Flow Execution Models for Extreme Scale Computing, Oct. 2011, pp. 38-44.

A Novel Architecture and Simulation for Executing Decoupled ...

A Novel Architecture and Simulation for Executing Decoupled ...

Suggest Documents

A Decoupled Architecture for Scalability in Text

a decoupled federate architecture for distributed ... - CiteSeerX

HiDISC: A Decoupled Architecture for Data

MediaBreeze: A Decoupled Architecture for Accelerating Multimedia

a decoupled federate architecture for hla-based

HiDISC: A Decoupled Architecture for Data

A Distributed Simulation Backbone for Executing

A Decoupled Architecture for Action-Oriented Coordination and ...

A Decoupled Scheduled Dataflow Multithreaded Architecture ...

A decoupled three-layered architecture for service robotics in ...

Design of a Generic Architecture for executing Bioinformatics ...

A multiprocessor decoupled system for the simulation of ... - CRS4

MACâEngine: A New Architecture for Executing MAC ...

A Distributed Architecture for Executing Complex Tasks ... - CiteSeerX

A Novel Architecture and Mechanism for

Development and Assessment of a Novel Decoupled XY ... - CiteSeerX

Development and Assessment of a Novel Decoupled ... - Google Sites

A Novel Simulation Architecture of Configurational ... - Semantic Scholar

Decoupled State-Execute Architecture - Springer Link

A Service Oriented Simulation Architecture for ...

A Flexible Simulation Architecture for Pandemic

Optimizations Enabled by a Decoupled Front-End Architecture

Architecture-based Simulation for Security and Performance

A Novel Middleware Architecture for Personal

A Novel Architecture and Simulation for Executing Decoupled ...