Bulk-synchronous pseudo-streaming algorithms for many-core ...

Recommend Documents

Manycore Parallel Algorithm - WordPress.com

TE091585 – Komputasi Grid. Outline. ❑ How to compute Phi ? ❑ Implementation. ❑ Result and Experiment. TE091585 – Komputasi Grid ...

PRO3D, Programming for Future 3D Manycore ...

level formal specification and 3D exploration, to actual programming and runtime control on ..... pose host and IOs, a P2012 computing die, and a memory die.

Manycore Processor Education Platform with FPGA for ...

for Undergraduate Level Computer Architecture Class. Hana Park .... References. 1. David Harris, Sarah Harris: Digital Design and Computer Architecture.

Auto-Tuning Support for Manycore Applications ... - CiteSeerX

software hierarchy, including operating systems, compilers, and applications. In particular, performance optimization becomes more dif- ficult because of the ...

Low-Power Manycore Accelerator for Personalized ... - UMBC

May 20, 2016 - Low power; biomedical; manycore; accelerator; digital signal processing .... This implementation is optimized on the bit-width resolution in ad- dition to .... In order to demonstrate the manycore's effectiveness at targeting ..... d

Scalable Memory Hierarchies for Embedded Manycore Systems

Multiprocessor (SMP) shared memory architecture for up to 7 processors plus ... of programming models emerging for manycore chips with scalable numbers of.

Efficient multithreading for manycore processor: Multidimensional ...

Jul 18, 2017 - Abstract. The Insight Toolkit (ITK) utilizes a generic design for image processing filters that allows many developers to rapidly implement new ...

Scalable Memory Hierarchies for Embedded Manycore Systems

Scalable Memory Hierarchies for Embedded. Manycore Systems. Sen Ma, Miaoqing Huang, Eugene Cartwright, and David Andrews. Department of Computer ...

Invasive Manycore Architectures - Semantic Scholar

of the power wall, processors do not scale any more towards higher frequencies. Instead, the major trend goes to the inte- gration of more and more processor ...

Invasive Manycore Architectures - Semantic Scholar

From the application programmer's point of view an invasive program will have to be written with respect to the resources currently available for his application.

Auto-Tuning Support for Manycore Applications - Perspectives for ...

formance of general-purpose parallel software without sac- rificing portability and ... After termination, new parameter values are generated based on monitoring.

Auto-Tuning Support for Manycore Applications - Perspectives for ...

[email protected]. Christoph ... ecutable threads, the cache sizes and architectures, or the memory access ... known beforehand and may be counter-intuitive.

Kickstarting High-performance Energy-efficient Manycore ...

development, Andreas announced the Epiphany manycore architecture ... (hardware team), and $10M for Software tool development ..... No custom logic was used in ..... Symposium Workshops, pp.984-992, Phoenix, USA, May 19-23, 2014.

Open Tiled Manycore System-on-Chip

Apr 18, 2013 - (b) Partitioned global ad- dress space. L2 Cache ... be seen as the library elements to a platform generator tool. Based on these elements two ...

Defect Tolerance in Homogeneous Manycore Processors ... - CiteSeerX

Abstract. Homogeneous manycore processors are emerging for tera- scale computation. Effective defect tolerance techniques are essential to improve the yield ...

Bioinformatics Sequence Comparisons on Manycore Processors

Dec 6, 2012 - 1.1.3 New programming languages for GPU and manycore ..... It is a direct application of our work to build

TPTS: A Novel Framework for Very Fast Manycore ... - CiteSeerX

instructions per second) on a commodity Linux box. This paper is organized as ...... Research,â Computer Arch. Letters (CAL), June 2002. [11] M. Plakal et al.

Energy-optimal Configuration Selection for Manycore Chips with ...

May 20, 2016 - Akhil Langerâ , Ehsan Totoniâ , Udatta Palekarâ¡ and Laxmikant V. KalÃ©â. â Intel Corporation ... However, unlike the IntelÂ® Haswell chip, .... architectures, such as IBM Power6 [13], IBM Power7 [15], AMD Bulldozer [4], IntelÂ®.

A Cluster for CS Education in the Manycore Era

Mar 9, 2011 - {adams, kmh23, jmw26}@calvin.edu. ABSTRACT. Traditional ... 2.1 Funding. We were fortunate enough to procure a grant from the National.

Compiler Optimizations for Manycore Processors - ACM Digital Library

support for transferring large pointer-based data structures between hosts and ...... shared pointer after the data structure is copied to MIC, we need to map a ...

Machine Learning and Manycore Systems Design - arXiv

signing large-scale manycore systems and machine learning .... sign knowledge and data-driven decision making. This in- ... ring, mesh, irregular, 3D); Link type (e.g., wireline, ..... Symposium on Computer Architecture, Toronto, Canada, 2017.

Trusted Computing using Enhanced Manycore ...

reasons, Cipher1 cannot be used in CBC-MAC mode for authentication of the session key because its output would need to be connected to the external data ...

Operating Systems for Manycore Processors from the Perspective of ...

today, in this forward-looking paper we discuss why future safety- critical systems ..... parts may or even must be centralised on dedicated servers. The central ...

SnuMAP: an Open-Source Trace Profiler for Manycore Systems

information and insights for application developers and multi/many-core ..... function names so that the Java dynamic linker can find the SnuMAP interface.

Bulk-synchronous pseudo-streaming algorithms for many-core ...

Download PDF

3 downloads 19620 Views 253KB Size Report

Comment

Aug 25, 2016 - DC] 25 Aug 2016 .... connection to a shared external memory pool of size E â« L. This type of connection is commonly available for the ... A hyperstep is best viewed .... This value can then be communicated back to the host.

arXiv:1608.07200v1 [cs.DC] 25 Aug 2016

Bulk-synchronous pseudo-streaming algorithms for many-core accelerators Jan-Willem Buurlage∗

Tom Bannink†

Abe Wits ‡

Abstract The bulk-synchronous parallel (BSP) model provides a framework for writing parallel programs with predictable performance. In this paper we extend the BSP model to support what we will call pseudo-streaming algorithms for accelerators. We also generalize the BSP cost function to these algorithms, so that it is possible to predict the running time for programs targeting many-core accelerators and to identify possible bottlenecks. Several examples of algorithms within this new framework will be explored. We extend the BSPlib standard by proposing a small number of new BSP primitives to create and use streams in a portable way. We will introduce a software library called Epiphany BSP that implements these ideas for the Parallella development board. Finally we will give experimental results for pseudo-streaming algorithms on the Parallella platform. Keywords: Bulk-synchronous parallel Software library Matrix-matrix multiplication Sparse matrix-vector multiplication Many-core coprocessor Streaming algorithm

1

Introduction

In the bulk-synchronous parallel (BSP) model, introduced by Valiant in 1990 [19], the computer is assumed to consist of p identical processors together with a communication network so that these processors can communicate. A BSP algorithm is structured in a number of supersteps. Each superstep consists of a computation phase and a communication phase. At the end of each step a barrier synchronization is performed between the cooperating processors, so that the next superstep is initiated only after each processor has finished communication completely. Each processor runs the same program, but on different data, adhering to the Single Program Multiple Data (SPMD) paradigm. A BSP algorithm is imagined to run on an abstract BSP computer. This computer consists of p processors, assumed to be identical, which each have access to their own local memory. There is also a network layer which can be used by a processor to ∗ Centrum

voor Wiskunde en Informatica (CWI), Amsterdam, The Netherlands, [email protected] / QuSoft, Amsterdam, The Netherlands, [email protected] ‡ Coduin, Utrecht, The Netherlands, [email protected] † CWI

1

communicate with a remote processor. The cost of a barrier synchronization at the start and end of communication is denoted by l, and the communication cost per data-word is denoted by g. These are usually expressed in the number of floating-point operations (FLOPs), and related to time through the computation rate r of the processor which is measured in floating point operations per second (FLOPS). The four parameters (p, g, l, r) define a BSP computer completely. Each BSP algorithm has an associated cost, which can be expressed completely (s) using the parameters of a BSP computer. We denote by wi the amount of work to perform in the ith superstep by processor s. We assume that the communication cost (s) results only from the sending and receiving of data. We denote by ri the number of (s) data words received, and with ti the number of data words transmitted by processor s in superstep i. The h-relation of this superstep is defined as the maximum number of (s) (s) words transmitted or received by any processor, i.e. hi = max0≤s