Review of parallel computing methods and tools

55 downloads 74061 Views 266KB Size Report
May 20, 2013 - Shared Memory API. POSIX threads7 ... ory API. OpenACC9. Open. CPU, GPU. HLPL. Similar to OpenMP ... *Altera SDK for OpenCL14. Matlab.
Review of parallel computing methods and tools Radoslaw Cieszewski Maciej Linczuk Krzysztof Pozniak Ryszard Romaniuk Institute of Electronic Systems Warsaw University of Technology Nowowiejska 15/19, Warsaw, Poland May 20, 2013 ABSTRACT Parallel computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated using parallel computing techniques. Specialized parallel computer architectures are used for accelerating specific tasks. High-Energy Physics Experiments measuring systems often uses FPGAs for fine-grained computation. FPGA combines many benefits of both software and ASIC implementations. Like software, the mapped circuit is flexible, and can be reconfigured over the lifetime of the system. FPGAs therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Creating parallel programs implemented in FPGAs is not trivial. This paper presents existing methods and tools for fine-grained computation implemented in FPGA using High Level Programming Languages. Keywords: Parallel Computing, FPGA, ASIC, DSP, High-Energy Physics Experiment, KX1, Fine-Grained Parallelism, Coarse-Grained Parallelism

1. INTRODUCTION Parallel computing is a form of computation in which many arithmetical, logical and input/output operations are processed simultaneously. Parallel computer is based on the principle that large problems can be divided into subtasks, which are then solved concurrently. There’s many classes of parallel computers: - Symmetric Multiprocessing (SMP), - Multicore Computing, - Grid Computing, - Cluster Computing, - Parallel Computing on Graphics Processing Units (GPU), - Reconfigurable Computing on FPGA’s, - Parallel Computing on ASIC’s. In High-Energy Physics Experiments FPGAs are widely used for accelerating diagnostic algorithms1234 . FPGAs has a set of hardware resources, including logic and routing, which function is controlled by on-chip configuration SRAM. Programming the SRAM, either at the start of an application or during execution, allows the hardware functionality to be configured and reconfigured. Such approach gives a possibility to implement different algorithms and applications on the same hardware. The principal difference, comparing to traditional

microprocessors, is the ability to make substantial changes to logic and routing. This technique is termed ”time-multiplexed hardware” or ”run-time reconfigured hardware”. The main goal of this article is to research efficient tools for FPGA programming with High Level Programming Language. Outline The remainder of this article is organized as follows. Section 2 gives a theoretical base for parallel computing . Tools and methods comparison are presented in Section 3. Finally, Section 4 gives the conclusions.

2. BACKGROUND Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may executed at a time after that instruction is finished, the next is executed. Parallel computing uses multiple resources simultaneously to solve a problem. This is accomplished by mapping the algorithm into independent parts so that each part of the algorithm can be executed simultaneously with the others. Such parts in a parallel program are often called fibers, threads, processes or tasks. If the algorithm tasks communicate many times per second, algorithm exhibits fine-grained parallelism. If the algorithm subtasks communicate few or less times per second, algorithm exhibits coarse-grained parallelism. FPGA’s are very flexible and can be programmed to accelerate fine-grained and coarse-grained applications. FGPAs are also scalable. Sometimes the algorithms are so complex that it can not be fit into one FPGA. Then, multi-FPGA systems are used.

2.1 Amdahl’s law and Gustafson’s law Amdahl’s law defines maximum possible speed-up of an algorithm on a parallel computer. Gustafson’s law defines the speed-up with P processors. 1 1 = lim α P →∞ 1 − α +α P S(P ) = P − α(P − 1) = α + P (1 − α) S=

α − fraction of running time a program spends on non-parallelizable parts, P - number of processors. This two formulas shows that acceleration depends on algorithm. Sequential parts of code can’t be parallelized and then speed-up is independent of the number of processors. Gustafson’s law assumes that speed-up parallel portion of code varies linearly with the number of processors.

2.2 Parallelism level FPGA’s computers can process computation: - at bit-level of parallelism, - using pipeline techniques, - at instruction-level of parallelism, - at data-level of parallelism, - at task-level of parallelism. The main advantage over conventional microprocessors is the ability to perform the calculation at the lowest level of parallelism - bit-level. Due to Flynn’s taxonomy, FPGAs are classified as Multiple Instruction Multiple Data (MIMD) machines.

(1)

(2)

2.3 Memory and communication architectures Parallel computing defines two main memory models: - Shared Memory, - Distributed Memory, Shared memory refers to a block of memory that can be accessed by several different processing units in parallel computer. This memory may be simultaneously accessed by multiple tasks with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between tasks. In distributed memory systems, each processing unit has its own local memory and computation are done on local data. If remote data is required then the subtask must communicate with remote processing units. This means that remote processor are engaged in this operation adding some overheads. Hybrid architectures combine this two main approaches. Processing element has its own local memory and access to the memory on remote processors. System architecture has impact on choosing parallel programming models and tools presented in next chapter.

3. METHODS AND TOOLS Standard CPU, DSP, Multicore and GPU Arrays architecture are dedicated for coarse-grained computation while FPGAs has better performance with fine-grained computation5 . Hybrid architectures has best performance for complex computations. Figure 1 shows best architectures for different granularity calculations.

Figure 1. Recent Trend of Programmable and Parallel Technologies

Due to different computer architectures, many tools have been created for parallel programming (see Table 1). These tools has many built-in methods to facilitate fast implementation of algorithms: - High Level Programming Language (HLPL), - Graphical Description (GD), - Implicit Parallelizm Extraction (IPE), - Automatic Pipeline Stages Generation (APSG), - Multi FPGAs Compilation (MFC). Name

Open/ Proprietary

Architecture Methods Support

Additional information

OpenMP6 POSIX threads7

Open Open

CPU CPU

HLPL HLPL

MPI8

Open

CPU

HLPL

OpenACC9 OpenHMPP10

Open Open

CPU, GPU CPU, GPU

HLPL HLPL

CUDA11 Matlab Parallel Computing Toolbox12 OpenCL13

Open Proprietary (MathWorks) Open (*Prioprietary Altera) Proprietary (MathWorks) Proprietary (Mitronics) Proprietary (Impulse AT) Proprietary (Mentor Graphics) Proprietary (Nallatech) Proprietary (Xilinx) Proprietary (Y Explorations) Proprietary (CalyptoDS) Proprietary (Forte DS) Proprietary (Cadence) Proprietary (Synopsys) Proprietary (BlueSpec)

GPU CPU, GPU

HLPL HLPL

CPU, GPU, (*FPGA)

HLPL

*Altera SDK for OpenCL14

FPGA

GD, HLPL, APSG

FPGA

GD, HLPL, APSG

FPGA

HLPL, APSG

FPGA

HLPL, APSG

FPGA

HLPL, APSG, IPE

FPGA FPGA

GD, HLPL, APSG, IPE GD, HLPL, APSG

Verilog and VHDL code generation from Matab functions, System objects and Simulink blocks Mitrion Virtual Processor and SDK called Mitrion-C C to VHDL or Verilog RTL. C comaptible functions library supporting parallel programming C to Verilog or VHDL. Syntax similar to occam. Similar, open project - FpgaC ANSI C to VHDL compiler and a parallelization visualizer ANSI C/C++/SystemC/TLM

FPGA

HLPL, APSG

FPGA

HLPL, APSG

ANSI C/C++/SystemC/TLM to Verilog or VHDL RTL SystemC/TLM

FPGA

HLPL, APSG

ANSI C/C++/SystemC/TLM

FPGA

HLPL

C/C++ to RTL

FPGA

HLPL, APSG

Proprietary (Nec) Proprietary (Ajax compilers)

FPGA

HLPL, APSG, IPE

FPGA

HLPL, APSG

BSV language based on Haskel/TLM to Verilog RTL, SystemC ANSI C/SystemC to Verilog/VHDL/LSI ANSI C and NAC to VHDL

Matlab Coder12

HDL

Mitrion-C15 ImpulseC16

Handel-C17

DIME-C18 Vivado19 eXCite20

CatapultC21 Cynthesizer 522 C-to-Silicon Compiler23 Synphony C Compiler24 BlueSpec Compiler25 CyberWorkBench26 HercuLeS27

Shared Memory API. Widely used Shared Memory API Widely used Distributed Memory API Similar to OpenMP Programming standard for Heterogeneous computing Widely used GPU library Perform parallel computations on CPU, GPU and Clusters

ANSI C to Verilog, VHDL or SystemC RTL

SystemC28

Open

PandA framework and Bambu29

Open

LegUp?

Open

xPilot30

Open

Shang31 C-to-Verilog32 ROCCC-2.0? JHDL33 MyHDL34 FpgaC35

Open Open Open Open Open Open

Trident36 RHDL37

Open Open

FPGA

HLPL

Accelera and OSCI (Open SysteC Initiative) FPGA HLPL, APSG, IPE C to HDL. Tool being developed at The Politecnico di Milano(Italy) FPGA, DSP HLPL, APSG C to Verilog RTL. Tool being developed at The University of Toronto (Canada) FPGA HLPL, APSG C or SystemC to RTL. Tool being developed at The University of California (USA) FPGA HLPL C to Verilog RTL FPGA HLPL , APSG, IPE C to Verilog DG, FPGA HLPL, APSG, IPE C to VHDL FPGA HLPL Java to EDIF netlist FPGA HLPL Python to VHDL or Verilog FPGA HLPL C to EDIF netlist. Tool being developed at The University of Toronto (Canada) FPGA HLPL, APSG, IPE C/C++ to VHDL FPGA HLPL Ruby to VHDL or Verilog Table 1: Parallel computing tools comparison

Most important methods, for diagnostics systems in High Energy Physics Experiments like JET, are: High Level Programming Language, Implicit Parallelizm Extraction, Automatic Pipeline Stages Generation and Multi FPGAs Compilation. Source code of such tool must be open.

4. CONCLUSIONS AND FUTURE WORK Review of parallel computing methods and tools were done. Many tools has been created to effectively implement parallel programs (Table 1). Some of them are dedicated for FPGAs integrated circuit. Many algorithms can be greatly accelerated by placing the computationally intense portions of an algorithms on FPGAs. There’s no tools meeting all requirements for High-Energy Physics Experiment diagnostics systems e.g. KX138 . Such systems are built with many FPGAs. Process of creating program for each FPGA separately is difficult and time-consuming. Writing parallel computer programs is more difficult than sequential ones. Programmer should care about synchronization, concurrency, data consistency and parallelization. Widely parametrized software with built-in methods: automatic synchronization, implicit parallelizm extraction, pipeline stages generation for customized multi FPGAs systems can improve process of creating parallel programs. Creating such tool will be the subject of further research.

REFERENCES [1] Janicki, T., Cieszewski, R., Kasprowicz, G. H., and Pozniak, K., “Fpga mezzanine card dsp module,” 80080K–1–80080K–7 (2011). [2] Cieszewski, R. and Linczuk, M. G., “Universal dsp module interface,” Proceedings of SPIE: Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 7745, 7745– 1T1–7745–1T1–6 (2010).

[3] Zabolotny, W., Czarski, T., Maryna, C., Henryk, C., Ryszard, D., Dominik, W., Katarzyna, J., Leslaw, K., Kasprowicz, G. H., Kierzkowski, K., Kudla, I. M., Pozniak, K., Jacek, R., Zbigniew, S., and Marek, S., “Optimization of fpga processing of gem detector signal,” Proceedings of SPIE: Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments , 80080G–1–80080G–7 (2011). [4] Romaniuk, R., Pozniak, K., Czarski, T., Czuba, K. M., Giergusiewicz, W., Kasprowicz, G. H., and Koprek, W., “Optical network and fpga/dsp based control system for free electron laser,” Bulletin of the Polish Academy of Science, Technical Science 2, 123–138 (1996). [5] “Implementing fpga design with the opencl standard.” www.altera.com/literature/wp/wp-01173-opencl.pdf. [6] “The openmp homepage.” http://openmp.org/. [7] “The posix threads tutorial.” https://computing.llnl.gov/tutorials/pthreads/. [8] “The open mpi homepage.” http://www.open-mpi.org/. [9] “The openacc homepage.” http://www.openacc.org/. [10] “The openhmpp homepage.” http://www.caps-entreprise.com/openhmpp-directives/. [11] “The cuda homepage.” http://www.nvidia.com/object/cudah omen ew.html. [12] “The mathworks homepage.” http://www.mathworks.com. [13] “The opencl homepage.” https://developer.nvidia.com/opencl. [14] “The altera sdk for opencl homepage.” http://www.altera.com/products/software/opencl/openclindex.html. [15] “The mitronics homepage.” http://mitrionics.com/. [16] “The impulsec homepage.” http://www.impulseaccelerated.com/productsu niversal.htm. [17] “The handel-c homepage.” http://www.mentor.com/products/fpga/handel-c/. [18] “The dime-c homepage.” http://www.nallatech.com/Development-Tools/dime-c.html. [19] “The xilinx vivado homepage.” http://www.xilinx.com/products/design-tools/vivado/. [20] “The excite homepage.” http://www.yxi.com/products.php. [21] “The catapultc homepage.” http://calypto.com/en/products/catapult/overview. [22] “The cynthesizer 5 homepage.” http://www.forteds.com/products/cynthesizer.asp. [23] “The c-to-silicon homepage.” http://www.forteds.com/products/cynthesizer.asp. [24] “The synphony c compiler homepage.” http://www.synopsys.com/Systems/BlockDesign/HLS/Pages/SynphonyCCompiler.aspx. [25] “The bluespec compiler homepage.” http://www.bluespec.com/products.html. [26] “The cyberworkbench homepage.” http://www.nec.com/en/global/prod/cwb/. [27] “The hercules homepage.” http://www.ajaxcompilers.com/technology/hercules-high-level-synthesis. [28] “The systemc homepage.” http://www.accellera.org/downloads/standards/systemc. [29] “The panda framework and bambu homepage.” http://panda.dei.polimi.it/. [30] “The xpilot homepage.” http://cadlab.cs.ucla.edu/soc/. [31] “The shang homepage.” https://github.com/OpenEDA/Shang. [32] “The c-to-verilog homepage.” http://www.c-to-verilog.com/. [33] “The jhdl homepage.” http://www.jhdl.org/. [34] “The myhdl homepage.” http://www.myhdl.org/. [35] “The fpgac homepage.” http://sourceforge.net/p/fpgac/. [36] “The trident homepage.” http://sourceforge.net/projects/trident/. [37] “The rhdl homepage.” http://rhdl.rubyforge.org/. [38] Rzadkiewicz, J., Chernyshova, M., Jakubowska, K., Karpiski, L., Scholz, M., Dominik, W., Czyrkowski, H., Dabrowski, R., Kierzkowski, K., Salapa, Z., Zastrow, K.-D., Tyrrell, S., Price, D., Sverker, G., Blanchard, P., Pozniak, K., Kasprowicz, G. H., and Zabolotny, W., “Report on the design of the final kx1 gem detectors,” tech. rep., corporateAuthor (2011). [39] Compton, K. L., Architecture Generation of Customized Reconfigurable Hardware, PhD thesis, Northwestern University (2003).