DDM-CMP: Data Driven Multithreading on a Chip Multiprocessor ...
Recommend Documents
become available, rather than being issued in the order as defined by the static .... 3 illustrates the safe definition and release points for a looplive register r3.
TFlux: A Portable Platform for Data-Driven Multithreading on Commodity. Multicore Systems ..... problem sizes are separated into three categories: Small,. Medium, and .... 6.3 TFluxCell. For the TFluxCell implementation, a Sony Playstation 3.
similar to Tango [8] and g88 [4]. .... Table 3: AU possible causes of wasted issue slots, and latency-hiding or latency-reducing techniques that can reduce the ...
W. Wolf is with the Georgia Institute of Technology, Atlanta, GA 30332 USA. (e-mail: .... mented as custom silicon from the FPGA fabric. In other cases, the CPU .... result, the quality and capabilities of the software development environment for ...
The worst-case map portion was searched in the chart ... The worst-case map portion has been used in the ..... of Design Alternatives for a Multiprocessor.
effective tool for profiling the behavior of the MPSoC system is in great need. ... An effective tool to profile the ..... role in Linux kernel, which behave like a black box that makes a ... device drivers are more suitable for our monitor hardware.
May 2, 2009 - code starting from the original application sequen- tial C code. ..... this research are part of the Mosart (Mapping Opti- mization for Scalable ...
May 7, 2004 - for Networking Applications. Valentina Salapura, Christos J. Georgiou, Indira Nair. IBM Research Division. Thomas J. Watson Research Center.
a4-31G basis function set, using GAMESS, on Pentium-III@500MHz with 512MB Main Memory. (Generation of Initial Integrals) for a = 1,Mi for b = 1,Mj for c = 1, ...
The time-sliced arbiter divides the memory access time into equal time slots, one ... allelism (ILP) approach, which was the primary processor design objective ...
Jul 28, 2006 - The advance in embedded systems and multiprocessor trends pave the ... intensive bio-medical applications with huge potential health ...
control model developed with the Matlab and Verilog models . . . 123 ...... As shown in Figure 2.7, the synchronous sender sends data and asserts the Valid ...... In this design, the process of moving from floating-point to fixed-point is a manual.
Dedicated SW. Custom OS. Drivers. MPU core. Specific API. (c). Figure 1. MPSoC platform (a), software stack (b) and concurrent development environment (c).
May 1, 2007 - As application complexity grows, programmable multiprocessor ... software development possible and enable early software design and ...
accordance with the WISHBONE B.3 standard. The exported. VHDL can then be prototyped on an FPGA platform. Even though some similar tools exist, there ...
POWER-4 [14] processor and recently released their second gen- eration CMT processor, the POWER-5, in which each core sup- ports 2-way SMT [15]. AMD ...
step synthesis approach by achieving reductions of up to 38% in communication .... perform all the required NoC data transfers in the allotted time. In the next ...
Department of Electrical and Computer Engineering, University of Florida. P.O.Box 116200 ... Phone: (352)392-5225 ... a larger number of transistors and numerous architectural features for increased performance, the ...... Technical Report, HCS Resea
Jun 23, 2010 - Magnus Jahre and Lasse Natvig, Performance Effects of a Cache Miss ..... [95] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, ...
[9],[17],[13],[3] as well as commercial implementations such as IBM's Power4 [20] and HP's. Mako [19]. CMPs reduce the impact of interconnect delay on clock ...
Several studies and example systems, such as the Sun Niagara, ..... symmetric
multiprocessor (SMP) architectures composed of multiple microprocessor.
This type of memory has the advantage of allowing an immediate ... the programming of the software that manages the system. (fig 1). But this solution has many drawbacks. Firstly ... The processor is implemented with full custom VLSI chip.
memory system added meta-data bits to our local storage arrays. ... ing much of the initial Verilog for this system, we realized that we were never ..... assertions to the casez statements to make sure none of the inputs were .... models: a tutorial.
Mar 6, 2007 - sor [17], developed by Sony, IBM, and Toshiba, shares many similarities ...... Scalable and High Performance Computing, pp. 284-291, 1992.
DDM-CMP: Data Driven Multithreading on a Chip Multiprocessor ...
The number of transistors needed to build a Pentium 4 is enough for. â 5.7 Pentium III processors. â ~10 âmodified Pentium IIIâ processors. Processor. Freq. Tech ...
DDM-CMP: Data Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou Paraskevas Evripidou Pedro Trancoso CASPER group Department of Computer Science University Of Cyprus The CASPER group: Computer Architecture System Performance Evaluation Research
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
Motivation • Significant levels of thread level parallelism exist in programs – – – –
Significant speedup potentials Memory wall Synchronization latencies Communication latencies
Potential Solution: Data Driven Multithreading (DDM)
• Current trend is increasing the Potential Solution: complexity to improve performance Chip Multiprocessors (CMP) – Longer Validation and Design Phases – Increased power/energy consumption – Marginal performance improvement potential
Motivation • DDM-CMP targets all the three aspects mentioned in the previous slide – Decreases complexity – Benefits from the available thread level parallelism – Offers significant potential for thermal behavior improvement
• Our preliminary experimental results have shown that DDM-CMP achieves speedup ranging from 5.2 to 14.9 compared to an equal hardware budget high-end state-of-the-art single chip microprocessor for the SPLASH2 kernels studied
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
Data Flow Model • Data Driven Multithreading (DDM) is the evolution of the Data-Flow model of execution • The Data Flow model of execution – Proposed in the 70s (Jack Dennis - MIT) – Execution of instructions is driven by data availability • An instruction is enabled only when all operands are available • An instruction can be fired (i.e. executed) only after it becomes enabled
• In the Data-Flow model the “basic element” the instruction whereas in DDM the “basic element” is the thread.
Data Driven Multithreading • Synchronization part of the program is separated from the computation part • Tolerates synchronization and communication latencies by allowing the computation processor produce useful work while a long latency event is in progress • Efficient data prefetching through Cache-Flow policies • Data Driven Multithreading can be implemented using commodity microprocessors [Kyriacou 05] – The only additional requirement is a small hardware structure, the TSU (Thread Synchronization Unit)
TSU: Hardware support for Data Driven Multithreading
TSU: Hardware support for Data Driven Multithreading
TSU: Hardware support for Data Driven Multithreading
TSU: Hardware support for Data Driven Multithreading
Model of execution ...
...
...
...
A producer – consumer relationship exists among threads
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability
Model of execution Thread 1 Thread 3 Thread 2
Thread 4
Thread 7
Thread 5
Thread 8
Thread 6
The DDM programs are represented as graphs
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability
Model of execution Thread 1 Thread 3 Thread 2
Thread 4
Thread 7
Thread 5
Thread 8
Thread 6
Nodes of the synchronization graph represent the threads
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability
Model of execution Thread 1 Thread 3 Thread 2
Thread 4
Thread 7
Thread 5
Thread 8
Thread 6
Arcs carry data from the producer to the consumer threads
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability
Model of execution Thread 1 Thread 3 Thread 2
Thread 4
Thread 7
Thread 5
Thread 8
Thread 6
Threads 1 and 2 are executing in parallel
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability
Model of execution Thread 1 Thread 3 Thread 2
Thread 4
Thread 7
Thread 5
Thread 8
Thread 6
Threads 1 and 2 have completed their execution
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability
Model of execution Thread 1 Thread 3 Thread 2
Thread 4
Thread 7
Thread 5
Thread 8
Thread 6
Results “propagation” – Thread 3 is ready to be executed
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability
Model of execution ...
...
...
...
A producer – consumer relationship exists among threads
• A program in DDM is a collection of re-entrant code blocks – A code-block is equivalent to a function or a loop body in the highlevel context – Each code-block comprises of several threads – A Thread is a sequence of instructions equivalent to a basic block – A producer-consumer relationship exists among threads – Scheduling of code-blocks and threads is done dynamically according to data availability – The CPU can execute the instructions within a thread using any available optimization
Data Driven Multithreading • Why Data Driven multithreading and not any other multithreading model of execution? – – – –
DDM tolerates communication latencies DDM tolerates synchronization latencies DDM offers effective data prefetching DDM can be implemented with commodity microprocessors Data Driven Multithreading overtakes the memory wall
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
DDM-CMP Architecture
DDM-CMP Architecture
• The execution processors • Cache is embedded in the execution processors
DDM-CMP Architecture
• The thread synchronization units (TSUs) • One per processor here
DDM-CMP Architecture
• The interconnection network • TSUs communicate with each other through the interconnection network
DDM-CMP Architecture
• Off-chip memory • Communication via the system bus
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
Proof of Concept More Parallelism Less Complex design More opportunities for thermal enhancement
Proof of Concept • The number of transistors needed to build a Pentium 4 is enough for – 5.7 Pentium III processors – ~10 “modified Pentium III” processors.
Proof of Concept • The configuration of PE was determined with simulations – According to our experimental results reducing cache size has minimal effect on performance (IPC) for the kernels studied
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
Experimental Setup • Baseline – The baseline used for our experiments is a state-of-the art single chip microprocessor, a Hyper-threaded Pentium 4 clocked at 3.2GHz implemented with 90nm technology
• Benchmarks – To evaluate the proposed architecture we used kernels from SPLASH2 benchmark suite
Experimental Setup • Pentium 4 (Baseline) – The performance of Pentium 4 was quantified using native execution (hardware counters)
• 4 Cores CMP (Pentium III) – The performance of a single Pentium III processor was quantified using native execution (hardware counters) – Performance of the 4-cores CMP was extrapolated using results of our previous work for the D2NOW (Extrapolation is conservative)
• 8 Cores CMP (PE: Modified Pentium III) – The performance loss of modified compared to the original Pentium III was estimated via simulation (Simplescalar) – Performance of the 8-cores CMP was extrapolated using results from D2NOW
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
Frequency Scaling • For our experiments: – Pentium 4 was clocked at 3.2GHz with implementation technology of 90nm – The Pentium III was clocked at 800MHz with implementation technology 180nm
• Today’s implementation technology allows us to build a Pentium III with significantly higher frequency than 800MHz • The “fastest” Pentium III ever was scaled at 1,4GHz (Tualatin)
Experimental Results
Without frequency scaling
Experimental Results
With moderate frequency scaling
Experimental Results
With aggressive frequency scaling
Experimental Results
Very encouraging results
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
Thermal Optimization Opportunities • In a multiprogramming environment some cores of the CMP will be idle at any given time point • Scheduling algorithms will define where (on which idle core) a new process will run • Examples of Scheduling Algorithms • Always Coolest • Neighborhood aware
• To evaluate the potential – TSIC: Thermal Simulator for Chip Multiprocessors This work was accepted for publication in the 10th Panhellenic Conference on Informatics
DDM-CMP enhanced thermal Optimization Opportunities • DDM-CMP offers enhanced thermal optimization opportunities: – DDM-CMP is likely to be build out of a large number of small embedded microprocessors – Scheduling will be often in the DDM architecture – Thermal profiling information can be included during compilation
Outline • Motivation • Data Driven Multithreading (DDM) model of execution • DDM-CMP architecture • Proof of concept • Experimental Setup • Experimental Results • Thermal Aware Scheduling • Work in progress • Conclusions
Work in progress • A hardware prototype will be build using Xilinx Virtex II Pro chip – It includes • 2 embedded PowerPC processors • Programmable FPGA
– We will execute application threads on the two processors and implement the interconnection network and the TSUs on the FPGA • DDM-CMP simulator – Under implementation using the Liberty Simulation Environment (LSE) • DDM-CMP compiler – Aims at automatically converting conventional code to binaries appropriate for the DDM model of execution
Conclusions • DDM-CMP – Takes into advantage thread level parallelism – Decreased design and implementation complexity – Improves the thermal characteristics of the chip
• Achieves speedup ranging from 5.2 – 14.9 compared to a current high-end state-of-the-art single chip microprocessor • A hardware prototype is under development
Questions
Thank you! Questions? The CASPER group: www.cs.ucy.ac.cy/carch/casper/