execution performance of scheduled dataflow architecture

0 downloads 0 Views 884KB Size Report
me in the CRASH lab, Dr. Hyong-Shik Kim, Dr. Roberto Giorgi, Mehran Rezaei,. Mohamed Aborizka and .... 3.8 General Organization of Execution Pipeline (EP)… .... If the compiler is not able to find compact instructions to execute ... The early work on dataflow computer architecture emerged in the 1970s with the use of ...
IMPLEMENTATION AND PERFORMANCE EVALUATION OF SCHEDULED DATAFLOW (SDF) ARCHITECTURE

by

JOSEPH MARIA ARUL

A DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Electrical and Computer Engineering of The School of Graduate Studies of The University of Alabama in Huntsville

HUNTSVILLE, ALABAMA 2001

Copyright by Joseph Maria Arul All Rights Reserved 2001

ii

DISSERTATION APPROVAL FORM

Submitted by Joseph M. Arul in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering. Accepted on behalf of the Faculty of the School of Graduate Studies by the dissertation committee: ___________________________________________Committee Chair (Date) ____________________________________________

____________________________________________

____________________________________________

____________________________________________

____________________________________________

____________________________________________Department Chair

____________________________________________College Dean

____________________________________________Graduate Dean

iii

ABSTRACT School of Graduate Studies The University of Alabama in Huntsville Degree Doctor of Philosophy

College/Dept. Engineering/Electrical and Computer Engineering

_ _

Name of Candidate Joseph Maria Arul

_

Title Implementation and Performance Evaluation of Scheduled Dataflow (SDF) Architecture.

_

This dissertation presents the implementation (simulated) and evaluation of a nonblocking, decoupled memory/execution, multithreaded architecture known as the Scheduled Dataflow (SDF) architecture. Recent focus in the field of new processor architecture is mainly on Very Long Instruction Word (VLIW) (e.g., Itanium), superscalar and superspeculative designs. This trend allows for better performance at the expense of increased hardware complexity, and possibly higher power expenditures resulting from dynamic instruction scheduling. The SDF system deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow, multithreading and decoupling of memory accesses from execution. A program is partitioned into non-blocking execution threads. In addition, all memory accesses are decoupled from the thread’s execution. Data is pre-loaded into the thread’s context (registers), and all results are post-stored after the completion of the thread’s execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and post-stores, and to control the allocation of hardware thread contexts to enabled threads. Thus, SDF contains two units called Synchronization Processor (SP) and Execution Processor (EP).

iv

Even though multithreading and decoupling are possible with control-flow architecture, the non-blocking and functional nature of the SDF system make it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. Evaluation is done based on comparing the execution cycles of SDF with the execution cycles of MIPS (DLX simulator) architecture. The SDF simulator can also be easily modified to contain more than a single SP and a single EP. The execution cycles on the SimpleScalar (a superscalar simulator) and VLIW (as facilitated by Trimaran simulator and TMSC6000) architectures are compared with SDF system consisting of multiple SPs and EPs. Our performance comparisons show that the SDF system consistently outperforms MIPS like system. The SDF system also outperforms superscalar and VLIW when the number of functional units (viz., integer and floating point units, or EPs and SPs) exceeds a certain number. The SDF system performance improvements result from multithreading and decoupling. This dissertation relies on an instruction set simulator for the SDF system and hand-coded benchmarks.

Abstract Approval:

Committee Chair

___________________________________ (Date)

Department Chair

___________________________________

Graduate Dean

___________________________________

v

ACKNOWLEDGMENTS It is a great joy to acknowledge my advisor Dr. Krishna M. Kavi for his tireless effort and numerous hours of guidance and support. I also would like to thank him for his encouragement, stimulating questions and insights during our weekly meetings. I would like to thank Rev. Phil O’Kennedy and Rev. Louis Giardino for their support, help and tolerance during these years of my research and studies here in Huntsville. Bill and Rose Liles have supported me in many different ways in this land of unknown and unfamiliar. I would like to acknowledge Bill’s contribution in spending many hours of proofreading my dissertation. My joy and gratitude to all the committee members, Dr. Mary Ellen Weisskopf, Dr. Chester Carroll, Dr. Earl Wells and Dr. Emil Jovanov, for their time and contributions in achieving my goal. I also would like to express my gratitude to my other colleagues who worked with me in the CRASH lab, Dr. Hyong-Shik Kim, Dr. Roberto Giorgi, Mehran Rezaei, Mohamed Aborizka and Shuaib Hanief for their understanding and exchange of views and ideas. On a more personal note, I would like to acknowledge my family members, especially my mother and father in my abilities and inculcating in me the value of hard work and pursuance. My thanks to my brothers and sisters for their constant love and support during my years of study here in the USA.

vi

TABLE OF CONTENTS Page List of Figures …………………………………………………………………….. x List of Tables ……………………………………………………………………... xii Chapter I.

II.

INTRODUCTION………………………………………………………

1

1.1

Control Flow and Dataflow Models of Architectures…………….

1

1.2

Multithreading to Alleviate Long Memory Latency ……………..

5

1.3

Blocking Versus Non-Blocking Multithreading …………………

7

1.4

Threaded Dataflow Architecture ………………………..……….

8

1.5

Dissertation Scope and Contributions……………………………

8

1.6

Dissertation Outline……………………………………………… 10

BACKGROUND AND RELATED RESEARCH…………………….

11

2.1

Dataflow Model and Architectures………………………………

11

2.2

Static (Single-Token-Per-Arc) Dataflow………………………… 13

2.3

Dynamic (Tagged-Token) Dataflow Architecture…………. …… 14

2.4

Explicit Token Store (ETS) Architecture.……………………….. 16

2.5

Hybrid of von Neumann/Dataflow Model and Multithreading….. 18

2.6

Efficient Architecture for Running THreads (EARTH)………….

2.7

Simultaneous Multithreading Processor (SMT)…………….…… 20

2.8

The Superthreaded Architectural Model…………………….…… 22

2.9

Rhamma Processor……………………………………………….

vii

19

24

2.10

III.

IV.

Analytical Model of Scheduled Dataflow Architecture…………

25

SCHEDULED DATAFLOW (SDF) ARCHITECTURE…………….

32

3.1

Instruction Format of Scheduled Dataflow Architecture….……..

32

3.2

Code Partitioning………………………………………………… 33

3.3

Dataflow Graph and the Related SDF Assembly Code………….. 35

3.4

Thread Continuation………………………………………….….. 38

3.5

Execution Pipeline……………………………………………….. 40

3.6

Synchronization Pipeline…………………………………….…... 41

3.7

Scheduling Unit (SU)…………………………………………….. 42

3.8

I-Structure Memory for Scheduled Dataflow (SDF) Architecture.. 44

SCHEDULED DATAFLOW VERSUS CONVENTIONAL RISC SYSTEMS………………………………………………………... 48

V.

4.1

Comparison of SDF Versus DLX (MIPS)……………………….. 48

4.2

Execution Performance of Scheduled Dataflow…………………. 49

4.3

Effect of Thread Level Parallelism on Execution Behavior……... 51

4.4

Effect of Thread Granularity on Execution Behavior………….… 54

4.5

Utilization of Two Processing Units……………………………..

56

SCHEDULED DATAFLOW VERSUS SUPERSCALAR AND VLIW……………………………………………………………...

58

5.1

58

Evaluation of SDF Versus Superscalar Architecture…………….

viii

VI.

VII.

5.2

Execution Performance of SDF with Multiple SPs and EPs…….. 63

5.3

Comparison of SDF with VLIW…………………………………. 72

CACHE PERFORMANCE OF SDF………………………………….. 77 6.1

Cache Performance of SDF Versus Superscalar Architecture……. 77

6.2

Cache Performance of SDF Versus DLX………………………… 81

6.3

Effect of Separate Data and I-Structure Caches on SDF…………. 83

CONCLUDING REMARKS…………………………………………

85

7.1

Conclusions and Unique Contributions of this Dissertation……..

85

7.2

Future Work……………………………………………………… 87

Appendix A: SDF Instruction Set Manual Reference 0.9.2…………………… 89 Appendix B:

List of Op-Codes Used in SDF…………………………………… 108

Appendix C: Fibonacci Program Code for SDF……………………………….. 110

REFERENCES…………………………………………………………………... 114

ix

LIST OF FIGURES

Figure

Page

2.1

Processing Element of a Dataflow……………………………………….

15

2.2

ETS Representation of a Dataflow Program Execution………………….

18

2.3

Rhamma Processor……………………………………………………….

25

2.4

Queuing Networks……………………………………………………….. 26

2.5

Effect of Thread Parallelism……………………………………………... 27

2.6

Effect of Thread Length………………………………………………….. 28

2.7

Fraction of Memory Access Instructions………………………………… 29

2.8

Effect of Cache Memories……………………………………………….. 30

3.1

Instruction Formats……………………………………..………………... 32

3.2

The Code Portions of an SDF Thread………………..…………………... 34

3.3

Simple Dataflow Graph…………………………………………………... 36

3.4

Execution Code of SDF Corresponding to the Dataflow Graph…………. 36

3.5

Pre-load Code of SDF Corresponding to the Dataflow Graph…………… 36

3.6

Post-store Code of SDF for the Dataflow Graph………………………… 38

3.7

Thread Continuation Transitions Handled by the Scheduling Unit (SU)... 39

3.8

General Organization of Execution Pipeline (EP)……………………….. 40

3.9

Operand Register Pairs for Scheduled Dataflow Architecture………….... 41

3.10

General Organization of Synchronization Pipeline (SP)…………………. 42

3.11

Code for I-structure Allocation in SDF…………………………………... 45

x

3.12

Codes for IFETCH Element from I-structure in SDF..…………………... 46

3.13

Code for ISTORE Value in I-structure of SDF……….………………….. 46

3.14

Code for IFREE in I-structure of SDF………………….……………….... 47

4.1

Effect of Thread Level Parallelism of SDF Execution (Matrix Multiply)... 52

4.2

Effect of Thread Level Parallelism of SDF Execution (Livermore loop 5). 53

4.3

Effect of Thread Granularity of SDF Execution (Matrix Multiply)………. 54

4.4

Effect of Thread Granularity on SDF Performance with Varying Data Sizes………………………………………………………………….. 56

4.5

Utilization of SP and EP on SDF…………………………………………. 57

5.1

Comparing SDF with a Superscalar Processor for FFT…………………... 61

5.2

Scalability of SDF over Superscalar for the Program Matrix Multiply…... 66

5.3

Scalability of SDF over Superscalar for the Program FFT……………….. 68

5.4

Scalability of SDF over Superscalar for Fibonacci Program……………... 70

5.5

Scalability of SDF over Superscalar for Zoom Program…………………. 72

xi

LIST OF TABLES

Table

Page

4.1

Execution Performance of Scheduled Dataflow Compared to DLX…….. 50

4.2

Effect of Thread Level Parallelism………………………………………. 53

4.3

Effect of Thread Granularity on SDF Performance……………………… 55

5.1

Superscalar Parameters Set………………………………………………. 57

5.2

SDF Versus Superscalar for the Program Matrix Multiply………………. 60

5.3

SDF Versus Superscalar for FFT of Different Data Sizes………………... 61

5.4

Scheduled Dataflow Versus Superscalar for Fibonacci Program………… 62

5.5

Scheduled Dataflow Versus Superscalar for Zoom Program..…………… 63

5.6

Superscalar Parameters for Multiple Functional Units…………………… 63

5.7

SDF Versus Superscalar for Matrix Multiply of Different Data Sizes…… 65

5.8

SDF Versus Superscalar for FFT Program of Different Data Sizes…….... 67

5.9

SDF Versus Superscalar for Fibonacci Program of Different Data Sizes… 69

5.10

SDF Versus Superscalar for Zoom Program of Different Data Sizes…….. 71

5.11

Comparing SDF with VLIW (Matrix Multiply)………………………….. 73

5.12

Impact of Thread Level Parallelism Versus VLIW (Matrix Multiply)…… 74

5.13

Comparison of SDF with VLIW (FFT)…………………………………... 74

5.14

Comparison of SDF with VLIW (Zoom)………………………………… 75

6.1

Cache Performance Comparing SDF Versus Superscalar with Different Units (Matrix Multiplication)………………………………….. 78

xii

6.2

Cache Performance Comparing SDF Versus Superscalar with Different Units (Fibonacci Program)…………………………………….. 80

6.3

Cache Behavior of SDF Compared to DLX Architecture………………... 82

6.4

Effect of Separate I-Structure Cache……………………………………..

xiii

84

1

Chapter I

INTRODUCTION

1.1 Control Flow and Dataflow Models of Architectures In today’s computer industry, von Neumann or control flow architecture, which dates back to 1946, is still the dominating design method used in microprocessors. This architecture is controlled by the program counter and executes sequentially unless the sequence is changed by a jump, branch, interrupts or return instruction. The principle criterion of von Neumann architecture is the minimal hardware complexity. It consists of a Control Unit (CU), Arithmetic Logic Unit (ALU), a storage unit, and an Input/Output (I/O) unit, all connected to a single bus. Accessing the same storage in different stages of the pipeline fetch, decode, execute and write back has limited the performance. Today, the main emphasis is on getting maximum performance as opposed to limited hardware complexity. Processor performance has been doubling every 18 months (Moore’s law) while the performance of memory is increasing by only 7% a year. In the microarchitectural research, increasing emphasis is on the instruction-level parallelism [68], which improves the performance by increasing the number of instructions per cycle. Techniques to increase the parallelism within a basic block and across basic blocks have been studied. The focus of architectural research is on the wider instruction fetch, such as VLIW, higher instruction issue rates, larger instruction windows of a superscalar system

2 and increasing use of prediction and speculation [54]. This requires a disproportionate increase in chip area and complexity. Although superscalar processors use out-of-order parallelism, the order of instruction flow still is constrained by the sequential program order as defined by the von Neumann programming model. When the instruction execution is internally performed in a highly parallel fashion, the instruction level parallelism can be exploited by contemporary microprocessors. The sequential threads limit the instruction-level parallelism that can be exploited in the superscalar processors. Superscalar processors [10, 12, 44, 71] are hardware centric where the hardware dynamically detects the opportunities for parallel execution and schedules the operations to exploit the available resources. Many researchers have proposed decentralized architecture wherein multiple threads run on multiple simpler processing units on a single chip, which are known as Chip-Multiprocessors (CMP). The advantage of CMP is that its simplicity allows for a faster clock in each processing unit as well as better utilization of silicon area. Another common modern architecture is the Very Long Instruction Word (VLIW) [42] processor, which deviates modestly from the von Neumann architecture, but allows the use of the sequential von Neumann programming model. A typical VLIW processor uses a long instruction word, i.e., instruction tuple, which contains a fixed number of operations that are fetched, decoded, issued, and executed synchronously. The compiler schedules the VLIW instructions statically. Thus, it eliminates runtime resource scheduling and synchronization. If the compiler is not able to find compact instructions to execute simultaneously for the long instruction word, it will have to fill it with no operation instructions. This often results in low utilization of hardware resources. Among

3 the drawbacks of VLIW are the implementation complexity, low utilization and very complex demands on compiler technology. James Smith, in his article “Instruction-Level Distributed Processing (ILDP),” states that the trend is towards building general-purpose micro architectures composed of small, simple, interconnected processors running at very high clock frequencies. The shift is from instruction-level parallelism to instruction-level distributed processing with more emphasis on interinstruction communication with dynamic optimization and a tight interaction between hardware and low-level software [58]. This would mean ILDP still requires complex dynamic management of instructions during their execution. Dataflow architecture [13, 16, 31, 28, 46, 47] is a radical alternative to the existing von Neumann architecture and it uses dataflow graphs as its machine language. The application program can be translated into a dataflow graph. The dataflow graph is a directed graph consisting of named nodes, which represent instructions, and arcs, which represent data dependences among instructions. In dataflow architecture, the execution is driven only by the availability of operands at the inputs to the instructions (or nodes). Dataflow eliminates the need for a program counter and common storage, which are bottlenecks in the von Neumann architecture for exploiting instruction level parallelism. The two main characteristics of dataflow architecture are functionality and composability. Functionality means that the evaluation of the graph corresponds to mathematical functions. Composability is the ability to combine the graphs to form a new graph. In the dataflow model, the execution of the instructions (nodes) proceeds when all operands are present. The firing of the instruction results in its input data (operands) consumption and the generation of the output data (results). In von Neumann architecture, the sequence of

4 execution is based on control dependence, whereas in dataflow architecture, it is based on data dependence. Since it is constrained only by data dependencies among instructions, dataflow program execution is also self-scheduling. The dataflow architecture can exploit parallelism at the task and instruction level. Modern microprocessors with a single thread of control do not incorporate enough fine-grained parallelism to fill the multiple functional units of current microprocessors. The dataflow approach resolves any threads of control into separate instructions that are ready to execute when the operands are available. Due to the fine-grained parallelism that can be utilized in dataflow architectures, the parallelism is much larger than modern microprocessors. Hence, dataflow is a sound, simple and yet powerful model of parallel computation. The parallelism in dataflow architecture is limited to the availability of data. The main characteristics that restrict the usage of the dataflow model are the number of functional units needed, communication bandwidth required, and the need to match the operands associatively. The early work on dataflow computer architecture emerged in the 1970s with the use of dataflow program graphs to represent, as well as to exploit, the parallelism in programs [16, 17]. Dataflow as a computational model has influenced many areas, such as programming languages, processor design, multithreaded architectures, parallel compilation, signal processing and distributed computing. Both von Neumann and dataflow have their own advantages and disadvantages. Hybrid models of execution that combine the advantages of the two and minimize the drawbacks can greatly increase the parallelism found in applications. There has been a tremendous growth in the density of dynamic random access memory (DRAM) chips. The access times and I/O bandwidth have not improved with the

5 processor clock rate. Modern processors further add to the problem with the memory hierarchies that need to be traversed between processors and main memory. Another approach to alleviate this memory gap, decoupled architecture, has been studied in [57]. The main feature of this decoupling is to separate the operand accesses from execution. Both instruction streams are executed on independent processing units called Accessing Processor (AP) and Execution Processor (EP). They communicate with each other and the memory system through queues. The amount of performance improvement depends on the application and the dependencies between memory accesses and arithmetic instructions. Other mechanisms such as Intelligent RAM (IRAM) [35] or Processing In Memory (PIM) are also similarly trying to alleviate the memory latency problem. IRAM architectures try to integrate a small logic unit with DRAM main memory on a single chip providing high memory bandwidth and significant reductions in memory latency. In PIM, processor and memory are integrated with a scalar or superscalar microprocessor that is enhanced by RAM memory rather than cache memory. The main advantages of having processors and memory on-chip are the high bandwidth and low latency. Recently, there is much emphasis on CMP [36, 48] architecture with speculative multithreading. Conventional superscalar design is so centralized that it does not exploit ILP beyond the single thread of execution.

1.2 Multithreading to Alleviate Long Memory Latency Recent advances in multiprocessor architecture combine shared memory and distributed memory models into distributed shared memory (DSM) systems with physically distributed memory comprising hardware connected by a Local Area Network

6 (LAN) or Wide Area Network (WAN). These architectures are supported by a software/hardware cache coherency mechanism and face additional delays in memory latency. In case of a miss on the local memory, the remote memory must be accessed. Stalls resulting from the request to, and a reply from the remote memory will significantly degrade the performance of the system. In order to hide the memory latency, different approaches such as prefetching, coherent caches (hardware to reduce cache misses), relaxed memory consistency (buffering and pipelining of memory references) and multiple contexts (switch from one context to another) have been used. The above mentioned approaches can require increased cache capacities, which consume large silicon areas on processor chips with diminishing performance gains. It should be noted that many different ways exist to reduce the memory latency, but no method eliminates it. Basically, designs must rely on latency-reducing, latency-tolerating, or latency-hiding mechanisms to improve the performance of DSM. Multithreading is a latency hiding technique so that processor idle time can be reduced to obtain a better performance [22, 30, 45]. In a single threaded machine, program execution is defined by the processor state, which consists of memory state (program memory, data memory, stack), activity specifier (program counter and stack counter) and register context (set of registers). The context of a thread will consist of activity specifier and registers. Having multiple contexts in a multithreaded architecture, there may be several instructions from different threads ready to execute. Context switch between threads is significantly faster than switching between processes. A processor can be shared between multiple threads leading to higher utilization. Multithreading can hide memory latency [49, 50], long I/O latency or can be

7 used to interleave instructions on a cycle-by-cycle basis from multiple threads to minimize pipeline breaks due to data dependencies among instructions within a single thread. Multithreading can be used in DSM as well as in symmetric multiprocessors (SMPs) to increase applications’ throughput and responsiveness. In order to take full advantage of multithreading [61, 62], multiple hardware context and fast context switch must be provided. Even though there are numerous approaches to tolerating memory latencies such as preload/prefetch, decoupling and so on, multithreading must form the basis for all these other techniques.

1.3 Blocking Versus Non-Blocking Multithreading A thread can be a blocking thread or a non-blocking thread. A non-blocking thread proceeds to evaluate as soon as all input operands are available. It completes execution without blocking the processor pipeline (for instance, due to synchronization waits). Thread switching is controlled by the compiler with the idea of creating new threads rather than blocking when waiting for data. Hence, it reduces the reliance on locks for synchronization. The disadvantage of the non-blocking model is that it leads to the creation of many small threads and if the threads contain less CPU intensive applications, then very little performance gain can be achieved. A blocking thread can block during execution in case of memory accesses, cache misses or synchronization. The waiting time when it is blocked in the pipeline will be similar to the von Neumann architecture. However it can be improved by switching out the blocking thread and executing an enabled thread, using a fast context switch. When

8 there are too many context switches between threads, it leads to smaller performance gains and destroys the purpose of masking the memory latency efficiently.

1.4 Threaded Dataflow Architecture An application program can be transformed into graphs and subgraphs of dataflow architecture. When a program is transformed, it can still exhibit fine-grained parallelism. Those subgraphs can be identified and transformed into a thread. Thus, the dataflow principle is modified to form multiple threads with multiple contexts. In a dataflow paradigm, the data passed between instructions are stored back into the memory. The tokens are data packets consisting of data values that propagate along the arcs in the dataflow graph. In threaded dataflow architecture [45], data passed between instructions within the same thread is stored into registers instead of writing back to memory. Total amount of token storage can thus be reduced in the threaded dataflow architecture, which in turn reduces the hardware complexity. Examples of such architectures include Monsoon [46], Epsilon and EM-4 and EM-X [21, 51].

1.5 Dissertation Scope and Contributions This

dissertation

discusses

the

implementation

(simulated),

execution

performance and cache behavior of a new multithreaded architecture known as Scheduled Dataflow Architecture (SDF). The SDF system uses a non-blocking multithreaded model based on the dataflow paradigm. In addition to taking advantage of the multithreaded techniques, memory accesses are decoupled from the thread execution. SDF architecture tries to combine the advantages of decoupled and non-blocking multithreaded

9 architecture. This architecture differs from other multithreaded architectures in two ways: i) the programming paradigm is based on dataflow, which eliminates the need for complex runtime scheduling, thus reducing the hardware complexity significantly, and ii) complete decoupling of all memory accesses from the execution pipeline. The usage of dataflow and non-blocking models of execution permit a clean separation of memory access from execution (which is very difficult to coordinate in other programming models). The application program is transformed into a dataflow graph, and the graphs and subgraphs are identified to form multiple threads based on the data needed for each thread. A Processing Element (PE) consists of a Synchronization Processor (SP) and an Execution Processor (EP). Data is pre-loaded into an enabled thread’s register context prior to its scheduling on the execution pipeline. After completing the execution, the results are post-stored from registers to memory. The instruction set implements the dataflow computational model, while the execution engine relies on control-flow like scheduling of instructions. SDF does not perform out-of-order execution and thus eliminates the need for complex instruction issue and retiring hardware. An instruction level simulator has been developed to determine performance of SDF architecture. The execution performance of this architecture has been compared with the MIPS system (DLX simulator),1 superscalar architecture as facilitated by the simplescalar Tool Set [10] and VLIW (Trimaran2 and Code composer studio simulator3). Trimaran is an integrated compilation and performance monitoring infrastructure system for architectures relying on instruction level parallelism. It is a joint work of Hewlett Packard (HP), the University

1

Used DLX simulator [Hennessy 96] for this purpose http://www.trimaran.org 3 TMS320C6000 DSP Code Composer Studio by Texas Instruments http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.html 2

10 of Illinois and New York University. Code composer studio is a compiler and simulator software for Texas Instruments DSP VLIW system.

1.6 Dissertation Outline The next chapter explains some of the related architectures as well as how multithreading has been used to improve the dataflow paradigm. Chapter 3 will describe the SDF in detail including the pipelines of this new architecture. Chapter 4 will compare the execution performance of SDF with DLX (which implements MIPS-like instructions). It also will describe the effect of thread level parallelism, thread granularity and utilization of synchronization processor and execution processor. Chapter 5 compares the execution performance of SDF with SimpleScalar (a Superscalar simulator) and VLIW architectures. Chapter 6 presents the cache performance of SDF as compared to a superscalar as well as a DLX processor. Chapter 7 presents the conclusion and some of the future improvements that can be applied to the scheduled dataflow architecture. Appendix A gives the detailed description of the Instruction set. Appendix B presents the Opcode and the related values used in the simulator. Appendix C gives the assembly code for an example program (hand coded Fibonacci program).

11

Chapter II

BACKGROUND AND RELATED RESEARCH

2.1 Dataflow Model and Architectures Dennis and Misunas have developed the dataflow computing model in the early 1970s [17]. Prior to that, developments in single assignment languages and notations for parallel computation have existed. The single assignment languages have led to single assignment machines. Since dataflow and single assignment languages are synonymous, dataflow machines have all been designed in conjunction with single assignment languages. Single assignment programming languages include SISAL [9], Id [24] and Value oriented Algorithmic Language (VAL) [1]. Dataflow bases instruction execution on the availability of data; synchronization is implicit for parallel activities and it is selfscheduling. The main advantage of the single-assignment rule is that the parallelism is not constrained by anti-dependences and output dependences, as is the case in conventional languages. The dataflow model and architecture have been studied for more than two decades and held the promise of an elegant execution paradigm with the ability to exploit inherent parallelism in applications. However, the actual implementations of the model have failed to deliver the promised performance. Nevertheless, several features of the dataflow computational model have found their place in modern processor architectures and compiler technology (e.g., static single assignment, register renaming,

12 dynamic scheduling, out-of-order instructions execution, I-Structure like synchronization, and non-blocking threads). Most modern processors utilize complex hardware techniques to detect data and control hazards, and dynamic parallelism – to bring the execution engine closer to an idealized dataflow engine. The SDF system shows that such complexities can be eliminated by executing dataflow instructions directly in hardware. Some of the limitations of the pure dataflow model that have prevented its practical implementations include the following: Too fine-grained (instruction level) multithreading, Difficulty in exploiting memory hierarchies and registers, and Asynchronous triggering of instructions Many researchers have addressed the first two limitations of dataflow architectures [30, 29, 59, 63, 64]. SDF architecture specifically addresses the third limitation. Some researchers have proposed hybrid designs in which the dataflow scheduling is applied only at thread level (i.e., macro-dataflow), while each thread is comprised of conventional control-flow instructions [20, 26, 52]. In such systems, the instructions within a thread do not retain functional properties, and hence, introduce Write-AfterWrite (WAW) and Write-After-Read (WAR) dependencies. This in turn requires complex hardware to perform dynamic instruction scheduling. In our SDF, the instructions within a thread still retain the functional properties of the dataflow model, and thus eliminate the need for complex hardware. The results (or data) flow from instruction to instruction, where each instruction specifies a location for storing the data. The deviation of the proposed decoupled Scheduled Dataflow (SDF) system from “pure” dataflow is a deviation from data driven execution (or token driven execution) that is traditionally used for its implementation.

13 There are several computer architectures that support a pure dataflow model including static, dynamic and explicit token store architectures. Recently, there is a growing interest in multithreaded architecture, which inherits some properties from dataflow. The following sections will describe briefly static, dynamic, explicit token store and the recent multithreaded architectures such as Simultaneous Multithreading (SMT) [65, 66, 39, 40] and Superthreading [52].

2.2 Static (Single-Token-Per-Arc) Dataflow In the static dataflow model, after having transformed the program into a dataflow graph, the graph can be represented as a collection of templates. The template contains the operation code of the represented instruction, operand slots for holding operand values and the destination address fields. The destination address fields refer to the operand slots in subsequent activity templates that receive the results. As the term singletoken-per-arc indicates, it allows at most one token per arc. The node is enabled or fired when all the inputs are available on the input arcs. In this model, consecutive iterations of loop can only be pipelined. Hence, it reduces the amount of parallelism that can be exploited. Another drawback of this method is that, due to acknowledgement of tokens, the traffic is doubled. Several static machines were constructed including 1) the MIT static dataflow machine, built in the early 80s with eight processing element (PEs) interconnected through a communication network [17], 2) LAU4 developed in Toulouse, France and used static graphs as well as language based on single-assignment rule, 3) the DDM1 (Utah Data Driven Machine) designed at the University of Utah in 1978, and 4)

4

“Language ả assignation unique” (LAU) is a French acronym for single assignment computation.

14 Texas Instruments DDP (Distributed Data Processor) designed to investigate the dataflow design in 1979.

2.3 Dynamic (Tagged-Token) Dataflow Architecture Dataflow graphs present a natural paradigm to allow the subprogram and the loop iterations to proceed in parallel. In order to achieve this invocation and to execute as a separate instance of a re-entrant subgraph, replication of the loop iterations must be done. A token consists of a tag and data. Each tag is composed of context field t (uniquely identifying the context), initiation number i (identifying the loop iteration), and instruction address n. It has also a port number since the destination instruction may need more than one input. For example, a token can be defined as p. Port p represents left or right. If it is a simple token without identifying loop iteration, t is the tag itself. If the tags are attached to the tokens, they allow multiple tokens to reside on the arc and the architectures supporting such tokens are called dynamic (tagged-token) dataflow architectures. When identical tags are present on all input arcs, the node can be enabled or fired. There are two units known as matching unit and execution unit, which are connected by an asynchronous pipeline, with queues added to balance the load variations. The execution is in terms of receiving, processing and sending out tokens containing some data and a destination tag. Data dependencies are translated into tag matching and transformation. The tag contains information about the destination context and how to transform the tag for results. In order to support token matching, some form of associative memory such as real memory with associative access, a simulated memory

15 based on hashing, or a direct matched memory is required. Figure 2.1 shows the processing element (PE), where the execution pipeline is connected by a token queue.

Processing Element (PE) Execution pipeline

Token Queue

Instruction Fetch Unit Token M atching Unit

Instruction and Frame M emory

Processing Unit Token Form Unit

Figure 2.1 Processing Element of a Dataflow.

This method of execution unfolds much more parallelism than the static dataflow model. However the large number of tokens waiting to be matched increases the associative memory. I-Structure in dataflow can be viewed as a place to store structures, arrays or indexed data using the single-assignment rule. Each element of I-Structure can be read many times, but can be written only once. Each element is associated with status bits and a queue of deferred reads. The status bit can be defined as PRESENT, ABSENT, and WAITING. PRESENT – status to indicate that it can be read and not written ABSENT – status to indicate the read operation has to be deferred. WAITING – status indicates at least one deferred read operation. a write operation must supply the data to all deferred reads.

16 Three operations that can be performed with I-Structure are ALLOCATE, I-FETCH and ISTORE. Allocate – Reserves specified number of elements for a new I-Structure. I-fetch – Retrieves the content of an I-Structure element. (If the element is not present, it is deferred.) I-store – Writes the content into the specified I-Structure element. (If the element is not empty, Error condition is reported. If status waiting, data is supplied to all deferred reads.) The I-fetch instruction mentioned above is implemented in a split-phase, which means the read request issued is independent in time from the response. It does not cause issuing PE to wait for a response. Arvind et al. first proposed I-Structure [5, 6] but it has simultaneously been discovered by Watson and Gurd at the University of Manchester [69]. The main drawback of this dynamic dataflow architecture is that the amount of memory needed to store tokens waiting for a match tends to be very large. To overcome this, the technique of hashing is used, which is not as fast as associative memory. Several dynamic dataflow machines have been constructed including 1) the MIT Tagged-Token Dataflow Architecture, 2) the Manchester Dataflow Machine, 3) the Distributed Data Driven Processor (DDDP) from OKI Electric Ind. (Japan), 4) the Stateless Data-Flow Architecture (SDFA) designed at the University of Manchester, and 5) A Data Driven VLSI Array (DDA) designed at Technion (Haifa, Israel).

2.4 Explicit Token Store (ETS) Architecture The Explicit Token Store (ETS) concept is proposed in order to overcome the drawback associated with token matching using associative memory. ETS uses direct matching of operands (or tokens) belonging to an instruction. In a direct matching scheme, storage (called a frame) is dynamically allocated for all the tokens needed by the

17 instructions in a code block. A code block can be viewed as a sequence of instructions comprising a loop body or a function. The actual disposition of locations within a frame is determined at compile-time; however, the actual allocation of frames is determined during run-time. In a direct matching scheme, any computation is completely described by a pointer to an instruction (IP) and a pointer to a frame (FP). The pair of pointers, , called a continuation, corresponds to the tag part of a token. A typical instruction pointed to by an IP specifies an opcode; an offset (r) in the frame where the match of input operands for that instruction will take place; one or more displacements (destinations) that define the destination instructions to receive the result token(s); and the input port (left/right) indicator that specifies the appropriate input arc for a destination instruction. Figure 2.2 shown on the next page illustrates the ETS model. When a token arrives at a node (e.g., ADD), the IP part of the tag points to the instruction that contains an offset r as well as displacement(s) for the destination instruction(s). The actual matching process is achieved by checking the disposition of the slot in the frame memory at FP+r. If the slot is empty, the data value from the token is written to the slot and its presence bit is set to indicate that the slot is full. If the slot is already full (indicating a match of input operands), the value is extracted, leaving the slot empty, and the corresponding instruction is executed. The result token(s) generated from the operation is communicated to the destination instruction(s) by updating the IP according to the displacement(s) encoded in the instruction (e.g., execution of the ADD operation produces two result tokens and L). Instruction execution in ETS is asynchronous since an instruction is enabled immediately upon the arrival of the input operands.

18 MIT and Motorola jointly built the Monsoon explicit token store machine in 1990 [47, 13]. The Monsoon PE uses an 8-stage pipeline. The first stage is the instruction fetch

Code-Block Activation R

L

Instruction Memory IP

op c od e

r

d ests

A DD

2

+ 1 ,+ 2 L

NEG

-

+6

SU B

3

+1

A DD

Fr a m e M e m o r y NEG

SU B FP FP+2 4 .2 4

P r e s e n c e Bi t s

Figure 2.2 ETS Representation of a Dataflow Program Execution.

stage. The second stage is the effective address generation, which consists of three pipeline stages. The execution stage consists of three pipeline stages and the final stage is the token-form stage. The ETS model is also applied in other machines such as the EM-4 and Epsilon-2 [21].

2.5 Hybrid of von Neumann/Dataflow Model and Multithreading Even though dataflow architecture provided the natural elegance of eliminating output-dependencies and anti-dependencies, it performed poorly with sequential code. In an eight-stage pipeline machine such as Monsoon, an instruction of the same thread can only be issued to the dataflow pipeline after the completion of its predecessor instruction. With fewer pipeline stages, delay will be less. Besides, the token matching, waiting-

19 matching store, introduced more bubbles or stalls in the execution stage(s) of the dataflow machines. In order to overcome these drawbacks, many researchers implemented hybrid of dataflow/control-flow models along with multithreaded execution. In such models several tokens within a dataflow graph are grouped together as a thread to be executed sequentially under its own private program counter control, while activation and synchronization of threads are data-driven. Such hybrid [6, 19] architectures deviate from the original model, where the instructions fetch data from memory or registers instead of having instructions deposit operands (tokens) in “operand receivers” of successor instructions. In such hybrid models, two key features supporting these types are sequential scheduling and the use of registers for buffering the results between instructions. Examples of such hybrid models are the Threaded Abstract Machine (TAM) [14], P-RISC and Star-T [4]. The architecture studied in this dissertation SDF is one such hybrid architecture to overcome several limitations that are mentioned in Section 2.1. Its features include decoupling, non-blocking multithreading, dataflow program paradigm and scheduling of instructions in a control-flow like manner.

2.6 Efficient Architecture for Running THreads (EARTH) EARTH is a multithreaded multiprocessor architecture. In a multiprocessor system, often synchronization and communication between processors become bottlenecked. EARTH is designed to show that efficient usage of threads can hide memory latency. If there is enough parallelism in an application, switching between threads can hide the latencies inherent in parallel processing. EARTH is a hybrid von Neumann/Dataflow multithreaded model. It consists of one or more processing nodes

20 connected by a network. Each node consists of a Synchronization Unit (SU) and an Execution Unit (EU). SU handles synchronization, scheduling and remote accesses. EU handles execution of procedures. SU and EU are both connected by Ready Queue (RQ) and Event Queue (EQ), where the EU receives the fiber (a thread) from RQ and when it requires data from remote memory it stores the fiber in EQ. Local memory is shared by both the units. The EU can have one or more Processing Element (PE) to execute fibers simultaneously. All PEs share the same RQ and EQ. Only the EU executes the fibers in sequential order. The fiber consists of Frame ID (FID) and Instruction Pointer (IP). The program is partitioned into fibers in the EARTH model. The threads in the EARTH model are a sequentially executed, non-preemptive and atomically scheduled set of instructions. The sequential function is called from a fiber in the same way it is called from another sequential function.

2.7 Simultaneous Multithreading Processor (SMT) Thread Level Parallelism (TLP) and Instruction Level Parallelism (ILP) represent two different ways of achieving the parallelism that exists in a program. Superscalar processors exploit ILP by executing multiple instructions in a single cycle. But, with insufficient ILP (due to cache misses, long latency instructions, sequentialized computation), multiple issue hardware on a superscalar is wasted. Conventional multiprocessors exploit TLP by executing different threads on different processors. But, with insufficient TLPs found in an application program, processors will be idle. Simultaneous Multithreading (SMT) [66] allows multiple independent threads to issue multiple instructions each cycle to a superscalar processor. SMT uses the thread level

21 parallelism and instruction level parallelism interchangeably by running multiple applications on the same processor. Thus, all the available threads compete for and share all of the superscalar’s resources every cycle. Each application may have different levels of parallelism. The choice of implementing ILP or TLP depends on the application and data. When per-thread ILP is limited, TLP can achieve more parallelism. The SMT exploits whichever parallelism is available to achieve greater throughput and significant program speedups. The primary changes in SMT from the conventional superscalar processor are the differences in the instruction fetch mechanism and the register file. Three more processor resources are replicated to support SMT. The fetch mechanism selects up to two threads (threads that incur no I-Cache misses) and fetches up to four instructions for each cycle. After fetch and decode, register renaming is performed. The renaming mechanism is to map the architecture registers to the machine’s physical registers. This requires a total of 356 registers; that is, 32 registers for each of the 8 threads and 100 renaming registers. Since the register file is large, it incurs an overhead of one extra cycle for the register access. The renaming register mechanism eliminates intra-thread register dependences. Instructions from all threads are placed in the instruction queue and are issued from the queue when the operands become available. The SMT and the on-chip multiprocessor are similar in many ways. SMT also studies the aspect of how TLP stresses other hardware structures such as memory system, branch prediction and cache misses. SMT has the advantage of flexible usage of TLP and ILP, fast synchronization, and a shared L1 cache over functional units. In a multiprocessor system, more execution resources can be added to improve performance, but it can suffer from inflexible partitioning of a program. SMT

22 permits dynamic resource sharing on a per-cycle basis to match the ILP and TLP and therefore use the existing resources more effectively. One study has shown that latency hiding characteristics of SMT with 8 threads allow it to achieve a 2.68 average speedup over a single MP2 processor [39]. The hardware needed for selecting threads, renaming registers and instruction issue makes SMT architecture very complex.

2.8 The Superthreaded Architectural Model The superthreaded architectural [52] model has multiple processing elements connected to each other with a unidirectional ring. Each processing element has a private instruction level cache and a private memory buffer to cache speculative stores and to support run-time data dependence checking. But they share the L-1 data cache and L-2 combined instruction/data cache as well as share a register file and a lock register. During run time, the processing elements with their own program counter and instruction execution path, can execute instructions from multiple program locations simultaneously. The compiler determines the granularity of threads, which can be one or several iterations of a loop and partitions the control flow graph statically. Threads have few data dependencies with other concurrent threads. The program starts execution from its entry thread and forks its successor threads to other processing elements. The successor can fork its own successors and thus can keep all the processing elements busy. The oldest thread is the head thread and when it completes, one of the successors becomes the head thread. The successor thread can be forked with or without control speculation. If a thread forks another thread with control dependence, it must ensure that all of the control dependencies are satisfied for the newly generated successor threads. If the speculated

23 control dependences are false, it must issue a command to kill the successor thread and all of its subsequent threads. Thread pipelining stages consist of continuation, Target Store Address Generation (TSAG), computation and write back. The continuation stage begins execution by the initiation of the previous thread. It computes recurrence variables, such as loop index variables, which initiates the next thread. For concurrent threads the data is stored in Target Store (TS), and TSAG performs the computation for these target stores. These addresses are stored in the memory buffer of each thread and are forwarded to the succeeding concurrent threads. When a particular thread completes the computation, it sends a flag to the successor thread that it can start the computation. The TSAG can be safe or unsafe. If there is no data dependence on the previous thread, it is called safe. In the unsafe stage, the thread may be data dependent on the previous thread and hence must wait for the flag from the predecessor thread. In the computation stage it performs the main computation of a thread. For the load operation, if the address matches the target store entry in its memory buffer, it will read the data, else, it will wait for the data from the previous threads. If a target store address is calculated, it must be forwarded to the corresponding target store address of all concurrent successor threads. The computation stage can be stopped by the stop instruction or can be killed by the predecessor thread. In the write back stage; a thread completes its execution by writing all of the data from store operations in its memory buffer to memory. Each thread has to wait for its predecessor thread’s wb_done flag to perform its write back stage. Therefore, all of the stores are performed thread-by-thread. WAW (antidependence) and WAR (output dependence) hazards are thereby handled. The TS address need not be forwarded if the data has been

24 stored to memory. In this superthreaded architecture, the compiler partitions the program into threads for concurrent execution, then performs thread pipelining to facilitate runtime data dependence checking. Traditional compiler techniques such as function inlining, inductive variable substitution, alias analysis, data dependence analysis and variable privatizations are used to generate an appropriate level of granularity to produce efficient code for this architecture. The main difference between superthreaded architecture and other multithreaded processors is that it supports thread-level control speculation with runtime data dependence checking. Additionally, communication between threads need not go through the memory as found in the traditional multiprocessor. Superthreaded architecture has some similarities with SDF, particularly in the manner in which control dependencies are transformed into data dependencies using target store. However since the threads are blocking and control-flow based, superthreaded architecture requires complex hardware to resolve data dependencies.

2.9 Rhamma Processor A multithreaded architecture (called Rhamma) that implements decoupled memory access/execution has been designed in Germany [22]. Figure 2.3 shows the overall structure of Rhamma processors. Rhamma uses two separate processors: A memory processor that performs all load and store instructions and an execution processor, which executes other instructions. A single sequence of instructions (thread) is generated for both processors: when a memory access instruction is decoded by the execution processor, a context switch is utilized to send the thread to the memory processor; and when the memory processor decodes a non-memory access instruction, a context-switch

25 causes the thread to be handed over to the execute processor. Threads are blocking and additional context switches due to data dependencies and cache misses may be incurred during the execution of a thread.

Memory Processor IF

ID

OF

EX/WB

Data Cache

Score Board Instr Cache

IF

Register Contexts

ID

OF

EX/WB

Execute Processor

Figure 2.3 Rhamma Processor.

2.10 Analytical Model of Scheduled Dataflow Architecture The preliminary analysis of SDF architecture using closed-form queuing networks as shown in Figure 2.4 and Monte Carlo simulations has been reported in [34]. In order to analyze the architecture in a more realistic light, synthetic workloads have been generated and applied to the simulations representing different architectures. The workload generations are based on previously published data [25], observations based on specific architectural characteristics and observations based on hand-coded programs. This is mainly to emphasize the fundamental differences in the programming and execution paradigms of the architecture (viz., data-driven threads, non-blocking versus blocking threads, no-stalls on memory access versus stalls due to cache misses, no branch stalls

26 versus stalls on misprediction, token driven versus instruction driven, etc.). Since it is not possible to use the same set of parameters for all architectures, “normalized” workloads have been used for all architectures. That is, all architectures execute the same amount of useful work, but different architectures have different amounts of overhead instructions, stalls, and context switches. The simulations based on our analytical methods for the three architectures are reported: Conventional processor, the SDF system and the Rhamma processor.

Memory Processor Execution Processor

a). Rhamma processor

Pre-load

Execute

Post-store b). Scheduled Dataflow

Figure 2.4 Queuing Networks.

In order to measure the effect of thread level parallelism on the performance of the different architectures, sequences of threads for each architecture are generated. To introduce the latency between a pair of threads (the time difference between the

27 termination of a thread and the initiation of a successive thread), the simple performance model for multithreaded processors suggested by Agarwal [3] is chosen. Three values are considered for latencies: 1, 3, 5 times the length of thread (L = 1R, L = 3R, L = 5R in Figure 2.5 below).

Figure 2.5 Effect of Thread Parallelism.

Figure 2.5 shows the execution times for the same total workload but for varying number of threads comprising the workload. The execution time of a conventional processor is not affected by the degree of thread parallelism since this processor executes single threaded programs only. However, as a degree of thread parallelism increases, both SDF and Rhamma show performance gains. As expected, with only one thread at a time (degree of threaded parallelism =1), multithreaded architectures perform poorly as compared to single threaded architectures. The figure also shows that the SDF system

28 executes the multithreaded workload faster than Rhamma for all values of thread parallelism. For the above experiment, same cache miss rates (5%) and the same cache miss penalty (50 cycles) are used for all architectures. For the remaining experiments for both SDF and Rhamma processors L = 3R is used. The average thread length is set to 30, 20, and 50 functional instructions for conventional architecture, SDF and Rhamma respectively. It can be noted that the SDF system will provide a higher degree of thread parallelism than Rhamma, since the non-blocking nature of SDF leads to finer-grained threads. These average thread lengths are based on our observations from analyzing some actual programs written using SDF instructions.

Figure 2.6 Effect of Thread Length.

Figure 2.6 shows the results of varying average thread lengths. The normalized thread length includes only functional instructions and does not include architecture-

29 specific overhead instructions. For conventional and SDF architectures, increasing thread run-lengths shows performance gains to a certain degree, since longer threads imply fewer context switches. With Rhamma, however, longer threads do not guarantee shorter execution times. The blocking nature of Rhamma threads proportionally causes more thread-blockings (or context switches) per thread as run length increases. Thus, increasing thread granularity without considering other optimizations for blocking multithreaded systems with decoupled access/execute processors may adversely impact the performance.

Figure 2.7 Fraction of Memory Access Instructions.

Since both SDF and Rhamma decouple memory accesses from pipeline execution, the impact of the number of memory access instructions per thread is explored. Figure 2.7 shows the results, where the x-axis indicates the fraction of

30 load/store instructions. As for conventional architecture, increasing memory access instructions leads to increased cache misses, thus increasing the execution time. However, the decoupling permits the two multithreaded processors to tolerate cache miss penalties. Note that the SDF system outperforms Rhamma for all values of memory access instructions. This is primarily because of the pre-loading and post-storing performed by the SDF architecture.

(a) Impact of miss rates

(b) Impact of miss penalties

Figure 2.8 Effect of Cache Memories.

Figure 2.8 shows the effect of cache memories on the performance of the three architectures. For this experiment, 50 cycles cache miss penalty for Figure 2.8a and 5% cache miss rate for Figure 2.8b is assumed. As can be observed in Figure 2.8, both multithreaded processors are less sensitive to memory access delays than the conventional processor. When a cache miss occurs in Rhamma, a context switch (“switch

31 on use”) of the faulting thread occurs. In SDF (note that only pre-load and post-store threads access the memory) assuming non-blocking caches, a cache miss does not prevent the memory accesses for other threads. Note that this is not possible in Rhamma since memory accesses are not separated into pre-loads and post-stores. The delays incurred by pre-loads and post-stores in SDF do not lead to additional context switches since threads are enabled for execution only when the pre-loading is complete, and once enabled for execution, they complete without blocking. Decoupling memory accesses provides better tolerance of memory latencies when used with non-blocking multithreading models and when memory accesses are grouped for pre-loads and poststores. These analytical models and the results obtained from the experiments have led us to believe that non-blocking multithreaded architecture with decoupled memory access is viable. In the remaining chapters, results will be presented based on our instruction set simulator and the real benchmarks representing several scientific applications.

32

Chapter III

SCHEDULED DATAFLOW (SDF) ARCHITECTURE

3.1 Instruction Format of Scheduled Dataflow Architecture A detailed description of the instruction set for our architecture can be found in Appendix A. Before describing the architecture of the Scheduled Dataflow (SDF), it is necessary to understand how the instructions differ from an ETS dataflow system. In ETS, each instruction specifies a memory location by providing an offset (R) with respect to the frame pointer (FP). The first data token destined to the instruction will be stored in this memory location, waiting for its match. When a matching data token arrives for the instruction, the previously stored data is retrieved, and the instruction is immediately scheduled for execution. The result of the instruction is converted into one or two tokens by tagging the data with the address of its destination instruction (IP).

Opcode

Offset(R)

Dest-Inst-1 and Port

Dest-Inst-2 and Port

(a) ETS Instruction Format

Opcode

Offset(R) Dest-Data-1 and Port Dest-Data-2 and Port

(b) SDF Instruction Format

Figure 9. Instruction Formats.

33 The format of the ETS instruction is shown in Figure 3.1a. Instructions for SDF differ from ETS instructions only slightly as shown in Figure 3.1b. In the ETS, the destinations refer to the destination instructions (i.e., IP values); in SDF the destinations refer to the operand locations of the destination instructions (i.e., offset value into activation frames or register contexts). This change also permits the detection of RAW data dependencies among instructions in the execution pipeline and the use of result forwarding so that results from an instruction can be sent directly to dependent instructions. The result forwarding is not applicable in ETS dataflow since instructions are token driven. The source operands for SDF are specified by a single offset (R) value and refer to a pair of registers where the data values are stored by the processor instructions. Unlike ETS, SDF is not data driven – the input operands of an instruction are saved in the pair of registers associated with the instruction and scheduled for execution at a later time.

3.2 Code Partitioning In order for the SDF Architecture to execute a program, there must be a compiler to automatically partition the code. For a conventional multiprocessor system, the program is explicitly partitioned into processes based on programming constructs such as loops, procedures, etc. In order to run the program for SDF, the program must explicitly be partitioned into threads of a large granularity to limit the number of synchronization and context switching operations that must be performed based on those programming constructs. Single Instruction Single Assignment Language (SISAL) produces Intermediate form (IF1 or IF2) code and that can be used as an input for the compiler to

34 generate threaded assembly code partitioning the program into threads. I-structure memory operations such as I-fetch and I-store can be done only in the SP. I-Allocate can be done in the EP.

thread

preload execute poststore

Figure 3.2 The Code Portions of an SDF Thread.

The complier must take care of partitioning the high-level source code in order to create the SDF threads. Each thread consists of three portions: pre-load code, execute code, and post-store code as can be seen in Figure 3.2. A compiler must generate a number of code blocks, where each code block consists of several instructions from one or more threads associated with a label that can be executed to completion either on SP or EP without interruption. Synchronization count is the number of inputs needed for the thread before it can be executed on EP. The thread can be scheduled on an EP when the synchronization count becomes 0. A Frame is allocated for the above-mentioned code block or thread. All data needed for the thread, including synchronization count, is stored in the frame memory. Pre-load code is that portion of a thread that gets all the data required by the thread and loads the data in the thread’s registers. Post-store code is that portion of the thread for storing all the data back into memory or to the next thread’s

35 context (registers). The SP is responsible for the pre-load and the post-store portions of the code. When a thread is created (using the FALLOC instruction), a frame is allocated for the inputs to the thread. An instruction pointer (IP) indicating the first executable instruction of the thread and a synchronization count indicating the number of inputs needed before the thread becomes enabled for execution is stored in the allocated frame. Once a thread receives all the necessary inputs, the thread is allocated a register context. The pre-load code then moves the data from a thread's frame memory into its registers. The execute portion will perform computations using only the registers, while the poststore code will store the thread's results in other threads' frames. Each thread is also allocated a register file upon activation. Each context has 32 register pairs. It is also designed with two write ports (in order to write to each register individually) and one double word read port. Some registers are designated for special purposes. Register R0 is hardwired to 0 and register R1 is used to store constants, frame pointers or instruction pointers. Register R1 behaves as a scratch register. Register R63 is used for storing synchronization count for the thread. At present the compiler for generating SDF code directly from source programs is not complete.

3.3 Dataflow Graph and the Related SDF Assembly Code This section will explain how the dataflow processing is possible and the concept of what is called “scheduled” dataflow. Consider the graph of Figure 3.3, and the corresponding SDF code in Figures 3.4, 3.5, and 3.6. Each node of a graph will be translated into an SDF instruction. The two source operands destined for a dyadic SDF instruction (i.e., input data) are stored in a pair of registers assigned to that instruction. In

36 an instruction, a pair of registers (even-odd registers), say RR2 refers to registers R2 and R3 within a specified register context. The predecessor instructions store the data in either the left or right half of a register pair, as dictated by the data dependencies of the

X

Y

+

A

B

-

+

*

/

(X+Y)*(A+B)

(X-Y)/(A+B)

Figure 3.3 Simple Dataflow Graph. ADD ADD SUB MULT DIV

RR2, RR4, RR4, RR10, RR12,

R11, R10 R12 R14 R15

R13

; Compute A+B, Result in R11 and R13 ; Compute X+Y, Result in R10 ; Compute X-Y, Result in R12 ; Compute (X+Y)*(A+B), Result in R14 ; Compute (X-Y)/(A+B), Result in R15

Figure 3.4 Execution Code of SDF Corresponding to the Dataflow Graph.

LOAD LOAD LOAD LOAD LOAD LOAD LOAD LOAD

RFP|2, RFP|3, RFP|4, RFP|5, RFP|6, RFP|7, RFP|8, RFP|9,

R2 R3 R4 R5 R6 R7 R8 R9

; load A into R2 ; load B into R3 ; load X into R4 ;load Y into R5 ;load frame pointer for returning 1st result ;load frame offset for returning 1st result ;load frame pointer for returning 2nd result ;load frame offset for returning 2nd result

Figure 3.5 Pre-load Code for SDF Corresponding to the Dataflow Graph.

37 program. Unlike the previous dataflow architectures (e.g., Monsoon), an instruction is not scheduled for execution immediately when the operands are matched (available). Instead, operands are saved in the register-pair associated with the instruction and the enabled instructions are scheduled for execution based on some total ordering of the dataflow graph. Thus, asynchronous scheduling is eliminated, hence the name schedule dataflow. The dataflow graph shown in Figure 3.3 receives 4 inputs (A, B, X, Y) from other threads (or from the input file). Each thread will be associated with a frame and the inputs to the thread are saved in the frame until the thread is enabled for execution (based on synchronization count). When enabled, a register context is allocated to the thread and the input data for the thread is preloaded from the frame memory into its registers. Assuming that the inputs for the thread (A, B, X, Y) are stored in its Reference Frame Pointer (RFP) at offsets 2, 3, 4, and 5, the first four LOAD instructions of Figure 3.5 (executed by SP) preload the thread’s data into registers R2, R3, R4, R5 of the register context allocated for the thread. After the preload, the thread is scheduled for execution on EP. The EP then uses only its registers during the execution of the thread body (Figure 3.4). For example, ADD RR2, R11, R13 adds the contents of registers R2 and R3, and stores the result in R11 and R13. The instructions still retain the functional nature of dataflow. An instruction stores its results in the registers that are specifically associated with a destination instruction. There are no Write-After-Read (WAR) and Write-AfterWrite (WAW) dependencies in these registers. The results generated by MULT and DIV instructions are in registers R14 and R15. They may be needed for other threads. The frame pointers and the frame offsets for the destination threads are available in registers R6, R7, R8, and R9 (see last 4 LOAD instructions of Figure 3.5). The results

38 can be stored in the post-store stage to the particular frame pointer and the offset. As can be seen in the Figure 3.6, the first result from R14 is stored in a frame pointed to by R6

STORE R14, STORE R15,

R6|R7 R8|R9

;store first result ;store second result.

Figure 3.6 Post-store Code for SDF for the Dataflow Graph.

and the offset R7 (i.e., R6|R7). The second result from R15 is stored in a frame pointed to by R8 and the offset R9 (i.e., R8|R9). When a thread is created, it is necessary to provide the thread with the frame pointer and offsets of destination threads. This information is transferred into registers during the pre-load stage. The instructions of each thread are scheduled for execution sequentially. However the instructions retain the dataflow functional nature, which eliminates WAR and WAW dependencies. The execution processor will have no bubbles or stalls due to cache misses since SDF uses decoupled memory accesses.

3.4 Thread Continuation In order to better understand the implementation of SDF architecture, one needs to focus on the dynamic scenario that can be generated during run time. The concept of thread continuation explains how the thread is scheduled during run-time. A continuation is simply a four-value tuple, . FP is the frame pointer (where the thread’s input values are stored). IP is the Instruction Pointer (which points to the thread’s code). RS is a register set consisting of the registers allocated for the thread. SC is the synchronization count (the number of values needed to enable that thread). Each

39 thread has an associated continuation. At a given time a thread continuation can be one of the following: • Waiting Continuation (WTC) or • Pre-Load Continuation (PLC) or • Enabled Continuation (EXC) or • Post-Store Continuation (PSC) or

< FP, IP, RS, -- > < --, IP, RS, -- > < --, IP, RS, -->

where “--“ means that the value is not defined in that instance of the continuation. Therefore, at a given time a certain thread can be in a state of Waiting Continuation (WTC), Pre-Load Continuation (PLC), Enabled Continuation (EXC) or Post-Store (PSC) depending on the values stored in its continuation. After being created in WTC, a thread moves from WTC to pre-load (or PLC) status at SP, to execute (or EXC) status at EP and finishes in post-store (PSC) status again at SP. The synchronization pipeline of the SP handles the PLC and PSC. The execution pipeline of the EP handles the EXC continuation state. A Scheduler Unit (SU) handles WTC threads and the movements among the units.

Thread Terminate

WTC

PLC

PRELOAD PHASE COMPLETED

PSC

SP

EXC EP

EXECUTION PHASE COMPLETED

Figure 3.7 Thread Continuation Transitions Handled by the Scheduling Unit (SU).

40 3.5 Execution Pipeline As can be seen from Figure 3.8, the execution pipeline consists of four pipeline stages: instruction fetch, decode, execute and write back. Instruction fetch unit behaves like a traditional fetch unit, relying on a program counter to fetch the next instruction.5 The information in register context can be viewed as a part of the thread id: , where Register Set (RS) refers to a register set assigned to the thread during its execution. Decode and register fetch unit obtain a pair of registers that contains the two source operands for the instruction. Execute unit executes the instruction and sends the results to write-back unit along with the destination register numbers. Write-back unit writes (up to) two values to the register file. In SDF architecture, a pair of registers is viewed as source operands for an instruction. Data is stored in either the left or right half of a register pair by a previous instruction.

Instruction Cache

Instruction Fetch Unit

Decode Unit

Execute Unit

Write-Back Unit

PC

Pre-Loaded Threads Reg . Context

Register Sets

Figure 3.8 General Organization of Execution Pipeline (EP)

Unlike ETS, in SDF architecture, an instruction is not scheduled for execution immediately when the operands are matched. Instead, operands are saved in the register5

Since both SP and EP need to execute instructions, instruction cache is assumed to be dual ported.

41 pair associated with the instruction. The enabled instruction is scheduled for execution at a later time. Note that the presence bits associated with operand registers are used only to catch exceptions from improper scheduling of instructions (i.e., attempt to execute an instruction before the availability of operands). As can be seen, the execution pipeline

P

L e ft O p e ra n d

P

R ig h t O p e ra n d

0 1 2 3

Figure 3.9 Operand Register Pairs for Scheduled Dataflow Architecture.

behaves more like a conventional pipeline (e.g., MIPS) while retaining the primary dataflow properties – data flows from instruction to instruction. This eliminates the need for complex hardware for detecting WAR and WAW dependencies and register renaming, as well as unnecessary thread context switches on cache misses.

3.6 Synchronization Pipeline As can be seen, the synchronization pipeline consists of six stages: instruction fetch, decode, effective address, memory access, execute and write-back. As mentioned earlier, the synchronization pipeline handles pre-load and post-store instructions. The Instructions fetch unit retrieves an instruction belonging to the current thread using Program Counter (PC). The decode unit decodes the instruction and fetches register operands (using a register set). The Effective address unit computes effective address for

42 LOAD and STORE instructions. LOAD and STORE instructions only reference the frame memories6 of threads, using a FP and an offset into the frames; both of which are

Data Cache Instruction Cache

Enabled Threads

Instruction Fetch Unit

Decode Unit

Effective Address Unit

Memory Access Unit

Execute Unit

Write-Back Unit

PC

Post-Store Threads

Reg . Context

Register Sets

Figure 3.10 General Organization of Synchronization Pipeline (SP).

contained in registers. The memory access unit completes LOAD and STORE instructions. Pursuant to a post-store, the synchronization count of a thread is decremented. The Write-back unit completes LOAD (pre-load) and I-FETCH instructions, by storing the values in appropriate registers.

3.7 Scheduling Unit (SU) In any multithreaded execution model there will be many threads in the ready queue that can start execution at a given time. The scheduling of threads to an SP and EP in SDF is handled by the SU. In SDF architecture, a thread is created using a FALLOC instruction. The FALLOC instruction creates a frame (accessible by a FP) related to a certain thread (pointed to by an IP) with a given Synchronization Count (SC), indicating

6

Following traditional dataflow paradigm, I-Structure memory for arrays and other structures are used.

43 the number of inputs needed to enable the thread. The FALLOC thus creates a WTC (), which is then handled by the SU. When a thread completes its execution and “post-stores” results (performed by SP), the synchronization counts of awaiting (WTC) threads are decremented. The SU takes care of checking when the synchronization count becomes zero. Then it allocates a Register Set (RS) to it, and the continuation is scheduled for execution on SP. The thread has now a PLC (). The PLC thread is then ready to execute on SP. SP loads values from the frame memory into the assigned register set. At the end of this phase, the thread is moved to the pre-loaded list and handed off to the EP, using a FORKEP instruction. The thread’s continuation is in state EXC () and the thread is ready to be executed on the EP. The IP will now point to the first instruction beyond the pre-load (referring to the first executable instruction). In order to speed up frame allocation, fixed sized frames for threads are pre-allocated and a stack of indexes pointing to the available frames is maintained. The SU makes a frame available to the EP with an index popped from that stack. The EP uses it as the address of the frame (i.e., FP) in response to a FALLOC instruction. The register sets are viewed as circular buffers for allocating (and de-allocating) to enable threads. SP pushes indexes of de-allocated frames onto the stack when executing FFREE instruction subsequent to post-stores of completed threads. These policies permit fast context switching and the creation of threads. The SU is also responsible for scheduling pre-load (by PLC) and post-store (by PSC) on multiple SPs and preloaded threads on multiple EPs in superscalar implementations of SDF architecture. FORKSP is used to move a thread from EP to SP. FORKEP is used to move a thread from SP to EP. FALLOC and FFREE take 2 cycles in

44 SDF architecture. FORKEP and FORKSP take 4 cycles to complete. These numbers are based on the observations made in Sparcle [2] that a 4 cycle context switch can be implemented in hardware. Note that the scheduling is at thread level in SDF, rather than at instruction level as done in other multithreaded systems (e.g., Tera, SMT), and thus requires simpler hardware. After the EP completes the execution of a thread, the thread is moved to the post-store list and handed off to the SP for post-storing, using FORKEP instruction. The thread’s continuation is in state PSC () and the thread is ready to be executed on the SP. The IP will now point to the first post-store instruction. After completing post store, the register set (RS) is freed and the thread execution is complete.

3.8 I-Structure Memory for Scheduled Dataflow (SDF) Architecture In a single assignment language, the rule single assignment means that each update of a complex data structure consumes the value and produces a new data structure. In dataflow models, Arvind and Thomas proposed the concept of I-structure [7]. Istructure may be viewed as a special memory to store data obeying the single-assignment rule. Data types such as arrays and structures can be stored in I-structure. Each element of an I-structure is written only once but can be read any number of times. Status bits are used to identify each element. Similarly, in SDF, I-structure memory is used in conjunction with the dataflow concepts. The following opcodes are used for I-structure memory in SDF (a detailed description can be found in Appendix A): IALLOC

-- Opcode to allocate a given number of elements for a new I-structure and the pointer is stored in the destination register.

45 IFREE

-- Opcode to free the I-structure memory pointed to by the given I-structure pointer (contained in a register). -- Opcode to retrieve the specified content of the element and store in the destination registers. (If the element is not yet written, it is deferred.) -- Opcode to store the value into the specified I-structure element (if the element is already present, an error condition is returned.)

IFETCH ISTORE

The basic idea of status bits is the same as described previously in Chapter 2. The status of the each element in the I-structure can be PRESENT, ABSENT, or WAITING. The status can be as follows: PRESENT ABSENT WAITING

-- the element can be read, not written. -- write operation is permitted but read is deferred. -- One or more requests of the element have been deferred.

The example code explains the usage of I-structure operations in SDF. In SDF, only the EP can execute IALLOC operation. The SP executes other operations such as IFETCH, ISTORE, and IFREE. The following code shown in Figure 3.11 explains the IALLOC operation. The figure shows the value 100 stored in register R1 by PUTR1. The integer value 100 is moved to Register R5 in order to allocate 100 I-structure elements. The IALLOC operation creates I-structure memory to hold 100 elements and the pointer to the Istructure element is stored in the destination register R8. The created I-structure can now be accessed by using the pointer in register R8.

/* EP allocates 100 I-structure element and /* stores the pointer in Register R8 /* R1 is a scratch Register PUTR1 100 MOVE R1, R5 IALLOC R5, R8

*/ */ */

Figure 3.11 Code for I-structure Allocation in SDF.

46 Figure 3.12 shows an example code for IFETCH operation in SDF. In the following code, value 15 is moved to R1 by the operation PUTR1. Here, assuming the Istructure pointer is in register R8, the element to be fetched (i.e., 15) is moved to register R9. By the operation IFETCH, it fetches the 15th element of the I-structure pointed to the pointer in R8 and the value is stored in register R15. The index value can be used

/* Assuming pointer to I-structure is in R8, */ PUTR1 15 MOVE R1, R9 IFETCH RR8, R15 or IFETCH R8|15, R5 Figure 3.12 Code for IFETCH Element from I-structure in SDF.

without moving to a register as shown. Either operation is valid in SDF depending on where the instructions are used in the program. Example code for ISTORE is shown in Figure 3.13. Assume that the pointer to Istructure is in register R8. The instruction PUTR1 100 moves a value (100) to register R1 and the following instruction moves it to register R15. If the value is to be stored in the 20th element of the I-structure, move that index value 20 to register R9. The ISTORE instruction stores the value 100 in the 20th element of the I-structure memory.

/* Assuming pointer to I-structure is in R8, */ PUTR1 100 MOVE R1, R15 PUTR1 20 MOVE R1, R9 ISTORE R15, R8|R9 or ISTORE R15, R8|20

Figure 3.13 Code for ISTORE Value in I-structure of SDF.

47 The figure below shows the code for IFREE, which frees the I-structure memory that can be reused later when needed by the CPU. Once again assuming Register R8 contains the pointer to the I-structure memory, the IFREE instruction below frees the Istructure.

/*Assuming pointer to I-structure is in R8 IFREE R8

*/

Figure 3.14 Code for IFREE in I-structure of SDF.

48

Chapter IV

SCHEDULED DATAFLOW VERSUS CONVENTIONAL RISC SYSTEMS

4.1 Comparison of SDF Versus DLX (MIPS) In order to avoid performance disparities introduced by different compilers, several hand coded benchmark programs are used for both the SDF system and the MIPS7 system. The same programming style is used in both platforms (i.e., same degree of loop unrolling, use of registers whenever possible and avoid storing of temporaries in memory). The programs used for this comparison include a recursive Fibonacci, matrix multiplication, Livermore kernel 5, Fast Fourier Transforms (FFT) and zoom (Code segment for picture zooming) [60]. Different benchmarks are chosen in order to analyze the different characteristics that exist in programs. Fast Fourier Transforms are used in many multimedia applications such as image processing, speech, audio as well as in signal processing problems. Matrix multiplication is also used extensively in scientific applications and is a promising benchmark to exploit the maximum amount of parallelism. The Fibonacci program is chosen because of its recursive nature. The Zoom program is also used in many multimedia applications such as picture zooming. Livermore kernel 5 is chosen from Livermore loop benchmarks. Using the SDF simulator system (1 SP and 1 EP), the performance of SDF is compared with a single threaded RISC architecture. This chapter mainly presents the benefits of separating memory 7

Used DLX simulator [Hennessy 96] for this purpose.

49 accesses from execution pipeline combined with the non-blocking multithreaded and dataflow engine as compared to the single pipeline of MIPS that executes all the instructions including memory accesses. In the following chapters, comparison of SDF (multiple SPs and EPS) with recent multithreaded and superscalar architecture will be shown in detail. This current chapter also will present the effect of parallelism (i.e., number of enabled threads) and thread granularity (average run-lengths of the execution threads on EP) on the performance of this architecture. For the results shown in this chapter, the simulator assumes a perfect cache. Chapter VI will present the cache performance of SDF versus DLX by using traces and DineroIV9 cache simulator.

4.2 Execution Performance of Scheduled Dataflow Table 4.1 presents the execution cycles of DLX as compared to SDF. For the above-mentioned examples, a degree of 5 unrolling is used for matrix multiply, Livermore loop 5 and zoom program. In SDF, 5 (concurrent) threads are used for all programs except Fibonacci. In both platforms one cycle per arithmetic and memory access is assumed. It can be noted from the table that the SDF system outperforms MIPS architecture when the program exhibits greater thread parallelism (e.g., matrix multiply, zoom and Livermore loop 5). Livermore loop exhibits less parallelism than matrix multiplication even though both programs have three nested loops and this is due to a loop carried dependency in Livermore loop. The zoom program exhibits moderate parallelism. This is due to a significant serial portion that exists in the outer loop, which limits the speed up (Amdhal’s Law). SDF does not show any speedup for the Fibonacci

9

http://www.cs.wisc.edu/~markhill/DineroIV

50 program because of its recursive nature. Even though the system spawns many threads for the recursive calls, multiple threads do not improve performance. This is in line with

Table 4.1 Execution Performance of Scheduled Dataflow Compared to DLX.

N 25*25 50*50 75*75 100*100

N 5 10 15 20 25 30

Matrix Multply DLX SDF Cycles Cycles 966090 306702 7273390 2159780 24464440 6976908 57891740 16175586

Speed UP 3.150 3.368 3.506 3.579

Loop=N

Fibonacci DLX SDF Cycles Cycles 615 842 7014 10035 77956 111909 864717 1241716 9590030 13771467 1.06E+08 1.53E+08

Speed UP 0.7304 0.699 0.6966 0.6964 0.6964 0.6964

N

50 100 150 200 250 300 350 400 450

5,5,4 10,10,4 15,15,4 20,20,4 25,25,4 30,30,4 35,35,4 40,40,4

Livermore 5 DLX Cycles 87359 354659 801959 1429259 2236559 3223859 4391159 5738459 7265759 Zoom DLX Cycles 10175 40510 97945 161580 271175 391150 532285 645520

SDF Cycles 56859 215579 476299 839019 1303739 1870459 2911789 3309899 4182619 SDF Cycles 9661 37421 83331 147391 229601 329961 448471 585131

Speed Up 1.536 1.645 1.684 1.703 1.715 1.724 1.508 1.734 1.737 Speed UP 1.0532 1.0825 1.1754 1.0963 1.1811 1.1854 1.1869 1.1032

general acceptance that multithreaded architectures are not very effective for sequential (or single threaded) applications. The speed up achieved in matrix multiplication is so great because it can easily be parallelized with three nested loops and no dependencies. Besides, the innermost loop is unrolled five times to increase the thread granularity. The main factors involved in the speed up are due to the non-blocking multithreading and decoupling of memory accesses. Due to data dependencies encountered (from Load to ALU ops), more cycles are wasted in DLX. In addition, functions in DLX have used

51 stack for exchanging data, which may have caused some unnecessary memory accesses. It is satisfying to note that it is possible to design a non-blocking multithreaded architecture with completely decoupled memory accesses, and achieve scalable performance. However SDF incurs unavoidable overheads for creating threads (allocation of frames and allocation of register context) and transfering threads between SP and EP using FORKEP and FORKSP instructions. Currently the simulator is written in such a way that the data can only be exchanged between threads by storing them in the threads’ frames (memory). These memory accesses can be avoided by storing the results of a thread directly into another thread’s register context, by allocating all frames directly to register sets (by providing sufficient register sets in hardware). SDF might have achieved better performance by this method of direct storing of data into the following thread’s registers. It is also important to note that SDF system eliminates the need for the complex hardware required for dynamic instruction scheduling. The hardware savings can be used to include additional register-sets, which can lead to an increased degree of thread parallelism and thread granularities.

4.3 Effect of Thread Level Parallelism on Execution Behavior This section explains the performance benefits of increasing the thread level parallelism (i.e., number of ready threads). The matrix multiplication program is used for Figure 4.1 to show the impact of increasing the number of threads and the speedup gained. SDF architecture and SMT architecture use thread level parallelism in a program as well as instruction level parallelism when applicable. It is obvious that increasing the number of threads notably improves the performance.

52

E 6Million x e c 5Million u t 4Million i o n 3Million C y 2Million c l e 1Million s 0 1 Thrd

2 Thrds

3 Thr ds

4 Thrds

5 Thrds

10 Thrds

Number of Concurrent Threads

Figure 4.1 Effect of Thread Level Parallelism of SDF Execution (Matrix Multiply).

The latter part of this chapter will present the performance improvement of thread granularity. This can be accomplished by unrolling loops to varying degrees. Both of these characteristics can be used in SDF to improve the performance of the application program. The Trimaran VLIW system uses larger unrolling of the loop to find suitable instructions to fill the long instruction word. Depending on number of available registers in a register context, unrolling of loops can improve performance and increase the level of parallelism in a program such as matrix multiplication. Varying the number of concurrent threads with a fixed matrix multiplication of 50*50 is used in the experiments shown here (i.e., Figure 4.1). Each thread executes five (unrolled) loop iterations. The number of threads used is varied from 1 to 20. For the figure shown (Figure 4.1), only the innermost loop of matrix multiplication is parallelized (unlike the previous data in Table 4.1, where all 3 nested loops were parallelized). The actual slope of the curve (Figure 4.1) depends on the application and the problem size.

53 Table 4.2 Effect of Thread Level Parallelism (Execution Cycles on SDF). Data Size 1 Thread N 100 200 300 400 500 600 700 800 900 1000

3941 7841 11741 15641 19541 23441 27341 31241 35141 39041

2 Threads 5 Threads 10 Threads 20 Threads

3011 5981 8951 11921 14891 17861 20831 23801 26771 29741

2453 4865 7277 9689 12101 14513 16925 19337 21749 24161

2267 4494 6719 8945 11171 13397 15623 17849 20075 22301

2175 4309 6443 8577 10711 12845 14979 17203 19247 21381

The figure shows that increasing the degree of parallelism does not decrease the number of cycles needed in a linear fashion. This is due to saturation of both the synchronization and the execution pipeline (reaching more than 80% utilization with 10 threads). In the case of one SP, the SP is responsible for pre-load and post-store activities, and the EP has less work. Using more than one SP, the EP could be kept busy. This can

Execution Cycles

1.20E+07

1 Thread

1.00E+07

2 Threads

8.00E+06

3 Threads

6.00E+06

4 Threads

4.00E+06

5 Threads

2.00E+06

10 Therads

0.00E+00 N

50 100 150 200 250 300 350 400 450 500 Data Size

Figure 4.2 Effect of Thread Level Parallelism of SDF Execution (Livermore Loop 5).

54 be seen later where comparisons with multiple SPs and EPs are shown. There, it is shown that increasing the number of SPs and EPs drastically improves the thread level parallelism. Figure 4.2 shows the effect of increasing the number of threads for the Livermore kernel 5 program. This graph also presents the results for different data sizes. By increasing the number of threads, linear speedup cannot be achieved due to other overhead such as creation and switching of threads. However the performance gain can be clearly seen. Other benchmarks such as Fibonacci, FFT and zoom have shown similar behaviors even though the data is not presented here.

4.4 Effect of Thread Granularity on Execution Behavior This section presents the effect of thread granularity on execution behavior for matrix multiply. The number of concurrent threads is fixed at 5, and the thread granularity is varied by changing the number of innermost loop iterations executed by each thread (i.e., degree of unrolling). For this experiment 1 EP and 1 SP are used and concentrated only on the innermost loop of matrix multiply.

9 M illio n 8 M illio n 7 M illio n 6 M illio n 5 M illio n 4 M illio n 3 M illio n 2 M illio n 1 M illio n 0 IU n ro ll

2 U n ro ll

3 U n ro ll 4 U n ro ll D e g r e e o f U n r o llin g

5 U n ro ll

1 0 U n ro ll

Figure 4.3 Effect of Thread Granularity of SDF Execution (Matrix Multiply).

55 The data shown (Figure 4.3) is for a 50*50 matrix. The thread granularity has ranged from an average of 27 instructions (12 for SP and 15 for EP) with no loop unrolling, to 51 instructions (13 for SP and 39 for EP) when each thread executed ten unrolled loop iterations. The execution performance improves (i.e., execution time decreases) as the threads become coarser. However, the improvement becomes less significant beyond certain granularity. Similar behavior is observed for other benchmarks. The number of registers per thread context (currently 32 pairs) also is a limiting factor on the granularity. If more registers are allowed, the degree of unrolling can be increased.

Table 4.3 Effect of Thread Granularity on SDF Performance (Execution Cycles on SDF). Data Size 1 Unrolled 2 Unrolled 3 Unrolled 4 Unrolled N Loop Loops Loops Loops 100 200 300 400 500 600 700 800 900 1000

8301 16561 24821 33801 41341 49601 57861 66121 74381 82641

4671 9301 13931 18561 23191 27821 32451 37081 41711 46341

3461 6885 10301 13720 17142 20561 23980 27401 30821 34241

2806 5571 8336 11101 13866 16631 19386 22161 24926 27691

5 Unrolled 10 Unrolled Loops Loops 2453 4865 7277 9689 12101 14513 16925 19337 21749 24161

1667 3293 4919 6545 8171 9797 11424 13049 14675 16301

Figure 4.4 also presents the effect of thread granularity on matrix multiplication with data sizes ranging from 100 to 1000. The results confirm that performance of multithreaded systems can benefit both from the degree of thread parallelism and coarser grained threads. Because of the non-blocking nature and the decoupling of memory accesses, it may not always be possible to increase thread granularity in decoupled

56 scheduled dataflow (SDF). Innovative compiler optimizations utilizing static branch prediction to speculatively preload threads to increase thread run-lengths are required to improve the performance of SDF.

9 00 00 8 00 00 7 00 00 6 00 00 5 00 00 4 00 00 3 00 00 2 00 00 1 00 00 0

Da t a S iz e

1

2

3

4

5

10

10 0 20 0 30 0 40 0 50 0 60 0 70 0 80 0 90 0 10 00

De g r e e o f L o o p Un r o l l in g

(In n e rm o s t lo o p o f M a trix M u ltip ly )

Figure 4.4 Effect of Thread Granularity on SDF Performance with Varying Data Sizes.

4.5 Utilization of Two Processing Units Since SDF has two processing units, it is important to look at the workload imbalance between these separate processing units (EP and SP). Figure 4.5 shows the utilization of SP and EP for Fibonacci, matrix multiply, Livermore kernel 5 and zoom program for different data sizes. Figure 4.5 shows the utilization of SP and EP on different benchmarks. In the Fibonacci program, due to its recursive calls, utilization of the execution processor is more than the synchronization processor. After making the recursive calls, the SP is idle and waits for the execution processor to return the result. Since matrix multiplication provides such a high degree of parallelism at thread level and instruction level, the EPs and SPs are more nearly balanced.

100 90 80 70 60 50 40 30 20 10 0

Fibonacci

Matrix

LL5

50 10 0 15 0

50 10 *50 0* 15 100 0* 15 0

50 10 *50 0* 15 100 0* 15 0

SP EP

15

5 10

Percentage

57

Zoom

Figure 4.5 Utilization of SP and EP on SDF.

Even though Livermore kernel 5 has a three nested loop, due to its loop carried dependencies, the SP is utilized more than the EP. This unbalanced use of EP and SP is also evident in the zoom program, which has a significant amount of sequential code. When more units are added to the processing element, the workload seems to be equally distributed to all units in the processing element. Due to the thread level parallelism and instruction level parallelism (loop unrolling), SDF can utilize both SP and EP equally well for most applications. It should be noted that the utilization ratios presented in this section do not correspond to execution performance directly. Since EP waits for SP to pre-load a thread, and SP waits for EP to complete before post-storing results, a single threaded application may exhibit a balanced utilization but no parallelism. However for a multithreaded application, the utilization of SP and EP will more closely correspond to performance gains.

58

Chapter V

SCHEDULED DATAFLOW VERSUS SUPERSCALAR AND VLIW

5.1 Evaluation of SDF Versus Superscalar Architecture In the previous chapter, comparison of SDF with a MIPS like DLX simulator has been shown. Here we compare SDF with superscalar and VLIW processors. The first experiment compares the performance of a SDF processor containing one SP and one EP with a superscalar processor containing one integer ALU, one integer multiply/divide unit, one floating point ALU and one floating point multiply/divide unit. The SDF processor and the superscalar processor have the same number of functional units.10 For the superscalar, both in-order and out-of-order instruction issue are shown. In both systems, instructions take 1 cycle, and assume a perfect cache (all memory accesses are

Table 5.1 Superscalar Parameters Set. Superscalar Parameter Number of Functional Units Instruction Issue width Instruction Fetch and Decode Width Register Update Unit (RUU) Load/Store Queue (LSQ) Branch Prediction 10

Value 1 Integer Adder, 1 Integer Multiply 1 FP Adder, 1 FP Multiply/Divide 8 8 32 32 Bimodal with 2048 entries

Here hand-coding is done to generate complete programs for the SDF simulator. It may not be appropriate to compare FP+Int with SP+EP, but this is the best we can do at present.

59 set to one cycle). For the superscalar system, both instruction fetch and decode width and instruction issue width are set to eight. Register Update Unit (RUU) and Load/Store Queue (LSQ) are set to 32. The last two columns of Tables 5.2, 5.3, 5.4, and 5.5 compare the SDF system with in-order and out-of-order superscalar processors. For the matrix multiply program (Table 5.2), 10 threads are spawned to execute in parallel on the SDF system. Superscalar out-of-order system consistently outperforms the SDF system. It is also important to note that the Simplescalar tool set11 performs extensive optimizations and dynamic instruction scheduling. If the SDF system compiler is written with similar optimization techniques, it might perform better than the superscalar out-of-order system. The SDF system does not perform any dynamic instruction scheduling, eliminating complex hardware (e.g., scoreboards or reservation stations). Moreover, Simplescalar utilizes branch prediction (the data shown uses Bimodal prediction with 2048 entries). At present, the SDF system uses no branch prediction. It may be possible to use static branch prediction to speculatively preload data and increase run-lengths of the threads. The matrix multiplication program exhibits a large degree of instruction level parallelism and good branch prediction is easy to achieve. The data shown in the table uses only 10 threads. The compiler can be written to spawn the number of threads that match the available register sets. As seen in the table, the SDF system outperforms in-order superscalar system. For the FFT program shown in Table 5.3 for SDF, eight threads are spawned. Unlike for matrix multiply, the out-of-order superscalar system does not always outperform the SDF system in the case of FFT. If the data size is small, out-of-order superscalar system performs better than all other systems exploiting instruction level 11

http://www.cs.wisc.edu/~mscalar/simplescalar.html

60 parallelism. The thread level parallelism is quite low for these data sizes. However, for data sizes of 256 or larger, the available thread level parallelism used by SDF system (and the overlapped execution of SP and EP) exceeds the available instruction level parallelism. This confirms the result reported on SMT [39, 41], which indicates that high performance is achieved by using a combination of thread level and instruction level parallelism. This can be shown clearly for larger data sizes where the SDF system performs better than superscalar architecture. Figure 5.1 plots execution time in cycles for different data sizes. It can be noted from the graph that the SDF system eventually surpasses both in-order and out-of-order superscalar systems.

Table 5.2 SDF Versus Superscalar for the Program Matrix Multiply.

Data Size

50*50 100*100 150*150

SDF

(Cycles)

SS-IO

(Cycles)

SS-OO

(Cycles)

1,720,885 1,968,235 1,174,250 13,318,705 15,434,835 9,170,850 44,453,530 51,811,474 30,747,479

SDF-SS-IO SDF-SS-OO

Speedup 1.1438 1.1589 1.1656

Speedup 0.6824 0.6886 0.6917

The results for the recursive Fibonacci program are shown in Table 5.4. This program exhibits very little parallelism at either instruction level or thread level. If the data size is very small, conventional superscalar systems appear to incur overhead in creating recursive function calls, while the SDF system creates very few threads and incurs smaller overhead. As the data size increases, the SDF system recursively creates too many threads, with little thread level parallelism. This leads to poor performance by the SDF system as compared to the superscalar system. This is again in line with the

61 general observation that multithreaded architectures perform poorly for applications with little or no thread level parallelism (and for single threaded applications).

Table 5.3 SDF Versus Superscalar for FFT of Different Data Sizes.

Data Sizes

SDF (Cycles)

SS-IO SS-OO (Cycles) (Cycles)

8 16 32 64 128 256 512

13,791 33,635 79,895 185,765 424,411 955,721 2,126,583

21,423 14,294 37,917 25,608 83,024 53,595 212,301 132,479 604,674 364,203 1,906,115 1,095,955 6,543,376 3,576,399

E x e c u t i o n C y c l e s

SDF vs. SS-IO SDF vs. SS-OO Speedup Speedup 1.5534 1.1273 1.0391 1.1428 1.4247 1.9944 3.0769

1.0365 0.7613 0.6708 0.7132 0.8581 1.1467 1.6818

7000000 6000000 5000000

SDF

4000000

SS-IO

3000000

SS-OO

2000000 1000000 0 8

16

32

64 128 256 512

Data Size

Figure 5.1 Comparing SDF with a Superscalar Processor for FFT.

The zoom program execution results can be found in Table 5.5. The program contains a substantial amount of sequential code in the middle loop. This code allows for the exploitation of instruction level parallelism but limits thread level parallelism.

62 Moreover, in the SDF system, newly created threads wait for pre-load activities of SP. As we will see later, SDF’s performance improves when multiple SPs are used.

Table 5.4 Scheduled Dataflow Versus Superscalar for Fibonacci Program.

Data Size

5 10 15 19

SDF (Cycles)

799 8,092 87,835 597,382

SS-IO (Cycles)

11,914 15,676 57,422 326,032

SS-OO (Cycles)

7,853 10,697 42,226 245,107

SDF vs. SS-IO Speedup

14.9111 1.9372 0.6537 0.5458

SDF vs. SS-OO Speedup

9.8285 1.3219 0.4807 0.4103

The data presented so far confirms that any multithreaded architecture requires greater thread level parallelism to achieve good performance, whereas superscalar architectures require greater instruction level parallelism. Our data shows that the nonblocking multithreaded model is better suited for decoupling memory accesses from execution unit. The functional nature of the SDF system instructions eliminates the need for dynamic scheduling of instructions within a thread. The compiler must be written to make the best possible use of these two processing units. The SDF system incurs unavoidable overhead for creating threads (allocation of frames and allocation of register contexts) and transferring threads between SP and EP (FORKSP and FORKEP instructions). Hardware savings by eliminating dynamic scheduling of instructions may be used to either increase the number of register sets (thus supporting greater thread-level parallelism) or add additional SPs and EPs.

63 Table 5.5 Scheduled Dataflow Versus Superscalar for Zoom Programming. Data Size

SDF (Cycles)

50*50*4 100*100*4 150*150*4 200*200*4

442,940 1,768,070 3,975,450 7,065,080

SS-IO (Cycles)

353,969 1,400,923 3,231,841 5,584,818

SS-OO (Cycles)

250,508 998,666 2,326,862 3,970,448

SDFvs.SS-IO SDF vs. SS OO Speedup Speedup

0.7991 0.7923 0.8129 0.7905

0.5656 0.5648 0.5853 0.5620

5.2 Execution Performance of SDF With Multiple SPs and EPs The following section presents the results of the experiment done in order to investigate the performance of the SDF system using multiple SPs and EPs, compared with the performance of superscalar architecture using multiple integer and floating-point units. For comparison purposes, in this dissertation we equate the number of functional units in a superscalar (#Integer ALUs + #Floating Point ALUs)12 with the number of SPs and EPs (#SPs + #EPs).

Table 5.6 Superscalar Parameters for Multiple Functional Units.

Superscalar Parameter Number of Functional Units Instructin Issue Width Instruction Fetch and Decode Width Register Update Unit (RUU) Load/Store Queue (LSQ) Branch Prediction

12

Value Varied 32 32 32 32 Bimodal with 2048 entries

Each ALU in Superscalar contains separate adder and multiply/divide units. In SDF, each ALU is treated as a single unit performing all arithmetic operations.

64 It can be seen from the data that the conventional superscalar system does not scale well with increasing number of functional units and the scalability is limited by the instruction fetch/decode window size and the RUU size. Some researchers have noted that the power consumed by the instruction issue logic increases quadratically with increased widow width [69]. The SDF system relies primarily on thread level parallelism, and the decoupling of memory accesses from execution. The SDF system performance can scale better with a proper balance of workload between SPs and EPs. Several experiments are conducted and the results are shown in the following pages. In order to provide greater opportunities for dynamic instruction scheduling for the superscalar system, the instruction fetch and decode window widths and RUU are set to 32. Table 5.7 shows the data for the matrix multiplication program. As more SPs and EPs are added (correspondingly more integer and floating point functional units in superscalar), the SDF system outperforms superscalar architecture, even when compared to complex out-oforder scheduling used by the superscalar system.

The SDF system performance

overtakes that of the out-of-order superscalar architecture with 3 SPs and 3 EPs (corresponding to 3 INT and 3 FP ALUs in the superscalar system). It should also be noted that for superscalar architecture, the performance improvement with increasing number of functional units scales poorly – superscalar architecture exhibits no improved performance beyond 3 INT and 3 FP ALUs. For SDF, the performance is limited by SPs. Adding more SPs in SDF consistently improves the performance. In fact, adding more units in the superscalar system increases the complexity in finding more dynamic instructions to schedule and requires larger instruction issue/decode widths as well as renaming

65

Data size

Superscalar (cycles) 2IN T ALU 1FP ALU

SD F Superscalar (cycles) (cycles) 2SP 2IN T ALU 1EP 2FP ALU

SD F (cycles) 2SP 2EP

Supserscalar SDF (cycles) (cycles) 3IN T ALU 3SP 2FP A LU 2EP

Superscalar SDF (cycles) (cycles) 3IN T ALU 3SP 3FP ALU 3EP

50*50

IO 1,890,104 1,504,297 1,890,104 860,782 1,867,200 756,707 1,867,200 574242 OO 712,396 712,396 706,877 706,877 100*100 IO 14,824,104 11,843,442 14,824,104 6,660,012 14,633,700 5,941,602 14,633,700 4,402,772 OO 5,532,202 5,532,202 5,511,587 5,511,587 150*150 IO 49,763,150 39,762,487 49,763,150 22,227,742 49,110,246 19,924,912 49,110,246 14,819,482 OO 18,514.510 18,514,510 18,468,811 18,468,409

Data size

50*50

Superscalar (cycles) 4IN T ALU 3FP ALU

SD F Superscalar (cycles) (cycles) 4SP 4IN T ALU 3EP 4FP ALU

SD F (cycles) 4SP 4EP

Supserscalar SDF Superscalar SDF (cycles) (cycles) (cycles) (cycles) 5IN T ALU 5SP 5IN T ALU 5SP 4FP ALU 4EP 5FP ALU 5EP

IO 1,867,200 1,867,200 507,197 430,957 1,867,200 OO 680,321 680,321 680,321 100*100 IO 14,633,700 3,970,682 14,633,700 3,330,992 14,633,700 OO 5,306,381 5,306,381 5,306,380 150*150 IO 49,110,246 13,308,457 49,110,246 11,115,592 49,110,246 OO 17,782.453 17,782,453 17,782,453

1,867,200 345,027 680,321 2,982,702 14,633,700 2,665,472 5,306,380 9,990,607 49,110,246 8,894,002 17,782,453 381,247

Table 5.7 SDF Versus Superscalar for Matrix Multiply of Different Data Sizes.

66 registers. In the SDF system adding more SP and more EP does not complicate the system since there is no dynamic instruction scheduling involved.

60000000 E x e c u t i o n C y c l e s

50000000 40000000 SS-IO 30000000

SS-OO SDF

20000000 10000000 0 2+1

2+2

3+2

3+3

4+3

4+4

5+4

5+5

Number of Execution Units

Figure 5.2 Scalability of SDF over Superscalar for the Program Matrix Multiply (Data Size 150*150).

The above figure presents the scalability of SDF for matrix multiply. The x-axis shows the number of functional units (#SP+#EP for SDF and #INT ALU + #FP ALUs for superscalar). The execution times for the matrix multiplication are shown on the y-axis for the data size of 150*150. Similar experiments are conducted for other programs such as Fibonacci, FFT and zoom. Table 5.8 shows the results for FFT execution. For the FFT program, the SDF system outperforms out-of-order superscalar for data sizes greater than 256 for all machine configurations. Once again, the SDF system performance scales better with added SPs than that of a superscalar when more functional units are added.

67

S u p e r s c a la r (c y c le s) 2 IN T A L U 1FP A LU

D a ta s iz e

256 512

IO OO IO OO

D ata size

2 ,8 7 4 ,7 4 8 1 ,0 5 6 , 7 9 9 8 ,6 7 2 , 2 3 2 2 ,9 7 7 ,5 7 3

SDF (c y c le s ) 2SP 1EP

S u p e r s c a la r (c y c le s ) 2IN T A L U 2FP A L U

9 9 2 ,1 2 9

2 ,8 7 4 ,7 4 8 1 ,0 5 5 , 5 1 4

6 3 8 ,7 0 5

2 ,8 4 3 , 1 1 8 1 ,0 0 5 ,5 1 9

5 5 9 ,6 6 5

2 ,8 4 3 , 1 1 8 1 , 0 0 5 ,5 1 9

5 5 8 ,3 5 3

1 ,5 1 6 , 6 6 0

8 , 6 7 2 ,2 3 2 2 ,9 7 4 , 4 9 5

1 ,4 1 8 , 3 5 6

1 4 , 6 3 3 ,7 0 0 2 ,8 0 8 ,1 0 8

1 , 2 4 0 ,9 4 8

8 ,5 7 1 ,6 6 5 2 ,8 0 8 , 1 0 8

1 ,2 3 8 , 5 8 0

Superscalar (cycles) 4IN T A L U 3F P A L U

SDF Su persca lar (cycles) (cy cles) 4SP 4IN T A L U 3E P 4F P A L U

SD F ( c y c le s ) 2SP 2EP

SDF (cycles) 4SP 4EP

S u p s e r s c a la r (c y c le s ) 3 IN T A L U 2FP A LU

SD F (c y c le s) 3SP 2EP

S u p e r s c a la r S D F (c y c le s ) (c y c le s) 3 IN T A L U 3SP 3FP A LU 3E P

Supserscalar SD F Su persca lar SD F (cycles) (cycles) (cycles) (cycles) 5IN T A L U 5SP 5IN T A L U 5SP 4F P A L U 4E P 5F P A L U 5E P

256

IO OO

2,8 7 1,9 28 1,0 13,41 7

5 25,87 3

2,8 71 ,9 28 1,0 13,41 7

5 25,457

2 ,8 71,928 1,0 12,03 1

512

IO OO

8,6 36 ,150 2,8 28,02 5

1,1 63,95 6

8,63 9 ,15 0 2 ,8 28,025

1 ,1 63,9 56

8 ,63 9,150 2,823,3 5 6

508 ,047

2,8 71,92 8 1,012,03 1 8 ,63 9,150 2,82 3,3 56

Table 5.8 SDF Versus Superscalar for FFT Program of Different Data Sizes.

507 ,633

68

3500000 E x e c u t i o n C y c l e s

3000000 2500000 SS-IO

2000000

SS-OO 1500000

SDF

1000000 500000 0 2+1

2+2

3+2

3+3

4+3

4+4

5+4

5+5

Number of Execution Units

Figure 5.3 Scalability of SDF over Superscalar for the Program FFT (Data Size 256).

The scalability for the FFT program (Figure 5.3) is very similar to that shown for matrix multiplication, at least for data sizes greater than 256. Results for the Fibonacci program with data sizes 10 and 15 are shown in Table 5.9. As the number of SPs is increased, the SDF system compares more favorably with the out-of-order superscalar system with a similar number of integer units. The numbers in bold indicate the points where SDF outperforms the out-of-order superscalar execution. As before, the SDF system performance scales better with more SPs and EPs than the superscalar, with more functional units. Adding more FP ALUs in the superscalar shows no improvement because Fibonacci does not utilize floating point arithmetic.

69

S u p e r sc a la r (c y c le s) 2 IN T A L U 1FP A LU

D a ta siz e

S u p e r s c a la r ( c y c le s ) 2 IN T A L U 2FP A LU

SDF (c y c le s) 2SP 2EP

S u p se r sc a la r ( c y c le s ) 3 IN T A L U 2F P A L U

SD F (c y c le s) 3SP 2EP

S u p e r s c a la r S D F (c y c le s) ( c y c le s ) 3 IN T A L U 3SP 3F P A L U 3E P

10

IO OO

1 4 ,6 7 7 7 ,0 6 4

7 ,4 0 8

1 4 ,6 7 7 7 ,0 6 4

4 ,1 6 2

1 4 ,2 0 8 6 ,4 7 1

3 ,9 2 9

1 4 ,2 0 8 5 ,9 7 4

2 ,8 7 4

15

IO OO

5 1 ,5 8 0 2 7 ,9 6 7

8 0 ,8 0 2

5 1 ,5 8 0 2 7 ,9 6 7

4 4 ,0 3 3

5 0 ,5 6 6 2 4 ,4 3 9

4 1 ,4 8 1

5 0 ,5 6 6 2 4 ,4 3 9

2 9 ,4 6 3

S u p se r sc a la r ( c y c le s ) 5 IN T A L U 4FP A LU

SD F (c y c le s) 5SP 4EP

S u p e r sc a la r ( c y c le s ) 4 IN T A L U 3F P A L U

D a ta s iz e

10 15

SDF (c y c le s ) 2SP 1EP

SD F (c y c le s) 4SP 3EP

S u p e r sc a la r (c y c le s) 4 IN T A L U 4FP A LU

SD F ( c y c le s ) 4SP 4EP

S u p e r sc a la r S D F (c y c le s) ( c y c le s ) 5 IN T A L U 5SP 5FP A LU 5EP

IO OO

1 4 ,1 7 4 5 ,9 6 4

2 ,7 6 4

1 4 ,1 7 4 5 ,9 6 4

2 ,2 6 3

1 4 ,1 7 4 5 ,9 6 0

2 ,1 9 3

1 4 ,1 7 4 5 ,9 6 0

1 ,8 8 8

IO OO

5 0 ,1 8 9 2 4 ,5 3 4

2 8 ,1 5 1

5 0 ,1 8 9 2 4 ,5 3 4

2 2 ,2 0 6

5 0 ,1 8 9 2 4 ,5 3 0

2 1 ,4 1 5

5 0 ,1 8 9 2 4 ,5 3 0

1 7 ,8 4 1

Table 5.9 SDF Versus Superscalar for Fibonacci Program of Different Data Sizes.

70

E x e c u t i o n C y c l e s

90000 80000 70000 60000 SS-IO SS-OO SDF

50000 40000 30000 20000 10000 0 2+1

2+2

3+2

3+3

4+3

4+4

5+4

5+5

Number of Execution Units

Figure 5.4 Scalability of SDF over Superscalar for Fibonacci Program (Data Size 15).

It can also be noted from the above scalability Figure 5.4 of SDF over superscalar that the number of cycles becomes less as the number of SPs is increased. The x-axis shows the number of functional units (#SP + #EP for SDF and #INT ALUs + #FP ALUs for superscalar). A similar experiment is conducted for the zoom program and the data is shown in Table 5.10. It should be noted from the table that with 5 SPs and 4 EPs, the SDF system outperforms even the out-of-order superscalar system with 5 INT and 4 FP ALUs (this is shown in bold in the table). The scalability of SDF for the zoom program is also presented and can be seen clearly from Figure 5.5 that the SDF system’s curve is dropping rapidly below the superscalar out-of-order data. The data for each of the benchmarks is consistent with the contention that as more processing units are added SDF becomes a viable alternative with multiple SPs and EPs to superscalar architectures that utilize complex dynamic instruction scheduling logic.

71

Data Size

100*100*4 IO OO 200*200*4 IO OO

Data Size

100*100*4 IO OO 200*200*4 IO OO

Superscalar SDF (Cycles) (Cycles) 2INT ALU 2SP 1FP ALU 1EP 1,310,785 518,518 5,224,086 2,061,519

1,217,847 4,868,037

Superscalar SDF (Cycles) (cycles) 2INT ALU 2SP 2FP ALU 2EP 1,310,785 518,518 5,224,086 2,061,519

881,837 3,522,217

Superscalar SDF (Cycles) (Cycles) 4INT ALU 4SP 4FP ALU 4EP

Superscalar SDF (Cycles) (cycles) 5INT ALU 5SP 4FP ALU 4EP

1,310,377 408,164 5,223,208 1,622,538

1,310,377 398,454 5,223,208 1,585,528

440,417 1,756,957

Superscalar SDF (cycles) (cycles) 3INT ALU 3SP 2FP ALU 2EP

Superscalar SDF (Cycles) (cycles) 3INT ALU 3SP 3FP ALU 3EP

Superscalar SDF (cycles) (Cycles) 4INT ALU 4SP 3FP ALU 3EP

1,310,379 626,967 447,182 5,223,680 2,522,577 1,777,283

1,310,379 586,417 447,182 5,223,210 2,340,497 1,776,955

1,310,377 441,847 408,164 5,223,208 1,760,257 1,622,538

Superscalar SDF (Cycles) (cycles) 6INT ALU 6SP 5FP ALU 5EP

Superscalar SDF (cycles) (Cycles) 6INT ALU 6SP 6FP ALU 6EP

1,310,379 295,357 398,128 5,223,210 1,177,057 1,585,528

1,310,377 295,497 398,128 5,223,208 1,175,137 1,585,528

Superscalar SDF (cycles) (cycles) 5INT ALU 5SP 5FP ALU 5EP

353,037 1,310,379 353,057 398,454 1,406,537 5,223,680 1,406,717 1,585,528

Table 5.10 SDF Versus Superscalar for Zoom Program of Different Data Sizes.

72

6000000 E x e c u t i o n C y c l e s

5000000

4000000

SS-IO

3000000

SS-OO SDF

2000000

1000000

0 2+1

2+2

3+2

3+3

4+3

4+4

5+4

5+5

6+5

6+6

Number of Execution Units

Figure 5.5 Scalability of SDF with Superscalar for Zoom Program (Data Size 200*200*4).

For the zoom program, the SDF system is implemented using up to 6 SPs and 6 EPs with 6 INT ALUs and 6 FP ALUs in superscalar. In the SDF system, SPs and EPs are no more complex than traditional functional units used in superscalar systems. As mentioned previously, complex instruction issue, register renaming and instruction retiring logic are not used in the SDF system. Scheduling of threads among available SPs and EPs is performed at thread level (instead of at instruction level as done in Tera and SMT).

5.3 Comparison of Scheduled Dataflow with VLIW This section presents the performance of the SDF system as compared with VLIW architectures as facilitated by the Texas Instruments TMX320C6000 VLIW processor

73 simulator tool-set (which includes an optimizing compiler and profiling tool), and the Trimaran infrastructure. The Texas Instruments TMS320C6000 family of DSPs uses Very Long Instruction Word (VLIW) architecture. In this simulator, eight functional units utilize the highest level of performance for processing data and two register files, divided into two data paths. Each data path consists of a multiplier, an adder, a load/store unit and one unit for managing control-flow (branch and compare instructions). For these two systems (VLIW and SDF), instruction execution and memory access cycles are set to match. The SDF system utilizes 8 functional units (4 SPs and 4 EPs) as compared to the 8-wide VLIW architecture. The SDF system also is compared with the Trimaran simulator using default configurations and optimizations (using a total of 9 functional units, a maximum loop unrolling of 32, and several other complex optimizations).

Table 5.11 Comparing SDF with VLIW (Matrix Multiply). Matrix Multiplication Data Size 50*50 100*100 150*150

SDF

Trimaran

TMS ‘C6000

430957 3330992 11115592

331910 2323760 4959204

1033698 16199926 86942144

SDF/Trimaran 1.29841523 1.43344924 2.24140648

SDF/TMX ‘C6000 0.416908033 0.205617729 0.127850447

Table 5.11 presents the data for matrix multiplication. TMS ‘C6000 does not perform well because the optimized version relies on the unrolling of only 5 iterations (unlike Trimaran, which uses 32 iterations). The SDF system achieves better performance than TMS ‘C6000 because it relies on thread level parallelism – the data in Table 5.11 uses 10 active threads. The Trimaran system outperforms SDF because of the aggressive

74 optimization that is used by the compiler. Performing similar optimizations and/or increasing the number of active threads can improve the SDF’s performance. Trimaran exploits greater ILP since it examines 32 loop iterations (and this can be noticed with larger data sizes where Trimaran can sustain higher issue rates).

Table 5.12 Impact of Thread Level Parallelism Versus VLIW (Matrix Multiply). Matrix Multiplication

Data Size

TMS ‘C6000 VLIW SDF 8-wide 4SPs,4Eps Optimized 10 Threads

50*50 100*100 150*150

1033698 16199926 86942144

430957 3330992 11115592

SDF 4SPs,4EPs 5 Threads

1525560 11096312 36242464

SDF SDF-10 vs 4SPs,4Eps VLIW 2 Threads 2956616 22297111 73773061

0.4169 0.2056 0.1279

SDF-5 vs SDF-2 vs VLIW VLIW 1.4758 0.6850 0.4169

2.8602 1.3764 0.8485

In order to show the impact of thread level parallelism, Table 5.12 presents the execution cycles for matrix multiply using 2, 5 and 10 active threads in the SDF system. TMS data is also included to emphasize the impact of thread level parallelism. It can be noted from the table that the SDF system data shows performance gain from utilizing greater thread-level parallelism. The SDF system can achieve performance comparable to

Table 5.13 Comparison of SDF with VLIW (FFT).

Data Size 8 16 32 64 128 256 512

SDF 8148 19323 45028 103491 234766 525457 1163956

Trimaran Optimized 4622 12391 31665 81375 214685 595211 1768441

FFT TMS C6000 Optimized 26717 73456 213933 619241 2040729 6943638

SDF/Trimaran

SDF/TMS

1.762873215 1.559438302 1.422011685 1.271778802 1.093537043 0.882807945 0.658181981

0.304974361 0.263055435 0.210477112 0.167125562 0.115040263 0.075674596

75 that of the Trimaran simulator by increasing the number of register contexts (hence the number of active threads). In the SDF system, while using only 2 threads, it outperforms TMS ‘C6000 for larger data sizes. It is because for larger data sizes, the SDF’s usage of thread level parallelism as well as instruction level parallelism improves its performance. Table 5.13 shows the results of comparing the SDF system with TMS ‘C6000 and Trimaran for FFT benchmark. Similar to the comparison of superscalar with the SDF system, the SDF system outperforms the Trimaran VLIW system for large data sizes (greater than 256). This is achieved by the usage of thread level parallelism as well as instruction level parallelism. In the section where the SDF system is compared with superscalar, the SDF system scaled better with more functional units (particularly with more SPs). For a large data size, the SDF system can more effectively utilize functional units than either superscalar or VLIW systems that rely only on ILP from a single threaded programming model.

Data Size 50*50*4 100*100*4 150*150*4 200*200*4 250*250*4

SDF 115452 459667 1032567 1833337 2862857

Trimaran Optimized 157770 630520 1418270 2521020 3938770

ZOOM TMS Optimized

SDF/Trimaran

SDF/TMS

144201 641625 1480525 2959430 4729593

0.7317741 0.72902842 0.72804685 0.72722033 0.72684036

0.800632451 0.716410676 0.697433005 0.619489902 0.605307264

Table 5.14 Comparison of SDF with VLIW (Zoom).

Table 5.14 shows the comparison of SDF with the two VLIW systems for the zoom program. Once again, for the SDF system 4 SPs and 4 EPs are used for comparison

76 with 8-wide VLIW. The SDF system consistently outperforms both systems. The column, which shows the data for SDF, is marked in bold, indicating that it performs better than the VLIW systems. From the above data and the scalability figures shown, the SDF system can become a viable architecture for future microprocessors. It can efficiently support fine-grained threads and decouple memory accesses from the execution pipeline. Since memory accesses are separated from execution, it can easily tolerate long latency operations. The SDF system also reduces hardware complexity of the processor by eliminating the need for complex logic (e.g., scoreboard or reservation stations needed for resolving data dependencies, register renaming, out-of-order instruction issue and branch predictions).

77

Chapter VI CACHE PERFORMANCE OF SCHEDULED DATAFLOW ARCHITECTURE

6.1 Cache Performance of SDF Versus Superscalar Architecture In von Neumann architecture, cache memories are used to exploit spatial and temporal localities. However, dataflow architecture does not support localities since the execution is based on availability of operands. Kavi and Hurson in their article, “Design of Cache Memories for Dataflow Architecture,” have presented various methods to improve cache performance in a dataflow method of programming [29]. They have noted that this can be achieved by proper partitioning of a dataflow program into vertical layers of data dependent instructions as well as proper distribution and allocation of the recurrence portions of the dataflow program. In a dataflow program, enhancing of data references will improve the cache performance. In SDF, a thread is a sequence of instructions with pre-load, execute and post-store code. Once the first instruction of the non-blocking thread is executed, the remaining thread executes without interruption. So the thread is the basic unit, which determines the synchronization and execution. Since all the data needed for a thread resides in a frame memory, cache memory should be designed to accommodate frames of active threads. Frames used by the threads must be grouped together, and block size must be large enough to accommodate the frame size in order to achieve maximum performance. Compilers must also be designed in such a way

78

Matrix Multiplication 1 SP 1 EP versus Superscalar 1 INT 1 FP N*N

SDF Frame Ref

50 100 150

385681 2941361 9767041

Write Miss

Write Hit %

Write Miss %

I-Structure Ref

I-Struct Miss Total

182 325 518

99.9528 99.989 99.9947

0.0471893 0.0110493 0.0053035

252500 2010000 6772500

120 6956 26085

SDF Total Cycles

Superscalar Cache Hit

Superscalar Cache Miss %

Superscalar Total Cycles

1873081 4459405 48230629

390930 3053196 10240224

332(0.08) 683(0.02) 1262(0.01)

1203099 9212931 30809569

385813 3032962 10194876

325(0.08) 676(0.02) 1262(0.01)

967511 7434811 24904584

Matrix Multiplication 2 SP 2 EP versus Superscalar 2 INT 2 FP 50 100 150

385681 2941361 9767041

181 353 537

99.9531 99.98 99.9945

0.04693 0.0120012 0.00549808

252500 2010000 6772500

120 7288 33463

937319 7231391 24117680

Matrix Multiplication 3 SP 3 EP versus Superscalar 3 INT 3 FP 50 100 150

385681 2941361 9767041

182 349 520

99.9528 99.9881 99.9947

0.0471893 0.0118653 0.0053240

252500 2010000 6772500

120 7354 32873

626059 4824439 16085940

385813 3032962 10194876

325(0.08) 676(0.02) 1262(0.01)

961536 7413636 24858202

Table 6.1 Cache Performance Comparing SDF versus Superscalar with different units. (Cache size 128K, Block size 64bytes) Matrix Multiplication.

79 to reuse the frames that are released recently by the FFREE instruction. If a circular queue is used to allocate frames, it will allocate new frames whenever a frame is needed causing a higher cache miss rate. However if a stack is used for allocating frames, better cache performance results. In Chapters 4 and 5 the results shown are based on the instruction set simulators with perfect caches. The SDF simulator is modified to produce the real cache performance of the architecture. This section presents the cache performance of SDF as compared with superscalar architecture. The simplescalar simulator for the superscalar architecture is designed with an instruction cache and a data cache. The default cache miss in the superscalar architecture takes 18 cycles. In order to have a fair comparison, in both SDF and superscalar architecture, the cache miss penalty for instruction, frame and I-structure caches is set to 18 cycles. The superscalar architecture is set to have separate caches for instruction and data. For this experiment the cache size is set to 128K and the block size is set to 64 bytes. Table 6.1 presents the results for varying data sizes for matrix multiplication program. The experiments have been repeated for varying cache and block size but not presented here. The table presents the frame and I-structure cache misses for SDF and the data cache misses for superscalar architecture as well as the total cycle counts for both architectures. The instruction cache miss is not shown here. The main purpose of this experiment is to deal with the data caches and to find out the total cycle count for both architectures. Similar to the previous experiments, the experiment is conducted by varying the number of SPs and EPs for SDF and varying the number of integer and floating point units for superscalars. The SDF system uses separate frame and I- structure cache while the superscalar uses only data cache. Since SDF uses separate

80

Fibonacci 1 SP 1 EP versus Superscalar 1 INT 1 FP N*N

5 10 15

SDF Frame Ref

238 2830 31566

Write Miss

Write Hit %

34 359 7174

85.71% 87.31% 77.27%

Write Miss %

14.29% 12.69% 22.73%

Instru Ref

1025 11798 131232

Instru Miss Total

SDF Total Cycles

8 8 8

1303 13390 20350

Superscalar Cache Hit

Superscalar Cache Miss %

Superscalar Total Cycles

3595 4503 14641

218(5.72%) 218(4.62%) 220(1.48%)

31227 35334 75535

3595 4505 14622

218(5.72%) 218(4.62%) 220(1.48%)

28908 32130 63328

3597 4498 14522

218(5.71%) 218(4.62%) 220(1.49%)

28108 31250 61757

Fibonacci 2 SP 2 EP versus Superscalar 2 INT 2 FP 5 10 15

238 2830 31566

34 351 7184

85.71% 87.60% 77.24%

14.29% 12.40% 22.76%

1025 11798 131232

8 8 8

796 6748 103023

Fiboancci 3 SP 3 EP versus Superscalar 3 INT 3 FP 5 10 15

238 2830 31566

33 356 7239

86.13% 87.42% 77.07%

13.87% 12.58% 22.93%

1025 11798 131232

8 8 8

606 4624 69121

Table 6.2 Cache Performance Comparing SDF versus Superscalar with different units. (Cache size 128K, Block size 64bytes) Fibonacci Program.

81 cache of frame and I-structure, it is not fair to compare the cache misses of the two. However, in spite of the large I-structure cache miss in SDF when more functional units are added, the total cycle count favors SDF. This is shown clearly in bold in the table. The I-structure cache can be improved as discussed in [29]. This can lead to further improvement of the overall performance of SDF. Table 6.2 presents the cache performance of SDF as well as the superscalar architecture and the total cycle counts in both the architectures for Fibonacci. Since the Fibonacci program does not involve any Istructure elements, instruction cache miss is shown. It can be noted from the table that the SDF performance is better than the superscalar architectures. This is shown in bold in the table. Similar experiments for other applications such as the Livermore loop and zoom are conducted and similar behavior as that shown in Tables 6.1 and 6.2 is observed.

6.2 Cache Performance of SDF Versus DLX Even though the instruction set simulator produces a real cache performance after running a program in the cache statistics file, this section compares the cache behavior of SDF with that of DLX by collecting address traces on both systems and using Dinero IV [18] to generate cache behavior. Table 6.3 presents the cache behavior for matrix multiplication, Livermore kernel 5 spawning 5 threads and a single thread for the Fibonacci program using a 256K byte cache with a block size of 64 bytes. Data for other cache organizations is not shown; the SDF cache behavior is similar to cache memories in conventional systems when the cache parameters (like associativity, block size and cache sizes) are varied. The best cache behavior can be achieved when the block size is equal to the frame size. In modern architectures data prefetching is done to reduce the

82

N*N

25 50 75 100

DLX Refs 286714 1627928 5462503 12910828

Matrix Multiply DLX DLX SDF Misses Rate Refs 61 237 530 940

SDF SDF Loop=N DLX Misses Rate Refs

0.02 156470 382 0.01 1094360 614 0.01 3526250 1010 0.01 8164640 1558

0.24 0.06 0.03 0.02

50 100 150 200 250 300 350 400 450

22429 88829 199229 353629 1104058 794429 1080829 1411229 1785629

DLX Misses

Livermore DLX Rate

8 13 17 22 27 32 36 41 46

SDF Refs

SDF Misses

0.000356681 24177 0.000146349 91913 8.53289E-05 203249 6.22121E-05 358185 2.44552E-05 556721 4.02805E-05 798857 3.33078E-05 1084593 2.90527E-05 1413929 2.57612E-05 1786865

22 31 40 49 58 67 76 88 97

SDF Rate 0.00090996 0.00033728 0.0001968 0.0001368 0.00010418 8.387E-05 7.0072E-05 6.2238E-05 5.4285E-05

Fibonacci N

DLX Refs

5 10 15 20 25 30

260 3014 33546 372152 4127350 40975448

DLX Misses 5 10 14 18 23 27

DLX Rate

SDF Refs

0.01923 134 0.00332 1702 0.00042 19076 4.8E-05 211758 5.6E-06 2348634 6.6E-07 26046952

SDF Misses 8 13 18 23 28 33

SDF Rate 0.0597 0.0076 0.0009 0.0001 1E-05 1E-06

Table 6.3 Cache Behavior of SDF Compared to DLX Architecture. (Cache size: 256K bytes, Block size of 64 bytes).

83 memory latency. In SDF, since the data for a thread’s frame is pre-loaded into the thread’s context, the frame can be pre-fetched and thus pre-fetching will be effective in this architecture too. The I-structure and the frames are mapped to different areas of memory. This can cause more cache misses in the case of matrix multiplication or Livermore loop 5. If there are separate caches for frames and I-structure, the performance can be much better. The Fibonacci program in SDF presents very similar results as DLX. The Fibonacci program does not use any I-structure memory. The effect of separate Istructure and frame memories is explored in the next section.

6.3 Effect of Separate Data and I-Structure Caches on SDF Table 6.4 presents the cache behavior for the unified caches as well as the effect of having a separate cache for data and I-structures. This has been presented in the paper “Cache Performance of Scheduled Dataflow” [8]. If a single cache for both frame (data cache) and the I-structure (arrays) is used, more conflict misses result. For the unified case, a single 256K cache (64 byte blocks) and for the split case, 128K I-structure cache and 128K frame cache are used. For the Livermore loop program using separate Istructure cache, the cache misses are very similar to DLX cache misses. However, in the matrix multiplication program the cache misses show different behavior due to the fact that the program accesses rows of one matrix and columns of another matrix, and the strides have caused more conflict misses for larger data sizes. From the table it should be noted that data cache (used for thread frames) encounters no conflict misses. This can be attributed to the stack usage for frames allocated and the reuse of frames. The misses also indicate the maximum number of frames needed by the program during its execution.

84 Table 6.4 Effect of Separate I-Structure Cache.

N

SDF Refs

Unified Misses

25 50 75 100

156470 1094360 3526250 8164640

382 614 1010 1558

N

SDF Refs

Unified Misses

50 100 150 200 250 300 350 400 450

24177 91913 203249 358185 556721 798857 1084593 1413929 1786865

22 31 40 49 58 67 76 88 97

Matrix Multiply Unified I-Struct Miss Rate Misses 0.002441363 0.000561059 0.000286423 0.000190823

81 319 3132 6215

Livermore Loop 5 Unified I-Struct Miss Rate Misses 0.000909959 0.000337275 0.000196803 0.000136801 0.000104181 8.38698E-05 7.00724E-05 6.22379E-05 5.4285E-05

12 21 30 39 48 57 66 78 87

I-Struct Miss Rate 0.00051767 0.00029149 0.0008882 0.00076121

Frame Misses 11 11 11 11

I-Struct Miss Rate

Frame Misses

0.00049634 0.00022848 0.0001476 0.00010888 8.6219E-05 7.1352E-05 6.0852E-05 5.5165E-05 4.8689E-05

10 10 10 10 10 10 10 10 10

Matrix multiplication requires a maximum of 11 frames, whereas Livermore loop requires 10 frames, and allocating register sets to threads on creation could eliminate these frames. Usage of I-structure and frame memories should be carefully planned in order to improve the cache performance in SDF. Besides, proper allocation and deallocation of frames may improve the performance of the cache in SDF. This is in line with the paper [29] and the proposed solution that proper distribution and allocation of the recurrence portions of the dataflow program can improve performance of the cache.

85

Chapter VII

CONCLUDING REMARKS

7.1 Conclusions and Unique Contributions of this Dissertation This chapter will summarize the main contributions and conclusions of the dissertation with regard to implementing and evaluating SDF. The data comparing the SDF system with multiple units of superscalar and VLIW shows that it is possible to build a simple (no out-of-order scheduling), cost effective machine when compared to complex hardware technology needed in modern superscalar and VLIW processors. A main contribution of this dissertation is to demonstrate the extent to which overall execution time can be lowered in the SDF system as compared to RISC-like systems. Simple hardware and exploiting multithreading, decoupling and dataflow paradigm techniques can achieve this reduction. That data shows that it is possible to build an architecture with non-blocking threads and the decoupling of execution from memory accesses. Since the execution unit uses only registers, it proceeds with no bubbles or stalls in the pipeline. Many modern compilers such as the Stanford University Intermediate Format11 (SUIF) provide for aggressive optimization techniques. These techniques may also be used with SDF. The compiler can ensure good data locality within the threads of an application. This trend in compiler technology complements the use of threads in SDF. 11

http://suif.stanford.edu/suif/suif.html

86 A second, important conclusion of the research is that the hardware used in SDF can be much simpler. The SDF system uses no dynamic instruction scheduling. The hardware for dynamic instruction scheduling, such as scoreboard and reservation stations are not required. These hardware savings can be used to allocate more register sets in order to take advantage of the thread level parallelism or instruction level parallelism that can be found in an application. The third contribution of the research demonstrates the scalable nature of SDF with multiple SPs and EPs as compared to superscalar and VLIW systems. In a superscalar system, having multiple hardware units, increases the complexity of dynamic instruction scheduling and also requires larger window widths. In the SDF system, increasing multiple units of SPs and EPs does not complicate the system. Careful use of cache memories should be done when multiple SPs and EPs are added, otherwise cache misses will increase due to thread conflicts. There are only 4 pipeline stages in the EP and 6 in the SP. The shorter pipelines can also benefit in case of branch mispredictions. On the other hand, these pipelines can be subdivided to increase clock speed as is done in modern Pentium processors. Lastly, there have also been projects for implementing multiple processors on the same chip (CMP) as an effective use of extra chip area. This is to take advantage of inter and intra instruction level parallelism. Sharing multiple register sets among the multiple on-chip processors can reduce the cost. In the SDF system, the register sets are allocated to threads and the threads are scheduled in different units at different times. Such scheduling will be complex in an on-chip microprocessor.

87 7.2 Future Work Several results obtained and presented in the tables shown in this dissertation lead to other areas of research. Currently, the simulator is written such that the thread loads all the input data in the preload phase to the register context before scheduling on the EP. After completion of the execution, in the post-store phase the results are stored in memory even if the results are required for the successive threads. The other threads subsequently load the data from the memory to proceed with the required execution. The post-store can be modified to store the required results in the successive threads’ register context instead of memory. The successive threads can be placed in the execution phase avoiding the pre-load phase. This can significantly improve the performance of the SDF system. Unnecessary data movement from memory to memory is avoided. Due to smaller data movements from and to threads, cache performance can also improve. As mentioned, the simulator does not implement any branch prediction techniques, which is very common in the superscalar system or VLIW system. Careful implementation of a suitable branch prediction technique with the SDF system can contribute to the overall performance improvement. In Chapter 5 multiple units of SPs and EPs are added to single processing element to compare the performance evaluation of the SDF system with superscalar and VLIW systems. Further experiments can be performed with multiple processing elements with multiple units of SPs and EPs. That will also require a complex design of I-structure access and a separate or unified cache. Then it is possible to compare with on-chip multiprocessor system. Since each thread carries its own state (i.e., continuation) threads from multiple applications (or user

88 processes) can be mixed freely on SDF. This will be very complex to achieve in conventional multiprocessor system, CMPs and SMTs. The instruction set shown in Appendix A gives a limited number of instructions. The experiments shown in this research can very well be translated using this instruction set. However, instructions may be added to the simulator depending on the requirements of the benchmarks. The number of benchmarks presented in the dissertation is limited and hand-coded. If a suitable compiler with an aggressive optimizations technique can be developed, additional results can be presented with our architecture. Benchmark programs such as SPEC-95 can be easily translated. The compiler can be written to further improve the performance of the SDF system. As was shown in Chapter 5, when more units are added to the SDF system and for the larger data sizes, the performance increases drastically and eventually overtakes the other systems.

89 APPENDIX A

SDF Instruction Set Manual Reference 0.9.2 1. Conceptual Vision of the Machine The Scheduled Dataflow Architecture consists of the following four main building blocks: •

the Instruction and Frame Memory



the I-Structure Memory



the Execution Processor (EP)



the Synchronization Processor (SP)

1.1 The Instruction and Frame Memory The Instruction and Frame memory is local to the machine. It can be used for both retrieving the SDF instructions and for reading or writing the frames (a frame is a local chunk of memory, which holds all the data that is addressed by a certain codeblock). The memory is organized as a vector of words 32-bits wide, and contains 230 entries (for a total of 232 bytes). 1.2 The I-Structure Memory The I-Structure memory has been introduced with the purpose of providing a way of sharing common data between different processors and for storing arrays. Four operations are possible on the I-Structure memory: allocation, deallocation, retrieving of data and storing of data. These operations will be completely specified in the Instruction Set Manual. The I-Structure Memory guarantees the synchronization among data accesses by different processors when multiple processors are implemented.

90 1.3 The Execution Processor (EP) The Execution Processor includes •

the Execution Pipeline



the Program Counter (PC)



the Running Context Pointer (RCP)



16 contexts (CTX00, CTX01,….), also accessible by SP



an instruction cache

1.3.1

The Running Context Pointer (RCP)

RCP always points to the running context. The running context includes a set of registers as specified in the instruction set section. Active contexts are those that have been allocated by some execution thread but have been filled out with data from Frame Memory or that are completing the storing of data into the Frame Memory. The EP always has one running context, a number of active contexts, and a number of unallocated contexts. 1.3.2

The Execution Pipeline

The Execution Pipeline consists of four stages, which are ordered as follows: •

Instruction Fetch



Operand Fetch (up to 2 operands may be fetched)



Execute



Write Back (up to 2 operands may be stored)

All operations involving operands (operand fetch and write back) act exclusively on the running context registers in a non-blocking fashion. The SP takes care of loading and storing data from/to operand (frame) cache/memory as specified in the following subsection. The execution of a code block can start only when the SP has located all the values that are needed by the frame associated with that code block.

91 1.4 The Synchronization Processor (SP) The SP takes care of loading and storing operands in the active contexts. The active contexts are all allocated contexts, except the running context, which contains operands to be stored to or retrieved from the operand (frame) cache memory. The SP contains also •

an Operand (Frame) cache



an I-Structure cache

2.1 Registers The machine supports multiple contexts. Each context has 32 register pairs. Each register of a pair can be addressed separately. Each register should have enough room to accommodate all possible 4 basic types, which are Boolean, Integer, Character, Real. •

Register R0 is hardwired to 0.



Register R1 is used as a scratch register for immediate operand instruction. It can be used as long as it doesn’t interfere with PUTR1 function.



Register R63 is reserved for holding the Running context synchronization count.

The machine must guarantee at least the following data ranges for the previous types. TYPE

RANGE

Boolean

TRUE, FALSE

Character

0…255

Integer

-2147483648…2147483647(-231..231-1)

Real

sign, 23-bit mantissa, 8-bit exponent(32-bit single-precision IEEE Floating point

2.2 Notation •

RD indicates a destination register; RRD indicates a destination registerpair.



RS indicates a source register; RRS indicates a source register-pair.

92 •

‘.l’ means ‘left portion of a register-pair’; ‘.r’ means ‘right portion of a register-pair’.



‘I’ indicates an I-Structure; F indicates a Frame; C indicates a code-block; D indicates an I/O device.



indicates the I-Structure entry I[indx].



indicates the Frame data at offset ‘offset’ in Frame F.



‘&’ means ‘address of’ when placed before one of the previous objects.

2.3 Instruction Formats • R

R format (Register to Register operations) OpCode

RRS or RS

12



R0 format (Register to indexed Operand operations)

0

OpCode

RR or R

12 •

7 8

19 20

R 13 14

Offset 19 20

Reserved 24 25

31

Reserved 24 25

31

RI format (Immediate value into Register Loading)

RI 0

13 14

RD2

0

R0

7 8

RD1

Value/address 12

31

2.4 Arithmetic Operators Arithmetic operators are allowed to operate on each compatible basic type. It is up to the compiler to guarantee that an operator is applied to correct operands. On the other side it is up to the architecture to select the appropriate behavior of a certain operator, since the type of the operands is known. ADD

-Add two operands

Usage: ADD RRS, RD

or

ADD RRS, RD1|RD2

93 ADD…0 ADD…0

RRS

RD

0

RES

RRS

RD1

RD2

RES

Description: Performs addition and stores result in single or double destination. Operation: RD ← (RRS.l +RRS.r)(or RD1 ←(RRS.l+RRS.r) and RD2 ← (RRS.l+RRS.r)) SUB

-Subtract two operands

Usage: SUB RRS, RD

or

SUB RRS, RD1|RD2

SUB…0

RRS

RD

0

RES

SUB…0

RRS

RD1

RD2

RES

Description: Performs subtraction and stores result in single or double destination. Operation: RD ← (RRS.l - RRS.r)(or RD1 ←(RRS.l - RRS.r) and RD2 ← (RRS.l - RRS.r)) MUL

-Multiply two operands

Usage: MUL RRS, RD

or

MUL RRS, RD1|RD2

MUL…0

RRS

RD

0

RES

MUL…0

RRS

RD1

RD2

RES

Description: Performs multiplication and stores result in single or double destination. Operation: RD ← (RRS.l x RRS.r)(or RD1 ←(RRS.l x RRS.r) and RD2 ← (RRS.l x RRS.r))

94 DIV

-Divide two operands

Usage: DIV RRS, RD

or

DIV RRS, RD1|RD2

DIV…0

RRS

RD

0

RES

DIV…0

RRS

RD1

RD2

RES

Description: Performs division and stores result in single or double destination. Operation: RD ← (RRS.l / RRS.r)(or RD1 ←(RRS.l / RRS.r) and RD2 ← (RRS.l / RRS.r)) MOD

-Modulo of two operands

Usage: MOD RRS, RD

or

MOD RRS, RD1|RD2

MOD…0

RRS

RD

0

RES

MOD…0

RRS

RD1

RD2

RES

Description: Performs modulo and stores result in single or double destination. Operation: RD←mod(RRS.l, RRS.r)(or RD1←mod(RRS.l,RRS.r)andRD2←mod(RRS.l, RRS.r)) AND

-Logical AND of two operands

Usage: AND RRS, RD

or

AND RRS, RD1|RD2

AND…0

RRS

RD

0

RES

AND…0

RRS

RD1

RD2

RES

Description: Performs logical AND and stores result in single or double destination. Operation:

95 RD ← (RRS.l ∩RRS.r)(or RD1 ←(RRS.l ∩RRS.r) and RD2 ← (RRS.l∩RRS.r)) BAND

-Binary AND operands

Usage: BAND RRS, RD

or

BAND RRS, RD1|RD2

AND…0

RRS

RD

0

RES

AND…0

RRS

RD1

RD2

RES

Description: Performs Binary AND and stores result in single or double destination. Operation: RD ← (RRS.l ∩RRS.r)(or RD1 ←(RRS.l ∩RRS.r) and RD2 ← (RRS.l∩RRS.r)) OR

-Logical OR of two operands

Usage: OR RRS, RD or

OR RRS, RD1|RD2

OR…0

RRS

RD

0

RES

OR…0

RRS

RD1

RD2

RES

Description: Performs logical OR and store result in single or double destination. Operation: RD ← (RRS.l ∪RRS.r)(or RD1 ←(RRS.l ∪RRS.r) and RD2 ← (RRS.l∪RRS.r)) NOT

-Logical NOT

Usage: NOT RS, RD or

NOT RS, RD1|RD2

NOT…0

RS

RD

0

RES

NOT…0

RS

RD1

RD2

RES

Description: Performs logical NOT and store result in single or double destination.

96 Operation: RD ← not(RS) (or RD1 ←not(RS) and RD2 ←not(RS)) SHL

-Logical Shift left

Usage: SHL RRS, RD

or

SHL RRS, RD1|RD2 or SHL RRS.l|n, RD

SHL…0

RRS

RD

0

RES

SHL…0

RRS

RD1

RD2

RES

Description: Performs logical shift left and store result in single or double destination. Operation: RD ← RRS.lRRS.r)) NEG

-Change sign to operand

Usage: NEG RS, RD or

NEG RS, RD1|RD2

NEG…0

RS

RD

0

RES

NEG…0

RS

RD1

RD2

RES

97 Description: Performs sign change and store result in single or double destination. Operation: RD ← -(RS) (or RD1 ←-(RS) and RD2 ←-(RS)) MAX

-Maximum between two operands

Usage: MAX RRS, RD

or

MAX RRS, RD1|RD2

MAX…0

RRS

RD

0

RES

MAX…0

RRS

RD1

RD2

RES

Description: Calculate maximum and store result in single or double destination. Operation: RD←max(RRS.r,RRS.l)(orRD1 ←max(RRS.r,RRS.l) and RD2 ←max(RRS.r,RRS.l)) MIN

-Minimum between two operands

Usage: MIN RRS, RD

or

MIN RRS, RD1|RD2

MIN…0

RRS

RD

0

RES

MIN…0

RRS

RD1

RD2

RES

Description: Calculate minimum and store result in single or double destination. Operation: RD←min(RRS.r,RRS.l)(orRD1 ←min(RRS.r,RRS.l) and RD2 ←min(RRS.r,RRS.l)) ABS

-Absolute value

Usage: ABS RS, RD or

ABS RS, RD1|RD2

98 ABS…0

RS

RD

0

RES

ABS…0

RS

RD1

RD2

RES

Description: Calculate Absolute value and store result in single or double destination. Operation: RD←|RS| (orRD1 ←|RS| and RD2 ←|RS|) FLR

-Floor value

Usage: FLR RS, RD or

FLR RS, RD1|RD2

FLR…0

RS

RD

0

RES

FLR…0

RS

RD1

RD2

RES

Description: Calculate floor value and store result in single or double destination. Operation: RD←RS (orRD1 ←RS and RD2 ←RS ) 2.5 Compare Operators LT

-Less than

Usage: LT RRS, RD or

LT RRS, RD1|RD2

LT…0

RRS

RD

0

RES

LT…0

RRS

RD1

RD2

RES

Description: Performs comparison if less than the right hand side then stores 1 in result in single or double destination if the condition is satisfied else 0. Operation: RD ← RRS.lRRS.r)) LE

-Less than or Equal

Usage: LE RRS, RD or

LE RRS, RD1|RD2

LE…0

RRS

RD

0

RES

LE…0

RRS

RD1

RD2

RES

Description: Performs comparison if less than or equal to the right hand side then stores 1 in result in single or double destination if the condition is satisfied else 0. Operation: RD ← RRS.l≤RS.r) (or RD1 ←(RRS.l≤RRS.r) and RD2 ←(RRS.l≤RRS.r)) GE

-Greater than or Equal

Usage: GE RRS, RD or

GE RRS, RD1|RD2

GE…0

RRS

RD

0

RES

GE…0

RRS

RD1

RD2

RES

Description: Performs comparison if greater than or equal to the right hand side then stores 1 in result in single or double destination if the condition is satisfied else 0. Operation:

100 RD ← RRS.l≥RS.r) (or RD1 ←(RRS.l≥RRS.r) and RD2 ←(RRS.l≥RRS.r)) EQ

-Equal to

Usage: EQ RRS, RD or

EQ RRS, RD1|RD2

EQ…0

RRS

RD

0

RES

EQ…0

RRS

RD1

RD2

RES

Description: Performs comparison if equal to the right hand side then stores 1 in result in single or double destination if the condition is satisfied else 0. Operation: RD ← RRS.l==RS.r) (or RD1 ←(RRS.l==RRS.r) and RD2 ←(RRS.l==RRS.r)) NE

-Not Equal to

Usage: NE RRS, RD or

NE RRS, RD1|RD2

NE…0

RRS

RD

0

RES

NE…0

RRS

RD1

RD2

RES

Description: Performs comparison if equal to the right hand side then stores 1 in result in single or double destination if the condition is satisfied else 0. Operation: RD ← RRS.l≠RS.r) (or RD1 ←(RRS.l≠RRS.r) and RD2 ←(RRS.l≠RRS.r)) 2.6 Type Conversion Operators Type conversion operators are needed to modify the type of the content of a register before applying a certain arithmetic operation, in order to perform the correct arithmetic function.

101 TBL

-Convert to Boolean Type

Usage: TBL RS, RD or

TBL RS, RD1|RD2

TBL…0

RS

RD

0

RES

TBL…0

RS

RD1

RD2

RES

Description: Performs conversion to boolean and store value in single or double destination. Operation: RD←bool(RS) (orRD1 ←bool(RS) and RD2 ←bool(RS )) TCH

-Convert to Character Type

Usage: TCH RS, RD or

TCH RS, RD1|RD2

TCH…0

RS

RD

0

RES

TCH…0

RS

RD1

RD2

RES

Description: Performs conversion to character and store value in single or double destination. Operation: RD←char(RS) (orRD1 ←char(RS) and RD2 ←char(RS )) TRL

-Convert to Real Type

Usage: TRL RS, RD or

TRL RS, RD1|RD2

TRL…0

RS

RD

0

RES

TRL…0

RS

RD1

RD2

RES

Description: Performs conversion to Real and store value in single or double destination. Operation: RD←real(RS) (orRD1 ←real(RS) and RD2 ←real(RS ))

102 TDB

-Convert to Double Type

Usage: TDB RS, RD or

TDB RS, RD1|RD2

TDB…0

RS

RD

0

RES

TDB…0

RS

RD1

RD2

RES

Description: Performs conversion to double type and store value in single or double destination. Operation: RD←double(RS) (orRD1 ←double(RS) and RD2 ←double(RS )) TIN

-Convert to Integer Type

Usage: TIN RS, RD or

TIN RS, RD1|RD2

TIN…0

RS

RD

0

RES

TIN…0

RS

RD1

RD2

RES

Description: Performs conversion to Integer type and store value in single or double destination. Operation: RD←integer(RS) (orRD1 ←integer(RS) and RD2 ←integer(RS )) 2.7 Data Movement MOVE

-Move data between registers

Usage: MOVE RS, RD

Moves data between RS to RD

MOVE RS, RD1|RD2

Moves data from RS to RD1 and RD2

MOVE…0

RS

RD

0

RES

MOVE…0

RS

RD1

RD2

RES

Description: Moves data to single or double destination.

103 Operation: RD←(RS) (orRD1 ←(RS) and RD2 ←(RS)) PUTR1

-Put immediate data into register R1

Usage: PUTR1 value

Put immediate value/address into R1

Value Description: Put immediate value/address into R1. Operation: R1←value LOAD

-Load data from Frame

Usage: LOAD RRS, RD

Load data from into RD

LOAD RS|offset, RD

Load data from into RD

LOAD…0

RRS

RD

0

RES

LOAD…0

RD

RS

Offset

RES

Description: Load frame data into register(s). Operation: RD← (orF←RS; and RD ←F[offset]) Note: The maximum value of offset is 31(25-1). The instruction has no effect if the data is not present (i.e., it’s non-blocking). STORE

-Store data into Frame

Usage: STORE RS, RD1|RD2

Store data from RS into

STORE RRS, RD1|RD2 RRS.l in , RRS.r in

104 STORE RS, RD|offset

Store data from RS into

STORE RRS, RD|offset RRS.l in , RRS.r in STORE…0

RS

RD1

RD2

RES

STORE…0

RRS

RD1

RD2

RES

STORE….0

RS

RD

Offset

RES

STORE…0

RRS

RD

Offset

RES

Description: Load frame data into register(s). Operation: RD← (orF←RS; and RD ←F[offset]) Note: The maximum value of offset is 31(25-1). The instruction has no effect if the data is not present (i.e., it’s non-blocking). 2.8 I-Structure Management IALLOC

-Allocate memory for and I-Structure

Usage: IALLOC RS, RD

Allocates an I-Structure of RS entries

IALLOC value, RD

Allocates an I-Structure of ‘value’ entries

IALLOC…0

RS

RD

0

RES

Description: An I-Structure of the specified size is allocated. The I-Structure pointer is stored in RD. Operation: RD←&1 I-Structure flags are initialized to E (Empty). IFREE

-FREE the memory belonging to an I-Structure

Usage: IFREE RS

Free the specified I-Structure.

IFREE addre

Free the specified I-Structure.

IFREE…0

RS

0

0

RES

105 Description: The I-Structure specified by RS is freed. Operation: IFETCH

-Fetch an I-Structure entry

Usage: IFETCH RRS, RD

Fetch

IFETCH RS|index, RD

Fetch

IFETCH…0

RRS

RD

0

RES

IFETCH

RD

RS

Index

RES

Description: Given the I-Structure I it loads the specified value into RD if .flag if F (data present), else the request is queued, and the flag is set to W (Waiting for data to come). Operation: RD←I[index].value IF I[index].flag=F ISTORE

-Store an I-Structure entry

Usage: ISTORE RS, RD1|RD2

Stores into

ISTORE RS, RD|index

Stores into

ISTORE..0

RS

RD1

RD2

RES

ISTORE

RS

RD

Index

RES

Description: Given the I-Structure I, it stores the value specified in RD and set .flag to F (data present). Operation: I[index].value←RS and I[index].flag←F (thereafter, all pending requests are satisfied.

106 2.9 Thread support FORKSP

-Schedule execution of code on Synchronization processor

Usage: FORKSP RS, RD

conditionally schedules the code at RD

FORKSP RD

unconditionally schedules the code at RD

FORKSP RS, addr

conditionally schedules the code at addr

FORKSP addr

unconditionally schedules the code at addr

FORKSP..0

RS

RD

RES

FORKSP..0

0

RD

RES

Description: Schedule the execution of a certain thread on synchronization processor(SP), when present, the condition is true if its value is not zero. FORKEP

-Schedule execution of code on Execution processor

Usage: FORKEP RS, RD

conditionally schedules the code at RD

FORKEP RD

unconditionally schedules the code at RD

FORKEP RS, addr

conditionally schedules the code at addr

FORKEP addr

unconditionally schedules the code at addr

FORKEP..0

RS

RD

RES

FORKEP..0

0

RD

RES

Description: Schedule the execution of a certain thread on Execution processor(EP), when present, the condition is true if its value is not zero. STOP Usage: STOP

-Terminate the current thread

107 STOP

0

0

0

RES

Description: Stop the current thread and schedule another one. This also frees the Running Context. 2.10 I/O Support INPUT

-Input data from a device

Usage: INPUT index, RD

Inputs data from device number ‘index’

INPUT

RD

0

Index

RES

Description: Inputs data from the given device into destination (device could be up to 8). Operation: RD←Device[index] OUTPUT

-Output data to a device

Usage: OUTPUT RD, index OUTPUT

Outputs data to a device number ‘index’ RS

0

Index

Description: Outputs data to a given device (device could be up to 8) Operation: D[index]←RS

RES

108 APPENDIX B

LIST of Op-Codes used in SDF OPCODE

Related integer

NOP ADD SUB MUL DIV MOD AND OR NOT SHL SHR NEG MAX MIN ABS FLR LT LE EQ NE TBL TCH TRL TDB TIN MOVE PUTR1 LOAD STORE STORE1 IALLOC IFREE IFETCH ISTORE FORKSP FORKEP STOP

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

109

OPCODE

Related integer

FALLOC FFREE INPUT OUTPUT SKIP IFCHCL POW BAND GE GT HALT

37 38 39 40 41 42 43 44 45 46 63

110 APPENDIX C Fibonacci Program Code for SDF ; ======================================== ; Hand generated SDF Code for FIBONACCI PROGRAM ; ======================================== ; Content followed a by semicolon is ignored by the assembler ; Written by Joe Arul. ; SDF assembler will produce object code for SDF simulator. ; Date Oct20th 1999 ; Version number is given for different versions of the simulator. version 0.9.2 ; ;=========================================== ; Main forks two threads called fib0 and submain0 ; Submain is forked to wait for all the completion of the ; results. Frame allocated for fib0 is stored in Register R8 ; Frame allocated for submain0 is stored in Register R12. ; Label names end with a semicolon or labeled as code ; Code is to indicate that it is the start of a code block. ;========================================== ;Following code block is preload code block for SP. ;========================================== code main INPUT 2, R4 ; gets the input for the Fibonacci PUTR1 main1 FORKEP R1 STOP ;========================================== ;Following code block is execute code block for EP. ;========================================== main1: PUTR1 fib0 MOVE R1, R16 PUTR1 3 MOVE R1, R17 FALLOC RR16, R8 ; frame allocate for fibonacci PUTR1 submain0 MOVE R1, R16 PUTR1 1 MOVE R1, R17 FALLOC RR16, R12 ; frame allocated for return PUTR1 main02 FORKSP R1 STOP ;========================================== ;Following code block is post-store code block for SP ;========================================== main02: STORE R12, R8|2 PUTR1 2 STORE R1, R8|3 STORE R4, R8|4

111 Fibonnaci program code cont… FFREE STOP ;========================================== ; ;========================================== ; Code submain is the last code block to receive all the ; results from the threads and displays the result. Output instruction ; stores the result in Output device two. ;========================================== code submain0 LOAD RFP|2, R2 PUTR1 submain01 FORKEP R1 STOP submain01: PUTR1 submain02 FORKSP R1 STOP submain02: OUTPUT R2, 3 FFREE STOP ;============================================ ; Code block where the comparison of N is done. ; If greater than one, then two more threads are spawned ; recursively. code fib0 ;============================================ LOAD RFP|2, R2 ; frame ptr LOAD RFP|3, R3 ; frame offset LOAD RFP|4, R4 ; N value PUTR1 fibloop FORKEP R1 STOP fibloop: PUTR1 1 MOVE R1, R9 MOVE R4, R8 LE RR8, R20 ; if n==1 NOT R20, R21 ; R20=TRUE R21=FALSE PUTR1 fib0 MOVE R1, R12 PUTR1 3 MOVE R1, R13 FALLOC RR12, R15 ; frame for n-1 FALLOC RR12, R16 ; frame for n-2 SUB RR8, R30 ; (n-1) PUTR1 2 MOVE R1, R9 SUB RR8, R31 ; (n-2) PUTR1 fineredc MOVE R1, R12 PUTR1 3 MOVE R1, R13 FALLOC RR12, R18 ; R18 frame for final reduce PUTR1 middleredc MOVE R1, R12

112 Fibonnaci program code cont… PUTR1 4 MOVE R1, R13 FALLOC RR12, R19 ; R19 for middle reduce PUTR1 fibl2 FORKSP R20, R1 ; if the condition satisfied PUTR1 fibl3 FORKSP R21, R1 ; if the condition is not satisfied STOP ; ***if the condition is satisfied then we return the result fibl2: STORE R2, R18|2 STORE R3, R18|3 PUTR1 1 STORE R1, R18|4 FFREE R15 ; free (n-1) frame FFREE R16 ; free (n-2) frame FFREE R19 ; free middle reduce frame FFREE STOP ;*** If the condition is not satisfied then we call two fib and returns ;*** the result later ;*** fib(n-1) call fibl3: STORE R19, R15|2 PUTR1 4 STORE R1, R15|3 STORE R30, R15|4 ;store n-1 here ;***fib(n-2) call STORE R19, R16|2 PUTR1 5 STORE R1, R16|3 STORE R31, R16|4 ; store n-2 here ;***for the middle reduce STORE R18, R19|2 PUTR1 4 STORE R1, R19|3 STORE R2, R18|2 STORE R3, R18|3 FFREE STOP ;=============================================== ; code block for middle reduce ;=============================================== code middleredc LOAD RFP|2, R2 ; frame ptr LOAD RFP|3, R3 ; frame offset LOAD RFP|4, R4 ; first return value LOAD RFP|5, R5 ; second return value PUTR1 middle1b FORKEP R1 STOP ;***for the EP middle1b: ADD RR4, R14 PUTR1 middle1c FORKSP R1 STOP

113 Fibonacci program code cont… middle1c:

STORE R14, R2|R3 FFREE STOP ;===================================================== ; fineredc called once either when the condition is ; satisfied at one or more than one fib call. ;===================================================== code fineredc LOAD RFP|2, R2 LOAD RFP|3, R3 LOAD RFP|4, R4 STORE R4, R2|R3 FFREE STOP ;======================================================

114 REFERENCES [1]

Ackermann, W.B. and Dennis, J.B., “VAL – A value-oriented Algorithmic Language, preliminary Reference Manual,” Tech.Report TR 218, Laboratory for Computer Science, MIT, Cambridge, MA, 1979.

[2]

Agarwal, A., Kubiatowicz, J., Kranz, D., Lim, B.H. and Yeung, D., “Sparcle: An evolutionary processor design for multiprocessors,” IEEE Micro, June 1993, pp. 48-61.

[3]

Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K.L., Kranz, D., Kubiatowicz, J., Lim, B.H., Mackenzie, K. and Yeung, D., “The MIT Alewife machine: Architecture and performance,” Proc. Of 22nd Int’l Symp. on Computer Architecture (ISCA-22), 1995, pp. 2-13.

[4]

Ang, B.S., Arvind. and Chiou, D., “StarT- the next generation: Integrating global caches and dataflow architecture,” Technical Report 354, Laboratory for Computer Science, MIT, 1995.

[5]

Arvind and Nikhil R. S., “Executing a program on the MIT Tagged-Token Dataflow Architecture,” PARLE(2), 1987, pp. 1-29.

[6]

Arvind and Nikhil R. S., “Executing Program on the MIT Tagged-Token Dataflow Architecture,” IEEE Transactions on Computers Vol.39 No.3, 1990, pp. 300-318.

[7]

Arvind and Thomas, R. E., “I-structure: An Efficient Data Type for functional languages,” Technical Report, MIT/LCS/TM-210, Laboratory for Computer Science, MIT, Cambridge, MA, 1981.

[8]

Arul, J.M., Kavi, K.M. and Hanief, S., "Cache Performance of Scheduled dataflow Architecture," Proc of the 4th International Conference on Algorithms for Parallel Processing (ICA3PP), Dec., 2000, pp. 110-123.

[9]

Bohm, A.D.W., Cann, D.C., Feo, J.T. and Oldehoeft, R.R., “SISAL Reference Manual: language version 2.0,” Technical Report CS91-118, Computer Science Dept., Colorado State University.

[10]

Burger, D. and Austin, T.M., “The SimpleScalar Tool Set Version 2.0,” Technical Report 1342, Department of Computer Science, University of Wisconsin, Madison, WI.

[11]

Butler, M., Yeh, T.H., Patt, Y., Alsup, M., Scales, H. and Shebanow, M., “Single instruction stream parallelism is greater than two,” Proc. of 18th Intl. Symposium on Computer Architecture (ISCA-18), May, 1991 pp. 276-286.

115 [12]

Chang, P.P., Lavery, D.M., Mahlke, S.A., Chen, W.Y. and Hwu, W.W., “The importance of prepass code scheduling for Superscalar Superpipelined processors,” IEEE Trans. Computers, Vol. 44, No.3, Mar. 1995, pp. 353-370.

[13]

Culler, D.E. and Papadopoulos, G.M., “The explicit token store,” Journal of Parallel and Distributed Computing, Vol.10, No.4, 1990, pp. 289-308.

[14]

Culler, D. E., Goldstein, S. C., Schauser, K. E. and Eicken, T. V., “TAM – A compiler controlled Threaded Abstract Machine,” Journal of Parallel and Distributed Computing 18, 1993, pp.347-370.

[15]

Cuppu, V., Jacob, B., Davis, B. and Mudge, T., “A performance comparison of contemporary DRAM architectures,” Proc. of the Intl. Symposium on Computer Architecture (ISCA-26), May 1999, pp. 222-233.

[16]

Dennis, J.B., “Dataflow supercomputers,” IEEE Computers, Nov. 1980, pp. 4856.

[17]

Dennis, J. B. and Misunas, D. P., “A preliminary architecture for a basic dataflow processor,” Proceeding of 2nd International Conference on Computer Architecture (ISCA-2), Jan. 1975, pp. 126-132.

[18]

Edler, J. and Hill, M.D. “DineroIV Trace-Driven uniprocessor cache simulator,” http://www.cs.wisc.edu/~markhill/DineroIV

[19]

Gao, G. R., “An efficient hybrid dataflow architecture model,” Journal of Parallel and Distributed Computing, Vol.19, 1993, pp. 293-307.

[20]

Govindarajan, R., Nemawarkar, S.S. and LeNir, P., “Design and performance evaluation of a multithreaded architecture,” Proc. of the First High Performance Computer Architecture (HPCA-1), Jan. 1995, pp. 298-307.

[21]

Grafe, V. G. and Hoch, J. E., “The Epsilon-2 Multiprocessor System,” Journal of Parallel and Distributed Computing, Vol.10, 1990, pp. 309-318.

[22]

Grünewald, W. and Ungerer, T., “A multithreaded processor design for distributed shared memory system,” Proc. Intl. Conf. on Advances in Parallel and Distributed Computing, 1997, pp. 206-213.

[23]

Grünewald, W. and Ungerer, T., “Towards extremely fast context switching in a blockmultithreaded processor,” Proc. of the 22nd Euromicro Conf., Sept. 1996, pp. 592-599.

[24]

Heller S. and Traub, K., “Id compiler user’s manual,” Technical Report MIT/CSG Memo 248, Laboratory for Computer Science, MIT, Cambridge, MA, 1985.

116 [25]

Hennessy, J.L. and Patterson, D.A., Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publisher, 1996.

[26]

Hum, H.H.-J., Maquelin, O., Theobald, K.B., Tian, X., Tang, X., Gao, G., Cupryk, P., Hendren, L.J., Jimenez, A., Krishnan, S., Marquez, A., Merali, S., Nemawarkar, S.S., Panangaden, P., Xue, X. and Zhu, Y., “A design study of the EARTH multiprocessor,” Proc. of the Conf. on Parallel Architectures and Compilation Techniques (PACT), June 1995, pp. 59-68.

[27]

Hurson, A.R., Kavi, K.M. and Lee, B., “Cache memories in dataflow architectures,” IEEE Parallel and Distributed Technology, 1996, pp 50-64.

[28]

Iannucci, R.A., “Toward a dataflow/von Neumann hybrid architecture,” Proc. of 15th Symposium on Computer Architecture (ISCA-15), 1990, pp. 131-140.

[29]

Kavi, K.M. and Hurson, A.R., “Performance of cache memories in dataflow architectures,” Euromicro Journal on Systems Architecture, Vol. 44, No. 9-10, June 1998, pp. 657-674.

[30]

Kavi, K.M., Lee, B. and Hurson, A.R., “Multithreaded systems: A survey,” Advances in Computers, Volume 48 (Edited by M. Zerkowitz), Academic Press, 1998, pp. 287-328.

[31]

Kavi, K. M. and Shirazi, B., “Dataflow architecture: Are dataflow computers commercially viable?,” IEEE Potentials, Oct. 1992, pp. 27-30.

[32]

Kavi, K.M., Giorgi, R. and Arul, J., “Comparing execution performance of scheduled dataflow architecture with RISC processors,” Proc. of the 13th ISCA Parallel and Distributed Computing Systems Conference (PDCS-00), Aug. 2000, pp. 41-47.

[33]

Kavi, K.M., Arul, J. and Giorgi, R., “Execution and cache performance of the scheduled dataflow architecture,” Journal of Universal Computer Science, Special Issue on Multithreaded Processors and Chip Multiprocessors, Vol. 6, No.10. Oct. 2000, pp. 948-967.

[34]

Kavi, K.M., Kim, H.S., Arul, J. and Hurson, A.R., “A decoupled scheduled dataflow multithreaded architecture,” Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (I-SPAN99), June 1999, pp. 138-143.

[35]

Kozyrakis, C. E., Perissakis, S., Patterson, D., Anderson, T., Asanovic, K., Cardwell, N., Fromm, R., Golbus, J., Gribstad, B., Keeton, K., Thomas, R., Treuhaft, N. and Yelick, K., “Scalable processors in the billion-transistor era: IRAM,” IEEE Computers Vol. 30, No.9, Sept. 1997, pp. 75-78.

117 [36]

Krishnan, V. and Torrellas, J., “A chip-multiprocessor architecture with speculative multithreading,” IEEE Computers, Sept. 1999, pp. 866-880.

[37]

Lam, M. and Wilson, R.P., “Limits of control flow on parallelism,” Proc. of the 19th Intl. Symposium on Computer Architecture (ISCA-19), May 1992, pp. 46-57.

[38]

Lee, B. and Hurson, A.R., “Dataflow architectures and multithreading,” IEEE Computer, Vol.27, Aug. 1994, pp. 27-39.

[39]

Lo, J.L., Eggers, S.J., Emer, J.S., Levy, H.M., Stamm, R.L. and Tullsen, D.M., “Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading,” ACM Trans. on Computer Systems, Aug. 1997, pp. 332-354.

[40]

Lo, J.L., Parekh, S.S., Eggers, S.J., Levy, H.M. and Tullsen, D.M., “Softwaredirected register deallocation for simultaneous multithreaded processors,” IEEE Trans. on Parallel and Distributed Systems, Vol.10, No.9, Sept. 1999, pp. 922933.

[41]

Mitchell, N., Carter, L., Ferrante, J. and Tullsen, D., “ILP vs TLP on SMT,” Proc. of Supercomputing, Nov. 1999.

[42]

Moreno, J.H., Moudgill, M., Ebcioglu, K., Altman, E.R., Hall, B., Miranda, R., Chen, S.K. and Polyak, A., “Simulation/evaluation environment for a VLIW processor architecture,” IBM Journal of Research and Development,” Vol. 41, No. 3, May 1997, pp. 287-302.

[43]

Najjar, W.A., Lee, E. and Gao, G., “Advances in the dataflow computation model,” Parallel Computing, North-Holland, Vol. 25, No. 13-14, Dec. 1999, pp. 907-1929.

[44]

Onder, S. and Gupta, R., “Superscalar execution with direct data forwarding,” Proc. of the Intl. Conf. on Parallel Architectures and Compiler Technologies (PACT-98), Oct. 1998, pp. 130-135.

[45]

Papadopoulos, G.M. and Traub, K.R., “Multithreading: A revisionist view of dataflow architectures,” Proc. of the 18th Intl. Symposium on Computer Architecture (ISCA-18), 1991, pp. 342-351.

[46]

Papadopoulos, G.M. and Culler, D.E., “Monsoon: An explicit token-store architecture,” Proc. of the 17th Intl. Symposium on Computer Architecture (ISCA17), May 1990, pp. 82-91.

[47]

Papadopoulos, G.M., “Implementation of a general purpose dataflow multiprocessor,” Tech Report TR-432, Laboratory for Computer Science, MIT, Cambridge, MA, Aug. 1988.

118 [48]

Park, K., Choi, S. H., Chung, Y., Hahn, W. J. and Yoon, S. H., “On-Chip multiprocessor with simultaneous multithreading,” ETRI Journal, Vol. 22, No. 4, Dec. 2000, pp. 13-24.

[49]

Roh, L. and Najjar, W.A., “Analysis of communication and overhead reduction in multithreaded execution,” Int. Conf. on Parallel Architectures and Compilation Techniques, June 1995, pp. 122-130.

[50]

Saavedra-Barrera. R. H., Culler, D. E. and von Eicken, T., “Analysis of multithreaded architectures for parallel computing,” 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, July 1990, pp. 169-178.

[51]

Sakai, S., Yamaguchi Y., Hiraki K., Kodama Y. and Yuba T., “An architecture of a dataflow single chip processor,” Proc. of the 16th Annual Symposium on Computer Architecture, May 1989, pp. 46-53.

[52]

Sakai, S., et al., “Super-threading: Architectural and software mechanisms for optimizing parallel computations,” Proc. of 1993 Intl. Conf. on Supercomputing, July 1993, pp. 251-260.

[53]

Salusbury, A., Pong, F. and Nowatzyk, A., “Miss the memory wall: The case for processor/memory integration,” Proc. of ISCA, May 1996.

[54]

Sato, T., “Quantitative evaluation of pipelining and decoupling a dynamic instruction scheduling mechanism,” Journal of Systems Architecture, Vol. 46, No. 13, Nov. 2000, pp.1231-1252.

[55]

Shankar, B. and Roh, L., “MIDC Language manual,” Tech Report, CS Dept., Colorado State University, July 1996. http://www.cs.colostate.edu/~dataflow/papers/Manuals/manual.pdf.

[56]

Shankar, B., Roh, L., Bohm, W. and Najjar, W., “Control of parallelism in multithreaded code,” Proc. of the Intl. Conference on Parallel Architectures and Compiler Techniques (PACT-95), June 1995, pp. 131-139.

[57]

Smith, J.E., “Decoupled access/execute computer architectures,” Proc. of the 9th Annual Symposium on Computer Architecture, May 1982, pp. 112-119.

[58]

Smith, J.E., “Instruction-level distributed processing,” IEEE Computer, Vol.34, No.4, April 2001, pp. 59-65.

[59]

Takesue, M., “A unified resource management and execution control mechanism for dataflow machines,” Proc. of 14th Intl. Symposium on Computer Architecture (ISCA-14), June 1987, pp. 90-97.

119 [60]

Terada, H., Miyata, S. and Iwata, M., “DDMP’s: Self-timed super-pipelined datadriven multimedia processor,” Proc. of the IEEE, Feb. 1999, pp. 282-296.

[61]

Thekkath, R. and Eggers, S. J., “The effectiveness of multiple hardware contexts,” Proc. of the 6th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1994, pp. 328-337.

[62]

Thekkath, R. and Eggers, S. J., “Impact of sharing-based thread placement on multithreaded architectures,” 21st Annual Intl. Symposium on Computer Architecture, April 1994, pp. 176-184.

[63]

Thoreson, S.A. and Long, A.N., “A feasibility study of a memory hierarchy in data flow environment,” Proc. of Intl. Conf. on Parallel Processing, June 1987, pp. 356-360.

[64]

Tokoro, M., Jagannathan, J.R. and Sunahara, H., “On the working set concept for data-flow machines,” Proc. of the 10th Annual Intl. Symposium on Computer Architecture (ISCA-10), July 1983, pp. 90-97.

[65]

Tullsen, D. M., Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L. and Stamm R. L., “Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor,” In 23rd Annual Intl. Symposium on Computer Architecture (ISCA-23), May 1996, pp. 191-202.

[66]

Tullsen, D. M., Eggers, S. J., Levy, H. M. and Lo, J. L., “Simultaneous multithreading: Maximizing on-chip parallelism,” In Annual Intl. Symposium on Computer Architecture (ISCA-22), June 1995, pp. 392-403.

[67]

Vajapeyam, S. and Valero, M., “Early 21st century processors,” IEEE Computers, Vol.34, No.4, April 2001, pp. 47-51.

[68]

Wall, D.W., “Limits on instruction-level parallelism,” Proc. of 4th Intl. Conf. on Architecture Support for Programming Languages and Operating Systems (ASPLOS-4), April 1991, pp. 176-188.

[69]

Watson, I. and Gurd, J. R., “A prototype data flow computer with token labelling,” Proc. of the National Computer Conference, AFIPS Proceedings 48, 1979, pp. 623-628.

[70]

Wilcox, K. and Manne, S., “Alpha processors: A history of power issue and a look at the future,” Cool Chips Tutorial in Conjunction with MICRO-32, Haifa, Israel, Dec. 1999.

[71]

Yamamoto, W., Serrano, M. J., Talcott, A. R., Wood, R. C. and Nemirosky, M., “Performance estimation of multistreamed, superscalar processors,” 27th Hawaii Intl. Conf. on Systems and Sciences, I: Architecture, Jan. 1994, pp. 195-204.

Suggest Documents