Reconfigurable Processor Core Design for Network ...

14 downloads 3333 Views 2MB Size Report
Jul 12, 2004 - interface that provides predefined routines for accessing the NoC. .... API Platform .... platform architecture space explorer (PSE) environment the basic ..... Instruction Mapping: The data plow graph is mapped to each block by using .... Special wakeup tokens could be generated during the issue stage of a ...
國 立 成 功 大 學 電機工程學系 碩士論文 用於單晶片網路之可重新架構處理矽核設計 Reconfigurable Processor Core Design for Network-on-a-Chip

研 究 生 : 陳世綸 指導教授 : 周哲民

Student : Shih-Lun Chen Advisor : Jer-Min Jou

Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C. Thesis for Master of Science July 2004

中華民國九十三年七月

用於單晶片網路之可重新架構處理矽核設計

陳世綸*

周哲民**

國立成功大學電機工程研究所 摘 要 設計一個複雜的 SoC 系統將會面臨許多挑戰,網路單晶片提供了一個發展平台, 讓設計者實現多樣的運用於平台上。經過數個處理器架構的模擬與比較,擁有可重新 架構設計的十六個運算單元處理器被選出作為本篇論文之設計。與網路介面、網路和 路由器相結合,便完成了一個單晶片網路;相同的方法,一個網路晶片網可利用數個 網路單晶片、網路介面、網路與路由器相互結合而成。透過階層式資料流對映,一個 非常大的應用可以被階層式地對映到網路晶片網上,相較於傳統電路,此將更強大、 有彈性、可伸縮、重覆使用與可重新架構,並且對於特殊應用俱有極高的效能。

* **

作者 指導教授

Reconfigurable Processor Core Design for Network-on-a-Chip Shih-Lun Chen*

Jer-Min Jou**

Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, Republic of China

ABSTRACT To design a complex System-on-a-Chip (SoC) poses many challenges. The Network-on-a-Chip (NoC) provides designers a develop platform, on which they could implement various applications. After comparing and simulating several architectures of processor, the sixteen function-units processor with reconfigurable design is chosen for designing of this thesis. To combine with network interface, interconnection network, and routers, a NoC is complete. In the same way, a NoC mesh would be combined by several NoCs, network interface, interconnection network, and routers. Through hierarchical dataflow mapping, a very large special application could be mapped into a NoC mesh hierarchically. It would be more powerful, flexible, scalable, reusable and reconfigurable than traditional circuits and very high performance for special application.

*The author ** The advisor

誌謝 在兩年的研究所生涯中,首先要感謝指導教授周哲民老師,於學業 上認真、耐心且不辭辛勞地教誨指導,並於生活上的照顧與關懷,才能 使我順利完成學業與論文。同時要感謝我的家人,在這兩年中給我的支 持與鼓勵;此外要感謝博士班陳仁德、蕭宇宏、鄭卜仁與孫建明學長生 活上的指導與照顧,並於論文與實作上的協助與幫忙。 另一方面要感謝 ASIC 實驗室的學長黃昭欽、林仁興、林士生、涂義 昇與許毓霖在學業困難所給予的協助與指導,感謝顏業烜同學在自身研 究的百忙之中,仍細心與不厭其煩地教導,同時感謝許世勳、邱庭鈺與 吳源晉同學共同的討論、研究與共同的經歷成長與協助。 最後要感謝楊皓義、蘇弘毅、郭瑞宏 、王子綸與李智偉學弟一年來 在研究上的協助與支持。

世綸 7/12/2004

Contents Abstract Chinese Abstract English Content List of figures List of tables

Chapter 1 INTRODUCTION 1.1 Network on a Chip.………………..…………………………………………..2 1.2 Reconfigurable Processor Orientation….……………………………………..3 1.3 Simulation before Design………………………………………………………….3 1.4 Thesis organization………………………………………………………..……….5

Chapter 2 Network on a Chip 2.1 Components of NoC……………..…..………………………………….….7 2.2 Communication Structure of NoC ………………………..……………….8 2.2.1 Communication Componets…………………………………..……...………..10 2.2.2 Router……………………………………………………..……………………..12 2.3 Multiprocessor structure of NoC……………………..………………..………..13 2.4 Platform Principle of NoC……………………………..……………..…….…..16 2.4.1 The Platform-Based Design Flow…………….……………………..……….16 2.4.2 Proposed NoC platform-based design…………………………………….19

Chapter 3 Reconfigurable Principle and Application 3.1 Fine-grained Reconfigurable Architectures……………………………….……..22 3.2 Coarse-grained Reconfigurable Architectures…………………………………...23 3.2.1 Mesh-Based Architectures…………………………....……….………..……..23 3.2.2 Based on Linear Array Architectures………………………..………..……..28

3.2.3 Crossbar-Based Architectures………..………………………………….29 3.2.4 Comparing with fine-grained……………………………………………..30 3 . 3 St a t i c R e c o n f i g u r a t i o n … … … … … … … … … … … … … … … … … . . … . . 3 1 3.4 Dynamic Reconfiguration……………………………………….……31 3 . 5 T h e R e c o n f i g u r a t i o n H i e r a r c h y …… … … … … … … … … … … … … … . . 3 2 3.5.1 Reconfiguration Vertical Axis…………………………….……………..32 3.5.2 Reconfiguration Horizontal Axis……………………………..……………....33 3.5.3 Reconfiguration Time Axis………………….………………………………………34

Chapter 4 Polymorphous TRIPS Architecture 4.1 Granularity of parallel processing elements on a chip……….……….………37 4.2 Grid Processor Architectures………………………………...…………………..38 4.3 Grid Processor Execution Model……………………..……………………..41 4.4 Polymorphous TRIPS Architecture………………………………….………43 4.4.1 Overview of TRIPS Architecture……...………………………………………44 4.1.2 Polymorphous Resources…………..………………….…………...……..……45 4.5 Instruction, Thread and Da t a L e v e l P a r a l l e l i s m… … … … … . . . … … 4 6 4.5.1 Desktop Morph: Instruction Level Parallelism……….……...….…………….46 4.5.2 Thread Morph: Thread Level Parallelism…………………...……….………48 4.5.3 Super Morph: Data Level Parallelism……….……….………………………49

Chapter 5 Simulation 5.1 Non-reconfigurable Processors Design Simulation………...…………...…………….52 5.1.1 One Function Unit General CPU Architecture without Reconfigurable Design.............................................................................52 5.1.2 Two Function Units General CPU Architecture without Reconfigurable Design...........................................................................…….55 5.1.3 Four Function Units General CPU Architecture without Reconfigurable Design............................................................................…..57 5.1.4 Nine Function Units General CPU Architecture without Reconfigurable

Design...…………………………………………………..………..…..59 5.1.5 Sixteen Function Units General CPU Architecture without Reconfigurable Design………………………………………………..……………….……….….61 5.2 Simulation with Reconfigurable Design...............…………….........….........63 5.2.1 One Function Unit CPU Architecture with Reconfigurable Design………………..64 5.2.2 Two Function Units CPU Architecture with Reconfigurable Design…..…….66 5.2.3 Four Function Units CPU Architecture with Reconfigurable Design……...…69 5.2.4 Nine Function Units CPU Architecture with Reconfigurable Design…...……71 5.2.5 Sixteen Function Units CPU Architecture with Reconfigurable Design...………74 5.3 Compare and Analyze Architectures………………………...................………76 5.4 Conclusion……………………………………………………...................………84

Chapter 6 Reconfigurable Processor Design 6.1 Instruction Set………………………………………….....……………………….86 6.1.1 Very-Long Instruction Word……………………..…………………....……………86 6.1.1.1 Architecture Comparison of CISC, RISC, and VLIW………………...………..86 6.1.1.2 VLIW Architectures and Implementations…….....………………………….….88 6.1.2 Types of Instruction Set………......…………..…………………....……………90 6.1.2.1 Resister Type Slot……….…………………………………………..………..90 6.1.2.2 Immediate Type Slot…….....………………………..…………………….….91 6.1.2.3 Jump Type Slot………...............…………..…………………....……………92 6.1.2.4 Reconfigurable Type Slot….…………………………………………..………..92 6.1.2.5 Communication Type Slot……….…………………………………..………..93 6.2 Hardware Architecture…….....………………….………..…………………….….95 6.2.1 The basic principle of a CPU……….............…..…………………....……………95 6.2.1.1 Basic Components of a CPU………………………………………..………..95 6.2.1.2 The Program Counter Datapath……………………………………..………..95 6.2.1.3 The Arithmetic Operation Datapath……….……………………….………..97 6.2.2 Enhancing Performance with Pipeline……….....……………………………..…..98 6.2.3 Sixteen Function Units Reconfigurable Processor.............….....……………100 6.3 Hazards……………………………………………………………….....………..101

6.3.1 Structural hazards……….…………………………………………..…..……..101 6.3.2 Control hazards……….....…………………….………………………………..101 6.3.2.1 Stall.............….......……………………………………………………….……101 6.3.2.2 Prediction………………………………………………………….....………..102 6.3.3 Conclusion……………………………………………………….........………..105 6.4 Reconfigurable Components Designs……….…………..……………..………..106 6.4.1 Reconfigurable Controller and DMA machine………….……………………..107 6.4.2 Function Unit……………………………...……………………………….……108 6.4.3 Cache……………………………………………………….……….....………..109 6.4.4 Example for Reconfiguration…………………………….……….......………..111 6.5 Exception……….……………………………….………...……………..………..115 6.6 Network Interface………………………………………....……………………..116 6.7 Hierarchical Dataflow Mapping……………………...…………………….……117 6.8 Conclusion………………………………………………….……..….....………..119

Chapter 7 Conclusion

R e f e re n c e … … … … … … … … … … … … … … … … … … … … … … … … … . . 1 2 3

List of Table

Chapter 1 INTRODUCTION Chapter 2 Network on a Chip Chapter 3 Reconfigurable Principle and Application TABAL 3-1 Summary of the technical details of the different coarse-grained reconfigurable architectures…………………………………………...………28

Chapter 4 Polymorphous TRIPS Architecture Chapter 5 Simulation TABLE 5-1 Each result produced time in FIR filter in one function unit architecture……55 TABLE 5-2 Hardware cost in one function unit architecture………………………...……55 TABLE 5-3 Each result produced time in FIR filter in two function units architecture.….57 TABLE 5-4 Hardware cost in two function units architecture……………………….….…57 TABLE 5-5 Each result produced time in FIR filter in four function units architecture..…59 TABLE 5-6 Hardware cost in four function units architecture……………………….……59 TABLE 5-7 Each result produced time in FIR filter in nine function units architecture.…61 TABLE 5-8 Hardware cost in nine function units architecture……………………….……61 TABLE 5-9 Each result produced time in FIR filter in sixteen function units architecture.63 TABLE 5-10 Hardware cost in sixteen function units architecture.………….............……63 TABLE 5-11 Each result produced time in FIR filter in one function units architecture....66 TABLE 5-12 Hardware cost in two function units architecture.…………………….….….66 TABLE 5-13 Each result produced time in FIR filter in two function units architecture…68 TABLE 5-14 Hardware cost in two function units architecture.……………………...……69 TABLE 5-15 Each result produced time in FIR filter in four function units architecture...71

TABLE 5-16 Hardware cost in four function units architecture.…………………….….…71 TABLE 5-17 Each result produced time in FIR filter in nine function units architecture...73 TABLE 5-18 Hardware cost in nine function units architecture.……………………..……74 TABLE 5-19 Each result produced time in FIR filter in sixteen function units architecture.…………………………………...………………………76 TABLE 5-20 Hardware cost in sixteen function units architecture.…………….........……76 TABLE 5-21 Components increasing percentages of each kind of reconfigurable CPU architectures with DMA design from the architectures without DMA design……………………………………………………………………......78 TABLE 5-22 Performance grow-up percentage from reconfigurable CPU without DMA design architectures to with DMA design architectures in FIR example...…81 TABLE 5-23 Total gate count values of 1, 2, 4, 9 and 16 FUs reconfigurable CPU architecture with DMA design……………………………………………...83 TABLE 5-24 Performance by result/clock, gate count and result contribution per gate count values………………………………………………………………….…..…83

Chapter 6 Reconfigurable Processor Design TABLE 6-1 Comparison of CISC, RISC, and VLIW characteristics.…………………….…87

Chapter 7 Conclusion Reference

List of Figures Chapter 1 INTRODUCTION FIGURE 1-1 Reconfigurable processor orientation. ………………………………...……….…3 FIGURE 1-2 Reconfigurable processor components………..…………………………………..5

Chapter 2 Network on a Chip FIGURE 2-1 NoC architecture………………………………………………………………..……7 FIGURE 2-2 NoC structure template………………………………………………………...……8 FIGURE 2-3 OSI Layers of the NoC interconnect network……………………………………9 FIGURE 2-4 Communication components of NoC………………….………………………10 FIGURE 2-5 Proposed NoC…………………….………………………………………………11 FIGURE 2-6 A generic router model……………………………………………………….……12 FIGURE 2-7 Layers of the network interface……………………………………………..……13 FIGURE 2-8 Software stack (gray blocks)………………………………………………...……14 FIGURE 2-9 Concurrent development environment of the NoC…………………….………15 FIGURE 2-10 System platform stack………………….…………………………...….………17 FIGURE 2-11 Platform stacks for NoC design……………….……………………....………19 FIGURE 2-12 Platform stacks for NoC and network platform designs…………….….……20

Chapter 3 Reconfigurable Principle and Application FIGURE 3-1 Reconfigurable computing orientation……………………………………..……22 FIGURE 3-2 The figure shows RAW architecture.……………………………………....….…25 FIGURE 3-3 CHESS array hexagon floor plan………………………………………….…..…26 FIGURE 3-4 Compare between fine grain and coarse grain…….….………………….……30 FIGURE 3-5 Vertical axis of reconfiguration in reconfigurable processor core……………33 FIGURE 3-6 Three kinds of reconfigurable blocks when design a control core will be analyzed……………………………………………………….…….……34 FIGURE 3-7 Time of reconfiguration in control cores.…………...……………………...……34

Chapter 4 Polymorphous TRIPS Architecture FIGURE 4-1 Granularity of parallel processing elements on a chip……………..……….…37

FIGURE 4-2 Grid Processor organization………………………………………………...….…39 FIGURE 4-3 A sample instruction stream for Grid processor………………………….…..…41 FIGURE 4-4 Basic blocks shown as dataflow graphs.…….….………………………………42 FIGURE 4-5 Basic blocks mapped on a grid of dimension 4x4, with the 3 neighbors reachable directly and instruction destinations are ordered pairs (x,y)….…42 FIGURE 4-6 Overview of TRIPS architecture…………………………………..………..…44 FIGURE 4-7 Desktop Frame management……………………………………….…….....….…47 FIGURE 4-8 Polymrophism for Super morph………………………….……………..……..…50

Chapter 5 Simulation FIGURE 5-1 One function unit reconfigurable CPU architecture without DMA.…...….…53 FIGURE 5-2 FIR filter data flow……………………………………………….................….…54 FIGURE 5-3 Mapping FIR filter data flow to one function unit reconfigurable CPU..…...54 FIGURE 5-4 Two function units reconfigurable CPU architecture without DMA……..…55 FIGURE 5-5 Mapping FIR filter data flow to two function units reconfigurable CPU…...56 FIGURE 5-6 Four function units reconfigurable CPU architecture without DMA…….…57 FIGURE 5-7 Mapping FIR filter data flow to four function units reconfigurable CPU..…58 FIGURE 5-8 Nine function units reconfigurable CPU architecture without DMA……..…59 FIGURE 5-9 Mapping FIR filter data flow to nine function units reconfigurable CPU.….60 FIGURE 5-10 Sixteen function units reconfigurable CPU architecture without DMA...…62 FIGURE 5-11 Mapping FIR filter data flow to sixteen function units reconfigurable CPU…………………………...………………………………….62 FIGURE 5-12 One function unit reconfigurable CPU architecture with DMA………...…65 FIGURE 5-13 Mapping FIR filter data flow to one function unit reconfigurable CPU with DMA............................................................................................….…66 FIGURE 5-14 Two function units reconfigurable CPU architecture with DMA……….....67 FIGURE 5-15 Mapping FIR filter data flow to two function units reconfigurable CPU with DMA…………………………………………………………...…..…68 FIGURE 5-16 Four function units reconfigurable CPU architecture with DMA……….....69 FIGURE 5-17 Mapping FIR filter data flow to four function units reconfigurable CPU with DMA……………………………………………………………….…70 FIGURE 5-18 Nine function units reconfigurable CPU architecture with DMA………….72 FIGURE 5-19 Mapping FIR filter data flow to nine function units reconfigurable

CPU with DMA…………………....…………………………...…...…..…73 FIGURE 5-20 Sixteen function units reconfigurable CPU architecture with DMA…….....74 FIGURE 5-21 Mapping FIR filter data flow to sixteen function units reconfigurable CPU with DMA………………………………………...……………….…75 FIGURE 5-22 Hardware cost in one, two, four, nine and sixteen function units reconfigurable CPU without DMA.………………………………..…….76 FIGURE 5-23 Hardware cost in one, two, four, nine and sixteen function units reconfigurable CPU with DMA.………………………..…………….….77 FIGURE 5-24 Components increasing amounts of each kind of reconfigurable CPU architectures with DMA design from the architectures without DMA design…………………....………………………………...……...…78 FIGURE 5-25 The FIR example results produced time in 1, 2, 4, 9 and 16 FUs reconfigurable CPU without DMA design.……………………………….......80

FIGURE 5-26 The FIR example results produced time in 1, 2, 4, 9 and 16 FUs reconfigurable CPU with DMA design…………………...………..…81 FIGURE 5-27 Compare performance in 1, 2, 4, 9 and 16 FUs between reconfigurable CPU architectures without DMA designs and with DMA design in FIR example………………………….……………………………………………82 FIGURE 5-28 Result contribution per gate count in 1, 2, 4, 9 and 16 FUs reconfigurable CPU with DMA design……………………….…………………………….......84

Chapter 6 Reconfigurable Processor Design FIGURE 6-1 A C-language fragment to be implemented to CISC, RISC, and VLIW.....…88 FIGURE 6-2 A general VLIW implementation……………………………….................….…89 FIGURE 6-3 VLIW instruction format in the reconfigurable processor design……...…...90 FIGURE 6-4 Format of register type slot.………………………………….……….......…91 FIGURE 6-5 Format of immediate type slot.…………………………………………………...91 FIGURE 6-6 Format of jump type slot………………………………………………….…92 FIGURE 6-7 Reconfigurable type slot.............................................................................…92 FIGURE 6-8 A simple example of running reconfigurable slots.……………..............….…93 FIGURE 6-9 Communication type slots..……..............................................................…...94 FIGURE 6-10 A simple example of communication slots...............................................…94

FIGURE 6-11 The Program Counter Datapath…….....................................................…...96 FIGURE 6-12 Arithmetic operation datapath………………………………..................…97 FIGURE 6-13 A basic five-stage pipeline CPU architecture..…………………….……..…..98 FIGURE 6-14 Basic architecture of sixteen function units reconfigurable processor...…100 FIGURE 6-15 Hysteresis counter two-bit self predictor………….............................….…104 FIGURE 6-16 Handle process when prediction error..................................................…...104 FIGURE 6-17 Forwarding in five-stage pipeline……................................................…...105 FIGURE 6-18 Sixteen function units general processor added the predictor and forward circuit.…..............................................................................................…...106 FIGURE 6-19 Reconfigurable controller and DMA architectures.............................…....107 FIGURE 6-20 Function unit architecture.......................................................................…109 FIGURE 6-21 Using four-word (16-byte) blocks cache for reconfigurable processor design………………..............................................................................…111 FIGURE 6-22 A reconfiguration example from FIR application reconfiguring to MDCT application.……………………...….......................................................…112 FIGURE 6-23 Timing flow from non-reconfiguration execution through FIR and MDCT application

reconfigurable

executions

to

other

non-reconfiguration

executions................................................................................................…113 FIGURE 6-24 Function unit one in sixteen function units reconfigurable processor from FIR application reconfigured to MDCT application...............................…114 FIGURE 6-25 Handling interrupt process.………………………….…………….……….…..115 FIGURE 6-26 Interrupt handle processing.……………………………………….……….…..116 FIGURE 6-27 Data transfers flow through network interfaces and routers...................…117 FIGURE 6-28 Hardware architectures hierarchy diagram………………….............….…118 FIGURE 6-29 A hierarchical dataflow mapping example...........................................…...119 FIGURE 6-30 Reconfigurable processor design combined with the Reconfigurable components..….........................................................….…120

Chapter 7 Conclusion Reference

Chapter 1 Introduction

Today, the cost of mask in SoC is increasing rapidly. How to transfer high-speed and macro data or computing is become a serious problem. How to integrate heterogeneous IPs into a SoC is become a major topic. The reusing of macro transistors is become more and more important in the area of IC design. The reconfigurable processor is become a research focus by many SoC researchers in the world.

This chapter, firstly, will briefly introduce the Network-on-a-Chip (NoC) architecture. Then it’s necessary to orientate the position of reconfigurable processor and give the reason why simulation before design.

Of course, it is indispensable to describe the

reconfigurable processor components and how to combine it into Network-on-a-Chip (NoC). Finally, the organization of this thesis will be briefly introduced.

1

1.1 Network on a Chip Recently, a single chip may contain up to ten billion transistors; with such massive on-chip resources, SoC designers can implement much complex hardware on a chip. Those designers now face many unusual challenges, and must not only be concerned with component-level issues such as performance and power, but also with on-chip system-level issues: reusability, adaptability, and scalability (i.e., integration and interconnecting,). These SoC issue trends indicate that it requires a new architecture template, relative software tools, and even an effective design methodology to deal with those challenges. There are three common problems in SoC are listed below: 1. System-level synchronization: Future SoC system will contain various components, thus SoC will probably be implemented with local synchronization and global asynchronization. How to achieve effective system-level synchronization with reasonable cost? 2. Reusability: Due to the TTM pressure, designers are forced to reuse IP to shorten design process. Reusability ranges over system level to physical level. Reusability requires a well-defined interface (wrapper), adaptive communication protocols, and scalable on-chip interconnection structures. How to achieve fast reusability with proper area and timing cost for the additional hardware/software design? 3. Diverse interconnect architectures: In the past the choice of the interconnect architecture was limited to a few choices, given the small number of blocks that had to be interconnected and the relative simplicity in dominating the performance and delay tradeoffs. For SoCs, a richer set of interconnect schema should be examined: for example, shared communication resources such as busses, crossbars, and meshes to minimize resource needs. Solving the latency vs. throughput tradeoff now requires taking in consideration a large number of design parameters, like pipeline stages, arbitration, synchronization, routing and repeating schemes. Considering

the

problems

above,

we

develop

a

novel

SoC

structure:

“Network-on-a-Chip”. It provides designers a develop platform, on which they can implement various applications, such as MPEG, MP3. The details will be introduced in chapter 2.

2

1.2 Reconfigurable Processor Orientation As shown in FIGURE 1-1, the reconfigurable processor is bridging the gap between reconfigurable computing and microprocessors parallel computer. It has some advantages of both, unfortunately, sometimes includes defects of both. The key point to choice which one to design is what kinds of uses. As shown in the figure, the reconfigurable processor has more flexibility than reconfigurable computing and ASIC and more performance than microprocessor parallel computer.

flexibility microprocessor parallel computer reconfigurable processor reconfigurable computing ASIC performance FIGURE 1-1 Reconfigurable processor orientation.

To combine several reconfigurable processors into a NoC and then combine several NoCs into a NoC mesh, the hierarchical hardware architecture is complete. It would be a very powerful and flexible chip, the large computations could be mapped into the NoC mesh hierarchically and multi-applications just on a chip.

1.3 Simulation before Design To find the best designs, architects must rapidly simulate many design alternatives and have confidence in the results. Since semiconductor integration capacity reached the point where entire systems could theoretically be integrated into a single die, the hardware companies have heralded a new age in platform-based design for a number of years. So,

3

now platforms are defined include a wide assortment of elements from System-Level Design (SLD): the RTL hardware definition, bus architecture, power management strategy, device drivers, OS ports, and application software.

To be successful, however, a platform will need more than this; an essential element for enabling differentiation will prove to be an advanced systems modeling and verification environment. Developers require a variety of views of the entire platform from RTL, system models, software development models, and real hardware development boards.

Each view of the platform reflects the same system architecture, and designers can use test software in any of the higher-level views, providing a high degree of confidence in the design prior to tape out. This provides a valuable environment in which to investigate system bandwidth and performance requirements. System views must be extendible, allowing designers to exploit the advantages of a well-supported, pre-verified base platform of hardware and software designs, whilst differentiating their own application with their own design.

1.4 Reconfigurable Processor Components The reconfigurable processor design consists of communication components, basic processor components and reconfigurable components. The communication components just consist of network interface, which transfers the data between the reconfigurable processor and routers. The Basic processor components consists of the general components in common processor, including program counter, controller, instruction cache, decoder, ALU, data cache and branch predictor. The reconfigurable components consist of four special component, including reconfigurable controller, DMA, function unit, and reconfigurable operation cache, which are designed for reconfigurable operations. The ALU is included in function unit, so the function unit is also an arithmetic component.

4

Communication Components

Network Interface Program Counter Controller Instruction Cache

Reconfigurable Processor

Basic processor Components

Decoder ALU Data Cache Branch Predictor Reoconfigurable Controller

Reconfigurable Components

DMA Function unit

FIGURE 1-2 Reconfigurable processor components

1.5 Thesis Organization Chapter 1 Introduction Chapter 2 Network on a Chip Chapter 3 Reconfigurable Principle and Applications Chapter 4 Polymorphous TRIPS Architecture Chapter 5 Simulation Chapter 6 Reconfigurable Processor Design Chapter 7 Conclusion

5

Chapter 2 Network on a Chip

Recently, a single chip may contain up to ten billion transistors; with such massive on-chip resources, SoC designers can implement much complex hardware on a chip. Those designers now face many unusual challenges, and must not only be concerned with component-level issues such as performance and power, but also with on-chip system-level issues: reusability, adaptability, and scalability (i.e., integration and interconnecting,). These SoC issue trends indicate that it requires a new architecture template, relative software tools, and even an effective design methodology to deal with those challenges.

An efficient solution to these problems is to treat SoCs as micronetworks, or Network on a Chip (NoC) [4] where the interconnections are designed using an adaptation of the protocol stack. Networks have a much higher bandwidth due to multiple concurrent connections and regular structure. So, the design of global wires can be fully optimized and as a result their properties are more predictable. Regularity enables design modularity,

6

which in turn provides a standard interface for easier component reuse and better interoperability. Overall performance and scalability increase since the networking resources are shared. In this chapter, the NoC architecture will be described briefly FIGURE 2-1 shows the NoC architecture, which is combination of three parts: network, multiprocessor, and platform.

Application

Silicon Implementation

Aeachitecture Silicon Implementation Platform Stack

System Platform Stack

Aeachitecture Platform Instance

Manufacturing Interface Manufacturing Interface Platform Stack

Silicon Implementation Platform Instance

FIGURE 2-1 NoC architecture

2.1 Components of NoC NoC design is the combination of dynamic compositions, which are reliable in a plug-and-play fashion, of heterogeneous IP blocks which require that the on-chip interconnect network is scalable (in the number of attached IP blocks), programmable, and reusable. Most researches indicate that the best way to integrate various IP components is globally asynchronous and locally synchronous (GALS): NoC’s communication network and interface are designed on this purpose. NoC have four components types (shown as FIGURE 2-2): (1) Communication components contain the network, and the network is consisted of interconnection and routers.

7

(2) Computation components: the computation components can be programmable, reconfigurable, or application-specific logics, i.e. reconfigurable processors, FPGAs, or ASICs. (3) Global/shared memory components: the memory components store data used by every component in NOC. (4) Application interface (AI): the application interface is a software layer between designers and system. (5) I/O: Input/output ports in NoC.

Interconnection

Communication Components

Network Router DSP ASIC

Computation Components

Reconfigurable Processor FPGA

NoC Shared memory Memory

unshared memory

Application Interface I/O FIGURE 2-2 NoC structure template

2.2 Communication Structure of NoC The communication of NoC is based on the OSI communication structures. The OSI RM is a framework that allows people to classify and describe network protocols. Since its

8

standardization, it has been used as a reference for wired and wireless computer-network design. However, the same layering concept and many of the protocol functions can be used also to realize NoC.

Communication-based design uses the stack model as a tool to guide the decomposition of the design problem. The separation of computation and communication reveals the communication requirements of the system. The application layer provides the set of communication functions that implement those requirements. It does so by building upon the functionality defined at lower levels in the stack model. In most cases, it is not necessary to implement protocols at all of the OSI stack layers to provide this high-level functionality.

One of the benefits of the OSI stack model is that it scales to match the needs of the system components. If the system components do not require connections with state, data format conversion or other features, the corresponding stack layers can be omitted. However, as embedded systems scale in complexity their communication architectures will have to scale in functionality as well.

Application on-chip interconnect network

Presentation Session Transport Network Data Link Physical

FIGURE 2-3 OSI Layers of the NoC interconnect network

The layered approach of OSI Model is a useful method for structuring and organizing a protocol at an early stage of the design process. However, on the way to implementation, designers may consider whether the original layering structure should be maintained, or whether performance is optimized by combining adjacent layers. Below, we briefly

9

describe the seven OSI layers: physical, data link, network, transport, session, presentation, and application layer.

2.2.1 Communication Componets The communication component is possibly the most important components of the NoC. It not only provides efficient and scalable on-chip communication but also provides high adaptivity and reusability of the computation components. As shown in FIGURE 2-4, the communication components of the NoC contains the interconnect network, network interface, and the routers. The topology determines the structure of interconnect network. The network interface transforms the sending/receiving data between tiles and routers. The routers send the packages in the interconnect network.

After comparing some network structures [38], the shared medium network has the simplest architecture, but the bandwidth of network is not scalable. The indirect network is suitable for the system that contains fewer processing nodes, and the performance is pretty good. Nevertheless, when the number of the processing node increases, the topology and the wiring of the indirect network between switches become extremely complex. The direct netwok, especially strictly orthogonal topology, has a regular and simpler topology than the indirect netwok, which degrades the impacts of the influence of the increasing processing nodes.

Controller FPGA

uProc uProc

Memory

uProc uProc

DSP

uProc uProc

Input Unit

Output Unit

Switch Fabric Shared Buffer

MUL

uProc uProc

MUL

MUL

Router

Computation Components 2-D Mesh

3-D Hypercube

Torus

FIGURE 2-4 Communication components of NoC

10

To neighbor router

From neighbor router

switch

To build up a scalable system, especially a NoC template, a more regular structure of interconnection network is needed, which is easily mapped to 2-D layout. Thus the direct netwok is chosen, especially 2D mesh, as the communication infrastructure of NoC template.

The topology of the network in NoC is 2D mesh topology. There are 2 hierarchical networks in the NoC template, which are called the local network and the global network from the lower to higher hierarchy respectively. The local network connects all adjacent nodes in the mesh and is used for the data transmission in the local area. For the scalability of the network and reducing the latency of the transmission between 2 distant nodes, we add the extra global network between non-adjacent nodes.

The hop counts of the global network can be determined according to the size and the latency requirement of the system. There is a set of bidirectional full-duplex channel in the connection of 2 nodes both in the local and global network. Take FIGURE 2-5 for an example, this is a 16-node 2D mesh network and the pair of (x, y) represents the address of nodes in the network. The hop count of the global network is 4. FIGURE 2-5 shows a proposed NoC topology 2D mesh, with 2 hierarchical networks, local network and global network

router

network interface (0, 3)

MEM

(1, 3)

(3, 3)

(2, 3)

FPGA Shared MEM

uProc uProc (0, 2)

global network local network

(1, 2)

(2, 2)

DSP (3, 2) MEM

FPGA

DSP (0, 1) MEM uProc uProc (0, 0)

uProc uProc

Shared MEM (1, 1)

(2, 1)

FPGA

MEM Shared MEM

uProc uProc (1, 0)

(2, 0)

MEM uProc uProc

(3, 1)

(3, 0) MEM

DSP

uProc uProc

Shared MEM

processing element

FIGURE 2-5 Proposed NoC

11

2.2.2 Router The architecture of a generic node including its router [38] was shown in Figure 2-6. The figure shows a generic router model with input and output buffering. Note that the buffer of each channel is divided into 2 virtual channels.

Basically, the router model is used for the unicast and distributed routing. The router contains 4 input channels, which are paths of receiving packets from other routers, 4 output channels, which are paths of transmitting packets to other routers, an injection channel and an ejection channel, which is a path transmitting packets from the router to the processing element bound with the router via a network interface and a path transmitting packets from the processing elements to the router separately. The router is composed of the following major components: buffers, a switch, a routing and arbitration unit, link controllers and a network interface.

To Network Interface / Processing Element Injection Channel

Ejection Channel LC : Link Controller

LC

LC LC

LC

Input Channel

Output Channel LC

Input Channel

LC Output Channel

Switch LC

LC

Input Channel

Output Channel LC

Routing and Arbitration

LC

Input Channel

Output Channel

FIGURE 2-6 A generic router model

After the link controller of the input channel receiving a packet, it will store the packet in the input buffer. Than the routing and arbitration unit will route the packet according to the routing information in the header flit of the packet and arbitrate the usage

12

of the switch to transmit the output buffer. Finally, the link controller of the output channel transmitting the packet to the next node.

2.2.3 Network Interface Network interface plays an important role of integrating computation resources to communication networks. Network interface can be viewed as a bridge between NoC computation and communication components. Network interface is also a configurable architecture, that we can modify the architecture to meet different requirements. In homogeneous NoC, the interface can simply be synchronous. In heterogeneous NoC, the interface should be asynchronous. There are many mature interface protocols that suitable for the NoC platform, such as VCI, AMBA etc. These protocols not only provide efficient transfer, but also play an important role of interfacing IP and NoC platform.

As shown in FIGURE 2-7, the NI can be divided into three layers: transport layer, network layer, and physical layer. The transport layer is implemented with the CMAI. It can also be viewed as a user or software layer. The network layer contains drivers which are used by the CMAI to control the physical layer. The physical layer is implemented by the bus protocols, such as VCI or AMBA.

transport layer CMAI Router

network layer Drivers

Computation Components

physical layer VCI/AMBA

FIGURE 2-7 Layers of the network interface

2.3 Multiprocessor structure of NoC The NoC can be viewed as a system with multiple processors that concurrently

13

operate and interact through on-chip interconnect network. To design a reconfigurable processor (will be introduced explicitly in chapter 6), or so called “tile reconfigurable processor”; it not only executes common algorithm operations but also be able to communicate with the on-chip interconnect network. Viewing the NoC on the multiprocessor perspective, dealing with the problems and challenges is needed, which a software designer should face. The abstract layer concept of the software stack: software developers typically organize application software as a stack of layers running on each processor core, as shown in FIGURE 2-8.

Dedicated Software (application) Specific Application (Programming) interface Custom Operation System Drivers Tile Processor Unit Core

FIGURE 2-8 Software stack (gray blocks)

The lowest layer contains drivers and low-level routines to control and configure the platform. The middle layer can use common commercial embedded operating system, configured according to the application. The upper layer is an application (programming) interface that provides predefined routines for accessing the NoC. Software designers can then isolate coding of application software from the design of the NoC. Because of code size and performance concerns, it is not realistic to use a generic operating system as the middle layer. A lightweight custom operating system, supporting an application- and platform-specific AI, is necessary. Software and hardware adaptation layers isolate platform components, thus enabling the concurrent development of the components in FIGURE 2-9.

14

User Application

SoC Design

NoC Application

Computation AI

(programming) interface

Communication AI

Software Design

Software communication abstraction Hardware communication abstraction Communication (abstract) interface NoC hardware components (RTL and layout)

Hardware Design

NoC interconnect network

FIGURE 2-9 Concurrent development environment of the NoC With this scheme, the software programmers use AIs for software application development. The hardware design team uses communication (abstract) interfaces from communication controllers. The NoC design team can concentrate on implementing hardware and software abstraction layers for the selected IP communication interconnect. Designing such hardware-software abstraction layers constitutes a major effort, and design tools are lacking. For example, the established EDA tools are not well adapted to the NoC design scenario; consequently, many challenging requirements are emerging: • Higher abstraction level. It is too time-consuming to model at RTL and verify the interconnection between multiple processor cores. • Higher-level programming. The NoC will include hundreds of thousands of dedicated software (firmware) lines. It is too hard for developers to program this software at the assembly level. • Efficient hardware-software interfaces. Designers must optimize processor interfaces, register banks, shared memories, software drivers, and operating systems for each application.

As the result, the platform-based design flow is decided to use. We’ll discuss the platform perspective of the NoC in the next section.

15

2.4 Platform Principle of NoC Creation of an economically feasible NoC design flow requires a structured, top-down methodology that theoretically limits the space of exploration, yet in doing so achieves superior results in the fixed time constraints of the design. We propose such a methodology based on defining platforms at all of the key articulation points in the NoC design flow. Each platform represents a layer in the design flow for which the underlying, subsequent design-flow steps are abstracted. By carefully defining the platform layers and developing new representations and associated transitions from one platform to the next, we believe that an economically feasible “single-pass” NoC flow can be realized.

A platform is an abstraction representation for a family of micro-architectures. The elements of this family are a sort of “hardware” denominator that could be shared across multiple applications – this is the flexibility (reconfigurability and programmability) of a platform. Programmability comes from hardware and software aspects of the platform: software aspect contains microprocessor, DSP, and other programmable components, while hardware aspect contains reconfigurable logic units (FPGA). Further definition, we consider that a platform is one layer of design abstraction stack. The layer includes two perspectives: (1) The upper view point represents the abstraction of following design flow, that we can continue our design process without the details of bottom level. (2) The lower view point represents a set of design rules. With the rules we can choose which component is constructing a platform.

2.4.1 The Platform-Based Design Flow The platform-based design flow contains several platforms layers. Every platform layer represents an abstraction of a stage of the design flow. There is a “platform stack” between platforms: the platform stack mapping the upper abstraction layer onto the lower abstraction layer. This is so called the hierarchical design concept of the platform-based design flow.

16

The platform stack can be viewed as a combination of the upper Application Interface (AI) and the lower Implementation Platform through mapping tools. AI provides an abstraction interface representing the low level micro-architecture. AI contains a set of parameters, such as performance, power dissipation, cost, size, or some physical parameters of manufacture. These parameters make designers reuse the low level micro-architecture easily. Besides, through AI, designers can propagate the upper level constraints to lower level implementation platform for further refinement process. Taking system Platform-Stack as an example (FIGURE 2-10): it contains two cone AI platforms, hardware architecture platform, mapping tools (e.g. software, RTOS, or device-driver synthesizer). AI wraps the hardware components of the hardware architecture: 1. The programmable cores and the memory subsystem via a Real Time Operating System (RTOS), 2. The I/O subsystem via the Device Drivers, and 3. The network connection via the network communication subsystem Arch. Platform Design-Space Exploration

API Platform Specification

Architectural Space

Application Instance

Platform Instance

Application Space

FIGURE 2-10 System platform stack

Through the system platform-stack, a system designer can map his application onto the abstraction representation, and then optimize the cost, efficiency, power dissipation, and flexibility. In the design flow, how many platform stacks and what level of the abstraction, are two major considerations for the designer. The higher abstraction level, in

17

which a platform is defined, the more instances it contains. The platform instances stand for the resources that a designer can afford in a platform. Choosing of the platform instances is so called “design”: the designer sets the reconfigurable parameters of the platform library, programs some IP core, and then chooses some set of IP components to complete the design. To choose the platform instances efficiently, we must consider two mechanisms: (1) A propagation mechanism used in the platform stack design flow. (2) An evaluation mechanism used in the upper platform to obtain specifications and characteristics of the lower platform.

The two mechanisms are the core of platform-based design flow. Architecture platform-based design is neither a top-down nor a bottom-up design methodology. Rather, it is a “meet-in-the-middle” approach. In a pure top-down design process, application specification is the starting point for the design process. The sequence of design decisions drives the designer toward a solution that minimizes the cost of the architecture. The design process selects the most attractive solution as defined by a cost function.

In a bottom-up approach, a given architecture (instance of the architecture platform) is designed to support a set of different applications that are often vaguely defined and is, in general, much based on designer intuition and marketing inputs. In general, IC companies traditionally followed this approach trying to maximize the number of applications (hence, the production volume) of their platform instances. The trend is towards defining platforms and platform instances in close collaboration with system companies thus fully realizing the meet-in-the-middle approach.

18

Application

Silicon Implementation

Aeachitecture Silicon Implementation Platform Stack

System Platform Stack

Aeachitecture Platform Instance

Manufacturing Interface Manufacturing Interface Platform Stack

Silicon Implementation Platform Instance

FIGURE 2-11 Platform stacks for NoC design

2.4.2 Proposed NoC platform-based design NoC platform-based design consists of architecture, software, and implementation platform design. They will be described below: (1) To choose platform instances for construct architecture platform, and then mapping the application onto the architecture platform. During the design stage, the range of the design space must be decided – the range of platform application (flexibility, the capability of supporting different applications) and the number of platform instances (the resources that a platform contains). (2) To pay attention to software reusability: some application interfaces crossing different applications provides high software reusability and fast adaptability. During the design stage, there are four considerations: (1) How to reflect manufacture characteristics to the architecture platform accurately? (2) The flexibility of architecture platform. (3) The cooperation of platform and software develop environment. (4) A set of software mapping tools to cover the details of architecture platform. (3) Finally, to complete the implementation platform design. The concept of implementation platform is strongly related to the concept of regular design and design re-use, which provide correct-by-assembly design, ease of verification, construction of reliable components from widely fluctuating parameters, and manufacture of high-yielding reliable silicon ICs. Regular design can avoid the

19

manufacture problems of cross talk, digital noise and EMC.

NoC platform-based design flow is shown as below, which reflects the three design platform described above.

Topology Design

Application Space

AI Platform

Network Layer

Protocol Design

Implementation Space

MAC Layer

Physical Level Routing

FIGURE 2-12 Platform stacks for NoC and network platform designs

20

Chapter 3 Reconfigurable Principle and Applications

The reconfigurable computing research and design are development items in some developed countries in the world nowadays. The reconfiguration provides a lot of advantages in design for designers. No matter in reducing design cost, shorting design time, diminishing the difficulties and improving the integrating of IP components, the reconfiguration plays an important role. As shown in FINGURE 3-1, the reconfigurable computing [15] is bridging the gap between ASICs and microprocessors.

Reconfigurable platforms [16] will enable the users to reset the system architectures according to process technologies and the advancement of applications. We will particularly analyze the architectures and specialties of reconfiguration in this section. Reconfigurable platforms bring a new dimension to digital system development and have a

21

strong impact on SoC design. If the state of the art of the design flow would be as desired, their flexibility could support turn-around times of minutes instead of months for real time in-system debugging, profiling, verification, tuning, field-maintenance and field upgrades.

flexibility microprocessor reconfigurable computing ASIC performance FIGURE 3-1 Reconfigurable computing orientation.

3.1 Fine-grained Reconfigurable Architectures Fine-grained reconfigurable architecture consists of an array of CLBs (Configurable Logic Blocks) with a path width of 1 bit, which are embedded in a reconfigurable interconnect fabrics. Operations that are more complex can be constructed by reconfiguring, though not as efficient as in other approaches. The represent is FPGA.

Field-Programmable Gate Array (FPGA): FPGA vendors stepping forward rapidly. The vendors on the market are: Xilinx, Altera, Lattice, Actel, Actel, Atmel, Cypress, Lucent Quicklogic and Triscend. Currently FPGA vendors have a relatively fast growing large user base of HDL-savvy designers. Driven by a growing large user base innovations occur more and more rapidly. FPGA vendors are heading for the forefront of platform-based design. Meanwhile a wide variety of hardwired IP cores are delivered on board of the same chip with the FPGA. Due to Moore’s law the FPGA vendors offer more and more products having micro-controllers like ARM, MIPS, PowerPC, other

22

RISC architectures, memory, peripheral circuitry and others, together with the FPGA on board of the same chip.

3.2 Coarse-grained Reconfigurable Architectures Coarse-grained reconfigurable circuit: consisting of an array of CFBs (Configurable Functional blocks), also called rDPU (reconfigurable Datapath Unit). The coarse-grained architectures support CFBs (Configurable Functional Blocks) in arithmetic layer, data-path in character layer and the powerful and area efficient in routing switches. The most advantage in the coarse-grained architecture is reducing the configuration memories and configuration time.

In the TABLE 5.1, shows some coarse-grained reconfiguration circuits that lead the method for solving the multi-granular. In the other word, it combines several coarse-grained reconfigurable units, for example, the four 4-bits ALUs will be combined into a 16-bits ALU. The coarse-grained reconfigurable hardware can be separated into three kinds: Mesh-Based Architectures, Based on Linear Arrays Architectures and Crossbar-Based Architecture. Some typical coarse-grain reconfigurable circuits as listed below:

3.2.1 Mesh-Based Architectures Mesh-Based architectures arrange their

that supports rich communication

resources for efficient parallelism and encourages nearest neighbor (NN) links between adjacent PEs (NN or 4NN: links to 4 sides {east, west, north and south}, or 8NN: NN-links to 8 sides {east, north-east, north, north-west, west, south-west, south, south-east}, like in CHESS array). Some typical mesh-based archetectures listed below: 1. DP-FPGA (Datapath FPGA) [6]: It is a FPGA-like mixed fine and coarse grained architecture with 1 and 4 bits. Its fabric includes 3 component types: control logic, the data-path and memory. 2. The KressArray [20] [40]: It is primarily a mesh of rDPUs (reconfigurable Data-Path

23

Array) physically connected through wiring by abutment: no extra routing areas needed. It interconnects fabric distinguishes 3 physical levels: multiple unidirectional and/or bi-directional NN links, full length/segmented column/row backbuses and a single global bus reaching all rDPUs. 3. The KressArray Family: It is supported by an application and development tool and platform architecture space explorer (PSE) environment the basic principles of the KressArray. 4. Colt [33]: It combines concepts form FPGAs and data-flow computing. It is a 16 bits pipenet and relies highly on runtime reconfiguration using wormhole routing. 5. MATRIX (Multiple Alu architecture with Reconfigurable Interconnect eXperiment) [13]: It is a multi-granular array of 8-bits BFUs (Basic Functional Units) with procedurally programmable instruction memory and a controller which can generate local control signals from ALU output by a pattern matcher, a reduction network or 0 half NOR PLA. 6. The Garp Architecture [28]: It resembles an FPGA and comes with a MIPS-II-like host. Basic unit of its primarily mesh-based architecture is a row of 32 PEs and a reconfigurable ALU. The host has instruction set extensions to configure and control the RA. Host and RA share the same memory hierarchy. Memory accesses can be initiated by the RA, but only through the central 16 columns. The blocks in the leftmost column are dedicated controllers for interfacing. For fast reconfigurations the RA features a distributed cache with depth 4, which stores the least recently used configurations. 7. RAW (Reconfigurable Architecture Workstation) [15]: It provides a RISC multi-processor architecture composed of NN-connected 32-bits modified MIPS R2000 microprocessor tiles with ALU, 6-stage pipeline, floating point unit, controller register file of 32 general purpose and 16 floating point registers, program counter, local cached data memory, 32 Kilobyte SRAM instruction memory (FIGURE 3.2). The prototype chip features 16 tiles arranged in a 4 by 4 array. RAW provides both static (determined at compile time) and dynamic network (determined at run-time: wormhole routing for the data forwarding).

24

8. REMARC (Reconfigurable Multimedia Array Coprocessor) [39]: It is a reconfigurable accelerator tightly coupled to a MIPS-II RISC processor which consists of an 8 by 8 array of 16 bits nanoprocessors with memory and is attached to a global control unit. The communication resources consist of nanoprocessors NN connections and additional 32 bits horizontal and vertical buses that also allow broadcast to processors in the same row or column respectively. It also broadcasts a global program counter value each cycle to all nonoprocessors and supports SIMD operations.

FIGURE 3-2 The figure shows RAW architecture.

9. MorphoSys (Morphoing System) [17, 25]: It has a MIPS-like TinyRISC processor with extended instruction set, a mesh-connected 8 by 8 RA, a frame buffer for intermediate data, four quadrants of 4 by 4 16 bits RCs each, featuring ALU, multiplier, shifter, register file and a 32 bits context register for storing the configuration word. 10. The CHESS Array [2]: The CHESS hexagonal array features a chessboard-like floorplan with interleaved rows of alternating ALU / switchbox sequence (FIGURE 3-3). Embedded RAM areas support high memory requirements. Switchbox can be converted to 16 word by 4 bits RAMs if needed. The interconnect fabrics of CHESS has segmented 4 bits busof different length. There are 16 buses in each row and column, 4

25

buses for local connections spanning one switchbox, 4 buses of length 2, 2 buses of length 4, 8 and 16 respectively. 11. The DReAM Array (Dynamically Reconfigurable Architecturefor Mobile System) [29]: Each RPU consists of 2 dynamically reconfigurable 8-bits RAP (Reconfigurable Arithmetic Processing unit), 2 barrel shifters, a controller, two 16 by 8-bits dual port RAMs and a communication protocol controller. The RPU array fabric uses NN ports and global buses divisible by switching boxes.

FIGURE 3-3 CHESS array hexagon floor plan

12. CS2000 family [22]: Chameleon system offers the CS2000 family multi-protocol and multi-application reconfigurable platform. It consists of RCP (reconfigurable communication processor) for telecommunications and data communications, a 32 bits RISC core as a host, full memory controller and PCI controller for connecting to a RA of 6, 9 or 12 reconfigurable tiles where each tile has 7 32-bit rDPUs (each including a 8 words instruction memory), 4 local memory blocks of 128x32 bits, 2 16x24-bit multipliers. The RA allows multiple independent data streams and its reconfigurable fabric can be changed within a single clock cycle. 13. The MECA family [23]: It is intended to be a family of DSPs (digital signal processors) optimized for VoIP by compressing voice into ATM or IP packets etc… It

26

aims at next generation VoIP and VoATM. The MECA family claims that after comparing with conventional DSPs, the speed-up is factor of 10. 14. CALISTO (Configurable Algorithm-adaptive Instruction) [24]: It is intended by Silicon-Spice to be an innovative adaptive instruction set architecture for inter protocols (IP) and ATM packet-based networks with flexibility for ASAP (Any Service Any port) to deliver voice and data simultaneously over a unified data network. CALISTO is a single-chip communications processor for carrier-class voice gateways, sort switches and remote access concentrators/remote access servers (RAC/RAS), which aims at applications like echo cancellation, voice/fax/data modems, packetization, cellification, delay equalization. 15. FIPSOC (Field-programmable System-on-Chip) [25]: It has a 8051 controller, a RA and RAA (Reconfigurable Analog Array) The 8x12 (8x16 or 16x16) RA is an array of DMC (Digital Macro Cells) including a 4-bit LUT and 4 latches, a 4-bit up/down counter, shift register, 4-bit adder, 16x4 bits RAM etc… The RAA has CAB (Configurable Analog Blocks) usable as differential amplifiers, comparators, converters etc…

27

Project

first Souce

Architecture

Data-path Width

Fabrics

Mapping

publ. PADDI

1990 [29]

Crossbar

16 bit

Central crossbar

Routing

PADDI-2

1993 [1]

Crossbar

16 bit

Multiple crossbar

Routing

DP-FPGA

1994 [6]

2-D array

1&4 bit,multi-granular Inhomogenous

routing Switchbox routing

channels KressArray

1995 [20][40] 2-D mesh

Family select pathwidth Multiple NN & bus segments

(co-)compilation

Colt

1996 [33]

1&16

Runtime reconfiguration

2-D array

bit (sophisticated)

inhomogenous RaPID

1996 [5]

2-D array

16 bit

Segmented buses

Channel routing

Matrix

1996 [13]

2-D mesh

8 bit, multi-granular

BNN, length 4 & global lines

Multi-length

RAW

1997 [15]

2-D mesh

8 bit, multi-granular

8NN switched connections

Switchbox rout

Garp

1997 [28]

2-D mesh

2 bit

Global & Semi-global lines

Heuristic routing

Pleiades

1997 [19]

Mesh/Crossbar

Multi-granular

Multiple segmented crossbar

Switchbox routing

PipeRench

1998 [36]

1-D array

128 bit

(sophisticated)

Scheduling

REMARC

1998 [39]

2-D mesh

16 bit

NN & full length buses

MorphoSys

1999 [21]

2-D mesh

16 bit

NN, length 2 & 3 global lines

Manual P&R

CHESS

1999 [2]

Hexagon mesh

4 bit, multi-granular

8NN and buses

JHDL compilation

DReAM

2000 [29]

2-D array

8 & 16 bit

NN, segmented buses

Co-compilation

CS2000 family 2000 [21]

2-D array

16 & 32 bit

Inhomogenous array

Co-compilation

MECA family

2000 [23]

2-D array

(not disclosed)

CALISTO

2000 [24]

2-D array

(not disclosed)

FIPSOC

2000 [25]

8x12,8x16 array

(not disclosed)

Flexible array

2000 [26]

2-D array

(not disclosed)

(not disclosed)

(not disclosed)

SystolicRing

2002 [19]

Ring

16 bit

Circular ring

switchbox

SPP

2003 [42]

hibrid

Fine-granular

Bus segment

Heuristic routing

TABAL 3-1 Summary of the technical details of the different coarse-grained reconfigurable architectures; Note: NN stands for “Nearest Neighbor”.

3.2.2. Based on Linear Array Architectures Some RAs are based on linear array architectures. It is also typically with NN connect and aiming at mapping pipelines onto it. Tow RAs have linear array structures: (1). RaPiD (Reconfigurable Pipelined Data-path) provides different computing resources like ALUs, RAMs, multipliers and registers, and it uses mostly static reconfiguration. (2). PipeRench provides a global bus and relies on dynamic reconfiguration that allows the reconfiguration of a PE in each execution cycle. The two kinds of based on linear array architectures will be listed below: 1. RaPid (Reconfigurable Pipelined Data-path) [5]: It aims at speed-up of highly regular, computation-intensive tasks by deep pipelines on its 1-D RA. PaPiD-1 features 15DPUs of 8 bits with integer multiplier (32 bits output), 3 integer ALUs, 6 general-purpose

28

data-path registers and 3 local 32 word memories, which are all 16 bits wide. ALUs can be chained. Each memory has a special data-path register with an incrementing feedback path. 2. PipeRench [36]: It’s an accelerator for pipelined applications and provides several reconfigurable pipeline stages (stripes) and relies on fast partial dynamic pipeline reconfiguration and run time scheduling of configuration streams and data streams. It has a 256 by 1024 bits configuration memory, a state memory (used to save the current register contents of a stripe), an ATT (Address Translation Table), four data controllers, a memory bus controller and a configuration controller.

3.2.3. Crossbar-Based Architectures A full crossbar switch is a most powerful communication network and is easily to rout. But the 2 RAs of this category use only reduced crossbars. Three kinds of RAs have crossbar-based structures: (1). PADDI for the fast prototyping of DSP data-paths features in eight PEs which all connected by a multiplayer crossbar. (2). PADDI-2 has 48 PEs, bus it saves area by trstricted crossbar with a hierarchical interconnect structure for linear arrays of PEs forming clusters. (3). The Pleiades Architecture has programmable microprocessor and heterogeneous RA of EXUs. The three kinds of crossbar-based architectures will be listed below: 1. PADDI-1 Architecture (Programmable Arithmetic Device for DSP) [29]: For rapid prototyping of computation-intensive DSP data-paths, it consists of clusters of 8 arithmetic execution units (EXUs) 16 bits wide, 8 words SRAM (which may be concatenated for 32 bits) and connected to a central crossbar switchbox. 2. PADDI-2 Architecture [1]: It features a data-driven execution mechanism. It has 48 EXUS. Each PE features a 16 bits ALU also including both multiply and select, multiplication instructions taking 8 cycles on a single processor. Alternatively, eight EXUs can be pipelined, resulting in a single cycle multiplication. PADDI-2 EXUs are packed in 12 clusters of four elements each two level interconnect: six 16 bits intra-cluster data buses (plus on bit control buses, inter-cluster 16 data buses, which can be broken up into shorter segments. 3. Pleiades Architecture [19]: It is a generalized low power “PADDI-3” with

29

programmable microprocessor and low power heterogeneous RA of EXUs, which allows to integrate both fine and coarse grained EXUs and memories in place of EXUs. Communication between EXUs is dataflow driven. All configuration registers are part of the processor’s memory map and configuration codes are processor’s memory writes.

3.2.4 Comparing with fine-grained Coarse-grained reconfigurable architectures [30] are much more efficient than fine-grained architectures, because fine-grained architectures usually have a large delay overhead, a big routing area overhead with much more hidden configurable memories and poor routability. Since coarse-grained reconfigurable architectures have regular structure potential, full custom designs of them can be drastically more area-efficient, than by assembling the FPGA way from single-bit CLBs (Configurable Logic Blocks).

In contrast to logic-level mapping onto fine-grained devices, mapping applications onto coarse-grained devices belongs to functional abstraction levels – using compilation techniques instead of logical synthesis.

Coarse-grained reconfigurable architecture is more powerful and more magnitude more energy-efficient than fine-grained, and has drastically reduced reconfigurability overhead. It drastically reduces reconfiguration overhead and directly configures high-level parallelism. Using the same technology a coarse-grained array implementation of the same algorithm is substantially faster than FPGAs.

term

Granularity (path width)

Reconfigurable Logic Fine grain (~1 bit)

Configurable blocks CLBs: configurable logic blocks

Coarse grain (example: 16 of RDPUs: reconfigurable data-path Reconfigurable

32 bits)

units (for instance: ALU-like)

Computing

Multi-granular(supports slice RDPU slices (example: 4 bits) bundling) FIGURE 3-4 Compare between fine grain and coarse grain.

30

3.3 Static Reconfiguration The static reconfigrable system must be reconfigured in the initial stage. The static reconfiguration is accomplished by several kinds of reconfigurable platforms that include: 1. Once reconfigurable platform: A. ARC Tangent A4-RISC core and Tensilica Xtensa processor are both once reconfigurable platforms and support the parallel architectures of application specific set and choice, for example, the size and organization of caches. B. VLIW (Very Long Instruction Word) processor applies to reconfigurable architecture applications. The parameters of reconfiguration include the type and quantity of dataflow, the internal link architectures and so on. 2. Repeated reconfigurable platform: A. Chameleon

reconfigurable

communication

processor:

It

consists

of

reconfigurable architectures in small scale and a general-purpose processor core. B. Triscend E5 and A7: The CSoC (Configurable System on a Chip) families. C. Altera integrates embedded processors (EXCALIBUR) and some peripherals. D. Xilinx Virtex-II architecture includes 192 special purpose multipliers. The designers can use the Virtex-II device to accomplish a key point in DSP elements for using in broadband systems. E. ARM+FPGAs system development tools.

3.4 Dynamic Reconfiguration The Dynamic reconfiguration systems reconfigure in the rum-time. The dynamic reconfiguration includes several kinds of technologies. Run-time reconfigurable provides a powerful advantage of FPGAs over ASICs [37]: smaller, faster circuit, simplified hardware interfacing, fewer IOBs, smaller, cheaper packages, simplified

31

software interfaces. Exploding design cost and shrinking product life cycles of ASICs create a demand on RA usage for product longevity. Performance is only one part of the story. The time has come to fully exploit their flexibility to support turn-around times of minutes instead of months for real time in-system debugging, profiling, verification, tuning, field-maintenance and field-upgrades. 1. The variable hardware components that include many basic arithmetic logic units are the core of the design. The designer can write the low-level hardware language (micro-instruction like) and than using NoC (Network on a Chip) to enter the mode data to change the components for achieving the reconfiguration. Because this method dependents on the quantity of arithmetic logic units, the designers must consider the abilities of reconfiguration and the size of the hardware. 2. The Chameleon reconfigurable communication architecture shows an arithmetic component array and uses programmable inter-connection architecture for communicating data. This architecture allows designer to create a design data-path and memory framework.

3.5 The Reconfiguration Hierarchy The reconfiguration hierarchy can be separated into three orthogonal axes. Reconfiguration vertical axis represents each reconfigurable abstract layer, like physical layer, application layer …etc. Reconfiguration horizontal axis represents all kinds of reconfigurable blocks in chips, like communication block, processor block…etc. Reconfiguration time axis represents the early or late relation in all kinds of reconfigurable functions. The three orthogonal axes design method we called reconfiguration hierarchy design.

3.5.1 Reconfiguration Vertical Axis The reconfiguration vertical axis is the high or low layer of every computing system, for example, the lowest layer gate units, the highest layer micro-architecture, ISA, processor architecture and so on. The reconfiguration system includes several different

32

design layers, using the reconfigurable processor design for example: System layer—The data transfer behavior between each core. Architecture layer—The data transfer behavior form source cores to destination cores. Physical layer—The data transfer behavior is data transferring bit-by-bit in transfer median. Each layer has its own mapping reconfigurable architecture. The different layer reconfigurable architecture couldn’t contradict each other. FIGURE 3-5 shows that the routing table indicates the reconfiguration of system layer of reconfigurable processor core. The FPGA bit-stream in internet control core reconfigures the routing algorithm, which can be seen as an architecture layer reconfiguration.

FIGURE 3-5 Vertical axis of reconfiguration in reconfigurable processor core.

3.5.2 Reconfiguration Horizontal Axis Reconfiguration horizontal axis can component all kinds of reconfigurable blocks in the same layer. Every layer is component by many different kinds of components, like communication block, store block, computing block…etc. Every layer has three kinds of reconfigurable blocks in horizontal l axis. FIGURE 3-6 shows that when we design a

33

control core, we will analyze three kinds of reconfigurable blocks and make use of.

FIGURE 4-6 Three kinds of reconfigurable blocks in horizontal axis reconfiguration FIGURE 3-6 Three kinds of reconfigurable blocks when design a control core will be analyzed.

3.5.3 Reconfiguration Time Axis

FIGURE 3-7 Time of reconfiguration in control cores.

The reconfigurable methods can be separated into dynamic reconfiguration and static reconfiguration. Dynamic reconfiguration architecture could be reconfigured when system in run time stage. Static reconfiguration architecture must be reconfigured when system in

34

initial stage. Static reconfiguration is easier to achieve. The designers only need to reconfigure architecture when system in run time stage. But this kind of reconfiguration must change all of the system and inflexible. Dynamic reconfiguration focuses on run time reconfigure ability. The system just change a little and can be achieved the goal of reconfiguration. FIGURE 3-7 shows the reconfiguration time, using communication control core for example. The C represents the time point which could be reconfigured.

35

Chapter 4 Polymorphous TRIPS Architecture

The success of general-purpose microprocessors is the ability to run many diverse workloads well. Today, many application-specific processors, like desktop, network, server, scientific, graphics and digital signal processors have been constructed to match the particular parallelism characteristics of their application domains. To build processors is not only general purpose for single-threaded programs but also for many types of concurrency as well would provide substantive benefits in terms of system flexibility as well as reduced design and mask costs.

This chapter describes the polymorphous TRIPS architecture [32] which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data or thread-level parallelism.

36

4.1 Granularity of parallel processing elements on a chip General-purpose microprocessors owe their success to their ability to run many diverse workloads well. Today, the diversification of workloads (media, streaming, network, desktop) is needed and the general-purpose microprocessors can’t support so many applications efficiently. So, the chip multiprocessors that the number and gramularity of processors emerges and fix at processor design time.

One strategy for combating processor fragility is to build a heterogeneous chip, which contains multiple processing cores, each designed to run a distinct class of workloads effectively. The Tarantula processor [34] is one integrated heterogeneity example. The two major defects of this processor are (1) when little design reuse between the two types of processors will increase hardware complexity and (2) when the application mix contains a balance different than that ideally suited to the underlying heterogeneous hardware will come of poor resource utilization.

Using multiple heterogeneous is an alternative approach to design an intergrated solution, which is to build one or more homogeneous processors on a die. This method will mitigates the hardware complexity problem. When an application maps well onto the homogeneous substrate, the utilization problem is solved, as the application is not limited to one of several heterogeneous processors.

FIGURE 4-1 Granularity of parallel processing elements on a chip FIGURE 4-1 shows a range of points in the spectrum of PEs (Fine-grain processing elements) granularities that are possible for a 400mm2 chip in 100nm technology. The

37

finer-grained architectures on the left in FIGURE 4-1 can offer high performance on applications with fine-grained data parallelism, but the finer-grained architectures will have difficulty achieving good performance on general-purpose and serial applications. The five shown in the diagram represent a good cross-section of the overall space: (a) Ultra-fine-grained FPAG, such as Xilinx and Altera FPGA products. (b) Hundreds of primitive processors connected to memory banks, such as a PIM (Processor-In-Memory) architecture or reconfigurable ALU arrays, such as RaPid, Piperench, or PACT. (c) Tens of simple in-order processors, such as in RAW or Piranha architectures. (d) Coarse-grained architectures consisting of 10-20 4-issue cores, such as the Power4, Cyclops, MultiScalar processors, other proposed speculatively-threaded CMPs , and the polymorphous Smart Memories architecture. (e) Wide-issue processors with many ALUs each, such as Grid Processors [35].

The PIM (a) topology has high peak performance, but its performance on control-bound codes with irregular memory accesses, such as compression or compilation. It will restrain the performance achieve the best. To see the right items in FIGURE 4-1, the coarse-grained architectures (d) traditionally is deficient in capability to use internal hardware to show high performance on fine-grained and high parallel applications. The polymorphism can bridge the finer-grain and coarser-grained architectures with either of two competing approaches. The next section will introduce Grid processor, which is one kind of representative polymorphism TRIPS architectures.

4.2 Grid Processor Architectures GPAs (Grid Processor Architectures) [35] are designed to scale with technology and allowing faster clock rates than conventional architectures while providing superir instruction level parallelism on traditional workloads and high performance across an range of application classes.

38

FIGURE 4-2 Grid Processor organization

FIGURE 4-2 (a) shows a high-level diagram of Grid Processor architecture design. In this architecture, we can easy find that the ALUs are arranged in an m by n array, shown as 4 by 4 grid in the diagram for example. In this implementation, instructions are delivered by instruction cache banks on the left side of the array. Instruction group inputs are fetched from the register file banks and injected form the rop of the grid. The block sequencer and block termination control determines which insruction groups to map to the grid and when each group has been completed and can be committed. Operands are passed from producer to consumer instructions through a lightweight network, shown as a mesh augmented with diagonal channels. Memory accesses are routed to the primary cache banks located on the right side of the grid through a separate network.

FIGURE 4-2 (b) shows a grid node. Rather than a full-fledged processor with its own program counter, a node refers to a functional unit with the logic shown in the figure. Each node contains input ports for arriving operands, instruction and operand buffers. The buffers hold instructions and input operands until all operands have arrived and the instruction can execute. There is a router that delivers values to the output ports and the grid network in each note. The router can deliver both values produced by the node’s ALU and those being routed through the node to a destination elsewhere in the gird.

39

The instruction and operand buffers each have multiple, enabling multiple instructions to be mapped to a single physical node. A frame consists of a single instruction slot in all of the grid nodes and can be pictured as a single virtual grid. Thus each additional slot provides another frame and a virtual grid of nodes. For scheduling and dynamic data forwarding purposes, the (x,y) coordinate of the grid node in the array along with the slot label is used as the name of the destination. The FIGURE 4-2 (b) for example, one frame consists of 16 instructions, using one instruction slot in each of the 16 grid nodes. An 8x8 grid with 8 frames would thus be capable of mapping a total of 512 instructions at a time. Groups larger than one frame are allowed to span multiple frames. Free frames can be used to map speculatively fetched groups. Instruction fetch and map: Each group of instructions mapped to the grid consists of one predicated hyperblock. These hyperblocks have a single point of entry and may have multiple exits, but have no internal transfers of control. After a hyperblock is mapped, branch and target predictors in the block sequencer predict the succeeding hyperblock, and begin fetching and mapping it onto the grid prior to the completion of the previous hyperblock. Instruction execution: The register file at the top of the grid resides contains on three-ported bank per column. When a hyperblock is mapped onto the grid, the corresponding move instructions are fetched and delivered to queues at the appropriate register file banks. Each bank can issue two move instructions per cycle, injecting two operands into the grid. The move instructions contain the register number to be read and the location of up to three target ALUs within the grid. When an operand arrives at the node, the control logic attempts to wakeup, select and issue the instruction corresponding to the frame identifier of the arriving operand. If all of the operands are present, the instruction is issued to the ALU. Its result is sent to the output touter with the frame identifier and the address of up to two target ALUs. Operand touting: Because the physical locations of consumers are explicitly encoded within producer instructions, there is a trade-off associated with the instruction fanout. If instructions encode a large number of target consumers, each instruction may be overly large. Too few targets results in extra instruction overhead to replicate the values within the grid. If an instruction has more than three consuming instructions for a particular

40

value, a data movement instruction called a split instruction can be inserted into the schedule to forward result to multiple consumers.

4.3 Grid Processor Execution Model The compiler for Grid processors partitions the program into a sequence of blocks [31]. FIGURE 4-3 shows a stream of instructions that has been partitioned by the compiler into three different blocks B1, B2, and B3. Explicit move instructions, separate from the computation instructions, are generated for the registers read by every block. The move instructions fetch block inputs from the register file and pass them as internal (temporary) value to the block. // B1 0x0000 0x0004 0x0008 0x000c 0x0010

add r1, r2, r3 add r2, r2, r1 ld r4, (r1) add r5, r4, 1 beqz r5, 0xdeac

// I1 // I2 // I3 // I4 // I5

// B2 0x0014 0x0018 0x001c 0x0020 0x0024 0x0028

add add ld ld mul bne

r10, r2, r3 r11, r2, r3 r4, (r10) r5, (r11) r31, r4, r5 r31, 0xbee0

//I6 //I7 //I8 //I9 //I10 //I11

// B3 0x002c 0x0030 0x0034 0x0038 0x004c

xor r8, r5, 1 sll r9, r4, r8 add r13, r9, 8 add r12, r9, r2 sw r13, r12

//I12 /I13 //I14 //I15 //I16

FIGURE 4-3 A sample instruction stream for Grid processor

FUGURE 4-4 shows the DFG (Data Flow Graph) of the blocks B1, B2 and B3 in FIGURE 4-3 along with the move instructions. As shown in the figure, all instructions renamed by temporary registers in the operands and moved instructions generating for every input register. For example, in block B1, two move instructions, move t2, r2 and move t3, r3 are generated by compiler to input registers r2 and r3. Inside a block, all values are using temporary names. The move instructions associates register inputs of the block and temporaries. Data values that passed to other blocks are written to the register file.

Instructions fetch to blocks from the instruction memory and map them onto the grid enmasse a run time. There is no serialization of fetch, decode and rename for the instructions within blocks. The execution in blocks is initiated by the move instructions which read register data and send them to their consumers. Instructions wake up units

41

match incoming data with instructions and issue ready signals to functional units for execution. The computation results are tagged and forwarded by the router through the interconnection to their eventual destinations.

I1 add t2, t2, t1

I2

move t2, r2 move t4, r4 move t5, r5

move t2, r2 move t3, r3

move t2, r2 move t3, r3 add t1, t2, t3

add t10, t2, t3

I6

I7

add t11, t2, t3

I12 xor t8, t5, t1

ld t4, (t10)

I8

I9

ld t5, (t11)

I13 sll t9, t4, t8

I3

ld t4, (t1)

I4

add t5, t4, 1

I10 mul t31, t4, t5

I5

beqz t5, 0xdeac

I11 bnez t31,0xbee0

add t13, t9, t8

I14

I15 add t12, t9, t2

I16 sw t13, t12

FIGURE 4-4 Basic blocks shown as dataflow graphs. Register are marked with “r” and temporaries with “t”.

Instruction Mapping: The data plow graph is mapped to each block by using the compiler to layout. Every computation instruction is assigned to a node in the block of grid. All output operands are renamed with the positions of consumer nodes. FIGURE 4-5 shows a data plow lay out on the computation elements by instruction mapping on a grid of dimension 4x4. (0,0)

(0,0)

B1

B3

I6

I7

I12

I3

I8

I9

I13

I4

I10

I14

I5

I11

I16

I1

I2

(0,0)

B2

I15

FIGURE 4-5 Basic blocks mapped on a grid of dimension 4x4, with the 3 neighbors reachable directly and instruction destinations are ordered pairs (x,y).

Instruction Wakeup and Execution: After multiple instructions being mapped onto a single node and data are written to an operand buffer, the instruction and operand buffers are examined to wake-up and issue ready instructions. The wakeup and execute operations are serially performed by computation node whenever an operand arrives. Serializing

42

wakeup and execute may increase the cycle time along the execute-execute path of dependent instructions. In conventional superscalar cores, usually use bypass paths to forward data which can be used to guarantee that following the execution of an instruction that dependents with the data in next cycle. In Grid processor, there is no dedicated path that is guaranteed to be free when an instruction completes execution to forward its data to its consumer. However, there are several mechanisms that can alleviate this problem. Special wakeup tokens could be generated during the issue stage of a producer instruction. They reach the consumer nodes at the end of the stage and reserve a channel for the data to follow in the next cycle. Speculative instruction issue could be used to hide the select latency with local rollback mechanisms in the event of incorrect issue.

Block Mapping: The instructions can be used to buffer multiple instructions and data, which are associated through tags. There are three reasons to have multiple instructions mapped on a node. First, graphs larger than the physical grid can be folded over and mapped on the gird with more than one instruction at a node. Second, instructions form different blocks that are fetched speculatively can be mapped at a node. Finally, blocks from different threads can also be mapped to support multithreading.

4.4 Polymorphous TRIPS Architecture The polymorphous TRIPS architecture [32] uses large coarse-grained processing cores to achieve high performance on single threaded applications with high ILP. The TRIPS architecture is heavily partitioned to avoid large centralized structure and long wire runs, contrary to conventional large core designs with centralized components that are difficult to scale. By point-to-point communication channels, the partitioned computation and memory elements are connected and the communication channels are exposed to software schedulers for optimization.

The TRIPS system employs coarse-grained polymorphous features to minimize both software and hardware complexity and configuration overheads at the level of memory banks and instruction storage. The key challenge in defining the polymorphous features is balancing their appropriate granularity so that workloads involving defferent levels of ILP,

43

TLP and DLP can maximize their use of the available resources, and at the same time avoid escalating complexity and non-scalable structures.

4.4.1 Overview of TRIPS Architecture FIGURE 4-6 (a) shows a diagram of the TRIPS architecture that will be implemented in a prototype chip. The TRIPS prototype chip will consist of four polymorphous 16-wide cores, an array of 32KB memory tiles connected by a routed network, and a set of distributed memory controllers with channels to external memory. According to [32], the prototype chip will be built using a 100nm process and is targeted for completion in 2005.

FIGURE 4-6 Overview of TRIPS architecture

FIGURE 4-6 (b) shows an expanded view of a TRIPS core and the primary memory system. The TRIPS core, for example, is typically composed of an array of homogeneous execution nodes, each containing a n integer ALU, a floating point unit, a set of reservation stations, and router connections at the input and output. Each reservation station has storage for an instruction and two source operands. When a reservation station contains a valid instruction for execution. After execution, the node can forward the result to any of the operand slots in local or remote reservation stations within the ALU array. In the ALU array, the nodes are directly connected to their nearest neighbors, but the routing network can deliver results to any node in the array.

The banked instruction cache on the left couples one bank per row, with an additional instruction cache bank to issue fetches to values from registers for injection into the ALU array. The banked register file above the ALU array holds a portion of the architectural state. To the right of the execution nodes are a set of banked level-1 data caches, which can

44

be accessed by any ALU through the local grid routing network. Below the ALU array is the block control logic that is responsible for sequencing block execution and selecting the next block. The backsides of the L1 caches are connected to secondary memory tiles through the chip-wide two-dimensional interconnection network. The switched network provides a robust and scalable connection to a large number of tiles, using less wiring than conventional dedicated channels between these components.

4.4.2 Polymorphous Resources Frame Space: FIGURE 4-6(c) shows that ach execution node contains a set of reservation stations. Reservation stations with the same index across all of the nodes combine to form a physical frame. For example, if we combined the first slot for all nodes in the grid forming fram0, the frame space or collection of frames would be a polymorphous resource in TRIPS. It is managed differently by different modes to support efficient execution of alternate forms of parallelism.

Register File Bank: The hardware substrate provides many registers, although the programming model of each execution mode sees essentially the same number of architecturally visible registers. The extra copies can be used in different ways, such as for speculation or multithreading, depending on the mode of operation.

Block Sequencing Controls: The block sequencing controls determine three kinds of situations. First, when a block has completed execution? Second, when a block should be deal-located from the frame space? Finally, which block should be loaded next into the free frame space? To implement different modes of operations, a range of policies can govern these actions. The deal-location logic is configured to allow a block to execute more than once, because in streaming applications, the same inner loop is applied to multiple data elements. The next block selector can be configured to limit the speculation and to prioritize between multiple concurrently executing threads useful for multithreaded parallel programs.

Memory Tiles: The TRIPS memory tiles can be configured to L2 cache banks, scratchpad

45

memory and synchronization buffers for producer/consumer communication. In addition, the memory tiles closest to each processor present a special high bandwidth interface that further optimizes their use as stream register files.

4.5 Instruction, Thread and Data Level Parallelism 4.5.1 Desktop Morph: Instruction Level Parallelism The TRIPS processor uses the polymorphous capabilities of the processor to run single-threaded codes efficiently by exploiting instruction-level parallelism. To achieve high ILP (Instruction-Level Parallelism), the D-morph configuration treats the instruction buffers in the processor core as a large, distributed, instruction issue window, which uses the TRIPS ISA to enable out-of-order execution while avoiding the associative issue window lookups of conventional machines. The D-morph must provide high-bandwidth instruction fetching to use the instruction buffers effectively as a large window, aggressive control, data speculation, a high-bandwidth, and low-latency memory system that preserves sequential memory semantics across a window of thousands of instructions.

Frame Space Management:As shown in FIGURE 4-6 (c) or see the FIGURE 4-2, the instruction buffers as a distributed issue window at each buffer. Orders-of-magnitude increases in window sizes are possible in TRIPS architecture. This window is fundamentally a three-dimensional scheduling region, where the x and y dimensions correspond to the physical dimensions of the ALU array and z dimension corresponds to multiple instruction slots at each ALU node.

As shown in FIGURE 4-7 (b), The three dimensional region can be viewed as a series of frames, in which each frame consists of one instruction buffer entry per ALU node, resulting in a 2 dimension slice of the 3 dimension scheduling region.

FIGURE 4-7 (a) shows a four-instruction hyperblock (H0) mapped into A-frame 0 as shown in FIGURE 4-7 (b), where N0 and N2 are mapped to different buffer slots (frames) on the same physical ALU node. All communication within the block is determined by the compiler which schedules operand routing directly from ALU to ALU. The consumers are

46

encoded in the producer instructions as X, Y and Z relative offsets. Using the lightweight routed network in the ALU array, the instructions can directly pass a produced value to any element within the same A-frame. The maximum number of frames that can be occupied by one program block (the maximum A-frame size) is architecturally limited by the number of instruction bits to specify destinations, and physically limited by the total number of frames available in a given implementation.

FIGURE 4-7 Desktop Frame management

Multiblock Speculation: The A frames are treated as a circular buffer in which the oldest A frame if non-speculative and all other A-frames are speculative. When the A-frame holding the oldest hyperblock completes, the block is committed and removed. The next oldest hyperblock becomes non-speculative and the released frames can be filled with a new speculative hyperblock. Each producer instruction prepends it’s a-frame ID to the Z coordinate. Values are transmitted through the register file to pass between hyperblocks. As shown in FIGURE 4-7 (b), the mommunication of R1 from H0 to H1 is transmitting a value between hyperblocks through the register file.

High Bandwidth Instruction Fetching:

In order to fill the large distributed window, the

high bandwidth instruction fetch is needed. One method is using a program counter to point to hyperblock headers. The control mode accesses a partitioned instruction cache by broadcasting the index when there is sufficient frame space to map a hyperblock. Each bank then fetches a row’s worth of instructions with a single access and streams it to the bank’s respective row. Along with a prepended header that contains the number of frames

47

consumed by the block, the hyperblocks are encoded as VLIW-like blocks.

4.5.2 Thread Morph: Thread Level Parallelism The thread-morph is intended to provide higher processor utilization by mapping multiple threads of control onto a single TRIPS core. Similar to simultaneous multithreading in the execution resources (ALUs), memory banks shared, the thread-morph statically partitions the reservation station (issue window) and eliminates some replicated SMT structures, such as the reorder buffer.

Thread-Morph Implementation: There are multiple strategies for partitioning a TRIPS core to support multiple threads, two of which are row processors and frame processors. Row processors space-share the ALU array, allocating one or more rows per thread. The advantage to this approach is that each thread has I-cache and D-cache bandwidth and capacity proportional to the number of rows assigned to it. The disadvantage is that the distance to the register file is non-uniform, penalizing the threads mapped to the bottom rows.

Frame Space Management: Instead of holding non-speculative and speculative hyperblocks for a single thread as in the Desktop morph, the physical frames are partitioned a priori and assigned to threads. For example, a TRIPS core can dedicate all 128 frames to a single thread in Desktop morph, or 64 frames to each of two threads in the Thread morph (uneven frame sharing is also possible). Within each thread, the frames are further divided into some number of A-frames and speculative execution is allowed within each thread. No additional register file space is required, since the same storage used to hold state for speculative blocks can instead store state from multiple non-speculative and speculative blocks. The only additional frame support needed is thread-ID bits in the register stitching logic and augmentations to the A-frame allocation logic.

Instruction Control: The Thread morph maintains n program counters (where n is the number of concurrent threads allowed) and n global history shift registers in the exit predictor to reduce thread-induced mispredictions. Using a prediction, the Thread morph fetches the next block for a given thread, made by the shared exit predictor and maps it onto the array. In additional to the extra prediction registers, n copies of the commit buffers

48

and block control state must be provided for n hardware threads.

Memory: Much the same as the Desktop morph, the memory system operates except that per-thread Ids on cache tags and LSQ CAMs are necessary to prevent illegal creoss thread interference, provided that shared address spaces are implemented.

4.5.3 Super Morph: Data Level Parallelism The applications of Super morph are typically characterized by DLP (data-level parallelism) including predictable loop-based control flow with large iteration counts, large data sets, regular access patterns and poor locality. But tolerance to memory latency and high computation intensity with tens to hundreds of arithmetic operations performed per element loaded from memory. The Super morph was heavily influenced by the Imagine architecture and uses the Imagine execution model in which a set of stream kernels are sequenced by a control thread. FIGURE 4-8 highlights the features of the Super morph.

Frame Space Management: The Super morph fuses multiple A-frames to make a super A-frame since the control flow of the programs is highly predictable. Inner loops of a streaming application are unrolled to fill the reservation stations within the super A-frame. Code required to set up the execution of the inner loops and to connect multiple loops can run in one of three ways: (1) Embedded into the program that uses the frames for Super morph execution, (2) executed on a different core within the TRIPS chip-similar in function to the Imagine host processor, or (3) run within its own set of frames on the same core running the DLP kernels. In this third mode, a subset of the frames are dedicated to a data parallel thread, while a different subset are dedicated to a sequential control thread.

Instruction Fetch: In order to reduce the power and instruction fetch bandwidth overhead of repeated fetch of the same code block across inner-loop iterations, the Super morph employs mapping reuse. A block is kept in the reservation stations and used multiple times. The Super morph implements mapping reuse with a instruction which indicates that the next block of instructions constitute a loop and is to execute a finite number of times N, where N can be determined at runtime and is used to set an iteration counter. When all of the instructions from an iteration complete, the hardware decrements the iteration counter and triggers a revitalization signal which resets the reservation stations,

49

maintaining constant values residing in reservation station, so that they may fire again when new operands arrive for the next iteration. When the iteration counter reaches zero, the super A-frame is cleared and the hardware maps the next block onto the ALUs for execution.

FIGURE 4-8 Polymorphism for Super morph

FIGURE 4-8 Polymrophism for Super morph

Memory System: Using a subset of on-chip memory tiles, the TRIPS Super morph implements the Imagine stream register file. The Super morph memory tile configuration includes turning off tag checks to allow direct data array access and augmenting the cache line replacement state machine to include DMA-like capabilities. As shown in FIGRE 4-8 (b), memory tiles adjacent to the processor core are used for the SRF and augmented with dedicated wide channels (256 bits per row assuming 4 64-bit channels for the 4x4 array) into the ALU array for increased SRF bandwidth. By transferring an entire SRF line into the grid, the Super morph DLP loops can execute an SRF_read that acts as load multiple word instructions. Rather than requiring a data switch between the SRF banks and the ALU array, the data can be easily moved to any ALU using the high-bandwidth in grid touting network. Streams are striped across the multiple banks of the SRF and stored to the SRF bank over narrow channels to the memory tile. Memory tiles that are not adjacent to the processing core can be configured as a conventional level-2 cache still accessible to the unchanged level-1 cache hierarchy. The conventional cache hierarchy can be used to store irregularly accessed data structures, such as texture maps.

50

Chapter 5 Simulation

To find the best designs, architects must rapidly simulate many design alternatives and have confidence in the results. Since semiconductor integration capacity reached the point where entire systems could theoretically be integrated into a single die, the hardware companies have heralded a new age in platform-based design for a number of years. So, now platforms are defined include a wide assortment of elements from System-Level Design (SLD): the RTL hardware definition, bus architecture, power management strategy, device drivers, OS ports, and application software.

To be successful, however, a platform will need more than this; an essential element for enabling differentiation will prove to be an advanced systems modeling and verification environment. Developers require a variety of views of the entire platform from RTL,

51

system models, software development models, and real hardware development boards.

Each view of the platform reflects the same system architecture, and designers can use test software in any of the higher-level views, providing a high degree of confidence in the design prior to tape out. This provides a valuable environment in which to investigate system bandwidth and performance requirements. System views must be extendible, allowing designers to exploit the advantages of a well-supported, pre-verified base platform of hardware and software designs, whilst differentiating their own application with their own design. In the preceding of this chapter will introduce the basic principle of a CPU and then analyze

several

design

alternatives,

including

1-function-unit,

2-function-units,

4-function-units, 9-function-units and 16-function-units of general CPUs by C language. In order to evaluate the design effectiveness of different architecture designs, a general application example FIR filter is used. In order to look after both sides of hardware efficiency and cost, the operation performance and computing elements of different architectures will be analysed and compared with each other. In the end of this chapter, the best one of CPU architectures will be selected for the final choose of reconfigurable CPU architecture that will be implemented in real hardware in next chapter.

5.1 Non-reconfiguration Processors Design Simulation This section will introduce and analyze the performance and hardware cost of several kinds of CPU architectures, including 1-function-unit, 2-function-units, 4-function-units, 9-function-units and 16-function-units general CPUs by C language. In addition to use a general application example FIR filter to evaluate the effectiveness of each kinds of architectures. The results of each kind of architectures that are without adding reconfigurable design will be compared with the next section ones that are with reconfigurable design in the final section in this chapter.

5.1.1 One Function Unit General CPU Architecture without Reconfigurable Design As shown in FIGURE 5-1, the architecture of one function unit CPU without reconfigurable design is like an early stage general purpose CPU without the design of

52

superscalar. The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all design with. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when an R-type instruction closely follows a load instruction and data dependence occurs, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

4 ADD 1

2

MUX 2

PC

Shift Left 2

ADD

Decoder Instruction Cache

1 2

FU

Data Cache

4

Register File Pre-decoder

Forward

Branch Predictor FIGURE 5-1 One function unit general CPU architecture without reconfigurable design

FIGURE 5-2 shows a FIR filter data flow, this section and next section will use the FIR example for mapping data flow to the hardware and analysing the performance. As shown in FIGURE 5-2, the c1, c2, c3 and c4 are the coefficients of FIR, and the d1, d2, d3 and d4 are the input data of FIR filter. Although just use the FIR filter for example to analyse the performance is not objective enough, the general CPUs are designed for running some special cases and application. Unlike the general purpose processor, the reconfigurable CPU needs to handle large amount of data and repeatedly run some equivalent arithmetic operations. So, the FIR filter is a suitable example for mapping data

53

flow to general CPUs and analysing the efficiencies to compare the performance.

c1

d1 c1

d1 c1

*

*

d1 c1

*

*

+

d1

+ +

FIGURE 5-2 FIR filter data flow

time

FU

R7

R6 R5 ld

ld

ld

ld

* R1

* R2

R8

* R3

R9

*

R4 R10

R11

+

R13

+ + R14

sw R15

R12

FIGURE 5-3 Mapping FIR filter data flow to one function unit general CPU

FIGURE 5-3 shows the mapping FIR filter data flow to one function unit general CPU. We can easily find that each result produced needs twelve clocks. TABLE 5-1 shows that the FIR filter simulation results in one function unit general CPU architecture. Because of the five-stage pipelined design structure and the coefficients of FIR filter, the first result needs additional eight clocks delay than others. The average time in producing each result is twelve clocks as the mapping data flow shows. TABLE 5-2 shows the hardware cost in one function unit general CPU architecture, it will be compared performance and hardware

54

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Result

Time(clock) 20

32

44

56

68

80

92

104

116

128

12

cost with other architectures in section 5-4. TABLE 5-1 Each result produced time in FIR filter in one function unit architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter 61

Amount

1

18

5

45

11

Or

And

Cache

8

13

2

TABLE 5-2 Hardware cost in one function unit architecture.

5.1.2 Two Function Units general CPU Architecture without Reconfigurable Design

8 ADD 1

2 Shift Left 4

MUX PC

ADD

Forward

Decoder 1

FU1

Data Cache

Register 2 File Pre-decoder Branch Predictor

4

MUX

Instruction Cache

FU2

Forward

FIGURE 5-4 Two function units general CPU architecture without reconfigurable design

As shown in FIGURE 5-4, the architecture of two function units general CPU without reconfigurable design is like a superscalar design with two ALUs. The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all designed with. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when an R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall

55

is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

FIGURE 5-5 shows the mapping FIR filter data flow to two function units general CPU. The results in FU1 and FU2 can be forwarded to the inputs of FU1 and FU2. The data cache outputs also can be forwarded to the inputs of FU1 and FU2. As shown in FIGURE 5-5, to see the rectangle drawn in dotted line is the basic operations in FIR filter example. We can easily find that each result produced needs six clocks equally.

z x

R9

R13

+

R7

y ld1 R13 R14

+

sw0 ld1

0

ld1

ld1 frame0

1

*

1

*

1

R10

*

1

R11 R12

sw1

1

1

*

R6

+

R15

+

1

ld2

R14

ld2

frame6

frame5

frame4

frame3

frame2 R8

frame1 R5

FIGURE 5-5 Mapping FIR filter data flow to two function units general CPU

TABLE 5-3 shows that the FIR filter simulation results in two function units architecture. Because of the five-stage pipelined design structure and the coefficients of FIR filter, the first result needs additional eight clocks delay than others. The average time in producing each result is six clocks as the mapping data flow shows.

TABLE 5-4

shows the hardware cost in two function units general CPU architecture, it will be compared performance and hardware cost with other architectures in section 5-4.

56

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 14

19

25

31

37

43

49

55

61

67

6

TABLE 5-3 Each result produced time in FIR filter in two function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter Amount

95

2

20

6

52

23

Or

And

Cache

22

26

2

TABLE 5-4 Hardware cost in two function units architecture . 5.1.3 Four Function Units General CPU Architecture without Reconfigurable Design

16 ADD 1

2

MUX

Shift Left 4

PC

ADD

Decoder Instruction Cache

1

FU1

FU2

FU3

FU4

2

Register File Pre-decoder 4

Branch Predictor

3

Forward 4

Data Cache

addr data

FIGURE 5-6 Four function units general CPU architecture without reconfigurable design

As shown in FIGURE 5-6, the architecture of four function units general CPU without reconfigurable design is like a superscalar design with four ALUs. The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all designed with. The results in each FU can be transferred to every FU by forwarding and the memory outputs can be forwarded to any FU. Ideally,

57

excepting the first and final instructions, every instruction costs the time of one clock. But when an R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

FIGURE 5-7 shows the mapping FIR filter data flow to four function units general CPU.

To see the rectangle drawn in dotted line is the basic operations in FIR filter

example, we can easily find that each result produced needs four clocks equally. The x and y coordinate axes are positions of FUs in four function units general CPU and the z coordinate axis is the frame that will show the operation steps follows the time.

z x R13 R7

R5

y R15

+

R13

0

+

0

+

0

R1

sw0

R2

ld1

ld1

frame0

ld1

1

R10

*

*

1

ld2 ld2

R14

R11

sw1

ld2

1

1

+

1

+

* * 1

ld1 R14

R9

R15

+

1

R12

ld2

frame5

frame4

frame3

frame2 frame1

R8

R6

FIGURE 5-7 Mapping FIR filter data flow to four function units general CPU

TABLE 5-5 shows that the FIR filter simulation results in four function units architecture. Because of the five-stage pipelined design structure and the coefficients of FIR filter, the first result needs additional eight clocks delay than others. The average time in producing each result is four clocks as the mapping data flow shows.

TABLE 5-6

shows the hardware cost in four function units general CPU architecture, it will be compared performance and hardware cost with other architectures in section 5-4.

58

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 12

16

20

24

28

32

36

40

44

48

4

TABLE 5-5 Each result produced time in FIR filter in four function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter 146

Amount

4

33

9

86

44

Or

And

Cache

30

59

2

TABLE 5-6 Hardware cost in four function units architecture

5.1.4 Nine Function Units General CPU Architecture without Reconfigurable Design

36

1

2

ADD

Shift Left 4

MUX

ADD

Decoder Decoder

PC Instruction Cache

FU1

FU2

FU3

FU4

FU5

FU6

FU7

FU8

FU9

Decoder 1

2

Register File Pre-decoder Branch Predictor

Forward 4 data

Data Cache

data addr

FIGURE 5-8 Nine function units general CPU architecture without reconfigurable design

59

z x

R7

R15

R13

y

R5 R15

+

R14 R7

sw0

R5

+

1

+

1

R8

R6

R3

R1

R4

* * 2

R13 R2

*

2

1

*

2

2

R9

ld3 R11

ld3

+

2

+

ld3

R12

*

R3 R4R9

3

R1 R14

2

3

*

3

R2

ld4 R11

* *

R6

2

R10

ld3

sw1

+

3

R10

ld4

ld4

sw2

+

3

+

3

R12

ld4

frame3

R8

frame2

frame1

frame0

FIGURE 5-9 Mapping FIR filter data flow to nine function units general CPU

As shown in FIGURE 5-8, the architecture of nine function units general CPU without reconfigurable design is like a superscalar design with nine ALUs. The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all designed with. The results in each FU can be transferred to every FU by forwarding and the memory outputs can be forwarded to any FU. The difference with one, two and four function units architectures is that there are three decoders in nine function units architecture. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when a R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

FIGURE 5-9 shows the mapping FIR filter data flow to nine function units general CPU.

Most of results produced by FUs are forwarded to other FUs, that is why the

performance can be improved and the data flow can be mapped to FUs so closely. To see the rectangle drawn in dotted line is the basic operations in FIR filter example, we can easily find that each result produced needs two clocks equally. The x and y coordinate axes are positions of FUs in four function units general CPU and the z coordinate axis is the frame that will show the operation steps follows the time.

60

TABLE 5-7 shows that the FIR filter simulation results in nine function units architecture. Because of the five-stage pipelined design structure and the coefficients of FIR filter, the first result needs additional eight clocks delay than others. The average time in producing each result is two clocks as the mapping data flow shows.

TABLE 5-8

shows the hardware cost in nine function units general CPU architecture, it will be compared performance and hardware cost with other architectures in section 5-4. Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 10

12

14

16

18

20

22

24

26

28

2

TABLE 5-7 Each result produced time in FIR filter in nine function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter Amount

245

4

42

14

145

101

Or

And

Cache

62

288

2

TABLE 5-8 Hardware cost in nine function units architecture

5.1.5 Sixteen Function Units General CPU Architecture without Reconfigurable Design As shown in FIGURE 5-10, the architecture of sixteen function units general CPU without reconfigurable design is like a superscalar design with sixteen ALUs. The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all designed with. The results in each FU can be transferred to every FU by forwarding and the memory outputs can be forwarded to any FU. The difference with one, two, four and nine function units architectures is that there are four decoders in sixteen function units architecture. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when an R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

61

64

1

2

ADD

ADD

Shift Left 4

MUX

Register File PC Instruction Cache

Decoder1

FU 1

FU 2

FU 3

FU 4

Decoder2

FU 5

FU 6

FU 7

FU 8

Decoder3

FU 9

FU 10

FU 11

FU 12

Decoder4

FU 13

FU 14

FU 15

FU 16

1 2

Pre-decoder Branch Predictor

4

Forward

Data Cache

addr

data

FIGURE 5-10 Sixteen function units general CPU architecture without reconfigurable design

z x

R15

R8 R13

y

+

0

R15

+

*

1

0

ld3

1

*

R10

1

*

R5

ld3

R9

*

1

sw0

ld3

R12 R6

R1

+

1

+

R2 R14

1

R3

sw0

R12

*

R4 R9

*

1

*

ld3

+

ld3

+

1

1

R10

ld3

1

*

1

R11

1

ld3

frame2

R7

frame1

R11

ld3

frame0

FIGURE 5-11 Mapping FIR filter data flow to sixteen function units general CPU

FIGURE 5-11 shows the mapping FIR filter data flow to nine function units general

62

CPU.

Most of results produced by FUs are forwarded to other FUs, that is why the

performance can be improved and the data flow can be mapped to FUs so closely. To see the rectangle drawn in dotted line is the basic operations in FIR filter example, we can easily find that each result produced needs two clocks equally. The x and y coordinate axes are positions of FUs in sixteen function units general CPU and the z coordinate axis is the frame that will show the operation steps follows the time.

TABLE 5-9 shows that the FIR filter simulation results in sixteen function units general CPU architecture. Because of the five-stage pipelined design structure and the coefficients of FIR filter, the first result needs additional eight clocks delay than others. The average time in producing each result is two clocks as the mapping data flow shows. The performance is the same as nine function units architecture. The reason why the performance sixteen function units general CPU the same as nine function units general CPU is that there must be one clock delay for using the load results. So, the FIR filter data flow can’t be mapped entirely to sixteen function units general CPU. The next section will introduce a method to resolve this delay problem and improve the performance.

TABLE

5-10 shows the hardware cost in sixteen function units general CPU architecture, it will be compared performance and hardware cost with other architectures in section 5-4.

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 10

12

14

16

18

20

22

24

26

28

2

TABLE 5-9 Each result produced time in FIR filter in sixteen function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter Amount

459

4

73

21

233

178

Or

And

Cache

97

789

2

TABLE 5-10 Hardware cost in sixteen function units architecture

5.2 Simulation with Reconfigurable Design This section will introduce and analyze several kinds of CPU architectures with the reconfigurable design. The special of reconfigurable CPU is that there are additional several circuits and memories designed in it. Each function unit contains two blocks of

63

memory, M1 and M2 each has four frames, for reconfigurable computing. The DMA (Direct Memory Access) is designed for directly storing reconfigurable operation to memory block (M1 or M2) in function unit rapidly. The reconfigurable controller is using three states to control the all the reconfigurable actions.

In general design of CPU, the data must be loaded from data cache to register file. It will make sure that if the data is missing in cache, the loaded data is still correct because the load instruction will wait until the correct data is fetched from data memory. But when an R-format instruction closely follows a load instruction and occurs data dependence, the processor must be stalled for at least one clock. If the process is a loop that will be done several times or special application that will be done many times, the performance will be sacrificed very much.

The reconfigurable CPU is located between general-purpose CPU and ASIC, the reconfigurable design is needed because the reconfigurable CPU must handle a lot of loops and special application. When using the processor with reconfigurable design to handle loops or special applications, the data must be series in data memory and easily fetched to data cache. So don’t be worried about the cache miss. If the cache miss occur unfortunately, it just cost some clocks delay.

This section will simulate the performance and hardware cost of several kinds of CPU architectures, including 1-function-unit, 2-function-units, 4-function-units, 9-function-units and 16-function-units reconfigurable CPUs, with reconfigurable design by C language. In addition to using a general application example FIR filter to evaluate the effectiveness of each kind of architectures, the results of each kind of architectures that are with reconfigurable design will be compared with the previous section ones that are without reconfigurable design in the final section in this chapter.

5.2.1 One Function Unit CPU Architecture with Reconfigurable Design As shown in FIGURE 5-12, the architecture of one function unit CPU with reconfigurable design is like an early stage general purpose CPU without the design of superscalar. The only difference is additionally adding some circuits, wires and memories. The reconfigurable controller receives the reconfigurable running number from DMA and

64

sends control signals to function unit. The instruction cache receives the control signals from reconfigurable controller and sends data directly to one memory block of function unit through DMA. When the function units is running one of memory block, the other one is downloading other application in the same time..

4

1

2

ADD

Shift Left 2

MUX

ADD

2

Decoder

PC Instruction Cache

1

2

Data Cache

FU

4

Register File Forward

Pre-decoder Branch Predictor

DMA

Reconfigurable controller

FIGURE 5-12 One function unit CPU architecture with reconfigurable design

The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all design with. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when reconfigurable design is not used and an R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

FIGURE 5-13 shows the mapping FIR filter data flow to one function unit CPU with reconfigurable design. TABLE 5-11 shows that the FIR filter simulation results in one function unit CPU architecture with reconfigurable design.

Because of the five-stage

pipelined design structure and the coefficients of FIR filter, the first result needs additional

65

nine clocks delay than others. The average time in producing each result is eight clocks as the mapping data flow shows. This structure is four clocks saving than one function unit reconfigurable CPU without reconfigurable design for producing one FIR result, but the hardware costs more.

TABLE 5-12 shows the hardware cost in one function unit CPU

architecture with reconfigurable design, it will be compared performance and hardware cost with other architectures in section 5-4.

time

FU R7

R6

R5

*r

*r

*r

R3

R2

R1

R8

sw + R15 + R14 R12

R11

R9

+

*r

R13

R4 R10

FIGURE 5-13 Mapping FIR filter data flow to one function unit CPU with reconfigurable design.

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 17

25

33

41

49

57

65

73

81

89

8

TABLE 5-11 Each result produced time in FIR filter in one function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter Amount

64

1

19

5

47

11

Or

And

Cache

8

14

3

TABLE 5-12 Hardware cost in two function units architecture.

5.2.2 Two Function Units CPU Architecture with Reconfigurable design As shown in FIGURE 5-14, the architecture of two function unit CPU with reconfigurable design is like a superscalar design with two ALUs. The only difference is

66

additionally adding some circuits, wires and memories. The reconfigurable controller receives the reconfigurable running number from DMA and sends control signals to function unit. The instruction cache receives the control signals from reconfigurable controller and sends data directly to one memory block of function unit through DMA. When the function units is running one of memory block, the other one is downloading other application in the same time..

The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all design with. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when reconfigurable design is not used and an R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

8 ADD 1

2 Shift Left 4

MUX

ADD

Decoder Instruction Cache

1

FU1 Data Cache

Register 2 File

4

MUX

PC

FU2

Pre-decoder

Forward

Branch Predictor

DMA

Reconfigurable controller

FIGURE 5-14 Two function units CPU architecture with reconfigurable design

FIGURE 5-15 shows the mapping FIR filter data flow to two function units CPU with

67

reconfigurable design. When the reconfiguration is used, the function units could get the operation data directly from the inside black memory (M1 or M2) and the instructions could load the data that reconfigurable computing needed. The results in FU1 and FU2 can be forwarded to the inputs of FU1 and FU2. The data cache outputs also can be forwarded to the inputs of FU1 and FU2. As shown in FIGURE 5-15, to see the rectangle drawn in dotted line is the basic operations in FIR filter example. We can easily find that each result produced needs just four clocks equally.

z x

sw1 R13 R9

y

+1 R7 R3

sw0 R15

+0

R13

+0 +0

R5 R14 R1

R6 R2

*1

*1

R11

*1 R10 +

1

R8

*1 R4

R5 R14 R1

+1

R15 R6 R2

*2

*2

frame5

frame4

R12

frame3

frame2

frame1

frame0

FIGURE 5-15 Mapping FIR filter data flow to two function units CPU with reconfigurable design

TABLE 5-13 shows that the FIR filter simulation results in two function units reconfigurable CPU architecture.

Because of the five-stage pipelined design structure and

the coefficients of FIR filter, the first result needs additional seven clocks delay than others. The average time in producing each result is four clocks as the mapping data flow shows. This structure is two clocks saving than two function units CPU without reconfigurable design for producing one FIR result, but the hardware costs more. TABLE 5-14 shows the hardware cost in two function units CPU with reconfigurable design architecture, it will

68

be compared performance and hardware cost with other architectures in section 5-4.

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 11

15

19

23

27

31

35

39

43

47

4

TABLE 5-13 Each result produced time in FIR filter in two function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter 117

Amount

2

22

6

56

2

Or

And

Cache

22

28

3

TABLE 5-14 Hardware cost in two function units architecture.

5.2.3 Four Function Units CPU Architecture with Reconfigurable Design

16 ADD 1

2 Shift Left 4

MUX

PC

ADD

Decoder Instruction Cache

2 2

FU1

FU2

FU3

FU4

Register File Pre-decoder Branch Predictor

Forward

DMA

Reconfigurable controller

4

Data Cache

FIGURE 5-16 Four function units reconfigurable CPU architecture with RECONFIGURABLE As shown in FIGURE 5-16, the architecture of four function unit CPU with reconfigurable design is like a superscalar design with four ALUs. The only difference is

69

additionally adding some circuits, wires and memories. The reconfigurable controller receives the reconfigurable running number from DMA and sends control signals to function unit. The instruction receives the control signals from reconfigurable controller and sends data directly to one memory block of function unit through DMA. When the function units is running one of memory block, the other one is downloading other application in the same time..

The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all design with. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when reconfigurable design is not used and an R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

z x

R12 R5

y

R14 R5

sw0

+0 *1 R4

R3

R12

R13

*1 +

1

R1 R6

*

2

+1 *2 R2

R8

R15

+1 *2

R4 R7 R9

sw1 R11

R12

+ 2 * R3

2

R1 R6

*3

+2 *3 R2

R14 R8

+2 *3

R7

R4

+ 3 * R3

3

frame3

R13

frame2

R10

frame1

frame0

FIGURE 5-17 Mapping FIR filter data flow to four function units CPU with reconfigurable design FIGURE 5-17 shows the mapping FIR filter data flow to four function units CPU with reconfigurable design. When the reconfiguration is used, the function units could get the

70

operation data directly from the inside black memory (M1 or M2) and the instructions could load the data that reconfigurable computing needed. The results in FU1, FU2, FU3 and FU4 can be forwarded to the inputs of each FU. The data cache outputs also can be forwarded to each FU. As shown in FIGURE 5-17, to see the rectangle drawn in dotted line is the basic operations in FIR filter example. We can easily find that each result produced needs just two clocks equally.

TABLE 5-15 shows that the FIR filter simulation results in four function units CPU architecture with reconfigurable design.

Because of the five-stage pipelined design

structure and the coefficients of FIR filter, the first result needs additional eight clocks delay than others. The average time in producing each result is two clocks as the mapping data flow shows. This structure is two clocks saving than four function units CPU without reconfigurable design for producing one FIR result, but the hardware costs more. TABLE 5-16 shows the hardware cost in four function units CPU with reconfigurable design architecture, it will be compared performance and hardware cost with other architectures in section 5-4.

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 10

12

14

16

18

20

22

24

26

28

2

TABLE 5-15 Each result produced time in FIR filter in four function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter Amount

192

4

37

9

94

44

Or

And

Cache

30

63

3

TABLE 5-16 Hardware cost in four function units architecture.

5.2.4 Nine Function Units CPU Architecture with Reconfigurable Design As shown in FIGURE 5-18, the architecture of nine function units CPU with reconfigurable design is like a superscalar design with nine ALUs. The only difference is additionally adding some circuits, wires and memories. The reconfigurable controller receives the reconfigurable running number from DMA and sends control signals to function unit. The instruction receives the control signals from reconfigurable controller and sends data directly to one memory block of function unit through DMA. When the

71

function units is running one of memory block, the other one is downloading other application in the same time..

36

1

2

ADD

MUX Decoder1

PC

Decoder2

Instruction Cache Pre-decoder Branch Predictor

ADD

Shift Left 4

1

Decoder3

FU1

FU2

FU3

FU4

FU5

FU6

FU7

FU8

FU9

2

Register File

DMA Reconfigurable controller

Forward 4

Data Cache

FIGURE 5-18 Nine function units CPU architecture with Reconfigurable Design

The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all design with. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when reconfigurable design is not used and an R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

FIGURE 5-19 shows the mapping FIR filter data flow to nine function units CPU with reconfigurable design. When the reconfiguration is used, the function units could get

72

the operation data directly from the inside black memory (M1 or M2) and the instructions could load the data that reconfigurable computing needed. The results in each FU can be forwarded to the inputs of all FUs. The data cache outputs also can be forwarded to all FUs. As shown in FIGURE 5-19, to see the rectangle drawn in dotted line is the basic operations in FIR filter example. We can easily find that each result produced needs just one clocks equally.

z x

R11

R11

y

sw1

R15

+2 *3

R12

R9

+1 *3 *3 +2 *3

R13

+3 *4

R12

R9

R13

sw0

sw2

R15

R14

+2 *4 *4 +3 *4

R14

+4 *5

+3 *5 *5 +4 *5

R10

frame1

R10

frame0

FIGURE 5-19 Mapping FIR filter data flow to nine function units CPU with reconfigurable design

TABLE 5-17 shows that the FIR filter simulation results in nine function units CPU architecture with reconfigurable design.

Because of the five-stage pipelined design

structure and the coefficients of FIR filter, the first result needs additional 9 clocks delay than others. The average time in producing each result is one clock as the mapping data flow shows. This structure is one clock saving than nine function units CPU without reconfigurable design for producing one FIR result, but the hardware costs more. TABLE 5-18 shows the hardware cost in nine function units CPU with reconfigurable design architecture, it will be compared performance and hardware cost with other architectures in section 5-4.

73

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 10

11

12

13

14

15

16

17

18

19

1

TABLE 5-17 Each result produced time in FIR filter in nine function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter 320

Amount

4

51

14

163

101

Or

And

Cache

62

297

3

TABLE 5-18 Hardware cost in nine function units architecture.

5.2.5 Sixteen Function Units CPU Architecture with Reconfigurable Design

36

1

2

ADD

MUX Decoder1

PC

Decoder2

Instruction Cache Pre-decoder Branch Predictor

ADD

Shift Left 4

1

Decoder3

FU1

FU2

FU3

FU4

FU5

FU6

FU7

FU8

FU9

2

Register File

DMA Reconfigurable controller

Forward 4

Data Cache

FIGURE 5-20 Sixteen function units CPU architecture with reconfigurable design.

As shown in FIGURE 5-20, the architecture of sixteen function units CPU with reconfigurable design is like a superscalar design with sixteen ALUs. The only difference is additionally adding some circuits, wires and memories. The reconfigurable controller

74

receives the reconfigurable running number from DMA and sends control signals to function unit. The instruction receives the control signals from reconfigurable controller and sends data directly to one memory block of function unit through DMA. When the function units is running one of memory block, the other one is downloading other application in the same time.

The five-stage pipeline, branch predictor, data forwarding handled the branch instructions in the second stage to improve the performance are all design with. Ideally, excepting the first and final instructions, every instruction costs the time of one clock. But when reconfigurable design is not used and a R-type instruction closely follows a load instruction and occurs data dependence, the one clock delay for stall is needed. Another case will also occur delay, when the branch predict is wrong, the system must clear the first stage and second stage registers and substitute the current PC value for the branch address. It will cause two clocks delay and having no idea to avoid it.

z x sw2

y

*r8

sw3

*r9

*r7

+4 *r8 +5 *r9

+2 *r6 +3 *r7

+6 *r8 +7 *r9

+4 *r6 +5 *r7

+6 *r8 +7 *r9

sw0

*r6

sw1

+4 *r6 +5 *r7

frame0

FIGURE 5-21 Mapping FIR filter data flow to sixteen function units CPU with reconfigurable design

75

TABLE 5-19 shows that the FIR filter simulation results in sixteen function units CPU architecture with reconfigurable design. Because of the five-stage pipelined design structure and the coefficients of FIR filter, the first result needs additional nine clocks delay than others. The average time in producing each result is helf clock as the mapping data flow shows. This structure is helf clocks saving than sixteen function units CPU without reconfigurable design for producing one FIR result, but the hardware costs more. TABLE 5-20 shows the hardware cost in sixteen function units CPU with reconfigurable design architecture, it will be compared performance and hardware cost with other architectures in section 5-4.

Result

First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Average

Time(clock) 10

10

11

11

12

12

13

13

14

14

0.5

TABLE 5-19 Each result produced time in FIR filter in sixteen function units architecture.

Hardware Register Multiplier Adder Subtract Comparator Shifter Amount

646

4

98

21

283

178

Or

And

Cache

97

814

3

TABLE 5-20 Hardware cost in sixteen function units architecture.

5.3 Compare and Analyze Architectures This section will compare and analyze the performance and hardware cost in all kinds of architectures introduced previously. Excepting shows the bar charts of performance and hardware cost of CPU architectures with reconfigurable design or not, this section will find the correlation between performance and hardware costs. Finally, the produced result contribution per gate count values will supply a well reference foundation to determine which architecture will be designed. In the end of this section, the best kind of architecture will be selected for the design of CPU and it will be implemented in next chapter.

FIGURE 5-22 is a bar chart of hardware cost in one, two, four, nine and sixteen function units CPU without reconfigurable design. The coordinate axis x is the kinds of components and the coordinate axis y is amounts of each components. We can easily find in this figure, the components of registers and and gates are increasing conspicuously with

76

the number of function units. The increasing amounts of adders, subtracts, comparators, shifters and or gates are increasing mildly with the number of function units. The amount of multipliers is increasing weakly with the number of function units and the cache amounts are constant with the number of function units.

1 FU

2 FU

4 FU

9 FU

16 FU

Comparator

Shifter

Or

900 800

Component amount

700 600 500 400 300 200 100 0 Register

Multiplier

Adder

Subtract

And

Cache

Components

FIGURE 5-22 Hardware cost in one, two, four, nine and sixteen function units CPU without reconfigurable design.

FIGURE 5-23 is a bar chart of hardware cost in one, two, four, nine and sixteen function units CPU with reconfigurable design. The coordinate axis x is the kinds of components and the coordinate axis y is amounts of each components. We can easily find in this figure, the components of registers and and gates are increasing conspicuously with the number of function units. The increasing amounts of adders, subtracts, comparators, shifters and or gates are increasing mildly with the number of function units. The amounts of multipliers and caches are increasing weakly with the number of function units.

77

1 FU

2 FU

4 FU

9 FU

16 FU

900 800

Component amount

700 600 500 400 300 200 100 0 Register

Multiplier

Adder

Subtract

Comparator

Shifter

Or

And

Cache

Components

FIGURE 5-23 Hardware cost in one, two, four, nine and sixteen function units CPU with reconfigurable design.

Hardware Register Multiplier Adder Subtract Comparator Shifter Or

And

Cache

1 FU

4.918%

0

5.456%

0

4.444%

0

0

7.69%

50%

2 FU

23.158%

0

10%

0

7.692%

0

0

7.69%

50%

4 FU

31.507%

0

12.12%

0

9.302%

0

0

6.78%

50%

9 FU

30.612%

0

21.43%

0

12.413%

0

0

3.13%

50%

16 FU

40.741%

0

34.25%

0

21.459%

0

0

3.17%

50%

TABLE 5-21 Components increasing percentages of each kind of CPU architectures with reconfigurable design from the architectures without reconfigurable design.

TABLE 5-21 shows the components increasing percentages of each kind of CPU architectures with reconfigurable design from the architectures without reconfigurable design. We can easily find that the multipliers, Subtracts and shifters are zero growing-up in component amounts from CPU architectures without reconfigurable design to architectures with reconfigurable design. The increasing components percentages of the and gates are from 3.13% to 7.69%, the comparators are from 4.44% to 21.46%, the adders are from 5.46% to 34.25% and the registers are from 4.92% to 40.74%.

78

1 FU

2 FU

4 FU

9 FU

16 FU

200 180

Component amount

160 140 120 100 80 60 40 20 0 Register

Multiplier

Adder

Subtract

Comparator

Shifter

Or

And

Cache

Componets

FIGURE 5-24 Components increasing amounts of each kind of CPU architectures with reconfigurable design from the architectures without reconfigurable design

FIGURE 5-24 is a bar chart shows components increasing amounts of each kind of CPU

architectures

with

reconfigurable

design

from the

architectures

without

reconfigurable design. We can easily find that the registers are increased mostly because the pipeline stages needed store all values to registers. The increasing amounts of adders, comparators, and gates and counters almost direct ratio with the numbers of FUs. The most increasing components are caches, the increasing amount from half to two times.

79

140 120

Time(clock)

100 1 FU

80

2 FU 4 FU

60

9 FU 16 FU

40 20 0 First

Second

Third

Fourth

Fifth

Sixth

Seventh

Eighth

Ninth

Tenth

Results produced

FIGURE 5-25 The FIR example results produced time in 1, 2, 4, 9 and 16 FUs CPU without reconfigurable design.

FIGURE 5-25 shows the FIR example results produced time in one, two, four, nine and sixteen FUs CPU without reconfigurable design. The results produced time in nine FUs is the same with in sixteen’s. So, we could find only four lines in coordinate chart. The reason why nine FUs results produced time the same as sixteen’s is that in each result produced from FIR needed four load instructions. But the R-format instructions are following closely with the load instructions. Although forwarding can forward the load results to ALU inputs before reading from register files, which could save one clock delay, the stall or wait one clock is needed. So, the FIR data flow couldn’t be mapped entirely to sixteen function units CPU.

80

100 90 80

Time(clock)

70 1 FU 2 FU 4 FU 9 FU 16 FU

60 50 40 30 20 10 0 First

Second

Third

Fourth

Fifth

Sixth

Seventh

Eighth

Ninth

Tenth

Results produced

FIGURE 5-26 The FIR example results produced time in 1, 2, 4, 9 and 16 FUs CPU with reconfigurable design.

FIGURE 5-19 shows the FIR example results produced time in one, two, four, nine and sixteen FUs CPU with reconfigurable design. The results produced time in nine and sixteen FUs are not the same. Because using reconfiguration, the instructions just need to load the data that reconfiguration needed and then running the reconfiguration. The reconfigurable design will save the delay that caused by load instructions, that could save one clock delay in any situation of R-format instructions closely following with load instructions and the same situation will be occurred serially again and again. So, the FIR data flow could be mapped entirely to sixteen function units reconfigurable CPU and improve the performance.

Performance Hardware

1 FU

2 FU

4 FU

9 FU

16 FU

50%

50%

100%

100%

300%

Grow-up percentage

TABLE 5-22 Performance grow-up percentage from CPU without reconfigurable design architectures to with reconfigurable design architectures in FIR example.

81

TABLE 5-22 lists the performance grow-up percentages from without reconfigurable design to with reconfigurable design of one, two, four, nine and sixteen function units CPU architectures in FIR example. We can easily find in the table that the performance grow-up percentages are increasing with the number of function units. The performance grow-up percentages in one and two function units CPU from without reconfigurable design to with reconfigurable design are the same in 50%. The four function units and nine function units are the same in 100%. The most grow-up percentage is in sixteen function units architecture, the performance of architecture with reconfigurable design are four times to the architecture without reconfigurable design in FIR example. FIGURE 5-27 shows how many FIR results could be produced each clock, we could easily find that the sixteen function units architecture is the most performance grow-up one.

non-DMA

DMA

2.5

Performance(result/clock)

2

1.5

1

0.5

0 1 FU

2 FU

4 FU

9 FU

16 FU

Amounts of fuction units

FIGURE 5-27 Compare performance in 1, 2, 4, 9 and 16 FUs between CPU architectures without reconfigurable design and with reconfigurable design in FIR example. Notice the unit of y coordinate axis, it represents the hardware could produce how many FIR results each clock.

To compare TBALE 5-21 and TABLE 5-22, the ratios of performance increasing are two to four times in architectures with reconfigurable design than architectures without reconfigurable design. Excepting counters, that cost very few areas in the chip, cost less

82

than one half times in architectures with reconfigurable design than architectures without reconfigurable design. According to the simulation results, the architectures with reconfigurable design are performance increasing greater and hardware cost increasing less than architectures without reconfigurable design. So, the reconfigurable CPU with reconfigurable design would be chosen for the hardware design in next chapter.

Hardware Reg

Mul

Add

Sub Comparator Shifter Or

And Cache

Gate count 736

7544

322

322

345

345

4

4

14720

1 FU

64

1

19

5

47

11

8

14

3

2 FU

117

2

22

6

56

2

22

28

3

4 FU

192

4

37

9

94

44

30

63

3

9 FU

320

4

51

14

163

101

62

297

3

16 FU

646

4

98

21

283

178

97

814

3

Total

126634 174586 278442 423302

750799 TABLE 5-23 Total gate count values of 1, 2, 4, 9 and 16 FUs reconfigurable CPU architecture with RECONFIGURABLE design.

TABLE 5-23 shows the total gate count values of one, two, four, nine and sixteen FUs CPU architecture with reconfigurable design. In this table, the second row represents the gate count values in each component and the final column represents the total gate count values on each architecture. To combine with FIGURE 5-27 and calculate the numbers, we can get TBALE 5-24.

1 FU

2 FU

4 FU

9 FU

16 FU

Result/clock

0.125

0.25

0.5

1

2

Gate count

126634

174586

278442

423302

750799

9.87E-07

1.43E-06

1.79E-06

2.36E-06

2.66E-06

Result contribution per gate count

TABLE 5-24 Performance by result/clock, gate count and result contribution per gate count values.

TABLE 5-24 shows the performance by result/clock, gate count and result

83

contribution per gate count values in one, two, four, nine and sixteen function units CPU with reconfigurable design. To see this table and FIGURE 5-28, we can easily find that the most value of result contribution per gate count is in sixteen FUs CPU architecture. That means in the five architecture of CPU, the sixteen FUs one with reconfigurable design could provide the most contribution in each gate count.

Results contribution per gate count

0.000003 0.0000025 0.000002 0.0000015 0.000001 0.0000005 0 1 FU

2 FU

4 FU

9 FU

16 FU

Amounts of function units

FIGURE 5-28 Result contribution per gate count in 1, 2, 4, 9 and 16 FUs e CPU with reconfigurable design.

5.4 Conclusion After comparing the CPU architectures without reconfigurable design and with reconfigurable design, the latter architectures are chosen. And then the values of FIR result contribution per gate count are calculated in each kind of architectures with reconfigurable design. The sixteen FUs CPU is the best one for being chosen. It’s a long way to simulate and compare so many kinds of different architectures. But the gains are sweet. Now, we could make sure that the sixteen function units reconfigurable CPU with reconfigurable design is the best one. In order to check the program counter datapath, including the functions of branch and jump, is correct, a test example “bubble sort” was inserted for testing and got the correct results.

Next chapter, the sixteen function units reconfigurable CPU with reconfigurable

84

design architecture will be presented and the hardware will include some circuits for interconnect data with other blocks by network on chip. The reconfigurable processor will receive and send data through a network interface. The details of design of sixteen function units CPU with reconfigurable design wiould be introduced explicitly in next chapter.

85

Chapter 6 Reconfigurable Processor Design

From chapter 5, the comparison result shows that the sixteen function units reconfigurable processor is the best architecture for design. This chapter will explicitly introduce how to design a sixteen function units reconfigurable process and why using so large a processor core in NoC. First, the instruction set will be chosen from three kinds of instruction sets after comparing and will find the VLIW is the most suitable one. Then the non-reconfigurable instruction formats ant reconfigurable instruction formats, which including setting and running types, will be introduced explicitly and give a simple example to show how to use.

To combine program datapath and arithmetic operation datapath, the model of reconfigurable processor formed. After combining reconfigurable components, including the reconfigurable controller, DMA, function unit, and cache, the reconfigurable processor

85

is almost complete. Then the interrupt handler is designed for interrupt handling. Finally, in order to communicate with other tiles in NoC, the network interface is added and the reconfigurable processor is complete.

6.1 Instruction Set This section will introduce the reconfigurable processor instruction section. The first is comparing the CISC, RISC and VLIW instruction sets for choosing a most suitable one. And then introduce the five kinds of slot formats. Finally, the instruction set introduced this section will be the basis to the hardware design in next section. 6.1.1 Very-Long Instruction Word VLIW (Very Long Instruction Word) architectures [27] are distinct from traditional RISC (Reduce Instruction Set Computer) and CISC (Complex Instruction Set Computer) architectures in current mass-market microprocessors. VLIW implementations are simpler for very high performance. Just as RISC architectures permit simpler, cheaper high-performance implementations than do CISCs, VLIW architectures are simpler and cheaper than RISCs because of further hardware simplifications. VLIW architectures are characterized by instructions that each specify several independent operations. This is compared to RISC instructions that typically specify one operation and CISSC instructions that typically specify several dependent operations. VLIW instructions are necessarily longer than RISC or CISC instructions, thus the name.

6.1.1.1 Architecture Comparison of CISC, RISC, and VLIW From the larger perspective, RISC, CISC, and VLIW architectures have more similarities than differences. Obviously these architectures all use the traditional state-machine model of computation. Each instruction effects an incremental change in the state (memory, registers) of the computer, and the hardware fetche and executes instructions sequentially until a branch instruction causes the flow of control to change. The differences between RISC, CISC, and VLIW are in the formats and semantics of the instructions. TABLE 6-1 shows comparison of CISC, RISC, and VLIW characteristics.

86

Architecture Characteristic Instruction Size

CISC

RISC

Varies

One size, usually 32 bits One size

Instruction Format

Field placement varies

Instruction Semantics

Regular, consistent placement of fields Almost always one simple operation

Varies from simple to complex, possibly many dependent operations per instruction Few, sometimes special Many, general-purpose

Registers Memory References

Hardware Design Focus

Bundled with operations in many different types of instructions Exploit microcoded implementations

Not bundled with operations, I,e., load/store architecture Exploit implementations with one pipeline and no microcode

VLIW

Regular, consistent placement of fields Many simple, independent operations

Many, general-purpose Not bundled with operations, i,e., load/store architecture Exploit implementations with multiple pipelines, no microcode and no complex

Picture of Five Typical Instructions =1 byte

TABLE 6-1 Comparison of CISC, RISC, and VLIW characteristics.

The characteristics of three kinds of instruction sets will be listed and introduced below: CISC instructions: These instructions vary in size, often specify a sequence of operations, and can require serial (slow) decoding algorithms. CISCs tend to have few registers, and the registers may be special-purpose, which restricts the ways in which they can be used. Memory references are typically combined with other operations (such as add memory to register). CISC instruction sets are designed to take advantage of microcode.

RISC instructions: These instructions specify simple operations, are fixed in size, and are easy (quick) to decode. RISC architectures have a relatively large number of general-purpose registers. Instructions can reference main memory only through simple load-register-from-memory and store-register-to-memory operations. RISC instruction sets do not need microcode and are designed to simplify pipelining.

VLIW instructions: These are like RISC instructions except that they are longer to allow them to specify multiple, independent simple operations. A VLIW instruction can be thought of as several RISC instructions joined together. VLIW architectures tend to be RISC-like in most attributes.

87

C-language function (j) long j; { long I; j=j+i; }

CISC

RISC

add 4[r1]

Suggest Documents