means that the architecture can be modified for each application without lim- .... accurate description of the customization process, both instruction generation.
Carlo Galuzzi
Automatically Fused Instructions Algorithms for the Customization of the Instruction-Set of a Reconfigurable Architecture
Automatically Fused Instructions Algorithms for the Customization of the Instruction-Set of a Reconfigurable Architecture
PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof.dr.ir. J.T. Fokkema, voorzitter van het College voor Promoties, in het openbaar te verdedigen op vrijdag 15 mei 2009 om 12:30 uur door
Carlo GALUZZI Master in Mathematics Universit`a degli Studi di Milano geboren te Milano, Italie
Dit proefschrift is goedgekeurd door de promotor: Prof.dr. K.G.W. Goossens Copromotor: Dr. K.L.M. Bertels Samenstelling promotiecommissie: Rector Magnificus, voorzitter Prof.dr. K.G.W. Goossens, promotor Dr. K.L.M. Bertels, copromotor Prof.dr. W. Najjar Prof.dr. J. Takala Prof.dr.ir. A.J. van der Veen Dr. J.M.P. Cardoso Dr. C. Silvano Prof.dr. C.I.M. Beenakker, reservelid
Technische Universiteit Delft, NL Technische Universiteit Delft, NL Technische Universiteit Delft, NL University of California Riverside, US Tampere University of Technology, FI Technische Universiteit Delft, NL University of Porto, PT Politecnico di Milano, IT Technische Universiteit Delft, NL
CIP-DATA KONINKLIJKE BIBLIOTHEEK, DEN HAAG Carlo Galuzzi Automatically Fused Instructions - Algorithms for the Customization of the Instruction-Set of a Reconfigurable Architecture Delft: TU Delft, Faculty of Elektrotechniek, Wiskunde en Informatica - III Thesis Technische Universiteit Delft. – With ref. – Met samenvatting in het Nederlands. ISBN 978-90-72298-02-7 Subject headings: instruction-set, instruction generation, instruction selection, algorithms, reconfigurable architecture. c 2009 Carlo Galuzzi Copyright All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without permission of the author. Printed in The Netherlands
To Stamatis, friend and mentor
Automatically Fused Instructions Algorithms for the Customization of the Instruction-Set of a Reconfigurable Architecture Carlo Galuzzi
Abstract
I
n this dissertation, we address the design of algorithms for the automatic identification and selection of complex application-specific instructions used to speed up the execution of applications on reconfigurable architectures. The computationally intensive portions of an application are analyzed and partitioned in segments of code to execute in software and segments of code to atomically execute in hardware as single instructions. These instructions extend the instruction-set of the reconfigurable architecture in use and they are application-specific. The main goal of the work presented in this dissertation is the identification of application-specific instructions with multiple inputs and multiple outputs. The instructions are generated in two consecutive steps: first, the application is partitioned in non-overlapping single-output instructions and, then, these instructions are further combined in multiple-output instructions following different policies. We propose different algorithms for the partitioning of an application in both single-output and multiple-output instructions. A number of approaches have been proposed in both academia and industry for extending a given instruction-set with application-specific instructions to speed up the execution of applications. The proposed solutions usually have a high computational complexity. The algorithms proposed in this dissertation provide quality solutions and have linear computational complexity in all cases but one, in which case the proposed solution is optimal. Additionally, compared with existing approaches, the new application-specific instructions are atomically executable in hardware by construction, whereas existing approaches increase the computational complexity by testing each generated instructions. The proposed algorithms are tested on the Molen reconfigurable architecture. The experimental results on well-known benchmarks show that a considerable speed up can be obtained in the execution of an application by using the application-specific instructions identified by the proposed algorithms. i
Acknowledgements When some years ago I met Prof. Stamatis Vassiliadis for the first time, it was difficult for me to imagine him as advisor for a future PhD. At that time, as a mathematician, I was envisioning a PhD in mathematics and Stamatis was working on completely different topics. But things never go as expected so, almost by accident, after my master I decided to change research topic and I started my PhD in Delft under his guidance and the guidance of Dr. Koen Bertels. I would like to thank Stamatis for this opportunity as these PhD years have enriched both my knowledge and my life. Stamatis was a good friend and a great scientist and his death has left a terrible void in my heart and in all the people he knew. The work presented in this dissertation is the outcome of the work I have conducted at the Computer Engineering Laboratory in Delft in the last years. This result would never be possible without the help of many people. First of all, I would like to thank my advisor, Dr. Koen Bertels, for his help during these PhD years. I would like also to thank Prof. Kees Goossens for serving as a promoter and the examination committee for the useful feedbacks and comments on this dissertation. In particular, I would like to thank Prof. Jarmo takala for his very detailed comments. Additionally, I would like to thank Dr. Georgi Gaydajiev and Dr. Cristina Silvano for their help and the nice chats in the last years. I would like to thank Yana Yankova, Dimitris Theodoropoulos and Roel Meeuws for their help in implementing the algorithms presented in this dissertation. Without their help, my work would have been only theoretical so thank to them to make it real. I would like especially to thank Dimitris, to accept to dedicate big part of his free time for implementing my ideas, not always so easy to implement as I was thinking. My time in Delft would have never be so happy without the right social environment. I would like to thank Christos Strydis and Lotfi Mhamdi for our endless chats in the middle of the nights about every possible topics, work included, and for their help and friendship during these years. iii
I am also grateful to Daniele Ludovici and Sebastian Isaza for their useful comments on this dissertation and their constant help in the computer engineering everyday life. I would like to thank my friends and colleagues Demid Borodin, Kamana Sigdel, Pepijn de Langen, Yannis Sourdis, Said Hamdioui and Zaid Al-Ars. My thanks also go to Bert Meijs and Lidwina Tromp for their technical and administrative assistance throughout the years. I would like to thank my Italian friends Tania, Davide, Chiara and Gabriele for their support and friendship. My special thanks go to my Cretan girlfriend Niki Frantzeskaki, who has been able to survive with me this dissertation. Despite my stressful times, she always loved, supported and helped me, something I am deeply grateful. Finally, I would like to truly thank my parents Daniela and Massimo, and my brother Bruno, for their unconditioned love, support and patience along with my entire life.
Carlo
Delft, The Netherlands, May 2009
iv
Table of contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments
i
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
List of Tables
List of Acronyms and Symbols . . . . . . . . . . . . . . . . . . . . . . xvii 1
Introduction . . . . . . . . . . . . . 1.1 Problem Overview . . . . . . 1.1.1 Motivational Example 1.2 Dissertation Contribution . . . 1.3 Dissertation Organization . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1 2 4 6 7
2
The Instruction-Set Extension Problem . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . 2.1.1 General Purpose Computing . . . . . . . . 2.1.2 Application Specific Computing . . . . . . 2.1.3 Reconfigurable Computing . . . . . . . . . 2.2 Instruction-Set Extensions . . . . . . . . . . . . . 2.3 Different Types of Customizations . . . . . . . . . 2.3.1 Motivational Example . . . . . . . . . . . 2.3.2 Types of Customizations . . . . . . . . . . 2.4 The Customization Process . . . . . . . . . . . . . 2.4.1 Custom Templates vs Predefined Templates 2.4.2 The Cost Function . . . . . . . . . . . . . 2.4.3 Instruction Generation . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
9 9 10 10 12 13 15 15 17 19 19 20 21
v
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
27 29 30 31 33 34 35
3
Single-Output Instructions . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . 3.2 Maximal Single-Output Instructions . . . . . . . 3.2.1 The MAXMISO Partitioning Algorithm . . 3.3 Single-Output Instructions of Variable Size . . . 3.3.1 Level of a Node and Level of a MAXMISO 3.3.2 The SUBMAXMISOs . . . . . . . . . . . . 3.3.3 The SMM Partitioning Algorithm . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
37 37 41 44 46 46 48 50 54
4
Multiple-Output Instructions . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 4.2 Convex Clustering . . . . . . . . . . . . . . . . . . 4.3 The Parallel Clustering Algorithm . . . . . . . . . . 4.4 Convex Connected Subgraphs . . . . . . . . . . . . 4.4.1 The Nautilus Clustering Algorithm . . . . . 4.4.2 The Spiral Clustering Algorithm . . . . . . . 4.4.3 One Modification to Generate Bigger Clusters 4.4.4 The Selection Process . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
57 57 59 63 69 70 75 83 84 84
5
Performance Evaluation . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . 5.2 Main Limitations in Reconfigurable Architectures 5.3 The Molen Reconfigurable Architecture . . . . . 5.3.1 The Delft WorkBench Project . . . . . . 5.4 The Toolchain . . . . . . . . . . . . . . . . . . . 5.5 Discussion of the Evaluation Process . . . . . . . 5.6 The Experimental Results . . . . . . . . . . . . . 5.6.1 The Parallel Clustering Algorithm . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. 87 . 87 . 89 . 90 . 93 . 96 . 102 . 107 . 111
2.5
2.6
2.4.4 Instruction Selection . . . . . . . . . Custom Instruction Integration . . . . . . . . 2.5.1 Functional Units . . . . . . . . . . . 2.5.2 Coprocessors . . . . . . . . . . . . . 2.5.3 Attached or External Processing Units 2.5.4 Embedded Cores . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . .
vi
. . . . . . .
. . . . . . . . .
. . . . . . . . .
5.7
6
5.6.2 The Nautilus and Spiral Clustering Algorithms . . . . 112 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Conclusions . . . . . . . . . . . . . . . 6.1 Outlook . . . . . . . . . . . . . . 6.2 Contributions . . . . . . . . . . . 6.3 Open Issues and Future Directions
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
119 119 121 122
A Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 125 Bibliography
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Samenvatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Curriculum Vitae
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
vii
List of Tables
4.1
Clustering algorithms and most suitable domains of application. 63
4.2
Assignment of a boolean variable to each MAXMISOs identified in the subject graph of Figure 4.2(a). . .
66
5.1
Average measurements of the execution time of the toolchain presented in Figure 5.3. . . . . . . . . . . . . . . . . . . . . . 101
5.2
Information about the MAXMISO partitioning of a set of benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3
Information about the FIX SMM partitioning of a set of benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4
Overall application speed up compared with pure software execution for the Parallel Clustering Algorithm. The table shows results for a selected set of benchmarks and different FPGA boards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5
Area usage for a set of benchmarks for the Nautilus and the Spiral clustering algorithms. . . . . . . . . . . . . . . . . . . 113
5.6
Number of fused instructions generated by clustering different kinds of single-output instructions (MAXMISO and SMM ). . . . . 114
5.7
Overall application speed up compared with pure software execution for the Nautilus and the Spiral clustering algorithms for a set of benchmarks. . . . . . . . . . . . . . . . . . . . . . 115
ix
List of Figures
1.1
Motivational example: (a) a picture of the Statue of Roy Liechtenstein in Barcelona and (b) a frame. . . . . . . . . . .
4
1.2
Motivational Example - possible solutions . . . . . . . . . . .
5
2.1
Positioning of different computer architectures in terms of flexibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Motivational example - data-flow subgraph extracted from ADPCM decoder and different custom instructions. . . . . . .
16
The subject graph of a simple application and examples of subgraphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
The subject graph of an application and its single-output subgraphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Distribution of the nodes of a subject graph G in levels and the collapsed graph G^ , image of G via the collapsing function f . .
47
A MAXMISO extracted from a subject graph and two examples of SMM partitioning of the MAXMISO . . . . . . . . . . . . . . .
49
The iterative application of the SMM partitioning likened to the consecutive partitioning of a box in smaller boxes. . . . . . . .
51
Schematic SMM partitioning of an application with the FIX and VARIABLE SMM partitioning algorithms. . . . . . . . . . .
55
A subject graph and examples of convex disconnected MIMO subgraphs. . . . . . . . . . . . . . . . . . . . . . . . .
61
2.2
3.1 3.2 3.3 3.4 3.5 3.6
4.1
xi
4.2
Generation of custom instructions with the Parallel clustering algorithm. The convex disconnected MIMO subgraphs are obtained by combining MAXMISOs . . . . . . . . . . . . . . . . .
67
Example of construction of a convex MIMO subgraph with the Nautilus clustering algorithm. . . . . . . . . . . . . . . . . .
72
4.4
An Archimedean Spiral . . . . . . . . . . . . . . . . . . . . .
76
4.5
An example of the spiral search. . . . . . . . . . . . . . . . .
77
4.3
PRED ′ (A )
4.6
Example to show that
| n ∈ A }. . . .
78
4.7
Example of construction of a convex MIMO subgraph with the Spiral clustering algorithm. . . . . . . . . . . . . . . . . . . .
81
4.8
Clustering without reiteration of STEP 1 and STEP 2. . . . . .
83
4.9
Clustering with reiteration of STEP 1 and STEP 2. . . . . . .
83
5.1
The Molen machine organization. . . . . . . . . . . . . . . .
91
5.2
The overall workflow of the Delft Workbench. . . . . . . . . .
94
5.3
Toolchain for the automatic identification of fused instructions. The toolchain is divided in three phases: the singleoutput clustering (PHASE 1), the hardware-software estimation (PHASE 2) and the multiple-output clustering (PHASE 3). .
97
5.4
6=
{PRED ′ (n )
The execution time of an application: (a) in software, (b) using custom instructions without scheduling, and (c) using custom instructions with scheduling. . . . . . . . . . . . . . . . . . . 104
A.1 Overall application speed up compared to the pure software execution. The figure shows results for three benchmarks and for six different FPGA sizes. . . . . . . . . . . . . . . . . . . 126 A.2 ADPCM Decoder, Part 1 . . . . . . . . . . . . . . . . . . . . 127 A.3 ADPCM Decoder, Part 2 . . . . . . . . . . . . . . . . . . . . 128 A.4 SAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.5 MDCT 32, Part 1 . . . . . . . . . . . . . . . . . . . . . . . . 130 A.6 MDCT 32, Part 2 . . . . . . . . . . . . . . . . . . . . . . . . 131 A.7 Gost Encrypt, Part 1 . . . . . . . . . . . . . . . . . . . . . . . 132 A.8 Gost Encrypt, Part 2 . . . . . . . . . . . . . . . . . . . . . . . 133 A.9 Gost Decrypt, Part 1 . . . . . . . . . . . . . . . . . . . . . . 134 xii
A.10 Gost Decrypt, Part 2 . . . . . . . . . . . . . . . . . . . . . . 135 A.11 Cast Decrypt, Part 1 . . . . . . . . . . . . . . . . . . . . . . . 136 A.12 Cast Decrypt, Part 2 . . . . . . . . . . . . . . . . . . . . . . . 137 A.13 Cast Encrypt, Part 1 . . . . . . . . . . . . . . . . . . . . . . . 138 A.14 Cast Encrypt, Part 2 . . . . . . . . . . . . . . . . . . . . . . . 139 A.15 Twofish Decrypt, Part 1 . . . . . . . . . . . . . . . . . . . . . 140 A.16 Twofish Decrypt, Part 2 . . . . . . . . . . . . . . . . . . . . . 141 A.17 Twofish Encrypt, Part 1 . . . . . . . . . . . . . . . . . . . . . 142 A.18 Twofish Encrypt, Part 2 . . . . . . . . . . . . . . . . . . . . . 143 A.19 Hamming 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A.20 Hamming 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A.21 Vorbis Invsqlook . . . . . . . . . . . . . . . . . . . . . . . . 145 A.22 Vorbis Coslook . . . . . . . . . . . . . . . . . . . . . . . . . 145
xiii
List of Algorithms
3.1 3.2 3.3 4.1
Partition of a graph G = (V , E ) in MAXMISOs . . . . Partition of a graph G = (V , E ) in SMMs – FIX SMM . Partition of a graph G = (V , E ) in SMMs – VARIABLE The Nautilus clustering algorithm . . . . . . . . . .
xv
. . . . . . SMM . . .
. . . .
. . . .
44 52 52 74
List of Acronyms and Symbols ADP ADSP ASIC ASIP CCU CPU DAG DFG CFG DSP FPGA FU GPP ILP LP MAXMISO RH RFU SMM VHDL XREG G (V , E ) DEPTH (G ) GMISO GMIMO GMM A −v GSMM K (G ) L EV (v ) N∗ ℘(G ) Td WIDTH (G )
Area-Delay Product Application Domain Specific Processor Application-Specific Integrated Circuit Application-Specific Instruction-set Processor Custom Configured Unit Central Processing Unit Directed Acyclic Graph Data-Flow Graph Control-Flow Graph Digital Signal Processor Field-Programmable Gate Array Functional Unit General Purpose Processor Integer Linear Programming Linear Programming Maximal Single-Output Graph Reconfigurable Hardware Reconfigurable Functional Unit Submaxmiso Very High Scale Integrated Circuits Hardware Description Language Exchange Register A graph with set of nodes V and set of edges E Depth of a graph G Set of the single-output subgraphs of a graph G Set of the multiple-output subgraphs of a graph G Set of MAXMISO of a graph G Set of SMMs obtained removing node v from the MAXMISO A . (G ) Ratio WIDTH DEPTH (G ) Level of a node v Set of the natural numbers different from zero Powerset of a graph G Turn distance in a spiral search Width of a graph G
xvii
1 Introduction
O
ver the last years, many different processing architectures have been proposed in the market, each optimized according to a given goal. Those architectures can be classified according to their degree of flexibility. We can identify three main categories: (i) the general purpose processors, based on the Von Neumann computing paradigm, (ii) the domainspecific processors and (iii) the application-specific processors. Architectures based on the Von Neumann computing paradigm, often addressed as General Purpose Processors (GPPs ), present a high degree of flexibility allowing to execute any kind of application. In simple words, GPPs are non-specialized architectures useful to execute applications with a wide or unknown application domain. Unfortunately, this high flexibility results in a higher power usage compared with specialized architectures. Additionally, although different techniques have been introduced to increase the level of parallelism, parallelism remains relatively limited for highly parallelizable applications. If the application domain of the tasks to compute is known, an alternative solution is the use of specialized architectures. We identify two types of specialized architectures, depending on the degree of specialization: domain-specific processors and application-specific processors. The former are processors used to speed up computations common to a class of applications from the same domain whereas the latter are processors used to speed up a single type of applications. The more specialized is the architecture, the more optimized is the execution of the task. The ideal architecture would combine the flexibility of a GPP and the performance of a specialized architecture. Simply put, the ideal architecture would be flexible enough to be able to adapt to the application running on it. A reconfigurable architecture, combination of a GPP and a reconfigurable device, usually a Field-Programmable Gate Array (FPGA ), combines flexibility and
1
2
C HAPTER 1. I NTRODUCTION
performance: the GPP provides good performance for a wide range of applications and the reconfigurable device can be used to implement specialized instructions, application-specific or domain-specific, to efficiently execute additional applications. The objective of our research is to investigate ways to generate applicationspecific instructions for reconfigurable architectures which, when implemented in hardware, lead to a more efficient execution of the considered application. The key research quest is: how can we generate these instructions? The main goal of the current dissertation is to answer this question. Additionally, does this have to be done manually or can it be done automatically? In the next chapters, we will present, in detail, different methods of extending in an automatic fashion the degree of specialization of a given reconfigurable architecture with application-specific instructions.
1.1 Problem Overview The high flexibility and efficiency of reconfigurable architectures allow their use in a wide number of fields. The hardware flexibility allows such architectures to adapt to the continuous changes of standards and functional requirements of the applications. Additionally, reconfigurable architectures can be used for fast prototyping, reducing non-recurring engineering costs and timeto-market. As mentioned before, a reconfigurable architecture is the combination of a GPP and a reconfigurable device. The processor features a basic instructionset serving a wide range of fields. When a generic application is executed on the GPP , the instructions that belong to the instruction-set are executed in hardware. The rest of the instructions are executed in software. Assuming that an instruction can be executed faster in hardware than in software, a good idea would be to identify a certain number of new specialized instructions, combinations of basic instructions, for a single application or for a domain of applications to add to the basic instruction-set of the processor. In this way, the number of instructions executed in software can be reduced and the application(s) can be run faster and more efficiently on the architecture in terms of increased performance, reduced power consumption, or other metrics. Since the instruction-set of the processor is predefined at design time and cannot be modified, the reconfigurable device coupled to the GPP can, then, be used to implement these new instructions. Since the reconfigurable device can contain only a limited number of new instructions, the candidate instructions
1.1. P ROBLEM OVERVIEW
3
usually undergo a selection process which identifies only a limited number of instructions. The new instructions can be used by a single application, then we speak about application-specific instructions, or can be used by a class of applications in a certain domain, and then we speak about domain-specific instructions. In both cases, the new instructions can be seen as extensions of the basic instruction-set of the processor. For this reason, this problem is often referred to the instruction-set extension problem. A specialized instruction can be seen as a sequence of basic instructions which is atomically executed in hardware. Since the equivalent execution of this instruction in software is sequential, to increase performance and to take full advantage of the parallelism provided by a hardware execution, the specialized instructions usually contain many basic instructions which can be executed in parallel. Assuming a performance gain proportional to the size of the specialized instructions in terms of number of the basic operations contained, the more the instructions, the higher the performance gain. When the specialization of an architecture is domain-wide, two main issues arise. First, the identification of common specialized instructions between different applications involves isomorphism problems, which are well known computationally complex problems. Second, the size of an instruction common to different applications is usually inversely proportional to the number of applications: the greater the number of applications that share the specialized instruction, the smaller the size of the instruction. This happens because different applications hardly share the same sequences of instructions if the number of instructions is high. As a consequence, application-specific instructions can provide higher performance at lower computational complexity compared with domain-specific instructions. The drawback is that the new specialized instructions are used only by one application. However, the architecture is reconfigurable. This means that the architecture can be modified for each application without limiting the number of applications which is possible to efficiently execute on the architecture. In this dissertation, we describe efficient methods able to identify, in an automatic fashion, new specialized, application-specific instructions. Although the different methods are detailed in the next chapters, we include an example here which gives an idea on how we extend a basic instruction-set with new application-specific instructions.
4
C HAPTER 1. I NTRODUCTION
Figure 1.1: Motivational example: (a) a picture of the Statue of Roy Liechtenstein in Barcelona and (b) a frame.
1.1.1 Motivational Example Let us consider a picture and a frame, as depicted in Figure 1.1. If the frame is big enough, all the surface of the picture can be framed. On the contrary, if the frame is not big enough to contain the complete picture, what shall we do? Although the picture contains many details (the statue, the street, some buildings, the sky, etc.), the main subject is the statue. This means that we have to crop the picture around the statue in a way that the part which contains the statue fits the frame. In Figure 1.2 two possible solutions are shown. Let us first divide the picture in small pieces in a puzzle-like fashion, as shown in Figure 1.2(a) and 1.2(d). The part of the picture we are interested in, the statue, is covered by a certain number of pieces. Since the frame has a fixed limited size, we have to select, at first, only the pieces containing parts of the statue. Additional pieces can be included if the frame has some space left. In Figure 1.2(b) and 1.2(e) only the pieces containing details of the statue are selected to be framed. The final result, the picture in the frame, is shown in Figure 1.2(c) and 1.2(f). Depending on the size of the pieces, the part of the picture selected to be framed can leave
1.1. P ROBLEM OVERVIEW
5
(a)
(b)
(c)
(d)
(e)
(f)
Figure 1.2: Motivational Example - possible solutions: the picture is partitioned in small pieces ((a) or (d)), part of the pieces are selected to be framed ((b) or (e)) and finally the selected pieces are set into the frame ((c) or (f)).
additional space in the frame or not. This additional space can then be used to frame other interesting parts of the picture, as shown in Figure 1.2(f). Let us translate this example in terms of instruction-set extension. Let us consider a reconfigurable architecture and an application to run on it. If the size of the reconfigurable device is big enough, all the instructions executed by the application can be implemented in hardware, part will belong to the basic instruction-set and part will be the new specialized instructions. By considering the previous example: if the frame is big enough, the whole picture can be framed. If the frame is not big enough, the main subject is selected at first
6
C HAPTER 1. I NTRODUCTION
to be framed. If there is some space left in the frame, this can be used to add other interesting parts of the picture (see Figure 1.2(f)). In the same way, if the reconfigurable device is not big enough, we have to select the important parts of the application and, at first, only these are implemented in hardware. If some space is left on the reconfigurable device, additional parts can be selected within the application to be implemented in hardware.
1.2 Dissertation Contribution As seen in the previous example, given a reconfigurable architecture and an application to speed up on that architecture, we face two main challenges: • How can we correctly identify the parts of the application to implement in hardware? • How many parts of the application can we actually implement in hardware? In this dissertation, we present methods for the identification and selection of those parts of an application which can be implemented in hardware, as application-specific instructions, and used to speed up the execution of the application on a reconfigurable architecture. To identify these instructions, we use a two-step methodology: first, based on some properties to define, the application is partitioned in non-overlapping clusters of instructions of variable size. Second, these clusters are used as building blocks and further combined following different policies. The result is a collection of custom instructions which are candidate for hardware implementation. The total number of selected instructions is limited by the size of the reconfigurable hardware. This means that a selection method needs to be applied when the number of candidate instructions exceeds the available hardware. The work contained in this dissertation deals with the aforementioned challenges. More specifically, the main contributions are the following. • The design of automatic methods to identify and select optimal, when possible, sets of application-specific instructions. • The generation of different kinds of application-specific instructions, both single- and multiple-output.
1.3. D ISSERTATION O RGANIZATION
7
• A reduced computational complexity compared with existing methodologies for the design of application-specific instructions and no problems of scalability. • The inclusion of different hardware constraints (limitations on inputs, outputs and available area), during the instruction generation to allow the implementation of the generated instructions on architectures with different hardware constraints. • A theoretical guaranteed instruction identification process which identifies only instructions which can be atomically executed in hardware without scheduling problems. An overview of the organization of this dissertation is presented in the next section.
1.3 Dissertation Organization The contributions of the dissertation are presented over various chapters. Before delving into the contributions, Chapter 2 first provides the necessary overview of the current state-of-the art in the context of instruction-set extension. It provides an accurate description of the problems involved. Additionally, an overview of different reconfigurable architectures which incorporate custom instructions is presented. We target the identification of new instructions with, among other properties, multiple number of inputs and multiple number of outputs, known as MIMO instructions. The new instructions are identified via a two-step approach which first, generates single-output instructions and then, combines selected instructions in MIMO instructions. For this reason, Chapter 3 is devoted to the description of algorithms used to partition an application in single-output clusters of instructions. The well-known MAXMISO partitioning algorithm is used and a modified version is proposed for the generation of single-output clusters of instructions of variable size. Chapter 4 describes the next step in the instruction-set extension process. Once the application is partitioned into single-output clusters, each cluster can be considered as a building block and, following different policies, can be clustered or not with other clusters for the generation of new instructions. The chapter presents different clustering algorithms for the generation
8
C HAPTER 1. I NTRODUCTION
of MIMO instructions. Additionally, it addresses the selection problem in case of hardware limitations and many candidates for the implementation. Chapter 5 presents the Molen reconfigurable architecture, selected as proofof-concept platform in this dissertation, and it presents the dedicated toolchain developed for the generation and selection of new instructions. The algorithms presented in the previous chapters are included in a toolchain for the automatic generation of instruction-set extensions for reconfigurable architectures. After that, the chapter gives an overview of the experimental setup and results. The algorithms described in the dissertation are applied to a set of well-known benchmarks from different domains, i.e. cryptography and multimedia, and the results reveal the benefit of the instruction-set extension process described in the previous chapters. Finally, Chapter 6 provides concluding remarks on the work presented. This chapter summarizes the dissertation, outlines its contributions and proposes future research directions.
2 The Instruction-Set Extension Problem
T
he extension of a given instruction-set with specialized instructions has become a common technique used to speed up the execution of applications. By identifying computationally intensive portions of an application to be partitioned in segments of code to execute in software and segments of code to execute in hardware, the execution of an application can be considerably speeded up. Each segment of code implemented in hardware can then be seen as a specialized application-specific instruction extending a given instruction-set. Although a number of approaches exists in literature proposing different methods to customize an instruction-set, the description of the problem consists only of sporadic comparisons limited to isolated problems. This chapter presents a thorough analysis of the issues involved during the customization of an instruction-set by means of a set of specialized applicationspecific instructions. The chapter starts with a description of the main kinds of computer architectures used in the last years. After that, different kinds of customizations are analyzed in a great detail. The chapter continues with an accurate description of the customization process, both instruction generation and instruction selection. Finally, an overview of the main architectures which integrate custom instructions is presented.
2.1 Introduction In the past years, electronic devices have been steadily penetrating the market, featuring not only an ubiquitous nature but also a plethora of functionalities. During the years, these functionalities have been implemented by using different kinds of computer architectures which can be categorized according to their degree of flexibility and can be categorized in two main groups: the general purpose and the application specific computing group [19, 47]. 9
10
C HAPTER 2. T HE I NSTRUCTION -S ET E XTENSION P ROBLEM
2.1.1 General Purpose Computing General purpose architectures have been widely used and studied in the past decades. This type of architectures provides a high degree of flexibility in terms of application domains. Additionally, many tools have become available on the market and have allowed programmers to map many different applications onto this type of architectures virtually effortlessly [47]. The general purpose computing group is based on the Von Neumann computing paradigm. The general structure of a Von Neumann machine consists of a memory for storing program and data (Harvard architectures contain two parallel accessible memories for storing program and data separately), a control unit used to store the addresses of the instructions to execute and an arithmetic and logic unit used to execute the instructions. A program targeting a Von Neumann machine is coded as a set of instructions to be executed sequentially. The execution of an instruction is realized in five steps: (1) fetching the instruction from the program memory, (2) decoding the instruction to determine which operation has to be executed and which operands are required, (3) reading the operands from the memory, (4) executing the instruction and, finally, (5) writing the result of the operation back to the data memory. This execution model results in a high performance overhead for each individual operation which turns into energy overhead. In this sense, the general purpose computing group is considered to be the most flexible hardware at the cost of a general high energy consumption. Over the years, different techniques to increase the level of parallelism have been introduced at the instruction level: for instance techniques as instruction pipelining, superscalar execution, out-of-order execution and register renaming. Parallelism has also been exploited at other levels: bit-level, data-level and loop-level parallelism. Although the level of parallelism has been increased during the years, it is still relatively limited for highly parallelizable applications, which become poor candidates for implementation on these architectures.
2.1.2 Application Specific Computing In the context of application-specific computing, three main categories can be identified: Application-Specific Integrated Circuits (ASICs ), Application Domain Specific Processors (ADSPs ) and Application-Specific Instruction-set Processors (ASIPs ).
2.1. I NTRODUCTION
11
ASICs are circuits designed for a specific application such as the processor in a TV set top box. Being designed for a specific use, ASICs are able to satisfy specific constraints and to reduce energy consumption, using an appropriate architecture designed for the targeted application, compared with general purpose architectures which are designed for a generic use. In an ASIC , the entire application has been hard-wired and the software component is usually represented by run-time configurable parameters. However, energy saving comes at the cost of low flexibility and programmability: for each new functionality or application, the hardware has to be redesigned and built. Today designing and manufacturing an ASIC is a time-consuming and expensive process [69]. The increasing Non-recurring Engineering (NRE ) costs, due to the high mask and testing costs, associated with manufacturing, together with factors such as Deep Sub-micron Effects (DSM ), increased feature set and heterogeneous integration contribute to increase the production costs. Additionally, this long process has to deal with the shrinking time-to-market which sometimes makes the choice of an ASIC not suitable. ADSPs and ASIPs are processors having a partially customizable instructionset which can be tuned towards the specific requirements of an application (ASIPs ) or a domain of applications (ADSPs ) by extending the basic instruction-set with dedicated instructions. Digital Signal Processors (DSPs ) are an example of ADSP . A DSP is a processor specialized to accelerate computation of repetitive, numerically intensive tasks in the digital-signal processing area such as telecommunications, multimedia, image processing, etc. A typical application-specific instruction implemented on a DSP processor is the Multiply ACcumulate (MAC) instruction which can be performed on huge set of data concurrently. A MAC instruction performed on a common Von Neumann machine would have to access the memory to load/store the intermediate result. As a result, by using specialized hardware that directly perform addition after multiplication without having to access the memory a considerable amount of time can be saved. If the processor has to be used only for one application, ASIPs can be used instead of ADSPs . From an optimization point of view, ASIPs can be better optimized than ADSPs as modifications to the latter have to benefit all the applications in a domain, whereas in the former case only one application is considered [6]. The customizable instruction-set of ADSPs and ASIPs introduce more flexibility in the design even though the number of different instruction-set customizations is usually relatively limited and, therefore, the execution of different applications can be not efficient [36].
12
C HAPTER 2. T HE I NSTRUCTION -S ET E XTENSION P ROBLEM
GPP+RH
Flexibility
ASIC
ASIP
ADSP
GPP
Figure 2.1: Positioning of different computer architectures in terms of flexibility. GPP + RH represents a reconfigurable architecture composed by a GPP and a Reconfigurable Hardware (RH).
The aforementioned architectures can, then, be positioned in terms of flexibility as depicted in Figure 2.1. The flexibility of a general purpose processor can be further extended by using a reconfigurable hardware, as described in the next section.
2.1.3 Reconfigurable Computing An ideal computing system would combine the flexibility of a general purpose system with the high performance of application-specific system. The last two decades have seen a new emerging class of architectures, the socalled reconfigurable architectures. Time-to-market and reduced development costs have became increasingly important and have paved the way for reconfigurable architectures. Reconfigurable devices, including the most widely used Field-Programmable Gate Arrays (FPGAs)1 , consist of arrays of programmable logic cells interconnected using a set of routing resources which are also reconfigurable. In this way, custom digital circuits can be mapped onto the reconfigurable hardware by computing the logic functions of the circuit within the logic blocks and the reconfigurable routing is used to connect the logic blocks together to form the necessary circuit [28]. These architectures are usually formed with a combination of a GeneralPurpose Processor (GPP ) and a reconfigurable device. Part of the operations is executed by the host processor while the rest is executed by the reconfigurable hardware. A reconfigurable architecture is an architecture able to adapt to the application: the structure of the architecture can changes at start-up time or even at run-time to match the new application. Reconfigurable architectures present three main advantages compared with the architectures previously described: first, changing an existing architecture, 1 Xilinx (http : //www .xilinx .com /) and Altera (http : //www .altera .com /) are currently the main producers of FPGAs devices on the market.
2.2. I NSTRUCTION -S ET E XTENSIONS
13
rather than defining a completely new one, allows to reuse its associated compiler which has to be partially modified and not redesigned from scratch. Second, reconfigurable architectures can serve a much wider range of applications, being an extension of GPP (see Figure 2.1). Examples are data encryption, data compression and genetic algorithms. Third, reconfigurable architectures can be used for rapid prototyping. Rapid prototyping allows a device to be tested in real hardware before its final production. In this way, considerable amounts of development and debugging efforts can be eliminated and the time-to-market can be reduced. More specifically, reconfigurable architectures become even more useful since they allow implementing different versions of the final product up to an error-free state [19]. Additionally, the design remains flexible until the product enters the market and even after, allowing to ship a product that meets the minimum requirements and add features after deployment. The higher cost/performance ratio for reconfigurable architectures has led researchers to look for methods and properties to maximize the performance. Each particular configuration can then be seen as an extension of the instruction-set of the host processor. The identification, definition and implementation of those operations that provide the largest performance improvement constitutes a major challenge and represents the so-called instruction-set extension problem. Note. The target of this dissertation is the design of clustering algorithms to extend the basic instruction-set of a reconfigurable architecture, combination of a general purpose processor and a reconfigurable hardware. The algorithms can also be used to design an instruction-set, extension of an existing one, which can be implemented in an ASIC -fashion, where the basic instruction-set and its extension are permanently implemented in hardware. Nevertheless, this is not our main target. In this latter case, metrics like optimal power consumption and area efficient design are among the main targets. In our analysis, the main target is, given a certain amount of hardware resources, to speed up the execution of an application by implementing a certain number of applicationspecific instructions identified with low computational complexity algorithms. The design of power consumption efficient instruction-set extensions will be addressed as future work.
2.2 Instruction-Set Extensions The customization of an instruction-set presents, among others, many advantages [6]: first, the application code can be more densely encoded, resulting
14
C HAPTER 2. T HE I NSTRUCTION -S ET E XTENSION P ROBLEM
in a code size reduction; second, the total number of instructions that have to be executed may be reduced, which results in a lower power consumption and third, the execution of the application can be more efficient in terms of increased performance using the customized instruction. Although the focus of this dissertation is on presenting a set of algorithms for the generation and selection of custom instructions for a given architecture, the issue concerning the efficient implementation of the selected instructions in hardware has to be addressed as well. Later in the chapter, we give an overview of different architectures that integrate a general purpose processor with custom logic for application acceleration. The identification process of new specialized instructions is usually subject to different constraints such as power consumption, area, code size, cycle count, operating frequency, etc. Additionally, not all the instructions suitable for a hardware implementation can be selected for being implemented in hardware, due to the ever limited hardware resources, in the general case. The issues involved are diverse and range from the isomorphism problem and the covering problem, well-known computationally complex problems, to the function’s study necessary for the guide/cost function involved in the generation and selection of custom instructions. Equally important is the selection problem addressed by different techniques such as branch-and-bound and dynamic programming. The proposed solutions are either exact, whenever appropriate and possible or, given that the problems involved are known to be computationally complex, heuristics that are used in those cases where the solution is not computable in a feasible time. In the next sections, we overview the current state-of-the-art in instruction-set customization, describing in detail all the issues involved. The instruction-set customization problem represents a well specified topic where results and concepts from many different fields, such as engineering and graph theory are required. Especially the latter is the dominant approach and it seems to provide the right analytical framework. Thinking about the data-flow or control-flow graphs of an application2 , it is easy to imagine an application represented by a directed graph, where the nodes represent the operations and the edges represent the data dependencies, and the required new complex instructions are represented by subgraphs having particular properties. Thus, the problem turns into the identification of methods for the recognition of certain types of subgraphs. 2
The directed graphs that show, respectively, the data dependencies and the control dependencies among a number of functions.
2.3. D IFFERENT T YPES
OF
C USTOMIZATIONS
15
The purpose of this chapter is to present a survey of current research in instruction-set extension, investigating the issues regarding the customization of an instruction-set under specific requirements. The main objective is to provide a detailed overview of all the aspects involved in the customization of an instruction-set. It does not seek to cover every technique and research project in instruction-set extension. Instead, it provides an overview of all relevant aspects of the problem and it compensates for the lack of a general view of the problem in the existing literature which only consists of sporadic comparisons limited to isolated issues involved. The remainder of this chapter is the following. Section 2.3, after presenting a motivational example, overviews the different instruction-set customizations. Degree of customization, granularity of the instructions and degree of automation of the process are presented in detail. Section 2.4 deeply elaborates on the customization process and provides a detailed account of the problems involved in the customization. Instruction generation and selection, properties of the custom instructions and existing solutions are presented to better understand the problem. Section 2.5 proposes a selected overview of the main architectural approaches that integrate custom logic for application acceleration. Finally, Section 2.6 concludes the chapter.
2.3 Different Types of Customizations Instruction-set customization can be pursued by following different approaches in the type of customization, which can be complete or partial, and in the granularity of the instructions, which can be fine-grain or coarse-grain. We introduce a motivational example to informally outline the main idea of the instruction-set extension.
2.3.1 Motivational Example In Figure 2.2a, we present a data-flow subgraph extract from the ADPCM application as implemented in the MediaBench benchmark suite [72]. Nodes represent the primitive operations, namely the instruction belonging to the instruction-set and the edges represent the data dependencies. A custom instruction is represented by a subgraph of the data-flow graph. The main idea is to identify different clusters of basic operations within the graph which can be implemented as single instructions to atomically execute in hardware. They become new specialized instructions extending the basic
16
C HAPTER 2. T HE I NSTRUCTION -S ET E XTENSION P ROBLEM
MM4 indata
1
+
LD
4
MM1
MM2
>>
15
state
MM6
1
& LD
7 stepsizeTable
&
MM5 indexTable LD
4 LD
+
0
&
3 0
0
>>
!=
>
SEL
&
> 88
0
+ SEL
1 2
!=
1
&
>>
SEL ST
0
+
0
LD
!=
SEL
8 &
+
−
SEL
32767
> 32767 SEL
−32768
MM3
0, if there are α nodes on the longest path from v and the level 0 of the input nodes. The natural number associated to each node is called level of the node. In our analysis, we consider finite graphs, i.e. graphs having a finite number of nodes and edges. This means that L EV (·) ∈ [0, +∞). In particular, the maximum level d ∈ N of its nodes is called the depth of the graph. In the example presented in Figure 3.3a, the graph has 4 levels hence depth equal to 4. 5
From Wikipedia, http://en.wikipedia.org/wiki/Sum of absolute differences In the context of graphs (instructions), size refers to the number of nodes (basic operations) belonging to the (sub)graph (complex instruction). 6
3.3. S INGLE -O UTPUT I NSTRUCTIONS
In1
+
In2
In3
In4
x n3
n1
+
VARIABLE S IZE
In1
In2
In3
47
In4
In5
MM1
f
L2
L2 n2
Out2
G
L0
L1
L1 n4
+
Out1
L0
In5
OF
MM2
L3
MM3
Out1
Out2
L3
G (a)
(b)
Figure 3.3: (a) Distribution of the nodes of a subject graph G in levels and (b) the collapsed graph G^ , image of G via the collapsing function f .
NB. Given a graph G , we call circuit a subset of the edge set E of G that forms a path such that the first node of the path corresponds to the last. A tree is a connected DAG containing no circuits. Since a tree is a DAG , the subject graph under consideration can be a tree as well. In the context of tree, the depth is related to a node and not to the graph: it is the length of the path from the root of the tree to the node, what we have described as the level of a node. In this dissertation, in case the subject graph is a tree, depth will always refer to the maximum level of the nodes of the graph and not to the level of a node. The definition of level of a node can be extended to MAXMISOs . Let f : G → G^ be a function such that: MMi ⊂ G 7→ ai ∈ G^ .
(3.6)
The function f collapses a MAXMISO MMi of the subject graph G in a node ai of G^ . The function f is injective, as a consequence of the empty intersection of two MAXMISOs , and surjective by construction. Therefore, f is a bijection between G and G^ or, more specifically, between GMM and G^ . The function f is called collapsing function and the graph G^ = (V^ , E^ ) is called collapsed graph. We can now define the following: Definition 3.3.1. Let MMi ∈ G be a MAXMISO of G and let f : G → G^ be the
48
C HAPTER 3. S INGLE -O UTPUT I NSTRUCTIONS
collapsing function. The level of MMi is defined as the level of the image of MMi in the collapsed graph through the collapsing function: L EV (MMi ) = L EV (f (MMi )).
(3.7)
Figure 3.3 shows an example of a subject graph G and the collapsed graph G^ . As shown in Figure 3.3, |V^ | = |GMM |.
3.3.2 The SUBMAXMISOs Let MMi be a MAXMISO of the subject graph G with at least 2 nodes. Each node vj ∈ MMi belongs to level L EV (vj ). Figure 3.4(a) depicts an example of MAXMISO extracted from the subject graph of an application, with the nodes divided in levels. Let us consider a node v ∈ MMi , with 0 ≤ L EV (v ) ≤ d . If we apply the MAXMISO partitioning algorithm to MMi − {v }, each MAXMISO identified in the graph is called a SUBMAXMISO , and denoted as SMM , of MMi \ {v } (or, shortly, of MMi ). For example, the node v can be either an exit node or a node randomly chosen from the central levels of the graph or a node with specific properties like area or power consumption below or above a certain threshold previously defined. Figure 3.4 shows a MAXMISO extracted from a subject graph and two examples of SMM partitioning removing an exit node (Figure 3.4(b)) and removing a node from the central level of the graph (Figure 3.4(c)). As the figure shows, the number of SMMs and their sizes tightly depend on the A −v choice of the node v . We denote with GSMM the set of SMM obtained removing A 1−v 1 ∩ ... ∩ node v from the MAXMISO A . We denote with GSMM the set GSMM Al −vl GSMM . The definition of level of a SMM is the obvious generalization of the definition of level of a MAXMISO (see Definition 3.3.1). Remark 3.3.1. The SMM partitioning can provide a solution to the problem raised in Remark 3.2.3. Taking into consideration the SAD kernel, for example, depending on the node removed in the graph, the generated SMMs have more chances to fit into the available hardware. If neither the SMMs fit into the hardware nor the MIMO partitioning algorithms have enough MISOs to cluster, then some or all the SMMs can be repartitioned in MISOs by reapplying the SMM partitioning. An iterative reapplication of the SMM partitioning has two consequences: • it reduces the size of the clusters at each iteration and
3.3. S INGLE -O UTPUT I NSTRUCTIONS
OF
VARIABLE S IZE
49
Figure 3.4: A MAXMISO extracted from a subject graph G (a) and two examples of SMM partitioning of the MAXMISO : removing n7 (b) and removing n5 (c).
50
C HAPTER 3. S INGLE -O UTPUT I NSTRUCTIONS
• it reduces the total number of inputs of the clusters. In this way, the generated MISO clusters can also be used for instruction-set customization on architectures with severe hardware constraints. Figure 3.5 gives a visual idea of the iterative application of the SMM partitioning, likening a SMM to a box: let us take a box (we take a SMM A ), we remove one of the small boxes contained inside (we remove one node a from the A ) and we divide the leftover space in the box into smaller boxes (SMM partitioning of A \ {a }) (see Figure 3.5(a)). After that, if one of the boxes is still too big and we want to divide it in smaller boxes, we repeat the same procedure: we remove one of the small boxes and we divide the leftover space in the box into smaller boxes. This procedure can be repeated a finite number of times, as shown in Figure 3.5(b)-(d). As the size of the boxes is reduced at each iteration, we will arrive at a point in which the boxes have the minimum possible size and can not be further divided (each of the generated SMM has only one node) (see Figure 3.5(e)). In the following section, we describe the algorithm for SMM partitioning in more detail.
3.3.3 The SMM Partitioning Algorithm Let us consider a subject graph G and the MAXMISOs partitioning of G : GMM = {MM1 , ..., MMl }. From the way SMMs are defined, we know that the SMM partitioning is affected by the node that is initially removed. As a consequence, we have two possible scenarios: • the same type of node is removed from each subgraph to partition in SMMs. The corresponding SMM partitioning is called FIX SMM partitioning, • a different type of node is removed from each subgraph to partition in SMMs. The corresponding SMM partitioning is called VARIABLE SMM partitioning Algorithm 3.2 presents the FIX SMM Partitioning algorithm. The algorithm takes as input the set of MAXMISOs GMM . For each MAXMISO MMi , a node v i is removed and the function G ENERATE M AXMISO is called to partition MMi − {v i } in MAXMISOs . At the end of the partitioning, SET1 contains all the generated SMMs , SET2 contains the removed nodes and SET3 contains the
3.3. S INGLE -O UTPUT I NSTRUCTIONS
OF
VARIABLE S IZE
51
(e)
(d)
(c)
(b)
(a) Figure 3.5: The iterative application of the SMM partitioning likened to the consecutive partitioning of a box in smaller boxes.
generated SMMs plus the removed nodes. All the node removed are of the same type: an exit node, or a node from a specific level, or a node with a specific property previously defined, etc.
52
C HAPTER 3. S INGLE -O UTPUT I NSTRUCTIONS
Algorithm 3.2 Partition of a graph G = (V , E ) in SMMs – FIX SMM Input: GMM = {MM1 , ..., MMl } Output: GSMM Set1 , Set2 , Set3 ⇐ ∅ for all i = 1..l do Choose v i ∈ MMi G ENERATE M AXMISO(MMi − {v i }) MMi −v i Set1 ⇐ Set1 ∪ GSMM Set2 ⇐ Set2 ∪ {v i } Set3 ⇐ Set1 ∪ Set2 end for Algorithm 3.3 Partition of a graph G = (V , E ) in SMMs – VARIABLE SMM Input: GMM = {MM1 , ..., MMl }, Properties P1 , ..., Pn Output: GSMM Set1 , Set2 , Set3 ⇐ ∅ for all i = 1..l do repeat Choose v i ∈ MMi G ENERATE M AXMISO(MMi − {v i }) until Property Pi is satisfied MMi −v i Set1 ⇐ Set1 ∪ GSMM Set2 ⇐ Set2 ∪ {v i } Set3 ⇐ Set1 ∪ Set2 end for Algorithm 3.3 presents the VARIABLE SMM Partitioning algorithm. The algorithm takes as input the set of MAXMISOs {MM1 , ..., MMl }. For each MAXMISO MMi , a node v i is selected to be removed and the function G ENERATE M AXMISO is called to partition the MAXMISO bereft of the node. For each MAXMISO MMi , the partitioning is repeated until the generated SMMs satisfy a certain property. As before, when the MAXMISOs are all partitioned, SET1 , SET2 and SET3 contain all the generated SMMs , the removed nodes and the generated SMMs plus the removed nodes respectively. Remark 3.3.2. From Remark 3.2.2, we know that the MAXMISO partitioning algorithm has linear complexity with the number of nodes of the graph under consideration. The SMM partitioning algorithm is the MAXMISO partitioning
3.3. S INGLE -O UTPUT I NSTRUCTIONS
OF
VARIABLE S IZE
53
of each MAXMISO bereft of a node. Consequently, the algorithm for the SMM partitioning requires linear complexity with the number of nodes analyzed as well. The previous remark about the complexity of the SMM partitioning is clearly valid for both versions of the SMM partitioning: FIX and VARIABLE . As mentioned before, the SMM partitioning can be applied recursively to the generated clusters to reduce the size of the clusters from time to time. In particular, we have the following: Remark 3.3.3. The function G ENERATE M AXMISO is called t times with t ≤ |G | for the VARIABLE SMM partitioning and t = |GMM | for the FIX SMM partitioning. The proof is immediate by looking at Figure 3.6. Remark 3.3.4. Each SMM is contained in a MAXMISO . This means that the SMM partitioning is a refinement of the MAXMISO partitioning. Additionally, the number of possible MIMO clusters which is possible to generate increases. More precisely, if there are l MAXMISOs and j SMMs with j = l + α and α > 0, the additional MIMO clusters which is possible to generate is equal to: 2j − 2l = 2l +α − 2l = 2l (2α − 1).
(3.8)
The integer α is a positive number since the number of SMMs is always greater than or equal to two times the number of MAXMISOs . In Table 5.3, we present some information about the FIX SMM partitioning of a set of applications or kernels of applications considered in this dissertation. The table includes information about the partitioning of an application in SMMs for different selections of the node to remove (SMM 1 − SMM 7) (see Section 5.5 for more details), and in MAXMISOs to show the main differences between the two partitioning. The table provides information about the total number of nodes, MAXMISOs and SMMs , the number of unique MAXMISOs and unique SMMs and information about the total number of levels in the associated collapsed graph. Remark 3.3.5. In this dissertation, the SMM partitioning is defined as the MAXMISO partitioning of a MAXMISO bereft of a node. The partitioning, as shown throughout the section, does not depend on the single-output property of the graph that is partitioned. It depends only on the node that is removed from the graph. This means that the SMM partitioning can be applied to MIMO graphs as well: a node is selected to be removed from the graph and the
54
C HAPTER 3. S INGLE -O UTPUT I NSTRUCTIONS
MAXMISO partitioning is applied to the MIMO graph bereft of the node. Clearly, the computational complexity is not affected and it continues to be linear with the number of processed nodes.
3.4 Summary The current chapter presents two algorithms for the partitioning of an application in single-output instructions. The first algorithm partitions the application in single-output instructions of maximal size called MAXMISOs . Based on the available hardware resources, the output of the MAXMISO partitioning can be a set of instructions not suitable for hardware implementation. This is the case, for example, when the number of input operands of a custom instruction is limited or when the generated cluster requires additional hardware resources. In this case, the second algorithm presented in this chapter can be used to address the problem. This algorithm, based on the MAXMISO partitioning, repartitions the maximal single-output instructions in a set of smaller single-output instructions called SUBMAXMISO . When tight constraints are imposed on the custom instructions, the iterative application of the SMM partitioning reduces the size of the clusters to combine, at each iteration, making the single-output instruction appropriate for the MIMO clustering and the available hardware resources. The following chapter addresses the second step in the customization process: the (re)partitioning of the application in MIMO instructions to atomically execute in hardware based on clustering of single-output instructions.
Note. The content of this chapter is based on the the following paper: C. Galuzzi, K. Bertels and S. Vassiliadis, A Linear Complexity Algorithm for the Generation of Multiple Input Single Output Instructions of Variable Size, SAMOS VII, July 2007.
3.4. S UMMARY
55
MM_1
SMMs of MM_1 \ {V_1} V_1
SET 2
SET 1
SMMs of MM_n \ {V_n}
MM_n V_n
A)
B)
C)
MM_1
1) FIX SMM ALGORITHM
SMMs of MM_1 \ {V_11} V_11
V_12
SMMs of MM_1 \ {V_12}
SMMs of MM_1 \ {V_1n} V_1n
MM_n
SMMs of MM_n \ {V_n1} V_n1
V_n2
SMMs of MM_n \ {V_n2}
SMMs of MM_n \ {V_nk} V_nk A)
B)
C)
2) VARIABLE SMM ALGORITHM
Figure 3.6: Schematic SMM partitioning of an application with the FIX and VARIABLE SMM partitioning algorithms. Each square represents a MAXMISO (A). A node is removed from each MAXMISO (B) and the MAXMISO partitioning algorithm is applied to each MAXMISO bereft of the node (C).
4 Multiple-Output Instructions
T
he previous chapter presented two partitioning algorithms for the identification of single-output instructions of maximal or variable size. In this chapter, single-output instructions are further combined in multiple-output instructions to be atomically executed on the available hardware. The chapter includes different types of MIMO partitioning depending on the shape of the subject graph of the application. Depending on the available hardware resources, the generated instructions can undergo a selection process which select only a limited number of them. This chapter starts with the description of a property necessary to guarantee the atomic execution in hardware of the generated instructions. Then, in Section 4.3 and 4.4, the different clustering algorithms are presented. Finally, Section 4.5 summarizes the chapter.
4.1 Introduction MIMO instructions are obtained by combining a certain number l ≥ 1 of single-output instructions. To atomically execute these instructions in hardware, we have to guarantee for each instruction that: • it satisfies all the hardware constraints to be implemented in hardware, • it can be atomically executed in hardware. In particular, the latter implies the following: as a MIMO instruction can be the combination of only one single-output instruction, we have to guarantee that the single-output instructions satisfy the previous requirements as well. Concerning the first requirement, we note that, the partitioning algorithms described later in the chapter take into consideration, amongst others, the max57
58
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
imum number of inputs and outputs of the instructions during the generation process. MAXMISO and SMM partitioning algorithms do not do this. The only hardware information used during the single-output partitioning process is the size of the available hardware. Although this can be seen as not efficient when limitations on inputs and outputs are introduced, the motivation is concealed into the clustering policies used by the MIMO partitioning algorithms. As we will see, the total number of inputs and outputs of the custom instructions is variable during the generation of the instruction. This means that input and output limitations are not necessary during the generation of the single-output instructions but only later on. A second issue concerns the available hardware resources. Since MIMO instructions are combinations of MISO instructions, a necessary but not sufficient condition for the MIMO instructions to fit into the available hardware is that each of their single-output components can fit into the available hardware. When this is not the case, the single-output instructions are iteratively repartitioned in sets of smaller-size instructions using the SMM partitioning algorithm. Concerning the second requirement, when a set of instructions is fused into a single instruction that is atomically executed in hardware, the instruction has to be functionally executable. For example, consider the subgraph G 2 shown in Figure 3.1. If G 2 is fused into a single instruction, assuming that all inputs are available at issue time and all results are generated at the end of the instruction execution, there exists no feasible scheduling for G 2. This means that there exists a path between the nodes of G 2 which includes nodes not belonging to G 2 (n 3, in this case). Convexity is the property which guarantees that this eventuality does not occur. A formal definition of the convexity of a subgraph is as follows. Definition 4.1.1. Let G be a graph and let G ∗ ⊂ G be a subgraph of G . G ∗ is called a convex subgraph of G if, for any two nodes in the subgraph, there exists no path between them that involves nodes not belonging to the subgraph itself. Convexity is therefore the property which guarantees a feasible scheduling of a custom instruction. By Definition 3.1.2, single-output instructions are always convex. More specifically, this means that single nodes, MAXMISOs and SMMs are convex instructions and can be atomically executed in hardware. In this dissertation, the convexity of the generated MIMO instructions is guaranteed by construction and the instructions are generated and selected automatically, as we will see in the following sections and in the Chapter 5.
4.2. C ONVEX C LUSTERING
59
We call instruction candidate a custom instruction which is possible to select to be an application-specific instruction-set extension. We have the following. Definition 4.1.2. We call Automatically Fused Instruction a convex instruction candidate that has been identified and selected via an automatic process for hardware implementation as application-specific instruction-set extension. Since the combination of single-output instructions can be non-convex, we need to find out when it is convex. In the next section, we show how we can combine single-output instructions in convex connected or disconnected multiple-output instructions.
4.2 Convex Clustering Let us consider a subject graph G with set of nodes V . We have the following lemmas. Lemma 4.2.1. Let ni , nj ∈ V be two nodes of G at level L EV (ni ) and L EV (nj ), respectively. Let C = ni ∪ nj . If L EV (ni ) = L EV (nj )
(4.1)
then C is a convex disconnected MIMO subgraph. Proof. By contradiction, if C is not convex, there exists at least one path between ni and nj which involves one or more nodes not belonging to C . This means that there exists at least one edge from ni to nj (or from nj to ni ). Therefore: L EV (nj ) ≥ L EV (ni ) + 1 (or L EV (ni ) ≥ L EV (nj ) + 1),
(4.2)
which contradicts the assumption L EV (ni ) = L EV (nj ). As a result, C is a convex disconnected MIMO subgraph. Corollary 4.2.1. Any combination of nodes at the same level is a convex disconnected MIMO subgraph. Proof. Let us assume that the combination of nodes at the same level is not a convex MIMO subgraph. Then, there exists at least one path between two of these nodes that includes extra nodes not belonging to the subgraph. This contradicts the previous lemma.
60
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
Lemma 4.2.1 and Corollary 4.2.1 combine single nodes in convex subgraphs. How can we combine subgraphs in convex subgraphs? The next lemma gives an answer for MAXMISO subgraphs. Lemma 4.2.2. Let MMi , MMj ⊂ G be two MAXMISOs of G at level L EV (MMi ) and L EV (MMj ) respectively. Let C = MMi ∪ MMj . If L EV (MMi ) = L EV (MMj )
(4.3)
then C is a convex disconnected MIMO subgraph. Proof. Let f : G → G^ be the collapsing function. Let us consider ai = f (MMi ), aj = f (MMj ) and c = f (C ). By the definition of level of a MAXMISO we have that Eq. (4.3) is equivalent to: L EV (f (MMi )) = L EV (f (MMj )) ⇒ L EV (ai ) = L EV (aj ).
(4.4)
By Lemma 4.2.1, c is a convex disconnected MIMO subgraph. Since f is a one-to-one correspondence, the inverse function f −1 : G^ → G gives that C = MMi ∪ MMj is a convex disconnected MIMO subgraph. Figure 4.1 shows an application of the previous lemmas. A subject graph G is partitioned in MAXMISOs (the solid boxes in Figure 4.1(a)) and transformed into the collapsed graph G^ via the collapsing function f . By Lemma 4.2.1, the combination of nodes at the same level is a convex disconnected subgraph (Figure 4.1(b)). By Lemma 4.2.2, the combination of two MAXMISOs at the same level is a convex disconnected MIMO subgraph (Figure 4.1(c)). Lemma 4.2.2 can be extended to any combination of MAXMISOs at the same level as a consequence of Corollary 4.2.1. If the nodes or the MAXMISOs do not belong to the same level, how can we combine them in a convex subgraph? If they belong to consecutive levels, we have the following result. Lemma 4.2.3. Let ni , nj ∈ V be two nodes of G . Let L EV (ni ) > L EV (nj ) be the levels of ni , nj respectively. Let C = ni ∪ nj . If L EV (ni ) − L EV (nj ) = 1
(4.5)
then C is a convex connected or disconnected MIMO subgraph. Proof. By contradiction, if C is not convex, there exists at least one path between ni and nj which involves one or more nodes not belonging to C . This
4.2. C ONVEX C LUSTERING
61
Figure 4.1: A subject graph and examples of convex disconnected MIMO subgraphs. The figure shows a subject graph G partitioned in MAXMISO (the solid boxes) and transformed in the collapsed graph G^ via the collapsing function f (a), and two examples of convex disconnected MIMO subgraphs as described in Lemma 4.2.1 (b) and in Lemma 4.2.2 (c).
62
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
means that there exists at least one node nk 6∈ C on a path between ni and nj . Then: L EV (nj ) < L EV (nk ) < L EV (ni ). (4.6) Since L EV (ni ) − L EV (nj ) = 1, L EV (nk ) 6∈ N which is a contradiction. As a result, C is a convex MIMO subgraph. Lemma 4.2.3 can be easily extended to MAXMISOs and to combinations of nodes or MAXMISOs at the same level. Although not explicitly stated by the previous lemmas, the elements that are combined, nodes or MAXMISOs , are convex subgraphs by default. The previous results can then be generalized as described by the next Theorem 4.2.1. Remark 4.2.1. The collapsing function, defined in Section 3.3.1 for the MAXMISOs , can be generalized to subgraphs belonging to a cover of disjoint elements, not necessarily MAXMISOs . If {A1 , ..., Ak } is a cover of disjoint subgraphs of a graph G , the (generalized) collapsing function is the function f : G → G^ that maps a subgraph of G in a node of G^ : f : Ai ⊂ G 7→ ni ∈ G^ . Since the function f is bijective: f (Ai ∩ Aj ) = f (Ai ) ∩ f (Aj ) = ni ∩ nj and f −1 (ni ∩ nj ) = f −1 (ni ) ∩ f −1 (nj ) = Ai ∩ Aj . Theorem 4.2.1. Let G be a subject graph and let {Ai }, with i = 1, ..., k ∈ N be a minimal cover of convex and disjoint subgraphs. Let {Aa1 , ...Aal } ⊂ {A1 , ..., Ak } with |L EV (Aai ) − L EV (Aaj )| ∈ {0, 1} ∀i 6= j . Any combination of {Aa1 , ...Aal } is a convex MIMO subgraph. In particular, the MIMO subgraph is disconnected if L EV (Aai ) = L EV (Aaj ) ∀i 6= j . Proof. If Ai is a node for every index i , the proof follows by Lemma 4.2.1 and 4.2.3. If {Ai } is a cover of disjoint subgraphs, by considering the (generalized) collapsing function f the proof becomes trivial. Remark 4.2.2. The previous theorem holds for minimal covers of nodes or MAXMISOs or SMMs since these covers contain only convex and disjoint subgraphs. Although these covers are minimal covers, we note that the minimality of the cover is a hypothesis not used by the theorem. This requirement is mainly introduced to allow a better clustering with the MIMO clustering algorithms. Remark 4.2.3. Let G be a graph with set of nodes V = {a1 , ..., an }. Let L EV (ai ) be the level of a node ai . It is possible to associate two values to each graph: the depth of the graph (DEPTH or DEPTH (G )) and the width of the graph (WIDTH or WIDTH (G )). The depth of a graph, described in Section
4.3. T HE PARALLEL C LUSTERING A LGORITHM
63
3.3.1, is equal to the total number of levels of the graph. Let nα be the total number of nodes at level α. The width of the graph is defined as follows: WIDTH = max nα . α
(4.7)
The width of the graph, like the depth of the graph, is a finite integer as the (G ) graphs are always finite graphs (|V | < +∞). By defining K (G ) = WIDTH DEPTH (G ) , we can divide the graphs in three classes: • graphs with K (G ) > 1, • graphs with 0 < K (G ) < 1, • graphs with K (G ) = 1, The proposed MIMO clustering algorithms can be divided according to the class of graphs we deal with. Table 4.1 shows how the proposed algorithms are divided according to the most suitable class of graphs. Anyhow, each algorithm can be applied to graphs belonging to all three classes. Table 4.1: Clustering algorithms and most suitable domains of application.
Algorithm Parallel Nautilus Spiral
K (G ) K (G ) ≫ 1 0 < K (G ) ≪ 1 K (G ) ∼ 1
Type of solution Optimal Non-optimal Non-optimal
In the next section, we present the Parallel clustering algorithm which is an application of Theorem 4.2.1 in case the elements to combine are at the same level. The result, as we will see, is an optimal set of convex disconnected MIMO subgraphs. Section 4.4 presents the Nautilus clustering algorithm and the Spiral clustering algorithm, two heuristics based on Theorem 4.2.1 for the generation of convex connected MIMO subgraphs.
4.3 The Parallel Clustering Algorithm Note. The custom instructions generated and selected to be implemented in hardware as application-specific instruction-set extensions are generated to speed up the execution of an application and their number depends on the available hardware resources. For this reason, we associate to each node in the
64
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
subject graph three numbers: the hardware latency, the software latency and the area. The first two numbers represent the latency of the node (operation) when implemented in hardware and software respectively. The third number represents the area occupied by the node (operation) when implemented in hardware. The MIMO clustering algorithms make use of these numbers during the generation of the automatically fused instructions. How these numbers are collected is described in Section 5.4. Let G be a subject graph with set of nodes V . Let d be the depth of the graph. Let HW and SW represent two disjoint sets of nodes such that V = HW ∪ SW . Each node n of G is identified by two indices i , j , where i is the level of the node and j is its position at level i . Let lHWij and lSWij be the latency of a node in HW and SW respectively. Let A and αij be the total available hardware resources (total area) and the hardware resources (area) that a node nij needs when implemented in hardware. The problem of generating and selecting new instructions to atomically execute in hardware as application-specific instructions can be formulated as follows: given a certain number of nodes (instructions) distributed between the levels of the subject graph, we want to select subsets of nodes at the same levels to implement in hardware, based on the available hardware resources, to minimize the total execution time of the application. Different nodes are selected at the same level to increase the number of instructions to execute in parallel in hardware. Formally, this problem can be described as follows. Problem Statement. Find the optimal subset of nodes HW ⊂ V such that the following function is minimized: min
X
lSWij +
nij ∈SW
d X i =0
max lHWij , nij ∈ HW
(4.8)
under the following constraint: X
αij ≤ A .
(4.9)
nij ∈HW
The function in (4.8) represents the minimization of the total execution time of the application. The first term is the execution time of the nodes (operations) that are executed in software and have a sequential execution. The second term represents the latency of the nodes (operations) that are selected for hardware execution in parallel at each level. The constraint expressed by (4.9) represents the requirement that the new instructions should fit into the available hardware resources.
4.3. T HE PARALLEL C LUSTERING A LGORITHM
65
The problem can be solved as a 0 − 1 Linear Programming problem in search for an optimal solution using an efficient solver. 0-1 Selection. Every node of G belongs to HW or to SW set. As a consequence, we associate a Boolean variable xi ,j to any nij ∈ V such that:
if nij ∈ HW if nij ∈ SW ,
xi ,j = 1 xi ,j = 0
where i ∈ {0, ..., d } represents the level L EV (nij ), and j represents the position at level i . The search for the optimal subset HW is then equivalent to the search of optimal 0/1 values for all xij . Objective Function. The original function to minimize in (4.8) can be translated into the following objective function: X
lSWij · xij +
nij ∈V
d X i =0
max lHWij · xij . nij ∈ V
(4.10)
If xij = 1 then xij = 0. This means that in the previous equation, we can consider nij ∈ V , instead of nij ∈ HW or nij ∈ SW , given that V = HW ∩ SW and that HW and SW are two disjoint sets. The max function included in the objective function transforms the problem into a non-linear problem, which is hard to be efficiently solved. As a consequence, we transform the objective function by adding for each level a new integer variable, called xmax , which has the largest hardware latency of the level. More specifically, the objective function becomes: X
lSWij · xij +
nij ∈V
d X
xmaxi ,
(4.11)
i =0
with the additional constraints: xmaxi ≥ lHWij · xij , ∀i ∈ {0, ..., d }
(4.12)
with node nij ∈ Leveli . Linear System of Inequalities. The original constraint given by (4.9), in this context, can then be expressed as follows: X αij · xij ≤ A . (4.13) n ∈V
66
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
When using an efficient solver, it is possible to find the optimal solution of the instruction generation and selection problems for the available hardware resources. Note that, by using the collapsing function, it is possible to apply the parallel clustering algorithm to generate and select instructions by combining subgraphs instead of nodes. By Theorem 4.2.1 is therefore possible to cluster subgraphs belonging to a minimal cover of convex and disjoint subgraphs. In particular, this is valid for MAXMISOs and SMMs covers. Example 4.3.1 shows an application of the Parallel clustering algorithm by clustering MAXMISOs instead of single nodes. Example 4.3.1. Let us consider the graph in Figure 4.2. The figure shows a subgraph G ∗ of the data-flow graph of the ADPCM decoder as implemented in the MediaBench benchmark suite [72]. By using the Parallel clustering algorithm, we want: • to identify a certain number of new instructions to implement in hardware, • to generate the new instructions by combining MAXMISOs . At first, G ∗ is partitioned in MAXMISOs (the dashed boxes in Figure 4.2(a)). Via the collapsing function f , the graph G ∗ is mapped onto the collapsed graph G^∗ (Figure 4.2(b)). We associate to each MAXMISO a boolean variable xij , as shown in Table 4.2. For each MAXMISO we estimate hardware and software latencies and the area occupied if implemented in hardware. Table 4.2: Assignment of a boolean variable to each MAXMISOs identified in the subject graph of Figure 4.2(a). The first index i represents the level of the MAXMISO in the collapsed graph G^∗ . The second index j represents the position of the MAXMISO at level i , clockwise.
MAXMISO MM0 MM1 MM2 MM3 MM4 MM5 MM6
Boolean Variable xi ,j x3,0 x0,0 x1,0 x3,1 x0,1 x2,0 x2,1
67
2
+ 32767
>
a)
SEL
−32768
SEL
SEL
>
3
MM4
ST
!= 0
!=
>>
LD
1
stepsizeTable
&
>>
LD
+
indata
outdata
& &
&
4
15
!=
1
&
8
0
1
2
0
4
7
>>
MM3
2
MM6
4.3. T HE PARALLEL C LUSTERING A LGORITHM
Figure 4.2: Generation of custom instructions with the Parallel clustering algorithm. The convex disconnected MIMO subgraphs are obtained by combining MAXMISOs . The figure shows a subgraph G ∗ of the data-flow graph of the ADPCM decoder which is partitioned in MAXMISOs (the dashed boxes). Via the collapsing function f , each MAXMISO in G ∗ is mapped on a node of the collapsed graph G^∗ . By applying the Parallel clustering algorithm, selected MAXMISOs are combined in parallel to be executed in hardware (the grey dashed boxes).
68
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
As shown in Figure 4.2, the collapsed graph G^∗ has four levels. This means that there are four xmax variables, one for each level. By (4.11), we have that the objective function to minimize is: lsw 0,0 ∗ x0,0 + lsw 0,1 ∗ x0,1 + lsw 1,0 ∗ x1,0 + lsw 2,0 ∗ x2,0 + lsw 2,1 ∗ x2,1 + lsw 3,0 ∗ x3,0 + lsw 3,1 ∗ x3,1 + xmax0 + xmax1 + xmax2 + xmax3 .
(4.14)
By (4.12) and (4.13), the linear system of inequalities is: C 1 : a0,0 ∗ x0,0 + a0,1 ∗ x0,1 + a1,0 ∗ x1,0 + a2,0 ∗ x2,0 + a2,1 ∗ x2,1 + a3,0 ∗ x3,0 + a3,1 ∗ x3,1 ≤ A C 2 : xmax0 ≥ lsw 0,0 ∗ x0,0 C 3 : xmax0 ≥ lsw 0,1 ∗ x0,1 C 4 : xmax1 ≥ lsw 1,0 ∗ x1,0 C 5 : xmax2 ≥ lsw 2,0 ∗ x2,0 C 6 : xmax3 ≥ lsw 2,1 ∗ x0,1 C 7 : xmax3 ≥ lsw 3,0 ∗ x3,0 C 8 : xmax3 ≥ lsw 3,1 ∗ x3,1 .
(4.15)
Additionally, we have the following constraints: C 9 : xmaxi C 10 : xi ,j
integer boolean .
(4.16)
By considering the Virtex II Pro FPGA XC2VP4 as available hardware, the solution of the ILP problem is graphically represented by the grey dashed boxes in Figure 4.2(b). The optimal set of convex disconnected MIMO instructions is therefore: Opt = {Inst1 , Inst2 , Inst3 }, where Inst1 = MM4 , Inst2 = MM5 ∪ MM6 and Inst3 = MM3 . Summarizing, the main steps required by the Parallel clustering algorithm to generate and select the new convex disconnected MIMO instructions to implement in hardware are the following. Given the subject graph of an application: • STEP 1 - evaluate the hardware and software latencies of each node, • STEP 2 - evaluate the area occupied by each node when implemented in hardware, • STEP 3 - formulate the instruction-generation and instruction-selection problems as a single ILP problem by identifying the objective function (4.11) and the set of constraints (4.12) and (4.13),
4.4. C ONVEX C ONNECTED S UBGRAPHS
69
• STEP 4 - find the optimal solution of the ILP problem by using an efficient solver. If the algorithm is used to cluster subgraphs belonging to a minimal cover of convex and disjoint subgraphs, before STEP 1, we have to transform the subject graph in the collapsed graph via the collapsing function. Therefore, in STEP 1 − 3, we consider the nodes of the collapsed graph and not the nodes of the subject graph. Remark 4.3.1. According to Remark 4.2.3, the Parallel clustering algorithm is more suitable for graphs with K (G ) ≫ 1. it is important to observe that: • since the graph is large, many elements can be combined at the same level increasing the performance gain provided by the parallel execution in hardware of the different components of the convex MIMO instruction, • when many elements are combined at the same level, the final cluster can have a number of inputs which exceeds the total number of inputs allowed by the architecture used to implement the new instructions. This means, more specifically, that the parallel clustering algorithm can be used, in general, for graphs with K (G ) ≫ 1 if the considered architecture does not impose tight limitations on the total number of inputs and/or outputs of the instructions to implement. Should that not be the case, as we will see in the next chapter, we can either address the inputs/outputs limitations using methods like the one proposed in [93] or use one of the other algorithms presented in this dissertation, and use the Parallel clustering algorithm for graph with 0 < K (G ) ≪ 1.
4.4 Convex Connected Subgraphs In the previous section, the clusters generated are optimal combinations of subgraphs at the same level. Since the clusters are convex and disconnected, they can be executed in hardware and they can exploit the hardware parallelism provided by the parallel execution of the different subgraphs (instructions) in hardware. We can also roughly calculate the performance gain for these instructions. Let assume the hardware latency for a node ni to be li . When k nodes are
70
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
combined at the same level, the execution time of the cluster in hardware is maxi =1..k li . In this case, the performance gain is: k X (li ) − max (li ). 1≤i ≤k
i =1
(4.17)
If we successively combine nodes through the levels of the graph, the overall performance gain increases. Let assume that α1 , ..., αh are the levels of the nodes belonging to a cluster. The overall performance gain in this case is: αh X X ( lij − max(lij )).
j =α1
ij
ij
(4.18)
Then, MIMO instructions obtained by combining instructions belonging to different levels can provide a higher performance gain compared with the performance provided by the instructions generated with the Parallel clustering algorithm1 . Therefore, two algorithms which generate convex connected MIMO instructions combining instructions belonging to different levels are presented in the next sections.
4.4.1 The Nautilus Clustering Algorithm The algorithm presented in this section, the Nautilus clustering algorithm, is described using nodes as elements combined together for generating convex MIMO subgraphs. By using the collapsing function, as described in the previous section, it is possible to use the Nautilus clustering algorithm to generate and select instructions obtained by combining subgraphs instead of nodes. By Theorem 4.2.1 is therefore possible to cluster subgraphs belonging to a minimal cover of convex and disjoint subgraphs. In particular, this is valid for MAXMISOs and SMMs covers. Let G (V , E ) be a subject graph with set of nodes V = {a1 , ..., an }. Let aι be a node of G with LEV (aι ) = α ∈ [0, d ], where d is the depth of the graph. 1 In this example, we do not consider the communication costs between software and hardware. These are mainly represented by the I/O communication costs (latency to load the inputs and store the outputs of the instructions). Although in the example the costs are not considered, the experimental evaluation of the algorithms presented in the next chapter consider the overhead intruded by I/O communication costs during the generation and selection of the custom instructions.
4.4. C ONVEX C ONNECTED S UBGRAPHS
71
We can define the following sets: PRED′ (a
ι)
SUCC′ (aι )
SUCC(aι )
{m ∈ V | LEV(m ) = α − 1 ∧ ∃ (m , a ) ∈ E } ι = ∅ {m ∈ V | LEV(m ) = α + 1 ∧ ∃ (a , m ) ∈ E } ι = ∅ {m ∈ V | ∃ [a → m ] ∧ LEV(m ) > LEV(a )} ι ι = ∅
STEP 1. Let C = aι :
C′ = C ∪
PRED
′
if α ≥ 1 if α = 0 if α ≤ d − 1 if α = d if α ≤ d − 1 if α = d . (4.19)
(aι )
(4.20)
is a convex MIMO subgraph. This holds for α ≥ 1 as a consequence of Theorem 4.2.1 and for α = 0 since a node is trivially a convex subgraph. STEP 2. Let us consider SUCC ′ (PRED ′ (aι )). Let NIn (n ) and NInC (n ) be the number of inputs of a node n and the number of inputs of a node n coming from a set C . For each node n and each set C , the following inequality can be satisfied: (4.21) β · NInC (n ) ≥ NIn (n ). with β ∈ N∗ . In particular, if β = 1, the previous inequality becomes an equality since the total number of inputs of a node n coming from a set C cannot be greater than the total number of inputs of the node. This can be reformulated by saying that the number of inputs of n coming from the set C has to be greater than or equal to β1 of the total inputs of n . This condition is introduced so as to have control on the total number of inputs of the final cluster. Figure 4.3 shows two possible choices for β: β = 2 (Figure 4.3(c)) and β = 4 (Figure 4.3(c’))2 . Let us define the following set: C ′′
=
C ′ ∪ {n ∈ C′
SUCC′ ( PRED′ (aι ))
| (4.21) holds}
if n exists otherwise.
(4.22)
By Theorem 4.2.1, if there exists n such that (4.21) holds, C ′′ = C ′ ∪ {n } is a convex subgraph. 2 Although we are considering nodes belonging to the subject graph G , in Figure 4.3, some of the nodes have more than two outputs. As mentioned at the beginning of the section, the algorithm can be used also to cluster MAXMISOs or SMMs for example. For illustrative purposes, we use a subgraph of the collapsed graph of the subject graph of an application only.
72
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
Level -1
Level -1 n8
n7
n9
n7
n10
n8
n10
n9
Level
Level n4
n5
n6
n1
n2
n3
n4
n5
n6
n1
n2
n3
Level +1
Level +1
(b)
(a)
Level -1
Level -1 n8
n7
n9
n8
n7
n10
n10
n9
Level
Level n4
n5
n6
n1
n2
n3
n4
n5
n6
n1
n2
n3
Level +1
Level +1
(c’)
(c)
Level -1
Level -1 n7
n8
n7
n10
n9
n8
n10
n9
Level
Level n4
n5
n6
n1
n2
n3
n4
n5
n6
n1
n2
n3
Level +1
Level +1
(d)
(d’)
Figure 4.3: Example of construction of a convex MIMO subgraph with the Nautilus clustering algorithm. The figure shows a subgraph of the collapsed graph of a subject graph of an application. Starting from node C = {n4 } (a), we build C ′ = {n4 , n7 , n8 , n9 , n10 } (b). After that, using β = 2 in Eq. (4.21), we generate C ′′ = {n4 , n5 , n7 , n8 , n9 , n10 } (c) and C ′′′ = {n1 , n2 , n4 , n5 , n7 , n8 , n9 , n10 } (d). Using β = 4, we have C ′′ = {n4 , n5 , n6 , n7 , n8 , n9 , n10 } (c’) and C ′′′ = {n1 , n2 , n3 , n4 , n5 , n6 , n7 , n8 , n9 , n10 } (d’).
4.4. C ONVEX C ONNECTED S UBGRAPHS
73
STEP 3. Let A be a cluster of nodes of G with h ≤ L EV (n ) ≤ k for every node n ∈ A . By defining the following set: {m ∈ V \V | ∃ (n , m ) ∈ E ∧ h + 1 ≤ L EV(m ) ≤ k + 1} A SUCC∗ (A ) = {m ∈ V \VA | ∃ (n , m ) ∈ E ∧ h + 1 ≤ L EV(m ) ≤ k }
if k ≤ d − 1 if k = d , (4.23)
we have that the set: C ′′′ = C ′′ ∪ {n ∈
SUCC
∗
(C ′′ ) | NInC ′′ (n ) = NIn (n )}
(4.24)
is a convex subgraph. This follows by the requirement NInC ′′ (n ) = NIn (n ). This requirement guarantees that a node m is included in the subgraph C ′′ if and only if all its inputs are coming from C ′′ . This means that it does not subsist the possibility of having a path between a node of C ′′ and m which includes a node not belonging to C ′′ ∪ {m }. In a similar way: C ′′′′ = C ′′′ ∪ {n ∈ SUCC ∗ (C ′′′ ) | NInC ′′′ (n ) = NIn (n )} ... C z = C z −1 ∪ {n ∈ SUCC ∗ (C z −1 ) | NInC z −1 (n ) = NIn (n )}
(4.25)
are all convex subgraphs. Since the number of levels and the nodes of the graph are finite, z < ∞, and C z represents the convex MIMO subgraph generated. Algorithm 4.1 presents the Nautilus clustering algorithm. Initially, a node v i is chosen as a seed. The cluster related to that seed is built through the function G ENERATE C LUSTER. Each time a cluster is generated, its nodes are removed from the set of nodes to further analyze. The process continues until the set of nodes to analyze is empty (the function G ENERATE C LUSTER identifies the convex MIMO subgraphs as previously described). Remark 4.4.1. The cluster generated strictly depends on the initial choice of the seed. The seed, as in the SMM partitioning algorithm, can be selected in many different ways. The seed can be either a node randomly chosen, or a node from the central levels of the graph, or a node on the critical path of the graph, etc. As we will see in the next chapter, the seed is chosen as the node with the smallest latency in hardware starting from the lower levels of the graph. In case many nodes at that level have the same latency, we select the seed randomly. We note that, the node is selected starting from the lower levels of the graph to allow the vertical expansion of the cluster during the generation process. As shown in Figure 4.3, the algorithm, after β is fixed, tries to generate the cluster expanding vertically in the graph. This is the main reason why, in Remark 4.2.3, we mentioned that the Nautilus clustering algorithm is more suitable for subject graph G with 0 < K (G ) ≪ 1.
74
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
Algorithm 4.1 The Nautilus clustering algorithm Input: G = (V , E ), β Output: GMIMO Set1 ⇐ V Set2 ⇐ ∅ repeat Choose v i ∈ SET1 G ENERATE C LUSTER(v i ) Set1 ⇐ Set1 − N ODE IN C LUSTER Set2 ⇐ Set2 ∪ N ODE IN C LUSTER until SET1 = ∅ − · − · − G ENERATE C LUSTER(v i ) 1: Generate C ′ 2: ... 3: Generate C z Remark 4.4.2. As shown by Algorithm 4.1, after a cluster is generated, its nodes are removed from the nodes to further analyze. This means that the Nautilus clustering algorithm requires linear complexity with the number of nodes of the graph analyzed. In Figure 4.3, we present an example of the generation of a convex MIMO subgraph with the Nautilus clustering algorithm. The figure shows a subgraph of the collapsed graph of a subject graph of an application. Starting from the node n4 , we build C = {n4 } (Figure 4.3(a)) and C ′ = {n4 , n7 , n8 , n9 , n10 } (Figure 4.3(b)). After that, using β = 2, we generate C ′′ = {n4 , n5 , n7 , n8 , n9 , n10 } (Figure 4.3(c)) and C ′′′ = {n1 , n2 , n4 , n5 , n7 , n8 , n9 , n10 } (Figure 4.3(d)). Using β = 4, we have C ′′ = {n4 , n5 , n6 , n7 , n8 , n9 , n10 } (Figure 4.3(c’)) and C ′′′ = {n1 , n2 , n3 , n4 , n5 , n6 , n7 , n8 , n9 , n10 } (Figure 4.3(d’)). A remark concerns the inequality in Eq. (4.21). As depicted in Figure 4.3, the total number of inputs of the final subgraph is represented by the number of inputs of the cluster C ′′ . This appears due to the fact that the nodes added to C ′′ are added if and only if all their inputs come from C ′′ . The nodes added to C ′′′ are added if and only if all their inputs come from C ′′′ and so on. This means that the nodes further included in C ′′ : firstly, they do not provide any additional input to the final cluster and, secondly, they reduce the number of
4.4. C ONVEX C ONNECTED S UBGRAPHS
75
output of C ′′ . It is hence possible to limit the total number of inputs and outputs of the convex MIMO subgraphs generated by modifying β. Summarizing, the main steps required by the Nautilus clustering algorithm to generate and to identify the new convex MIMO instructions are the following. Given the subject graph of an application: • A 1 - evaluate the hardware and software latencies of each node, • A 2 - evaluate the area occupied by each node when implemented in hardware, • A 3 - generate the convex MIMO instructions as described in Algorithm 4.1, • A 4 - select a subset of instructions that fit the available hardware and provide performance gain. If the algorithm is used to cluster subgraphs belonging to a minimal cover of convex and disjoint subgraphs, before A 1 we have to transform the subject graph in the collapsed graph by using the collapsing function. Therefore, in A 1 − 3, we consider the nodes of the collapsed graph and not the nodes of the subject graph. Since the available area to implement the new instruction is limited during the cluster generation, the area of the cluster is always bounded by the total available area: if the inclusion of a node exceeds the available area, the node is withdrawn and the cluster is extended with other nodes, if possible. The selection criteria to choose a subset of instructions that fits the available hardware resources is described in Section 4.4.4.
4.4.2 The Spiral Clustering Algorithm In this section, we present the Spiral clustering algorithm, a clustering method based on the notion of Archimedean spiral. The clustering process is described using nodes. These nodes can be the nodes of the subject graph of an application as well as the nodes of the collapsed graph. We now describe a spiral in brief. The easiest spiral line is the one studied by Archimedes, named the Archimedean Spiral3 . This spiral is generated when a point P moves with constant speed v on a line which, in turn, rotates around 3
The Works of Archimedes, edited by Sir Thomas Heath, Dover, New York 2002.
76
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
P
Py r
Td
O
B
Px
Figure 4.4: An Archimedean Spiral. The point P is identified by the coordinates Px = r cos B and Py = r sin B , where r = vt and B = ωt . Td is the turn distance.
one of its points O with constant angular velocity ω. The point O is called center of the spiral. The distance Td between two consecutive turns is called turn distance. Figure 4.4 depicts an Archimedean Spiral. Let L EV 1 , ...L EVd be the levels of the nodes of a subject graph G and let O be a node of G with L EV (O ) = i . Definition 4.4.1. A spiral search is a type of search which, given a node O , looks for nodes to group with O following a spiral path S with center O . The clustering is done between the levels of the graph. Every time the spiral intersects a level, the nodes of that level (some or all) are analyzed and, based on some property to define, some of the nodes are included in the cluster. O is called the center (of the spiral search) and the number of levels between two consecutive turns is called the turn distance (of the spiral search). Figure 4.5 depicts a spiral search with turn distance equals to one level. In the figure, the levels are analyzed following the order of intersection between the spiral and the levels. The Spiral clustering algorithm proposed in this section generates each convex MIMO subgraph by means of a spiral search through the levels of the graph. We set Td equals to 1 level. In the following, we present the details of the Spiral clustering algorithm. Let A be a cluster of nodes of G (V , E ) with h ≤ L EV (n ) ≤ k for every node n ∈ A . In addition to the set SUCC∗ (A ) defined in (4.23), we define the
4.4. C ONVEX C ONNECTED S UBGRAPHS
77
S
S_1
...
15
LEVEL i-2
6 14 7
1
LEVEL i-1
5 13
8
2 9
O 3
4
LEVEL i
12
LEVEL i+1
11
LEVEL i+2 ...
10
S_2
Figure 4.5: An example of the spiral search. If O ∈ Li is the center of the spiral search, the levels are analyzed following the order of intersection between the spiral and the levels: 1, 2, 3, ....
following set: PRED′ (A )
{PRED′ (n ) | n ∈ A ∧ L EV(n ) = h } = {PRED′ (n ) | n ∈ A ∧ L EV(n ) = h + 1}
if h ≥ 1 if h = 0
(4.26)
We note that PRED ′ (A ) 6= {PRED ′ (n ) | n ∈ A }. By looking at the graph G and its subgraph A in Figure 4.6, we have that PRED ′ (A ) = {n6 } and {PRED ′ (n ) | n ∈ A } = {n2 , n4 , n6 }. Similarly to the Nautilus clustering algorithm, the final cluster is generated in multiple steps. Starting from a node, step-by-step the node is combined with additional nodes to generate the final convex MIMO subgraph. The initial steps are the same steps of the Nautilus Clustering. STEP 1. Given a node aι ∈ G , we build the convex subgraphs C and C ′ (see Eq. (4.20)). STEP 2. Given β, we build the convex subgraph C ′′ (see Eq. (4.22)). STEP 3. We build the convex subgraph C ′′′ (see Eq. (4.24)). The final subgraph is generated by means of a spiral search through the levels. As shown in Figure 4.5, the levels are analyzed following the order of intersection with the spiral. Then, the next intersections to analyze would be 4 and then 5, i.e. level i and level i − 1.
78
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
Figure 4.6: Example to show that PRED ′ (A ) 6= {PRED′ (n ) | n ∈ A }. By looking at the graph G and its subgraph A , we have that PRED′ (A ) = {n6 } and {PRED′ (n ) | n ∈ A } = {n2 , n4 , n6 }.
Lemma 4.4.1. Intersections 4 and 5 do not provide any node p 6∈ C ′′′ such that C ′′′ ∪ {p } is convex and NIn (p ) = NInC ′′′ (p ). Proof. Let us consider intersection 4. By contradiction, let p be a node such that C ′′′ ∪ {p } is a convex subgraph, with p belonging to level i . Two cases are possible: there exists at least a path between p and one node at level i + 1 or between one node at level i − 1 and p . The first case is not possible as the nodes at level i + 1 belong to C ′′′ if and only if they have all inputs coming from C ′′ . If p exists, it means that there exists at least one node in C ′′′ at level i + 1 with one input not coming from C ′′ . In the second case, all nodes p such that C ′′′ ∪ {p } is convex and NIn (p ) = NInC ′′′ (p ), have been already included during the generation of C ′′ . Therefore, no additional node can be further included in the cluster. Let us consider intersection 5. No node at level i − 1 can have all inputs such that NIn (p ) = NInC ′′′ (p ), as the inputs of p come from levels lower than or equal to i − 2. In the same vain, it is possible to show that (see Figure 4.5), all the intersections in the region bounded by the two half-lines s1 and s2 do not contain nodes suitable for an inclusion in the convex cluster. By Lemma 4.4.1, following the spiral search, the next levels to analyze are levels i − 2, ..., i + 2.
4.4. C ONVEX C ONNECTED S UBGRAPHS
79
STEP 4. From Level i − 2 we can build: C ιv = C ′′′ ∪ PRED ′ (C ′′′ )
(4.27)
From level i − 1 we can build: C v = C ιv ∪ SUCC ′ (PRED ′ (C ′′′ ).
(4.28)
The graph C v is a convex graph. By contradiction let C v be non convex. This means that there exists at least one path between two nodes of C v , which includes a node a 6∈ C v . By construction, the only alternative is L EV (a ) = i − 1 =. By (4.29), if a exists, a ∈ C v . This means that C v is a convex subgraph. From level i to i + 2, we build the following subgraphs C v ι = C v ∪ {m ∈ V \VC v | ∃(n , m ) ∈ E ∧ ∧ L EV (m ) = i ∧ NInC v (m ) = NIn (m )}, C v ιι = C v ι ∪ {m ∈ V \VC v ι | ∃(n , m ) ∈ E ∧ ∧ L EV (m ) = i + 1 ∧ NInC v ι (m ) = NIn (m )},
C v ιιι =
v ιι C ∪
(4.29)
{m ∈ V \VC v ιι | ∃(n , m ) ∈ E ∧ ∧ L EV(m ) = i + 2 ∧ NInC v ι (m ) = NIn (m )}
C v ιι
if i + 1 < d
(4.30)
if i + 1 = d
The previous subgraphs are all convex by construction. We prove the convexity of C v ι . The same type of proof can be applied to the other two subgraphs. If C v ι is not convex, there exists at least one path connecting two nodes of C v ιι which includes a node a 6∈ C v ιι . By construction, we have the two alternatives: • L EV(a )=i − 1, • L EV(a )=i , The first case is not possible. If a exists, by Eq. (4.29), a ∈ C ν . The second case is not possible as well. If a exists, it means that there is an edge between a and a node of C v ι at level i + 1. This means that C ′′′ is not convex against what has been shown before. After C v ιιι has been generated following the spiral search, we have to analyze level i − 3. In general, we have the following result.
80
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
Remark 4.4.3. If the maximum level analyzed in STEP 4 is i + θ and the center of the spiral is on level i , the next level to analyze is the level symmetric respect to the center of the spiral decreased by one: L EVEL = i − [(i + θ) − i ] − 1 = i − θ − 1.
(4.31)
If this level does not exist, we consider the minimum level analyzed so far. STEP 5. We iteratively repeat STEP 4 by considering, as input, the cluster generated in the previous iteration. For example, C ιx = C νιιι ∪ PRED ′ (C νιιι ) and so on. Since the number of levels and the number of nodes are finite, the clustering stops when there are no additional nodes to be included in the cluster. The algorithmic description of the Spiral clustering algorithm is the same of the one seen in Algorithm 4.1, with the exception that the function G ENER ATE C LUSTER follows the clustering process described in this section. Remark 4.4.4. In this case, as we have seen for the SMMs and for the Nautilus clustering algorithm, the generated cluster strictly depends on the initial choice of the seed. As we will see in Chapter 5, the seed is chosen as the node with the smallest latency in hardware starting from the central levels of the graph. In case many nodes at that level have the same latency, we select the seed randomly. We note that, the node is selected starting from the central levels of the graph so as to allow the expansion of the cluster above and below the seed during the generation process. As previously described, staring from the seed, at each step, the cluster is expanded both vertically and horizontally. This is the reason why, in Remark 4.2.3, we mentioned that the Spiral clustering algorithm is more suitable for graph with K (G ) ∼ 1. Remark 4.4.5. After the convex cluster is generated, in both the Nautilus and the Spiral clustering algorithms, the nodes of the cluster are removed by the nodes to further analyze. This means that the Spiral clustering algorithm requires linear complexity with the number of nodes of the graph analyzed.
4.4. C ONVEX C ONNECTED S UBGRAPHS
81
Figure 4.7: Example of construction of a convex MIMO subgraph with the Spiral clustering algorithm. The figure shows a subgraph of the collapsed graph of a subject graph of an application. Starting from node n1 , we build C = {n1 } and C ′ = {n1 , n2 , n3 } (a). After that, using β = 2 in Eq. (4.21), we generate C ′′ = {n1 , n2 , n3 , n4 } (b), C ′′′ = {n1 , n2 , n3 , n4 , n5 , n6 } (c), C ιv = {n1 , n2 , n3 , n4 , n5 , n6 , n7 , n8 } (d), C v = C ιv , C v ι = {n1 , n2 , n3 , n4 , n5 , n6 , n7 , n8 , n9 } (e), C v ιι = C v ι and, finally, C νιιι = {n1 , n2 , n3 , n4 , n5 , n6 , n7 , n8 , n9 , n10 , n11 }.
82
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
In Figure 4.3, an example of construction of a convex MIMO subgraph with the Spiral clustering algorithm is presented. Also in this case, by modifying β in Eq. (4.21), it is possible to limit the total number of inputs and outputs of the generated convex MIMO subgraphs. Summarizing, the main steps required by the spiral clustering algorithm to generate and to identify the new convex MIMO instructions to implement in hardware are the following. Given the subject graph of an application: • A 1 - evaluate the hardware and software latencies of each node, • A 2 - evaluate the area occupied by each node when implemented in hardware, • A 3 - generate the convex MIMO instructions as described in the previous steps. • A 4 - select a subset of instructions that fit the available hardware and provide performance gain. If the algorithm is used to cluster subgraphs belonging to a minimal cover of convex and disjoint subgraphs, before A 1, we have to transform the subject graph in the collapsed graph by using the collapsing function. Therefore, in A 1 − 3, we consider the nodes of the collapsed graph and not the nodes of the subject graph. Since the available area to implement the new instructions is limited, the area of the clusters is always bounded by the total available area during the cluster generation: if the inclusion of a node exceeds the available area, the node is withdrawn and the cluster is extended with other nodes, if possible. Additionally, all the generated clusters are convex subgraphs with the exception of C ιv . This means that the search for convex subgraphs can be interrupted after each of the subgraphs {C k }k 6=ιv is generated. The selection criteria to choose a subset of instructions that fits the available hardware resources is described in Section 4.4.4.
4.4. C ONVEX C ONNECTED S UBGRAPHS
1
2
3
4
9
5
6
1
8
7
2
3
9
5
6
1
8
4
10
a)
83
7
2
3
6
1
9
5
7
2
3
4
9
5
6
7
9
5
6
1
8
4
10
d)
8
10
c)
8
4
3
10
b)
1
2
7
2
3
4
9
5
6
10
e)
8
7
10
f)
Figure 4.8: Clustering without reiteration of STEP 1 and STEP 2: example of an application considering only 3 levels. C = {4} (a), C ′ = {4} ∪ {1, 2} (b), C ′′ = {1, 2, 4} ∪ {3, 5} (c), C FIN = {1, 2, 3, 4, 5} ∪ {6, 7} = {1, 2, 3, 4, 5, 6, 7} (d), (e) and (f) do not expand the cluster.
1
2
3
4
9
5
6
1
8
7
3
9
5
6
1
8
4
10
a)
7
2
3
6
1
9
5
7
2
3
e)
4
9
5
6
7
1
8
4
6
10
8
10
c)
8
4
2
3
10
b)
1
d)
2
9
5
7
2
3
4
6
10
8
9
5
7
10
f)
Figure 4.9: Clustering with reiteration of STEP 1 and STEP 2: example of an application considering only 3 levels. C = {4} (a), C ′ = {4} ∪ {1, 2} (b), C ′′ = {1, 2, 4} ∪ {3, 5} (c), C ′′′ = {1, 2, 3, 4, 5} ∪ {8} (d), C ιv = {1, 2, 3, 4, 5, 8} ∪ {9} (e), C FIN = {1, 2, 3, 4, 5, 8, 9} ∪ {6, 7, 10} = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} (f).
4.4.3 One Modification to Generate Bigger Clusters By assuming that the size of a cluster is proportional to the performance gain the cluster can provide, we describe a modification of the same initial steps we have seen for both the Nautilus and the Spiral clustering algorithms. After the
84
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
generation of C ′′ and before the generation of C ′′′ , we generate the following subgraphs: C ∗ = C ′′ ∪ PRED ′ (SUCC′ (PRED ′ (aι ))),
(4.32)
and C ∗∗ = C ∗ ∪ SUCC ′ (PRED ′ (SUCC′ (PRED ′ (aι )))).
(4.33)
Both graphs, by Theorem 4.2.1, are convex. Figure 4.8 and 4.9 depict an easy example which shows the reiteration and non-reiteration of the initial clustering. As shown in the figures, the final cluster can contain additional nodes which, in turn, can provide additional performance gain.
4.4.4 The Selection Process The two heuristics presented in the previous sections, namely the Nautilus clustering algorithm and the Spiral clustering algorithm, stop the generation of convex MIMOs when there exists no extra design space to explore (i.e. all the nodes in the graph have been clustered). As mentioned before, the available hardware resources for implementing the new instructions can be limited. If the total area of the generated clusters exceeds the available hardware resources, we use an established metric for evaluating our optimizations and select the instructions to implement in hardware, that fit the available area. The Area-Delay Product (ADP ) and its variants (e.g. A 2DP , AD 2P ) have been extensively used [41, 70] and provide a well-known empirical measure of performance. Delay is the primary concern in our working context; therefore, AD 2P has been used as a metric. In effect, the clusters are ranked based on the following formula: AC , (4.34) (LSW − LHW )2 where AC is the area of the generated cluster, LSW and LHW are the latency of the cluster in software and hardware respectively. Additionally, clusters with LSW < LHW are automatically excluded during the selection process.
4.5 Summary The current chapter presents three algorithms for the generation and selection of convex connected and disconnected MIMO instructions: the Parallel clustering, the Nautilus clustering and the Spiral clustering. The chapter starts
4.5. S UMMARY
85
with describing the main concepts related to convex subgraphs. Afterwards, we describe the Parallel clustering algorithm, a clustering algorithm for the generation of convex disconnected MIMO instructions. Since it is not always possible to implement disconnected instructions in hardware, the chapter continues with the description of two heuristics of linear complexity, the Nautilus clustering algorithm and the Spiral clustering algorithm for the generation of convex connected MIMO instructions. Finally, the chapter presents a possible selection process to select a subset of the generated instructions to implement in hardware based on the available hardware resources. The next chapter describes the experimental validation of the concepts described in Chapter 3 and 4. Given a reconfigurable architecture, we present the performance gain of a set of well known benchmarks, when the instruction-set is extended with the new convex connected or disconnected MIMO instructions.
Note. The Spiral clustering algorithm is based on the notion of Archimedean Spiral and this gives the name to the algorithm. The name Nautilus clustering algorithm is also related to a spiral, the logarithmic spiral. The logarithmic spiral can be distinguished from the Archimedean spiral by the fact that the distances between the turnings of a logarithmic spiral increase in geometric progression, while in an Archimedean spiral these distances are constant (Td = const ). The logarithmic spiral is a special kind of spiral curve which often appears in nature: a well known example is the Nautilus shell (a living-fossil), which gives the name to the algorithm. Nautilus shells and Ammonites shells, shells with an Archimedean spiral shape, are depicted on the cover of this dissertation. The Parallel clustering algorithm takes its name from the type of clustering which identifies and selects operations that can be grouped and executed in parallel in hardware.
86
C HAPTER 4. M ULTIPLE -O UTPUT I NSTRUCTIONS
Note. The content of this chapter is based on the the following papers: C. Galuzzi, E. Moscu Painante, Y. D. Yankova, K. Bertels and S. Vassiliadis, Automatic Selection of Application - Specific Instruction - Set Extensions, CODES+ISSS 2006, October 2006. C. Galuzzi, K. Bertels and S. Vassiliadis, A Linear Complexity Algorithm for the Automatic Generation of Convex Multiple Input Multiple Output Instructions, ARC 2007, March 2007. C. Galuzzi, K. Bertels and S. Vassiliadis, The Spiral Search: A Linear Complexity Algorithm for the Generation of Convex Multiple Input Multiple Output Instruction-Set Extensions, IC-FPT 2007December 2007.
5 Performance Evaluation
I
n this chapter, we evaluate the performance of the algorithms presented in this dissertation. The chapter opens by giving a short overview of the major limitations in existing reconfigurable architectures. As the proposed algorithms are general and not designed for a specific architecture, a reconfigurable architecture based on the Molen machine organization and programming paradigm is described and used to carry out the experiments. This kind of architecture, as it will be outlined in the text, allows to overcome the major limitations in existing reconfigurable architectures. Besides, it allows to carry out experiments with different architectural constraints. The chapter continues with an elaboration of the toolchain for the automatic generation and selection of application-specific instruction-set extensions. Afterwards, the experimental results for different benchmarks are proposed and, after that, a summary of the conclusions is presented.
5.1 Introduction The algorithms proposed in this dissertation generate and select applicationspecific instructions that can be used to speed up and to execute more efficiently an application on a given architecture. The algorithms are general and not restricted to a specific architecture. As a results, the new instructions extending a given instruction-set can be either implemented on a fix hardware or implemented on a reconfigurable hardware. The type of hardware, fix or reconfigurable, affects in many ways the clustering process used to identify the instructions: • it affects the type and number of new instructions; • it affects the metrics that the new instructions optimize; 87
88
C HAPTER 5. P ERFORMANCE E VALUATION
• it affects the complexity of the clustering process. For example, the design of a specialized instruction-set extension of an ASIC will gives priority to metrics like power consumption, area and frequency of execution of an instruction to compensate for the production costs. A new instruction will be selected if it is frequently used by an application and this turns in finding a solution to covering problems, well known computational complex problems. Generally speaking, there is no restriction in using our algorithms for this purpose. Nevertheless, our target is different. The main goal is to provide a set of clustering algorithms for fast automatic generation of application-specific instruction-set extensions for reconfigurable architectures. Assuming minor limitations on the hardware available for the implementation of the new instructions, we can prioritize different metrics and, as a general result: • the execution time of the application can be reduced, • the computational complexity can be reduced, • the hardware can be reused to implement different application-specific instructions. While hardware reuse is an intrinsic property of reconfigurable hardware, reduced execution time and lower computational complexity depend on the method used to generate and select the new instructions. As we have seen in the previous chapters, the proposed algorithms have low computational complexity (see Section 4.4) and do not impose limitations either on the number or on the type of instructions that can be generated and selected for hardware implementation. As a general idea, given an application and a reconfigurable architecture (a GPP coupled with a reconfigurable hardware, like an FPGA ), an application, intended to be executed only on the GPP , is transformed into an equivalent application executing on both the GPP core and the reconfigurable hardware which makes use of the application-specific instructions implemented on the reconfigurable hardware. The program transformation includes: the identification of the piece of code representing the application-specific instruction implemented in hardware, the elimination of this code and the insertion of an equivalent code to call the instruction in hardware. In the next section, we present the main limitations on common reconfigurable architectures. Based on this analysis, we present the reconfigurable architecture selected in this dissertation to carry out the experiments.
5.2. M AIN L IMITATIONS
IN
R ECONFIGURABLE A RCHITECTURES
89
5.2 Main Limitations in Reconfigurable Architectures A reconfigurable architecture is an architecture which combines the performance of hardware and the flexibility of software, by coupling, for example, a GPP and a reconfigurable hardware like an FPGA . A number of surveys on reconfigurable architectures exists in literature, starting from the most recent publications [19, 110] and continuing with [14, 15, 28, 48, 49, 95]. Reconfigurable architectures are mainly classified in terms of granularity of the reconfigurable hardware, the coupling method and the reconfiguration level. Based on the analysis presented in [34], three major drawbacks, common to the majority of the architectures, can be identified: • Limitations on the number of operands: in most of the reconfigurable architectures, the number of inputs and outputs operands of the instructions to implement on the reconfigurable hardware is limited. This has two drawbacks: first, it limits the size of the new custom instructions and second, it limits the performance gain. • Limitations on parallelism: many architectures do not support (at different levels) the parallel execution of sequential operations, when they have no data dependency. This limits the performance gain, as shown in Section 4.4. • Limitations on the opcode space: in many architectures, the number of custom instructions that is possible to map on the reconfigurable hardware is limited. This is due to the limitation of the opcode space since, the common approach is to generate a new instruction for each part of the application selected to be mapped on hardware, meaning for each custom instruction. This limits the performance gain. A limitation on the number of new instructions implies that a larger part of the application is selected for sequential software execution limiting the overall performance gain1 . Consequently, the generation and selection of custom instructions on a generic reconfigurable architecture are limited in: 1. the total number of inputs and outputs of each instruction; 1
In this context, we are not considering a selection process of the new custom instructions based on their frequency of execution.
90
C HAPTER 5. P ERFORMANCE E VALUATION
2. the number of basic operations contained in the custom instruction that is possible to execute in parallel in hardware and, consequently, the number of custom instructions that can be executed in parallel in hardware; 3. the total number of instructions that is possible to generate. As the proposed algorithms are general and not restricted to a specific architecture, the performance evaluation of the algorithms on a generic architecture would be affected by the architecture-dependent hardware constraints affecting the quality of the instructions generated and selected. For this reason, in this dissertation, we decided to run the experiments on a reconfigurable architecture which relaxes the aforementioned limitations. Such an architecture is, for example, a reconfigurable architecture based on the Molen machine organization and on the Molen programming paradigm, called in short a Molen reconfigurable architecture. Note that, even though we consider more relaxed hardware constraints, the algorithms are tested taking into consideration different hardware constraints and different limitations on the available hardware resources. The target of the next section is to present an overview of the main characteristics of the Molen reconfigurable architecture.
5.3 The Molen Reconfigurable Architecture The Molen reconfigurable architecture is based on the co-processor architectural paradigm: a GPP , the core processor, controls the execution and the (re)configuration of a reconfigurable co-processor, tuning the latter for specific applications by implementing application-specific instructions. The reconfiguration and the execution of the code on the reconfigurable hardware is done in firmware via reconfigurable microcode (ρµ-code). The ρµ-code is an extension of the classical microcode which includes reconfiguration and execution. Although a detailed description regarding the Molen architecture is presented in [110, Chap. 5], a short overview of the main characteristics of this architecture is included here. An example of reconfigurable architecture based on the Molen machine organization is depicted in Figure 5.1. We can identify two main components: • the Core Processor, usually a GPP ; • the Reconfigurable Processor (RP ), usually implemented on an FPGA . Besides, there are:
5.3. T HE M OLEN R ECONFIGURABLE A RCHITECTURE
91
Main Mamory
Register File
Instruction Fetch
Data Fetch
ARBITER
DATA MEMORY MUX/DEMUX
Core Processor Code unit
XREGs
Core Processor
CCU RH memory
Reconfigurable Processor
Figure 5.1: The Molen machine organization [110].
• the arbiter, which issues the instructions to either processors and partially decodes the instructions received from the instruction fetch unit; • the data fetch unit, responsible for fetching (storing) the data from (to) the main memory; • the memory MUX , responsible for distributing (collecting) the data to (from) either the reconfigurable or the core processor. The reconfigurable processor consists of two parts: the ρµ-code unit and the Custom Configured Unit (CCU ) which is composed by a reconfigurable hardware, like an FPGA , and memory. The CCUs are used to implement custom instructions identified, for example, with the clustering methods proposed in the previous chapters. The code executed on the reconfigurable unit is distinct from the code executed on the core processor. Data must be transferred across the code boundaries in order for the overall application code to be meaningful. Such data includes predefined parameters and results or pointers to them. The parameter and result passing is performed through a mechanism utilizing a set of registers called eXchange REGisters (XREGs ).
92
C HAPTER 5. P ERFORMANCE E VALUATION
The Molen programming paradigm is a sequential consistency paradigm targeting the described organization, which allows parallel and concurrent hardware execution. The paradigm requires only a one-time architectural extension of few instructions to provide a large user reconfigurable operation space to support a virtually unlimited number of custom instructions that can be performed on the reconfigurable hardware. There are six main instructions for controlling the reconfigurable hardware: • two set instructions, partial and complete, to configure the CCUs to perform the custom instructions; • one execute instruction, for the actual execution of the custom instructions on the CCU ; • one set prefetch and one execute prefetch instructions, to prefetch the needed microcode responsible for the CCU reconfigurations and executions into a local on-chip storage facility (the ρµ-code unit) in order to possibly diminish microcode loading times; • one break instruction used as a synchronization mechanism to complete the parallel execution of both the reconfigurable processor and the core processor. Additionally, there are two move instructions used to exchange values between the register file and XREGs since the reconfigurable processor is not allowed to directly access the general-purpose register file. The support of custom instructions by the reconfigurable processor can be initially dived in two separate phases: set and execute. In the set phase, the CCU is configured to perform the supported operations. After that, in the execute phase, the actual execution of the operations is performed. This decoupling allows to schedule the set phase well ahead of the execute phase and thereby hiding the reconfiguration latency. As no actual execution is performed in the set phase, it can be even scheduled upwards across the code boundary in the code preceding the RP target code. Furthermore, no specific instructions are associated with specific operations to configure and execute on the CCU as this greatly reduces the opcode space. Instead, pointers to ρµ-code which emulate both the configuration and the execution of applications are used2 . 2 For this purpose, there are two types of ρµ-code: the one that controls the configuration of the CCU and the one that controls the actual execution of the implementation configured on the CCU .
5.3. T HE M OLEN R ECONFIGURABLE A RCHITECTURE
93
The exchange registers, are used for passing operation parameters to the reconfigurable hardware and returning the computed values after operation execution. In order to avoid dependencies between the processors, the needed parameters are moved from the register file to the XREGs and the results are stored back in the register file. The execution of one instruction does not generate problem as all the XREGs are available. If two or more instructions are executed in parallel, the total number of parameters (inputs and outputs) is bounded by the number of available XREGs . If the parameters do not exceed the number of XREGs , parameters are passed by value, otherwise - by reference. This is an important issue to consider, above all, during the scheduling of the new instructions. Regarding the shortcoming presented in the previous section, by using a Molen reconfigurable architecture: • An arbitrary number (only hardware real estate design restricted) of inputs and outputs of the custom instructions can be passed to/from the reconfigurable hardware. It is only restricted by the implemented hardware as any given technology can allow only a limited hardware. • Parallelism is allowed as long as the sequential memory consistency model can be guaranteed. • The limitation on the number of custom instructions that is possible to generate is virtually removed. As a result, the experimental evaluation of the algorithms is carried out on a Molen reconfigurable architecture prototype composed by a PowerPC 405 processor and a Virtex II Pro FPGA. It is noteworthy that, although a Molen reconfigurable architecture removes the majority of the common hardware limitations compared with other reconfigurable architectures, we take into consideration different hardware constraints and hardware resources limitations in the generation and selection of the custom instructions.
5.3.1 The Delft WorkBench Project The Delft Workbench project3 aims at providing a semi-automatic tool platform for integrated hardware-software co-design targeting heterogeneous computing systems containing reconfigurable components. It target the Mole machine organization and it addresses the entire design cycle rather than isolated 3
http : //ce .et .tudelft .nl /DWB /
94
C HAPTER 5. P ERFORMANCE E VALUATION
Human Directives
COST MODEL
ARCHITECTURE CODE
VHDL GENERATOR
Binary Code SET f(.) SET h(.)
EXEC f(.) EXEC h(.)
Manual Code
VHDL entity CCU_F
VHDL entity CCU_H
MOLEN ARCHITECTURE
IP LIBRARY
RETARGETABLE COMPILER
1111 0000 0000 1111 f(.) 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 h(.) 0000 1111 0000 1111 0000 1111 g() 0000 1111 { 0000 1111 h(.) 0000 1111 0000 1111 0000 1111 { 0000 1111 ... } 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 } 0000 1111 1111 0000 0000 1111
C2C
PROFILER
11111 00000 00000 11111 00000 11111 f(.) 00000 11111 00000 11111 00000 11111 00000 11111 g() 00000 11111 { 00000 11111 00000 11111 { ... 00000 11111 00000 11111 } 00000 11111 ... 00000 11111 00000 11111 { ... 00000 11111 } 00000 11111 00000 11111 } 11111 00000 00000 11111
1111 0000 0000 1111 0000 1111 f(.) 0000 1111 0000 1111 0000 1111 0000 1111 g() 0000 1111 { 0000 1111 0000 1111 { ... 0000 1111 0000 1111 } 0000 1111 ... 0000 1111 0000 1111 { ... 0000 1111 } 0000 1111 0000 1111 } 1111 0000 0000 1111
Performance Statistics
Figure 5.2: The overall workflow of the Delft Workbench.
parts. It involves the development of compilers for reconfigurable platforms, programming models, hardware software co-design, CAD and design space exploration software, optimization algorithms and integration software development. Figure 5.2 shows the overall workflow of the Delft Workbench project, described in the following. PROFILER. Initially, the application is analyzed by a profiler. Profiling consists of identifying those parts of an application that could be mapped on the reconfigurable hardware. The goal is to determine as early as possible in the design stage whether certain parts not only provide the necessary speed up but can also be implemented given the limited reconfigurable or other hardware resources. A ’part’ of the application can be a whole function or procedure but it can also be any cluster of instructions that is scattered throughout the application. There are evidently several parts, which could be selected, and each of them should be considered. The goal of this stage is to assess to what extent hardwired functions would increase the overall performance and what the cost is to build them. One way to support this process is to offer a profiler that
5.3. T HE M OLEN R ECONFIGURABLE A RCHITECTURE
95
can collect and analyze execution traces of the program. The profiler will use this information in combination with human directives to propose a number of candidate code segments. A cost model will be used to assess how these code segments translate into hardware designs, taking into account configuration delays, area usage, power consumption, etc. Such a cost model will allow to filter away those functions that will not likely result in the anticipated improvement. This cost model will be multidimensional meaning that the human designer can specify what specific constraints (memory, real time, power,...) will need to be taken into account when evaluating the different candidate functions. The profiler accepts as input ANSI C code and produces as output annotated C code with pragma directives to indicate the parts of the code considered for execution on the reconfigurable hardware. C 2C . The candidate parts of the application undergo a series of optimizations and transformations that aim at converting the sequential algorithm targeting a GPP execution into a parallelized form suitable for hardware implementation. These optimizations and transformations include loop optimizations and custom instruction generation. The work presented in this dissertation can be used in the graph transformation phase: an application, or part of it, is analyzed and application-specific instructions are generated and selected for hardware implementation to speed up the execution of the application on the reconfigurable architecture. RETARGETABLE COMPILER. After the C 2C step, the parts of application selected for hardware execution are completely defined. The compiler generates the appropriate link code for their execution on the reconfigurable processor, while the rest of the application is compiled for the execution on the core processor. The link code mainly contains: • the code for passing the parameters, via the XREGs , to the reconfigurable processor, and the code for getting the computed results from the XREGs ; • the instructions to configure and to execute the implemented configuration on the reconfigurable processor. One main goal of the compiler is to generate high quality code tailored to the specific features of the target architecture. In this case, specific optimizations (see [34]) have to be included in the compiler in order to address the distinct characteristics of the reconfigurable hardware such as the reconfiguration overhead, parallel execution, reconfigurable hardware allocation, etc. VHDL GENERATION. The parts of application selected for hardware implementation have to be described in a Hardware Description Language (HDL ) such as
96
C HAPTER 5. P ERFORMANCE E VALUATION
VHDL . The VHDL description can be obtained in three different ways: manually, with an IP library, or automatically: • For critical code segments that require very high quality of the implemented hardware, manual VHDL code should be written. • Whenever an external IP library is available for the selected code segments, the hardware models can be directly instantiated from that library. • The third possibility, automated code generation, is envisioned for fast prototyping and fast performance estimation during design space exploration. Even though the current state of the art in automated HDL generation is not yet comparable to manually crafted HDL, certain optimizations can be applied in order to obtain a high quality hardware model at a fraction of the time needed for a manual implementation.
5.4 The Toolchain A dedicated toolchain, to evaluate the qualities of the proposed algorithms, has been built and the algorithms have been applied on a set of well-known benchmarks. In the remaining of this section, we describe the toolchain in detail. As mentioned in the previous chapters, the custom instructions are generated via a two-step clustering: first, the application or parts of the application to speed up are partitioned in single-output instructions. Afterwards, these instructions are further combined in multiple-output instructions that can be atomically executed on hardware, i.e. the single-output instructions are combined in convex connected or disconnected multiple-output instructions. From now on, we refer to these instructions as the (automatically) fused instructions. The toolchain for the experiments is presented in Figure 5.3. The grey boxes in Figure 5.3 denote tools that have been developed.
5.4. T HE T OOLCHAIN
97
*.C
PHASE 1
C – to – DFG
MM generation
SMM generation
VHDL generation
PHASE 2
SW Cost
Synthesis HW Cost
Parallel Clustering Algorithm Instruction Selection
C U S T O M
Nautilus Clustering Algorithm
Spiral Clustering Algorithm
Instruction Selection
Instruction Selection
I N S T R U C T I O N S
PHASE 3
Collapsed DFG Generation
Figure 5.3: Toolchain for the automatic identification of fused instructions. The toolchain is divided in three phases: the single-output clustering (PHASE 1), the hardware-software estimation (PHASE 2) and the multiple-output clustering (PHASE 3).
98
C HAPTER 5. P ERFORMANCE E VALUATION
The toolchain can be divided in three phases: • PHASE 1: the single-output clustering, • PHASE 2: the hardware-software estimation, • PHASE 3: the multiple-output clustering. PHASE 1. The starting point is an application written in ANSI C . The application is profiled to identify those parts that could be implemented on the reconfigurable hardware. The main idea is to determine as early as possible in the design stage whether certain parts not only can provide the necessary speed up when implemented in hardware but also if they can be implemented given the limited hardware resources. The goal is thus to end up with certain parts that can be accelerated by identifying application-specific instructions to implement on the reconfigurable hardware. The remaining parts of the application will be executed on the regular general purpose processor. The profiling itself will be determined in view of a particular objective such as increased performance, reduced power consumption or a small footprint. In this dissertation, we consider performance improvement as a guideline in the profiling. Additional information on the profiling can be seen in [82]. After the application is profiled, the selected parts of the application are transformed into their equivalent Data-Flow Graphs (DFG ) with the C − to − DFG converter implemented within the SUIF24 compiler framework. The DFG of an application can be a cyclic graph or an acyclic graph. Our algorithms require an acyclic graph. As a consequence, if the DFG is a cyclic graph, it is transformed into a Directed Acyclic Graph (DAG ). Up to now, we consider full loop unrolling to increase the selection space for the algorithms. If full loop unrolling is not possible, it is possible to unroll the loops by a certain factor and consider the inner parts of the loops as DAGs to analyze. The (DAG ) DFGs are the subject graphs described in Chapter 3. The nodes of the DAG associated to the (part of) application to analyze, from now on addressed as the subject graph, represent primitive operations and the edges represent the data dependencies. The nodes have two inputs at most and their single-output can be input to multiple nodes. The subject graph can be partitioned in single-output instructions by considering: 1. single nodes, 4
http : //suif .stanford .edu /suif /suif 2
5.4. T HE T OOLCHAIN
99
2. maximal single-output instructions, the MAXMISOs , 3. single-output instructions of variable size, the SMMs . In this dissertation, the single-output instructions considered for multipleoutput clustering are only the MAXMISOs and the SMMs . The use of single nodes, although possible, brings to the generation of too small, on average, custom instructions and it is not considered. The subject graph is then partitioned in MAXMISOs and, after that, in SMMs . The output of PHASE 1 is the subject graph of the application under consideration partitioned either in MAXMISOs or in SMMs . The set of SMMs , as explained in Section 3.3, depends on the node removed by the MAXMISOs . We implemented the FIX SMM clustering algorithm to generate the set of SMMs and we consider different options in the selection of the node (input node, output node, different inner nodes, etc.). PHASE 2. After the subject graph has been partitioned in single-output clusters, for each cluster we extract the hardware and software cost, i.e. the hardware and software execution time, and the area occupied by the cluster (instructions) when implemented in hardware. This information is taken into consideration by the MIMO clustering algorithms during the generation, and selection, of the custom instructions. The software execution time is computed as the sum of the latencies of the basic operations contained in the single-output instruction. The hardware execution time is estimated through behavioral synthesis of the single-output instructions’ VHDL models and then converting the reported delay into PowerPC cycles. We consider implementation of our algorithm on the Molen prototype built on a Xilinx’s Virtex-II Pro Platform FPGA . The software execution is assumed to be performed on a PowerPC 4055 and the VHDL synthesis is performed for the XC 2VP 100 chip6 . The PowerPC operates at 300Mhz while the FPGA operates at 100Mhz . The PowerPC processor in Virtex-II Pro does not provide floating-point instructions. As a result, the floating-point operations have to be either emulated in some way or, in our case, converted into the proper integer arithmetic7 . As described in the previous sections, the Molen architecture does not impose 5
We assume a simplified software execution model where caches, pipelines etc. are not considered. 6 The generated VHDL is not optimized and hardware reuse is not addressed. 7 Please, note that this is not a limitation of the generation and/or selection process, rather a limitation on the hardware and software implementation of the operations and, therefore, it does not affect the clusters generated by the algorithms.
100
C HAPTER 5. P ERFORMANCE E VALUATION
any restrictions on the operations included in the custom instructions. Consequently, the selected operations are not limited only to arithmetic and logic operations, rather they can also contain memory accesses and control logic. As the subject graph does not contain loops, a custom instruction does not contain loops. Nevertheless, if an entire loop is collapsed in a single node in the subject graph, a custom instruction containing that node can include a loop. As a result, additional modifications of the source code are not necessary before the selection process. In the current Molen prototype, the access to the XREGs (used for data communication between the processors) to load and store the input/output values of the custom instructions is significantly slower compared to the general purpose registers. We consider 2 cycles to read an input from an XREG or from the on-chip memory and 1 cycle to write the output in an XREG or in the on-chip memory. Additionally, we consider 0 cycle for the internal communications in the custom instructions. For the VHDL code generation, we use the C − to − VHDL generation toolset DWARV : Delft Workbench Automated Reconfigurable VHDL generator. The toolset exploits the available operation parallelism and has no limitations on the application domains. We refer to [121] for additional details. After the VHDL code has been generated, it is synthesized using the Xilinx ISE 9.4i, and this delay is converted into PowerPC cycles. After the software and hardware costs are estimated, the collapsed graph is generated as described in Section 3.3. The collapsed graph is the graph obtained by collapsing each single-output instruction within the subject graph in a single node. This operation is performed by the collapsing function. PHASE 3. After the application is partitioned in single-output instructions and hardware and software costs are estimated for each of them, the instructions are further combined in convex multiple-output instructions. The single-output instructions can be combined by using one of the three clustering algorithms: the Parallel clustering algorithm, the Nautilus clustering algorithm or the Spiral Clustering algorithm. In the first case, an optimal number of convex fused disconnected instructions is generated and selected based on the available hardware resources as described in Section 4.3. The Nautilus and Spiral clustering algorithms, described in Section 4.4.1 and 4.4.2 respectively, initially generate a set of candidate instructions and then, select a subset as described in Section 4.4.4 based on the available hardware resources. The toolchain output is a set of fused instructions, optimally or heuristically generated and selected, to implement on the reconfigurable hardware. These
5.4. T HE T OOLCHAIN
101
Table 5.1: Average measurements of the execution time of the toolchain presented in Figure 5.3. Tool MM Generation SMM Generation VHDL generation RTL synthesis Parallel Clustering - ILP problem formulation Parallel Clustering - ILP problem solver a Nautilus Clustering and selection Spiral Clustering and selection
Execution Time T1 ∼ 10 sec T2 ∼ 10 sec T3 < 1 min T4 < 5 min T5 ∼ 10 sec T6 < 2 min T7 ∼ 10 sec T8 ∼ 10 sec
a
For most of the considered kernels and FPGAs sizes, the HW selection takes less than 1 minute. However, in a few cases, the best solution found after two minutes spent by the ILP solver is considered.
instructions are application-specific instructions extending the instruction-set of the architecture and they provide performance gain in the execution of the application on the architecture compared to the pure software execution. The speed up estimation is based on the Amdahl’s law (see, for example, [71, Appendix A]), using the profiling results and the computed speed up for the (parts of) application considered. Remark 5.4.1. The target of this dissertation is the design of algorithms for the generation and selection of multiple-output instructions. As a singleoutput instruction is also a multiple-output instruction with only one output, we can generate single-output instructions either by using the Nautilus or the Spiral clustering algorithm (limiting the output of the generated clusters to one output) or by directly considering MAXMISOs and/or SMMs . A selection process, if necessary, can be the same seen in Section 4.4.4. As a result, in Figure 5.3, the output of PHASE 2 can directly undergo a selection process, without further instruction combination. We note that the tools in the toolchain do not require any manual efforts to be adjusted to the target application. All the steps in Figure 5.3 are automated except the integration of the tools. The integration of the tools is secured through shell scripts or batch files, depending on the operative system used. Nevertheless, as future work we intend to fully automate the tool chain by implementing the necessary VBA (Visual Basic for Application) modules. Regarding the execution time of the presented toolchain, average measure-
102
C HAPTER 5. P ERFORMANCE E VALUATION
ments are included in Table 5.1. The execution time refers to the execution time needed to identify one set of custom instructions for the available hardware resources, meaning to identify one solution. After the initial single-output clustering and hardware/software estimations have been completed, the total execution time depends on the number of possible solutions considered. More specifically, the execution time to provide n solutions is: n ∗ (T5 + T6 ) P4 Total Execution Time = i =1 Ti + n ∗ T7 n ∗T 8
Parallel clustering Nautilus clustering Spiral clustering
(5.1)
with T2 in the previous sum if and only if the subject graph is partitioned in SMMs . Furthermore, we note that the identification of custom instruction is done at static-time and not at run-time.
5.5 Discussion of the Evaluation Process THE SMM PARTITIONING. After the application is partitioned in MAXMISO , it can be further partitioned in SMMs as described in Section 3.3. The SMM partitioning depends on the node which is removed from the MAXMISOs . As mentioned before, in this dissertation, we implement different versions of the FIX SMM partitioning algorithm. The node removed by the MAXMISOs can be: • • • • • • •
OPTION OPTION OPTION OPTION OPTION OPTION OPTION
1: 2: 3: 4: 5: 6: 7:
an output node; a semi-random node; an input node; 1st node with 1 successor and 1 predecessor; 1st node with 1 successor and 2 predecessors; 2nd node with 1 successor and 1 predecessors; 2nd node with 1 successor and 2 predecessors.
Since the subject graph is a DAG , the nodes of the graph can be topologically ordered. Successors and predecessors of nodes mentioned before are then identified by the topological order of the nodes. The node to remove is chosen starting the analysis from the output node of the considered MAXMISO . OPTION 2 randomly removes a node from the central levels of the MAXMISO . The same thing happens with OPTION 4 − 7 when the MAXMISO does not contain a node with the required property to remove. It can be the case that a MAXMISO is composed by only one node. In this case, the SMM partitioning of this MAXMISO is considered to be the MAXMISO itself.
5.5. D ISCUSSION
OF THE
E VALUATION P ROCESS
103
THE SEED IN THE NAUTILUS AND SPIRAL CLUSTERING ALGORITHMS. The Nautilus and the Spiral clustering algorithms generate fused instructions by starting the clustering from one node, called the seed and then, step-by-step, additional operations are included in the cluster. The seed, as described in Section 4.4, is selected according to the following: • Nautilus clustering Algorithm: the seed is chosen as the node with the smallest latency in hardware starting from the lower levels of the graph. In case many nodes at that level have the same latency, we randomly select the seed. • Spiral clustering Algorithm: the seed is chosen as the node with the smallest latency in hardware starting from the central levels of the graph. In case many nodes at that level have the same latency, we randomly select the seed. Additionally, as shown by Eq. (4.21), the number of inputs and outputs of the final clusters depends on choice of β. We consider β = 2, meaning that the nodes in the initial cluster are included if and only if at least half of their inputs are originating from the cluster (see Section 4.4.1). ISOMORPHISM AND HARDWARE REUSE . The complexity of the clustering algorithm presented in this dissertation is always linear with the number of nodes analyzed, with the exception of the Parallel Clustering algorithm. One possible improvement in the presented toolchain is the inclusion of a graph isomorphism step. This step can be used to test, between the custom instruction generated, whether there are isomorphic instructions in such a way to partition the set of generated instructions in isomorphism classes8 . By implementing the set of class representatives9 a considerable amount of area can be saved. Since 1. the isomorphism problem is a well-known computationally complex problem10 ; 8
An isomorphism class is a collection of isomorphic instructions. A set of class representatives is a set which contains exactly one element from each isomorphism class, i.e. one instruction for each isomorphism class. 10 There exists no known P algorithm for graph isomorphism testing, although the problem has also not been shown to be NP -complete. In fact, the problem of identifying isomorphic graphs seems to fall in a crack between P and NP -complete, if such a crack exists and, as a result, the problem is sometimes assigned to a special graph isomorphism complete complexity class. 9
104
C HAPTER 5. P ERFORMANCE E VALUATION
Figure 5.4: The execution time of an application: (a) in software, (b) using custom instructions without scheduling, and (c) using custom instructions with scheduling.
2. the shape of the instructions is not regular, reducing the possibility to have isomorphic instructions; 3. we want to keep a low overall complexity for the clustering methods presented in this dissertation, we do not address isomorphism and hardware reuse. This analysis will be included in the toolchain in future work. Although with our methodology we increase the use of hardware resources, we have to main benefits: 1. a lower overall computational complexity, 2. zero reconfiguration time to use the generated instructions. All the selected instructions are implemented in hardware, not only a set of representative instructions. This means that: (1) multiple instances of the same instructions can be executed in parallel and (2) the reconfiguration time to use the generated instructions is zero as it is not necessary to reconfigure the FPGA to use them. As mention before, the identification and selection of fused instructions is done at static-time. As no reconfiguration is necessary we speak only about configuration time. SCHEDULING OF THE FUSED INSTRUCTIONS. The fused instructions can undergo a scheduling process before their implementation in hardware. In this dissertation, the final execution time of an application using fused instruction, see Figure 5.4, is considered to be the sum of the software execution time and the sum of the execution time of the fused instructions. This means that if we
5.5. D ISCUSSION
OF THE
E VALUATION P ROCESS
105
schedule the new instructions, it is possible to identify which of them are independent and can be implemented in parallel and increase even more the overall performance gain. THE SET OF BENCHMARKS. In our experiments, we used a set of well known benchmark applications. In the following, we give a short description of each benchmark: • Gost , Cast 128 and Twofish (Encrypt and Decrypt), cryptographic kernels from MCRYPT library11 ; • ADPCM , the Adaptive Differential Pulse-Code Modulation decoder, a well-known raw audio format decoder; • SAD , the Sum of absolute differences, a simple video quality metric used for block-matching in motion estimation for video compression; • MDCT butterfly , part of a Modified Discrete Cosine Transform employed in MP 3, AC − 3, Ogg Vorbis , and AAC for audio compression; • cosine lookup function (vorbis coslook i), from the Ogg Vorbis audio codec. It uses some mirroring to save lookup table space; • inverse lookup function 1/sqrt (x ) (vorbis invsqlook i), from the Ogg Vorbis audio codec. It works for x ∈ [0.5, 1); • Hamming 1 and 2, two functions that calculates the Hamming distance in bits between two 16-bit integers and two 32-bit integers. LIMITATIONS ON THE NUMBER OF INPUTS AND OUTPUTS . We consider 2 cycles to read each input and 1 cycle to write each output. This means that inputs and outputs are loaded and stored in a sequential fashion. If an instruction Instri has IInstri inputs and the maximum number of inputs that can be read concurrently is NI , we have that the total number of cycles required to read the inputs of the instruction is: 1 if IInstri mod NI 6= 0 IInstri div NI + (5.2) 0 if IInstri mod NI = 0 For example to read 7 inputs with a limit of 4 that can be read concurrently, we need 14 cycles if the inputs are read sequentially and we need 4 cycles (4 = 11
http : //mcrypt .sourceforge .net /
106
C HAPTER 5. P ERFORMANCE E VALUATION
7 div 4 + 1)if the inputs are read 4 at time. As a result, a considerable amount of cycles is actually saved when input and output limitations are considered. More specifically we have the following result. Lemma 5.5.1. Let A = {Instr1 , ..., Instrm } be a set of fused instructions. Let Totcycles be the total number of cycles required to execute the m instructions in A . Let IInstri and OInstri be the total number of inputs and outputs of an instruction Instri . If NI and NO are the total input and the total output limitations rispectively, the real total number of cycles required for the execution of the instructions is: i P n h Totcycles − mi =1 2 IInstri − (IInstri div NI ) − χMOD + h io (5.3) + OInstri − (OInstri div NO ) − ψMOD where χMOD =
1 0
if IInstri mod NI = 6 0 if IInstri mod NI = 0
(5.4)
ψMOD =
1 0
if OInstri mod NO = 6 0 if OInstri mod NO = 0.
(5.5)
and
Proof. By the algorithm for the division of two integer numbers12 , we have that: IInstri = NI (IInstri div NInp ) + (IInstri mod NInp ). (5.6) This means that total number of cycles to read all the inputs of Instri is: 2[NI (IInstri div NI ) + (IInstri mod NI )].
(5.7)
If NI is the input limitation and IInstri is the number of inputs of instruction Instri , the real number of cycles to read the inputs is: h 1 if IInstri mod NI 6= 0 i 2 (IInstri div NI ) + (5.8) 0 if IInstri mod NI = 0 By combining Eq. (5.7) and Eq. (5.8), we have that for each instruction Instri we actually save a number of cycles equals to: h 1 if IInstri mod NI 6= 0 i 2 IInstri + (IInstri div NI ) − (5.9) 0 if IInstri mod NI = 0 12
Given two integers a and b we have a = bq + r where q = a div b and r = a mod b .
5.6. T HE E XPERIMENTAL R ESULTS
107
Table 5.2: Information about the MAXMISO partitioning of a set of benchmarks. The first column shows the benchmarks. Columns 2 and 3 show the number of nodes and the number of MAXMISOs respectively. Finally, the last two columns show the number of unique MAXMISOs and the number of levels in the collapsed graph. Benchmark Adpcm SAD Vorb coslook Vorb invsqlook MDCT 32 Hamming 1 Hamming 2 Gost enc Gost dec Cast128 enc Cast128 dec 2fish dec 2fish enc
♯ Nodes 839 1310 20 42 569 69 229 897 897 523 523 1115 1099
♯ MMs 110 1 1 1 268 1 1 58 58 47 47 86 86
Unique MM 14 1 1 1 13 1 1 9 9 17 17 9 9
MM - ♯ Levels 20 1 1 1 8 1 1 34 34 19 19 3 3
With a similar argument for the outputs13 , by considering all the instructions Instr1 , ..., Instrm in A , we obtain that the total number of cycle is the one in Eq. (5.3).
5.6 The Experimental Results In this section, we present the overall application speed up compared to the pure software execution for the benchmarks presented in Section 5.5. As mentioned before, the generation of fused instructions is done in two steps. First, the application is partitioned into single-output instructions and second, some of these instructions are further combined to generate the fused instructions. In Table 5.2, we present information about the MAXMISO partitioning of the considered set of benchmarks. The first column shows the benchmarks. Columns 2 and 3 show the number of nodes and the number of MAXMISOs respectively. Finally, Columns 4 and 5 show the number of unique MAXMISOs and the number of levels in the collapsed graph. When the application is partitioned in single-output instructions of variable size, the SMMs , we have implemented different versions of the FIX SMM parti13
One cycle is required to write an output. This means that the integer 2 is substituted with 1. Additionally, NI is substituted with NO and IInstri is substituted with OInstri .
108
C HAPTER 5. P ERFORMANCE E VALUATION
tioning. In Table 5.3, we present some information about the SMM partitioning of the set of considered benchmarks. To highlight the differences between the partitioning of an application in MAXMISOs and in SMMs , the table includes information on both MAXMISOs and SMMs : total number, total number of unique elements and total number of levels in the associated collapsed graph. Table 5.3: Information about the FIX SMM partitioning of a set of benchmarks. We consider seven possible choices in the selection of the node removed from the MAXMISO to generate the SMMs . For each subtable, column 1 shows the SMM partitioning (see Section 5.5 for more details), column 2, 3 and 4 show the number of MAXMISOs , the unique MAXMISOs and the number of levels in the collapsed graph (MAXMISO collapsed graph) respectively. Column 5, 6 and 7 show the total number of SMMs , the unique SMMs and the number of levels in the collapsed graph (SMM collapsed graph) respectively. ♯ Nodes
♯ MM
Unique MM
MM - ♯ Levels
♯ SMM
Unique SMM
SMM - ♯ Levels
ADPCM SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
839 839 839 839 839 839 839
110 110 110 110 110 110 110
14 14 14 14 14 14 14
20 20 20 20 20 20 20
268 204 204 204 234 236 204
18 13 14 14 15 17 14
24 39 21 21 22 53 21
Vorb coslook SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
20 20 20 20 20 20 20
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
7 2 2 2 3 7 2
6 2 2 2 3 6 2
5 2 2 2 3 5 2
Vorb invsqlook SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
42 42 42 42 42 42 42
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
3 2 2 2 6 4 2
3 2 2 2 6 4 2
3 2 2 2 4 4 2
SAD SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
1310 1310 1310 1310 1310 1310 1310
1 1 1 1 1 1 1
532 520 520 520 536 520 520
14 13 12 12 11 12 12
10 16 15 15 17 15 15
1 1 1 1 1 1 1 1 1 1 1 1 1 1 Continued on Next Page. . .
5.6. T HE E XPERIMENTAL R ESULTS
109
Table 5.3 – Continued ♯ Nodes
♯ MM
Unique MM
MM - ♯ Levels
♯ SMM
Unique SMM
SMM - ♯ Levels
MDCT 32 SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
569 569 569 569 569 569 569
268 268 268 268 268 268 268
13 13 13 13 13 13 13
8 8 8 8 8 8 8
342 356 290 302 316 316 317
15 12 12 12 12 16 12
2 2 2 2 2 5 2
Hamming 1 SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
69 69 69 69 69 69 69
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
18 2 2 2 20 18 2
6 2 2 2 7 6 2
17 2 2 2 18 17 2
Hamming 2 SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
229 229 229 229 229 229 229
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
2 2 2 2 36 35 2
2 2 2 2 6 6 2
2 2 2 2 35 34 2
Gost Enc SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
897 897 897 897 897 897 897
58 58 58 58 58 58 58
9 9 9 9 9 9 9
34 34 34 34 34 34 34
182 116 116 116 216 148 116
17 11 12 12 19 13 12
68 67 24 24 129 100 24
Gost Dec SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
897 897 897 897 897 897 897
58 58 58 58 58 58 58
9 9 9 9 9 9 9
34 34 34 34 34 34 34
182 116 116 116 216 148 116
17 11 12 12 19 13 12
68 67 23 22 129 100 22
Cast Dec SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
523 523 523 523 523 523 523
47 47 47 47 47 47 47
17 17 17 17 17 17 17
19 19 19 19 19 19 19
101 94 94 110 145 110 94
25 19 20 21 30 26 21
25 37 20 20 64 50 20
Cast Enc SMM 1
523
47 17 19 Continued on Next Page. . .
98
23
23
110
C HAPTER 5. P ERFORMANCE E VALUATION Table 5.3 – Continued ♯ Nodes
♯ MM
Unique MM
MM - ♯ Levels
♯ SMM
Unique SMM
SMM - ♯ Levels
SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
523 523 523 523 523 523
47 47 47 47 47 47
17 17 17 17 17 17
19 19 19 19 19 19
94 94 110 143 110 94
19 20 20 30 24 20
37 21 21 65 52 21
Twofish Dec SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
1115 1115 1115 1115 1115 1115 1115
86 86 86 86 86 86 86
9 9 9 9 9 9 9
3 3 3 3 3 3 3
285 172 172 188 332 188 188
21 11 12 13 17 13 13
8 5 6 7 9 7 7
Twofish Enc SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 SMM 6 SMM 7
1099 1099 1099 1099 1099 1099 1099
86 86 86 86 86 86 86
9 9 9 9 9 9 9
3 3 3 3 3 3 3
266 172 172 188 332 188 188
21 11 12 13 17 13 13
8 5 5 7 9 7 5
An expected remark to start with is that by partitioning the application in SMMs instead of MAXMISOs , the number of single-output instructions that can be combined increases. More specifically, the number of SMMs is always greater than or equal to |{MMi s .t . |MMi | > 1}| + |{MMi s .t . |MMi | = 1}| where |MMi | represents the total number of nodes of MMi . This happens as MM with one node are not partitioned in SMM . Additionally, the number of levels of the associated collapsed graph increases. This means that, the MIMO clustering algorithms have more elements to combine through more levels and the generation of fused instructions can be more efficient. This turns in an increased overall application speed up, as shown by the experimental results (see Appendix A). Table 5.3 shows that by removing an input nodes from the MAXMISO the number of SMM generated is limited compared with the removal of an output node, as described in Section 3.3. This happens as the removal of an input node generates only two SMMs : the node and the MM bereft of the node.
5.6. T HE E XPERIMENTAL R ESULTS
111
Table 5.4: Overall application speed up compared with pure software execution for the Parallel Clustering Algorithm. The table shows results for a selected set of benchmarks and different FPGA boards. We consider, as single-output instructions to cluster, the MAXMISOs (MM ) and the SMMs generated by OPTION 1 − 5 (SMM 1, ..., SMM 5) (see Section 5.5 for more details). ADPCM MM SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 Cast Dec MM SMM 1 SMM 2 SMM 3 SMM 4 SMM 5 Twofish Dec MM SMM 1 SMM 2 SMM 3 SMM 4 SMM 5
XC 2VPX 20 1,68x 1,72x 1,53x 1,72x 1,55x 2,07x XC 2VPX 20 1,4 1,72 1,5 1,53 1,58 1,61 XC 2VPX 20 1,2 1,05 1,13 1,16 1,18 1,18
XC 2VP 30 1,67 1,95 1,56 1,7 1,58 2,58 XC 2VP 30 1,5 2,02 1,73 1,83 1,79 2 XC 2VP 30 1,1 1,15 1,13 1,17 1,19 1,25
XC 2VP 40 1,66 2,32 2,2 2,32 2,01 2,33 XC 2VP 40 2,25 3,73 2,33 2,38 2,28 2,32 XC 2VP 40 1,3 1,38 1,12 1,41 1,43 1,23
XC 2VP 50 1,6 2,75 2,57 2,76 2,47 3,02 XC 2VP 50 2,77 3,02 2,57 2,57 2,61 2,59 XC 2VP 50 1,25 1,4 1,25 1,43 1,44 1,26
XC 2VPX 70 1,65 3,6 3,55 3,54 3,25 3,65 XC 2VPX 70 2,01 3,02 2,55 2,54 2,6 2,57 XC 2VPX 70 1,5 1,83 2,15 1,55 2,17 2,21
XC 2VP 100 1,65 3,75 3,57 3,53 3,26 3,76 XC 2VP 100 2,42 3,01 2,55 2,55 2,62 2,6 XC 2VP 100 1,78 1,95 3,7 3,82 4,25 3,75
5.6.1 The Parallel Clustering Algorithm Results concerning the Parallel clustering algorithm applied to a selected set of benchmarks are presented in Table 5.4 and depicted in Figure A.1 in the Appendix of this dissertation. We present results for: ADPCM , Cast Dec and Twofish Dec . As mentioned in Chapter 4.3, the algorithm generates and selects an optimal set of application-specific instructions based on the available hardware resources. Therefore, we present results for the six FPGA boards: XC 2VPX 20, XC 2VP 30, XC 2VP 40, XC 2VP 50, XC 2VPx 70 and XC 2VP 100. One first and expected remark is that the impact on performance of the MM and SMM partitioning algorithm is increasing with the size of the available area on the FPGA . This is explained by the fact that more single-output clusters can be selected for parallel execution on the FPGA . As mentioned in the previous section, a SMM partitioning generates more clusters and more levels in the associated collapsed graph. As a result, the parallel clustering with SMMs increases the overall speed up, as shown in Figure A.1. The Parallel clustering algorithm generates, at most, one disconnected MIMO
112
C HAPTER 5. P ERFORMANCE E VALUATION
instruction for each level in the collapsed graph. As a result, the number of generated instructions is always less than or equal to the depth of the collapsed graph (see Section 4.2.3 and Table 5.3). The number of inputs and outputs never exceed (30, 15) and on average is around half. Additionally, the speed up can be further increased by scheduling the new instructions and by taking into consideration Eq. (5.3). The occupied area is not shown since it is almost equal to the available area on the considered FPGA .
5.6.2 The Nautilus and Spiral Clustering Algorithms Results concerning the Nautilus and the Spiral clustering algorithms for a set of benchmarks are presented in Table 5.7 and depicted in the Appendix of this dissertation. In the following, we present an analysis of the results. INPUT − OUTPUT LIMITATIONS. If the limitations on the total number of inputs and outputs are very tight, the algorithms provide limited speed up. This is a consequence of the two-step clustering process. We do not combine single nodes, where the number of inputs and outputs are limited to 2 and 1 respectively. Instead, we combine single-output instructions of variable size which have, in general, a total number of inputs greater than 2 (and 1 output). This is done to increase the speed up provided by the single instructions, since the speed up is usually proportional to the size (the total number of operations contained in the instruction). Additionally, by Section 5.5, the speed up can be further increased by scheduling the new instructions and by taking into consideration Eq. (5.3). REITERATION OF THE INITIAL CLUSTERING. The reiteration of the initial clustering in the custom instruction generation process provides, in most of the cases, additional speed up. When the speed up does not increase, it means that the additional nodes included in the cluster (if any) increase the total number of the inputs and the additional overhead to load these inputs is not compensated by the inclusion of additional nodes later in the clustering. This means that a good choice would be to keep β variable during the clustering process (see Eq. (4.21)). This will be addressed as future work. AREA USAGE . Table 5.5 gives an overview on the total area usage for each benchmark. In case not all the generated clusters can be implemented in hardware due to hardware resource limitations, the clusters are selected as described in Section 4.4.4, NUMBER OF SELECTED CLUSTERS. In Table 5.6, we show the minimum and maximum number of fused instructions for a selected set of benchmarks for
5.6. T HE E XPERIMENTAL R ESULTS
113
Table 5.5: Area usage for a set of benchmarks for the Nautilus and the Spiral clustering algorithms.
Benchmark ADPCM SAD MDCT 32 Gost Enc Gost Dec Cast Dec Cast Enc Twofish Dec Twofish Enc Hamming 1 Hamming 2 Vorb invsqlook Vorb coslook
Nautilus Area Usage 58,40% 58,82% 53,89% 77,81% 76,84% 38,81% 38,89% 88,31% 86,93% 8,94% 8,94% 8,94% 8,94%
Spiral Area Usage 42,32% 38,47% 55,02% 77,93% 74,93% 41,49% 41,35% 98,86% 99,78% 23,27% 23,27% 23,27% 23,27%
the Nautilus and the Spiral clustering algorithms and for different Inputs and Outputs limitations. By the number of cluster selected, it is expected that a consistent number of cycles can be further saved by scheduling the new instructions and by taking into consideration Eq. (5.3). OVERALL APPLICATION SPEED UP . Table 5.7 gives an overview of the minimum and maximum speed up for each benchmark, depending on the clustering algorithm used. Unfortunately, the benefit of the presented clustering algorithms can not be compared exhaustively with the existing approaches. This is mainly due to the lack of a basic set of benchmark applications and kernels used to test the various methodologies in literature. However, the overall speed up for the cryptographic benchmarks has the same order of magnitude and is comparable with the speed up produced by the existing approaches. For the ADPCM decoder, a quite common benchmark, our overall speed up is around 3.7x with the Parallel clustering and around 2.3x with the Nautilus and Spiral clustering. Looking at the existing approaches: [11] (3.4x ), [18] (2.7x ), [26] (2.3x ), [93] (3.5x ) and [113] (3.8x ), we have similar speed up. In this case and, in general, we can list the main differences compared with the existing approaches: 1. Linear overall computational complexity. The clustering algorithms presented in this dissertation, with the exception of the Parallel Clustering, have linear complexity in the number of processed elements, as described in Chapter 3 and Chapter 4.
114
C HAPTER 5. P ERFORMANCE E VALUATION
Table 5.6: Number of fused instructions generated by clustering different kinds of single-output instructions (MAXMISO and SMM ): the table shows the minimum and maximum number of fused instructions for a selected set of benchmarks for the Nautilus and the Spiral clustering algorithms. Different input and output limitations are taken into consideration. ADPCM 4–4 6–4 8–4 8–5 10–5 10–8 15–10 20–8 20–15 inf–inf Twofish Enc 4–4 6–4 8–4 8–5 10–5 10–8 15–10 20–8 20–15 inf–inf MDCT 32 4–4 6–4 8–4 8–5 10–5 10–8 15–10 20–8 20–15 inf–inf Gost Dec 4–4 6–4 8–4 8–5 10–5 10–8 15–10 20–8 20–15 inf–inf
Nautilus min
Nautilus max
Spiral min
Spiral max
20 32 36 32 32 31 21 33 30 30
71 87 88 88 88 83 117 115 95 108
21 36 35 34 32 31 23 35 30 31
80 95 91 90 86 80 90 90 78 78
14 15 14 24 15 14 15 21 23 22
75 128 87 87 75 74 74 74 74 65
14 15 20 22 17 19 27 23 34 34
71 76 76 70 65 58 58 58 57 56
15 20 21 22 16 18 13 21 33 34
104 98 95 93 93 89 85 89 79 55
17 21 25 22 16 19 18 25 28 34
108 102 103 98 95 103 107 110 116 141
15 12 17 25 22 24 33 35 34 34
12 13 14 13 46 45 64 65 64 57
17 12 17 24 22 28 33 35 34 34
37 36 37 35 44 43 62 57 68 62
5.6. T HE E XPERIMENTAL R ESULTS
115
Table 5.7: Overall application speed up compared with pure software execution for the Nautilus and the Spiral clustering algorithms for a set of benchmarks. There are different single-output clusters to further combine in MIMO clusters: MAXMISOs and SMMs . For this reason, the table shows the minimum and the maximum speed up. Additional results are presented in the Appendix A.
Benchmark ADPCM ADPCM SAD SAD MDCT 32 MDCT 32 Gost Enc Gost Enc Gost Dec Gost Dec Cast Dec Cast Dec Cast Enc Cast Enc Twofish Dec Twofish Dec Twofish Enc Twofish Enc Hamming 1 Hamming 1 Hamming 2 Hamming 2 Vorbis Invsqlook Vorbis Invsqlook Vorbis Coslook Vorbis Coslook
Speed up min max min max min max min max min max min max min max min max min max min max min max min max min max
Nautilus 1,19x 2,3x 2,23x 2,56x 1,4x 3,03x 1,02x 2,25x 1,02x 2,26x 1,13x 2,39x 1,12x 2,11x 1,13x 5,03x 1,12x 5,56x 2,2x 3,74x 4x 11,345x 1,2x 2,9x 1,58x 1,58x
Spiral 1,23x 2,2x 2,23x 2,56x 1,51x 2,32x 1,05x 2,37x 1,05x 2,28x 1,12x 2,4x 1,09x 2,11x 1,12x 4,66x 1,12x 4,55x 2,2x 3,74x 4x 11,345x 1,2x 2,66x 1,9x 1,9x
116
C HAPTER 5. P ERFORMANCE E VALUATION
2. Convexity of the instructions by construction. The fused instructions are generated to be atomically executed in hardware. The convexity of the instructions, property described in Section 4.1, is theoretically guaranteed by construction, opposite to the existing approaches which verify the convexity of the instructions every time a new operation is included in the cluster during the generation process. Testing the convexity of a cluster involves a multiple analysis of the nodes in the cluster to verify that for each pair of nodes there exists no path connecting the nodes which involves nodes not belonging to the cluster. Then, the convexity guaranteed by construction enables to save a considerable amount of the execution time. 3. Multi-step clustering. The generation of new instructions, opposite to the majority of the existing approaches, is done in multiple steps to reduce the overall complexity of the generation process, as described in the previous chapters. 4. Different types of clustering. The algorithms presented in this dissertation generate single- or multiple-output instructions (connected or disconnected) to atomically execute in hardware as single applicationspecific instructions. Different clustering algorithms are presented for the generation of both single- and multiple-output instructions. All the instructions generated can directly be implemented in hardware or undergo a selection process, if the hardware resources are limited (see Figure 5.3. 5. Automatic generation and selection of custom application-specific instructions based on the available hardware resources. The toolchain for the generation and selection of fused instructions, as described in Section 5.4, does not require any manual efforts to be adjusted to the specific application and, excepts the integration of the various tools, all the steps are automated.
5.7 Summary In this chapter, we evaluated the performance of the algorithms proposed in Chapter 4. The chapter starts by describing the main limitations on common reconfigurable architectures. A certain number of drawbacks led us considering, for the experimental evaluation of the algorithms, a Molen reconfigurable architecture, a reconfigurable architecture based on the Molen machine
5.7. S UMMARY
117
organization and Molen programming paradigm, which allows to overcome the usual limitations and fully test the algorithms. After a short overview of the main characteristics of the selected architecture, the chapter presents the automatic toolchain used to carry out the experiments. The experimental results are discussed in detail by starting the analysis with remarks common to all the algorithms and by continuing with a description of the results related to the Parallel, the Nautilus and the Spiral clustering algorithms for a set of well-known benchmarks. The experimental results show that the presented algorithms are effectively able to speed up the execution of an application with custom application-specific instructions (fused instructions). Additionally, as shown in Chapters 4, the algorithms have a very fast runtime, generate and select automatically the new instructions, guarantee the convexity of the final clusters by construction and have low computational complexity. Finally, in the next chapter, we provide concluding remarks on the work presented in this dissertation. The chapter highlights the main contributions and proposes future research directions.
Note. The content of this chapter is based on the the following papers: C. Galuzzi and K. Bertels,A Framework for the Automatic Generation of Instruction-Set Extensions for Reconfigurable Architectures, ARC 2008, March 2008. C. Galuzzi, E. Moscu Painante, Y. D. Yankova, K. Bertels and S. Vassiliadis, Automatic Selection of Application - Specific Instruction - Set Extensions, CODES+ISSS 2006, October 2006. C. Galuzzi, D. Theodoropoulos and K. Bertels, A Clustering Method for the Identification of Convex Disconnected Multiple Input Multiple Output Instructions, SAMOS VIII, July 2008. C. Galuzzi, K. Bertels and S. Vassiliadis, A Linear Complexity Algorithm for the Automatic Generation of Convex Multiple Input Multiple Output Instructions, ARC 2007, March 2007. C. Galuzzi, K. Bertels and S. Vassiliadis, The Spiral Search: A Linear Complexity Algorithm for the Generation of Convex Multiple Input Multiple Output Instruction-Set Extensions, IC-FPT 2007, December 2007. C. Galuzzi, D. Theodoropoulos, R. Meeuws and K. Bertels, Algorithms for the Automatic Extension of an Instruction-Set, DATE 2008, April 2008.
6 Conclusions
I
n this dissertation, we have described efficient methods able to identify, in an automatic fashion, new specialized, application-specific instructions-set extensions for reconfigurable architectures. The instruction generation process is based on a two-step clustering which first partitions the application in single-output instructions and then combines these instructions, following different policies, in multiple-output instructions. As a result, this dissertation has presented clustering algorithms for the generation of both single- and multiple-output instructions. The chapter unfolds in three sections. Section 6.1 summarizes the work presented in this dissertation. In Section 6.2, we present the main contributions of this dissertation. Finally, Section 6.3 lists some future directions and open issues worthy of further research and investigation in the context of instruction-set customization.
6.1 Outlook The dissertation started by providing an overview of the state-of-the-art in instruction-set customization in Chapter 2. This chapter presented the issues involved in the customization of a given instruction-set in the form of a set of specialized instructions for a given application (application-specific) or domain of applications (domain-specific). The problem has been presented in detail, considering different types of customizations and different types of instructions. Instruction generation and selection have been exhaustively analyzed. The chapter presented a detailed description of the problem and, at the same time, it provided an exhaustive overview of the research in the last years in instruction-set customization. In this dissertation, the generation of custom instructions to extend a given instruction-set is performed via a multi-step clustering. Initially, the applica119
120
C HAPTER 6. C ONCLUSIONS
tion is partitioned in single-output instructions and, afterwards, some of these instructions are combined in multiple-output instructions and selected to be atomically executed on the available hardware. Chapter 3 presented two algorithms for the partitioning of an application in single-output instructions. The first algorithm, the MAXMISO partitioning algorithm, partitions the application in single-output instructions of maximal size called MAXMISOs . Based on the available hardware resources, the output of the MAXMISO partitioning can be a set of instructions not suitable for hardware implementation. This is the case, for example, when the number of input operands of a custom instruction is limited or when the cluster generated requires additional hardware resources. To face this problem, the SMM partitioning algorithm is proposed. The SMM partitioning algorithm is based on the MAXMISO partitioning algorithm, and it repartitions the maximal single-output instructions in a set of smaller singleoutput instructions. In Chapter 4, we presented a set of three algorithms for the generation and selection of convex connected or disconnected MIMO instructions: the Parallel clustering algorithm, the Nautilus clustering algorithm and the Spiral clustering algorithm. After a description of the main concepts related to convex subgraphs, the chapter described the Parallel clustering algorithm, a clustering algorithm for the generation of convex disconnected MIMO instructions. Since it is not always possible to implement disconnected instructions in hardware, the chapter continued with the description of two heuristics of linear complexity. The Nautilus clustering algorithm and the Spiral clustering algorithm for the generation of convex connected MIMO instructions are analyzed. After that, the chapter presented a possible selection process to select a subset of the generated instructions to implement in hardware based on the available hardware resources. Chapter 5 addressed the experimental results. In this chapter, we evaluated the performance of the algorithms proposed in this dissertation. The chapter started by describing the common limitations on reconfigurable architecture. A certain number of problems with the common architectures led us considering, for the experimental evaluation of the quality of the presented algorithms, an architecture based on the Molen machine organization and Molen programming paradigm which allows to overcome the usual limitations and fully test the algorithms. After that, a general description of the main concepts related to the Molen architecture is presented. The automatic toolchain used to carry out the experiments followed. We continued by presenting some general remarks on the experiments and a description of the results related to the proposed algorithms for a set of well-known benchmarks. The experimental results have
6.2. C ONTRIBUTIONS
121
shown that the presented algorithms are effectively able to speed up the execution of an application with custom application-specific instructions on a reconfigurable architecture. Additionally, as shown in Chapter 3 and 4, the algorithms have a very fast runtime, they generate and select automatically the new instructions and they guarantee the convexity of the final clusters by construction.
6.2 Contributions The main contributions of this dissertation can be summarized as follows. 1. Automatic generation and selection of custom application-specific instruction based on available hardware resources. To evaluate the qualities of the clustering algorithms presented in this dissertation, we have built a dedicated toolchain, as described in Section 5.4. None of the tools in the developed toolchain requires manual efforts to be adjusted to the target application. All the steps are automated except from the integration of the tools. 2. Convexity of the cluster by construction. The presented algorithms generate convex clusters that can be atomically executed in hardware. Contrary to the existing approaches for instruction-set customization, the convexity of the clusters, property described in Section 4.1, is theoretically guaranteed by construction. Existing approaches verify the convexity of the clusters every time a new operation is included during the clustering process. Then, the convexity guaranteed by construction enables the saving of a considerable amount of execution time. 3. Elimination of the restrictions of the type and number of new instructions. In contrast to the majority of the existing approaches, the new instructions are generated without limitations on the type of new instructions that is possible to generate, on the number of input/output values of the custom instructions and on the number of instructions. 4. Multi-step clustering. The generation of the new instructions is based on a multi-step clustering to reduce the overall computational complexity. Single-output instructions of variable size, in terms of number of basic operations, are clustered in convex multiple-output instructions. Opposite to the majority of the existing approaches, clustering is done
122
C HAPTER 6. C ONCLUSIONS
in two steps so as to reduce the overall complexity of the generation process, as described in Chapter 3 and 4. 5. Different kind of custom instructions. The presented clustering algorithms generate convex disconnected and connected instructions to exploit the parallelism provided by the hardware implementation of the new instructions. All the algorithms presented in this dissertation generate instructions, single- or multiple-output and connected or disconnected which can be atomically executed in hardware. Although the focus of this dissertation is on the generation of multiple-output custom instructions, the single-output instruction generated in the first phase can also be selected, based on the available hardware resources, for hardware implementation. 6. Overall linear computational complexity. The clustering algorithms presented in this dissertation, with the exception of the Parallel clustering, have linear complexity in the number of processed elements as described in Chapter 3 and 4. In particular, this results in no scalability problems: application with thousands of instructions can be analyzed and speeded up in a very limited amount of time.
6.3 Open Issues and Future Directions A considerable amount of effort has been spent in the past years to provide efficient methods for the customization of an instruction-set by means of a set of specialized instructions. The problems involved, as described in the previous chapters, are computational complex problems. Hardware/software partitioning, equivalent to instruction-set customization under certain assumptions, it is proven to be NP-hard in the general case [5]. Optimal solutions have been proposed by many authors. A plethora of efficient heuristics have been proposed to find near-optimal solutions. This occurs when the computational complexity of the problem becomes unmanageable and exact solutions can not be found in a timely manner. The complexity of the problem and the advancements in many fields of research promise progresses for many of the high complexity subproblems involved in custom instruction generation. An open issue is the degree of human effort required to identify and implement the instruction-set extensions. Although human ingenuity in manual creation of custom capabilities creates high quality results, the complexity of the problem as well as the time-to-market requirements led researchers to look for
6.3. O PEN I SSUES
AND
F UTURE D IRECTIONS
123
automatic- or partially-automatic methods for identifying custom instructions. Quality results can be produced through a balance of human intervention and automatic methods in the generation of the instructions. However, future approaches will substantially minimize the amount of human effort due to the increasing complexity of the designs. As things stand, one of the major limitations in the generation of custom instructions is the limited number of inputs and outputs operands of the instructions. This limitation, which is architecture-specific, can be relaxed using one or more of the few methods proposed to overcome severe limitations on the number of operands, as mentioned in Section 2.4.3. Although progresses in this direction have been reached, restrictions on the number of operands still limit the performance of the custom instructions. Nowadays, it is not unusual to own devices which integrate a phone, an audio and video player and other features. The number of features that are integrated in a single device increases day by day at the cost of additional power consumption. Many low-power and power-aware architectures have been proposed. While the former minimize power consumption while satisfying performance constraints, the latter maximize performance parameters while satisfying power constraints. Custom instructions need to take into consideration this aspect as well. In this case, the custom instructions will be a trade off between the size (limiting the size of the instruction, the power consumption is limited as well) and the frequency of execution (a limited number of executions reduces the power consumption). In the last few years, multi-core systems have became ubiquitous. Many architectures integrate two or more cores in the same hardware to increase performance of execution exploiting available parallelism. Multi-core architectures can provide high performance, run at lower clock speed than single-core architecture and can reduce power consumption. Therefore, how can instruction-set customization take advantages of multi-core technology? Multi-core systems can be homogeneous or heterogeneous. The former implement identical copies of the same core: same frequencies, cache sizes, functions, etc. Examples are the Intel Core 2 Duo and the Advanced Micro Devices Athlon 64 X 2. Heterogeneous systems integrate different cores which can have different functions, frequencies, memory models, etc. Examples are the CELL Processor used in Sony’s PlayStation 3 game console and the Tilera TILE 64. Instruction-set customization can take advantages of the multi-core architecture extending each core with a set of specialized instructions. When the degree of specialization of a custom instruction is too high, instruc-
124
C HAPTER 6. C ONCLUSIONS
tion reuse becomes hard. However, a custom instruction designed to speed up application from different domains can not be too specialized. The requirements to speed up different applications from different domains turns into the generation of custom instructions of limited size. This is experienced because it is uncommon that applications from different domains perform the same complex calculations. Therefore, the custom instructions are limited to few operations per instruction. The limited size results in a reduction in performance. The multi-core concept allows for extension of the specialization domains of the custom instructions: each core can be extended with a different set of specialized instructions for a given domain. In this way, a considerable amount of applications from different domains such as graphics, audio, cryptography, communications, mathematics, or biology and more, can be efficiently executed on the architecture. The size of the custom instructions can be increased by limiting the domains of specialization of the single instructions. Although such architectures are more complex, efficiency, power consumption and other benefits can outweigh the increased complexity. Our future research will address, amongst others, instruction-set customization for multi-core reconfigurable architecture. Besides, we will exhaustively study the impact of the variation of the initial setting for the algorithms presented in this dissertation. Additionally, we will include power analysis and hardware reuse in the customization process.
A Experimental Results In this Appendix, we present the most representative results for the Parallel, Nautilus and Spiral clustering algorithms. At first, we present results concerning the parallel clustering algorithm. Three benchmarks have been tested for six different FPGA sizes. As mentioned in Section 4.3, the algorithm generates and selects an optimal set of applicationspecific instructions based on the available hardware resources. We consider a Powerpc 405 and six different FPGA board: XC 2VPX 20, XC 2VP 30, XC 2VP 40, XC 2VP 50, XC 2VPx 70 and XC 2VP 100. The PowerPC operates at 300Mhz while the FPGA operates at 100Mhz . We present results for the ADPCM , Cast Dec and Twofish Dec for the six FPGA board: XC 2VPX 20, XC 2VP 30, XC 2VP 40, XC 2VP 50, XC 2VPx 70 and XC 2VP 100. After that, we present some results concerning the Nautilus and Spiral clustering algorithms. Each figure depicts the overall application speed up compared to the pure software execution. Result about the Nautilus Clustering are presented on the left side of the figure. Results about the Spiral Clustering are on right side of the figure. The two clusterings combine MAXMISOs or SMMs , with or without repetition of the initial clusterings, as described in Section 4.4.3, and with different limitations on the total number of inputs and outputs.
125
126
A PPENDIX A. E XPERIMENTAL R ESULTS
Parallel Clustering Algorithm XC2VP100
XC2VPX70
4,50
4,00
4,00
3,50
3,50
3,00
3,00 2,50 2,50 2,00
2,00
1,50
1,50 1,00
1,00 MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
XC2VP50
XC2VP40 3,00
3,50
2,80 2,60
3,00
2,40 2,20
2,50
2,00 1,80
2,00
1,60 1,40
1,50
1,20 1,00
1,00 MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
XC2VP30
XC2VPX20 2,20
2,80 2,60
2,00 2,40 1,80
2,20 2,00
1,60 1,80 1,40
1,60 1,40
1,20 1,20 1,00
1,00 MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Adpcm
MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. MMOpt. Opt. Opt. Opt. Opt. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Cast Dec
Twofish Dec
Figure A.1: Overall application speed up compared to the pure software execution. The figure shows results for three benchmarks and for six different FPGA sizes.
127
Nautilus and Spiral Clustering Algorithms Nautilus 4-4
Spiral 4-4
1,25
1,30
1,20
1,25 1,20
1,15
1,15 1,10
1,10
1,05
1,05
1,00
1,00
Nautilus 6-4
Spiral 6-4 1,30
1,35 1,30
1,25
1,25 1,20
1,20
1,15
1,15
1,10 1,10
1,05 1,00
1,05
Nautilus 8-4
Spiral 8-4 1,35
1,26 1,24 1,22 1,20 1,18 1,16 1,14 1,12 1,10 1,08 1,06
1,30 1,25 1,20 1,15 1,10 1,05 1,00
Nautilus 8-5
Spiral 8-5 1,35
1,26 1,24 1,22 1,20 1,18 1,16 1,14 1,12 1,10 1,08 1,06
1,30 1,25 1,20 1,15 1,10 1,05 1,00
Nautilus 10-5
Spiral 10-5
1,35
1,35
1,30
1,30
1,25
1,25
1,20
1,20
1,15
1,15
1,10
1,10
1,05
1,05
1,00
1,00
Figure A.2: ADPCM Decoder, Part 1
128
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 10-8
Spiral 10-8
1,35
1,35
1,30
1,30
1,25
1,25
1,20
1,20
1,15
1,15
1,10
1,10
1,05
1,05
1,00
1,00
Nautilus 15-10
Spiral 15-10
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Spiral 20-8
Nautilus 20-8 2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-15
Spiral 20-15
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus Inf-Inf
Spiral Inf-Inf
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.3: ADPCM Decoder, Part 2
129
Nautilus 4-4 2,8 2,6 2,4 2,2 2 1,8 1,6 1,4 1,2 1
Spiral 4-4 2,8 2,6 2,4 2,2 2 1,8 1,6 1,4 1,2 1
Figure A.4: SAD
130
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 4-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 4-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 6-4
Spiral 6-4 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-4
Spiral 8-4 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-5
Spiral 8-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 10-5 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 10-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Figure A.5: MDCT 32, Part 1
131
Nautilus 10-8
Spiral 10-8 2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 15-10
Spiral 15-10 2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 20-8
Spiral 20-8 2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 20-15 2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 20-15 2,50 2,00 1,50 1,00 0,50 0,00
Nautilus Inf-Inf 3,50 3,00
Spiral Inf-Inf 2,50 2,00
2,50 2,00
1,50
1,50
1,00
1,00 0,50 0,00
0,50 0,00
Figure A.6: MDCT 32, Part 2
132
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 4-4 1,03 1,02 1,02 1,01 1,01 1,00 1,00 0,99 0,99
Spiral 4-4 1,07 1,06 1,05 1,04 1,03 1,02 1,01 1,00 0,99 0,98 0,97
Nautilus 6-4 1,03 1,02 1,02 1,01 1,01 1,00 1,00 0,99 0,99
Spiral 6-4 1,06 1,05 1,04 1,03 1,02 1,01 1,00 0,99 0,98 0,97
Nautilus 8-4 1,03 1,03 1,02 1,02 1,01 1,01 1,00 1,00 0,99 0,99
Spiral 8-4 1,07 1,06 1,05 1,04 1,03 1,02 1,01 1,00 0,99 0,98 0,97
Nautilus 8-5
Spiral 8-5
1,04
1,08
1,03
1,06
1,02
1,04
1,01
1,02
1,00
1,00
0,99
0,98
0,98
0,96
Nautilus 10-5
Spiral 10-5
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.7: Gost Encrypt, Part 1
133
Nautilus 10-8
Spiral 10-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 15-10
Spiral 15-10
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-8
Spiral 20-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-15
Spiral 20-15
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus Inf-Inf
Spiral Inf-Inf
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.8: Gost Encrypt, Part 2
134
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 4-4
Spiral 4-4 1,06 1,05 1,04 1,03 1,02 1,01 1,00 0,99 0,98
1,03 1,02 1,02 1,01 1,01 1,00 1,00
Nautilus 6-4 1,03
Spiral 6-4 1,06 1,05 1,04 1,03 1,02 1,01 1,00 0,99 0,98
1,02 1,02 1,01 1,01 1,00
Nautilus 8-4
Spiral 8-4 1,07 1,06 1,05 1,04 1,03 1,02 1,01 1,00 0,99 0,98
1,03 1,03 1,02 1,02 1,02 1,02 1,02 1,01 1,01 1,01 1,01 1,01
Nautilus 8-5 1,04
Spiral 8-5 1,07 1,06 1,05 1,04 1,03 1,02 1,01 1,00 0,99 0,98
1,03 1,03 1,02 1,02 1,01
Nautilus 10-5
Spiral 10-5
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.9: Gost Decrypt, Part 1
135
Nautilus 10-8
Spiral 10-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 15-10
Spiral 15-10
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-8
Spiral 20-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-15
Spiral 20-15
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus Inf-Inf
Spiral Inf-Inf
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.10: Gost Decrypt, Part 2
136
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 4-4 1,15
Spiral 4-4 1,14 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96
1,10 1,05 1,00 0,95 0,90
Nautilus 6-4
Spiral 6-4 1,14 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96
1,20 1,15 1,10 1,05 1,00 0,95 0,90
Nautilus 8-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 8-4 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-5 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 8-5 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 10-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 10-5 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Figure A.11: Cast Decrypt, Part 1
137
Nautilus 10-8 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 10-8 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 15-10 2,50
Spiral 15-10 1,90 1,85 1,80 1,75 1,70 1,65 1,60 1,55 1,50 1,45 1,40
2,00 1,50 1,00 0,50 0,00
Nautilus 20-8
Spiral 20-8
3,00
2,50
2,50
2,00
2,00
1,50
1,50 1,00
1,00 0,50
0,50
0,00
0,00
Nautilus 20-15
Spiral 20-15
3,00
2,50
2,50
2,00
2,00
1,50
1,50 1,00
1,00 0,50
0,50
0,00
0,00
Nautilus Inf-Inf
Spiral Inf-Inf
3,00
3,00
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.12: Cast Decrypt, Part 2
138
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 4-4
Spiral 4-4 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96 0,94
1,14 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96 0,94
Nautilus 6-4 1,15
Spiral 6-4 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96 0,94
1,10 1,05 1,00 0,95 0,90
Nautilus 8-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 8-4 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-5 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 8-5 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 10-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 10-5 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Figure A.13: Cast Encrypt, Part 1
139
Nautilus 10-8 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 10-8 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 15-10 2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 15-10 2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 20-8
Spiral 20-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-15
Spiral 20-15
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus Inf-Inf
Spiral Inf-Inf
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.14: Cast Encrypt, Part 2
140
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 4-4 1,14 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96 0,94
Spiral 4-4 1,14 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96 0,94
Nautilus 6-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 6-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-4
Spiral 8-4 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 8-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 10-5
Spiral 10-5
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.15: Twofish Decrypt, Part 1
141
Nautilus 10-8
Spiral 10-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 15-10
Spiral 15-10
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-8
Spiral 20-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-15
Spiral 20-15
3,00
3,00
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus Inf-Inf 6,00 5,00 4,00 3,00 2,00 1,00 0,00
Spiral Inf-Inf 5,00 4,50 4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00
Figure A.16: Twofish Decrypt, Part 2
142
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 4-4 1,15
Spiral 4-4 1,14 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0,98 0,96 0,94 0,92
1,10 1,05 1,00 0,95 0,90
Nautilus 6-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 6-4 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-4 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 8-4 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 8-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Spiral 8-5 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
Nautilus 10-5
Spiral 10-5
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Figure A.17: Twofish Encrypt, Part 1
143
Nautilus 10-8
Spiral 10-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 15-10
Spiral 15-10
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-8
Spiral 20-8
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus 20-15
Spiral 20-15
3,00
3,00
2,50
2,50
2,00
2,00
1,50
1,50
1,00
1,00
0,50
0,50
0,00
0,00
Nautilus Inf-Inf 6,00 5,00 4,00 3,00 2,00 1,00 0,00
Spiral Inf-Inf 5,00 4,50 4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00
Figure A.18: Twofish Encrypt, Part 2
144
A PPENDIX A. E XPERIMENTAL R ESULTS
Nautilus 15-10 4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00
Spiral 15-10 4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00
Nautilus 20-8 4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00
Spiral 20-8 4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00
Figure A.19: Hamming 1
Nautilus 20-15
Spiral 20-15
12,00
12,00
10,00
10,00
8,00
8,00
6,00
6,00
4,00
4,00
2,00
2,00
0,00
0,00
Nautilus Inf-Inf
Spiral Inf-Inf
12,00
12,00
10,00
10,00
8,00
8,00
6,00
6,00
4,00
4,00
2,00
2,00
0,00
0,00
Figure A.20: Hamming 2
145
Nautilus 4-4
Spiral 4-4
3,50
3,00
3,00
2,50
2,50
2,00
2,00 1,50 1,50 1,00
1,00 0,50
0,50
0,00
0,00
Figure A.21: Vorbis Invsqlook
Nautilus 4-4 1,8 1,6 1,4 1,2 1 0,8 0,6 0,4 0,2 0
Spiral 4-4 2 1,8 1,6 1,4 1,2 1 0,8 0,6 0,4 0,2 0
Figure A.22: Vorbis Coslook
Bibliography [1] Alfred V. Aho, Mahadevan Ganapathi, and Steven W. K. Tjiang. Code generation using tree matching and dynamic programming. ACM Transactions on Programming Languages and Systems (TOPLAS), 11(4):491–516, 1989. [2] Alex Alet`a, Josep M. Codina, Antonio Gonz´alez, and David Kaeli. Removing communications in clustered microarchitectures through instruction replication. ACM Transactions on Architecture and Code Optimization (TACO), 1(2):127–151, 2004. [3] Cesare Alippi, William Fornaciari, Laura Pozzi, and Mariagiovanna Sami. A DAG -based design approach for reconfigurable VLIW processors. In DATE ’99: Proceedings of the conference on Design, automation and test in Europe, pages 778–779, March 1999. [4] Alauddin Yousif Alomary. A hardware/software codesign partitioner for asip design. In ICECS ’96: Proceedings of the Third IEEE International Conference on Electronics, Circuits, and Systems, pages 251–254, Oct 1996. ´ am Mann, Andr´as Orb´an, and [5] P´eter Arat´o, S´andor Juh´asz, Zolt´an Ad´ D´avid Papp. Hardware-software partitioning in embedded system design. In IEEE International Symposium on Intelligent Signal Processing, WISP 2003, pages 197–202, Budapest, Hungary, 4-6 Sept. 2003. [6] Marnix Arnold. Instruction Set Extension for Embedded Processors. PhD thesis, University of Delft, The Netherlands, 2001. [7] Marnix Arnold and Henk Corporaal. Designing domain-specific processors. In CODES ’01: Proceedings of the ninth international symposium on Hardware/software codesign, pages 61–66, 2001. [8] Kubilay Atasu. Hardware/Software Partitioning for Custom Instruction Processors. PhD thesis, Bo˘gazic¸i University, Turkey, Dec. 2007. ¨ [9] Kubilay Atasu, G¨unhan D¨undar, and Can Ozturan. An integer linear programming approach for identifying instruction-set extensions. In CODES+ISSS ’05: Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 172–177, 2005.
147
148
B IBLIOGRAPHY
¨ [10] Kubilay Atasu, Oskar Mencer, Wayne Luk, Can Ozturan, and G¨unhan D¨undar. Fast custom instruction identification by convex subgraph enumeration. In ASAP 2008. International Conference on ApplicationSpecific Systems, Architectures and Processors, pages 1–6, 2008. [11] Kubilay Atasu, Laura Pozzi, and Paolo Ienne. Automatic applicationspecific instruction-set extensions under microarchitectural constraints. In DAC ’03: Proceedings of the 40th conference on Design automation, pages 256–261, 2003. [12] Peter M. Athanas and Harvey F. Silverman. Processor reconfiguration through instruction-set metamorphosis. Computer, 26(3):11–18, 1993. [13] Massimo Baleani, Frank Gennari, Yunjian Jiang, Yatish Patel, Robert K. Brayton, and Alberto Sangiovanni-Vincentelli. HW/SW partitioning and code generation of embedded control applications on a reconfigurable architecture platform. In CODES ’02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 151– 156, 2002. [14] Francisco Barat and Rudy Lauwereins. Reconfigurable instruction set processors: A survey. In RSP 2000: Proceedings of the 11th IEEE International Workshop on Rapid System Prototyping, page 168, 2000. [15] Francisco Barat, Rudy Lauwereins, and Geert Deconinck. Reconfigurable instruction set processors from a hardware/software perspective. IEEE Transactions on Software Engineering, 28(9):847–862, Sept. 2002. [16] Nguyen Ngoc B`ınh, Masaharu Imai, and Akichika Shiomi. A new hw/sw partitioning algorithm for synthesizing the highest performance pipelined ASIPs with multiple identical fus. In EURO-DAC ’96/EUROVHDL ’96: Proceedings of the conference on European design automation, pages 126–131, 1996. [17] Nguyen Ngoc B`ınh, Masaharu Imai, Akichika Shiomi, and Nobuyuki Hikichi. A hardware/software partitioning algorithm for designing pipelined ASIPs with least gate counts. In DAC ’96: Proceedings of the 33rd annual conference on Design automation, pages 527–532, 1996. [18] Partha Biswas, Sudarshan Banerjee, Nikil Dutt, Laura Pozzi, and Paolo Ienne. Isegen: Generation of high-quality instruction set extensions by
B IBLIOGRAPHY
149
iterative improvement. In DATE ’05: Proceedings of the conference on Design, Automation and Test in Europe, pages 1246–1251, 2005. [19] Christophe Bobda. Springer, 2007.
Introduction to Reconfigurable Computing.
[20] Robert King Brayton and F. Somenzi. Boolean relations and the incomplete specification of logic networks. In ICCAD ’89: Proceedings of the 1992 IEEE/ACM international conference on Computer-aided design, pages 316–319, Nov. 1989. [21] Philip Brisk, Adam Kaplan, Ryan Kastner, and Majid Sarrafzadeh. Instruction generation and regularity extraction for reconfigurable processors. In CASES ’02: Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems, pages 262–269, 2002. [22] D. Buell, W.J. Kleinfelder, and J.M. Arnold. Splash 2: FPGAs in a Custom Computing Machine. 1996. [23] Lin Chen. Graph isomorphism and identification matrices: Parallel algorithms. IEEE Transactions on Parallel and Distributed Systems, 7(3):308–319, 1996. [24] Hoon Choi, Jong-Sun Kim, Chi-Won Yoon, In-Cheol Park, Seung Ho Hwang, and Chong-Min Kyung. Synthesis of application specific instructions for embedded DSP software. IEEE Transactions on Computers, 48(6):603–614, June 1999. [25] N. Clark. Customizing the Computation Capabilities of Microprocessors. PhD thesis, University of Michigan, Ann Arbor, 2007. [26] Nathan Clark, Wilkin Tang, and Scott Mahlke. Automatically generating custom instruction set extensions. In WASP 2002: Proceedings of 1st Workshop on Application Specific Processors, pages 94–101, Istanbul, Turkey, 19 Nov. 2002. [27] Nathan Clark, Hongtao Zhong, and Scott Mahlke. Processor acceleration through automated instruction set customization. In MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, page 129, 2003.
150
B IBLIOGRAPHY
[28] Katherine Compton and Scott Hauck. Reconfigurable computing: a survey of systems and software. ACM Comput. Surv., 34(2):171–210, 2002. [29] Jason Cong, Yiping Fan, Guoling Han, and Zhiru Zhang. Applicationspecific instruction generation for configurable processor architectures. In FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, pages 183–189, 2004. [30] Olivier Coudert. On solving covering problems. In DAC ’96: Proceedings of the 33rd annual conference on Design automation, pages 197–202, 1996. [31] Olivier Coudert and Jean Chritophe Madre. New ideas for solving covering problems. In DAC ’95: Proceedings of the 32nd ACM/IEEE conference on Design automation, pages 641–646, 1995. [32] Giovanni De Micheli and Rajesh K. Gupta. Hardware/software codesign. Proceedings of IEEE, 85(3):349–365, March 1997. [33] Carl Ebeling, Darren C. Cronquist, and Paul Franklin. Rapid - reconfigurable pipelined datapath. In FPL ’96: Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart Applications, New Paradigms and Compilers, pages 126–135, London, UK, 1996. [34] Elena Moscu Painante. The Molen Compiler for Reconfigurable Architectures. PhD thesis, Delft University of Technology, The Netherlands, 2007. [35] Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. Lx : a technology platform for customizable VLIW embedded processing. ACM SIGARCH Computer Architecture News, Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA ’00), 28(2):203–213, 2000. [36] William Fornaciari, Laura Pozzi, and Mariagiovanna Sami. Processori riconfigurabili: unalternativa flessibile per i sistemi dedicati. Alta Frequenza - Rivista di Elettronica, pages 22–28, 1999. [37] Scott Fortin. The graph isomorphism problem. Technical Report TR 9620, Department of Computing Science, University of Alberta, Canada, July 1996.
B IBLIOGRAPHY
151
[38] Carlo Galuzzi, Koen Bertels, and Stamatis Vassiliadis. A linear complexity algorithm for the generation of multiple input single output instructions of variable size. Embedded Computer Systems: Architectures, Modeling, and Simulation, 7th International Workshop, SAMOS 2007, vol. 4599, pages 283–293, Samos, Greece, July 16-19, 2007. [39] Carlo Galuzzi, Koen Bertels, and Stamatis Vassiliadis. A linear complexity algorithm for the automatic generation of convex multiple input multiple output instructions. Reconfigurable Computing: Architectures, Tools and Applications, Third International Workshop, ARC 2007, volume 4419, pages 130–141, Mangaratiba, Brazil, March 27-29, 2007. [40] Carlo Galuzzi, Elena Moscu Panainte, Yana Yankova, Koen Bertels, and Stamatis Vassiliadis. Automatic selection of application-specific instruction-set extensions. In CODES+ISSS ’06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, pages 160–165, 2006. [41] Aman Gayasen, N. Vijaykrishnan, Mahmut Kandemir, and Arif Rahman. Switch box architectures for three-dimensional fpgas. In FCCM ’06: Proceedings of the 14th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, pages 335–336, 2006. [42] Werner Geurts. Synthesis of Accelerator Data Paths for HighThroughput Signal Processing Applications. PhD thesis, Katholieke Universiteit Leuven, 1995. [43] Werner Geurts. Accelerator Data-Path Synthesis for High-Throughput Signal Processing Applications. Kluwer Academic Publishers, Norwell, MA, USA, 1997. [44] Maya Gokhale, William Holmes, Andrew Kopser, Sara Lucas, Ronald Minnich, Douglas Sweely, and Daniel Lopresti. Building and using a highly parallel programmable logic array. Computer, 24(1):81–89, 1991. [45] Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari Cadambi, R. Reed Taylor, and Ronald Laufer. Piperench: a coprocessor for streaming multimedia acceleration. SIGARCH Comput. Archit. News, 27(2):28–39, 1999.
152
B IBLIOGRAPHY
[46] A. Grasselli and F. Luccio. A method for minimizing the number of internal states in incompletely specified sequential networks. IEEE Trans. Electron. Comp., EC-14:350–359, June 1965. [47] Yuanqing Guo. Mapping Applications to a Coarse-Grained Reconfigurable Architecture. PhD thesis, University of Twente, The Netherlands, 2006. [48] Reiner Hartenstein. Coarse grain reconfigurable architecture (embedded tutorial). In ASP-DAC ’01: Proceedings of the 2001 conference on Asia South Pacific design automation, pages 564–570, 2001. [49] Reiner Hartenstein. A decade of reconfigurable computing: a visionary retrospective. In DATE ’01: Proceedings of the conference on Design, automation and test in Europe, pages 642–649, 2001. [50] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The chimaera reconfigurable functional unit. In FCCM ’97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines, page 87, 1997. [51] Scott Hauck, Thomas W. Fry, Matthew M. Hosler, and Jeffrey P. Kao. The chimaera reconfigurable functional unit. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(2):206–217, 2004. [52] J. R. Hauser and J. Wawrzynek. Garp: a MIPS processor with a reconfigurable coprocessor. In FCCM ’97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines, page 12, 1997. [53] Simon D. Haynes, Peter Y. K. Cheung, Wayne Luk, and John Stone. Sonic - a plug-in architecture for video processing. In FCCM ’99: Proceedings of the Seventh Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, page 280, 1999. [54] Simon D. Haynes, John Stone, Peter Y. K. Cheung, and Wayne Luk. Video image processing with the sonic architecture. Computer, 33(4):50–57, 2000. [55] Bruce Kester Holmer. Automatic design of computer instruction sets. PhD thesis, 1993. [56] Ing-Jer Huang and Alvin M. Despain. Generating instruction sets and microarchitectures from applications. In ICCAD ’94: Proceedings of
B IBLIOGRAPHY
153
the 1994 IEEE/ACM international conference on Computer-aided design, pages 391–396, 1994. [57] Ing-Jer Huang and Alvin M. Despain. Synthesis of instruction sets for pipelined microprocessors. In DAC ’94: Proceedings of the 31st annual conference on Design automation, pages 5–11, 1994. [58] Z. Huang and S. Malik. Managing dynamic reconfiguration overhead in system-on-a-chip design using reconfigurable datapaths and optimized interconnection networks. In DATE’01: Proceedings of the conference on Design, automation and test in Europe, pages 735–740, 2001. [59] Zhining Huang, Sharad Malik, Nahri Moreano, and Guido Araujo. The design of dynamically reconfigurable datapath coprocessors. Trans. on Embedded Computing Sys., 3(2):361–384, 2004. [60] Huynh Phung Huynh, Joon Edward Sim, and Tulika Mitra. An efficient framework for dynamic reconfiguration of instruction-set customization. In CASES ’07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, pages 135–144, 2007. [61] Paolo Ienne and Rainer Leupers. Customizable Embedded Processors: Design Technologies and Applications. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. [62] Masaharu Imai, Jun Sato, Alauddin Alomary, and Nobuyuki Hikichi. An integer programming approach to instruction implementation method selection problem. In EURO-DAC ’92: Proceedings of the conference on European design automation, pages 106–111, 1992. [63] C. Iseli. Spyder: A Reconfigurable Processor Development System. PhD thesis, Ecole Polytechnique Federale de Lausanne, 1996. [64] Christian Iseli and Eduardo Sanchez. Spyder: a sure (superscalar and reconfigurable) processor. The Journal of Supercomputing, 9(3):231– 252, 1995. [65] Martin Janssen, Francky Catthoor, and Hugo de Man. A specification invariant technique for regularity improvement between flow-graph clusters. In EDTC ’96: Proceedings of the 1996 European conference on Design and Test, page 138, 1996.
154
B IBLIOGRAPHY
[66] Ramkumar Jayaseelan, Haibin Liu, and Tulika Mitra. Exploiting forwarding to improve data bandwidth of instruction-set extensions. In DAC ’06: Proceedings of the 43rd annual conference on Design automation, pages 43–48, 2006. [67] R. Kastner, A. Kaplan, S. Ogrenci Memik, and E. Bozorgzadeh. Instruction generation for hybrid reconfigurable systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 7(4):605–627, 2002. [68] Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh, and Majid Sarrafzadeh. Instruction generation for hybrid reconfigurable systems. In ICCAD ’01: Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design, pages 127–130, 2001. [69] K. Keutzer, S. Malik, and A. R. Newton. From Asic to Asip: The next design discontinuity. In ICCD ’02: Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD’02), pages 84–90, 2002. [70] Ian Kuon and Jonathan Rose. Area and delay trade-offs in the circuit and architecture design of FPGAs . In FPGA ’08: Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays, pages 149–158, 2008. [71] Georgi Krasimirov Kuzmanov. The Molen Polymorphic Media Processor. PhD thesis, Delft University of Technology, 2004. [72] Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330–335, 1997. [73] Jong-eun Lee, Kiyoung Choi, and Nikil D. Dutt. An algorithm for mapping loops onto coarse-grained reconfigurable architectures. In LCTES ’03: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems, pages 183–188, 2003. [74] R. Leupers, K. Karuri, S. Kraemer, and M. Pandey. A design flow for configurable embedded processors based on optimized instruction set extension synthesis. In DATE ’06: Proceedings of the conference on
B IBLIOGRAPHY
155
Design, automation and test in Europe, pages 581–586, Munich, Germany, 2006. [75] Xiao Yu Li, Matthias F. Stallmann, and Franc Brglez. Effective bounding techniques for solving unate and binate covering problems. In DAC ’05: Proceedings of the 42nd annual conference on Design automation, pages 385–390, 2005. [76] S. Liao, K. Keutzer, S. Tjiang, and S. Devadas. A new viewpoint on code generation for directed acyclic graphs. ACM Transactions on Design Automation of Electronic Systems (TODAES), 3(1):51–75, 1998. [77] Stan Liao and Srinivas Devadas. Solving covering problems using lprbased lower bounds. In DAC ’97: Proceedings of the 34th annual conference on Design automation, pages 117–120, 1997. [78] Stan Liao, Srinivas Devadas, Kurt Keutzer, and Steve Tjiang. Instruction selection using binate covering for code size optimization. In ICCAD ’95: Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design, pages 393–399, 1995. [79] C. Liem, T. May, and P. Paulin. Instruction-set matching and selection for DSP and ASIP code generation. In Proceedings of the European Design and Test Conference (ED & TC), pages 31–37, Paris, France, Feb 1994. [80] S. Lin and B.W. Kernighan. An effective heuristic algorithm for the traveling-salesman problem. Operations Research, 21(2):498–516, 1973. [81] Guangming Lu, Hartej Singh, Ming-Hau Lee, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. The morphosys parallel reconfigurable system. In Euro-Par ’99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing, pages 727–734, London, UK, 1999. [82] R. J. Meeuws. A quantitative model for hardware/software partitioning. Master’s thesis, Delft University of technology, May 2007. [83] Bingfeng Mei, Serge Vernalde1, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. Adres: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In FPL 03: Proceedings of the 2003 International Conference on Field-Programmable Logic and Applications, pages 61–70. 2003.
156
B IBLIOGRAPHY
[84] Takashi Miyamori and Kunle Olukotun. Remarc (abstract): reconfigurable multimedia array coprocessor. In FPGA ’98: Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays, page 261, 1998. [85] Nahri Moreano, Guido Araujo, Zhining Huang, and Sharad Malik. Datapath merging and interconnection sharing for reconfigurable architectures. In ISSS ’02: Proceedings of the 15th international symposium on System Synthesis, pages 38–43, 2002. [86] R. Niemann and P. Marwedel. An algorithm for hardware/software partitioning using mixed integer linear programming. Design Automation for Embedded Systems, Special Issue: Partitioning Methods for Embedded Systems, 2(2):165–193, March 1997. [87] Ralf Niemann and Peter Marwedel. Hardware/software partitioning using integer programming. In EDTC ’96: Proceedings of the 1996 European conference on Design and Test, page 473, 1996. [88] Armita Peymandoust, Laura Pozzil, Paolo Ienne, and Giovanni De Micheli. Automatic instruction set extension and utilization for embedded processors. In ASAP 2003: Proceedings of the 14th International Conference on Application-Specific Systems, Architectures and Processors, pages 108–118, The Hague, The Netherlands, 24-26 June 2003. [89] Nagaraju Pothineni, Anshul Kumar, and Kolin Paul. Application specific datapath extension with distributed I/O functional units. In VLSID ’07: Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference, pages 551–558, 2007. [90] L. Pozzi, M. Vuleti´c, and P. Ienne. Automatic topology-based identification of instruction-set extensions for embedded processors. Technical Report CS 01/377, EPFL, DI-LAP, Lausanne, Dec. 2001. [91] Laura Pozzi. Methodologies for the Design of Application-Specific Reconfigurable VLIW Processors. PhD thesis, Politecnico di Milano, Milano, Italy, Jan. 2000. [92] Laura Pozzi, Kubilay Atasu, and Paolo Ienne. Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25(7):1209–1229, JULY 2006.
B IBLIOGRAPHY
157
[93] Laura Pozzi and Paolo Ienne. Exploiting pipelining to relax register-file port constraints of instruction-set extensions. In CASES ’05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pages 2–10, 2005. [94] J. Rabaey. Reconfigurable processing: The solution to low-power programmable DSP . In ICASSP ’97: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’97) -Volume 1, page 275, 1997. [95] Bozidar Radunovic and Veljko M. Milutinovic. A survey of reconfigurable computing architectures. In FPL ’98: Proceedings of the 8th International Workshop on Field-Programmable Logic and Applications, From FPGAs to Computing Paradigm, pages 376–385, London, UK, 1998. [96] Rahul Razdan, Karl S. Brace, and Michael D. Smith. PRISC software acceleration techniques. In ICCS ’94: Proceedings of the1994 IEEE International Conference on Computer Design: VLSI in Computer & Processors, pages 145–149, 1994. [97] Rahul Razdan and Michael D. Smith. A high-performance microarchitecture with hardware-programmable functional units. In MICRO 27: Proceedings of the 27th annual international symposium on Microarchitecture, pages 172–180, 1994. [98] C. R. Rupp, M. Landguth, T. Garverick, E. Gomersall, H. Holt, J. M. Arnold, and M. Gokhale. The NAPA adaptive processing architecture. In FCCM ’98: Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, page 28, 1998. [99] K. Seto and M. Fujita. Custom instruction generation with high-level synthesis. In SASP 2008: Proceedings of the 2008 Symposium on Application Specific Processors, pages 14–19, Anaheim, California, June 2008. [100] D. Sreenivasa Rao and Fadi J. Kurdahi. Hierarchical design space exploration for a class of digital systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 1(3):282–295, 1993. [101] D. Sreenivasa Rao and Fadi J. Kurdahi. Partitioning by regularity extraction. In DAC ’92: Proceedings of the 29th ACM/IEEE conference on Design automation, pages 235–238, 8-12 June 1992.
158
B IBLIOGRAPHY
[102] D. Sreenivasa Rao and Fadi J. Kurdahi. On clustering for maximal regularity extraction. IEEE Transactions on Computer-Aided Design, 12(8):1198–1208, August 1993. [103] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha. Custom-instruction synthesis for extensible processor platform. IEEE Trans. ComputerAided Design of Integrated Circuits, 23(2):216–228, Feb. 2004. [104] Fei Sun, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. Synthesis of custom processors based on extensible platforms. In ICCAD ’02: Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, pages 641–648, 2002. [105] Fei Sun, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. A scalable application-specific processor synthesis methodology. In ICCAD ’03: Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design, page 283, 2003. [106] T.J. Todman, G.A. Constantinides, S.J.E. Wilton, O. Mencer, W. Luk, and P.Y.K. Cheung. Reconfigurable computing: architectures and design methods. IEE Proceedings - Computers and Digital Techniques, 152(2):193–207, Mar 2005. [107] Johan Van Praet, Gert Goossens, Dirk Lanneer, and Hugo De Man. Instruction set definition and instruction selection for ASIPs . In ISSS ’94: Proceedings of the 7th international symposium on High-level synthesis, pages 11–16, 1994. [108] N. Vassiliadis, N. Kavvadias, G. Theodoridis, and S. Nikolaidis. A RISC architecture extended by an efficient tightly coupled reconfigurable unit. International Journal of Electronics, 93(6):421–438, June 2006. [109] Nikolaos Vassiliadis, George Theodoridis, and Spiridon Nikolaidis. Enhancing a reconfigurable instruction set processor with partial predication and virtual opcode support. ARC 2006: Proceedings of the Second International Workshop on Applied Reconfigurable Computing, vol. 3985, pages 217–229, Delft, Netherlands, March 1–3, 2007. [110] Stamatis Vassiliadis and Dimitrios Soudris, editors. Fine- and CoarseGrain Reconfigurable Computing. Springer, 2007.
B IBLIOGRAPHY
159
[111] Stamatis Vassiliadis, Stephan Wong, and Sorin Cotofana. The Molen ̺µ-coded processor. In FPL ’01: Proceedings of the 11th International Conference on Field-Programmable Logic and Applications, pages 275–285, London, UK, 2001. [112] Stamatis Vassiliadis, Stephan Wong, Georgi Gaydadjiev, Koen Bertels, Georgi Kuzmanov, and Elena Moscu Panainte. The Molen polymorphic processor. IEEE Transactions on Computers, 53(11):1363–1375, 2004. [113] Ajay K. Verma, Philip Brisk, and Paolo Ienne. Rethinking custom ISE identification: a new processor-agnostic method. In CASES ’07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, pages 125–134, 2007. [114] Albert Wang, Earl Killian, Dror Maydan, and Chris Rowen. Hardware/software instruction set configurability for system-on-chip processors. In DAC ’01: Proceedings of the 38th conference on Design automation, pages 184–188, 2001. [115] M. Wazlowski, L. Agarwal, T. Lee, A. Smith, E. Lam, P. Athanas, H. Silverman, and S. Ghosh. Prism − II compiler and architecture. In IEEE Workshop on FPGAs for Custom Computing Machines, pages 9–16, 1993. [116] M. J. Wirthlin and Brad L. Hutchings. Disc: The dynamic instruction set computer. In Proceedings of the International Society of Optical Engineering (SPIE). Field Programmable Gate Arrays (FPGAs) for Fast Board Development and Reconfigurable Computing, volume 2607, pages 92–103, Oct. 1995. [117] R. Wittig and P. Chow. OneChip : An FPGA processor with reconfigurable logic. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, pages 126–135, Mar. 1996. [118] Ralph D. Wittig. Onechip: An fpga processor with reconfigurable logic. Master’s thesis, Department of Electrical and Computer Engineering, University of Toronto, 1995. [119] Christophe Wolinski and Krzysztof Kuchcinski. Identification of application specific instructions based on sub-graph isomorphism constraints. In ASAP 2007: Proceedings of the IEEE International Conf. on Application -specific Systems, Architectures and Processors, pages 328–333, July 2007.
160
B IBLIOGRAPHY
[120] Christophe Wolinski and Krzysztof Kuchcinski. Automatic selection of application-specific reconfigurable processor extensions. In DATE ’08: Proceedings of the conference on Design, automation and test in Europe, pages 1214–1219, March 2008. [121] Y. Yankova, G. Kuzmanov, K. Bertels, G. Gaydadjiev, Yi Lu, and S. Vassiliadis. DWARV : Delftworkbench automated reconfigurable vhdl generator. In FPL’07: Proceedings of the 17th International Conference on Field Programmable Logic and Applications, pages 697–701, Amsterdam, The Netherlands, August 2007. [122] Zhi Alex Ye, Andreas Moshovos, Scott Hauck, and Prithviraj Banerjee. CHIMAERA: A high-performance architecture with a tightly-coupled reconfigurable functional unit. In ACM SIGARCH Computer Architecture News, Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA ’00), pages 225–235, June 2000. [123] Pan Yu and Tulika Mitra. Scalable custom instructions identification for instruction-set extensible processors. In CASES ’04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 69–78, 2004. [124] Pan Yu and Tulika Mitra. Disjoint pattern enumeration for custom instructions identification. FPL 2007: Proceedings of the 17th IEEE International Conference on Field Programmable Logic and Applications, Amsterdam, The Netherlands, August 2007.
List of Publications International Journals
1. C. Galuzzi, K. Bertels and S. Vassiliadis, A Linear Complexity Algorithm for the Automatic Generation of Convex Multiple Input Multiple Output Instructions, International Journal of Electronics, Vol. 95, Issue 7, July 2008, pp. 603-619 2. C. Galuzzi, C. Gou, H. Calderon, G. Gaydadjiev and S. Vassiliadis, High-bandwidth Address Generation Unit, Journal of Signal Processing Systems, DOI 10.1007/s11265-008-0174-x International Conferences
1. K. Sigdel, Mark Thompson, Andy D. Pimentel, C. Galuzzi and K. Bertels, System-Level Runtime Mapping Exploration of Reconfigurable Architectures, 16th Reconfigurable Architectures Workshop (RAW 2009), Rome (Italy), May 2009. 2. C. Galuzzi, D. Theodoropoulos, Roel Meeuws and K. Bertels, Algorithms for the Automatic Extension of an Instruction-Set, Design, Automation and Test in Europe (DATE 2009), Nice (France), April 2009. 3. C. Galuzzi, D. Theodoropoulos, Roel Meeuws and K. Bertels, Automatic Instruction-Set Extensions with the Linear Complexity Spiral Search, International Conference on ReConFigurable Computing and FPGAsn (ReConFig 2008), Cancun, Mexico, December 2008. 4. C. Galuzzi, D. Theodoropoulos and K. Bertels, A Clustering Method for the Identification of Convex Disconnected Multiple Input Multiple Output Instructions, International Symposium on Systems, Architectures, MOdeling and Simulation (IC-SAMOS VIII), Samos (Greece), July 2008. 5. C. Galuzzi and K. Bertels, A Framework for the Automatic Generation of Instruction-Set Extensions for Reconfigurable Architectures, International Workshop on Applied Reconfigurable Computing (ARC 2008), pp. 280-286, London (UK), March 2008. 161
162
L IST
OF
P UBLICATIONS
6. C. Galuzzi and K. Bertels, The Instruction-Set Extension Problem: A Survey, International Workshop on Applied Reconfigurable Computing (ARC 2008), pp. 209-220, London (UK), March 2008. - BEST PAPER AWARD 7. C. Galuzzi, K. Bertels and S. Vassiliadis, The Spiral Search: A Linear Complexity Algorithm for the Generation of Convex Multiple Input Multiple Output Instruction-Set Extensions, International Conference on Field-Programmable Technology 2007 (ICFPT’07), pp. 337340, Kokurakita, Kitakyushu (Japan), December 2007. 8. C. Galuzzi, K. Bertels and S. Vassiliadis, A Linear Complexity Algorithm for the Generation of Multiple Input Single Output Instructions of Variable Size, International Symposium on Systems, Architectures, MOdeling and Simulation (SAMOS VII Workshop), pp. 295-305, Samos (Greece), July 2007. 9. H. Calderon, C. Galuzzi, G. Gaydadjiev and S. Vassiliadis, Highbandwidth Address Generation Unit, International Symposium on Systems, Architectures, MOdeling and Simulation (SAMOS VII Workshop), pp. 263-274, Samos (Greece), July 2007. 10. C. Galuzzi, K. Bertels and S. Vassiliadis, A Linear Complexity Algorithm for the Automatic Generation of Convex Multiple Input Multiple Output Instructions, International Workshop on Applied Reconfigurable Computing (ARC 2007), pp. 130-141, Mangaratiba, Rio de Janeiro (Brazil), March 2007. 11. C. Galuzzi, E. Moscu Painante, Y. D. Yankova, K. Bertels and S. Vassiliadis, Automatic Selection of Application - Specific Instruction Set Extensions, International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp. 160-165, Seoul (Korea), October 2006. Local Conferences
1. C. Galuzzi, K. Bertels and S. Vassiliadis, Graph Covering for generating application specific instructions: an overview of some existing methods, Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing, ProRisc 2005, Veldhoven, (The Netherlands), November 2005.
L IST
OF
P UBLICATIONS
163
2. C. Galuzzi, K. Bertels and S. Vassiliadis, Graph Theory and Application Specific Processors, Proceedings of the 15th Annual Workshop on Circuits, Systems and Signal Processing, ProRisc 2004, Veldhoven, (The Netherlands), November 2004. Reports
1. K. Bertels, S. Vassiliadis, E. Moscu Painante, Y. D. Yankova, C. Galuzzi, R. Chaves and G. K. Kuzmanov, Developing Applications for Polymorphic Processors: The Delft Workbench, Technical Report, January 2006.
Samenvatting
I
n dit proefschrift behandelen we het ontwerpen van algoritmen voor de automatische identificatie en selectie van complexe applicatie-specifieke instructies die de uitvoering van applicaties op herconfigureerbare architecturen kunnen versnellen. De rekenkundig intensieve delen van een applicatie worden geanalyseerd en gepartitioneerd in code-segmenten die in software en code-segmenten die als n geheel in hardware als instructies uitgevoerd worden. Deze instructies breiden de instructie-set van de gebruikte herconfigureerbare architectuur uit en zijn applicatie-specifiek. Het hoofddoel van het werk dat in dit proefschrift wordt gepresenteerd is de identificatie van applicatie-specifieke instructies met meerdere ingangen en meerdere uitgangen. De instructies worden gegenereerd in twee opnvolgende stappen: eerst wordt de applicatie gepartitioneerd in niet-overlappende enkele-uitvoer instructies en daarna worden deze instructies verder gecombineerd in meerdere-uitvoer instructies volgens verschillende aanpakken. We stellen verschillende algorithmen voor het partitioneren van een applicatie in zowel enkele-uitvoer als meerdere-uitvoer instructies voor. Een aantal methoden zijn voorgesteld in zowel de academia als de industrie voor het uitbreiden van een gegeven instructie-set met applicatie-specifieke instructies om de uitvoer van applicaties te versnellen. De voorgestelde oplossingen hebben gewoonlijk een hoge rekenkundige complexiteit. De algoritmen die in dit proefschrift worden voorgesteld, verschaffen kwaliteitsoplossingen en hebben een lineaire rekenkundige complexiteit in alle gevallen behalve n, waar de voorgestelde oplossing optimaal is. Bovendien zijn de nieuwe applicatie-specifieke instructies door hun ontwep uitvoerbaar als n geheel, in tegenstelling tot de bestaande methodes die de rekenkundige complexiteit verhogen door elke gegenereerde instructie te testen. De voorgestelde algoritmen zijn getest op de Molen herconfigureerbare architectuur. De experimentele resultaten met welbekende benchmarks tonen dat een aanzienlijke verhoging van de uitvoersnelheid van een applicatie behaald kan worden door de applicatie-specifieke instructies die de voorgestelde algoritmen geidentificeerd hebben te gebruiken.
165
Curriculum Vitae
Carlo Galuzzi was born on the 1st of August 1977 in Milan, Italy. He received the M.Sc. in Mathematics (summa cum laude) from Universit`a Degli Studi di Milano, Italy in 2003. In 2004, he joined the Computer Engineering (CE) Laboratory in Delft for his doctoral studies where he had been working towards his PhD degree under the guidance of Prof.dr. Stamatis Vassiliadis and Dr. Koen Bertels. His research work includes graph theory, hardware-software partitioning and instruction-set extension. Mr. Galuzzi is a reviewer for many international conferences and journals. He served as publication chair, finance chair and local arrangement chair for many conferences, including MICRO, SAMOS and DTIS. Carlo received the best paper award at ARC 2008. .
167