MORPH: A System Architecture for Robust High Performance Using Customization (An NSF 100 TeraOps Point Design Study)
Andrew A. Chien
Department of Computer Science 1304 West Spring eld Avenue University of Illinois Urbana, Illinois 61801
[email protected] Rajesh K. Gupta
Dept. of Information & Computer Science 444 Computer Science Building University of California Irvine, CA 92697.
[email protected]
PROJECT REPORT FOR NSF AWARD ASC-96-34947
i
Executive Summary Achieving 100 TeraOps performance within a ten-year horizon will require massivelyparallel architectures that exploit both commodity software and hardware technology for cost eciency. Increasing clock rates and system diameter in clock periods will make ef cient management of communication and coordination increasingly critical. Con gurable logic presents a unique opportunity to customize bindings, mechanisms, and policies which comprise the interaction of processing, memory, I/O and communication resources. This programming exibility, or customizability, can provide the key to achieving robust high performance. The MultiprocessOr with Recon gurable Parallel Hardware (MORPH) uses recon gurable logic blocks integrated with the system core to control policies, interactions, and interconnections. This integrated con gurability can improve the performance of local memory hierarchy, increase the eciency of interprocessor coordination, or better utilize the network bisection of the machine. MORPH provides a framework for exploring such integrated application-speci c customizability. Rather than complicate the situation, MORPH's con gurability supports component software and inter-operability frameworks, allowing direct support for application-speci ed patterns, objects, and structures. In response to a NSF-sponsored Point Design Study of potential architectural approaches to achieving 100 TeraOps performance within a ten-year horizon, a study was conducted in 1996 to explore the suitability of MORPH-class architectures to achieve this goal [40]. This report outlines the motivation, design, and initial evaluation of MORPH's approach to architectural adaptation. By using traditional scienti c applications (partial dierential equation solvers for computational uid dynamics) and emerging scienti c applications (graphics and visualization codes), we evaluate the utility of con gurability in ecient memory hierarchy management. Our results show that incorporating exibility can signi cantly increase memory eciency { utilizing the fast cache memory more eciently to reduce eective memory latency and memory bandwidth requirements. Across applications, we nd that the required optimizations are dierent, supporting the basic approach of recon gurability in high performance architectures. Estimates of hardware complexity using hardware synthesis tools are provided to quantify the cost/performance trade-os. The results of this study point to evolution of an entirely new class of computer architectures, namely, Application Adaptive (AA) architectures. These architectures exploit the capability of the underlying hardware to recon gure logic to achieve system-level cost/performance goals by extensive analysis and pro ling of application data and runtime characteristics. A key distinction made by AA architectures against traditional customcomputing machines is that architectural exibility is used to customize architectural mechanisms and policies (instead of building additional functional resources { an approach com-
monly adopted by custom computing machines). Thus relatively small amounts of recon gurable circuit blocks can be leveraged to yield high performance on a per application basis. AA architectures can be used in a wide variety of applications from single-node embedded systems to multiprocessor systems. The current MORPH design for a 100 TeraOps machine, therefore, serves only as a proxy to describe the application of architectural adaptation to build an ecient memory system. While application-driven adaptation is key to achieving these goals, substantially dicult issues remain to be addressed: for instance, how are the opportunities for architectural adaptation are to be identi ed and what are the needed advances in compiler technology; how can one ensure a safe and protected execution framework when the underlying hardware may be changing, et cetra. This report provides only a starting point to think about application-driven architectural adaptation. It is important to note that extensive reprogrammability as achieved by SRAM-based eld-programmable devices such as FPGAs is the driving force behind machine adaptation, it is by no means the only method to achieve the goal of architectural exibility. Indeed, this report makes the case that continuing advances in microelectronic processing would make it possible to achieve hardware adaptation even in custom-designed data-path circuits.
The exploratory work on MORPH architecture discussed in this report has laid the foundation for the recently initiated program on de nition of a Software-Controlled Application-Adaptive (SCAA) architecture. Under the support of DARPA/ITO, the goal of this project is to explore adaptive architecture opportunities for ecient memory system management and integrate these with compiler and operating system advances that supports a robust, safe and multiuser computing environment.
This report is organized as follows. In the following Section 1 we present the barriers to achieving high performance computing machines and the opportunity presented by the key technology trends to build AA architectures. Section 2 presents the overall hardware organization of MORPH and potential areas where architectural customization can be used. The software architecture is presented in the following Section 3. Section 4 describes how applications can make use of architectural customizability provided by MORPH. Section 5 describes our evaluation of con gurability for memory hierarchy management. Section 6 presents experimental results from the simulation model using a set of kernel benchmarks and actual application programs. We conclude with a discussion of the open issues and directions for future research in Section 7. The latest information on the project and results is available on the Web site: \http://www-csag.cs.uiuc.edu/projects/morph.html." iii
Contents 1 Introduction
1.1 Role of Con gurability in MORPH Point Design . . . . . . . . . . . . . . . . 1.2 Key Technology Trends Underlying MORPH . . . . . . . . . . . . . . . . . .
2 System Architecture & Organization 2.1 2.2 2.3 2.4
Exploiting Architectural Adaptation in MORPH . Low Latency Communication . . . . . . . . . . . Minimizing Communication . . . . . . . . . . . . Resource Load Balance . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
1 2 4
11 14 14 14 15
3 MORPH Software Architecture
15
4 Application-Driven Customizability
18
3.1 Automatic techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Pro ling and Programmer Annotation . . . . . . . . . . . . . . . . . . . . . 17
4.1 4.2 4.3 4.4
Vector Memories: Stride Skewing for Performance . . Optimizing Cache Granularity for Performance . . . . Programmable Coherence to Reduce Communication Custom Prefetching . . . . . . . . . . . . . . . . . . .
5 Evaluation of Con gurability in MORPH 5.1 Scope and Limitations of Our Study 5.2 Description of the Benchmarks Used 5.2.1 PetaFlop Kernels . . . . . . . 5.2.2 Sparse Matrix Libraries. . . . 5.2.3 Application Programs . . . . 5.3 Simulation Environment . . . . . . .
6 Experimental Results
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
6.1 Con gurability in Memory Hierarchy Management . . . . . 6.1.1 Peta op Kernels . . . . . . . . . . . . . . . . . . . 6.1.2 Sparse Matrix Libraries . . . . . . . . . . . . . . . 6.2 Case Studies in Architectural Adaptation . . . . . . . . . . 6.2.1 Architectural Adaptation for Latency Tolerance . . 6.2.2 Architectural Adaptation for Bandwidth Reduction 6.3 Hardware Costs of Con gurability . . . . . . . . . . . . . . iv
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
18 19 19 20
20 21 22 22 22 23 24
27 27 27 31 32 33 35 39
7 Summary
39
8 Educational and Human Resource Impact Statement 9 Further Information
50 50
A Peta op Kernels B Hardware Modeling Using HardwareC C HardwareC Code of Architectural Assists
51 53 57
D Raw Simulation Data
62
7.1 Important Issues Not Addressed . . . . . . . . . . . . . . . . . . . . . . . . . 41
C.1 Prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 C.2 Translate and Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 D.1 D.2 D.3 D.4
Kernels . . . . . . . . . . . . . . . . . . . . Sparse Matrix Operations . . . . . . . . . SPLASH-2 Radiosity Code . . . . . . . . . SPLASH-2 Raycasting Code (VOLREND)
v
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
62 69 84 87
List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Trends in Device Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trends in Device Integration Density . . . . . . . . . . . . . . . . . . . . . . Distribution of Interconnect Lengths in Logic Devices . . . . . . . . . . . . . Normalized Delay of Interconnect Metal . . . . . . . . . . . . . . . . . . . . Fraction of cycle time devoted to gate and interconnect delays . . . . . . . . Inter-Connect Critical Length: Trends show a cross-over point beyond which the wire critical length is longer than the average interconnect length. . . . . Circuit model using programmable interconnect between functional blocks . An Architecture for Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . A Flexible 100 TeraOp Architecture . . . . . . . . . . . . . . . . . . . . . . . A wide range of logical machine organizations (Address Space and Cache coherence) can be con gured. . . . . . . . . . . . . . . . . . . . . . . . . . . MORPH software architecture uses runtime diagnosis and synthesized hardware assists. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sparse Library Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . MORPH Simulation Workbench, MORPHSIM . . . . . . . . . . . . . . . . . Average Percentage of Memory Stall Cycles for the Peta op Kernels for a Range of Cache con gurations . . . . . . . . . . . . . . . . . . . . . . . . . . Average Memory Read Time for the Peta op Kernels for a Range of Cache con gurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Memory Write Time for the Peta op Kernels for a Range of Cache con gurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Organizations of Prefetcher Logic . . . . . . . . . . . . . . . . . . . . . . . . Scatter and Gather Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache miss rates of dierent implementations of sparse matrix-matrix multiply. Data trac volume of dierent schemes. (Total size of non-zeros, 1.35 MB) . Gather-Scatter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gather-Scatter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gather-Scatter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacobi 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacobi 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacobi 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MP3D 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MP3D 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MP3D 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SAXPY 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
5 5 6 7 7 9 10 12 13 13 17 23 25 29 29 30 34 36 38 38 63 63 64 64 65 65 66 66 67 67
31 32 33 34 35 36
SAXPY 2 . . . . . . . . . . . . . . SAXPY 3 . . . . . . . . . . . . . . Sparse 1 . . . . . . . . . . . . . . . Sparse 2 . . . . . . . . . . . . . . . Sparse 3 . . . . . . . . . . . . . . . Task Imbalance in Raycasting Code
. . . . . .
vii
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
68 68 82 82 83 88
List of Tables 1 2 3 4
Cache Simulation Parameters . . . . . . . . . . . . . Hardware Costs . . . . . . . . . . . . . . . . . . . . . MORPH System Characteristics . . . . . . . . . . . . Expected MORPH Performance on PetaFlop Kernels
viii
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
28 39 42 43
1 Introduction Increasing reliance on computational techniques for scienti c inquiry, complex systems design, faster than real life simulations, and higher delity human computer interaction continue to drive the need for ever higher performance computing systems. For instance, realistic simulations of heart and circulatory systems demand processing rates of upto 200 GFlops/beat [57]. Computational demands for real-time simulation of atmospheric turbulence, aircraft design, cosmology and pharmaceutical drug design range in peta ops. It is estimated that the simulation of DNA cloning experiments on a machine with 1 PetaFlops performance would take over two years [57]. Despite rapid progress in the basic device technology [52], and even in uniprocessor computing technology [26], these applications demand systems scalable in every aspect: processing, memory, I/O, and particularly communication. In addition to advances in raw processing power, we must also achieve dramatic improvements in system usability through use of commodity software and software development frameworks. The NSF Point Design Study was initiated in 1996 to explore possible architectural and technology solutions that will help us achieve 100 TeraOps performance within a reasonable cost, power, packaging budget in a ten-year period. It was anticipated that such concerted effort in building high-performance machines will be critical in eventually achieving PetaFlops performance goal several years in advance of what natural evolution in commodity computing machines would make it possible. Based on technology projections for the year 2007 design window for the NSF point design studies, our analysis (see Section 1.2) indicates that the cumulative eect of continuing increases in the communication cost relative to computation (gate speeds) would fundamentally alter the cost/performance trade-os in design technology selection for high performance circuit building blocks. In particular, the use of eld con gurable logic blocks would become a practical choice in an ever broader range of the systems. The diminishing cost of incorporating con gurable logic into a cycle-time would present the system architect with the opportunity to adapt a machine's behavior to better match the application requirements. This matching would allow the machine to deliver per application performance that is substantially closer to peak machine performance than is possible today with xed hardware structures. While a broad variety of such architectures are possible, MORPH is a
design point which explores the potential of integrating con gurability deep into the system core. MORPH design consists of versatile processor cores that can be used in systems ranging from modest desktops to those consisting of thousands of these nodes.
1 INTRODUCTION
2
1.1 Role of Con gurability in MORPH Point Design MORPH design is motivated by the fact that the current day scalable systems are still quite dicult to program, and in many cases eectively preclude use of the most sophisticated (and most ecient) algorithms. Even when successfully used, most systems exhibit substantial performance fragility due to rigid architectural choices that do not work well across dierent applications. Therefore, architectural recon gurability is a key ingredient in achieving robust high performance. However, the candidates for architectural adaptation must be judiciously chosen so as to yield maximum performance gains without adversely impacting system cost goals. Because technology trends continue to increase the importance of communication, the MORPH architecture focuses on exploiting con gurability to manage locality, communication, and coordination. In particular, the MORPH design study has explored improved eciency and scalability using novel mechanisms to manage the interaction of processing and memory, and memory system management (cache coherence, prefetching). Future innovations would explore interaction of processing with I/O and communication resources,
exible hardware granularity (e.g. mechanisms and association with processors and memory) and other data management policies. Memory system management is critical to combat latency deterioration and memorysystem bandwidth congestion in multiple processor parallel system. Latency and bandwidth issues are key determining factors in performance for large machines because these can provide a constant multiplier on the achievable performance [40]. The multiplier may even decrease as the memory latency fails to improve as fast as processor clock speeds. To understand the worsening eects of long latency in MPP machines, consider a hypothetical machine with processing elements processing elements running at 2 GHz with eight-way super-scalar lled pipelines. Assuming a typical 1 microsecond round-trip latency for a cache miss, this corresponds to about 16K instructions, with an average 30% or 4800 instructions being load/store. For a single-thread execution at miss rate as low as 0.02% brings computing eciency down to 50%. This points to the need for very low miss rates to ensure that high-throughput processing nodes can be kept busy. A similar, analysis of the bisection bandwidth indicates bandwidth requirements of about 9.6 TB/sec for a 100 TeraOp machine with 10 Flops per word of trac. This points to about 100K wires at 1 GHz, which is clearly impractical. Thus, active bandwidth management is required
to reduce the need for communication and to increase the number of Flops before a communication is necessary. Unfortunately a wealth of studies indicate that no xed policies (or even wiring con gurations) are optimal. MORPH seeks to exploit application structure and behavior to adapt machine for ecient execution. Traditionally in high performance computing systems, this
1 INTRODUCTION
3
information has been provided by optimizing compilers (vectorizing or parallelizing compilers). However, as the use of component software and interoperability frameworks proliferates, we expect such analysis to become increasingly dicult. Therefore, we hope to see progress in the exploration of a broad range of techniques to identify opportunities for customization based on aggressive compiler analysis, user annotations, pro ling, and even on-the- y monitoring and adaptation. Because the eld of con gurable computing is still in its nascence, the range of possible architectures has only begun to be explored. The primary innovations of MORPH are to focus on integration of con gurability into the system core, and exploitation of opportunities to optimize communication and coordination. In this study, we report results from initial investigations of the application of con gurability to improving system performance. The initial results include:
analysis of hardware technology trends which clearly indicate the increasing viability of con gurable logic; even in high performance system paths,
de nition of an abstract architecture for a con gurable machine. The abstract ar-
chitecture represents a signi cant departure from existing approaches to con gurable computing by placing a strong emphasis on customizable architectural mechanisms rather than synthesizing whole or part of the application [3],
empirical evaluation of scienti c and graphics application behavior which character-
izes the potential bene ts of applying con gurability in the speci c area of intelligent memory hierarchy management
empirical evaluation of the hardware complexity required for these customizations which demonstrate that the hardware requirements for the reprogrammable logic in MORPH are modest and substantially less than those needed in building custom computing machines.
These results indicate that while daunting challenges still remain, con gurable machines are a promising approach for building 100 Tera op and even Peta op computers. The variability across applications is signi cant, rewarding custom management of precious fast storage, and the hardware required to achieve this customization need not reside on critical paths within fastest parts of the processor. Finally, the required con gurable hardware resources to achieve signi cant performance enhancements is modest, indicating that bringing con gurability onto the processor chip may be viable in the near future. While the results of this initial study are encouraging, signi cant challenges still remain. Our results substantiate the viability of the approach, but a large space of architectural issues are still open. Questions include:
1 INTRODUCTION
4
Exploration of con gurability in other parts of the system such as data-path, network, main memory elements and input/output.
Partition and management of customization to machine usability and maximum bene t for a given application.
How to utilize con gurability for increased fault-tolerance in systems with a large number of processing elements.
Analysis of con guration and protection requirements, and approaches for solving these problems.
We hope to address these questions in a follow up study by building and experimenting with a prototype based on MORPH architectural elements.
1.2 Key Technology Trends Underlying MORPH MORPH design is based on commodity technology and components that would be available in the year 2007. The pace of semiconductor technology evolution shows no signs of slowing down. Indeed, the projections from the National Technology Roadmap for Semiconductors (NTRS) have been revised in three years to show an acceleration of technological maturity by 1 to 3 years for most deep submicron processes beyond 250 nm (see Figure 1). In evolutionary terms, there are no surprises per se in each of the component area or even the process technology itself as speci ed by the NTRS1997. However, the cumulative impact of continuing trends in circuit density and performance on system architectures, even for single-node machines, is nothing short of phenomenal. To understand this, let us rst examine the eect of technology scaling in the coming decade. (Figures 1 and 2 show the projected trends in device feature size and integration densities for logic devices.) Assume that feature sizes are scaled by factor S . Then, the intrinsic gate delay scaling (limited by velocity saturation eects) scales by 1=S for short channel devices [33]. This scaling is independent of the voltage scaling used. Since the speed of a logic gate is proportional to the load capacitance, scaling of device dimensions leads to decreasing gate delays. However, the interconnect introduces a range of parasitic eects that are signi cantly pronounced at the deep sub-micron technology of three generations hence. Consider the typical distribution of interconnect length on a chip shown in Figure 3 from [37]. The distribution has two peaks, one around at 10% and at 50% of the die-size. The die-size is approximately square-root of the chip area. The average length of the interconnect pchip area can be approximated as L = [54]. The total interconnect can be roughly divided 3 into two categories: local and global interconnect. The rst peak in Figure 3 represents the
1 INTRODUCTION
250
5
Feature Size NTRS 1997 NTRS 1993
200 (nm) 150
100 50 1996 1998 2000 2002 2004 2006 2008 2010 2012 Year
Figure 1: Trends in Device Scaling
Semiconductor Logic Device Density 3 350 3 300 250 200 Gates (millions) 3 150 100 3 50 3 3 3 0 3 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 Year Figure 2: Trends in Device Integration Density
1 INTRODUCTION
6
Probability
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
Wire length as a fraction of die diagonal length
Figure 3: Distribution of Interconnect Lengths in Logic Devices local interconnect length, whereas the second peak is due to the global interconnect nets. With scaling, the local interconnect scales as device shrinks whereas global interconnect actually increases due to increase in die density and size [50]. Thus the wire-length scaling can be divided into two parts as follows: 8 < SL = : S1 local interconnect S global interconnect
The resistance scaling is SR = SSL2 whereas the RC delay scaling is given by
SRC = SC SR = S SSL SR = SSL2
2
Thus RC is constant for local interconnect while gate delay is decreasing. RC grows as S 4 for the global interconnect. Let us now consider, in absolute terms, the characteristics of circuits and devices that would be available in the year 2007. Figure 4 plots the normalized delay of the interconnect (per unit length) which is about 80 ns/mm. Figure 5 shows the increasing role of interconnect in absolute terms. The on-chip cycle time is limited to 1 nanosecond primarily to combat the increasing parasitic (inductive) eects of metal inter-connect, while the unit gate delay (inverter with fanout of two) scales down to 20 pico-seconds. Thus modern day control logic consisting of 7-8 logic stages per cycle would form less than 20% of the total cycle time (Figure 5). This clearly challenges the fundamental design trade-o today that tries to simplify the amount of logic per stage in the interest of reducing the cycle time [31]. In addition, this points to a sharply reduced marginal cost of per stage logic complexity on the circuit-level performance.
1 INTRODUCTION
7
Interconnect Delay 3 80 "normalized-delay.dat" 3 70 60 50 Delay40 (ns/cm) 30 3 20 3 10 3 3 3 0 3 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 Year Figure 4: Normalized Delay of Interconnect Metal
0.4
Interconnect 0.8
0.6 0.2 Logic 1995
2007
Figure 5: Fraction of cycle time devoted to gate and interconnect delays
1 INTRODUCTION
8
Another aspect of chip-level system design that will be qualitatively challenged by these technology trends is the role of reprogrammable logic on chip designs (even for custom/semicustom designs). This is best understood by examining the eect of technology scaling on the unit length of wire, or Lcrit, that is equivalent to a unit gate delay (inverter with fanout of two). The critical length can be computed as: s
Lcrit = 0t:pgate 38rc where tpgate is the unit gate delay that scales by S . r and c are unit length resistance and capacitance respectively. The scaling of rc delay is given as (using local interconnect scaling) Src = SSR SSC = SSRC2 = S12 L L L Thus the critical length scales as S 3=2. This scaling corresponds to a reduction by a factor of
32 for a 10X reduction in feature size. An estimate of the absolute value of the interconnect lengths is provided by considering the metal pitch in sub-micron devices. Pitch for the nest interconnect is projected at 0.4-0.6 . On logic devices, average interconnect length, L, is approximately 1000X to 10,000X the pitch. In view of the interconnect distributions in Figure 3, this means that local interconnect is 300X to 3000X pitch. This scaling trends is shown in Figure 6. Within three generations of evolutions in process technology we can expect a cross-over point where the critical length is larger than the average
inter-connect length between logic blocks that are located in the same cycle time. Depending upon the average length of local interconnect this cross-over could occur
within the Region II indicated as cross-over region in Figure 6. Once past this cross-over region, the delay due to the average interconnect length would exceed switching delays. In fact, for the year 2007, the average inter-connect length is approximately three to ve times the critical length, that is equivalent to three to ve gate delays. For a wire of length L, the total inter-connect delay with M \repeating buers" is given as L 2
t = 0:38rc M M + (M ? 1)tpgate
where tpgate is the propagation delay of the buer. Optimal buering stages can be found by setting @t=@M to zero. However, a common rule of thumb in circuit design is provide electrical buering for signal integrity at least per unit critical length of wire. This implies that due to purely electrical reasons it would be preferred to include anywhere from three to ve inter-connect buers within a cycle time to mitigate the eect of delay due to average local interconnect. Such a buer gate when combined with a weak-feedback device would form core of a storage element that presents less than 50% switching delay overhead (from 20ps to 30ps). Figure 7 shows a schematic diagram of the interconnect between two blocks
1 INTRODUCTION
9 3000
static interconnect
avg. interconnect length
Length (Microns)
critical length
dynamic interconnect
2000
1000
I
CROSS−OVER REGION II
1
III 0.1
Feature Size (Microns)
Figure 6: Inter-Connect Critical Length: Trends show a cross-over point beyond which the wire critical length is longer than the average interconnect length. that is made programmable using reprogrammable devices. For wirelengths, L, larger than the critical lengths, such a circuit would present a shorter delay from A to B than a direct wire connection. Indeed, SPICE simulations show that as channel length scales down from 1 micron to 350 nm, the wirelength L for which a reprogrammable circuit provides shorter delay scales down from 7500 microns to 4900 microns. (Circuit simulations for 100 nm technology could not be considered reliable due to signi cant uncertainty in the process parameters). Referring back to Figure 6, the eect of critical length cross-over means that in the cross-over region II device, circuit and CAD strategies may be eective in mitigating the eect of interconnect dominance. However, once past this region, dynamic interconnect, this is, interconnect with logic would be the preferred choice. The needed logic may be embedded in inter-layer regions once advances in stacked MOS device permit so. While technologically inconceivable today, such incorporation of reprogrammable logic gates would be consistent with the earlier observation that the cost of local decision making on cycle time would be signi cantly less than the cost of sending information leading to increased logic level per cycle. Since the logical complexity per cycle is related to system microarchitectures, most immediate use of inter-connect buering gates would be in reprogrammable interconnects to improve versatility of the hardware blocks (perhaps sold as core elements by semiconductor vendors [28]). In the face of these technological changes, the challenge for the system architect is to nd ways that future architectures can eectively exploit the opportunity presented by the dynamic interconnect. We note that beyond the immediate eects as illustrated above, the extrapolation of trends in switching versus interconnect speeds fundamentally challenges our current notions in synchronous circuit design and CAD tools where storage elements de ne the clock bound-
1 INTRODUCTION
10
A
L/2
L/2
B
Figure 7: Circuit model using programmable interconnect between functional blocks aries. Instead, clock boundaries will be de ned based on architectural veri cation, block placement and complexity concerns in on-chip systems that may well see mixed synchronous and asynchronous circuit blocks working together. In the following section, we examine the trends in reprogrammable logic and CAD tools and their likely eect on systems and system design methodologies.
Trends in Programmable Logic and Computer-Aided Design (CAD) According
to SIA projections, advances in programmable logic will make their use viable in a much broader range of system elements. The maturation of CAD technology will allow the exible exploitation of con gurable resources to customize for individual programs. Indeed this has been the basis of most of current work on customizable computing machines [9]. MORPH design is also predicated on these continuing trends, though its use of reprogrammable hardware and CAD tools is substantially more limited and conservative. Programmable logic densities and speed are increasing in similar fashion to SRAM, approximately four times for every three years. The routability and gate utilization in reprogrammable devices will continue to increase at a rapid rate of 5-15% per year due to better algorithms, CAD tools, and additional routing resources. State of the art programmable logic devices, implemented in a 0.35 process technology, provide 100,000 gates and achieve propagation delays of less than 5 ns. As an alternative organization, current FPGA devices also oer up to 40 Kbits of multifunctional memory in addition to over 60,000 usable gates in implementation of specialized hardware functions such as arithmetic or DSP functions. Unlike standard SRAM parts, the embedded memory can be used as multiple-ported SRAM or FIFO providing much greater exibility in system organization. Current eorts in FPGA design and architecture show signi cant improvements in the eciency of the on-chip memory blocks. This trend in memory eciency and utilization is expected to continue and close the gap between FPGA and SRAM densities and up to 10 Mbits of multi-functional embed-
2 SYSTEM ARCHITECTURE & ORGANIZATION
11
ded memory would be available in addition to the logic blocks in single-chip reprogrammable devices. In the ten-year period, programmable logic integration levels are expected to reach 2 million gates. Further, the continuing trends in intellectual property (IP) encapsulation into hard or soft cores instead of chips as carriers to likely to be used by FPGA manufacturers also in providing on-chip reprogrammable core blocks along with associated control and testability logic. Coupled with the advances in technology, the advances in CAD tools and algorithms are beginning to have an impact on how designs are done today. With the emphasis on system models in hardware description languages (HDLs) such as Verilog and VHDL, the process of hardware design is increasingly a language-level activity, supported by compilation and synthesis tools [27]. These tools are beginning to support a variety of design constraints, on performance, size, power, and even the pin-outs. Locking I/O maps to ensure that physical design remains unchanged while logical connections are modi ed based on applications will soon be a common feature to allow programmable logic to be embedded in key modules of a system and provide on-line programmability to change hardware functionality. Tools for distributed hardware control synthesis to allow dynamic binding of hardware resources [16], and synthesis of protocols to low latency hardware [8, 14, 15] have been successfully demonstrated. With these CAD and synthesis capabilities, embedded programmable logic can be inserted into the key parts of systems, and used to alter behavior dramatically with modest performance overhead.
2 System Architecture & Organization As mentioned earlier, the basic architecture of MORPH re ects the observations that in the 2007 technology window, (1) communication is already critical and getting increasingly so [64], and (2) exible interconnects can be used to replace static wires at competitive performance [51, 13, 12, 22]. The key elements of the MORPH architecture include processing elements and memory elements embedded in a scalable interconnect. The scalable interconnect exibly connects all parts of the system with fast packet routing, eciently exploiting the wiring resources provided by the system packaging [13, 12, 51, 22]. Figure 8 shows our architecture. Architectural adaptation can be used in the bindings, mechanisms, and policies on the interaction of processing, memory, and communication resources while keeping the macro-level organization the same and thus preserving the programming model for developing applications. The hardware structure allows adaptation of data transport, coordination, association (for granularity), and ecient computation. As an example of its exibility, MORPH could be used to implement either a cache-coherent machine, a non-cache coherent machine, or even
2 SYSTEM ARCHITECTURE & ORGANIZATION CPU
12 Flexible Interconnect
Programmable Logic
Programmable Logic Cache Programmable Logic
Programmable Logic
Programmable Logic
Network Interface
Memory
Figure 8: An Architecture for Adaptation clusters of cache coherent machines connected by put/get or message passing (Figure 10). Varying the mix of processing and memory elements supports a wide range of machine con gurations and balances. Examples of other possible changes include changes in cache block size, branch predictors, or prefetch policies. Our proposed con guration uses 32 BIPS processors, and 8192 processing nodes. The physical memory con guration is driven primarily by cost and packaging (power budgeting) factors. A cost balanced system (based on today's processor to memory price ratios and SIA predictions) would be approximately two memory chips for every processor, or approximately 4 gigabytes/processor. Each processing node consists of 8-16 KB of L1 cache and 128 MB of L2 cache that can be con gured as private or shared among on-chip CPUs. Using MCM and areal interconnect technology, our system could be integrated with 30 nodes per card, and with 20 cards per rack, the core of the system would t in eight racks with room needed for power supplies, I/O cooling fans etc. The communication bandwidth is limited by the wiring of the packages to about 90 GB/sec for a pin-out of 1800 usable MCM pins. The bisection bandwidth would be in the range of 10 TB/sec. MORPH's exible architecture subsumes both the processor-in-memory (PIM) and scalable shared memory approaches. Based on the experience of several \PIM"-like systems [21, 20, 55, 7, 6], there is evidence that PIM organizations represent signi cant programming challenges, particularly for irregular applications. We believe that the use of more traditional processor-memory structures will yield a machine with more accessible performance
2 SYSTEM ARCHITECTURE & ORGANIZATION
13
Network Interface Programmable Logic
High density High performance Memory element
High BW System Level Interconnect (optical or other)
Network Interface Programmable Logic
High Performance Processing Element
Scalable Interconnect (modules to subsystems)
Figure 9: A Flexible 100 TeraOp Architecture PE
PE
scalable interconnect
Global Shared Memory
PE
MEM
PE
scalable interconnect
PE
scalable interconnect
PE
MEM
PE
PE
PE
MEM
MEM
MEM
scalable interconnect
MEM
PE
PE
PE
MEM
MEM
MEM
Figure 10: A wide range of logical machine organizations (Address Space and Cache coherence) can be con gured. than an organization in which processors are accessing primarily their local on-chip memory. In addition, by adding a small amount of programmable logic to the memory units, we can yield much of the bene t of having computational elements within the memory. A shared memory approach represents the opposite extreme, with advantages in programmability, but with questions about scalability.1 While the proposed architecture can subsume a range of traditional parallel machine orSince communication is the critical issue, novel technologies for communication implementation must be explored. One possibility is to exploit optical arrays and free space or waveguide based interconnects. Since ultra high-speed electrical connections in the tens of Gbps range are limited to a few centimeters, serial links driven by smart pixel arrays could be used to support system backplane connections. Through smart pixel arrays any degree-K interconnection network can be embedded into the backplane. This includes linear arrays, 2D and 3D meshes, toroids, hypercubes, dilated crossbars, orthogonal crossbars, Knockout, CrossOut and shue-based networks [59]. 1
2 SYSTEM ARCHITECTURE & ORGANIZATION
14
ganizations as shown in Figure 10, the primary use of con gurability is to enable customization of mechanism for higher performance. We outline some of the major optimization types below.
2.1 Exploiting Architectural Adaptation in MORPH The critical con gurabilities or adaptation exibilities that this architecture provides include:
control over computing node granularity (processor-memory association) interleaving (address-physical memory element mapping) cache policies (consistency model, coherence protocol, object method protocols) cache organization (block size or objects) behavior monitoring and adaptation
Depending upon application and runtime environment, customization can be done statically or at run-time. We discuss below the potential architectural opportunities presented by MORPH.
2.2 Low Latency Communication The MORPH architecture can optimize for low-latency communication by adapting the number of memory elements associated with each processing element (optimal PE granularity), con guring the physical I/O resource to match the applications needs (local memory hierarchy, global network) and by adding special hardware structures [19, 60] such as fast barrier or broadcast support for machine subsets or the entire machine, to optimize performance. For example, experience over the last ten years demonstrates that intraprocessor communication mechanisms (data shared through the cache) are much more ecient than even the best interprocessor mechanisms. When machine con guration granularity matches the application, extremely high performance can result. The programmable logic on both processor elements and memory elements allows us to dynamically associate memories with processor chips, changing the node granularity at application set up. In another example, some applications, bene t greatly from low-latency barriers, or high speed broadcast or multicast. Such structures are easily implementable with this con gurable hardware.
2.3 Minimizing Communication The possibilities to minimize communication include custom caching policies (e.g. adaptive invalidate-update, custom block sizes, and even more complex schemes [25]), object-based coherence (e.g. program semantics-based policies and data movement), custom prefetching FSM's (derived from program analysis, or dynamic selection based on eectiveness), and can
3 MORPH SOFTWARE ARCHITECTURE
15
drive all of these choices with detailed performance data capture, via customizable hardware. For example, false sharing can be eliminated by adapting policies (subblocking) for particular cache blocks. In other examples, object consistency semantics (e.g. write-once policy) can be used to reduce protocol overhead, object sizes could be used to eliminate multiple cache misses for a single object reference, virtual function data requirements could be used to fetch only the needed parts of an object which reduces data movement requirements, and custom encodings can be used for special datatypes, reducing the number of bits that must be transferred.
2.4 Resource Load Balance A critical issue for scalable systems, particularly with the increasing prominence of irregular and adaptive methods, is eciently achieving resource load balance. Our abstract architecture supports custom, even array-speci c, memory interleaving to avoid memory bank con icts, dynamic sharing of memory modules amongst processing elements, enabling the programming of hardware structures to rapidly propagate load information, and distribute tasks. For example, array references which cause bank con icts can be optimized by changing the address mapping for the node, or even for the individual array. Another example is custom load balance structures which can distribute tasks through hardware priority structures; such structures can help to achieve low latency load distribution, improving application scalability.
3 MORPH Software Architecture High performance computing systems cannot dictate the software structure of next generation high-performance computing applications: their very complexity will demand the best software structuring and complexity management techniques available. These applications not only require high computational rates, massive memory resources, and high performance I/O, they will be substantially more complex than current generation applications, exploiting sophisticated adaptive algorithms that use complex data structures, combining diverse computational applications (metacomputing), and integrating computation, visualization, databases and scienti c exploration. The tools for building such applications will be the best mainstream software technology available: object-oriented programming, component libraries (e.g. POOMA, A++/P++, Scalapack), domain-speci c libraries (e.g. KeLP, AMR++), or problem solving environments (perhaps as high level as MATLAB). Applications may consist of several independent programs, composed by procedure calls (shared memory), object-interoperability frameworks (CORBA, OLE, SOM), or even messaging (e.g.
3 MORPH SOFTWARE ARCHITECTURE
16
MPI [53], TCP/IP [17]). These software structures have direct implications for which implementation techniques are feasible. Achieving good programmability demands tools and techniques which allow applications of this type to achieve high performance on our exible architecture. Mapping an application onto a con gurable architecture such as MORPH involves selecting an appropriate execution model for program sections (e.g. memory and object consistency models as well as hardware primitives), node memory capacity and the size of domains for cache coherence (to match working set structures), custom operations for optimized communication and coordination (e.g. a memory side atomic swap register or a histogrammer), as well as mapping those structures (along with the computation and data) onto the underlying machine. These are daunting tasks, which for common choices can be achieved via libraries (e.g. globally shared memory, clustered cache-coherent machines, or distributed memory). However these approaches are likely to present too in exible a view to support both a wide range of applications and demanding irregular, adaptive applications well. We believe that achieving scalable high performance on a wide range of applications demands the development of technologies (automatic and high level abstractions for programmer assisted decisions) to exploit the exibility of our proposed architecture. There are two basic types of techniques for identifying opportunities for customization: static analysis (compiler analysis and directives) and dynamic adaptation (pro les and dynamic statistics) to rationally make use of the exibility to optimize the mapping and execution of the program. It is imperative that good performance be achievable with modest eort and the highest levels of performance be available with reasonable tuning eort.
3.1 Automatic techniques These techniques exploit aggressive interprocedural analysis [48, 49, 47, 18, 30, 29, 11, 1], pro le data, and run-time statistics to optimize program implementation choices are essential to the programmability of the machine and accessibility of high performance. Aggressive compiler analysis has been essential to high performance computing based on vector, shared memory, and distributed machines. Extensions to interprocedural techniques will continue to yield signi cant bene ts as analysis is broadened to include the entire program. However, the heterogeneous and variegated expression of applications (see above) will limit the range of the regularized semantics amenable to compiler analysis. Because of the limitations of static program analysis and fundamental hardware technology trends which increase the performance sensitivity of parallel machines, pro le data and runtime statistics will increasingly important for achieving robust high performance. For example, pro ling and runtime statistics may be essential for automatically tuning cache coherence and blocking. Con gurable hardware can be con gured as instruments for idiom
3 MORPH SOFTWARE ARCHITECTURE High−level Programming Languages
17
Domain−specific Environments
Libraries
Object Composition Frameworks (CORBA, OLE, SOM, etc)
Configuration Tools
Mapping
Synthesizer
Static Analyzer
SW Optimizer
Tools
HW Assist
Partitioner
Compiler
Programmable hardware
Object code
Execution Model Generation
runtime diagnostic and reconfiguration
Program Execution Profiling and trace analysis
Figure 11: MORPH software architecture uses runtime diagnosis and synthesized hardware assists. recognition or traditional statistics collection. These forms of fast monitoring can then be used to drive selection of node granularity (memory stealing), mechanisms, assess incremental miss rates and adaptation to larger node memories (stealing memory). The range of possibilities is endless. A change in memory grain size can be detected, for example, by maintaining a record of the cache misses. We propose to evaluate hardware assists that automatically detect changes in the memory grain size and context sets. This hardware would be synthesized to implement a cache tag recognizer using a modi ed version of the Aho-Corasic algorithm for string matching [2]. If the recognizer detected a desirable con guration change, the hardware modi cations could be precomputed or even synthesized on the y, and programmed into the appropriate hardware.
3.2 Pro ling and Programmer Annotation While a full complement of programmer annotation, pro le-directed analysis, and even onthe- y performance and diagnosis techniques are essential, the critical issue is the abstractions used to present performance data and system characteristics to the programmer. In addition to traditional views { execution time distribution, cache miss rates, communication volume, load balance, etc., tools for these exible systems will add aspects of computational eciency (special operations), internal node communication (memory hierarchy performance, memory bank organization) and external node communication (parallel decomposition), and even suggest/execute program reorganizations which enhance performance. These techniques are pictured in Figure 11 which illustrates the interplay of the software
4 APPLICATION-DRIVEN CUSTOMIZABILITY
18
application structure, automatic and programmer aided optimization, and the software and hardware synthesis to build the implementation. Optimizing compilers will analyze whole programs, generating code structures, specifying hardware structures, and execution models. Hardware will be generated from high-level synthesis techniques and together the hardware and software will be mapped to the underlying machine. Special hardware functionality would generally be mapped to all parts of the machine that require it (as part of their execution model). The extraction of special operations, guidance for selecting special policies, etc. would also be guided by compiler analysis as well as programmer and tool-generated annotations.
4 Application-Driven Customizability Given the preliminary stage of the MORPH point design study comprehensive listing of mechanisms for application-driven customizability is not possible. Critical issues in assessing mechanisms include hardware cost, cycle-time impact, con guration cost, eect on software, protection mechanism(s) needed, etc. We describe several illustrative examples below to show the leverage and importance of a exible architecture. While these are neither representative nor typical, these mechanisms do illustrate the extent of exibility that the overall architectural framework provided by MORPH machines.
4.1 Vector Memories: Stride Skewing for Performance Vector memories achieve high performance on regular structures of accesses, but performance drops quickly if accesses fall to the same memory banks, causing bank con icts, or if accesses cannot be mapped into the vector model with constant stride. The exibility of MORPH architecture allows this problem to be easily solved: by using the programmable logic on the processor elements to modify the mapping of addresses to memory elements. For example, shuing the address lines or using more complex hash functions eliminates the con icts. However, preserving program correctness, is a little trickier, but also manageable with the programmable logic. By choosing several good hash functions, with complementary structures [5], we can ensure fewer bank con icts, and by ensuring that addresses are mapped consistently (address ranges, context registers, additional instructions, etc.), correct program execution is ensured. Likewise for sparse matrix operations, where scatter-gather operations would typically be employed, the programmable logic can be used to prefetch the irregular structure eciently (data structure interpretation) or even remap the addresses, to present them to the processor (and perhaps pack them into the cache) in contiguous addresses.
4 APPLICATION-DRIVEN CUSTOMIZABILITY
19
4.2 Optimizing Cache Granularity for Performance Virtually all processors are critically dependent on their cache subsystems to achieve high performance. However, cache performance can be extremely sensitive to the relationship of the working set to cache size (particularly in direct map caches). Programmable logic can be used to diagnose this problem, and if appropriate, implement corrective measures which minimize the performance losses (changes in cache sizes, or expanding victim cache buers). For example, consider an L1 cache of 8KB, for which the critical working set size is 9KB. The cache misses are collected by a hardware recognizer that invokes a hardware synthesizer to build customized victim caches [36]. The recognizer analyzes cache misses on-the- y, assessing how a particular size victim cache would aect the cache miss-rate. Note that the recognizer can be extremely simple { no larger than the tag store for the size of victim cache that is being considered. The recognizer can be a simple acceptor automata that accepts only the cache tags in the L1 cache. The recognizer holds a state for the frequently used set of cache tags implicitly (as a decision-diagram representation over the tag bits). As the set of cache tags transitions to a new set of cache tags on each miss the state machine is updated to ensure that the new tag state is accepted by the recognizer. (Once the number of states in the recognizer exceed the maximum allowable, the recognizer starts to recycle used states.) Since each update aects only a very small number of tags it is relatively easy to update the recognizer. Once a state update leads to a known state in the recognizer, the recognizer makes an estimation of the working set size based on the structure of the implicitly represented state machine, in particular, the size of the strongly-connected components and evaluates the possibility of building a victim cache in steps of 1 KB. This technique works irrespective of the associativity of the L1 cache since the recognizer summarizes working set and not how this working set is distributed in the cache. Through simple modi cations such as victim caches, the hardware assist can exploit the programmable hardware to eliminate sharp fallos in performance when working set is slightly larger than the cache size. Note also that this runtime monitoring activity does not aect the critical paths and cost of runtime recon guration is amortized over long periods time between which these updates take place.
4.3 Programmable Coherence to Reduce Communication Researchers have long recognized that a single data management granularity, and single cache consistency policies [25, 65, 56, 32] could not hope to serve all applications equally well. These observations are also borne by our experimental results in the following section. However, hard-wired machines must be designed to handle a single common case, to simplify their implementation and as a compromise across a workload. In environments where communication is expensive (e.g. distributed virtual memory systems), coherence systems
5 EVALUATION OF CONFIGURABILITY IN MORPH
20
are customized to minimize communication by exploiting data compression, computing differences, and using coherence policies based on observed (or declared) behavior [35, 38, 4]. Such systems can reduce communication requirements signi cantly and also improve latencies, but in conventional systems incur signi cant computation overhead for the requisite bookkeeping. MORPH's con gurable logic will allow custom protocols and similar optimizations to be implemented with low overhead, reaping the communication reduction and lower latency bene ts without computational overhead. Of course, there are a wealth of cache system optimizations proposed within parallel machines which could be applied in an application-speci c manner to achieve best performance [43, 23, 25].
4.4 Custom Prefetching The pointer-based data access to sparse matrix data structures in current day memory hierarchies yields poor performance because the indirection introduces main memory and memory hierarchy latencies into the innermost computational loop. Techniques such as software prefetching (loop unrolling and hoisting of loads) do not adequately solve the problem, as prefetching at the processor leaves multi-level memory hierarchy latency in the critical path. Hardware prefetching (based on n-th order dierences) is also unlikely to be eective because the address references generated by a sparse matrix traversal may contain no particular address structure. MORPH can provide an address range-based prefetching scheme that can reside in customizable hardware at arbitrary levels of the memory hierarchy, in all of them, or to bypass them completely. This approach performs data structure aware prefetching, based on the address ranges of data structures. Thus, if the structures for sparse matrix elements are all allocated from a single memory region, then row and column prefetching of xed-size elements can be enabled based on detected memory accesses. This in eect can transform a cache-based machine into a decoupled architecture as dictated by application structure. In the following section, we describe the experimental setup and results from our experiments in evaluation of speci c architectural mechanisms for the MORPH point design study.
5 Evaluation of Con gurability in MORPH The space of con gurable architectures is vast. For example, con gurable logic can be used augment the functionality of all major elements of computing systems for customization in any of the following aspects:
computing (data types, functional units, operations),
5 EVALUATION OF CONFIGURABILITY IN MORPH
21
interconnect (data movement, encoding, synchronization), memory (caching, coherence, prefetching, interleaving), input/output (caching, prefetching, interleaving, protocol adaptation, compression, format conversion), and networking (protocol adaptation, hardware interoperability).
Because this space is vast, and our focus is to explore the relevance and viability of con gurable logic in high performance systems reaching to 100 TeraOps or even PetaOps, we choose to focus on several important scienti c applications and the bene ts of con gurability in the memory system to enhance their execution. Closer examination of this ground reveals ample opportunities for the use of con gurability to enhance computational eciency through the better management of memory hierarchies. Management of memory hierarchies is a rich area area for improving machine performance today (microprocessor-based computer systems can spend as much as 50-75% of time waiting for operands). In the following sections, we explore the potential bene ts of con gurability in managing memory hierarchies through study of a range of kernels, sparse linear algebra libraries, and entire application programs. We address the key issues of scaling and application domain, their applicability, and limitations in this study.
5.1 Scope and Limitations of Our Study
Scaling: Technology scaling and extrapolation to future computer systems is challenging.
We report our empirical results for current day computer technologies and then use derivative metrics (such as miss rate) to extrapolate potential bene ts for future computer systems which will exhibit much higher clock rates (1-2 Ghz) and memory sizes. We also assume that the increasing gap between processor and memory speeds will be spanned by multiple level memory hierarchies, the fastest levels of which will be close to the same sizes exhibited by high clock rate machines such as Digital's ALPHA today.
Applications: Applications of the year 2007 are likely to be dierent in a number of ways
that can not be anticipated; the dramatic increase in computing capabilities and complementary decrease in cost, volume, weight will doubtless continue to broaden the domain of applications, incorporating many that we cannot predict today. However, it seems likely that peta op machines in a 10-year time frame will, as a core of their purpose, support a range of traditional applications of critical importance to the major mission-directed agencies. As a consequence, we chose a selection of kernels from existing applications, sparse matrix libraries, and entire application kernels. The sparse matrix libraries were chosen as representatives of the emerging generation of adaptive, dynamic codes that eciently focus the computational eort on the parts of the system with high frequency behavior. The complete
5 EVALUATION OF CONFIGURABILITY IN MORPH
22
application kernels chosen to cover the two application domains we selected as drivers for our point design study: computational uid dynamics and immersive virtual reality. To evaluate, we simulated current-day memory systems with state of the art applications, executing standard full size contemporary data sets. Using contemporary data set sizes with contemporary codes gives an accurate model of capacity and data movement in the memory hierarchy. Along with the assumptions on the capacities of the higher level memory hierarchy sizes, the only major shortcoming of these memory hierarchies as a model for 2007 hierarchies is they are not deep enough, and the main memory latencies are not large enough. By focusing on miss rates, we avoid explicitly dependence on these ratios. Using these models, we simulate memory reference behavior and analyze the performance of these memory hierarchies on a range of applications. These studies focus on the variability of performance across applications with the goal of exploiting con gurability to provide robust high performance for all application programs. In addition, these results and the application programs are analyzed to explore approaches to approaching 0% miss rates for high level caches. Highly parallel systems are an essential approach in achieving performance levels of 100 TeraOps or PetaOps in the 2007 time frame. One signi cant limitation of our studies is that we do not model the parallel aspects of memory hierarchies. This was not possible due to the short time scale of this study, and also due to application scaling. Obtaining applications and data sets large enough to correspond to actual machine con gurations in 2007 is not only dicult, simulation of such systems with current day systems is eectively infeasible.
5.2 Description of the Benchmarks Used
5.2.1 PetaFlop Kernels
The PetaFlop Kernels are a collection of loop nests collected by the Applications Working Group at the April 1996 Peta ops Architecture Workshop held at Oxnard, California. They represent a broad range of scienti c computing, but in particular, the application archetypes expected to continue to be important to the mission-directed agencies for highest performance computing. These are listed in Appendix A.
5.2.2 Sparse Matrix Libraries. Sparse methods are of increasing popularity as the range of topologies and structures that computational scientists and engineers must simulate increase in diversity and complexity. These codes have signi cantly dierent computational characteristics compared to dense linear algebra codes { sparse codes often make heavy use of pointers, dynamic allocation, and conditional branches.
5 EVALUATION OF CONFIGURABILITY IN MORPH
23
As our application examples, we use the sparse matrix library SPARSE developed by Kundert and Sangiovanni-Vincentelli (version 1.3 available from http://www.netlib.org/sparse/), and an additional sparse matrix multiply routine that we wrote. This library supports a wide variety of matrix operations, including LU factorization, matrix-vector multiplication, matrix-matrix multiplication and a solver for linear systems. This library implements an ecient storage scheme for sparse matrices using row and column linked lists of matrix elements as shown in Figure 12. Only nonzero elements are represented, and elements in each row and column are connected by singly linked lists via the nextRow and nextCol elds. Space for elements, which is 40 bytes per matrix element, are allocated dynamically in blocks of elements for eciency. There are also several one dimensional arrays for storing the root pointers for row and column lists. Starting Column Pointers struct MatrixElement { Complex val; int row,col; < other fields > struct MatrixElement *nextRow, *nextCol; }; Starting Row Pointers Matrix Element Pointer
Figure 12: Sparse Library Data Structures In this study, we chose to explore the dynamics of matrix-vector product and matrixmatrix product because of their computational importance. We focused on 2D matrices, so for these, all elements in every row and column are connected by a linked list.
5.2.3 Application Programs MP3D: It is a particle code simulation from the Splash benchmarks. The application
models regular particle distribution in space using a straightforward interaction model.
Adaptive Mesh Re nement (AMR): As a proxy for computational uid dynamics
codes, and a broad range of structured mesh partial dierential equation solvers, we selected a state-of-the-art adaptive mesh re nement code drawn from the National Science Foundation Cosmology Grand Challenge. This code uses regular grids, but re nes them dynamically to model the evolution of systems with 100x - 1000x great computational eciency than regular grid codes producing results of equivalent numerical quality.
5 EVALUATION OF CONFIGURABILITY IN MORPH
24
Ray Tracing (RT): As a proxy for the immersive virtual reality codes, we selected three
application kernels: a ray tracing code, a hierarchical radiosity code, and a volume visualization code. These applications cover the space from large data sets to interactive visualization a complement the back-end graphics OpenGL pipelines many observers expect to become standard hardware accelerators. Of course, illumination models for graphics are well known to correspond closely in structure to codes that solve a wide range of similar problems (e.g neutron transport, alpha particles, etc.). Ray tracing is a well-known computationally intensive technique for modeling the propagation of light (re ection and transmission) amongst objects in a space. Rays are originated at the light sources, or from the viewers perspective and propagated throughout the scene based on the properties of objects in the space. Ray tracing is a challenging code for parallelism because while there is substantial parallelism across rays, sequential algorithms do clever optimization to reduce the number of rays which must actually be propagated, and the tracing of rays may have little locality with respect to the scene. We use the well-known Splash-2 code, VOLREND, as our ray tracing application kernel. This application is challenging because of load balance (depends heavily on scene structure) and data locality (rays can bounce between objects that are far apart spatially).
Hierarchical Radiosity (HR): Recently, hierarchical radiosity techniques have been pro-
posed which may enable this more expensive computation to be feasible for high speed and even real-time graphics. Radiosity computations model the mutual illumination of \patches" in a scene, by modeling the light that ows between them. Because the cost of such a model is proportional to the number of patches, they are only re ned as necessary to produce the desired precision of results. Hierarchical radiosity techniques use approaches similar to Barnes-Hut n-body interaction codes to eciently share the results of long distance interactions and thereby dramatically reduce the cost of computing the overall radiosity. Challenges in this code include dynamic load balance (many ne-grained tasks are created in an unpredictable pattern) and data locality (patches that are far apart in space can interact). We use the Splash-2 Radiosity code for this application kernel.
5.3 Simulation Environment Figure 13 shows the synthesis and simulation framework used for hardware evaluation. This environment allows us to quickly evaluate the eect of architectural changes on application runtimes. While bulk of the simulation environment is pretty standard, we brie y discuss the innovative features that allow us to model hardware blocks eciently. In particular, we describe how the simulator is able to achieve signi cant speedup in architectural evaluation by minimizing the use of interpretation over a program-driven simulation.
5 EVALUATION OF CONFIGURABILITY IN MORPH
VHDL
Library
25
Compiled Application
Inputs Compilation
Synthesis
Linked Executable
Compiled Simulator
Mapping
FPGA Netlist
Event Manager
Execution
Simulation Statistics
Figure 13: MORPH Simulation Workbench, MORPHSIM In general, almost any simulation framework that supports the notion of events and provides mechanisms to process these events, can be used to build simulatable system prototypes. This is even more true of modeling reactive hardware systems. Indeed, most existing HDLs can be classi ed as source languages for a corresponding event-driven simulator [34]. An event-driven model is powerful enough to describe most hardware systems at any level of abstraction: from algorithms to gate-level circuits [61]. However, this generality also imposes a signi cant burden on the simulation eciency due to the extra work (or overhead) needed in event maintenance and processing. Event processing often requires interpretation of event generation, propagation and disposition by the simulation model. \Interpretation" of an object in a simulation model refers to the evaluation of the object by a separate procedure that provides semantic interpretation of the object in the context of the simulation model. This can be done either statically (at compile time) or at runtime. Runtime interpretation is done by a (separate) module that holds the state of execution and it is invoked whenever the interpretation is needed. This is in contrast to a native execution where the host machine (machine running simulation) is used to hold the state of execution. Since each call to the interpreter may require a context-switch (to another simulated context for instance), the tradeo between native versus interpretation is dictated by the cost of execution versus the cost of context switches (needed for interpretation). Mixed hardware/software simulations are slowed down due to their interface with an event-driven hardware simulator. Recent eorts in this direction (e.g., [24]) have demonstrated the utility of simulation backplanes in integrating various simulators. Though the achieved simulation speeds are not mentioned it
5 EVALUATION OF CONFIGURABILITY IN MORPH
26
is unlikely that hybrid machine simulations can be used to run moderately large application benchmarks which require simulation eciencies upwards of 100,000 simulated cycles per second. A critical bottleneck in achieving higher simulation speeds stems from the integration of hardware description language (HDL) models. The HDL simulations often require operationlevel interpretation, that is, each operation (statement) requires a call to the interpreter. This is because, each signal assignment statement, in a language such as VHDL, can potentially generate an event, therefore, the system simulations are signi cantly slowed by frequent calls to the interpreter (or the event manager). An alternative is to build models that work with a cycle-based simulation. A cycle-based simulation, though not necessarily ecient in terms of work required, is often able to use native execution (against software interpretation) to achieve signi cant reductions in simulation time. Cycle-based simulation using conditional control ow in high-level programming languages has been used to demonstrate speedup in gate-level hardware simulations [63]. Since in these simulations the main simulation loop is based on per cycle, only synchronous systems can be simulated. Clearly, the circuit latency can not be used to improve simulation eciency as in the case of event-driven simulations. Our approach to system simulation uses a combination of interpretation and native code execution such that the native codes are encapsulated into non-event producing blocks and interpretation is done at a coarse-level. This is done by separating the two very dierent time-scales of concern related to system simulation and hardware veri cation as shown in Figure 13. Cycle-based system-level simulation is uses a modi ed program-driven simulator MINT [62] that performs the system simulation as the application executes. This simulator consists of a \front end" and a \back end." The front end represents interface to the application program, whereas the back end models the underlying micro-architecture. The front end handles the application program execution. When the program execution leads to generation of event (either through reference to modeled blocks in MINT or through directives embedded in the source program), the front ends sends the event to the back end. When the operations associated with an event complete, the back end returns with a status signal to either block, abort or continue execution depending the event semantics. The underlying simulation library provides event processing and disposition. This allows us to conveniently customize system simulations for speci c hardware blocks. Due to the high-level synchronous nature of the system level simulations, the hardware blocks are modeled at the behavioral level while detailed implementations are only considered in hardware validation. The compiled simulator (based on MINT) can interpret almost any program that runs natively, and can generate a events speci c to any hardware block. The original design of MINT was used to generate events representing references to (cache) memory blocks. However, with appropriate de nition of an event, the system simulator can now be used to
6 EXPERIMENTAL RESULTS
27
address references to any hardware block. This way the MORPHSIM simulation environment allows us to associate events and reduce runtime interpretation to those only on the blocks that are under design. The back end is customized to re ect the underlying system microarchitecture. For architectural blocks that can be identi ed in the compiled simulator, the applications do not need to be preprocessed or recompiled. Hardware validation in MORPHSIM is done using traditional event-driven simulation on hardware blocks. The input to these simulations is derived from translation of C models of hardware into HardwareC blocks. (These blocks are also a good candidate for presynthesis optimizations based on Timed Decision Tables [44].) However, currently this translation is done manually. The hardware validation does not require on-line application executions. Instead, the results from application executions are used to create a test-bench for the eventdriven hardware validation. This reduces the redundancy in detailed hardware simulations (for the purposes of hardware validation), while modeling its eect on application-level simulation statistics.
6 Experimental Results 6.1 Con gurability in Memory Hierarchy Management The following studies explore the advantages of con gurability in a range of applications kernels, libraries and entire applications. While signi cant bene ts can be achieved for some of the kernels, in fact much larger bene ts are possible for the entire application programs, and the critical factor is often data reuse across multiple phases of an application (or loop nests). Con gurability shows dramatic bene ts in a number of cases by allowing the limited fast storage to be more eciently utilized (more densely packed with useful values) and carefully managed (controlled replacement, cache partitioning).
6.1.1 Peta op Kernels To explore the opportunities for optimizing memory hierarchy performance, we rst executed each of the applications against a standard cache simulator, using a variety of cache parameters for level one (L1) and level two (L2) caches. Parameters include cache size, block size, and associativity. The parameter values used are presented in Table 1. The results for these parametric cache studies are presented in Figures 14 through 16. Each axis in the star chart represents a speci c cache con guration within the parametric ranges mentioned above. We report results and discuss the implications for the advantages of intelligent memory hierarchy management. For brevity, we only present a select subset of the data here. The complete experimental results are available on-line at the MORPH
6 EXPERIMENTAL RESULTS L1 Cache Line Size 32B or 64B Associativity 1 Cache Size 32KB Write Write back + Policy Write allocate Replacement Policy Random Transfer (L1-L2) Rate 16B/5 cycles
28 L2 Cache 32B or 64B 2 512KB Write back + Write allocate Random (L2-Mem) 8B/15 cycles
Table 1: Cache Simulation Parameters website and summarized in Appendix D.
Average Stall Fraction measures the amount of time the processor spends waiting for
values from the memory system. As illustrated in Figure 14, for the Peta op Kernels the number of stall cycles is signi cant, ranging from 20% to 90% of the simulated cycles. Clearly, memory system performance is a critical issue for current day memory hierarchies. Not only do many of the kernels spend much of their time in memory stall, the choice of cache parameters is critical for performance; there is no single optimum with the kernels achieving optimum performance with dierent cache con gurations. Several aspects of our cache simulator model may exaggerate this eect, such as no cycles are allotted for computation and only one outstanding memory request is allowed. However, as the clock rates of processors (and their simultaneous issue capability) is increasing much faster than the improvements in memory latency, the simulation results are likely to be representative of the future. In large part, the simulations indicate that the Peta op Kernel memory reference patterns are lacking in locality of reference and also interact badly with cache features such as multiword block sizes. In particular, Gather, the scatter-gather kernel is forced to move entire cache lines, where single double oat words would suce.
Average Memory Read Time and Write Time capture the eective memory latencies
seen for load and store operations respectively. Again, these vary widely for the Peta op Kernels, but the average load times are extremely high and vary over several orders of magnitude (note that these are presented on a logarithmic scale). This indicates the basic mismatch between the Peta op Kernels and the memory hierarchies which require high
6 EXPERIMENTAL RESULTS
29
Figure 14: Average Percentage of Memory Stall Cycles for the Peta op Kernels for a Range of Cache con gurations
Figure 15: Average Memory Read Time for the Peta op Kernels for a Range of Cache con gurations
6 EXPERIMENTAL RESULTS
30
Figure 16: Average Memory Write Time for the Peta op Kernels for a Range of Cache con gurations degrees of reuse for good performance. The average write times are better, re ecting the usual situation that writes can be decoupled from the continued execution of the processor, since they need not return a value. However, signi cant numbers of stall cycles are incurred for writes as the bandwidth of a traditional memory hierarchy is not well matched to the needs of the Peta op kernels. We observe that these Peta op Kernel results present a wealth of data, of primary interest are several observations:
Signi cant miss rates occur for all of these applications; with faster processors (in
absolute terms and relative to memory), the performance impact of these misses will continue to increase.
Optimal con gurations with respect to miss rate vary signi cantly across the Peta op Kernels.
Traditional memory hierarchies often waste memory bandwidth, often transferring more words than are required, or failing to retain reusable data when sucient fast memory capacity is actually available.
The Peta op Kernels are too small (no reuse) to exploit intelligence in memory hierarchy management. In the following sections, we explore the use of con gurability for intelligent management with slight larger kernels and some entire application programs.
6 EXPERIMENTAL RESULTS
31
6.1.2 Sparse Matrix Libraries With the sparse matrix libraries, we rst characterize the cache behavior, then explore the opportunities for reuse in the matrix-matrix product routines. Memory hierarchy simulation results from the sparse matrix library routines show that the cache organization has only modest impact on the performance of this algorithm, with the major impact being the cache line size. This can be explained by the fact that the sparse matrices' representation { linked columns and rows constructed from dynamically allocated storage. This produces little spatial locality in the reference stream (beyond the basic structure size of 40 bytes), so line sizes below 40 bytes fail to capture this spatial locality and those much above waste most of the data that is transferred. Because we believe that dynamic data structures, such as those used in the sparse matrix library are of increasing importance for high performance computing applications, we carefully analyzed the source code for this sparse matrix-matrix product to identify opportunities to improve memory hierarchy performance. This analysis uncovered a signi cant number of ineciencies: 1. 2. 3. 4. 5.
Misalignment of dynamically allocated structures and cache blocks Lack of reuse in algorithm Movement of unused data Ineective prefetching due to pointer indirection Poor write locality due to dynamic allocation
These ineciencies can be optimized in turn by the corresponding software, and softwarecon gurable hardware optimizations. These optimizations not only improve performance with respect to traditional metrics such as cache miss rate, then also increase performance with respect to peta op-relevant metrics such as total volume of data moved: 1. Padding structures to cache block sizes (Misalignment) 2. Blocking the Matrix-Matrix algorithm (Reuse) 3. Gather hardware triggered by memory references to a particular range of addresses (Movement of unused data) 4. Prefetch hardware in the memory/cache controller (Pointer Indirection) 5. Scatter-allocate hardware triggered by memory references to a range of addresses (Poor write locality) In the following Section 6.2 we describe how these optimizations are implemented. While these techniques are presented in the context of optimization for sparse matrix libraries, each is an instance of a much more general set of memory hierarchy management techniques. Amongst others, some of the innovations include:
6 EXPERIMENTAL RESULTS
32
address range triggered hardware actions, adding intelligence (and altering the addressing-physical storage correspondence) to the memory hierarchy, including allocation in the memory hierarchy.
Furthermore, within these sets, the critical customization which yields a performance bene t depends on details of the application's data structures or behavior. This, we believe, is a strong argument for the bene ts of customizable machines { no single application is compelling enough to merit the inclusion of any of the particular features.
Cache Packing: Custom Cache Organization The sparse matrix structure uses dy-
namically allocated storage chunks of 40 bytes for each element. However, because they are dynamically allocated and therefore not contiguous along either the rows or columns of the matrix (and in general cannot be made so), signi cant amounts of fast cache storage are wasted. Each matrix element contains the following information:
16 bytes: Real numbers for the storage of the (possible) complex value 8 bytes: Column and row number of the element 4 bytes: Pointer to the next element in the row 4 bytes: Pointer to the next element in the column 4 bytes: Hooks for special initialization
The total amount of storage is with alignment padding introduced by the compiler (to align doubles) is 40 bytes. A standard software-only technique would be to use 24 bytes of padding to each structure to make their size match the cache line size of 64 bytes. While this can ensure that we get an entire matrix element for each cache miss (no element is divided over two cache lines), it also ensures that 24/64 = 37.5% of the fast memories' capacity is wasted. Con gurable logic in the cache can be used to reorganize the cache to match the application (changing the cache block size to 40 byte blocks). This not only will increase the eective capacity and thereby increase opportunities for reuse, it also will reduce the memory bandwidth requirement, as only 40-byte chunks need be delivered from the main memory. The two signi cant downsides of this optimization are the need for more cache tag memory (to match the increased number of cache blocks) and the requirement that the memory system and interconnect deal with non power of two data transfers eciently.
6.2 Case Studies in Architectural Adaptation We present two case studies of architectural adaptation for application-speci c locality optimizations. On modern architectures with deep memory hierarchies, data transfer bandwidth and access latency dierentials across levels of memory hierarchies can span several orders
6 EXPERIMENTAL RESULTS
33
of magnitude, making locality optimizations critical for performance. Although compiler optimizations can be eective for regular applications such as dense matrix multiply, optimizations for irregular applications can greatly bene t from architectural support. However, numerous studies have shown that no xed architectural policies or mechanisms, e.g., cache organization, work well for all applications, causing performance fragility across dierent applications. We present two case studies of architectural adaptation using applicationspeci c knowledge to enhance latency tolerance and eciently utilize network bisection on multiprocessors. Table 1 shows the simulation parameters used. We report our empirical results for current day computer technologies and then use derivative metrics (such as miss rate) to extrapolate potential bene ts for future computer systems which will exhibit much higher clock rates and memory sizes. We also manually translated the C routines modeling the customizable logic blocks into HardwareC [41] to evaluate their hardware cost in terms of size and cycle delays. (Our recent work is focused on automatic translation of these routines to synthesizable blocks [45].)
6.2.1 Architectural Adaptation for Latency Tolerance Our rst case study uses architectural adaptation for prefetching. As the gap between processor and memory speed widens, prefetching is becoming increasingly important to tolerate the memory access latency. However, oblivious prefetching can degrade a program's performance by saturating bandwidth. We show two example prefetching schemes that aggressively exploit application access pattern information. Figure 17 shows the prefetcher implementation using programmable logic integrated with the L1 cache. The prefetcher requires two pieces of application-speci c information: 1. the address ranges and 2. the memory layout of the target data structures. The address range is needed to indicate memory bounds where prefetching is likely to be useful. This is application dependent, which we determined by inspecting the application program, but can easily be supplied by the compiler. The program sets up the required information and can enable or disable prefetching at any point of the program. Once the prefetcher is enabled, however, it determines what and when to prefetch by checking the virtual addresses of cache lookups to check whether a matrix element is being accessed. The rst prefetching example targets records spanning multiple cache lines and for our example, prefetches all elds of a matrix element structure whenever some eld of the element is accessed. The pseudocode of this prefetching scheme for the sparse matrix example is shown below, assuming a cache line size of 32 bytes, a matrix element padded to 64 bytes,
6 EXPERIMENTAL RESULTS Processor
34 virtual addresses/data
L1 Cache data physical addresses
Prefetcher
Prefetcher Logic
additional addresses
L2 Cache
Figure 17: Organizations of Prefetcher Logic and a single matrix storage block aligned at 64-byte boundary. Prefetching is triggered only by read misses. Because each matrix element spans two cache lines, the prefetcher generates an additional L2 cache lookup address from the given physical address (assuming a lock-up free L2 cache) that prefetches the other cache line not yet referenced. /* Prefetch only if vAddr refers to the matrix */ GroupPrefetch(vAddr,pAddr,startBlock,endBlock) { if (startBlock row chaseAddr=virtual-to-physical(chaseAddr->nextRow) } }
Because the data gathering changes the storage mapping of matrix elements, in order not to change the program code, a translate logic in the cache is required to present \virtual" linked list structures to the processor. When the processor accesses the start of a row or column linked list, a prefetch for the entire row or column is initiated. Because the target location in the cache for the linked list is known, instead of returning the actual pointer to the rst element, the translate logic returns an address in the reserved address space corresponding to the location of the rst element in the explicitly managed cache region. In addition, when the processor accesses the next pointer eld, the request is also detected by the translate logic, and an address is synthesized dynamically to access the next element in this cache region. The translate logic in pseudocode is shown below. Translate(vAddr, pAddr, newPAddr) { /* check if accessing start of a row */ if(startRowRoot