The NAPA Adaptive Processing Architecture - Semantic Scholar

7 downloads 0 Views 176KB Size Report
very complex logic design with all the associated physical design issues. ...... Inc., New York, 1993, ch. 2. [10] E. Waingold et al., “Baring it All to Software: Raw.
The NAPA Adaptive Processing Architecture Charlé R. Rupp, Ph.D., Mark Landguth, Tim Garverick, Edson Gomersall, Harry Holt National Semiconductor Corporation [email protected] Jeffrey M. Arnold Consultant Maya Gokhale, Ph.D. Sarnoff Corporation

Abstract The National Adaptive Processing Architecture (NAPA) is a major effort to integrate the resources needed to develop teraops class computing systems based on the principles of adaptive computing. The primary goals for this effort include: (1) the development of an example NAPA component which achieves an order of magnitude cost/performance improvement compared to traditional FPGA based systems, (2) the creation of a rich but effective application development environment for NAPA systems based on the ideas of compile time functional partitioning and (3) significantly improve the base infrastructure for effective research in reconfigurable computing. This paper emphasizes the technical aspects of the architecture to achieve the first goal while illustrating key architectural concepts motivated by the second and third goals.

1. Introduction First generation configurable computing systems demonstrated that application specific computing elements built from reconfigurable logic can achieve significant performance improvement over conventional solutions[1,3,4,5]. The basis of this performance gain is the exploitation of parallelism at several levels: massive, very fine grain parallelism within the configurable processing elements; deep pipelines within and among processing elements; and SIMD or systolic control structures across arrays of processing elements. Successful applications on these early configurable platforms tended to be stream oriented, exhibiting static communication patterns and operating on small data objects such as bytes or individual bits. Most first generation configurable systems follow the “attached processor” model: the reconfigurable resources are viewed as a peripheral device attached to a conventional computer via a standard I/O interface. Communication between the host and the reconfigurable logic typically passes through an API library and kernel device driver. Unfortunately, the overhead costs imposed

by the attached processor model are significant. The communication latency introduced by the interface makes a single consistent programming model encompassing both the fixed and the adaptive computation resources infeasible. The dichotomy between host and attached processor in turn forces the application programmer to contend with the challenging issues of hardware/software co-design, often with inadequate debugging and instrumentation support. The National Adaptive Processing Architecture (NAPA) integrates adaptive logic and scalar processing with a parallel processing interconnect structure in a closely coupled manner. This second generation adaptive computing architecture exploits the techniques successfully demonstrated in previous efforts and provides the basis for new algorithm styles not previously available. At the same time, this new architecture style allows the use of powerful software compilation concepts to simplify the program development tasks needed to access this high performance capability. The compiler for this architecture style partitions the data processing tasks of an application into the three underlying computation dimensions of the NAPA architecture: traditional scalar processing, parallel processing interconnect and adaptive logic processing. The result is the execution of hundreds of computation steps per system clock cycle. The first NAPA processor implementation has a target system clock of 50 MHz.

2. Related Work A number of research efforts have proposed integrating conventional RISC processors with reconfigurable logic. The MIT Raw project proposes a class of array architectures in which the tiles consist of a RISC like execution pipeline coupled with a block of configurable logic. The array architecture is based upon a switched interconnect network between nearest neighbors in a mesh[10]. Wittig and Chow proposed a very tightly coupled approach to integrating reconfigurable logic into a

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

1

conventional processor datapath. In the OneChip architecture the reconfigurable logic appears as a set of Programmable Function Units (PFUs) that operate in parallel to the standard processor data path, or Basic Function Unit (BFU). Control of the PFUs is handled through the instruction stream of the standard processor[11]. The Berkeley BRASS project is designing a hybrid MIPS architecture, Garp, that includes a reconfigurable coprocessor. In Garp the standard processor and the reconfigurable coprocessor share a single memory hierarchy, with each having access to external memory through a common data cache. The coprocessor is under the direct control of the main processor through an extension to the instruction set, although once started an operation on the coprocessor can execute concurrently with instructions on the main processor[8].

3. NAPA System Concepts Much of the cost/performance improvement and reduced development complexity of the NAPA architecture is the result of an emphasis on algorithmic datapath and function generation. These tools create large, nearly optimal, placed and routed logic blocks for final assembly in the adaptive logic processor program. The rapid execution time of these tools allows effective incremental development. This also raises the level of the code target for the front end compiler’s code generator resulting in a substantial improvement in the quality of the output of high level programming techniques. We believe that an appropriate amount of functional specialization (hardwired function blocks such as large high speed random access memories) will always provide a significant improvement in the cost/performance characteristics of the overall system. This does not limit the universality of the adaptive logic capability so long as the specialized function blocks of the system can be bypassed by an application specific version of the function in the adaptive logic. Having a certain amount of functional specialization also establishes a common infrastructure that can drastically simplify the application development environment: without this infrastructure, each application has its own “interface architecture” based on the universal logic array and the development software quickly becomes immersed in trying to create a very complex logic design with all the associated physical design issues. The key to the architecture of a reconfigurable logic based adaptive computing system is then the selection of dedicated function blocks which achieve most of the development simplicity and cost/performance of a traditional fixed function based computing architecture while providing an interface to the logic based adaptive

processing resource which attains maximum performance and retains complete adaptive generality. We liken the adaptive system architecture definition problem to the well known issues surrounding floating point processing: (1) when an application has a high percentage (by run time) of specialized (floating point, adaptive) computations to perform then a specialized (floating point, adaptive) processor will improve both the absolute performance and the relative cost/performance of the system and (2) the more the specialized computations dominate the application the more closely the specialized processor should be coupled with the traditional general purpose processor. This principle is usually referred to as the “90/10 hardware/software partitioning rule:” in most applications, 10 percent of the program code requires 90 percent of the execution time using traditional processing techniques. When the small amount of code which uses most of the time is converted to an adaptive program (circuit) the net gain in performance can be one to two orders of magnitude.

3.1 NAPA Processor Architecture Principles The basic principles of the NAPA architecture are then: 1. The Adaptive Logic Processor (ALP) resource is always combined with a general purpose scalar processing capability called the Fixed Instruction Processor (FIP). This establishes a completely general architecture basis for application development using well known techniques. By allowing the ALP to access the same memory space as the FIP, the ALP also retains complete generality. Realizing the desired system performance requires “only” the optimal partitioning of the application into traditional and adaptive subsets. In our first implementation of a NAPA processor, the FIP is a 32-bit RISC core that requires less than 10 percent of the silicon resource. 2.

A dedicated co-processor interface is always available between the ALP and FIP to achieve maximum performance, supplement the FIP instruction set without compromising existing compatibility and provide high speed reconfiguration of the ALP resource. We call this the Reconfigurable Pipeline Controller (RPC). The functions available to the FIP using the co-processor interface are called the Reconfigurable Pipeline Instruction Set (RPIS). An important requirement for the RPC is to allow the ALP to be integrated with a wide range of different FIP processors. This allows source code portability across a number of NAPA implementations.

3.

To achieve teraops class performance, a NAPA processor must include a parallel processing

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

2

interface as a fundamental aspect of the architecture. devices. The NAPA architecture provides for a very NAPA uses a new interconnect architecture called flexible Configurable Input/Output (CIO) capability with the ToggleBus and each NAPA processor includes a high speed integration into any position in the ALP. In ToggleBus Transceiver (TBT) interface circuit. the first NAPA processor implementation, as many as When several NAPA processors operate together, 128 package pins may be used for CIO. these bus transceivers form a Multistage The NAPA architecture allows the construction of a Interconnect Network (MIN) [9,12] which allows bus Host bandwidth increase in A Array Adaptive proportion to the number of ... M D NAPA Processor Control PP PP N Peripheral and I/O processors. Extracting the Processors parallel processing potential CIO Q Q represents the third subset of External ToggleBus TM Wiring Network the application partitioning Memory P Q P Q P Q problem. When used in a Blocks C SIMD style, the architecture A A A allows an ensemble of SIMD, NAPA MIMD M D NAPA M D NAPA M D NAPA processors to behave Data ... 0 1 N-1 Processors as a single co-processor to a higher level host processor. Figure 1. ToggleBus Based NAPA Processing Cluster The RPC provides a wide range of cost effective system add-ons which we mechanism that allows this higher level host to hope will make adaptive computing available to a much access the full capabilities of the ALP using RPIS, as larger research community which in turn will foster the well as general access to the memory space of each development of a whole new generation of application NAPA processor. algorithms and adaptive computing concepts. 4. A closely coupled memory organization is necessary 3.2 Basic System Structure for both the ALP and the FIP to achieve optimal For use in very high speed systems, N = 2k NAPA performance. Although the ALP has general access processing elements are tied together using a shuffle to the FIP Memory Space (FMS), we believe that based [9] wiring pattern as illustrated in Figure 1. The system cost/performance is generally improved by resulting ensemble of TBT circuits in the processors having high speed memories dedicated for use by the forms MIN with partial control (for an m-bit word, each ALP program. Our first implementation of a NAPA TBT has an (m-1)-bit switch control vector). Associated processor includes three on-chip memory resources with this processor ensemble is a programmable and a general external memory interface to processor (typically a NAPA processor) that acts as the accommodate the usual range of DRAM, SRAM and “cluster controller”. This (N+1)-th processor generates ROM devices. the ensemble control vector (“C” in the diagram) which 5. To achieve optimal cost performance, the ALP must defines the data flow type and parameters for each have features allowing close integration to the FIP ToggleBus cycle. When N is less than or equal to the and should be optimized for the types of word size of the data on the ToggleBus, each cycle computations needed for the target applications. For requires two system clocks. When N is greater than the the first NAPA processor, we have developed a new number of bits in a data word, each cycle requires three ALP architecture that uses a Pipeline Bus Interface system clocks. Although not as flexible as a crossbar (PBI) technique for concurrent data transfers style interconnect structure, the ToggleBus nevertheless between the ALP and the dedicated function units implements all of the major regular data flow patterns over a set of 32-bit busses called the Pipeline Bus that normally occur in parallel processing algorithms Array (PBA). This achieves the very high bandwidth with a cost that is proportional to N. needed for effective adaptive computing without the In addition to the external memory interface (the high cost of thousands of package pins. “A” and “D” lines in the figure), each NAPA processor We believe that specialized hardware is justified in a provides the “P” and “Q” data ports to support the system for only two reasons: (1) performance ToggleBus data flow operations. In a typical ToggleBus improvement compared to a sequential algorithm on a cycle data flows out of the P vector and is simultaneously general purpose structure and (2) to satisfy the unique input on the Q vector. Each NAPA processor provides a interface requirements of external sensor and actuator large number of Configurable Input Output (CIO) pins

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

3

Core Bus

Configuration Data Bus

CD

CB a ToggleBus Interface

P Q C

d

TBT

PCP

FIP Fixed Instruction Processor

LMA

RPC

Memory Interface

D JTAG

XMC

External Memory Controller

cd

Reconfigurable Pipeline Controller

ca

Local Memory Array

A

PTR PTW

ToggleBus Transceiver

PMA Pipeline Memory Array PM0

PM1

FIP Peripheral Devices

CIO PDA PDR PDW

ALP

Configurable I/O Pins

SMA

Adaptive Logic Processor

Scratchpad Memory Array

PMR PMW PAR PAW

SM0 SM1 SM2 SM3 SM4 SM5 SM6 SM7

Figure 2. NAPA Processor Structure which allow sensor and actuator devices to be interfaced directly to the adaptive logic array. When a single NAPA processor is used in an application, only the P-vector is needed to interface with a standard system bus. In this case, the Q-vector may also be used for configurable input/output. By using a NAPA processor in this mode, and loading a parallel processor control program into the adaptive logic, a NAPA processor becomes the preferred choice for a cluster controller since no external circuitry is required to create the needed functionality. In this way, a NAPA cluster becomes a very flexible co-processor to a system host or may operate as a fully standalone unit for a specific application.

3.3 Programming Models The tight coupling and the rich communication mechanisms between the computational resources in a single NAPA device can support several different models of computation. In the simplest model, known as the “instruction accelerator”, the ALP appears to the FIP program as a special purpose instruction execution unit. In this model the FIP and the ALP operate fully synchronously under the control of the FIP. When the FIP reaches a point in the application program where the enhanced “instruction” is to be executed, parameters are passed to the ALP together with an “execute” RPIS instruction. The FIP then suspends execution until the ALP completes the operation. A single result value is returned from the ALP to the FIP. The ALP

"instruction" may complete in one clock cycle, or may take several cycles; the FIP remains suspended until completion. A much more effective model is the "deep pipeline" in which the ALP performs a very long series of computations on a large stream of data. In this model the FIP typically initializes the RPC and/or external I/O to stream data through the ALP. Once the stream operation has begun, the FIP is suspended. The ALP can reactivate the FIP upon completion of the operation or an exceptional condition by asserting an interrupt. This model is typically used for sensor/actuator applications where an external device is either supplying or sinking a continuous stream of data to/from the ALP. The final programming model is a more traditional "fork/join" model, where the FIP initializes a computation on the ALP which is then effectively "forked" as a separate thread of control. Once the ALP has started the FIP is free to continue another portion of the computation or to switch tasks altogether. The FIP and ALP threads can re-join through standard synchronization mechanisms such as status flags and interrupts.

4. NAPA Processor Structure As illustrated in Figure 2 each NAPA processor consists of three main subsystems. The 32-bit RISC based FIP on the “left side” uses an on-chip Core Bus (CB) to interact with: (1) associated on-chip program/data memory

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

4

provided by the Local Memory Array (LMA), (2) external ROM, SRAM, and/or DRAM accessed through the External Memory Controller (XMC), and (3) a set of small on-chip peripheral devices including an interrupt controller, timer/watchdog unit and JTAG style debug interface. The “right side” of the processor is dominated by the computationally oriented ALP which interacts with the rest of the processor using a Pipeline Bus Interface (PBI) which includes nine major busses, pipeline stage synchronization signals and associated control signals. A significant part of the interface signal bandwidth to the ALP is dedicated to two on-chip memory structures which provide data buffering for the high bandwidth computations. The Pipeline Memory Array (PMA) is a two bank SRAM structure allowing two concurrent operations. Each of the eight Scratchpad Memory Array (SMA) blocks may be mapped to arbitrary physical locations in the array allowing the dynamic creation of a wide variety of high speed memory structures. The CIO external pin array may also be mapped dynamically to any position in the array. The third or “middle” portion of a NAPA processor consists of the ToggleBus Transceiver (TBT) which provides the parallel processing interconnect interface between the ALP and an ensemble wiring network and finally the Reconfigurable Pipeline Controller (RPC). The RPC maintains the Configuration Data (CD) bus which is used to read and write the configuration of all on-chip and off-chip memories as well as the adaptive configuration of the ALP and CIO. During normal operation, the RPC executes the Reconfigurable Pipeline Instruction Set (RPIS) which provides the co-processor interface for the FIP and/or any programmable device driving the ToggleBus. In the first NAPA processor implementation, the PMA memory is 16K-bytes, the LMA memory is 4Kbytes and each of the eight SMA memories is 256 bytes. All of the PBA busses as well as the external memory bus and the ToggleBus are 32-bits wide.

4.1 Adaptive Logic Processor The ALP consists logically of three circuit layers as illustrated in Figure 3. Each core cell in the bottom layer includes a 16-bit configuration register, a D-type flipflop, a three input, two output configurable logic circuit, and configurable switches for selecting the three inputs. Each core cell is capable of implementing as many as 18 gate equivalents (for example, a full adder with accumulator storage bit). The three inputs to the core cell are generally a combination of the outputs of the four nearest neighbors and a “local bus” line input. The two outputs may be used to drive the inputs of any of the four nearest neighbors and drive one of the four local bus

VCLs Core Cells

PCPs

Core Blocks

Pipeline Control Area PBA

Pipeline Data Path Area

Example Pipeline Segment 0,0

Figure 3. ALP Physical Structure lines which logically traverse each side of the core cell. Each local bus line spans four core cells. The network of local bus lines forms the lowest level of routing hierarchy and is part of the second logical layer of the ALP which also includes two levels of routing switches with associated data lines that span longer distances in the array. The top logical layer of the ALP implements the Pipeline Bus Interface Lines (PBILs) which traverse the array horizontally, the Vertical Control Lines (VCLs) which traverse the array vertically and the Global Bus Interface Lines (GBILs) which traverse the entire array in both directions. The manner in which the PBIL and GBIL resources are connected to the dedicated function blocks is the essence of an adaptive processing architecture using the ALP. The middle layers in the ALP array provide the configurable routing hierarchy needed to connect the local bus signals to the top level global interconnect lines. These resources can also be used in conjunction with unused top level wiring to connect any core cell output to any other core cell input with only two buffers in the path. This deterministic routing capability simplifies pipeline segment construction and improves performance. In the first NAPA processor implementation, the ALP uses an array of 6144 core cells (64 columns by 96 rows) and has a total of 1608 global routing lines

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

5

available for external connection. Of these, 192 horizontal GBILs are used for the SMA interface, 256 PBILs are used for the PBA interface, 192 PBILs are used for the input and output connections to the CIO pins, and 128 PBILs are used to implement the PCPs as well as a number of miscellaneous dedicated connections. In the NAPA architecture, the PBA lines physically traverse the “data path” region of the ALP (approximately the lower two-thirds of the rows) while the PCP lines traverse the “control” region (the upper rows of the array). In the general case, an ALP circuit, called a “pipeline segment”, covers a rectangular region of the array extending vertically the entire height of the array but generally with a width which is smaller than the horizontal extent of the array. Each pipeline segment includes a data path which is connected to one or more PBA busses and a control section which connects to the PCP lines which are associated with the PBA busses used by the pipeline segment. In general, several segments may reside in the ALP at one time and any one segment may be reconfigured without affecting the operation of

array, (4) pipeline segment generation is significantly simplified (for example, a single pipeline datapath) and (5) powerful run time configuration management schemes are possible. Groups of core cell flip-flops form pipeline registers which may be directly read or modified by the FIP program using RPIS instructions. The middle routing layers include a quick connect feature which allows the FIP program to change the connection of a pipeline segment to a specific PBA bus in one RPIS transaction.

4.2 ALP Pipeline Segment Interface Summary

As shown in Figure 4 the pipeline segment concept is based on the classic model of a data processor: The PBA busses flow through the datapath section, the associated PCP lines interact with the control unit, the control unit commands operations in the datapath using a group of signals usually referred to as the Horizontal Control Word (HCW) and status signals are communicated back to the control using lines usually referred to as the Status Word (SW). In NAPA, the HCW and SW are implemented using the global VCL resource. The Pipeline Control Ports PBA data interface allows concurrent connection Pipeline Segment PCP of the datapath portion of the ALP to the PMA, TBT, and FIP memory space. Reconfigurable Each PCP signal set is associated with one or Control Unit Pipeline more PBA busses. In the first NAPA processor Controller Status Word implementation there are four PCP ports which SW HCW allows up to four PBA transfers to be executed Horizontal (CIO and SMA transfers may also occur Control Configurable Word Input/Output simultaneously with PBA transfers). Each PCP ToggleBus(PT) control signal set generally includes three sets of CIO Datapath Unit ready/request signals used by the RPC to FIP Memory synchronize ALP operations with the operations Space (PD) VCLs of the dedicated function units: (1) autonomous PBILs Pipeline requests are data transfers initiated by the ALP, Memory Array PBA (2) cooperative access requests are initiated by (PMR,PMW) GBILs SMA the current Focus Of Control (FOC -- at any time Pipeline Scratchpad Bus Array Memory Array either the FIP or the ToggleBus controller is the Figure 4. Pipeline Segment Structure focus of control), and (3) function requests which allow the FOC to control processing in the ALP. the remaining segments. Each PCP also includes a signal which may be use as a The process of changing only a portion of the logic status indicator to the FOC or as a FIP interrupt source. array without affecting the operation of the remaining The PCP port groupings are as follows: portion is called “partial reconfiguration”. Partial 1. Pipeline Data (PD) port: data transfers between the reconfiguration allows the development software to ALP and the FOC (cooperative access mode) or any optimize the utilization of the configurable logic device in the FIP memory space (autonomous request). resource. By architectural design, pipeline segments are The FOC controls task execution in the ALP by issuing inherently relocatable which means that: (1) a pipeline function requests which include an eight-bit Pipeline segment does not have to be finally placed until run-time, Instruction Register (PIR) code and an associated data (2) pipeline segments always connect to resources operand. The ALP pipeline segments are totally external to the ALP using top layer signals (no “pad responsible for the implementation of these pipeline routing” is needed), (3) a pipeline segment may be instructions. The ALP receives data on the PDR bus dynamically moved in the horizontal direction of the and outputs data on the PDW bus.

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

6

2. Pipeline ToggleBus (PT) port: Cooperative access is controlled by the cluster controller using the C-vector and each ToggleBus cycle outputs a word on the PTW bus while reading a word on the PTR bus. An autonomous request allows the ALP to use the TBT circuits as a local computation resource. 3. Pipeline Memory Read (PMR) port: input data from the PMA memory on the PMR bus using the PAR address output bus. 4. Pipeline Memory Write (PMW) port: may be configured as an input or output port using the PMW bus for input or output, and the PAW bus as the PMA address.

4.3 Fixed Instruction Processor The NAPA architecture has been carefully developed to accommodate a wide variety of FIP processors including controllers, DSP engines, simple RISC machines as well as highly sophisticated RISC processors having split caches, virtual memory and floating point processing capability. Selecting a FIP which is well balanced with the ALP resource for a range of applications is a challenging problem. For the first NAPA processor implementation, we have chosen to use a very small 32bit RISC core called the CompactRISC. This machine supports a 50 MHz instruction rate which is more than adequate for ALP configuration and “outer loop” setup and control. This satisfies the need for an “always present” scalar processor. When more scalar processing power is needed (for example, a significant amount of the floating point processing) a NAPA processor may be used as a co-processor to a higher speed machine through the ToggleBus.

4.4 ToggleBus The ToggleBus interconnect architecture allows the implementation of a number of styles of parallel processing. A ToggleBus transceiver has two m-bit bidirectional transceivers called the “P” bus and the “Q” bus. Each ToggleBus cycle consists of transferring data out through the “P” lines and receiving the data on the “Q” lines. The common ToggleBus control bus “C” is used to define the type of the bus transaction for all processors. The ToggleBus supports three basic types of data flow patterns and combinations of these patterns. The “broadcast” data flow pattern allows any one processor (including the cluster controller) to transmit data to all of the other processors in one bus transaction. This flow pattern is similar to the capability found on traditional Tristate based bus structures. The broadcast data flow is used to distribute configuration data in a NAPA cluster, initiate specific processing in the data processors by the

direct execution of instructions and for the global distribution of application data. In the MIMD computing style, these instructions essentially initiate the execution of independent tasks. In the SIMD computing style, the instructions typically initiate direct configurable logic pipeline action. The ToggleBus also supports general bit-level and word-level “data rotation” data flow patterns. These are used for both SIMD and MIMD style parallel processing approaches for inter-processor data flow. The rotation data flow patterns may be viewed as an extension of the “ring” style interconnect that has been successfully used in previous systems. The third major type of data flow on the ToggleBus is called “data reflection”. This is the most general and most powerful data flow pattern and allows the implementation of very complex data flow operations such as multiple dimension memory access. Reflection data flow is based on the “Hamming distance” principle. For example, data between two processors j and k may be exchanged in one bus cycle by setting the ToggleBus reflection distance in the C-vector to d = j ⊗ k where “⊗” is the bitwise exclusive-OR of the bits of the j and k integer values. The data flow patterns based on the Hamming distance are an integral aspect of complex algorithms such as the Fast Fourier Transform. In general, ToggleBus cycles are performed using an “associative subset” scheme. The associative processing operations allow the PN processor to select the subset of processors that will respond to subsequent bus cycles. When used as a local computing resource, the TBT can perform arbitrary rotations and reflections, and most of the operations of Feng’s Data Manipulation Function networks [6] on 32-bit words in a single clock cycle.

5. NAPA Software Architecture The NAPA software environment covers three main areas: application program development, run time support, and simulation. The application program development tools include application capture, compilation, and physical design. The run time support system enables an application to control and debug the diverse resources of the NAPA architecture. Finally, the simulation environment supports both architectural exploration and provides a rich environment for early application and system software development.

5.1 Programming Concepts The NAPA software development environment is layered to effectively abstract the various levels of hardware and software from an application programmer. This eases the implementation and maintenance of a given application. Figure 5 illustrates the layering in the

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

7

software environment. At the lowest level one layer exists for each of the two main compute elements: the FIP and the ALP. Entry to these layers is in the form of ANSI C code for the FIP and structural Verilog code for the ALP. The next layer of abstraction is the NAPA C language and compiler[7]. NAPA C presents a tightly integrated programming model encompassing data storage and computation on both the FIP and the ALP. Finally, an application on the host processor may be developed using the Application Device Interface Layer. An important design goal was to provide for tool interoperability and for iterative refinement of an

Figure 5. Software Development Layers application. Iterative refinement is the process by which the performance of an algorithm or application is incrementally improved by bringing more resource to bear on the problem. Many applications in NAPA are initially written in standard C and run only on the FIP. Performance analysis tools are used to identify bottlenecks, and these sections are migrated to other resources (such as the ALP) either through annotation with NAPA C constructs or manually. The NAPA tool set allows different parts of an application to be developed using different tools. As the performance of an application is tuned, portions of the application may be re-implemented in a more efficient form using different tools.

conventional C program that contains portions of the computation assigned to the FIP as well as C code to control circuits generated for the ALP. Through the MARGE datapath synthesis tool, the NAPA C compiler also generates, for the computation mapped to the ALP, a Verilog netlist that utilizes highly optimized datapath component generators. The programmer may also select from lower level alternative tools to program the FIP and the ALP. These tools include an ANSI C compiler (gcc based) and assembler for the FIP, and an assortment of commercial and custom CAD tools for the ALP ranging from behavioral synthesis to ALP-specific datapath configuration and layout. Whether the source is the output of the NAPA C Compiler or is explicitly provided by the programmer, the input to the FIP Compiler is ANSI C. The FIP program contains calls to the NAPA run time system to handle a wide variety of system initialization and control functions. These calls are resolved by the FIP compiler into either calls to library routines or an active run time system (resident monitor). The input to the ALP Compiler is a structural Verilog netlist expressed as references to primitive core cell functions and parameterized datapath module generators. The ALP Compiler expands the calls to the module generators and performs the physical layout of the design. The output of the ALP Compiler is a wordstream containing configuration data and symbolic information.

5.2 Compilation The task of programming the NAPA system consists of specifying concurrent, cooperative computation on each of the available computational resources. Figure 6 shows a top level view of the development tool flow for the NAPA architecture. Entry into the flow is either through the NAPA C Compiler or directly into the lower level compilers for the FIP and the ALP. The primary tool for application entry is the NAPA C compiler. NAPA C constructs allow the programmer to explicitly map data and computations to either the FIP or the ALP. The NAPA C compiler generates a Figure 6. Application Development Tool Flow

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

8

The Linker is responsible for assembling the executable image from the FIP object module and ALP wordstream. In addition to generating a single object file, the Linker resolves symbolic references in the FIP object module to ALP pipeline register locations. The output is an executable image in the Common Object File Format (COFF).

5.3 Physical Design The role of the ALP Compiler is the generation of the ALP configuration word stream for one or more pipeline segments. The structure of the ALP Compiler is shown in Figure 7. The input to the ALP Compiler is a structural Verilog specification for datapath primitives, higher level ALP subroutine calls and synthesized control logic. The Datapath Generator produces a list of parameterized function calls in the D4* function generator language and defines a linear order and placement of these functions within the ALP. D4* is a C-like imperative language designed to express parameterizable function generators. The function generator is responsible for creating relocatable datapath blocks with the word size, speed and geometric options specified in the reference parameters. The D4* system is referenced by the datapath generator to check the validity of the parameters and place the primitives of the block according to the construction algorithm specified in the D4* based function generator file. The D4* system then performs a low level route resulting in completely placed and routed function blocks for use by the datapath generator. The D4* system includes a drawing interface and logic simulator for use in function generator development. Since the datapath modules produced by the function generator are already placed and routed, and the datapath generator specifies the ordering and placement of these modules within the ALP, the physical design task is greatly simplified. The automatic placer is primarily responsible for placing the logic primitives of the control section of each pipeline segment. The router is used to connect the individual cells of the control section and to connect the control to the datapath. An interactive, graphical tool, ALPEdit, is also available to view and edit the physical design. The output of the physical design process is a wordstream containing configuration data for the ALP and the CIO as well as symbolic information needed by the linker and run time debugger.

5.4 Run Time System The run time system is an integral part of the NAPA programming environment, as it is responsible for the management of the diverse and complex resources

Figure 7. ALP Compiler Structure available to the NAPA programmer. In the simplest embedded systems, the run time code is linked directly into the application program and loaded at boot time. In more complex systems that run more than one application, a portion of the run time system remains resident on a NAPA processor. In a NAPA cluster the cluster elements run a simplified monitor that communicates with the run time system on the cluster controller. At system boot time the cluster controller can load the monitor code into the individual elements before the cluster element’s local FIP is allowed to boot. The NAPA run time system performs a wide range of services for the application, including: 1. Management of the FIP and related peripherals. This includes initialization of various control registers, interrupt management and control of the timer and watchdog devices for real time applications; 2. ALP configuration management, including final placement of relocatable pipeline segments, loading configuration wordstreams, and dynamic reconfiguration control; 3. RPC and ALP control, including ALP reset, pipeline register access, starting and stopping the clocks, and data stream (DMA) setup and control; 4. External host and debugger communication. An external software debugger can access a NAPA processor through one of two interfaces: through a dedicated JTAG port and through the ToggleBus interface. The FIP contains hardware support for breakpoints, instruction tracing and single stepping. The ALP support includes the ability to start, stop, and single step the clock to the entire ALP or to any subset of columns. The complete configuration space is available

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

9

to be read and written by either the FIP or an external host. Likewise, all on-board memory structures may also be read and written.

5.5 Simulation A rich simulation environment was developed to serve a variety of roles, including: architectural exploration and evaluation; early benchmark and application development; programming tool and run time system development; and logic verification. The NAPA architecture simulator, NAPAsim, consists of a C language, cycle accurate model of the RISC core, peripherals and memories, coupled with an event driven logic simulator for modelling the user-defined contents of the reconfigurable logic. The front end of NAPAsim is a Tcl/Tk based GUI which provides source level symbolic debugging capabilities. NAPAsim supports application debugging and performance tuning through an extensive set of profiling and run time analysis tools[2].

Acknowledgments This work was supported in part by DARPA through Contract DAB63-94-C-0085 to National Semiconductor Corporation.

References [1] R. Amerson et al., “Teramac - Configurable Custom Computing,” in P. Athanas and K. Pocek, editors, Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Napa, CA, April 1995, pp. 32-38. [2] J. M. Arnold, “An Architecture Simulator for the NAPA1000,” submitted to FCCM98. [3] P. Bertin, D. Roncin and J. Vuillemin, “Programmable Active Memories: A Performance Assessment,” in G. Borriello and C. Ebeling, eds., Research on Integrated Systems: Proceedings of the ’93 Symposium, MIT Press, Cambridge, Mass., 1993, pp. 88-102. [4] B. Box, “Field Programmable Gate Array Based Reconfigurable Preprocessor,” in D. Buell and K. Pocek, eds., Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Napa, CA, April 1994, pp. 40-49. [5] D. A. Buell, J. M. Arnold and W. J. Kleinfelder, Splash 2 FPGAs in a Custom Computing Machine, IEEE CS Press, Los Alamitos, CA, 1996. [6] T. Y. Feng, “Data Manipulation Functions in Parallel Processors and Their Implementations,” IEEE Trans. Computers, March 1974, pp. 309-318. [7] M. B. Gokhale and J. M. Stone, “NAPA C: Compiling for a Hybrid RISC/FPGA Architecture,” submitted to FCCM98. [8] J. R. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Coprocessor,” in J. Arnold and K. Pocek, eds., Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Napa, CA, April 1997, pp 12-21. [9] Hwang, Kai, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill Inc., New York, 1993, ch. 2. [10] E. Waingold et al., “Baring it All to Software: Raw Machines,” IEEE Computer, Sept. 1997, pp. 86-93. [11] R.D. Wittig and P. Chow, “OneChip: An FPGA Processor with Reconfigurable Logic,” in J. Arnold and K. Pocek, eds., Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Napa, CA, April, 1996, pp. 126-135. [12]C. L. Wu and T. Y. Feng, “On a Class of Multistage Interconnection Networks,” IEEE Trans. Computers, Aug. 1980, pp. 696-702.

Copyright 1998, the Authors; Submitted for publication at the FPGAs for Custom Computing Machines Conference (FCCM98 a)

10