SMASH: A Program for Scheduling Memory-Intensive ... - cs.York

2 downloads 0 Views 588KB Size Report
SMASH: A Program for Scheduling. Memory-Intensive. Application-Specific. Hardware*. Pravil. Gupta and. Alice. C. Parker. Electrical. Engineering. – Systems.
SMASH:

A Program

for Scheduling

Application-Specific Pravil

Gupta

Electrical

Hardware* Alice

Engineering

University Los

and

C.

Parker

– Systems

of Southern

Angeles,

Memory-Intensive

CA

California

90089-2562

Abstract

AB

w ABC

The research described in this paper addresses automatic synthesis of memory-intensive applicationspecifk systems, with emphasis on hierarchical storage amhitectum design. SMASH is a progmm which combines stomge hiemrchy design with data path syn-

Sfep.... ...-----

step

............-. ....... ...........

step 2

Figure 1: An example memory-related issues.

goal is to automate architectures

the design for

tems. The storage architecture

of datapaths

memory-intensive

ditional

sys-

is closely connected

to

was

supported

by

and monitored contract No.

the

Advanced

by the

Federal .3FM90092. The

official

policies, Projects

2

either

expressed

Agency

or implied,

or the U.S.

An overview

$03.0001994

IEEE

3........ ...... ..........

scheduling

the second with

ports and the third

on data-transfer

with

a

with

an ad-

bandwidth.

The

due to data transfers

inputs

may not result in

overlap

the execution

statement of SMASH

is shown in Figure

3. The

are

the behavioral VHDL description of a memoryintensive application-specific system, which may contain bounded inner loops and conditional branches. Loop structures, arrays and indexed references are assumed to be transformed and optimized.

Research

Bureau of Inviews and con-

of the

constraint

Problem

program



Advanced

Government.

54 0-8186-5785-5/94

showing

execution delay if the transfers of other parts of the CDFG.

clusions considered in this document are those of the authom and should not be interpreted as necessarily representing the Research

on memory

schedule length

SMASH designs a hierarchical storage system concurrently with the datapath and also determines the input/output data-transfer schedule between various hierarchies and the datapath, as the datapath itself is scheduled. The need for such a combined synthesis is illustrated in Figure 1. In Figure 1 three schedwork

... ...... -

required buffer size decreases from 3 to 2 as the number of ports is constrained. In general, the increase in

the system datapath, and isolating its synthesis from datapath synthesis may not result in an efficient solution. Our tools design application-specific systems, where the memory-access pattern is not only relatively fixed but also known before hand. This mostlydeterministic access characteristic makes it straightforward to automate the memory design process.

*This

2+ ......-. ------

(one data transfer/cycle.)

ules are shown, first unconstrained, constraint

OA ..... .......................... B 1 .-.. ................ ....... c

3* 4* P ............-e............. ........... ................ 4 54 ! c. Dstapstb schedule with 2 read b. Datapath’ schedule ports and data prefetching.

with 2 read ports.

Introduction

Projects Agency vestigation under

DstaDath

sehed~e.

this paper.

Our

...................... J

............

a

2-

............ ..... ----------

*

step 3

fers between them. We synthesized some designs examples using SMASH and have included the results in

and storage

........%.

1

+

thesis. It uses appropriate system pammeters in order to coordinate between the synthesis of diffenmt subarchitectumw of the system and schedules data tmns-

1

o .. -----------......-- ....C 1+ /,.- ------.. ... / $. .........

the module library consisting of (i) functional modules (e.g. adders) with each module characterized by its area, delay and bitwidth, and single (ii) storage modules (e.g. registers,

port/multiport port RAMs) ber of ports,

memory

area-performance

constraints;



the clock cycle, which is the duration trol step in the datapath;



external



memory bandwidth constraints words that can be transferred

input/output

one control

timing

of each con-

constraints;

(the number onto the chip

on transformations

of this specific

them

is to perform

synand

Target

software

to reduce

here since the

thesis after the high-level memory management transformation steps have been applied. 4

and

results

costs, we do not report

objective

access time and storage capacity;



Our with

have published

register-files, single-port/multicharacterized by cost per word, num-

architecture

The target system architecture consists of a datapath and a hierarchical storage architecture

of in

(Figure 2) as described level target architecture

step).

number

synthesis system produces a two-chip system a datapath consisting of operators and opera-

tion schedule, size and port conjuration for on-chip foreground memory to store input/output and intermediate variables, data-transfer schedule between the

of tradeoffs

below. Although the top is fixed, there are a large The memories

can de-

generate into a simple set of wires if they needed. The on-chip foreground memory

possible.

are not consists

datapath and on-chip memory, size and port configuration of off-chip background memory for bulk storage, and data-transfer schedule between the on-chip and off-chip

3

memory.

o

Related

research

The original

MIMOLA

to make

tradeoffs

system was the first system

in the use of multiport

memories

[7]. Balakrishnan et al. [2] presented an approach to use multi-port memories to implement single isolated registers. Chen [3] explored the design space for multiport memory synthesis. Ahmad and Chen [1] use O-1 integer-linear

programming

variables

in the datapath

multiport

memories

access pattern.

into

depending

Figure

number

width

of

on their ports and their

Stok [11] optimizes

location

and address allocation

without

conditional

register files during

They

the buffers

and the off-chip

memory

given step, and (ii) the number of read/write ports accessible to the dat apath RbUj/ Wbu f, which is the maximum number of inputs/outputs the datapath accesses

for high speed applica-

branches.

between

(13W~~-~jj ), which is the number of inputs/outputs that can be transferred from/to the off-chip memory in one control step. The synthesis soft ware determines (i) total buffer size, which is determined by the maximum number of inputs/outputs stored in the buffers in any

the synthesis process. Grant et al. [5] suggested an approach to group the memory requirements of various operators using single-port memory modules such that control and communications may be optimized. Lippens et al. [6] described automatic memory altions

architecture.

of 1/0 buffers which interface to the off-chip memory and datapath memory. The user specifies the band-

to group intermediate a minimum

2: Target

in any given step. Dat apath

synthesize

streams and then manipulate distributed memory structure.

hence such tradeoffs are not described in this paper. All the 1/0 data values from/to the external world are stored in the Off-chip background memory,

RAMs

and distribute

[2, 3, 11] have studied

these streams to form a They allow only 1 and the data

among

in the dat apath.

stores the inter-

mediate

2 port

variables

memory

memory after the datapath scheduling and allocation. They model multi-dimensional periodic signals as data

parallel

which

is generally

datapath

general number

register files and single-port SRAMS. They polyhedral-based model for high-level memory

(B Wo~-~j~) and the synthesis the off-chip memory size.

agement for linear, piecewise linear dent signal indexing[4]. Although

memory

large and inexpensive,

memories. They do not consider storage hierarchies. IMEC’S CATHEDRAL-II [12] compiles multidimensional data structures into distributed dual-port use a man-

Other

purpose computer. of read/write ports

researchers tradeoffs

just

and

w in a

The user determines the on the off-chip memory* software

determines

The required bandwidth between the on-chip and the off-chip memory imposes cost constraints on the

and data depenother researchers

55

overall design because of the pin requirements on the chips, and the expense of having multiport memory for

ing the partial

off-chip

5.1

bulk storage.

the off-chip

Furthermore,

background

memory

the access time for may be greater

be-

reads data from

the buffers,

are transferred

to off-chip

memory

when

are no longer needed or if they can be refetched overlapped with the datapath execution.

5

Design

in

in

ture, and (iii) execution

the

time.

BWOn-Oft, the software and the execution time. between or larger

they easily,

architec-

To deal with

this 3-way

trades off between the size The software can trade off

(i) more clock cycles (delay the execution) buffer size (prefetch the data before it is re-

quired and store it); (ii) more ports (fetch more data with a wider bandwidth) or more clock cycles (fetch more data with extra cycles); and (iii) more ports (retrieve them repeatedly whenever needed with wider l? WOn-Oft specified by the user) or larger buffer size (save them for future use which will increase the 1/0 buffer size) for data values which are used again. This can be done efficiently because the number of ports is

methodology

usually 5.2

small. Step 1: ing with

Combining datapath 1/0 accesses

In step 1, the software

determines

schedul-

the number

of

functional units of each type required for the design and the scheduling of all the operations in the CDFG to appropriate control steps. It also outputs the percentage utilization of each type of module in each step,

%= Off-chiitiptiyy

PI I/O trsnsferschedul

which

Betwea ext. world and off-chip memory.

ule.

helps

the quality

the scheduling

the on-chip by inserting

corresponds

3: Design steps in SMASH.

us determine

We combine

from/to erators

fE!i!E9

Figure

storage

SMASH

tradeoff, we iterate on BWO.-Off by repeatedly invoking the synthesis software. For each user-specified

controller moves new data to them from the off-chip memory for further processing. Similarly, the output variables are first stored in the on-chip memory and then

tradeoffs used

Three parameters can vary while making the costperformance tradeoff, (i) number of ports, which determines BWOn _Off, (ii) size of the storage architec-

chip memory before they are required. In each control step the required data is loaded onto the buffer ports and as the datapath

Design ture

cause it is off chip, and is bigger in size. As a result, the software schedules the prefetch of the input variables from off-chip memory into the 1/0 buffer of the on-

design in each step.

buffers with a read/write

to every 1/0

of 1/0

of the schedreads/writes

the scheduling of op(R/W) node which

access to the buffers into the

CDFG. The read node consists of two inputs: (i) array name, and (ii) index and one output: value. The write node has three inputs: (i) array name, (ii) index, and (ii) value and one output: array (modified). During the datapath scheduling step, whenever an operator

The major design steps of our methodology are shown in Figure 3. The SMASH software implements the following two steps (highlighted in the figure): First, datapath synthesis with operation schedul-

involving 1/0 is scheduled, the corresponding R/W node is also scheduled to the same step, implying an 1/0 buffer access in that step. All memory-related issues are considered during this scheduling by checking

ing is performed combined with scheduling of on-chip data transfers to/from 1/0 buffers. As a result of this scheduling, constraints are placed on the memory structure. The second step, the 1/0 transfer scheduling, includes determining the data transfers between the two levels of the memory hierarchy. We ensure that first step of the stepwise construction of the system takes into account the second step by looking ahead so that the second step is not overly constrained. Global design parameters like BWOn_Ojf and timing constraints are considered when construct-

if 1. the 1/0 buffers are able to provide number of R/W ports in that step, and

the required

2. the input data, if required, could be prefetched into the buffers from the off-chip memory prior to that step and the output data, if any could be transferred back to the off-chip memory before it was required

56

elsewhere. Note that the data with the datapath execution:

transfer

tual

is overlapped

data

transfer

is scheduled,

the software

may not be able to make all the required time.

1(s)