SMASH: A Program for Scheduling. Memory-Intensive. Application-Specific. Hardware*. Pravil. Gupta and. Alice. C. Parker. Electrical. Engineering. â Systems.
SMASH:
A Program
for Scheduling
Application-Specific Pravil
Gupta
Electrical
Hardware* Alice
Engineering
University Los
and
C.
Parker
– Systems
of Southern
Angeles,
Memory-Intensive
CA
California
90089-2562
Abstract
AB
w ABC
The research described in this paper addresses automatic synthesis of memory-intensive applicationspecifk systems, with emphasis on hierarchical storage amhitectum design. SMASH is a progmm which combines stomge hiemrchy design with data path syn-
Sfep.... ...-----
step
............-. ....... ...........
step 2
Figure 1: An example memory-related issues.
goal is to automate architectures
the design for
tems. The storage architecture
of datapaths
memory-intensive
ditional
sys-
is closely connected
to
was
supported
by
and monitored contract No.
the
Advanced
by the
Federal .3FM90092. The
official
policies, Projects
2
either
expressed
Agency
or implied,
or the U.S.
An overview
$03.0001994
IEEE
3........ ...... ..........
scheduling
the second with
ports and the third
on data-transfer
with
a
with
an ad-
bandwidth.
The
due to data transfers
inputs
may not result in
overlap
the execution
statement of SMASH
is shown in Figure
3. The
are
the behavioral VHDL description of a memoryintensive application-specific system, which may contain bounded inner loops and conditional branches. Loop structures, arrays and indexed references are assumed to be transformed and optimized.
Research
Bureau of Inviews and con-
of the
constraint
Problem
program
●
Advanced
Government.
54 0-8186-5785-5/94
showing
execution delay if the transfers of other parts of the CDFG.
clusions considered in this document are those of the authom and should not be interpreted as necessarily representing the Research
on memory
schedule length
SMASH designs a hierarchical storage system concurrently with the datapath and also determines the input/output data-transfer schedule between various hierarchies and the datapath, as the datapath itself is scheduled. The need for such a combined synthesis is illustrated in Figure 1. In Figure 1 three schedwork
... ...... -
required buffer size decreases from 3 to 2 as the number of ports is constrained. In general, the increase in
the system datapath, and isolating its synthesis from datapath synthesis may not result in an efficient solution. Our tools design application-specific systems, where the memory-access pattern is not only relatively fixed but also known before hand. This mostlydeterministic access characteristic makes it straightforward to automate the memory design process.
*This
2+ ......-. ------
(one data transfer/cycle.)
ules are shown, first unconstrained, constraint
OA ..... .......................... B 1 .-.. ................ ....... c
3* 4* P ............-e............. ........... ................ 4 54 ! c. Dstapstb schedule with 2 read b. Datapath’ schedule ports and data prefetching.
with 2 read ports.
Introduction
Projects Agency vestigation under
DstaDath
sehed~e.
this paper.
Our
...................... J
............
a
2-
............ ..... ----------
*
step 3
fers between them. We synthesized some designs examples using SMASH and have included the results in
and storage
........%.
1
+
thesis. It uses appropriate system pammeters in order to coordinate between the synthesis of diffenmt subarchitectumw of the system and schedules data tmns-
1
o .. -----------......-- ....C 1+ /,.- ------.. ... / $. .........
the module library consisting of (i) functional modules (e.g. adders) with each module characterized by its area, delay and bitwidth, and single (ii) storage modules (e.g. registers,
port/multiport port RAMs) ber of ports,
memory
area-performance
constraints;
●
the clock cycle, which is the duration trol step in the datapath;
●
external
●
memory bandwidth constraints words that can be transferred
input/output
one control
timing
of each con-
constraints;
(the number onto the chip
on transformations
of this specific
them
is to perform
synand
Target
software
to reduce
here since the
thesis after the high-level memory management transformation steps have been applied. 4
and
results
costs, we do not report
objective
access time and storage capacity;
●
Our with
have published
register-files, single-port/multicharacterized by cost per word, num-
architecture
The target system architecture consists of a datapath and a hierarchical storage architecture
of in
(Figure 2) as described level target architecture
step).
number
synthesis system produces a two-chip system a datapath consisting of operators and opera-
tion schedule, size and port conjuration for on-chip foreground memory to store input/output and intermediate variables, data-transfer schedule between the
of tradeoffs
below. Although the top is fixed, there are a large The memories
can de-
generate into a simple set of wires if they needed. The on-chip foreground memory
possible.
are not consists
datapath and on-chip memory, size and port configuration of off-chip background memory for bulk storage, and data-transfer schedule between the on-chip and off-chip
3
memory.
o
Related
research
The original
MIMOLA
to make
tradeoffs
system was the first system
in the use of multiport
memories
[7]. Balakrishnan et al. [2] presented an approach to use multi-port memories to implement single isolated registers. Chen [3] explored the design space for multiport memory synthesis. Ahmad and Chen [1] use O-1 integer-linear
programming
variables
in the datapath
multiport
memories
access pattern.
into
depending
Figure
number
width
of
on their ports and their
Stok [11] optimizes
location
and address allocation
without
conditional
register files during
They
the buffers
and the off-chip
memory
given step, and (ii) the number of read/write ports accessible to the dat apath RbUj/ Wbu f, which is the maximum number of inputs/outputs the datapath accesses
for high speed applica-
branches.
between
(13W~~-~jj ), which is the number of inputs/outputs that can be transferred from/to the off-chip memory in one control step. The synthesis soft ware determines (i) total buffer size, which is determined by the maximum number of inputs/outputs stored in the buffers in any
the synthesis process. Grant et al. [5] suggested an approach to group the memory requirements of various operators using single-port memory modules such that control and communications may be optimized. Lippens et al. [6] described automatic memory altions
architecture.
of 1/0 buffers which interface to the off-chip memory and datapath memory. The user specifies the band-
to group intermediate a minimum
2: Target
in any given step. Dat apath
synthesize
streams and then manipulate distributed memory structure.
hence such tradeoffs are not described in this paper. All the 1/0 data values from/to the external world are stored in the Off-chip background memory,
RAMs
and distribute
[2, 3, 11] have studied
these streams to form a They allow only 1 and the data
among
in the dat apath.
stores the inter-
mediate
2 port
variables
memory
memory after the datapath scheduling and allocation. They model multi-dimensional periodic signals as data
parallel
which
is generally
datapath
general number
register files and single-port SRAMS. They polyhedral-based model for high-level memory
(B Wo~-~j~) and the synthesis the off-chip memory size.
agement for linear, piecewise linear dent signal indexing[4]. Although
memory
large and inexpensive,
memories. They do not consider storage hierarchies. IMEC’S CATHEDRAL-II [12] compiles multidimensional data structures into distributed dual-port use a man-
Other
purpose computer. of read/write ports
researchers tradeoffs
just
and
w in a
The user determines the on the off-chip memory* software
determines
The required bandwidth between the on-chip and the off-chip memory imposes cost constraints on the
and data depenother researchers
55
overall design because of the pin requirements on the chips, and the expense of having multiport memory for
ing the partial
off-chip
5.1
bulk storage.
the off-chip
Furthermore,
background
memory
the access time for may be greater
be-
reads data from
the buffers,
are transferred
to off-chip
memory
when
are no longer needed or if they can be refetched overlapped with the datapath execution.
5
Design
in
in
ture, and (iii) execution
the
time.
BWOn-Oft, the software and the execution time. between or larger
they easily,
architec-
To deal with
this 3-way
trades off between the size The software can trade off
(i) more clock cycles (delay the execution) buffer size (prefetch the data before it is re-
quired and store it); (ii) more ports (fetch more data with a wider bandwidth) or more clock cycles (fetch more data with extra cycles); and (iii) more ports (retrieve them repeatedly whenever needed with wider l? WOn-Oft specified by the user) or larger buffer size (save them for future use which will increase the 1/0 buffer size) for data values which are used again. This can be done efficiently because the number of ports is
methodology
usually 5.2
small. Step 1: ing with
Combining datapath 1/0 accesses
In step 1, the software
determines
schedul-
the number
of
functional units of each type required for the design and the scheduling of all the operations in the CDFG to appropriate control steps. It also outputs the percentage utilization of each type of module in each step,
%= Off-chiitiptiyy
PI I/O trsnsferschedul
which
Betwea ext. world and off-chip memory.
ule.
helps
the quality
the scheduling
the on-chip by inserting
corresponds
3: Design steps in SMASH.
us determine
We combine
from/to erators
fE!i!E9
Figure
storage
SMASH
tradeoff, we iterate on BWO.-Off by repeatedly invoking the synthesis software. For each user-specified
controller moves new data to them from the off-chip memory for further processing. Similarly, the output variables are first stored in the on-chip memory and then
tradeoffs used
Three parameters can vary while making the costperformance tradeoff, (i) number of ports, which determines BWOn _Off, (ii) size of the storage architec-
chip memory before they are required. In each control step the required data is loaded onto the buffer ports and as the datapath
Design ture
cause it is off chip, and is bigger in size. As a result, the software schedules the prefetch of the input variables from off-chip memory into the 1/0 buffer of the on-
design in each step.
buffers with a read/write
to every 1/0
of 1/0
of the schedreads/writes
the scheduling of op(R/W) node which
access to the buffers into the
CDFG. The read node consists of two inputs: (i) array name, and (ii) index and one output: value. The write node has three inputs: (i) array name, (ii) index, and (ii) value and one output: array (modified). During the datapath scheduling step, whenever an operator
The major design steps of our methodology are shown in Figure 3. The SMASH software implements the following two steps (highlighted in the figure): First, datapath synthesis with operation schedul-
involving 1/0 is scheduled, the corresponding R/W node is also scheduled to the same step, implying an 1/0 buffer access in that step. All memory-related issues are considered during this scheduling by checking
ing is performed combined with scheduling of on-chip data transfers to/from 1/0 buffers. As a result of this scheduling, constraints are placed on the memory structure. The second step, the 1/0 transfer scheduling, includes determining the data transfers between the two levels of the memory hierarchy. We ensure that first step of the stepwise construction of the system takes into account the second step by looking ahead so that the second step is not overly constrained. Global design parameters like BWOn_Ojf and timing constraints are considered when construct-
if 1. the 1/0 buffers are able to provide number of R/W ports in that step, and
the required
2. the input data, if required, could be prefetched into the buffers from the off-chip memory prior to that step and the output data, if any could be transferred back to the off-chip memory before it was required
56
elsewhere. Note that the data with the datapath execution:
transfer
tual
is overlapped
data
transfer
is scheduled,
the software
may not be able to make all the required time.
1(s)