DISC: DYNAMIC. INSTRUCTION. STREAM. COMPUTER. Dr. Mario Daniel Nemirovsky. Apple Computer Corporation. Drs. Forrest Brewer and Roger C. Wood.
DISC:
DYNAMIC
INSTRUCTION
STREAM
COMPUTER
Dr. Mario Daniel Nemirovsky Apple Computer Corporation Drs. Forrest Brewer and Roger C. Wood Electrical and Computer Engineering Department University of California, Santa Batiara ABSTRACT This paper applies a form of instruction stream interleaving to the problem of high performance real-time systems. Such systems are characterized by high bandwidth, stochastically occurring interrupts as well as high throughput requirements, The DISC computer is based on dynamic interleaving where the next instruction to be executed is dynamically selected from several possible simultaneously active streams. Each stream context is stored internally making possible active task switching in a single instruction cycle. For several RTS applications the DISC concept promises higher computation throughput at lower cost than is possible on contemporary RISC processors. Implementation and register organization details are presented as well as simulation results.
1.0
integrated manufacturing systems have become common. Other real time systems occur in the automotive industry and airplane control systems. In these newer applications, it is questionable whether conventional architectures provide a cost We propose an effective computation engine solution. efficient architectural concept for construction of real time system controllers which promise significantly higher performance than conventional approaches for modest increases in cost. Real time systems provide different constraints for the system architect than do conventional systems. In particular, externally derived deadlines from the controlled
systemproducewidely varying computational loads on the controller, as it must respond to these external requests and interrupts in a specified amount of time. In this work, we are considering the deadline times to be from microseconds to milliseconds, common to conventional microprocessor system controllers. In these systems, 1/0 timing constraints become a primary issue. Often the data required is generated by a sensor in a time scale much slower than the operation of the processor. On the other hand, keeping the data current is much desired which prevents caching or queueing earlier data values. In these cases, it is difficult to make use of the processor idle time while it is awaiting new data due to the overhead required to change program context. Interrupt processing is also very important in RTS to help alleviate overhead due to polling and to insure quick responses to exceptional or critical deadlines. For this reason, interrupt latency (a measure of the time to respond to an interrupt signal) is an important performance measure for real time control systems. If the control system is complex and dynamic rescheduling is required, then there must be provisions for rapid context switching as processes are started and stopped as required by the system. Finally, it has keen shown [1] that if the processor throughput can be partitioned arbitrarily among the executing processes, scheduling which is in some senses optimal can be achieved. This throughput partitioning must be done with very low overhead so as to not compete with the processing tasks themselves.
INTRODUCTION
This paper describes an architecture concept and implementation which is specifically oriented for use in real time controller systems (RTS). Such systems are characterized by various priority hard and soft deadlines for completion of tasks and efficient interaction of the processor and several peripherals running at vastly differing data rates. Ideally, system deadlines can all be met by the computation engine in all cases. However, reasonable provisions must be made for graceful degradation of low priority tasks in exceptional circumstances. Another characteristic of such systems is the notion that the worst case delays of the system must fall within It is of no use for the average the critical timing constraints. performance to meet these requirements as the system may incur permanent damage if these constraints are not met. Many present applications of micro-controllers are in relatively lowend applications where meeting these requirements is simple These uses have not pushed the even for slow microprocessors. technology significantly except in methods for Iowwing the costs. Recently, however, there has been a rapid increase in the complexity of control systems as mechanical and computer Permission vided
that
advantage, its date
O
1991
fee all or part
are not
the ACM copyright
appear,
Association republish,
to copy without the copies
for requires
ACM
made
not ice and the title
and not ice is given Computing a &
arid/or
of th~
or distributed
that
copying
Machinery.
To
specitic
mateml
ia granted
for direct
pro-
commercial
of the publication is by permission copy
otherwise,
aod of the or
2.0
Previous
permission.
0-89791-460-0/91/0011/0163
PREVIOUS
to
systems specially
$1.50
163
involves modified
WORK in architectures for real time controller a few specialized architectures and several current microprocessors. Although digital work
signal processing (DSP) chips are often used in real time systems, they are usually used as an auxiliary processor as their specialized architectures do not perform more general (nonnumeric) processing efficiently. In addition, the large parallelism and register set size of DSP’S make these devices very inefficient for use in interrupt driven or heavy context switching applications. Several common microprocessors have been modified for real time control applications with the addition of internal time, DMA, and communication interface functions. These include the 68332, 68HC11, and 8748 microprocessors as examples. It is important to note that the general purpose architecture of the original microprocessors is retained in these controllers, with the extra functions simplifying the peripheral interfacing. For this reason, these micro controllers have interrupt latency and context switching behaviors similar to the original microprocessor parent. The 68332 [2] does have an auxiliary processor called the timer processing unit (TPU) which is capable of performing relatively complex time process behaviors such as stepper motor control, etc. The purpose of this unit is to reduce the frequency of interrupts and context switches required by the real time system. Another solution to the interrupt latency and context switching time problems is use of a stack architecture such as that of the RTX2000 machine. Since the instruction stream is primarily zero address (stack) operations, these machines do not have large internal register sets which need to be saved. For this reason, the interrupt latency and the context switching times are very fast. The stack instructions, however, do not lend themselves to manipulating complex 1/0 devices due to the lack of support for complex address modes to these peripherals. These processors also tend to have slightly lower performance than their register heavy counterparts. Instruction level interleaving is not new; early processors using interleaving include the CDC6600 1/0 processor [4], the multiple instruction stream processor of Flynn [5,6], the work of Kaminsky and Davidson [7], and the Denelcor HEP computer [8]. More recent work includes the UCSB CCMP system [10], the APRIL [13], and others [9, 11, 12]. This work (with the exception of the CDC) is primarily directed towards the performance gains and ease of parallel programming implementations possible with interleaving. Very little attention has been given to the advantages of interleaving in real time systems. Interleaving architectures relv. on maintenance of several simultaneous contexts for each This leads of the running processes in the processor. inevitably to a large overhead of registers required to sustain these contexts. These registers are often organized into register windows or multiple windows with disadvantageous worst case replacement behavior. These memory problems have been studied by Sites [18], and by Wyes and Plessman [19] using background processes to update the register windows before registers are needed. Another alternative is proposed in the CRISP [221 architecture usimz a stack cache. We will propose
a
v-=iable
sized
multi-window
organization
for
interleaved in the processor at the instruction level, providing the highest level of granularity for task scheduling and partitioning. The instruction level interleaving allows for efficient pipelining to obtain high instruction throughput not For applications in achievable in conventional architectures. real time systems, however, it is the dynamic nature of the effectively parallel streams which is particularly useful. In a the control unit selects the next conventional processor, instruction to be executed in sequential order unless this order is changed by a jump or other control instruction. In DISC, the sequential order is replaced by a hardware scheduler which selects from among the several possible streams a particular instruction for execution on the next cycle. It is thus possible to assign an interrupt to a given stream which begins processing effectively in parallel and at a given level of partitioned throughput to the rest of the streams then active in ‘Streams can also trigger other instruction ~he processor. streams and multiple streams can synchronize with each other when necessary. As an example, consider a machine running 3 streams concurrently, and one of the streams is halted by wait states from a slow peripheral. The other streams automatically are allocated the instruction slots which would otherwise be used as polling or interrupt overhead. In situations where the number of active contexts is smaller than the number of supported streams, all overhead for context switching is removed. (Even when this is not the case, for many real time systems, the frequency of context switches should be reduced.) It is well known that multi-stream interleaving on a pipelined processor is more efficient than single stream execution. DISC exploits this efficiency advantage by implementing a RISC-based processing engine design to automatically interleave instruction execution from a small number of stored process contexts. Scheduling of streams on an instruction basis allows simple partitioning of the processing power among the several active real time tasks. This schedule allows several versions of real time scheduling models, including preemptive and fixed schedules as well as General scheduling [1] with little or no overhead. The cost for this is the necessity of several stored contexts along with the ancillary registers needed to be duplicated for each stream. In particular, PC, SP, El registers must be maintained for each stream. To manage this large number or registers, DISC introduces the concept of a stack window register set. These registers are similar to the registers windows proposed by Patterson in RISC- I [3] with the exception that the number of registers allocated in a procedure crdl is variable. Each stream is allocated its own stack window as well as a common set of global registers used for inter-stream parameter passing. The stack window described below is a very important issue in a hard deadline environment to minimize context switching, procedure call/return, and interrupt overhead. 3.2
DYNAMIC lNSTRUCTlON STREAM COMPUTER
3.1
DISC
concept simultaneous and
halted
(DISC)
Concept
The
dynamic
relies
on
instruction by
the
instruction an
stream
architecture streams
processor.
which Each
computer
maintaining of
these
(DISC) several
are dynamically
started
streams
is
a
mechanism
by
which
multiple
instructions from a sequential instruction stream are simultaneously executed in an overlapped fashion. For this discussion we will consider a five stage pipeline, consisting of instruction fetch (IF), instruction decode (ID), read registers (RR), execute (EX), and write register (wR). The essential feature of a pipe is that ensuing instructions are scheduled before earlier ones have completed, This leads to hazards is a lowering the performance of the pipeline. A hazard situation which precludes executing the next instruction of the stream. Hazards are caused by violation of either data or
purpose. 3.0
Pipelining Pipelining
this
is
164
a,l:
I
indicates
IF a.1 IF f.2 IF k.3 ID a.1
IF n.4 IF p.5
IF b.1
IF g.2
—
ID f.2
ID k.3
ID p.5
ID b.1
—
RR a.1
RR f.2 RR k.3 RR n.4 RR p.5
ID n.4
WR a.1 WR Interleaved
Pipeline
- Figure
-. . . ..-
.. IF a.1
f.2 WR k.3
—
No other
.
belongs
instruction
IF n.4
~lF p.5 ; IF b.1
IF g.2
—
ID a.1
ID f.2
ID k.3 ;ID n.4~ ID p.5
ID b.1
—
RR a.1 RR f.z ERRk.~
RR n.4 RR p.5
—
EX a.1 !EX f.2;
EX k.3 EX n.4
—
Pipeline
During
—
has
been
Interleaving
the
throughput
of
the
1
➤
3.2
each process completes before instruction for Under these instructions from that process me fetched. conditions and assuming the processes to be independent, there are no control or data hazards at all. A representative branch is shown in Figure 3.2. Hazards between separate processes are possible if the processes are not independent, for example, the processes may communicate. We can add special hardware for process communication which will reduce the overhead in these cases. As a result, the interleaved pipeline achieves higher throughput on several processes than an identical pipeline executing a single stream, due to reduction in the number of hazards. The performance increase for interleaved pipelines is not without cost. There must be sufficient registers to retain the states of all executing processes. The resources required to duplicate the context must be duplicated for each of the processors as many times as there are virtual processors to be supported. This cost is highly dependent on the architecture of the processor. It is important to have a very small context per process, and to have minimum extra hardware to support the multiple process switching. The question remains: How does interleaving give a solution to the real-time controller requirements?
A pipeline is interleaved if, at every pipe cycle, an instruction from a different instruction stream enters the pipe and there are at least as many instruction streams as pipe stages. Therefore, interleaving is a way to share the processor resources between multiple processes. Figure 3.1 shows an interleaved pipeline, in which five independent instruction streams or tasks are shown in a five stage pipeline. The result of such an interleaved pipe is the equivalent of five parallel processors, where each processor is running at one instruction every five cycles. Thus in an ideal pipeline there is no performance gain from interleaving instructions. In fact, the overhead of supporting several parallel streams may slow down the achievable clock cycle, hence the performance may decrease. However, a single stream running a pipeline will have both data and control hazards reduGing
the pipe stream
each
control caused dependencies. A data hazard exists when an A, is modifying data which is used in the next instruction, instruction, B. In this case, data for B has not been updated by the time it is read. To insure correct operation, instruction A should be completed before B executes its third stage and the pipeline should keep running A but delay all those instructions that follow A until the register write is completed. A control hazard takes place when the instruction sequence is modified as a result of an interrupt or an instruction such as jump or branch. By the time an instruction modifies the program sequence, there will be several instructions in the pipe which belong to the incorrect sequence. Any such instructions need to be @dred from the pipe. It is important to reduce the performance overhead associated with hazards. Several techniques such as delayed branching and pipeline bypasses reduce the effect of hazards, but generally do not Interleaving, however, can be used to eliminate them. and
a Jump - Figure
on
to instruction
WR f.q WR k.3
3.3
E
3.1
IF k.3
hazards from pipeline execution in a number of systems [4-8].
1
—
IF f.2
Interleaved
“a”
stream
EX f.2 EX k.3 F)(n4
FX a.1
eliminate employed
instruction
runnina . on instruction
3.4
Dynamic
Interleaving
As we described earlier, a real-time system requires that multiple tasks be able to run concurrently. Some of these tasks occur at deterministic times, others at random times. There are a large number of interrupts, and the 1/0 speed is generally much slower than the processor speed. Interleaving could be a very a good solution if a sufficient number of active tasks could be guaranteed, but this is difficult because of the randomness. Thus, we introduce the concept of dynamic A pipeline organization is said to be interleaving. dynamically interleaved if it can run from a single instruction stream to a multiple instruction stream and the computation power of the prQGem+m can be allocated ktween the mtdt;plc reallocate virtual processors in any way and can dynamically the throughput when the instruction stream scheduled to run is
pipeline,
In an interleaved pipeline, all instructions present in the pipeline belong to separate processes at all times. Thus
165
not ready. This is achieved in DISC by dynamically selecting the next instruction to execute from the possible streams. In the case where only one stream is active, each pipeline slot executes sequential instruction from that stream. The concept
local and global variables for all streams. However, keeping local variables in internal registers causes context switching overhead on procedure calls and retorns [14, 15]. To solve the inconsistency between a large register set and fast context switching, a multi-window approach is a very logical alternative [14-23]. In addition to reducing the local register saving/restoring to just a pointer change, if the windows overlap, then the overlapped registers can be used for argument passing. DISC is an architecture which contains multiple ;nstruc~ion streams” each instruction stream should have its own multiple window file.
is described by Figure 3.3. The figure shows up to four instruction streams (IS 1, 1S2, 1S3, and 1S4). Assume that the total throughput of the processor is T and the following partition is assigned T/2 to IS 1, and T/6 to 1S2, 1S3, and 1S4, As the figure shows, when 1S1 is the only one active, it will be dynamically assigned T even though the static assignment is T/2. Similarly, if 1S3 is inactive, its processor time will be dynamically reassigned to 1S2 and 1S4. Dynamic interleaving greatly facilitates scheduling and multitasking since each task can be assigned its own “virtual processor” of adjustable computational power. Real Time Systems also require hard deadline management which is often implemented via timer based interrupts. In conventional architectures, these interrupts require context switches. In DISC, an intermpt, instead of suspending a running process, can create its own instruction stream. This makes the system more deterministic since even when interrupts are invoked, other tasks can be running. When the interrupt routine is finished, the throughput will be
STACK
WINDow . . . .. . . . . . . . ... . . . . . . . . . . .. . . . . . . . . . . . . . . . .
.. . . . +--1 . . . . II. .. .I.
dynamically reallocated to the remaining instruction streams. Context switching will not be required as long as the number of instruction steams supported by the processor is less
.. .. .. .. .. .. .. .. .. .. .. .. .. . SUB-CALL
;
or INT.
~ . T . ..
SUB-RET
~
or INT-RET
~
1S1
.. .. .. .
Dynamic
.. . .. . :
1S2
.. . .. .. ..
.. . .. .
1S3
Fc&c, : .. F&K .. .. .. ... .. ..
.. ..
Instruction
: . . . . . .
1S4
. . . . .
; . . . . . . . . . . .
Stream
Stack Window
@rERRh
... ,.
. .. ... . Diagram
---
. bigure
..
.. .. ..
J.j
-
program
or equal to that required by the application. Otherwise, some context switching will be required but the total number of switches will be smaller in a DISC than in a traditional architecture.
3.6
3.6.1 The Stack
X
Window
Approach
- Figure
3.4
The approach used on DISC is called a stack window. Figure 3.4 shows the window file on the stack window approach. Bottom Of the Stack register (BOS) is pointing to the last empty word of the stack window (SW). Active Window Pointer (AWP) is pointing to register zero (RO) of the window. If the window size is “S” then the address of RO is AWP, R1 is AWP-1,..., Rn is AWP-n, ... . R(S-1) is AWP-S+l. In the instruction se~ stack increment and decrement is added to some instructions such as Load, Store, Add, Subtract, etc. When an instruction increments AWP then the new AWP location becomes RO,RO becomes RI, RI becomes R2, and so on (Figure 3.5). Then the SW is a window that is moving up and down as demands require. Let us assume that instructions which increment the AWP do so at the end of the instruction. Then a procedure call will increment AWP storing the return address there. On a return, the TOS is decremented by the instruction offset (no larger than the window size) to the return address location. It restores the leaving
3.5
+
I
.. ... . $“ .
... . .. .. . .. . .. .. . .
.. .. . .. . .. . . . .
AU70
~
.. .
km . IN’xj .
.
... ............... ... ... ............... ... . R . . . . . . . . . . . . . . . . . .. .
.+
AUTO
AWP-n
counter h
at
the
and same
decrements place
Communication
h
was
the
AWP
before
the
one call
more took
time,
place.
Issues
Input/Output
Real Time Systems require multiple 1/0 peripherals with different access times; therefore, DISC has to support an asynchronous DATA bus. DISC is a load/store type machine. To avoid stopping the other instruction stream when a load or store instruction is issued, a pseudo-DMA type load/store was
Due to the speed degradation of external access with respect to the processor, it is important to keep operands in the processor. Therefore, the processor should have enough registers to be able to allocate registers to most, or all, of the
166
implemented on DISC1. On a load instruction, the effective
increment
address of the external request is calculated. It is then loaded into the Asynchronous Bus Interface (ABI), with the address of the destination register. The IS requesting the read cycle is sent into a wait state and the ABI initiates the read cycle. During the time the access in taking place, another IS requests for a load or store -+ will send IS to a wait state. Once the read is completed the ABI stores the data into the destination register and re-activates all waiting 1Ss. This is done without affecting the running instruction streams. The store instruction works in a similar way.
L-l
+
check synchronization since the computation throughput which would be spent polling will be dynamically allocated to the active 1Ss.
AWP
bti R7
R7 R6
R7
R7
R6
R5
R6
R6
R5
3.7
Implementation DISC
of
the DISC1 is implemenexperimental R4 R3 tation of the DISC concept. R2 ~ R3 R2 More information about the R2 R2 RI implementation and models ;: RI RO This is available [24]. + :; implementation was designed n< E RO is lost to prove feasibility of DISC 1: + and to obtain benchmarks. decrement AWP J The design is targeted to the typical control requirements Stack Wkdow Movements - Figure 3.5 of automotive electronics. A 16-bit architecture was chosen for DISC1 since the goal was to compare its performance with respect to present RTS controllers. In fact, Interprocess Communication and 3.6.2 present technology would allow physical implementation as a Synchronization 32-bit architecture. A Harvard architecture was chosen to allow Instructions are Since DISC has multiple 1Ss, communication between simultaneous instruction and data fetch. IS is required. This can be accomplished in different ways. On fetched through the program bus which is 24-bits wide, while DISC1 there are three ways supported. There are four global the data bus is 16-bit asynchronous. An asynchronous data bus registers that are shared between all the 1Ss. In addition there is is required since controllers have a very large variety of 1/0 an internal global memory shared between the 1Ss, since global peripherals with large variety of access times. registers and internal memory allow read-modify-write DISC1 supports up to four instruction stream running instructions, they can be used as semaphores. IPC can also be concurrently and uses a four stage pipeline. The scheduler of done via software interrupts which is discussed in the next DISC1 is responsible for selecting which instruction stream The section. Process synchronization can be achieved by either will be executed next, based on present priority. computational power of the system can be allocated evenly semaphore polling or by interprocess interrupts. Interrupts are more efficient since they do not require repetitive instructions between 1Ss, or assigned in increments as low as 1/16 of the on the processing engine. total. It contains 2 Kbyte of internal memory in addition to the stack window registers. The internal memory is shared between Interrupts 3.6.3 all 1Ss with access done via register indirect, register plus The interrupt structure on DISC is very special offset, or 9-bits immediate addressing. DISC 1 is a load/store computer with reduced instruction set. All the instructions are because of the importance of interrupts in real time systems and because they are also used to obtain inter IS communication and effectively single cycle including the load and store instructions with the proviso of asynchronous wait for external synchronization. Every IS has one interrupt register (IR) and memory and I/O. This simplifies the design and reduces the one mask register (MR). On DISC 1 the interrupt registers overhead cost of the multiple instruction streams. A 16x 16 contain 8 bits, bit 7 is the highest priority, bit O is the lowest priority (or background or normal mode of running). Interrupt integer hardware multiplier is included in DISC1. DISC1 has 16 7 to 1 are vectored interrupts. Interrupt O is the background, no registers per instruction stream, four global, four special vector is generated, e.g. 1S0 interrupts 1S2 by setting a bit on registers. and eight local (stack window) registers. Fimre 3.6 sh~ws a ‘block d~agram of DISC 1. A RTL model of DISC1 was the IR of 1S2. External interrupts can also set a request to any written in Verilogm and several programs were run on this of the IRs. Finally, interrupts can be automatically generated, model. such as the stack overflow or other exceptional interrupt. Interrupt request bits can only be cleared by the IS to which the IR belongs. When no bk of the IS is se~ the instruction stream EVALUATION OF DISC PERFORMANCE 4.0 will not be scheduled (not active). Once an interrupt is requested, if it is the highest priority one pending, a vector
R4 R3
R5 R4
R5
R4
,“
4.1
interrupt will be generated. The next instruction that belongs to that IS will be started at the address given by the vector interrupt. Vector interrupts were chosen in the implementation to avoid the need for polling to obtain the interrupt source. Synchronization between 1Ss can be obtained via interrupts. When reach
interrupts the
join
are point
This is much better
used is
to
synchronize
deactivated
than having
until
1Ss, the
the IS polling
the
first
IS
Stochastic
Model
A stochastic model was developed to evaluate the DISC architecture. Poission distributions, with the indicated means, were assumed for the number of consecutive instructions for which the IS is active (meanon), or inactive (meanoff), between external access requests (mean_req), and for 1/0 request times (mean_io). Also controlled were the percentage of external requests that were dwected to memory
to
IS arrives. a semaphore to
other
167
I
t--t
I
I
IL
I
Block
Diagram
of DISC1
- Figure 3.6
(alpha), returns,
the percentage branches and
(aljmp), access.
and the number
of instructions, interrupts that
such as jumps, modify program
of wait cycles for an external
differing
calls, flow
memory
The model simulates the sequencer used in DISC 1, so that any sequence that can run on DISC 1 can be simulated. The model assumes that when a jump instruction takes place, all of the instructions in the pipe that belong to the same IS have to be flushed from the pipe. If only one IS is active, this simplifying assumption makes DISC performance worse than a single IS computer. For an external request, either 1/0 or memory, if the access time is larger than zero, all instructions on the pipe belonging to the same IS are flushed, and the IS requesting
access is put into a wait state. This is done in order to allow other 1Ss to keep running, but penalizes DISC with respect to a standard architecture if only one IS is being run since the pipe could simply be halted. If the bus was busy at the time access is requested, the instruction is flushed and a new external access is requested once the IS is out of the wait state. If the bus was not busy, the busy flag is set and it remains set until the access time is completed. Upon completion of the external access all waiting flags are cleared. Two performance measure for DISC are evaluated: processor utilization on DISC, PD, — and Delta. Delta is a value used to compare a single IS system with a multiple and is defined as: delta = (PD- Ps)ps * 100% Ps
(processor
utilization
on
the
standard
is
when it is waiting for data. To assume the contrary implies support of out-of-sequence code and/or a smart compiler. It also assumes that every time a “jump” type instruction is executed, the standard processor will require (pipe_length1) cycles to be flushed from the pipeline. This is conservative in that delayed branching can be used to help alleviate the number of cycles needed to be flushed. However, delayed branching can only be applied to statically analyzable portions of the design and is less effective as pipeline depth increases. It is common practice in RTS analysis to measure the interrupt latency time as a system evaluation. By dedicating a stream to a particular interrupt, we can achieve very high figures of merit since the instructions will start execution immediately. However, we must still ensure that the appropriate context is available and that the interrupt executes quickly enough once it is started. The latency time as conventionally described is ambiguous in this sense since a short interrupt to retrieve a value will execute very quickly (the common micro-controller case) while a longer interrupt will be scheduled throughput by the hardware scheduler.
Simulation
that
load
2 is alternatively
Ld3
Ld4
40
40
20
0
30
0
40
30
14
10
40
8
77
80
81
70
100
90
6
6
6
6
0
6 15 10
Ld 1:4 Ld2
Ld 1:2 M
1:3
meanon
40
40
40
40
meanoff
o
0
0
mean-.req
15
13
alpha
80
tmem
6
mean_io
10
10
10
11
10
0
aljmp
20
20
12
18
20
5
- Parameter
Set for Typical
Program
and 4.3 show the processor utilization and delta for different combinations. In Table 4.2 we show that as the degree of partitioning increases, so does the utilization. Hence if we have a program that can be partitioned into multiple 1Ss, a much better processor utilization is obtained, especially if the processor utilization of the single IS is low. Even when the processor utilization of a single IS is very high, there are still some gains obtained by running multiple instruction streams, as shown in load 3 in Table 4.2.
calculated as the total number of executable instructions divided by the sum of the total number of executable instructions, the number of cycles that the data bus was busy, and the number of cycles dropped due to “jump” type instructions. This assumes that instructions are not being executed in a standard processor
4.2
in the fact
Ldl
Table 4.1
IS system,
processor)
principally
active and inactive, while load 1 is always active. Load 3 represents a DSP type program running only from internal memory, and load 4 an interrupt driven program which is only active while handling an interrupt. These loads were also combined into a single IS, e.g. load (1:4) represents a statistical combination of loads 1 and 4 into a single IS. Table 4.1 shows the parameters for each of these runs, and Tables 4.2
Maximum
Number of Instruction 1
Streams
2
3
4
load 1
45
70
75
90
load 2
32
54
64
75
load 3
87
95
95
100
load 4
25
46
58
67
Table 4.2 a - Processor Utilization
Maximum
Number 1
Results
of Instruction
Streams
2
3
4
load 1
-7
42
53
85
load 2
4
40
62
89
9
9
14
load 3
A large number of simulation runs were made to evaluate the DISC architecture. Parameters varied, in addition to those described above, included the scheduler sequence, number of cycles for an external memory access (tmem), and pipeline length. One set of runs evaluated the effect of only “jump” instructions, another of external 1/0 only. Finally, a set of four program loads was specified to simulate more realistic RTS behavior. Loads 1 and 2 represent typical RTS behavior
PD
load 4
0 -7
30
52
72
Table 4.2 b - Delta Table 4.3 shows results for load 1 combined with each of the other loads, first into a single IS, then each run in an
169
independent IS, then with load 1 partitioned into two 1Ss, and finally with both loads partitioned into dual 1Ss. The range of improvement of DISC over a traditional single-instructionstream processor (delta), is dramatic as long as at least two 1Ss are enabled, especially when traditional processor performance is poor.
dependent constraints on the performance architectures needs to be evaluated.
and size of DISC
ACKNOWLEDGEMENTS Load Type
Combined Loads
Separated Loads
Three 1Ss Four I%
1-2
44
62
71
84
1-3
60
88
84
99
1-4
46
58
72
79
Table 4.3 a - Processor Utilization
‘d
Type
Combined Loads
Separated Loads
Three 1Ss
Four 1Ss
.7
33
52
80
1-3
-5
31
40
53
22
-8
Table 4.3 b -
46
REFERENCES Coffman, Theory”,
2.
CPU32,
3.
Patterson D. and Sequin C., “RISC I A reduced Instruction Set VLSI Computer,” Proc. of the 8th Symposium on Computer Architecture, May 1981
4.
Thornton J.E., “Parallel 6600,” Proceedings-Spring 1964
67
Delta
On the other hand, in applications where single stream processor utilization is very high, the advantages of DISC are not significant. In addition, if the application does not permit keeping multiple 1Ss active, then a DISC architecture could result in a performance degradation. What is remarkable is the large throughput increase made available by using such a small number of parallel streams.
5.0
1.
E.G. and Denning Prentice-Hall, 1973
P.J.,
“Operating
System
PD
1-2
1-4
This research was partially supported by Delco Systems Operations, a subsidiary of Delco Electronics Corporation, and the University of California M. I. C.R.O. grant #90-185.
Manual
(Rev. 0.8), Motorola
Operation Joint
1989
in the Control Computer
Data
Conference,
5.
Flynn M.J., Podvin A., and Shimizuk K., “A Multiple Instruction Stream processor with shared resources, ” Parallel Processor System, C. Hobbs, Washington D. C., Spartan, 1970
6.
Flynn M.J., “Some Computer Organizations Transactions on Computers, Effectiveness,” No 9 Sept. 1972
7.
Kaminsky, W.J. and Davidson E. S., “Developing Multiple-Instruction-Stream Single-Chip Processor, IEEE Computer Magazine, Dec. 1979
8.
Kowalik J. S., cd., “Parallel MIMD Supercomputer and its Applications,”
CONCLUSIONS
DISC shows a performance improvement over standard architectures for real time applications. The ability to dynamically reallocate the throughput permits the system to take advantage of the time that otherwise would be lost. It was shown that even a system with two instruction streams significantly outperforms a single instruction stream system. In particular, the ability to partition throughput among streams, the rapid interrupt handling, and the concurrent processing during 1/0’s should provide substantial benefits to RTS. There are many application where DISC will be outperformed. Specifically, this will be true in applications where the number of wait cycles and pipe hazards are very small. Future work should be done to evaluate the optimum number of instruction streams for a given application. The stochastic model sheds considerable light on this question, but detailed analysis of algorithmic requirements, 1/0 patterns, etc. will be necessary. Two other parameters also need study: the depth and size of memory usage in the stack windows could be evaluated by stochastic means and appropriate measures of interrupt latency need to be defined and modeled. Numerous operating system, compiler, and other software questions also need to be addressed. Finally, implementation technology
Reference
and Their Vol. C-21,
Computation: The MIT
a ”
HEP Press,
1985 9.
Smith B. J., “A Pipelined, Shared Resource MIMD Computer,” Proc. of the 1978 International Conference on Parallel Processing, 1978
10.
Staley C. A., “Design and Analysis of the CCMP: A Highly Expandable Shared Memory Parallel Computer,” Ph.D. Dissertation UCSB, August 1986
11.
T., “MASA: A Halstead R.H. and Fujita Processor Architecture for Parallel Computing,” Proc. of the 15th Symposium Architecture, June 1988
12.
170
Multithreaded
Symbolic on Computer
Rishiyur S., Nikhil, and Arvind, “Can Dataflow Subsume Von--Neumann Computing?, ” Proc. of the 16th Symposium on Computer Architecture, June 1989
13.
Agarwal A., Lim B., Kranz D. and Kubiatowicz, “ APRIL: A Processor Architecture for Multiprocessing,” Proc. of the 17th Symposium on Computer Architecture, May 1990
14.
patterson Computer
15.
Lunde A., “Empirical Evaluation of some Features of Instruction Set Processor Architectures,” Communication of the ACM, March 1977
16.
Alexander G., Wortman characteristics of XPL Magazine, Nov. 1975
17.
Patterson D., Communications
18.
Sites, R.L.,’’How to use 1000 registers,” Conference on VLSI, Jan. 1979
19.
Wyes H.W. and Plessmann K.W., “OMEGAA RISC Architecture for Real-Time Applications,” IFAC 10th Triennial World Congress, Munich, FRG, 1987
20.
Tannenbaum programming of the ACM
21.
Halbert D. and Kessler P., “Windows of overlapping CS292R-course final report, UC Register Frames,” Berkeley, June 1980
22.
Dirzel
D. and Sequin Magazine, Sept.
D.R.
VLSI
RISC”,
Systems,
H.R.,
Palo Alto,
CA,
Caltech
of structured Communications
“Register
Stack Cache,” for Programming
Computers,”
Proc.
A. S., “Implications for machine architecture,” March 1978
and McLellan
IEEE
D., “Static and Dynamic programs,” IEEE Computer
“Reduced Instruction Set of the ACM, Jan. 1985
Free: The C Machine Architectural Support Operating
C., “A 1982
Allocation
for
Proc. Symp. on Languages and
March
1982
23.
Siewiorek “Computer McGraw-Hill
24.
Nemirovsky, Mario, “DISC, A Dynamic Instruction Stream Computer,” PhD Dissertation, University of California, Santa Barbara, September 1990.
D. P., Gordon Bell C., Structures: Principles Book, 1971
and and
Newell A., Examples,”
171