The Spring Scheduling Co-Processor: A Scheduling Accelerator 1
Wayne Burleson, Jason Ko Department of Electrical and Computer Engineering Doug Niehaus, Krithi Ramamritham, John A. Stankovic, Gary Wallace, Charles Weems Department of Computer Science University of Massachusetts, Amherst, MA 01003
[email protected]
Abstract
We present SSCoP, a novel VLSI scheduling accelerator for multi-processor real-time systems. The co-processor can be used for static scheduling as well as for on-line scheduling. Many dierent policies such as earliest deadline rst, highest value rst, or resourceoriented policies, for example, earliest available time rst, or their combinations can be used. When any on-line scheduling algorithm is used it is important to assess not only the speed of the scheduling itself, but also the overall performance impact of the interface of the co-processor to the host system. In this paper we describe the co-processor architecture, a CMOS implementation, an implementation of the host{co-processor interface and a study of the overall performance improvement. We show that the current VLSI chip speeds up the main portion of the scheduling operation by over three orders of magnitude. We also present an overall system improvement analysis by accounting for the operating system overheads and identify the next set of bottlenecks to improve. The scheduling co-processor includes several novel features. It is implemented as a parallel VLSI architecture for scheduling that is parameterized for dierent numbers of tasks, numbers of resources, internal wordlengths, and future IC technologies. The scheduling architecture was implemented using a single phase clocking scheme (TSPC) in several novel ways which allowed prototyping in 2 micron CMOS technology and straightforward scaling to a more aggressive .8 micron CMOS technology. The 328,000 transistor custom VLSI accelerator running with a 100Mhz clock, combined with careful hardware/software co-design results in a considerable performance improvement, thus removing a major bottleneck in real-time systems.
1 Introduction Real-time computing with hard deadlines is found in a number of control, avionic, industrial and communications systems. The complexities of next generation real-time systems arise from a number of factors including dierent types of time constraints, dierent types Early reports of this work appeared at the 1993 International Conference on Computer Design and the 1993 Real-Time Systems Symposium. 1
1
of time granularities, the presence of tasks with complex characteristics such as resource requirements and precedence constraints, and the dynamic nature of the environment. These complex systems call for resource and task management strategies that are not only predictable, i.e., provide assurances that time constraints will be met, but are also adaptable and exible. Adaptability calls for approaches that can deal with the dynamics and uncertainties of the environment eectively while exibility demands approaches that can accommodate changes to the requirements as well as modi cation to the hardware and software structures. The Spring operating system kernel [16] contains support for real-time applications that possess the above characteristics. Because of the performance considerations involved, and the physically distributed nature of the applications, Spring supports distributed real-time systems made up of multiprocessor nodes. A novel aspect of the kernel is the dynamic planning based scheduling of tasks that arrive dynamically, whereby a task is guaranteed to meet its deadline by explicitly constructing a plan for task execution such that all guaranteed tasks meet their timing constraints. This guarantee takes tasks' times, resource constraints, and precedence constraints of active tasks into account and avoids the need to a priori compute worst case blocking times, which can be very pessimistic in complex real-time systems. The operating system (OS) kernel [16] also retains a signi cant amount of application semantics at run time which is used to provide exibility and graceful degradation. These planning and application semantic features are integrated to provide direct support for achieving both application and system level predictability. The kernel also uses global re ective memory to achieve predictable distributed communication. Each multiprocessor Spring node contains one (or more) application processors, one (or more) system processors, and an I/O subsystem. Application processors execute previous dynamically guaranteed and relatively high level application tasks. System processors of oad the scheduling algorithm and other OS overhead from the application tasks both for speed, and so that external interrupts and OS overhead do not cause uncertainty in the execution of guaranteed tasks. System processors could invoke speci cally designed co-processors to oer hardware support to dynamically scheduling (guaranteeing) tasks. Speci cally, the work described in this paper is the the design and implementation of the Spring scheduling co-processor (SSCoP), a VLSI scheduling co-processor which will accelerate the execution of the Spring scheduling algorithms, and thus increase the range, in terms of deadline granularity, of applicability of the dynamic scheduling approach adopted by Spring. While the scheduling co-processor is capable of handling the full sophistication required by the Spring scheduling algorithm (necessary for many applications), it is also possible to use the co-processor in simpler situations (e.g., one case is where earliest deadline scheduling is sucient). In terms of VLSI, the scheduling co-processor includes several novel features. It is implemented as a parallel VLSI architecture for scheduling that is parameterized by dierent numbers of tasks, numbers of resources, internal wordlengths, and future IC technologies. The scheduling architecture was implemented using a single phase clocking scheme (TSPC) in several novel ways which allowed prototyping in 2 micron CMOS technology and straightforward scaling to a more aggressive .8 micron CMOS technology. 2
The 328,000 transistor custom VLSI accelerator running with a 100Mhz clock, combined with careful hardware/software co-design results in a considerable performance improvement, thus removing a major bottleneck in real-time systems. Future superscalar RISC microprocessors intended for real-time systems could obtain even higher performance by including a SSCoP module as an additional execution unit. Prior work on real-time systems that has resorted to hardware-assisted solutions includes the VLSI clock synchronization unit that is part of MARS, a fault tolerant real{time system implementation [3], the fault tolerant message routing unit that is part of HARTS (Hexagonal architecture for Real-Time Systems) [15], and network interface chips for the FDDI protocol [14]. More familiar examples of hardware co-processors include oating point, graphics, DMA, cache controllers and many others. All involve tasks that could be performed in software, but which are critical enough to system performance to warrant custom hardware support. In many real-time systems, on-line guarantees oer such important advantages that scheduling co-processors are warranted. However, the applicability of dynamic scheduling is directly in uenced by the overheads incurred by the scheduling algorithm. An algorithm which takes tens of milliseconds to determine if a new task is admissible can only deal with tasks whose relative deadlines are in the hundreds of milliseconds or longer. In general, the smaller the absolute timing overhead of the scheduling, the smaller the granularity of the deadlines that can be dealt with by the algorithm. If we can construct hardware that is orders of magnitude faster than its software version, we can handle order of magnitude smaller time granules. In fact, through a combination of direct measurement and simulation, we have estimated that the current SSCoP interfaced to the Spring kernel [16] speeds up the portion of the algorithm it implements by three orders of magnitude. We have also studied the overall performance bene t when SSCoP is part of the full system and identi ed the next set of system level bottlenecks to improve. In Section 2 we present the Spring scheduling algorithm. Section 3 provides an overview of the SCCoP architecture and discusses its interface with the rest of the Spring system. Details of a VLSI architecture for scheduling are presented in Section 4. Section 5 discusses design details of a high-speed CMOS implementation of SSCoP. Section 6 summarizes the test results of the initial prototypes of SSCoP and discusses issues of scaling to larger, more practical versions of the chip. Section 7 explains the details of the interface between the host and the chip, which leads to Section 8 which presents an analysis of the overall system performance improvement due to SSCoP. Section 9 summarizes the main results and discusses directions for further research related to SSCoP.
2 The Scheduling Algorithm Many practical instances of scheduling algorithms have been found to be NP-complete [17]. A majority of scheduling algorithms reported in the literature perform static scheduling, and hence, have limited applicability since not all task characteristics are known a priori and further, tasks arrive dynamically. For dynamic scheduling, in single processor systems with independent preemptable tasks, the earliest deadline rst algorithm and the least laxity rst algorithm are optimal. For dynamic systems with more than one 3
processor, and/or tasks that have mutual exclusion constraints, an optimal scheduling algorithm does not exist[10]. These negative results point out the need for heuristic approaches to solve scheduling problems in such systems. In the rest of this section we provide an overview of our approach to scheduling. To support the predictable execution of tasks, we use guarantees, a notion fundamental to predictable scheduling. A task is guaranteed by constructing a plan for task execution whereby all guaranteed tasks meet their timing constraints. Speci cally, if a set S of tasks has been previously guaranteed and a new task T arrives, T is guaranteed if and only if a S feasible schedule can be found for tasks in the set S fT g. A task is guaranteed subject to a set of assumptions, for example, about its worst case execution time and the nature of faults in the system. If these assumptions hold, once a task is guaranteed it will meet its timing requirements. The version of our guarantee algorithm that has been realized on the chip described in this paper deals with non-preemptable tasks that have deadlines, resource requirements and precedence constraints. Current real{time scheduling algorithms schedule the CPU independently of other resources. For example, consider a typical real{time scheduling algorithm, earliest deadline rst. Scheduling a task which has the earliest deadline does no good if it subsequently blocks because a resource which it requires is unavailable. Our approach integrates CPU scheduling and resource allocation so that this blocking never occurs. By integrating CPU scheduling and resource allocation at run time, we are able to understand (at each point in time) the current resource contention and completely control it so that task performance with respect to deadlines is predictable, rather than letting resource contention occur in a random pattern that could result in an unpredictable system. The algorithm attempts to guarantee non-preemptable tasks given their arrival time TA , deadline TD or period TP , worst case computation time TC , predecessor set fTPred g, and resource requirements fTR g. A task uses a resource either in shared mode or in exclusive mode and holds a requested resource as long as it executes2 . The guarantee algorithm computes the earliest start time, Test, at which task T can begin execution. Test accounts for resource contention among tasks. It is a key ingredient in our scheduling strategy. We use a heuristic scheduling algorithm which tries to determine a full feasible schedule for a set of tasks in the following way. It starts at the root of the search tree which is an empty schedule and tries to extend the schedule (with one more task) by moving to one of the vertices at the next level in the search tree until a full feasible schedule is derived. To this end, we use a heuristic function, H , (see details below) which synthesizes various characteristics of tasks aecting real-time scheduling decisions to actively direct the scheduling to a plausible path. The heuristic function, H , is applied to the tasks that remain to be scheduled at each level of search. An eligible task with the smallest value of function H is selected to extend the current schedule. A task is eligible when it has not yet been scheduled but all tasks in its predecessor set TPred have been scheduled. A task without predecessors is always eligible. A set of tasks with precedence constraints, where each task has its own resource requirements, is generated from a process-level description of a computation[8]. Partly as a byproduct of this, tasks typically are short and use resources for the shortest possible time. 2
4
While extending the partial schedule at each level of search, the algorithm determines if the resulting partial schedule is strongly-feasible or not. A partial feasible schedule is said to be strongly-feasible if all the schedules obtained by extending it with any one of the remaining tasks are also feasible. Thus, if a partial feasible schedule is found not to be strongly-feasible because, say, task T misses its deadline when the current schedule is extended by T, then it is appropriate to stop the search since none of the future extensions involving task T will meet its deadline. In this case, a set of tasks cannot be scheduled given the current partial schedule. (In the terminology of branch-and-bound techniques, the search path represented by the current partial schedule is bound since it will not lead to a feasible complete schedule.) However, it is possible to backtrack to continue the search even after a non-stronglyfeasible schedule is found. Backtracking is done by discarding the current partial schedule, returning to a previous partial schedule, and extending it using a dierent task. The task chosen is the one with the smallest H value, among those which were not already chosen during a previous attempt to extend this partial schedule. Even though we allow backtracking, its overhead can be restricted either by limiting the maximum number of backtracks or by restricting the total number of evaluations of the H function. We use the former scheme in the implementation discussed here. The heuristic function H can be constructed by simple or integrated heuristics. The possible heuristics include: H (T ) = TD Minimum deadline rst (Min D): Minimum processing time rst (Min C): H (T ) = TC Minimum earliest start time rst (Min S): H (T ) = Test Minimum laxity rst (Min L): H (T ) = TD - (Test+TC ) H (T ) = TD + W TC Min D + Min C: Min D + Min S: H (T ) = TD + W Test Here Test denotes the earliest time that a task that is yet to be scheduled can begin execution and is given by: Test = Max(T's arrival time; fEATiu j (Ri; u) 2 fTRgg) (1) where EATiu, for a given partial schedule, is the earliest time resource Ri is available in mode u, where u equals shared or exclusive if T needs resource Ri in shared or exclusive mode, respectively. The rst four heuristics above are considered simple heuristics because they treat only one dimension at a time, e.g., only deadlines, or only resource requirements, while the last two are integrated heuristics. W is a weight used to combine two simple heuristics. Min L and Min S need not be combined because the heuristic Min L contains the information in Min D and Min S. Extensive simulation studies of the algorithm for uniprocessor and multiprocessors show that the simple heuristics do not work well and that the integrated heuristic (Min D + Min S) works very well and has the best performance among all the above possibilities as well as over many other heuristics we tested[11]. For example, combinations of three heuristics were shown not to improve performance over the (Min D + Min S) heuristic. Hence this is the heuristic that is implemented in the current Spring system. However, it is important to note that SSCoP supports all of these heuristic functions. 5
S = task set to be scheduled; partial schedule = empty; repeat: if S is empty Halt with \Success"; given a partial schedule Test calculation: for each task T in S Compute Test H value generation: for each task T in S Compute H ; Task selection: determine task minT with lowest H value, among eligible tasks; Update partial schedule or backtrack: if (partial schedule k minT ) is feasible and strongly feasible fpartial schedule = (partial schedule k minT ); S = S - minT ; go to repeatg else if backtracking is allowed and possible backtrack to a previous partial schedule; else Halt with "failure"
Figure 1: Pseudo Code for the Scheduling Algorithm
3 Overview of SSCoP and its Interface To motivate the design of SCCoP we rst present the pseudo-code for the Spring scheduling algorithm in Figure 1. >From the pseudo-code, we derive the architecture of SSCoP shown in Figure 2 [19]. The EST generator, H-value generator, and Task Selector in the gure correspond to the Test calculation, H value generation, and task selection steps in the scheduling algorithm pseudo-code, respectively. The Task Selector is also responsible for the \update partial schedule or backtrack" operation. The feasible schedule is read by the host from the Queue in Figure 2 through the I/O unit. The execution of the scheduling algorithm, in particular, the complexities of backtracking, are controlled by a master nite state machine (FSM). A critical aspect of SSCoP design, and that of any co-processor, is the design of an ecient interface between the host software and the co-processor. If this step is neglected, processing required to prepare the input to SSCoP and use the output information can overshadow any performance improvement obtained from the custom hardware[9]. In many ways, SSCoP looks like a smart, albeit small, memory to the Spring system 6
Control FSM
H values Task Selector
K
EST’s H Value Generator
K
EST Generator
logK
Task Index Queue
I/O Unit
Host
Figure 2: A VLSI Architecture for Scheduling processor. All registers in SSCoP appear in the address space of the system processor, thus simplifying the software interface. It behaves as a memory chip which is written with task information and from which a complete and feasible schedule can be read back, if the guarantee algorithm succeeds. Programming SSCoP is performed by the system processor writing its memory-mapped control and data registers. Speci cally, prior to the scheduling operation all task parameters must be loaded into SSCoP task table, including: deadline, worst case computation time, arrival time, resource requirement, and the predecessor set. In the implementation discussed here, the predecessor set is represented by a predecessor matrix in SSCoP's task table. Also, when a task is added to the partial schedule the corresponding scheduled bit is set in the NSReg register, which is discussed in Section 5. Thus, in many ways the SSCoP has the same relationship with the system processor as a
oating point co-processor has with a CPU. The Earliest Available Times (EAT 's) of all resources are also calculated and written into SSCoP's EAT registers, while the heuristic function, appropriate weights, and the signal to begin are speci ed by writing to command registers. Upon completion of the scheduling operation, the output queue holds the indices of the scheduled tasks, sorted by their start times, which are held in the EST generator. The host checks for completion by polling the status register of SSCoP, and then reads the schedule from the output queue and EST registers. If no valid schedule is found, then this is so indicated. Since all of the registers in SSCoP are memory-mapped, SSCoP behaves as SRAM except for the fact that it has its own clock. In fact, except for a clock pin, SSCoP has a pinout which looks exactly like a socketed PROM which was already designed into the Motorola board which we use for the Spring scheduling processor. The idea was to make the addition of SSCoP as simple as possible, forfeiting some address space but only requiring a small daughter board which plugs into the unused PROM socket. The only real design challenge in the I/O is the asynchronous interface required to get the synchronous 7
R1
Task Index 1
R2
S 0 1 1 0 0
E 1 0 1 1 1
S 0 0 1 0 1
E 1 lsb 0 1 0 0 msb
EATReg
T est’s lsb msb 11101
TR=
0 EAT ij (bit line)
MSB
1 R i (bit line)
EAT ij (bit line)
MSB
1 R i (bit line)
EAT ij (bit line)
MSB
Arrival Times
0 R i (bit line)
EAT ij (bit line)
MSB
R i (bit line)
MSB
(j)
(j) NL ti
(j) NLti
R ti
(j) NL ti
ph1
(j) NLti
R ti
(j) NL ti
ph1
(j) NLti
R ti
(j) NL ti
ph1
(j) NLti
R ti
(j)
NLti
NL ti
ph1
Q
D
Q
D
Q
D
D
LOAD
S
Q
Q
D
D
LOAD
S
Q
Q
D
D
LOAD
S
Q
Q
ph1
D Aij (bit line) S
S CLK
TR=
2
00110
CLK
CLK
CLK
1 EAT ij (bit line)
MSB
CLK
CLK
0 R i (bit line)
EAT ij (bit line)
MSB
LOAD CLK
0 R i (bit line)
EAT ij (bit line)
MSB
1 R i (bit line)
EAT ij (bit line)
MSB
R i (bit line)
MSB
(j)
(j)
(j)
NLti
NL ti
R ti
(j)
ph1
(j)
NLti
NL ti
R ti
(j)
ph1
(j)
NLti
NL ti
R ti
(j)
ph1
(j)
NLti
NL ti
R ti
(j)
NLti
NL ti
ph1
Q
D
D
LOAD
S
Q
Q
D
D
LOAD
S
Q
Q
D
D
LOAD
S
Q
Q
ph1
Q
D
Q
D
D Aij (bit line) S
S CLK
TR =
3
11111
CLK
CLK
CLK
1 EAT ij (bit line)
MSB
CLK
CLK
0 R i (bit line)
EAT ij (bit line)
MSB
LOAD CLK
R i (bit line)
EAT ij (bit line)
MSB
R i (bit line)
MSB
(j)
(j)
(j)
NLti
NL ti
R ti
ph1
(j)
(j)
NLti
NL ti
R ti
(j)
ph1
(j)
NLti
NL ti
R ti
ph1
(j)
(j)
NLti
NL ti
R ti
ph1
(j)
NLti
NL ti
Q
D
D
LOAD
S
Q
Q
D
D
LOAD
S
Q
Q
D
D
LOAD
S
Q
Q
ph1
Q
D
Q
D
D Aij (bit line) S
S CLK
CLK
CLK
CLK
CLK
CLK
k=3
1
EAT ij (bit line)
MSB
00010
CLK
CLK
0 R i (bit line)
00000
CLK
CLK
LOAD CLK
CLK
11111
CLK
RReg
AReg 2r=4
Figure 3: The Test Computation. SSCoP to look like an SRAM. Future versions of SSCoP will use fast synchronous memory interfaces such as RAMBUS [13] to avoid the overhead of asynchronous transfers. The SSCoP architecture is parameterized such that the same architecture applies for dierent values of k (the number of tasks), r (the number of resources) and b (the wordlength). Obviously, VLSI area, power and performance will vary for dierent choices of the three parameters. This has allowed us to put o decisions concerning the actual values of the parameters until quite late in the design process. It also allows us to develop more advanced implementations of the architecture as improved IC technologies become available. Although the current Spring system clock runs at 20MHz, SSCoP can run at a much higher clock rate. In addition, depending on the values of k, r, and b, the SSCoP clock rate can be varied independently of the system clock.
4 Components of the SSCoP Architecture In this section, we discuss the details of the various components of the parameterized SSCoP architecture shown in Figure 2 and expanded in Figure 5.
4.1 Earliest Start Time Computation
The EST generator contains all the necessary parameters and circuitry for computing the Test's of the tasks, including: the EAT for each resource (EATReg), the resource requirements of each task (RReg), and the arrival time of each task (AReg). Figure 3 8
MSB
(j)
EAT ui
TR u
(bit line)
(bit line)
(j)
NLti
NL ti
D
Q
i
R ti
MSB
(j)
clk Q
T est
D
(precharge line) LOAD
S
(j)
NLti
NL ti
D
clk
Q
Arrival Time (bit line)
S
task index (wordline)
clk
clk
(a)
(b)
Figure 4: (a) Cell used to store resource requirement vector TR and compute the maximum over input EAT 's. (b) Cell used to compute the maximum over the input arrival time. illustrates the EST generator for k = 3, r = 2, and b = 5. The EATReg contains two sets of EAT 's; the working EAT 's which are updated whenever an eligible task is added to the current partial schedule; and the shadow set which retain the original EAT 's to support backtracking. Figure 3 only illustrates the working EAT 's. The Test's of all tasks are computed in parallel by the EST generator according to Equation 1. The calculation is bit-serial, so one bit of every task's Test is computed during every clock cycle. The decision to use bit-serial arithmetic is explained later in Section 5. The resource requirements of each task are given by its TR vector. So, for example, task T1 uses resource R1 in exclusive mode and resource R2 in shared mode, while task T3 uses resource R1 in exclusive mode, and does not use R2. Note the MSB- rst bit-serial method for computing the maximum. As the bits of the input values are examined, they are wire-ORed over a precharge line to generate the corresponding bit of the maximum value. However it is important to note that when an input value is revealed as less than the maximum by the comparison for a given bit position, the RReg and AReg cells are designed to eliminate it from all subsequent comparisons. This ensures that the maximum value is properly derived. Figure 4 shows the RReg and AReg cells.
4.2 Heuristic Function Computation
The outputs of the EST generator proceed to the H-value generator in which the H -values of all tasks are computed in parallel. Again the arithmetic is bit-serial, only now it is LSB- rst. All H-functions shown in Section 2 are supported by the H-value generator. The particular H-function used is speci ed by the host when it sets the SSCoP opcode and weight registers. The status of the scheduling algorithm can be determined by the host at any time by reading SSCoP status and FSM-state from the SSCoP status register.
9
4.3 Task Selection and Queuing
The values of the H-function for the k tasks are passed on to the task selection block. This block nds the minimum H -value among all eligible tasks and then checks the feasibility test results. If the current partial schedule is found to be strongly feasible, then (1) the task selector appends the index of the eligible task with the smallest H -value onto the output queue, retaining its Test in the EST generator and (2) the working EAT 's for the resources used by the scheduled task are updated. If the current partial schedule is not strongly feasible, backtracking is performed, if possible. If SSCoP successfully generates a schedule, the schedule can be read from the output queue (Q) and the EST generator by the host. SSCoP may fail to produce a successful schedule due to (1) the absence of a feasible schedule even after an exhaustive search, (2) the limit on backtracking being reached, or (3) the host's early termination of SSCoP execution. The main control complexity of SSCoP arises from backtracking. Before backtracking begins, the control FSM resets the datapath, ushes the output queue and restores the initial EAT 's. SSCoP recomputes the partial schedules until the last strongly feasible partial schedule of the previous search attempt is reached. The partial schedule is then extended by the task with the next smallest H -value, i.e., one with an H value smaller than that of the task selected on the previous extension of this partial schedule. Although this strategy increases the time overhead, it reduces the space overhead substantially since it avoids the need to maintain large amounts of state for arbitrary levels of backtracking.
5 CMOS Implementation of the Scheduling Co-Processor We now describe a high-speed CMOS implementation of SSCoP. We use a custom design methodology, an advanced clocking scheme, and simple CAD tools (Verilog, MAGIC, IRSIM, and CAzM). The chip is prototyped in a conventional 2-metal, 2 micron process but will be scaled to a 3-metal, .8 micron process. Both processes are made available through MOSIS. Figure 5 shows the detailed block diagram of SSCoP, which consists mainly of parallel access shift register banks storing all parameters pertaining to the scheduling computation: deadlines(DReg), worst case computation time(CReg), arrival time(AReg), value density(VDReg), EAT 's (EATReg), resource requirement(RReg), and predecessor matrix(PReg). In addition to data storage, some register banks also perform actual computations; for example the RReg computes the Test's (see Figure 3). Another example is PReg, which computes the eligibility of the tasks and stores the result in the EB -vector. PReg is a k k predecessor matrix which de nes the precedence constraint between every pair of tasks. Using this information, and the NSReg which indicates which tasks have already been scheduled, PReg computes the eligibility of each task in parallel. The Test's of the eligible tasks are then stored in the ESTReg. The ALU computes the H function and tests the strong feasibility of the current partial schedule. It also precomputes the new values for the EATReg, storing them in the UpdateReg. TaskSelect picks the eligible task q with the smallest H -value. It uses the same cell (Figure 4) as the RReg. Figure 6 10
shows an example of how the task selector selects the index of the eligible task with the smallest H -value which is also eligible. For testing purposes, all the registers in SSCoP are accessible by the host via the data bus. They can be easily tested and initialized by register read/write operations.
5.1 Clocking Scheme and Logic Style The ne-grain pipelining of SSCoP and its basic structure as a smart memory mean that a large portion of the circuitry is latches and associated clock distribution. In addition, the need for high-performance through dynamic logic and the desire to scale the chip to large die sizes and advanced technologies (less than 1 micron feature size) require a well-laid plan for clocking. We have chosen an advanced clocking scheme known as True Single Phase Clocking (TSPC) proposed in [18] [1]. Under TSPC, there are two complementary types of latches, both synchronized by the same single clock phase. One type of latch is transparent when the clock is low and another is transparent when clock is high. Both data and control signals are synchronized by the same clock phase and gating of clock signals is not allowed. Both static and dynamic logic gates are supported in TSPC. The primary advantage of TSPC is the simplicity of amplifying and routing a single clock phase, an advantage which increases as clock frequencies increase and and die sizes grow. Throughout SSCoP, we chose a bit-serial scheme to exploit the parallelism across tasks and resources. Since large SSCoPs will have k and r both larger than b, bit-parallel schemes would be slowed by carry propagation while the task-parallel operations found in the EST generator, task selector and precedence checking is solved by using large fan-in dynamic gates using circuit techniques similar to bit-line sense-amps found in memories. Details of this design decision and implementation can be found in [5].
5.2 Data Path
The data path of SSCoP is laid out using a novel slice scheme to allow easy expandability while maintaining a constant aspect ratio. The data bus runs through the data path to allow easy access to all registers. Common decoders are used to drive word lines to select the appropriate slice. Since the arithmetic is done in a bit-serial fashion, the logic depth between latches in the data path is very shallow and does not increase with the wordlength b. The only circuit design diculties are in the global communication and computations that arise in the ESTReg and TaskSelect blocks. Both broadcasts and wired-OR functions across k tasks result in circuitry which is not local and therefore clock period will grow as k increases. Although a purely bit-systolic scheme [4] could be used to avoid this problem, we determined that the additional latching and latency were not worth the added cost and eort for realistic values of k (< 512). Conventional memories have similar scaling issues in their word-line fanout and bit-line fanins. A die photo of a small prototype SSCoP (k = 8; r = 4; b = 8) is shown in Figure 7.
11
SSCoPDatapath AddressBus i
b DataBus
log2r
b iDecoder 2r UpdateReg
Q
EATReg
q
MaskLogic k
EB−Vector k
NSReg
logk
tDecoder
W1 W2
PReg
AReg
EB−Vector
DReg
ESTReg
CReg VDReg
V−Vector
TaskSelect
ALU
RReg
task index logk
k
DataBus k
RetryNumReg
RetryEntryCounter OpReg TailCounter
I/O
StatusReg
FSM BitCounter
ModeReg RetryNumCounter SSCoPControlUnit
32
Clk
R/W OE
CE
12
Address
Data
Figure 5: Functional block diagram of SSCoP
32
task index of the selected task 01
H−values
MSB
E
task index
(j)
NLti D
S CLK
R ti Q
EATij (bit line)
(j)
NLti
Q
S
D
00
001011
01
110010
10
R i (bit line)
D
001000
ph1
LOAD
CLK
0
MSB
NLti
(j)
k=3
CLK R ti Q
EATij (bit line)
(j)
NLti
Q
S
D
R i (bit line)
D
ph1
LOAD
CLK
1
MSB
(j)
NLti
CLK R ti Q
EATij (bit line)
(j)
NLti
Q D
ph1
LOAD
CLK
1
R i (bit line)
Figure 6: The task selection
5.3 Control FSM
To reduce the design complexity of the control FSM, the nite state diagram of the control unit has been partitioned so that it can be implemented by a master FSM and 4 counters: BitCounter, TailCounter, RetryEntryCounter, and RetryNumCounter. The master FSM interacts with the datapath and the counters in order to coordinate the chip operations. The BitCounter keeps track of the bit position in bit-serial operations. The TailCounter keeps track of the tail of the output queue Q. The other counters control backtracking. The RetryEntryCounter stores the level at which the next backtracking attempt will be made, and the RetryNumCounter interacts with the retry number registers, RetryNumReg. The RetryNumReg has k entries, each of log k bits, one for each level of the search tree. Each entry stores the number of branches taken at that level. The control unit is also implemented using the TSPC scheme.
5.4 Veri cation and Testing The SSCoP functional block diagram of Figure 5 was veri ed using a Verilog[20] behavioral simulation. A version of the Spring SSCoP scheduler capable of using this behavioral simulation was implemented on a UNIX workstation, enabling us to verify that SSCoP correctly handles benchmark scheduling problems, including the intricacies of backtracking. Verilog was also used to simulate the switch level behavior of the TSPC clocking scheme used by SSCoP latches and logic. The regular bit-slice nature of SSCoP permitted us to use MAGIC to layout cells which interconnect by abutment. IRSIM was used to simulate the switch level behavior of the layout and CAzM was used to simulate the electrical performance of circuits. We thus veri ed the logical and electrical correctness of the SSCoP design as thoroughly as possible prior to fabricating our rst physical prototypes. 13
Figure 7: Die photo of small prototype SSCoP (k = 8, r = 4, b = 8). The controller is implemented by a TSPC PLA in the upper left portion of the chip. The array structure in the upper right hand corner is the Earliest Start Time computation and registers. Parameterized array structures across the bottom of the chip form the datapath for the Heuristic Function computation. All bussing and interconnect in this portion are implemented by abutting simple cells. The various control registers associated with each task are embedded in the datapath. The memory style layout of the regular structures allows the use of dynamic bit-lines for large fan-in OR gates used in the EST and Task Selection blocks. 14
The parameterization of the architecture allowed us to simulate small versions of SSCoP (k = 4; r = 4; b = 8; w = 4) initially to verify functionality without manipulating the large netlists which result from a realistic size SSCoP (e.g., k = 128; r = 32; b = 24; w = 4). Testing the fabricated versions of SSCoP was greatly simpli ed by the fact that all state information on the chip is readable and writable via the data bus. Unlike scan-path testing schemes, this allows on-the- y checkpointing in addition to more conventional manufacturing chip tests.
6 Test Results of the Co-processor Prototypes This section presents the test results of three prototype VLSI chips and then discusses the issues involved in scaling to larger, more practical versions of the design. While the nal version of SSCoP that can be integrated into the Spring system requires k 16, r 16 and b 24 and hence a sub-micron process for both density and performance, three small prototypes were fabricated by Orbit Semiconductor using a 2 2-metal P-well process through the MOSIS service. This strategy was used for several reasons. The fabrication costs of the 2 process are much cheaper than the eventual .8 process. The CAD tools and cells for the 2 process are more mature, and our design experience is limited in sub-micron technologies. However, the parameterized architecture and circuit techniques allow us to use the 2 technology as an inexpensive and rapid turnaround prototyping vehicle for more advanced, high-performance and expensive technologies. Alternative prototyping methods such as Field Programmable Gate Arrays (FPGAs) were considered, but deemed inappropriate due to the memory-style full-custom circuit techniques used in SSCoP. Although a custom design, the development costs of the SSCoP prototypes were kept down by using systematic design methods, a mix of academic and commercial CAD tools, and the MOSIS foundry services. In addition, the use of a daughter board and simple memory interface minimized the system hardware and software costs of a co-processor. We believe that the overall performance gains from SSCoP will more than justify these relatively low costs. The rst test chip was a small data-path prototype (k = 4 tasks, r = 4 resources, b = 8 wordlength, w = 4 the length of the weights) which was used to verify the function and speed of the circuits, in particular the dynamic logic structures found in the EST block and Task Selector. Since the TSPC circuit technique is somewhat novel, we felt that a small prototype would allow us to inexpensively verify our circuits before committing to an expensive larger design with the additional complexity of the controller. The fabricated chips were partly functional, enough to gain some con dence in the circuit techniques despite some electrical design errors due to insucient circuit simulation. The chips were tested at 50Mhz which was the limit of our Tektronix LV500 tester. According to simulations, a 100Mhz clock rate should be possible. The second prototype was a small version of the parameterized controller. This included a novel TSPC PLA structure and several counters which scale with the various architecture parameters. This was a very small design but allowed us to check the complexity of backtracking and determine the eect of the controller on the overall performance of the chip. 15
k
r
b
8 8 16 32 64 64 64 128 256
16 32 32 32 24 32 32 32 32
16 32 32 32 24 24 32 24 24
Transistors
12940 23532 43812 84372 127940 165444 165492 327684 652164
0.8 2 H(mm) W(mm) H(mm) W(mm)
1.6 3.4 5.2 5.2 6.4 6.4 6.8 11.2 20.2
4 3.8 6.0 7.2 7.2 8.0 8.8 10.0 11.1
2.4 5.1 8.0 8.0 9.6 9.6 10.2 17.0 na
6 7.8 9.0 10.0 10.0 12.0 13.0 15.0 na
Table 1: Layout size for dierent k, r and b and dierent technologies The third prototype was a slightly larger chip (k = 8; r = 4; b = 8) which combined both datapath and controller. Half of the fabricated chips were entirely functional at 20Mhz except for a logical error in the updating of the Valid vector which disabled backtracking. Our earlier prototypes had pointed out some problems which were xed in this version. A die photo of this prototype is shown in Figure 7. Based on these prototypes we can now estimate the costs, performance and design eort of larger SSCoPs in sub-micron technology. Table 1 provides a summary of the size of SSCoP parameterized by the maximum number of tasks (k), the number of resources (r) and the internal wordlength (b). Performance analysis of the SSCoP architecture and the scheduling problem in general is a dicult problem and is explained in detail in Section 8 and in [19, 9]. Although k, r, b, and the clock rate are the physical factors aecting the execution time of a scheduling operation on SSCoP, even more complex is the performance overhead due to system, setup, and software issues. However, we can consider the eect of the parameter values on the clock-rate of the chip. Larger SSCoPs will handle larger task sets and the cost of global operations and broadcasts will tend to slightly decrease the clock rate. Initially we are using circuit techniques to design higher gain drivers and sense-amps, but for extremely large future versions of SSCoP, temporal partitioning and localization will be needed [4].
7 SSCoP-System Interface A critical aspect of SSCoP design, and that of any co-processor, is its interface to the host. Since the goal is to minimize the total scheduling time, we must also consider how SSCoP design in uences the preprocessing required to set up a scheduling operation, and the postprocessing required to transform the resulting schedule into a form the application processors can use. The essential data structures of Spring's scheduler include the current schedule, or system task table (STT), consisting of a \dispatch queue" for each application processor 16
sscop_scheduler(new_work, current_sched) { num_of_tasks = num_tasks(new_work) + num_tasks(current_sched); sched_time = estimate_sched_time(num_of_tasks); cutoff_line = do_cutoff_and_eat(current_sched, sched_time); sched_set = form_sched_set(new_work, current_sched, cutoff_line); init_sscop_input(sched_set, cutoff_line); result = build_sched(cutoff_line); if (sscop_succeeded(result)) put_sched_into_dsp_queues(current_sched,sched_set); else { revert_to_previous_sched(current_sched); eval_sscop_error(sched_set,result); }
Figure 8: The Overall Scheduling Algorithm (AP) in the system [7]. Dispatch queues are currently implemented as linked lists of STT entries, each describing a task. Each STT entry speci es relevant task attributes, including scheduled start and nish times, deadline, and resource use. STT entries are placed on their dispatch queue in increasing scheduled start time order. In Spring, dispatching of previously guaranteed tasks continues while new tasks are guaranteed. To ensure maximum exibility when guaranteeing the new task, those tasks that will not begin execution until the guarantee is completed are considered for rescheduling. To identify these tasks, the following procedure is used. The scheduling operation has three major components: preprocessing, SSCoP execution, and postprocessing, SSCoP execution is represented by build schedule function in Figure 8. As part of the preprocessing, the scheduler begins by nding the total number of tasks in the current schedule and in the new work requested, and then uses it to calculate the time required for the complete scheduling operation, including all pre and post processing. Next, the scheduler examines the current schedule's dispatch queues, \cutting" each at the point corresponding to the current time plus the scheduling time, known as the \cuto line". The system must leave those tasks on each dispatch queue slated to begin execution before the cuto line undisturbed to ensure that their \guarantees" remain valid. This is necessary to avoid race conditions between the scheduling and application task execution which are both proceeding in parallel. The do cuto and eat routine also calculates the initial EAT values for every resource. This calculation is slightly more complicated than might be expected, since the cuto line can fall within the scheduled execution of a task, requiring the initial EAT values for its resources to be the task's scheduled nish time. 17
Those STT entries scheduled to start after the cuto line are rescheduled. These, plus the STT entry structures allocated to represent the new work, comprise the scheduling set processed by SSCoP. However, when the total number of tasks in the system exceeds k, we cannot permanently assign every task a SSCoP task slot. Instead, we allocate SSCoP task slots to scheduling set members during preprocessing. Each task's attributes are then loaded into the appropriate entries within the assigned slot. This assumes that no more than k tasks are to be scheduled at any given time. In the nal section of the paper we discuss what needs to be done if this assumption does not hold. Initializing temporal attributes, deadline, execution, and arrival time, is done by writing the proper values to memory mapped SSCoP registers. However, the precedence relations and resource use attributes require nontrivial preprocessing because the system's run-time data structures represent relations between STT entries, while SSCoP's PReg represents precedence relations between SSCoP task slots in matrix form. For example, a precedence relation A ! B (A precedes B ) is represented by setting the bit in task B 's PReg entry at the position corresponding to A's task slot. Resource use requires nontrivial preprocessing for two reasons. First, just as the total number of tasks in the system can exceed k, so too can the total number of resources in the system exceed r. The solution is the same; resources used by the tasks in the scheduling set are assigned SSCoP resource slots as they are encountered during task slot assignment. Task resource use is represented by setting bits in the task's RReg entry. That completes the preprocessing required to prepare SSCoP to calculate a schedule. The heuristic function and its weights are speci ed by setting SSCoP command registers. These generally remain unchanged for long periods, although it is simple to select a dierent H -function for every scheduling operation. There is, however, a fairly subtle aspect of preprocessing that has not been addressed. The task deadline, execution, and arrival times are initialized by writing them into the proper SSCoP registers, but what values are written depends on whether SSCoP uses absolute or relative times. Absolute times are easy to understand, and require no preprocessing, but SSCoP currently uses relative times for two reasons: 1. When time-dependent H -functions are used, the performance of some H -functions degrades because the time component gradually predominates, distorting the H value. For example, the H -function Min D + Min C , which performs reasonably well, adds the deadline to the computation time. If absolute times are used this will converge to Min D, which has poor performance, as the value of the absolute time becomes ever greater than the computation time. Hence, further research is required to evaluate the sensitivity of speci c H -functions to increasing time values, and to investigate ecient methods for handling the problem. 2. Arithmetic operations in SSCoP are bit-serial. Assuming 32-bit absolute time values, using relative times of length b saves 3(32 ? b)N cycles when scheduling N tasks, as well as reduces SSCoP size, since all word length registers are correspondingly smaller. (The word size for the nal implementation has not been chosen, but will probably fall between 16 and 24. Our ability to so easily defer, or change, this decision is an important bene t of SSCoP's parameterized design.) 18
The scheduling operation is started by build sched setting a bit in the SSCoP control register. It can monitor SSCoP status during execution by reading the status register. Build sched checks for completion by polling the status register, and can interrupt SSCoP at any time by setting the abort bit in the control register. Build sched also enforces the cuto line by watching the system clock while waiting for SSCoP execution to nish. When complete, the output queue, Q, holds the indices of the scheduled tasks, in the order they were added to the schedule. Their starting times are held in the corresponding ESTReg entries. The post processing is comparatively simple. Each task index is mapped to the corresponding STT entry using a table lled in as task slots were allocated during preprocessing, and the absolute scheduled start and nish times are calculated from the relative start time in the ESTReg. The STT entry is then added to the end of the appropriate AP's dispatch queue, beginning with the point where each was truncated by the cuto line. The STT entries in the resulting dispatch queues are always in increasing start time order, since they all use the same processor resource, and a task added to the SSCoP output queue after another using the same processor cannot have an Test earlier than the scheduled nish time of the previous task.
8 Overall System Performance Evaluation We now evaluate our assumption that SSCoP will substantially improve the performance of the scheduling operation as a whole. Recall that scheduling has three major components: preprocessing, SSCoP execution, and postprocessing. Equation 2 predicts the number of cycles required for SSCoP execution. We have considered the phases of SSCoP execution, each of which is executed once every time a task is added to the generated partial schedule. The cost per task is:
MAX (Test; D ? C ) H -value computation for all tasks in parallel b + w + log2 k + 3 task selection 3 update 3b + 4w + log2 k + 10 Total for one task b+w+2 b + 2w + 2
and the total cost for a set of N tasks, without backtracking, is:
N (3b + 4w + log2 k + 10)
(2)
If we assume a reasonably demanding scenario (eg. N = 64 tasks to be scheduled, k = 128 maximum tasks, b = 24 wordlength, w = 4 weight wordlength), then Table 2 gives the predicted SSCoP execution times. Since we are concerned with the complete scheduling operation, we next need to consider the software execution time necessary to create the required SSCoP inputs and to process the generated SSCoP outputs. Table 3 shows results obtained from measurements of the Spring software scheduler. 19
Clock nsec/cycle sec/task sec/sched-op (MHz) 50 20.00 2.10 134.40 60 16.67 1.75 112.00 70 14.29 1.50 96.00 80 12.50 1.31 84.00 90 11.11 1.17 74.67 100 10.00 1.05 67.20 Table 2: SSCoP Execution Time Prediction
new tasks / old tasks Software Total SSCop Total Speedup Factor SSCoP Preprocessing SSCoP Postprocessing SSCoP Overhead
Test 1
Test 2
64 / 0 16 / 48
Test 3
8 / 56
Test 4
1 / 63
Test 5
0 / 64
147,451 129,097 128,757 127,849 121,992 35,150
22,257
20,485
19,616
18,654
4.2
5.8
6.3
6.5
6.5
6,130
5,555
5,605
5,603
5,577
805
807
788
802
794
20%
29%
31%
33%
34%
Table 3: Software Scheduler Timing Measurements (all times are secs)
20
All the task sets used in these experiments contained 64 tasks. Resource use and task precedence were similar, but not identical, in all ve tests. The major varying component among the tests was the ratio of new tasks to old tasks, i.e. new work scheduled for the rst time versus old work being rescheduled. Software total shows the total time in secs that it took for a complete scheduling operation of 64 tasks. This includes not only the time it took the software to actually build the new schedule, but also the pre and post processing time required by the overall scheduling algorithm described in Section 7. SSCoP total shows the total time for a complete scheduling operation assuming SSCoP does the work of building the schedule. The numbers in Table 3 assume that SSCoP execution takes zero time. Notice that the gures in Table 2 show that while the co-processor times are really nonzero, they are small enough that they would make no dierence in the speedup results. Speedup factor (software total / SSCoP total) shows how much faster the overall scheduling operation is when using SSCoP instead of software. If we measure just the build schedule procedure versus predicted SSCoP execution which is the main goal of this work, the speed increase realized is as high as 1700 fold (three orders of magnitude). This speedup is obtained because of the high degree of parallelism inherent in this scheduling problem. Such speedup could not be attained by high performance RISC chips where parallelism is limited. However, this impressive performance in the core of the scheduling algorithm does not portray the total picture since we cannot use the build schedule function alone to do anything useful. Pre and post processing portions of the scheduling operation are mandatory for a new schedule to be created and used by the Spring System. SSCoP preprocessing and SSCoP postprocessing gures comprise the additional time speci c to interfacing with the SSCoP registers, these times are included in SSCoP total. SSCoP overhead percentage (SSCoP preprocessing + SSCoP postprocessing / SSCoP total) is a measure of how much time was spent in SSCoP speci c code, i.e. the overhead required to use SSCoP. Table 3 shows that the speedup factor and SSCoP overhead percentage both increase as the ratio of new work to old work decreases. Speedup factor increases because the proportion of the extra preprocessing time required to deal with new work decreases in relation to the rest of the required processing. Since both the software scheduler and the SSCoP version need to spend the same amount of time preprocessing the new work, this is a win for the SSCoP version because its total time is smaller. Correspondingly, SSCoP overhead percentage increases because with less preprocessing spent on new work, the SSCoP speci c processing becomes more signi cant. In Spring tasks are considered for scheduling as and when they arrive. Thus, we expect that the ratio of new work to old work would be small. In particular, rarely will all the tasks in a scheduling set correspond to new arrivals. Hence under typical scenarios approximately 6 fold speedup can be expected. The current target hardware (68020) is much slower than many current processors, and moving to a new architecture will clearly improve pre and post processing performance, but will also improve the software build schedule performance. This means the predicted SSCoP execution times will become more and more signi cant when compared against the 21
software total time. Therefore bene t from SSCoP will decrease when faster processors are considered. Assuming the 68020 runs at 1 MIPS, use of a 10 MIPS processor will decrease the speedup factor of SSCoP from 6.3 in Test 3 to 6.1. Similarly, a 100 MIPS target processor will reduce the same speedup factor to 4.7, 200 MIPS to 3.8, 500 MIPS to 2.4, and 1000 MIPS to 1.5. The build schedule function dominates the execution time of the software schedule. SSCoP was designed to reduce exactly this overhead, and it does so dramatically. However, the software execution times of the requisite pre and post processing code are now the limiting factors in the speedup bene ts derived from SSCoP. Clearly, optimizing total SSCoP scheduling time will require careful attention to optimizing the pre and post processing code. However, the current preprocessing implementation is fairly simplistic, and there are many ways in which we should be able to improve its performance, including performing more functions in hardware.
9 Summary This paper has presented a parameterized architecture and initial performance results for the Spring scheduling co-processor (SSCoP). A Verilog based functional simulation of SSCoP supported tests verifying SSCoP's ability to correctly solve scheduling problems. Lower level simulations predicted the number of SSCoP cycles required for various phases of the scheduling operation, permitting us to construct a simple formula for SSCoP execution time when backtracking is not required. We also presented performance results for the pre and post processing steps, required to prepare scheduling problems for SSCoP processing and process its output, as currently implemented. Other important properties of the SSCoP implementation are discussed in greater detail elsewhere[2] as are details on the Spring scheduling algorithm [12]. We have shown that selective use of special-purpose hardware can signi cantly speed up certain critical portions of real-time operating systems by exploiting their inherent parallelism. By reducing the time required for speculative scheduling, SSCoP allows early prediction of infeasible schedules and enables more sophisticated recovery algorithms to improve the reliability and overall performance of the system. While it can be argued that a state-of-the-art RISC processor could also oer signi cant scheduling speed-up, our SSCoP prototype provides a dedicated solution which is parameterized and ecient in terms of both area and power, and scales better because of the high degree of parallelism compared to the RISC chip. However, in future real-time systems, to see more substantial improvement, a SSCoP module should probably be incorporated into a more general-purpose RISC processor. This would be analogous to the special-purpose hardware for branch prediction, multi-processing and multi-media that are now found on general-purpose processors.
9.1 Future Research
As was mentioned earlier, in addition to computation time requirements, deadline constraints, resource requirements, and precedence constraints, guarantee can be subject to periodicity constraints, replication constraints arising from fault tolerance requirements, 22
importance levels for tasks, and I/O requirements. Speci c guarantee algorithms have already been designed depending on the task and resource characteristics [12]. Extending the current version of SSCoP to suit these algorithms will be part of further work. SSCoP can be considered to be made up of two components: a heuristic search component that employs branch-and-bound techniques and a component that is speci c to the scheduling algorithm adopted in Spring. Even though in the current implementation, the two components are intertwined, it should be possible to bifurcate the two so that we have a search engine that can be made to work in dierent contexts. Means to accomplish this are worthy of investigation. Partitioning of large scheduling problems to t onto smaller SSCoPs is also an area for further study. Although the scheduling problem is formally not partitionable (hence NPcompleteness), approximate methods may allow software controlling SSCoP to simulate larger \virtual" SSCoPs. Time domain partitioning is also a challenge in hardwaresoftware co-design.
Acknowledgements This research is supported in part by the National Science Foundation under grant IRI-9208920 and in part by ONR under grant N00014-92-J-1048. We also acknowledge Cadence Design Systems for supplying the Verilog software.
References [1] M. Afghahi and C. Svensson, \A Uni ed Single-Phase Clocking Scheme for VLSI Systems", IEEE Journal of Solid State Circuits, Vol. 25, No. 1. February 1990, pp. 225-233. [2] W. Burleson, J. Ko, D. Niehaus, K. Ramamritham, J. Stankovic, G. Wallace, and C. Weems. The Spring Scheduling Co-Processor: A Scheduling Accelerator. In Proceedings of the International Conference on Computer Design. IEEE, October 1993. [3] H. Kopetz and W. Ochsenreiter. Clock Synchronization in Distributed RealTime Systems. IEEE Trans. on Computers C-36(8):933-940, Aug. 1987. [4] S.Y.Kung, VLSI Array Processors, Prentice-Hall, Englewood Clis, N.J., 1988. [5] J. Ko, The Spring Scheduling Co-processor, Masters Thesis, University of Massachusetts, 1994. (A condensed version is available as UMASS Technical Reprot TR-93-CSE-4). [6] L. Lindh. \Utilization of Hardware Parallelism in Realizing Real Time Kernels," Doctoral Thesis, Dept of Electronics, Royal Institute of Technology, Stockholm, Sweden, 1994. [7] L. Molesky, K. Ramamritham, C. Shen, J. Stankovic and G. Zlokapa, Implementing a Predictable Real-Time Multiprocessor Kernel - The Spring Kernel. 23
In Proceedings of the Seventh IEEE Workshop on Real-Time Operating Systems and Software, pages 20{26. IEEE, May 1990. [8] D. Niehaus. Program Representation and Translation for Predictable RealTime Systems. In Proceedings of the IEEE Real-Time Systems Symposium, pages 53{63. IEEE, December 1991. [9] D. Niehaus, K. Ramamritham, J. Stankovic, G. Wallace, C. Weems., W. Burleson, and J. Ko, The Design and Use of the Spring Scheduling CoProcessor Programming Interface, In Proceedings of the IEEE Real-Time System Symposium. IEEE, December 1993. [10] A. K. Mok, \Fundamental Design Problems of Distributed Systems for the Hard Real-Time Environment", Ph.D. Dissertation, Department of Electrical Engineering and Computer Science, MIT, Cambridge, Mass., May 1983. [11] K. Ramamritham, J. Stankovic, and P. Shiah, \Ecient Scheduling Algorithms for Real{Time Multiprocessor Systems," IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, April 1990, pp. 184-194. [12] K. Ramamritham and J. Stankovic. \Scheduling Strategies Adopted in Spring: An Overview," in Foundations of Real-Time Computing: Scheduling and Resource Management, A. van Tilborg and G. Koob, Ed., Kluwer Academic Publishers, pp. 277-305, 1992. [13] RAMBUS - Architectural Overview, RAMBUS Inc., Mountain View, CA, 19 93. [14] K.C. Sevcik and M.J. Johnson. Cycle-Time properties of the FDDI Token Ring Protocol. IEEE Trans. on Software Engineering SE-13(3):376-85, March 1987. [15] K. Shin. HARTS: A Distributed Real-Time Architecture. IEEE Computer 24(5), May, 1991. [16] J. Stankovic and K. Ramamritham, \The Spring Kernel: A New Paradigm for Real{Time Systems," IEEE Software, May 1992. pp. 54-71. [17] J. D. Ullman, \NP-Complete Scheduling Problems", Journal of Computer and System Science, October. 1975. [18] J. Yuan and C. Svensson, \High Speed CMOS Circuit Technique", IEEE Journal of Solid State Circuits, Vol. 24, No. 1. February 1989, pp. 62-70. [19] The Spring Scheduling Co-processor, Technical Report TR-93-CSE-4, Dept. of ECE, University of Massachusetts, Amherst, 1993. [20] Cadence Design Systems Inc., Verilog-XL Reference Manual, 1991.
24