IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
1
Synthesizing Variable Instruction Issue Interpreters for Implementing Functional Parallelism on SIMD Computers Nael B. Abu-Ghazaleh, Student Member, IEEE, Philip A. Wilsey, Member, IEEE, Xianzhi Fan, and Debra A. Hensgen, Member, IEEE Abstract—Functional parallelism can be supported on SIMD machines by interpretation. Under such a scheme, the programs and data of each task are loaded on the processing elements (PEs) and the Control Unit of the machine executes a central control algorithm that causes the concurrent interpretation of the tasks on the PEs. The central control algorithm is, in many respects, analogous to the control store program on microprogrammed machines. Accordingly, the organization of the control algorithm greatly influences the performance of the synthesized MIMD environment. Most central control algorithms are constructed to interpret the execution phase of all instructions during every cycle (iteration). However, it is possible to delay the interpretation of infrequent and costly instructions to improve the overall performance. Interpreters that attempt improved performance by delaying the issue of infrequent instructions are referred to as variable issue control algorithms. This paper examines the construction of optimized variable issue control algorithms. In particular, a mathematical model for the interpretation process is built and two objective functions (instruction throughput and PE utilization) are defined. The problem of deriving variable issue control algorithms for these objective functions has been shown elsewhere to be NP-complete. Therefore, this paper investigates three heuristic algorithms for constructing near optimal variable issue control algorithms. The performance of the algorithms is studied on four different instruction sets and the trends of the schedulers with respect to the instruction sets and the objective functions are analyzed. Index Terms—MIMD on SIMD, interpretation, variable instruction issue, scheduling instruction execution, SIMD computers.
—————————— ✦ ——————————
1 INTRODUCTION parallel Single Instruction-stream Multiple Data-streams (SIMD) computers were originally developed to perform efficient parallel computation on matrix data [1], [2]. An SIMD machine consists of a control unit (CU) that controls an array of simple processing elements (PEs). In the traditional SIMD programming model, the program is loaded on the CU with the data distributed to the PEs. The CU executes the program, performing scalar operations and dispatching vector operations for execution on the PEs. Thus, SIMD machines are considered to be efficient only on applications with large vector sizes and uniform control patterns and are still mostly used to speed the execution of programs with abundant data parallelism [3]. The commercial availability of inexpensive massively parallel SIMD machines has renewed interest in finding algorithms for solving a wider class of problems on SIMD machines [3], [4]. MIMD interpretation on SIMD is an alternative method of programming SIMD machines [5], [6], [7], [8], [9]. It operates on the Level Principle [10], which states that any object may be considered to be data when viewed
M
ASSIVELY
————————————————
• N. Abu-Ghazaleh and P.A. Wilsey are with the Department of ECECS, PO Box 210030, University of Cincinnati, Cincinnati, OH 45221-0030. E-mail:
[email protected],
[email protected] • D.A. Hensgen is with the Naval Postgraduate School, Department of Computer Science, Monterey, CA. E-mail:
[email protected]. Manuscript received June 24, 1994. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number D95262.
from a higher level of abstraction. Thus, the separate execution threads, which would be allocated as tasks to the processing units on an MIMD machine, are loaded as data on the PEs. The CU then executes a central control algorithm (CCA) that interprets the functionalities of the instruction set. Every pass through the CCA is termed an iteration or cycle. Fig. 1 shows a typical central control algorithm. PEs participate in the portions of the cycle necessary for the interpretation of their instruction while staying idle for the remaining parts of the cycle. The execution threads are each advanced exactly one instruction with every iteration. Thus, concurrent execution of multiple independent control threads on an SIMD machine (typically thought a strictly MIMD capability) is realized. This method transcends the limitations of traditional SIMD programming because it operates on a completely orthogonal level, utilizing the inherent data parallelism present in the instructions of execution threads when they are viewed as data. Since SIMD machines can only support a single control thread, the portions of the CCA that require distinct control patterns (conditional regions) cause the serialization of the execution of their corresponding conditional clauses. Thus, only the PEs that are interested in a conditional clause are active while all other PEs remain idle. This problem, referred to as the interpretation overhead, is most critical in Step 3 of Fig. 1 where each PE is active for only one instruction in the interpreted instruction set.
1045-9219/97/$10.00 ©1997 IEEE J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
27,136
01/21/97 12:04 PM
1 / 12
2
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
loop forever 1. fetch instruction 2. decode operands 3. if (opcode = ADD) then {interpret ADD instr.} if (opcode = AND) then {interpret AND instr.} . . . 4. store results 5. update local program counter end loop Fig. 1. Example central control algorithm.
The interpretation overhead can be minimized in the following ways: 1) Minimizing the size of the conditional regions in the interpreted instruction set. This is done by using a uniformly encoded, RISC-like instruction set with a small number of simple instructions and restricted addressing modes [9]. 2) Combining (reusing) simple operations to realize more complex operations. Each PE may take part in more than one of the execution regions, effectively reducing the PE idle time. Thus, a small number of simple functional units support the realization of a considerably more complex instruction set [7], [11]. 3) Using compiler based techniques to slide code segments in order to isolate common expressions. Common expressions can be run in data parallel mode, minimizing the need for interpretation. This technique allows the machine to run in its native data parallel mode as long as possible [12]. 4) Issuing a varying subset of instructions every cycle based on the probability and costs of the instructions. Thus, an infrequent high-cost instruction is interpreted less often than a high-probability low-cost instruction. In this paper, we examine the construction of CCAs that support the execution of varying instruction subsets each cycle. Such a central control algorithm is called a variable issue control algorithm. While a variable issue control algorithm may temporarily impede the throughput for a few PEs, total system performance can be improved because a faster cycle time is achieved and exploited by the remaining PEs. Determining which subset of instructions to issue during each iteration of the control algorithm is known as the Variable Instruction Issue (VII) problem. These subsets can be bound to iterations of the control algorithm at three different times: 1) When the machine is booted, based on large samples of instruction distributions; 2) when programs to be executed are loaded, based on compile time statistics and profile data gathered from previous runs; or 3) at execution time, based on compile time statistics and profile data gathered from previous runs continuously augmented with dynamic branching information. The first method assumes that all programs follow a similar instruction profile and thus penalizes instruction
streams that do not conform to the expected profile. The second method favors tasks that are compiled together and expected to execute together. This method assumes that the profile of instruction use is static throughout the lifetime of the loaded set of tasks. The last method (called dynamic instruction scheduling) is the most responsive to dynamic changes in the frequency of operation use. Unfortunately, it is also very expensive to implement as each iteration of the control loop must collect instruction profile data and reconfigure based on the collected data [13]. This paper builds a mathematical model of the control algorithm and uses it to examine variable instruction issue. The model is used to develop two objective functions quantifying the performance of the system. The problem of finding the optimal issue schedule for these objective functions has been shown to be NP-complete [14]. Accordingly, we develop and evaluate heuristics that provide near optimal schedules in acceptable times. The remainder of this paper is organized as follows. Section 2 presents related work. The mathematical model for the interpretation process and the formulation of two objective functions is presented in Section 3. In Section 4, the model is analyzed in conjunction with the objective functions and an optimal single cycle algorithm is developed. Section 5 presents the heuristic algorithms built for near optimal solutions for the VII problem. Empirical performance results of the algorithms are reported in Section 6. Finally, Section 7 contains some closing remarks.
2 RELATED WORK A number of researchers have explored MIMD interpretation on SIMD or related techniques for programming SIMD machines. The earliest work in this area is the implementation of parallel combinator reduction for supporting execution of functional programs [3], [7], [15]. Additional work has been conducted in the interpretation of logic programming languages [16], [17]. More recently, studies have considered the simulation of logic circuits by interpretation on a SIMD processor [15]. Logic circuits consisting of AND and OR gates are loaded as data on each PE. The CU alternately broadcasts the instructions necessary to interpret AND and OR, with PEs masking in and out of execution as required. Since there are only two functionalities in the control loop, a high degree of parallelism is achieved. Collins [5] proposes interpretation as a more efficient method for executing where statements present in language implementations for the connection machine [18]. These where statements are case driven conditional statements where different sets of PEs execute different portions of the code, depending on the value of the condition expression. In experiments with the CM-2, Collins reports that very few parallel branches are required (in two examples only six) before interpretation becomes superior to direct SIMD execution. Collins studies many aspects of the interpretation process, including a restricted case of the variable instruction issue problem: The delayed execution of a single expensive instruction. This analysis is a special case of the general optimization problem we analyze.
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
2 / 12
ABU-GHAZALEH ET AL.: SYNTHESIZING VARIABLE INSTRUCTION ISSUE INTERPRETERS
Dietz and Cohen [6] developed an instruction set, complete with communication instructions, for interpretation. His group built an interpreter for it on the MasPar MP-1 SIMD machine [19], [20] and showed good performance on selected applications. They considered polling the PEs before issuing an instruction to ensure that there is at least one PE interested in it. This polling is achieved via a global-or instruction. However, since the number of PEs is large, odds are that each instruction, no matter how rare, would occur in at least one of the instruction streams. Polling simply discovers that all the instructions are needed [5], [21]. Dietz also investigated a compiler optimization scheme called Common Subexpression Induction (CSI) [12]. The CSI optimization aligns execution threads, delaying some, in order to isolate common portions of code for data-parallel execution, maximizing the fraction of the time the machine operates in its native SIMD mode. More recently, Dietz and Cohen explored frequency biasing of the issue of instructions, issuing instructions with a frequency proportional to their probabilities [22]. Nilsson and Tanaka [8] provide a theoretical treatment of the interpretation process. In their analysis, a slightly different structure for interpretation is considered (Fig. 2). More precisely, the interpreter studied by Nilsson and Tanaka fetches new instructions after every distinct opcode execution. In contrast, most recent work moves steps 1, 2, 4, and 5 of Fig. 2 outside of the inner loop to achieve a more efficient interpreter (Fig. 1). However, decoupling these functionalities makes it possible for a PE to advance more than one instruction for each iteration. Nilsson and Tanaka argue that it is not useful to issue any particular instruction more than once inside the control algorithm. Based on this assumption, they develop the criteria for the optimal permutation of instructions within the single cycle. They consider the costs of instructions to be equal and find the permutation that will allow the PEs to advance as many instructions as possible within the single pass. This permutation is used for all iterations (no variable scheduling). loop forever foreach instruction A in the instruction set. forall PEs waiting on instruction A 1. fetch instruction. 2. decode operands. 3. interpret A. 4. store results. 5. update pc. end forall end foreach end loop Fig. 2. Central control algorithm used by Nilsson and Tanaka.
3 THE MODEL In this section, a mathematical model for the interpretation process is presented [23]. The parameters of the problem and the notation used for these parameters are defined and explained. The state of the machine is defined and the state transition function is derived. In addition, two objective
3
functions that measure the quality of instruction schedules are constructed. The model allows the use of a pragmatic scheduling theory approach for solving the VII problem. The general scheduling problem has been shown to be NP-complete [24]. Moreover, most approaches used for solving scheduling theory based models are based on branch and bound algorithms [24], [25]. Branch and bound algorithms are enumeration based and, hence, are exponential in the size of the problem. Coffman [24] notes that queuing theory [26] provides a complimentary approach to scheduling theory. However, queuing theory is more concerned with the analysis of algorithms in a stochastic environment where service and arrival times are random variables governed by probability distribution functions. The models studied successfully by queuing theory are usually simple because of the great demand on mathematical tractability [24]. Attempts at expressing this problem in a queuing theory context have not been successful.
3.1 Parameters and Notation Let each iteration of the control loop be numbered 0, 1, º, l and M be the total number of PEs in the machine. In addition, let the functionalities (instructions) in the interpreted architecture be numbered 0, 1, º, k. In the remainder of this section, the integral variables of the problem are defined and expanded.
3.1.1 The Issue Vector The issue variable, xi,t, is a binary variable taking the value 1 if instruction i is to be issued during control iteration t and 0 otherwise. The issue vector, X t = ·x 1,t , x 2,t , º, x k,t Ò is a k-tuple of binary variables completely defining the subset of instructions issued at iteration t. A collection of issue vectors, X1, X2, º, Xl is an issue schedule and represents a solution to the VII problem. The set of instructions with an issue variable value of 1 at iteration t is termed the issue set for iteration t, while the complimentary set is termed the blocked set.
3.1.2 The Probability of Arrival of Instructions Once an instruction is serviced by the interpreter, the instruction immediately following it in the instruction stream is said to have arrived. The probability of the arrival of instruction i at a PE that was active during iteration t - 1 is denoted pi,t. If the processor was not active during iteration t - 1, it retains its current instruction. The probability pi,t is an important but difficult to quantify term. Its value is affected by the current instruction, the application, the programming language, and the compiler. A proper evaluation of this probability requires extensive profiling and analyzing of traces of the application. This evaluation process has been previously studied in the compiler theory field, particularly for the use of profiling compilers [27], [28]. For our analysis, a static probability distribution is used. Thus, the term pi,t reduces to pi for all t and P = ·p1, p2, º, pkÒ represents the probability vector. As will be shown later, the model of the control algorithm does not preclude the use of a more robust probability model.
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
3 / 12
4
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
3.1.3 The Cost of Instruction Issue We denote the cost for issuing (executing) instruction i as ci. Therefore, the cost vector is denoted as C = ·c1, c2, º, ckÒ. The cost of the variable region of iteration t (step 3 in Fig. 1) then is
k
Âi =1 xi ,t ◊ ci . We assume that the granularity of the
instructions in the instruction set and the interpreter are identical (i.e., the execution regions for different instructions in the interpreter are distinct and not overlapping).
3.1.4 The Static (Minimum) Cost We define m as the fixed cost associated with the control loop. This cost accounts for time spent outside the functional blocks (steps 1, 2, 4, and 5 in Fig. 1 and the condition evaluations in step 3). If no instructions are issued, the loop executes in m time units. Since m does not contribute to the execution part of the interpretation, it is considered an overhead associated with interpretation. Since m is a constant cost incurred in each iteration of the control loop, it moderates the sensitivity of the schedule to the cost and probabilities of instructions. If m is small, the schedule becomes more sensitive to the cost and probability ratios of the instructions. Conversely, if m is large, the schedule becomes less sensitive to the cost and probability ratios. At high overheads, variable instruction scheduling is not needed and all of the instructions are issued every iteration. In practice, a well designed and well encoded interpretation loop may have an overhead in the range of 15-30% of the maximum CCA length, depending on the architecture [22], [29].
3.2 The State of the Model and the State Transition Function The expected number of processors with instruction i as their next instruction at iteration t is denoted by ni,t. The instruction distribution vector, defined as Nt = ·n1,t, n2,t, º, nk,tÒ, is a complete specification of the state of the machine at iteration t. As shown in (1), the state of the machine is a function of the previous state, the probability distribution function (which may be a function of any number of past states), and the issue vector of the previous state.
static probability distribution model. For other prob1 ability models this term is different . The state transition function completes the definition of the mathematical model of the interpretation process. It describes the mechanics of the model, and it specifies how the future states of the machine are obtained from the present state and the current issue set. Consequently, the state transition function represents an integral part of the scheduling algorithms developed later in this paper. These algorithms make instruction issue decisions based on the state of the machine, and then use (1) to find future states.
3.3 The Objective Functions Quantifying the performance of a computer is an elusive goal. Hennessy and Patterson argue that there is no clear way to measure computer performance objectively [30]. Nevertheless, in this section, we propose two objective functions, argue their validity, and discuss the complexity of each.
3.3.1 The Rate of Instruction Throughput There are multiple criteria for measuring the quality of an instruction schedule. The sustained throughput of the system, in instructions per unit time, is one such measure. Informally, schedule A is better than schedule B if it services more instructions per second than schedule B does. The number of instructions serviced by iteration t, tt, may be expressed as t t = Nt ◊ XtT =
 nj ,t -1 ◊ x j ,t -1
Examining (1) more carefully, we see that the expected number of PEs waiting on instruction i at iteration t, ni,t is the sum of two terms: 1) ni , t -1 ◊ xi ,t -1 counting the PEs that had this instruction in the previous iteration, if this instruction was not issued in that iteration. These PEs have not been serviced, and therefore are still waiting on instruction i. 2) All PEs whose instruction was serviced at iteration t - 1 are now free to move to their next instruction. Some of these PEs will have instruction i. The second term measures the expected number of PEs with instruction i arriving next in their instruction stream. This term is a function of the probability model. This particular state transition function corresponds to the
(3)
The cost of iteration t, gt, is the length of the iteration in time units. Below is the sum of the costs of the overhead part and the functional blocks that were issued: g t = m + C ◊ XtT = m +
k
 ci ◊ xi ,t .
(4)
i +1
The instantaneous throughput of the CCA at iteration t, rt, is the ratio of the number of instructions serviced to the time it took to service them, or k
Âi =1 ni ,t ◊ xi ,t . k Âi =1 ci ◊ xi ,t
t rt = t = gt m+
(1)
j =1
 ni ,t ◊ xi ,t . i =1
k
ni , t = ni ,t -1 ◊ xi ,t - 1 + pi ◊
k
(5)
The total system throughput, R, is defined as the ratio of the total number of instructions serviced by all iterations to the total time it takes to service them. Thus, l
 j =1 t j R = lim l l ƕ  j =1 g j
l
k
 j =1 Âi =1 ni , j ◊ xi , j = lim l k l Æ•  j =1 FH m + Âi =1 ci ◊ xi , j IK
(6)
where l is the number of iterations.
1. For example, a different transition function results if we assume a conditional probability model where the probability of occurrence of instruction i, pi,j, is a function of the previous instruction j. In this case, the state transition function is k
ni , t = ni , t - 1 ◊ xi , t - 1 +
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
Âp
i, j
◊ nj , t -1 ◊ x j , t -1 .
(2)
j =1
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
4 / 12
ABU-GHAZALEH ET AL.: SYNTHESIZING VARIABLE INSTRUCTION ISSUE INTERPRETERS
The number of possible schedules for a string of l iterak¥l tions of a CCA that has k functional blocks is 2 (since k ¥ l instructions need to be scheduled). The solution space for the problem is exponential in k and l and finding the optimal total system throughput is NP-complete [14]. The argument against using throughput as the performance measure is that it treats all instructions equally, regardless of the amount of work they do. An analogous performance measure is the MIPS rating of sequential computers. While MIPS provides a reasonable measure of the performance of computers, a machine that executes one million floating point operations per second might register the same MIPS rating as one that executes one million integer operations per second. An execution rate of r complex instructions per second is better than an execution rate of r simple ones (since fewer of the complex instructions are needed to execute the same task). This is not reflected in pure throughput. Consequently, this objective function penalizes complex instructions since their longer cost is considered but their higher benefit is not.
3.3.2 PE Utilization Another measure that incorporates both cost and benefit is the average fraction of time a processor spends actively engaged in the computation. This measure is valid for assessing the success of a schedule because maximizing the active time results in the best cost/benefit ratio. Every PE whose instruction gets issued is active for m + ci time units. The PEs whose instructions do not get issued do 2 not benefit from any portion of the control loop. Thus the aggregate time that all the PEs spend doing useful work at iteration t, at, can be expressed as k
at =
Â
c
h
ni ,t ◊ m + ci ◊ xi ,t .
(7)
i =1
The total amount of time available for each processor to execute is m plus the sum of the cost of the instructions that get issued. Therefore, the amount of time available for computation across all PEs at iteration t, lt, is
F I GH Â c ◊ x JK . k
lt = M ◊ m +
i
i ,t
(8)
i =1
5
l
k
Âi =1 ni ,t ◊ cm + ci h ◊ xi ,t . k M ◊ F m + Â ci ◊ xi , t I H K i =1
(9)
By taking the sum of the time over all of the iterations of the CCA in proportion to the total available time, the average utilization of the system is obtained. Hence, the utilization objective function, U, is
2. Technically, they are active during the nonexecution parts of the loop. However, that is lost work because it has to be repeated again in the next iteration.
k
where l is the number of iterations. The solution space of k¥l possible schedules is 2 . Finding the optimal schedule with respect to utilization is also an NP-complete problem [14].
4 FINDING NEAR OPTIMAL SOLUTIONS TO THE VII PROBLEM The NP-complete nature of the VII problem implies that the global optimal solution, for nontrivial cases, cannot be found in practice. Therefore, heuristics that produce near optimal schedules must be developed. In this section, we analyze the mathematical model of the interpretation process in conjunction with the objective functions. Lower bounds on the values of the objective functions for given instruction sets are derived. In addition, a polynomial time algorithm for optimizing the single cycle is developed. These results are used in the development and analysis of the heuristics algorithms presented in Section 5.
4.1 A Lower Bound Any candidate schedule should perform at least as well as the all-ones schedule, X = ·1, 1, º, 1Ò. For this schedule all of the instructions are issued every iteration. Since this schedule is trivially available, we refuse to accept a schedule with a worse rate. The instantaneous throughput rate of the allones schedule is obtained by setting xi,t = 1 in (5) for all i. This leads to k
Âi =1 ni ,t rall - ones = k m + Â ci i =1
M
= m+
k
Âi =1 ci
,
(11)
which is constant. Similarly, the all-ones utilization, uall-ones, is obtained by setting xi,t to 1 in (9): k
Âi =1 ni ,t ◊ cm + ci h . uall - ones = k M ◊ F m + Â ci I H i =1 K
(12)
Substituting the value for ni,t by applying (1) yields
The utilization at iteration t, ut, is the total amount of useful processor time in proportion to the total available time at that particular iteration. Thus, a ut = t = lt
l
 j =1 aj = lim  j =1 Âi =1 ni , j ◊ cm + ci h ◊ xi , j , (10) U = lim l l k l Æ•  j =1 lj lÆ•  j =1 M ◊ FH m + Âi =1 ci ◊ xi , j IK
uall - ones =
k
Âi =1 pi ◊ ci , k m + Â ci i =1
m+
(13)
which is also a constant. Since rall-ones and uall-ones are both constant, rall-ones = Rall-ones and uall-ones = Uall-ones. The only feasible nonvariable (or static) schedule is the all-ones schedule. Any other static schedule implies that there is at least one capability that is not issued ever (the schedules do not vary from one iteration to the next). When this capability arrives at a particular instruction stream, it is not serviced and the stream would not proceed further. Eventually all of the instruction streams are blocked as the unissued capability is encountered. It follows that the all-ones throughput and utilization represent the throughput and utilization attainable by the static instruction schedule and therefore provide useful metrics against which the variable schedules can be meas-
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
5 / 12
6
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
ured. In fact, we use the ratio of the throughput (utilization) of a schedule to the all-ones throughput (utilization) as a normalized figure for comparing the performance of the schedules. This ratio denotes the quality of the schedule.
4.2 The Optimal Single-Cycle Algorithm The criteria for including an instruction in an arbitrary issue set of an iteration is developed and used to build a polynomial time algorithm for optimizing one iteration of the CCA. A potential problem with using this algorithm for global scheduling is identified. Finally, we comment on the significance of the existence of this polynomial time algorithm. First, we ask the question: Given an issue set that does not include instruction i, how do we decide whether the addition of i results in a higher rate? Obviously, the answer is that if the rate with the addition of i is higher than the rate without i then the inclusion of i increases the rate. More formally, rt
xi , t =
1 ≥ rt
xi , t =
0 ¤ xi , t = 1.
(14)
Expanding the terms in the above equation we obtain k
k
 j =0 jπi ni ,t ◊ xi ,t k ci + m +  ci ◊ xi ,t j =0 jπi ni ,t +
 j =0 jπi ni ,t ◊ xi ,t ≥ . ¤ xi ,t = 1 (15) k ci ◊ xi ,t m+ j =0 jπi
With some manipulation this yields ni ,t ≥ rt ci The term
ni , t ci
xi , t =
0 ¤ xi ,t = 1.
(16)
is called the marginal throughput value (or mar-
ginal value when it is clear that the throughput objective function is being discussed). The same procedure applied to the utilization function, yields the following instantaneous optimality criterion:
c
h
ni ,t ◊ m + ci ≥ mt M ◊ ci The term
c
ni , t ◊ m + ci M ◊ ci
h
xi , t =
0 ¤ xi ,t = 1.
(17)
is denoted the marginal utilization value.
Returning to the throughput function, we note that if the throughput of a particular issue set without the inclusion of t instruction i is g t , then the throughput with instruction i is t
t t + ni , t g t + ci
. It is straightforward to prove that if
ni , t ci
≥
tt gt
, then the
following inequality holds: ni ,t ni ,t + t t t t . ≥ ≥ ci ci + g t gt
(18)
Consequently, if the marginal rate of an instruction is larger than the rate of a particular issue set that does not include that instruction, then the inclusion of the instruction in the issue set results in a new rate that is higher than the old rate but lower than the marginal rate of that instruction. Thus, if an instruction is in the optimal issue set then all instructions with higher marginal rate are also in that optimal set. Conversely, if an instruction is not in the opti-
mal issue set then neither are any instructions with a lower marginal rate. This reasoning leads to an algorithm of complexity 2 O(k log k) for maximizing the instantaneous rate for a single cycle (Fig. 3). The algorithm begins every iteration by computing the marginal rates of all instructions. Starting with an empty issue set, the algorithm repeatedly considers whether the marginal rate of the instruction with the highest marginal rate among the unissued instructions is higher than the current rate. If it is, then the instruction is added to the issue set; otherwise, the issue set we have is the optimal issue set. Once the optimal set is obtained, the new state of the machine is evaluated using (1). for each CCA iteration being scheduled cost = mu; number = 0; rate = 0; instruction = best_remaining_instruction(); while (instruction.marginalrate >= rate) { instruction.taken = TRUE; number = number + instruction.number; cost = cost + instruction.cost; rate = number / cost; instruction = best_remaining_instruction(); } compute_next_state(); end for Fig. 3. An algorithm for maximizing the instantaneous throughput.
The shortcomings of the optimal instantaneous rate algorithm are evident when the case of a long instruction with a low probability of occurrence is considered. The marginal rate is extremely low for such an instruction and, hence, we call it a low marginal rate (or offensive) instruction. The criteria for the issue of an instruction is that its marginal rate is higher than the maximum attainable rate without it. Consequently, the marginal rate for this instruction has to rise at least above the lower bound (and significantly higher than that in practice) before it is issued. Since its probability of occurrence is small, a large number of iterations is required before the instruction has a marginal rate qualifying it for scheduling. Meanwhile, more PEs are continually being blocked by new instances of the offensive instruction appearing in their instruction stream. Fig. 4 shows a plot of the throughput achieved by the optimal single-cycle throughput algorithm for a number of consecutive cycles. The instruction set used has a low marginal rate instruction. The effect of the low marginal rate instruction on the rate is evident. There are clearly identifiable phases (the first of which spans cycles 0 to 140) during which the rate decreases continuously. These phases correspond to the times between issues of the low marginal rate instruction. Once the instruction is issued, the processors it is blocking are freed and the overall rate rises. Examining the figure more carefully, we see that there are phases within the main phases, corresponding to another, less offensive, low marginal rate instruction.
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
6 / 12
ABU-GHAZALEH ET AL.: SYNTHESIZING VARIABLE INSTRUCTION ISSUE INTERPRETERS
Fig. 4. The effect of low marginal value capabilities.
The existence of a polynomial time algorithm for optimizing the single cycle is an important result that is utilized, in one form or another, by most other heuristic algorithms considered in the remainder of this paper. The reduction of the complexity of the optimization of the single cycle, from exponential (in the number of instructions) to polynomial, is possible because there is no temporal dependency among the instructions of the same cycle. Once the state of the machine is known, the decision on whether to issue instructions does not affect this state. The marginal rates are preserved and the decision to issue an instruction is independent of the decisions taken on other instructions. This is not the case for instructions of different cycles. The decision to issue an instruction in a given cycle affects all future states of the machine and, consequently, the scheduling decisions for the instructions in all future cycles.
5 THE ALGORITHMS In this section, we present two heuristic algorithms that produce instruction issue schedules that optimize, or nearly optimize, the objective functions. The Improving Lower Bound (ILB) family of algorithms is a set of polynomial time algorithms based on the optimal single-cycle rate algorithm. ILB algorithms attempt to isolate undesirable trends in the schedule (such as low marginal rate instructions that are being blocked too long) and relax the solution to avoid these trends. The sliding window, semi-exhaustive algorithm optimizes multiple cycles together exhaustively, providing a better piece-wise optimization than the optimal singlecycle rate algorithm.
5.1 The Improving Lower Bound Algorithm We have discussed one of the problems associated with optimizing the instantaneous rate: low marginal rate instructions. The Improving Lower Bound Algorithm (ILB) is an iterative algorithm that specifically addresses this problem. The first iteration finds the schedule obtained by optimizing the instantaneous rate at each cycle. Each successive iteration of this algorithm uses the overall rate obtained by the preceding iteration as a lower bound. Whenever a cycle has an instantaneous rate lower than this bound, the solution at that cycle is relaxed by adding other instructions to
7
the issue set. This process is repeated until no further improvement on the overall rate is obtained. The inner loop of this algorithm is identical to the optimal single-cycle algorithm. In addition, once the optimal single-cycle rate is obtained, it is compared to the lower bound; if it is lower, the solution is relaxed. The differences in the flavors of this algorithm are mainly in the relaxation step. This algorithm generates a better solution than the optimal single-cycle rate algorithm since its first iteration is the complete run of that algorithm with subsequent iterations only accepting schedules with better overall rates. Fig. 5 shows a run of an ILB algorithm for a sample instruction set. There is a gradual decline in the instantaneous rate up to the point of relaxation. This decline is due to the number of available PEs decreasing as unissued instructions block their continuation. After each relaxation point, there is, typically, a noticeable increase in the instantaneous rate. This increase occurs because the relaxation process clears the system from the effects of unissued instructions. At the point of relaxation, the instantaneous solution is worse than the lower bound. However, this loss is balanced by the gain in the throughput of future cycles. In later iterations of the algorithm, as the lower bound approaches the global optimal rate, the instantaneous loss becomes more expensive than the gain and the algorithm terminates. In several experiments (using relaxation criteria described below), we observed that the number of iterations the algorithm needed before reaching such a point was small (fewer than five). There is no one clear method for relaxing the schedule at the point when the optimal instantaneous rate falls below the current lower bound. In fact, by choosing the criteria for relaxing the schedule, we can tune the solution to address a specific concern or a problem. The following relaxation criteria are studied: 1) Issue All. Once the lower bound exceeds the optimal instantaneous rate, all of the instructions are issued (an all-ones cycle). Thus, all of the PEs are released. Once an all-one cycle is issued, the system state returns to the initial state and no additional relaxation is required. The final schedule is very small (since it is periodic with a period size of p cycles, where p is the number of cycles before the first relaxation). 2) Issue the Next Best Instruction. This method attempts to relax the solution without excessively penalizing the current rate. Relaxation is achieved by adding the instruction with the best marginal rate in the blocked set to the issue set. 3) Issue the Longest Waiting Instruction. This algorithm relaxes the solution by adding the instruction that has not been issued for the longest number of iterations. Of the three criteria discussed so far, this one addresses the low marginal rate most directly. The longest waiting instruction is likely to be an offensive instruction because offensive instructions have a low probability (typically) and they are often delayed for a long time before they have a detrimental effect on the rate of the system. 4) An Intelligent (Informed) Relaxation Criteria. While the above three criteria attempt to relax the cycle in intui-
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
7 / 12
8
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
Fig. 5. An Improving Lower Bound algorithm at work.
tively useful ways, none of them takes into account the particulars of the instructions being scheduled. This criteria attempts to evaluate candidate instructions for relaxation by comparing the gain obtained from delaying the instruction to the gain obtained from issuing it. The gain values are approximate. They are obtained by considering the following restricted case: What is the gain obtained from delaying an instruction this iteration, given that it will be issued next iteration? The answer to that question is the free service (next iteration) of instances of the same instruction arriving at other PEs, or gainblock = pj ◊ j ,t
k
 ni ,t ◊ xi ,t .
(19)
i =1
By the same token, what is gained from issuing the instruction this cycle? The free service of instructions that arrive on PEs who have this instruction provided these instructions get serviced in the next iteration. Thus, gainissue = nj ,t ◊ j ,t
k
 pi ◊ xi ,t +1.
(20)
i =1
If the gain from issuing the instruction is higher than the gain from blocking it, the instruction is issued.
5.2 The Sliding Window, Piece-Wise Exhaustive, Algorithm Since it sometimes sacrifices the long term gain for an immediate gain, the optimal instantaneous (single-cycle) rate strategy does not provide a globally optimal schedule. The Sliding Window Algorithm optimizes multiple cycles together, providing better global schedules. If the window size being considered is l iterations, the optimal solution is obtained. However, this algorithm is exponential in both k and W, where W is the window size in cycles. There is no polyno-
mial time algorithm to find the optimal multicycle schedule for window sizes greater than one, since the scheduling of iteration t + 1 is a function of iterations t, t - 1, º, 1. Thus, optimizing W iterations, of k capabilities each involves k*W
exhaustively evaluating 2 schedules. This algorithm increases the size of the unit with which we perform the piece-wise optimization. It follows that it suffers from some of the same problems suffered by the optimal single-cycle algorithm at small window sizes. For example, the problem of low marginal rate instructions still exists. The larger the window size, the better the solution, but the more complex the computation. The limitations of this approach become clearer as one considers that the final cycle in each window is scheduled exactly like the optimal single-cycle rate algorithm. Cycles before the last cycle in the window have a progressively deeper look into the future.
6 EXPERIMENTAL STUDY A suite of schedulers, one corresponding to each of the algorithms presented earlier was developed. The schedulers were based on the machine model developed in Section 3. Representative instruction set probability and cost profiles were selected and used by the schedulers to build variable instruction issue interpreters for the instruction set.
6.1 The Instruction Sets When building a variable instruction issue schedule, it is imperative to understand the probability profile governing the occurrence of instructions in the execution streams. If the probability function is not accurate, the instruction issue scheduling may not result in any speedup of the computation, and may even impede it. The fact that the target of our schedule is a massively parallel machine suggests, according to the central limit theorem, that deviations in the overall probability model are small [31].
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
8 / 12
ABU-GHAZALEH ET AL.: SYNTHESIZING VARIABLE INSTRUCTION ISSUE INTERPRETERS
9
TABLE 1 INSTRUCTION SET PROFILING INFORMATION USED IN THE SIMULATION Inst. Branch Or And Add Load Shift Sethi Store Mult.
Probability Cmprss 23.1 16.8 0.7 24.7 11.3 9.5 9.00 4.89 0.01
Agrep 12.00 10.20 9.9 27.70 20.40 19.05 0.05 0.70 0
Partition 32.50 16.50 0.2 17.40 31.9 0.29 0.30 0.9 0.01
Route 24.17 6.3 0.38 38.75 12.46 1.3 1.71 14.92 0.01
While the particular probability profile is not the focus here (the scheduling can be performed for any profile), we nevertheless seek to schedule an instruction set appropriate for the target paradigm, MIMD interpretation on SIMD. There have been several attempts at the design of instruction set architectures for concurrent interpretation [6], [8], [9], [11], [32]. While none of these attempts have provided a clearly superior approach, trends in the instruction set design became apparent. A suitable instruction set for this paradigm has a small number of capabilities, with uniform encoding, and a limited number of addressing modes. With this in mind, the instruction set profile for the experiment was selected. This profile, shown in Table 1, was obtained by analyzing several string processing and integer mathematics applications on a SUN Sparc 2 workstation using the spixtools profiling package [33]. Since the Sparc architecture supports a relatively large number of instructions (too many for efficient interpretation), some classes of instructions were grouped together as one instruction. For example, the branch statistics were all grouped together. The values listed in the cost column are the approximate execution times for these instructions on the MasPar MP-1 machine (relative to the execution time of an integer arithmetic operation) [19], [20]. The instruction set with the probability profile shown in Table 1 is referred to as instruction set 1. The case of a CCA that supports system calls for paging and communication is also considered. This case adds offensive (low marginal rate) capabilities to the interpreted instruction set and thus allows the evaluation of the performance of the algorithms on this important special case. The costs for the communication and paging capabilities are chosen 3 to be 20 and 50 time units, respectively. Page faults and communication requests are assumed to be exponentially distributed [10] with a mean of 0.01 for communication and 0.005 for paging. The probability of a page fault according to this model is
M.C. 18.98 18.58 9.94 11.47 6.58 11.85 5.25 5.77 11.58
Cost Espresso 15.82 10.67 6.62 22.41 22.91 6.85 5.76 8.74 .22
Simplex 19.21 16.35 1.96 25.22 13.43 10.87 1.81 9.09 2.6
Avg. 20.8 13.63 4.24 23.95 17.00 8.52 3.4 6.4 2.06
3 1 1 1 16 1 1 16 6
6.2 Implementation and Results
The probabilities in Table 1 are adjusted to allow for these capabilities (to make the probabilities sum to 1). The instruction set with the modified profile for communication and paging is called instruction set 2.
Parametrized variable instruction schedulers, based on the algorithms presented in Section 5, are built. The schedulers assume a model of the machine based on (1) and gather statistics about the quality of the schedules as they are constructed. A simulated configuration is described by the number of PEs, instruction set profile, and the interpretation overhead (m ) . The effect of the interpretation overhead on the scheduling performance is studied by scheduling the same instruction set profiles using two different overhead values. The overhead, while related to the characteristics and functionalities of the instruction set, is largely a function of the encoding and organization of the design. Thus, a study of the effect of the overhead emphasizes the importance of an efficiently organized design. The experiments show results for two distinct overhead values. The overhead values used are five and 30 time units corresponding to an efficiently encoded instruction set and a less efficient one, respectively. Fig. 6 shows the global throughput value for schedules obtained using the different algorithms for all four instruc4 tion set variations. The throughput achieved by the Optimal Single-Cycle Algorithm achieves a significant improvement over the static schedule throughput for all four instruction sets studied. This result alone serves as strong evidence that VII is useful in enhancing the performance of this paradigm. The Improving Lower Bound algorithms, as discussed earlier, can fare no worse than the optimal singlecycle throughput. Furthermore, in most cases they show considerable enhancement in the throughput value. It is also clear from the graph that the semi-exhaustive algorithm performs better than the optimal single-cycle algorithm. Fig. 7 shows the results for the utilization algorithms. The instruction sets with the higher overhead value achieve lower throughput rates than their lower overhead counterparts. Surprisingly, the reverse is true for the utilization schedules. This occurs because the overhead value is considered part of the active time of scheduled processors. Moreover, this fact causes the utilization based algorithms to be liberal; rewarding the issue of more instructions by adding the overhead to their benefit value. This is evident from the high utilization achieved by the issue-all ILB algorithm—the most liberal of the relaxation criteria studied.
3. These figures vary with the implementation and the packet and page sizes.
4. Execution time costs of the semi-exhaustive algorithm prevent reporting results for all four test cases.
-a
P(P.F.) = 1 - e , a = .005
(21)
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
9 / 12
10
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
Fig. 6. Comparison of the throughput performance of the algorithms.
Fig. 8. Actual run-time results of an improving lower bound algorithm.
Fig. 7. Comparison of the utilization performance of the algorithms.
Fig. 9. The effect of increasing the window size on the semi-exhaustive algorithm performance.
Fig. 8 and Fig. 9 show run-time cycle throughput rates achieved using the improving lower bound and the sliding window algorithms respectively. The results are shown for instruction set 2 with an overhead of 30. The ILB algorithm uses the longest waiting instruction criteria for relaxation. The ILB plot (Fig. 8) shows the first two iterations. The algorithm successfully avoids the pattern caused by the offensive instructions by relaxing early. However, the solutions at the relaxation point are very low suggesting that the relaxation criteria is relaxing some instructions too early. The sliding window plot (Fig. 9) displays the same pattern of decreasing rates between issues of the offensive instruction. However, with the limited peek into the future the
algorithm schedules the offensive instructions significantly earlier than the single cycle algorithm (at cycle 80 instead of 140 for this example). The average run times for the algorithms to schedule 100 cycles are shown in Table 2. The times were obtained using the Unix get_r_usage() function. In general, each algorithm was run several times for 10,000 cycles and the times averaged and normalized to 100 cycles to eliminate clock inaccuracies. The exception was the semiexhaustive algorithms, whose high run time costs did not allow such averaging. The run time required by the semi-exhaustive algorithm is orders of magnitude higher than any of the other algorithms, even at the small window sizes used. The performance of these algorithms is not significantly better than the ILB algorithms, making it difficult to justify their run time cost. It has already been noted that the semi-exhaustive algorithm performs better with higher window sizes. Unfortunately, window sizes higher than three do not terminate in a practical time. For example, the projected run time for a window size of four optimization for instruction 5 set 2 is 5.28 years. Clearly, this is not a practical solution.
TABLE 2 THE RUN TIMES OF THE ALGORITHMS Algorithm Static Instant. Optimal Relax All Next Best Longest Waiting Semi-Exhaustive (2) Semi-Exhaustive (3)
Run Time (Sec.) Inst. Set1 Inst. Set2 0 .011 .037 .031 .026 5.9 2172
0 .012 .045 .037 .030 26 79360
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
.
11
5. This value is obtained by scaling up the time for the window size 3 by 2 regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
10 / 12
ABU-GHAZALEH ET AL.: SYNTHESIZING VARIABLE INSTRUCTION ISSUE INTERPRETERS
7 CONCLUDING REMARKS In this paper, a model for the interpretation process central to the MIMD interpretation on SIMD paradigm was developed [5], [6], [7], [8], [9]. Two objective functions, Throughput and Utilization, that quantify the performance of a variable schedule were introduced and analyzed. Since the Variable Instruction Issue problem is NP-Complete [14], heuristic algorithms that optimize either of the objective functions in acceptable times were developed. The performance of the algorithms on four different instruction sets was studied. We have conclusively shown that Variable Instruction Issue is useful and feasible for optimizing the performance of MIMD interpretation on SIMD machines. The results obtained were consistent over a variety of instruction sets and measured using two objective functions thought to be good benchmarks for the schedule performance. The throughput function treats all instructions equally and, thus, penalizes complex instructions, since their higher cost is considered but their higher benefit is not. The utilization function considers both the cost and the benefit of the instructions. The two objective functions are both convincing measures of the performance of the system. However, they each optimize different resources that may not correlate well. For example, instruction sets with high overhead values have good utilization but bad throughput values and vice versa. With the exception of the sliding window algorithm, the algorithms developed to solve this NP-complete problem are polynomial time algorithms. The basic algorithm optimizes the single cycle schedule nonexhaustively. This algorithm provides significant improvement in performance over the static schedule and is efficient. The existence of a polynomial time optimal algorithm for the single cycle is an important result that is utilized in subsequent greedy algorithms. A major problem with the schedules obtained via the optimal single-cycle algorithm, namely the low marginal rate instruction problem, was recognized. A family of algorithms targeted at this problem was developed. This family of algorithms, called the Improving Lower Bound Algorithms (ILBs), is a polynomial time suite of algorithms that relaxes the optimal schedule whenever the rate falls below the current lower bound. By setting the lower bound to be the average of the previous iteration, solutions better than the previous average are ensured. The difference between these algorithms is in the technique by which the solution is relaxed. By varying the criteria of the relaxation the algorithm can be tuned to address different problems in the schedule. The improving lower bound algorithms are the most promising of the algorithms studied in this paper. Unfortunately, the intelligent relaxation criteria did not perform significantly better than the other relaxation criteria. The semi-exhaustive algorithm achieves better solutions than the optimal single-cycle rate algorithm even at small window sizes. It suffers from the same problem the singlecycle algorithm suffers from (the limited future outlook delaying the issue of offensive instructions) and would benefit from a relaxation-based algorithm like the improv-
11
ing lower bound algorithm. With larger window sizes, considerably better instruction schedules are produced by the semi-exhaustive algorithm. However, the ILB algorithms produce schedules with comparable performance at a fraction of the run time to compute them.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]
[10] [11]
[12] [13] [14] [15] [16] [17] [18] [19] [20] [21]
[22]
[23] [24]
G.H. Barnes, R.M. Brown, M.Kato, D.J. Kuck, D.L. Slotnick, and R.A. Stokes, “The ILLIAC-IV Computer,” IEEE Trans. Computers, vol. 17, pp. 746–757, 1968. W.D. Hillis, The Connection Machine. Cambridge, Mass.: The MIT Press, 1985. W.D. Hillis and G.L. Steele Jr., “Data Parallel Algorithms,” Comm. ACM, vol. 29, pp. 1,170–1,183, Dec. 1986. R.M. Hord, Parallel Supercomputing in SIMD Architectures. Boca Raton, Fla.: CRC Press, 1990. R.J. Collins, “Multiple Instruction Multiple Data Emulation on the Connection Machine,” Technical Report CSD–910004, Dept. of Computer Science, Univ. of California., Los Angeles, Feb. 1988. H.G. Dietz and W.E. Cohen, “A Massively Parallel MIMD Implemented by SIMD Hardware,” Technical Report EE 92–4-P, School of Electrical Eng., Purdue Univ., West Lafayette, Ind., Jan. 1992. P. Hudak and E. Mohr, “Graphinators and the Duality of SIMD and MIMD,” Proc. 1988 ACM Symp. Lisp and Functional Programming, 1988. M. Nilsson and H. Tanaka, “MIMD Execution by SIMD Computers,” J. Information Processing, vol. 13, no. 1, pp. 58–61, 1988. P.A. Wilsey, D.A. Hensgen, N.B. Abu-Ghazaleh, C.E. Slusher, and D.Y. Hollinden, “The Concurrent Execution of NonCommunicating Programs on SIMD Processors,” Proc. Fourth Symp. Frontiers of Massively Parallel Computation, pp. 29–36, Oct. 1992. R.A. Finkel, An Operating System VADE MECUM. Engelwood Cliffs, N.J.: Prentice Hall, 1988. R.A. Bagley, P.A. Wilsey, and N.B. Abu-Ghazaleh, “Composing Functional Unit Blocks for Efficient Interpretation of MIMD Code Sequences on SIMD Processors,” Parallel Processing: CONPAR 94– VAPP VI, B. Buchberger and J. Volkert, eds., vol. 854, Lecture Notes in Computer Science, pp. 616–627, Springer-Verlag, Sept. 1994. H.G. Dietz, “Common Subexpression Induction,” Proc. 1992 Int’l Conf. Parallel Processing, vol. II, pp. 174–182, Aug. 1992. W. Shu and M.-Y. Wu, “Asynchronous Problems on SIMD Parallel Computers,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 7, pp. 704–713, July 1995. X. Fan, N.B. Abu-Ghazaleh, and P.A. Wilsey, “On the Complexity of Scheduling MIMD Operations for SIMD Interpretation,” J. Parallel and Distributed Computing, vol. 29, no. 8, pp. 91–95, Aug. 1995. S.A. Kravitz, A.P. Mathur, and V. Rego, “Logic Simulations on Massively Parallel Architectures,” Proc. 16th Ann. Int’l Symp. Computer Architectures, pp. 336–343, 1989. P. Kacsuk and A. Bale, “DAP Prolog: A Set-Oriented Approach to Prolog,” The Computer J., vol. 30, no. 5, pp. 393–403, 1987. M. Nilsson and H. Tanaka, “Massively Parallel Implementation of Flat GHC on the Connection Machine,” Proc. Int’l Conf. Fifth Generation Computer Systems, pp. 1,031–1,039, 1988. Connection Machine CM-200 Series Technical Summary. Cambridge, Mass.: Thinking Machine Corporation, 1991. MasPar MP-1 Architecture Specification. Sunnyvale, Calif.: MasPar Computer Corporation, Mar. 1991. J. Nickolls, “The Design of the MasPar MP-1,” Proc. 35th IEEE CS Int’l Conf., COMP-CON ‘90, pp. 25–28, 1990. D.Y. Hollinden, D.A. Hensgen, and P.A. Wilsey, “Experiences Implementing the MINTABS System on a MasPar MP-1,” Proc. Third Symp. Experiences with Distributed and Multiprocessor Systems SEDMS III, pp. 43–58, Mar. 1992. H.G. Dietz and W.E. Cohen, “A Control-Parallel Programming Model Implemented on SIMD Hardware,” Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., vol. 757, Lecture Notes in Computer Science, pp. 311–325, Springer-Verlag, Aug. 1992. N.B. Abu-Ghazaleh, “Variable Instruction Scheduling for Efficient MIMD Interpretation on SIMD Machines,” master’s thesis, Univ. of Cincinnati, June 1994. E.G. Coffman., Computer and Job-Shop Scheduling Theory. New York: John Wiley and Sons, 1976.
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
11 / 12
12
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
[25] M.R. Garey and D.S. Johnson, Computers and Intractability. New York: W.H. Freedman and Company, 1979. [26] L. Klienrock, Queueing Systems, vol. I and II. New York: John Wiley and Sons, 1976. [27] P.P. Chang, S.A. Mahlke, and W.W. Hwu, “Using Profile Information to Assisst Classic Code Optimizations,” Software—Practice and Experiences, vol. 21, no. 12, pp. 1,301–1,321, 1991. [28] J.A. Fisher, “Trace Scheduling: A Technique for Global Microcode Compaction,” IEEE Trans. Computers, vol. 30, no. 7, pp. 478–490, July 1981. [29] P.A. Wilsey, D.A. Hensgen, N.B. Abu-Ghazaleh, C.E. Slusher, and D.Y. Hollinden, “The Concurrent Execution of NonCommunicating Programs on SIMD Processors,” Proc. Fourth Symp. Frontiers of Massively Parallel Computation, pp. 29–36, Oct. 1992. [30] J.L. Hennesy and D.A. Patterson, Computer Architecture a Quantitave Approach. San Mateo, Calif.: Morgan Kaufman, 1990. [31] L. Ott, An Introduction to Statistical Methods and Data Analysis. Boston: PWS-Kent, 1988. [32] N.B. Abu-Ghazaleh, T. Dichiaro, P.A. Wilsey, D.A. Hensgen, and M.M. Cahay, “Parallel Execution of Monte Carlo Simulations on SIMD Processors,” Technical Report 146–2–93–ECE, Dept. of ECE, Univ. of Cincinnati, Cincinnati, Oh., Feb. 1993. [33] Introduction to SpixTools. Mountain View, Calif.: Sun Microsystems Laboratories Inc., 1991.
Nael B. Abu-Ghazaleh is a PhD candidate at the University of Cincinnati. He received his BSc degree in electrical engineering from the University of Jordan in 1990, and an MS in computer engineering from the University of Cincinnati in 1994. His research interests include parallel processing, communication networks, computer architecture, and instruction level parallelism.
Philip A. Wilsey received the PhD and MS degrees in computer science from the University of Southwestern Louisiana and a BS degree in mathematics from Illinois State University. He is an assistant professor at the University of Cincinnati. His current research interests are parallel and distributed processing, parallel discrete event driven simulation, computer aided design, formal methods and design verification, and computer architecture. Xianzhi Fan received an MS degree in Computer Engineering from the University of Cincinnati in 1994 and a BS in Computer Engineering from Beijing University in 1986. Debra A. Hensgen received the BS degree in mathematics from Eastern Kentucky University and the MS and PhD degrees in computer science from the University of Kentucky in 1989. She is an associate professor in the Computer Science Department at the Naval Postgraduate School in Monterey, California. Her research interests include tools for concurrent processing, operating systems and resource management systems, allocation of shared heterogeneous resources, and parallel and distributed computing. Dr. Hensgen is serving as the program chair for the Heterogeneous Computing Workshop which is held in conjunction with the International Parallel Processing Symposium.
J:\PRODUCTION\TPDS\2-INPROD\D95262\D95262_1.DOC
regularpaper97.dot
JT
19,968
01/21/97 12:04 PM
12 / 12