Optimal Two Level Partitioning and Loop Scheduling for ... - CiteSeerX

Optimal Two Level Partitioning and Loop Scheduling for Hiding Memory Latency for DSP Applications Zhong Wang

Michael Kirkpatrick

Edwin Hsing-Mean Sha

Dept of Comp Sci & Engr University of Notre Dame Notre Dame, IN 46556



[email protected]

[email protected]

[email protected]

ABSTRACT The large latency of memory accesses in modern computers is a key obstacle in achieving high processor utilization. To hide this latency, this paper proposes a new memory management technique that can be applied to computer architectures with three levels of memory. The technique takes advantage of access pattern information that is available at compile time by prefetching certain data elements from the higher level memory. It as well maintains certain data for a period of time to prevent unnecessary data swapping. Data locality is much improved compared with the usual pattern by partitioning the iteration space and reducing execution in each partition. These combined approaches lead to improvements in average execution times of approximately 35% over the one-level partition algorithm and more than 80% over list scheduling and hardware prefetching.

1.

INTRODUCTION

Memory latency is becoming an increasingly important performance bottleneck as the gap between processor and memory speeds continues to grow. While cache hierarchies are an important step toward addressing the latency problem, they are not a complete solution. Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization. A number of different solutions have been proposed. Prefetching data into the cache nearest to the CPU before it is accessed can hide the long memory latency, making this an effective solution. The previous research on prefetching can be classified into two categories, hardware-based prefetching and software-directed prefetching. Hardware based prefetching techniques [4; 2] has very poor performance when the program has complex access patterns. The traditional software-directed prefetching techniques [5] generally give little consideration to scheduling the program so as to make the program execution more parallel and minimize the overall execution time. Special cases occur in DSP applications and image processing, where there exists a large number of uniform nested This work is partially supported by NSF grants MIP95-01006 and NSF ACS 96-12028 and JPL 961097

loops. Thus, detailed consideration of how to schedule the loop body efficiently is very important to these applications. Another area for optimization is the ALU scheduling of instructions. Here, several multi-dimensional loop pipelining techniques [3] have been proposed to minimize ALU time. These techniques, however, do not consider the data fetching latency in their algorithms. The one-level partition technique in [6; 1] combines the two aspects of instruction level parallelism and memory access latency reduction, to decrease the overall schedule for uniform dependency loops. However, these algorithms assumed that only two levels were present in the memory hierarchy and no analysis was performed with three or more levels of memory, as is the case in most contemporary computer architectures. Our studies have shown that the one-level partition technique can not take full advantage of the second level memory in such an architecture. On the contrary, our two-level partitioning technique, by appropriately choosing the partition shape and size, can prefetch all the data into the second level memory before the first level memory needs to access them, which can lead to the lower average fetching time. These prefetch operations are parallel with the ALU computation. The lower average fetching time is the key to better performance under the first level memory size constraint, which is demonstrated by the experimental results. In our paper, we apply the idea of partitioning to the extra levels in the memory hierarchy. In both the first level memory and second level memory, we adopt the partition technique and make extensive use of compile time information about the usage pattern of the information produced in the loop to generate an approximated “balanced schedule” in the first and level schedules in order to minimize the overall average execution time and make better use of the second level memory. The paper presents methods of deciding the best partition in both levels in order to achieve this goal. The new algorithm exceeds the performances of existing algorithms by optimizing both the ALU and memory schedules in the first level and taking into consideration a balance between the schedules of the two levels. In the situation of no memory constraint, the results produced by our algorithm will come within one instruction time unit of the theoretic lower bound. Experiments show the improvement can reach about 35% over the one-level partitioning algorithm with memory constraints and more than 80% over the traditional hardware prefetching scheme or list scheduling.

2. BASIC IDEA Permission to make digital/hardcopy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2000, Los Angeles, California (c) 2000 ACM 1 -58113-188-7/00/0006..$5.00

A multi-dimensional data flow graph (MDFG) G =(V, E, d, t) is a node-weighted and edge-weighted directed graph, where V is the set of computation nodes, E V V is the set of dependence edges, d is a function from E to Z n representing the multi-dimensional

delay vector between two nodes, (n is the number of dimensions), and t is a function from V to the positive integers, representing the computation time of each node. The execution of all nodes in V one time is an iteration, i.e. the execution of one instance of the loop body. Iterations are identified by a vector i, equivalent to a multi-dimensional index, starting from 0 0 0 . An edge with delay 0 0 0 represents an intra-iteration dependency, and an edge with no-zero delay(d e ) represents an inter-iteration dependency. In this case, the execution of the current iteration will use data computed d e iterations before. The iteration space contains all iterations with a different index. Delay dependencies exist among iterations in the iteration space. An MDFG is said to be realizable if an execution sequence for each iteration can be found. Regular execution of nested loops proceeds in either a row-wise or column-wise manner until the end of the row or column is reached. However, this mode of execution does not take full advantage of either the locality of reference present or the available parallelism, since dependencies have both horizontal and vertical components. The execution of such structures can be made to be more efficient by dividing the iteration space into regions called partitions that better exploit spatial locality. This observation can be applied to both the first and second level caches. Provided that the total iteration space is divided into partitions, the execution sequence will be determined by each partition. Assume that the partition in which the loop is executing is the current partition. Then the next partition is the partition adjacent on the right side of the current partition along the X-axis. The second next partition is adjacent to and lies on the right side of the next partition. The definitions of third next partition, fourth next partition, etc, are similar. The other partitions are all partitions except those on the current row. In the two-level partition algorithm, the iteration space will be partitioned on two levels. The first level partition consists of a number of iterations, and the second level partition is made up of a number of first level partitions. For a delay dependency that goes into the mth next partition, a keep m memory operation is used to keep this data in a lower level memory for m partitions. Delay dependencies that go into other partitions result in the use of prefetch memory operations to fetch data in advance. Due to memory constraints, it is likely that in the first level partition, there exist some keep-m, m 2, operations. while in the second level, only keep-1 operations exist. Our scheme is a software-based method in which some special prefetching instructions are added to the code when it is compiled. When the processor encounters these instructions during program execution, it will pass them to the special hardware – memory units– to process it. The function of the memory unit is the same in both two levels: it is in charge of getting the data ready before the lower level cache needs to reference them. The advantage of using keep is that it can eliminate the time wasted for unnecessary data swapping, which results in a better performance schedule.

2.1 One-Level Partitioning In one-level partition scheduling we have two schedule parts: an ALU part and a memory part, as seen in Figure 1. In the ALU part, the multi-dimensional rotation scheduling algorithm is used to get a static compact schedule for one iteration. Then this schedule is simply duplicated for each node in the partition to obtain the ALU part schedule. The lower bound of the one-partition schedule algorithm is determined by this part, which is len per iteration #nodes, where len per iteration denotes the schedule length of one iteration from the multi-dimensional rotation scheduling algorithm, and #nodes is the

#%$'&)(*,+

-/.0

"

Prefetch Part

Keep ! Part

Figure 1: The one-level partition schedule

number of nodes in one partition. Different from ALU part scheduling, the memory part scheduling arranges the memory unit operations in one partition as a whole. Since prefetch operations are independent of ALU operation, they can be arranged from the beginning of the memory part scheduling. On the other hand, each kind of keep operation can only be arranged after the corresponding ALU computation has been performed. Due to this dependency, the memory part schedule length is longer than the ALU part schedule by at least the time of one keep operation. Two important pieces of information in the one-level partition algorithm are partition shape and partition size. We use two partition vectors–Px and Py , to identify a parallelogram as the partition shape. Assume without loss of generality the angle between Px and Py is less than 1801 , and Px is clockwise of Py . Then the partition shape and size can be denoted by the direction and the length of these two vectors. The directions of these two vectors must satisfy the following conditions: For any delay vector de , the cross product de Px is less than or equal to zero, while the cross product de Py is greater than or equal to zero. These conditions guarantee the absence of dependency cycles in the CDG and thus the existence of a realizable partition execution sequence. In the one-level partition algorithm, vector s 23 1 0 is used as the basic vector of Px , and the normalized leftmost delay vector is used as the basic vector of Py . In the one-level partition algorithm, the balanced schedule is an important concept in determining the partition size. It is a schedule in which the memory part schedule is longer than the ALU part schedule by at most the amount of time needed for a keep operation. Moreover, it can be proven that we can always find a Px and Py that achieve a balanced schedule without first level cache P y size constraints, as long as Py y dy and Px x 4 max 5 dx 6 dy Pyy 77 x 8 , 9 d 2: dx dy D. These conditions guarantee that no delay dependency will span more than two partitions. To consider how to decide the partition size under a first level cache size constraint, we first need to understand the relation between the memory requirement and the partition size. The memory requirement consists of three parts: the memory locations used to store data for keep operations, those used to store data for prefetch operations and those used to store intermediate data which are used by the iteration in the same partition. The size of each part is determined by the delay vector in MDFG and the partition size. A keep m operation needs m ; 1 memory locations. An individual prefetch operation needs two memory locations, Based on this information, the overall memory requirement can be calculated as long as the number of each kind of operations is known. The detailed calculation of the memory requirement for the inter-

mediate data and the number of each kind of memory operation can be found in [6]. To satisfy the memory constraints, reducing the partition size along the Px direction can provide much better performance than reducing the partition size along the Py direction. The first kind of reduction can reduce the number of prefetch operations as well as the number of iterations in the partition, while the second reduction only reduces the number of iterations in the partition. Due to the much larger prefetch operation latency, the first reduction is preferred, when decreasing the partition size to satisfy the memory constraints. This observation playsan important role in the onelevel partition algorithm. We can start from the balanced schedule and reduce the partition size mainly along the Px direction, until the memory constraint is satisfied. The detailed one-level algorithm can be seen in [6].

3.

TWO-LEVEL PARTITION

In reality, contemporary computer systems always have more than two levels of memory. Therefore, it is necessary to study the schedule with more levels of memory. We have considered the situation with dynamic scheduling (FIFO) in the second level memory and done the experiments to get a group of statistical data of average access times of the first level memory for two different benchmarks under different iteration space size (this access time includes the time from the third to the first level and from 2nd to 1st level). The experiment results show that having more memory levels does not result in substantial improvement when the iteration space is large. The dynamic algorithm can not predict the data access sequence. At the same time, the data have always been swapped out to the third level by the time they are needed. This demonstrates that dynamic scheduling can not make use of the information which can be obtained during the compilation to improve data locality and reduce access time. This phenomenon leads us to think about partitioning in the second level memory as well as in the first level memory. The objective of the second level partition is to prefetch some data from the third level to the second level and keep some data in the second level memory for a time for near future use, such that whenever the first level memory wants to access these data, it can always find them in the second level memory, which eliminates the time used to fetch data from the third level memory.

3.1 The different delay dependencies When considering the second level partition, we can treat the first level partition as the basic integral unit. Then, the second partition will be the combination of a number of first level partitions. In the second level, the current partition, next partition and other partition have the same definitions as those in the first level. In the case of two level partitions, there will be four kinds of delay dependencies which need to be treated differently when scheduling the loop. They are indicated by d1 , d2 , d3 , and d4 . d1 is a delay dependency that goes to an other partition from the current partition in both the first level memory and the second level memory. For this delay dependency, the prefetch operation is needed in both levels. In the second level, d3 is an intra-partition delay dependency, which goes to the next partition in the first level. It only needs keep operations in the first level partition schedule and has no effect on the second level partition schedule. Delay dependency d4 is an intra-partition delay dependency in the second level but goes to an other partition in the first level memory. Therefore, it will be treated as a prefetch operation in the first level partition schedule and keep operation in the second level to

retain the corresponding data for near future use. Moreover, some write-back operations are needed to put back the data corresponding to this delay dependency into the second level memory. The delay dependency d2 goes to the next partition in both first and second level memory. Due to the execution sequence (the loop is executed by the first level partition sequence along the x-axis until the right boundary of the second level partition is reached, then continues from the left boundary of the second level partition along the next row of first level partitions), this delay dependency gives rise to some complications in the schedule. In the first level partition it can not be kept in the first level memory as usual, since it will not be needed until the execution reaches a partition in the next second level partition. Instead, from the standpoint of the first level partition schedule, the data will be prefetched from the second level before it is needed. Thus, we will have the following definition of boundary partition. D EFINITION 1. A boundary partition is a first level partition with at least one delay dependency going to the next partition in the second level. In a boundary partition, some or all of the keep operations in an ordinary partition transform into prefetch operations. The number of keep operations which become prefetch operations depends on the relation between the first level partition size and the length of the delay vector. For instance, if there are only keep-1 operations in the ordinary first level partition, the boundary partitions are found only in the rightmost one column of first level partitions. If keep-2 operations also exist, then the boundary partition also includes the second right-most column of first level partition. But in a first level partition of this column, only keep-2 operations become prefetch operations. After deciding which keep operations will become prefetch operations, the boundary partition schedule lengths can be determined accordingly. In the second level partition, the data corresponding to d2 will also be prefetched from the third level memory to the second level for use in the next second level partition.

3.2 The write back operation The write back operation can be considered as being hidden in the prefetch operation. L EMMA 1. In the first level partition, the number of the writeback operations is always the same as the number of prefetch operations. Provided the numbers of these two operations are identical, it is possible to treat them as an integrated operation. Next we will prove that the memory consumption is still two for such an integrated operation. At the same time, how to integrate these two operations into one operation is illustrated in the proof. T HEOREM 2. Every write back operation can be combined with a prefetch operation into an integrated operation, and the number of memory locations for this operation is 2. From this theorem, we can simply double the prefetch time to consider the write back operation. The model and the algorithm can still be useful. T HEOREM 3. The number of prefetch operations and write back operations, which is used to write data back to the third level, are identical. These two operations can be regarded as one integrated operation when considering the second level partition scheduling.

3.3 Two level partition scheduling The #pre f irst

Top Region d

Tkeep

The boundary partition schedule length is different. Its value will depend on how many keep operations change into prefetch operations. In the above, #pre f etch f irst is the number of prefetch operations per first level partition, T2 1 is the time cost to fetch data from the second level to the first level, #keep f irst is the number of keep operations per first level partition, and Tkeep is the time cost per keep operation. In the following, we always assume this value to be equal to one. In the second level schedule, there exist only two kinds of operation, prefetch to fetch those data corresponding to d1 and d2 delays. Keep to keep the data corresponding to the d4 delays. The length of the first level schedule is the summation of all the first level partition schedules in a second level partition and the length of the second level schedule is the summation of the prefetch part and the keep part. The overall second level schedule length is Lsecond

number of prefetch operations in the first level be pre f irst and the number of keep operations be keep f irst . The number of prefetch and keep operations in the second level partition can be calculated by the following properties.

TWO LEVEL PARTITION SCHEDULING ALGORITHM

Having known the constitution of the first level and the second level schedules, we need to find the relation between these two schedules in order to achieve the optimal average schedule length. From the analysis above, the first level partition schedule can be determined independently. Therefore, the lower bound of the average schedule length is the average schedule length of the ordinary first level partition, which equals the one-level partition schedule length where the prefetch time can be regarded as the fetching time from the second level to the first level. The question now becomes what is the size of the second level partition to make the overall schedule optimal.

4.1 The property of the operation amount The first level partition schedule length depends on the ALU part schedule, the number of prefetch and keep operations in the first level. While the second level schedule length only depends on the prefetch and keep operations in the second level. The following properties show the relationship between the number of operations in the first and the second level partitions. The first level partitions in a second level partition can be classified into ordinary partitions and boundary partitions. Assume that the region size consisting of only ordinary partitions is x1 y and that consisting of only boundary partitions has size x2 y. Let the

Keep Region Boundary Region

Figure 2: The different sub-regions in the boundary partitions In the region consisting of only boundary partitions, there exist three sub-regions for a given delay dependency, as shown in Figure 2. In the figure, d is that delay vector and the dotted lines delimit the first level partitions. The three different sub-regions are designated in the figure. D EFINITION 2. In the region consisting of only boundary partitions, the top region for a delay vector is the region from which this delay dependency starts and goes into an other second level partition. The boundary region for a delay vector is the region from which this delay dependency starts and goes into the next second level partition. The keep region for a delay vector is the region from which this delay dependency starts and goes into an other first level partition, but still keeps in this second level partition. The One-level partitioning scheduling algorithm guarantees that the Py projection of delay vector will not be longer than the length of the first level partition size along the Py direction. That is to say, the top region is only included in the top row of first level partitions in a second level partition. The number of integer points in these regions is the number of corresponding operations for this delay vector. These numbers can be calculated using the same parallelogram idea given in [6] P ROPERTY 2. For a given delay dependency, the number of integer points in the top region and the boundary region is the number of prefetch operations in the boundary partitions. The number of integer points in the keep region is the associated number of keep operations in the boundary partitions.

4.2 The balanced two levels partition schedule Since the lower bound of the two-level partition schedule is determined by the first level partition schedule, we should make the first level schedule longer than the second level schedule to achieve good performance. D EFINITION 3. Given the first level partition schedule and a certain second level partition size, the balanced two-level partition schedule is a schedule in which the first level schedule is longer than the second level schedule and the schedule length difference between the first and second level is minimal. The following theorem proves the existence of such a balanced two level partition schedule.

2

T HEOREM 4. If the schedule of the first level partition is determined, ? a second level partition size that makes the two level partition schedule a balanced two level partition schedule can always be found.

4.3 The calculation of the memory requirement In the two level partition schedule, we need to calculate both the first level memory requirement and the second level memory requirement. T HEOREM 5. The first level memory requirement mem1req is calculated by the following equation: mem1req

2

meminter ; ∑ #keep n @ n ; 1 A; 2 #pre f irst

In the equation, meminter is the memory requirement for the intermediate data in the first level partition, #keep n is the number of keep-n operations, and #pre f irst is the number of prefetch operations. We can use the above theorem to derive the memory requirement for the first level memory, in which the calculation of meminter is the same as in the one-level partition algorithm. T HEOREM 6. The memory requirement for the second level mem2req is calculated by the following equation: mem2req

2

2 #presecond ; #keepsecond

#presecond represents the number of prefetch operations in the second level, and #keepsecond is the number of keep operations. Combining the above theorem and two properties in the section 4.1, the memory requirement for both the first and second level can be calculated. In choosing the second level partition size, we select the size that can make the two level partition schedule a balanced two level partition schedule. This is the tradeoff between the performance improvement and the memory consumption. After achieving a balanced two level partition schedule, increasing x can still bring some performance improvement. On the other hand, the memory requirement increase rate is much larger than the performance improvement rate. The results can be found in section 5. So it is more reasonable to adopt a second level partition size which can generate a balanced two level partition schedule. We have demonstrated that we can always find such an x to generate a balanced two level partition schedule, as long as y is large enough. Therefore, there exist different such xs associated with different y. The following algorithm gives the method to caculate the minimum second level memory requirements and the corresponding second level partition size. In this algorithm, the symbol #pre1 denotes the number of prefetch operations in one first level partition, keep1 n is the number of keep-n operations in one first level partition. T3 2 and T2 1 are the fetching times from the third level memory to the second level memory and from the second level memory to the first level memory, respectively. l pord is the schedule length of ordinary first level partition. In the algorithm, we first find a y which can lead to the minimum second level memory requirement. Then calculate an x value that makes the first level schedule length equal to the second level schedule length using this y. Then B xC is chosen as the actual x. It can be shown that the difference between the first level schedule and the second level schedule is an increasing function of x. The second level partition size calculated using the algorithm guarantees a balanced two level partition schedule.

Algorithm 1 Calculate the minimum second level memory requirement and the corresponding second level partition size Input: A given first level partition schedule and the corresponding set of delay dependencies Output: The minimum second level memory requirement and the associated second level partition size 1. Using the delay dependency set to determine the width w of the region consisting of only boundary partitions 2. Compute the amount of keep operations #keepkeep and prefetch operations #prebound in the bottom row of this region 3. Derive the function f D y E to calculate the number of keep operations and f FGD y E to calculate the number of prefetch operations. H

f D y EJI #keepkeep K D y L 1 E E M ∑ n K #keep1 f FGD y EJI #prebound K D y L 1 N

nM

#pre1 K w

4. Compute l D y E which is the summation of all boundary partition schedule lengths. 5. Let the first level schedule length equal the second level schedule length and derive the expression for the second level memory requirement memreq2 H

l pord D x L w E y M l D y EI #pre1 T3 O 2 D x L w ENM #pre1 D x L w ED y L 1 E M f F D y E T3 O 2 M f D y E q memreq2 I 2#pre1 D x L w ENM #pre1 D x L w ED y L 1 ENM 2 f F D y ENM f D y E

6. Calculate x for a given y using the first equation. 7. Substitute x into the second equation. Calculate the value of y which makes memreq2 minimum. 8. Calculate the value of the corresponding x and the second level memory size.

4.4 Algorithm From the above discussion, the balanced two level partition schedule can provide us both the optimal average schedule length and the second level memory requirement. After knowing the partition size in both levels, we can generate the schedule for both levels using the previous knowledge. Note that there is a difference between the arrangement of memory operations in the first and second level. In the first level partition, we can only issue keep operations after the corresponding data are ready, so there maybe some idle time in the memory part. In the second level partition, we can arrange the keep operations as soon as the corresponding data have been written back. Since there is no keep operation in the top row of the second level partition, there will no such idle time in the second level memory. Algorithm 2 The two-level partition schedule Input: MDFG, the first level memory constraint Output: The two level partition schedule 1. Retiming MDFG to get a compact schedule for one iteration. //Using multi-dimensional rotation scheduling algorithm. see section 2.1. 2. Based on the delay dependencies in the retimed MDFG, determine the first level partition shape. // see section 2.1 3. Based on the first level memory constraint and the first level partition shape, calculate the first level partition size. //Using one-level partition algorithm, see section 2.1 4. Create the first level partition schedule. 5. Use this first level partition schedule and Algorithm 1 to compute the second level partition size. //see section 4.3 6. Calculate the corresponding second level memory requirement. //see section 4.3 7. Create the second level schedule.

5. EXPERIMENT In the experiments we assume the time consumption to fetch a data from the third level to the second level be 9 clock cycles, from

Benchmark IIR P DPCM WDF Floyd 2D

Two-level mreq2 size2 1159 6Q 2 3982 5Q 3 486 6Q 2 477 3Q 2 491 3Q 2

size1 4Q 8 12 Q 10 4Q 5 7Q 4 3Q 5

One-Level len 6.031 4.008 4.05 6.000 12

list mreq1 358 840 129 216 241

len 40 27 18 33 65

ratio 85% 85.16% 77.5% 81.82% 81.54%

hardware pre len ratio 37 83.7% 22 81.78% 12 66.25 % 22 72.72% 60 80%

U

9

5.8

5.6 8.5 5.4

8

5.2

5 7.5

Table 1: Experimental results without memory constraints assuming Tp re f etch =12

T

4.8

7

4.6

4.4 6.5

6

0

R

2000

4000

6000

S

4.2

8000

10000

12000

14000

4

V

0

0.5

1

1.5

R

2

2.5

W

3

3.5 4

x 10

Benchmark IIR DPCM WDF FLOYD 2D

size 1Q 6 3Q 9 1Q 5 2Q 3 1Q 4

One-level mreq1 175 416 74 109 112

len 10 5 5 8.333 14

size1 5Q 2 12 Q 4 4Q 2 4Q 2 3Q 1

mr eq1 141 381 75 89 125

Two-level size2 7 Q 10 7Q 8 8Q 5 6Q 4 4 Q 13

mreq2 3825 9230 936 742 2080

len 6.129 4.027 4.156 6 12.5

Table 2: Experimental results with about 1/2 of original size Benchmark IIR DPCM WDF FLOYD 2D

size 1Q 3 2Q 6 1Q 3 1Q 2 1Q 2

One-level mreq1 85 216 44 43 65

len 15.67 7.083 8 12.5 25

size1 2Q 2 4Q 4 1Q 3 1Q 2 1Q 2

mreq1 74 205 44 37 65

Two-level size2 13 Q 6 15 Q 8 19 Q 3 18 Q 4 11 Q 4

mreq2 1435 6952 462 592 755

len 6.404 4.188 4.645 6.417 12.615

Table 3: Experimental results with about 1/4 of original size

the second level to the first level be 3 clock cycles, and from the third level to the first level be 12 clock cycles. We compared five different schemes. They are the two-level partition algorithm, the one-level partition algorithm in [6], PSP scheduling in [1], list scheduling and hardware prefetch scheme, respectively. The first two algorithms have been introduced in the previous sections. The situation with the level memory size constraints was not considered in PSP scheduling. This algorithm can obtain the same results as our algorithm under no first level memory size constraints, but can not get feasible results when this constraint is imposed. Therefore, their experiment results are not shown in the table. The list scheduling uses the list algorithm to schedule the ALU part and no partition in the memory part. The hardware prefetch scheme uses the model presented in [4]. Whenever a block is accessed, the next block is also loaded. The first table presents the results without the memory constraint in the first level. In this table, we use the same first level partition size as shown in the table in both one-level and two-level partition scheduling. Also, they have the same average schedule length. The other two tables describe the results with memory constraints. In the last two tables, the relative memory constraints for all benchmarks are used due to the large difference among their memory requirements. In the experiments, five benchmarks are Infinite Impulse Response filter, Differential Pulse-Code Modulation device, Wave Digital filter, Floyd-Steinberg algorithm and Two Dimensional filter. They are represented by “IIR”, “DPCM”, “WDF”, “Floyd” and “2D”, respectively, in all tables. All the “size1” columns list the partition size of the first level in both the one-level partition algorithm and two-level partition algorithm. The “size2” columns list the partition size of the second level partition size in the two-level partition algorithm. All the “len” columns represent the average schedule lengths, and the “ratio” column in the first table denotes the improvement the two-level partition algorithm can obtain compared with the list scheduling and hardware prefetching schemes. The mreq1 and mreq2 columns represent the memory requirement of the first and second level memory for each algorithm. As we can see from the first table, the two-level partition algorithm is much superior to list scheduling and hardware prefetching

(a) IIR

(b) DPCM

Figure 3: The relationship between the average schedule length and the second memory size for IIR and DPCM scheduling in performance. The one-level partition algorithm can be competitive in performance with the two-level partition algorithm in the case without the memory constraints, which is due to the fact that large first level memory size can hide the long memory access time efficiently. When memory constraints are added, the performance difference is obvious from the last two tables. In the Figure 3(a) and Figure 3(b), the curve depicts the relationship between the average schedule length and the second level memory requirement. Two different benchmarks – IIR and DPCM are used, and the first level memory constraints are the same as in the second table. The x axis in these two curves is the second level memory size, and the y axis is the corresponding average schedule length. It can be seen from the curves that a smaller second level memory size will degrade the performence greatly, while increasing the second level memory size will not result in much performance improvement. The memory size and the corresponding second level partition size obtained from our algorithm is the best point to determine the two level schedule.

6. REFERENCES [1] F.Chen and E.H.-M.Sha. Loop scheduling and partitions for hiding memory latencies. To appear in Proc. IEEE & ACM 12th Intl. Symposium on System Synthesis (ISSS), Nov 1999. [2] F.Dahlgren and M.Dubois. Sequential hardware prefetching in shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 6(7):733–746, July 1995. [3] N.Passos and E.H.-M.Sha. Scheduling of uniform multidimensioanl systems under resource constraints. Journal of IEEE Transactions on VLSI Systems, 6(4):719–730, December 1998. [4] T.-F.Chen. Data Prefetching for High-Performance Processors. PhD thesis, Dept. of Comp. Sci. and Engr, Univ. of Washington, 1993. [5] T.Mowry. Tolerating latency in multiprocessors through compiler-inserted prefetching. ACM Transactions on Computer Systems, 16(1):55–92, February 1998. [6] Z. Wang, V.Andronache, and Edwin H.M. Sha. Optimal partitioning under memory constraints for minimizing average schedule length. In Proc. IASTED 11th Intl. Conference on Parallel and Distributed Computing and Systems, pages 758–763, Cambridge, MA, Nov 1999.

Optimal Two Level Partitioning and Loop Scheduling for ... - CiteSeerX

Optimal Two Level Partitioning and Loop Scheduling for ... - CiteSeerX

Suggest Documents

Optimal Planning and Scheduling for Repetitive ... - CiteSeerX

A Hardware-Software Partitioning and Scheduling ... - CiteSeerX

Machine Partitioning and Scheduling under Fault ... - CiteSeerX

optimal frequency-reuse partitioning for ubiquitous ... - CiteSeerX

Two Level Job-Scheduling Strategies for a Computational ... - CiteSeerX

Communication Sensitive Loop Scheduling for DSP ... - CiteSeerX

Communication Sensitive Loop Scheduling for DSP ... - CiteSeerX

Optimal Partitioning and Coordination Decisions in ... - CiteSeerX

Efficient Variable Partitioning and Scheduling for DSP ... - CiteSeerX

Efficient Variable Partitioning and Scheduling for DSP ... - CiteSeerX

Optimal Transmission Scheduling for Communication ... - CiteSeerX

Constraint processing for optimal maintenance scheduling - CiteSeerX

Tractable Optimal Competitive Scheduling - CiteSeerX

An Efficient Two-level Partitioning Algorithm For VLSI Circuits - Design ...

Optimal Partitioning and Coordination for Decomposition-based ...

Matrix Partitioning: Optimal bipartitioning and

Two-Level Processor-Sharing Scheduling ... - Semantic Scholar

Architecture-Dependent Loop Scheduling via ... - CiteSeerX

Optimal Loop Unrolling and Shifting for Reconfigurable ... - CiteSeerX

Rotation Scheduling: A Loop Pipelining Algorithm - CiteSeerX

Delay-Optimal Scheduling for Two-Hop Relay Networks with

Optimal Matroid Partitioning Problems

Optimal Scheduling and Power Control for TDMA based ... - CiteSeerX

Statement-Level Communication-Free Partitioning ... - CiteSeerX