ABSTRACT. In this paper we describe a design exploration methodol- ... due to the small area and improved register file latency achieved by ... of forwarding paths in a scalar processor could be reduced without a .... Moreover, assume that B has ..... 40. 60. 80. 100. 120. Number of cycles. Number of bypassings. 08 clusters.
Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Marcio Buss, Rodolfo Azevedo, Paulo Centoducatte and Guido Araujo IC - UNICAMP Cx. Postal 6176 Campinas, SP, Brazil {marcio.buss, rjazevedo, ducatte, guido}@ic.unicamp.br
Inter cluster copy-bus.
ABSTRACT
.
In this paper we describe a design exploration methodology for clustered VLIW architectures. The central idea of this work is a set of three techniques aimed at reducing the cost of expensive inter-cluster copy operations. Instruction scheduling is performed using a list-scheduling algorithm that stores operand chains into the same register file. Functional units are assigned to clusters based on the application inter-cluster communication pattern. Finally, a careful insertion of pipeline bypasses is used to increase the number of data-dependencies that can be satisfied by pipeline register operands. Experimental results, using the SPEC95 benchmark and the IMPACT compiler, reveal a substantial reduction in the number of copies between clusters.
Register File
Register File
Bypass DP0
DP1
DP2
DP3
FU0
FU1
FU2
FU3
.
.
.
.
Figure 1: Clustered VLIW architecture model with inter-cluster bypass.
1. INTRODUCTION The problem of instruction partitioning/scheduling for clustered VLIW has earned a considerable attention recently, due to the small area and improved register file latency achieved by these architectures [5]. Register file area/latency is proportional to O(n2 )/O(log m), where n is the total number of input/output ports, and m the number of read-ports. Such features of clustered VLIW architectures are particularly relevant in the design of highly constrained embedded systems, where high performance, reduced die size and low power consumption are premium design goals. In this paper we describe a design exploration methodology for clustered VLIW architectures. Instruction scheduling is performed using a list-scheduling algorithm that stores chains of operands into the same register file. Functional units are assigned to clusters based on the application intercluster communication pattern. Finally, pipeline bypasses are inserted to increase the number of data-dependencies which can be satisfied by pipeline register operands. This paper is divided as follows. Section 2 shows some prior art. Section 3 describes the architectural model adopted
throughout the paper. Section 4 discusses how operations are scheduled and assigned to functional units. The partitioning of functional units into clusters is discussed in Section 5. Finally, Section 6 shows how functional units are assigned to physical clusters. The SPEC CINT95 and CFP95 benchmarks and the IMPACT compiler [10] were used to evaluate the performance of this strategy (Section 7). In Section 8 we conclude the work.
2.
RELATED WORK
Clustered VLIW architectures have been extensively studied in the literature. The assignment of operation traces to clusters has been originally studied in the context of the Bulldog [6] and Multiflow Trace [9] compilers. Separate partitioning and scheduling has been proposed by Capitanio et al [4] using a limited connectivity architecture. Ozer et al [14, 16] integrate partitioning and scheduling in a single phase using the Unified Assign and Scheduling (UAS) algorithm. A variation of UAS and modulo scheduling has been proposed by Sanchez et. al [18, 19] as a way to assign different loop iterations to separate clusters. Fisher at al [7] proposed a Partial Component Cluster technique to divide Data-Flow Graph (DFG) components into clusters in order to avoid copy operations along DFG critical paths. Ozer and Conte [15] introduced an optimal cluster scheduling for a VLIW machine based on integer linear programming. Their approach is suitable to help the search for a schedule lower
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’01, November 16-17, 2001, Atlanta, Georgia, USA. Copyright 2001 ACM 1-58113-399-5/01/0011 ...$5.00.
141
Register File
bound, and as a way to evaluate the effectiveness of heuristic based schemes. Fernandes et al [8] proposed a queue-based register file to pass operands between clusters. Architectural exploration and VLIW customization for a particular application has been studied by Jacome et. al [11] and R. Rau et al [17]. Ahuja et. al [3] showed that the number of forwarding paths in a scalar processor could be reduced without a great performance loss. Unfortunately, not much work has been performed on simultaneously tailoring cluster partitioning and pipeline bypass structures to a specific application.
. ..
(WB)
(ID)
(ID)
..
.. .
.
MUX
MUX
MUX
MUX
EX (FU1)
EX (FU2)
.
.
MEM
MEM
.
.
DP1
DP2
3. ARCHITECTURE MODEL The architecture model used in this paper (Figure 1) is a pipelined clustered VLIW architecture, where each cluster is formed by a set of one or more homogeneous1 functional unit (FU), a multi-ported register file and an inter-cluster data transfer bus (called copy-bus). This model is similar to the one described by Capitanio et al [5]. Contrary to the work in [5], the copy-bus is driven by the output of the functional unit and not by the output of the register file. By doing so, a copy operation can be scheduled to copy the result of an operation at the output of some FU directly to the register file of another cluster through the copy-bus. Consider, for example, two dependent operations A and B (B depends on A’s result) assigned to datapaths DP1 and DP2 in two distinct clusters (Figure 2). Figure 3 shows the pipeline timing diagram of these operations. Assume that the result of operation A is available in the EX/MEM pipeline register, at the end of stage EX in DP1. A copy operation following A in DP1 can be used to move A’s result to the copy-bus, just in time to be written, during the ID stage, into the register file of DP2 (solid arrow). This is not possible by the approach used in [5], which requires one extra NOP operation to transfer the data to DP2’s register file. The presence of the copy-bus affect the final register file design, but its impact is much smaller than the benefits gained by reducing the number of the read-ports [5]. Using a heuristic from [5], we assume that the width of the copy-bus is equal to half the number of FUs per cluster, i.e. in the best case only half of the FUs in one cluster can simultaneously execute copies to other clusters. One cluster can receive copies from all other clusters, provided the maximum constraint above is met. Abnous and Bagherzadeh [1, 2] studied some of the design issues that arise in the pipeline structure and bypassing mechanism of pipelined clustered VLIW processors. In our work we use a few bypassing lines to forward operation results stored in pipeline registers to other datapaths. Pipeline bypasses can be added between datapaths within the same cluster or between datapaths in distinct clusters. The goal of inserting a bypass interconnection between two datapaths inside the same cluster (e.g. DP0 and DP1) is to reduce the number of NOP operations required to solve the data-hazard between the dependent instructions in the datapaths. By assigning a bypass interconnection between two datapaths in distinct clusters (eg. DP1 and DP2), we are also reducing the number of copy operations required to use the copy-bus. A copy operation must be issued by the compiler if: (a) no bypass exist between DP1 and DP2 and there is (at
Register File
(WB)
Figure 2: Bypass interconnection between datapaths of FU2 and FU3. A: copy: B:
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
EX
MEM
IF
ID
DP1 DP1 WB
DP2
wr rd
Bypass from EX/MEM in DP1 forwards the result of instruction A to instruction B in the EX stage of DP2. Instruction B (DP2) reads from the copy-bus the result of instruction A in DP1.
Figure 3: Using inter-cluster forwarding to solve inter-cluster dependencies.
least one) data-hazard between the operations in these datapaths; (b) a bypass exist between DP1 and DP2, but at least one of the uses of the data in DP2 is so far from its definition in DP1 that the bypass cannot satisfy the dependency. Consider again datapaths DP1 and DP2 and dependent operations A and B in Figure 3. Moreover, assume that B has been scheduled two slots after A. In this case, the result of A can be forwarded from the EX/MEM register in DP1 directly to the ID/EX register of DP2 (dotted line) through the bypass interconnect in Figure 2. We are assuming that one bypass interconnection between two datapaths has as many lines as those required to exchange operands (in both directions) between the stages of the datapath pipelines. The area needed by a bypass interconnection between two pipelines, is proportional to the number of comparators required to detect the datahazards [2] between them. This cost becomes very large if bypasses are allowed between all pipeline pairs, in which case it is proportional to 2dn2 , n the number of FUs and d the depth of the pipeline. Instead of allowing full bypassing between all datapaths, we insert only a few carefully chosen bypasses between very communicating datapaths, aiming at reducing the inter-cluster communication. These interconnections are selected based on the communication pattern of the application. Pipeline bypasses have a reasonably small
1 Homogeneous FUs have been used for the sake of simplicity. The technique applies to heterogeneous units as well.
142
Consider for example, the DDG of Figure 4. For the sake of simplicity, we assume in this example that all operations have single cycle latencies. Moreover, consider that the scheduling priority is such that operations are scheduled in alphabetic order. Initially, operations A-D are assigned to FU0-FU3. After A-D are scheduled, E is the next operation in the working list which is ready to be scheduled. The intersection of the FUs assigned to the parents of E (FU0 and FU1) is empty, so E is scheduled to the first free FU that was assigned to its parents (i.e. FU0). Next, F is scheduled, and since it has no parents it is assigned the next free FU, i.e. FU1. Operation G is then scheduled to FU2, since its parent’s FUs are different. The next candidate for scheduling is H, which is assigned to FU3. Operation I is scheduled to FU0, while J and K are assigned to FU2 and FU0 respectively. Finally, L is scheduled to FU0, the same functional unit as its parents I and K. Notice from Figure 4 that whenever a data-dependency exist between two operations scheduled to different FUs some action must be taken to assure that this dependency is satisfied3 . At this point of our solution, FUs have not been assigned to clusters yet. When FUs are assigned a common register file inside the same cluster, the dependency can be satisfied through the register file, or by some intra-cluster bypass if one exist. On the other hand, if the FUs are located in different clusters a copy operation will be required if there is no inter-cluster bypass between those two FUs. For example, consider operations J and K scheduled to functional units FU2 and FU0. If units FU0 and FU2 are assigned to the same cluster, no copy operations will required to communicate the result of J to K. The same is not true if these operations are scheduled to FUs in different clusters and there is no bypass between their datapaths. In order to evaluate the communication pattern between FUs we measure the dependencies between each pair of FU using the communication table shown in Figure 4. Each entry in this table corresponds to the number of data dependencies that need to be satisfied between a pair of FUs. For example entry (1,0) in the table is 2, meaning that two operations scheduled to FU1 (B and F) communicate their results to two operations scheduled to FU0 (E and I).
impact in the processor cycle time [2], consisting basically on the delay of the routing lines between datapaths. Thus, for the sake of simplicity, we neglect the impact of bypasses to the cycle time2 . 0
FU 0
FU 1
FU 2
B
A
FU 3 D
C FU1
E
FU
FU 3
3
0 0 1 2
0
0
0
0
0
0
2 1
0
0
0
3 0
0
2
0
Intercommunication Table.
H
G
F
0
(FU 2, FU3)
(FU 0, FU1)
FU 1 2
1
FU 2
0 A B I
(FU 2, FU3)
J
1 E
(FU 0, FU1) K
Cycle FU 0
2
I
3
K
4
L
F
3 C D
G H J
5 L FU 0
6 7 Reservation Table
Figure 4: IMPACT list scheduling using clustering as second criteria. Our approach is based on three phases. Initially, a variation of list scheduling is used to schedule operations into compacted instructions (Section 4). The algorithm tries to cluster dependent operands into the same FU so as to avoid expensive inter-cluster copy operations. In the second phase (Section 5), we use a partitioning algorithm to assign functional units to clusters, such that the majority of the data dependencies are resolved by the cluster register file. Assigning FUs to clusters, based on the application, is a central feature of our approach which has not been extensively explored before. Finally, functional units are assigned to physical units inside clusters (Section 5), and bypass interconnections are inserted between the most communicating datapaths.
4. SCHEDULING Our scheduling approach is a simple extension of the IMPACT compiler list scheduling algorithm. For a given operation op in the candidate list, IMPACT uses the distance from op to a root of the Data-Dependence Graph (DDG) as the first scheduling criteria, followed by the number of children operations that become candidates if op is scheduled. We use exactly the same algorithm, adding only a small modification to determine which FU will be assigned to execute op. For each candidate operation op removed from the priority list, its FU is determined based on the FUs of its parents in the DDG. If the intersection of the FUs assigned to parents(op) is non-empty, op is assigned the same FU as its parents, if that FU is free. Otherwise, op is assigned the first FU available at the current time step. If the intersection of the FUs in parents(op) is empty, op is assigned the first free FU, giving a higher priority to the FUs assigned to its parents. The central idea here is to keep the result of an operation into the same register cluster as its operands. By doing so, we avoid increasing the number of inter-cluster copy operations as much as possible. 2
5.
CLUSTER PARTITIONING
After the communication table is computed, our algorithm divides the FUs among clusters such that the most intercommunicating FUs are assigned to the same cluster. Initially, the communication table is reduced to a low-diagonal matrix, in order to accumulate the dependencies (i, j) and (j, i) into a unique value. As said before in Section 3, one bypass is a bi-directional connection between all stages of datapaths i and j. This is not a requirement of our approach though, and it can be relaxed if required. The table on the top left corner of Figure 5 shows a reduced communication table. Based on this table, we build a cluster vector the size of the number of FUs. Each entry in this vector stores the number of a FU. The indices of the vector correspond to a physical datapath, and are divided according to the number of clusters. In the case of Figure 5, four functional units FU0-FU3 must be assigned to two clusters (0 and 1) each cluster containing two physical 3 We assume that intra-pipeline data-hazards are always satisfied.
Provided that only a “few” are inserted.
143
0 0 0
1
2
3
0
0
0
1 2
0
0
0
2 8
1
0
0 0
3 3
cluster 0
9
4
cluster 0 cluster 1
0
1
2 1
2
3
cluster 1
cluster 0
8 4
8 3 0
1
Initial partitioning
cluster 0
cluster 1
2
3
1
0 2
9
3
3
1
cluster 1 4 3 2 1
9
(b) Cost = 23.
(a). Initial cost = 21
nication table is used, in combination with the result of its cluster vector after partitioning, to compute a partial hardwired communication table. This table is a representation of the number of data dependencies between program operands in a particular loop, given the current architecture. Its goal is basically to map each FU in the communication table to its corresponding datapth (and cluster) in the cluster vector.
DP0 DP1 DP2 DP3
FU 1 2
3
0 0
0
0
0
1 2
0
0
0
2 1
0
0
0
3 0
0
2
0
0
0
2
FU
(c). Cost = 10.
DP
Communication table cluster 0
cluster 0
cluster 1
cluster 0
2 3 0
2
1 1
3 4
(d) Cost = 10.
0
3
cluster 1 8 2 2 4
0
0
1
DP 2
0
0
0
0
0
1
2
0
0
0
2
1
0
0
0
3
0
0
2
0
2
3 X
cluster 1
1
3
10
Partial hardwired communication table
3
Cluster vector
1 DP FU 1 2
3
0 0
0
0
0
1 0
0
0
0
9
0
(e) Cost = 23. FU
Figure 5: Selecting the most communicating functional units.
2 3
1
0
0
3 0
0
0
0
DP
Communication table cluster 0 2
DP
0
1
2
3
0
0
0
0
0
1
3
0
0
0
2
1
0
0
0
2
1.1
3
0
0
0
0
3
0
0 0
1 32.3
2 X
10
0
DP
1
2
3
0
0
0
0
0
0
0
0
0
10
12
0
cluster 1
0
1
Hardwired communication table (normalized by 1000)
Partial hardwired communication table
3
Cluster vector
datapaths. Cluster 0 contains datapaths DP0 and DP1, and cluster 1 datapaths DP2 and DP3. A variation of the LPK algorithm [12] is then used to swap FUs between clusters, so as to minimize their communication. The communication cost between two clusters, for a given FU distribution, is the total number of data dependencies that cross the clusters border. Initially, the algorithm divides the functional units into two sets of clusters. It swaps all possible FU pairs, one from each set, storing the smallest cost it has seen so far. After all possible FU exchanges have been tried, the resulting smallest cost gives the best FU distribution between the two sets of clusters. The algorithm proceeds recursively into each cluster set, until all FU are assigned a cluster. Consider, for example, the reduced communication table and cluster vector in Figure 5. The communication between two FUs is represented by a double-headed solid arrow labeled with the cost from the communication table. The cost of the initial partitioning in Figure 5a (21) is the result of the sum of the communication costs between: FU0 and FU2 (cost 8); FU0 and FU3 (cost 3); FU1 and FU2 (cost 1); and FU1 and FU3 (cost 9). FU0 and FU2 (in gray) are then selected for swapping, resulting in the new configuration (2103) with cost 23 (Figure 5b). The algorithm proceeds exchanging pairs of FUs from the initial partitioning (Figure 5(c-e)) while computing their costs. After all pairs of FUs have been tested, the minimal communication cost (10) is achieved. The configuration that results in the smallest inter-cluster communication (3120) is obtained by swapping FU3 and FU0 (Figure 5c).
DP FU 1 2
0
FU
0
1
2
0
0
0
0
3
3
0 0
0
0
0
0
1 0
0
0
0
1
3
0
0
0
2 1
3
0
0
3 1
0
0
0
2
0
0
0
0
3
0
1
1
0
Communication table cluster 0 1
cluster 1 2
3
0
DP
4 X
10
Partial hardwired communication table
Cluster vector
Figure 6: Mapping the communication table to a hardwired bypass interconnects. For example, at the center of Figure 6, FU2 has been assigned to DP0 (index 0) of the cluster vector. Hence, line 2 in the communication table if mapped to column 0 of the partial hardwired table. Notice that one partial table emerges for each inner-loop super-block in the program. In the case of Figure 6 three loops were considered. Since the resulting architecture has to execute all of them, we need to take into account the contribution of each loop to the overall intercluster communication. This is done, in a second phase, by adding up the partial communication tables into a single hardwired communication table. Each entry in a given partial table is weighted by 10N L , where N L is the nestinglevel of the loop corresponding to that table [13]. A better estimate is possible if the loop trip-count can be determined at compile time. The resulting hardwired communication table is then used to determine the pairs of datapaths which will be interconnected with bypass lines. To do that the entries in the hardwired table are sorted into a priority list, such that the most communicating pair of datapaths have a higher priority. Bypass lines are inserted between datapath pairs, the highest priority pair first. In Figure 6, for example, the highest entry in the hardwired table is 32,300, corresponding to the communication between DP0 and DP1. These datapaths were assigned the same cluster 0, in order to reduce the cost of
6. DATAPATH MAPPING After the scheduling and partitioning tasks described above are finished, operations are associated to FUs and FUs to clusters. To complete the architectural design, FUs must be assigned to their corresponding physical datapaths, and bypass lines inserted. We do that using the two step procedure shown in Figure 6. First, each inner-loop commu-
144
inserting copy operations between them. Some of the datadependencies between DP0 and DP1 will be resolved by the common register file in cluster 0, but many short distance dependencies can be satisfied by adding a bypass line between DP0 and DP1.
6.9e+06
Number of cycles
6.8e+06
7. EXPERIMENTAL RESULTS The approach described above was implemented into the IMPACT infrastructure, and the resulting compiler was used to compile eighteen programs from the SPEC CINT95 (6), CFP95 (5) benchmarks, IMPACT (5) and Miscellaneous (2) as shown in Table 1. The compiler was executed using superblock formation, maximum unrolling of 32 and no predication. In our experiments we estimate the number of copy operations and cycles produced by each program across a large number of architecture configurations. Each configuration corresponds to a different combination of the following parameters: (a) Number of FUs (from 2 to 16); (b) Number of register file clusters (from 1 to the number of FUs); (c) Number of bypass interconnections (from 0 to the number of FUs). For the sake of simplicity we adopted homogeneous clusters, i.e. all clusters (CLs) have the same number of FUs. The goal of the experimental work was to determine the impact of the techniques described in Sections 4, 5 and 6. The experiments were divided into three parts. First, we evaluated the impact of bypass insertion into the cycle count of the programs. In the second part, we studied how cluster partitioning and bypassing effects the number of copy instruction between clusters. In the last set of experiments we evaluate the impact of the scheduling and mapping algorithms. CINT95 099.go 124.m88ksim 129.compress 130.li 132.ijpeg 147.vortex
CFP95 101.tomcatv 102.swim 103.su2cor 107.mgrid 125.turb3d
IMPACT fir kalman paraffins dag eight
08 clusters 01 cluster
6.7e+06
6.6e+06
6.5e+06
6.4e+06
6.3e+06 0
20
40 60 80 Number of bypassings
100
120
Figure 7: Impact of adding bypasses for program go (16 FUs). 7.3e+07
08 clusters 01 cluster
7.2e+07
Number of cycles
7.1e+07 7e+07 6.9e+07 6.8e+07 6.7e+07 6.6e+07
MISC mpeg2dec mpeg2enc
0
20
40
60
80
100
120
Number of bypassings
Figure 8: Impact of adding bypasses for program su2cor (16 FUs).
In the second part of our experimental work, we studied the impact of clustering and bypassing in the number of inter-cluster copy instructions. Consider, for example, the graphs in Figure 9a-b, where the number of inter-cluster copy operations for program su2cor is measured using two architectures with 8 and 16 FUs. In that graph, the vertical (horizontal) axis represents the number of copy operations (clusters). Curves in the graph have for parameter the number of bypass lines inserted between FUs. The number of copy operations grows with the number of clusters, as expected, given the increase in the number of inter-cluster communication. Nevertheless, notice that as bypass lines are added to the architecture, many copy operations are wiped-out of the program.
Table 1: Benchmark Programs The maximum number of bypass interconnections is given by n(n − 1)/2, where n is the number of FUs for that configuration. Figures 7 and 8 show the impact, on programs go and su2cor, of adding from 0 to 120 bypasses (full bypassing between all FUs). All architecture configurations considered in the following analysis have 16 FUs, and range from 1 and 8 clusters. For program go (Figure 7), we noticed that most of the speed-up was achieved for 8 bypasses (6.5% for one cluster and 5.8% for 8 clusters). Only a small difference was noticed when using 16 or more bypasses (7.1% for one cluster and 6.4% for 8 clusters, when 16 bypasses are used). For program su2cor (Figure 8) we faced a more complex tradeoff. In the first knee of the curve (left side of the figure), when 4 bypasses are used, the speed-up was 2.62% (one cluster) and 2.75% (8 clusters). This value increases very slowly. Notice that using 120 bypasses improves the speedup only 6% (1 cluster) and 7.2% (8 clusters). Thus, since the program speed-up decreases almost exponentially with the number of bypasses, we restrict the maximum number of bypasses to the number of FUs.
Bypasses 0 1
su2cor 6169 2851
go 53270 25784
Table 2: Number of copies for 2 FUs In Table 2 (2 FUs/2 CLs) a single bypass line removes more than 50% of all copy operations. If the same program
145
10000 9000 8000
70000
7000 6000 5000 4000 3000
60000 50000 40000 30000
2000
20000
1000
10000
0
0 1
12000
2
4 (a) Number of clusters
8
1 90000
00 bypassing 01 bypassing 02 bypassing 04 bypassing 08 bypassing 16 bypassing
70000
8000
6000
4000
2
4 (a) Number of clusters
8
8 (b) Number of clusters
16
00 bypassing 01 bypassing 02 bypassing 04 bypassing 08 bypassing 16 bypassing
80000
Number of copies
10000
Number of copies
00 bypassing 01 bypassing 02 bypassing 04 bypassing 08 bypassing
80000
Number of copies
Number of copies
90000
00 bypassing 01 bypassing 02 bypassing 04 bypassing 08 bypassing
60000 50000 40000 30000 20000
2000 10000 0
0
60
2
4
8 (b) Number of clusters
16
1
Percentage of copies removed wrt 0 bypassing
Percentage of copies removed wrt 0 bypassing
1
2 FUs 4 FUs 8 FUs 16 FUs
50
40
30
20
10
0 0
0.2
0.4 0.6 (c) Clusters / FUs
0.8
1
60
4
1 bypassing 4 bypassing 8 bypassing 16 bypassing
50
40
30
20
10
0 0
Figure 9: Number of copy operations for program su2cor in a: (a) 8 FUs architecture; (b) 16 FUs architecture. (c) Percentage of copy operations removed from su2cor through bypassing, as a variable of the number of clusters per FUs.
2
0.2
0.4 0.6 (c) Clusters / FUs
0.8
1
Figure 10: Number of copy operations for program go in a: (a) 8 FUs architecture; (b) 16 FUs architecture. (c) Percentage of copy operations removed from go through bypassing, as a variable of the number of clusters per FUs.
146
2.2e+08
1.3e+07
1 Cluster 0 Bypassing 2 Cluster 0 Bypassing 4 Cluster 0 Bypassing 8 Cluster 0 Bypassing
2e+08 1.8e+08
1.1e+07 Number of cycles
Number of cycles
1 Cluster 0 Bypassing 2 Cluster 0 Bypassing 4 Cluster 0 Bypassing 8 Cluster 0 Bypassing
1.2e+07
1.6e+08 1.4e+08 1.2e+08
1e+07 9e+06 8e+06
1e+08
7e+06
8e+07 6e+07
6e+06 2
4
6
8 FUs
10
12
14
16
2
Figure 11: Cycle count for program su2cor in nonclustered and clustered configurations with no bypasses. 1.4e+08
6
8 FUs
10
12
14
16
Figure 13: Cycle count for program go in nonclustered and clustered configurations with no bypasses. 8.4e+06
1 Cluster 0 Bypassing 2 Cluster 1 Bypassing 4 Cluster 4 Bypassing
1.3e+08
4
1 Cluster 0 Bypassing 2 Cluster 1 Bypassing 4 Cluster 4 Bypassing
8.2e+06 8e+06 Number of cycles
Number of cycles
1.2e+08 1.1e+08 1e+08 9e+07
7.8e+06 7.6e+06 7.4e+06 7.2e+06 7e+06
8e+07 6.8e+06 7e+07
6.6e+06
6e+07
6.4e+06 2
4
6
8
10
12
14
16
2
FUs
4
6
8
10
12
14
16
FUs
Figure 12: Impact on the cycle count for program su2cor due to maximum bypassing.
Figure 14: Impact on the cycle count for program go due to maximum bypassing.
runs on an 8 FUs/8 CLs configuration (Figure 9a), 8 bypasses are required to reduce by 38% the number of copy operations. As more bypasses are added the law of diminishing returns settles in and the gain saturates. In Figure 9b for 16 FUs, we removed 22% of the copy operations using 8 bypasses and 29% of the copy operations using 16 bypasses. In general, we noticed, across all SPEC programs, that the insertion of a single bypass reduces the total number of copy operations from 10% to 70%. Notice, for example, that program go repeats the same behavior (Figure 10a-b) as su2cor. The bypass effectiveness when all configurations run program su2cor is described in Figure 9c. The vertical axis in that graph shows the percentage of copy operations removed with respect to a bypassing free configuration. The horizontal axis represents the ratio CLs/FUs. Notice that it ranges to a maximum of one since the number of CLs is at most the number of FUs. As mentioned before, increasing the number of bypasses implies in a reduction in the number of copy operations. For example, 44% of the copy operations are removed when 4 bypasses are inserted in a 4 FUs/2 CLs architecture. Interesting enough, the percentage of copy operations removed by bypasses seems to saturate when the
ratio CLs/FUs is around 0.5, i.e when each cluster has two FUs. The same pattern was also detected in the majority of the combinations of programs and architecture configurations (e.g. program go in Figure Figure 10c). We believe this might have to do with the way FUs data-dependencies are uniformly partitioned across clusters by the binary recursive algorithm described in Section 5. Further experimental work will be required to clarify this finding. In the third part of our experiments, we studied the impact of the scheduling and clustering algorithms. We plotted, for all programs, the cycle count as a function of the number of FUs and clusters. In order to filter out the effect of bypassing, we considered only 0 bypassing configurations. As shown in Figure 11, for program su2cor, the performance for clustered architectures follows very closely the performance for the non-clustered architecture (1 Cluster, 0 Bypassing). In other words, our approach is capable of canceling the negative effects of clustering, namely the cost of the inter-cluster copy instructions overhead. The same can be seen from Figure 13, for program go. The effect of the bypass lines in the cycle count for programs su2cor and go are shown in Figures 12 and 14. The
147
[4] A. Capitanio, N. Dutt, and A. Nicolau. Design considerations for limited connectivity VLIW architectures. Technical Report TR-92-59, University of California, Irvine, Irvine, CA 92717, 1992. [5] A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register file for VLIWs: A preliminary analysis of tradeoffs. In 25th International Symposium on Microarchitecture (MICRO), 1992. [6] J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, 1986. [7] P. Faraboshchi, G. Desoli, and J. A. Fisher. Clustered instruction-level parallel processors. Technical Report Technical Report HPL-98-204, HP Labs, USA, 1998. [8] M. M. Fernandes, J. Llosa, and N. Topham. Partitioned schedules for clustered VLIW architectures. In IEEE/ACM International Parallel Processing Symposium, 1998. [9] J. A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Trans. on Computers, C-30(7):478–490, July 1981. [10] W. W. Hwu et al. Impact advanced compiler technology. http://www.crhc.uiuc.edu/IMPACT/index.html. [11] M. F. Jacome, G. de Veciana, and V. Lapinskii. Exploring performance tradeoffs for clustered VLIW asips. In International Conference on Computer-Aided Design, 2000. [12] C. Lee, C. Park, and M. Kim. Efficient algorithm for graph partitioning problem using a problem transformation method. Computer Aided Design, 21(10):611, December 1989. [13] S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997. [14] E. Ozer, S. Banerjia, and T. M. Conte. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In 31th International Symposium on Microarchitecture (MICRO), 1998. [15] E. Ozer and T. M. Conte. Optimal cluster scheduling for a VLIW machine. Technical report, Dept. of Elec. and Comp. Eng., North Carolina State University, 1998. [16] E. Ozer and T. M. Conte. Unified cluster assignment and instruction scheduling for clustered VLIW microarchitectures. Technical report, Dept. of Elec. and Comp. Eng., North Carolina State University, 1998. [17] V. K. R. Rau and S. Aditya. Machine-description driven compilers for EPIC and VLIW processors. Design Automation for Embedded Systems, 4(2/3):71–118, 1999. [18] J. Sanchez and A. Gonzalez. The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures. In Intl. Conference on Parallel Processing (ICPP), 2000. [19] J. Sanchez and A. Gonzalez. Instruction scheduling for clustered VLIW architectures. In Intl. Symposium on System Synthesis (ISSS), 2000.
best combination of clusters and number of bypass lines have been used for each architecture. Surprisingly, the large gains achieved on reducing the number of copy operations in Figures 9 and 10 did not translated to real performance. In general, benchmark programs speed-up, due to bypassing, ranged from 6% to 15% only. Although it is not clear why, it might be possible that the combination of the cluster partitioning and scheduling algorithms leaves only a small number of copy operations in the code for bypassing. Yet another explanation can be drawn from this finding. Not enough ILP is available in the unrolled loop bodies4 . In this case, the generated instructions could have enough empty slots to hide the latency of most copy operations. Further experimental work will be required to address this issue. Notice that cycle count does not take into consideration the benefits achieved by clustering, i.e. a smaller register file and reduced latency. If the register file determines the cycle time of the processor, the curves representing clustered architectures in Figures 11 and 13 will reveal a performance improvement proportional to the reduction in the register file latency. Otherwise, by using our technique, the same performance level of a non-clustered architecture is achieved at a smaller processor cost.
8. CONCLUSIONS AND FUTURE WORK This paper presents a scheduling and partitioning algorithm for clustered VLIW architectures aimed at reducing the communication cost between datapaths and clusters. This is achieved by assigning higher communication datapaths to the same register file, while tailoring bypass interconnections to the application. Preliminary experimental results reveal a substantial reduction on the number of inter-cluster copy operations and a potential performance improvement. As the next steps in this project we are considering: (a) to use the data-dependency distance between scheduled operations to improve the communication cost estimate; (b) to insert delay registers into bypass lines to resolve long distance data-dependencies.
9. ACKNOWLEDGMENTS This work was partially supported by research grants from CNPq/NSF Collaborative Research Project 68.0059/99-7, CNPq research grant 300156/97-9, fellowship research awards from CAPES 01-P-5822/00 and FAPESP 97/10982-0, 99/09462-8. We also thank the reviewers for their comments.
10. REFERENCES [1] A. Abnous and N. Bagherzadeh. Pipelining and bypassing in a VLIW processor. IEEE Trans. on Parallel and Distributed Systems, 5(6):658–663, June 1994. [2] A. Abnous and N. Bagherzadeh. Architectural design and analysis of a VLIW processor. International Journal of Computers and Electrical Engineering, 21(2):119–142, 1995. [3] P. S. Ahuja, D. W. Clark, and A. Rogers. The performance impact of incomplete bypassing in processor pipelines. In MICRO-28, 1995. 4
Remember that predication is not used.
148