Design space exploration of an optimized compiler ... - Springer Link

6 downloads 707 Views 1MB Size Report
Feb 24, 2007 - e-mail: [email protected] ... e-mail: [email protected] .... KressArray architecture template in order perform design space ...
J Supercomput (2007) 40: 127–157 DOI 10.1007/s11227-006-0016-1

Design space exploration of an optimized compiler approach for a generic reconfigurable array architecture Grigoris Dimitroulakos · Michalis D. Galanis · Costas E. Goutis

Published online: 24 February 2007 © Springer Science+Business Media, LLC 2007

Abstract Several mesh-like coarse-grained reconfigurable architectures have been devised in the last few years accompanied with their corresponding mapping flows. One of the major bottlenecks in mapping algorithms on these architectures is the limited memory access bandwidth. Only a few mapping methodologies encountered the problem of the limited bandwidth while none has explored how the performance improvements are affected, from the architectural characteristics. We study in this paper the impact that the architectural parameters have on performance speedups achieved when the PEs’ local RAMs are used for storing the variables with data reuse opportunities. The data reuse values are transferred in the internal interconnection network instead of being fetched, from external memories, in order to reduce the data transfer burden on the bus network. A novel mapping algorithm is also proposed that uses a list scheduling technique. The experimental results quantified the trade-offs that exist between the performance improvements and the memory access latency, the interconnection network and the processing element’s local RAM size. For this reason, our mapping methodology targets on a flexible architecture template, which permits such an exploration. More specifically, the experiments showed that the improvements increase with the memory access latency, while a richer interconnection topology can improve the operation parallelism by a factor of 1.4 on average. Finally, for the considered set of benchmarks, the operation parallelism has been improved from 8.6% to 85.1% from the application of our methodology, and by having each PE’s Local RAM a size of 8 words. G. Dimitroulakos () · M.D. Galanis · C.E. Goutis VLSI Design Laboratory, ECE Department, University of Patras, Patras, Greece e-mail: [email protected] M.D. Galanis e-mail: [email protected] C.E. Goutis e-mail: [email protected]

128

G. Dimitroulakos et al.

Keywords Coarse-grained reconfigurable arrays · Compiler techniques · Data reuse · Data bandwidth bottleneck · Reconfigurable computing

1 Introduction Coarse-grained reconfigurable architectures have been proposed for accelerating computation intensive parts of DSP applications in embedded systems. These architectures combine the high performance of ASICs with the flexibility of microprocessors. Coarse-grained reconfigurable architectures consist of a large number of Processing Elements (PEs) connected with a configurable interconnect network. This work focuses on architectures where the PEs are organized in a 2-dimensional (2D) array and they are connected with mesh-like reconfigurable networks [1–6]. In this work, these architectures are called Coarse-Grained Reconfigurable Arrays (CGRAs). This type of reconfigurable architecture is increasingly gaining interest because it is simple to be constructed and it can be scaled up, since more PEs can be added in the mesh-like interconnect. Each PE typically contains a reconfigurable Functional Unit (FU) and a small local memory. CGRA architectures are able to efficiently execute word or subword operations instead of the bit-level operations in the FPGAs. Due to their large number of PEs, CGRA architectures can provide massive amounts of parallelism and high computational capability. Typically, the application domain of CGRAs is DSP and multimedia. These kinds of applications usually spend most of their execution time in loop structures. These computational intensive parts have high amounts of operation and data parallelism. The large number of PEs available in coarse-grained architectures can be used to exploit this parallelism and thus accelerating the loops of an application. Furthermore, as it was shown in [7], a CGRA can achieve better performance speed-ups relative to a VLIW DSP processor. The aim of mapping algorithms to CGRAs is to exploit the inherent parallelism in the considered applications for increasing performance. An increase in the operation parallelism results in a respective increase in the rate by which data are fetched from memory—called data memory bandwidth—which is the major bottleneck in exploiting the inherent parallelism [8]. Although, there were a few automated mapping methodologies [9–11] proposed to tackle the problem of limited memory bandwidth in CGRAs, none of them has explored how the performance improvements are affected by the architectural characteristics. Such a study can provide detailed knowledge about the effect of design parameters on the quality of the resulting architecture, its performance, storage needs, and flexibility. Pursuing such a study requires a flexible architecture template, together with a resource aware scheduling technique. Apart from proposing a new mapping methodology that tackles the memory bandwidth bottleneck, our work studies how the performance improvements are affected from crucial architecture parameters such as the memory access latency, the structure of the interconnection network and the size of the PEs’ local RAMs. For this reason the proposed mapping methodology targets on a parametric architecture template having its characteristics drawn from popular CGRAs. Our mapping approach exploits the high bandwidth foreground memory which is constituted from the PEs’ local RAMs and the interconnections among them for the purpose of storing variables

Design space exploration of an optimized compiler approach

129

with data reuse opportunities. We call it to hereafter as the Distributed Foreground Memory. The data reused values are transferred through the CGRA’s interconnection network instead of being fetched from external memories. In this way more operations which require memory accesses can be run in parallel. The experimental results quantified the trade-offs that exist between the performance improvements and the memory access latency, the interconnection network and the processing element’s local RAM size. The rest of the paper is organized as follows: in Sect. 2, the existing work in the field of mapping applications to CGRA architectures is overviewed. Sect. 3 presents the architecture template to which the methodology targets. Sect. 4 describes the proposed methodology, while the experimental results are presented in Sect. 5. Finally, conclusions are outlined in Sect. 6.

2 Related work The first devices that had been used for reconfigurable computing were the Field Programmable Gate Arrays (FPGAs). An FPGA consists of a matrix of programmable logic cells, executing bit-level operations, with a grid of interconnect lines running among them. FPGAs allow to realize systems from a low granularity level that is, logic gates and flip-flops. This makes FPGAs very popular for the implementation of complex bit level operations. The most recent survey for reconfigurable systems including FPGAs can be found in [12]. However, FPGAs are inefficient for coarsegrained datapath operations due to the high cost of reconfiguration in performance and power. To overcome the performance bottleneck for this type of applications, research was directed to reconfigurable architectures based on coarser granularity blocks. The coarser granularity greatly reduces the delay, power and configuration time relative to an FPGA device at the expense of flexibility [1, 12, 13]. A major part of the area of coarse grained reconfigurable architectures is covered by architectures where the PEs are organized in a 2D-array (CGRAs). Many CGRA architectures have been proposed in the past few years [1–6] accompanied with an equal number of mapping methodologies. Typically the applications which belong to the application domain of the CGRAs are characterized by high data transfer rate between the processor and the memory. Only a few automated approaches [9–11] have been followed to tackle the problem of the limited memory bandwidth in CGRAs for exploiting the hardware parallelism, while none is based on the exploitation of the interconnection network among PEs as our methodology does. Apart from proposing a new mapping methodology which tries to alleviate the memory bandwidth bottleneck, our mapping flow is to a great degree parametric in respect to the architectural features. This capability is exploited and a series of experiments are presented illustrating the impact of the architecture characteristics on performance speedups. These experiments will expose the trade-offs emerging from the exploitation of the data reuse opportunities for reducing the memory bandwidth bottleneck. We consider the latter one as our main contribution in this work since such an exploration have not yet published in this context (to our best knowledge) for any of the following works. Moreover, another class of processors that resemble with CGRAs are the systolic arrays. Most systolic arrays appear as ASICs, although some programmable

130

G. Dimitroulakos et al.

systolic arrays have been defined, like the Intel iWarp [14] and NuMesh [15]. Programmable systolic arrays use a very different control model based on the standard microprogrammed control of a general datapath. iWarp [14] in particular, closely resembled a microprocessor with hardware support for systolic communication. In contrast, CGRAs (as the template considered in our work) are much more fine-grained with small memories, configurable interconnect and more efficient configurable control mechanism. Additionally, the systolic arrays that appeared as ASICs were mostly used for mapping systolic DSP algorithms [16]. But CGRAs can also be configured to implement algorithms that are not systolic, for example a Viterbi decoder. Hence, the already existing mapping approaches for systolic arrays target on signal processing applications with regular structure and they cannot handle control DSP algorithms. The approaches for CGRAs are not confined to regular but also to control intensive applications. So, CGRAs can be conceived as a superset of the systolic arrays. We classify the existing works, referring to CGRA mapping algorithms, to the ones developed for optimizing the memory bandwidth requirements and the ones exploring parts of the design space for mapping algorithms to CGRAs in the following two subsections. 2.1 Mapping approaches with memory bandwidth optimization for CGRAs In [9] a CGRA architecture was presented. Each PE has two input and two output registers while its operation is data-driven. The PEs’ interconnection network has direct unidirectional connections among neighbouring PEs. The mapping methodology for this architecture is based on a simulated annealing algorithm in which the routing of data is considered during the operations’ placement phase. The mapping approach is retargetable in respect to different architecture instances generated by the KressArray architecture template in order perform design space exploration [17] for identifying architectures instances with good performance-power trade offs. For reducing the memory bandwidth requirements the global register file is used instead of the PE’s local register files that is performed from our approach. The PACT-XPP [3] is a hierarchical array of coarse-grain processing array elements. A series of vertical and horizontal buses establish communication among the PEs while for storing the intermediate data values shared memory banks exist on the left and the right side of each array’s row. The mapping of algorithms in this architecture is based on temporal partitioning while, to reduce the number of memory accesses, the compiler [10] only reads one element per iteration and generates shift registers to store the data reuse values when array references inside loops read subsequent element positions. Again, this method differs from ours in the sense that the distributed foreground memory is not exploited. By using the distributed foreground memory we avoid the generation of additional circuitry for exploiting the data reuse opportunities. In [11] a generic template for a wide range of CGRAs was presented. A three-level mapping algorithm is used to generate loop pipelines to fit into the CGRA. First, on the PE-level mapping stage, microoperation trees are mapped to single PEs without the need of reconfiguration. Then the PE-level mappings are grouped together on line-level in such a way, that the number of required memory accesses do not exceed the capacity of the memory interface belonging to the line. On the plane-level phase,

Design space exploration of an optimized compiler approach

131

the line-level mappings are put into the 2D array. In this work [11], operations with data reuse opportunities are placed in the same line and in the plane-level phase these 1-dimensional mappings are placed on the architecture’s 2D-space. On the contrary, in our work we increase the mapping freedom by considering the whole 2D-space for each operation through a set of resource aware heuristics. In [6] an array architecture is proposed in which each PE contains instruction memory, data memories, an arithmetic logic unit, registers and configurable logic. The placement of operations in the PEs precedes the scheduling phase as a separate step while the scheduling of operations is performed by a list scheduling algorithm. To reduce bus congestion this architecture has a large SRAM inside each PE in order to decrease the latency for fetching the operation’s input operands. The work in [4] combines a RISC processor with a reconfigurable cell array consisting of 8 × 8 (subdivided into 4 quadrants) identical 16-bit PEs. The interconnection network comprises of three hierarchical levels were nearest neighbor, intraquadrant, and interquadrant connections are possible. In this case for reducing the overhead for transferring data from memory to PEs a streaming cache is employed while small register files exist inside each PE for storing the intermediate values. The mapping algorithm [18] proposed for this architecture mainly focuses on image processing applications and targets on exploiting the streaming cache while the transfers between the RISC and the RC array are manually defined. 2.2 Exploration with CGRA architectures ADRES [2] is a parametric architecture template of a CGRA tightly coupled with a VLIW processor. The CGRA’s operation set, resource allocation, timing, and even internal organization of each processing element define a specific instance of the architecture. The mapping algorithm [2] is performed via a modulo scheduling algorithm where the binding, routing and scheduling are considered simultaneously by a simulated annealing technique. The modulo scheduler is incorporated in the DRESC design flow [7]. In [19, 20] the DRESC flow was applied on a set of kernels for finding the most efficient ADRES instance for executing them. For this purpose, a series of experiments concerning different interconnection topologies, heterogeneous functional units, different memory interface, variation of the register file structure were conducted. However, the DRESC scheduling algorithm does not include a memory bandwidth optimization technique as our methodology does. In [21] an exploration of interconnect topologies for CGRA architectures was performed, using a retargetable list-based mapping algorithm that targets on a CGRA architecture template that can generate CGRA architecture instances with different PE interconnection networks. The work in [22] explores the performance of the listbased algorithm presented in [21], by introducing three different cost functions. None of the [21, 22] considers memory management techniques for reducing the data bandwidth bottleneck while it is also assumed that its PE is directly connected to memory.

3 Architecture template In this section, the generic reconfigurable template used for the proposed mapping methodology is described. Since it is based on characteristics found in the majority

132

G. Dimitroulakos et al.

Table 1 CGRA architecture design parameters Component

Description

CGRA Interconnection Network structure

Pair wise connections (PEx , PEy )

interconnection width

2x bits xε{3, 4, 5, . . .}

interconnection latency

xε{0, 1, 2, . . .} cycles

PE microarchitecture operand bitwidth

2x bits xε{3, 4, 5, . . .}

Local RAM’s size

2x words xε{1, 2, 3, . . .}

context RAM size

2x xε{1, 2, 3, . . .} configuration contexts

fast reconfiguration overhead

0 cycles

context RAM fill latency (1 configuration)

xε{1, 2, 3, . . .} cycles

Operation Set

{add, multiplication, ALU}

Per Operation Latency

xε{1, 2, 3, . . .} cycles

Register Files Input Ports

xε{1, 2, 3, . . .} ports

Register Files Output Ports

xε{3, 4, 5, . . .} ports

Data Memory Interface PE-Buses Connections

Pair wise connections {PEx ,Busy )

Scratch Pad’s Access Latency

xε{1, 2, 3, . . .} cycles

Number of Scratch-Pad’s ports

xε{1, 2, 3, . . .} ports

Main Memory Module Access latency

xε{1, 2, 3, . . .} cycles

Bus Bitwidth

2x xε{3, 4, 5, . . .} bits

Number of Buses

xε{1, 2, 3, . . .} buses

Bus Multiplexing

Pair wise connections (Scratch Pad’s portx ,Busy )

Memories’ Size

x words

Scratch-Pad’s banks configuration

Bankx → 2x1 words, x2 ports x is the # of banks, x1 εZ and x2 ε{1, 2, 3}

of CGRA architectures [1–6], it can be used as a realistic model for mapping applications to such type of architectures. Furthermore, it represents a wide variety of 2D CGRAs since it is parametric in respect to the different PE network configurations in terms of: (a) the number and type of PEs, (b) the interconnections among them, and (c) the data memory interface architecture. The proposed architecture template is shown in Fig. 1a and it consists of 4 basic parts; (a) the PEs organized in a 2D interconnect network, (b) the configuration memory, (c) the data memory interface and (d) the main data memory. Table 1 illustrates the design parameters for the considered CGRA architecture template. The PEs’ interconnection network is defined as a set of pairwise PE to PE connections. In more detail, depending on the PE communication network description different network configurations can be instantiated. For example each PE can be connected to its nearest neighbours as in [9] (Fig. 2a) while there are cases, like in [4], where there are also direct connections among all the PEs across a column and a row (Fig. 2b).

Design space exploration of an optimized compiler approach

133

Fig. 1 (a) CGRA architecture template. (b) Memory hierarchy. (c) Data Transfer Bottleneck

(a)

(b)

Fig. 2 (a) Direct connections among neighboring PEs. (b) Direct connection among PEs in the same row and column

Typically a PE consists of one Functional Unit (FU), which it can be configured to perform a specific word-level operation each time. Typical operations supported by the FU are: ALU, multiplication, and shifts. Furthermore, the PEs in the proposed CGRA template supports predicate operation [23]. Predicated Local RAMs exist inside the PEs for storing predicated values. Thus, loops containing conditional statements are supported by the CGRA template through the “if-conversion” process [24]. Moreover, for storing intermediate values between computations and values fetched from memory, a local RAM exists inside a PE. Figure 3 shows an example of a PE architecture. The FU of this PE has three inputs and two outputs. The multiplexers are used to select each input operand that can come from three different sources: (a) from the same PE’s RAM, (b) from the memory buses and (c) from another PE. The output of each FU can be routed to other PEs or to its local RAM. Each PE’s context RAM stores control values (configuration contexts) that determine how the FU, the local RAMs, the multiplexers and demultiplexers are configured. The configuration memory (Fig. 1a) of the CGRA stores the whole configuration for setting up the CGRA for executing the application. The configuration contexts can also be loaded from the configuration memory at the cost of extra delay if the PEs’ context RAMs are not big enough to store the entire configuration.

134

G. Dimitroulakos et al.

Fig. 3 Example of PE architecture

The CGRA’s data memory interface (Fig. 1b) consists of: (a) The memory buses. As it happens in the majority of the existing CGRA architectures [3–5], the PEs of our template architecture residing in a row, share a common bus connection for transferring the memory data values (Fig. 1a). Furthermore, an Address Generation Unit (AGU) (Fig. 1a) is attached to each bus for the addressing. Thus, the mapped algorithm can access a variable existing in the scratch-pad or main memory by generating the proper address. (b) The scratch pad memory [25]. It is the first level (L1) of the CGRA’s memory hierarchy, and serves as a local memory for quickly loading data in the PEs of the CGRA. However, the data values are transferred in the CGRA through the buses which have limited bandwidth if we compare them with the bandwidth requirements for executing applications to CGRAs. Thus, the bandwidth bottleneck in this type of architecture exists in the communication path that connects the CGRA with the main memory (Fig. 1c). In this paper we will use the terms scratch pad memory and L1 memory alternatively. (c) The base memory level. It is called L0, and is formed by the local RAMs inside the PEs. The interconnection network together with the L0 acts as a high-bandwidth foreground memory, since during each cycle several data transfers can take place through different paths in the CGRA. Our methodology exploits this capability for reducing the memory accesses to L1 thus reducing the data transfer bottleneck (Fig. 1c). This is achieved by storing the data reused values in the L0 and not fetching them again from the L1 memory. In this way, we improve the performance by increasing the number of operations that require memory accesses in each cycle of the schedule.

4 Mapping methodology for CGRAs 4.1 Motivation-preliminaries The task of mapping applications to CGRAs is a combination of scheduling operations for execution, mapping these operations to particular PEs, and routing data to

Design space exploration of an optimized compiler approach

135

Fig. 4 (a) Place decision for operation S3. (b) Dependence graph. (c) Sample code segment

specific interconnects in the CGRA. Our methodology targets on exploiting the distributed foreground memory which is formed by the local data RAMs inside the PEs. For reducing the number of memory accesses due to data reuse opportunities and for improving execution time a proper developed scheduler is required. The scheduler has to consider simultaneously the data reuse exploitation and the execution time reduction. We illustrate the basic concepts of the proposed mapping methodology by a motivational example. In Fig. 4a, we give a block diagram of 3 × 3 CGRA architecture where only direct interconnections between PEs exist. Figure 4c presents an algorithm’s code segment and its respective dependence graph is given in Fig. 4b. As it is shown in Fig. 4c operation O3 access the same array data value (A[ ]) with operation O1 after one loop iteration. Also, there is a data dependence from O2 to O3 since the outcome of the operation O2 is consumed afterwards by operation O3. Operations O1 and O2 are assumed to be placed for execution in the PEs shown in Fig. 4a. In order to execute O3, a PE where the operation will be executed must be reserved and the appropriate data must be routed there. Apart from variable y, our methodology routes through the interconnection network the reused variable A[ ] to the specific PE, required for executing O3 as shown in Fig. 4a. The PE selection for executing an operation, and the way the input operands are fetched to the specific PE will be referred to hereafter as a Place Decision (PD) for that specific operation. Each PD has a different impact on the operation’s execution time and the way this operation’s execution influences the effectiveness of PDs of future scheduled operations. The operation’s execution time is determined from the operation’s latency, the path delay which is necessary to fetch the operation’s operands and the availability of resources. Therefore, large path delay or lack of free resources causes the operation’s execution interval to be inflated. Additionally, the impact of the current operation’s scheduling to future operations is due to the resource reservations (memory bus, PEs, local PE RAM, interconnections). Hence, PDs which wastefully consume the CGRA resources cause future schedule instructions to have less cost-effective PDs. So, a set of costs is assigned to each PD to incorporate the aforementioned factors that influence the scheduling of the operations. The aim of the proposed mapping algorithm is to find a cost-effective PD for all operations of the scheduled application.

136

G. Dimitroulakos et al.

4.2 Methodology description Figure 5a shows the structure of the developed memory-aware mapping methodology for CGRAs. The input is the application’s description in C language. Initially, the loop normalization transformation [24] is utilized for normalizing the candidate loops. Afterwards, the loop bodies are unrolled for achieving better CGRA utilization in the mapping phase. Depending on the loop body’s dependences and the number of available CGRA architecture resources, several instances of the loop body can be executed simultaneously in the CGRA for different parts of the loop’s iteration space. In this way parallelism and performance can be improved since the CGRA’s resources are better utilized. Since the unlimited unrolling can lead to resource congestion situations a feedback in our methodology script refers to the exploration performed for finding the best value of the unroll factor in terms of the achieved operation parallelism. We have utilized the frontend of the SUIF2 compiler infrastructure [26] to create the Intermediate Representation (IR) of SUIF2. We have used existing and we have developed new SUIF2 passes for performing analysis and transformations on application’s loops. More specifically, data-flow analysis is used to identify the data reuse opportunities in the IR, for obtaining this information for the mapping stage. Transformations in the IR are performed to create the Data Dependence Graph (DDG) used in the mapping process. Passes for dead code elimination, common sub-expression elimination and if-conversion transformations [24, 27], have also been utilized. The considered analysis and transformation flow is enclosed in the dashed line of Fig. 5a. As shown in Fig. 5a, b, the first input to the mapping algorithm is a DDG, representing the loop body. Since we exploit the data reuse opportunities the graph contains extra information concerning the data reuse among operations. More specifically, there are variables which are used multiple times after their production. The reuse opportunities for these variables can be effectively identified from the pro-

Fig. 5 (a) Mapping methodology flow for CGRAs. (b) Sample DDRG

Design space exploration of an optimized compiler approach

137

gram’s true dependences [24]. Moreover, there are variables which may used multiple times but without been assigned in a loop. Another kind of dependence known as input dependence [24] is used for identifying reuse opportunities of this type. One operation has an input dependence with another operation when both use the same variable. Input dependences does not impose sequence constraints on the operations which are dependent as all other (true, anti and output) data dependences do. For example if the sequence of execution of two operations which are input dependent change then the program’s operation is still valid. Hence, an input dependency can be exploited irrespective which of the operations executes first. However, if an input dependence is exploited for eliminating a load operation then it must be honoured in the same way as it is done for a true dependence. Hence, apart from the true, anti and output dependences, the application’s DDG has additional edges for representing the input dependences. We call this graph Data Dependence Reuse Graph (DDRG). The DDRG is a directed graph G(V , E, ER ), where: V is the set of DDRG nodes representing the operations of the loop body. Each DDRG node is annotated with the type of operation, its priority and the memory operations it requires. E is the set of data edges showing data dependencies among operations. Each dependence edge E is annotated with the type of dependence as well as with the dependence distance. Finally, ER is the set of edges representing input dependences among the DDRG nodes. The ER edges are further annotated with the names of variables that are common to the operations that connect as well as from the input dependence distance. The input dependence distance is the number of iterations between two successive consumptions of a common input variable. We call, the subset of operations in the DDRG that have E edges sinking into a specific node v, Data-Dependence-Predecessors (DDPs) for that node. Furthermore, we call the subset of operations in the DDRG which are connected with an operation v via ER edges, Input Dependence Predecessors (IDPs) for that node. The description of the CGRA architecture is the second input to the mapping phase. The CGRA architecture is modelled by an undirected graph, called CGRA Graph, GA (V , EI ). The V is the set of PEs of the CGRA and EI are the interconnections among them. The CGRA architecture description includes parameters, like the local RAMs’ size inside a PE, the context RAMs size, the memory buses to which each PE is connected, the bus bandwidth, the L1 memory access times and the CGRA’s resource reservation record. The latter one records the reservations performed for each type of CGRA resource. The proposed memory-conscious mapping algorithm is described in the pseudocode shown in Fig. 6. It is based on the priority-based list scheduling algorithm described in [28] but it has been extended for the purpose of being applied to CGRAs. As mentioned in Sect. 4.1, the routing of data values imposes the need for identifying the spatial location for executing each operation, which minimizes the overall execution time. For this reason, a set of resource aware heuristics have been devised (Sect. 4.2.2) along with a novel approach for estimating (Sect. 4.2.3) them in order to steer the scheduling process. The algorithm is initialized by assigning to each DDRG node a value that represents its priority. The priority of an operation is calculated as the difference of its As Late As Possible (ALAP) minus its As Soon As Possible (ASAP) value [28]. This result is called mobility. Also variable p, which indirectly points each time to the

138

G. Dimitroulakos et al. // SOP : Set with operations to be scheduled // G : Application’s DDRG // QOP: Queue with ready to schedule operations SOP = V ; AssignPriorities(G); p = Minimum_Value_Of _Mobility; // Highest priority while (SOP = ø){ QOP = queue ROP(p); do { Op = dequeue QOP; (DDPPEs, RTime) = Predecessors(Op); (IDPPEs) = Predecessors_R(Op); do{ Choices = GetCosts(DDPPEs,IDPPEs, RTime); RTime + +; } while( Resource_Congestion(Choices) ); Decision = DecideWhereToScheduleTimePlace(Choices); ReserveResources(Decision); Schedule(Op); SOP = SOP − Op; } while( QOP = ø); p = p + 1; }

Fig. 6 Memory-aware mapping algorithm for CGRAs

most exigent operations, is initialized by the minimum value of mobility. In this way, operations residing in the critical path are considered first in the scheduling phase. During the scheduling phase, in each iteration of the while loop, QOP queue takes via the ROP() function the ready to execute operations which have a value of mobility less than or equal to the value of variable p. The first do-while loop schedules each operation contained in the QOP queue one at a time, until it becomes empty. Then, the new ready to execute operations are considered via ROP() function which updates the QOP queue. The Predecessors() function returns (if exist) the PEs where the Op’s DDPs were scheduled (DDPPEs) and the earliest clock cycle (RTime) at which the operation Op can be scheduled. The RTimeOp equals the maximum of the times where each of the Op’s DDPs (DDPOp ) finished executing tf.   RTimeOp = maxi=1,...,|DDPOp | tf Opi , 0 where Opi ∈ DDPOp (1) The Predecessors_R() returns (if exist) for a given operation Op the PEs where its IDPs were executed (IDPPEs). Subsequently, the function GetCosts() returns the possible PDs and the corresponding costs for the operation Op in the CGRA, in terms of the Choices variable. It takes as inputs the RTime for the operation Op along with the PEs where the DDPs (DDPPEs) and IDPs (IDPPEs) have been scheduled. The function Resource_Congestion() returns true if there are no available PDs due to resources conflicts. In that case RTime is incremented and the GetCosts() function is repeated until available PDs are found. In Sect. 4.2.1 the reasons for resource congestion are described in detail. The DecideWhereToScheduleTimePlace() function analyzes the mapping costs from the Choices variable and chooses the most efficient PD. The function Re-

Design space exploration of an optimized compiler approach

139

(a)

(b) Fig. 7 (a) A possible mapping for the sample graph of Fig. 5. (b) Derived CGRA configuration

serveResources() reserves the resources (bus, PEs, storage and interconnections) for executing the current operation on the selected PE. More specifically, the PEs are reserved as long as the execution takes place. For each transfer, the amount and duration of bus reservation is determined by the number of the words transferred and the L1 memory latency respectively. The local RAM in each of the PEs is reserved according to the lifetime of the variables [28]. Also in order to decrease the duration of storage reservation time, the IDP operands are fetched from the latest scheduled IDP. Finally, the Schedule() records the scheduling of operation Op. After all operations are scheduled, the configuration of the CGRA is extracted in the form shown in the example of Fig. 7. 4.2.1 Operation scheduling Each operation’s scheduling is performed through a series of phases where in each of them the availability of resources is taken into account. Initially, the input variables are fetched to the PE where the execution takes place. In case where the input variables come from memory, the bus resources are reserved for transferring the data values. On the contrary, when the variables exist in a PE in the CGRA, the appropriate number of interconnections and storage locations are reserved for routing the

140

G. Dimitroulakos et al.

data values to the target PE. Additionally, in the first phase the availability of the PE where the execution takes place is checked. If there are not adequate resources for completing the first phase, a resource congestion status is returned from the Resource_Congestion() function. In the second phase, storage of the variables occurs at the PE where the execution takes place when the variables do not arrive simultaneously there. Additionally, local memory is reserved at the source PEs for storing the data values until they are routed to the target PE. If the algorithm fails to find the storage resources, then the scheduling phase fails and a resource congestion status is returned. This type of resource congestion causes conflicting varaibles to be spilled (by introducing the proper load/store operations) and the algorithm restarts the scheduling process through the call to the Resource_Congestion() function. 4.2.2 Costs description The Choices variable includes the delay, interconnection, and the memory overhead costs for each PD. The delay cost for placing operation Op in a specific PEx refers to the operation’s earliest possible schedule time there. As shown in Eq. (2), it is the sum of the RTime plus the maximum of the times tf required to fetch the operation’s (Op) input operands to PEx , where P is the set with operation’s input operands.   (2) Choices.dly (PEx , Op) = RTimeOp + maxi=1,...,|P | tf P [i] , 0 When the operands come from the L1 memory, then tf equals the L1 memory latency while when they come from a PE in the CGRA it is equal to the time tr of routing them to PEx . Hence, by denoting with PEP [i] the PE where the routed operand P [i] resides, by tinit_route the time at which the routing initiates and by tinit_fetch the clock cycle that follows RTime where the bus is available to fetch the requested data, we have  tinit_route L1 latency, P [i] ∈ Pm (3) tf P [i] = tr PEP [i] →PEx,init_route , P [i] ∈ Pr where Pm is the subset of P with the operands which are fetched from memory, while Pr = P − Pm , is the subset of P with the operands which are routed to the destination PEx . As shown in Eq. (3) the time tr depends on the tinit_route since the availability of the interconnections and storage locations needed for routing an operand depends on the clock cycle where routing initiates. When an input operand is routed, the interconnection overhead refers to the interconnections that must be reserved, in order to route the operands to the destination PE. Higher interconnection overhead causes future scheduled operations to have larger execution start time due to conflicts. As shown in Eq. (4), the interconnection cost for a PD is the sum of the CGRA interconnections which are used for routing the operation’s input operands. ⎧ Pr = ∅ ⎨ 0,    Choices.intercon (PEx , Op) = PathLength PEPr [i] → PEx Pr = ∅ ⎩ i=1,...,|Pr |

(4)

Design space exploration of an optimized compiler approach

141

A greedy approach was adopted for calculating the time tr (Eq. (3)) and the number of interconnections (Eq. (4)) required for routing an operand. For each operand the shortest paths, which connect the source and destination PE, are identified. From this set of paths, the one with the minimum routing delay is selected. The length and delay of the selected path gives the delay and interconnection costs through Eq. (2) and Eq. (4), respectively. Finally, the memory overhead cost refers to the number of memory accesses by which the current operation contributes. It is equal to the number of input operands which are fetched from memory. Choices.memory (PEx , Op) = |Pm |

(5)

The function DecideWhereToScheduleTimePlace() firstly identifies the subset of PDs with minimum delay cost and from them the ones with minimum memory cost. Finally, from the resulting PD subset chooses the one with minimum interconnection cost. 4.2.3 Cost assignment algorithm Figure 8 presents function GetCosts() that is used to find the PDs and the corresponding costs for scheduling each operation of the DDRG. This function identifies the PDs and the corresponding costs for executing the operation in each of the CGRA’s PEs. This is achieved by the first loop (i loop) of this function. In each iteration of the i loop, the three costs are evaluated for placing the operation into a different PE. The function GetDDPCosts() considers: a) the DDP operands which always exist in the CGRA and hence are routed to the destination PE, and b) the memory fetches without reuse opportunities. The latter ones always exist in the L1 memory. It takes as inputs the PEs (DDPPEs) where the DDP operands are stored, the RTime and the PE (TargetPE) to which the operation is assumed to be placed for execution and determines the three aforementioned costs. These costs are calculated using Eqs. (2) to (5) and they are returned in the Choices1 variable. Additionally, for fetching the operands with data reuse opportunities there are two alternative ways. These operands exist both in the L1 memory and the CGRA hence, either they can be fetched from the L1 memory or they can be routed from a PE that has stored them in its local RAM. We have investigated that it is not possible to determine which of these two ways is the most efficient in terms of performance due to the operand’s non-deterministic routing delay. Thus, the algorithm should decide which of the two ways is the most efficient. Therefore, we introduce two costs and we associate with them two cost variables, namely FetchChoices2 and RouteChoices2 which have the same structure as the variable Choices1. The first one refers to the case of fetching the operands from L1 memory, while the second one refers to the case of routing the operands from a PE which has them. These two variables are initialized with the Choices1 variable via the InitializeReuseVar() function. As shown in Fig. 8, the algorithm handles each of the IDP operands separately in the j loop for determining the most efficient way of fetching them to the target PE (TargetPE). The function GetRouteCosts() determines the costs assuming each time that the corresponding IDP operand is routed to the target PE. On the contrary, the function GetFetchCosts() determines the costs assuming that the corresponding IDP

142

G. Dimitroulakos et al. // Choices : is an array, each element of which refers to the costs (delay e.t.c.) per PD // RTime : is the time when an operation is ready to execute // RVAvailTime : is the array with the earliest times at which the IDP operands are available // TargetPE : is the array with the CGRA’s PEs for which the exploration takes place // DDPPE : is the array holding the PEs where the Op’s DDPs were scheduled // IDPPE : is the array holding the PEs where Op’s IDPs were scheduled Choices function GetCosts (int[] DDPPEs, int[] IDPPEs, int RTime){ for ( i = 0 ; i < NumPEs ; i++ ){ GetDDPCosts (DDPPEs,RTime,TargetPE[i],Choices1[i]); InitializeReuseVar(Choices1[i],FetchChoices2[i],RouteChoices2[i]); for ( j = 0 ; j < NumROps ; j++ ){ GetRouteCosts (DRGPEs[j], RTime, RVAvailTime[j], TargetPE[i], RouteChoices2[i][j]); GetFetchCosts(RTime, TargetPE[i], FetchChoices2[i][j]); SelectRouteOrFetch(Choices2[i][j],RouteChoices2[i][j], FetchChoices2[i][j]); } CombineCosts(Choices[i],Choices2[i]); } return Choices; }

Fig. 8 Description of the GetCosts() function

operand is fetched from the L1 memory. Afterwards, SelectRouteOrFetch() function identifies which of the two aforementioned ways for fetching the IDP operand is the most efficient. To prohibit a significant increase in the schedule length caused when the routing of operands is more costly in terms of cycles than fetching them from the L1 memory, we have introduced a threshold parameter called memory_thresh. This parameter determines the laxity that the scheduler has, in respect to the delay of routing the operand over the delay of an L1 memory access. If the difference of the delays of routing the operands and of fetching them from the L1 memory is greater than the memory_thresh, then a memory fetch is selected. The rejection of one of the above two cases for transferring the IDP operands, is done by recording the cost of the selected case in the Choices2 variable which is returned from the SelectRouteOrFetch() function. Finally, after deciding separately how each IDP operand is fetched, the function CombineCosts() combines the results derived from each IDP for finding the final costs. It takes the maximum of the delay costs derived by each IDP operand as the final delay cost and accumulates the interconnection and memory costs to determine the corresponding final costs. The final result is the Choices which is returned to the main scheduling function (Fig. 6).

5 Experimental results 5.1 Experimental setup In this section, we present the experimental results from applying the proposed mapping methodology steps on a representative CGRA architecture. We have developed in C++ a prototype compiler framework that realizes our mapping algorithm. The

Design space exploration of an optimized compiler approach

143

Table 2 Application’s characteristics Program

# of Ops

IPC

CGRA util%

Description

fft

95

5.6

34.8

radix-4 FFT

idct ver

61

5.5

34.7

vertical pass (8 × 8 IDCT)

idct hor

79

5.6

35.3

horizontal pass (8 × 8 IDCT)

fdct_ver

60

5.0

31.3

vertical pass (8 × 8 FDCT)

fdct_hor

61

5.1

31.8

horizontal pass (8 × 8 FDCT)

wav ver

64

5.3

33.3

vertical pass (2D-DWT)

wav_hor

54

3.9

24.1

horizontal pass 2D-DWT

lat_synth

18

3.0

18.8

lattice synthesis filter

lat anal

18

2.3

14.1

lattice analysis filter

iir

39

4.9

30.5

iir filter

volterra

27

3.9

24.1

iir filter

wdf7

44

4.9

30.6

iir filter

nc

56

4.3

26.9

noise canceling

experimental setup considers a 2D CGRA of 16 PEs connected in a 4 × 4 array. In the experiments two scenarios are considered concerning the PEs’ Interconnection Topologies (PEIT). The first one (PEIT1) refers to the case where PEs are directly connected to all other PEs in the same row and same column, as in a quadrant of Morphosys [4] (Fig. 2b), through vertical and horizontal interconnections. The second one (PEIT2) refers to the case where each PE is connected only to its nearest neighbours (Fig. 2a) as it happens in [9]. The PEIT1 has more available internal bandwidth to transfer data due to the richer interconnection topology, while PEIT2 has less available bandwidth. These two alternatives were chosen to have the experimental results of a high bandwidth case in contrast to the results of a low bandwidth case for illustrating the impact of the CGRA’s internal bandwidth to the scheduling characteristics (performance e.t.c). Additionally, each PE has a local RAM of size 64 words; thus the L0 size is 1024 words. There is one FU in each PE that can execute any operation in one clock cycle. The granularity of the FU is 16-bit, which is the word size. The direct connection delay among the PEs is zero cycles. Furthermore, two buses per row are dedicated for transferring data to the PEs from the scratch-pad (L1) memory. Each bus transfers one word per scratch pad’s memory cycle. The experiments consider scenarios for different values of unroll factor, the scratch-pad’s latency and PEs’ Local RAM size. Also, since our main focus is on the application’s performance we have set in all cases the memory_thresh equal 0. Moreover, we assume that the CGRA’s context RAMs can store 64 context words each. Finally, to measure the operation parallelism in the experiments we use the Instructions Per Cycle (IPC) and the CGRA Utilization alternatively. We have used 13 characteristic DSP applications written in C code. The first nine programs are drawn from the Texas Instruments DSP benchmark suite [29] while the last four are developed by the authors [30]. Their characteristics are given in Table 2. More specifically, the second column refers to the number of operations in the application’s loop body, the third one refers to IPC, the fourth one refers to the

Fig. 9 Performance comparison with and without data reuse exploitation (PEIT1)

144 G. Dimitroulakos et al.

Design space exploration of an optimized compiler approach

145

percentage of CGRA’s PE utilization, while the fifth has a brief description of each application. These values were derived by executing the initial loops of the designs on a CGRA having the characteristics that were previously described, with an internal interconnection network as PEIT1, and also having set the L1’s memory latency equal to 1 cycle. In the following experiments we assume that each application’s initial loop body is executed for 100 iterations. The exploration focuses on identifying how the performance is influenced from the architectural characteristics. More specifically, the methodology exploits the CGRA interconnection network and the PEs’ Local RAMs for achieving a faster schedule. Hence, the exploration firstly focuses on identifying the impact of these two design parameters on the performance improvements. Moreover, we have investigated that the improvements were also dependent on the memory access latency as well as on the unroll factor by which the application loops were unrolled. The relation between these parameters and performance was also outlined by the considered set of experiments. 5.2 Experimentation 5.2.1 Performance in respect to the scratch-pad’s latency Figures 9 and 10 show the performance comparison for mapping the designs on the CGRA, with and without exploiting data reuse opportunities for reducing the memory accesses for the two architecture alternatives respectively (PEIT1 and PEIT2). For this experiment, we consider 4 scenarios concerning the scratch pad’s latency and we unroll the loops in the designs by a factor of 10 since we have investigated (Figs. 13 and 14) that this gives a significant amount of operation parallelism. Above the bars, the percentages of performance improvements are shown. In the PEIT1 architecture (Fig. 9), for L1 memory latency equal 1 cycle the improvements range from 11.9% to 43.7%. For larger values of the memory latency the improvements become larger. On average the performance is improved by 38.4% if we consider all scenarios for the value of the L1 memory access latency. However, smaller performance improvements were achieved with the PEIT2 architecture alternative, in comparison to the PEIT1 case. More specifically, for L1 memory latency 1 cycle in the PEIT2 architecture (Fig. 10), the performance improvements range from 3.2% to 33.3%. On average, performance is improved by 30.6% in the PEIT2 case if we consider all scenarios of the L1 access latency. Also, the execution time is smaller in the PEIT1 than in PEIT2 case. This is due to the smaller average routing delay which is achieved by the richer interconnection topology in PEIT1. Ultimately, in all cases the applications are executed faster when data reuse opportunities are exploited. Figure 11, shows the average IPC for all benchmarks with (REUSE) and without (NO REUSE) data reuse exploitation for the two architecture alternatives (PEIT1 and PEIT2) while in Fig. 12 the improvements of the average IPC are illustrated in respect to the L1 memory access latency for the two architecture alternatives. Although the average IPC is reduced significantly as the L1 memory access latency increases (Fig. 11) the IPC improvements for L1 access latency more than 1 cycle are much larger comparing to the ones achieved for L1 access latency equal 1 cycle. In this way as shown in Fig. 11 the reduction in the average value of IPC is compensated

Fig. 10 Performance comparison with and without data reuse exploitation (PEIT2)

146 G. Dimitroulakos et al.

Design space exploration of an optimized compiler approach

147

Fig. 11 (a) Variation of the Average IPC in respect to the L1 memory access latency for the two different interconnection alternatives (PEIT1 and PEIT2) with (REUSE) and without (NO REUSE) the data reuse exploitation

Fig. 12 (a) Improvements from the data reuse exploitation of the Average IPC in respect to the L1 memory access latency for the two different interconnection alternatives (PEIT1 and PEIT2)

from the IPC improvements due to data reuse exploitation. This is due to the following reasons: Firstly the bus bandwidth is relieved from unnecessary accesses, hence the possibility of a bus conflict becomes smaller. Additionally less time is generally spent for transferring memory data values because routing the data values in the distributed foreground memory becomes faster (as the L1 access latency increases) than accessing data from the external memory. Hence, the reduction in memory accesses which is performed by applying our methodology has a greater effect on performance and IPC for larger values of the memory latency. 5.2.2 Memory accesses reduction, operation parallelism improvement and L0 utilization Table 3 shows the reduction in memory accesses and the L0 utilization, while Table 4 illustrates the CGRA Utilization improvement achieved by the application of our methodology for the two architecture alternatives. Additionally, for this experiment, the scratch-pad’s latency equals 1 and the unroll factor equals 10. The reduction in memory accesses explains the increase in performance, shown in Fig. 9 and Fig. 10. We free the memory bandwidth by reusing common data values internally in the CGRA and not fetching them from the scratch-pad memory. This increases the operation parallelism, as shown in Table 4, since more operations requiring memory accesses can be run in parallel. So, operations shift-up in time and the schedule length decreases.

148

G. Dimitroulakos et al.

Table 3 Reduction in memory accesses and LO utilization Program

# of accesses Reuse (PEIT1)

Reuse (PEIT2)

No Reuse

Reduction %

LO Utilization %

PEIT1

PEIT1

PEIT2

PEIT2

fft

2310

2480

3200

27.8

22.5

15.5

18.0

idct_ver

1950

2200

3200

39.1

31.3

10.5

11.2

idct_hor

2260

2440

3200

29.4

23.8

11.0

14.5

fdct_ver

1720

2000

3000

42.7

33.3

15.4

17.2

fdct_hor

1820

2190

3000

39.3

27.0

14.4

15.8

wav_ver

3200

3200

6400

50.0

50.0

32.2

35.2 14.8

wav_hor

2940

3260

5600

47.5

41.8

14.9

lat_synth

650

980

1200

45.8

18.3

5.8

4.1

lat_anal

640

930

1200

46.7

22.5

7.5

7.5

iir

3020

3230

4200

28.1

23.1

14.8

14.9

volterra

1050

1370

2000

47.5

31.5

8.3

8.9

wdf7

1770

2250

3500

49.4

35.7

16.4

14.5

nc

2110

2790

4000

47.3

30.3

19.0

17.8

Table 4 Improvement in CGRA utilization Program

CGR A Utilization % Reuse

Reuse

No Reuse

No Reuse

PEIT1

PEIT2

PEIT1

PEIT2

Improvement %

Improvement %

PEIT1

PEIT2

fft

69.6

44.3

45.9

36.4

48.4

21.7

idct_ver

81.1

65.7

62.5

50.2

29.8

30.9

idct_hor

85.1

54.3

61.7

52.5

37.9

3.4

fdct_ver

81.5

53.6

55.1

45.2

47.9

18.6

fdct_lior

79.4

52.2

53.0

39.3

49 .8

32.8

wav_ver

54.1

42.6

47.6

40.0

13.7

6.5

wav_hor

84.4

70.3

47.5

46.9

77.7

49.9

lat_synth

75.0

66.2

53.6

46.9

39.9

41.2

lat_anal

62.5

45.0

45.0

35.2

38.9

27.8

iir

59. 5

48.7

45.1

42.0

31.9

16.0

volteiia

70.3

51.1

49.6

43.3

41.7

18.0

wdf7

76.4

61.1

48.2

43.0

58.5

42.1

nc

81.4

51.5

53.0

46.7

53.6

10.3

Additionally, the reduction in memory accesses is greater in PEIT1 case than it is in the PEIT2. The increased PE connectivity in the first case reduces the routing delay hence, more reused values are allowed to be routed internally in the CGRA for the given value of the memory_thresh parameter. On the contrary, the poorer PE connectivity of the second case results in an increased routing delay causing some of the reused values to be fetched from memory due to the reasoning described in Sect. 4.2.3. This increases the bus bandwidth requirements which cause memory ac-

Design space exploration of an optimized compiler approach

149

cesses conflicts. Consequently, operation’s parallelism is improved less in the second case (PEIT2) than in the first (PEIT1) which is expressed by a smaller CGRA Utilization improvement shown in Table 4. The aforementioned conclusion is also justified from Fig. 12 which shows that the improvements of the average IPC are larger for the PEIT1 architecture alternative with the richer interconnection topology in respect to the PEIT2 case. On average the improvements by the application of our methodology are larger in the PEIT1 case by a factor of 1,4 if we consider all scenarios of the L1 memory access latency. 5.2.3 Effect of the unroll factor on performance Figures 13 and 14 show the performance, in respect to the unroll factor, with and without exploiting data reuse opportunities for the two network topologies (PEIT1 and PEIT2) respectively. The results refer to 5 different scenarios concerning the unroll factors; 1, 2, 5, 10 and 20. The scratch-pad’s latency again equals 1. The leftmost vertical bar in each design corresponds to unroll factor 1, while the rightmost to unroll factor 20. Each vertical bar has two upper limits. The lower one is the number of cycles for the reuse exploitation case, while the upper one is the number of cycles without the data reuse exploitation. It is deduced that the exploitation of the data reuse opportunities improves performance in almost all cases. As it is illustrated in Figs. 13 and 14, in some applications the achieved performance from the application of the proposed methodology reaches a maximum for some value of the unroll factor (mostly 10). Beyond that value, the performance slightly decreases but still beats the performance achieved without exploiting the data reuse opportunities. We have already mentioned that by unrolling the application loops higher CGRA utilization and consequently higher IPC can be achieved. However, for large values of the unroll factor less cost effective PDs exist for each operation due to the exhaustion of the unreserved resources. This happens especially to the case (Figs. 13 and 14) where the data reuse opportunities are exploited due to the additional routing imposed by the data reuse values. The routing of data reuse values requires additional resources (storage locations and interconnections) to be reserved. The absence of available routing resources for transferring the data values is responsible for the schedule length increase. 5.2.4 IPC in respect to the PE local RAM Figure 15 illustrates the impact of PEs’ local RAM size on the average value of IPC with (REUSE) and without (NO REUSE) exploiting the data reuse opportunities for the two different interconnection alternatives (PEIT1 and PEIT2). The Average IPC refers to the average value of IPC for all benchmarks while these measurements refer to 4 different scenarios in respect to L1 memory access latency. Additionally, the unroll factor equals 10 for all cases. It is concluded, that the average value of IPC for small values of the Local RAM size when data reuse opportunities are exploited, drops faster than the case where they are not exploited. This is explained as follows: When data reuse opportunities are exploited, more data values are stored in the distributed foreground memory and this increases the possibility of having inadequate storage locations for scheduling the

Fig. 13 Performance relative to the unroll factor with and without reuse exploitation (PEIT1)

150 G. Dimitroulakos et al.

Fig. 14 Performance relative to the unroll factor with and without reuse exploitation (PEIT2)

Design space exploration of an optimized compiler approach 151

152

G. Dimitroulakos et al.

Fig. 15 Average IPC in respect to Local RAM Size with and without exploiting the data reuse opportunities and for the two architecture alternatives

operations efficiently. When this happens, the operations’ execution time increases so as to find the appropriate storage resources hence, operation parallelism and consequently performance decreases. Also it must be mentioned that, the average value of IPC has improved from the data reuse exploitation by a factor the ranges from 36.5% to 85.1% for the PEIT1 case and from 8.6% to 68% for the PEIT2 case for the considered values of the L1 access latency even for a PE Local RAM size which is as small as 8 words. 5.2.5 Effect of the unroll factor on L0 utilization Figures 16 and 17 illustrate the L0 utilization in respect to unroll factor for the two network topologies (PEIT1 and PEIT2) respectively. As it is shown the L0 utilization is proportional to the unroll factor in both cases. By having in mind the performance measurements for Figs. 13 and 14, it is evident that after a certain value of unroll factor the demands in storage locations (especially with the data reuse exploitation) increase significantly while the performance improves negligibly or sometimes worsens. Hence, the unroll factor should be chosen appropriately to come up with an efficient solution in respect to the storage requirements. Moreover, by comparing Figs. 16 and 17 it is concluded that the L0 Utilization is higher for the PEIT2 case because the increased routing delay increases the storage requirements for routing the values. However, the difference with the PEIT1 case is

Fig. 16 L0 Utilization % in respect to the unroll factor (a) with and (b) without data reuse exploitation (PEIT1)

(a)

(b)

Design space exploration of an optimized compiler approach 153

Fig. 17 L0 Utilization % in respect to the unroll factor (a) with and (b) without data reuse exploitation (PEIT2)

(a)

(b)

154 G. Dimitroulakos et al.

Design space exploration of an optimized compiler approach

155

quite small meaning that the differentiation on the interconnection network that the two architecture alternatives have, has a small effect on the storage utilization. Finally, for the considered applications the storage requirements increase approximately by a factor of 4 by the data reuse exploitation. This happens because the utilization of the PEs’ Local RAM increases since the data reuse values are stored internally in the CGRA’s distributed foreground memory.

6 Conclusions A memory-conscious mapping methodology for CGRA architectures was presented in this work. The distributed foreground memory was exploited from the proposed mapping algorithm for reducing the bus congestion. The mapping algorithm achieved this by coordinating the mapping of operations to PEs and by efficiently routing the data values through the CGRA’s interconnection network. For doing so, the factors that influence the scheduling were identified and incorporated in the proposed methodology by using a set of heuristics. Moreover, the proposed retargetable mapping algorithm was used for evaluating our mapping approach and for exploring the design space in respect to the factors that influence the mapping efficiency. More specifically, the experiments showed that from the exploitation of the PE’s interconnection network for routing values with data reuse opportunities significant performance improvements are achieved. This conclusion was also validated by the measurements in performance for different values of the unroll factor. Moreover, it was showed that an architecture with a richer interconnection network among PEs achieves generally higher improvements than an architecture with poorer connectivity. Additionally, the algorithm showed that for small values of the PEs’ local RAM size the algorithm tends to perform the same with and without data reuse exploitation. However, even for a small register file significant improvements can be achieved. Hence, the CGRA interconnection network and the PEs’ Local RAM size determine the benefit from the application of the proposed mapping methodology. Analogous remarks were derived by the experimentation for different size of CGRA architectures (6 × 6, 8 × 8, etc.).

References 1. Hartenstein R (2001) A decade of reconfigurable computing: A visionary retrospective. In: Proc of ACM/IEEE DATE ’01, 2001, pp 642–649 2. Mei B, Vernalde S, Verkest D, De Man H, Lauwereins R (2003) Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In: Proc of ACM/IEEE DATE ’03, 2003, pp 255–261 3. Pact Corporation (2004) The XPP white paper. Technical report. www.pactcorp.com 4. Singh H, Ming-Hau L, Guangming L, Kurdahi FJ, Bagherzadeh N, Chaves Filho EM (2000) MorphoSys: An integrated reconfigurable system for data-parallel and communication-intensive applications. IEEE Trans Comput 49(5):465–481 5. Miyamori T, Olukutun K (1999) REMARC: reconfigurable multimedia array coprocessor. IEICE Trans Inf Syst 389–397 6. Waingold E, Taylor M, Srikrishna D et al (1997) Baring it all to software: raw machines. IEEE Comput 30(9):86–93

156

G. Dimitroulakos et al.

7. Mei B, Vernalde S, Verkest D, Lauwereins R (2004) Design methodology for a tightly coupled VLIW/reconfigurable matrix architecture. A case study. In: Proc of ACM/IEEE DATE ’04, 2004, pp 1224–1229 8. Catthoor F, Danckaert K, Kulkarni C, Brockmeyer E, Kjeldsberg P, Achteren T, Omnes T (2002) Data accesses and storage management for embedded programmable processors. Kluwer Academic 9. Hartenstein RW, Kress R (1995) A datapath synthesis system for the reconfigurable datapath architecture. In: Proc of ASP-DAC, Art No 77, Sep 1995 10. Cardoso JMP (2002) Weinhardt M, XPP-VC: A compiler with temporal partitioning for the PACTXPP architecture. In: Proc of field programmable logic and its applications (FPL 02), LNCS 2438, Springer, 2002, pp 864–874 11. Lee J, Choi K, Dutt N (2003) Compilation approach for coarse-grained reconfigurable architectures. IEEE Design Test Comput 20(1):26–33 12. Todman TJ, Constantinides GA, Wilton SJE, Mencer O, Luk W, Cheung PYK (2005) Reconfigurable computing: architectures and design methods. IEE Proc Comput Digit Tech 152(2):193–207 13. Miyamori T, Olukotun K (1998) A quantitative analysis of reconfigurable coprocessors for multimedia applications. In: IEEE symposium on fpgas for custom computing machines, 1998, pp 2–11 14. Borkar S, Cohn R, Cox G, Gross T, Kung HT, Lam M et al (1990) Supporting systolic and memory communication in iWarp. In: Proc 17th int’l symp. computer architecture, IEEE CS Press, Los Alamitos, Calif, 1990, pp 70–81 15. Shoemaker D, Honoré F, Metcalf C, Ward S (1996) NuMesh: An architecture optimized for scheduled communication. J Supercomput 285–302 16. Quinton P, Robert Y (1991) Systolic algorithms and architectures, Prentice Hall 17. Hartenstein RW, Hoffman Th, Nageldinger U (2000) Design-space exploration of low power coarse grained reconfigurable datapath array architectures. In: Proc PATMOS 2000, LNCS, 1918, 2000, pp 118–128 18. Venkataramani G, Najjar W, Curdahi F, Bagherzadeh N, Bohm W, Hammes J (2003) Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Trans Embed Comput Syst 2(4):560–589 19. Mei B, Lambrechts A, Verkest D, Mignolet JY, Lauwereins R (2005) Architecture exploration for a reconfigurable architecture template. IEEE Design Test Comput 22(2):90–101 20. Kwok Z, Wilton SJE (2005) Register file architecture optimization in a coarse-grained reconfigurable architecture. In: Proc of IEEE FCCM ’05, 2005, pp 35–44 21. Bansal N, Gupta S, Dutt N, Nikolau A, Gupta R (2004) Network topology exploration of mesh-based coarse-grain reconfigurable architectures. In: Proc of ACM/IEEE DATE ’04, 2004, pp 474–479 22. Bansal N, Gupta S, Dutt N, Nikolau A, Gupta R (2004) Interconnect-aware mapping of applications to coarse-grain reconfigurable architectures. In: Proc of field programmable logic and its applications (FPL ’04), LNCS 3203, Springer, 2004, pp 891–899 23. Mahlke SA, Lin DC, Chen WY et al (1992) Effective compiler support for predicated execution using the hyperblock. In: Proc 25th microarchitecture, 1992, pp 45–54 24. Kennedy K, Allen R (2002) Optimizing compilers for modern architectures. Morgan Kauffman 25. Panda PR, Dutt N, Nicolau A (1999) Memory issues in embedded systems-on-chip: optimizations and exploration. Kluwer Academic 26. Hall MW et al (1996) Maximizing multiprocessor performance with the SUIF compiler. Comput 29:84–89 27. Muchnick S (1998) Advanced compiler design and implementation. Morgan Kauffman 28. De Micheli G (1994) Synthesis and optimization of digital circuits. McGraw-Hill, International Editions, Singapore 29. Texas Instruments Inc, www.ti.com, 2005 30. http://www.vlsi.ee.upatras.gr/~mgalanis/DSP_codes.doc

Design space exploration of an optimized compiler approach

157

Grigoris Dimitroulakos is a PhD candidate at Electrical and Computer Engineering Department of Patras University. He received his Physics Degree in 1999 and the MSE in Electronics in 2001 from the Physics Department. He has been working towards his PhD degree at the Electrical and Computer Engineering Department of Patras University since 2001. His research interests are in the area of compilers, reconfigurable computing, and the optimization of embedded systems on chip. He has authored or co-authored more than 40 scientific articles in conferences and Journals in the area of his research.

Michalis D. Galanis was born in March 14, 1979 in Patras, Greece. He received BSc in Physics from the Department of Physics of University of Patras in 2000. From the Electronics Laboratory of the same department he received MSc in Electronics in 2002. Since October 2002 he has been working towards a PhD degree in the domain of Reconfigurable Computing at the VLSI Design Laboratory of the Department of Electrical and Computer Engineering. He receives since 2003, a scholarship for his PhD studies from the Alexander S. Onassis Public Benefit Foundation, for his excellent academic studies in past years. Until now, he has authored or co-authored 40 research works, presented (or to be appeared) in international conferences and journals.

Costas E. Goutis was a Research Assistant and Research Fellow in the Department of Electrical and Electronic Engineering, University of Strathclyde, Strathclyde, U.K., from 1976 to 1979, and Lecturer in the Department of Electrical and Electronic Engineering, University of Newcastle upon Tyne, U.K., from 1979 to 1985. Since 1985, he has been an Associate Professor and Full Professor in the Department of Electrical and Computer Engineering, University of Patras, Patras, Greece. His recent research interests focus on VLSI circuit design, low-power VLSI design, systems design, analysis and design of systems for signal processing, telecommunications, memory management, and reconfigurable computing. He has been awarded a large number of Research Contracts from ESPRIT, RACE, and National Programs. Prof. Goutis has authored or co-authored more than 200 scientific articles, book chapters, tutorials and reports, in the areas of his research.