Data-Driven Regular Reconfigurable Arrays

0 downloads 0 Views 262KB Size Report
and exploring data-driven array architectures in which, for a specific .... Two versions of a first mapping algorithm have been developed (versions PR1 and ... performed concurrently, and finally, a PAR-SE tree merges the results into a unique.
Data-Driven Regular Reconfigurable Arrays: Design Space Exploration and Mapping* Ricardo Ferreira1, João M. P. Cardoso2,3, Andre Toledo1, and Horácio C. Neto3,4 1

Departamento de Informática, Universidade Federal de Viçosa, Viçosa 36570 000, Brazil, [email protected] 2 Universidade do Algarve, Campus de Gambelas, 8000-117, Faro, Portugal [email protected] 3 INESC-ID, 1000-029, Lisboa, Portugal 4 Instituto Superior Técnico, Lisboa, Portugal, [email protected]

Abstract. This work presents further enhancements to an environment for exploring coarse grained reconfigurable data-driven array architectures suitable to implement data-stream applications. The environment takes advantage of Java and XML technologies to enable architectural trade-off analysis. The flexibility of the approach to accommodate different topologies and interconnection patterns is shown by a first mapping scheme. Three benchmarks from the DSP scenario, mapped on hexagonal and grid architectures, are used to validate our approach and to establish comparison results.

1. Introduction Recently, array processor architectures have been proposed as extensions of microprocessor-based systems (see, e.g., [1], [2]). Their use to execute streaming applications leads to acceleration and/or energy consumption savings, both important for today and future embedded systems. Since many design decisions must be taken in order to implement an efficient architecture for a given set of applications, environments to efficiently experiment with different architectural features are fundamental. Array architectures may rely on different computational models. Architectures behaving in a static dataflow fashion [3][4] are of special interest, as they naturally process data streams, and therefore provide a very promising solution for stream-based computations, which are becoming predominant in many application areas [5]. In addition, the control flow can be distributed and can easily handle datastreams even in the presence of irregular latency times. In the data-driven model, synchronization can be achieved by ready-acknowledge protocols, the centralized control units are not needed, and the operations are dynamically scheduled by data flow. Furthermore, array architectures are scalable due to the regular design and symmetric structure connections. Moreover, high parallelism, energy consumption savings, circuit reliability and a short design cycle can be also reached by adopting reconfigurable, regular, data-driven array architectures [6]. However, many array * Ricardo Ferreira acknowledges the financial support from Ct-Energia/CNPq, CAPES and FAPEMIG, Brazil.

architectures seem to be designed without strong evidences for the architectural decisions taken. Remarkably the work presented in [7][8] has been one of the few exceptions which addressed the exploration of architectural features (in this case, a number of KressArray [4] properties). Our previous work presented a first step to build an environment to test and simulate data-driven array architectures [9]. To validate the concept we have presented results exploiting the size of input/output FIFOs for a simple example. As shown, the simulations are fast enough to allow the exploration of a significant number of design decisions. Our work aims to support a broad range of data-driven based arrays, a significant set of architecture parameters, and then evaluate its tradeoffs using representative benchmarks. The environment will help the designer to systematically investigate different data-driven array architectures (topologies and connection patterns), as well as internal PE parameters (existence of FIFOs in PE input/outputs and their size, number of input/outputs of each PE, pipeline stages in each PE, etc.), and to conduct experiments to evaluate a number of characteristics (e.g., protocol overhead, node activity, etc.). An environment capable to exploit an important set of features is of great interest since it can provide an important aid on the design of new data-driven array architectures suitable to execute a set of kernels for specific application domains. The main contributions of this paper are: − the integration in the environment of a first mapping scheme; − the attainment of mapping results on grid and hexagonal arrays for three DSP benchmarks; This paper is structured as follows. The following section briefly introduces the environment. Section 3 explains the mapping scheme. Section 4 shows examples and experimental results. Finally, section 5 concludes the paper and discusses ongoing and future work.

2. The Environment A global view of our design exploration environment is shown in Fig. 1. The start point is the dataflow specification1 which is written in XML. XML is also used to specify the coarse-grained, data-driven, array architecture, and the placement and routing. Each dataflow operator is directly implemented with a Functional Unit (FU). The environment uses Java to specify each FU behavior and to perform the dataflow and array modeling, simulation and mapping. For simulating either the array architecture or the specific design, we use the Hades simulation tool [10], which has been extended with a data-driven library. Note that we are interested on modeling and exploring data-driven array architectures in which, for a specific implementation of an algorithm, the PE operations and interconnections between them are statically defined2. Our environment supports two simulation flows (Dflow and Aflow in Fig. 1):

1 A dataflow model can be automatically generated by a compiler from the input program in an imperative programming language [18][17]. 2 TRIPS [11] is an example of an architecture with interconnections dynamically defined.

− In Dflow, the dataflow representation is translated to a Hades Design and simulated. Dflow provides an estimation of the optimal performance (e.g., achievable when implementing an ASIC based architecture) provided that full balancing is used (i.e., FIFOs of enough size are used). It permits a useful comparison with implementations in a reconfigurable data-driven array, since it represents the optimal achievable performance using a specific set of FUs (akin to the FUs existent in each PE of the array). − In Aflow, the dataflow representation is mapped to a data-driven array architecture, specified by a template, and is simulated with Hades. For design analysis, an user may specify, in the dataflow and array architecture descriptions, which parameters should be reported by the simulator engine. Those parameters can be the interconnect delay, the handshake protocol overhead, the operator activity, etc. As some experimental results show, the simulation and the mapping is fast enough to conduct a significant number of experiments. (XML) Compiler

EDA Dflow

Architecture Place and Route

(Java)

Dataflow FU Library

Aflow Routing

Hades Simulator

Fig. 1. Environment for Exploration of Data-Driven Array Architectures (EDA). Note that the Place and Route phase still needs further work and the front-end compiler is planned as future work

A typical FU integrates an ALU, a multiplier or divider, input/output FIFOs, and the control unit to implement the handshake mechanism (see Fig. 2a). The FU is the main component of each PE in the array architecture. A PE consists on an FU embedded on an array cell, which has a regular neighbor pattern implemented by local interconnections (see Fig. 2b and Fig. 2c). A ready/acknowledge based protocol controls the data transfer between FUs or PEs. An FU computes a new value when all required inputs are available and previous results have been consumed. When an FU finishes computation, an acknowledge signal is sent back to all inputs and the next data tokens can be received. Each FU, besides traditional data-driven operators [3], may also implement the SE-PAR and PAR-SE operators introduced in [12] [13]. These operators resemble mux and demux operators without external control. The canonical form of PAR-SE has two inputs (A and B) and one output (X). It repeatedly outputs to X the input in either A and B, in an alternating fashion. The canonical form of SE-PAR has one input (A) and two outputs (X and Y). The operator repeatedly alternates data on the input to either X or Y. Note, however, that PAR-SE and SEPAR can have more than two inputs and two outputs, respectively. They can be used to reduce array resources and to fully decentralize the control structure needed. These

operations provide an efficient way of sharing resources whenever needed (e.g., interface to an input/output port, interface to memory ports, etc.). SE-PAR and PARSE operations with more than two outputs and two inputs, respectively, can be implemented by a tree of basic SE-PAR and PAR-SE. Each FU may have input/output FIFOs, which can be efficient structures to handle directly unbalanced paths. Parameters such as protocol delay, FIFO size, FU granularity are global in the context of an array but can be local when a specific dataflow implementation is the goal. At this level, an FU behavior and the communication protocol are completely independent of the array architecture. They are specified as Java classes, which provide an easy way to write an incremental FU library and then to model and to simulate a pure dataflow, as well as a novel architecture. The properties of the target architecture such as the array topology, the interconnection network, and the PE’s placement, are specified using XML-based languages, which provide an efficient way to explore different array architectures. ready data

ready ack data

ack

FU

FU data

FU

FU ack

MEM

FU

ready

(a)

(b)

(c)

Fig. 2. (a) Data-driven functional unit (FU) with two inputs and one output; (b) Hexagonal cell with the FU; (c) Hexagonal array (FUs may have FIFOs in their input/output ports)

We can use the graphical user interface of Hades to perform interactive simulation at Dflow or Aflow. Fig. 3 shows a screenshot with a hexagonal array simulation and the respective waveforms.

Fig. 3. A hexagonal array interactive simulation using Hades

3. A Flexible Mapping Approach An array processor can be defined as a set of PEs connected with different topologies. Our goal is to allow the exploration of data-driven architectures, which can have a mesh, a hexagonal or any other interconnection network. Previous works [14][15] have addressed the mapping of data-driven algorithms on regular array architectures. A mapping algorithm for a regular hexagonal array architecture has been proposed in [14], with a fixed interconnection degree. On the other hand, most array architectures are based on grid topology [15]. Our approach differs from the previous ones in two significant ways: (a) it is currently able to compare hexagonal and grid topologies; (b) it presents an object-oriented mapping scheme to model different patterns; (c) it is flexible enough to accommodate other mapping algorithms. Our object-oriented mapping scheme also takes advantages of Java and XML technologies to enable a portable and flexible implementation. The scheme provides an easy way of modeling grid, hexagonal, octal, as well as others topologies. The implementation is based on three main classes: the Array, the PE, and the Edge. The Array class implements the place and routing algorithm. The array and neighbor parameters (e.g., number and their positions) are specified using PE classes. Finally, the Border class models the number and type of connections between neighbors. Examples of PE and Edge classes are represented in Fig. 4. Each PE defines the number of borders with the neighbors, with each border having the input/output connections defined by the Edge class. At the moment the scheme does not accept different Edges. A PE can be connected to N-hop neighbors (see Fig. 5a) and can have in, out and/or in-out connections (see Fig. 5b). In a 0-hop pattern, each PE is connected to the immediate neighbors. In a 1-hop pattern, each PE is connected to the immediate neighbors and 1-hop, i.e., the PEs that can be reached by traversing through one neighbor PE. For instance, the nodes 1 and 3 are 1-hop neighbors in Fig. 5a.

(a)

(b)

Fig. 4. Main classes of the mapping scheme: (a) PE classes, each one with different edges; (b) Edge classes (three parameters are used: number of input, output, and input/output connections

Two versions of a first mapping algorithm have been developed (versions PR1 and PR2). They are based on the greedy algorithm presented in [14]. Albeit simple, they enabled us to explore different connection patterns and different topologies. The mapping algorithm is divided in place and route steps. The algorithm starts by placing the nodes of the dataflow graph (i.e., assign a PE for each node of the graph) based on layers and then optimizes the placement based on the center mass forces. The PR1 version adds NOP nodes in the input dataflow graph before starting the optimization phase. After placement, the route step tries to connect the nodes using incremental routing (e.g., each path of the original DFG is constructed by routing from one PE to one of its neighbors).

The infrastructure has been developed to easily integrate different mapping algorithms. Future work will address a more advanced mapping algorithm. We plan to add critical path sensibility and to include path balancing, which can be very important in array architectures without or with small size FIFOs. A scheme to deal with heterogeneous array elements (e.g., only some PEs with support for multiplication) should also be researched. Notice also that the arrays currently being explored do not permit the use of a PE to implement more than one operation of the DFG. Arrays including this feature require node compression schemes as has been used in [14] for hexagonal arrays.

Grid 0 hop, 1 inout

Hexagonal 0 hop, 1in, 1 out

Hexagonal 0 hop, 2 in-out

Grid 1 hop, 1 in-out (a)

(b)

Fig. 5. Different topologies supported by the mapping phase: (a) 0-hop and 1-hop Grid Topology (b) Uni-directional and Bi-directional Neighbor Connection

4. Experimental Results We have used the current prototype environment to perform some experiments. In the examples presented, we have used 32-bit width FUs and a 4-phase asynchronous handshake mechanism. All executions and simulations have been done in a Pentium 4 (at 1.8 GHz, 1 GB of RAM, with Linux). Examples As benchmarks we use three DSP algorithms: FIR, CPLX, and FDCT. FIR is a finiteimpulse response filter. CPLX is a FIR filter using complex arithmetic. FDCT is a fast discrete cosine transform implementation. The last two benchmarks are based on the C code available in [16]. For the experiments, we manually translated the input algorithms to a dataflow representation. The translation has been done bearing in mind optimization techniques that can be included in a compiler from a software programming language (e.g., C) to data-driven representations (see, e.g., [17][18] for details about compilation issues). For the data-driven implementation of the FDCT example (see part of source code in Fig. 6a) we used the SLP (self loop pipelining) technique with SE-PAR and PARSE operators [13][12]. See the block diagram in Fig. 6b. An SE-PAR tree splits the matrix input stream in 8 parallel elements. Then, the inner loop operations are performed concurrently, and finally, a PAR-SE tree merges the results into a unique matrix output stream. Notice that the SE-PAR and PAR-SE operators are also used here to share the computational structures of the two loops of the FDCT.

I_1 = 0; For (i=0; i < N; i++) { // vertical For (j=0;j> body

Cnt 64 N

Cnt

Par-se

Cnt 1 8

0 Se-par

8

1

8*N

rd

Buffer 256

wr Cnt 8

...

wr

8

Cnt

Output Mem

Par-se

Par-se

1 8

Cnt 64 N

0

Fig. 6. FDCT: (a) source code based on the C code available in [16]; (b) possible FDCT implementation using the SLP technique and sharing the loop body resources between vertical and horizontal traversal

Results Table 1 shows the number of resources needed (number of FUs) and the results obtained by simulating implementations of the FIR and CPLX filters, with different number of taps, and two parts of the FDCT (FDCTa is related to the vertical traversal and FDCTb is related to the horizontal traversal) and the complete FDCT. In these experiments FIFOs (size between 1 and 3) in the inputs and outputs of each FU are used to achieve the maximum throughput. The CPU time to perform each simulation has been between 1 to 4 seconds for 1,024 input data items. Table 1. Properties related to the implementations of three DSP benchmarks Ex

#FU #copy #ALU #MULT #SE- #PAR- #I/O PAR SE

Average Average Max ILP Max activity activity (ALU+MULT) ILP (ALU+MULT) (all) (all)

FIR-2

7

1

2

2

0

0

2

1.00

1.00

4

7

FIR-4

13

3

4

4

0

0

2

1.00

1.00

8

13

FIR-8

25

7

8

8

0

0

2

1.00

1.00

16

25

FIR-16

49

15

16

16

0

0

2

1.00

1.00

32

49

CPLX4

22

5

8

2

4

1

2

0.70

0.86

8

18

CPLX8

46

13

18

4

8

1

2

0.68

0.82

16

38

FDCTa

92

26

36

14

7

7

2

0.12

0.18

10

23

FDCTb

102

26

46

14

7

7

2

0.12

0.16

12

25

FDCT

136

26

52

14

21

21

2

0.20

0.26

22

49

The average activity columns show the percentage of time in which an FU performs an operation. The maximum activity (i.e., 1.00) is reached when the FU activity is equal to the input stream rate. We present average activities taking into account only ALU+MULT operations and all the operations. The maximum ILP (instruction level parallelism) shows the maximum number of FUs executing at a certain time step. Once again, we present ILP results for ALU+MULT and for all

operations. As we can see for FIR and CPLX the maximum ILP is approximately equal to the number of FUs, which depicts that all the FUs are doing useful work almost all the time. With FDCT, the maximum ILP is high (22 for ALU+MULT operations and 49 for all operations, considering the complete example) but many FUs are used only small fractions of the total execution time. We believe this can be improved by using SE-PAR and PAR-SE operators with more outputs and inputs, respectively. For instance, in the hexagonal array we may have implementations of these operators with 6 inputs or 6 outputs, which significantly reduce the SE-PAR and PAR-SE trees. Fig. 7 shows a snapshot of the execution of FDCT showing the operations activity. all FUs

other operations

ALU + MULT operations

50 45

Operations Activity

40 35 30 25 20 15 10 5 0

time

Fig. 7. Activity of the FUs during initial execution of the FDCT implementation shown in Fig. 6b. After 102 input samples the implementation starts outputting each result with maximum throughput

The mapping algorithms presented in Section 3 have been implemented in Java. Table 2 shows the mapping results for the three benchmarks on three different array topologies (Grid, Grid 1-hop, and Hexagonal), each one with two different connection structures (0,0,2 and 2,2,0 indicate 2 bidirectional connections, and 2 input and 2 output connections in each Edge of a PE, respectively). Each example has been mapped in 200 to 400 ms of CPU time. Column “N/E” specifies the number of nodes and edges for each dataflow graph. Average path lengths (measured as the number of PEs that a connection needs to traverse from the source to the sink PE) after mapping the examples onto three topologies are shown in columns “P”. Columns “M” represent the maximum path length for each example when mapped in the correspondent array. Cells in Table 2 identified by “-“ represent cases unable to be placed and routed by the current version of our mapping algorithm. Those cases happened with the FDCT examples on the Grid topology. The results obtained, using the implemented mapping scheme, show that the simpler Grid topology is the worst in terms of maximum and average path lengths. Hexagonal and the Grid 1-hop perform distinctly according to the benchmark. The hexagonal seems to perform better for the FIR filters and is outperformed by the Grid 1-hop for the other benchmarks (CPLX and FDCT). Values in bold in Table 2 highlight the best results. The results confirm that the Grid 1-hop outperforms the Grid topology as been already shown in [15]. Note however that the hexagonal topology was not evaluated in [15].

Table 2. Results of mapping the benchmarks on different array topologies and interconnection patterns between PEs EX

FIR2 FIR4 FIR8 FIR16 CPLX4 CPLX8 FDCTa FDCTb FDCT

N/E

P&R

7/7 PR1 PR2 13/15 PR1 PR2 25/31 PR1 PR2 49/63 PR1 PR2 22/28 PR1 PR2 46/60 PR1 PR2 92/124 PR1 PR2 102/134 PR1 PR2 136/186 PR1 PR2

Grid 0,0,2 2,2,0 P M P M 1.28 2 1.28 2 1.28 2 1.28 2 1.86 4 1.60 3 1.60 3 1.60 3 1.96 6 2.03 5 1.83 5 1.83 5 2.25 9 2.26 11 2.19 9 2.15 11 1.71 6 1.75 6 1.71 6 1.71 6 2.46 10 2.31 11 2.13 10 2.21 11 2.49 14 2.41 10 2.32 14 2.42 12 -

Grid 1-hop 2,2,0 0,0,2 P M P M 1.28 2 1.28 2 1.28 2 1.28 2 1.40 2 1.40 2 1.46 2 1.46 2 1.58 3 1.58 3 1.54 3 1.54 3 1.71 5 1.69 5 1.71 5 1.73 5 1.46 3 1.46 3 1.46 3 1.46 3 1.73 5 1.73 5 1.61 6 1.61 6 1.83 6 2.09 7 1.83 5 1.96 10 1.75 6 1.94 7 1.76 5 1.85 8 3.19 15 3.31 21 2.91 13 3.01 16

Hexagonal 2,2,0 0,0,2 P M P M 1.28 2 1.28 2 1.14 2 1.14 2 1.33 2 1.33 2 1.26 2 1.26 2 4 1.54 1.54 4 4 1.51 4 1.51 7 1.55 7 1.55 8 1.71 8 1.71 5 1.39 5 1.39 1.50 4 1.50 4 1.75 7 1.76 7 1.80 8 1.80 8 2.08 8 2.32 9 2.07 10 2.10 9 2.01 10 2.24 9 1.97 10 2.02 10 4.61 22 4.31 28 3.71 20 4.04 21

5. Conclusions This paper presents further enhancements in an environment to simulate and explore data-driven array architectures. Although in those architectures many features are worth to be explored, developing an environment capable to exploit all the important features is a tremendous task. In our case we have firstly selected a subset of properties to be modeled: FIFO sizes, grid or hexagonal topologies, etc. Notice, however, that the environment has been developed bearing in mind incremental enhancements, each one contributing to a more powerful exploration. A first version of a mapping approach developed to easily explore different array configurations is presented and results achieved for hexagonal and grid topologies are shown. This first version truly proves the flexibility of the scheme. Ongoing work intends to add more advanced mapping schemes to enable a comparison between different array topologies independent from the mapping algorithm used to conduct the experiments. Forms to deal with heterogeneous array elements distributed through an array are also under focus. Further work is also needed to allow the definition of the configuration format for each PE of the architecture being evaluated, as well as, automatic VHDL generation to prototype a certain array or data-driven solution in an FPGA. We have also longterm plans to include a front-end compiler to continue studies of some data-driven array features with complex benchmarks. We really hope that further developments will contribute to an environment able to evaluate new data-driven array architectures prior to fabrication.

References 1.

2.

3. 4.

5. 6. 7.

8.

9.

10.

11. 12.

13. 14. 15.

16. 17.

18.

R. Hartenstein, “A Decade of Reconfigurable Computing: a Visionary Retrospective,” In Int’l Conf. on Design, Automation and Test in Europe (DATE’01), Munich, Germany, March 12-15, 2001, pp. 642-649. L. Bossuet, G. Gogniat, and J. L. Philippe, “Fast design space exploration method for reconfigurable architectures,” In Int’l Conference on Engineering of Reconfigurable Systems and Algorithm (ERSA’03), Las Vegas, Nevada, June 23-26, 2003. A. H. Veen, “Dataflow machine architecture,” in ACM Computing Surveys, Vol. 18, Issue 4, 1986, pp. 365-396. R. Hartenstein, Rainer Kress, Helmut Reinig, “A Dynamically Reconfigurable Wavefront Array Architecture,” in Proc. Int’l Conference on Application Specific Array Processors (ASAP’94), Aug. 22-24, 1994, pp. 404-414. W. Thies, M. Karczmarek, and S. Amarasinghe, “StreamIt: A Language for Streaming Applications,” In Proc. of the Int’l Conf. on Compiler Construction (CC’02), 2002. N. Imlig, et al., “Programmable Dataflow Computing on PCA,” in IEICE Trans. Fundamentals, vol. E83-A, no. 12, December 2000, pp. 2409-2416. R. Hartenstein, M. Herz, T. Hoffmann, and U. Nageldinger, “Generation of Design Suggestions for Coarse-Grain Reconfigurable Architectures,” in 10th Int’l Workshop on Field Programmable Logic and Applications (FPL’00), Villach, Austria, Aug. 27-30, 2000. R. Hartenstein, M. Herz, Th. Hoffmann, and U. Nageldinger, “KressArray Xplorer: A New CAD Environment to Optimize Reconfigurable Datapath Array Architectures,” in 5th Asia and South Pacific Design Automation Conference (ASP-DAC’00), Yokohama, Japan, pp. 163-168. Ricardo Ferreira, João M. P. Cardoso, and Horácio C. Neto, “An Environment for Exploring Data-Driven Architectures,” in 14th Int’l Conference on Field Programmable Logic and Applications (FPL’04), LNCS 3203, Springer-Verlag, 2004, pp. 1022-1026. N. Hendrich, “A Java-based Framework for Simulation and Teaching,” in 3rd European Workshop on Microelectronics Education (EWME’00), Aix en Provence, France, 18-19, May 2000, Kluwer Academic Publishers, pp. 285-288. D. Burger, et al., “Scaling to the End of Silicon with EDGE architectures,” in IEEE Computer, July 2004, pp. 44-55. João M. P. Cardoso, “Self Loop Pipelining and Reconfigurable Dataflow Arrays,” in Int’l Workshop on Systems, Architectures, MOdeling, and Simulation (SAMOS IV), Samos, Greece, July 19-21, 2004, LNCS 3133, Springer Verlag, pp. 234-243. João M. P. Cardoso, “Dynamic Loop Pipelining in Data-Driven Architectures,” in ACM Int’l Conference on Computing Frontiers (CF’05), Ischia, Italy, May 4-6, 2005. I. Koren, et al., “A Data-Driven VLSI Array for Arbitrary Algorithms,” in IEEE Computer, Vol. 21, No 10, 1989, pp. 30-43. N. Bansal, et al., “Network Topology Exploration of Mesh-Based Coarse-Grain Reconfigurable Architectures,” in Design, Automation and Test in Europe Conference (DATE ‘04), Paris, France, Feb. 16-20, 2004, pp. 474-479. Texas Instruments, Inc. TMS320C6000™ Highest Performance DSP Platform. 19952003, http://www.ti.com/sc/docs/products/dsp/c6000/benchmarks/62x.htm#search M. Budiu, and S. C. Goldstein, “Compiling application-specific hardware,” In Proceedings 12th Int’l Conference on Field Programmable Logic and Applications (FPL’02), LNCS 2438, Springer-Verlag, 2002, pp. 853–863. João M. P. Cardoso, and Markus Weinhardt, “XPP-VC: A C Compiler with Temporal Partitioning for the PACT-XPP Architecture,” in 12th Int’l Conference on Field Programmable Logic and Applications (FPL'02), LNCS 2438, Springer Verlag, 2002, pp. 864-874.

Suggest Documents