Modelling Programmable Logic Devices and

2 downloads 0 Views 136KB Size Report
Modelling Programmable Logic Devices and Reconfigurable,. Microprocessor-related Architectures. Christian Siemers, Volker Winterstein. University of Applied ...
Modelling Programmable Logic Devices and Reconfigurable, Microprocessor-related Architectures Christian Siemers, Volker Winterstein University of Applied Sciences Nordhausen, Weinberghof 4, D-99734 Nordhausen, Germany (siemers|winterstein)@fh-nordhausen.de Abstract This paper introduces two basic models for describing the space efficiency and the throughput of configurable devices. The first model focuses on available Programmable Logic Devices (PLD) and shows the relationships of silicon space and computing time to the block size. This model is further subdivided into a particular one for Complex PLDs (CPLD) and one for Field-Programmable Gate Arrays (FPGA) due to the fact that both incorporate different implementations of programmable logic. The second model was developed to describe the behaviour of block-based, reconfigurable architectures like the recently introduced Universal Configurable Block (UCB) system with respect to block sizes. All models show a specific behaviour concerning to the needed silicon area and the data throughput. Consequently these models are useful to determine optimum values for block sizes in different logic architectures.

1

Motivation of the Work

Recently a research project was started to find an optimal PLD-architecture in dependence of certain parameters. While technology shows also impact on the actual architecture, research is currently focussed on technology-independent parameters. Related work [1] shows the behaviour of microprocessors with phase pipelining and superscalar architecture, e.g. optimum values of pipeline stages and general limiting effects. The goal of this work is to show the general relation of silicon area and computational throughput in dependence of different cell architectures. We will also discuss the differences between some representative architectures, especially their advantages and disadvantages regarding silicon area consumption and computational achievement. At last, we will describe some qualitative derivations of optimum values for such cell architectures.

2

Modelling Programmable Logic Devices

Programmable Logic Devices (PLD) are available since 1977, high-density PLDs since 1985. During the past decades, two basic architectures were developed: CPLD (coarse-grained, Complex PLDs) and FPGA (finegrained, Field Programmable Gate Arrays) [2], [3]. Their differences are well-known, why should we discuss them now? The reason is to find a theoretical explanation for the best architectural properties of commonly used or new (reconfigurable) cell structures. So the approach of this paper is to compare the different designs and work out, whether arbitrary values e.g. for block size are optimum values or whether there is additionally space for optimisation.

2.1

Field-Programmable Gate Arrays

If we want to realize logical functions with FPGAs we want to optimize them typically in two dimensions – silicon area (costs), and computational achievement (speed). So we can ask two important questions: First the device shall have a well-defined computing capacity (for logical operations on bit level), but the main question is, how much silicon area will be needed to provide this. And the second question is, how fast an average application may run. To find optimum values, we have to obtain a description for the dependence of the silicon area needed for a given number N of inputs per basic logic block (which is the most important architectural parameter). The functional unit inside the FPGA needs a certain number klogic,0 of basic logic cells, each with a silicon area A logic,0. So we can establish formula (1) for the silicon area in dependence of the number of cells, the size of the basic cell and their number of inputs. For the logic part is assumed that this dependence will follow the equation

Alog ic ( N ) = A logic,0 ⋅ 2 N ⋅

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

k logic,0 N

⋅ (1 + k internal ⋅ N )

2

(1)

This is derived from the assumption that at theoretical minimum (N = 1, one block with one input), a basic area Alogic,0 per cell and a number of logic cells klogic,0 are needed. If we than increase the number of cell inputs the silicon area scales with 2N, because the number of stored values and of path transistors will grow exponentially inside Look-Up Table (LUT) structures, mostly used inside FPGAs. The number of such logic elements is assumed to scale with 1/N to obtain the same computational capacity, because with a growing number of inputs the functionality of each block will increase, but the number of blocks needed for the same functionality of the device will decrease. But there is also an additional, small correction factor due to the fact that logic elements mostly have incorporated feedback lines. This results in smooth area increase and is modelled with k internal 1. Nevertheless the formula (1) is scaled to this initial value. Additional to the silicon area of the logic part we have to consider the area used for the routing between the logic cells. The routing capacity scales vice versa with the number of inputs N like shown in equation (2), because larger blocks with more inputs can compute more complex and will need fewer routing resources to other (neighboured) blocks. Now we can describe the necessary routing area with respect to the number of routings krouting,0 within the cell and the number N of cell inputs as

Arouting ( N ) = A routing,0 ⋅

k routing,0 N

(2)

Here we assume that the area per routing element Arouting,0 is constant through all variations of N, but their number decreases with the factor of 1/N. The routing consists of local connections and of configurable generalpurpose connections with several stages between 2 logic cells. The sum of the formulas (1) and (2) now representives the entire silicon area of a LUT-based FieldProgrammable Gate Array: k logic,0 2 ⋅ (1 + k internal ⋅ N ) + ALUT ( N ) = Alogic, LUT,0 ⋅ 2 N ⋅ N (3) k routing,0 A routing, LUT,0 ⋅ N

For cost-sensitive applications it is useful to know the input size for the minimised silicon area, so we will display equation (3) in Fig. (1) as graph 1/ALUT (N) (normalized to relative units), where the maximum of this curve is an equivalent to the minimum of silicon area consumption. For modelling the throughput D (respectively the computational speed of the logic block) the delay time for signals inside the PLD has to be modelled. So we have to take the following additional assumptions: The delay time TCell,0 for any logic cell is proportional to the square root of the cell area, i.e. proportional to the linear dimension of the cell. The delay time Trouting,0 for any routing part instead is nearly constant as long as local routing (nearest neighbour connection) is dominant. This is assumed for the purpose of this paper. The throughput D then can be described by N and the sum of both delay times, caused from the signal delay inside the basic logic cell and the signal delay time on their routing lines. The factor N also models the growing parallelism for logic computation inside the logic cell, i.e. the more inputs are available, the more parallel computations are possible inside this cell.

D LUT =

N TCell,0 ⋅ 2

N

2

+ Trouting,0

(4)

Both equations (3) and (4) show the influence of N to the use of silicon area and the throughput of a LUTstructure, depending also from the relationship of cell and routing. Fig. 1 visualizes both functions as well the unweighted product for the (assumed) relation of cell/routing of 1:4, which is in good agreement to the relations inside real devices and represents as relative units the ratio of the factors A*k both cell and routing areas. The same calculation was performed for a relation of 1:1, but the results don’t differ significantly. As shown in fig. 1, the minimum use of silicon area (maximum of curve 1/Space) will be found at logic blocks with typically 2 inputs, but the maximum of throughput is reached in a range from four to six inputs. So we can confirm that small logic blocks provide a good use of the available silicon area, but larger logic blocks will support a faster computing (throughput). Consequently we can find an optimum average value to the relationship throughput/space for lookup table structures with 3 inputs and 1 output, which means a good computational speed at reasonable costs.

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

3

2,5

Relative Units

2

1/Space Throughput

1,5

Throughput/Space

1

0,5

0 0

2

4

6

8

10

12

14

16

18

Block Input Size

Figure 1. Modelling a LUT-based FPGA for space efficiency and throughput

2.2

Complex Programmable Logic Devices

The most obvious characteristic of the logic cells inside the ‘fine-grained’ FPGAs is the look-up table structure. This means that every logic function of maximum input size N is realisable inside, but the exponential growth of the cell size with N is also significant limiting the input vector size to comparatively low numbers. Vice versa we should also consider the ‘coarsegrained’ CPLDs. They mostly incorporate a sum-ofproducts structure with programmable ANDs as well as fixed ORs (terms). This is the second architecture to be modelled within this paper. So we can find equivalent expressions for the use of silicon area and the computational throughput in comparison to the FPGA structures. The use of silicon area as a sum for logic and routing purposes inside a CPLD is given in Equation (4). At first approximation, the number of programmable AND connections per cell increases linear with N, because each input will be needed both in a inverted and not inverted state. The number of cells is modelled to decrease with 1/N, because cells with more inputs can compute in a more

complex matter (increasing parallelism), and therefore no net effect will occur. The cell size will also grow slightly additionally with (1 + kinternal * N)² to model the increasing feedback inside the cell, what remains as a second approximation. These feedback lines are a very important detail of the schematics and often used in CPLDs who have more than one output per logical block. At least the routing size per cell will stay constant. Because the number of cell-external routing will decrease (modelled with 1/N), the complete model for the silicon area can be written as:

APAL ( N ) = A logic, PAL,0 ⋅ k logic,0 ⋅ (1 + k internal ⋅ N )

2

+

A routing, PAL,0 ⋅ k routing, PAL,0

(4)

N Now equation (4) describes the needed area for the PAL-structure, but normally just only a part of the logic is used. If the input size is increased, the probability for unused additional hardware also increases. In accordance to [1], our model must be extended by an additional term (1 + keff * N), where the factor keff describes the probability of unused parts within the logic cell:

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

APAL,eff. ( N ) = A logic, PAL,0 ⋅ k logic,0 ⋅ (1 + k internal ⋅ N ) ⋅ 2

(1 + k eff ⋅ N ) +

A routing, PAL,0 ⋅ k routing, PAL,0

(5)

N For keff relative large values can be estimated (example in fig. 2: keff = 0.05). This serves the practical observation of device manufacturers that approximately 80 % of all functional signal descriptions will have only 3 – 5 terms. The second interesting value for CPLDs is their throughput D. As described before, the routing time per cell use stays constant, and for the PAL (programmable array logic) structure the same rule applies per used logic cell. With increasing N, the number of cell/routing uses decreases with 1/N, caused by the increasing parallelism. Therefore the throughput results in the throughput-model

D PAL =

Tcell, PAL,0

N + Trouting, PAL,0

(6)

Equation (6) shows a linear dependence of D from N, but a vice versa proportionality to the sum of the delay times inside the cell and routing architecture. So it is essentially for a high throughput to realise short delay

times within the cells, combined with a large number of cell inputs. Figure 2 visualises this model for the PALstructure. It can be shown that a minimum of silicon area per logic block will be used for blocks with 5-10 inputs. The throughput instead will grow linearly with increasing block input size. Following this, we get a wide maximum range for the ratio throughput/space for block sizes with around 20-60 inputs, assuming that all hardware is used for the logical functions, what means keff = 0. In the more real case that the effective use of the hardware will have a probability of 95 percent (keff = 0.05 for unused 5% hardware), the maximum of the curve throughput/space will rapidly move to lower numbers of block input sizes to a typical range of 12-32 inputs with a slight decrease for higher number of inputs. A number of 18-32 inputs is used by most manufacturers of CPLDs, therefore k = 0.05 is a reasonable approximation. The calculated curves also show that the optimum ratio of throughput/space for CPLDs strongly depends from the effective use of the logic functions, what requires a well optimized fitter software.

7

6

Relative Units

5

4

1/Space Throughput Throughput/Space Throughput/Space (eff.)

3

2

1

0 0

10

20

30

40

50

60

Block Input Size

Figure 2. Modelling a PAL-based CPLD for space efficiency and throughput

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

70

2.3

Discussion of the PLD-related models

If we compare the results shown in Fig.1 and Fig.2 we can theoretical justify that FPGAs and CPLDs have their own specific optimum for the best use of an available silicon area with a good speed, depending from the block input size. FPGA show a sharp maximum value for 3 – 4 inputs, what is in accordance with available devices (for example Actel ProASIC [4]: N = 3, Xilinx Virtex II[5]: N = 4). Any additional growth of N beyond must be motivated by other reasons. CPLD instead have a broad, not specific maximum for usable resources, in accordance with available devices (Lattice ispLSI2000 [6]: N = 18, Lattice ispLSI5000 [7]: N = 64). This verifies the rule that devices with large blocks will work at optimum for specific cases, where the application demands for these sizes. Another fact is that the fitter software must support a very good utilization of the available hardware resources. Averaged effective throughput/space shows a maximum at block sizes N = 16 .. 32, where the most commercial devices have their specific values.

3

Modelling Block-Based Reconfigurable Architectures

The last chapter of this paper discusses the modelling of coarse-grained architectures with cell structures including arithmetical and logical operations and programmable routing structures. These architectures are similar to microprocessor architectures providing equivalent operations and support of reconfigurable computing. One proposal was published in [8] and is called the Universal Configurable Block/Machine (UCB/UCM) concept. Figure 3 shows the principle structure of a UCB. The UCB consists of units for arithmetical or logical operations (AU), and compare operations (CoU), which are combined using a configurable network with multiplexers (MUX_C, MUX_D). The arithmetical units AU will perform one or few (then configurable) computational operations inside the UCB, for example adding two integer values. The result at the output of this sub-unit is available for use inside the whole network. Compare units CoU generates conditional bits inside the network for the conditional execution of computations or moves. In the most cases these units will be configurable to the comparison operation which defines the predicate flag. Multiplexers are used for routing purposes within the data path. While multiplexers of type MUX_C will route data sources like registers or outputs of arithmetical units

AU to the inputs of following AUs, CoUs or RAM memory, the MUX_D multiplexer can route data outputs of arithmetical units to register inputs. Demultiplexers are able to route the results of compare units as control bits to the corresponding AUs. This general structure of functional blocks works as a very complex universal PLD with a very high data performance, which is dependent of the data bus width (number of inputs per logical block inside the UCB) and the average signal runtime as a function of the number of subunits. The point for the purpose of this paper is that the operations are assumed to be not configurable during runtime, but fixed (please note that some might be configurable to provide a few number of similar operations like addition/subtraction). In this case, the input size number N is no longer of interest, but instead the number of functional units (blocks) inside one UCB, denoted with B. Now we can observe two size depending effects depending from B: Increasing B results at first in decreasing the computing time TBlock,0 through extended internal parallelism, and more internal logical functions can be performed at the same time within the UCB. But the second effect works in contrast to this. If the UCB block size will be increased, the growing number of elements inside the UCB causes longer signal delay times for the longer averaged distances between the units. This is modelled by the factor (1 + x*B) of equation (7) in accordance to [1], where x describes the increasing runtime. A typical value for x is approximately 3% per block [1], but we have to remark that an exactly estimated value is not yet known. The resulting formula for the throughput D will be presented in equation (7). As long as no data dependences must be checked during runtime, this model seems to be sufficient. Inside the UCB model, all data dependences are checked either during compile time or during the first execution phases when generating the configuration by instruction flow translation. On the other hand we also have to consider that data dependences will reduce the effective delay time not for one configuration, but for the resulting requence for one algorithm. Increasing the UCB block size might be feasible with incorporating (and allowing) data forwarding inside one configuration, and this truly results in increased delays. This effect, either virtual (through more configurations) or real (through allowed data forwarding inside one configuration) is modelled through (1 + y*B), where y represents a factor for an increase of the delay time as a result of possible data hazards. Typical values for y are also in a range of 3% per block. Therefore the complete model for the throughput of an UCB can be written as

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

D=

B (1 + x ⋅ B ) ⋅ (1 + y ⋅ B ) ⋅ TBlock,0

Multiplexer Type D

(7)

Multiplexer Type C

A D

local RAM (optional)

Arithmetical Unit Type A

Multiplexer Type C

Arithmetical Unit Type A

Multiplexer Type C

Register File

M_Clock Enable

Clock Generation

Clock

Ready-Signal Compare Unit 2 Type F

Multiplexer Type C

Arithmetical Unit Type A

Multiplexer Type C

Compare Unit Type B

Demultiplexer Type E

Multiplexer Type C

Load/Store Pipeline (optional) Dataflow Controlbits

Figure 3. Structure of a Universal Configurable Block The dependence of the throughput D from the number B of units inside the UCB is shown in figure 4. The throughput shows generally abroad non specific maximum, and he strongly varies with changes of the factors x and y. If we assume values of 3 % for both factors, we get a very broad maximum around 32 units per UCB, but the throughput will vary only slightly in a range of 16 to 64 units per UCB. With increasing values of x and y the maximum of the throughput will move to lower numbers of UCB units, what will reduce the computational speed of the device. This shows that it will be essentially to optimize the UCB architecture to short delay times and to avoid the possibility of data hazards. Furthermore it should be discussed how the routing inside the UCB between the AUs could be optimized to reduce signal delay times due to the storage of computed data into registers. On the other hand it is not usefull to create UCBs with very large numbers of blocks inside, because new inserted blocks must have a useful logical or arithmetical function to minimize the overall power consumption per algorithm of the device. Therefore the blocks should incorporate a maximum of functionality, but the price which has to be paid is the increasing probability of data dependencies, what limits the complexity of the UCB structure.

Last but not least UCBs have to be optimized for an optimum of throughput depending from delay times and their inner data structures.

4

Summary and Outlook

This paper has presented 3 different models for configurable and reconfigurable devices. The intention was to show, how the overall behaviour of these architectures depends on few typical parameters. These results are very essential to find implications for the development of new device architectures. All discussed architectural structures have their own advantages and disadvantages with respect to optimum block input size or the number of blocks. It has been shown that the optimum values of silicon space efficency and throughput in FPGAs and CPLDs differ from each other, so in principle optimisations for both values are possible (but mostly mutual exclusive). Most of the commercially available devices show structures optimised for a compromise of both important runtime parameters. The universal configurable block/machine will be a very interesting architecture for future devices, because it will have a high throughput in a wide range unit numbers inside the UCB.

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

Next steps for the further development of the models are (first) scaling the models for comparison between known architectures, and (second) finding new powerful

architectures to find overall optimum values for application-class-specific approaches.

9

8

7

Relative Units

6

5

x=y=0.03 x=y=0.05 x=y=0.10

4

3

2

1

0 0

10

20

30

40

50

60

70

Number of units inside the UCB

Figure 4. Modelling the throughput of UCB as example for reconfigurable, microprocessor-related devices [8]

References [1]

[2]

[3]

[4] [5] [6] [7]

M. J. Flynn, P. Hung, K. W. Rudd. Deep-Submicron Microprocessor Design Issues. Micro, 19(4):11–22, 1999. Chow, Paul; Seo, Soon Ong; Rose, Jonathan; Chung, Kevin; Páez-Monzón, Gerard; Rahardja, Immanuel: ”The Design of an SRAM-Based Field-Programmable Gate Array – Part I: Architecture“. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7(2): 191 .. 197, 1999. Chow, Paul; Seo, Soon Ong; Rose, Jonathan; Chung, Kevin; Páez-Monzón, Gerard; Rahardja, Immanuel: ”The Design of an SRAM-Based Field-Programmable Gate Array – Part II: Circuits Design and Layout“. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7(3), pp. 321 .. 330 (1999). Actel Corporation: ProASIC 500k Xilinx Corporation: Virtex II Lattice Semiconductor: ispLSI2000 Lattice Semiconductor: ispLSI5000

Christian Siemers: “The Universal Configurable Block/Machine – An Approach for a Configurable SoC-Architecture”. Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms ERSA’02, Las Vegas, Nevada, June 24–27, pp. 83–89 (2002).

0-7695-1926-1/03/$17.00 (C) 2003 IEEE