Implementation of a Partially Reconfigurable Multi ... - Semantic Scholar

2 downloads 0 Views 2MB Size Report
Apr 4, 2009 - 90 nm CMOS technology using HSPICE circuit simulation. For the comparison ..... Limited, NEC Electronics Corporation, Renesas Technol-.
IEICE TRANS. ELECTRON., VOL.E92–C, NO.4 APRIL 2009

539

PAPER

Implementation of a Partially Reconfigurable Multi-Context FPGA Based on Asynchronous Architecture Hasitha Muthumala WAIDYASOORIYA†a) , Student Member, Masanori HARIYAMA† , Member, and Michitaka KAMEYAMA† , Fellow

SUMMARY This paper presents a novel architecture to increase the hardware utilization in multi-context field programmable gate arrays (MCFPGAs). Conventional MC-FPGAs use dedicated tracks to transfer context-ID bits. As a result, hardware utilization ratio decreases, since it is very difficult to map different contexts, area efficiently. It also increases the context switching power, area and static power of the contextID tracks. The proposed MC-FPGA uses the same wires to transfer both data and context-ID bits from cell to cell. As a result, programs can be mapped area efficiently by partitioning them into different contexts. An asynchronous multi-context logic block architecture to increase the processing speed of the multiple contexts is also proposed. The proposed MCFPGA is fabricated using 6-metal 1-poly CMOS design rules. The data and context-ID transfer delays are measured to be 2.03ns and 2.26ns respectively. We achieved 30% processing time reduction for the SAD based correspondance search algorithm. key words: DPGA, multi-context, asynchronous FPGA

1.

Fig. 1

Introduction

Dynamically reconfigurable devices [1], [2] can execute very large applications even their descriptive size is larger than the hardware resources available. Since most of the computations do not run simultaneously, large applications can be broken into several small sub-programs. These sub-programs can share the same hardware resources by scheduling them into different time slots. Multi-context FPGA (MC-FPGA) [3] is the simplest dynamically reconfigurable device. Figure 1 shows a typical MC-FPGA architecture. In MC-FPGAs, each program or application are assigned to a separate context. The configuration data that belong to each context are stored in different configuration planes. To execute a particular program, the appropriate context-ID is passed to the logic and interconnect elements. Configuration data in the relevant configuration plane are selected according to a context-ID. Figure 2 shows a multi-context LUT, which is the main component of a multi-context logic block (MCLB). It has separate sets of configuration bits for each context. The multiplexers at the top in Fig. 2 select the appropriate set of configuration bits according to the context-ID. Then the multiplexer at the bottom selects the relevant configuration bit according to the input data, similar to a single context FPGA. The right side of Fig. 1 shows a multi-context switch (MC-switch). It has a Manuscript received June 11, 2008. Manuscript revised November 23, 2008. † The authors are with the Graduate School of Information Sciences, Tohoku University, Sendai-shi, 980-8579 Japan. a) E-mail: [email protected] DOI: 10.1587/transele.E92.C.539

Fig. 2

Fig. 3

Structure of a typical MC-FPGA.

Multi-context LUT (MCLUT).

Graphical representation of the context planes.

multiplexer to select the relevant configuration bit from multiple configuration planes according to context-ID. Figure 3 shows the graphical representation of the context planes. 2.

Problems in Typical MC-FPGAs

2.1 Problems Related to Power Consumption Figure 4 shows the context-ID distribution in a typical MCFPGA. All the logic blocks and MC-switches share the same context-ID signal. As a result, when the context-ID changes, the whole chip changes into a different context. In order to make this change faster, the context-ID tracks are designed quite similar to a typical clock tree. Moreover, in a 4-context FPGA, two dedicated context-ID tracks are needed to distribute the 2 context-ID bits. Similarly, when the number of

c 2009 The Institute of Electronics, Information and Communication Engineers Copyright 

IEICE TRANS. ELECTRON., VOL.E92–C, NO.4 APRIL 2009

540

Fig. 4

Context-ID distribution in typical MC-FPGA. Fig. 5

context-ID bits doubles, an extra context-ID track is needed. These tracks occupy a large area and consume a large power in both static and dynamic states. According to [4], [5] and [6], more than 20% of the power consumption in FPGAs is due to the clock distribution and 60% is due to the interconnection network. In addition to that, dynamic power is consumed due to the context switching and static power is consumed due to the contextID distribution. If we assume the structure of a context-ID track is quite similar to a clock-tree, we can save a considerable share of power by removing context-ID tracks.

Fig. 6

Ex.1: Wasted configuration memory.

Ex.2: Overlapped configuration memory.

2.2 Problems Related to Hardware Utilization We discuss the low hardware utilization in both logic and interconnect elements and configuration memory using following examples. Example 1: Let us consider two mutually independent programs assigned to two contexts as shown in Fig. 5(a). We assume the configuration memory needed by both DFGs is very small compared to the memory available in a single configuration plane. If we map the two contexts onto two configuration planes as shown in Fig. 5(b), we get a large wasted configuration memory and unusable logic and interconnection area. Not only the area, but also the execution time is large. Since the two DFGs are mapped to two different contexts, the second one executes after the execution of the first one. As a result, it takes more time than the schedule given in Fig. 5(a). This problem can be solved by assigning both DFGs to a single context or map two contexts onto a single configuration plane. However, it is difficult to assign both DFGs to a single context when their execution time is different. Mapping two contexts onto a single configuration plane is impossible, since the context-ID of the whole chip changes simultaneously, so that a single configuration plane can hold the configuration of a single context. Example 2: The redundant configuration data in different contexts decreases the configuration memory utilization. Figure 6(a) shows a large DFG partitioned to two contexts. However, part of the DFG is the same in both context 1 and context 2. As a result, the same configuration data are included in two configuration planes after mapping, as shown in Fig. 6(b). Therefore, part of the configuration area is wasted.

Fig. 7

Ex.3: Logic and interconnection utilization.

Example 3: In the CDFG (control data flow graph) shown in Fig. 7(a), DFG2 or DFG3 is selected depending on the condition generated by DFG1. Therefore, we have to prepare the circuits of DFG2 and DFG3 as shown in Fig. 7(b), even though only one is needed at a time. The unused circuit consumes static power and occupies chip area. This problem can be solved by mapping DFG2 and DFG3 onto different contexts. However, this is very difficult since the execution time of each context is data dependent and cannot be determined offline. 2.3 Proposed Context Partitioning In typical MC-FPGAs, contexts are executed sequentially so that the same resources are reused in different time slots. However, as explained in this section, smaller contexts increase the logic and interconnect area and redundant contexts increase configuration memory. Therefore, our idea is to run contexts both sequentially and parallelly so that

WAIDYASOORIYA et al.: IMPLEMENTATION OF A PARTIALLY RECONFIGURABLE MULTI-CONTEXT FPGA

541

Fig. 8

Fig. 9

Fig. 10

Solution for Ex.1.

Solution for Ex.2.

smaller but mutually independent contexts run together in parallel while the dependent ones run sequentially. However, to do this effectively, the MC-FPGA must be partially reconfigurable, so that a new context can be assigned to a part of a circuit without interrupting the online contexts. A part of an application is called a sub-context. More than one sub-context can be mapped to a single configuration plane unlike a context. A sub-context is selected by a localcontext-ID. Unlike a context-ID, a local-context-ID is only distributed in the part of the chip that is required by a particular sub-context. Figures 8, 9 and 10 show the solutions to the problems discussed earlier in this section. In Fig. 8, two DFGs are assigned to two sub-contexts and mapped onto a single configuration plane. Since we are using local-context-IDs, it is possible to map two sub-contexts with different execution timings onto a single configuration plane. The remaining configuration planes can be used to store another subcontext. The execution timings of different sub-contexts can be overlapped. As a result, the schedule in Fig. 8(a) is preserved unlike the case in Fig. 5. In Fig. 9, the same parts of DFGs are assigned to a single sub-context. The other two parts are assigned to different sub-contexts. As a result, we do not get the same configuration data are stored in different configuration planes. The sub-contexts 1 and 2 are executed until the time is t1 and sub-contexts 1 and 3 are executed next. Note that, the logic and the interconnect area correspond to the sub-context 3 must not be overlapped with the area correspond to the subcontext 1, because in this case, different configuration planes are accessed simultaneously.

Solution for Ex.3.

In Fig. 10, three DFGs are assigned to three different sub-contexts. The sub-contexts 2 and 3 are mapped onto the similar areas of the two configuration planes such that both the sub-contexts correspond to the same logic and the interconnect area. The local context-ID of the sub-contexts 2 and 3 are internal signals, generated based on the results of the sub-context 1. Since we are just transferring the context-ID from cell-to-cell, it is possible to transfer both the internal context-IDs (context-IDs that are generated inside a logic block) and the external context-IDs (context-IDs that are fed to the cell array). 3.

Previous Works

Concepts similar to flexible context partitioning have been proposed in some papers. PACT Xpp [7] uses a run-time and partial reconfiguration mechanism, so that only the required resources have to be changed in each reconfiguration. However, the reconfiguration mechanism is very complex and needs dedicated configuration buses. Therefore the overhead for the reconfiguration is very large. This increases the area as well as the power consumption as explained in Sect. 2. PCA [8] provides a dynamically reconfigurable asynchronous architecture. It consists of two hardware planes, one is to implement the computations and the other is to implement the data path and the reconfiguration mechanism. However, this mechanism is based on re-writing SRAMs. As a result, the reconfiguration time is large as 60 μs for each reconfiguration. As a result, it cannot be used in the applications that require fast context switching. WASMII [9] consists of an execution part and a control part. The execution part is an MC-FPGA. The control part provides a reconfiguration mechanism. The evaluation used both PCA and FPGA as the execution part. According to the results, performance of a normal FPGA is about 15 times higher than PCA. However, asynchronous architectures increase the performance by nearly 30% compared to synchronous ones. In [10], stripe based and PE based asynchronous versions of a PipeRench architecture are discussed. In [11], a very fast reconfiguration mechanism is provided. However, neither of them can execute different contexts in parallel and the problems discussed in Sect. 2 cannot be solved

IEICE TRANS. ELECTRON., VOL.E92–C, NO.4 APRIL 2009

542

using these architectures. 4.

Proposed MC-FPGA Architecture

We divide the logic and the interconnect area into independent clusters to improve the flexibility of the context partitioning. We feed different context-IDs onto different areas to allow parallel execution of different sub-contexts. To achieve this, we use a localized context-ID signal. As shown in Fig. 11, the local-context-ID signal is transfered from cell-to-cell and distributed only in a part of the chip that is required by a particular sub-context. It is possible to have more than one local-context-IDs on the same chip in different areas simultaneously. The local-context-ID signal is guided to the neighboring cells by the context-IDdecoders (CID). In the proposed architecture, all the applications are partitioned into sub-contexts and accessed by the local context-IDs. 4.1 Decoder Based Interconnection Network We use the interconnect architecture proposed in [12]. It is based on the fact that most of the configuration data in MCFPGAs do not change when the contexts are switched [13]. The basic element of the interconnection network is called the switch element (SE). Figure 12(a) shows the structure of an SE. It has two inputs, one is fixed (D) and the other is variable (U). The fixed input “D” is used to implement the context-ID independent configurations. The variable input “U” is used to implement the configuration patterns that depend on a single context-ID bit. Figure 12(b) shows the structure of a switch block (SB). It has two SEs horizontally and vertically. This structure allows us to increase the routing options and to create context-ID-decoders (CIDs), which we use extensively for our MC-FPGA. Figure 13 shows how to implement the configurations that are independent of the context-IDs. Since the configuration-bits do not change with the context-ID, the fixed input of an SE is used. The variable input of an SE is used to implement the configurations depending on a single context-ID bit. In Fig. 14(a), the configurations depending on only C1. Therefore the context-ID bit C1 is sent to the variable input of an SE as shown in Fig. 14(b). However,

when a configuration depends on more than a single contextID bit, it cannot be implemented using a single SE. In this case, the context-ID bits are decoded into a different control signal and then sent it to the variable input of an SE. Figure 15 shows an example that the configuration bit depends on both context-IDs, C0 and C1. According to Fig. 15(a), LB1 and LB2 are connected only when both context-ID bits are 1. Therefore, a set of SEs are connected to generate the control signal [U = C0 · C1] using the context-IDs C0 and C1. Then this control signal is sent to the variable input of an SE. All the other SEs in the path between LB1 and LB2 are programmed as always connected. As a result, the connection patterns shown in Fig. 15(a) can be implemented. In this architecture, each sub-context has different critical path delays. As a result, maximum frequency varies for each context. Since each configuration plane corresponds to the same logic and interconnect area, it is impossible to use different clock frequencies for different sub-contexts simultaneously, unless there are several clock distribution networks. It is very difficult to use different clock distributions, since the power consumption due to the clock tree is quite high according to [4], [5] and [6]. When several sub-

Fig. 12

Fig. 13

Fig. 11

Local context-ID distribution.

Fig. 14

Switch block structure.

Context-ID independent configuration.

configurations depend on a single context-ID.

WAIDYASOORIYA et al.: IMPLEMENTATION OF A PARTIALLY RECONFIGURABLE MULTI-CONTEXT FPGA

543

Fig. 16

Fig. 17

Fig. 15

Configurations depend on a several context-IDs.

contexts are executed at the same time, the worst case delay is used to determine the clock frequency for all the subcontexts. As a result, faster sub-contexts also run at a slower speed. However, using an asynchronous architecture, different sub-contexts can run at different speeds by increasing the overall speed. 4.2 Data and Context-ID Encoding Scheme Asynchronous encoding schemes can be classified into bundled-data encoding and delay-insensitive encoding. Figure 16 shows the overall architecture of the bundled-data encoding. It splits the data and request into separate wires. The data are sent similar to the synchronous architecture, where n-bits of data need n number of wires. However, an explicit delay is inserted to the request signal to ensure the request is received only after the valid data are available. Therefore, it requires the correct delay values to operate reliably. This method needs only (n + 2) wires to send n-bits of data and can be implemented by using a relatively small area. An asynchronous FPGA based on bundled-data encoding is proposed in [14] and [8]. However, in FPGAs, it is not easy to always meet the delay constraint since the data path is programmable. Therefore we choose the delay-insensitive encoding, since it does not require any delay insertion. In the delayinsensitive encoding, two wires are used for each data bit as the data and the request signals, unlike the bundled-data encoding where the request signal is shared between all the data bits. There are many delay-insensitive encodings such as 4-phase dual-rail encoding, level encoded 2-phase dualrail encoding (LEDR) [15], etc. The former is used in [16] and the latter is used in [17]. The latter gives a higher

Fig. 18

Bundled-data encoding.

LEDR encoding.

The data transfer based on LEDR encoding.

throughput than the former. Therefore, we use LEDR encoding for the proposed MC-FPGA. Figure 17 shows the LEDR encoded signals (V, R) and their decoded values, the data and the phase. Figure 18 shows the data transfer based on the LEDR encoding. The data bit equals to V and the phase equals to the XOR value of V and R (V ⊕ R). Since the same tracks are used to transfer both the data and the context-IDs, we need another signal to separate the context-ID bits from the data bits. For this purpose, a new signal denoted by C is used. In this paper, the term “CVR encoding” is used for this encoding scheme, since it requires three signals C, V and R to transfer a data bit or a contextID bit. The encoded signals and their decoded values are shown in Fig. 19. The phase equals to [C ⊕ V ⊕ R] and the data bit equals to V. If the phase is changed, it indicates a new signal is received. If C is “0,” the current bit is a data bit, and otherwise it is a context-ID bit. An example of a bit-serial data and context-ID transfer is shown in Fig. 20. After the first three data bits [0 0 1] are received, C changes from “0” to “1” indicating the next bit is a context-ID bit. Then the context-ID bits are received one after the another. In this example, the received context-ID equals to 3 (011). After all three context-ID bits are received, C changes from “1” to “0” indicating the next bit is a data bit. 4.3 Architecture of the Logic Block The block diagram of a logic block (LB) in the proposed MC-FPGA is shown in Fig. 21. The proposed LB has 3 main components. 1. Receiver 2. Transmitter

IEICE TRANS. ELECTRON., VOL.E92–C, NO.4 APRIL 2009

544

Fig. 19

Fig. 20

CVR encoding.

Fig. 23

The data transfer based on CVR encoding.

Fig. 24

Fig. 21

Fig. 22

Structure of the decoder-11.

Block diagram of a logic block.

Architecture of the multi-context LUT.

3. Multi-context look-up-table (LUT) The receiver circuit receives C, V, R values from previous LBs. When both input C values are “0,” the receiver sends V, R values to the LUT. When the C values change from “0”

The mechanism of controlling the context-switching.

to “1,” the context-ID bits are stored in the shift register in the receiver. After a context-ID bit is saved in the shift register, the receiver sends the acknowledge (ACK) signals to the previous LBs, indicating it is ready to receive another bit. After all the context-ID bits are received, the transmitter starts sending the appropriate V, R values of the context-ID bits to the next LBs. The multiplexers select the LUT output when C is “0” and the transmitter output when C is “1.” The control signal of the multiplexer is delayed for one cycle, so that it can select the data and the context-ID accurately as shown in Fig. 20. The architecture of a 2-input-8-context LUT is shown in Fig. 22. It has 4 memory bits per each context to function as a programmable 2-input logic gate. When the appropriate configuration plane is selected according to the contextID bits, the relevant memory bits are passed to the decoders. The structure of the decoder-11 in Fig. 22 is shown in Fig. 23. When the phases of both inputs ([Va, Ra] and [Vb, Rb] ) are “0,” both V and R are equal to the memory bit M11, so that the output phase equals to the input phase after the calculation. Similarly, when the phase is “1,” V equals to M11 and R equals to M11. Note that, when the phases are different, both V and R become high-impedance, so that the values in the latches are preserved. Also note that, in every case, both inputs have the same C value. If C is different, the inputs to the LUT will be blocked by the receiver, so that the

WAIDYASOORIYA et al.: IMPLEMENTATION OF A PARTIALLY RECONFIGURABLE MULTI-CONTEXT FPGA

545

previous state of the latch is preserved. This multi-context LUT is a modification of the single context LUT proposed in [17]. We use a sequencer to control the timing of the contextIDs which are fed to the cell array to switch the contexts. The block diagram of the sequencer is shown in Fig. 24. It contains few counters and a control circuit. This mechanism is very similar to the one used in the typical MC-FPGAs. Therefore, the area overhead due to the sequencer is negligible. 5.

Evaluation

5.1 Circuit Evaluation Performance of the proposed MC-FPGA is compared with the typical synchronous MC-FPGA shown in Fig. 1 under 90 nm CMOS technology using HSPICE circuit simulation. For the comparison, following assumptions are considered. The area of the logic blocks is 30% of the total area of a typical MC-FPGA. The rest of the area occupied by the switch blocks. Using this assumption, we calculated the ratio between the number of logic blocks and switch blocks. The same ratio is applied to the proposed MC-FPGA. Therefore, a single cell contains a logic block and switch blocks. We consider an equal number of switch blocks are included in the cells of both proposed and typical MC-FPGAs. The frequency of the typical MC-FPGA and the data-rate of the proposed MC-FPGA are set to 100 MHz. The area, power and peak current due to the clock tree and the context-ID tracks are not considered for this evaluation. The proposed switch block is smaller than the typical 8-context switch so that the interconnect area is smaller than that of the typical architecture as shown in Table 1. The static power is also small since the number of transistors in the proposed SB is smaller compared to that of a typical MC-switch. The total power is calculated for a single cell that contains a logic block and switch blocks. The area of an LB is about 2.5 times larger than the typical multi-context LB and the static power is also large. Since, the LEDR data transfer is used, two wires are needed to transfer a single bit. As a result, area is approximately 2 times larger than that of a synchronous LB. We also use Muller C elements to detect the changes in the data. Shift registers are also included in LBs to store context-ID bits. Even the area of

a single cell is large, more throughput can be obtained by using a smaller number of cells. As a result, the processing time is smaller for a unit area in the proposed MC-FPGA than that of the typical one. Similarly static energy is also reduced due to processing time reduction. This is explained in detail in Sect. 5.2 using a mapping example. The peak current and the context switching power are calculated by assuming only 50% of the circuit is switched. Since only half of the circuit is reconfigured in the proposed MC-FPGA, the peak current and the context switching power is low. In the typical MC-FPGA, the whole chip has to be reconfigured when the contexts are switched, so that the context switching power is large. Figure 25 shows the relationship between the context switching power and the percentage of the circuit switched. According to the results, if less than 77% of the circuit is switched, the proposed architecture provides better performances in terms of peak current and context switching power. A test chip is fabricated using 90 nm 6-metal 1-poly CMOS design rules. Figure 26 shows the chip micrograph. The designed chip has a 10 × 10 cell array. Figure 27 shows

Fig. 25

Context switching power Vs. percentage of the circuit switched.

Fig. 26

Chip micrograph.

Table 1 Proposed MC-FPGA Vs. Typical MC-FPGA. (number of context = 8).

Normalized area per cell Normalized peak current per cell Normalized context switching power per cell Normalized static power per cell

Interconnects Typical Proposed 1 0.78

Logic Blocks Typical Proposed 1 2.55

Typical 1

Total Proposed 1.28

1

0.49

1

0.81

1

0.57

1

0.43

1

0.72

1

0.68

1

0.84

1

4.68

1

1.93

IEICE TRANS. ELECTRON., VOL.E92–C, NO.4 APRIL 2009

546

Fig. 28 Fig. 27

Measured data transfer delay for 100 cells.

Table 2

Performance of the proposed MC-FPGA.

Number of contexts Number of cells Cell area Minimum context ID transfer delay (for 3 context-ID bits) Minimum data transfer delay

Optical flow extraction.

8 100 (10 × 10) 53 × 62 μm2 2.26 ns 2.03 ns

the measured delay for 100 cells connected in series. Table 2 shows the other measured performance of the chip. The context switching delay is only 2.26 ns, compared to the other architectures such as [7] where the delay is over 60 ns. Fig. 29

DFG of the SAD calculation.

5.2 Mapping Example We compare the performance of the proposed MC-FPGA with that of the typical MC-FPGA using the correspondence search algorithm which is widely used in optical flow extraction. In optical flow extraction, corresponding pixels between two images taken at time t and t + δt are searched as shown in Fig. 28. To find a corresponding pixel, a reference window for a particular pixel in the image at time t and a search area in the image at time t + δt are considered. Different candidate windows are selected from the search area and SAD (sum of absolute differences) with the reference window is calculated. The DFG of SAD calculation for a reference and a candidate windows with n pixels each is shown in Fig. 29. The more similar the reference window to the candidate window is, the more smaller the SAD becomes. Therefore, the candidate window when the SAD becomes minimum is selected as the corresponding window to the reference window. To reduce the computational amount of SAD calculation, an effective algorithm based on partial SAD calculations is proposed in [18]. The CDFG of the algorithm is shown in Fig. 30. In this algorithm, SAD calculation is divided into m partial SAD calculations. At the beginning, partial SAD of n/m pixels in reference and candidate windows is calculated and it is added to the current SAD. If this value is bigger than the current minimum SAD, the rest of the calculations are terminated. Otherwise, the partial SAD of another n/m pixels is calculated. If the calculation is not terminated at the middle of the process, it continues until the calculation of the SAD of all the pixels is finished. If the

Fig. 30

CDFG for the proposed searching algorithm.

SAD for n pixels is smaller than the current minimum SAD, the current minimum SAD is replaced by the SAD of current window. Using this method, most of the unnecessary calculations are reduced. After evaluating 900 images that are used for vehicle recognition, the computational amount is reduced to 54%. The specifications of the images are shown in Table 3.

WAIDYASOORIYA et al.: IMPLEMENTATION OF A PARTIALLY RECONFIGURABLE MULTI-CONTEXT FPGA

547 Table 3

Specifications of the input images.

Image resolution Window size Search area Number of partial SADs Gap between 2 candidate pixels

1600 × 1200 20 × 20 40 × 40 10 15

Fig. 32 FPGA.

Fig. 31

Structure of a single sub-context.

The computational amount reduction due to the partial SAD calculations is effectively exploited using the proposed architecture. The partial SAD calculations that are separated by conditional branches, are assigned to separate subcontexts. Figure 31 shows the structure of a sub-context. The computations of a sub-context are the absolute difference calculations and the additions. The functions of the cells and the interconnection among the cells are the same for all the sub-contexts. However, the interconnections between the memory and the cells can be changed for different candidate windows, depending on the memory allocation. In those cases, a new sub-context is assigned. Depending on the result of the conditional branch, a new sub-context is called or the current sub-context continues. Since the number of parallel computations in each sub-context is n/m times smaller than that in Fig. 29, the SAD for m windows are calculated in parallel. Figure 32 shows the allocation of contexts and their processing time. Different sub-contexts are allocated to different areas of the MC-FPGA. Each subcontext is capable of calculating the SAD of n pixels of a candidate and the reference windows. However, the calculation is done step-by-step. In each step, the SAD of n/m pixels are calculated. After each partial SAD calculation, the SAD is checked with the current minimum SAD. If it exceeds the current minimum, then the context switching occurs and the SAD of another candidate window is calculated. Otherwise, the current sub-context continues. As a result, the context switching occurs in different timing depend on the pixel data, similar to the example 1 discussed in Sect. 2. The conditional branches are effectively implemented using an internally generated local-context-ID, similar to the example 3 in Sect. 2. In this particular example, about 50% of the contexts are switched. This algorithm cannot be implemented in the typical

Time chart of the correspondence search in proposed MC-

MC-FPGA using the same amount of cells used in the proposed one. Since the typical MC-FPGA is not partially reconfigurable, sub-context cannot be changed in different timing. Moreover, the context switching is pre-scheduled in the high level synthesis, and altering that schedule depending on the pixel data is impossible. Therefore, the DFG shown in Fig. 29 that calculates the SAD of all the pixels in reference and candidate windows is better suited for the typical MC-FPGA. Table 4 shows the performance comparison. The processing time is calculated based on the assumption of both typical and proposed MC-FPGAs have the same area. The proposed one has fewer cells (logic blocks and switch blocks) than the typical one since the cell area of the proposed one is 28% larger as shown in Table 1. However, the total processing time is reduced to 70% due to the reduction of unnecessary calculations in the proposed algorithm. In other words, if the processing time is the same for both MC-FPGAs, the proposed one needs 30% smaller area than that in typical MC-FPGA to implement the algorithm. In this example, the static energy is almost the same as that of a typical MC-FPGA since the processing time is smaller. Moreover, we haven’t consider the energy and area due to the clock and context-ID distributions. According to [4], [5] and [6], more than 20% of power is due to the clock tree. Considering those facts, the proposed MC-FPGA has a small energy consumption and a high area efficiency. One major concern in the proposed architecture is the area overhead of the registers, receiver and transmitter that used to control the reconfiguration mechanism. These units occupy 16% of the cell area. However, there is a potential to decrease this area by optimizing the receiver and transmetter circuits. The total cell area given in the Table 1 contains the area of this overhead also. In the evaluation, we have not considered the overhead of the clock and contextID distributions that account for a onsiderable portion of the total area in typical MC-FPGAs. Therefore, the overhead of the reconfiguration machenism in the proposed MC-FPGA is not very large. Moreover, the proposed MC-FPGA needs a smaller number of cells to map applications. The reason

IEICE TRANS. ELECTRON., VOL.E92–C, NO.4 APRIL 2009

548 Table 4

Performance evaluation for the correspondence search. Area

Typical MC-FPGA Proposed MC-FPGA

1 1

Number of logic elements 1 0.78

Processing time 1 0.70

Static energy 1 1.04

for this is the higher flexibility in context partitioning in proposed MC-FPGA. According to the mapping example, the number of logic blocks reduced by 22% and the processing time reduced by 30% as shown Table 4. 6.

Conclusion

We have proposed a novel MC-FPGA architecture to use the configuration memory, logic and interconnects effectively by dividing the applications into the sub-contexts. The subcontexts are controlled by a local context-ID signal that is transferred from cell-to-cell, using the same data tracks. As a result, the area overhead and power consumption due to dedicated context-ID tracks are reduced. Since only the part of the contexts are switched in the reconfiguration, context switching power is small. The peak current when the context are switched is also reduced due to the partial reconfiguration and the asynchronous architecture. The proposed local context-ID signal can be produced internally from LBs, so that we can perform area and power effective implementation of the conditional branches. The proposed architecture is suitable for the applications that have both data-intensive processes and control-intensive processes (processes that have conditional branches). This type of applications are not suitable for the typical MC-FPGAs due to the implementation difficulties of conditional branches. In some applications, the calculation amount can be reduced by partitioning them into several sub-processes. Those applications are also suitable for the proposed architecture. An example of such an application is given in Sect. 5.2. In this example, the SAD computation is partitioned into several partial SAD calculations with different execution timings as shown in Figs. 30 and 32. The execution of the partial SAD calculations are controlled by the conditional branches. We used the partial reconfigurability to implement this algorithm effectively. The processing time is reduced by 30% as shown in Table 4 compared to the typical MC-FPGA. However, the proposed architecture is not suitable for the applications that the calculation amount cannot be reduced by partitioning them into several sub-processes. The applications that have huge amount of control-intensive processes are also not suitable for both proposed and typical MC-FPGAs. The reason is the frequent context switching which increases the processing time. The frequent context context-ID transmission can reduce the performance in MC-FPGAs. In the proposed architecture, we use two methods to minimize the performance reduction. The first method is to pipeline the context-ID and the data transmission so that the new data are immediately accessible by a logic block just after its context is

switched. The second method is the partial reconfiguration. Due to the partial reconfiguration, context-ID is transfered only to a part of the circuit. Therefore, the other parts of the circuit are not affected. Moreover, the context switching power is also small if less than 77% of the circuit is switched, as shown in Fig. 25. However, still some cycles are wasted due to the re-filling of the data pipeline after the contexts are switched. As a result, there is a trade-off exists between the cycles wasted due to the context switching and the cycles saved by reducing unnecessary calculations. The best number of contexts can be obtained by considering this trade-off. To gain the maximum performance from the proposed architecture, the CAD tool is very important. In the proposed architecture, conflicts can occur if two sub-contexts, belonging to two different context planes, use the same logic and wire area simultaneously. These conflicts can be solved in high level synthesis by careful analysis of sub-context timing and dependencies. Acknowledgment This work is supported by VLSI Design and Education Center, the University of Tokyo in collaboration with STARC, Fujitsu Limited, Matsushita Electric Industrial Company Limited, NEC Electronics Corporation, Renesas Technology Corporation, Toshiba Corporation, Cadence Design Systems, Inc. and Synopsys, Inc. References [1] A. DeHon, “Dynamically programmable gate arrays: A step toward increased computational density,” Proc. Fourth Canadian Workshop on Field-Programmable Devices, pp.47–54, 1996. [2] A. DeHon, “DPGA utilization and application,” FPGA’96 ACM/SIGDA Fourth International Symposium on FPGAs, 1996. [3] S. Trimberger, “A time-multiplexed FPGA,” 5th IEEE Symposium on FPGA-Based Custom Computing Machines, pp.22–28, 1997. [4] E. Kusse and J. Rabaey, “Low-energy embedded FPGA structures,” ISLPED, pp.155–160, 1998. [5] L. Shang, A.S. Kaviani, and K. Bathala, Dynamic Power Consumption in VirtexTM -II FPGA Family, FPGA’02, pp.157–164, 2002. [6] F. Li, Y. Lin, L. He, D. Chen, and J. Cong, “Power modeling and characteristics of field programmable gate arrays,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol.24, no.11, pp.1712– 1724, 2005. [7] V. Baumgarte, G. Ehlers, F. May, A, Nuckel A, M. Vorbach, and M. Weinhardt, “PACT XPP-A self-reconfigurable data processing architecture,” J. Supercomputing, vol.26, no.26, pp.167–184, 2003. [8] R. Konishi, H. Ito, H. Nakada, A. Nagoya, K. Oguri, N. Imlig, T. Shiozawa, M. Inamori, and K. Nagami, “PCA-1: A fully asynchronous, self-reconfigurable LSI,” Asynchronous Circuits and Systems (ASYNC), pp.54–61, 2001. [9] Y. Adachi, K. Ishikawa, S. Tsutsumi, and H. Ammo, “An implementation of the Rijndael on Async-WASMII,” IEEE International Conference on Field-Programmable Technology (FPT), pp.44–51, 2003. [10] H. Kagotani and H. Schmit, “Asynchronous piperench: Architecture and performance estimations,” Proc. Field-Programmable Custom Computing Machines (FCCM03), pp.121–129, 2003. [11] M. Yamashina and M. Motomura, “Reconfigurable computing: Its

WAIDYASOORIYA et al.: IMPLEMENTATION OF A PARTIALLY RECONFIGURABLE MULTI-CONTEXT FPGA

549

[12]

[13]

[14]

[15]

[16] [17]

[18]

concept and a practical embodiment using newly developed dynamically reconfigurable logic (DRL) LSI,” DRL, ASP-DAC, pp.329– 332, 2000. W. Chong, M. Hariyama, and M. Kameyama, “Novel switch-block architecture using reconfigurable context memory for multi-context FPGAs,” Proc. International Workshop on Applied Reconfigurable Computing (ARC 2005), pp.99–102, 2005. I. Kennedy, “Exploiting redundancy to speedup reconfiguration of an FPGA,” Proc. 13th International Conference on Field Programmable Logic and Application, pp.262–271, 2003. C. Traver, R.B Reese, and M.A. Thornton, “Cell designs for selftimed FPGAs,” Proc. 14th Annual IEEE International ASIC/SOC Conference, pp.175–179, 2001. M.E. Dean, T.E. Williams, and D.L. Dill, “Efficient self-timing with level-encoded 2-phase dual-rail (LEDR),” Proc. 1991 University of California/Santa Cruz Conference on Advanced Research in VLSI, pp.55–70, 1991. J. Teifel and R. Manohar, “An asynchronous dataflow FPGA architecture,” IEEE Trans. Comput., vol.53, no.11, pp.1376–1392, 2004. M. Hariyama, S. Ishihara, C.C. Wei, and M. Kameyama, “A fieldprogrammable VLSI based on an asynchronous bit-serial architecture,” IEEE Asian Solid-State Circuits Conference (A-SSCC), pp.380–383, 2007. T. Enomoto, Y. Sasajima, A. Hirobe, and T. Ohsawa, “Fast motion estimation algorithm and low power CMOS motion estimation array LSI for MPEG-2 encoding,” Proc. International Symposium on Circuits and Systems, vol.4, pp.IV-203–206, 1999.

Hasitha Muthumala Waidyasooriya received the B.E. degree in Information Engineering and M.S. degree in Information Sciences from Tohoku University, Japan, in 2006 and 2008 respectively. He is currently a Ph.D. student in Graduate School of Information Sciences, Tohoku University. His research interests include field-programable-gate-array architectures and high-level design methodology for VLSIs.

Masanori Hariyama received the B.E. degree in electronic engineering, M.S. degree in Information Sciences, and Ph.D. in Information Sciences from Tohoku University, Sendai, Japan, in 1992, 1994, and 1997, respectively. He is currently an associate professor in Graduate School of Information Sciences, Tohoku University. His research interests include VLSI computing for real-world application such as robots, high-level design methodology for VLSIs and reconfigurable computing.

Michitaka Kameyama received the B.E., M.E. and D.E. degrees in Electronic Engineering from Tohoku University, Sendai, Japan, in 1973, 1975, and 1978, respectively. He is currently a Professor in the Graduate School of Information Sciences, Tohoku University. His general research interests are intelligent integrated systems for real-world applications and robotics, advanced VLSI architecture, and newconcept VLSI including multiple-valued VLSI computing. Dr. Kameyama received the Outstanding Paper Awards at the 1984, 1985, 1987 and 1989 IEEE International Symposiums on Multiple-Valued Logic, the Technically Excellent Award from the Society of Instrument and Control Engineers of Japan in 1986, the Outstanding Transactions Paper Award from the IEICE in 1989, the Technically Excellent Award from the Robotics Society of Japan in 1990, and the Special Award at the 9th LSI Design of the Year in 2002. Dr. Kameyama is an IEEE Fellow.

Suggest Documents