SuperH RISC engine SH7040 series Sh7045, SH7044, SH7043, SH7042, SH7041, SH7040 Hardware Manual, 2nd edn., HITACHI, user Manual ADJ-602-128A ...
Tile-Based Fault Tolerant Approach Using Partial Reconfiguration Atsuhiro Kanamaru, Hiroyuki Kawai, Yoshiki Yamaguchi, and Morisothi Yasunaga Graduate School of Systems and Information Engineering University of Tsukuba 1-1-1 Ten-ou-dai Tsukuba Ibaraki, 305-8573, Japan
Abstract. This paper deals with a dependable computing system using a reconfigurable device. The work carried out for this purpose of this study involved the proposition of a fault-tolerant approach which covers microprocessors. TFT, which is short for tile-based fault tolerant approach, has the intermediate layer which makes the connection between physical circuit layout and logical circuit layout for use in partial and dynamic reconfiguration. The reconfiguration is effectively utilized for online replacement of failed circuits. An advantage of TFT is that there is no conflict with other fault-tolerant approaches, and therefore TFT is freely available in the construction of dependable systems.
1
Introduction
A system failure surely occurs for a long operating time, and the higher failure rate is derived from extreme environmental conditions such as cosmic space, desert, and the deep ocean. FPGAs have been attracting a great deal of attention since an circuit configuration on an FPGA can be reconfigured after the establishment of a system. The use of FPGAs is spreading to dependable computing [1] and particularly discussed in single event effects [2,3,4]. However, realizing higher dependability of systems, fault tolerance is what need to be thrashed out. Hardware fault, which is caused by hot electron [5,6], negative bias temperature instability [7] and gate oxide breakdown [8], sets up permanent failure and requires fault tolerant techniques for long-term systems without any hand maintenance. Here, we propose a tile-based fault tolerant (TFT) approach using partial reconfiguration. TFT feature is high compatibility with any circuits and to reduce the recovery time. The article is explained as follows. Section 2 describes TFT approach how it fixes circuit troubles using partial reconfiguration. In Section 3, we implement simple TFT modules on the earlier and latest Xilinx FPGAs. To enhance the experience, Section 4 shows an TFT implementation of a RISC processor, and then, discusses the issues. Finally, Section 5 concludes this paper. J. Becker et al. (Eds.): ARC 2009, LNCS 5453, pp. 293–299, 2009. c Springer-Verlag Berlin Heidelberg 2009
294
2
A. Kanamaru et al.
Tile-Based Fault Tolerant (TFT) Approach
2.1
Triple Modular Redundancy (TMR) in Our System
TMR is one of well-known techniques for fault masking[9]. The concept of TMR is a majority voting on the outputs from the triplicate circuits. Many systems are based on TMR for achieving higher dependability against environmental disturbance. As for Xilinx FPGAs, TMR technique is presented in [10] and some examples are presented in [11]. We also adopt TMR as one of fault masking techniques and detect a failure part using its voting circuit. Our voting circuit produces two outputs; a valid output is returned from a majority-voting function, and then, the error flag for identified damaged circuit is output by the comparison when an error is occurred. For instance, a comparison in the majority-voting function can identify what circuit has an error; the input of a function is ”001”, ”010”, ”011”, ”100”, ”101” or ”110”. In our implementation, the identified results are merged in control circuits and it is utilized for a circuit swapping using a partial reconfiguration. 2.2
Overview of TFT Approach
TFT is an implementing rule for designing and constructing dependable computing systems. Fig.1 illustrates TFT overview. system start
t=5
C CD BBD BBD
FAC AACD BBD B D
FAC AACD D FD
FAF AAC BBC BBFF
FAF AACD BBCD BBF
FAF AA D BB D BBFD
FA AACD BB D BBFD
t=10
t=9
t=8
t=7
t=6
system halt, if an error occurs. X :working tile(A-D)
:error-source tile F :failure tile :reconfigured tile
Bus Macro
Minimum Partial Reconfiguration region size 4 slice width
FPGA width
Static Module
t=4
F
Partial Reconf. Mod.
FAFD AACD BBCD BBFF
FPGA width
t=3
AC CD BBD BBD
Partial Reconf. Mod.
A
Static Module
t=2
AAC A CD BBD BBD
FPGA height
t=1
multiple of 4 slices
:spare tile
Fig. 1. TFT overview and its flowchart
Fig. 2. Xilinx PRM and its overview[12]
The t numbers in the Fig.1 correspond with the time step related to a repetition of tile failures and recoveries. The shaded tiles, which are inscribed from A to D, display a type of modules. The A, B, C, and D consist of 3, 4, 2 and 4 tiles respectively. No circuit is implemented on any spare tiles in the initial condition. When a working circuit on a tile has an error, the tile is replaced with a spare tile after the suitable reconfiguration. For example, the device works well in the initial condition in Fig.1 (t=1). The system checks each tile using TMR and an error is found (Fig.1 (A at t=2)). The system replaces the wrong tile to a spare tile (Fig.1 (t=3)). It repeats from t=4 to t=10. The system will halt when there is no spare tile for the reparation.
Tile-Based Fault Tolerant Approach Using Partial Reconfiguration
3
295
TFT Implementation on FPGAs
3.1
TFT Implementation on Earlier XILINX FPGAs
Modules can communicate with other modules using bus macros (BMs) [12][13]. BMs perform unchanged routing channels for keeping correct inter-module networks even if any partial reconfiguration modules (PRMs) are changed. Standard BMs are composed of look-up tables and only support to connect between adjacent modules. TFT, however, requires to establish remote routing channels for communicating between distant modules. We therefore designed remote-channel BMs composed of tri-state buffers using XILINX ISE 6.3.03i1 . The circuit relocation and remote-channel BMs are discussed by [14]. As a preliminary experiment of TFT implementation, a simplified model was evaluated regarding remote-channel communication. This model has six modules: one static module for I/O, three PRMs for computation, and two PRMs for spare modules. The intra-module network topology is a complete graph. For an intramodule network with n nodes, 2 × n C2 channels are totally required because each BM channel is unidirectional. The placement of modules is illustrated in Fig.3. Each PRM is 4 CLB width and FPGA device height. A static module is the rest of circuits, and however, we use only 5 CLB width. Each PRM is equivalent to a tile. In our validation, PRM3 and PRM4 have no circuits except BM interfaces at the initial phase. These modules are applied to spare tiles; a circuit can be moved from PRM0 − → PRM3 − → PRM4, PRM2 − → PRM4 followed by PRM1 − → PRM3, et al.
Fig. 3. TFT on 2V1000
1
Fig. 4. 5VLX50 overview Fig. 5. 5VLX50 net diagram
The handling is not supported by the latter versions of ISE 8.0. We have not tested any versions of ISE 7.x.
296
A. Kanamaru et al.
Table 1. Comparison of circuit size and reconfiguration time (XC2V1000 @333.3KHz)
whole reconfiguration partial reconfiguration
filesize (KB) 499 (max.) 34 (min.) 17
used LUTs/overall 135/5,120 21/640 24/640
time (sec) 21.94 1.45 0.76
ratio 1 0.07 0.03
The whole and partial reconfiguration time are shown in Table 1. FPGA bitstreams were downloaded through XILINX platform cable.The time is actual measurement time observed by Agilent 54641D mixed-signal oscilloscope. The absolute reconfiguration speed is not so high, but it is improved when we use the maximum download frequency, 33.3 MHz. Consequently, we had higher flexibility and dependability of a system compared to common approaches. 3.2
TFT Implementation on the Latest XILINX FPGA
Following Section 3.1, we tested the latest FPGA, XC5VLX50. The latest architecture supports rectangle partial reconfiguration which is suitable for TFT, but not supporting remote-channel BMs with tri-state buffers. We therefore redesign remote-channel BM using LUTs. On the latest Xilinx architecture, it expands PRM commutative and circuit design flexibility because of higher flexibility and the increase in the number of remote-channel wires. The size of a tile in Section 3.1 and this section is made as nearly equal as possible. It causes the increase in the use of BMs and the growth rate is 30 C2 calculated by 2× 2×5 C2 , and therefore a whole circuit can not be divided into the large number of tiles on earlier architecture. Nonetheless, we can implement 30 tiles on the latest FPGA through the use of new LUT-based BMs. Under our experimental condition, the FPGA has approximately 1.8 times higher flexibility compared to the earlier FPGA. The whole and partial reconfiguration time are shown in Table 2. FPGA bitstreams were downloaded through XILINX Platform cable running at 6 MHz. Table 2. Comparison of circuit size and reconfiguration time (5VLX50 @6MHz)
whole reconfiguration partial reconfiguration
4
filesize (KB) 1,533 (max.) 36 (min.) 24
used LUTs/overall 1,792/7,200 62/240 60/240
time (sec) 4.41 0.28 0.18
ratio 1 0.06 0.04
RISC Processor Implementation on an FPGA
Using TFT, we have already achieved a positive result in regularly-structured circuits such as a cellular automata computation [15]. In this section, we introduce that TFT can be adopted to complex-structured circuits.
Tile-Based Fault Tolerant Approach Using Partial Reconfiguration
Spare tile (no circuits at the system start)
Bus Macros (complete graph topology)
Multiplier (mult.v, 400 slices)
Decoder (decode.v 2100 slices) Datapath (datapath.v, 300 slices)
Memory Access Controller (mem.v, 100 slices)
Tile1(PRM) Tile2(PRM) Tile3(PRM) Tile4(PRM) Static tile for I/O pins
Tile0
297
Fig. 6. SH core and its segmentation
Fig. 7. SH implementation (no TFT)
Table 3. Latency comparison of a whole (Fig.7) and TFT (Fig.8) implementations
a Whole implementation TFT impllementation total time except tri-state buffer (estimation)
latency (ns) 29.471 33.437 (30.465)
speedup 1.00 0.88 (0.97)
9
Fig. 8. SH implementation applied TFT
4.1
Fig. 9. Critical path in Fig.8
SuperH RISC Processor
SH, which is short for the SuperH RISC processor, is a name of microprocessor architecture produced by Hitachi Co.Ltd. [16], and is used in a large number of embedded systems. Aquarius is a SH compatible CPU core and freely available [17]. The main feature of Aquarius is the small circuit size required only about 2,900 slices on our implementation.
298
4.2
A. Kanamaru et al.
Implementation Results
Table 3 shows that the latency comparison between Fig.7 and Fig.8. The critical path in Fig.8 is a BM channel between tiles as shown in Fig.9. Its latency is 3.575ns and the time of a tri-state buffer induced by TFT is 0.603ns. It means that we can cut down on the critical path to 30.465ns (=33.437-(3.575-0.603)) through optimizing the circuit implementation based on the connection among tiles. This results show that TFT-related performance degradation is not critical.
5
Conclusion and Future Works
Aiming at the development of a dependable system with an FPGA, we proposed TFT in Section 2. We use remote-channel BMs with tri-state buffers and LUTbased BMs are used for a Virtex5 device in Section 3, and then Section 4 shows results from applying TFT to the implementation of a SuperH RISC processor core. Through our experiment, we reconfirm that PowerPC cores, BRAMs, and some hardware cores on an FPGA lead good performance for any other system but hinder its dependability because they make the circuit structure be much complicated than naive and old architechtures. In our current system environment, we can read out the content of distributed RAM and registers from partial bitstreams and implant its data directly into another-tile bitstream. In this regard, however, this approach is a bit poor from the point of view of its generation time and dependability. A no-redundant microprocessor which generates partial bitstreams will be serious bottleneck. We note that, to solve these problems, the next stage is what an approach of circuits re-partition within a tile is considered on the basis of global routing architecture for circuit structure simplification. And then, other architecture [18,19] may be discussed.
Acknowledgments This work was partially supported by Grant-in-Aid for Young Scientists (B) 20700044.
References 1. Cheatham, J.A., Emmert, J.M., Baumgart, S.: A survey of fault tolerant methodologies for FPGAs. ACM Trans. Des. Autom. Electron. Syst. 11(2), 501–533 (2006) 2. Swift, G.M.: Virtex-II Static SEU Characterization. XILINX Single Event Effects first Consortium Report (2004) 3. iRoC Technologies, Radiation results of the SER test of Actel, Xilinx and Altera FPGA instances (2004), http://www.actel.com/documents/OverviewRadResultsIROC.pdf 4. Quinn, H., Graham, P.: Terrestrial-Based Radiation Upsets: A Cautionary Tale. In: FCCM 2005, pp. 193–202 (2005)
Tile-Based Fault Tolerant Approach Using Partial Reconfiguration
299
5. Johnston, A.H., Swift, G.M., Shaw, D.C.: Impact of cmos scaling on single-event hard errors in space systems. In: Proc. of IEEE Symp. on Low Power Electronics, pp. 88–89 (October 1995) 6. White, M., Chen, Y.: Scaled cmos technology reliability users guide (March 2008) 7. Lesea, A., Percey, A.: Negative-bias temperature instability (nbti) effects in 90nm pmos, (November 2005), http://japan.xilinx.com/support/documentation/white_papers/wp224.pdf 8. Azizi, N., Yiannacouras, P.: Gate oxide breakdown (December 2003), http://citeseer.comp.nus.edu.sg/681500.html 9. Siewiorek, D.P., Swarz, R.S.: Theory and Practice of Reliable System Design. Digital Press, Bedford (1982) 10. Carmichael, C., Fuller, E., Fabula, J., Lima, F.D.: Proton testing of seu mitigation methods for the virtex fpga. In: Proc. of Int’l Conf. on Military and Aerospace Programmable Logic Devices (September 2001) 11. Lima, F., Carmichael, C., Fabula, J., Padovani, R., da Luz Reis, R.A.: A fault injection analysis of virtex fpga tmr design methodology. In: Proc. of Radiation and its Effects on Components and Systems, vol. 1, pp. 1–8 (2001) 12. XILINX Inc., XAPP290: Two Flows for Partial Reconfiguration: Module Based or Difference Based (September 2004) 13. XILINX Inc., XAPP290: Difference-Based Partial Reconfiguration (December 2007) 14. Kalte, H., Porrmann, M., R¨ uckert, U.: System-on-programmable-chip approach enabling online fine-grained 1d-placement. In: Proc. of Int’l Symp. on Parallel and Distributed Processing, pp. 141–148 (April 2004) 15. Kawai, H., Yamaguchi, Y., Yasunaga, M.: Realization of the sound space environment for the radiation-tolerant space craft. In: ReConFig 2006, pp. 198–205 (September 2006) 16. SuperH RISC engine SH7040 series Sh7045, SH7044, SH7043, SH7042, SH7041, SH7040 Hardware Manual, 2nd edn., HITACHI, user Manual ADJ-602-128A (1997) 17. Aitch, T.: A Pipelined RISC CPU Aquarius (SuperH-2 ISA Compatible CPU Core) (July 2003), http://www.opencores.org/projects.cgi/web/aquarius/ 18. Konishi, R., Ito, H., Nakada, H., Nagoya, A., Imlig, N., Shiozawa, T., Inamori, M., Nagami, K., Oguri, K.: PCA-1: A Fully Asynchronous, Self-Reconfigurable LSI. In: ASYNC 2001, Washington, DC, USA, p. 54 (2001) 19. Sugawara, T., Ide, K., Sato, T.: Dynamically Reconfigurable Processor Implemented with IPFlex’s DAPDNA Technology. IEICE Transactions on Information and Systems 87(8), 1997–2003 (2004)