Resilient Network-on-Chip Design Through Data Path ... - Springer Link

2 downloads 0 Views 2MB Size Report
University of Chinese Academy of Sciences, Beijing 100049, China ..... In Proc. the 46th Design Automation ... Dr. Han was a recipient of Best Paper Award.
Han YH, Liu C, Lu H et al. RevivePath: Resilient Network-on-Chip design through data path salvaging of router. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 28(6): 1045–1053 Nov. 2013. DOI 10.1007/s11390-013-1396-3

RevivePath: Resilient Network-on-Chip Design Through Data Path Salvaging of Router Yin-He Han (韩银和), Senior Member, CCF, IEEE, Member, ACM, Cheng Liu (刘 成) Hang Lu (路 航), Student Member, CCF, IEEE, Wen-Bo Li (李文博) Lei Zhang (张 磊), Member, CCF, IEEE, and Xiao-Wei Li (李晓维), Senior Member, CCF, IEEE State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China University of Chinese Academy of Sciences, Beijing 100049, China E-mail: {yinhes, liucheng, luhang, liwenbo, zlei, lxw}@ict.ac.cn Received October 16, 2012; revised August 15, 2013. Abstract Network-on-Chip (NoC) with excellent scalability and high bandwidth has been considered to be the most promising communication architecture for complex integration systems. However, NoC reliability is getting continuously challenging for the shrinking semiconductor feature size and increasing integration density. Moreover, a single node failure in NoC might destroy the network connectivity and corrupt the entire system. Introducing redundancies is an efficient method to construct a resilient communication path. However, prior work based on redundancies, either results in limited reliability with coarse grain protection or involves even larger hardware overhead with fine grain. In this paper, we notice that data path such as links, buffers and crossbars in NoC can be divided into multiple identical parallel slices, which can be utilized as inherent redundancy to enhance reliability. As long as there is one fault-free slice left available, the proposed salvaging scheme named as RevivePath, can be employed to make the overall data path still functional. Furthermore, RevivePath uses the direct redundancy to protect the control path such as switch arbiter, routing computation, to provide a full fault-tolerant scheme to the whole router. Experimental results show that it achieves quite high reliability with graceful performance degradation even under high fault rate. Keywords

1

Network-on-Chip, fault-tolerant, on-chip router

Introduction

Network-on-Chip (NoC) is envisioned to be a scalable communication substrate for building on-chip multi-core systems[1-4] . Meanwhile, with continuously shrinking feature sizes, devices and interconnects turn out to be especially vulnerable to both permanent and transient faults[5-6] . Some literatures show upwards of 10% of transistors will be defective due to wear-out and variation when the system scales up to hundreds of billions of transistors[3] . These faults cause tremendous reliability problems and consequently severe manufacturing yield loss. Fault-tolerant techniques are now indispensable for large-scale IC designs. Relaxing the requirement of 100% correctness can greatly promote manufacturing yield, which in turn reduces the manufacturing cost[6] . Since more and more cores can be integrated on a single chip, it is possible that a faulty core is re-

placed by a spare one or simply disabled[7] . As a result, even suffering certain performance degradation, the system could still be functional. However, for NoC, which is a distributed router-based architecture managing communication between embedded cores, a single node failure might destroy the connectivity of the whole system. Though some fault-tolerate techniques have been proposed in the high level, such as fault-tolerant routing[8-14] , the great number of the sacrificed routers due to deadlock routing rule is still a big problem to routing level solutions. In this paper, we mainly investigate fault tolerance design in router architecture level. Prior work in faulttolerant architecture design mainly concentrates on redundancy scheme, i.e., introducing spare components to replace faulty ones [15-17] . However, redundancy solutions usually incur expensive hardware overhead by at least 100%. [18] divides a larger component into smaller sub-components and applies fine-grained redundancy;

Regular Paper The work was supported in part by the National Basic Research 973 Program of China under Grant No. 2011CB302503, and the National Natural Science Foundation of China under Grant Nos. 61076037, 60906018, 60921002. ©2013 Springer Science + Business Media, LLC & Science Press, China

1046

J. Comput. Sci. & Technol., Nov. 2013, Vol.28, No.6

however, the chip area overhead further increases due to more configuration and control logic which also shrinks reliability benefits. In this paper, instead of direct redundancy, we turn to exploit the inherent redundancy and use the degradation idea to save the router. For a typical on-chip router which will be shown in the next section, the chip area occupations of the components vary significantly. Data path components, i.e., links, input buffers and crossbars consume most part of the chip area (more than 90%), while control path components such as routing computing and arbiters are pretty small. For data path components within a router, they can be viewed as many small parallel and independent path slices. For example, a 64-bit link can be regarded as four separate 16-bit link slices. Whenever a permanent fault is detected in the 64-bit link, the fault may only destroy a single link slice rather than the entire 64-bit bundle. Consequently, we can salvage the other three slices rather than abandon them as a whole. Based on the above observation, in this paper, instead of introducing extra redundancies to data path components, we propose a fine-grained data path salvaging strategy, RevivePath, which elaborately leverages inherent redundancy within routers. Whenever faults destroy partial data path and diminishes the bandwidth, time division multiplexer (TDM) is introduced to salvage the entire data path using surviving slices. The rest of the paper is organized as follows. Section 2 analyzes the area occupation for NoC components and the related work. Section 3 presents the proposed resilient NoC architecture. Section 4 shows the experimental results. At last, Section 5 concludes this paper. 2 2.1

Background and Related Work Insight into the Router

NoC can be viewed as a set of structured routers to which IP cores are attached via network interfaces[1] . On-chip routers in NoC are responsible for routing and flow control of traffic streams across the chip. In this paper, we adopt a conventional 2-stage pipelined wormhole router design as our baseline model, which is depicted in Fig.1. It consists of pipeline registers, routing computing, buffers, the virtual channel allocator, switch allocator and crossbar. Among all these components, pipeline register, buffer and crossbar that scale with the data width are classified as data path, while the routing and arbiter logic belongs to control path. Fig.2 shows the area of these components in different bandwidth cases. It is obvious that link, buffer, pipeline and crossbar take up the majority of a router area under diverse data widths. Thus, direct replication of data path component results in quite a large area overhead.

Fig.1. General router architecture.

Fig.2. NoC components area under different data widths.

2.2

Related Work

With device size continuingly scaling down, an increasing likelihood of failure including both transient errors and permanent errors comes to occur within NoC[3,19] . Both industry designers and academics have paid enormous attention to transient faults[16,20-21] . In contrast, NoC micro-architecture design for permanent fault tolerant is much less concerned. However, even though permanent fault is not so frequent as transient fault, it is getting pressing nowadays, especially in very large scale ICs[3] . Constantinides et al. proposed a component-level diagnose and spare-part reconfiguration strategy[18] . With an automatic cluster decomposition algorithm, the proposed method is able to balance area cost and error resilience through replicating and reconfiguring design components in proper granularity. However, this technique is still not a satisfactory minimum overhead design choice. Koibuchi et al. presented a lightweight fault-tolerant mechanism based on default backup paths (DBP)[22] . The method delicately adds DBP instead of complete hardware duplication and consumes much less area overhead. However it takes fault-tolerant design in such a large granularity that the network will soon degrade

Yin-He Han et al.: Explore Resilient Network-on-Chip

to a unidirectional ring suffering higher fault rate and thus the network bandwidth degrades dramatically. Fick et al. combined ECC (error correcting code), port-swapping, and a crossbar bypass to mitigate wearout induced hard faults[23] . They exploited inherent redundancy of routers as well as previous bypass method, a kind of resource sparing method similar to DBP, maintaining correct operation and low area overhead. However, it targets the whole port. When only buffer or link faults appear, the proposed hardware becomes powerless. Some prior work proposed to serialize transmission channels for defect tolerance[24] or protect the link failure[15-16] ; while in this paper, we take the entire NoC’s data path into considerations. And our proposed router microarchitecture design could also tackle the bandwidth mismatch problem between adjacent NoC pipelines. In addition, previous researches such as fault-tolerant routing and network dynamic reconfigurations[1,23,26-27] are totally orthogonal to the proposed fault-tolerant mechanism, and thus can be combined with the proposed design to further enhance the network reliability. Fault detection is essential to the proposed faulttolerant design. In order to precisely know where the fault exists, the built-in self-test (BIST)[10] or other onchip fault detection techniques[28-31] can be employed to check the status of the predefined logic blocks, according to which faults will be repaired. Furthermore, some detection error codes (DEC) can be used to find the faults on the control path. There are a lot of available error detection codes like Berger code, parity, ReedSolomon code and other commonly-used CRC.

1047

3 3.1

Design of RevivePath General Architecture

In the proposed RevivePath scheme, the data path and control path use different techniques to protect. To the control path, it is obvious that routing computing, virtual channel allocator and switch allocator, which are critical to NoC normal operation, consume relatively small chip area. Therefore, RevivePath employs direct redundancy. To the data path, RevivePath exploits the inherent redundancy, and employs the salvaging technique to protect. Fig.3 shows the general architecture of RevivePath. In this example, links, buffers and the crossbar are partitioned into four slices. Behind links, buffers and the crossbar network, the path salvaging networks are inserted to salvage the left slices when encountering faults. The detailed function and flow will be introduced in the next section. 3.2

Data Path Salvaging Scheme

Since the direct redundancy in control path is a mature idea, this subsection will mainly introduce the technique used in the data path. 3.2.1 Slices Salvaging Scheme In this subsection, we will introduce the detail of slices salvaging strategy for data path. Data path is regular and can be regarded as many independent, identical parallel slices. A 64-bit link can be seen as four 16-bit link slices; a 64-bit FIFO can be viewed as four 16-bit FIFO slices; a 64-bit Mux which constructs the

Fig.3. General architecture of RevivePath.

1048

J. Comput. Sci. & Technol., Nov. 2013, Vol.28, No.6

crossbar can also be divided into four 16-bit Mux slices. Whenever partial slices fail, the remaining slices can be reused to continue transmitting data in TDM mode. Data path salvaging organizes the data path components as a group of workable subcomponents, which backup each other. Details for the scheme will be explained in the following subsections. Fig.4 shows the detailed structure of path salvaging network. Data slices in this figure indicate data path components both in the downstream and upstream of links, crossbar or pipeline register. Fault Detection and Location. When some of the upstream data slices fail, the built-in self-test technique[10,25] or detection error codes (DEC) can be employed in the off-line or on-line environments to detect the failures. The faulty information of the slices is then recorded in the registers of the fault status scheduler. Salvaging Flow. The upstream fault status scheduler is initialized by the upstream fault status registers after detecting error, and reconfigures the path salvaging network between storage registers (in the middle of Fig.4) and upstream data slices cycle by cycle to reconstruct an integrated data in storage registers using data fragments. Similarly, the downstream fault status scheduler keeps the storage registers and downstream data slices matched. The ready flag is a sign indicating whether data in storage registers is available. When it is set invalid, all the logic driven by storage registers has to be stalled. When it is set valid, the downstream fault status scheduler will get to work. However, if the storage register fails, the whole structure corrupts. In addition, when data slices in this figure are actually buffers, there is probably no explicit storage registers as depicted. To address the corner situations, we remove a data slot

from each buffer to serve as a storage register. With simple bypass design, the special storage registers will not introduce additional pipeline latency. 3.2.2 Implementation Hardware Fig.4 exhibits the slices salvaging hardware structure. It mainly consists of fault detection circuits (BIST or detection error circuits), fault status schedulers, networks constructed by MUXs, to together transfer available upstream data slices to next links. This process can be divided into two parts: upstream data to storage registers, and storage registers to downstream. Available upstream data slices are selected according to upstream fault status. Data path in the example is split into four slices. Since the situation with three slices available in upstream data slices is simply treated as that with two to alleviate the logic design complexity, there are only three possible data width situations: single fault-free data slice, two fault-free data slices, and four fault-free data slices. Mux 1 and Mux 2 in the figure are used to extract a single valid data slice and two valid data slices respectively from the upstream data path. When all the four data slices are functional, i.e., there are no faults in the upstream data path, the data from upstream data path can then be directly stored in the storage registers. Data in storage registers combined with the slices chosen in the first step reconstructs 64-bit data, which corresponds to the situations with different numbers of fault slices, through Mux 3. Blocks in the downstream including Mux 4∼Mux 6, the downstream fault status scheduler are responsible for the transmission from storage registers to downstream data path. When slices of the data path in down-

Fig.4. Detailed structure of the path salvaging network.

Yin-He Han et al.: Explore Resilient Network-on-Chip

stream fail, data connected with them are simply set to 0. Apart from the difference mentioned, the blocks in both the upstream and downstream of the data path function are running in almost the same way which will not be explained in case of repetition. In addition, we can see from Fig.4 that the slice reuse logic also has inherent fault tolerance. Mux 1, Mux 3 constitute a fault-tolerant data path when there is a single input data path slice left available; Mux 2, Mux 3 construct a fault-tolerant data path when there are two available input data path slices. Furthermore, when there are no faults at all, Mux 3 sustains fully functional data path. As fault-tolerant data paths are closely coupled with specific available data path slices, the probability that specific faulttolerant data path fails and corresponding slice fault status occurs at the same time is quite low. Consequently, the slice reuse structure is quite resilient, especially compared with conventional design such as redundancy. Example. If the slice a is faulty, the BIST or DEC will detect this fault and send a fault vector [0111] to fault status scheduler 1. There is only one faulty slice, so it will be simply treated as two fault-free slices situation. Therefore, the path /32(cd) is selected. The transmissions are duplicated through the path /32(cd), Mux 3 and then 64 bits will be stored in storage registers. The ready flag is set valid until all 64 bits are reconstructed. 3.2.3 Intermittent Transmission and Pipeline Transmission When some faulty slices are found, upstream data path has to wait for an assembled data in the storage register until the ready flag is set valid. At the same time, downstream data path must also pend for fully free storage registers when the ready flag is set invalid. We denote this transmission style as “Intermittent Transmission”. The temporal-spacial distribution example of intermittent style is shown in Fig.5(a). There are three faulty slices in upstream. Therefore, it takes 4 cycles to assemble the total data in storage registers. After this, the data will be transported to downstream in two cycles since there are two faulty slices in downstream. We can see that it takes the intermittent transmission style 18 cycles to complete the packet transmission including head flits, body flits and tail flit. This style is not efficient since the downstream has to wait for the upstream data. In order to address this less efficient problem, we propose a new transmission style: pipeline transmission style. Compared with intermittent transmission, as long as there are available data slices in storage regis-

1049

ters, they can be accessed and sent to the next pipeline. The stalling time is reduced. In the same faulty situation as Fig.5(b), the pipeline transmission style costs only 12 cycles. Apparently, the pipeline transmission style outperforms the intermittent transmission style.

Fig.5. Temporal-spacial distribution examples of pipeline and intermittent style under various fault cases. (a) Intermittent style. (b) Pipeline style.

However, one thing should be noted that, when an integrated data is necessary for downstream data path such as routing computing, the intermittent transmission style will remain functional while slice pipeline transmission style will be powerless. Slice Fault Status Association. When slice reuse technique is applied, the fault status of neighboring data path slices influences each other. Assume that the fault-free slice number from input links of router A to buffers in router B (links, buffers, crossbar, links) are s1 , s2 , s3 , s4 respectively. When s2 6 s1 , links provides more data slices than what buffers could accommodate each cycle, thus pipeline registers between the links and buffers will be overlaid. In case of the problem, links will have to wait for the release signal of the pipeline registers. As a result, it makes the flow control more complex and inefficient. When s4 6 s3 , the problem is similar. To guarantee a correct and efficient pipeline, the available slice numbers of the data path components are then modified as v1 , v2 , v3 , v4 respectively where v1 6 v2 , v3 6 v4 . Since the slice reuse structure between the buffer and crossbar employs intermittent transmission style, v2 is independent with v3 . Therefore, the data path slice status correlation is limited between buffers of neighboring routers, which will not affect the scalability of the proposed method. In ad-

1050

J. Comput. Sci. & Technol., Nov. 2013, Vol.28, No.6

dition, when the number of available data path slices cannot be fully used in TDM mode, it will then be degraded to simplify hardware design. For example, when a data path is split into four slices and three of them are fault-free, it will be degraded to 2. Accordingly, the maximum available buffer capacity will be modified to adapt the flow control usually in flit granularity. 4 4.1

Experimental Results Area Overhead

In order to evaluate the area overhead, we synthesize the proposed router design under 200 MHz. The flit width is 64-bit, and the buffer capacity for each input port is 8 flits. Different fault-tolerant mechanisms including redundancy, detection error code, and path salvaging network are employed in this router. To get a comprehensive perspective of slicing technique, four different path salvaging network configurations: Our4 (data path salvaging router with four slices), Our2 (data path salvaging router with two slices), OurM (data path salvaging router with mixed slice number), i.e., the buffer and link adopt four slices, while the crossbar chooses two slices, and Our-D (data path salvaging router discarding crossbar protection), i.e., the buffer and link are still divided into four slices while the crossbar simply goes unprotected, are synthesized under the same situation. Experimental results show that the area overhead of the four routers are 65.4%, 26.5%, 52.46% and 45.9% respectively. Without link slicing logic, the area overhead of the routers turn to be 45.9%, 20.0%, 33.1% and 26.5% respectively. Fig.6 presents the area overhead of TMR (3-modular redundancy), MRR (most reliable router)[18] , LOR (least overhead router)[18] , Vicis[10] and the proposed designs. Note that Vicis has also considered the overhead of fault detection which costs additional 10% area. To make a fair comparison, we have removed the area overhead of this part. Additionally, since TMR, MRR, LOR, Vicis only concentrate on logic part fault-tole-

Fig.6. Router area comparison.

rant, our proposed methods also do not include the area targeting the link fault tolerance. In this figure, it can be seen that TMR and MRR which employ direct replication consume more than 100% area overhead. LOR, Vicis and the proposed design in this paper develop router’s inherent redundancy with various methods, and the area overhead are mostly around 20% to 50%. They are more area efficient. 4.2

Reliability

4.2.1 Silicon Protection Factor Comparison The reliability also depends on the protection circuit. To evaluate reliability, Constantinides et al.[18] defined a new metric called SPF (Silicon Protection Factor), which is the number of faults that a router can tolerate before becoming afunctional normalized by the area overhead. In this paper, we obtain the SPF using the following procedure: a random fault injection scheme is employed based on the area of components. When a fault falls on certain router components and the fault can still be tolerated, the counter that records the total fault injection number is increased by one. Finally, when one of the components totally corrupts and thereby induces router failure, fault injection completes. With repeated fault injection of 100 000 times, the average counter value is regarded as the average number of faults that a router can tolerate. With delicate experiments, the number of faults that Our-4, Our-2, Our-M and Our-D can tolerate is 16.80, 6.18, 17.51 and 11.84 respectively. Normalized by the area overhead, SPF of Our-4, Our-2, Our-M and OurD is 11.52, 5.15, 13.17 and 9.36 respectively. The SPF comparison results are displayed in Fig.7. Although the SPF of MRR is also very high, the additional MRR area overhead is more than 200%. In a complex virtual channel router, the area cost might be too large to be accepted. Vicis has a moderate area overhead and SPF. However, with similar area overhead, Our-4, OurM and Our-D all have much higher SPF, and especially the SPF of Our-M even doubles that of Vicis. Although Our-2 has a relatively low SPF, it has the lowest area overhead. The reason that the SPF of Our-4 is lower than Our-M lies in the fact that the crossbar in planar NoC is small. So four-slice reuse method will be a little overprotected which pulls down average SPF. When the fault-tolerant method is used in a high dimension network with much larger crossbar area, Our-4 will outperform the other configurations. In addition, we can also find that Our-M has higher SPF than Our-2 which indicates that the buffer and link with four slices are more efficient than those with two slices. This is also in accordance with the experimental results shown in Fig.7.

Yin-He Han et al.: Explore Resilient Network-on-Chip

Fig.7. Router SPF comparison.

4.2.2 Available Components Encountering Faults To get an insight into the proposed method, we also evaluate the network reliability. There is no specific fault tolerance routing algorithm used in this paper, and thus we cannot make further comparison with Vicis. As torus is symmetrical and faults have equal influence on all routers, which makes the analysis scalable, we evaluate 8 × 8 torus network reliability considering link fault as well. Any router component failure or corresponding output link failure is assumed as node failure. When additional logic used as fault tolerance fails, the protected component is assumed to be corrupted too. In this case, Fig.8 shows available node number when the network suffers 0 ∼ 1 300 faults (it also equals such a fault rate per gate that scales from 0 to 1/2 000). When the fault number is no more than 300, 95% nodes keep fully functional. Even when the fault number jumps to 1 000, 70% nodes struggle to survive successfully.

Fig.8. Available nodes in 8 × 8 torus.

Most of the fault tolerance routings allow partial nodes to remain functional, so we further analyze the component status in NoC. Fig.9 exhibits NoC component fault status exposed to previous fault injection. It is surprising that when the fault number comes to 500,

1051

still 95% of the components remain functional. Even if the fault number doubles, 90% components survive. Therefore, with typical fault-tolerant routing, there will be more available nodes left. On the other hand, we find that when the fault number is around 1 300, 96% data path could still be functional. Unfortunately, only 86% control logic blocks are left, which become the bottleneck of further improvement on NoC reliability. In this paper, conventional redundancy method is used to protect control logic block, nevertheless, Mux and Demux that are used for redundancy are also exposed to fault. When Mux and Demux fault probability increases faster than that of redundancy logic, there will be reliability limit for control logic block, no matter how we improve the redundancy granularity and spare number.

Fig.9. Available NoC components in 8 × 8 torus.

4.3

Performance Analysis Encountering Faults

To evaluate the influence of the proposed method on NoC performance, we implement a cycle-accurate simulator in SystemC with 4 × 4 2D mesh topology constructed by wormhole routers. XY routing is employed and each input port has an 8-flit depth buffer. Each node of the network generates packets according to Poisson process. Destinations of the packets are selected randomly. 100 K packets with 8 flits long per node are injected with 20 K cycles warm-up before collecting performance information. We inject 6 to 48 faults on links and buffers. The simulation results shown in Fig.10 illustrate that the proposed design keeps graceful performance degradation when the number of faults increases. Moreover, in the proposed design, both single-slice fault and double-slice fault are assumed to be equivalent, link faults result in associated crossbar partially working, and buffer faults lead to disabling of corresponding links and crossbar. It means that performance of 6 buffer faults equals that of 6 ∼ 36 faults including 6 ∼ 12 buffer faults, 0 ∼ 12 links faults

1052

and 0 ∼ 12 crossbar faults. Similarly, performance of 6 link faults is equivalent to that of 6 ∼ 24 faults including 6 ∼ 12 link faults and 0 ∼ 12 crossbar faults. The other fault situations also implicit a much larger number of fault situations. Thus the NoC performance will keep stable in a large range of faults.

J. Comput. Sci. & Technol., Nov. 2013, Vol.28, No.6

[6] [7]

[8]

[9]

[10]

[11]

[12]

[13] Fig.10. Average network latency under various fault scenarios.

5

[14]

Conclusions

In this paper, we exploited NoC inherent redundancy by splitting large NoC components of data path into slices, each of which is able to maintain the function of the whole component in presence of partial failures using TDM. For tiny logic with comparatively low fault probability, conventional redundancy or ECC is employed. Evaluation result shows that the proposed design provides several configurations with high reliability and low overhead. The area overhead varies from 26.5% to 65.4% and SPF scales from 5.15 to 13.17. When the network suffers medium fault rate, 90% nodes in 8 × 8 torus keep fully functional. Even if the network is exposed to higher fault rate, about 55% nodes survive, and more than 86% NoC components work well, which promises a much larger number of available nodes with conventional fault-tolerant routing. The simulation also indicates that the NoC performance degrades gracefully when fault rate rises dramatically. References [1] Benini L, De Micheli G. Networks on chips: A new SoC paradigm. Computer, 2002, 35(1): 70-78. [2] De Micheli G, Benini L. Networks on Chips: Technology and Tools. Morgan Kaufmann Pub, 2006. [3] Borkar S. Microarchitecture and design challenges for gigascale integration. In Proc. the 37th International Symposium on Microarchitecture, Dec. 2004, p.3. [4] Dally W, Towles B. Route packets, not wires: On-chip interconnection networks. In Proc. Design Automation Conference, June 2001, pp.684-689. [5] Borkar S. Designing reliable systems from unreliable compo-

[15] [16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

nents: The challenges of transistor variability and degradation. IEEE Micro, 2005, 25(6): 10-16. Constantinescu C. Trends and challenges in VLSI circuit reliability. IEEE Micro, 2003, 23(4): 14-19. Zhang L, Han Y, Xu Q et al. On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems. IEEE Trans. Very Large Scale Integration Systems, 2009, 17(9): 1173-1186. Boppana R V, Chalasani S. Fault-tolerant routing with nonadaptive wormhole algorithms in mesh networks. In Proc. Supercomputing, Nov. 1994, pp.693-702. Zhang Z, Greiner A, Taktak S. A reconfigurable routing algorithm for a fault-tolerant 2D-mesh network-on-chip. In Proc. Design Automation Conference, June 2008, pp.441-446. Flick D, DeOrio A, Chen G et al. A highly resilient routing algorithm for fault-tolerant NoCs. In Proc. Conf. Design, Automation and Test in Europe, April 2009, pp.21-26. Flich J, Rodrigo S, Duato J. An efficient implementation of distributed routing algorithms for NoCs. In Proc. Int. Symp. Networks-on-Chip, April 2008, pp.87-96. Wang J, Gu H, Yang Y et al. An energy- and buffer-aware fully adaptive routing algorithm for Network-on-Chip. Microelectronics Journal, 2013, 44(2): 137-144. Xiang D, Zhang Y, Pan Y. Practical deadlock-free faulttolerant routing in meshes based on the planar network fault model. IEEE Trans. Computers, 2009, 58(5): 620-633. Xiang D, Luo W. An efficient adaptive deadlock-free routing algorithm for torus networks. IEEE Trans. Parallel and Distributed System, 2012, 23(5): 800-808. Siewiorek D, Swarz R. Reliable Computer Systems: Design and Evaluation (3rd edition). A K Peters/CRC Press, 1998. Smolens J, Gold B, Kim J et al. Fingerprinting: Bounding soft-error-detection latency and bandwidth. In Proc. the 11th Int. Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 2004, pp.224-234. Weaver C, Austin T. A fault tolerant approach to microprocessor design. In Proc. International Conference on Dependable Systems and Networks, June 2001, pp.411-420. Constantinides K, Plaza S, Blome J et al. BulletProof: A defect-tolerant CMP switch architecture. In Proc. the 12th International Symposium on High-Performance Computer Architecture, Feb. 2006, pp.5-16. Hegde R, Shanbhag N R. Toward achieving energy efficiency in presence of deep submicronnoise. IEEE Trans. Very Large Scale Integration Systems, 2000, 8(4): 379-391. Kim J, Park D, Nicopoulos C et al. Design and analysis of an NoC architecture from performance, reliability and energy perspective. In Proc. Int. Symp. Architecture for Networking and Communications Systems, Oct. 2005, pp.173-182. Murali S, Atienza D, Benini L et al. A multi-path routing strategy with guaranteed in-order packet delivery and faulttolerance for networks on chip. In Proc. Design Automation Conference, June 2006, pp.845-848. Koibuchi M, Matsutani H, Amano H et al. A lightweight fault-tolerant mechanism for network-on-chip. In Proc. ACM/IEEE International Symposium on Networks-on-Chip, April 2008, pp.13-22. Fick D, DeOrio A, Hu J et al. Vicis: A reliable network for unreliable silicon. In Proc. the 46th Design Automation Conference, July 2009, pp.812-817. Palesi M, Kumar S, Catania V. Leveraging partially faulty links usage for enhancing yield and performance in networkson-chip. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2010, 29(3): 426-440. Alaghi A, Karimi N, Sedghi M et al. Online NoC switch fault detection and diagnosis using a high level fault model.

Yin-He Han et al.: Explore Resilient Network-on-Chip

1053

In Proc. International Symposium on Defect and FaultTolerance in VLSI Systems, Sept. 2007, pp.21-29. Gomez M E, Duato J, Flich J et al. An efficient fault-tolerant routing methodology for meshes and tori. Computer Architecture Letters, 2004, 3(1): 3. Ho C T, Stockmeyer L. A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers. IEEE Trans. Computers, 2004, 53(4): 427-438. Han Y, Xu Y, Li H et al. Test resource partitioning based on efficient response compaction for test time and tester channels reduction. In Proc. Asian Test Symposium, Nov. 2003, pp.440-445. Han Y, Xu Y, Chandra A et al. Test resource partitioning based on efficient response compaction for test time and tester channels reduction. Journal of Computer Science and Technology, 2005, 20(2): 201-210. Han Y, Hu Y, Li X et al. Embedded test decompressor to reduce the required channels and vector memory of tester for complex processor circuit. IEEE Trans. Very Large Scale Integration Systems, 2007, 15(5): 531-540. Han Y, Hu Y, Li H et al. Theoretic analysis and enhanced X-tolerance of test response compact based on convolutional code. In Proc. the 2005 Asia and South Pacific Design Automation Conference, Jan. 2005, pp.53-58.

Hang Lu received the B.Eng. and M.Eng. degrees from Beijing University of Aeronautics and Astronautics, China, in 2008 and 2011 respectively. He is currently a Ph.D. candidate in computer architecture at ICT, CAS. His research interests include high performance computer architecture, Network-onChip, scale-out processors, etc.

Yin-He Han received the B.Eng. degree from Nanjing University of Aeronautics and Astronautics, China, in 2001, and the M. Eng. and Ph.D. degrees in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2003 and 2006, respectively. He is currently an associate professor at ICT, CAS. His research interests include VLSI architecture design and test, especially on fault-tolerant and low power architecture. Dr. Han was a recipient of Best Paper Award at Asian Test Symposium (ATS) 2003. He is a member of IEEE/ACM /CCF/IEICE. He is the program chair of ATS 2014, finance chair of HPCA 2013, program co-chair of WRTLT 2009, and has served and serves on the technical program committees of several IEEE and ACM conferences, including HPCA 2013, ASPDAC 2013, Cool Chip 2013, ATS 2008∼2010, GVLSI 2009∼2010, etc.

Lei Zhang received his Ph.D. degree in computer architecture from ICT, CAS, Beijing, in 2008. He is currently an associate professor at the State Key Laboratory of Computer Architecture, ICT, CAS. His research interests include Networkon-Chip, elastic processor and reconfigurable computing.

[26]

[27]

[28]

[29]

[30]

[31]

Cheng Liu received the B.Eng. and M.Eng. degrees from Harbin Institute of Technology, China. He was a Ph.D. candidate of ICT, CAS, from 2009 to 2011, and now is a Ph.D. candidate of The University of Hong Kong, China. His research interests include reconfigurable computing, hardware acceleration, and Network-on-Chip.

Wen-Bo Li is a junior student majoring in computing science at Simon Fraser University, Canada. Currently, he visits Institute of Computing Technology, Chinese Academy of Sciences, and joins the research on Network-on-Chip and power management for mobile system.

Xiao-Wei Li received his B.Eng. and M.Eng. degrees in computer science from Hefei University of Technology, China, in 1985 and 1988 respectively, and his Ph.D. degree in computer science from the ICT, CAS in 1991. Currently, he is a professor and the deputy director of the State Key Laboratory of Computer Architecture, ICT, CAS. His research interests include VLSI testing and design verification, dependable computing, and wireless sensor networks. He is a senior member of IEEE. He is an associate Editor-in-Chief of the Journal of Computer Science and Technology, and a member of editorial board of Journal of Electronic Testing (JETTA) and Journal of Low Power Electronics (JOLPE). In addition, he has served and serves on the technical program committees of several IEEE and ACM conferences, including VTS, DATE, ASP-DAC, PRDC, etc. He was the Program Co-Chair of IEEE Asian Test Symposium (ATS) in 2003, and the General Co-Chair of ATS 2007.

Suggest Documents