SHiFA: System-Level Hierarchy in Run-Time Fault-Aware Management of Many-Core Systems Mohammad Fattah1 ,Maurizio Palesi2 , Pasi Liljeberg1 , Juha Plosila1 , Hannu Tenhunen1 1
University of Turku, Turku, Finland. 2 University of Enna, Kore, Italy.
{mofana, pakrli, juplos, hanten}@utu.fi
ABSTRACT A system-level approach to fault-aware resource management of many-core systems is proposed. The proposed approach, called SHiFA, is able to tolerate run-time faults at system level without any hardware overhead. In contrast to the existing system-level methods, network resources are also considered to be potentially faulty. Accordingly, applications are mapped onto healthy nodes of the system at run-time such that their interaction will not require the use of faulty elements. By utilizing the simple routing approach, results show 100% utilizability of PEs and 99.41% of successful mapping when up to 8 links are broken. SHiFA design is based on distributed operating systems, such that it is kept scalable for future many-core systems. A significant improvement in scalability properties is observed compared to the state-of-the-art distributed approaches.
Categories and Subject Descriptors C.4 [Performance of Systems]: Fault tolerance
General Terms Performance, Reliability
Keywords application mapping, system-level design, hierarchical management
1.
INTRODUCTION
According to the International Technology Roadmap for Semiconductors [1], by 2020 Multi-Processor Systems-on-Chip (MPSoCs) will integrate hundreds of processing elements (PEs) connected by a Network on Chip [2] (NoC) based communication infrastructure. This massive integration, enabled by aggressive scaling of transistors, exacerbates the reliability issues in multi-processor design. Accordingly, future many-core systems require sophisticated faultaware approaches to tolerate their erroneous characteristics. Integrated circuit failures are usually classified by the failure duration, namely, permanent and temporary [3]. Permanent faults are those who continue to exist until the faulty components are repaired. While, temporary faults can be divided into transient and intermittent. The former appears once and disappears, while the latter occurs, vanishes, reappears; and so on. In this paper we focus on the permanent faults, and unless otherwise stated, with the term “fault" we mean “permanent fault". Several methods have proposed architectural support to provide run-time fault-tolerance to NoC-based systems. They mainly contribute to routing algorithms (e.g. [4–6]), topologies [7] or redundant routers [8]. Besides the hardware overhead of such methods, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. DAC ’14, June 01 - 05 2014, San Francisco, CA, USA Copyright 2014 ACM 978-1-4503-27305/14/06...$15.00. http://dx.doi.org/10.1145/2593069.2593214
[email protected]
architecture-level approaches are limited to single- or a few faults [4, 8], or lack 100% reliability (connectivity) for all fault scenarios [5,6]. As such, fault patterns may not always be kept transparent to the run-time resource management of the operating systems (OS), i.e. OS cannot completely rely on the network architecture. On the other hand, run-time resource management, i.e. application mapping, is well recognized as an important challenge for future many-core systems [9–11]. This is primarily because of the extremely dynamic workload of such systems, where an unpredictable sequence of different applications with various requirements enter and leave the system at run-time. The growing complexity has encouraged several researchers to use distributed resource management methods [11–14]. There are several fault-aware resource management methods proposed in literature [15–19]. However, they assume failures to happen only in PEs while their associated router are still working. As such, they make the unrealistic assumption that the healthy PEs always maintain full connectivity. Thus, such system-level solutions are equivalent to those of fault-free mapping/migration methods where a faulty PE is equal to a busy/overloaded one. Whilst, as aforementioned, the imperfectness of the underlying network may not be always kept hidden from the OS perspective even by utilizing state-of-the-art architectural supports. Therefore, it is necessity for run-time resource management approaches to adapt to run-time failures. Accordingly, we propose our novel, yet simple, System-level and Hierarchical Fault-Aware (SHiFA) approach for run-time resource management of many-core systems. Unlike other system-level methods, we consider failures of both PEs and network resources in SHiFA. We construct the system-level design of SHiFA based on distributed OS proposals [20, 21]; i.e. a light-weight kernel is running on the background of each PE, hosting tasks of different applications. Current decentralized approaches [11–14] completely distribute the resource management job over different nodes. As a result, they lack a pervasive view of the system and might lead to disharmonic and random behaviors [22]. In contrast, we structure SHiFA kernels in three levels of hierarchy: (i) the system mobile master (MM), (ii) the application managers (AM), and (iii) the basic kernels. As such, MM works as a conductor which keeps holistic view of the system and orchestrates between different AMs, which in turn, each AM performs detailed jobs regarding its application of responsibility. This degrades the random behavior of the system while keeping advantages of distributed management approaches. In this paper we make the following contributions: • To the best of our knowledge, we propose the first work that takes the imperfect network into account for run-time fault-aware management of NoC-based many-core systems (Section 4). Our system-level design makes SHiFA independent of the underlying network architecture while keeps the system faults transparent to the applications. Accordingly, it imposes no hardware overhead and can be adopted in the already manufactured platforms such as Intel SCC [23]. • Hierarchical structure of our approach increases the utilization of healthy nodes within the imperfect network, as motivated in Fig.
MM Mobile Master AM Application Manager
AM
*
Faulty Node Inaccessible centrally by MM
*
Accessible hierarchically by AM
MM
×
t0
t2
t1
*
t4
t5
* Figure 1: SHiFA case example: due to the crossed faulty node and XY routing, none of the 5 nodes in the shaded area can be accessed by MM. However, the selected AM can utilize 4 of them (asterisk ones) within the same architectural limitations.
1. Our experiments show that SHiFA over 99.99% of PEs can be utilized when up to 8 network links are broken (Section 5.1). • We equip AMs with a powerful fault-aware mapping algorithm. The algorithm finds a feasible allocation of the application tasks, such that the communication between tasks does not violate the network limitations. The time complexity of the proposed algorithm is independent of the network size, which is a significant scalability property over existing distributed methods. Results demonstrate 99.41% of success rate, when up to 8 links are broken in the network (Section 5.2.1). • As mentioned, MM keeps a holistic view of the system and prevents random behavior of distributed managers. We integrate smart hill climbing [9] with tabu search [24] to decrease the related overheads compared to random choice of state-of-the-art methods. Results show 5 times overhead reduction in average (Section 5.3). • Last but not least, we discuss the use of an always healthy control network which is, implicitly or explicitly, assumed by most of the existing methods (inc. SHiFA) in Section 6. We also suggest alternative solutions regarding SHiFA characteristics.
2.
t0
*
RELATED WORK
The definition of techniques and methodologies for dealing with on-chip fault-tolerant communication started well before the NoC paradigm emerged. With the advent of NoC-based architectures, however, the increase in the number of degree of freedom opened new opportunities for improving the fault tolerance of the system [25]. In the following we study related work handling the permanent faults occurring at run-time in NoC based many-core systems. Architecture-level methods has been improved by operating toward several different directions; including network topologies [7], network redundancy [8] and routing algorithms [4–6]. However, architectural methods suffer from several shortcomings. First, they usually impose noticeable hardware overhead, e.g. [8]. Second, they might be limited to specific fault models. For instance, work in [5] only supports broken links, while routers are assumed healthy, relying on their Vicis [26] router architecture which imposes 42% area overhead. Most importantly, such methods are limited in the number of faults they cover; i.e. they fail in keeping whole system functional because of a limited number of faults. Complementary system-level methods are proposed to ensure resiliency while maintaining the required levels of system performance. Although silicon failures will defect both PEs and network components, the following proposed system-level methods assume failures only in PEs. Hence, their target problem is reduced to mapping/migration problem where each faulty PE is equivalent to an unavailable/overloaded PE. Offline task mapping techniques have been proposed in [16] and [18]. The proposed techniques are based on extensive design-time analysis of different single-PE fault scenarios to determine optimal mappings. These mappings are stored in a table and are looked-up at run-time to migrate tasks as and when faults occur. The work in [16] also presents some run-time heuristics for optimal migration of the task running on failed PE. Chou and Marculescu, in their FARM [15] approach, propose a run-time mapping algorithm which considers failures in PEs. Failed PEs are not taken into account during application mapping, while
t2
AM
t3 t1
t5
Faulty Node
t8
Inaccessible Node
t6 t4
t7
t3
t6
t7
t8
(a)
(b)
Figure 2: (a) Gaussian Elimination application with 9 tasks and 11 edges. (b) Its feasible mapping solution: All the tasks are mapped onto accessible healthy nodes and none of their communications go through the crossed faulty node.
during an application execution, spare-cores are used to migrate tasks of broken PEs. An adaptive fault-aware scheduling approach is presented by Bolchini et al. [19]. They use software-based triple modular redundancy (TMR) techniques to detect and tolerate faults occurring in the PEs, and to dispatch application threads to the healthy PEs.
3.
PRELIMINARIES
We design SHiFA according to message-passing paradigm both in the application model and the system architecture. Accordingly, each application is represented by a directed task graph Ap = T G(T, E). A vertex ti ∈ T represents an application task, and an edge ei,j ∈ E stands for its communication with a destination task tj . Task graph of Gaussian Elimination application [27] which is extracted using TGG [28] is shown in Fig. 2(a). In our message-passing many-core architecture, cores have access to their own local memory and communicate and synchronize through exchanging messages. An architecture graph AG(N, L) describes the set of network nodes (N ), connected together with the set of links (L). Our SHiFA design is not limited to any specific network architecture nor fault model; i.e. it can handle faults happening in links, routers, and PEs. Failures can be detected at runtime using an online testing method, e.g. []. Consequently, we denote the set of healthy elements by AGH (NH , LH ) ⊆ AG(N, L). Reachability: Iff a data packet can be delivered from a node nsrc to another ndst , regarding the current AGH , we call the destination R node to be reachable by the source node (denoted as nsrc → ndst ). Accessibility: Iff a control packet can be delivered back from a reachable node, we called it to be accessible from the source node A (denoted as nsrc → ndst ). Assuming a fault-free control network (Section 6), every reachable nodes is also accessible. Territory: The set of healthy nodes that are accessible from a given node is called its territory (denoted as T node ∈ NH ). Mapping Problem: Given the current application (Ap) and runtime architecture (AGH ), a mapping algorithm must allocate tasks to healthy nodes such that all communicating pairs are accessible. Moreover, allocated nodes should be accessible from the point in which the mapping algorithm is executed (AM node in SHiFA). In other words, a feasible mapping is the one whose task allocation and communication does not involve any faulty element: map : T → N : map(ti ) = nw,h ; ∀ti ∈ T, ∃nw,h ∈ NH ; A
A
∀ei,j ∈ E, nti → ntj ; ∀ti ∈ T, AM → nti
(1)
where nti /ntj is the node which the task ti /tj is mapped onto. A feasible mapping of Gaussian elimination application onto the example faulty network (with XY routing) is shown in Fig. 2(b). In general, the feasibility of a mapping can be assessed at run-time based on the topology and routing algorithm. Compliance with current systems: As an example, we note that our system model is very similar to that of Intel SCC [23] in which cores (i) are connected through NoC, (ii) simultaneously boot Linux kernels, and (iii) run message-passing applications.
4.
SHIFA SCHEME
As mentioned, we define three different roles of kernels in the proposed hierarchy: (i) the system mobile master (MM), (ii) the application manager (AM), and (iii) the basic kernel. Regarding the run-time circumstances, a kernel can switch to any role according to SHiFA procedures. Note that SHiFA kernels are activated at runtime in arrival and termination of applications, and upon run-time failures. Otherwise, they will not interfere in the performance of applications. Briefly stated, upon arrival of an application into MM, it hands the application over to an accessible node (AM). AM explores its neighboring nodes and tries to find a feasible mapping of the application, according to (1). AM then reports back to MM about its success/failure in mapping. In case of success, the application is executed on the allocated nodes. Otherwise, MM assigns another AM until a feasible mapping is found. When a task execution is terminated on a PE, its basic kernel informs its associated AM, which in turn, AM informs MM when all the tasks of its application are terminated; i.e. the application execution is finished. As a result of the proposed hierarchy, we can also utilize nodes which are not necessarily in MM’s territory. That is, a node can be utilized (called utilizable) in SHiFA, iff we can find a node as AM A A such that M M → AM and AM → ndst .
4.1
System Mobile Master
The mobile master works as conductor and keeps the holistic view of the system. Accordingly, we define two major jobs for MM. First, it signals kernels to promote as AMs. Second, it moves to accessible nodes to retain its territory size.
4.1.1
AM Selection
Regarding the run-time configuration of the system and the current fault pattern, an appropriate node must be selected to hand the job of mapping over to it. The candidate node should provide (i) a high likelihood of a feasible mapping (ii) in its close neighborhood. This is to (i) find a feasible mapping quickly while (ii) it is kept contiguous. Contiguous allocation reduces the number of network elements incorporated in application execution. As a result, it reduces the expected number of infected applications upon a run-time failure. Moreover, contiguous allocation decreases traffic congestion and energy dissipation of network [9, 29] as well as execution time of applications [30, 31]. Existing distributed resource management approaches [11–13] use random-base node selection. That is, they send the mapping request to one or several random nodes, and each node will explore its neighborhood to find the required number of available nodes. Such random exploration continues until one or several offers are found [11, 12]. Nevertheless, random node selection can lead to high time and communication overheads [9], as each kernel spends time on generating several exploration packets to collect required information. We exploit smart hill climbing (SHiC) [9] integrated with tabu search [24] for AM selection. Consequently, an AM is selected using SHiC approach, while unsuccessful AMs are added to a tabu list to prevent reselection. We empty out the tabu list whenever the run-time configuration of system changes; i.e. upon an application termination, fault occurrence, etc. The utilized AM selection works with the approximate shape of the running applications, e.g. rectangle in mesh topology [9]. Hence, we do not need to keep the availability status of all nodes in MM: AMs report only shape of the obtained mapping back to MM, e.g. rectangle corners (nx1,y1 , nx2,y2 ) in mesh topology. It is noteworthy that quality of mapping is in AM’s responsibilities, not MM’s. As a result, MM stops looking for AMs upon a mapping success.
4.1.2
Mobility
MM reacts proactively against being failed or getting isolated. Accordingly, upon any run-time failure, or its prediction1 , MM migrates to an accessible node with the largest territory. This provides 1
e.g. by using software-based prediction methods [32].
MM with the widest options for AM selection which improves the likelihood of mapping feasibility. As mentioned, MM does not need to keep the availability status of all the nodes and only knows about the approximate area of running applications. This eases the housekeeping and facilitates an agile migration for MM. For instance, in a mesh, it keeps only address of 3 nodes for each running application: the corresponding AM and two corners of its approximate rectangle. A more sophisticated exploration of migration scenarios and definition of a vice-MM role (for unpredicted failures) is left as future work.
4.2
Application Manager
An application manager maps, migrates, and remaps the application of its responsibility. The task mapping is required once the application is requested for execution. The task migration and application remapping is required in the case of prediction/occurrence of run-time faults during the application execution. Note that this paper focuses on the system-level process of the task migration, while its implementation details are out of this paper scope.
4.2.1
Fault-Aware Mapping
AM is assigned to find a feasible mapping for its application of responsibility. In order to find a feasible and contiguous mapping of an application, we equip AMs with a fault-aware adaptation of CASqA [31], as described in Algorithm 1. Initially, AM maps a task of application (tf ) onto its own local node and marks connected tasks as M ET 2 (lines 2–5). Afterwards, for each M ET task (tc ), we collect the set of available accessible e – line 9) which are in the first square around AM (r = 1)3 nodes (N and with the smallest Manhattan Distance (M D = 1) from parent of tc . Among the collected nodes, we select the one, if any, which keeps the mapping feasible to allocate to tc (lines 10–15). MD increases once all M ET tasks are examined for feasible mappings while current radius, r, increases once all possible MD values are tried. In order to limit the possible dispersion, we limit the maximum value the current radius might have such that twice the application size of nodes will be explored: m k jlp 2 × |T | /2 (2) Rmax = The above scenario continues until all the tasks are mapped (line 18). Note that the proposed algorithm is re-executed with picking different tasks as tf (line 1) to increase the success rate. According to the fault pattern around AM, however, there might be no accessible node for a given task which will keep the mapping feasible. Consequently, all the loops will finish and the algorithm will return a failure (line 19). Note that the proposed algorithm considers a variety of different allocations for a given task and maps it once there is no other task could be mapped with better conditions; i.e. with smaller r and M D values. This leads to a higher chance of finding a feasible mapping (Section 5.2.1) along with a better mapping quality (Section 5.2.2) compared to related work [14, 29, 33–35]. As AM does not know the availability status of PEs, it needs to ask them before considering them in the allocation. As such, when an accessible node is explored for the first time (within line 9), AM sends a reservation request to it. The kernel of the destination node will reply back to AM about its status and will book a place if it is available. The destination node will remain reserved until end of the mapping procedure. Finally, AM will release those nodes that are not used in the mapping; e.g. all of them in the case of mapping failure. Complexity Analysis: There are four main loops in the algorithm. The first one (line 1) is repeated according p to the application size (|T |). The next one (line 6) repeats up to |T | times according to (2). That is the same for the next loop (line 7). As the M ET set 2
We define parent as the task newly MET tasks are met through. We denote each layer of square around AM as square with radius (r) of the layer count. 3
is a subset of tasks, the last loop can repeat up to |T | times. All in all, this leads to a complexity of O(|T |3 ), which is independent of the network size, making it suitable for future many-core systems. Note that if there is no fault in AM region, the first loop will run once; reducing the complexity to O(|T |2 ).
4.2.2
Task Migration and Re-Mapping
Upon a run-time fault occurrence/prediction that inflicts the application execution, AM tries to move the interfered tasks so that they can safely resume their execution. If the task migration cannot resolve the issue, a task re-mapping is initialized. In case of re-mapping failure, MM will be involved, which either restarts or migrates the application to a healthy region. Recall that thanks to the proposed hierarchy, the fault information does not need to be consistent in all kernels. A new fault will first raise the involved AMs. In case an AM fails in handling the local failure or after handling it, MM is informed about the fault and the taken actions.
5.
RESULTS AND ANALYSIS
In this section we evaluate our proposed approach and compare it to state-of-the-art. Accordingly, different set of experiments are performed on our in-house cycle-accurate SystemC many-core platform which utilizes a pruned version of Noxim [36], as its communication architecture. Each node consists of a router and a PE connected to its local port. A light-weight kernel resides in the background of each PE and switches between its three defined roles according to SHiFA strategies. Each PE also hosts a task and emulates the task behavior according to the task graph.
5.1
Accessibility of Nodes
In our first analysis, we measure the portion of nodes which can potentially be accessed and utilized under SHiFA. A 12x12 mesh with XY routing is studied, while the number of broken (bidirectional) links is gradually increased from 1 to 8. For the sake of scalability analysis, we also considered the case where up to 10% of links (27 of them) are broken. Moreover, we try to quantify the contribution of the mobility feature of master, as well as the use of an enhanced architectures. Accordingly, we ran the same set of simulations when the system master resides in a fixed node (FM), and when the mesh exploits an Odd-Even routing (OE). In OE case, when two outputs are granted, the healthy one is selected. Once both granted outputs are healthy, a local priority-based selection is
% of Accessible PEs
95
90 85 80 75 70 65 60
1
2
4
6
8
MM-OE
FM-XY
MM-XY
MM-OE
FM-XY
MM-XY
MM-OE
FM-XY
MM-XY
MM-OE
FM-XY
MM-XY
MM-OE
FM-XY
MM-XY
55
MM-OE
Body: 1: foreach tf ∈ T do 2: ntf ← nAM ; 3: M AP ← tf ; 4: M ET ← set of tasks connected to tf in T G; 5: U N M ← T − (M AP + M ET ); 6: for r = 1 → Rmax do 7: for M D = 1 → 4 × r do 8: foreach tc ∈ M ET do e ← Available nodes ∈ T AM in r with M D hop away from tc 9: N parent. e so that keeps the mapping feasible then 10: if ∃nc ∈ N 11: ntc ← nc ; 12: move tc from M ET to M AP ; 13: if there are tasks in U N M connected to tc then 14: move tasks connected to tc from U N M to M ET ; 15: M D ← 1; 16: if M AP = T then 17: return success; 18: end foreach tf ∈ T ; 19: return failure; =0
% of Possible AMs
100
FM-XY
Inputs: T G(T, E): Task graph of the requested application, Ap. AGH (NH , LH ): The set of healthy elements of the system. Output: map : T → NH . Variables: U N M , M ET and M AP : Sets of unmet, met and mapped tasks. r: The current radius of the square around the AM .
MM-XY
Algorithm 1 Fault-aware mapping algorithm executed by AMs.
10%
Figure 3: Percentage of the accessible PEs with fixed master (FM) and mobile master (MM), using XY or odd-even (OE) routing algorithm. The darker parts of each bar shows the percentage of PEs that can be accessed directly from fixed/mobile-master.
applied. The extracted results are shown in Fig. 3 where each point of the graph is averaged over 100 different random fault-patterns. As can be seen, 99.999% of nodes can be utilized, when up to 8 links are broken. This is slightly decreased (to 99.72% in MMXY case) when the broken links are increased over 10% of the total amount of links. Our system-level method is comparable to a highly-resilient routing algorithm [5] which provides 99.99% reliability when 10% of links are broken. It can be seen that the mobility feature and OE routing mainly contribute to the territory of system master; i.e. they provide system master with a larger set of nodes to be selected as candidate AMs.
5.2
Success Rate and Quality of Mapping
As the second study, we measure the success rate of our proposed fault-aware mapping (Section 4.2.1) in finding a feasible mapping for an application. Moreover, we compare the mapping quality with state-of-the-art algorithms in our run-time simulation environment.
5.2.1
Mapping Success
A 12x12 mesh network is used, while the number of faulty links is varied between 1 to 8. As before, both XY and OE routing algorithms, as well as the case where 10% of links are broken, are considered. 100 different fault patterns are generated and examined for each case. For each fault pattern, we have considered the mapping of 9 different real applications; such as, MPEG4 decoder [37], Video Object Plane Decoder (VOPD) [38], Multi-Window Display (MWD) [38], etc. Within each fault pattern, nodes in MM’s territory are promoted as AM one by one. Each AM is assigned to map each application using the proposed algorithm. For the sake of comparison, we also utilized the distributed D&C mapping algorithm [14], which is derived from NMAP [35]. As the original D&C is not fault-aware, we changed the algorithm such that a swap is applied if the number of feasible communications is increased. Accordingly, each bar in Fig. 4 shows the percentage of successful mappings over all examined cases (# of fault patterns ×T M M × # of apps ). In case of single-fault by using our proposed algorithm, an arbitrary AM selection can lead to 99.93% of successful mapping using OE routing algorithm. This reduces to 99.41% when there are up to 8 faulty links. Results show that our algorithm scales well when there are up to 10% of links broken; an arbitrary AM selection for a given application leads to over 96% of mapping success. Note that in a dynamic environment, it is the failure rate which impacts the performance but the success rate. For instance, using D&C algorithm, the probability of an arbitrary AM failure increases by 30 folds, when there are only 2 broken links. Moreover, Fig. 4 shows the ratio of energy per bit of communication obtained by SHiFA and D&C mappings; the proposed algorithm enhances network energy dissipation at least by 60%.
5.2.2
Mapping Quality
We compared the quality of the obtained mappings with state-ofthe-art. As mentioned before, however, other system-level methods consider a fault-free network. In order to enable a fair comparison, the same assumption is made (only in this set of experiments). A
SHiFA-XY
100
SHiFA-OE
D&C-OE
R(energy)
1.9
SHiFA
D&C
100
1.7
90
1.65
88
1.6
86
1.55
84
1.5
82
1.45
10
2
4 broken links 6
8
6
0
Run-Time Overhead and Scalability
As mentioned before, MM perceives a holistic view of the system due to the proposed hierarchy. Whereas, other distributed methods [11–13] lack such pervasive view of the system and explore the system nodes randomly, which leads to high overhead in finding suitable resources. In order to evaluate the gained performance and scalability, we extracted the average number of trials required to find a successful AM, using both SHiFA method and random method of related work (Fig. 5). As can be seen, SHiFA improves the related overhead compared to previous works around 5 times, while scales over different fault populations and system sizes. More precisely, SHiFA finds the suitable AM at most in the second try, while random search might lead to up to 13 trials. Moreover, Fig. 5 shows the normalized execution time required for mapping algorithms. Comparison is done over D&C algorithm as a distributed approach (which is shown to have improvements over other related work).
DISCUSSION
Existing fault-tolerant work, specially those with no fault pattern limitations, rely on a perfect oracle network that cannot have faults to maintain connectivity when faults appear in the actual network. That is, they require a guaranteed delivery of control packets in order to keep the system resilient against run-time failures. For instance, routing algorithms of [6] and [39] explicitly depend on always healthy control network. Although the presented work in [5] tends to work in a distributed manner, authors assume that “the routers know when they need to invoke the algorithm". This breaks the assumed distribution and requires an always healthy control network to synchronize routers. As another example, Chang et al. [8] reconfigure the spare routers in a single point of management in both of their SARA and DAPA algorithms. This requires an always healthy network which collects the fault information over the chip and distributes the reconfiguration commands back to the routers. The same assumption is Table 1: Mapping Quality of Different Algorithms
Lavg 1.00 1.07 1.13
15% E 1.00 1.20 1.30
0.1 0
2
10%
Lavg 1.00 1.06 1.11
E 1.03 1.22 1.31
SHiFA
4
6
8
% of failed nodes
Random
SHiFA
10
D&C
14
100
12
# of AM trials
12x12 mesh network is instantiated with two different fault ratios: 5% and 10%. A random sequence of applications are entered into the system. This sequence is kept fixed in all experiments for the sake of fair comparison. Applications are mapped onto the system using different mapping algorithms including NN[34] and CoNA[33]. Each experiment is run for 1 million of simulation cycles. Note that 100% of the mappings are feasible due to the assumed healthy network. A normalized summery of extracted results are shown in Table 1. Note that each cell is averaged over 10 different fault patterns. As can be seen, SHiFA mapping algorithm outperforms other works regarding the network metrics.
5%
1
4 2
Figure 4: Success rates for arbitrary AM selection over SHiFA and D&C algorithms. The line demonstrates the ratio between energy dissipation of obtained solutions.
%of fault Mapping SHiFA CoNA[33] NN[34]
10
8
1.4 1
Normalized Execution Time
1.75
92
10
10
8
6 1
4 2
0
Normalized Execution Time
1.8
94
80
6.
Random
12
# of AM trials
96
5.3
SHiFA
14
1.85
Ratio
% of Success
98
0.1 6x6
8x8
10x10
12x12
system size
14x14
16x16
Figure 5: Scalability results when (up) increasing the fault rate and (down) increasing the system size. Bars represent of AM trials.
made by existing system-level approaches, as they initially assume a fault-free network. Similarly, SHiFA needs a guaranteed negotiation between kernels. That is, the destination node must be not only reachable but also accessible in most of the SHiFA procedures such as, AM selection, MM migration, task allocation/migration, etc. However, building a completely fault immune control network will not be light-weight due to the imposed redundancies. Moreover, such assumption cannot be realistic regarding the mean time to failure (MTTF) of nanometer scale silicons. For instance, the MTTF of a 12x12 mesh network will be around 15 months, according to [6, 40]. An alternative solution is to route control packets through the same– potentially– faulty network. Then, in order to guarantee the delivery of control packets, it will require either a completely reliable and distributed routing algorithm or violating the turn-model limitations when it is necessary. The former is not proposed yet, while the later might lead to deadlock. Our suggestion is to accept the deadlock probability as a price to pay, and violate the routing restriction if and only if it is inevitable. The main motivation is our observation that thanks to SHiFA characteristics, the deadlock probability will be extremely negligible. Especially, when the traffic of control packets is isolated; e.g. by exploiting virtual channel (VC). Moreover, it has been shown that even in a data network, deadlock is a very rare and detectable event [41], so that unrestricting the routing by allowing any turn is recommended. The rareness of deadlock is emphasized in SHiFA as the control packets are (i) rarely generated and (ii) mainly transferred in local distances. The SHiFA control packets are generated rarely; once for the negotiations regarding the migration which happens upon a run-time failure (MTTF– several months), and once during the the application mapping. In the case of the mapping of an application with |T | tasks after c candidate AM selection, the total generated control packets during the life-time of the application will be: {2 × c} + {2 × c × |T |} + {|T |+1}
(3)
The first term stands for the AM selection, the second one represents the reservation of nodes made by each candidate AM, while the last term relates to the application termination. Note that the two last terms correspond to local packet transmission– eq. (2). We verified our suggestion through run-time simulations. Accordingly, a 12x12 mesh network with Odd-Even routing function is considered, where the control network follows the same architec-
Table 2: Control Packets that Experience both Congestion and Turn-Model Violation.
# of broken links # of AM trials per app. RC&V (×10−4 )
1 1.5 0.2
2 1.6 0.5
3 1.6 0.6
4 1.7 1.6
5 1.8 1.8
6 1.8 2.0
ture and fault pattern as the data network. Consequently, the number of broken links is increased from 1 to 6, where 10 different fault patterns are explored for each case. Each simulation is carried out under dynamic workload for 100000 clock cycles. As expected, none of the simulations lead to a deadlock. However, in order to quantify the deadlock probability, we measured the portion of the control packets (RC&V ) which experienced both turn-model violation and congestion along their path to destination. Table 2 represents the extracted results. Note that turn-model violation and congestion are two necessary but not sufficient conditions for a deadlock. Let us assume the system with 1 broken link, and application with average size of 20 tasks. According to Table 2, there will be 1.5 AM selection per application and 2 control packets out of 100,000 which experience both congestion and violation. Hence, regarding to (3), each application execution will generate 84 control packets. Thus, every 595 application execution, there will be a control packet experiencing both, which in turn, we assume that every thousand cases (every 595000 applications) will lead to a deadlock (our experiments show no deadlock cases). Assuming 1 hour execution time for each application, where up to 7 applications can fit into a 12x12 mesh, there will be a deadlock case every 85000 hours (every 9 years); i.e. the mean time to deadlock (MTTD) will be much longer than MTTF for such system (15 months). The MTTD decreases to 9 months when there are up to 6 broken links in the network. According to the provided calculations, the deadlock probability under SHiFA is extremely negligible such that motivates the use of violating routing algorithms instead of the (unrealistic and expensive) always-healthy fault-free control networks.
7.
CONCLUSION AND FUTURE WORK
In this paper, we proposed a system-level and hierarchical approach to fault-aware management of many-core systems, called SHiFA. The proposed powerful fault-aware mapping showed over 99% of successful mapping when several links are broken in the network. Moreover, the hierarchical design of SHiFA improved fault-locality properties while kept 99.99% of PEs utilizable while there are several broken links. Furthermore, the offered holistic view of the proposed hierarchy, decreased the overheads related to random behavior of other distributed approaches. We also discussed the assumption of an always healthy control network, and suggested to route the control packets through the same potentially faulty network. We showed that it can be an alternative solution to expensive and unrealistic fault-free networks. We intend to enrich the migration scenarios and develop related algorithms for both MM and AMs. Moreover, we plan to realize SHiFA on Intel SCC as and measure the related pros and cons.
8.
REFERENCES
[1] “Yield enhancement,” International Technology Roadmap for Semiconductors, 2011. [2] W. Dally and B. Towles, “Route packets, not wires: on-chip interconnection networks,” in DAC, 2001. [3] I. Koren and C. M. Krishna, Fault-tolerant systems. Morgan Kaufmann, 2010. [4] Z. Zhang, A. Greiner, and S. Taktak, “A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip,” in DAC, 2008. [5] D. Fick et al., “A highly resilient routing algorithm for fault-tolerant NoCs,” in DATE, 2009. [6] S. Pasricha and Y. Zou, “NS-FTR: a fault tolerant routing scheme for networks on chip with permanent and runtime intermittent faults,” in ASP-DAC, 2011. [7] M. Hosseinabady, M. R. Kakoee, J. Mathew, and D. K. Pradhan, “Reliable network-on-chip based on generalized de bruijn graph,” in HLVDT Workshop, 2007.
[8] Y.-C. Chang et al., “On the design and analysis of fault tolerant NoC architecture using spare routers,” in ASP-DAC, 2011. [9] M. Fattah et al., “Smart Hill Climbing for Agile Dynamic Mapping in Many-Core Systems,” in DAC, 2013. [10] A. K. Singh et al., “Mapping on multi/many-core systems: survey of current and emerging trends,” in DAC, 2013. [11] I. Anagnostopoulos et al., “Distributed run-time resource management for malleable applications on many-core platforms,” in DAC, 2013. [12] S. Kobbe et al., “DistRM: distributed resource management for on-chip many-core systems,” in CODES+ISSS, 2011, pp. 119–128. [13] M. Hosseinabady and J. Nunez-Yanez, “Run-time stochastic task mapping on a large scale network-on-chip with dynamically reconfigurable tiles,” Computers Digital Techniques, IET, 2012. [14] I. Anagnostopoulos et al., “A divide and conquer based distributed run-time mapping methodology for many-core platforms,” in DATE, 2012. [15] C.-L. Chou and R. Marculescu, “FARM: Fault-Aware Resource Management in NoC-Based Multiprocessor Platforms,” in DATE, 2011. [16] O. Derin, D. Kabakci, and L. Fiorin, “Online task remapping strategies for fault-tolerant Network-on-Chip multiprocessors,” in NOCS, 2011. [17] J. Huang, K. Huang, A. Raabe, C. Buckl, and A. Knoll, “Towards fault-tolerant embedded systems with imperfect fault detection,” in DAC, 2012. [18] A. Das and A. Kumar, “Fault-aware task re-mapping for throughput constrained multimedia applications on NoC-based MPSoCs,” in RSP, 2012. [19] C. Bolchini, A. Miele, and D. Sciuto, “An adaptive approach for online fault management in many-core architectures,” in DATE, 2012. [20] A. Baumann et al., “The Multikernel: A New OS Architecture for Scalable Multicore Systems,” in SOSP, 2009. [21] D. Wentzlaff and A. Agarwal, “Factored Operating Systems (FOS): The Case for a Scalable Operating System for Multicores,” ACM SIGOPS Operating Systems Review, 2009. [22] M. Fattah et al., “Exploration of mpsoc monitoring and management systems,” in ReCoSoC. IEEE, 2011, pp. 1–3. [23] J. Howard et al., “A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS,” in ISSCC, 2010. [24] F. Glover and M. Laguna, Tabu Search. Norwell, MA, USA: Kluwer Academic Publishers, 1997. [25] C. Grecu, L. Anghel, P. P. Pande, A. Ivanov, and R. Saleh, “Essential fault-tolerance metrics for noc infrastructures,” in IOLTS, 2007. [26] D. Fick, A. DeOrio, J. Hu, V. Bertacco, D. Blaauw, and D. Sylvester, “Vicis: A reliable network for unreliable silicon,” in DAC, 2009. [27] A. Amoura, E. Bampis, and J.-C. Konig, “Scheduling algorithms for parallel Gaussian elimination with communication costs,” Parallel and Distributed Systems, IEEE Transactions on, 1998. [28] “TGG: Task Graph Generator,” URL: http://sourceforge.net/projects/taskgraphgen/, 2010. [29] C.-L. Chou, U. Ogras, and R. Marculescu, “Energy- and Performance-Aware Incremental Mapping for Networks on Chip With Multiple Voltage Levels,” TCAD, IEEE Transactions on, 2008. [30] V. Leung et al., “Processor allocation on Cplant: achieving general processor locality using one-dimensional allocation strategies,” in Cluster Computing, 2002. [31] M. Fattah et al., “Adjustable Contiguity of Run-Time Task Allocation in Networked Many-Core Systems,” in ASP-DAC, 2014. [32] S. Shamshiri, A.-A. Ghofrani, and K.-T. Cheng, “End-to-end error correction and online diagnosis for on-chip networks,” in ITC, 2011. [33] M. Fattah et al., “CoNA: Dynamic Application Mapping for Congestion Reduction in Many-Core Systems,” in ICCD, 2012. [34] E. Carvalho, N. Calazans, and F. Moraes, “Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous MPSoCs,” in RSP Workshop, 2007. [35] S. Murali and G. De Micheli, “Bandwidth-constrained mapping of cores onto noc architectures,” in DATE, 2004. [36] F. Fazzino, M. Palesi, and D. Patti, “Noxim: Network-on-chip simulator,” URL: http://sourceforge.net/projects/noxim, 2008. [37] D. Bertozzi et al., “NoC synthesis flow for customized domain specific multiprocessor systems-on-chip,” IEEE Transactions on Parallel and Distributed Systems, 2005. [38] E. B. van der Tol and E. G. Jaspers, “Mapping of MPEG-4 decoding on a flexible architecture platform,” Media Processors, 2002. [39] R. Parikh and V. Bertacco, “udirec: unified diagnosis and reconfiguration for frugal bypass of noc faults,” in MICRO, 2013. [40] S. Murali et al., “Analysis of error recovery schemes for networks on chips,” Design Test of Computers, IEEE, 2005. [41] R. Al-Dujaily et al., “Run-time deadlock detection in networks-on-chip using coupled transitive closure networks,” in DATE, 2011.