2011 14th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops
Self-Reconfiguration for Fault-Tolerant FlexRay Networks Kay Klobedanz1 , Andreas Koenig1 , Wolfgang Mueller1 , Achim Rettberg2 1
University of Paderborn/C-LAB, Faculty of Electrical Engineering, Computer Science and Mathematics, 33102 Paderborn, Germany Email:
[email protected],
[email protected],
[email protected] 2 Carl von Ossietzky University Oldenburg, Faculty II, Department Computer Science, 26129 Oldenburg, Germany Email:
[email protected] healing, self-repair and many more. These properties have in common that they shift decision making from humans to the technical systems. This shift has two possible benefits. First, decision made during the design of the system can be delayed until operation and thus the system is able to react more flexible to different application requirements. Second, the human effort in managing such systems can be reduced and the corresponding systems are more cost efficient. To raise the quality of the vehicle it is important to built in fault-tolerant systems in the network. In a distributed system fault-tolerance can be include in three ways: Replication, redundancy, and diversity. Replication provides multiple identical instances of a system, the tasks and requests are directed to all of them in parallel, and the choosing of the correct result is based on a quorum. Redundancy is characterized by multiple identical instances and a switching to one of the remaining instances in case of failure. Diversity provides different implementations of a system that are used like replicated systems. A self-configurable system is able to provide redundancy, diversity and replication of tasks, therefore, it helps to make the system more stable. In the context of self-configuration of automotive systems redundancy of data, applications, and tasks can be used to get an increased fault-tolerance in case of ECU failures. Crucial data, applications or tasks are distributed as backup componentson the ECUs of the vehicle system that they can be used of or executed by other ECU if their originally ECUs failed. The way of distribution and the number of replicas and the decision which components are replicated depends on an adequate algorithm. At this costs of replication and migration and load of other ECUs will be considered. FlexRay is the emerging standard for safety-critical automotive systems. It offers deterministic behavior, bandwidth capacities and redundant communication channels to increase fault tolerance of such networks. Nevertheless, the failure of a single ECU may result in a malfunction of the whole system. To further increase the fault tolerance of a FlexRay network, nodes failure should be compensated by redundancy and task replication, which implies changes in the communication dependencies and a reconfiguration of the system at run time. Current automotive systems usually consist of
Abstract—In this paper we present an approach for the selfreconfiguration of FlexRay networks to increase their fault tolerance. We propose a self-organized distributed coordinator concept which performs the self-reconfiguration in case of a node failure using redundant slots in the FlexRay schedule and combination of messages in existing frames and slots to avoid a complete bus restart. Therefore, the self-reconfiguration is realized by means of predetermined information about resulting changes in the communication dependencies and (re-)assignments from a introduced heuristic, which determines initial configurations and, based on that, calculates valid reconfigurations for the remaining nodes of the FlexRay network. The distributed coordinator concept, which is implemented by lightweight tasks not consuming any significant resources, uses these information and performs the reconfiguration of the FlexRay network at run time to increase the fault tolerance of the system. An evaluation by means of realistic safety-critical automotive real-time systems revealed that this reconfiguration approach determines valid reconfigurations for up to 80% of possible individual node failures and thereby offers applicable information for the self-reconfiguration approach. Furthermore, in an iterative design process these results can be improved to optimize the reconfigurations. The evaluation of our self-organized distributed coordinator concept and the comparison to a centrally organzied solution with a dedicated coordinator proves its benefits regarding the additional hardware and communication overhead and the resulting reconfiguration time, which has an great impact on the fault tolerance of the FlexRay network.
I. I NTRODUCTION The increasing number of ECUs (Electronic Control Units) in modern vehicles implies the necessity of reliable communication between these components. Another topic such systems has to cope with is the flexibility. The communication infrastructure must be flexible in the sense to include new ECUs or mobile devices into the network. The idea to build up a self-configurable communication infrastructure could help to overcome these requirements. This can be achieved by new design methods from the self-x domain. The increasing adoption of methods from computational intelligence, biologically inspired computing, and optimization in engineering discipline such as mechatronics or production enables the development of a new class of autonomous systems. These systems often posses abilities which are best describe by the phrase self-x: self-optimization, self-coordination, self978-0-7695-4377-2/11 $26.00 © 2011 IEEE DOI 10.1109/ISORCW.2011.38
207
AMI-C specification is presented. An application manager is introduced for telematic applications. The architecture enable in-vehicle terminal to provide various telematics services to increase drivers safety. The authors of [6] describe trends for automotive systems. They give an overview of requirements for middleware systems in this area. Especially what industry demands for such middleware services. Hiding the distribution and the heterogenity of such platforms is demanded as well as providing high-level services (e.g. mode and redundancy management) and ensuring QoS. The analysis of the FlexRay protocol and concepts for its optimization are discussed in some work like [7]. Publications like [8] and [9] describe concepts to determine proper configurations and parameterizations for the static segment of FlexRay. Nevertheless, these approaches assume that the executed tasks are statically linked to the nodes of the FlexRay network which significantly decreases the flexibility and the fault tolerance of such systems. Moreover, the authors do not consider reconfigurations and redundancy to increase fault tolerance. In [10] the authors propose a replication of tasks to compensate node failures. Therefore, they determine the reconfiguration capabilities of FlexRay based on these assumptions. However, this publication assumes a quite strict predefined task-toECU and slot assignmentas, whereas we maximize flexibility in the configuration and the corresponding reconfigurations. In [11] the authors present a heuristic for the configuration of FlexRay networks which uses a representation of configurations (task-to-ECU, message-to-frame, and frame-to-slot assignments) as individuals, which are coded as strings. The approach evaluates these individuals with several weighted parameters and determines an optimized configuration by means of selection, crossover and mutation functions of an genetic algorithm. Our reconfiguration approach from [12] is inspired by this heuristic, but we extended it by an optimized analytical methodology, which, additionally to a initial configuration, calculates valid reconfigurations for the remaining nodes of the FlexRay network in case of a node failure. The results of this approach are the input for our self-reconfiguration concept.
ECUs, which host functions and are also hardwired to their target sensors/actuators. A failure of a hardwired node cannot be compensated with redundancy because the connections to the peripherals get lost. Hence, we consider a modified network topology to improve fault tolerance by means of redundancy, at which we separate between two types of ECUs: Nodes, we call peripheral interfaces, which are wired to sensors and actuators, and just read/write values to/from the bus and dedicated nodes hosting the functional tasks. Since peripheral interfaces do not execute any complex task they only require low hardware capacities, which allows costefficient redundancy of these components. However, here we focus on the distributed functional ECUs, which receive and provide their data via FlexRay, and can therefore be considered for redundancy and reconfiguration. Because the FlexRay specification does not allow changes of the schedule at run time, we propose a self-reconfiguring approach, which uses redundant slots in the schedule and combined messages in existing frames and slots to avoid the necessity of a complete bus restart in case of a node failure. Therefore, we introduce a self-organized distritibuted coordinator concept which performs the self-reconfiguration of FlexRay by means of predetermined information about resulting changes in the communication dependencies and (re-)assignments. To achieve these necessary information, we present a heuristic, which determines initial configurations and, based on that, calculates valid reconfigurations for the remaining nodes of the FlexRay network in case of a node failure. The remainder of this paper is structured as follows. After the related work and the FlexRay fundamentals, a detailed description of our self-reconfiguration approach is given in Section IV. An evaluation is provided in V before the final section concludes with a summary. II. R ELATED W ORK In this section we will give a short overview of existing approaches in the field of self-configurability, especially middlewares who supports it, and additionally the communication infrastructure for vehicles. Afterwards we present existing work regarding the configuration and reconfiguration of FlexRay networks. Nowadays, the most famous middleware approach is suggested by the Autosar consortium [1]. Within Autosar a runtime environment (RTE) is developed to support a common infrastructure for automotive systems.The self-configurability developed in in [2] approach enriched the Autosar RTE especially by dynamic reconfiguration management through load balancing strategy. In [3] a formal specification for developing distributed, embedded, real-time control systems is described. The middleware supports dependable, adaptive dynamic resource management based on replicated services. An additional approach for fault-tolerance and dynamic reconfiguration is discussed by the authors of [4]. Again replicated services are used in this model. In [5] a middleware architecture for telematics software based on OSGi and
III. F LEX R AY F UNDAMENTALS FlexRay was introduced to implement deterministic and fault-tolerant communication systems for safety-critical distributed real-time systems. The protocol makes use of recurring communication cycles as shown in Figure 1. It is composed of a static and an optional dynamic segment. The timetriggered static segment is based on the TDMA (Time Division Multiple Access) protocol. Therefore, the transmission slots of the segment are assigned to a sender node by a globally known synchronously incremented slot-counter. The static segment consist of a fixed initially defined number of equally sized static slots (2 – 1023). The event-triggered dynamic segment of the FlexRay communication cycle is optional. Because we focus on the static segment, the reader is referred to [13] for further details. The size of slots and frames, the cycle length,
208
and several other parameters are defined by an initial setup of the FlexRay schedule, which cannot be changed during run time. Furthermore, the assignment of slots to nodes is fixed and cannot be changed without a bus restart.
Distributed Coordinator
communication cycle static segment
dynamic segment
symbolic window
NIT
optional slot 1
slot 2
...
ECU 1
ECU 2
ECU n
Coor 1
Coor 2
Coor n
Task 1 … Task n
Task 1 … Task n
…
Task 1 … Task n
slot n
Distributed Embedded Emb System ((Application) Application pplica )
Fig. 1.
Self-Reconfiguring FlexRay Network
Components of the FlexRay communication cycle. Fig. 3.
A FlexRay frame basically consists of three segments: Header, payload, and trailer (see Figure 2). The header segment contains protocol relevant information such as the payload length. The payload segment contains data bytes which are identified numerically increasing by one with each subsequent byte. The trailer segment contains CRC data. As shown in Figure 2, the payload data of a frame can be composed of several PDUs (Protocol Data Units). The data of a specific PDU can also be accessed by means of a unique ID. This frame packing allows different tasks to send their data combined in one frame. For a more detailed description of the FlexRay protocol and components, the reader is referred to [13]. PDU 0 Header
Data 0
PDU 1 Data 1
Data 2
...
PDU n
...
Data n
ECU 1
Fig. 4.
Coor 1
Task Lists
Task 1 … Task n
COM Matrix
Coordinator task maintaining task lists and COM matrix.
B. Distributed Coordinator Concept As mentioned before, we add distributed coordinator tasks to every node of the system. This distributed coordinator concept is implemented by lightweight tasks, which do not consume any significant resources and runtime. As shown in Figure 4, each instance of the distributed coordinator maintains task lists with running and redundant tasks of each node for every possible configuration and a communication matrix which provides the necessary information about the communication dependencies. The latter contains the corresponding assigned transmission and reception slots of the FlexRay schedule (ref. Table I). These data structures are updated when modifications in the task and slot assignments are required at the time of a node failure resulting in a self-reconfiguration. Because the memory demand for these data structures is negligible compared to the application data, it does not result in a mentionable hardware overhead to maintain it on every node.
Trailer
frame payload
Fig. 2.
Overview of the self-reconfiguring FlexRay network architecture.
FlexRay frame format.
IV. A PPROACH In this section we present our self-reconfiguration approach for fault tolerant FlexRay networks. After a basic overview, we outline the self-reconfiguring distributed coordinator concept and describe our reconfiguration approach for the static segment as well as the determination of the necessary parameters and information. A. Overview Figure 3 gives an overview of the components of our proposed self-reconfiguring FlexRay network architecture. The defined application of the embedded system is realized by means of several communicating tasks executed on the distributed functional ECUs of our proposed FlexRay network topology (ref. Section I). Additionally to these application task we add a task instance of our distributed coordinator to every node of the system as proposed in [14]. These distributed coordinator tasks perform the self-reconfiguration of the FlexRay network in case of an ECU failure to increase the fault tolerance of the system. Furthermore, this self-organized approach is superior to a centrally organzied solution with a dedicated coordinator node (ref. Section V-A).
Slot ECU ECU ECU ECU
1 2 3 4
s1 tx rx rx -
s2 tx -
static segment s3 s4 rx rx tx rx tx
... ... ... ... ...
TABLE I E XAMPLE OF A COMMUNICATION MATRIX FOR THE STATIC SEGMENT OF F LEX R AY (T OPOLOGY WITH 4 ECU S ).
The distributed coordinator tasks must be configured to receive and monitor the messages of all relevant slots in order to detect node failures and to coordinate the appropriate reconfigurations of the communication and the necessary task
209
and slot assignments. Therefore, the task lists, and the communication matrix are generated by means of the task-to-ECU and slot-to-ECU assignments and the communication between the nodes, determined with our proposed reconfiguration approach (ref. Section IV-C). Figure 5 shows the flowchart of the self-reconfiguration activities by means of the distributed coordination concept in case of a fault-detection. If a malfunction of an ECU is detected, each coordinator task checks via task lists and communication matrix if the necessary reconfiguration implies an reassignment of one or more additional tasks to its host. If no additional task-to-ECU assignments for its hosting node are performed, the coordinator instance just has to reconfigure the receiving slots for the input data of the executed tasks. By means of the task lists, each coordinator task knows for every possible reconfiguration which node(s) will resume executing the corresponding task(s) in case of a node failure. The communication matrix contains the necessary information of the resulting communication reconfigurations and the redundant slot assignments for the reconfigured task-to-ECU assignments. If the node failure implies the reassignment of one or more tasks to the hosting node the coordinator additionally initiates the activation of the corresponding task(s) and the assignment of their transmssion slots for the reconfiguration of the system. ECU Failure Detected
YES
Section III). Due to the fact that the messages between the coordinator tasks are not containing application data, a small data type based on an appropriate coding is sufficient for these PDUs. With this messages the distributed coordinator tasks can exchange control and status messages like described above. This reduces the resulting communication overhead and leaves more slots to the application tasks. Furthermore, this enables in cycle responses and mostly reduces the number of required cycles. Hence, we minimize the necessary time needed for a self-reconfiguration after an ECU failure. C. Reconfiguration Approach As input for the self-reconfiguration concept our approach from [12] determines capable initial configurations and reconfigurations based on a Directed Acyclic Graph (DAG) derived from the task dependencies, the given network topology and the parametrization of the FlexRay network. Therefore, several sub-steps have to be performed: • Task-to-ECU assignment, • Consideration of task dependencies, • Message-to-frame/frame-to-slot/slot-to-ECU assignment, • Scheduling and validation for each ECU. Based on a determined initial configuration it calculates possible reconfigurations for the remaining ECUs of the FlexRay network to compensate for node failures. Figure T1
Check Task Lists & COM Matrix
(Re-)Assignment of Task(s)?
Communication cycle
1
2
3
4
5
T3
T2
ECU 1
T4 T5
T6
ECU 3
T7
Generating an initial configuration
NO
ECU 2
n
ECU failure
Alternatives: Backup-node(s)
Reconfigure Receiving Slots & Activate Task(s) & Assign Transmission Slots
Fig. 5.
Evaluating the effort & overhead of reconfiguration
Reconfigure Receiving Slots
Reconfiguration of task set
Mirrored task set Reallocation of task set
• • •
Configuration of communication system without failure Reconfigured system after ECU failure Effort and overhead of reconfiguration
Self-reconfiguration activities of a distributed coordinator. Fig. 6.
The communication matrix and the task lists provide the necessary information about task-to-ECU and slot-to-ECU assignments to the distributed coordinator tasks for each reconfiguration. Nevertheless, to guarantee a proper communication the correct reconfiguration of the receiving slots for every application task must be ensured. Therefore, every coordinator task propagates his performed reassignments of transmission slots. By means of this information the self-reconfiguration of the slot-to-task assignments is finalized. Instead of using an additional slot to realize this communication, the distributed coordinator concept makes use of frame packing attaching the messages to application frames in existing slots (ref.
Overview of (re-)configuration approach.
6 gives an overview of the whole approach. It calculates an initial configuration by means of a predefined FlexRay communication cycle, a given network topology and a DAG representing the dependencies between the distributed tasks with appropriate assignments resulting in a valid schedule. Based on the determined initial setup, we simulate the failure of each ECU seperatly to retrieve valid reconfigurations for the remaining system. Therefore, our approach calculates and validates reconfigurations – i. e. reallocation/reassignment of the task set to the reamining ECUs and messages on the FlexRay bus including additional slot-to-ECU assignments,
210
considering the resulting effort and overhead to provide these information as feedback to the system developer and as input for our self-reconfiguring concept. The underlying model of our approach comprises distributed systems, which are composed of ECUs connected via a communication bus. The task-to-ECU assignment can be changed dynamically during runtime. The communication is realized via local inter-process communication (IPC) or bus communication, e. g. FlexRay. The scheduling in our model is performed by means of earliest deadline first (EDF) [15]. The utilization based schedulability test of EDF for each ECU also validates determined (re-)configurations. Each task τi of a task set Γ is defined by its execution time Ci and its period Ti as: Γ = {τi (Ti , Ci ), i = 1, . . . , n}.
provide input data to the system and have no predecessor. Γout defines the tasks, which provide output data and have no successors. Γmid are all other nodes of the DAG, which have predecessors and successors. We assume that additionally to Γmid also Γin and Γout can be considered for the reconfiguration, because they are executed on reconfigurable ECUs and not on peripheral interfaces (see Section I). The primary parameters of the modeled FlexRay schedule are a bandwidth of 10MBit/s a cycle length equal to the maximum period of the considered tasks (Tcycle = Tmax ), a variable slot length Tslot , and an according number of slots (#slots = Tcycle /Tslot ). To calculate a valid initial configuration, we developed a heuristic based on a genetic algorithm [11]. Because the initial configuration just has to be valid and not optimal, we omit the optimizing crossover and mutation functions in our resulting pseudo-genetic algorithm. Using our pseudo-genetic algorithm we focus on the calculation of individuals as candidates for the initial configuration. The assignment of tasks to ECUs is followed by the assignment of messages to frames in static slots of the communication cycle (slot-to-ECU assignment) and the scheduling of the tasks on each ECU, which also validates the individual. Finally, each individual gets evaluated by means of a fitness-function. Based on the input containing the model properties, our approach calculates an initial generation of individuals as described above. Each calculated individual is compared to the existing individuals of the current generation regarding the task-toECU assignment to guarantee a heterogeneous solution-space. For further calculated generations the selected individuals are analyzed and compared analog to [11]. After calculating one or more generations, the best individual of the fitness based sorted generation will be selected as the initial configuration for further application. To speed up the search process, we also realized a method to use the first valid individual. The first step to calculate an individual is the task-to-ECU assignment restricted by the model properties and communication dependencies – e. g. EDF scheduling implies the the maximum n utilization (U = i=1 Ci /Ti ≤ 1) as basic restriction for n tasks on each ECU. Hence, our approach supports arbitrary randomized or optimized strategies to determine the task-toECU assignment, we propose an optimized assignment based on the task dependencies. This strategy uses information about task dependencies to maximize the local IPC and minimizes the number of messages over the FlexRay bus to reduce the number of assigned slots which increases flexibility for the reconfigurations. Figure 8(a) shows an example dependency graph of tasks to assign. First of all, our approach checks Γin = {τ1 , τ2 , τ3 }. Is any of the tasks in this set not assigned to an available ECU the assignment is performed randomly based on the utilization of each ECU. By means of this initial assignment the algorithm uses depth first search to assign succeeding tasks to the same ECU. If this is not possible because of the utilization the algorithm chooses another ECU randomly. Figure 8(b) shows the resulting assignment for 2 ECUs. The tasks in Γin are randomly assigned – marked lighter and darker blue. Along
The mentioned communication dependencies between tasks can be represented by means of a DAG as shown in Figure 7. Such a graph G = (Γ, M ) is composed of vertices (task set Γ) and edges (set of messages M = {m1 , . . . , mm }). The dependencies in the example of Figure 7 imply that the execution of τ1 has to be finished before τ3 gets executed – i. e. τ3 needs input data from τ1 to execute. The tasks and the whole DAG have a period of 1000μs which means that the execution of the whole DAG must be finished in this time. Our model considers different messages sent from one task [11] – τ1 sending m1 and m2 –, which have to be assigned to the frames/slots separately and the concept of each task sending only one message for all receivers (m1 = m2 ) [16]. For complex systems, a DAG can be composed of several τ1 150μs m1 τ3 300μs
Fig. 7.
τ2 175μs m2
m3
1000μs
τ4 250μs
Example of a DAG.
subgraphs with different periods. Like in [11] we presume for all periods Ti and the maximum period Tmax = max(Ti ): Ti · 2xi = Tmax , xi ∈ N0 , i = 1, . . . , n. The resulting hyperperiod Tmax avoids scheduling over multiple communication cycles. For example the periods of all subgraphs are 4ms, 2ms, 1ms, . . . for Tmax = 4ms. Special messages are transitions between two subgraphs, which we label as split-messages. Our approach schedules these splitmessages as follows: If Tsend < Trecv , τsend writes messages with Trecv because τrecv reads with a lower frequency. If Tsend > Trecv , τsend sends with Tsend . In this case it is important that τsend sends its message early enough to minimize obsolete data. We define three different types of task sets in a dependency graph, as also proposed in [11]. Γin is the set of tasks, which
211
τ1
τ2 m1
m2
τ3 m3
τ4
τ1
m4
m1
τ5
m5
m2
m3
m6
τ6
(a) Tasks to assign.
m4
τ5
m5
τ7
for the sending task (start of assigned slot) is calculated as d∗ = (slot-ID − 1) · Tslot . The deadlines of the predecessors are calculated bottom up by means of the DAG. The updated release times of receiving tasks (end of assigned slot) are calculated as r∗ = slot-ID · Tslot . The release times of the successors are calculated top down. Based on these values rm and dm are also updated. After all assignments are determined, the individual is validated implicitly by schedulig each ECU. Therefore, every ECU is scheduled with EDF over the hyperperiod (communication cycle length). Hence, tasks with Ti < Tmax have to be instantiated and scheduled multiple times with corresponding release times and deadlines whose absolute values are calculated from relative values. The individual is valid, if all deadlines are met and the schedulability test is positive. Finally, the evaluation of the individuals in one generation is performed by means of a fitness-function based on several weighted parameters. These parameters are the number of empty slots, which can be increased by IPC or frame packing, the number of unused ECUs, and the last assigned slot. Additionally, several status information for the validity of the task-to-ECU and slot assignments and the final schedule are included.
τ3
τ4
m6
τ6
Fig. 8.
τ2
τ7
(b) Assignment to 2 ECUs.
Example dependecy graph for optimized task-to-ECU assignment.
the communication path the assignments are proceeded as described resulting in the assignment shown in Figure 8(b). Every message mi that is not handled via IPC has to be assigned to a static slot in the communication cycle, but not every message needs a dedicated slot, because multiple messages from one task/ECU can be packed in one frame (see Section III. This significantly increases the flexibility of the initial configuration and the reconfigurations. The determined release times and deadlines for the tasks must be aligned if they are communicating via FlexRay. Based on the initial values, they are recalculated to determine the first and last possible assignable slot for a transmission. By means of the equations for EDF with precedence constrains [15], we calculate the release time rm of a message as rm = max(rsend +Csend ) and the deadline as dm = min(drecv − Crecv ). After initializing rm and dm , we determine available slots for the message. Figure 9 illustrates the possible delay between the finishing of a task τ1 and the beginning of a transmission in a available slot. Starting the slot-IDs with 1, the first available slot with Dependencies
τ1
Schedule
m1
τ2 writing
Slot 1
Our approach uses the results and information of the initial configuration to calculate appropriate reconfigurations for the compensation of node failures. Hence, several strategies for the compensation of node failures are supported (see Figure 6) we focus here on the reallocation of the tasks on the remaining ECUs. Our approach determines valid reconfigurations and evaluates their resulting effort and overhead – i. e. how many additional redundant slots have to be included in the communication matrix and how many redundant task instances have to be stored on each ECU. Figure 10 illustrates an example
τ2
τ1
TDMA-cycle
D. Reconfiguration in Case of a Node Failure
Slot 2
Slot 3 [m1]
reading Slot n
Fig. 9.
m2 m3
m1
m2 m3 m4
Reconfiguration
m1
0
t
Example for a valid slot assignment.
m1
length Tslot is calculated as slotf irst = rm /Tslot + 1, and the last usable slot is calculated as slotlast = dm /Tslot . The difference between slotlast and slotf irst is the number of possible slots. Our approach chooses a slot randomly, which is free or already transmitting a frame from the same ECU with free capacities. If the period of a sending task is shorter than the communication cycle, multiple slots have to be assigned. To avoid multiple tests per cycle, the tasks are sorted ascending by period length. Thus, only available slots for the first message instance have to be analyzed. As shown in Figure 9, the assignment of a slot changes the release time of the reading task (τ2 ). To keep the available slots valid, the values for the release times and deadlines must be updated after every assignment. The updated deadline
Fig. 10.
m2
-
m1
m2
-
ECU 1
Γ1
ECU 2
Γ2
-
Reconfiguration for reallocated tasks (Γ2 on ECU1).
for two ECUs with their assigned task sets Γ1 and Γ2 . After a failure of ECU2 several sub-steps have to be performed to get an adequate reconfiguration. The effort and overhead of a reconfiguration by means of additional slots should be minimized. Therefore, it is desirable that, after a reallocation of Γ2 on ECU1, the messages m3 and m4 are transmitted on already used slots. The general sub-steps for a reconfiguration are: 1) Reallocation of the task set to the remaining ECUs considering their particular utilization.
212
next iterations. Therefore, we determine the hosting ECU for every predecessor and successor. The ECU executing most of these tasks is analyzed for an assignment. In Figure 12 τx
2) Reconfiguration based on the initial configuration. Here, the effort and overhead should be minimized. Instead of using the presented pseudo-genetic algorithm for the calculation we propose an optimized analytical methodology for the reallocation of the task set. Primarily, we have to assure that the n remaining ECUs (ECUiok ) provide sufficient resources to execute the reallocated task set of the failing ECU (ECUf ail ) – i. e. we can store the redundant task instance and activate it in case of a node failure. The available resources must be sufficient to handle the utilization U of the faulty ECU: n 1 − U (ECUiok ). U (ECUf ail ) ≤
ECU 1
Fig. 12.
m1
ECU 2
τ2
τ1
m2
τ3
m1
ECU 3
(a) Failure of ECU1. Fig. 11.
ECU 2
τ2
m4
τ4
τ5
ECU x
Example for the determination of an adequate ECU.
is connected to tasks executed on ECU1 and ECU2. Because ECU1 hosts two of these tasks it is analyzed for a reallocation of τx . If there is no adequate candidate or the utilization of a determined ECU is to high, the task is assigned to an ECU, which offers sufficient capacities and hosts at least one task τ ∗ with the period of T (τ ∗ ) ≤ T (τx ). Hereby, it is guaranteed for a bus communication that there could be at least one more task, whose slot is possibly usable by τx to send its message via frame packing. The determination of valid slots is nearly identical to the initial configuration, including the updating of release times and deadlines. But here we distinguish between optimal assignments, where existing frames and slots are occupied (frame packing), and valid solutions with as less as possible additional frames and slots. Figure 13 illustrates this distinction. If m4 with T (m4 ) = Tcycle /2 is transmitted by ECU1 for an optimal assignment, only slots with ID = 2 + i · #slots/2 for i ∈ 0, 1 are suitable, in this case slot 2 and 7 (second instance of m4 ). If m4 is sent by ECU2 and r(m4 ) implies slot 4 as earliest transmission possibility. In this case slot 3 and 8 are omitted due to release time constraints and only slot 4 and 9 are applicable. Hence one additional slot (9) transmitting m4 is necessary. For the reconfiguration the messages are processed
ECU 1 m2
τ3
ECU x
m3 ECU 1
ECU 2
m2 τx
If a reallocation is possible, our approach initializes the assignment of the remaining ECUs and slots. Here, we have to consider that slots, assigned to the faulty ECU, are not available anymore. Figure 11(a) shows a failure of ECU1 executing τ1 . The assigned slots for ECU1 are lost and m1 and m2 need a reassignment. Furthermore, we have to check the received messages of the reallocated tasks. If – as shown in Figure 11(b) – a failure of ECU3 occurs, it depends on the reallocation of τ3 if the reconfiguration must assign a slot to m2 or IPC can be used. The assignments of all tasks and mes-
ECU 1
τ2
m1
i=1
τ1
τ1
ECU 3
(b) Failure of ECU3. Examples for ECUs failures.
sages not affected by the node failure are retained unchanged. This results in an initial ”empty” reconfiguration reduced by the faulty ECU with consistent assignments, release times and deadlines. Based hereon, our approach calculates an adequate reconfiguration with appropriate ECU and slot assignments. The assignment of the reassigned task set Γf ail is performed by means of the DAG along the communication paths analog to the initial configuration. Our approach starts with an iteration of Γcurrent = Γin . If τcurrent ∈ Γf ail , this task has to be reallocated. The algorithm proceeds with the updated task set Γ∗current composed of the successors of the tasks in Γcurrent until every task in Γf ail is processed. We optimize the reconfiguration by means of determining an adequate ”best possible” ECU assignment for every task in Γf ail . Figure 12 shows an example DAG for an failure of ECU x and the resulting reassignment of τx . Because the assignment is performed top down, it is guaranteed that the predecessors are already processed. The algorithm maximizes the IPC to all connected tasks in the DAG, assuming that there probably will not be determined a proper combination of period, ECU and slot assignment for the messages m3 and m4 in the
T( cycle ) 1
2
3
4
m1 m2 m3
m4
Fig. 13.
5
6
7
8
m1 m2
9
10
ECU 1 ECU 2
?
Example for slot assignment: T (m4 ) = Tcycle /2.
sorted by period lengths and release times. The valid slots, determined by means of release time and deadline of mi , are analyzed sequentially regarding the following properties: Are the current and possible subsequent slots for multiple instances of mi assigned to the same ECU? Do all considered frames offer enough capacities for the message instances? Candidates are sorted and finally assigned based on the collective number
213
by the utilization of inter-process communication and frame packing to reduce the resulting communication overhead and increase the flexibility. Here we consider a maximum number of #redundantSlots = 2 per ECU (ref. Section V-B). Figure 14 illustrates the determination of a communication matrix by means of an initial configuration and appropriate reconfigurations calculated with our reconfiguration approach. It shows an example for a communication matrix resultig from a initial configuration and one reconfiguration to compensate for the failure of ECU2. In the initial configuration one task is assigned to each ECU. Thus, one transmission slot is is assigned to each ECU, too. To compensate the failure of ECU2 the task τ2 is reassigned to ECU1. This implies the reservation of an additional transmission slot for ECU1 in the communication matrix. Finally, the complete communcation matrix is determined considering all necessary redundant transmission slots for every reconfiguration.
of slots needed for all transmitted instances of mi in the period of one cycle. The validation of the reconfiguration is also performed by a conclusive EDF scheduling of every ECU. If the schedulability test is positive, our approach determines a valid reconfiguration with no or a minimal number of additional slots as intended. E. Determination of Task Lists and Communication Matrix As mentioned in section IV-B the distributed coordinator concept needs information about the resulting changes in the task-to-ECU assignment and the communication in case of a node-failure to perform the self-configuration of the FlexRay network. The information for changes in the task-to-ECU assignments for all possible ECU failures are contained in individual task lists for each resulting reconfiguration process. Listing 1 shows part of an example task list which is represented in a XML scheme. These task lists are generated by means of the results from the reconfiguration approach described in section IV-C, which determines the corresponding task-to-ECU assignments for all valid reconfigurations based on the initial configuration and assignments. In case of a selfreconfiguration every distributed coordinator instance performs the re-assignment and activation of the corresponding tasks for its hosting node by means of the appropriate task list. Furthermore, the analysis of the determined task lists allow to specify the redundant task instances which have to be stored at every node to be available for activation at the self-reconfiguring processes. Storing only these necessary task subsets on each ECU significantly reduces the required hardware overhead for the approach.
Initial Configuration (Task-to-ECU assignment) ECU 2
ECU 3
ECU 4
W1
W2
W3
W4
… IInitial iti l Configuration C fi ti (Slot-to-ECU (Sl t t ECU assignment) i t)
Reconfiguration
COM Matrix
Reconfiguration
...
Task1 , Task2 Task3 , Task4 ... Listing 1.
ECU 1
Reconfiguration
Reconfiguration (Slot-to-ECU assignment)
… ECU U1
W2
W1
Part of an examplary task list (XML scheme).
ECU 2
E ECU 3
ECU CU 4 E
W2
W3
W4
Reconfiguration (Task-to-ECU assignment)
Because the FlexRay specification does not support reassignments of slots to senders during runtime (ref. Section III) our approach has to consider possible communication reconfigurations to compensate node failures. Therefore, the communication matrix contains additional redundant transmission slots assigned to the ECUs, which can be used by these after a self-reconfiguration of the FlexRay network. The number of redundant transmission slots #redundantSlots assigned to each node n is the maximum number of additional slots #additionalSlotsi needed for a possible reconfiguration reconfi of n:
Fig. 14.
Example for communication matrix determination.
V. E VALUATION The presented approach was evaluated by some testscenarios, which are typical for self-organizing systems, e.g. a node failure. The following detailed evaluation demonstrates the effectiveness of our approach and furthermore shows how our FlexRay network extension supports self-reconfiguration. A. Properties of the Distributed Coordinator Concept Section IV-B described the realization of selfreconfiguration for fault-tolerant FlexRay networks by means of our distributed coordinator concept. Table II gives an overview about the properties of our approach as
#redundantSlots = max(#additionalSlotsi ) for i ∈ reconfi . As described in Section IV-C our approach minimizes the number of redundant transmission slots for each node
214
realistic safety-critical automotive real-time systems [16]. Figure 15 shows the DAGs for an EPS (Electric Power Steering), ACC (Adaptive Cruise Control), and TC (Traction Control) system. For a more detailed description of theses systems the reader is referred to [16]. Table III contains the corresponding
shown in [14]. Furthermore it shows the comparison of the self-organized approach to a former proposed centrally organzied solution with a dedicated coordinator node [17]. Our self-organized non-central approach avoids a single point of failure, because a defect of the dedicated coordinator node could extinguish the fault tolerance of the whole system. Therefore, we distribute the coordination to realize a decentralized self-reconfiguration. Another important property is the negligable resulting hardware overhead. The distributed coordinator concept is implemented with lightweight tasks and small data structures (task lists & COM Matrix) which only require insignificant additional runtime and memory resources. Furthermore, the communication overhead is minimized by means of the combination of messages for coordinator task communication with existing application frames to avoid additional slot assignments and frames. The time needed for a self-reconfiguration of the FlexRay network is also minimized by means of the frame packing used for the communication between the distributed coordinator tasks. Node failures are detected instantly and resulting communication changes are propagated directly by the corresponding coordinator tasks. Therefore, the system can be reconfigured in the same (in cycle) or latest in the next cycle.
Single point of failure Hardware Overhead
Communication Overhead
Reconfiguration Time
Dedicated Node YES: Defect of coordinator could extinguishe the fault tolerance of the system. HIGH: Additional hardware component (ECU) needed for dedicated execution of coordinator task. FAIR: Beside the necessary redundant slots (incl. backup node) one static slot has to be reserved for the coordinator node. LOW: The coordinator node has to inform all involved nodes about changes for the reconfigurations. Hence it takes up to two cycles for a reaction.
τ7 300μs
τ1 150μs
τ2 175μs
τ9 175μs
τ8 150μs
τ15 200μs
τ10 300μs
τ16 200μs
τ17 200μs
τ20 300μs
τ19 150μs
τ18 200μs
τ21 175μs
3000μs τ3 300μs
τ4 250μs
τ5 150μs
τ6 100μs
(a) EPS Fig. 15.
1500μs
τ11 250μs
τ12 200μs
τ13 150μs
τ14 200μs
3000μs τ22 400μs
τ23 150μs
(b) ACC
τ24 200μs
(c) TC
Examples for safety-critical automotive real-time systems [16].
messages between the given application tasks of the example. The evaluation is performed based on the model properties described in IV-C and every task sends one message to one or multiple receivers. Based on this input data our reconfiguration TABLE III M ESSAGE INFORMATION FOR THE DAG S IN F IGURE 15.
Distributed Coordinator NO: Distribution of coordinator task to existing nodes avoids the single point of failure. VERY LOW: Lightweight tasks consume insignificant resources and runtime. Memory demand of task lists and communication matrix is negligible. LOW: Combination of messages for coordinator task communication avoids additional slot assignments and frames.
(Send, Recv) Msg m1 (τ1 , τ3 ),(τ1 , τ4 ) (τ2 , τ4 ) m2 (τ3 , τ5 ) m3 (τ4 , τ6 ) m4 (τ7 , τ10 ) m5 m6 (τ8 , τ10 ) (τ9 , τ11 ) m7 m8 (τ10 , τ11 ),(τ10 , τ12 ) (τ11 , τ13 ) m9
Byte 12 12 20 12 12 12 10 12 10
Msg m10 m11 m12 m13 m14 m15 m16 m17 m18
(Send, Recv) (τ12 , τ14 ) (τ15 , τ20 ) (τ16 , τ20 ) (τ17 , τ20 ) (τ18 , τ20 ) (τ19 , τ22 ) (τ20 , τ22 ) (τ21 , τ22 ) (τ22 , τ23 ),(τ22 , τ24 )
Byte 10 12 12 12 12 10 22 20 6
approach determines an initial configuration as described in Section IV-C. Because here we focus on the reconfiguration results, we refer the reader to [12] for a detailed evaluation of the initial task-to-ECU and slot assignments by means of several parameters such as, the number of empty slots, which can be increased by IPC or frame packing (ref. IV-C). Summarized, this evaluation shows that the proposed optimized inital configuration methodology distinctly increases the flexibility for the reconfigurations to be determined by our approach. We simulated all possible ECU failures using the above described use-case from [16] and calculated reconfigurations based on arbitrary initial configurations to evaluate our reconfiguration approach. Multiple passes of these simulations were executed for changing network sizes with different numbers of ECUs. Figure 16 illustrates the rate of determined optimal and valid reconfigurations (ref. Section IV-C) for different numbers of ECUs. The rate is increasing with a growing network size as expected, because of the growing flexibility and capacities for the task-to-ECU assignments. Rates from 40% to >60% of individual ECU failures can be reconfigured optimal without any additional slots depending on the number of ECUs. The rate of valid reconfigurations with a minimal addition of 2 slots is up to nearly 80%.
VERY LOW: Failures and changes are detected and communicated directly by the distributed coordinator tasks. This allows reactions in the same (in cycle) or at latest after one cycle.
TABLE II P OPERTIES OF DISTRIBUTED COORDINATOR CONCEPT COMPARED TO DEDICATD NODE .
B. Evaluation of our Reconfiguration Approach As described in Section IV-E the necessary information for a self-reconfiguration are provided by task lists and a communication matrix. These data structures are generated by means of the results from the reconfiguration approach proposed in [12] and described in Section IV-C. To get meaningful results we evaluated our approach by means of
215
68% 55% 43% 30% 4 ECUs
Fig. 16.
[11] S. Ding, H. Tomiyama, and H. Takada, “An effective ga-based scheduling algorithm for flexray systems,” IEICE - Trans. Inf. Syst., 2008. [12] K. Klobedanz, A. Koenig, and W. Mueller, “A reconfiguration approach for fault-tolerant flexray networks,” in Design, Automation and Test in Europe (DATE) 2011, 2011. [13] FlexRayConsortium, “Flexray communications system protocol specification version 2.1 rev. A,” Dec 2005, www.flexray.com. [14] K. Klobedanz, G. Defo, W. Mueller, and T. Kerstan, “Distributed coordination of task migration for fault-tolerant flexray networks,” in Symposium on Industrial Embedded Systems (SIES) 2010, 2010. [15] G. C. Buttazzo and G. Buttanzo, Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications, 1997. [16] N. Kandasamy, J. P. Hayes, and B. T. Murray, “Dependable communication synthesis for distributed embedded systems,” in Computer Safety, Reliability and Security Conf., 2003. [17] K. Klobedanz, G. B. Defo, H. Zabel, W. Mueller, and Y. Zhi, “Task migration for fault-tolerant flexray networks,” in 7th IFIP Conference on Distributed and Parallel Embedded Systems (DIPES) 2010, 2010.
Optimal reconf. Valid reconf.
Rate of reconfigurations
80%
5 ECUs
6 ECUs
7 ECUs
Rate of reconfigurations depending on number of ECUs.
VI. C ONCLUSION We presented an approach for the self-reconfiguration of FlexRay networks. Our aim to increase the fault tolerance of such networks is fully achieved. Based on our distributed coordinator concept, we demonstrated an example with a node failure and showed that a bus restart is avoidable. We achieved this by using redundant slots in the FlexRay schedule and by combination of messages. From the initial configuration in the network our proposed heuristic is able to calculate a new valid reconfiguration of the network without the failed node. Additionally, the overhead of our approach is low. With our approach we can determine for up to 80% of possible individual node failures. Future work will concentrate on the extension of our approach to cover more failures and additionally in the analysis of the network. ACKNOWLEDGEMENTS This work was partly funded by the DFG Collaborative Research Centre 614 and by the German Ministry of Education and Research (BMBF) through the ITEA2 project VERDE (01S09012H). R EFERENCES [1] www.autosar.org. [2] I. Jahnich, I. Podolski, and A. Rettberg, “Towards a middleware approach for a self-configurable automotive embedded system,” in In Proceeding SEUS ’08 Proceedings of the 6th IFIP WG 10.2 international workshop on Software Technologies for Embedded and Ubiquitous Systems, Oct. 2008. [3] B. Ravindran, L. R. Welch, and C. Kelling, “Building distributed scalable dependable real-time systems,” in In Proceedings of the IEEE Conference on Engineering of Computer-Based Systems, Mar. 1997. [4] K. Chaaban, M. Shawky, and P. Crubille, “A distributed framework for real-time in-vehicle applications,” in IIn Proceedings of the 8th International IEEE Conference on Intelligent Transportation Systems, Vienna, Sep. 2005. [5] M.Kim, Y. Choi, Y. Moon, S. Kim, and O. Kwon, “Design and implementation of status based application manager for telematics,” in The 8th International Conference on Advanced Communication Technology (CACT), Feb. 2006. [6] N.Navet, Y. Song, F. Simonot-Lion, and C. Wilwert, “Trends in automotive communication systems,” in In Proceedings of the IEEE, 2005. [7] T. Pop, P. Pop, P. Eles, Z. Peng, and A. Andrei, “Timing analysis of the flexray communication protocol,” Real-Time Syst., vol. 39, no. 1-3, pp. 205–235, 2008. [8] L. H. M. Grenier and N. Navet, “Configuring the communication on flexray: the case of the static segment,” in ERTS’08, 2008. [9] M. Lukasiewycz, M. Glaß, J. Teich, and P. Milbredt, “Flexray schedule optimization of the static segment,” in CODES+ISSS ’09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, 2009. [10] R. Brendle, T. Streichert, D. Koch, C., and J. Teich, “Dynamic reconfiguration of flexray schedules for response time reduction in asynchronous fault-tolerant networks,” in ARCS, 2008.
216