a global clock which, in turn, supports a Time Division Multiple Access (TDMA) message ...... node, which sends periodic Tick messages across the CAN bus.
Preprint of a Paper to be published in: IEEE Transactions on Industrial Informatics, 2007
Fault-tolerant, time-triggered communication using CAN Michael Short and Michael J. Pont Embedded Systems Laboratory, University of Leicester, University Road, Leicester LE1 7RH, UK
Abstract The Controller Area Network (CAN) protocol was originally introduced for automotive applications but is now also widely used in process control and many other industrial areas. In this paper we present a low-cost redundancy-management scheme for replicated CAN channels which helps to ensure that clocks (and, hence, tasks) on the distributed nodes remain synchronized in the event of failures in the underlying communication channels, without the need for expensive or proprietary interface electronics. We argue that, when using this framework with duplicated channels, the probability of inconsistent message delivery drops to acceptable levels for a wide range of systems. Through an analysis of the protocol and a case study, we conclude that the creation of reliable, low-cost, distributed embedded systems using CAN is a practical possibility.
Index Terms - Distributed computing, network reliability, embedded control.
1. Introduction Although the CAN protocol was originally introduced for automotive applications [1], it is now widely used in process control and many other industrial areas [2][3][4][5].
In
comparison with earlier protocols (and standards such as “RS-485”), CAN is easy to use and provides more hardware support for error detection / recovery. As a consequence of its popularity and widespread use, most modern microcontroller families now have one or more members with on-chip hardware support for this protocol (e.g. [6][7][8][9]): this means, in turn, that CAN networks can now be implemented at very low cost.
Page 1
No protocol is perfect, and – from the perspective of the developer of low-cost, highreliability systems - it may be argued that CAN has five main limitations: [i] Lack of support for
time-triggered
communications;
[ii]
Incomplete
support
for
reliable
group
communications; [iii] Lack of support for redundant bus arrangements; [iv] Lack of mechanisms to handle “babbling idiot” errors; [v] Limited bandwidth.
Because of such limitations, it is tempting to discard CAN and focus on more recent alternatives, such as FlexRay [10] or TTP/C [11]. Indeed, where costs are not an issue, or the bandwidth of a CAN network is insufficient, this may be an appropriate solution. However, as we will seek to demonstrate in this paper, experience gained with CAN over recent years means that it is now possible to create extremely reliable networks using this protocol, if care is taken at the design and implementation stages.
More specifically, we will describe the design and implementation of a redundancymanagement scheme for time-triggered, replicated communication channels. This scheme helps to ensure that clock synchronization and synchronized task execution is robust to failures in the underlying communication channels. We will argue that, when using this framework with duplicated channels, the probability of an inconsistent delivery drops to acceptable levels for a wide range of systems. Overall, we will argue that CAN remains an attractive option for low cost (and resource constrained) embedded systems.
This paper is organised as follows: in Section 2, we describe previous work in this area and state the motivation. Section 3 presents a fault hypothesis, and introduces the proposed techniques for redundant channel management. In Section 4 we investigate the suitability of the proposed techniques and comment on transient error containment. In Section 5 we present a simple case study. Finally, in Section 6, conclusions and areas of further work are presented.
2. Previous work in this area In the Introduction, we highlighted five limitations of the CAN protocol. A range of previous studies have sought to understand (and ameleriorate the impact of) such limitations in CAN and related protocols. We review previous work in this area in this section.
Page 2
2.1 Time-triggered communications Although CAN was primarily intended to support event-triggered communications between unsynchronized nodes, time-triggered communication - which has a number of benefits (see for example [12]) – may be enforced, if due care is taken at the system design stage.
A number of hardware and software-based protocol extensions and modifications have been proposed to enable time-triggered communications on CAN. These tend to rely on the use of a global clock which, in turn, supports a Time Division Multiple Access (TDMA) message schedule. For example, Turski describes a distributed clock synchronization methodology with a potential resolution of +/-1 bit time1, using a combination of hardware and software [13]. A CAN controller with support for detecting the Start Of Frame (SOF) sequence is required to accurately time-stamp reference messages from a master timer source. These reference messages are then used with a software-based distributed clock synchronization algorithm.
Several similar software-based synchronization algorithms have also been described. When using such techniques, clock synchronization at a level of 100 s [14] is typical; however Donnelly & Cosgrove describe a distributed clock with an accuracy of 8s [15]. Several time-triggered protocols of varying complexity have been developed for such synchronization frameworks, from simple static bus access schedules [15], to extensions of the normal arbitration mechanism to ensure graceful degradation of service under transiently-faulted network conditions [16].
The Time-Triggered CAN (TT-CAN) protocol uses a synchronization method similar to that proposed by Turski to provide time-triggered operation of CAN at the hardware level [17]. The protocol provides a maximum accuracy of +/- 1 bit-time [14], and supports a static TDMA schedule, which can provide ‘empty’ slots that allow normal message arbitration for dynamic messages.
A full implementation of TT-CAN requires dedicated hardware and, at the present time, such hardware has not been widely adopted. By contrast the family of ‘shared-clock’ algorithms
1
1 µs at the maximum CAN bit rate.
Page 3
described by Pont [18] and Ayavoo et al. [19] provide time-triggered communications on CAN without the need for additional hardware, or software clock synchronization algorithms. These systems provide tightly synchronized distributed clocks with jitter levels as low as +/1 bit time, allowing static TDMA schedules to be implemented for bus utilisation levels in excess of 90%.
Pimentel & Fonseca describe a time-triggered system which, although it does not utilize a global clock, controls a cycle of communication via a synchronization message sent by a primary message producer with an accurate clock [20]. The remaining nodes, upon receipt of this message, start local timers (each with different values), which upon expiry allow local tasks to be executed and messages to be transmitted in different time-slots on the network. However, this type of ‘domino’ architecture lacks scalability, as the authors note that “a FlexCAN network for a safety-critical system always has to be characterized by a small number of nodes”. This architecture must also operate in a single cycle with a single control period [21], and it suffers from potentially large jitter levels (2.5 ms at a basic cycle of 10ms [22]).
Since in some applications jitter levels greater than 10% render subsequent
interpretation of the sampled data useless [44], the 25% jitter levels in FlexCAN render it unacceptable for many systems.
Almeida et al. describe, in detail, the FTT-CAN protocol [47]. This protocol attempts to combine the desired properties of both event- and time-triggered systems into a single, repeated ‘elementary cycle’. Each elementary cycle is triggered by the transmission of a specific message from a master node. In response to this message, nodes required to transmit in the synchronous window send their traffic; message collisions are handled by the built-in arbitration mechanism of CAN, in a similar manner to a variant of the shared clock protocol described by Ayavoo et al. [19]. In the synchronous window, event-triggered messages can be sent as required. This protocol also allows on-line modification of the time-triggered schedule to meet varying traffic conditions in the network.
Although the protocol
implementation may be accelerated by using a dedicated co-processor, a software version has been implemented [47]. As with many such methods, a software implementation leads to increased processing overheads required on the host node; the overheads may be reduced at the expense of responsiveness to changes in the communication requirements.
Page 4
Rufino proposes an enhanced software layer for CAN (CANELY) to provide fault-tolerant communications [46]. Services provided by the layer include group communications and node failure detection.
Although clock synchronization (with precision at the tens of
microsecond level) is employed for these services, the system does not directly support timetriggered communication. However it has been proposed that this feature may be built on top of the enhanced layer as a system-level service [52]. 2.2 Reliable group communications If we are to develop reliable embedded systems using CAN, then we need to ensure that we can achieve reliable group communications. This means, for example, that when one node transmits a message, all nodes must receive the same message. One deficiency with CAN is that this condition may not always be satisfied, most notably during the detection of End Of Frame (EOF) sequences.
This problem can arise as follows. CAN receivers achieve consensus that the accepted message is valid by processing an error-free sequence of bits up to the 6th bit of the EOF sequence. At this point, the receiving CAN controllers accept the message. The sender, however, validates the transmission at the very last bit of the EOF. This presents a potential problem. If a subset of receivers detects an error in the 6th bit of the EOF sequence, they will subsequently reject the message and begin transmission of an error flag in the 7th bit of the EOF.
The remaining receiver nodes will already have accepted the message; thus an
inconsistent delivery has arisen. Under normal circumstances, the sender will queue the message for re-transmission; therefore the possibility of inconsistent message duplicates (IMDs) or inconsistent message omissions (IMOs) arises. Previous studies have shown that the probability of this situation occurring in normal CAN is highly dependant on the bit rate, the nature of the bus traffic and the number of nodes connected to the bus [23].
To address the problem of inconsistent deliveries in CAN, several methods in both hardware ([24][25]) and software ([23][26][27]) have been proposed. The proposed software solutions have been adopted, in some cases, by the protocols described in the previous section; however it should be noted that software solutions generally have bandwidth and processing overheads involved. Interestingly, some modern CAN controllers now feature diagnostic facilities that will report error codes indicating possible inconsistency detection such as “Other error detected upon message reception in the EOF sequence” [28]. Page 5
2.3 Redundant bus arrangements CAN was introduced as a “single bus” protocol. Any distributed system based on a single bus is vulnerable to a range of failures which may result from cable damage, connector damage or electrical interference. Many current microcontroller families provide dual onchip CAN controllers to support more than one communication channel (e.g. [6][7][8][9]); however most higher level protocols built on CAN do not directly support these additional channels. One reason for this is because, as noted by Kopetz, in networks with a degree of nondeterministic traffic, the system cannot be said to be replica-determinate because “it is impossible to guarantee that the message order on two replicated channels will always be the same” [12].
As such, only protocols enforcing highly deterministic time-triggered
communications on CAN are suitable for such systems, as CAN in its native format is not replica determinate no efficient solution for redundancy management can be found.
Several studies have recognized this and suggested techniques for multi-channel operations in time-triggered CAN networks.
For example, Ryan et al. propose a software-based
synchronization layer to address the channel redundancy problem in TTCAN networks [29]. They report a synchronization accuracy of +/-24 s between channels in the prototype system running at 125 kbit/sec, which they note could be improved by using a faster baud rate or a lower network time unit. Extrapolating to the maximum bit rate, this gives an accuracy of +/3s.
Pimentel & Fonseca also describe a methodology for the integration of redundant CAN channels in the FlexCAN architecture [20]. In this system, messages in each of the controlcycle timing slots are simply duplicated on one or more redundant busses; subsequently, when the remaining nodes’ timers expire, they check for the presence of these messages.
By contrast, redundancy in the physical media of a single channel is still possible. Rufino et al. describe such a solution, specifically, they propose a ‘media selection module’ to interface between a standard CAN controller: this provides a form of bit-wise voting on several redundant media interfaces [48]. The full solution requires dedicated interface electronics, implemented in a programmable logic device. In addition, such redundancy only provides replication in the cabling media and bus transceivers, with the consequence that errors caused by a faulty CAN controller will be replicated on all available media. As such this approach
Page 6
provides no additional robustness to IMOs. This approach to media replication has been adopted in the FTT-CAN protocol [45], and its use has also been proposed for the CANELY protocol [52]. 2.4 “Babbling idiot” errors Babbling idiot failures, considered to be the most critical failures in a shared-channel system [12], can be broadly classified into two different sub-types; hardware and software [21]. Several papers have proposed bus guardians of varying complexity for CAN when it is used in its native format (e.g. [30][31]). The former approach has been suggested for use with the CANELY protocol [52]. Tindell & Hansson suggested altering the CAN controller at the silicon level to provide suitable protection [32]. These methods all inhibit the transmission of successive frames during some inter-frame period, thus guarding against a single node monopolizing the bus.
Although no specific bus guardian design is defined in the TT-CAN protocol, a generic design for time-triggered systems [43] was suggested for this purpose by Ryan et al [29]. Buja et al. present a very similar guardian for use in the FlexCAN system [21]. Such systems, however, cannot ensure total fail-silence, as a node may still transmit (in error) during its ‘enabled’ window. In addition, the guardians proposed in both [21] and [43] violate a basic principle of the CAN protocol because - when the bus drivers are disabled (i.e. when the node is not in its allowed transmit slot) - the CAN controller is prevented from asserting valid ACK bits or transmitting error frames. Tyagi [33] describes a guardian for use with the shared-clock algorithms which ensures fail-silence; however this design reduces the available system bandwidth by 50%, which is clearly unacceptable in most systems.
Ferreira et al. propose two approaches to enforcing fail-silence for use with the FTT-CAN protocol [45]. The first approach is primarily for use with the master node, which is a critical point of failure in the FTT-CAN system. It replies on internal node replication with a temporized voting mechanism, connected to a single CAN controller. The second approach is intended for use in the slave nodes, and provides an interface between the guarded node and the bus transceiver. The guardian simultaneously monitors the node transmissions and bus traffic on a bit-by-bit basis. Any detected protocol or bit stream errors are not transferred to the bus. The requirements for both guardian designs dictate that a programmable logic
Page 7
device, such as an FPGA, is best suited for their implementation. Such an implementation, although effective, may be costly. 2.5 Limited network bandwidth The limitation of the maximum CAN bit rate to 1Mbits/s is often seen as a major concern ([12][34][35]).
Although protocol modifications to overcome this limitation have been
proposed (e.g. FastCAN [34]), no commercial products are in widespread use. 2.6 Other protocols We note that two protocols - FlexRay and TTP – have received much attention recently: these protocols have been designed such that all the deficiencies outlined in this section are ameliorated [35]. These protocols both feature (i) the ability to communicate over multiple redundant communication channels; (ii) integrated bus guardians; and (iii) support for hard time-triggered communications. Since no protocol can provide a theoretically ideal atomic broadcast, the probability of occurrence is designed to be low enough to meet stringent safety regulations in both these technologies.
Due to a lack of user experience with these protocols, and their comparatively high cost, it may be desirable for system developers to continue to use CAN where this is practical. Indeed, in the automotive sector (for example), it is envisioned that - although FlexRay or TTP may become the basis for a “communications backbone” in future vehicles - CAN will remain in use for many years to come [36]. 2.7 Current motivation As the review presented in previous sub-sections makes clear, many of the existing CANbased protocols rely on media redundancy, as opposed to full channel redundancy. This requires the use of (potentially costly) dedicated interface electronics. The problem of using full channel redundancy with native CAN was highlighted in Section 2.3. In existing systems where full channel redundancy has been directly employed, such as FlexCAN [20] and TTCAN [17], this again either requires the use of dedicated hardware or results in a limited design scope in the resulting system architecture, with significant levels of clock jitter.
As we will demonstrate in this paper, scalable, low-jitter systems with full channel redundancy can be implemented using standard CAN hardware. The techniques we will
Page 8
propose are intended for use in resource-constrained, low-cost systems in which (i) low clock jitter and predictable behaviour are required; (ii) additional software and hardware must be kept to a minimum.
As we shall show in Section 4, the techniques we propose support high levels of network utilisation, allowing designers to get high levels of performance from the CAN protocol. This makes the protocol suitable for a wide range of applications.
3. Fault-tolerant time-triggered communication using CAN As we outlined in the introduction, our aim in this paper is to present techniques that will help to maximise the reliability of CAN-based embedded systems. Before giving a description of these techniques, we first present a basic fault hypothesis and state a number of assumptions that have been made for the type of systems under consideration. 3.1 System fault hypothesis Figure 1 shows the type of system under consideration (a broadcast bus). The failure rates of the physical components in such a distributed system are highly dependant on many factors, such as the application environment and cabling media. We will assume that the failure rates for the physical hardware, cabling media and the bit error rate (BER) of the communication system lie within the bounds of what have been called “benign” and “aggressive” environments [50][51][52]. Failure rates for individual controllers, transceivers and cabling may be calculated for specific cases using a resource such as that described previously in handbooks (e.g. [53]). Typical failure rates for the elements of interest in this study are shown in Table 1 [50][51][52].
Figure 1 – Broadcast-bus architecture under consideration
Page 9
Table 1: Typical component failure rates Element Link Bus Section BER CAN Transceiver CAN Controller
Failures / Hour Benign Aggressive 1.0 x 10-8 1.0 x 10-6 1.0 x 10-6 3.0 x 10-11 2.6 x 10-7 1.0 x 10-6 1.0 x 10-6
For the proposed protocol, we can state the following fault hypothesis:
The communication system must be able to tolerate a single physical fault in any one of its constituent components, i.e. a bus segment, bus link, transceiver or communication controller.
Tolerance of multiple faults is dependant on the application; a never-give-up strategy is employed.
A fault can occur at any time (exponential distribution) and may be transient or permanent in nature.
Transient faults (such as continuous blocks of electromagnetic interference) do not persist beyond the physical (temporal) controllability limit of the system.
3.2 System assumptions In this section we state some basic system assumptions before describing the basic operation of the channel management system.
The system design is based around the broadcast bus architecture, as shown in Figure 1.
Each node in the distributed system possesses a TDMA bus access schedule for the network, and each message is allocated a time-slot which the respective node can use for transmission. Such a scheme is shown in Figure 22.
Each node may transmit an arbitrary number of messages during the TDMA cycle. It is assumed that a methodology such as [49] has been used to create a suitable message schedule for a given system application.
2
Techniques for determining the minimum length of each slot Si will be discussed in Section 3.7.
Page 10
Each node’s clock is synchronized to a global time-base with some guaranteed minimum level of accuracy , such that no message collisions can take place on the bus.
For time-triggered systems such as those considered in this paper, a minimal system requirement is a periodic transmission of a time reference message over the network, from a (time master) node in possession of an accurate timer. This reference message, when received by the remaining (slave) nodes, invokes a high-priority interrupt which is used for time-synchronisation.
Each time-slot Si is large enough to allow the worst-case transmission time Mi of a message i, incorporating the accuracy of clock synchronization , plus an arbitrary inter-message idle period P.
Task executions on each distributed node are synchronized to this global time-base, and scheduled such that the message handling tasks cannot be blocked or interrupted (i.e. they have the highest priority, or co-operative/hybrid scheduling is employed).
Each node has access to a timer that is independent of the global time base, yet has the same accuracy.
The CAN nodes are prevented from entering the ‘error-passive’ state3, and employ the full CAN 2.0B protocol [1].
Re-transmission of CAN messages is disabled4.
In order to minimize costs, the use of simple interface electronics (based on nonproprietary solutions) is required.
The replicated communication channels can be electrically isolated from each other, up to the controller level, and the cabling media spatially routed via different physical paths.
3
The standard CAN controller issues a signal when a certain error count has been reached. This may be set to a level just before the node becomes ‘error passive’, and when issued, the controller is manually put into the ‘bus-off’ state by the application. Periodic attempts may then be performed to reset the controller and enter the ‘error-active’ state.
4
In this protocol, as with any time-triggered system, automatic retransmission of messages may cause further messages to miss their deadlines (in a domino-like effect). A “fail-silent” approach to message errors is therefore more appropriate. In addition, since many sampled-data designs are robust to the loss of a single sample [38], the single-shot transmission approach may be particularly appropriate in such systems.
Page 11
Strategies have been employed to provide the appropriate levels of node redundancy at the hardware level, such as those suggested by Iserman et al. [37] or Pont [19].5
Figure 2 – TDMA structure with inter-slot spacing
3.3 Message transmission procedure If the assumptions outlined in the previous section are to hold, the message broadcasts should be transparent to both producers and consumers, and the replicated channels should appear as a single entity. We describe how we seek to meet these goals in this section.
We will refer to each communication channel C (in a j-channel system) as follows: C1, C2, C3, … Cj. In order to manage each channel effectively when transmitting a particular message Mi, we will send an exact replica of it over each network channel, but each message will be delayed by a short time period D from the previous message.
When a transmitting node enters the (uninterruptible) message transmit function, the message objects in each channel are first loaded with the required information (data fields etc.). We then initiate transmission of the message on channel C1 by setting the channels Transmit Request (TXRQ) bit.
To strictly enforce fail silence and prevent undue jitter, we require single-shot transmission of each message in each channel6. The properties of a TDMA protocol can be exploited with a
5
For the remainder of this paper, we will assume that each system node employs fail-operational behaviour, and permanent node failures are not considered further.
Page 12
CAN controller to ensure that such single-shot messaging takes place. A CAN controller will automatically queue a message for re-transmission after an error (or loss of arbitration) only if the TXRQ bit of the corresponding CAN controller object remains set.
We also note that a standard CAN controller will reset the transmission object’s New Data flag (NEWDAT) only if it has detected an idle bus, and commenced the transmission procedure. This enables a simple mechanism for single-shot transmissions to take place in a TDMA scheme, as the bus should always be in the idle state when commencing a transmission. If, as the result of an error, the bus is not in the idle state then waiting for a NEWDAT reset may cause an unnecessary delay. To prevent this being a potential failure point, a short timeout T is introduced7.
Thus, as soon as we have initiated transmission, we start an on-chip timer, and monitor the status of NEWDAT. Should this bit be set before T has elapsed, the transmission has been initiated; otherwise it has failed, and an appropriate error flag can be set. In either case, we immediately reset TXRQ, to ensure that the message is not re-transmitted. We then wait until the time delay D has elapsed: D can be set to any value which satisfies the condition D > T.8 We then initiate transmission of the message on channel C2 by setting the TXRQ bit, and use a similar procedure to monitor the NEWDAT bit until a time period D + T has elapsed, and set the error flag to the appropriate status. This procedure is repeated until the message transmission has been attempted on all j channels. The procedure can then terminate. A flow chart for this procedure is shown in Figure 3.
With such an approach, the redundant channel(s) all carry identical traffic, shifted slightly in time. The replica-determinism of the channels holds, and all transient errors (except babbling
6
A number of modern standalone or integrated CAN controllers now support “single shot” transmission of messages at the hardware level; for example the Philips SJA1000 [28], Microchip MCP2515 [39] and the XC167 microcontroller on-chip CAN module [40].
However, many existing systems operate using
hardware without such support. To avoid restricting the application of the protocol discussed in this paper, we do not assume the presence of hardware support for single-shot transmission. 7
We have found that setting T to a value of 2 bit times to be sufficient.
8
We have found that a value of D equal to 5 bit times to be a sufficient level of delay.
Page 13
idiot errors) can be detected by checking for the absence of messages in each channel (by receiver nodes), or checking the transmit error status of all channels (for transmitters) after any given time-slot. The nodes can achieve consensus on the status of the last transmission within the accuracy of the clocks . Under normal, fault-free conditions, the receivers can also check the integrity of the received data by a majority vote or other suitable means.
Figure 3: Message transmission procedure 3.4 Message reception procedure On the slave nodes, each CAN controller is configured such that the arrival of the required message Mi on any of the available channels (C1, C2 … Cj) will invoke a high-priority interrupt. However, the interrupts are prioritised such that C1 > C2 > … > Cj. We assume that the Interrupt Service Routine (ISR) corresponding to the message arrival performs some action (such as scheduling a task for execution, clock synchronization etc), and that the worst-case execution time of the ISR W is known.
We then handle message reception as follows: upon activation of a message interrupt via the channel Ck, the receiver will first disable all interrupts, and timestamp the actuation of the Page 14
interrupt. We then start a local timer, and for all channels i = (k+1) to j, and at fixed intervals of time equal to (i*D), we manually ‘sample’ the interrupt request bit of CAN controller Ci to check for reception of a valid message. Upon receipt of a valid message on Ci, the resulting interrupt request bit for Ci is reset. Missing messages can be flagged with an appropriate error, and all channels 1 … k-1 can also thus be flagged in the event of errors (unless k = 1).
When this process is complete, we adjust the timestamp value by subtracting a value (k-1) * D. This ensures that, regardless of the channel Ck that actually invoked the interrupt, the timestamp is adjusted such that its value represents the value that the first channel C1 would have read. In this way we ensure that fault-tolerant time-stamping takes place.
The final processing that needs to take place as part of this redundancy management scheme is to ensure that the interrupt overheads terminate at the same point in time, regardless of the channel that invoked the interrupt. So, after the activation of an interrupt on channel Ck and the subsequent execution of ISR overheads (such as a synchronization algorithm), we wait for the timer to count to a value equal to W + (j-(k-1)) * D. This is a form of ‘sandwich delay’ [54] and ensures that control is passed back to the scheduler at the same instant in time, regardless of the invoking channel.
A summary of the message-reception procedure is shown in Figure 4.
Page 15
Figure 4: Message receive procedure 3.5 Clock synchronization The implementation of the message transmission and reception processes, outlined in the previous section, is suitable for use with many software-based clock synchronization mechanisms. However, as will be noted in Section 4, greater levels of clock synchronization will lead to significantly better performance, and better task synchronization in the distributed system.
Page 16
Several factors may affect the accuracy of CAN-based clock synchronization methods, not least the bit-stuffing mechanism employed in CAN. For example, previous analysis of the shared-clock protocol has revealed that the jitter, and hence clock accuracy , between the clocks in a standard shared-clock network is largely dependant on this mechanism [42]. This bit-stuffing induced variation in transmission times can also indirectly affect clock accuracy in other methodologies (for example when time-stamping reference messages).
A
methodology known as ‘Software Bit Stuffing’ [42] has been developed to significantly reduce these variations, and may help to increase clock accuracy in such situations.
In addition, during system power-up or after a block of continuous interference, there will be a time when individual nodes will not have synchronized clocks. The node should not transmit any messages (unless it is the time master) during this time. Since the choice of synchronization algorithm has an influence on the time taken to re-synchronize the clock, this should be made with care; the synchronization time should be several magnitudes smaller than the controllability time of the physical system. 3.6 Example Figure 5 shows the transmission and reception of a time-stamped message M, over a triple bus system (j=3), in three different fault scenarios. From this Figure, it can be seen that regardless of the fault status of the underlying channels, the time taken from the start of transmission to the time taken to end the receiver ISR varies very little, and that the timestamp is dynamically adjusted to read approximately the same value. The impact of the technique on the accuracy of this value, although small, is dependant on the implementation platform (some experimental results are provided in Section 5). Thus, any subsequent task release or synchronization associated with the arrival of this message is not subject to errors or significant jitter, and the triple-channel system appears as a single entity to both transmitters and receivers.
Page 17
Figure 5: Management technique for message handling 3.7 Determining the required slot size Having described the transmission and reception procedures, we will describe techniques which allow the determination of the minimum required values for each slot time S in the TDMA cycle.
From Figure 1, it can be seen that each slot Si consists of the message transmission time Mi and the inter-message spacing period P. We suggest that the idle period P should have a minimum value of 2, to compensate for synchronization errors in the global clock and to prevent message collisions. From the description of the CAN protocol, we can infer that the maximum transmission time (Cm) for a message with DLC data bytes, including the worstcase level of bit stuffing, is given by Equation 1, adapted from [41]:
8 DLC g 1 C m 8 DLC g 13 b 4 (1) where b is the bit-time and g is a constant representing control bits subjected to bit stuffing, and takes the value 34 for a standard CAN frame and 54 for an extended CAN frame.
Page 18
Please note that this measure does not include any allowance for superposition of error frames or overload frames: we must include an extra 20 bits into this measure to cover these possibilities9. In addition, in each transmission we have (j-1) copies of the message, each delayed by a time D. Taking these factors into consideration, the minimum slot time Si for a message transmission Mi with a data length of DLCi in a system with j replicated channels is given by Equation 2: 8 DLC i g 1 b S i MIN (( j 1) D) (2 ) 8 DLC i g 13 20 4
(2)
4. Assessing the proposed techniques In this section, we analyze the impact of permanent and transient errors on the proposed technique, and demonstrate that the reduction in the achievable bus utilization is minimal. 4.1 Permanent (hardware) errors In this discussion we will explore, using the failure model presented in Section 3.1, the impact on overall system reliability when redundant channels are added to the system (in the manner described in the previous section).
Considering the simple architecture of Figure 1, we can analytically determine the overall system failure rate for the communication equipment and physical media (CAN controller, bus transceivers, bus links, bus section) for a three node system (note: the failure rate for each node is not considered in this analysis). The findings are summarized in Table 2.
Table 2: Overall failure rate in multiple channels
9
Number of Channels
Failures / Hour
1
1.0 x 10-5
2
1.0 x 10-11
3
1.0 x 10-17
This value is derived from the CAN specification. Both error and overload frames may be superimposed by several nodes, with a worst case duration of 12 bits. In addition, these frames have delimiters of 8 bits (after the active flag). This gives a total of 20 bits in the worst case.
Page 19
From this table, it can be seen that increasing the number of channels has a very significant impact on the reliability of the communications equipment in the system. The increases are such that even the dual-channel system may be used in systems with high reliability requirements. 4.2 Transient errors In this discussion, we will assume that multiple message channels are being used in a given system.
In such a system, we send duplicate copies of each message over isolated
communication channels. Clearly, unless there is large physical separation between the isolated channels, continuous blocks of electrical interference will affect all channels uniformly. However, as stated in Section 3.1, it is assumed that any blocks of interference will be of limited duration.
Since re-transmission is disabled, old messages lost to
interference will no longer be re-transmitted and further (domino) disruptions are thereby avoided. In this way (without any further processing) the effects of certain types of transient errors can be minimised. In addition, as electrical and physical isolation is assumed, certain types of transient errors (such as intermittent connector faults associated with vibration), will be isolated and their effects will not propagate between channels. As such, the effects of inconsistent deliveries of the form described in Section 1.1 will be reduced when using this protocol. It is possible to provide a quantitative estimate of the protocols’ resilience to IMOs, using the probability model proposed by Rufino et al.10 [23]. Since the re-transmission of messages is inhibited, the probability of IMDs is zero, and given that each message is replicated over j different channels, the probability of an IMO for any particular message (of length DATA) is given by Equation 3:
PIFO 1 BER
DATA2
BER
j
(3)
where BER is the bit error rate.
10
We note that, in practice, this model has been found to give slightly pessimistic results [51]. However until further research is performed on this subject, we use this model for comparative purposes.
Page 20
Since an IMO may lead to a potentially dangerous system state, it is desirable to calculate the probability of such occurrences per hour. Considering each message in the time-triggered system as a periodic stream, a frequency (in terms of messages/second) fi can be determined for each message i. This can be obtained from knowledge of the TDMA schedule and its period in seconds, TPeriod. The failure rate for a given system implementation with n streams may then be predicted using Equation 4:
n
IMO 3600 f i PIFOi i 1
(4)
Taking (for example) a system with TPeriod equal to 0.01 seconds with a TDMA cycle of 9 messages each of length 110 bits (utilization 80% at 125,000 bits/s), may be calculated for varying BERs as shown in Table 3:
Table 3: IMO failure rate in multiple channels Failures / Hour Number of Channels
BER = 10-7
BER = 10-9
BER = 10-11
1
3.2 x 10-1
3.2 x 10-3
3.2 x 10-5
2
3.2 x 10-8
3.2 x 10-12
3.6 x 10-16
3
3.2 x 10-15
3.2 x 10-21
3.6 x 10-27
From this table, it can be seen that increasing the number of channels from the single channel case dramatically reduces the failure rate of undetected IMOs. Prospective designers can thus estimate the likely safety impact of using the protocol with a particular message schedule in a particular environment.
The impact of IMOs in a time triggered system is in many cases not as critical as in an event triggered system. If messages are only sent in response to external events, the occurrence of an IMO can potentially result in a situation (which persists indefinitely) where the distributed system’s knowledge of its external environment (and hence its internal state) is inconsistent, a potentially dangerous situation. This may not be the case for a time-triggered system.
Page 21
Each message stream may be classified as containing either absolute (e.g. temperature) or incremental (e.g. change in temperature) data, and each message stream can also be classified in terms of its safety criticality. We also note that a system inconsistency after an IMO may only exist for a maximum of TPeriod in an absolute stream; as mentioned, a well-designed system can often tolerate the loss of a single sample without problems. However, in an incremental stream, the same potential problem exists whereby an inconsistency may persist for an indefinite, possibly dangerous time.
The number of IMO failures per hour may be calculated for individual message streams. If cost constraints dictate that (for example) a minimum number of channels must be used, further action can be taken to increase safety for critical messages, by duplicating the same data temporally as well as spatially. Techniques for designing a message schedule where critical streams are temporally duplicated are discussed in [49]. Thus the IMO failure rate for a particular message stream i duplicated r times in a j channel system may be calculated using Equation 5:
IMO 3600 fi PIFOi r i
(5)
Thus even in a dual-channel system critical message streams may be designed to very high reliability requirements, whilst also exhibiting tolerance to permanent hardware faults in the replicated communication system. 4.3 Impact on latency and channel utilization In this section we consider the impact that the protocol has on the overall message latency and channel utilization.
The latency (i.e. response/transmission time) of a message broadcast is bounded and kept approximately constant in time-triggered systems. The worst-case transmission time of a CAN message was given in Equation 1. As previously mentioned, in each transmission we have (j-1) copies of the message, each delayed by a time D: thus the overall increase in latency when adding additional busses is a period equal to (j-1)*D.
Page 22
For example, if D is set to a value of 5 bit-times (a value which we have found to be effective), this corresponds to an increase of approximately 3% in maximum latency (per channel) when using 8 data bytes and extended identifiers.
Channel utilisation is a measure of how much of the total bus capacity is actually used, and ranges from 0% (no capacity used) to 100% (full capacity used). In order to enable a meaningful comparison of the effects of using this broadcast technique, we will consider the effects of adding extra channels to a system using the single-bus case as a benchmark11.
For the time-triggered bus, with n slots in the TDMA period, utilisation U can be defined as:
n Mi U i 1 TPeriod
100 (6)
where Mi is the actual transmission time of message i and TPeriod is defined as:
n TPeriod S i TIdle i 1 (7) where TIdle is an inter-cycle ‘idle-time’ (i.e. a time period when the bus is idle between subsequent TDMA cycles), and Si is the slot time for each message i in the TDMA period (with a minimum duration defined by Equation 2).
Thus the channel utilisation depends on the nature of the message schedule, the accuracy of the clocks , the number of channels j and the idle period.
11
We chose this is a benchmark since - under error-free circumstances - it is possible to achieve channel utilisation levels in excess of 90% in purely time-triggered systems, this is in contrast to other (arbitrating) approaches which can require utilisation bounds as low as 69% to guarantee deadlines [14].
Page 23
By way of example we shall consider the impact of using redundant channels, at various levels of clock accuracy, on a 1 Mbit/s system with no idle period, transmitting periodic messages with 8 data bytes and using extended identifiers. A table of utilisation U and slot size S for such a system is shown in Table 4. From this, we can see that the maximum possible bus utilisation for the TDMA strategy we have chosen, at maximum clock accuracy and bit rate, is 87%12. As we add additional busses into the system, the maximum utilisation of each individual bus remains at this level, but - considering the channels as a single entity the maximum utilisation starts to decrease, and the minimum achievable slot size increases by a value D for each extra channel. In all, the impact of redundant channels on the achievable bus utilisation and minimum latency times is minimal.
Table 4: Network channel utilization (1000 Kbits/sec)
(s) 2 Number of Channels 1 2 3
U (%) 87 84.7 82.5
10 S (s) 184 189 194
U (%) 80 78.1 76.2
100 S (s) 200 205 210
U (%) 42.1 41.6 41.1
S (s) 380 385 390
4.4 Suitability of the proposed protocol Despite the fact that the impact of redundant channels is minimal, it can be seen from Table 4 that the bus utilisation in the system decreases dramatically as the level of clock accuracy decreases. This is because the required slot size S is highly dependant on the level of accuracy, and a larger idle period P is required at lower levels of accuracy. However, as the bit rate decreases the impact of clock accuracy also decreases. If we repeat the previous exercise for a 125 Kbits/s system (Table 5), it can be seen that the overall levels of utilisation increase, even at a clock accuracy of 100 s.
12
If, however, we do not allow 20 bit times for error containment, the utilisation increases to 97.6%.
Page 24
Table 5: Network channel utilization (125 Kbits/sec)
(s) 2 Number of Channels 1 2 3
U (%) 88.7 88.4 88
10 S (s) 1446 1451 1456
U (%) 87.7 87.4 87.1
100 S (s) 1462 1467 1472
U (%) 78.1 77.8 77.6
S (s) 1642 1647 1652
In fact, if we can constrain the maximum error in the clocks to a value 10.b (the bus bittime), the achievable bus utilisation (even in systems with 6 channels) can be maintained at around 80%: this is higher than that achievable through the use of some standard (arbitrating) approaches.
Overall, as the results in this section demonstrate, the techniques presented in this paper suggest the timely delivery of all messages at high bus utilisation levels, and a graceful degradation in the presence of both transient and permanent errors in the communication channels. Given the nature of the results in this section, we suggest that a dual-channel system will give an optimal trade-off between reliability, bus utilisation and cost for many systems.
5. Case study As can be seen for the description and analysis of the proposed protocol, the success of the strategy relies on the ability to maintain clock accuracy under normal operating conditions, and also in the presence of channel faults. In this section we present a simple case study that illustrates the effectiveness of the proposed protocol using a simple three-node test system employing a dual-channel architecture. All nodes in this test system were implemented using 16-bit Infineon C167CS microcontrollers which incorporate dual CAN controllers. For this case study, we implemented a variant of a shared-clock scheduler13 [18]. In this type of distributed embedded system, one accurate clock is used to drive the scheduler of a Master node, which sends periodic Tick messages across the CAN bus. The Slave nodes have schedulers that are driven by the arrival of these Tick messages; essentially only a single valid ‘Tick’ is required to synchronize the slave clocks. In this way, the activity on all the 13
Any practical clock synchronization algorithm (such as those outlined in Section 2) would produce similar results.
Page 25
nodes in the system can be synchronized, and messages can be transmitted at specific time slots, employing a pre-defined TDMA schedule. Upon start-up (or following a continuous block of electrical interference), synchronisation of the distributed clocks takes approximately 300s in this system.
The bit rate employed in this study was 1Mbit/s. With reference to Figure 2, the TDMA cycle in this simple test case used 4 slots: the Master node first transmits an (empty) timereference (‘Tick’) message. Following this, each node is then allotted a slot to transmit a single 8-byte message, containing (randomly generated) data. In each case, the length of the TDMA cycle (TPeriod) was equal to 5ms; each slot width was equal to 1ms, giving an additional idle period of 1ms. To execute the application software, each node in the system employed a hybrid scheduler [18]: the single pre-empting task was used to handle the communication between nodes.
In order to measure the levels of clock synchronization, periodic tasks were created for both the Master and Slave nodes, with synchronous execution, once every 5ms. At the start of the Master task, a port pin was set high (for a short period of time). In the Slaves, another pin (initially high) was set low at the start of the task, again for a short period. The signals from the Master pin and a Slave pin were then AND-ed (using a 74LS08N), to give a pulse stream. The widths of the resulting pulses was thus representative of the synchronization between the clocks, and were measured using a National Instruments data acquisition card ‘NI PCI6035E’, used in conjunction with the LabVIEW 7.1 software package.
Clock jitter levels were determined by taking the difference of the maximum and minimum delays in the sample set and by calculating the variance of the sample set as an indication of the average. In each experiment, 10,000 samples were taken, for four different conditions covering intermittent and permanent channel failures:
Normal system operation (CAN1 and CAN2 OK).
Partial system operation (CAN1 faulted, CAN2 OK).
Partial system operation (CAN2 faulted, CAN1 OK).
Random faults on either CAN1 or CAN2 during the measurement period.
Page 26
In order to inject the failures into each underlying channel, we employed a fault injector controlled by a separate PC. This is shown schematically in Figure 6. The random faults were injected with an average inter-arrival of 1000 ms. All injected faults were cleared after 250 ms, allowing the relay contact plenty of time to operate.
Figure 6: Fault injection procedure The clock synchronization results obtained are shown in Table 6 (units of s). From this table, it can be seen that a worst-case clock synchronization of +/- 1.125 s could be guaranteed, with an average accuracy less than 0.6 s, regardless of the fault status of the channels. Thus with this clock accuracy = 2.25 s, the constraint that 10.b is more than satisfied: the protocol can therefore be applied even at the highest bit rate.
In addition, it was noted that no data errors or missing samples were recorded during this period, indicating that all messages sent over healthy channels were delivered and processed correctly. These results indicate that, even in the presence of faults, no node in the network has lost its clock accuracy, and the TDMA schedule was maintained. Table 6: Jitter measurements for fault scenarios (s) Measurement Max Min Max-Min Ave (Std)
Normal 2.42 0.30 2.12 0.55
CAN1 Only 2.75 0.70 2.05 0.55
CAN2 Only 2.70 0.58 2.12 0.59
Random 2.55 0.30 2.25 0.57
Page 27
6. Conclusions and further work At the start of this paper, we argued that CAN has five main limitations: [i] Lack of support for
time-triggered
communications;
[ii]
Incomplete
support
for
reliable
group
communications; [iii] Lack of support for redundant bus arrangements; [iv] Lack of mechanisms to handle “babbling idiot” errors; [v] Limited bandwidth.
During the course of this paper, we have discussed all of these limitations and proposed solutions to the first three problems which – together – can be used to increase the reliability of CAN-based designs. Overall, while no single protocol can satisfy the requirements of all systems, we believe that the techniques we have described in this paper may be adapted to compliment, and potentially improve, the features of many of the numerous CAN-based protocols which are already in existence.
As can be seen from the analysis presented in Section 4 and Section 5, the proposed techniques support highly deterministic message transfers and are robust to failures in the communication channels. We also note that, under fault-free circumstances, the redundancy management technique has a negligible impact on the system bandwidth, and provides clock synchronization levels that are robust to faults in any of the underlying channels. Finally, we note that the levels of clock synchronization over multiple channels we achieved in this study exceed those currently demonstrated by TT-CAN [37]. In addition, we note that there is no practical reason why one (or more) of the slots in the static communication schedule cannot be designated for use as ‘arbitrated’ windows. Further work will explore this possibility.
As noted in Section 3.2, we have not considered the failure of network nodes in this paper. Neither have we considered the problem of babbling idiot protection. To address these problems requires an appropriate form of ‘bus guardian’. As noted in Section 2.4, various guardians have been proposed. However, many of these guardians are problematic, in that: they violate a basic CAN principle: this violation occurs because, when the bus drivers are disabled, the CAN controller is prevented from asserting valid ACK bits or transmitting error frames [21][43]. In addition, many of these existing guardians are an imperfect match for time triggered systems [31]; they reduce the effective bandwidth to an unacceptable level [33]; or they may be costly, due to their implementation complexity [45].
Page 28
Further work in this area will include detailed proposals for a higher-level, fault-tolerant architecture designed to address these issues.
This architecture will include software /
hardware redundancy management schemes, provisions against babbling idiot failures and node re-integration capabilities.
Such schemes have been shown (in simulation) to
considerably improve dependability in simulation studies [50].
Acknowledgements The project described in this paper was supported by the Leverhulme Trust (Grant F/00212/D). The authors would like to thank Jianzhong Fang (University of Leicester) and the anonymous reviewers for their comments during the preparation of the paper.
References [1]
R. Bosch, CAN Specification 2.0, Robert Bosch GmbH, Postfach 300240, D-700 Stuttgart 30, September 1991.
[2]
M. Farsi and M. Barbosa, CANopen Implementation: applications to industrial networks, Research Studies Press Ltd, England, 2000.
[3]
L.B. Fredriksson, “Controller Area Networks and the protocol CAN for machine control systems,” Mechatronics, Vol.4 No.2, pp. 159-192, 1994.
[4]
K. Etschberger, Controller Area Network: Basics, Protocols, Chips and Applications. IXXAT Automation GmbH, 2001.
[5]
K. Pazul, Controller Area Network (CAN) Basics, Microchip Technology Inc. Preliminary DS00713A, Page 1 AN713, 1999.
[6]
Philips, P8x592 8-bit microcontroller with on-chip CAN datasheet, Philips Semiconductor, 1996.
[7]
Siemens, C515C 8-bit CMOS microcontroller, user’s manual, Siemens, 1997.
[8]
Infineon, C167CR Technologies, 2000.
[9]
Philips, LPC2119/2129/2194/2292/2294 Semiconductor, 2004.
Derivatives
16-Bit
Single-Chip
Microcontroller,
microcontroller
user
manual,
Infineon Philips
[10] FlexRay, FlexRay Communication System Protocol Specification Version 2.0, FlexRay Consortium, June 2004. [11] TTA-Group, Time-Triggered Protocol TTP/C High-Level Specification Document Protocol Version 1.1, Version 1.4.3, TTTECH, Vienna, Austria, Nov 2003. [12] H. Kopetz, “A Comparison of CAN and TTP,” Annual Reviews in Control, Vol. 24, pp. 177–188, 2000. [13] K. Turski, “A global time system for CAN networks,” in Proceedings of the 1st International CAN Conference, CiA, pp. 31-36, 1994. [14] I. Broster, Flexibility in Dependable Real-Time Communication, PhD Dissertation, University of York, UK, August 2003.
Page 29
[15] B. Donnelly and J. Cosgrove, “Achieving Microsecond Accuracy With 32 bit Microcontrollers using the Controller Area Network (CAN),” in Irish Signals and Systems Conference, pp. 508-513, Belfast, Ireland, June/July 2004. [16] I. Broster and A. Burns, “Timely use of the CAN Protocol in Critical Hard Real-Time Systems with Faults,” in Proc. 13th Euromicro Conference on Real-Time Systems (ECRTS 2001), 13-15 June 2001, Delft, The Netherlands, pp. 95-102, 2001. [17] T. Fuhrer, B. Muller, W. Dieterie, F. Hartwich, R. Hugel and H. Weiler, “Time triggered communication on CAN,” in Proceedings of the 7th International CAN Conference, Amsterdam, Netherlands, 24th–26th October 2000. [18] M.J. Pont, Patterns For Time Triggered Embedded Systems, Addison Wesley, 2001. [19] D. Ayavoo, M.J. Pont, M. Short, and S. Parker, “Two novel shared-clock scheduling algorithms for use with CAN-based distributed systems”, Microprocessors and Microsystems, [doi:10.1016/j.micpro.2006.11.002], 2007. [20] J.R. Pimentel and J.A. Fonseca, “FlexCAN: A Flexible Architecture for Highly Dependable Embedded Applications,” Paper presented at the 3rd Int. Workshop on Real-Time Networks, Catania, Italy, July 2004. [21] G. Buja, A. Zucollo and J. Pimentel, “Overcoming Babbling-Idiot Failures in the FlexCAN Architecture: A Simple Bus-Guardian,” IEEE Int. Workshop on Emerging Technologies in Factory Automation (ETFA05), Catania, Italy, Sept. 2005. [22] M. Bertoluzzo, G. Buja and J. Pimentel, “Design of a Safety-Critical Drive-By-Wire System Using FlexCAN,” Presented at the SAE World Congress 2006, Detroit, MI, USA, SAE Paper No. 2006-01-1026, April 2006. [23] J. Rufino, P. Verıssimo, G. Arroz, C. Almeida and L. Rodrigues, “Fault-Tolerant Broadcasts in CAN,” in Proc. of the 28th Fault-Tolerant Computing Symposium (FTCS), pages 150–159, 1998. [24] J. Proenza and J. Miro-Julia, “MajorCAN: A Modification to the Controller Area Network Protocol to Achieve Atomic Broadcast,” in Proc. of the IEEE Int’l Workshop on Group Communication and Computations (IWGCC), pages C72–C79, 2000. [25] J. Kaiser and M.A. Livani, “Achieving Fault-Tolerant Ordered Broadcasts in CAN,” in Proc. of the 3rd European Dependable Computing Conference, 1999. [26] L.M. Pinho and F. Vasques, “Reliable Real-Time Communication in CAN Networks,” IEEE Transactions On Computers, Vol. 52, No. 12, December 2003. [27] L. Rodrigues, M. Guimarães and J. Rufino, “Fault-Tolerant Clock Synchronization in CAN,” in Proceedings of the 19th IEEE Real-Time Systems Symposium, Madrid, Spain, December, 1998. [28] Philips, SJA1000 Stand-Alone CAN Controller: Product Specification, 4th January, 2000. [29] C. Ryan, D. Heffernan and G. Leen, “Clock synchronization on multiple TTCAN network channels,” Microprocessors and Microsystems, Vol. 28, pp. 135-146, 2004. [30] I. Broster, and A. Burns, “An analyzable bus-guardian for event triggered communication,” in Proc. of the 24th IEEE Real-Time Systems Symposium, pp. 410-419, Dec 2003. [31] J. Ferreira, E. Martins, P. Pedriras, J. Fonseca and L. Almeida, “Components to Enforce Fail-Silent Behaviour in Dynamic Master-Slave Systems,” in 5th IFAC Int. Symposium Page 30
on Intelligent Components and Instruments for Control Applications, Aviero, Portugal, July 2003. [32] K. Tindell and H. Hansson, “Babbling Idiots, The Dual-Priority Protocol, and Smart CAN Controllers,” in Proceedings of the 1st International CAN Conference, CiA, pp. 722-728, 1994. [33] A. Tyagi, Design and Implementation of a Bus Guardian in Controller Area Networks, MSC Project Report, University of Leicester, 2004. [34] G. Cena and A. Valenzano, “FastCAN: A High-Performance Enhanced CAN-Like Network,” IEEE Transactions on Industrial Electronics, Vol. 47, No. 4, pp. 951 – 963, 2000. [35] S. Shaheen, D. Heffernan and G. Leen, “A comparison of emerging time triggered protocols for automotive X-by-wire control networks,” ImechE Journal of Automobile Engineering, 217 (D1), pp. 13–22, 2003. [36] L. Fredriksson, “CAN for Critical Embedded Automotive Networks,” IEEE Micro, Vol. 22, Issue 4, pp. 28-35, July-Aug 2002. [37] R. Iserman, R. Schwarz and S. Stoltz, “Fault-tolerant drive-by-wire Systems,” IEEE Control Systems Magazine, Vol. 22, No. 5, pp. 64-81, 2002. [38] J. Nilsson, B. Bernhardsson and B. Wittenmark, “Some topics in real-time control,” In Proceedings of the 17th American Control Conference, pp. 2386–2390, Philadelphia, Pennsylvania, June 1998. [39] Microchip, MCP2515 Stand-Alone CAN Controller with SPI Interface: Datasheet, 3rd January 2005. [40] Infineon, XC167 User’s Manual Volume 2 of 2: Peripheral Units Version 2.0, April 2004. [41] T. Nolte, H.A. Hansson and C. Norström, “Minimizing CAN response-time jitter by message manipulation”, in The 8th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2002), San Jose, California. 2002. [42] M. Nahas, M.J. Short, and M.J. Pont, “Exploring the impact of software bit stuffing on the behaviour of a distributed embedded control system implemented using CAN,” in Proceedings of the 10th international CAN Conference, held in Rome, pp. 10-1 to 10-7, 8-10 March 2005. [43] C. Temple, “Avoiding the Babbling-Idiot Failure in a Time-Triggered Communication System,” Paper presented at the 28th Annual International Symposium on FaultTolerant Computing, Munich, Germany, June 1998. [44] F. Cottet and L. David, “A solution to the time jitter removal in deadline based scheduling of real-time applications”, in Proc. 5th IEEE Real-Time Technology and Applications Symposium - WIP, Vancouver, Canada, pp. 33-38, 1999. [45] J. Ferreira, L. Almeida, J.A. Fonseca, P. Pedreiras, E. Martins, G. Rodríguez-Navas, J. Rigo and J. Proenza, “Combining Operational Flexibility and Dependability in FTTCAN,” IEEE Trans. Industrial Informatics, Vol. 2, No. 2, pp.95-102, 2006. [46] J. Ruffino, Computational System for Real-Time Distributed Control. PhD Thesis, Technical University of Lisbon, Lisbon, Portugal, July 2002. [47] L. Almeida, P. Pedreiras and J.A.G. Fonseca, “The FTT-CAN Protocol: Why and How,” IEEE Trans. On Industrial Electronics, Vol. 49, No. 6, pp.1189-1201, 2002. Page 31
[48] J. Rufino, P. Verissimo and G. Arroz, "A Columbus' Egg Idea for CAN Media Redundancy," in Proc. Twenty-Ninth Annual International Symposium on FaultTolerant Computing, pp. 286-293, 1999. [49] N. Kandasamy, J.P. Hayes and B.T. Murry, “Dependable communication synthesis for distributed embedded systems,” Reliability Engineering and System Safety, Vol. 89, pp. 81-92, 2002. [50] E.A. Latronico. Reliability Validation of Group Membership Services for X-by-Wire Protocols. PhD Dissertation, Carnegie-Mellon University, USA, May 2005. [51] J. Ferreira, A. Oliveira, P. Fonseca and J.A. Fonseca, “An Experiment to Assess Bit Error Rate in CAN”, in Proc. 3rd Int. Workshop on Real-Time Networks, June 2004. [52] J. Rufino, P. Verissimo and G. Arroz, “Node Failure Detection and Membership in CANELy,” in Proc. 2003 International Conference on Dependable Systems and Networks (DSN'03), pp. 331-340, 2003. [53] MIL-HDBK-217F, Military Handbook of Reliability Prediction of Electronic Equipment, December 1991. [54] M.J. Pont, S. Kurian, and R. Bautista-Quintero, “Meeting real-time constraints using ‘Sandwich Delays’”. Paper presented at the 11th European Conference on Pattern Languages of Programs (EuroPLoP 2006), Germany, July 2006.
Page 32