A Test Tool for FlexRay-based Embedded Systems Martin Horauer, Oliver Praprotnik, Martin Zauner, Roland H¨oller
Paul Milbredt
University of Applied Sciences Technikum Wien H¨ochst¨adtplatz 5, A-1200 Vienna, Austria Email:
[email protected]
Audi Electronics Venture GmbH Sachsstraße 18, D-85080 Gaimersheim, Germany Email:
[email protected]
Abstract— In this paper we present an architecture for a test and diagnosis toolset for FlexRay-based automotive distributed networks. Next to data monitoring and recording, this toolset provides facilities for fault injection and replay. Hence, the presented implementation is tailored for an embedded test and fault diagnosis and will enable an assessment of the reliability and dependability of future automotive solutions.
I. I NTRODUCTION Nowadays embedded electronic hard- and software is the most important innovator in the automotive domain. The use of electronic control units (ECU) interconnected by various different fieldbus systems enables the replacement of pure mechanical and hydraulic solutions. With the use of distributed electronic systems existing and new applications can be enhanced and developed (“x-by-wire”). Fig. 1 illustrates part of a typical automotive system architecture built using various electronic control units interconnected by several different field-bus systems. The latter are usually dedicated to different domains like, e.g., power-train or chassis control and linked with each other using special gateway nodes. With more than 70 ECUs running various distributed applications it is obvious that the network architecture is a crucial point. To that end an industrial consortium has established the communication protocol FlexRay to address the expected requirements. FlexRay Backbone
Gateway Telematics
Gateway Body/Comfort
Gateway Chassis
Door Locks
Steer-by-Wire
Transmission
Mobile Phone
Climate Con.
Brake-by-Wire
Diagnostics
Navigation
Seat Control
Cruise Control
...
Fig. 1.
Sunroof
FlexRay
Radio
CAN
Engine
MOST
FlexRay
Gateway Powertrain
...
A typical automotive network
FlexRay relies on both the time- and event-triggered paradigms and provides reliability and fault-tolerance mechanisms along with the bandwidth to serve the needs of future automotive solutions. In particular, it supports two redundant channels, line, star and hybrid bus topologies and bus speeds up to 10 Mbit/s per channel. Communication occurs either in
static time slots that are fix assigned to every node during system design or in an optional event-triggered part using mini-slotting for bus arbitration, see [1]. An optional bus guardian ensures fail-silent behavior of the nodes in the case when they exhibit babbling idiot failures. All these features are required since FlexRay will be introduced for safety-critical applications where a failure can lead to severe consequences. For all the stages of the typical development cycle according to the V-model a lot of tools are required. One important issue herein is the support for systematic debugging and testing at the system level in order to evaluate whether the system is correctly implemented and configured and will react as expected in its future field environment. Especially with more and more manufacturers relying on independent suppliers, the need for testing becomes paramount. To that end, this paper presents some concepts of a work-in-progress test environment for the system-test of FlexRay based communication subsystems. The following sections highlight the involved testing issues and presents related work before we describe the architecture of the presented solution and detail its benefits. Finally, the conclusion provides an outlook of our next development steps. II. T ESTING I SSUES Apart from pure node tests a systematic approach with adequate tool support is required for a test at the system level. Herein, the following aspects are of relevance: Fault Hypothesis. Prior to test execution one should have a proper fault hypothesis in mind; i.e., an assumption that the system-under-test will exhibit a certain failure semantic. A component exhibits a failure semantic when the probability of failure modes which are not covered by the failure semantic are sufficiently low. For example, a component exhibits interleaving failures when the probability for timing or arbitrary failures are insignificant, cf. [2]. Controllability. In order to execute a test it is required (i) to bring the system-under-test into the required test state/mode, (ii) generate stimuli and (iii) feed them to the inputs of the system-under-test. In the context of automotive electronics, the first item (a.k.a. offline test) is by its nature easier to achieve during application development in the lab, before system operation or during maintenance. It is, however, much more problematic for an online test. The second item has to deal with the large system state space and hence faces either an excessive test duration or a very limited test coverage. The third item is often hard to achieve due to accessability
restrictions of the distributed embedded system. Hence, some kind of remote-test that provides the test-stimuli and collects the response information via the fieldbus system is desirable here. Note that remote testing is applicable for both offline and online modes. Observability defines where and how the outputs of the system and potential internal states of the system-under-test can be observed without influencing the system itself (a.k.a. probe-effect). Furthermore, it addresses how the output information of the test can be collected for a succinct evaluation and interpretation. When a composable architecture is used to build a distributed embedded system, the task of testing may be eased when so called “temporal firewalls” partition the system into smaller error or fault containment regions, respectively, cf. [8]. Temporal firewalls are some kind of dual-ported memory interfaces with the following characteristics: • only data (state information) is allowed to pass through, whereas the flow of control information is derived from the synchronized time base • access from either side is only allowed at fixed predetermined instances of time, e.g., using some kind of non-blocking synchronized access Examples for temporal firewalls in FlexRay are the controllerhost-interfaces between the communications controllers and the processing units at every node. Hence, in most cases the test problem for time-triggered distributed systems is reduced to testing every error containment region in isolation. In particular, this requires dedicated node tests on one hand and tests for the communication subsystem on the other hand. For the latter (focus of this paper) the following mechanisms are usually employed: • monitoring the bus traffic • replay of bus traffic • fault injection III. R ELATED W ORK Tools to monitor bus-traffic are in widespread use for testing, debugging and optimization of various communication services and protocols. Examples for COTS tools for network analysis and monitoring tools tailored to fieldbus systems like CAN, LIN, MOST, FlexRay or TTP/C are CANalyzer1 , TTView2 or the BusDoctor3 . Implementation issues of these and similar tools along with some use-case scenarios can be found, e.g., in [3], [4], [5]. The majority of these monitoring and network analysis tools operate at or above the medium access layer employing COTS network controllers and device drivers in some kind of promiscuous mode where all (even corrupt) frames are forwarded to the processing CPU. When looking closer to these solutions, the following main approaches for implementing monitoring services can be distinguished: 1 http://www.vector-informatik.com 2 http://www.tttech.com 3 http://www.decomsys.com
(1) Software monitoring, instruments the software with specific services. [9] distinguishes between kernel-probes, inline-probes, probe-tasks and probe-nodes. The drawbacks of either approach are due to the probe effect – the system has to be modified – and the involved additional resource requirements. The probe effect can be minimized when the monitoring services are builtin becoming a basic standard service of the system. However, this implies that enough resources are available, which sometimes is a costly requirement. (2) Hardware monitoring can be used to improve the observability of the system. This approach uses (dedicated) hardware to collect the required information [7], and thus presents the advantage of minimizing the interference with the target system. Typically, the monitoring hardware is connected to the parallel bus from the node’s CPU and monitors the memory transfers, e.g., using insystem emulators or logic analyzers, see [6], [10] for further examples using dedicated hardware. The main drawback of this approach is, usually, the low abstraction level (electrical signal) of the data being monitored. (3) Hybrid monitoring takes advantage of both hardware and software monitoring. In hybrid monitoring, triggering is accomplished in software and the recording in hardware [7]. Perturbation of the monitored system is greatly reduced, while high level data can be monitored, see e.g. [11]. Replay is used to re-enact previously recorded scenarios for test and debugging purposes, cf. [9], [12]. In particular, it includes (1) pattern (stimulus) generation/modification and (2) how the generated stimuli are applied to the systemunder-test. Stimuli may be either obtained via monitoring and optional modification or using some stimulus generation tool. Whereas the monitoring and modification approach is easier accomplished and more representative, stimulus generation usually provides a higher degree of freedom. The application of stimuli to a system-under-test is either accomplished in (i) causal or (ii) temporal order. Correct causal order is useful when simply the sequence of events is of interest, whereas the stronger requirement temporal order is mandatory when real-time behavior is involved. To accomplish either form of replay a clock synchronization π with a precision lower than the shortest difference between any two events (ei , ej ) is a necessary pre-condition: π ≺ |ei − ej |. Even with this service at hand, exact temporal replay is hard to achieve since all events must be issued at the same relative instants as they were recorded. For time-triggered communication this is usually only possible when the tester is used to replay an entire cluster in synchronous mode dictating the other-wise democratic clock synchronization. This, however requires a modification of the network topology/connections. Fault injection is a very popular mechanism to artificially insert faults and/or errors into the system in order to create conditions where the deployed mechanisms for fault tolerance are activated, see [13] for a survey. Fault injection allows detailed studies of the complex interaction between faults and
fault handling mechanisms. For example, fault injection can be used to test and estimate the coverage of error detection mechanisms, i.e., the success rate of the mechanisms. The various fault injection techniques and tools, that have been introduced over the years, can generally be divided into three categories [14]: • Simulation Based Fault Injection • Software Implemented Fault Injection • Hardware Implemented Fault Injection In this context, the contributions of this work-in-progress paper are related to the presentation of an architecture for the implementation of a tester along with some use-cases for our prototype implementation. IV. T ESTER A RCHITECTURE The test approach taken in our project to implement debug and test tools relies on the star bus topology of FlexRay. All debug and test mechanisms are built into a central star coupler. The most relevant advantages of this approach over other existing ones are: (1) More complex fault scenarios can be tested, e.g., Byzantine faults. In fact with COTS nodes alone it is often hard to generate and evaluate complex fault scenarios. (2) Replay of bus-traffic doesn’t necessarily require a physical re-arrangement (dis-connection) of the nodes. (3) Start-up and re-integration scenarios can be more easily tested and evaluated by manipulating the according symbols. Fig. 2 depicts the central architecture of our prototype tester implementation for FlexRay:
the physical layer devices, dedicated NRZ encoder/decoders provide the suitable bit encoding for the FlexRay physical layer protocol. In between the codec’s and the active star logic, a physical layer fault injection and manipulation unit allows for shifting and/or delaying certain frames and/or channels. Furthermore, several bits within a frame can be flipped. The FlexRay receiver and transmitter units provide basic encapsulation and de-encapsulation mechanisms for FlexRay packets. A monitoring and a replay unit (the latter with various faultinjection capabilities) allow a frame-based access of the bus traffic. Either mechanism uses a central timer with timestamp capabilities. All the before-mentioned modules are controlled via a dedicated control unit that also hosts trigger and filter mechanisms. The bus interface handles the data flow to/from a remote host via a high-speed USB 2.0 controller. The host part in turn consists of a suitable library and a graphical user interface for presentation of the data to the user and for controlling the embedded tester.
Fig. 2. FlexRay Communication Subsystem Tester – Architectural Overview
V. A PPLICATIONS AND U SE -C ASES The following scenarios shall illustrate some examples where the above described debug and test-tool would be required: Start-up: One key benefit of a composable time-triggered architecture is that several applications may be developed independently from each other and integrated later on into a cluster. A problem arises on one hand, however, when the cluster does not start, e.g., due to interoperability problems. Here the above test tool could give valuable insights into an actual bus communication, hence, verify the design and make it easier to identify problems. On the other hand, if the prototype cluster does start up, robustness tests are beneficial to inspect whether the given configuration will start under various different conditions. For example, with our tester one could manipulate the frames at the bit level by shifting certain frames in time to emulate transient electrical problem that may occur, e.g., due to bad connector contacts. System operation: Sometimes a system is successfully put into operation but then suddenly fails. For example, an error detection mechanism on one node triggers, and as a consequence the communication controller switches the node to silent. This means it prevents the node from sending further messages – which is, however, expected by the other nodes. Here, monitoring and replay could be useful to uncover the fault that triggered this error. Robustness: If no failure is encountered during prototype operation, robustness tests shall support the projection of this result to the mass production. Here, a tool is required to systematically subject the system to all anticipated fault scenarios and thus test the fault-tolerance services. The above list is by no means complete, however, it illustrates the significance of adequate tool support for the various testing stages.
The basic elements of an active star coupler are implemented using COTS physical layer chips (PHY) and a custom active star logic implemented within an FPGA. Next to
VI. C ONCLUSION With the adoption of FlexRay for series production, test solutions for the complete life cycle of a car will gain
GUI
Library Software
Host PC Embedded Tester
High Speed USB 2.0 Controller FPGA Bus Interface Unit
Monitoring Unit
Timer
FlexRay Receiver
Replay & Fault Injection Unit
FlexRay Transmitter
Control Unit (Trigger, Filter, etc.)
Active Star Logic
Fault Injection and Manipulation next to the Physical Layer Interfaces, opt. Bus Guardian Decoder / Encoder
Decoder / Encoder
Decoder / Encoder
Decoder / Encoder
Decoder / Encoder
Decoder / Encoder
PHY
PHY
PHY
PHY
PHY
PHY
significance. In particular, tests for the communication subsystem during various stages of the development process will be required to aid the implementation and verification process since ever more applications with enhanced functionalities will be introduced. To that end, this paper introduced an architecture for a tester that will simplify this job. Next to an implementation some use-cases were presented. The presented approach can be re-used for many different purposes, e.g., for debugging and testing during system verification, for interoperability or robustness tests as well as for maintenance tests. The next steps in our project involve the development of a demonstrator system that will serve as a test-bed for future experimental evaluations along with our tester. Furthermore, we will try to enhance the combination of monitoring, replay, and fault-injection with a fault dictionary that will enable a concise fault diagnosis and location. ACKNOWLEDGMENT The work presented in this paper is performed in the context of the project DECS that is funded by the FHplus programme of the Austrian Federal Ministries BMVIT and BMBWK and managed by the Austrian Research Agency FFG under contract 811414. R EFERENCES [1] “—”, FlexRay Communications System, Protocol Specification Version 2.1 Revision A, FlexRay Consortium, 2005. http://www.flexray.com [2] F. Cristian, Understanding fault-tolerant distributed systems, Communications of the ACM, Vol. 34, Iss. 2, pp. 56-78, Feb. 1991. [3] P. Peti, R. Obermaisser, W. Elmenreich, and T. Losert, An architecture supporting monitoring and configuration in real-time smart transducer networks, Proc. of IEEE Sensors, pp. 1479-1484, 2002. [4] M.S. Reorda and M. Violante, On-line analysis and perturbation of CAN networks, 19th IEEE Int. Symposium on Defect and Fault Toler-ance in VLSI Systems, pp. 424-432, Cannes - France, Oct. 2004. [5] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer and M. Zauner, A Structured Approach for the Systematic Test of Embedded Automotive Communication Systems, Int. Test Conference (ITC 2005), paper 2.1, Austin - Texas - USA, Nov. 2005. [6] J.P. Tsai, K.-Y. Fang, H.-Y. Chen, Y.-D. and Bi, A Noninterference Monitoring and Replay Mechanism for Real-Time Software Testing and Debugging, IEEE Trans. on Software Eng. vol. 16, pp. 897 - 916, 1990. [7] J.P. Tsai, Y.-D. Bi, S. Yang, and R. Smith, Distributed Real-Time System: Monitoring, Visualization, Debugging, and Analysis, Wiley-Interscience, 1996. ISBN 0-471-16007-5. [8] H. Kopetz, On the Fault Hypothesis for a Safety-Critical Real-Time System, Workshop on Future Generation Software Architectures in the Automotive Domain, pp.14-23, San Diego, USA, Jan. 2004. [9] H. Thane, Monitoring, Testing and Debugging of Distributed Real-Time Systems, Doctoral Thesis, Mechatronics Laboratory, Department of Machine Design Royal Institute of Technology, KTH S-100 44 Stockholm, Sweden, 2000, ISSN 1400-1179. [10] I. Smaili, Real-Time Monitoring for the Time-Triggered Architecture, Doctoral Thesis, Vienna University of Technology, 2004. [11] M. Thoss, Automated High-Accuracy Hybrid Measurement for Distributed Embedded Systems, In 3rd International Workshop on Intelligent Solutions in Embedded Systems, pp. 39-48, May, 2005. [12] H. Thane, D. Sundmark, J. Huselius, A. Pettersson, Replay debugging of real-time systems using time machines, Proceedings of the International Parallel and Distributed Processing Symposium, April 2003. [13] M.C. Hsueh, T.K. Tsai and R.K. Iyer, Fault Injection Techniques and Tools, IEEE Transactions on Computer, Vol. 30, No. 4, pp. 75-82, 1997. [14] J. Arlat, Y. Crouzet, J. Karlsson, P. Folkesson, E. Fuchs, and G.H. Leber, Comparison of Physical and Software-Implemented Fault Injection Techniques, IEEE Transactions on Computers, 52(9), pp. 1115-1133, Sept. 2003.