PAK vol. 58, nr 7/2012
665
Michał MOSDORF, Konrad GROCHOWSKI, Janusz SOSNOWSKI, Piotr GAWKOWSKI INSTYTUT INFORMATYKI, POLITECHNIKA WARSZAWSKA, ul. Nowowiejska 15/19, 00-665 Warszawa
Gas-flow computer with SBST Mgr inż. Michał MOSDORF
Prof. dr hab. inż. Janusz SOSNOWSKI
PHD student at Institute of Computer Science of Faculty of Electronics and Information Technology at Warsaw University of Technology. Graduate of Computer Science of Faculty of Electronics and Information Technology (2009). Conducts research in field of software reliability in embedded systems environment.
Graduated from the Faculty of Electronics and Information Technology at Warsaw University of Technology. Received professor title in 2006. Currently employed in professor position at Computer Science Institute of Warsaw University of Technology. He is an author and coauthor of 200 publications. His scientific interests concern computer systems reliability, architecture and interfaces design..
e-mail:
[email protected]
e-mail:
[email protected]
Mgr inż. Konrad GROCHOWSKI
Dr inż. Piotr GAWKOWSKI
PHD student at Institute of Computer Science of Faculty of Electronics and Information Technology at Warsaw University of Technology. Graduate of Computer Science of Faculty of Electronics and Information Technology (2010). His scientific interests concern embedded and real-time systems, systems reliability and software engineering.
Graduated from the Faculty of Electronics and Information Technology at Warsaw University of Technology. He obtained the PhD degree at the same faculty in 2005. Currently he is an assistant professor at Computer Science Institute of Warsaw University of Technology. His research interests include systems reliability, fault tolerance and evaluation of system dependability.
e-mail:
[email protected]
e-mail:
[email protected]
Abstract The paper deals with the problem of improving dependability in industrial embedded systems. This problem is considered in relevance to the developed gas flow computer. It is implemented around ARM microcontroller which performs complex measurements and calculations of gas flow with embedded software based self-test mechanisms (SBST) assuring fault detection and fault handling. These mechanisms do not interfere with the normal operation neither in time nor in space. The effectiveness of these approaches has been practically verified in specialised experiments. Keywords: embedded systems, microcontrollers, fault simulation, reliability.
Komputerowy przelicznik przepływu gazu z wbudowanym SBST Streszczenie Ostatnio obserwuje się coraz większe zainteresowanie inteligentnymi urządzeniami pomiarowymi. Wykorzystują one bardzo wydajne mikroprocesory lub mikrokontrolery i złożone oprogramowanie. Urządzenia te zwykle pracują w środowisku przemysłowym lub otwartym terenie, gdzie są narażone na różne zakłócenia (elektromagnetyczne, termiczne, niestabilne zasilanie itp.). Stąd istotnym jest zapewnienie dużej wiarygodności ich pracy. Problem ten uwidocznił się w produkowanych przelicznikach gazu ziemnego [9]. Dane producenta wskazują na 8% problemów serwisowych (rozdz. 2). Autorzy podjęli się rozwiązania tego problemu poprzez opracowanie programowych mechanizmów autotestowania (SBST) zintegrowanych z oprogramowaniem operacyjnym urządzeń pomiarowych. Pozwalają one monitorować w sposób ciągły (rys. 1) poprawność pracy urządzenia (rozdz. 3). W szczególności wbudowano mechanizm kontroli poprawności i autonaprawy kodu przelicznika, obsługę sytuacji wyjątkowych, autotestowanie krytycznych procedur spreparowanymi zestawami danych itd. (rozdz. 3). Pozwalają one na detekcję zarówno błędów przemijających (i ograniczone ich tolerowanie) jak i błędów trwałych. Efektywność tego rozwiązania została zweryfikowana przy wykorzystaniu techniki symulacji błędów ([1, 3]) oraz generowanie logów operacyjnych i liczników zastosowanych w nowym prototypie przelicznika. W porównaniu z innymi przelicznikami gazu osiągnięto istotną poprawę wiarygodności (rozdz. 4). Przedstawione mechanizmy mogą być zastosowane również w innych urządzeniach z mikrokontrolerami. Słowa kluczowe: systemy wbudowane, mikrokontrolery, symulacja błędów, niezawodność.
1. Introduction Recently many intelligent measurement devices have been used in industry. They are based on microprocessors and microcontrollers with quite sophisticated software. When operating in the industrial environment (in the field) they are subjected to various disturbances (e.g. electromagnetic, power supplies) and damages. Hence, an important issue is to assure high level of their dependability at reasonable costs, usually implemented in software [5-7]. We faced this problem in gas flow computers [9]. To improve dependability of these devices we developed a set of software based self-test mechanisms (SBST) which were integrated with the control software to monitor the operation of these devices continuously. They were targeted at transient and permanent faults and trigger appropriate fault handling procedures. Classical SBST techniques [2, 4, 7] do not fulfil real-time and memory restrictions in the considered system. Moreover, the complex architecture of the microcontroller requires taking into account more sophisticated fault models not covered in classical SBST. In our approach we base on an original application driven SBST enhanced with various on-line error detection mechanisms. The effectiveness of this approach (time and code overheads, fault coverage) was verified using fault injection technique [1, 3] and some specialised operational profile logging. As compared with gas counters used by Polish gas providers it shows significant dependability improvement. Section 2 outlines the system reliability problems. A concept of the implemented fault detection and fault handling mechanism is described in Section 3. The experimental results confirming the achieved dependability improvement are given in Section 4. Section 5 presents the conclusions.
2. System reliability problems We have had practical experience with reliability of gas flow computers used in the field and produced since 2004 [9]. They generated about 8% service problems. Most problems were related to the processor board (40-60% depending upon the model), transmission circuitry contributed to 5-10%, power supply 3-6%, external and internal interconnections 2-3%, remaining
666 measurement and supplementary boards 20-50%. The physical nature of faults was different, about 10% problems related to program bugs. More than 20% faults caused exceeding system restarts. Cold soldering contributed to over 14% of problems, flash memory faults appeared in about 5% cases, microcontroller chips about 3%. In the case of the computer board 5-6% cases resulted in complete board replacement (errors not diagnosed). This statistics relates to devices based on Freescale MCF 5407 microcontroller. Its processing power is used in 95% for controlling measurements, calculating gas flow, handling transmissions, data repository and graphical display. Error detection and handling was limited to basic functions. Hence, most of reliability problems were identified in the service. The customers only reported general problems. In fact most of them were detected with a long delay (human observation of device behavior), which had some impact on the measurement accuracy and related financial losses. To improve system dependability, we decided to develop a new gas flow computer with a more powerful microcontroller (ARM9EJ-S core based) which provided some extra power needed for error detection and fault handling procedures. These processes were enhanced with rich operational logs which generated reports on system restarts, anomalies, program execution flow, etc. These data are useful during system development and service time. During exploitation of the system only the most critical events are registered and signaled. To facilitate dependability optimization we included fault injection mechanism which provided the capability of testing system robustness and tracing fault effect propagation, detection, etc.
3. Improving system dependability When improving system dependability we concentrated on handling transient faults (they dominate in industrial environment) and detecting potential permanent faults. When developing appropriate procedures we analysed the operational profile of the gas flow computer and resource usage to identify the areas of most critical threats, time and space reserves which could be used to enhance dependability. The control program typically performs some critical complex calculations for about 20ms in each 0.5 s iteration cycle. These calculations are hard real time conditioned so they are executed as the most privileged task. Within each iteration cycle other activities of lower priority (e.g. internal and external communications, data transfer to local repository, identifying alarms) can be performed in some asynchronous regime with soft real time condition. On average all these processes do not use more than 60% of the microcontroller power. Hence, we could use some significant idle time for dependability issues. All program modules and a lot of data are stored within a flash memory and are copied to a RAM memory. The program is executed by fetching instructions from RAM (as well as from cache). Moreover, RAM stores many constants used in calculations and some calculated data which periodically are transferred into the repository on flash or to external environment. Here arises the problem of RAM disturbance by transient faults, which may result in long term faulty results. Hence, we included a low priority procedure checking CRC checksum of the program code and constant data (about 1MB) for each iteration. Moreover, a relatively big temporary data block for repository (about 6 MB) is protected with simpler XOR checksum. Transient faults may also disturb CPU registers, and the internal control circuitry. They can be detected to some extent by built in exception detectors (like divide by 0, invalid opcode, FPU overflow, etc.). To assure better fault coverage, we included some software assertions which check acceptable ranges of results (measurements in particular) and maximal allowable state changes for many internal variables within subsequent iterations or measurements. We check also the correctness of program flow at the level of functions and modules (classical fine grained approach [7] is too expensive and not needed). In particular we assure that
PAK vol. 58, nr 7/2012
the authorisation procedure (e.g. used for changing configuration or constant parameters) can be executed at some specified control flow context. To verify the most critical computational procedure of the gas flow computer, it is always preceded with a test which uses some predefined data sets. The calculation results obtained from this test are verified with the expected ones. Moreover, infinite loops are detected with a watchdog mechanism that is controlled by the special task that monitors the execution of all the tasks in the system. If any of the tasks does not send the required heart-beat signal within the given time window, the physical watchdog is not reset and the whole system will be restarted. The system comprises many peripheral circuitry used for transmissions (e.g. UART, interrupt controller, DMA controller), which may operate in different modes. To avoid the impact of transient faults on these modes, we reprogram the needed modes periodically, so only temporal deviation is possible. The used transmission protocols provide error detection and error handling mechanisms which cover not only faults in the transmission media but also those related to the transmission control programs. Handling transient faults we perform appropriate recovery (e.g. skipping the initiated calculation and repeating it after reloading disturbed program, retransmitting faulty packets). All these situations are logged. Moreover, we check whether they do repeat more than a specified number of times within a specified time window. This allows us to distinguish transient faults from permanent ones. The above listed mechanisms are targeted at transient faults, however they can detect many permanent faults. In practice many permanent faults will not be detected by these mechanisms so we have to improve their detection by periodically initiated self-test programs for CPU and RAM. As far as it concerns CPU it is a complex circuit with sophisticated pipelining and complex control circuitry handling various instructions, so efficient tests covering all instructions including instruction and data dependences in pipelines is a very complex task needing a deep knowledge (at microarchitectural level [4]). On the other hand, the realized program modules use only about 60% of all instructions moreover in fixed data and control flow paths. The developed control program modules use 45-124 instructions from the list of 150 + 22 floating point instructions. This constitutes on average about 50% CPU instruction set usage. Hence, we developed an application level test which is based on selecting some representative input variables so as to sensitize all paths. We use about 100 sets and perform self-test for several sets in each operational iteration. Within about 100 iterations all sets are verified. This test is enhanced with a test of CPU registers, ALU, etc. These tests are the lowest priority threads so they do not disturb normal operation. CPU tests are distributed over many iteration cycles (roving tests - Fig. 1), however some of them e.g. register tests are executed for each iteration. External and internal interface circuitry including the correlated transmission programs are tested on line with the protocol error detection mechanisms. These programs fix control flow paths easily tested by transmissions (application driven testing), however they involve 124 different CPU instructions (more than data processing modules). Testing RAM is more complex due to the need of blocking its usage for a longer time. Moreover, they damage the existing memory contents. Avoiding this problem with transparent testing [4] still needs long blocking, so they are initiated on demand if repeated errors are detected (probable permanent fault) or in the periods of longer system idle states. Nevertheless, the fault detection mechanisms for transient faults cover also many RAM permanent faults. Iteration i (0.5 sec.)
Iteration i+1 (0.5 sec.)
Interleaved operational and autotesting threads On-line hardware fault detection and software assertions Fig. 1. Rys. 1.
Testing scheme Schemat testowania
… …
PAK vol. 58, nr 7/2012
667
The developed autodiagnostic programs were optimized in code size (e.g. RAM tests up to 400bytes, other tests up to 1 kB. Moreover, to assure high robustness of the tests they are coded with a small number of CPU instructions (9-30). In the developed system we assure low latency in permanent fault detection so wrong results are not transferred. The lacking data is recovered at higher level using redundant measurements performed by several devices monitoring the gas pipe network.
4. Experimental results The presented methodology of improving dependability and its evaluation was verified in a prototype of gas flow computer. The gained experience is useful in designing embedded systems for industrial applications. The verification process involved the following experiments: analysis of program code (instruction and resource usage), analysis of operational profiles, fault injection experiments. When handling these experiments we have developed some supplementary tools. Some of them communicate with the tested system via standard interfaces with some embedded features in the high level protocols. Unfortunately, the used microcontroller does not provide specialised fault injection infrastructure (compare [8]). So, the fault injection functionality is developed as purely software solution. It is embedded only during the development process. The whole fault injection infrastructure is controlled through the standard GazMODEM protocol with additional service packets [9]. It provides the capability of reading and setting the resources contents. The actual fault injection controller responsible for conducting the fault injection campaign is located on the PC workstation connected to the target device. It reads the target memory content, disturbs it with the given fault, saves corrupted data to the flow computer and observes the fault detection alerts on the target device as well observes its operation – in particular the measured gas flow is checked over the given observation period. The fault injection instant is random. The fault injection instant is correlated with the execution of the procedure serving the service packet used. It can influence the observed fault sensitivity as the injection instant at the low level of the device processing is correlated with the execution of the particular code parts within the given task execution schedule. This small drawback could be overcome in devices with the dedicated debugging interfaces [8]. Nevertheless, the implemented solution in practice revealed several problems (e.g. implementation problems in communication subsystem, critical calculation procedures). 1000 test runs were executed during the fault injection campaign, each disturbed with a single fault located in the code memory. The fault injected was a single bit flip. It is worth noting that the device was not restarted nor cleaned up if the device was responding after each test (a single test takes 3 seconds between the subsequent checking of the gas volume increase – between these checks the fault injection took place). So, the unrecovered errors from one test could impact also the subsequent test. The only clean-up was made if the device hang-up during the test - the manual restart was made in such cases (the initialization procedure was executed then). The faults were equally distributed over the flow computer code memory area. In only 1 test the flow computer failed to send back the requested data to the host PC within the given time limit (however, the device was operating, i.e. updating its LCD display etc.). In 19 cases the device hanged-up (the manual reset was made in such case). As the experiment was conducted with disabled watchdog and due to the device specificity it could not be considered as unsafe behaviour. Enabling watchdog should solve these issues. In one test the device was functional but did not respond to the host PC within the given time (the gas volume could not be read from the device). Finally, within these 1000 tests we observed only 11 erroneous tests, i.e. the counted gas volume was not equal to the expected one. It is worth noting that almost all such tests were noticed in a sequence preceded with the tests ended by manual device reset due to the
hung-up. It was observed for 6 tests (of total number of 11 - the calculated gas volume was equal to 0). Observing the reported number of the corrected code errors we can say that during these erroneous tests the code checking mechanism was not executing probably due to the device initialization process (code checking task has low priority). The other 3 tests with incorrect volume calculated were also observed in a sequence but not preceded by the hanged-up test. In all these cases (6+3) the code check mechanism seems not to be activated, however, all the errors were successfully recovered without any user actions during the subsequent tests. Two other tests with incorrect calculations were also one after the other; however, the code checking mechanism reported code recovery in these tests. Unfortunately, the recovery took place too late (after the critical data corruption). These situations constituted about 1% of all injected faults (in the previous models about 90%).
5. Conclusions The presented methodology of improving dependability and its evaluation was verified in a prototype of gas computer. The gained experience is useful in designing embedded systems for industrial applications. To assure low performance impact of fault detection and fault handling mechanisms, we base on software procedures integrated with the application software. To assure high fault coverage, we use original application driven self-testing. This approach was supplemented with built-in profiling mechanisms as well as some statistical analysis of the application code. The SBST testing is integrated with the available on-line test mechanisms (hardware and software) and specially designed software assertions. To reduce potential losses of performance, we included also fault avoidance mechanisms especially targeted at transient faults dominating in industrial environment. The effectiveness of the presented methodology was verified in fault injection experiments. They confirmed significant dependability improvements compared with the produced devices. The embedded logging mechanism allows collecting data from the field on dependability. This will be analysed in the future. This paper presents results of the project under the grant of Programme Innovative Economy POIG.01.04.00-20-017/09.
6. References [1] Arlat J., et al.: Comparison of physical and software-implemented fault injection techniques, IEEE Trans. on Computers, vol.52, no.9, 1115-1133, 2003. [2] Psarakis M., et al.: Microprocessor software-based self-testing, IEEE Design & Test of Computers, 27(3), pp.64-75, 2010. [3] Gawkowski P., Sosnowski J.: Experiences with software implemented fault injection, Proc. of the International Conference on Architecture of Computing Systems, VDE Verlag GMBH, 73-80, 2007. [4] Sosnowski J.: Software based self-testing of microprocessors, Journal of System Architecture, 52, pp.257-271, 2006. [5] Rebaudengo M., Reorda M., Villante M.: A new software based technique for low cost fault tolerant application, Proc. of IEEE Annual Reliability and Maintainability Symposium, 23-28, 2003. [6] Skarin D., Karlsson J.: Software implemented detection and recovery of soft errors in a break by wire system, Proc. of 7th European Dependable Computing Conference, IEEE Comp. Soc., 145-154, 2008. [7] Vemu R., Abraham J. A.: Budget dependant control flow error detection, 14th IEEE IOLTS symposium, pp. 73-78, 2008. [8] Fidalgo A. V., Alves G. R., Ferreira J. M.: Real Time Fault Injection Using Enhanced OCD – A Performance Analysis, Proceedings of the 21st IEEE International Symposium on De-fect and Fault-Tolerance in VLSI Systems (DFT'06), 2006. [9] Plum website, http://www.plum.pl
_____________________________________________________ otrzymano / received: 17.04.2012 przyjęto do druku / accepted: 01.06.2012
artykuł recenzowany / revised paper