Rapita Systems Ltd., York, UK, Email: 1mhouston, [email protected]. â¡. BAUER ... hard real-time application, i.e. the control code of a large drilling.
Large Drilling Machine Control Code Parallelisation and WCET Speedup Mike Gerdes∗ , Julian Wolf∗ , Irakli Guliashvili∗ , Theo Ungerer∗ , Michael Houston† , Guillem Bernat† , Stefan Schnitzler‡ and Hans Regler‡ ∗ University
of Augsburg, Germany, Email: {gerdes, wolf, ungerer}@informatik.uni-augsburg.de Systems Ltd., York, UK, Email: {mhouston, bernat}@rapitasystems.com ‡ BAUER Maschinen GmbH, Schrobenhausen, Germany
† Rapita
Abstract—Hard real-time applications in safety-critical domains – namely avionics, automotive, and machinery – require high-performance and timing analysability. We present research results of the parallelisation and WCET analysis of an industrial hard real-time application, i.e. the control code of a large drilling machine from BAUER Maschinen GmbH. We reached a quadcore speedup of 2.62 for the maximum observed execution time (MOET) and 1.93 on the WCET compared to the sequential version. For the WCET analysis we used the measurement-based WCET analysis tool RapiTime.
I. I NTRODUCTION In the safety-critical embedded computing domain the demands on hard real-time systems are continuously rising. Complex applications require more and more performance, which can no longer be satisfied by single-threaded execution. In the machinery domain, e.g. a large drilling machine control application needs to react on a lot of sensor data and controls many actuators. In the near future, sequential applications will not allow adding more peripheral devices, as e.g. CAN (Controller Area Network) devices, or PWM (Pulse Wide Modulation) and I/O input and outputs, while also maintaining fast responses for controlling interaction. Parallel execution on a multi-core processor is increasingly being considered as an effective solution to cope with the performance requirements of current and future hard realtime embedded systems. However, due to the access of shared hardware resources, the worst-case execution time (WCET) is hard to determine. Therefore, WCET analysis techniques have to consider the influences of parallelisation and guarantee that all hard real-time capabilities of a system are still met. In this paper we present the results of a pilot study with BAUER Maschinen GmbH concerning a control application for large drilling machines (see Figure 1). The main idea is the parallelisation of the given sequential code to overcome potential future shortcomings in the sequential version. Our approach is to distribute the sequential tasks on different cores, and run them in parallel. By this, we can show that a parallelised version of the sequential code scales and allows controlling more peripheral devices. We perform the pilot study comparing sequential, dual and quad parallel versions on the hard-real time capable MERASA multi-core processor [1]. The main contribution of this paper is to show performance results for different versions gaining a WCET speedup of the
Fig. 1.
Two large drilling machines of BAUER Maschinen GmbH.
parallelised application. The timing behaviour is analysed with RapiTime, a commercial measurement-based WCET tool. Another study of a parallelised, industrial real-time application has been done in [2]. The authors have shown the WCET analysis’ complexity of a parallelised avionics application using a static WCET analysis tool. In [3] the authors are presenting work-in-progress on automatic mapping of single threaded tasks on multi-core ECUs (electronic control unit) with AUTOSAR. They highlight the need of tools for analysing, and visualising legacy sequential applications for a successful mapping to multi-cores. The authors in [4] investigate a possibility to perfom a WCET analysis on (small) parallel systems by using modelchecking of a system of timed automata with the UPPAAL tool box. The main aspects of the parallelisation are presented in section II, as well as required adaptations of the MERASA multi-core FPGA prototype to run the control code application. In section III, we present evaluation results of the improvement in maximum observed execution time (MOET) of the parallelised version compared with the sequential version, and also the WCET speedup for the parallelised pilot study application estimated with RapiTime tool [5]. II. PARALLELISATION OF THE C ONTROL C ODE A PPLICATION The original control code provided by BAUER Maschinen runs on an Infineon TriCore based on OS functionalities by Sensor-Technik Wiedemann (STW). The overall structure of the original code is not changed for the adaptation to run it on
the MERASA processor. In the initialisation phase some hardware configurations are done and task arrays are initialised. These task arrays contain tasks from different categories, e.g. PWM, and I/O tasks. Each task array is registered in a software scheduler with different priorities. Also, single tasks, e.g. CAN-bus tasks, are registered in the software scheduler. When the initialisation phase is finished, the main phase begins with a loop, in which all the application code is executed. Also, the scheduler is started, and it interrupts the main loop on a fixed time scale to start the registered task arrays and tasks. Each scheduling cycle one task of each registered task array and the registered CAN tasks are executed to provide data from inputs, or to write data on output ports; this data concerns PWM signals, digital I/O, and CAN messages. The code was compiled to MERASA processor, except for the library functions and drivers that are the property of STW. So, they were added manually in the MERASA RTOS [6] with consultations of BAUER Maschinen. Also, some changes and adaptations for the MERASA processor were needed, e.g. a CAN-bus interface was implemented on our FPGA prototype. To ease the analysability of our multi-core processor, we did not consider interrupts so far. We did not have access to the software scheduler’s code, which is nested in STW’s library, so we had to build our own solution for scheduling the tasks. Therefore, we substituted the original, interrupt-based software scheduler with an interface to explicitly call each task (and one task from task arrays) in each iteration. The main focus is that the timing is as near as possible to the original drilling machine control application. Also, the task calling is built with the same priorities and order of tasks, as if the software scheduler interrupts the main loop. The used library functions were changed to either include our MERASA software drivers or to simulate a realistic behaviour, e.g. in accessing shared resources. The baseline for the timing of those hand-coded library functions for peripheral devices is the timing of the CAN task for a CAN-bus keyboard – an original keyboard of a large drilling machine – that we connected to our FPGA prototype. We measured the timing for the CAN-bus keyboard task, and based on those timings we adjusted the timing and behaviour of the other, simulated tasks (like PWM and I/O, and an additional CAN task) that are not connected to any external device in our prototype FPGA processor. For the parallelisation, the single-threaded pilot study code is decomposed into two alternative, parallel configurations: one with two and the other with four hard real-time (HRT) threads. In the two-threaded version, one HRT thread executes the PWM tasks, the I/O tasks, and the main loop in parallel to the execution of two CAN tasks (CAN-bus keyboard and simulated) in the other HRT thread. In the four-threaded configuration, one HRT thread executes the main loop and PWM tasks in parallel with the execution of the I/O tasks, and the two CAN tasks, each in its own HRT thread. Please note that each HRT thread is tied to one core, and is executed in isolation on that core, but potentially running in concert with additional non hard real-time (NHRT) threads with simultaneous multithreading (SMT) [7] techniques.
Fig. 2. Execution of the decomposed application with unsynchronised releases (top) and synchronised releases (bottom) with four HRT threads.
III. R ESULTS FROM P ILOT S TUDY In this section we present the results of the parallelised pilot study code running the two configurations on a quadcore MERASA processor. The performance of the parallelised applications is compared to the sequential version of the pilot study application. The results are obtained from a MERASA quad-core processor on FPGA with activated data scratchpad (DSP), which is a core-local memory containing the stack of each HRT thread. Additionally, a dynamic instruction scratchpad (DISP) [8] is used in one experiment. The D-ISP caches an entire function when it is called for the first time, eliminating instruction fetches to the main memory in the same way as conventional cache. The advantage is that the effect of the DISP can be more easily analysed for the purpose of worst-case timing analysis; conventional cache is often disabled in safety critical systems because it makes the application’s worst-case behaviour hard to predict, complicating the certification of the system. Use of the D-ISP improves performance by reducing the number of memory accesses through the shared memory bus (and reducing memory access latency in general). A series of three experiments was performed, comparing first the behaviour without any synchronisation between cores, second the behaviour with synchronisation, and third the behaviour with the D-ISP enabled. In one experiment the tasks run without synchronisation, this means that each core runs its HRT thread in a loop without any additional idle time introduced so that the core is fully-loaded. In the other experiments the HRT threads are coordinated using a synchronisation barrier. All the tasks must complete a cycle
TABLE I M AXIMUM OBSERVED EXECUTION TIMES FOR UNSYNCHRONISED RELEASE OF THE APPLICATION COMPONENTS ON 1, 2, AND 4 CORES . Task / Cores Main loop / PWM tasks I/O tasks can1 (keyboard) can2 (simulated)
1 79,447
Total Speedup
79,447 -
2 54,906 56,350 56,350 1.41
TABLE II M AXIMUM OBSERVED EXECUTION TIMES FOR SYNCHRONISED RELEASE OF THE APPLICATION COMPONENTS WITH D-ISP ON 1, 2, AND 4 CORES .
4 15,175 101,427 60,618 37,798
Task / Cores Main loop / PWM tasks I/O tasks can1 (keyboard) can2 (simulated)
1 40,950
101,427 0.78
Total Speedup
40,950 -
before the next simultaneous release of all tasks is made. This more closely mimics the behaviour of the original sequential code, as the relative frequency of task execution remains the same. Figure 2 shows a scheduling-style diagram for four HRT threads demonstrating the difference in behaviour with and without synchronisation. The spaces represent the slack that can be used for “extra” work. The measurements in the following sections are based on recording the times between the first instruction of the “loop body” – the code executed in one scheduling cycle of the algorithm – and the last. This means that the times are usable by standard scheduling techniques which take these times as input, or considered to be maximum response times in the case of a cyclic executive. For the tests, 10,000 iterations of the loop body are executed, providing a high level of confidence that we have seen the maximum times for the paths tested. As with all measurements, the results can only be said to be representative of the measurements taken. If there are other inputs or tests which result in higher execution times, they should be included in the analysis. A. Speedup Sequential vs. Parallel In the following, we measure the raw execution performance as MOET, which is lower than the actual, but unknown WCET. The first set of tests (Table I) measures the MOET with no synchronisation between tasks. The two-core decomposition demonstrates a speedup of 1.41. Thus, with two cores active, the processor is able to use additional resources which are not exploited with only one core. This does not extend to the fourcore decomposition, where performance drops to 0.78 due to additional pressure on the memory bus (see below). The second set of tests (see also Figure 2) includes a barrier synchronisation which ensures that each cycle of the decomposed application performs the same amount of work (executes the same number of instructions) as the original configuration. This configuration resulted in a more modest speedup to the two-core decomposition of 1.19, but is also able to produce an improvement to the four-core decomposition of 1.23. We observed that the overall processor utilisation is reduced, but that, when the three shorter tasks complete, the fourth core is able to make full use of the available resources. However, while the three shorter tasks are executing, the fourth task still experiences contention at the interconnect. The final set of tests (Table II) was performed with the D-ISP activated and the synchronised release, as described
2 13,489 31,167 31,167 1.31
4 7,582 13,349 15,617 13,704 15,617 2.62
before. The decomposition is now able to better exploit the additional cores, and improvements of 1.31 for two-core respectively 2.62 for four-core decompositions were observed. This is due to the utilisation which is now much more consistent throughout the tests, with fewer interferences from the other cores. The real-time bus in the MERASA multi-core applies a policy in which in each bus cycle another core can access the bus (TDMA). When the number of cores sharing the realtime bus increases, the latency of each core accessing the memory increases too. This slows down the executing code on all cores and therefore increases the MOET of each core. In order to make effective use of multi-core processors, the inter-core interferences must be minimised. One effective way of doing this is to use a D-ISP, and a DSP. The execution times are consistently improved as the number of cores is increased, demonstrating the reduction in pressure on the memory bus. Without D-ISP, the overhead of increasing the number of cores becomes larger than the performance gain from parallel execution. This is particularly demonstrated in the unsynchronised version of the tests, where the cores are fully loaded. The sum of execution time over the four cores has increased which represents the overhead incurred by interference and additional memory access reading four times as much code from main memory per instruction cycle. The theoretical maximum for a multi-core with two and four cores would be a linear speedup of 2 and 4. Our tasks are not perfectly balanced, but the obtained speedups of 1.31 and 2.62 are very reasonable. Further decomposition into smaller tasks might allow improved load balancing with an appropriate schedule, and therefore even better speedups: increasing the CPU utilisation should not have as great an effect on the execution time with the D-ISP enabled. B. RapiTime WCET Analysis The WCET estimates in Table III were computed from the third configuration with activated D-ISP with the measurement-based WCET tool RapiTime. A modest WCET speedup of 1.08 is seen for the two-core decomposition; however the four-core decomposition shows a WCET speedup of 1.93 over the single-core baseline. The WCET speedup is lower than the speedup for the MOET, but it still shows a considerable speedup for the four-core decomposition. As can be seen in Figure 3, the largest improvements in WCET come from enabling the D-ISP. The full potential of
TABLE III W ORST- CASE EXECUTION TIMES FOR SYNCHRONISED RELEASE OF THE APPLICATION COMPONENTS WITH D-ISP ENABLED ON 1, 2, AND 4 CORES .
2.8 2.6 2.4
1 96,043
Total WCET Speedup
96,043 -
2 20,294 88,945 88,945 1.08
4 8,406 30,877 49,829 38,818 49,829 1.93
2.2 Speedup
Task / Cores Main loop / PWM tasks I/O tasks can1 (keyboard) can2 (simulated)
wcet_unsync wcet_sync wcet_sync_d-isp moet_sync_d-isp
2 1.8 1.6 1.4 1.2 1 0.8 0.6 1
the four-core decomposition can be exploited with activated DISP, which reduces the impact from timing variability due to memory bus contention. Without the D-ISP, any block incurs large time penalties because of delays while loading instructions from main memory, even if no explicit memory accesses are made. With D-ISP, only explicit memory accesses cause a block to suffer interference. Without D-ISP, the variation in execution times caused by blocking on memory accesses causes the WCET to increase. Even when the MOET decreases for four cores, the WCET estimate increases. The side effect of using the D-ISP is that an extra delay is incurred while a function is first loaded into the scratchpad, or if it has been evicted by another function load. This loading time is captured in the RapiTime measurements and accounted to the calling function. RapiTime will therefore assume this extra delay is possible every time the call is made from that location. This is a well-recognised challenge of analysing cache, and rare timing events in general. The advantage of the D-ISP is that the effects can be localised to a call site. IV. C ONCLUSION AND F UTURE W ORK The decomposition of the BAUER Maschinen case study onto four parallel cores has resulted in a reduction to the overall MOET and a speedup of 2.62 over the baseline with only one core. The maximum theoretical improvement would be 4. The WCET has been improved by a WCET speedup of 1.93 for the four-core decomposition with D-ISP, and a speedup of 2.41 of the original single-core WCET estimate without D-ISP (see Figure 3). The use of D-ISP as a predictable cache has allowed this decomposition with minimal contention for the shared memory bus. The testing has shown that D-ISP produces stable execution times while providing all the benefits of an instruction cache. In summary, with a HRT capable multi-core processor, like the MERASA processor is, we can support a parallelised, industrial realtime application with significant performance improvement and gains in predictability, hence enabling effective analysis of worst-case execution time on a multi-core processor. Future work in this area needs to evaluate in more detail by using the original library functions instead of the simulated functions. Furthermore, different task distributions to the cores should be investigated to yield a better load balancing and thus a higher speedup. However, a lot of work for analysing parallel HRT applications needs still to be done manually. Thus, in future work we focus on introducing parallel design patterns
2 Number of cores
4
Fig. 3. WCET estimates for the three evaluated configurations, compared to the MOET as recorded with D-ISP activated.
to ease an automatic analysis of parallelised HRT applications. That is we need to explore how a WCET speedup can be achieved while facilitating the WCET analysis of parallel applications at the same time, e.g. with automatically inserted annotations. Furthermore, research in the development of tools for profiling and visualising data of parallel HRT applications is strongly needed. ACKNOWLEDGMENT The authors would like to thank BAUER Maschinen GmbH for providing an industrial control code application, a CANbus keyboard, and valuable information for this research. This research was partly sponsored by the EC FP7 project MERASA under Grant Agreement No. 216415. R EFERENCES [1] Theo Ungerer et al, “MERASA: Multicore Execution of Hard Real-Time Applications Supporting Analyzability,” IEEE Micro, vol. 30, 2010. [2] C. Rochange, A. Bonenfant, P. Sainrat, M. Gerdes, J. Wolf, T. Ungerer, Z. Petrov, and F. Mikulu, “WCET Analysis of a Parallel 3D Multigrid Solver Executed on the MERASA Multi-Core,” in Proceedings of the 10th Int’l Workshop on Worst-Case Execution Time Analysis, Brussels, Belgium, vol. 268, July 2010, pp. 92–102. [3] J. Schneider, M. Bohn, and R. R¨oßger, “Migration of Automotive RealTime Software to Multicore Systems: First Steps towards an Automated Solution,” in Proceedings Work-In-Progress Session of the 22th Euromicro Conference on Real-Time Systems, July 6–9 2010, pp. 37–40. [4] A. Gustavsson, A. Ermedahl, B. Lisper, and P. Pettersson, “Towards WCET Analysis of Multicore Architectures using UPPAAL,” in Proceedings of the 10th Int’l Workshop on Worst-Case Execution Time Analysis, Brussels, Belgium, July 2010, pp. 103–113. [5] “RapiTime White Paper,” Rapita Systems Ltd, June 2008, http://www.rapitasystems.com/system/files/RapiTime-WhitePaper.pdf. [6] J. Wolf, M. Gerdes, F. Kluge, S. Uhrig, J. Mische, S. Metzlaff, C. Rochange, H. Casse, P. Sainrat, and T. Ungerer, “RTOS Support for Parallel Execution of Hard Real-Time Applications on the MERASA Multi-core Processor,” in IEEE International Symposium on ObjectOriented Real-Time Distributed Computing, Los Alamitos, CA, USA, 2010, pp. 193–201. [7] J. Mische, I. Guliashvili, S. Uhrig, and T. Ungerer, “How to Enhance a Superscalar Processor to Provide Hard Real-Time Capable In-Order SMT,” in 23rd International Conference on Architecture of Computing Systems (ARCS 2010), Proceedings, vol. 5974, Hannover, Germany, February 2010, pp. 2–14. [8] S. Metzlaff, I. Guliashvili, S. Uhrig, and T. Ungerer, “A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware,” in Proceedings of 24th International Conference on Architecture of Computing Systems (ARCS 2011), vol. Vol. 6566, Lake Como, Italy, Feb. 2011, pp. 122–134.