Towards a Load Balancer Architecture for Multi- Core Mobile ...

3 downloads 420 Views 308KB Size Report
In this paper we discuss the problem of scheduling and load balancing alternatives for ... dynamic addition and execution of third party software application in their ..... We have started by presenting the scheduling solution landscape and their ...
Towards a Load Balancer Architecture for MultiCore Mobile Communication Systems D. Tudor, G. Macariu, C. Jebelean and V. Creţu Politehnica University of Timisoara, Timisoara, Romania {dacian, georgiana, calin, vcretu}@cs.upt.ro

Abstract—Due to the exponential growth of the mobile communication systems in the past decade, more efforts have been invested in performance increase solutions which should satisfy the increasing demand for resource hungry applications. Since the computational power provided by single processing units seems to grow slower compared to the application needs, it is generally accepted that a suitable solution can be implemented using multi-core architectures which aim to provide a better balance between performance, power consumption, flexibility and scalability. In this paper we discuss the problem of scheduling and load balancing alternatives for virtualized run-time environments. Based on the virtualization concept, we summarize the most common approaches in scheduling techniques for embedded mobile communication systems. Considering the shortcomings of single core architectures for embedded mobile communication systems, a future architecture is presented, which has been proposed by the eMuCo project in the context of an European research project. Last but not least, we focus on a critical component in the future embedded system that aims to distribute and balance the load in order to ensure an optimal distribution of threads on the available cores and to eventually reduce power consumption. A variant of the architecture of the load balancer and its integration into the complete system is described.

I. INTRODUCTION The continuous growth of mobile communication systems has lead to the development of more complex mobile systems. Complex feature commonly seen on high-end devices have started to become present in lowend products (e.g. video streaming, video telephony, rich multimedia applications). To satisfy the market’s demands for more complex features, more efforts have been put into the increase of processing power due to the large scale availability of common processors. For example, it is easier to integrate a regular processor in a given system than to adopt a domain specific processor such as a graphical accelerator or advanced DSP. Since the computational power provided by individual processing units seems to grow very slowly compared to the needs, it is generally accepted that a suitable solution can be implemented using multi-core architectures. Multicore architectures are based on multiple processing cores manufactured on the same integrated circuit. Modern multi-core processors for embedded systems (e.g. such as those provided by ARM) offer support to control individual cores through dynamic clock scaling and advanced low-power states. As a result of the frequency reduction and possibility to turn cores off, it is expected to greatly support power efficiency and reduced heat

dissipation. Next, it is expected that these architectures can provide the best balance between performance, power consumption, flexibility and scalability. A natural consequence of multi-core solutions is that they pave the way to enable separated execution domains. This aspect is especially important in the context of dynamic addition and execution of third party software application in their own execution environment, by making use of virtualization techniques. Virtualization is well known in the desktop and server domains and has been extensively used in the last couple of years. Legacy operating system virtualization has been extensively done on micro-kernel based systems such as the L4 microkernel. One of the major goals that are aimed in the eMuCo project [18] is to develop and demonstrate concepts allowing the co-operation of open and closed application environments through the utilization of virtualization techniques. The benefit of multi-core architectures is three-fold: first, multi-cores provide the required computing performance; second they provide power efficiency by allowing lower clock rates; and third, they provide a second dimension in resource allocation. In the eMuCo consortium it is believed that bringing these technologies together in a mobile environment will be a big step to a new mobile generation systems. II. SCHEDULING SOLUTIONS The scheduling algorithms considered for embedded systems can be divided in two major classes: offline algorithms and online algorithms. Offline (static) scheduling algorithms generate schedules prior to system execution. This kind of algorithms is appropriate in systems where the parameters of the tasks are known a priori and change infrequently. An offline schedule can be represented as a static table with explicit start time and execution place for each task. Although the resulting schedule is predictable and guarantees system performance it presents the drawback that it is inflexible since any change in the parameters of the tasks determines the reconstruction of the whole scheduling table. Online (dynamic) algorithms generate the schedule at runtime and do not assume any prior knowledge of task parameters. The main advantages of online scheduling come from its flexibility and ability to adapt to environment changes but all at the cost of high run-time processing time. Another possibility is to build a quasi-static schedule consisting of multiple offline schedules, each to be used in an alternate situation. At system runtime, an online schedule will select the pre-computed offline schedule

applicable to each situation. This solution was used in [10] to achieve fault-tolerance. In complex real-time systems applications may have different timing requirements and the differences between real-time and non-real-time tasks must be reflected in the chosen scheduling algorithm. At the same time, the scheduler must guarantee timely availability of sufficient computational resources to all real-time tasks and a certain performance level to non-real-time tasks. In such situations several levels of scheduling may be required. A. Hierarchical Scheduling Applications for embedded systems can have different performance requirements. Some of them may have strict timing constraints while for others achieving greater performance may be the most important requirement. A schedule for applications with hard real-time constraints demands extensive verification and analysis to determine if the application can meet all its deadlines. By contrast, scheduling non-real-time applications does not need such proofs but, in turn, may need to provide Quality of Service guarantees. The solution for making possible co-existence of applications with different timing requirements is represented by execution time servers. The basic idea behind this technique is that each application is assigned to a server which uses a fraction of the total processor bandwidth and the task running in the server is limited to using this reserved bandwidth. This solution is used extensively for uni-processor embedded systems and was also extended for multi-processor systems. In [2] and [3] two bandwidth sharing server algorithms for multiprocessor systems are introduced. Both algorithms assign a task to each server and use a global Earliest Deadline First (EDF) algorithm for scheduling servers. A more complex solution is presented in [5]. Here servers are assigned more than a single task and schedule their constituent tasks using any scheduling algorithm. At another level, server tasks are scheduled on available processors using a global EDF algorithm. For hard realtime tasks a server is associated with each processor in the system while for all soft real-time tasks a single migratory server is created and a number equal to the number of processors of migratory servers will handle best-effort jobs. This solution addresses only systems where the number of hard real-time tasks is small due to the need to statically assign them to servers. The FRESCOR (Framework for Real-time Embedded Systems based on COntRacts) project [8] extends the server technique introducing the notion of service contracts. Each application will have a contract specifying its timing and Quality of Service requirements which are negotiated with the scheduling framework. If negotiation is successful the system will reserve enough capacity for the application to meet its requirements by creating a server which keeps track of the resource usage for the associated contract. Contract negotiation can be made either offline at design time or online while the system is running when requirements change or new applications are deployed on the system. This solution enables composition of application components, each consisting of several threads, requiring hierarchical scheduling inside servers. A slightly different approach for ensuring that real-time applications running in a heterogeneous multi-processor

system meet their timing constraints is introduced by the ARTiS scheduler [13]. In this case a number of processors are classified in real-time and non-real-time processors. Hard real-time tasks have the highest priority in the system, run only on real-time processors and cannot be migrated between processors, while soft real-time tasks may run on both real-time and non-real-time processors with the restriction that they may become non-preemptible only on non-real-time processors and may migrate from one processor to another. Non-real-time tasks can run on both types of processor as long as they do not endanger the real-time properties. The scheduler uses task migration for load balancing and as a way of ensuring that all hard real-time tasks meet their deadline. B. Power-Aware Scheduling Solutions As many embedded systems are powered by batteries a major issue is extending the autonomy of the system as much as possible. Existing power-aware scheduling algorithms exploit the processor capability of changing voltage and frequency during runtime. This technique is called Dynamic Voltage Scaling (DVS). A power-aware scheduling algorithm based on DVS selects at each instant the task to run and the processor voltage to apply while running the task. Power-aware scheduling can be done either offline (static) or online (dynamic). In the case of offline algorithms the processor voltage during the execution of a task is statically assigned before system execution as opposed to online algorithms where the processor voltage is determined just before scheduling a new task. Chen et al. [6] proposed a static scheduling algorithm using DVS for multi-core embedded systems applicable to loop applications. They start with an initial schedule obtained by rotation scheduling and use DVS for relaxing it in order to reduce energy consumption without endangering the timing requirements of the application. Opposite to the approach in [6], Shao et al. [15] introduce an algorithm in which the initial schedule assumes the minimum voltage level for all tasks and then iteratively increases the voltage for some tasks to reduce the execution time of the loop application until the timing constraints are satisfied. A mixed offline-online scheduling solution for sporadic hard real-time tasks upon multi-processor systems is proposed in [12]. Here an offline procedure determines a schedule where for each processor the smallest frequency is chosen and all task deadlines are met. As the offline procedure takes into consideration the worst case execution time (WCET) of each task and the probability of a task actual execution time being equal to its WCET is low, a further online procedure is applied to reduce the energy consumption. Basically, at system runtime, when a job is to be allocated to a processor based on a global EDF policy, the algorithm reduces the processor speed in such a way that the job still meets its deadline. A resembling technique is presented in [1], with the difference that here the processor speed is reduced only if the next task is not ready for execution when the current task should finish. The solutions presented above minimize energy consumption by executing tasks at reduced speed when the difference between WCET and actual execution time of tasks allows it and does not cause risks of missing deadlines. Another possible approach is to shut down processors when there are no active tasks and increase the supply voltage at maximum when there are tasks ready for

running. Such a solution is presented for example in [16] for uni-processor systems or in [4] for multi-processor systems. In [4] DVS is applied to all processor cores at once. In this case, cores are switched on only if task deadlines cannot be satisfied using only the cores that are already powered. Moreover, the speed of all cores is increased only if the algorithm cannot guarantee the WCET of all running tasks. As tasks finish, the working frequency of the cores is reduced or some of them are shut down. A drawback of the algorithm comes from the fact that the power savings are highly dependent of running time of each task and of their inter-arrival time. In [7] the voltage level used for each task instance is determined based on actual execution time of past instances which leads to more efficient energy consumption but only after several instances of each task have executed. The frequency and voltage scaling can also be made considering probabilistic distributions of task execution time. Such an approach is considered in [17] where tasks are portioned between the processors with the aim to balance energy consumption between them. All power-aware scheduling solutions presented above are defined for hard real-time task systems. But modern embedded systems have to deal with a combination of hard-, soft- or non-real-time tasks. To assure temporal isolation between these different types of tasks, a technique based on servers can be used. Scordino and Lipari [14] present a server-based technique using DVS for uni-processor systems where slowing down the processor determines each server with active jobs execute for a longer time. The server-based approach to DVS presents the advantage that it is not needed to know a priori the release time of tasks or their execution times. C. Multi-Mode Systems Some applications for embedded systems may have multiple modes of operation, each mode with an associated set of tasks. In these systems a problem that arises is transition from one mode to another. For example, the tasks of the old mode could all be interrupted or it may be possible to finish the tasks currently running and start the tasks of the new mode only after that. If the task associated with the new mode start only after all tasks of the old mode have stopped the transition protocol is said to be synchronous, otherwise it is asynchronous. A synchronous transition protocol for multi-processor realtime systems is introduced in [11]. III.

In the eMuCo system architecture, the multi-core hardware platform lies on the lowest level, above which a micro-kernel is running (e.g. L4 Fiasco micro-kernel [19]). The micro-kernel is running in system mode, provides the upper layers with protected access to the hardware layer and offers a minimal set of system services like address management, memory management, thread handling and inter-process communication to applications running above. User-level services are built on top of these services, are run in user mode and compose the Basic Resource Layer. Upper layers consist of different choices for operating systems or virtual machines running on top of the micro-kernel and application layers. System flexibility and scalability is targeted by using a load balancer service running on top of the micro-kernel. The load balancer can be seen as an application running in the Basic Resource Layer in user mode, but for reasons of expressiveness we decided to allocate a separate level for this particular application. Conceptually, the load balancer continuously monitors the computational power needed by upper layers (applications) and dynamically balances the allocation of threads on the available cores in order to supply the necessary processing power and to optimize power consumption (e.g. shut down one or more cores when the computational power is low and put all cores to work when the need shows up). The eMuCo system architecture that includes the load balancer component is presented in Figure 1. B. Load Balancing Challenges In a multi-core environment, one component must decide where threads are allocated initially and decide when a thread needs to be relocated in order to satisfy an optimum criterion. We call such component the Load Balancer (LB). The role of the load balancer is to monitor thread execution and distribute the threads on the available cores. Only the load balancer has a global view of the threads in the system and can thus place threads on cores according to their needs. In the eMuCo system, the kernel does not migrate operating system threads dynamically, but rather only supplies functionality to do so for its applications. In an architectural view, the load balancer intercepts thread creations and thus gains control of the threads in applications. This functionality is transparent to the appli-

A FUTURE MOBILE COMMUNICATION SYSTEM

This section is largely based on the eMuCo technical report [18] and aims to summarize the most important aspects of the project from the architectural and scheduling point of view. A. eMuCo Architecture eMuCo [18] is one forerunner project that aims to provide an answer to the natural evolution of the embedded communication systems towards multi-core architectures. eMuCo introduces a multi-core hardware platform, which is efficiently exploited by the combination of an L4 micro-kernel with virtualization concepts, a load balancer, and legacy applications such as GSM protocol stack systems or multimedia applications running in heterogeneous run-time environments.

Figure 1. eMuCo Architecture with Load Balancer

tion layer. The load balancer is then able to periodically, or on any specific event, monitor the threads and potentially migrate them between the available cores. Additionally, the load balancer has the possibility to switch CPU cores off to reduce the power consumption of the system, and to switch them back on if more CPU resources are required. IV. LOAD BALANCER ARCHITECTURE The main goal of the load balancer component is to ensure that, at each given moment, work is evenly distributed between the available CPUs. This implies that running threads should be migrated between CPUs such that no CPU becomes overloaded while other CPUs are kept under little stress, unless they are totally switched off. It is expected that this decision would promote an improvement on the overall power consumption of the system. Based on these considerations, we identified two main requirements for the load balancer antagonistic in nature: • The load balancer should ensure that enough computation power is provided for each running thread. • The load balancer should ensure an optimal distribution of threads on the available CPUs to improve power consumption if possible. During the evaluation of the possible design strategies of the load balancer, two scenarios have been considered in the eMuCo project, which are summarized next. The first approach refers to dynamic, on board load balancing. The second approach relies to offline system analysis and it is a part of our work in the context of the eMuCo project. The asynchronous scenario. Workload on each CPU is permanently monitored and thread migration between CPUs may occur at any given time. This scenario would normally ensure the best distribution of work among CPUs, but the overhead required by the monitoring part could become a serious drawback, especially in a realtime environment. The synchronous scenario. During its lifetime, the system can be seen as a state machine and certain events (external or not) determine transitions between states. States and transitions are application-specific and are dictated by the main component of the system, which is normally implemented as an L4 task, running next to the load balancer. This particular scenario relies on the fact that it is possible that thread distribution among CPUs does not need to change while the system remains in the same state. However, once the system switches its state, a new thread distribution may be needed, suitable for the new state. In this scenario there is no need to permanently monitor CPU loads and thread migration only happens when the system changes its state (thus being synchronous with that event). The exact distribution of threads for each state may be determined by means of profiling, automated learning, or other techniques. This scenario may not be as efficient as the previous one in what concerns the distribution of workload among CPUs but it can be more efficient when it comes to power consumption because it provides more flexibility. For example, the system may reach some state where the needed computation power is so low that the natural course of action would be to switch off a few CPUs while the other perform normally, instead

of keeping all the CPUs active and balancing a small workload among all of them. We believe this scenario is the desirable one in a real-time environment since the overhead required to migrate threads between CPUs is smaller and only occurs during state changes. In case of the synchronous (or static approach), in each state, the load balancer should manage all the running threads in the L4 system where it is installed, tracking for each of them at least the following information: • Thread identifier: primary key used to uniquely identify a thread; • Priority: an integer between 0 and 255 (0 being the lowest, 255 being the highest) – these values are enforced by the underlying L4 system; • CPU identifier: an integer that uniquely identifies the CPU where the thread is currently running. In addition, for real-time threads the information which is managed by the load balancer can be enhanced with the following: • Release time: the moment when thread execution is requested; • Deadline: the maximum time until the task must be completed. Conceptually, this is achieved by using different thread tables, each of them being associated with one state of the load balancer. Thread tables are statically allocated at system startup and computed by a profiling application. In each state, the corresponding thread table contains a static mapping between running threads and available CPUs. The situation is depicted in Figure 2. The load balancer architecture is presented in Figure 3. It runs as a L4 service, together with other L4 services already available: pager, memory allocator, etc. In this architecture, it is possible that the L4 kernel runs an entire operating system on top of its virtualization layer. In the context of the eMuCo project, it was decided that the operating system of choice was L4Linux, which is a port of the Linux kernel to the L4 microkernel. It runs in user mode on top of the microkernel together with other microkernel applications, being binary compatible with the normal Linux/x86 kernel [9]. The main component of the system may also run as an L4 application, or as an L4Linux application, or there may be subcomponents

Figure 2. Static Load Balancing Tables

running as L4 applications and other subcomponents running as L4Linux applications. In this environment, the load balancer may need to communicate with some or all of these components to properly perform its job. For example, communication with the L4 kernel is vital because thread migration, although commanded by the load balancer, is done at the L4 level. A. Thread tables generation The resource requirements of every application can be described in the form of a contract of the application with the rest of the system. The application contract concept is then used to determine the thread tables for each state of the system. The thread configuration tables can be obtained through benchmarking and profiling of applications. This is a must for hard real-time applications like the modem subsystem of an embedded mobile device for which the load balancer must decide as fast as possible on thread allocation and priority assignment. Through profiling one can determine the thread table for each operation mode (state) of an application. The process is described in Figure 4. The L4 microkernel offers a tracing facility which can provide application specific event traces. Those traces can contain information on thread scheduling like release times, execution time, thread core affinity, etc. Basically, by executing each application use case in L4 and collecting scheduling events an application contract containing application requirements can be determined through offline analysis. Further, the contract can be used to determine by offline negotiation the thread configuration tables for each use case (operation mode).

Figure 4. Thread Configuration Table building process

Once the contract negotiation is finished a reservation of computational resources (i.e. CPU time) is guaranteed to application tasks, and mapped to a priority captured within the load balancer table. V. CONCLUSIONS In this paper we have summarized some of the problems and challenges of constructing multi-core embedded communication systems from both the architectural and scheduling point of view. We have started by presenting the scheduling solution landscape and their applicability to multi-core embedded systems. Next, the architecture of the eMuCo system has been presented, which aims to bridge the gap between the increasing application requirements and the execution capabilities. In the context of the eMuCo architecture, we have identified a sensitive component which has to coordinate the execution of all running tasks within the system which is called the load balancer. Last but not least, the load balancer architecture is presented together with the system workflow in order to achieve a flexible and efficient system. ACKNOWLEDGMENT eMuCo (www.emuco.eu) is a European project supported by the European Union under the Seventh Framework Program (FP7) for research and technological development. REFERENCES [1]

[2]

[3]

Figure 3. Load Balancer Architecture

[4]

J. Ahmed, and C. Chakrabarti, "A Dynamic Task Scheduling Algorithm for Battery Powered DVS Systems," Proceedings of the 2004 International Symposium on Circuits and Systems 2004 (ISCAS '04), vol. 2, pp. 813-816, May 2004. S. Baruah, J. Goossens, and G. Lipari, “Implementing ConstantBandwidth Servers upon Multiprocessor Platforms,” Proceedings of the Eighth IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’02), pp. 154-163, 2002. S. Baruah, and G. Lipari, “A Multiprocessor Implementation of the Total Bandwidth Server,” Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04), vol. 1, pp. 40a, 2004. D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato, "A Simple Power-Aware Scheduling for Multicore Systems when Running Real-Time Applications," IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08), pp.1-7, April 2008.

[5]

B.B. Brandenburg, and J.H. Anderson, “Integrating Hard/Soft Real-Time Tasks and Best-effort Jobs on Multiprocessors,” Proceedings of the 19th Euromicro Conference on Real-Time Systems (ECRTS’07), pages 61-70, 2007. [6] Y. Chen, Z. Shao, Q. Zhuge, C. Xue, B. Xiao, and E.H. Sha, “Minimizing Energy via Loop Scheduling and DVS for MultiCore Embedded Systems,” Proceedings of the 11th international Conference on Parallel and Distributed Systems - Workshops (ICPADS'05), vol. 2, pp. 2-6, July 2005. [7] A. Dudani, F. Mueller, and Y. Zhu, “Energy-Conserving Feedback EDF Scheduling for Embedded Systems with Real-Time Constraints,” Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded Systems: Software and Compilers For Embedded Systems (LCTES/SCOPES '02), pp. 213222, June 2002. [8] M.G. Harbour, “FRESCOR - Architecture and Contract Model for Processors and Networks,” Technical Report DAC1, Cantabria University, 2006. [9] H. Härtig, M. Hohmuth, J. Liedtke, S. Schönberg, and J. Wolter, ”The Performance of µ-Kernel-based Systems”, Proceedings of the 16th Symposium on Operating System Principles (SOSP), pp. 66-77, France 1997. [10] V. Izosimov, P. Pop, P. Eles, and Z. Peng, “Scheduling of FaultTolerant Embedded Systems with Soft and Hard Timing Constraints,” Proceedings of the Conference on Design, Automation and Test in Europe (DATE '08), pp. 915-920, March 2008. [11] V. Nelis, and J. Goossens, “Mode Change Protocol for MultiMode Real-Time Systems upon Identical Multiprocessors,” arXiv:0809.5238, Cornell University, 2008.

[12] V. Nelis, J. Goossens, R. Devillers, D. Milojevic, and N. Navet, “Power-Aware Real-Time Scheduling upon Identical Multiprocessor Platforms,” Proceedings of the 2008 IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC’08), pp. 209-216, June 2008. [13] E. Piel, P. Marquet, J. Soula, J.-L. Dekeyser, "Real-Time Systems for Multiprocessor Architectures," 20th International on Parallel and Distributed Processing Symposium (IPDPS’06), 8 pp., April 2006. [14] C. Scordino, and G. Lipari, "A Resource Reservation Algorithm for Power-Aware Scheduling of Periodic and Aperiodic RealTime Tasks," IEEE Transactions on Computers, vol. 55, no. 12, pp. 1509-1522, December, 2006. [15] Z. Shao, M. Wang, Y. Chen, C. Xue, M. Qiu, L.T. Yang, and E.H. Sha, "Real-Time Dynamic Voltage Loop Scheduling for MultiCore Embedded Systems," IEEE Transactions on Circuits and Systems II: Express Briefs, vol.54, no.5, pp.445-449, May 2007. [16] Y. Shin, K. Choi, and T. Sakurai, “Power-conscious Scheduling for Real-time Embedded Systems Design,” VLSI Design, vol. 12, no. 2, pp. 139-150, 2001. [17] C. Xian, Y. Lu, and Z. Li, “Energy-Aware Scheduling for RealTime Multiprocessor Systems with Uncertain Task Execution Time,” Proceedings of the 44th Annual Conference on Design Automation (DAC’07), pp. 664-669, June 2007. [18] eMuCo Technical Report, www.emuco.eu [19] L4 Fiasco micro-kernel, http://os.inf.tu-dresden.de/fiasco/

Suggest Documents