ParaRMS Algorithm:A Parallel Implementation of Rate ...

ParaRMS Algorithm:A Parallel Implementation of Rate Monotonic Scheduling Algorithm Using OpenMP Rajnish Dashora School of Computing Sciences and Engineering VIT University Vellore, India [email protected]

Harsh P. Bajaj School of Computing Sciences and Engineering VIT University Vellore, India [email protected]

Akshat Dube School of Electrical Engineering VIT University Vellore, India [email protected]

Narayanamoorthy M School of Computing Sciences and Engineering VIT University Vellore, India [email protected]

Abstract— With the evolution of multi-core Systems, the computing performance has been improved and these processors with more than one processing cores has found there application in the fields where high performance and complex computation is required like supercomputing, remote monitoring systems, data mining, data communication and processing systems etc. Multi-core processors can be utilized to improve the performance of embedded and real time systems. As the scheduling capability of the Real Time Operating System (RTOS) determines its efficiency and performance when dealing with real-time critical tasks, use of more than one processing units speeds up the task processing. This paper presents an algorithm called ParaRMS, which is a Rate Monotonic Scheduling (RMS) algorithm in parallel on multi-core architecture and has been scheduled on a multi-core system. This improves the scalability and responsiveness of the Rate monotonic scheduling algorithm which will also help scheduling the dynamic tasks effectively and offers good CPU utilization for given task set. In support of this work, analysis of ParaRMS has been carried out on Intel VTune Amplifier XE 2013 with positive results. Keywords—Multicore Processors; Embedded Systems; Real Time Operating Systems (RTOS); Scheduling; Rate Monotonic Scheduling(RMS); Real time critical tasks; Shared Memory; Symmetric Multiprocessing(SMP);Embedded Multicore

I.

INTRODUCTION

Embedded systems are everywhere, starting from daily household consumer electronics like air conditioner, washing machines, mobile phones to the control and monitoring units in the nuclear power plants. Scheduling tasks on these systems can be done using many algorithms such as Rate monotonic Scheduling, Earliest Deadline First scheduling etc. With the evolving technology of multi core processors, these algorithms

can be improved and result in systems with higher performance capabilities. Multi-core processors which have more than one processing units, have significant enhancements over the performance level of the system and are less complex to design when compared to a single core processor with large number of transistors. Though, utilizing these multi-core processors in designing embedded systems is a challenge as these systems are designed to perform set of specific tasks and thus require a higher level of system integration. Rate Monotonic Scheduling algorithm can be implemented on multi-core embedded systems to schedule tasks using OpenMP programming model [15]. The solution proposed in this paper significantly enhances the performance of the real time embedded system, takes care of synchronization among the processes and time critical tasks to meet their deadline. II.

BACKGROUND

Embedded systems with sophisticated functionalities need low power consumption and lower manipulating cost. Power consumption is a critical factor for embedded system and thus needs to be minimized with maximum system reliability and performance. Multi-core processors provide performance increase without demanding any changes to the operating system because the system has two or more processing cores where one core runs the main applications and the others are for tasks that require immediate attention like interrupt handling. While using shared variables in a program, it is first updated is in cache L1 and the same cache is being accessed by multiple processors at a given instant of time. Hence to avoid synchronization problem and to

maintain consistency of shared memory, a cache control mechanism is required. However, embedded systems do not have coherent cache systems and Memory Management Units. They have non-uniform memory access time in contrast with General purpose systems [13]. III.

RELATED WORKS

Algorithm for the scheduling of tasks on Embedded systems with multi-core processors , like hybrid scheduling methods follow two level scheduling schema where one level is used for assigning scheduling policy and second for rate monotonic scheduling[12]. In these algorithms independent tasks are partitioned using some partitioning algorithms and then allocated on multi-core architecture [17]. There exist static and dynamic partitioning algorithms which are scalable but still partitioning and assigning each individual task to different cores is an overhead or an extra work done in the scheduling process. Hence these existing algorithms are complex, less scalable and hard to implement. Finding a minimal schedule for a given set of task in a multi-core system is a NP-hard problem [11]. The proposed algorithm in this paper is helpful in finding a suitable schedule for a set of process on multi-core processing environment. IV.

RATE MONTONIC SCHEDULING

Rate monotonic scheduling is optimal static priority scheduling algorithm. The static priority algorithm is the one which allocates priorities before execution of processes [2]. The algorithm is optimal in the sense that the set of tasks that cannot meet the deadlines assigned to them, RMS [4] cannot schedule them. The processor with the least period has the first priority. Formulas check for schedulability with rate monotonic scheduling of task set. The purpose of a real-time scheduling algorithm is to ensure that critical timing constraints, such as deadlines and response time are met [2]. For a process the response time is defined as the time at which it finishes its execution [3]. The underlying theory of RMS is known as Rate Monotonic Analysis (RMA).It has the following assumptions [1]: A1: All processor are allocated the single CPU based on their periods. A2: Context switching time is not considered. A3: No data dependencies are there among the processes. A4: The execution time for each of the process is constant. A5: Deadlines are considered to be at the end of periods. A6: The process in ready state and having highest priority is considered for execution. The total utilization of CPU by n tasks which are rate monotonically scheduled is [1]: Ö

T

Where Ti is the computation time τi is the execution time of a process Pi from the set of n tasks. Algorithm of serial RMS[1] /* processes[] is an array of process activation records, stored in order of priority, with processes[0] being the highest-priority process */ 1 Activation_record processes[NPROCESSES]; 2 void RMA(int current) 3 { //current _ currently executing process 4 int i; // turn off current process (may be turned back on) 5 6

7 8 9 10 11 12 13 14 15 16 17

processes[current].state = READY_STATE; // find process to start executing for (i = 0; i < NPROCESSES; i__) /* the for loop is incremented by one each time so it checks the process to execute at each step */ if (processes[i].state=READY_STATE) { // make this the running process processes[i].state=EXECUTING_STATE; process[i].executiontime++; break; } if(i!=current) processes[current].state = WAIT_STATE; if(process[i].runningtime==process[i].executiontime) processes[i].state=COMPLETED_STATE; }

Rate Monotonic scheduling has an advantage of Low context switch overhead and no scheduling overhead as well [3]. However, one disadvantage that cannot be overlooked is that RMS cannot scale effectively with processes arriving at run time. It becomes difficult to schedule the processes. Hence to overcome the disadvantages and to develop a strongly scalable algorithm for rate monotonic scheduling parallelism and multi-core processors can be used. A Parallel solution for the RMS scheduling algorithm can be implemented using OpenMP (open multi-processing).OpenMP reduces complexity of implementation of parallelism [8]. In real time systems, process execution is time critical and OpenMP Programming model develops a parallel program in three steps [9]: Analyse: Analyse the scenario, the logic behind the algorithm, the serial algorithm, inputs and output expected from the program. Design: This phase describes how much work to is be done and how it will be distributed in multiprocessor environment.

Parallelize: parallel algorithm is implemented in this phase. Performance of a parallel system depends on percentage of program that runs in parallel, work in parallel regime, workbalance among the processors and the communication among the cores. The performance of the parallel program can be measured on the basis of speed-up, efficiency, scalability of program, time taken and Amdahl’s law [18]. V.

PROPOSED SOLUTION

This paper proposes a solution with which the RMS algorithm can be applied in parallel. The algorithm is executed on multicore processes so that many tasks can be scheduled at a given instance of time. A core comprises of various threads (thread pool) so these threads, whenever idle, can pick up a task and execute it. Implementation of the solution is done in OpenMP where in thread 0 is the master thread. The algorithm is designed such that if any process is in ready state then no core will remain idle and all processes in ready state will be taken up for execution by some core. As the set of periodic tasks is given the algorithm allocates deadline to each of them based on their periods. After that, it checks the number of threads available and accordingly assigns tasks to these threads that are idle and thus balances the work load. Then it is also ensured that one task is allocated only one thread at a given time. The tasks are checked after every unit of time to check about its execution time left and its deadline. Based on this the task is either continued or if higher priority task is in ready state then the task in execution will be pre-empted it goes under waiting state if execution is completed or goes to ready state if its execution is remaining/left. Also the implementation ensures that all the threads have same number of tasks by using dynamic schedule clause of OpenMP [10].

ParaRMS Algorithm can be described as follows: Step 1: Check the schedulability of the tasks i.e. there utilization should be less than one for each of them to be scheduled before their respective deadlines if(schedulable) Process further Steps Else It exits the program showing a message an doesn’t execute any of the tasks Step 2: Sort the tasks according to their periods Step 3: Calculate the hyper-period of these tasks so that the scheduling can be sampled for this period Step 4: Initially all tasks are in ready state as all processes are assumed to be arrived initially. Step 5: The below three steps will run on each processing unit a. Select the task having highest priority among the tasks in ready state and not being processed by any core, change its state to active state and process it for one unit of time. b. As the process is selected for execution, its state changed to active and its remaining time reduced by a unit. Reduction in the remaining time is a critical section of the algorithm as it is should only be processed by a particular thread or core. c. If the task being previously executed has reached its completion time places it into waiting or completion state (means the system is waiting the process to arrive). Step 6: Time count is increased and Processes arrived are made to be in the ready state. Step 7: Go to step 5 and repeat till the time reaches hyperperiod. Utilizatati on of system

Prioritize by period

Ready the processes arrived

rm s()

Fig 2:Fork-Join Model

Assumptions: A1: In beginning the of the execution all tasks are in ready state. A2: There is no race condition among threads to execute a task i.e. two threads are not trying to execute the same task. Fig 1 :States of task

A3: The tasks in the beginning of their period have all the resources necessary for execution so we put them into ready state. A4: There is no cache conflict for accessing shared variables. A5: The periods of tasks are not decimal point numbers. Rests of the assumptions are same as the serial RMS algorithm. Advantages of this approach: i. Allocation of processes is similar to task decomposition model which gives an advantage of load balancing and higher probability of scheduling larger task set. ii. Tasks can be scheduled dynamically as some threads (cores) can be idle at that instance so these dynamic tasks can be scheduled. iii. Even if a task on executing on a particular thread has some failure while executing the other tasks are not affected. As that thread is independent of other threads. Also the tasks waiting to be executed can be executed on other threads.

Also For a task-set to be schedulable on this embedded system average utilization per core must be less than 1.

Calculation of Hyper-period and utilization: Hyper period is the time which covers all possible combination and is the least common multiple of the all the periods of taskset. That is, Hyper period = LCM of periods of tasks-set Utilization is the basic measure of efficiency. In case of single core processors utilization simply depends on two things-first the total time slots available on processor and second the fraction of time slots utilized, hence efficiency can be directly stated as the ratio of total utilized time to the total available time to the CPU [3].

The hotspot analysis of the code for given input shows that the program function takes very less time as can be seen in figure 1.So, this implies that the code is optimized so can be used on an embedded system as it will use less power [7].Also, the number of CPU cycles is less and also the there Load is balanced on the threads based on the number (show in fig. 5) of tasks given as input. Here, the master thread (thread 0) has more loads because the processing capacity of the hardware is higher. When the number of processes to be scheduled increase the algorithm utilizes its scheduling capabilities and assigns tasks to multiple cores simultaneously. Hence load balancing and faster computation is achieved.

VI.

EXPERIMENTAL ANALYSIS

Experimental analysis of this algorithm is done using the VTune Amplifier XE [5] [6] Software provided by Intel. Some task-set are randomly chosen and given as input to the algorithm and tested on Intel Core i7-4770S processor and also Intel Core i5 4570S processor. Analysis on Intel VTune Amplifier of the program for the input in Table 1 to the parallel code as shown in figure 3 and 4. A comparison for this task set tabulated in Table 2. Task ID

Execution Time

Period

Priority

1

2

6

2

2

3

12

3

3

1

4

1

TABLE I. INPUT WITH TASK SET I OF SIZE 3

Utilization= Now for the multi-core processors one more factor affecting utilization is number of cores as the number of cores in the system increases average utilization per core tends to decrease that implies, Average Utilization per core ∞ Assuming all the cores are identical and having same capabilities of execution. Hence, CPU time used by processes Average Utilization per core= total available CPU time x number of cores

Ö

μ

T

Where, α is the total number of processing cores in the embedded system

Fig 3: CPU utilization by code 4 threads of processor The concurrency analysis shows that the threads are properly synchronized. There is no race condition in the code. Also the wait time for processes is less for given set of tasks. If we give

more number of tasks still the wait time may increase marginally. Name Of Processor Intel Core i74770S Intel Core i5 4570S

Intel Core i74770s

Number of threads

CPI Rate

Time taken by methods

Overhead Time

Spin Time

Inactive Time

8

1.05

0.001s

0.0s

0.0s

0.0s

4

2.16

0.001s

0.0s

0.0s

0.0s

1(serial)

1.667

0.005s

0.0s

0.0s

0.0s

performance in various environments. The CPU utilization histogram for the i5 processor is shown in figure 8 for the second task set and the Table IV shows comparative analysis of the second task set.

TABLE II. COMPARATIVE ANALYSIS FOR TASK SET I

The analysis was carried out for the following task sets also in order to check its scalability. This helps to measure performance more accurately. Analyzing various kinds of task sets these results can be generalized. Task i 1 2 3 4 5 6 7 8

Execution time Ci

Period Ti

Ratio Ci/Ti

Utilization (1..N)

Upper Bound

10 15 12 5 20 50 25 5

500 300 100 150 200 200 250 40

0.020 0.050 0.120 0.033 0.100 0.250 0.100 0.125

0.020 0.070 0.190 0.223 0.323 0.573 0.673 0.798

1.000 0.828 0779 0.756 0.743 0.735 0.729 0.724

Fig 5: CPU Utilization by these tasks using 4cores(i5 Name Of Process or Intel Core i74770S Intel Core i5 4570S Intel Core i74770s

CPI Rate

Time taken by methods

8

1.97

0.002s

0.0s

0.0s

0.0s

4

2.36

0.003s

0.0s

0.0s

0.0s

1(Serial )

1.667

0.008s

0.0

0.0s

0.0s

Overhea d Time

Spin Time

processors)

TABLE III. INPUT WITH TASK SET II OF SIZE 8

TABLE IV. COMPARATIVE ANALYSIS OF TASK SET II

6 5 4 3 2 1 0

Speed-up (sec)

Analysis done on Intel Core i7 Processor for the given input having more number of tasks,more execution time and periods for the tasks is summarized bellow:

Inactive Time

Number of threads

4 threads 8 threads Task Set I

Task Set II

Task Set Size Fig. 6: Speed-up vs. Task-set size for core-i5 and i7 VII. Fig 4:CPU Utilization using 8 cores (i7 processor) The analysis was also carried out on Intel Core i5 processor to check its cross platform performance. This helps analyze the

RESULT AND CONCLUSION

ParaRMS Algorithm enhances the performance of the Embedded System to a significant level of optimization using OpenMP. Tasks are executed by balancing their loads on the threads and also they can be dynamically scheduled as some or

the other threads that are free can execute the tasks appearing dynamically. ParaRMS very efficiently synchronizes the processes among cores and has very less overhead time on symmetric multi-core embedded processor which is analyzed and the experimental results are in fever of this claim. This solution is strongly scalable because the number of processor cores increase the algorithm will take less time process the same task-set also it can schedule more number of processes. Hence increasing number of cores in the system increases its capacity to schedule more number of tasks. ParaRMS algorithm can be used in real-time environment for scheduling the tasks on embedded system where high reliability and performance is metric of the system [14]. Moreover, performance improvements of the order of 50% to 80% are considerable steps in the direction of further optimization. VIII.

FUTURE WORKS

Future works include developing an embedded multicore system with intelligent cores that schedule the tasks among them without interference of Operating System or developer. This will help in designing the hardware for autonomous and real-time embedded systems with better speed and reliability and lesser complexity. Parallel implementation of Earliest Deadline First (EDF) Algorithm will be taken on the device as EDF offer higher level of reliability toward the metric of embedded systems to meet deadline at any cost as it is a dynamic scheduling algorithm. This will be a platform independent and integrated solution for hardware [16]. REFERENCES [1] Wolf, W. Computers as components: principles of embedded computing system design. Morgan Kaufmann Publishers Inc., 2001 [2] Stewart, David and Michael Barr. “Rate Monotonic Scheduling’’ Embedded Systems Programming, March 2002, pp. 79-80. [3] Liu, C. & Layland, J. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment 1973 [4] Literature survey by ECE department of University of Texas, Austin www.ece.utexas.edu/~bevans/courses/ee382c/projects/f all99/forman/litsurvey.pdf [5] Dick Kalser, Intel, Tools to measure the performance scalability of Application software.intel.com/sites/products/Whitepaper/Measure ApplicationPerformanceScalability_013012.pdf [6] llnl.gov, Intel Vtune Amplifier computing.llnl.gov/?set=code&page=intel_vtune

[7] Shih, W. K.; Liu, J. & Liu, C. Modified RateMonotonic Algorithm for Scheduling Periodic Jobs with Deferred Deadlines. IEEE Transactions on Software Engineering, 1993, 19, 1171-1179 [8] Tim Mattson, Larry Meadows, A “Hands-on” Introduction to OpenMP, openmp.org [9] Rohit Chandra, Ramesh Menon, Leo Dagum, David Kohr, Dror Maydan and Jeff McDonald, “Parallel Programming In OpenMP” Academic press, A Harcourt Science and Technology Company, USA,2001. [10] M.D.Jones,Shared Memory Programming With OpenMP, Center for Computational Research [11] Karthi Lakshmanan, Shinpei Kato, Ragunathan (Raj) Rajkumar, Scheduling Parallel Real-Time Tasks on Multi-core Processors Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, USA [12] Pengliu Tan, Task Scheduling of Real-time Systems on Multi-Core Architectures School of Computing Nanchang Hangkong University Nanchang, China. [13] Platforms James H. Anderson, John M. Calandrino, and UmaMaheswari C. Devi ,Real-Time Scheduling on Multicore Department of Computer Science The University of North Carolina at Chapel Hill October 2005 [14] Sanjoy K. Baruah,Joel Goossens ,Rate monotonic scheduling on uniform multiprocessors, IEEE computer society [15] Peter S Pacheao, An Introduction to Parallel Programming, Morgan Kaufmann Elsevier 2011. [16] Benhai zhou1 , Jianzhong Qiao1 and Shukuan Lin, Research on Parallel Real-time Scheduling Algorithm of Hybrid Parameter Tasks on Multi-core Platform Received June 17, 2010; Revised March 4, 2011 [17] Karthik Lakshmanan, Ragunathan (Raj) Rajkumar, and John P. Lehoczky, Partitioned Fixed-Priority Preemptive Scheduling for Multi-Core Processors Carnegie Mellon University Pittsburgh, PA 15213, [email protected], [email protected], [email protected] [18] Grama Ananth, Gupta Anshul, Kapyris George, Kumar Vipin, Introduction to Parallel Computing, Second Edition, PEARSON, Addison Wesley, 2003.

ParaRMS Algorithm:A Parallel Implementation of Rate ...

ParaRMS Algorithm:A Parallel Implementation of Rate ...

Suggest Documents

ParaRMS Algorithm:A Parallel Implementation of Rate

ParaRMS Algorithm:A Parallel Implementation of Rate Monotonic ...

Performance Evaluation of Parallel Implementation

Parallel implementation of Estimation of Distribution ...

Parallel Implementation of Bi-cubic Interpolation ...

Parallel Implementation of the Unified Flow Solver

GPGPU optimized parallel implementation of AES ...

Implementation of Parallel LFSR-Based ... - DATE Conference

Implementation of Multigrid on Parallel Machines

Parallel Implementation of 2D Daubechies - D4 ...

Development of parallel implementation for the dendritic

Efficient Parallel Implementation of Multilayer Backpropagation ... - APT

Parallel software implementation of recursive multidimensional digital ...

Parallel Implementation of Real-Time Block-Matching

Parallel Implementation of the Gauss-Seidel

Parallel Implementation of the Discontinuous Galerkin ... - CiteSeerX

cudaBayesreg: Parallel Implementation of a Bayesian Multilevel ...

cudaBayesreg: Parallel Implementation of a Bayesian Multilevel ...

Parallel implementation of maximum likelihood ... - CiteSeerX

E cient Implementation of Parallel Image ...

Cuda Parallel Implementation of Image ... - Bentham Open

Design & Implementation of Fuzzy Parallel Distributed ... - viXra

PICA: Multi-Population Implementation of Parallel ...

Implementation of a Massively Parallel Dynamic ...