An Experimental Evaluation of Real-Time DVFS Scheduling Algorithms

0 downloads 218 Views 2MB Size Report
Sep 9, 2011 - and the hardware platforms include ASUS laptop with the Intel I5 ...... them was a Hewlett-Packard N3350 n
An Experimental Evaluation of Real-Time DVFS Scheduling Algorithms

Sonal Saha

Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfilment of the requirements for the degree of

Master of Science in Computer Engineering

Binoy Ravindran, Chair Paul E. Plassmann Robert P. Broadwater

September 9, 2011 Blacksburg, Virginia

Keywords: Dynamic Voltage and Frequency Scaling, Real-Time Linux Copyright 2011, Sonal Saha

An Experimental Evaluation of Real-Time DVFS Scheduling Algorithms

Sonal Saha

(ABSTRACT)

Dynamic voltage and frequency scaling (DVFS) is an extensively studied energy management technique, which aims to reduce the energy consumption of computing platforms by dynamically scaling the CPU frequency. Real-Time DVFS (RT-DVFS) is a branch of DVFS, which reduces CPU energy consumption through DVFS, while at the same time ensures that task time constraints are satisfied by constructing appropriate real-time task schedules. The literature presents numerous RT-DVFS scheduling algorithms, which employ different techniques to utilize the CPU idle time to scale the frequency. Many of these algorithms have been experimentally studied through simulations, but have not been implemented on real hardware platforms. Though simulation-based experimental studies can provide a first-order understanding, implementation-based studies can reveal actual timeliness and energy consumption behaviours. This is particularly important, when it is difficult to devise accurate simulation models of hardware, which is increasingly the case with modern systems. In this thesis, we study the timeliness and energy consumption behaviours of fourteen stateof-the-art RT-DVFS schedulers by implementing and evaluating them on two hardware platforms. The schedulers include CC-EDF, LA-EDF, REUA, DRA andd AGR1 among others, and the hardware platforms include ASUS laptop with the Intel I5 processor and a motherboard with the AMD Zacate processor. We implemented these schedulers in the ChronOS real-time Linux kernel and measured their actual timeliness and energy behaviours under a range of workloads including CPU-intensive, memory-intensive, mutual exclusion lockintensive, and processor-underloaded and overloaded workloads. Our studies reveal that measuring the CPU power consumption as the cube of CPU frequency can lead to incorrect conclusions. In particular, it ignores the idle state CPU power consumption, which is orders of magnitude smaller than the active power consumption. Consequently, power savings obtained by exclusively optimizing active power consumption (i.e., RT-DVFS) may be offset by completing tasks sooner by running them at the highest frequency and transitioning to the idle state earlier (i.e., no DVFS). Thus, the active power consumption savings of the RT-DVFS techniques’ that we report are orders of magnitude smaller than their simulation-based savings reported in the literature.

Dedication I dedicate this thesis to my mother.

iii

Acknowledgments First and foremost, I would like to thank my advisor, Dr. Binoy Ravindran, for his help, guidance and encouragement. It has been an honour to work under him. I would also like to thank Dr. Paul Plassmann and Dr. Robert Broadwater, for serving on my committee and providing me with their valuable suggestions and feedback. I also thank my Real-Time Systems lab mates, Matthew Dellinger, Piyush Garyali and Aaron Lindsay for their help and feedback. Finally, I would like to thank my family and friends, without whose love, support and encouragement, this thesis wouldn’t have been possible.

iv

Contents 1 Introduction

1

1.1

Limitations of Past Real-Time DVFS Research . . . . . . . . . . . . . . . . .

3

1.2

Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 Related Work

8

2.1

GP-DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

RT-DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3

Statistical DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3 Algorithms 3.1

16

Schedulers for Independent Underloaded task-sets . . . . . . . . . . . . . . .

17

3.1.1

Base-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.1.2

Static-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.1.3

CC-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.1.4

LA-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.1.5

Snowdon-min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.1.6

DRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.1.7

DRA-OTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.1.8

AGR1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.1.9

AGR2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

v

3.2

3.3

Schedulers for Dependent Underloaded task-sets . . . . . . . . . . . . . . . .

30

3.2.1

EUA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.2.2

HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.2.3

DS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.2.4

USFI EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Schedulers for Overloaded task-sets . . . . . . . . . . . . . . . . . . . . . . .

38

3.3.1

38

REUA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Extending ChronOS with DVFS support 4.1

4.2

41

ChronOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.1.2

ChronOS real-time Scheduler . . . . . . . . . . . . . . . . . . . . . .

41

4.1.3

Scheduling Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.1.4

System Calls provided by ChronOS . . . . . . . . . . . . . . . . . . .

42

Adding DVFS support to ChronOS . . . . . . . . . . . . . . . . . . . . . . .

43

4.2.1

The CPUfreq Subsystem . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.2.2

Changing the frequency using CPUfreq module . . . . . . . . . . . .

45

4.2.3

Working of the CPUfreq module . . . . . . . . . . . . . . . . . . . . .

45

4.2.4

Integration of CPUfreq subsytem with ChronOS . . . . . . . . . . . .

46

4.2.5

Implementation of the RT-DVFS Schedulers . . . . . . . . . . . . . .

46

4.2.6

Modification of the real-time data structure for RT-DVFS support . .

48

4.2.6.1

49

Additional fields for RT-DVFS support . . . . . . . . . . . .

5 Experimental Methodology

50

5.1

Platform Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.2

Test Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.2.1

Modification to the Test Application for RT-DVFS . . . . . . . . . .

51

5.3

Real-time Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.4

Power Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

vi

5.4.1

Performance and Idle states of CPU . . . . . . . . . . . . . . . . . . .

52

5.4.2

System Power Measurements . . . . . . . . . . . . . . . . . . . . . . .

52

5.4.3

Normalized and Actual CPU power measurement . . . . . . . . . . .

53

5.4.4

Calculation of Normalized CPU Energy consumption . . . . . . . . .

54

5.4.5

Calculation of Actual CPU Power . . . . . . . . . . . . . . . . . . . .

54

6 Experiments 6.1

55

Energy consumption results on Intel I5 laptop for schedulers designed for independent task-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

6.1.1

Varying the CPU utilization at constant ACET . . . . . . . . . . . .

55

6.1.2

Varying ACET at a constant CPU utilization . . . . . . . . . . . . .

77

6.1.3

DSR results on Intel I5 laptop

. . . . . . . . . . . . . . . . . . . . .

86

6.2

Energy and DSR results for the schedulers designed for dependent task-sets .

93

6.3

Energy consumption results on the AMD Zacate Board . . . . . . . . . . . . 108

6.4

System Power results on the AMD Zacate Board . . . . . . . . . . . . . . . . 113

6.5

Energy and DSR results for the schedulers designed for dependent task-sets . 121

7 Conclusions and Future Work

131

7.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2.1

DVFS in CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.2.2

Considering Real world Applications . . . . . . . . . . . . . . . . . . 134

7.2.3

DVFS with DPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.2.4

DVFS with Procrastination Scheduling . . . . . . . . . . . . . . . . . 135

7.2.5

Reducing Dynamic Power Consumption of other devices . . . . . . . 135

Bibliography

135

vii

List of Figures 1.1

Idle Time Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Dynamic Slack Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

3.1

Base-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.2

Static-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.3

CC-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.4

LA-EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.5

Snowdon-min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.6

DRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.7

Step TUF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.8

HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.9

DS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.1

CPUfreq Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.2

Working of CPUfreq module . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

4.3

RT-DVFS scheduling in ChronOS . . . . . . . . . . . . . . . . . . . . . . . .

47

6.1

Normalized CPU Energy vs. CPU utilization (Underloads, 0.1 WCET, 5T, NL) 56

6.2

Actual CPU Power vs. CPU utilization (Underload, 0.1 WCET, 5T, NL) . .

6.3

Normalized CPU Energy vs. CPU utilization (Overloads, 0.1 WCET, 5T, NL) 57

6.4

Actual CPU Power vs. CPU utilization (Overloads, 0.1 WCET, 5T, NL) . .

6.5

Normalized CPU Energy vs. CPU utilization (Underloads, 0.2 WCET, 5T, NL) 59

6.6

Actual CPU Power vs. CPU utilization (Underload, 0.2 WCET, 5T, NL) . . viii

56

58

59

6.7

Normalized CPU Energy vs. CPU utilization (Overloads, 0.2 WCET, 5T, NL) 60

6.8

Actual CPU Power vs. CPU utilization (Overloads, 0.2 WCET, 5T, NL) . .

6.9

Normalized CPU Energy vs. CPU utilization (Underloads, 0.3 WCET, 5T, NL) 61

6.10 Actual CPU Power vs. CPU utilization (Underload, 0.3 WCET, 5T, NL) . .

60

61

6.11 Normalized CPU Energy vs. CPU utilization (Overloads, 0.3 WCET, 5T, NL) 62 6.12 Actual CPU Power vs. CPU utilization (Overloads ,0.3 WCET, 5T, NL) . .

62

6.13 Normalized CPU Energy vs. CPU utilization (Underloads, 0.4 WCET, 5T, NL) 63 6.14 Actual CPU Power vs. CPU utilization (Underload, 0.4 WCET, 5T, NL) . .

63

6.15 Normalized CPU Energy vs. CPU utilization (Overloads, 0.4 WCET, 5T, NL) 64 6.16 Actual CPU Power vs. CPU utilization (Overloads, 0.4 WCET, 5T, NL) . .

65

6.17 Normalized CPU Energy vs. CPU utilization (Underload, 0.5 WCET, 5T, NL) 66 6.18 Actual CPU Power vs. CPU utilization (Underload,0.5 WCET, 5T, NL) . .

67

6.19 Normalized CPU Energy vs. CPU utilization (Overloads, 0.5 WCET, 5T, NL) 68 6.20 Actual CPU Power vs. CPU utilization (Overloads, 0.5 WCET, 5T, NL) . .

68

6.21 Normalized CPU Energy vs. CPU utilization (Underloads, 0.6 WCET, 5T, NL) 69 6.22 Actual CPU Power vs. CPU utilization (Underload, 0.6 WCET, 5T, NL) . .

69

6.23 Normalized CPU Energy vs. CPU utilization (Overloads, 0.6 WCET, 5T, NL) 70 6.24 Actual CPU Power vs. CPU utilization (Overloads, 0.6 WCET, 5T, NL) . .

70

6.25 Normalized CPU Energy vs. CPU utilization (Underload, 0.7 WCET, 5T, NL) 71 6.26 Actual CPU Power vs. CPU utilization (Underload, 0.7 WCET, 5T, NL) . .

71

6.27 Normalized CPU Energy vs. CPU utilization (Overloads, 0.7 WCET, 5T, NL) 72 6.28 Actual CPU Power vs. CPU utilization (Overloads, 0.7 WCET, 5T, NL) . .

72

6.29 Normalized CPU Energy vs. CPU utilization (Underload, 0.8 WCET, 5T, NL) 73 6.30 Actual CPU Power vs. CPU utilization (Underload, 0.8 WCET, 5T, NL) . .

73

6.31 Normalized CPU Energy vs. CPU utilization (Overloads, 0.8 WCET, 5T, NL) 74 6.32 Actual CPU Power vs. CPU utilization (Overloads, 0.8 WCET, 5T, NL) . .

74

6.33 Normalized CPU Energy vs. CPU utilization (Underload, 0.9 WCET, 5T, NL) 75 6.34 Actual CPU Power vs. CPU utilization (Underload, 0.9 WCET, 5T, NL) . .

ix

75

6.35 Normalized CPU Energy vs. CPU utilization (Overloads, 0.9 WCET, 5T, NL) 76 6.36 Actual CPU Power vs. CPU utilization (Overloads, 0.9 WCET, 5T, NL) . .

77

6.37 Normalized CPU Energy vs. CPU utilization (Underload, 1 WCET, 5T, NL)

78

6.38 Actual CPU Power vs. CPU utilization (Underload, 1 WCET, 5T, NL) . . .

78

6.39 Normalized CPU Energy vs. CPU utilization (Overloads, 1.0 WCET, 5T, NL) 79 6.40 Actual CPU Power vs. CPU utilization (Overloads, 1 WCET, 5T, NL) . . .

79

6.41 Normalized CPU Energy vs. ACET (Underload, 50% CPU Utilisation, 5T, NL) 80 6.42 Actual CPU Power vs. ACET (Underload, 50% CPU Utilisation, 5T, NL) .

81

6.43 Normalized CPU Energy vs. ACET (Underload, 60% CPU Utilisation, 5T, NL) 82 6.44 Actual CPU Power vs. ACET (Underload, 60% CPU Utilisation, 5T, NL) .

82

6.45 Normalized CPU Energy vs. ACET (Underload, 70% CPU Utilisation, 5T, NL) 83 6.46 Actual CPU Power vs. ACET (Underload, 70% CPU Utilisation, 5T, NL) .

83

6.47 Normalized CPU Energy vs. ACET (Underload, 80% CPU Utilisation, 5T, NL) 84 6.48 Actual CPU Power vs. ACET (Underload, 80% CPU Utilisation, 5T, NL) .

84

6.49 Normalized CPU Energy vs. ACET (Underload, 90% CPU Utilisation, 5T, NL) 85 6.50 Actual CPU Power vs. ACET (Underload, 90% CPU Utilisation, 5T, NL) .

86

6.51 Normalized CPU Energy vs. ACET ( Underload,100% CPU Utilisation , 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

6.52 Actual CPU Power vs. ACET (Underload, 100% CPU Utilisation, 5T, NL) .

87

6.53 DSR vs. CPU Utilisation (0.1 WCET, 5T, NL) . . . . . . . . . . . . . . . .

88

6.54 DSR vs. CPU Utilisation (0.2 WCET, 5T, NL) . . . . . . . . . . . . . . . .

88

6.55 DSR vs. CPU Utilisation (0.3 WCET, 5T, NL) . . . . . . . . . . . . . . . .

89

6.56 DSR vs. CPU Utilisation (0.4 WCET, 5T, NL) . . . . . . . . . . . . . . . .

89

6.57 DSR vs. CPU Utilisation (0.5 WCET, 5T, NL) . . . . . . . . . . . . . . . .

90

6.58 DSR vs. CPU Utilisation (0.6 WCET, 5T, NL) . . . . . . . . . . . . . . . .

90

6.59 DSR vs. CPU Utilisation (0.7 WCET, 5T, NL) . . . . . . . . . . . . . . . .

91

6.60 DSR vs. CPU Utilisation (0.8 WCET, 5T, NL) . . . . . . . . . . . . . . . .

91

6.61 DSR vs. CPU Utilisation (0.9 WCET, 5T, NL) . . . . . . . . . . . . . . . .

92

6.62 DSR vs. CPU Utilisation (1 WCET, 5T, NL) . . . . . . . . . . . . . . . . .

92

x

6.63 5% CS - Normalized CPU Energy vs. CPU utilization (5% CS, 10T, 1L) . .

94

6.64 10% CS - Normalized CPU Energy vs. CPU utilization (10% CS, 10T, 1L) .

94

6.65 15% CS - Normalized CPU Energy vs. CPU utilization (15% CS, 10T, 1L) .

95

6.66 20% CS - Normalized CPU Energy vs. CPU utilization (20% CS, 10T, 1L) .

95

6.67 30% CS - Normalized CPU Energy vs. CPU utilization (30% CS, 10T, 1L) .

96

6.68 40% CS - Normalized CPU Energy vs. CPU utilization (40% CS, 10T, 1L) .

96

6.69 50% CS - Normalized CPU Energy vs. CPU utilization (50% CS, 10T, 1L) .

97

6.70 60% CS - Normalized CPU Energy vs. CPU utilization (60% CS, 10T, 1L) .

98

6.71 70% CS - Normalized CPU Energy vs. CPU utilization (70% CS, 10T, 1L) .

99

6.72 5% CS - Actual CPU Power vs. CPU utilization (5% CS, 10T, 1L) . . . . .

99

6.73 10% CS - Actual CPU Power vs. CPU utilization (10% CS, 10T, 1L) . . . . 100 6.74 15% CS - Actual CPU Power vs. CPU utilization (15% CS, 10T, 1L) . . . . 100 6.75 20% CS - Actual CPU Power vs. CPU utilization (20% CS, 10T, 1L) . . . . 101 6.76 30% CS - Actual CPU Power vs. CPU utilization (30% CS, 10T, 1L) . . . . 101 6.77 40% CS - Actual CPU Power vs. CPU utilization (40% CS, 10T, 1L) . . . . 102 6.78 50% CS - Actual CPU Power vs. CPU utilization (50% CS, 10T, 1L) . . . . 102 6.79 60% CS - Actual CPU Power vs. CPU utilization (60% CS, 10T, 1L) . . . . 103 6.80 70% CS - Actual CPU Power vs. CPU utilization (70% CS, 10T, 1L) . . . . 103 6.81 5% CS - DSR vs. CPU utilization (5% CS ,10T, 1L) . . . . . . . . . . . . . 104 6.82 10% CS - DSR vs. CPU utilization (10% CS, 10T, 1L) . . . . . . . . . . . . 104 6.83 20% CS - DSR vs. CPU utilization (20% CS, 10T, 1L) . . . . . . . . . . . . 105 6.84 30% CS - DSR vs. CPU utilization (30% CS, 10T, 1L) . . . . . . . . . . . . 105 6.85 40% CS - DSR vs. CPU utilization (40% CS, 10T, 1L) . . . . . . . . . . . . 106 6.86 50% CS - DSR vs. CPU utilization (50% CS ,10T, 1L) . . . . . . . . . . . . 106 6.87 60% CS - DSR vs. CPU utilization (60% CS, 10T, 1L) . . . . . . . . . . . . 107 6.88 70% CS - DSR vs. CPU utilization (70% CS, 10T, 1L) . . . . . . . . . . . . 108 6.89 Normalized CPU Energy vs. CPU utilization (Underload, 0.1 WCET, 5T, NL)109 6.90 Normalized CPU Energy vs. CPU utilization (Underload, 0.2 WCET, 5T, NL)109

xi

6.91 Normalized CPU Energy vs. CPU utilization (Underload, 0.3 WCET, 5T, NL)110 6.92 Normalized CPU Energy vs. CPU utilization (Underload, 0.4 WCET, 5T, NL)110 6.93 Normalized CPU Energy vs. CPU utilization (Underload, 0.5 WCET, 5T, NL)111 6.94 Normalized CPU Energy vs. CPU utilization (Underload, 0.6 WCET, 5T, NL)112 6.95 Normalized CPU Energy vs. CPU utilization (Underload, 0.7 WCET, 5T, NL)112 6.96 Normalized CPU Energy vs. CPU utilization (Underload, 0.8 WCET, 5T, NL)113 6.97 Normalized CPU Energy vs. CPU utilization (Underload, 0.9 WCET, 5T, NL)114 6.98 Normalized CPU Energy vs. CPU utilization (Underload, 1 WCET, 5T, NL) 114 6.99 Normalized CPU Energy vs. ACET (Underload, 50% CPU Utilization 5T, NL)115 6.100 Normalized CPU Energy vs. ACET (Underload, 60% CPU Utilization 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.101 Normalized CPU Energy vs. ACET (Underload, 70% CPU Utilization 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.102 Normalized CPU Energy vs. ACET (Underload, 80% CPU Utilization 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.103 Normalized CPU Energy vs. ACET (Underload, 90% CPU Utilization, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.104 Normalized CPU Energy vs. ACET (Underload, 100% CPU Utilization, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.105 CPU Intensive Workload : System Power vs. CPU Utilization(0.3 WCET, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.106 CPU Intensive Workload : System Power vs. CPU Utilization(0.6 WCET, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.107 CPU Intensive Workload : System Power vs. CPU Utilization(0.9 WCET, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.108 Memory Intensive Workload : System Power vs. CPU Utilization(0.3 WCET, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.109 Memory Intensive Workload : System Power vs. CPU Utilization(0.6 WCET, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.110 Memory Intensive Workload : System Power vs. CPU Utilization(0.9 WCET, 5T, NL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.111 20% CS - Normalized CPU Energy vs. CPU utilization (20% CS, 10T, 1L) . 121 xii

6.112 30% CS - Normalized CPU Energy vs. CPU utilization (30% CS, 10T, 1L) . 122 6.113 40% CS - Normalized CPU Energy vs. CPU utilization (40% CS, 10T, 1L) . 122 6.114 50% CS - Normalized CPU Energy vs. CPU utilization (50% CS, 10T, 1L) . 123 6.115 60% CS - Normalized CPU Energy vs. CPU utilization (60% CS, 10T, 1L) . 124 6.116 70% CS - Normalized CPU Energy vs. CPU utilization (70% CS, 10T, 1L) . 124 6.117 80% CS - Normalized CPU Energy vs. CPU utilization (80% CS, 10T, 1L) . 125 6.118 90% CS - Normalized CPU Energy vs. CPU utilization (90% CS, 10T, 1L) . 125 6.119 100% CS - Normalized CPU Energy vs. CPU utilization (100% CS, 10T, 1L) 126 6.120 20% CS - DSR vs. CPU utilization (20% CS, 10T, 1L) . . . . . . . . . . . . 126 6.121 30% CS - DSR vs. CPU utilization (30% CS, 10T, 1L) . . . . . . . . . . . . 127 6.122 40% CS - DSR vs. CPU utilization (40% CS, 10T, 1L) . . . . . . . . . . . . 127 6.123 50% CS - DSR vs. CPU utilization (50% CS, 10T, 1L) . . . . . . . . . . . . 128 6.124 60% CS - DSR vs. CPU utilization (60% CS, 10T, 1L) . . . . . . . . . . . . 128 6.125 70% CS - DSR vs. CPU utilization (70% CS, 10T, 1L) . . . . . . . . . . . . 129 6.126 80% CS - DSR vs. CPU utilization (80% CS, 10T, 1L) . . . . . . . . . . . . 129 6.127 90% CS - DSR vs. CPU utilization (90% CS, 10T, 1L) . . . . . . . . . . . . 130 6.128 100% CS - DSR vs. CPU utilization (100% CS, 10T, 1L) . . . . . . . . . . . 130

xiii

List of Algorithms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Static-EDF . . . . . CC-EDF . . . . . . Defer Function . . . LA-EDF . . . . . . Snowdon-min . . . . DRA . . . . . . . . DRA-OTE . . . . . AGR1 . . . . . . . . AGR1 (contd) . . . Defer EUA Function EUA . . . . . . . . . HS . . . . . . . . . . DS . . . . . . . . . . USFI EDF . . . . . REUA . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

xiv

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

18 20 21 22 23 25 27 28 29 32 33 34 36 37 39

List of Tables 2.1

Actual Implementation of DVFS algorithms . . . . . . . . . . . . . . . . . .

15

3.1

Sample 3 task task-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.2

Sample 2 task task-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.3

Slack Estimation Techniques used in the Algorithms . . . . . . . . . . . . . .

40

5.1

P-States in Intel I5 processor . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

7.1

Normalized Energy savings of RT-DVFS algorithms . . . . . . . . . . . . . . 133

7.2

Actual Energy savings of RT-DVFS algorithms . . . . . . . . . . . . . . . . . 134

xv

Chapter 1 Introduction Energy is an important resource for many computer systems. A significant amount of research has been devoted to computer system power management at various levels of abstraction. For example, techniques such as clock gating [49] and low-power flip-flops [32] are hardware-level techniques that reduce active power during normal operation and reduce leakage power during sleep mode. Dynamic Voltage and Frequency Scaling (DVFS) [26, 51] and Dynamic Power management (DPM) [47, 39] are example techniques that optimize power consumption at the operating system-level. While DVFS involves adjusting the CPU voltage and frequency dynamically to reduce power consumption, DPM involves transitioning devices, including the CPU, into to low-power/sleep states. Compiler-level power management techniques [27] include optimizing the code to reduce its execution time and memory accesses to save power. Application-level power management has also been studied. Examples include doing DVFS from the user space [36],and sending information about the the beginning and end of task from the user space to the OS, so that OS can do DVFS, while maintaining soft real time guarantees [54]. In this thesis, we focus on DVFS, which is based on the following idea: In most of the CMOS based modern processors, the maximum frequency of operation is dependent on the supply voltage and is given by: f =k×

(Vdd − Vt )2 Vdd

(1.1)

Here, Vdd is the supply voltage, f is the clock frequency, Vt is the threshold voltage [59], and k is a constant. Equation 1.1 can be rewritten [19] as: f = a × Vdd

(1.2)

where a is constant. Thus, frequency has a linear relation with the supply voltage. When the CPU operates at a frequency f , its active power consumption, denoted as Pactive , 1

Sonal Saha

Chapter 1. Introduction

2

is given by: 2 Pactive = Cef × Vdd ×f

(1.3)

Here Cef is the effective switch capacitance, Vdd is the supply voltage, and f is the clock frequency. Thus, substituting Equation 1.2 in Equation 1.3 we get: Pactive =

Cef × f 3, a2

(1.4)

which in turn ,is equivalent to Pactive = S3 × f 3 ,

(1.5)

where S3 is constant. Thus, the dynamic power consumption of a CMOS processor is directly proportional to the cube of CPU frequency [24]. Real-Time DVFS(RT-DVFS) is a branch of DVFS, which involves reducing the CPU energy consumption by scaling the CPU frequency, while at the same time, ensuring that task time constraints are satisfied. Broadly, RT-DVFS techniques have two objectives: (i) Reduce energy consumption through DVFS; and (ii) Optimize task timeliness behavior through real-time resource management (i.e., real-time scheduling and synchronization). These two objectives may conflict, because reducing the frequency may increase task execution times, which is antagonistic to timeliness optimization. Most RT-DVFS scheduling algorithms consider satisfaction of time constraints as a “hard constraint.” They often utilize task slack times (i.e., time between task deadlines and worst-case task execution times) to reduce energy consumption, to the extent possible [44, 15, 54, 33].

Figure 1.1: Idle Time Example Figure 1.1 shows a task T1 with deadline 4ms and a worst-case execution time (WCET) of 2ms at a maximum frequency fmax (on a frequency-scalable processor). If T1 runs at fmax , then it will complete after 2s and the CPU will be idle for the rest of the 2ms, as shown in Case 1. The 2ms slack time can be exploited toward reducing energy consumption, by

Sonal Saha

Chapter 1. Introduction

3

running T1 at the reduced frequency of fmax /2, which will then complete the task after 4ms. The task deadline is still met, but at the same time, energy is also saved. The energy saved is ((0.53 ) ∗ 4)/((13 ) ∗ 2) = 25%. The utilization of task slack time to minimize the idle time is the fundamental idea used by the vast majority of the RT-DVFS algorithms [44, 15, 54, 33]. There are two types of slack time which are exploited by RT-DVFS algorithms: (i) Static Slack: This is the idle time interval due to low CPU demand of the application. When the total CPU demand of the application1 is less than 100%, then there exist time intervals during which the CPU idles. This is the static slack, which can be utilized by RT-DVFS algorithms to scale the CPU frequency. (ii) Dynamic Slack: This is the slack time that is available when a task completes earlier than its predicted worst-case execution time (WCET), as shown in the Figure 1.2. This slack time can only be obtained when the task completes—i.e., at run-time, in contrast to the static slack, which is known off-line, as task periods and WCETs are presumed to be known off-line for many real-time applications. Once the slack becomes available, it can then be distributed to the other eligible tasks by scaling their frequency, increasing their time budgets, and thus saving energy.

Figure 1.2: Dynamic Slack Example Most of the RT-DVFS algorithms differ in their techniques used to estimate and utilize the static and dynamic slacks.

1.1

Limitations of Past Real-Time DVFS Research

A number of RT-DVFS algorithms [13, 31, 46, 17, 61, 21, 28, 60, 34, 35] have been developed in the past two decades. These algorithms have been extensively analyzed and their fundamental properties have been established – e.g., schedulable utilization bounds below which they meet all deadlines; conditions under which they satisfy energy budgets. These 1

For a periodic real-time application, this is the ratio of the task period to the worst-case task execution time, aggregated for all tasks.

Sonal Saha

Chapter 1. Introduction

4

algorithms have also been experimentally studied using simulations [15, 53, 52, 57, 30] where the primary focus is on understanding their normalized real-time and power consumption behaviours (e.g., normalized to no DVFS; normalized to highest frequency), and relative realtime/power trends. Only a very small fraction of these algorithms have been implemented and evaluated on real hardware platforms [44, 33, 54]. Simulation-based experimental studies have several advantages. First, it provides an effective way to evaluate the performance of an algorithm, especially to understand relative performance trends. Second, it is highly scalable and repeatable: a vast number of experimental settings can be used to evaluate algorithm behaviours, and deterministically repeated, all programmatically. Additionally, it hides platform-specific issues through an abstract (often, discrete-event) simulation model, which significantly reduces development time. However, simulation-based studies, especially those for RT-DVFS, have drawbacks. Unlike simulators used in the computer architecture community[8, 12, 11, 10], there does not exist OS/platform simulators that have been rigorously evaluated and accepted by the (RT-DVFS) community as a whole. This has resulted in researchers developing their “home grown” simulators with many built-in assumptions (e.g., idle states of CPUs, number of idle states, transition overheads) that are not easy to verify against any particular hardware. Since, the power savings of RT-DVFS are highly platform- (and application-) dependent, this can potentially lead to incorrect conclusions. To illustrate this, we note that, many RT-DVFS works that have relied on simulation-based experimental studies have made the following assumptions: (i) A CPU has a continuous range of frequencies. [15, 53, 54] However, actual CPUs do not have a continuous range, but discrete frequency steps. Some processors have a rich frequency set (e.g., 10), such as the Intel I5, and some have a smaller set (e.g., AMD zacate has 3 steps; Via C7 has two steps). Modelling a CPU with continuous range of frequencies can give highly optimistic results, which may not hold on actual hardware. (ii) The CPU’s idle state power consumption is negligible. [15, 53, 54] These works only consider CPU’s dynamic (or active) power consumption, which is assumed to be directly proportional to the cube of frequency (as illustrated by Equation 1.5). In contrast, CPUs of most modern hardware have idle states (C states) and performance states (P states), and the power consumed by the CPU is the summation of the power consumed in both these states. [1] Abstracting away this detail can lead to significantly optimistic (and sometimes) erroneous power savings, as the thesis shows (see Chapter 6). (iii) Non-CPU power consumption is insignificant. In most of the RT-DVFS algorithms [15, 44, 54], only the CPU’s power consumption is considered and the system level power consumption is ignored. The interaction between CPU and other components (e.g., memory, bus, disk) is often ignored. This again can lead to erroneous power savings, as non-CPU devices’ power consumption are DVFS-independent. Consequently, the overall power savings depend on the power profile of such devices and application workload characteristics – e.g., DVFS may prolong the CPU active state, which

Sonal Saha

Chapter 1. Introduction

5

may potentially increase memory power consumption. In [14] Aydinet al. have shown that below a particular speed DVFS has a negative impact on the system-level energy consumption and in [48] Snowdonet al. show that the optimal voltage and frequency setting is dependent on the system characteristics as well as on application. In the past, a small subset of RT-DVFS algorithms have been implemented and evaluated on real hardware platforms. Grunwald et al. [26] did an actual implementation of DVFS algorithms developed by Weiser [51] in the Linux kernel, running on the Itsy pocket computer. Their main aim was to understand whether the energy saving claims made by these algorithms on simulated platforms give similar results on real platforms as well. They measured the system power consumption and concluded that these algorithms did not give significant energy savings on real hardware platforms. However these DVFS algorithms were not realtime, and Grunwald et al. focused on system-level power measurement for the schedulers, which aim at the reduction of CPU power consumption. Pillai and Shin [44] implemented their RT-DVFS schedulers on a laptop, and measured the power consumption. Although their algorithms aimed at the reduction of CPU power and not the system power, they presented the entire system power consumption of the laptop. In this thesis, we present an accurate method to measure the exact CPU power consumption. For this, we have used the CPUpower tool [5], which measures the CPU power consumption as the sum of the power consumed in the active and the idle state. Snowdon et al. [33] designed and implemented a RT-DVFS scheduler in the OKL4 microkernel on the Gumstix platform. Even though they claim to have measured the system power, they have not reported any result on energy measurements in [33]. This makes it difficult to evaluate the performance of their scheduler. Yuan et al. [54] designed and implemented a statistical DVFS scheduler for multimedia applications called the Grace-OS. However, they have not done real power measurements. Similar to the other simulation based implementations of DVFS algorithms, they have also assumed that CPU power is proportional to the cube of the frequency and have done the CPU energy calculations likewise. We show that this is an inaccurate way of measuring the CPU power. The actual CPU power consumption is very different and depends on the energy consumed in the active as well as the idle state.

1.2

Research Contributions

In this thesis, we study the timeliness and power consumption behaviour of fourteen RTDVFS schedulers through implementation and actual measurements. The schedulers include Static Earliest Deadline First (Static-EDF), Cycle Conserving Earliest Deadline First (CC-EDF), Look-Ahead Earliest Deadline First (LA-EDF), Snowdon-minimum (Snowdonmin), Resource-constrained Energy-Efficient Utility Accrual Algorithm (REUA), Dynamic Reclaiming Algorithm (DRA) and Aggressive Speed Reduction Algorithm (AGR) among the

Sonal Saha

Chapter 1. Introduction

6

others. Static-EDF utilizes static slack, whereas CC-EDF, LA-EDF, DRA and AGR utilize dynamic slack as well. LA-EDF, AGR and DRA are more aggressive as compared to the other algorithms as they try to reduce the frequency as much as possible, while still satisfying task time constraints. We implemented the schedulers in a Linux-based real-time kernel called ChronOS [4], and measured their real-time/power behaviours on two representative hardware platforms including ASUS laptop with the Intel I5 processor and a motherboard with the AMD Zacate processor. We used a synthetic application (similar to the studies in [22]) that allowed us to generate a broad range of workload conditions including CPU-intensive, memory-intensive, mutual exclusion lock-intensive, and processor-underloaded and overloaded workloads. We measured the actual CPU power by accounting for the power consumption in the active and idle states, and also the system power using a multimeter. We also measured the normalized CPU energy 2 consumption, where CPU power is considered to be proportional to the cube of the frequency, so as to compare with the simulated results of the algorithms. We draw the following conclusions from our experimental study: (1) Our studies reveal that measuring the CPU power consumption as the cube of CPU frequency can lead to incorrect conclusions. In particular, it ignores the idle state CPU power consumption, which is orders of magnitude smaller than the active power consumption. Consequently, power savings obtained by exclusively optimizing active power consumption (i.e., RT-DVFS) may be offset by completing tasks sooner by running them at the highest frequency and transitioning to the idle state earlier (i.e., no DVFS). Thus, the active power consumption savings of the RT-DVFS techniques’ that we report are orders of magnitude smaller than their simulation-based savings reported in the literature. (2) From our study, we observed that, algorithms such as Static-EDF, CC-EDF, and Snowdonmin, which have been reported to outperform Base-EDF (on power savings) do not actually do so. These algorithms outperform Base-EDF on normalized CPU energy. However, they consume only slightly lesser CPU power and system power as Base-EDF, or in some cases, even more. (3) Aggressive energy saving algorithms such as LA-EDF, DRA, DRA-OTE, AGR1, and AGR2 do consume less actual CPU power and system power than Base-EDF. But their energy savings are not as high as reported in past simulation-based studies. (4) We also observe ,that in overloads, aggressive algorithms like AGR1,AGR2 and LA-EDF save the maximum power. (5) Lock based algorithms consume almost similar CPU power, with REUA performing slightly better. (6) Energy savings of a RT-DVFS algorithm is highly dependent on the number of frequency 2

Normalized CPU Energy is the same as Normalized CPU power, as the time factor gets cancelled in both the numerator and the denominator.

Sonal Saha

Chapter 1. Introduction

7

steps available on the processor. The energy savings obtained on the Intel I5 platform with 10 frequency steps was much higher than the energy savings obtained on the AMD Zacate platform with 3 frequency steps. To the best of our knowledge, this is the first implementation and actual real-time/power measurement-based experimental study of the aforementioned RT-DVFS schedulers, and constitutes the thesis contribution.

1.3

Thesis Organization

The rest of the thesis is organized as follows: In Chapter 2, we review the past DVFS algorithms which have been implemented on real hardware platforms as well as on simulated platforms. We also compare and contrast our work with those algorithms which overlap with the thesis problem space. In Chapter 3, we describe the RT-DVFS scheduling algorithms which we have implemented and evaluated. In Chapter 4, we describe ChronOS real-time Linux, and our DVFS extensions to it. Chapter 5 discusses the experimental methodology and Chapter 6 reports the results. The thesis concludes in Chapter 7.

Chapter 2 Related Work Significant amount of research has been done in the field of DVFS. A large number of algorithms have been designed which can be broadly classified into three categories. This classification was made by Yuan et al in [55] as follows: (i) real-time DVFS (RT-DVFS) : RT-DVFS algorithms are designed for real-time systems and aim at saving energy while maintaining hard real-time constraints. They scale the CPU frequency based on the worst case execution times of the real-time application. Most of the RT-DVFS algorithms differ in their techniques to utilize the static slack available due to the low CPU utilization of the application or dynamic slack available due to the actual execution time being much lesser than the worst case execution time of the real-time application. (ii) General Purpose DVFS (GP-DVFS) : GP-DVFS algorithms designed for general purpose systems aim at saving energy without degrading performance and are mostly suited for besteffort applications. This class of algorithms scale CPU frequency based on the workload prediction which in turn depends on the average CPU utilization. GP DVFS algorithms are based on two techniques - prediction and speed setting. Prediction involves predicting the future workload while speed setting involves deciding the speed at which to run. These are interval based schedulers where the prediction and speed setting decisions are made for every interval so as to minimise the idle time in that interval. (iii) Stastistical DVFS : Both GP-DVFS and RT-DVFS algorithms are not suited for multimedia applications as their run time demand varies. The interval based approach of GPDVFS algorithms may not satisfy the timing constraints whereas the worst case approach based RT-DVFS algorithms might not be that energy efficient. Thus, for multimedia applications, Statistical DVFS algorithms have been designed to deal with run time demand variations. The core idea of Statistical DVFS is to change the CPU speed based on the demand distributions of the applications. These algorithms involve either online or off line profiling of the applications to get the probability distribution of the cycle demand of the applications and make scheduling and frequency scaling decisions based on that. 8

Sonal Saha

Chapter 2. Related Work

9

In the rest of the chapter, we will review the DVFS algorithms which have been implemented on real hardware platforms as well as on simulated platforms in the above mentioned three categories. We have an overlapping problem space with the algorithms that have been implemented in real-time, so we are going to compare and contrast only with them. But at the same time, we will also review the algorithms that have just been simulated for completeness.

2.1

GP-DVFS

The concept of DVFS was introduced by Weiser in [51]. He devised three interval based schedulers PAST, FUTURE and OPT which differ in their prediction technique. In PAST, the workload of the current interval is assumed to be same as that of the previous interval, in OPT, the exact workload is known for the entire duration and in FUTURE, the workload is known for a small interval in the future. As can be understood, OPT and FUTURE algorithms are impractical as the future workload can never be predicted. Nevertheless, this ground breaking paper led the way for the multitude of DVFS research which took place in the coming decades. They evaluated these algorithms by analysing the traces collected by running applications on UNIX based workstations. Govilet al. [23] developed a few more dynamic clock scaling algorithms with different prediction and speed-setting techniques. They developed six new algorithms, where each algorithm employ different techniques to predict the future workload. For example, in FLAT, prediction is done based on global average of the computational load, in LONG SHORT it is based on the mean of the global and local average of the workload, while in AGED AVERAGES, weighted average is considered, in CYCLE, cyclic behaviour of CPU utilization is considered, in PATTERN, prediction is done on the basis of pattern matching with a previous occurrence, whereas in PEAK, prediction is based on the expected value of the utilization, considering narrow peaks in its pattern. The speed setting policy involved reducing the frequency, when the utilization is low and the CPU is idle and increasing it, when the utilization is high and the CPU is active. They used the same traces as Weiser to evaluate their algorithms and found that PEAK performed the best. Both Weiser and Govil considered only the CPU power consumption and its linear relationship with the CPU frequency. On the contrary, Martin in [40] studied the relation between system level performance and CPU frequency scaling. He took into account non linear relation between the total system power and the CPU frequency, non ideal properties of the batteries, and the memory bandwidth to evaluate the performance of the clock scaling algorithms. He modified Weiser’s PAST algorithm taking into account the above mentioned non-ideal characteristics and evaluated the algorithm using the same traces as Weiser did. He concluded that reduction of the energy consumption is a system level problem, and all the above mentioned factors should be considered for devising a clock scaling algorithm.

Sonal Saha

Chapter 2. Related Work

10

Instead of using post simulation traces to analyse the performance of the clock scaling algorithms, Pering et al. [43] used simulated systems to do so. They implemented the clock scaling algorithms devised by Govil [23] and Weiser [51] on a simulated platform and evaluated it by running the MPEG decoder with QoS requirements. These algorithms when simulated resulted in 46% energy savings. Grunwald et al. [26] did an actual implementation of interval based GP-DVFS algorithms in the Linux kernel, running on the Itsy pocket computer with the StrongARM SA-1100 processor. They implemented the algorithms proposed by Weiser [51], such as PAST and variations of AVGn, and evaluated them by running realistic workloads for handheld systems. In these algorithms CPU scaling decisions are made based on the average utilization, with the goal to minimise the idle time. Their main aim was to find whether the energy saving claims made by these algorithms on simulated platforms give similar results on real platforms as well. They measured the system power consumption, and concluded that these algorithms did not give significant energy savings when implemented on real hardware platforms. They also concluded that in order to make efficient energy management decisions, operating system has to be aware of the properties of the application. They contemplate that the poor performance of these algorithms is due to the voltage scaling limitations of the platform used, as well as due to the poor efficacy of the algorithms. However, this poor performance could also be due to the minimisation of idle time. This is because if the processor consumes very low power in the idle state as compared to the active state then the energy consumption on running at a lower frequency for a longer time can be greater than the energy consumption on running at a higher frequency for a shorter time. This is exactly what we have observed from our study. In PACE [37], the authors have modified the algorithms developed by Weiser and Govil to further reduce their energy consumption without affecting their performance. They have shown that by modifying the way the scaling algorithm schedules tasks, the notion of deadline can be introduced, and this deadline information can be used by these algorithms to further reduce the energy consumption. In PACE, a speed schedule is created, where the speed is increased as the deadline approaches. The authors claim that they obtained around 20% more energy savings than the algorithms which they improved. They have evaluated their modified algorithms by doing simulations using real workloads. In [41], Miyoshi et al. introduced the concept of critical power slope. They have used it as a metric to show, that the relation between power and frequency is dependent on the architecture of the processor, particularly on the relative energy efficiency of the idle states and the active state. They have shown that on Pentium based systems it is more power efficient to run at highest frequency, while in PowerPC based systems it is more power efficient to run at the lowest frequency. The results of this paper is highly relevant to our work as this explains why we did not get similar energy savings on implementing RT-DVFS algorithms on pentium based platform as claimed by their simulations.

Sonal Saha

2.2

Chapter 2. Related Work

11

RT-DVFS

A multitude of RT-DVFS algorithms [13, 31, 46, 17, 61, 21, 28, 60, 34, 35] have been developed in the past two decades. We review a subset of them in this section. Pillai and Shin devised 5 RT-DVFS algorithms and found that the EDF based schedulers outperform the RMA based ones. Among the EDF based ones, Static-EDF utilizes the static slack, whereas CC-EDF and LA-EDF reclaim the dynamic slack to scale the CPU frequency. These algorithms were one of the earliest ones to be implemented on a real hardware platform. As suggested by their class, these algorithms are designed to reduce the energy consumption while ensuring 100% deadline satisfaction ratio. By this work authors have demonstrated that RT-DVFS algorithms can indeed save energy not only on modelled near idle simulated platforms, but also on real hardware platforms. The platform used by them was a Hewlett-Packard N3350 notebook computer, with a AMD K6-2+ processor, which uses the powernow! technology to scale the voltage and the frequency on-the-fly. In order to implement these schedulers, they modified the Linux 2.2.16 kernel, particularly its real-time scheduler and task management services. Based on their work they have made some interesting conclusions. They conclude that the energy savings of a RT-DVFS algorithm is highly dependent on the voltage and the frequency settings available on the hardware, and not much on the number of tasks in a task-set or on the energy efficiency of the idle states of the CPU. In contrast with their result the result obtained from our study is different. We have observed that the performance of a RT-DVFS algorithm is highly dependent on the energy efficiency of the idle states of the processor. The scope of their work is limited to independent underloaded CPU-intensive workloads. In contrast we have implemented and evaluated schedulers designed for independent underloaded workloads, and for dependent overloaded and underloaded workloads. Additionally, we have also implemented these schedulers on two different hardware platforms, and tested them with both CPU-intensive and memory intensive workloads. RT-DVFS algorithms designed in this paper aim at the reduction of CPU power and not the system power. Accordingly, CPU power consumption is the proper metric to evaluate the performance of these algorithms. However, in [44] the authors have measured the entire system power. This might lead to incorrect evaluation of these algorithms as the system power also includes the power consumed by the memory, I/O devices and other subsystems as well. We were able to measure the actual CPU power consumption of the CPU by using the CPUpower tool [5], and thus correctly evaluate and compare the performance of these algorithms. Aaydin et al. in [15] devised DRA and AGR, which were based on priority based slack stealing [31]. While DRA reclaimed unused slack based on actual workload and reduced the CPU frequency, in addition to that, AGR used average workload information to predict the early completion of the future workloads and thus obtain extra slack to further reduce the CPU

Sonal Saha

Chapter 2. Related Work

12

energy. However, they have evaluated the performance of these algorithms just on the basis of simulations. In their simulations, they have modelled an ideal CPU having continuous range of frequencies and considered CPU power to be proportional to the cube of average frequency. Due to these limitations, the actual energy savings obtained by implementing them on a real hardware platform do not match up with the savings obtained on simulated platforms. In [52], Haisang et al. developed a utility accrual RT-DVFS algorithm suited for underloaded as well as overloaded non-dependent CPU intensive applications, subjected to TUF time constraints. This algorithm aims at accruing maximum utility per unit of energy consumed and the reduction of total system power. During the underloads it accrues the maximum utility and scales the frequency in a similar way as LA-EDF [44], whereas in overloads it accrues the maximum utility possible by running the tasks at the maximum speed. In [53], they extended their work to include applications that share resources and are subjected to mutual exclusion constraints. REUA defaults to EUA when subjected to non-dependent workloads. They have used the priority inheritance protocol to resolve the deadlocks and bound the blocking time. This work is also based on simulations. In [57], Zhang et al. developed RT-DVFS algorithms for task-sets with non-preemptible blocking sections. They have used Stack Resource Policy (SRP) [16] to bound the blocking time. In the static speed algorithm High Speed (HS) a constant speed is selected based on the static slack available, and on the feasibility of the task-set scheduled with EDF under SRP. Dual Speed Algorithm (DS) extends the high speed algorithm by operating at high speed as well as at low speed, whenever possible. In DSDR, the dynamic slack is reclaimed and redistributed to further reduce the energy. They have also simulated these algorithms and shown that the dynamic algorithms can save 80% more energy than the static algorithms. In [30], Rajkumar et al. have devised RT-DVFS algorithms to support synchronization of tasks, for access to shared resources. These algorithms are based on the computation of slowdown factors for the tasks which need to synchronise. Any Resource Access Protocol like SRP, PCP or PIP can be used to bound the blocking time. This algorithm involves frequency inheritance where a low priority job can inherit the frequency of the highest priority job that it has blocked. They have done simulations and reported 25% energy gains over other slowdown techniques. Snowdon et al. [33], designed and implemented a DVFS algorithm aiming at the reduction of not only the CPU power consumption, but also the power consumption of other subsystems particularly the memory. Further, their work is based on the fact that for efficient energy management the operating system should have the knowledge of the properties of the platform, as well as that of the application. Accordingly, they have developed a time energy model, which uses this knowledge to calculate the time and the energy required by a piece of software at a particular CPU, memory and the bus frequency. They integrated this time and energy model with the real-time dynamic slack based scheduler RBED, designed by Brand et al. [18]. This scheduler uses the the time-energy model to select a feasible set of frequencies

Sonal Saha

Chapter 2. Related Work

13

that minimizes the total energy consumption while maintaining hard real-time constraints. They have implemented this scheduler inside the OKL4 microkernel on the Gumstix platform with the XScale PXA255 processor. They have measured the system power consumption, which is appropriate in their case as they aim to reduce the entire system power. One important conclusion from their work is that the implementation of their scheduler in a microkernel is not the best approach. This is because, the performance of a microkernel is dependent on the small overhead of context switches and a DVFS scheduler introduces large overheads due to the complexity of the algorithm. Even though theirs is one of the earliest effort to do a real implementation of a DVFS scheduler that takes into account the memory power consumption along with the CPU power consumption, they have not reported any result on energy measurements or realtime measurements. This makes it difficult to evaluate the performance of this scheduler. Another limitation of their work is that, they have implemented their algorithms on the Gumstix platform which doesn’t support voltage scaling. The energy savings of a DVFS algorithm is highly dependent on the scaling of the voltage. Frequency scaling just by itself cannot reduce the energy. In this paragraph we will review some of the very recent works in the RT-DVFS space. In [17], Bini et al. have considered the discrete speed levels of CPU as well as switching overhead to devise their RT-DVFS algorithm. Whereas in [60] and [61], Zhuo et al. have modelled a system consisting of a DVFS processor and other devices and considered the reduction of both the CPU power and the device power. In [21], the authors have taken into account the non-linear relationship between frequency and system power, leakage power and intra-task overheads. In [58], Zhang et al. have developed a procrastination based DVFS algorithm in which they have used stochastical workloads. In [35] and [34], the authors have created RT-DVFS algorithms for energy harvesting systems. In [56] similar to Snowdon’s work [33], Yun et al. have based their work on saving energy of various subsystems by scaling frequency of the CPU, system bus and memory. In [28], Hung et al. have also considered systems with DVFS and non-DVFS elements. In [50], Wang et al. have devised a preemptive DVFS technique, where a frequency value is scaled on every preemption. They have shown that it can save 24% more energy than the inter-task DVFS algorithms.

2.3

Statistical DVFS

Yuan et al. [54], designed and implemented a statistical DVFS scheduler for multimedia applications called the Grace-OS. This algorithm copes with the dynamic cycle demands of a multimedia application, providing soft real-time guarantees while saving energy. They have implemented this scheduler in the Linux kernel 2.4.18, and evaluated it on a HP Pavilion N5470 laptop with a AMD Athlon processor. The authors report energy saving of 7% to 72% as compared to deterministic scheduling.

Sonal Saha

Chapter 2. Related Work

14

Even though they have implemented their algorithm on a real hardware platform, they have not done real power measurements. Similar to the other simulation based implementations of DVFS algorithms, they have also assumed that CPU power is proportional to the cube of the frequency and have done the CPU energy calculations likewise. On the contrary, our work shows that this is an inaccurate way of measuring the CPU power. The actual CPU power consumption is very different and depends on the energy consumed in the active as well as the idle state. In [45], Poulwese et al. implemented a power aware video decoder on a LART device [29] with the StrongARM SA-1100 processor. They modified the decoder such that, the decoder can change the frequency from the user-space depending on the decoding time of the frame. They showed that they obtained greater energy savings compared to fixed frequency and interval based schedulers. Choi et al. [20] designed and implemented a DVFS algorithm for MPEG decoders, which aimed at maintaining the Quality of Service while reducing energy. In this, they compute the workload to decode the frame and accordingly scale the voltage and frequency. They have also implemented their algorithm on a StrongArm-1110 based evaluation board. In [55] Yuan et al. have implemented another statistical DVFS algorithm for multimedia applications. However, this algorithm has been devised considering the characteristics of a non-ideal CPU such as discrete speed levels and the energy consumption of the CPU in the idle state. The algorithm aims to minimize the system energy consumption rather than that of the CPU only. This algorithm known as PDVS, has been implemented as a part of Grace-OS and been evaluated on a HP N540 laptop with an Athlon CPU running the Linux kernel. The authors claim that, PDVS reduces the total energy consumption of the laptop by 15%-38% as compared to non-DVFS algorithms and around 10% compared to other DVFS algorithms, which assume ideal CPU. In [25], the authors have developed a statistical DVFS algorithm which uses stochastic data to make CPU scaling decisions. This algorithm saves energy while meeting all the deadlines, and thus it is well suited for hard real-time systems. However, the performance of this algorithm is evaluated just on the basis of simulations.

2.4

Summary

In the Table 2.1, we list those DVFS algorithms which have been implemented on a hardware platform, as these are the ones which are relevant to the scope of our work.

Sonal Saha

15

Chapter 2. Related Work

Table 2.1: Actual Implementation of DVFS algorithms Name

Class

Task Model

Workload

Platform

Static-EDf, CC-EDF, LA-EDF [44]

RT-DVFS

NoDependency

CPU Intensive

RBED-DVFS [33]

RT-DVFS

NoDependency

CPU and Memory Intensive

PAST, AVGn [26]

GP-DVFS

NoDependency

CPU Intensive

GRACE-OS [54]

StatisticalDVFS

NoDependency

Multimedia Applications

Power aware video decoder [45]

StatisticalDVFS

NoDependency

Multimedia Applications

PDVS [55]

StatisticalDVFS

NoDependency

Multimedia Applications

Hewlett-Packard N3350 notebook computer, with a AMD K6-2+ processor OKL4 microkernel on the Gumstix platform with the XScale PXA255 processor Itsy pocket computer with the StrongARM SA-1100 processor HP Pavilion N5470 laptop with a AMD Athlon processor LART device [29] with the StrongARM SA-1100 processor HP Pavilion N5470 laptop with a AMD Athlon processor

System/CPU power CPU Power

System Power

CPU Power

CPU Power

CPU Power

CPU Power

Chapter 3 Algorithms In this thesis, we evaluate and compare the performance of 14 RT-DVFS schedulers on two representative hardware platforms. In this chapter we will describe these schedulers in details. These schedulers can be classified into the following categories: (i) Schedulers for Independent Underloaded task-sets, (ii) Schedulers for Dependent Underloaded task-sets and (iii) Schedulers for Overloaded task-sets All RT-DVFS schedulers have to make two decisions- (i) which task to run and (ii) which frequency to run it at. Generally, all RT-DVFS schedulers are EDF based, i.e they schedule the task with the earliest deadline. They differ from each other in the way, they estimate the [31] slack to scale the frequency. There are two kinds of slacks available, static slack, which is available due to the characteristic of the task-set itself, such as less than 100% CPU utilization, and dynamic slack, which is available due to variations in the execution time. These two slack estimation techniques can be further classified. This classification was made by Kim et al. in [31] as follows: (i) Static Slack Estimation: (a) Maximum Constant Speed : In this the utilization of the task-set is taken into account to decide on a static speed, for all the tasks in the task-set such that the task-set is feasible under EDF. (ii) Dynamic Slack Estimation: Dynamic Slack arises due to the variations in the execution times of the tasks in the task-set at run time. (a) Stretching to NTA: The maximum constant speed is decided on the basis of the WCETs of the tasks in the task-set. But when the actual execution time of the task is much lesser than its WCET, then dynamic slack arises. One way to estimate the dynamic slack is to utilize the time till the arrival of the next task, to scale the frequency. The time of arrival of the next task is called NTA. (b) Priority based stack stealing: In this technique, the dynamic slack obtained due to the 16

Sonal Saha

17

Chapter 3. Algorithms

early execution of the higher priority task is transferred to the lower priority task, which is to be executed next. (c) Utilization Updating: In this technique, the worst case utilization of the processor is updated taking into consideration the actual execution times of the task which got completed. This modified utilization value is then used for scaling the CPU frequency. In Table 3.3 we summarize the various techniques used in the fourteen RT-DVFS algorithms described below.

Example task-set We will describe the algorithms with the help of an example task-set. Let us consider a three task periodic task-set with tasks T1 , T2 and T3 whose characteristics are described by the Table 3.1. Table 3.1: Sample 3 task task-set

3.1 3.1.1

Task

WCET

T1 T2 T3

2 1 3

Actual Time 1.6 0.8 2.4

Period 5 5 15

Schedulers for Independent Underloaded task-sets Base-EDF

This is the basic EDF scheduler from [44] which doesn’t involve any frequency scaling and operates at the maximum frequency. The task with the earliest deadline is scheduled and all the task run at the maximum frequency which in the case of I5 is 2400 MHz. We have included this in our experiments for comparison. The Figure 3.1 shows the schedule of the example task-set under this algorithm. 1 1

We will call the task with the earliest deadline as Tbest in all the remaining algorithms unless specified otherwise.

Sonal Saha

Chapter 3. Algorithms

18

Figure 3.1: Base-EDF Algorithm 1: Static-EDF 1: 2: 3:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm

4: 5: 6: 7:

Utilization = C1 /P1 + C2 /P2 + · · · + Cn /Pn k= fm *Utilization Select f ) f1 , f2 , · · · fm such that f 0 k

3.1.2

Static-EDF

This is the Static-EDF scheduler from [44]. This uses the static slack estimation technique to scale the CPU frequency. In this, the frequency is scaled based on the static utilisation value of the task-set, as can be seen from the Algorithm 1. All the tasks in the task-set are run at the same frequency such that the utilization of the processor at the scaled frequency, becomes 1. Thus this algorithm ensures, that no deadlines gets missed, as the utilisation is still less than equal to 1. This algorithm aims to minimise the idle time as can be seen from the schedule in Figure 3.2. From Algorithm 1, we see that the new frequency is decided by scaling the maximum frequency by the utilization value of the task-set in line 6. As a non-ideal processor has discrete frequencies, a frequency less than equal to k is selected to run the selected task.

Sonal Saha

Chapter 3. Algorithms

19

Figure 3.2: Static-EDF

3.1.3

CC-EDF

Cycle Conserving EDF or CC-EDF utilizes the dynamic slack to scale the CPU frequency. When a task is released, then a conservative approach is taken and it is assumed that the task is going to take its WCET to execute, and the frequency is set accordingly. However, on completion, if the actual execution time is lesser, then the extra unused cycles can be transferred to the remaining tasks. As the remaining tasks now get more cycles than they require to execute, the frequency can now be scaled . The working of the actual algorithm has been illustrated in the Figure 3.3. In Algorithm 2 the pseudo code of CC-EDF is described. When the task Ti begins, we take the conservative approach considering that it will take its WCET to execute, and set the Ui as Ci /Pi in line 9 and compute the total utilization using this Ui in line 12. This total utilization value is then used to decide the frequency in line 22. When the task completes its current invocation, then we compute task Ti ’s load during that invocation using actual execution time. If the actual execution time is considered to be ACi then Uload = ACi /Pi . As Ti doesn’t take more than ACi to execute, ACi can now be considered as the WCET of Ti and the utilization can be recomputed using Uload in 19. ACi ≤ Ci results in Uload ≤ Ui and thus the total U decreases and the frequency gets scaled.

Sonal Saha

Chapter 3. Algorithms

20

Algorithm 2: CC-EDF 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; if Task Ti begins then if (Uprev == 0) then Uprev = U ; Ui = Ci /Pi ; if (Uprev ==0) then Uprev =Ui ; U = Uprev − Uiprev + Ui ; Uprev = U ; if Task Ti ends then if (Uprev == 0) then Uprev = U ; Uload = ACi /Pi ; U = Uprev − Ui + Uload ; Uiprev = Uload ; Uprev = U ; k = fm ∗ U ; Select f ) f1 , f2 , · · · fm such that f 0 k

3.1.4

LA-EDF

Look-Ahead EDF or LA-EDF is the most energy efficient RT-DVFS algorithm devised by Shin [44]. Similar to the previous scheduler, this scheduler is also based on the dynamic slack reclaiming and the utilization updation technique. This is an aggressive algorithm which tries to work at the lowest frequency possible. It tries to do minimum work before the earliest deadline by pushing as much work as it can beyond that deadline, but at the same time making sure that the future deadlines are met, even if it has to run at a higher frequency in future. The Algorithm 4 describes the pseudo-code for LA-EDF. Whenever the scheduler is invoked (i.e. on task release or completion) the defer function (described in Algorithm 3) is called. In this function, the time interval till the earliest deadline is considered and the minimum amount of work that each task has to do in this interval (which is denoted in time as x in our example), so as to prevent any future deadline misses, is calculated. In the for loop [10-15], tasks are considered in the order of their decreasing deadlines and the minimum amount of work required to be done by each task in this interval is calculated

Sonal Saha

Chapter 3. Algorithms

21

Figure 3.3: CC-EDF Algorithm 3: Defer Function 1: 2: 3: 4: 5: 6:

Procedure:Defer () Input: σT // List of released tasks WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm

7: 8: 9: 10: 11: 12: 13: 14:

U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; s = 0; for each task Ti in σT in decreasing deadline order do U = U − Ci /Pi ; x = max(0, Clef ti − (1 − U )(Di − Dn )); U = U + (Clef ti − x)/(Di − Dn ); s = s + x;

15: 16:

f = s/(Dn − current time);

and summed to obtain the total work. So s becomes the total amount of minimum work, and D − current time becomes the interval in which the work has to be completed. Using these values the frequency is calculated in line 16. The schedule using the example task-set is shown in the Figure 3.4. It can be seen that the tasks are run at a higher frequency in the later interval to prevent deadline misses.

Sonal Saha

Chapter 3. Algorithms

22

Figure 3.4: LA-EDF Algorithm 4: LA-EDF 1: 2: 3:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm

4: 5: 6: 7: 8: 9:

U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; if Task Ti begins then U prev = U ; Clef t = Ci ; k =Defer();

10: 11: 12: 13: 14: 15: 16:

if Task Ti ends then Clef t = 0 ; k =Defer(); During execution of Ti Decerement Clef t ; k = fm ∗ U ; Select f ) f1 , f2 , · · · fm such that f 0 k

3.1.5

Snowdon-min

This is the RT-DVFS scheduler implemented by Snowdon. We call this Snowdon-min as this is not the full version of Snowdon’s RT-DVFS scheduler. Here we do not consider the scaling of the memory or the bus frequency, but only that of the CPU frequency. In this algorithm the amount of dynamic slack produced by the completed task is added to the budget of the next runnable task, and its frequency is scaled accordingly. This algorithm is described in Algorithm 5.

Sonal Saha

Chapter 3. Algorithms

23

Algorithm 5: Snowdon-min 1: 2: 3: 4: 5: 6: 7: 8: 9:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Budget: B1 , B2 , · · · BN Frequencies: f1 , f2 , · · · fm The next runnable task selected by EDF: Tbest U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; if Task Ti begins then Bi = Ci /Soptimal ;

10:

14: 15:

if Task Ti ends then U Ci = Bi − ACi ; lef tbest = Bbest − ARbest ; Bbest = Bbest − ARbest + U Ci ; k = lef tbest /Bbest ;

16:

Select f ) f1 , f2 , · · · fm such that f 0 k

11: 12: 13:

Whenever a task Ti begins, it is allotted a budget (in time units) equal to Ci /Soptimal as in line 8. This is the time Ti will take to complete when it is run at the speed, Soptimal which is the constant speed decided by the Static-EDF algorithm. When the tasks completes its current invocation, the unused time is calculated as in line 12 where ACi is the actual execution time of the task Ti . In line 13 the remaining WCET of Tbest at the speed Soptimal is calculated, where Bbest is the time budget allotted to Tbest and ARbest is the time for which Tbest has already executed. In line 14 the unused time from Ti is added to the remaining budget of Tbest . In line 15 the frequency is calculated by dividing the remaining WCET of Tbest by its available budget. The schedule for the example task-set is shown in the Figure 3.5.

3.1.6

DRA

Dynamic Reclaiming Algorithm (DRA) is based on dynamic slack reclaiming and has been devised by Aydin et al. in [15]. In this algorithm a data structure called α queue is maintained. Whenever a task arrives, it pushes its worst case execution time at Soptimal in the α queue. Thus each item of the α queue is characterized by the deadline Di of the task Ti , and the remaining worst case execution of Ti under the speed Soptimal which is denoted as remi . This alpha queue is ordered according to EDF* priority. (EDF* is similar to EDF, except that, if two tasks have the same deadline, then among the two, the task which has arrived earlier will have a higher priority.) With the progress of time, the remi field of the head of the α queue is subtracted with the elapsed time since the last scheduling event. If remi is smaller than the time elapsed, then after updating the remi of the head, the head is deleted and the update continues with the new head, till the entire elapsed time is used

Sonal Saha

Chapter 3. Algorithms

24

Figure 3.5: Snowdon-min up. Now whenever a task is selected for execution, the α queue is checked and the remi field of all the α queue items having a deadline lesser than or equal to Tbest is added to the remi of Tbest . Thus in this way, the unused slack time of all those tasks which got completed earlier than the scheduled task is transferred to the scheduled task and the frequency is scaled accordingly. We will discuss the DRA algorithm in details in the following sections. Auxiliary Functions In this section, we define the auxiliary functions that are used by DRA. Initialize α member(Di , remi ) Initialize the fields of a alpha member with the deadline Di and the remaining WCET under Soptimal remi of the arrived task. Insert Into α Queue(αj , α queue) Insert this αj into the α queue at the deadline position. Update α Queue(time dif f, α queue) Let time dif f be the time elapsed since the last scheduling event. This time dif f is subtracted from the remi field of the head of the α queue . If the remi is smaller than time dif f , then after updating the remi of the head, the head is deleted and the update continues with the new head, till the entire elapsed time is used up.

Sonal Saha

Chapter 3. Algorithms

25

Algorithm 6: DRA 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; if Task Ti begins then Initialize alpha member(Di, remi ); time dif f =Compute Time Elapsed(); Update α Queue(time dif f, α queue); Insert Into α Queue(αj , α queue); if Task Ti ends then Compute Time Elapsed(); Update α Queue(time dif f, α queue); E =Calculate Dynamic Slack Available(Tbest, α queue); Rem W CETbest =Calculate Remaining WCET(Tbest) ; k = Rem W CETbest /E ; Select f ⊇ f1 , f2 , · · · fm such that f 0 k

Compute Time Elapsed() Return the time elapsed since the last scheduling event. Calculate Dynamic Slack Available(Tbest , α queue) This function calculates dynamic slack available to the selected task Tbest . For any P task Ti the dynamic slack is given by Ei = dj ≤di remj i.e it is obtained by summing up the remj fields of all the α members whose deadline field has a value lesser than or equal to the deadline of Ti Calculate Remainig WCET(Ti ) Calculate the remaining WCET remi of Ti under Soptimal Calculate NTA() Return the time of arrival of the next task . Algorithm 6 describes the pseudo-code for DRA. This pseudo-code uses auxiliary functions that have been described in 3.1.6. When the task Ti arrives, an alpha member’s deadline and remi fields are initialized with Di and remi of Ti in line 7. In line 8 time elapsed since the last scheduling value is calculated and stored in the variable time diff. Using this time difference the α queue is updated as explained above. The insertion of the initialized α member is done in line 10. When the task Ti ends, the time elapsed is calculated in line 13 and the α queue is updated in line 14. In line 15 dynamic slack available to Tbest is calculated as explained above . In line 16 the remaining WCET of Tbest is determined. In line 17 remaining WCET of Tbest is divided by

Sonal Saha

Chapter 3. Algorithms

26

the slack time available to Tbest and the frequency is scaled accordingly. The schedule with the example task-set is shown in the Figure 3.6.

Figure 3.6: DRA

3.1.7

DRA-OTE

DRA-OTE (Dynamic Reclaiming Algorithm-One Task Technique) is an extension to the DRA algorithm. This algorithm tries to further reduce the frequency when some conditions are met, if there is only one task in the ready queue. In order to explain this algorithm we will first discuss the concept of NTA or Next Task Arrival. The next arrival time of any task instance in a system is known as NTA [31]. If there is only one task in the ready queue , and if its remi at Soptimal is less than the time available till the NTA, then the frequency of execution of this task can further be reduced to utilize the entire time till NTA. The algorithm 7 presents the pseudo-code for DRA-OTE. Till line 18 it is same as DRA. In line 20 The next arrival time is calculated and stored in the variable NTA . In line 21 it is checked if the task Tbest is the only task in the ready queue. If it is then the time till the NTA is calculated from the current time and stored in the variable Z. In 22 the frequency k is further scaled by (Rem W CETbest )/Z and the new frequency value is obtained.

Sonal Saha

Chapter 3. Algorithms

27

Algorithm 7: DRA-OTE 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; if Task Ti begins then Initialize alpha member(Di, remi ); time dif f =Compute Time Elapsed(); Update α Queue(time dif f, α queue); Insert Into α Queue(αj , α queue); if Task Ti ends then Compute Time Elapsed(); Update α Queue(time dif f, α queue); E =CalculateDynamicSlackAvailable(Tbest, α queue); Rem W CETbest =Calculate Remaining WCET(Tbest) ; k = Rem W CETbest /E ; N T A = Calculate N T A(); if Task Ti is the only task in the run q then Z = N T A − current time ; k = k ∗ Rem W CETbest /Z ;

24: 25:

Select f ⊇ f1 , f2 , · · · fm such that f 0 k

3.1.8

AGR1

AGR1 is also an extension of DRA, however, in addition to the dynamic speed reclaiming technique this algorithm uses average workload information to predict the early completion of the future workloads and thus obtain extra slack to further reduce the CPU frequency. Thus this algorithm reduces the frequency more aggressively than DRA and DRA-OTE, consequently saving more energy. This algorithm is based on the fact that whenever there is more than one task in the ready queue, and when all the tasks have to complete before NTA, then the CPU time can be transferred among these tasks. Lets consider that there are 3 tasks T1 , T2 and T3 in the ready queue and all of them have to complete before the NTA. If T1 is the task having the earliest deadline, then it can obtain CPU time from T2 and T3 , such that its speed can be reduced, but at the same time ensuring that all these tasks complete before NTA. This additional frequency scaling is done to the frequency which is obtained as the result of DRA. Before describing the algorithm let us first understand the meaning of some terms used in the algorithm.

Sonal Saha

Chapter 3. Algorithms

28

Algorithm 8: AGR1 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

Tbest : the next runnable task selected with the earliest deadline σT : List of tasks in alpha q that have completed and have unused computation time and the tasks in the ready q having a deadline greater than the selected task’s deadline σL ; List of tasks in alpha q and the ready task which have to complete before NTA but can provide CPU time to Tbest WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; if Task Ti begins then Initialize alpha member(Di, remi ); time dif f =Compute Time Elapsed(); Update α Queue(time dif f, α queue); Insert Into α Queue(αj , α queue); if Task Ti ends then Compute Time Elapsed(); Update α Queue(time dif f, α queue);

18: 19: 20: 21: 22: 23: 24: 25: 26: 27:

E =CalculateDynamicSlackAvailable(Tbest, α queue); Rem W CETbest =Calculate Remaining WCET(Tbest) ; k = Rem W CETbest /E ; N T A = Calculate N T A(); if Task Ti is the only task in the run q then Z = N T A − current time ; k = k ∗ Rem W CETbest /Z ; goto end;

Soptavg This is the optimal speed considering the average workload of the task-set. Q This is the total amount of time required to be transferred to the next runnable task so that it can operate at max(Smin , Soptavg ) B This is the amount of time actually transferred. Algorithm 8 and Algorithm 9 describe this algorithm.

2

The first part of the algorithm which is Algorithm 8 is same as DRA-OTE. Lets consider the second part of the algorithm which is 9 2

This algorithm has been shown in two parts for clear visibility.

Sonal Saha

Chapter 3. Algorithms

29

Algorithm 9: AGR1 (contd) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

if k Q then count ready tasks − − ; Z = Zprev ; break ; for each task Tj in σL in decreasing deadline order do if Tj isready then Tj speed prev = Tj speed; Requestedt ime = Q − Qactual ; Tj speed = Tj speed prev ∗ [Rem W CETj /(Rem W CETj − Requestedtime)] ; B = [(Tj speed/Tj speed prev) − 1] ∗ Rem W CETj ;

23:

if Tj is completed but is in the α queue then B = min(Requestedtime, Rem W CETj )

24:

Qactual = Qactual + B;

22:

25: 26:

k = k ∗ Rem W CET best/(Rem W CET best + B); end: Select f ⊇ f1 , f2 , · · · fm such that f 0 k

If the frequency selected under DRA-OTE is lesser than Smin or Soptavg then there is no point in further reducing the frequency so we just goto end as in lines 1 and 2. If that is not the case then we calculate the amount of CPU time that is required to be transferred to the ready task Tbest , so that it can operate at the lowest frequency possible which is the max(Smin , Soptavg ) in line 3. In lines 4 we check if the time till NTA is enough to incorporate the increase in the Rem W CET best by Q . If not, then Q is adjusted accordingly in 51. Once Q that is the amount of CPU time to be transferred to Tbest from the other ready tasks is decided, then in the for loop from 8 to 15 , the ready tasks as well as completed tasks having unused computation time in the α queue which can contribute in this Q are figured out. Say this list of tasks is in the list σL . In the for loop from 16 to 24 the individual transfer of B units of time from these tasks take place.

Sonal Saha

Chapter 3. Algorithms

30

As they are giving away a part of their allotted CPU time, these tasks are left with lesser time for execution, and so their speed is increased accordingly in line 20. B is calculated in line 21 and the frequency is scaled accordingly in 25.

3.1.9

AGR2

AGR2 is very similar to AGR1 except in 2 points which we state as follows: (i) If the frequency value available after DRA computation in line 21 of AGR1 is lesser than 0.9*Soptavg further frequency reduction is not done. (ii) Instead of Soptavg , 0.9*Soptavg is used for computing Q.

3.2 3.2.1

Schedulers for Dependent Underloaded task-sets EUA

EUA is a utility accrual RT-DVFS algorithm devised by Haisang et al. in [52]. This algorithm aims at accruing maximum utility per unit of energy consumed while reducing the total system power. During the underloads it accrues the maximum utility and scales the frequency in a similar way as LA-EDF [44], whereas in overloads it accrues the maximum utility possible by running the tasks at the maximum speed.

Figure 3.7: Step TUF The version of EUA implemented in ChronOS makes certain assumptions which are stated as follows: (i) The task-set under consideration is subjected to a step TUF function which is shown in 3.7. In a Step TUF function a task has a constant maximum utility till its deadline. After its deadline the utility value becomes zero. (ii) The actual version of EUA aims at the reduction of the total system power. In this implementation we have considered the CPU power reduction, assuming it to be directly proportional to the cube of frequency, as was assumed by Haisang [53]

Sonal Saha

Chapter 3. Algorithms

31

3 (iii) Considering the above two points the UER of phase Ji is given by Utility/(fmin ∗ Rem W CETj ) where fmin is the minimum frequency supported by hardware.

Auxiliary Functions The following functions have been used in the algorithm. sortByUER(σ) Sort the list σ according to their decreasing UER of the phases i.e. the phase with the highest UER will be at the head of the list. calculateUER(Jk , t) Calculate the UER as Utility/Rem W CETk , where Utility is the utility of the phase Jk and Rem W CETk is the remaining WCET under Soptimal . We just calculate UER as the ratio of Utility and remaining WCET and do not include the minimum power (which would be proportional to the cube of fmin ). This is because minimum power will be constant for all the phases, as we only consider CPU power consumption in this implementation. Utility and remaining WCET are the only two parameters which will be different for different phases. f easible(σdl ) For a list to be feasible the predicted completion time of all the phases in the list at the maximum frequency, should be less than their respective deadlines. Insert(Jk , σdl , Jk .X) This function inserts the phase Jk in the ordered list σdl at the position indicated by the index Jk .X. EUA is described using the Algorithm 11. It uses auxiliary functions described in the Section 3.2.1. In the for loop from lines 7 to 10, for all the phases in the ready queue it is checked, whether the phases are still feasible, i.e. if they have not blown their deadline. If they have, then they are aborted as in line 9. If not, then the UER of all the feasible tasks in the ready queue is calculated as explained above. Then these tasks are sorted in the order of their UER and placed in a list say σtmp in line 11. In the for loop from lines 12 to 18 the head of the σtmp is selected (this is the task with the highest UER) and inserted into the list σdl , at its deadline position in line 15. In line 16, the feasibility of the list is checked as explained above. If feasible, then σdl is copied into σ as in

Sonal Saha

Chapter 3. Algorithms

32

Algorithm 10: Defer EUA Function 1: 2:

Procedure:Defer ()

3:

Input: σT // List of released tasks WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm

4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; s = 0; for each task Ti in σT in decreasing deadline order do U = U − Ci /Pi ; if U ≥ 1 then break ; x = max(0, Clef ti − (1 − U )(Di − Dn )); U = U + (Clef ti − x)/(Di − Dn ); s = s + x;

18: 19:

if U ≥ 1 then f = fmax ;

20: 21:

else f = s/(Dn − current time);

line 17. If not then we break from the loop. The head of σ is selected as Tbest , which is the next runnable task. Thereafter, the algorithm is the same as LA-EDF, with the only difference being in the defer function. A modified version of defer called as defer EUA is used. The only difference between defer and defer EUA is that when U ≥ 100 the max frequency is selected. Thus during underloads the frequency selection is same as LA-EDF whereas during overloads the max frequency is selected as the main aim then is to accrue the maximum utility.

3.2.2

HS

This is an RT DVFS algorithm devised by Zhang et al. in [57], for task-sets with non preemptible blocking sections. In this algorithm a static speed known as high speed is calculated to ensure that no deadlines are missed when the tasks are scheduled with EDF even in the presence of non-preemptible blocking sections. In this algorithm, the non-preemptible blocking sections have been modelled as a special case of SRP [16], where there is just one resource shared by all the tasks. When the tasks are scheduled with EDF under SRP then feasibility is given by:

Sonal Saha

Chapter 3. Algorithms

33

Algorithm 11: EUA

3:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm

4: 5:

U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ;

1: 2:

6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:

for ∀Jk ∈ Jr do if f easible(Jk ) = f alse then abort(Jk ); Jk .U ER:=calculateUER(Jk, t); σtmp :=sortByUER(Jr); for ∀Jk ∈ σtmp from head to tail do if Jk .U ER > 0 then σdl = σ; Insert(Jk , σdl , Jk .X); if feasible(σdl ) then σ = σdl ; else break; Tbest =headOf(σ) ; if Task Ti begins then U prev = U ; k =Defer EUA();

23: 24: 25: 26: 27: 28:

if Task Ti ends then Clef t = 0 ; k =Defer EUA(); During execution of Ti Decerement Clef t ; Select f ) f1 , f2 , · · · fm such that f 0 k

P ∀k, 1 ≤ k ≤ n nk=1 Ci /Di + Bk /Dk ≤ 1, where Bk is the maximum blocking time that a task Ti can be blocked. So A static speed P H can be selected as follows, which ensures that the deadlines will be met. ∀k, 1 ≤ k ≤ n nk=1 Ci /Di + Bk /Dk ≤ H .... (1) The algorithm is based on computing this static speed based on the properties of the task-set. Algorithm 12 gives the pseudo-code for this algorithm which is explained as follows: The equation (1) can be broken down in the 3 steps within the for loop from lines 9 to 12. In line 8 the utilization of the task-set is computed. Gj is the maximum length of time for which task Ti can be blocked under SRP by the lower priority jobs whose periods and deadlines are greater than that of Ti , since this task-set is being scheduled by EDF. The example task-set to be used for HS and DS is given in the Table 3.2.

Sonal Saha

Chapter 3. Algorithms

34

Figure 3.8: HS Algorithm 12: HS 1: 2: 3: 4: 5: 6:

Input: σT // List of released tasks WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm

12:

U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; H = 0; utilization = 0 ; for each task Ti in σT in the increasing order of the periods do utilization+ = Ci /Pi ; speed = utilization + [max (Gj |Pi ≤ Pj )]/Pi ; H = max(speed, H);

13:

return H ;

7: 8: 9: 10: 11:

The Figure 3.8 shows the schedule and the speed of operation under HS algorithm using the example task-set. Table 3.2: Sample 2 task task-set Task T1 T2

WCET 2 5

Period 5 10

Sonal Saha

Chapter 3. Algorithms

35

Figure 3.9: DS

3.2.3

DS

In the HS algorithm, the processor can just operate at one constant speed. Another algorithm has been devised by Zhang et al. in [57] which is an improvement over the HS algorithm, as it can save more energy by operating at two speeds. These two speed are the static high speed calculated by the HS algorithm (H), and the Soptimal speed calculated by the Static EDF algorithm (L). In this algorithm, the processor operates at the high speed only if a task is blocked by a lower priority task, for a duration, which is till the completion time of the blocking task. At all the other times the processor operates at the low speed, thus saving considerable energy while making sure that no deadlines are missed. The DS algorithm is presented in the Algorithm 13. First and foremost in line 5 the variable high speed interval is initialized to 0 and declared as static so that its value doesn’t change between the scheduler invocations. In the HS and the DS algorithm the High Speed is calculated in the user space, and is transferred to the kernel as an argument to the system call. In line 13 the processor speed is set as L. In line 14 it is checked if the current time is lesser than the high speed interval. If it is then it means that the interval is not over, and the processor speed is set as H. Again, in lines 17 to 19 if the resource that Tbest is going to request is not free, then it is blocked by a lower priority task and the lower priority task becomes Tbest or the next runnable task and the processor speed is set to H in this case as well. The high speed interval is set to be the max of the previous high speed interval and the deadline of the task which has blocked. However, if the blocking task finishes before its deadline then the high speed interval also

Sonal Saha

Chapter 3. Algorithms

36

Algorithm 13: DS 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

Input: σT // List of released tasks WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm static high speed interval = 0 H = 0; utilization = 0 ; for each task Ti in σT in the increasing order of the periods do utilization+ = Ci /Pi ; speed = utilization + [maxGj |Pi < Pj ]/Pi ; H = max(speed, H); Set processor speed as L; if Blocking task finishes before its deadline then high speed interval = 0; if Current time less than high speed interval then k = H;

20:

if the resource which Tbest is going to request is not free then T best = T owner; k = H; high speed interval = max(high speed interval, deadlineof T owner;

21: 22:

if End of high speed interval is reached then high speed interval = 0;

23: 24:

if Task Ti begins then U prev = U ; Clef t = Ci ; k =Defer();

18: 19:

25: 26: 27: 28: 29: 30: 31: 32:

if Task Ti ends then Clef t = 0 ; k =Defer(); During execution of Ti Decerement Clef t ; Select f ) f1 , f2 , · · · fm such that f 0 k

gets over with the completion of the blocking task as shown in line 14. Ultimately when the high speed interval is reached it is set to be 0 as in line 22. If above conditions are not true, then the processor operates at the low speed. The schedule is shown in the Figure 3.2.3.

Sonal Saha

Chapter 3. Algorithms

37

Algorithm 14: USFI EDF 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Input: σT // List of released tasks WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; H = 0; utilization = 0 ; while q ≤ n do for each P of the periods do P task Ti in σT in the increasing order q≤p≤r Cp /Dp ] = 1; 1≤r≤q 1/ηr [Cr /Dr ] + 1/ηi [Bi /Di +

13: 14:

ηm = maxni=q ηi for each task Ti in σT in the increasing order of the periods do ηi = ηm ;

15:

q = m + 1;

12:

16: 17: 18: 19: 20:

if the resource which Tbest is goinf to request is not free then T best = T owner; ηbest = ηowner ; k = ηbest ; Select f ) f1 , f2 , · · · fm such that f 0 k

3.2.4

USFI EDF

This is an RT-DVFS algorithm designed by Rajkumar et al. in [30], which allows synchronization of tasks for access to shared resources while maintaining hard real-time constraints. In this algorithm the static slowdown factors for the synchronizing tasks are calculated, considering the feasibility of the task-set when scheduled with EDF. This algorithm also involves frequency inheritance where a low priority job can inherit the frequency of the highest priority job that it has blocked in order to prevent deadline misses. In this algorithm described by Rajkumar et al., the programmer has been given the freedom to use any Resource Access Protocol. Since we use EDF for scheduling, we choose SRP [16] as the Resource Access Protocol, as it is suited for dynamic priority scheduling. In SRP, whenever a job attempts to preempt another job, it is checked if the resource it will require in future is available at the time of preemption. If it is not then it is blocked and not allowed to execute. USFI EDF is presented in 14. In the for loop from 9 to 15 the static P Pslowdown factors are calculated using the equation : 1≤r≤q 1/ηr [Cr /Dr ] + 1/ηi [Bi /Di + q≤p≤r Cp /Dp ] = 1 In lines 17 and 18 frequency inheritance takes place where the blocking tasks inherits the slowdown factor of the blocked task. As such under SRP, a job can be blocked at most for one critical section so the max blocking

Sonal Saha

Chapter 3. Algorithms

38

time will be the length of the highest critical section [16]. Here B is the maximum blocking time under SRP.

3.3 3.3.1

Schedulers for Overloaded task-sets REUA

This is an extension of the EUA algorithm that is devised for dependent task models. Thus in addition to acquiring maximum utility while saving energy, this algorithm includes mechanisms to bound the blocking time as well as for deadlock detection and resolution, Auxiliary Functions The following functions have been used in the algorithm. sortByUER( σ) Sort the list σ according to their decreasing UER of the phases i.e. the phase with the highest UER will be at the head of the list. calculateUER(Jk , t) This function calculates the UER of a phase as follows : The utility and the remaining WCET of the phase being considered, as well as of all the dependent phases inPthe dependency Pchain of the phase are summed and their ratio is calculated as ( nk=1 Utilityk )/( nk=1 Rem W CETk ) where n is the total number of phases in the dependency chain including the phase being considered. buildDepJk This functions builds the dependency chain for the phase Jk . This chain is created considering the owner of the resource requested by Jk say Jowner , and then again the owner of the resource requested by this Jowner and so on. If there is a loop in the chain, which means that if the owner of the resource requested by a phase in the chain itself requests a resource whose owner is also present in the chain, then this will create a deadlock. In order to resolve this one of the phases in the loop is aborted, This phase which is aborted has the lowest UER. insertPhaseWithDep(Jk , σ) This function inserts the phase into the list σ along with all its dependencies at the critical deadline position. The critical deadline of the phase is the earliest deadline of the phase among all the phases in its dependency chain.

Sonal Saha

Chapter 3. Algorithms

39

Algorithm 15: REUA 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

WCET: C1 , C2 , · · · CN Period: P1 , P2 , · · · PN Frequencies: f1 , f2 , · · · fm U = C1 /P1 + C2 /P2 + · · · + Cn /Pn ; for ∀Jk ∈ Jr do if feasible(Jk)=false then abort(Jk ); else Jk .Dep := buildDep(Jk); for ∀Jk ∈ Jr do Jk .U ER:=calculateUER(Jk, t); σtmp :=sortByUER(Jr); for ∀Jk ∈ σtmp from head to tail do if Jk .U ER > 0 then σ := insertPhaseWithDep(Jk , σ); else break; Tbest :=headOf(σ); if Task Ti begins then U prev = U ; k =Defer EUA();

22: 23: 24: 25: 26: 27:

if Task Ti ends then Clef t = 0 ; k =Defer EUA(); During execution of Ti Decerement Clef t ; Select f ) f1 , f2 , · · · fm such that f 0 k

The Algorithm 15 describes this algorithm and uses the functions described in 3.3.1. Till line 8 this algorithm is same as EUA. In line 10 the dependency chain of each phase is calculated using the buildDep() function as explained above. In line 12 the UER is calculated for each phase. Then these tasks are sorted in the order of their UER and placed in a list, σtmp . From 14 to 17 the phases are inserted into σ along with their dependencies in the critical deadline order. The head of σ is selected as Tbest , which is the next runnable task. Thereafter the algorithm is same as EUA.

Sonal Saha

40

Chapter 3. Algorithms

Table 3.3: Slack Estimation Techniques used in the Algorithms Name

Static-EDf CC-EDF, LA-EDF DRA, DRA-OTE AGR1, AGR2 EUA, REUA HS, DS

Max Const Speed ✓

Stretching to NTA

Priority Based stack stealing

Utilization Updating

✓ ✓







✓ ✓



Chapter 4 Extending ChronOS with DVFS support 4.1 4.1.1

ChronOS Introduction

In order to implement and evaluate our RT-DVFS schedulers we have used ChronOS, which is, a best-effort real-time Linux kernel developed by Matthew et al. in [22]. ChronOS is based on the 2.6.33.7 version of the Linux kernel, and has been enhanced with Ingo Molnar’s PREEMPT RT real-time patch [42]. ChronOS provides a number of APIs and a scheduler plugin infrastructure that can be used to implement various RT-DVFS algorithms. The base Linux kernel provides soft real-time capabilities. In order to make this kernel hard real-time, it has to be made preemptable. To achieve this, the PREEMPT RT patch has been used in ChronOS. With this patch the interrupt latencies improve and most of the parts of the kernel become preemptible, thus providing hard real-time properties to the Linux kernel.

4.1.2

ChronOS real-time Scheduler

The ChronOS scheduler is built on top of the Linux scheduler. The 0(1) Linux scheduler [38] has a bitmap implementation of the priority levels. Consequently, the scheduling algorithms implemented in the kernel which are (SCHED NORMAL, SCHED FIFO, SCHED RR) take constant time to complete. In the Linux kernel, every priority level has a Linux run-queue associated with it. In ChronOS, in addition to that, every priority level also has a ChronOS realtime run-queue (CRT-RQ). This run-queue contains the ChronOS real-time tasks. ChronOS real-time tasks are real-time segments defined in the user applications, using the system 41

Sonal Saha

Chapter 4. Extending ChronOS with DVFS support

42

calls provided by ChronOS. Real-time segments are those portions of a user application thread, which have to be executed with hard real-time constraints. Thus, when a ChronOS real-time task enters the system it is added to theCRT-RQ . On being called, based on the scheduling algorithm selected,the task from the CRT-RQ is selected by the ChronOS scheduler and returned to the Linux scheduler, which executes it.

4.1.3

Scheduling Events

ChronOS is an event based real-time kernel, in which the real-time scheduler is invoked when certain events occur. These events are known as scheduling events, which include: • When a task enters the system When a task begins its real-time segment the scheduler is invoked. • When a task leaves the system When a task completes its real-time segment the scheduler is invoked. • When a resource is requested In case of dependent task-sets, when a task requests a resource then also the scheduler is invoked. • When a resource is released Similarly, whenever at task releases the resource , the scheduler is invoked.

4.1.4

System Calls provided by ChronOS

As already mentioned, ChronOS provides system calls with which, real-time segments can be defined within the threads of the user application. These system calls are described as follows: begin rt seg() This system call is used by the application to start a real-time segment. Using this system call the application also informs the kernel, about the real-time properties of the real-time segment, such as its period, deadline and so on. In order to implement RT-DVFS we have modified this system call to inform the kernel about the utilisation of the task-set, high frequency, Soptavg and the locks that will be requested by the tasks in future. end rt seg() By using this system call the application indicates the end of the scheduling segment to the kernel.

Sonal Saha

Chapter 4. Extending ChronOS with DVFS support

43

chronos mutex lock() This system call is used to request a resource, such as a mutex. chronos mutex lock() This system call is used to release the resource. set scheduler() This system call is used for selecting a particular scheduling algorithm. The ChronOS scheduler selects a task from the CRT-RQ based on the scheduling algorithm set by this system call. In order to implement RT-DVFS schedulers in ChronOS, they have to be implemented as Linux kernel modules, and thus can be loaded or unloaded from the running kernel using modprobe. The sequence of operation is as follows: The real-time application calls the set scheduler() system call to select a RT-DVFS scheduler, say LA-EDF. If the LA-EDF kernel module is available, then ChronOS loads this module and LA-EDF becomes the ChronOS local scheduler. Consequently, sched la edf() is called by the ChronOS scheduler at every scheduling event, which then operates on the CRT-RQ ,finds the task with the earliest deadline, sets the frequency of the CPU to the value decided by the frequency evaluating part of the algorithm, and returns the selected task to the Linux SCHED FIFO Scheduler for execution.

4.2

Adding DVFS support to ChronOS

RT-DVFS schedulers need to change the voltage and frequency of the processor at every scheduler invocation. A framework known as CPUFreq subsystem [9] has been implemented in the Linux kernel, since the 2.6.0 kernel, with which, the processor frequencies can be dynamically scaled. In order to implement RT-DVFS schedulers in ChronOS we have used this CPUfreq subsystem, described in the Section 4.2.1.

4.2.1

The CPUfreq Subsystem

The Figure 4.1 [6] shows the high-level view of the CPUfreq subsystem. It contains the following components: (i) CPUfreq module: This module abstracts the low-level frequency controlling driver interface from the high-level frequency controlling policies. It provides APIs to the high-level code, which enables them to change CPU frequency.

Sonal Saha

Chapter 4. Extending ChronOS with DVFS support

44

Figure 4.1: CPUfreq Subsystem (ii) CPU-specific drivers: For DVFS to work, CPU itself should support dynamic voltage and frequency scaling. The modern CPUs are enhanced with technologies which support DVFS. For example, Intel processors come with Enhanced Intel SpeedStep technology [6], AMD processors come with the Powernow! technology [3]. There are certain CPU specific drivers which enable the change of the frequencies. For Intel processors both the drivers’, acpi-cpufreq and speedstep-centrino can be used, whereas for AMD processors powernow-k8 driver is used. (iii) In-kernel governors : These governors are built as kernel modules and they change the frequency of the processor depending on they the way they have been implemented, when selected from the user-space. The 5 governors implemented in Linux kernel 2.6.33.7 include: (a) Performance: This governor runs the CPU at the maximum frequency. (b) Powersave: This one runs the CPU at the minimum frequency. (c) Ondemand : This one decides the CPU frequency based on the current CPU usage. (d) Conservative: This one too decides the frequency based on the CPU usage, but unlike the Ondemand governor it changes the frequency in steps rather than changing it drastically.

Sonal Saha

4.2.2

Chapter 4. Extending ChronOS with DVFS support

45

Changing the frequency using CPUfreq module

For implementing DVFS in Chronos, we have used the CPUfreq module to scale the frequency. This module provides a function called cpufreq driver target() which has been defined in the file Linux/drivers/cpufreq/cpufreq.c. Its prototype is as follows: cpufreq driver target(struct cpufreq policy *policy, unsigned int target freq, unsigned int relation) We will briefly describe each argument that this function takes. policy This argument is for providing the limits within which the CPU frequency can be set. The frequency which is to be set must lie between policy-¿min and policy-¿max. target freq This is the frequency to which the CPU frequency is requested to be set. relation This argument specifies the relation between the target freq requested to be set and the actual frequency set. This can be set to two values as follows: CPUFREQ REL L :The actual frequency selected is higher than or equal to the target frequency. CPUFREQ REL H :The actual frequency selected is lower than or equal to the target frequency. For example, if we want to set the processor frequency to 1.33 GHz we can call this function as follows: cpuf req driver target(policy, 1330000, CP UF REQ RELAT ION H).

4.2.3

Working of the CPUfreq module

The Figure 4.2 shows the flowchart which explains the working of the CPUfreq module. When the frequency change requests comes to the CPUfreq module, then first of all it inquires with all the registered drivers, about the frequency range that they can handle. If the new CPU frequency is out of that range, then the CPUfreq subsystem adjusts it, so that it falls within that range. It then notifies these registered drivers, that the frequency is going to change, so that they can decide which parameters to change, in order to adjust with the new frequency. After that, the CPU frequency is changed by writing into appropriate hardware registers of the CPU. The registered drivers are notified about the CPU frequency change, so that now they can change those parameters on which the had decided earlier.

Sonal Saha

Chapter 4. Extending ChronOS with DVFS support

46

Figure 4.2: Working of CPUfreq module

4.2.4

Integration of CPUfreq subsytem with ChronOS

The Figure 4.3 shows the integration of CPUfreq subsystem with ChronOS. A RT-DVFS algorithm needs to select a task to execute and the CPU frequency to execute it at. The ChronOS local scheduler rearranges the CRT-RQ , based on the RT-DVFS scheduling algorithm chosen, and returns the head of the CRT-RQ to the Linux SCHED-FIFO scheduler for execution. At the same time it also determines the frequency of execution and uses the cpufreq driver target function provided by the CPUfreq module to change the CPU frequency. This CPUfreq module, in turn calls the CPU-specific driver, which is, acpi cpufreq in case of Intel I5 and powernow k8 in case of AMD. This driver then writes into the appropriate hardware registers inside the CPU to change the voltage and the frequency.

4.2.5

Implementation of the RT-DVFS Schedulers

In almost all the RT-DVFS schedulers described in Chapter 2, the schedulers do important computations based on, if a task has arrived or if it has completed. Thus, figuring out how to determine whether a real-time task has arrived or completed, was an important part of implementing RT-DVFS schedulers in ChronOS. In order to do so, we have made use of the global current pointer which has been defined as, per cpu variables in the arch/x86/include/asm/current.h file. The current variable gets updated in the context switch() function, called inside the Linux scheduler function. As the ChronOS scheduler

Sonal Saha

Chapter 4. Extending ChronOS with DVFS support

47

Figure 4.3: RT-DVFS scheduling in ChronOS is called before the context switch takes place, the current pointer always points to the task which has just arrived or completed. This is because a ChronOS scheduler is called only at the scheduling events, which in the case of independent task-sets involve only 2 events, one being, when the task enters the system, and other being, when the task leaves the system. We add a new field to the real-time data structure inside ChronOS, called flag begin end check. There are two system calls provided by ChronOS with which a real-time application can indicate the beginning or the end of a real-time segment, these being begin rt seg and end rt seg. So in begin rt seg, we specify flag begin end check as 100 and in end rt seg, we specify it as 200 to differentiate between the two. So in the ChronOS scheduler we just check the value of this field of the current pointer to determine, if a task has arrived or completed so as to do the computations accordingly.

Sonal Saha

4.2.6

Chapter 4. Extending ChronOS with DVFS support

48

Modification of the real-time data structure for RT-DVFS support

In order to add DVFS support to ChronOS we have also added some additional fields to real-time data structure inside ChronOS. The code listing 4.1 shows all the fields of the real-time data structure, however in Section 4.2.6.1, we will discuss only the additional fields that have been added to support RT-DVFS in ChronOS. 1 /∗ S t r u c t u r e a t t a c h e d t o s t r u c t t a s k s t r u c t ∗/ 2 struct r t i n f o { 3 struct r t i n f o { 4 /∗ Real−Time i n f o r m a t i o n ∗/ 5 struct t i m e s p e c d e a d l i n e ; /∗ monotonic t ime ∗/ 6 struct t i m e s p e c t e m p d e a d l i n e ; /∗ monotonic t ime ∗/ 7 struct t i m e s p e c p e r i o d ; /∗ r e l a t i v e t ime ∗/ 8 struct t i m e s p e c l e f t ; /∗ r e l a t i v e t ime ∗/ 9 unsigned long e x e c t i m e ; /∗ WCET, us ∗/ 10 i nt m a x u t i l ; 11 long l o c a l i v d ; 12 long g l o b a l i v d ; 13 unsigned i nt s e g s t a r t u s ; 14 15 /∗ The f o l l o w i n g f i e l d s have been added f o r RT−DVFS s u p p o r t ∗/ 16 i nt u t i l t a s k −s e t ; 17 long l o a d ; 18 i nt c u r r e n t f r e q u e n c y ; 19 i nt f l a g b e g i n e n d c h e c k ; 20 unsigned long budget ; 21 struct mutex data ∗ mutex0 ; 22 struct mutex data ∗ mutex1 ; 23 i nt S o pta vg ; 24 i nt h i g h f r e q ; 25 struct alpha member a member ; 26 27 28 29 /∗ L i s t s FIXME: c o n v e r t t o a r r a y o f l i s t h e a d s ∗/ 30 /∗ 0 i s l o c a l , 1 i s g l o b a l . . . s h o u l d f i x t o pound d e f i n e s ∗/ 31 struct l i s t h e a d t a s k l i s t [ 2 ] ; 32 struct l i s t e n t r y l i s t [ SCHED LISTS ] ; 33 34 /∗ DAG u sed by x−GUA c l a s s o f a l g o r t h i m s ∗/ 35 struct r t g r a p h graph ; 36 37 /∗ Lock i n f o r m a t i o n ∗/ 38 struct mutex head ∗ r e q u e s t e d r e s o u r c e ; 39 struct r t i n f o ∗ dep ; 40 41 /∗ Abort i n f o r m a t i o n ∗/ 42 struct a b o r t i n f o a b o r t i n f o ;

Sonal Saha 43 44 45 46 47 } ;

Chapter 4. Extending ChronOS with DVFS support

49

/∗ Task s t a t e i n f o r m a t i o n ∗/ unsigned char f l a g s ; i nt cpu ;

Listing 4.1: The ChronOS main real-time data structure

4.2.6.1

Additional fields for RT-DVFS support

util task-set This is the total utilisation the task-set to which the task belongs and is calculated as P n k=1 Ci /Pi where, n is the total number of tasks in the task-set. load This is the ratio of actual execution time to the period of the task. This is given by ACi /Pi . current frequency This is the frequency at which the task was operating before its preemption. budget This is the time allocated to the task for execution. This field is required for the implementation of dynamic slack reclaiming algorithms such as Snowdon-min, DRA, AGR etc. mutex0, mutex1 This field is to let the kernel know about the future resource requirement of the task. Required for the implementation of HS, DS and USFI EDF. high freq This is the static high-speed calculated for the entire task-set in the presence of nonpreemptible blocking sections. Required for the implementation of HS and DS. Soptavg This field specifies the average workload of the task-set to which the task belongs. Required for the implementation of AGR1 and AGR2. flag begin end check This field is required to check, whether the scheduler invocation took place when a task arrived, or when it completed. a member This structure is used for the αqueue implementation, required for the implementation of DRA and AGR algorithms.

Chapter 5 Experimental Methodology 5.1

Platform Specifications

We have implemented the fourteen RT-DVFS algorithms on two hardware platforms. The first platform is an ASUS laptop with the Intel I5 processor. This processor has a rich set of 10 frequencies as shown in the Table 5.1. As this laptop has an Intel processor, it uses the Enhanced Intel Speedstep technology [6] for scaling the processor voltage and frequency on-the-fly. The driver used is acpi cpufreq. The other board we have used is the AMD Zacate mini-ITX Motherboard called the GAE350N-USB3 [2]. This processor can operate at 3 frequencies which are 800 MHz, 1,28 GHz and 1.6 GHz. As this board has an AMD processor, it uses the powernow! technology for DVFS. The driver it uses is the powernow k8.

5.2

Test Application

In order to test and evaluate our RT-DVFS algorithms we have used a real-time test application called Sched test App developed by Matthew et al. in [22]. This user-space application takes a task-set file as the input. This task-set file provides the WCET, period and the deadline values for each task to this test application. Users can specify the scheduling algorithm and the workload to be used, as well as parameters such as, the % of CPU usage, run time, the length of the critical section in case of dependent task-set and so on. Depending upon the number of tasks in the task-set, this application creates a thread for every task. Every periodic instance of the task uses the ChronOS API begin rt seg to enter its real-time segment. Depending upon the workload chosen, the real-time task either burns the CPU for its WCET (in case of CPU intensive workload) or does some heavy memory accesses for its WCET (in case of memory intensive workloads). By calling end rt seg, it ends its real-time 50

Sonal Saha

Chapter 5. Experimental Methodology

51

segment and sleeps until the start of its next period. Thus, this application provides a set of periodic real-time tasks to the ChronOS local scheduler, which then schedules these tasks depending upon the scheduling algorithm selected and the timing constraints the tasks are subjected to.

5.2.1

Modification to the Test Application for RT-DVFS

In order to test and evaluate the RT-DVFS algorithms, there is one more parameter which is required. This is the actual execution time (ACET) of the real-time task, which is lesser than its WCET. We have modified the Sched test App such that we can vary this parameter. This means, that even though the kernel is provided with the WCET of the task, the task is actually executed, i.e it burns the CPU, only for its actual execution time. Actual execution time is expressed as a ratio of WCET. In our experiments we have used the values for ACET/WCET ranging from 0.1 to 1, in steps of 0.1.

5.3

Real-time Measurements

The real-time measurements in our experiments include measuring the the Deadline Satisfaction Ratio (DSR) and the Accrued Utility Ratio (AUR) given by Equations 5.1 and 5.2 respectively. T he number of tasks that met their deadlines T otal tasks in the system

(5.1)

T otal accrued utility of the tasks that met their deadlines T otal possible accrued utility

(5.2)

DSR =

AUR =

DSR is the ratio of the number of tasks that met their deadlines to the total number of tasks in the system. Similarly AUR is the ratio of the total accrued utility of the tasks that met their deadlines to the total possible utility that can be accrued in the system. Note: We have measured AUR only for the utility accrual algorithms such as EUA and REUA.

5.4

Power Measurements

In this section we will discuss the techniques we have used for doing the power measurements. But before that, we discuss the idle and performance states of the CPU in the Section 5.4.1.

Sonal Saha

5.4.1

Chapter 5. Experimental Methodology

52

Performance and Idle states of CPU

The ACPI specifications [1] define the following idle and performance states of the CPU: P state This is the performance state of the CPU. The processor is active, when it is in this state. The number of P States supported by a CPU is CPU specific. The processor operates at a different frequency/voltage pair in different P states, and thus consumes different amount of power at each state. The lower the P state is, the higher the frequency the CPU operates in, and thus consumes more power . The P states, the frequency of operation and the power consumed in each P-state in the Intel I5 processor is given in the Table 5.1. C0 state This is the operating state of the CPU. CPU is operating in one of the P-states when in this state. The power consumed in this state is dependent on the P-state in which it is operating in. C1 state This is the first idle state. In this state, only the CPU main internal clocks are halted through software. In Intel I5, the CPU consumes 1000mW, when in this state. C2 state In this state, only the CPU main internal clocks are halted via hardware and the CPU takes longer time to wake up from this state. For Intel I5, the power consumed in this state is 500mW. C3 state In this state, most parts of the processor is stopped, such as caches. As a result, the processor is no longer cache coherent in this state. It takes longer time to wake up from this state. The power consumed in this state for the Intel I5 is 300 mW.

5.4.2

System Power Measurements

We could measure the total system power of the AMD Zacate board using the Fluke 289 RMS multimeter [7]. This is a high resolution multimeter, its resolution being is 0.5 mA. This multimeter has an averaging mechanism, where it can perform the average of the current measurements over an interval of time (while it is measuring). We made a slit in the cord of the power supply, and attached the multimeter in series, to measure the current. We obtained the rms value of the average current and multiplied it with the rms voltage which is 120 V to get the average power consumed in that time interval.

Sonal Saha

Chapter 5. Experimental Methodology

53

Table 5.1: P-States in Intel I5 processor P-State P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P9

5.4.3

Frequency in MHz 2400 2399 2266 2133 1999 1866 1733 1599 1466 1333 1199

Power Watts 25000 25000 23316 21689 20116 18531 17021 15517 14068 12640 11250

in

Normalized and Actual CPU power measurement

We have used the CPUpower tool [5] to obtain the normalized and the actual CPU power measurements. When a user application is fed as input, this tool gives information about the average frequency in the active state as well as the percentage of time spent in respective performance (P states) and idle states (C states) for the duration, the application runs. The output of the CPUpower tool is as follows:

|Nehalem || Mperf || Idle_Stats CPU | C3 | C6 | PC3 | PC6 || C0 | Cx | Freq ||

Suggest Documents