Avoiding Blocking System Calls in a User-Level Thread Scheduler ...

24 downloads 0 Views 546KB Size Report
5.2.3 Solving the lost-wake-up problem with double system calls . . 69 .... scheduler activations, the proposition of using user threads for real life applications.
Avoiding Blocking System Calls in a User-Level Thread Scheduler for Shared Memory Multiprocessors

By: Andrew Borg Supervisor: Dr. K. Vella Observer: Mr. K. Debattista

Department of Computer Science and AI University of Malta June 2001

Submitted in partial fulfillment of the requirements for the degree of B.Sc. I.T. (Hons.)

Abstract SMP machines are frequently used to perform heavily parallel computations. The multithreading paradigm has proved suitable for exploiting SMP architectures. In general, application developers use a thread library to write such a program. This library schedules threads itself or relies on the operating system kernel to do so. However, both of these approaches pose a number of problems. This dissertation describes the extension of two existing user-level thread schedulers, one for uniprocessors and one for SMPs. This will enable the execution of blocking system calls without blocking the scheduler kernel. In order to do this we make use of an operating system extension called scheduler activations. The usefulness of avoiding blocking system calls is apparent in server applications that carry out several I/O operations and which need to handle multiple clients simultaneously. A web server that is capable of dispensing static pages and which implements the HTTP1.0 protocol is also implemented to demonstrate the effectiveness of this approach.

Acknowledgements This B.Sc. dissertation was carried out during the 2000-2001 academic year under the Department of Computer Science and AI at the University of Malta. I would like to acknowledge the assistance of my supervisor Dr. Kevin Vella whose guidance and knowledge of the subject matter enabled me to successfully complete this project. His availability, encouragement and patience transcends that expected of any supervisor. I thank Mr. Kurt Debattista and Mr. Joe Cordina for their time and help which I knew I could count on at any time of the day. Their responsibilities in no way covered them providing me with any form of assistance and yet their support proved crucial throughout the whole project. I thank Joe in particular for helping out at the end of the project to obtain the final results. His knowledge of networking and the Linux operating system was vital in obtaining the results for the web server tests. Kurt I thank for his concrete assistance in proof reading all of this document, in helping me to understand the thread schedulers and also in suggesting solutions to several implementation problems. Various friends have been ready to oblige when I needed company, recreation or that little extra support and for this I am grateful. I thank Wallace who was always ready to provide that extra help during the harder patches of these past nine months and for the ideas he passed on when discussing our projects. Last but not least, I would like to express my profound gratitude to my mother and father, who were unfailing in their unconditional love, support and encouragement, not just throughout this project but throughout my whole life.

iii

Contents 1 Introduction 1.1

1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

Blocking System Calls . . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Kernel support - Scheduler activations . . . . . . . . . . . . .

2

1.2

Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Structure of this document . . . . . . . . . . . . . . . . . . . . . . . .

3

2 Background 2.1

2.2

2.3

4

General interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.1.1

History of concurrency . . . . . . . . . . . . . . . . . . . . . .

4

2.1.2

Uniprocessor and multiprocessor systems . . . . . . . . . . . .

4

Classic thread implementations . . . . . . . . . . . . . . . . . . . . .

5

2.2.1

From processes to kernel threads . . . . . . . . . . . . . . . .

5

2.2.2

From kernel threads to user threads . . . . . . . . . . . . . . .

6

2.2.3

The problem of user threads - Poor system integration . . . .

6

Blocking system calls . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3 Survey 3.1

3.2

9

Two-level hybrid library solutions . . . . . . . . . . . . . . . . . . . .

9

3.1.1

More kernel threads - A simple many-to-many model . . . . .

9

3.1.2

SIGWAITING in Solaris . . . . . . . . . . . . . . . . . . . . .

11

Non-cooperative systems . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.2.1

12

Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

3.2.2 3.3

3.4

3.5

Using a call-back mechanism . . . . . . . . . . . . . . . . . . .

13

Cooperative systems . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.3.1

Counting active kernel threads . . . . . . . . . . . . . . . . . .

14

3.3.2

Process control . . . . . . . . . . . . . . . . . . . . . . . . . .

15

Scheduler activations . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.4.1

The idea behind scheduler activations . . . . . . . . . . . . . .

17

3.4.2

Activations and blocking system calls . . . . . . . . . . . . . .

17

3.4.3

Comparing cooperative techniques . . . . . . . . . . . . . . . .

19

Scheduler activations with polled unblocking . . . . . . . . . . . . . .

23

4 Integrating scheduler activations with a user-level thread scheduler for uniprocessors

31

4.1

Uniprocessor smash - a user-level scheduler for uniprocessors . . . . .

31

4.1.1

Internal scheduler functions for uniprocessor smash . . . . . .

32

4.1.2

The API for uniprocessor smash . . . . . . . . . . . . . . . . .

34

4.2

4.3

Scheduler activations with synchronous unblocking

. . . . . . . . . .

34

4.2.1

The scheduler activations API for synchronous unblocking . .

35

4.2.2

Implementation considerations . . . . . . . . . . . . . . . . . .

37

4.2.3

Saving blocked activation information . . . . . . . . . . . . . .

37

4.2.4

Saving unblocked activation information . . . . . . . . . . . .

37

4.2.5

Coding act new() . . . . . . . . . . . . . . . . . . . . . . . .

38

4.2.6

Coding act block() . . . . . . . . . . . . . . . . . . . . . . .

39

4.2.7

Coding act unblock()

40

4.2.8

Modifying sched dequeue()

. . . . . . . . . . . . . . . . . .

42

4.2.9

A typical race condition . . . . . . . . . . . . . . . . . . . . .

44

Scheduler activations with polled unblocking . . . . . . . . . . . . . .

45

4.3.1

The scheduler activations API for polled unblocking . . . . . .

46

4.3.2

Implementation considerations . . . . . . . . . . . . . . . . . .

49

4.3.3

Maintaining activation information . . . . . . . . . . . . . . .

49

4.3.4

Coding act new() . . . . . . . . . . . . . . . . . . . . . . . .

52

. . . . . . . . . . . . . . . . . . . . .

v

4.4

4.3.5

Coding act unblock()

. . . . . . . . . . . . . . . . . . . . .

54

4.3.6

Modifying sched dequeue()

. . . . . . . . . . . . . . . . . .

56

4.3.7

Modifying sched yield() . . . . . . . . . . . . . . . . . . . .

58

Comparing the two implementations . . . . . . . . . . . . . . . . . .

59

4.4.1

A difference in paradigm . . . . . . . . . . . . . . . . . . . . .

60

4.4.2

A difference in API . . . . . . . . . . . . . . . . . . . . . . . .

60

5 Integrating scheduler activations with a user-level thread scheduler for SMPs

62

5.1

SMP smash - a user-level scheduler for SMPs . . . . . . . . . . . . . .

62

5.1.1

Introducing spin locks . . . . . . . . . . . . . . . . . . . . . .

63

5.1.2

An Overview of operation . . . . . . . . . . . . . . . . . . . .

63

5.1.3

Internal scheduler functions . . . . . . . . . . . . . . . . . . .

65

5.1.4

The API for uniprocessor smash . . . . . . . . . . . . . . . . .

66

Integrating SMP smash with scheduler activations . . . . . . . . . .

66

5.2.1

Replacing semaphores . . . . . . . . . . . . . . . . . . . . . .

67

5.2.2

The lost-wake-up problem . . . . . . . . . . . . . . . . . . . .

67

5.2.3

Solving the lost-wake-up problem with double system calls . .

69

5.2.4

Maintaining activation information . . . . . . . . . . . . . . .

71

Implementation of upcalls and scheduler functions . . . . . . . . . . .

72

5.3.1

Coding activations wait() . . . . . . . . . . . . . . . . . .

72

5.3.2

Modifying scheduler in() . . . . . . . . . . . . . . . . . . .

73

5.3.3

Coding act new() . . . . . . . . . . . . . . . . . . . . . . . .

74

5.3.4

Modifying cthread yield()

. . . . . . . . . . . . . . . . . .

77

5.3.5

Coding act unblock()

. . . . . . . . . . . . . . . . . . . . .

77

5.4

Bugs in the kernel patch . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.2

5.3

6 ActServ - a multithreaded web server

84

6.1

Benchmark Application . . . . . . . . . . . . . . . . . . . . . . . . . .

84

6.2

Uniprocessor ActServ . . . . . . . . . . . . . . . . . . . . . . . . . .

85

vi

6.3

6.4

6.2.1

Implementing ActServ for a uniprocessor machine . . . . . .

85

6.2.2

Comparing ActServ to Serv . . . . . . . . . . . . . . . . . .

86

6.2.3

Comparing ActServ to Apache . . . . . . . . . . . . . . . . .

88

SMP Serv

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

6.3.1

Implementing Serv for an SMP machine . . . . . . . . . . . .

89

6.3.2

Comparing Serv to Apache . . . . . . . . . . . . . . . . . . .

90

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

6.4.1

User threads vs kernel threads and processes . . . . . . . . . .

91

6.4.2

The advantages of using scheduler activations . . . . . . . . .

92

7 Conclusion 7.1

7.2

7.3

93

Results and achievements

. . . . . . . . . . . . . . . . . . . . . . . .

93

7.1.1

Uniprocessor schedulers with activations . . . . . . . . . . . .

93

7.1.2

SMP scheduler with activations . . . . . . . . . . . . . . . . .

94

7.1.3

ActServ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

Possible extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

7.2.1

Extensions for the uniprocessor schedulers with activations . .

95

7.2.2

Extensions for the SMP scheduler with activations . . . . . . .

95

7.2.3

Extensions for the ActServ

. . . . . . . . . . . . . . . . . .

96

Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

vii

List of Figures 3.1

User and kernel thread allocation . . . . . . . . . . . . . . . . . . . .

10

3.2

A blocking system call with activations . . . . . . . . . . . . . . . . .

20

3.3

Synchronous and Asynchronous Unblocking . . . . . . . . . . . . . .

27

4.1

Uniprocessor circular run queue architecture . . . . . . . . . . . . . .

33

4.2

Maintaining activation information (1) . . . . . . . . . . . . . . . . .

52

4.3

Maintaining activation information (2) . . . . . . . . . . . . . . . . .

53

4.4

Maintaining activation information (3) . . . . . . . . . . . . . . . . .

54

5.1

Shared run queue architecture . . . . . . . . . . . . . . . . . . . . . .

64

6.1

Comparing ActServ and Serv - Very low network load . . . . . . . . .

87

6.2

Comparing ActServ and Serv - Low network load . . . . . . . . . . .

88

6.3

Comparing ActServ and Serv - Average network load . . . . . . . . .

89

6.4

Comparing ActServ and Apache on a uniprocessor . . . . . . . . . . .

90

6.5

Comparing Serv and Apache on a 4-processor SMP . . . . . . . . . .

91

viii

List of Algorithms 1

Uniprocessor version 1 - Algorithm for act new() . . . . . . . . . . .

39

2

Uniprocessor version 1 - Algorithm for act block()

. . . . . . . . .

40

3

Uniprocessor version 1 - Algorithm for act unblock() . . . . . . . .

41

4

Uniprocessor version 1 - Algorithm for sched dequeue() . . . . . . .

43

5

Uniprocessor version 2 - Algorithm for act new() . . . . . . . . . . .

55

6

Uniprocessor version 2 - Algorithm for act unblock() . . . . . . . .

57

7

Uniprocessor version 2 - Algorithm for sched dequeue() . . . . . . .

58

8

Uniprocessor version 2 - Algorithm for sched yield() . . . . . . . .

59

9

Lost-wake-up problem - (sleep() operation) . . . . . . . . . . . . . .

68

10

Lost-wake-up problem - (wake() operation) . . . . . . . . . . . . . .

69

11

System call solution to lost-wake-up problem - (sleep() operation) .

70

12

System call solution to lost-wake-up problem - (wake() operation) . .

71

13

SMP Algorithm for activations wait()

. . . . . . . . . . . . . . .

73

14

SMP Algorithm for scheduler in() . . . . . . . . . . . . . . . . . .

75

15

SMP Algorithm for act new()

76

16

SMP Algorithm for cthread yield()

. . . . . . . . . . . . . . . . .

77

17

SMP Algorithm for act unblock() . . . . . . . . . . . . . . . . . . .

78

18

SMP Algorithm for act unblock() from cthread yield()

. . . . .

79

19

SMP Algorithm for act unblock() by ACT UNBLK IDLE . . . . . .

81

20

SMP Algorithm for act unblock() by ACT UNBLK KERNEL . . .

81

. . . . . . . . . . . . . . . . . . . . .

ix

Chapter 1 Introduction Multithreading is the mechanism of choice is several approaches to parallel programming. Threading support for an application can be provided either by the kernel in the form of kernel threads or by a thread library that operates in user space. Kernel threads rely heavily on kernel resources and are therefore not suitable for fine-grain parallel applications. User threads are more efficient than kernel threads because they do not rely on kernel resources for scheduling, communication and synchronisation. However, since user threads are not recognised by the operating system kernel as independent threads of execution, user threads lack kernel support.

1.1

Overview

In this dissertation we shall be combining two techniques that together allow parallel applications to make full use of the underlying hardware whilst incurring minimal overheads. The first of these techniques involves using a user-level multithreading paradigm to run applications. This makes for a more natural specification and design of parallel programs while at the same time maintaining efficiency. The second involves modifying the kernel to provide support to the user level. User-level multithreaded applications that perform blocking system calls are particularly vulnerable when no kernel support is provided.

1

1.1.1

Blocking System Calls

When a thread becomes blocked in an I/O operation or through some other blocking system call, the underlying kernel thread also blocks. Therefore, all other user threads within the application are unable to use the kernel thread to make use of the system’s processors. This is a direct result of the opaque nature of user threads from the kernel’s point of view. Applications that do frequently perform blocking system calls, such as web servers, can not be efficiently implemented using user-level threads. Some mechanism is required to avoid system calls blocking the application’s kernel threads.

1.1.2

Kernel support - Scheduler activations

Acquiring kernel support is one solution used in order to allow user threads to be executed while another thread waiting for a service from the kernel remains blocked. Anderson’s scheduler activations [1] provide this support by creating a new kernel thread and notifying the user level when a user thread blocks. In this way, other user threads can be run instead of having the application block in the kernel.

1.2

Aims and objectives

In this dissertation we aim to show that by using the kernel support provided through scheduler activations, the proposition of using user threads for real life applications that require several I/O operations and other blocking system calls becomes a valid one. This is achieved by modifying two existing thread schedulers, one for uniprocessors and one for SMPs, to take advantage of kernel support through scheduler activations. Two different kernel patches are used. The first is integrated with the uniprocessor scheduler while the second is integrated into both the uniprocessor and SMP schedulers. A web server that implements the HTTP 1.0 protocol and can dispense static web pages is also presented in order to test and compare the performance of the thread schedulers using activations with that of the original thread schedulers and the widely-used Apache web server [26].

2

1.3

Structure of this document

Chapter 2 introduces user-level threads are discusses why they pose a problem for the operating system in its effort to provide maximum performance. In Chapter 3 a number of solutions to this problem are evaluated. We shall then select one solution and use it to implement two schedulers, one for a uniprocessor architecture (Chapter 4) and one for an SMP architecture (Chapter 5). Finally Chapter 6 deals with the design and implementation of a web server and benchmark web client used to test the schedulers. Performance comparisons between this web server and Apache will also be made. All implementation for the schedulers was done on a four-processor machine with Intelr Xeon(T M ) processors and 256Mb of RAM. The operating systems used were Linux 2.2.12 and Linux 2.2.17, both patched to provide scheduler activations support.

3

Chapter 2 Background 2.1 2.1.1

General interests History of concurrency

In earlier versions of UNIX, support for concurrent programming was cumbersome. The situation was worse on PCs in the 80s with the standard operating system of this platform, DOS, not providing any support for concurrency whatsoever. This made the design and execution of parallel programs difficult. The solution lay in a new structuring by which these programs could be defined as a series of independent sequential threads of execution that communicate and cooperate at a number of discrete points. Multithreading is the vehicle for concurrency in many approaches to parallel programming. A thread is essentially an encapsulation of the flow of control in a program.

2.1.2

Uniprocessor and multiprocessor systems

The structural independence of each thread in an application lends itself naturally to the idea of using more than one processing element to execute the program. On a uniprocessor, thread programming has just two advantages. The first is a more natural specification and design of applications. The second is the achievement of apparent concurrency through techniques such as timeslicing. However, by using 4

threads on a multiprocessor, in particular Symmetric Multi Processors (SMPs), we can have true parallelism with different threads within the same application being physically executed at the same time on different CPUs. This means that we can have speedups on N -processor machines of a factor up to N .1

2.2 2.2.1

Classic thread implementations From processes to kernel threads

The notion of a thread, as a sequential flow of control, dates back at least to 1965, the Berkeley Timesharing System being the first instance. At the time they were not called threads, but were defined as processes by Dijkstra [11]. Processes interacted through shared variables, semaphores, and similar means. The early 1970s saw the birth of the UNIX operating system. The UNIX notion of a process became a sequential thread of control plus a virtual address space. Thus, processes in the UNIX sense are quite heavyweight machines. Since they cannot share memory (each process has its own address space), they interact through pipes, signals, etc. Shared memory (also a rather ponderous mechanism) was added much later. Processes were designed for multiprogramming in a uniprocessor environment, making them suited for coarse-grain parallelism but not for general purpose parallel programming. This led to the introduction of kernel threads which were nothing more than old-style processes that shared the address space of a single UNIX process. They were also were called ‘lightweight processes’, by way of contrast with ‘heavyweight’ UNIX processes. This distinction dates back to the very late 70s and early 80s. Lightweight processes (LWPs) are often referred to as kernel threads because it is the kernel that is responsible for all creation, destruction, synchronisation and scheduling activities. The advantage of kernel threads is that since the kernel is aware of these threads, all scheduling policies that hold true for processes can be applied to LWPs, including true timeslicing. 1

It is sometimes possible to achieve speedups of a factor greater than the number of processors

because of caching considerations. This is termed super-scalar computing.

5

2.2.2

From kernel threads to user threads

The major disadvantage of kernel threads is that they are limited to the functionality provided by the system’s kernel. For example the application developer is bound by the scheduling constructs provided by the kernel’s threads library. Some applications require non-standard properties for their threads that are not offered by the kernel. Kernel threads, though lightweight when compared to processes, are still heavily dependant on kernel resources. User threads provide an alternative to kernel threads and instead operate on top of kernel threads. In an application that utilises user threads, the threads of the application are managed by the application itself. In this way, functionality and scheduling policy can be chosen according to the application. These user threads are much more efficient than kernel threads in carrying out operations such as context switching, since no kernel intervention is necessary to manipulate threads. The performance of kernel threads, although an order of magnitude better than that of traditional processes, has been typically an order of magnitude worse than the best-case performance of user-level threads [1].

2.2.3

The problem of user threads - Poor system integration

As a consequence of lack of kernel support in existing operating systems, it is often difficult to implement user-level threads that have the same level of integration with system services as is available with kernel threads. The kernel, in turn, is not aware of multiple user threads created on top of kernel threads. This situation leads to two related characteristics of kernel threads that cause difficulty, namely that: • kernel threads are scheduled obliviously with respect to user-level threads state. • kernel threads block and resume without notification to the user level. The first problem is apparent when scheduling user threads that have different priority levels. The kernel is unable to distinguish between these priority levels and therefore, a user thread with high priority might be preempted in order to execute a 6

user thread with a lower priority. A worst-case scenario would be the kernel preempting a kernel thread that is running a user thread holding a lock. It might then give processor time to another kernel thread that has a user thread spinning and waiting for the lock to be released. The second problem concerns our main interest and is how to deal with the blocking and resumption of kernel threads in a user-level threads library. Many of the solutions described in Chapter 3 do have repercussions for (and in some cases offer solutions to) the scheduling and preemption problems described above. When this is so, we shall discuss them briefly. However, we shall concentrate mostly on how to deal with the blocking and unblocking of the kernel threads used to run the user-level threads.

2.3

Blocking system calls

When an application requires a service from the kernel, it does this by means of a system call. System calls provide the interface between user-level applications and the kernel. They are often used when the application requires the services of the underlying system hardware, such as a network interface card or an I/O device. The operating system services the request and returns a result depending on the definition of the system call. It is, however, sometimes impossible for the kernel to immediately service a request because the underlying hardware is not ready to provide the required service. For example if a read() system call is made from an input device such as a disk, the device may still be busy reading the data from its medium. In this case, the kernel sets the kernel thread to sleep by removing it from its run queue, and services other threads. It is only when the request is finally fulfilled that the thread is unblocked and placed back onto the kernel run queue. In any application that uses a system where the number of kernel threads is the same as the number of processors (as is true for the user-level thread libraries we shall be using), a situation may arise where the number of running kernel threads 7

is less than the number of available processors. This under-utilisation of resources means that there could very well be user threads that are waiting to be run and processors that are idling whilst waiting for threads to unblock. This is a condition that arises because of the kernel’s inability to recognise user threads and the lack of communication between the operating system kernel and the user-level scheduler.

In this chapter we described how threads provide the vehicle for concurrency in parallel applications. We introduced the two traditional types of threads - kernel threads and user threads. We discussed what the advantages and disadvantages of these are and how communication between them could improve performance, particularly in applications that use blocking system calls. In Chapter 3 we shall be describing a number of solutions that exist to deal with this problem. We will look specifically at how each solution attempts to utilise CPU resources while waiting for blocked threads to become unblocked.

8

Chapter 3 Survey The solutions being presented in this chapter require an increase in programming complexity or effort, either at the operating systems level, at the user thread library level or at the application level. Moreover, each solution has advantages and disadvantages that need to be considered before deciding which technique to implement in our user-level scheduler.

3.1

Two-level hybrid library solutions

A two-level library is one that achieves parallelism both from kernel threads and from user threads. All user threads need to operate on kernel threads but in twolevel libraries the number of kernel threads is greater than the number of processors. Figure 3.1 shows this arrangement. The user threads are multiplexed across the kernel threads which are in turn multiplexed across the CPUs. The kernel threads thus become ‘virtual processors’ that appear to the user-level thread scheduler to be physical processors.

3.1.1

More kernel threads - A simple many-to-many model

Using a two-level hybrid library is in itself a solution to avoiding blocking system calls blocking the whole application. When a user thread blocks, the kernel thread which

9

Figure 3.1: User and kernel thread allocation

acts as its virtual processor also blocks. When this occurs, the operating system simply switches to another user thread running on another kernel thread and thus maintains full utilisation of the processors. Therefore, since there are more kernel threads than processors, when kernel threads block they are replaced by the surplus of kernel threads. The main advantage of this system is that it is quite simple to implement. The user thread library simply creates a number of extra kernel threads and dispatches the user threads to these kernel threads. No changes are required either in the operating system or in the application. Using such a system also automatically makes the thread library suitable for SMP machines. Solaris [23] on UNIX platforms and MARCEL [12] in PM2 [13] are two examples of such libraries.

10

The main problem with a two-level library system is in deciding how many kernel threads to have. The size of the kernel thread pool has a critical impact on the performance of the many-to-many model. If the number of kernel threads in the pool is nearly equal to the number of user threads, the implementation will act much like the one-to-one model. Therefore, a few blocking system calls can block all of the kernel threads, thereby blocking the whole application. On the other hand, if too many are created, processor time is wasted as the kernel carries out context switches between the kernel threads.

3.1.2

SIGWAITING in Solaris

One solution to avoiding all kernel threads using a two-level hybrid system from blocking and thereby blocking the whole application is implemented in the Solaris Operating System through the use of the SIGWAITING signal.

When the

kernel realises that all of an application’s kernel threads are blocked, it drops a SIGWAITING signal on the process. Upon receipt of the signal, the user-level scheduler decides whether or not to create a new kernel thread, on the basis of the number of runnable user threads. The SIGWAITING mechanism makes no guarantees about optimal use of kernel threads on a multiprocessor. Specifically, an application may have runnable user-level threads awaiting processor time but less unblocked kernel threads than processors on which to run them.

In such

circumstances, the application does not receive a SIGWAITING until all kernel threads are blocked.

Thus, even if there are processors available and work

to be done, the SIGWAITING mechanism does not guarantee that there is a sufficient number of kernel threads to run the user threads on the available processors.

Two-level thread libraries provide a simple solution to avoid blocking system calls blocking a whole application. However they have disadvantages in terms of performance because of the lack of cooperation and information sharing between the user and kernel levels. They provide a solution that is not scalable for applications that constantly make blocking system calls (such as web servers). 11

3.2

Non-cooperative systems

It is possible to have systems where the number of kernel threads is kept equal to the number of processors, have no extra support from the kernel on scheduling decisions, and still avoid blocking system calls blocking an application. We now present two such solutions.

3.2.1

Wrappers

A solution proposed by Vella [7] is to ‘wrap’ each blocking system call with code. This code will spawn a new kernel thread before blocking and destroy or save it for reuse after unblocking. Therefore, whilst calling one of these system calls would normally block the process, the wrapping routine ensures that only the calling thread is blocked, leaving the process available to execute other threads. An example of such a package is the DCE threads library [24] which is an implementation of the POSIX 1003.4a Standard. It is common for such a library to have some system calls that are wrapped and others that are not. For example, in DCE, read(), write(), open(), socket(), send(), and recv() are wrapped whilst wait(), sigpause(), msgsnd(), msgrcv(), and semop()are not wrapped. Another interesting example of wrappers is that implemented by Barnes [18] for KRoC [10], an occam compiler. The system only works on uniprocessor machines. It however demonstrates the effectiveness of wrappers in ensuring that an application does not block because of blocking system calls. As a demonstration, Barnes also builds a web server, occserv, and compares it to the Apache web server. While occserv uses Occam’s fine grain thread scheduler, Apache uses system processes to achieve concurrency. Results show an improvement in service times, requests per second, concurrency levels and throughput. The behavior of occserv is also smoother in comparison to the rather erratic behavior of Apache. The obvious disadvantage of such a solution is the need to wrap every blocking system call. In some operating systems this runs into the hundreds. If more blocking system calls are introduced in later operating system versions, the threads package 12

will need to be upgraded accordingly. Another disadvantage is that a new kernel thread is always created with a blocking system call. This may not be necessary as the call may be serviceable immediately by the kernel and thus not need to block. An example may be a read() call that is serviced from a fast cache. An advantage of this system is that no changes are required in the kernel, making distribution of the library easier.

3.2.2

Using a call-back mechanism

Another solution is to never have system calls that block in the kernel. When the user makes a blocking system call, he immediately regains control. When the system call completes a ‘call-back’ is made to inform the application. This system is implemented in Microsoft Windows 95 and onwards as Asynchronous Procedure Calls (or APCs) [28]. Windows in fact follows an event-driven paradigm. Interestingly, this is what actually happens in every kernel. If a kernel thread were to actually block, the processor would block and other runnable kernel threads would not execute. What actually happens is that the kernel removes the thread from its run queue, giving the user the impression that that thread has blocked. This system is efficient but places a lot of responsibility on the application developer. It requires that the developer consider another layer of parallelism above the notion of threads. Within the same thread, he must consider that the kernel could be executing a sequence of code and at the same time be waiting for a blocking request to complete. When this blocking request does complete, he must then decide how the application is to handle it within the APC, and finally return to executing the previous code. Windows, being completely event driven, lends itself naturally to such a system. Linux, on the other hand does offer non-blocking versions of system calls but, in not being even driven, makes development far more complex when using them.

The two systems described above would maintain as many kernel threads as there are processors. In this way no processor is ever left idle when there is a thread to run and time is not wasted on context switching in the kernel. 13

3.3

Cooperative systems

In cooperative systems, the kernel can pass information to the user level through an asynchronous or synchronous interface. The user-level application can that use this information to optimise its operations.

3.3.1

Counting active kernel threads

If a thread library in a system is to maintain an optimum relationship between the number of kernel threads and processors on that system, it must have information on the number of active kernel threads. This approach was proposed by Inohara and Masuda [14]. They contend that hybrid libraries share three characteristics that cause unnecessary vertical and horizontal context switching on multiprogrammed systems, namely that: • the user-level programs (or the thread library) determine how many kernel threads to use. • the kernel does not inform the user level of which or how many kernel threads are actually being assigned to processors. • interaction between the kernel and user-level schedulers is synchronous. They present a mechanism that aims to minimise thread switching overhead, be it horizontal or vertical switching. Horizontal context switching is context switching between threads at the same level, that is switching between user threads in user space or kernel threads in kernel space. Vertical context switches are switches from user level to kernel level (system calls), or vice versa (upcalls1 ). First, it is the kernel scheduler that decides how many kernel threads to use in each address space. This minimises the number of horizontal switchings in the kernel scheduler. Second, the kernel scheduler lets user-level schedulers know which kernel threads are actually being assigned processors. Third, and most important, is that all interaction between the kernel 1

We shall describe upcalls in Section 3.4.1

14

scheduler and user-level schedulers is done asynchronously. The information on the scheduling status of the kernel threads is passed through an asynchronous interface using a shared memory area between user space and the kernel. As this interaction is asynchronous, and synchronisation between the kernel and user-level schedulers necessarily involves horizontal and vertical switching, unnecessary switching is removed. In Section 3.4.3 we shall compare this method with scheduler activations which use a synchronous system of cooperation between the user and kernel levels. Of particular relevance is how this system controls the number of blocked threads in the system. The kernel writes to the shared memory area a count of the number of active kernel threads for the application. In this way, the thread library is able to create and destroy kernel threads as required. It is the responsibility of the threads package programmer and application developer to ensure that the information in this shared memory area is polled at appropriate times. This will optimise the number of running kernel threads in relation to the number of user threads and processors. Inohara produced results which were an improvement on any system, including scheduler activations as originally proposed by Anderson [1]. This is because the third characteristic described above still holds true for scheduler activations. We shall discuss this after introducing scheduler activations.

3.3.2

Process control

Tucker [4, 5] proposes another approach that requires information exchange between the kernel and user levels. His system requires process control from the application and processor partitioning and interface support from the operating system. The process control technique is based on the principle that to maximise performance, a parallel application must dynamically control its number of runnable processes (or kernel threads) to match the number of processors available to it. Tucker contends that in a multiprogramming environment, this adjustment of processes must be dynamic because other applications are continuously entering and leaving the system, hence constantly changing the number of available processors an application has. His system uses an asynchronous interface, similar to that used by Inohara. How15

ever, the information passed to the user level depends on the system wide environment the application is operating in and not just the state of the application itself. It is therefore best if all applications in the multiprogrammed environment use the process control mechanism. As an example, consider a ten processor machine with two applications running concurrently. The kernel will partition the processors amongst the applications, for example giving three processors to the first application and seven to the second. The process control mechanism in the applications is then responsible to create a number of processes or kernel threads that equals the number of allocated processors. When this figure changes, a shared memory area unique to each application is updated and the application reacts accordingly, creating or destroying processes or kernel threads as required. The process control model therefore introduces ‘space partitioning’ as opposed to ‘time partitioning‘ used in Inohara’s and Anderson’s models. Tucker also compares his system to Anderson’s scheduler activations [1] system which, coincidentally, was being developed at the same time. We shall again defer from comparing these two systems till after we have introduced in some detail the operating mechanism of scheduler activations.

3.4

Scheduler activations

Scheduler activations were originally proposed by Anderson et al [1] at the University of Washington. Its authors implemented this mechanism on top of the FastThreads library on the Topaz system. This system is unfortunately no longer running and the source was never released. In this section we shall be describing in detail the operation of Anderson’s scheduler activations, in particular how the system behaves when a blocking system call is made by the application. We shall demonstrate with a graphical example how the system works in practice. We will then discuss how this system compares with Inohara’s implementation using an asynchronous shared memory area and Tucker’s process control mechanism. Finally we shall present a new model for scheduler activations as implemented by Danjean [2] as a patch for the 16

Linux operating system kernel.

3.4.1

The idea behind scheduler activations

Scheduler activations enable the kernel to notify an application whenever it makes a scheduling decision affecting one of its threads. Anderson coined the term ‘Scheduler Activation’ because each event in the kernel causes the user-level thread system to reconsider its scheduling decision of which threads to run on which processors. This mechanism is implemented by introducing a type of system call called an upcall. While a traditional system call from an application to the application can be termed a downcall, an upcall is a call from the kernel to the application. In order to make these upcalls, the kernel makes use of a scheduler activation. A scheduler activation is an execution context in exactly the same way that a normal kernel thread is. In fact, implementations of activations use the operating system’s native kernel threads and simply add the functionality of upcalls. When an application uses a classical kernel thread, it creates that thread itself and designates a function for the operating system to execute. The opposite happens with scheduler activations. The operating system decides when an activation is needed. It then creates it and begins executing a specific user function2 .

3.4.2

Activations and blocking system calls

Our main interest is in how scheduler activations deal with blocking system calls. What follows is a simple example of an implementation of activations that has the bare minimum to deal with blocking system calls. The implementation we use in our threads packages is based on the same idea but uses a different set of upcalls and downcalls. First of all, we shall define three required upcalls: 2

For consistency, when we refer to an upcall we shall not use a function notation (eg. when

an unblock upcall is made...) When we refer to the function called by the upcall we shall use a function notation (eg an upcall is made to unblock()).

17

1. upcall new This upcall is used to notify the application that a new activation has been created. The application can then use this activation to run the code it requires. We shall name the function executed when this upcall is made new(). 2. upcall block When an activation blocks, the kernel uses another activation to make this upcall in order to notify the application that one of its activations has blocked. We shall name the function executed when this upcall is made block(). 3. upcall unblock This upcall is made to report to the application that one of its activations has become unblocked. The application can then resume the user thread that was running on that activation. We shall name the function executed when this upcall is made unblock(). Figure 3.2 illustrates what happens when on a dual processor machine, an application using activations makes a blocking I/O request.

Time T1:

The kernel allocates two processors to the application. Two new

activations are launched with a new upcall to new(). The user-level thread scheduler selects two threads from the pool and begins to run them.

Time T2:

Activation (A) makes a blocking system call (such as an I/O

request). A new activation is created using the new upcall. A block upcall is also made to block() to notify the application that one of its activations has blocked. This allows it to take appropriate action such as removing the thread from its run queue. The user-level thread scheduler then chooses another thread from the thread pool and uses the new activation (C) to run it.

Time T3:

The activation (A) finally unblocks (for example on completion of

the I/O request), and the application receives notification from the kernel by means of an unblock upcall to unblock(). This upcall could be made either through activation (B) or (C). At this point, the activation performing the upcall can choose 18

to continue its thread or to immediately resume the thread that was blocked. In any case, the extra activation is discarded. In this way, the number of active activations is always equal to the number of available processors. This removes any unnecessary horizontal context switching in the kernel.

One of the most important things to note about this system is that the number of active activations for an application is always equal to the number of physical processors. This is true regardless of how many processors are actually available. So on an N processor machine an application can have an arbitrary amount of inactive blocked threads but the number of active threads is always N . The number of active threads that are actually being given processing time is not always N but is decided upon by the kernel scheduler. In an environment where the level of multiprogramming is high, this value could be much less than N . In a dedicated environment, this value is always N . Whilst it is true that scheduler activations can be used to make user-level scheduling decision based on the kernel scheduler, we shall be concentrating solely on using activations to avoid blocking system calls from reducing an application’s performance. Other upcalls (namely upcall preempt and upcall restart) would be required to notify the user-level scheduler of the preemption and resumption of kernel threads. In a heavily multiprogrammed environment, these upcalls would create significant overhead as the operating system would be constantly preempting kernel threads to execute other applications. On the other hand, in an environment with low multiprogramming, and especially in applications that use coarse grain threads with no priorities, preemption by the kernel in critical sections (such as a thread holding a spin lock) is rare.

3.4.3

Comparing cooperative techniques

We shall now compare the scheduler activations mechanism with Inohara’s (see Section 3.3.1) and Tucker’s (see Section 3.3.2) techniques. As discussed, scheduler activations always maintain an equal number of activations 19

Figure 3.2: A blocking system call with activations

20

and processors, despite the fact that some of these activations may be inactive in a multiprogrammed environment. Using Tucker’s mechanism, the number of kernel threads is dependant on the number of processors available to the application and is decided by the kernel. This is a better solution for a multiprogramming environment An essential difference between these techniques and scheduler activations is that scheduler activations use a synchronous method of information exchange between the kernel and user levels. Upcalls are the synchronous method used in scheduler activations. Inohara and Tucker let the application poll a shared area of memory between the kernel and user space to retrieve kernel information. Tucker contends that the two factors that influence the choice of a synchronous or asynchronous interface are response time and communication overhead. He argues that where response time is critical, application-based polling is not competitive and therefore kernel based signaling would be the mechanism of choice. Such an application would be one that carries out several blocking system calls and hence leaves the processor idle for relatively long periods. He continues to say that a system that uses polling of information in user memory space thus reducing overhead (as in Inohara’s solution), is best when some slack can be tolerated. In passing it is interesting to note another point that Tucker discusses in his thesis. It is not always straightforward as to what constitutes a ‘relevant’ event that requires an exchange of information between the kernel and user levels. This holds true whether using an asynchronous or a synchronous mechanism. For example a blocking system call which is long lived would require notification by the kernel to the application but one which is short lived might not. Inohara argues that a synchronous interface causes vertical switching which in some cases can be redundant. It is in fact more important in a synchronous system to decide what information is relevant for the application to have. Since the overhead of communication between the kernel and user levels can be high, the frequency that an upcall is required should be kept to a minimum. Also, since user-level schedulers react immediately to an update of the information sent from the kernel scheduler, there is an increase in horizontal switching at the user level. Moreover, with scheduler 21

activations, it is the kernel that decides when to create and destroy kernel threads and not the application. We now give an example illustrating the different behavior we get from scheduler activations and asynchronous techniques. We shall use Inohara’s count of active threads and show how the systems behave on receiving a blocking system call. For simplicity’s sake we shall assume that we have two processors that are both totally dedicated to our application. With scheduler activations, the kernel first creates two activations for the application to run its threads on. For the asynchronous system, an area of user memory is filled with the value ‘2’. The application reads this value and creates two kernel threads to run its application. It then sets the value in memory to ‘0’. Consider what happens in the two systems when one of the user threads of the application issues a blocking system call. With activations, a new activation is created immediately and a new user thread assigned to it. With Inohara’s solution, the count is now set to ’1’ indicating a free idle processor. The application then polls this value at some later time and creates a new kernel thread. The best performing system depends very much on the time taken for the system call to complete and unblock the activation or kernel thread. Creating a new activation requires a certain amount of overhead that may use more processor time than had the system remained idle. On the other hand, if the blocking system call lasts for a long time, then activations will give the best result as the response from the application can be immediate. Polling can be equally effective if it is done frequently. This giving a near immediate response time. However frequent polling can degrade performance. A similar argument holds when the kernel thread unblocks. This time, using scheduler activations we can guarantee that the number of activations is always equal to the number of processors. If the application does not poll often enough, it can end up with several more kernel threads than processors. This increases the amount of horizontal context switching in the kernel and therefore degrades performance. Also of importance is how the application deals with an unblock upcall in the activations model. Since this upcall can occur at any time, the preempted activation must stop 22

running its user thread to deal with the upcall. This could lead to race hazards, especially if the preempted user thread is running in a critical section. It also increases horizontal switching that could be avoided with a polling technique. This situation leads us to our last section in this survey. Danjean [2] develops a system using both scheduler activations and polling of a shared area of memory between the kernel and user space. He develops this as a patch for the Linux operating system. By combining the two, one can ensure that there are always an equal number of activations as processors for each application and at the same time there is no unnecessarily interruption of an application.

3.5

Scheduler activations with polled unblocking

´ This version of scheduler activations was developed by V. Danjean at the Ecole Normale Sup´erieure in Lyon, France. To date there have been no publications made on this latest activations patch. The publications that we have cited so far refer to papers that describe the implementation of an activations patch following Anderson’s original model. We shall now demonstrate the operation of Danjean’s model and follow up with a discussion of how it compares with the traditional activations model. Here we only aim to give the reader an understanding of this model. We shall not be describing any implementation issues, nor will we use the API provided by the patch. Additional details will be given when we describe how to integrate scheduler activations into our thread libraries in Chapters 4 and 5. We first define the two upcalls necessary for this model: 1. upcall new This upcall is used to notify the application that a new activation has been created. This can happen when: • the kernel creates the initial activations at the beginning of the program. • an activation blocks. Note that by using a technique to discriminate between these two cases, we avoid having to make two upcalls (upcall new and upcall block) when an activation 23

blocks. The application then uses this activation to run the code it requires. 2. upcall unblock When an activation unblocks, the kernel uses the same activation to make this upcall in order to notify the application that the activation unblocked. This optimisation improves performance as activations running on other processors do not have to be preempted in order to notify the user level that an activation has unblocked. On exiting the unblock() function, the activation resumes execution where it had left off (that is exactly after the blocking system call). We also need two other important components. The first is the shared memory area between kernel and user space and the second is the system call that restarts unblocked activations. Danjean uses an integer value (nb unblocked) that stores the number of unblocked activations. The application or thread library polls this value and when necessary makes a system call to restart the unblocked activations. We shall call this system call act restart(). We finally note one further optimisation that deals with idle activations on SMP machines. It is not uncommon for an application to have less user threads to run than the number of processors. Since with scheduler activations there are always an equal number of activations and processors, some activations, and hence processors, may be idle. This creates a problem with polled unblocking. With a synchronous activations model, an unblock upcall would be made immediately and the unblocked activation restarted. If we use a polling mechanism, we may end up with unblocked activations and idle processors for the period of time until the polling is made. Danjean solves this by allowing for automatic unblocking if, and only if, an activation is idle on a processor. Therefore, an unblocked activation can be restarted in two ways: • automatically if there are idle processors. • by using the act restart() system call. This is the only way to restart an unblocked activation if no other activations are idle.

24

Figure 3.3 illustrates what happens when on a dual processor machine, an application using activations makes a blocking I/O request. We show what happens in the two possible cases for restarting an unblocked activation.

Time T1:

The kernel allocates two processors to the application. Two new

activations are launched with a new upcall to the function new(). The user-level thread scheduler selects two threads from the pool and begins to run them.

Time T2: Activation (A) makes a blocking system call (such as an I/O request). A new activation is created using the new upcall. Note that this time there is no block upcall made to notify the user level that one of its activations has blocked. The new upcall directly implies that an activation has blocked. The thread scheduler takes appropriate action such as removing the thread from its run queue. It then chooses another thread from the thread pool and uses the new activation (C) to run it.

Time T3:

Activation (B) also makes a blocking system call. Again a new

activation (D) is launched and used to run one of the user threads in the thread pool.

Time T4:

Activation (A) unblocks in the kernel. The value of nb unblocked

is incremented to 1. This time however, the two activations (C) and (D) are not interrupted and continue to execute on the two processors.

Time T5:

The thread running on activation (C) completes. The thread sched-

uler polls the value of nb unblocked. Since it is non-zero, it makes an act restart() system call and the kernel resumes running the thread on activation (A). (Note that polling by the scheduler can be made at any time when within the scheduler kernel and not necessarily when a user thread terminates).

Time T6:

The thread running on activation (D) completes. The user-level

scheduler polls the value of nb unblocked. Since it is zero this time, it attempts 25

to run another user thread from the thread pool. However there are no more user threads to run so the activation is put to sleep and the processor is idle.

Time T7:

Activation (A) unblocks in the kernel. The kernel realises that

there is an idle processor. Therefore, the value of nb unblocked is not changed and activation (B) is scheduled automatically.

In this chapter we have identified a number of solutions that exist to avoid blocking system calls blocking a whole application. For each we described the advantages and disadvantages of that solution. We saw that a system that provides kernel support is best for an operating system such as Linux that does not follow the event-driven model. We distinguished between asynchronous and synchronous interfaces between the kernel and user space to provide this support and concluded that a hybrid model is best, especially in environments with low multiprogramming. Scheduler activations as proposed by Danjean constitute such a model. In the next two chapters we shall be integrating scheduler activations into a uniprocessor and SMP thread scheduler. Finally we shall design and build a web server to test the schedulers and compare them to a process driven web server and a user level web server without kernel support through scheduler activations.

26

Figure 3.3: Synchronous and Asynchronous Unblocking

27

28

29

30

Chapter 4 Integrating scheduler activations with a user-level thread scheduler for uniprocessors In this chapter we shall be describing the integration of two different models for scheduler activations with a uniprocessor user-level thread library. These two models come as kernel patches for the Linux operating system and were both developed by Danjean at Lyon. We shall start by describing briefly the thread scheduler used. For each of the two patches, we shall describe the API provided by the patch and then discuss how the integration of each patch with the thread scheduler was implemented. We conclude with a brief comparison of the two models.

4.1

Uniprocessor smash - a user-level scheduler for uniprocessors

The uniprocessor scheduler used is smash. This scheduler was developed at the University of Malta by Debattista [6]. The uniprocessor smash is based on the MESH [9] scheduler. smash is a non-preemptive based user-level thread scheduler that borrows many of the ideas used in MESH to increase performance and speedup.

31

It strips off the external communication interface. Besides providing an API for thread management, smash also provides an API for CSP constructs. Though we shall not be discussing CSP, these constructs do operate correctly with activations as they use the underlying internal scheduler functions described below. The version we shall be using for this implementation is the simple circular run queue without a dummy head. This configuration is shown in Figure 4.1. There is one current thread executing and this is pointed to by the variable sched.current. For a detailed description of how the thread scheduler is implemented refer to Debattista [6]. We shall only discuss in detail those functions which are relevant to integrating scheduler activations with our scheduler. In summary, the threads package uses the following internal functions and provides the subsequent API.

4.1.1

Internal scheduler functions for uniprocessor smash

Thread Insertion Thread insertion involves placing a user thread onto the run queue. Since the environment the original scheduler operates in is a uniprocessor one with no automatic preemption, no concurrency issues need to be considered with regards to protecting the run queue data structure. Thread insertion is carried out by the scheduler in() routine. Thread Removal When a thread terminates, it is removed from the run queue. If there are no threads left on the run queue then the scheduler terminates. Thread removal is carried out by the sched dequeue() routine. Thread Yield Thread yielding involves saving the context of the current thread and switching to the context of the next thread on the run queue. Thread yielding is carried out by

32

run queue U U

U U

U

U U

U

U

U U

User Space

C

C

C

C U

U

U scheduler

Kernel Space

K

CPU

CPU

Figure 4.1: Uniprocessor circular run queue architecture (from [6])

the sched yield() routine.

The only variable that the reader need know about is the variable sched.current. This is a pointer to the current user-level thread on the run queue.

33

4.1.2

The API for uniprocessor smash

For completeness, the following functions provide the API by which the application developer can make use of smash: • cthread init() - creates and initialises a new thread. • cthread run() - places a thread on the run queue for execution (calls scheduler in()). • cthread yield() - yields execution to another thread (calls sched yield()). • cthread join() - a synchronisation function used to allow a thread to wait for another thread to terminate before continuing. • cthread stop() - terminates a thread (calls sched dequeue()).

4.2

Scheduler activations with synchronous unblocking

The scheduler activations patch used in this section is built around the original model of scheduler activations as proposed by Anderson and described in Section 3.4.1. This version of the scheduler activations patch is not as stable as the second and suffers from a number of bugs. It is however interesting in that it allows us to study the nature of race conditions that arise when using this model. Note that for a uniprocessor, there is only one active activation. All other activations are either blocked or unblocked and waiting to be restarted. The active activation is either running a user thread, executing code in an upcall, or is idle. A complete specification of the API and an overview of the implementation of the patch is given in [3]. The system calls and upcalls required are as follows:

34

4.2.1

The scheduler activations API for synchronous unblocking

System Calls The patch API provides three system calls: act init() This system call must be used first if an application is to use activations. It is passed a structure as a parameter that is used to inform the kernel of: • the number of activations that can be run in parallel. This is usually set to be the number of processors on the machine, though it can be less. • the maximum number of activations that can be created. • the address of the functions where the upcalls will be made. • two areas that are used to store the activation context for an upcall. The first is the context of the preempted activation. The second is the context of the activation that the upcall is giving information about.

act cntl() This system call is used by the application to request information about activations or modify kernel variables (such as the number of processors that an application wants to use). What is important in this implementation is when act cntl() is called using the flag ACT CTNL WAIT UPCALL. When called using this flag, the activation will block in the kernel until an upcall is about to be made. This functionality is required in order to put the running activation to sleep. This is done when there are no more threads on the run queue waiting to execute and there are blocked activations which have to waited for. act resume() When an upcall is made, it acquires a lock that is released by calling this system call. Therefore, after servicing an upcall, act resume() must be called. act resume() takes an activation context as a parameter. If this parameter is not 35

NULL, the activation continues with the state saved in the parameter. If it is NULL then the activation continues after this system call.

Upcalls Following are the upcalls made by the kernel in order to notify the user level of any activations that have been created, blocked or unblocked: act restart This upcall has several uses for synchronisation on SMPs.

For

the purpose of the uniprocessor scheduler, it is used just once in conjunction with act init(). When act init() is called, an upcall is made to act restart() with the indication of the current state. This allows execution to continue just after the system call. act new When a new activation is launched, an act new upcall is made to act new(). The function receives as a parameter an integer value called an activation ID that is used to identify the activation. act block This upcall is made to act block() when an activation blocks. The activation IDs of the blocked activation and the current preempted activation are passed as parameters. The context of the preempted activation is saved in the first context buffer that was passed as a parameter to act init(). It is then used in act resume() to continue running the preempted activation. act unblock This upcall is made to act unblock() when an activation unblocks. Again the activation IDs of the unblocked activation and the current preempted activation are passed as parameters to this function. The contexts of the preempted activation and the unblocked activation are saved in the context buffers that were passed as parameters to act init(). These can then be used to resume either the unblocked activation or the preempted activation.

36

4.2.2

Implementation considerations

Integrating scheduler activations with the smash scheduler basically involves: • modifying the thread library to deal with special conditions that arise because of the use of activations. • defining structures to store information on activations. • writing code for the functions of each of the upcalls mentioned above. Therefore, we must write four new functions which we shall call the same as the upcalls - act restart(), act new(), act block() and act unblock(). For our purposes, the act restart() function will be ignored as it is redundant for a uniprocessor implementation. As a result of the synchronous nature of this implementation, we have to be very careful in recognising and solving potential race conditions. Where relevant we shall point out these race conditions and show how we solve them. At the end of this section we shall describe a typical race condition that had to be taken into consideration so that the reader can appreciate the level of complexity that a preemptive system entails.

4.2.3

Saving blocked activation information

When an activation blocks, it is necessary to save which user-level thread is running on it so that when the activation unblocks, that thread can be enqueued onto the run queue. This activation patch assigns an ID to every activation. The maximum number of activations is limited and defined on initialisation. Therefore we choose to create an array of size MAXACT - the maximum number of possible activations. We shall call this array blocked activations.

4.2.4

Saving unblocked activation information

When an activation unblocks, it is not always safe to immediately enqueue the unblocked user thread onto the run queue since an act unblock upcall can occur at 37

any time. Therefore, if the run queue is being modified and this upcall is made by the kernel, a race condition will arise if act unblock() also attempts to modify the run queue. This occurs because there are two threads of execution modifying the same data structure. For this reason, when an activation unblocks, the unblocked thread is placed into a temporary linked list, act sched. These unblocked threads are placed onto the scheduler run queue only when we can guarantee that no other operation on the run queue data structure is in operation. This is done using an atomic SWAP operation (see [15]).

4.2.5

Coding act new()

When an activation blocks, an upcall is made to this function. The algorithm for this routine is given in Algorithm 1. At first it is necessary to wait for the act block upcall to be done as the scheduler kernel would not yet know which activation has blocked. Therefore, act resume() is called immediately (line 2). When the upcall to act block() returns, the run queue is checked for any other threads that can be run (line 3). If there are any such threads, the current thread is jumped to (line 11). It there are no threads on the run queue, the list of unblocked threads is checked to see if there are any threads that have unblocked. If there are none, the activation is put to sleep (line 5). If there are unblocked threads, act shed.current (the head of the linked list of unblocked threads) is swapped with sched.current (line 7). This will set the value of act shed.current to NULL and at the same time place the unblocked threads onto the run queue. The next thread is then jumped to (line 8). Since act resume() is called immediately, it is no longer possible to ensure that no upcalls will occur whilst executing the code in this function. In fact an upcall is guaranteed to occur after act resume() and this is the act block upcall that notifies which activation has blocked. However, it is possible that before this upcall is made, one or more act unblock upcalls are made. This creates a potential race condition since the blocked thread would not yet have been dequeued from the run queue. Therefore, a flag called dealt with dequeue is set in line 1 . As we shall see, this flag is reset once the dequeue operation is complete in act block(). This flag 38

will be needed in act unblock() in order to take appropriate action when a thread unblocks. Algorithm 1 Uniprocessor version 1 - Algorithm for act new() 1: dealt with dequeue=0 2: act resume(NULL,0) 3: if sched.current==NULL then 4:

if act sched.current==NULL then act cntl(ACT CNTL WAIT UPCALL) //sleep

5: 6:

else

7:

act sched.current=SWAP(sched.current,&act sched.current)

8:

sched jmp()

9:

end if

10: end if 11: sched jmp()

4.2.6

Coding act block()

After an activation blocks and a new activation is created, an upcall is made to act block() in order to notify the user-level scheduler as to which activation has blocked. The algorithm for this function is given in Algorithm 2. We guarantee that the blocked activation is always pointed to by sched.current. In order to do this we must ensure that before this upcall is made, no changes are done to the sched.current pointer. As we shall see later, this will require some more work in act ublock(). The first thing required is to save the blocked thread in the array of blocked threads (line 1). The blocked thread is then simply removed from the run queue (line 2) and dealt with dequeue is set to 1 in order to indicate that the dequeue operation has taken place (line 3). Note that the dequeue operation is done ‘manually’ and not by calling sched dequeue(). This is so because of extra functionality that we shall be giving the sched dequeue() function and which is not required here. Finally the activation that was preempted in order to make the upcall is resumed (line 4). 39

Algorithm 2 Uniprocessor version 1 - Algorithm for act block() 1: blocked activations[ID of blocked activation].thread = sched.current 2: dequeue(sched.current) 3: dealt with dequeue=1 4: act resume(stopped activation)

4.2.7

Coding act unblock()

The act unblock upcall is made whenever an activation unblocks in the kernel and the activation lock is not held by some other upcall. Due to the synchronous nature of this model of activations, we must be careful when modifying data structures in order to protect against race hazards. We next describe the algorithm for act unblock() as shown in Algorithm 3. When act unblock() is called in an upcall, the application is in one of the following states: • the run queue is empty and hence the preempted activation was idle or about to go to sleep. • the run queue contains just one thread and the flag dealt with dequeue is 0. This means that the last activation was waiting for a dequeue in act block() but an act unblock upcall was made before it. After the act block upcall, the activation would have been set to sleep. • there is more than one thread on the run queue, or there is just one thread and dealt with dequeue is 1. If case 1 or 2 above is true (line 1), the unblocked thread is immediately enqueued (line 2) and the unblocked activation is resumed (line 3). Note that when the enqueue operation is done, the enqueued thread is placed before the current thread in the scheduler, just in case the dequeue operation of a blocked activation has not yet been carried out. If case 3 above is true, the unblocked thread is placed onto the temporary linked list act sched (line 5) . Finally, the context of this thread is saved (line 6) so 40

that it can be resumed later on. The preempted activation can then be resumed (line 7). Algorithm 3 Uniprocessor version 1 - Algorithm for act unblock() 1: if run queue is empty OR (run queue contains one

item

&&

dealt with dequeue==0) then 2:

scheduler in(blocked activations[ID of unblocked activation].thread)

3:

act resume(unblocked activation)

4: else 5:

act scheduler in(blocked activations[ID of unblocked activation].thread)

6:

save context of unblocked activation

7:

act resume(stopped activation)

8: end if

It is worth noting that the context of an activation is different from that of the user threads. Therefore, for each thread an additional parameter is used called act buf. This stores the context of an activation and is used to resume that activation. JmpBuf Set() can then be used to save the context of an unblocked thread. See [6] for a description of the operation of JmpBuf Set() and JmpBuf Jmp() and the Linux man pages for memcpy(). The lines 6 and 7 above translate to the following C code:

memcpy(blocked_activations[stopped_aid-1].thread-> act_buf,act_params->stopped_buf,sizeof(act_buf_t)); if (!JmpBuf_Set(&(blocked_activations[stopped_aid-1].thread->jmp))) { act_resume(act_params->cur_buf,0); } else { act_resume(sched.current->act_buf,0); }

41

4.2.8

Modifying sched dequeue()

The sched dequeue() routine in the original uniprocessor thread scheduler simply removes threads from the run queue. We need to modify sched dequeue() to deal with two important operations: • re-inserting unblocked threads saved in act sched onto the run queue. • putting the current activation to sleep if there are no more threads to run and there exist blocked threads. Algorithm 4 demonstrates how we deal with these two cases.In line 1 the current thread which is to be dequeued is checked to see if it is the only thread on the run queue. If it is not, it is removed (line 2). If there are any unblocked threads in act sched.current (line 3) they are placed onto the run queue (lines 4,5). If the thread to be dequeued is the only thread, then it is removed (line 8). Finally act sched.current is checked to see if there are any unblocked threads (line 9). If there are none, the activation is put to sleep (line 10). If there are unblocked threads, they are placed onto the run queue and run (lines 12, 13). Note that lines 9 through 14 are identical to lines 4 through 9 of the algorithm for act new() (see Algorithm 1). This code fragment ensure that no race conditions are possible when placing unblocked activations onto the run queue or putting the running activation to sleep. We also point out that it is also feasible to check at every entry into our scheduler for unblocked threads that could be placed onto the run queue. Since synchronisation constructs use sched dequeue() to remove non-runnable threads, we select this function in which to carry out this operation. Another place where this could be done is before a context-switch, such as when the application calls cthread yield(). However, the use of SWAP is expensive since it locks the bus, and therefore we avoid using it here. What we guarantee is that, except in one very rare condition described below, whenever a thread terminates, all unblocked threads are placed onto the run queue. It is interesting to note that using SWAP rather than spin locks to protect structures gives a solution that is termed ‘wait-free’ [16]. However, since there exists the possibility of preemption because of the act unblock upcall, 42

Algorithm 4 Uniprocessor version 1 - Algorithm for sched dequeue() 1: if sched.current is not the only thread on the run queue then 2:

remove sched.current from run queue

3:

if act sched.current!=NULL then

4:

temp = SWAP(act sched.current,NULL)

5:

place unblocked threads before sched.current

6:

end if

7: else 8:

sched.current==NULL

9:

if act sched.current==NULL then

10: 11:

act cntl(ACT CNTL WAIT UPCALL) //sleep else

12:

act sched.current=SWAP(sched.current,&act sched.current)

13:

sched jmp()

14:

end if

15: end if

43

using spin locks would mean that a thread holding a lock might be preempted. This could lead to additional race hazards that would have to be solved.

4.2.9

A typical race condition

In order to guard against race hazards, we need to be able to guarantee that when carrying out an operation that modifies some data structure, no other operation on that data structure could be in the middle of execution. This is done by means of: • the atomic SWAP operation. • the section in an upcall before act resume() is called and the activations lock released. • guaranteeing that there are no blocking system calls in our scheduler. As an example, consider lines 7 through 14 of Algorithm 4 and the algorithm for the act unblock() function (Algorithm 3).

During the execution of

sched dequeue(), an act unblock upcall can occur at any time. We consider what happens if this upcall occurs between any of these lines:

Lines 7-8 The thread to be dequeued has not been removed from the run queue yet. Therefore, the condition in line 1 of act unblock() fails. The unblocked thread is placed on the act sched list (lines 5-6). This need not be done atomically since the activations lock guarantees that no other upcall, in particular act unblock, will be made until act resume() is called.

When act resume() is called, the

sched dequeue() function is returned to and line 8 is executed.

Lines 8-9

The dequeue operation was safely made and the run queue is

empty . If an act unblock upcall is now made, there is no time to place any unblocked threads onto the run queue. This is the only occasion where we can not guarantee that at the end of every thread, unblocked threads are placed onto the run queue. The act unblock() function is entered into and the condition in 44

line 1 is satisfied.

The unblocked thread is placed directly onto the run queue

and begins to be executed. Once again, this need not be done atomically since we can guarantee that no other preemptive upcall can be made that will modify the run queue data structure. Note that this time, we do not return to sched dequeue().

Lines 9-10

If there are no unblocked threads, then the activation is put to

sleep. Any act unblock upcall that occurs before going to sleep is handled as in lines 8-9 above. If the sleep operation is not carried out, there is no need to be concerned about line 10 since this function will not be returned to.

Lines 11-12 Here, a similar argument holds as for lines 8-9 above.

Lines 12-13 The SWAP operation has been carried out and the run queue now contains the unblocked threads. The list of unblocked threads is also automatically set to NULL. If an act unblock upcall occurs here, the condition in line 1 of act unblock() fails. Therefore, the unblocked thread is placed onto the temporary run queue and sched dequeue() is returned to and continues execution. In line 13 the scheduler jumps to the first of these unblocked threads.

4.3

Scheduler activations with polled unblocking

The scheduler activations patch used for the second implementation is built around the hybrid model of scheduler activations as implemented by Danjean and described in Section 3.5. This version of the scheduler activations patch is far more stable than the first. It operates on the Linux 2.2.17 kernel though a version for the Linux 2.4 kernel is expected to be released in the near future. Unfortunately, since this implementation is very recent, there are no references that can be made to any publications or manuals. In fact, it was necessary to study the source of the patch in order to extract the API which we describe below. 45

4.3.1

The scheduler activations API for polled unblocking

System Calls There is only one system call for the API of this patch. This is the act cntl() system call. Note that since no act unblock upcall can be made automatically except when the current activation is idle, no locks are needed for activations. Hence the act resume() system call described in the last section is no longer required. act cntl() is the only system call in this implementation. It accepts two parameters. The first is a flag indicating what the system call is to do. The second is a list of parameters that the system call may require. A list of the possible values for the first parameter are listed below: • ACT CNTL INIT - initialisation of the activations. For the second parameter, act cntl() is passed a structure containing information such as the number of processors required, stack areas for active activations, pointers to functions were upcalls are to be made, etc. • ACT CNTL GET PROC - returns the number of the processor the current activation is running on. If the scheduler is running on an N processor machine and asks that all processors are used, this value will be between 0 and N − 1. No additional parameters are required. • ACT CNTL GET PROC WANTED - returns the number of processors that the application asked to use. This was given as a parameter when calling act cntl() with the ACT CNTL INIT flag. Again, no additional parameters are required. • ACT CNTL GET MAX PROC - returns the number of physical processors on the machine. No additional parameters are required. • ACT CNTL READY TO WAIT - The activation will be set to sleep until the next upcall is done.

This call is used in conjunction with

46

ACT CNTL DO WAIT in order to solve the lost-wake-up problem described in Section 5.2.3. • ACT CNTL DO WAIT - Puts the current activation to sleep. As a second parameter it takes any value that needs to be passed to the act unblock() function when the activation is woken up. If the sleeping activation is woken up automatically by an unblocked activation, the call never returns. It returns only if woken up by an ACT CNTL WAKE UP system call. • ACT CNTL WAKE UP - Wakes up a sleeping activation. We do not need to make this call in the uniprocessor version but we shall be needing it for our SMP scheduler. In the original patch the parameter passed indicated how many sleeping activations were to be woken up. However we modify the patch so that the parameter passed indicates which activation is to be woken up. • ACT CNTL RESTART UNBLOCKED - Restarts an unblocked activation if one does exist. If it does, an unblocked activation is restarted and we never return. As a parameter we give any value we wish to pass to the restarted activation. Before making this system call the user-level thread scheduler should first check the shared memory area to see if there are in fact any unblocked activations. This is necessary so that no redundant system calls are made. • ACT CNTL UPCALLS - Enable/disable generation of upcalls. We shall not be needing this call. Upcalls We require just two upcalls to integrate this version of scheduler activations into smash. The first upcall is made when a new activation is launched and the second is made when an activation unblocks.

act new When a new activation is launched, this upcall is made to the act new() function. The function receives as a parameter an integer value that indicates which 47

processor it is running on. For a uniprocessor machine, this is always 0. Note that in this API we no longer have an integer value (which we previously called the activation ID) in order to identify an activation. We will therefore need some other method of identifying activations. act unblock When an activation unblocks, this upcall is made to the act unblock() function. The function receives as a parameter an integer value indicating the new processor number which the activation has resumed on. Once again, for a uniprocessor this is always 0. It also receives a set of parameters, some of which are given by the kernel and some of which are given when calling act cntl() with the ACT CNTL DO WAIT or ACT CNTL RESTART UNBLOCKED flags. Of particular importance is the flag ACT UNBLK KERNEL. If an act unblock upcall is made to act unblock() and this flag is set, it means that another activation blocked and the current activation unblocked. Rather than creating a new activation, this activation was restarted. As we shall see, we must check for this flag in the act unblock() function and take appropriate action. Another important parameter passed to this function is the return result of the blocking call. So for example, if an accept() system call is made, the return value parameter will give the value returned by accept(). In this case this would be -1 if an error occurred, or a socket identifier on success. Other parameters passed are used to restart the unblocked thread. This involves a small amount of assembly programming. For our purposes we shall assume that as soon as the scheduler exits from this upcall, the unblocked thread is resumed were it had left off (right after the blocking system call).

Shared memory area The kernel updates an integer value in user space that indicates how many unblocked activations exist. This value is referenced by the integer value nb act unblock. In order to see if there are unblocked threads the scheduler simply polls this value. If there

48

are unblocked activations, nb act unblock will be non-zero. In this case act cntl() may be called with the ACT CNTL RESTART UNBLOCKED flag.

4.3.2

Implementation considerations

Since we are implementing our non-preemptive smash scheduler on a uniprocessor and no preemption is possible from unblocked activations, we no longer need to worry about race conditions. Therefore, we are able to do away with the atomic SWAP operation and the temporary linked list of unblocked threads. Any threads that unblock can be safely placed onto the run queue by standard instructions. Integration with our scheduler is also much simpler and the flow of execution is much easier to comprehend. However, complications come in some lacking properties of the API.

4.3.3

Maintaining activation information

The new API, unlike the previous version, does not provide us with an integer to identify an activation. Another solution therefore has to be found in order to save a reference of the user threads that were running on blocked activations. This is necessary so that when that activation unblocks, the unblocked user thread can be placed back onto the run queue. Since an activation is implemented as a normal kernel thread, identifying it could be easily done by using getpid(). However this involves an expensive system call which defeats the scope of running threads at user level. Cordina [8] proposes a solution that he uses for his SMP implementation of MESH. He suggests to compare the stack pointer being currently used with all the running kernel thread stacks. This solution is not feasible in our case because if there are a large number of blocked activations, the scheduler would have to check every activation until it finds the correct one. The solution we use is based on the solution used by Glibc [27] and also in the original smash SMP thread scheduler which we shall be using in Chapter 5. This system works by making a system call called modify ldt() to set up a segment

49

in memory for each activation. We store information in the segment relative to our kernel thread. In our case we store a pointer to an activation structure. This structure stores: • an integer value by which the scheduler can identify the activation. In this way an activation can always identify itself by checking this value. • a pointer to the user thread that was running on that activation before it blocked. These structures need to have memory allocated beforehand and reused as necessary. Otherwise it would be necessary to create a structure every time an activation blocks and free it when it unblocks. Since the malloc() and free() operations are expensive system calls, we need to avoid doing this dynamically. Therefore, before running the user-level threads the scheduler will malloc() a number of these structures and keep a list of those which are free and those which are used. A further problem arises in deciding how to store these structures. An obvious solution is to use an array. However, this time the API of the patch does not allow us to set a limit to the maximum number of activations that can be blocked in the kernel. This means that these structures need to be created dynamically. Therefore, an array implementation is not possible. Instead of creating one structure at a time, a whole block of structures can be created at one go in an array. These arrays are then stored in a linked list. A linked list is also kept to reference which structures are free and which are used within these arrays or blocks. When the number of free structures becomes zero, a new block is created with just one malloc() operation. Freeing these structures is not necessary and could even degrade performance. If the number of blocked threads oscillates frequently between a multiple of the array size, the system would be constantly freeing and allocating memory. We argue that if an application had a certain number of blocked user threads at one point, it will probably have a similar number further on in its execution. Note that before running an application, it would be best to set the block size to a value that is approximately

50

equal to the maximum number of activations that are expected to be blocked concurrently. This would ensure that extra memory is not used up for nothing and processor time is not wasted allocating memory for small arrays. Also, in order to access a particular structure, it would be necessary to traverse the linked list of arrays. Using a linked list of arrays, the number of operations needed to access a structure is dependant not on the number of structures but on the number of arrays. So, for example, if the array size is set to ten structures per array and there are three arrays in total, only 3 operations would be required to access the last structure. Two operations would be necessary to move from the first array to the third and one to access the last structure in the final array. With a linked list there would have been a need for 30 operations as all the structures would have to be traversed. Figure 4.2 describes these structures during execution of an application. We choose to have an array or block size of 5. The head of this linked list of blocks is pointed to by head blocked arrays and the tail is pointed to by tail blocked arrays. Since there are two blocks already created, it must be that there was at least one occasion where there were at least six activations blocked at the same time. The linked list of integers has one node pointed to by free stop. The value contained in this node and those to the left contain references to activation structures that are not being used. The current activation is pointed to by current kt. Figure 4.3 shows what happens when the current user thread blocks. A pointer to this thread is placed in the array with ID 4. free stop is then moved to the left by one node. The next free structure is now in position 2 so current kt is made to point to this structure. Consider Figure 4.4.

Assume now that the value of nb act unblock

is polled and found to be non-zero.

act resume() is called with the

ACT CNTL RESTART UNBLOCKED flag and the activation with ID 9 becomes unblocked. The free stop pointer is therefore moved one node to the right and filled in with the value 9. The pointer current kt is also made to reference this structure. Note that the value 9 now appears twice in the linked list of free positions but this is not a problem. It is possible to have repeated values on the right of the free stop 51

Figure 4.2: Maintaining activation information (1)

pointer but never on the left. If there were repeated values on the left, it would mean that the same structure is being used to save information on two activations.

4.3.4

Coding act new()

Algorithm 5 shows the code executed when an act new upcall is made. The reader will appreciate that it is much easier to understand than the algorithm for act new() given in Section 4.2.5. Although more effort is required in order to maintain the structures described in the previous sub-section, no preemption by other activations is possible as no other upcalls can occur. In line 1 the pointer to the user thread of the current activation structure is set 52

Figure 4.3: Maintaining activation information (2)

to point to the current thread on the run queue. This is the user thread that has blocked. Since there is a new activation, an activation structure must be set for it. This is done by looking in the list of free positions for the first such structure (line 5). If none exist then a new block of activation structures would have to be created (line 3). The linked list of arrays is traversed until a free activation structure can be referenced (line 7). The ldt structure for this activation is initialised and the current activation pointer current kt is set to reference this structure (line 8). Finally, the blocked thread is removed from the run queue (line 9) and the scheduler jumps to the next user thread (line 10). As we shall see, if there are no more user threads, the sched dequeue() function will be responsible for putting the activation to sleep. In

53

Figure 4.4: Maintaining activation information (3)

this case line 10 will not be executed. When one of the blocked activations resumes, its user thread will be scheduled automatically.

4.3.5

Coding act unblock()

Algorithm 6 shows the code executed when an act unblock upcall is made. It is important to remember that in this implementation, the unblocked user thread is automatically resumed on exiting the act unblock() function. Line 1 deals with a particular condition that occurs if an activation blocks and then immediately unblocks. In this case the ACT UNBLK KERNEL flag would be set. This is checked for by comparing the pointer to the activation structure stored 54

Algorithm 5 Uniprocessor version 2 - Algorithm for act new() 1: current kt->current ct=sched.current 2: if (free stop->prev==NULL) then 3:

create new blocked array structure

4: else 5:

free stop=free stop->prev

6: end if 7: Let temp kt be a pointer to the structure with ID free stop->pos 8: current kt=kt init(free stop->pos,temp kt) 9: sched dequeue() 10: sched jmp()

in current kt with that stored in the ldt structure. The latter is retrieved using the macro THREAD SELF. If this condition is satisfied then act unblock() is exited and the thread is resumed. Lines 2 through 9 deal with an act unblock upcall that is caused when another activation blocks and thus the ACT UNBLK KERNEL flag is set. In this case two operations need to be carried out. The first is dequeuing the blocked user-level thread. The second is resuming the unblocked one. At this point, current kt points to the activation structure that has just become blocked. Also, sched.current points to the user-level thread that has blocked. Therefore, in line 3 the pointer to the thread of the activation structure is set to reference the blocked thread. current kt is then set to the current activation which is running (line 4). Next it is necessary to update the list of free positions of activation structures. This is done by marking the unblocked structure as free and the blocked structure as used. Since a block and an unblock operation are occurring concurrently, the free stop pointer is not moved. Instead the value of the current node is changed from the ID of the blocked activation to the ID of the unblocked activation (line 5). The reference to the unblocked user thread is now extracted (line 6) and this thread enqueued before the current thread (which is the blocked thread) (line 7). The blocked thread is then removed from the run queue

55

(line 8). Once again this is not done by calling sched dequeue() because if another thread unblocked, it would not return. The current thread on the run queue is set to be that running on the activation that unblocked (line 9). When the act unblock() exists, the unblocked thread will continue executing after the blocking system call. Lines 11 through 16 deal with an act unblock upcall that is caused either because an activation unblocked when the processor was idle or because the act resume() system call was made with the ACT CNTL RESTART UNBLOCKED flag. In line 11 the pointer current kt is set to point to the activation structure of the current activation. It is now necessary to update the free positions list and this is done in lines 12 and 13. Finally, the unblocked thread is replaced onto the run queue and sched.curret is made to point to it. Once again, when the function exists, the unblocked thread will continue executing after the blocking system call. Note that in line 19 the value of the system call is returned so that it can be used by the application. return value is passed as a parameter to act unblock() by the kernel.

4.3.6

Modifying sched dequeue()

Once again, we need to modify sched dequeue() to deal with two important operations. • Polling the value of act nb unblock and calling act resume() if it is non-zero. • Putting the current activation to sleep if there are no more threads to run and there exist blocked threads. This algorithm (see Algorithm 7) is much simpler than that used in the implementation. In line 1 the thread to be dequeued is checked to see if it is the only thread on the run queue. If it is then it is removed (line 2). The value of act nb unblock is polled and if it is not zero (line 3) one of the unblocked activations is restarted (line 4). Note that it could have been possible to make the system call without checking the value of act nb unblock. The kernel would realise that there are no unblocked 56

Algorithm 6 Uniprocessor version 2 - Algorithm for act unblock() 1: if (!(current kt==THREAD SELF)) then 2:

if (ACT UNBLK KERNEL flag is set) then

3:

current kt->current ct=sched.current

4:

current kt=THREAD SELF

5:

free stop->pos = ID of current activation (current kt->aid)

6:

temp=current kt->current ct

7:

sched enqueue(temp)

8:

dequeue(sched.current)

9:

sched.current=temp

10:

else

11:

current kt=THREAD SELF

12:

free stop=free stop->next

13:

free stop->pos = ID of current activation

14:

temp= current kt->current ct

15:

scheduler in(temp)

16:

sched.current=temp

17:

end if

18: end if 19: return return value

57

activations and simply return. However this would incur an expensive system call for nothing. By using the polling mechanism we only incur the cost of a conditional statement. If the condition in line 1 fails then it means that there is only one thread on the run queue: the thread to dequeue. In this case it is simply removed (line 7) and the activation is put to sleep (lines 8-9). Note that these two calls must be used together. The use of two calls is of no value in our uniprocessor version. Their relevance will be apparent when we discuss the SMP implementation in Chapter 5. It is also not necessary to check if there are any unblocked threads by calling act resume() with the ACT CNTL RESTART UNBLOCK flag. This is because the kernel will automatically restart them rather than go to sleep if such an activation exists. Unlike in the previous implementation , sched jmp() is never called in this algorithm. We decide that is the responsibility of the calling function to do so once sched dequeue() returns. Algorithm 7 Uniprocessor version 2 - Algorithm for sched dequeue() 1: if sched.current is not the only thread on the run queue then 2:

remove sched.current from run queue

3:

if nb act unblock!=0 then act cntl(ACT CNTL RESTART UNBLOCKED)

4: 5:

end if

6: else 7:

sched.current==NULL

8:

act cntl(ACT CNTL READY TO WAIT)

9:

act cntl(ACT CNTL DO WAIT)

10: end if

4.3.7

Modifying sched yield()

Once again it is possible to check for unblocked threads at any point in our scheduler. However one must consider that this would require an additional conditional state58

ment in order to poll the value of act nb unblock. In the previous implementation we avoided checking for unblocked threads in this function because it might incur the cost of an expensive SWAP operation besides a conditional statement. This time we show how we could modify sched yield() in order to check for unblocked threads. This modification could be made anywhere in the scheduler except between lines of code where the run queue data structure is being modified. Note that in the common case the conditional statement that checks the value of act nb unblock will fail. Therefore, for applications that carry out several context switches, it may be a good idea to remove this functionality. This could be done by a pre-compiler that analysis the code to see how often context switches are indeed carried out. Algorithm 8 shows how small the modification required is. All that is required is to add the lines 2 and 3. In line 2 the value of act nb unblock is polled. If it is not zero, the unblocked activation is resumed (line 3). If it is zero then there are no unblocked activations. In this case the next thread on the run queue is scheduled (lines 5-6). Algorithm 8 Uniprocessor version 2 - Algorithm for sched yield() 1: if JmpBuf Set(&sched.current->jmp) then 2:

if my nb act unblock!=0 then act cntl(ACT CNTL RESTART UNBLOCKED)

3: 4:

end if

5:

sched.current = sched.current->next

6:

sched jmp()

7: end if

4.4

Comparing the two implementations

We finalise this chapter by comparing the two scheduler activations kernel patches. We shall do this in the context of the challenges each one poses in integrating them into our user-level thread scheduler. These challenges come about because of two reasons: 59

• the paradigmatic difference between having a synchronous and an asynchronous interface for notifying the user level of unblocked activations. • the API provided by the kernel patch.

4.4.1

A difference in paradigm

The synchronous interface of the first model creates several race conditions that have to be accounted for. Every line of code that is made after the act resume() system call that releases the activations lock must be properly studied. This is necessary because an activation may unblock at any time and attempt to modify a shared data structure. As a consequence of this it was necessary to implement extra data structures such as the the list of unblocked threads. We also had to make use of atomic operations which are expensive since they lock the bus. In the second model, unblocked activations are not restarted automatically but at our request. Therefore, it is no longer necessary to maintain a list of unblocked threads that are placed onto the run queue at a later stage. Instead it possible to place them directly on the run queue when they unblock and be confident that the run queue data structure was not being modified at that time. The logical flow of the program is also much easier to design as a result of the absence of preemption.

4.4.2

A difference in API

A fundamental difference in the APIs of the two implementations is the way activations are identified. In the first version, each activation is given an integer identifier automatically. The number of activations is also limited and set at initialisation. For this reason, we could make do with a simple array to store information for blocked activations. The second implementation does not provide any form of identification. This meant that we had to find our own way of identifying kernel threads which we did by using the modify ldt() system call. We also could not set a limit on the maximum number of activations that can be blocked simultaneously. This can prove to be an 60

advantage and a disadvantage. The advantage is that we can make use of system resources to the limit available at run-time. The disadvantage is that if we do reach this limit, other programs executing in a multiprogramming environment could be severely starved of system resources. In the final implementation we use a command line parameter that limits the maximum number of activations that can be blocked at one time. We achieve this by simply using a counter that is incremented when a thread blocks and decremented when it unblocks. When the counter reaches the chosen limit and an activations blocks, the newly created activation is put to sleep rather than running another user thread.1 An optimal solution would have been to let the kernel decide what this limit should be and have the thread library to read it from some shared memory area. The maximum number of blocked activations could then be set dynamically.

Having implemented these two models for our uniprocessor scheduler, we now select one which we shall be implementing for our SMP scheduler. It is true that the API for the first implementation offers an easy way of identifying activations, whilst the second does not. However we have found a solution for this deficiency in the ldt data structure. As discussed in Chapter 3, the polled unblocking mechanism is more efficient. In this chapter we have seen that it is also easier to implement. We therefore choose the second implementation of scheduler activations over the first for implementing into our SMP thread scheduler.

1

Another solution could have been to use the getrlimit() system call to find the maximum

number of kernel threads that the system allows. A function of this value could then be used. See [17] pgs 180-184 for details on this system call.

61

Chapter 5 Integrating scheduler activations with a user-level thread scheduler for SMPs In this chapter we shall be describing the integration of a user-level thread scheduler for SMPs with scheduler activations to avoid blocking system calls. The support for scheduler activations comes from the second kernel patch described in Chapter 4. We shall begin by briefly describing the thread scheduler and its operation without scheduler activations. For a detailed discussion of all implementation details see [6]. We shall then discuss in further detail parts of the activations patch API that we did not make use of in our uniprocessor implementation but which are necessary for our SMP implementation.

5.1

SMP smash - a user-level scheduler for SMPs

The version of smash that we shall be using is the traditional shared run queue. The number of processors and kernel threads are always equal and each kernel thread is bound to a single processor such that that vertical context switching in the kernel is kept to a minimum. However, in the original version, blocking system calls severely degrade performance as discussed in Chapter 2. 62

5.1.1

Introducing spin locks

As we are now working in a multiprocessor environment, more than one process may be attempting to modify some shared data structure such as the list of unblocked threads. If this occurs, that shared data structure may be corrupted. One could consider such a structure to be a resource that can only be used by one task at a time. Access control for these resources is provided by spin locks [19]. When a task requires access to a resource protected by a spin lock, it attempts to acquire the resource by means of a spinlock() operation. If the resource has already been acquired, the task busy waits by continuously testing the contents of a variable. Once the resource is released, the task exits the busy-waiting loop, sets the lock to notify other tasks that it now has the resource and continues to execute. A lock is released be means of a call to a splinrelease() function. Several implementations of spinlock() exist and are suited to solve a variety of problems. (See [6] for a discussion of these). For our purposes it is only necessary to know that these constructs exist and must be used when modifying shared data structures. In the shared run queue version we that shall be using, a single lock is used to protect all the scheduler structures. This is shown in Figure 5.1. Communication structures have their own lock.

5.1.2

An Overview of operation

The mechanism used for starting and stopping threads in the original scheduler will be the same as that which we will use in our scheduler with activations. Initially, all processors except the starting one are made to sleep on a semaphore. Threads are enqueued onto the run queue and executed as follows: • if some processor is idle, the thread to be enqueued thread is placed on the first idle processor and the processor is woken up. This thread is referred to as the current thread of that processor. • if all processors are being used, then the thread is placed onto the run queue.

63

shared run queue User Space

U

U

U

U

U

U

U

U

U

U

U

U scheduler

U scheduler

U scheduler

U scheduler

Kernel Space

K

K

K

K

CPU

CPU

CPU

CPU

CPU

Figure 5.1: Shared run queue architecture (from [6])

When a thread terminates or relinquishes control to other threads, the run queue is first checked to see if other threads do in fact exist. If they do, the original thread is enqueued back onto the run queue and the new thread is dequeued and placed on the processor. If there are no threads on the run queue, execution of the original thread continues. Note that in the shared run queue model there never are threads on the run queue while any of the processors are idle. This means that load-balancing is optimal. However, the processor cache is easily corrupted as threads migrate frequently across different processors. Maintaining threads on the same processor when rescheduled follows what is know as the ‘principle of locality’. Various solutions exists to maintain locality while at the same time achieving near-optimal load balancing through thread migration. A discussion of these is given in [6].

64

5.1.3

Internal scheduler functions

An important element in the implementation of the following functions is that we have a mechanism to identify the kernel thread we are running on. This is achieved by using the ldt structure discussed in Chapter 4 and which was used to store information about activations. On initialisation the kernel threads are given a unique identifier between 0 and N − 1 where N is the number of processors. Thread Insertion Thread insertion involves placing the thread onto the shared run queue or an idle processor. In order to determine which operation must be carried out, a count of the number of idle processors must be kept. We shall call refer to the variable that stores this number as sched.sleepnum. When no processors are idle, the counter is set to 0 and the thread is simply inserted onto the run queue. When it is greater than 0, it is placed onto the first idle processor. Note that this routine is protected by a scheduler spin lock which we shall henceforth refer to as qlock. Thread Removal When a thread is removed (either because of synchronisation reasons or because it is terminated), another thread must be obtained from the run queue. If there are no threads left then the processor is set to idle. What is effectively done when thread removal is required is for the scheduler kernel to call sched dequeue(). Unlike in the uniprocessor version where nothing is returned, this time sched dequeue() returns either the dequeued thread or NULL if the run queue is empty. The calling function will set the processor to sleep if NULL is returned. Otherwise, the dequeued thread is executed. As above, the whole routine is protected by the spin lock qlock which is released before the processor is set to sleep.

65

Thread Yield Thread yielding involves calling sched dequeue() to check if there are other threads on the run queue.

If NULL is returned, the thread continues to execute.

If

sched dequeue() returns a pointer to a thread, then the current context of the original thread is saved and it is placed onto the run queue. The dispatched thread is then scheduled. Once again, the whole operation is protected by a spin lock (qlock) until the new current thread is scheduled.

5.1.4

The API for uniprocessor smash

Folliwing is the API provided by SMP smash. Note that this is exactly the same API as that of uniprocessor smash and we reproduce it here for completeness: • cthread init() - creates and initialises a new thread. • cthread run() - places a thread on the run queue for execution (calls scheduler in()). • cthread yield() - yields execution to another thread (calls sched yield()). • cthread join() - a synchronisation function used to allow a thread to wait for another thread to terminate before continuing. • cthread stop() - terminate a thread (calls sched dequeue()).

5.2

Integrating SMP smash with scheduler activations

We shall now discuss how we integrate scheduler activations with our user-level thread scheduler for SMPs. We will discuss some of the functionality provided in the activations patch API that we did not need for our uniprocessor implementation but is now required. We then describe the algorithms that are need to be implemented for the upcalls, as well as changes that must be made to the original thread scheduler. 66

5.2.1

Replacing semaphores

Semaphores are the constructs used in the original thread scheduler in order to put kernel threads to sleep and wake them up. In our case the problem is that the wake() and sleep() semaphore operations carried out by the semop() system call are blocking system calls. This obviously would interfere with our implementation since we always need to guarantee that no blocking system calls ever occur within the scheduler kernel. The activations API provides three system calls that we use to replace semaphore operations. These are the act cntl() system call with the following parameters: • ACT CNTL WAKE UP - Wakes up a sleeping activation. • ACT CNTL READY TO WAIT - The activation will soon be put to sleep with ACT CNTL DO WAIT, unless an upcall is made between the two. • ACT CNTL DO WAIT - Put the activation to sleep. The first of the above functions is used to wake up a sleeping activation. The second parameter of the system call is an integer between 0 and N − 1 where N is the number of processors. When this system call is made, the activation that was idle resumes execution right after the point were the system call act cntl() was made with the ACT CNTL DO WAIT flag. The second and third functions are used to put an activation to sleep. The reason that there have to be two system calls is because of the non-atomic nature of system calls. Unlike the operations on semaphores, the sleep() and wake() operations could lead to the lost-wake-up problem [20] where an activation may be erroneously set to sleep indefinitely.

5.2.2

The lost-wake-up problem

The lost-wake-up problem occurs when a thread issues a wake() call on a thread that has not yet slept. Consider the algorithm fragment Algorithm 9. In this algorithm a thread has just finished executing . If there are no threads on the run queue (lines 67

1-2) then it is set to sleep (line 3), otherwise the dequeued thread is scheduled (line 5). When some other kernel thread wakes this idle thread, it beforehand must place a thread on the run queue. This thread is scheduled in line 7. Algorithm 10 shows an algorithm fragment of a thread insert operation. In line 1 it is first necessary to check if any processors are idle. This is done by checking a counter of idle processors. If there is an idle processor, the thread is placed on that processor (line 3) and the processor is woken up in order to run the new thread (line 4). It is possible to know which processor is idle since its current thread would be NULL. If no processors are idle the thread is placed onto the run queue (line 6). The algorithm seems to work well at first sight but a race condition exists. This arises when the processor running Algorithm 10 executes lines 3 and 4 at the exact time when Algorithm 9 is at a point between lines 2 and 3. In this case the wake() call is ignored as the processor is not really idle. The processor executing Algorithm 10 continues running. That running Algorithm 9 executes line 3 and goes to sleep, even though it has a user thread to run. Algorithm 9 Lost-wake-up problem - (sleep() operation) 1: temp=sched dequeue() 2: if temp=NULL then 3:

sleep()

4: else 5:

scheduler schedule(temp)

6: end if 7: scheduler schedule(new thread)

Since it is possible to know which processors are idle and which are not, we could have the scheduler repeatedly issue the wake() call until the sleep operation actually occurs. The activation or kernel thread would therefore still be set to sleep but would wake up immediately. Upon this happening, a flag could be set that would be tested for by the thread issuing the wake() call. Once this flag is set, this latter thread would be able to resume its operation. This is one valid solution. However, since the 68

Algorithm 10 Lost-wake-up problem - (wake() operation) 1: if a processor is idle then 2:

i = idle processor

3:

place thread on processor i

4:

wake(i)

5: else 6:

sched enqueue(thread)

7: end if

wake() operation is an expensive system call, we would prefer not to busy wait by continuously calling it. It may also be the case that extra wake() system calls are made between the sleep() call being executed and the flag being set. Such a solution is therefore not optimal. What is desirable is to have some similar mechanism but instead to busy-wait on a shared variable.

5.2.3

Solving the lost-wake-up problem with double system calls

The solution provided by the patch API is to use two system calls when the scheduler requires to put an activation to sleep. The first indicates that the activation is about to go to sleep. The second actually puts it to sleep. If a wake() call is made between these two system calls, then the second system call is ignored. The mode of operation between these three system calls is described in the next two algorithms. Algorithm 11 is similar to Algorithm 9 except that before actually going so sleep, the system call act cntl(ACT CNTL READY TO WAIT) is made. This notifies the kernel that the current activation is soon to make a system call to go to sleep. A flag is then set which shall be needed to busy wait upon. Algorithm 12 is the complementary wake() algorithm. Exactly the same routine to that described in 10 is followed. However before calling wait() it is necessary to ensure that the flag has been set. If not, the scheduler busy waits until it is set. This ensures that before calling setting the processor to sleep, the ACT CNTL READY TO WAIT system call would have been 69

made. Now the same race condition as with the lost-wake-up problem can occur and the wake() call in line 5 of Algorithm 12 is made before the sleep() call in line 5 of Algorithm 11. This time however, since the wake() call comes between the two system calls of ACT CNTL READY TO WAIT and ACT CNTL DO WAIT, the activation is not set to sleep by the kernel but continues running. Line 10 in Algorithm 11 is therefore executed and the thread is run on the processor that was about to idle. Busy waiting wastes valuable processor time and is only recommended when the time to block and unblock is more expensive. Since blocking and unblocking with activations can only be achieved with system calls, busy waiting produces better performance. Besides this, the condition on which the scheduler busy waits occurs rarely. It only occurs when the run queue is empty and a thread is being placed on an activation that has only just about been set to sleep. As we shall see, we also make this condition even more rare by placing threads not just on the first processor with no current thread but on the first processor that has no current thread and has its flag set. Algorithm 11 System call solution to lost-wake-up problem - (sleep() operation) 1: temp = sched dequeue() 2: if temp==NULL then 3:

act cntl(ACT CNTL READY TO WAIT)

4:

set flag=1

5:

act cntl(ACT CNTL DO WAIT)

6:

set flag=0

7: else 8:

scheduler schedule(temp)

9: end if 10: scheduler schedule(temp)

70

Algorithm 12 System call solution to lost-wake-up problem - (wake() operation) 1: if a processor is idle then 2:

i = idle processor

3:

place thread on processor i

4:

while (flag==0) do nothing //busy wait

5:

act cntl(ACT CNTL WAKE UP,i)

6: else 7:

sched enqueue(thread)

8: end if

5.2.4

Maintaining activation information

As with the uniprocessor scheduler, we need to have a way of identifying activations. We shall be using exactly the same structure described in Section 4.3.3. The original thread scheduler without activations uses a value in the kernel thread structure that stores the processor number that the kernel thread is running on. This value is used to enable the scheduler kernel to access the correct data structures. Therefore an activation structure now holds: • an integer value by which an activation is identified. • the processor number the activation it is running on. • a pointer to the user thread that was running on that activation before it blocked. Since more than one processor may be modifying the structure of blocked activations, it is necessary to protect the structure with a lock. We introduce the lock alock which must be acquired before modifying the list of blocked activations or the linked list of free structures. This lock must then be released as soon as the necessary modifications are made. Recall that in our uniprocessor version we used a pointer current kt to identify the current kernel thread. We now extend this to an array, current kt[N], for an 71

N processor system. We also need a flag for the purpose described above for each processor. We therefore introduce the array sched.ktready[N]. This array already existed in the original version and was used to wait for all kernel threads to go to sleep before shutting down the scheduler. We extend its use throughout the scheduler to store which processors are sleeping (or about to sleep) and which are not. We also make use of a counter, sched.sleepnum, that holds the number of idle processors. Finally we also note that for an SMP implementation, each processor has its own current thread. Therefore, once again, we need an array of size N to store the current thread for each processor. We shall refer to this array as sched.current[N].

5.3

Implementation of upcalls and scheduler functions

In this section we shall describe the algorithms necessary to integrate scheduler activations with the user-level thread scheduler for SMPs. We need to: • write an algorithm for putting activations to sleep using the two sleep system calls. We shall call this function activations wait(). • modify the part of scheduler in() that puts activations to sleep. • change any code that use semaphores to wake up kernel threads and instead use activations wait(). • modify the cthread yield() function so that when a thread yields execution, an unblocked thread can be resumed. • write code for the upcall functions act new() and act unblock().

5.3.1

Coding activations wait()

The activations wait() function is that which will replace the sleep() operation on the semaphores that was used in the original SMP scheduler. The basic oper72

ation of the two system calls required to put activations to sleep was discussed in Section 5.2.3. We shall now look at activations wait() in more detail. Algorithm 13 shows the algorithm for this function. The variable index represents the id of the current processor and is passed as a parameter to activations wait(). Lines 1 through 3 are used to solve the lost-wake-up problem as described in Section 5.2.3. In line 3 the activation is put to sleep only if no other activations have unblocked. If there are unblocked activations, an upcall will be made to act unblock() and the function never returns. If the function is set to sleep and some time later an activation unblocks, then once again, an act unblock upcall is made to act unblock() and the function never returns. On the other hand, if the activation is woken up using the act cntl() function call with the ACT CNTL WAKE UP parameter, line 4 is executed. This line sets the value of sched.ktready of the current processor to 0 to indicate that it is no longer idle. The thread that was placed on the run queue to be executed is then scheduled (line 5). Algorithm 13 SMP Algorithm for activations wait() 1: act cntl(ACT CNTL READY TO WAIT) 2: sched.ktready[index]=1 3: act cntl(ACT CNTL DO WAIT) 4: sched.ktready[index]=0 5: scheduler schedule(index)

Modifying the sleep operations of the original scheduler to use the activations wait() function simply involved replacing the semop() operations that put processors to sleep with the activations wait() function.

5.3.2

Modifying scheduler in()

The scheduler in() routine remains very similar to that for the original scheduler. The idea is to first check if any processor is idle. If one such processor is found, a user thread is placed as the current thread for that processor. The processor is then woken up in order to run that thread. If no processors are idle, the thread is enqueued onto 73

the run queue. In order to solve the lost-wake-up problem we also make use of the sched.ktready variable for that processor. In Algorithm 12 we showed how we would busy wait on this flag after finding the first idle processor. We optimise this algorithm in the actual implementation. Instead of waiting for this flag to be set for the first idle processor to be found, all the processors are checked in a circular manner until the first processor that is idle and has this flag already set is found. This processor is then chosen to run the current thread. Algorithm 14 shows the algorithm for the scheduler in() function. The thread to be enqueued (c) is passed as a parameter to the function. In line 1 the scheduler checks if any processors are idle by checking the sched.sleepnum variable. If no processors are idle then the thread is placed onto the run queue (line 2). If an idle processor does exist then the processors are cycled through until an idle one is found (lines 5-7). The current thread of that processor is then set to point to this thread (line 8). In line 9, the value of sched.sleepnum is decremented in order to indicate that there is one less idle processor. The act cntl() system call is then made with two parameters: (i) ACT CNTL WAKE UP to indicate that an activation is to be restarted by the kernel and (ii) the number of that processor which is to be woken up. Note that every time the scheduler in() function is called, it must be sandwiched between spinlock() and spinrelease() functions in order to protect the scheduler data structures.

5.3.3

Coding act new()

Whenever a new activation is created, an act new upcall is issued to the act new() function. This upcall is made under two possible conditions: • at initialisation when the activations are being created to equal the number of processors. • whenever an activation blocks. In either case, an integer value (proc) is passed as a parameter by the kernel to act new() in order to identify on which processor the new activation was created. Once the activation is created, the scheduler can do one of two things: 74

Algorithm 14 SMP Algorithm for scheduler in() 1: if sched.sleepnum == 0 then 2:

sched enqueue(c)

3: else 4:

i=0

5:

for (; ;) do

6:

i=(i+1) % PROCNUM

7:

if !sched.current[i] && sched.ktready[i]==1 then

8:

sched.current[i] =c

9:

sched.sleepnum−−

10:

act cntl (ACT CNTL WAKE UP,i)

11:

return

12: 13:

end if end for

14: end if

• put the activation to sleep if there are no user threads on the run queue • place a user thread from the run queue as the current thread of the new activation. This thread is then scheduled on the new activation. Algorithm 15 shows the algorithm for the act new() function. In line 1 a flag is checked to see which of the two possible conditions listed above has led to the creation of the activation. In the case that the kernel is still initialising the first activations, an activation structure is reserved for that activation (line 3) and the activation is put to sleep (line 5). Note that it is necessary to protect the linked list of arrays of activations structures with the lock alock. If a new activation was created because another activation blocked, then it is first necessary to save a pointer to the user thread that was running on the blocked activation. This is done in line 7. Note that the pointer to the current activation current kt[proc] still points to the activation structure for the blocked activation and not for the newly created one. The current kt[proc] pointer is updated in line 9 when 75

a new activation structure is reserved for the new activation. Next the run queue is checked for any user threads that are waiting to be scheduled. If such a thread exists it is removed from the run queue and placed as the current thread for the new activation (line 12). This thread is then scheduled (line 14). The sched dequeue() operation is placed in a spin lock in order for the activation to have exclusive access to the run queue data structures (lines 11 and 13). If there are no user threads to run, the value of sched.sleepnum is incremented to indicate that there is another idle processor (line 16). The qlock is then released before the activation is put to sleep using the activations wait() function (lines 17-18). Algorithm 15 SMP Algorithm for act new() 1: if Initialisation of activations was not yet done then 2:

spin lock(alock)

3:

save activation information for proc

4:

spin release(alock)

5:

activations wait(proc)

6: else 7:

current kt[proc]->current ct=sched.current[proc]

8:

spin lock(alock)

9:

save activation information proc

10:

spin release(alock)

11:

spin lock(qlock)

12:

if sched.current[proc.c] = sched dequeue() then

13:

spin release(qlock)

14:

scheduler scheduler(proc);

15:

else

16:

sched.sleepnum++

17:

spin release(qlock)

18:

activations wait(proc)

19:

end if

20: end if

76

5.3.4

Modifying cthread yield()

The cthread yield() function needs one simple modification by which the scheduler can check the value of act nb unblocked and call act cntl() with the ACT CNTL RESTART UNBLOCKED parameter if this value is non-zero. Algorithm 16 shows the code for this function. All that are added are lines 2 and 3. There is no need to enclose the polling of act nb unblocked within a spin lock since if a race condition arises in which the system call is made erroneously, this system call would be ignored and the function would continue normally. Algorithm 16 SMP Algorithm for cthread yield() 1: save context of current thread 2: if nb act unblock!=0 then 3:

act cntl(ACT CNTL RESTART UNBLOCKED)

4: end if 5: continue with a normal yield operation

5.3.5

Coding act unblock()

Whenever an activation unblocks, an act unblock upcall is made made to the act unblock() function. An activation can unblock for three reasons: • the

act cntl()

system

call

is

made

with

the

ACT CNTL RESTART UNBLOCKED parameter from the cthread yield() function. • an activation is idle on an idle processor and another activation has unblocked. • another activation blocked and the unblocked activation was restarted instead of the kernel creating a new activation.

In this case the flag

ACT UNBLK KERNEL would be passed as a parameter to act unblock(). We shall consider each of these three cases in turn and give a separate algorithm for each. It is possible to establish which of the three cases has occurred by checking a parameter passed to act unblock() by the kernel. 77

Algorithm 17 shows how these three sub-functions are coalesced into the act unblock() function. Recall that when this function is called by the kernel, no parameter is passed that allows identification of the unblocked activation. It is only by means of the ldt structure that this activation can be identified. This structure is retrieved using the macro THREAD SELF. In line 1 a temporary variable called curr kt is made to point to the activation structure of the unblocked activation. This will be needed in the sub-functions in order to resume the unblocked thread. The id field of the activation structure references the processor number the activation was running on when it blocked. This is updated to reference the processor number of the new processor on which the kernel chose to resume the unblocked activation. Lines 3 through 9 simply function as a case statement to call the relevant sub-function depending on the condition under which the activation unblocked. In line 10, the return value of the system call is returned and the unblocked user thread is resumed. Algorithm 17 SMP Algorithm for act unblock() 1: curr kt=THREAD SELF 2: curr kt->id=new proc 3: if came from yield then 4:

call sub-Algorithm 18

5: else if unblocked on an idle processor then 6:

call sub-Algorithm 19

7: else if automatically unblocked by the kernel then 8:

call sub-Algorithm 20

9: end if 10: return return value

Algorithm 18 shows the code required to deal with the unblocking of an activation by polling in the cthread yield() function. In line 2, the thread that is yielding its execution is placed back onto the runqueue or onto an idle processor if one exists by using the scheduler in() function. The user thread of the unblocked activation is then set as the current thead for the processor upon which the activation unblocked (line 3). Note that these two operations have to be embedded in a spin lock. In lines 78

6 and 7, the linked list of arrays of blocked activations is updated to indicate that a new position is free - that of the activation that was running the user thread that yielded its execution. Again the structures are protected by using a spinlock (lines 5 and 8). Finally the pointer of the current activation is set to that of the unblocked activation (line 9). Algorithm 18 SMP Algorithm for act unblock() from cthread yield() 1: spinlock(sched.qlock) 2: scheduler in(sched.current[new proc]) 3: sched.current[new proc]=curr kt->current ct 4: spinrelease(sched.qlock) 5: spinlock(alock) 6: free stop=free stop->next 7: free stop->pos=current kt[new proc]->kt id 8: spinrelease(alock) 9: current kt[new proc]=curr kt

Algorithm 19 shows the code required to deal with the automatic unblocking of an activation by the kernel in order to run that activation on an idle processor. Since this unblocking occurs in a synchronous manner, there is a potential cause for race conditions. This occurs in particular when an activation unblocks automatically before it was about to be woken up in the scheduler in() function using the act cntl() system call. Algorithm 19gives a solution to this problem and safely restarts the unblocked thread without overwriting another thread placed on a processor by scheduler in(). It is important to remember that since all calls to scheduler in() are protected by spin locks, it is not possible to enter scheduler in() if the qlock lock is held by another thread. However it is possible to enter act unblock() while another thread is in the scheduler in() function. In line 1, the sched.ktready variable of the processor that is to resume the unblocked thread is reset in order to indicate that it is no longer idle. This is done immediately since it is possible that another processor would be spinning in scheduler in, await79

ing for this flag to be set (see Sections 5.2.3 and 5.3.2). The algorithm then takes hold of the lock qlock (line 2). This is necessary to ensure that no other processors will be able to call scheduler in() until the lock is released. If some other processor is already executing within scheduler in() then the algorithm busy waits for this function to exit and the lock be released. Line 3 is crucial in that it distinguishes whether a thread was placed as the current for the awoken processor in scheduler in() or not. If the former case holds true, then the scheduler sets the unblocked user thread as the current thread for that processor (line 4). The value of sched.sleepnum is then decremented in order to indicate that there is one less idle processor (line 5). If the latter case holds true then it is necessary to place the extra thread back onto the run queue. A flag is first set (line 7) and a temporary variable is used to save a reference to this thread (line 8). The current thread for the processor is then set to reference the unblocked thread (line 9). The lock qlock can now be released allowing other processors to access the scheduler in() function and run queue data structures (line 11). In either case, the list of free positions is updated to indicate that there is a new free activation structure - that of the activation that was idle (lines 12-15). The current activation structure is then set to that of the activation which has unblocked. Finally, the flag that might have been set in line 7 is checked to see if there is a thread that has to be placed back onto the run queue or onto another idle processor. If this is so then scheduler in() is called with a pointer to this tread (lines 17-20). Algorithm 20 shows the code required to deal with the unblocking of an activation by the kernel instead of it creating a new activation to replace one which has blocked. First, the thread which blocked is saved in the activation structure of the underlying activation which blocked (line 1). The unblocked user thread is then set to be the current for the processor running the unblocked activation (line 2). Finally the current activation is set to be that of the unblocked activation.

80

Algorithm 19 SMP Algorithm for act unblock() by ACT UNBLK IDLE 1: sched.ktready[new proc]=0 2: spinlock(sched.qlock) 3: if sched.current[new proc]==NULL then 4:

sched.current[new proc]=curr kt->current ct

5:

sched.sleepnum–

6: else 7:

flag=1

8:

old ct=sched.current[new proc]

9:

sched.current[new proc]=curr kt->current ct

10: end if 11: spinrelease(sched.qlock) 12: spinlock(alock) 13: free stop=free stop->next 14: free stop->pos=current kt[new proc]->kt id 15: spinrelease(alock) 16: current kt[new proc]=curr kt 17: if flag==1 then 18:

spinlock(sched.qlock)

19:

scheduler in(old ct)

20:

spinrelease(sched.qlock)

21: end if

Algorithm 20 SMP Algorithm for act unblock() by ACT UNBLK KERNEL 1: current kt[new proc]->current ct=sched.current[new proc] 2: sched.current[new proc]=curr kt->current ct 3: current kt[new proc]=curr kt

81

5.4

Bugs in the kernel patch

The release of the kernel patch used in the implementation of the SMP scheduler with scheduler activations suffers from a serious race condition that crops up when several activations are blocking and unblocking simultaneously. This bug was discovered during the testing of the web server which we shall describe in the next chapter. What happens is that when an activation blocks on a processor, the newly created activation is passed a wrong value in act new() for the parameter identifying the processor on which the activation blocked. This causes the data structures to become corrupted and either the application crashes or one of the processors remains permanently idle and can no longer be used. This condition does not occur on the uniprocessor implementation and no known bugs exist for the patch using one processor. Up until this bug occurs, tests show that the SMP scheduler performs correctly and no known bugs exists in the scheduler and upcall code described in this chapter. Before delivering the patch, we were warned by the developers at Lyon that there were known bugs that caused segmentation faults. During the course of this dissertation, the activations patches were still undergoing development and still needed an amount of testing. A new version for the Linux 2.4 kernel is to be released in June 2000 and should solve all these problems. The API will undergo only minor modifications meaning that our scheduler would also only require minor modifications in order to be fully functional.

5.5

Conclusion

In this chapter we have described how to integrate scheduler activations into a userlevel thread scheduler for SMPs. We discussed the main problems that need to be solved, in particular the due consideration that must be given to race conditions. We then described the modifications and new functions that are required in order to have user-level threads using the smash scheduler make use of scheduler activations in order to avoid blocking system calls blocking the underlying kernel threads and

82

ultimately the whole application. In the next chapter we shall describe a simple web server that makes use of the two schedulers that use this version of the kernel patch. We shall use this web server to test and demonstrate the performance advantages of using a user-level thread scheduler with scheduler activations.

83

Chapter 6 ActServ - a multithreaded web server In this chapter we shall be describing a simple web server and benchmark program that were developed in order to demonstrate and test the uniprocessor and SMP schedulers. We shall call this web server ActServ when it makes use of scheduler activations and Serv when it does not. Unfortunately, due to the bug in the SMP implementation of the kernel patch, it was not possible to derive results for the SMP scheduler with activations. We shall produce a number of graphs that serve to: • compare the performance of the uniprocessor thread scheduler without activations to that with scheduler activations. • compare the performance of the Apache web server with ActServ on a uniprocessor machine. • compare the performance of Serv to that of Apache an a 4-processor SMP in order to predict the expected performance of ActServ.

6.1

Benchmark Application

In order to extract performance figures for the web servers, a simple benchmark program was used. The program takes three command line parameters, namely: 84

• a number indicating the concurrent clients to make requests. • a number indicating the number of requests per client. • the URL of the server. As an example, consider the program run as follows: benchmark 10 100 http://cow.cs.um.edu.mt/testfile The benchmark program would fork 10 clients and each client would make 100 serial requests. The client used is a simple program called http get [22]. It sends a request following the HTTP protocol, just as a web browser would and outputs the received data.

6.2

Uniprocessor ActServ

ActServ is a simple web server capable of dispensing static web pages. It implements the HTTP1.0 protocol as defined in RFC 1945 [21]. The methods that are implemented are the GET and HEAD methods.

6.2.1

Implementing ActServ for a uniprocessor machine

When a request is made, a user thread, rather than a process or kernel thread, is used to service the request. Blocks of these threads are created at program startup and dynamically when needed in order to avoid the need to allocate memory for a new thread with each request. These threads are then reused once the request is complete. Each thread is passed a structure (cstruct) as a parameter that contains: • a unique number that identifies the thread structures • the socket number to be used to service the client • a pointer to the thread itself

85

To save these structures, a data structure similar to that used to save the information on activations described in Section 4.3.3 is used. This consists of a linked list of arrays of these structures and a linked list of integers to identify which positions are free. Once an accept() call unblocks indicating a client request, the socket returned by accept() is saved in one of the (cstruct) structures. The position of the structure and a pointer to the thread to be used is also saved in the (cstruct) structure. The list of free structures is then updated to indicate that it is being used and can not be used by another thread. Once the request is complete, the unique number field is used to update the free positions list in order to indicate that it is free to be used by another thread. The pointer to the thread is then used to call cthread reuse() so that the thread can be reused. Note that no protection by spinlocks is necessary because the environment is a single processor one and no activation can be preempted to indicate that another activation unblocked.

6.2.2

Comparing ActServ to Serv

In order to run the tests described in this section, the benchmark program was run on a four processor machine, while a single processor machine was used to run the servers. The two computers were linked together by means of a 100Mbit ethernet link. Serv is a single client server when run on a uniprocessor since it can only service its clients sequentially. This means that if a client with a very slow connection requests a file before a number of fast clients, these latter clients will have to wait for the response for the slow client to be completed. With ActServ however, the read() and write operations to the slow client’s socket would block. Another activation would therefore be created and used to service the fast clients. This implies that by using scheduler activations, a lower level of concurrency provided by the kernel can be achieved, beside the concurrency provided by the thread library at the user level. The first tests involved comparing ActServ and Serv. In order to simulate different loads on the network and slow and fast clients, a delay was introduced 86

between each read() operation by the client. The file requested was a 1Mb file and each read() operation read 4kb from the socket. Figure 6.1 shows the performance of ActServ and Serv when this delay is zero. On a very fast network with a low load, the performance of Serv is marginally better than that of ActServ. This is because the read() and write() operations to the sockets rarely block. When they do block, the overhead in creating a new activation is larger then had the server simply waited for the socket to unblock. Very Low Network Load 42

41

Requests/sec

40

39

38

37

36 Serv ActServ 35 0

10

20

30

40

50 60 Concurrency (c)

70

80

90

100

Figure 6.1: Comparing ActServ and Serv - Very low network load

With a delay between reads on the client side that is only slightly larger, (1015µs) the performance of ActServ begins to deteriorate as shown in Figure 6.2. This is because at this point, the read() and write() operations do begin to block more frequently. However, the time to create the new activations is still more than that had the kernel waited for the system call to unblock. When the load on the network reaches a certain threshold, the performance of ActServ is always better to that of Serv. This occurs when the time that a system call remains blocked is greater than that for the overhead generated in creating a 87

Low Network Load 40

38

36

Requests/sec

34

32

30

28

26 Serv ActServ 24 0

10

20

30

40

50 60 Concurrency (c)

70

80

90

100

Figure 6.2: Comparing ActServ and Serv - Low network load

new activation. This threshold is passed in our simulation of network load when the delay between reads in the client is of just 2750µs. This is in fact the approximate time taken to clone a new kernel thread and make an upcall. Beyond this load, the performance of ActServ will always be better than that of Serv since ActServ will allow the server to serve clients concurrently. When the load on the server is in fact high enough, full concurrency is achieved when compared to Serv. This is exactly what is required - improvement in performance on a network with an average to high load.

6.2.3

Comparing ActServ to Apache

Apache is a web server that comes as standard with Linux distributions. It is therefore ideal for comparing between web servers. In this test, the benchmark program was set so that each client made 1000 sequential requests of a 1K file. Figure 6.4 shows that at concurrency levels above 10, the performance of ActServ is on average 30% better than that of Apache. At low concurrency, this improvement is derived mainly from 88

Average Network Load 28 26 24

Requests/sec

22 20 18 16 14 12 Serv ActServ 10 0

10

20

30

40

50 60 Concurrency (c)

70

80

90

100

Figure 6.3: Comparing ActServ and Serv - Average network load

using a user-level scheduler rather than the heavyweight processes used by Apache. However as the concurrency increases, and hence also the load on the network, it is the scheduler activations that maintain the constant performance. Note that Apache was configured in order to achieve maximum performance. This was done by allowing it to create server processes before hand and never destroying these processes.

6.3

SMP Serv

In order to run the tests described in this section, the benchmark program was run on a single processor machine, while a four processor machine was used to run the servers. Again, the two computers were linked together by means of a 100Mbit ethernet link.

6.3.1

Implementing Serv for an SMP machine

The web server needs to be modified slightly because of the risk of concurrent access corrupting the thread structures. All that is required is for these structure to be 89

Comparing Actserv and Apache on a uniprocessor 800

700

600

Requests/sec

500

400

300

200

100 Apache ActServ 0 10

20

30

40

50 60 Concurrency (c)

70

80

90

100

Figure 6.4: Comparing ActServ and Apache on a uniprocessor

protected by a spin lock. Before a processor attempts to use one of the cstruct structures to pass as a parameter to a user thread servicing a client, it must first acquire the lock. Once the request is serviced, the spin lock is again acquired before marking the structure on the linked list of free positions as reusable.

6.3.2

Comparing Serv to Apache

In order to run the tests described in this section, the server was run on a four processor machine, while a single processor machine was used to run the benchmark program. The two computers were linked together by means of a 100Mbit ethernet link. In this test, the benchmark program was set again so that each client made 1000 sequential requests of a 1K file. Figure 6.5 shows that at concurrency levels between 10 and 75, the performance of Serv is on average 50% better than that of Apache. However when the load on the network increases to concurrency levels above 75, the number of serviced requests per second continues to improve nearly linearly for APACHE while that of Serv 90

begins to level out. This is because while Apache creates a process to service each client, Serv can only service four clients at a time. This is where the advantages of using scheduler activations in ActServ would have been expected to show up. As one activation blocks on the loaded network, another activation would be created to service another client. Therefore, the performance of ActServ would be expected to be slightly worse than that of Serv at low concurrency because of the overhead introduced by activations but better at higher concurrencies. Comparing Serv and Apache on a 4-processor SMP 1400 1300 1200 1100

Requests/sec

1000 900 800 700 600 500 400

SMP Serv SMP Apache

300 10

20

30

40

50 60 Concurrency (c)

70

80

90

100

Figure 6.5: Comparing Serv and Apache on a 4-processor SMP

6.4 6.4.1

Conclusion User threads vs kernel threads and processes

The comparison figures obtained for ActServ, Serv and Apache clear underline the advantage of using user-level threads as opposed to processes. Apache uses processes in order to serve multiple clients concurrently. These are heavyweight structures that are expensive to create. In fact, if Apache is not pre-configured to maintain 91

its processes after serving clients as well as to create a number of server processes beforehand, its performance is much worse than than that described above. Userlevel threads are relatively cheap to create. Synthetic tests shows that had we chosen to create a new thread for each request rather than maintain a pool of reusable user threads, performance would have been only marginally worse. Interestingly, Apache organisation intends marketing a new web server that uses kernel threads rather the processes in order to achieve server concurrency. One would expect the performance of such applications to be better than that of the existing versions of Apache but still worse than that of using user-level threads.

6.4.2

The advantages of using scheduler activations

The tests for uniprocessor ActServ clearly show that it is not feasible to write a userlevel multithreaded web server without any additional concurrency provided by the kernel. This is because such a web server would only be able to service a concurrent number of clients that is equal to the underlying number of kernel threads used. By using scheduler activations, ActServ automatically achieves this added concurrency at a near optimal level. Additional kernel threads are created and destroyed as needed by the kernel in order to allow full resource usage with little overhead.

92

Chapter 7 Conclusion We conclude this document be reviewing the original aims and resulting achievements attained in the various chapters of this project. We finally discuss possible extensions that could further improve this dissertation.

7.1 7.1.1

Results and achievements Uniprocessor schedulers with activations

The two scheduler activations patch were successfully integrated with the circular run queue smash scheduler. The first implementation provided initiative for the design of some interesting algorithms and wait free solutions that were used to deal with automatic preemption caused by the unblocking of activations. It also served to highlight the shortcomings of using a totally synchronous interface as Anderson’s scheduler activations [1] are. The second implementation as proposed by Danjean [2] provided a simpler model for implementing scheduler activations. However in order to achieve high efficiency from the user-level scheduler kernel, a number of data structures to save information on blocked activations and techniques for identifying activations had to be found and implemented. These allow an application to achieve all the advantages of using user threads as opposed to kernel threads by rarely using expensive system calls for

93

memory allocation and kernel thread identification.

7.1.2

SMP scheduler with activations

The SMP implementation of smash with scheduler activations is optimised and free of known bugs. Even though the kernel patch does have the bug discussed in Section 5.4, it operates correctly in applications that do not perform several blocking system calls. With the new kernel patch to be distributed in June 2001, these bugs should be ironed out and tests with ActServ could then be carried out. The original SMP smash scheduler was successfully modified and expanded to deal with scheduler activations. Alternatives had to be found for parts of the original smash implementation that interfered with the use of activations, particularly the use of semaphores. The data structures and techniques necessary to save information on blocked activations and identify activations were simply borrowed for those in the second uniprocessor scheduler. However, several race conditions had to be identified and solved due to the shared memory environment the scheduler operates in. This involved using a combination of spin locks, system calls provided by the patch as well as busy-waiting techniques.

7.1.3

ActServ

The web server ActServ highlights the effectiveness of using scheduler activations to provide kernel support to a user-level thread library. Results show that using scheduler activations allowed us to achieve added concurrency by running threads while waiting for others to unblock. By comparing this web server with Apache we saw that using user threads rather than processes allows us to achieve improvements of up to 30% on a uniprocessor machine and up to 50% on an SMP with four processors.

94

7.2

Possible extensions

Possible future investigation could be aimed at looking into the use of scheduler activations for uniprocessor and SMP schedulers that implement different architectures. Other solutions to avoiding blocking system calls blocking an application such as Tucker’s [4] and Inohara and Masuda’s [14] techniques could also be integrated with the user-level schedulers in order to see how they compare to scheduler activations. The web server could also undergo a number of improvements in order to make it more feature rich and suitable for commercial web hosting.

7.2.1

Extensions for the uniprocessor schedulers with activations

Debattista [6] describes three architectures for uniprocessor thread schedulers. Each architecture performs best on specific types of applications. Therefore integrating scheduler activations with these architectures would serve to give kernel support to the schedulers optimised for the various types of applications. An interesting experiment would have been to compare the two schedulers developed in Chapter 4 in order to measure the actual improvement of polled unblocking as opposed to synchronous unblocking.

7.2.2

Extensions for the SMP scheduler with activations

Various architectures exist for user-level schedulers on SMPs, the traditional shared run queue architecture used in this dissertation being the simplest. Some improvements on the shared run queue architecture described in Debattista include perprocessor run queues, batch thread migration and schedulers that use mutual exclusion, or lock-free and wait-free algorithms. Integrating scheduler activations with these other smash schedulers would require more complex algorithms for the upcalls and routines described in Chapter 5. However, it would be interesting to see how performance improvements achieved by using this schedulers over the traditional shared

95

run queue architecture differ when user scheduler activations are used in applications that perform several blocking system calls.

7.2.3

Extensions for the ActServ

ActServ implements the HTTP 1.0 protocol and can not handle dynamic web pages. Two main extensions are possible in this area. The first is to upgrade the web server to implement the more recent HTTP 1.1 protocol. The second is to implement some form of server side scripting such as PHP [25] or CGI scripts. The expected performance of ActServ when compared to Apache should then be even better to that for static dispensing of web pages alone due to the heavier computation required on the server side.

7.3

Final remarks

The principle aims set out for this dissertation were met. The thread libraries developed give the application developer a simple but effective approach to improving the performance of parallel applications. By combining kernel support and the performance advantages of using user-level multithreading, applications can achieve maximum resource utilisation from the underlying hardware. Using of a user-level thread scheduler alone is not suitable for implementing applications that make several blocking system calls. However, by using scheduler activations, the proposition of choosing user-level threads over processes or kernel threads for such applications is a valid one.

96

Bibliography [1] T. Anderson, B. Bershad, E. Lazowska and H. Levy. Scheduler activations: effective kernel support for the user-level management of parallelism. In Proceedings of the 13th ACM Symposium on Operating Systems Principles. October 1991. [2] V. Danjean, R. Namyst and R. D. Russell. Integrating kernel activations in a ´ multithreaded runtime system on top of Linux. Ecole Normale Sup´erieure de Lyon. March 2000. [3] V. Danjean. Extending the Linux kernel with activations for better support of ´ multithreaded programs and integration in PM2 . Ecole Normale Sup´erieure de Lyon, internship made at the University of New Hampshire. September 1999. [4] A. Tucker, A. Gupta. Process control and scheduling for multiprogrammed sharedmemory multiprocessors. In Proceedings of the 12th ACM Symposium on Operating Systems Principles. December 1989. [5] A. Tucker. Efficient scheduling on multiprogrammed shared-memory multiprocessors. Thesis for the Degree of Doctor of Philosophy, Stanford University. December 1993. [6] K. Debattista. High performance thread scheduling on shared memory multiprocessors. Thesis for the Degree of Master of Science, University of Malta. January 2001.

97

BIBLIOGRAPHY

98

[7] K. Vella. Seamless parallel computing on heterogenous networks of multiprocessor workstations. Thesis for the Degree of Doctor of Philosophy, University of Kent at Caterbury. December 1998. [8] J. Cordina. Fast multi-threading on shared memory multiprocessors. Thesis for the Degree of Bachelor of Science, University of Malta. June 2000. [9] M. Boosten, R.W. Dobinson and P.D.V van der Stok. Fine grain parallel processing on commodity platforms, IOS Press. 1999. [10] D. Wood and P. Welch. The Kent retargetable occam compiler. In Proceedings of Wotug 19, volume 47 of Concurrent Systems Engineering, IOS Press. March 1996 [11] E. W. Dijkstra. Cooperating sequential processes. In F. Genyus (ed.), Programming Languages, Academic Press. 1967. [12] R. Namyst and J. M´ehant. MARCEL: Une biblioth`eque de processus l´egers. Laboratoire d’Informatique Fondamentale de Lille, Lille. 1995. [13] R. Namyst. PM2 : un environnement pour une conception portable et une ex´ecution efficace des applications parall`eles irr´eguli`eres. Th`ese de doctorat, Univ. de Lille. January 1997. [14] S. Inohara and T. Masuda. A framework for minimising thread management overhead based on asynchronous cooperation between user and kernel schedulers. Technical Report Department of Information Science, Faculty of Science, University of Tokyo. January 1994. [15] C. Schimmel. UNIX systems for modern architectures. Addison Wesley. 1994. [16] J. Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the 14th Annual ACM Symposium on Principles of Distributed Computing. August 1995.

BIBLIOGRAPHY

99

[17] W. R. Stevens. Advanced programming in the UNIX environment. Addison Wesley, 1993. [18] F.R.M. Barnes. Blocking system calls in KRoC/Linux. In Communicating Process Architectures 2000, volume 58 of Concurrent Systems Engineering Series. Computing Laboratory, University of Kent, IOS Press. September 2000. [19] U. Vahalia. UNIX internals. Prentice Hall. 1996. [20] A. M. Lister R.D. Eager. Fundamentals of operating systems. MacMillan Computer Science Series. 1993. [21] Internet RFC archives. http://www.faqs.org/rfcs/rfc1945.html [22] A. Globus and J Poskanzer. [email protected], [email protected]. ACME Laboratories. [23] SUN Microsystems. www.sun.com/solaris [24] DCE, Germany. www.dce.de/threads.html [25] PHP home page. http://www.php3.org/ [26] The Apache software foundation. http://www.apache.org/ [27] Glibc. http://www.gnu.org/software/libc/libc.html [28] MSDN Online Library. http://msdn.microsoft.com/library/

Suggest Documents