Trends and challenges in operating systems - From parallel computing ...

4 downloads 95949 Views 657KB Size Report
computing to cloud computing. 1. INTRODUCTION. Moore's law has worked well over the last three decades for all of us by increasing single-thread per-.
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2011) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.1903

EDITORIAL

Trends and challenges in operating systems—from parallel computing to cloud computing

1. INTRODUCTION Moore’s law has worked well over the last three decades for all of us by increasing single-thread performance through more sophisticated transistor usage. In addition, the ever-increasing frequencies also supported applications in getting linearly faster just by moving to the next release of processor architecture. Because of physical limitations, this development has come to an end. Because a linear increase in clock speed leads to square (or even worse) increase in power usage, modern chips can no longer accelerate software execution by just running faster. Instead, chip designers and engineers started developing processors that use the (still increasing) number of transistors to integrate multiple cores—in fact creating a multicomputer on a chip. This new development puts another layer on the already existing multiple levels of parallelism in a modern processor. On the lowest level, the execution unit itself can realize instruction-level parallelism [1], a widely implemented approach in modern processor designs. Each execution unit additionally supports the concept of logical processors, which allows for simultaneous multithreading (SMT) inside the processor pipeline [2,3]. This approach allows hiding memory access latencies for some of the threads, by utilizing the execution unit for other tasks during a blocking period. It allows for a virtual sharing of execution units within a processor. SMT maintains the architectural state of the execution unit separately per logical processor, in order to allow context switching in hardware. A set of execution units can be put together to form a chip multi-processing design, today known as multicore and many-core processors [4]. The different ways of exploiting parallelism inside one processor chip are then summarized as chip multithreading capabilities [5]. The new generations of processor hardware have major consequences on operating system (OS) and application design, because it eventually will turn every computer into a parallel system. The problem of exploiting parallel computing power, traditionally only an issue for the high-performance computing (HPC) community, is now crucial for the mainstream of OSs and programming languages. Although widely debated, these changes at different software levels do not primarily affect the desktop but rather server computers. However, nearly all modern applications and user front ends rely on some kind of connectivity to server-based functionality. Front end applications are mainly driven by sequential performance, which has reached an acceptable level for a large majority of client applications on desktop computers. Lower level devices, such as smart phones or embedded systems, increasingly demand the offloading of compute-intensive tasks to server systems. We, therefore, argue that the true challenges for future OSs and middleware will be on the server side. In this scenario of multiple-level parallelism, the server software stack has to face the single question of How can a server system exploit the new parallel processing power most effectively in order to provide an optimal quality-of-service to the client?

In the following article, we focus on three major trends (Figure 1) that have visibly started to impact the development of next generation server OSs. First, the OS must support the concurrent execution of parallelized computational load with a reasonable management overhead, in order to utilize available resources efficiently. If computational workload is not capable of running in parallel, a second option is the use of physical resources for multiple server applications/OS instances by Copyright © 2011 John Wiley & Sons, Ltd.

Figure 1. Future trends in server environments.

means of virtualization. Both demands may lead to conflicting functional and performance optimizations in the OS, because protection and resource management overhead must be kept at a reasonable scale at the same time. Despite the two dimensions of resource utilization, the compute power (either for parallel or partitioned workloads) can be also provided as a service for remote users (i.e., cloud computing). This brings another set of orthogonal demands to the attention of the server runtime environment architect. In the subsequent sections, we will further discuss the three identified issues in order to point out relevant issues for the future OS design. 2. DYNAMIC PARALLELISM IN THE OPERATING SYSTEM In order to take advantage of multiple execution engines, a computer needs software that can do multiple things simultaneously. One obvious solution works well for throughput-oriented servers. In order to increase throughput, those servers execute many instances of a job simultaneously. By feeding additional requests to those servers, additional CPU cores can easily be saturated. An alternative is the execution of server applications that exhibit internal parallelism. Many wellestablished results for concurrency control in the OS are currently being reviewed in the light of future multicore and many-core processors [6]. Reasonably, well-understood parallel computing algorithm designs are being translated into patterns and guidelines for multicore-aware systems and suddenly become relevant for the server application operation [7]. The long debated question of the “right” parallel programming model gets new importance not only for highly specialized parallel systems but for mainstream servers as well, which demands a better mapping from user-mode threading libraries to the hardware-aware scheduling performed in the OS kernel. The new programming models typically address three basic concerns: the implicit or explicit creation of parallelism, communication among parallel entities, and coordination of parallel execution in a control-parallel or a data-parallel style [8]. Communication either follows the message passing or the shared memory approach. With a chosen parallelization approach, either the OS is used to execute multiple tasks (threads/processes), or the data-parallel execution environment is provided with work items (mapper/reducer tasks). Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

EDITORIAL

It must be noted that these topics have been investigated in context of special-purpose parallel computers over the last 20 years. However, they are now becoming mainstream. One example for the closer collaboration between user-mode libraries and the OS are recent Microsoft products, such as the Parallel Pattern Library, Asynchronous Agents Library for C++, or the Task Parallel Library [9]. Other examples for multithreading support in terms of a library are Intel’s Thread Building Blocks [10], Cilk [11], or OpenMP [12]. Many of these products rely upon some concurrency runtime library, which provides both low-level synchronization primitives and high-level concurrent data structures for multiple programming language front ends. The scheduling of parallel tasks is typically coordinated by a work-stealing algorithm [13]. The implementation of such runtime libraries, for example in the Microsoft case, relies not only on the classical concurrent thread execution mechanism, but also on extended information from the kernel about the underlying hardware architecture, for example, regarding the memory hierarchy. Another example for the better collaboration between parallel application and OS kernel is Apple’s Grand Central Dispatch (GCD). The new parallel programming model for C and ObjectiveC developers [14] relies upon an asynchronous design approach for solving the concurrency problem. It takes the thread management code one would normally write in an application and moves that code down to the OS layer. All the programmer has to do is defining the task she wants to execute and adding them to an appropriate dispatch queue. GCD takes care of creating the needed threads and of scheduling tasks to run on those threads. Because the thread management is now part of the OS kernel operation, GCD provides a holistic approach to task management and execution, providing better efficiency than traditional threads. Accordingly, operation queues in Objective-C objects present another interesting technology. With operation queues, the programmer defines the tasks she wants to execute and then adds them to an operation queue, which handles the scheduling and execution of those tasks. Like GCD, operation queues handle all of the thread management, ensuring that tasks are executed as quickly and as efficiently as possible on the system. For data-parallel programming, one of the most influential standards is OpenCL [15]. The specification aims for the programming of a heterogeneous collection of CPUs, graphics processing unit (GPUs), and other discrete computing devices organized as single platform. For existing implementations in the graphics card domain, the majority of the API is implemented by the graphics/GPU card driver living in the OS, while the application still only utilizes a pure user-level library [16]. For this reason, work items executed using OpenCL must be self-contained, so that they can be transferred to the system level and the graphics processor for asynchronous execution. 2.1. Discussion The support for multiple parallel activities in hardware is realized on different levels in today’s computer systems. One common problem in parallel programming today is the difficulty to correctly represent the underlying multi-level system of heterogeneous hardware in the programming language/environment. Even though modern OSs have started to be aware of this effect [17], they still cannot consider things such as data dependencies when scheduling parallel activities. Other research in the OS community focuses on scalability, addressing topics such as lock-free data structures and software transactional memory [18]. Experimental systems such as Intel’s 48-core Single-chip Cloud Computer (SCC) no longer support a single system view with cache-coherent shared memory. Instead, these systems need to be programmed as multicomputer-on-a-chip. Research questions, such as whether message passing or shared memory are the better programming abstractions for those systems are being asked (again). Most libraries and programming language extension are lacking monitoring features, which would be needed in order to dynamically optimize partitioning of data and execution based on the OS’s knowledge of the underlying hardware. In the best case, the OS will report all relevant hardware settings to the library, and the implementation will explicitly coordinate the resource usage. Another problem with the general shared-memory approach is the contention of shared resources. With the increase of parallelism in software layers, the processor faces a high number of threads with Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

different memory access patterns and cache utilization profiles [19]. Proposals by Intel suggest the introduction of a QoS-aware memory hierarchy [20], where the OS prioritizes some threads with respect to their cache and memory bandwidth usage. Other strategies still need to be investigated. In general, this shows the need to improve the software to hardware mapping. Above the level of superscalar instruction processing, all parallelization coordination must consider the given hardware layers, without tying itself too much to one particular system architecture. Describing characteristics of the underlying hardware architecture in an abstract way is a prerequisite in order to develop efficient scheduling and data partitioning schemes for parallel applications. Thread pools are another well-known problem. They help to free the programmer from the burden of thread management when executing multiple pieces of code in parallel. However, library-based thread pools are not well integrated with the OS’s scheduler. Also, they solely operate on the coarsegrained level of threads—whereas new mechanisms like dispatch—or operation queues work on the more fine-grained level of functions or blocks. Traditional resource management in OS assumes that processors, memory, and I/O channels are homogeneous resources. Even within an OS kernel, memory modules and CPUs are indistinguishable and treated equally. This is changing, as previously esoteric non-uniform memory access (NUMA) system designs with heterogeneous compute units (CPU/GPU) are becoming mainstream building blocks for current and future application servers. Windows Server 2008 R2 (Microsoft Corporation, Redmond, WA, USA) (same code base as Windows 7) is one example of how OSs can be extended to facilitate the new hardware mechanisms. It supports the concept of processor groups, in order to deal with huge numbers (>64) of logical processors in a system. The traditional APIs for managing processor affinity remain valid— they just operate within a processor group only. Also, Server 2008 R2 manages virtual memory per default in such a manner that memory allocations are directed to physical modules that are close (low-latency access) to the corresponding CPU. In addition, an elaborate API can be used to explore the topology of the underlying NUMA machine and distributed threads/memory allocations, accordingly. Core parking is another technique, where a modified scheduler tries to aggregate many threads on a single CPU core (ideally leaving entire CPU sockets idle) in order to help with energy management. Data-parallel programming has been the foundation for a set of well-known server infrastructure environments, such as Hadoop. The major issue for such environments is the increasing relevance of I/O performance. Data locality and support for a feasible consistency model are expected from the middleware. In addition, the OS must support low latency/high bandwidth interconnects in the best possible way, in order to deal with the increasing data amounts to be handled even in standard server applications. The tight integration of a parallel programming library with the OS will eventually allowing to accommodate heterogeneous parallel programming models “under the hood”. The programmer of the future should be enabled to write library-based parallel code, which will then be directed either to CPU or GPU. Besides extending traditional OS abstractions for advanced server hardware, there is a more disruptive approach with radical experimental OSs ETH’s Barrelfish [21] tries to put specialized satellite kernels on each of the CPU cores. This approach demands new ways of memory management, I/O handling, and application code translation. Traditional foundations for this work come from the domain of distributed shared memory systems [22, 23], and single system image OSs such as OpenMOSIX [24]. 3. DYNAMIC PARTITIONING IN THE OPERATING SYSTEM Operating systems abstract away implementation details of the underlying hardware. All OS resources made available to higher software layers and applications are virtual entities. For instance, the notion of a process or thread is an abstract, virtual representation of a computer’s CPU. The notion of a large virtual memory is an abstract representation of a computer’s storage system. Thus, dynamic partitioning in its modern form of “virtualization” is nothing new. It happens on all layers of a computer system. Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

EDITORIAL

The term “Virtualization” has been initially coined in context of the IBM mainframe systems introduced in the 1960s. It was introduced to multiplex OS instances and applications on costly hardware as a means to partition scarce and expensive computing resources. IBM introduced virtualization first with the development of the System/360 Model 67 mainframe. Initial approaches really focused on partitioning resources available on a physical machine by providing the so-called logical partitions (LPARs). This is performed mostly on the hardware layer. Two LPARs may access memory from a common memory chip, provided that the ranges of addresses directly accessible to each do not overlap. One partition may indirectly control memory of a second partition, but only by commanding a process of the second partition to directly operate on the memory. CPUs may be dedicated to a single LPAR, or shared between separate LPARs. On IBM mainframes, LPARs were managed by the Processor Resource/System Manager facility. Modern IBM mainframes operate exclusively in LPAR mode, even when there is only one LPAR on a machine. LPARs are today the lowest layer of virtualization. They require hardware support that is present in the mainframe and IBM’s Power architecture. Mainframe virtualization with its LPAR support must be seen as strongest realization of resource partitioning—however, similar technologies also become part of more and more layers in the X86 hardware/software stack. 3.1. The virtual machine monitor concept The basic unit of virtualization abstraction is the Virtual Machine (VM)—the representation of a running OS instance. VMs are managed through a virtual machine monitor (VMM), respectively, hypervisor [25]. VMMs export a VM abstraction that (mostly) resembles the underlying hardware. Each VM abstraction is a guest that encapsulates the entire state of the OS running within it. The guest OS interacts with virtual hardware abstractions presented by the VMM as if it were real hardware. The VMM runs in the most privileged level of the processor modes, whereas the guest OS typically runs in least privileged user mode. Common uses of virtualization today include server consolidation and containment, where previously under-utilized physical server systems are being transformed into VMs that can run safely and move transparently across shared hardware. Test and development systems based on VMs allow for rapid provisioning of test and development servers. Libraries of pre-configured test machines can easily be stored. Virtualization works well on the enterprise desktop to secure unmanaged PCs without compromising end-user autonomy. Tamper-proof system operation can be achieved by layering a security policy in software around VMs on the desktop. Finally, entire systems can be encapsulated into single files that can be replicated and restored onto any target server in order to support business continuity. 3.2. Virtualization approaches There are four major approaches for virtualization used today: Hardware Emulation The emulation of all hardware aspects, including processor instructions, chipset and all processor-relevant devices, is the traditional approach of virtualization. Full Virtualization Otherwise known as native virtualization, it uses a VM that mediates between the guest OSs and the native hardware. The VMM (or hypervisor) interfaces between the guest OSs and the bare hardware. Certain protected instructions must be trapped and handled within the hypervisor because the underlying hardware is not owned by an OS but is instead shared by it through the hypervisor. Paravirtualization Is another popular technique that has some similarities to full virtualization. This method uses a hypervisor for shared access to the underlying hardware but integrates virtualization-aware code into the OS itself. This approach obviates the need for any recompilation or trapping because the OSs themselves cooperate in the virtualization process. OS-level virtualization OS-level virtualization virtualizes servers on top of the OS itself. This method supports a single OS and simply isolates the independent servers from one another. OSlevel virtualization requires changes to the OS kernel, but the advantage is native performance. Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

VMM I/O

Application

Application

Host OS

Application

Application

Application

Application

Application

Application

Application

Application

Application

Application

Guest OS

Guest OS

Guest OS

Guest OS

Guest OS

Virtual Machine Monitor

Hardware

Virtual Machine Monitor (VMM)

Hardware

Figure 2. Two approaches to full virtualization.

Hardware emulation is a very complex virtualization approach that has a comparatively low performance. Because every instruction must be simulated on the underlying hardware, a 100 times slowdown is not uncommon. For high-fidelity emulations that include cycle accuracy, simulated CPU pipelines, and caching behaviors, the actual speed difference can be on the order of 1000 times slower. However, the virtualization of all hardware aspects has its advantages, for example, for completely diverse architectures (32-bit Windows on PowerPC) or co-development of firmware and hardware. Examples for hardware emulation include Bochs or QEMU on Linux. Full virtualization (Figure 2) is faster than hardware emulation, but slower than bare hardware because of the hypervisor mediation. The biggest advantage of full virtualization is that an OS can run unmodified. The only constraint is that the OS must support the underlying hardware. Some older instruction sets, such as x86, create problems for the full method of virtualization. On x86, certain privileged instructions that need to be handled by the VMM are silently ignored when executed in user mode (instead of generating a trap). Therefore, hypervisors must dynamically scan and trap privileged-mode code to handle this problem. Examples for full virtualization include VMware ESXserver, Hyper-V on Intel x64 and z/VM on the IBM mainframe. Linux kernel virtual machine (KVM) is a full virtualization solution that turns a Linux kernel into a hypervisor using a kernel module. This module allows other guest OSs to then run in user-space of the host Linux kernel. The KVM module introduces a new execution mode into the kernel. Where vanilla kernels support kernel mode and user mode, the KVM introduces a guest mode. The guest mode is used to execute all non-I/O guest code, where normal user mode supports I/O for guests. Paravirtualization requires the guest OSs to be modified for the hypervisor, which is a disadvantage, but also offers performance near that of a non-virtualized system. Like full virtualization, multiple different OSs can be supported concurrently. Examples for systems using paravirtualization are Xen [26], Microsoft Hyper-V, and Parallels Desktop. Operating system-level virtualization effectively groups processors and memory regions and makes them available so that multiple user-space environments can co-exist without noticing about each other. Well-known examples are Solaris Zones and OpenVZ. It is the most lightweight mode of virtualization—however, it does not establish an additional security boundary and will fail when the OS kernel gets compromised. 3.3. Discussion Integrating virtualization into an OS kernel is a trend, as it can be seen with latest versions of Linux, Windows, or BSD. Hypervisors itself are often based on stripped down versions of standard OSs (i.e., VMware ESX: Linux, Hyper-V: Windows). OS vendors and communities also work on paravirtualization support, by standardizing virtual device drivers for interconnects between hypervisor, guest VMs, and server hardware. The increasing importance of virtualization for server operation recently has lead to changes to the Intel IA32/x64 instruction set. An additional protection mode (ring -1) has been introduced in order to be able to differentiate between hypervisor and the OS kernel and user mode. Virtualized input/output instructions (AMD IOMMU, Intel VT-d) are an additional example where the support of virtualization functionality impacts the design of instruction set. Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

EDITORIAL

Figure 3. Three layers of abstraction in cloud computing.

Another relevant virtualization management question is how to map virtual resources (memory regions, processors, devices) onto a parallel computer’s physical resources. Existing research shows interesting developments here, where overbooking and dynamic load management becomes part of the hypervisor implementation. [27]. However, the addition of a virtualization layer to the hardware/software stack has some caveats. In addition to the performance overhead, virtualization layers may impact system reliability and increase the system attack surface [28]. Potential attackers may affect system consistency of many virtualized server systems simultaneously [29]. Hardware support for partitioning is the most secure way to circumvent these potential attacks. Server systems are becoming commodity. Computation moves from the desktop to the datacenter and eventually to the cloud. In contrast to traditional system management approaches, where resource requirements were known and well described when an application got deployed in the (local) datacenter, cloud systems are much more dynamic and have to operate on limited knowledge of application resource utilization. Dynamic provisioning through virtualization is a key technology to solve this challenge. 4. DYNAMIC PROVISIONING IN THE OPERATING SYSTEM Operation of server systems requires a number of configuration decisions on OS and hardware level. These configurations are being done on estimates on the number of users, on the required bandwidth, storage, and compute capacity. However, users today expect compute services with unlimited scalability, high availability, and minimal cost. Although these requirements affect the whole software stack, future OSs will have to support new monitoring and self-adaptive resource management schemes. The dynamic provisioning of compute resources (utility computing) refers to packaging of IT resources (such as CPU, memory, network bandwidth, storage) into a metered service similar to traditional utilities, such as the telephone network. With low initial costs, utility computing relies on a pay-per-use billing model and allows quick reaction to changes in demand of IT services. Utility computing has a rather long history in the world of ultra-reliable but expensive mainframe computers. IBM typically offers computing capacity rather than physical processor and storage resources. Virtualization techniques have been in place on the mainframe since multiple decades. They provide the underpinnings for effectively sharing compute resources across multiple users and organizations, thus establishing the notion of compute resources as a utility. Client/server computing and the PC marked a departure from the traditional world of mainframe computing. However, with the establishment of huge, under utilized datacenters, and the advent of virtualization support in Intel’s and AMD’s CPUs, all the prerequisites for managing computation as a pay-per-use service offered by “the cloud” have been in place. Amazon’s Elastic Compute Cloud Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

(EC2) was the initial offering of a cloud computing platforms that has been made available in beta status in 2006. Cloud is the latest take on the idea of getting processing resources—as a utility—from somewhere else [30, 31]. Questions on programming models, seamless integration of internal and external services, security, as well as monitoring and (self-adaptive) management techniques [32] need to be answered for OSs and middleware acting as the foundation of a cloud datacenter. Accordingly, cloud computing is building upon a number of ideas and principles being established in context of utility computing, grid computing, and autonomic computing in the past. One of the well-known predecessors for Cloud Computing was (and still is) Grid Computing [33], an approach for pooling compute resources from multiple administrative domains on order to carry for massively parallel computations, such as scientific computing and simulation. Grids are established and used inside one organization or are spread across organizational boundaries. Although many approaches for meta-scheduling and billing-aware scheduling [34] were developed and extended in the context of Grid research, it always formed a work area restricted to traditional high-performance computing topics. In contrast to cloud computing, grid computing also never needed to consider the interactive end user to a larger extend. It rather follows the traditional job submission model from cluster computing, where jobs are being submitted through portals and results are being retrieved hours later in form of log files and collected program output. For large-scale simulations and other computebound task farming problems, it is traditionally acceptable to put major restrictions and rules onto the developer. Grids and other HPC infrastructures are provided by experts for experts. With modern cloud computing services in the data center, the provisioning scheme changes dramatically, which resulted in new research topic of autonomic cloud computing. In order to be manageable, cloud computing infrastructures, called “the fabric”, have to follow autonomic computing principles. In such a system, the human operator does not control the system directly. Instead, he defines general policies and rules that serve as an input for the self-management process. This includes self-configuration, self-healing, self-optimization, and selfprotection [35]. Issues such as self-management must be seen as a necessary approach to cope with the ever-increasing complexity of computing systems and infrastructures. Moving workloads transparently from one compute node to another (with little or no blackout time) can be seen as a self-healing mechanism in the cloud. The ability to instantiate certain preconfigured machine images (as with Amazon EC2) or certain roles (as with Windows Azure) can be seen as a step towards self-configuration. Replication in space (redundancy) or time (re-execution) for computation and data is the typical approach for self-optimization and self-protection in the cloud. In contrast to previous approaches, cloud computing no longer assumes that developers and users are aware of the provisioning and management infrastructure for cloud services. It has established the notion of a service as basic unit of abstraction. Services can live on different levels of abstraction (see Figure 3)—comparable with standard applications (Software-as-a-Service), comparable with frameworks and programming platforms (Platform-as-a-Service), or comparable with a (virtualized) IT infrastructure (Infrastructure-as-a-Service). From the standpoint of a programmer, cloud computing on each of these levels of abstraction has a number of technical design choices, which impose implications on developers [32, 36]. Cloud computing is appealing for certain kind of applications, namely those imposing a variable load and needing massive scaling (such as Web 2.0 apps during peak hours), those with a short or unpredictable lifetime, and those doing parallel computing using huge amounts of resources. Cloud computing also can be seen as a system consolidation approach to be used to handle mergers and acquisitions of companies—allowing moving “exotic” applications into the cloud and out of the datacenter. 4.1. Discussion All cloud-computing platforms build on virtualization (for resource management) and Web Service interfaces (for client access) as basic building blocks. VMs instantiated by the customers may or Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

EDITORIAL

may not run on multiple physical cores. The important issues is that none of the programming environments for cloud computing do explicitly support parallel programming for multicore systems available there. Microsoft’s parallel pattern library or Intel’s threading building blocks have outlined how multicore systems can be efficiently used on the desktop and server. Those ideas need to be transported into the cloud server OSs. Given the tremendous amount of experience and know how in system optimization and monitoring present in today’s datacenters, another important aspect for successful operation of cloud applications will be the question of how to monitor and manage all layers of the cloud stack. Selfadaptive management mechanisms present in the cloud fabric are only in a very limited way exposed to cloud applications and clients. A consistent monitoring model covering all layers of the cloud stack is badly needed. Recent developments in the OS area [37] provide the low-level fundamentals for having configurable monitoring support with small overhead in an operation environment. Selfconfiguration implies also the dynamic reconfiguration and update of software in a 24/7 operation scenario. As least as important, however, is the question of how to secure data in the cloud. Current levels of access control available in the cloud platforms are very limited. Identity management and federation as well as management of trust among clients and services in the cloud will be most important [38]. 5. SUMMARY Future OSs need to efficiently exploit the compute power available from multicore and many-core processor architectures that are extended with various types of co-processors. Simultaneously, the OS’s notion of memory hierarchy needs to be revisited, as non-uniform memory access machines are becoming commonplace in the server market. Within this paper, we have discussed three main roads for dealing with these challenges. Dynamic parallelism addresses the question of how to express parallel execution on the programming language and OS programming interface level. We see revamped user-mode schedulers (such as Mac OS X GCD [Apple Inc. Cupertino, CA, USA], Windows Server 2008 R2 User Mode Scheduler) as approaches for resource management not only on the traditional CPU but also on heterogeneous computing platforms utilizing co-processors, such as GPU or crypto-chips. Reasonably, well-understood parallel computing themes are being translated into patterns and guidelines for multicore-aware system design and suddenly become relevant for the mainstream programmer. The long debated question of the “right” parallel programming model becomes new importance with the advent of new many-core computer systems. Explicit multithreading and control-parallel programming does not scale well and will not be able to fully utilize future many-core CPUs. New programming models and pattern libraries, therefore, embrace the data-parallel style of programming. Programming on data-parallel hardware (such as the GPU) and support for heterogeneous architectures in general, however, will remain characteristic for special purpose systems and tasks only (like graphics workstations, rendering, and gaming) for some time. Dynamic partitioning addresses the question of how to share compute resources among many OSs instances in a predictable and secure fashion. With focus on energy efficiency, future OSs will have to facilitate virtualization techniques (either as host or guest) in order to allow for server consolidation and dynamic capacity management. Virtualization is an old trend that regains interest with increasing capacity and power of commodity computing platforms. With recent developments in server hardware, including high-density blade servers, the advent of industry standard architectures (IA32/x64), and an enormous growth of sheer CPU capacity through multicore/many-core architectures virtualization provides many options for performance, portability, efficiency, and flexibility. Virtualization may will be the only feasible option to capitalize on the features of modern multicore CPUs like Intel’s Nehalem-EX without re-architecting large parts of existing applications. In addition, the VMM layer’s management abilities make it an ideal place to implement advanced security services, exploiting views into the system that are not possible at the guest OS layer. This forms the foundation for offering remote resource access to completely untrusted users in a cloud Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

computing environment. However, as the virtual layer emerges as a viable means to host additional security services, the layer itself becomes a target for malicious attacks. As additional components such as Intel’s VT-x and VT-i architectures are developed to combat these problems in the VMM, still newer targets for exploitation will emerge. Future research on the next generation of OSs must consider this new attack vector, in order to keep the promise of resilient and scalable resource provisioning from server OSs. Dynamic provisioning addresses the question of how to manage compute instances that are beyond a single organization’s administrative control. Service providers have been offering outsourcing solutions for quite some time, however, with cloud computing approaches (namely Software-as-a-Service), users seem to be willing to rely on service providers whose internal administrative and operational procedures are mostly unknown and hidden from the user. Computation moves away from the desktop. Future OSs will have to provide protocols, interfaces, and resource managers that support easy consumption of services from the cloud, as well as provisioning of cloud services. Cloud computing builds on the idea of getting processing resources—as a utility—from somewhere else. This again has a number of implications on how to develop high performance and large scale computing systems. Questions on programming models, seamless integration of internal and external services, security, as well as monitoring and (self-adaptive) management techniques need to be answered for the cloud. However, future cloud computing environments will also need to support parallel programming models and patterns in order to efficiently utilize underlying hardware and deliver comparable performance.

REFERENCES 1. Emer J, Hill MD, Patt YN, Yi JJ, Chiou D, Sendag R. Single-threaded vs. multithreaded: where should we focus? IEEE Micro 2007; 27:14–24. DOI: 10.1109/MM.2007.109. 2. Eggers S, Emer J, Levy H, Lo J, Stamm R, Tullsen D. Simultaneous multithreading: a platform for next-generation processors. IEEE Micro 1997; 17:12–19. DOI: 10.1109/40.621209. 3. Tullsen DM, Eggers SJ, Levy HM. Simultaneous multithreading: maximizing on-chip parallelism. In ISCA ’98: 25 Years of the International Symposia on Computer Architecture (selected papers). ACM: New York, NY, USA, 1998; 533–544, DOI: 10.1145/285930.286011. 4. McDougall R. Extreme software scaling. Queue 2005; 3:36–46. DOI: 10.1145/1095408.1095419. 5. Spracklen L, Abraham S. Chip multithreading: opportunities and challenges. 11th International Symposium on HighPerformance Computer Architecture (HPCA-11), San Francisco, 2005; 248–252, DOI: 10.1109/HPCA.2005.10. 6. Sutter H, Larus J. Software and the concurrency revolution. Queue 2005; 3:54–62. DOI: 10.1145/1095408.1095421. 7. Hill M, Marty M. Amdahl’s law in the multicore era. IEEE Computer 2008; 41:33–38. 8. Blake G, Dreslinski RG, Mudge T, Flautner K. Evolution of thread-level parallelism in desktop applications. In ISCA ’10: Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM: New York, NY, USA, 2010; 302–313, DOI: 10.1145/1815961.1816000. 9. Leijen D, Schulte W, Burckhardt S. The design of a task parallel library. In OOPSLA ’09: Proceeding of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications. ACM: New York, NY, USA, 2009; 227–242, DOI: 10.1145/1640089.1640106. 10. Reinders J. Intel Threading Building Blocks. O’Reilly Associates, Inc.: Sebastopol, CA, USA, 2007. 11. Leiserson C, Mirman I. How to Survive the Multicore Software Revolution. Technical Report, 2008. 12. Chandra R, Dagum L, Maydan D, Kohr D, McDonald J, Ramesh M. Parallel Programming in OpenMP (1st edn). Morgan Kaufmann: San Francisco, 2000. 13. Blumofe R, Leiserson C. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), Santa Fe, NM, USA, 1994; 356–368. 14. Apple Inc. Concurrency Programming Guide, April 2010. 15. Munshi A. The OpenCL Specification - Version 1.1, June 2010. 16. Feinbube F, Tröger P, Polze A. Joint forces: from multithreaded programming to GPU computing. IEEE Software 2010; 28:51–57. DOI: 10.1109/MS.2010.134. 17. Vianney D. Hyper-Threading speeds Linux. IBM DeveloperWorks, January 2003. 18. Asanovic K, Bodik R, Catanzaro B, Gebis J, Husbands P, Keutzer K, Patterson D, Plishker W, Shalf J, Williams S, Yelick K. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, Electrical Engineering and Computer Sciences, December 2006. 19. Burger D, Goodman JR, Kägi A. Memory bandwidth limitations of future microprocessors. In ISCA ’96: Proceedings of the 23rd Annual International Symposium on Computer Architecture. ACM: New York, NY, USA, 1996; 78–89, DOI: 10.1145/232973.232983. Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

EDITORIAL

20. Iyer R, Zhao L, Guo F, Illikkal R, Makineni S, Newell D, Solihin Y, Hsu L, Reinhardt S. QoS policies and architecture for cache/memory in CMP platforms. SIGMETRICS Performance Evaluation Review 2007; 35:25–36. DOI: 10.1145/1254882.1254886. 21. Baumann A, Barham P, Dagand P-E, Harris T, Isaacs R, Peter S, Roscoe T, Schüpbach A, Singhania A. The multikernel: a new OS architecture for scalable multicore systems. In SOSP ’09: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM: New York, NY, USA, 2009; 29–44, DOI: 10.1145/1629575.1629579. 22. Nitzberg B, Lo V. Distributed shared memory: a survey of issues and algorithms. Computer 1991; 24:52–60. DOI: 10.1109/2.84877. 23. Eskicioglu M. A comprehensive bibliography of distributed shared memory. SIGOPS Operating Systems Review 1996; 30:71–96. DOI: 10.1145/218646.218651. 24. Lottiaux R, Gallard P, Vallée G, Morin C, Boissinot B. OpenMOSIX, OpenSSI and Kerrighed: a comparative study. IEEE International Symposium on Cluster Computing and the Grid 2005; 2:1016–1023. 25. Rosenblum M, Garfinkel T. Virtual machine monitors: current technology and future trends. Computer 2005; 38:39–47. DOI: 10.1109/MC.2005.176. 26. Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the art of virtualization. In SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. ACM: New York, NY, USA, 2003; 164–177, DOI: 10.1145/945445.945462. 27. Fedorova A, Kumar V, Kazempour V, Ray S, Alagheband P. Cypress: A scheduling infrastructure for a many-core hypervisor. 1st Workshop on Managed Many-Core Systems, Boston, MA, USA, Citeseer, 2008. 28. Le M, Tamir Y. ReHype: enabling VM survival across hypervisor failures. In 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. ACM: New York, 2011; 63–74. 29. Garfinkel T, Rosenblum M. When virtual is harder than real: security challenges in virtual machine based computing environments. In HOTOS’05: Proceedings of the 10th Conference on Hot Topics in Operating Systems. USENIX Association: Berkeley, CA, USA, 2005; 20–20. 30. Vaquero LM, Rodero-Merino L, Caceres J, Lindner M. A break in the clouds: towards a cloud definition. SIGCOMM Computer Communication Review 2009; 39:50–55. DOI: 10.1145/1496091.1496100. 31. Polze A. Towards Predictable Cloud Computing. CloudFutures Workshop, 2010. 32. Armbrust M, Fox A, Griffith R, Joseph A, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Zaharia M. Above the Clouds: A Berkeley View of Cloud Computing. Technnical Report UCB/EECS-2009-28, February 2009. 33. Foster I, Zhao Y, Raicu I, Lu S. Cloud Computing and Grid Computing 360-Degree Compared. Grid Computing Environments Workshop, Austin, TX, 2008; 1–10, DOI: 10.1109/GCE.2008.4738445. 34. Nitzberg B, Schopf JM. Current Activities in the Scheduling and Resource Management Area of the Global Grid Forum. In JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING Lecture Notes in Computer Science, Vol. 2537/2002. Springer: London, 2002; 229–235. 35. Huebscher MC, McCann JA. A survey of autonomic computing—degrees, models, and applications. ACM Computing Surveys 2008; 40:1–28. DOI: 10.1145/1380584.1380585. 36. Polze A. A Comparative Analysis of Cloud Computing Environments. Microsoft Faculty Connection, 2010. 37. Passing J, Schmidt A, Löwis M, Polze A. NTrace: Function Boundary Tracing for Windows on IA-32. 16th Working Conference on Reverse Engineering, Lille, France, 2009; 43–52, DOI: 10.1109/WCRE.2009.12. 38. Ko RKL, Lee BS, Pearson S. Towards achieving accountability, auditability and trust in cloud computing. Advances in Computing and Communications 2011; 193:432–444.

A NDREAS P OLZE A ND P ETER T RÖGER Hasso Plattner Institute for Software Engineering, Prof.-Dr.-Helmert-Str. 2-3 14482 Potsdam, Germany E-mail: [email protected]

Copyright © 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. (2011) DOI: 10.1002/cpe

Suggest Documents