code block at a time, and each code block will be executed exactly once. If there ... The middleware was implemented using Python and C. It uses TCP sockets ...
Draft Version To appear in: Proceedings of The 22st Euromicro International Conference on Parallel, Distributed and Network-Based Processing - PDP 2014
Supporting Elasticity in OpenMP Applications Guilherme Galante and Luis C. E. Bona Department of Informatics – Federal University of Paran´a Curitiba, PR – Brazil {ggalante, bona}@inf.ufpr.br Abstract—Elasticity can be seen as the ability of a system to increase or decrease the computing resources allocated in a dynamic and on demand way. To explore this feature, several works addressed the development of frameworks and platforms focusing the construction of elastic parallel and distributed applications for IaaS clouds. However, none of these works addressed the exploration of elasticity in multithreaded applications. In this paper, we propose a mechanism to provide elasticity support for OpenMP applications, making possible the dynamic provisioning of cloud resources taking into account the program structure and runtime requirements. In our proposal, the OpenMP directives were modified to support the dynamic adjustment of resources and a set of routines were included to the user-level library to enable the configuration of the elastic execution. Dynamic memory allocation support was also included in elastic OpenMP library. We also present the architecture and implementation of the proposed mechanism. The experiments validate our approach and show some possibilities to use the elastic OpenMP. Keywords—Cloud computing, OpenMP, elasticity, parallel applications.
I.
Most of these elasticity mechanisms are based on controlling the number of virtual machines (VM) that host the applications server components and in the use of load balancers to divide the workload among the many VM instances. The control is carried out by an elasticity controller that employs data from a monitoring system to decide when instances must be added or removed. The decisions are based on a set of rules that specify the conditions to trigger some actions over the underlying cloud. Every condition is composed of a series of metrics which are compared against a threshold to trigger actions over the underlying cloud. These metrics include the number of connections, number of requests and resources usage such as CPU, memory and I/O. An example is presented in Figure 1, where we observe the allocation of VMs in function of the connected clients. The elasticity controller uses the number of clients to dynamically allocate or deallocate VMs, enabling application to be ready to handle the load variations. Clients VMs
I NTRODUCTION
In the last decade, scientific applications have been executed over high performance infrastructures including cluster and grid computing [1]. Recently, cloud computing emerged as an alternative execution environment, offering to end users a variety of resources from the hardware to the application level, allowing immediate and on demand access to required resources without needing to purchase additional physical infrastructure. In addition, a new feature emerged on cloud computing: elasticity. Elasticity is the ability of a system to increase or decrease the computing resources allocated in a dynamic way [2]. It implies that the virtual environment used to execute an application may be changed over time, without any longterm indication about the future resources demands. This feature is suitable for dynamic applications, whose resources requirements cannot be determined exactly in advance, either due to changes in runtime requirements or in application structure [3]. Several mechanisms for elasticity are offered by IaaS public cloud providers, such as, Amazon EC21 , Azure2 and Rackspace3 , and also are proposed in some research works [4]. These mechanisms were originally developed for dynamic scaling server-based applications, such as web, e-mail and databases, in order to treat unpredictable workloads, and enabling organizations to avoid the downfalls involved with the non-elastic provisioning (over and under-provisioning) [5]. 1 http://aws.amazon.com/ 2 http://www.windowsazure.com 3 http://www.rackspace.com/
0
100
200
300
400
500
600
Time (hours)
Fig. 1.
Use of elasticity in a web application. Adapted from Roy et al. [6].
Although these solutions are successfully employed in server-based applications, applications such as the scientific ones cannot benefit from the use of these mechanisms. Scientific applications have almost always been designed to use a fixed number of resources, and cannot explore elasticity without appropriate support [7]. The simple addition of instances and the use of load balancers has no effect in these applications since they are not able detect and use these resources. It means that most of scientific applications are executed in batch mode and their workloads are defined by input files containing the data to be processed [8]. Besides, scientific jobs tend to be resource-greedy, using intensively all provided resources. Figure 2 illustrates this behavior in the execution of a scientific experiment (multithreaded 2D heat transfer). Observe that all processing capabilities are constantly used, independently from the number of threads/CPUs employed. The absence of external requests and the constant use of resources make ineffective the use of traditional elasticity mechanisms based in monitoring data.
% CPU usage
4
200
2
100
1
0:00
Fig. 2.
30:00
60:00
Time (minutes)
90:00
Threads/CPU
400
120:00
Scientific application CPU usage with different number of threads.
Considering this scenario, there is a need for mechanisms that take into account the particularities of scientific applications and respective programming models, and offer the appropriate support for enabling the use of elastic resources. Previously published works developed solutions focusing on distributed memory models, such as MapReduce [9], [10], MPI (Message Passing Interface) [7] and master-slave applications [11], however, there are no works addressing the use of elasticity for parallel shared memory (multithread) applications. Among the various standards for developing multithreaded applications for shared memory machines, OpenMP stands out for its extensive use in scientific computing and for the support offered by a variety of compilers and architectures [12]. It consists of a collection of compiler directives, library routines, and environment variables that can be easily inserted into a serial program to create a portable program that will run in parallel on shared memory architectures [13]. In this paper, we propose a mechanism to provide elasticity support for OpenMP applications, making possible the dynamic provisioning of cloud resources taking into account the program structure and runtime requirements. While the mechanisms developed to distributed memory applications rely on the replication of virtual machines (horizontal elasticity), our mechanism enables the dynamic allocation of virtual processors (VCPUs) and memory to a running VM (vertical elasticity). The use of vertical elasticity is fundamental for providing elasticity for multithread applications, since they can take advantage in using all available cores of a physical machine (mapped to VM processors). The proposal is to insert elasticity control in OpenMP directives to enable the automatic adjustment of the number of virtual processors (VCPUs) according to the amount of threads in execution. In addition, a set of routines is included to the user-level library, which allows the user to configure the elasticity provided by directives. Using this features it is possible, for example, to expand the VCPU set to handle a larger number of threads, or to release VCPUs when serial pieces of code are being executed. We also added dynamic memory allocation support to OpenMP library. To enable the elasticity features to OpenMP, we propose an architecture that comprises an extended OpenMP API and a middleware. The API extends the original OpenMP standard receiving modifications to enable the dynamic provisioning of resources. When an elasticity action is required, the API interacts with the middleware, which access the cloud infrastructure and perform the allocations and deallocations of resources. To validate our proposal, we modified the GCC OpenMP library (libgomp4 ) and implemented the middleware to support the 4 http://gcc.gnu.org/onlinedocs/libgomp/
dynamic resources allocation in a private cloud managed by an extended version of OpenNebula [14]. The platform received extensions to enable the use of vertical elasticity, that is not supported in original OpenNebula. Three tests were conducted, where elastic OpenMP was successfully used to improve application’s efficiency, to implement load balancing in a hybrid MPI-OpenMP application and to include dynamic memory allocation to parallel applications. The remainder of the paper is organized as follows. Section II presents some related works. Section III introduces the elastic OpenMP and presents its architecture. Section IV show implementation details. In Section V some experiments and results are presented. Finally, Section VI, concludes the paper. II.
R ELATED W ORKS
Several works addressed the development of frameworks and platforms focusing the construction of elastic parallel and distributed applications for IaaS clouds [4]. Raveendran et al. [7] present a framework for the development of elastic MPI applications. The authors proposed the adaptation of MPI applications by terminating the execution and restarting a new one using a different number of virtual instances. Vectors and data structures are redistributed and the execution continues from the last iteration. Applications that do not have an iterative loop cannot be adapted by the framework, since it uses the iteration index as execution restarting point. In the works of Chohan et al. [9] and Iordache et al. [10] the elastic execution of MapReduce applications are presented. In the former, the authors investigate the dynamic addition of Amazon Spot Instances to reduce the application execution time. The latter presents a system that implements an elastic MapReduce API that allows the dynamic allocation of resources of different clouds. Rajan et al. [11] presented Work Queue, a framework for the development of elastic master-slave applications. Applications developed using Work Queue allow adding slave replicas at runtime. The slaves are implemented as executable files that can be instantiated by the user on different machines. When executed, the slaves communicate with the master, that coordinates task execution and the data exchange. Although several applications models have been covered in the presented works, none of them addressed the exploration of elasticity in multithreaded applications, which is the main contributions of our work. Other important contribution of our work, not explored in above cited works, is the use of vertical elasticity. The use of vertical elasticity for memory provisioning is presented by Molt´o et al. [15]. The authors present a mechanism for adapting the memory size of the VM to the memory consumption pattern of the application. This work differs from ours in using monitoring data to decide when to scale memory up or down. In our approach, the request of additional memory is done using programming routines. Other related work is presented by McFarland [16]. The author presents a system to enable the dynamic redimensioning of OpenMP applications in multiprocessor architectures. The objective is to enable the dynamic adaptation of the number of threads according to processors/cores available in the machine.
In our work we propose a mechanism that enable programmers to create applications capable of adapting the processors according to the number of started threads, taking advantage of the elasticity provided by clouds. III.
E LASTIC O PEN MP
OpenMP [17], a set of compiler directives (or pragmas), runtime library routines and environment variables, is the de facto programming standard for parallel programming in C/C++ and Fortran on shared memory systems. The major OpenMP directives enable the program to create a team of threads to execute a specified region of code in parallel and to share out work in a loop or in a set of code segments. User-level runtime routines are used to set and retrieve information about the runtime environment, such as retrieving the maximal number of threads, set the number of running threads or control thread scheduling. Environment variables may also be used to adjust the runtime behavior of OpenMP applications, particularly by setting defaults for the current run [18]. OpenMP programming is based on the fork-join model [13]. An OpenMP program always begins execution as a single thread (the master thread), and when a parallel directive is encountered, execution forks and the parallel region is executed by a team of threads. When the parallel region is finished, the threads in the team join again at an implicit barrier and the master thread continues execution. Inside the parallel regions, work-sharing directives can be employed to distribute the work among the available threads. Some work-sharing constructions specify a code block that is executed by only one thread (omp single, omp master) while other may distribute the work to all threads of the team (omp for) or to part of it (omp sections). It is possible to observe that the number of threads, and consequently the resources effectively used, can vary significantly during the application execution, making OpenMP applications suitable to use elasticity. As presented in Sections I and II, the mechanisms provided by public providers focus on server-based applications, while the mechanisms presented in some academic works are specific for distributed memory applications, such as MapReduce and MPI. To the best of our knowledge there are not solutions addressing the use of elasticity for shared memory applications, as is the case of OpenMP. We propose a solution to provide elasticity for OpenMP in which the directives are extended to support the automatic adjustment of the number of VCPUs according to the amount of threads in execution. These elasticity-aware directives can automatically control elasticity, hiding the complexity of writing and executing elasticity strategies of the user. In addition, we add some routines to the user-level library, targeting to provide a more precise control over the elastic execution. The solution also includes support for elastic memory allocation, taking advantage of the ballooning technique available in most of modern hypervisors. The memory elasticity was included to enable users to avoid the occurrence of thrashing (which impact on the applications performance) without the need of over-provisioning.
With the establishment of the features for design or use of OpenMP elasticity, we proceed to describe the architecture of our mechanism. The architecture of elastic OpenMP is composed of four layers, as illustrated in Figure 3. At layer 1 is the cloud infrastructure, that provides the resources for the applications execution. To enable the interaction between application and cloud platform the architecture employs an elasticity middleware, present at layer 2. A layer above is the OpenMP API with elasticity support. Finally, at the top are the applications. 4. 3.
Applications Elastic OpenMP API Resource Requests
2.
Elasticity Middleware Commands to Cloud Interface
1.
Fig. 3.
Cloud Infrastructure
Elastic OpenMP architecture.
A. Cloud Infrastructure The cloud infrastructure is responsible for providing all resources needed to the applications executions. It must provide processing units, memory, storage and network resources. To enable the elasticity to OpenMP it is necessary that the cloud support vertical elasticity. It is a fundamental requirement in our proposal, since dynamic allocation of VCPUs and memory depend on this feature. Unfortunately, the major public providers (e. g., Amazon, Rackspace and Azure) do not fully support vertical elasticity, however the feature is starting to be offered in providers like Profitbricks5 . Likewise, the open-source cloud platforms (e. g, Eucalyptus, OpenStack and OpenNebula) also do not provide vertical elasticity. A possible solution is to extend this platforms to support the dynamic allocation of resources. B. Elasticity Middleware The Elasticity Middleware was designed to support the interaction between the OpenMP API and the cloud. The middleware exposes an interface for the requesting of resources, which implements a protocol for communication between the middleware and OpenMP API. The protocol is based in plaintext messages containing the resource type (VCPU or memory) and the amount to be (de)allocated (units for VCPUs and megabytes for memory). The requests received from OpenMP API are converted to commands to the underlying cloud. The middleware also provides information about the cloud infrastructure, including availability of resources and operation errors. Figure 4 illustrates an example of API-Middleware-Cloud interaction. We can observe in the source code a parallel directive, that creates the threads and request the allocation of VCPUs, sending a message to the middleware. The middleware handles the message and interacts with the cloud platform requesting the resources through an interface provided by the cloud provider. 5 http://www.profitbricks.com
int main (int argc, char *argv[]) { omp_set_elastic_parallel(1); omp_set_num_threads(5); #pragma omp parallel { //do something in parallel; } } Application Code
Source Code
Threads
VCPU's
fork
VCPU allocation
join
VCPU release
a) parallel //serial region omp_set_num_threads(N) #pragma omp parallel { //parallel region }
...
...
//serial region
b) single
API Interface
+
Cloud Interface
Middleware
omp_set_num_threads(N) #pragma omp parallel { //parallel region #pragma omp single { //single region } //parallel region }
VM
Cloud Platform
Fig. 4.
API-Middleware-Cloud interaction.
Note that each cloud platform can use distinct protocols for communication and the middleware must be adapted to handle different possibilities. Some examples of protocols are the OpenNebula XML-RPC6 , OCCI7 and Amazon EC28 .
VCPU release
...
...
VCPU allocation
c) sections omp_set_num_threads(N); #pragma omp parallel { //parallel region #pragma omp sections { #pragma omp section { //section A } #pragma omp section { //section B } } //parallel region }
C. Elastic OpenMP API To support elasticity, the OpenMP API must provide mechanisms for managing the dynamic provisioning and to be able to communicate with the middleware to send the resource requests. In our proposal we extend OpenMP to support automatically these features, minimally changing the original API and allowing the exploration of elasticity transparently, hiding implementation details from user. The elasticity extensions are based in adaptations of three OpenMP directives, and in the addition of a set of user-level routines to OpenMP API.
...
...
Fig. 5.
...
... VCPU release
...
...
VCPU allocation
Elastic OpenMP: dynamic allocation of resources.
The most important OpenMP directive is the parallel. When this construct is found, a team of threads is created to execute the associated parallel region, which is the code dynamically contained within the parallel construct. Parts of the program that are not enclosed by a parallel construct will be executed serially [13]. In elastic OpenMP, the parallel directive was modified to adapt the number of VCPUs according to the number of threads specified in the parallel region (via routine or clause). The API send a request to the middleware whenever there is a need for changing the amount of VCPUs.
it automatically inherits the thread team defined in the parallel section and the allocated VCPUs.
Figure 5-a presents an example of Elastic OpenMP behavior for parallel directive. When the parallel region is created and the threads team is started, the API automatically allocates additional VCPUs to handle the threads. When the parallel region is finished and the threads join, the number VCPUs is set back to original configuration.
The OpenMP sections construction provides a means by which different threads can run different blocks of the program in parallel. In non-elastic OpenMP, each thread executes one code block at a time, and each code block will be executed exactly once. If there are fewer threads than code blocks, some threads execute multiple code blocks. If there are fewer code blocks than threads, the remaining threads will be idle.
Inside a parallel region, other constructs, called worksharing directives, could be used to distribute computation among the threads in a team. There are three main worksharing directives: for, sections and single. The for construct in OpenMP is used to split up loop iterations and assign them to threads for concurrent execution. This directive was not modified for supporting elasticity, once 6 http://opennebula.org 7 http://occi-wg.org/ 8 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
The single directive identifies a section of code that must be run by a single thread, typically the first to encounter it. Other threads wait at a barrier until the thread executing the single code block has completed. The elastic single directive maintains only one VCPU active while the single region is being executed. When all threads are joined in the barrier, the number of VCPUs is reestablished, as presented in Figure 5-b.
In elastic section directive, the number of VCPUs is adjusted to the number of sections in the source code, as illustrated in Figure 5-c. The objective is to avoid idle VCPUs if there are unused threads. According to OpenMP specification, the maximum number of threads that can be used in sections is defined by the parallel region. Thus, if there are more sections than threads, new VCPUs will not be added. It is important to observe that other OpenMP directives and clauses are not be affected by the proposed extensions and continue working as proposed in the original specification.
Apart from the modifications in the OpenMP directives, our extension also add some routines to the user-level library, which allow programmers to control the elastic execution and query the environment. The additional routines are listed and described in Table I. The three first routines listed on Table I, are used to turn on/off the elasticity features provided by OpenMP directives, enabling the use elasticity only when needed. There also three routines to verify the status of the elastic directives. The omp set map enables the programmer to set the number of VCPUs that must be allocated in function of the active threads. The argument m specifies the ratio between VCPUs and threads. For example, the value 0.5 for m means that for each 2 threads one VCPU will be allocated. For noninteger values for the VCPUs amount, the number is rounded up. The omp get max cpu and omp get free cpu routines are used to get information about the CPUs available in the physical machine that hosts the VM where the applications is running. The former gets the maximum number of CPUs allocable to the VM and the latter returns the number of CPUs that can be immediately allocated. These functions are useful to determine the number of threads and VCPUs that can be used in a parallel section. We also propose in our extension a primitive for memory elasticity. The omp alloc routine is similar to the malloc function provided by libc9 , but it also adds to the VM the requested memory space. The omp free routine frees the memory space pointed by the function argument and decrease the memory amount from current VM. D. Applications At the top of the architecture layers are the applications developed with Elastic OpenMP. In order to use the elasticity features provided by API, the procedure is similar to use conventional OpenMP and demands few modifications in source code. The differences are in the use of additional routines to enable and configure the elasticity, and the mandatory use of a routine (omp set num threads) or clause (num threads) to set the number of threads, and consequently, the number of VCPUs that will be used. IV.
I MPLEMENTATION
To validate our proposal, we implemented a prototype of the architecture presented in Section III. To provide the resources, in the cloud layer we use a modified version of OpenNebula 3.4. We adapted the platform to support the dynamic allocation of VCPUs and memory (without rebooting) in Xen10 virtual machines. The choice of an OpenNebula cloud was made based on the flexibility for VMs creation, which allows the configuration of resources according to application needs, unlike Eucalyptus and OpenStack, in which a pre-configured class must be chosen. The Xen Hypervisor was used due to its capacity to handle the allocations of VCPUs and memory on-the-fly. The middleware was implemented using Python and C. It uses TCP sockets for message exchanges with the OpenMP 9 http://www.gnu.org/software/libc/ 10 http://www.xen.org
API and the commands are sent to the cloud using OpenNebula XML-RPC. The OpenMP API used is an extended version of API provided by GCC 4.7, that was adapted to enable the dynamic resources allocation. We modified the directives to request resources to the middleware and added the user-level routines (described in Section III-C) to the OpenMP library (libgomp). To date, the elastic OpenMP API was implemented for C and C++. V.
T ESTS AND R ESULTS
In this section we presented three experiments using elastic OpenMP. In the first, elasticity is used to improve the application efficiency using the more appropriate amount of resources. In the second, we use elastic OpenMP to implement load balancing to a hybrid MPI-OpenMP application. The third experiment evaluates the use of dynamic memory allocation. The hardware used to perform the tests is a 24-core workstation equipped with three AMD Opteron 6136 at 2.40 GHz and 98 GB RAM. We used 4 cores and 10 GB RAM for Xen’s dom0 and the remaining 20 CPUs and 88 GB of memory for the other VMs. The operating system in all tests is Ubuntu Server 12.04 with kernel 3.2.0-29. A. Efficiency Improvement In this experiment we use elasticity to improve the application efficiency using the more appropriate amount of resources. Efficiency is a measure of how much of the available processing power is being used. Efficiency close to unity means that the hardware is effectively used; low efficiency means that resources are wasted [19]. Increasing the number of processes or threads in a parallel application, generally, increase its speedup, but can decrease its efficiency. It could happen due to factors such as contention, communication, and software structure. In this test, we used a heat transfer problem that consists in solving a partial differential equation in each iteration to determine the variation of the temperature within a 2D square domain. The application has some serial parts that affect its scalability, so, there a limit for the number of VCPUs that can be efficiently used. The application was tested with domains with 8192×8192 and 16384×16384 cells. To find the most efficient resources amount is not an easy task, considering that applications can present different results depending on the input data and parameters used. Thus, in this test, we employ elasticity to enable the OpenMP application to find automatically the most suitable resources to maintain its efficiency at a certain level. We modified the application to calculate the efficiency for each iteration and if the value is higher than a given value (0.75 in this test), more resources are automatically allocated for next iteration. If efficiency remains under the reference value for two consecutive iterations, the last allocated resources are released. All resources allocation and deallocations are performed by OpenMP. At the beginning of execution, the VM was configured with 1 VCPU and 20 GB RAM. The results are presented in Figure 6 and 7.
TABLE I. Routine
A DDITIONAL USER - LEVEL ROUTINES .
Description Enables or disables the elasticity features in parallel regions. If the flag argument is true the dynamic adjustment of resources is enabled and false disables it. Enables or disables the elasticity features of sections directives. If the flag argument is true the dynamic adjustment of resources is enabled and false disables it. Enables or disables the elasticity features of single directives. If the flag argument is true the dynamic adjustment of resources is enabled and false disables it. Returns true if elasticity feature in parallel region is enabled and false if disabled. Returns true if elasticity feature in sections construct is enabled and false if disabled. Returns true if elasticity feature in single construct is enabled and false if disabled. Sets the ratio between VCPUS and active threads. Gets the number CPUs of physical machine. Gets the number of free CPUs available for allocation. Allocates size bytes and returns a pointer to the allocated memory. The function also adds the size bytes of memory to the current VM. Return NULL in case of error. Frees the memory space pointed to by ptr and deallocates the memory from current VM.
void omp_set_elastic_parallel(int flag) void omp_set_elastic_section(int flag) void omp_set_elastic_single(int flag) int omp_get_elastic_parallel( ) int omp_get_elastic_section( ) int omp_get_elastic_single( ) void omp_set_map(float m) int omp_get_max_cpu( ) int omp_get_free_cpu( ) void* omp_alloc(long int size) void omp_free(void *ptr)
25
25 8912 16384
VCPU's/Threads
VCPU's/Threads
Threads VCPU's
20
20
15
10
15 10 5
5 0
0 0
10
20
30
40
50
0
10
20
Iterations
Fig. 6.
Elastic OpenMP: Number of VCPUs and threads used.
Fig. 8.
40
50
Monitoring System: Number of VCPUs and threads used. 1.1
8192 16384
1
1
0.9
0.9
Efficiency
Efficiency
1.1
0.8
0.7
8192 16384
0.8
0.7 0.6
0.6 0
10
20
30
40
0
50
10
Elastic OpenMP: Efficiency.
Figure 6 shows the number of VCPUs allocated along the execution, that is increased while the efficiency is higher than 0.75. In Figure 7 the efficiency achieved in each iteration is presented. As we can observe, after the adjustment period (iteration 1 to 20), the efficiency is sustained close to 0.75, avoiding the excessive allocation of resources. Note that the results are distinct for different input data sizes, scaling better for the 16384×16384 domain, showing that the solution can consider particularities of each execution.
20
30
40
50
Iterations
Iterations
Fig. 7.
30 Iterations
Fig. 9.
Monitoring System: Efficiency.
get application information, necessary to decide if resources must be allocated. This result shows that to provide elasticity for parallel applications it is necessary more than information about VM resources usage. It is also necessary to take into account the structure and behavior of each program. B. Load Balancing
We compared the results obtained with elastic OpenMP with a monitoring-based mechanism. We developed a solution that employs vertical elasticity, considering the lack of mechanisms with this characteristic. The mechanism operation is based in rules and resembles the presented in the mechanisms provided by public cloud providers. An elasticity controller monitors the VM and if all VCPUs have usage higher than 90%, a new VCPU is added. The results obtained with this approach are presented in Figures 8 and 9.
The objective of this experiment is to evaluate the use of elastic OpenMP in providing load balancing features to hybrid MPI-OpenMP applications. A synthetic application was used to discover the potential of the technique presented. The synthetic application includes a simple external loop containing 4 vector operations. Two MPI processes execute the external loop and operations loops are parallelized with OpenMP threads. At the end of each external iteration there is a message interchange to synchronize MPI processes. This synthetic job allows giving a specific workload to the MPI processes in each loop iteration.
The results presented in Figure 9 show that the monitoringbased solution is unable to guarantee the efficiency of the application. It occurs due to the inability of the mechanism to
We performed two tests using this application. In the first, we tested the application implemented with original OpenMP, employing a fixed number of threads. In the second, to enable
the load balancing we adapted the application to set the number of threads according to the workload. Consequently, the VCPUs are automatically (de)allocate by elastic OpenMP to handle the threads. In both tests, ten iterations of the application external loop were performed. As execution environment we employed two VMs initially configured with two VCPUs and 20 GB RAM. Figure 10 presents the results of the application implemented with original OpenMP, where it is possible to observe the load unbalancing. There are differences up to 60% between the execution time presented for both processes. As consequence of the differences, the iteration execution time is determined by time of the slower process. The total time in this test is 527 seconds. 60 Proc 0 Proc 1
Execution Time (sec)
55 50 45 40 35 30 25 20 0
Fig. 10.
1
2
3
4
5 6 Iteration
7
8
9
10
11
Figure 11 shows the results of the elastic application. Note that the load balancing worked very well, making execution times are similar for both processes, with a maximum of 8% of unbalance. The dynamic allocation of resources and the load balancing reduced the total execution time to 270 seconds. Proc 0 Proc 1
Execution Time (sec)
55
In the first experiment we evaluate the use of the dynamic memory allocation routines provided by elastic OpenMP. The application was modified by substituting malloc’s and free’s with omp alloc and omp free routines. Figure 12 shows the results using this approach. In both scenarios, the elasticity routines worked efficiently in allocating and deallocating resources, providing memory enough to execute the job. It is also possible to observe that all application data was kept in resident memory during the whole execution period. The difference between allocated memory (red line) and the memory used by application (black line) is exactly the 2 GB initially allocated. In the second test, we employed the monitoring-based system. No modifications in the application were necessary in this case. The mechanism uses the expression: VM memory = used memory × (1+MOP)
Load imbalance in hybrid MPI-OpenMP application.
60
In this section, we present an experiment using the routines provided by elastic OpenMP, and compare its results with the results of a monitoring-based solution mechanism proposed in the work of Molt´o et al. [15]. We used the same synthetic job presented in Section V-B with two different scenarios: (1) crescent workload and (2) alternating workload. In the first, the workload is increased every two iterations, and in the second, the workload alternates between high and low memory demands. In both tests, we employed two VMs with two VCPUs and 2 GB RAM.
50 45 40 35 30
to calculate the memory allocation, where MOP is the Memory Overprovisioning Percentage that indicates the amount of memory to be (de)allocated. The mechanism was configured to collect memory usage information every 1 second and MOP=50%. The results obtained with this mechanism are presented in Figure 13. In the first scenario, the mechanism succeeded to provide the memory necessary to the application, however in most of the time it was not possible to keep all data in resident memory, affecting the application performance. In the second scenario, the mechanism failed in allocating the memory for the workload peak and the application aborted at the beginning of the second iteration.
25
VI.
20 0
Fig. 11.
1
2
3
4
5 6 Iteration
7
8
9
10
11
Load Balancing using elastic OpenMP.
An example of applications that could take advantage of elasticity for load balancing are the adaptive mesh refinement codes that suffer with load balance problems when parallelized with MPI. C. Dynamic Memory Allocation The ability to dynamically modify the memory of a VM at runtime without any service disruption represents an important capability for applications with dynamic memory requirements. This feature enables maintain memory enough to keep the application data in resident memory and consequently to preserve the performance of the application. On the other hand, the unused memory can be released when the application no longer requires that amount of memory.
C ONCLUSION
This paper describes a mechanism for providing elasticity to OpenMP applications. In our approach, the API extends the original OpenMP standard receiving modifications to enable the dynamic provisioning of resources. The elasticity extensions are based in adaptations of OpenMP directives and in the addition of a set of user-level routines to OpenMP API. The results show that elastic OpenMP was successfully used to provide elasticity in different scenarios, being used to improve application’s efficiency, to implement load balancing in a hybrid MPI-OpenMP application and to include dynamic memory allocation to parallel applications. In addition, the use of OpenMP for constructing elastic applications presented superior results when compared with the results of using monitoring-based systems. This advantage is due to the fact that elastic OpenMP enables programmers to build elasticity controllers specific for each application, that are able to take into account the program structure and runtime requirements.
Memory (MB)
a) 10,000
R EFERENCES
Resident VM Memory
[1]
8,000
6,000
[2]
4,000
2,000
0 0
b) 10,000
50
100
150 Time (sec)
200
250
300
[3]
Resident VM Memory
Memory (MB)
8,000
[4]
6,000
4,000
[5]
2,000
0 0
Fig. 12. a)
50
100
150 Time (sec)
200
250
300
[6]
OpenMP: Dynamic Memory Allocation.
14,000
[7]
Resident VM Memory
12,000
Memory (MB)
10,000 8,000
[8]
6,000 4,000
[9]
2,000 0 0
b) 14,000
50
100
150
200 Time (sec)
250
300
350
[10]
Resident VM Memory Load Pattern
12,000
[11]
Memory (MB)
10,000 8,000 6,000
[12]
4,000 2,000 Application Aborted
0 0
Fig. 13.
50
100
150 200 Time (sec)
250
300
350
Monitoring System: Dynamic Memory Allocation.
As future work, we plan to add support to nested parallelism and to include elasticity to OpenMP task directive (omp task).
[13] [14]
[15]
ACKNOWLEDGMENT
[16]
This work is supported by CAPES and INCT-MACC (CNPq grant nr. 573710/2008-2).
[17]
[18]
[19]
M. Villamizar, H. Castro, and D. Mendez, “E-clouds: A saas marketplace for scientific computing,” in Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing, ser. UCC ’12. IEEE, 2012, pp. 13–20. L. Badger, R. Patt-Corner, and J. Voas, “Draft cloud computing synopsis and recommendations recommendations of the national institute of standards and technology,” Nist Special Publication, vol. 146, p. 84, 2011. [Online]. Available: http://csrc.nist.gov/publications/ drafts/800-146/Draft-NIST-SP800-146.pdf S. Jha, D. S. Katz, A. Luckow, A. Merzky, and K. Stamou, “Understanding Scientific Applications for Cloud Environments,” in Cloud Computing: Principles and Paradigms, R. Buyya, J. Broberg, and A. M. Goscinski, Eds. John Wiley & Sons, Inc., March 2011, ch. 13, pp. 345–371. G. Galante and L. C. E. de Bona, “A survey on cloud computing elasticity,” in Proceedings of the International Workshop on Clouds and eScience Applications Management, ser. CloudAM 2012. IEEE, 2012, pp. 263–270. T. C. Chieu, A. Mohindra, A. A. Karve, and A. Segal, “Dynamic Scaling of Web Applications in a Virtualized Cloud Computing Environment,” in Proceedings of the 2009 IEEE International Conference on eBusiness Engineering, ser. ICEBE 2009. IEEE, 2009, pp. 281–286. N. Roy, A. Dubey, and A. Gokhale, “Efficient autoscaling in the cloud using predictive models for workload forecasting,” in Proceedings of the 4th Intl. Conference on Cloud Computing, ser. CLOUD 2011. IEEE, 2011, pp. 500–507. A. Raveendran, T. Bicer, and G. Agrawal, “A framework for elastic execution of existing mpi programs,” in Proceedings of the International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, ser. IPDPSW 2011. IEEE, 2011, pp. 940–947. L. Wang, J. Zhan, W. Shi, and Y. Liang, “In cloud, can scientific communities benefit from the economies of scale?” IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 2, pp. 296–303, Feb. 2012. N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz, “See Spot Run: Using Spot Instances for Mapreduce Workflows,” in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, ser. HotCloud 2010. USENIX, 2010. A. Iordache, C. Morin, N. Parlavantzas, and P. Riteau, “Resilin: Elastic MapReduce over Multiple Clouds,” INRIA, Rapport de recherche RR8081, Oct. 2012. [Online]. Available: http://hal.inria.fr/hal-00737030 D. Rajan, A. Canino, J. A. Izaguirre, and D. Thain, “Converting a high performance application to an elastic cloud application,” in Proceedings of the 3rd International Conference on Cloud Computing Technology and Science, ser. CLOUDCOM ’11. IEEE, 2011, pp. 383–390. R. Rabenseifner, G. Hager, and G. Jost, “Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes,” in Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, ser. PDP ’09. IEEE, 2009, pp. 427–436. B. Chapman, G. Jost, and R. v. d. Pas, Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, 2007. R. Moreno-Vozmediano, R. S. Montero, and I. M. Llorente, “Iaas cloud architecture: From virtualized datacenters to federated cloud infrastructures,” Computer, vol. 45, no. 12, pp. 65–72, 2012. G. Molt´o, M. Caballer, E. Romero, and C. de Alfonso, “Elastic memory management of virtualized infrastructures for applications with dynamic memory requirements,” in International Conference on Computational Science, ser. ICCS 2013, 2013. D. J. McFarland, “Exploiting malleable parallelism on multicore systems,” Master’s thesis, Virginia Polytechnic Institute and State University, Blacksburg, Virginia - USA, 2011. R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, and R. Menon, Parallel programming in OpenMP. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001. C. Liao, O. Hernandez, B. Chapman, W. Chen, and W. Zheng, “Openuh: an optimizing, portable openmp compiler: Research articles,” Concurr. Comput. : Pract. Exper., vol. 19, no. 18, pp. 2317–2332, Dec. 2007. A. H. Karp and H. P. Flatt, “Measuring parallel processor performance,” Commun. ACM, vol. 33, no. 5, pp. 539–543, May 1990.