Can Cloud Virtual Environment Achieve Better Performance?

3 downloads 7778 Views 553KB Size Report
256KB L2 cache dedicated per core, and L3 cache with total. 12 MB shared per ... 2) Runtime and Operating System: Windows Server 2008. R2 is installed on ...
International Conference on Applied Internet and Information Technologies, 2013 - Regular Papers

Can Cloud Virtual Environment Achieve Better Performance? Sasko Ristov, Goran Velkoski and Marjan Gusev Ss. Cyril and Methodius University, Faculty of Computer Science and Engineering, Skopje, Macedonia [email protected], [email protected], [email protected]

Abstract – Cloud virtual environment adds the virtualization layer, which is expected to decrease the performance. In this paper we address the performance of the dense matrix-vector multiplication (DMVM) executed on a virtual machine with Windows operating system and C# in the cloud compared to traditional bare-metal with the same runtime environment. Expected behaviour based on theoretical analysis prefers the classical environment, while the experimental research the cloud environment.

I. INTRODUCTION Cloud computing is emerging to become the only platform for intensive applications and data. Its ”pay-per-use” and ”everything-as-a-service” models lower CAPEX and OPEX, allow easy access to highly scalable and elastic resources from everywhere, reduce maintenance expenses etc. However, web service performances are almost 30% lower while hosted in the cloud compared to traditional on premise while using the same resources [1]. Another cloud drawback is the performance unpredictability. Performance isolation is necessary in cloud multi-tenant environment [2]. One virtual machine instance behaves differently on the same hardware infrastructure at different times among other active virtual machine instances [3]. Also virtual machine instance granularity significantly affects the workload’s performance for small network workload [4]. But cloud can reduce the cost at small performance penalty using thin hypervisors or OSlevel containers [5]. Underutilization of resources by adding more nodes can considerably improve the performance implementing more parallelism [6], as well as using virtual machine load balancer [7]. Despite the additional virtualization layer, there are algorithms that sometimes can provide better performance while executed in virtual environment. Cache intensive algorithms run faster in virtual environment when problem size fits in the CPU cache memory, but the performance rapidly degrades when a problem size exceeds the dedicated cache per core [8]. Gusev and Ristov determine that there are regions where the best performance in the cloud environment is achieved allocating the resources among many concurrent instances of virtual machines rather than in traditional multiprocessors using API for parallelization (OpenMP) [9]. Most of high performance algorithms are analyzed while executed on Linux based operating systems using OpenMP or

MPI for parallelization. In this paper we analyze the performance of Windows operating system with C# and threading for parallelization. This platform can provide even superlinear speedup (speedup greater than the number of CPUs) for matrix multiplication while hosted in Windows Azure cloud [10]. We have chosen the DMVM as algorithm which is widely used in both scientific and commercial applications. We are not interested to find an algorithm which speeds up the DMVM execution in the cloud nor to speed up the execution using other faster algorithms, but to determine the platform impact on cache intensive algorithm performance when executed in Eucalyptus open source cloud and on baremetal machine using the same hardware resources and the same runtime environment. The rest of the paper is organized as follows. Section II presents the used testing methodology and Section III refers to theoretical analysis. Section IV shows the results of the experiments realized to determine the performance achieved by sequential and parallel implementation of DMVM using a particular platform. The platform impact on the DMVM performance is analyzed in Section V. Section VI concludes the work and presents the plans for future work.

II. TESTING METHODOLOGY The testing methodology is based on two different environments and 3 test cases for each environment. A. Testing Algorithm DMVM algorithm is used for testing. Matrix and vector elements are stored as double precision numbers with size bytes each. One thread multiplies the whole matrix and vector for sequential test cases. For parallel test cases, each thread multiplies the row block matrix and the whole vector where denotes the total number of parallel threads and used CPU cores. B. Testing Environment This section presents two analyzed platforms (cloud and bare-metal) as testing environments using the same hardware infrastructure and runtime environments. 1) Hardware Infrastructure: The hardware infrastructure is the same for each test case. It consists of workstation with shared memory multiprocessor Intel(R) Core(TM) 2 Quad

Page 92

International Conference on Applied Internet and Information Technologies, 2013 - Regular Papers D. Test Cases Three groups of test cases are realized during the experiments in both platforms:  sequential execution on only one core  parallel execution with two threads on two cores  parallel execution with four threads on four cores

III. THEORETICAL ANALYSIS We realize series of experiments in each test case by varying the matrix size and vector size from to in order to analyze the performance behavior upon different overload and variable cache storage requirements, i.e. for each cache region to as described in [8]. However, cache regions are different for DMVM algorithm compared to dense matrix-matrix multiplication. To analyze the behavior we start with definition of the cache requirements in (1), where variable denotes the number of threads and cores. ⁄

(1)

Fig. 1. Eucalyptus Testing Environment

CPU Q9400 @ 2.66GHz and 8GB RAM. The multiprocessor consists of 4 CPU cores, each with 32 KB L1 cache and 256KB L2 cache dedicated per core, and L3 cache with total 12 MB shared per two cores (two cores share 6MB L3 cache). 2) Runtime and Operating System: Windows Server 2008 R2 is installed on both machines; either virtual machine instance in the cloud or bare-metal host operating system. C# is used together with threads for parallel implementation.

Equation (1) can be rewritten as follows in (2), since the ⁄ vector size (2) Execution Time is measured for each test case and Speed is calculated as defined in (3). is expressed in Gigaflops, i.e. the number of floating point operations in second.

C. The Platforms

(3)

Two different platforms are defined, i.e. bare-metal and cloud platform. 1) Bare-metal Platform: The Bare-metal platform consists of Windows operating system installed on the real hardware machine described in Section II-B1. 2) Cloud Platform: The cloud platform uses Eucalyptus open source cloud solution [11] deployed on three physical machines as depicted in Fig. 1. Nurmi et al. [12] outline the basic principles and important operational aspects of the Eucalyptus cloud solution. The Cloud Controller (CLC) is the entry-point in the cloud and the interface to the management platform. Walrus allows the customers to store persistent data. The Cluster Controller (CC) schedules virtual machine instance execution on specific Node Controller (NC). NC executes hosts virtual machine instances. All three nodes are installed with Linux Ubuntu 12.04. Kernel-based Virtual Machine (KVM) virtualization standard is used for instancing a virtual machine. The cloud environment uses the same hardware as described in Section II-B1 for all nodes. A virtual machine instance is instantiated with all available hardware resources and it is installed with the same Windows Server R2 operating system as the bare-metal platform.

We repeat each test case up to 1.000.000 times and calculate the average to achieve reliable test results, especially for smaller matrices since there are negligible effects on the performance for good comparison [13]. Normalized Speed is calculated for parallel execution as defined in (4) to compare the DMVM algorithm behavior, i.e. the achieved speed per core and thread. (4) Speedup is calculated as defined in (5). Index denotes the sequential execution and index the parallel xecution with P threads on P cores. (5) We define two more parameters, i.e. Relative Platform Speed and Relative Platform Speedup in (6) and (7) to compare the platform impact on achieved speed and speedup.

Page 93

International Conference on Applied Internet and Information Technologies, 2013 - Regular Papers

(6) (7) Indexes C and B denote the cloud and bare-metal platforms, correspondingly. Different algorithm behavior is expected for executing the DMVM algorithm in and cache regions. We define region as the region where the cache memory requirements fit in cache expecting the highest performance. and are defined as the regions where the cache requirements fit in and cache correspondingly. Finally, region presents the main memory region when the cache requirements are above the last level cache size and cache misses are generated. Table I presents the values of for each cache region as defined in (2). TABLE I CACHE REGION RANGES FOR MATRIX AND VECTOR SIZE Fig. 2. Speed achieved on bare-metal platform

Region

1 Thread

2 Threads

4 Threads

, and regions are different ranges for N since each CPU core possesses dedicated and cache and total available cache memory for parallel execution is greater than sequential execution. ’s upper limit is constant for each parallel execution since two caches are used.

IV. EXPERIMENTAL RESULTS This section presents the results of the experiments to determine the impact of parallelization upon Speed , Normalized Speed and Speedup in a particular platform. A. DMVM Performance on Bare-metal Platform In this section we analyze the performance of the DMVM algorithm while executed in the bare-metal platform using different number of threads and CPU cores. 1) Speed in Bare-metal Platform: Fig. 2 depicts the results of the experiments on the bare-metal platform. identifies the curve for speed of sequential execution and and identify speed of parallel execution on bare-metal platform with 2 and 4 threads correspondingly. We observe that sequential execution provides better performance for smaller , i.e. for since the operating system spends more time to create the threads and schedule the tasks instead of executing the operations. Better performance is achieved for using 2 threads and using 4 threads.

Fig. 3. Normalized Speed achieved on bare-metal platform

Two regions are observed for each test case. The speed increases proportionally in regions and as defined in Table I and saturates in regions and . Another unusual result is occurrence of speed variations for parallel execution, which are more emphasized for Speed . 2) Normalized Speed in Bare-metal Platform: Better analysis can be realized analyzing Normalized Speed depicted in Fig. 3. identifies the curve for normalized speed of sequential execution, while and identify normalized speed of parallel execution on bare-metal platform with 2 and 4 threads correspondingly. Sequential execution provides the greatest speed per core for each N in

Page 94

International Conference on Applied Internet and Information Technologies, 2013 - Regular Papers

Fig. 4. Speedup achieved on bare-metal platform

Fig. 5. Speed achieved on cloud platform

front of parallel execution with 2 and 4 cores. This is emphasized for smaller . 3) Speedup in Bare-metal Platform: Fig. 4 depicts the results for speedup in bare-metal platform. and identify the speedup of parallel execution on bare metal platform with 2 and 4 threads correspondingly. We clearly observe the ”slowdown”, i.e. speedup smaller than 1 for regions described in Section IV-A1. The speedup increases enormously in regions and . The speedup increasement is smaller in region and it saturates in region. The Speedup fluctuates, i.e. it climbs and descends similar to the speed . B. DMVM Performance on Cloud Platform In this section we analyze the performance of the DMVM algorithm while executed in the cloud platform using different number of threads and CPU cores. 1) Speed in Cloud Platform: Fig. 5 depicts the results of the experiments on cloud platform. identifies the curve for speed of sequential execution, while and identify the speed of parallel execution on cloud platform with 2 and 4 threads correspondingly. Similar to bare-metal platform, we observe that sequential execution provides better performance for smaller . Better performance for parallel implementation is achieved for using 2 threads and using 4 threads. Also two regions are observed for each test case. The speed increases proportionally in regions and as defined in Table I and saturates in regions and . We also observe speed climbings and descents for parallel execution, which are more emphasized for Speed . 2) Normalized Speed in Cloud Platform: Fig. 6 depicts Normalized Speed . identifies the curve for normalized speed of sequential execution, while and

Fig. 6. Normalized Speed achieved on cloud platform

identify normalized speed of parallel execution on cloud platform with 2 and 4 threads correspondingly. Sequential execution also provides the greatest speed per core for each N in front of parallel execution with 2 and 4 cores, which are similar in cloud platform. 3) Speedup in Cloud Platform: Fig. 7 depicts the results for speedup in cloud platform. and identify the speedup of parallel execution with 2 and 4 threads correspondingly. We also clearly observe the ”slowdown”, i.e. speedup smaller than 1 for regions described in Section IV-B1. The

Page 95

International Conference on Applied Internet and Information Technologies, 2013 - Regular Papers

Fig. 8. Achieved Relative Platform Speed

Fig. 7. Speedup achieved on cloud platform

speedup increasement and saturation is similar as in baremetal platform.

V. PERFORMANCE COMPARISON OF CLOUD AND BARE-METAL PLATFORMS Although the speed and speedup curves obtained for both platforms presented in previous Section IV have similar trends, still there are differences worth to be analyzed. In this section we compare the platforms to determine which platform provides better performance (speed) and scaling (speedup) with particular number of threads and CPU cores. Therefore we calculate the parameters Relative Platform Speed and Relative Platform Speedup . A. Speed Comparison Fig. 8 depicts the Relative Platform Speed i.e. compares the speeds of both platform for particular number of threads. We observe a phenomenon, i.e. Relative Platform Speed for each and for each . That is, despite the virtualization layer, the cloud platform provides better performance (speed) than bare-metal, both for sequential and parallel execution. We must note that there is a small region where . B. Speedup Comparison Fig. 9 depicts the Relative Platform Speedup to compare the speedups of both platform for particular number of threads. We observe that in regions and , while in regions and . That is, the DMVM algorithm achieves smaller speedup in the cloud compared to

Fig. 9. Achieved Relative Platform Speedup for 2 and 4 threads

bare-metal platform cloud while other classification parameters will be used only by one side. For example, response time will be a common classification parameter, while the CPU utilization for a certain load will be mostly a parameter analyzed only by the cloud service provider. Opposite to , in each cache region to . That is, the DMVM algorithm achieves greater speedup in the cloud compared to bare-metal platform. We also observe fluctuations (climbings and descents) for .

Page 96

VI. CONCLUSION AND FUTURE WORK

International Conference on Applied Internet and Information Technologies, 2013 - Regular Papers This paper analyzes the performance behavior of sequential and parallel implementation of DMVM algorithm while executed using the same Windows operating system with C# runtime environment, but on different platforms, i.e. bare metal and Eucalyptus cloud. The paper contribution is manifold since we obtained many unexpected results. The main conclusion is that we proved experimentally that the DMVM algorithm provides better performance (Speed) while executed in the cloud compared to bare-metal with the same resources and runtime environment. The cloud platform also provides greater speedup using 4 cores, and smaller speedup using 2 cores. DMVM algorithm achieves higher rate per processor when it runs with one thread instead of parallel execution for small N. Sequential execution also provides better speed per core. Using these results, we will continue our research in the cloud platform to analyze the hypothesis that dividing the problem and executing on several virtual machine instances with 1 core will improve even more the performance, instead of using threads for parallelization in virtual machine instance allocated with all CPU cores of physical machine. We also plan to analyze the speed variations with various climbings and descents for parallel execution of the DVMV algorithm.

REFERENCES [1] S. Ristov, G. Velkoski, M. Gusev, and K. Kjiroski, “Compute and memory intensive web service performance in the cloud,” in ICT Innovations 2012. Springer Berlin / Berlin Heidelberg, 2013, vol. AISC 257, pp. 215–224. [2] W. Wang, X. Huang, X. Qin, W. Zhang, J. Wei, and H. Zhong, “Application-level cpu consumption estimation: Towards performance isolation of multi-tenancy web applications,” in Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, june 2012, pp. 439 –446. [3] Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, and C. Pu, “An analysis of performance interference effects in virtual environments,” in Perf. Analysis of Systems Software. ISPASS 2007. IEEE Int. Symp. on, 2007, pp. 200 –209.

[4] P. Wang, W. Huang, and C. Varela, “Impact of virtual machine granularity on cloud computing workloads performance,” in Grid Computing (GRID), 2010 11th IEEE/ACM International Conference on, Oct. 2010, pp. 393 –400. [5] A. Gupta, L. V. Kal´e, D. S. Milojicic, P. Faraboschi, R. Kaufmann, V. March, F. Gioachin, C. H. Suen, and B.-S. Lee, “Exploring the performance and mapping of HPC applications to platforms in the cloud,” in Proc. of the 21st Int. Symp. on High-Performance Parallel and Distributed Computing, ser. HPDC ’12. ACM, 2012, pp. 121–122. [6] R. Iakymchuk, J. Napper, and P. Bientinesi, “Improving highperformance computations on clouds through resource underutilization,” in Proc. of the 2011 ACM Symp. on Applied Computing, ser. SAC ’11. ACM, 2011, pp. 119–126. [7] M. Sharma and P. Sharma, “Performance evaluation of adaptive virtual machine load balancing algorithm,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 3, no. 2, pp. 86–88, Feb. 2012. [8] M. Gusev and S. Ristov, “Matrix multiplication performance analysis in virtualized shared memory multiprocessor,” in MIPRO, 2012 Proceedings of the 35th International Convention, IEEE Conference Publications, 2012, pp. 264–269. [9] ——, “The optimal resource allocation among virtual machines in cloud computing,” in Proc. of the 3rd International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2012), 2012, pp. 36–42. [10] ——, “Superlinear speedup in Windows Azure cloud,” in 2012 IEEE 1st International Conference on Cloud Networking (CLOUDNET) (IEEE CloudNet’12), Paris, France, Nov 2012, pp. 173–175. [11] Eucalyptus, “Eucalyptus open source cloud,” Jan. 2013. [Online]. Available: http://www.eucalyptus.com [12] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov, “The Eucalyptus open-source cloud-computing system,” in Proc. of the 2009 9th IEEE/ACM Int. Symp. on Cluster Computing and the Grid, ser. CCGRID ’09. IEEE Computer Society, 2009, pp. 124–131. [13] Z. Krpic, G. Martinovic, and I. Crnkovic, “Green HPC: MPI vs. OpenMP on a shared memory system,” in MIPRO, 2012 Proceedings of the 35th International Convention, IEEE Conference. May 2012, pp. 246–250.

Page 97

Suggest Documents