Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 70 (2015) 769 – 777
Design and Development of Grid Enabled, G2PU Accelerated Java Application (Protein Sequence Study) for Grid Performance Analysis Subrata Sinhaa*, Smriti Priya Medhia, G.C Hazarikab a
Centre for Bioinformatics Studies, Dibrugarh University,Dibrugarh-786004,India b Department of Mathematics, Dibrugarh University, Dibrugarh-786004,India
Abstract Recent advancement in the field of structural biology has generated huge volume of data and analyzing such data is vital to know the hidden truths of life but such analysis is compute intensive in nature and requires huge computational power resulting in extensive use of high performance computing (Multi Core Computing, G2PU Computing, CPU-GPU Hybrid computing, Cluster, Grid) models. Grid is one of the widely used HPC model which is commonly used in computational biology and to execute a compute intensive tasks on grid , the applications must also be grid enabled. Performance of grid depends on appropriate selection of load balancing strategy (at server and node level) and maximum utilization of computational resources of nodes (multiple cores of CPU and execution units of GPU) by the grid enabled applications. In this paper an attempt has been made to establish an experimental grid, develop a grid enabled application and to operate the grid with its highest possible performance by the grid enabled applications.
© 2015 2014The TheAuthors. Authors.Published Published Elsevier B.V. © byby Elsevier B.V. This is an open access article under the CC BY-NC-ND license Peer-review under responsibility of organizing committee of the International Conference on Eco-friendly Computing and (http://creativecommons.org/licenses/by-nc-nd/4.0/). underSystems responsibility of the2015). Organizing Committee of ICECCS 2015 Peer-review Communication (ICECCS
Keywords: Grid Computing; Load Balancing; Node Optimization; CPU-G2PU Hybrid Computing; Multi Core CPU; JPPF Grid
* Corresponding author. Tel.: +91-848-646-8313 E-mail address:
[email protected]
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Organizing Committee of ICECCS 2015 doi:10.1016/j.procs.2015.10.116
770
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
1. Introduction The rapid magnification of scientific applications has led to the development of incipient generation of distributed systems such as Grid Computing1. Grid is extensively used in computational biology and bioinformatics2. mainly in molecular dynamics, comparative genomic analysis and annotation, Mass Spectrometry Data Analysis, Protein Stability studies3-6. Grid Architecture consists of Grid fabric, grid middleware and grid enabled application. Gridenabled applications are specific software applications that can utilize grid infrastructure by the use of grid middleware. There are many grid middleware viz. UNICORE (Java), Globus (C and Java), Legion (C++), Gridbus (C, Java, C# and Perl), JPPF (Java) used in various grid projects7-13. Among all middleware JPPF Grid middleware has been used for proposed study and its performance was tested for every load balancing algorithm(Static , Adaptive, Deterministic, Reinforcement Learning and Node threads Algorithm) against various granularity of tasks(.1 MB to 51.6 MB) and performance of grid was tested by grid enabled protein sequences analysis program. Apart from selecting best load balancing algorithm at server, performance enhancement at nodes can be made possible by enabling grid enabled applications to utilize CPU & GPU for processing a task. Rapid growth of programmability and capability of GPU (Graphics Processing Units) has accelerated its use in general purpose computing13 and proved to be a promising solution to compute intensive tasks in various domains of computational biology14-24 and hybrid computing (CPU+GPU) is a promising future for computational biology25-27 and in the field of comparative genomics28, grid with the hybrid computing model is used. Some grids utilize GPU processing to improve the performance29. Massive computational power with reduce the processing time30, lower energy consumption and less economic cost is the point where this hybrid architecture dominates over the traditional one using only CPUs31. To make GPUs programmable as CPUs specific frameworks are required like CUDA32, OpenCL33, both CUDA and OpenCL are low-level libraries and require time consuming programming however GPU Programming made simplified by APARAPI34. In this study we constructed a JPPF grid to analyze and evaluated the performance gain provided by Grid Server and node performance optimization (attained by selection of appropriate load balancing strategy) and sharing computational load among CPU and GPU of nodes. To perform the experiment we have established an experimental grid having Java Parallel Programming Framework as middleware in the Computer Laboratory of Centre for Bioinformatics Studies. 2. Materials and Method 2.1. Material 2.1.1 Hardware Components: A) Hardware of Grid Server - i) GPU: Intel HD 4000 D, IvyBridge GT2 16 Execution Units with 350 MHz – 1350(Turbo) MHz. Memory Size: 1556 MB ii) CPU (4 Cores): Intel (R) Core ™ i7-337 3.4 GHz, 8MB RAM. Hardware of Grid Nodes - i) GPU: Intel HD Ivy Bridge GT1 with 6 Execution Units of 6 @ 350 - 1100 (Boost) MHz clock frequency ii) CPU (2 Cores): Intel (R) Core 2 Duo 3.4GHz, 2GB RAM B) Interconnection Network: Gigabit Ethernet 1000Mbps, D-Link Switch 24-Port 10/100/1000 2.1.2. Software Components: OS - Windows 7 ,Grid Middleware: JPPF 5.0, GPU Libraries : Aparapi, Standard Library : JDK 1.7, Apache Ant, version 1.9.4, Biological Libraries : Bio Java 3.0.5. 2.1.3 Data : 51.6 MB of Protein Sequence Data 2.2 Method Driver configuration: The parameters for each of the algorithms were set as: Static algorithm (strategy.manual_profile.size = 1), Adaptive algorithm (minSamplesToAnalyse = 100, minSamples ToCheckConvergence = 50, maxDeviation = 0.2, maxGuessToStable = 50, sizeRatioDeviation = 1.5, decreaseRatio = 0.2), Deterministic Algorithm( initialSize = 5, performanceCacheSize = 1000, proportionalityFactor = 1),
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
Reinforcement Learning Algorithm( performanceCacheSize = 3000, performanceVariationThreshold = 0.001), Nodethreads Algorithm( multiplicator = 1) Node configuration:. A node's configuration file has a property, processing.threads to specify the size of the execution thread pool.processing.threads > n , where n is number of CPU cores in nodes. Development of Grid Enabled Application: A Gridified Bioinformatics Application which analysed 51.6 MB of protein sequence to compute Instability Index, net charge, hydrophobicity and Aliphatic Index of retrieved protein sequences. The application was written to evaluate the performance of the established Grid. Method for executing task on GPU: To enable APARAPI program to execute Open CL native library (OpenCL.dll) and APARAPI native library (aparapi_x86.dll) must be set to the environment variable PATH. CPU-GPU Load Balancing: Load shared proportionately as per the capability of CPU and GPU. In our experiment 80% load is given to CPU and 20% load given on GPU.
Fig 1 - Simplified Architecture of GPU Enabled Grid Enabled Application Class Hierarchy and underlying hardware utilization
771
772
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
The grid enabled application divides the task between the cores of GPU and CPU by adopting the above class hierarchy. The client application uses standard JPPF classes and APARAPI. 3. Results 3.1 Results of server optimization and performance analysis
1 2 3 4 5 6 7 8 9 10
Of Task (in MB)
Granularity
Sl. Of obs.
Table 1. Server Load Balancing Performance for various algorithms
0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.6
Time of Execution (in milliseconds) for various Server Load Balancing Algorithm Static Algorithm 1532 1975 2476 2688 3704 3988 4489 4997 5499 6009
Adaptive Algorithm 1893 2100 2248 3432 4397 4762 5927 6093 7358 8123
Deterministic Algorithm 2193 2500 2748 3732 4997 5062 6727 6393 8258 9123
Reinforcement Learning Algorithm
Node threads Algorithm
2043 2300 2498 3582 4697 4912 6327 6243 7808 8623
2043 2300 2498 3582 4697 4912 6327 6243 7808 8623
Fig 2. Comparison of Load Balancing Algorithms on Proposed Grid for various level of granularity of task
From figure 2 it can be seen that static load sharing is much efficient than other algorithms. All other algorithms requires communication between nodes and server. We can see along with increase in granularity of task, the total
773
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
execution time of the grid enabled application also increased; this is because of more network communication between servers and nodes. So we have chosen static load balancing algorithm for server in our grid, because all the nodes used in the grid are of same configuration. 3.2 Results of node performance by node threads Table 2. Node Load Balancing Performance for different numbers of processing threads
Sl. No
Granularity of Task (in KB)
Performance of grid nodes for various node threads in milli seconds
T=1
T=2
T=4
T=1
T=2
T=4
T=1
T=2
T=4
T=1
T=2
T=4
1
0.1
760
381
395
765
384
399
766
384
399
764
383
398
2
0.2
981
492
506
987
495
510
988
495
510
985
494
509
3
0.4
1232
617
631
1237
620
635
1238
620
635
1236
619
634
4
0.8
1337
670
684
1344
673
688
1345
674
689
1342
672
687
5
1.6
1844
923
937
1852
927
942
1853
928
943
1850
926
941
6
3.2
1988
995
1009
1993
998
1013
1994
998
1013
1992
997
1012
7
6.4
2240
1121
1135
2243
1123
1138
2244
1123
1138
2243
1123
1138
8
12.8
2493
1248
1262
2498
1250
1265
2499
1251
1266
2496
1249
1264
9
25.6
2745
1374
1388
2748
1375
1390
2749
1376
1391
2748
1375
1390
10
51.6
2998
1500
1514
3004
1503
1518
3005
1504
1519
3002
1502
1517
Node 1
Node 2
Node 3
a.
b.
c.
d.
Node 4
Fig 3(a)-3(d) Node Load Balancing Performance for different numbers of processing threads for node 1,2,3,and 4
774
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
We have set processing.threads attribute value to 1, 2 and 4 in Node Configuration file for each node in each granularity of tasks. For each granularity for each thread value we have executed the grid enabled application in 3 runs and a mean of execution time of all run were considered for each thread. It is evident from Table 2 and Fig 3(a) – 3(d) that if the node is having 2 cores then setting the number of processing threads to 2 gives best performance because in our experimental grid all nodes were having 2 cores CPUs. 3.3 Node performance analysis of CPU+GPU hybrid computing model
Sl. Of obs.
Granularity of Task (in KB)
Table 3. Processing time readings and performance improvement studies when load is shared/not shared between CPUs and GPUs CPU-GPU
1 2 3 4 5 6 7 8 9 10
0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.6
281 292 242 265 363 395 440 498 544 590
CPU Only
Improvement
100 % Load on CPU of Each Node
Difference in Execution Time(in ms)
(Load Sharing : 80% in CPU, 20% in GPU)
381 492 617 670 923 995 1121 1248 1374 1500
100 200 375 405 560 600 681 750 830 910
26.25 40.65 60.78 60.45 60.67 60.30 60.75 60.10 60.41 60.67
(in %)
From the above readings and calculations it is evident that if CPU and GPU process data together then we get better improvement in performance in each node. The data is graphically represented in Figure belowa
b
Figure 4(a). Execution Time and 4(b) Improvement of CPU-GPU Hybrid model A performance pattern has noticed that after a granularity of task, the performance remain within a range of ~ 60% because the performance will get enhanced upto the processing capability of GPU. With higher number of execution units in GPU the performance will be scaled to much higher level. 3.4 Results of grid performance analysis The processing time of the grid enabled application T(G) and sequential application T(S) has been recorded for various granularity of task in milli seconds. The result shows that performance improvement [T(S)/T(G)] after reaching certain level becomes stable . The results generated are shown in Table 4.
775
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
Table 4. Comparative processing time performance calculations between T(S) & T(G) Sl. Of obs.
Granularity of Task(in MB)
T(S) in ms
T(G) in ms
% PE=T(S)/T(G) * 100
1 2
0.1
3064
1532
200
0.2
6912.5
1975
350
3
0.4
11884.8
2476
480
4
0.8
9139.2
2688
340
5
1.6
12593.6
3704
340
6
3.2
13559.2
3988
340
7
6.4
15262.6
4489
340
8
12.8
16989.8
4997
340
9
25.6
18696.6
5499
340
10
51.6
20430.6
6009
340
Figure 5. Execution Time of Grid Enabled and Standalone Application
Performance of Grid Enabled application has been achieved upto 3.4 times in comparison to sequential version of the same application. 4. Conclusion Performance of grid depends on the configuration of grid by the appropriate selection of load balancing strategy. A grid enabled application which can utilize the underlying hardware resources of nodes (CPU-GPU Cores) efficiently makes the compute intensive tasks quite faster. The study shows how to achieve maximum performance from a grid enabled application by the appropriated server, node optimization and utilization of the hybrid computing model. The experiment conducted above was on a mini experimental grid of 4 nodes and the grid enabled application shows 3.4 times better than non grid enabled application performing the same analysis. In addition to that each node when tested for CPU-GPU hybrid model has shown improvement of ~ 60% per node additionally. This approach is a promising model for all most all compute intensive tasks in bioinformatics like molecular modeling, docking and molecular dynamics studies for large number of protein structures.
776
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
Acknowledgements We acknowledge Prof. Subrata Chakraborty, Director i/c Centre for Bioinformatics Studies for facilitating with the infrastructure to construct the Grid and perform the experiments. Dr. K Narain, Dy. Director of RMRC-ICMR, Dibrugarh and Dr. Ashwani Sharma, Indian Science and Technology Foundation for their constant support and help. References 1. Foster I , Yong Z , Raicu I , Shiyong L . Cloud and Grid Computing 360-degree compared. In Proc. Grid Computing Environments Workshop, 2008. GCE '08, 12-16 Nov. 2008, p. 1 - 10 2. Payne JL, Sinnott Armstrong NA, Moore JH. Exploiting graphics processing units for computational biology and bioinformatics. Interdiscip Sci. 2010; 2(3):213-20. 3. Hunter A, Australia WA, Schibeci D , Hiew HL, Bellgard M; GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC. In Proc. IEEE International Conference on Cluster Computing;20-23 Sept. 2004 4. Thrasher A, Musgrave Z, Kachmarck B, Thain D, Emrich S. Scaling up genome annotation using MAKER and work queue; Int J Bioinform Res Appl 2014;10(4-5):447-60 5. Carapito C, Burel A, Guterl P, Walter A, Varrier F, Bertile F, Van DA. MSDA, a proteomics software suite for in-depth Mass Spectrometry Data Analysis using grid computing; Proteomics. 2014; 14(9):1014-9 6. Wu CC, Lai LF, Gromiha MM, Huang LT. High throughput computing to improve efficiency of predicting protein stability change upon mutation; Int J Data Min Bioinform. 2014; 10(2):206-24. 7. Menday R, Wieder P. GRIP: The Evolution of UNICORE towards a Service-Oriented Grid. Proceedings of the Cracow Grid Workshop 2003, Cracow, Poland, p. 142-150 8. Venugopal S, Buyya R, Winton L. A Grid Service Broker for Scheduling Distributed Data-Oriented Applications on Global Grids, Technical Report, In Proc. 2nd workshop on Middleware for grid computing, Toronto, Ontario, Canada 2004; p. 75-80 9. Hey T,Trefethen AE. The UK e-Science Core Programme and the Grid, Future Generation Computer Systems 2002; 18(8):1017-1031. 10. Johnston W, Gannon D, Nitzberg B. Grids as production computing environments: The engineering aspects of NASA’s Information Power Grid. In Proc. Eighth IEEE International Symposium on High Performance Distributed Computing, Redondo Beach, CA, USA, 1999; p. 197-204 11. Natrajan A, Crowley M, Wilkins Diehr N, Humphrey M, Fox AD, Grimshaw A, Brooks CL. Studying Protein Folding on the Grid: Experiences using CHARMM on NPACI Resources under Legion, In Proc. the 10th International Symposium on High Performance Distributed Computing, August 7-9, 2001; p. 14–21. 12. Buyya R, Date S, Mizuno Matsumoto Y, Venugopal S, Abramson D. Composition of Distributed Brain Activity Analysis and its On-Demand Deployment on Global Grids, In Proc. 10th International Conference on High Performance Computing 2003,Hyderabad 13. Huang Q, Huang Z, Werstein P, Purvis M. GPU as a General Purpose Computing Resource; In Proc. Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, 2008; p. 151 - 158 14. Guerrero GD, Imbernon B, Perez Sanchez H, Sanz F, Garcia JM, Cecilia JM. A performance/ cost evaluation for a GPU-based drug discovery application on volunteer computing; Biomed Res Int. 2014; 2014:474219 15. Goga N, Marrink S, Cioromela R, Moldoveanu F. GPU-SD and DPD parallelization for Gromacs tools for molecular dynamics simulations. In Proc IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE) 2012. p. 251-254 16. Demchuk O, Karpov P, Blume Y. Docking Small Ligands to Molecule of the Plant FtsZ Protein: Application of the CUDA Technology for Faster Computations. Cytology and Genetics 2012; 46(3):172-179. 17. Shimoda T, Suzuki S, Ohue M, Ishida T, Akiyama Y. Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures; BMC Syst Biol. 2015; 9(Suppl) 1:S6. 18. Wu J, Hong B, Takeda T, Guo JT. High performance transcription factor-DNA docking with GPU computing; Proteome Sci. 2012 ;10 (Suppl) 1:S17 19. Manavski Svetlin A, Giorgio V. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment; BMC Bioinformatics 2008; 9: S10. 20. Zhao K, Chu X. G-BLASTN: accelerating nucleotide alignment by graphics processors; Bioinformatics. 2014; 30(10):1384-91 21. Pang B, Zhao N, Becchi M, Korkin D, Shyu CR. Accelerating large-scale protein structure alignments with graphics processing units; BMC Res Notes 2012; 22(5):116. 22. Yang G, Jiang W, Yang Q, Yu W. PBOOST: a GPU-based tool for parallel permutation tests in genome-wide association studies; Bioinformatics 2015; 31(9):1460-2. 23. Putz B, Kam Thong T, Karbalai N, Altmann A, Muller Myhsok B. Cost-effective GPU-grid for genome-wide epistasis calculations; Methods Inf Med 2013; 52(1):91-5. 24. Papadopoulos A, Kirmitzoglou I, Promponas VJ, Papadopoulos A, Kirmitzoglou I, Promponas VJ, Theocharides T. GPU technology as a platform for accelerating local complexity analysis of protein sequences. In Proc IEEE Eng Med Biol Soc. 2013;2013:2684-7. 25. Wang Y, Du H, Xia M, Ren L, Xu M, Xie T. A Hybrid CPU-GPU Accelerated Framework for Fast Mapping of High-Resolution Human Brain Connectome. PLoS ONE 2008; 8(5): e62789. 26. Nikolay P, Karpov P, Yaroslav B. Hybrid CPU-GPU calculations – a promising future for computational biology. In Proc. Third International Conference on High Performance Computing HPC-UA 2013,Ukraine, Kyiv; p. 330–335 27. Liao T, Zhang Y, Kekenes Huskey PM, Cheng Y, Michailova A, McCulloch AD, Holst M , McCammon JA; Multi-core CPU or GPUaccelerated Multiscale Modeling for Biomolecular Complexes; Mol Based Math Biol. 2013; p. 1-27 28. Singh A, Chen C, Liu W, Mitchell W, Schmidt B. A hybrid computational grid architecture for comparative genomics; IEEE Trans Inf Technol Biomed. 2008; 12(2) : 218-25 29. Hernandez Esparza R , Chica Mejia SM, Zapata Escobar AD, Guevara Garcia A, Martinez Melchor A, Hernandez Perez JM, Vargas R,
Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777
Garza J. Grid-based algorithm to search critical points, in the electron density, accelerated by graphics processing units; J Comput Chem. 2014 Dec 5;35(31):2272-8. 30. Agulleiro JI , Vazquez F, Garzon EM, Fernandez JJ. Hybrid computing: CPU+GPU co-processing and its application to tomographic reconstruction; Ultramicroscopy 2012; 115:109-14. 31. Tyng Yeu L, Hung Fu L, Jun Yao C. Proceedings of IEEE 26th International on Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW); 21-25 May 2012, Shanghai; p. 2369 – 2377 32. Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GPGPU Processing In CUDA Architecture; Advanced Computing: An International Journal 2012 ; 3(1): 105-120 33. Lee H, Munshi A. The OpenCL Specification Version: 2.0 Document Revision: 26" (PDF). Khronos Group. Available from https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf Updated on 17 October 2014. Retrieved 16 April 2015. 34. Gary Frost, Aparapi - an open source API for expressing data parallel workloads in Java™. Available from http://developer.amd.com/toolsand-sdks/opencl-zone/aparapi/
777