Design and Development of Grid Enabled, G2PU Accelerated ... - CORE

6 downloads 90549 Views 518KB Size Report
Oct 17, 2014 - Procedia Computer Science 70 ( 2015 ) 769 – 777 ... GPU (Graphics Processing Units) has accelerated its use in general purpose ..... Foster I , Yong Z , Raicu I , Shiyong L . Cloud and Grid Computing 360-degree compared.
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 70 (2015) 769 – 777

Design and Development of Grid Enabled, G2PU Accelerated Java Application (Protein Sequence Study) for Grid Performance Analysis Subrata Sinhaa*, Smriti Priya Medhia, G.C Hazarikab a

Centre for Bioinformatics Studies, Dibrugarh University,Dibrugarh-786004,India b Department of Mathematics, Dibrugarh University, Dibrugarh-786004,India

Abstract Recent advancement in the field of structural biology has generated huge volume of data and analyzing such data is vital to know the hidden truths of life but such analysis is compute intensive in nature and requires huge computational power resulting in extensive use of high performance computing (Multi Core Computing, G2PU Computing, CPU-GPU Hybrid computing, Cluster, Grid) models. Grid is one of the widely used HPC model which is commonly used in computational biology and to execute a compute intensive tasks on grid , the applications must also be grid enabled. Performance of grid depends on appropriate selection of load balancing strategy (at server and node level) and maximum utilization of computational resources of nodes (multiple cores of CPU and execution units of GPU) by the grid enabled applications. In this paper an attempt has been made to establish an experimental grid, develop a grid enabled application and to operate the grid with its highest possible performance by the grid enabled applications.

© 2015 2014The TheAuthors. Authors.Published Published Elsevier B.V. © byby Elsevier B.V. This is an open access article under the CC BY-NC-ND license Peer-review under responsibility of organizing committee of the International Conference on Eco-friendly Computing and (http://creativecommons.org/licenses/by-nc-nd/4.0/). underSystems responsibility of the2015). Organizing Committee of ICECCS 2015 Peer-review Communication (ICECCS

Keywords: Grid Computing; Load Balancing; Node Optimization; CPU-G2PU Hybrid Computing; Multi Core CPU; JPPF Grid

* Corresponding author. Tel.: +91-848-646-8313 E-mail address: [email protected]

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Organizing Committee of ICECCS 2015 doi:10.1016/j.procs.2015.10.116

770

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

1. Introduction The rapid magnification of scientific applications has led to the development of incipient generation of distributed systems such as Grid Computing1. Grid is extensively used in computational biology and bioinformatics2. mainly in molecular dynamics, comparative genomic analysis and annotation, Mass Spectrometry Data Analysis, Protein Stability studies3-6. Grid Architecture consists of Grid fabric, grid middleware and grid enabled application. Gridenabled applications are specific software applications that can utilize grid infrastructure by the use of grid middleware. There are many grid middleware viz. UNICORE (Java), Globus (C and Java), Legion (C++), Gridbus (C, Java, C# and Perl), JPPF (Java) used in various grid projects7-13. Among all middleware JPPF Grid middleware has been used for proposed study and its performance was tested for every load balancing algorithm(Static , Adaptive, Deterministic, Reinforcement Learning and Node threads Algorithm) against various granularity of tasks(.1 MB to 51.6 MB) and performance of grid was tested by grid enabled protein sequences analysis program. Apart from selecting best load balancing algorithm at server, performance enhancement at nodes can be made possible by enabling grid enabled applications to utilize CPU & GPU for processing a task. Rapid growth of programmability and capability of GPU (Graphics Processing Units) has accelerated its use in general purpose computing13 and proved to be a promising solution to compute intensive tasks in various domains of computational biology14-24 and hybrid computing (CPU+GPU) is a promising future for computational biology25-27 and in the field of comparative genomics28, grid with the hybrid computing model is used. Some grids utilize GPU processing to improve the performance29. Massive computational power with reduce the processing time30, lower energy consumption and less economic cost is the point where this hybrid architecture dominates over the traditional one using only CPUs31. To make GPUs programmable as CPUs specific frameworks are required like CUDA32, OpenCL33, both CUDA and OpenCL are low-level libraries and require time consuming programming however GPU Programming made simplified by APARAPI34. In this study we constructed a JPPF grid to analyze and evaluated the performance gain provided by Grid Server and node performance optimization (attained by selection of appropriate load balancing strategy) and sharing computational load among CPU and GPU of nodes. To perform the experiment we have established an experimental grid having Java Parallel Programming Framework as middleware in the Computer Laboratory of Centre for Bioinformatics Studies. 2. Materials and Method 2.1. Material 2.1.1 Hardware Components: A) Hardware of Grid Server - i) GPU: Intel HD 4000 D, IvyBridge GT2 16 Execution Units with 350 MHz – 1350(Turbo) MHz. Memory Size: 1556 MB ii) CPU (4 Cores): Intel (R) Core ™ i7-337 3.4 GHz, 8MB RAM. Hardware of Grid Nodes - i) GPU: Intel HD Ivy Bridge GT1 with 6 Execution Units of 6 @ 350 - 1100 (Boost) MHz clock frequency ii) CPU (2 Cores): Intel (R) Core 2 Duo 3.4GHz, 2GB RAM B) Interconnection Network: Gigabit Ethernet 1000Mbps, D-Link Switch 24-Port 10/100/1000 2.1.2. Software Components: OS - Windows 7 ,Grid Middleware: JPPF 5.0, GPU Libraries : Aparapi, Standard Library : JDK 1.7, Apache Ant, version 1.9.4, Biological Libraries : Bio Java 3.0.5. 2.1.3 Data : 51.6 MB of Protein Sequence Data 2.2 Method Driver configuration: The parameters for each of the algorithms were set as: Static algorithm (strategy.manual_profile.size = 1), Adaptive algorithm (minSamplesToAnalyse = 100, minSamples ToCheckConvergence = 50, maxDeviation = 0.2, maxGuessToStable = 50, sizeRatioDeviation = 1.5, decreaseRatio = 0.2), Deterministic Algorithm( initialSize = 5, performanceCacheSize = 1000, proportionalityFactor = 1),

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

Reinforcement Learning Algorithm( performanceCacheSize = 3000, performanceVariationThreshold = 0.001), Nodethreads Algorithm( multiplicator = 1) Node configuration:. A node's configuration file has a property, processing.threads to specify the size of the execution thread pool.processing.threads > n , where n is number of CPU cores in nodes. Development of Grid Enabled Application: A Gridified Bioinformatics Application which analysed 51.6 MB of protein sequence to compute Instability Index, net charge, hydrophobicity and Aliphatic Index of retrieved protein sequences. The application was written to evaluate the performance of the established Grid. Method for executing task on GPU: To enable APARAPI program to execute Open CL native library (OpenCL.dll) and APARAPI native library (aparapi_x86.dll) must be set to the environment variable PATH. CPU-GPU Load Balancing: Load shared proportionately as per the capability of CPU and GPU. In our experiment 80% load is given to CPU and 20% load given on GPU.

Fig 1 - Simplified Architecture of GPU Enabled Grid Enabled Application Class Hierarchy and underlying hardware utilization

771

772

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

The grid enabled application divides the task between the cores of GPU and CPU by adopting the above class hierarchy. The client application uses standard JPPF classes and APARAPI. 3. Results 3.1 Results of server optimization and performance analysis

1 2 3 4 5 6 7 8 9 10

Of Task (in MB)

Granularity

Sl. Of obs.

Table 1. Server Load Balancing Performance for various algorithms

0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.6

Time of Execution (in milliseconds) for various Server Load Balancing Algorithm Static Algorithm 1532 1975 2476 2688 3704 3988 4489 4997 5499 6009

Adaptive Algorithm 1893 2100 2248 3432 4397 4762 5927 6093 7358 8123

Deterministic Algorithm 2193 2500 2748 3732 4997 5062 6727 6393 8258 9123

Reinforcement Learning Algorithm

Node threads Algorithm

2043 2300 2498 3582 4697 4912 6327 6243 7808 8623

2043 2300 2498 3582 4697 4912 6327 6243 7808 8623

Fig 2. Comparison of Load Balancing Algorithms on Proposed Grid for various level of granularity of task

From figure 2 it can be seen that static load sharing is much efficient than other algorithms. All other algorithms requires communication between nodes and server. We can see along with increase in granularity of task, the total

773

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

execution time of the grid enabled application also increased; this is because of more network communication between servers and nodes. So we have chosen static load balancing algorithm for server in our grid, because all the nodes used in the grid are of same configuration. 3.2 Results of node performance by node threads Table 2. Node Load Balancing Performance for different numbers of processing threads

Sl. No

Granularity of Task (in KB)

Performance of grid nodes for various node threads in milli seconds

T=1

T=2

T=4

T=1

T=2

T=4

T=1

T=2

T=4

T=1

T=2

T=4

1

0.1

760

381

395

765

384

399

766

384

399

764

383

398

2

0.2

981

492

506

987

495

510

988

495

510

985

494

509

3

0.4

1232

617

631

1237

620

635

1238

620

635

1236

619

634

4

0.8

1337

670

684

1344

673

688

1345

674

689

1342

672

687

5

1.6

1844

923

937

1852

927

942

1853

928

943

1850

926

941

6

3.2

1988

995

1009

1993

998

1013

1994

998

1013

1992

997

1012

7

6.4

2240

1121

1135

2243

1123

1138

2244

1123

1138

2243

1123

1138

8

12.8

2493

1248

1262

2498

1250

1265

2499

1251

1266

2496

1249

1264

9

25.6

2745

1374

1388

2748

1375

1390

2749

1376

1391

2748

1375

1390

10

51.6

2998

1500

1514

3004

1503

1518

3005

1504

1519

3002

1502

1517

Node 1

Node 2

Node 3

a.

b.

c.

d.

Node 4

Fig 3(a)-3(d) Node Load Balancing Performance for different numbers of processing threads for node 1,2,3,and 4

774

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

We have set processing.threads attribute value to 1, 2 and 4 in Node Configuration file for each node in each granularity of tasks. For each granularity for each thread value we have executed the grid enabled application in 3 runs and a mean of execution time of all run were considered for each thread. It is evident from Table 2 and Fig 3(a) – 3(d) that if the node is having 2 cores then setting the number of processing threads to 2 gives best performance because in our experimental grid all nodes were having 2 cores CPUs. 3.3 Node performance analysis of CPU+GPU hybrid computing model

Sl. Of obs.

Granularity of Task (in KB)

Table 3. Processing time readings and performance improvement studies when load is shared/not shared between CPUs and GPUs CPU-GPU

1 2 3 4 5 6 7 8 9 10

0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.6

281 292 242 265 363 395 440 498 544 590

CPU Only

Improvement

100 % Load on CPU of Each Node

Difference in Execution Time(in ms)

(Load Sharing : 80% in CPU, 20% in GPU)

381 492 617 670 923 995 1121 1248 1374 1500

100 200 375 405 560 600 681 750 830 910

26.25 40.65 60.78 60.45 60.67 60.30 60.75 60.10 60.41 60.67

(in %)

From the above readings and calculations it is evident that if CPU and GPU process data together then we get better improvement in performance in each node. The data is graphically represented in Figure belowa

b

Figure 4(a). Execution Time and 4(b) Improvement of CPU-GPU Hybrid model A performance pattern has noticed that after a granularity of task, the performance remain within a range of ~ 60% because the performance will get enhanced upto the processing capability of GPU. With higher number of execution units in GPU the performance will be scaled to much higher level. 3.4 Results of grid performance analysis The processing time of the grid enabled application T(G) and sequential application T(S) has been recorded for various granularity of task in milli seconds. The result shows that performance improvement [T(S)/T(G)] after reaching certain level becomes stable . The results generated are shown in Table 4.

775

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

Table 4. Comparative processing time performance calculations between T(S) & T(G) Sl. Of obs.

Granularity of Task(in MB)

T(S) in ms

T(G) in ms

% PE=T(S)/T(G) * 100

1 2

0.1

3064

1532

200

0.2

6912.5

1975

350

3

0.4

11884.8

2476

480

4

0.8

9139.2

2688

340

5

1.6

12593.6

3704

340

6

3.2

13559.2

3988

340

7

6.4

15262.6

4489

340

8

12.8

16989.8

4997

340

9

25.6

18696.6

5499

340

10

51.6

20430.6

6009

340

Figure 5. Execution Time of Grid Enabled and Standalone Application

Performance of Grid Enabled application has been achieved upto 3.4 times in comparison to sequential version of the same application. 4. Conclusion Performance of grid depends on the configuration of grid by the appropriate selection of load balancing strategy. A grid enabled application which can utilize the underlying hardware resources of nodes (CPU-GPU Cores) efficiently makes the compute intensive tasks quite faster. The study shows how to achieve maximum performance from a grid enabled application by the appropriated server, node optimization and utilization of the hybrid computing model. The experiment conducted above was on a mini experimental grid of 4 nodes and the grid enabled application shows 3.4 times better than non grid enabled application performing the same analysis. In addition to that each node when tested for CPU-GPU hybrid model has shown improvement of ~ 60% per node additionally. This approach is a promising model for all most all compute intensive tasks in bioinformatics like molecular modeling, docking and molecular dynamics studies for large number of protein structures.

776

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

Acknowledgements We acknowledge Prof. Subrata Chakraborty, Director i/c Centre for Bioinformatics Studies for facilitating with the infrastructure to construct the Grid and perform the experiments. Dr. K Narain, Dy. Director of RMRC-ICMR, Dibrugarh and Dr. Ashwani Sharma, Indian Science and Technology Foundation for their constant support and help. References 1. Foster I , Yong Z , Raicu I , Shiyong L . Cloud and Grid Computing 360-degree compared. In Proc. Grid Computing Environments Workshop, 2008. GCE '08, 12-16 Nov. 2008, p. 1 - 10 2. Payne JL, Sinnott Armstrong NA, Moore JH. Exploiting graphics processing units for computational biology and bioinformatics. Interdiscip Sci. 2010; 2(3):213-20. 3. Hunter A, Australia WA, Schibeci D , Hiew HL, Bellgard M; GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC. In Proc. IEEE International Conference on Cluster Computing;20-23 Sept. 2004 4. Thrasher A, Musgrave Z, Kachmarck B, Thain D, Emrich S. Scaling up genome annotation using MAKER and work queue; Int J Bioinform Res Appl 2014;10(4-5):447-60 5. Carapito C, Burel A, Guterl P, Walter A, Varrier F, Bertile F, Van DA. MSDA, a proteomics software suite for in-depth Mass Spectrometry Data Analysis using grid computing; Proteomics. 2014; 14(9):1014-9 6. Wu CC, Lai LF, Gromiha MM, Huang LT. High throughput computing to improve efficiency of predicting protein stability change upon mutation; Int J Data Min Bioinform. 2014; 10(2):206-24. 7. Menday R, Wieder P. GRIP: The Evolution of UNICORE towards a Service-Oriented Grid. Proceedings of the Cracow Grid Workshop 2003, Cracow, Poland, p. 142-150 8. Venugopal S, Buyya R, Winton L. A Grid Service Broker for Scheduling Distributed Data-Oriented Applications on Global Grids, Technical Report, In Proc. 2nd workshop on Middleware for grid computing, Toronto, Ontario, Canada 2004; p. 75-80 9. Hey T,Trefethen AE. The UK e-Science Core Programme and the Grid, Future Generation Computer Systems 2002; 18(8):1017-1031. 10. Johnston W, Gannon D, Nitzberg B. Grids as production computing environments: The engineering aspects of NASA’s Information Power Grid. In Proc. Eighth IEEE International Symposium on High Performance Distributed Computing, Redondo Beach, CA, USA, 1999; p. 197-204 11. Natrajan A, Crowley M, Wilkins Diehr N, Humphrey M, Fox AD, Grimshaw A, Brooks CL. Studying Protein Folding on the Grid: Experiences using CHARMM on NPACI Resources under Legion, In Proc. the 10th International Symposium on High Performance Distributed Computing, August 7-9, 2001; p. 14–21. 12. Buyya R, Date S, Mizuno Matsumoto Y, Venugopal S, Abramson D. Composition of Distributed Brain Activity Analysis and its On-Demand Deployment on Global Grids, In Proc. 10th International Conference on High Performance Computing 2003,Hyderabad 13. Huang Q, Huang Z, Werstein P, Purvis M. GPU as a General Purpose Computing Resource; In Proc. Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, 2008; p. 151 - 158 14. Guerrero GD, Imbernon B, Perez Sanchez H, Sanz F, Garcia JM, Cecilia JM. A performance/ cost evaluation for a GPU-based drug discovery application on volunteer computing; Biomed Res Int. 2014; 2014:474219 15. Goga N, Marrink S, Cioromela R, Moldoveanu F. GPU-SD and DPD parallelization for Gromacs tools for molecular dynamics simulations. In Proc IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE) 2012. p. 251-254 16. Demchuk O, Karpov P, Blume Y. Docking Small Ligands to Molecule of the Plant FtsZ Protein: Application of the CUDA Technology for Faster Computations. Cytology and Genetics 2012; 46(3):172-179. 17. Shimoda T, Suzuki S, Ohue M, Ishida T, Akiyama Y. Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures; BMC Syst Biol. 2015; 9(Suppl) 1:S6. 18. Wu J, Hong B, Takeda T, Guo JT. High performance transcription factor-DNA docking with GPU computing; Proteome Sci. 2012 ;10 (Suppl) 1:S17 19. Manavski Svetlin A, Giorgio V. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment; BMC Bioinformatics 2008; 9: S10. 20. Zhao K, Chu X. G-BLASTN: accelerating nucleotide alignment by graphics processors; Bioinformatics. 2014; 30(10):1384-91 21. Pang B, Zhao N, Becchi M, Korkin D, Shyu CR. Accelerating large-scale protein structure alignments with graphics processing units; BMC Res Notes 2012; 22(5):116. 22. Yang G, Jiang W, Yang Q, Yu W. PBOOST: a GPU-based tool for parallel permutation tests in genome-wide association studies; Bioinformatics 2015; 31(9):1460-2. 23. Putz B, Kam Thong T, Karbalai N, Altmann A, Muller Myhsok B. Cost-effective GPU-grid for genome-wide epistasis calculations; Methods Inf Med 2013; 52(1):91-5. 24. Papadopoulos A, Kirmitzoglou I, Promponas VJ, Papadopoulos A, Kirmitzoglou I, Promponas VJ, Theocharides T. GPU technology as a platform for accelerating local complexity analysis of protein sequences. In Proc IEEE Eng Med Biol Soc. 2013;2013:2684-7. 25. Wang Y, Du H, Xia M, Ren L, Xu M, Xie T. A Hybrid CPU-GPU Accelerated Framework for Fast Mapping of High-Resolution Human Brain Connectome. PLoS ONE 2008; 8(5): e62789. 26. Nikolay P, Karpov P, Yaroslav B. Hybrid CPU-GPU calculations – a promising future for computational biology. In Proc. Third International Conference on High Performance Computing HPC-UA 2013,Ukraine, Kyiv; p. 330–335 27. Liao T, Zhang Y, Kekenes Huskey PM, Cheng Y, Michailova A, McCulloch AD, Holst M , McCammon JA; Multi-core CPU or GPUaccelerated Multiscale Modeling for Biomolecular Complexes; Mol Based Math Biol. 2013; p. 1-27 28. Singh A, Chen C, Liu W, Mitchell W, Schmidt B. A hybrid computational grid architecture for comparative genomics; IEEE Trans Inf Technol Biomed. 2008; 12(2) : 218-25 29. Hernandez Esparza R , Chica Mejia SM, Zapata Escobar AD, Guevara Garcia A, Martinez Melchor A, Hernandez Perez JM, Vargas R,

Subrata Sinha et al. / Procedia Computer Science 70 (2015) 769 – 777

Garza J. Grid-based algorithm to search critical points, in the electron density, accelerated by graphics processing units; J Comput Chem. 2014 Dec 5;35(31):2272-8. 30. Agulleiro JI , Vazquez F, Garzon EM, Fernandez JJ. Hybrid computing: CPU+GPU co-processing and its application to tomographic reconstruction; Ultramicroscopy 2012; 115:109-14. 31. Tyng Yeu L, Hung Fu L, Jun Yao C. Proceedings of IEEE 26th International on Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW); 21-25 May 2012, Shanghai; p. 2369 – 2377 32. Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GPGPU Processing In CUDA Architecture; Advanced Computing: An International Journal 2012 ; 3(1): 105-120 33. Lee H, Munshi A. The OpenCL Specification Version: 2.0 Document Revision: 26" (PDF). Khronos Group. Available from https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf Updated on 17 October 2014. Retrieved 16 April 2015. 34. Gary Frost, Aparapi - an open source API for expressing data parallel workloads in Java™. Available from http://developer.amd.com/toolsand-sdks/opencl-zone/aparapi/

777

Suggest Documents