Parallel Computations in the Volunteer based ...

1 downloads 0 Views 222KB Size Report
Adobe Flash etc. without the need for installation of additional software. This pa- per presents results of scalability experiments carried on the Comcute system.
Parallel Computations in the Volunteer based Comcute System Paweł Czarnul and Jarosław Kuchta and Mariusz Matuszek Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology, Poland {pczarnul,qhta,mrm}@eti.pg.gda.pl

Abstract. The paper presents Comcute which is a novel multi-level implementation of the volunteer based computing paradigm. Comcute was designed to let users donate the computing power of their PCs in a simplified manner, requiring only pointing their web browser at a specific web address and clicking a mouse. The server side appoints several servers to be in charge of execution of particular tasks. Thanks to that the system can survive failures of individual computers and allow definition of redundancy of desired order. On the client side, computations are executed within web browsers using technologies such as Java, JavaScript, Adobe Flash etc. without the need for installation of additional software. This paper presents results of scalability experiments carried on the Comcute system.. . . Keywords. volunteer computing, parallel computations, scalability, reliability

1 Introduction Many areas of modern science rely heavily on supercomputing power availability. In fact, computing demand from just Materials Science, Biology, Astronomy and Medicine [1] outpaces the supply and it is an ongoing effort to keep up with the demand. Part of this effort is designing systems which combine the power of many distributed personal computers and make it available for science. Several such systems exist e.g. BOINC1 [2], but their common property is need for a dedicated computing module to be installed and configured by the user, who wishes to make their computer available. This task is somewhat technical in nature and often intimidates potential volunteers. Volunteer computing plays an important role in supplying the computational power demanded by science. Harnessing the power of personal computers connected to the Internet requires dedicated systems, which distribute computations and collect results. Usually such systems require the user of a personal computer to install and run a dedicated client software, which often presents a difficulty to less technical-savvy users. Recognizing this difficulty led to design and implementation of a Comcute system, which allows volunteers to make their computing resources available by just pointing 1

http://boinc.berkeley.edu/

their web browser at a web address and clicking a mouse without installation of additional software. Development of the Comcute project2 [3] took place in years 20102012 and was supported under Grant OR00010811 by the Polish Ministry of Science and Higher Education.

2 Related Work There is a variety of paradigms and tools for parallel computations proposed and implemented at various levels: – shared memory systems: GPGPU with NVIDIA CUDA and OpenCL [4,5], OpenMP, Pthreads, Java Threads for multithreaded programming on SMP systems [6], – distributed memory systems: • dedicated HPC systems: MPI [7], PVM [8], • collection of HPC systems: MPICH-G2 [9], PACX-MPI [10], • distributed systems, including HPC in various Virtual Organizations: grid systems implemented on top of grid middlewares such as Globus Toolkit, Unicore, Gridbus with scheduling and management of resources [11,12,13], • frameworks such as Hadoop3, • workflow systems such as the one in BeesyCluster [14], • volunteer-based systems such as BOINC [2] in which distributed volunteers donate computing power of their own computers to shared projects. Paper [15] demonstrates how a new WeevilScout prototype framework can be used to engage thousands of Internet browsers with JavaScript support for computations in the master-slave fashion for a bio-informatics task. [16] presents a framework using the master-slave model for computations built on top of Google App Engine that allows free of charge execution using the TaskQueue scheme. The master and slaves are implemented behind a Web interface and then use the TaskQueue for execution. Compared to [16], Comcute was designed in order not to require access to Google or any other infrastructure and rely on computing power of Internet users’ computers instead. Compared to volunteer computing such as BOINC [2], Comcute was created to offer several new unique features: – ability to run the client within a web browser supporting many technologies such as Java, JavaScript, Adobe Flash etc. not just JavaScript as addressed by WeevilScout [15], – advanced management of computations at the server side supporting: • redundancy of desired order, i.e. requesting redundant computations of data chunks by volunteers, • ability to partition input data and integration on-the-fly as data chunks come, • distributed management of computations on the server side that is able to survive failures of individual servers (not addressed by WeevilScout [15] nor BOINC [2]). 2 3

http://comcute.eti.pg.gda.pl http://hadoop.apache.org/

3 Proposed Solution The Comcute system uses the same volunteer-based paradigm of calculation as BOINC. A computational task (code and partitioned data) is distributed to a great number of volunteers (Internet users). On the other hand, Comcute differs from other volunteer-based computing systems in a level of calculation flexibility, a level of reliability and ease of use. First, the volunteers may be recruited from the users of common public services, such as e-government or e-administration services, video-sharing services (e.g. YouTube) or social networking services (e.g. Facebook). Calculation tasks (code and data) are loaded to their computers in a simple one-click fashion. Calculations are performed at their computers in a web browser context which should be safe for the client. The code is matched to the capabilities supported by the volunteer’s browser (e.g. Java Script, Java, Flash). A single user may process many data packs for various tasks in a single session. In this way, the Comcute system can process various computing tasks ordered by customers at the same time. 3.1 Architecture A quad-layered architecture of the Comcute system ensures the efficiency and reliability of calculations. Usually a multi-layered architecture is presented by vertically divided regions, but the quad-layered architecture of the Comcute system is presented in a form of concentric regions (Figure 1) with Z-layer representing user interface in the center, internal W-layer with system core nodes, external S-layer with distributing servers and surrounding “layer” containing grouped sets of Internet users’ computers. There is no central S nor W node in the system. The Zlayer gives access to a number of W-nodes organized in a loadbalanced grid. The first W-node contacted by a customer forms a set of other W-nodes (W’-set) in the number needed to complete a task, as requested by the customer. The task (code and data) is distributed to all the W’-nodes, where task data is divided into a set of data packs. The W’-nodes divide task data using the same algorithm and its parameters. The task code along with data packs is sent as independent packets from W’-nodes through S-servers to computers of Internet users (Is). Fig. 1: Multi-layered Comcute architecture The S-servers are placed in the public domain of the World Wide Web. They may be set up as public administration (government) servers, video (movie) servers, social network servers and so on. Beside

their normal activity, they offer participation in the Comcute project to their users. By joining the Comcute project, users agree to download and execute computing tasks. Each data packet is processed for a short time, but a great number of processors gives the effect of a huge computational scale. The S-servers also separate the Internet users from the W-nodes, which are located in a protected network. Locations of W and S servers depend on the system administrator so that W-S links may offer large bandwidths. The code is executed at the computers of Internet users in the context of a web browser. The calculation results are sent back through the S-servers to the W’-nodes, where they are assembled into aggregated result. This aggregation is carried out in cooperation between the W’-nodes. The functionality of the W-layer grid is designed to withstand attacks on the system and the computations. Each task is processed by a set of nodes which exchange and compare the results. An optional verification of the results can be provided in this way. If the S-servers are not responding for a long time or communication between the W-nodes is broken, the W-node grid may reconfigure itself. This way the system can perform its task as long as a single W-node is in an operational state. 3.2 Distributed Volunteer Task Execution The W-layer is composed of independent but collaborating W-nodes driven by the same algorithm. At the beginning of the computational cycle the first W-node responding to the client request takes the task orders from the clients and authorizes them. Based on the task parameters it estimates the number of W-nodes needed for calculations and invites them to form the W’ set. It then distributes the accepted tasks to other nodes of the W’ set. Then each node of the W’ set divides task data into packs according to the task parameters using a partition algorithm specified by the client. If the customer requested a high level of reliability, the W’-set of nodes is further divided into smaller groups G (e.g. three nodes in each group) for tight collaboration. If not, the nodes work in loose collaboration only. Subsequently each W’-node offers the code of the calculation tasks and packs of data as independent packets on demand to S-servers. During calculations W’-nodes offer data packs in a random order from the whole data set. Collaboration among the W’-nodes means that they share partial results of calculations so each node has a full set of results, obtained not only from its cooperating S-servers, but from the other W’-nodes as well. Tight collaboration means that the W’nodes within a G group offer the data packs in the same order thus forming a separate subset of redundantly calculated data. When gathering the results in the tight collaboration mode the W’-nodes within each group exchange and compare results among each other to avoid calculation errors (incidental or intentional). The users of public Internet services located at the S-servers may volunteer to the Comcute project and agree to load task code and packs of data to their computers. The S-servers may buffer task packets taken from the W’-node and distribute the task data along with ordinal service data. The task code is contained within original service web pages. As the service web page is loaded to the Internet user computer, the user’s web browser executes the task code loader (among the other web code) which reports the

browser capabilities to the S-server. The server chooses which form of the task code to load (e.g. JavaScript, Java, Silverlight). As the loader receives the task and data, it launches the code execution and after it stops the loader sends the result back to the S-server. The W’-node gathers the results of each data pack calculations reported by S-servers from the Internet computers. It then exchanges the results with the other W’-nodes. Thus each W’-node independently merges the partial results into a final result in accordance with the algorithm specified by the customer. If a high level of reliability was requested, each node first compares partial results obtained from other nodes of the same G group in accordance with arbitration logic specified by the customer. If partial results differ, the nodes may repeat the calculation cycle. If there is a sufficient number of consistent partial results, W’-nodes aggregate them into a final result. If some nodes within the G group do not report partial results to the other members of the group, the operating W’-nodes try to invite and join new nodes from the whole pool of W-nodes (beyond W’-set). If not possible, they continue to operate at a lower level of reliability. Once each G group completes calculations of their subset of data they return to the pool, ready to join other groups to help them to complete calculations. However, when there are no requests for help from other groups, they try to form a new group taking control over another subset of remaining data. This way the calculations will complete as long as one W’-node is able to operate. Each operating W’-node completes the whole set of partial results and forms the final result. In the end it stores the result in its own repository and makes it available to the other nodes and subsequently to the customer. 3.3 Performance Factors In the process described above the number of W’-nodes (N W ′ ) is the first factor of concurrency. It depends on the total number of W-nodes (N W ) and the mean W-node load factor LW (0 ≤ LW ≤ 1). This number is divided by the cardinality of the G group (|G|), which is 1 if there is no need to form the groups, and 2 or more if a higher level of reliability was requested. Concurrency may be degraded by a factor 0 ≤ κ ≤ 1 dependent, among other things, on computation overlap between W’ nodes or G groups. A mean number of S-servers cooperating with each W’-node is the second main factor (N S ). The third factor is a mean number of people using each public S-service at the same time (N I ). Here we consider the level of readiness (RL ) i.e. how much the users are willing to participate in the Comcute system and share the calculation power of their computers. This level may be leveraged by a set of marketing means (e.g. free movies). Finally, the probability of calculation completing at each Internet user computer (PC ) depends on the mean time of single data pack calculation (T P ) and the mean time when a user remains connected to the service (T S ). The practical concurrency factor (CF ), defined as how much Comcute can speed up computations taking into account redundancy and willingness of Internet clients, can be estimated as: CF =

NW ′ (1 − κ)NS NI RL PC |G|

(1)

NW ′



1 - if there is no need for higher reliability ≥ 2 - if a higher level of reliability was requested  TS - if TS < TP = NW (1 − LW ) PC = TP 1 - if TS ≥ TP

|G| =

(2) (3)

3.4 A Versatile Client Template In order to test the Comcute system a versatile Internet client template, nicknamed iRobot was implemented. Internal structure of iRobot is illustrated in Figure 2. Many instances of iRobots can be deployed simultaneously on hosting computers and controlled remotely. The remote control capability allows the operator to: – switch each iRobot from a standby state to an active state, – switch back from an active state to a standby mode, – command every iRobot to complete its running task and exit. In its active state, each iRobot loops a series of transitions: query S node for a task → execute the returned Remote Web service access task → sleep. Each query of an S node control is directed at the generic DNS address, web user profile which in turn gets resolved to a speiTask emulator and cific S node by a round robin load balexecutor iTask controller ancing algorithm, located in a DNS server. If tasks are available for execuOS services access layer tion, the S node queried will respond with a task implemented in a technology supported by iRobot instance, as Fig. 2: iRobot web user emulator structure determined by an availability of a iTask executor module. For this mechanism to work, every task query form an iRobot contains a JSON-encoded list of technologies supported by the iRobot. This guarantees that only tasks which can be executed by an iRobot will be sent to it. In addition to a remote control capability, also parameters for tasks executed by iRobots (iTasks) can be supplied remotely from a central control location. Once supplied, these parameter sets are matched against task names being run, thus allowing for very flexible adaptation of the testing environment to different test patterns. Once an iTask is started, its execution is supervised by the iTask controller. The controller is governed by a set of timing parameters, which determine the maximum execution time of a task Te and a delay time Td after execution is finished, before iRobot will query S node for another task. This allows iRobot to mimic a behavior of an average web user browsing the Web [17]. Both times are calculated using a general normal (Gaussian) distribution:   x−µ 1 (4) T{e,d} = |f (x)| where f (x) = φ σ σ Both µ (mean) and σ (variance) parameters can be controlled by the test operator.

4 Experiments We designed experiments in order to test the scalability of Comcute and obtain timelines of particular Internet clients in order to observe characteristics of processing and interaction with S servers. 4.1 Testbed Application and Configurations For the following experiments, we used the client template described in Section 3.4 with the following parameters: – probability of returning correct results by a volunteer equal to 1 – this allows comparison of adequate execution times of various configurations for scalability tests, – processing of a data chunk by a volunteer equal to 10 seconds – this corresponds to values discussed in [17]. In summary, [17] states that clients often leave pages after 10-20 seconds and present probabilities of them doing so. From this perspective, 10 seconds seems adequate for our tests. It is also clear that the first 10 seconds are critical for the client to decide whether to stay on the page or leave. – the size of the data chunk sent from Comcute to the client equal to 5000 bytes; this corresponds to input data such as a text fragment to search, definition of a subspace to search by the client, coefficients of a set of equations to solve etc. – the size of results equal to 1000 bytes. In many aforementioned applications, results are smaller than the input data packets e.g. the following ones would correspond to the applications above: returning location in a text fragment or search subspace where matches have been found, solutions to a set of equations etc. We assumed 10000 data chunks which gives sufficient granularity to balance the load among the numbers of Internet clients tested (up to 256). In fact, the system used |G| = 2 which means that Comcute created a copy of each data packet for a total number of packets equal to 20000. The testbed code used on the client side contacts the S server access URL and is automatically redirected to a particular S server by a DNS system. The DNS system was modified so that the client can contact any of the S servers available using the round robin scheme. The client downloads a Java client code as a jar file. It is executed on the client side and fetches data packets from an S server as long as the data is available. Upon termination of processing, the client contacts the S server access URL again and repeats the procedure. 4.2 Testbed Environment We used the following environment for tests: two W servers, each with 48GB RAM 2 x Intel(R) Xeon(R) E5640 2,66GHz CPUs (4 cores, 8 threads each) CentOS 6.2, four S servers, each with 24 GB RAM, 2 x Intel(R) Xeon(R) E5640 2,66GHz CPUs (4 cores, 8 threads each) CentOS 6.2. Internet clients ran on a cluster of 8 nodes, each with 4 GB RAM, 2 x Intel(R) Xeon(TM) CPU 2.80GHz CPUs (2 cores, 4 threads each). We used Ethernet network connection between the components. For the sizes of the data packets, the startup time played a crucial role in the communication time which is much shorter than processing in this case anyway.

4.3 Simulation Results Firstly, we aimed at assessment of the system scalability i.e. the ability of the system to decrease the execution time of a task of a given size with an increasing number of volunteers. Figure 3 presents the execution times we obtained for the aforementioned system parameters. Figure 4 shows the obtained speed up compared to theoretically ideal values. The latter is computed assuming all the data packets are processed sequentially on a single machine without communication. Consequently, these ideal values should be regarded as a theoretical upper bound that cannot be obtained in a distributed system. The system scales well for the tested numbers of volunteers up to 256. It should be noted that the ideal theoretical speed-up refers to the total number of data packets used i.e. 20000 in this case. On the other hand CF = 128 for 256 clients because |G| = 2. The following conclusions could be drawn from this experiment: 1. The system scales well with volunteers not limited by the resources. 2. The practical limit on the number of volunteers tested per our cluster node is around 32. At this point with the number of volunteers higher than 32 per cluster node, we started to observe shortage of system resources for running the volunteers, mainly memory limitations. task execution time [s]

14000 theoretical best for 10000 data chunks results for 10000 data chunks

12000 10000 8000 6000 4000 2000 0 16

32

64

128 number of clients

256

Fig. 3: Execution times of the testbed task 300

speed-up

250 200 150 100 50

achieved speed-up theoretical speed-up

0 4

16

32

64

128 number of clients

256

Fig. 4: Speed-up of the testbed task vs the number of Internet clients

We also obtained individual timelines of particular Internet clients to observe: 1. delay in taking up the task compared to other clients, 2. potential idle times during the task execution. This would indicate a temporary lack of data chunks on S servers; on the other hand constant supply of data chunks would indicate correct prefetching of data chunks by S servers from W servers.

Figure 5 presents a timeline for 64 Internet clients. Clients compute the task in parallel from about time step 100 seconds up to around 3500 seconds. It can be seen that usually the delays in starting processing the data are within 10-30 seconds. Before processing of data packets starts i.e. before time step 80 seconds, clients query S servers for computational codes. Around time steps 2300 and 3400 seconds, some S servers ran out of data packets (fetched from W servers) that caused delay in processing on the client side. Prefetching data from W servers is good to a certain degree because in case server S fails, W will need to wait before sending the lost packets to other S servers.

number of active workers

64

48

32

16

0 0

600

1200

1800

2400

3000

3600

time [s]

Fig. 5: Timeline for 64 Internet clients

5 Conclusions and Future Work The main contribution of the Comcute system shown in this paper is its capability to balance between efficiency of concurrent calculations and reliability of volunteer computing. It is important as each web-open system is exposed to attacks. Comcute is resistant both to the attacks on the system itself (DDoS attacks) and results falsification. Another contribution and the novelty of the system for Internet users is very simple usage (no need for installation) and ability to use many technologies within web browsers so this idea may be used by almost any web public service. The experiments have proven that Comcute scales well. A slight difference between real and ideal concurrency factor results from Amdahl’s law. In the future, we want to extend the work to a higher number of client computers. Additionally, experiments with tasks with various data packet priorities and integration with workflow management in BeesyCluster [18] will be performed.

Acknowledgments The work was performed within grant “Modeling efficiency, reliability and power consumption of multilevel parallel HPC systems using CPUs and GPUs” sponsored by and

covered by funds from the National Science Center in Poland based on decision no DEC-2012/07/B/ST6/01516. We would like to thank W. Korłub for his help in the environment configuration.

References 1. Czarnul, P., Grzeda, K.: Parallel Simulations of Electrophysiological Phenomena in Myocardium on Large 32 and 64-bit Linux Clusters. In: 11th European PVM/MPI Users Group Meeting Budapest, Hungary, September 19 - 22, 2004. Proceedings. (Volume 3241/2004.) 2. Anderson, D.P.: Boinc: A system for public-resource computing and storage. In: Proceedings of 5th IEEE/ACM International Workshop on Grid Computing, Pittsburgh, USA (2004) 3. Balicki, J., Krawczyk, H., Nawarecki, E., eds.: Grid and Volunteer Computing. Gdansk University of Technology, Faculty of Electronics, Telecommunication and Informatics Press, Gdansk (2012) ISBN: 978-83-60779-17-0. 4. Kirk, D.B., mei W. Hwu, W.: Programming Massively Parallel Processors, Second Edition: A Hands-on Approach. Morgan Kaufmann (2012) ISBN-13: 978-0124159921. 5. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional (2010) ISBN-13: 978-0131387683. 6. Buyya, R., ed.: High Performance Cluster Computing, Programming and Applications. Prentice Hall (1999) 7. Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall (1999) 8. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Mancheck, R., Sunderam, V.: PVM Parallel Virtual Machine. A Users Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge (1994) 9. Karonis, N.T., Toonen, B., Foster, I.: Mpich-g2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing 63 (2003) 551 – 563 Special Issue on Computational Grids. 10. Keller, R., Müller, M.: The Grid-Computing library PACX-MPI: Extending MPI for Computational Grids. (www.hlrs.de/organization/amt/projects/pacx-mpi/) 11. Garg, S.K., Buyya, R., Siegel, H.J.: Time and cost trade-off management for scheduling parallel applications on utility grids. Future Generation Computer Systems 26 (2010) 1344 – 1355 12. Chin, S.H., Suh, T., Yu, H.C.: Adaptive service scheduling for workflow applications in service-oriented grid. J. Supercomput. 52 (2010) 253–283 13. Yu, J., Buyya, R., Ramamohanarao, K.: Metaheuristics for Scheduling in Distributed Computing Environments. In: Workflow Scheduling Algorithms for Grid Computing. Springer (2008) in Metaheuristics for Scheduling in Distributed Computing Environments, ISBN: 978-3-540-69260-7, Berlin, Germany. 14. Czarnul, P.: Integration of compute-intensive tasks into scientific workflows in beesycluster. In Alexandrov, V., Albada, G., Sloot, P., Dongarra, J., eds.: Computational Science – ICCS 2006. Volume 3993 of LNCS. Springer (2006) 944–947 15. Cushing, R., Putra, G., Koulouzis, S., Belloum, A., Bubak, M., de Laat, C.: Distributed computing on an ensemble of browsers. Internet Computing, IEEE 17 (2013) 54–61 16. Malawski, M., Kuzniar, M., Wojcik, P., Bubak, M.: How to use google app engine for free computing. Internet Computing, IEEE 17 (2013) 50–59 17. Nielsen, J.: How long do users stay on web pages? (2011) Nielsen Norman Group, http://www.nngroup.com/articles/how-long-do-users-stay-on-web-pages/. 18. Czarnul, P.: Modeling, run-time optimization and execution of distributed workflow applications in the jee-based beesycluster environment. The Journal of Supercomputing 63 (2013) 46–71

Suggest Documents