Towards making BOINC and EGEE interoperable - CiteSeerX

4 downloads 17186 Views 304KB Size Report
A building block of this infrastructure is bridging between the different grid types. In the paper we focus on bridging from desktop grids towards service grids, i.e. ...
Towards making BOINC and EGEE interoperable P. Kacsuk, Z. Farkas MTA SZTAKI Hungary-1518 Budapest P.O. Box 63 {kacsuk,zfarkas}@sztaki.hu

Abstract Today basically two grid concepts rule the world: service grids and desktop grids. Service grids offer an infrastructure for grid users, thus require notable management to keep the service running. On the other hand, desktop grids aim to utilize free CPU cycles of cheap desktop PCs, are easy to set up, but the availability towards users is limited compared to the service grid. The aim of the EDGeS project is to create an integrated infrastructure that combines the advantages of the two grid concepts. A building block of this infrastructure is bridging between the different grid types. In the paper we focus on bridging from desktop grids towards service grids, i.e. making desktop grids able to utilize free service grid resources.

1. Introduction Over the past years, e-infrastructures have played an increasing role in enabling large-scale innovative scientific research. In order to establish such e-infrastructures, various Grid systems have been created and run as a service for the scientific community. Originally, the aim of Grid systems was to harvest computational or storage resources from anyone (donors) so that anyone (users) could claim resources dynamically, according to their actual needs, in order to execute large parallel or data-intensive application. This twofold aim has however not fully been achieved, and we can today observe two different trends in the development of Grid systems: Service Grids and Desktop Grids. Researchers and developers in Service Grids (SG) create a Grid service to share computational resources and to allow users from several institutions to access these resources. A resource can become part of the Grid by installing a predefined software set, or middleware. Because of the complexity of SG systems, this task is acheived by professional system administrators who take care of the hardware/middleware/software environment and ensure high-

G. Fedak INRIA, ENS-Lyon 46 ave d’Italie, Lyon, F-69364 [email protected]

availability of the Grid. Examples of such infrastructures are EGEE, the NorduGrid, or the US TeraGrid. The middleware is, however, so complex that it often requires extensive expert effort to maintain. It is therefore natural, that individuals do not often offer their resources in this manner. Even though the original aim of enabling anyone to join the Grid with ones resources has not been fulfilled, the largest Grid in the world (EGEE) contains around sixty thousand PCs. A key concept of SG is the Virtual Organization (VO), which allows the management of trust relationships and access rights. Anyone who obtains a valid certificate from a Certificate Authority (CA) can access, if authorized, those Grid resources that trust that CA. Desktop Grids (DG), also refered as “volunteer computing systems” or “Public- Resource Computing”, rely upon the general public to donate ”spare cycles” of their compute resources. Despite the relative architectural simplicity compared with SG, DG have demonstrated their ability to integrate dispersed, heterogeneous computing resources with ease, successfully scavenging cycles from tens of thousands of idle desktop computers. The most well-known example is the SETI@HOME project[1], in which approximately four million PCs have been involved. This paradigm represents a complementary trend concerning the original aims of Grid computing. In Desktop Grid systems, anyone can bring resources into the Grid. As resources are provided by end-users, the software has been designed with a strong focus on simplicity and self-manageability, requiring as few as possible user maintenance and expertise. On the downside, only a very limited user community (i.e., target applications) have actually set-up DG projects, which still requires a significant effort from the DG administrators. Also, DGs do not work as services and hence, projects run independently by recruiting resources for themself, which constrasts with most Service Grids, where reciprocal agreements exist for resource utilization among partners. Because of this limitation, the Grid research community considers Desktop Grids only as particular and limited solutions. Until now, these two kinds of Grid systems have

been completely sep- arated. However, with the objective to support new scientific communities who need extremely large numbers of resources, the solution could be to interconnect these two kinds of Grid systems into an integrated Service Grid Desktop Grid (SG-DG) infrastructure. To build this unified SD-DG infrastructure, new mechanisms have to be designed which allow the interoperability of DG and SG. The Lattice[9] project was a first step in this direction enabling Grid application to run on a Desktop Grid based on BOINC. The European EDGeS project[14] goes one step further. It aims at establishing an integrated SG-DG infrastructure where the DG → SG bridge enables desktop grids to submit task units as jobs into the service grids and the SG → DG bridge ensures the job movement into the opposite direction. In this paper, we described research on how such a DG → SG bridge can be created for BOINC and EGEE.

2. Related Research in Bridging Desktop Grids and Service Grids There exists two main approaches to bridge SG and DG (see Fig. 1). In this section we present the principles of these two approaches and discuss them according to architectural, functional and security perspective.

daemon between the DG server and the SG resources. From the DG server point of view, the Grid or cluster appears as one single resource with large computing capabilities. The superworker continuously fetches tasks or work units from the DG server, wraps and submit the tasks accordingly to the local Grid or cluster resources manager. When computations are finished on the SG computing nodes, the superworker sends back the results to the DG server. Thus, the superworker by itself is a scheduler which needs to continuously scan the queues of the computing resources and watch for available resources to launch jobs. Since the superworker is a centralized agent this solution has several drawbacks: 1. the superworker can become a bottleneck when the number of computing resources increases, 2. the round trip for a work unit is increased because it has to be marshalled/unmarshalled by the superworker, 3. it introduces a single point of failure in the system, which has low fault-tolerance. On the otherhand, this centralized solution provides better security properties, concerning the integration with the Grid. First the superworker does not require modification of the infrastructure, it can be run under any user identity as long as the user has the right to submit jobs on Grid. Next, as works are wrapped by the superworker, they are run under the user identity, which conforms with the regular security usage, in contrast with the approach described in the following paragraph.

2.2. The Gliding-in approach

Figure 1. Bridging Service Grid and Desktop Grid, the superworker approach versus the Gliding-in approach

2.1. The superworker approach The superworker, proposed by the Lattice[9] project and the SZTAKI Desktop Grid[4], is the first solution. This enables the usage of several Grid or cluster resources to schedule DG tasks. The superworker is a bridge implemented as a

The Gliding-in approach to cluster resources spread in different Condor pool using the Global Computing system (XtremWeb) was first introduced in[8]. The main principle consists in wrapping the XtremWeb worker as regular Condor task and in submitting this task to the Condor pool. Once the worker is executed on a Condor resource, the worker pulls jobs from the DG server, executes the XtremWeb task and return the result to the XtremWeb server. As a consequence, the Condor resources communicates directly to the XtremWeb server. Similar mechanisms are now commonly employed in Grid Computing[13]. For example Dirac[15] uses a combination of push/pull mechanism to execute jobs on several Grid clusters. The generic approach on the Grid is called a pilot job. Instead of submitting jobs directly to the Grid gatekeeper, this system submits so-called pilot jobs. When executed, the pilot job fetches jobs from an external job scheduler. The gliding-in or pilot job approach has several advantages. While simple, this mechanism efficiently balances the load between heterogeneous computing sites. It benefits

from the fault tolerance provided by the DG server; if Grid nodes fail then jobs are rescheduled to the next available resources. Finally, as the performance study of the Falkon[10] system shows, it gives better performances because series of jobs do not have to go through the gatekeeper queues which is generally characterized by long waiting time, and communication is direct between the worker running on the computing element (CE) and the DG server without intermediate agent such as the superworker. From the security point of view, this approach breaks the Grid security rule about Pilot Jobs. This rule does not allow actual jobs owner to be different than pilot job owner. This is a well known issue of pilot jobs and new solution such as gLExec[12] are proposed to circumvent this security limitation.

Figure 2. BOINC Jobwrapper client and processes

4. First version of the BOINC → EGEE bridge 3. Concept of bridging from BOINC to EGEE The BOINC → EGEE bridge developed in SZTAKI is based on the superworker approach as written in Section 2.1. The main idea behind the solution is to create a modified BOINC client that

The first version of the bridge uses a modified BOINC client and a jobwrapper application that creates and handles an EGEE job based on the BOINC workunit’s description file. The architecture of the bridge using the EGEE jobwrapper can be seen in Figure 3.

• represents itself as a very powerful desktop PC consisting of many CPU cores, • starts a wrapper application (jobwrapper process) instead of the workunit’s executable, so the wrapper application can handle the workunit as needed, for example, for the BOINC → EGEE case it creates and handles an EGEE job based on the workunit’s descriptions. It will start a jobwrapper process on behalf of every workunit’s executable. (E.g. if the modified BOINC client represents itself as a PC consisting of 10 processors, then the modified BOINC client will start 10 jobwrapper processes). The outlines of the modified client can be seen in Figure 2. The modifications include configurability and wrapper application starting. Configurability means that a configuration file contains the number of workunit (or CPU) slots to report towards the BOINC server and the location of the jobwrapper executable to start instead of the workunit executables. Based on these principles, we have developed two versions of the BOINC → EGEE bridge: 1. first version of the bridge (EGEE jobwrapper) 2. improved version of the bridge (3G bridge and jobwrapper). Next, we describe the different versions of the BOINC → EGEE bridge in detail.

Figure 3. EGEE Jobwrapper The tasks of the jobwrapper are as follows: 1. reading the description file of the workunit produced by the modified BOINC CoreClient, 2. creating an input package out of the WU’s input files, 3. creating a JDL file and a wrapper script using the information in the description file, 4. using Globus C and EGEE WMProxy C++ API to submit the JDL file to EGEE, 5. using EGEE L&B C++ API to watch the execution of the submitted job (by periodically polling the job’s status),

6. downloading the job’s output package using Globus C API once the job has finished,

5. Improved version of the BOINC → EGEE bridge

7. placing produced output files where BOINC awaits them to be.

Based on the lessons learnt from the first prototype, we have improved the BOINC → EGEE bridge towards a version that collects workunits and submits them as one request to the WMS using a special EGEE feature - collections - instead of handling them one by one. The solution is based on the idea to set up a ”service”, that collects workunits in a queue, and the queue is periodically checked for new workunits ready for submission. For every new workunit, a JDL file is created, but the created JDL files are submitted in a collection. The architecture of the bridge can be seen in Figure 4.

The EGEE jobwrapper has been successfully tested for a week processing workunits in 20 workunit slots (i.e. running 20 workunits on EGEE in parallel) originating from SZTAKI Desktop Grid on the VOCE EGEE VO. According to the test the bridge used only a small amount of resources: handling the whole lifecycle of one workunit required only at most 1 second of CPU time, and the memory footprint of the EGEE jobwrapper was very low, only a few megabytes above the shared libraries (EGEE, Globus). The performance of the bridge has also been compared to a traditional BOINC client, running on an Intel Pentium Dual-Core E2200 CPU (running at 2,2GHz), using two workunit slots (i.e. processing two workunits in parallel). For the comparison, we have checked the average credit granted to the two clients: the one running on the dual-core Intel desktop machine and the one using 20 workunit slots and running them on EGEE. Ideally, we expected that the EGEE jobwrapper client would receive 10 times the average credits of the normal desktop PC, as it had 10 times the available CPU of the desktop PC. Initially, the average credit granted to the two clients was identical, however the bridge has processed far more workunits than the desktop machine. We explain this with the redundant computing of BOINC: although the bridge processes a lot of workunits, the same workunits are also sent to normal machines where these workunits are placed in queues, thus a canonical result will be created later. After a few days, the performance of the bridge started to grow, and after a week’s operation, the average granted credit rates have become the following: about 100 credits for the desktop machine, and about 800 credits for the bridge, i.e. on longer terms the bridge has 8 times the performance of the desktop machine, which is lower than the expected 10 times speed-up. The lower performance can be explained with the overhead of EGEE and with the lower performance of worker nodes in the selected VOs. During testing we have faced a very important limitations of this implementation: many EGEE jobwrapper processes can overload the EGEE WMS (gLite WMS version 3.0). In case of using 50 workunit slots, the WMS stopped responding after some time. After a discussion with the VO administrators, we received the answer that the WMS has to be restarted periodically, and the restart period has been too long. Thus, we can say that this architecture is not 100% reliable in scenarios where huge number of workunits should be submitted to EGEE. As a consequence, we have implemented the second version of the bridge.

Figure 4. 3G Bridge Architecture The improved version of the bridge has the following components: BOINC jobwrapper client, 3G jobwrapper, and Generic Grid-Grid (3G) Bridge. The 3G Bridge consists of a Job Handler interface, Job Database, Queue Manager, GridHandler interface and target grid plugin that is an EGEE Plugin in the case of the BOINC → EGEE bridge. In fact, the 3G Bridge can be used to connect different grids in a very convenient way. For this, we only have to create source grid producers that can talk to the 3G bridge through the Job Handler (MySQL) interface in order to add jobs of the given source grid to the bridge. On the other hand, for destination grids, a small set of GridHandler functions have to be implemented (as the target grid plugin) in order to submit jobs to the given destination grid from the bridge. In the BOINC → EGEE bridge the task of the BOINC jobwrapper client is to fetch work units from the BOINC server and create the work unit descriptions and start the 3G jobwrappers. The 3G jobwrappers use the Job Handler interface to place work units into the Job Database and to query the status of the work units. The Queue Manager uses information in the Job Database to handle jobs by the GridHandler interface and by the EGEE Plugin. The GridHandler interface is a generic interface above different grid plugins, for our case the EGEE Plugin. The EGEE Plugin is responsible for managing the life cycle of jobs in EGEE and their certificates.

For a BOINC server we can have N projects and for every project we can have M BOINC jobwrapper clients. One particular BOINC jobwrapper client can send jobs to one particular EGEE VO with one particular certificate and in order to realize this association the bridge provides one particular EGEE Plugin. If we want to send jobs from a single project’s WUs to k different VOs, then we need k BOINC jobwrapper clients and k EGEE plugins. If we want to add a new BOINC project and we want to send its WUs to the same k EGEE VOs with the same certificates, then we need another k BOINC jobwrapper clients for the new project. But we can use the same EGEE Plugins as before. If the certificates are different, then we need k different EGEE plugins as well. If j certificates out of the k original ones can be used for the new project, too then we need k − j new EGEE Plugins. If another BOINC server is connected to the k EGEE VOs, then the necessary number of BOINC jobwrapper clients and EGEE Plugins can be calculated as before. The new set of BOINC jobwrapper clients and EGEE Plugins will be disjoint with the original set serving the first BOINC server. Example 1 Boinc 1 wants to send one project WUs into two EGEE VOs: VO1 and VO2. Then we need two BOINC jobwrapper clients and two EGEE plugins. Example 2 Boinc 1 wants to send three projects’ WUs into two EGEE VOs: VO1 and VO2 with the same two certificates (that are different by VOs). Then we need six BOINC jobwrapper clients and two EGEE plugins. An overview figure for a better understanding of the above can be seen on figure 5.

Figure 5. BOINC → EGEE bridge example setup

There are different aspects of security we should mention related to the 3G Bridge: user authentication in EGEE, protection against malicious usage and traceability of jobs. Concerning user authentication in EGEE, this bridge version uses MyProxy information (hostname, port, username and password) for getting proxies to be used for submission on an EGEE Plugin instance basis. These datas have to be specified in the configuration file before the bridge is started. So as opposed to the EGEE jobwrapper solution, the bridge uses MyProxy information provided by the BOINC project oweners to create proxies used for submission. As a consequence, users do not have to either periodically login to create new proxies or create services that renew credentials for submission. The traceability of jobs is solved using detailed logging messages. The 3G jobwrapper component produces messages that show the relation between workunits and unique identifiers of entities in the Job Database. On the other hand, the Queue Manager and EGEE plugin produce messages that show the relation between unique identifiers of entities in the Job Database and EGEE jobs. As a consequence, the bridge provides every information to find the workunit belonging to a maliciously working EGEE job. Concerning malicious usage protection, we can think of different problems the 3G Bridge should address: the bridge should run only executables originating from BOINC servers, and if a user provides proxy information to run workunits of a given BOINC server, we must be sure that no other application uses this proxy information (e.g. there should be no way to use someone else’s proxy to run some other BOINC workunits on EGEE). These problems do not occur if the bridge is properly configured: • the Job Database should be accessible only by the bridge administrator, this can be ensured by using proper MySQL configuration (e.g. using a strong password and provide access only from a dedicated machine that’s running the bridge). An open Job Database means serious security problem, • the BOINC jobwrapper clients of the source BOINC servers should be run and configured by the bridge administrator. This way we can be sure that only applications downloaded from BOINC servers are placed into the Job Database, and that the corresponding EGEE Plugin instance (and proxy) is used to submit by a given BOINC server. For example, in the previous bridge configuration sample, if some new BOINC user wants to connect a new BOINC server and run the workunits on the voce EGEE VO, it is not feasible to use the already defined voce EGEE Plugin instance, but a new EGEE Plugin instance should be created in the configuration file with MyProxy information provided by the new BOINC user. Moreover, a new

BOINC jobwrapper client must be started with a configuration file for the 3G jobwrapper processes so they use the newly created EGEE Plugin instance. The 3G bridge architecture has been tested in the same test scenario as the first (EGEE jobwrapper) solution. Both the performance and computational requirements are almost the same as in the previous version: the 20 workunit slots test scenario has about 80% performance of the ideal case. However the reliability has been significantly improved for cases where many (e.g. 100) workunit slots are used. In such cases using the collection feature of EGEE reduces the usage of the WMS because submitting e.g. 10 new jobs demands only 1 submission request instead of 10.

6. Conclusions Currently the 3G bridge version of the BOINC → EGEE bridge connects SZTAKI Desktop Grid to the GILDA training VO of EGEE. From October 2008 the developed BOINC → EGEE bridge will be used to connect two public BOINC systems (SZTAKI Desktop Grid, Extremadura Desktop Grid) and one local BOINC system (Univ. Westminster DG system) into the EDGeS VO of EGEE. To summarize the main innovation and achievements of the EDGeS project in integrating BOINC with EGEE we can say: • We have modified the BOINC client application so it can represent itself as a very powerful machine towards the BOINC server. It can start a wrapper application specified in a configuration file that can be used to handle the workunit in basically any way based on the workunit description file produced by the modified BOINC client. • Using the modified BOINC client, we have developed two versions of the BOINC → EGEE bridge. The improved version introduces new components (3G jobwrapper, Job Handler interface, Queue Manager, GridHandler interface, EGEE plugin) in the bridge, that allow collecting jobs and submitting them as an EGEE Collection. On one hand, this way the EGEE WMS doesn’t get overloaded. On the other hand, the architecture is generic so allows easy extensibility with other grid sources or destinations.

Acknowledgement The work presented here was funded by the FP7 EDGeS project. The EDGeS (Enabling Desktop Grids for eScience) project receives Community funding from the European Commission within Research Infrastructures initiative of FP7 (grant agreement Number 211727).

References [1] A. Acharya and M. Raje. Mapbox: using parameterized behavior classes to confine applications. In Technical report TRCS99-25, University of California, Santa Barbara, 1999. [2] A. Alexandrov, P. Kmiec, and K. Schauser. Consh: a confinied execution environment for internet computations. In Proceedings of the Usenix annual technical conference, http://www.usenix.org/events/usenix99/, 1999. [3] D. Anderson. BOINC: A System for Public-Resource Computing and Storage. In proceedings of the 5th IEEE/ACM International GRID Workshop, Pittsburgh, USA, 2004. [4] Z. Balaton, G. Gombas, P. Kacsuk, A. Kornafeld, J. Kovacs, A. C. Marosi, G. Vida, N. Podhorszki, and T. Kiss. Sztaki desktop grid: a modular and scalable way of building large computing grids. In Proc. Of the 21th InternationalParallel and Distributed Processing Symposium, 26-30March 2007, Long Beach, California, USA, 2007. [5] J. Bull, L. Smith, L. Pottage, and R. Freeman. Benchmarking java against c and fortran for scientific applications. In ISCOPE Conference, LNCS volumes 1343, Springer, pages 97–105, 2001. [6] I. Goldberg, D.Wagner, R. Thomas, and E. Brewer. A secure environment for untrusted help application confining the wily hacker. In Proceedings of the 6th Usenix Security Symposium, 1996. [7] L. Gong, M. Muller, H. Prafullchandra, and R. Schemers. Going beyond the sandbox: an overview of the new security architecture in the java development kit 1.2. In Usenix Symposium on Internet Technologies ans Systems, 1997. [8] O. Lodygensky, G. Fedak, F. Cappello, V. Neri, M. Livny, and D. Thain. XtremWeb & Condor : Sharing Resources Between Internet Connected Condor Pools. In Proceedings of CCGRID’2003, Third International Workshop on Global and Peer-to-Peer Computing (GP2PC’03), pages 382–389, Tokyo, Japan, 2003. IEEE/ACM. [9] D. S. Myers, A. L. Bazinet, and M. P. Cummings. Expanding the reach of Grid computing: combining Globus- and BOINC-based systems, chapter Grids for Bioinformatics and Computational Biology, Wiley Book Series on Parallel and Distributed Computing., 2008.

[10] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a fast and light-weight task execution framework. In IEEE/ACM SuperComputing, 2007. [11] L. F. G. Sarmenta. Sabotage-Tolerance Mechanisms for Volunteer Computing Systems. Future Generation Computer Systems, 18(4):561–572, 2002. [12] S. Filigoi, O. Koeroo, G. Venekamp, D. Yocum, D. Groep, and D. Petravick. Addressing the Pilot security problem with gLExec. Technical Report FERMILABPUB-07-483-CD, Fermi National Laboratory, 2007. [13] D. Thain and M. Livny. Building reliable clients and services. In the Grid: Blueprint for a New Computing Infrastructure, Ian Foster and Carl Kesselman editors, pages 285–318. Morgan Kaufman, 2004. [14] Z. Balaton, Z., Farkas, Z., Kacsuk, P., Kelley, I., Taylor, I. and Kiss, T., 2008. EDGeS: The Common Boundary Between Service and Desktop Grids. In Proceedings of the CoreGrid Integration Workshop, Crete, Greece, 2008 [15] A. Tsaregorodtsev, V. Garonne and Ian Stokes-Rees. DIRAC: A Scalable Lightweight Architecture for High Throughput Computing. In Fifth IEEE/ACM International Workshop on Grid Computing (GRID’04), 2004.

Suggest Documents