Using its integrated workflow editor component, or its graphical Web interface the .... graph which starts with the selected job, and lock as big continuous part of it as .... between their client machines, the portal server and the MyProxy hosts ...
Workflow-Oriented Collaborative Grid Portals1 Gergely Sipos1, Gareth J. Lewis2, Péter Kacsuk1 and Vassil N. Alexandrov2 1
MTA SZTAKI Computer and Automation Research Institute H-1518 Budapest, P.O. Box 63, Hungary {sipos, kacsuk}@sztaki.hu 2
Advanced Computing and Emergent Technologies Centre School of Systems Engineering, University of Reading, Whiteknights P.O. Box 225, Reading, RG6 6AY, UK {v.n.alexandrov, g.j.lewis}@reading.ac.uk
Abstract. The paper presents how workflow-oriented, single-user Grid portals could be extended to meet the requirements of users with collaborative needs. Through collaborative Grid portals different research and engineering teams would be able to share knowledge and resources. At the same time the workflow concept assures that the shared knowledge and computational capacity is aggregated to achieve the high-level goals of the group. The paper discusses the different issues collaborative support requires from Grid portal environments during the different phases of the workflow-oriented development work. While in the design period the most important task of the portal is to provide consistent and fault tolerant data management, during the workflow execution it must act upon the security framework its back-end Grids are built on.
1. Introduction The workflow concept is a widely accepted approach to compose large scale applications [14], [15], [16]. Most of today’s well-known production Grids support the execution of workflows composed from sequential or parallel jobs [2], [3]. At the same time, none of these infrastructures contain services that enable the creation and processing of the workflows: the required functionality is provided by the front-end Grid portals. Globus 2 [1], the middleware layer the referred systems are built on, contains services for job-execution and for the different aspects of data management. Because workflow management builds directly onto these basic computation-related services and because the workflow support can be the top layer of a Grid, its integration into portals is an obvious simplification. One of the original goals of grid computing is to enable the cooperative work between researcher or engineer teams through the sharing of resources [5]. In 1
The work described in this paper is supported by the Hungarian Grid project (IHM 4671/1/2003) by the Hungarian OTKA project (No. T042459) and by the University of Reading.
collaborative systems the different teams’ software and hardware infrastructures and the team member’s knowledge can be aggregated to generate an environment that is suitable to solve previously unsolvable problems [6]. Unfortunately, today’s different Grid portals focus solely on High Performance and High Throughput Computing, but they do not support collaborative work. The P-GRADE Portal [4] is a Grid portal solution with such typical characteristics. On one hand it gives high-level workflow management support, on the other hand it is a useless piece of software for users with collaborative problem solving requirements. Using its integrated workflow editor component, or its graphical Web interface the users can comfortably organise sequential or parallel MPI jobs into workflows, can execute, monitor and visualize them, but cannot achieve the aggregation of knowledge, resources and results. The main goal of our research was to determine how to convert the P-GRADE Portal, a Globus-based computational Grid portal into a centre for collaborative work. Although we discuss the problems collaborative computational portals face in reference to the P-GRADE Portal, the presented results can be easily generalised for other portal solutions like Triana [8] or GENIUS [9]. The remaining part of the paper is organised as follows. Chapter 2 discusses the several advantages workflow-oriented collaborative computational Grid portals bring for clients, while chapters 3 and 4 describe the different difficulties collaborative support requires in these environments. In chapter 3 the collaborative workflow design is introduced, while chapter 4 examines the differences between collaborative and single-user workflow execution processes. At the end chapter 5 gives conclusions.
2. Knowledge and resource aggregation with collaborative portals Today’s computational Grids sometimes seem quite small from the end-user’s point of view. Even if a Grid has quite large number of sites an account or certificate is usually valid for a small subset of them, for the sites that participle in a specific Virtual Organisation (VO). Some VOs can be accessed by physicists, some others by geologists, while another subset is allocated for biologists. None of these scientists can break out from the “sandboxes” grid administrators force them into. This is true for the applications as well. Different VOs (or even Grids) are built for different jobs [3], and there is little chance for a program to run elsewhere than in its own special environment. Consequence of the usually special requirements of Grid jobs that the valuable results they produce bring benefit only for a few favoured researchers. Because Grid portals are designed to serve one specific VO (or sometimes more VOs inside the same Grid) there is no chance to define and execute across-grids workflows with them. The different jobs of such a workflow should run in different Grids (or at least in different VOs) and the portal framework must be able to handle the special design and execution needs such applications require. With the collaborative P-GRADE Portal we would like to give solutions for this problem. The main aim of this new version of the P-GRADE Portal is to support the integration and sharing of knowledge and resources through collaborative workflows.
A collaborative workflow is a job-workflow built by multiple persons, by the members of a collaborative team. Every team member gives his/her own execution logic, jobs and resource access rights to the workflow graph, enabling the team to exploit aggregated knowledge and hardware capacity. This main goal of collaborative portals can be seen in Fig. 1. If we first take a look at a single-user P-GRADE Portal workflow, it can be stated that one individual must possess all the knowledge and site access rights the workflow definition and execution requires [4]. The person that defines the graph must specify the jobs for the different nodes and must have a certificate which is valid to execute these jobs on the chosen Grid sites. (Even if a high-level broker service automatically maps the jobs onto the best sites it is the consequence of GSI that the end-user must possess with the appropriate certificate(s) that are accepted by those sites.) This it true for all the previously referred, Globus 2 based Grids and their portals.
Fig. 1. Sharing and integrating knowledge and resources through collaborative portals
Using the team work support of a collaborative portal, the different participants can aggregate their knowledge and abilities granted them by their certificates to achieve higher level goals than they individually would be able to. In the design phase the team members can collaboratively and in real-time define the structure of the workflow graph and the content of the different nodes. Every member can contribute to the workflow with his/her jobs. A job can use the results produced by other members’ jobs and can generate results for other jobs again. Through the data flow between the jobs inside a collaborative workflow the system realises the flow of knowledge. The different users do not have to take care where and how the input files for their jobs will be produced, they have to know only the format and the meaning of the incoming and outgoing data flows. Globus 2 and Globus 3 based Grids can be accessed through the P-GRADE Portal [4]. In Globus 2 environments GRAM sites provide the job execution facility and consequence of the GSI philosophy is that different users can access different GRAM sites. These sites can be inside the same VO, inside different VOs within the same Grid, or in totally different Grids. The P-GRADE Portal does not cooperate with
Grid resource brokers, it sends the different jobs to GRAM sites statically defined by the developers of the workflows. Since several participants are involved in the construction of a collaborative workflow, the sites they can individually access are collectively available for the collaborative jobs. Any job of a collaborative workflow can use any site that at least one team member has access to. (See Fig. 2.) Due to this philosophy workflow-based inter-grid resource aggregation can be realised.
Fig. 2. Collaborative inter-grid resource aggregation
3. Collaborative workflow development The P-GRADE Portal provides a dynamically downloadable editor application what the users can apply to construct and upload workflows onto the portal server [4]. A P-GRADE workflow consists of three types of components: jobs, ports and links. Jobs are the sequential programs and parallel MPI applications that must be executed on the different Grid sites. Ports are the output and input files these jobs generate or require, while links define data channels between them. In the collaborative version of the P-GRADE portal multiple editors must be able to work on the same workflow definition simultaneously. To enable the parallel work, every editor must manage its own local copy of the actually opened collaborative workflow and synchronize it with the appropriate global view stored on the portal server. The purpose of the synchronization is twofold: with a local-to-global update the locally performed changes can be validated on the consistent global view, while the global-to-local update is necessary to inform the different team members about each others’ work in a real-time fashion. The different synchronization tasks are illustrated in Fig. 3. The server component of the P-GRADE Portal is a Web application that cannot send asynchronous calls to its client side workflow editors [4]. Consequently, both the
global-to-local and the local-to-global synchronization processes have to be initialized by the clients. During a local-to-global synchronization process the portal server has to generate a new global view from the present global and the received local workflows. During a global-to-local synchronization the workflow editor has to generate a new local view from the present local and the received global workflows.
Fig. 3. The different synchronization tasks in the collaborative workflow editing period
The most obvious way to update a workflow with another one is to simply overwrite it with the new one. It can be clearly seen that this protocol would lead to lost updates and would make the final result of the editing phase unpredictable: the user, whose editor updates the global workflow last, overwrites the other members’ already validated work. Database manager systems successfully solve similar problems with data locking mechanisms [10]. These servers lock smaller or bigger parts of their managed data to provide consistent view. In our case the global workflow is the managed data. Because locking of full workflows would prevent the collaborative work itself, (in that case only one person could work on a workflow at a time) collaborative portals have to enable the locking of workflow parts. A workflow part can be locked for maximum one user at a time, but different parts of the same workflow can be locked for different users simultaneously. Only locked workflow parts can be edited and only by their owners. On the other hand every collaborative group member must be able to read both locked and unlocked parts of the collaborative workflow. Our approach cuts workflow design periods to three phases: contention, editing and releasing. In the contention phase the user must lock the workflow part he/she is interested in. In the editing phase the locked part can be modified, while in the releasing phase the modified components get unlocked and become part of the consistent global view. In our proposed concept locking requests can be generated through the workflow editor by the users manually. When a collaborative team member clicks a special menu item of a workflow node his/her editor generates a locking request. The meaning of this request for the system is the following: take the part of the workflow graph which starts with the selected job, and lock as big continuous part of it as possible. In other words the system has to lock the selected job, every job that directly or indirectly depends on this job via file dependencies, all ports that are connected to these jobs, and finally all direct links between these jobs. If a locked component is
found during this graph traversal process, the system skips the rest of the branch and continues the work on the remaining branches. Fig. 4. illustrates the protocol through a simple example.
%
%
&
&
'
(
'
(
) *+
!"
!
,
)
#$
%
% &
& '
'
(
(
Fig. 4. Workflow component locking scenario
The extended sequence diagram presents a locking scenario. On the left side one member of a collaborative team can be seen. He participates in the collaborative design through his workflow editor. The editor manages the local view of the collaborative workflow, and it communicates with the portal server that stores the global view. At the beginning both the editor and the portal server store the same view: “Job 5” together with its input and output ports (the small squares connected to it) are locked, while the rest of the workflow is unlocked. This means that some other person from the group is currently working on “Job 5”, but the rest of the graph is free to lock. Suppose that our user clicks the “Lock” menu of “Job 3”. The editor automatically generates an appropriate request message and sends it to the server. The server appoints that the status of “Job 3” is unlocked, so it estimates the biggest part of the graph that depends on “Job 3” and contains only unlocked components. In our case this part graph consists of “Job 3”, “Job 4”, their two input and output ports, and the link that connects them. (Since “Job 5” was already locked it does not become part of this group.) The server locks this branch in the global view and sends this new view back to the editor. The editor updates its GUI according to the received graph
allowing the user to begin the development work. The user now can modify the properties of his/her components, he/she can add new jobs to this graph, can connect new ports to his jobs, can define new links between his ports and finally he can delete any of his locked components. The development phase normally comes to its end when the user finishes the work and manually unlocks the components just like he/she locked them earlier. Besides this the system has to be prepared for unstable clients as well. The workflow editor can crash, the client host or the network can break down, hence the portal server must be able to identify and automatically unlock unnecessarily locked workflow components. Distributed systems apply leasing/lifetime based solutions for this purpose [11], [12]. If the portal server locks workflow components only for limited periods and enables the on-demand extensions of these locking intervals then unnecessarily locked components can be released as soon as possible. The workflow editor can hide the regular task of the leasing/lifetime extension from the user. The workflow editor has to perform regular local-to-global synchronization (see Fig. 3.) during the whole design period. In the contention phase out-of-date local information can make the user believe that a given part of the workflow is still locked by someone else, or contrarily, it is still free to lock. In the editing phase rarely updated local views deprive the users from the experience of the real-time collaboration. During a global-to-local synchronization process the editor has to merge the received global workflow with the present local one. The new local view must be generated from the following items: • • •
every component from the old local view that is locked for the local user every unlocked component from the received global view every component from the global view that is locked for other users
Since only locked components can be modified, the local-to-global synchronization process makes no sense in the contention phase. During a local-to-global synchronization process the portal server has to merge the present global workflow with the received local one. The following items have to be aggregated to generate the new global view: • • •
every component from the local view that is locked for the user whose editor performed the local-to-global update process every component from the global view that is locked for other users every unlocked component from the global view
4. Collaborative workflow execution In the execution phase there are two differences between collaborative and single-user Globus workflows: 1. In the collaborative case multiple clients must be able to observe and control the execution of a workflow. 2. To execute a collaborative workflow usually more than one GSI proxies are required. For a normal workflow a single proxy is enough.
The synchronization process discussed earlier provides an adequate solution to the first issue. The system can provide a consistent view during in the execution phase just like it is achieved in the editing period. Our proposed solution for the second problem is not so simple. The present version of the P-GRADE Portal generates a Globus proxy for a workflow job from the longterm certificate that belongs to the person who defined GRAM site for that job [4]. To enhance security, the P-GRADE Portal delegates the certificate and proxy management tasks to MyProxy servers [13]. The Portal itself is only a Web interface to access MyProxy sites. The users can upload and download certificates and proxies between their client machines, the portal server and the MyProxy hosts through their Web browsers. Collaborative portals must be able to automatically download proxies for the different collaborative jobs, since a user, who submits a collaborative workflow cannot wait until all the involved team members download the necessary proxies by hand. Collaborative portals can provide an automatic proxy download facility if they store the relations between users, MyProxy sites, and GRAM hosts. Using this database the portal can estimate which MyProxy server must be used to get a valid proxy for a GRAM site defined by a collaborative group member. The infrastructure required to process the automated proxy download can be seen in Fig. 5. The huge benefit of this approach is that once the collaborative group members define their own MyProxy-GRAM relations, the portal can always get the required proxies without any manual help.
Fig. 5. Infrastructure for automated execution of collaborative workflows
Another consequence of the usage of multiple proxies within a single workflow is that file transfers between different security domains necessarily occur. In this case
none of the proxies is sufficient to perform direct file transfer between the job executor sites. The portals server has to be used as a temporary storage for the indirect file transfer. The portal server application can use the proxy of the first job to copy the output files into a permanent or temporary directory on the portal server, and it can use the proxy of the second job to copy these files onto the second GRAM site. After the successful finish of the workflow the proxies that belong to the jobs that produced the final results can be used to copy the files onto the portal server at a place where every collaborative team member can access them.
5. Summary and conclusions The paper discussed how today’s Globus-based, workflow-oriented computational Grid portals can be adapted to collaborative centres. The solutions have been examined through the example of P-GRADE Portal, a concrete computational Grid portal, but most of the results can be applied for other widely used frameworks as well. In the workflow design phase the biggest difference between single-user and collaborative portals is the requirement to protect against lost updates. Our approach is a data locking mechanism optimised for workflow-oriented environments. In the execution phase security is the biggest issue for collaborative groups. The paper introduced an automatically proxy download facility to submit the workflow jobs into the different VOs and Grids the collaborative team members have access to. The solution is based on a special database that specifies which proxies and how must be obtained to access the specified executor sites. Because of the extra features that collaborative portals must provide in contrast with single-user portals, these complex systems loose their interface characteristics and become functionality providers. Since Grid portals should be slim layers that provide Web interface for standardised Grid services, the introduced collaborative functionality should be realised by stateful Grid services and not by a Grid portal. Jini [11] or WSRF [21] would be possible nominates to develop these services. The P-GRADE Portal is already used in several Grid systems like the LCG-2 based SEE-GRID [18], the GridLAB testbed [8], the Hungarian Supercomputing Grid [19] and the UK OGSA testbed [20]. We already started the implementation of its collaborative version. The prototype will be available by June 2005 and the final system will be absolutely unique on the market and will provide an optimal balance between HPC, HTC and collaborative support.
6. References [1] I. Foster and C. Kesselman: “The Globus project: A status report”, In Proceedings of the Heterogeneous Computing Workshop, pages 4-18. IEEE Computer Society Press, 1998. [2] The Grid2003 Production Grid: “Principles and Practice”, To appear in 13th IEEE International Symposium on High-Performance Distributed Computing (HPDC13), Honolulu, 2004.
[3] LHC Grid: http://lcg.web.cern.ch/LCG/ [4] Cs. Németh, G. Dózsa, R. Lovas and P. Kacsuk: The P-GRADE Grid portal In: Computational Science and Its Applications – ICCSA 2004: International Conference, Assisi, Italy, LNCS 3044, pp. 10-19 [5] I. Foster and C. Kesselman: "The Grid: Blueprint for a New Computing Infrastructure”, Morgan-Kaufman, 1999. [6] Gareth J. Lewis, S. Mehmood Hasan, Vassil N. Alexandrov: “Building Collaborative Environments for Advanced Computing” In proc. of the 17th International Conference on Parallel and Distributed Systems (ISCA), pp. 497-502, San Francisco, 2004. [7] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke: “A security architecture for computational grids”, In ACM Conference on Computers and Security, pages 83–91. ACM Press, 1998. [8] G. Allen, K. Davis, K. N. Dolkas, N. D. Doulamis, T. Goodale, T. Kielmann, A. Merzky, J. Nabrzyski, J. Pukacki, T. Radke, M. Russell, E. Seidel, J. Shalf, and I. Taylor: “Enabling Applications on the Grid: A GridLab Overview”, International Journal of High Performance Computing Applications, Aug. 2003. [9] R. Barbera, A. Falzone, A. Rodolico: “The GENIUS Grid Portal”, Computing in High Energy and Nuclear Physics, 24-28 March 2003, La Jolla, California [10] V. Gottemukkala and T. Lehman: “Locking and latching in a memory-resident database system”, In Proceedings of the Eighteenth International Conference on Very Large Databases, Vancouver, pp. 533-544, August 1992. [11] J. Waldo: “The Jini architecture for network-centric computing”, Communications of the ACM, 42(7), pp. 76-82, 1999. [12] I. Foster, C. Kesselman, J. Nick, and S. Tuecke: “The physiology of the Grid: An Open Grid Services Architecture for distributed systems integration”, Technical report, Open Grid Services Architecture WG, Global Grid Forum, 2002. [13] J. Novotny, S. Tuecke, and V. Welch: “An online credential repository for the grid: MyProxy”, Symposium on High Performance Distributed Computing, San Francisco, Aug. 2001. [14] I. Taylor, M. Shields, I. Wang and R. Philp: “Grid Enabling Applications Using Triana”, In Workshop on Grid Applications and Programming Tools. Seattle, 2003. [15] Matthew Addis, et al: “Experiences with eScience workflow specification and enactment in bioinformatics”, Proceedings of UK e-Science All Hands Meeting 2003 [16] Ewa Deelman, et al: “Mapping Abstract Complex Workflows onto Grid Environments”, Journal of Grid Computing, Vol.1, no. 1, 2003, pp. 25-39. [17] J. Frey, T. Tannenbaum, I. Foster, M. Livny, S. Tuecke: “Condor-G: A Computation Management Agent for Multi-Institutional Grids”, in 10th International Symposium on High Performance Distributed Computing. IEE Press, 2001 [18] SEE-GRID Infrastructure: http://www.see-grid.org/ [19] J. Patvarszki, G. Dozsa, P Kacsuk: “The Hungarian Supercomputing Grid in the actual practice”, Proc. of the XXVII. MIPRO Conference, Hypermedia and Grid Systems, Opatija, Croatia, 2004. pp. 203-207. [20] T. Delaitre, A.Goyeneche, T.Kiss, G.Z. Terstyanszky, N. Weingarten, P. Maselino, A. Gourgoulis, S.C. Winter: “Traffic Simulation in P-Grade as a Grid Service”, Proc. of the DAPSYS 2004 Conference, September 19-22, 2004, Budapest, Hungary [21] Foster, I., J. Frey, S. Graham, S. Tuecke, K. Czajkowski, D. Ferguson, F. Leymann, M. Nally, I. Sedukhin, D. Snelling, T. Storey, W. Vambenepe, and S. Weerawarana: “Modeling Stateful Resources with Web Services”, 2004, www.globus.org/wsrf