party resellers) interact with distributed computing resources and soft- ware units (pluglets and kernels), as shown in Figure 1. âProvidersâ are resource owners ...
Resource Monitoring and Management in Metacomputing Environments 1 Tomasz Wrzosek, Dawid Kurzyniec, Dominik Drzewiecki, and Vaidy Sunderam Dept. of Math and Computer Science, Emory University Atlanta, GA 30322, USA {yrd,dawidk,drzewo,vss}@mathcs.emory.edu
Abstract. Sharing of computational resources across multiple administrative domains (sometimes called grid computing) is rapidly gaining in popularity and adoption. Resource sharing middleware must deal with ownership issues, heterogeneity, and multiple types of resources that include compute cycles, data, and services. Design principles and software approaches for monitoring and management tools in such environments are the focus of this paper. A basic set of requirements for resource administration tools is first proposed. A specific tool, the GUI for the H2O metacomputing substrate is then described. Initial experiences are reported, and ongoing as well as future enhancements to the tool are discussed.
1
Introduction
A growing number of high-performance computational environments consist of, and use, resources that are not only distributed over the network but also owned and administered by different entities. This form of resource sharing is very attractive for a number of reasons [2]. However, the very aspects of heterogeneity, distribution, and multiple ownership, make such systems much more complex to monitor, manage, and administer than local clusters or massively parallel processor machines. Therefore, there is a need for sophisticated, reliable, and user-friendly management tools fulfilling several essential requirements: – Authorized individuals should be allowed to check the status, availability and load of shared resources. – Owners should be able to (remotely) control access to their resources. – Resource usage information should be dynamically updated at appropriate intervals, and presented in a clear and comprehensible form. – As with all types of monitoring tools, perturbation to the underlying system should be minimized to the extent possible. 1
Research supported in part by U.S. DoE grant DE-FG02-02ER25537 and NSF grant ACI-0220183
– Especially in shared computing environments where multiple administrators control different resources, abrupt unavailability of resources is more likely. Monitoring tools should be responsive to such events and provide informational and correctional options. – In addition, the usual graphical user interface guidelines should be followed to make tools convenient and easy to use and to allow users to focus on their own domains and applications. A graphical monitoring and management tool that follows these guidelines has been developed for use with the H2O metacomputing environment [10]. Some background information about this underlying distributed computing substrate, the salient features of the GUI tool, and its projected use in shared-resource settings, are described in Section 2. The following section discusses related tools for other distributed computing environments, while Section 4 describes the detailed design and implementation of the GUI. The paper concludes with a discussion of our early experiences with the tool, and plans for further development.
2
The H2O Framework and its Graphical User Interface
The H2O metacomputing system is an evolving framework for loosely coupled resource sharing in environments consisting of multiple administrative domains. Its architectural model is provider-centric and based on the premise that resource providers act locally and independently of each other and of clients, and that by minimizing or even eliminating coordination middleware and global state at the low level, self-organizing distributed computing systems can be enabled. In H2O, a software backplane within each resource supports component-based services that are completely under owner control; yet, authorized clients may configure and securely use each resource through components that deliver compute, data, and application services. A detailed description of the core H2O framework is outside the scope of this paper [10,5]; however, selected aspects will be explained further in this paper as appropriate. In H2O, various types of entities (providers, clients, developers, thirdparty resellers) interact with distributed computing resources and software units (pluglets and kernels), as shown in Figure 1. “Providers” are resource owners, e.g. users that control and/or have login id’s on computer systems. They instantiate the H2O kernel on machines they wish to share and specify their sharing policies. Clients avail of resources to suit their own needs via “pluglets”, which are componentized modules providing remote services. These services may range from end-user-applications (e.g. a computational fluid dynamics pluglet) to generic programming environments (e.g. a collection of pluglets implementing an MPI environment).
Service deployment may be performed by end-clients, by providers, as well as by third-party resellers that may offer to clients a value-added over a raw resource served by the provider. It is evident that without appropriate tools, management of such multi-actor, dynamic environment can be unwieldy. Therefore, the graphical tool termed the H2O GUI has been developed that assists users in managing H2O kernels, pluglets, policies and other associated aspects through a convenient and friendly interface. Providers may use the GUI to start, stop, and dynamically attach to and detach from specific kernels, as well as to control sharing policies and user access privileges. Pluglet deployers (i.e. third-party resellers or end-clients) may use the GUI to load pluglets into specific kernels, search for previously loaded pluglets, and aggregate pluglet management tasks. The GUI enables rapid determination of resource usage types and levels, pluglet configurations on various resources, and execution control of distributed tasks. Registration and Discovery
Publish
JNDI
LDAP
DNS
GIS
e-mail, phone, ...
...
Find
Deploy
...
H2O kernel
UDDI
Provider
A
native
Deploy
A code
A Client
Provider
Provider
Client
Provider
Client
B
Deploy
H2O pluglet Legacy App
B
A
B
Repository
Provider host
A Reseller
C
Repository
B Developer
A
B
C
Fig. 1. Various example usage scenarios of H2O
3
Related Work
In large, loosely coupled, heterogeneous, and somewhat fragile environments of today, tremendous potential and need exists for the evolution of effective tools for resource environment monitoring. Our GUI project is one effort that attempts to address this issue; other similar research efforts are mentioned below. A few tools oriented towards distributed resource administration have been available for some time; an example is XPVM [6], a graphical console and monitor for PVM. XPVM enables the assembly of a parallel virtual machine, and provides interfaces to run tasks on PVM nodes. In addition, XPVM can also gather task output, provide consoles for distributed debugging, and display task interaction diagrams. Adopting a different
premise, MATtool [1] was written to aid system administrators; it permits uploading and execution of a task (process) simultaneously on many machines, and provides monitoring services running on hosts to inspect current machine status. Sun Grid Engine, Enterprise Edition is a software infrastructure that enables “Campus Grids” [8]. It orchestrates the delivery of computational power based upon distributed resource policies set by a grid administrator. The SGEEE package includes a graphic user interface tool (qmod) that is used to define Sun’s grid resource access policies (e.g. memory, CPU time, and I/O activity). Not a tool in the traditional sense but rather a library providing specific API is the Globus’ GRAM package [4]. It is used by the Globus framework [3] to process requests for, and to allocate resources, and provides an API to submit, cancel, and manage active jobs. Another category of systems created to help users with distributed computing and resource management are Grid Portals [13]. Such interfaces enable job submission, job monitoring, component discovery, data sharing, and persistent object storage. Examples of such portals include Astrophysics Simulation Collaboratory Grid Portal [7], the NPACI HotPage framework [12], and JiPANG [11]. These portals present resources in a coherent form through simple, secure and transparent interfaces, abstracting away the details of the resources. The H2O GUI builds on the experience of the tools mentioned above. It supports controlled resource sharing by permitting the definition and application of sharing policies, aids in the deployment and monitoring of new services, as well as in their discovery and usage, and, finally, presents collected resources and services in the form of kernels and pluglets abstracting away their origin and nature. However, the target user constituency of the H2O GUI is different from the tools above, which were created either for resource administration or for resource usage. The H2O GUI combines these two target groups and may be used by resource providers (administrators) as well as service deployers and/or third-party resellers. The main distinguishing feature of the H2O GUI concerns the resource aggregates that it monitors. In other metacomputing environments, there is usually the notion of the distributed system (with a certain amount of global state) that is being monitored. In contrast, resource aggregation in H2O does not involve distributed state and is only an abstraction at the client side – H2O kernels maintain no information about other kernels. Hence, the H2O GUI provides the convenience of managing a scalable set of resources from a single portal even though the resources themselves are independent and disjoint.
4
The H2O GUI - how it works
From an operational viewpoint, the first action in the use of H2O generally involves the instantiation of kernels by providers. Active kernels are represented by kernel “references”; as indicated in Figure 1, clients lookup and discover kernel references via situation-specific mechanisms that can range from UDDI repositories to LDAP servers to informal means such as phone calls or email. Clients login to kernels, instantiate (or upload) pluglets and deploy their applications. The H2O GUI can help with all aspects of these phases. 4.1
Profiles
To facilitate operational convenience, stable data concerning resources from a given provider (or a set of providers) may be maintained on persistent storage in the form of a “profile” file. The profile contains key data concerning provider’s kernels and is retained in XML-format to enhance interoperability. It is a collection of entries, each holding both static, and, after instantiation, dynamic information about an H2O kernel. The latter has a form of a kernel reference, that is of paramount importance as it is the only way to contact the kernel. my_red_kernel
Fig. 2. Example profile entry An example of a kernel entry is shown in Figure 2. The “RemoteRef” field contains the aforementioned kernel reference; it encapsulates the information about the kernel identity, kernel endpoint, and the protocol that client must use when connecting to the kernel via this endpoint. The “startup” element is specified and used only by resource owners, as it allows for starting the kernel directly from the GUI. The (open) set of supported kernel instantiation methods is configured globally via the GUI configuration dialog and it may include any method invokable from command line interface.
4.2
The GUI functionality
The GUI has two main operating windows. The first one, displayed after GUI startup, is a small panel designed to provide a user with all general information about the kernels in the profile at glance. It also facilitates controls allowing shutdown and/or restart of all kernels (i.e. the entire profile). A user may also access information about kernels in a more detailed manner. Second GUI window, shown in Figure 3, is more comprehensive and provides the user with a number of new options that include detailed on-line monitoring and management of the distributed resources (kernels) – the focus is on separate access to individual kernels. The left panel in the main GUI window displays a list of kernel entries that are contained in the currently loaded profile. A user may edit this list
Fig. 3. main GUI window – changing the aforementioned static kernel data, adding new, or removing unused kernel entries. The first two operations are conducted within a dialog divided into two sections corresponding to the structure of a profile entry. As in the case of a profile entry the usage of this dialog varies between different classes of users. Resource providers may use it to specify how their kernels are to be started. On the other hand, service deployers (entities loading pluglets into H2O kernels) may use it to manually enter remote references to kernels of their interest that are controlled by somebody else. From the perspective of resource providers, H2O kernel may be viewed as a mechanism of controlled resource sharing. Thus for them, the GUI serves as a tool for collective management of resources that are shared via H2O kernels and utilized by H2O pluglets. To support such management tasks, the GUI provides ways of controlling kernels as well as separate pluglets, thus enabling provider to independently handle raw resources and services running on them. This separate access to resources (kernels)
is realized through the kernel list which allows providers to start or shutdown a selected kernel as appropriate. Pluglets loaded into a kernel are shown in the second (right) panel and may be suspended, reactivated, or terminated via the GUI. These options might be used to manage already running jobs or to enqueue new ones on a kernel, change their priorities, grant or refuse resources etc. The GUI is also designed for use as an accounting tool for H2O kernels. Since H2O requires a security policy to be defined and users’ accounts to be created, the GUI goal is to provide convenient interfaces to facilitate these tasks. A kernel provider may define entities that are allowed to use the provider’s kernel and/or may define a set of permissions that will be used as the kernel security policy. The interface utilizes the fact that H2O security policies are based on the set of standard Java policies [5,9]. The policy may be uploaded into, and used in, multiple kernels in the current profile, or may be exported to a file. These options will simplify housekeeping and security management. The GUI also provides a form of a remote console onto which kernel messages are printed. Examples of useful display messages include: a kernel reference that may later be used to advertise this kernel to potential users, pluglet loading messages, or pluglet and kernel exception stack traces. All events that take place within the GUI as well as within monitored kernels and pluglets (e.g. kernel/pluglet state changes, pluglet loading events, and policy changes) are logged, thus enabling a GUI user to keep track of happenings in the system and to inspect causes of possible errors and exceptions in the system. Apart from resource providers, the GUI may also be used by service deployers in order to upload new services into already running kernels and control them. The mandatory information needed to load a pluglet include its class path and a main class name. Some additional data may also be specified, thus allowing deployer to personalize the service instance. This information may later be used for discovery or advertisement purposes. The necessary data may either be typed by hand or loaded from an XMLbased descriptor file that would typically be provided by the supplier of pluglet binaries.
5
Discussion and Future Work
This paper has presented the preliminary facilities provided by the H2O GUI, a graphical interface for the management and monitoring of shared resources in the H2O metacomputing framework. Based on the belief that ease of management of shared resources in grids and metacomputing environments is critical to their success, the H2O GUI attempts to com-
bine simplicity and industry standards (XML, JAAS, Swing) with utility and convenience for the different entities that interact with H2O, namely resource providers, clients, pluglet developers and third-party resellers. In this initial version of the GUI, the focus is on resource provider and service deployer facilities. The prototype implementation enables userfriendly operation for both types of entities. In follow-on versions of the GUI, we intend to offer new features that assist in policy control and lookup and discovery interfaces for pluglets. We also are exploring more complex monitoring features e.g. network connection displays, bandwidth and memory usage indicators, and service usage statistics. We believe that these features and facilities will lead to substantial increases in adoption of computational resource sharing across multiple administrative domains as a mainstream paradigm.
References 1. S. M. Black. MATtool. Monitoring and administration tool. Available at http: //www.ee.ryerson.ca:8080/~sblack/mat/. 2. I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. The International Journal of Supercomputer Applications, 15(3), 2001. 3. Globus. Available at: http://www.globus.org. 4. GRAM: Globus Resource Allocation Manager. Available at: http://www-unix. globus.org/api/c-globus-2.2/globus_gram_documentation/html/index.html. 5. D. Kurzyniec, T. Wrzosek, D. Drzewiecki, and V. Sunderam. Towards selforganizing distributed computing frameworks: The H2O approach. preprint, 2003. 6. Maui High Performance Supercomputing Center. XPVM. Available at http: //www.uni-karlsruhe.de/Uni/RZ/Hardware/SP2/Workshop.mhtml. 7. M. Russell et. al. The astrophysics simulation collaboratory: A science portal enabling community software development. In Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing, pages 207–215, San Francisco, CA, 7-9 Aug. 2001. 8. SGEEE: Sun Grid Engine, Enterprise Edition. Papers available at: http://wwws. sun.com/software/gridware/sge.html. 9. Sun Microsystems. Default policy implementation and policy file syntax. http: //java.sun.com/j2se/1.4.1/docs/guide/security/PolicyFiles.html. 10. V. Sunderam and D. Kurzyniec. Lightweight self-organizing frameworks for metacomputing. In The 11th International Symposium on High Performance Distributed Computing, Edinburgh, Scotland, July 2002. 11. T. Suzumura, S. Matsuoka, and H. Nakada. A Jini-based computing portal system. In Super Computing 2001, Denver, CO, USA, November 10-16 2001. 12. M. Thomas, S. Mock, and J. Boisseau. NPACI HotPage: A framework for scientific computing portals. In 3rd International Computing Portals Workshop, San Francisco, CA, December 7 1999. 13. G. von Laszewski and I. Foster. Grid Infrastructure to Support Science Portals for Large Scale Instruments. In Proceedings of the Workshop Distributed Computing on the Web (DCW), pages 1–16. University of Rostock, Germany, 21-23 June 1999.