J Supercomput (2010) 52: 199–223 DOI 10.1007/s11227-008-0256-3
Network Bandwidth-aware job scheduling with dynamic information model for Grid resource brokers Chao-Tung Yang · Fang-Yie Leu · Sung-Yi Chen
Published online: 2 December 2008 © Springer Science+Business Media, LLC 2008
Abstract A resource broker with a user-friendly interface for job submission developed on a platform constructed using the Globus toolkit is proposed. The broker employs a domain-based network information model and dynamic version to measure network statuses, and also monitors and collects resource statuses and networkrelated information as the basis of its brokerage. A network bandwidth-aware job scheduling algorithm for brokering suitable Grid resources to communicationintensive jobs based on improving and preserving the advantages of our previously developed network information model is also proposed. Using timely information, the resource broker effectively matches Grid resources and user requests, thus improving job execution efficiency. Keywords Network Bandwidth-aware · Job scheduling · Dynamic information model · Grid computing · Resource brokers
1 Introduction Grid computing can be defined as coordinated resource-sharing and problem-solving in dynamic, multi-institutional collaborations [5, 7–11]. Grid systems generally involve many heterogeneous resources, such as hardware/software, computer architectures, and languages, which are often available on different platforms. The platforms, in turn, are frequently in different locales, belong to different administrative domains, and are linked by networks employing open standards at node interfaces [13, 29]. As C.-T. Yang () · F.-Y. Leu · S.-Y. Chen High-Performance Computing Laboratory, Department of Computer Science and Information Engineering, Tunghai University, Taichung, 40704, Taiwan, ROC e-mail:
[email protected] F.-Y. Leu e-mail:
[email protected]
200
C.-T. Yang et al.
more Grids are deployed worldwide, the number of multi-institutional collaborations grows rapidly. Grid system resource management primarily concerns facilitating effective Grid execution of computationally intensive and expensive tasks such as simulations and optimizations [1, 4, 15, 17, 19, 20, 22, 24–28, 30, 31]. Grid applications often employ shared resources to improve performance. Hence, resource brokerage target functions should encompass mechanisms and parameters such as scheduling strategies, machine configurations and links, Grid workloads, degrees of data replication, etc. Grid applications are very often developed to provide networks with only besteffort packet delivery services. Most recent high-performance computing applications with quality of service requirements focus on the capabilities of interconnecting target Grid infrastructure networks. Therefore, application runtime environments must interact with network resource managers, via their Grid resource brokers, in order to obtain relevant information on networks and reserve network resources. And, workload/job scheduling is fundamentally important to Grid computing. The discipline has been around for quite some time, but as Grid and scale-out computing environments became more popular; it became more and more important. Schedulers are ultimately responsible for allocating jobs to be executed across resources in Grid environments. While it sounds easy in theory, it is actually quite complicated. In essence, scheduling is an optimization challenge since the numbers of resources available in Grids are finite, and user requests may exceed capacities. Optimization can take many forms. Sometimes resource utilization across all Grid resources is most important, and sometimes job throughput is. Generally, organizations seek to balance these often competing objectives. This study mainly concerns improving a previously developed resource broker to provide optimized global scheduling for all submitted jobs and requested resources. Thus, a resource broker with a network bandwidth-aware characteristic is proposed. We first examine how target function parameters affect application performance, since understanding each parameter’s influence is crucial to enabling applications to achieve better performance, and also helps in developing effective scheduling heuristics and designing high-quality Grids. We then define the resource broker’s environment by choosing overall application response time as a functional objective, and defining Grid site job, scheduler, and performance models. Finally, we report results of experiments conducted on the TIGER Grid platform [29] to verify achievement of what we propose in this paper. Applications are represented as workflows that can be decomposed into single Grid jobs. The Grid resources these jobs require (e.g., number of processors) are also described. The main contributions of this paper are as follows. • Ganglia and NWS tools are used to monitor resource statuses and network-related information. • A dynamic network information model for monitoring Grid resource information services is improved. • A network Grid resource broker bandwidth-aware job scheduling algorithm is proposed to facilitate communication-intensive job execution.
Network Bandwidth-aware job scheduling
201
• A user-friendly Grid portal is constructed for general users to submit jobs and monitor resource status details. The rest of this paper is organized as follows. Section 2 provides a background review of related work. Section 3 describes the resource broker monitoring and information service architecture and interface. Section 4 presents static and dynamic domain-based network information models. The network bandwidth-aware job scheduling algorithm for job execution is proposed in Sect. 5. Experimental results are presented and discussed in Sect. 6. Section 7 concludes the paper and addresses our future research. 2 Background 2.1 Middleware for Grid computing The purpose of Grid computing is to maximize IT resource usage, and provide a positive return on investments by increasing Grid job throughputs via seamless, pervasive access to distributed, heterogeneous resources. The Globus Toolkit® [5, 8–11] 4.0.X (GT4, in this study), an open-source strategic resource, provides necessary middleware services to help build Grid infrastructures. The Globus Toolkit offers only the fundamental software and technologies needed to set up Grids. Often, if maximizing the potential of a basic Grid setup is desirable, the core Globus tools such as the Version4 Monitoring and Discovery System (MDS4), Global Resource Allocation Manager (GRAM), and the Grid Security Infrastructure (GSI), must be supplemented with other relevant functions and/or software. The Java CoG (Commodity of Grid) kit combines Java technology with Grid computing to develop advanced Grid services and increase basic Globus resource accessibility [12, 16]. It allows easier and more rapid application development by encouraging reuse of collaborative codes to avoid duplication of effort among problem-solving environments, science portals, Grid middleware, and collaborative pilots. Many Javabased applications deploy the Java CoG kit to connect themselves to Grid systems. Key components include: GridProxyInit, a JDialog for submitting pass phrases to Grids to extend certificate expiration dates, GridConfigureDialog, which uses the UITool in the CoG Kit to enable users to configure Grid server process numbers and host names, and GridJob, which creates GramJob instances. These Java classes represent a simple gram job and allow for submitting jobs to gatekeepers, canceling jobs, sending signal commands, and registering and unregistering themselves from callbacks. GetRSL, RSL provides a common interchange language to describe resources. Mpich-g2, a Grid-enabled implementation of the MPI v1.1 standard [14], acquires necessary services from the Globus Toolkit® (e.g., job startup and security) to enable coupling of multiple machines, potentially with different architectures, to run MPI applications. Mpich-g2 automatically converts data in messages sent between machines with different architectures to support multi-protocol communication. It selects TCP for inter-machine messaging and, where available, vendor-supplied MPI for intramachine messaging. Existing parallel programs written for MPI can be executed over the Globus infrastructure after recompilation.
202
C.-T. Yang et al.
The Ganglia project grew out of the University of California, Berkeley’s Millennium initiative [32]. Ganglia is a scalable distributed open-source system for monitoring node statuses (processor collections) in cluster-based wide-area systems. It uses a hierarchical communication structure among its components to accommodate information from large, arbitrary collections of multiple clusters, such as Grids. The Ganglia monitor collects hardware and system information, such as processor type and loading, memory usage, disk usage, operating system information, and other static/dynamic scheduler-specific details. The Ganglia software provides monitoring support for CPUs, and helps users realize the goal of providing resource information. For job-related monitoring, we make use of Globus Toolkit client-side utilities to perform basic job monitoring services. Below, we demonstrate how to use the features of the Ganglia monitoring system and the monitoring utilities in the Globus Toolkit to enhance MDS and achieve the goals mentioned above. The Network Weather Service (NWS) is a distributed system that employs a set of distributed performance sensors to detect network statuses by periodically monitoring Grid resources [16, 17, 30] and gathering system information of concern, including end-to-end TCP/IP performance (bandwidth and latency), available CPU percentage, and available nonpaged memory. The NWS then forecasts what the conditions will be for given time periods with numerical methods/models. The sensor interface also allows new sensors to be added to the system. The NWS uses a technology called clique [30] to prevent all nodes from making point-to-point measurements. Each node in a clique conducts inter-node measurements between itself and all other nodes in the same clique. Nodes can also join other cliques and do similar measurements. In order to realize the clique conception, the NWS extends token-passing methods to decrease resource collisions by conducting measurements in proper sequences. 2.2 Related work In recent years, many research papers have focused on Grid resource broker (GRB) development, aiming to improve Grid system resource utilization and task efficiency. GRB was first raised in 1999. It has been presented at the Open Grid Forum and recognized by the Grid Computing Environment (GCE) research group since 2001. GRB portals provide integrated approaches to managing Grid resources via userfriendly web GUIs and back-end GSI-enabled schedulers. Aloisio and Cafaro [1, 2] introduced the Grid resource broker portal, an advanced Web gateway for computational Grids in use at the University of Lecce. The portal allows trusted users seamless access to computational resources and Grid services, and provides a friendly computing environment that takes advantage of the underlying Globus Toolkit middleware to enhance its basic services and capabilities. The authors’ presentation showed that users do not need to learn how to use Globus, or rewrite their legacy applications before they can smoothly use their systems. Many projects designed resource management systems for implementation in a variety of system architectures to provide abundant management services. Krauter et al. [15] proposed a resource management system as the central component of a distributed network computing system. Requirements for RMSs (Grid taxonomy) were described, and an abstract functional model for defining resource management
Network Bandwidth-aware job scheduling
203
architectures was developed. The taxonomy was based on Grid system, machine organization, resource model characterization, and scheduling characterization types. Representative Grid systems were surveyed and classified into various categories. Chen et al. [4] proposed an active Grid resource management system to support a computational Grid network. The system is a scalable two-level resource management architecture in which resources are divided into multiple autonomous domains. An active resource tree (ART) in each domain classifies resources as leaf nodes and active routers as nonleaf nodes. Resource information is disseminated to each active ART router, and nodes in the ART work cooperatively to discover and schedule resources. Communications between domains are done via the ART root. Resource trading between consumers and providers is carried out according to an advanced barter marketing model. Toyama et al. [24] proposed a resource management system that can overcome high-cost data-transfer problems and help users employ desktop Grid computing for data-intensive applications. The system provides efficient scheduling by considering data file location to consist of three technical parts: resource management, multiple data replication, and worker selection. The system uses these techniques for intelligent reuse of previously copied data. Generally, achieving the goals of acting as middleware and providing services requires a software layer that interacts with Grid environments. It is also necessary to offer resource management services to hide the underlying Grid resource complexity from users. Rodero et al. [22] designed an OGSI-compliant Grid resource broker compatible with both GT2 and GT3. This broker focuses on resource discovery and management, and dynamic policy management for job scheduling and resource selection, and is also designed in an extensible and modular way using standard protocols and schemas to make it compatible with new middleware versions. The authors gave experimental results to demonstrate the resource broker’s behavior. Grid-enabled workflow management tools are crucial for successfully building and deploying bioinformatics workflows [25]. We briefly review these workflow systems: Grid Resource Broker Workflow Engine (GRBWE) [3], Pegasus [6], and Pegasys [23]. Cafaro et al. [3] introduced the GRBWE, dealing with workflows that can be described by arbitrary graphs, and which is able to handle graph cycles and condition vertices. GRBWE provides an important feature called recursive composition, which allows workflow vertices to be defined as a subworkflow or parameter sweep vertices instead of batch tasks. Pegasus [6] is a workflow management system designed to map abstract workflows onto Grid resources, through the Globus Replica Location Service (RLS) and Monitoring and Discovery Service (MDS) to discover available resources and data. Comparing to our GRB: Pegasus lacks an interface with certain bioinformatics tools, and does not support a workflow monitoring tool. Pegasys [23] allows users to build direct acyclic graphs. The differences between Pegasus and our system are listed below. First, unlike ours, which supports an editor for graphically decomposing submitted workflows, Pegasus provides only DAGs and does not include an editor for decomposing workflows; Second, Pegasys does not provide a Grid framework and must be installed on cluster machines, whereas our GRB scheduler supports dynamic assignment of resources belonging to computational Grid environments.
204
C.-T. Yang et al.
3 Grid Resource Broker Information Service 3.1 Conceptual overview In a previous study, we reported implementation of a workflow-based computational Grid resource broker, called WCGRB [33], which as shown in Fig. 1, consists of seven subsystems: Portal, Job monitor, Workflow Maker, Global Job Queue, Resource Broker, Monitoring Service, and Information Service with which WCGRB can discover existing Grid resources, evaluate their performance, and then assign suitable resources to submitted jobs. The purpose is to execute jobs such that all requirements and deadlines are met. Users can easily make use of this system through a common Grid portal [7, 19, 20, 26, 29]. To avoid confusion in this paper, we call this resource broker “Previous Resource Broker,” P-Resource Broker in short. Network transmission was not considered a key WCGRB parameter. Therefore, during execution, we must allocate sufficient bandwidth to communication links to prevent them from reducing system throughput. We also had to improve existing WCGRB subsystems features, including the Monitoring Service and Information Service, to enable them to effectively cooperate with the newly developed subsystems. The purpose of the extension is to more accurately match available network and Grid resources with user requests. To achieve this, WCGRB collects up-to-date network information and current resource statuses using the dynamic network information
Fig. 1 The WCGRB system architecture
Network Bandwidth-aware job scheduling
205
Fig. 2 Information Service and Monitoring Service architectures
model. Submitted jobs can be then executed by the most appropriate network and computational resources. Figure 2 depicts the Information Service and Monitoring Service subsystem architectures in relation to the P-Resource Broker. The Information Service collects resource information on processors, memory, and disks from all machines (nodes) in a Grid system, analyzes it, and then provides it to users or other WCGRB subsystems. The Monitoring Service accesses information maintained by the Information Service, and presents it in graphical form. The Information Service modifications include a network bandwidth awareness feature for improved resource brokerage performance. This required changes in the job scheduler deployment algorithm. The broker also employs a domain-based network information model that dynamically configures a layered Grid architecture. The Monitoring Service also uses the NWS and Ganglia tool to monitor resource statuses. The P-Resource Broker compares user requests and resource information collected by the Information Service, and selects appropriate resources for submitted jobs. The Scheduler, which is a component of the P-Resource Broker, assigns jobs to selected machines in the Grid for execution. The P-Resource Broker also collects execution results and stores them in the Information database, which holds resource and network statuses analyzed and processed via the Information Service. Users can retrieve these results using the Grid Portal. Grids may span several administrative domains via the Internet, which means their machines and other resources may be difficult to monitor, control, and manage. The development of the new version of the WCGRB is aimed at providing a multi-platform Grid service for real-time monitoring of resources such as CPU speed and utilization, memory usage, disk usage, and network bandwidth.
206
C.-T. Yang et al.
3.2 Information Service The Information Service was originally a monolithic program with all its functions developed in the same module. The new Information Service consists of five components, Gatherer, Message Center, Getter and Setter, Predictor, and Agent. Their functions and relations are as follows. • Gatherer: uses Ganglia to collect resource information, such as CPU speed, numbers of CPUs, CPU loading, memory size, available memory, and disk space usage. It also uses the NWS tool to collect current network bandwidth. Gatherer is invoked every time a job is submitted, and periodically between job submissions. According to our experience, such information collection consumes, on average, about 0.81% (for gigabit links) to 2.43% (for 100 Mbps links) of network bandwidth. Thus, its influence is slight. • Message Center: stores native information collected from the Grid, including that described for Gatherer. After analysis and processing, the collected data is sent to the Information database for storage. • Getter and Setter: respond to information-collection and data-accessing requests submitted by users. Thus, their database operation events may be frequent. In order to unify information access processes and reduce redundant program development, Getter and Setter were designed for installation at the front end of the Message Center to control and handle all Message Center accesses. Getter and Setter are also responsible for storing information gathered by Gatherer in the Message Center database for future use. • Predictor: newly developed for the new version of the WCGRB, has two functions: it periodically retrieves native information from the Message Center, and it accepts requests issued by the Agent to predict and obtain required results. Its modular design provides different types of native information so it is adaptable to various prediction models. This component was developed to increase system flexibility since it can be applied to many applications. • Agent: is the contact window of the Information Service. The P-Resource Broker Scheduler and Monitoring Service Controller both need real-time information and machine information estimates. For example, assume the P-Resource Broker is asked for a list of machines with low CPU loading. First, the Resource Broker sends a request to the Agent. Upon receiving the request, the Agent invokes Getter and Setter to access the requested information from the Message Center, and then sends the information to the P-Resource Broker. The P-Resource Broker also delivers information on machines’ execution to the Agent, including numbers of CPUs currently in use and their execution times, disk space occupied, memory usage, task requirements, etc. Predictor analyzes the information and sends recommended machine lists to the Agent if necessary. 3.3 Monitoring Service The original Monitoring Service was also monolithic, however, the new version consists of Displayer, Controller, and Drawer. Their main functions and relations are:
Network Bandwidth-aware job scheduling
207
Fig. 3 The layered structure of the Resource Broker and lower-level components
• Displayer: provides a query mechanism for users to obtain historical data on Grid nodes. We constructed a web interface for it and integrated it with the Grid Portal so that users can conveniently access and query WCGRB. • Controller: periodically accesses native information on nodes via the Information Service, and controls Monitoring Service tasks, including Grid node configuration and parameter-setting. • Drawer: receives parameters and data from the Controller, and draws figures for presentation by Displayer. Drawer functions are flexible since it must draw figures appropriate to the types of information received. Figure 3 shows the Resource Broker’s layered structure and lower-level components, including the Monitoring Service and the Information Service incorporating four important integrated components, the Tomcat server [21], JRobin [32], Ganglia, and NWS. We also built a user-friendly GUI interface to enable even inexperienced or naive users with no computational Grid knowledge to manipulate the system easily.
4 Dynamic domain-based network information model Job turnaround times may be affected by network performance in Grid computing environments. For this reason, resource brokers require network information to select resources. We constructed the network measurement model shown below. Nodes are single machines or personal computers and lines linking nodes represent point-topoint network measurements. The simplest way to measure Grid environment network performance is by measuring link bandwidths between nodei and nodej in sequence using the NWS tool [30], where nodei and nodej are any two Grid nodes, i = j . This yields a graph representing complete point-to-point bandwidth measurements. However, such a process would cause a too-many-connections problem, and produce huge network traffic occupying a significant portion of the available bandwidth. The number of
208
C.-T. Yang et al.
Fig. 4 The domain-based network measurement model
bandwidth measurements during a time period without a domain-based network information model is: NMN(N ) = N × (N − 1),
(1)
where N is the number of nodes. If there are, for example, 100 nodes in a Grid, the number of paths/links measured during a predefined period is NMN(100) = 9,900, which is a terribly large number of connections during a measurement cycle time. In large-scale Grid environments, it is excessive bandwidth overhead. Therefore, we propose the “domain” concept in which domains consist of several nodes. Figure 4 shows a network consisting of three domains with three to five nodes. Generally, a domain is a cluster in Grid systems. Domains are linked by Borders, which are nodes containing at least two domains. All borders in a network form a central domain. We thus need only pair-wise measurements within domains. Network information on links connecting domains is collected by the central-domain. Our domain-based model may look like an NWS clique, but they are not on the same level. A clique operates using token-passing techniques and is a bottom-level component built into NWS. Our model is top-level and has certain advantages, such as reducing the number of measurements conducted and consuming less bandwidth. Our main idea is to predict bandwidths for unmeasured links in our model. We first describe what a domain is. Grid nodes belonging to organizations, e.g., universities or companies, often form autonomous network management units (ANMU) [18], i.e., domains, since the nodes are autonomously managed by specific organizational units, e.g., computer centers. In this study, we assume that Grid nodes in domains are tightly coupled and managed by units or administrators. If they belong to geographically dispersed departments or are managed by different administrators/units, we call the domains “loose domains” (L-domains in short). We then further recursively divide the L-domains into subdomains until each domain/subdomain is managed by an administrator or a unit, and nodes in the domain/subdomain are tightly coupled. We call such tightly coupled domains/subdomains “tight domains” (T-domains in short), in which nodes are often organized into specific network infrastructures, such as Fast Ethernet, Gigabit, or InfiniBand. T-domains may be con-
Network Bandwidth-aware job scheduling
209
nected by LANs or WANs. (Below, when no ambiguity results, T-domains are called simply domains.) Our design ensures that local fluctuations will not affect the behavior of the entire Grid system. Two questions about the model may then be addressed. • How to select representative nodes in each domain as borders without losing its generality? • How to accurately evaluate unmeasured point-to-point network information in the domain-based model? In our proposed model, jobs may be accomplished by coordination between several domains. In such cases, we consider the link bandwidths between two arbitrarily chosen borders to decide where jobs should be submitted so programs can execute in parallel efficiently, and delays in job transmissions and responses will be the shortest. The node with the worst bandwidth will limit total execution time, therefore, the representative border must be the one with the worst bandwidth. How can we know which node in a domain has the worst bandwidth? One way is by measuring paths between any two arbitrary nodes in the underlying domain and then selecting the end node with the worst link as the domain border. But this may not work in real Grid environments because the organizations that own the domains may use different policies in controlling their nodes. Another method is to ask domain administrators to select borders according to their network topologies or architectures (e.g., to select nodes topologically farthest from the routers that connect to the WLANs). However, neither method is by itself smart and easy to scale. Therefore, we propose a hybrid of the two methods. 1. When the Grid is first being built up, start by choosing n domains (perhaps 2–4 if less than the total number of domains), then use one of the methods above to select borders and save them in a border list. 2. The remaining domains should select their own borders one by one according to their network topology, or test all nodes in a domain against each border in the border list to select one as the border, and add it to the list. Figure 5 illustrates an example. Assume that the top and right-bottom domains have already selected Fig. 5 Border selection diagram
210
C.-T. Yang et al.
their own borders in Step 1. We then perform Step 2 on the left domain, which measures the bandwidths of the paths between its five nodes and the top and rightbottom domain borders, i.e., R1 , R2 , . . . , R5 , T1 , T2 , . . . , and T5 . After that, we select the one that has the lowest bandwidth, i.e., min1≤i,j ≤5 (Ri + Tj ). 3. Repeat Step 2 until every domain has found its most suitable border. This can not only effectively reduce the construction complexity of domain-based architectures and avoid measuring links between two arbitrary nodes in networks (not in domains), but also makes Grid environments scalable for the addition of new domains. When a new domain joins an existing Grid, Step 3 is invoked. Figure 6 illustrates a static domain-based network information model consisting of four sites, each of which has four nodes. Here, static means site borders are specific fixed nodes, e.g., A1, B1, C1, and D1, instead of dynamically selected ones. First, all sites measure all internal links between arbitrary pairs of nodes. After that, all head nodes, e.g., node A1 of site A, et al., periodically measure the path bandwidths between themselves and all other head nodes, e.g., nodes B1, C1, and D1, et al. Consider a Grid containing N nodes, each of which periodically measures its path bandwidth once per T seconds. For example, according to (1), the numbers of paths measured in two metropolitan-scale Grid environments, UniGrid [13] and TigerGrid [29] are, NMN(96) = 9120 and NMN(46) = 2070, respectively. When a domain-based network information model is in use, in each T the number of paths measured is NMN(mi ), (2) NMS(N , [mi ]) = NMN(N ) + where N is the total number of head nodes, and mi is the total number of nodes in sitei . Then according to (2), the numbers of paths measured are dramatically reduced to NMS(31, [4, 8, 8, 5, 8, 5, 7, 2, 1, 1, 3, 1, 4, 1, 3, 1, 1, 4, 8, 1, 1, 1, 1, 2, 2, 1, 4, 1, 2, 4, 1]) = 1,316, and NMS(12, [4, 4, 4, 4, 8, 2, 3, 4, 4, 4, 4, 1]) = 292, respec-
Fig. 6 Static domain-based network information model
Network Bandwidth-aware job scheduling
211
tively. We further define the reduction ratio R as R=
NMN(N ) − NMS(N , [mi ]) × 100%. NMN(N )
(3)
The Rs for the example above are 86.01% and 85.94%, respectively, showing that the domain-based network information model can reduce measurement efforts and bandwidth consumption by up to 86%—significant, particularly in large-scale Grids. Even though this model can eliminate huge amounts of measurement effort and bandwidth use, it lacks network information between pairs of nodes belonging to different sites (unless both are borders). For example, the link (target) between nodes A2 and B1 shown in Fig. 7 is not measured. Addressing the second question above, to get network bandwidths for unmeasured paths in our model, we use a few measured values gathered by the NWS to estimate the missing values. The notation used is shown below. • • • • • •
B_inavg : average intradomain bandwidth (Mb/s), B_outavg : average interdomain bandwidth (Mb/s), Pflu : bandwidth fluctuation amplitude (%), Nflu : number bandwidth fluctuation instances, Pvaflu : valid fluctuation rate (%), Lij [k]: last k times intradomain bandwidth from nodei to nodej was measured.
B_inavg and B_outavg are obtained by averaging bandwidth histories, Pflu is used to detect bandwidth consumption, and Nflu traces network fluctuations over given time periods, ignoring pulses and bandwidth noise. The algorithm, which is similar to that for deploying a sliding window, recognizes bandwidths for the last Nflu th times. Pvaflu shows the bandwidth fluctuation percentages occurring when the Nflu time is treated as actual bandwidth usage. Lij [k] is the latest k th time bandwidth was measured. The default values for all parameters are: Pflu = 30%, Nflu = 10, time period = 5 second (achieved by setting the NWS’s detection period), and Pvaflu = 80%.
Fig. 7 Estimating target bandwidths
212
C.-T. Yang et al.
Fig. 8 The enhanced redesign of our previous work
The target bandwidth can be calculated using the following equation. Btar =
Brem × B_outavg × α, B_inavg
(4)
where α is a value converted from intradomain to interdomain, and Brem is the remaining intradomain bandwidth calculated using the following equation. Nflu −1 Brem =
Lij [k] . Nflu − 2
k=2
(5)
We further enhanced the static model by improving the switching mechanism in the dynamic domain-based network information model. Figure 8 shows an example. The principal improvement is switching the site head node to the next free node. For example, when node A1 is busy, the next free node, node A2, becomes the head node of site A, and measures the bandwidth between itself and nodes B3, C2, and D4, if they are the respective free nodes in sites B, C, and D. The purpose is to avoid having a busy node still act as a border, which would decrease system performance. There are three obvious advantages in using this model. • First, the number of bandwidth measurements is the same as that for a static model; the measurement time complexity is not worsened. • Second, bandwidth measurements between pairs of arbitrary nodes belonging to different sites are easily obtained. • Finally, network bandwidth measurements obtain real values instead of estimated values, thus enabling the Resource Broker to effectively schedule jobs allocated to multiple sites.
Network Bandwidth-aware job scheduling
213
5 A Network Bandwidth-aware Job Scheduler 5.1 Performance evaluation mechanism Our Grid environment consists of several sites (i.e., clusters). Jobs can be directly submitted to site nodes through the Resource Broker. We use ATPi (standing for average total computing power), which will be formally defined later, to represent the total computing power of site i machines to be allocated to jobs. The ATPi has three main parts, CPU, memory, and intranetworking. Users must input the number of CPUs required to execute their jobs. The Information Service checks CPU availability and hardware information including CPU utilization and speed, available memory, and network bandwidth at each site. The Resource Broker then calculates the ATP for each site and chooses enough processors for the job based upon calculation results. We use a statistical approach to analyzing High Performance Linpack (HPL) application execution results as follows. We first fix the memory size and change the number of CPUs to conduct the HPL performance test. We then fix the number of CPUs, change the HPL problem size, and redo the performance test to determine the effect of changing memory size. We can then give a performance value to each type of CPU and memory size in our environment based on the performance test results. Let αPE be the processor performance weight parameter for site s. It is derived from the correlation coefficient between CPU and HPL values, i.e., Cov(CPU, HPL), and that between memory and HPL values, i.e., Cov(memory, HPL), 0 αPE 1, where Cov(CPU, HPL) αPE = . [31] (6) Cov(CPU, HPL) + Cov(memory, HPL) In other words, Cov(CPU, HPL) is the evaluated constraint on memory size for each machine fixed to avoid the effect of differing memory sizes, and Cov(memory, HPL) is the evaluation for a site without changing processors and associated facilities, e.g., cache memory and bus speeds. After evaluating, we conduct an HPL performance test on one of the sites under consideration and switch network bandwidth from gigabit to 10/100 Mb to determine the effect of network speed on system performance. There are two αNE weights, one for gigabit, which is αNE (giga) =
Cov(gigabit, HPL) , Cov(gigabit, HPL) + Cov(10/100, HPL)
[31]
(7)
and the other for 10/100 Mb, which is αNE (10/100) =
Cov(gigabit, HPL) . [31] Cov(gigabit, HPL) + Cov(10/100, HPL)
5.2 The algorithm Parameters used in our resource broker are listed and explained below. • n: number of sites (domains) in a Grid environment, • Si : site i, i = 1 ∼ n,
(8)
214
C.-T. Yang et al.
• mi : number of nodes in site i, • P (Si ): number of available processors in site i, P (Si ) ≤ mi where total available processors for a job execution are summed up as Y = ni=1 P (Si ), • X: number of processors deployed to execute a given job, • Pvalij : processor performance of node j in site i, i = 1 ∼ n, j = 1 ∼ mi ; Pvalij is calculated using a hybrid approach that mixes node j ’s CPU speed and its benchmarking result, where the benchmarking is performed by fixing node j ’s memory size to reduce the effect of differing memory space, • Mvalij : memory performance of node j in site i, i = 1 ∼ n, j = 1 ∼ mi , • Puij : site i node j processor utilization rate over the past minute, i = 1 ∼ n, j = 1 ∼ mi , • Muij : site i node j memory utilization over the past minute, i = 1 ∼ n, j = 1 ∼ mi , • αPEi : performance weight of processors in site i, 0 ≤ αPEi ≤ 1, • 1 − αPEi : memory performance weight for site i, • αNEi : intranetworking effect ratio for site i, 0 ≤ αNEi ≤ 1, • β: internetworking effect ratio for the Grid, • Eik : graph constructed to represent the relation between sites i and k; numbers between edges represent NWS forecasted link bandwidths between the sites, • ATP(Si ): average total computing power for site i, mi j =1 ·Pvalij (1 − Puij ) · αPEi ATP(Si ) = P (Si ) mi j =1 Mvalij · (1 − Muij ) · (1 − αPEi ) · αNEi , + (9) P (Si ) where Pvalij · (1 − Puij ) and Mvalij · (1 − Muij ) are the respective currently available processor and memory performances for node j in site i. The square bracket in this equation means the inner effect of the machine. Let ATPi = β ·
Eik + (1 − β) · ATP(Si ),
(10)
k=1,n,k=i
which is the computing power of site i. We summarize the Network Bandwidth-aware Job Scheduling algorithm shown in Fig. 9. Choosing the smallest Q can obtain a local minimum and reduce internetworking effect. But after adding β, the result is globally optimal. Let Q-set be a Q-combination of sites. Its computing power is then ATPQp = β ·
i∈Qp ,k ∈Q / p
Eik + (1 − β) ·
ATP(Sj ),
n p = 1, 2, 3 . . . , CQ . (11)
j ∈Qp
n , and Q nodes The Q-set with max1≤p≤CQn ATPQp , e.g., Qx , is chosen, 1 ≤ p ≤ CQ from i∈Qx mi nodes are selected, one per site, as the head nodes of the Q-set. Qx now has Q links connecting itself to the remaining sites, instead of only one link, thus avoiding having a single link become a bottleneck for the Q-set. However, in a
Network Bandwidth-aware job scheduling
215
Fig. 9 The WCGRB job-scheduling algorithm Fig. 10 An example of a Grid testbed
hierarchical/layered Grid architecture, let ATPL j be the computing power of domain j at layer L, j ∈ Qp and Qp be a Q-set at layer L + 1, then ATPL+1 Qp = β ·
i∈Qp ,k ∈Q / p
Eik + (1 − β) ·
ATPL j,
n p = 1, 2, 3 . . . , CQ .
(12)
j ∈Qp
Assume the Grid shown in Fig. 10 consists of five domains, and the Resource Broker is installed in Domain A. “A(8)” means there are eight working nodes (processors)
216
C.-T. Yang et al.
in site A. The number “40” represents current communication bandwidth (in Mbps) between sites A and B. The Resource Broker first queries the Information Service to acquire the current statuses of all working nodes. There are three cases. • Case 1: If an underlying job needs, say, 8 processors, the Resource Broker checks all probable sites to identify those that contain at least 8 processors. If more than one site qualifies, an ATP is calculated for each site. The Resource Broker then allocates the top 8 processors in the best site—site A or site C in this example—to the job, based of course on ATP values. • Case 2: If the job requires, say, 12 processors, no single site has enough processors to run the job. The Resource Broker counts the numbers of processors for each combination of two sites, and sorts ATPs for the combinations with not fewer than 12 processors—(A, B), (A, C), (A, E), (B, C), and (C, E) in this example—then selects the best pair of sites and allocates the 12 top-ranked processors to the job. • Case 3: If the job needs, say, 16 processors, only sites A and C can qualify for allocation, so the Resource Broker can immediately allocate the 16 processors topranked for speed in sites A and C to the job.
6 Experimental environment and results A metropolitan-scale computing platform called the TIGER (Taichung Integrating Grid Environment and Resource) Grid was used in this experiment. This Grid interconnects 12 computing sites containing 46 nodes with 89 processors distributed over the 12 computing sites (clusters) at seven educational institutes, including Tunghai University (THU), National Changhua University of Education (NCUE), National Taichung University (NTCU), Hsiuping Institute of Technology (HIT), National Dali Senior High School (DALI), Lizen High School (LZSH), and Long Fong Elementary School (LFPS). Specifications, HPL performance value, and site ranking for the TIGER Grid are listed in Table 1. A comparison of these sites, including site ranking, number of processors, sum of processor speeds, and sum of memory capacities, is shown in Fig. 11. TIGER network bandwidth information is listed in Table 2, and THU and TIGER site topologies are shown in Fig. 12. The first example describes experimental results for a static network information model (SNIM) and a dynamic network information model (DNIM). Two sites, eta and beta, were used in this experiment. We transferred a 5 GB file from eta-2 to beta-1, and observed the bandwidth between them every 60 seconds. Figure 13 shows that the connection bandwidth from eta-2 to beta-1 obtained using SNIM is a smooth curve, which cannot reflect the actual situation. But DNIM shows variations in the link. Figure 14 illustrates an unstable error rate fluctuation with SNIM, providing unstable information references to the resource broker and causing it to make wrong decisions. In the second experiment, a sequence of 100 jobs was randomly generated as template jobs, each with an np, to simulate submission of 100 running jobs, where “np” represents number of processors required for each job. The jobs were dispatched by the Network Bandwidth-aware Job Scheduler. Relevant information including time waiting in queues, turnaround time, and resource utilization, was logged. Figure 15
Network Bandwidth-aware job scheduling
217
Table 1 Specifications, HPL performance, and site ranking for the TIGER testbed Site
Number of
Total speed
Total memory
HPL
Site
nodes/CPUs
(MHz)
(MB)
Gflops
ranking
alpha
4/8
16,000
4,096
12.5683
10
beta
4/8
22,400
4,096
20.1322
11 5
gamma
4/4
11,200
4,096
5.8089
delta
4/4
12,000
4,096
10.6146
7
eta
2/4
12,800
2,048
11.2116
8
mu
2/4
8,000
4,096
11.8500
9
ncue
4/16
32,000
16,384
28.1887
12
ntcu
4/5
3,250
1,024
1.0285
2
hit
4/4
11,200
2,048
7.0615
6
dali
4/4
7,200
512
2.8229
3
lz
4/4
2,700
768
0.8562
1
lf
1/1
3,000
3.0389
4
1,024
Table 2 Site network information alpha alpha beta gamma delta eta mu ncue ntcu hit dali lz lf
578
beta
gamma
delta
eta
mu
ncue
ntcu
hit
dali
lz
lf
47
423
47
44
47
6
48
57
8
23
9
738
40
48
46
724
6
44
40
8
22
9
609
38
39
37
4
36
20
6
19
8
763
49
22
3
29
37
4
14
6
47
4
42
37
7
21
8
793
6
49
42
23
8
9
82
5
4
11
19
3
5
8
14
5
82
9
25
3
82
7
9
788
87
83
9 N/A
shows the distribution of template jobs. The X-axis represents type of job and the Y -axis job dispatch time. Jobs were formatted using task_mpi N , where mpi means deploying MPI parallel program, “task” represents main job task which included: (1) bw_mpi: uses the MPI library to perform TCP packet transmissions between two nodes across links, using up the link bandwidth when the argument, packet size, grows up; programs consuming 64 kb, 128 kb, 256 kb, 512 kb, and 1 Mb bandwidths were deployed, (2) Hello3_mpi: hello world MPI program [28],
218
C.-T. Yang et al.
Fig. 11 Overview of Grid resources in TIGER
Fig. 12 THU and TIGER site topologies
(3) mmd_mpi: performs square matrix multiplication and uses up memory gradually when the argument, matrix size, grows up; matrix sizes of 128*128, 256*256, 512*512, and 1024*1024 integers were used, (4) nqueen_mpi: solves the N -Queen problem, which requires placing n queens on an n by n chessboard such that no two queens attack each other, i.e., no two queens can be placed on the same row, the same file, or the same diagonal; n = 2, 4, and 8 were used, (5) pi_mpi: computes the value of Pi (i.e., π ) to a given precision using numerical integration; precisions of 64, 128, 256, and 1,024 digits were required,
Network Bandwidth-aware job scheduling
219
Fig. 13 DNIM shows better performance than SNIM
Fig. 14 SNIM error rate, which is worse by up to 428.37%
Fig. 15 The distribution of template jobs
(6) prime_mpi: acquires all prime numbers between 0 and a given number for its argument; 64, 128, 256, 512, and 1,024 were used as given numbers.
220
C.-T. Yang et al.
Fig. 16 Comparison of three policies on average job turnaround and waiting time
Fig. 17 Comparison of three policies on resource utilization
To show that the Network Bandwidth-aware Job Scheduler performed better than two other job-scheduling algorithms, network-only and speed-only, the same job sequences generated were submitted to each for execution and comparison. • Network-only: considers only network information. If site i, has enough processors to satisfy all jobs, the fastest site with the highest β · αNEi is chosen. When Q sites Q are needed, those with the top-Q i=1 β · αNEi value are selected, 2 ≤ Q ≤ n. • Speed-only: deals only with CPU clock information. If a single site is sufficient to handle the jobs, the site with the largest CPU clock summation value in its intranet is chosen. When Q sites are required, those with the top-Q CPU clock summation values in the intranet are allocated, 2 ≤ Q ≤ n. Experimental results are shown in Figs. 16 and 17. Job turnaround time is defined as the average of queue time + execution time. As shown in Fig. 16, the Network Bandwidth-aware Job Scheduler performed better than the other two. Figure 17 shows resource utilization statistics. Clearly, the Network Bandwidth-aware Job Scheduler increased utilization of powerful sites, and decreased total turnaround times.
Network Bandwidth-aware job scheduling
221
7 Conclusions and future research This paper proposes a new version of WCGRB that uses Network Bandwidth-aware Job Scheduling to help users make better use of available Grid resources. WCGRB monitors resource usage and statuses for Grids, using the Ganglia toolkit, to further improve information services provided by the Globus toolkit. Our Grid resource brokerage system discovers and evaluates Grid resources. Submitted jobs can then be executed by appropriate Grid resources, meeting their requirements. We described the design and implementation of this Resource Broker in detail. The experimental results show the new WCGRB significantly enhances the capabilities of our Grid resource broker by considering an additional feature, network bandwidth, which is provided by the NWS tool for job scheduling. The innovative contribution of this integration may facilitate design and implementation of new mapping/scheduling mechanisms that take network and computational resources into account. In the future, we would like to derive mathematical performance and reliability models so our resource broker users can predict and realize its probable performance and reliability.
References 1. Aloisio G, Cafaro M (2002) Web-based access to the grid using the grid resource broker portal. Concurr Comput Pract Exp 14:1145–1160 2. Aloisio G, Cafaro M, Carteni G, Epicoco I, Fiore S, Lezzi D, Mirto M, Mocavero S (2007) The grid resource broker portal. Concurr Comput Pract Exp 19(12):1663–1670 3. Cafaro M, Epicoco I, Mirto M, Lezzi D, Aloisio G (2007) The grid resource broker workflow engine. In: Proceedings of international conference on grid and cooperative computing, 2007 4. Chen XL, Yang C, Lu SL, Chen GH (2004) An active resource management system for computational grid. In: Proceedings of grid and cooperative computing. Lecture notes in computer science, vol 3251. Springer, Berlin, pp 225–232 5. Czajkowski K, Fitzgerald S, Foster I, Kesselman C (2001) Grid information services for distributed resource sharing. In: Proceedings of the tenth IEEE international symposium on high-performance distributed computing. IEEE Press, New York 6. Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A, Jacob JC, Katz DS (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. J Sci Program 13(3):219–237 7. Ferreira L, Berstis V, Armstrong J, Kendzierski M, Neukoetter A, Takagi Ma, Bing-Wo R, Amir A, Murakawa R, Hernandez O, Magowan J, Bieberstein N (2003) Introduction to grid computing with Globus. http://www.ibm.com/redbooks 8. Foster I (2002) The grid: a new infrastructure for 21st century science. Phys Today 55(2):42–47 9. Foster I, Karonis N (1998) A grid-enabled MPI: message passing in heterogeneous distributed computing systems. In: Proceedings of supercomputing conference, 1998 10. Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl 11(2):115–128 11. Foster I, Kesselman C (2003) The grid 2: blueprint for a new computing infrastructure, 2nd edn. Morgan Kaufmann, San Mateo 12. Gregor L, Mike H (2005) Workflow concepts of the Java CoG kit. J grid Comput 3(3–4):239–258 13. Ho LY, Liu PF, Wang CM, Wu JJ (2008) The development of a drug discovery virtual screening application on Taiwan unigrid. In: Proceedings of advances in grid and pervasive computing. Lecture notes in computer science, vol 5036. Springer, Berlin, pp 38–47 14. Karonis NT, Toonen B, Foster I (2003) MPICH-G2: A grid-enabled implementation of the message passing interface. J Parallel Distrib Comput 63(5):551–563
222
C.-T. Yang et al.
15. Krauter K, Buyya R, Maheswaran M (2002) A taxonomy and survey of grid resource management systems for distributed computing. Softw Pract Exp 32:135–164 16. Laszewski V, Foster I, Gawor J, Lane P (2001) A Java commodity grid kit. Concurr Comput Pract Exp 13:645–662 17. Le H, Coddington P, Wendelborn AL (2004) A data-aware resource broker for data grids. In: Proceedings of the IFIP international conference on network and parallel computing. Lecture notes in computer science, vol 3222. Springer, Berlin, pp 442–453 18. Leu FY, Li MC, Lin JC, Yang CT (2008) Detection workload in a dynamic grid-based intrusion detection environment. J Parallel Distrib Comput 68:427–442 19. Nabrzyski J, Schopf JM, Weglarz J (2005) grid resource management. Kluwer Academic, Dordrecht 20. Park SM, Kim JH (2003) Chameleon: a resource scheduler in a data grid environment. In: Proceedings of the IEEE/ACM international symposium on cluster computing and the grid, 2003, pp 258-265 21. Penchikala S (2008) Terracotta 2.6 supports cluster visualization tools and Tomcat 6 integration. http://www.infoq.com/news/2008/04/terracotta-2.6-release 22. Rodero I, Corbalán J, Badia RM, Labarta J (2005) eNANOS grid resource broker. In: Lecture notes in computer science, vol 3470. Springer, Berlin, pp 111–121 23. Shah SP, He DYM, Sawkins JN, Druce JC, Quon G, Lett D, Zheng GXY, Xu T, Quellette BFF (2004) Pegasys: software for executing and integrating analyses of biological sequences. BMC Bioinform 5:40 24. Toyama T, Yamada Y, Konishi K (2006) A resource management system for data-intensive applications in desktop grid environments. In: Parallel and distributed computing and systems. Acta Press, Calgary 25. Venugopal S, Buyya R, Winton L (2006) A grid service broker for scheduling e-science applications on global data grids. Concurr Comput Pract Exp 18:685–699 26. Yang CT, Lai CL, Shih PC, Li KC (2004) A resource broker for computing nodes selection in grid environments. In: Proceedings of grid and cooperative computing: international conference. Lecture notes in computer science, vol 3251. Springer, Berlin, pp 931–934 27. Yang CT, Lai CL, Li KC, Hsu CH, Chu WC (2005) On utilization of the grid computing technology for video conversion and 3D rendering. In: Proceedings of parallel and distributed processing and applications: third international symposium. Lecture notes in computer science, vol 3758. Springer, Berlin, pp 442–453 28. Yang CT, Shih PC, Li KC (2005) A high-performance computational resource broker for grid computing environments. In: Proceedings of the international conference on AINA, vol 2, 2005, pp 333–336 29. Yang CT, Li KC, Chiang WC, Shih PC (2005) Design and implementation of TIGER grid: an integrated metropolitan-scale grid environment. In: Proceedings of the 6th IEEE international conference on PDCAT, Dec 2005, pp 518–520 30. Yang CT, Shih PC, Chen SY, Shih WC (2005) An efficient network information modeling using NWS for grid computing environments. In: Proceedings of international conference on grid and cooperative computing. Lecture notes in computer science, vol 3795. Springer, Berlin, pp 287–299 31. Yang CT, Lin CF, Chen SY (2006) A workflow-based computational resource broker with information monitoring in grids. In: Proceedings of the IEEE international conference on grid and cooperative computing, 2006, pp 199–206 32. Yang CT, Chen TT, Chen SY (2007) Implementation of monitoring and information service using Ganglia and NWS for grid resource brokers. In: Proceedings of IEEE Asia-Pacific services computing conference, 2007, pp 356–363 33. Yang CT, Lai KC, Shih PC (2008) Design and implementation of a workflow-based resource broker with information system on computational grids. J Supercomput. doi:10.1007/s11227-008-0201-5
Network Bandwidth-aware job scheduling
223
Chao-Tung Yang is a professor of computer science and information engineering at Tunghai University in Taiwan. He received a B.S. degree in computer science and information engineering from Tunghai University, Taichung, Taiwan, in 1990, and the M.S. degree in computer and information science from National Chiao Tung University, Hsinchu, Taiwan, in 1992. He received the Ph.D. degree in computer and information science from National Chiao Tung University in July 1996. He won the 1996 Acer Dragon Award for an outstanding Ph.D. dissertation. He has worked as an associate researcher for ground operations in the ROCSAT Ground System Section (RGS) of the National Space Program Office (NSPO) in Hsinchu Science-based Industrial Park since 1996. In August 2001, he joined the faculty of the Department of Computer Science and Information Engineering at Tunghai University. He got the excellent research award by Tunghai University in 2007. His researches have been sponsored by Taiwan agencies National Science Council (NSC), National Center for High Performance Computing (NCHC), and Ministry of Education. His present research interests are in grid and cluster computing, parallel and high-performance computing, and internet-based applications. He is both a member of the IEEE Computer Society and ACM. Fang-Yie Leu received his B.S., master, and Ph.D. degrees all from National Taiwan University of Science and Technology, Taiwan, in 1983, 1986, and 1991, respectively, and another master’s degree from Knowledge System Institute, USA, in 1990. His research interests include wireless communication, network security, grid applications, and Chinese natural language processing. He is currently an associate professor of Tunghai University, Taiwan, and director of database and network security laboratory of the University. He is also a member of the IEEE Computer Society.
Sung-Yi Chen received a B.S. degree in the Department of Computer Science and Information Engineering from Tunghai University in 2005. He has been studying for the M.S. degree in the Department of Computer Science and Information Engineering from Tunghai University since September 2005. His research interests include grid and cluster computing, parallel and high-performance computing, and grid and pervasive computing.