Step by step, how to build the grid

9 downloads 0 Views 1MB Size Report
Jun 5, 2009 - Job Server: The Job Server “pbs_server” is the central focus for PBS. Its ... The Job Server manages one or more queues. Each batch queue ...
Step by step, how to build the grid

1)

2)

3)

4)

Build your own grid: ............................................................................................................................................. 1

a.

Step1: Read about grid computing .................................................................................................... 1

b.

Step2: Designing the grid ............................................................................................................... 1

c.

Step3: The possible grid resources ................................................................................................. 1

d.

Step5 : Grid topology: .................................................................................................................... 2

e.

Step4: Make grid middleware choice: ............................................................................................... 3

Globus Toolkit Services......................................................................................................................................... 4

a.

Globus Resource Allocation Manager (GRAM):.................................................................................... 5

b.

Grid Security Infrastructure (GSI):.................................................................................................. 5

c.

Monitoring and Discovery Service (MDS): ......................................................................................... 5

d.

Grid Resource Information Service (GRIS): ....................................................................................... 6

e.

Grid Index Information Service (GIIS): .............................................................................................. 6

f.

Grid File Transfer Protocol (GridFTP): .............................................................................................. 6

g.

Communication services (Nexus) .................................................................................................... 6

h.

System monitoring (HBM) .............................................................................................................. 7

i.

Remote Data Access (GASS): .......................................................................................................... 7

j.

The Replica Management system: .................................................................................................... 8

Portable Batch System (PBS) ................................................................................................................................ 8

a.

Software and Installation architecture ............................................................................................. 9

b.

PBS Daemons & Job Life Cycle ...................................................................................................... 10

c.

Scheduling and Queue Management ............................................................................................... 10

d.

Most commonly used PBS commands ............................................................................................ 12

Why Globus and PBS? ......................................................................................................................................... 12

1) Build your own grid: Building a grid is not a simple task.

###########################################################################

a. Step1: Read about grid computing Before building your own grid, you must know the basis of grid computing. If you don’t, just read “Error! Reference source not found. Grid Computing” to have a clear idea about.

b. Step2: Designing the grid Just like civil engineers building a bridge, software engineers building a grid must specify an overall design before they start work. This design is called the grid architecture and identifies the fundamental components of a grid's purpose and function. The have to answer: Do we, really, need to build a grid? What’s the purpose? What type of Grid does we opt to? What components we need? What physical and logical structure should be selected? What type of applications we want to run on? It will be for public use or restricted? It will have a commercial interest or not?... In our case we are going to build a grid:  That will be used for research activities  For the valorization of the wasted resources  That will be for the restricted use of the research staff  It will have no commercial interest  It can be interconnected in the future with other grids

c. Step3: The possible grid resources A grid depends on underlying hardware: without computers and networks, you can't have a grid! That’s why; you need to study all the resources states and to evaluate  Possible resources Resource Number Processor

RAM

Hard Disc

Lab1

09

Intel® Celeron® CPU E1400 @ 2.00GHz 1.99 GHz

504 MB 60GB

Lab2

08

Pentium(R) 4 CPU 3.00 GHz 256 MB 80GB Figure 1: Grid possible resources

-1-

 Disc space utilization

Figure 2: Average of discs space utilization rate

 Operating system: 

Microsoft Windows XP Professional version 2002 Service Pack 3



Linux Debian 5.5

 Network connections: Resource Number Network card RAM Lab1

09

504 MB

Lab2

08

256 MB

 Special resources possibilities: Hardware, software, license… Intel Cluster Ready Certified Solutions www.voltaire.com/ICR

d. Step5 : Grid topology: Topology refers to the layout of connected clusters and nodes. It can be physical or logical. Physical Topology means the physical design of the grid including the devices, location and network cable installation. Logical topology refers to the fact that how the different nodes and clusters constituting the grid cooperates to execute jobs. It exist many possible grid topologies.

-2-

Figure 3: Different possible grid topologies1

In our case, we are going to create a simple testbed composed of two nodes (two labs) and a grid client. We also invented a new robust, stable, scalable, load balanced… new grid topology for TunGrid presented in details in the third chapter. The simple grid Poste11

Poste21

Poste24

Poste12 Grid CLIENT

Head node

Poste14

Poste22

Poste13

Poste23 Execution host

Cluster1

Cluster2

Figure 4: the created grid illustration

e. Step4: Make grid middleware choice: The grid middleware is the crucial component of the grid. It makes grid computing possible. The Middleware organizes and coordinates separate grid resources to create a coherent whole. Middleware is conceptually "in the middle" of operating systems software (like Windows or Linux) and applications software (like a weather forecasting program).

1

Source: http://upload.wikimedia.org/wikipedia/commons/9/96/NetworkTopologies.png

-3-

Making Grid architecture choice represents a strategic decision for any grid project. It starts with specifying the design solution objectives. The aim is to know what does the grid. The design objectives provide a basic framework for building the grid infrastructure. The advantage of using design solution objectives is to start documenting certain areas that can affect the overall design. Within your design, you are going to need to make sure that the grid can provide a certain amount of security, availability, and performance. By documenting these different objectives or requirements, it will make your design a lot easier to follow. You will also be able to justify some of your decisions during the course of the design by being able to come back to certain objectives and making sure they were met. Once the design objectives have been defined, you can separate them into individual subsystems. This allows each design objective to be worked on in parallel, but at the same time providing cohesiveness for the overall architecture. Once you have documented the core subsystems of the design, you can focus on the different requirements that your grid design will be comprised of. When you start building the initial pieces of your design, you need to make sure that your solution objectives line up with the customer’s requirements. For a grid design, this is especially important, as there are not only the standard infrastructure components to consider, but specialized middleware and application integration issues as well. Making sure that your solution objectives satisfy your stated requirements will allow you to design a working grid. In the next sections we present in details: the Globus

Toolkit Services and the Portable Batch

System (PBS). 2) Globus Toolkit Services The Globus Toolkit is an open source software toolkit used to build Grid systems and applications. These tools cover security measures, resource location, resource management, communications... It is being developed by the Globus Alliance, a team primarily involving Ian Foster's team at Argonne National Laboratory and Carl Kesselman's team at the University of Southern California in Los Angeles, and many others all over the world. Globus Toolkit is being a central part of science and engineering projects that total nearly a halfbillion dollars internationally. A growing number of projects and companies are using the Globus Toolkit to unlock the potential of grids for their cause and to build significant commercial Grid products. Many of the protocols and functions defined by the Globus Toolkit are similar to those in networking and storage today, but have been optimized for grid-specific deployments. It includes:

-4-

MDS

GRAM

process

GridFTP

resources FTP server

GIIS

jobmanager GRIS gatekeeper

LDAP

RSL/HTTP1.1

Distributed

LDAP

Central job allocation job management

resource finding use

use

data transfer data control gsiftp/http/file use

proxy initialize/destroy

user Client Figure 5: The system overview of Globus Toolkit (source: Error! Reference source not found.])

a. Globus Resource Allocation Manager (GRAM): It is a set of service components. It provides a single standard interface for remote job submission and control. GRAM is designed to provide a single common protocol and API for requesting and using remote system resources, by providing a uniform, flexible interface to, local job scheduling systems. GRAM hides the heterogeneity and complexity of local systems management mechanisms like schedulers, queuing systems, reservation systems, and control interfaces… for users and application developers.

b. Grid Security Infrastructure (GSI): It provides mutual authentication of both users and remote resources using PKI-based2 identities and the Secure Sockets Layer (SSL3) communication protocol, determines their access rights, and secure and protects communications over an open network. GRAM provides a simple authorization mechanism based on GSI identities and a mechanism to map GSI identities to local user accounts.

c. Monitoring and Discovery Service (MDS): 2

: “The Public Key Infrastructure (PKI) is a set of hardware, software, people, policies, and procedures needed to create, manage, store, distribute, and revoke digital certificates.”, Wikipedia, 5 June 2009, http://en.wikipedia.org/wiki/Public_key_infrastructure 3

: “Secure Sockets Layer (SSL) predecessor of Transport Layer Security (TLS), are cryptographic protocols that provide security and data integrity for communications over networks such as the Internet. TLS and SSL encrypt the segments of network connections at the Transport Layer end-to-end.”, Wikipedia, 18 June 2009, http://en.wikipedia.org/wiki/Transport_Layer_Security

-5-

It is a suite of web services that includes a set of components to discover and to monitor several resources in a Virtual Organization (VO). It collects information about available resources such as processing capacity, bandwidth capacity, type of storage… and their status. The Globus Toolkit provides core architecture and an implementation for publishing, locating, and subscribing to information.

d. Grid Resource Information Service (GRIS): It is associated with each resource. It answers queries from machines members of the grid about the particular resource for its current configuration, capabilities, and status. The Grid Information Service works as “White pages” providing resource information’s like “How much memory does machine have” and “Yellow pages” for resource options like “Which queues allow large jobs”. GRIS accesses an “information provider” deployed on that resource for requested information. The local information maintained by GRIS is updated when requested, and cached for a period of time known as the time-to-live (TTL). If no request for the information is received by GRIS, the information will time out and be deleted. If a later request for the information is received, GRIS will call the relevant information provider(s) to retrieve the latest information.

e. Grid Index Information Service (GIIS): The GIIS coordinates arbitrary GRIS services. GRIS is able to register its information with a GIIS, but GRIS itself does not receive registration requests. It is a directory service that collects (‘pulls”) information for GRIS’s. It works as a “caching” service providing indexing and searching functions.

f. Grid File Transfer Protocol (GridFTP): It is an extension of the standard File Transfer Protocol (FTP) for use with Grid computing. It provides a more reliable, secure, robust, and high-performance data transfer optimized for Grid computing applications. It resumes the problem of incompatibility between storage and access systems providing a uniform library of access functions for all data sources combining the total available data into unique partition. GridFTP uses basic Grid security on both control (command) and data channels. GridFTP was explained in the first chapter as one of the most important standards.

g. Communication services (Nexus) Nexus is a portable library that provides multi-threaded communication facilities within heterogeneous parallel and distributed computing environments. It was created by Ian Foster and Carl Kesselman at Argonne National Labs and University of Southern California. Its targeted use is in the development of advanced languages and libraries on such platforms. It is not primarily intended as an application level tool. It provides: 

multiple threads execution



dynamic processor acquisition



dynamic address space creation



global memory model



asynchronous events



it supports heterogeneity at multiple levels

-6-

Nexus has been redesigned and streamlined as a component of Globus, implemented on top of core Globus services such as the Globus thread library. But a user interested only in Nexus can use it alone, as a module independent of other Globus services. Nexus is the communication component of Globus, and it supports multiple communication protocols as well as resource characterization mechanisms that allow automatic selection of optimal protocols. It also provides a global memory model.

h. System monitoring (HBM) The Heart Beat Monitor also called System monitoring and Fault Monitoring is designed to provide simple fault monitoring for remote machines. Reporting faults is a difficult task as failures can occur for many different reasons. It monitors processes on remote machines but is also used to show network failure. In this case the monitor registers the lack of a response rather than a report of actual failure. So in effect if for example a network breakdown stopped a remote machine from sending a signal back, it would be assumed that the machine in question had gone down. The HBM contains three main components:  Heart Beat Monitor Client Library (HBM-CL): The main function of the HBM-CL is to provide a way to register processes for monitoring. The request generated by the HBM-CL’s globus_hbm_client_register() program is passed onto the HBM-LM.  Heart Beat Monitor Local Monitor (HBM-LM): The HBM-LM is run on any machine that monitors processes. Its job is to accept requests for jobs to be monitored from the HBM-CL, and to monitor them based on a simple timer mechanism. It then reports back all pertinent information gained to the HBM-DC. For example if it received a HBM-CL request for the stoppage of monitoring of a specific job it would report this fact back to the HBM-DC the next time it transmitted.  Heart Beat Monitor Data Collector (HBM-DC): The HBM-DC is a centrally located server responsible for collecting information from individual HBM-LM’s around a network and to provide information at request on the status of those jobs. It is ultimately responsible for monitoring the frequency of replies from various remote HBM-LM’s. If for any example one stopped reporting the machine would be assumed inaccessible.

i. Remote Data Access (GASS): The Remote Data Access Service is responsible of providing a uniform access to diverse storage management systems and cache management for the jobs running on the Grid. When a job is running it may requires additional resources in order to achieve the task, for example the transferring of data files from a remote machine to the place where the job being executed, they will be listed in the job request sent to the execution machine. On receiving that request the gatekeeper will examine the request and, if it determines that additional resources are required, it will set up a GASS server on that machine to deal with any requests the program will make. The GASS server itself allows users to put files in a local cache accessible to the job running at the time. Any file can be transferred as long as it resides on an FTP or HTTP server on an accessible network. -7-

Allowing a user to access files requires a user to be authorized to access the remote resource. Therefore in order to comply with requests for information the GASS server has to make use of the GSI to make sure the machines involved can be authenticated. Once the program that is running finishes, the GASS server will delete the cache and shut down the server. It is important to note that the use of GASS and the cache is the only way to access files for use in execution. Local files at present cannot be accessed in any other way.

j. The Replica Management system: The replica management service manages the copying and placement of files in a highperformance, distributed computing environment to optimize the performance of the dataintensive applications. Data-intensive, high-performance computing applications require the efficient management and transfer of terabytes of information in wide-area, distributed computing environments. Such applications include experimental analyses and simulations in scientific disciplines such as high-energy physics, climate modeling, earthquake engineering, and astronomy. In such applications, massive datasets must be shared by a community of hundreds or thousands of researchers distributed worldwide. These researchers need to transfer large subsets of these datasets to local sites or other remote resources for processing. They may create local copies or replicas to overcome long wide-area data transfer latencies. The data management environment must provide security services such as authentication of users and control over who is allowed to access the data. In addition, once multiple copies of files are distributed at multiple locations, researchers need to be able to locate copies and determine whether to access an existing copy or create a new one to meet the performance needs of their applications. The replica management architecture must include: Separation of replication and file metadata information: strict separation between metadata information, which describes the contents of files, and replication information, which is used to map logical file and collection names to physical locations. Replication semantics: It enforces no replica semantics. Files registered with the replica management service are asserted by the user to be replicas of one another, but the service does not make guarantees about file consistency. Rollback: If a failure occurs during a complex, multi-part operation, the system rollback the state of the replica management service to its previouslyconsistent state before the operation began.

3) Portable Batch System (PBS) The Portable Batch System (PBS)[35][36] is a batch job and computer system resource management package. It accepts jobs, preserve and protect the job until it is run, run the job, and deliver output back to the submitter.

-8-

PBS was developed to provide a growth platform for batch cluster computing and distributed jobs. It can be installed and configured to support jobs run on a single system, or many systems grouped together in many fashions.

a. Software and Installation architecture The “Figure 6: PBS Installation Architecture” shows the installation architecture of the Portable Batch System. Client Only Host

Server and Execution Host policy

Kernel

events 5,7

3

1

pbs_sched

MOM

pbs_server 2,6

4

queues

8

running jobs

3

Client

3

4 4

Execution Host

Execution Host Kernel

Kernel new running job

MOM

MOM

new running job

running jobs

running jobs

Figure 6: PBS Installation Architecture (source [35])

In this proposed architecture we have Server and execution Hosts which are running three different daemons:  Job Server: The Job Server “pbs_server” is the central focus for PBS. Its main function is to provide the basic batch services such as receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job. The Job Server manages one or more queues. Each batch queue contains zero or more batch jobs and has a set of queue attributes. Two main types of queues are defined: Execution queues: the job residing in an execution queue is a candidate for execution. Routing queues: the job residing in a routing queue is a candidate for routing to a new destination. -9-

 Job Executor: The job executor and resource monitor, pbs_mom, informally called Mom, manages job execution and monitors system activity and resource usage. There must be, at least, a Mom running on every node that can execute jobs.  Job Scheduler: The Job Scheduler “pbs_sched” is another daemon which contains the site's policy controlling which job is run, where and when it is run. When run, the Scheduler can communicate with the various Moms to learn about the state of system resources and with the Server to learn about the availability of jobs to execute. The interface to the Server is through the same API as the commands. In fact, the Scheduler just appears as a batch Manager to the Server [35].

b. PBS Daemons & Job Life Cycle The “Figure 6: PBS Installation Architecture” shows the different PBS Daemons and the job Life Cycle. Every time pbs_server receives a job:



Event tells server to start a scheduling cycle



Server sends scheduling command to scheduler



Scheduler requests resource info from MOM



MOM returns requested info



Scheduler requests job info from Server



Server sends job status info to scheduler that makes policy decision to run job



Scheduler sends run request to Server



Server sends job to MON to run

c. Scheduling and Queue Management Pbs_sched is very powerful scheduler [35][]. The jobs can be routed to processing hosts over a network and the Resources may be reserved for a job before its execution begins, and limits are established on the consumption of resources by the job. Each job which is submitted to the system can:  be batch (set up so it can be run to completion without manual intervention) or interactive (prompts the user for data or input)  define a list of required resources so that pbs_sched can select or even reserve the suitable resources  define a priority (by default the normal priority is assigned to the job)  define the time of execution  send a mail to user when execution start, end or abort  define dependencies (after, afterOk, afterNotOk, before, ... )  be synchronized with other jobs -10-

 be check-pointed (if the host OS provide for) PBS like SGE and the other grid schedulers (but with little differences between them), all these informations defines the “profile requirements of the job”. And also for each resource maintains a “resource profile”. PBS (and the others) maintains many resource properties that can be platform-dependent or common to all systems like:  cput: max CPU time used by all processes in the job  pcput: max CPU time used by any single process in the job  mem: max amount of physical memory used by the job  pmem: max amount of physical memory used by any process of the job  vmem: max amount of virtual memory used by the job  pvmem: max amount of virtual memory used by any process of the job  walltime: wall clock time running  file: the largest size of any single file that may be created by the job  host: name of the host on which job should be run  nodes: number and/or type of nodes to be reserved for exclusive use by the job Also, for each resource, it's possible to specify min-max limits and default values in queues and server attributes to filter different classes of jobs. If a running job exceeds the amount of the resource requested it will be aborted by the Mom. Moreover, PBS provides a separate process to schedule which jobs should be placed into execution. This is a flexible mechanism by which you may implement a very wide variety of policies. In fact it is possible to implement a replacement Scheduler using the provided APIs which will enforce the desired policies. The configuration required for a Scheduler depends on the Scheduler itself: see file PBS_HOME/sched_priv/sched_config. A good amount of code has been written to make it easier to change and add to this Scheduler. The delivered FIFO Scheduler provides the ability to sort the jobs in several different ways on user and group priority, in addition to FIFO order. As distributed, the FIFO Scheduler is configured with the following options:  All jobs in a queue will be considered for execution before the next queue is examined.  The queues are sorted by queue priority.  The jobs within each queue are sorted by requested cpu time (cput). The shortest job is places first.  Jobs which have been queued for more than a day will be considered starving and heroic measures will be taken to attempt to run them.  Any queue whose name starts with "ded" is treated as a dedicated time queue. Jobs in that queue will only be considered for execution if the system is in dedicated time as specified -11-

in the dedicated_time configuration file. If the system is in dedicated time, jobs not in a "ded" queue will not considered. (See file PBS_HOME/sched_priv/dedicated_time)  Prime time is from 4:00 AM to 5:30 PM. Any holiday is considered non-prime. (See file PBS_HOME/sched_priv/holidays)

 A sample dedicated_time and resource group file are also included.  These system resources are checked to make sure they are not exceeded: mem (memory requested) and ncpus (number of CPUs requested).

d. Most commonly used PBS commands PBS supplies both command line commands and a graphical interface. These are used to submit, monitor, modify, and delete jobs. The commands can be installed on any system type supported by PBS and do not require the local presence of any of the other components of PBS. There are three classifications of commands: 4 

User commands :

qsub: submit a batch job to PBS qstat: list information about queues and jobs qdel: Removes a job from the queue. This includes all running, waiting, and held jobs. (also see man qdel) qselect: Obtain a list of jobs that meet certain criteria qrerun: Terminate an executing job and return it to a queue qorder: exchange order of two pbs batch jobs in a queue qmove: Move a job to a different queue or server qhold: Place a hold on a job to keep it from being scheduled for running qalter: Altering (modifying) the parameters of a job after it's submitted qmsg: Append a message to the output of an executing job qrls: Remove a hold from a job



Administrator commands:

qmgr: provides an administrator interface to the batch system pbsnodes: bsnodes -a list all PBS nodes, their attributes, and job status

4) Why Globus and PBS? PBS and Globus toolkit are both open source and free for use software. This encourages broader, more rapid adoption and leads to greater technical innovation, as the open-source community provides continual enhancements to the product.

4

http://www.weizmann.ac.il/physics/linux_farm/pbs/Commands/ ******* http://www.arsc.edu/support/howtos/usingpbs.html ****** http://amdahl.physics.purdue.edu/usingcluster/node31.html ****** http://ag.cqu.edu.au/FCWViewer/view.do?page=5665 ******* http://cf.ccmr.cornell.edu/cgi-bin/w3mman2html.cgi?qorder%281B%29 ****** http://cf.ccmr.cornell.edu/cgi-bin/w3mman2html.cgi?qmgr%288B%29 -12-

They offers complete grid software infrastructure including software for security, information infrastructure, resource management, data management, communication, fault detection, and portability allowing securely online resource sharing and collaboration across corporate, institutional, and geographic boundaries without sacrificing local autonomy. PBS also includes techniques to create, modify, delete, and manage job’s queues. But the main characteristic is the possibility to create routing queues. Each routing queue has a list of destinations to which jobs may be routed. This fits our new scheduling and load balancing technique used in TunGrid. Routing queues and all related techniques will be discussed in the next

-13-

doc.