Environment creation and cleanup. â Controls job execution. Job1. Shadow startd ... Information is always outdated. â
Condor
(by University of Wisconsin-Madison)
High Throughput Distributed Computing System Presented by Mark Silberstein 2006
CDP 236370, Technion
1
Definitions ●
●
●
●
●
Cluster [pool] – a group of interconnected computers (resources) Batch (job) – self-contained piece of software for unattended execution Batch/queuing system – system for automatic scheduling and invocation of jobs, competing for resource [multiple resources] High Performance System – optimized for low latency of every job High Throughput System – optimized to increase utilization of resources –
Ex: printer queue CDP 236370, Technion
2
Batch system – take 1 multiple identical resources ● ●
●
CPU
CPU
CPU
Job Queue Invokes jobs and brings results back Job “babysitting” –
Invoke only once
–
Job failures
CDP 236370, Technion
3
Batch system – take 2
distributed heterogeneous resources Job Queue Invokes jobs and brings results back Job “babysitting”
● ●
CPU
CPU
CPU
●
Remote control Report resource characteristics (metadata) Job requirements –
“I want CPU at least than ...”
CDP 236370, Technion
4
Batch system – take 3 distributed heterogeneous resources
Multiple users
Job Queue Invokes jobs and brings results back Job “babysitting” Remote control Job requirements Resource attributes (metadata)
● ●
CPU
CPU
CPU
●
Security Resource sharing policies – QoS Queue – Access control
CDP 236370, Technion
5
Batch system – take 4 distributed heterogeneous resources + Multiple users
Non-dedicated resources – cycle stealing Job Queue + Access control Invokes jobs and brings results back Job “babysitting” Remote control Job requirements Resource attributes (metadata) –
update
periodic
Security Resource sharing policies – QoS
CPU
CPU
CPU ●
On-demand job eviction
●
Fault tolerance
Respecting resource policies CDP 236370, Technion 6 ●
Condor at glance Submission hosts ●
Basic idea – “classified advertisement” matching –
Resources publish their capabilities
–
Jobs publish their requirements
–
Matchmaker finds best match
Matchmaker
CPU
CDP 236370, Technion
CPU
Execution hosts
7
Condor architecture Submission host: schedd and shadow ●
●
Schedd - Job Queue –
Holds DB of all jobs submitted for execution (fault-tolerant)
–
Requests resources from MM
–
Claiming logic
–
Ensures only-once semantics
Submission host
Sched d
Shadow1 Shadow2
Shadow (per running job) –
Remote invocation
–
Input/Output staging
–
Job “babysitting” - failure identification
–
Sometimes works as I/O proxy CDP 236370, Technion
CPU
CPU Matchmaker
CPU
8
Condor architecture ●
Execution host: startd and starter Startd – resource manager –
Monitoring ● ●
●
–
●
Schedd
Keeps track of resource usage
Shadow Matchmaker
Periodically sends resource attributes to MM Enforces local policies
Execution gateway
startd
●
Security
●
Spawns starter
●
Communicates with schedd
Starter (per running job) –
Communicates with shadow (I/O)
–
Environment creation and cleanup
–
Controls job execution
CDP 236370, Technion
Execution gateway
Starter Job1
Resource monitoring
Starter Job2
9
Matchmaker Collector –
Central registry of the pool metadata ●
●
Collector
Condor brain
●
fy
●
Attempts to match requests with resources
Noti
●
Periodically pulls info from collector
Negotiator h is bl Pu
●
b u S
All pool entities send reports to collector
Negotiator –
e b i r sc
No tify
●
Notifies happy pairs Maintains fair share of resources between users CDP 236370, Technion
10
Condor description language ClassAd ●
●
●
●
●
●
Used to describe entities - resources, jobs, daemons, etc. Schema-less!!! Mapping of attribute names to expressions Both descriptive and functional
●
Simple examples:
Ex1: Simple [ CPU=200; RAM=30 ] Ex2: Reference to local [ MyCPU=200; RAM=20; Power=(RAM+MyCPU)]
Ex3: Reference to other Expressions can contain [Type=job;Exec=test.exe; Requirements=other.RAM>200 attributes from other ] classads Protocol for expression evaluation
[ Type=resource; RAM=300;]
CDP 236370, Technion
11
Matching constraints ●
Matching process is symmetric: –
Matched only if both resource and job requirement expressions are true
[Type=job; Exec=test.exe; Requirements=other.RAM>200 ] [ Type=resource; RAM=300; Requirements=(Exec==test.e xe)]
CDP 236370, Technion
12
Example of resource classad MyType = "Machine" TargetType = "Job" Name = "
[email protected]" Machine = "ds-i1.cs.technion.ac.il" Rank = 0.000000 CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) CondorVersion = "$CondorVersion: 6.4.7 Jan 26 2003 $" CondorPlatform = "$CondorPlatform: INTEL-LINUX-GLIBC22 $" VirtualMemory = 1014294 Disk = 34126016 CondorLoadAvg = 0.000000 LoadAvg = 1.000000 KeyboardIdle = 26038 Arch = "INTEL" OpSys = "LINUX" UidDomain = "cs.technion.ac.il" FileSystemDomain = "cs.technion.ac.il" Subnet = "132.68.37" HasIOProxy = TRUE
...
CpuBusyTime = 2109520 CpuIsBusy = TRUE State = "Owner" EnteredCurrentState = 1084352386 Activity = "Idle" EnteredCurrentActivity = 1084352386
Start = (Scheduler =?= "
[email protected]") || (((Keybo ardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) = 60)) || (State != "Unclaimed" && State != "Owner"))))
Requirements = START CDP 236370, Technion
13
Matchmaker in details ●
Collector stores –
All resources' classads
–
All schedds' classads ●
Represent only number of jobs, and their owners, but not jobs themselves
–
Information is always outdated
–
Periodically removes staled data ●
soft registration
CDP 236370, Technion
14
Idle state (periodic update and soft state) Remove stale data ( garbage collection)
Collector Schedd Classad: Number of idle jobs in queue, IP:port
Startd classad: Resource state, Resource characteristics
Schedd
Startd
Important: this diagram is valid throughout all life of schedd and startd CDP 236370, Technion
15
Negotiator ●
Periodic negotiation cycle 1)Pulls all classads (once per cycle) 2)Contacts each schedd according to priority and gets job classad
3)For each job traverses all resources' classads and attempts to match each one 4)If found 1)Chooses best match according to global and local policies 2)Notifies matched parties 3)Remove matched classad 5)If not found – tries next job from the same schedd, or next schedd CDP 236370, Technion
16
Claiming and running
Startd
Negotiator Sta e rv D ) M rtd e at s hI e ch IP, R a tc ID M (
Schedd
Activate claim: are you available? Yes Run job Alive Alive Job finished, has more? Release claim: no, thanks CDP 236370, Technion
17
Negotiation state sequence diagram Schedd
Negotiator Get next schedd with idle jobs
Choose next job to match
Get next job classad Claim ID and address of matched startd
Collector
Startd
Fetch all classads
Single negotiation cycle Perform matchmaking Assign Claim ID
I am claimed
Activate claim with received Claim ID Repeat until there are idle jobs
Validation of correct match
Send job classad Start new job Job complete CDP 236370, Technion
18
Startd resource monitoring ●
●
Periodic sampling of system resources –
CPU utilization, Hard Disk, Memory ...
–
User-defined attributes
–
If job is running – total running time, total load by job, ...
Published in classad and can be matched with
CDP 236370, Technion
19
Startd policies (support for cycle-stealing) ●
Resource owner can configure –
When resource is considered available ●
–
What to do when owner is back ●
–
Suspend job to RAM
How to evict job ●
●
Ex: only after 15 min after keyboard is idle
Job should be killed at most 5 sec after I want resource back
Pool manager has no control over these policies CDP 236370, Technion
20
Global resource sharing policies ●
●
How should resources be shared between users? What happens without policies: –
●
1000 Computers, User A starts 1000 jobs, 5 hours long, User B will have to wait ;(((
Solution – fair share –
User with higher priority can preempt another job ●
●
Priorities change dynamically according to the resource usage : more resources – worse priority Priouser(t)=k*Priouser(t-dt) + (1-k)*(number of used resources) , where k=0.5dt/(priority half life) CDP 236370, Technion
21
Putting policies together: Negotiation cycle revisited (Condor 6.6 series) ●
Periodic negotiation cycle
1)Pull all classads (once per cycle) and optimize for matching 2)Order all schedd requests by user priorities // higher priority – served first
New job
3)For each user While (user quota is not exceeded AND has more job requests ) do 1)Contact schedd and get next job classad 2)Traverse all resources' classads and attempt to match one by one 1)If not found – notify schedd; goto NEW JOB 2)If match is found, AssignWeights(), add to matched list 3)ChooseBestMatch() and Notify() CDP 236370, Technion
22
Putting policies together: Negotiation cycle revisited(cont) ●
Function AssignWeight() 1) Assign preemption weight: ●
●
●
2 – if resource is idle 1 – if resource is busy and prefers new job over current one (Resource Rank evaluation) 0 – if resource is busy, current user has higher priority and global policy permits preemptions
2) Evaluate job preferences (Job Rank evaluation) ●
Function ChooseBestMatch() : lexicographic order –
Sort according to job rank, pick best one
–
Among all with equal best rank– sort according to preemption CDP 236370, Technion 23 weight
Condor and MPI parallel jobs ●
●
Problem: MPI requires synchronous invocation and execution of multiple instances of a program. Why it is a problem: –
Negotiator matches only one job at a time
–
Schedd knows to invoke one job at a time
–
Different failure semantics: single instance failure IS A whole MPI job failure
–
Startd might prevent single job, but this would kill the whole MPI run CDP 236370, Technion
24
MPI Universe ●
●
●
●
●
Each Startd capable of running MPI job publishes attribute: “DedicatedScheduler=”. Each MPI sub-job has a requirement to run on a host with DedicatedScheduler defined Negotiator matches all such hosts and passes them to Schedd Dedicated Schedd is responsible for synchronous invocation and failure semantics Dedicated Schedd can preempt any job on that host CDP 236370, Technion
25
Condor in the Technion ●
●
●
Condor is deployed in DSL,SSDL and CBL ( total ~200 CPUs) Gozal: R&D projects for Condor enhancements. Among them –
High availability
–
Distributed management and configuration
–
Resource sandbox
–
On the web: http://dsl.cs.technion.ac.il/projects/gozal/
Superlink-online: genetic linkage analysis portal CDP 236370, Technion
26
References ●
www.condorproject.org –
Condor administration manual
–
Research papers
–
Slides from the previous year lecture
CDP 236370, Technion
27