Scalable Linux Scheduling - CiteSeerX

9 downloads 52323 Views 105KB Size Report
May 1, 2001 - Center for Information Technology Integration. University of ... immensely. Java technology is ...... library/tech/00/03/biztech/articles/. 20soft.html.
CITI TechnicalReport01-7

Scalable LinuxScheduling StephenPMolloy, . Peter Honeyman {smolloy,honey}@citi.umich.edu

ABSTRACT Formostofitsexistence,Linuxhasbeenusedprim operatingsystem. Yet,inrecenttimes,itsuseas commercialoperatingsystemsfornetworkservers,d otherlarge-scalesystemshasbeenincreasing. Des popularity,Linux exhibits many undesirableperform

arilyaaspersonaldesktop acost-efficientalternativeto istributedworkstationsand piteitsremarkablerisein ancetraits.

Concernedaboutthescalabilityofmultithreadedne Linux,weinvestigateimprovementstotheLinuxsch calculatingbaseprioritiesandsortingtherunque Weproposeanimprovedschedulerdesignandcompare termsofscalabilityandperformancetotheexistin analysisshowsthatimprovementscanbm e adetothe introducingoverhead,thusimprovingthescalabilit operating system.

tworkserverspoweredby eduler. Wefocusonpreueforefficienttaskselection. ourimplementationin gLinuxscheduler.Our existingschedulerwithout yandrobustnessothe f Linux

1May 2001

Center for InformationTechnology Integration University oMichigan f 519WestWilliam Street AnnArbor,MI 48103-4943

Scalable LinuxScheduling Stephen Molloy,Peter Honeyman {smolloy,honey}@citi.umich.edu

Introduction 1.

Background 2.

Linux,astrongandsteadilyincreasingpresenceon theInternettoday[4],commonlyprovidescosteffectiveandload-tolerantsolutionsfornetwork services. ItsfamiliaritytoUNIXusers,sourceco availability,andabilitytorunonmanydifferent architectures explain Linux’s rapid increase in popularity.Manysoftwarecompanies,including AOL,NetscapeandIBM[6,8],offerLinuxproducts. SeveralorganizationsuseLinuxonrouters,printa fileservers,firewallsand,ofcourse,webapplica servers [10].

BecauseLinuxgrewfromda esktopoperatingsystem, manyissuespreventitfrombeingdominant a force theenterpriseservermarket. Thedesignand implementationofLinuxhastraditionallyfocusedo simplicity and versatility rather than small performance gains and scalability. The implementationoftheLinuxthreadmodeland scheduler illustratethis approach.

AtthesametimeaL s inux’spopularityhasincrease theuseofJavaforwebapplicationshasgrown immensely. Javatechnologyis akeycomponentin buildingscalableapplicationservers. However, communications-intensive Java applications often createlargenumbersofthreadsandLinuxdoesnot handlesuch stress gracefully. Concernedaboutthescalabilityofmultithreaded network serverspoweredbyLinux,weinvestigate improvementstotheLinuxscheduler. Experiments byIBMindicatethatasmucha3s0%ofthetotalCP timeinthesystemisspentintheschedulerwhent numberorfunningthreadsishigh[2]. Ouranalysi showsthecurrentschedulerusesanexpensiveand redundantalgorithmfortaskselection. Ourgoali improvethescalabilityothe f Linuxschedulertoa ittoenterprise-scaleserverworkloads. Ouranaly shows that our new scheduler implementation achieves thesegoals. Therestofthispaperisorganizedafsollows. Se 2givessomebackgroundinformationonourproject andprovidessomeinsightastowhywechoseto tackletheschedulerratherthantheLinuxthreadin model. Section3describesLinux’scurrentapproac toscheduling. Section4explainstheproblemswit it.Section5outlinesourapproachtosolvingthe problemandSection6describesitsperformance relativetothecurrentLinuxscheduler. Allkerne modifications, experiments, and descriptions are againsta2.3.99-pre4Linuxkerneland,whenused, Javaversion 1.1.7oIBM’s f JDK.

de

nd tion d,

U he s tso dapt sis

ction

g h h

l

TheLinuxthreadmodelisaone-to-onemodel, meaningthateveryuser-levelthreadismappedonto itsownkernelthread.Whilethismodelmakes programminginthekernellesscomplicated,it sometimesforcesprogramstogeneratemorekernel threadsthanins ecessary. Forcingthekernel’sde schedulertoaccommodatetoomanythreadscan adversely affect saerver’s performance.

in n

fault

Manyestablishedoperatingsystemssupportmany-tooneormany-to-manythreadmodelsinwhicheach kernelthreadhasmanyuserlevelthreads mappedto it. Inthesemodels,asecondaryschedulerchooses whichofthemappeduserlevelthreadstorun. The multi-tierschedulingapproachothese f modelsassu res thatthescheduleraet achlevelwillbefacedwith a moremanageablenumber of threads. TheLinuxscheduler,likeitsthreadmodel,isalso exerciseinsimplicity. Theheartotfhescheduler conciselycodedinjustafewlinesthatevaluate runnablethreadsandthenpicksthebest. Thepric forthissimplicitythough,isalineartimealgori thatrepeatsmanyofthesamecalculationsthatwer performedoints lastinvocation. Inthisproject,weaddresstheshortcomingsofthe schedulerratherthanthoseofthethreadingmodel. Thereasonsforthisdecisionaresimple. Wesawt redundantcalculationandO(n)loopinthecurrent schedulerandknewwecouldimproveit. Wealso knowthattheLinuxkernelcommunityhasbeenvery protectiveofitsthreadingmodelinthepastandw wantedtoavoidupsettinganycontributors. Finall theoriginaltimelinefortheprojectcalledfora workingdesignandimplementationwithinthetime frameoone f semester. Developing new a threading

an is e thm e

he

e y,

volatile long

state

unsigned long

policy

long

counter

long

priority

struct mm_struct

*mm

struct list_head

run_list

int

has_cpu

int

processor

Table 1: Thistableshowsthefieldsofthetask structure thatare mostrelevantto Linux schedulin

andotherstateinformationaboutthetaskandits registers.Italsotrackstaskstatisticsformemo managementandresourcecontrol,privileges,file descriptors,signalhandlersandothertaskspecifi information. Thevariousfieldsotfhetaskstruct useditnhescheduler areillustratediT n able1.

ry c ure

Thetask’s statefieldcanbseettooneosix f values, eachrepresentingadifferentstatethatinwhicha task mightfinditself(suchasblockingorsleeping.) TASK_RUNNINGisthevalueof statewhenataskis runnable. g.

modelforLinuxwouldalmostcertainlyrequiremore timethan wheadavailable. Othergroupshavespentconsiderabletimedesigning alternativeschedulersforLinux[1,5,9].Linux discussiongroupsprovideevidencethattheschedul hasbeenandcontinues tobeaninterestingtopicfor thedevelopercommunity. However,mostalternative scheduler designs focusonreducing latency for timeprocessesratherthanimprovingtheoverall scalability othe f defaultscheduler. Whileitisourgoaltoimprovescalabilityand performanceotfheschedulerwhenfacedwithalarg numberofrunnablethreads,itisnotourintentto changethecriteriaitusesforthreadselection. feelthatthecurrentcriteriaarecarefullychosen sufficienttomakegooddecisionswithaminimum amountofcalculation. Ourprimaryconcernisimp thatthesecriteriaarenotbeingusedinanoptima algorithm by thescheduler.

er

real-

e We and ly l

Intheremainderotfhispaper,becauseLinuxuses one-to-onethreadingmodel,wedonotdistinguish betweenauserthreadandakernelthread. Also,t matchtheterminologyusedinthekernelsourcecod werefer toany threaditnhesystem as taask.

a o e,

Current 3. Scheduler TounderstandwhythecurrentLinuxschedulerscale poorlywiththenumberofrunnablethreadsinthe system,itisnecessarytobefamiliarwithitsdat structures,algorithms,andconventions. Thissect outlines the existing scheduler to clarify our observations anddesign decisions.

s a ion

3.1 Task Structure ThebasicexecutioncontextinLinuxirseferredto atask. Thetaskstructureisresponsiblefor maintaining a task’s address space information, whetherthataddressspaceisharedwithothertas

as

ks,

The policyfieldisseteitherto SCHED_FIFO, SCHED_RR (round robin) or SCHED_OTHER to determinetheschedulingpolicyforthetask. The first twooptionsareforreal-timetasks,whilethethir dis forallothertasks. Realtimetasksarealwaysru n beforeregulartasksithey f arerunnable. The policy fieldiaslsousedtotrackyieldedtasks. Whena nonreal-time task gives up its processor via the sys_sched_yield() system call, a bit (SCHED_YIELD)inthetask’s policyfieldissetso this information can bpeassedotnothescheduler. Thefield has_cpuissetto1whiletask a iesxecuting onaprocessorand0otherwise.Uponsetting has_cputhe , field processoris settotheprocessor IDonwhichthetaskwillexecute. Thetaskstruct alsocontainspointersthatidentifythememorymap which iruns t andits placeotnherun queue.

ure in

Thetwomostimportantfactorsindeterminingwhich taskexecutesnextarerepresentedbythe priority and counterfields. Priorityisanintegerbetween 1and40. Highernumbersrepresenthigherpriority . Twentyisthedefaultvalueforalltasks. (Real-t ime tasksalsouseapriorityvalue,butitrangesfrom 0to 99 and is stored in a separate field called rt_priority.) Counter is value a thatindicatesthe time remaining in the task’s current quantum. Counter,measuredin10msticks,canrangefrom zerototwicethetask’s priorityLinux . usesthis fieldtoenforce fairness a policy. Itisworthnotingthatalltasks,whethertheyare lightweightthreadsorfull-fledgedprocesses,are treatedthesamebLinux ya system. Allprocesses threadsarevisibleinvarioussystemstatuscomman suchaspsandtop.Consequently,thedefault scheduler,whichirsesponsibleforaccommodatinga tasksinthesystem,canbeplacedunderconsiderab stress when running multithreadedapplications.

and ds

3.2 RunQueue TherunqueueinLinuxicircular, as doubly linked list containingalltasksinthe TASK_RUNNINGstate. The

ll le

schedulertraversesthislistwhenilooks t forat run. Thelistisnotmaintainedinsortedorder. theschedulerfindstwoequivalenttasks,theone closertothefrontofthelistischosen. Newlyc orawakenedtasksareplacedatthebeginningofth runqueue. Thelistisdoublylinkedandcircular, tasks can alsobaeddedtotheendothe f run queue

askto When reated e so .

3.3 Schedule() TheLinuxkernelfunction schedule()a, sinother operatingsystems,iscalledfromover500places withinthekernel,underscoringitssignificanceto overall systemperformance. The schedule() functioniscalledbyataskwhenityieldsthe processor,blocksforI/O,expiresitsquantum,or is preempted by another (higher priority) task. Schedule()usestheexecutioncontextofthetask thatcalledit(referredtoastheprevioustaskin the scheduler). Schedule()ischargedwithfindingthe besttasktotaketheprevioustask’splaceonthe processor. Indoingso,im t akesuseofaheuristi c computedbtyhefunction goodness(). 3.3.1Goodness Calculation Theschedulerusesthe goodness()function to determinetheutilityorfunningagiventask. Ah igh goodnessvaluemeansiwould t bsound ae decisiont o runthegiventasknext. Fortasksthataremarked SCHED_FIFOor SCHED_RR, goodness()returns 1000plusthevaluestoredinthetasks rt_priority field. Forothertasks,however, goodness()returns m a uchlowernumberandshowsmorediscretioninit s evaluation. For SCHED_OTHERtasks,fourfactorsaretakeninto consideration. Thefirstfactorisatask’s counter value. Ifataskhasa countervalueofzero,then goodness()returnsautilityofzero. Thisletsthe schedulerknowraunnabletaskwasfoundbutitsti me sliceiusseduIf tap. ask’s counter valueinsotzero, thenitsgoodnessvalueissettothesumofits counter and priorityvalues. Thethirdandfourthfactorsarebonusesforproces sor affinityandsharinganaddressspacewiththeprev ious task. Asmall,onepointadvantageisgiventotas ks thatsharememorymaps,becauseofthereduced overheadforthecontextswitch. Asomewhatlarger (15point)bonusigs iventotaskswhoselastrunw as onthecurrentprocessor,totrytotakeadvantage of memorylinesthatmaystillresideintheprocessor ’s cache.Thesebonusesareaddedtothepreviously calculatedgoodnessvalueto determinethetask’sfinal goodness value.

3.3.2SchedulingAlgorithm Theschedulerbeginsbyexecutingalloutstanding bottom-halves (delayed functions that were too substantialtorunduringaninterrupt.) Aftersom e additionaladministrativework,the schedulerenters theheartoiftscode: anexaminationofallrunna ble tasks. Theprevioustaskitshefirsttasklooked atby thescheduler. Ifthe SCHED_YIELDbitissetforthe previoustask,thentheschedulerclearsthebitan d useszeroasthetask’sgoodnessvalue. Otherwise, it calls goodness()todeterminethis value. Next,theschedulerwalksthroughtherunqueue, evaluatingthegoodnessofeachtasknotcurrently runningonanotherprocessor. Afterallrunnablet asks havebeenexamined,thetaskwiththegreatest goodnessvalueics hosentorunontheprocessor. If 1 notaskhasagoodnessgreaterthanzero t,henthe schedulerjumpstoapieceofcoderesponsiblefor recalculatingthe countervaluesofalltasksinthe system(runnableorotherwise)andreturnstosearc h therun queueagain. Whilethe goodness()functionbyitselfisvery simple,executesquicklyandconsidersthemost appropriatefactorsinmakingintelligentschedulin g decisions,iitsexpensivetorecalculate goodness() for every task oenvery invocation othe f scheduler .

Problem 4. Efficienthandlingofmultiplethreadsiscrucialf enterpriseserverstomakebestuseofsystem resources,communicatewithmanypartiesathe t sam time,andreducetheaveragetimethatservicerequ spendwaitingforanavailableserver. Multiplexin I/Osystemcalls(suchas selectc) anhelpinsome situations,buttheyarenotalwaysavailable.The popularJavaprogramminglanguageisaprime example. ThreadsareanessentialelementintheJavalangua becausetheJavalanguagelacksaninterfaceforno blockingandmultiplexingI/O,threadsareespecial importantinconstructingcommunicationsintensive applications. Typically,oneom r oreJavathreads constructedforeachcommunicationsstreamusedby Javaprogram.Therefore,anativelythreadedJava VirtualMachine(suchasIBM’sJVM[7])canputa strainontheLinuxscheduler,which,aswhe avese examinesthegoodnessfunctionforevery threadin run queue. This can baeenxhausting process.

1 The runqueue mustcontainaleast t one taskfor t count. Anempty runqueue will schedule the idle t trigger the recalculation.

hisconditionto askrather than

or e ests g

ge: nly are a

en, the

list head

22

40

33

AprofileotfhekerneltakenduringtheVolanoMark runsshowedthatbetween37(5-room)and55(25room)percentoftotaltimespentinthekerneldur thetestis spentin thescheduler.

23

(a)Run Queuefor CurrentLinux Scheduler list head

ELSC 5. Scheduler

40

Toreducetheamountotfimespentinthescheduler wedevelopedanewschedulingsolution,calledthe ELSCscheduler.Ourgoalsinimplementingthis scheduler areafollows: s

list head list head

ing

33

1) Keepchangeslocaltothescheduler. Donot changecurrentinterfaces tothescheduler.

list head

22

2) Keeptheconceptandimplementationsimple. 3) Behavelikethecurrentschedulerasmuchas

23

(b)Run Queuefor ELSC Scheduler

4)

Figure1:

Illustrationofrunqueuestructuresforboth schedulers.Thesquaresrepresentlistheadsandt hecircles representtasks. Thelabelsonthetasksindicate thestaticgoodness of thatparticular task.

ExperimentsatIBMshowtheimpactoftheLinux schedulerontheperformanceofamultithreaded networkapplicationwritteninJava[2]. VolanoMar isabenchmarkwrittentomeasuretheperformanceo VolanoChat,aJavaimplementationofachatroom server. Because its results have beenwidely publishedinmagazinessuchasJavaWorld[3], VolanoMark is an important benchmark for comparing the performance of different implementations of theJavaVirtualMachine. TheVolanoMarkbenchmarkestablishesasocket connectiontoachatserverforeachsimulatedchat roomuser.BecauseJavadoesnotprovidenonblockingreadandwrite,VolanoMarkusesapair threadsoneachendofeachsocketconnection(4 threadsperconnection)tosimulatenon-blockingI/ Fora5to25-roomsimulation,thekernelmust potentiallydealwith400to2,000threadsinther queue. Thekeymeasureopf erformancereported VolanoMarkismessagethroughput,i.e.,thenumber ofmessagespersecond(overallconnections)the serveriasbletohandleduringabenchmarkrun. T measurementsforIBM’sreportweretakenwhile runningVolanoMarkoveraloopbackinterface, eliminatinganynetworkoverheadinvolved;theheap sizeforthetestwaslargeenoughfortheoverhead Javagarbagecollectiontobleessthan5%ofthet elapsedtimethroughouttheexperiments. TheresultsotfheVolanoMarkexperimentsshowthat 25-roomthroughputdecreasedby24%from5-room throughputduetotheadditionalthreadsinthesys

possible.2 Maintain existing performance for light loads. Scalegracefully under heavy loads.

TheELSCschedulerisatable-basedschedulerthat keepstherunqueueinasortedorder,making schedulingdecisionseasierandfaster. Wechosea tablebaseddesignbecauseitrelievesusofthe overheadofsortinglists. Italsoavoidscomplexi when inserting oremoving r tasks,unlike,say, hae k f

of O. un by

he

of otal

tem.

ty ap.

ThefoundationotfheELSCscheduleriistsability to keeptasksinanorderthatmakeschoosingonefast . Thekeytothissortedorderisinhowatask’s goodness() valueiscalculatedinthecurrent scheduler. The goodness()calculation consistsoaf staticanddynamic a part. Thestaticpartconsist osaf task’s priorityand countervalues. Whileatask isontherunqueuebutnotrunningonprocessor, a its countervaluedoesnotchange.Likewise,its priorityalmostneverchanges,thoughwhenit does,theELSCscheduleradaptsaccordingly.We refertothecombinationothese f twovaluesatas ask’s staticgoodness Atask’s . dynamicgoodness consists ofmemorymapandprocessoraffinity. Despitethe factthattheydon’tchangewhileataskisonthe run queue,theydependonwhichtaskandprocessorare calling schedule() The . ELSCschedulerusesstatic goodness tosorttasks on therun queue. 5.1 Implementation TheELSCschedulerusesanewstructurefortherun queue. Previously,therunqueuewas simple a doub linkedlistofnodesthateachpointtoataskas 2 By behave,we meanthatif the currentscheduler a real-time taskover S a CHED_OTHERtask,eveniitf counter,thenthe ELSCscheduler shoulddtohe same few optimizations,the ELSCscheduler doesadhere t quirky rulesasthe currentLinux scheduler.

ly hown

lwaysselectsa has zaero Aside . from a tohe same

inFigure1To a.makeschedulingdecisionsfast,w e needtokeeptherunqueuesorted,whileathe t sam e timekeepinginsertionand deletiontimessmall. The ELSCschedulerdoesthiswithanarrayo3f 0doubly linkedlists. Eachlistinthearrayiussedtoho ldtasks inacertainstaticgoodnessrange,asdemonstrated in Figure1b. Listsaot neendofthetableholdtask s withthe higheststaticgoodnessvalueswhiletheother endholdtaskswiththelowest. A toppointer isused toindicatethehighestprioritylistthatcontains a runnabletask. Tochangethestructureothe f runqueuefrom sain gle listtoatableoflists,weneedtochangefourru n queue manipulation functions as well: add_to_runqueue(), del_from_runqueue(), move_first_runqueue() and move_last_runqueue().Thefirstofthetwo functionsputstasksonandremovesthemfromthe runqueuewhenappropriate. Thenexttwotasksgiv e ataskanadvantage/disadvantageintheselection processwhenanothertaskhasthesame goodness() value. Only schedule()manipulatestherunqueue directly. The function add_to_runqueue() is modified slightlytodealwiththenewtablestructure. Lik ethe currentscheduler,iat ddstaskstothefrontoaf list. Theparticularlistdependsonthetask. Iftheta skis real-time,itusesoneofthetenhighestlists, determinedbydividingthert_priorityfieldby10. If thetaskisa SCHED_OTHERtask,thenthelistis determinedbyadding counterto priorityand dividingbyfour. Oncethelistischosen,thetas kis addedtothefrontotfhatlistandthe toppointeris updatedinecessary. f Whenalltasksintherunqueueexhausttheirtime quantum,their countersareallzero. Atthistime, thecurrentschedulerresetsall countersinthe system. TheELSCschedulerdoesthesame. However,toavoidre-indexingeverytaskintherun queuewhentheir counterisreset,wemodified add_to_runqueue()asfollows. Ifthetaskbeing insertedhasanon-zero countervalue,thetaskis inserted as described above. Otherwise, add_to_runqueue() usesapredicted counter valueforthetask,basedonitsknowledgeohf owt he schedulerresetsthem. Usingthepredicted counter valueanditscurrent priorityt,hetaskisindexed intotherunqueueandaddedtotheendofitslist . Thisway,allzero countertasksresideathe t endof thelist,behindalltaskswithanon-zero counter value. Thezero countertasksareoutoftheway of thescheduler,butareinpositiononceallothert asks intherunqueueexhausttheirquanta. A next_top

pointerisusedtokeeptrackothe f highestpriori tylist containingarunnabletaskafter countersarereset andiset s atthis time. Inthecurrentscheduler,the del_from_runqueue() functionremoves task a fromthelistitisonbys imply pointingthetwonodesoneithersideoitin fthe listat eachother. Thenitsetsitsownrunqueuenode’s nextpointertoNULL,indicatingthatitisnolonger ontherunqueue.TheELSCschedulerfollows exactlythesameprocess. Afterwards,itupdatesb oth thetopand next_toppointeritfheremovalotfhe taskcausedeitheroneofthemtochange.Inthe ELSCscheduler,itispossibleforatasktobe consideredontherunqueuebutnotactuallybien one 3 ofthelistsinthetable. Because anode’s next pointerindicatespresenceontherunqueuebythe currentscheduler,theELSCscheduleralsosetsthe prev pointertoNULLtoindicatethatthetask ins ot actuallyonanylist,thusleavingthe nextpointer aloneifthetaskisconsidered“ontherunqueue” withoutbeing otnherun queue. The functions move_first_runqueue()and move_last_runqueue() were meant to bias decisions in the case of a goodness() tie.

Consequently,weneedonlytomovetaskswithin theircurrentlistsinthetable. Ataskim s oved within itscurrentlisttothebeginningoend r oits f sec tionof thelist. Recallthatlistscancontaintaskswith both zeroandnon-zero countervalues. Thesefunctions behaveappropriatelywhenfacedwithmixed-counter lists. Inadditiontothemodificationothese f fourfunct codewasaddedtoinitializetherunqueuetable structurewhenbooting.Wealsowrotetwotest routinesthatdeterminewhetheralistcontainstas with zeroonon-zero r counter values.

ions,

ks

5.2 ELSCSchedulingAlgorithm Like the current scheduler, the actual ELSC implementationof schedule() beginsbyexecuting alloutstandingbottom-halvesandthenperforming someadditionaladministrativework. Itthendevia tes from thecurrentscheduler as follows. Iftheprevioustaskwasstillrunningwhenitcall schedule(),i.e.,itexhausteditsquantum,was preempted,oryieldedtheprocessor,thentheELSC schedulerinsertsthetaskintotherunqueue. Thi isimportantbecausetasksareremovedfromtheirr

3

The reasonfor thisisbecause waectually remove t runqueue while they are running,butthe restof t wouldlike ttohinkthatthey are still onthe run us w a ay ttoell precisely itafaskion slist. a

ed

step un

asksfromthe he Linux system queue. Thisgives

put d,it

RecalculateFrequency 1000000

ofit. not the uler,

100000

Entries

queuelistswhentheyareexecutingandneedtobe backontherunqueue. Evenithe f taskhasyielde willbetreatedproperlyinthesearchloop. Sowe insertthetaskinthetablenowlestwelosetrack Also,byre-insertingtheprevioustaskhere,wedo needtotreatiat saspecialcasewhenevaluating goodnessotasks. f Next,justasthecurrentsched ELSCmovesexhaustedSCHED_RRtaskstotheends of their lists.

10000

elsc

1000

reg

100 10

Thenextstepdetermineswhetherweneedto recalculate counterIfs. thetoppointerizsero,then therearenorunnabletasksinthetablewith non a -zero countervalue;eithertheyallhavezero counter valuesotrherearenotasksintherunqueue. If the next_top pointerisnon-zero,thentherearerunnable tasksinthetablewithzero countervalues,sothe schedulerrecalculatesthe countervaluesforevery taskinthesystem. If,however,the next_top pointer iszero,thenthetableiscompletelyemptyandthe re arenotaskstorun,sowescheduletheidletaska nd skiptherestof thedecision process. Ifthetop_pointerins on-zero,thelistpointedto byit isguaranteedtohavealteastonenon-zero counter taskinit,sowestartoursearchatthetoplist. The ELSC search loop attempts to emulate the goodness() calculation used by the current scheduler. Startingwiththefirsttaskinthelis t,ELSC checkstoseeif thetaskisstillrunningonanother CPU. Ifso,weshouldn’tscheduleiIft.alltask isn thelistareeliminatedbythischeck,thenwecons ider 4 thenextpopulatedlistandtryagain. Next, wecheck toseeithe f taskhas zero a counter value. Ifwefind suchatask,thentherestofthelistiseitherem ptyor unusable,sowebreakoutofthesearchloop.If, however,thetaskweareconsideringhasanon-zero countervalue,thenweevaluateitsgoodness. The processofselectingataskfromthehighestlisti s describedbelow. Ifthetaskhasjustyieldeditsprocessor,wewill runit onlyifwecannotfindanothertaskonthelist. T his policyisslightlydifferentthanthecurrentsched uler, whichconsidersayieldedtasktohaveagoodness valueofzero. Fromthispoint,thetask’sutility is evaluatedjustlike goodness() Bonuses . aregiven forhavingthesamememorymaporrunningonthe same processor. uni-processor the In case aif , searchloopandrunthetaskrightawaybecausewe won’tfindanother task with greater a bonus. Whenwefinishexaminingatask,wemarkittobe schedulednextifihas t thehighestutilityseens Then weselectthe nexttask the inlistandrep 4

Thiscanonly happenonSMPsystems.

ofar. eatthe

UP

1P

2P

4P

Figure2:

Thenumberoftimes (onalogscale)thateach schedulerenterstherecalculateloopduringatypi calrunofthe VolanoMarkbenchmark.

process. Intheworstcase,everytaskintherun isplacedinthesameprioritylist(andELSC performancecanbenobetterthanthecurrent scheduler). Sowleimitthenumberoftasksexamin ineachlisttoanumber,currentlysettobehalf numberofprocessorsinthesystemplusfive,which intendedtobelargeenoughtofindtaskswith adequatebonusesonSMPsystems,yetstilllimitth searchtoareasonablenumberoftasks.Not consideringtherestofthelistshouldn’tbepro a asalltasksinthelisthaveaboutthesamestatic goodness.

queue

ed the is e blem,

Forreal-timetasks,thesearchisactuallymuch simpler. Again,we xamineonlythefirstfewtask anddon’tlookatthosecurrentlyrunningonother processors.Butinsteadofworryingaboutyielded processesandbonuses,wesimplyrunthetaskwith thehighestrt_priority value.

s

Afterdecidingwhichtasktorunnext,theELSC schedulermanuallyremovesthetaskfromitslist( i.e., doesn’tuse del_from_runqueue())andsetsrun queuenode’s prev pointertoNULL.This indicatesthatthetaskis“ontherunqueue”,even thoughitisnotcurrentlyinalist.Finally,if the previoustaskhadyieldedtheprocessor,thenthe ELSCschedulerclearsthe SCHED_YIELDbittogive the task a better chance in future calls to schedule(). Wementionedbeforethatoneoour f designgoalswa tomaketheELSCschedulerbehaveam s uchlikethe currentscheduleraspossible.Atthispoint,we describehowtheELSCschedulerbehavesdifferently First,theELSCschedulertriestolimititssearch onelistinitstable. Therefore,itmaychoosea itshighestprioritylistthatdoesn’treceiveany bonusesforprocessoraffinityomemory r map. Int case,itis possiblethat taask residing itnhese

s

. to taskin his cond

Scheduler

TimetoCompleteCompilation

Current UP -

6:41.41

ELSC UP -

6:38.68

Current 2P -

3:40.38

ELSC 2P -

3:40.36

Table 2: Averagetimetakentocompleteafullcompileotfh

e

Linux kernel.

highestprioritylist,whichwouldreceivethese bonusesandhavehadahigher goodness()value thanthechosentask,isnotrun. Wedecidedthis behavioral difference is acceptable because the differencebetweenthe goodness() valuesothe f two tasks is smallenough toignore. Theotherdifferenceinbehaviorios nethatavoids an undesirablecharacteristicofthecurrentscheduler . Currently,iaftaskenterstheschedulerbecausei its yieldingtheprocessorandnoothertaskscanbe scheduled,thentheschedulerentersaloopto recalculatethe countervalueforalltasksinthe system. Inthissituation,theELSCschedulerruns the previoustaskagainiitdoes f nothave zero a counter value.Figure2illustrateshowmanytimeseach schedulerrecalculatesduringatypicalVolanoMark runonuni-processorandone,twoandfourprocesso r SMPmachines.

Experiments 6. TheELSCschedulermeetsthefirstthreeoof urfou r designgoals. Thedesignchangesarekeptlocal,t he solutionissimple,anditbehavesverymuchliket he currentscheduler. Thefinalgoalofthisproject isto maketheELSCschedulerperformaswellasthe currentschedulerinlighter desktopsituationswhile scalinggracefullyunderheavyloads. Weusedtwo teststodeterminewhetherwereachedthisgoal. T he firstisasimpletestthatmeasuresthetimeita t kesto compiletheLinuxkernel.Thistestismeantto compareschedulerperformanceforlightloads. The secondtestistheVolanoMarkbenchmark,described earlier. WhileVolanoMarkmaynotberepresentativ e oftaypicalworkload,itdoessimulatethebehavio of r acommerciallyavailableapplication. Weuseitin this analysis as satress testfor thetwoschedule rs.

WecompiledtheLinuxkernelthreetimesoneachof 5 theschedulers,configuredtorunausni-processor and two-processorkernels. WeranthetestonanIBM Netfinity5500withdualPentiumIIprocessors. Th e kernelversionwas2.3.99-pre4withourELSC modifications. Torunthetest,wesetupshell a script thatwouldfirstbuildakernelandthenrun“make clean”. Thisstepwasintendedtoreducethevaria nce inmeasurementduetofilesystemperformanceby pullingasmuchinformationaps ossible intothe L1 andL2caches. Thenweusethebash“time” commandtorunthe“make -j4bzImage”command. Table2showstheaverageresultsgivenbythetime command. Our confidenceitnhesemeasurements is very high a thetestwas run multipletimes andresults never deviatedfrom themean bm y orethan 4hundredths of saecond. For allpracticalpurposes,thehundredt saecondreportediT n able2areinsignificant. In two-processor casetheELSC scheduler barely edges thecurrentscheduler by ainnsignificantcouple hundredths of saecond. In theuni-processor case ELSC scheduler has daistinctadvantage. Webeliev this is duetotheshortcutin theELSC search loop theuni-processor scheduler,which ends thesearch soon am as emory mapmatch ifound. s TheVolanoMarkbenchmarktestismorecomplicated. WeranVolanoMarkinloopbackmode,which simulatesboththeclientsandserversfortheJava roomsonthesamemachine.Inloopback communicationbetweenclientsandserversdoesnot travelacrossanetwork. Intheexchangeom f essag betweenclientsandservers,eachmusthavetimeon theCPUtosendandreceiveit’smessagesinorder lettheotherdothesame.Thistypeofmessage exchangingapplicationforcesmanyentriesintothe scheduler.AssuggestedbytheVolanoMarkrun rules,weranthebenchmark11timesforeachsyste configurationanddiscardedthefirstrunduetoit variantstartupcosts.

s

hs of the

the e for as

chat mode,

WeranVolanoMark withbothschedulersconfigured asuni-processor,one,twoandfourprocessorSMP kernels. For each of these configurations, VolanoMarkwasconfiguredtosimulate510, , 15and 20rooms,eachwith20simulatedusersexchanging 100messages.Eachsimulatedusercreatestwo threads,soeachroomcreatesatotalof80threads It. iseasytoseethatevenat5roomstheVolanoMark benchmarkputsconsiderablestressonasystem. While running VolanoMark, we also collected 5 In these experiments,uni-processor kernelsare co mpiledwithout SMPenabled,eliminating itsoverhead. One-process or kernelsare compiledwithSMPenabledbut use only one processor.

es to

m s

UPand1PMessageThroughput

Scalingw ithRoom s

1.2

4600

1 4200 3800

elsc-up

0.8

reg-up

0.6

elsc

0.4

reg

elsc-1p reg-1p

3400

0.2 3000

0 5

6200 5700 5200 4700 4200 3700 3200 2700 2200 1700

10 15 NumberoR f ooms

20

UP

elsc reg

Figure3:

10 15 NumberoR f ooms

Throughput in messages VolanoMarkrunson6differentschedulerconfigurat scale iadjusted s onthe secondgraphtfoitall da

20

per second for ions. TheYtapoints.

statisticsaboutwhattheschedulerwasdoingand exposedthemthroughtheprocfilesystem.The overheadofcollectingthesestatisticsexistsinb schedulersandinbothcasesisnegligible.The machineusedfortheVolanoMarkrunswasanIBM Netfinity 7000with 4Pentium IIxeon processors. ThemetricreportedbyVolanoMarkismessage throughput,whichwecanuseasameasureofboth performanceandscalability. Usingthebareresult fromtheVolanoMarkruns,wecancomparehoweach of the two schedulers behaves in different configurations. Figure3illustratestheperforman gains given btyheELSC scheduler. Figure4givesadifferentinterpretationofthesa data.Toobtainsomemeasureofhowwelleach schedulerscaleswhenfacedwithalargenumberof tasks,wecanusethe5-roomtrialsasabase measurementandseehowperformanceisaltered whenthenumberofthreadsisincreasedinthe20roomtrials. ThenumberchartedinFigure4issim themessagethroughputachievedinthe20-roomtria dividedbythethroughputachievedinthe5-room

2P

4P

Figure4: Showshoweachschedulerscalesfrom5roomsto20 roomsonvariousprocessorconfigurations. Thehei representsthescalingfactor(20-room-throughput/ throughput).

4ProcessorMessageThroughput

5

1P

oth

s

ce me

ply ls

ghtofthebar 5-room-

trials. Asthefigureindicates,theELSCschedule clearlyscalestomorethreadsbetterthanthecurr scheduler.

r ent

Butthesenumbersdonotpaintthewholepicture. wanttounderstandwhytheELSCschedulerscalesso muchbetterthanthecurrentschedulerandverifyt theseresultsarenotafluke.Sowecollected additionalstatisticsontheschedulerswhilewrea VolanoMark tests.

We

Thefirststatisticthatjumpsoutisthenumberof cyclesspentperentryintothescheduler.Forthe ELSCscheduler,thisnumberissignificantlylower than the current scheduler, proving the ELSC schedulerreally doesspendlesstimeinthescheduler. TheexplanationisthatbecauseELSCwithitstable basedapproachtoschedulingexaminesfarfewer tasks on each entry into the scheduler, as demonstratedbF y igure5.

hat nthe

-

TheELSCschedulerisnotwithoutfault. Although mostofthestatisticswecollectedindicatethatt he ELSCschedulerisfasterandbetter,twoothem f sh ow theopposite. Oneoftheadverseaffectsoaftabl ebasedschemeias nincreaseinthenumberocfalls to schedule()whenrunningonamachine withmore thanoneprocessor. AsdemonstratedbyFigure6, thereisastrongcorrelationwithhowmanytimesa taskisselectedwithouthavingtheprocessoraffin ity bonus.ThesemeasurementssuggesttheELSC schedulerisnotchoosingtheabsolutebesttaskin multiprocessormachines.Wesuspectthatthisis relatedtothefactthattheELSCschedulerfindst he mostsuitabletaskinthehighestpopulatedclasso f staticpriorities.Thus,sometasksthatmighthav e higher goodness() values when the processor affinitybonusisadded,butresideinlowerstatic classes,may notbeconsidered.

CyclesperSchedule()

25000

CallstoSchedule()

4200

20000

3700 15000 10000

elsc

3200

reg

2700

5000

elscsched regsched

2200

0

1700 UP

1P

2P

4P

UP

TasksExam ined

40 35 30 25 20 15 10 5 0

1P

2P

4P

TasksScheduledonNewProcessor

1000 800 elsc

600

reg

400

elsccpu regcpu

200 UP

1P

2P

0

4P

UP

Figure5:

Thefirstchartshowsthenumberocf yclesthatare spenteachtimethesystementersthescheduler. T hesecondchart showshow many tasksare examined btyhe scheduler each time iist called.

AlthoughtheVolanoMarkbenchmarkcreatesmany threadswiththesamememorymap,wedonotbelieve thisfactsignificantlyinfluencesthebehaviorof scheduler. Theonlypossibledifferencebetweenth twoschedulers wouldbseimilar tothepassing over lowerclassesofstaticprioritiesthathappenswhe runningonmultipleCPU’s. Ofcourse,ifataskis insertedintoalowerprioritylist,thenaddingth pointbonusforsharingamemorymapwiththe previoustaskcannotraiseit’sgoodnessvalueenou tobgereater than any othe f tasks in thehighest

either e of n eone gh class.

Evaluation 7.

1P

WesetouttoimprovetheLinux scheduler’s scalability,preferring modifications thatdonot changedesktopperformanceandmaintain existing scheduler abstractions,yetscalewellwhen present with large a number of tasks. Wehaveshown thati

r

ed t’s

4P

Figure6:

The firstchartshowshow many times(inthousands) the systementersthe schedule() functioncall ina naverage 10-room VolanoMarksimulation. The secondchartshowshow many times the scheduler chooses taasktrounodifferent na processor thanit ranbefore.

possibletoimprovetheLinuxschedulerwithout introducingalotofoverhead.ThoughtheELSC schedulerdoesnotalwaysselectthebesttask availableonmachineswithmorethanoneprocessor, wehavedemonstratedthattheELSCscheduler satisfiesourgoalsforbothasmallandlargenumb ofreadytasksandoffersaviablealternativetot currentLinux scheduler.

er he

TheELSCschedulerisanopensourcecontribution andisfreelyavailableforuseandmodification. The currentversionothe f ELSCpatchcanbde ownloaded from www.citi.umich.edu/projects/linuxscalability/patches/

An increasing number of organizations continueto evaluate,test,andusetheLinux operating system. Although Linux does many things well,wehave shown thecurrentscheduler has shortcomings in its design andimplementation. When confrontedwith a largenumber of tasks,overallsystem performance declines rapidly. This behavior is unacceptablefo large-scaleenterpriseenvironments.

2P

Future 8. Work Inthefuture,wewouldliketoseehowtheELSC scheduler performs in other multithreaded environments.Onesuchexampleisawebserver runningApache. Wouldwseeethesameperformance gainswesawwhilerunningVolanoMark,ordoes somethingotherthantheschedulercauseprimary bottlenecksinthesesystems?WouldtheELSC schedulerbemoreeffectiveinincreasingthroughpu or decreasing thelatency oan fApachewebserver? ThefocusotfheELSCdesignistoreducethetime spentlookingforatasktoschedule. Wewouldals liketofindwaystoallowtheschedulertomake greateruseom f ultipleCPUsandexaminetheeffect of modifying the goodness metric. Is Linux consideringeverythingitoughtinitsscheduling decisions? Dowecareaboutprocessoraffinityaft manyothertaskshaverunonthegivenprocessor? Canweconstructaschedulerthatspendslesstime waitingforspinlocksandmoretimescheduling tasks? We are also interestedinexploringalternative schedulerdesigns.Thetable-baseddesignofthe ELSC scheduler is one approach; many other possibilitiesexist,suchassortingtasksbystati goodnesswithinheapsforeachprocessorandaddres space.Onecouldchoosetheabsolutebesttask availablesimplybyexaminingthetopofeachheap. Orperhapsamulti-priority-queuesolutionwouldbe morebeneficialtohelptheschedulerscaletomult processors well.

Acknowledgments 9.

t

o s

er

WethankRayBryantandBillHartnerofIBM,Chuck LeveroN f etworkApplianceandBrianNobleotfhe Universityof Michiganfortheirguidance and assistance;Dr.CharlesAntonelliandProfessorGar Tysonforprovidinghardware usedindevelopmentat theUniversityofMichigan;theLinuxTechnology CenteraItBMforallowingtheuseofequipmentat IBMfordevelopmentandtesting;andChrisKing, ScottLathropandPaulMooreoftheUniversityof Michigan for their contributions to the initial developmentof theELSC scheduler. Wethanktheanonymousreviewersfortheirhelpful comments,andourshepherd,TedFaber,whose insightandsuggestions wereparticularly valuable. ThisworkwaspartiallysupportedbyDell,Intel,a theSun-NetscapeAlliance. Wealsothankallthepast,present,andfuture developersofLinuxfortheirskilledandselfless contributions.

c s

iple

y

nd

10. References [1] Atlas,A.“Designandimplementationof statisticalratemonotonicschedulinginKURT Linux.” Proceedings20 thIEEEReal-Time SystemsSymposium Phoenix, . AZ,December 1999. [2] Bryant,RayandHartner,Bill. “Java,Threads, and Scheduling in Linux.” IBM Linux TechnologyCenter,IBMSoftwareGroup. http://www-4.ibm.com/software/developer/ library/java2

[3] Carr,John. “AS/400LeadstheleagueinJava performance.” JavaWorld,11August2000. [4] Daggett,DawnandGillen,Aal ndKusnetzky, Dan.“LinuxOvertakesNetWareforthe Market’sNumber2Position.” International Datacorporation –PressRelease ,24July 2000. http://www.idc.com/software/press/PR/ SW072400PR.stm

[5] Gooch,Richard. “LinuxSchedulerBenchmark Results.”30September1998. http://www. atnf.csiro.au/~rgooch/benchmarks/linuxscheduler.html ftp://ftp.atnf.csiro.au/pub/people/rgooc h /linux/kernel-patches/v2.2/rtqueuepatch- current.gz

[6] Lohr,Steve. “IBMgoescounterculteralwith Linux.” TheNewYorkTimesOnTheWeb 20 , March 2000. http://ww10.nytimes.com/ library/tech/00/03/biztech/articles/ 20soft.html

[7] Neffenger, John. “The Volano Report.” Volano LLC , 24 March 2000. http://www.volano.com/report000324.html

[8] Shankland,Stephen.“AOLreleasesopensourcesoftware.” CnetNews ,9July1999. http://www.canada.cnet.com/news/0-1005200-344644.html

[9] Wang,Y.C. “Implementinggeneral a real-time schedulingframeworkintheRED-Linuxrealtimekernel.” Proceedings20 thIEEERealTimeSystemsSymposium. LosAlamitos,CA, 1999. 246-55 [10] Woodard,B“Building . anenterpriseprinting system.” ProceedingsotfheTwelfthSystems Administration Conference (LISA XII) USENIXAssociation,Berkeley,CA.1998, 219-28.

.