May 1, 2001 - Center for Information Technology Integration. University of ... immensely. Java technology is ...... library/tech/00/03/biztech/articles/. 20soft.html.
CITI TechnicalReport01-7
Scalable LinuxScheduling StephenPMolloy, . Peter Honeyman {smolloy,honey}@citi.umich.edu
ABSTRACT Formostofitsexistence,Linuxhasbeenusedprim operatingsystem. Yet,inrecenttimes,itsuseas commercialoperatingsystemsfornetworkservers,d otherlarge-scalesystemshasbeenincreasing. Des popularity,Linux exhibits many undesirableperform
arilyaaspersonaldesktop acost-efficientalternativeto istributedworkstationsand piteitsremarkablerisein ancetraits.
Concernedaboutthescalabilityofmultithreadedne Linux,weinvestigateimprovementstotheLinuxsch calculatingbaseprioritiesandsortingtherunque Weproposeanimprovedschedulerdesignandcompare termsofscalabilityandperformancetotheexistin analysisshowsthatimprovementscanbm e adetothe introducingoverhead,thusimprovingthescalabilit operating system.
tworkserverspoweredby eduler. Wefocusonpreueforefficienttaskselection. ourimplementationin gLinuxscheduler.Our existingschedulerwithout yandrobustnessothe f Linux
1May 2001
Center for InformationTechnology Integration University oMichigan f 519WestWilliam Street AnnArbor,MI 48103-4943
Scalable LinuxScheduling Stephen Molloy,Peter Honeyman {smolloy,honey}@citi.umich.edu
Introduction 1.
Background 2.
Linux,astrongandsteadilyincreasingpresenceon theInternettoday[4],commonlyprovidescosteffectiveandload-tolerantsolutionsfornetwork services. ItsfamiliaritytoUNIXusers,sourceco availability,andabilitytorunonmanydifferent architectures explain Linux’s rapid increase in popularity.Manysoftwarecompanies,including AOL,NetscapeandIBM[6,8],offerLinuxproducts. SeveralorganizationsuseLinuxonrouters,printa fileservers,firewallsand,ofcourse,webapplica servers [10].
BecauseLinuxgrewfromda esktopoperatingsystem, manyissuespreventitfrombeingdominant a force theenterpriseservermarket. Thedesignand implementationofLinuxhastraditionallyfocusedo simplicity and versatility rather than small performance gains and scalability. The implementationoftheLinuxthreadmodeland scheduler illustratethis approach.
AtthesametimeaL s inux’spopularityhasincrease theuseofJavaforwebapplicationshasgrown immensely. Javatechnologyis akeycomponentin buildingscalableapplicationservers. However, communications-intensive Java applications often createlargenumbersofthreadsandLinuxdoesnot handlesuch stress gracefully. Concernedaboutthescalabilityofmultithreaded network serverspoweredbyLinux,weinvestigate improvementstotheLinuxscheduler. Experiments byIBMindicatethatasmucha3s0%ofthetotalCP timeinthesystemisspentintheschedulerwhent numberorfunningthreadsishigh[2]. Ouranalysi showsthecurrentschedulerusesanexpensiveand redundantalgorithmfortaskselection. Ourgoali improvethescalabilityothe f Linuxschedulertoa ittoenterprise-scaleserverworkloads. Ouranaly shows that our new scheduler implementation achieves thesegoals. Therestofthispaperisorganizedafsollows. Se 2givessomebackgroundinformationonourproject andprovidessomeinsightastowhywechoseto tackletheschedulerratherthantheLinuxthreadin model. Section3describesLinux’scurrentapproac toscheduling. Section4explainstheproblemswit it.Section5outlinesourapproachtosolvingthe problemandSection6describesitsperformance relativetothecurrentLinuxscheduler. Allkerne modifications, experiments, and descriptions are againsta2.3.99-pre4Linuxkerneland,whenused, Javaversion 1.1.7oIBM’s f JDK.
de
nd tion d,
U he s tso dapt sis
ction
g h h
l
TheLinuxthreadmodelisaone-to-onemodel, meaningthateveryuser-levelthreadismappedonto itsownkernelthread.Whilethismodelmakes programminginthekernellesscomplicated,it sometimesforcesprogramstogeneratemorekernel threadsthanins ecessary. Forcingthekernel’sde schedulertoaccommodatetoomanythreadscan adversely affect saerver’s performance.
in n
fault
Manyestablishedoperatingsystemssupportmany-tooneormany-to-manythreadmodelsinwhicheach kernelthreadhasmanyuserlevelthreads mappedto it. Inthesemodels,asecondaryschedulerchooses whichofthemappeduserlevelthreadstorun. The multi-tierschedulingapproachothese f modelsassu res thatthescheduleraet achlevelwillbefacedwith a moremanageablenumber of threads. TheLinuxscheduler,likeitsthreadmodel,isalso exerciseinsimplicity. Theheartotfhescheduler conciselycodedinjustafewlinesthatevaluate runnablethreadsandthenpicksthebest. Thepric forthissimplicitythough,isalineartimealgori thatrepeatsmanyofthesamecalculationsthatwer performedoints lastinvocation. Inthisproject,weaddresstheshortcomingsofthe schedulerratherthanthoseofthethreadingmodel. Thereasonsforthisdecisionaresimple. Wesawt redundantcalculationandO(n)loopinthecurrent schedulerandknewwecouldimproveit. Wealso knowthattheLinuxkernelcommunityhasbeenvery protectiveofitsthreadingmodelinthepastandw wantedtoavoidupsettinganycontributors. Finall theoriginaltimelinefortheprojectcalledfora workingdesignandimplementationwithinthetime frameoone f semester. Developing new a threading
an is e thm e
he
e y,
volatile long
state
unsigned long
policy
long
counter
long
priority
struct mm_struct
*mm
struct list_head
run_list
int
has_cpu
int
processor
Table 1: Thistableshowsthefieldsofthetask structure thatare mostrelevantto Linux schedulin
andotherstateinformationaboutthetaskandits registers.Italsotrackstaskstatisticsformemo managementandresourcecontrol,privileges,file descriptors,signalhandlersandothertaskspecifi information. Thevariousfieldsotfhetaskstruct useditnhescheduler areillustratediT n able1.
ry c ure
Thetask’s statefieldcanbseettooneosix f values, eachrepresentingadifferentstatethatinwhicha task mightfinditself(suchasblockingorsleeping.) TASK_RUNNINGisthevalueof statewhenataskis runnable. g.
modelforLinuxwouldalmostcertainlyrequiremore timethan wheadavailable. Othergroupshavespentconsiderabletimedesigning alternativeschedulersforLinux[1,5,9].Linux discussiongroupsprovideevidencethattheschedul hasbeenandcontinues tobeaninterestingtopicfor thedevelopercommunity. However,mostalternative scheduler designs focusonreducing latency for timeprocessesratherthanimprovingtheoverall scalability othe f defaultscheduler. Whileitisourgoaltoimprovescalabilityand performanceotfheschedulerwhenfacedwithalarg numberofrunnablethreads,itisnotourintentto changethecriteriaitusesforthreadselection. feelthatthecurrentcriteriaarecarefullychosen sufficienttomakegooddecisionswithaminimum amountofcalculation. Ourprimaryconcernisimp thatthesecriteriaarenotbeingusedinanoptima algorithm by thescheduler.
er
real-
e We and ly l
Intheremainderotfhispaper,becauseLinuxuses one-to-onethreadingmodel,wedonotdistinguish betweenauserthreadandakernelthread. Also,t matchtheterminologyusedinthekernelsourcecod werefer toany threaditnhesystem as taask.
a o e,
Current 3. Scheduler TounderstandwhythecurrentLinuxschedulerscale poorlywiththenumberofrunnablethreadsinthe system,itisnecessarytobefamiliarwithitsdat structures,algorithms,andconventions. Thissect outlines the existing scheduler to clarify our observations anddesign decisions.
s a ion
3.1 Task Structure ThebasicexecutioncontextinLinuxirseferredto atask. Thetaskstructureisresponsiblefor maintaining a task’s address space information, whetherthataddressspaceisharedwithothertas
as
ks,
The policyfieldisseteitherto SCHED_FIFO, SCHED_RR (round robin) or SCHED_OTHER to determinetheschedulingpolicyforthetask. The first twooptionsareforreal-timetasks,whilethethir dis forallothertasks. Realtimetasksarealwaysru n beforeregulartasksithey f arerunnable. The policy fieldiaslsousedtotrackyieldedtasks. Whena nonreal-time task gives up its processor via the sys_sched_yield() system call, a bit (SCHED_YIELD)inthetask’s policyfieldissetso this information can bpeassedotnothescheduler. Thefield has_cpuissetto1whiletask a iesxecuting onaprocessorand0otherwise.Uponsetting has_cputhe , field processoris settotheprocessor IDonwhichthetaskwillexecute. Thetaskstruct alsocontainspointersthatidentifythememorymap which iruns t andits placeotnherun queue.
ure in
Thetwomostimportantfactorsindeterminingwhich taskexecutesnextarerepresentedbythe priority and counterfields. Priorityisanintegerbetween 1and40. Highernumbersrepresenthigherpriority . Twentyisthedefaultvalueforalltasks. (Real-t ime tasksalsouseapriorityvalue,butitrangesfrom 0to 99 and is stored in a separate field called rt_priority.) Counter is value a thatindicatesthe time remaining in the task’s current quantum. Counter,measuredin10msticks,canrangefrom zerototwicethetask’s priorityLinux . usesthis fieldtoenforce fairness a policy. Itisworthnotingthatalltasks,whethertheyare lightweightthreadsorfull-fledgedprocesses,are treatedthesamebLinux ya system. Allprocesses threadsarevisibleinvarioussystemstatuscomman suchaspsandtop.Consequently,thedefault scheduler,whichirsesponsibleforaccommodatinga tasksinthesystem,canbeplacedunderconsiderab stress when running multithreadedapplications.
and ds
3.2 RunQueue TherunqueueinLinuxicircular, as doubly linked list containingalltasksinthe TASK_RUNNINGstate. The
ll le
schedulertraversesthislistwhenilooks t forat run. Thelistisnotmaintainedinsortedorder. theschedulerfindstwoequivalenttasks,theone closertothefrontofthelistischosen. Newlyc orawakenedtasksareplacedatthebeginningofth runqueue. Thelistisdoublylinkedandcircular, tasks can alsobaeddedtotheendothe f run queue
askto When reated e so .
3.3 Schedule() TheLinuxkernelfunction schedule()a, sinother operatingsystems,iscalledfromover500places withinthekernel,underscoringitssignificanceto overall systemperformance. The schedule() functioniscalledbyataskwhenityieldsthe processor,blocksforI/O,expiresitsquantum,or is preempted by another (higher priority) task. Schedule()usestheexecutioncontextofthetask thatcalledit(referredtoastheprevioustaskin the scheduler). Schedule()ischargedwithfindingthe besttasktotaketheprevioustask’splaceonthe processor. Indoingso,im t akesuseofaheuristi c computedbtyhefunction goodness(). 3.3.1Goodness Calculation Theschedulerusesthe goodness()function to determinetheutilityorfunningagiventask. Ah igh goodnessvaluemeansiwould t bsound ae decisiont o runthegiventasknext. Fortasksthataremarked SCHED_FIFOor SCHED_RR, goodness()returns 1000plusthevaluestoredinthetasks rt_priority field. Forothertasks,however, goodness()returns m a uchlowernumberandshowsmorediscretioninit s evaluation. For SCHED_OTHERtasks,fourfactorsaretakeninto consideration. Thefirstfactorisatask’s counter value. Ifataskhasa countervalueofzero,then goodness()returnsautilityofzero. Thisletsthe schedulerknowraunnabletaskwasfoundbutitsti me sliceiusseduIf tap. ask’s counter valueinsotzero, thenitsgoodnessvalueissettothesumofits counter and priorityvalues. Thethirdandfourthfactorsarebonusesforproces sor affinityandsharinganaddressspacewiththeprev ious task. Asmall,onepointadvantageisgiventotas ks thatsharememorymaps,becauseofthereduced overheadforthecontextswitch. Asomewhatlarger (15point)bonusigs iventotaskswhoselastrunw as onthecurrentprocessor,totrytotakeadvantage of memorylinesthatmaystillresideintheprocessor ’s cache.Thesebonusesareaddedtothepreviously calculatedgoodnessvalueto determinethetask’sfinal goodness value.
3.3.2SchedulingAlgorithm Theschedulerbeginsbyexecutingalloutstanding bottom-halves (delayed functions that were too substantialtorunduringaninterrupt.) Aftersom e additionaladministrativework,the schedulerenters theheartoiftscode: anexaminationofallrunna ble tasks. Theprevioustaskitshefirsttasklooked atby thescheduler. Ifthe SCHED_YIELDbitissetforthe previoustask,thentheschedulerclearsthebitan d useszeroasthetask’sgoodnessvalue. Otherwise, it calls goodness()todeterminethis value. Next,theschedulerwalksthroughtherunqueue, evaluatingthegoodnessofeachtasknotcurrently runningonanotherprocessor. Afterallrunnablet asks havebeenexamined,thetaskwiththegreatest goodnessvalueics hosentorunontheprocessor. If 1 notaskhasagoodnessgreaterthanzero t,henthe schedulerjumpstoapieceofcoderesponsiblefor recalculatingthe countervaluesofalltasksinthe system(runnableorotherwise)andreturnstosearc h therun queueagain. Whilethe goodness()functionbyitselfisvery simple,executesquicklyandconsidersthemost appropriatefactorsinmakingintelligentschedulin g decisions,iitsexpensivetorecalculate goodness() for every task oenvery invocation othe f scheduler .
Problem 4. Efficienthandlingofmultiplethreadsiscrucialf enterpriseserverstomakebestuseofsystem resources,communicatewithmanypartiesathe t sam time,andreducetheaveragetimethatservicerequ spendwaitingforanavailableserver. Multiplexin I/Osystemcalls(suchas selectc) anhelpinsome situations,buttheyarenotalwaysavailable.The popularJavaprogramminglanguageisaprime example. ThreadsareanessentialelementintheJavalangua becausetheJavalanguagelacksaninterfaceforno blockingandmultiplexingI/O,threadsareespecial importantinconstructingcommunicationsintensive applications. Typically,oneom r oreJavathreads constructedforeachcommunicationsstreamusedby Javaprogram.Therefore,anativelythreadedJava VirtualMachine(suchasIBM’sJVM[7])canputa strainontheLinuxscheduler,which,aswhe avese examinesthegoodnessfunctionforevery threadin run queue. This can baeenxhausting process.
1 The runqueue mustcontainaleast t one taskfor t count. Anempty runqueue will schedule the idle t trigger the recalculation.
hisconditionto askrather than
or e ests g
ge: nly are a
en, the
list head
22
40
33
AprofileotfhekerneltakenduringtheVolanoMark runsshowedthatbetween37(5-room)and55(25room)percentoftotaltimespentinthekerneldur thetestis spentin thescheduler.
23
(a)Run Queuefor CurrentLinux Scheduler list head
ELSC 5. Scheduler
40
Toreducetheamountotfimespentinthescheduler wedevelopedanewschedulingsolution,calledthe ELSCscheduler.Ourgoalsinimplementingthis scheduler areafollows: s
list head list head
ing
33
1) Keepchangeslocaltothescheduler. Donot changecurrentinterfaces tothescheduler.
list head
22
2) Keeptheconceptandimplementationsimple. 3) Behavelikethecurrentschedulerasmuchas
23
(b)Run Queuefor ELSC Scheduler
4)
Figure1:
Illustrationofrunqueuestructuresforboth schedulers.Thesquaresrepresentlistheadsandt hecircles representtasks. Thelabelsonthetasksindicate thestaticgoodness of thatparticular task.
ExperimentsatIBMshowtheimpactoftheLinux schedulerontheperformanceofamultithreaded networkapplicationwritteninJava[2]. VolanoMar isabenchmarkwrittentomeasuretheperformanceo VolanoChat,aJavaimplementationofachatroom server. Because its results have beenwidely publishedinmagazinessuchasJavaWorld[3], VolanoMark is an important benchmark for comparing the performance of different implementations of theJavaVirtualMachine. TheVolanoMarkbenchmarkestablishesasocket connectiontoachatserverforeachsimulatedchat roomuser.BecauseJavadoesnotprovidenonblockingreadandwrite,VolanoMarkusesapair threadsoneachendofeachsocketconnection(4 threadsperconnection)tosimulatenon-blockingI/ Fora5to25-roomsimulation,thekernelmust potentiallydealwith400to2,000threadsinther queue. Thekeymeasureopf erformancereported VolanoMarkismessagethroughput,i.e.,thenumber ofmessagespersecond(overallconnections)the serveriasbletohandleduringabenchmarkrun. T measurementsforIBM’sreportweretakenwhile runningVolanoMarkoveraloopbackinterface, eliminatinganynetworkoverheadinvolved;theheap sizeforthetestwaslargeenoughfortheoverhead Javagarbagecollectiontobleessthan5%ofthet elapsedtimethroughouttheexperiments. TheresultsotfheVolanoMarkexperimentsshowthat 25-roomthroughputdecreasedby24%from5-room throughputduetotheadditionalthreadsinthesys
possible.2 Maintain existing performance for light loads. Scalegracefully under heavy loads.
TheELSCschedulerisatable-basedschedulerthat keepstherunqueueinasortedorder,making schedulingdecisionseasierandfaster. Wechosea tablebaseddesignbecauseitrelievesusofthe overheadofsortinglists. Italsoavoidscomplexi when inserting oremoving r tasks,unlike,say, hae k f
of O. un by
he
of otal
tem.
ty ap.
ThefoundationotfheELSCscheduleriistsability to keeptasksinanorderthatmakeschoosingonefast . Thekeytothissortedorderisinhowatask’s goodness() valueiscalculatedinthecurrent scheduler. The goodness()calculation consistsoaf staticanddynamic a part. Thestaticpartconsist osaf task’s priorityand countervalues. Whileatask isontherunqueuebutnotrunningonprocessor, a its countervaluedoesnotchange.Likewise,its priorityalmostneverchanges,thoughwhenit does,theELSCscheduleradaptsaccordingly.We refertothecombinationothese f twovaluesatas ask’s staticgoodness Atask’s . dynamicgoodness consists ofmemorymapandprocessoraffinity. Despitethe factthattheydon’tchangewhileataskisonthe run queue,theydependonwhichtaskandprocessorare calling schedule() The . ELSCschedulerusesstatic goodness tosorttasks on therun queue. 5.1 Implementation TheELSCschedulerusesanewstructurefortherun queue. Previously,therunqueuewas simple a doub linkedlistofnodesthateachpointtoataskas 2 By behave,we meanthatif the currentscheduler a real-time taskover S a CHED_OTHERtask,eveniitf counter,thenthe ELSCscheduler shoulddtohe same few optimizations,the ELSCscheduler doesadhere t quirky rulesasthe currentLinux scheduler.
ly hown
lwaysselectsa has zaero Aside . from a tohe same
inFigure1To a.makeschedulingdecisionsfast,w e needtokeeptherunqueuesorted,whileathe t sam e timekeepinginsertionand deletiontimessmall. The ELSCschedulerdoesthiswithanarrayo3f 0doubly linkedlists. Eachlistinthearrayiussedtoho ldtasks inacertainstaticgoodnessrange,asdemonstrated in Figure1b. Listsaot neendofthetableholdtask s withthe higheststaticgoodnessvalueswhiletheother endholdtaskswiththelowest. A toppointer isused toindicatethehighestprioritylistthatcontains a runnabletask. Tochangethestructureothe f runqueuefrom sain gle listtoatableoflists,weneedtochangefourru n queue manipulation functions as well: add_to_runqueue(), del_from_runqueue(), move_first_runqueue() and move_last_runqueue().Thefirstofthetwo functionsputstasksonandremovesthemfromthe runqueuewhenappropriate. Thenexttwotasksgiv e ataskanadvantage/disadvantageintheselection processwhenanothertaskhasthesame goodness() value. Only schedule()manipulatestherunqueue directly. The function add_to_runqueue() is modified slightlytodealwiththenewtablestructure. Lik ethe currentscheduler,iat ddstaskstothefrontoaf list. Theparticularlistdependsonthetask. Iftheta skis real-time,itusesoneofthetenhighestlists, determinedbydividingthert_priorityfieldby10. If thetaskisa SCHED_OTHERtask,thenthelistis determinedbyadding counterto priorityand dividingbyfour. Oncethelistischosen,thetas kis addedtothefrontotfhatlistandthe toppointeris updatedinecessary. f Whenalltasksintherunqueueexhausttheirtime quantum,their countersareallzero. Atthistime, thecurrentschedulerresetsall countersinthe system. TheELSCschedulerdoesthesame. However,toavoidre-indexingeverytaskintherun queuewhentheir counterisreset,wemodified add_to_runqueue()asfollows. Ifthetaskbeing insertedhasanon-zero countervalue,thetaskis inserted as described above. Otherwise, add_to_runqueue() usesapredicted counter valueforthetask,basedonitsknowledgeohf owt he schedulerresetsthem. Usingthepredicted counter valueanditscurrent priorityt,hetaskisindexed intotherunqueueandaddedtotheendofitslist . Thisway,allzero countertasksresideathe t endof thelist,behindalltaskswithanon-zero counter value. Thezero countertasksareoutoftheway of thescheduler,butareinpositiononceallothert asks intherunqueueexhausttheirquanta. A next_top
pointerisusedtokeeptrackothe f highestpriori tylist containingarunnabletaskafter countersarereset andiset s atthis time. Inthecurrentscheduler,the del_from_runqueue() functionremoves task a fromthelistitisonbys imply pointingthetwonodesoneithersideoitin fthe listat eachother. Thenitsetsitsownrunqueuenode’s nextpointertoNULL,indicatingthatitisnolonger ontherunqueue.TheELSCschedulerfollows exactlythesameprocess. Afterwards,itupdatesb oth thetopand next_toppointeritfheremovalotfhe taskcausedeitheroneofthemtochange.Inthe ELSCscheduler,itispossibleforatasktobe consideredontherunqueuebutnotactuallybien one 3 ofthelistsinthetable. Because anode’s next pointerindicatespresenceontherunqueuebythe currentscheduler,theELSCscheduleralsosetsthe prev pointertoNULLtoindicatethatthetask ins ot actuallyonanylist,thusleavingthe nextpointer aloneifthetaskisconsidered“ontherunqueue” withoutbeing otnherun queue. The functions move_first_runqueue()and move_last_runqueue() were meant to bias decisions in the case of a goodness() tie.
Consequently,weneedonlytomovetaskswithin theircurrentlistsinthetable. Ataskim s oved within itscurrentlisttothebeginningoend r oits f sec tionof thelist. Recallthatlistscancontaintaskswith both zeroandnon-zero countervalues. Thesefunctions behaveappropriatelywhenfacedwithmixed-counter lists. Inadditiontothemodificationothese f fourfunct codewasaddedtoinitializetherunqueuetable structurewhenbooting.Wealsowrotetwotest routinesthatdeterminewhetheralistcontainstas with zeroonon-zero r counter values.
ions,
ks
5.2 ELSCSchedulingAlgorithm Like the current scheduler, the actual ELSC implementationof schedule() beginsbyexecuting alloutstandingbottom-halvesandthenperforming someadditionaladministrativework. Itthendevia tes from thecurrentscheduler as follows. Iftheprevioustaskwasstillrunningwhenitcall schedule(),i.e.,itexhausteditsquantum,was preempted,oryieldedtheprocessor,thentheELSC schedulerinsertsthetaskintotherunqueue. Thi isimportantbecausetasksareremovedfromtheirr
3
The reasonfor thisisbecause waectually remove t runqueue while they are running,butthe restof t wouldlike ttohinkthatthey are still onthe run us w a ay ttoell precisely itafaskion slist. a
ed
step un
asksfromthe he Linux system queue. Thisgives
put d,it
RecalculateFrequency 1000000
ofit. not the uler,
100000
Entries
queuelistswhentheyareexecutingandneedtobe backontherunqueue. Evenithe f taskhasyielde willbetreatedproperlyinthesearchloop. Sowe insertthetaskinthetablenowlestwelosetrack Also,byre-insertingtheprevioustaskhere,wedo needtotreatiat saspecialcasewhenevaluating goodnessotasks. f Next,justasthecurrentsched ELSCmovesexhaustedSCHED_RRtaskstotheends of their lists.
10000
elsc
1000
reg
100 10
Thenextstepdetermineswhetherweneedto recalculate counterIfs. thetoppointerizsero,then therearenorunnabletasksinthetablewith non a -zero countervalue;eithertheyallhavezero counter valuesotrherearenotasksintherunqueue. If the next_top pointerisnon-zero,thentherearerunnable tasksinthetablewithzero countervalues,sothe schedulerrecalculatesthe countervaluesforevery taskinthesystem. If,however,the next_top pointer iszero,thenthetableiscompletelyemptyandthe re arenotaskstorun,sowescheduletheidletaska nd skiptherestof thedecision process. Ifthetop_pointerins on-zero,thelistpointedto byit isguaranteedtohavealteastonenon-zero counter taskinit,sowestartoursearchatthetoplist. The ELSC search loop attempts to emulate the goodness() calculation used by the current scheduler. Startingwiththefirsttaskinthelis t,ELSC checkstoseeif thetaskisstillrunningonanother CPU. Ifso,weshouldn’tscheduleiIft.alltask isn thelistareeliminatedbythischeck,thenwecons ider 4 thenextpopulatedlistandtryagain. Next, wecheck toseeithe f taskhas zero a counter value. Ifwefind suchatask,thentherestofthelistiseitherem ptyor unusable,sowebreakoutofthesearchloop.If, however,thetaskweareconsideringhasanon-zero countervalue,thenweevaluateitsgoodness. The processofselectingataskfromthehighestlisti s describedbelow. Ifthetaskhasjustyieldeditsprocessor,wewill runit onlyifwecannotfindanothertaskonthelist. T his policyisslightlydifferentthanthecurrentsched uler, whichconsidersayieldedtasktohaveagoodness valueofzero. Fromthispoint,thetask’sutility is evaluatedjustlike goodness() Bonuses . aregiven forhavingthesamememorymaporrunningonthe same processor. uni-processor the In case aif , searchloopandrunthetaskrightawaybecausewe won’tfindanother task with greater a bonus. Whenwefinishexaminingatask,wemarkittobe schedulednextifihas t thehighestutilityseens Then weselectthe nexttask the inlistandrep 4
Thiscanonly happenonSMPsystems.
ofar. eatthe
UP
1P
2P
4P
Figure2:
Thenumberoftimes (onalogscale)thateach schedulerenterstherecalculateloopduringatypi calrunofthe VolanoMarkbenchmark.
process. Intheworstcase,everytaskintherun isplacedinthesameprioritylist(andELSC performancecanbenobetterthanthecurrent scheduler). Sowleimitthenumberoftasksexamin ineachlisttoanumber,currentlysettobehalf numberofprocessorsinthesystemplusfive,which intendedtobelargeenoughtofindtaskswith adequatebonusesonSMPsystems,yetstilllimitth searchtoareasonablenumberoftasks.Not consideringtherestofthelistshouldn’tbepro a asalltasksinthelisthaveaboutthesamestatic goodness.
queue
ed the is e blem,
Forreal-timetasks,thesearchisactuallymuch simpler. Again,we xamineonlythefirstfewtask anddon’tlookatthosecurrentlyrunningonother processors.Butinsteadofworryingaboutyielded processesandbonuses,wesimplyrunthetaskwith thehighestrt_priority value.
s
Afterdecidingwhichtasktorunnext,theELSC schedulermanuallyremovesthetaskfromitslist( i.e., doesn’tuse del_from_runqueue())andsetsrun queuenode’s prev pointertoNULL.This indicatesthatthetaskis“ontherunqueue”,even thoughitisnotcurrentlyinalist.Finally,if the previoustaskhadyieldedtheprocessor,thenthe ELSCschedulerclearsthe SCHED_YIELDbittogive the task a better chance in future calls to schedule(). Wementionedbeforethatoneoour f designgoalswa tomaketheELSCschedulerbehaveam s uchlikethe currentscheduleraspossible.Atthispoint,we describehowtheELSCschedulerbehavesdifferently First,theELSCschedulertriestolimititssearch onelistinitstable. Therefore,itmaychoosea itshighestprioritylistthatdoesn’treceiveany bonusesforprocessoraffinityomemory r map. Int case,itis possiblethat taask residing itnhese
s
. to taskin his cond
Scheduler
TimetoCompleteCompilation
Current UP -
6:41.41
ELSC UP -
6:38.68
Current 2P -
3:40.38
ELSC 2P -
3:40.36
Table 2: Averagetimetakentocompleteafullcompileotfh
e
Linux kernel.
highestprioritylist,whichwouldreceivethese bonusesandhavehadahigher goodness()value thanthechosentask,isnotrun. Wedecidedthis behavioral difference is acceptable because the differencebetweenthe goodness() valuesothe f two tasks is smallenough toignore. Theotherdifferenceinbehaviorios nethatavoids an undesirablecharacteristicofthecurrentscheduler . Currently,iaftaskenterstheschedulerbecausei its yieldingtheprocessorandnoothertaskscanbe scheduled,thentheschedulerentersaloopto recalculatethe countervalueforalltasksinthe system. Inthissituation,theELSCschedulerruns the previoustaskagainiitdoes f nothave zero a counter value.Figure2illustrateshowmanytimeseach schedulerrecalculatesduringatypicalVolanoMark runonuni-processorandone,twoandfourprocesso r SMPmachines.
Experiments 6. TheELSCschedulermeetsthefirstthreeoof urfou r designgoals. Thedesignchangesarekeptlocal,t he solutionissimple,anditbehavesverymuchliket he currentscheduler. Thefinalgoalofthisproject isto maketheELSCschedulerperformaswellasthe currentschedulerinlighter desktopsituationswhile scalinggracefullyunderheavyloads. Weusedtwo teststodeterminewhetherwereachedthisgoal. T he firstisasimpletestthatmeasuresthetimeita t kesto compiletheLinuxkernel.Thistestismeantto compareschedulerperformanceforlightloads. The secondtestistheVolanoMarkbenchmark,described earlier. WhileVolanoMarkmaynotberepresentativ e oftaypicalworkload,itdoessimulatethebehavio of r acommerciallyavailableapplication. Weuseitin this analysis as satress testfor thetwoschedule rs.
WecompiledtheLinuxkernelthreetimesoneachof 5 theschedulers,configuredtorunausni-processor and two-processorkernels. WeranthetestonanIBM Netfinity5500withdualPentiumIIprocessors. Th e kernelversionwas2.3.99-pre4withourELSC modifications. Torunthetest,wesetupshell a script thatwouldfirstbuildakernelandthenrun“make clean”. Thisstepwasintendedtoreducethevaria nce inmeasurementduetofilesystemperformanceby pullingasmuchinformationaps ossible intothe L1 andL2caches. Thenweusethebash“time” commandtorunthe“make -j4bzImage”command. Table2showstheaverageresultsgivenbythetime command. Our confidenceitnhesemeasurements is very high a thetestwas run multipletimes andresults never deviatedfrom themean bm y orethan 4hundredths of saecond. For allpracticalpurposes,thehundredt saecondreportediT n able2areinsignificant. In two-processor casetheELSC scheduler barely edges thecurrentscheduler by ainnsignificantcouple hundredths of saecond. In theuni-processor case ELSC scheduler has daistinctadvantage. Webeliev this is duetotheshortcutin theELSC search loop theuni-processor scheduler,which ends thesearch soon am as emory mapmatch ifound. s TheVolanoMarkbenchmarktestismorecomplicated. WeranVolanoMarkinloopbackmode,which simulatesboththeclientsandserversfortheJava roomsonthesamemachine.Inloopback communicationbetweenclientsandserversdoesnot travelacrossanetwork. Intheexchangeom f essag betweenclientsandservers,eachmusthavetimeon theCPUtosendandreceiveit’smessagesinorder lettheotherdothesame.Thistypeofmessage exchangingapplicationforcesmanyentriesintothe scheduler.AssuggestedbytheVolanoMarkrun rules,weranthebenchmark11timesforeachsyste configurationanddiscardedthefirstrunduetoit variantstartupcosts.
s
hs of the
the e for as
chat mode,
WeranVolanoMark withbothschedulersconfigured asuni-processor,one,twoandfourprocessorSMP kernels. For each of these configurations, VolanoMarkwasconfiguredtosimulate510, , 15and 20rooms,eachwith20simulatedusersexchanging 100messages.Eachsimulatedusercreatestwo threads,soeachroomcreatesatotalof80threads It. iseasytoseethatevenat5roomstheVolanoMark benchmarkputsconsiderablestressonasystem. While running VolanoMark, we also collected 5 In these experiments,uni-processor kernelsare co mpiledwithout SMPenabled,eliminating itsoverhead. One-process or kernelsare compiledwithSMPenabledbut use only one processor.
es to
m s
UPand1PMessageThroughput
Scalingw ithRoom s
1.2
4600
1 4200 3800
elsc-up
0.8
reg-up
0.6
elsc
0.4
reg
elsc-1p reg-1p
3400
0.2 3000
0 5
6200 5700 5200 4700 4200 3700 3200 2700 2200 1700
10 15 NumberoR f ooms
20
UP
elsc reg
Figure3:
10 15 NumberoR f ooms
Throughput in messages VolanoMarkrunson6differentschedulerconfigurat scale iadjusted s onthe secondgraphtfoitall da
20
per second for ions. TheYtapoints.
statisticsaboutwhattheschedulerwasdoingand exposedthemthroughtheprocfilesystem.The overheadofcollectingthesestatisticsexistsinb schedulersandinbothcasesisnegligible.The machineusedfortheVolanoMarkrunswasanIBM Netfinity 7000with 4Pentium IIxeon processors. ThemetricreportedbyVolanoMarkismessage throughput,whichwecanuseasameasureofboth performanceandscalability. Usingthebareresult fromtheVolanoMarkruns,wecancomparehoweach of the two schedulers behaves in different configurations. Figure3illustratestheperforman gains given btyheELSC scheduler. Figure4givesadifferentinterpretationofthesa data.Toobtainsomemeasureofhowwelleach schedulerscaleswhenfacedwithalargenumberof tasks,wecanusethe5-roomtrialsasabase measurementandseehowperformanceisaltered whenthenumberofthreadsisincreasedinthe20roomtrials. ThenumberchartedinFigure4issim themessagethroughputachievedinthe20-roomtria dividedbythethroughputachievedinthe5-room
2P
4P
Figure4: Showshoweachschedulerscalesfrom5roomsto20 roomsonvariousprocessorconfigurations. Thehei representsthescalingfactor(20-room-throughput/ throughput).
4ProcessorMessageThroughput
5
1P
oth
s
ce me
ply ls
ghtofthebar 5-room-
trials. Asthefigureindicates,theELSCschedule clearlyscalestomorethreadsbetterthanthecurr scheduler.
r ent
Butthesenumbersdonotpaintthewholepicture. wanttounderstandwhytheELSCschedulerscalesso muchbetterthanthecurrentschedulerandverifyt theseresultsarenotafluke.Sowecollected additionalstatisticsontheschedulerswhilewrea VolanoMark tests.
We
Thefirststatisticthatjumpsoutisthenumberof cyclesspentperentryintothescheduler.Forthe ELSCscheduler,thisnumberissignificantlylower than the current scheduler, proving the ELSC schedulerreally doesspendlesstimeinthescheduler. TheexplanationisthatbecauseELSCwithitstable basedapproachtoschedulingexaminesfarfewer tasks on each entry into the scheduler, as demonstratedbF y igure5.
hat nthe
-
TheELSCschedulerisnotwithoutfault. Although mostofthestatisticswecollectedindicatethatt he ELSCschedulerisfasterandbetter,twoothem f sh ow theopposite. Oneoftheadverseaffectsoaftabl ebasedschemeias nincreaseinthenumberocfalls to schedule()whenrunningonamachine withmore thanoneprocessor. AsdemonstratedbyFigure6, thereisastrongcorrelationwithhowmanytimesa taskisselectedwithouthavingtheprocessoraffin ity bonus.ThesemeasurementssuggesttheELSC schedulerisnotchoosingtheabsolutebesttaskin multiprocessormachines.Wesuspectthatthisis relatedtothefactthattheELSCschedulerfindst he mostsuitabletaskinthehighestpopulatedclasso f staticpriorities.Thus,sometasksthatmighthav e higher goodness() values when the processor affinitybonusisadded,butresideinlowerstatic classes,may notbeconsidered.
CyclesperSchedule()
25000
CallstoSchedule()
4200
20000
3700 15000 10000
elsc
3200
reg
2700
5000
elscsched regsched
2200
0
1700 UP
1P
2P
4P
UP
TasksExam ined
40 35 30 25 20 15 10 5 0
1P
2P
4P
TasksScheduledonNewProcessor
1000 800 elsc
600
reg
400
elsccpu regcpu
200 UP
1P
2P
0
4P
UP
Figure5:
Thefirstchartshowsthenumberocf yclesthatare spenteachtimethesystementersthescheduler. T hesecondchart showshow many tasksare examined btyhe scheduler each time iist called.
AlthoughtheVolanoMarkbenchmarkcreatesmany threadswiththesamememorymap,wedonotbelieve thisfactsignificantlyinfluencesthebehaviorof scheduler. Theonlypossibledifferencebetweenth twoschedulers wouldbseimilar tothepassing over lowerclassesofstaticprioritiesthathappenswhe runningonmultipleCPU’s. Ofcourse,ifataskis insertedintoalowerprioritylist,thenaddingth pointbonusforsharingamemorymapwiththe previoustaskcannotraiseit’sgoodnessvalueenou tobgereater than any othe f tasks in thehighest
either e of n eone gh class.
Evaluation 7.
1P
WesetouttoimprovetheLinux scheduler’s scalability,preferring modifications thatdonot changedesktopperformanceandmaintain existing scheduler abstractions,yetscalewellwhen present with large a number of tasks. Wehaveshown thati
r
ed t’s
4P
Figure6:
The firstchartshowshow many times(inthousands) the systementersthe schedule() functioncall ina naverage 10-room VolanoMarksimulation. The secondchartshowshow many times the scheduler chooses taasktrounodifferent na processor thanit ranbefore.
possibletoimprovetheLinuxschedulerwithout introducingalotofoverhead.ThoughtheELSC schedulerdoesnotalwaysselectthebesttask availableonmachineswithmorethanoneprocessor, wehavedemonstratedthattheELSCscheduler satisfiesourgoalsforbothasmallandlargenumb ofreadytasksandoffersaviablealternativetot currentLinux scheduler.
er he
TheELSCschedulerisanopensourcecontribution andisfreelyavailableforuseandmodification. The currentversionothe f ELSCpatchcanbde ownloaded from www.citi.umich.edu/projects/linuxscalability/patches/
An increasing number of organizations continueto evaluate,test,andusetheLinux operating system. Although Linux does many things well,wehave shown thecurrentscheduler has shortcomings in its design andimplementation. When confrontedwith a largenumber of tasks,overallsystem performance declines rapidly. This behavior is unacceptablefo large-scaleenterpriseenvironments.
2P
Future 8. Work Inthefuture,wewouldliketoseehowtheELSC scheduler performs in other multithreaded environments.Onesuchexampleisawebserver runningApache. Wouldwseeethesameperformance gainswesawwhilerunningVolanoMark,ordoes somethingotherthantheschedulercauseprimary bottlenecksinthesesystems?WouldtheELSC schedulerbemoreeffectiveinincreasingthroughpu or decreasing thelatency oan fApachewebserver? ThefocusotfheELSCdesignistoreducethetime spentlookingforatasktoschedule. Wewouldals liketofindwaystoallowtheschedulertomake greateruseom f ultipleCPUsandexaminetheeffect of modifying the goodness metric. Is Linux consideringeverythingitoughtinitsscheduling decisions? Dowecareaboutprocessoraffinityaft manyothertaskshaverunonthegivenprocessor? Canweconstructaschedulerthatspendslesstime waitingforspinlocksandmoretimescheduling tasks? We are also interestedinexploringalternative schedulerdesigns.Thetable-baseddesignofthe ELSC scheduler is one approach; many other possibilitiesexist,suchassortingtasksbystati goodnesswithinheapsforeachprocessorandaddres space.Onecouldchoosetheabsolutebesttask availablesimplybyexaminingthetopofeachheap. Orperhapsamulti-priority-queuesolutionwouldbe morebeneficialtohelptheschedulerscaletomult processors well.
Acknowledgments 9.
t
o s
er
WethankRayBryantandBillHartnerofIBM,Chuck LeveroN f etworkApplianceandBrianNobleotfhe Universityof Michiganfortheirguidance and assistance;Dr.CharlesAntonelliandProfessorGar Tysonforprovidinghardware usedindevelopmentat theUniversityofMichigan;theLinuxTechnology CenteraItBMforallowingtheuseofequipmentat IBMfordevelopmentandtesting;andChrisKing, ScottLathropandPaulMooreoftheUniversityof Michigan for their contributions to the initial developmentof theELSC scheduler. Wethanktheanonymousreviewersfortheirhelpful comments,andourshepherd,TedFaber,whose insightandsuggestions wereparticularly valuable. ThisworkwaspartiallysupportedbyDell,Intel,a theSun-NetscapeAlliance. Wealsothankallthepast,present,andfuture developersofLinuxfortheirskilledandselfless contributions.
c s
iple
y
nd
10. References [1] Atlas,A.“Designandimplementationof statisticalratemonotonicschedulinginKURT Linux.” Proceedings20 thIEEEReal-Time SystemsSymposium Phoenix, . AZ,December 1999. [2] Bryant,RayandHartner,Bill. “Java,Threads, and Scheduling in Linux.” IBM Linux TechnologyCenter,IBMSoftwareGroup. http://www-4.ibm.com/software/developer/ library/java2
[3] Carr,John. “AS/400LeadstheleagueinJava performance.” JavaWorld,11August2000. [4] Daggett,DawnandGillen,Aal ndKusnetzky, Dan.“LinuxOvertakesNetWareforthe Market’sNumber2Position.” International Datacorporation –PressRelease ,24July 2000. http://www.idc.com/software/press/PR/ SW072400PR.stm
[5] Gooch,Richard. “LinuxSchedulerBenchmark Results.”30September1998. http://www. atnf.csiro.au/~rgooch/benchmarks/linuxscheduler.html ftp://ftp.atnf.csiro.au/pub/people/rgooc h /linux/kernel-patches/v2.2/rtqueuepatch- current.gz
[6] Lohr,Steve. “IBMgoescounterculteralwith Linux.” TheNewYorkTimesOnTheWeb 20 , March 2000. http://ww10.nytimes.com/ library/tech/00/03/biztech/articles/ 20soft.html
[7] Neffenger, John. “The Volano Report.” Volano LLC , 24 March 2000. http://www.volano.com/report000324.html
[8] Shankland,Stephen.“AOLreleasesopensourcesoftware.” CnetNews ,9July1999. http://www.canada.cnet.com/news/0-1005200-344644.html
[9] Wang,Y.C. “Implementinggeneral a real-time schedulingframeworkintheRED-Linuxrealtimekernel.” Proceedings20 thIEEERealTimeSystemsSymposium. LosAlamitos,CA, 1999. 246-55 [10] Woodard,B“Building . anenterpriseprinting system.” ProceedingsotfheTwelfthSystems Administration Conference (LISA XII) USENIXAssociation,Berkeley,CA.1998, 219-28.
.