Data Placement In Bubba - CiteSeerX

19 downloads 51162 Views 1MB Size Report
declustermg, the constraints which our data recovery technique puts on data .... buy better load balancmg, so that only relatton locality matters The only cases .... load and to experlmentally arrive at the best weights to maximize throughput, m ...
Data Placement In Bubba George Copeland William Alexander Ellen Boughter Tom Keller MCC 3500 West Balcones Center Dnve Austm, Texas 78759

Abstract l?us paper examznesthe problem of data placement zn Bubba, a hrghly-parallel system for data-mtensrve appllcatlons bemg developed at MCC “Highly-parallel” lmplres that load balancmngIS a cntlcal performance issue “Data-mtenave” means data IS so large that operatrons should be executed where the data resides As a result, data placement becomesa cntlcal performance issue In general, determmmg the optimal placement of data acrossprocessmgnodesfor performance IS a d$icult problem We describe our heuristic approach to solvmg the data placement problem w Bubba We then present expenmental results using a specific workload to provide msrght into the problem Several researchers have argued the benefits of deelustering (1e , spreading each base relation over many nodes) We show that as declustermg IS increased. load balancing contmues to improve However, for transactions mvolvmg complex Joins, further declusterrng reduces throughput because of communications, startup and termmatron overhead We argue that data placement, especially declustermg,m a highly-parallel system must be considered early in the design, so that mechanrsmscan be included for supportmg variable declustermg, for mmtmlzmg the most significant overheads associated with large-scale declustenng, and for gathering the required statistics

addItIona work Neither of these extremes leads to maximum throughput m a parallel processmg design Instead, some compromise must be reached, which balances the gain in processmgcapacity due to parallel execution against the overhead caused by parallel execution Bubba 1s a highly-parallel architecture for data-Intensive apphcatlons By “data-mtensive” we mean any systemm which the large size of the base data causessignificant storage and performance problems In such systems,it 1susually cheaper to move tntermedlate results over the interconnect between physical nodesthan to move base data, because base data 1s typlcally much larger and becauseit 1snot always needed m Its entirety Thus, a Bubba design prmclple IS to perform work involving base data at those physical nodes upon which the base data resides This means that data placement1s a crItica factor m achieving load balancing and throughput This paper exammesthe problem of data placement tn Bubba, which includes three related declslons the number of nodes over which to partmon base data, the particular nodes m which to place base data, and whether to place the data on disk or cache it permanently in memory The dlfflculty of this problem dictates using a heurlstlc approach Effective data placement is crucial to performancefor any databasesystem havmg multiple disks, since it 1san important lever for load balancmg This problem IS usually addressed today by examining file reference frequencieson a perlodlc basis and manually movmg files to achieve a more uniform load An important exception to this conventlonal mode of operation IS found m the Teradata system Teradata 1sa highly-parallel database machme which employs full declustermg, m which each relation is spread equally among all of the disk drives in the system, via a hash function [Ter85] describesthis scheme but does not describe Its performance implications GAMMA [Dew%] used full declustermh m$ measured e parallel database machine smgle-transactlon-at-a-time response time, but did not measurethroughput [Llv87] studied the relative performance of a data declustermg strategy versus no declustermg on a conventional multlple-disk file system They compared the strategy of leavmg a file intact on one disk versusfull declustermg From their mvestlgation of both strategies

1 Introduction

Throughput m parallel systemsIS determmedboth by the amount of total work to be done and by the total amount of processrng capabdlty wasted by poor load balancing Maxlmlzmg the former while mmlmrzmgthe latter leads to the followmg design dilemma Mmlmizmg total work usually leads to a sequential processmgdesign, because parallel processmg adds the overheads of commumcatlons, startup and terrnmatlon Contrarily, mmlmlzmg aggregate processor Idle time 1s usually achievedby designswhich spread each operation over all processing elements, usually guaranteeing excessive

Pemusslonto copy wtthout fee all or part of this materlal ISgranted prowded that the copsesare not madeor dlstnbuted for dmxt commercmladvantage,the ACM copyright notlce and the tttle of the pubhcatlon and Its date appear, and notxe 1s gwen that copymgISby permIssIonof the Assoctatlon for Computmg Machmery To copy otherwse, or to repubhsh, reqmresa fee and/or specific permtsslon 0 1988 ACM O-89791-268-3/88/0006/0099

$1 50

99

across a spectrum of multi-transaction workloads, they concluded that, except under extremely high utlhzatlon condmons, full declustermg was consistently a better approach to tahe [Tan871measuredthe throughput of the DebItCredIt [An0851 workload using full declusterrngfor all relations except for the history file and log, which used a DegDecl (the number of nodes contamrng a relation) of 4 They showed that throughput Increased hnearly with the total number of nodes for up to 32 nodes This workload generatedfairly umform accessesto the entu-edata space Our study extends their fmdmgs For some operations, the amount of work Increasesnonlrnearly with DegDecl This IS the commonly encountered bane of parallel processing (e g , [Hwa84], [Vrs85], [Cve87]) In these cases, less than full declustermg outperforms both With full no declustermg and full declustermg declustermg, the placement problem IS trlvlal because each base relation is spread across every available disk [Ter87], [DeW87], [L1v87] and [Tan871did not study the impact of different placementsof the subflles Section 2 describes the Bubba design environment Section 3 describes how we deal with locality m Bubba Section 4 describesthe data placement problem m Bubba in more detail Sectlon 5 describes our data-placement heuristics Section 6 describes our performance model, an experiment and the rnslght it provides Section 7 provides a summary and our conclusions

anmmi Figure 2 1 High-Level

eachIR and IP describeswhich IRs contain which parts of each base relation This supports variable declustermg, where each base relation has a different DegDec! The global directory IS used to route messagesto only those IRS which could be involved m a particular operation Inverted files are declustered according to their inverted attribute and have global directory Indexes Thus allows a relation to be accessedusing a single Inverted attribute value wlthout involving all DegDecl IRS Instead, only a smgle Inverted-file IR and only those IRs contammg records with the Inverted attrlbute are involved A dynamic loading and activation mechanism 1s employed [AlC88], which starts up a transactlon on an IR only after the first messagehas arrived This mmlmlzes program loadmg, startup and termmation overhead by involving only those IR; that are actually required by that transactlon Multlple dataflow-control mechamsmsare employed [Ale88, AlC88], which Inform each dataflow operation that it has received all of its mcommg data The choice among these IS made by the compiler based on which causesthe least commumcatlonsoverhead Taken together, these mechamsmsallow startup and termination cost to be O(1) for rifle-shot transactlons using inverted attributes as keys, as well as clustered attributes, rather than O(DegDecl) We have put considerable effort into lmplementmga streamlined communications protocol which reduces the cost of each messageover the Bubba Interconnect

2 The Design Environment

This se&Ion describesthe Bubba design environment, mcludmg our performance goals, our hardware orgamzatlon, our techmques for efficiently supporting declustermg, the constraints which our data recovery technique puts on data placement, and our benchmark workload Bubba IS being designed to run the database and knowledgebaseworkloads that we envlslon for the mid Its performance oblectlve 1s to deliver 1990’s hnowledgebasemanagement functlonahty with cost and performance improvements between one and two orders of magnitude relative to conventional general-purpose computers of the mid 1990s

2 3 Recovery And Avallabdlty Constraints

Recovery from media or IR failure requires mamtammgdata redundantly on different media or IRS Many systems maintain a checkpoint and log for this purpose [Gra78] For higher avallabdrty, some systems mamtam two identical on-lme copies of each conceptual relation and their indexes, where updatesare sent to both copies and reads can use either copy Mirroring as used by Tandem [Kat78] IS one example of this, where each disk has a twin contaming the same data and indexes Teradata [Ter85] uses a hash technique to decluster two copies of each relation across the same set of nodes, which has the property that the two copies of each record are guaranteed to each be m different nodes Both of these identical-copy techniques double the size of both the base data and their mdexeband require updatesto be applied to both copies Each of these three recovery techniques can be supported on Bubba, the choice being derermmedby the avallablllty requirements of the apphcatlon In addltlon, we support a novel inverted-file (IF) techmque that exploits the data redundancy already present In the Inverted flies One copy of the conceptual relation IS a

2 1 Hardware Organization

Bubba is a highly-parallel machine for data-Intensive apphcatlons It 1s designed to be scalable from 50 to 1,000 mtelhgent repositones (IRS) Each IR has a main processor,a disk controller, a communications processor, a large main memory, and a disk Its design philosophy 1s shared-nothing [Ter8S] [Sto86] [DeW86], neither memory nor disks are shared between IRS Bubba also has several interface processors (IPs) to handle mteractlon with users and some centrahzed functions The IRSand IPs are connectedby an interconnect, so that any IR or IP may send messagesto any other IR or IP This Interconnect IS the only shared resource, It makesa network out of the set of IRS and IF’s This network IS a single machme, the IRS and lPs are physically close and messagedelays are small The high-level architecture IS Illustrated in Figure 2 1 2 2 Architectural

Hardware Organlzatlon

Support For Declustenng

Bubba Includes the following mechanisms to efflclently support declustermg An efficient global directory mechanism replicated in

100

base relation called the drrect copy, which has the same structure as Its concpptual relation Another copy of the conceptual relation IS its set of inverted fdes and a remainder relation (containing all data which IS not inverted), which together we call the IF copy Either copy can be created from the other The IF techmque strikes a compromtsebetween the previous techniquesby requlrmg less storage and updating than the Identical-copy techniques provided there IS at least one inverted file, but havmg a recovery time between that of the checkpoint-and-loggmg and the identical-copy techniques Each of these techniques constrams declustermg differently The two identical-copy techmques require different transactlon scheduling and execution strategies to achieveload balancing of reads on the two copies The IF technrque requires that the two copies be declustered over different sets of IRS, which puts the followmg constraint on the placement of each conceptual relation CDegDecl= DDegDecl+ IFDegDecl I total number of IRS, where CDegDecl, DDegDecl and IFDegDecl are the DegDecl for the conceptual relation, the du-ect copy and the IF copy Our treatment of data placement m this paper assumesthe IF techmque, although the results also apply qualitatively when other recovery techniques are used 2 4 The Order-Entry Workload

We use database apphcatlon programs based on a conventional order-entry system as our workload Five transactions comprise our workload Payment, Order-Shopped, New-Order, Suggested-Order and Store-Layout, plus a mix of the five transactions,Mrxed, formmg a composite workload These transactions and their workload characterlzatlonsare described m [Ale871 The order-entry databaseconsists of 36 base relations, 8 of these are direct-copy relations and the remamder are their IF-copy relations The total size of the databaseIS 160 Gbytes The five transactlons may be summarizedas follows 1) New-Order records a customer’s order for an average of 10 different items after the customer’s new outstandingbalance IS checkedagainst his credit hmlt 2) Order-Shipped generatesan mvolce for the customer after an order IS fdled 3) Payment retrieves and updates the date-paid for an order and adjusts the associatedcustomer, salesman, district and company sales totals 4) Suggested-OrderInfers the number of Items to order from suppliers m order to keep a warehouse sufficiently well-stocked 5) Store-Layout assists a customer m conflgurmg the layout of Items on the shelvesin the store in an attempt to maximize customer proflt New-Order, Order-Shipped and Payment represent conventional database transactions, whereas Suggested-Order and Store-Layout represent knowledgebase queries These transactlons require roughly from 2x to 1,000x the work involved in the DebitCredit transactlon All five transacttons are executed on the same database usrng the same data placement, so that only a compromisedata placement ISpossible We think that

workloads with the kmd of mix of simple and complex transactlons seen m Order-Entry will gradually replace

101

simpler workloads hke DebitCredIt, largely becauseof the avallabdlty of reasonably priced systems with the performance needed to support them [Sam871 For each transactlon type m the Order-Entry workload, Figure 2 2 provides the relative frequency of each transactlon m the Mlxed workload (TrunsFreq) and the complexity measures(as a function of DegDecl) of the costs of the database operations (e g , Jams), commumcation, and startup and termination as implemented on Bubba The first three transactlons contain “rifle shot” operations (1e , select operations mvolvmg a small number of records) Suggested-Order and Store-Layout contam 1-M and N-M joins, where N and M are the number of records m the two Jomed relations Both Suggested-Order and Store-Layout require exhaustive scans of a large percentage of the databasewlthout usmg an inverted file Their TransFreqs are set artificially high because they are intended to be reprebentatlve of many different “large” transactlons, each of which we expect to be run mfrequently Commit and loggmg overhead are included m each of these five transactions operation commun start/term transaction Trans Freq type

New-Order Order-Ship Payment Sugg-Order Store Layou

cost

cost

0 332 rifle shot O(l) O(1) 0 332 rifle shot 00) O(1) 0 332 rifle shot 00) O(1) 0 001 1-M Jom O(DegDec1) O(DegDecl) 0 003 N-M jam O(DegDecl2) O(DegDec1)

Frgure 2 2 Charactenzation

Of Order-Entry

Workload

3 Locality

This section describes how we address locahty m Bubba and describes the locahty m the Order-Entry workload 3 1 What Is Locahty?

Various formal defmttlons of locahty have been proposedm the hterature [Bun841 They all pertam to the shape of the curve of cumulative access frequency of objects (ordered by decreasmgfrequency) plotted agamst the cumulative fraction of the ObJeCts This fraction can be based on either ObJectsize in bytes (which we call temperature locahty) or number of ObJects(which we call heat locahty) We refer to this type of curve as a locahty curve Examples are Figures 3 1 and 3 2, which are describedmore fully later Hugh locahty means a steeply rising curve, and low locahty means a slowly rlsmg curve We use the following defmmons l The heat of an ObJeCt is the accessfrequency of the ObJeCt over some period of title l The size of an object IS the number of bytes of the object l The temperature of an Object is the object’s heat dlvlded by the object’s size l Record heat (or temperature) locahty IS a measureof the nonuniformity m heat (or temperature) among mdlvldual records wlthm a base relation l Block heat (or temperature) locahty IS a measure of the nonumformlty m heat (or temperature) among mdlvtdual storage blocks wlthm a base relation l Relation heat (or temperature) locality ISa measureof the nonumformlty m heat (or temperature) among individual base relations

To help provide some mtumon, we use an analogy betweenour terms (heat, size and temperature) and terms from physrcs (heat, mass and temperature) Heats and sizes are each additive That IS, Htotal = H r + Hz and Stota1 = s1+ s2 However, because temperatures measure mtenstty, they are not dtrectly addmve Two temperaturesare addedby addmg then correspondmnheats and stzes. so that H Htf T =---L and T2 =$ then Ttotal = HS,l++Hs,” l

Sl

Both record and relatton localmes vary widely dependmg on the application but are independent of databasesystemarchrtecture Block locahty 1sdependent on both record locality and the strategy used by the databasesystemarchnectureto place records onto blocks The strategtes used by a database system for placmg records onto blocks and for placmg blocks onto IRs are lmbed, SO that these chotcesmust be made together 3 2 Dealmg With Locahty In Bubba Bubba uses four techniques for data placement declustermg, relation assignment, our clusterrng and indexing strategy, and cachmg Declustermg [Liv87] partitions the records of each relation into DegDecl segments We decluster recordsmto segmentsaccording to either the key value or its hash When usmg the key value, we equally partmon the sorted relatron across Its DegDecl IRS based on balanced heat rather than size The advantage of declustermg 1s improved throughput due to load balancmg In general, as DegDecl 1sincreased,load 1smore balanced (mcreasmg throughput) but wrth drmmlshmg returns, and the overheaddue to commumcattons,startup and termmatton mcreases(reducing throughput) Relation assignment places the set of relations, each having DegDecl segments,onto the IRs m such a way that the total heat of each IR is roughly equal Relatron assignment and declustermg are complementary techmques Relation assignment becomes caster as DegDecl is Increasedand IS trrvtal with full declustermg Random-baseddeclustermgdeltberately destroys most potential block locahty due to record locality m order to buy better load balancmg, so that only relatton locality matters The only cases for which block locality 1snot destroyed are 1) for records which are so hot that they causetheir block to have a high temperature, and 2) for relations indexed on a key value that IS strongly correlated with heat (1e , history files [AnoSS]) Someclustering and mdexmg strategtesplace records onto blocks according to relative heat EFlo78, Jak80, Omt83, Yu85] These strategiesmore fully explott record locahty by concentratmgthe hottest records into relattvely few, cacheable blocks However, these schemeswould improve data cache efftctency at the expense of load balancing m a parallel envtronment and of mdex cache effictency To whatever extent they are successful in clustering hot records m the same storageblock for better data cache efficiency, load balancing across multiple IRS wtll be hurt In addmon, index cache efftciency 1s stgmflcantly reduced because much larger mdexes must be used Placmg data mto IRs and blocks accordmgto key values allows cluster mdexes to be used for key attributes instead of inverted indexes Cluster indexesare much smaller than inverted indexes, becausethey require

102

only R&level or block-level resolutton instead of record-level resolution, wtth the advantagethat they can be cached Our clustermg and mdexmg strategy wtthm an IR IS to use a cluster mdex usmg either the key value or Its hash With random-based declustermg,the advantageof heat clustermg wtthm each IR IS greatly drmnnshed,but Its cost m terms of mdex stze 1s not dtmmished This approachreducesdata cacheefficiency to the extent there is still some record locahty wrthm each IR after declustermg, but stgmftcantly mcreases index cache efftctency becausemdexes are reduced by several orders of magnitude Caching has the obvious advantageof reducing disk IO, given that temperature locahty exrsts Bubba adopts the approach that a base relation IS either enttrely cache resident or entirely disk resident The reasonsare 1) that we force block locahty to be low vta declustermg to achieve load balancmg, and 2) that an optimizer can exploit thts knowledgeof cache residencyby having more prectse cost functions in its optimizer and by usmg memory-resident algornhms Relations are cachedbased on temperaturerather than heat, so that memory space1s considered in addmon to memory bandwtdth The opttmum total amount of cache memory for a Bubba conftguratton can be determmedby an analysts slmtlar to the “5-mmute rule” m [Gra87] For a gtven conftguratton, the highest temperature relations are cached We use conventtonal LRU to buffer blocks of non-cached relations, which handles the two cases described above (I e , extremely hot records and correlattons between key-value and heat), as well as transient hot spots Buffering of temporary data 1s handled usmg a separatemechanism Although Bubba destroysmost block locahty, relation locahty 1sstall quite stgmftcantm Bubba In general, htgh relation temperature locahty helps data cache efficiency, whereas high relatton heat locahty hurts load balancmg among IRS 3 3 Locality In Order-Entry Ftzures 3 1 and 3 2 mdtcate that both temoerature and h&t locality for the Order-Entry Mtxed workload are quite high Hugh temperature localny means that only a small data cache ISneededto reap most of the benefits of caching For example, a htt ratto of about 93% 1s possible with only about 0 6% of the database cached High heat locahty meansthat load balancing will be qurte difficult, because the relatively high heats of some relations mean that placement of these very hot relations will determine overall load balancmg If these relations are declustered over a small number of IRS, then load balancing ~111be more drfftcult to achieve Thus, for the Order-Entry workload, tt IS quoteeasy to obtam efftctent cachmg but quote dtfftcult to achteve efficient load balancing For the sake of comparison, Ftgures 3 1 and 3 2 also contain “theoretrcal” curves for three localmes 50/50 (no locahty), 70/30 and 90/10 For example, 70130locahty meansthat 70% of the logtcal accessesare to 30% of the relations Numerous formulas can sattsfy an “@” rule, where Olp

Suggest Documents