database products, and applications arc cifteii built on ad- hoc and incornplcte undcrstanding of teinporsll rcquire- rncnts (this author has sccn a university ...
Temporal Issues in Data Warehouse Systems N. L. Sarda Indian Institute of Technology, Bombay nls @cse.iith.ernet.in Abstract
The warchousc rcprcscnts the consolidatcd history of tlic organizi~tionat suitable lcvcl ordctails. A Data Warchousc {DWH) System prnvidcs cxtensivc capahilitics to intcractivcly bruwsc, suminarizc, visualize, mid carry out dilkretit analysts 0 1 data. A user can analyxc pcrrormancc of it sclccted busincss activity ovcr diffcreni husitiesv paramctcrs. note c x c c j h m s , ‘drilt-down’ into tlclails, ctc., i o undcrstanrl ;md develop rcasms filr past pcrfcirinancc, and use this knowlcdge (using, in addiiioii, rlntn analysis, data inining and business intclligencc wx~ls) to make hcttcr dccisiws with rcspcct to thc sclcctcd goals. Today, thc (lata warcliousc concept tias found a wide iicccpiance. l’liccnrly cxpxiericcs has lctl to inaturiiig OF diita mutlcls, tools and tcclinologics, cxtetisivc ventlor support, and nicthcxldog ics. A ‘multirlirnct~sio~i~il’ data inodclI1) is one of Ihc wjdcly acccptcd inodcl bnscd o n the ‘hypcrcuhc’ pnmdigm. Hcrc, thc business rneasurcs (also callctl ‘facts’j arc sturcd within cells of’ a hypcrcuhc whose ;~xcs/ctlgcsrepresent busiticss cntilics or niiributcs (Called ‘dirncnsions’). ‘I’hc [lata in thc cuhc can hc browsed, visualiml, and aggregatcd along selected dirnctisicins using opcrators specifically tlcsignctl Tor this purpose (which include ~ ~ l ~ i l l - d o ‘roll-up’, w ~ ~ ’ , ‘slicca i i d - d i d , etc.). An alicmativc model, which is rnuch closer to tlic pop~larrelntir,nalclaia model, is hascd OH h c cvnccpt of ‘star schcmn’, whcrc the dimension tnhlcs arc creatcrl with R typical unc-(o-ttiaiiy rclntionsliip with tahlcs containing ibc f x t s [6] 171. Tcchnulogies supporting ciiher ur both niodcls arc available totlay (rcspeciivcly celled MULAP arid ROIAP DWH systcins {4]). Many rricttiodnlogics arc being proposcd I‘or dcvcloping DWH systcms [ 7 , 21. Kimball, a lcnilitig DWH expert, lias propuscd I likccyclc and an rrrdiitccturc, wticre tcclitiiqucs ate l~rr~posctl Tor ’ctinicnsiotial modcling’ and dcsign o f star schema 171, His rich expcricticc i s rcfleclcd in thc propsccl metliolology. design principlcs, tcchnology end tool sclection and evaluation tcchriiqiics. and a Inrgc nunibcr of case studies [hl, In this papcr, we proposc to rcvicw Some of tbc issues in DWH tlcsign, and give pariiculiir altcntion to liandling o l titnc in dala warclioiises. ’linic is a ubiquitous dimcnsion,
1. Introduction The significant rctluction in ~ ( 1 s stir1 t increasc in proccssitig pow‘cr and sioragc cnpilcitics has madc i 1 possihlc for organizations IO build integrated repositories flldatii and 1111d y z c them t‘or bcttcr and fastcr decisions. Data warcliousing providcs praccsscs arid tcchnologics to hiiild what is dcscrihcd ;is ‘subject-orientcd, intcgratcil, tiinc-varying, ncinvolatilc collcctioti or data’ to support organizations in dccision making [4], [ 6 ,71. Today, an organim(ioti typically lies rnnny information syslcitis tlcdicatcd Ibr l~rnccssiiigbusiticss transactions for supliorting its opcrntiwis. ‘I‘hcse systcms cntitaiii tlic opcralional t l a ~ ~ h a s c( sO W ) . Pcriudicnlly, tlic dntti is extwicrl froin ttic OPDs a ~ i dstorctl in an integrated fnsliioii into a warehouse (this step inay also Icquirc data ‘clcaning’, sumiiiariznticin and consolidation).
27 0-7B95-049(i-W00 $10.000 2000 lEEE
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.
Tlicsc attributes define groups or classes ofthc dimensional entitics. The atlrihutes iniiy form indcpcrtdent or Iiicrarchical groups. Such attributos iirc oftcii called categnly drtributes. Product (with proditcl category altributc) and Geography (with city, state, country) arc typical cxannlples of dimensions with liicrarcliies. Hierarchies arc important for analyzing and understanding fact data at various lcvcls. [ 101 have identified many requircmcnts for warehousc data models, and have coinpared various rnultidimcnsional data tnodels igaitlst thcsc requirements. Thc reqiiiremcnts include :
prcscnt in all DWH applications, and it is thc most dominant factor in data analysis for decision support. In Scc.2,wc rcview the modeling issucs for tiriic for DWH applications. In Sec.3, wc propose n twu-step approach to design, and elaborate on temporal modeling of lacts and tliincnsintts. In Scc.4, we extend the temporal model to provide for aggrcgntcd facts. In Scc.5, we prcscnt nn inkgrated framcwork for dcsign of information systems and DWH applications, whcre lifecyclcs of busincss rrbjects arc taken into account tn dcIine 'push operniinns' lo move data from opcratiorial database systcms to thc DWH systcm. In Sec.6, wc conclude and idcnlify issucs (hat need furlher attention.
0
Explicit and nirtltiplc liicrarcliics in rlirneiisions
0
Symmetric trentmcn! of dirnonsions and measures
0
Support Por correct aggregation
2. Issues in Modeling and Design The OPD systems conlain data about busincss cntities and ope.ratinns. Ttmy arc often dcsignctl io storc rhc current data, with changes that ovzrwritc rhc previous d a h . In somc applicnlions, thc data history is cxplicitly irinintaincd by inclurling tinic AS attributes. Tlic temporal ineaning oF data und the timc-based processing is typically buricrl in the application logic (and is not cxplicit in t t ~ cn3otlel). Thc conceptual design of QPD is flftcn bnscd o t I thc I!ntily- Rclationship (ER) madcl, which idcntifics dislinguisliablc and intlcpcndent cnlittcs and nssociations hctween Lhein. Both cntities and relationshipx have atlrihutcs, including key nttrihutcs. The values of thcsc attributcs dcfinc thcir sliitc. Typjcally. an cnlity has its own lifccyclc: it cnincs into cxistance, cntcrs into rclatior~sliips,gets tnodiiicrl, and, finnlly, gets tlclctcd. 'I'hc dala model for the application Inity capture ttie life cyclc and stnlc chiltigcs by delining time nttribtitcs b r ttietn. Thus, ttic titnc inay cxist iis atlributcs wilh niany cnlitics/rctationships. Tlic ER tnndcl is gcncrally irnplcrnentcd using the rclationnl model,whcrc tilbles ctintni n cnti ~y antl rcli~tionshipinstarms (w ih/witlitrut time, dcpeiicling the dcsign). Mulriple OPDs within an orgatiization act rls sources lor constructing ii warchuusc. Entiticshttributcs which form basis for data analysis arc defincd iis dirncnsions, and the othcr data (which arc targct of analysis) ara casted as fncts. Often, thc exteriial entitics (tikc custoiners rind supplicrs) and infrastructur;d entities (like entployces and stores) hccome ditncnsions, and data contained in the rclalionships (containiiig transactional data like sales, ordcr value) hccome facts. Thcre are rcccnt efforts (such as [ 5 ] ) which propose systenialic rechniquc for thc design of warehousc sclienia from nn ER model. However, this technique nccds extensions to hanrllc niultiplc sources, OPD dynamics (it., statc changes), physical design hasett on querying needs, antl spccificatiuns for cxlraction of data from OPD for loading into lhe DWH system. The dimensions providc attributes for setectioti of data against which facts are to be analyzcd and aggl-egatcd.
Non-srrict 11 icrarchics (with many-to-Inany relationships ainnng Icvais) Many-to-niany rclationstiips bdlwceti facts and dirnenSiOtlS
HnndlIng changc ;in(! titnc
Handling unccrtainty 0
Handliiig dif'rcrcnt levels o f gmnularity.
We explicitly norc thcir etiipliasis on liai~ttlitigiitnc and aggrcgatcs. This has also hccn notcd in ~nethodologiesand niorlcling tcchniques proposed in 171 [ZJ. Tbcy delino lime dimensinn and hicrarchics withiti it in 11 fairly gcnctal and widely npplicablc details. Tlicy also define tcchniqucs IO hi~ldlechanges to dinicnsioiis guided hy ihcir cxtcnsivc expcricncc. While thc nd hoc solutiuns o f l r r e d in ~Iicsc. mctliorlologies arc practical, h c y lack ii fornial basis which can lend to better undcrstanding of issucs, illtcrnatives and icchniqucs. The situation has ilpar;lllcl with thc gettcml DHMS tcchndogies. Realizing the iiiiportatm OK inodeling time and h ~ ~ d l i nhistory g data, an cxtcnsive rcscarch in 'I'emporal Datahaac Systems, covcriiig morlcls, query langiiagcs, and iaiplerncntatian tcchniqucs [ 19, 16, IS) lins hcen carried out sincc mid-80s. lwo orthogonal lime mcasures, called valid (rcal-world) tirnc nnd traiisnction (systcm) tiine. have hccn proposcil to niiIilltuitI cnniplete hisrrwy (includingerror corrections and retrospcctive and pro-activc specilicatiotis). TWObasic Lypcs of tcinporal lahles arc providcil to niodcl statcs (wliich hnvc validity over an interval of tirnc) arid cvcnts (which happcn ai a givcn instmt). Bxtendcd algcbra opcrations and uptlatc scrnanlics providc a solid foundation for Ihc inodcl and temporal qiicry langungcs.
28
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.
However, this research has not yet percolatcd into database products, and applications arc cifteii built on adhoc and incornplcte undcrstanding of teinporsll rcquirerncnts (this author has sccn a university application which cannot handle multiplc curriculums, although it is a ccmmon 'business' knowledge that studclits admitkd at dii'krcnt times may have diffcrent curriculum and rules). Sincc tiinc is a fundamental diincnsion for data analysis and as DWHs store history data, w c should make a thorough el'fort to give it a StrOng foundation. It is important to note not only thc ubiquitous prcsence of time a s a dimension hut also its distinguishing characteristics:
Scction, wc give B morc dcteilcd and formal prescntakion of our temporal DWI-1 frmmcwork. A faci or dirncnsion may
correspond to an entity or a relationship in thc ER m d c l . Without loss of generality, we will trcat dimensions as coining I'rorn errtitics and facis from rclatinnships. Lcl D IE II dimension curresponding to cntity E (that has kcy K and other attdbutcs A , , 112, cic.). IE only tkc instanis when changes hirppcn to E arc o f interest, wc model L1 as ai1 evcnt tahlc. U is riiodcllctl as n state tablc whcn the interval during which thc state holds is of intcrest. Sincc the two tcmliord pnradigms are coiwcrtihic, wc will conccnirate only on the state hascd climensiori. At a conccptual Icvcl, thc tcinporal validity of il statc < L,al,a x ,...,? I , > fur an enlily idcritilietl hy cntity-kcy X: (which may be a real-world non-ctiatiging attributc or a surrogotc) is inoclclccl by a temporal cleincnt 'oc, which i s II set of instants whew thc statc holds. At the rcprcseiiiation Icvel, tho temporal valirlily i s de6ncd over an interval of time 'U. A statc valid over muldple disjoitit intervals will be rcprcscntctl by separate tuples, and hence may Ire thought of as separatc strllcs. Sincc fin entiiy can h w c only onc sinre at any tiinc, thc stak tahle motlcling D tins a key which is ccmpusitc of attrihiite K and time 11. Pur ctmvenieiice, wc prolmc astme-lwei sirrrngute i o uniquely idcntily states o f entitics in D, giving us its sclienic iis (IC, S , -41, Azr ...,v). An update to onclmclrc attributes in D produces a now state with same K but B new S value. TIN iicw icmporal validity i s dclincd (using 'moving' time vnriahlc NOW if rccpiircd), and the prcvinris validity intcrval is riioclificd ay per tlic scrnantics of uprlalc to n tcnrporal tahlc [If,]. Thc attribiitc bcing inatlilictl inily be R basic attrihutc or s CUIcgory attribute. A changc to any d t h c r n IC~NIS to ii ticw stntc. To clarify the point, considcr a gcogrilphy dimcnsion whcre n cily dl is shown as bclonging to district (n catcgory atlrihute) $1. Ii'dl i s rnovctl into nnothcr (nnlura[ly a ncighImiiring) district we ircat this as a state cliailgc rol dl. It is oftcn n business dccision to tlccidc how tiiuch changc is perniittcd for sn ctility IO still rcinain thc S a m entity. ?'hc district $1 may be split into two districts, mmed ,sl a n d ~ 2 whew 81 cotiiinncs its cxistancc. Altenliitivcly, we inay tcrminatc lilctimc o f '$1 and crcotc new districis (say, - ~ 1 1S, l 2 j . This is an opcrational levcl decision, hut it affccts tic data analysis. A [act P may also bc inorlelcil as an cvent or ilstate iablc. Wc will prcscnt our discussion using only the stritc modcl. It represcnts same rnciisures b l , bz ... relntctl to a gixiup of tlimcnsions D',D 2 .... , Sincc the nmcasurcs arc timc-hnsed. they rclatc tu statcs of entities fit ttiosc tiinc. I.Ieim, a fact table is conccptuaiizcrl as ( S I , S", .,., bl, ...,U ) wficrc its cotnpositc kcy is rlefinctl avcr tlic statc-surrogate%of its cliineiisioiis and thc liinc. This coticcptualization of F lcnds tr) thc idciitilicetion of some important tenipurnl constraints bctwecn F anti its D's:
It is il continuous {although inodclcd a s discrelc) and linearly growing dimension
It allows data validity to be dcfincrl at an insrant or over an interval (Ifittcr with specilic charactcrizations 171) Uscrs associate R 'calendar' fhr iritcrpcctalion oftiine. A calendar defines multiplc units of titiic and hicrarchies among those units. An application riiay use midtiplc cnlendars (q., Julian, Grcgorian and Financial calcntlars) concurrcntly nr convcrt from onc to anothcr. 0
0
It charocterizes Facts as wcll as otlicr diincnsims (k., both facts and dimcrision data have temporal validities)
It is ncccssary
to establish that data h u s meaningful wmpornl relationships bcforc tlicy can bc imlyzcrl or incrgcd together.
3. Temporal Support In Data Warehousing We proposc a 2-stage approach to I>WH design. 'I'hcse stages, the logical dcsign and physical dcsign, have diffcrent concerns and objcctivcs, and thcir scparntion has bccn advomted in all mcthodologics for softwarc dcvelupmcnt. Thc logical UWR model is derived from thc conccptual operntional datnbase modals, and takes into account the structurc nntl relationships among (ha. I1 also builds into i t R fcmporal fratncwork for h;mdliiig chxngcs and co-rclhting cnticurrent data. We propose use OF (ROLAP-based) Idationnl framework for the logical iiiodcl 8s it allows flexiblc and explicit structuring of data uking the notions of attributes and kcys. Givcn an ER tnodel, and hascd on a good undcrstanding o l thc piirposc and nature of DWH applications, wc expect :in analyst to identiry thc dimctisiotis and fncrs, thc two concepts at the hcari of a DWH applicnlion (onc may USC a techniquc as in [SI Tor this). In this Section, we proposc to tiiodel diincnsions and h c i s using icmporal database concepts. In doing so, the analyst should cxaminc tcrnporzll prripertics incnrporatcd in thc ER mrxlcl itself. In thc next
29
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.
,
a projection is carried O U ~ .It merges tuples with SRMC attribute vslucs and ovcrlappjng end consecutive validities into il single tuplc with an extended iiiterval that is aunion dintcrvals of the conhiiicd tuples. Cnnlcsce is gcnerally not meaningful on fact data (luring projcctinn. Wowcver, it is incaniiigful in doing soinc sggregations (group-by plus some aggrcgatc functions).
I. A fact spans ovcr a siiigIc stale of an ciitity (ant1 not ovcr a sct of states). Tlic definition of states, thus, s h ~ l he d consislcot with the mcasures of inkrest. 2. The states 81, s2, ... in il fact f 6 F niust h a w cancurrcnl validities over ?,! and
3. Thcrc may be muIliplc ineasure~wilh rcspecl to a givcn set of concurrently valid shtes.
We are able to articulak tlic above constraints by a sysicmatic applicatitrn o f temporal databasc concepls to data warehousing. We may occasiooally have Lo record facts at higticr lcvels of granuIaritics (q., s ~ l e sA t dislrict levcl rnthcr than at storc levcl). As discusscd in the next Scction, thc correct way to do this is to trcnt categories ~liemsclvc~ ns diincnsinn entities. Thc h o v c constrainis also csposc tho teinpornl setnantic conltision whcn we deljhcrntcly try 80 relax thcm. For cxample, considcr a producl dinlension showing product pi's lifctirnc ciiding at inslant d l . However, ii fact tnblc inay show salcs For pl cvcn hcyond tl as it is siill availnblc in storcs. Tlic conliisinn hcrc i s niixitig mnnufacturing lifctinlc with on-sele lifetime. We can fui-[her apply tcmporal cliar~clcrimtions as proposed in ChronoHasc [ 171 a i d citlicr rlata morlcls. Thcsc cliaracteri~ticsare definctl with respect lo an upward/dowiiward inl~critaiiccof w statc vdid ovcr i n t e r ~ dI> to instatlts i n w oi-suh-intervals of 11. It is tinlural tu basc diinc~isionstatcs on dowilward inheritancc, sincc tl stnto covers all instants in that intcrval. Thc (often nutncric) meaSUI’CS b1,62, .,. in the sli1te-hiised fact r-’ wilt rcquirc carcfiil characterization. Wc tnily alsti extctid thc notion of tipwarrlldownwnrd inlicritnncc of incnsurcs across cutcgorics i n dirncnsions. This will help l i s ill establishing the so-cnllcd arldilivc/scntadditive characteristics of thc incasurcs 171. Thc teniporal foundalion also allows 11s to npply nieaningfiil opetations on dirnensions and facts atitl corrcclly intcrpret thc tcinpurnl valirlily ofi-csults. Soinc teiriporal opcrntioiis also ncctl cbnnges/cxtensions in thc spccific context of data warehouses. Wc list bclow soinc of (tic more useful tcmpord operators [ 191 and tticir spccific mcntiing for
DWH: 1 . timcslicc at t ; givcs it snapslio~at time t in t e r m of ntily ttiose ntlriburcs which have dowuwerd inheritancc for iristanis: may apply to both D and F.
4. tcrnporal aggrcgtltcs [ 161 such as ‘increasing’ which gives an interval over which a i nttribtitc has increasing vafiies.
Thus, the tcmpcirnl roundntioo permits much bctter clarity and also powerfttl tcmporal algebra operations in querying dala in R DWH system. An important question to ask at this stage is: should’rimc itsclf tw dclitwl tis a dirnensiun in R DWH’! Since time validities iirc associated with both the dimensions and facts, ttierc arc no othcr relntionsliipa to bc captulrd by deliniiig liinc as ;Idimension. Howcvcr. it still serves R uscful purpose by allowing us to explicitly represent cnlcndars, time hicrarchics, and specific tiinc events (like holidays). Thc tcrngoral rnotleliiig discussed above should bc an inkgtnl pert of h e logical dcsign pliase of R DWH applicalion. This niny bc Followed by physical dcsign for el‘& cicnt rcpresciltalion and iiccess. The tccliniqtics (such as ‘de-ncirrtlalizotion’,explicit storage of aggrcgatc attri bulcs) ul‘ pliysicnl design are wcll understood froni our experience in building OLTP systcms, and (hey arc illst) e p p h h l c to DWH systcms. It) thc ncxt Scction, we givc a mcxc formal prescntniion for DWIJ inodeling and also consider aggrcgation niid reprcscn tation issues.
4. Varying Fact Granularity and Prc-computed Aggregates 4.1. Dimensions, Catcgorics and Facts The fact hblc in a warchousc contains state-keys of its dirnensions and other attributes representing (often numeric) rncasttrccs. Thc composition nf thcse kcys defines the ‘granolarity’ of the data. Thus, if thc dimensions are Product, Storc and Timc, theti B typical f~iacccontains w t w e al (say) may rcprcsciit total salc of product p1 an date cll in store R I . As nonc of lhesc ditnensiun kcys can be nbscnr, they togcther deline thc levcl or grmularity of fact data (in fact, they togcther [Ire thc key for the fact lablc). In most datswarchousc designs, the granularity is chosen to bc at s detailed levcl (eg., thc individual transaction levcl) so that the warchousc data can bc used for annlysis both at
2. concurrctit join of rahles : the result tiiptes coiitnin intervals which arc an ititcrsection of intervals of joined tuplos and contain only downward inheritable
attributes, 3. cunlosce (a fundarnontal opcratioii in the tcfnpraaldata no del which ensurc that the cotnpletc ~ciiiporalhistory ofa bct is gjvcn in a single tuplc) is iniplicit whcnevcr
aggrcgatc and detailed Icvcl.
30
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.
Howcvcr. the uscrs will frequently access thc werehousc at an aggrcgatc lcvel (and drill-down to dctails for selected CBSCS whcn necessary). The fact data is suitably aggregated based oil some groupings of dirncnsion attributc values. Whilc lhe grouping may h e donc on any altributc in a dimcnsiun (cg., product colour), a kcw nttributcs are frcquent candirliltes fnr aggrcgiition. Thcse arc tlic ‘category’ atkibutes, which dc fine hierarchies wI Lliin the diriicnsions. Since tlieso aggregations are dnne frcqticntly, many warehouse systcrns providc for storing aggrcgates explicitly, often as separate facts based on category diinetisii~nr;or by
A ditncnsion D inay bc R snapshot dimcnsion or il tcmp o d clitncnsion. For a tcinporal dimension, we dclinc the tcinporal validily of valucs of To and lheir association willi the cakgcrry typc T I ,Y\,... valucs. For exainplc, tlic product (pl23, 350, cokc) in subcatcgnry (softdrink) inay liavc temporal validity From date 1997106112uptu datc 1999/04/30. ’I’fic Lctnporal validily cliongcs whcncver pciinary nttributcs changc or thc calcgory relationships changc.
adding R new lcvcl field in the tlimcnsiono [6]. In some warehouse applications [ 101, it is nccessary to record h c i s at dii‘krctit griinularity Icvels, oflcn becausc 1lic dctailed d n ~ ais not avnilablc. For example, somc stores inay send salcs data for products only on a weekly basis (whilc most othcts providc the data on daily hasis).
siire’ altributcs b l , / ~ a.... , WCIISC the term ‘indcxing’ tu iiidicatc that r7 fact relatcs to A ditncnsinn.
A Fact F i s defined ovcr one/tiwrc dimensions
D’!D 2 , ...(which accd not hc distinct) snd a scl of ‘mea-
A hase fact f f F is iiidcxed by kcy domains dA,di, ... (II‘ its zisvciatcd dimcnsi~ns,and has a teriiporal validity nvcr interval IJ, which niny be cunstraincd with rcspect to thc vnliclitics d, v 2 ,...o f irs dimension valucs. For cxainplc, the constraints inay hc : w1 prccccds v, v2 cnntniris U , clc. ‘I’tic incasiires i n n hase fact may have odditionnl lctnporal ctiaractcrizatir)iis (q., tlicir validity with rcspect tu iiistilnts in or sub-intcrv;ils of w).
These two requircmcnts, o f storing iletn at dill‘ercnt granularitics antl or prc-cnmpirling iggrcgatcs lor efficient qucryitig, can be addrcsscil togother by cxtcnding ihc ‘ncw lcvcl Lields’ approiich of Kinnball [6] in tcrnns ofcarcliil definition ofcaiegorics and ensuring that tlic DW system rnaintaiiis aggrcgiltes icinporally correct. In thc Ibllowing, we will usc supcrscripts for rcl‘cring t o different di~ncn.sinnsand subscripts for keys, nttrihutcs, clc. within a rlitnension. Wc d c h e a dirncnsion D as conlaining a kcy rlomiiti do (which, as discussed i n thc previous Section, is typically II state surrogate) and a group o l ~ p r i m q(or. hosc) attributcs 0.1 , n3r ,..,o., . ‘ 1 ’ 1 ~doinain n may also hc associatcd with category types Tl,!lb,,.., w t w e c x h type T, consists or a key dj antl it scl of nttrihtites cidlctl oaregot:)1 arrribrrrer. 7t’hus, a Ti = ( r l i , n i l , upL,...). Thc catcgory typcs may bc rclatcd by a partial orcler 5 where T, 5 TIimplies that a subset of rl.i is lngicnlly Icl;rtcd to ilviiiuc in d j . The dimension 13 by itself is also cotisidcrcd as type To,which is thc haw uf tlic pertial order 5 , The parlial order dclincs n caicgory hicrarcliy’ for D. As an cxample, a Product tlinieiisiuii 161 may hc tlctincd as foliows (with key domains undctlined):
..
I
An aggrcgatcd(also callcil ‘higlicr-grilnularity) Fact f, G
F may hc intlcxccl by R category dotnaiii df oi‘doinain U k (itisteaid of dh) fur sonic of its dirncnsions. The dirncnsion o l F for which lhis is pcrinittctl is idcniilicd as ‘nggreof F. A fact j!,is also ~ s s ~ J & I ~ c ~ Iwith g i ~ t a h l diincnsion ~’ inctisurcs 1 7 1 , l?z3 ..., It) gciiml, a rlerivalion relationship tnay hc rlcfincd between hasc fact incasttrcs b i /)2> ... and nggrcgalc tneasurcs l3,,13,, .... Further, a fact tinny hc nggregatcd at m y sdcctcd Icvcl in the cstcgnry hicrarcliy of its tliincnsions. IJ)‘
~
An aggrcgatcd laci j g is callcd n ‘prc-cnmputcd’ fact whcn thc basc facts intlcxcd by thc hasc doinains nI the aggmgnted diincnsiatis nlso exist. Oihcrwisc, fg is tcrincd as ii givcn (or, ~ncnsured)fact. ‘ 1 ’ 1 ~incasures U I ,B2, ... in n prc-coinpiiredf, arc obtaincd using thc dcrivation rcletionships spccilictl bclwccti ttmn antl the bnsc iiictisurcs. ‘l’hc tcinporal validity For fg is computcrl using those of thc base lacts.
Hecfiuse ofthc spccific nuturc of timc, where highcr levcl timc categorics (such as inunlh) map 011 to thc basic dotii;iin kcys ( ~ u d ai s date) which arc in a scquencc forming an interval, wc ciin charnctcrim temporal validity ‘U of a timeaggregaicd fact fg, This validily must cxtend ovcr the interval nf the carcgory itsclr. A non-time aggrcgated fhct must have a viilidity of thc t m i c limc domain key. For cxamplc, an aggrcgatcd fact may give trrtal salcs ovcr 811 ptnducts in one rcgiori an ii givcn datc. I t may ho natcd that all tiinc categorics need not he iritcrvnl-bascd, in which case tiinc c m be trcalcd likc any other rlimcnsion.
.
T4 = (stcxagc-type) wlncrc Yi, 5 TI 5 TZ5 T3 and To 5 T’,,,‘I’hus, the T’roduct ditr~ciisionhas two hicrerchics. T h e itsclf inay hc delincd as a diincnsioii with suitable granularity rind multiple category lypes (such 11s wcck, rnotitli, year). It is a dimension wilt) spcoial properlies : its key domain vnliies arc ordctrd, and ttic domain grows linearly (hcnce, i t may bc populated in nclvuncc).
31
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.
4.2, Maintaining Pre-computed Pacts A prc-computed aggrcgstcd iilct may bc aggrcgalcd on onc or m y e dimensions. Lct D3 bc onc such din~cnsion, and Ict T/ hc the category type at which lcvel the aggrcgation has hccii dotic. Let j g be onc such fact with teinporal validity U. Whcn ii ncw base fact f with validity 'u1 or n fact aggregntcd $1 lcvcl T k , where Tk 5 Yi, is addcd to the data warehnusc, we need to update fg or geiicrnte n new aggregated I'm f;: i) J contains rl hasc kcy value from dh, which currespoiids to thc cntcgory key di. Thcti, the nggrcgatc fact 1,. which is indexed by d i +is updatcd by incrementill dcrivntion or I3 inensures in j g from the base niensiire~i n l h e incrcrncntal derivations, coi~sistcntwith the dcrivation relationships betwecn B's and b's, nccd to bc specified (or infcrrcd hy the warctiousc syslcm). 'ltie temporal validity of f must Iw cnnsistcnt willr that (If f,, which reinains the same. ii) S contains it titne kcy which falls in a category for which an aggrcgatctl fact docs not exist yet (this nccds lo hc pernrittcd for tinic as it is an indcpcnrlcntly growing dimension; nltcrnatively, niny corrcspontl to R futurc t i m e value). A iicw aggIcgnter1 fact Jq for tlic new category v:rlue is crcotcd. Ihe warchotisc systciii may uprlntc prc-coiiiputzd aggrcgaics itnmcdiaicly, or dn so in n hatch-tnde whcrc a bnkh of iicw facts arc ~ ~ r o c e s ~rogcthcr cd (cfficicritly using y e sorling and othcr rcchniqucs).
I.
I >
4.3. Consistency of Given Aggregated Facts In s i m c sitiin~ions,a fact tnny bc rccrirdetl at a tiighcr
IcvcI. The base and iincr lcvel fncrs cnncsponrling to h e givcn aggrcgated facl inay tiot bc availeblc; iiencc, tlicrc is no iiccJ to vcrify its cunsistcricy with any other facts. However, thc rclatcd bnsc CIS in;iy be added in frrlure (with retrospcctivc tiinc validity as rcquired}. A warehause systcrn may providc an optional consisrcocy check betwccn the aggrcgnlcd and t t x bnsc facts (when ttic uscr can nsccrtain that all i M s arc in placc).
w r o h o r e category types (di,nil
,
surmgatc kcy altrihutcs S' S 2 ,,,. of the ditiicnsions I
sclectctl aggregation - an indicator to rlislinguish bctwccn pre-cotnpurcd and givcn nggrcgatcd hcts - tcmporal validily :ittribiite(r).
Nutc that thc surrogate dimcasion keys ~ s c din FR iiniqiicly irlcnli fy ag'grcgntion lcvcl, Altcmntivcly, il facl I? be reprcscntcd b y fiiulliplc t~tblcsby scpamling haw d a t n atid ;Iggrcgate data, Cuch with tcinporat validity. This is ii nuriiializcd and tiiorc storagecriicien t rcprcscntation. To facilitate tunirilcnancc and LISC of nggrcgatcs (in querying), the schcma of :i wnrchousc systctn inust also incluric spccification or hicrarchicat relationstiips bclwceti cotcgories in [tic diinensions, Icvcls of pcrniitted aggrcgalions, and relatioilships (both lor dcrivation and for incrcincntn! update) betwccii base and aggrcgntc mcflsiirc attri h u h
Tradilionally, a DWH application cuiisists of :in EW, stcp, whcrc data is extractctl froin soiircc information systcins, tmnsfornicd, and fhcii loaded into Itlc data wewhouse. This s u p , 2111 activily associntcd with 'back orficc' nnrl an intcgral part nl'data staging [7], is bascd on ti g o d undcrstanding of the structure ancl meaning of data storcd hi thc suiircc systems. N l c r the data is cxtractcd into thc staging RWI, it i s chcckcd for consislcncy ancl completeness. It has to bc ;malyzcd i o decide wtiat is new and what has chnngcd, and mapped onto keys used locally in the DWH (which inay or niay not corrcspond with sourcc keys; in fact, 171 rccomriicnds that warchouse keys he distinct froin soiircc koys), and tcinpornl validities h a w to bc associalcrl with the cxtractcd data. bcforc loading thcm into DWII. '14ie ETL slcp is rcpcated periodically to incrementally update the data in
Tlic dimensional data waretiousc motlel defined above consists oftlimensjoiis arid facts. A diinension D is dcliiietl in twms of
- one basc type ( d o , a l , tt2> . . ,)
._
- the base mcasure atttiburcs h 1, 112,... - ihc aggregnte ntlrilwtcs R 1 B2... for each
5. Integrated Approach to Warehouse
4.4. Mapping to Relational Modcl
-
Thc dimension D can then be rlciincd as a rclalinn DR over eltrihutcs (S,do, a I ,aka, ...di, a21, ..., v ) which ciintains the surrogate (key), kcy attrihutcs of'bnsc and category iypcs, and the hase aiicl category atlrihutes. II i s the teinporid vnlitlity ofthc dimension data. Such B rclation coinbincs all thc characteristics of thc dimcnsion for casc of access (as popularizcd by the star- schcnia data warchuuse niodcls 171). The teniporal validity a(trihutc(s) i s dclined for dinicnsions requiring icmporal supporl. Whcn a .s E S corresponds to category kcy, thc attributes in 13R corrcsponditrg io ttic hasc type, nnrl to thc catcgnrica which arc lower i n levels than that of S, arc not applicable. A h c t F, indexcd hy diinensions D L D ,2 , .._,cdii be reprcsentcd by a uniilctl rclatioii FH,which coritnins
..>
whcrc do and di's are distinct, Let d, the key space of D, hc a union of all d's (ie., do Ud1 lJ&,.,), Let S hc a 'surrogntc' kcy domain which allows I-to-l mapping hctwcon d and S.
32
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.
rcd-world sitoation, it may be hard to achicvc a retigiuus separation bctwcen the uscrs of OPD and DWJI systems). Bcsides thc above advanlagcs of thc intcgrated approwh, thc primary benefit is in inaintaining thc consistency and completeness of d a b being pushed into thc DWH. The objects modeled in a DWH havc thcir own likcycles, during which they go through state cbangcs and intcrxt with other objccts. It may bc iniportunt for m y meaningful data analysis that the objccts C6~1pletCii chain of stalc changcs and intcrnctions M o r e tlicir rclaictl data can be pushed into thc DWI-I. This is illustrated by thc following cxnmplcs:
the DWB.Simon [ 141calls this as ‘non-oblrusivedata warehousing’. Fur legacy information systeins, the ETL step is obviously a nccessity, in spitc of thc challcnges it imposes due to administrative conflicts and operational prioritics (Lor instancc, thc E‘IZ step inust be done at slack times, and any changes in the source systein triggers off changcs in ETL and staging proccsses, which rcquires cumpletc cooperation betwcen the source system and DWH administrators). Integration of (lata sourccs into a DWH environment is allricting attentions of researchers for this rcasoii [14, 3, 181. Howcvcr, we wish 10 cinphasizc on issucs which gn beyond tlic above challengcs of ctiordination bctwccn thc sources and DWH. Thesc issues arc of fundainenlal nature, and thcy suggcst an intcgrated approach for an cffectivc itnplemcntation. This intcgration nceds to go bcyoncl the mcdiators suggestcd by Wicderhokl [ZO] which arc essentially like ETL agents. 0iir suggcsterl framework is inorc like rhc ‘web-enahlcd applications’ recominetitlcd by Simon [ 141 (using the ‘push’ paradigm rathcr than the ‘pull’ paradigm used in ETL bascd approach). Hcrc, thc appliciltions cxtract, perform quality chcck, transforin the d;itn, and ‘publish’ thcrn along with rclcvant metaddata. The DWH systcms subscribe to thc publicalion scrvice for updating their coiitents. An OPD nccds to storc ciirreiil data. They also often contain some history pertaining to immediate pas[ to support management requirements in moiiitoring nncl cuntrol. An OPD may model thc time dimension (using the temporal database concepts), or bc simply non-temporal. I n lhis context, ilDWH applicatioii built by taking pcriodical snapshots sulters from thc following limiiations:
1. An nrdcr objject may go through thc states: received, acceptcd, shipped, invoiccd, and paid-up. Only when it has rcacticd thc ‘paid-up’ mtc that its dctails along with paymcnt dctuils sliould be moved to thc DWH.
2. Thc ilcaclcmic pcrformancc data of a shidetit i s complctc only whcn the grades for dl thc courses,rcgistcrcd by him liavc hceii reccivcd, at which titnc, hisher study data c i i i i hc pushctl for ilnalysis inin the DWH.
’11) achieve this objcctive, wc mrist inodcl nn OPI) system using thc temporal paradigm (hat cnpturcs statc changcs, aiid associatc ’ p i d i ’ opclntors with statcs or with statctransilions, Note (hat this rcquircment on the OPD is cotisistent with the almvc responsibility of niiiintciiaiiw of statclcvcl surrogatc kcys.
6. Concliisions Thc data warehousing technology i s gaining widc attctitiun, nncl tnany nrganizations arc building data wnretiouscs (or, data marts) to help dicni i r i (lata analysis for decision support. Rcccnt rcscarch in DWH has conccntrnted o n dnta Inotlcls, lifecyclc tricttiiitlolflgies ant1 implctnentation techniqucs. Tbc hypcrcube-based multid itiicii sional inodd, xiti thc star-schema based (cxtcntled-) rclnrional i n d e l havc emcrgctl IS candidate data models ror DWHs. Howcvcr, thcsc models do not adcquntcly addrcss issues related to inodeling d time and history data, which arc ccrtaitily core issucs in data warchousing. ’I’hcy do givc many practial, intuitivc and at1 hoc guiclelincs (eg,, 16, 7, a]), but thc lack of a f ~ m a basis l ohscures innny issues. ‘l‘hcrc has hccn cxtcnsive rescarcti in morlelitig o f titnc for tcinporal databascs, but this rcsearch has not rcccivcd diic attention from DWH application devclopers, and infonnal/ad hoc guirlclincs for handling history and time arc ofl‘eiretl by dcvelupcrs (q., sec [ I I]). Thc main purposc ofthis paper is to focus 011 handling of timc in DWH applications F(irinally, and to establish tnodcling conccpts and tcchniqucs that will give us tenipornlly complctc nntl crmsistent solutions. Wc i~chicvcthis by cxtonding dic tcmporal dntnbasc concept for warchousing. We
It inay miss wine statc changes and/or cxact o c c w rcncc times orevents
Localiog clisiigc~additions/delctionsmay requirc a timc-consuming clicck with prcvious snapshots Not evcry changc (q., error corrections) i s irnportnnt Tor data analysis. They need to bc filtercd out.
In planning an integratcd information systern(s), we cnii model time and history approprintcly using the teinporal dzltabasc coiicepts. Wc can also build efficient ‘change extraction’ capabilities into the OPD systein. More importantly. thc application uses its knowledge (of whcn the data becoincs useful informiition for USS) to ‘push’ out incremental data for usc in the DWI-1. Thc change cxtraction cnpability may be based on logs, indexcs on timc, or tliffcrential filcs. Another significant advmilngc is thnt an 01’D itsolf can crcate atid mnintniii cnrity-levcl and statc-levcl siirrogate kcys ancl pass thcrn to tlic DWH. Finally, an ititcgrxtcd syslcrn cnii Facilitate A scamlcss ‘drill-down’from lhc DWH sidc to the OP13 for a dctaileri look at thc currcnt data (in a
33
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.
show that thc diinensions and facts can bc mutlelcd as tcmpcm11 state or cvctit relations, atid we identify the basic tcinporal constraints to enstirc coosistoncy of data in a DWH. The framcwork extends iiaturally to Iiandlc mulliple grnnitlarities and hicrnrchics mnot~gdimensions. We also show how the tempflrsll algebra operations can be npplicd for querying R DWH. Wc Iiavc proposed n logical DWH rlcsign phase during which a temporal warchouse schetna is defined. Thc physical design phase can apply the wcll-undcrstood databnsc tcchniqucs for eflicicni storagc and ~ C C C S Sto data. We emphasizc ilii integrated ;ipproach where information systems (hanciting operational rcqnircmcnts) arc ' DWH-nwarc', and extract arid push d a h into D W based on thc lifecyclc of busiiiess objccts. There are many clinllenging issucs that rcquire i'iirther rcscarch for hitilding unified decision support systems :
EUicient storage structurcs and nlgorilhms for tcmpcrml opcraliotis (such as coricurrcnt join and coalesce) : DWH qucrics arc prcilominaiitly tirnc based. SpccinI storage strciiturcs tiavc bccii proposctl (its extcnsioii IO K-trees) [XI to tiantllc iiiskirit and inrcrval-bascd qucrics. Wc nccd to itwesiigagc their applicability in DWH cnvironincnis
Titne scrics reprcscntation arid opcrators [9, 131: variations in data arc ortcn arinlyzctl with rcspcct tci time. Such proccssing is fxililatcd by rcprcscnting dnta as ii function of tiinc and prtividirig rnnny irscful slatistical, aggrcgatioii a n d pnttcrn scarching operators. In a DWH, time series datu iiiny hc ctiiisirlercd as nggrcgated dara I'm cfficicnt proccssing. Schcm:i cvolution: as ticwer inhmation sysrcins C O I ~ C into existance, and ncw decisinn inorlels r7rc Fortnulatcrt. thc DWH applicntioti itsell will tieet1 to cvnlvc. We ~ i c r dto itivestigalc solutioii.c for providing intciqierability aticl dnla access across evolving DWHs.
Warehousing of business rulcs and business d~usions: -'. ' The DWH rcscarch has completcly ignc~cclproccdures and rulcs. Tticsc illso contaiii organization's history [12j. An exampte busincss rule is: all ordered iteins must bc supplied horn same clcpot. An acccss to tticsc rules hclps in bcllcr understanding and annlysis ofdata. It is i i l s o essential to store business decisions made from analysis of data in the DWfI so thnt thcy can suhscquently bc vcrifiecl for intcnded rcsults.
[ZJ C. Bnllard, D. Herremnn. D. Schau, et al. Ihtu Mndeiirig 7kclrriiquesfovflatu Wnrclrousirtg. IBM Corporation, International Tcchiiicnl Support Oqptiisation, Feb 1998. 131 n. Caivnncse, G . DcGincorno, M. Lenzcrini, D.Narcti, and It. Rosnti. Sourcc integration in dittil warehousing. Tcchnical r e p r t , Tcchiiical Report DWQ-UNIROMA-002, llnivcrsity di Roinit, Itnli, October 19'17. [4] S . Chautlhnri and U. Dayal. An overview ordnta warchousing and Olitp technology. SICiMOll Recod, 26(1), March 1997. 151 M. Golfnrelli, D. Maio, and S. Kizzi. Coimptunl rlcsign o f datii wilrehousc from cr schemcs. In Pinwedings of r h fhWCIii h i tenicrhrinl Cur&rwi ct! Oti SysruipI Sciartcrs, IEEE, Jaiiunry 6-9, 1998. [6] It. Kiinbnll. The Dokr Wme1iarr.w Tuolkir. Jhon Wifcy & Sons, Inc., 1896. [7] R. Kimball, L. Rccves, M. Ross, ant1 W.'I'hurnthwailc. Tlie h f n Wflvehoust!L~ecycleToo[kit.John Wilcy & Sons, Inc., 1938. [8] A. Kuniar, V.J. 'rsotras, and C. Faloutsos. Designing accws metl~odslor bitemporal clntabascs. In Pmcccdirrgs OJ the ltiimiationui Wo~ksltopm i Tortlmwl I)cct(rbosrs,p a p 235254, Zurich, SwitxerI;iiirl,Seplcmber 1995. Springcr. [91 Oraclc Corpor;itiotl. Oracle 7 h e Set-ies Corwidgc Uspr'i Gitide. UIU, : htt~~://www.ornclc.coiii. [In] 'I: 13. Pederscn and C. S. Jcnscn. Multidimcnsioiial rliitn modcling for complex data. tn Pmcserlings ojICDE, 19Y9, [II ] T.Qiiitilnii,R. Gilhert, anti M. Fcrgtisoti. Modeling history fur the dah warehouse. Darrrhrrse PirjgvutiiniinK arid Desigti, Nov. 1998. URL : littp://www.tlbyd.conl/vnuIt/new~online.shtml. 1121 N.L. S m h A fmiicwork for application cvolution mnniigcincnt. In l'rnceetlir~grnf IOdr Ausrtulidrrsiati Drrrul>a.~c Coyfkwre, New Xedoiid, pngcs 13-24. Springcr, 1999. 1131 D. Schmidt, R. Milrti, A, K. Dittrich, mid W. Drcycr. Tiinc scrics, ;Iiieglcctcd issue in tc~riporaldatabase rcscnrch ? Reccrrt Advaiiccs it1 Tetrtpiwai rhrtrbirses, J. Cliffod mid A. Tuzhilirr ( e d ~I995 )~ A. R. Sinion. Ttie watchful erikrprisc. Dntnbasr Prqqmtitwiirig tuld Ilcsigri, 1998. URL, : l~tt~~://www.dbptl.coii~. It. Siwtlgrass , I l e ~ e l o p i i iTiim-Ora'w ~ led Ddnbilse Applicrrtims bi SQL. Mrirgrm Kaufmon, 2000, R. T,Snodgrass. editor. Tho T.@1,2 Turnppord @ w r y Lor/gimngc. l'hc Kluwer Iiitcnintional Series in Enginecriiig and Computer Scicncc. Kluwcr Academic Publisliars, 1995. S. SripatIa and 1'. Mollcr, The gcncralized chronohasc tcmporal data niothl. Metu-Lagics mid Logic P ~ O g ~ O f i l / l t b i g , 1995. .I. Srivasfavaand P. Y.Chcn. Wilrchotisc creation - npotcniiiil roi~lblocklo data wrrehou~ing. I n f E K K Ecrrisucrtrins or1
KDE,vnlumc 11, pgcs 118-126, Jan 1'195). 'Ihnscl, Cliford, Gnrlia, Jsgodi:c, Scgev, kind Snodgrass, crlitors. 7brtpor.ulD~rtcibm~.s. Ttic BcjainidCuinmings Publishing Company, Inc., 1993. G. Wicdcrliold. Wcnving data imo information. Dmnbnse Prvgmrmitig atid Dcsigii. URL :
References
h~tp://www.dbpd.cor~vault/ncw~onl~n~.~~tml.
[I] R. Agrawiit, A. Gupla, and S . Sariiwagi. Motlcling miiltitlirncnsionnl datahtlses. In Pmcccdings oj" 13th lnteriintiunnl Cntlference on Ihtn Li~g'ngirieci,iiig, pngca 212-243, 1997.
34
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 24, 2009 at 06:02 from IEEE Xplore. Restrictions apply.