ModelingTemporal Consistency in Data Warehouses RobertM. Bruckner
1
Beate , List
1
Institute oSoftware f Technology ViennaUniversity oTechnology f Favoritenstr.9-11 /188, A-1040 Vienna, Austria
1
Josef , Schiefer
Real-worldchangesaregenerallydiscovereddelayed bycomputersystems.Thetypicalupdatepatternsfo r traditionaldatawarehousesonanovernightoreven weeklybasisenlargethispropagationdelayuntilt he informationiavailable s tkonowledgeworkers. Themaincontributionothe f paperitsheidentific ation of two different temporal characterizations of the informationappearinginadatawarehouse:oneist he classicaldescriptionothe f timeinstantwhenagi venfact occurred,theotherrepresentstheinstantwhenthe informationhasbeenenteredintothesystem.Wepr esent anapproachformodelingconceptualtimeconsistenc y problemsandintroduceadatamodelthatdealswith timely delays and supports knowledge workers to determinewhatthesituationwas inthepast,knowi ngonly theinformationavailableagativeninstantoftim e.
A M. , Tjoa
1
2
IBMWatsonResearchCenter 30 Saw MillRiver Rd. Hawthorne,NY 10532, USA
{bruckner, list, tjoa}@ifs.tuwien.ac.at
Abstract
2
[email protected]
canbeviewedasanaspectodf ataquality,whichh stronginfluenceonthedelayuntilasystemrealiz certainstateofthedata.Whileacomplete,real-t enterpriseDWHmightstillbethepanacea;therear approachestoenableDWHstoreact“just-in-time”[ changing customer needs andfinancialconcerns. past
address change
real-worldevent eventfirstcapturedin electronicformat
asa esa ime seome 16]to
Boston
NY
dataloadintodatawarehouse event-drivenaction basedonthis(new) knowledge future
e.g.provide speciallocationbasedservices NY
Figure1.Delayeddiscoveryoreal-world f changes.
1 Introduction Theobservationofreal-worldeventsbycomputer systemsicsharacterizedbydelay. a Intheapplied context ofinformationsystemsiitsdeterminedbyexternal and internalfactors(e.g.updatepatterns,processing speed). Thisso-called propagationdelay isthetimeintervalit takesforamonitoringsystemtorealizeanoccurre dstate change.Incontrasttooperationalsystems(designe dto meetwell-specifiedresponsetimerequirements),th e focusofdatawarehouses(DWHs)[6]isgenerallyth e strategicanalysisofdataintegratedfromheteroge neous systems.Late-arrivingdatathatshouldhavebeenl oaded intotheDWHweeksormonthsbeforecomplicatethis situation[10].Hence,keepingdatacurrentandcon sistent in thatcontextisnotan easy task. Untilrecently, timelinessrequirements[4](describing therelativeavailabilityodf atatosupportagive nprocess withinthetimetablerequiredtoperformtheproces s)were restrictedtomid-termorlong-term.W.H.Inmon,k nown asthefounderofdatawarehousing,citestimevari ance[6] asoneofour f silentcharacteristicsoD af WH.Tim eliness
Figure1demonstratesthattypicalupdatepatterns for traditionalDWHsonaweeklyorevenmonthlybasis enlargepropagationdelaysuntiltheinformationis availabletoknowledgeworkers.Anysignificantdel ayin therecognition oevents f mayresultin number a of further considerationsneeding tobteaken intoaccount: • Dataintegration .Aggregateshavetobeupdated, becausethenewrecordswillchangecountsandtota ls of the prior history. • Analyticalprocessing H . istoricalanalysisresultscan nolongerberepeated,ifadditionalinformation regardingthattimeperiodisintegrateddelayed (numbersandsummarieswillchangeunexpectedly from theuser’s perspective). Wepresentaschemamodelthatcopeswithdelaysan enablesatimelyconsistentrepresentationoifnfor Thisenhancesanalyticalprocessingbyconsidering informationvalidityodf ataitsypicallyrestricte periods,becauseofrequent f updates or late-arrivi Theremainderofthispaperisorganizedasfollows Section2considersresearchissuesandrelatedwor
d mation. that dtotime ng data. . k.
Sections 3and4classify common temporaldatastru ininformationsystemsandintroducetheconceptof temporalconsistencydescribedfromdifferentviewp (conceptualandlogical).Section5evaluatesthem andfinallywgeive conclusion. a
ctures oints odel
2 Researchissue andrelatedwork Thenotionotime f ifsundamentaltoourexistence and animportantaspectofreal-worldphenomena.Wecan reflectonpasteventsandonpossiblefutureevent s,and thusreasonabouteventsinthedomainoftime.In many models,timeisanindependentvariablethatdeterm ines thesequenceof statesosafystem.Therearetwodifferent temporalcharacterizationsothe f informationappea ringin D a WH:oneitsheclassicaldescriptionothe f time instant whenagivenfactoccurred;theotherrepresentsth e instantwhentheinformationis actuallyknowable tothe system.Thisdistinction,implicitandusuallynot critical inon-linetransactionalprocessing(OLTP)applicat ions, hasaparticularimportanceforDWHs,whereitcan be usefultodeterminewhatwasthesituationinthep ast, knowing only theinformation availableagativen t ime. Temporaldatabases providesupportforpast,current, orevenfuturedataandallowsophisticatedqueries over timetobestated[17].Researchintemporaldataba ses[5] hasgrownimmenselyinrecentyears.Inparticular, transactiontimeandvalidtimehavebeenproposed [7] andinvestigatedidnetail[8],[15]. Inthefieldof datawarehousing Bliujute , eal. t in[1] concentratedontheshortcomingsosftarschemasin the contextofslowlychangingdimensions[9]andconcl uded thatstate-orientedwarehousesalloweasieranalyti cal processingandevenbetterqueryperformancethan observedinregulareventswarehouses.Ourformal approachtomanagingtemporalconsistency(describe din section 4is )state-oriented,too. PedersenandJensen[11]describefeaturesthatent ire DWHdatamodelsshouldhave(includingarequiremen t tohandlechangesindataovertime)andevaluate previouslyproposedmodels.Inthedisciplineof temporal datawarehouses laotofresearchwasdoneinthecontext of temporalviewmaintenance,e.g.[13]. Aninterestingpracticeapproachisdescribedin[4 ], where timelinessivsiewedatshetimefromwhen fact a is firstcaptured inanelectronicformatandwhenitis actually knowablebyaknowledgeworkerwhoneedsit. Late-arrivingfactsanddimensionrecords[10]can complicatethissituation,becausetheyarechangin g countsandtotalsforpriorhistory.Someindustrie s,like healthcare,havetodealwithhugenumbersoflate arriving records [3]. Thenatureofdelaysinactivetemporaldatabasesi s discussedin[12],concludingthat temporalfaithfulness
hastobeprovided.Applyingthisconcepttodata warehousingensuresthatinformationisanalyticall processediconsistent an way.
y
3 Temporaldata structures Timestampsallowthemaintenanceoftemporaldata. WhenconsideringtemporaldataforDWHsweneedto understandhowtimeisreflectedinadatabase,how this relatestothestructureothe f data,andhowsata techange affects existing data.Thereare number a of approa ches: • Transientdata .Thekeycharacteristicsoftransient dataisthatalterationstoanddeletionsofexisti recordsphysicallydestroythepreviousdataconten Thistypeofdataistypicallyfoundinoperational environments (e.g.order-entry systems). • Periodicdata Once . record a iasddedtodatabase, a it isneverphysicallydeleted,norisitscontenteve modified.Rather,newrecordsarealwaysaddedto reflectupdatesorevendeletions.Periodicdatath containsacompleterecordofthechangesthathave occurreditnhedata.DWHs areperiodicinature.
ng t.
r us
• Semi-periodicdata .Thiskindofdataistypically foundinthereal-timedataofoperationalsystems wherepreviousstatesareimportant(bankaccount systems,insurancepremiumssystems,etc.).However , inalmostalloperationalsystems,thedurationfor whichpersistentdataareheldirselativelyshort, dueto performanceand/orstorageconstraints.Therefore, this kindodata f may bteermedsemi-periodicdata. • Snapshotdata Snapshot . dataarestable a viewofdata astheyexistatsomepointintime.Theyareaspe kindofperiodicdata.Snapshotsusuallyrepresent dataasome t timeinthepast,andseries a osnap f can provide view a of thehistory oan forganizatio Thestandardapproachtostoringperiodicdata(typ foundinDWHs)itsousetimestampedstatusandev records.Thereare,however,avarietyofschemest maximizetheefficiency otimestamps f [2],[8],[15 The singletimestamp approach storing only a whenarecordbecamevalid,iswellapplicabletoe data,butfacesseriousdeficienciesinthecontext DWHs,whereingeneralstateinformationistored. aretworelativelycommontypesoqueries f usedin environments,which explain theproblem: 1.Aquerythatneedstoaccesscurrentdata.Ina timestampscheme,theonlywaytoidentifycurrent recordsistofindthelatesttimestampoftheperi set,which ian sinefficientprocess. 2.Aquerythatbuildsaviewofthedataaatpart timeinthepast.Inordertosupportthiskindof theperiodovalidity f oeach f recordmustbeknown
cial the shots n. ically ent o ]. starttime vent of There DWH single odic icular query, to ,
compareitwiththerequiredtime.Withasingle timestampapproach,theendoftheperiodofvalidi canonlybefoundfromthenextrecordintheperio sequence.Ingeneralthis is alsoaenxpensiveproc
ty dic ess.
Inordertoaddressthefirstproblem,asecondtim estamp (called endtime c) anbeaddedtoeachfact.Itidentifies theendothe f periodovalidity. f Thiscausesperf ormance improvementsinretrievingdata.However,itisnot sufficienttofullysolvethesecondproblem,becau sethe periodofvaliditycanchangeovertimeduetonew informationintegratedintotheDWHlater.Atempor ally consistent view (similar to snapshot data) during analyticalprocessingrequiresthestorageobf oth, theold and the new (maybe overlapping) validityperiod. Therefore,weenhanceDWHdatamodelswithfollowin g timedimensions (describedidnetailin section 4.2 ): • validtimedimension (validityofknowledge)as motivatedabove. • revelationtimedimension (transactiontime).It describes the point in time, when a piece of informationwas realizedbayleast t onesourcesys tem. • loadtimestamp. Thisrepresentsthepointintime when thenew pieceoinformation f was integrated.
4 Temporalconsistency past stateS
evente
T1S
revelation timeT 1
1
validtime ofstateS
1
Boston
1
T2S T1E stateS
2
evente
validtime ofstateS
revelation timeT 2
2
2
NY future
T2E
Figure2.Overlappingvalidityperiods.
Thecontinuumofrealtimecanbemodeledbya directedtimeline consistingofaninfiniteset{T}of instants (timepointsonanunderlyingtimeaxis[7]).A sectionothe f timelineicsalleda duration.An eventtakes placeataninstantotfime,andthereforedoesnot have duration.A timeinterval isdeterminedbytheduration between twocorresponding (start end) - instants. Figure2describesasituation,whereacomputer systemobservesstateS T indicating thataspecified 1 at 1 person(Mr.Smith)livesinBoston.ItknowsthatM r. SmithstaystherefromT . henextstate(S 1StoT 1ET 2)
knowabletothecomputersystemregardingMr.Smith is thenewaddressinNewYorkatT 2.Additionaldata reveals thathealready livedinNew YorksinceT 2S. Inordertodeterminewhatwasthesituationbefore instantT 2itmust , befeasibletoprocessonlythosestates knownbeforeT . ymodelingthissituationtemporally 2B consistent,itwill alwaysbepossibletofindout,thatthe systemdid notknowwhereMr.SmithlivedafterT until 1E theinstantT , henthenewpieceoifnformation(state 2w S2was ) integratedintotheDWH.
4.1 Conceptual model Inthissectionwewillpresentaconceptualmodel that generalizestheexamplefromFigure2which , illustrated theusefulnessofoverlappingvalidtimes.Atempor ally consistentrepresentationoinformation f requiresa reliable viewonhistoricaldataaat nypointintime independent from propagation delays.Thereforewdeefine: A knowledgestate ( KS)isdeterminedbyaspecified instantTI. tconsidersallinformation(knowledge) that wasobserved,captured,andintegrateduntilthein stantT. AnorderedrelationoftwoinstantsT < implies that 1T 2 KS(T1) ≤KS(T 2).Inotherwords,movingforwardintime causes theknowledgestatetogrow. Ingeneralananalysisfocusesonatimeinterval containingatleastone instantoifnterest ( II):II interest= st [IIstartI,I end(]e.g.July1 2, 001).TheKSandIIaretwo orthogonaltimedimensionsandthereforeindependent from each other regarding analysis capabilities. IngeneralastoredstateS isdeterminedbyaninstant X TX( revelationtime )andacorresponding validtime intervalindicatedby[T T , ]. A data model e nables XS XE temporalconsistency ifasetonf ineconditionslistedin Table1issatisfiedfor anycombinationostored f datasets (S1S, 2regarding ) thesamesubject. Table1.Conceptualmodelfortemporalconsistency Knowledge State ( KS) KS < T 1 T1 ≤ KS < T
KS ≥T
2
2
Instant(s)ofinterest(
II)
Any II II < T 1S T1S ≤ II ≤T II > T 1E
Retrieved state (not defined)
1E
II < T 1S T1S ≤ II T 2E
1 “not defined” meansneither S nor 1 S 2 This case iobsolete, s if the corresponding valid [T1ST, 1Eand ] [T 2ST, 2Edo ]not overlap.
2
2S)
2E
1
(not defined) S1 (not defined)
1
(not defined) S1 (not defined) S2 (not defined)
1
. times
1
1
1
Table1describesthetimelycorrectstatethatwil bl e retrievedduringanalyticalprocessing.Theretriev edstate canbeviewedatsheoutputofafunctionconsideri ngthe appliedKSandthespecifiedII.Thereadingexampl es below exactly describe the contributions of this conceptual model to conventional temporal models restrictedtonon-overlappingvalidtimes. • Atany pointin timeaannalysis basedoKS an betw een Tand (T KS