Modeling Temporal Consistency in Data Warehouses - CiteSeerX

3 downloads 0 Views 67KB Size Report
The remainder of this paper is organized as follows . Section 2 considers research issues and related wor k. real-world event. Boston. NY event first captured in.
ModelingTemporal Consistency in Data Warehouses RobertM. Bruckner

1

Beate , List

1

Institute oSoftware f Technology ViennaUniversity oTechnology f Favoritenstr.9-11 /188, A-1040 Vienna, Austria

1

Josef , Schiefer

Real-worldchangesaregenerallydiscovereddelayed bycomputersystems.Thetypicalupdatepatternsfo r traditionaldatawarehousesonanovernightoreven weeklybasisenlargethispropagationdelayuntilt he informationiavailable s tkonowledgeworkers. Themaincontributionothe f paperitsheidentific ation of two different temporal characterizations of the informationappearinginadatawarehouse:oneist he classicaldescriptionothe f timeinstantwhenagi venfact occurred,theotherrepresentstheinstantwhenthe informationhasbeenenteredintothesystem.Wepr esent anapproachformodelingconceptualtimeconsistenc y problemsandintroduceadatamodelthatdealswith timely delays and supports knowledge workers to determinewhatthesituationwas inthepast,knowi ngonly theinformationavailableagativeninstantoftim e.

A M. , Tjoa

1

2

IBMWatsonResearchCenter 30 Saw MillRiver Rd. Hawthorne,NY 10532, USA

{bruckner, list, tjoa}@ifs.tuwien.ac.at

Abstract

2

[email protected]

canbeviewedasanaspectodf ataquality,whichh stronginfluenceonthedelayuntilasystemrealiz certainstateofthedata.Whileacomplete,real-t enterpriseDWHmightstillbethepanacea;therear approachestoenableDWHstoreact“just-in-time”[ changing customer needs andfinancialconcerns. past

address change

real-worldevent eventfirstcapturedin electronicformat

asa esa ime seome 16]to

Boston

NY

dataloadintodatawarehouse event-drivenaction basedonthis(new) knowledge future

e.g.provide speciallocationbasedservices NY

Figure1.Delayeddiscoveryoreal-world f changes.

1 Introduction Theobservationofreal-worldeventsbycomputer systemsicsharacterizedbydelay. a Intheapplied context ofinformationsystemsiitsdeterminedbyexternal and internalfactors(e.g.updatepatterns,processing speed). Thisso-called propagationdelay isthetimeintervalit takesforamonitoringsystemtorealizeanoccurre dstate change.Incontrasttooperationalsystems(designe dto meetwell-specifiedresponsetimerequirements),th e focusofdatawarehouses(DWHs)[6]isgenerallyth e strategicanalysisofdataintegratedfromheteroge neous systems.Late-arrivingdatathatshouldhavebeenl oaded intotheDWHweeksormonthsbeforecomplicatethis situation[10].Hence,keepingdatacurrentandcon sistent in thatcontextisnotan easy task. Untilrecently, timelinessrequirements[4](describing therelativeavailabilityodf atatosupportagive nprocess withinthetimetablerequiredtoperformtheproces s)were restrictedtomid-termorlong-term.W.H.Inmon,k nown asthefounderofdatawarehousing,citestimevari ance[6] asoneofour f silentcharacteristicsoD af WH.Tim eliness

Figure1demonstratesthattypicalupdatepatterns for traditionalDWHsonaweeklyorevenmonthlybasis enlargepropagationdelaysuntiltheinformationis availabletoknowledgeworkers.Anysignificantdel ayin therecognition oevents f mayresultin number a of further considerationsneeding tobteaken intoaccount: • Dataintegration .Aggregateshavetobeupdated, becausethenewrecordswillchangecountsandtota ls of the prior history. • Analyticalprocessing H . istoricalanalysisresultscan nolongerberepeated,ifadditionalinformation regardingthattimeperiodisintegrateddelayed (numbersandsummarieswillchangeunexpectedly from theuser’s perspective). Wepresentaschemamodelthatcopeswithdelaysan enablesatimelyconsistentrepresentationoifnfor Thisenhancesanalyticalprocessingbyconsidering informationvalidityodf ataitsypicallyrestricte periods,becauseofrequent f updates or late-arrivi Theremainderofthispaperisorganizedasfollows Section2considersresearchissuesandrelatedwor

d mation. that dtotime ng data. . k.

Sections 3and4classify common temporaldatastru ininformationsystemsandintroducetheconceptof temporalconsistencydescribedfromdifferentviewp (conceptualandlogical).Section5evaluatesthem andfinallywgeive conclusion. a

ctures oints odel

2 Researchissue andrelatedwork Thenotionotime f ifsundamentaltoourexistence and animportantaspectofreal-worldphenomena.Wecan reflectonpasteventsandonpossiblefutureevent s,and thusreasonabouteventsinthedomainoftime.In many models,timeisanindependentvariablethatdeterm ines thesequenceof statesosafystem.Therearetwodifferent temporalcharacterizationsothe f informationappea ringin D a WH:oneitsheclassicaldescriptionothe f time instant whenagivenfactoccurred;theotherrepresentsth e instantwhentheinformationis actuallyknowable tothe system.Thisdistinction,implicitandusuallynot critical inon-linetransactionalprocessing(OLTP)applicat ions, hasaparticularimportanceforDWHs,whereitcan be usefultodeterminewhatwasthesituationinthep ast, knowing only theinformation availableagativen t ime. Temporaldatabases providesupportforpast,current, orevenfuturedataandallowsophisticatedqueries over timetobestated[17].Researchintemporaldataba ses[5] hasgrownimmenselyinrecentyears.Inparticular, transactiontimeandvalidtimehavebeenproposed [7] andinvestigatedidnetail[8],[15]. Inthefieldof datawarehousing Bliujute , eal. t in[1] concentratedontheshortcomingsosftarschemasin the contextofslowlychangingdimensions[9]andconcl uded thatstate-orientedwarehousesalloweasieranalyti cal processingandevenbetterqueryperformancethan observedinregulareventswarehouses.Ourformal approachtomanagingtemporalconsistency(describe din section 4is )state-oriented,too. PedersenandJensen[11]describefeaturesthatent ire DWHdatamodelsshouldhave(includingarequiremen t tohandlechangesindataovertime)andevaluate previouslyproposedmodels.Inthedisciplineof temporal datawarehouses laotofresearchwasdoneinthecontext of temporalviewmaintenance,e.g.[13]. Aninterestingpracticeapproachisdescribedin[4 ], where timelinessivsiewedatshetimefromwhen fact a is firstcaptured inanelectronicformatandwhenitis actually knowablebyaknowledgeworkerwhoneedsit. Late-arrivingfactsanddimensionrecords[10]can complicatethissituation,becausetheyarechangin g countsandtotalsforpriorhistory.Someindustrie s,like healthcare,havetodealwithhugenumbersoflate arriving records [3]. Thenatureofdelaysinactivetemporaldatabasesi s discussedin[12],concludingthat temporalfaithfulness

hastobeprovided.Applyingthisconcepttodata warehousingensuresthatinformationisanalyticall processediconsistent an way.

y

3 Temporaldata structures Timestampsallowthemaintenanceoftemporaldata. WhenconsideringtemporaldataforDWHsweneedto understandhowtimeisreflectedinadatabase,how this relatestothestructureothe f data,andhowsata techange affects existing data.Thereare number a of approa ches: • Transientdata .Thekeycharacteristicsoftransient dataisthatalterationstoanddeletionsofexisti recordsphysicallydestroythepreviousdataconten Thistypeofdataistypicallyfoundinoperational environments (e.g.order-entry systems). • Periodicdata Once . record a iasddedtodatabase, a it isneverphysicallydeleted,norisitscontenteve modified.Rather,newrecordsarealwaysaddedto reflectupdatesorevendeletions.Periodicdatath containsacompleterecordofthechangesthathave occurreditnhedata.DWHs areperiodicinature.

ng t.

r us

• Semi-periodicdata .Thiskindofdataistypically foundinthereal-timedataofoperationalsystems wherepreviousstatesareimportant(bankaccount systems,insurancepremiumssystems,etc.).However , inalmostalloperationalsystems,thedurationfor whichpersistentdataareheldirselativelyshort, dueto performanceand/orstorageconstraints.Therefore, this kindodata f may bteermedsemi-periodicdata. • Snapshotdata Snapshot . dataarestable a viewofdata astheyexistatsomepointintime.Theyareaspe kindofperiodicdata.Snapshotsusuallyrepresent dataasome t timeinthepast,andseries a osnap f can provide view a of thehistory oan forganizatio Thestandardapproachtostoringperiodicdata(typ foundinDWHs)itsousetimestampedstatusandev records.Thereare,however,avarietyofschemest maximizetheefficiency otimestamps f [2],[8],[15 The singletimestamp approach storing only a whenarecordbecamevalid,iswellapplicabletoe data,butfacesseriousdeficienciesinthecontext DWHs,whereingeneralstateinformationistored. aretworelativelycommontypesoqueries f usedin environments,which explain theproblem: 1.Aquerythatneedstoaccesscurrentdata.Ina timestampscheme,theonlywaytoidentifycurrent recordsistofindthelatesttimestampoftheperi set,which ian sinefficientprocess. 2.Aquerythatbuildsaviewofthedataaatpart timeinthepast.Inordertosupportthiskindof theperiodovalidity f oeach f recordmustbeknown

cial the shots n. ically ent o ]. starttime vent of There DWH single odic icular query, to ,

compareitwiththerequiredtime.Withasingle timestampapproach,theendoftheperiodofvalidi canonlybefoundfromthenextrecordintheperio sequence.Ingeneralthis is alsoaenxpensiveproc

ty dic ess.

Inordertoaddressthefirstproblem,asecondtim estamp (called endtime c) anbeaddedtoeachfact.Itidentifies theendothe f periodovalidity. f Thiscausesperf ormance improvementsinretrievingdata.However,itisnot sufficienttofullysolvethesecondproblem,becau sethe periodofvaliditycanchangeovertimeduetonew informationintegratedintotheDWHlater.Atempor ally consistent view (similar to snapshot data) during analyticalprocessingrequiresthestorageobf oth, theold and the new (maybe overlapping) validityperiod. Therefore,weenhanceDWHdatamodelswithfollowin g timedimensions (describedidnetailin section 4.2 ): • validtimedimension (validityofknowledge)as motivatedabove. • revelationtimedimension (transactiontime).It describes the point in time, when a piece of informationwas realizedbayleast t onesourcesys tem. • loadtimestamp. Thisrepresentsthepointintime when thenew pieceoinformation f was integrated.

4 Temporalconsistency past stateS

evente

T1S

revelation timeT 1

1

validtime ofstateS

1

Boston

1

T2S T1E stateS

2

evente

validtime ofstateS

revelation timeT 2

2

2

NY future

T2E

Figure2.Overlappingvalidityperiods.

Thecontinuumofrealtimecanbemodeledbya directedtimeline consistingofaninfiniteset{T}of instants (timepointsonanunderlyingtimeaxis[7]).A sectionothe f timelineicsalleda duration.An eventtakes placeataninstantotfime,andthereforedoesnot have duration.A timeinterval isdeterminedbytheduration between twocorresponding (start end) - instants. Figure2describesasituation,whereacomputer systemobservesstateS T indicating thataspecified 1 at 1 person(Mr.Smith)livesinBoston.ItknowsthatM r. SmithstaystherefromT . henextstate(S 1StoT 1ET 2)

knowabletothecomputersystemregardingMr.Smith is thenewaddressinNewYorkatT 2.Additionaldata reveals thathealready livedinNew YorksinceT 2S. Inordertodeterminewhatwasthesituationbefore instantT 2itmust , befeasibletoprocessonlythosestates knownbeforeT . ymodelingthissituationtemporally 2B consistent,itwill alwaysbepossibletofindout,thatthe systemdid notknowwhereMr.SmithlivedafterT until 1E theinstantT , henthenewpieceoifnformation(state 2w S2was ) integratedintotheDWH.

4.1 Conceptual model Inthissectionwewillpresentaconceptualmodel that generalizestheexamplefromFigure2which , illustrated theusefulnessofoverlappingvalidtimes.Atempor ally consistentrepresentationoinformation f requiresa reliable viewonhistoricaldataaat nypointintime independent from propagation delays.Thereforewdeefine: A knowledgestate ( KS)isdeterminedbyaspecified instantTI. tconsidersallinformation(knowledge) that wasobserved,captured,andintegrateduntilthein stantT. AnorderedrelationoftwoinstantsT < implies that 1T 2 KS(T1) ≤KS(T 2).Inotherwords,movingforwardintime causes theknowledgestatetogrow. Ingeneralananalysisfocusesonatimeinterval containingatleastone instantoifnterest ( II):II interest= st [IIstartI,I end(]e.g.July1 2, 001).TheKSandIIaretwo orthogonaltimedimensionsandthereforeindependent from each other regarding analysis capabilities. IngeneralastoredstateS isdeterminedbyaninstant X TX( revelationtime )andacorresponding validtime intervalindicatedby[T T , ]. A data model e nables XS XE temporalconsistency ifasetonf ineconditionslistedin Table1issatisfiedfor anycombinationostored f datasets (S1S, 2regarding ) thesamesubject. Table1.Conceptualmodelfortemporalconsistency Knowledge State ( KS) KS < T 1 T1 ≤ KS < T

KS ≥T

2

2

Instant(s)ofinterest(

II)

Any II II < T 1S T1S ≤ II ≤T II > T 1E

Retrieved state (not defined)

1E

II < T 1S T1S ≤ II T 2E

1 “not defined” meansneither S nor 1 S 2 This case iobsolete, s if the corresponding valid [T1ST, 1Eand ] [T 2ST, 2Edo ]not overlap.

2

2S)

2E

1

(not defined) S1 (not defined)

1

(not defined) S1 (not defined) S2 (not defined)

1

. times

1

1

1

Table1describesthetimelycorrectstatethatwil bl e retrievedduringanalyticalprocessing.Theretriev edstate canbeviewedatsheoutputofafunctionconsideri ngthe appliedKSandthespecifiedII.Thereadingexampl es below exactly describe the contributions of this conceptual model to conventional temporal models restrictedtonon-overlappingvalidtimes. • Atany pointin timeaannalysis basedoKS an betw een Tand (T KS

Suggest Documents