The Cedar DBMS: A Preliminary Report, 1981 - CiteSeerX

9 downloads 22838 Views 1021KB Size Report
by a desire to address a range of applications that arise in our laboratory (CSL.) ..... advent of a fast remote .procedure call facility [Nelson,. 19811; this would also ...
. ‘x

The CedarDBMS: A Preliminary Report by Mark R. Brown,RodericG. G. Cattell,andNorihisaSuzuki ComputerScienceLaboratory,Xerox PaloAlto ResearchCenter PaloAlto, California 94304

Some conventions: in the sequel, we shall refer to programs that use a certain facility as clienrs of the facility. Human usersof a serviceare called users. This sans-serif font will be usedto emphasizeidentifiersthat might appearin a program.

Abstract The Cedar DBMS is a databasemanagementsystem developed as part of the Cedar programming environment The systemhas severalunusual aspects, including its interfacefor applicationsprogrammingin a strongly-typedprocedurallanguageand its distributionof computation and data over a network. This paper describes the design goals, architecture, and implementationof the CedarDBMS.

2. Background 2.1ProjectedApplications

Our work in constructingthe Cedar DBMS is motivated not only by our interestin DBMS researchproblems,but by a desireto addressa rangeof applicationsthat arisein our laboratory (CSL.) These applications include computer-designautomation,graphicsand pageimaging, knowledge representation for Artificial Intelligence, program storagein a sharedprogrammingenvironment, and officeinformationsystems. Theseapplicationspresenta wide rangeof requirements, in terms of the expecteddatabasesize, the necessary performance,and .the most desirable data model and interfaceto the DBMS. Yet there are certainsimilarities among most of the applications. Apart from some AI and CAD applications,none requires extremely high performancefor a large number of accesses, but most would like the overheadfor the first accessto be low (for < 2 secondinteractiveresponse.)No applicationinvolves extremelylarge volumesof data (104- 106dataobjectsis the upper range.) A typical query seemsto involveonly a small number (< 50) of relateddata objects. There is a strongrequirementfor a programinterfaceto the DBMS, and lessof a needfor a standardhumaninterface.

1. Introduction We formed a small group in early 1979 to discussthe design and construction of a database management system. The system was to run in the Cedar programming environment (which was itself being designedat the time), and its primary goalwasto support variousapplicationswithin our laboratory. The members of the group had research interests in data models, physicaldatabasedesign,userinterfaces,and other topics relatedto databases;building the systemwould give us a chanceto try out someof our ideasin theseareas. A database management system has now been implemented;we shall call it the Cedar DBMS. Several interestingapplicationshavebeenbuilt using this system, which is still under development. This paper will focus primarily on the novel aspectsof the Cedar DBMS. These,,.include:,,a programmer’s interface that harmonizeswell with,its host language, distributionof computationand dataover a network, and a clean separationbetweentransactionmanagementand datamanagement.To a largedegreetheseare a function of the system’sstartingassumptions,concerningboth the system’s intended applications and ,the system’s computing environment. Therefore we shall devote Section2 to a descriptionof this context. Section3 gives an overview of the design, and Section 4 discusses selecteddesign issuesin more detail. Section S gives someconclusionsand outlinesdirectionsfor future work.

2.2 ComputingEnvironment Hardware. Our laboratory’scomputingfacility includesa

large number of personal computers,connectedby an experimental 3 mBit/sec local network [Metcalfe & Boggs,19761. The local network.is connectedto other local nets, forming an internetwork [Boggse& al, 19791. Many of the personalcomputersare Altos [Thackeret. al, 19791;.a growing number of higher-speed Dorado computers[Lampsonet. al, 19811are alsoavailable.Each machineis equippedwith a disk storagedevice. An important service provided in the Juniper. internetwork environmentis the distributed’file system called Juniper [Israel e& al 1978j. Juniper is implemented by a cooperating set of file server computers. A client of Juniper performspage or byterangeread and write actionsto someset of files,possibly

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice .is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. O1981ACM 0-89791-0404 /80/0400/0205 $00.75

205

other programs. A programmay also export an interface, therebymaking the interfaceproceduresimplementedby the programavailableto other programs. BecauseCedar is a single-languageenvironment, all componentsof the systemare structured as programs connected by interfaces. For instance, the entire operating system is written in Cedar, and operating system facilities are presented to client programs via standardCedarinterfaces[Lampson& Redell, 19801. Cedar is a fully typed language,much in the style of Pascal. The type systemallows the compiler to detect certaintypes of errorsin using proceduresor data items. The type systemmay be viewedas a form of protection: if a type and its operationsare defined in an interface,a client must take extraordinarystepsin order to apply a non-interfaceoperationto a variableof this type. The two principal extensionsto Mesathat are presentin Cedarare runtime typesand garbagecollection. In Mesa, all type checkingis done at compiletime, and typestend to be highly specific. This makesit difficult to write a general package for, say, manipulation of lists. To addressthis problem Cedar contains dynamically-typed pointers (FSF ANY), along with a facility for examining typesat runtime. Storagemanagementis a constant problem in writing general packages. It is difficult for an interface to describethe preciseresponsibilitythat eachparty to the interfacehas for storageallocationand deallocation;this givesrise to problemsof dangling pointers and storage leaks. Cedar includes a garbage-collectingstorage allocatorto deal with this problem, and we expectmany applicationsprograms to use garbage-collectedstorage exclusively.

not all residingon the samemachine., Theseoperations can be groupedinto a transaction to guaranteethat they will occur as a single,atomicaction with respectto other activityin the system. This atomicityis guaranteedeven in the faceof file servercrashes. Juniper uses locking to ensure the serial consistency implied by atomictransactions.For someof what follows it will be necessaryto understand Jumper’s locking algorithmin somedetail. When a client readsa sequence of bytesfrom Juniper,it implicitly obtainsa read lock on the bytes;writing a byte rangeobtainsa write lock. At a given instant of time, a single byte may be coveredby any numberof readlocksand at mostone write lock. An attemptto obtain a write lock on an alreadywrite-locked byte causesthe secondwriter to wait; eventually,a waiter either proceedsbecausethe conflict has disappeared,or times out and is aborted(all writes performedunder the transactionare undone.) Thesetimeoutsresolvepotential deadlocksituations. When a transactionattemptsto commit (finish and make its actions visible to the rest of the world), it may be holding a write lock that overlapsa read lock of another transaction. In this casethe reading transactionwill be senta broken read lock indication. The readermust then either releasethe broken lock (to certify that it doesnot dependupon the data formerly protectedby the lock) or abort itself. After commit, a transaction becomes invalid and thereforelosesall of the read and write locks that it had acquired. A transactionmay chooseto checkpointrather than commit; in this case,the transactionfirst commits, then changesall of its write locks into read locks, and is ableto proceedwith theselocksin force. Juniper files are named by 64-bit IDS, which Juniper guaranteesto be unique acrossthe inter-net. A directory service,providing text names for tiles, is written as a Juniper application. Juniper providesstandardfacilities for file protection, based on accesscontrol lists. As personalcomputers,the client machinesare not able to provide strong protection guarantees; therefore the physical separationof Juniper from client machinesis useful.

r2.

DesignOverview

3.I GeneralApproach

The CedarDBMS is a relativelysimple,low-levelsystem. We felt that we could not choosean appropriatehighlevel data model for the DBMS without actually implementingsomeof our intended applications. Given our range of applicationsit may be unreasonableto impose a single high-level data model. By building a simplesystemand someapplicationsquickly we hopedto gain the experiencenecessaryto.designa practicalfuture systemwith a higher degreeof dataindependence. We hopedthat in addition to supportingapplicationsand furnishing us with useful experience,the Cedar DBMS would contribute software components toward the developmentof a future high-levelsystem. It is possible in principle to layer a high-level systemon the Cedar DBMS, but the high-level system’may want control of decisionsthat are currently being made by the Cedar DBMS, suchasaccesspath selection. While we,characterizeour systemas low-level,it doesin fact support a useful degree of abstraction from the representationof data items in ferrous oxide. For instance,clients are never exposedto disk addressesor the equivalent(“tuple ID$‘.)

Cedar. The Cedar project is developingan advanced programmingenvironmentto serveas the basisfor most of CSL’s programmingin the next few years. Cedaris a single-language environment;the languageis a derivative of Mesa [Mitchell, et al., 19791that is called “the Cedar language”,or “Cedar” for short The featuresof Cedar that havesignificantinfluenceon the designof the Cedar DBMS are interfaces,the type system, and garbage collection. The remainderof this sectiondiscussesthese featuresin more detail. .!.’ Systems built in ,Cedar are collections of modules. Definitions modules, loften called interfaces, are the connectors between programs. Interfaces contain declarationsof types,constants,procedures,and variables. Only the namesand typesof proceduresare specifiedin interfaces, not their implementations. A program module,or program,containsa collectionof variablesand the implementationsof procedureswith direct accessto those variables. A program may import an interface in order to use facilitiesdefined there and implementedin

206

We have borrowed ideas freely from the Research Storage System (RSS), a component of System R [Astral-ranet. al, 19761.One very appealingaspectof the RSS’ design is its apparentability’ to support different stylesof use (relational,hierarchical)in a gracefulway. As in the RSS, we do not addressthe difficult and specializedissuesof very large databases,sincevery large databasesare not expectedto arisein our applications. We are more concernedwith fast accessat the individual tuple level for moderate-sized databases. Client accessto the Cedar DBMS is through &standard Cedarinterface. This is the only designthat iswnsistent with our goal of building both the system and some applicationsquickly. In the future we expect database accessto becomeimportant enough to require a higher degreeof integrationwith the Cedarlanguage. 3.2 SystemLevels

The CedarDBMS is constructedin four levels:the tuple level, storagelevel, segmentlevel, and file level. This represents a strict hierarchy in that each level communicatesonly with adjacent levels, and does so through Cedarinterfaces. The file level exports DBFile. This interface contains operationsto createa file, ‘set a file’s length (in pages), reador write a pagefrom a fue, etc. The’segmentlevel exports DBSegment, an interfacethat supports the abstractionof a databaseas a uniformly addressableset of pages. Such databasesconsist of a numberof distinctsegments,which are independentunits for allocating databasepages. DBSegment contains operationsto make a databaseaddressible,to create a new segmentas part of the current database,to allocate or free pagesfrom a particular segment,to generatethe addressof an in-core copy of page p of the current database,etc. The storage level exports DBStorage. DBStorage supportsabstractionssuch as.tuple and index, and a set structureon tuplesthat is equivalentto,RSS’binary links. It providesoperationsto copy primitive valuessuch as unstrings, numbers,and word arraysto and from tuple . The tuple level exports DBTuples, the Cedar DBMS only client interface. The next section discusses DBTuples in detail. The primary responsibilitiesof the Tuple level are to implement the “system tuple sets” (data dictionary),to maintainindexesin the faceof tuple updates,and to perform accesspath selectionfor retrieval requests.It alsotranslatesbetweenCedarlanguagetypes and the more primitive types supported at the storage level. 4. DesignSpecifics 4.1 Client Interface

To aid in the expositionwe shall interspersethe most relevant.portions of the DBTuples interface with the descriptionbelow.

tuple consistsof a set of fields, each with an associated value. Eachtuple belongsto a type, calleda tuple set. A tuple setis a templatefor makingtuples: it hasa fixed set of namedattributes, and each attribute has a value type. A field of a tuple is accessedby the corresponding attribute’sname and its valueis constrainedto matchthe attribute’s value type. Most valuetypes are drawn from a limited set of Cedar ‘types that includesINTEGER, LONG INTEGER, and STRING. Variable-lengthvalues,such as strings,do not require a specificupper bound on their length. TheseCedartypes are uninterpreted by the DBMS, apart from some knowledge of ordering among vahres(as required for indexing.) The systemalsosupportsan interpretedvalue type, tupleRef. A tupleRef is a referenceto a tuple from any tuple set. The presenceof tupleRef valuesmakesthe DBTuples interfacenon-relational, Various procedures in DBTuples take tuples as parametersor return tuples as results. The interfacetype is Tuple: TYPE = REF TupleObject; TupleObject: TYPE.

That is, a client programis givena pointer, or handle,to a garbage-collecteddata object whose structure is not revealed in the interface. This ensures that only proceduresdefined in the interface can be applied to tuples. For instance,to test two tuples for identity (a distinctlynon-relationaloperation),we call Eq: PROCEDURE

[tl, t2: Tuple] RETURNS [BOOLEAN].

A TupleObject actuallycontainsinformation that varies with the type of tuple, plus a tuple identifier (TID) whose function is the sameas in the RSS. TIDs are hidden from the client, so the system can always perform reorganizationsthat changeTIDs. The four primitive operationson tuples are to createor destroytuples,and to retrieveor storetuple fields. These four are also the primitive operationson recordsin most programming languages,but not in the application programmer’s interface to most data management systems. CreateTuole: PROCEDURE Tts: TuoleSetl RETURNS It: _ TUDlel: .DestroyT;ple: PROCEDURijt: Tuple]; * SetF: PROCEDURE Tt: Tuole. a: Attribute. v: Valuel: -- t.a + v GetF: PROCEDURE it: TipI& a: Attribute] RETURNS [v: Value]; -- v + t.a

Here the type Value = REF ANY, allowing differing types of data to be passed and returned. An implicit dereferencingand narrowing (type check) takes place when a Value is assignedto a client variable with a specifictype. Storagefor the result of GetF is allocated from a garbage-collected zone. The systemalso includesa notion of “null tuple”. and a predicateto test a Tuple for equalitywith the null tuple. Null tuples are generatedfor a variety of reasons. An uninitializedtupleRef field is null, as is a tupleRef that oncepointed to a tuple that waslater destroyed.Also, if a client holds two tuple-valuedvariablestl and t2 such that Eq[tl, t2], then DestroyTuple[tl] causes both tl and t2 to becomenull. Null: PROCEDURE [t: Tuple] RETURNS [BOOLEAN];

4.1.1Tuples

The primitive data type provided by DBTuples is the tuple. The DBTuples notion of tuple is fairly standard; it is not too far wrong to think of a row of a table. A 207

An interestingproblem arisesin the implementationof

tuples. A TupleObject containsa TID, which is in effect

a referenceto a particular location on the disk. When a client calls DestroyTuple, it is important for the system to invalidateall copiesof the destroyedtuple’s TID that might be contained in other client TupleObjects. Otherwisethe client might use one of these copies to perform a tuple accessafter the disk storage formerly occupiedby the destroyedtuple had been usedto create a new tuple. There seemto be two possibilities:either ensure that duplicate TIDs are never created in TupleObjects, or find all TupleObjects when invalidation of duplicates is necessary. But avoiding duplicates also requires the ability to find all TupleObjects. So in either case,the DBMS needsto maintain a REF to a TupleObject T for as long as T is accessible to the client The motive for using garbage-collectedstorage for TupleObjects is that- a TupleObject T should be reclaimedby the garbagecollector when the last client REF (Tuple) to T is destroyed But we havejust argued that the DBMS must maintain its own REF to T for as long as the client does. So if the garbagecollectoronly reclaimsa TupleObject when no referencesto it remain, no TupleObject will everbe reclaimed Cedar’s solution to this problem is to allow the associationof a user-writtenfinalization procedure with eachcollectiblerecordtype. For instance,we can arrange for the garbage collector to call a DBMS-supplied finalization procedure when the number of REFS to a TupleObject reachesa specifiedvalue greaterthan zero, say2. If 2 is the number of REFS to a TupleObject that the DBMS maintainsfor its own use,then no more client REFS will point to a TuPle that is handed to the finalization procedure, and the procedure can safely destroythe DBMS’s REFS. Later, ,the collector will find that the number of REFS to the TupleObject has reached zero,and will reclaimits storage. 4.1.2Retrieval DBTuples providestwo forms of retrieval from a tuple set to yield a collectionof tuples. A result of the first form, a match set, containsall tuples whose field values fall into specifiedranges. A result of the secondform, a ref set, containsall tuples with a specifictupleRef field pointing to a particulartuple. Both kinds of retrievalare implementedas generators:an initial call is made to definethe query, and this returns a handle for the match set or ref set. Then succeedingcalls to the “next” operation are used to scanthrough the elementsof the set. When all elementsof the set havebeen enumerated, the null tuple is returned ForMatchSet: PROCEDURE [ts: TupleSet, tm: TupleMatch] RETURNS [ms: MatchSet]; NextMatch: PROCEDURE [ms: MatchSet] RETURNS [t: Tuple]; ForRefSet: PROCEDURE [t: Tuple, a: Attribute] RETURNS [rs: RetSet]; NextRef: PROCEDURE [x: RefSet] RETURNS [y: Tuple];

ForMatchSet automaticallyuses an index, if this will help generatethe matchset. Match setsand ref setsare representedin the sameway as tuples-the client is given a handle for a dynamicallyallocated,garbage-collected opaqueobject A tuple is not bound in any way to the match set or ref set that

208

generatedit. In fact, a client is free to build arbitrary Cedar data structuresthat include Tuples. However, these structures only remain valid during a single invocationof the client. A commonuse of ForMatchSet is to retrievea tuple by its unique name (some application-definedcombination of attributes),so DBTuples providesa more specialized procedure,FetchTuple, for this case. If the match set contains either no tuples or more than one tuple, FetchTuple raisesan exception(Cedar SIGNAL) for the client to dealwith. FetchTuple: PROCEDURE [ts: TupleSet, tm: TupleMatch] RETURNS [t: Tuple]; NotFound, MultipleMatch: SIGNAL;

In someapplicationsit is usefulto perform the following query: “given tuple t, find all tupleRef attributesa such that there existsa tuple ta with ta.a = t, i.e. ta’s field a referencest”. This is especiallyuseful when a database makesuse of the fact that a tupleRef is not constrained to referencea tuple from any particular tuple set. To supportsuchnetwork-styledatabases,DBTuples includes this operation: GetAIlRefAttributes: PROCEDURE [t: Tuple] AETURNS [al: AttributeList];

4.1.3Data Definition; Starting and Stopping

The client’s data schema,i.e. the tuplesets and their attributes,is describedin the databaseitself. Tuplesthat representtuple sets and tuples that representattributes are referred to as dictionary tuples. In addition, it is possible for the client to specify what indexes (implemented as B-Trees) he would like to have maintainedop tuple setsto increasethe performanceof the system;the indexesand the attributesthat make up the index’s key (termed index factors) are also representedasdictionarytuplesin the client’sdatabase. TupleSet, Attribute, Index, Indexfactor: TYPE = Tuple;

In addition to data tuples,and the dictionary tuplesthat describethe data schemafor data tuples,the tuple level exports system tuples that define the data schemafor dictionary tuples. These are simply variablesin the DBTuples interface,and their valuesare manufactured during system startup. There are 17 system tuples, consistingof 4 tuple setswith a total of 13 attributes. The systemtuple setsare: TupleSetTS, AttributeTS, IndexTS, IndexFactorTS: READONLY TuoleSet:

Read operations such as FetchTuple and GetF work equallywell on system,dictionary,or datatuples. We do not allow the client to modify the data schemaby directly modifying the data in dictionary and system tuples; instead, DBTuples includes specializedprocedures for performingsuch updates. A fundamentaldifficulty with allowing clients to update even selected fields of dictionary tuples is the granularity of updates: an individual updatemay temporarilyleavethe dictionarym an inconsistentstate, but the DBMS may accessthe dictionary before it becomes consistent The data definition proceduresinclude CreateTupleSet: PROCEDURE [name: STRING, owner, segment: STRING] RETURNS [TupleSet]; CreateAttribute: PROCEDURE [ts: TupleSet, name: STRING, type: ValueType] RETURNS [Attribute]; Createlndex: PROCEDURE [ts: TupleSet, order: TupleOrder] RETURNS [Index];

g$;Te;=procedures destroytuple sets,their attributes, . Finally, DBTuple&ncludes proceduresfor starting and ending a database,session, and procedures to allow updatesmadein a sessionto be committedor aborted. A commitor abort appliesto all changessincethe last open, commit,or abort. OpenDatabase: PROCEDURE [u=TranslD: Transaction+ NIL, userName, password, databaseName: STRING+ NIL] RETURNS[translD: Transaction, welcometvlsg: STRING]; 4: CloseDatabase: PROCEDURE; -- commit *. MarkTransaction: PROCEDURE; -- commit AbortTransaction: PROCEDURE;

All parametersof OpenDatabase may be defaulted, &owing simple Cedar programs a scratch databasein which to work.

4.2 Distribution of Computationand Data

We usethe Junipei fle systemto implementDBFile, the file componentof thti Cedar DBMS. The rest of the system runs on t.heLclientmachine, i.e. on the same machine as i the ’ application program that imports DBTuples. .Thii means that network commmiication takesplace at the level of databasepage accesses.On a fast local-network,the extra communicationsoverheadis lessthan the cost of a disk access.Furthermore,there is the possibility of caching pages locally, since each personalcomputerincludesa disk unit The alternativesto this choice would have been to run the CedarDBMS on e&h Juniper server,or to providea number of separatemachinesas “databaseservers”. The first alternative is impractical becauseJuniper is not written in the Cedar language; the secondwas rejected on the groundsthat performancewould be unacceptable. This is largely a function of the fact that DBTuplee is a tuple-at-a-timeinterface. A serverthat processesqueries written in a higher-levellanguagewould be practical. A “DBTuples server”may becomemore attractivewith the advent of a fast remote.procedurecall facility [Nelson, 19811;this would alsoprovide accessto Cedar databases from,n?n-Cedarprograms. We chose to allow a databaseto be structured as a collectionof files,insteadof making a databaseconsist of a single file. A number of considerationsenteredhere: performance,size,data distribution, and protection. We can get the file serverto preservethe locality of liles on the disk, so by placing more closelyrelated information in a single file we can get better performance. Juniper files do not span multiple disk volumes,so in order to createa databaselarger than a volumewe need multiple tiles. We providedatadistributionat a low levelallowing severalfiIes, perhapslocatedon different file servers,to contribute to .the samedatabase. We also provide lowle;etirtection by using Juniper’s iile protection It is worth expandingon the protectionissue. Protecting parts of databasesby protecting files seemslike a weak idea. Much more natural protection facilities can be provided at a high level, as is done by associating protection with views in SystemR [Griffiths & Wade, 19761.But our systemhas nothing correspondingto the RDS level of SystemR. Protection facilitiesbuilt into DBTuples would almost certainly have to be thrown

awayin building a systemwith a high-leveldata model. So we chosenot to build protectionin DBTuples. me main protection issuein our environmentis not privacy, but inadvertentmodifications,especiallyby undebugged programs. We want to provide natural barriersbetween applications,but not to makethesebarriersimpermeable. A segmentis the databaseunit that lives in a single file. A tuple set(including all of its dictionarytuples)must be containedin a singlesegment,but a segmentmay contain any numberof tuplesets.A databaseis a set of segments; a segmentbelongsto exactlyone d&base, 4.3 TransactionManpgementand ConcurrencyControl

In designing the Cedar DBMS concurrency contiol mechanismwe relied heavily on the concurrencycontrol facilities already provided by Juniper. We wanted database transactions, like Juniper transactions, to guaranteeatomicity and henceserialconsistency;we did not feel it necessaryto support lower degrees of consistency[Gray et. al, 19761. To first approximation,the DBMS operatesasfollows: it opens a Juniper transaction during OpenDatabase, performspagereadsand writes asrequiredby the client’s DBTuples operations, and commits the Juniper transactionduring CloseDatabase. Juniper maintains locks as described in Section 2.2, and may abort a transactiondue to conflicts. It is the client’sresponsibility to retry an aborted transactionif desired,and also to invalidateany writes~madeto other media(e.g.a display terminal) based on information from the aborted tmnsaction Note that in this scheme,locksare held and lock processingis performedon the machinethat stores files,not on the client machinethat executesDBMS code. Giving the%le serverall of the responsibilityfor locks, and thus locking physical units (pages), has some interesting consequences. For instance,it makes the publishedalgorithmsfor concurrencycontrol in B-Trees [Bayer & Scholnick, 19771inapplicable to the Cedar DBMS; from the file server’spoint of view, a B-Tree pageis no different from any other page. We havetried to reducethe liielihood of unnecessarylock conflictsby segregatingunrelated data structuresthat are updated frequently,such as TupleSetTS tuples,on separatepages. Another potential locus of contention is the page allocator for a segment(i.e., extending a file.) This contention can be reduced by employing separate allocators for individual logical structures within a segment(e.g. tuple sets),and having these employ the segment’sallocatoronly when necessary. An important benefit of using a transaction-basedfile serverin this way is that transactionscan spanordinary files as well as databases. For instance,a database representinga file directorymust perform the combined operationof (1) creatinga new file and then (2) makinga directoryentry for it; we want this ‘to be atomic,avoiding an orphan fde if the systemcrashesbetween(1) and (2). To allow this kind of application to be written, Cedar DBMS clients can perform Jllniper actions under the sametransactionthe DBMS is using for databasepage transfers.

this situation. Should this become a problem for the Cedar DBMS we can either convince the Juniper implementors to provide exclusive-mode locks, or implementthem ourselvesby using Juniper files to store the lock state. We feel that it is important to acknowledgethe great amount of effort that we savedby building the Cedar DBMS on top of an existingtransaction-based tile server. The difficult problems in building a robust database facility in a distributed environmentare associatedwith managing transactions, and Juniper solved these problemsfor us.

To obtain better performance,the CedarDBMS actually usesan elaborationof the file-accessing methoddescribed above. First, we maintain a cacheon the client machine of recently-accessed pages. This cacheis implemented from virtual memory,backedby local disk storage,and providestwo benefits, For cachehits on core-resident pages,the accesstime is obviouslyvery fast For cache hits that generatepaging activity to the local disk, the accesstime is comparableto the time to accessa pageon a lightly-loadedJuniper server,but the local disk access generatesno load on the remoteserver. Hence caching pageson the local disk improvesthe performancefor the entirecommunityof users. The secondelaborationis that for a sequenceof database transactionswe can checkpoint,rather than commit, at the end of eachdatabasetransactionexceptthe last. This holds read lockson all pagesinvolvedin the transaction, including locks on pagesin the local cache. Hence the cacheremainsvalid through the sequenceof transactions, improvingthe cache’sefficiency. This improvementis not entirely free,however,sincethe more locks a transactionholds the greaterits chancesare of conflict with anothertransaction.But Juniper’sbroken read lock facility givesthe basisfor solvingthis problem. The DBMS records,in the client machine,the pagesthat have actually been used in a databasetransaction(i.e. sincethe previousJuniper checkpoint.) If a read lock for one of thesepagesbreaksthen the databasetransaction must abort, but other broken locks may be released without compromisingthe transaction’sintegrity. The third elaborationinvolvescontrol of when writes into the local cacheare propogatedto Juniper. Unless the number of writes in a transactionis extremelylarge,we can cacheall of the pageswritten in the transactionon the client machine, and write these pages to Juniper immediatelybefore the commit. Furthermore,we can write the pagesback to Juniperin a standardorder for all clients. This eliminates the possibility of deadlock between clients. (Of course,Juniper would break the deadlockeventuallyby abortingone of the participants.) The concurrencycontrol-methodwe havedescribedis an “optimistic” one [Kung & Robinson, 19791. That is, readersalmostneverwait due to locking conflicts,but are sometimesaborted becausedata they depend upon has been updated from another,transaction. This is in contrastto methodsthat lock objectsfor exclusiveaccess in order to update. The combinationof the optimisticapproachand physical lockinghasproblemswith certainpatternsof transactions. One bad situationis when there is a very high traffic of shorttransactionstrying to updatethe samelogicalentity, such as a log. In this ‘case,somesort of logical locking seemsto be requiredfor adequateperformance. Another bad situation is when there is a very long transaction that accessesmany pages. If a short transactionmodifiessomepagethat hasbeenusedby the long transactionand then commits,the long transaction will be aborted. If a small transactiondoes this often enoughit cankeepthe largetransactionfrom makingany progressat all. Juniper’spresentfacilitiesdo not include exclusive-mode locks,so it is awkwardfor us to copewith

5. Conclusion

The following Cedar DBMS applications have been implementedto date: ’ An experimental system for the storage and incremental update of Mesa modules, which also allowsefficient “browsing” of the storedinformation. The CSL .Notebook, a database of informal laboratory memorandathat can be updated via the electronicmail system. An experimentalsystemfor maintainingand querying a databaseof CSL’s Cedar source and and object codefiles. A prototypeversionof Palm,an entity-baseddatabase browserthat has beenusedin conjunctionwith all of the abovesystems. We feel that our strategyof building a simple system quickly has been a success;these applications were undertakenlessthan a year after we startedto build the Cedar DBMS. The cleanseparationof the DBMS from the file system has been very convenient, since by reimplementingDBFile we can run the Cedar DBMS at sitesthat do not haveaccessto a Juniper file server. We have implementionsof DBFile for three different file systems. We are presently considering a number of specific improvementsto the lower levelsof the system,and are alsoponderingdesignsfor a high-levelsystemand greater integrationof the DBMS with the Cedarlanguage. l

l

l

Acknowledgements:PeterDeutschand Jerry Popekwere membersof the initial Cedar DBMS designgroup. Jim Saxe,John Ellis, Ray Cheng,and Eric Schmidtbuilt the first Cedar DBMS-basedapplicationprograms. Howard Sturgis, Jay Israel, Karen Kolling, and Jim Mitchell developedthe Juniperfile server. References

[Astrahanet. al, 19761 M. M. Astrahanet. al, “SystemR: A RelationalApproachto DatabaseManagement,”ACM Trans oy DatabaseSystemsI. 2 (June 1976) pp. 97 - 137. [Bayer 4 Scholnick, 19771 Rudolf Bayer and Mario Scholnick,“Concurrencyof Operationson B-Trees,”Acta Informatics 9, I (1977),pp 1 - 21. [Boggs et. al! 19791 David R. Boggs,John F. Shoch, Edward A. Taft, and Robert M. Metcalfe, “Pup: An Internetwork Architecture,” Xerox PARC technical report CSL-79-10,July 1979.

210

[Deutsch& Ta4 19801L. PeterDeutschand Edward A. Taft, “Requirementsfor an ExperimentalProgramming Environment,”Xerox PARC technicalreport CSL-80-10, June1980. [Giay et al, 19761J. N. Gray, R. A. Lorie, G. F. Putzolu, and I. L. Traiger,“‘Granularity of ‘Locksand Degreesof Consistencyin a Shared Data Base,” Modeling in Data BaseManagementSystems,G. M. Nijssen editor, North Holland, 1976,pp. 365- 394. [Griftiths dt Wade, 19761 Patricia P. Grifflths and Bradford W. Wade,“An Authorization Mechanismfor a Relational DatabaseSystem,”ACM Trans okDatabase SystemsI,3 (September1976),pp. 242- 255. psrael et. al, 19781Jay 7. Israel,JamesG. Mitchell, and Howard E. Sturgis,“SeparatingData From Function in a Distributed File System,”Xerox PARC technicalreport CSL-78-5,September1978. [Kung & Robinson, 19791 H. T. Kung and John T. Robinson, “Optimistic Methods for Concurrency Control,” ACM Trans. on DatabaseSystems(to appear.) [Lampson & Sturgis, 19811 Butler W. Lampson and Howard E. Sturgis, “Crash Recovery in a Distributed Data StorageSystem,”Comm ACM (to appear.) [Lampsonet. al, 19811 Butler W. Lampson,Kenneth A. Pier, Gene A. McDaniel, Sever0 M. Omstein and Douglas W. Clark, “The Dorado: A High-Performance Personal Computer; Three Papers,” Xerox PARC technicalreport CSL-81-1,January1981. [Lampson & Redell, 19801 Butler W. Lampson and David D. Redell, “Experience with Processesand Monitors in Mesa,” Comm. ACM 23, 2 (February 1980), pp. 105- 117. wetcalfe & Boggs,19761Robert M. Metcalfeand David R. Boggs, “Ethernet: Packet Switching for Local g5mgter Networks,” Comm ACM 19,‘7 (July 1976),pp.

.! .

[Mitchell et. al, 19791 James G. Mitchell, William Maybury, and Richard Sweet,“Mesa LanguageManual,” Xerox PARC technicalreport CSL-79-3,April 1979. [Nelson,19811BruceJ. Nelson,“RemoteProcedureCall,” Ph.D. Thesis,Carnegie-MellonUniversity,(to appear.) packer et al, 19791 Charles P. I-hacker, Edward M.McCreight, Butler W. Lampson, Robert F. Sproull, and David R. Boggs, “Alto: A Personal Computer,” Xerox PARC technicalreport CSL-79-11,August 1979.

211