Access Method Concurrency with Recovery - Semantic Scholar

28 downloads 9193 Views 959KB Size Report
II-tree structure change, crash re- cover y takes no special measures. A crash ...... neering SE-6,3 (May 1980) 297-304. Lomet,. D.B.. Recovery for shared disk.
Access

Method David

Digital

with

Lomet

Betty

Equipment

Cambridge One

Concurrency

Corp.

Research

Kendall

Sq.,

Cambridge,

Lab

Bldg.

MA

College

Recovery

Salzberg

of Computer

Northeastern

University

Boston,

700

Science

MA.

*

02139

1. We define a search structure,

Abstract Providing

high

studied

concurrency

extensively.

umented

But few efforts

for combining

a recovery

in B+-trees

scheme that

have been doc-

concurrency preserves

has been

methods

well-formed

with trees

across system crashes. We describe an approach for this that works for a class of index trees that is a

that

is a generalization

Our

concurrency

tined

to

this

class.

work

and recovery with

Recent

approaches

called

of the

based

all

method

search

work

a II-tree,

Blink-tree

[6]. is de-

structures

[19] suggests

on Blink-trees

should

in that have

of

higher concurrency than other proposed methods. Thus, our technique has both good performance and very broad applicability.

our method is that it works with a range of different recovery methods. It achieves this by decomposing structure changes in an index tree into a sequence

2. II-tree structure changes consist of a sequence of atomic actions [7]. These separate actions

generalization

of atomic formed

of the B1i’k-tree.

actions,

A major

each one leaving

and each working

feature

the tree well-

on a separate

level of the

tree. All atomic actions on levels of the tree above the leaf level are independent of database transactions,

and so are of short duration.

are serializable

and are guaranteed

all or nothing

property

Searchers can see the intermediate

rializable.

By contrast,

in ARIES/IM

[14] com-

changes are serial.

Introduction

The subject of concurrency in B+-trees has a long history [1, 6, 14, 16, 17, 18]. Most work, with the exception

of [5, 14], have not treated

of system crashes during

structure

the problem

changes.

In this

paper, we show how to manage concurrency and recovery for a wide class of index tree structures, single attribute,

multiattribute,

ther, our approach

to provide following: work

IRI-88-15707

and versioned.

Fur-

works for a large class of recovery

methods. The four innovations

This

states of the

II-tree that exist between these atomic actions. Hence, complete structural changes are not seplete structural

1

to have the

by the recovery method.

that

make it possible

high concurrency

for index

for us

trees are the

3. We define separate actions for performing dates at each level of the tree. Update

tions on non-leaf nodes are separate from any transaction whose update triggers a structure change. a tree

supported

by NSF grant

and IRI-91-02821

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, raquires a fee and/or specific permission. 1992 ACM SIG MOD - 6/92/CA, USA 01992 ACM 0-89791-522-4/92/0005/0351 . ..$l .50

351

Only may

node-splitting

need

to

at the leaves of

be within

an updating

transaction. Even in this case, only for pageoriented UNDO recovery systems do locks on the split nodes need to be kept to the end This of the transaction. entire structure changes of database oriented

was partially

upac-

4. When

transactions

UNDO

recovery

a system

crash

is unlike

[14] where

are subtransactions and where non-pagemust be supported.

occurs

during

the

se-

quence of atomic actions that constitutes a complete II-tree structure change, crash reA crash cover y takes no special measures. may cause an intermediate some time. The structure

etate to persist for change is completed

when the intermediate state is detected during normal subsequent processing by scheduling a completing atomic action. The state is tested

again in the completing the idempotence

atomic

the II-tree

action to ensure

grows in height

via splitting

of a root,

new levels are formed.

of completion.

A split

is normally

described

by an index

term.

The rest of this paper is organized in the following way. Section 2 defines the II-tree. Section 3

pointer

describes

key space for which the child node is responsible.

atomic

the operations

actions

describes

are described

how II-tree

posed into

on II-trees.

atomic

such decomposed short

discussion

2

The

in general.

structure actions,

4,

Section

5

changes are decom-

and how to cope with

changes. of what

In section

Finally,

section

Each index

we have accomplished.

the index

called a parent

node.

index

terms

Description

on any path between the node and a leaf node. More precisely, however, a II-tree is a rooted DAG because, like the Blink-tree,

nodes have edges to sib-

ling

nodes.

nodes as well as child

2.1.1

All

these terms

Each node is responsible

for a specific

part of the

key space, and it retains that responsibility for as long as it is allocated. A node can meet its space responsibility entries (data natively,

in two ways. It can directly contain or index terms) for the space. Alter-

it can delegate

responsibility

for part

of

the space to a sibling node. A node delegates space to a new sibling node during a node split. A sibling term describes a key space for which a sibling node is responsible and includes

a side

pointer

to the sibling.

node

except

the

root

can contain

sibling

containing node for the whole key space. Each level describes a partition of the space into subspaces di-

2.1.2

Multiple

level.

contain

terms.

Parents

their

children.

only nodes

II-trees must refer to nodes

for spaces that A pointer

contain

the

can never refer to

the subspace directly However,

contained

by the index

node.

each node at a level need not have a parhigher

and generalization

level.

This

is an ab-

of the idea introduced is, having a new node only via a side pointer

is acceptable. Like [18], we define the requirements formed general search structure. Thus, well-formed if 1. each node is responsible

a

of a wellII-tree is

for a subspace

of the

search space; 2. each sibling

term

describes

containing node for which is responsible.

a subspace

of its

its referenced

node

3. each index term describes a subspace of the index node for which its referenced node is responsible;

terms to contained nodes. A level of the II-tree is a maximal connected subgraph of nodes and side pointer edges. Each level of the II-tree is responsible for the entire key space. The first node at each level is responsible for the whole space, i.e. it is the

rectly contained by nodes of that the II-tree its name.

as in B1ink-trees,

which

than

A

node is

A node

containing a sibling term is called the containing node and the sibling node to which it refers is called the contained node. Any

nodes

in the Blink-tree[6]. That connected in the Blink-tree

Level

for a child

of a

a de-allocated node. Further, an index node must contain index terms that refer to nodes that are responsible for spaces, the union of which contains

straction One

subspaces.

ent node at the next

below.

Within

indicated

term

and child pointers

are responsible

a child

and a description

sibling

Well-formed

which

Informally, a II-tree is a balanced tree, and we measure the level of a node by the number of child edges

are defined

and/or

includes

In H-trees,

nodes are index

2.1.3

Structural

node

are at a level one higher

~-tree

posted,

node containing

Side pointers 2.1

when

to a child

parent

6 is a

term,

This gives

Levels

The II-tree is split from the bottom, like the B-tree. Data nodes (leaves) are at level O. Data nodes contain only data records and/or sibling terms. As

352

4. the union of the spaces described by the index terms and the sibling terms contains the space an index

node is responsible

for;

5. the lowest level nodes are data nodes. 6. a root exists that search space.

is responsible

for the entire

The well-formedness description above defines a correct search structure, All structure changing atomic actions must preserve this well-formedness. We will need additional constraints on structure changing actions to facilitate node consolidation (deletion).

2.2

Applicability

y

to

Various

Search

Structures keyL

There

are a number

described

as II-trees,

rency control 2.2.1

of search trees which and hence exploit

and recovery

can be

o

our concur-

ling term) pointer.

El 0

method.

key;

, an index

is represented

term

Make

time

time.

8

split

wifl

now

Q%

Blink-trees

In a Blink-tree

1.

(respectively

B

A

2“

Mime in thecurrent nodeA that the time before8 is covad by the historical nodeB. B: (t< 8)

sib-

o

8

D

by a key value and node

now

time

It denotes that the child node (respectively

sibling node) referenced is responsible for the entire space greater than or equal to the key. The dhectly contained space of a B ‘ink-tree node is the space it is responsible

for minus

the space it has delegated

0+-.+J

now

to its (one) sibling. !!!!%

2.2.2

The

A TSB-tree ple versions it indexes

Figure 1: In the Time-Split B-tree, new current nodes contain copies of old history node pointers

these records

both

and old key pointers. copies of old history

by key and by time,

and we can split by either of these attributes. Historical nodes (nodes created by a split in the time never split again. History the current node directly

key space to access history previous

versions

In Figure after

nodes that

of records in that of splits

responsible

sibling pointcontaining a contain

node

child)

hand

tents of the right subtree.

corner of the key space it started with. A time split produces a new (hktorical) node with the original

The new history

node contains

the original

node directly

of the key space.

containing

A key sibling

rent node refers to the new current the higher

part

of the key space.

a

to the new sibling splits

II-tree

his-

containing

the con-

This makes the treatment

consistent

with

that

of other

Operations

Here we describe the operations general

new node

on II-trees

way. The steps do not describe

with concurrent decompose

operations,

structure

These are described

with failures,

changes into

in a very

how to deal

atomic

or how to actions.

in later sections.

of this key space. 3.1

2.2.3

their

splits. This is illustrated in Figure 2. A complete description and explanation of hB-tree concurrency, node splitting, and node consolidation is given in [3]. -

3

will contain a copy of the history sibling pointer. It makes the new current node responsible for not merely its current key space, but for the entire history

through

key ranges through

in the cur-

node containing The

time

node with

the lower part

pointer

points

of hyperplane

node directly containing the more recent time. A history sibling pointer in the current node refers to node.

for all previous

torical pointers and all higher their key (side) pointers.

node, as in [11], one child of the root (say the right

is in the lower right

copy of the prior history sibling pointer. A key split produces a new (current)

New historic nodes contain pointers. Current nodes are

the

space.

1, the region covered by a current

a number

the hkitory

Jnc B: (t< 8)

TSB-tree [10] provides indexed access to multiof key sequenced records. As a result,

dimension) ers permit

Jn A. B(t< 8) C(k>’)

The

In the hB-tree

hB-Tree

[11], the idea of containing

Searching

Searches start at the root of the II-tree. is an index node that directly contains

and con-

tained nodes is explicit and they are described with kd-tree fragments. The “External” markers of [11] can be replaced with the addresses of the nodes which were extracted, and a linking network established with the desired properties. In addition, when the split is by a hyperplane, instead of eliminating the root of the local tree in the splitting

The root the entire

search space. In an index node whose directly contained space includes a search point, an index term must exist that references a child node that is responsible for the space that contains the search point. There may be several such child nodes. However, it is desirable to follow the child pointer to the node that directly contains the search point.

353

2. Partition

the subspace

the original

L-1 o

10 Split

a comer

I

clude any sibling

inc/+4\B

A;C tree for a split, replace external markers wih addr?mes of IE,W siblings.

10

intersect Figure 2: An hB-tree index showing the use of k-d trees for sibling terms. External markers (showing what spaces have been removed

in creating

have been replaced

pointers.

with

sibling

one part.

is a data node, place in

terms

to subspaces for which Remove from

node all the data that

it no longer data.

4. If the node being split is an index node, include in each node the index terms that refer to child nodes whose approximately contained spaces

when usingthe local

m

contain

by origi-

directly contains. This partitions the (What is dealt with here is point data.)

7

o

The

to the new sibling

the new node is now responsible.

2. Split A again at x=7. tiA:A~C

contained

the sibling node all of the original node’s data that are contained in the delegated space. In-

the original

B’

to directly

is delegated

3. If the node being split

Usea k-d tm.e as a sibling term to indicate the kordms of B.

10

part

directly two parts.

node.

A.

born

‘m’ inA ‘~”:

6

into

nal node continues The other 1.

01

node

the space directly

contained

by the

node. 5. Put

“holes”)

a sibling

term

in the original

node

that

refers to the new node. 6. Schedule the posting of an index term describing the split to the next higher level of the tree.

This avoids subsequent sibling traversals at the next lower level. Because the posting of index terms can be delayed, we can only calculate the space approximately

contained

by a child

with

respect

gated to other child nodes referenced are present

in the

index

occurs in a separate atomic tion that

performs

action

from the ac-

the split.

to a

given parent. This is the difference between that part of the space of the parent node the child is responsible for and the subspaces that it has delethat

The index term describes the new sibling and the space for which it is responsible. Posting

by index terms

node.

We proceed to the child that approximately contains the search point. This node will usually, but

Example:

In a B link-tree,

to perform

a node split,

all records (“records” may be index entries in index nodes or data records in data nodes) whose keys are ordered

after the middle

record’s

key are copied

from the original node to the new node. node has been delegated the high order

The new key sub-

not always, contain the search point. If the directly contained space of a node does not include the search point, a side pointer is followed to the

space. The link (sibling term) is copied from the old node to the new node. Then the copied records are removed from the old node and the link in the old node is replaced with a new sibling term (ad-

sibling

dress of the new node and the split

node that

containing

has been delegated

the search point.

Eventually,

the subspace a sibling

key value).

is

found whose directly contained space includes the search point. The search continues until the data node level of the tree is reached. The record for the search point will be present in the data node whose directly contained space includes the search point, if it exists at all.

When an index node is split, it is simplest, if possible, to delegate to the new sibling a space which is the union of the approximately contained spaces of a subset of child nodes. However, it can be difficult

3.2

Splitting

ling terms,

Splitting

and new sibling nodes is too unbalanced, reducing storage ut ilizat ion. In this case, a child term whose approximately contained space intersects the new

3.2.1

Node Node

A node split 1. Allocate

Steps

has the following

steps:

space for the new node.

354

3.2.2

Clipping

to split an index node in a multi-attribute index tree in this way because either the space partitioning is too complex, resulting in very large index and sibor because the division

between

original

sibling’s space as well as the remaining directly contained space in the old sibling is said to be clipped. The child

term is placed in both

When a child term is clipped, describing

the subsequent

involve the updating We choose to post atomic

action.

we post only search path

to the parent

index terms

of this child may

correct

searches.

to the splitting

of the child

that

Other

formed,

4.1.1

to cope with structure

system

changes.

failures

is already

in the midst

Subsequently,

“the

parent”,

that

is on the current

we intend

present of II-tree

Atomic

Node

When

a node

the parent

Latching

for

Latch

under-utilized,

it may

be

node is de-allocated.

atomic

actions

the leaf level.

that

change

Latches Latches

control an index

intree

are semaphores

usage pattern

guarantees

normally

and

by index

contained terms

node

must

be

for

the ab-

come in two

permits

others with

Latches do and latches

do not conflict

locks.

Deadlock

with

standard

is avoided

database

by PREVENTING

cycles in

a “potential delay” graph [8]. If resources are ordered and latched in that order, the potential delay can be kept cycle-free

it. Since a II-tree we can order

in the same parent

(S) mode which

is usually

parent

without

materializing

accessed in search order,

nodes prior

to their

children

and containing nodes prior to the contained nodes referenced via their side pointers. Space managenode must only be referenced

by

ment information When arbitrary phase locking

These conditions of the contained

mean that only the single parent node need be updated during a

This node will also be a parent This

is important

all references

of the

as a node cannot

be

to it have been purged.

is a difficulty with the above constraints. a node is referenced by more than one par-

ent is not derivable from the index term information we have described thus far. However, multiparent nodes are only formed when (1) an index node (the parent)

splits,

clipping

can be ordered last. atomic actions are possible,

is used to ensure

serializability

[2].

rectness.

This

occurs in traversing

tree nodes dur-

ing a search. a previously

acquired

latch

violates

the ordering of resources and compromises deadlock avoidance. Promotion is the most common cause of deadlock [5]. For example, when two transactions set S latches on the same object to be updated, and then subsequently desire to promote to X, a deadlock results.

one or more of its

two

However, when dealing with index trees, latches can sometimes be released early without violating cor-

Promoting

(This complication arises only in multiattribute IItrees; Blink-tree nodes never have multiple parents.) There Whether

Avoidance

access so that update can be performed. not involve the database lock manager

graph

this parent.

until

Actions

becomes

the contained

deleted

in

S latches to access the latched data simultaneously, and exclusive (X) mode which prevents all other

container

consolidation.

be cor-

modes, share

and the contained

container.

must

Consolidation

For this to work well,

referenced node, and

leave the tree well-

actions

Atomic

Deadlock

which the holder’s

ing node, regardless of which is the underutilized node. Then the index term for the contained node

both

or in-

must have the all

can be used for concurrency

volving above

search path.

possible to consolidate it with either its containing node or one of its contained nodes. We always move the node contents from contained node to contain-

is deleted

atomic

atomic

How this is done is described

sence of deadlock. 3.3

actions

and must

update

serialized.

Latches

when we refer to

this to denote

between deadlocks

parents

This

that

interactions

Update

this section.

4.1

a mechanism

for

cause undetected

property All

rectly

occurs,

is on the current

node.

or nothing

can be updated when they are on a search path that results in a sibling traversal to the new node. exploits

ensure that

do not

posting

the split

Actions

actions

of several parent index nodes. to each parent in a separate

When

Atomic

We must

parents.

splitting

4

their

latches

Update (U) mode [4] supports promotion. It allows sharing by readers, but conflicts with X and

index terms, or (2) when a child with more than one

other

parent is split, possibly requiring posting in more than one place. We mark these clipped index terms

promote from a S to an X latch. But it may promote from U to X mode. However, a U latch may only be safely promoted to X under restricted circumstances. The rule that we observe is that the promotion request is not made while the requester

as referring

to multi-parent

nodes.

355

U modes.

An atomic

action

is not allowed

to

holds

latches

on higher

ordered

ever a node might

be written,

4.1.2

Latch-Lock

resources.

When-

a U latch is used.

operations

(together

mute with

the structure

When Avoiding

a structure

dent atomic

Deadlocks

with

their

inverses)

change is part

action,

that

of an indepen-

the locks needed for the struc-

ture change are two phased but only persist There are two situations where an index tree atomic action may interact with database transactions and also require locks. These are (1) normal accessing of

in an independent

a database

T, whose update

and

record for fetch, insert,

(2) moving

consolidate Should

data

records,

delete, or update

whet her to split

or

nodes. holders

of database

locks be required

to

duration like this.

atomic

yet updated

any record

the split,

the split

can be performed

independent

of and before

blocked

during

course,

the structure

volving only latches is possible. deadlocks, we observe the:

in-

To avoid latch-lock

Rule [14]: An action does not wait . No Wait for database locks while holding a latch that can conflict with a holder of a database lock.

For our II-tree

operations,

on data nodes whenever However,

latches

locks.

nodes may be retained.

Except for data node consolidation, no atomic (i) holds both: action or database transaction database locks; and (ii) uses other than S latches above the data node level. S latches on index nodes never conflict with database transactions, only with index

change

at omit

date, these actions consolidate

actions.

node to be updated Hence, with

its holding

another

while

it holds

of this

consolidate

a U-latch

Page-oriented

4.2.1 Data

U-latch

database

splitting

locks

for

locks.

And

database cannot

locks. conflict

action)

of the consolidate

that op-

Updates and

some

consolidation (but

not

all)

require recovery

protocols. For example, if undos of updates on database records must take place on the same page (leaf node) as the original update, (page-oriented UNDO) the records cannot be moved until their updating transaction dates can be permitted

commits or aborts. No upon records moved by uncom-

mitted structure changes, since undoing the move would cause those records to move, Finally, no update can be permitted that makes the undoing of

Then,

this independent

updates

action.

change will

that

change are only Further,

of

not be undone

T aborts. Other data node splits in page-oriented tems must be done within an updating

if

undo sysdatabase

transaction. In this case, the database locks are held to the end of transaction and the structure change must be undone 4.2.2

Move

if the transaction

aborts.

Locks

For page-oriented undo, a move that conflicts with non-commutative

lock is required updates. The

move lock causes the structure change operation to wait until all transactions that are updating records to be moved have completed. Further, it blocks updating transactions from changing records moved until

the moving

transaction

the move lock keeps updates that

would

prevent

reads do not require

completes.

the undoing undo,

Finally,

from consuming of the move.

concurrent

space Since

reads can be

tolerated. Hence, move locks are compatible share mode locks.

with

split is possible if T aborts. For the same reason, any other transaction which traverses the sibling pointer created by T’s split may not post the index term until T commits. Therefore a move lock must be distinguished from a share lock. A transaction encountering a move lock on a sibling traversal does not schedule an index posting. A move lock can be realized with a set of individual record locks, a page-level lock, a key-range lock, or even a lock on the whole relation. This depends on the implementation specifics. If the move lock is implemented using a lock whose granule is a

the move impossible. Such updates are those that consume space in the node that is needed in order

ity can alter

to consolidate

sufficient.

nodes split

T.

the structure

by

in an action

When data node splitting occurs in a system with page-oriented undo, the move lock must be held to the end of the transaction T that does the splitting. The posting of the index term for splits cannot occur until and unless T commits, so that undo of the

UNDO

Non-commutative node

for consolion the index

(or any other

holds database locks. Details eration can be found in [12].

4.2

Except

never hold database

never requests

with

to be moved

we must release latches

we wait for database

on index

If a transaction,

the need for a node split,

has not

undetected

no deadlock

action.

triggers

do not commute

even though

for the

of this action. All node consolidation is Some data node splitting can also be done

wait for latches on data nodes, this wait is not known to the lock manager and can result in an deadlock

com-

change can be permitted.

by a transaction.

Only

356

node size or larger,

once granted,

the locking

required.

no update

activ-

This one lock is

data into a node can result

Should the move lock be realized as a set of record locks, the need to wait for one of these locks

one atomic

means that

rectly

released.

the latch on the splitting This

permits

changes

node must be

to the node

that

can result in the need for additional records to be locked. Since the space involved—one node—is limited, the frequency of this problem should be low. The

node is re-latched

(records

inserted

and examined

or deleted).

ble are no change, different that

Providing

We want

ery methods.

to index

write-ahead

is used. The WAL

nodes and consolidating

what

of recov-

our approach

without

logging protocol

specifying

(the WAL

pro-

user commitment

promises.

be “relatively”

to

must be durable

prior

Atomic

actions

That

is, they

durable.

to the commitment

There

is a window

of trans-

There track”

are at

tion. tion

might

depend

is the first

one

on the results

of the atomic

ac-

This optimization which

might

assumes that

depend

any transac-

on these results

uses the

same log. an

Atomic

Action

Atomic actions must complete or partial executions must be rolled back, Hence, the recovery manager needs to know about atomic actions. Three possible ways of identifying an atomic action to the recovery manager

are as (i) a separate

database

(ii) a special system transaction,

transaction,

or (iii) as a “nested

top level action” [13]. Our approach works with any of these techniques, or any other that guarantees

s

Structure

these atomic

ac-

reasons

why

we “lose

a structure actions have

but not all. of an index term

to a single parent. Structure

changes

action try

are detected

by a tree traversal

that

At this time,

to post the index

sals may follow

as being

includes

we schedule

term.

term

node consolidation term.

a

an atomic

Several tree traver-

the same side pointer,

to post the index

incom-

following

multiple

and hence

times.

A sub-

may have removed

the

These are acceptable

Before because the state of the tree is testable. posting the index term, we test that the posting has not already been done and still needs to be done. Node consolidations are scheduled when encountering underutilized nodes. As with node splitting, the II-tree solidation

state is tested to make sure that the conis only performed once, and only when

appropriate.

5.2

atomicity.

a node splits

changes need completion.

2. We only schedule the posting

sequent

Identifying

the time

Between

two

need to post the index 4.3.2

in a

and the index term describing

least

been executed,

side pointer.

that

data

terms,

Changes

1. A system crash may interrupt change after some of its atomic

forcing

transaction

between

action

of which structure

plete

This

index

Structure

in another.

actions that use their results. Thus, it is not necessary to force a ‘(commit” record to the log when an atomic action completes. This “commit” record can be written when the next transaction commits, the log.

makes

moving

tions, a II-tree is said to be in an intermediate state. We try to complete structure changes because intermediate states may result in non-optimal search or storage utilization.

ac-

undo, prior

making changes in the stable database. Our atomic actions are not user visible and do not involve

Completing

it is posted

ensures that

their

5.1

in one atomic

are satisfied.

tions are logged so as to permit

need only

atomic

A node consolidation

between

Logging

tocol)

one level at a time.

or even

tree concurrency

Thus, we indicate

We assume that

node) and occurs in a different

action. Posting an index term can also result in a node split. Thus, node splitting changes a II-tree

single atomic action, Consolidation of index terms can lead to further node consolidation, escalating

a large number

how these requirements

4.3.1

the splitting

to old and

(of the parent of

tree changes to the next level, like splitting.

requires from a recovery method, exactly

and refering

new nodes, is a second node update

at two levels of the II-tree,

required.

to work with

the split,

This is

of index terms cor-

changes

Atomicity

our approach

and recovery

describing

in a node split.

The posting

possi-

The outcomes locks required,

the move is no longer

4.3

for changes

action.

Exploiting

Saved

State

Exploiting saved information is an important aspect of efficient index tree structure changes. The bad news of independence is that information about

Changes

A II-tree structure change is decomposed into a sequence of atomic actions. An update or insert of

357

the II-tree acquired by early atomic structure change may have changed

actions of the and so cannot

be trusted

by later atomic

have been altered mation

actions.

in the interim.

needs to be verified

The information

that

key, nodes traversed node containing

may

before it is used.

we save consists

on the path from

of search

root to data

the search key, and the location

the relevant

index

information

can permit

terms

structured

without

the path,

and to find

within

those nodes.

of

This

us to locate nodes to be re-

a second search of the nodes on a location

node where a new index an old one deleted. To verify

The II-tree

Thus, saved infor-

term

within

an index

is to be inserted

saved information,

or

state id (stored in the node) equal saved node and state id, then there have not been any updates to the saved node since the previous traversal. (Log sequence numbers are used for state identifiers in Whether jor impact

systems.)

No

Consolidate

Consolidation Possible node, once responsible for

ant:

Not

key subspace,

When re-allocated,

in any way, including for different

Supported

[CNS]

for a key wabspace, is

for the subspace, CNS has three

a tree

traversal,

responsibility

key subspaces,

or being

used in other

indexes. This affects the “validity)’ of remembered state. Saved path information needs to be verified The

effect

CP has on the tree operations

prior

to latching

2. When

a child

an

index

or sibling

at a time is held during

posting

1. During

a tree traversal,

to ensure that

node

completed. is acquired

is either traversal

is used

via a pointer

de-referencing

is

node.

Thus,

two latches

during

need

a traversal.

2. When posting an index term in a parent node, we must verify that the node produced by the split continues to exist. Thus, in the atomic

3. During ent node

our traversal

the parent

node,

index

down to this node.

a node

is

the

How

to

remembered

may

deal

par-

have been de-

with

this

contin-

node de-allocation. is NOT a Node Up(a) De-allocation A node’s state identifier is undate: changed by de-allocation. It is impossi-

ble to determine

Only

ination

if a node

by state identifier

exam-

has been de-allocated.

However, we ensure that the root does not move and is never de-allocated. Then, any node reachable from the root via a tree traversal is guaranteed to be allocated. Thus, tree re-traversals start at the root. A node on the path is accessed

node,

and latched using latch in the original traversal.

node to

the one remembered from (the usual case) or a node

re-traversal

is limited

nodes and comparing

that ers,

can be reached by following sibling pointThus “re-traversals” to fin-d a pa~e-nt alparent. ways start with the remembered

remembered be equal. (b)

358

split,

to be updated

gency depends upon how node de-allocation is treated. There are two strategies for handling

immortal and remain responsible for the key space assigned to them during the split.

be updated the original

coupling

The latch on the referenced node prior to the release of the latch on

to be held simultaneously

a traversal.

an index term in a parent

a node split,

latch

a node referenced

is not freed before the pointer

it is not necessary to verify the existence of the nodes result ing from the split. These nodes are

3. During

is as

follows:

allocated.

searched for an index or sibling term for the pointer to the next node to be searched. We need not hold latches so as to ensure the pointer’s continued validity. The latch on an index node can be released after a search and one latch

used

assigned

Invari-

effects on our tree operations. 1. During

they maybe

being

operation that posts the index term, we also verify that the node that it describes exists by

Case

A node, once responsible

always responsible

[CP] Invariant: A a key subspace, remains

reaponsib~e for the subspace oniy until it is deallocated. De-allocated nodes are not responsible for any

continuing Consolidation

Case

the referencing

node consolidation is possible has a maon how we handle this information. The

possibility of removal of a node from the structure affects the extent to which saved information can be trusted. 5.2.1

Consolidate

before being trusted.

we use state identi-

fiers [9] within nodes to indicate the states of each node. We record these identifiers as part of our saved path. The basic idea is that if a node and its

many commercial

5.2.2

coupling, Typically,

just as a path

to re-latching new state

state ids, which

will

path

ids with usually

De-allocation is a Node Update: Node de-allocation changes not only space

node whose index

management information, but also the node’s state identifier to indicate that de-

not

allocation

NODE.)

has taken place.

the posting

This

of a log record

an additional

disk access.

remembered parent always be allocated has not changed

requires

3, Space Test: accommodate

the

can-

is held

on

and re-traversals

can be-

Test NODE

for sufficient

space to

the update.

If sufficient,

proceed

to Update

Node.

Otherwise,

split

NODE:

The

space manage-

ment information is X latched and a new node is allocated. The key space directly contained

leasing latches until a node with an unchanged state id is found or the root is

by the current node is divided, such that the new node becomes responsible for a subspace

encountered.

A path

at this node.

Since node de-allocation

A

re-traversal

begins

of the key space. A sibling

is

of the tree are usu-

to NODE

Change

node

consolidation

Action

is possible

allocation

is not a node update.

the index

term

posting

operation

node, and the KEY

for

are: The LEVEL

for the Blink-tree

be posted, of the new performs

The search starts with

the state identifiers

is the root, a second node is allocated. contents

from

the root

S-latches.

that

us-

are removed

from

the root

are unchanged,

NODE.

This

the remem-

Repeat

is reached, U-latches

traversing

side pointers

is U-latched.

The parent

other

are used, pos-

until

of NODE

been posted, the action wise the child node with

term

is responsible

for

the space that contains the KEY. If not, then the node whose index term is being posted has already been deleted and the action is terminated. is responsible

6

Test

on the

on NODE.

step.

NODE: Post

Post

the

a log record

index

describing

term

in

the up-

to an X latch,

Discussion

There

are many ways to realize II-tree

associated

structure

changes.

updates

and

We describe a specific

complete procedure in [12]. In this paper, we have emphasized the abstract elements of our approach. These elements

for space

containing the KEY, this sibling becomes the one whose index term is posted. (This may mean a new ADDRESS will be in the new index term.) The S latches are dropped. The U latch is promoted

this Space

the X latch

is terminated. Otherthe largest index term

node that

exists that

one

date to the log. Release all latches still held by the action.

has already

key value smaller than the KEY is S latched. It is accessed to determine whether a side pointer refers to a sibling

Release the X latch

node, but retain

4. Update

is

NODE. If the index

descending

the correct

left S-latched. Split:.

can require

more level in the II-tree should NODE have been the root where the split causes the tree to

As long as

the LEVEL

on NODE

of

NODE’s

increase in height.

If a sibling

an index

for the parent

If NODE

bered PATH is used. If a state identifier is changed, a search for the KEY begins. When

NODE

are released,

is scheduled

Then check which resulting node has a directly contained space that includes KEY, and make

ing latch-coupling

sibly

an index

changes are logged.

the

steps:

1. Search:

is not the root,

and put into this new node. A pair of index terms is generated that describe the two new nodes and they are posted to the root. These

which was searched for.

Index term posting

The change

of the new node

containing the new node’s pointer. (At the end of the

when all latches

posting operation NODE.)

and de-

The arguments

of the tree where the index term will the remembered PATH, the ADDRESS

If NODE

term is generated address as a child action,

tree where

and the creation

are logged.

Structure

term that references

the new node is placed in NODE,

To illustrate the detailed steps of an atomic action, we treat the case of posting index terms in a B1ink-

2. Verify

posted

a latch

gin from there. If it has changed, however, one must go up the path, setting and re-

ally avoided.

following

is being

while

and possibly However,

node in the path will if its state identifier

rare, full re-traversals

5.3

term

be consolidated

include



extending the notion of sibling link pass a much wider class of tree-like which we have called II-trees;



decomposing

updates

into a number

actions so as to separate

(The new

359

structure

to encomstructures

of atomic

changes that

alter the index from updates act ions;

in database

[7] Lomet, D. B. Process structuring, tion, and recover y using atomic

trans-

ACM completion



of structure

plete structure normal using





changes

when

incom-

changes are encountered

during

latches

for efficient

without

deadlock;

dealing

with

concurrency

[8] Lomet,

control

approach high

neering

node consolidation

to index

concurrency

recovery

as well as its

,91

tree while

structure being

schemes and with

Our techniques changes.

many varieties

of

impede

permit

multiple

In addition,

normal

concurrent activity

struc-

Lomet,

D.B.

using

multiple

Only

for

redo

shared

logs.

OR (May

disk

Digital

systems

Equipment Cam-

1989) 315-324.

D. and Salzberg,

B., The

hB-tree:

a

multiattribute indexing method with good guaranteed performance. ACM Trans. Database Systems 15,4 (Dee 1990) 625-658.

data

Corp. Tech Report CRL91/8 (Aug 1991) Cambridge Research Lab, Cambridge, MA

execute

UNDO,

in the context

of a

even data node splitting 13] Mohan,

transaction.

C., Haderle,

P., and Schwarz, ery method References R., Schkolnick, M., Concurrency of opon B-trees. Acts ~nforrnatica 9, I (1977)

rollbacks

(to appear

in ACM

ACM

[15] Sagiv,

G., Lomet, D. and Salzberg, B., of the hB-tree for node consoli-

recov-

fine-granularity

locking

using write-ahead Trans. Database

logging. Systems).

SIGMOD

Conf.,

San Diego,

(June

Y.,

Concurrent

overtaking,

operations

J Computer

on B* trees

and System

Sci-

ences 33,2 (1986) 275-296 [16] Salzberg,

(in preparation)

tree. [4] Gray, J .N., Lorie, R. A., Putzulo, G. R., and Traiger, I. L., Granularity of locks and degrees of consistency in a shared data base. IFIP Working Conf on Modeling of Data Base Management Systems (1976) 1-29. [5] Gray, J. and Reuter, A., Transaction Processing: Techniques and Concepts, Morgan Kaufmann (book in preparation).

B., Pirahesh,

a transaction

1992)).

with

and concurrency.

supporting

and partial

Proc.

(NOV 1976) 624-633. [3] Evangelidis, Modifications

D., Lindsay,

P. ARIES:

[14] Mohan> C. and Levine, F. ~ARIES/IM: an efficient and high concurrency index management method using write-ahead logging. (to appear in

[2] Eswaren, K., Gray, J., Lorie, R. and Traiger, I. On the notions of consistency and predicate locks in a database system. Comm ACM 19,11

dation

Engi-

posting is separate from the database Should the recovery method support

non-page-oriented

erations 1-21.

on Software

1980) 297-304.

12] Lomet, D. and Salzberg, B. Concurrency and Recovery for Index Trees. Digital Equipment

might

can occur out side of the database

[1]Bayer,

of processes with dead-

Trans.

Recovery

Portland,

[11] Lomet,

and

above the data level exeatomic actions which do activity.

IEEE

SE-6,3 (May

Conf.,

all update

database

transaction,

index term transaction.

1977)

and even here the resulting

splitting

database

Reliable

. . [1OJ Lomet, D. and Salzberg, B., Access methods for multiversion data, Proc. ACM SIGMOD

understandable.

structure change activity cutes in short independent node

for

12,3 (Mar

Corp. Tech ReportCRL90/4 (Ott 1990) bridge Research Lab, Cambridge, MA

changes

usable with

index trees. ‘We have described it in-an abstract way which emphasizes its generality and hopefully makes the approach

D.B. Subsystems

lock avoidance.

L-J

Our

not

Design

Notices

processing;

provides

ture

on Language

SIGPLAN

128-137.

absence.

many

Conf.

Softwa~e,

synchronizaactions. Proc.

B.,

Restructuring

Northeastern

BS-85-21

(1985),

[17] Shasha, structure

Boston,

[18] Srinivasan,

360

TR

MA

1988) 53-90.

V. and Carey,

concurrency

ACM SIGMOD 416-425.

[6] Lehman, P., Yao, S. B., Efficient locking for concurrent operations on B-trees. ACM Trans. Database Systems 16,4 (Dee 1981) 650-670.

Lehman-Yao

Tech Report

D., Goodman, N., Concurrent search algorithms. ACM Trans. Database

Systems 13,1 (Mar

B-tree

the

University

control

Conf.,

M. Performance algorithms.

Denver,

CO (May

of

Proc. 1991)