II-tree structure change, crash re- cover y takes no special measures. A crash ...... neering SE-6,3 (May 1980) 297-304. Lomet,. D.B.. Recovery for shared disk.
Access
Method David
Digital
with
Lomet
Betty
Equipment
Cambridge One
Concurrency
Corp.
Research
Kendall
Sq.,
Cambridge,
Lab
Bldg.
MA
College
Recovery
Salzberg
of Computer
Northeastern
University
Boston,
700
Science
MA.
*
02139
1. We define a search structure,
Abstract Providing
high
studied
concurrency
extensively.
umented
But few efforts
for combining
a recovery
in B+-trees
scheme that
have been doc-
concurrency preserves
has been
methods
well-formed
with trees
across system crashes. We describe an approach for this that works for a class of index trees that is a
that
is a generalization
Our
concurrency
tined
to
this
class.
work
and recovery with
Recent
approaches
called
of the
based
all
method
search
work
a II-tree,
Blink-tree
[6]. is de-
structures
[19] suggests
on Blink-trees
should
in that have
of
higher concurrency than other proposed methods. Thus, our technique has both good performance and very broad applicability.
our method is that it works with a range of different recovery methods. It achieves this by decomposing structure changes in an index tree into a sequence
2. II-tree structure changes consist of a sequence of atomic actions [7]. These separate actions
generalization
of atomic formed
of the B1i’k-tree.
actions,
A major
each one leaving
and each working
feature
the tree well-
on a separate
level of the
tree. All atomic actions on levels of the tree above the leaf level are independent of database transactions,
and so are of short duration.
are serializable
and are guaranteed
all or nothing
property
Searchers can see the intermediate
rializable.
By contrast,
in ARIES/IM
[14] com-
changes are serial.
Introduction
The subject of concurrency in B+-trees has a long history [1, 6, 14, 16, 17, 18]. Most work, with the exception
of [5, 14], have not treated
of system crashes during
structure
the problem
changes.
In this
paper, we show how to manage concurrency and recovery for a wide class of index tree structures, single attribute,
multiattribute,
ther, our approach
to provide following: work
IRI-88-15707
and versioned.
Fur-
works for a large class of recovery
methods. The four innovations
This
states of the
II-tree that exist between these atomic actions. Hence, complete structural changes are not seplete structural
1
to have the
by the recovery method.
that
make it possible
high concurrency
for index
for us
trees are the
3. We define separate actions for performing dates at each level of the tree. Update
tions on non-leaf nodes are separate from any transaction whose update triggers a structure change. a tree
supported
by NSF grant
and IRI-91-02821
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, raquires a fee and/or specific permission. 1992 ACM SIG MOD - 6/92/CA, USA 01992 ACM 0-89791-522-4/92/0005/0351 . ..$l .50
351
Only may
node-splitting
need
to
at the leaves of
be within
an updating
transaction. Even in this case, only for pageoriented UNDO recovery systems do locks on the split nodes need to be kept to the end This of the transaction. entire structure changes of database oriented
was partially
upac-
4. When
transactions
UNDO
recovery
a system
crash
is unlike
[14] where
are subtransactions and where non-pagemust be supported.
occurs
during
the
se-
quence of atomic actions that constitutes a complete II-tree structure change, crash reA crash cover y takes no special measures. may cause an intermediate some time. The structure
etate to persist for change is completed
when the intermediate state is detected during normal subsequent processing by scheduling a completing atomic action. The state is tested
again in the completing the idempotence
atomic
the II-tree
action to ensure
grows in height
via splitting
of a root,
new levels are formed.
of completion.
A split
is normally
described
by an index
term.
The rest of this paper is organized in the following way. Section 2 defines the II-tree. Section 3
pointer
describes
key space for which the child node is responsible.
atomic
the operations
actions
describes
are described
how II-tree
posed into
on II-trees.
atomic
such decomposed short
discussion
2
The
in general.
structure actions,
4,
Section
5
changes are decom-
and how to cope with
changes. of what
In section
Finally,
section
Each index
we have accomplished.
the index
called a parent
node.
index
terms
Description
on any path between the node and a leaf node. More precisely, however, a II-tree is a rooted DAG because, like the Blink-tree,
nodes have edges to sib-
ling
nodes.
nodes as well as child
2.1.1
All
these terms
Each node is responsible
for a specific
part of the
key space, and it retains that responsibility for as long as it is allocated. A node can meet its space responsibility entries (data natively,
in two ways. It can directly contain or index terms) for the space. Alter-
it can delegate
responsibility
for part
of
the space to a sibling node. A node delegates space to a new sibling node during a node split. A sibling term describes a key space for which a sibling node is responsible and includes
a side
pointer
to the sibling.
node
except
the
root
can contain
sibling
containing node for the whole key space. Each level describes a partition of the space into subspaces di-
2.1.2
Multiple
level.
contain
terms.
Parents
their
children.
only nodes
II-trees must refer to nodes
for spaces that A pointer
contain
the
can never refer to
the subspace directly However,
contained
by the index
node.
each node at a level need not have a parhigher
and generalization
level.
This
is an ab-
of the idea introduced is, having a new node only via a side pointer
is acceptable. Like [18], we define the requirements formed general search structure. Thus, well-formed if 1. each node is responsible
a
of a wellII-tree is
for a subspace
of the
search space; 2. each sibling
term
describes
containing node for which is responsible.
a subspace
of its
its referenced
node
3. each index term describes a subspace of the index node for which its referenced node is responsible;
terms to contained nodes. A level of the II-tree is a maximal connected subgraph of nodes and side pointer edges. Each level of the II-tree is responsible for the entire key space. The first node at each level is responsible for the whole space, i.e. it is the
rectly contained by nodes of that the II-tree its name.
as in B1ink-trees,
which
than
A
node is
A node
containing a sibling term is called the containing node and the sibling node to which it refers is called the contained node. Any
nodes
in the Blink-tree[6]. That connected in the Blink-tree
Level
for a child
of a
a de-allocated node. Further, an index node must contain index terms that refer to nodes that are responsible for spaces, the union of which contains
straction One
subspaces.
ent node at the next
below.
Within
indicated
term
and child pointers
are responsible
a child
and a description
sibling
Well-formed
which
Informally, a II-tree is a balanced tree, and we measure the level of a node by the number of child edges
are defined
and/or
includes
In H-trees,
nodes are index
2.1.3
Structural
node
are at a level one higher
~-tree
posted,
node containing
Side pointers 2.1
when
to a child
parent
6 is a
term,
This gives
Levels
The II-tree is split from the bottom, like the B-tree. Data nodes (leaves) are at level O. Data nodes contain only data records and/or sibling terms. As
352
4. the union of the spaces described by the index terms and the sibling terms contains the space an index
node is responsible
for;
5. the lowest level nodes are data nodes. 6. a root exists that search space.
is responsible
for the entire
The well-formedness description above defines a correct search structure, All structure changing atomic actions must preserve this well-formedness. We will need additional constraints on structure changing actions to facilitate node consolidation (deletion).
2.2
Applicability
y
to
Various
Search
Structures keyL
There
are a number
described
as II-trees,
rency control 2.2.1
of search trees which and hence exploit
and recovery
can be
o
our concur-
ling term) pointer.
El 0
method.
key;
, an index
is represented
term
Make
time
time.
8
split
wifl
now
Q%
Blink-trees
In a Blink-tree
1.
(respectively
B
A
2“
Mime in thecurrent nodeA that the time before8 is covad by the historical nodeB. B: (t< 8)
sib-
o
8
D
by a key value and node
now
time
It denotes that the child node (respectively
sibling node) referenced is responsible for the entire space greater than or equal to the key. The dhectly contained space of a B ‘ink-tree node is the space it is responsible
for minus
the space it has delegated
0+-.+J
now
to its (one) sibling. !!!!%
2.2.2
The
A TSB-tree ple versions it indexes
Figure 1: In the Time-Split B-tree, new current nodes contain copies of old history node pointers
these records
both
and old key pointers. copies of old history
by key and by time,
and we can split by either of these attributes. Historical nodes (nodes created by a split in the time never split again. History the current node directly
key space to access history previous
versions
In Figure after
nodes that
of records in that of splits
responsible
sibling pointcontaining a contain
node
child)
hand
tents of the right subtree.
corner of the key space it started with. A time split produces a new (hktorical) node with the original
The new history
node contains
the original
node directly
of the key space.
containing
A key sibling
rent node refers to the new current the higher
part
of the key space.
a
to the new sibling splits
II-tree
his-
containing
the con-
This makes the treatment
consistent
with
that
of other
Operations
Here we describe the operations general
new node
on II-trees
way. The steps do not describe
with concurrent decompose
operations,
structure
These are described
with failures,
changes into
in a very
how to deal
atomic
or how to actions.
in later sections.
of this key space. 3.1
2.2.3
their
splits. This is illustrated in Figure 2. A complete description and explanation of hB-tree concurrency, node splitting, and node consolidation is given in [3]. -
3
will contain a copy of the history sibling pointer. It makes the new current node responsible for not merely its current key space, but for the entire history
through
key ranges through
in the cur-
node containing The
time
node with
the lower part
pointer
points
of hyperplane
node directly containing the more recent time. A history sibling pointer in the current node refers to node.
for all previous
torical pointers and all higher their key (side) pointers.
node, as in [11], one child of the root (say the right
is in the lower right
copy of the prior history sibling pointer. A key split produces a new (current)
New historic nodes contain pointers. Current nodes are
the
space.
1, the region covered by a current
a number
the hkitory
Jnc B: (t< 8)
TSB-tree [10] provides indexed access to multiof key sequenced records. As a result,
dimension) ers permit
Jn A. B(t< 8) C(k>’)
The
In the hB-tree
hB-Tree
[11], the idea of containing
Searching
Searches start at the root of the II-tree. is an index node that directly contains
and con-
tained nodes is explicit and they are described with kd-tree fragments. The “External” markers of [11] can be replaced with the addresses of the nodes which were extracted, and a linking network established with the desired properties. In addition, when the split is by a hyperplane, instead of eliminating the root of the local tree in the splitting
The root the entire
search space. In an index node whose directly contained space includes a search point, an index term must exist that references a child node that is responsible for the space that contains the search point. There may be several such child nodes. However, it is desirable to follow the child pointer to the node that directly contains the search point.
353
2. Partition
the subspace
the original
L-1 o
10 Split
a comer
I
clude any sibling
inc/+4\B
A;C tree for a split, replace external markers wih addr?mes of IE,W siblings.
10
intersect Figure 2: An hB-tree index showing the use of k-d trees for sibling terms. External markers (showing what spaces have been removed
in creating
have been replaced
pointers.
with
sibling
one part.
is a data node, place in
terms
to subspaces for which Remove from
node all the data that
it no longer data.
4. If the node being split is an index node, include in each node the index terms that refer to child nodes whose approximately contained spaces
when usingthe local
m
contain
by origi-
directly contains. This partitions the (What is dealt with here is point data.)
7
o
The
to the new sibling
the new node is now responsible.
2. Split A again at x=7. tiA:A~C
contained
the sibling node all of the original node’s data that are contained in the delegated space. In-
the original
B’
to directly
is delegated
3. If the node being split
Usea k-d tm.e as a sibling term to indicate the kordms of B.
10
part
directly two parts.
node.
A.
born
‘m’ inA ‘~”:
6
into
nal node continues The other 1.
01
node
the space directly
contained
by the
node. 5. Put
“holes”)
a sibling
term
in the original
node
that
refers to the new node. 6. Schedule the posting of an index term describing the split to the next higher level of the tree.
This avoids subsequent sibling traversals at the next lower level. Because the posting of index terms can be delayed, we can only calculate the space approximately
contained
by a child
with
respect
gated to other child nodes referenced are present
in the
index
occurs in a separate atomic tion that
performs
action
from the ac-
the split.
to a
given parent. This is the difference between that part of the space of the parent node the child is responsible for and the subspaces that it has delethat
The index term describes the new sibling and the space for which it is responsible. Posting
by index terms
node.
We proceed to the child that approximately contains the search point. This node will usually, but
Example:
In a B link-tree,
to perform
a node split,
all records (“records” may be index entries in index nodes or data records in data nodes) whose keys are ordered
after the middle
record’s
key are copied
from the original node to the new node. node has been delegated the high order
The new key sub-
not always, contain the search point. If the directly contained space of a node does not include the search point, a side pointer is followed to the
space. The link (sibling term) is copied from the old node to the new node. Then the copied records are removed from the old node and the link in the old node is replaced with a new sibling term (ad-
sibling
dress of the new node and the split
node that
containing
has been delegated
the search point.
Eventually,
the subspace a sibling
key value).
is
found whose directly contained space includes the search point. The search continues until the data node level of the tree is reached. The record for the search point will be present in the data node whose directly contained space includes the search point, if it exists at all.
When an index node is split, it is simplest, if possible, to delegate to the new sibling a space which is the union of the approximately contained spaces of a subset of child nodes. However, it can be difficult
3.2
Splitting
ling terms,
Splitting
and new sibling nodes is too unbalanced, reducing storage ut ilizat ion. In this case, a child term whose approximately contained space intersects the new
3.2.1
Node Node
A node split 1. Allocate
Steps
has the following
steps:
space for the new node.
354
3.2.2
Clipping
to split an index node in a multi-attribute index tree in this way because either the space partitioning is too complex, resulting in very large index and sibor because the division
between
original
sibling’s space as well as the remaining directly contained space in the old sibling is said to be clipped. The child
term is placed in both
When a child term is clipped, describing
the subsequent
involve the updating We choose to post atomic
action.
we post only search path
to the parent
index terms
of this child may
correct
searches.
to the splitting
of the child
that
Other
formed,
4.1.1
to cope with structure
system
changes.
failures
is already
in the midst
Subsequently,
“the
parent”,
that
is on the current
we intend
present of II-tree
Atomic
Node
When
a node
the parent
Latching
for
Latch
under-utilized,
it may
be
node is de-allocated.
atomic
actions
the leaf level.
that
change
Latches Latches
control an index
intree
are semaphores
usage pattern
guarantees
normally
and
by index
contained terms
node
must
be
for
the ab-
come in two
permits
others with
Latches do and latches
do not conflict
locks.
Deadlock
with
standard
is avoided
database
by PREVENTING
cycles in
a “potential delay” graph [8]. If resources are ordered and latched in that order, the potential delay can be kept cycle-free
it. Since a II-tree we can order
in the same parent
(S) mode which
is usually
parent
without
materializing
accessed in search order,
nodes prior
to their
children
and containing nodes prior to the contained nodes referenced via their side pointers. Space managenode must only be referenced
by
ment information When arbitrary phase locking
These conditions of the contained
mean that only the single parent node need be updated during a
This node will also be a parent This
is important
all references
of the
as a node cannot
be
to it have been purged.
is a difficulty with the above constraints. a node is referenced by more than one par-
ent is not derivable from the index term information we have described thus far. However, multiparent nodes are only formed when (1) an index node (the parent)
splits,
clipping
can be ordered last. atomic actions are possible,
is used to ensure
serializability
[2].
rectness.
This
occurs in traversing
tree nodes dur-
ing a search. a previously
acquired
latch
violates
the ordering of resources and compromises deadlock avoidance. Promotion is the most common cause of deadlock [5]. For example, when two transactions set S latches on the same object to be updated, and then subsequently desire to promote to X, a deadlock results.
one or more of its
two
However, when dealing with index trees, latches can sometimes be released early without violating cor-
Promoting
(This complication arises only in multiattribute IItrees; Blink-tree nodes never have multiple parents.) There Whether
Avoidance
access so that update can be performed. not involve the database lock manager
graph
this parent.
until
Actions
becomes
the contained
deleted
in
S latches to access the latched data simultaneously, and exclusive (X) mode which prevents all other
container
consolidation.
be cor-
modes, share
and the contained
container.
must
Consolidation
For this to work well,
referenced node, and
leave the tree well-
actions
Atomic
Deadlock
which the holder’s
ing node, regardless of which is the underutilized node. Then the index term for the contained node
both
or in-
must have the all
can be used for concurrency
volving above
search path.
possible to consolidate it with either its containing node or one of its contained nodes. We always move the node contents from contained node to contain-
is deleted
atomic
atomic
How this is done is described
sence of deadlock. 3.3
actions
and must
update
serialized.
Latches
when we refer to
this to denote
between deadlocks
parents
This
that
interactions
Update
this section.
4.1
a mechanism
for
cause undetected
property All
rectly
occurs,
is on the current
node.
or nothing
can be updated when they are on a search path that results in a sibling traversal to the new node. exploits
ensure that
do not
posting
the split
Actions
actions
of several parent index nodes. to each parent in a separate
When
Atomic
We must
parents.
splitting
4
their
latches
Update (U) mode [4] supports promotion. It allows sharing by readers, but conflicts with X and
index terms, or (2) when a child with more than one
other
parent is split, possibly requiring posting in more than one place. We mark these clipped index terms
promote from a S to an X latch. But it may promote from U to X mode. However, a U latch may only be safely promoted to X under restricted circumstances. The rule that we observe is that the promotion request is not made while the requester
as referring
to multi-parent
nodes.
355
U modes.
An atomic
action
is not allowed
to
holds
latches
on higher
ordered
ever a node might
be written,
4.1.2
Latch-Lock
resources.
When-
a U latch is used.
operations
(together
mute with
the structure
When Avoiding
a structure
dent atomic
Deadlocks
with
their
inverses)
change is part
action,
that
of an indepen-
the locks needed for the struc-
ture change are two phased but only persist There are two situations where an index tree atomic action may interact with database transactions and also require locks. These are (1) normal accessing of
in an independent
a database
T, whose update
and
record for fetch, insert,
(2) moving
consolidate Should
data
records,
delete, or update
whet her to split
or
nodes. holders
of database
locks be required
to
duration like this.
atomic
yet updated
any record
the split,
the split
can be performed
independent
of and before
blocked
during
course,
the structure
volving only latches is possible. deadlocks, we observe the:
in-
To avoid latch-lock
Rule [14]: An action does not wait . No Wait for database locks while holding a latch that can conflict with a holder of a database lock.
For our II-tree
operations,
on data nodes whenever However,
latches
locks.
nodes may be retained.
Except for data node consolidation, no atomic (i) holds both: action or database transaction database locks; and (ii) uses other than S latches above the data node level. S latches on index nodes never conflict with database transactions, only with index
change
at omit
date, these actions consolidate
actions.
node to be updated Hence, with
its holding
another
while
it holds
of this
consolidate
a U-latch
Page-oriented
4.2.1 Data
U-latch
database
splitting
locks
for
locks.
And
database cannot
locks. conflict
action)
of the consolidate
that op-
Updates and
some
consolidation (but
not
all)
require recovery
protocols. For example, if undos of updates on database records must take place on the same page (leaf node) as the original update, (page-oriented UNDO) the records cannot be moved until their updating transaction dates can be permitted
commits or aborts. No upon records moved by uncom-
mitted structure changes, since undoing the move would cause those records to move, Finally, no update can be permitted that makes the undoing of
Then,
this independent
updates
action.
change will
that
change are only Further,
of
not be undone
T aborts. Other data node splits in page-oriented tems must be done within an updating
if
undo sysdatabase
transaction. In this case, the database locks are held to the end of transaction and the structure change must be undone 4.2.2
Move
if the transaction
aborts.
Locks
For page-oriented undo, a move that conflicts with non-commutative
lock is required updates. The
move lock causes the structure change operation to wait until all transactions that are updating records to be moved have completed. Further, it blocks updating transactions from changing records moved until
the moving
transaction
the move lock keeps updates that
would
prevent
reads do not require
completes.
the undoing undo,
Finally,
from consuming of the move.
concurrent
space Since
reads can be
tolerated. Hence, move locks are compatible share mode locks.
with
split is possible if T aborts. For the same reason, any other transaction which traverses the sibling pointer created by T’s split may not post the index term until T commits. Therefore a move lock must be distinguished from a share lock. A transaction encountering a move lock on a sibling traversal does not schedule an index posting. A move lock can be realized with a set of individual record locks, a page-level lock, a key-range lock, or even a lock on the whole relation. This depends on the implementation specifics. If the move lock is implemented using a lock whose granule is a
the move impossible. Such updates are those that consume space in the node that is needed in order
ity can alter
to consolidate
sufficient.
nodes split
T.
the structure
by
in an action
When data node splitting occurs in a system with page-oriented undo, the move lock must be held to the end of the transaction T that does the splitting. The posting of the index term for splits cannot occur until and unless T commits, so that undo of the
UNDO
Non-commutative node
for consolion the index
(or any other
holds database locks. Details eration can be found in [12].
4.2
Except
never hold database
never requests
with
to be moved
we must release latches
we wait for database
on index
If a transaction,
the need for a node split,
has not
undetected
no deadlock
action.
triggers
do not commute
even though
for the
of this action. All node consolidation is Some data node splitting can also be done
wait for latches on data nodes, this wait is not known to the lock manager and can result in an deadlock
com-
change can be permitted.
by a transaction.
Only
356
node size or larger,
once granted,
the locking
required.
no update
activ-
This one lock is
data into a node can result
Should the move lock be realized as a set of record locks, the need to wait for one of these locks
one atomic
means that
rectly
released.
the latch on the splitting This
permits
changes
node must be
to the node
that
can result in the need for additional records to be locked. Since the space involved—one node—is limited, the frequency of this problem should be low. The
node is re-latched
(records
inserted
and examined
or deleted).
ble are no change, different that
Providing
We want
ery methods.
to index
write-ahead
is used. The WAL
nodes and consolidating
what
of recov-
our approach
without
logging protocol
specifying
(the WAL
pro-
user commitment
promises.
be “relatively”
to
must be durable
prior
Atomic
actions
That
is, they
durable.
to the commitment
There
is a window
of trans-
There track”
are at
tion. tion
might
depend
is the first
one
on the results
of the atomic
ac-
This optimization which
might
assumes that
depend
any transac-
on these results
uses the
same log. an
Atomic
Action
Atomic actions must complete or partial executions must be rolled back, Hence, the recovery manager needs to know about atomic actions. Three possible ways of identifying an atomic action to the recovery manager
are as (i) a separate
database
(ii) a special system transaction,
transaction,
or (iii) as a “nested
top level action” [13]. Our approach works with any of these techniques, or any other that guarantees
s
Structure
these atomic
ac-
reasons
why
we “lose
a structure actions have
but not all. of an index term
to a single parent. Structure
changes
action try
are detected
by a tree traversal
that
At this time,
to post the index
sals may follow
as being
includes
we schedule
term.
term
node consolidation term.
a
an atomic
Several tree traver-
the same side pointer,
to post the index
incom-
following
multiple
and hence
times.
A sub-
may have removed
the
These are acceptable
Before because the state of the tree is testable. posting the index term, we test that the posting has not already been done and still needs to be done. Node consolidations are scheduled when encountering underutilized nodes. As with node splitting, the II-tree solidation
state is tested to make sure that the conis only performed once, and only when
appropriate.
5.2
atomicity.
a node splits
changes need completion.
2. We only schedule the posting
sequent
Identifying
the time
Between
two
need to post the index 4.3.2
in a
and the index term describing
least
been executed,
side pointer.
that
data
terms,
Changes
1. A system crash may interrupt change after some of its atomic
forcing
transaction
between
action
of which structure
plete
This
index
Structure
in another.
actions that use their results. Thus, it is not necessary to force a ‘(commit” record to the log when an atomic action completes. This “commit” record can be written when the next transaction commits, the log.
makes
moving
tions, a II-tree is said to be in an intermediate state. We try to complete structure changes because intermediate states may result in non-optimal search or storage utilization.
ac-
undo, prior
making changes in the stable database. Our atomic actions are not user visible and do not involve
Completing
it is posted
ensures that
their
5.1
in one atomic
are satisfied.
tions are logged so as to permit
need only
atomic
A node consolidation
between
Logging
tocol)
one level at a time.
or even
tree concurrency
Thus, we indicate
We assume that
node) and occurs in a different
action. Posting an index term can also result in a node split. Thus, node splitting changes a II-tree
single atomic action, Consolidation of index terms can lead to further node consolidation, escalating
a large number
how these requirements
4.3.1
the splitting
to old and
(of the parent of
tree changes to the next level, like splitting.
requires from a recovery method, exactly
and refering
new nodes, is a second node update
at two levels of the II-tree,
required.
to work with
the split,
This is
of index terms cor-
changes
Atomicity
our approach
and recovery
describing
in a node split.
The posting
possi-
The outcomes locks required,
the move is no longer
4.3
for changes
action.
Exploiting
Saved
State
Exploiting saved information is an important aspect of efficient index tree structure changes. The bad news of independence is that information about
Changes
A II-tree structure change is decomposed into a sequence of atomic actions. An update or insert of
357
the II-tree acquired by early atomic structure change may have changed
actions of the and so cannot
be trusted
by later atomic
have been altered mation
actions.
in the interim.
needs to be verified
The information
that
key, nodes traversed node containing
may
before it is used.
we save consists
on the path from
of search
root to data
the search key, and the location
the relevant
index
information
can permit
terms
structured
without
the path,
and to find
within
those nodes.
of
This
us to locate nodes to be re-
a second search of the nodes on a location
node where a new index an old one deleted. To verify
The II-tree
Thus, saved infor-
term
within
an index
is to be inserted
saved information,
or
state id (stored in the node) equal saved node and state id, then there have not been any updates to the saved node since the previous traversal. (Log sequence numbers are used for state identifiers in Whether jor impact
systems.)
No
Consolidate
Consolidation Possible node, once responsible for
ant:
Not
key subspace,
When re-allocated,
in any way, including for different
Supported
[CNS]
for a key wabspace, is
for the subspace, CNS has three
a tree
traversal,
responsibility
key subspaces,
or being
used in other
indexes. This affects the “validity)’ of remembered state. Saved path information needs to be verified The
effect
CP has on the tree operations
prior
to latching
2. When
a child
an
index
or sibling
at a time is held during
posting
1. During
a tree traversal,
to ensure that
node
completed. is acquired
is either traversal
is used
via a pointer
de-referencing
is
node.
Thus,
two latches
during
need
a traversal.
2. When posting an index term in a parent node, we must verify that the node produced by the split continues to exist. Thus, in the atomic
3. During ent node
our traversal
the parent
node,
index
down to this node.
a node
is
the
How
to
remembered
may
deal
par-
have been de-
with
this
contin-
node de-allocation. is NOT a Node Up(a) De-allocation A node’s state identifier is undate: changed by de-allocation. It is impossi-
ble to determine
Only
ination
if a node
by state identifier
exam-
has been de-allocated.
However, we ensure that the root does not move and is never de-allocated. Then, any node reachable from the root via a tree traversal is guaranteed to be allocated. Thus, tree re-traversals start at the root. A node on the path is accessed
node,
and latched using latch in the original traversal.
node to
the one remembered from (the usual case) or a node
re-traversal
is limited
nodes and comparing
that ers,
can be reached by following sibling pointThus “re-traversals” to fin-d a pa~e-nt alparent. ways start with the remembered
remembered be equal. (b)
358
split,
to be updated
gency depends upon how node de-allocation is treated. There are two strategies for handling
immortal and remain responsible for the key space assigned to them during the split.
be updated the original
coupling
The latch on the referenced node prior to the release of the latch on
to be held simultaneously
a traversal.
an index term in a parent
a node split,
latch
a node referenced
is not freed before the pointer
it is not necessary to verify the existence of the nodes result ing from the split. These nodes are
3. During
is as
follows:
allocated.
searched for an index or sibling term for the pointer to the next node to be searched. We need not hold latches so as to ensure the pointer’s continued validity. The latch on an index node can be released after a search and one latch
used
assigned
Invari-
effects on our tree operations. 1. During
they maybe
being
operation that posts the index term, we also verify that the node that it describes exists by
Case
A node, once responsible
always responsible
[CP] Invariant: A a key subspace, remains
reaponsib~e for the subspace oniy until it is deallocated. De-allocated nodes are not responsible for any
continuing Consolidation
Case
the referencing
node consolidation is possible has a maon how we handle this information. The
possibility of removal of a node from the structure affects the extent to which saved information can be trusted. 5.2.1
Consolidate
before being trusted.
we use state identi-
fiers [9] within nodes to indicate the states of each node. We record these identifiers as part of our saved path. The basic idea is that if a node and its
many commercial
5.2.2
coupling, Typically,
just as a path
to re-latching new state
state ids, which
will
path
ids with usually
De-allocation is a Node Update: Node de-allocation changes not only space
node whose index
management information, but also the node’s state identifier to indicate that de-
not
allocation
NODE.)
has taken place.
the posting
This
of a log record
an additional
disk access.
remembered parent always be allocated has not changed
requires
3, Space Test: accommodate
the
can-
is held
on
and re-traversals
can be-
Test NODE
for sufficient
space to
the update.
If sufficient,
proceed
to Update
Node.
Otherwise,
split
NODE:
The
space manage-
ment information is X latched and a new node is allocated. The key space directly contained
leasing latches until a node with an unchanged state id is found or the root is
by the current node is divided, such that the new node becomes responsible for a subspace
encountered.
A path
at this node.
Since node de-allocation
A
re-traversal
begins
of the key space. A sibling
is
of the tree are usu-
to NODE
Change
node
consolidation
Action
is possible
allocation
is not a node update.
the index
term
posting
operation
node, and the KEY
for
are: The LEVEL
for the Blink-tree
be posted, of the new performs
The search starts with
the state identifiers
is the root, a second node is allocated. contents
from
the root
S-latches.
that
us-
are removed
from
the root
are unchanged,
NODE.
This
the remem-
Repeat
is reached, U-latches
traversing
side pointers
is U-latched.
The parent
other
are used, pos-
until
of NODE
been posted, the action wise the child node with
term
is responsible
for
the space that contains the KEY. If not, then the node whose index term is being posted has already been deleted and the action is terminated. is responsible
6
Test
on the
on NODE.
step.
NODE: Post
Post
the
a log record
index
describing
term
in
the up-
to an X latch,
Discussion
There
are many ways to realize II-tree
associated
structure
changes.
updates
and
We describe a specific
complete procedure in [12]. In this paper, we have emphasized the abstract elements of our approach. These elements
for space
containing the KEY, this sibling becomes the one whose index term is posted. (This may mean a new ADDRESS will be in the new index term.) The S latches are dropped. The U latch is promoted
this Space
the X latch
is terminated. Otherthe largest index term
node that
exists that
one
date to the log. Release all latches still held by the action.
has already
key value smaller than the KEY is S latched. It is accessed to determine whether a side pointer refers to a sibling
Release the X latch
node, but retain
4. Update
is
NODE. If the index
descending
the correct
left S-latched. Split:.
can require
more level in the II-tree should NODE have been the root where the split causes the tree to
As long as
the LEVEL
on NODE
of
NODE’s
increase in height.
If a sibling
an index
for the parent
If NODE
bered PATH is used. If a state identifier is changed, a search for the KEY begins. When
NODE
are released,
is scheduled
Then check which resulting node has a directly contained space that includes KEY, and make
ing latch-coupling
sibly
an index
changes are logged.
the
steps:
1. Search:
is not the root,
and put into this new node. A pair of index terms is generated that describe the two new nodes and they are posted to the root. These
which was searched for.
Index term posting
The change
of the new node
containing the new node’s pointer. (At the end of the
when all latches
posting operation NODE.)
and de-
The arguments
of the tree where the index term will the remembered PATH, the ADDRESS
If NODE
term is generated address as a child action,
tree where
and the creation
are logged.
Structure
term that references
the new node is placed in NODE,
To illustrate the detailed steps of an atomic action, we treat the case of posting index terms in a B1ink-
2. Verify
posted
a latch
gin from there. If it has changed, however, one must go up the path, setting and re-
ally avoided.
following
is being
while
and possibly However,
node in the path will if its state identifier
rare, full re-traversals
5.3
term
be consolidated
include
●
extending the notion of sibling link pass a much wider class of tree-like which we have called II-trees;
●
decomposing
updates
into a number
actions so as to separate
(The new
359
structure
to encomstructures
of atomic
changes that
alter the index from updates act ions;
in database
[7] Lomet, D. B. Process structuring, tion, and recover y using atomic
trans-
ACM completion
●
of structure
plete structure normal using
●
●
changes
when
incom-
changes are encountered
during
latches
for efficient
without
deadlock;
dealing
with
concurrency
[8] Lomet,
control
approach high
neering
node consolidation
to index
concurrency
recovery
as well as its
,91
tree while
structure being
schemes and with
Our techniques changes.
many varieties
of
impede
permit
multiple
In addition,
normal
concurrent activity
struc-
Lomet,
D.B.
using
multiple
Only
for
redo
shared
logs.
OR (May
disk
Digital
systems
Equipment Cam-
1989) 315-324.
D. and Salzberg,
B., The
hB-tree:
a
multiattribute indexing method with good guaranteed performance. ACM Trans. Database Systems 15,4 (Dee 1990) 625-658.
data
Corp. Tech Report CRL91/8 (Aug 1991) Cambridge Research Lab, Cambridge, MA
execute
UNDO,
in the context
of a
even data node splitting 13] Mohan,
transaction.
C., Haderle,
P., and Schwarz, ery method References R., Schkolnick, M., Concurrency of opon B-trees. Acts ~nforrnatica 9, I (1977)
rollbacks
(to appear
in ACM
ACM
[15] Sagiv,
G., Lomet, D. and Salzberg, B., of the hB-tree for node consoli-
recov-
fine-granularity
locking
using write-ahead Trans. Database
logging. Systems).
SIGMOD
Conf.,
San Diego,
(June
Y.,
Concurrent
overtaking,
operations
J Computer
on B* trees
and System
Sci-
ences 33,2 (1986) 275-296 [16] Salzberg,
(in preparation)
tree. [4] Gray, J .N., Lorie, R. A., Putzulo, G. R., and Traiger, I. L., Granularity of locks and degrees of consistency in a shared data base. IFIP Working Conf on Modeling of Data Base Management Systems (1976) 1-29. [5] Gray, J. and Reuter, A., Transaction Processing: Techniques and Concepts, Morgan Kaufmann (book in preparation).
B., Pirahesh,
a transaction
1992)).
with
and concurrency.
supporting
and partial
Proc.
(NOV 1976) 624-633. [3] Evangelidis, Modifications
D., Lindsay,
P. ARIES:
[14] Mohan> C. and Levine, F. ~ARIES/IM: an efficient and high concurrency index management method using write-ahead logging. (to appear in
[2] Eswaren, K., Gray, J., Lorie, R. and Traiger, I. On the notions of consistency and predicate locks in a database system. Comm ACM 19,11
dation
Engi-
posting is separate from the database Should the recovery method support
non-page-oriented
erations 1-21.
on Software
1980) 297-304.
12] Lomet, D. and Salzberg, B. Concurrency and Recovery for Index Trees. Digital Equipment
might
can occur out side of the database
[1]Bayer,
of processes with dead-
Trans.
Recovery
Portland,
[11] Lomet,
and
above the data level exeatomic actions which do activity.
IEEE
SE-6,3 (May
Conf.,
all update
database
transaction,
index term transaction.
1977)
and even here the resulting
splitting
database
Reliable
. . [1OJ Lomet, D. and Salzberg, B., Access methods for multiversion data, Proc. ACM SIGMOD
understandable.
structure change activity cutes in short independent node
for
12,3 (Mar
Corp. Tech ReportCRL90/4 (Ott 1990) bridge Research Lab, Cambridge, MA
changes
usable with
index trees. ‘We have described it in-an abstract way which emphasizes its generality and hopefully makes the approach
D.B. Subsystems
lock avoidance.
L-J
Our
not
Design
Notices
processing;
provides
ture
on Language
SIGPLAN
128-137.
absence.
many
Conf.
Softwa~e,
synchronizaactions. Proc.
B.,
Restructuring
Northeastern
BS-85-21
(1985),
[17] Shasha, structure
Boston,
[18] Srinivasan,
360
TR
MA
1988) 53-90.
V. and Carey,
concurrency
ACM SIGMOD 416-425.
[6] Lehman, P., Yao, S. B., Efficient locking for concurrent operations on B-trees. ACM Trans. Database Systems 16,4 (Dee 1981) 650-670.
Lehman-Yao
Tech Report
D., Goodman, N., Concurrent search algorithms. ACM Trans. Database
Systems 13,1 (Mar
B-tree
the
University
control
Conf.,
M. Performance algorithms.
Denver,
CO (May
of
Proc. 1991)