R RDBMS latches using SolarisTM DTrace Exploring Oracle
arXiv:1111.0594v1 [cs.DB] 2 Nov 2011
Andrey Nikolaev RDTEX LTD, Russia,
[email protected], http://andreynikolaev.wordpress.com Proceedings of MEDIAS 2011 Conference. 10-14 May 2011. Limassol, Cyprus. ISBN 978-5-88835-032-4
Abstract Rise of hundreds cores technologies bring again to the first plan the problem of interprocess synchronization in database engines. Spinlocks are widely used in contemporary DBMS to synchronize processes at microsecond R RDBMS specific spintimescale. Latches are Oracle locks. The latch contention is common to observe in contemporary high concurrency OLTP environments. In contrast to system spinlocks used in operating systems kernels, latches work in user context. Such user level spinlocks are influenced by context preemption and multitasking. Until recently there were no direct methods to measure effectiveness of user spinlocks. This became possible with the emergence of SolarisTM 10 Dynamic Tracing framework. DTrace allows tracing and profiling both OS and user applications. This work investigates the possibilities to diagnose and tune Oracle latches. It explores the contemporary latch realization and spinning-blocking strategies, analyses corresponding statistic counters. A mathematical model developed to estimate analytically the effect of tuning SPIN COUNT value. Keywords: Oracle, Spinlock, Latch, DTrace, Spin Time, Spin-Blocking
1
Introduction
R documentation [1] latch is According to latest Oracle ”A simple, low-level serialization mechanism to protect shared data structures in the System Global Area”.
Huge OLTP Oracle RDBMS ”dedicated architecture” instance contains thousands processes accessed the shared memory. This shared memory is called ”System Global Area” (SGA) and consist of millions cache, metadata and result structures. Simultaneous processes access to these structures synchronized by Locks, Latches and KGX Mutexes:
Fig.1.
R RDBMS architecture Oracle
Latches and KGX mutexes are the Oracle realizations of general spin-blocking spinlock concept. The goal of this work is to explore the most commonly used spinlock inside Oracle — latches. Mutexes appeared in latest Oracle versions inside Library Cache only. Table 1 compares these synchronization mechanisms. Wikipedia defines the spinlock as ”a lock where the thread simply waits in a loop (”spins”) repeatedly checking until the lock becomes available. As the thread remains active but isn’t performing a useful task, the use of such a lock is a kind of busy waiting”. Use of spinlocks for multiprocessor synchronization were first introduced by Edsger Dijkstra in [2]. Since that time, a lot of researches were done in the field of mutual exclusion algorithms. Various sophisticated spinlock realizations were proposed and evaluated. The contemporary review of such algorithms may be found in [3] There exist two general spinlock types: • System spinlock. Kernel OS threads cannot block. Major metrics to optimize system spinlocks are atomic operations (or Remote Memory References) frequency and shared bus utilization. • User application spinlocks like Oracle latch and mutex. It is more efficient to poll the latch for several
usec rather than pre-empt the thread doing 1 ms context switch. Metrics are latch acquisition CPU and elapsed times. The latch is a hybrid user level spinlock. The documentation named subsequent latch acquisition phases as: • Atomic Immediate Get. • If missed, latch spins by polling location nonatomically during Spin Get. • In spin get not succeed, latch sleeps for Wait Get. According to Anderson classification [4] the latch spin is one of the simplest spinlocks - TTS (”test-and-testand-set”). Frequently spinlocks use more complex structures than TTS. Such algorithms, like famous MCS spinlocks [5] were designed and benchmarked to work in the conditions of 100% latch utilization and may be heavily affected by OS preemption. For the current state of spinlock theory see [6]. If user spinlocks are holding for long, for example due to OS preemption, pure spinning becomes ineffective. To overcome this problem, after predefined number of spin cycles latch waits (blocks) in a queue. Such spinblocking was first introduced in [8] to achieve balance between CPU time lost to spinning and context switch overhead. Optimal strategies how long to spin before blocking were explored in [9, 10, 11]. Robustness of spinblocking in contemporary environments was recently investigated in [12] Contemporary servers having hundreds CPU cores bring to the first plan the problem of spinlock SMP scalability. Spinlock utilization increases almost linearly with number of processors [23]. One percent spinlock utilization of Dual Core development computer is negligible and may be easily overlooked. However, it may scales upto 50% on 96 cores production server and completely hang the 256 core machine. This phenomenon is also known as ”Software lockout”. Table 1. Serialization mechanisms in Oracle Access Acquisition SMP Atomicity Timescale Life cycle
Locks Several Modes FIFO No Milliseconds Dynamic
Latches Types and Modes SIRO (spin) + FIFO Yes
Mutexes Operations
Microseconds
SubMicroseconds Dynamic
Static
SIRO Yes
1.1
R RDBMS Performance Tuning Oracle overview
During the last 30 years, Oracle developed from the first tiny one-user SQL database to the most advanced contemporary RDBMS engine. Each version introduced new performance and concurrency advances. The following timeline is the excerpt from this evolution: • v. 2 (1979): the first commercial SQL RDBMS • v. 3 (1983): the first database to support SMP • v. 4 (1984): read-consistency, Database Buffer Cache • v. 5 (1986): Client-Server, Clustering, Distributing Database, SGA • v. 6 (1988): procedural language (PL/SQL), undo/redo, latches • v. 7 (1992): Library Cache, Shared SQL, Stored procedures, 64bit • v. 8/8i (1999): Object types, Java, XML • v. 9i (2000): Dynamic SGA, Real Application Clusters • v. 10g (2003): Tuning, mutexes
Enterprise Grid Computing, Self-
• v. 11g (2008): Results Cache, SQL Plan Management, Exadata • v. 12c (2011): Cloud. Not yet released at the time of writing R is the most complex and widely As of now, Oracle used SQL RDBMS. However, quick search finds more then 100 books devoted to Oracle performance tuning on Amazon [13, 14, 15]. Dozens conferences covered this topic every year. Why Oracle needs such tuning?
Main reason of this is complex and variable workloads. Oracle is working in so different environments ranging from huge OLTPs, petabyte OLAPs to hundreds of tiny instances running on one server. Every database has its unique features, concurrency and scalability issues. To provide Oracle RDBMS ability to work in such diverse environments, it has complex internals. Last Oracle version 11.2 has 344 ”Standard” and 2665 ”Hidden” tunable parameters to adjust and customize its behavior. Database administrator’s education is crucial to adjust these parameters correctly. Working at Support, I cannot underestimate the importance of developer’s education. During design phases, developers need to make complicated algorithmic, physical database and schema design decisions. Design mistakes and ”temporary” workarounds may results in million dollars losses in production. Many ”Database Independence” tricks also results in performance problems. Another flavor of performance problems come from selftuning and SQL plan instabilities, OS and Hardware
issues. One need take into account also more than 10 million bug reports on MyOracleSupport. It is crucial to diagnose the bug correctly. Historically latch contention issues were hard to diagnose and resolve. Support engineers definitely need more mainstream science support. This work summarizes author investigations in this field. To allow diagnostics of performance problems Oracle instrumented his software well. Every Oracle session keeps many statistics counters. These counters describe ”what sessions have done”. There are 628 statistics in 11.2.0.2. Oracle Wait Interface events complements the statistics. This instrumentation describes ”why Oracle sessions have waited”. Latest 11.2.0.2 version of Oracle accounts 1142 distinct wait events. Statistics and Wait Interface R AWR/ASH/ADDM tools, Tundata used by Oracle ing Advisors, MyOracleSupport diagnostics and tuning tools. More than 2000 internal ”dynamic performance” X$ tables provide additional data for diagnostics. Oracle performance data are visualized by Oracle Enterprise Manager and other specialized tools. This is the traditional framework of Oracle performance tuning. However, it was not effective enough in spinlocks troubleshooting.
1.2
The Tool
To discover how the Oracle latch works, we need the tool. Oracle Wait Interface allows us to explore the waits only. Oracle X$/V$ tables instrument the latch acquisition and give us performance counters. To see how latch works through time and to observe short duration events, we need something like stroboscope in physics. Likely, such tool exists in Oracle SolarisTM. This is DTrace, Solaris 10 Dynamic Tracing framework [16].
kslgetl (get exclusive latch), the probe description will be pid16444:oracle:kslgetl:entry. Predicate and action of probe will filter, aggregate and print out the data. All the scripts used in this work are the collections of such triggers. Unlike standard tracing tools, DTrace works in Solaris kernel. When oracle process entered probe function, the execution went to Solaris kernel and the DTrace filled buffers with the data. The dtrace program printed out these buffers. Kernel based tracing is more stable and have less overhead then userland. DTrace sees all the system activity and can take into account the ”unaccounted for” userland tracing time associated with kernel calls, scheduling, etc. DTrace allowed this work to investigate how Oracle latches perform in real time: • Count the latch spins • Trace how the latch waits • Measure times and distributions • Compute additional latch statistics The following next sections describe Oracle performance tuning and database administrator’s related results. Reader interested in mathematical estimations may proceed directly to section 3
2
Oracle latch instrumentation
It was known that the Oracle server uses kslgetl - Kernel Service Lock Management Get Latch function to acquire the latch. DTrace reveals other latch interface routines:
DTrace is event-driven, kernel-based instrumentation that can see and measure all OS activity. It allows defining the probes (triggers) to trap and write the handlers (actions) using dynamically interpreted C-like language. No application changes needed to use DTrace. This is very similar to triggers in database technologies.
• kslgetl(laddr, wait, why, where) – get exclusive latch
DTrace provides more than 40000 probes in Solaris kernel and ability to instrument every user instruction. It describes the triggering probe in a four-field format: provider:module:function:name.
–
A provider is a methodology for instrumenting the system: pid, fbt, syscall, sysinfo, vminfo . . . If one need to set trigger inside the oracle process with Solaris spid 16444, to fire on entry to function
• kslg2c(l1,l2,trc,why, where) – get two excl. child latches • kslgetsl(laddr,wait,why,where,mode) – get shared latch. In Oracle 11g ksl get shared latch()
• kslg2cs(l1,l2,mode,trc,why, where)) – get two shared child latches • kslgpl(laddr,comment,why,where) – get parent and all childs • kslfre(laddr) – free the latch
Fortunately Oracle gave us possibility to do the same using oradebug call utility. It is possible to acquire the latch manually. This is very useful to simulate latch related hangs and contention.
When Oracle process waits (sleeps) for the latch, it puts latch address into ksllawat, ”where” and ”why” values into ksllawer and ksllawhy columns of corresponding x$ksupr row. This is the fixed table behind the v$process view. These columns are extremely useful SQL>oradebug call kslgetl when exploring why the processes contend for the latch. This is illustrated on Figure 2. DTrace scripts also demonstrated the meaning of arguments: • laddres – address of latch in SGA • wait – flag for no-wait or wait latch acquisition • where – integer code for location from where the latch is acquired. • why — integer context of why the latch is acquiring at this where.
Fig.2.
Latch is holding by process, not session
• mode – requesting state for shared lathes. 8 SHARED mode. 16 - EXCLUSIVE mode
In summary, Oracle instruments the latch acquisition in x$ksupr fields:
”Where” and ”why” parameters are using for the instrumentation of latch get.
• ksllalaq – address of latch acquiring. Populated during immediate get (and spin before 11g)
Integer ”where” value is the reason for latch acquisition. This is the index in an array of ”locations” strings that literally describes ”where”. Oracle externalizes this array to SQL in x$ksllw fixed table. These strings the database administrators are commonly see in v$latch misses and AWR/Statspack reports.
• ksllawat — latch being waited for.
Fixed view v$latch misses is based on x$kslwsc fixed table. In this table Oracle maintains an array of counters for latch misses by ”where” location. ”Why” parameter is named ”Context saved from call” in dumps. It specifies why the latch is acquired at this ”where”. ”Where” and ”why” parameters instrument the latch get. When the latch will be acquired, Oracle saves these values into the latch structure. Oracle 11g externalizes latch structures in x$kslltr parent and x$kslltr children fixed tables for parent and child latches respectively. Versions 10g and before used x$ksllt table. Fixed views v$latch and v$latch children were created on these tables. ”Where” and ”why” parameters for last latch acquisition may be seen in kslltwhr and kslltwhy columns of these tables. Fixed table x$ksuprlat shows latches that processes are currently holding. View v$latchholder created on it. Again, ”where” and ”why” parameters of latch get present in ksulawhr and ksulawhy columns.
• ksllawhy – why for the latch being waited for • ksllawere – where for the latch being waited for • ksllalow – bit array of levels of currently holding latches • ksllaspn — latch this process is spinning on. Not populated since 8.1 • ksllaps% — inter-process post statistics
2.1
The latch structure - ksllt
Latch structure is named ksllt in Oracle fixed tables. It contains the latch location itself, ”where” and ”why” values, latch level, latch number, class, statistics, wait list header and other attributes. Table 2.1. Latch size by Oracle version Version 7.3.4 8.0.6 8.1.7 9.0.1 9.2.0 10.1.0 10.2.0-11.2.0.2
Unix 32bit 92 104 104 ? 196 ? 100
Unix 64bit – – 144 200 240 256 160
Windows 32bit 120 104 104 160 200 208 104
Contrary to popular believe Oracle latches were significantly evolved through the last decade. Not only additional statistics appeared (and disappeared) and new (shared) latch type was introduced, the latch itself was changed. Table 2.1 shows how the latch structure size changed by Oracle version. The ksllt size decreased in 10.2 because Oracle made obsolete many latch statistics. Oracle latch is not just a single memory location. Before Oracle 11g the value of first latch byte (word for shared latches) was used to determine latch state: • 0x00 – latch is free. • 0xFF – exclusive latch is busy. Was 0x01 in Oracle 7. • 0x01,0x02,etc. — shared latch holding by 1,2, etc. processes simultaneously. • 0x20000000 | pid — shared latch holding exclusively. In Oracle 11g the first exclusive latch word represents the Oracle pid of the latch holder: • 0x00 – latch free. • 0x12 – Oracle process with pid 18 holds the exclusive latch.
2.2
Latch attributes
According to Oracle Documentation and DTrace traces, each latch has at least the following flags and attributes: • Name — Latch name as appeared in V$ views • SHR — Is the latch Shared? Shared latch is ReadWrite spinlock. • PAR — Is the latch Solitary or Parent for the family of child latches? Both parent and child latches share the same latch name. The parent latch can be gotten independently, but may act as a master latch when acquired in special mode in kslgpl(). • G2C — Can two child latches be simultaneously requested in wait mode? • LNG — Is wait posting used for this latch? Obsolete since Oracle 9.2. • UFS — Is the latch Ultrafast? It will not increment miss statistics when STATISTICS LEVEL=BASIC. 10.2 and above • Level. 0-14. To prevent deadlocks latches can be requested only in increasing level order.
• Class. 0-7. Spin and wait class assigned to the latch. Oracle 9.2 and above. Evolution of Oracle latches is summarized in table 2.2. Table 2.2. Latch attributes by Oracle version Oracle version 7.3.4.0 8.0.6.3 8.1.7.4 9.2.0.8 10.2.0.2 10.2.0.3 10.2.0.4 11.1.0.6 11.1.0.7 11.2.0.1
Number of latches 53 80 152 242 385 388 394 496 502 535
PAR
G2C
LNG
UFS
SHR
14 21 48 79 114 117 117 145 145 149
2 7 19 37 55 58 59 67 67 70
3 3 4 — — — — — — —
— — — — 4 4 4 6 6 6
— 3 9 19 47 48 50 81 83 86
To prevent deadlocks Oracle process can acquire latches only with level higher than it currently holding. At the same level, the process can request the second G2C latch child X in wait mode after obtaining child Y, if and only if the child number of X < child number of Y. If these rules are broken, the Oracle process raises ORA-600 errors. ”Rising level” rule leads to ”trees” of processes waiting for and holding the latches. Due to this rule the contention for higher level latches frequently exacerbates contention for lower level latches. These trees can be seen by direct SGA access programs. Each latch can be assigned to one of 8 classes with different spin and wait policies. By default, all the latches belong to class 0. The only exception is ”process allocation latch”, which belongs to class 2. Latch assignment to classes is controlled by initialization parameter LATCH CLASSES. Latch class spinning and waiting policies can be adjusted by 8 parameters named LATCH CLASS 0 to LATCH CLASS 7.
2.3
Latch Acquisition in Wait Mode
According to contemporary Oracle 11.2 Documentation, latch wait get (kslgetl(laddress,1,. . . )) proceeds through the following phases: • One fast Immediate get, no spin. • Spin get: check the latch upto SPIN COUNT times. • Sleep on ”latch free” wait event with exponential backoff.
• Repeat. It occurs that such algorithm was really used ten years ago in Oracle versions 7.3-8.1. For example, look at Oracle 8i latch get code flow using Dtrace:
Note the semop() operating system call. This is infinite wait until posted. This operating system call will block the process until another process posts it during latch release.
Therefore, in Oracle 9.2-11.2, all the latches in default class 0 rely on wait posting. Latch is sleeping without kslgetl(0x200058F8,1,2,3) -KSL GET exclusive Latch any timeout. This is more efficient than previous algokslges(0x200058F8, ...) -wait get skgsltst(0x200058F8) ... call repeated 2000 times rithm. Contemporary latch statistics shows that most latch waits is less then 1 ms now. In addition, spinning pollsys(...,timeout=10 ms)- Sleep 1 skgsltst(0x200058F8) ... call repeated 2000 times once reduce CPU consumption. pollsys(...,timeout=10 skgsltst(0x200058F8) pollsys(...,timeout=10 skgsltst(0x200058F8) pollsys(...,timeout=30
ms)- Sleep 2 ... call repeated 2000 times However, this introduces a is lost in OS, waiters will ms)- Sleep 3 ... call repeated 2000 times common problem in earlier losses can lead to instance ms)- Sleep 4 ...
The 2000 cycles is the value of SPIN COUNT initialization parameter. This value could be changed dynamically without Oracle instance restart.
problem. If wakeup post sleep infinitely. This was 2.6.9 Linux kernels. Such hang because the process will never be woken up. Oracle solves this problem by ENABLE RELIABLE LATCH WAITS parameter. It changes the semop() system call to semtimedop() call with 0.3 sec timeout.
Corresponding Oracle event 10046 trace [14] is:
Latches assigned to non-default class wait until timeout. Number of spins and duration of sleeps for class X WAIT #0: nam=’latch free’ ela=1 p1=536893688 p2=29 p3=0 are determined by corresponding LATCH CLASS X WAIT #0: nam=’latch free’ ela=1 p1=536893688 p2=29 p3=1 parameter, which is a string of: WAIT #0: nam=’latch free’ ela=1 p1=536893688 p2=29 p3=2 WAIT #0: nam=’latch free’ ela=3 p1=536893688 p2=29 p3=2
The sleeps timeouts demonstrate the exponential backoff:
”Spin Yield Waittime Sleep0 Sleep1 . . . Sleep7” Detailed description of non-default latch classes can be found in [21].
0.01-0.01-0.01-0.03-0.03-0.07-0.07-0.15-0.23-0.39-0.39- DTrace demonstrated that by default the process spins 0.71-0.71-1.35-1.35-2.0-2.0-2.0-2.0...sec for exclusive latch for 20000 cycles. This is determined
This sequence can be almost perfectly fitted by the following formula. timeout = 2[(Nwait +1)/2] − 1
(1)
However, such sleep for predefined time was not efficient. Typical latch holding time is less then 10 microseconds. Ten milliseconds sleep was too large. Most waits were for nothing, because latch already was free. In addition, repeating sleeps resulted in many unnecessary spins, burned CPU and provokes CPU thrashing. It was not surprising that in Oracle 9.2-11g exclusive latch get was changed significantly. DTrace demonstrates its code flow: kslgetl(0x50006318, 1) sskgslgf(0x50006318)= 0 kslges(0x50006318, ...) skgslsgts(...,0x50006318) sskgslspin(0x50006318)... kskthbwt(0x0) kslwlmod() sskgslgf(0x50006318)= 0 skgpwwait semop(11, {17,-1,0}, 1)
-Immediate latch get -Wait latch get -Spin latch get - repeated 20000 cycles - set up Wait List -Immediate latch get -Sleep latch get
by static LATCH CLASS 0 initialization parameter. The SPIN COUNT parameter (by default 2000) is effectively static for exclusive latches [21]. Therefore spin count for exclusive latches can not be changed without instance restart. Further DTrace investigations showed that shared latch spin in Oracle 9.2-11g is governed by SPIN COUNT value and can be dynamically tuned. Experiments demonstrated that X mode shared latch get spins by default up to 4000 cycles. S mode does not spin at all (or spins in unknown way). Discussion how Oracle shared latch works can be found in [21]. The results are summarized in table 2.3. Table 2.3. Shared latch acquisition Held in S mode Held in X mode Blocking mode
2.3.1
S mode get Compatible 0 0
X mode get 2* SPIN COUNT 2* SPIN COUNT 2* SPIN COUNT
Latch Release
Oracle process releases the latch in kslfre(laddr). To deal with invalidation storms [4], the process releases the
latch nonatomically. Then it sets up memory barrier using atomic operation on address individual to each process. This requires less bus invalidation and ensures propagation of latch release to other local caches.
Since version 10.2 many previously collected latch statistics have been deprecated. We have lost important additional information about latch performance. Here I will discuss the remaining statistics set.
This is not fair policy. Latch spinners on the local CPU board have the preference. However, this is more efficient then atomic release. Finally the process posts first process in the list of waiters.
As was demonstrated in previous chapter, since version 9.2 Oracle uses completely new latch acquisition algorithm:
3
Immediate latch get Spin latch get Add the process to waiters queue Sleep until posted
3.1
The latch contention Raw latch statistic counters
Latch statistics is the tool to estimate whether the latch acquisition works efficiently or we need to tune it. Oracle counts a broad range of latch related statistics. Table 3.1 contains description of v$latch statistics columns from contemporary Oracle documentation [1]. Oracle collects more statistics then are usually consumed by classic queuing models. Table 3.1. Latch statistics Statistic: Documentation scription: GETS
MISSES
SLEEPS
SPINGETS
de-
When and how it is changed:
Number of times the latch was requested in willing-to-wait mode Number of times the latch was requested in willing-to-wait mode and the requestor had to wait Number of times a willing-to-wait latch request resulted in a session sleeping while waiting for the latch Willing-to-wait latch requests, which missed the first try but succeeded while spinning
Incremented by one after latch acquisition Incremented by one after latch acquisition if miss occurred
WAITTIME
Elapsed time spent waiting for the latch (in microseconds)
IMMEDIATEGETS IMMEDIATEMISSES
Number of times a latch was requested in nowait mode Number of times a nowait latch request did not succeed
GETS, MISSES, etc. are the integral statistics counted from the startup of the instance. These values depend on complete workload history. AWR and Statspack reports show changes of integral statistics per snapshot interval. Usually these values are ”averaged by hour”, which is much longer then typical latch performance spike. Another problem with AWR/Statspack report is averaging over child latches. By default AWR gathers only summary data from v$latch. This greatly distorts latch efficiency coefficients. The latch statistics should not be averaged over child latches. To avoid averaging distortions the following analysis uses the latch statistics from v$latch parent and v$latch children (or x$ksllt in Oracle version less then 11g) The current workload is characterized by differential latch statistics and ratios. Table 3.2 Differential (point in time) latch statistics
Incremented by number of times process slept during latch acquisition Incremented by one after latch acquisition if miss but not sleep occured. Counts only the first spin Incremented by wait time spent during latch acquisition. Incremented by one after each no-wait latch get Incremented by one after unsuccessful no-wait latch get
Description: Definition:
AWR equivalent:
′′
Get Requests′′
Arrival rate
λ=
∆GET S ∆time
Gets efficiency
ρ=
∆M ISSES ∆GET S
′′
P ct Get M iss′′ /100
Sleeps tio
κ=
∆SLEEP S ∆M ISSES
′′
Avg Slps /M iss′′
ra-
Wait time per second
W =
Spin efficiency
σ=
∆W AIT T IM E 106 ∆time
∆SP IN GET S ∆M ISSES
′′ Snap T ime (Elapsed)′′
′′
W ait T ime (s)′′
′′ Snap T ime (Elapsed)′′
′′
Spin Gets′′ ′′ M isses′′
There exist several ways to choose the basic set of differential statistics. I will use the most close to AWR/ Statspack way containing ”Arrival rate”, ”Gets efficiency”, ”Spin efficiency”, ”Sleeps ratio” and ”Wait time per second”. Table 3.2 defines these quantities. This work analyzes only wait latch gets. The no-wait (IMMEDIATE . . . ) gets add some complexity only for several latches. I will also assume ∆time to be small enough that workload do not change significantly. Other statistics reported by AWR depend on these key statistics: • Latch miss rate is
∆MISSES ∆time
• Latch waits (sleeps) rate is
= ρλ. ∆SLEEP S ∆time
= κρλ.
From the queuing theory point of view, the latch is G/G/1/(SIRO+FIFO) system with interesting queue discipline including Serve In Random Order spin and First In First Out sleep. Using the latch statistics, I can roughly estimate queuing characteristics of latch. I expect that the accuracy of such estimations is about 20-30%. As a first approximation, I will assume that incoming latch requests stream is Poisson and latch holding (service) times are exponentially distributed. Therefore, our first latch model will be M/M/1/(SIRO+FIFO). Oracle measures more statistics then usually consumed by classic queuing models. It is interesting what these additional statistics can be used for.
3.2
Here I introduced the the η=
min(NCP U , Nproc ) min(NCP U , Nproc ) − 1
multiplier to correct naive utilization estimation. Clearly, the η multiplier confirms that the entire approach is inapplicable to single CPU machine. Really η significantly differs from one only during demonstrations on my Dual-Core notebook. For servers its impact is below precision of my estimates. For example for small 8 CPU server the η multiplier adds only 14% correction. We can experimentally check the accuracy of these formulas and, therefore, Poisson arrivals approximation. U can be independently measured by sampling of v$latchholder. The latchprofx.sql script by Tanel Poder [18] did this at high frequency. Within our accuracy we can expect that ρ and U should be at least of the same order. We know that U = λS, where S is average service (latch holding) time. This allows us to estimate the latch holding time as: ηρ (4) S= λ This is interesting. We obtained the first estimation of latch holding time directly from statistics. In AWR terms this formula looks like ′′
S=η
P ct Get M iss′′ × ′′ Snap T ime′′ 100 × ′′ Get Requests′′
Average service time:
The PASTA (Poisson Arrivals See Time Averages) [20] property connects ρ ratio with the latch utilization. For Poisson streams the latch gets efficiency should be equal to utilization: ρ=
∆latch hold time ∆misses ≈U = ∆gets ∆time
(2)
However, this is not exact for server with finite number of processors. The Oracle process occupies the CPU while acquiring the latch. As a result, the latch get see the utilization induced by other NCP U − 1 processors only. Compare this with MVA [17] arrival theorem. In some benchmarks there may be only Nproc ≤ NCP U Oracle shadow processes that generates the latch load. In such case we should substitute Nproc instead NCP U in the following estimate: ρ≃ 1−
1 min(NCP U , Nproc )
1 U= U η
(3)
3.3
Wait time:
Look more closely on the summary wait time per second W . Each latch acquisition increments the WAIT TIME statistics by amount of time it waited for the latch. According to the Little law, average latch sleeping time is related the length of wait (sleep) queue: L = λwaits ×haverage wait timei = λρκ×δ(W ait T ime) The right hand side of this identity is exactly the ”wait time per second” statistic. Therefore, actually: W ≡L
(5)
We can experimentally confirm this conclusion because L can be independently measured by sampling of v$process.latchwait column.
3.4
Recurrent sleeps:
In ideal situation, the process spins and sleeps only once. Consequently, the latch statistics should satisfy the following identity: M ISSES = SP IN GET S + SLEEP S
(6)
Or, equivalently: 1=σ+κ
(7)
In reality, some processes had to sleep for the latch several times. This occurred when the sleeping process was posted, but another process got the latch before the first process received the CPU. The awakened process spins and sleeps again. As a results the previously equality became invalid. Before version 10.2 Oracle directly counted these sequential waits in separate SLEEP1-SLEEP3 statistics. Since 10.2 these statistics became obsolete. However, we can estimate the rate of such ”sleep misses” from other basic statistics. The recurrent sleep increments only the SLEEPS counter. The SPIN GETS statistics not changed. The σ + κ − 1 is the ratio of inefficient latch sleeps to misses. The ratio of ”unsuccessful sleep” to ”sleeps” is given by: Recurrent sleeps ratio =
σ+κ−1 κ
(8)
Normally this ratio should be close to ρ. Frequent ”unsuccessful sleeps” are inefficient and may be a symptom of OS waits posting problems or bursty workload.
3.5
Latch acquisition time:
Average latch acquisition time is the sum of spin time and wait time. Oracle does not directly measure the spin time. However, we can measure it on Solaris platform using DTrace. On other platforms, we should rely on statistics. Fortunately in Oracle 9.2-10.2 one can count the average number of spinning processes by sampling x$ksupr.ksllalaq. The process set this column equal to address of acquired latch during active phase of latch get. Oracle 8i and before even fill the v$process.latchspin during latch spinning. Little law allows us to connect average number of spinning processes with the spinning time: Ns = λTs
(9)
As a result the average latch acquisition time is: Ta = λ−1 (Ns + W )
(10)
Note that according to general queuing theory the ”Serve In Random Order” discipline of latch spin does not affect average latch acquisition time. It is independent on queuing discipline. In steady state, the number of processes served during the passage of incoming request through the system should be equal to the number of spinning and waiting processes. In Oracle 11g the latch spin is no longer instrumented due to a bug. The 11g spin is invisible for SQL. This do not allow us to estimate Ns and related quantities.
3.6
Comparison of results
Let me compare the results of DTrace measurements and latch statistics. Typical demonstration results for our 2 CPU X86 server are: /usr/sbin/dtrace -s latch_times.d -p 17242 0x5B7C75F8 ... latch gets traced: 165180 ’’Library cache latch’’, address=5b7c75f8 Acquisition time: value ------------- Distribution ------------- count 4096 | 0 8192 |@@ 7324 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 151748 32768 |@ 4493 65536 | 1676 131072 | 988 262144 | 464 524288 | 225 1048576 | 211 2097152 | 53 4194304 | 21 8388608 | 1 16777216 | 1 33554432 | 0 Holding time: value ------------- Distribution ------------- count 8192 | 0 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@ 105976 32768 |@@@@@@@@@@@@ 50877 65536 |@@ 6962 131072 | 1986 262144 | 829 524288 | 330 1048576 | 205 2097152 | 34 4194304 | 6 8388608 | 0 Average acquisition time =26 us Average holding time =37 us
The above histograms show latch acquisition and holding time distributions in logarithmic scale. Values are in nanoseconds. Compare the above average times with
the results of latch statistics analysis under the same conditions: Latch statistics for 0x5B7C75F8 ’’library cache’’ level#=5 child#=1 Requests rate: lambda= 20812.2 Hz Miss /get: rho= 0.078 Est. Utilization: eta*rho= 0.156 Sampled Utilization: U= 0.143 Slps /Miss: kappa= 0.013 Wait_time/sec: W= 0.025 Sampled queue length L= 0.043 Spin_gets/miss: sigma= 0.987 Sampled spinnning: Ns= 0.123 Derived statistics: Secondary sleeps ratio = 0.01 Avg latch holding time = 7.5 us sleeping time = 1.2 us acquisition time = 7.2 us
We can see that ηρ and W are close to sampled U and L respectively. The holding and acquisition times from both methods are of the same order. Since both methods are intrusive, this is remarkable agreement. Measurements of latch times and distributions for demo and production workloads conclude that: The latch holding time for the contemporary servers should be normally in microseconds range.
4
Latch contention in Oracle 9.211g
Latch contention should be suspected if the latch wait events are observed in Top 5 Timed Events AWR section. Look for the latches with highest W . Symptoms of contention for the latch are highly variable. Most commonly observed include: • W > 0.1 sec/sec
To achieve this one need to tune the SQL operators, use bind variables, change the physical schema, etc. . . Classic Oracle Performance books explore these topics [13, 14, 15]. However, this tuning methodology may be too expensive and even require complete application rewrite. This work explores complementary possibility of changing SPIN COUNT. This commonly treated as old style tuning, which should be avoided at any means. Increasing of spin count may leed to waste of CPU. However, nowadays the CPU power is cheap. We may already have enough free resources. We need to find conditions when the spin count tuning may be beneficial. Processes spin for exclusive latch spin upto 20000 cycles, for shared latch upto 4000 cycles and infinitely for mutex. Tuning may find more optimal values for your application. Oracle does not explicitly forbid spin count tuning. However, change of undocumented parameter should be discussed with Support.
4.1
Spin count adjustment
Spin count tuning depends on latch type. For shared latches: • Spin count can be adjusted dynamically by SPIN COUNT parameter. • Good starting point is the multiple of default 2000 value. SPIN COUNT parameter in ini• Setting tialization file, should be accompanied by LATCH CLASS 0=”20000”. Otherwise spin for exclusive latches will be greatly affected by next instance restart. On the other hand if contention is for exclusive latches then:
• Utilization > 10% • Acquisition (or sleeping) time significantly greater then holding time V$latch misses fixed view and latchprofx.sql script by Tanel Poder [18] reveal ”where” the contention arise. One should always take into account that contention for a high-level latch frequently exacerbates contention for lower-level latches [13]. How treat the latch contention? During the last 15 years, the latch performance tuning was focused on application tuning and reducing the latch demand.
• Spin count adjustment by LATCH CLASS 0 parameter needs the instance restart. • Good starting point is the multiple of default 20000 value. • It may be preferable to increase the number of ”yields” for class 0 latches. In order to tune spin count efficiently the root cause of latch contention must be diagnosed. Obviously spin count tuning will only be effective if the latch holding time S is in its normal microseconds range. At any
time the number of spinning processes should remain less then the number of CPUs. It is a common myth that CPU consumption will raise infinitely while we increase the spin count. However, actually the process will spin up to ”residual latch holding time”. The next chapter will explore this.
5
and probability of not releasing the latch during time t is Ql (τ ≥ t) = 1 − Pl (τ < t) . Therefore, the probability to spin for the latch during time less then t is Pl (τk < t) when t < ∆ Psg (ts < t) = (13) 1 when t ≥ ∆ and has a discontinuity in t = ∆ because the process acquiring latch never spins more than ∆. The magnitude of this discontinuity is 1 − Pl (∆). This is the probability of latch sleep.
Latch spin CPU time
The spin probes the latch holding time distribution. To predict effect of SPIN COUNT tuning, let me introduce the mathematical model. It extends the model used in [9]for general latch holding time distribution. As a cost function, I will estimate the CPU time consumed while spinning. Consider a general stream of latch acquisition events. Latch was acquired by some process at time Tk and released at Tk + hk , k ∈ N Here hk is the latch holding time distributed with p.d.f. p(t). I will assume that both Tk and hk are generally independent for any k and form a recurrent stream. Furthermore, I assume here the existence of at least second moments for all the distributions. If Tk+1 < Tk + hk then the latch will be busy when the next process tries to acquire it. The latch miss will occur. In this case the process will spin for the latch up to time ∆. The spin get will succeed if:
Therefore, the spinning probability distribution function has a bump in ∆ psg = pl (t)H(∆ − t) + (1 − Pl (∆))δ(t − ∆)
(14)
Here H(x) and δ(x) is Heaviside step and bump functions correspondingly. Spin efficiency is the probability to obtain latch during the spin get :
σ=
∆−0 Z
psg (t) dt = Pl (∆) = 1 − Ql (∆)
(15)
0
Tk+1 + ∆ > Tk + hk The process will sleep for the latch if Tk+1 +∆ < Tk +hk . Therefore, the conditions for latch wait acquisition phases are: latch miss: latch spin get: latch sleep:
Tk+1 < Tk + hk , Tk + hk − ∆ < Tk+1 < Tk + hk , Tk+1 + ∆ < Tk + hk . (11)
If the latch miss occur, then second process will observe that latch remain busy for: τk+1 = Tk + hk − Tk+1
(12)
This is ”residual time” [20] or time until first event [22] of latch release . Its distribution differ from that of hk . To reflect this, I will add the subscript l to all residual distributions. In addition, I will omit subscript k for the stationary state. Let me denote the probability that missed process see latch release at time less then t as: Pl (τ < t) = Pl (t)
Oracle allows measuring the average number of spinning processes. This quantity is proportional to the average CPU time spending while spinning for the latch:
Γsg =
Z∞ 0
tpsg (t) dt =
Z∆
tpl (t) dt + ∆(1 − Pl (∆)) (16)
0
Integrating by parts both expressions may be rewritten in form: σ = 1 − Ql (∆) R∆ R∆ (17) Γsg = ∆ − Pl (t) dt = Ql (t) dt 0
0
or, equivalently: σ = 1 − Ql (∆) R∞ Γsg = htl i − Ql dt
(18)
∆
According to classic considerations from the renewal theory [20], the distribution of residual time is the transformed latch holding time distribution:
It is clear that such spin probes the latch holding time distribution around the origin.
1 (1−P (t)) hti
pl (t) =
2
ht i . The average residual latch holding time is htl i = h2ti Incorporating this into previous formulas for spin efficiency and CPU time results in:
σ=
1 hti
R∆ 0
1 hti
Γsg =
Q(t)dt R∆
dt
R∞
(19) Q(z) dz
t
0
These nice formulas encourage us that observables explored are not artifacts: σ=
1 hti
R∆ 0
1 hti
Γsg =
dt R∆
R∞ t
dt
0
p(z) dz R∞
dz
t
R∞
(20) p(x) dx
z
Assuming existence of second moments for latch holding time distribution we can proceed further. It is possible to change the integration orders using: Z∞
dz
t
Z∞
p(x)dx =
z
Z∞
zp(z)dz − t
t
Z∞
Γsg
Z∆
t
∆ t p(t) dt + hti 2
0
Z∞ ∆ (t − )p(t) dt 2
Spin count tuning when spin efficiency is low
The spin may be inefficient σ ≪ 1. In this low efficiency region, the (20) can be rewritten in form: σ=
∆ hti
−
1 hti
Γsg = ∆ −
R∆ (∆ − t)p(t) dt 0
∆2 2hti
+
1 2hti
R∆ (∆ − t)2 p(t) dt 0
For Oracle performance tuning purpose we need to know what happens if we double spin count: In low efficiency region doubling the spin count will double ”spin efficiency” and also double the CPU consumption. These estimations especially useful in the case of severe latch contention and for the another type of Oracle spinlocks — the mutex.
Spin count tuning when efficiency is high
In high efficiency region, the sleep cuts off the tail of latch holding time distribution: R∞ 1 σ = 1 − hti (t − ∆)p(t) dt ∆
∆
I will focus on two regions where analytical estimations possible. To estimate the effect of spin count tuning, we need the approximate scaling rules depending on the value of ”spin efficiency” σ =”Spin gets/Miss”.
5.1
If processes never release latch immediately (p(0) = 0) then ( ∆ + O(∆3 ) σ = hti (22) ∆2 Γsg = ∆ − 2hti + O(∆4 )
5.2
p(z)dz
Utilizing this identity twice, we arrive to the following expression: 1 = 2hti
Other parts of latch holding time distribution impact spinning efficiency and CPU consumption only through the average holding time hti. This allows to estimate how these quantities depend upon SPIN COUNT (∆) change.
Γsg =
−
1 2hti
R∞ (t − ∆)2 p(t) dt
∆
Oracle normally operates in this region of small latch sleeps ratio. Here the spin count is greater than number of instructions protected by latch ∆ ≫ hti. From the above it is clear that the spin time is bounded by both the ”residual latch holding time” and the spin count:
Γsg < min( (21)
ht2 i 2hti
ht2 i , ∆) 2hti
Sleep prevents process from waste CPU for spinning for heavy tail of latch holding time distribution
Normally latch holding time distribution has exponential tail: Q(t) ∼ C exp(−t/τ ) κ = 1 − σ ∼ C exp(−t/τ ) 2 i Γsg ∼ ht 2hti − Cτ exp(−t/τ ) It is easy to see that if ”sleep ratio” is small κ = 1−σ ≪ 1 then Doubling the spin count will square the sleep ratio coefficient. This will only add part of order κ to spin CPU consumption. I would like to paraphrase this for Oracle performance tuning purpose as: If ”sleep ratio” for exclusive latch is 10% than increase of spin count to 40000 may results in 10 times decrease of ”latch free” wait events, and only 10% increase of CPU consumption. In other words, if the spin is already efficient, it is worth to increase the spin count. This exponential law can be compared to Guy Harrison experimental data [24].
5.3
Long distribution tails: CPU thrashing
Frequent origin of long latch holding time distribution tails is so-called CPU thrashing. The latch contention itself can cause CPU starvation. Processes contending for a latch also contend for CPU. Vise versa, lack of CPU power caused latch contention. Once CPU starves, the operating system runqueue length increases and loadaverage exceeds the number of CPUs. Some OS may shrink the time quanta under such conditions. As a result, latch holders may not receive enough time to release the latch. The latch acquirers preempt latch holders. The throughput falls because latch holders not receive CPU to complete their work. However, overall CPU consumption remains high. This seems to be metastable state, observed while server workload approaches 100% CPU. The KGX mutexes are even more prone to this transition. Due to OS preemption, residual latch holding time will raise to the CPU scheduling scale – upto milliseconds and more. Spin count tuning is useless in this case. Common advice to prevent CPU thrashing is to tune SQL in order to reduce CPU consumption. Fixed priority OS scheduling classes also will be helpful. Future works will explore this phenomenon.
6
Conclusions
This work investigated the possibilities to diagnose and tune latches, the most commonly used Oracle spinlocks.
Using DTrace, it explored how the contemporary latch works, its spinning-blocking strategies, corresponding parameters and statistics. The mathematical model was developed to estimate the effect of tuning the spin count. The results are important for precise performance tuning of highly loaded Oracle OLTP databases.
7
Acknowledgements
Thanks to Professor S.V. Klimenko for kindly inviting me to MEDIAS 2011 conference Thanks to RDTEX CEO I.G. Kunitsky for financial support. Thanks to RDTEX Technical Support Centre Director S.P. Misiura for years of encouragement and support of my investigations.
References R [1] Oracle Database Concepts 11g Release 2 (11.2). 2010.
[2] E. W. Dijkstra. 1965. Solution of a problem in concurrent programming control. Commun. ACM 8, 9 (September 1965), 569-. DOI=10.1145/365559.365617 [3] J.H. Anderson, Yong-Jik Kim, ”Shared-memory Mutual Exclusion: Major Research Trends Since 1986”. 2003. [4] T. E. Anderson. 1990. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 1 (January 1990), 6-16. DOI=10.1109/71.80120 [5] John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9, 1 (February 1991), 21-65. DOI=10.1145/103727.103729 [6] M. Herlihy and N. Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. ISBN:9780123705914. ”Chapter 07 Spin Locks and Contention.” [7] T. E. Anderson, D. D. Lazowska, and H. M. Levy. 1989. The performance implications of thread management alternatives for shared-memory multiprocessors. SIGMETRICS Perform. Eval. Rev. 17, 1 (April 1989), 49-60. DOI=10.1145/75372.75378
[8] J. K. Ousterhout. Scheduling techniques for concurrent systems. In Proc. Conf. on Dist. Computing Systems, 1982. [9] Beng-Hong Lim and Anant Agarwal. 1993. Waiting algorithms for synchronization in large-scale multiprocessors. ACM Trans. Comput. Syst. 11, 3 (August 1993), 253-294. DOI=10.1145/152864.152869 [10] Anna R. Karlin, Mark S. Manasse, Lyle A. McGeoch, and Susan Owicki. 1990. Competitive randomized algorithms for non-uniform problems. In Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms (SODA ’90). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 301-309. [11] L. Boguslavsky, K. Harzallah, A. Kreinen, K. Sevcik, and A. Vainshtein. 1994. Optimal strategies for spinning and blocking. J. Parallel Distrib. Comput. 21, 2 (May 1994), 246-254. DOI=10.1006/jpdc.1994.1056 [12] Ryan Johnson, Manos Athanassoulis, Radu Stoica, and Anastasia Ailamaki. 2009. A new look at the roles of spinning and blocking. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (DaMoN ’09). ACM, New York, NY, USA, 21-26. DOI=10.1145/1565694.1565700 [13] Steve Adams. 1999. Oracle8i Internal Services for Waits, Latches, Locks, and Memory. O’Reilly Media. ISBN: 978-1565925984 [14] Millsap C., Holt J. 2003. Optimizing Oracle performance. O’Reilly & Associates, ISBN: 9780596005276. [15] Richmond Shee, Kirtikumar Deshpande, K. Gopalakrishnan. 2004. Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning. McGraw-Hill Osborne Media. ISBN: 9780072227291 [16] Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal. 2004. Dynamic instrumentation of production systems. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC ’04). USENIX Association, Berkeley, CA, USA, 2-2. [17] Reiser, M. and S. Lavenberg, ”Mean Value Analysis of Closed Multichain Queueing Networks,” JACM 27 (1980) pp. 313-322 [18] Tanel Poder blog. Core IT for Geeks and Pros http://blog.tanelpoder.com
[19] Anna R. Karlin, Kai Li, Mark S. Manasse, and Susan Owicki. 1991. Empirical studies of competitve spinning for a shared-memory multiprocessor. SIGOPS Oper. Syst. Rev. 25, 5 (September 1991), 41-55. DOI=10.1145/121133.286599 [20] L. Kleinrock, Queueing Systems, Theory, Volume I, ISBN 0471491101. Wiley-Interscience, 1975. [21] Andrey Nikolaev blog, Latch, mutex and beyond. http://andreynikolaev.wordpress.com [22] F. Zernike. 1929. ”Weglangenparadoxon”. Handbuch der Physik 4. Geiger and Scheel eds., Springer, Berlin 1929 p. 440. [23] B. Sinharoy, et al. 1996. Improving Software MP Efficiency for Shared Memory Systems. Proc. of the 29th Annual Hawaii International Conference on System Sciences [24] Guy Harrison. 2008. Using spin count to reduce latch contention in 11g. Yet Another Database Blog. http://guyharrison.squarespace.com/
About the author Andrey Nikolaev is an expert at RDTEX First Line Oracle Support Center, Moscow. His contact email is
[email protected].