POSTGRESQL is an open-source, full-featured relational database. This
presentation gives an overview of the shared memory structures used by
Postgres.
Inside PostgreSQL Shared Memory BRUCE MOMJIAN
POSTGRESQL is an open-source, full-featured relational database. This presentation gives an overview of the shared memory structures used by Postgres. Creative Commons Attribution License
http://momjian.us/presentations
Last updated: May, 2017
1 / 25
Outline
1. File storage format 2. Shared memory creation 3. Shared buffers 4. Row value access 5. Locking 6. Other structures
2 / 25
File System /data
Postgres
/data
Postgres
Postgres
3 / 25
File System /data/base
Postgres
Postgres
Postgres
/data
/base /global /pg_clog /pg_multixact /pg_subtrans /pg_tblspc /pg_twophase /pg_xlog 4 / 25
File System /data/base/db
Postgres
Postgres
/data
/base /16385 (production) /1 (template1) /16821 (test) /17982 (devel) /21452 (marketing)
Postgres
5 / 25
File System /data/base/db/table
Postgres
Postgres
/data
/base /16385
/24692 (customer) /27214 (order) /25932 (product) /25952 (employee) /27839 (part)
Postgres
6 / 25
File System Data Pages
Postgres
/data
/base /16385
/24692 8k
8k
8k
8k
Postgres
Postgres
7 / 25
Data Pages Postgres
/data
/base /16385
/24692 8k
8k
8k
8k
Postgres
Postgres Page Header
Item
Item
Item
8K Tuple Tuple
Tuple
Special
8 / 25
File System Block Tuple Postgres
/data
/base /16385
/24692 8k
8k
8k
8k
Postgres Page Header
Item
Item
Item
Postgres 8K Tuple Tuple
Tuple
Special
Tuple
9 / 25
File System Tuple ’Martin’
int4in(’9241’) Tuple
textout() Header
Value
Value
Value
Value
Value
Value
OID − object id of tuple (optional) xmin − creation transaction id xmax − destruction transaction id cmin − creation command id cmax − destruction command id ctid − tuple id (page / item) natts − number of attributes infomask − tuple flags hoff − length of tuple header bits − bit map representing NULLs
10 / 25
Tuple Header C Structures typedef struct HeapTupleFields { TransactionId t_xmin; TransactionId t_xmax; union { CommandId t_cid; TransactionId t_xvac; } t_field3; } HeapTupleFields;
/* inserting xact ID */ /* deleting or locking xact ID */
/* inserting or deleting command ID, or both */ /* VACUUM FULL xact ID */
typedef struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; } t_choice; ItemPointerData t_ctid;
/* current TID of this or newer tuple */
/* Fields below here must match MinimalTupleData! */ uint16
t_infomask2;
/* number of attributes + various flags */
uint16
t_infomask;
/* various flag bits, see below */
uint8
t_hoff;
/* sizeof header incl. bitmap, padding */
/* ^ − 23 bytes − ^ */ bits8
t_bits[1];
/* bitmap of NULLs −− VARIABLE LENGTH */
/* MORE DATA FOLLOWS AT END OF STRUCT */ } HeapTupleHeaderData;
11 / 25
Shared Memory Creation ()
rk
postmaster
fo
postgres
postgres
Program (Text)
Program (Text)
Program (Text)
Data
Data
Data
Shared Memory
Shared Memory
Shared Memory
Stack
Stack
Stack
12 / 25
Shared Memory
PROC
Lightweight Locks
XLOG Buffers
Proc Array
Lock Hashes
CLOG Buffers
LOCK
Subtrans Buffers
Auto Vacuum
PROCLOCK
Btree Vacuum
Two−Phase Structs Multi−XACT Buffers
Statistics Background Writer
Synchronized Scan
Shared Invalidation
Buffer Descriptors Shared Buffers
Semaphores
13 / 25
Shared Buffers Buffer Descriptors
Pin Count − prevent page replacement LWLock − for page changes
8k
8k
8k Shared Buffers
read()
Page Header
Item
Item
Item
write() Postgres
/data /base /16385 /24692
8K 8k 8k 8k 8k
Tuple Tuple
Tuple
Special
Postgres
Postgres
14 / 25
HeapTuples
8k
8k
8k Shared Buffers
Page Header
Item
Item
Item
8K Tuple Tuple
Tuple
Special
HeapTuple
’Martin’
int4in(’9241’) Tuple
textout() Header
Value
Value
Value
Value
Value
Postgres
Value
C pointer OID − object id of tuple (optional) xmin − creation transaction id xmax − destruction transaction id cmin − creation command id cmax − destruction command id ctid − tuple id (page / item) natts − number of attributes infomask − tuple flags hoff − length of tuple header bits − bit map representing NULLs
15 / 25
Finding A Tuple Value in C Datum nocachegetattr(HeapTuple tuple, int attnum, TupleDesc tupleDesc, bool *isnull) { HeapTupleHeader tup = tuple−>t_data; Form_pg_attribute *att = tupleDesc−>attrs; { int
i;
/* * Note − This loop is a little tricky. For each non−null attribute, * we have to first account for alignment padding before the attr, * then advance over the attr based on its length. Nulls have no * storage and no alignment padding either. We can use/set * attcacheoff until we reach either a null or a var−width attribute. */ off = 0; for (i = 0;; i++) /* loop exit is at "break" */ { if (HeapTupleHasNulls(tuple) && att_isnull(i, bp)) continue; /* this cannot be the target att */ if (att[i]−>attlen == −1) off = att_align_pointer(off, att[i]−>attalign, −1, tp + off); else /* not varlena, so safe to use att_align_nominal */ off = att_align_nominal(off, att[i]−>attalign); if (i == attnum) break; off = att_addlength_pointer(off, att[i]−>attlen, tp + off); } } return fetchatt(att[attnum], tp + off); }
16 / 25
Value Access in C #define fetch_att(T,attbyval,attlen) \ ( \ (attbyval) ? \ ( \ (attlen) == (int) sizeof(int32) ? \ Int32GetDatum(*((int32 *)(T))) \ : \ ( \ (attlen) == (int) sizeof(int16) ? \ Int16GetDatum(*((int16 *)(T))) \ : \ ( \ AssertMacro((attlen) == 1), \ CharGetDatum(*((char *)(T))) \ ) \ ) \ ) \ : \ PointerGetDatum((char *) (T)) \ ) 17 / 25
Test And Set Lock Can Succeed Or Fail
1
1
0/1
0
1
Success
Failure
Was 0 on exchange
Was 1 on exchange Lock already taken 18 / 25
Test And Set Lock x86 Assembler
static __inline__ int tas(volatile slock_t *lock) { register slock_t _res = 1;
: : :
/* * Use a non−locking test before asserting the bus lock. Note that the * extra test appears to be a small loss on some x86 platforms and a small * win on others; it’s by no means clear that we should keep it. */ __asm__ __volatile__( " cmpb $0,%1 \n" " jne 1f \n" " lock \n" " xchgb %0,%1 \n" "1: \n" "+q"(_res), "+m"(*lock) "memory", "cc"); return (int) _res;
}
19 / 25
Spin Lock Always Succeeds 1
1
0/1
0
Sleep of increasing duration
1
Success
Failure
Was 0 on exchange
Was 1 on exchange Lock already taken
Spinlocks are designed for short-lived locking operations, like
20 / 25
Light Weight Locks Sleep On Lock
PROC
Lightweight Locks
XLOG Buffers
Proc Array
Lock Hashes
CLOG Buffers
LOCK
Subtrans Buffers
Auto Vacuum
PROCLOCK
Btree Vacuum
Two−Phase Structs Multi−XACT Buffers
Statistics Background Writer
Synchronized Scan
Shared Invalidation
Buffer Descriptors Shared Buffers
Semaphores
Light weight locks attempt to acquire the lock, and go to sleep on a semaphore if the lock request fails. Spinlocks control access to
21 / 25
Database Object Locks
PROC
PROCLOCK
LOCK Lock Hashes
22 / 25
Proc
PROC empty
used
used
empty
used
empty
Proc Array
23 / 25
Other Shared Memory Structures
PROC
Lightweight Locks
XLOG Buffers
Proc Array
Lock Hashes
CLOG Buffers
LOCK
Subtrans Buffers
Auto Vacuum
PROCLOCK
Btree Vacuum
Two−Phase Structs Multi−XACT Buffers
Statistics Background Writer
Synchronized Scan
Shared Invalidation
Buffer Descriptors Shared Buffers
Semaphores
24 / 25
Conclusion
http://momjian.us/presentations
https://www.flickr.com/photos/john_getchel/ 25 / 25