Oracle Clusterware and Private Network Considerations. Much of this
presentation is attributed to Michael Zoll and work done by the RAC Performance
...
Oracle Clusterware and Private Network Considerations
Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance Development group
1
Copyright © 2008, Oracle. All rights reserved.
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
2
Copyright © 2008, Oracle. All rights reserved.
Agenda Architectural Overview RAC and Cache Fusion Performance Infrastructure Common Problems and Resolution Aggregation and VLANs
3
Copyright © 2008, Oracle. All rights reserved.
Oracle Clusterware Cluster Private High Speed Network L2/L3 SWITCH
CSSD
/…/
public network
VIP1
VIP2
VIPn
CSSD
CSSD
CSSD
OPROCD
OPROCD
OPROCD
ONS
ONS
ONS
CRSD
CRSD
CRSD
EVMD
EVMD
EVMD
OS
OS
OS
shared storage
RAW Devices 4
OCR and Voting Disks
Copyright © 2008, Oracle. All rights reserved.
Runs in Real Time Priority
Runs in Real
Under the Covers
Time Priority
Cluster Private High Speed Network L2/L3 SWITCH
LMON
Instance 1 SGA
LMD0
Global Resoruce Directory Dictionary Cache Library Cache
LMON
DIAG
Log buffer Buffer Cache
Instance 2 SGA
LMD0
DIAG
LMON
Global Resoruce Directory Dictionary Log buffer Cache Library Cache
Instance n SGA
Buffer Cache
DIAG
Global Resoruce Directory Dictionary Cache
Log buffer
Library Cache
Buffer Cache
VKTM
LGWR
DBW0
VKTM
LGWR
DBW0
VKTM
LGWR
DBW0
LMS0
SMON
PMON
LMS0
SMON
PMON
LMS0
SMON
PMON
Node 1
Node 2
Redo Log Files
Redo Log Files
Data Files and Control Files
5
LMD0
Copyright © 2008, Oracle. All rights reserved.
Node n
Redo Log Files
Global Cache Service (GCS) Manages coherent access to data in buffer caches of all instances in the cluster Minimizes access time to data which is not in local cache • access to data in global cache faster than disk access Implements fast direct memory access over high-speed interconnects • for all data blocks and types Uses an efficient and scalable messaging protocol • Never more than 3 hops New optimizations for read-mostly applications
6
Copyright © 2008, Oracle. All rights reserved.
Cache Hierarchy: Data in Remote Cache Local Cache Miss Datablock Requested
Datablock Returned Remote Cache Hit
7
Copyright © 2008, Oracle. All rights reserved.
Cache Hierarchy: Data On Disk Local Cache Miss Datablock Requested Disk Read
Grant Returned Remote Cache Miss
8
Copyright © 2008, Oracle. All rights reserved.
Cache Hierarchy: Read Mostly Local Cache Miss
No Message required
9
Disk Read
Copyright © 2008, Oracle. All rights reserved.
11.1 CPU Optimizations for read-intensive operations • Read-only access – No messages – Direct reads
• Read-mostly access – Message reductions – Latency improvements
Significant gains • From 50-70% reductions measured
10
Copyright © 2008, Oracle. All rights reserved.
Performance of Cache Fusion Message:~200 bytes
LMS
200 bytes/(1 Gb/sec )
Receive
Process block Send
Initiate send and wait Block: e.g. 8K Receive
8192 bytes/(1 Gb/sec)
Total access time: e.g. ~360 microseconds (UDP over GBE) Network propagation delay ( “wire time” ) is a minor factor for roundtrip time ( approx.: 6% , vs. 52% in OS and network stack ) 11
Copyright © 2008, Oracle. All rights reserved.
Fundamentals: Minimum Latency (*), UDP/GBE and RDS/IB Block size RT (ms)
2K
4K
8K
16K
UDP/GE
0.30
0.31
0.36
0.46
RDS/IB
0.12
0.13
0.16
0.20
(*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”) AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantiles The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block would be very high )
12
Copyright © 2008, Oracle. All rights reserved.
Infrastructure: Network Packet Processing TX
RX
PROCESS (FG/LMS*)
PROCESS (FG/LMS*)
SOCKET LAYER (tx Buffers)
SOCKET LAYER (rx Buffers)
IP LAYER TCP / UDP
IP LAYER TCP / UDP
INTERFACE LAYER
INTERFACE LAYER
L2/L3 SWITCH
13
Copyright © 2008, Oracle. All rights reserved.
Infrastructure: Interconnect Bandwidth Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application Typical utilization approx. 10-30% in OLTP • 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75-80% of theoretical bandwidth ) Generally, 1Gb/sec sufficient for performance and scalability in OLTP. DSS/DW systems should be designed with > 1Gb/sec capacity A sizing approach with rules of thumb is described in • Project MegaGrid: Capacity Planning for Large Commodity Clusters (http://otn.oracle.com/rac) 14
Copyright © 2008, Oracle. All rights reserved.
Infrastructure: Private Interconnect
• Network between the nodes of a RAC cluster MUST be private • Supported links: GbE, IB ( IPoIB: 10.2 ) • Supported transport protocols: UDP, RDS (10.2.0.3) • Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding • Large ( Jumbo ) Frames for GbE recommended if the global cache workload requires it. • global cache block shipping versus small lock message passing.
15
Copyright © 2008, Oracle. All rights reserved.
Network Packet Processing: Layers, Queues and Buffers TX
RX
PROCESS (FG/LMS*)
PROCESS (FG/LMS*)
Rec()
USER SOCKET LAYER (rx Buffers)
SOCKET LAYER (tx Buffers)
KERNEL Socket buffers Socket queues
IP LAYER TCP / UDP
IP LAYER TCP / UDP
TX IP queues
Software interrupts
RX IP input queue
INTERFACE LAYER
INTERFACE LAYER
Hardware interrupts RX queues, RX buffers
Backplane pressure cpu L2/L3 SWITCH
Ingress Buffers
16
Egress Buffers
Copyright © 2008, Oracle. All rights reserved.
Infrastructure: IPC configuration Important Settings: • Negotiated top bit rate and full duplex mode • NIC ring buffers • Ethernet flow control settings • CPU(s) receiving network interrupts Verify your setup: • CVU does checking • Load testing eliminates potential for problems • AWR and ADDM give estimations of link utilization Buffer overflows, congested links and flow control can have severe consequences for performance
17
Copyright © 2008, Oracle. All rights reserved.
Infrastructure: Operating System Block access latencies increase when CPU(s) busy and run queues are long • Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy Fewer and busier LMS processes may be more efficient. • monitor their CPU utilization • Caveat: 1 LMS can be good for runtime performance but may impact cluster reconfiguration and instance recovery time • the default is good for most requirements Higher priority for LMS is default • The implementation is platform-specific 18
Copyright © 2008, Oracle. All rights reserved.
Common Problems and Symptoms
“Lost Blocks”: Interconnect or Switch Problems System load and scheduling Contention Unexpectedly high global cache latencies
19
Copyright © 2008, Oracle. All rights reserved.
Miss-configured or Faulty Interconnect Can Cause: Dropped packets/fragments Buffer overflows Packet reassembly failures or timeouts Ethernet Flow control kicks in TX/RX errors
“lost blocks” at the RDBMS level, responsible for 64% of escalations
20
Copyright © 2008, Oracle. All rights reserved.
“Lost Blocks”: NIC Receive Errors
Db_block_size = 8K ifconfig –a: eth0 Link encap:Ethernet
HWaddr 00:0B:DB:4B:A2:04
inet addr:130.35.25.110
Bcast:130.35.27.255
UP BROADCAST RUNNING MULTICAST
MTU:1500
Mask:255.255.252.0
Metric:1
RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0
…
21
Copyright © 2008, Oracle. All rights reserved.
“Lost Blocks”: IP Packet Reassembly Failures
netstat –s Ip: 84884742 total packets received … 1201 fragments dropped after timeout … 3384 packet reassembles failed
22
Copyright © 2008, Oracle. All rights reserved.
Finding a Problem with the Interconnect or IPC
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class ---------------------------------------------------------------------------------------------------log file sync
286,038
49,872
174
41.7
Commit
gc buffer busy
177,315
29,021
164
24.3
Cluster
gc cr block busy 110,348
5,703
52
4.8
Cluster
gc cr block lost
4,272
4,953 1159
4.1
Cluster
cr request retry
6,316
4,668
3.9
Other
23
739
Should never be here
Copyright © 2008, Oracle. All rights reserved.
Global Cache Lost block handling Detection Time in 11g reduced • 500ms ( around 5 secs in 10g ) • can be lowered if necessary • robust ( no false positives ) • no extra overhead Cr request retry event related to lost blocks • It is highly likely to see it when gc cr blocks lost show up
24
Copyright © 2008, Oracle. All rights reserved.
Interconnect Statistics Automatic Workload Repository (AWR ) Target
Avg Latency
Stddev
Avg Latency
Stddev
Instance
500B msg
500B msg
8K msg
8K msg
--------------------------------------------------------------------1
.79
.65
1.04
1.06
2
.75
.57
95
.78
3
.55
.59
.53
.59
4
1.59
3.16
1.46
1.82
.
---------------------------------------------------------------------
Latency probes for different message sizes Exact throughput measurements ( not shown) Send and receive errors, dropped packets ( not shown ) 25
Copyright © 2008, Oracle. All rights reserved.
“Blocks Lost”: Solution
Fix interconnect NICs and switches Tune IPC buffer sizes
26
Copyright © 2008, Oracle. All rights reserved.
CPU Saturation or Long Run Queues
Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ Event Class --------------------------
Waits
Avg %Total wait Call Time(s) (ms) Time Wait
--------- ------
db file sequential read
---- -----
1,312,840
21,590
16
21.8
User I/O
gc current block congested
275,004
21,054
77
21.3
Cluster
gc cr grant congested
177,044
13,495
76
13.6
Cluster
1,192,113
9,931
8
10.0
Cluster
gc current block 2-way
gc cr block congested 85,975 8,917messages 104 9.0 “Congested” : LMS could not dequeue fastCluster enough
Cause 27
: Long run queue, CPU starvation Copyright © 2008, Oracle. All rights reserved.
High CPU Load: Solution
Run LMS at higher priority (default) Start more LMS processes • Never use more LMS processes than CPUs Reduce the number of user processes Find cause of high CPU consumption
28
Copyright © 2008, Oracle. All rights reserved.
Contention Event
Waits
Time (s)
--------
% Call Time
----------------------
---------
gc cr block 2-way
317,062
5,767
18
19.0
gc current block 2-way
201,663
4,063
20
13.4
gc buffer busy
111,372
3,970
36
13.1
CPU time
--------
AVG (ms)
2,938
gc cr block busy
40,688
1,670
--------
9.7 41
5.5
------------------------------------------------------Serialization Global Contention on Data Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related
29
Copyright © 2008, Oracle. All rights reserved.
Contention: Solution
Identify “hot” blocks in application Reduce concurrency on hot blocks
30
Copyright © 2008, Oracle. All rights reserved.
High Latencies Event
Waits
----------------------
----------
Time (s) ----------
AVG (ms)
% Call Time
---------
--------
gc cr block 2-way
317,062
5,767
18
19.0
gc current block 2-way
201,663
4,063
20
13.4
gc buffer busy
111,372
3,970
36
13.1
CPU time gc cr block busy
2,938 40,688
1,670
9.7 41
5.5
------------------------------------------------------Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms)
Tackle latency first, then tackle busy events 31
Copyright © 2008, Oracle. All rights reserved.
High Latencies : Solution
Check network configuration • Private • Running at expected bit rate Find cause of high CPU consumption • Runaway or spinning processes
32
Copyright © 2008, Oracle. All rights reserved.
Health Check Look for: Unexpected Events gc cr block lost
1159 ms
Unexpected “Hints” • Contention and Serialization gc cr/current block busy
52 ms
• Load and Scheduling gc current block congested
14 ms
Unexpected high avg gc cr/current block 2-way
33
36 ms
Copyright © 2008, Oracle. All rights reserved.
Gigabit Ethernet Definition Max Bandwidth • • • • • •
34
1000 Mbits = 125 MB per sec, excluding header and pause frames 118 MB per sec Equates to 85000 Clusterware/RAC messages or 14000 8k blocks RAC workload has a mix of short messages of 256 bytes and db_block_size of long messages For real life workload only 60-70% the bandwidth can be sustained For RAC type workload 40 MB per sec per interface is optimal load For additional bandwidth more interfaces can be aggregated
Copyright © 2008, Oracle. All rights reserved.
Aggregation Active/Standby (single switch) 1U
ce2
ce4
NIC
NIC
ce4:1
$ --> ifconfig -a ce2: flags=69040843 mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843 mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843 mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
35
Copyright © 2008, Oracle. All rights reserved.
Aggregation Active/Active (Single Switch) 1U
ce2
ce4
NIC
NIC
ce4:1
$ --> ifconfig -a ce2: flags=69040843 mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843 mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843 mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
36
Copyright © 2008, Oracle. All rights reserved.
Aggregation Active/Standby (Switch Redundancy) 1U
ce4:1
NIC
NIC
ce2 ce4
ce2 ce4
NIC
NIC
ce8 ce10
ce8 ce10
ce4:1
1U $ --> ifconfig -a ce10: flags=69040843 mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843 mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843 mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
37
Copyright © 2008, Oracle. All rights reserved.
Aggregation Solutions Cisco Etherchannel based 802.3ad AIX Etherchannel HPUX Auto Port Aggregation SUN Trunking, IPMP, GLD Linux Bonding (only certain modes) Windows NIC teaming
38
Copyright © 2008, Oracle. All rights reserved.
Aggregation Methods
Load balance/failover/load spreading • spread on sends/serialize on receives Active/Standby Oracle Interconnect Requirement • Both Send/Receive side load balancing • NIC and Switch port failure detection
39
Copyright © 2008, Oracle. All rights reserved.
General Interconnect requirement Recommendations For OLTP Workloads • Normally 1 Gbit Ethernet with redundancy (active/standby or load-balance) is sufficient • For DW workloads Multiple GigE aggregated 10 Gig E or Infiniband
40
Copyright © 2008, Oracle. All rights reserved.
Oracle RAC Cluster Interconnect network selection Oracle Clusterware • IP address associated with Private Hostname (provided during Install interview) Oracle RAC Database • Private Network specified during the Install interview • Cluster_interconnect parameter provided IP address
41
Copyright © 2008, Oracle. All rights reserved.
Jumbo Frames • Non-IEEE standard • Useful for NAS/iSCSI storage • Network device inter-operability issues • Configure with care and test rigorously Excerpt from alert.log:
Maximum Tranmission Unit (mtu) of the ether adapter is different on the node running instance 4, and this node. Ether adapters connecting the cluster nodes must be configured with identical mtu on all the nodes, for Oracle. Please ensure the mtu attribute of the ether adapter on all nodes [and switch ports] are identical, before running Oracle.
42
Copyright © 2008, Oracle. All rights reserved.
UDP Socket Buffer (rx)
Default settings adequate for majority of customers May need to increase allocated buffer size • MTU size increases • netstat reporting fragmentation and/or reassembly errors • ifconfig reporting dropped packets or overflow
43
Copyright © 2008, Oracle. All rights reserved.
Cluster Interconnect NIC settings • NIC driver dependent – DEFAULTS GENERALLY SATISFACTORY • Changes can occur between OS versions • Linux 2.4 => 2.6 kernels,flowcontrol on e1000 drivers • NAPI interrupt coalescence in 2.6
• Confirm flow control: rx=on, tx=off • Confirm full bit rate (1000) for the NICs • Confirm full duplex auto-negotiate • Ensure NIC names/slots identical on all nodes • Configure interconnect NICs on fastest PCI bus • Ensure compatible switch settings • 802.3ad on NICs = 802.3ad on switch ports • MTU=9000 on NICs = MTU=9000 on switch ports FAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE PERFORMANCE DEGRADATION AND NODE FENCING
44
Copyright © 2008, Oracle. All rights reserved.
The Interconnect and VLANs • Interconnect should be dedicated non-routable subnet mapped to a single dedicated, non-shared VLAN • If VLANs are ‘trunked’ the interconnect VLAN traffic should not exceed the access switch layer • Minimize the impact of Spanning Tree events • Monitor the switch(es) for congestion • Avoid QoS definitions that may negatively impact interconnect performance
45
Copyright © 2008, Oracle. All rights reserved.
46
Copyright © 2008, Oracle. All rights reserved.
47
Copyright © 2008, Oracle. All rights reserved.
Q& A
QUESTIONS ANSWERS
48
Copyright © 2008, Oracle. All rights reserved.