Oracle RACPACK - Oracle RAC SIG

100 downloads 222 Views 562KB Size Report
Oracle Clusterware and Private Network Considerations. Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance  ...
Oracle Clusterware and Private Network Considerations

Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance Development group

1

Copyright © 2008, Oracle. All rights reserved.

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

2

Copyright © 2008, Oracle. All rights reserved.

Agenda Architectural Overview RAC and Cache Fusion Performance Infrastructure Common Problems and Resolution Aggregation and VLANs

3

Copyright © 2008, Oracle. All rights reserved.

Oracle Clusterware Cluster Private High Speed Network L2/L3 SWITCH

CSSD

/…/

public network

VIP1

VIP2

VIPn

CSSD

CSSD

CSSD

OPROCD

OPROCD

OPROCD

ONS

ONS

ONS

CRSD

CRSD

CRSD

EVMD

EVMD

EVMD

OS

OS

OS

shared storage

RAW Devices 4

OCR and Voting Disks

Copyright © 2008, Oracle. All rights reserved.

Runs in Real Time Priority

Runs in Real

Under the Covers

Time Priority

Cluster Private High Speed Network L2/L3 SWITCH

LMON

Instance 1 SGA

LMD0

Global Resoruce Directory Dictionary Cache Library Cache

LMON

DIAG

Log buffer Buffer Cache

Instance 2 SGA

LMD0

DIAG

LMON

Global Resoruce Directory Dictionary Log buffer Cache Library Cache

Instance n SGA

Buffer Cache

DIAG

Global Resoruce Directory Dictionary Cache

Log buffer

Library Cache

Buffer Cache

VKTM

LGWR

DBW0

VKTM

LGWR

DBW0

VKTM

LGWR

DBW0

LMS0

SMON

PMON

LMS0

SMON

PMON

LMS0

SMON

PMON

Node 1

Node 2

Redo Log Files

Redo Log Files

Data Files and Control Files

5

LMD0

Copyright © 2008, Oracle. All rights reserved.

Node n

Redo Log Files

Global Cache Service (GCS) Manages coherent access to data in buffer caches of all instances in the cluster Minimizes access time to data which is not in local cache • access to data in global cache faster than disk access Implements fast direct memory access over high-speed interconnects • for all data blocks and types Uses an efficient and scalable messaging protocol • Never more than 3 hops New optimizations for read-mostly applications

6

Copyright © 2008, Oracle. All rights reserved.

Cache Hierarchy: Data in Remote Cache Local Cache Miss Datablock Requested

Datablock Returned Remote Cache Hit

7

Copyright © 2008, Oracle. All rights reserved.

Cache Hierarchy: Data On Disk Local Cache Miss Datablock Requested Disk Read

Grant Returned Remote Cache Miss

8

Copyright © 2008, Oracle. All rights reserved.

Cache Hierarchy: Read Mostly Local Cache Miss

No Message required

9

Disk Read

Copyright © 2008, Oracle. All rights reserved.

11.1 CPU Optimizations for read-intensive operations • Read-only access – No messages – Direct reads

• Read-mostly access – Message reductions – Latency improvements

Significant gains • From 50-70% reductions measured

10

Copyright © 2008, Oracle. All rights reserved.

Performance of Cache Fusion Message:~200 bytes

LMS

200 bytes/(1 Gb/sec )

Receive

Process block Send

Initiate send and wait Block: e.g. 8K Receive

8192 bytes/(1 Gb/sec)

Total access time: e.g. ~360 microseconds (UDP over GBE) Network propagation delay ( “wire time” ) is a minor factor for roundtrip time ( approx.: 6% , vs. 52% in OS and network stack ) 11

Copyright © 2008, Oracle. All rights reserved.

Fundamentals: Minimum Latency (*), UDP/GBE and RDS/IB Block size RT (ms)

2K

4K

8K

16K

UDP/GE

0.30

0.31

0.36

0.46

RDS/IB

0.12

0.13

0.16

0.20

(*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”) AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantiles The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block would be very high )

12

Copyright © 2008, Oracle. All rights reserved.

Infrastructure: Network Packet Processing TX

RX

PROCESS (FG/LMS*)

PROCESS (FG/LMS*)

SOCKET LAYER (tx Buffers)

SOCKET LAYER (rx Buffers)

IP LAYER TCP / UDP

IP LAYER TCP / UDP

INTERFACE LAYER

INTERFACE LAYER

L2/L3 SWITCH

13

Copyright © 2008, Oracle. All rights reserved.

Infrastructure: Interconnect Bandwidth Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application Typical utilization approx. 10-30% in OLTP • 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75-80% of theoretical bandwidth ) Generally, 1Gb/sec sufficient for performance and scalability in OLTP. DSS/DW systems should be designed with > 1Gb/sec capacity A sizing approach with rules of thumb is described in • Project MegaGrid: Capacity Planning for Large Commodity Clusters (http://otn.oracle.com/rac) 14

Copyright © 2008, Oracle. All rights reserved.

Infrastructure: Private Interconnect

• Network between the nodes of a RAC cluster MUST be private • Supported links: GbE, IB ( IPoIB: 10.2 ) • Supported transport protocols: UDP, RDS (10.2.0.3) • Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding • Large ( Jumbo ) Frames for GbE recommended if the global cache workload requires it. • global cache block shipping versus small lock message passing.

15

Copyright © 2008, Oracle. All rights reserved.

Network Packet Processing: Layers, Queues and Buffers TX

RX

PROCESS (FG/LMS*)

PROCESS (FG/LMS*)

Rec()

USER SOCKET LAYER (rx Buffers)

SOCKET LAYER (tx Buffers)

KERNEL Socket buffers Socket queues

IP LAYER TCP / UDP

IP LAYER TCP / UDP

TX IP queues

Software interrupts

RX IP input queue

INTERFACE LAYER

INTERFACE LAYER

Hardware interrupts RX queues, RX buffers

Backplane pressure cpu L2/L3 SWITCH

Ingress Buffers

16

Egress Buffers

Copyright © 2008, Oracle. All rights reserved.

Infrastructure: IPC configuration Important Settings: • Negotiated top bit rate and full duplex mode • NIC ring buffers • Ethernet flow control settings • CPU(s) receiving network interrupts Verify your setup: • CVU does checking • Load testing eliminates potential for problems • AWR and ADDM give estimations of link utilization Buffer overflows, congested links and flow control can have severe consequences for performance

17

Copyright © 2008, Oracle. All rights reserved.

Infrastructure: Operating System Block access latencies increase when CPU(s) busy and run queues are long • Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy Fewer and busier LMS processes may be more efficient. • monitor their CPU utilization • Caveat: 1 LMS can be good for runtime performance but may impact cluster reconfiguration and instance recovery time • the default is good for most requirements Higher priority for LMS is default • The implementation is platform-specific 18

Copyright © 2008, Oracle. All rights reserved.

Common Problems and Symptoms

“Lost Blocks”: Interconnect or Switch Problems System load and scheduling Contention Unexpectedly high global cache latencies

19

Copyright © 2008, Oracle. All rights reserved.



Miss-configured or Faulty Interconnect Can Cause: Dropped packets/fragments Buffer overflows Packet reassembly failures or timeouts Ethernet Flow control kicks in TX/RX errors

“lost blocks” at the RDBMS level, responsible for 64% of escalations

20

Copyright © 2008, Oracle. All rights reserved.

“Lost Blocks”: NIC Receive Errors

Db_block_size = 8K ifconfig –a: eth0 Link encap:Ethernet

HWaddr 00:0B:DB:4B:A2:04

inet addr:130.35.25.110

Bcast:130.35.27.255

UP BROADCAST RUNNING MULTICAST

MTU:1500

Mask:255.255.252.0

Metric:1

RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0



21

Copyright © 2008, Oracle. All rights reserved.

“Lost Blocks”: IP Packet Reassembly Failures

netstat –s Ip:    84884742 total packets received    … 1201 fragments dropped after timeout    …    3384 packet reassembles failed

22

Copyright © 2008, Oracle. All rights reserved.

Finding a Problem with the Interconnect or IPC

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class ---------------------------------------------------------------------------------------------------log file sync

286,038

49,872

174

41.7

Commit

gc buffer busy

177,315

29,021

164

24.3

Cluster

gc cr block busy 110,348

5,703

52

4.8

Cluster

gc cr block lost

4,272

4,953 1159

4.1

Cluster

cr request retry

6,316

4,668

3.9

Other

23

739

Should never be here

Copyright © 2008, Oracle. All rights reserved.

Global Cache Lost block handling Detection Time in 11g reduced • 500ms ( around 5 secs in 10g ) • can be lowered if necessary • robust ( no false positives ) • no extra overhead Cr request retry event related to lost blocks • It is highly likely to see it when gc cr blocks lost show up

24

Copyright © 2008, Oracle. All rights reserved.

Interconnect Statistics Automatic Workload Repository (AWR ) Target

Avg Latency

Stddev

Avg Latency

Stddev

Instance

500B msg

500B msg

8K msg

8K msg

--------------------------------------------------------------------1

.79

.65

1.04

1.06

2

.75

.57

95

.78

3

.55

.59

.53

.59

4

1.59

3.16

1.46

1.82

.

---------------------------------------------------------------------

Latency probes for different message sizes Exact throughput measurements ( not shown) Send and receive errors, dropped packets ( not shown ) 25

Copyright © 2008, Oracle. All rights reserved.

“Blocks Lost”: Solution

Fix interconnect NICs and switches Tune IPC buffer sizes

26

Copyright © 2008, Oracle. All rights reserved.

CPU Saturation or Long Run Queues

Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ Event Class --------------------------

Waits

Avg %Total wait Call Time(s) (ms) Time Wait

--------- ------

db file sequential read

---- -----

1,312,840

21,590

16

21.8

User I/O

gc current block congested

275,004

21,054

77

21.3

Cluster

gc cr grant congested

177,044

13,495

76

13.6

Cluster

1,192,113

9,931

8

10.0

Cluster

gc current block 2-way

gc cr block congested 85,975 8,917messages 104 9.0 “Congested” : LMS could not dequeue fastCluster enough

Cause 27

: Long run queue, CPU starvation Copyright © 2008, Oracle. All rights reserved.

High CPU Load: Solution

Run LMS at higher priority (default) Start more LMS processes • Never use more LMS processes than CPUs Reduce the number of user processes Find cause of high CPU consumption

28

Copyright © 2008, Oracle. All rights reserved.

Contention Event

Waits

Time (s)

--------

% Call Time

----------------------

---------

gc cr block 2-way

317,062

5,767

18

19.0

gc current block 2-way

201,663

4,063

20

13.4

gc buffer busy

111,372

3,970

36

13.1

CPU time

--------

AVG (ms)

2,938

gc cr block busy

40,688

1,670

--------

9.7 41

5.5

------------------------------------------------------Serialization Global Contention on Data Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related

29

Copyright © 2008, Oracle. All rights reserved.

Contention: Solution

Identify “hot” blocks in application Reduce concurrency on hot blocks

30

Copyright © 2008, Oracle. All rights reserved.

High Latencies Event

Waits

----------------------

----------

Time (s) ----------

AVG (ms)

% Call Time

---------

--------

gc cr block 2-way

317,062

5,767

18

19.0

gc current block 2-way

201,663

4,063

20

13.4

gc buffer busy

111,372

3,970

36

13.1

CPU time gc cr block busy

2,938 40,688

1,670

9.7 41

5.5

------------------------------------------------------Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms)

Tackle latency first, then tackle busy events 31

Copyright © 2008, Oracle. All rights reserved.

High Latencies : Solution

Check network configuration • Private • Running at expected bit rate Find cause of high CPU consumption • Runaway or spinning processes

32

Copyright © 2008, Oracle. All rights reserved.

Health Check Look for: Unexpected Events gc cr block lost

1159 ms

Unexpected “Hints” • Contention and Serialization gc cr/current block busy

52 ms

• Load and Scheduling gc current block congested

14 ms

Unexpected high avg gc cr/current block 2-way

33

36 ms

Copyright © 2008, Oracle. All rights reserved.

Gigabit Ethernet Definition Max Bandwidth • • • • • •

34

1000 Mbits = 125 MB per sec, excluding header and pause frames 118 MB per sec Equates to 85000 Clusterware/RAC messages or 14000 8k  blocks RAC workload has a mix of short messages of 256 bytes and db_block_size of long messages For real life workload only 60-70% the bandwidth can be sustained For RAC type workload 40 MB per sec per interface is optimal load For additional bandwidth more interfaces can be aggregated

Copyright © 2008, Oracle. All rights reserved.

Aggregation Active/Standby (single switch) 1U

ce2

ce4

NIC

NIC

ce4:1

$ --> ifconfig -a ce2: flags=69040843 mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843 mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843 mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

35

Copyright © 2008, Oracle. All rights reserved.

Aggregation Active/Active (Single Switch) 1U

ce2

ce4

NIC

NIC

ce4:1

$ --> ifconfig -a ce2: flags=69040843 mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843 mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843 mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

36

Copyright © 2008, Oracle. All rights reserved.

Aggregation Active/Standby (Switch Redundancy) 1U

ce4:1

NIC

NIC

ce2 ce4

ce2 ce4

NIC

NIC

ce8 ce10

ce8 ce10

ce4:1

1U $ --> ifconfig -a ce10: flags=69040843 mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843 mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843 mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

37

Copyright © 2008, Oracle. All rights reserved.

Aggregation Solutions Cisco Etherchannel based 802.3ad AIX Etherchannel HPUX Auto Port Aggregation SUN Trunking, IPMP, GLD   Linux Bonding (only certain modes) Windows NIC teaming    

38

Copyright © 2008, Oracle. All rights reserved.

Aggregation Methods

Load balance/failover/load spreading • spread on sends/serialize on receives Active/Standby Oracle Interconnect Requirement • Both Send/Receive side load balancing • NIC and Switch port failure detection

39

Copyright © 2008, Oracle. All rights reserved.

General Interconnect requirement Recommendations For OLTP Workloads • Normally 1 Gbit Ethernet with redundancy (active/standby or load-balance) is sufficient • For DW workloads          Multiple GigE aggregated          10 Gig E or Infiniband

40

Copyright © 2008, Oracle. All rights reserved.

Oracle RAC Cluster Interconnect  network selection Oracle Clusterware • IP address associated with Private Hostname (provided during Install interview) Oracle RAC Database • Private Network specified during the Install interview • Cluster_interconnect parameter provided IP address

41

Copyright © 2008, Oracle. All rights reserved.

Jumbo Frames • Non-IEEE standard • Useful for NAS/iSCSI storage • Network device inter-operability issues • Configure with care and test rigorously Excerpt from alert.log:

Maximum Tranmission Unit (mtu) of the ether adapter is different on the node running instance 4, and this node. Ether adapters connecting the cluster nodes must be configured with identical mtu on all the nodes, for Oracle. Please ensure the mtu attribute of the ether adapter on all nodes [and switch ports] are identical, before running Oracle.

42

Copyright © 2008, Oracle. All rights reserved.

UDP Socket Buffer (rx)

Default settings adequate for majority of customers May need to increase allocated buffer size • MTU size increases • netstat reporting fragmentation and/or reassembly errors • ifconfig reporting dropped packets or overflow

43

Copyright © 2008, Oracle. All rights reserved.

Cluster Interconnect NIC settings • NIC driver dependent – DEFAULTS GENERALLY SATISFACTORY • Changes can occur between OS versions • Linux 2.4 => 2.6 kernels,flowcontrol on e1000 drivers • NAPI interrupt coalescence in 2.6

• Confirm flow control: rx=on, tx=off • Confirm full bit rate (1000) for the NICs • Confirm full duplex auto-negotiate • Ensure NIC names/slots identical on all nodes • Configure interconnect NICs on fastest PCI bus • Ensure compatible switch settings • 802.3ad on NICs = 802.3ad on switch ports • MTU=9000 on NICs = MTU=9000 on switch ports FAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE PERFORMANCE DEGRADATION AND NODE FENCING

44

Copyright © 2008, Oracle. All rights reserved.

The Interconnect and VLANs • Interconnect should be dedicated non-routable subnet mapped to a single dedicated, non-shared VLAN • If VLANs are ‘trunked’ the interconnect VLAN traffic should not exceed the access switch layer • Minimize the impact of Spanning Tree events • Monitor the switch(es) for congestion • Avoid QoS definitions that may negatively impact interconnect performance

45

Copyright © 2008, Oracle. All rights reserved.

46

Copyright © 2008, Oracle. All rights reserved.

47

Copyright © 2008, Oracle. All rights reserved.

Q& A

QUESTIONS ANSWERS

48

Copyright © 2008, Oracle. All rights reserved.