Impact of Cache Coherence Protocols on the ... - Semantic Scholar

4 downloads 106 Views 281KB Size Report
Dec 3, 2007 - 2007 Intel Corporation. Impact of Cache Coherence Protocols on the Processing of Network Traffic. Amit Kumar and Ram Huggahalli.
Impact of Cache Coherence Protocols on the Processing of Network Traffic

Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007

© 2007 Intel Corporation

Outline Background  Network performance improvement with new microarchitecture  Need to revisit platform changes for CPU on loading

Overview of existing and Prefetch-hint coherence protocols Direct Cache Access (DCA) Performance Overview Prototype Results Future Research

2 © 2007 Intel Corporation

Background & Motivation  Adoption of 10Gbps has been limited to a few applications. A primary reason has been the processing capability of general purpose platforms.  Recent micro-architectural changes offered by Intel® CoreTM processors has shown 66% higher network processing capability over a previous generation Intel® Pentium® 4 architecture  Providing a coherence protocol that places data into CPU cache further improves processing capabilities  Our prototype implementation of Direct Cache Access (DCA) shows 15.6% - 43.4% speed up

3 © 2007 Intel Corporation

Background & Motivation (contd.) Solutions to reduce TCP/IP processing overhead can be classified in three categories:  Platform improvements to improve CPU on loading – Copy specific solutions have been user level TCP/IP stack, Page flipping etc.  TCP Offload Engines (TOEs) – Uses hardware assists to offload main CPU. Limited to small spectrum of networking applications.  Interconnects or protocols like Infiniband, Myrinet or RDMA – Requires new hardware-software interfaces which requires application support. In some cases, it requires expensive NIC solutions as well.

New micro-architectural efficiencies provide a greater impetus for CPU on loading and diminishes need of specialized solutions.

4 © 2007 Intel Corporation

Opportunity for DCA in Realistic Workloads Source: “Direct Cache Access for High Bandwidth Network I/O“. 32nd Annual International Symposium on Computer Architecture (ISCA'05) pp. 50-59. Ram Huggahalli, Ravi Iyer and Scott Tetrick.

% of Inbound I/O data Read by CPU vs. Distance 100%

Rx (512B to 4KB)

TPC-W % Occurence

80%

SPECWeb99

60%

Network I/O: 80% read within a short time

40% 20%

System Bus Clocks (x1000) 5 © 2007 Intel Corporation

50

40

30

20

10

0%

0

TPC-C

Today’s Coherence Protocol 1. Packet arrives on the NIC from the network 2. NIC sends the packet as I/O bus transactions to the Chipset

Network

Chipset

Memory Processor (L1/L2 line in M state)

Packet

Before CPU is interrupted

Coherent Memory Write

Snoop Writeback Memory Write

3. Chipset ensures coherency of data by snooping processor caches before writing to memory 4. Processor eventually reads packet for TCP/IP processing and moves data to application buffer

NIC

Read After CPU is interrupted

Read Data

Demand read from CPU is a compulsory cache miss

Coherence protocol for inbound I/O

6 © 2007 Intel Corporation

Prefetch Hint Protocol 1. Packet arrives on the NIC from the network

Network

NIC

Chipset

Memory Processor (L1/L2 line in M state)

Packet

2. NIC sends the packet as I/O bus transactions (with a target cache tag) to the Chipset 3. Chipset sends snoops to the processor with hints to prefetch the data

Coherent Memory Write Before CPU is interrupted

Snoop-Hint Writeback Memory Write Prefetch Read

4. Processor prefetches packet soon after hint is received. Packet is present in the cache TCP/IP processing begins

Read Data

Coherence protocol for DCA prototype

7 © 2007 Intel Corporation

Impact of Prefetch Hint/DCA protocol ns per Packet Profile @ 4KB I/O copy

tcp

other (driver, os, app interface)

4500 4000

core

core

core

L2 Cache

3500

L2 Cache FSB 1333 MHz, 10.4 GB/s (peak)

3000

ns per Packet

core

Memory Controller Hub

2680 ns

2500 2000

2481 ns

1500 256 ns

1000 500

1002 ns 148 ns 179 ns

0 Base

4 ch FBD667 MHz 20.8 GB/s peak read bandwidth

PCIe 2x1GbE NIC 1GbE

1GbE

To system similar to SUT

System Configuration

DCA

Source: Intel

Copy with DCA is 5x faster and TCP/IP processing is 1.5x faster

8 © 2007 Intel Corporation

DCA Performance & Sensitivities 60%

20%

2.0

10%

1.0

0%

0.0 1 10 100 1000 Total TCP Connections (across 2 ports) - Log Scale

4096

16384

65536

I/O Size (bytes)

DCA Speed-up (No Load)

Perf Gain or Loss

(b)

DCA Speed-up with and without Memory Loading DCA Speed-up (Mem Load)

45% 40%

SPECintRate with Network Traffic 8.3% 8.1%

8%

5.4% 5.9%

3.9% 4.0%

4%

5.3%

7.7% 6.9% 6.2% 5.5%

2.6%

1.8%

gz ip

0%

30% 25%

-4%

9.5%

7.1% 3.8%

8.6% 5.0%

1.5%

-2.6% ar eq t ua ke fa ce re c am m p lu ca s fm a3 d si xt ra ck

16384

ap pl u m es a ga lg el

I/O Size (bytes)

8192

ise

4096

up w

2048

w

1024

6.7%

6.0%

0%

id

0%

6.1% 6.3% 6.7% 8% 5.0% 4.3% 4% 1.4%

m gr

5%

SPECfpRate with Network Traffic

im

15% 10%

(e)

sw

20%

Perf Gain or Loss

Speed-up

35%

(d)

tw G EO olf M EA N

1024

9 © 2007 Intel Corporation

a G EO psi M EA N

256

x bz ip 2

0.0 64

vo rte

0%

3.0

ga p

1.0

30%

rlb m k

16.4%

2.0

40%

eo n

3.0

6.0

DCA

5.0 29.9% @ 512 connections 4.0

cr af ty pa rs er

4.0

Base

40.3% @ 2 connections

m cf

10%

42.2%

40.2%

33.1%

30% 15.6%

Speed-up

40%

50%

gc c

5.0 42.4%

50%

Speed-up

vp r

Speed-up

Speed-up

DCA

Throughput / core (Gb/s)

6.0 Base

pe

60%

20%

Normalized Throughput per Core (Gb/s) and Speed-up vs. TCP Connections

(c)

Normalized Throughput per Core (Gb/s) and Speed-up

Network Throughput (Gb/s)

(a)

Future Research DCA next steps: 

Protocol Optimization – Bypass memory and write incoming data directory into LLC (Write Update protocol)



Performance improvement with DCA at 10Gbps and real application benefit

Related future work: 

Read Current – It is a network transmit optimization where the cached buffer used to transmit data remains in the same state in the cache



Cache QoS – Network processing cycles through kernel buffers through the CPU cache evicting other useful data. Cache QoS policies will restrict such pollution by restricting network data to few ways in the cache



CPU-NIC Integration – Integrating NIC on CPU can unveil many opportunities that traditional SW and HW don’t enjoy. A bigger ecosystem uplift is required to make effective use of NIC integration

10 © 2007 Intel Corporation

Q&A

Disclaimer: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm

11 © 2007 Intel Corporation

Suggest Documents