Impact of Cache Coherence Protocols on the ... - Semantic Scholar

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007

© 2007 Intel Corporation

Outline Background Network performance improvement with new microarchitecture Need to revisit platform changes for CPU on loading

Overview of existing and Prefetch-hint coherence protocols Direct Cache Access (DCA) Performance Overview Prototype Results Future Research

2 © 2007 Intel Corporation

Background & Motivation Adoption of 10Gbps has been limited to a few applications. A primary reason has been the processing capability of general purpose platforms. Recent micro-architectural changes offered by Intel® CoreTM processors has shown 66% higher network processing capability over a previous generation Intel® Pentium® 4 architecture Providing a coherence protocol that places data into CPU cache further improves processing capabilities Our prototype implementation of Direct Cache Access (DCA) shows 15.6% - 43.4% speed up


Background & Motivation (contd.) Solutions to reduce TCP/IP processing overhead can be classified in three categories: Platform improvements to improve CPU on loading – Copy specific solutions have been user level TCP/IP stack, Page flipping etc. TCP Offload Engines (TOEs) – Uses hardware assists to offload main CPU. Limited to small spectrum of networking applications. Interconnects or protocols like Infiniband, Myrinet or RDMA – Requires new hardware-software interfaces which requires application support. In some cases, it requires expensive NIC solutions as well.

New micro-architectural efficiencies provide a greater impetus for CPU on loading and diminishes need of specialized solutions.


Opportunity for DCA in Realistic Workloads Source: “Direct Cache Access for High Bandwidth Network I/O“. 32nd Annual International Symposium on Computer Architecture (ISCA'05) pp. 50-59. Ram Huggahalli, Ravi Iyer and Scott Tetrick.

% of Inbound I/O data Read by CPU vs. Distance 100%

Rx (512B to 4KB)

TPC-W % Occurence

80%

SPECWeb99

60%

Network I/O: 80% read within a short time

40% 20%

System Bus Clocks (x1000) 5 © 2007 Intel Corporation

50

40

30

20

10

0%

0

TPC-C

Today’s Coherence Protocol 1. Packet arrives on the NIC from the network 2. NIC sends the packet as I/O bus transactions to the Chipset

Network

Chipset

Memory Processor (L1/L2 line in M state)

Packet

Before CPU is interrupted

Coherent Memory Write

Snoop Writeback Memory Write

3. Chipset ensures coherency of data by snooping processor caches before writing to memory 4. Processor eventually reads packet for TCP/IP processing and moves data to application buffer

NIC

Read After CPU is interrupted

Read Data

Demand read from CPU is a compulsory cache miss

Coherence protocol for inbound I/O


Prefetch Hint Protocol 1. Packet arrives on the NIC from the network

Network

NIC

Chipset

Memory Processor (L1/L2 line in M state)

Packet

2. NIC sends the packet as I/O bus transactions (with a target cache tag) to the Chipset 3. Chipset sends snoops to the processor with hints to prefetch the data

Coherent Memory Write Before CPU is interrupted

Snoop-Hint Writeback Memory Write Prefetch Read

4. Processor prefetches packet soon after hint is received. Packet is present in the cache TCP/IP processing begins

Read Data

Coherence protocol for DCA prototype


Impact of Prefetch Hint/DCA protocol ns per Packet Profile @ 4KB I/O copy

tcp

other (driver, os, app interface)

4500 4000

core

core

core

L2 Cache

3500

L2 Cache FSB 1333 MHz, 10.4 GB/s (peak)

3000

ns per Packet

core

Memory Controller Hub

2680 ns

2500 2000

2481 ns

1500 256 ns

1000 500

1002 ns 148 ns 179 ns

0 Base

4 ch FBD667 MHz 20.8 GB/s peak read bandwidth

PCIe 2x1GbE NIC 1GbE

1GbE

To system similar to SUT

System Configuration

DCA

Source: Intel

Copy with DCA is 5x faster and TCP/IP processing is 1.5x faster


DCA Performance & Sensitivities 60%

20%

2.0

10%

1.0

0%

0.0 1 10 100 1000 Total TCP Connections (across 2 ports) - Log Scale

4096

16384

65536

I/O Size (bytes)

DCA Speed-up (No Load)

Perf Gain or Loss

(b)

DCA Speed-up with and without Memory Loading DCA Speed-up (Mem Load)

45% 40%

SPECintRate with Network Traffic 8.3% 8.1%

8%

5.4% 5.9%

3.9% 4.0%

4%

5.3%

7.7% 6.9% 6.2% 5.5%

2.6%

1.8%

gz ip

0%

30% 25%

-4%

9.5%

7.1% 3.8%

8.6% 5.0%

1.5%

-2.6% ar eq t ua ke fa ce re c am m p lu ca s fm a3 d si xt ra ck

16384

ap pl u m es a ga lg el

I/O Size (bytes)

8192

ise

4096

up w

2048

w

1024

6.7%

6.0%

0%

id

0%

6.1% 6.3% 6.7% 8% 5.0% 4.3% 4% 1.4%

m gr

5%

SPECfpRate with Network Traffic

im

15% 10%

(e)

sw

20%

Perf Gain or Loss

Speed-up

35%

(d)

tw G EO olf M EA N

1024


a G EO psi M EA N

256

x bz ip 2

0.0 64

vo rte

0%

3.0

ga p

1.0

30%

rlb m k

16.4%

2.0

40%

eo n

3.0

6.0

DCA

5.0 29.9% @ 512 connections 4.0

cr af ty pa rs er

4.0

Base

40.3% @ 2 connections

m cf

10%

42.2%

40.2%

33.1%

30% 15.6%

Speed-up

40%

50%

gc c

5.0 42.4%

50%

Speed-up

vp r

Speed-up

Speed-up

DCA

Throughput / core (Gb/s)

6.0 Base

pe

60%

20%

Normalized Throughput per Core (Gb/s) and Speed-up vs. TCP Connections

(c)

Normalized Throughput per Core (Gb/s) and Speed-up

Network Throughput (Gb/s)

(a)

Future Research DCA next steps:

Protocol Optimization – Bypass memory and write incoming data directory into LLC (Write Update protocol)

Performance improvement with DCA at 10Gbps and real application benefit

Related future work:

Read Current – It is a network transmit optimization where the cached buffer used to transmit data remains in the same state in the cache

Cache QoS – Network processing cycles through kernel buffers through the CPU cache evicting other useful data. Cache QoS policies will restrict such pollution by restricting network data to few ways in the cache

CPU-NIC Integration – Integrating NIC on CPU can unveil many opportunities that traditional SW and HW don’t enjoy. A bigger ecosystem uplift is required to make effective use of NIC integration


Q&A

Disclaimer: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm


Impact of Cache Coherence Protocols on the ... - Semantic Scholar

Impact of Cache Coherence Protocols on the ... - Semantic Scholar

Suggest Documents

Cache Coherence Protocols

Location-Aware Cache-Coherence Protocols for Distributed

Assignment 2: Cache Coherence Protocols

Cache Coherence

Modeling and Verification of Cache Coherence Protocols - CiteSeerX

The GLOW Cache Coherence Protocol Extensions ... - Semantic Scholar

SWEL: Hardware Cache Coherence Protocols to Map Shared Data ...

SWEL: Hardware Cache Coherence Protocols to Map Shared Data ...

Impact of Modern Memory Subsystems on Cache ... - Semantic Scholar

Cache Coherence Techniques

Cache Coherence Tradeoffs in Shared-Memory ... - Semantic Scholar

MESI Cache Coherence Simulator for Teaching ... - Semantic Scholar

Multi-Cache Coherence Protocol for Distributed ... - Semantic Scholar

Snoopy and Directory Based Cache Coherence ... - Semantic Scholar

Impact of Social Mobility on Routing Protocols for ... - Semantic Scholar

Impact of Network Protocols on Programmable ... - Semantic Scholar

Data cache performance evaluation: the impact of ... - Semantic Scholar

The Compression Cache - Semantic Scholar

Avoiding the Cache Coherence Problem in a

Cache Simulation Based on Runtime ... - Semantic Scholar

On the Cache Access Behavior of OpenMP ... - Semantic Scholar

Congestion Coherence - Semantic Scholar

Evaluating the Effects of Cache Redundancy on ... - Semantic Scholar

On the Steady-State of Cache Networks - Semantic Scholar