Architectural Characterization of Processor Affinity in ... - IEEE Xplore

Architectural Characterization of Processor Affinity in Network Processing Annie Foong, Jason Fung, Don Newell, Seth Abraham, Peggy Irelan, Alex Lopez-Estrada Intel Corporation [email protected] high speeds. Adding more processors to a system, by itself, does not address the problem – TCP/IP software implementations are known for their inability to scale well in general-purpose SMP operating systems (OS) [10][16]. However, next generation chip multiprocessors (CMP) will bring multiple cores to each CPU [8], making SMP scaling a major operating systems design issue. Previous work had shown potential performance improvements by careful affinity of processes/threads to processors in a SMP system [13][21]. However, current general purpose operating systems support only static affinity, and have minimal consideration of userdefined affinity in their schedulers. The ultimate goal of this work is to make a case for generic OS schedulers to provide mechanisms that account for user-directed affinity. Before that can be done, we must first expose the full potential of affinity by an indepth characterization of the reasons behind performance gains. We provide the background necessary to understand the motivation for our work and problem statement in Sections II & III. In Section IV, we give implementation details and tools used for the analysis. In Section V, we provide overall performance data for all the affinity modes possible, so as to determine the data points worthy of further study. We focused on studying these data points in Section VI. Here, we go in-depth to analyze our results in the context of a TCP reference stack. As we proceed to different stages of analysis, we gradually hone in to events and metrics that matter. Where pertinent, we also call out places where affinity does not make a difference. We conclude by discussing related and future work.

Abstract Network protocol stacks, in particular TCP/IP software implementations, are known for its inability to scale well in general-purpose monolithic operating systems (OS) for SMP. Previous researchers have experimented with affinitizing processes/thread, as well as interrupts from devices, to specific processors in a SMP system. However, general purpose operating systems have minimal consideration of userdefined affinity in their schedulers. Our goal is to expose the full potential of affinity by in-depth characterization of the reasons behind performance gains. We conducted an experimental study of TCP performance under various affinity modes on IA-based servers. Results showed that interrupt affinity alone provided a throughput gain of up to 25%, and combined thread/process and interrupt affinity can achieve gains of 30%. In particular, calling out the impact of affinity on machine clears (in addition to cache misses) is characterization that has not been done before.

1. Introduction The arrival of 10 Gigabit Ethernet (GbE) allows a standardized physical fabric to handle the tremendous speeds previously attributed only to proprietary networks. Though aimed primarily to meet the needs of traffic loads seen in data centers and storage area networks (SANs), the concept of a converged fabric across WANs, LANs and SANs is appealing. However, supporting multi-gigabit/s of TCP traffic can quickly saturate the abilities of a SMP server today. At the platform level, integration of the memory controller on the CPU die will effectively scale memory bandwidth with processing power. Next generation buses, such as PCI-Express, will potentially deliver 64Gbps of bandwidth. While platform improvements will continue to address bus and memory bottlenecks, a system bottleneck still exists in terms of a processor’s capacity to process TCP at these

0-7803-8965-4/05/$20.00 ©2005 IEEE

2. Background The major overheads of TCP are well studied [3][5][9]. A seminal paper by [4] showed that the number of instructions for TCP protocol processing itself is minimal. The non-scalability of TCP stems from the fact that it requires substantial support from

1

the operating system (e.g. buffer management, timers, etc) and incurs substantial memory overheads. These include memory accesses for data movement (the copy-based BSD sockets programming model can incur up to three accesses to memory per request). We refer interested readers to [5] where we describe an implementation of the TCP fast paths as typified by Linux. Network adapters (NIC) manufacturers’ efforts to offload functionality from the processor have resulted in real but incremental improvements. They range from checksum and segmentation offloads [3] to complete offload of the TCP stack to hardware [1]. In the case of full-fledged TOEs, industry’s effort has been elusive at best. TCP has a far more complex state machine than most other transports. Unlike some newer protocols (e.g. Fibre Channel and Infiniband), which were designed specifically for hardware implementation from ground up, TCP began as a software stack. Corner cases abound that are not so easily addressed if the solutions are hardwired. Finally, the most commonly overlooked overheads are those incurred by scheduling and interrupting [9]. Though not cost intensive operations by themselves, these have an indirect intrusive effect on cache and pipeline effectiveness. The intrusions into a TCP stack come in the form of applications issuing requests from above and interrupts coming from devices on data arrives or leaves. The impact of intrusions can be substantial in general-purpose SMP OSes. These OSes are designed to a run a wide variety of applications. As such, their scheduler will always attempt to load balance, moving processes from processors with heavier loads to those with lighter loads. However, process migration is not free. The migrated process will have to pay the price of warming various levels of data caches, instruction cache and translation-lookaside buffers in the processor that has just migrated to. On the other hand, generic OSes do not attempt to balance interrupts across processors. Both Windows NT and Linux default SMP configuration operates with device interrupts going to CPU0. Under high loads, CPU0 saturates before other processors in the system. Previous attempts of OSes to redistribute interrupts in either a random or round-robin fashion had given rise to bad side effects [1]. Interrupt handlers ended up being executed on random processors and created more contention for shared resources. Furthermore, cache contents (e.g. holding TCP contexts) are not reused optimally as packets from devices are sent to different processors on every interrupt. Since the scheduler prioritizes balancing over process-to-interrupt affinity, going to more processors

will only increase the non-scalability problem. Finally, the TCP/IP stack differs from most applications in that it is completely event-driven. Applications issue requests from the top, independent of data arriving from the network. Data can arrive/leave inside OR outside of the requesting process’s context. This creates interesting affinity problems and possibilities.

3. Problem Statement While it is always possible to design different scheduling algorithms that can be effective under differing workloads [15], it remains that the OS is still oblivious of application needs. We propose that an application knows its own workload best and is in the position to better “place” itself than the OS scheduler. However, leveraging such an optimization can be difficult in general-purpose OSes. A first step in validating this hypothesis is to characterize the impact of user-directed affinity as it is supported by today’s OSes. Our research questions and methodology in this study thus becomes: 1. What is the baseline profile and characterization of TCP processing ? To answer this, we break down the TCP stack into logical, functional bins. Separation at the procedure call level (> 300 procedures) would render any analysis useless. Examining only the top few functions [1] provides only a partial view. Instead, we have carefully examined all the code for a reference TCP stack (Linux-2.4.20), and separated the procedures into basic blocks of TCP functionality. We shall provide a full baseline characterization of TCP processing in two affinity modes. This will form the foundation for the comparative study. 2. Where exactly do affinity improvements happen ? By how much ? To quantify this, we performed a series of speedup analyses using Amdahl’s Law[6]. We do a comparative study of processing times (and other events) in the no-affinity mode against those in the full affinity mode. Speedups and improvements are derived accordingly. 3. What (subset) of architectural events are responsible for performance improvements ? While previous researchers have all attributed performance improvement to better cache locality, there had not been an extensive attempt to fully expose the architectural reasons behind improvements. We monitor the count of various events including last level cache misses. It must be noted that we did not exhaustively look at all possible events. Rather, we focused on the usual performance culprits (e.g. cache

103

misses, branch mispredictions, TLB misses etc). By using expected costs for event penalties seen in processors, we are able to obtain a first-order approximation of the primary performance-affecting events. Experimental-based characterization of the entire application runs [14] and analytic models of protocol stacks [20] have given us an overall understanding of networking requirements. In addition to architectural understanding, we also hope to bring a systems software perspective to this study. By abstracting TCP processing to a level where analysis gives useful insights, we can quantify and directly relate architectural events to software implementation. For example, while it is important to know that the overall CPI of transmit processing of 64KB is about 5, it is extremely useful to further realize that data copy routines incur CPIs of 4, while interface routines incur CPIs of 17. Such a unique view allowed us to (i) provide a solid understanding of TCP processing in different affinity modes; (ii) showcase exactly where affinity brings benefits to TCP processing; and (iii) relate these benefits in terms of improvements seen in various architectural events.

Figure 1 summarizes the configuration of the system under test (SUT) and clients, and the setup of our tiny cluster. We have used the ttcp microbenchmark to exercise bulk data transmits (TX) and receives (RX) between 2 nodes. A connection is set up once between two nodes. Data is sent from the transmitter(s) to the receiver(s), reusing the same buffer space for all iterations. ttcp workload primarily characterizes bulk data transfer behavior, and must be understood in that context. We have chosen this simple workload because it exercises the typical and optimal TCP code path, and allows us to focus on understanding the network stack without applicationrelated distractions. Moreover, our focus is on quantifying the differences that affinity brings in an ideal scheduling case, where load balancing quirks do not come into play. This workload will project directly to real workloads that are based on long-lived connections and bulk data (e.g. iSCSI and other network storage). ttcp caching behavior is also representative of real web or file servers that serve static file content to/from the network (no touching of payload data). Web characterization studies [24] showed that although 50% of web requests may be dynamic in nature, but they resulted in 30-60% of quasi-static “templates” that can be reused. More importantly, we can partition any general workload into “network fast paths”, “network connection setup/teardown” and “application processing” as exemplified in [14]. The studies done here of affinity benefits will project directly to the portions involving network fast paths. To study the various affinity modes, we have used the mechanisms that are available through Redhat’s patched version of the Linux-2.4.20 kernel (and officially folded into the mainstream Linux-2.6 kernel [11]). These mechanisms allow processes, threads and interrupts to be statically bound to processors. In our tests, one connection (unique IP address) is owned by one instance of ttcp, and serviced by one physical NIC. There are a total of 8 GbE NICs, 8 connections and 8 ttcp processes running on our SUT. We will compare 4 modes of affinity – (i) no affinity (no aff); (ii) interrupt-only affinity (IRQ aff) (e.g. interrupts from NICs 1-4 are directed to go to CPU0); (iii) process-only affinity (proc aff) (e.g. ttcp processes 1-4 are bound to CPU0); and (iv) full affinity (full aff). Full affinity is the case where a ttcp process is affinitized to the same processor as the interrupts, coming from the NIC that it is assigned to (Figure 2). We modified ttcp to use sys_sched_setaffinity() to set process affinity [12]. We statically redirect interrupts from a device to a specific processor by setting a bit

4. Experimental Setup System under Test (SUT)

Client

Processors

Intel 2GHz P4 Xeon MP × 2

Intel 3.06GHz P4 Xeon × 2

Cache

512KB L2, 2MB L3

512KB L2

FSB Freq

400 MHz

533 MHz

Memory

DDR 200MHz Registered ECC 256MB/channel × 4

DDR 266MHz Registered ECC 2GB/channel × 2

Board Chipset

Shasta-G ServerWorks GC-HE

Westville Intel E7501

PCI Bus

64-bit PCI-X 66/100MHz

64-bit PCI-X 66/100MHz

NIC

Dual-port Intel PRO/1000 MT×4

Dual-port Intel PRO/1000 MTx1

ttcp process

1

2

3

4

5

6

7

8

1

3

2

4

5

6

7

8

SU T

1

5

Client 1

2

6

Client 2

3

5

Client 3

4

8

Client 4

SUT

Figure 1 System configurations and Cluster Setup

104

mask in smp_affinity under the Linux’s /proc filesystem. CPU 0

CPU 0

CPU 1 1

TCP

2

3

CPU 1 4

5

6

TCP

TCP

7

8

TCP

11

22

11

22

55

66

33

44

33

44

77

88

55

66

77

reduce cache interference, the scheduler tries as much as possible to schedule a process onto the same processor that it was previously running on. By the same token, “bottom halves/tasklets” (i.e. tasks scheduled to run at a later time) of interrupt handlers are usually scheduled on the same processor where their corresponding “top halves” had previously run. As a result, interrupt affinity indirectly leads to process affinity as well. Of course, there is no guarantee and interrupt and process contexts can still land up on different CPUs. The best improvement in throughput (up to 29%) is therefore achieved with full affinity. We also ran similar tests on 4P systems (not shown here) and observed even better improvement brought on by affinity. However, this has more to do with the imbalance of workload rather than the intrinsic impact of affinity. Without affinity, the bottleneck that CPU0 imposes on a 4P system becomes even more pronounced. CPU0 is fully saturated with interrupt processing, even though there are idle cycles available on the other processors. Given these caveats, further analysis will be done only on 2P systems. A more illuminating view is normalize processor cycles with work done - GHz/Gbps, (i.e. cycles per bit transferred). This “cost” metric allows us to account for both CPU and throughput improvement at the same time (Figure 4). To better interpret these charts, we look at the cost of a 64KB transmit. It is about 1.9 in the no affinity case, and is reduced to 1.4 in the full affinity case. This is a reduction of about 25%. Affinity has a bigger impact on large size transfers, and we will explain why in the next section.

88

No Affinity Interrupts default to CPU0, OS-based scheduling

Full Affinity Each Interrupt and process mapped to a specific CPU

Figure 2 Two possible permutations of interrupt and process affinity To get processing distribution insights for our indepth analysis, we have used Oprofile-0.7 [18] as our measurement tool. Oprofile is low-overhead, systemswide profiler based on event sampling. It allows us to determine the number of events that had occurred in any function during a run. The events are those supported by the processor’s hardware event counter registers [7]. E.g. if the event of interest is set to cycles, we can determine time spent in a function; if the event is last-level cache (LLC) miss, we can determine how many times data touching resulted in a memory access. It must be noted Oprofile is based on statistical sampling. It is not an exact accounting tool. When a profile is performed over a long run, it gives a fairly accurate distribution of where events lie. For the profiling to capture all cycles, we further ensured that the processors polls on idle, instead of the default power-saving mode.

6. Detailed Analysis and Methodology In this section, we do a detailed analysis on the extreme data points identified in the previous section, i.e. receives and transmits of 128B and 64KB, under no and full affinity modes. Behavior of other data points will fall somewhere in between the extremes. We begin with a baseline analysis of the 2 affinity modes and show pertinent metrics that characterize the stack. We proceed to extract performance impact indicators based on Pentium 4’s expected events penalties [23]. Once these are identified, the final comparative study, evaluating the impact of affinity in terms of these identified events, is subsequently done.

5. Overview of Performance In this section, we present a performance overview of the various possibilities of process and interrupt affinities. Figure 3 shows the TCP performance comparison of the four affinity models. The bars show CPU utilization (almost fully utilized in all cases), while throughput is represented by lines. We see that process affinity alone has little impact on throughput. Under this mode, CPU0 not only has to service all interrupts, but also at least 4 ttcp processes. Any affinity benefits are negated by more pronounced load imbalance. On the other hand, interrupt affinity alone can improve throughput by as much as 25%. This behavior is a result of the scheduling algorithm. To

105

4000

90

3650

90

3650

80

3300

80

3300

70

2950

70

2950

60

2600

60

2600

50

2250

50

2250

40

1900

40

1900

30

1550

30

1550

20

1200

20

1200

10

850

10

850

0

500

0

128

256

1024

4096

8192

16384

CPU Utilization (%)

100

65536

Bandwidth (Mb/s)

RX Bandwidth vs CPU Utilization 4000

Bandwidth (Mb/s)

CPU Utilization (%)

TX Bandwidth vs CPU Utilization 100

500 128

256

Transaction Size (bytes)

1024

4096

8192

16384

65536


No Aff CPU

Proc Aff CPU

IRQ Aff CPU

Full Aff CPU

No Aff CPU

Proc Aff CPU

IRQ Aff CPU

Full Aff CPU

No Aff BW

Proc Aff BW

IRQ Aff BW

Full Aff BW

No Aff BW

Proc Aff BW

IRQ Aff BW

Full Aff BW

Figure 3 TCP CPU utilization and throughput Rx Cost in GHz/Gbps 6

5

5

4

GHz/Gbps

GHz/Gbps

Tx Cost in GHz/Gbps 6

3

2

4

3

2

1 128

256

1024

4096

8192

16384

65536

1 128


256

1024

4096

8192

16384

65536

Transaction Size (bytes) No Aff

Proc Aff

IRQ Aff

Full Aff

No Aff

Proc Aff

IRQ Aff

Full Aff

Figure 4 TCP processing costs (IIS, TUX) serve data out of the buffer cache. A full implementation of the sockets interface includes not only the obvious BSD sockets API (both in kernel and user), but also the Linux system call (sys_call routine), and schedule-related routines. This is how an application causes a socket action to be executed from the user level all the way down to the TCP stack. We put all these functions into the interface bin. Driver includes both the NIC driver routines, and NIC interrupt processing. Locks include all synchronization-related routines. Timers refer to all of the timer routines that TCP uses. A few baseline observations are worth calling out and we shall reserve comparisons to later sections. For 64KB transfers, the top timing hotspots include the TCP engine, buffer mgmt and copies. For 128B transfers, the hotspots are the sockets interface and the TCP engine.

6.1. Baseline TCP Characterization Table 1 shows a comprehensive characterization of the stack. We have separated the compute-intensive parts of TCP protocol processing (Engine), i.e. the cranking of the state machine, from the memoryintensive parts of TCP processing (Buffer mgmt). Buffer mgmt includes memory, buffer management routines and the manipulation of TCP control structures, etc. Copies are of movement of payload data only. This allows us to highlight the impact of the copy semantics imposed by BSD-based synchronous sockets [5]. Data copies is always uncached on the receive side, since the packet arrives in memory via device DMA. Whether or not it is cached on the sendside depends on the application caching behavior. In our experiments, we have set ttcp to serve data directly from cache. This reflects how modern server applications are designed. E.g. in-kernel web servers

106

Table 1 Baseline Characterization RX 128B

% Cycles No Aff

CPI Full Aff

MPI

No Aff

Full Aff

No Aff

Full Aff

% Branches

% Br mispredicted

No Aff

No Aff

Full Aff

Full Aff

Interface

41.5%

46%

8.49

8.66

0.0032

0.0036

19.76%

20.74%

0.22%

0.21%

Engine

23.7%

21%

3.38

2.72

0.0021

0.0005

15.21%

15.59%

0.98%

0.92%

Buf Mgmt

10.0%

7%

2.31

1.55

0.0023

0.0002

17.25%

17.32%

0.64%

0.44%

Copies

13.8%

15%

4.99

5.14

0.0074

0.0077

9.93%

10.94%

0.02%

0.00%

Driver

5%

5%

5.64

4.44

0.0063

0.0024

12.64%

13.01%

3.58%

4.27%

Locks

2.7%

1%

17.95

23.22

0.0080

0.0103

30.14%

34.79%

2.38%

3.11%

Timers

2.2%

3%

3.04

3.17

0.0018

0.0042

14.33%

12.92%

0.08%

0.13%

Overall

99.0%

99.0%

4.66

4.23

0.0032

0.0023

16.42%

16.81%

0.68%

0.63%

RX 64KB

% Cycles No Aff

CPI Full Aff

MPI

No Aff

Full Aff

No Aff

Full Aff

% Branches

% Br mispredicted

No Aff

No Aff

Full Aff

Full Aff

Interface

3.0%

7.5%

15.44

8.90

0.0195

0.0023

22.46%

37.66%

6.69%

6.62%

Engine

22.8%

22.7%

4.70

3.72

0.0046

0.0016

16.98%

17.92%

0.75%

0.52%

Buf Mgmt

11.2%

20.4%

6.57

4.04

0.0106

0.0039

15.95%

17.34%

1.43%

0.83%

Copies

40.3%

32.1%

66.34

58.03

0.1329

0.1100

11.97%

11.05%

0.65%

0.83%

Driver

11.0%

7.2%

6.89

5.69

0.0108

0.0051

12.63%

13.44%

3.04%

3.68% 17.62%

Locks

0.3%

1.3%

15.16

22.78

0.0054

0.0222

35.20%

29.93%

1.91%

Timers

11.3%

8.2%

5.85

7.35

0.0097

0.0146

9.60%

10.42%

0.19%

0.21%

Overall

99.9%

99.4%

8.49

7.53

0.0133

0.0101

15.28%

16.13%

1.37%

1.20%

TX 128B

% Cycles No Aff

CPI Full Aff

MPI

No Aff

Full Aff

No Aff

Full Aff

% Branches

% Br mispredicted

No Aff

No Aff

Full Aff

Full Aff

Interface

42.4%

46.0%

8.68

8.73

0.0034

0.0037

17.77%

18.61%

0.20%

0.19%

Engine

29.0%

28.8%

3.38

3.05

0.0020

0.0009

17.98%

18.46%

0.59%

0.54%

Buf Mgmt

11.6%

8.2%

4.44

2.99

0.0046

0.0001

16.59%

16.08%

1.33%

0.81%

Copies

5.9%

6.9%

1.62

1.60

0.0082

0.0079

5.05%

5.31%

3.31%

3.33%

Driver

4.4%

6.0%

5.73

4.38

0.0065

0.0025

14.94%

14.16%

3.21%

3.08% 4.54%

Locks

3.8%

1.0%

14.96

20.06

0.0030

0.0099

28.77%

32.08%

1.07%

Timers

1.5%

2.2%

2.58

3.15

0.0016

0.0042

15.69%

14.49%

0.15%

0.15%

Overall

98.8%

99.1%

4.56

4.11

0.0038

0.0028

15.80%

15.97%

0.90%

0.81%

TX 64KB

% Cycles No Aff

CPI Full Aff

MPI

No Aff

Full Aff

No Aff

Full Aff

% Branches

% Br mispredicted

No Aff

No Aff

Full Aff

Full Aff

Interface

6.0%

5.0%

17.62

11.27

0.0212

0.0063

20.06%

20.41%

6.66%

8.90%

Engine

25.5%

21.8%

5.01

3.41

0.0070

0.0016

16.96%

16.40%

1.83%

2.24%

Buf Mgmt

28.0%

20.3%

5.93

4.06

0.0065

0.0007

16.92%

16.49%

1.07%

0.53%

Copies

27.1%

37.1%

3.93

4.12

0.0106

0.0095

2.20%

2.24%

0.37%

0.39%

Driver

10.4%

12.2%

6.06

5.35

0.0049

0.0030

14.93%

14.68%

1.37%

1.57%

Locks

0.6%

0.0%

14.65

16.49

0.0025

0.0040

24.80%

20.09%

0.78%

31.73%

Timers

2.0%

3.0%

4.07

7.10

0.0029

0.0116

9.99%

10.96%

0.15%

0.27%

Overall

99.7%

99.5%

5.04

4.14

0.0078

0.0047

11.53%

10.76%

1.41%

1.43%

CPI: Cycles per instruction MPI: Last-level cache Misses per instruction % Branches: Number of branches/Number of Instructions % Br Mispredicted: Number of branches mispredicted/Number of branches

107

The only deviation from this norm is seen in locks of large transfers in full affinity mode. Diving deeper into the implementation of spinlocks in Linux (Table 2) and looking at the absolute value of branches and mispredictions reveal the reasons. The seemingly large miscprediction rate (in the full affinity case) is not due to more mispredictions, but due to fewer branches. The number of branches and instructions in the full affinity case is about 5-10% that of no affinity. In the full affinity case, processes and interrupts are bound to the same processor making for minimal spinlock contention. If a processor successfully grabs a lock, there are no jumps or branches required. The number of branches and instructions decrease accordingly. As such, when a branch misprediction does occur, it counts very heavily against the branch misprediction ratio. When contention is high, as in the no affinity case, the processor finds itself spinning in a spinloop waiting for the lock to be released. On Pentium 4, REPZ NOP translates to a PAUSE instruction, which is implemented as a no-op with a pre-defined delay, in an attempt to reduce memory ordering violations [7]. In either REPZ NOP or PAUSE implementations, the absolute number of branches taken in the no affinity case will be much larger than the full affinity case.

Time spent in drivers is also quite substantial for large transfers. This is expected – large transfers involve primarily data and descriptors manipulation, while the small transfers involve primarily socket read/write calls. For the Pentium 4s, a CPI value of 1 is considered good, and a value of 5 is considered poor [23]. We see that TCP processing, on the whole, does poorly in terms of CPI. Furthermore, TX generally has lower CPIs and MPIs than RX, an indicator that RX is more memory-bound. Very large CPIs are seen in interface and locks. Given the nature of these bins, system calls and resource contention respectively, we expected these inefficiencies. In all cases, the computation requirement of the TCP engine remains incredibly constant (~20%–30%), when cycles have been normalized to work done. Copies tend to take more time in RX than TX. Copies on RX (under Linux-2.4.x) are implemented via rep movl (repeat string move); whereas copies on TX are implemented via a carefully crafted rolled-out loop that moves data efficiently based on its alignment. This also explains the glaringly large CPI and MPI seen in RX of 64KB – one instruction was responsible for moving a whole lot of data. Alignment for TX data is known beforehand allowing for the rolled-out optimization. RX copies were implemented assuming arbitrary arrival of bytes, and alignment is not guaranteed. A more optimized version of RX copy, based on integer copy, had since appeared in Linux-2.6 [1]. In addition, timers in RX of 64KB take up substantially more time. Most of that time is in the routine do_gettimeofday(), which is used by the bottom half of the receive interrupt handler to compare current time with the timestamp in arriving packets. There is no corresponding use this routine on the TX path. Overall, branches make up about 10% 16% of all instructions in TCP fast path, and the percentage of branch mispredictions is also fairly low (< 2%).

6.2. Performance Impact Indicators As seen in the previous section, there is no limit to the depth we can go to find evidence in support of our observations. While is it tempting to dive deeper, we wanted to make sure that we are studying architectural events that have a substantial impact on performance. To this end, we follow the tuning advice and recommendations found in VTune 7.1 (manual) [23], a profiling tool written specifically for Intel architecture (IA) based processors. The events we have chosen to study include machine clears (i.e. instruction pipeline flushes), trace cache (TC) misses, L2 cache misses, LLC misses, the number of page walks due to instruction TLB (ITLB)

Table 2 Spinlocks (Linux) implementation Address

Instructions

c02bd319:

lock decb 0x2c(%ebx) js

c02c2c0e

… c02c2c0e

: Cmpb $0x0,0x2c(%ebx) Repz nop jle

c02c2c0e

jmp

c02bd319

108

Comments atomic decrement of “lock” lock=1 in unlocked state if already held by another processor, jump to .text.lock.tcp Successfully grabbed lock, continue in caller’s original path Check if “lock” value is 0 Translates to a PAUSE if lock