The first IP routers were software-based and used off-the shelf computer architectures. ... forwarding in the data-plane using management and signaling software. At ... In particular, the 1Gbps PCI bus used to be a limiting factor during several ...
Spotify Unlimited (no ads, on computer) .... Cached files are served in P2P overlay .... P2P Protocol. Storage. Gunnar K
Jul 17, 2000 ... Is gravure a printing process suitable only for very large runs, for huge ..... mostly
based on the PDF technology (Portable Document Format - a ...
Generic VPN Architecture. Connect several ... Provider-based VPNs using MPLS
& BGP ... In fact, when you have set up your MPLS+BGP core network, you can.
peer (P2P) and client-server systems is the existence of user accounts. ... online social networks require user authentication. ...... cheap and easy,â in OSDI.
Dec 5, 2008 - Email: [email protected] ... Email: jrgn,[email protected] .... of hands for manufacturing tasks,â Robotics and Automation, IEEE.
3.1 Centering a dataset through projections on the equiangular vector . . . . . 8 ... Dimensionality reduction algorithms transform data for sake of easier com- putation ..... We have shown how to obtain a positive-semidefinite (p.s.d.) kernel matrix
was a somewhat suitable swap-in for the broken problem. .... one arrives at and formulates the b2s time solution, the la
Jun 11, 2010 - Unlike in Fortran, it is very easy in Matlab to write a program that runs ..... www.mathworks.com/access/
Earlier studies of grammar and style checking software ... On top of that, aspects such as user abilities and .... that could rank and possibly avoid alternative.
OBJECT DETECTION AND SIGNATURE RECOGNITION. APPROACH. Sobhan ..... use these detected windows to enrich our training database. 10000 negative ...
The final stage in the production of the plan is to use the ... Stage 4: MFM informs DM of where the ECA's ut- ... brief or not) - user may be shouting, assume.
procedureâmodular verification of Java programs equipped with methodâ ... result, systems are no longer developed as monolithic applications; instead they.
At the first stage of the process we consider an object ... First we introduce some notation. ..... of Giraffe and Applelogo images and a Mug image, that pro-.
+46 87906729; fax: +46 87230302. E-mail addresses: [email protected] (D. .... ger, there are three Android sensors. On the robot's 158 shoulder, there is a ...
Oct 7, 2008 - using the primary method of slicing the image, very few of the vali- ...... http://www.rp-photonics.com/ti
Over 10 million users across Europe, over 2 million subscribers. â» Fast (median ... Smartphone clients on Android, iOS, Palm, Symbian, Windows. Phone.
terms of the structural properties of a program graph rather than in terms of its unfoldings is a ... ity policies restricting access to given resources by means of API method calls. (cf. [19]). In previous work ...... F. B. Schneider. Enforceable se
of a flow is defined as Re = UL/ν, with U and L being characteristic velocity and .... Re(eh,ve n) + SDe(eh; ve n)=0,. (3.9) with. RÏ(Ïh; vÏ n)=(ËÏn,vÏ n) â (un ¯Ïn ...
applications and end systems outside the network. 2. Monitoring network-wide aggregates. The monitoring layer must provide estimates of aggregates of local ...
Royal Institute of Technology, Stockholm, Sweden, Email:{celle, danik}@nada.kth.se. AbstractâIn this paper, we present a real-time vision system that integrates ...
Figure 1(a) shows the results from the state-of-the-art Flexible Mixture of Parts (FMP) model ..... Optimizing search engines using clickthrough data. In Knowledge ...
shown the importance of contextual information in computer vision tasks, like .... An independent. BoF is computed for e
Oct 24, 2012 - Predicting response times for the Spotify backend. Rerngvit Yanggratoke1, Gunnar Kreitz 2,. Mikael Goldma
Investigate packet forwarding performance of new PC hardware: –
Multi-core CPUs
–
Multiple PCI-e buses
–
10G NICs
Can we obtain enough performance to use open-source routing also in the 10Gb/s realm?
Measuring throughput ●
●
Packet per second –
Per-packet costs
–
CPU processing, I/O and memory latency, clock frequency
Bandwidth –
Per-byte costs
–
Bandwidth limitations of bus and memory
Measuring throughput
drops breakpoint capacity
overload
Inside a router, HW style Switched backplane
Line Card
Line Card
Line Card
Line Card
Buffer Mem ory
Buffer Mem ory
forwarder
forwarder
CPU Card Buffer Mem ory
Buffer Mem ory
forwarder
forwarder
CPU
RIB
Specialized hardware: ASICs, NPUs, backplane with switching stages or crossbars
Inside a router, PC-style CPU
Buffer Mem ory
RIB
Shared bus backplane
Line Card
Line Card
Line Card
●
Every packet goes twice over shared bus to the CPU
●
Cheap, but low performance
●
But lets increase the # of CPUs and # of buses!
Block hw structure (set 1)
Hardware – Box (set 2)
AMD Opteron 2356 with one quad core 2.3GHz Barcelona CPUs on a TYAN 2927 Motherboard (2U)
Hardware - NIC
Intel 10g board Chipset 82598 Open chip specs. Thanks Intel!
Lab
Equipment summary ●
Hardware needs to be carefully selected
●
BifrostLinux on kernel 2.6.24rc7 with LC-trie forwarding
●
Tweaked pktgen
●
●
Set 1: AMD Opteron 2222 with two double core 3GHz CPUs on a Tyan Thunder n6650W(S2915) motherboard Set 2: AMD Opteron 2356 with one quad core 2.3GHz Barcelona CPUs on a TYAN 2927 Motherboard (2U)
●
Dual PCIe buses
●
10GE network interface cards. –
PCI Express x8 lanes based on Intel 82598 chipset
Experiments ●
Transmission(TX) –
●
Forwarding experiments –
Upper limits on (hw) platform Realistic forwarding performance
Tx Experiments ●
Goal: –
Just to see how much the hw can handle – upper limit
●
Loopback tests over fibers
●
Don't process RX packets just let MAC count them
●
●
These numbers can give indication what forwarding capacity is possible Experiments: –
Single CPU TX single interface
–
Four CPUs TX one interface each Tested device
Tx single sender: Packets per second
Tx single sender: Bandwidth
Tx - Four CPUs: Bandwidth
CPU 1
CPU 2
CPU 3
Packet length: 1500 bytes
CPU 4
SUM
TX experiments summary ●
●
Single Tx sender is primarily limited by PPS at around 3.5Mpps A bandwidth of 25.8 Gb/s and a packet rate of 10 Mp/s using four CPU cores and two PCIe buses
●
This shows that the hw itself allows 10Gb/s performance
●
We also see nice symmetric Tx between the CPU cores.
Forwarding experiments ●
Goal: –
●
●
Realistic forwarding performance
Overload measurements (packets are lost) Single forwarding path from one traffic source to one traffic sink –
Single IP flow was forwarded using a single CPU.
–
Realistic multiple-flow stream with varying destination address and packet sizes using a single CPU.
–
Multi-queues on the interface cards were used to dispatch different flows to four different CPUs.
Test Generator
Tested device
Sink device
Single flow, single CPU: Packets per second
Single flow, single CPU: Bandwidth
Single sender forwarding summary ●
●
Virtually wire-speed for 1500-byte packets Little difference between forwarding on same card, different ports, or between different cards –
Seems to be slightly better perf on same card, but not significant
●
Primary limiting factor is pps, around 900Kpps
●
TX has small effect on overall performance
Introducing realistic traffic ●
●
For the rest of the experiments we introduce a more realistic traffic scenario Multiple packet sizes –
●
Multiple flows (multiple dst IP:s) –
Simple model based on realistic packet distribution data This is also necessary for multi-core experiments since NIC classification is made using hash algorithm on packet headers
Packet size distribution (cdf)
Real data from www.caida.org, Wide aug 2008
Flow distribution ●
Flows have size and duration distributions
●
8000 simultaneous flows
●
Each flow 30 packets long –
●
31000 new flows per second –
Measured by dst cache misses
●
Destinations spread randomly over 11.0.0.0/8
●
FIB contains ~ 280K entries –
●
Mean flow duration is 258 ms
64K entries in 11.0.0.0/8
This flow distribution is relatively aggressive
Multi-flow and single-CPU: PPS & BW
max min
Set 1
Set 2
Set 1
Small routing table 280K entries No ipfilters ipfilters enabled
Set 2
Multi-Q experiments ●
●
Use more CPU cores to handle forwarding NIC classification (Receiver Side Scaling RSS) uses hash algorithm to select input queue
●
Allocate several interrupt channels, one for each CPU.
●
Flows are distributed evenly between CPUs –
●
need aggregated traffic with multiple flows
Questions: –
Are processing of flows evenly dispatched ?
–
Will performance increase as CPUs are added?
Multi-flow and Multi-CPU (set 1)
1 CPU 4 CPUs CPU #1 CPU #2 CPU #3 CPU #4 Only 64 byte packets
Results MultiQ
●
Packets are evenly distributed between the four CPUs.
●
But forwarding using one CPU is better than using four CPUs!
●
Why is this?
Profiling. Single CPU
Multiple CPUs
samples
%
symbol name
samples
%
symbol name
396100
14.8714
kfree
1087576
22.0815
dev_queue_xmit
390230
14.6510
dev_kfree_skb_irq
651777
13.2333
__qdisc_run
300715
11.2902
skb_release_data
234205
4.7552
eth_type_trans
156310
5.8686
eth_type_trans
204177
4.1455
dev_kfree_skb_irq
142188
5.3384
ip_rcv
174442
3.5418
kfree
106848
4.0116
__alloc_skb
158693
3.2220
netif_receive_skb
75677
2.8413
raise_softirq_irqoff
149875
3.0430
pfifo_fast_enqueue
69924
2.6253
nf_hook_slow
116842
2.3723
ip_finish_output
69547
2.6111
kmem_cache_free
114529
2.3253
__netdev_alloc_skb
68244
2.5622
netif_receive_skb
110495
2.2434
cache_alloc_refill
Multi-Q analysis ●
●
With multiple CPUs: TX processing is using a large part of the CPU making using more CPUs sub-optimal It turns out that the Tx and Qdisc code needs to be adapted to scale up performance
MultiQ: Updated drivers ●
●
●
●
We recently made new measurements (not in paper) using updated driver code We also used hw set 2 (Barcelona) to get better results We now see an actual improvement when we add one processor (More to come)
Multi-flow and Multi-CPU (set 2)
1 CPU 2 CPUs 4 CPUs
Conclusions ●
●
●
For optimal results hw and sw must be carefully selected. >25Gb/s Tx performance
●
Near 10Gb/s wirespeed forwarding for large packets
●
Identified bottleneck for multi-q and multi-core forwarding.
●
Tx and forwarding results towards 10Gb/s performance using Linux and selected hardware
If this is removed, upscaling performance using several CPU cores is possible to 10Gb/s and beyond.