Network Resiliency Implementation in the ATLAS TDAQ System

May 2010

Stefan Stancu 25 May 2010

ATL-DAQ-SLIDE-2010-082

Network Resiliency Implementation in the ATLAS TDAQ System

IEEE-NPSS Real Time conference, Lisbon

1

Outline • TDAQ system block diagram • Networks and protocols Global view Control network Front/Back-End network

• Operational issues • Conclusions May 2010


2

TDAQ system block diagram Detector Readout Systems(ROSs) 100 kHz

FrontEnd Network

100 kHz (RoI based)

2nd Level Trigger

~5 kHz

Event builders (SFI) Control Network

~5 kHz

BackEnd Network

~5 kHz

Event Filter (EF) [3rd Level Trigger]

~300 Hz

Sub-Farm Output (SFOs) ~300 Hz

Permanent Storage May 2010


3

TDAQ system block diagram ~150 PCs (custom input cards)

Detector Readout Systems(ROSs) 100 kHz

FrontEnd Network

100 kHz (RoI based)

2nd Level Trigger ~850 PCs

~5 kHz


~5 kHz

~100 PCs BackEnd Network

~5 kHz


~300 Hz

~1600 PCs (~300 1st stage)


~10 Disk Servers Permanent Storage May 2010


3

TDAQ system block diagram ~150 PCs (custom input cards)

Detector Readout Systems(ROSs) 100 kHz

FrontEnd Network

100 kHz (RoI based)

2nd Level Trigger ~850 PCs

~5 kHz


~5 kHz

~100 PCs BackEnd Network

~5 kHz


~300 Hz

~1600 PCs (~300 1st stage)


~10 Disk Servers Permanent Storage May 2010


3

Global view – routers FrontEnd Network Primary Outside TDAQ

Control Network Backup

•

BackEnd Network

3 networks, 5 routers  Ethernet + IP  ~2000 computers, ~100 edge switches

•

OSPF (Open Shortest Path First) in-between routers  redistribute connected  black-hole prefixes assigned to TDAQ  Links  In-between Control routers and from Control to “outside”  high speed (10GE+)  From the Control to Front/Back-End are auxiliary (used for management purposes)  Low speed (GE)

•

Interface with Outside – static for easy/complete decoupling:  Dormant back-up (no load balancing)  Outside: TDAQ prefixes are routed on the primary and backup (with higher cost) links  Inside: two default gateways – primary and backup (with higher cost)

 Primary link simulated failure  not perceived by user/application level  no real failure experienced to date

May 2010


4

Control network - VRRP •

Rack 1

Inter-router link runs OSPF ( “redistributes connected”)  Two trunked 10GE lines

•

R1,R2 provide:  subnet 1 to sw1  subnet 2 to sw2

•

sw1

VRRP (Virtual Router Redundancy Protocol) operation:  SubnetX  VRRP instance X:

VRRP 1

 One MAC (vrrp_macX) and one IP (vrrp_ipX) for the virtual router  Physical routers hand-shake and elect:

B

OSPF R1

 a master router (R1)  implements the virtual router  a backup router (R2)  dormant while the master is active

R2 M

VRRP 2

 R1–sw2 link fails

B

 R1—R2 handshake on subnet 2 fails (as R1 is not reachable through sw2)  R2 no longer sees a master so it becomes the master itself, implementing the virtual router (with vrrp_mac2, and vrrp_ip2)  Hosts in Rack2 continue to talk to the virtual router, unaware of the physical change

sw2

• Rack 2

A single VRRP instance provides redundancy but no load balancing  Two VRRP instances per subnet (R1 master in one instance, R2 master in the other one) could provide load balancing  However this causes asymmetric traffic (potential flooding on sw1, sw2)  To be avoided if bandwidth is not an issue

May 2010


5


Rack 1


•


•

sw1


VRRP 1


B

OSPF R1


R2 M

VRRP 2


B


sw2

• Rack 2


May 2010


5


Rack 1


•


•

sw1


VRRP 1


B

OSPF R1


R2 M

VRRP 2


B


sw2

• Rack 2


May 2010


5


Rack 1


•


•

sw1


VRRP 1


B

OSPF R1


R2 VRRP 2


M


sw2

• Rack 2


May 2010


5


Rack 1


•


•

sw1


VRRP 1


B

OSPF R1


R2 VRRP 2


M


sw2

• Rack 2


May 2010


5


Rack 1


•


•

sw1


VRRP 1


B

OSPF R1


R2 M

VRRP 2


B


sw2

• Rack 2


May 2010


5

Control network - VRRP • Practical issues  Tested prior to deployment  With proxy ARP enabled the following happens when host a host from rack 2 (host_r2) wants to talk outside subnet 2:

Rack 1

arp who-has host_x_IP tell host_r2_IP arp reply host_x_IP is-at vrrp_mac2 # correct arp reply host_x_IP is-at R2_phys_mac2 # spurious

sw1

VRRP 1

B

 Depending on which ARP reply is received first (can be assumed to be random), the host will

OSPF R1

R2 M

VRRP 2

 either behave correctly (correct ARP received first)  or will use the “backup” router R2 (spurious ARP received first)

B

 Undesired behaviour because:  Traffic through R2 is asymmetric (return comes through R1) flooding can occur on sw2 depending on its mac-address-aging settings  Uncontrolled load balancing  Will not be detected on a test which only disables one swX primary link to R1

sw2

 Deployed in production only after the issue was fixed by the manufacturer

Rack 2

• May 2010

If possible thoroughly test before deployment. IEEE-NPSS Real Time conference, Lisbon

6


Rack 1


sw1

VRRP 1

B


OSPF R1

R2 M

VRRP 2


B


sw2


Rack 2

• May 2010


6


Rack 1


sw1

VRRP 1

B


OSPF R1

R2 M

VRRP 2


B


sw2


Rack 2

• May 2010


6

Control network –high throughput servers • •

High throughput needed on ~70 infrastructure and monitoring servers “Standard” options:  Edge switch and VRRP with 10G up-links

Rack 1

 Two points of failure (switch and server interface)

 Edge switch and VRRP with 10G up-links + bonding on the server  Linux bonding in ‘active-backup’ on the server can provide good redundancy  One point of failure (switch)

 Two edge switch and VRRP with 10G up-links + bonding on the server

sw1

VRRP 1

 Linux bonding in ‘active-backup’ on the server can provide good redundancy  No single point of failure  NOTE: STP is required in the subnet in order to break the loops created by the two switches (each one with 2 up-links)

B

•

OSPF R1

M

B

Direct router connections  Linux bonding in “active-backup”: primary link connected to R1, back-up one to R2  Not enough:

R2

VRRP 2

 example: failure of sw1 primary up-link renders the servers unreachable for Rack1

 Use VLAN interfaces on the routers and interconnect them (emulate a rack level switch)  the already existing high speed trunk used by OSPF can be shared by virtually any number of VLANs (tagged)

sw2

•

Production experience  Deployed for:  all (most) critical servers  the NetApp FAS3100 storage units used system wide (user accounts, etc)

 Sample failure while running:  a server interface going down and then re-negotiation to a lower speed  no effect perceived on the data taking run  Shifter reported the warnings generated by the network monitoring tools

May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

B


R2

VRRP 2



sw2

•



May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

B


R2

VRRP 2


 Use VLAN interfaces on the routers and interconnect them (emulate a rack level switch) sw2

 the already existing high speed trunk used by OSPF can be shared by virtually any number of VLANs (tagged)

sw2’

•



May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

B


R2

VRRP 2


 Use VLAN interfaces on the routers and interconnect them (emulate a rack level switch) sw2

 the already existing high speed trunk used by OSPF can be shared by virtually any number of VLANs (tagged)

sw2’

•



May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

M


R2

VRRP 2



•



May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

M


R2

VRRP 2



•



May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

B


R2

VRRP 2



•



May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

B


R2

VRRP 2



•



May 2010


7



Rack 1




sw1

VRRP 1


B

•

OSPF R1

M

B


R2

VRRP 2



•



May 2010


7

FrontEnd TDAQ network High throughput, low latency Two vertical slices (fan out at the ROS level) Geographical location:  ROSs and ros-swX are underground  Cores and Trigger/EB farms are at the surface

• ros-swA

ROSs to Cores > 100M  fibre  Original design: fibre ports on the ROSs PCs  Once 10G affordable  Concentrate with GE copper underground and use 10G to feed the cores at the surface

ros-swB

Surface

Underground

~150 ROSs

• • •

 One up-link failure renders

Network Resiliency Implementation in the ATLAS TDAQ System

Network Resiliency Implementation in the ATLAS TDAQ System

Suggest Documents

ATLAS TDAQ System Administration: evolution and re-design

Event Monitoring Design - ATLAS TDAQ Monitoring Working Group

resiliency resiliency resiliency resiliency resiliency ... - gb&d magazine

resiliency resiliency resiliency resiliency resiliency ... - gb&d magazine

resiliency resiliency resiliency resiliency resiliency ... - gb&d magazine

CS655 System & Network Architectures and Implementation

9-1-1 Network Resiliency - Hughes Government

Transportation Network Resiliency: A Fuzzy Systems ...

Wireless Network Resiliency Cooperative Framework - CTIA [PDF]

INTRUSION DETECTION SYSTEM RESILIENCY TO ... - Infoscience

Machine learning of network metrics in ATLAS

The SwitchWare Active Network Implementation

champions freedom - Atlas Network [PDF]

champions freedom - Atlas Network [PDF]

Implementation of a Telecardiology System in the

Performance measurement system implementation in

Larimer County Resiliency Framework - Colorado Resiliency ...

The Atlas Genome Assembly System - Semantic Scholar

Design and Implementation of a String Matching System for Network ...

design and implementation of system and network security - Journal of ...

System-on-Chip Implementation of Neural Network Training on FPGA

CS 655 – System and Network Architectures and Implementation ...

Design and Implementation of a Sensor Network System for Vehicle ...

Design and Implementation of Network Management System Based ...